WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
71758
URLs are encoded in UTF-8, then decoded as if they are Latin1
https://bugs.webkit.org/show_bug.cgi?id=71758
Summary
URLs are encoded in UTF-8, then decoded as if they are Latin1
Benjamin Poulain
Reported
2011-11-07 18:42:46 PST
Before parsing, we encode the URL in UTF-8. When we encode invalid URL, we do it with Latin1. This makes it possible to transform an URL in an other.
Attachments
Patch
(6.91 KB, patch)
2011-11-07 18:51 PST
,
Benjamin Poulain
webkit.review.bot
: commit-queue-
Details
Formatted Diff
Diff
Patch suggestion
(8.53 KB, text/plain)
2011-11-07 22:36 PST
,
Benjamin Poulain
no flags
Details
Patch
(23.03 KB, patch)
2011-11-10 21:38 PST
,
Benjamin Poulain
no flags
Details
Formatted Diff
Diff
Patch
(29.03 KB, patch)
2011-11-22 23:15 PST
,
Benjamin Poulain
darin
: review+
benjamin
: commit-queue-
Details
Formatted Diff
Diff
Show Obsolete
(2)
View All
Add attachment
proposed patch, testcase, etc.
Benjamin Poulain
Comment 1
2011-11-07 18:51:57 PST
Created
attachment 113975
[details]
Patch
Darin Adler
Comment 2
2011-11-07 20:09:15 PST
Comment on
attachment 113975
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=113975&action=review
> Source/WebCore/platform/KURL.cpp:1123 > + m_string = originalString ? *originalString : String::fromUTF8(url);
String::fromUTF8 will return a null string if the input data contains invalid UTF-8 sequences. Is there some guarantee that “url” will not have such sequences? If so, what guarantees that? If not, then can we make tests showing our behavior in that case?
> LayoutTests/ChangeLog:9 > + * fast/url/invalid_url-expected.txt: Added. > + * fast/url/invalid_url.html: Added.
We use "-" characters between words in our test file names, not "_" characters.
Darin Adler
Comment 3
2011-11-07 20:11:27 PST
You are claiming that URLs are all encoded in UTF-8. What is the basis of this claim? Where is the code that encodes URLs as UTF-8?
Benjamin Poulain
Comment 4
2011-11-07 20:17:27 PST
(In reply to
comment #3
)
> You are claiming that URLs are all encoded in UTF-8. What is the basis of this claim? Where is the code that encodes URLs as UTF-8?
As far as I can see, they are either all ASCII, or UTF-8. The encoding is done through KURL::init() in encodeRelativeString().
> String::fromUTF8 will return a null string if the input data contains invalid UTF-8 sequences. Is there some guarantee that “url” will not have such sequences? If so, what guarantees that? If not, then can we make tests showing our behavior in that case?
I think that should not happen because the encoding is done with URLEncodedEntitiesForUnencodables (e.g.: CString pathDecoded = pathEncoding.encode(s.data(), pathEnd, URLEncodedEntitiesForUnencodables);). I can add a test for this case.
Darin Adler
Comment 5
2011-11-07 20:20:30 PST
(In reply to
comment #4
)
> (In reply to
comment #3
) > > You are claiming that URLs are all encoded in UTF-8. What is the basis of this claim? Where is the code that encodes URLs as UTF-8? > > As far as I can see, they are either all ASCII, or UTF-8. > The encoding is done through KURL::init() in encodeRelativeString().
encodeRelativeString takes a TextEncoding argument, which can be an encoding other than UTF-8 if the webpage is using an encoding other than UTF-8.
WebKit Review Bot
Comment 6
2011-11-07 20:20:44 PST
Comment on
attachment 113975
[details]
Patch
Attachment 113975
[details]
did not pass chromium-ews (chromium-xvfb): Output:
http://queues.webkit.org/results/10331902
New failing tests: fast/url/invalid_url.html
Benjamin Poulain
Comment 7
2011-11-07 20:33:16 PST
Comment on
attachment 113975
[details]
Patch (In reply to
comment #5
)
> > As far as I can see, they are either all ASCII, or UTF-8. > > The encoding is done through KURL::init() in encodeRelativeString(). > > encodeRelativeString takes a TextEncoding argument, which can be an encoding other than UTF-8 if the webpage is using an encoding other than UTF-8.
Hum, yep, I did not realize that was also added to the output. So we can possibly have mixed encoding in the output. That is getting tricky. I am tempted to change KURL::parse() to use a String instead of a char*.
Roland Steiner
Comment 8
2011-11-07 20:54:28 PST
IIRC from a bug a long time ago, different parts of an URL (fragment part, query part) can have different encodings.
Benjamin Poulain
Comment 9
2011-11-07 22:36:40 PST
Created
attachment 113993
[details]
Patch suggestion Since the encoding problem appear only for the invalid URLs. What do you think about always assigning m_string to the original string if the url is invalid? We don't try to partially resolve it, if it is invalid, the URLString is the original string.
Darin Adler
Comment 10
2011-11-08 09:56:13 PST
(In reply to
comment #9
)
> Since the encoding problem appear only for the invalid URLs. What do you think about always assigning m_string to the original string if the url is invalid?
Sounds like a good direction! I am not sure it will work perfectly without studying the issue more. As you point out, there are two different kinds of KURL: 1) One that holds a valid URL. 2) One that passes through a string that is not a valid URL. We should try to be as clear as possible about how all the KURL functions behave when it is case (2). There have been some bugs due to code making mistakes due to (2), not checking the parts of the URL properly and then passing to some lower level framework that treats (2) as a URL. Longer term we might want to eliminate (2) and solve the problem of passing strings that are not valid URLs through the engine in some other way.
Darin Adler
Comment 11
2011-11-08 09:56:58 PST
Maybe in the future we could factor the KURL class into two: One named WebCore::URL and another named WebCore::StringOrURL.
Benjamin Poulain
Comment 12
2011-11-08 10:12:14 PST
(In reply to
comment #10
)
> Sounds like a good direction! > > I am not sure it will work perfectly without studying the issue more. > > As you point out, there are two different kinds of KURL: > > 1) One that holds a valid URL. > 2) One that passes through a string that is not a valid URL. > > We should try to be as clear as possible about how all the KURL functions behave when it is case (2). There have been some bugs due to code making mistakes due to (2), not checking the parts of the URL properly and then passing to some lower level framework that treats (2) as a URL. Longer term we might want to eliminate (2) and solve the problem of passing strings that are not valid URLs through the engine in some other way.
I agree with you, that is how I got interested in this problem in the first place. In KURL, we have the string and m_isValid. When we pass that to the platform layer, the isValid information is often lost.
Adam Barth
Comment 13
2011-11-08 12:43:11 PST
We've often talked about having a ParsedURL and a URLString, but that's related but somewhat different than what you're talking about.
Benjamin Poulain
Comment 14
2011-11-10 21:38:07 PST
Created
attachment 114624
[details]
Patch
WebKit Review Bot
Comment 15
2011-11-10 21:40:32 PST
Attachment 114624
[details]
did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'LayoutTests/ChangeLog', u'LayoutTests/fast..." exit_code: 1 Source/WebCore/platform/KURL.cpp:1421: Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons. [readability/comparison_to_zero] [5] Total errors found: 1 in 13 files If any of these errors are false positives, please file a bug against check-webkit-style.
Benjamin Poulain
Comment 16
2011-11-10 21:43:52 PST
The style issue is intended. The Chromium bot will likely complain since it will need a new baseline for the modified tests. Is there any plan to get rid of GOOGLEURL or use by default?
WebKit Review Bot
Comment 17
2011-11-11 17:09:48 PST
Comment on
attachment 114624
[details]
Patch
Attachment 114624
[details]
did not pass chromium-ews (chromium-xvfb): Output:
http://queues.webkit.org/results/10460497
New failing tests: fast/url/invalid-urls-utf8.html
Adam Barth
Comment 18
2011-11-13 22:43:40 PST
> Is there any plan to get rid of GOOGLEURL or use by default?
I did some work a while back to change WebKit to use GURL by default, but I didn't finish the project. I believe that would be valuable for the project and am happy to help if someone (such as yourself) was interested in taking the lead.
Darin Adler
Comment 19
2011-11-14 15:57:30 PST
Comment on
attachment 114624
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=114624&action=review
> Source/WebCore/ChangeLog:8 > + URLs are encoded in UTF-8, then decoded as if they are Latin1 > +
https://bugs.webkit.org/show_bug.cgi?id=71758
> + > + Reviewed by NOBODY (OOPS!). > + > + Test: fast/url/invalid-urls-utf8.html
This change log says nothing about what you changed to accomplish this and why.
> Source/WebCore/platform/KURL.cpp:379 > - String* originalString = &rel; > + String originalString = rel;
Why is this better? The new code has to bump reference counts, and the old code didn’t, so the new code is slower.
> Source/WebCore/platform/KURL.h:251 > - void parse(const char* url, const String* originalString); > + void parse(const char* url, const String& originalString, const String& unmodifiedSourceURL);
Where is the explanation of this new argument? I think this change is mysterious!
Benjamin Poulain
Comment 20
2011-11-14 16:06:17 PST
(In reply to
comment #19
)
> (From update of
attachment 114624
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=114624&action=review
> > > Source/WebCore/ChangeLog:8 > > + URLs are encoded in UTF-8, then decoded as if they are Latin1 > > +
https://bugs.webkit.org/show_bug.cgi?id=71758
> > + > > + Reviewed by NOBODY (OOPS!). > > + > > + Test: fast/url/invalid-urls-utf8.html > > This change log says nothing about what you changed to accomplish this and why.
Yep, sorry I forgot to fill in the changelog. I will update with a Changelog.
> > Source/WebCore/platform/KURL.cpp:379 > > - String* originalString = &rel; > > + String originalString = rel; > > Why is this better? The new code has to bump reference counts, and the old code didn’t, so the new code is slower. > > > Source/WebCore/platform/KURL.h:251 > > - void parse(const char* url, const String* originalString); > > + void parse(const char* url, const String& originalString, const String& unmodifiedSourceURL); > > Where is the explanation of this new argument? I think this change is mysterious!
I can use a String& for originalString. The goal here was to make the code easier to follow and to have consistency between originalString and unmodifiedSourceURL. With a pointer you can have null pointer, null string, empty string. With the ref, only null string and empty string. Do you agree with a const String& or you think the pointer reads better?
Benjamin Poulain
Comment 21
2011-11-14 16:07:15 PST
(In reply to
comment #18
)
> I did some work a while back to change WebKit to use GURL by default, but I didn't finish the project. I believe that would be valuable for the project and am happy to help if someone (such as yourself) was interested in taking the lead.
I am short on time at the moment but I can give it a shot after thanksgiving. Do you have a bug number?
Darin Adler
Comment 22
2011-11-14 16:35:09 PST
(In reply to
comment #20
)
> (In reply to
comment #19
) > The goal here was to make the code easier to follow and to have consistency between originalString and unmodifiedSourceURL. > > With a pointer you can have null pointer, null string, empty string. > With the ref, only null string and empty string. > Do you agree with a const String& or you think the pointer reads better?
A pointer is a way to have an optional parameter, and is a relatively common idiom in WebKit when an argument is optional. I’m not sure that using a special value for a string, such as null or empty, to mean the parameter is not being passed is quite as clear.
Benjamin Poulain
Comment 23
2011-11-14 16:54:10 PST
Comment on
attachment 114624
[details]
Patch (In reply to
comment #22
)
> A pointer is a way to have an optional parameter, and is a relatively common idiom in WebKit when an argument is optional. I’m not sure that using a special value for a string, such as null or empty, to mean the parameter is not being passed is quite as clear.
As you wish, I don't have a strong opinion against this. I'll post an update.
Darin Adler
Comment 24
2011-11-15 09:48:59 PST
Comment on
attachment 114624
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=114624&action=review
>> Source/WebCore/platform/KURL.h:251 >> + void parse(const char* url, const String& originalString, const String& unmodifiedSourceURL); > > Where is the explanation of this new argument? I think this change is mysterious!
Because the difference between “original string” and “unmodified source URL” is so subtle, I think we need to be clearer about what these are. I don’t think having the name “URL” in the new argument is good, because it’s not a URL any more than originalString is, it’s a string just like originalString. I think the three arguments are: 1) A pointer to an eight-bit string to parse as the URL. 2) A copy of that exact same URL as a String, if the code path that led here happens to mean we have that string already in a String. 3) A copy of the original String passed in to the KURL API. Except that (3) is not right. It is supported only for certain code paths. So we need a clearer explanation of what (3) is. Then perhaps we can rename (1), (2), and (3) to make these relationships clear.
Benjamin Poulain
Comment 25
2011-11-22 23:15:42 PST
Created
attachment 116323
[details]
Patch
Benjamin Poulain
Comment 26
2011-11-26 11:53:36 PST
Comment on
attachment 116323
[details]
Patch I will land manually so I can rebaseline Chromium afterwards.
Darin Adler
Comment 27
2011-11-30 11:03:38 PST
Comment on
attachment 116323
[details]
Patch Patch looks fine. But the Early Warning System failed on it, and you should figure out why.
Benjamin Poulain
Comment 28
2011-12-01 11:11:40 PST
Committed
r101691
: <
http://trac.webkit.org/changeset/101691
>
Benjamin Poulain
Comment 29
2011-12-01 14:51:28 PST
Committed
r101713
: <
http://trac.webkit.org/changeset/101713
>
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug