RESOLVED FIXED 71758
URLs are encoded in UTF-8, then decoded as if they are Latin1
https://bugs.webkit.org/show_bug.cgi?id=71758
Summary URLs are encoded in UTF-8, then decoded as if they are Latin1
Benjamin Poulain
Reported 2011-11-07 18:42:46 PST
Before parsing, we encode the URL in UTF-8. When we encode invalid URL, we do it with Latin1. This makes it possible to transform an URL in an other.
Attachments
Patch (6.91 KB, patch)
2011-11-07 18:51 PST, Benjamin Poulain
webkit.review.bot: commit-queue-
Patch suggestion (8.53 KB, text/plain)
2011-11-07 22:36 PST, Benjamin Poulain
no flags
Patch (23.03 KB, patch)
2011-11-10 21:38 PST, Benjamin Poulain
no flags
Patch (29.03 KB, patch)
2011-11-22 23:15 PST, Benjamin Poulain
darin: review+
benjamin: commit-queue-
Benjamin Poulain
Comment 1 2011-11-07 18:51:57 PST
Darin Adler
Comment 2 2011-11-07 20:09:15 PST
Comment on attachment 113975 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=113975&action=review > Source/WebCore/platform/KURL.cpp:1123 > + m_string = originalString ? *originalString : String::fromUTF8(url); String::fromUTF8 will return a null string if the input data contains invalid UTF-8 sequences. Is there some guarantee that “url” will not have such sequences? If so, what guarantees that? If not, then can we make tests showing our behavior in that case? > LayoutTests/ChangeLog:9 > + * fast/url/invalid_url-expected.txt: Added. > + * fast/url/invalid_url.html: Added. We use "-" characters between words in our test file names, not "_" characters.
Darin Adler
Comment 3 2011-11-07 20:11:27 PST
You are claiming that URLs are all encoded in UTF-8. What is the basis of this claim? Where is the code that encodes URLs as UTF-8?
Benjamin Poulain
Comment 4 2011-11-07 20:17:27 PST
(In reply to comment #3) > You are claiming that URLs are all encoded in UTF-8. What is the basis of this claim? Where is the code that encodes URLs as UTF-8? As far as I can see, they are either all ASCII, or UTF-8. The encoding is done through KURL::init() in encodeRelativeString(). > String::fromUTF8 will return a null string if the input data contains invalid UTF-8 sequences. Is there some guarantee that “url” will not have such sequences? If so, what guarantees that? If not, then can we make tests showing our behavior in that case? I think that should not happen because the encoding is done with URLEncodedEntitiesForUnencodables (e.g.: CString pathDecoded = pathEncoding.encode(s.data(), pathEnd, URLEncodedEntitiesForUnencodables);). I can add a test for this case.
Darin Adler
Comment 5 2011-11-07 20:20:30 PST
(In reply to comment #4) > (In reply to comment #3) > > You are claiming that URLs are all encoded in UTF-8. What is the basis of this claim? Where is the code that encodes URLs as UTF-8? > > As far as I can see, they are either all ASCII, or UTF-8. > The encoding is done through KURL::init() in encodeRelativeString(). encodeRelativeString takes a TextEncoding argument, which can be an encoding other than UTF-8 if the webpage is using an encoding other than UTF-8.
WebKit Review Bot
Comment 6 2011-11-07 20:20:44 PST
Comment on attachment 113975 [details] Patch Attachment 113975 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/10331902 New failing tests: fast/url/invalid_url.html
Benjamin Poulain
Comment 7 2011-11-07 20:33:16 PST
Comment on attachment 113975 [details] Patch (In reply to comment #5) > > As far as I can see, they are either all ASCII, or UTF-8. > > The encoding is done through KURL::init() in encodeRelativeString(). > > encodeRelativeString takes a TextEncoding argument, which can be an encoding other than UTF-8 if the webpage is using an encoding other than UTF-8. Hum, yep, I did not realize that was also added to the output. So we can possibly have mixed encoding in the output. That is getting tricky. I am tempted to change KURL::parse() to use a String instead of a char*.
Roland Steiner
Comment 8 2011-11-07 20:54:28 PST
IIRC from a bug a long time ago, different parts of an URL (fragment part, query part) can have different encodings.
Benjamin Poulain
Comment 9 2011-11-07 22:36:40 PST
Created attachment 113993 [details] Patch suggestion Since the encoding problem appear only for the invalid URLs. What do you think about always assigning m_string to the original string if the url is invalid? We don't try to partially resolve it, if it is invalid, the URLString is the original string.
Darin Adler
Comment 10 2011-11-08 09:56:13 PST
(In reply to comment #9) > Since the encoding problem appear only for the invalid URLs. What do you think about always assigning m_string to the original string if the url is invalid? Sounds like a good direction! I am not sure it will work perfectly without studying the issue more. As you point out, there are two different kinds of KURL: 1) One that holds a valid URL. 2) One that passes through a string that is not a valid URL. We should try to be as clear as possible about how all the KURL functions behave when it is case (2). There have been some bugs due to code making mistakes due to (2), not checking the parts of the URL properly and then passing to some lower level framework that treats (2) as a URL. Longer term we might want to eliminate (2) and solve the problem of passing strings that are not valid URLs through the engine in some other way.
Darin Adler
Comment 11 2011-11-08 09:56:58 PST
Maybe in the future we could factor the KURL class into two: One named WebCore::URL and another named WebCore::StringOrURL.
Benjamin Poulain
Comment 12 2011-11-08 10:12:14 PST
(In reply to comment #10) > Sounds like a good direction! > > I am not sure it will work perfectly without studying the issue more. > > As you point out, there are two different kinds of KURL: > > 1) One that holds a valid URL. > 2) One that passes through a string that is not a valid URL. > > We should try to be as clear as possible about how all the KURL functions behave when it is case (2). There have been some bugs due to code making mistakes due to (2), not checking the parts of the URL properly and then passing to some lower level framework that treats (2) as a URL. Longer term we might want to eliminate (2) and solve the problem of passing strings that are not valid URLs through the engine in some other way. I agree with you, that is how I got interested in this problem in the first place. In KURL, we have the string and m_isValid. When we pass that to the platform layer, the isValid information is often lost.
Adam Barth
Comment 13 2011-11-08 12:43:11 PST
We've often talked about having a ParsedURL and a URLString, but that's related but somewhat different than what you're talking about.
Benjamin Poulain
Comment 14 2011-11-10 21:38:07 PST
WebKit Review Bot
Comment 15 2011-11-10 21:40:32 PST
Attachment 114624 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'LayoutTests/ChangeLog', u'LayoutTests/fast..." exit_code: 1 Source/WebCore/platform/KURL.cpp:1421: Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons. [readability/comparison_to_zero] [5] Total errors found: 1 in 13 files If any of these errors are false positives, please file a bug against check-webkit-style.
Benjamin Poulain
Comment 16 2011-11-10 21:43:52 PST
The style issue is intended. The Chromium bot will likely complain since it will need a new baseline for the modified tests. Is there any plan to get rid of GOOGLEURL or use by default?
WebKit Review Bot
Comment 17 2011-11-11 17:09:48 PST
Comment on attachment 114624 [details] Patch Attachment 114624 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/10460497 New failing tests: fast/url/invalid-urls-utf8.html
Adam Barth
Comment 18 2011-11-13 22:43:40 PST
> Is there any plan to get rid of GOOGLEURL or use by default? I did some work a while back to change WebKit to use GURL by default, but I didn't finish the project. I believe that would be valuable for the project and am happy to help if someone (such as yourself) was interested in taking the lead.
Darin Adler
Comment 19 2011-11-14 15:57:30 PST
Comment on attachment 114624 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=114624&action=review > Source/WebCore/ChangeLog:8 > + URLs are encoded in UTF-8, then decoded as if they are Latin1 > + https://bugs.webkit.org/show_bug.cgi?id=71758 > + > + Reviewed by NOBODY (OOPS!). > + > + Test: fast/url/invalid-urls-utf8.html This change log says nothing about what you changed to accomplish this and why. > Source/WebCore/platform/KURL.cpp:379 > - String* originalString = &rel; > + String originalString = rel; Why is this better? The new code has to bump reference counts, and the old code didn’t, so the new code is slower. > Source/WebCore/platform/KURL.h:251 > - void parse(const char* url, const String* originalString); > + void parse(const char* url, const String& originalString, const String& unmodifiedSourceURL); Where is the explanation of this new argument? I think this change is mysterious!
Benjamin Poulain
Comment 20 2011-11-14 16:06:17 PST
(In reply to comment #19) > (From update of attachment 114624 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=114624&action=review > > > Source/WebCore/ChangeLog:8 > > + URLs are encoded in UTF-8, then decoded as if they are Latin1 > > + https://bugs.webkit.org/show_bug.cgi?id=71758 > > + > > + Reviewed by NOBODY (OOPS!). > > + > > + Test: fast/url/invalid-urls-utf8.html > > This change log says nothing about what you changed to accomplish this and why. Yep, sorry I forgot to fill in the changelog. I will update with a Changelog. > > Source/WebCore/platform/KURL.cpp:379 > > - String* originalString = &rel; > > + String originalString = rel; > > Why is this better? The new code has to bump reference counts, and the old code didn’t, so the new code is slower. > > > Source/WebCore/platform/KURL.h:251 > > - void parse(const char* url, const String* originalString); > > + void parse(const char* url, const String& originalString, const String& unmodifiedSourceURL); > > Where is the explanation of this new argument? I think this change is mysterious! I can use a String& for originalString. The goal here was to make the code easier to follow and to have consistency between originalString and unmodifiedSourceURL. With a pointer you can have null pointer, null string, empty string. With the ref, only null string and empty string. Do you agree with a const String& or you think the pointer reads better?
Benjamin Poulain
Comment 21 2011-11-14 16:07:15 PST
(In reply to comment #18) > I did some work a while back to change WebKit to use GURL by default, but I didn't finish the project. I believe that would be valuable for the project and am happy to help if someone (such as yourself) was interested in taking the lead. I am short on time at the moment but I can give it a shot after thanksgiving. Do you have a bug number?
Darin Adler
Comment 22 2011-11-14 16:35:09 PST
(In reply to comment #20) > (In reply to comment #19) > The goal here was to make the code easier to follow and to have consistency between originalString and unmodifiedSourceURL. > > With a pointer you can have null pointer, null string, empty string. > With the ref, only null string and empty string. > Do you agree with a const String& or you think the pointer reads better? A pointer is a way to have an optional parameter, and is a relatively common idiom in WebKit when an argument is optional. I’m not sure that using a special value for a string, such as null or empty, to mean the parameter is not being passed is quite as clear.
Benjamin Poulain
Comment 23 2011-11-14 16:54:10 PST
Comment on attachment 114624 [details] Patch (In reply to comment #22) > A pointer is a way to have an optional parameter, and is a relatively common idiom in WebKit when an argument is optional. I’m not sure that using a special value for a string, such as null or empty, to mean the parameter is not being passed is quite as clear. As you wish, I don't have a strong opinion against this. I'll post an update.
Darin Adler
Comment 24 2011-11-15 09:48:59 PST
Comment on attachment 114624 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=114624&action=review >> Source/WebCore/platform/KURL.h:251 >> + void parse(const char* url, const String& originalString, const String& unmodifiedSourceURL); > > Where is the explanation of this new argument? I think this change is mysterious! Because the difference between “original string” and “unmodified source URL” is so subtle, I think we need to be clearer about what these are. I don’t think having the name “URL” in the new argument is good, because it’s not a URL any more than originalString is, it’s a string just like originalString. I think the three arguments are: 1) A pointer to an eight-bit string to parse as the URL. 2) A copy of that exact same URL as a String, if the code path that led here happens to mean we have that string already in a String. 3) A copy of the original String passed in to the KURL API. Except that (3) is not right. It is supported only for certain code paths. So we need a clearer explanation of what (3) is. Then perhaps we can rename (1), (2), and (3) to make these relationships clear.
Benjamin Poulain
Comment 25 2011-11-22 23:15:42 PST
Benjamin Poulain
Comment 26 2011-11-26 11:53:36 PST
Comment on attachment 116323 [details] Patch I will land manually so I can rebaseline Chromium afterwards.
Darin Adler
Comment 27 2011-11-30 11:03:38 PST
Comment on attachment 116323 [details] Patch Patch looks fine. But the Early Warning System failed on it, and you should figure out why.
Benjamin Poulain
Comment 28 2011-12-01 11:11:40 PST
Benjamin Poulain
Comment 29 2011-12-01 14:51:28 PST
Note You need to log in before you can comment on or make changes to this bug.