Before parsing, we encode the URL in UTF-8. When we encode invalid URL, we do it with Latin1. This makes it possible to transform an URL in an other.
Created attachment 113975 [details] Patch
Comment on attachment 113975 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=113975&action=review > Source/WebCore/platform/KURL.cpp:1123 > + m_string = originalString ? *originalString : String::fromUTF8(url); String::fromUTF8 will return a null string if the input data contains invalid UTF-8 sequences. Is there some guarantee that “url” will not have such sequences? If so, what guarantees that? If not, then can we make tests showing our behavior in that case? > LayoutTests/ChangeLog:9 > + * fast/url/invalid_url-expected.txt: Added. > + * fast/url/invalid_url.html: Added. We use "-" characters between words in our test file names, not "_" characters.
You are claiming that URLs are all encoded in UTF-8. What is the basis of this claim? Where is the code that encodes URLs as UTF-8?
(In reply to comment #3) > You are claiming that URLs are all encoded in UTF-8. What is the basis of this claim? Where is the code that encodes URLs as UTF-8? As far as I can see, they are either all ASCII, or UTF-8. The encoding is done through KURL::init() in encodeRelativeString(). > String::fromUTF8 will return a null string if the input data contains invalid UTF-8 sequences. Is there some guarantee that “url” will not have such sequences? If so, what guarantees that? If not, then can we make tests showing our behavior in that case? I think that should not happen because the encoding is done with URLEncodedEntitiesForUnencodables (e.g.: CString pathDecoded = pathEncoding.encode(s.data(), pathEnd, URLEncodedEntitiesForUnencodables);). I can add a test for this case.
(In reply to comment #4) > (In reply to comment #3) > > You are claiming that URLs are all encoded in UTF-8. What is the basis of this claim? Where is the code that encodes URLs as UTF-8? > > As far as I can see, they are either all ASCII, or UTF-8. > The encoding is done through KURL::init() in encodeRelativeString(). encodeRelativeString takes a TextEncoding argument, which can be an encoding other than UTF-8 if the webpage is using an encoding other than UTF-8.
Comment on attachment 113975 [details] Patch Attachment 113975 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/10331902 New failing tests: fast/url/invalid_url.html
Comment on attachment 113975 [details] Patch (In reply to comment #5) > > As far as I can see, they are either all ASCII, or UTF-8. > > The encoding is done through KURL::init() in encodeRelativeString(). > > encodeRelativeString takes a TextEncoding argument, which can be an encoding other than UTF-8 if the webpage is using an encoding other than UTF-8. Hum, yep, I did not realize that was also added to the output. So we can possibly have mixed encoding in the output. That is getting tricky. I am tempted to change KURL::parse() to use a String instead of a char*.
IIRC from a bug a long time ago, different parts of an URL (fragment part, query part) can have different encodings.
Created attachment 113993 [details] Patch suggestion Since the encoding problem appear only for the invalid URLs. What do you think about always assigning m_string to the original string if the url is invalid? We don't try to partially resolve it, if it is invalid, the URLString is the original string.
(In reply to comment #9) > Since the encoding problem appear only for the invalid URLs. What do you think about always assigning m_string to the original string if the url is invalid? Sounds like a good direction! I am not sure it will work perfectly without studying the issue more. As you point out, there are two different kinds of KURL: 1) One that holds a valid URL. 2) One that passes through a string that is not a valid URL. We should try to be as clear as possible about how all the KURL functions behave when it is case (2). There have been some bugs due to code making mistakes due to (2), not checking the parts of the URL properly and then passing to some lower level framework that treats (2) as a URL. Longer term we might want to eliminate (2) and solve the problem of passing strings that are not valid URLs through the engine in some other way.
Maybe in the future we could factor the KURL class into two: One named WebCore::URL and another named WebCore::StringOrURL.
(In reply to comment #10) > Sounds like a good direction! > > I am not sure it will work perfectly without studying the issue more. > > As you point out, there are two different kinds of KURL: > > 1) One that holds a valid URL. > 2) One that passes through a string that is not a valid URL. > > We should try to be as clear as possible about how all the KURL functions behave when it is case (2). There have been some bugs due to code making mistakes due to (2), not checking the parts of the URL properly and then passing to some lower level framework that treats (2) as a URL. Longer term we might want to eliminate (2) and solve the problem of passing strings that are not valid URLs through the engine in some other way. I agree with you, that is how I got interested in this problem in the first place. In KURL, we have the string and m_isValid. When we pass that to the platform layer, the isValid information is often lost.
We've often talked about having a ParsedURL and a URLString, but that's related but somewhat different than what you're talking about.
Created attachment 114624 [details] Patch
Attachment 114624 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'LayoutTests/ChangeLog', u'LayoutTests/fast..." exit_code: 1 Source/WebCore/platform/KURL.cpp:1421: Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons. [readability/comparison_to_zero] [5] Total errors found: 1 in 13 files If any of these errors are false positives, please file a bug against check-webkit-style.
The style issue is intended. The Chromium bot will likely complain since it will need a new baseline for the modified tests. Is there any plan to get rid of GOOGLEURL or use by default?
Comment on attachment 114624 [details] Patch Attachment 114624 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/10460497 New failing tests: fast/url/invalid-urls-utf8.html
> Is there any plan to get rid of GOOGLEURL or use by default? I did some work a while back to change WebKit to use GURL by default, but I didn't finish the project. I believe that would be valuable for the project and am happy to help if someone (such as yourself) was interested in taking the lead.
Comment on attachment 114624 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=114624&action=review > Source/WebCore/ChangeLog:8 > + URLs are encoded in UTF-8, then decoded as if they are Latin1 > + https://bugs.webkit.org/show_bug.cgi?id=71758 > + > + Reviewed by NOBODY (OOPS!). > + > + Test: fast/url/invalid-urls-utf8.html This change log says nothing about what you changed to accomplish this and why. > Source/WebCore/platform/KURL.cpp:379 > - String* originalString = &rel; > + String originalString = rel; Why is this better? The new code has to bump reference counts, and the old code didn’t, so the new code is slower. > Source/WebCore/platform/KURL.h:251 > - void parse(const char* url, const String* originalString); > + void parse(const char* url, const String& originalString, const String& unmodifiedSourceURL); Where is the explanation of this new argument? I think this change is mysterious!
(In reply to comment #19) > (From update of attachment 114624 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=114624&action=review > > > Source/WebCore/ChangeLog:8 > > + URLs are encoded in UTF-8, then decoded as if they are Latin1 > > + https://bugs.webkit.org/show_bug.cgi?id=71758 > > + > > + Reviewed by NOBODY (OOPS!). > > + > > + Test: fast/url/invalid-urls-utf8.html > > This change log says nothing about what you changed to accomplish this and why. Yep, sorry I forgot to fill in the changelog. I will update with a Changelog. > > Source/WebCore/platform/KURL.cpp:379 > > - String* originalString = &rel; > > + String originalString = rel; > > Why is this better? The new code has to bump reference counts, and the old code didn’t, so the new code is slower. > > > Source/WebCore/platform/KURL.h:251 > > - void parse(const char* url, const String* originalString); > > + void parse(const char* url, const String& originalString, const String& unmodifiedSourceURL); > > Where is the explanation of this new argument? I think this change is mysterious! I can use a String& for originalString. The goal here was to make the code easier to follow and to have consistency between originalString and unmodifiedSourceURL. With a pointer you can have null pointer, null string, empty string. With the ref, only null string and empty string. Do you agree with a const String& or you think the pointer reads better?
(In reply to comment #18) > I did some work a while back to change WebKit to use GURL by default, but I didn't finish the project. I believe that would be valuable for the project and am happy to help if someone (such as yourself) was interested in taking the lead. I am short on time at the moment but I can give it a shot after thanksgiving. Do you have a bug number?
(In reply to comment #20) > (In reply to comment #19) > The goal here was to make the code easier to follow and to have consistency between originalString and unmodifiedSourceURL. > > With a pointer you can have null pointer, null string, empty string. > With the ref, only null string and empty string. > Do you agree with a const String& or you think the pointer reads better? A pointer is a way to have an optional parameter, and is a relatively common idiom in WebKit when an argument is optional. I’m not sure that using a special value for a string, such as null or empty, to mean the parameter is not being passed is quite as clear.
Comment on attachment 114624 [details] Patch (In reply to comment #22) > A pointer is a way to have an optional parameter, and is a relatively common idiom in WebKit when an argument is optional. I’m not sure that using a special value for a string, such as null or empty, to mean the parameter is not being passed is quite as clear. As you wish, I don't have a strong opinion against this. I'll post an update.
Comment on attachment 114624 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=114624&action=review >> Source/WebCore/platform/KURL.h:251 >> + void parse(const char* url, const String& originalString, const String& unmodifiedSourceURL); > > Where is the explanation of this new argument? I think this change is mysterious! Because the difference between “original string” and “unmodified source URL” is so subtle, I think we need to be clearer about what these are. I don’t think having the name “URL” in the new argument is good, because it’s not a URL any more than originalString is, it’s a string just like originalString. I think the three arguments are: 1) A pointer to an eight-bit string to parse as the URL. 2) A copy of that exact same URL as a String, if the code path that led here happens to mean we have that string already in a String. 3) A copy of the original String passed in to the KURL API. Except that (3) is not right. It is supported only for certain code paths. So we need a clearer explanation of what (3) is. Then perhaps we can rename (1), (2), and (3) to make these relationships clear.
Created attachment 116323 [details] Patch
Comment on attachment 116323 [details] Patch I will land manually so I can rebaseline Chromium afterwards.
Comment on attachment 116323 [details] Patch Patch looks fine. But the Early Warning System failed on it, and you should figure out why.
Committed r101691: <http://trac.webkit.org/changeset/101691>
Committed r101713: <http://trac.webkit.org/changeset/101713>