Bug 71758 - URLs are encoded in UTF-8, then decoded as if they are Latin1
Summary: URLs are encoded in UTF-8, then decoded as if they are Latin1
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebCore Misc. (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Benjamin Poulain
URL:
Keywords:
Depends on: 73588
Blocks:
  Show dependency treegraph
 
Reported: 2011-11-07 18:42 PST by Benjamin Poulain
Modified: 2011-12-01 14:51 PST (History)
8 users (show)

See Also:


Attachments
Patch (6.91 KB, patch)
2011-11-07 18:51 PST, Benjamin Poulain
webkit.review.bot: commit-queue-
Details | Formatted Diff | Diff
Patch suggestion (8.53 KB, text/plain)
2011-11-07 22:36 PST, Benjamin Poulain
no flags Details
Patch (23.03 KB, patch)
2011-11-10 21:38 PST, Benjamin Poulain
no flags Details | Formatted Diff | Diff
Patch (29.03 KB, patch)
2011-11-22 23:15 PST, Benjamin Poulain
darin: review+
benjamin: commit-queue-
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Benjamin Poulain 2011-11-07 18:42:46 PST
Before parsing, we encode the URL in UTF-8. When we encode invalid URL, we do it with Latin1. This makes it possible to transform an URL in an other.
Comment 1 Benjamin Poulain 2011-11-07 18:51:57 PST
Created attachment 113975 [details]
Patch
Comment 2 Darin Adler 2011-11-07 20:09:15 PST
Comment on attachment 113975 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=113975&action=review

> Source/WebCore/platform/KURL.cpp:1123
> +        m_string = originalString ? *originalString : String::fromUTF8(url);

String::fromUTF8 will return a null string if the input data contains invalid UTF-8 sequences. Is there some guarantee that “url” will not have such sequences? If so, what guarantees that? If not, then can we make tests showing our behavior in that case?

> LayoutTests/ChangeLog:9
> +        * fast/url/invalid_url-expected.txt: Added.
> +        * fast/url/invalid_url.html: Added.

We use "-" characters between words in our test file names, not "_" characters.
Comment 3 Darin Adler 2011-11-07 20:11:27 PST
You are claiming that URLs are all encoded in UTF-8. What is the basis of this claim? Where is the code that encodes URLs as UTF-8?
Comment 4 Benjamin Poulain 2011-11-07 20:17:27 PST
(In reply to comment #3)
> You are claiming that URLs are all encoded in UTF-8. What is the basis of this claim? Where is the code that encodes URLs as UTF-8?

As far as I can see, they are either all ASCII, or UTF-8. 
The encoding is done through KURL::init() in encodeRelativeString().

> String::fromUTF8 will return a null string if the input data contains invalid UTF-8 sequences. Is there some guarantee that “url” will not have such sequences? If so, what guarantees that? If not, then can we make tests showing our behavior in that case?

I think that should not happen because the encoding is done with URLEncodedEntitiesForUnencodables (e.g.: CString pathDecoded = pathEncoding.encode(s.data(), pathEnd, URLEncodedEntitiesForUnencodables);).

I can add a test for this case.
Comment 5 Darin Adler 2011-11-07 20:20:30 PST
(In reply to comment #4)
> (In reply to comment #3)
> > You are claiming that URLs are all encoded in UTF-8. What is the basis of this claim? Where is the code that encodes URLs as UTF-8?
> 
> As far as I can see, they are either all ASCII, or UTF-8. 
> The encoding is done through KURL::init() in encodeRelativeString().

encodeRelativeString takes a TextEncoding argument, which can be an encoding other than UTF-8 if the webpage is using an encoding other than UTF-8.
Comment 6 WebKit Review Bot 2011-11-07 20:20:44 PST
Comment on attachment 113975 [details]
Patch

Attachment 113975 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/10331902

New failing tests:
fast/url/invalid_url.html
Comment 7 Benjamin Poulain 2011-11-07 20:33:16 PST
Comment on attachment 113975 [details]
Patch

(In reply to comment #5)
> > As far as I can see, they are either all ASCII, or UTF-8. 
> > The encoding is done through KURL::init() in encodeRelativeString().
> 
> encodeRelativeString takes a TextEncoding argument, which can be an encoding other than UTF-8 if the webpage is using an encoding other than UTF-8.

Hum, yep, I did not realize that was also added to the output. So we can possibly have mixed encoding in the output.

That is getting tricky. I am tempted to change KURL::parse() to use a String instead of a char*.
Comment 8 Roland Steiner 2011-11-07 20:54:28 PST
IIRC from a bug a long time ago, different parts of an URL (fragment part, query part) can have different encodings.
Comment 9 Benjamin Poulain 2011-11-07 22:36:40 PST
Created attachment 113993 [details]
Patch suggestion

Since the encoding problem appear only for the invalid URLs. What do you think about always assigning m_string to the original string if the url is invalid?
We don't try to partially resolve it, if it is invalid, the URLString is the original string.
Comment 10 Darin Adler 2011-11-08 09:56:13 PST
(In reply to comment #9)
> Since the encoding problem appear only for the invalid URLs. What do you think about always assigning m_string to the original string if the url is invalid?

Sounds like a good direction!

I am not sure it will work perfectly without studying the issue more.

As you point out, there are two different kinds of KURL:

1) One that holds a valid URL.
2) One that passes through a string that is not a valid URL.

We should try to be as clear as possible about how all the KURL functions behave when it is case (2). There have been some bugs due to code making mistakes due to (2), not checking the parts of the URL properly and then passing to some lower level framework that treats (2) as a URL. Longer term we might want to eliminate (2) and solve the problem of passing strings that are not valid URLs through the engine in some other way.
Comment 11 Darin Adler 2011-11-08 09:56:58 PST
Maybe in the future we could factor the KURL class into two: One named WebCore::URL and another named WebCore::StringOrURL.
Comment 12 Benjamin Poulain 2011-11-08 10:12:14 PST
(In reply to comment #10)
> Sounds like a good direction!
> 
> I am not sure it will work perfectly without studying the issue more.
> 
> As you point out, there are two different kinds of KURL:
> 
> 1) One that holds a valid URL.
> 2) One that passes through a string that is not a valid URL.
> 
> We should try to be as clear as possible about how all the KURL functions behave when it is case (2). There have been some bugs due to code making mistakes due to (2), not checking the parts of the URL properly and then passing to some lower level framework that treats (2) as a URL. Longer term we might want to eliminate (2) and solve the problem of passing strings that are not valid URLs through the engine in some other way.

I agree with you, that is how I got interested in this problem in the first place.

In KURL, we have the string and m_isValid. When we pass that to the platform layer, the isValid information is often lost.
Comment 13 Adam Barth 2011-11-08 12:43:11 PST
We've often talked about having a ParsedURL and a URLString, but that's related but somewhat different than what you're talking about.
Comment 14 Benjamin Poulain 2011-11-10 21:38:07 PST
Created attachment 114624 [details]
Patch
Comment 15 WebKit Review Bot 2011-11-10 21:40:32 PST
Attachment 114624 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'LayoutTests/ChangeLog', u'LayoutTests/fast..." exit_code: 1

Source/WebCore/platform/KURL.cpp:1421:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
Total errors found: 1 in 13 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 16 Benjamin Poulain 2011-11-10 21:43:52 PST
The style issue is intended.

The Chromium bot will likely complain since it will need a new baseline for the modified tests. Is there any plan to get rid of GOOGLEURL or use by default?
Comment 17 WebKit Review Bot 2011-11-11 17:09:48 PST
Comment on attachment 114624 [details]
Patch

Attachment 114624 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/10460497

New failing tests:
fast/url/invalid-urls-utf8.html
Comment 18 Adam Barth 2011-11-13 22:43:40 PST
> Is there any plan to get rid of GOOGLEURL or use by default?

I did some work a while back to change WebKit to use GURL by default, but I didn't finish the project.  I believe that would be valuable for the project and am happy to help if someone (such as yourself) was interested in taking the lead.
Comment 19 Darin Adler 2011-11-14 15:57:30 PST
Comment on attachment 114624 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=114624&action=review

> Source/WebCore/ChangeLog:8
> +        URLs are encoded in UTF-8, then decoded as if they are Latin1
> +        https://bugs.webkit.org/show_bug.cgi?id=71758
> +
> +        Reviewed by NOBODY (OOPS!).
> +
> +        Test: fast/url/invalid-urls-utf8.html

This change log says nothing about what you changed to accomplish this and why.

> Source/WebCore/platform/KURL.cpp:379
> -    String* originalString = &rel;
> +    String originalString = rel;

Why is this better? The new code has to bump reference counts, and the old code didn’t, so the new code is slower.

> Source/WebCore/platform/KURL.h:251
> -    void parse(const char* url, const String* originalString);
> +    void parse(const char* url, const String& originalString, const String& unmodifiedSourceURL);

Where is the explanation of this new argument? I think this change is mysterious!
Comment 20 Benjamin Poulain 2011-11-14 16:06:17 PST
(In reply to comment #19)
> (From update of attachment 114624 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=114624&action=review
> 
> > Source/WebCore/ChangeLog:8
> > +        URLs are encoded in UTF-8, then decoded as if they are Latin1
> > +        https://bugs.webkit.org/show_bug.cgi?id=71758
> > +
> > +        Reviewed by NOBODY (OOPS!).
> > +
> > +        Test: fast/url/invalid-urls-utf8.html
> 
> This change log says nothing about what you changed to accomplish this and why.

Yep, sorry I forgot to fill in the changelog. I will update with a Changelog.

> > Source/WebCore/platform/KURL.cpp:379
> > -    String* originalString = &rel;
> > +    String originalString = rel;
> 
> Why is this better? The new code has to bump reference counts, and the old code didn’t, so the new code is slower.
> 
> > Source/WebCore/platform/KURL.h:251
> > -    void parse(const char* url, const String* originalString);
> > +    void parse(const char* url, const String& originalString, const String& unmodifiedSourceURL);
> 
> Where is the explanation of this new argument? I think this change is mysterious!

I can use a String& for originalString. The goal here was to make the code easier to follow and to have consistency between originalString and unmodifiedSourceURL.

With a pointer you can have null pointer, null string, empty string.
With the ref, only null string and empty string.
Do you agree with a const String& or you think the pointer reads better?
Comment 21 Benjamin Poulain 2011-11-14 16:07:15 PST
(In reply to comment #18)
> I did some work a while back to change WebKit to use GURL by default, but I didn't finish the project.  I believe that would be valuable for the project and am happy to help if someone (such as yourself) was interested in taking the lead.

I am short on time at the moment but I can give it a shot after thanksgiving. Do you have a bug number?
Comment 22 Darin Adler 2011-11-14 16:35:09 PST
(In reply to comment #20)
> (In reply to comment #19)
> The goal here was to make the code easier to follow and to have consistency between originalString and unmodifiedSourceURL.
> 
> With a pointer you can have null pointer, null string, empty string.
> With the ref, only null string and empty string.
> Do you agree with a const String& or you think the pointer reads better?

A pointer is a way to have an optional parameter, and is a relatively common idiom in WebKit when an argument is optional. I’m not sure that using a special value for a string, such as null or empty, to mean the parameter is not being passed is quite as clear.
Comment 23 Benjamin Poulain 2011-11-14 16:54:10 PST
Comment on attachment 114624 [details]
Patch

(In reply to comment #22)
> A pointer is a way to have an optional parameter, and is a relatively common idiom in WebKit when an argument is optional. I’m not sure that using a special value for a string, such as null or empty, to mean the parameter is not being passed is quite as clear.

As you wish, I don't have a strong opinion against this. I'll post an update.
Comment 24 Darin Adler 2011-11-15 09:48:59 PST
Comment on attachment 114624 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=114624&action=review

>> Source/WebCore/platform/KURL.h:251
>> +    void parse(const char* url, const String& originalString, const String& unmodifiedSourceURL);
> 
> Where is the explanation of this new argument? I think this change is mysterious!

Because the difference between “original string” and “unmodified source URL” is so subtle, I think we need to be clearer about what these are. I don’t think having the name “URL” in the new argument is good, because it’s not a URL any more than originalString is, it’s a string just like originalString.

I think the three arguments are:

1) A pointer to an eight-bit string to parse as the URL.
2) A copy of that exact same URL as a String, if the code path that led here happens to mean we have that string already in a String.
3) A copy of the original String passed in to the KURL API.

Except that (3) is not right. It is supported only for certain code paths. So we need a clearer explanation of what (3) is. Then perhaps we can rename (1), (2), and (3) to make these relationships clear.
Comment 25 Benjamin Poulain 2011-11-22 23:15:42 PST
Created attachment 116323 [details]
Patch
Comment 26 Benjamin Poulain 2011-11-26 11:53:36 PST
Comment on attachment 116323 [details]
Patch

I will land manually so I can rebaseline Chromium afterwards.
Comment 27 Darin Adler 2011-11-30 11:03:38 PST
Comment on attachment 116323 [details]
Patch

Patch looks fine. But the Early Warning System failed on it, and you should figure out why.
Comment 28 Benjamin Poulain 2011-12-01 11:11:40 PST
Committed r101691: <http://trac.webkit.org/changeset/101691>
Comment 29 Benjamin Poulain 2011-12-01 14:51:28 PST
Committed r101713: <http://trac.webkit.org/changeset/101713>