71758 – URLs are encoded in UTF-8, then decoded as if they are Latin1

RESOLVED FIXED 71758

URLs are encoded in UTF-8, then decoded as if they are Latin1

https://bugs.webkit.org/show_bug.cgi?id=71758

Summary URLs are encoded in UTF-8, then decoded as if they are Latin1

Benjamin Poulain

Reported 2011-11-07 18:42:46 PST

Before parsing, we encode the URL in UTF-8. When we encode invalid URL, we do it with Latin1. This makes it possible to transform an URL in an other.

Attachments
Patch (6.91 KB, patch) 2011-11-07 18:51 PST, Benjamin Poulain	webkit.review.bot: commit-queue-	Details Formatted Diff Diff
Patch suggestion (8.53 KB, text/plain) 2011-11-07 22:36 PST, Benjamin Poulain	no flags	Details
Patch (23.03 KB, patch) 2011-11-10 21:38 PST, Benjamin Poulain	no flags	Details Formatted Diff Diff
Patch (29.03 KB, patch) 2011-11-22 23:15 PST, Benjamin Poulain	darin: review+ benjamin: commit-queue-	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.

Benjamin Poulain

Comment 1 2011-11-07 18:51:57 PST

Created attachment 113975 [details] Patch

Darin Adler

Comment 2 2011-11-07 20:09:15 PST

Comment on attachment 113975 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=113975&action=review > Source/WebCore/platform/KURL.cpp:1123 > + m_string = originalString ? *originalString : String::fromUTF8(url); String::fromUTF8 will return a null string if the input data contains invalid UTF-8 sequences. Is there some guarantee that “url” will not have such sequences? If so, what guarantees that? If not, then can we make tests showing our behavior in that case? > LayoutTests/ChangeLog:9 > + * fast/url/invalid_url-expected.txt: Added. > + * fast/url/invalid_url.html: Added. We use "-" characters between words in our test file names, not "_" characters.

Darin Adler

Comment 3 2011-11-07 20:11:27 PST

You are claiming that URLs are all encoded in UTF-8. What is the basis of this claim? Where is the code that encodes URLs as UTF-8?

Benjamin Poulain

Comment 4 2011-11-07 20:17:27 PST

(In reply to comment #3) > You are claiming that URLs are all encoded in UTF-8. What is the basis of this claim? Where is the code that encodes URLs as UTF-8? As far as I can see, they are either all ASCII, or UTF-8. The encoding is done through KURL::init() in encodeRelativeString(). > String::fromUTF8 will return a null string if the input data contains invalid UTF-8 sequences. Is there some guarantee that “url” will not have such sequences? If so, what guarantees that? If not, then can we make tests showing our behavior in that case? I think that should not happen because the encoding is done with URLEncodedEntitiesForUnencodables (e.g.: CString pathDecoded = pathEncoding.encode(s.data(), pathEnd, URLEncodedEntitiesForUnencodables);). I can add a test for this case.

Darin Adler

Comment 5 2011-11-07 20:20:30 PST

(In reply to comment #4) > (In reply to comment #3) > > You are claiming that URLs are all encoded in UTF-8. What is the basis of this claim? Where is the code that encodes URLs as UTF-8? > > As far as I can see, they are either all ASCII, or UTF-8. > The encoding is done through KURL::init() in encodeRelativeString(). encodeRelativeString takes a TextEncoding argument, which can be an encoding other than UTF-8 if the webpage is using an encoding other than UTF-8.

WebKit Review Bot

Comment 6 2011-11-07 20:20:44 PST

Comment on attachment 113975 [details] Patch Attachment 113975 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/10331902 New failing tests: fast/url/invalid_url.html

Benjamin Poulain

Comment 7 2011-11-07 20:33:16 PST

Comment on attachment 113975 [details] Patch (In reply to comment #5) > > As far as I can see, they are either all ASCII, or UTF-8. > > The encoding is done through KURL::init() in encodeRelativeString(). > > encodeRelativeString takes a TextEncoding argument, which can be an encoding other than UTF-8 if the webpage is using an encoding other than UTF-8. Hum, yep, I did not realize that was also added to the output. So we can possibly have mixed encoding in the output. That is getting tricky. I am tempted to change KURL::parse() to use a String instead of a char*.

Roland Steiner

Comment 8 2011-11-07 20:54:28 PST

IIRC from a bug a long time ago, different parts of an URL (fragment part, query part) can have different encodings.

Benjamin Poulain

Comment 9 2011-11-07 22:36:40 PST

Created attachment 113993 [details] Patch suggestion Since the encoding problem appear only for the invalid URLs. What do you think about always assigning m_string to the original string if the url is invalid? We don't try to partially resolve it, if it is invalid, the URLString is the original string.

Darin Adler

Comment 10 2011-11-08 09:56:13 PST

(In reply to comment #9) > Since the encoding problem appear only for the invalid URLs. What do you think about always assigning m_string to the original string if the url is invalid? Sounds like a good direction! I am not sure it will work perfectly without studying the issue more. As you point out, there are two different kinds of KURL: 1) One that holds a valid URL. 2) One that passes through a string that is not a valid URL. We should try to be as clear as possible about how all the KURL functions behave when it is case (2). There have been some bugs due to code making mistakes due to (2), not checking the parts of the URL properly and then passing to some lower level framework that treats (2) as a URL. Longer term we might want to eliminate (2) and solve the problem of passing strings that are not valid URLs through the engine in some other way.

Darin Adler

Comment 11 2011-11-08 09:56:58 PST

Maybe in the future we could factor the KURL class into two: One named WebCore::URL and another named WebCore::StringOrURL.

Benjamin Poulain

Comment 12 2011-11-08 10:12:14 PST

(In reply to comment #10) > Sounds like a good direction! > > I am not sure it will work perfectly without studying the issue more. > > As you point out, there are two different kinds of KURL: > > 1) One that holds a valid URL. > 2) One that passes through a string that is not a valid URL. > > We should try to be as clear as possible about how all the KURL functions behave when it is case (2). There have been some bugs due to code making mistakes due to (2), not checking the parts of the URL properly and then passing to some lower level framework that treats (2) as a URL. Longer term we might want to eliminate (2) and solve the problem of passing strings that are not valid URLs through the engine in some other way. I agree with you, that is how I got interested in this problem in the first place. In KURL, we have the string and m_isValid. When we pass that to the platform layer, the isValid information is often lost.

Adam Barth

Comment 13 2011-11-08 12:43:11 PST

We've often talked about having a ParsedURL and a URLString, but that's related but somewhat different than what you're talking about.

Benjamin Poulain

Comment 14 2011-11-10 21:38:07 PST

Created attachment 114624 [details] Patch

WebKit Review Bot

Comment 15 2011-11-10 21:40:32 PST

Attachment 114624 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'LayoutTests/ChangeLog', u'LayoutTests/fast..." exit_code: 1 Source/WebCore/platform/KURL.cpp:1421: Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons. [readability/comparison_to_zero] [5] Total errors found: 1 in 13 files If any of these errors are false positives, please file a bug against check-webkit-style.

Benjamin Poulain

Comment 16 2011-11-10 21:43:52 PST

The style issue is intended. The Chromium bot will likely complain since it will need a new baseline for the modified tests. Is there any plan to get rid of GOOGLEURL or use by default?

WebKit Review Bot

Comment 17 2011-11-11 17:09:48 PST

Comment on attachment 114624 [details] Patch Attachment 114624 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/10460497 New failing tests: fast/url/invalid-urls-utf8.html

Adam Barth

Comment 18 2011-11-13 22:43:40 PST

> Is there any plan to get rid of GOOGLEURL or use by default? I did some work a while back to change WebKit to use GURL by default, but I didn't finish the project. I believe that would be valuable for the project and am happy to help if someone (such as yourself) was interested in taking the lead.

Darin Adler

Comment 19 2011-11-14 15:57:30 PST

Comment on attachment 114624 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=114624&action=review > Source/WebCore/ChangeLog:8 > + URLs are encoded in UTF-8, then decoded as if they are Latin1 > + https://bugs.webkit.org/show_bug.cgi?id=71758 > + > + Reviewed by NOBODY (OOPS!). > + > + Test: fast/url/invalid-urls-utf8.html This change log says nothing about what you changed to accomplish this and why. > Source/WebCore/platform/KURL.cpp:379 > - String* originalString = &rel; > + String originalString = rel; Why is this better? The new code has to bump reference counts, and the old code didn’t, so the new code is slower. > Source/WebCore/platform/KURL.h:251 > - void parse(const char* url, const String* originalString); > + void parse(const char* url, const String& originalString, const String& unmodifiedSourceURL); Where is the explanation of this new argument? I think this change is mysterious!

Benjamin Poulain

Comment 20 2011-11-14 16:06:17 PST

(In reply to comment #19) > (From update of attachment 114624 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=114624&action=review > > > Source/WebCore/ChangeLog:8 > > + URLs are encoded in UTF-8, then decoded as if they are Latin1 > > + https://bugs.webkit.org/show_bug.cgi?id=71758 > > + > > + Reviewed by NOBODY (OOPS!). > > + > > + Test: fast/url/invalid-urls-utf8.html > > This change log says nothing about what you changed to accomplish this and why. Yep, sorry I forgot to fill in the changelog. I will update with a Changelog. > > Source/WebCore/platform/KURL.cpp:379 > > - String* originalString = &rel; > > + String originalString = rel; > > Why is this better? The new code has to bump reference counts, and the old code didn’t, so the new code is slower. > > > Source/WebCore/platform/KURL.h:251 > > - void parse(const char* url, const String* originalString); > > + void parse(const char* url, const String& originalString, const String& unmodifiedSourceURL); > > Where is the explanation of this new argument? I think this change is mysterious! I can use a String& for originalString. The goal here was to make the code easier to follow and to have consistency between originalString and unmodifiedSourceURL. With a pointer you can have null pointer, null string, empty string. With the ref, only null string and empty string. Do you agree with a const String& or you think the pointer reads better?

Benjamin Poulain

Comment 21 2011-11-14 16:07:15 PST

(In reply to comment #18) > I did some work a while back to change WebKit to use GURL by default, but I didn't finish the project. I believe that would be valuable for the project and am happy to help if someone (such as yourself) was interested in taking the lead. I am short on time at the moment but I can give it a shot after thanksgiving. Do you have a bug number?

Darin Adler

Comment 22 2011-11-14 16:35:09 PST

(In reply to comment #20) > (In reply to comment #19) > The goal here was to make the code easier to follow and to have consistency between originalString and unmodifiedSourceURL. > > With a pointer you can have null pointer, null string, empty string. > With the ref, only null string and empty string. > Do you agree with a const String& or you think the pointer reads better? A pointer is a way to have an optional parameter, and is a relatively common idiom in WebKit when an argument is optional. I’m not sure that using a special value for a string, such as null or empty, to mean the parameter is not being passed is quite as clear.

Benjamin Poulain

Comment 23 2011-11-14 16:54:10 PST

Comment on attachment 114624 [details] Patch (In reply to comment #22) > A pointer is a way to have an optional parameter, and is a relatively common idiom in WebKit when an argument is optional. I’m not sure that using a special value for a string, such as null or empty, to mean the parameter is not being passed is quite as clear. As you wish, I don't have a strong opinion against this. I'll post an update.

Darin Adler

Comment 24 2011-11-15 09:48:59 PST

Comment on attachment 114624 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=114624&action=review >> Source/WebCore/platform/KURL.h:251 >> + void parse(const char* url, const String& originalString, const String& unmodifiedSourceURL); > > Where is the explanation of this new argument? I think this change is mysterious! Because the difference between “original string” and “unmodified source URL” is so subtle, I think we need to be clearer about what these are. I don’t think having the name “URL” in the new argument is good, because it’s not a URL any more than originalString is, it’s a string just like originalString. I think the three arguments are: 1) A pointer to an eight-bit string to parse as the URL. 2) A copy of that exact same URL as a String, if the code path that led here happens to mean we have that string already in a String. 3) A copy of the original String passed in to the KURL API. Except that (3) is not right. It is supported only for certain code paths. So we need a clearer explanation of what (3) is. Then perhaps we can rename (1), (2), and (3) to make these relationships clear.

Benjamin Poulain

Comment 25 2011-11-22 23:15:42 PST

Created attachment 116323 [details] Patch

Benjamin Poulain

Comment 26 2011-11-26 11:53:36 PST

Comment on attachment 116323 [details] Patch I will land manually so I can rebaseline Chromium afterwards.

Darin Adler

Comment 27 2011-11-30 11:03:38 PST

Comment on attachment 116323 [details] Patch Patch looks fine. But the Early Warning System failed on it, and you should figure out why.

Benjamin Poulain

Comment 28 2011-12-01 11:11:40 PST

Committed r101691: <http://trac.webkit.org/changeset/101691>

Benjamin Poulain

Comment 29 2011-12-01 14:51:28 PST

Committed r101713: <http://trac.webkit.org/changeset/101713>

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Unspecified

OS Unspecified

Product WebKit

Component WebCore Misc.

Assignee

Benjamin Poulain

Reported

2011-11-07 18:42 PST

Modified

2011-12-01 14:51 PST History

CC List

8 users Show

URL

Keywords

Depends on

73588

Blocks

Dependencies

tree graph