Bug 71602

Summary: [Mac] Improve the conversion from KURL to CFURL
Product: WebKit Reporter: Benjamin Poulain <benjamin>
Component: Page LoadingAssignee: Benjamin Poulain <benjamin>
Status: NEW ---    
Severity: Normal CC: abarth, ap, darin, ddkilzer, eric, sam
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch darin: review-

Description Benjamin Poulain 2011-11-04 19:19:34 PDT
Currently converting from KURL to CFURL throws away the high bytes of all the characters. Let's fix that.
Comment 1 Benjamin Poulain 2011-11-04 19:33:49 PDT
Created attachment 113746 [details]
Patch
Comment 2 Benjamin Poulain 2011-11-04 22:53:06 PDT
Comment on attachment 113746 [details]
Patch

Clearing the review flag. I have an idea on how I could add tests for this.
Comment 3 Benjamin Poulain 2011-12-12 20:01:32 PST
Created attachment 118943 [details]
Patch
Comment 4 Darin Adler 2011-12-18 19:33:26 PST
Comment on attachment 118943 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=118943&action=review

This patch changes the behavior when creating a CFURL from a URL that came from a webpage with a non-UTF-8 encoding. It’s possible that it only changes behavior for the better, but with no test cases demonstrating the change, I am concerned that we are just doing this blind, without testing. We should come up with a way to test this, and demonstrate that the new behavior is better than the old. The test case we are adding in this patch is entirely in UTF-8, so it doesn’t address the issue of behavior for non-UTF-8 encodings.

> Source/WebCore/platform/cf/KURLCFNet.cpp:70
> +    // NOTE: CFURLCreateAbsoluteURLWithBytes() is the only way to create an absolute URL or completely fail.
> +    // However, this function is not tolerant of illegal UTF-8 sequences, which
>      // could either be a malformed string or bytes in a different encoding, like Shift-JIS, so we fall back
>      // onto using ISO Latin-1 in those cases.

They way you have rewritten this function completely eliminates this possibility, so neither the comment nor the code needs to allow this.

> Source/WebCore/platform/cf/KURLCFNet.cpp:74
> +    // FIXME: does that really ever happen?

This comment is quite mysterious. We should not add it.
Comment 5 Benjamin Poulain 2011-12-18 19:38:10 PST
> > Source/WebCore/platform/cf/KURLCFNet.cpp:70
> > +    // NOTE: CFURLCreateAbsoluteURLWithBytes() is the only way to create an absolute URL or completely fail.
> > +    // However, this function is not tolerant of illegal UTF-8 sequences, which
> >      // could either be a malformed string or bytes in a different encoding, like Shift-JIS, so we fall back
> >      // onto using ISO Latin-1 in those cases.
> 
> They way you have rewritten this function completely eliminates this possibility, so neither the comment nor the code needs to allow this.

I was not sure of that. The previous comment was totally mysterious to me given what could be in m_string.

So would you agree the Latin1 case is useless in that function?
Comment 6 Darin Adler 2011-12-18 19:46:57 PST
(In reply to comment #5)
> So would you agree the Latin1 case is useless in that function?

Yes, it’s useless in your new function, where you can guarantee the characters are UTF-8.

But I think the reason we can guarantee that is suspect; we need test cases for URLs that come from non-UTF-8 webpages to make sure that our behavior is either preserved or improved for such URLs. I am not sure just from reading your patch if that is so.
Comment 7 Benjamin Poulain 2011-12-18 19:54:10 PST
> But I think the reason we can guarantee that is suspect; we need test cases for URLs that come from non-UTF-8 webpages to make sure that our behavior is either preserved or improved for such URLs. I am not sure just from reading your patch if that is so.

Valid URLs are always converted to ASCII.

Invalid URLs are no longer changed by KURL since r101713.

I am unsure what you want me to test. Non-UTF8 pages containing invalid characters?
Comment 8 Darin Adler 2011-12-18 20:47:23 PST
(In reply to comment #7)
> > But I think the reason we can guarantee that is suspect; we need test cases for URLs that come from non-UTF-8 webpages to make sure that our behavior is either preserved or improved for such URLs. I am not sure just from reading your patch if that is so.
> 
> Valid URLs are always converted to ASCII.
> 
> Invalid URLs are no longer changed by KURL since r101713.
> 
> I am unsure what you want me to test.

OK, that’s a good explanation.

Not to state the obvious, but we need to test any case where the code change has an effect on behavior.

> Non-UTF8 pages containing invalid characters?

Sure, that seems worth testing.

Originally, we needed to match the behavior of other browsers, browsers that did no character set conversion on URLs at all. URLs on pages with, say, Shift-JIS encoding could have Shift-JIS character sequences, and in those other browsers those sequences would survive intact and be sent to the server.

Given what you say, it’s possible that percent-encoding is what takes care of all cases like this now.
Comment 9 Darin Adler 2011-12-18 20:50:29 PST
One other consideration is that we want to make sure the new code path is not measurably slower than the old. The interface to the utf8 function requires allocating an extra copy of the string because of the use of CString; the old code did not include memory allocation for simple cases. We may need a special case to preserve a fast case that is no slower than before. Or it may be that the extra memory allocation overhead is negligible. Might have to measure to be sure.
Comment 10 Benjamin Poulain 2011-12-18 21:02:13 PST
(In reply to comment #9)
> One other consideration is that we want to make sure the new code path is not measurably slower than the old. The interface to the utf8 function requires allocating an extra copy of the string because of the use of CString; the old code did not include memory allocation for simple cases. We may need a special case to preserve a fast case that is no slower than before. Or it may be that the extra memory allocation overhead is negligible. Might have to measure to be sure.

Note that the previous code is always fast because it is wrong. I got started with KURL because I was getting weird URLs in CFNetwork.

We can easily add a fast path for valid URLs. I don't mind adding that.
Comment 11 Darin Adler 2011-12-18 21:42:56 PST
(In reply to comment #10)
> (In reply to comment #9)
> > One other consideration is that we want to make sure the new code path is not measurably slower than the old. The interface to the utf8 function requires allocating an extra copy of the string because of the use of CString; the old code did not include memory allocation for simple cases. We may need a special case to preserve a fast case that is no slower than before. Or it may be that the extra memory allocation overhead is negligible. Might have to measure to be sure.
> 
> Note that the previous code is always fast because it is wrong. I got started with KURL because I was getting weird URLs in CFNetwork.
> 
> We can easily add a fast path for valid URLs. I don't mind adding that.

Yes, I understand that when fixing a bug you are changing from incorrect to correct behavior. I won’t debate the point that the old code was fast “because it was wrong”, even though I don’t agree. It was fast and wrong. We want new code that’s right and still fast, at least in the common cases.

There are two possibilities. You can prove that the old speed didn’t matter, or you can include a fast path that is as fast as the old code or faster. Either is OK with me.
Comment 12 Benjamin Poulain 2011-12-18 22:02:12 PST
I am not defensive. I said "_always_ fast".
Comment 13 Darin Adler 2011-12-18 22:24:13 PST
OK. You’ve said the valid URL case is all ASCII, so it seems that the valid case is the one that was both fast and correct before. If speed of these functions does matter, then we don’t want to slow down the valid case when improving the behavior of the invalid case.
Comment 14 Benjamin Poulain 2011-12-18 22:27:01 PST
(In reply to comment #13)
> OK. You’ve said the valid URL case is all ASCII, so it seems that the valid case is the one that was both fast and correct before. If speed of these functions does matter, then we don’t want to slow down the valid case when improving the behavior of the invalid case.

That is exactly what I said:
"We can easily add a fast path for valid URLs. I don't mind adding that."

I don't know what we are disagreeing about.
Comment 15 Darin Adler 2011-12-18 22:28:04 PST
I’m don’t think we are disagreeing.
Comment 16 Darin Adler 2011-12-18 22:28:36 PST
Maybe it seems like disagreeing when I say the same thing you did in a different way, for emphasis.
Comment 17 Benjamin Poulain 2011-12-18 22:31:54 PST
(In reply to comment #16)
> Maybe it seems like disagreeing when I say the same thing you did in a different way, for emphasis.

My bad. For some reason I thought you were pissed when reading comment 11.
Comment 18 Darin Adler 2011-12-18 22:46:21 PST
(In reply to comment #17)
> For some reason I thought you were pissed when reading comment 11.

Nope. My only real new point in that one is that we don’t have to add a fast path if we can prove this code isn’t performance critical.
Comment 19 Benjamin Poulain 2011-12-18 23:19:38 PST
Created attachment 119824 [details]
Patch
Comment 20 Benjamin Poulain 2011-12-18 23:22:53 PST
Comment on attachment 119824 [details]
Patch

Oops, forgot to update the changelog accordingly.
Comment 21 Benjamin Poulain 2011-12-18 23:26:03 PST
Created attachment 119826 [details]
Patch
Comment 22 Darin Adler 2011-12-18 23:34:48 PST
Comment on attachment 119826 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=119826&action=review

This looks great. review- because of the CSTring.h typo, but also we need to figure out if CFURL does anything with the passed-in character encoding later on after the URL is created.

> Source/WebCore/platform/cf/KURLCFNet.cpp:30
> +#include <wtf/text/CSTring.h>

We don’t need this include any more. But also, it will break the build on case sensitive file systems because of the capital T.

> Source/WebCore/platform/cf/KURLCFNet.cpp:69
> +    if (urlString.is8Bit())
> +        return CFURLCreateAbsoluteURLWithBytes(0, urlString.characters8(), urlString.length(), kCFStringEncodingISOLatin1, 0, true);
> +    return CFURLCreateAbsoluteURLWithBytes(0, reinterpret_cast<const UInt8*>(urlString.characters16()), urlString.length(), kCFStringEncodingUnicode, 0, true);

I believe the encoding passed in to this function is used by CFURL for more than just interpreting the bytes at creation time; I think it has some effect on the handling of escape sequences later. I’ll have to research this before I’m sure this change is safe.

What about URL strings that are 16-bit strings, but happen to be all ASCII? Aren’t those common? Or maybe we always create a string with 8-bit characters now.

> Source/WebCore/platform/cf/KURLCFNet.cpp:76
> +    // Note: This function does not handle empty URLs because CFURL does not handle empty URLs.
> +    // See KURL::createCFURL() in KURLMac.mm.

I think it’s confusing to say “does not handle empty URLs”; it raises but does not answer this question: What does this function do when the URL is empty? Does it crash? Create some other kind of URL?
Comment 23 Benjamin Poulain 2011-12-18 23:37:33 PST
Hum, and there is one more case that should be tested: invalid UTF-8 characters in the invalid URL.
Comment 24 Benjamin Poulain 2011-12-18 23:42:58 PST
> What about URL strings that are 16-bit strings, but happen to be all ASCII? Aren’t those common? Or maybe we always create a string with 8-bit characters now.

I think that is the common case at the moment because we call String::characters() when we parse. That is probably something we should fix separately.
Comment 25 Benjamin Poulain 2011-12-18 23:57:52 PST
Created attachment 119828 [details]
Patch
Comment 26 Benjamin Poulain 2011-12-18 23:59:16 PST
This last one is not for review. I still need to figure the encoding details with CFURL.
Comment 27 Benjamin Poulain 2011-12-19 02:19:35 PST
Created attachment 119837 [details]
Patch
Comment 28 Eric Seidel (no email) 2011-12-21 13:22:41 PST
CCing Adam in case he's curious.
Comment 29 Darin Adler 2011-12-21 14:12:55 PST
Comment on attachment 119837 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=119837&action=review

> Source/WebCore/platform/cf/KURLCFNet.cpp:73
> +    const String& urlString = url.string();
> +    // CFURL saves the encoding argument to be able to reverse the percent encoding. We must use kCFStringEncodingUTF8 to
> +    // be consistent with KURL path encoding.
> +    if (urlString.is8Bit())
> +        return CFURLCreateAbsoluteURLWithBytes(0, urlString.characters8(), urlString.length(), kCFStringEncodingUTF8, 0, true);

I think this logic should be used only if the url is valid.
Comment 30 Benjamin Poulain 2011-12-23 01:52:51 PST
> > Source/WebCore/platform/cf/KURLCFNet.cpp:73
> > +    const String& urlString = url.string();
> > +    // CFURL saves the encoding argument to be able to reverse the percent encoding. We must use kCFStringEncodingUTF8 to
> > +    // be consistent with KURL path encoding.
> > +    if (urlString.is8Bit())
> > +        return CFURLCreateAbsoluteURLWithBytes(0, urlString.characters8(), urlString.length(), kCFStringEncodingUTF8, 0, true);
> 
> I think this logic should be used only if the url is valid.

If the string is 8bit, it does not really matter if URL is valid, both cases are Latin1 and will encode fine through UTF-8. Isn't it?
Comment 31 Darin Adler 2011-12-23 08:21:40 PST
(In reply to comment #30)
> both cases are Latin1 and will encode fine through UTF-8

No, Latin-1 will not encode fine as UTF-8. For example, the Latin-1 character for À is 0xC0, but in UTF-8 that’s the leading byte of a multi-byte sequence.
Comment 32 Benjamin Poulain 2011-12-23 09:06:48 PST
Comment on attachment 119837 [details]
Patch

> No, Latin-1 will not encode fine as UTF-8. For example, the Latin-1 character for À is 0xC0, but in UTF-8 that’s the leading byte of a multi-byte sequence.

Of course, I was thinking Unicode, not UTF8. This is indeed completely wrong.
Comment 33 Benjamin Poulain 2012-01-06 20:46:46 PST
Created attachment 121535 [details]
Patch
Comment 34 Eric Seidel (no email) 2012-09-13 09:45:50 PDT
Comment on attachment 121535 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=121535&action=review

> Source/WebCore/platform/cf/KURLCFNet.cpp:39
> +CFURLRef createCFURLFromString(const KURL&);

fromString?  you mean fromKURL, no?
Comment 35 Darin Adler 2013-04-08 18:09:05 PDT
Comment on attachment 121535 [details]
Patch

review- since at least the name of the function here has to change