71602 – [Mac] Improve the conversion from KURL to CFURL

NEW 71602

[Mac] Improve the conversion from KURL to CFURL

https://bugs.webkit.org/show_bug.cgi?id=71602

Summary [Mac] Improve the conversion from KURL to CFURL

Benjamin Poulain

Reported 2011-11-04 19:19:34 PDT

Currently converting from KURL to CFURL throws away the high bytes of all the characters. Let's fix that.

Attachments
Patch (6.41 KB, patch) 2011-11-04 19:33 PDT, Benjamin Poulain	no flags	Details Formatted Diff Diff
Patch (14.90 KB, patch) 2011-12-12 20:01 PST, Benjamin Poulain	no flags	Details Formatted Diff Diff
Patch (14.33 KB, patch) 2011-12-18 23:19 PST, Benjamin Poulain	no flags	Details Formatted Diff Diff
Patch (14.14 KB, patch) 2011-12-18 23:26 PST, Benjamin Poulain	no flags	Details Formatted Diff Diff
Patch (15.20 KB, patch) 2011-12-18 23:57 PST, Benjamin Poulain	no flags	Details Formatted Diff Diff
Patch (15.57 KB, patch) 2011-12-19 02:19 PST, Benjamin Poulain	no flags	Details Formatted Diff Diff
Patch (15.63 KB, patch) 2012-01-06 20:46 PST, Benjamin Poulain	darin: review-	Details Formatted Diff Diff
Show Obsolete (6) View All Add attachment proposed patch, testcase, etc.

Benjamin Poulain

Comment 1 2011-11-04 19:33:49 PDT

Created attachment 113746 [details] Patch

Benjamin Poulain

Comment 2 2011-11-04 22:53:06 PDT

Comment on attachment 113746 [details] Patch Clearing the review flag. I have an idea on how I could add tests for this.

Benjamin Poulain

Comment 3 2011-12-12 20:01:32 PST

Created attachment 118943 [details] Patch

Darin Adler

Comment 4 2011-12-18 19:33:26 PST

Comment on attachment 118943 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=118943&action=review This patch changes the behavior when creating a CFURL from a URL that came from a webpage with a non-UTF-8 encoding. It’s possible that it only changes behavior for the better, but with no test cases demonstrating the change, I am concerned that we are just doing this blind, without testing. We should come up with a way to test this, and demonstrate that the new behavior is better than the old. The test case we are adding in this patch is entirely in UTF-8, so it doesn’t address the issue of behavior for non-UTF-8 encodings. > Source/WebCore/platform/cf/KURLCFNet.cpp:70 > + // NOTE: CFURLCreateAbsoluteURLWithBytes() is the only way to create an absolute URL or completely fail. > + // However, this function is not tolerant of illegal UTF-8 sequences, which > // could either be a malformed string or bytes in a different encoding, like Shift-JIS, so we fall back > // onto using ISO Latin-1 in those cases. They way you have rewritten this function completely eliminates this possibility, so neither the comment nor the code needs to allow this. > Source/WebCore/platform/cf/KURLCFNet.cpp:74 > + // FIXME: does that really ever happen? This comment is quite mysterious. We should not add it.

Benjamin Poulain

Comment 5 2011-12-18 19:38:10 PST

> > Source/WebCore/platform/cf/KURLCFNet.cpp:70 > > + // NOTE: CFURLCreateAbsoluteURLWithBytes() is the only way to create an absolute URL or completely fail. > > + // However, this function is not tolerant of illegal UTF-8 sequences, which > > // could either be a malformed string or bytes in a different encoding, like Shift-JIS, so we fall back > > // onto using ISO Latin-1 in those cases. > > They way you have rewritten this function completely eliminates this possibility, so neither the comment nor the code needs to allow this. I was not sure of that. The previous comment was totally mysterious to me given what could be in m_string. So would you agree the Latin1 case is useless in that function?

Darin Adler

Comment 6 2011-12-18 19:46:57 PST

(In reply to comment #5) > So would you agree the Latin1 case is useless in that function? Yes, it’s useless in your new function, where you can guarantee the characters are UTF-8. But I think the reason we can guarantee that is suspect; we need test cases for URLs that come from non-UTF-8 webpages to make sure that our behavior is either preserved or improved for such URLs. I am not sure just from reading your patch if that is so.

Benjamin Poulain

Comment 7 2011-12-18 19:54:10 PST

> But I think the reason we can guarantee that is suspect; we need test cases for URLs that come from non-UTF-8 webpages to make sure that our behavior is either preserved or improved for such URLs. I am not sure just from reading your patch if that is so. Valid URLs are always converted to ASCII. Invalid URLs are no longer changed by KURL since r101713. I am unsure what you want me to test. Non-UTF8 pages containing invalid characters?

Darin Adler

Comment 8 2011-12-18 20:47:23 PST

(In reply to comment #7) > > But I think the reason we can guarantee that is suspect; we need test cases for URLs that come from non-UTF-8 webpages to make sure that our behavior is either preserved or improved for such URLs. I am not sure just from reading your patch if that is so. > > Valid URLs are always converted to ASCII. > > Invalid URLs are no longer changed by KURL since r101713. > > I am unsure what you want me to test. OK, that’s a good explanation. Not to state the obvious, but we need to test any case where the code change has an effect on behavior. > Non-UTF8 pages containing invalid characters? Sure, that seems worth testing. Originally, we needed to match the behavior of other browsers, browsers that did no character set conversion on URLs at all. URLs on pages with, say, Shift-JIS encoding could have Shift-JIS character sequences, and in those other browsers those sequences would survive intact and be sent to the server. Given what you say, it’s possible that percent-encoding is what takes care of all cases like this now.

Darin Adler

Comment 9 2011-12-18 20:50:29 PST

One other consideration is that we want to make sure the new code path is not measurably slower than the old. The interface to the utf8 function requires allocating an extra copy of the string because of the use of CString; the old code did not include memory allocation for simple cases. We may need a special case to preserve a fast case that is no slower than before. Or it may be that the extra memory allocation overhead is negligible. Might have to measure to be sure.

Benjamin Poulain

Comment 10 2011-12-18 21:02:13 PST

(In reply to comment #9) > One other consideration is that we want to make sure the new code path is not measurably slower than the old. The interface to the utf8 function requires allocating an extra copy of the string because of the use of CString; the old code did not include memory allocation for simple cases. We may need a special case to preserve a fast case that is no slower than before. Or it may be that the extra memory allocation overhead is negligible. Might have to measure to be sure. Note that the previous code is always fast because it is wrong. I got started with KURL because I was getting weird URLs in CFNetwork. We can easily add a fast path for valid URLs. I don't mind adding that.

Darin Adler

Comment 11 2011-12-18 21:42:56 PST

(In reply to comment #10) > (In reply to comment #9) > > One other consideration is that we want to make sure the new code path is not measurably slower than the old. The interface to the utf8 function requires allocating an extra copy of the string because of the use of CString; the old code did not include memory allocation for simple cases. We may need a special case to preserve a fast case that is no slower than before. Or it may be that the extra memory allocation overhead is negligible. Might have to measure to be sure. > > Note that the previous code is always fast because it is wrong. I got started with KURL because I was getting weird URLs in CFNetwork. > > We can easily add a fast path for valid URLs. I don't mind adding that. Yes, I understand that when fixing a bug you are changing from incorrect to correct behavior. I won’t debate the point that the old code was fast “because it was wrong”, even though I don’t agree. It was fast and wrong. We want new code that’s right and still fast, at least in the common cases. There are two possibilities. You can prove that the old speed didn’t matter, or you can include a fast path that is as fast as the old code or faster. Either is OK with me.

Benjamin Poulain

Comment 12 2011-12-18 22:02:12 PST

I am not defensive. I said "_always_ fast".

Darin Adler

Comment 13 2011-12-18 22:24:13 PST

OK. You’ve said the valid URL case is all ASCII, so it seems that the valid case is the one that was both fast and correct before. If speed of these functions does matter, then we don’t want to slow down the valid case when improving the behavior of the invalid case.

Benjamin Poulain

Comment 14 2011-12-18 22:27:01 PST

(In reply to comment #13) > OK. You’ve said the valid URL case is all ASCII, so it seems that the valid case is the one that was both fast and correct before. If speed of these functions does matter, then we don’t want to slow down the valid case when improving the behavior of the invalid case. That is exactly what I said: "We can easily add a fast path for valid URLs. I don't mind adding that." I don't know what we are disagreeing about.

Darin Adler

Comment 15 2011-12-18 22:28:04 PST

I’m don’t think we are disagreeing.

Darin Adler

Comment 16 2011-12-18 22:28:36 PST

Maybe it seems like disagreeing when I say the same thing you did in a different way, for emphasis.

Benjamin Poulain

Comment 17 2011-12-18 22:31:54 PST

(In reply to comment #16) > Maybe it seems like disagreeing when I say the same thing you did in a different way, for emphasis. My bad. For some reason I thought you were pissed when reading comment 11.

Darin Adler

Comment 18 2011-12-18 22:46:21 PST

(In reply to comment #17) > For some reason I thought you were pissed when reading comment 11. Nope. My only real new point in that one is that we don’t have to add a fast path if we can prove this code isn’t performance critical.

Benjamin Poulain

Comment 19 2011-12-18 23:19:38 PST

Created attachment 119824 [details] Patch

Benjamin Poulain

Comment 20 2011-12-18 23:22:53 PST

Comment on attachment 119824 [details] Patch Oops, forgot to update the changelog accordingly.

Benjamin Poulain

Comment 21 2011-12-18 23:26:03 PST

Created attachment 119826 [details] Patch

Darin Adler

Comment 22 2011-12-18 23:34:48 PST

Comment on attachment 119826 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=119826&action=review This looks great. review- because of the CSTring.h typo, but also we need to figure out if CFURL does anything with the passed-in character encoding later on after the URL is created. > Source/WebCore/platform/cf/KURLCFNet.cpp:30 > +#include <wtf/text/CSTring.h> We don’t need this include any more. But also, it will break the build on case sensitive file systems because of the capital T. > Source/WebCore/platform/cf/KURLCFNet.cpp:69 > + if (urlString.is8Bit()) > + return CFURLCreateAbsoluteURLWithBytes(0, urlString.characters8(), urlString.length(), kCFStringEncodingISOLatin1, 0, true); > + return CFURLCreateAbsoluteURLWithBytes(0, reinterpret_cast<const UInt8*>(urlString.characters16()), urlString.length(), kCFStringEncodingUnicode, 0, true); I believe the encoding passed in to this function is used by CFURL for more than just interpreting the bytes at creation time; I think it has some effect on the handling of escape sequences later. I’ll have to research this before I’m sure this change is safe. What about URL strings that are 16-bit strings, but happen to be all ASCII? Aren’t those common? Or maybe we always create a string with 8-bit characters now. > Source/WebCore/platform/cf/KURLCFNet.cpp:76 > + // Note: This function does not handle empty URLs because CFURL does not handle empty URLs. > + // See KURL::createCFURL() in KURLMac.mm. I think it’s confusing to say “does not handle empty URLs”; it raises but does not answer this question: What does this function do when the URL is empty? Does it crash? Create some other kind of URL?

Benjamin Poulain

Comment 23 2011-12-18 23:37:33 PST

Hum, and there is one more case that should be tested: invalid UTF-8 characters in the invalid URL.

Benjamin Poulain

Comment 24 2011-12-18 23:42:58 PST

> What about URL strings that are 16-bit strings, but happen to be all ASCII? Aren’t those common? Or maybe we always create a string with 8-bit characters now. I think that is the common case at the moment because we call String::characters() when we parse. That is probably something we should fix separately.

Benjamin Poulain

Comment 25 2011-12-18 23:57:52 PST

Created attachment 119828 [details] Patch

Benjamin Poulain

Comment 26 2011-12-18 23:59:16 PST

This last one is not for review. I still need to figure the encoding details with CFURL.

Benjamin Poulain

Comment 27 2011-12-19 02:19:35 PST

Created attachment 119837 [details] Patch

Eric Seidel (no email)

Comment 28 2011-12-21 13:22:41 PST

CCing Adam in case he's curious.

Darin Adler

Comment 29 2011-12-21 14:12:55 PST

Comment on attachment 119837 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=119837&action=review > Source/WebCore/platform/cf/KURLCFNet.cpp:73 > + const String& urlString = url.string(); > + // CFURL saves the encoding argument to be able to reverse the percent encoding. We must use kCFStringEncodingUTF8 to > + // be consistent with KURL path encoding. > + if (urlString.is8Bit()) > + return CFURLCreateAbsoluteURLWithBytes(0, urlString.characters8(), urlString.length(), kCFStringEncodingUTF8, 0, true); I think this logic should be used only if the url is valid.

Benjamin Poulain

Comment 30 2011-12-23 01:52:51 PST

> > Source/WebCore/platform/cf/KURLCFNet.cpp:73 > > + const String& urlString = url.string(); > > + // CFURL saves the encoding argument to be able to reverse the percent encoding. We must use kCFStringEncodingUTF8 to > > + // be consistent with KURL path encoding. > > + if (urlString.is8Bit()) > > + return CFURLCreateAbsoluteURLWithBytes(0, urlString.characters8(), urlString.length(), kCFStringEncodingUTF8, 0, true); > > I think this logic should be used only if the url is valid. If the string is 8bit, it does not really matter if URL is valid, both cases are Latin1 and will encode fine through UTF-8. Isn't it?

Darin Adler

Comment 31 2011-12-23 08:21:40 PST

(In reply to comment #30) > both cases are Latin1 and will encode fine through UTF-8 No, Latin-1 will not encode fine as UTF-8. For example, the Latin-1 character for À is 0xC0, but in UTF-8 that’s the leading byte of a multi-byte sequence.

Benjamin Poulain

Comment 32 2011-12-23 09:06:48 PST

Comment on attachment 119837 [details] Patch > No, Latin-1 will not encode fine as UTF-8. For example, the Latin-1 character for À is 0xC0, but in UTF-8 that’s the leading byte of a multi-byte sequence. Of course, I was thinking Unicode, not UTF8. This is indeed completely wrong.

Benjamin Poulain

Comment 33 2012-01-06 20:46:46 PST

Created attachment 121535 [details] Patch

Eric Seidel (no email)

Comment 34 2012-09-13 09:45:50 PDT

Comment on attachment 121535 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=121535&action=review > Source/WebCore/platform/cf/KURLCFNet.cpp:39 > +CFURLRef createCFURLFromString(const KURL&); fromString? you mean fromKURL, no?

Darin Adler

Comment 35 2013-04-08 18:09:05 PDT

Comment on attachment 121535 [details] Patch review- since at least the name of the function here has to change

Note You need to log in before you can comment on or make changes to this bug.

Status NEW

Resolution

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Unspecified

OS Unspecified

Product WebKit

Component Page Loading

Assignee

Benjamin Poulain

Reported

2011-11-04 19:19 PDT

Modified

2013-04-08 18:09 PDT History

CC List

6 users Show

URL

Keywords

Depends on

Blocks