Summary: | [Mac] Improve the conversion from KURL to CFURL | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Benjamin Poulain <benjamin> | ||||||||||||||||
Component: | Page Loading | Assignee: | Benjamin Poulain <benjamin> | ||||||||||||||||
Status: | NEW --- | ||||||||||||||||||
Severity: | Normal | CC: | abarth, ap, darin, ddkilzer, eric, sam | ||||||||||||||||
Priority: | P2 | ||||||||||||||||||
Version: | 528+ (Nightly build) | ||||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||||
OS: | Unspecified | ||||||||||||||||||
Attachments: |
|
Description
Benjamin Poulain
2011-11-04 19:19:34 PDT
Created attachment 113746 [details]
Patch
Comment on attachment 113746 [details]
Patch
Clearing the review flag. I have an idea on how I could add tests for this.
Created attachment 118943 [details]
Patch
Comment on attachment 118943 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=118943&action=review This patch changes the behavior when creating a CFURL from a URL that came from a webpage with a non-UTF-8 encoding. It’s possible that it only changes behavior for the better, but with no test cases demonstrating the change, I am concerned that we are just doing this blind, without testing. We should come up with a way to test this, and demonstrate that the new behavior is better than the old. The test case we are adding in this patch is entirely in UTF-8, so it doesn’t address the issue of behavior for non-UTF-8 encodings. > Source/WebCore/platform/cf/KURLCFNet.cpp:70 > + // NOTE: CFURLCreateAbsoluteURLWithBytes() is the only way to create an absolute URL or completely fail. > + // However, this function is not tolerant of illegal UTF-8 sequences, which > // could either be a malformed string or bytes in a different encoding, like Shift-JIS, so we fall back > // onto using ISO Latin-1 in those cases. They way you have rewritten this function completely eliminates this possibility, so neither the comment nor the code needs to allow this. > Source/WebCore/platform/cf/KURLCFNet.cpp:74 > + // FIXME: does that really ever happen? This comment is quite mysterious. We should not add it. > > Source/WebCore/platform/cf/KURLCFNet.cpp:70
> > + // NOTE: CFURLCreateAbsoluteURLWithBytes() is the only way to create an absolute URL or completely fail.
> > + // However, this function is not tolerant of illegal UTF-8 sequences, which
> > // could either be a malformed string or bytes in a different encoding, like Shift-JIS, so we fall back
> > // onto using ISO Latin-1 in those cases.
>
> They way you have rewritten this function completely eliminates this possibility, so neither the comment nor the code needs to allow this.
I was not sure of that. The previous comment was totally mysterious to me given what could be in m_string.
So would you agree the Latin1 case is useless in that function?
(In reply to comment #5) > So would you agree the Latin1 case is useless in that function? Yes, it’s useless in your new function, where you can guarantee the characters are UTF-8. But I think the reason we can guarantee that is suspect; we need test cases for URLs that come from non-UTF-8 webpages to make sure that our behavior is either preserved or improved for such URLs. I am not sure just from reading your patch if that is so. > But I think the reason we can guarantee that is suspect; we need test cases for URLs that come from non-UTF-8 webpages to make sure that our behavior is either preserved or improved for such URLs. I am not sure just from reading your patch if that is so. Valid URLs are always converted to ASCII. Invalid URLs are no longer changed by KURL since r101713. I am unsure what you want me to test. Non-UTF8 pages containing invalid characters? (In reply to comment #7) > > But I think the reason we can guarantee that is suspect; we need test cases for URLs that come from non-UTF-8 webpages to make sure that our behavior is either preserved or improved for such URLs. I am not sure just from reading your patch if that is so. > > Valid URLs are always converted to ASCII. > > Invalid URLs are no longer changed by KURL since r101713. > > I am unsure what you want me to test. OK, that’s a good explanation. Not to state the obvious, but we need to test any case where the code change has an effect on behavior. > Non-UTF8 pages containing invalid characters? Sure, that seems worth testing. Originally, we needed to match the behavior of other browsers, browsers that did no character set conversion on URLs at all. URLs on pages with, say, Shift-JIS encoding could have Shift-JIS character sequences, and in those other browsers those sequences would survive intact and be sent to the server. Given what you say, it’s possible that percent-encoding is what takes care of all cases like this now. One other consideration is that we want to make sure the new code path is not measurably slower than the old. The interface to the utf8 function requires allocating an extra copy of the string because of the use of CString; the old code did not include memory allocation for simple cases. We may need a special case to preserve a fast case that is no slower than before. Or it may be that the extra memory allocation overhead is negligible. Might have to measure to be sure. (In reply to comment #9) > One other consideration is that we want to make sure the new code path is not measurably slower than the old. The interface to the utf8 function requires allocating an extra copy of the string because of the use of CString; the old code did not include memory allocation for simple cases. We may need a special case to preserve a fast case that is no slower than before. Or it may be that the extra memory allocation overhead is negligible. Might have to measure to be sure. Note that the previous code is always fast because it is wrong. I got started with KURL because I was getting weird URLs in CFNetwork. We can easily add a fast path for valid URLs. I don't mind adding that. (In reply to comment #10) > (In reply to comment #9) > > One other consideration is that we want to make sure the new code path is not measurably slower than the old. The interface to the utf8 function requires allocating an extra copy of the string because of the use of CString; the old code did not include memory allocation for simple cases. We may need a special case to preserve a fast case that is no slower than before. Or it may be that the extra memory allocation overhead is negligible. Might have to measure to be sure. > > Note that the previous code is always fast because it is wrong. I got started with KURL because I was getting weird URLs in CFNetwork. > > We can easily add a fast path for valid URLs. I don't mind adding that. Yes, I understand that when fixing a bug you are changing from incorrect to correct behavior. I won’t debate the point that the old code was fast “because it was wrong”, even though I don’t agree. It was fast and wrong. We want new code that’s right and still fast, at least in the common cases. There are two possibilities. You can prove that the old speed didn’t matter, or you can include a fast path that is as fast as the old code or faster. Either is OK with me. I am not defensive. I said "_always_ fast". OK. You’ve said the valid URL case is all ASCII, so it seems that the valid case is the one that was both fast and correct before. If speed of these functions does matter, then we don’t want to slow down the valid case when improving the behavior of the invalid case. (In reply to comment #13) > OK. You’ve said the valid URL case is all ASCII, so it seems that the valid case is the one that was both fast and correct before. If speed of these functions does matter, then we don’t want to slow down the valid case when improving the behavior of the invalid case. That is exactly what I said: "We can easily add a fast path for valid URLs. I don't mind adding that." I don't know what we are disagreeing about. I’m don’t think we are disagreeing. Maybe it seems like disagreeing when I say the same thing you did in a different way, for emphasis. (In reply to comment #16) > Maybe it seems like disagreeing when I say the same thing you did in a different way, for emphasis. My bad. For some reason I thought you were pissed when reading comment 11. (In reply to comment #17) > For some reason I thought you were pissed when reading comment 11. Nope. My only real new point in that one is that we don’t have to add a fast path if we can prove this code isn’t performance critical. Created attachment 119824 [details]
Patch
Comment on attachment 119824 [details]
Patch
Oops, forgot to update the changelog accordingly.
Created attachment 119826 [details]
Patch
Comment on attachment 119826 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=119826&action=review This looks great. review- because of the CSTring.h typo, but also we need to figure out if CFURL does anything with the passed-in character encoding later on after the URL is created. > Source/WebCore/platform/cf/KURLCFNet.cpp:30 > +#include <wtf/text/CSTring.h> We don’t need this include any more. But also, it will break the build on case sensitive file systems because of the capital T. > Source/WebCore/platform/cf/KURLCFNet.cpp:69 > + if (urlString.is8Bit()) > + return CFURLCreateAbsoluteURLWithBytes(0, urlString.characters8(), urlString.length(), kCFStringEncodingISOLatin1, 0, true); > + return CFURLCreateAbsoluteURLWithBytes(0, reinterpret_cast<const UInt8*>(urlString.characters16()), urlString.length(), kCFStringEncodingUnicode, 0, true); I believe the encoding passed in to this function is used by CFURL for more than just interpreting the bytes at creation time; I think it has some effect on the handling of escape sequences later. I’ll have to research this before I’m sure this change is safe. What about URL strings that are 16-bit strings, but happen to be all ASCII? Aren’t those common? Or maybe we always create a string with 8-bit characters now. > Source/WebCore/platform/cf/KURLCFNet.cpp:76 > + // Note: This function does not handle empty URLs because CFURL does not handle empty URLs. > + // See KURL::createCFURL() in KURLMac.mm. I think it’s confusing to say “does not handle empty URLs”; it raises but does not answer this question: What does this function do when the URL is empty? Does it crash? Create some other kind of URL? Hum, and there is one more case that should be tested: invalid UTF-8 characters in the invalid URL. > What about URL strings that are 16-bit strings, but happen to be all ASCII? Aren’t those common? Or maybe we always create a string with 8-bit characters now.
I think that is the common case at the moment because we call String::characters() when we parse. That is probably something we should fix separately.
Created attachment 119828 [details]
Patch
This last one is not for review. I still need to figure the encoding details with CFURL. Created attachment 119837 [details]
Patch
CCing Adam in case he's curious. Comment on attachment 119837 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=119837&action=review > Source/WebCore/platform/cf/KURLCFNet.cpp:73 > + const String& urlString = url.string(); > + // CFURL saves the encoding argument to be able to reverse the percent encoding. We must use kCFStringEncodingUTF8 to > + // be consistent with KURL path encoding. > + if (urlString.is8Bit()) > + return CFURLCreateAbsoluteURLWithBytes(0, urlString.characters8(), urlString.length(), kCFStringEncodingUTF8, 0, true); I think this logic should be used only if the url is valid. > > Source/WebCore/platform/cf/KURLCFNet.cpp:73
> > + const String& urlString = url.string();
> > + // CFURL saves the encoding argument to be able to reverse the percent encoding. We must use kCFStringEncodingUTF8 to
> > + // be consistent with KURL path encoding.
> > + if (urlString.is8Bit())
> > + return CFURLCreateAbsoluteURLWithBytes(0, urlString.characters8(), urlString.length(), kCFStringEncodingUTF8, 0, true);
>
> I think this logic should be used only if the url is valid.
If the string is 8bit, it does not really matter if URL is valid, both cases are Latin1 and will encode fine through UTF-8. Isn't it?
(In reply to comment #30) > both cases are Latin1 and will encode fine through UTF-8 No, Latin-1 will not encode fine as UTF-8. For example, the Latin-1 character for À is 0xC0, but in UTF-8 that’s the leading byte of a multi-byte sequence. Comment on attachment 119837 [details] Patch > No, Latin-1 will not encode fine as UTF-8. For example, the Latin-1 character for À is 0xC0, but in UTF-8 that’s the leading byte of a multi-byte sequence. Of course, I was thinking Unicode, not UTF8. This is indeed completely wrong. Created attachment 121535 [details]
Patch
Comment on attachment 121535 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=121535&action=review > Source/WebCore/platform/cf/KURLCFNet.cpp:39 > +CFURLRef createCFURLFromString(const KURL&); fromString? you mean fromKURL, no? Comment on attachment 121535 [details]
Patch
review- since at least the name of the function here has to change
|