Make WebKit's fragment cannonicalization match other browsers
Created attachment 81357 [details] Patch
I'm not 100% convinced that Chrome's results (which we now match) are the best here. I explained things mostly in teh ChangeLOg. https://github.com/abarth/url-spec/blob/master/tests/gurl-results/by-browser.txt unfortunately has kinda old data and doesn't cover all of these tests for all browsers.
Comment on attachment 81357 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=81357&action=review Some nits below. w.r.t. diverging from Firefox. I discussed this topic (in general) with some Firefox folks. They said they're willing to converge to match other browsers URL handling as long as we write a spec for how we think things ought to be. If we follow this trajectory, everyone should end up with interoperable behavior. > Source/WebCore/platform/KURL.cpp:997 > + if (c == '\0') > + continue; This part isn't correct. It's an artifact of how Brett did his testing in IE (which strips nulls at the HTML parser layer). It's hard to observe directly, but we should be able to observe via the DOM (not via the HTML parser). > Source/WebCore/platform/KURL.cpp:1001 > + *p++ = '%'; > + *p++ = hexDigits[c >> 4]; > + *p++ = hexDigits[c & 0xF]; We should add an inline helper function to do this work instead of copy/pasting it. Really what we should do is a templated free function that takes a predicate for what to escape, if you want to get fancy.
Created attachment 81418 [details] updated patch
Comment on attachment 81418 [details] updated patch View in context: https://bugs.webkit.org/attachment.cgi?id=81418&action=review > Source/WebCore/platform/KURL.cpp:1012 > + // FIXME: We should be decoding any non-ascii characters (above 0x80), > + // however at this point we don't have the full UChar anymore, just a char. This FIXME isn't correct. The decoding has already happened at this point.
Created attachment 81452 [details] Patch for landing
Comment on attachment 81452 [details] Patch for landing Rejecting attachment 81452 [details] from commit-queue. Failed to run "['./Tools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '--bot-id=cr-jail-3', 'build-..." exit_code: 2 Last 500 characters of output: .......................................................................... fast/css/content .. fast/css/counters ....................... fast/css/getComputedStyle ........................ fast/css/namespaces ........... fast/doctypes ............ fast/dom .............. fast/dom/anchor-getParameter.html -> failed Exiting early after 1 failures. 6834 tests run. 145.29s total testing time 6833 test cases (99%) succeeded 1 test case (<1%) had incorrect layout 4 test cases (<1%) had stderr output Full output: http://queues.webkit.org/results/7699951
The patch must be wrong. We hit ASSERTs with this. fast/loader/data-url-encoding-html.html is one such test which asserts. I'll have to look at it tomrorow.
We're hitting: static void checkEncodedString(const String& url) { for (unsigned i = 0; i < url.length(); ++i) ASSERT(!(url[i] & ~0x7F)); // THIS ASSERT ASSERT(!url.length() || isSchemeFirstChar(url[0])); } Which is expected since we're now allowing characters over 127 in the URL. Instead of changing the ASSERT I think I'd rather just encode everything over 127 for now.
There is actually a bunch more to do here. About 10 tests fail with this applied.
> Instead of changing the ASSERT I think I'd rather just encode everything over 127 for now. Why? We're supposed to have unicode characters over 127 in the fragment.
(In reply to comment #11) > > Instead of changing the ASSERT I think I'd rather just encode everything over 127 for now. > > Why? We're supposed to have unicode characters over 127 in the fragment. By the time this fragment handling code ever runs we have chars instead of UChars. And checkEncodedString clearly assumes the url will only ever include ascii. :) Still working through the tests.
The idea behind checkEncodedString() was to make as stringent test for the fact that the string is indeed a parsed KURL as possible, catching cases where arbitrary URLs are passed in. But yes, KURL works with bytes here, not with UTF-16 strings.
I'm not sure I understand. By bytes, do you mean UTF-8 strings? Canonicalized URLs can have non-ASCII characters in their fragment.
I think that we can get byte sequences that aren't UTF-8 in some KURL code paths as we decode percent escapes. I'm not very clear on what goes on, but I doubt we can assume that "char*" is UTF-8 in KURL code.
Created attachment 81716 [details] a more conservative patch, not quite ready
Comment on attachment 81716 [details] a more conservative patch, not quite ready Ok. This is certainly a step in the right direction. We'll need to revisit this issue at some point.
Created attachment 81744 [details] Ready for re-review and commit
I believe this passes all the tests now. However its hard to tell, since it seems SL Debug builds still crash a bunch of tests in TOT. :(
Comment on attachment 81744 [details] Ready for re-review and commit View in context: https://bugs.webkit.org/attachment.cgi?id=81744&action=review > Source/WebCore/platform/KURL.cpp:1008 > + // Strip CR, LF and Tab from fragments, per: > + // https://bugs.webkit.org/show_bug.cgi?id=8770 How does GURL handle this case?
(In reply to comment #20) > (From update of attachment 81744 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=81744&action=review > > > Source/WebCore/platform/KURL.cpp:1008 > > + // Strip CR, LF and Tab from fragments, per: > > + // https://bugs.webkit.org/show_bug.cgi?id=8770 > > How does GURL handle this case? I'm not sure. I did test to see that both Chrome and FF pass the test in question (which is about stripping these chars). I'm not sure how GURL does the actual stripping however.
(In reply to comment #21) > (In reply to comment #20) > > (From update of attachment 81744 [details] [details]) > > View in context: https://bugs.webkit.org/attachment.cgi?id=81744&action=review > > > > > Source/WebCore/platform/KURL.cpp:1008 > > > + // Strip CR, LF and Tab from fragments, per: > > > + // https://bugs.webkit.org/show_bug.cgi?id=8770 > > > > How does GURL handle this case? > > I'm not sure. I did test to see that both Chrome and FF pass the test in question (which is about stripping these chars). I'm not sure how GURL does the actual stripping however. GURL documents this here: http://www.google.com/codesearch/p?hl=en#R_oU-jn3RFk/trunk/src/url_canon_query.cc&q=tab%20package:http://google-url%5C.googlecode%5C.com&l=40 But I haven't found the code which actually makes that happen yet. I'm not very familiar with GURL.
Here is the code in question: http://www.google.com/codesearch/p?hl=en#R_oU-jn3RFk/trunk/src/url_canon_etc.cc&q=package:http://google-url%5C.googlecode%5C.com%200x20&l=275
Comment on attachment 81744 [details] Ready for re-review and commit Ok, but we should add tests for these cases to fast/url/fragment.html
(In reply to comment #24) > (From update of attachment 81744 [details]) > Ok, but we should add tests for these cases to fast/url/fragment.html Ok. It's already covered by fast/loader/url-strip-cr-lf-tab.html, but I'm happy to add a test to fast/url/anchor.html.
Created attachment 81774 [details] Patch for landing
Comment on attachment 81774 [details] Patch for landing Clearing flags on attachment: 81774 Committed r78040: <http://trac.webkit.org/changeset/78040>
All reviewed patches have been landed. Closing bug.