Bug 53850 - Make WebKit's fragment canonicalization match other browsers
Summary: Make WebKit's fragment canonicalization match other browsers
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: 528+ (Nightly build)
Hardware: Other OS X 10.5
: P2 Normal
Assignee: Eric Seidel (no email)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-02-05 04:18 PST by Eric Seidel (no email)
Modified: 2011-02-09 03:53 PST (History)
3 users (show)

See Also:


Attachments
Patch (12.08 KB, patch)
2011-02-05 04:44 PST, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
updated patch (12.82 KB, patch)
2011-02-06 11:33 PST, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
Patch for landing (12.00 KB, patch)
2011-02-06 23:40 PST, Adam Barth
no flags Details | Formatted Diff | Diff
a more conservative patch, not quite ready (10.98 KB, patch)
2011-02-08 16:47 PST, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
Ready for re-review and commit (12.66 KB, patch)
2011-02-08 21:40 PST, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
Patch for landing (13.21 KB, patch)
2011-02-09 01:57 PST, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Eric Seidel (no email) 2011-02-05 04:18:08 PST
Make WebKit's fragment cannonicalization match other browsers
Comment 1 Eric Seidel (no email) 2011-02-05 04:44:55 PST
Created attachment 81357 [details]
Patch
Comment 2 Eric Seidel (no email) 2011-02-05 04:46:06 PST
I'm not 100% convinced that Chrome's results (which we now match) are the best here.  I explained things mostly in teh ChangeLOg.

https://github.com/abarth/url-spec/blob/master/tests/gurl-results/by-browser.txt

unfortunately has kinda old data and doesn't cover all of these tests for all browsers.
Comment 3 Adam Barth 2011-02-05 10:24:43 PST
Comment on attachment 81357 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=81357&action=review

Some nits below.  w.r.t. diverging from Firefox.  I discussed this topic (in general) with some Firefox folks.  They said they're willing to converge to match other browsers URL handling as long as we write a spec for how we think things ought to be.  If we follow this trajectory, everyone should end up with interoperable behavior.

> Source/WebCore/platform/KURL.cpp:997
> +        if (c == '\0')
> +            continue;

This part isn't correct.  It's an artifact of how Brett did his testing in IE (which strips nulls at the HTML parser layer).  It's hard to observe directly, but we should be able to observe via the DOM (not via the HTML parser).

> Source/WebCore/platform/KURL.cpp:1001
> +            *p++ = '%';
> +            *p++ = hexDigits[c >> 4];
> +            *p++ = hexDigits[c & 0xF];

We should add an inline helper function to do this work instead of copy/pasting it.  Really what we should do is a templated free function that takes a predicate for what to escape, if you want to get fancy.
Comment 4 Eric Seidel (no email) 2011-02-06 11:33:04 PST
Created attachment 81418 [details]
updated patch
Comment 5 Adam Barth 2011-02-06 11:37:05 PST
Comment on attachment 81418 [details]
updated patch

View in context: https://bugs.webkit.org/attachment.cgi?id=81418&action=review

> Source/WebCore/platform/KURL.cpp:1012
> +        // FIXME: We should be decoding any non-ascii characters (above 0x80),
> +        // however at this point we don't have the full UChar anymore, just a char.

This FIXME isn't correct.  The decoding has already happened at this point.
Comment 6 Adam Barth 2011-02-06 23:40:08 PST
Created attachment 81452 [details]
Patch for landing
Comment 7 WebKit Commit Bot 2011-02-07 02:55:21 PST
Comment on attachment 81452 [details]
Patch for landing

Rejecting attachment 81452 [details] from commit-queue.

Failed to run "['./Tools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '--bot-id=cr-jail-3', 'build-..." exit_code: 2

Last 500 characters of output:
..........................................................................
fast/css/content ..
fast/css/counters .......................
fast/css/getComputedStyle ........................
fast/css/namespaces ...........
fast/doctypes ............
fast/dom ..............
fast/dom/anchor-getParameter.html -> failed

Exiting early after 1 failures. 6834 tests run.
145.29s total testing time

6833 test cases (99%) succeeded
1 test case (<1%) had incorrect layout
4 test cases (<1%) had stderr output

Full output: http://queues.webkit.org/results/7699951
Comment 8 Eric Seidel (no email) 2011-02-08 00:48:55 PST
The patch must be wrong.  We hit ASSERTs with this.

fast/loader/data-url-encoding-html.html is one such test which asserts.

I'll have to look at it tomrorow.
Comment 9 Eric Seidel (no email) 2011-02-08 02:07:38 PST
We're hitting:
static void checkEncodedString(const String& url)
{
    for (unsigned i = 0; i < url.length(); ++i)
        ASSERT(!(url[i] & ~0x7F)); // THIS ASSERT

    ASSERT(!url.length() || isSchemeFirstChar(url[0]));
}

Which is expected since we're now allowing characters over 127 in the URL.

Instead of changing the ASSERT I think I'd rather just encode everything over 127 for now.
Comment 10 Eric Seidel (no email) 2011-02-08 13:15:12 PST
There is actually a bunch more to do here.  About 10 tests fail with this applied.
Comment 11 Adam Barth 2011-02-08 14:06:07 PST
> Instead of changing the ASSERT I think I'd rather just encode everything over 127 for now.

Why?  We're supposed to have unicode characters over 127 in the fragment.
Comment 12 Eric Seidel (no email) 2011-02-08 14:14:57 PST
(In reply to comment #11)
> > Instead of changing the ASSERT I think I'd rather just encode everything over 127 for now.
> 
> Why?  We're supposed to have unicode characters over 127 in the fragment.

By the time this fragment handling code ever runs we have chars instead of UChars.  And checkEncodedString clearly assumes the url will only ever include ascii. :)

Still working through the tests.
Comment 13 Alexey Proskuryakov 2011-02-08 14:26:20 PST
The idea behind checkEncodedString() was to make as stringent test for the fact that the string is indeed a parsed KURL as possible, catching cases where arbitrary URLs are passed in.

But yes, KURL works with bytes here, not with UTF-16 strings.
Comment 14 Adam Barth 2011-02-08 14:36:13 PST
I'm not sure I understand.  By bytes, do you mean UTF-8 strings?  Canonicalized URLs can have non-ASCII characters in their fragment.
Comment 15 Alexey Proskuryakov 2011-02-08 15:16:26 PST
I think that we can get byte sequences that aren't UTF-8 in some KURL code paths as we decode percent escapes. I'm not very clear on what goes on, but I doubt we can assume that "char*" is UTF-8 in KURL code.
Comment 16 Eric Seidel (no email) 2011-02-08 16:47:39 PST
Created attachment 81716 [details]
a more conservative patch, not quite ready
Comment 17 Adam Barth 2011-02-08 17:57:31 PST
Comment on attachment 81716 [details]
a more conservative patch, not quite ready

Ok.  This is certainly a step in the right direction.  We'll need to revisit this issue at some point.
Comment 18 Eric Seidel (no email) 2011-02-08 21:40:16 PST
Created attachment 81744 [details]
Ready for re-review and commit
Comment 19 Eric Seidel (no email) 2011-02-08 21:42:21 PST
I believe this passes all the tests now.  However its hard to tell, since it seems SL Debug builds still crash a bunch of tests in TOT. :(
Comment 20 Adam Barth 2011-02-08 22:05:31 PST
Comment on attachment 81744 [details]
Ready for re-review and commit

View in context: https://bugs.webkit.org/attachment.cgi?id=81744&action=review

> Source/WebCore/platform/KURL.cpp:1008
> +        // Strip CR, LF and Tab from fragments, per:
> +        // https://bugs.webkit.org/show_bug.cgi?id=8770

How does GURL handle this case?
Comment 21 Eric Seidel (no email) 2011-02-08 23:05:18 PST
(In reply to comment #20)
> (From update of attachment 81744 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=81744&action=review
> 
> > Source/WebCore/platform/KURL.cpp:1008
> > +        // Strip CR, LF and Tab from fragments, per:
> > +        // https://bugs.webkit.org/show_bug.cgi?id=8770
> 
> How does GURL handle this case?

I'm not sure.  I did test to see that both Chrome and FF pass the test in question (which is about stripping these chars).  I'm not sure how GURL does the actual stripping however.
Comment 22 Eric Seidel (no email) 2011-02-09 01:20:10 PST
(In reply to comment #21)
> (In reply to comment #20)
> > (From update of attachment 81744 [details] [details])
> > View in context: https://bugs.webkit.org/attachment.cgi?id=81744&action=review
> > 
> > > Source/WebCore/platform/KURL.cpp:1008
> > > +        // Strip CR, LF and Tab from fragments, per:
> > > +        // https://bugs.webkit.org/show_bug.cgi?id=8770
> > 
> > How does GURL handle this case?
> 
> I'm not sure.  I did test to see that both Chrome and FF pass the test in question (which is about stripping these chars).  I'm not sure how GURL does the actual stripping however.

GURL documents this here:
http://www.google.com/codesearch/p?hl=en#R_oU-jn3RFk/trunk/src/url_canon_query.cc&q=tab%20package:http://google-url%5C.googlecode%5C.com&l=40

But I haven't found the code which actually makes that happen yet.  I'm not very familiar with GURL.
Comment 24 Adam Barth 2011-02-09 01:26:25 PST
Comment on attachment 81744 [details]
Ready for re-review and commit

Ok, but we should add tests for these cases to fast/url/fragment.html
Comment 25 Eric Seidel (no email) 2011-02-09 01:46:25 PST
(In reply to comment #24)
> (From update of attachment 81744 [details])
> Ok, but we should add tests for these cases to fast/url/fragment.html

Ok.  It's already covered by fast/loader/url-strip-cr-lf-tab.html, but I'm happy to add a test to fast/url/anchor.html.
Comment 26 Eric Seidel (no email) 2011-02-09 01:57:53 PST
Created attachment 81774 [details]
Patch for landing
Comment 27 WebKit Commit Bot 2011-02-09 03:53:38 PST
Comment on attachment 81774 [details]
Patch for landing

Clearing flags on attachment: 81774

Committed r78040: <http://trac.webkit.org/changeset/78040>
Comment 28 WebKit Commit Bot 2011-02-09 03:53:45 PST
All reviewed patches have been landed.  Closing bug.