Bug 162105 - Optimize URLParser
Summary: Optimize URLParser
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Alex Christensen
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-09-16 16:57 PDT by Alex Christensen
Modified: 2016-09-21 10:55 PDT (History)
2 users (show)

See Also:


Attachments
Patch (15.58 KB, patch)
2016-09-16 16:58 PDT, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (8.24 KB, patch)
2016-09-20 18:18 PDT, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (8.43 KB, patch)
2016-09-20 18:36 PDT, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (8.66 KB, patch)
2016-09-20 19:20 PDT, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (9.85 KB, patch)
2016-09-20 23:30 PDT, Alex Christensen
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Christensen 2016-09-16 16:57:40 PDT
Optimize URLParser
Comment 1 Alex Christensen 2016-09-16 16:58:30 PDT
Created attachment 289143 [details]
Patch
Comment 2 Alex Christensen 2016-09-16 17:00:45 PDT
Comment on attachment 289143 [details]
Patch

Incomplete and needs tests.
Based partially on feedback from https://bugs.webkit.org/show_bug.cgi?id=161968
Comment 3 Alex Christensen 2016-09-20 18:18:24 PDT
Created attachment 289419 [details]
Patch
Comment 4 Alex Christensen 2016-09-20 18:19:14 PDT
There's probably a better way to encode invalid code points.  I'm not familiar with it, though.
Comment 5 Alex Christensen 2016-09-20 18:32:54 PDT
By the way, just using the result of U8_APPEND_UNSAFE with invalid code points matches the existing URL::parse.  Should we just keep doing that, or try to match Chrome and Firefox?
Comment 6 Alex Christensen 2016-09-20 18:36:03 PDT
Created attachment 289421 [details]
Patch
Comment 7 Alex Christensen 2016-09-20 19:20:50 PDT
Created attachment 289424 [details]
Patch
Comment 8 Geoffrey Garen 2016-09-20 20:24:50 PDT
Comment on attachment 289424 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=289424&action=review

r=me

> Source/WebCore/ChangeLog:8
> +        Covered by new API tests.

What's the benchmark result?

> Source/WebCore/platform/URLParser.cpp:468
> +const size_t replacementCharacterUTF8PercentEncodedLength = 9;

you can use sizeof(replacementCharacterUTF8PercentEncoded) - 1 here.
Comment 9 Alex Christensen 2016-09-20 23:30:12 PDT
Created attachment 289433 [details]
Patch
Comment 10 Alex Christensen 2016-09-20 23:34:40 PDT
(In reply to comment #8)
> Comment on attachment 289424 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=289424&action=review
> 
> r=me
> 
> > Source/WebCore/ChangeLog:8
> > +        Covered by new API tests.
> 
> What's the benchmark result?
This is about a 5% speedup on my url parsing benchmark.
> 
> > Source/WebCore/platform/URLParser.cpp:468
> > +const size_t replacementCharacterUTF8PercentEncodedLength = 9;
> 
> you can use sizeof(replacementCharacterUTF8PercentEncoded) - 1 here.
replacementCharacterUTF8PercentEncoded is a pointer.
Comment 11 Alex Christensen 2016-09-20 23:36:56 PDT
http://trac.webkit.org/changeset/206198
Comment 12 Darin Adler 2016-09-21 10:55:48 PDT
Comment on attachment 289433 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=289433&action=review

> Source/WebCore/platform/URLParser.cpp:467
> +const char* replacementCharacterUTF8PercentEncoded = "%EF%BF%BD";

Should be either:

    const char* const replacementCharacterUTF8PercentEncoded = "%EF%BF%BD";

Or:

    const char replacementCharacterUTF8PercentEncoded[] = "%EF%BF%BD";

> Source/WebCore/platform/URLParser.cpp:468
> +const size_t replacementCharacterUTF8PercentEncodedLength = 9;

This should not be necessary with modern compilers. Instead you should be able to use strlen. The compilers do constant folding if you call strlen on a constant string. However, because of the issue above, replacementCharacterUTF8PercentEncoded was probably not a constant string, which I suppose made this necessary/valuable.

> Source/WebCore/platform/URLParser.cpp:516
> +        if (!U_IS_UNICODE_CHAR(codePoint)) {

I don’t think the test coverage covers all the cases in U_IS_UNICODE_CHAR:

    a) U+D800-U+DFFF
    b) U+FDD0-U+FDEF (new with Unicode 3.1)
    c) U+xxFFFE and U+xxFFFF where xx is 00-10, a total of 34 different code points
    d) any value greater than U+10FFFF

I think we are focusing on (a), but we’d need to try at least some characters in (b), (c), and (d) to be sure we match what the standard calls for and confirm that this is the correct macro to use. And cover some other unusual character to make sure they are not turned into the replacement characters, such as private use characters and characters right next to the range here.