Bug 160811

Summary: Initial URLParser implementation
Product: WebKit Reporter: Alex Christensen <achristensen>
Component: New BugsAssignee: Alex Christensen <achristensen>
Status: RESOLVED FIXED    
Severity: Normal CC: beidson, darin
Priority: P2    
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch
none
Patch none

Description Alex Christensen 2016-08-12 12:09:10 PDT
Initial URLParser implementation
Comment 1 Alex Christensen 2016-08-12 12:09:58 PDT
Created attachment 285926 [details]
Patch
Comment 2 Brady Eidson 2016-08-12 12:26:27 PDT
Comment on attachment 285926 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=285926&action=review

r+ is contingent on fixing the buffer safety issues.

> Source/WebCore/platform/URLParser.cpp:40
> +static bool isC0Control(const StringView::CodePoints::Iterator& c) { return *c <= 0x001F; }
> +static bool isC0ControlOrSpace(const StringView::CodePoints::Iterator& c) { return isC0Control(c) || *c == 0x0020; }
> +static bool isTabOrNewline(const StringView::CodePoints::Iterator& c) { return *c == 0x0009 || *c == 0x000A || *c == 0x000D; }
> +static bool isASCIIDigit(const StringView::CodePoints::Iterator& c) { return *c >= 0x0030  && *c <= 0x0039; }
> +static bool isASCIIAlpha(const StringView::CodePoints::Iterator& c) { return (*c >= 0x0041 && *c <= 0x005A) || (*c >= 0x0061 && *c <= 0x007A); }
> +static bool isASCIIAlphanumeric(const StringView::CodePoints::Iterator& c) { return isASCIIDigit(c) || isASCIIAlpha(c); }

Do we seriously not have any of these things anywhere else?

> Source/WebCore/platform/URLParser.cpp:55
> +// 4.2. URL parsing
> +Optional<URL> URLParser::parse(const String& input, const URL& base, const TextEncoding&)

This comment literally repeats what the code on the line below it already says. Not sure it's useful.

> Source/WebCore/platform/URLParser.cpp:69
> +        SchemeEndCheckForSlashes, // Not explicitly in spec, but in scheme state steps 2. 8.

Not sure this comment is useful/helpful.

> Source/WebCore/platform/URLParser.cpp:117
> +                if (urlScheme == "file")

ASCIILiteral

> Source/WebCore/platform/URLParser.cpp:153
> +                ++c;
> +                if (*c == '/') {

Deref'ing c after incrementing it without checking for end. :(

> Source/WebCore/platform/URLParser.cpp:181
> +                ++c;
> +                buffer.append('/');
> +                if (*c == '/') {

Deref'ing c after incrementing it without checking for end. :(

> Source/WebCore/platform/URLParser.cpp:275
> +            if (*c == '/') {
> +                ++c;
> +                if (*c == '.') {
> +                    ++c;
> +                    if (*c == '.')
> +                        notImplemented();

Deref'ing c after incrementing it without checking for end. :(
Comment 3 Brady Eidson 2016-08-12 12:26:39 PDT
And making the build work.
Comment 4 Alex Christensen 2016-08-12 12:56:44 PDT
(In reply to comment #2)
> Do we seriously not have any of these things anywhere else?
We probably do, but not ones that take these iterators.  We can probably consolidate this code later.
> > Source/WebCore/platform/URLParser.cpp:69
> > +        SchemeEndCheckForSlashes, // Not explicitly in spec, but in scheme state steps 2. 8.
> 
> Not sure this comment is useful/helpful.
It is.  All the other states are specified in the spec.  This one is not.
> ASCIILiteral
That doesn't compile.  This whole case needs to be rewritten anyways, hence the comment.
Comment 5 Alex Christensen 2016-08-12 12:57:30 PDT
Created attachment 285934 [details]
Patch
Comment 6 Alex Christensen 2016-08-12 13:05:32 PDT
http://trac.webkit.org/changeset/204417
Comment 7 Brady Eidson 2016-08-12 15:32:19 PDT
(In reply to comment #4)
> (In reply to comment #2)
> > Do we seriously not have any of these things anywhere else?
> We probably do, but not ones that take these iterators.  We can probably
> consolidate this code later.
> > > Source/WebCore/platform/URLParser.cpp:69
> > > +        SchemeEndCheckForSlashes, // Not explicitly in spec, but in scheme state steps 2. 8.
> > 
> > Not sure this comment is useful/helpful.
> It is.  All the other states are specified in the spec.  This one is not.

I understand that, because that's what the comment says.
I just don't see how it's relevant here.
There should be a relatively high bar to breaking up source code with comments, and it's usually when the code doesn't explain itself. This name explains itself.

> > ASCIILiteral
> That doesn't compile.  This whole case needs to be rewritten anyways, hence
> the comment.

Weird, why not?
Comment 8 Alex Christensen 2016-08-12 15:38:20 PDT
> > > ASCIILiteral
> > That doesn't compile.  This whole case needs to be rewritten anyways, hence
> > the comment.
> 
> Weird, why not?

/Users/alexchristensen/code/OpenSource/Source/WebCore/platform/URLParser.cpp:115:31: error: use of overloaded operator '==' is ambiguous (with operand types 'WTF::String' and 'WTF::ASCIILiteral')
                if (urlScheme == ASCIILiteral("file"))
                    ~~~~~~~~~ ^  ~~~~~~~~~~~~~~~~~~~~
In file included from /Users/alexchristensen/code/OpenSource/Source/WebCore/platform/URLParser.cpp:27:
In file included from /Users/alexchristensen/code/OpenSource/Source/WebCore/platform/URLParser.h:28:
In file included from /Users/alexchristensen/code/OpenSource/Source/WebCore/platform/text/TextEncoding.h:29:
In file included from /Users/alexchristensen/code/OpenSource/Source/WebCore/platform/text/TextCodec.h:34:
/Users/alexchristensen/code/OpenSource/WebKitBuild/Debug/usr/local/include/wtf/text/WTFString.h:481:13: note: candidate function
inline bool operator==(const String& a, const String& b) { return equal(a.impl(), b.impl()); }
            ^
/Users/alexchristensen/code/OpenSource/WebKitBuild/Debug/usr/local/include/wtf/text/WTFString.h:483:13: note: candidate function
inline bool operator==(const String& a, const char* b) { return equal(a.impl(), reinterpret_cast<const LChar*>(b)); }
Comment 9 Darin Adler 2016-08-15 23:38:48 PDT
How is this different from the algorithms inside URL.cpp?
Comment 10 Alex Christensen 2016-08-16 10:54:09 PDT
The new URLParser will be more standards compliant.
It is designed around one finite state machine that looks more like the spec, so as the spec evolves it will be easier to make changes to keep up.
It uses code point iterators to ideally just go through the URL once instead of making several copies and looking at different parts of the input multiple times.
The character class table and percent encode table in URL.cpp are wrong.  There are a growing number of security bugs based on the way the current parser works.  I decided it would be better to just start from scratch instead of trying to massage the old URL parser into a new one.
Comment 11 Darin Adler 2016-08-18 10:35:42 PDT
(In reply to comment #10)
> The new URLParser will be more standards compliant.
> It is designed around one finite state machine that looks more like the
> spec, so as the spec evolves it will be easier to make changes to keep up.
> It uses code point iterators to ideally just go through the URL once instead
> of making several copies and looking at different parts of the input
> multiple times.
> The character class table and percent encode table in URL.cpp are wrong. 
> There are a growing number of security bugs based on the way the current
> parser works.  I decided it would be better to just start from scratch
> instead of trying to massage the old URL parser into a new one.

All that makes sense. But I don’t think it’s OK to end up having two URL parsers. How can we make sure we get rid of the old one?
Comment 12 Alex Christensen 2016-08-19 11:38:19 PDT
I plan to develop the new one as the URLParser class until it is good enough to replace the URL::parse and URL::init functions, then replace those functions with the code I'm developing now.