160811 – Initial URLParser implementation

RESOLVED FIXED 160811

Initial URLParser implementation

https://bugs.webkit.org/show_bug.cgi?id=160811

Summary Initial URLParser implementation

Alex Christensen

Reported 2016-08-12 12:09:10 PDT

Initial URLParser implementation

Attachments
Patch (12.47 KB, patch) 2016-08-12 12:09 PDT, Alex Christensen	no flags	Details Formatted Diff Diff
Patch (12.69 KB, patch) 2016-08-12 12:57 PDT, Alex Christensen	no flags	Details Formatted Diff Diff
Show Obsolete (1) View All Add attachment proposed patch, testcase, etc.

Alex Christensen

Comment 1 2016-08-12 12:09:58 PDT

Created attachment 285926 [details] Patch

Brady Eidson

Comment 2 2016-08-12 12:26:27 PDT

Comment on attachment 285926 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=285926&action=review r+ is contingent on fixing the buffer safety issues. > Source/WebCore/platform/URLParser.cpp:40 > +static bool isC0Control(const StringView::CodePoints::Iterator& c) { return *c <= 0x001F; } > +static bool isC0ControlOrSpace(const StringView::CodePoints::Iterator& c) { return isC0Control(c) || *c == 0x0020; } > +static bool isTabOrNewline(const StringView::CodePoints::Iterator& c) { return *c == 0x0009 || *c == 0x000A || *c == 0x000D; } > +static bool isASCIIDigit(const StringView::CodePoints::Iterator& c) { return *c >= 0x0030 && *c <= 0x0039; } > +static bool isASCIIAlpha(const StringView::CodePoints::Iterator& c) { return (*c >= 0x0041 && *c <= 0x005A) || (*c >= 0x0061 && *c <= 0x007A); } > +static bool isASCIIAlphanumeric(const StringView::CodePoints::Iterator& c) { return isASCIIDigit(c) || isASCIIAlpha(c); } Do we seriously not have any of these things anywhere else? > Source/WebCore/platform/URLParser.cpp:55 > +// 4.2. URL parsing > +Optional<URL> URLParser::parse(const String& input, const URL& base, const TextEncoding&) This comment literally repeats what the code on the line below it already says. Not sure it's useful. > Source/WebCore/platform/URLParser.cpp:69 > + SchemeEndCheckForSlashes, // Not explicitly in spec, but in scheme state steps 2. 8. Not sure this comment is useful/helpful. > Source/WebCore/platform/URLParser.cpp:117 > + if (urlScheme == "file") ASCIILiteral > Source/WebCore/platform/URLParser.cpp:153 > + ++c; > + if (*c == '/') { Deref'ing c after incrementing it without checking for end. :( > Source/WebCore/platform/URLParser.cpp:181 > + ++c; > + buffer.append('/'); > + if (*c == '/') { Deref'ing c after incrementing it without checking for end. :( > Source/WebCore/platform/URLParser.cpp:275 > + if (*c == '/') { > + ++c; > + if (*c == '.') { > + ++c; > + if (*c == '.') > + notImplemented(); Deref'ing c after incrementing it without checking for end. :(

Brady Eidson

Comment 3 2016-08-12 12:26:39 PDT

And making the build work.

Alex Christensen

Comment 4 2016-08-12 12:56:44 PDT

(In reply to comment #2) > Do we seriously not have any of these things anywhere else? We probably do, but not ones that take these iterators. We can probably consolidate this code later. > > Source/WebCore/platform/URLParser.cpp:69 > > + SchemeEndCheckForSlashes, // Not explicitly in spec, but in scheme state steps 2. 8. > > Not sure this comment is useful/helpful. It is. All the other states are specified in the spec. This one is not. > ASCIILiteral That doesn't compile. This whole case needs to be rewritten anyways, hence the comment.

Alex Christensen

Comment 5 2016-08-12 12:57:30 PDT

Created attachment 285934 [details] Patch

Alex Christensen

Comment 6 2016-08-12 13:05:32 PDT

http://trac.webkit.org/changeset/204417

Brady Eidson

Comment 7 2016-08-12 15:32:19 PDT

(In reply to comment #4) > (In reply to comment #2) > > Do we seriously not have any of these things anywhere else? > We probably do, but not ones that take these iterators. We can probably > consolidate this code later. > > > Source/WebCore/platform/URLParser.cpp:69 > > > + SchemeEndCheckForSlashes, // Not explicitly in spec, but in scheme state steps 2. 8. > > > > Not sure this comment is useful/helpful. > It is. All the other states are specified in the spec. This one is not. I understand that, because that's what the comment says. I just don't see how it's relevant here. There should be a relatively high bar to breaking up source code with comments, and it's usually when the code doesn't explain itself. This name explains itself. > > ASCIILiteral > That doesn't compile. This whole case needs to be rewritten anyways, hence > the comment. Weird, why not?

Alex Christensen

Comment 8 2016-08-12 15:38:20 PDT

> > > ASCIILiteral > > That doesn't compile. This whole case needs to be rewritten anyways, hence > > the comment. > > Weird, why not? /Users/alexchristensen/code/OpenSource/Source/WebCore/platform/URLParser.cpp:115:31: error: use of overloaded operator '==' is ambiguous (with operand types 'WTF::String' and 'WTF::ASCIILiteral') if (urlScheme == ASCIILiteral("file")) ~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~ In file included from /Users/alexchristensen/code/OpenSource/Source/WebCore/platform/URLParser.cpp:27: In file included from /Users/alexchristensen/code/OpenSource/Source/WebCore/platform/URLParser.h:28: In file included from /Users/alexchristensen/code/OpenSource/Source/WebCore/platform/text/TextEncoding.h:29: In file included from /Users/alexchristensen/code/OpenSource/Source/WebCore/platform/text/TextCodec.h:34: /Users/alexchristensen/code/OpenSource/WebKitBuild/Debug/usr/local/include/wtf/text/WTFString.h:481:13: note: candidate function inline bool operator==(const String& a, const String& b) { return equal(a.impl(), b.impl()); } ^ /Users/alexchristensen/code/OpenSource/WebKitBuild/Debug/usr/local/include/wtf/text/WTFString.h:483:13: note: candidate function inline bool operator==(const String& a, const char* b) { return equal(a.impl(), reinterpret_cast<const LChar*>(b)); }

Darin Adler

Comment 9 2016-08-15 23:38:48 PDT

How is this different from the algorithms inside URL.cpp?

Alex Christensen

Comment 10 2016-08-16 10:54:09 PDT

The new URLParser will be more standards compliant. It is designed around one finite state machine that looks more like the spec, so as the spec evolves it will be easier to make changes to keep up. It uses code point iterators to ideally just go through the URL once instead of making several copies and looking at different parts of the input multiple times. The character class table and percent encode table in URL.cpp are wrong. There are a growing number of security bugs based on the way the current parser works. I decided it would be better to just start from scratch instead of trying to massage the old URL parser into a new one.

Darin Adler

Comment 11 2016-08-18 10:35:42 PDT

(In reply to comment #10) > The new URLParser will be more standards compliant. > It is designed around one finite state machine that looks more like the > spec, so as the spec evolves it will be easier to make changes to keep up. > It uses code point iterators to ideally just go through the URL once instead > of making several copies and looking at different parts of the input > multiple times. > The character class table and percent encode table in URL.cpp are wrong. > There are a growing number of security bugs based on the way the current > parser works. I decided it would be better to just start from scratch > instead of trying to massage the old URL parser into a new one. All that makes sense. But I don’t think it’s OK to end up having two URL parsers. How can we make sure we get rid of the old one?

Alex Christensen

Comment 12 2016-08-19 11:38:19 PDT

I plan to develop the new one as the URLParser class until it is good enough to replace the URL::parse and URL::init functions, then replace those functions with the code I'm developing now.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component New Bugs

Assignee

Alex Christensen

Reported

2016-08-12 12:09 PDT

Modified

2016-08-19 11:38 PDT History

CC List

2 users Show

URL

Keywords

Depends on

Blocks