Bug 38572 - [WTFURL] Add core URL parser
Summary: [WTFURL] Add core URL parser
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: 528+ (Nightly build)
Hardware: Other OS X 10.5
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-05-05 02:00 PDT by Adam Barth
Modified: 2010-05-07 13:17 PDT (History)
7 users (show)

See Also:


Attachments
Patch (27.39 KB, patch)
2010-05-05 02:03 PDT, Adam Barth
no flags Details | Formatted Diff | Diff
Patch (27.56 KB, patch)
2010-05-05 02:13 PDT, Adam Barth
no flags Details | Formatted Diff | Diff
Patch (27.59 KB, patch)
2010-05-05 16:45 PDT, Adam Barth
no flags Details | Formatted Diff | Diff
Patch (27.60 KB, patch)
2010-05-05 18:27 PDT, Adam Barth
no flags Details | Formatted Diff | Diff
Patch (26.25 KB, patch)
2010-05-07 00:43 PDT, Adam Barth
no flags Details | Formatted Diff | Diff
Patch (25.83 KB, patch)
2010-05-07 10:57 PDT, Adam Barth
ap: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Adam Barth 2010-05-05 02:00:15 PDT
[WTFURL] Add core URL parser
Comment 1 Adam Barth 2010-05-05 02:03:37 PDT
Created attachment 55100 [details]
Patch
Comment 2 Adam Barth 2010-05-05 02:04:05 PDT
Comment on attachment 55100 [details]
Patch

oops.  not quite ready
Comment 3 Adam Barth 2010-05-05 02:13:10 PDT
Created attachment 55101 [details]
Patch
Comment 4 WebKit Review Bot 2010-05-05 02:15:19 PDT
Attachment 55101 [details] did not pass style-queue:

Failed to run "['WebKitTools/Scripts/check-webkit-style', '--no-squash']" exit_code: 1
JavaScriptCore/wtf/url/src/URLParser.h:69:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
JavaScriptCore/wtf/url/src/URLParser.h:374:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
JavaScriptCore/wtf/url/src/URLParser.h:558:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
JavaScriptCore/wtf/url/src/URLParser.h:591:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
Total errors found: 4 in 4 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 5 Adam Barth 2010-05-05 02:16:40 PDT
> If any of these errors are false positives, please file a bug against
> check-webkit-style.

These are false positives, sadly.  It's a long-standing bug in the style checker that I don't have a good idea how to fix.
Comment 6 Adam Barth 2010-05-05 14:10:50 PDT
Alexey, would you like to be CCed on all these patches, or just the more interesting ones?
Comment 7 Alexey Proskuryakov 2010-05-05 16:04:34 PDT
Thanks - all would be just fine.
Comment 8 Adam Barth 2010-05-05 16:45:47 PDT
Created attachment 55173 [details]
Patch
Comment 9 Adam Barth 2010-05-05 16:46:36 PDT
Updated with the ref => fragment renaming from Bug 38566.
Comment 10 WebKit Review Bot 2010-05-05 16:51:20 PDT
Attachment 55173 [details] did not pass style-queue:

Failed to run "['WebKitTools/Scripts/check-webkit-style', '--no-squash']" exit_code: 1
JavaScriptCore/wtf/url/src/URLParser.h:69:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
JavaScriptCore/wtf/url/src/URLParser.h:374:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
JavaScriptCore/wtf/url/src/URLParser.h:558:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
JavaScriptCore/wtf/url/src/URLParser.h:591:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
Total errors found: 4 in 4 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 11 Alexey Proskuryakov 2010-05-05 17:34:37 PDT
> +        This patch adds the core of the URL parser.  The URL parser uses a
> +        templated notion of a character to support different character types.

Ideally, we should match Unicode nomenclature, and call these "code units"
<http://unicode.org/glossary/#code_unit>. It's wrong to call a UTF-16 code unit
a character, even though that's already established in WebKit code and some
Web-related specifications - but using this word for bytes in UTF-8 strings is
just weird.

>        'sources': [
>          'src/URLComponent.h',
> +        'src/URLParser.h',

Is this intentional? My understanding was that no port (including chromium) was
using this right away, and it's strange that only a header was added.

> + * The contents of this file are subject to the Mozilla Public License Version

Aren't we supposed to only leave LGPL when importing Mozilla code? I don't know
for sure, but I never saw any official word permitting tri-license in WebKit
repository.

> +    // This handles everything that may be an authority terminator, including
> +    // backslash. For special backslash handling see parseAfterScheme.
> +    static bool isAuthorityTerminator(CHAR ch)

Why not call it "isPossibleAuthorityTerminator" then? BTW, I didn't know that
semicolon could do that.

Also, I'd make this function just take UChar32. We only need CHAR in arrays.

> +    // Given an already-identified auth section, breaks it into its consituent

Typo: "consituent"

> +    static void parseAuthority(const CHAR* spec, const URLComponent& auth, URLComponent* username, URLComponent* password, URLComponent* host, URLComponent* port)

All URLComponent arguments are required here, so they should all be passed in
the same way, either by pointer or by reference. Or is it some calling
convention I do not know about?

> +        // FIXME: add ASSERT(auth.isValid()); // We should always get an authority.
> +        if (auth.length() == 0) {

Should be !auth.length() (or can we even say auth.isEmpty()?)

> +        // Search backwards for @, which is the separator between the user info and
> +        // the server info.
> +        int i = auth.begin() + auth.length() - 1;
> +        while (i > auth.begin() && spec[i] != '@')
> +            --i;

This will find '@' as the last character, but not as the first one. If that's
correct, this should ideally have a comment explaining why we don't care.

I only read a small portion of the patch, but it changed while I was looking at
it; will start over later or tomorrow.
Comment 12 Adam Barth 2010-05-05 17:45:38 PDT
> >        'sources': [
> >          'src/URLComponent.h',
> > +        'src/URLParser.h',
> 
> Is this intentional? My understanding was that no port (including chromium) was
> using this right away, and it's strange that only a header was added.

This build file isn't used by any port at the moment, but I'm trying to land the library in an incrementally buildable form.  We can omit the build file if you like.  (I have a local git repo with a complete build file that I can work from.)

> > + * The contents of this file are subject to the Mozilla Public License Version
> 
> Aren't we supposed to only leave LGPL when importing Mozilla code? I don't know
> for sure, but I never saw any official word permitting tri-license in WebKit
> repository.

Dunno.  I'm happy to strip this down to LGPL if that's the thing to do.

> > +    // This handles everything that may be an authority terminator, including
> > +    // backslash. For special backslash handling see parseAfterScheme.
> > +    static bool isAuthorityTerminator(CHAR ch)
> 
> Why not call it "isPossibleAuthorityTerminator" then? BTW, I didn't know that
> semicolon could do that.

Will do.

> Also, I'd make this function just take UChar32. We only need CHAR in arrays.

The original library used concrete types for these also, but it used a different concrete type.  :)

My plan is to leave the concrete string types one or two layers up in the library so that different folks can use it with different types of strings (e.g., WTFString and std::wstring).

> > +    // Given an already-identified auth section, breaks it into its consituent
> 
> Typo: "consituent"

Will do.

> > +    static void parseAuthority(const CHAR* spec, const URLComponent& auth, URLComponent* username, URLComponent* password, URLComponent* host, URLComponent* port)
> 
> All URLComponent arguments are required here, so they should all be passed in
> the same way, either by pointer or by reference. Or is it some calling
> convention I do not know about?

This is an artifact of the original Google style of this function.  In that convention, input parameters are passed by constant reference and output parameters are passed by pointer.  In this case, we're extracting the username, password, host, and port from the auth "meta"-component.  I can convert the output parameters to non-constant references if you prefer.

> > +        // FIXME: add ASSERT(auth.isValid()); // We should always get an authority.
> > +        if (auth.length() == 0) {
> 
> Should be !auth.length() (or can we even say auth.isEmpty()?)

I'll add an isEmpty() method because I personally dislike !auth.length().

> > +        // Search backwards for @, which is the separator between the user info and
> > +        // the server info.
> > +        int i = auth.begin() + auth.length() - 1;
> > +        while (i > auth.begin() && spec[i] != '@')
> > +            --i;
> 
> This will find '@' as the last character, but not as the first one. If that's
> correct, this should ideally have a comment explaining why we don't care.

I believe it's correct, but I don't have a reference off-hand.  The reason is that people might have passwords that contain @ characters and they might not URL-encode them.

> I only read a small portion of the patch, but it changed while I was looking at
> it; will start over later or tomorrow.

Ah sorry about that.  I updated it to reflect the fact that Maciej asked me to rename ref to fragment.

Thanks!
Comment 13 Alexey Proskuryakov 2010-05-05 18:01:22 PDT
> We can omit the build file if you like.

I don't have a strong preference about what goes into .gyp files - if it's fine with port maintainers, it's fine with me.

> My plan is to leave the concrete string types one or two layers up in the
> library so that different folks can use it with different types of strings
> (e.g., WTFString and std::wstring).

Understood. Ideally, there would be a way to not seed confusion between code points (AKA characters) and code units though.

> I can convert the output parameters to non-constant references if you prefer.

Yes, I think that it would be beneficial.

> I believe it's correct, but I don't have a reference off-hand.

My question was about something else, but I just misread the code, sorry.
Comment 14 Adam Barth 2010-05-05 18:27:30 PDT
Created attachment 55188 [details]
Patch
Comment 15 Adam Barth 2010-05-05 18:28:38 PDT
I think this license block is what you asked for.  I found it in Arean.cpp, which seems like a perfectly respectable file.
Comment 16 Alexey Proskuryakov 2010-05-05 18:53:52 PDT
I defer to Maciej on this topic. Formal requirements on patch submission page don't allow for such license text.
Comment 17 Adam Barth 2010-05-05 21:48:34 PDT
(In reply to comment #16)
> I defer to Maciej on this topic. Formal requirements on patch submission page
> don't allow for such license text.

I'm happy to use whatever license block you like.  Can you point me to an example file?
Comment 18 Alexey Proskuryakov 2010-05-06 12:03:38 PDT
Comment on attachment 55188 [details]
Patch

> I'm happy to use whatever license block you like.  > Can you point me to an
example file?

I'm still deferring to Maciej, but an example of what I did last time when importing a Mozilla tri-license file is in LayoutTests/java/lc3.

+    // Compatability data points. I list "host", "path" extracted:

Typo: should be "compatibility". I don't think that having the "Us" column is a good idea, there is too much danger of it getting out of sync. Data for IE6 an unspecified version of Firefox is of questionable value, too.

+        // Found "//<some data>", looks like an authority section. Treat everything

Even if it's "javascript://%20Nothing%20but%20a%20comment"? If this function is only called for standard URLs, a comment should explain that.

Similar comment for parsePath() - '?' and "#" characters don't break non-standard URLs into pieces.

+            case '#':
+                // Record the first # sign only.
+                if (refSeparator < 0)
+                    refSeparator = i;
+                break;

Seems appropriate to break out of the loop here for slightly better performance, and possibly more readable code.

+        // components. The code below words from the end back to the beginning,

"words"?

+    // Initializes a path URL which is merely a scheme followed by a path. Examples
+    // include "about:foo" and "javascript:alert('bar');"
+    static void parsePathURL(const CHAR* spec, int specLength, URLSegments& parsed)

It's quite unclear which functions are API, and which are internal implementation. In particular, why does parsePathURL() trim spaces, given that some code has already checked the URL and found it to be a path URL, necessarily trimming spaces just for that?

A related general question: what kind of input do these functions operate on? Is it byte arrays, or not? I'd expect "spec" to be a byte array, but they are all const CHAR*.

+        } else {
+            // No scheme found, just path.

What cases is this expected to catch? We seem to fall back on this case for /foo/bar UNIX-style paths, but not for C:\foo\bar ones. And a colon in a UNIX-style path would fool us, too.

Same comments about mailto parser.

+        if (!component.isNonempty())

Not part of this patch, but "empty" should be capitalized. And double negation isn't great.

+        // Skip over any leading 0s.

A drive-by comment: do we have a test for spaces between ':' and port, e.g. "http://foo.com:   80/"?

+        // Copy valid digits to the buffer.
+        char digits[maxDigits + 1]; // +1 for null terminator
+        for (int i = 0; i < nonZeroDigits.length(); i++) {
+            CHAR ch = spec[nonZeroDigits.begin() + i];
+            if (!isPortDigit(ch)) {
+                // Invalid port digit, fail.
+                return InvalidPort;
+            }
+            digits[i] = static_cast<char>(ch);
+        }

We're doing all this work just to call atoi in the end? We could just as easily parse the number directly.

+        // parameter. The path should start with a slash, so we don't need to check
+        // the first one.

This comment slightly clashes with a later one: "No slash found, this means the input was degenerate."

+        if (!query.isNonempty())

More double negation.

+    // We treat slashes and backslashes the same for IE compatability.

Should be "compatibility".

+    // Given an already-initialized begin index and length, this shrinks the range
+    // to eliminate "should-be-trimmed" characters. Note that the length does *not*
+    // indicate the length of untrimmed data from |begin|, but rather the position
+    // in the input string (so the string starts at character |begin| in the spec,
+    // and goes until |length|).

Then it shouldn't be called length.

+} // namespace WTF
+
+#endif // URLParser_h

If there is anything in this file to use outside of WTF, there should be using decarations for these symbols.

I expect that you'll want to address some of the above comments, so marking r-.
Comment 19 Adam Barth 2010-05-06 12:07:37 PDT
(In reply to comment #18)
> (From update of attachment 55188 [details])
> > I'm happy to use whatever license block you like.  > Can you point me to an
> example file?
> 
> I'm still deferring to Maciej, but an example of what I did last time when
> importing a Mozilla tri-license file is in LayoutTests/java/lc3.

I asked him on IRC last night.  He said it was fine.  Will look at the rest of your comments after lunch.  Thanks for the review.
Comment 20 Brett Wilson (Google) 2010-05-06 13:13:07 PDT
(In reply to comment #18)
> (From update of attachment 55188 [details])
> > I'm happy to use whatever license block you like.  > Can you point me to an
> example file?
> 
> I'm still deferring to Maciej, but an example of what I did last time when
> importing a Mozilla tri-license file is in LayoutTests/java/lc3.
> 
> +    // Compatability data points. I list "host", "path" extracted:
> 
> Typo: should be "compatibility". I don't think that having the "Us" column is a
> good idea, there is too much danger of it getting out of sync. Data for IE6 an
> unspecified version of Firefox is of questionable value, too.
> 
> +        // Found "//<some data>", looks like an authority section. Treat
> everything
> 
> Even if it's "javascript://%20Nothing%20but%20a%20comment"? If this function is
> only called for standard URLs, a comment should explain that.

It's correct this is called only for standard URLs.

> Similar comment for parsePath() - '?' and "#" characters don't break
> non-standard URLs into pieces.

A "path URL" is javascript: or data: with no breaking, named so because the data is contained in the "path" component of the URL. This function parses the path of a URL, which is used for both standard and file URLs. Probably the comment should state this.


> +            case '#':
> +                // Record the first # sign only.
> +                if (refSeparator < 0)
> +                    refSeparator = i;
> +                break;
> 
> Seems appropriate to break out of the loop here for slightly better
> performance, and possibly more readable code.
> 
> +        // components. The code below words from the end back to the
> beginning,
> 
> "words"?
> 
> +    // Initializes a path URL which is merely a scheme followed by a path.
> Examples
> +    // include "about:foo" and "javascript:alert('bar');"
> +    static void parsePathURL(const CHAR* spec, int specLength, URLSegments&
> parsed)
> 
> It's quite unclear which functions are API, and which are internal
> implementation. In particular, why does parsePathURL() trim spaces, given that
> some code has already checked the URL and found it to be a path URL,
> necessarily trimming spaces just for that?

I agree this trimming looks unnecessary.

There are three layers of API, here. The top layer is the object-oriented layer (KURL, GURL) which isn't relevant to this work. The bottom layer is the low level "given this file URL parse it as a file URL". There is a middle layer that figures out what type of URL it is, knows about known schemes, and then dispatches to the correct low-level parser and canonicalizer.

Chrome uses all of these layers in different places. For example, our URL bar calls ParseStandardURL directly because the user is typing URL fragments and we want to know "given this input, what is the most reasonable possible thing that this could be if it was a standard URL."

> A related general question: what kind of input do these functions operate on?
> Is it byte arrays, or not? I'd expect "spec" to be a byte array, but they are
> all const CHAR*.

They operate on both byte arrays and UTF-16. It's templatized/overloaded to support both. The canonical version of a URL we use is 8-bit, which has a nice space advantage and also matches the network format. But many sources of uncanonicalized URL data have 16-bit encoding, for example, the DOM, which is why there is this duality.

> +        } else {
> +            // No scheme found, just path.
> 
> What cases is this expected to catch? We seem to fall back on this case for
> /foo/bar UNIX-style paths, but not for C:\foo\bar ones. And a colon in a
> UNIX-style path would fool us, too.

Yes, that's correct. If you ever hit this code, the URL is invalid. But this code is to make the most reasonable possible parsing of the invalid URL possible. This is necessary, for example, in the above example of the URL, where the URL may be invalid but we don't care (because we can fix it once we know where things are, for example, by adding "http".


> Same comments about mailto parser.
> 
> +        if (!component.isNonempty())
> 
> Not part of this patch, but "empty" should be capitalized. And double negation
> isn't great.

The "problem" is that there are three states: (0,-1) meaning invalid or not present (x,0) meaning valid but 0-length, and (x,y) meaning valid and nonzero length. The name IsEmpty gives some ambiguity about whether it treats invalid parts as empty, and I didn't want people to write code like if (!isValid() && !isEmpty()). Nonempty is unambiguously "there are characters" here which also corresponds to the thing we most often want to test.
Comment 21 Brett Wilson (Google) 2010-05-06 13:20:39 PDT
Adam, I'm confused about why this is all implemented in a header file. The old code had concrete types only in header files. Then it had templatized versions in the .cc files. The intent of this was to ensure there would be exactly one copy of each of the two versions of the code globally in the entire project. Since we know there are exactly two possible implementations of each template, we can do this trick to help the compiler generate exactly the two versions it needs globally.

My understanding of templates in headers is that the compiler must generate all version of the code used in any particular source file and include that in the object file, since it doesn't know what other stuff that object file might be linked to in the future. This can possibly generate a huge amount of duplicated code. Then my understanding is that optimizing linkers will try to clean up this mess, but this is imperfect, doesn't happen in debug mode, and slows down linking.

If all compilers we use are guaranteed not to generate duplicate code in this case, and it doesn't slow down compiling and linking, I'm fine having these in a header. But I'm currently quite uncomfortable with that approach and I much prefer the concrete types in public APIs.
Comment 22 Adam Barth 2010-05-06 13:39:24 PDT
> Adam, I'm confused about why this is all implemented in a header file.

Client won't be using the header directly.  Instead, they'll access the library through an interface with concrete types.  For example, here's a first draft of part of the interface for Chromium:

https://bugs.webkit.org/show_bug.cgi?id=38633

The patch in that bug probably isn't how this is going to turn out in the end, but that should give an idea.
Comment 23 Adam Barth 2010-05-07 00:39:29 PDT
> +    // Compatability data points. I list "host", "path" extracted:
> 
> Typo: should be "compatibility".

Fixed.

> I don't think that having the "Us" column is a
> good idea, there is too much danger of it getting out of sync. Data for IE6 an
> unspecified version of Firefox is of questionable value, too.

Removed.

> +        // Found "//<some data>", looks like an authority section. Treat
> everything
> 
> Even if it's "javascript://%20Nothing%20but%20a%20comment"? If this function is
> only called for standard URLs, a comment should explain that.

Done.

> Similar comment for parsePath() - '?' and "#" characters don't break
> non-standard URLs into pieces.
> 
> +            case '#':
> +                // Record the first # sign only.
> +                if (refSeparator < 0)
> +                    refSeparator = i;
> +                break;
> 
> Seems appropriate to break out of the loop here for slightly better
> performance, and possibly more readable code.

I'm not sure how to break out of the loop without a goto, and I'm not sure having a goto improve readability.  :)

> +        // components. The code below words from the end back to the
> beginning,
> 
> "words"?

Presumably "works".  Fixed.

> +    // Initializes a path URL which is merely a scheme followed by a path.
> Examples
> +    // include "about:foo" and "javascript:alert('bar');"
> +    static void parsePathURL(const CHAR* spec, int specLength, URLSegments&
> parsed)
> 
> It's quite unclear which functions are API, and which are internal
> implementation. In particular, why does parsePathURL() trim spaces, given that
> some code has already checked the URL and found it to be a path URL,
> necessarily trimming spaces just for that?

I'm slightly hesitant to remove this code until we've got more of the library landed so we can see which methods are exposed publicly.  I've added a FIXME to reflect our current understanding.

> A related general question: what kind of input do these functions operate on?
> Is it byte arrays, or not? I'd expect "spec" to be a byte array, but they are
> all const CHAR*.

See brettw's answer above.

> +        } else {
> +            // No scheme found, just path.
> 
> What cases is this expected to catch? We seem to fall back on this case for
> /foo/bar UNIX-style paths, but not for C:\foo\bar ones. And a colon in a
> UNIX-style path would fool us, too.

See brettw's answer above.

> Same comments about mailto parser.
> 
> +        if (!component.isNonempty())
> 
> Not part of this patch, but "empty" should be capitalized.

Done.

> And double negation isn't great.

Yeah, that doesn't seem ideal.  Changed to isEmptyOrInvalid().

> +        // Skip over any leading 0s.
> 
> A drive-by comment: do we have a test for spaces between ':' and port, e.g.
> "http://foo.com:   80/"?

I think so:
http://trac.webkit.org/browser/trunk/LayoutTests/fast/url/script-tests/segments.js#L20

I'm not sure if that test covers what you're interested in.

> +        // Copy valid digits to the buffer.
> +        char digits[maxDigits + 1]; // +1 for null terminator
> +        for (int i = 0; i < nonZeroDigits.length(); i++) {
> +            CHAR ch = spec[nonZeroDigits.begin() + i];
> +            if (!isPortDigit(ch)) {
> +                // Invalid port digit, fail.
> +                return InvalidPort;
> +            }
> +            digits[i] = static_cast<char>(ch);
> +        }
> 
> We're doing all this work just to call atoi in the end? We could just as easily
> parse the number directly.

Indeed.  Done.

> +        // parameter. The path should start with a slash, so we don't need to
> check
> +        // the first one.
> 
> This comment slightly clashes with a later one: "No slash found, this means the
> input was degenerate."

Fixed.

> +        if (!query.isNonempty())
> 
> More double negation.

Fixed.

> +    // We treat slashes and backslashes the same for IE compatability.
> 
> Should be "compatibility".

Fixed.

> +    // Given an already-initialized begin index and length, this shrinks the
> range
> +    // to eliminate "should-be-trimmed" characters. Note that the length does
> *not*
> +    // indicate the length of untrimmed data from |begin|, but rather the
> position
> +    // in the input string (so the string starts at character |begin| in the
> spec,
> +    // and goes until |length|).
> 
> Then it shouldn't be called length.

What name do you suggest?  I've changed it to "specLength" because that's the name of the parameter that's always supplied.  (Note that although "end" is a tempting name, it's actually end+1.)

> +} // namespace WTF
> +
> +#endif // URLParser_h
> 
> If there is anything in this file to use outside of WTF, there should be using
> decarations for these symbols.

The exported symbols will use concrete types.  Client aren't meant to call into this code directly.

> I expect that you'll want to address some of the above comments, so marking r-.

Thanks for the review.  Updated patch coming shortly.
Comment 24 Adam Barth 2010-05-07 00:43:39 PDT
Created attachment 55348 [details]
Patch
Comment 25 Alexey Proskuryakov 2010-05-07 08:33:07 PDT
> I'm not sure how to break out of the loop without a goto, and I'm not sure
> having a goto improve readability.  :)

        for (int i = path.begin(); i < pathEnd; i++) {
            switch (spec[i]) {
            case '?':
                // Only match the query string if it precedes the reference fragment
                // and when we haven't found one already.
-                if (refSeparator < 0 && querySeparator < 0)
+                if (querySeparator < 0)
                    querySeparator = i;
                break;
            case '#':
                // Record the first # sign only.
-                if (refSeparator < 0)
-                    refSeparator = i;
+                refSeparator = i;
+                i = pathEnd; // Break out of the loop.
                break;
            default:
                break;
            }
        }

Another alternative is to avoid using switch, which is rather pointless for two cases - then break would work.

> What name do you suggest?  I've changed it to "specLength" because that's the
> name of the parameter that's always supplied.  (Note that although "end" is a
> tempting name, it's actually end+1.)

I don't see it changed in trimURL(), which is the function I was commenting about.

It's a common idiom to use "end" for a position just after the end, see e.g. Vector::end(), or even the code quoted above that uses pathEnd.
Comment 26 Adam Barth 2010-05-07 10:57:14 PDT
Created attachment 55396 [details]
Patch
Comment 27 Adam Barth 2010-05-07 10:57:59 PDT
(In reply to comment #25)
> > I'm not sure how to break out of the loop without a goto, and I'm not sure
> > having a goto improve readability.  :)
> 
>         for (int i = path.begin(); i < pathEnd; i++) {
>             switch (spec[i]) {
>             case '?':
>                 // Only match the query string if it precedes the reference
> fragment
>                 // and when we haven't found one already.
> -                if (refSeparator < 0 && querySeparator < 0)
> +                if (querySeparator < 0)
>                     querySeparator = i;
>                 break;
>             case '#':
>                 // Record the first # sign only.
> -                if (refSeparator < 0)
> -                    refSeparator = i;
> +                refSeparator = i;
> +                i = pathEnd; // Break out of the loop.
>                 break;
>             default:
>                 break;
>             }
>         }

Done.

> > What name do you suggest?  I've changed it to "specLength" because that's the
> > name of the parameter that's always supplied.  (Note that although "end" is a
> > tempting name, it's actually end+1.)
> 
> I don't see it changed in trimURL(), which is the function I was commenting
> about.
> 
> It's a common idiom to use "end" for a position just after the end, see e.g.
> Vector::end(), or even the code quoted above that uses pathEnd.

Done.
Comment 28 Alexey Proskuryakov 2010-05-07 11:55:33 PDT
Comment on attachment 55396 [details]
Patch

+    // the // last CHAR in spec), this shrinks the range to eliminate

The second "//" looks like an error.

r=me
Comment 29 Adam Barth 2010-05-07 12:35:59 PDT
(In reply to comment #28)
> (From update of attachment 55396 [details])
> +    // the // last CHAR in spec), this shrinks the range to eliminate
> 
> The second "//" looks like an error.

Oops.  Yeah.  Textmate is terrible at re-wrapping comments.

> r=me

Thanks!
Comment 30 Adam Barth 2010-05-07 12:57:10 PDT
Committed r58966: <http://trac.webkit.org/changeset/58966>
Comment 31 WebKit Review Bot 2010-05-07 13:10:30 PDT
http://trac.webkit.org/changeset/58966 might have broken Chromium Linux Release
The following changes are on the blame list:
http://trac.webkit.org/changeset/58966
http://trac.webkit.org/changeset/58967
Comment 32 Darin Fisher (:fishd, Google) 2010-05-07 13:17:29 PDT
(In reply to comment #31)
> http://trac.webkit.org/changeset/58966 might have broken Chromium Linux Release
> The following changes are on the blame list:
> http://trac.webkit.org/changeset/58966
> http://trac.webkit.org/changeset/58967

No worries!  This was r58967, and I have taken care of it.