Bug 179573 - Clean up old URL parser remnants
Summary: Clean up old URL parser remnants
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Alex Christensen
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2017-11-10 18:30 PST by Alex Christensen
Modified: 2017-11-15 09:33 PST (History)
6 users (show)

See Also:


Attachments
Patch (39.52 KB, patch)
2017-11-10 18:32 PST, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (39.45 KB, patch)
2017-11-10 20:40 PST, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (42.33 KB, patch)
2017-11-10 22:05 PST, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (44.24 KB, patch)
2017-11-14 10:56 PST, Alex Christensen
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Christensen 2017-11-10 18:30:04 PST
Clean up old URL parser remnants
Comment 1 Alex Christensen 2017-11-10 18:32:01 PST
Created attachment 326670 [details]
Patch
Comment 2 Alex Christensen 2017-11-10 20:40:36 PST
Created attachment 326677 [details]
Patch
Comment 3 Alex Christensen 2017-11-10 22:05:38 PST
Created attachment 326678 [details]
Patch
Comment 4 Darin Adler 2017-11-12 17:54:14 PST
Comment on attachment 326678 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=326678&action=review

Looks OK, except I think this breaks the context menu’s handling of some special characters.

> Source/WebCore/platform/URL.cpp:547
> +    for (size_t i = 0; i < input.length(); ++i) {
> +        if (UNLIKELY(shouldEncode(input[i]))) {
> +            CString utf8 = input.utf8();
> +            StringBuilder builder;
> +            builder.reserveCapacity(utf8.length() + 2);
> +            for (size_t j = 0; j < utf8.length(); j++) {
> +                auto c = utf8.data()[j];
> +                if (shouldEncode(c)) {
> +                    builder.append('%');
> +                    builder.append(upperNibbleToASCIIHexDigit(c));
> +                    builder.append(lowerNibbleToASCIIHexDigit(c));
> +                } else
> +                    builder.append(c);
> +            }
> +            return builder.toString();
> +        }
> +    }
> +    return input;

It’s inelegant to have the algorithm for actually encoding nested inside an if statement in the middle of a loop (containing a second loop!). We should separate the two loops, putting them into separate functions or lambdas if we want to avoid booleans. Another idea (see below) is to use one loop to compute the length, and another to do the encoding, then there’s a clean place to exit between the two.

It’s not quite right to use size_t to iterate a String; indices into WTF::String are unsigned, not size_t, so it’s slightly wasteful to explicitly use the bigger type.

This loop looks to me like it will be pretty slow, but I guess that’s OK. Here are some thoughts on performance:

- Calling a function pointer once per character seems like it will be pretty expensive. But maybe it’s OK for the way we are using this.

- Converting to UTF-8 using String::utf8 means allocating and destroying a CStringBuffer. That’s costly and we can make sure we have code that converts to UTF-8 in streaming fashion without a buffer.

- Returning a String that is then used to concatenate with other strings means allocating and destroying a StringImpl. If this was turned into a StringConcatenate StringTypeAdapter then callers could concatenate this with other strings and only allocate one buffer for the entire result, eliminating this extra StringImpl.

- Calling CString::length and CString::data each time through a loop is not so great. They both do null checks on m_buffer and dereference it each time; I am not sure the compiler will be smart enough to hoist all of that out of the loop, given that we are calling a function through a function pointer each time through.

- I don’t fully understand the rationale for the StringBuilder::reserveCapacity call. It seems to make the assumption that exactly one character needs to be percent-encoded. That makes it likely we will need to reallocate the StringBuffer. If we split the function into two halves as I suggested above, we could compute the exact correct length the first time through instead of stopping on the first character we should encode. Then we could use String::createUninitialized instead of StringBuilder to make the string, if we aren’t doing the StringTypeAdapter thing yet.

> Source/WebCore/platform/URL.cpp:666
> +    auto questionMarkOrNumberSign = [] (UChar c) {
> +        return c == '?' || c == '#';
> +    };

Could use the word "character". I don’t think it would be any harder to read.

> Source/WebCore/platform/URL.cpp:667
> +    URLParser parser(makeString(m_string.left(m_portEnd), percentEncodeCharacters(path, questionMarkOrNumberSign), m_string.substring(m_pathEnd)));

makeString can work with StringView arguments instead of String arguments, so we could cut down on string allocations here by using StringView::substring instead of String::left and String::substring.

> Source/WebKit/WebProcess/WebCoreSupport/WebContextMenuClient.cpp:65
> -    String encoded = encodeWithURLEscapeSequences(searchString);
> -    encoded.replace(ASCIILiteral { "%20" }, ASCIILiteral { "+" });
>  
> -    String url = "http://www.google.com/search?q=" + encoded + "&ie=UTF-8&oe=UTF-8";
> +    URL url(URL(), "https://www.google.com/search?q=" + searchString + "&ie=UTF-8&oe=UTF-8");

This doesn’t seem quite right.

I understand that the URL parser will escape characters that would make the URL illegal otherwise, such as control characters, non-ASCII characters, and spaces. But there is other escaping that is needed to successfully convey an arbitrary search string to the server. Four examples:

"#" needs to be represented as "%23", otherwise it will be treated as the end of the query, and the part after it will be treated as the fragment
"%" needs to be represented as "%25" if it comes before valid hex digits, otherwise it will be treated as a the start of an escape sequence
"&" needs to be represented as "%26", otherwise it will be treated as the end of the string, delimiting another argument
"+" needs to be represented as "%2B", otherwise it will be treated as a space

After this change I don’t think we can correctly do searches for strings containing those characters. If we wanted the URL class to help us, using URL::setQuery instead of concatenating would take care of "#", but not the other characters. I don’t think the URL class has the functions needed to deal with these other issues.

The comment in the ChangeLog does not offer rationale for why we now decided to use "%20" rather than "+". We went out of our way to send the shorter form before; this patch removes the code without comment.
Comment 5 Alex Christensen 2017-11-14 10:56:44 PST
Created attachment 326891 [details]
Patch
Comment 6 Alex Christensen 2017-11-14 11:08:53 PST
> It’s not quite right to use size_t to iterate a String; indices into
> WTF::String are unsigned, not size_t, so it’s slightly wasteful to
> explicitly use the bigger type.
They both would fit into one register, so they would be just as fast.  I changed it to unsigned, but I think we should instead change the String operator[] and length to use size_t instead of unsigned.

> - Converting to UTF-8 using String::utf8 means allocating and destroying a
> CStringBuffer. That’s costly and we can make sure we have code that converts
> to UTF-8 in streaming fashion without a buffer.
This could definitely be improved, but this is the same as what we do now.

> - Returning a String that is then used to concatenate with other strings
> means allocating and destroying a StringImpl. If this was turned into a
> StringConcatenate StringTypeAdapter then callers could concatenate this with
> other strings and only allocate one buffer for the entire result,
> eliminating this extra StringImpl.
I added this optimization everywhere in this file.

> - I don’t fully understand the rationale for the
> StringBuilder::reserveCapacity call. It seems to make the assumption that
> exactly one character needs to be percent-encoded. That makes it likely we
> will need to reallocate the StringBuffer. If we split the function into two
> halves as I suggested above, we could compute the exact correct length the
> first time through instead of stopping on the first character we should
> encode. Then we could use String::createUninitialized instead of
> StringBuilder to make the string, if we aren’t doing the StringTypeAdapter
> thing yet.
I was optimistically hoping that we would only need to encode one character, and if we need to encode more then we can allocate again.  I removed this maybe-optimization-maybe-regression.

> I understand that the URL parser will escape characters that would make the
> URL illegal otherwise, such as control characters, non-ASCII characters, and
> spaces. But there is other escaping that is needed to successfully convey an
> arbitrary search string to the server. 
We also need to encode non-ASCII characters.  I remove this change and use the user info set to encode these queries, which isn't completely correct, but it's about as incorrect as what we used to do and it's non-standardized anyways.
Comment 7 Alex Christensen 2017-11-14 11:15:41 PST
https://trac.webkit.org/changeset/224823/webkit
Comment 8 Radar WebKit Bug Importer 2017-11-15 09:33:52 PST
<rdar://problem/35561963>