179573 – Clean up old URL parser remnants

RESOLVED FIXED 179573

Clean up old URL parser remnants

https://bugs.webkit.org/show_bug.cgi?id=179573

Summary Clean up old URL parser remnants

Alex Christensen

Reported 2017-11-10 18:30:04 PST

Clean up old URL parser remnants

Attachments
Patch (39.52 KB, patch) 2017-11-10 18:32 PST, Alex Christensen	no flags	Details Formatted Diff Diff
Patch (39.45 KB, patch) 2017-11-10 20:40 PST, Alex Christensen	no flags	Details Formatted Diff Diff
Patch (42.33 KB, patch) 2017-11-10 22:05 PST, Alex Christensen	no flags	Details Formatted Diff Diff
Patch (44.24 KB, patch) 2017-11-14 10:56 PST, Alex Christensen	no flags	Details Formatted Diff Diff
Show Obsolete (3) View All Add attachment proposed patch, testcase, etc.

Alex Christensen

Comment 1 2017-11-10 18:32:01 PST

Created attachment 326670 [details] Patch

Alex Christensen

Comment 2 2017-11-10 20:40:36 PST

Created attachment 326677 [details] Patch

Alex Christensen

Comment 3 2017-11-10 22:05:38 PST

Created attachment 326678 [details] Patch

Darin Adler

Comment 4 2017-11-12 17:54:14 PST

Comment on attachment 326678 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=326678&action=review Looks OK, except I think this breaks the context menu’s handling of some special characters. > Source/WebCore/platform/URL.cpp:547 > + for (size_t i = 0; i < input.length(); ++i) { > + if (UNLIKELY(shouldEncode(input[i]))) { > + CString utf8 = input.utf8(); > + StringBuilder builder; > + builder.reserveCapacity(utf8.length() + 2); > + for (size_t j = 0; j < utf8.length(); j++) { > + auto c = utf8.data()[j]; > + if (shouldEncode(c)) { > + builder.append('%'); > + builder.append(upperNibbleToASCIIHexDigit(c)); > + builder.append(lowerNibbleToASCIIHexDigit(c)); > + } else > + builder.append(c); > + } > + return builder.toString(); > + } > + } > + return input; It’s inelegant to have the algorithm for actually encoding nested inside an if statement in the middle of a loop (containing a second loop!). We should separate the two loops, putting them into separate functions or lambdas if we want to avoid booleans. Another idea (see below) is to use one loop to compute the length, and another to do the encoding, then there’s a clean place to exit between the two. It’s not quite right to use size_t to iterate a String; indices into WTF::String are unsigned, not size_t, so it’s slightly wasteful to explicitly use the bigger type. This loop looks to me like it will be pretty slow, but I guess that’s OK. Here are some thoughts on performance: - Calling a function pointer once per character seems like it will be pretty expensive. But maybe it’s OK for the way we are using this. - Converting to UTF-8 using String::utf8 means allocating and destroying a CStringBuffer. That’s costly and we can make sure we have code that converts to UTF-8 in streaming fashion without a buffer. - Returning a String that is then used to concatenate with other strings means allocating and destroying a StringImpl. If this was turned into a StringConcatenate StringTypeAdapter then callers could concatenate this with other strings and only allocate one buffer for the entire result, eliminating this extra StringImpl. - Calling CString::length and CString::data each time through a loop is not so great. They both do null checks on m_buffer and dereference it each time; I am not sure the compiler will be smart enough to hoist all of that out of the loop, given that we are calling a function through a function pointer each time through. - I don’t fully understand the rationale for the StringBuilder::reserveCapacity call. It seems to make the assumption that exactly one character needs to be percent-encoded. That makes it likely we will need to reallocate the StringBuffer. If we split the function into two halves as I suggested above, we could compute the exact correct length the first time through instead of stopping on the first character we should encode. Then we could use String::createUninitialized instead of StringBuilder to make the string, if we aren’t doing the StringTypeAdapter thing yet. > Source/WebCore/platform/URL.cpp:666 > + auto questionMarkOrNumberSign = [] (UChar c) { > + return c == '?' || c == '#'; > + }; Could use the word "character". I don’t think it would be any harder to read. > Source/WebCore/platform/URL.cpp:667 > + URLParser parser(makeString(m_string.left(m_portEnd), percentEncodeCharacters(path, questionMarkOrNumberSign), m_string.substring(m_pathEnd))); makeString can work with StringView arguments instead of String arguments, so we could cut down on string allocations here by using StringView::substring instead of String::left and String::substring. > Source/WebKit/WebProcess/WebCoreSupport/WebContextMenuClient.cpp:65 > - String encoded = encodeWithURLEscapeSequences(searchString); > - encoded.replace(ASCIILiteral { "%20" }, ASCIILiteral { "+" }); > > - String url = "http://www.google.com/search?q=" + encoded + "&ie=UTF-8&oe=UTF-8"; > + URL url(URL(), "https://www.google.com/search?q=" + searchString + "&ie=UTF-8&oe=UTF-8"); This doesn’t seem quite right. I understand that the URL parser will escape characters that would make the URL illegal otherwise, such as control characters, non-ASCII characters, and spaces. But there is other escaping that is needed to successfully convey an arbitrary search string to the server. Four examples: "#" needs to be represented as "%23", otherwise it will be treated as the end of the query, and the part after it will be treated as the fragment "%" needs to be represented as "%25" if it comes before valid hex digits, otherwise it will be treated as a the start of an escape sequence "&" needs to be represented as "%26", otherwise it will be treated as the end of the string, delimiting another argument "+" needs to be represented as "%2B", otherwise it will be treated as a space After this change I don’t think we can correctly do searches for strings containing those characters. If we wanted the URL class to help us, using URL::setQuery instead of concatenating would take care of "#", but not the other characters. I don’t think the URL class has the functions needed to deal with these other issues. The comment in the ChangeLog does not offer rationale for why we now decided to use "%20" rather than "+". We went out of our way to send the shorter form before; this patch removes the code without comment.

Alex Christensen

Comment 5 2017-11-14 10:56:44 PST

Created attachment 326891 [details] Patch

Alex Christensen

Comment 6 2017-11-14 11:08:53 PST

> It’s not quite right to use size_t to iterate a String; indices into > WTF::String are unsigned, not size_t, so it’s slightly wasteful to > explicitly use the bigger type. They both would fit into one register, so they would be just as fast. I changed it to unsigned, but I think we should instead change the String operator[] and length to use size_t instead of unsigned. > - Converting to UTF-8 using String::utf8 means allocating and destroying a > CStringBuffer. That’s costly and we can make sure we have code that converts > to UTF-8 in streaming fashion without a buffer. This could definitely be improved, but this is the same as what we do now. > - Returning a String that is then used to concatenate with other strings > means allocating and destroying a StringImpl. If this was turned into a > StringConcatenate StringTypeAdapter then callers could concatenate this with > other strings and only allocate one buffer for the entire result, > eliminating this extra StringImpl. I added this optimization everywhere in this file. > - I don’t fully understand the rationale for the > StringBuilder::reserveCapacity call. It seems to make the assumption that > exactly one character needs to be percent-encoded. That makes it likely we > will need to reallocate the StringBuffer. If we split the function into two > halves as I suggested above, we could compute the exact correct length the > first time through instead of stopping on the first character we should > encode. Then we could use String::createUninitialized instead of > StringBuilder to make the string, if we aren’t doing the StringTypeAdapter > thing yet. I was optimistically hoping that we would only need to encode one character, and if we need to encode more then we can allocate again. I removed this maybe-optimization-maybe-regression. > I understand that the URL parser will escape characters that would make the > URL illegal otherwise, such as control characters, non-ASCII characters, and > spaces. But there is other escaping that is needed to successfully convey an > arbitrary search string to the server. We also need to encode non-ASCII characters. I remove this change and use the user info set to encode these queries, which isn't completely correct, but it's about as incorrect as what we used to do and it's non-standardized anyways.

Alex Christensen

Comment 7 2017-11-14 11:15:41 PST

https://trac.webkit.org/changeset/224823/webkit

Radar WebKit Bug Importer

Comment 8 2017-11-15 09:33:52 PST

<rdar://problem/35561963>

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component New Bugs

Assignee

Alex Christensen

Reported

2017-11-10 18:30 PST

Modified

2017-11-15 09:33 PST History

CC List

6 users Show

URL

Keywords InRadar

Depends on

Blocks