Bug 133990 - Refactor to remove unnecessary code from the string hashing functions
Summary: Refactor to remove unnecessary code from the string hashing functions
Status: ASSIGNED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Web Template Framework (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Adam Treat
URL:
Keywords: BlinkMergeCandidate
Depends on:
Blocks:
 
Reported: 2014-06-17 12:12 PDT by Adam Treat
Modified: 2014-06-24 09:49 PDT (History)
8 users (show)

See Also:


Attachments
Patch (15.86 KB, patch)
2014-06-17 12:22 PDT, Adam Treat
dbates: review-
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Adam Treat 2014-06-17 12:12:17 PDT
StringHasher has methods to calculate the hash of a null terminated character array which internally calculate the length by searching for the null value at the end.  These methods are only used by CStringTranslator in AtomicString to calculate the hash of a new AtomicString made by a string literal.  These methods are both unnecessary and a bit costly because CStringTranslator, which is used by Hash* classes in WTF, searches for the end of the character array more than once and then once again when creating the StringImpl that backs the AtomicString.
Comment 1 Adam Treat 2014-06-17 12:22:50 PDT
Created attachment 233248 [details]
Patch
Comment 3 Adam Treat 2014-06-19 07:33:09 PDT
(In reply to comment #2)
> Blink review: <https://codereview.chromium.org/328453003>
> Blink commit: <https://src.chromium.org/viewvc/blink?revision=175826&view=revision>

Can you have a look please or suggest someone else to review?
Comment 4 Daniel Bates 2014-06-20 10:13:42 PDT
Comment on attachment 233248 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=233248&action=review

> Source/WTF/ChangeLog:6
> +        Refactor to remove unnecessary code from the string hashing functions
> +        https://bugs.webkit.org/show_bug.cgi?id=133990
> +
> +        Previously, StringHasher contained methods for calculating the hash of a pure C

We have an convention of referring to the original Blink commit when merging a Blink patch into WebKit by adding a remark to the ChangeLog entry/commit message of the form:

Merged from Blink:
https://src.chromium.org/viewvc/blink?revision=175826&view=revision

You can see an example of ChangeLog entry/commit messages that reference this at <http://trac.webkit.org/changeset/166668>.

> Source/WTF/ChangeLog:17
> +        Reviewed by NOBODY (OOPS!).

Please move the Reviewed by line to be on its own line under the WebKit bug URL.

> Source/WTF/wtf/text/AtomicString.cpp:94
> -    return addToStringTable<const LChar*, CStringTranslator>(c);
> +    return add(c, strlen(reinterpret_cast<const char*>(c)));

This seems like a weird and indirect approach because AtomicString::add(const LChar*, unsigned) duplicates the precondition checks done in this function and otherwise calls through to AtomicString::addToStringTable<LCharBuffer, LCharBufferTranslator>(). We should take a similar approach as AtomicString::add(const UChar*) and stack allocate an LCharBuffer with the specified string and computed length and directly call AtomicString::addToStringTable<LCharBuffer, LCharBufferTranslator>(...), which may be inlined.

> Tools/ChangeLog:6
> +        Refactor to remove unnecessary code from the string hashing functions
> +        https://bugs.webkit.org/show_bug.cgi?id=133990
> +
> +        Previously, StringHasher contained methods for calculating the hash of a pure C

Please also update this ChangeLog entry.
Comment 5 Darin Adler 2014-06-21 12:46:40 PDT
Comment on attachment 233248 [details]
Patch

As far as I can tell, this patch removes a “don’t iterate the string twice” optimization.
Comment 6 Benjamin Poulain 2014-06-21 13:42:45 PDT
(In reply to comment #5)
> (From update of attachment 233248 [details])
> As far as I can tell, this patch removes a “don’t iterate the string twice” optimization.

I was thinking the same :)

Any performance numbers Adam? How did you measure the improvement?
Comment 7 Adam Treat 2014-06-23 07:28:55 PDT
(In reply to comment #5)
> (From update of attachment 233248 [details])
> As far as I can tell, this patch removes a “don’t iterate the string twice” optimization.

Just the opposite unless I'm very mistaken.  This patch removes additional iterations of the string in favor of one iteration up front.
Comment 8 Adam Treat 2014-06-23 07:30:19 PDT
(In reply to comment #6)
> (In reply to comment #5)
> > (From update of attachment 233248 [details] [details])
> > As far as I can tell, this patch removes a “don’t iterate the string twice” optimization.
> 
> I was thinking the same :)
> 
> Any performance numbers Adam? How did you measure the improvement?

I did not measure any improvement.  The reason for the patch is mostly a refactor to remove unnecessary code.  I do believe it removes additional iterations of the string, but I haven't tried to measure it.
Comment 9 Daniel Bates 2014-06-23 09:11:42 PDT
Comment on attachment 233248 [details]
Patch

Looking over the patch again, I agree with Darin Adler's and Benjamin Poulain's remarks in comment #5 and comment #6, respectively, that this patch removes an optimization to avoid an iteration over the specified LChar string to be added to the table. Specifically this patch increases the algorithmic complexity of add() by a factor of N (the length of the LChar string to add to the table) when the string already exists in the table.

Without the proposed patch the cost to add a string S with length N that already exists in the table is O(2N) = "cost to hash S" (O(N)) + "cost to perform an equality operation to ensure that a match with the same hash is identical to S" (O(N)). With the proposed patch the cost becomes O(3N) = "cost to compute strlen(S)" (O(N)) + "cost to hash S" (O(N)) + "cost to perform an equality operation to ensure that a match with the same hash is identical to S" (O(N)).
Comment 10 Adam Treat 2014-06-23 11:46:33 PDT
(In reply to comment #9)
> (From update of attachment 233248 [details])
> Looking over the patch again, I agree with Darin Adler's and Benjamin Poulain's remarks in comment #5 and comment #6, respectively, that this patch removes an optimization to avoid an iteration over the specified LChar string to be added to the table. Specifically this patch increases the algorithmic complexity of add() by a factor of N (the length of the LChar string to add to the table) when the string already exists in the table.
> 
> Without the proposed patch the cost to add a string S with length N that already exists in the table is O(2N) = "cost to hash S" (O(N)) + "cost to perform an equality operation to ensure that a match with the same hash is identical to S" (O(N)). With the proposed patch the cost becomes O(3N) = "cost to compute strlen(S)" (O(N)) + "cost to hash S" (O(N)) + "cost to perform an equality operation to ensure that a match with the same hash is identical to S" (O(N)).

It also reduces by a factor of N when the string is not in the table, but hash clashes with another string of different length, right?
Comment 11 Daniel Bates 2014-06-23 13:14:01 PDT
(In reply to comment #10)
> (In reply to comment #9)
> > (From update of attachment 233248 [details] [details])
> > Looking over the patch again, I agree with Darin Adler's and Benjamin Poulain's remarks in comment #5 and comment #6, respectively, that this patch removes an optimization to avoid an iteration over the specified LChar string to be added to the table. Specifically this patch increases the algorithmic complexity of add() by a factor of N (the length of the LChar string to add to the table) when the string already exists in the table.
> > 
> > Without the proposed patch the cost to add a string S with length N that already exists in the table is O(2N) = "cost to hash S" (O(N)) + "cost to perform an equality operation to ensure that a match with the same hash is identical to S" (O(N)). With the proposed patch the cost becomes O(3N) = "cost to compute strlen(S)" (O(N)) + "cost to hash S" (O(N)) + "cost to perform an equality operation to ensure that a match with the same hash is identical to S" (O(N)).
> 
> It also reduces by a factor of N when the string is not in the table, but hash clashes with another string of different length, right?

Can you elaborate?

As far as I can tell, the proposed patch doesn't reduce the cost by a factor of N for such a situation.
Comment 12 Adam Treat 2014-06-23 14:38:45 PDT
(In reply to comment #11)
> (In reply to comment #10)
> > (In reply to comment #9)
> > > (From update of attachment 233248 [details] [details] [details])
> > > Looking over the patch again, I agree with Darin Adler's and Benjamin Poulain's remarks in comment #5 and comment #6, respectively, that this patch removes an optimization to avoid an iteration over the specified LChar string to be added to the table. Specifically this patch increases the algorithmic complexity of add() by a factor of N (the length of the LChar string to add to the table) when the string already exists in the table.
> > > 
> > > Without the proposed patch the cost to add a string S with length N that already exists in the table is O(2N) = "cost to hash S" (O(N)) + "cost to perform an equality operation to ensure that a match with the same hash is identical to S" (O(N)). With the proposed patch the cost becomes O(3N) = "cost to compute strlen(S)" (O(N)) + "cost to hash S" (O(N)) + "cost to perform an equality operation to ensure that a match with the same hash is identical to S" (O(N)).
> > 
> > It also reduces by a factor of N when the string is not in the table, but hash clashes with another string of different length, right?
> 
> Can you elaborate?
> 
> As far as I can tell, the proposed patch doesn't reduce the cost by a factor of N for such a situation.

In such a situation, the proposed patch will consist of:

    "cost to compute strlen(S)" (O(N) + "cost to hash S" (O(N)) + "cost to   determine the two strings don't match which is early return 'S.length != S1.length"

whereas without the patch it would be:

    "cost to hash S" (O(N)) + "cost to perform an equality operation to ensure that a match with the same hash is identical to S" (O(N)) + cost to create new StringImpl by calling strlen at the end

correct?
Comment 13 Daniel Bates 2014-06-24 09:45:18 PDT
(In reply to comment #12)
> > > [...]
> > > It also reduces by a factor of N when the string is not in the table, but hash clashes with another string of different length, right?
> > 
> > Can you elaborate?
> > 
> > As far as I can tell, the proposed patch doesn't reduce the cost by a factor of N for such a situation.
> 
> In such a situation, the proposed patch will consist of:
> 
>     "cost to compute strlen(S)" (O(N) + "cost to hash S" (O(N)) + "cost to   determine the two strings don't match which is early return 'S.length != S1.length"
> 
> whereas without the patch it would be:
> 
>     "cost to hash S" (O(N)) + "cost to perform an equality operation to ensure that a match with the same hash is identical to S" (O(N)) + cost to create new StringImpl by calling strlen at the end
> 
> correct?

You're right for this case! As aforementioned, the proposed patch increases the algorithmic complexity of add() by a factor of N when the string already exists in the table. I'm unclear which case is more likely to happen in practice. Are you able to measure it?
Comment 14 Daniel Bates 2014-06-24 09:49:28 PDT
Let S and S' be strings. We stay S == S' if the contents of the strings are identical. The following are different cases and their worst-case algorithmic complexities with and without the proposed patch:

Case 1: Suppose S != S', strlen(S) == strlen(S'), hash(S) == hash(S') and S' is in the atomic string table. AtomicString::add(S):

Without patch: O(4N) = "cost of hash(S)" (O(N)) + "cost to compare S and S'" (O(N)) + "cost to compute strlen(S)" (O(N)) + "cost to copy the buffer in StringImpl::createInternal()" (O(N))
With patch: O(4N) = "cost to compute strlen(S)" (O(N)) + "cost of hash(S)" (O(N)) + "cost to compare S and S'" (O(N)) + "cost to copy the buffer in StringImpl::createInternal()" (O(N))

Case 2: Suppose S != S', strlen(S) != strlen(S'), hash(S) == hash(S') and S' is in the atomic string table. AtomicString::add(S):

Without patch: O(4N) = "cost of hash(S)" (O(N)) + "cost to compare S and S'" (O(N)) + "cost to compute strlen(S)" (O(N)) + "cost to copy the buffer in StringImpl::createInternal()" (O(N))
With patch: O(3N) = "cost to compute strlen(S)" (O(N)) + "cost of hash(S)" (O(N)) + "cost to compare S and S'" (O(1)) + "cost to copy the buffer in StringImpl::createInternal()" (O(N))

Case 3: Suppose S == S', strlen(S) == strlen(S'), hash(S) == hash(S') and S' is in the atomic string table. AtomicString::add(S):

Without patch: O(2N) = "cost of hash(S)" (O(N)) + "cost to compare S and S'" (O(N))
With patch: O(3N) = "cost to compute strlen(S)" (O(N)) + "cost of hash(S)" (O(N)) + "cost to compare S and S'" (O(N))

Case 4: Suppose hash(S) != hash(S_i) for every string S_i in the atomic string table. AtomicString::add(S):

Without patch: O(3N) = "cost of hash(S)" (O(N)) + "cost to compute strlen(S)" (O(N)) + "cost to copy the buffer in StringImpl::createInternal()" (O(N))
With patch: O(3N) = "cost to compute strlen(S)" (O(N)) + "cost of hash(S)" (O(N)) + "cost to copy the buffer in StringImpl::createInternal()" (O(N))