133990 – Refactor to remove unnecessary code from the string hashing functions

ASSIGNED 133990

Refactor to remove unnecessary code from the string hashing functions

https://bugs.webkit.org/show_bug.cgi?id=133990

Summary Refactor to remove unnecessary code from the string hashing functions

Adam Treat

Reported 2014-06-17 12:12:17 PDT

StringHasher has methods to calculate the hash of a null terminated character array which internally calculate the length by searching for the null value at the end. These methods are only used by CStringTranslator in AtomicString to calculate the hash of a new AtomicString made by a string literal. These methods are both unnecessary and a bit costly because CStringTranslator, which is used by Hash* classes in WTF, searches for the end of the character array more than once and then once again when creating the StringImpl that backs the AtomicString.

Attachments
Patch (15.86 KB, patch) 2014-06-17 12:22 PDT, Adam Treat	dbates: review-	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.

Adam Treat

Comment 1 2014-06-17 12:22:50 PDT

Created attachment 233248 [details] Patch

Daniel Bates

Comment 2 2014-06-17 16:02:30 PDT

Blink review: <https://codereview.chromium.org/328453003> Blink commit: <https://src.chromium.org/viewvc/blink?revision=175826&view=revision>

Adam Treat

Comment 3 2014-06-19 07:33:09 PDT

(In reply to comment #2) > Blink review: <https://codereview.chromium.org/328453003> > Blink commit: <https://src.chromium.org/viewvc/blink?revision=175826&view=revision> Can you have a look please or suggest someone else to review?

Daniel Bates

Comment 4 2014-06-20 10:13:42 PDT

Comment on attachment 233248 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=233248&action=review > Source/WTF/ChangeLog:6 > + Refactor to remove unnecessary code from the string hashing functions > + https://bugs.webkit.org/show_bug.cgi?id=133990 > + > + Previously, StringHasher contained methods for calculating the hash of a pure C We have an convention of referring to the original Blink commit when merging a Blink patch into WebKit by adding a remark to the ChangeLog entry/commit message of the form: Merged from Blink: https://src.chromium.org/viewvc/blink?revision=175826&view=revision You can see an example of ChangeLog entry/commit messages that reference this at <http://trac.webkit.org/changeset/166668>. > Source/WTF/ChangeLog:17 > + Reviewed by NOBODY (OOPS!). Please move the Reviewed by line to be on its own line under the WebKit bug URL. > Source/WTF/wtf/text/AtomicString.cpp:94 > - return addToStringTable<const LChar*, CStringTranslator>(c); > + return add(c, strlen(reinterpret_cast<const char*>(c))); This seems like a weird and indirect approach because AtomicString::add(const LChar*, unsigned) duplicates the precondition checks done in this function and otherwise calls through to AtomicString::addToStringTable<LCharBuffer, LCharBufferTranslator>(). We should take a similar approach as AtomicString::add(const UChar*) and stack allocate an LCharBuffer with the specified string and computed length and directly call AtomicString::addToStringTable<LCharBuffer, LCharBufferTranslator>(...), which may be inlined. > Tools/ChangeLog:6 > + Refactor to remove unnecessary code from the string hashing functions > + https://bugs.webkit.org/show_bug.cgi?id=133990 > + > + Previously, StringHasher contained methods for calculating the hash of a pure C Please also update this ChangeLog entry.

Darin Adler

Comment 5 2014-06-21 12:46:40 PDT

Comment on attachment 233248 [details] Patch As far as I can tell, this patch removes a “don’t iterate the string twice” optimization.

Benjamin Poulain

Comment 6 2014-06-21 13:42:45 PDT

(In reply to comment #5) > (From update of attachment 233248 [details]) > As far as I can tell, this patch removes a “don’t iterate the string twice” optimization. I was thinking the same :) Any performance numbers Adam? How did you measure the improvement?

Adam Treat

Comment 7 2014-06-23 07:28:55 PDT

(In reply to comment #5) > (From update of attachment 233248 [details]) > As far as I can tell, this patch removes a “don’t iterate the string twice” optimization. Just the opposite unless I'm very mistaken. This patch removes additional iterations of the string in favor of one iteration up front.

Adam Treat

Comment 8 2014-06-23 07:30:19 PDT

(In reply to comment #6) > (In reply to comment #5) > > (From update of attachment 233248 [details] [details]) > > As far as I can tell, this patch removes a “don’t iterate the string twice” optimization. > > I was thinking the same :) > > Any performance numbers Adam? How did you measure the improvement? I did not measure any improvement. The reason for the patch is mostly a refactor to remove unnecessary code. I do believe it removes additional iterations of the string, but I haven't tried to measure it.

Daniel Bates

Comment 9 2014-06-23 09:11:42 PDT

Comment on attachment 233248 [details] Patch Looking over the patch again, I agree with Darin Adler's and Benjamin Poulain's remarks in comment #5 and comment #6, respectively, that this patch removes an optimization to avoid an iteration over the specified LChar string to be added to the table. Specifically this patch increases the algorithmic complexity of add() by a factor of N (the length of the LChar string to add to the table) when the string already exists in the table. Without the proposed patch the cost to add a string S with length N that already exists in the table is O(2N) = "cost to hash S" (O(N)) + "cost to perform an equality operation to ensure that a match with the same hash is identical to S" (O(N)). With the proposed patch the cost becomes O(3N) = "cost to compute strlen(S)" (O(N)) + "cost to hash S" (O(N)) + "cost to perform an equality operation to ensure that a match with the same hash is identical to S" (O(N)).

Adam Treat

Comment 10 2014-06-23 11:46:33 PDT

(In reply to comment #9) > (From update of attachment 233248 [details]) > Looking over the patch again, I agree with Darin Adler's and Benjamin Poulain's remarks in comment #5 and comment #6, respectively, that this patch removes an optimization to avoid an iteration over the specified LChar string to be added to the table. Specifically this patch increases the algorithmic complexity of add() by a factor of N (the length of the LChar string to add to the table) when the string already exists in the table. > > Without the proposed patch the cost to add a string S with length N that already exists in the table is O(2N) = "cost to hash S" (O(N)) + "cost to perform an equality operation to ensure that a match with the same hash is identical to S" (O(N)). With the proposed patch the cost becomes O(3N) = "cost to compute strlen(S)" (O(N)) + "cost to hash S" (O(N)) + "cost to perform an equality operation to ensure that a match with the same hash is identical to S" (O(N)). It also reduces by a factor of N when the string is not in the table, but hash clashes with another string of different length, right?

Daniel Bates

Comment 11 2014-06-23 13:14:01 PDT

(In reply to comment #10) > (In reply to comment #9) > > (From update of attachment 233248 [details] [details]) > > Looking over the patch again, I agree with Darin Adler's and Benjamin Poulain's remarks in comment #5 and comment #6, respectively, that this patch removes an optimization to avoid an iteration over the specified LChar string to be added to the table. Specifically this patch increases the algorithmic complexity of add() by a factor of N (the length of the LChar string to add to the table) when the string already exists in the table. > > > > Without the proposed patch the cost to add a string S with length N that already exists in the table is O(2N) = "cost to hash S" (O(N)) + "cost to perform an equality operation to ensure that a match with the same hash is identical to S" (O(N)). With the proposed patch the cost becomes O(3N) = "cost to compute strlen(S)" (O(N)) + "cost to hash S" (O(N)) + "cost to perform an equality operation to ensure that a match with the same hash is identical to S" (O(N)). > > It also reduces by a factor of N when the string is not in the table, but hash clashes with another string of different length, right? Can you elaborate? As far as I can tell, the proposed patch doesn't reduce the cost by a factor of N for such a situation.

Adam Treat

Comment 12 2014-06-23 14:38:45 PDT

(In reply to comment #11) > (In reply to comment #10) > > (In reply to comment #9) > > > (From update of attachment 233248 [details] [details] [details]) > > > Looking over the patch again, I agree with Darin Adler's and Benjamin Poulain's remarks in comment #5 and comment #6, respectively, that this patch removes an optimization to avoid an iteration over the specified LChar string to be added to the table. Specifically this patch increases the algorithmic complexity of add() by a factor of N (the length of the LChar string to add to the table) when the string already exists in the table. > > > > > > Without the proposed patch the cost to add a string S with length N that already exists in the table is O(2N) = "cost to hash S" (O(N)) + "cost to perform an equality operation to ensure that a match with the same hash is identical to S" (O(N)). With the proposed patch the cost becomes O(3N) = "cost to compute strlen(S)" (O(N)) + "cost to hash S" (O(N)) + "cost to perform an equality operation to ensure that a match with the same hash is identical to S" (O(N)). > > > > It also reduces by a factor of N when the string is not in the table, but hash clashes with another string of different length, right? > > Can you elaborate? > > As far as I can tell, the proposed patch doesn't reduce the cost by a factor of N for such a situation. In such a situation, the proposed patch will consist of: "cost to compute strlen(S)" (O(N) + "cost to hash S" (O(N)) + "cost to determine the two strings don't match which is early return 'S.length != S1.length" whereas without the patch it would be: "cost to hash S" (O(N)) + "cost to perform an equality operation to ensure that a match with the same hash is identical to S" (O(N)) + cost to create new StringImpl by calling strlen at the end correct?

Daniel Bates

Comment 13 2014-06-24 09:45:18 PDT

(In reply to comment #12) > > > [...] > > > It also reduces by a factor of N when the string is not in the table, but hash clashes with another string of different length, right? > > > > Can you elaborate? > > > > As far as I can tell, the proposed patch doesn't reduce the cost by a factor of N for such a situation. > > In such a situation, the proposed patch will consist of: > > "cost to compute strlen(S)" (O(N) + "cost to hash S" (O(N)) + "cost to determine the two strings don't match which is early return 'S.length != S1.length" > > whereas without the patch it would be: > > "cost to hash S" (O(N)) + "cost to perform an equality operation to ensure that a match with the same hash is identical to S" (O(N)) + cost to create new StringImpl by calling strlen at the end > > correct? You're right for this case! As aforementioned, the proposed patch increases the algorithmic complexity of add() by a factor of N when the string already exists in the table. I'm unclear which case is more likely to happen in practice. Are you able to measure it?

Daniel Bates

Comment 14 2014-06-24 09:49:28 PDT

Let S and S' be strings. We stay S == S' if the contents of the strings are identical. The following are different cases and their worst-case algorithmic complexities with and without the proposed patch: Case 1: Suppose S != S', strlen(S) == strlen(S'), hash(S) == hash(S') and S' is in the atomic string table. AtomicString::add(S): Without patch: O(4N) = "cost of hash(S)" (O(N)) + "cost to compare S and S'" (O(N)) + "cost to compute strlen(S)" (O(N)) + "cost to copy the buffer in StringImpl::createInternal()" (O(N)) With patch: O(4N) = "cost to compute strlen(S)" (O(N)) + "cost of hash(S)" (O(N)) + "cost to compare S and S'" (O(N)) + "cost to copy the buffer in StringImpl::createInternal()" (O(N)) Case 2: Suppose S != S', strlen(S) != strlen(S'), hash(S) == hash(S') and S' is in the atomic string table. AtomicString::add(S): Without patch: O(4N) = "cost of hash(S)" (O(N)) + "cost to compare S and S'" (O(N)) + "cost to compute strlen(S)" (O(N)) + "cost to copy the buffer in StringImpl::createInternal()" (O(N)) With patch: O(3N) = "cost to compute strlen(S)" (O(N)) + "cost of hash(S)" (O(N)) + "cost to compare S and S'" (O(1)) + "cost to copy the buffer in StringImpl::createInternal()" (O(N)) Case 3: Suppose S == S', strlen(S) == strlen(S'), hash(S) == hash(S') and S' is in the atomic string table. AtomicString::add(S): Without patch: O(2N) = "cost of hash(S)" (O(N)) + "cost to compare S and S'" (O(N)) With patch: O(3N) = "cost to compute strlen(S)" (O(N)) + "cost of hash(S)" (O(N)) + "cost to compare S and S'" (O(N)) Case 4: Suppose hash(S) != hash(S_i) for every string S_i in the atomic string table. AtomicString::add(S): Without patch: O(3N) = "cost of hash(S)" (O(N)) + "cost to compute strlen(S)" (O(N)) + "cost to copy the buffer in StringImpl::createInternal()" (O(N)) With patch: O(3N) = "cost to compute strlen(S)" (O(N)) + "cost of hash(S)" (O(N)) + "cost to copy the buffer in StringImpl::createInternal()" (O(N))

Note You need to log in before you can comment on or make changes to this bug.

Status ASSIGNED

Resolution

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Unspecified

OS Unspecified

Product WebKit

Component Web Template Framework

Assignee

Adam Treat

Reported

2014-06-17 12:12 PDT

Modified

2014-06-24 09:49 PDT History

CC List

8 users Show

URL

Keywords BlinkMergeCandidate

Depends on

Blocks