Bug 96022

Summary: Minimize collisions when hashing pairs
Product: WebKit Reporter: Dana Jansens <danakj>
Component: New BugsAssignee: Dana Jansens <danakj>
Status: RESOLVED FIXED    
Severity: Normal CC: andersca, ap, benjamin, darin, dglazkov, donggwan.kim, enne, jamesr, koivisto, mifenton, mjs, ojan, peter+ews, piman, pkasting, robin.webkit, rwlbuis, webkit.review.bot, wjmaclean
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
test_demonstration
none
Proof
none
Patch
none
split_up_tests
none
Patch
none
non-empty-lookup-test-vs-noopHash
none
Patch
none
Patch none

Description Dana Jansens 2012-09-06 14:11:17 PDT
Minimize collisions when hashing pairs
Comment 1 Dana Jansens 2012-09-06 14:12:05 PDT
Created attachment 162579 [details]
test_demonstration
Comment 2 Dana Jansens 2012-09-06 14:17:01 PDT
Created attachment 162580 [details]
Proof

Notes containing a method and proof for hashing a constant number of integers (such as a pair).
Comment 3 Dana Jansens 2012-09-06 14:18:22 PDT
The current hash function for pairs has poor performance as it does a nice hash function on 64 bits, but then just drops the top 32 bits. The hash method for pairs tries to use Thomas Wang's 64 bit Mix Function (http://www.cris.com/~Ttwang/tech/inthash.htm), but this requires not dropping any bits in order to retain the characteristics mentioned by Thomas.

A better method is described and proven in section 5.2.2 of https://bugs.webkit.org/attachment.cgi?id=162580 which I have implemented here.
Comment 4 Dana Jansens 2012-09-06 16:15:13 PDT
To compare the two hash methods I implemented the new method and used the unit tests in https://bugs.webkit.org/attachment.cgi?id=162579 which report their running times in milliseconds. A larger number of collisions will lead to longer running times, either due to long lookups or to more allocations to make space for buckets.

Times for comparison (in milliseconds) to hash all combinations of the first 2^11 integers within a std::pair<int,int>:

Current code: 7999 7794 7963 7944 7931 7775 7924 7779 7983 7898 7239
New method  : 6082 6141 6205 6099 6178 6136 5839 6102 6118 6030 5598

You can see an approximately 22% time decrease with the new method for the given input.

For general objects within a pair, we can use the DefaultHash::Hash::hash() method to get an integer. I used the same test case, but ran each of the two integers in the pair through the existing intHash() method first. In this case, we still see a time decrease of about 8%:

Current code: 7926 7898 7186 7174 7173 7987 7168 
New method  : 7353 7231 6636 6544 6600 6616 6614

This hashing method can also be found in [1] with a proof, and originally comes from the work of Woelfel[2][3].

[1] http://opendatastructures.org/versions/edition-0.1d/ods-java/node33.html#SECTION00832000000000000000
[2] http://pages.cpsc.ucalgary.ca/~woelfel/paper/efficient_strongly/efficient_strongly.pdf
[3] http://pages.cpsc.ucalgary.ca/~woelfel/paper/diss/diss.pdf


p.s. The stylebot warnings are to match other similar 1-line template declarations in HashFunctions.h
Comment 5 Dana Jansens 2012-09-06 16:16:03 PDT
Created attachment 162613 [details]
Patch
Comment 6 WebKit Review Bot 2012-09-06 16:19:54 PDT
Attachment 162613 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WTF/ChangeLog', u'Source/WTF/wtf/Ha..." exit_code: 1
Source/WTF/wtf/HashFunctions.h:195:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:196:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:197:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:198:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:199:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:200:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:201:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:202:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:203:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:204:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:205:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:206:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:207:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:208:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:209:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:210:  More than one command on the same line  [whitespace/newline] [4]
Total errors found: 16 in 13 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 7 Alexey Proskuryakov 2012-09-06 16:47:40 PDT
Could you please announce this on webkit-dev? I feel that it would be the best way to reach some people who should be aware of this proposal.
Comment 8 Build Bot 2012-09-06 16:54:03 PDT
Comment on attachment 162613 [details]
Patch

Attachment 162613 [details] did not pass mac-ews (mac):
Output: http://queues.webkit.org/results/13769522
Comment 9 Build Bot 2012-09-06 17:20:28 PDT
Comment on attachment 162613 [details]
Patch

Attachment 162613 [details] did not pass win-ews (win):
Output: http://queues.webkit.org/results/13769528
Comment 10 Early Warning System Bot 2012-09-06 17:28:41 PDT
Comment on attachment 162613 [details]
Patch

Attachment 162613 [details] did not pass qt-ews (qt):
Output: http://queues.webkit.org/results/13786189
Comment 11 Benjamin Poulain 2012-09-06 18:27:56 PDT
Interesting patch!

> Times for comparison (in milliseconds) to hash all combinations of the first 2^11 integers within a std::pair<int,int>:
> 
> Current code: 7999 7794 7963 7944 7931 7775 7924 7779 7983 7898 7239
> New method  : 6082 6141 6205 6099 6178 6136 5839 6102 6118 6030 5598

It this the time of hash() or the time to get values? Could you give both separately?
Did you measure this on 32 or 64 bits?


If I am not mistaken, all of our current hash functions rely on the CPU fast shifters. It would be nice to compare this on ARM in addition to x86.
Comment 12 Peter Beverloo (cr-android ews) 2012-09-06 19:14:00 PDT
Comment on attachment 162613 [details]
Patch

Attachment 162613 [details] did not pass cr-android-ews (chromium-android):
Output: http://queues.webkit.org/results/13778391
Comment 13 Early Warning System Bot 2012-09-06 23:34:03 PDT
Comment on attachment 162613 [details]
Patch

Attachment 162613 [details] did not pass qt-wk2-ews (qt):
Output: http://queues.webkit.org/results/13785292
Comment 14 Dana Jansens 2012-09-07 08:53:07 PDT
(In reply to comment #11)
> Interesting patch!
> 
> > Times for comparison (in milliseconds) to hash all combinations of the first 2^11 integers within a std::pair<int,int>:
> > 
> > Current code: 7999 7794 7963 7944 7931 7775 7924 7779 7983 7898 7239
> > New method  : 6082 6141 6205 6099 6178 6136 5839 6102 6118 6030 5598
> 
> It this the time of hash() or the time to get values? Could you give both separately?

For each key K this was the time to lookup K's value, and insert a new value for K. Then, lookup each value in the HashMap a second time.

The performance improvement here is do to fewer collisions, not to the speed of the hash() function. Fewer collisions mean we have a more densely populated HashMap, which means fewer memory allocations to increase the space needed to store all the keys.

When I only measure time to insert keys, the difference in time is exaggerated further.

> Did you measure this on 32 or 64 bits?

This is measured on a 64 bit machine.

> 
> If I am not mistaken, all of our current hash functions rely on the CPU fast shifters. It would be nice to compare this on ARM in addition to x86.

This would be interesting, though again this is showing the negative impact of a bad hash function in collisions, not in the hash() method itself being slow. Causing more reallocs will dominate any time spent in a hash function, so comparing the speed of the function itself is of less relevance unless they have similar hashing characteristics/collision rates.
Comment 15 Dana Jansens 2012-09-07 08:53:21 PDT
(In reply to comment #7)
> Could you please announce this on webkit-dev? I feel that it would be the best way to reach some people who should be aware of this proposal.

Sure, will do.
Comment 16 Dana Jansens 2012-09-07 10:26:26 PDT
Created attachment 162804 [details]
split_up_tests
Comment 17 Dana Jansens 2012-09-07 10:31:34 PDT
Created attachment 162808 [details]
Patch

new patch to fix truncation warning
Comment 18 WebKit Review Bot 2012-09-07 10:34:28 PDT
Attachment 162808 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WTF/ChangeLog', u'Source/WTF/wtf/Ha..." exit_code: 1
Source/WTF/wtf/HashFunctions.h:195:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:196:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:197:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:198:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:199:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:200:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:201:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:202:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:203:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:204:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:205:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:206:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:207:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:208:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:209:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:210:  More than one command on the same line  [whitespace/newline] [4]
Total errors found: 16 in 13 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 19 Benjamin Poulain 2012-09-07 10:39:01 PDT
Gosh, I understand how a hashmap works...

What I want to find out is if this is particular hash is the best solution.

Also, collisions on insert is not the whole story. Many algorithms are inserting items once and use HashTable::contains() a lot. In many of those cases, the speed of hash() will impact your profile.
Comment 20 Dana Jansens 2012-09-07 10:42:39 PDT
(In reply to comment #19)
> Gosh, I understand how a hashmap works...
> 
> What I want to find out is if this is particular hash is the best solution.
> 
> Also, collisions on insert is not the whole story. Many algorithms are inserting items once and use HashTable::contains() a lot. In many of those cases, the speed of hash() will impact your profile.

Okay, I wasn't sure if we were talking about the same things, thanks for clarifying. I added a test that did large numbers of queries on an empty HashMap to verify query time was not regressing, which sounds like your concern as well.  On my machine the time to do ~8000 queries went from about 130ms to under 1ms with the new hash function.
Comment 21 Benjamin Poulain 2012-09-07 10:46:07 PDT
> Okay, I wasn't sure if we were talking about the same things, thanks for clarifying. I added a test that did large numbers of queries on an empty HashMap to verify query time was not regressing, which sounds like your concern as well.  On my machine the time to do ~8000 queries went from about 130ms to under 1ms with the new hash function.

Thanks for testing that, it is exactly what I was looking for.

It looks like the speed of hash() is not gonna be a concern here :)
Comment 22 Peter Kasting 2012-09-07 10:52:12 PDT
(In reply to comment #20)
> I added a test that did large numbers of queries on an empty HashMap to verify query time was not regressing, which sounds like your concern as well.  On my machine the time to do ~8000 queries went from about 130ms to under 1ms with the new hash function.

Just to think through this to be sure we understand why this is improving... presumably, computing a hash value is effectively a series of arithmetic operations, whereas actually checking a slot in the map is one or more memory accesses.  Regardless of whether collisions are handled by chaining or rehashing (I haven't actually looked at WTF's implementation), resolving them involves making further memory accesses.

So if we make the hash function take longer to compute but reduce collisions, we're effectively trading more (or longer) arithmetic operations for fewer memory accesses.  Unless the difference in computation time is large, that seems like it ought to be a win.
Comment 23 Dana Jansens 2012-09-07 10:58:14 PDT
(In reply to comment #22)
> (In reply to comment #20)
> > I added a test that did large numbers of queries on an empty HashMap to verify query time was not regressing, which sounds like your concern as well.  On my machine the time to do ~8000 queries went from about 130ms to under 1ms with the new hash function.
> 
> Just to think through this to be sure we understand why this is improving... presumably, computing a hash value is effectively a series of arithmetic operations, whereas actually checking a slot in the map is one or more memory accesses.  Regardless of whether collisions are handled by chaining or rehashing (I haven't actually looked at WTF's implementation), resolving them involves making further memory accesses.
> 
> So if we make the hash function take longer to compute but reduce collisions, we're effectively trading more (or longer) arithmetic operations for fewer memory accesses.  Unless the difference in computation time is large, that seems like it ought to be a win.

Yes, though it depends on the hash table implementation. For instance, you can make the table larger and move the keys around on insert. Then perhaps you keep the same number of accesses on query. But you've paid for it in allocation/memcpy during insert. Either way though, more collisions means more work to do at some point in the process.
Comment 24 Benjamin Poulain 2012-09-07 11:12:49 PDT
> Just to think through this to be sure we understand why this is improving... presumably, computing a hash value is effectively a series of arithmetic operations, whereas actually checking a slot in the map is one or more memory accesses.  Regardless of whether collisions are handled by chaining or rehashing (I haven't actually looked at WTF's implementation), resolving them involves making further memory accesses.

On an empty hash map?
Comment 25 Build Bot 2012-09-07 11:24:19 PDT
Comment on attachment 162808 [details]
Patch

Attachment 162808 [details] did not pass win-ews (win):
Output: http://queues.webkit.org/results/13770866
Comment 26 Peter Kasting 2012-09-07 11:26:20 PDT
(In reply to comment #24)
> > Just to think through this to be sure we understand why this is improving... presumably, computing a hash value is effectively a series of arithmetic operations, whereas actually checking a slot in the map is one or more memory accesses.  Regardless of whether collisions are handled by chaining or rehashing (I haven't actually looked at WTF's implementation), resolving them involves making further memory accesses.
> 
> On an empty hash map?

No, on an empty hash map, you'd expect the first insertion to take slightly more time with a more complex hashing function.  (I'm assuming Dana's hashing function _is_ more computationally complex.)

I wasn't really trying to reason fully about what could happen at all possible points in a hash map's life cycle.  I was more trying to say, "if we have an experimental result that shows that checking contains() gets much faster, can we also think of a logical reason why that result makes sense?".
Comment 27 Benjamin Poulain 2012-09-07 11:33:40 PDT
> I wasn't really trying to reason fully about what could happen at all possible points in a hash map's life cycle.  I was more trying to say, "if we have an experimental result that shows that checking contains() gets much faster, can we also think of a logical reason why that result makes sense?".

Unless I miss something, the test was done on an empty HashMap: "a test that did large numbers of queries on an empty HashMap".

Otherwise that would not have proven much about the hash() function time itself as you pointed out.
Comment 28 Build Bot 2012-09-07 11:38:01 PDT
Comment on attachment 162808 [details]
Patch

Attachment 162808 [details] did not pass mac-ews (mac):
Output: http://queues.webkit.org/results/13776779
Comment 29 Dana Jansens 2012-09-07 11:43:41 PDT
Created attachment 162835 [details]
non-empty-lookup-test-vs-noopHash
Comment 30 Dana Jansens 2012-09-07 11:47:36 PDT
Just in case the empty HashMap is a special case, I will submit another test result.

I made a hash function that just returns pair.first, so essentially a no-op, to compare to this new hash function. Also, I made the query test first insert 4 elements into the HashMap. Then call contains() for 2048 different pairs. Here's the result:

[ RUN      ] HashTest.oldHashLookup
[       OK ] HashTest.oldHashLookup (102 ms)
[ RUN      ] HashTest.newHashLookup
[       OK ] HashTest.newHashLookup (33 ms)
[ RUN      ] HashTest.noopHashLookup
[       OK ] HashTest.noopHashLookup (13 ms)

A no-op is definitely fastest, and this hash function does much better than the current method also. I have attached the modified tests to the bug again, in case you are interested in trying them out.
Comment 31 Peter Kasting 2012-09-07 12:00:54 PDT
Ah, OK, sounds like the new hash is actually faster to compute, in addition to resulting in fewer collisions.  I hadn't grasped that previously.
Comment 32 Dana Jansens 2012-09-07 12:17:54 PDT
Created attachment 162842 [details]
Patch

using LL suffix for 64bit constant
Comment 33 WebKit Review Bot 2012-09-07 12:23:07 PDT
Attachment 162842 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WTF/ChangeLog', u'Source/WTF/wtf/Ha..." exit_code: 1
Source/WTF/wtf/HashFunctions.h:195:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:196:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:197:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:198:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:199:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:200:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:201:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:202:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:203:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:204:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:205:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:206:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:207:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:208:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:209:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:210:  More than one command on the same line  [whitespace/newline] [4]
Total errors found: 16 in 13 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 34 Build Bot 2012-09-07 13:00:04 PDT
Comment on attachment 162842 [details]
Patch

Attachment 162842 [details] did not pass mac-ews (mac):
Output: http://queues.webkit.org/results/13788240
Comment 35 Dana Jansens 2012-09-07 13:05:33 PDT
Created attachment 162851 [details]
Patch

update using declaration in GraphicsContextCA to WTF::pairIntHash
Comment 36 WebKit Review Bot 2012-09-07 13:07:04 PDT
Attachment 162851 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WTF/ChangeLog', u'Source/WTF/wtf/Ha..." exit_code: 1
Source/WTF/wtf/HashFunctions.h:195:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:196:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:197:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:198:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:199:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:200:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:201:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:202:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:203:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:204:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:205:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:206:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:207:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:208:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:209:  More than one command on the same line  [whitespace/newline] [4]
Source/WTF/wtf/HashFunctions.h:210:  More than one command on the same line  [whitespace/newline] [4]
Total errors found: 16 in 13 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 37 Adrienne Walker 2012-09-14 13:22:08 PDT
Comment on attachment 162851 [details]
Patch

R=me.  It's been a week since the last comment here and the email to webkit-dev, and it seems like nobody has any remaining objections.  Let's land this.
Comment 38 Benjamin Poulain 2012-09-14 13:25:27 PDT
> R=me.  It's been a week since the last comment here and the email to webkit-dev, and it seems like nobody has any remaining objections.  Let's land this.

I, for one, welcome our new hashing overlords :)
Comment 39 WebKit Review Bot 2012-09-14 13:59:59 PDT
Comment on attachment 162851 [details]
Patch

Clearing flags on attachment: 162851

Committed r128650: <http://trac.webkit.org/changeset/128650>
Comment 40 WebKit Review Bot 2012-09-14 14:00:05 PDT
All reviewed patches have been landed.  Closing bug.
Comment 41 Alexey Proskuryakov 2012-09-14 14:57:34 PDT
Build fix committed <http://trac.webkit.org/r128657>.