RESOLVED FIXED Bug 96022
Minimize collisions when hashing pairs
https://bugs.webkit.org/show_bug.cgi?id=96022
Summary Minimize collisions when hashing pairs
Dana Jansens
Reported 2012-09-06 14:11:17 PDT
Minimize collisions when hashing pairs
Attachments
test_demonstration (2.26 KB, patch)
2012-09-06 14:12 PDT, Dana Jansens
no flags
Proof (226.99 KB, application/pdf)
2012-09-06 14:17 PDT, Dana Jansens
no flags
Patch (15.73 KB, patch)
2012-09-06 16:16 PDT, Dana Jansens
no flags
split_up_tests (3.21 KB, patch)
2012-09-07 10:26 PDT, Dana Jansens
no flags
Patch (15.72 KB, patch)
2012-09-07 10:31 PDT, Dana Jansens
no flags
non-empty-lookup-test-vs-noopHash (4.60 KB, patch)
2012-09-07 11:43 PDT, Dana Jansens
no flags
Patch (15.72 KB, patch)
2012-09-07 12:17 PDT, Dana Jansens
no flags
Patch (16.00 KB, patch)
2012-09-07 13:05 PDT, Dana Jansens
no flags
Dana Jansens
Comment 1 2012-09-06 14:12:05 PDT
Created attachment 162579 [details] test_demonstration
Dana Jansens
Comment 2 2012-09-06 14:17:01 PDT
Created attachment 162580 [details] Proof Notes containing a method and proof for hashing a constant number of integers (such as a pair).
Dana Jansens
Comment 3 2012-09-06 14:18:22 PDT
The current hash function for pairs has poor performance as it does a nice hash function on 64 bits, but then just drops the top 32 bits. The hash method for pairs tries to use Thomas Wang's 64 bit Mix Function (http://www.cris.com/~Ttwang/tech/inthash.htm), but this requires not dropping any bits in order to retain the characteristics mentioned by Thomas. A better method is described and proven in section 5.2.2 of https://bugs.webkit.org/attachment.cgi?id=162580 which I have implemented here.
Dana Jansens
Comment 4 2012-09-06 16:15:13 PDT
To compare the two hash methods I implemented the new method and used the unit tests in https://bugs.webkit.org/attachment.cgi?id=162579 which report their running times in milliseconds. A larger number of collisions will lead to longer running times, either due to long lookups or to more allocations to make space for buckets. Times for comparison (in milliseconds) to hash all combinations of the first 2^11 integers within a std::pair<int,int>: Current code: 7999 7794 7963 7944 7931 7775 7924 7779 7983 7898 7239 New method : 6082 6141 6205 6099 6178 6136 5839 6102 6118 6030 5598 You can see an approximately 22% time decrease with the new method for the given input. For general objects within a pair, we can use the DefaultHash::Hash::hash() method to get an integer. I used the same test case, but ran each of the two integers in the pair through the existing intHash() method first. In this case, we still see a time decrease of about 8%: Current code: 7926 7898 7186 7174 7173 7987 7168 New method : 7353 7231 6636 6544 6600 6616 6614 This hashing method can also be found in [1] with a proof, and originally comes from the work of Woelfel[2][3]. [1] http://opendatastructures.org/versions/edition-0.1d/ods-java/node33.html#SECTION00832000000000000000 [2] http://pages.cpsc.ucalgary.ca/~woelfel/paper/efficient_strongly/efficient_strongly.pdf [3] http://pages.cpsc.ucalgary.ca/~woelfel/paper/diss/diss.pdf p.s. The stylebot warnings are to match other similar 1-line template declarations in HashFunctions.h
Dana Jansens
Comment 5 2012-09-06 16:16:03 PDT
WebKit Review Bot
Comment 6 2012-09-06 16:19:54 PDT
Attachment 162613 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WTF/ChangeLog', u'Source/WTF/wtf/Ha..." exit_code: 1 Source/WTF/wtf/HashFunctions.h:195: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:196: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:197: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:198: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:199: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:200: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:201: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:202: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:203: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:204: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:205: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:206: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:207: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:208: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:209: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:210: More than one command on the same line [whitespace/newline] [4] Total errors found: 16 in 13 files If any of these errors are false positives, please file a bug against check-webkit-style.
Alexey Proskuryakov
Comment 7 2012-09-06 16:47:40 PDT
Could you please announce this on webkit-dev? I feel that it would be the best way to reach some people who should be aware of this proposal.
Build Bot
Comment 8 2012-09-06 16:54:03 PDT
Build Bot
Comment 9 2012-09-06 17:20:28 PDT
Early Warning System Bot
Comment 10 2012-09-06 17:28:41 PDT
Benjamin Poulain
Comment 11 2012-09-06 18:27:56 PDT
Interesting patch! > Times for comparison (in milliseconds) to hash all combinations of the first 2^11 integers within a std::pair<int,int>: > > Current code: 7999 7794 7963 7944 7931 7775 7924 7779 7983 7898 7239 > New method : 6082 6141 6205 6099 6178 6136 5839 6102 6118 6030 5598 It this the time of hash() or the time to get values? Could you give both separately? Did you measure this on 32 or 64 bits? If I am not mistaken, all of our current hash functions rely on the CPU fast shifters. It would be nice to compare this on ARM in addition to x86.
Peter Beverloo (cr-android ews)
Comment 12 2012-09-06 19:14:00 PDT
Comment on attachment 162613 [details] Patch Attachment 162613 [details] did not pass cr-android-ews (chromium-android): Output: http://queues.webkit.org/results/13778391
Early Warning System Bot
Comment 13 2012-09-06 23:34:03 PDT
Dana Jansens
Comment 14 2012-09-07 08:53:07 PDT
(In reply to comment #11) > Interesting patch! > > > Times for comparison (in milliseconds) to hash all combinations of the first 2^11 integers within a std::pair<int,int>: > > > > Current code: 7999 7794 7963 7944 7931 7775 7924 7779 7983 7898 7239 > > New method : 6082 6141 6205 6099 6178 6136 5839 6102 6118 6030 5598 > > It this the time of hash() or the time to get values? Could you give both separately? For each key K this was the time to lookup K's value, and insert a new value for K. Then, lookup each value in the HashMap a second time. The performance improvement here is do to fewer collisions, not to the speed of the hash() function. Fewer collisions mean we have a more densely populated HashMap, which means fewer memory allocations to increase the space needed to store all the keys. When I only measure time to insert keys, the difference in time is exaggerated further. > Did you measure this on 32 or 64 bits? This is measured on a 64 bit machine. > > If I am not mistaken, all of our current hash functions rely on the CPU fast shifters. It would be nice to compare this on ARM in addition to x86. This would be interesting, though again this is showing the negative impact of a bad hash function in collisions, not in the hash() method itself being slow. Causing more reallocs will dominate any time spent in a hash function, so comparing the speed of the function itself is of less relevance unless they have similar hashing characteristics/collision rates.
Dana Jansens
Comment 15 2012-09-07 08:53:21 PDT
(In reply to comment #7) > Could you please announce this on webkit-dev? I feel that it would be the best way to reach some people who should be aware of this proposal. Sure, will do.
Dana Jansens
Comment 16 2012-09-07 10:26:26 PDT
Created attachment 162804 [details] split_up_tests
Dana Jansens
Comment 17 2012-09-07 10:31:34 PDT
Created attachment 162808 [details] Patch new patch to fix truncation warning
WebKit Review Bot
Comment 18 2012-09-07 10:34:28 PDT
Attachment 162808 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WTF/ChangeLog', u'Source/WTF/wtf/Ha..." exit_code: 1 Source/WTF/wtf/HashFunctions.h:195: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:196: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:197: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:198: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:199: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:200: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:201: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:202: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:203: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:204: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:205: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:206: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:207: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:208: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:209: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:210: More than one command on the same line [whitespace/newline] [4] Total errors found: 16 in 13 files If any of these errors are false positives, please file a bug against check-webkit-style.
Benjamin Poulain
Comment 19 2012-09-07 10:39:01 PDT
Gosh, I understand how a hashmap works... What I want to find out is if this is particular hash is the best solution. Also, collisions on insert is not the whole story. Many algorithms are inserting items once and use HashTable::contains() a lot. In many of those cases, the speed of hash() will impact your profile.
Dana Jansens
Comment 20 2012-09-07 10:42:39 PDT
(In reply to comment #19) > Gosh, I understand how a hashmap works... > > What I want to find out is if this is particular hash is the best solution. > > Also, collisions on insert is not the whole story. Many algorithms are inserting items once and use HashTable::contains() a lot. In many of those cases, the speed of hash() will impact your profile. Okay, I wasn't sure if we were talking about the same things, thanks for clarifying. I added a test that did large numbers of queries on an empty HashMap to verify query time was not regressing, which sounds like your concern as well. On my machine the time to do ~8000 queries went from about 130ms to under 1ms with the new hash function.
Benjamin Poulain
Comment 21 2012-09-07 10:46:07 PDT
> Okay, I wasn't sure if we were talking about the same things, thanks for clarifying. I added a test that did large numbers of queries on an empty HashMap to verify query time was not regressing, which sounds like your concern as well. On my machine the time to do ~8000 queries went from about 130ms to under 1ms with the new hash function. Thanks for testing that, it is exactly what I was looking for. It looks like the speed of hash() is not gonna be a concern here :)
Peter Kasting
Comment 22 2012-09-07 10:52:12 PDT
(In reply to comment #20) > I added a test that did large numbers of queries on an empty HashMap to verify query time was not regressing, which sounds like your concern as well. On my machine the time to do ~8000 queries went from about 130ms to under 1ms with the new hash function. Just to think through this to be sure we understand why this is improving... presumably, computing a hash value is effectively a series of arithmetic operations, whereas actually checking a slot in the map is one or more memory accesses. Regardless of whether collisions are handled by chaining or rehashing (I haven't actually looked at WTF's implementation), resolving them involves making further memory accesses. So if we make the hash function take longer to compute but reduce collisions, we're effectively trading more (or longer) arithmetic operations for fewer memory accesses. Unless the difference in computation time is large, that seems like it ought to be a win.
Dana Jansens
Comment 23 2012-09-07 10:58:14 PDT
(In reply to comment #22) > (In reply to comment #20) > > I added a test that did large numbers of queries on an empty HashMap to verify query time was not regressing, which sounds like your concern as well. On my machine the time to do ~8000 queries went from about 130ms to under 1ms with the new hash function. > > Just to think through this to be sure we understand why this is improving... presumably, computing a hash value is effectively a series of arithmetic operations, whereas actually checking a slot in the map is one or more memory accesses. Regardless of whether collisions are handled by chaining or rehashing (I haven't actually looked at WTF's implementation), resolving them involves making further memory accesses. > > So if we make the hash function take longer to compute but reduce collisions, we're effectively trading more (or longer) arithmetic operations for fewer memory accesses. Unless the difference in computation time is large, that seems like it ought to be a win. Yes, though it depends on the hash table implementation. For instance, you can make the table larger and move the keys around on insert. Then perhaps you keep the same number of accesses on query. But you've paid for it in allocation/memcpy during insert. Either way though, more collisions means more work to do at some point in the process.
Benjamin Poulain
Comment 24 2012-09-07 11:12:49 PDT
> Just to think through this to be sure we understand why this is improving... presumably, computing a hash value is effectively a series of arithmetic operations, whereas actually checking a slot in the map is one or more memory accesses. Regardless of whether collisions are handled by chaining or rehashing (I haven't actually looked at WTF's implementation), resolving them involves making further memory accesses. On an empty hash map?
Build Bot
Comment 25 2012-09-07 11:24:19 PDT
Peter Kasting
Comment 26 2012-09-07 11:26:20 PDT
(In reply to comment #24) > > Just to think through this to be sure we understand why this is improving... presumably, computing a hash value is effectively a series of arithmetic operations, whereas actually checking a slot in the map is one or more memory accesses. Regardless of whether collisions are handled by chaining or rehashing (I haven't actually looked at WTF's implementation), resolving them involves making further memory accesses. > > On an empty hash map? No, on an empty hash map, you'd expect the first insertion to take slightly more time with a more complex hashing function. (I'm assuming Dana's hashing function _is_ more computationally complex.) I wasn't really trying to reason fully about what could happen at all possible points in a hash map's life cycle. I was more trying to say, "if we have an experimental result that shows that checking contains() gets much faster, can we also think of a logical reason why that result makes sense?".
Benjamin Poulain
Comment 27 2012-09-07 11:33:40 PDT
> I wasn't really trying to reason fully about what could happen at all possible points in a hash map's life cycle. I was more trying to say, "if we have an experimental result that shows that checking contains() gets much faster, can we also think of a logical reason why that result makes sense?". Unless I miss something, the test was done on an empty HashMap: "a test that did large numbers of queries on an empty HashMap". Otherwise that would not have proven much about the hash() function time itself as you pointed out.
Build Bot
Comment 28 2012-09-07 11:38:01 PDT
Dana Jansens
Comment 29 2012-09-07 11:43:41 PDT
Created attachment 162835 [details] non-empty-lookup-test-vs-noopHash
Dana Jansens
Comment 30 2012-09-07 11:47:36 PDT
Just in case the empty HashMap is a special case, I will submit another test result. I made a hash function that just returns pair.first, so essentially a no-op, to compare to this new hash function. Also, I made the query test first insert 4 elements into the HashMap. Then call contains() for 2048 different pairs. Here's the result: [ RUN ] HashTest.oldHashLookup [ OK ] HashTest.oldHashLookup (102 ms) [ RUN ] HashTest.newHashLookup [ OK ] HashTest.newHashLookup (33 ms) [ RUN ] HashTest.noopHashLookup [ OK ] HashTest.noopHashLookup (13 ms) A no-op is definitely fastest, and this hash function does much better than the current method also. I have attached the modified tests to the bug again, in case you are interested in trying them out.
Peter Kasting
Comment 31 2012-09-07 12:00:54 PDT
Ah, OK, sounds like the new hash is actually faster to compute, in addition to resulting in fewer collisions. I hadn't grasped that previously.
Dana Jansens
Comment 32 2012-09-07 12:17:54 PDT
Created attachment 162842 [details] Patch using LL suffix for 64bit constant
WebKit Review Bot
Comment 33 2012-09-07 12:23:07 PDT
Attachment 162842 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WTF/ChangeLog', u'Source/WTF/wtf/Ha..." exit_code: 1 Source/WTF/wtf/HashFunctions.h:195: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:196: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:197: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:198: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:199: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:200: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:201: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:202: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:203: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:204: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:205: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:206: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:207: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:208: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:209: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:210: More than one command on the same line [whitespace/newline] [4] Total errors found: 16 in 13 files If any of these errors are false positives, please file a bug against check-webkit-style.
Build Bot
Comment 34 2012-09-07 13:00:04 PDT
Dana Jansens
Comment 35 2012-09-07 13:05:33 PDT
Created attachment 162851 [details] Patch update using declaration in GraphicsContextCA to WTF::pairIntHash
WebKit Review Bot
Comment 36 2012-09-07 13:07:04 PDT
Attachment 162851 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WTF/ChangeLog', u'Source/WTF/wtf/Ha..." exit_code: 1 Source/WTF/wtf/HashFunctions.h:195: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:196: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:197: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:198: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:199: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:200: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:201: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:202: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:203: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:204: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:205: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:206: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:207: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:208: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:209: More than one command on the same line [whitespace/newline] [4] Source/WTF/wtf/HashFunctions.h:210: More than one command on the same line [whitespace/newline] [4] Total errors found: 16 in 13 files If any of these errors are false positives, please file a bug against check-webkit-style.
Adrienne Walker
Comment 37 2012-09-14 13:22:08 PDT
Comment on attachment 162851 [details] Patch R=me. It's been a week since the last comment here and the email to webkit-dev, and it seems like nobody has any remaining objections. Let's land this.
Benjamin Poulain
Comment 38 2012-09-14 13:25:27 PDT
> R=me. It's been a week since the last comment here and the email to webkit-dev, and it seems like nobody has any remaining objections. Let's land this. I, for one, welcome our new hashing overlords :)
WebKit Review Bot
Comment 39 2012-09-14 13:59:59 PDT
Comment on attachment 162851 [details] Patch Clearing flags on attachment: 162851 Committed r128650: <http://trac.webkit.org/changeset/128650>
WebKit Review Bot
Comment 40 2012-09-14 14:00:05 PDT
All reviewed patches have been landed. Closing bug.
Alexey Proskuryakov
Comment 41 2012-09-14 14:57:34 PDT
Build fix committed <http://trac.webkit.org/r128657>.
Note You need to log in before you can comment on or make changes to this bug.