HashMap should support selecting a random entry
Created attachment 352930 [details] Patch
Created attachment 352937 [details] Patch
Comment on attachment 352937 [details] Patch Attachment 352937 [details] did not pass mac-ews (mac): Output: https://webkit-queues.webkit.org/results/9700481 New failing tests: http/wpt/css/css-animations/start-animation-001.html
Created attachment 352943 [details] Archive of layout-test-results from ews101 for mac-sierra The attached test failures were seen while running run-webkit-tests on the mac-ews. Bot: ews101 Port: mac-sierra Platform: Mac OS X 10.12.6
Comment on attachment 352937 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=352937&action=review r=me > Source/WTF/ChangeLog:8 > + In some cases, remove(begin()) is not quite good enough as a random We also need to audit all sites that are currently using this pattern.
Comment on attachment 352937 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=352937&action=review > Source/WTF/wtf/RandomNumber.h:34 > +// Returns a cryptographically secure pseudo-random number in the range [0, 1). > WTF_EXPORT_PRIVATE double randomNumber(); > > +// Returns a cheap pseudo-random number in the range (0, UINT_MAX]. > +WTF_EXPORT_PRIVATE unsigned weakRandomNumber(); It is bit surprising that these similarly named functions return different types.
Comment on attachment 352937 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=352937&action=review > Source/WTF/wtf/HashMap.h:272 > +inline auto HashMap<T, U, V, W, X>::random() -> iterator You should also add this to to HashSet/ListHashSet for completeness.
Random cache eviction is going to have interesting effects on performance testing, isn’t it?
> We also need to audit all sites that are currently using this pattern. I'll do that in a separate patch, just in case it has surprising performance consequences.
> > +// Returns a cheap pseudo-random number in the range (0, UINT_MAX]. > > +WTF_EXPORT_PRIVATE unsigned weakRandomNumber(); > > It is bit surprising that these similarly named functions return different > types. Renamed to weakRandomUint32().
> You should also add this to to HashSet/ListHashSet for completeness. Haha, there turned out to be soooo many!
Created attachment 353020 [details] Patch
Comment on attachment 353020 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=353020&action=review > Source/WTF/wtf/RandomNumber.h:31 > +// Returns a cryptographically secure pseudo-random number in the range [0, 1). > WTF_EXPORT_PRIVATE double randomNumber(); Might be nice to rename this to randomDouble() too at some point.
Comment on attachment 353020 [details] Patch cq+
Comment on attachment 353020 [details] Patch Clearing flags on attachment: 353020 Committed r237419: <https://trac.webkit.org/changeset/237419>
All reviewed patches have been landed. Closing bug.
<rdar://problem/45558390>
For small tables I am concerned that the first entry is twice as likely as the others. I suggest looping and choosing a new random number rather than mapping end to begin. This has come up in many contexts before and that has almost always been the best strategy. I also suggest looping on other high numbers beyond what divides easily. Those also result in bias toward the beginning of the table. I suppose we should construct a test case that proves the results are evenly distributed. Should not be too hard to write one.
> For small tables I am concerned that the first entry is twice as likely as > the others. I guess there's a potential bias issue in any table where an entry has a run of many empty or deleted buckets before it. This could happen at the beginning of the table or not. Beginning of the table example: 1 2 x x x x x x Bucket #1 maps to 2, and all other buckets map to 1. 1 has 7/8 probability. Not the beginning of the table example: 1 x x x x x x 2 Bucket #0 maps to 1, and all other buckets map to 2. 2 has 7/8 probability. That said, we expect these cases to be rare on average because we expect a hash function to distribute 1 and 2 evenly in the table. > I suggest looping and choosing a new random number rather than > mapping end to begin. This has come up in many contexts before and that has > almost always been the best strategy. A downside to looping is that it makes random() O(TableSize) when the table is sparse. But I guess begin() would have been O(TableSize) too. > I also suggest looping on other high > numbers beyond what divides easily. Those also result in bias toward the > beginning of the table. I didn't get this part. Can you give an example? > I suppose we should construct a test case that proves the results are evenly distributed. Should not be too hard to write one. I'll update Random_IsRandom. I was afraid of making a flaky test, but I think the binomial distribution guarantees-ish that 600/1000 will not be flaky.
Reopening to attach new patch.
Created attachment 353159 [details] Patch
Do you have any insight with regards to comment 8 (i.e. effects on performance testing)? I guess we can wait and see, but randomness in cache eviction seems unlikely to be good for noise in test results.
> Do you have any insight with regards to comment 8 (i.e. effects on > performance testing)? > > I guess we can wait and see, but randomness in cache eviction seems unlikely > to be good for noise in test results. There are no known effects. Random eviction is a pre-existing behavior.
Comment on attachment 353159 [details] Patch cq+
Comment on attachment 353159 [details] Patch Clearing flags on attachment: 353159 Committed r237476: <https://trac.webkit.org/changeset/237476>
> There are no known effects. What are the expected effects? > Random eviction is a pre-existing behavior. Can you elaborate? This bug literally introduces the ability to have random eviction.
(In reply to Geoffrey Garen from comment #19) > > I also suggest looping on other high > > numbers beyond what divides easily. Those also result in bias toward the > > beginning of the table. > > I didn't get this part. Can you give an example? I was confused about size vs. capacity. Confusing that HashTable::capacity is backed by HashTable::m_tableSize and HashTable::size is backed by HashTable::m_keyCount. Since m_tableSize is always a power of two then there is "number that does not divide easily". And the thing you did about looping when we encounter an empty or deleted bucket is essentially what I was looking for here. > > I suppose we should construct a test case that proves the results are evenly distributed. Should not be too hard to write one. > > I'll update Random_IsRandom. I was afraid of making a flaky test, but I > think the binomial distribution guarantees-ish that 600/1000 will not be > flaky. That’s great. The patch you landed was exactly what I was looking for. For testing, I think we should go at least a little bit further. In particular, I’d like to consider adding at least one test that would definitely have failed with the old algorithm; not sure the existing "hash table of size 2" one would have failed. Maybe repeating a similar test for at least one hash table size other than 2 will do the trick. Or some sequence of operations that intentionally makes hash tables with different "densities". And with no deletion, we are definitely not testing correct handling of deleted buckets.
(In reply to Darin Adler from comment #28) > there is "number that does not divide easily" there is *no* "number that does not divide easily [sic]" (meant evenly)
(In reply to Geoffrey Garen from comment #19) > I'll update Random_IsRandom. This test is misnamed. Is should be named something more like Random_IsEvenlyDistributed. It doesn’t check for randomness. I believe this checks for one of the desirable properties of pseudorandom numbers that truly random numbers won’t necessarily have. Although I suppose that perhaps truly random numbers also have it to some high probability.
Comment on attachment 353159 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=353159&action=review > Source/WTF/wtf/HashTable.h:392 > + ValueType* entry = m_table + (weakRandomUint32() & m_tableSizeMask); > + if (isEmptyBucket(*entry) || isDeletedBucket(*entry)) > + continue; > + return makeKnownGoodIterator(entry); I’m in a style nitpick mood so I wanted to share how I would have written this block: auto& bucket = m_table[weakRandomUint32() & m_tableSizeMask]; if (isEmptyOrDeletedBucket(bucket)) continue; return makeKnownGoodIterator(&bucket);
Comment on attachment 353159 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=353159&action=review >> Source/WTF/wtf/HashTable.h:392 >> + return makeKnownGoodIterator(entry); > > I’m in a style nitpick mood so I wanted to share how I would have written this block: > > auto& bucket = m_table[weakRandomUint32() & m_tableSizeMask]; > if (isEmptyOrDeletedBucket(bucket)) > continue; > return makeKnownGoodIterator(&bucket); No, that’s not quite right. I would have written this: auto& bucket = m_table[weakRandomUint32() & m_tableSizeMask]; if (!isEmptyOrDeletedBucket(bucket)) return makeKnownGoodIterator(&bucket);
> > There are no known effects. > > What are the expected effects? That's a broad question. I don't know how to give it any specific answer. > > Random eviction is a pre-existing behavior. > > Can you elaborate? This bug literally introduces the ability to have random > eviction. Our existing idiom for random eviction is remove(begin()). For most tables most of the time, remove(begin()) is mostly random.
> For testing, I think we should go at least a little bit further. In > particular, I’d like to consider adding at least one test that would > definitely have failed with the old algorithm; not sure the existing "hash > table of size 2" one would have failed. Random_IsRandom [sic] did fail with the old implementation of random(). > Maybe repeating a similar test for > at least one hash table size other than 2 will do the trick. I'm worried that larger tables will require lots more iterations to guarantee that flukes don't cause test failures. I consulted an online calculator for the binomial distribution to prove that a coin flip will never-ish land heads more than 600/1000 times. I'm not quite sure how to calculate the same result across a range of numbers when you're flipping a 100-sided or 1000-sided coin. > Or some > sequence of operations that intentionally makes hash tables with different > "densities". And with no deletion, we are definitely not testing correct > handling of deleted buckets. I'll add some tests for remove() at different densities.
Created attachment 353251 [details] Patch
Attachment 353251 [details] did not pass style-queue: ERROR: Tools/TestWebKitAPI/Tests/WTF/HashMap.cpp:1042: Multi line control clauses should use braces. [whitespace/braces] [4] Total errors found: 1 in 4 files If any of these errors are false positives, please file a bug against check-webkit-style.
Comment on attachment 353251 [details] Patch Clearing flags on attachment: 353251 Committed r237522: <https://trac.webkit.org/changeset/237522>
> Our existing idiom for random eviction is remove(begin()). For most tables most of the time, remove(begin()) is mostly random. My question was and remains about performance testing. Are you claiming that it's already "mostly random" during page load performance testing? How so? Anyway, at the end of the day, you are the person with the highest interest in stability of page load speed and memory use testing, so if this patch makes testing more challenging, you'll discover and fix that eventually.
> > Our existing idiom for random eviction is remove(begin()). For most tables most of the time, remove(begin()) is mostly random. > > My question was and remains about performance testing. Are you claiming that > it's already "mostly random" during page load performance testing? Yes. > How so? Hashing algorithms are mostly random. Therefore, the entry with the lowest index at a given table size is mostly random. Therefore, remove(begin()) is mostly random. It would be easier to have this discussion about a specific use case for a specific cache. The general theory of random eviction is that, though individual eviction decisions are random, overall observed behavior is stable, for the same reason that chaotic electrical impulses in the heart can produce a stable sinus rhythm.
Sounds like you do not have insight into why the general theory would help when running performance tests. When running tests, we do the same things in the same order, so I don't expect remove(begin()) to be "mostly random". As I said, it is OK to look into this when/if test result stability impacts your productivity. > It would be easier to have this discussion about a specific use case for a specific cache. That would be every case where the new functionality is used. There was no discussion in bug 190957, where it was applied in several places.