Bug 190814 - HashMap should support selecting a random entry
Summary: HashMap should support selecting a random entry
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Geoffrey Garen
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2018-10-22 16:53 PDT by Geoffrey Garen
Modified: 2018-11-02 10:11 PDT (History)
13 users (show)

See Also:


Attachments
Patch (7.92 KB, patch)
2018-10-22 17:17 PDT, Geoffrey Garen
no flags Details | Formatted Diff | Diff
Patch (7.92 KB, patch)
2018-10-22 17:41 PDT, Geoffrey Garen
no flags Details | Formatted Diff | Diff
Archive of layout-test-results from ews101 for mac-sierra (2.50 MB, application/zip)
2018-10-22 18:57 PDT, EWS Watchlist
no flags Details
Patch (10.35 KB, patch)
2018-10-23 21:38 PDT, Geoffrey Garen
no flags Details | Formatted Diff | Diff
Patch (3.05 KB, patch)
2018-10-25 22:31 PDT, Geoffrey Garen
no flags Details | Formatted Diff | Diff
Patch (4.21 KB, patch)
2018-10-27 19:00 PDT, Geoffrey Garen
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Geoffrey Garen 2018-10-22 16:53:41 PDT
HashMap should support selecting a random entry
Comment 1 Geoffrey Garen 2018-10-22 17:17:46 PDT
Created attachment 352930 [details]
Patch
Comment 2 Geoffrey Garen 2018-10-22 17:41:52 PDT
Created attachment 352937 [details]
Patch
Comment 3 EWS Watchlist 2018-10-22 18:57:57 PDT
Comment on attachment 352937 [details]
Patch

Attachment 352937 [details] did not pass mac-ews (mac):
Output: https://webkit-queues.webkit.org/results/9700481

New failing tests:
http/wpt/css/css-animations/start-animation-001.html
Comment 4 EWS Watchlist 2018-10-22 18:57:59 PDT
Created attachment 352943 [details]
Archive of layout-test-results from ews101 for mac-sierra

The attached test failures were seen while running run-webkit-tests on the mac-ews.
Bot: ews101  Port: mac-sierra  Platform: Mac OS X 10.12.6
Comment 5 Antti Koivisto 2018-10-22 22:53:00 PDT
Comment on attachment 352937 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=352937&action=review

r=me

> Source/WTF/ChangeLog:8
> +        In some cases, remove(begin()) is not quite good enough as a random

We also need to audit all sites that are currently using this pattern.
Comment 6 Antti Koivisto 2018-10-22 22:58:25 PDT
Comment on attachment 352937 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=352937&action=review

> Source/WTF/wtf/RandomNumber.h:34
> +// Returns a cryptographically secure pseudo-random number in the range [0, 1).
>  WTF_EXPORT_PRIVATE double randomNumber();
>  
> +// Returns a cheap pseudo-random number in the range (0, UINT_MAX].
> +WTF_EXPORT_PRIVATE unsigned weakRandomNumber();

It is bit surprising that these similarly named functions return different types.
Comment 7 Antti Koivisto 2018-10-23 00:05:37 PDT
Comment on attachment 352937 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=352937&action=review

> Source/WTF/wtf/HashMap.h:272
> +inline auto HashMap<T, U, V, W, X>::random() -> iterator

You should also add this to to HashSet/ListHashSet for completeness.
Comment 8 Alexey Proskuryakov 2018-10-23 00:26:53 PDT
Random cache eviction is going to have interesting effects on performance testing, isn’t it?
Comment 9 Geoffrey Garen 2018-10-23 21:33:51 PDT
> We also need to audit all sites that are currently using this pattern.

I'll do that in a separate patch, just in case it has surprising performance consequences.
Comment 10 Geoffrey Garen 2018-10-23 21:34:08 PDT
> > +// Returns a cheap pseudo-random number in the range (0, UINT_MAX].
> > +WTF_EXPORT_PRIVATE unsigned weakRandomNumber();
> 
> It is bit surprising that these similarly named functions return different
> types.

Renamed to weakRandomUint32().
Comment 11 Geoffrey Garen 2018-10-23 21:34:22 PDT
> You should also add this to to HashSet/ListHashSet for completeness.

Haha, there turned out to be soooo many!
Comment 12 Geoffrey Garen 2018-10-23 21:38:41 PDT
Created attachment 353020 [details]
Patch
Comment 13 Antti Koivisto 2018-10-24 00:42:14 PDT
Comment on attachment 353020 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=353020&action=review

> Source/WTF/wtf/RandomNumber.h:31
> +// Returns a cryptographically secure pseudo-random number in the range [0, 1).
>  WTF_EXPORT_PRIVATE double randomNumber();

Might be nice to rename this to randomDouble() too at some point.
Comment 14 Geoffrey Garen 2018-10-25 10:11:06 PDT
Comment on attachment 353020 [details]
Patch

cq+
Comment 15 WebKit Commit Bot 2018-10-25 10:41:44 PDT
Comment on attachment 353020 [details]
Patch

Clearing flags on attachment: 353020

Committed r237419: <https://trac.webkit.org/changeset/237419>
Comment 16 WebKit Commit Bot 2018-10-25 10:41:46 PDT
All reviewed patches have been landed.  Closing bug.
Comment 17 Radar WebKit Bug Importer 2018-10-25 10:42:33 PDT
<rdar://problem/45558390>
Comment 18 Darin Adler 2018-10-25 11:19:17 PDT
For small tables I am concerned that the first entry is twice as likely as the others. I suggest looping and choosing a new random number rather than mapping end to begin. This has come up in many contexts before and that has almost always been the best strategy. I also suggest looping on other high numbers beyond what divides easily. Those also result in bias toward the beginning of the table.

I suppose we should construct a test case that proves the results are evenly distributed. Should not be too hard to write one.
Comment 19 Geoffrey Garen 2018-10-25 22:23:27 PDT
> For small tables I am concerned that the first entry is twice as likely as
> the others. 

I guess there's a potential bias issue in any table where an entry has a run of many empty or deleted buckets before it. This could happen at the beginning of the table or not.

Beginning of the table example:

1 2 x x x x x x

Bucket #1 maps to 2, and all other buckets map to 1. 1 has 7/8 probability. 

Not the beginning of the table example:

1 x x x x x x 2

Bucket #0 maps to 1, and all other buckets map to 2. 2 has 7/8 probability.

That said, we expect these cases to be rare on average because we expect a hash function to distribute 1 and 2 evenly in the table.

> I suggest looping and choosing a new random number rather than
> mapping end to begin. This has come up in many contexts before and that has
> almost always been the best strategy.

A downside to looping is that it makes random() O(TableSize) when the table is sparse. But I guess begin() would have been O(TableSize) too.

> I also suggest looping on other high
> numbers beyond what divides easily. Those also result in bias toward the
> beginning of the table.

I didn't get this part. Can you give an example?

> I suppose we should construct a test case that proves the results are evenly distributed. Should not be too hard to write one.

I'll update Random_IsRandom. I was afraid of making a flaky test, but I think the binomial distribution guarantees-ish that 600/1000 will not be flaky.
Comment 20 Geoffrey Garen 2018-10-25 22:31:17 PDT
Reopening to attach new patch.
Comment 21 Geoffrey Garen 2018-10-25 22:31:18 PDT
Created attachment 353159 [details]
Patch
Comment 22 Alexey Proskuryakov 2018-10-26 09:19:37 PDT
Do you have any insight with regards to comment 8 (i.e. effects on performance testing)?

I guess we can wait and see, but randomness in cache eviction seems unlikely to be good for noise in test results.
Comment 23 Geoffrey Garen 2018-10-26 11:26:21 PDT
> Do you have any insight with regards to comment 8 (i.e. effects on
> performance testing)?
> 
> I guess we can wait and see, but randomness in cache eviction seems unlikely
> to be good for noise in test results.

There are no known effects.

Random eviction is a pre-existing behavior.
Comment 24 Geoffrey Garen 2018-10-26 11:33:38 PDT
Comment on attachment 353159 [details]
Patch

cq+
Comment 25 WebKit Commit Bot 2018-10-26 11:58:59 PDT
Comment on attachment 353159 [details]
Patch

Clearing flags on attachment: 353159

Committed r237476: <https://trac.webkit.org/changeset/237476>
Comment 26 WebKit Commit Bot 2018-10-26 11:59:01 PDT
All reviewed patches have been landed.  Closing bug.
Comment 27 Alexey Proskuryakov 2018-10-26 15:25:38 PDT
> There are no known effects.

What are the expected effects?

> Random eviction is a pre-existing behavior.

Can you elaborate? This bug literally introduces the ability to have random eviction.
Comment 28 Darin Adler 2018-10-26 19:52:24 PDT
(In reply to Geoffrey Garen from comment #19)
> > I also suggest looping on other high
> > numbers beyond what divides easily. Those also result in bias toward the
> > beginning of the table.
> 
> I didn't get this part. Can you give an example?

I was confused about size vs. capacity. Confusing that HashTable::capacity is backed by HashTable::m_tableSize and HashTable::size is backed by HashTable::m_keyCount.

Since m_tableSize is always a power of two then there is "number that does not divide easily". And the thing you did about looping when we encounter an empty or deleted bucket is essentially what I was looking for here.

> > I suppose we should construct a test case that proves the results are evenly distributed. Should not be too hard to write one.
> 
> I'll update Random_IsRandom. I was afraid of making a flaky test, but I
> think the binomial distribution guarantees-ish that 600/1000 will not be
> flaky.

That’s great. The patch you landed was exactly what I was looking for.

For testing, I think we should go at least a little bit further. In particular, I’d like to consider adding at least one test that would definitely have failed with the old algorithm; not sure the existing "hash table of size 2" one would have failed. Maybe repeating a similar test for at least one hash table size other than 2 will do the trick. Or some sequence of operations that intentionally makes hash tables with different "densities". And with no deletion, we are definitely not testing correct handling of deleted buckets.
Comment 29 Darin Adler 2018-10-26 19:54:36 PDT
(In reply to Darin Adler from comment #28)
> there is "number that does not divide easily"

there is *no* "number that does not divide easily [sic]" (meant evenly)
Comment 30 Darin Adler 2018-10-26 19:56:46 PDT
(In reply to Geoffrey Garen from comment #19)
> I'll update Random_IsRandom.

This test is misnamed. Is should be named something more like Random_IsEvenlyDistributed. It doesn’t check for randomness.

I believe this checks for one of the desirable properties of pseudorandom numbers that truly random numbers won’t necessarily have. Although I suppose that perhaps truly random numbers also have it to some high probability.
Comment 31 Darin Adler 2018-10-26 20:14:49 PDT
Comment on attachment 353159 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=353159&action=review

> Source/WTF/wtf/HashTable.h:392
> +                ValueType* entry = m_table + (weakRandomUint32() & m_tableSizeMask);
> +                if (isEmptyBucket(*entry) || isDeletedBucket(*entry))
> +                    continue;
> +                return makeKnownGoodIterator(entry);

I’m in a style nitpick mood so I wanted to share how I would have written this block:

    auto& bucket = m_table[weakRandomUint32() & m_tableSizeMask];
    if (isEmptyOrDeletedBucket(bucket))
        continue;
    return makeKnownGoodIterator(&bucket);
Comment 32 Darin Adler 2018-10-26 20:15:40 PDT
Comment on attachment 353159 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=353159&action=review

>> Source/WTF/wtf/HashTable.h:392
>> +                return makeKnownGoodIterator(entry);
> 
> I’m in a style nitpick mood so I wanted to share how I would have written this block:
> 
>     auto& bucket = m_table[weakRandomUint32() & m_tableSizeMask];
>     if (isEmptyOrDeletedBucket(bucket))
>         continue;
>     return makeKnownGoodIterator(&bucket);

No, that’s not quite right. I would have written this:

    auto& bucket = m_table[weakRandomUint32() & m_tableSizeMask];
    if (!isEmptyOrDeletedBucket(bucket))
        return makeKnownGoodIterator(&bucket);
Comment 33 Geoffrey Garen 2018-10-27 17:29:08 PDT
> > There are no known effects.
> 
> What are the expected effects?

That's a broad question. I don't know how to give it any specific answer.

> > Random eviction is a pre-existing behavior.
> 
> Can you elaborate? This bug literally introduces the ability to have random
> eviction.

Our existing idiom for random eviction is remove(begin()). For most tables most of the time, remove(begin()) is mostly random.
Comment 34 Geoffrey Garen 2018-10-27 18:01:12 PDT
> For testing, I think we should go at least a little bit further. In
> particular, I’d like to consider adding at least one test that would
> definitely have failed with the old algorithm; not sure the existing "hash
> table of size 2" one would have failed. 

Random_IsRandom [sic] did fail with the old implementation of random().

> Maybe repeating a similar test for
> at least one hash table size other than 2 will do the trick.

I'm worried that larger tables will require lots more iterations to guarantee that flukes don't cause test failures. I consulted an online calculator for the binomial distribution to prove that a coin flip will never-ish land heads more than 600/1000 times. I'm not quite sure how to calculate the same result across a range of numbers when you're flipping a 100-sided or 1000-sided coin.

> Or some
> sequence of operations that intentionally makes hash tables with different
> "densities". And with no deletion, we are definitely not testing correct
> handling of deleted buckets.

I'll add some tests for remove() at different densities.
Comment 35 Geoffrey Garen 2018-10-27 19:00:18 PDT
Reopening to attach new patch.
Comment 36 Geoffrey Garen 2018-10-27 19:00:19 PDT
Created attachment 353251 [details]
Patch
Comment 37 EWS Watchlist 2018-10-27 19:03:55 PDT
Attachment 353251 [details] did not pass style-queue:


ERROR: Tools/TestWebKitAPI/Tests/WTF/HashMap.cpp:1042:  Multi line control clauses should use braces.  [whitespace/braces] [4]
Total errors found: 1 in 4 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 38 WebKit Commit Bot 2018-10-28 11:19:05 PDT
Comment on attachment 353251 [details]
Patch

Clearing flags on attachment: 353251

Committed r237522: <https://trac.webkit.org/changeset/237522>
Comment 39 WebKit Commit Bot 2018-10-28 11:19:07 PDT
All reviewed patches have been landed.  Closing bug.
Comment 40 Alexey Proskuryakov 2018-10-29 09:11:24 PDT
> Our existing idiom for random eviction is remove(begin()). For most tables most of the time, remove(begin()) is mostly random.

My question was and remains about performance testing. Are you claiming that it's already "mostly random" during page load performance testing? How so?

Anyway, at the end of the day, you are the person with the highest interest in stability of page load speed and memory use testing, so if this patch makes testing more challenging, you'll discover and fix that eventually.
Comment 41 Geoffrey Garen 2018-11-01 13:07:30 PDT
> > Our existing idiom for random eviction is remove(begin()). For most tables most of the time, remove(begin()) is mostly random.
> 
> My question was and remains about performance testing. Are you claiming that
> it's already "mostly random" during page load performance testing? 

Yes.

> How so?

Hashing algorithms are mostly random. Therefore, the entry with the lowest index at a given table size is mostly random. Therefore, remove(begin()) is mostly random.

It would be easier to have this discussion about a specific use case for a specific cache. The general theory of random eviction is that, though individual eviction decisions are random, overall observed behavior is stable, for the same reason that chaotic electrical impulses in the heart can produce a stable sinus rhythm.
Comment 42 Alexey Proskuryakov 2018-11-02 10:11:27 PDT
Sounds like you do not have insight into why the general theory would help when running performance tests. When running tests, we do the same things in the same order, so I don't expect remove(begin()) to be "mostly random".

As I said, it is OK to look into this when/if test result stability impacts your productivity.

> It would be easier to have this discussion about a specific use case for a specific cache.

That would be every case where the new functionality is used. There was no discussion in bug 190957, where it was applied in several places.