Bug 227150 - [webkitcorepy] Add NestedFuzzyDict
Summary: [webkitcorepy] Add NestedFuzzyDict
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Jonathan Bedard
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2021-06-17 16:03 PDT by Jonathan Bedard
Modified: 2021-06-29 12:27 PDT (History)
3 users (show)

See Also:


Attachments
Patch (11.58 KB, patch)
2021-06-17 16:20 PDT, Jonathan Bedard
no flags Details | Formatted Diff | Diff
Patch (11.55 KB, patch)
2021-06-17 16:27 PDT, Jonathan Bedard
no flags Details | Formatted Diff | Diff
Patch (11.33 KB, patch)
2021-06-23 08:33 PDT, Jonathan Bedard
no flags Details | Formatted Diff | Diff
Patch (11.32 KB, patch)
2021-06-28 14:47 PDT, Jonathan Bedard
no flags Details | Formatted Diff | Diff
Patch (12.42 KB, patch)
2021-06-29 11:46 PDT, Jonathan Bedard
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jonathan Bedard 2021-06-17 16:03:10 PDT
Hashes have some unique properties when it comes to dictionaries. A direct look-up is obviously fastest, but, and this is particularly relevant for git hashes, you don't actually need the entirety of the string to get a low collision rate. With as few as 4 characters, most hashes in the WebKit project won't collide. But the birthday paradox ensures that at least a few hashes will collide, even with as many as 8 or 9 characters. To address this problem, we should have a nested dictionary structure that allows partial look-up keys (as long as those keys meet some minimum length) while allowing a fuzzy look-up of the the remaining 36 characters of a SHA-1 hash.

This technique is generally applicable to strings, so webkitcorepy should own this object.
Comment 1 Radar WebKit Bug Importer 2021-06-17 16:06:51 PDT
<rdar://problem/79475464>
Comment 2 Jonathan Bedard 2021-06-17 16:20:00 PDT
Created attachment 431735 [details]
Patch
Comment 3 Jonathan Bedard 2021-06-17 16:27:25 PDT
Created attachment 431737 [details]
Patch
Comment 4 Jonathan Bedard 2021-06-23 08:33:03 PDT
Created attachment 432048 [details]
Patch
Comment 5 Jonathan Bedard 2021-06-28 14:47:32 PDT
Created attachment 432425 [details]
Patch
Comment 6 dewei_zhu 2021-06-28 15:56:33 PDT
Comment on attachment 432425 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=432425&action=review

> Tools/Scripts/libraries/webkitcorepy/webkitcorepy/nested_fuzzy_dict.py:26
> +class NestedFuzzyDict(object):

Do we expect to support delete in this function?

> Tools/Scripts/libraries/webkitcorepy/webkitcorepy/nested_fuzzy_dict.py:33
> +        self.primary_size = int(primary_size or 4)

Do we expect it raise an exception when primary_size is something int() will raise an exception e.g.'invalid' ?
Also, why the primary size is set to 4?
Would it be considered 'over engineering' if we implement this using Trie?

> Tools/Scripts/libraries/webkitcorepy/webkitcorepy/nested_fuzzy_dict.py:47
> +        self.assert_valid_key(keyname)
> +        key_a, key_b = keyname[:self.primary_size], keyname[self.primary_size:]
> +        value = None
> +        found = False
> +        for key, result in self._data.get(key_a, dict()).items():
> +            if key.startswith(key_b):
> +                if found:
> +                    raise KeyError("Multiple values match '{}'".format(keyname))
> +                found = True
> +                value = result

Can we invoke self.getitem(keyname, None) instead?
Comment 7 Jonathan Bedard 2021-06-28 16:30:56 PDT
Comment on attachment 432425 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=432425&action=review

>> Tools/Scripts/libraries/webkitcorepy/webkitcorepy/nested_fuzzy_dict.py:26
>> +class NestedFuzzyDict(object):
> 
> Do we expect to support delete in this function?

I don't have a need for it currently, although I could add it relatively easily. I didn't want to completely recreate the API for dict, this isn't something I expect we'll use much.

>> Tools/Scripts/libraries/webkitcorepy/webkitcorepy/nested_fuzzy_dict.py:33
>> +        self.primary_size = int(primary_size or 4)
> 
> Do we expect it raise an exception when primary_size is something int() will raise an exception e.g.'invalid' ?
> Also, why the primary size is set to 4?
> Would it be considered 'over engineering' if we implement this using Trie?

I think the problem with a Trie is that we really only want the single nested dict. The point of nesting is so that we can narrow down the set of potentially matching hashes sufficiently so we can do the startswith check on just a handful of hashes. Its very similar to the CSS 101 idea of a hash table managing collisions with a linked list

>> Tools/Scripts/libraries/webkitcorepy/webkitcorepy/nested_fuzzy_dict.py:47
>> +                value = result
> 
> Can we invoke self.getitem(keyname, None) instead?

We cannot, and it's this little bit that makes it fuzzy.

This entire structure is optimized for hashes. The first 4 characters of a hash are mostly sufficient to avoid collisions in the webkit repository, but there are a significant minority of 4 character sequences which do point to multiple hashes. However, it's very unlikely that those 4 character sequences point to more than 2-3 hashes, which is what this function is counting on. Basically, this lets us have the fast look up of a hash table, while also allowing the user to specify partial hashes (hopefully 8-12 characters, but we do our best regardless)
Comment 8 dewei_zhu 2021-06-28 16:58:04 PDT
Comment on attachment 432425 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=432425&action=review

>>> Tools/Scripts/libraries/webkitcorepy/webkitcorepy/nested_fuzzy_dict.py:47
>>> +                value = result
>> 
>> Can we invoke self.getitem(keyname, None) instead?
> 
> We cannot, and it's this little bit that makes it fuzzy.
> 
> This entire structure is optimized for hashes. The first 4 characters of a hash are mostly sufficient to avoid collisions in the webkit repository, but there are a significant minority of 4 character sequences which do point to multiple hashes. However, it's very unlikely that those 4 character sequences point to more than 2-3 hashes, which is what this function is counting on. Basically, this lets us have the fast look up of a hash table, while also allowing the user to specify partial hashes (hopefully 8-12 characters, but we do our best regardless)

I mean, what if we do
```
def __getitem__(self, keyname):
    found, value = self.getitem(keyname)
    if found:
        return value
    raise KeyError(keyname)
```
Comment 9 Jonathan Bedard 2021-06-29 11:46:21 PDT
Created attachment 432505 [details]
Patch
Comment 10 dewei_zhu 2021-06-29 11:49:37 PDT
Comment on attachment 432505 [details]
Patch

r=me
Comment 11 EWS 2021-06-29 12:27:20 PDT
Committed r279381 (239246@main): <https://commits.webkit.org/239246@main>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 432505 [details].