WebKit Bugzilla
New
Browse
Search+
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
227150
[webkitcorepy] Add NestedFuzzyDict
https://bugs.webkit.org/show_bug.cgi?id=227150
Summary
[webkitcorepy] Add NestedFuzzyDict
Jonathan Bedard
Reported
2021-06-17 16:03:10 PDT
Hashes have some unique properties when it comes to dictionaries. A direct look-up is obviously fastest, but, and this is particularly relevant for git hashes, you don't actually need the entirety of the string to get a low collision rate. With as few as 4 characters, most hashes in the WebKit project won't collide. But the birthday paradox ensures that at least a few hashes will collide, even with as many as 8 or 9 characters. To address this problem, we should have a nested dictionary structure that allows partial look-up keys (as long as those keys meet some minimum length) while allowing a fuzzy look-up of the the remaining 36 characters of a SHA-1 hash. This technique is generally applicable to strings, so webkitcorepy should own this object.
Attachments
Patch
(11.58 KB, patch)
2021-06-17 16:20 PDT
,
Jonathan Bedard
no flags
Details
Formatted Diff
Diff
Patch
(11.55 KB, patch)
2021-06-17 16:27 PDT
,
Jonathan Bedard
no flags
Details
Formatted Diff
Diff
Patch
(11.33 KB, patch)
2021-06-23 08:33 PDT
,
Jonathan Bedard
no flags
Details
Formatted Diff
Diff
Patch
(11.32 KB, patch)
2021-06-28 14:47 PDT
,
Jonathan Bedard
no flags
Details
Formatted Diff
Diff
Patch
(12.42 KB, patch)
2021-06-29 11:46 PDT
,
Jonathan Bedard
no flags
Details
Formatted Diff
Diff
Show Obsolete
(4)
View All
Add attachment
proposed patch, testcase, etc.
Radar WebKit Bug Importer
Comment 1
2021-06-17 16:06:51 PDT
<
rdar://problem/79475464
>
Jonathan Bedard
Comment 2
2021-06-17 16:20:00 PDT
Created
attachment 431735
[details]
Patch
Jonathan Bedard
Comment 3
2021-06-17 16:27:25 PDT
Created
attachment 431737
[details]
Patch
Jonathan Bedard
Comment 4
2021-06-23 08:33:03 PDT
Created
attachment 432048
[details]
Patch
Jonathan Bedard
Comment 5
2021-06-28 14:47:32 PDT
Created
attachment 432425
[details]
Patch
dewei_zhu
Comment 6
2021-06-28 15:56:33 PDT
Comment on
attachment 432425
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=432425&action=review
> Tools/Scripts/libraries/webkitcorepy/webkitcorepy/nested_fuzzy_dict.py:26 > +class NestedFuzzyDict(object):
Do we expect to support delete in this function?
> Tools/Scripts/libraries/webkitcorepy/webkitcorepy/nested_fuzzy_dict.py:33 > + self.primary_size = int(primary_size or 4)
Do we expect it raise an exception when primary_size is something int() will raise an exception e.g.'invalid' ? Also, why the primary size is set to 4? Would it be considered 'over engineering' if we implement this using Trie?
> Tools/Scripts/libraries/webkitcorepy/webkitcorepy/nested_fuzzy_dict.py:47 > + self.assert_valid_key(keyname) > + key_a, key_b = keyname[:self.primary_size], keyname[self.primary_size:] > + value = None > + found = False > + for key, result in self._data.get(key_a, dict()).items(): > + if key.startswith(key_b): > + if found: > + raise KeyError("Multiple values match '{}'".format(keyname)) > + found = True > + value = result
Can we invoke self.getitem(keyname, None) instead?
Jonathan Bedard
Comment 7
2021-06-28 16:30:56 PDT
Comment on
attachment 432425
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=432425&action=review
>> Tools/Scripts/libraries/webkitcorepy/webkitcorepy/nested_fuzzy_dict.py:26 >> +class NestedFuzzyDict(object): > > Do we expect to support delete in this function?
I don't have a need for it currently, although I could add it relatively easily. I didn't want to completely recreate the API for dict, this isn't something I expect we'll use much.
>> Tools/Scripts/libraries/webkitcorepy/webkitcorepy/nested_fuzzy_dict.py:33 >> + self.primary_size = int(primary_size or 4) > > Do we expect it raise an exception when primary_size is something int() will raise an exception e.g.'invalid' ? > Also, why the primary size is set to 4? > Would it be considered 'over engineering' if we implement this using Trie?
I think the problem with a Trie is that we really only want the single nested dict. The point of nesting is so that we can narrow down the set of potentially matching hashes sufficiently so we can do the startswith check on just a handful of hashes. Its very similar to the CSS 101 idea of a hash table managing collisions with a linked list
>> Tools/Scripts/libraries/webkitcorepy/webkitcorepy/nested_fuzzy_dict.py:47 >> + value = result > > Can we invoke self.getitem(keyname, None) instead?
We cannot, and it's this little bit that makes it fuzzy. This entire structure is optimized for hashes. The first 4 characters of a hash are mostly sufficient to avoid collisions in the webkit repository, but there are a significant minority of 4 character sequences which do point to multiple hashes. However, it's very unlikely that those 4 character sequences point to more than 2-3 hashes, which is what this function is counting on. Basically, this lets us have the fast look up of a hash table, while also allowing the user to specify partial hashes (hopefully 8-12 characters, but we do our best regardless)
dewei_zhu
Comment 8
2021-06-28 16:58:04 PDT
Comment on
attachment 432425
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=432425&action=review
>>> Tools/Scripts/libraries/webkitcorepy/webkitcorepy/nested_fuzzy_dict.py:47 >>> + value = result >> >> Can we invoke self.getitem(keyname, None) instead? > > We cannot, and it's this little bit that makes it fuzzy. > > This entire structure is optimized for hashes. The first 4 characters of a hash are mostly sufficient to avoid collisions in the webkit repository, but there are a significant minority of 4 character sequences which do point to multiple hashes. However, it's very unlikely that those 4 character sequences point to more than 2-3 hashes, which is what this function is counting on. Basically, this lets us have the fast look up of a hash table, while also allowing the user to specify partial hashes (hopefully 8-12 characters, but we do our best regardless)
I mean, what if we do ``` def __getitem__(self, keyname): found, value = self.getitem(keyname) if found: return value raise KeyError(keyname) ```
Jonathan Bedard
Comment 9
2021-06-29 11:46:21 PDT
Created
attachment 432505
[details]
Patch
dewei_zhu
Comment 10
2021-06-29 11:49:37 PDT
Comment on
attachment 432505
[details]
Patch r=me
EWS
Comment 11
2021-06-29 12:27:20 PDT
Committed
r279381
(
239246@main
): <
https://commits.webkit.org/239246@main
> All reviewed patches have been landed. Closing bug and clearing flags on
attachment 432505
[details]
.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug