Hashes have some unique properties when it comes to dictionaries. A direct look-up is obviously fastest, but, and this is particularly relevant for git hashes, you don't actually need the entirety of the string to get a low collision rate. With as few as 4 characters, most hashes in the WebKit project won't collide. But the birthday paradox ensures that at least a few hashes will collide, even with as many as 8 or 9 characters. To address this problem, we should have a nested dictionary structure that allows partial look-up keys (as long as those keys meet some minimum length) while allowing a fuzzy look-up of the the remaining 36 characters of a SHA-1 hash. This technique is generally applicable to strings, so webkitcorepy should own this object.
<rdar://problem/79475464>
Created attachment 431735 [details] Patch
Created attachment 431737 [details] Patch
Created attachment 432048 [details] Patch
Created attachment 432425 [details] Patch
Comment on attachment 432425 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=432425&action=review > Tools/Scripts/libraries/webkitcorepy/webkitcorepy/nested_fuzzy_dict.py:26 > +class NestedFuzzyDict(object): Do we expect to support delete in this function? > Tools/Scripts/libraries/webkitcorepy/webkitcorepy/nested_fuzzy_dict.py:33 > + self.primary_size = int(primary_size or 4) Do we expect it raise an exception when primary_size is something int() will raise an exception e.g.'invalid' ? Also, why the primary size is set to 4? Would it be considered 'over engineering' if we implement this using Trie? > Tools/Scripts/libraries/webkitcorepy/webkitcorepy/nested_fuzzy_dict.py:47 > + self.assert_valid_key(keyname) > + key_a, key_b = keyname[:self.primary_size], keyname[self.primary_size:] > + value = None > + found = False > + for key, result in self._data.get(key_a, dict()).items(): > + if key.startswith(key_b): > + if found: > + raise KeyError("Multiple values match '{}'".format(keyname)) > + found = True > + value = result Can we invoke self.getitem(keyname, None) instead?
Comment on attachment 432425 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=432425&action=review >> Tools/Scripts/libraries/webkitcorepy/webkitcorepy/nested_fuzzy_dict.py:26 >> +class NestedFuzzyDict(object): > > Do we expect to support delete in this function? I don't have a need for it currently, although I could add it relatively easily. I didn't want to completely recreate the API for dict, this isn't something I expect we'll use much. >> Tools/Scripts/libraries/webkitcorepy/webkitcorepy/nested_fuzzy_dict.py:33 >> + self.primary_size = int(primary_size or 4) > > Do we expect it raise an exception when primary_size is something int() will raise an exception e.g.'invalid' ? > Also, why the primary size is set to 4? > Would it be considered 'over engineering' if we implement this using Trie? I think the problem with a Trie is that we really only want the single nested dict. The point of nesting is so that we can narrow down the set of potentially matching hashes sufficiently so we can do the startswith check on just a handful of hashes. Its very similar to the CSS 101 idea of a hash table managing collisions with a linked list >> Tools/Scripts/libraries/webkitcorepy/webkitcorepy/nested_fuzzy_dict.py:47 >> + value = result > > Can we invoke self.getitem(keyname, None) instead? We cannot, and it's this little bit that makes it fuzzy. This entire structure is optimized for hashes. The first 4 characters of a hash are mostly sufficient to avoid collisions in the webkit repository, but there are a significant minority of 4 character sequences which do point to multiple hashes. However, it's very unlikely that those 4 character sequences point to more than 2-3 hashes, which is what this function is counting on. Basically, this lets us have the fast look up of a hash table, while also allowing the user to specify partial hashes (hopefully 8-12 characters, but we do our best regardless)
Comment on attachment 432425 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=432425&action=review >>> Tools/Scripts/libraries/webkitcorepy/webkitcorepy/nested_fuzzy_dict.py:47 >>> + value = result >> >> Can we invoke self.getitem(keyname, None) instead? > > We cannot, and it's this little bit that makes it fuzzy. > > This entire structure is optimized for hashes. The first 4 characters of a hash are mostly sufficient to avoid collisions in the webkit repository, but there are a significant minority of 4 character sequences which do point to multiple hashes. However, it's very unlikely that those 4 character sequences point to more than 2-3 hashes, which is what this function is counting on. Basically, this lets us have the fast look up of a hash table, while also allowing the user to specify partial hashes (hopefully 8-12 characters, but we do our best regardless) I mean, what if we do ``` def __getitem__(self, keyname): found, value = self.getitem(keyname) if found: return value raise KeyError(keyname) ```
Created attachment 432505 [details] Patch
Comment on attachment 432505 [details] Patch r=me
Committed r279381 (239246@main): <https://commits.webkit.org/239246@main> All reviewed patches have been landed. Closing bug and clearing flags on attachment 432505 [details].