237049 – Drop StringHasher::hashMemory() and use the modern Hasher instead

Chris Dumez

Reported 2022-02-22 11:46:51 PST

Drop StringHasher::hashMemory() and use the modern Hasher instead.

Chris Dumez

Comment 1 2022-02-22 11:48:29 PST

Created attachment 452888 [details] Patch

Chris Dumez

Comment 2 2022-02-22 12:56:55 PST

Created attachment 452895 [details] Patch

Sam Weinig

Comment 3 2022-02-22 17:37:07 PST

Comment on attachment 452895 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=452895&action=review > Source/WTF/wtf/SchedulePair.h:77 > + return computeHash(reinterpret_cast<uintptr_t>(pair->runLoop()), pair->mode() ? CFHash(pair->mode()) : 0); Not new, but we should probably not hash a CFHash() here and should just hash the CFStringRef (not needed for this change, just pointing it out). > Source/WTF/wtf/UUID.h:101 > + add(hasher, high, low); Probably worth just adding an overload of add(Hasher, ..) for UInt128. But I think you can also use the UInt128Low64/UInt128High64 helpers instead of the explicit bit operations. > Source/WebCore/Modules/indexeddb/shared/IDBResourceIdentifier.h:93 > - static unsigned hash(const IDBResourceIdentifier& a) { return a.hash(); } > + static unsigned hash(const IDBResourceIdentifier& a) { return computeHash(a); } Would be nice if we could avoid having this and instead have HashTable (or whatever is using these Hash structs) try to call add(Hasher,...) if it exists. > Source/WebCore/page/SecurityOriginHash.h:41 > + return computeHash(origin->protocol(), origin->host(), origin->port().value_or(0)); This will compute a different hash value than the old one, and will be more expensive due to no using cached hash values in the strings. This is likely a better hash overall since you get more entropy per-byte, but it might be a slow down due to rehashing each string. > Source/WebKitLegacy/mac/History/BinaryPropertyList.cpp:81 > + auto* integers = array.integers(); > + for (size_t i = 0; i < array.size(); ++i) > + add(hasher, integers[i]); I would be nice if this could be done instead: add(hasher, Span(array.integers(), array.size()));

Darin Adler

Comment 4 2022-02-22 17:44:32 PST

Comment on attachment 452895 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=452895&action=review Do we need to measure the performance effect of all these hashing changes? > Source/WTF/wtf/SchedulePair.h:77 > + return computeHash(reinterpret_cast<uintptr_t>(pair->runLoop()), pair->mode() ? CFHash(pair->mode()) : 0); Seems like it would be slightly more elegant to have add(hasher&, const SchedulePair&) and call "return computeHash(*pair)" here. This code is still doing a hash of hashes, by calling CFHash on CFStringRef. Is there an alternative? Should we teach Hasher how to hash a CFStringRef? Put that in a header somewhere so we can reuse it? Do we really need to convert the run loop pointer to an integer to hash it? Can we hash pointers without reinterpret_cast, or is it ambiguous whether that means to hash the address or the data structure at that address? Is converting to an integer the way to say we want to hash the address? Mainly, I think uintptr_t is a leftover from wanting the same type for two integers that we wanted to pass to hashMemory. > Source/WTF/wtf/UUID.h:97 > +inline void add(Hasher& hasher, const UUID& uuid) Probably should take UUID instead of const UUID&. > Source/WTF/wtf/UUID.h:101 > + auto high = static_cast<uint64_t>(uuid.m_data >> 64); > + auto low = static_cast<uint64_t>(uuid.m_data & 0xffffffffffffffff); > + add(hasher, high, low); Seems like we should have: void add(Hasher&, UInt128) And have this call that. Can still have it live in this header for now. > Source/WebCore/dom/QualifiedName.h:123 > +inline void add(Hasher& hasher, const QualifiedName& name) > +{ > + auto impl = name.impl(); > + add(hasher, impl->m_prefix, impl->m_localName, impl->m_namespace); > +} I often like to just overload and also have a second function: void add(Hasher&, const QualifiedNameImpl&) And then have this one be: add(hasher, *name.impl()); Not *sure* that is better, but it seems appealing to me. > Source/WebCore/page/SecurityOriginHash.h:41 > + return computeHash(origin->protocol(), origin->host(), origin->port().value_or(0)); I don’t think we need this value_or(0) here. Also seems like we’d like to have: add(Hasher&, const SecurityOrigin&); Then here just call computeHash(*origin); > Source/WebCore/platform/ScriptExecutionContextIdentifier.h:79 > +inline void add(Hasher& hasher, const ProcessQualified<UUID>& uuid) > +{ > + add(hasher, uuid.object()); > +} Should this instead be a function template that works on all ProcessQualified<>? > Source/WebCore/platform/graphics/cocoa/FontPlatformDataCocoa.mm:50 > + return computeHash(fontHash, flags); The use of uintptr_t in this function doesn’t really make sense any more. We can hash any integer types that are appropriate, and seems unlikely we need a pointer-sized int for the flags, nor do we need to cast CFHashCode to uintptr_t. Also a little strange to put these bits all together in a flags integer when the hasher can just be passed multiple things; this is probably a bit more efficient, but why not just hash all the fields in the most straightforward way. Another example of hashing a CFStringRef by hashing a hash, and it’s not great. > Source/WebCore/platform/graphics/win/FontPlatformDataCGWin.cpp:137 > unsigned fontHash = m_font ? m_font->hash() : 0; Might need a hasher that knows how to dereference a pointer. > Source/WebCore/platform/graphics/win/FontPlatformDataCGWin.cpp:140 > + return computeHash(fontHash, cgFontHash, ctFontHash, m_useGDI); The "hashing a hash" pattern again, on both CFStringRef but also on m_font. But how much do we care about code in CGWin? Probably close to retirement at this point. > Source/WebKitLegacy/mac/History/BinaryPropertyList.cpp:82 > +inline void add(Hasher& hasher, const IntegerArray& array) > +{ > + auto* integers = array.integers(); > + for (size_t i = 0; i < array.size(); ++i) > + add(hasher, integers[i]); > +} A little strange that if there’s nothing in the array the hash doesn’t get changed at all. I am confused about what the best practice should be for adding empty collections to a hash.

Darin Adler

Comment 5 2022-02-22 17:44:53 PST

Nice that Sam and I made many of the same comments.

Chris Dumez

Comment 6 2022-02-22 17:58:31 PST

(In reply to Sam Weinig from comment #3) > Comment on attachment 452895 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=452895&action=review > > > Source/WTF/wtf/SchedulePair.h:77 > > + return computeHash(reinterpret_cast<uintptr_t>(pair->runLoop()), pair->mode() ? CFHash(pair->mode()) : 0); > > Not new, but we should probably not hash a CFHash() here and should just > hash the CFStringRef (not needed for this change, just pointing it out). Yes, I wondered about that but tried to be conservative since this is a pre-existing issue. Hashing the CFStringRef would produce a better hash but would be more expensive. > > Source/WTF/wtf/UUID.h:101 > > + add(hasher, high, low); > > Probably worth just adding an overload of add(Hasher, ..) for UInt128. But I > think you can also use the UInt128Low64/UInt128High64 helpers instead of the > explicit bit operations. The reason UUID uses the bit operations and not UInt128Low64()/UInt128High64() is that those functions work on a UInt128Impl, which is an implementation detail. On some platforms, UInt128 is a __uint128_t, not a UInt128Impl. > > > Source/WebCore/Modules/indexeddb/shared/IDBResourceIdentifier.h:93 > > - static unsigned hash(const IDBResourceIdentifier& a) { return a.hash(); } > > + static unsigned hash(const IDBResourceIdentifier& a) { return computeHash(a); } > > Would be nice if we could avoid having this and instead have HashTable (or > whatever is using these Hash structs) try to call add(Hasher,...) if it > exists. > > > Source/WebCore/page/SecurityOriginHash.h:41 > > + return computeHash(origin->protocol(), origin->host(), origin->port().value_or(0)); > > This will compute a different hash value than the old one, and will be more > expensive due to no using cached hash values in the strings. This is likely > a better hash overall since you get more entropy per-byte, but it might be a > slow down due to rehashing each string. Well, technically, those are AtomStrings so we are not actually going to re-hash strings per se. We'll just hash pointers. Therefore, I think it isn't as expensive as it looks. You are right though that it is more expensive than using the cached hash. The alternative though would be to use the cached hash and give that to Hasher. Sadly, this would hash a hash, which is not optimal. Would we prefer that? > > > Source/WebKitLegacy/mac/History/BinaryPropertyList.cpp:81 > > + auto* integers = array.integers(); > > + for (size_t i = 0; i < array.size(); ++i) > > + add(hasher, integers[i]); > > I would be nice if this could be done instead: > > add(hasher, Span(array.integers(), array.size())); This may work already, I'll give it a try.

Chris Dumez

Comment 7 2022-02-22 18:26:16 PST

(In reply to Darin Adler from comment #4) > Comment on attachment 452895 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=452895&action=review > > Do we need to measure the performance effect of all these hashing changes? > > > Source/WTF/wtf/SchedulePair.h:77 > > + return computeHash(reinterpret_cast<uintptr_t>(pair->runLoop()), pair->mode() ? CFHash(pair->mode()) : 0); > > Seems like it would be slightly more elegant to have add(hasher&, const > SchedulePair&) and call "return computeHash(*pair)" here. > > This code is still doing a hash of hashes, by calling CFHash on CFStringRef. > Is there an alternative? Should we teach Hasher how to hash a CFStringRef? > Put that in a header somewhere so we can reuse it? > > Do we really need to convert the run loop pointer to an integer to hash it? > Can we hash pointers without reinterpret_cast, or is it ambiguous whether > that means to hash the address or the data structure at that address? Is > converting to an integer the way to say we want to hash the address? Mainly, > I think uintptr_t is a leftover from wanting the same type for two integers > that we wanted to pass to hashMemory. > > > Source/WTF/wtf/UUID.h:97 > > +inline void add(Hasher& hasher, const UUID& uuid) > > Probably should take UUID instead of const UUID&. > > > Source/WTF/wtf/UUID.h:101 > > + auto high = static_cast<uint64_t>(uuid.m_data >> 64); > > + auto low = static_cast<uint64_t>(uuid.m_data & 0xffffffffffffffff); > > + add(hasher, high, low); > > Seems like we should have: > > void add(Hasher&, UInt128) > > And have this call that. Can still have it live in this header for now. > > > Source/WebCore/dom/QualifiedName.h:123 > > +inline void add(Hasher& hasher, const QualifiedName& name) > > +{ > > + auto impl = name.impl(); > > + add(hasher, impl->m_prefix, impl->m_localName, impl->m_namespace); > > +} > > I often like to just overload and also have a second function: > > void add(Hasher&, const QualifiedNameImpl&) > > And then have this one be: > > add(hasher, *name.impl()); > > Not *sure* that is better, but it seems appealing to me. > > > Source/WebCore/page/SecurityOriginHash.h:41 > > + return computeHash(origin->protocol(), origin->host(), origin->port().value_or(0)); > > I don’t think we need this value_or(0) here. > > Also seems like we’d like to have: > > add(Hasher&, const SecurityOrigin&); > > Then here just call computeHash(*origin); > > > Source/WebCore/platform/ScriptExecutionContextIdentifier.h:79 > > +inline void add(Hasher& hasher, const ProcessQualified<UUID>& uuid) > > +{ > > + add(hasher, uuid.object()); > > +} > > Should this instead be a function template that works on all > ProcessQualified<>? There is already one generic one for ProcessQualified<> and hashes both the object and the processIdentifier. However, the ProcessQualified<UUID> specialization really only needs to hash the object which it is a UUID. It maintains pre-exiting behavior and performance.

Darin Adler

Comment 8 2022-02-22 18:28:47 PST

(In reply to Chris Dumez from comment #7) > However, the ProcessQualified<UUID> > specialization really only needs to hash the object which it is a UUID. It > maintains pre-exiting behavior and performance. Definitely needs a comment, something like this: // Since UUIDs are unique on their own, optimize by not hashing the process identifier.

Chris Dumez

Comment 9 2022-02-22 18:41:55 PST

(In reply to Darin Adler from comment #4) > Comment on attachment 452895 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=452895&action=review > > Do we need to measure the performance effect of all these hashing changes? You are right that there is a slight risk. I'll monitor the perf bots after I land to make sure there is no obvious regression.

Chris Dumez

Comment 10 2022-02-22 18:42:45 PST

Created attachment 452926 [details] Patch

Chris Dumez

Comment 11 2022-02-22 18:47:51 PST

Created attachment 452927 [details] Patch

Darin Adler

Comment 12 2022-02-22 19:09:49 PST

Comment on attachment 452927 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=452927&action=review Looks great > Source/WTF/wtf/Hasher.h:79 > +inline void add(Hasher& hasher, const UInt128& value) I suggest UInt128 instead of const UInt128&. > Source/WebCore/Modules/indexeddb/shared/IDBResourceIdentifier.h:89 > + add(hasher, identifier.connectionIdentifier(), identifier.m_resourceNumber); Just realized this is a friend, so maybe we should use both data members instead of using a getter for one and the data member for the other.

Chris Dumez

Comment 13 2022-02-22 19:26:10 PST

Created attachment 452929 [details] Patch

EWS

Comment 14 2022-02-22 22:13:41 PST

Committed r290349 (247667@main): <https://commits.webkit.org/247667@main> All reviewed patches have been landed. Closing bug and clearing flags on attachment 452929 [details].

Radar WebKit Bug Importer

Comment 15 2022-02-22 22:14:22 PST

<rdar://problem/89337284>

Sam Weinig

Comment 16 2022-02-23 10:27:56 PST

> > > Source/WebCore/page/SecurityOriginHash.h:41 > > > + return computeHash(origin->protocol(), origin->host(), origin->port().value_or(0)); > > > > This will compute a different hash value than the old one, and will be more > > expensive due to no using cached hash values in the strings. This is likely > > a better hash overall since you get more entropy per-byte, but it might be a > > slow down due to rehashing each string. > > Well, technically, those are AtomStrings so we are not actually going to > re-hash strings per se. We'll just hash pointers. > Therefore, I think it isn't as expensive as it looks. You are right though > that it is more expensive than using the cached hash. > The alternative though would be to use the cached hash and give that to > Hasher. Sadly, this would hash a hash, which is not optimal. > > Would we prefer that? I don't think those are AtomStrings, though perhaps I am missing something. SecurityOrigin::protocol() returns a const String&. I actually think this is likely better better, but that you should just be on the lookout for a possible perf regression.

Chris Dumez

Comment 17 2022-02-23 10:30:01 PST

(In reply to Sam Weinig from comment #16) > > > > Source/WebCore/page/SecurityOriginHash.h:41 > > > > + return computeHash(origin->protocol(), origin->host(), origin->port().value_or(0)); > > > > > > This will compute a different hash value than the old one, and will be more > > > expensive due to no using cached hash values in the strings. This is likely > > > a better hash overall since you get more entropy per-byte, but it might be a > > > slow down due to rehashing each string. > > > > Well, technically, those are AtomStrings so we are not actually going to > > re-hash strings per se. We'll just hash pointers. > > Therefore, I think it isn't as expensive as it looks. You are right though > > that it is more expensive than using the cached hash. > > The alternative though would be to use the cached hash and give that to > > Hasher. Sadly, this would hash a hash, which is not optimal. > > > > Would we prefer that? > > I don't think those are AtomStrings, though perhaps I am missing something. > SecurityOrigin::protocol() returns a const String&. Right, those are not. My comment was meant for QualifiedNameComponents / QualifiedName. > I actually think this is likely better better, but that you should just be > on the lookout for a possible perf regression. Agreed. Will check the bots today.

Chris Dumez

Comment 18 2022-02-23 12:00:40 PST

(In reply to Chris Dumez from comment #17) > (In reply to Sam Weinig from comment #16) > > > > > Source/WebCore/page/SecurityOriginHash.h:41 > > > > > + return computeHash(origin->protocol(), origin->host(), origin->port().value_or(0)); > > > > > > > > This will compute a different hash value than the old one, and will be more > > > > expensive due to no using cached hash values in the strings. This is likely > > > > a better hash overall since you get more entropy per-byte, but it might be a > > > > slow down due to rehashing each string. > > > > > > Well, technically, those are AtomStrings so we are not actually going to > > > re-hash strings per se. We'll just hash pointers. > > > Therefore, I think it isn't as expensive as it looks. You are right though > > > that it is more expensive than using the cached hash. > > > The alternative though would be to use the cached hash and give that to > > > Hasher. Sadly, this would hash a hash, which is not optimal. > > > > > > Would we prefer that? > > > > I don't think those are AtomStrings, though perhaps I am missing something. > > SecurityOrigin::protocol() returns a const String&. > > Right, those are not. My comment was meant for QualifiedNameComponents / > QualifiedName. > > > I actually think this is likely better better, but that you should just be > > on the lookout for a possible perf regression. > > Agreed. Will check the bots today. I've just checked the perf bots and it looks to be perf-neutral.

Attachments
Patch (27.55 KB, patch) 2022-02-22 11:48 PST, Chris Dumez	ews-feeder: commit-queue-	Details Formatted Diff Diff
Patch (27.55 KB, patch) 2022-02-22 12:56 PST, Chris Dumez	no flags	Details Formatted Diff Diff
Patch (30.01 KB, patch) 2022-02-22 18:42 PST, Chris Dumez	no flags	Details Formatted Diff Diff
Patch (30.01 KB, patch) 2022-02-22 18:47 PST, Chris Dumez	no flags	Details Formatted Diff Diff
Patch (30.06 KB, patch) 2022-02-22 19:26 PST, Chris Dumez	no flags	Details Formatted Diff Diff
Show Obsolete (4) View All Add attachment proposed patch, testcase, etc.

achristensen
alecflett
beidson
benjamin
cmarcelo
darin
dino
esprehn+autocc
ews-watchlist
fmalita
ggaren
gyuyoung.kim
jsbell
kangil.han
mmaxfield
pdr
sabouhallawa
sam
schenney
sergio
webkit-bug-importer
youennf