Bug 237049 - Drop StringHasher::hashMemory() and use the modern Hasher instead
Summary: Drop StringHasher::hashMemory() and use the modern Hasher instead
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebCore Misc. (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Chris Dumez
URL:
Keywords: InRadar
Depends on: 236852
Blocks:
  Show dependency treegraph
 
Reported: 2022-02-22 11:46 PST by Chris Dumez
Modified: 2022-02-23 12:00 PST (History)
22 users (show)

See Also:


Attachments
Patch (27.55 KB, patch)
2022-02-22 11:48 PST, Chris Dumez
ews-feeder: commit-queue-
Details | Formatted Diff | Diff
Patch (27.55 KB, patch)
2022-02-22 12:56 PST, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (30.01 KB, patch)
2022-02-22 18:42 PST, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (30.01 KB, patch)
2022-02-22 18:47 PST, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (30.06 KB, patch)
2022-02-22 19:26 PST, Chris Dumez
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Dumez 2022-02-22 11:46:51 PST
Drop StringHasher::hashMemory() and use the modern Hasher instead.
Comment 1 Chris Dumez 2022-02-22 11:48:29 PST
Created attachment 452888 [details]
Patch
Comment 2 Chris Dumez 2022-02-22 12:56:55 PST
Created attachment 452895 [details]
Patch
Comment 3 Sam Weinig 2022-02-22 17:37:07 PST
Comment on attachment 452895 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=452895&action=review

> Source/WTF/wtf/SchedulePair.h:77
> +        return computeHash(reinterpret_cast<uintptr_t>(pair->runLoop()), pair->mode() ? CFHash(pair->mode()) : 0);

Not new, but we should probably not hash a CFHash() here and should just hash the CFStringRef (not needed for this change, just pointing it out).

> Source/WTF/wtf/UUID.h:101
> +    add(hasher, high, low);

Probably worth just adding an overload of add(Hasher, ..) for UInt128. But I think you can also use the UInt128Low64/UInt128High64 helpers instead of the explicit bit operations.

> Source/WebCore/Modules/indexeddb/shared/IDBResourceIdentifier.h:93
> -    static unsigned hash(const IDBResourceIdentifier& a) { return a.hash(); }
> +    static unsigned hash(const IDBResourceIdentifier& a) { return computeHash(a); }

Would be nice if we could avoid having this and instead have HashTable (or whatever is using these Hash structs) try to call add(Hasher,...) if it exists.

> Source/WebCore/page/SecurityOriginHash.h:41
> +        return computeHash(origin->protocol(), origin->host(), origin->port().value_or(0));

This will compute a different hash value than the old one, and will be more expensive due to no using cached hash values in the strings. This is likely a better hash overall since you get more entropy per-byte, but it might be a slow down due to rehashing each string.

> Source/WebKitLegacy/mac/History/BinaryPropertyList.cpp:81
> +    auto* integers = array.integers();
> +    for (size_t i = 0; i < array.size(); ++i)
> +        add(hasher, integers[i]);

I would be nice if this could be done instead:

    add(hasher, Span(array.integers(), array.size()));
Comment 4 Darin Adler 2022-02-22 17:44:32 PST
Comment on attachment 452895 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=452895&action=review

Do we need to measure the performance effect of all these hashing changes?

> Source/WTF/wtf/SchedulePair.h:77
> +        return computeHash(reinterpret_cast<uintptr_t>(pair->runLoop()), pair->mode() ? CFHash(pair->mode()) : 0);

Seems like it would be slightly more elegant to have add(hasher&, const SchedulePair&) and call "return computeHash(*pair)" here.

This code is still doing a hash of hashes, by calling CFHash on CFStringRef. Is there an alternative? Should we teach Hasher how to hash a CFStringRef? Put that in a header somewhere so we can reuse it?

Do we really need to convert the run loop pointer to an integer to hash it? Can we hash pointers without reinterpret_cast, or is it ambiguous whether that means to hash the address or the data structure at that address? Is converting to an integer the way to say we want to hash the address? Mainly, I think uintptr_t is a leftover from wanting the same type for two integers that we wanted to pass to hashMemory.

> Source/WTF/wtf/UUID.h:97
> +inline void add(Hasher& hasher, const UUID& uuid)

Probably should take UUID instead of const UUID&.

> Source/WTF/wtf/UUID.h:101
> +    auto high = static_cast<uint64_t>(uuid.m_data >> 64);
> +    auto low = static_cast<uint64_t>(uuid.m_data & 0xffffffffffffffff);
> +    add(hasher, high, low);

Seems like we should have:

    void add(Hasher&, UInt128)

And have this call that. Can still have it live in this header for now.

> Source/WebCore/dom/QualifiedName.h:123
> +inline void add(Hasher& hasher, const QualifiedName& name)
> +{
> +    auto impl = name.impl();
> +    add(hasher, impl->m_prefix, impl->m_localName, impl->m_namespace);
> +}

I often like to just overload and also have a second function:

    void add(Hasher&, const QualifiedNameImpl&)

And then have this one be:

    add(hasher, *name.impl());

Not *sure* that is better, but it seems appealing to me.

> Source/WebCore/page/SecurityOriginHash.h:41
> +        return computeHash(origin->protocol(), origin->host(), origin->port().value_or(0));

I don’t think we need this value_or(0) here.

Also seems like we’d like to have:

    add(Hasher&, const SecurityOrigin&);

Then here just call computeHash(*origin);

> Source/WebCore/platform/ScriptExecutionContextIdentifier.h:79
> +inline void add(Hasher& hasher, const ProcessQualified<UUID>& uuid)
> +{
> +    add(hasher, uuid.object());
> +}

Should this instead be a function template that works on all ProcessQualified<>?

> Source/WebCore/platform/graphics/cocoa/FontPlatformDataCocoa.mm:50
> +    return computeHash(fontHash, flags);

The use of uintptr_t in this function doesn’t really make sense any more. We can hash any integer types that are appropriate, and seems unlikely we need a pointer-sized int for the flags, nor do we need to cast CFHashCode to uintptr_t.

Also a little strange to put these bits all together in a flags integer when the hasher can just be passed multiple things; this is probably a bit more efficient, but why not just hash all the fields in the most straightforward way.

Another example of hashing a CFStringRef by hashing a hash, and it’s not great.

> Source/WebCore/platform/graphics/win/FontPlatformDataCGWin.cpp:137
>      unsigned fontHash = m_font ? m_font->hash() : 0;

Might need a hasher that knows how to dereference a pointer.

> Source/WebCore/platform/graphics/win/FontPlatformDataCGWin.cpp:140
> +    return computeHash(fontHash, cgFontHash, ctFontHash, m_useGDI);

The "hashing a hash" pattern again, on both CFStringRef but also on m_font. But how much do we care about code in CGWin? Probably close to retirement at this point.

> Source/WebKitLegacy/mac/History/BinaryPropertyList.cpp:82
> +inline void add(Hasher& hasher, const IntegerArray& array)
> +{
> +    auto* integers = array.integers();
> +    for (size_t i = 0; i < array.size(); ++i)
> +        add(hasher, integers[i]);
> +}

A little strange that if there’s nothing in the array the hash doesn’t get changed at all. I am confused about what the best practice should be for adding empty collections to a hash.
Comment 5 Darin Adler 2022-02-22 17:44:53 PST
Nice that Sam and I made many of the same comments.
Comment 6 Chris Dumez 2022-02-22 17:58:31 PST
(In reply to Sam Weinig from comment #3)
> Comment on attachment 452895 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=452895&action=review
> 
> > Source/WTF/wtf/SchedulePair.h:77
> > +        return computeHash(reinterpret_cast<uintptr_t>(pair->runLoop()), pair->mode() ? CFHash(pair->mode()) : 0);
> 
> Not new, but we should probably not hash a CFHash() here and should just
> hash the CFStringRef (not needed for this change, just pointing it out).

Yes, I wondered about that but tried to be conservative since this is a pre-existing issue.
Hashing the CFStringRef would produce a better hash but would be more expensive.

> > Source/WTF/wtf/UUID.h:101
> > +    add(hasher, high, low);
> 
> Probably worth just adding an overload of add(Hasher, ..) for UInt128. But I
> think you can also use the UInt128Low64/UInt128High64 helpers instead of the
> explicit bit operations.

The reason UUID uses the bit operations and not UInt128Low64()/UInt128High64() is that those functions work on a UInt128Impl, which is an implementation detail.
On some platforms, UInt128 is a __uint128_t, not a UInt128Impl.

> 
> > Source/WebCore/Modules/indexeddb/shared/IDBResourceIdentifier.h:93
> > -    static unsigned hash(const IDBResourceIdentifier& a) { return a.hash(); }
> > +    static unsigned hash(const IDBResourceIdentifier& a) { return computeHash(a); }
> 
> Would be nice if we could avoid having this and instead have HashTable (or
> whatever is using these Hash structs) try to call add(Hasher,...) if it
> exists.
> 
> > Source/WebCore/page/SecurityOriginHash.h:41
> > +        return computeHash(origin->protocol(), origin->host(), origin->port().value_or(0));
> 
> This will compute a different hash value than the old one, and will be more
> expensive due to no using cached hash values in the strings. This is likely
> a better hash overall since you get more entropy per-byte, but it might be a
> slow down due to rehashing each string.

Well, technically, those are AtomStrings so we are not actually going to re-hash strings per se. We'll just hash pointers.
Therefore, I think it isn't as expensive as it looks. You are right though that it is more expensive than using the cached hash.
The alternative though would be to use the cached hash and give that to Hasher. Sadly, this would hash a hash, which is not optimal.

Would we prefer that?

> 
> > Source/WebKitLegacy/mac/History/BinaryPropertyList.cpp:81
> > +    auto* integers = array.integers();
> > +    for (size_t i = 0; i < array.size(); ++i)
> > +        add(hasher, integers[i]);
> 
> I would be nice if this could be done instead:
> 
>     add(hasher, Span(array.integers(), array.size()));

This may work already, I'll give it a try.
Comment 7 Chris Dumez 2022-02-22 18:26:16 PST
(In reply to Darin Adler from comment #4)
> Comment on attachment 452895 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=452895&action=review
> 
> Do we need to measure the performance effect of all these hashing changes?
> 
> > Source/WTF/wtf/SchedulePair.h:77
> > +        return computeHash(reinterpret_cast<uintptr_t>(pair->runLoop()), pair->mode() ? CFHash(pair->mode()) : 0);
> 
> Seems like it would be slightly more elegant to have add(hasher&, const
> SchedulePair&) and call "return computeHash(*pair)" here.
> 
> This code is still doing a hash of hashes, by calling CFHash on CFStringRef.
> Is there an alternative? Should we teach Hasher how to hash a CFStringRef?
> Put that in a header somewhere so we can reuse it?
> 
> Do we really need to convert the run loop pointer to an integer to hash it?
> Can we hash pointers without reinterpret_cast, or is it ambiguous whether
> that means to hash the address or the data structure at that address? Is
> converting to an integer the way to say we want to hash the address? Mainly,
> I think uintptr_t is a leftover from wanting the same type for two integers
> that we wanted to pass to hashMemory.
> 
> > Source/WTF/wtf/UUID.h:97
> > +inline void add(Hasher& hasher, const UUID& uuid)
> 
> Probably should take UUID instead of const UUID&.
> 
> > Source/WTF/wtf/UUID.h:101
> > +    auto high = static_cast<uint64_t>(uuid.m_data >> 64);
> > +    auto low = static_cast<uint64_t>(uuid.m_data & 0xffffffffffffffff);
> > +    add(hasher, high, low);
> 
> Seems like we should have:
> 
>     void add(Hasher&, UInt128)
> 
> And have this call that. Can still have it live in this header for now.
> 
> > Source/WebCore/dom/QualifiedName.h:123
> > +inline void add(Hasher& hasher, const QualifiedName& name)
> > +{
> > +    auto impl = name.impl();
> > +    add(hasher, impl->m_prefix, impl->m_localName, impl->m_namespace);
> > +}
> 
> I often like to just overload and also have a second function:
> 
>     void add(Hasher&, const QualifiedNameImpl&)
> 
> And then have this one be:
> 
>     add(hasher, *name.impl());
> 
> Not *sure* that is better, but it seems appealing to me.
> 
> > Source/WebCore/page/SecurityOriginHash.h:41
> > +        return computeHash(origin->protocol(), origin->host(), origin->port().value_or(0));
> 
> I don’t think we need this value_or(0) here.
> 
> Also seems like we’d like to have:
> 
>     add(Hasher&, const SecurityOrigin&);
> 
> Then here just call computeHash(*origin);
> 
> > Source/WebCore/platform/ScriptExecutionContextIdentifier.h:79
> > +inline void add(Hasher& hasher, const ProcessQualified<UUID>& uuid)
> > +{
> > +    add(hasher, uuid.object());
> > +}
> 
> Should this instead be a function template that works on all
> ProcessQualified<>?

There is already one generic one for ProcessQualified<> and hashes both the object and the processIdentifier. However, the ProcessQualified<UUID> specialization really only needs to hash the object which it is a UUID. It maintains pre-exiting behavior and performance.
Comment 8 Darin Adler 2022-02-22 18:28:47 PST
(In reply to Chris Dumez from comment #7)
> However, the ProcessQualified<UUID>
> specialization really only needs to hash the object which it is a UUID. It
> maintains pre-exiting behavior and performance.

Definitely needs a comment, something like this:

    // Since UUIDs are unique on their own, optimize by not hashing the process identifier.
Comment 9 Chris Dumez 2022-02-22 18:41:55 PST
(In reply to Darin Adler from comment #4)
> Comment on attachment 452895 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=452895&action=review
> 
> Do we need to measure the performance effect of all these hashing changes?

You are right that there is a slight risk. I'll monitor the perf bots after I land to make sure there is no obvious regression.
Comment 10 Chris Dumez 2022-02-22 18:42:45 PST
Created attachment 452926 [details]
Patch
Comment 11 Chris Dumez 2022-02-22 18:47:51 PST
Created attachment 452927 [details]
Patch
Comment 12 Darin Adler 2022-02-22 19:09:49 PST
Comment on attachment 452927 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=452927&action=review

Looks great

> Source/WTF/wtf/Hasher.h:79
> +inline void add(Hasher& hasher, const UInt128& value)

I suggest UInt128 instead of const UInt128&.

> Source/WebCore/Modules/indexeddb/shared/IDBResourceIdentifier.h:89
> +    add(hasher, identifier.connectionIdentifier(), identifier.m_resourceNumber);

Just realized this is a friend, so maybe we should use both data members instead of using a getter for one and the data member for the other.
Comment 13 Chris Dumez 2022-02-22 19:26:10 PST
Created attachment 452929 [details]
Patch
Comment 14 EWS 2022-02-22 22:13:41 PST
Committed r290349 (247667@main): <https://commits.webkit.org/247667@main>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 452929 [details].
Comment 15 Radar WebKit Bug Importer 2022-02-22 22:14:22 PST
<rdar://problem/89337284>
Comment 16 Sam Weinig 2022-02-23 10:27:56 PST
> > > Source/WebCore/page/SecurityOriginHash.h:41
> > > +        return computeHash(origin->protocol(), origin->host(), origin->port().value_or(0));
> > 
> > This will compute a different hash value than the old one, and will be more
> > expensive due to no using cached hash values in the strings. This is likely
> > a better hash overall since you get more entropy per-byte, but it might be a
> > slow down due to rehashing each string.
> 
> Well, technically, those are AtomStrings so we are not actually going to
> re-hash strings per se. We'll just hash pointers.
> Therefore, I think it isn't as expensive as it looks. You are right though
> that it is more expensive than using the cached hash.
> The alternative though would be to use the cached hash and give that to
> Hasher. Sadly, this would hash a hash, which is not optimal.
> 
> Would we prefer that?

I don't think those are AtomStrings, though perhaps I am missing something. SecurityOrigin::protocol() returns a const String&.

I actually think this is likely better better, but that you should just be on the lookout for a possible perf regression.
Comment 17 Chris Dumez 2022-02-23 10:30:01 PST
(In reply to Sam Weinig from comment #16)
> > > > Source/WebCore/page/SecurityOriginHash.h:41
> > > > +        return computeHash(origin->protocol(), origin->host(), origin->port().value_or(0));
> > > 
> > > This will compute a different hash value than the old one, and will be more
> > > expensive due to no using cached hash values in the strings. This is likely
> > > a better hash overall since you get more entropy per-byte, but it might be a
> > > slow down due to rehashing each string.
> > 
> > Well, technically, those are AtomStrings so we are not actually going to
> > re-hash strings per se. We'll just hash pointers.
> > Therefore, I think it isn't as expensive as it looks. You are right though
> > that it is more expensive than using the cached hash.
> > The alternative though would be to use the cached hash and give that to
> > Hasher. Sadly, this would hash a hash, which is not optimal.
> > 
> > Would we prefer that?
> 
> I don't think those are AtomStrings, though perhaps I am missing something.
> SecurityOrigin::protocol() returns a const String&.

Right, those are not. My comment was meant for QualifiedNameComponents / QualifiedName.

> I actually think this is likely better better, but that you should just be
> on the lookout for a possible perf regression.

Agreed. Will check the bots today.
Comment 18 Chris Dumez 2022-02-23 12:00:40 PST
(In reply to Chris Dumez from comment #17)
> (In reply to Sam Weinig from comment #16)
> > > > > Source/WebCore/page/SecurityOriginHash.h:41
> > > > > +        return computeHash(origin->protocol(), origin->host(), origin->port().value_or(0));
> > > > 
> > > > This will compute a different hash value than the old one, and will be more
> > > > expensive due to no using cached hash values in the strings. This is likely
> > > > a better hash overall since you get more entropy per-byte, but it might be a
> > > > slow down due to rehashing each string.
> > > 
> > > Well, technically, those are AtomStrings so we are not actually going to
> > > re-hash strings per se. We'll just hash pointers.
> > > Therefore, I think it isn't as expensive as it looks. You are right though
> > > that it is more expensive than using the cached hash.
> > > The alternative though would be to use the cached hash and give that to
> > > Hasher. Sadly, this would hash a hash, which is not optimal.
> > > 
> > > Would we prefer that?
> > 
> > I don't think those are AtomStrings, though perhaps I am missing something.
> > SecurityOrigin::protocol() returns a const String&.
> 
> Right, those are not. My comment was meant for QualifiedNameComponents /
> QualifiedName.
> 
> > I actually think this is likely better better, but that you should just be
> > on the lookout for a possible perf regression.
> 
> Agreed. Will check the bots today.

I've just checked the perf bots and it looks to be perf-neutral.