Bug 180318 - Add WTF::Hasher, an easier to use replacement for WTF::IntegerHasher
Summary: Add WTF::Hasher, an easier to use replacement for WTF::IntegerHasher
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Web Template Framework (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Darin Adler
URL:
Keywords: InRadar
Depends on:
Blocks: 180363
  Show dependency treegraph
 
Reported: 2017-12-02 13:20 PST by Darin Adler
Modified: 2017-12-07 10:11 PST (History)
5 users (show)

See Also:


Attachments
Patch (53.22 KB, patch)
2017-12-02 13:31 PST, Darin Adler
no flags Details | Formatted Diff | Diff
Patch (52.75 KB, patch)
2017-12-02 14:08 PST, Darin Adler
jfbastien: review+
Details | Formatted Diff | Diff
Patch (54.51 KB, patch)
2017-12-03 13:18 PST, Darin Adler
no flags Details | Formatted Diff | Diff
Patch (54.52 KB, patch)
2017-12-03 13:50 PST, Darin Adler
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Darin Adler 2017-12-02 13:20:46 PST
Add WTF::Hasher, an easier to use replacement for WTF::IntegerHasher
Comment 1 Darin Adler 2017-12-02 13:31:53 PST Comment hidden (obsolete)
Comment 2 Darin Adler 2017-12-02 13:33:39 PST
This is the class I was talking about. I decided to just post the new Hasher itself first, then I can upload patches for starting to use it in various places where we use IntegerHasher and StringHasher now.
Comment 3 Darin Adler 2017-12-02 14:08:04 PST
Created attachment 328254 [details]
Patch
Comment 4 Darin Adler 2017-12-02 16:39:57 PST
OK, everything green on EWS. Ready for review.
Comment 5 JF Bastien 2017-12-02 21:17:04 PST
Comment on attachment 328254 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=328254&action=review

r=me with a few comments, and a design suggestion.

As we discussed in person I think it would be cool to rely on the tuple-like type API (with either member or non-member `get` functions), because once we move to C++17 all functions that implemented that API get structured binding for free (case #2 in [1]).

Looking at your code though I'm not sure how much uglier it would then look, since what you have now is pretty elegant. It can be added later, but once too many classes don't do something it gets hard to migrate. Your call :-)

  [1]: http://en.cppreference.com/w/cpp/language/structured_binding

> Source/WTF/wtf/Hasher.h:31
> +    void add(unsigned integer)

It's a bit odd to mix unsigned here and for hash, with uint32_t below. I'd go for uint32_t everywhere.

> Source/WTF/wtf/Hasher.h:65
> +    template<typename UnsignedInteger> friend void add(std::enable_if_t<std::is_unsigned<UnsignedInteger>::value && sizeof(UnsignedInteger) <= sizeof(uint32_t), Hasher&> hasher, UnsignedInteger integer)

Just personal preference, but I prefer using enable_if for the return type instead of on parameters because I find it easier to read (that is, unless it can be a default template parameter which you can't do here because you're overloading `add`).

> Source/WTF/wtf/Hasher.h:68
> +        // We can consider adding a more efficient code path for hashing 16-bit values if needed, perhaps using addCharacter,

FIXME follow-up bugs for those?

> Source/WTF/wtf/Hasher.h:90
> +inline void add(Hasher& hasher, double number)

One question about hashing floating-point values is what happens if you hash -0. / +0. (both compare equal) and NaN values (which never compare equal). Do you think we ever do this, and would it be a bug to do so? Maybe you want to normalize zeros to positive, or keep the distinction so they do hash differently? For NaNs is seems like you'd want to canonicalize them to what JSC calls "pure" NaN?

> Source/WTF/wtf/Hasher.h:92
> +    add(hasher, reinterpret_cast<uint64_t&>(number));

You want bitwise_cast here and below.

> Source/WTF/wtf/Hasher.h:111
> +    for (auto& value : container)

const auto& (I know it already is, I just like to say it explicitly).

> Tools/TestWebKitAPI/Tests/WTF/Hasher.cpp:49
> +}

?

> Tools/TestWebKitAPI/Tests/WTF/Hasher.cpp:55
> +    EXPECT_EQ(zero32BitHash, computeHash(static_cast<char>(0)));

Here and below, you should also test signed char and unsigned char since they're distinct from char.

> Tools/TestWebKitAPI/Tests/WTF/Hasher.cpp:129
> +    EXPECT_EQ(zero64BitHash, computeHash(0.0));

As discussed above, I'd figure out what we want from -0. and if it's the same as +0. (won't be if you just bitwise_cast).

> Tools/TestWebKitAPI/Tests/WTF/Hasher.cpp:141
> +    EXPECT_EQ(2678086759U, computeHash(std::numeric_limits<double>::quiet_NaN()));

I'd test other NaNs as well.
Comment 6 Darin Adler 2017-12-03 11:41:18 PST
Comment on attachment 328254 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=328254&action=review

Thanks for the review, and helpful comments.

>> Source/WTF/wtf/Hasher.h:31
>> +    void add(unsigned integer)
> 
> It's a bit odd to mix unsigned here and for hash, with uint32_t below. I'd go for uint32_t everywhere.

This is IntegerHasher, unmodified, and something I intend to delete soon. I would also have gone with uint32_t if I was changing the class at all.

I can switch to uint32_t but I am not sure I should. Really doesn’t matter since I will delete this soon.

>> Source/WTF/wtf/Hasher.h:65
>> +    template<typename UnsignedInteger> friend void add(std::enable_if_t<std::is_unsigned<UnsignedInteger>::value && sizeof(UnsignedInteger) <= sizeof(uint32_t), Hasher&> hasher, UnsignedInteger integer)
> 
> Just personal preference, but I prefer using enable_if for the return type instead of on parameters because I find it easier to read (that is, unless it can be a default template parameter which you can't do here because you're overloading `add`).

OK. I have never used enable_if at all before, so I have no preference.

>> Source/WTF/wtf/Hasher.h:68
>> +        // We can consider adding a more efficient code path for hashing 16-bit values if needed, perhaps using addCharacter,
> 
> FIXME follow-up bugs for those?

I don’t think of these as necessary projects, just possibilities, which is why I neither used FIXME in the comments nor filed bugs. Maybe you feel differently and think these are necessary?

>> Source/WTF/wtf/Hasher.h:90
>> +inline void add(Hasher& hasher, double number)
> 
> One question about hashing floating-point values is what happens if you hash -0. / +0. (both compare equal) and NaN values (which never compare equal). Do you think we ever do this, and would it be a bug to do so? Maybe you want to normalize zeros to positive, or keep the distinction so they do hash differently? For NaNs is seems like you'd want to canonicalize them to what JSC calls "pure" NaN?

I think the right thing to do is to treat negative zero and NaNs the same way that WTF::FloatHash does, which is to define equality based on bitwise_cast, too. Then negative zero can be a distinct hash key, and different NaN values can be different as well. I agree that the feature of canonicalizing to pure NaN for both hashing and equality comparison could be useful if we really wanted NaN to be good as a key, but I’d hate to pay extra overhead for that behavior, because I am pretty sure the doubles we are using as hash table keys aren’t ever NaN in practice.

What I realize now, though, is that we don’t want people pairing "double == double" with Hasher hashing. Which makes me think I might need to offer something like "equalForHashing" (better name would help) alongside "computeHash" so that programmers can easily get it right when they have a large set of things that they are hashing that include some floating point values. We want them to make use of the bitwise_cast comparison of floating point values rather than floating point equality, otherwise they will run into problems if they have any negative zeros; values that hash differently but are equal which breaks the invariant needed for data structures like HashTable.

>> Source/WTF/wtf/Hasher.h:92
>> +    add(hasher, reinterpret_cast<uint64_t&>(number));
> 
> You want bitwise_cast here and below.

Yes, will fix that! Unfortunately that means I have to add an include of <wtf/StdLibExtras.h>.

That’s funny; I knew bitwise_cast was correct for this, but somehow thought that I had studied the hash functions in HashFunctions.h and exactly imitated them. But I was wrong, they use bitwise_cast as well they should.

>> Source/WTF/wtf/Hasher.h:111
>> +    for (auto& value : container)
> 
> const auto& (I know it already is, I just like to say it explicitly).

My attitude is opposite. I am sort of pretending I have the future version of C++ where I can just say for (value : container) and not have to write auto&&. For me auto& is my attempt to say nothing about the type. And so "const" seems extraneous to me. But I will add "const" since you request it and I don’t feel strongly.

>> Tools/TestWebKitAPI/Tests/WTF/Hasher.cpp:49
>> +}
> 
> ?

Oops, I think that was supposed to have all the emptyHash cases in it, but they are in addMultiple now.

>> Tools/TestWebKitAPI/Tests/WTF/Hasher.cpp:55
>> +    EXPECT_EQ(zero32BitHash, computeHash(static_cast<char>(0)));
> 
> Here and below, you should also test signed char and unsigned char since they're distinct from char.

Sure, I was aware of that. But I do test them, just as int8_t and uint8_t. I could also test them under the names "signed char" and "unsigned char" if you think including both is clearer. But I don’t do that for "signed short" and "unsigned short" either. Or I could move to testing with the names "char" and "short" since for sizes 32  and above I am testing using built-in types, not typedefs.

>> Tools/TestWebKitAPI/Tests/WTF/Hasher.cpp:129
>> +    EXPECT_EQ(zero64BitHash, computeHash(0.0));
> 
> As discussed above, I'd figure out what we want from -0. and if it's the same as +0. (won't be if you just bitwise_cast).

I think we want to stay consistent with WTF::FloatHash and we don’t want it to be the same as +0; I will add a test for it.

>> Tools/TestWebKitAPI/Tests/WTF/Hasher.cpp:141
>> +    EXPECT_EQ(2678086759U, computeHash(std::numeric_limits<double>::quiet_NaN()));
> 
> I'd test other NaNs as well.

OK, I’ll test at least one other NaN. There are tons of them, of course.
Comment 7 Darin Adler 2017-12-03 11:49:44 PST
(In reply to JF Bastien from comment #5)
> As we discussed in person I think it would be cool to rely on the tuple-like
> type API (with either member or non-member `get` functions), because once we
> move to C++17 all functions that implemented that API get structured binding
> for free (case #2 in [1]).
> 
> Looking at your code though I'm not sure how much uglier it would then look,
> since what you have now is pretty elegant. It can be added later, but once
> too many classes don't do something it gets hard to migrate. Your call :-)
> 
>   [1]: http://en.cppreference.com/w/cpp/language/structured_binding

The one thing I am not clear on about your suggestion is whether you are suggesting I use the tuple-like type API approach *instead* of the "override add for your class" design of if you are suggesting I make the tuple-like support part of the default add function template and still offer "override add" as a hook for more exotic hashing requirements.

Which one did you mean?
Comment 8 Darin Adler 2017-12-03 12:05:27 PST
Comment on attachment 328254 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=328254&action=review

>>> Tools/TestWebKitAPI/Tests/WTF/Hasher.cpp:141
>>> +    EXPECT_EQ(2678086759U, computeHash(std::numeric_limits<double>::quiet_NaN()));
>> 
>> I'd test other NaNs as well.
> 
> OK, I’ll test at least one other NaN. There are tons of them, of course.

I’m not sure how to do this. How do I make another NaN? I could use signaling_NaN, but that is definitely not what I want to test.
Comment 9 Darin Adler 2017-12-03 13:18:26 PST Comment hidden (obsolete)
Comment 10 Darin Adler 2017-12-03 13:50:19 PST
Created attachment 328302 [details]
Patch
Comment 11 Darin Adler 2017-12-03 14:48:44 PST
Committed r225463: <https://trac.webkit.org/changeset/225463>
Comment 12 Radar WebKit Bug Importer 2017-12-03 14:50:15 PST
<rdar://problem/35820506>
Comment 13 JF Bastien 2017-12-04 10:03:30 PST
> >> Source/WTF/wtf/Hasher.h:90
> >> +inline void add(Hasher& hasher, double number)
> > 
> > One question about hashing floating-point values is what happens if you hash -0. / +0. (both compare equal) and NaN values (which never compare equal). Do you think we ever do this, and would it be a bug to do so? Maybe you want to normalize zeros to positive, or keep the distinction so they do hash differently? For NaNs is seems like you'd want to canonicalize them to what JSC calls "pure" NaN?
> 
> I think the right thing to do is to treat negative zero and NaNs the same
> way that WTF::FloatHash does, which is to define equality based on
> bitwise_cast, too. Then negative zero can be a distinct hash key, and
> different NaN values can be different as well. I agree that the feature of
> canonicalizing to pure NaN for both hashing and equality comparison could be
> useful if we really wanted NaN to be good as a key, but I’d hate to pay
> extra overhead for that behavior, because I am pretty sure the doubles we
> are using as hash table keys aren’t ever NaN in practice.
> 
> What I realize now, though, is that we don’t want people pairing "double ==
> double" with Hasher hashing. Which makes me think I might need to offer
> something like "equalForHashing" (better name would help) alongside
> "computeHash" so that programmers can easily get it right when they have a
> large set of things that they are hashing that include some floating point
> values. We want them to make use of the bitwise_cast comparison of floating
> point values rather than floating point equality, otherwise they will run
> into problems if they have any negative zeros; values that hash differently
> but are equal which breaks the invariant needed for data structures like
> HashTable.

I've been thinking about this, and I think there are 3 things we could plausibly want:

  1. Deal with bits only, treating -0. != +0. and all NaNs with different payload as different (and same payload as the same).
  2. Deal with IEEE-754 semantics, treating -0. == +0., and all NaNs as different.
  3. Ban -0. and all NaNs from hashing.

I think these all make sense in different context, but choosing 1. and 2. requires knowing what situation we're in! It seems like 3. is the most conservative thing to do if we don't know how hashing is used.

Trying it out here:
  https://bugs.webkit.org/show_bug.cgi?id=180363


> >> Source/WTF/wtf/Hasher.h:92
> >> +    add(hasher, reinterpret_cast<uint64_t&>(number));
> > 
> > You want bitwise_cast here and below.
> 
> Yes, will fix that! Unfortunately that means I have to add an include of
> <wtf/StdLibExtras.h>.
> 
> That’s funny; I knew bitwise_cast was correct for this, but somehow thought
> that I had studied the hash functions in HashFunctions.h and exactly
> imitated them. But I was wrong, they use bitwise_cast as well they should.

I could split it into its own header? Looks like I managed to get it into C++20 as `bit_cast` so I could rename as I move it too. What's neat is the one I got in wg21.link/p0476 is also constexpr, which requires compiler support :-)

WDYT?


(In reply to Darin Adler from comment #7)
> (In reply to JF Bastien from comment #5)
> > As we discussed in person I think it would be cool to rely on the tuple-like
> > type API (with either member or non-member `get` functions), because once we
> > move to C++17 all functions that implemented that API get structured binding
> > for free (case #2 in [1]).
> > 
> > Looking at your code though I'm not sure how much uglier it would then look,
> > since what you have now is pretty elegant. It can be added later, but once
> > too many classes don't do something it gets hard to migrate. Your call :-)
> > 
> >   [1]: http://en.cppreference.com/w/cpp/language/structured_binding
> 
> The one thing I am not clear on about your suggestion is whether you are
> suggesting I use the tuple-like type API approach *instead* of the "override
> add for your class" design of if you are suggesting I make the tuple-like
> support part of the default add function template and still offer "override
> add" as a hook for more exotic hashing requirements.
> 
> Which one did you mean?

Initially I thought the former, but what you have seems way simpler that implementing `get`, so much so that the code cost of making something washable doesn't seem worth it. I'm not sure I like my idea anymore!

(In reply to Darin Adler from comment #8)
> Comment on attachment 328254 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=328254&action=review
> 
> >>> Tools/TestWebKitAPI/Tests/WTF/Hasher.cpp:141
> >>> +    EXPECT_EQ(2678086759U, computeHash(std::numeric_limits<double>::quiet_NaN()));
> >> 
> >> I'd test other NaNs as well.
> > 
> > OK, I’ll test at least one other NaN. There are tons of them, of course.
> 
> I’m not sure how to do this. How do I make another NaN? I could use
> signaling_NaN, but that is definitely not what I want to test.

Crafting NaNs in C++ isn't the best thing ever... bitwise_cast basically :-(
Comment 14 Darin Adler 2017-12-07 10:11:02 PST
(In reply to JF Bastien from comment #13)
> I've been thinking about this, and I think there are 3 things we could
> plausibly want:
> 
>   1. Deal with bits only, treating -0. != +0. and all NaNs with different
> payload as different (and same payload as the same).
>   2. Deal with IEEE-754 semantics, treating -0. == +0., and all NaNs as
> different.
>   3. Ban -0. and all NaNs from hashing.
> 
> I think these all make sense in different context, but choosing 1. and 2.
> requires knowing what situation we're in! It seems like 3. is the most
> conservative thing to do if we don't know how hashing is used.

We’ve always done (1) for all our floating point hashing; it has never caused a problem, nor do I think it ever will. I don’t think there is an urgent need to make a change. Just adding assertions won’t actually make (3) happen. We’d also have to do extensive testing to see if the assertions fire.

> I could split it into its own header? Looks like I managed to get it into
> C++20 as `bit_cast` so I could rename as I move it too. What's neat is the
> one I got in wg21.link/p0476 is also constexpr, which requires compiler
> support :-)

I don’t think it needs to go into a separate header.

I do think it would be good to move it to be more like the future standard one.

> > I’m not sure how to do this. How do I make another NaN? I could use
> > signaling_NaN, but that is definitely not what I want to test.
> 
> Crafting NaNs in C++ isn't the best thing ever... bitwise_cast basically :-(

OK, maybe I’ll come back eventually and add that. I am not in any hurry to do so.