180318 – Add WTF::Hasher, an easier to use replacement for WTF::IntegerHasher

RESOLVED FIXED 180318

Add WTF::Hasher, an easier to use replacement for WTF::IntegerHasher

https://bugs.webkit.org/show_bug.cgi?id=180318

Summary Add WTF::Hasher, an easier to use replacement for WTF::IntegerHasher

Darin Adler

Reported 2017-12-02 13:20:46 PST

Add WTF::Hasher, an easier to use replacement for WTF::IntegerHasher

Attachments
Patch (53.22 KB, patch) 2017-12-02 13:31 PST, Darin Adler	no flags	Details Formatted Diff Diff
Patch (52.75 KB, patch) 2017-12-02 14:08 PST, Darin Adler	jfbastien: review+	Details Formatted Diff Diff
Patch (54.51 KB, patch) 2017-12-03 13:18 PST, Darin Adler	no flags	Details Formatted Diff Diff
Patch (54.52 KB, patch) 2017-12-03 13:50 PST, Darin Adler	no flags	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.

Darin Adler

Comment 1 2017-12-02 13:31:53 PST Comment hidden (obsolete)

Created attachment 328251 [details] Patch

Darin Adler

Comment 2 2017-12-02 13:33:39 PST

This is the class I was talking about. I decided to just post the new Hasher itself first, then I can upload patches for starting to use it in various places where we use IntegerHasher and StringHasher now.

Darin Adler

Comment 3 2017-12-02 14:08:04 PST

Created attachment 328254 [details] Patch

Darin Adler

Comment 4 2017-12-02 16:39:57 PST

OK, everything green on EWS. Ready for review.

JF Bastien

Comment 5 2017-12-02 21:17:04 PST

Comment on attachment 328254 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=328254&action=review r=me with a few comments, and a design suggestion. As we discussed in person I think it would be cool to rely on the tuple-like type API (with either member or non-member `get` functions), because once we move to C++17 all functions that implemented that API get structured binding for free (case #2 in [1]). Looking at your code though I'm not sure how much uglier it would then look, since what you have now is pretty elegant. It can be added later, but once too many classes don't do something it gets hard to migrate. Your call :-) [1]: http://en.cppreference.com/w/cpp/language/structured_binding > Source/WTF/wtf/Hasher.h:31 > + void add(unsigned integer) It's a bit odd to mix unsigned here and for hash, with uint32_t below. I'd go for uint32_t everywhere. > Source/WTF/wtf/Hasher.h:65 > + template<typename UnsignedInteger> friend void add(std::enable_if_t<std::is_unsigned<UnsignedInteger>::value && sizeof(UnsignedInteger) <= sizeof(uint32_t), Hasher&> hasher, UnsignedInteger integer) Just personal preference, but I prefer using enable_if for the return type instead of on parameters because I find it easier to read (that is, unless it can be a default template parameter which you can't do here because you're overloading `add`). > Source/WTF/wtf/Hasher.h:68 > + // We can consider adding a more efficient code path for hashing 16-bit values if needed, perhaps using addCharacter, FIXME follow-up bugs for those? > Source/WTF/wtf/Hasher.h:90 > +inline void add(Hasher& hasher, double number) One question about hashing floating-point values is what happens if you hash -0. / +0. (both compare equal) and NaN values (which never compare equal). Do you think we ever do this, and would it be a bug to do so? Maybe you want to normalize zeros to positive, or keep the distinction so they do hash differently? For NaNs is seems like you'd want to canonicalize them to what JSC calls "pure" NaN? > Source/WTF/wtf/Hasher.h:92 > + add(hasher, reinterpret_cast<uint64_t&>(number)); You want bitwise_cast here and below. > Source/WTF/wtf/Hasher.h:111 > + for (auto& value : container) const auto& (I know it already is, I just like to say it explicitly). > Tools/TestWebKitAPI/Tests/WTF/Hasher.cpp:49 > +} ? > Tools/TestWebKitAPI/Tests/WTF/Hasher.cpp:55 > + EXPECT_EQ(zero32BitHash, computeHash(static_cast<char>(0))); Here and below, you should also test signed char and unsigned char since they're distinct from char. > Tools/TestWebKitAPI/Tests/WTF/Hasher.cpp:129 > + EXPECT_EQ(zero64BitHash, computeHash(0.0)); As discussed above, I'd figure out what we want from -0. and if it's the same as +0. (won't be if you just bitwise_cast). > Tools/TestWebKitAPI/Tests/WTF/Hasher.cpp:141 > + EXPECT_EQ(2678086759U, computeHash(std::numeric_limits<double>::quiet_NaN())); I'd test other NaNs as well.

Darin Adler

Comment 6 2017-12-03 11:41:18 PST

Comment on attachment 328254 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=328254&action=review Thanks for the review, and helpful comments. >> Source/WTF/wtf/Hasher.h:31 >> + void add(unsigned integer) > > It's a bit odd to mix unsigned here and for hash, with uint32_t below. I'd go for uint32_t everywhere. This is IntegerHasher, unmodified, and something I intend to delete soon. I would also have gone with uint32_t if I was changing the class at all. I can switch to uint32_t but I am not sure I should. Really doesn’t matter since I will delete this soon. >> Source/WTF/wtf/Hasher.h:65 >> + template<typename UnsignedInteger> friend void add(std::enable_if_t<std::is_unsigned<UnsignedInteger>::value && sizeof(UnsignedInteger) <= sizeof(uint32_t), Hasher&> hasher, UnsignedInteger integer) > > Just personal preference, but I prefer using enable_if for the return type instead of on parameters because I find it easier to read (that is, unless it can be a default template parameter which you can't do here because you're overloading `add`). OK. I have never used enable_if at all before, so I have no preference. >> Source/WTF/wtf/Hasher.h:68 >> + // We can consider adding a more efficient code path for hashing 16-bit values if needed, perhaps using addCharacter, > > FIXME follow-up bugs for those? I don’t think of these as necessary projects, just possibilities, which is why I neither used FIXME in the comments nor filed bugs. Maybe you feel differently and think these are necessary? >> Source/WTF/wtf/Hasher.h:90 >> +inline void add(Hasher& hasher, double number) > > One question about hashing floating-point values is what happens if you hash -0. / +0. (both compare equal) and NaN values (which never compare equal). Do you think we ever do this, and would it be a bug to do so? Maybe you want to normalize zeros to positive, or keep the distinction so they do hash differently? For NaNs is seems like you'd want to canonicalize them to what JSC calls "pure" NaN? I think the right thing to do is to treat negative zero and NaNs the same way that WTF::FloatHash does, which is to define equality based on bitwise_cast, too. Then negative zero can be a distinct hash key, and different NaN values can be different as well. I agree that the feature of canonicalizing to pure NaN for both hashing and equality comparison could be useful if we really wanted NaN to be good as a key, but I’d hate to pay extra overhead for that behavior, because I am pretty sure the doubles we are using as hash table keys aren’t ever NaN in practice. What I realize now, though, is that we don’t want people pairing "double == double" with Hasher hashing. Which makes me think I might need to offer something like "equalForHashing" (better name would help) alongside "computeHash" so that programmers can easily get it right when they have a large set of things that they are hashing that include some floating point values. We want them to make use of the bitwise_cast comparison of floating point values rather than floating point equality, otherwise they will run into problems if they have any negative zeros; values that hash differently but are equal which breaks the invariant needed for data structures like HashTable. >> Source/WTF/wtf/Hasher.h:92 >> + add(hasher, reinterpret_cast<uint64_t&>(number)); > > You want bitwise_cast here and below. Yes, will fix that! Unfortunately that means I have to add an include of <wtf/StdLibExtras.h>. That’s funny; I knew bitwise_cast was correct for this, but somehow thought that I had studied the hash functions in HashFunctions.h and exactly imitated them. But I was wrong, they use bitwise_cast as well they should. >> Source/WTF/wtf/Hasher.h:111 >> + for (auto& value : container) > > const auto& (I know it already is, I just like to say it explicitly). My attitude is opposite. I am sort of pretending I have the future version of C++ where I can just say for (value : container) and not have to write auto&&. For me auto& is my attempt to say nothing about the type. And so "const" seems extraneous to me. But I will add "const" since you request it and I don’t feel strongly. >> Tools/TestWebKitAPI/Tests/WTF/Hasher.cpp:49 >> +} > > ? Oops, I think that was supposed to have all the emptyHash cases in it, but they are in addMultiple now. >> Tools/TestWebKitAPI/Tests/WTF/Hasher.cpp:55 >> + EXPECT_EQ(zero32BitHash, computeHash(static_cast<char>(0))); > > Here and below, you should also test signed char and unsigned char since they're distinct from char. Sure, I was aware of that. But I do test them, just as int8_t and uint8_t. I could also test them under the names "signed char" and "unsigned char" if you think including both is clearer. But I don’t do that for "signed short" and "unsigned short" either. Or I could move to testing with the names "char" and "short" since for sizes 32 and above I am testing using built-in types, not typedefs. >> Tools/TestWebKitAPI/Tests/WTF/Hasher.cpp:129 >> + EXPECT_EQ(zero64BitHash, computeHash(0.0)); > > As discussed above, I'd figure out what we want from -0. and if it's the same as +0. (won't be if you just bitwise_cast). I think we want to stay consistent with WTF::FloatHash and we don’t want it to be the same as +0; I will add a test for it. >> Tools/TestWebKitAPI/Tests/WTF/Hasher.cpp:141 >> + EXPECT_EQ(2678086759U, computeHash(std::numeric_limits<double>::quiet_NaN())); > > I'd test other NaNs as well. OK, I’ll test at least one other NaN. There are tons of them, of course.

Darin Adler

Comment 7 2017-12-03 11:49:44 PST

(In reply to JF Bastien from comment #5) > As we discussed in person I think it would be cool to rely on the tuple-like > type API (with either member or non-member `get` functions), because once we > move to C++17 all functions that implemented that API get structured binding > for free (case #2 in [1]). > > Looking at your code though I'm not sure how much uglier it would then look, > since what you have now is pretty elegant. It can be added later, but once > too many classes don't do something it gets hard to migrate. Your call :-) > > [1]: http://en.cppreference.com/w/cpp/language/structured_binding The one thing I am not clear on about your suggestion is whether you are suggesting I use the tuple-like type API approach *instead* of the "override add for your class" design of if you are suggesting I make the tuple-like support part of the default add function template and still offer "override add" as a hook for more exotic hashing requirements. Which one did you mean?

Darin Adler

Comment 8 2017-12-03 12:05:27 PST

Comment on attachment 328254 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=328254&action=review >>> Tools/TestWebKitAPI/Tests/WTF/Hasher.cpp:141 >>> + EXPECT_EQ(2678086759U, computeHash(std::numeric_limits<double>::quiet_NaN())); >> >> I'd test other NaNs as well. > > OK, I’ll test at least one other NaN. There are tons of them, of course. I’m not sure how to do this. How do I make another NaN? I could use signaling_NaN, but that is definitely not what I want to test.

Darin Adler

Comment 9 2017-12-03 13:18:26 PST Comment hidden (obsolete)

Created attachment 328300 [details] Patch

Darin Adler

Comment 10 2017-12-03 13:50:19 PST

Created attachment 328302 [details] Patch

Darin Adler

Comment 11 2017-12-03 14:48:44 PST

Committed r225463: <https://trac.webkit.org/changeset/225463>

Radar WebKit Bug Importer

Comment 12 2017-12-03 14:50:15 PST

<rdar://problem/35820506>

JF Bastien

Comment 13 2017-12-04 10:03:30 PST

> >> Source/WTF/wtf/Hasher.h:90 > >> +inline void add(Hasher& hasher, double number) > > > > One question about hashing floating-point values is what happens if you hash -0. / +0. (both compare equal) and NaN values (which never compare equal). Do you think we ever do this, and would it be a bug to do so? Maybe you want to normalize zeros to positive, or keep the distinction so they do hash differently? For NaNs is seems like you'd want to canonicalize them to what JSC calls "pure" NaN? > > I think the right thing to do is to treat negative zero and NaNs the same > way that WTF::FloatHash does, which is to define equality based on > bitwise_cast, too. Then negative zero can be a distinct hash key, and > different NaN values can be different as well. I agree that the feature of > canonicalizing to pure NaN for both hashing and equality comparison could be > useful if we really wanted NaN to be good as a key, but I’d hate to pay > extra overhead for that behavior, because I am pretty sure the doubles we > are using as hash table keys aren’t ever NaN in practice. > > What I realize now, though, is that we don’t want people pairing "double == > double" with Hasher hashing. Which makes me think I might need to offer > something like "equalForHashing" (better name would help) alongside > "computeHash" so that programmers can easily get it right when they have a > large set of things that they are hashing that include some floating point > values. We want them to make use of the bitwise_cast comparison of floating > point values rather than floating point equality, otherwise they will run > into problems if they have any negative zeros; values that hash differently > but are equal which breaks the invariant needed for data structures like > HashTable. I've been thinking about this, and I think there are 3 things we could plausibly want: 1. Deal with bits only, treating -0. != +0. and all NaNs with different payload as different (and same payload as the same). 2. Deal with IEEE-754 semantics, treating -0. == +0., and all NaNs as different. 3. Ban -0. and all NaNs from hashing. I think these all make sense in different context, but choosing 1. and 2. requires knowing what situation we're in! It seems like 3. is the most conservative thing to do if we don't know how hashing is used. Trying it out here: https://bugs.webkit.org/show_bug.cgi?id=180363 > >> Source/WTF/wtf/Hasher.h:92 > >> + add(hasher, reinterpret_cast<uint64_t&>(number)); > > > > You want bitwise_cast here and below. > > Yes, will fix that! Unfortunately that means I have to add an include of > <wtf/StdLibExtras.h>. > > That’s funny; I knew bitwise_cast was correct for this, but somehow thought > that I had studied the hash functions in HashFunctions.h and exactly > imitated them. But I was wrong, they use bitwise_cast as well they should. I could split it into its own header? Looks like I managed to get it into C++20 as `bit_cast` so I could rename as I move it too. What's neat is the one I got in wg21.link/p0476 is also constexpr, which requires compiler support :-) WDYT? (In reply to Darin Adler from comment #7) > (In reply to JF Bastien from comment #5) > > As we discussed in person I think it would be cool to rely on the tuple-like > > type API (with either member or non-member `get` functions), because once we > > move to C++17 all functions that implemented that API get structured binding > > for free (case #2 in [1]). > > > > Looking at your code though I'm not sure how much uglier it would then look, > > since what you have now is pretty elegant. It can be added later, but once > > too many classes don't do something it gets hard to migrate. Your call :-) > > > > [1]: http://en.cppreference.com/w/cpp/language/structured_binding > > The one thing I am not clear on about your suggestion is whether you are > suggesting I use the tuple-like type API approach *instead* of the "override > add for your class" design of if you are suggesting I make the tuple-like > support part of the default add function template and still offer "override > add" as a hook for more exotic hashing requirements. > > Which one did you mean? Initially I thought the former, but what you have seems way simpler that implementing `get`, so much so that the code cost of making something washable doesn't seem worth it. I'm not sure I like my idea anymore! (In reply to Darin Adler from comment #8) > Comment on attachment 328254 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=328254&action=review > > >>> Tools/TestWebKitAPI/Tests/WTF/Hasher.cpp:141 > >>> + EXPECT_EQ(2678086759U, computeHash(std::numeric_limits<double>::quiet_NaN())); > >> > >> I'd test other NaNs as well. > > > > OK, I’ll test at least one other NaN. There are tons of them, of course. > > I’m not sure how to do this. How do I make another NaN? I could use > signaling_NaN, but that is definitely not what I want to test. Crafting NaNs in C++ isn't the best thing ever... bitwise_cast basically :-(

Darin Adler

Comment 14 2017-12-07 10:11:02 PST

(In reply to JF Bastien from comment #13) > I've been thinking about this, and I think there are 3 things we could > plausibly want: > > 1. Deal with bits only, treating -0. != +0. and all NaNs with different > payload as different (and same payload as the same). > 2. Deal with IEEE-754 semantics, treating -0. == +0., and all NaNs as > different. > 3. Ban -0. and all NaNs from hashing. > > I think these all make sense in different context, but choosing 1. and 2. > requires knowing what situation we're in! It seems like 3. is the most > conservative thing to do if we don't know how hashing is used. We’ve always done (1) for all our floating point hashing; it has never caused a problem, nor do I think it ever will. I don’t think there is an urgent need to make a change. Just adding assertions won’t actually make (3) happen. We’d also have to do extensive testing to see if the assertions fire. > I could split it into its own header? Looks like I managed to get it into > C++20 as `bit_cast` so I could rename as I move it too. What's neat is the > one I got in wg21.link/p0476 is also constexpr, which requires compiler > support :-) I don’t think it needs to go into a separate header. I do think it would be good to move it to be more like the future standard one. > > I’m not sure how to do this. How do I make another NaN? I could use > > signaling_NaN, but that is definitely not what I want to test. > > Crafting NaNs in C++ isn't the best thing ever... bitwise_cast basically :-( OK, maybe I’ll come back eventually and add that. I am not in any hurry to do so.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component Web Template Framework

Assignee

Darin Adler

Reported

2017-12-02 13:20 PST

Modified

2017-12-07 10:11 PST History

CC List

5 users Show

URL

Keywords InRadar

Depends on

Blocks

180363

Dependencies

tree graph