Bug 174715 - Add a data structure for fixed sets of data that can switch its implementation strategy based on the input
Summary: Add a data structure for fixed sets of data that can switch its implementatio...
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: Web Template Framework (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on: 174195
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-21 09:39 PDT by Sam Weinig
Modified: 2017-12-11 08:49 PST (History)
4 users (show)

See Also:


Attachments
FixedSet Proof of Concept (16.60 KB, patch)
2017-07-22 11:55 PDT, Sam Weinig
no flags Details | Formatted Diff | Diff
Updated Proof of Concept (added MetaStrings) (19.40 KB, patch)
2017-07-22 17:13 PDT, Sam Weinig
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Sam Weinig 2017-07-21 09:39:57 PDT
It's become increasingly common for there to be a need for a fixed (at compile time) set or map of objects. We have increasingly turned to NeverDestroyed<HashSet<>>s for these, but in many cases this is non-optimal. We should add a data structure that chooses the right underlying model (Single item, linear search array, binary search array, hash set, perfect hashing) based on the input (or direct choice by the author in special cases). (as a side note, when sets are needed, we could look into optimizations like having the hash table and the values array in contiguous readonly memory, we could avoid the need for deleted values and perhaps other aspects that are only needed in dynamic hash tables).

References to comments leading to this bug:

I noted in bug 174348:

"My biggest takeaway is that we have a lot of static maps / sets / vectors that never change and that perhaps we should have specialized data structures for them. And, it reaffirmed my curiosity on determining the right size of a list deserves what data structure (e.g. does a list of 4 items really need to be in a HashSet, or would lookups in Vector be just as fast)."


Darin noted in bug 174533:

"I am annoyed by the overuse of HashSet for fixed sets where more efficient algorithms may work better. I think I need to build a FixedSet family of class templates. Then it can be a HashSet only when the set is huge, an array with binary search when it’s smaller, an array with linear search when it’s even smaller, and a single function call when it has exactly one element. Either FixedSet can automatically do the right thing, we can have multiple FixedSet classes with identical interfaces but different implementations (FixedSingleItemSet, FixedLinearSearchSet, FixedBinarySearchSet, FixedHashSet), or FixedSet can take a template parameter to tell it which type of implementation to generate."
Comment 1 Sam Weinig 2017-07-21 09:52:31 PDT
An interesting library that solves some of the goals here and might be useful for inspiration or otherwise is Serge Guelton's Frozen:
(Blog post: https://blog.quarkslab.com/frozen-an-header-only-constexpr-alternative-to-gperf-for-c14-users.html, GitHub: https://github.com/serge-sans-paille/frozen).
Comment 2 Yusuke Suzuki 2017-07-21 10:09:55 PDT
(In reply to Sam Weinig from comment #1)
> An interesting library that solves some of the goals here and might be
> useful for inspiration or otherwise is Serge Guelton's Frozen:
> (Blog post:
> https://blog.quarkslab.com/frozen-an-header-only-constexpr-alternative-to-
> gperf-for-c14-users.html, GitHub:
> https://github.com/serge-sans-paille/frozen).

That sounds cool. Reading the post(http://stevehanov.ca/blog/index.php?id=119), their minimal (good!) perfect-hashing construction is O(N). Not taking too much time to find an appropriate seed for the hash function in compile time.
Comment 3 Yusuke Suzuki 2017-07-21 10:14:42 PDT
Potentially, we can use this structure for static hash tables in JSC's prototypes and constructors.
Currently, we are deploying a bit simple perfect hashing table: Inserting the values to the table. If we find a collision, we just double the size of the table and perform again.
This potentially minimizes the current static hash tables.
Comment 4 Yusuke Suzuki 2017-07-21 10:20:14 PDT
I think we should upgrade Windows buildbot's MSVC to 2017 to use C++14 including related-constexpr. According to this library, it requires C++14 constexpr.
And even if we construct the own one, it becomes complicated code if we do not have C++14 relaxed constexpr.

We already upgraded GCC. So, after upgrading MSVC, we can start using this feature.
Comment 5 Darin Adler 2017-07-21 13:33:06 PDT
We already use perfect hashing implemented with gperf for cases where we decided performance was absolutely critical. The goal here is not necessarily to replace that, but to replace the cases where we needlessly use HashSet and create something that is unnecessarily large in code size and inefficient. With something trivially easy to use correctly and with good performance.

We may want to do this for just sets, or both sets and maps, for just strings or for all types of keys.

I do not think we should use perfect hashing for very small sets. I think we can outperform it with linear search and maybe even with binary search for some larger sets. Even for larger sets, I am not entirely sure perfect hashing is what we should be after. These are the considerations from most important to least, I think:

1) clarity of source code
2) memory use
3) code size
4) speed

Maybe I’m wrong about 2-4, but I am pretty sure that 1 is the top one.
Comment 6 Sam Weinig 2017-07-22 11:55:04 PDT
Created attachment 316206 [details]
FixedSet Proof of Concept

I played with this idea a little this morning and got a simple proof of concept done that works with integers. It will need more work to generalize it, but that's why it's a proof of concept.

The way it works is you do something like:

    static constexpr auto set = makeFixedSet<unsigned>( 5, 4, 1, 2, 3 );

If chooses the data structure to use based on how many arguments are passed to makeFixedSet. For a single argument, it uses a FixedSetSingle. For 4 or less, it uses a FixedSetLinearSearchArray. And for 5 and more, it uses FixedSetBinarySearchArray.

The input does not need to be pre-sorted, as the FixedSetBinarySearchArray does a constexpr sort of the contents.

I left out a HashTable or Perfect Hash variant from this proof of concept, but once additional types are added, I don't think adding those will be too much trouble (famous last words).
Comment 7 Sam Weinig 2017-07-22 17:13:08 PDT
Created attachment 316214 [details]
Updated Proof of Concept (added MetaStrings)

Made a little more progress. I added a mechanism to compile time strings from a parameter pack (see makeFixedStringArray) and added a basic MetaString type. I'd like to get to the point where I can convert the MetaString into a WTF::StaticStringImpl, and have those be the basis for the actual set, but I am thinking that will mean I need to maintain the string length as a constant, so MetaString will probably have to evolve.

For AtomicStrings, I think we will have to take a slightly different tack, and build the constant strings at compile time, but lazily create the actual AtomicStringImpls at first use.
Comment 8 Yusuke Suzuki 2017-12-11 08:49:46 PST
Now, VC++ is updated: it means that relaxed constexpr is available in WebKit!