Bug 224968

Summary: Extend SortedArrayMap further to work on case-folded strings, use in MIMETypeRegistry
Product: WebKit Reporter: Darin Adler <darin>
Component: Web Template FrameworkAssignee: Darin Adler <darin>
Status: RESOLVED FIXED    
Severity: Normal CC: benjamin, cdumez, cmarcelo, dino, ews-watchlist, fmalita, gyuyoung.kim, koivisto, pdr, sabouhallawa, sam, schenney, sergio, webkit-bug-importer, ysuzuki
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: All   
OS: All   
Attachments:
Description Flags
Patch
none
Patch
ews-feeder: commit-queue-
Patch sam: review+

Description Darin Adler 2021-04-22 21:28:58 PDT
Extend SortedArrayMap further to work on case-folded strings, use in MIMETypeRegistry
Comment 1 Darin Adler 2021-04-22 21:53:59 PDT Comment hidden (obsolete)
Comment 2 Darin Adler 2021-04-27 17:00:16 PDT Comment hidden (obsolete)
Comment 3 Darin Adler 2021-04-27 17:53:24 PDT
Created attachment 427221 [details]
Patch
Comment 4 Darin Adler 2021-04-27 23:13:16 PDT
Yusuke, this is working properly now and when you get a chance I think you should be the one to review it, although I would welcome input from others as well.
Comment 5 Sam Weinig 2021-04-28 12:12:26 PDT
Comment on attachment 427221 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=427221&action=review

> Source/WTF/wtf/SortedArrayMap.h:117
> +// FIXME: Use std::all_of instead of this and remove it, once we require C++20.
> +template<typename Iterator, typename Predicate> constexpr bool allOfConstExpr(Iterator first, Iterator last, Predicate predicate)
> +{
> +    for (; first != last; ++first) {
> +        if (!predicate(*first))
> +            return false;
> +    }
> +    return true;
> +}

I think putting this in StdLibExtras.h would make more sense. And it gives us fewer places to search when we upgrade to c++20. (There are probably things in there we can remove now actually).

> Source/WTF/wtf/SortedArrayMap.h:202
> +template<typename CharacterType> inline bool lessThanASCIICaseFolding(const CharacterType* characters, unsigned length, const char* literalWithNoUppercase)

Can this be constexpr?

> Source/WTF/wtf/SortedArrayMap.h:218
> +    if (string.is8Bit())
> +        return lessThanASCIICaseFolding(string.characters8(), string.length(), literalWithNoUppercase);
> +    return lessThanASCIICaseFolding(string.characters16(), string.length(), literalWithNoUppercase);

Should we consider a utility on StringView/String that does this for us?

return string.withCharacters([&] (const auto* characters, unsigned length) {
    lessThanASCIICaseFolding(characters, length, literalWithNoUppercase);
});

It's the same number of lines, so I am not sure.

> Source/WebCore/platform/MIMETypeRegistry.cpp:78
> +constexpr ComparableCaseFoldingASCIILiteral supportedImageMIMETypeArray[] = {

With a definition like this, where, e.g. what section, do these values end up getting stored in the final binary? 

My feel for what constexpr on a global, with and without static, means is still not super strong.

> Source/WebCore/platform/MIMETypeRegistry.cpp:243
> +constexpr ComparableLettersLiteral pdfMIMETypeArray[] = {
> +    "application/pdf",
> +    "text/pdf",
> +};
> +
> +FixedVector<const char*> MIMETypeRegistry::pdfMIMETypes()

Could this return a SortedArraySet<> to avoid any additional allocation?

> Tools/TestWebKitAPI/Tests/WTF/SortedArrayMap.cpp:97
> +    EXPECT_FALSE(caseFoldingSet.contains(""));
> +    EXPECT_TRUE(caseFoldingSet.contains("_"));
> +    EXPECT_TRUE(caseFoldingSet.contains("c"));
> +    EXPECT_TRUE(caseFoldingSet.contains("delightful"));
> +    EXPECT_FALSE(caseFoldingSet.contains("d"));
> +    EXPECT_TRUE(caseFoldingSet.contains("q_"));
> +    EXPECT_FALSE(caseFoldingSet.contains("q__"));
> +
> +    EXPECT_FALSE(lettersSet.contains(""));
> +    EXPECT_FALSE(lettersSet.contains("_"));
> +    EXPECT_TRUE(lettersSet.contains("c"));
> +    EXPECT_TRUE(lettersSet.contains("delightful"));
> +    EXPECT_FALSE(lettersSet.contains("d"));
> +    EXPECT_FALSE(lettersSet.contains("q_"));
> +    EXPECT_FALSE(lettersSet.contains("q__"));
> +
> +    ASSERT_TRUE(scriptTypesSet.contains("text/javascript"));
> +    ASSERT_TRUE(scriptTypesSet.contains("TEXT/JAVASCRIPT"));
> +    ASSERT_TRUE(scriptTypesSet.contains("application/javascript"));
> +    ASSERT_TRUE(scriptTypesSet.contains("application/ecmascript"));
> +    ASSERT_TRUE(scriptTypesSet.contains("application/x-javascript"));
> +    ASSERT_TRUE(scriptTypesSet.contains("application/x-ecmascript"));
> +    ASSERT_FALSE(scriptTypesSet.contains("text/plain"));
> +    ASSERT_FALSE(scriptTypesSet.contains("application/json"));
> +    ASSERT_FALSE(scriptTypesSet.contains("foo/javascript"));

I think having static_assert() variants would be good. Or maybe just switch these all to static_asserts.
Comment 6 Darin Adler 2021-04-28 13:26:40 PDT
Comment on attachment 427221 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=427221&action=review

>> Source/WTF/wtf/SortedArrayMap.h:117
>> +}
> 
> I think putting this in StdLibExtras.h would make more sense. And it gives us fewer places to search when we upgrade to c++20. (There are probably things in there we can remove now actually).

I am happy to move them.

>> Source/WTF/wtf/SortedArrayMap.h:202
>> +template<typename CharacterType> inline bool lessThanASCIICaseFolding(const CharacterType* characters, unsigned length, const char* literalWithNoUppercase)
> 
> Can this be constexpr?

It can’t be constexpr because toASCIILower is not constexpr.

toASCIILower is not constexpr because it uses a lookup table, and that lookup table uses external linkage so we don’t replicate it in each executable that calls toASCIILower.

Maybe there is a way to make toASCIILower a hybrid that uses lookup tables when used in non-constexpr constants.

Anyway, none of that is important because this function doesn’t need to be used in constexpr contexts at this time.

>> Source/WTF/wtf/SortedArrayMap.h:218
>> +    return lessThanASCIICaseFolding(string.characters16(), string.length(), literalWithNoUppercase);
> 
> Should we consider a utility on StringView/String that does this for us?
> 
> return string.withCharacters([&] (const auto* characters, unsigned length) {
>     lessThanASCIICaseFolding(characters, length, literalWithNoUppercase);
> });
> 
> It's the same number of lines, so I am not sure.

I am open to that different idiom.

This follows a long-standing pattern; 11 different functions in StringCommon.h work this way, and even use it as an abstraction across string types (at least across String, StringImpl, and StringView).

The reason *this* function isn’t in StringCommon.h yet is that I was worried about overgeneralizing it. If I moved it there I would be tempted to try to make it work with all sorts of combinations of types and that’s not needed for the current purpose.

I’d like to use types more consistently for the various flavors of ASCII literals. A modern version of equalIgnoringASCIICase or equalLettersIgnoringASCIICase might lean more on the type system rather than the function name. Here in this file, the types serve three different purposes:

1) To record compile-time invariants about the literals.

2) To define what == and < mean, incorporating rules about case folding, so we can use algorithms directly rather than passing in predicates.

3) To define the "parse" function, which is a trick for making the SortedArrayMap/Set templates work efficiently when there is a transformation needed on a passed-in object before doing the search in the sorted array. It’s particularly useful for the special packed versions that Yusuke created where we turn the string into a fixed size integer.

>> Source/WebCore/platform/MIMETypeRegistry.cpp:78
>> +constexpr ComparableCaseFoldingASCIILiteral supportedImageMIMETypeArray[] = {
> 
> With a definition like this, where, e.g. what section, do these values end up getting stored in the final binary? 
> 
> My feel for what constexpr on a global, with and without static, means is still not super strong.

I don’t know the answer, but it’s the "best answer". These are constant data. They get fully computed at compile time and stored in the same section that individual string constants are stored in.

>> Tools/TestWebKitAPI/Tests/WTF/SortedArrayMap.cpp:97
>> +    ASSERT_FALSE(scriptTypesSet.contains("foo/javascript"));
> 
> I think having static_assert() variants would be good. Or maybe just switch these all to static_asserts.

These can’t currently be static assertions because the contains function uses the non-constexpr function tryBinarySearch, and the case folding comparisons in the binary search are done with the non-constexpr function lessThanASCIICaseFolding.

Both of those could be rewritten to be constexpr if there was a good reason to do so, but I wouldn’t do it *just* so we could test at compile time.
Comment 7 Darin Adler 2021-04-28 13:32:50 PDT
Comment on attachment 427221 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=427221&action=review

>> Source/WebCore/platform/MIMETypeRegistry.cpp:243
>> +FixedVector<const char*> MIMETypeRegistry::pdfMIMETypes()
> 
> Could this return a SortedArraySet<> to avoid any additional allocation?

If we really cared about the allocation in this case (and I don’t, this is not the case we want to optimize, since it’s only used in one-time initialization of a method that is mostly not called), we would want to return an array view on ASCII strings that can be converted into NSString. Not sure the best type for exposing that cleanly; just a pointer and a count, but also want the caller to just treat this as const char* without having to expose a ComparableXXX type. Would prefer not to just expose implementation detail in the header by using SortedArraySet directly.
Comment 8 Darin Adler 2021-04-28 13:32:51 PDT Comment hidden (obsolete)
Comment 9 Sam Weinig 2021-04-28 18:16:22 PDT
(In reply to Darin Adler from comment #6)
> Comment on attachment 427221 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=427221&action=review
> 
> >> Source/WTF/wtf/SortedArrayMap.h:117
> >> +}
> > 
> > I think putting this in StdLibExtras.h would make more sense. And it gives us fewer places to search when we upgrade to c++20. (There are probably things in there we can remove now actually).
> 
> I am happy to move them.
> 
> >> Source/WTF/wtf/SortedArrayMap.h:202
> >> +template<typename CharacterType> inline bool lessThanASCIICaseFolding(const CharacterType* characters, unsigned length, const char* literalWithNoUppercase)
> > 
> > Can this be constexpr?
> 
> It can’t be constexpr because toASCIILower is not constexpr.
> 
> toASCIILower is not constexpr because it uses a lookup table, and that
> lookup table uses external linkage so we don’t replicate it in each
> executable that calls toASCIILower.
> 
> Maybe there is a way to make toASCIILower a hybrid that uses lookup tables
> when used in non-constexpr constants.
> 
> Anyway, none of that is important because this function doesn’t need to be
> used in constexpr contexts at this time.

When we can use std::is_constant_evaluated we can start doing this. https://en.cppreference.com/w/cpp/types/is_constant_evaluated

> 
> >> Source/WTF/wtf/SortedArrayMap.h:218
> >> +    return lessThanASCIICaseFolding(string.characters16(), string.length(), literalWithNoUppercase);
> > 
> > Should we consider a utility on StringView/String that does this for us?
> > 
> > return string.withCharacters([&] (const auto* characters, unsigned length) {
> >     lessThanASCIICaseFolding(characters, length, literalWithNoUppercase);
> > });
> > 
> > It's the same number of lines, so I am not sure.
> 
> I am open to that different idiom.
> 
> This follows a long-standing pattern; 11 different functions in
> StringCommon.h work this way, and even use it as an abstraction across
> string types (at least across String, StringImpl, and StringView).
> 
> The reason *this* function isn’t in StringCommon.h yet is that I was worried
> about overgeneralizing it. If I moved it there I would be tempted to try to
> make it work with all sorts of combinations of types and that’s not needed
> for the current purpose.
> 
> I’d like to use types more consistently for the various flavors of ASCII
> literals. A modern version of equalIgnoringASCIICase or
> equalLettersIgnoringASCIICase might lean more on the type system rather than
> the function name. Here in this file, the types serve three different
> purposes:
> 
> 1) To record compile-time invariants about the literals.
> 
> 2) To define what == and < mean, incorporating rules about case folding, so
> we can use algorithms directly rather than passing in predicates.
> 
> 3) To define the "parse" function, which is a trick for making the
> SortedArrayMap/Set templates work efficiently when there is a transformation
> needed on a passed-in object before doing the search in the sorted array.
> It’s particularly useful for the special packed versions that Yusuke created
> where we turn the string into a fixed size integer.
> 
> >> Source/WebCore/platform/MIMETypeRegistry.cpp:78
> >> +constexpr ComparableCaseFoldingASCIILiteral supportedImageMIMETypeArray[] = {
> > 
> > With a definition like this, where, e.g. what section, do these values end up getting stored in the final binary? 
> > 
> > My feel for what constexpr on a global, with and without static, means is still not super strong.
> 
> I don’t know the answer, but it’s the "best answer". These are constant
> data. They get fully computed at compile time and stored in the same section
> that individual string constants are stored in.
> 
> >> Tools/TestWebKitAPI/Tests/WTF/SortedArrayMap.cpp:97
> >> +    ASSERT_FALSE(scriptTypesSet.contains("foo/javascript"));
> > 
> > I think having static_assert() variants would be good. Or maybe just switch these all to static_asserts.
> 
> These can’t currently be static assertions because the contains function
> uses the non-constexpr function tryBinarySearch, and the case folding
> comparisons in the binary search are done with the non-constexpr function
> lessThanASCIICaseFolding.
> 
> Both of those could be rewritten to be constexpr if there was a good reason
> to do so, but I wouldn’t do it *just* so we could test at compile time.

Makes sense.
Comment 10 Sam Weinig 2021-04-28 18:20:38 PDT
(In reply to Darin Adler from comment #7)
> Comment on attachment 427221 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=427221&action=review
> 
> >> Source/WebCore/platform/MIMETypeRegistry.cpp:243
> >> +FixedVector<const char*> MIMETypeRegistry::pdfMIMETypes()
> > 
> > Could this return a SortedArraySet<> to avoid any additional allocation?
> 
> If we really cared about the allocation in this case (and I don’t, this is
> not the case we want to optimize, since it’s only used in one-time
> initialization of a method that is mostly not called), we would want to
> return an array view on ASCII strings that can be converted into NSString.
> Not sure the best type for exposing that cleanly; just a pointer and a
> count, but also want the caller to just treat this as const char* without
> having to expose a ComparableXXX type. Would prefer not to just expose
> implementation detail in the header by using SortedArraySet directly.

In these cases it is clearly unnecessary, but it's an intriguing problem. I'll keep thinking about it for future uses.
Comment 11 Darin Adler 2021-04-29 09:40:05 PDT
Committed r276780 (237163@main): <https://commits.webkit.org/237163@main>
Comment 12 Radar WebKit Bug Importer 2021-04-29 09:41:14 PDT
<rdar://problem/77325848>
Comment 13 Antti Koivisto 2021-05-02 23:34:02 PDT
Comment on attachment 427221 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=427221&action=review

> Source/WTF/wtf/SortedArrayMap.h:80
> +using ComparableASCIILiteral = ComparableASCIISubsetLiteral<ASCIISubset::All>;
> +using ComparableCaseFoldingASCIILiteral = ComparableASCIISubsetLiteral<ASCIISubset::NoUppercaseLetters>;
> +using ComparableLettersLiteral = ComparableASCIISubsetLiteral<ASCIISubset::NoUppercaseLettersOptimized>;

Why doesn't the last one say "ASCII"?
Comment 14 Darin Adler 2021-05-03 10:11:33 PDT
Comment on attachment 427221 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=427221&action=review

>> Source/WTF/wtf/SortedArrayMap.h:80
>> +using ComparableLettersLiteral = ComparableASCIISubsetLiteral<ASCIISubset::NoUppercaseLettersOptimized>;
> 
> Why doesn't the last one say "ASCII"?

It probably should.

In bug 225251, Sam is and I are discussing better names for this. I’ll do a rename as soon as we decide.