Bug 224727 - Use binary-search in LocaleToScriptMapping
Summary: Use binary-search in LocaleToScriptMapping
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Yusuke Suzuki
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2021-04-17 20:31 PDT by Yusuke Suzuki
Modified: 2021-04-19 22:07 PDT (History)
3 users (show)

See Also:


Attachments
Patch (34.99 KB, patch)
2021-04-17 20:36 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (35.13 KB, patch)
2021-04-17 20:47 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (35.38 KB, patch)
2021-04-17 21:02 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (40.32 KB, patch)
2021-04-17 22:26 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (40.32 KB, patch)
2021-04-17 22:27 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (31.08 KB, patch)
2021-04-17 23:26 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Yusuke Suzuki 2021-04-17 20:31:34 PDT
Use binary-search in LocaleToScriptMapping
Comment 1 Yusuke Suzuki 2021-04-17 20:36:13 PDT
Created attachment 426363 [details]
Patch
Comment 2 Darin Adler 2021-04-17 20:46:44 PDT
Comment on attachment 426363 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=426363&action=review

Great idea!

> Source/WebCore/ChangeLog:10
> +        This patch removes HashMaps in LocaleToScriptMapping, and binary-search onto the constant data arrays.
> +        These maps are not frequently used. Keys of the maps can be encoded into uint32_t or uint64_t so that
> +        comparison becomes super cheap and we can initialize this array at compile-time.

I’ve long wanted to make a re-usable binary search constant data array class. Glad you are doing this here; there seem to be many other places using hash tables that could do this sort of thing instead.

> Source/WebCore/platform/text/LocaleToScriptMapping.cpp:41
> +using ScriptName = uint32_t;

I’d like to see this use a bit more template machinery so we can do this for 4 and 8 without writing two separate sets of functions.

> Source/WebCore/platform/text/LocaleToScriptMapping.cpp:50
> +        result |= (static_cast<ScriptName>(code) << ((4 - index - 1) * CHAR_BIT));

I think we can use 8 instead of CHAR_BIT; I don’t think CHAR_BIT helps readability or portability here.

> Source/WebCore/platform/text/LocaleToScriptMapping.cpp:55
> +static Optional<ScriptName> tryParseScriptName(StringView string)

Not 100% sure this needs "try" in its name. It is quite common throughout all kinds of WebKit code for functions named parse to have failure modes and use Optional to express that failure.

> Source/WebCore/platform/text/LocaleToScriptMapping.cpp:64
> +        result |= (static_cast<ScriptName>(toASCIILower(code)) << ((4 - index - 1) * CHAR_BIT));

I think we can use 8 instead of CHAR_BIT; I don’t think CHAR_BIT helps readability or portability here.

Ideally we’d find a way to share code so we know the above is the same.

> Source/WebCore/platform/text/LocaleToScriptMapping.cpp:199
> +    auto* element = tryBinarySearch<ScriptNameCode>(scriptNameCodeList, WTF_ARRAY_LENGTH(scriptNameCodeList), name.value(),

std::size should work and seems nicer to use than WTF_ARRAY_LENGTH

> Source/WebCore/platform/text/LocaleToScriptMapping.cpp:455
> +        auto* element = tryBinarySearch<LocaleScript>(localeScriptList, WTF_ARRAY_LENGTH(localeScriptList), localeName.value(),

std::size should work and seems nicer to use than WTF_ARRAY_LENGTH

Also would be nice to make another template like tryBinarySearch, maybe even an overload of it, that handles the sizing automatically. Seems like it should be able to deduce both the type and the size without an explicit template or function argument for either.
Comment 3 Yusuke Suzuki 2021-04-17 20:47:19 PDT
Created attachment 426364 [details]
Patch
Comment 4 Darin Adler 2021-04-17 20:49:09 PDT
Since I already reviewed, I won’t set review+ on the new patch unless you tell me you aren’t going to make any more changes. Otherwise, I’ll wait for a newer patch that takes some of my suggestions and review again then.
Comment 5 Darin Adler 2021-04-17 20:50:19 PDT
More generally I think we can do this binary search pattern with ASCIILiteral in much the same way you are doing here with the fixed length packing into an integer. This is even more optimized, but something reusable could help encourage people not to use HashMap for fixed things like this.
Comment 6 Yusuke Suzuki 2021-04-17 21:02:07 PDT
Created attachment 426366 [details]
Patch
Comment 7 Yusuke Suzuki 2021-04-17 21:02:46 PDT
Ah, the uploaded patch is simply adding static_assert based test :)
Comment 8 Yusuke Suzuki 2021-04-17 22:25:56 PDT
Comment on attachment 426363 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=426363&action=review

Thanks

>> Source/WebCore/platform/text/LocaleToScriptMapping.cpp:41
>> +using ScriptName = uint32_t;
> 
> I’d like to see this use a bit more template machinery so we can do this for 4 and 8 without writing two separate sets of functions.

Changed.

>> Source/WebCore/platform/text/LocaleToScriptMapping.cpp:50
>> +        result |= (static_cast<ScriptName>(code) << ((4 - index - 1) * CHAR_BIT));
> 
> I think we can use 8 instead of CHAR_BIT; I don’t think CHAR_BIT helps readability or portability here.

Changed.

>> Source/WebCore/platform/text/LocaleToScriptMapping.cpp:55
>> +static Optional<ScriptName> tryParseScriptName(StringView string)
> 
> Not 100% sure this needs "try" in its name. It is quite common throughout all kinds of WebKit code for functions named parse to have failure modes and use Optional to express that failure.

Changed it to paseScriptName. Instead, constexpr version is renamed to parseConstantScriptName.

>> Source/WebCore/platform/text/LocaleToScriptMapping.cpp:64
>> +        result |= (static_cast<ScriptName>(toASCIILower(code)) << ((4 - index - 1) * CHAR_BIT));
> 
> I think we can use 8 instead of CHAR_BIT; I don’t think CHAR_BIT helps readability or portability here.
> 
> Ideally we’d find a way to share code so we know the above is the same.

Changed it to 8. Unfortunately, currently we have separate function because of constexpr.

>> Source/WebCore/platform/text/LocaleToScriptMapping.cpp:199
>> +    auto* element = tryBinarySearch<ScriptNameCode>(scriptNameCodeList, WTF_ARRAY_LENGTH(scriptNameCodeList), name.value(),
> 
> std::size should work and seems nicer to use than WTF_ARRAY_LENGTH

Changed, std::size_ sounds nice.

>> Source/WebCore/platform/text/LocaleToScriptMapping.cpp:455
>> +        auto* element = tryBinarySearch<LocaleScript>(localeScriptList, WTF_ARRAY_LENGTH(localeScriptList), localeName.value(),
> 
> std::size should work and seems nicer to use than WTF_ARRAY_LENGTH
> 
> Also would be nice to make another template like tryBinarySearch, maybe even an overload of it, that handles the sizing automatically. Seems like it should be able to deduce both the type and the size without an explicit template or function argument for either.

std::size sounds nice.
For now, I'll leave tryBinarySearch refactoring for the future change.
Comment 9 Yusuke Suzuki 2021-04-17 22:26:43 PDT
Created attachment 426367 [details]
Patch
Comment 10 Yusuke Suzuki 2021-04-17 22:27:30 PDT
Created attachment 426368 [details]
Patch
Comment 11 Darin Adler 2021-04-17 22:40:15 PDT
Comment on attachment 426368 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=426368&action=review

A few more ideas for refinements.

> Source/WebCore/platform/text/LocaleToScriptMapping.cpp:45
> +    ASSERT_UNDER_CONSTEXPR_CONTEXT(length <= sizeof(ResultType));

Since we’re doing assertions like this, how about asserting the terminator is the null character?

> Source/WebCore/platform/text/LocaleToScriptMapping.cpp:48
> +        uint8_t code = static_cast<uint8_t>(string[index]);

Seems like we could have code be "ResultType" instead of uint8_t and avoid the static_cast on the next line of code.

> Source/WebCore/platform/text/LocaleToScriptMapping.cpp:49
> +        result |= (static_cast<ResultType>(code) << ((sizeof(ResultType) - index - 1) * 8));

I think this has one set of parentheses that aren’t needed for clarity or correctness, the outermost set.

> Source/WebCore/platform/text/LocaleToScriptMapping.cpp:64
> +        result |= (static_cast<ResultType>(toASCIILower(code)) << ((sizeof(ResultType) - index - 1) * 8));

I think this has one set of parentheses that aren’t needed for clarity or correctness, the outermost set.

> Source/WebCore/platform/text/LocaleToScriptMapping.cpp:82
> +    static constexpr ScriptNameCode scriptNameCodeList[] = {
> +        { parseConstantScriptCode<ScriptName>("arab"), USCRIPT_ARABIC },

I think we’re going to be able to make a fancier version of this where you don’t have to write the calls to parseConstantScriptCode because it’s all done by the constexpr constructor of an object. Even better, I think it can encapsulate the implementation detail of the "parsing into an integer" and just take a couple of function arguments, and also make it easier to correctly call the binary search function.

Just refactoring! But should be more elegant.

And maybe have a version where we store const char* rather than a fixed size item in the table but otherwise it’s similar, and uses the same template just with different arguments. Depending on what other places in WebKit need.
Comment 12 Yusuke Suzuki 2021-04-17 23:25:05 PDT
Comment on attachment 426368 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=426368&action=review

>> Source/WebCore/platform/text/LocaleToScriptMapping.cpp:45
>> +    ASSERT_UNDER_CONSTEXPR_CONTEXT(length <= sizeof(ResultType));
> 
> Since we’re doing assertions like this, how about asserting the terminator is the null character?

Added.

>> Source/WebCore/platform/text/LocaleToScriptMapping.cpp:48
>> +        uint8_t code = static_cast<uint8_t>(string[index]);
> 
> Seems like we could have code be "ResultType" instead of uint8_t and avoid the static_cast on the next line of code.

I would like to first cast it from char to uint8_t to ensure that we are handling unsigned types, and avoid accidental sign-extension.

>> Source/WebCore/platform/text/LocaleToScriptMapping.cpp:49
>> +        result |= (static_cast<ResultType>(code) << ((sizeof(ResultType) - index - 1) * 8));
> 
> I think this has one set of parentheses that aren’t needed for clarity or correctness, the outermost set.

Removed.

>> Source/WebCore/platform/text/LocaleToScriptMapping.cpp:64
>> +        result |= (static_cast<ResultType>(toASCIILower(code)) << ((sizeof(ResultType) - index - 1) * 8));
> 
> I think this has one set of parentheses that aren’t needed for clarity or correctness, the outermost set.

Removed.

>> Source/WebCore/platform/text/LocaleToScriptMapping.cpp:82
>> +        { parseConstantScriptCode<ScriptName>("arab"), USCRIPT_ARABIC },
> 
> I think we’re going to be able to make a fancier version of this where you don’t have to write the calls to parseConstantScriptCode because it’s all done by the constexpr constructor of an object. Even better, I think it can encapsulate the implementation detail of the "parsing into an integer" and just take a couple of function arguments, and also make it easier to correctly call the binary search function.
> 
> Just refactoring! But should be more elegant.
> 
> And maybe have a version where we store const char* rather than a fixed size item in the table but otherwise it’s similar, and uses the same template just with different arguments. Depending on what other places in WebKit need.

I've added PackedASCIILowerCodes<T> which encapsulate them.
Comment 13 Yusuke Suzuki 2021-04-17 23:26:41 PDT
Created attachment 426372 [details]
Patch
Comment 14 Yusuke Suzuki 2021-04-18 00:43:40 PDT
Committed r276225 (236707@main): <https://commits.webkit.org/236707@main>
Comment 15 Darin Adler 2021-04-18 09:08:04 PDT
Comment on attachment 426368 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=426368&action=review

>>> Source/WebCore/platform/text/LocaleToScriptMapping.cpp:48
>>> +        uint8_t code = static_cast<uint8_t>(string[index]);
>> 
>> Seems like we could have code be "ResultType" instead of uint8_t and avoid the static_cast on the next line of code.
> 
> I would like to first cast it from char to uint8_t to ensure that we are handling unsigned types, and avoid accidental sign-extension.

Yes, I agree. We would cast it with the static_cast as we are doing here and then have a local variable be of type ResultType which would achieve exactly what you were talking about.
Comment 16 Ryan Haddad 2021-04-19 22:07:56 PDT
rdar://76809872