Bug 215058 - [JSC] Implement Intl Language Tag Parser
Summary: [JSC] Implement Intl Language Tag Parser
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Yusuke Suzuki
URL:
Keywords: InRadar
Depends on: 215756
Blocks: 213425
  Show dependency treegraph
 
Reported: 2020-08-01 02:15 PDT by Yusuke Suzuki
Modified: 2020-08-22 23:06 PDT (History)
10 users (show)

See Also:


Attachments
Patch (20.63 KB, patch)
2020-08-01 04:08 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (47.25 KB, patch)
2020-08-01 05:49 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (59.95 KB, patch)
2020-08-02 18:33 PDT, Yusuke Suzuki
ross.kirsling: review+
Details | Formatted Diff | Diff
Patch (71.04 KB, patch)
2020-08-22 15:49 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Yusuke Suzuki 2020-08-01 02:15:53 PDT
Add precise language tag parser implementing BCP47.
Comment 1 Yusuke Suzuki 2020-08-01 04:08:53 PDT
Created attachment 405783 [details]
Patch

WIP
Comment 2 Yusuke Suzuki 2020-08-01 05:49:24 PDT
Created attachment 405784 [details]
Patch

WIP
Comment 3 Yusuke Suzuki 2020-08-02 18:33:53 PDT
Created attachment 405815 [details]
Patch
Comment 4 Ross Kirsling 2020-08-04 16:22:33 PDT
Comment on attachment 405815 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=405815&action=review

Looks good overall.

It sucks that ICU canonicalization isn't sufficient, but I think you're right that this is unavoidable if we want cross-engine consistency, so thanks for doing what I was too stubborn to do. ;)

I think Darin will likely have comments regarding the string manipulation parts.


> Source/JavaScriptCore/runtime/IntlObject.cpp:851
> +    struct Code {
> +        LChar characters[8] { };
> +    };

Is it necessary that this be wrapped in a struct?

> Source/JavaScriptCore/runtime/IntlObject.cpp:875
> +static bool isUnicodeExtensionAttribute(StringView string)

We could probably merge this with isUnicodeExtensionTypeComponent...

> Source/JavaScriptCore/runtime/IntlObject.cpp:888
> +static bool isUnicodeExtensionTypeComponent(StringView string)

...but note that isUnicodeLocaleIdentifierType also exists, so this already feels a bit duplicative?

> Source/JavaScriptCore/runtime/IntlObject.cpp:917
> +static bool isUnicodeTValue(StringView string)

This appears to be just a TValue "component"?

(And also is matching the two functions above...perhaps we should have an alphanum{i,j} helper function?)

> Source/JavaScriptCore/runtime/IntlObject.cpp:1078
> +    bool isTLangOrTField = false;

Seems like this should be renamed to something like `found`...

> Source/JavaScriptCore/runtime/IntlObject.cpp:1109
> +    if (!isTLangOrTField)
>          return false;
> -    ++cursor;
> +    return true;

...and then you can just return it directly.

> Source/JavaScriptCore/runtime/IntlObject.cpp:1125
> +        if (!isUnicodeOtherExtensionValue(m_current))
> +            return true;
> +        if (!next())
> +            return true;

Seems like in a case like this you could conjoin the conditions, but I understand if you prefer to keep parallelism with the rest of the logic.

> Source/JavaScriptCore/runtime/IntlObject.cpp:1231
> +    if (!parser.parseUnicodeLocaleId())
> +        return false;
> +    if (!parser.isEOS())
> +        return false;
> +    return true;

(There's a part of me that really wants to write this as a single return statement, but I do see what you're aiming to convey with this code style. ;D)
Comment 5 Darin Adler 2020-08-04 16:57:45 PDT
Comment on attachment 405815 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=405815&action=review

I wonder about the idioms here compared with the StringParsingBuffer idioms that Sam has been using. Have you looked at that?

Parsers need extensive test suites, with cases that tickle every single branch. Do we have enough?

> Source/JavaScriptCore/runtime/IntlObject.cpp:871
> +    ASSERT(isASCIIAlphanumeric(singleton));
> +    singleton = toASCIILower(singleton);
> +    // 0 - 9 => numeric
> +    // 10 - 35 => alpha
> +    if (isASCIIDigit(singleton))
> +        return singleton - '0';
> +    return (singleton - 'a') + 10;

This could be a bit more like toASCIIHexDigit from ASCIICType.h.

> Source/JavaScriptCore/runtime/IntlObject.cpp:880
> +    if (length >= 3 && length <= 8)
> +        return string.isAllSpecialCharacters<isASCIIAlphanumeric>();
> +    return false;

Should do it all with &&:

    return length >= 3 && length<= 8 && string.isAllSpecialCharacters<isASCIIAlphanumeric>();
Comment 6 Darin Adler 2020-08-04 16:58:10 PDT
Generally looks fine. I have no objections.
Comment 7 Darin Adler 2020-08-04 16:58:32 PDT
Comment on attachment 405815 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=405815&action=review

>> Source/JavaScriptCore/runtime/IntlObject.cpp:1231
>> +    return true;
> 
> (There's a part of me that really wants to write this as a single return statement, but I do see what you're aiming to convey with this code style. ;D)

I feel the same way.
Comment 8 Radar WebKit Bug Importer 2020-08-08 02:16:20 PDT
<rdar://problem/66720470>
Comment 9 Yusuke Suzuki 2020-08-22 14:46:51 PDT
Comment on attachment 405815 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=405815&action=review

>> Source/JavaScriptCore/runtime/IntlObject.cpp:851
>> +    };
> 
> Is it necessary that this be wrapped in a struct?

This makes `bitwise_cast<VariantCode>()` easy by suppressing array to pointer decay.

>> Source/JavaScriptCore/runtime/IntlObject.cpp:871
>> +    return (singleton - 'a') + 10;
> 
> This could be a bit more like toASCIIHexDigit from ASCIICType.h.

Yeah, the difference is it is using G-Z too (so, it is not hex).

>> Source/JavaScriptCore/runtime/IntlObject.cpp:875
>> +static bool isUnicodeExtensionAttribute(StringView string)
> 
> We could probably merge this with isUnicodeExtensionTypeComponent...

I think we should have different ones because they are different part while the patterns are the same by chance.

>> Source/JavaScriptCore/runtime/IntlObject.cpp:880
>> +    return false;
> 
> Should do it all with &&:
> 
>     return length >= 3 && length<= 8 && string.isAllSpecialCharacters<isASCIIAlphanumeric>();

Changed

>> Source/JavaScriptCore/runtime/IntlObject.cpp:888
>> +static bool isUnicodeExtensionTypeComponent(StringView string)
> 
> ...but note that isUnicodeLocaleIdentifierType also exists, so this already feels a bit duplicative?

Ditto.

>> Source/JavaScriptCore/runtime/IntlObject.cpp:917
>> +static bool isUnicodeTValue(StringView string)
> 
> This appears to be just a TValue "component"?
> 
> (And also is matching the two functions above...perhaps we should have an alphanum{i,j} helper function?)

Right, changed.

>> Source/JavaScriptCore/runtime/IntlObject.cpp:1109
>> +    return true;
> 
> ...and then you can just return it directly.

Nice, fixed.

>> Source/JavaScriptCore/runtime/IntlObject.cpp:1125
>> +            return true;
> 
> Seems like in a case like this you could conjoin the conditions, but I understand if you prefer to keep parallelism with the rest of the logic.

Yeah, these codes are aligned to the style used in parse function in this parser.

>>> Source/JavaScriptCore/runtime/IntlObject.cpp:1231
>>> +    return true;
>> 
>> (There's a part of me that really wants to write this as a single return statement, but I do see what you're aiming to convey with this code style. ;D)
> 
> I feel the same way.

Because the other functions are using `break` / `continue` etc., I'm aligning these parse & isEOS() patterns with the other places not to introduce a bug.
I think this makes code consistent and avoid introducing a bug.
Comment 10 Yusuke Suzuki 2020-08-22 15:49:47 PDT
Created attachment 407058 [details]
Patch

Updating
Comment 11 Yusuke Suzuki 2020-08-22 17:25:11 PDT
> I wonder about the idioms here compared with the StringParsingBuffer idioms that Sam has been using. Have you looked at that?

Currently, this parser relies on split feature, so we cannot use StringParsingBuffer. We can revise whether we can use it later.

> Parsers need extensive test suites, with cases that tickle every single branch. Do we have enough?

I added more tests and ensured code coverage of the parser is 100% at least.
Comment 12 Yusuke Suzuki 2020-08-22 17:26:47 PDT
Committed r266039: <https://trac.webkit.org/changeset/266039>
Comment 13 WebKit Commit Bot 2020-08-22 23:01:33 PDT
Re-opened since this is blocked by bug 215756
Comment 14 Yusuke Suzuki 2020-08-22 23:06:58 PDT
Committed r266043: <https://trac.webkit.org/changeset/266043>