Bug 216224 - [JSC] Canonicalize "true" unicode extension type value to ""
Summary: [JSC] Canonicalize "true" unicode extension type value to ""
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Yusuke Suzuki
URL:
Keywords: InRadar
Depends on:
Blocks: 213425
  Show dependency treegraph
 
Reported: 2020-09-06 01:58 PDT by Yusuke Suzuki
Modified: 2020-09-14 19:50 PDT (History)
11 users (show)

See Also:


Attachments
Patch (20.82 KB, patch)
2020-09-06 02:00 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (20.93 KB, patch)
2020-09-06 02:04 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (33.01 KB, patch)
2020-09-10 02:22 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (38.41 KB, patch)
2020-09-11 20:18 PDT, Yusuke Suzuki
ross.kirsling: review+
Details | Formatted Diff | Diff
Patch (38.51 KB, patch)
2020-09-11 20:24 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Yusuke Suzuki 2020-09-06 01:58:40 PDT
[JSC] Canonicalize "true" unicode extension type value to ""
Comment 1 Yusuke Suzuki 2020-09-06 02:00:43 PDT
Created attachment 408119 [details]
Patch
Comment 2 Yusuke Suzuki 2020-09-06 02:04:31 PDT
Created attachment 408120 [details]
Patch
Comment 3 Ross Kirsling 2020-09-09 16:04:25 PDT
Comment on attachment 408120 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=408120&action=review

> Source/JavaScriptCore/runtime/IntlLocalePrototype.cpp:188
> +    if (caseFirst.isEmpty())
> +        return JSValue::encode(jsUndefined());
> +    // If caseFirst is "true", then we should return "" since language tag canonicalization makes it "".
> +    if (caseFirst == "true"_s)
> +        return JSValue::encode(jsEmptyString(vm));
> +    return JSValue::encode(jsString(vm, caseFirst));

This shouldn't be correct -- kf isn't a boolean extension; it should only ever be "upper"/"lower"/"false"/undefined (though Intl.Locale will evidently let it be an arbitrary string).

This does highlight an interesting separate issue in Intl.Locale though -- if the locale string has a *non-boolean* Unicode extension with the value omitted, we're returning "yes" instead of "true". For example:

    new Intl.Locale('en-us-u-co').collation
    > yes

> Source/JavaScriptCore/runtime/IntlObject.cpp:306
> +        // In the following unicode extension keys, "yes" is alias of "true". So we should not include value.
> +        // See CLDR key name="kb" entry for example.
> +        switch (computeTwoCharacters16Code(key)) {
> +        case computeTwoCharacters16Code("kb"_s):
> +        case computeTwoCharacters16Code("kc"_s):
> +        case computeTwoCharacters16Code("kh"_s):
> +        case computeTwoCharacters16Code("kk"_s):
> +        case computeTwoCharacters16Code("kn"_s):

I don't think we should be maintaining an allowlist here? Most of these are unknown to ECMA-402 and V8 evidently accepts "yes" for arbitrary extensions.
Also, unless I'm missing something, "yes" should never appear in the output of uloc_toLanguageTag, right?
Comment 4 Yusuke Suzuki 2020-09-10 02:15:47 PDT
Comment on attachment 408120 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=408120&action=review

>> Source/JavaScriptCore/runtime/IntlLocalePrototype.cpp:188
>> +    return JSValue::encode(jsString(vm, caseFirst));
> 
> This shouldn't be correct -- kf isn't a boolean extension; it should only ever be "upper"/"lower"/"false"/undefined (though Intl.Locale will evidently let it be an arbitrary string).
> 
> This does highlight an interesting separate issue in Intl.Locale though -- if the locale string has a *non-boolean* Unicode extension with the value omitted, we're returning "yes" instead of "true". For example:
> 
>     new Intl.Locale('en-us-u-co').collation
>     > yes

However, if "true" is explicitly specified (`en-u-kf-true`), then, I think we should return "" since

> Any type or tfield value "true" is removed.

rule can be applied regardless of whether this is a boolean extension. So, when "true" is specified, "" is better than returning "true".

>> Source/JavaScriptCore/runtime/IntlObject.cpp:306
>> +        case computeTwoCharacters16Code("kn"_s):
> 
> I don't think we should be maintaining an allowlist here? Most of these are unknown to ECMA-402 and V8 evidently accepts "yes" for arbitrary extensions.
> Also, unless I'm missing something, "yes" should never appear in the output of uloc_toLanguageTag, right?

Right, "yes" will not appear from ICU except for the user explicitly set "yes". And currently it is tested in test262/test/intl402/Intl/getCanonicalLocales/unicode-ext-canonicalize-yes-to-true.js
Currently, ICU does not handle this correctly, so we need to do that here. I think this list is small enough, so I'm fine to keep it here. And we can remove it once ICU supports this correctly.
Comment 5 Yusuke Suzuki 2020-09-10 02:22:12 PDT
Created attachment 408425 [details]
Patch
Comment 6 Ross Kirsling 2020-09-10 12:47:26 PDT
(In reply to Yusuke Suzuki from comment #4)
> Comment on attachment 408120 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=408120&action=review
> 
> >> Source/JavaScriptCore/runtime/IntlLocalePrototype.cpp:188
> >> +    return JSValue::encode(jsString(vm, caseFirst));
> > 
> > This shouldn't be correct -- kf isn't a boolean extension; it should only ever be "upper"/"lower"/"false"/undefined (though Intl.Locale will evidently let it be an arbitrary string).
> > 
> > This does highlight an interesting separate issue in Intl.Locale though -- if the locale string has a *non-boolean* Unicode extension with the value omitted, we're returning "yes" instead of "true". For example:
> > 
> >     new Intl.Locale('en-us-u-co').collation
> >     > yes
> 
> However, if "true" is explicitly specified (`en-u-kf-true`), then, I think
> we should return "" since
> 
> > Any type or tfield value "true" is removed.
> 
> rule can be applied regardless of whether this is a boolean extension. So,
> when "true" is specified, "" is better than returning "true".

Hmm, but this is purely a statement about locale string canonicalization and not about how Intl.Locale's getters should behave. You may still be right (since canonicalization happens in step 12 of https://tc39.es/ecma402/#sec-Intl.Locale), but there's absolutely nothing unique about caseFirst, so we'd need to do it for the other non-boolean getters too. I guess this and the "yes gets returned" issue I mentioned should probably be handled together within IntlLocale.cpp.


> >> Source/JavaScriptCore/runtime/IntlObject.cpp:306
> >> +        case computeTwoCharacters16Code("kn"_s):
> > 
> > I don't think we should be maintaining an allowlist here? Most of these are unknown to ECMA-402 and V8 evidently accepts "yes" for arbitrary extensions.
> > Also, unless I'm missing something, "yes" should never appear in the output of uloc_toLanguageTag, right?
> 
> Right, "yes" will not appear from ICU except for the user explicitly set
> "yes". And currently it is tested in
> test262/test/intl402/Intl/getCanonicalLocales/unicode-ext-canonicalize-yes-
> to-true.js
> Currently, ICU does not handle this correctly, so we need to do that here. I
> think this list is small enough, so I'm fine to keep it here. And we can
> remove it once ICU supports this correctly.

That test makes sense, but my reason for confusion is because this is the existing behavior without your new canonicalization:

    new Intl.Locale('en-us-u-kb-yes')
    > en-US-u-kb-true
    new Intl.Locale('en-us-u-co-yes')
    > en-US-u-co-yes

...so it seems like we only see "yes" returned from uloc_toLanguageTag if the user provided "yes" *and* the extension is non-boolean, which would mean that we'd never need to check for "yes" in particular.
Comment 7 Yusuke Suzuki 2020-09-11 19:49:59 PDT
Comment on attachment 408120 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=408120&action=review

>>>> Source/JavaScriptCore/runtime/IntlLocalePrototype.cpp:188
>>>> +    return JSValue::encode(jsString(vm, caseFirst));
>>> 
>>> This shouldn't be correct -- kf isn't a boolean extension; it should only ever be "upper"/"lower"/"false"/undefined (though Intl.Locale will evidently let it be an arbitrary string).
>>> 
>>> This does highlight an interesting separate issue in Intl.Locale though -- if the locale string has a *non-boolean* Unicode extension with the value omitted, we're returning "yes" instead of "true". For example:
>>> 
>>>     new Intl.Locale('en-us-u-co').collation
>>>     > yes
>> 
>> However, if "true" is explicitly specified (`en-u-kf-true`), then, I think we should return "" since
> 
> Hmm, but this is purely a statement about locale string canonicalization and not about how Intl.Locale's getters should behave. You may still be right (since canonicalization happens in step 12 of https://tc39.es/ecma402/#sec-Intl.Locale), but there's absolutely nothing unique about caseFirst, so we'd need to do it for the other non-boolean getters too. I guess this and the "yes gets returned" issue I mentioned should probably be handled together within IntlLocale.cpp.

OK, so we should move this logic to IntlLocale.cpp's getters. Changed.

>>>> Source/JavaScriptCore/runtime/IntlObject.cpp:306
>>>> +        case computeTwoCharacters16Code("kn"_s):
>>> 
>>> I don't think we should be maintaining an allowlist here? Most of these are unknown to ECMA-402 and V8 evidently accepts "yes" for arbitrary extensions.
>>> Also, unless I'm missing something, "yes" should never appear in the output of uloc_toLanguageTag, right?
>> 
>> Right, "yes" will not appear from ICU except for the user explicitly set "yes". And currently it is tested in test262/test/intl402/Intl/getCanonicalLocales/unicode-ext-canonicalize-yes-to-true.js
>> Currently, ICU does not handle this correctly, so we need to do that here. I think this list is small enough, so I'm fine to keep it here. And we can remove it once ICU supports this correctly.
> 
> That test makes sense, but my reason for confusion is because this is the existing behavior without your new canonicalization:
> 
>     new Intl.Locale('en-us-u-kb-yes')
>     > en-US-u-kb-true
>     new Intl.Locale('en-us-u-co-yes')
>     > en-US-u-co-yes
> 
> ...so it seems like we only see "yes" returned from uloc_toLanguageTag if the user provided "yes" *and* the extension is non-boolean, which would mean that we'd never need to check for "yes" in particular.

Ah! Right, so this is not necessary. Removed.
Comment 8 Yusuke Suzuki 2020-09-11 20:18:53 PDT
Created attachment 408582 [details]
Patch
Comment 9 Yusuke Suzuki 2020-09-11 20:24:17 PDT
Created attachment 408583 [details]
Patch
Comment 10 Ross Kirsling 2020-09-11 21:03:26 PDT
Comment on attachment 408582 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=408582&action=review

r=me

> Source/JavaScriptCore/runtime/IntlObject.cpp:193
> +static Vector<StringView> unicodeExtensionSubTags(StringView extension)

nit: Since we're moving this anyway, could we rename it unicodeExtensionSubtags with a lowercase t?

...or rather, maybe unicodeExtensionComponents, since that's what it's been renamed to:
https://tc39.es/ecma402/#sec-unicode-extension-components

> Source/JavaScriptCore/runtime/IntlObject.cpp:293
> +        if (subtag.length() != 2) {
> +            ++index;
> +            continue;
> +        }

Is this just for the leading attribute case?
Comment 11 Yusuke Suzuki 2020-09-11 21:15:13 PDT
Comment on attachment 408582 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=408582&action=review

>> Source/JavaScriptCore/runtime/IntlObject.cpp:193
>> +static Vector<StringView> unicodeExtensionSubTags(StringView extension)
> 
> nit: Since we're moving this anyway, could we rename it unicodeExtensionSubtags with a lowercase t?
> 
> ...or rather, maybe unicodeExtensionComponents, since that's what it's been renamed to:
> https://tc39.es/ecma402/#sec-unicode-extension-components

unicodeExtensionComponents sounds good. Changed.

>> Source/JavaScriptCore/runtime/IntlObject.cpp:293
>> +        }
> 
> Is this just for the leading attribute case?

Yes.
Comment 12 Yusuke Suzuki 2020-09-11 22:57:05 PDT
Committed r266973: <https://trac.webkit.org/changeset/266973>
Comment 13 Radar WebKit Bug Importer 2020-09-11 22:58:18 PDT
<rdar://problem/68751852>