RESOLVED FIXED 216224
[JSC] Canonicalize "true" unicode extension type value to ""
https://bugs.webkit.org/show_bug.cgi?id=216224
Summary [JSC] Canonicalize "true" unicode extension type value to ""
Yusuke Suzuki
Reported 2020-09-06 01:58:40 PDT
[JSC] Canonicalize "true" unicode extension type value to ""
Attachments
Patch (20.82 KB, patch)
2020-09-06 02:00 PDT, Yusuke Suzuki
no flags
Patch (20.93 KB, patch)
2020-09-06 02:04 PDT, Yusuke Suzuki
no flags
Patch (33.01 KB, patch)
2020-09-10 02:22 PDT, Yusuke Suzuki
no flags
Patch (38.41 KB, patch)
2020-09-11 20:18 PDT, Yusuke Suzuki
ross.kirsling: review+
Patch (38.51 KB, patch)
2020-09-11 20:24 PDT, Yusuke Suzuki
no flags
Yusuke Suzuki
Comment 1 2020-09-06 02:00:43 PDT
Yusuke Suzuki
Comment 2 2020-09-06 02:04:31 PDT
Ross Kirsling
Comment 3 2020-09-09 16:04:25 PDT
Comment on attachment 408120 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=408120&action=review > Source/JavaScriptCore/runtime/IntlLocalePrototype.cpp:188 > + if (caseFirst.isEmpty()) > + return JSValue::encode(jsUndefined()); > + // If caseFirst is "true", then we should return "" since language tag canonicalization makes it "". > + if (caseFirst == "true"_s) > + return JSValue::encode(jsEmptyString(vm)); > + return JSValue::encode(jsString(vm, caseFirst)); This shouldn't be correct -- kf isn't a boolean extension; it should only ever be "upper"/"lower"/"false"/undefined (though Intl.Locale will evidently let it be an arbitrary string). This does highlight an interesting separate issue in Intl.Locale though -- if the locale string has a *non-boolean* Unicode extension with the value omitted, we're returning "yes" instead of "true". For example: new Intl.Locale('en-us-u-co').collation > yes > Source/JavaScriptCore/runtime/IntlObject.cpp:306 > + // In the following unicode extension keys, "yes" is alias of "true". So we should not include value. > + // See CLDR key name="kb" entry for example. > + switch (computeTwoCharacters16Code(key)) { > + case computeTwoCharacters16Code("kb"_s): > + case computeTwoCharacters16Code("kc"_s): > + case computeTwoCharacters16Code("kh"_s): > + case computeTwoCharacters16Code("kk"_s): > + case computeTwoCharacters16Code("kn"_s): I don't think we should be maintaining an allowlist here? Most of these are unknown to ECMA-402 and V8 evidently accepts "yes" for arbitrary extensions. Also, unless I'm missing something, "yes" should never appear in the output of uloc_toLanguageTag, right?
Yusuke Suzuki
Comment 4 2020-09-10 02:15:47 PDT
Comment on attachment 408120 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=408120&action=review >> Source/JavaScriptCore/runtime/IntlLocalePrototype.cpp:188 >> + return JSValue::encode(jsString(vm, caseFirst)); > > This shouldn't be correct -- kf isn't a boolean extension; it should only ever be "upper"/"lower"/"false"/undefined (though Intl.Locale will evidently let it be an arbitrary string). > > This does highlight an interesting separate issue in Intl.Locale though -- if the locale string has a *non-boolean* Unicode extension with the value omitted, we're returning "yes" instead of "true". For example: > > new Intl.Locale('en-us-u-co').collation > > yes However, if "true" is explicitly specified (`en-u-kf-true`), then, I think we should return "" since > Any type or tfield value "true" is removed. rule can be applied regardless of whether this is a boolean extension. So, when "true" is specified, "" is better than returning "true". >> Source/JavaScriptCore/runtime/IntlObject.cpp:306 >> + case computeTwoCharacters16Code("kn"_s): > > I don't think we should be maintaining an allowlist here? Most of these are unknown to ECMA-402 and V8 evidently accepts "yes" for arbitrary extensions. > Also, unless I'm missing something, "yes" should never appear in the output of uloc_toLanguageTag, right? Right, "yes" will not appear from ICU except for the user explicitly set "yes". And currently it is tested in test262/test/intl402/Intl/getCanonicalLocales/unicode-ext-canonicalize-yes-to-true.js Currently, ICU does not handle this correctly, so we need to do that here. I think this list is small enough, so I'm fine to keep it here. And we can remove it once ICU supports this correctly.
Yusuke Suzuki
Comment 5 2020-09-10 02:22:12 PDT
Ross Kirsling
Comment 6 2020-09-10 12:47:26 PDT
(In reply to Yusuke Suzuki from comment #4) > Comment on attachment 408120 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=408120&action=review > > >> Source/JavaScriptCore/runtime/IntlLocalePrototype.cpp:188 > >> + return JSValue::encode(jsString(vm, caseFirst)); > > > > This shouldn't be correct -- kf isn't a boolean extension; it should only ever be "upper"/"lower"/"false"/undefined (though Intl.Locale will evidently let it be an arbitrary string). > > > > This does highlight an interesting separate issue in Intl.Locale though -- if the locale string has a *non-boolean* Unicode extension with the value omitted, we're returning "yes" instead of "true". For example: > > > > new Intl.Locale('en-us-u-co').collation > > > yes > > However, if "true" is explicitly specified (`en-u-kf-true`), then, I think > we should return "" since > > > Any type or tfield value "true" is removed. > > rule can be applied regardless of whether this is a boolean extension. So, > when "true" is specified, "" is better than returning "true". Hmm, but this is purely a statement about locale string canonicalization and not about how Intl.Locale's getters should behave. You may still be right (since canonicalization happens in step 12 of https://tc39.es/ecma402/#sec-Intl.Locale), but there's absolutely nothing unique about caseFirst, so we'd need to do it for the other non-boolean getters too. I guess this and the "yes gets returned" issue I mentioned should probably be handled together within IntlLocale.cpp. > >> Source/JavaScriptCore/runtime/IntlObject.cpp:306 > >> + case computeTwoCharacters16Code("kn"_s): > > > > I don't think we should be maintaining an allowlist here? Most of these are unknown to ECMA-402 and V8 evidently accepts "yes" for arbitrary extensions. > > Also, unless I'm missing something, "yes" should never appear in the output of uloc_toLanguageTag, right? > > Right, "yes" will not appear from ICU except for the user explicitly set > "yes". And currently it is tested in > test262/test/intl402/Intl/getCanonicalLocales/unicode-ext-canonicalize-yes- > to-true.js > Currently, ICU does not handle this correctly, so we need to do that here. I > think this list is small enough, so I'm fine to keep it here. And we can > remove it once ICU supports this correctly. That test makes sense, but my reason for confusion is because this is the existing behavior without your new canonicalization: new Intl.Locale('en-us-u-kb-yes') > en-US-u-kb-true new Intl.Locale('en-us-u-co-yes') > en-US-u-co-yes ...so it seems like we only see "yes" returned from uloc_toLanguageTag if the user provided "yes" *and* the extension is non-boolean, which would mean that we'd never need to check for "yes" in particular.
Yusuke Suzuki
Comment 7 2020-09-11 19:49:59 PDT
Comment on attachment 408120 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=408120&action=review >>>> Source/JavaScriptCore/runtime/IntlLocalePrototype.cpp:188 >>>> + return JSValue::encode(jsString(vm, caseFirst)); >>> >>> This shouldn't be correct -- kf isn't a boolean extension; it should only ever be "upper"/"lower"/"false"/undefined (though Intl.Locale will evidently let it be an arbitrary string). >>> >>> This does highlight an interesting separate issue in Intl.Locale though -- if the locale string has a *non-boolean* Unicode extension with the value omitted, we're returning "yes" instead of "true". For example: >>> >>> new Intl.Locale('en-us-u-co').collation >>> > yes >> >> However, if "true" is explicitly specified (`en-u-kf-true`), then, I think we should return "" since > > Hmm, but this is purely a statement about locale string canonicalization and not about how Intl.Locale's getters should behave. You may still be right (since canonicalization happens in step 12 of https://tc39.es/ecma402/#sec-Intl.Locale), but there's absolutely nothing unique about caseFirst, so we'd need to do it for the other non-boolean getters too. I guess this and the "yes gets returned" issue I mentioned should probably be handled together within IntlLocale.cpp. OK, so we should move this logic to IntlLocale.cpp's getters. Changed. >>>> Source/JavaScriptCore/runtime/IntlObject.cpp:306 >>>> + case computeTwoCharacters16Code("kn"_s): >>> >>> I don't think we should be maintaining an allowlist here? Most of these are unknown to ECMA-402 and V8 evidently accepts "yes" for arbitrary extensions. >>> Also, unless I'm missing something, "yes" should never appear in the output of uloc_toLanguageTag, right? >> >> Right, "yes" will not appear from ICU except for the user explicitly set "yes". And currently it is tested in test262/test/intl402/Intl/getCanonicalLocales/unicode-ext-canonicalize-yes-to-true.js >> Currently, ICU does not handle this correctly, so we need to do that here. I think this list is small enough, so I'm fine to keep it here. And we can remove it once ICU supports this correctly. > > That test makes sense, but my reason for confusion is because this is the existing behavior without your new canonicalization: > > new Intl.Locale('en-us-u-kb-yes') > > en-US-u-kb-true > new Intl.Locale('en-us-u-co-yes') > > en-US-u-co-yes > > ...so it seems like we only see "yes" returned from uloc_toLanguageTag if the user provided "yes" *and* the extension is non-boolean, which would mean that we'd never need to check for "yes" in particular. Ah! Right, so this is not necessary. Removed.
Yusuke Suzuki
Comment 8 2020-09-11 20:18:53 PDT
Yusuke Suzuki
Comment 9 2020-09-11 20:24:17 PDT
Ross Kirsling
Comment 10 2020-09-11 21:03:26 PDT
Comment on attachment 408582 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=408582&action=review r=me > Source/JavaScriptCore/runtime/IntlObject.cpp:193 > +static Vector<StringView> unicodeExtensionSubTags(StringView extension) nit: Since we're moving this anyway, could we rename it unicodeExtensionSubtags with a lowercase t? ...or rather, maybe unicodeExtensionComponents, since that's what it's been renamed to: https://tc39.es/ecma402/#sec-unicode-extension-components > Source/JavaScriptCore/runtime/IntlObject.cpp:293 > + if (subtag.length() != 2) { > + ++index; > + continue; > + } Is this just for the leading attribute case?
Yusuke Suzuki
Comment 11 2020-09-11 21:15:13 PDT
Comment on attachment 408582 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=408582&action=review >> Source/JavaScriptCore/runtime/IntlObject.cpp:193 >> +static Vector<StringView> unicodeExtensionSubTags(StringView extension) > > nit: Since we're moving this anyway, could we rename it unicodeExtensionSubtags with a lowercase t? > > ...or rather, maybe unicodeExtensionComponents, since that's what it's been renamed to: > https://tc39.es/ecma402/#sec-unicode-extension-components unicodeExtensionComponents sounds good. Changed. >> Source/JavaScriptCore/runtime/IntlObject.cpp:293 >> + } > > Is this just for the leading attribute case? Yes.
Yusuke Suzuki
Comment 12 2020-09-11 22:57:05 PDT
Radar WebKit Bug Importer
Comment 13 2020-09-11 22:58:18 PDT
Note You need to log in before you can comment on or make changes to this bug.