RESOLVED FIXED 210853
[Intl] Locale validation/canonicalization should defer to ICU
https://bugs.webkit.org/show_bug.cgi?id=210853
Summary [Intl] Locale validation/canonicalization should defer to ICU
Ross Kirsling
Reported 2020-04-22 02:18:39 PDT
[Intl] Canonicalize locales according to UTS 35
Attachments
Patch (916.00 KB, patch)
2020-04-22 02:38 PDT, Ross Kirsling
no flags
Patch (915.52 KB, patch)
2020-04-22 11:43 PDT, Ross Kirsling
no flags
Patch (785.18 KB, patch)
2020-04-23 23:53 PDT, Ross Kirsling
no flags
Patch (785.32 KB, patch)
2020-04-24 14:04 PDT, Ross Kirsling
no flags
Patch for landing (787.78 KB, patch)
2020-04-24 17:16 PDT, Ross Kirsling
no flags
Patch for landing (787.74 KB, patch)
2020-04-24 21:29 PDT, Ross Kirsling
no flags
Patch for landing (789.78 KB, patch)
2020-04-25 01:04 PDT, Ross Kirsling
no flags
Ross Kirsling
Comment 1 2020-04-22 02:38:27 PDT
Ross Kirsling
Comment 2 2020-04-22 02:57:35 PDT
Additional notes: 1. For ease of review, here's what the two functions in the generated header end up looking like: static String intlPreferredLanguageTag(const String& tag) { // 354 possible replacements static const auto map = makeNeverDestroyed<HashMap<String, String>>({ { "aam"_s, "aas"_s }, ... }); return map.get().get(tag); } 2. A ~5000-line file is much nicer than a ~50000-line file, but you might ask why ICU can't do this for us, since it has the data. The answer is that uloc_[for|to]LanguageTag unfortunately don't produce results that conform to ECMA-402 (and are also supposedly quite slow?). Still, there are certain transformations that we still need to apply.
Ross Kirsling
Comment 3 2020-04-22 11:43:49 PDT
Ross Kirsling
Comment 4 2020-04-23 13:24:53 PDT
(In reply to Ross Kirsling from comment #2) > 2. A ~5000-line file is much nicer than a ~50000-line file, but you might > ask why ICU can't do this for us, since it has the data. The answer is that > uloc_[for|to]LanguageTag unfortunately don't produce results that conform to > ECMA-402 (and are also supposedly quite slow?). Still, there are certain > transformations that we still need to apply. Hmm, it turns out there's an XML version of this that will be much easier (and future-proof) to parse. I'll try using that instead.
Ross Kirsling
Comment 5 2020-04-23 15:44:22 PDT
Apparently just as we were going to catch up with the existing tests for locale canonicalization, many more landed that raise the bar significantly higher (https://github.com/tc39/test262/pull/2554). This seems utterly unreasonable to ask each engine to implement. We should probably just rely on ICU and accept whatever failures occur in a given version.
Ross Kirsling
Comment 6 2020-04-23 23:53:42 PDT
Ross Kirsling
Comment 7 2020-04-23 23:56:25 PDT
By Yusuke's suggestion, let's just roll with what ICU gives us. What's cool is that there's almost zero loss to doing so in ICU 64 -- the only thing missing is the "-true" elimination of r260151. (And we could even add it back temporarily if we felt it were worth the perf hit.)
Darin Adler
Comment 8 2020-04-24 11:07:46 PDT
Comment on attachment 397431 [details] Patch I’ll review once the JSC stuff compiles.
Ross Kirsling
Comment 9 2020-04-24 14:04:18 PDT
Darin Adler
Comment 10 2020-04-24 16:23:13 PDT
Comment on attachment 397505 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=397505&action=review > Source/JavaScriptCore/runtime/IntlObject.cpp:356 > + return String(result.data(), resultLength); Are these language tags ASCII? Latin-1? UTF-8? This code is fine if they are ASCII or Latin-1, not if they are UTF-8.
Ross Kirsling
Comment 11 2020-04-24 16:53:26 PDT
(In reply to Darin Adler from comment #10) > Comment on attachment 397505 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=397505&action=review > > > Source/JavaScriptCore/runtime/IntlObject.cpp:356 > > + return String(result.data(), resultLength); > > Are these language tags ASCII? Latin-1? UTF-8? > > This code is fine if they are ASCII or Latin-1, not if they are UTF-8. Yeah, the output is all ASCII (just alphanumeric and separator characters).
Ross Kirsling
Comment 12 2020-04-24 17:16:23 PDT
Created attachment 397528 [details] Patch for landing
Ross Kirsling
Comment 13 2020-04-24 17:21:47 PDT
(I'll make sure I see JSC-ARMv7 pass tests before landing.)
Darin Adler
Comment 14 2020-04-24 17:28:19 PDT
I like this new strategy. But: I think it’s OK to add extra code to perform better with older versions of ICU or with things ICU didn’t get to yet, depending on the importance.
Ross Kirsling
Comment 15 2020-04-24 21:29:37 PDT
Created attachment 397538 [details] Patch for landing
Ross Kirsling
Comment 16 2020-04-25 01:04:46 PDT
Created attachment 397549 [details] Patch for landing
EWS
Comment 17 2020-04-25 02:14:16 PDT
Committed r260697: <https://trac.webkit.org/changeset/260697> All reviewed patches have been landed. Closing bug and clearing flags on attachment 397549 [details].
Radar WebKit Bug Importer
Comment 18 2020-04-25 02:15:20 PDT
Ross Kirsling
Comment 19 2020-04-25 14:01:06 PDT
For some reason this hits an unchecked exception assert only via run-javascript-core and not via run-jsc. Will fix...
Ross Kirsling
Comment 20 2020-04-25 16:48:06 PDT
(In reply to Ross Kirsling from comment #19) > For some reason this hits an unchecked exception assert only via > run-javascript-core and not via run-jsc. Will fix... Committed r260710: <https://trac.webkit.org/changeset/260710>
Note You need to log in before you can comment on or make changes to this bug.