Bug 210853 - [Intl] Locale validation/canonicalization should defer to ICU
Summary: [Intl] Locale validation/canonicalization should defer to ICU
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Ross Kirsling
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2020-04-22 02:18 PDT by Ross Kirsling
Modified: 2020-04-25 16:48 PDT (History)
15 users (show)

See Also:


Attachments
Patch (916.00 KB, patch)
2020-04-22 02:38 PDT, Ross Kirsling
no flags Details | Formatted Diff | Diff
Patch (915.52 KB, patch)
2020-04-22 11:43 PDT, Ross Kirsling
no flags Details | Formatted Diff | Diff
Patch (785.18 KB, patch)
2020-04-23 23:53 PDT, Ross Kirsling
no flags Details | Formatted Diff | Diff
Patch (785.32 KB, patch)
2020-04-24 14:04 PDT, Ross Kirsling
no flags Details | Formatted Diff | Diff
Patch for landing (787.78 KB, patch)
2020-04-24 17:16 PDT, Ross Kirsling
no flags Details | Formatted Diff | Diff
Patch for landing (787.74 KB, patch)
2020-04-24 21:29 PDT, Ross Kirsling
no flags Details | Formatted Diff | Diff
Patch for landing (789.78 KB, patch)
2020-04-25 01:04 PDT, Ross Kirsling
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Ross Kirsling 2020-04-22 02:18:39 PDT
[Intl] Canonicalize locales according to UTS 35
Comment 1 Ross Kirsling 2020-04-22 02:38:27 PDT
Created attachment 397180 [details]
Patch
Comment 2 Ross Kirsling 2020-04-22 02:57:35 PDT
Additional notes:

1. For ease of review, here's what the two functions in the generated header end up looking like:

    static String intlPreferredLanguageTag(const String& tag)
    {
        // 354 possible replacements
        static const auto map = makeNeverDestroyed<HashMap<String, String>>({
            { "aam"_s, "aas"_s },
            ...
        });
        return map.get().get(tag);
    }

2. A ~5000-line file is much nicer than a ~50000-line file, but you might ask why ICU can't do this for us, since it has the data. The answer is that uloc_[for|to]LanguageTag unfortunately don't produce results that conform to ECMA-402 (and are also supposedly quite slow?). Still, there are certain transformations that we still need to apply.
Comment 3 Ross Kirsling 2020-04-22 11:43:49 PDT
Created attachment 397227 [details]
Patch
Comment 4 Ross Kirsling 2020-04-23 13:24:53 PDT
(In reply to Ross Kirsling from comment #2)
> 2. A ~5000-line file is much nicer than a ~50000-line file, but you might
> ask why ICU can't do this for us, since it has the data. The answer is that
> uloc_[for|to]LanguageTag unfortunately don't produce results that conform to
> ECMA-402 (and are also supposedly quite slow?). Still, there are certain
> transformations that we still need to apply.

Hmm, it turns out there's an XML version of this that will be much easier (and future-proof) to parse. I'll try using that instead.
Comment 5 Ross Kirsling 2020-04-23 15:44:22 PDT
Apparently just as we were going to catch up with the existing tests for locale canonicalization, many more landed that raise the bar significantly higher (https://github.com/tc39/test262/pull/2554).

This seems utterly unreasonable to ask each engine to implement. We should probably just rely on ICU and accept whatever failures occur in a given version.
Comment 6 Ross Kirsling 2020-04-23 23:53:42 PDT
Created attachment 397431 [details]
Patch
Comment 7 Ross Kirsling 2020-04-23 23:56:25 PDT
By Yusuke's suggestion, let's just roll with what ICU gives us.

What's cool is that there's almost zero loss to doing so in ICU 64 -- the only thing missing is the "-true" elimination of r260151. (And we could even add it back temporarily if we felt it were worth the perf hit.)
Comment 8 Darin Adler 2020-04-24 11:07:46 PDT
Comment on attachment 397431 [details]
Patch

I’ll review once the JSC stuff compiles.
Comment 9 Ross Kirsling 2020-04-24 14:04:18 PDT
Created attachment 397505 [details]
Patch
Comment 10 Darin Adler 2020-04-24 16:23:13 PDT
Comment on attachment 397505 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=397505&action=review

> Source/JavaScriptCore/runtime/IntlObject.cpp:356
> +    return String(result.data(), resultLength);

Are these language tags ASCII? Latin-1? UTF-8?

This code is fine if they are ASCII or Latin-1, not if they are UTF-8.
Comment 11 Ross Kirsling 2020-04-24 16:53:26 PDT
(In reply to Darin Adler from comment #10)
> Comment on attachment 397505 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=397505&action=review
> 
> > Source/JavaScriptCore/runtime/IntlObject.cpp:356
> > +    return String(result.data(), resultLength);
> 
> Are these language tags ASCII? Latin-1? UTF-8?
> 
> This code is fine if they are ASCII or Latin-1, not if they are UTF-8.

Yeah, the output is all ASCII (just alphanumeric and separator characters).
Comment 12 Ross Kirsling 2020-04-24 17:16:23 PDT
Created attachment 397528 [details]
Patch for landing
Comment 13 Ross Kirsling 2020-04-24 17:21:47 PDT
(I'll make sure I see JSC-ARMv7 pass tests before landing.)
Comment 14 Darin Adler 2020-04-24 17:28:19 PDT
I like this new strategy. But: I think it’s OK to add extra code to perform better with older versions of ICU or with things ICU didn’t get to yet, depending on the importance.
Comment 15 Ross Kirsling 2020-04-24 21:29:37 PDT
Created attachment 397538 [details]
Patch for landing
Comment 16 Ross Kirsling 2020-04-25 01:04:46 PDT
Created attachment 397549 [details]
Patch for landing
Comment 17 EWS 2020-04-25 02:14:16 PDT
Committed r260697: <https://trac.webkit.org/changeset/260697>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 397549 [details].
Comment 18 Radar WebKit Bug Importer 2020-04-25 02:15:20 PDT
<rdar://problem/62365727>
Comment 19 Ross Kirsling 2020-04-25 14:01:06 PDT
For some reason this hits an unchecked exception assert only via run-javascript-core and not via run-jsc. Will fix...
Comment 20 Ross Kirsling 2020-04-25 16:48:06 PDT
(In reply to Ross Kirsling from comment #19)
> For some reason this hits an unchecked exception assert only via
> run-javascript-core and not via run-jsc. Will fix...

Committed r260710: <https://trac.webkit.org/changeset/260710>