Bug 224047 - [JSC] DUCET level-1 weighs are equal if characters are alphabets
Summary: [JSC] DUCET level-1 weighs are equal if characters are alphabets
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: Safari 14
Hardware: Mac (Intel) macOS 10.15
: P2 Normal
Assignee: Yusuke Suzuki
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2021-04-01 07:14 PDT by Eben Packwood
Modified: 2021-04-07 22:48 PDT (History)
10 users (show)

See Also:


Attachments
Minimal working example (314 bytes, application/x-javascript)
2021-04-01 07:14 PDT, Eben Packwood
no flags Details
Patch (9.34 KB, patch)
2021-04-07 14:46 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (10.61 KB, patch)
2021-04-07 14:54 PDT, Yusuke Suzuki
sbarati: review+
Details | Formatted Diff | Diff
sorted DUCET data (ASCII only) excluding 0000 entries (4.87 KB, text/plain)
2021-04-07 16:44 PDT, Yusuke Suzuki
no flags Details
Patch (10.55 KB, patch)
2021-04-07 16:57 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Eben Packwood 2021-04-01 07:14:22 PDT
Created attachment 424894 [details]
Minimal working example

Sorting certain string arrays using `localeCompare` can result in the array being sorted differently when `sort` is called multiple times successively.

It's not immediately clear what the root cause of this issue is, but it seems to be related to the presence of certain unicode strings in the array.

A minimal working example is attached. Each call to `sort` results in a different ordering of the array.
Comment 1 Eben Packwood 2021-04-06 11:40:15 PDT
It appears as if the issue is that `localeCompare` does not define a total ordering for strings, as required in the ECMAScript standard (https://tc39.es/ecma262/#sec-string.prototype.localecompare).

E.g. 

```
"AB - AS Foobar".localeCompare("AB - AS Pulheim Käther") === -1;
"AB - AS Pulheim Käther".localeCompare("Abz - Baz Qux") === -1;
"Abz - Baz Qux".localeCompare("AB - AS Foobar") === -1;
```
Comment 2 Rob 2021-04-06 16:56:50 PDT
I was looking into this a bit with Eben and simplified the test case a bit further. It looks like when comparing pure ASCII strings, localeCompare does a case-sensitive comparison. If the strings are not pure ASCII, localeCompare does a case-insensitive comparison. The only problematic result in this snippet is the third line:

```
"ABC".localeCompare("ABD Ã") === -1
"ABD Ã".localeCompare("AbE") === -1
"AbE".localeCompare("ABC") === -1
```
Comment 3 Smoley 2021-04-07 12:27:56 PDT
This reproduces for me on Safari 13.1.2 as well as TOT 14.2 using the provided test case.
Comment 4 Radar WebKit Bug Importer 2021-04-07 12:28:11 PDT
<rdar://problem/76360173>
Comment 5 Yusuke Suzuki 2021-04-07 14:46:32 PDT
Created attachment 425441 [details]
Patch
Comment 6 Yusuke Suzuki 2021-04-07 14:54:31 PDT
Created attachment 425444 [details]
Patch
Comment 7 Yusuke Suzuki 2021-04-07 16:42:00 PDT
Script used to compute DUCET weight.


import re

data = []

level1Array = []
level2Array = []
level3Array = []
level4Array = []

for i in range(0, 128):
    data.append([0, 0, 0, 0])

with open("sorted.txt") as file:
    for line in file:
        m = re.match("^([0-9A-F]{4})  ; \\[([0-9A-F]{4})\\.([0-9A-F]{4})\\.([0-9A-F]{4})\\.([0-9A-F]{4})\\]", line)
        level1Array.append(int(m.group(2), 16))
        level2Array.append(int(m.group(3), 16))
        level3Array.append(int(m.group(4), 16))
        level4Array.append(int(m.group(5), 16))

    level1Array = list(set(level1Array))
    level2Array = list(set(level2Array))
    level3Array = list(set(level3Array))
    level4Array = list(set(level4Array))
    level1Array.sort()
    level2Array.sort()
    level3Array.sort()
    level4Array.sort()

with open("sorted.txt") as file:
    for line in file:
        m = re.match("^([0-9A-F]{4})  ; \\[([0-9A-F]{4})\\.([0-9A-F]{4})\\.([0-9A-F]{4})\\.([0-9A-F]{4})\\]", line)
        code = int(m.group(1), 16)
        level1 = int(m.group(2), 16)
        level2 = int(m.group(3), 16)
        level3 = int(m.group(4), 16)
        level4 = int(m.group(5), 16)
        # putting +1 on level1 since 0 is special meaning, and they are already
        # excluded.
        data[code] = [level1Array.index(level1) + 1, level2Array.index(level2),
                level3Array.index(level3), level4Array.index(level4)]
        print(level1)

for level in range(0, 4):
    print("level%d" % (level + 1))
    result = []
    for i in range(0, 128):
        result.append(str(data[i][level]))
        if i % 8 == 7:
            result.append(",\n")
        else:
            result.append(", ")
    print("".join(result))

# vim: set sw=4 ts=4 et tw=80 :
Comment 8 Yusuke Suzuki 2021-04-07 16:44:17 PDT
Created attachment 425454 [details]
sorted DUCET data (ASCII only) excluding 0000 entries
Comment 9 Mark Lam 2021-04-07 16:47:58 PDT
Comment on attachment 425444 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=425444&action=review

> Source/JavaScriptCore/ChangeLog:12
> +            1. If we found that the result of level-1 weight comparison is equal, and characters are not eqaul, then we do level-3 weight comparison.

/not eqaul/not equal/

> Source/JavaScriptCore/runtime/IntlObject.cpp:436
> +//     6. Level-2 weights are alwasy 0020 except for 0000 cases. So if we include 0000 characters, we do not need to perform level-2 weight comparison.

/alwasy/always/

> JSTests/stress/ducet-level-3-or-4-comparison.js:11
> +shouldBe(collator.compare("ABC", "abc"), 1);

Can you also add `shouldBe(collator.compare("ABC", "ABC"), 0);` just to confirm that the equal case isn't broken?
Comment 10 Mark Lam 2021-04-07 16:53:03 PDT
Comment on attachment 425444 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=425444&action=review

r=me too.

> Source/JavaScriptCore/runtime/IntlObject.cpp:444
> +const uint8_t ducetLevel1Weights[128] = {

As we discussed offline, please fix the L1 weights too.  Thanks.
Comment 11 Yusuke Suzuki 2021-04-07 16:57:39 PDT
Created attachment 425457 [details]
Patch
Comment 12 Yusuke Suzuki 2021-04-07 22:44:36 PDT
Comment on attachment 425444 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=425444&action=review

>> Source/JavaScriptCore/ChangeLog:12
>> +            1. If we found that the result of level-1 weight comparison is equal, and characters are not eqaul, then we do level-3 weight comparison.
> 
> /not eqaul/not equal/

Fixed.

>> Source/JavaScriptCore/runtime/IntlObject.cpp:436
>> +//     6. Level-2 weights are alwasy 0020 except for 0000 cases. So if we include 0000 characters, we do not need to perform level-2 weight comparison.
> 
> /alwasy/always/

Fixed.

>> Source/JavaScriptCore/runtime/IntlObject.cpp:444
>> +const uint8_t ducetLevel1Weights[128] = {
> 
> As we discussed offline, please fix the L1 weights too.  Thanks.

Fixed. Thanks!

>> JSTests/stress/ducet-level-3-or-4-comparison.js:11
>> +shouldBe(collator.compare("ABC", "abc"), 1);
> 
> Can you also add `shouldBe(collator.compare("ABC", "ABC"), 0);` just to confirm that the equal case isn't broken?

Sounds nice, added.
Comment 13 Yusuke Suzuki 2021-04-07 22:48:13 PDT
Committed r275653 (236289@main): <https://commits.webkit.org/236289@main>