Bug 224047

Summary: [JSC] DUCET level-1 weighs are equal if characters are alphabets
Product: WebKit Reporter: Eben Packwood <ebenpackwood>
Component: JavaScriptCoreAssignee: Yusuke Suzuki <ysuzuki>
Status: RESOLVED FIXED    
Severity: Normal CC: 7raivis, ews-watchlist, keith_miller, mark.lam, msaboff, rob, saam, smoley, tzagallo, webkit-bug-importer, ysuzuki
Priority: P2 Keywords: InRadar
Version: Safari 14   
Hardware: Mac (Intel)   
OS: macOS 10.15   
Attachments:
Description Flags
Minimal working example
none
Patch
none
Patch
saam: review+
sorted DUCET data (ASCII only) excluding 0000 entries
none
Patch none

Description Eben Packwood 2021-04-01 07:14:22 PDT
Created attachment 424894 [details]
Minimal working example

Sorting certain string arrays using `localeCompare` can result in the array being sorted differently when `sort` is called multiple times successively.

It's not immediately clear what the root cause of this issue is, but it seems to be related to the presence of certain unicode strings in the array.

A minimal working example is attached. Each call to `sort` results in a different ordering of the array.
Comment 1 Eben Packwood 2021-04-06 11:40:15 PDT
It appears as if the issue is that `localeCompare` does not define a total ordering for strings, as required in the ECMAScript standard (https://tc39.es/ecma262/#sec-string.prototype.localecompare).

E.g. 

```
"AB - AS Foobar".localeCompare("AB - AS Pulheim Käther") === -1;
"AB - AS Pulheim Käther".localeCompare("Abz - Baz Qux") === -1;
"Abz - Baz Qux".localeCompare("AB - AS Foobar") === -1;
```
Comment 2 Rob 2021-04-06 16:56:50 PDT
I was looking into this a bit with Eben and simplified the test case a bit further. It looks like when comparing pure ASCII strings, localeCompare does a case-sensitive comparison. If the strings are not pure ASCII, localeCompare does a case-insensitive comparison. The only problematic result in this snippet is the third line:

```
"ABC".localeCompare("ABD Ã") === -1
"ABD Ã".localeCompare("AbE") === -1
"AbE".localeCompare("ABC") === -1
```
Comment 3 Smoley 2021-04-07 12:27:56 PDT
This reproduces for me on Safari 13.1.2 as well as TOT 14.2 using the provided test case.
Comment 4 Radar WebKit Bug Importer 2021-04-07 12:28:11 PDT
<rdar://problem/76360173>
Comment 5 Yusuke Suzuki 2021-04-07 14:46:32 PDT
Created attachment 425441 [details]
Patch
Comment 6 Yusuke Suzuki 2021-04-07 14:54:31 PDT
Created attachment 425444 [details]
Patch
Comment 7 Yusuke Suzuki 2021-04-07 16:42:00 PDT
Script used to compute DUCET weight.


import re

data = []

level1Array = []
level2Array = []
level3Array = []
level4Array = []

for i in range(0, 128):
    data.append([0, 0, 0, 0])

with open("sorted.txt") as file:
    for line in file:
        m = re.match("^([0-9A-F]{4})  ; \\[([0-9A-F]{4})\\.([0-9A-F]{4})\\.([0-9A-F]{4})\\.([0-9A-F]{4})\\]", line)
        level1Array.append(int(m.group(2), 16))
        level2Array.append(int(m.group(3), 16))
        level3Array.append(int(m.group(4), 16))
        level4Array.append(int(m.group(5), 16))

    level1Array = list(set(level1Array))
    level2Array = list(set(level2Array))
    level3Array = list(set(level3Array))
    level4Array = list(set(level4Array))
    level1Array.sort()
    level2Array.sort()
    level3Array.sort()
    level4Array.sort()

with open("sorted.txt") as file:
    for line in file:
        m = re.match("^([0-9A-F]{4})  ; \\[([0-9A-F]{4})\\.([0-9A-F]{4})\\.([0-9A-F]{4})\\.([0-9A-F]{4})\\]", line)
        code = int(m.group(1), 16)
        level1 = int(m.group(2), 16)
        level2 = int(m.group(3), 16)
        level3 = int(m.group(4), 16)
        level4 = int(m.group(5), 16)
        # putting +1 on level1 since 0 is special meaning, and they are already
        # excluded.
        data[code] = [level1Array.index(level1) + 1, level2Array.index(level2),
                level3Array.index(level3), level4Array.index(level4)]
        print(level1)

for level in range(0, 4):
    print("level%d" % (level + 1))
    result = []
    for i in range(0, 128):
        result.append(str(data[i][level]))
        if i % 8 == 7:
            result.append(",\n")
        else:
            result.append(", ")
    print("".join(result))

# vim: set sw=4 ts=4 et tw=80 :
Comment 8 Yusuke Suzuki 2021-04-07 16:44:17 PDT
Created attachment 425454 [details]
sorted DUCET data (ASCII only) excluding 0000 entries
Comment 9 Mark Lam 2021-04-07 16:47:58 PDT
Comment on attachment 425444 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=425444&action=review

> Source/JavaScriptCore/ChangeLog:12
> +            1. If we found that the result of level-1 weight comparison is equal, and characters are not eqaul, then we do level-3 weight comparison.

/not eqaul/not equal/

> Source/JavaScriptCore/runtime/IntlObject.cpp:436
> +//     6. Level-2 weights are alwasy 0020 except for 0000 cases. So if we include 0000 characters, we do not need to perform level-2 weight comparison.

/alwasy/always/

> JSTests/stress/ducet-level-3-or-4-comparison.js:11
> +shouldBe(collator.compare("ABC", "abc"), 1);

Can you also add `shouldBe(collator.compare("ABC", "ABC"), 0);` just to confirm that the equal case isn't broken?
Comment 10 Mark Lam 2021-04-07 16:53:03 PDT
Comment on attachment 425444 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=425444&action=review

r=me too.

> Source/JavaScriptCore/runtime/IntlObject.cpp:444
> +const uint8_t ducetLevel1Weights[128] = {

As we discussed offline, please fix the L1 weights too.  Thanks.
Comment 11 Yusuke Suzuki 2021-04-07 16:57:39 PDT
Created attachment 425457 [details]
Patch
Comment 12 Yusuke Suzuki 2021-04-07 22:44:36 PDT
Comment on attachment 425444 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=425444&action=review

>> Source/JavaScriptCore/ChangeLog:12
>> +            1. If we found that the result of level-1 weight comparison is equal, and characters are not eqaul, then we do level-3 weight comparison.
> 
> /not eqaul/not equal/

Fixed.

>> Source/JavaScriptCore/runtime/IntlObject.cpp:436
>> +//     6. Level-2 weights are alwasy 0020 except for 0000 cases. So if we include 0000 characters, we do not need to perform level-2 weight comparison.
> 
> /alwasy/always/

Fixed.

>> Source/JavaScriptCore/runtime/IntlObject.cpp:444
>> +const uint8_t ducetLevel1Weights[128] = {
> 
> As we discussed offline, please fix the L1 weights too.  Thanks.

Fixed. Thanks!

>> JSTests/stress/ducet-level-3-or-4-comparison.js:11
>> +shouldBe(collator.compare("ABC", "abc"), 1);
> 
> Can you also add `shouldBe(collator.compare("ABC", "ABC"), 0);` just to confirm that the equal case isn't broken?

Sounds nice, added.
Comment 13 Yusuke Suzuki 2021-04-07 22:48:13 PDT
Committed r275653 (236289@main): <https://commits.webkit.org/236289@main>
Comment 14 Yusuke Suzuki 2021-09-15 11:08:48 PDT
*** Bug 230199 has been marked as a duplicate of this bug. ***