RESOLVED FIXED 224047
[JSC] DUCET level-1 weighs are equal if characters are alphabets
https://bugs.webkit.org/show_bug.cgi?id=224047
Summary [JSC] DUCET level-1 weighs are equal if characters are alphabets
Eben Packwood
Reported 2021-04-01 07:14:22 PDT
Created attachment 424894 [details] Minimal working example Sorting certain string arrays using `localeCompare` can result in the array being sorted differently when `sort` is called multiple times successively. It's not immediately clear what the root cause of this issue is, but it seems to be related to the presence of certain unicode strings in the array. A minimal working example is attached. Each call to `sort` results in a different ordering of the array.
Attachments
Minimal working example (314 bytes, application/x-javascript)
2021-04-01 07:14 PDT, Eben Packwood
no flags
Patch (9.34 KB, patch)
2021-04-07 14:46 PDT, Yusuke Suzuki
no flags
Patch (10.61 KB, patch)
2021-04-07 14:54 PDT, Yusuke Suzuki
saam: review+
sorted DUCET data (ASCII only) excluding 0000 entries (4.87 KB, text/plain)
2021-04-07 16:44 PDT, Yusuke Suzuki
no flags
Patch (10.55 KB, patch)
2021-04-07 16:57 PDT, Yusuke Suzuki
no flags
Eben Packwood
Comment 1 2021-04-06 11:40:15 PDT
It appears as if the issue is that `localeCompare` does not define a total ordering for strings, as required in the ECMAScript standard (https://tc39.es/ecma262/#sec-string.prototype.localecompare). E.g. ``` "AB - AS Foobar".localeCompare("AB - AS Pulheim Käther") === -1; "AB - AS Pulheim Käther".localeCompare("Abz - Baz Qux") === -1; "Abz - Baz Qux".localeCompare("AB - AS Foobar") === -1; ```
Rob
Comment 2 2021-04-06 16:56:50 PDT
I was looking into this a bit with Eben and simplified the test case a bit further. It looks like when comparing pure ASCII strings, localeCompare does a case-sensitive comparison. If the strings are not pure ASCII, localeCompare does a case-insensitive comparison. The only problematic result in this snippet is the third line: ``` "ABC".localeCompare("ABD Ã") === -1 "ABD Ã".localeCompare("AbE") === -1 "AbE".localeCompare("ABC") === -1 ```
Smoley
Comment 3 2021-04-07 12:27:56 PDT
This reproduces for me on Safari 13.1.2 as well as TOT 14.2 using the provided test case.
Radar WebKit Bug Importer
Comment 4 2021-04-07 12:28:11 PDT
Yusuke Suzuki
Comment 5 2021-04-07 14:46:32 PDT
Yusuke Suzuki
Comment 6 2021-04-07 14:54:31 PDT
Yusuke Suzuki
Comment 7 2021-04-07 16:42:00 PDT
Script used to compute DUCET weight. import re data = [] level1Array = [] level2Array = [] level3Array = [] level4Array = [] for i in range(0, 128): data.append([0, 0, 0, 0]) with open("sorted.txt") as file: for line in file: m = re.match("^([0-9A-F]{4}) ; \\[([0-9A-F]{4})\\.([0-9A-F]{4})\\.([0-9A-F]{4})\\.([0-9A-F]{4})\\]", line) level1Array.append(int(m.group(2), 16)) level2Array.append(int(m.group(3), 16)) level3Array.append(int(m.group(4), 16)) level4Array.append(int(m.group(5), 16)) level1Array = list(set(level1Array)) level2Array = list(set(level2Array)) level3Array = list(set(level3Array)) level4Array = list(set(level4Array)) level1Array.sort() level2Array.sort() level3Array.sort() level4Array.sort() with open("sorted.txt") as file: for line in file: m = re.match("^([0-9A-F]{4}) ; \\[([0-9A-F]{4})\\.([0-9A-F]{4})\\.([0-9A-F]{4})\\.([0-9A-F]{4})\\]", line) code = int(m.group(1), 16) level1 = int(m.group(2), 16) level2 = int(m.group(3), 16) level3 = int(m.group(4), 16) level4 = int(m.group(5), 16) # putting +1 on level1 since 0 is special meaning, and they are already # excluded. data[code] = [level1Array.index(level1) + 1, level2Array.index(level2), level3Array.index(level3), level4Array.index(level4)] print(level1) for level in range(0, 4): print("level%d" % (level + 1)) result = [] for i in range(0, 128): result.append(str(data[i][level])) if i % 8 == 7: result.append(",\n") else: result.append(", ") print("".join(result)) # vim: set sw=4 ts=4 et tw=80 :
Yusuke Suzuki
Comment 8 2021-04-07 16:44:17 PDT
Created attachment 425454 [details] sorted DUCET data (ASCII only) excluding 0000 entries
Mark Lam
Comment 9 2021-04-07 16:47:58 PDT
Comment on attachment 425444 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=425444&action=review > Source/JavaScriptCore/ChangeLog:12 > + 1. If we found that the result of level-1 weight comparison is equal, and characters are not eqaul, then we do level-3 weight comparison. /not eqaul/not equal/ > Source/JavaScriptCore/runtime/IntlObject.cpp:436 > +// 6. Level-2 weights are alwasy 0020 except for 0000 cases. So if we include 0000 characters, we do not need to perform level-2 weight comparison. /alwasy/always/ > JSTests/stress/ducet-level-3-or-4-comparison.js:11 > +shouldBe(collator.compare("ABC", "abc"), 1); Can you also add `shouldBe(collator.compare("ABC", "ABC"), 0);` just to confirm that the equal case isn't broken?
Mark Lam
Comment 10 2021-04-07 16:53:03 PDT
Comment on attachment 425444 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=425444&action=review r=me too. > Source/JavaScriptCore/runtime/IntlObject.cpp:444 > +const uint8_t ducetLevel1Weights[128] = { As we discussed offline, please fix the L1 weights too. Thanks.
Yusuke Suzuki
Comment 11 2021-04-07 16:57:39 PDT
Yusuke Suzuki
Comment 12 2021-04-07 22:44:36 PDT
Comment on attachment 425444 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=425444&action=review >> Source/JavaScriptCore/ChangeLog:12 >> + 1. If we found that the result of level-1 weight comparison is equal, and characters are not eqaul, then we do level-3 weight comparison. > > /not eqaul/not equal/ Fixed. >> Source/JavaScriptCore/runtime/IntlObject.cpp:436 >> +// 6. Level-2 weights are alwasy 0020 except for 0000 cases. So if we include 0000 characters, we do not need to perform level-2 weight comparison. > > /alwasy/always/ Fixed. >> Source/JavaScriptCore/runtime/IntlObject.cpp:444 >> +const uint8_t ducetLevel1Weights[128] = { > > As we discussed offline, please fix the L1 weights too. Thanks. Fixed. Thanks! >> JSTests/stress/ducet-level-3-or-4-comparison.js:11 >> +shouldBe(collator.compare("ABC", "abc"), 1); > > Can you also add `shouldBe(collator.compare("ABC", "ABC"), 0);` just to confirm that the equal case isn't broken? Sounds nice, added.
Yusuke Suzuki
Comment 13 2021-04-07 22:48:13 PDT
Yusuke Suzuki
Comment 14 2021-09-15 11:08:48 PDT
*** Bug 230199 has been marked as a duplicate of this bug. ***
Note You need to log in before you can comment on or make changes to this bug.