Summary: | [JSC] DUCET level-1 weighs are equal if characters are alphabets | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Eben Packwood <ebenpackwood> | ||||||||||||
Component: | JavaScriptCore | Assignee: | Yusuke Suzuki <ysuzuki> | ||||||||||||
Status: | RESOLVED FIXED | ||||||||||||||
Severity: | Normal | CC: | 7raivis, ews-watchlist, keith_miller, mark.lam, msaboff, rob, saam, smoley, tzagallo, webkit-bug-importer, ysuzuki | ||||||||||||
Priority: | P2 | Keywords: | InRadar | ||||||||||||
Version: | Safari 14 | ||||||||||||||
Hardware: | Mac (Intel) | ||||||||||||||
OS: | macOS 10.15 | ||||||||||||||
Attachments: |
|
It appears as if the issue is that `localeCompare` does not define a total ordering for strings, as required in the ECMAScript standard (https://tc39.es/ecma262/#sec-string.prototype.localecompare). E.g. ``` "AB - AS Foobar".localeCompare("AB - AS Pulheim Käther") === -1; "AB - AS Pulheim Käther".localeCompare("Abz - Baz Qux") === -1; "Abz - Baz Qux".localeCompare("AB - AS Foobar") === -1; ``` I was looking into this a bit with Eben and simplified the test case a bit further. It looks like when comparing pure ASCII strings, localeCompare does a case-sensitive comparison. If the strings are not pure ASCII, localeCompare does a case-insensitive comparison. The only problematic result in this snippet is the third line: ``` "ABC".localeCompare("ABD Ã") === -1 "ABD Ã".localeCompare("AbE") === -1 "AbE".localeCompare("ABC") === -1 ``` This reproduces for me on Safari 13.1.2 as well as TOT 14.2 using the provided test case. Created attachment 425441 [details]
Patch
Created attachment 425444 [details]
Patch
Script used to compute DUCET weight. import re data = [] level1Array = [] level2Array = [] level3Array = [] level4Array = [] for i in range(0, 128): data.append([0, 0, 0, 0]) with open("sorted.txt") as file: for line in file: m = re.match("^([0-9A-F]{4}) ; \\[([0-9A-F]{4})\\.([0-9A-F]{4})\\.([0-9A-F]{4})\\.([0-9A-F]{4})\\]", line) level1Array.append(int(m.group(2), 16)) level2Array.append(int(m.group(3), 16)) level3Array.append(int(m.group(4), 16)) level4Array.append(int(m.group(5), 16)) level1Array = list(set(level1Array)) level2Array = list(set(level2Array)) level3Array = list(set(level3Array)) level4Array = list(set(level4Array)) level1Array.sort() level2Array.sort() level3Array.sort() level4Array.sort() with open("sorted.txt") as file: for line in file: m = re.match("^([0-9A-F]{4}) ; \\[([0-9A-F]{4})\\.([0-9A-F]{4})\\.([0-9A-F]{4})\\.([0-9A-F]{4})\\]", line) code = int(m.group(1), 16) level1 = int(m.group(2), 16) level2 = int(m.group(3), 16) level3 = int(m.group(4), 16) level4 = int(m.group(5), 16) # putting +1 on level1 since 0 is special meaning, and they are already # excluded. data[code] = [level1Array.index(level1) + 1, level2Array.index(level2), level3Array.index(level3), level4Array.index(level4)] print(level1) for level in range(0, 4): print("level%d" % (level + 1)) result = [] for i in range(0, 128): result.append(str(data[i][level])) if i % 8 == 7: result.append(",\n") else: result.append(", ") print("".join(result)) # vim: set sw=4 ts=4 et tw=80 : Created attachment 425454 [details]
sorted DUCET data (ASCII only) excluding 0000 entries
Comment on attachment 425444 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=425444&action=review > Source/JavaScriptCore/ChangeLog:12 > + 1. If we found that the result of level-1 weight comparison is equal, and characters are not eqaul, then we do level-3 weight comparison. /not eqaul/not equal/ > Source/JavaScriptCore/runtime/IntlObject.cpp:436 > +// 6. Level-2 weights are alwasy 0020 except for 0000 cases. So if we include 0000 characters, we do not need to perform level-2 weight comparison. /alwasy/always/ > JSTests/stress/ducet-level-3-or-4-comparison.js:11 > +shouldBe(collator.compare("ABC", "abc"), 1); Can you also add `shouldBe(collator.compare("ABC", "ABC"), 0);` just to confirm that the equal case isn't broken? Comment on attachment 425444 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=425444&action=review r=me too. > Source/JavaScriptCore/runtime/IntlObject.cpp:444 > +const uint8_t ducetLevel1Weights[128] = { As we discussed offline, please fix the L1 weights too. Thanks. Created attachment 425457 [details]
Patch
Comment on attachment 425444 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=425444&action=review >> Source/JavaScriptCore/ChangeLog:12 >> + 1. If we found that the result of level-1 weight comparison is equal, and characters are not eqaul, then we do level-3 weight comparison. > > /not eqaul/not equal/ Fixed. >> Source/JavaScriptCore/runtime/IntlObject.cpp:436 >> +// 6. Level-2 weights are alwasy 0020 except for 0000 cases. So if we include 0000 characters, we do not need to perform level-2 weight comparison. > > /alwasy/always/ Fixed. >> Source/JavaScriptCore/runtime/IntlObject.cpp:444 >> +const uint8_t ducetLevel1Weights[128] = { > > As we discussed offline, please fix the L1 weights too. Thanks. Fixed. Thanks! >> JSTests/stress/ducet-level-3-or-4-comparison.js:11 >> +shouldBe(collator.compare("ABC", "abc"), 1); > > Can you also add `shouldBe(collator.compare("ABC", "ABC"), 0);` just to confirm that the equal case isn't broken? Sounds nice, added. Committed r275653 (236289@main): <https://commits.webkit.org/236289@main> *** Bug 230199 has been marked as a duplicate of this bug. *** |
Created attachment 424894 [details] Minimal working example Sorting certain string arrays using `localeCompare` can result in the array being sorted differently when `sort` is called multiple times successively. It's not immediately clear what the root cause of this issue is, but it seems to be related to the presence of certain unicode strings in the array. A minimal working example is attached. Each call to `sort` results in a different ordering of the array.