Bug 224047

Summary:

[JSC] DUCET level-1 weighs are equal if characters are alphabets

Product:

WebKit

Reporter:

Eben Packwood <ebenpackwood>

Component:

JavaScriptCore

Assignee:

Yusuke Suzuki <ysuzuki>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

7raivis, ews-watchlist, keith_miller, mark.lam, msaboff, rob, saam, smoley, tzagallo, webkit-bug-importer, ysuzuki

Priority:

Keywords:

InRadar

Version:

Safari 14

Hardware:

Mac (Intel)

OS:

macOS 10.15

Attachments:

Description	Flags
Minimal working example	none
Patch	none
Patch	saam: review+
sorted DUCET data (ASCII only) excluding 0000 entries	none
Patch	none

Eben Packwood

Reported 2021-04-01 07:14:22 PDT

Created attachment 424894 [details] Minimal working example Sorting certain string arrays using `localeCompare` can result in the array being sorted differently when `sort` is called multiple times successively. It's not immediately clear what the root cause of this issue is, but it seems to be related to the presence of certain unicode strings in the array. A minimal working example is attached. Each call to `sort` results in a different ordering of the array.

Attachments
Minimal working example (314 bytes, application/x-javascript) 2021-04-01 07:14 PDT, Eben Packwood	no flags	Details
Patch (9.34 KB, patch) 2021-04-07 14:46 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (10.61 KB, patch) 2021-04-07 14:54 PDT, Yusuke Suzuki	saam: review+	Details Formatted Diff Diff
sorted DUCET data (ASCII only) excluding 0000 entries (4.87 KB, text/plain) 2021-04-07 16:44 PDT, Yusuke Suzuki	no flags	Details
Patch (10.55 KB, patch) 2021-04-07 16:57 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Show Obsolete (1) View All Add attachment proposed patch, testcase, etc.

Eben Packwood

Comment 1 2021-04-06 11:40:15 PDT

It appears as if the issue is that `localeCompare` does not define a total ordering for strings, as required in the ECMAScript standard (https://tc39.es/ecma262/#sec-string.prototype.localecompare). E.g. ``` "AB - AS Foobar".localeCompare("AB - AS Pulheim KÃƒÂ¤ther") === -1; "AB - AS Pulheim KÃƒÂ¤ther".localeCompare("Abz - Baz Qux") === -1; "Abz - Baz Qux".localeCompare("AB - AS Foobar") === -1; ```

Rob

Comment 2 2021-04-06 16:56:50 PDT

I was looking into this a bit with Eben and simplified the test case a bit further. It looks like when comparing pure ASCII strings, localeCompare does a case-sensitive comparison. If the strings are not pure ASCII, localeCompare does a case-insensitive comparison. The only problematic result in this snippet is the third line: ``` "ABC".localeCompare("ABD Ã") === -1 "ABD Ã".localeCompare("AbE") === -1 "AbE".localeCompare("ABC") === -1 ```

Smoley

Comment 3 2021-04-07 12:27:56 PDT

This reproduces for me on Safari 13.1.2 as well as TOT 14.2 using the provided test case.

Radar WebKit Bug Importer

Comment 4 2021-04-07 12:28:11 PDT

<rdar://problem/76360173>

Yusuke Suzuki

Comment 5 2021-04-07 14:46:32 PDT

Created attachment 425441 [details] Patch

Yusuke Suzuki

Comment 6 2021-04-07 14:54:31 PDT

Created attachment 425444 [details] Patch

Yusuke Suzuki

Comment 7 2021-04-07 16:42:00 PDT

Script used to compute DUCET weight. import re data = [] level1Array = [] level2Array = [] level3Array = [] level4Array = [] for i in range(0, 128): data.append([0, 0, 0, 0]) with open("sorted.txt") as file: for line in file: m = re.match("^([0-9A-F]{4}) ; \\[([0-9A-F]{4})\\.([0-9A-F]{4})\\.([0-9A-F]{4})\\.([0-9A-F]{4})\\]", line) level1Array.append(int(m.group(2), 16)) level2Array.append(int(m.group(3), 16)) level3Array.append(int(m.group(4), 16)) level4Array.append(int(m.group(5), 16)) level1Array = list(set(level1Array)) level2Array = list(set(level2Array)) level3Array = list(set(level3Array)) level4Array = list(set(level4Array)) level1Array.sort() level2Array.sort() level3Array.sort() level4Array.sort() with open("sorted.txt") as file: for line in file: m = re.match("^([0-9A-F]{4}) ; \\[([0-9A-F]{4})\\.([0-9A-F]{4})\\.([0-9A-F]{4})\\.([0-9A-F]{4})\\]", line) code = int(m.group(1), 16) level1 = int(m.group(2), 16) level2 = int(m.group(3), 16) level3 = int(m.group(4), 16) level4 = int(m.group(5), 16) # putting +1 on level1 since 0 is special meaning, and they are already # excluded. data[code] = [level1Array.index(level1) + 1, level2Array.index(level2), level3Array.index(level3), level4Array.index(level4)] print(level1) for level in range(0, 4): print("level%d" % (level + 1)) result = [] for i in range(0, 128): result.append(str(data[i][level])) if i % 8 == 7: result.append(",\n") else: result.append(", ") print("".join(result)) # vim: set sw=4 ts=4 et tw=80 :

Yusuke Suzuki

Comment 8 2021-04-07 16:44:17 PDT

Created attachment 425454 [details] sorted DUCET data (ASCII only) excluding 0000 entries

Mark Lam

Comment 9 2021-04-07 16:47:58 PDT

Comment on attachment 425444 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=425444&action=review > Source/JavaScriptCore/ChangeLog:12 > + 1. If we found that the result of level-1 weight comparison is equal, and characters are not eqaul, then we do level-3 weight comparison. /not eqaul/not equal/ > Source/JavaScriptCore/runtime/IntlObject.cpp:436 > +// 6. Level-2 weights are alwasy 0020 except for 0000 cases. So if we include 0000 characters, we do not need to perform level-2 weight comparison. /alwasy/always/ > JSTests/stress/ducet-level-3-or-4-comparison.js:11 > +shouldBe(collator.compare("ABC", "abc"), 1); Can you also add `shouldBe(collator.compare("ABC", "ABC"), 0);` just to confirm that the equal case isn't broken?

Mark Lam

Comment 10 2021-04-07 16:53:03 PDT

Comment on attachment 425444 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=425444&action=review r=me too. > Source/JavaScriptCore/runtime/IntlObject.cpp:444 > +const uint8_t ducetLevel1Weights[128] = { As we discussed offline, please fix the L1 weights too. Thanks.

Yusuke Suzuki

Comment 11 2021-04-07 16:57:39 PDT

Created attachment 425457 [details] Patch

Yusuke Suzuki

Comment 12 2021-04-07 22:44:36 PDT

Comment on attachment 425444 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=425444&action=review >> Source/JavaScriptCore/ChangeLog:12 >> + 1. If we found that the result of level-1 weight comparison is equal, and characters are not eqaul, then we do level-3 weight comparison. > > /not eqaul/not equal/ Fixed. >> Source/JavaScriptCore/runtime/IntlObject.cpp:436 >> +// 6. Level-2 weights are alwasy 0020 except for 0000 cases. So if we include 0000 characters, we do not need to perform level-2 weight comparison. > > /alwasy/always/ Fixed. >> Source/JavaScriptCore/runtime/IntlObject.cpp:444 >> +const uint8_t ducetLevel1Weights[128] = { > > As we discussed offline, please fix the L1 weights too. Thanks. Fixed. Thanks! >> JSTests/stress/ducet-level-3-or-4-comparison.js:11 >> +shouldBe(collator.compare("ABC", "abc"), 1); > > Can you also add `shouldBe(collator.compare("ABC", "ABC"), 0);` just to confirm that the equal case isn't broken? Sounds nice, added.

Yusuke Suzuki

Comment 13 2021-04-07 22:48:13 PDT

Committed r275653 (236289@main): <https://commits.webkit.org/236289@main>

Yusuke Suzuki

Comment 14 2021-09-15 11:08:48 PDT

*** Bug 230199 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.