RESOLVED FIXED 168182
Update custom line breaking iterators to the latest version of Unicode
https://bugs.webkit.org/show_bug.cgi?id=168182
Summary Update custom line breaking iterators to the latest version of Unicode
Myles C. Maxfield
Reported 2017-02-11 19:49:31 PST
The custom rules we use are from 2014 and appear to be from ICU 54. We should update them.
Attachments
WIP (46.43 KB, patch)
2017-02-12 15:16 PST, Myles C. Maxfield
no flags
Patch (530.45 KB, patch)
2017-02-12 20:00 PST, Myles C. Maxfield
zalan: review+
Myles C. Maxfield
Comment 1 2017-02-11 20:04:53 PST
I went through the breaking rules line by line, and compared it to ICU's 54.1 release[1] . I found a few things: 1. This version of ICU has no concept of strict vs loose line breaking. Therefore, my comparisons are done ignoring the loose/normal/strict pieces of our custom rules. These pieces of our rules work by adding / removing characters from the existing unicode sets. Newer unicode does have a concept of strict / loose rules. 1. Our emoji handling is custom, and not included in ICU. 2. We have a couple declarations hidden behind ADDITIONAL_EMOJI_SUPPORT which ICU includes. 3. There are three constructions we have which the open source rules don't have: $EXcm $INcm; $CM* $IN $CM* $EX; $CM+ $RI; We should probably just opt all ports into the ADDITIONAL_EMOJI_SUPPORT flag and delete the flag. The first two of the different constructions have to do with characters in the inseparable class, and the second one has to do with regional indicators. I'm not sure, but my theory right now is that these are just an oversight and aren't necessary. [1] http://source.icu-project.org/repos/icu/icu/tags/release-54-1/source/data/brkitr/line.txt
Myles C. Maxfield
Comment 2 2017-02-11 20:10:39 PST
ICU 55.1 is the first release to support the concept of loose vs strict line breaking. However, I'm not sure which ICU is used on the oldest OS we support.
Myles C. Maxfield
Comment 3 2017-02-11 20:22:07 PST
I believe the oldest OS release we support is El Capitan, which uses ICU 55, which has loose & strict line breaking. Perhaps the best solution is to just remove the custom rules entirely.
Myles C. Maxfield
Comment 4 2017-02-11 20:23:24 PST
(In reply to comment #3) > which uses ICU 55, *** 55.1
Radar WebKit Bug Importer
Comment 5 2017-02-11 20:30:26 PST
Myles C. Maxfield
Comment 6 2017-02-12 15:16:53 PST
Myles C. Maxfield
Comment 7 2017-02-12 15:19:11 PST
Myles C. Maxfield
Comment 8 2017-02-12 20:00:09 PST
Myles C. Maxfield
Comment 9 2017-02-13 11:09:32 PST
Saam Barati
Comment 10 2020-04-28 21:32:29 PDT
Comment on attachment 301329 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=301329&action=review > Source/WTF/wtf/text/LineBreakIteratorPoolICU.h:54 > + Vector<char> scratchBuffer(utf8Locale.length() + 11, 0); Was this supposed to be + 1?
Myles C. Maxfield
Comment 11 2020-05-04 23:18:43 PDT
Comment on attachment 301329 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=301329&action=review >> Source/WTF/wtf/text/LineBreakIteratorPoolICU.h:54 >> + Vector<char> scratchBuffer(utf8Locale.length() + 11, 0); > > Was this supposed to be + 1? Nope. The below code will add things like "@lb=normal" to the string, which is 10 characters long. Plus an additional one for the '\0' makes 11.
Note You need to log in before you can comment on or make changes to this bug.