Bug 229393 - Start teaching WidthIterator about character clusters
Summary: Start teaching WidthIterator about character clusters
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Myles C. Maxfield
URL:
Keywords: InRadar
Depends on:
Blocks: 229394
  Show dependency treegraph
 
Reported: 2021-08-23 01:31 PDT by Myles C. Maxfield
Modified: 2021-09-10 17:38 PDT (History)
3 users (show)

See Also:


Attachments
Patch (8.12 KB, patch)
2021-08-23 01:33 PDT, Myles C. Maxfield
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Myles C. Maxfield 2021-08-23 01:31:50 PDT
Start teaching WidthIterator about character clusters
Comment 1 Myles C. Maxfield 2021-08-23 01:33:32 PDT
Created attachment 436165 [details]
Patch
Comment 2 Radar WebKit Bug Importer 2021-08-30 01:32:22 PDT
<rdar://problem/82514688>
Comment 3 Fujii Hironori 2021-09-10 00:29:20 PDT
Comment on attachment 436165 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=436165&action=review

> Source/WebCore/ChangeLog:16
> +        * platform/graphics/ComposedCharacterClusterTextIterator.h: Added.

Unicode is using the term "grapheme cluster". Is "grapheme cluster" a better word?
https://unicode.org/reports/tr29/

> Source/WebCore/platform/graphics/ComposedCharacterClusterTextIterator.h:46
> +    bool consume(UChar32& character, unsigned& clusterLength)

Why does 'consume' return a single character?
Comment 4 Myles C. Maxfield 2021-09-10 02:10:30 PDT
Comment on attachment 436165 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=436165&action=review

>> Source/WebCore/ChangeLog:16
>> +        * platform/graphics/ComposedCharacterClusterTextIterator.h: Added.
> 
> Unicode is using the term "grapheme cluster". Is "grapheme cluster" a better word?
> https://unicode.org/reports/tr29/

"Grapheme cluster" has a formal and rigorous definition defined by Unicode, but that definition doesn't actually always match what users view of as a conceptual "letter." It's close, but there are some tweaks that text engines use to better match user expectation. There's a more general, fuzzy term "composed character cluster" that matches this idea. It's similar to what css-text calls a "typographic letter unit" which has an intentionally fuzzy definition. The concept of a composed character cluster does not have a formal rigorous definition; representing it by a grapheme cluster is a good approximation, but many text engines go beyond this with their own custom tweaks.

>> Source/WebCore/platform/graphics/ComposedCharacterClusterTextIterator.h:46
>> +    bool consume(UChar32& character, unsigned& clusterLength)
> 
> Why does 'consume' return a single character?

The |UChar32& character| is an out param that returns the "base" character in the cluster. The "base" character is always the first character. For example, if the cluster is U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT (to form "é"), the "base" character is the "e". I believe the term "base character" in a cluster is a term defined by Unicode, but I'm not sure.
Comment 5 Fujii Hironori 2021-09-10 05:13:56 PDT
Comment on attachment 436165 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=436165&action=review

>>> Source/WebCore/platform/graphics/ComposedCharacterClusterTextIterator.h:46
>>> +    bool consume(UChar32& character, unsigned& clusterLength)
>> 
>> Why does 'consume' return a single character?
> 
> The |UChar32& character| is an out param that returns the "base" character in the cluster. The "base" character is always the first character. For example, if the cluster is U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT (to form "é"), the "base" character is the "e". I believe the term "base character" in a cluster is a term defined by Unicode, but I'm not sure.

'WidthIterator::advanceInternal' advances by 'advanceLength' which is 'clusterLength' returned by the 'consume' method.

https://github.com/WebKit/WebKit/blob/e6953d368b55b5daba4805cc9bbc9383d79e1375/Source/WebCore/platform/graphics/WidthIterator.cpp#L340

> textIterator.advance(advanceLength);

If the complex text code path is removed, I think only base characters of clusters are shown.
For example, family emoji 👨‍👩‍👧‍👦 shows only the first person 👨. This is unhappy.
Comment 6 Myles C. Maxfield 2021-09-10 10:21:02 PDT
Comment on attachment 436165 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=436165&action=review

>>>> Source/WebCore/platform/graphics/ComposedCharacterClusterTextIterator.h:46
>>>> +    bool consume(UChar32& character, unsigned& clusterLength)
>>> 
>>> Why does 'consume' return a single character?
>> 
>> The |UChar32& character| is an out param that returns the "base" character in the cluster. The "base" character is always the first character. For example, if the cluster is U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT (to form "é"), the "base" character is the "e". I believe the term "base character" in a cluster is a term defined by Unicode, but I'm not sure.
> 
> 'WidthIterator::advanceInternal' advances by 'advanceLength' which is 'clusterLength' returned by the 'consume' method.
> 
> https://github.com/WebKit/WebKit/blob/e6953d368b55b5daba4805cc9bbc9383d79e1375/Source/WebCore/platform/graphics/WidthIterator.cpp#L340

Right. The next patch I’m planning to write is to iterate across the whole cluster, instead of only considering the base glyph. Font fallback has to be done for the cluster as a whole, and then after font fallback is complete, each code point in the cluster gets mapped to a glyph in the (one) font. This patch is the first step in that process.
Comment 7 Myles C. Maxfield 2021-09-10 10:24:38 PDT
Comment on attachment 436165 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=436165&action=review

>>>>> Source/WebCore/platform/graphics/ComposedCharacterClusterTextIterator.h:46
>>>>> +    bool consume(UChar32& character, unsigned& clusterLength)
>>>> 
>>>> Why does 'consume' return a single character?
>>> 
>>> The |UChar32& character| is an out param that returns the "base" character in the cluster. The "base" character is always the first character. For example, if the cluster is U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT (to form "é"), the "base" character is the "e". I believe the term "base character" in a cluster is a term defined by Unicode, but I'm not sure.
>> 
>> 'WidthIterator::advanceInternal' advances by 'advanceLength' which is 'clusterLength' returned by the 'consume' method.
>> 
>> https://github.com/WebKit/WebKit/blob/e6953d368b55b5daba4805cc9bbc9383d79e1375/Source/WebCore/platform/graphics/WidthIterator.cpp#L340
> 
> Right. The next patch I’m planning to write is to iterate across the whole cluster, instead of only considering the base glyph. Font fallback has to be done for the cluster as a whole, and then after font fallback is complete, each code point in the cluster gets mapped to a glyph in the (one) font. This patch is the first step in that process.

(That’s why this new code is behind a USE(CLUSTER_AWARE_WIDTH_ITERATOR) which is off by default. Additional patches will be needed before we can flip the switch and enable the functionality. Also, I’m confident there will need to be performance work to make this cluster stuff as fast as the current behavior.)
Comment 8 EWS 2021-09-10 17:38:00 PDT
Committed r282304 (241576@main): <https://commits.webkit.org/241576@main>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 436165 [details].