Start teaching WidthIterator about character clusters
Created attachment 436165 [details] Patch
<rdar://problem/82514688>
Comment on attachment 436165 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=436165&action=review > Source/WebCore/ChangeLog:16 > + * platform/graphics/ComposedCharacterClusterTextIterator.h: Added. Unicode is using the term "grapheme cluster". Is "grapheme cluster" a better word? https://unicode.org/reports/tr29/ > Source/WebCore/platform/graphics/ComposedCharacterClusterTextIterator.h:46 > + bool consume(UChar32& character, unsigned& clusterLength) Why does 'consume' return a single character?
Comment on attachment 436165 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=436165&action=review >> Source/WebCore/ChangeLog:16 >> + * platform/graphics/ComposedCharacterClusterTextIterator.h: Added. > > Unicode is using the term "grapheme cluster". Is "grapheme cluster" a better word? > https://unicode.org/reports/tr29/ "Grapheme cluster" has a formal and rigorous definition defined by Unicode, but that definition doesn't actually always match what users view of as a conceptual "letter." It's close, but there are some tweaks that text engines use to better match user expectation. There's a more general, fuzzy term "composed character cluster" that matches this idea. It's similar to what css-text calls a "typographic letter unit" which has an intentionally fuzzy definition. The concept of a composed character cluster does not have a formal rigorous definition; representing it by a grapheme cluster is a good approximation, but many text engines go beyond this with their own custom tweaks. >> Source/WebCore/platform/graphics/ComposedCharacterClusterTextIterator.h:46 >> + bool consume(UChar32& character, unsigned& clusterLength) > > Why does 'consume' return a single character? The |UChar32& character| is an out param that returns the "base" character in the cluster. The "base" character is always the first character. For example, if the cluster is U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT (to form "é"), the "base" character is the "e". I believe the term "base character" in a cluster is a term defined by Unicode, but I'm not sure.
Comment on attachment 436165 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=436165&action=review >>> Source/WebCore/platform/graphics/ComposedCharacterClusterTextIterator.h:46 >>> + bool consume(UChar32& character, unsigned& clusterLength) >> >> Why does 'consume' return a single character? > > The |UChar32& character| is an out param that returns the "base" character in the cluster. The "base" character is always the first character. For example, if the cluster is U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT (to form "é"), the "base" character is the "e". I believe the term "base character" in a cluster is a term defined by Unicode, but I'm not sure. 'WidthIterator::advanceInternal' advances by 'advanceLength' which is 'clusterLength' returned by the 'consume' method. https://github.com/WebKit/WebKit/blob/e6953d368b55b5daba4805cc9bbc9383d79e1375/Source/WebCore/platform/graphics/WidthIterator.cpp#L340 > textIterator.advance(advanceLength); If the complex text code path is removed, I think only base characters of clusters are shown. For example, family emoji 👨👩👧👦 shows only the first person 👨. This is unhappy.
Comment on attachment 436165 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=436165&action=review >>>> Source/WebCore/platform/graphics/ComposedCharacterClusterTextIterator.h:46 >>>> + bool consume(UChar32& character, unsigned& clusterLength) >>> >>> Why does 'consume' return a single character? >> >> The |UChar32& character| is an out param that returns the "base" character in the cluster. The "base" character is always the first character. For example, if the cluster is U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT (to form "é"), the "base" character is the "e". I believe the term "base character" in a cluster is a term defined by Unicode, but I'm not sure. > > 'WidthIterator::advanceInternal' advances by 'advanceLength' which is 'clusterLength' returned by the 'consume' method. > > https://github.com/WebKit/WebKit/blob/e6953d368b55b5daba4805cc9bbc9383d79e1375/Source/WebCore/platform/graphics/WidthIterator.cpp#L340 Right. The next patch I’m planning to write is to iterate across the whole cluster, instead of only considering the base glyph. Font fallback has to be done for the cluster as a whole, and then after font fallback is complete, each code point in the cluster gets mapped to a glyph in the (one) font. This patch is the first step in that process.
Comment on attachment 436165 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=436165&action=review >>>>> Source/WebCore/platform/graphics/ComposedCharacterClusterTextIterator.h:46 >>>>> + bool consume(UChar32& character, unsigned& clusterLength) >>>> >>>> Why does 'consume' return a single character? >>> >>> The |UChar32& character| is an out param that returns the "base" character in the cluster. The "base" character is always the first character. For example, if the cluster is U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT (to form "é"), the "base" character is the "e". I believe the term "base character" in a cluster is a term defined by Unicode, but I'm not sure. >> >> 'WidthIterator::advanceInternal' advances by 'advanceLength' which is 'clusterLength' returned by the 'consume' method. >> >> https://github.com/WebKit/WebKit/blob/e6953d368b55b5daba4805cc9bbc9383d79e1375/Source/WebCore/platform/graphics/WidthIterator.cpp#L340 > > Right. The next patch I’m planning to write is to iterate across the whole cluster, instead of only considering the base glyph. Font fallback has to be done for the cluster as a whole, and then after font fallback is complete, each code point in the cluster gets mapped to a glyph in the (one) font. This patch is the first step in that process. (That’s why this new code is behind a USE(CLUSTER_AWARE_WIDTH_ITERATOR) which is off by default. Additional patches will be needed before we can flip the switch and enable the functionality. Also, I’m confident there will need to be performance work to make this cluster stuff as fast as the current behavior.)
Committed r282304 (241576@main): <https://commits.webkit.org/241576@main> All reviewed patches have been landed. Closing bug and clearing flags on attachment 436165 [details].