Bug 233748 - Tamil conjuncts are not selected as a single unit when styling initials
Summary: Tamil conjuncts are not selected as a single unit when styling initials
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: Layout and Rendering (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
Keywords: InRadar
Depends on:
Reported: 2021-12-01 23:06 PST by Fuqiao Xue
Modified: 2021-12-08 23:06 PST (History)
7 users (show)

See Also:

Test case (429 bytes, text/html)
2021-12-01 23:06 PST, Fuqiao Xue
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Fuqiao Xue 2021-12-01 23:06:02 PST
Created attachment 445674 [details]
Test case

When the start of a line contains a consonant cluster that uses a conjunct (rather than visible virama), ::first-letter should highlight the whole cluster. Usually, modern Tamil has only two of these conjuncts, however one of them can be created in two ways (making a total of 3 clusters to test).

This doesn't work well if segmentation relies on Unicode grapheme clusters, since a conjunct with two consonants will be parsed as two grapheme clusters (the first ending after the virama, and the second starting with the second consonant and including any following vowel-signs or other combining characters).

For these situations it is necessary to tailor the segmentation algorithm, so that it recognises the whole consonant cluster plus any attached vowel-signs or combining characters as a single unit.  This is a particular issue for Tamil, since all other clusters are typically decomposed and show the virama.

Tests & results:

Interactive test, When ::first-letter is applied to Tamil the browser will select the KSHA and SHRI conjuncts as a single unit

Gecko produces the expected result. Webkit and Blink only select the first consonant+pulli.
Comment 1 Darin Adler 2021-12-03 09:41:09 PST
I wonder which Unicode algorithm is the basis for implementing the correct behavior here. We don’t want to come up with something novel, but I understand that to get this right we need to go beyond "grapheme cluster".
Comment 2 Darin Adler 2021-12-03 10:04:43 PST
For example, is "extended grapheme cluster" enough?
Comment 3 Alexey Proskuryakov 2021-12-03 17:37:49 PST
FWIW, following https://drafts.csswg.org/css-pseudo/#first-letter-pseudo it looks like we'd need to devise something that matches platform behavior:

> A UA must use the extended grapheme cluster (not legacy grapheme cluster), as defined in UAX29, as the basis for its typographic character unit. However, the UA should tailor the definitions as required by typographic tradition since the default rules are not always appropriate or ideal—and is expected to tailor them differently depending on the operation as needed.

Maybe it can be the same as character selection.
Comment 4 Myles C. Maxfield 2021-12-03 18:11:47 PST
I’m not sure if our platform has any concept of initial letter… Maybe I should talk to the Pages engineers.
Comment 5 Darin Adler 2021-12-06 17:00:38 PST
It does have the concept of "shift-right-arrow to select one character", which is what Alexey was referring to.
Comment 6 Radar WebKit Bug Importer 2021-12-08 23:06:17 PST