Bug 8854

Summary: CSS1: text-transform: capitalize doesn't handle ligatures and non-BMP characters
Product: WebKit Reporter: Alexey Proskuryakov <ap>
Component: CSSAssignee: Nobody <webkit-unassigned>
Status: NEW ---    
Severity: Normal CC: ahmad.saleem792, heycam, ian, mmaxfield, nicholas, nickshanks, rniwa, ysuzuki, zalan
Priority: P3    
Version: 420+   
Hardware: Mac   
OS: OS X 10.4   
Attachments:
Description Flags
test case
none
Screenshot of qt linux port
none
Safari 15.5 matching Chrome but differs from Firefox none

Description Alexey Proskuryakov 2006-05-11 11:17:58 PDT
The attached test case shows several problems with the current capitalize() implementation. All of these problems are pretty esoteric, though.
Comment 1 Alexey Proskuryakov 2006-05-11 11:18:25 PDT
Created attachment 8245 [details]
test case
Comment 2 Nicholas Shanks 2006-07-02 02:27:18 PDT
All we really need here is a list of Unicode ligatures that have no uppercase equivalent, and to decompose them beforehand. Does such a list exist already? I would be prepared to create one if necessary, but would rather not have to!
Comment 3 Alexey Proskuryakov 2006-07-02 05:20:37 PDT
Yes, ICU has support for upper-casing and title-casing.
Comment 4 Nicholas Wilson 2010-04-14 13:46:54 PDT
Now fixed.
Comment 5 Alexey Proskuryakov 2010-04-14 13:53:42 PDT
I still see the problem with ToT on Mac OS X 10.5.8. The code is cross-platform, so it should fail on other OSes, too.
Comment 6 Nicholas Wilson 2010-04-14 15:48:40 PDT
We might have different results then (Debian). I admit though that after opening the test case in Chrome, which is fine, I cavalierly modified the test before looking at in my trunk version of WebKit because I wanted to add in some harder combining characters. By an unlucky fluke, also I moved the Deseret to the end of the line, and by a bug-of-a-bug the SMP problems vanish in that specific case only (on trunk webkit; Chrome[ium] is entirely good). WebKit mangles things that follow SMP characters, but renders perfectly SMP text at the end of lines.

Further probing does reveal that this is an especially odd fluke, unfortunately, because normally the end of the string is truncated as well (str.len is clearly being incorrectly used to find the length of the UTF-16 strings used internally in vectors of UChars). I really cannot work out what is going on though that the Deseret text displays without any truncation, but other examples lose exactly one UChar from the end of the string per SMP surrogate pair. On the plus side, it is not possible to get half a pair to display, however hand you try; the dangling half is dropped cleanly. Some characters after runs of SMP text are wrong too (possibly UTF-16 treated as UCS-1 then encoded as UTF-8?).

On the other hand, ligature handling does work. That was the little bug though, unfortunately. SMP problems are actually more urgent because we need the maths symbols hidden in there.

As a side-note, whatever Chrome did to fix this we might be able to copy.

There is plenty of discussion along lines in their bug tracker, so I will have a look at that.
Comment 7 Alexey Proskuryakov 2010-04-14 16:23:55 PDT
> On the other hand, ligature handling does work.

Could you attach your screenshot of the original test case? I have a hard time trying to believe that any WebKit port gets the "fi" ligature right.

> Further probing does reveal that this is an especially odd fluke,
> unfortunately, because normally the end of the string is truncated as well
> (str.len is clearly being incorrectly used to find the length of the UTF-16
> strings used internally in vectors of UChars)

I'm pretty sure we don't use strlen, at least in cross-platform code. You may be seeing something related to bug 8855.
Comment 8 Nicholas Wilson 2010-04-14 17:10:11 PDT
Created attachment 53386 [details]
Screenshot of qt linux port

Sure. Screenshot. Using Qt 4.6.2, today's SVN. There are some debug messages:

QFontEngine: Glyph neither outline nor bitmap format=0                                                                    
load glyph failed err=6 face=0x8dab408, glyph=4741     
load glyph failed err=6 face=0x8dab408, glyph=4741     
load glyph failed err=6 face=0x8dab408, glyph=4741     
load glyph failed err=6 face=0x8dab408, glyph=4741     
....

But, all the glyphs render correctly. Πis uppercased nicely, and the two letters of the fi ligature are individually selectable. All glyphs of the non-BMP text render correctly with nothing missing. The blue drag-select box has some clear issues. There are several of bugs outstanding against the corner cases for that though. Copy and paste is not too broken either.

On the other hand, with other example texts there are the other problems I described of characters at the end of a text node not being rendered whenever the final characters of the node are not in the SMP (sic; weird, but true).

What exactly is the relationship between Chromium and WebKit patch-wise? I assumed in core rendering they would be very similar (I am just starting out with WebKit), but this and a few other things work perfectly on Chrome and from a brief look just now their code looks substantially different all over the place, with more changes than just adding another platform around it. I could not find anything which is relevant to this specific bug on their tracker, and I guess hunting down what they did to fix this and the selection problems probably takes too long to be helpful.
Comment 9 Alexey Proskuryakov 2010-04-14 17:16:57 PDT
This is not a screenshot of the original test case that I asked for. But anyway, "fi" also fails - it should be capitalized to "Fi".
Comment 10 Ahmad Saleem 2022-06-23 15:52:14 PDT
Created attachment 460461 [details]
Safari 15.5 matching Chrome but differs from Firefox

I am able to reproduce this issue based on attached test case in Safari 15.5 on macOS 12.4.

It matches Chrome Canary but differs from Firefox. As highlighted in the screenshot, "fi" is not capitalised, which is desired expected result. Thanks!