Bug 6310

Summary: text-transform: uppercase/lowercase don't handle cases one character becomes two
Product: WebKit Reporter: Alexey Proskuryakov <ap>
Component: CSSAssignee: Darin Adler <darin>
Status: VERIFIED FIXED    
Severity: Enhancement CC: darin, julian
Priority: P2 Keywords: HasReduction
Version: 420+   
Hardware: Mac   
OS: OS X 10.4   
Attachments:
Description Flags
test case
none
test case
none
dumb down lower() and upper() none

Description Alexey Proskuryakov 2005-12-31 07:12:04 PST
WebCore uppercases and lowercases strings character-by-character, which leads to multiple problems 
with non-ASCII strings. Firefox and Opera are no better at this, but that's probably not something that 
should be copied from them.

To be really advanced, WebCore should not just transform entire strings, but take the language into 
account (Turkish 'i' is the usual example here). However, with :lang being unimplemented (bug 3233), that 
probably has to wait.
Comment 1 Alexey Proskuryakov 2005-12-31 07:12:29 PST
Created attachment 5396 [details]
test case
Comment 2 Alexey Proskuryakov 2005-12-31 07:26:49 PST
Created attachment 5397 [details]
test case

Added Latin ligatures ffi and IJ.
Comment 3 Darin Adler 2006-01-02 14:41:15 PST
I don't think lower() and upper() in DOMString are really to blame here. We won't usually have the entire 
string we want to upper or lower-case in a DOMString -- words can be split over many separate DOM 
nodes and have a mix of styles. So proper implementation of a more sophisticated upper and lowercasing 
algorithm would probably not affect lower() and upper() at all.

Hence this bug should be retitled to describe the symptom.
Comment 4 Darin Adler 2006-01-02 14:41:43 PST
I suggest listing the Turkish "i" in a separate bug, since the issue is different.
Comment 5 Alexey Proskuryakov 2006-01-02 16:15:37 PST
As for DOMString::lower()/upper(), one possible solution could be to dumb it down even further, to handle 
only ASCII text (since it's mostly used for tag canonicalization). Then, it could be renamed to something 
like asciiLower(), as a warning.
Comment 6 Darin Adler 2006-01-21 11:36:32 PST
I like the idea of dumbing lower/upper down and changing their names.

Then we can create functions that take a language parameter and do the right thing, even for the Turkish, and handle cases where the number of characters changes when you change case.
Comment 7 Darin Adler 2006-01-21 11:45:05 PST
The ICU functions for this are:

    u_strToLower
    u_strToUpper
    u_strToTitle

I think we can fix a lot of problems if we change things so that text transforms use these functions, either directly or through DOMString functions.
Comment 8 Alexey Proskuryakov 2006-01-29 07:17:51 PST
Created attachment 6071 [details]
dumb down lower() and upper()

I have tried dumbing down lower() and upper() (and equalIgnoringCase(), for that matter), but there are too many cases where I'm not sure what to do. The attached patch has changes that seemed safe to me (although I woudn't be able to prove even their correctness for functions like getNamedItemNS() or parseMappedAttribute()).
Comment 9 Alexey Proskuryakov 2006-04-08 10:33:08 PDT
KXMLCore::Unicode has some smarter functions for upper/lowercasing now, perhaps they can be reused in WebCore (no locale support yet, though).
Comment 10 Alexey Proskuryakov 2006-05-09 10:51:04 PDT
*** Bug 8805 has been marked as a duplicate of this bug. ***
Comment 11 Darin Adler 2006-05-09 11:23:30 PDT
My QChar -> UChar patch is going to fix this "as a side effect".
Comment 12 Darin Adler 2006-05-09 21:55:38 PDT
The test case attached here did not match the bug report. The test case was testing "capitalize" but the bug report mentions "uppercase" and "lowercase".

I fixed uppercase and lowercase in a recent patch, and I added a test case based on the one here.
Comment 13 Timothy Hatcher 2006-05-09 22:54:07 PDT
Landed in r14273.
Comment 14 Alexey Proskuryakov 2006-05-10 22:37:21 PDT
(In reply to comment #12)
> The test case attached here did not match the bug report. The test case was
> testing "capitalize" but the bug report mentions "uppercase" and "lowercase".

Mea culpa. Capitalize also appears to be broken; I'll file a separate bug for it.

> I fixed uppercase and lowercase in a recent patch, and I added a test case
> based on the one here.

There's still a minor bug in the landed test case - the expected result for "IJ" is incorrect, although it looks the same. The actual/correct result is a ligature "ij", not two separate characters.

As a reminder to myself, I'm seeing an assertion failure when selecting text in the test, should investigate and file a bug:
/Users/ap/WebKit/WebCore/dom/Position.cpp:341: failed assertion `currentOffset >= renderer->caretMaxOffset()'
Comment 15 Alexey Proskuryakov 2006-05-11 11:32:11 PDT
Filed bug 8854, bug 8855.