WebCore uppercases and lowercases strings character-by-character, which leads to multiple problems
with non-ASCII strings. Firefox and Opera are no better at this, but that's probably not something that
should be copied from them.
To be really advanced, WebCore should not just transform entire strings, but take the language into
account (Turkish 'i' is the usual example here). However, with :lang being unimplemented (bug 3233), that
probably has to wait.
Created attachment 5396 [details]
Created attachment 5397 [details]
Added Latin ligatures ffi and IJ.
I don't think lower() and upper() in DOMString are really to blame here. We won't usually have the entire
string we want to upper or lower-case in a DOMString -- words can be split over many separate DOM
nodes and have a mix of styles. So proper implementation of a more sophisticated upper and lowercasing
algorithm would probably not affect lower() and upper() at all.
Hence this bug should be retitled to describe the symptom.
I suggest listing the Turkish "i" in a separate bug, since the issue is different.
As for DOMString::lower()/upper(), one possible solution could be to dumb it down even further, to handle
only ASCII text (since it's mostly used for tag canonicalization). Then, it could be renamed to something
like asciiLower(), as a warning.
I like the idea of dumbing lower/upper down and changing their names.
Then we can create functions that take a language parameter and do the right thing, even for the Turkish, and handle cases where the number of characters changes when you change case.
The ICU functions for this are:
I think we can fix a lot of problems if we change things so that text transforms use these functions, either directly or through DOMString functions.
Created attachment 6071 [details]
dumb down lower() and upper()
I have tried dumbing down lower() and upper() (and equalIgnoringCase(), for that matter), but there are too many cases where I'm not sure what to do. The attached patch has changes that seemed safe to me (although I woudn't be able to prove even their correctness for functions like getNamedItemNS() or parseMappedAttribute()).
KXMLCore::Unicode has some smarter functions for upper/lowercasing now, perhaps they can be reused in WebCore (no locale support yet, though).
*** Bug 8805 has been marked as a duplicate of this bug. ***
My QChar -> UChar patch is going to fix this "as a side effect".
The test case attached here did not match the bug report. The test case was testing "capitalize" but the bug report mentions "uppercase" and "lowercase".
I fixed uppercase and lowercase in a recent patch, and I added a test case based on the one here.
Landed in r14273.
(In reply to comment #12)
> The test case attached here did not match the bug report. The test case was
> testing "capitalize" but the bug report mentions "uppercase" and "lowercase".
Mea culpa. Capitalize also appears to be broken; I'll file a separate bug for it.
> I fixed uppercase and lowercase in a recent patch, and I added a test case
> based on the one here.
There's still a minor bug in the landed test case - the expected result for "IJ" is incorrect, although it looks the same. The actual/correct result is a ligature "ij", not two separate characters.
As a reminder to myself, I'm seeing an assertion failure when selecting text in the test, should investigate and file a bug:
/Users/ap/WebKit/WebCore/dom/Position.cpp:341: failed assertion `currentOffset >= renderer->caretMaxOffset()'
Filed bug 8854, bug 8855.