VERIFIED FIXED 6310
text-transform: uppercase/lowercase don't handle cases one character becomes two
https://bugs.webkit.org/show_bug.cgi?id=6310
Summary text-transform: uppercase/lowercase don't handle cases one character becomes two
Alexey Proskuryakov
Reported 2005-12-31 07:12:04 PST
WebCore uppercases and lowercases strings character-by-character, which leads to multiple problems with non-ASCII strings. Firefox and Opera are no better at this, but that's probably not something that should be copied from them. To be really advanced, WebCore should not just transform entire strings, but take the language into account (Turkish 'i' is the usual example here). However, with :lang being unimplemented (bug 3233), that probably has to wait.
Attachments
test case (665 bytes, text/html)
2005-12-31 07:12 PST, Alexey Proskuryakov
no flags
test case (836 bytes, text/html)
2005-12-31 07:26 PST, Alexey Proskuryakov
no flags
dumb down lower() and upper() (77.08 KB, patch)
2006-01-29 07:17 PST, Alexey Proskuryakov
no flags
Alexey Proskuryakov
Comment 1 2005-12-31 07:12:29 PST
Created attachment 5396 [details] test case
Alexey Proskuryakov
Comment 2 2005-12-31 07:26:49 PST
Created attachment 5397 [details] test case Added Latin ligatures ffi and IJ.
Darin Adler
Comment 3 2006-01-02 14:41:15 PST
I don't think lower() and upper() in DOMString are really to blame here. We won't usually have the entire string we want to upper or lower-case in a DOMString -- words can be split over many separate DOM nodes and have a mix of styles. So proper implementation of a more sophisticated upper and lowercasing algorithm would probably not affect lower() and upper() at all. Hence this bug should be retitled to describe the symptom.
Darin Adler
Comment 4 2006-01-02 14:41:43 PST
I suggest listing the Turkish "i" in a separate bug, since the issue is different.
Alexey Proskuryakov
Comment 5 2006-01-02 16:15:37 PST
As for DOMString::lower()/upper(), one possible solution could be to dumb it down even further, to handle only ASCII text (since it's mostly used for tag canonicalization). Then, it could be renamed to something like asciiLower(), as a warning.
Darin Adler
Comment 6 2006-01-21 11:36:32 PST
I like the idea of dumbing lower/upper down and changing their names. Then we can create functions that take a language parameter and do the right thing, even for the Turkish, and handle cases where the number of characters changes when you change case.
Darin Adler
Comment 7 2006-01-21 11:45:05 PST
The ICU functions for this are: u_strToLower u_strToUpper u_strToTitle I think we can fix a lot of problems if we change things so that text transforms use these functions, either directly or through DOMString functions.
Alexey Proskuryakov
Comment 8 2006-01-29 07:17:51 PST
Created attachment 6071 [details] dumb down lower() and upper() I have tried dumbing down lower() and upper() (and equalIgnoringCase(), for that matter), but there are too many cases where I'm not sure what to do. The attached patch has changes that seemed safe to me (although I woudn't be able to prove even their correctness for functions like getNamedItemNS() or parseMappedAttribute()).
Alexey Proskuryakov
Comment 9 2006-04-08 10:33:08 PDT
KXMLCore::Unicode has some smarter functions for upper/lowercasing now, perhaps they can be reused in WebCore (no locale support yet, though).
Alexey Proskuryakov
Comment 10 2006-05-09 10:51:04 PDT
*** Bug 8805 has been marked as a duplicate of this bug. ***
Darin Adler
Comment 11 2006-05-09 11:23:30 PDT
My QChar -> UChar patch is going to fix this "as a side effect".
Darin Adler
Comment 12 2006-05-09 21:55:38 PDT
The test case attached here did not match the bug report. The test case was testing "capitalize" but the bug report mentions "uppercase" and "lowercase". I fixed uppercase and lowercase in a recent patch, and I added a test case based on the one here.
Timothy Hatcher
Comment 13 2006-05-09 22:54:07 PDT
Landed in r14273.
Alexey Proskuryakov
Comment 14 2006-05-10 22:37:21 PDT
(In reply to comment #12) > The test case attached here did not match the bug report. The test case was > testing "capitalize" but the bug report mentions "uppercase" and "lowercase". Mea culpa. Capitalize also appears to be broken; I'll file a separate bug for it. > I fixed uppercase and lowercase in a recent patch, and I added a test case > based on the one here. There's still a minor bug in the landed test case - the expected result for "IJ" is incorrect, although it looks the same. The actual/correct result is a ligature "ij", not two separate characters. As a reminder to myself, I'm seeing an assertion failure when selecting text in the test, should investigate and file a bug: /Users/ap/WebKit/WebCore/dom/Position.cpp:341: failed assertion `currentOffset >= renderer->caretMaxOffset()'
Alexey Proskuryakov
Comment 15 2006-05-11 11:32:11 PDT
Note You need to log in before you can comment on or make changes to this bug.