Bug 6310

Summary:

text-transform: uppercase/lowercase don't handle cases one character becomes two

Product:

WebKit

Reporter:

Alexey Proskuryakov <ap>

Component:

CSS

Assignee:

Darin Adler <darin>

Status:

VERIFIED FIXED

Severity:

Enhancement

CC:

darin, julian

Priority:

Keywords:

HasReduction

Version:

420+

Hardware:

Mac

OS:

OS X 10.4

Attachments:

Description	Flags
test case	none
test case	none
dumb down lower() and upper()	none

Alexey Proskuryakov

Reported 2005-12-31 07:12:04 PST

WebCore uppercases and lowercases strings character-by-character, which leads to multiple problems with non-ASCII strings. Firefox and Opera are no better at this, but that's probably not something that should be copied from them. To be really advanced, WebCore should not just transform entire strings, but take the language into account (Turkish 'i' is the usual example here). However, with :lang being unimplemented (bug 3233), that probably has to wait.

Attachments
test case (665 bytes, text/html) 2005-12-31 07:12 PST, Alexey Proskuryakov	no flags	Details
test case (836 bytes, text/html) 2005-12-31 07:26 PST, Alexey Proskuryakov	no flags	Details
dumb down lower() and upper() (77.08 KB, patch) 2006-01-29 07:17 PST, Alexey Proskuryakov	no flags	Details Formatted Diff Diff
Show Obsolete (1) View All Add attachment proposed patch, testcase, etc.

Alexey Proskuryakov

Comment 1 2005-12-31 07:12:29 PST

Created attachment 5396 [details] test case

Alexey Proskuryakov

Comment 2 2005-12-31 07:26:49 PST

Created attachment 5397 [details] test case Added Latin ligatures ffi and IJ.

Darin Adler

Comment 3 2006-01-02 14:41:15 PST

I don't think lower() and upper() in DOMString are really to blame here. We won't usually have the entire string we want to upper or lower-case in a DOMString -- words can be split over many separate DOM nodes and have a mix of styles. So proper implementation of a more sophisticated upper and lowercasing algorithm would probably not affect lower() and upper() at all. Hence this bug should be retitled to describe the symptom.

Darin Adler

Comment 4 2006-01-02 14:41:43 PST

I suggest listing the Turkish "i" in a separate bug, since the issue is different.

Alexey Proskuryakov

Comment 5 2006-01-02 16:15:37 PST

As for DOMString::lower()/upper(), one possible solution could be to dumb it down even further, to handle only ASCII text (since it's mostly used for tag canonicalization). Then, it could be renamed to something like asciiLower(), as a warning.

Darin Adler

Comment 6 2006-01-21 11:36:32 PST

I like the idea of dumbing lower/upper down and changing their names. Then we can create functions that take a language parameter and do the right thing, even for the Turkish, and handle cases where the number of characters changes when you change case.

Darin Adler

Comment 7 2006-01-21 11:45:05 PST

The ICU functions for this are: u_strToLower u_strToUpper u_strToTitle I think we can fix a lot of problems if we change things so that text transforms use these functions, either directly or through DOMString functions.

Alexey Proskuryakov

Comment 8 2006-01-29 07:17:51 PST

Created attachment 6071 [details] dumb down lower() and upper() I have tried dumbing down lower() and upper() (and equalIgnoringCase(), for that matter), but there are too many cases where I'm not sure what to do. The attached patch has changes that seemed safe to me (although I woudn't be able to prove even their correctness for functions like getNamedItemNS() or parseMappedAttribute()).

Alexey Proskuryakov

Comment 9 2006-04-08 10:33:08 PDT

KXMLCore::Unicode has some smarter functions for upper/lowercasing now, perhaps they can be reused in WebCore (no locale support yet, though).

Alexey Proskuryakov

Comment 10 2006-05-09 10:51:04 PDT

*** Bug 8805 has been marked as a duplicate of this bug. ***

Darin Adler

Comment 11 2006-05-09 11:23:30 PDT

My QChar -> UChar patch is going to fix this "as a side effect".

Darin Adler

Comment 12 2006-05-09 21:55:38 PDT

The test case attached here did not match the bug report. The test case was testing "capitalize" but the bug report mentions "uppercase" and "lowercase". I fixed uppercase and lowercase in a recent patch, and I added a test case based on the one here.

Timothy Hatcher

Comment 13 2006-05-09 22:54:07 PDT

Landed in r14273.

Alexey Proskuryakov

Comment 14 2006-05-10 22:37:21 PDT

(In reply to comment #12) > The test case attached here did not match the bug report. The test case was > testing "capitalize" but the bug report mentions "uppercase" and "lowercase". Mea culpa. Capitalize also appears to be broken; I'll file a separate bug for it. > I fixed uppercase and lowercase in a recent patch, and I added a test case > based on the one here. There's still a minor bug in the landed test case - the expected result for "IJ" is incorrect, although it looks the same. The actual/correct result is a ligature "ij", not two separate characters. As a reminder to myself, I'm seeing an assertion failure when selecting text in the test, should investigate and file a bug: /Users/ap/WebKit/WebCore/dom/Position.cpp:341: failed assertion `currentOffset >= renderer->caretMaxOffset()'

Alexey Proskuryakov

Comment 15 2006-05-11 11:32:11 PDT

Filed bug 8854, bug 8855.

Note You need to log in before you can comment on or make changes to this bug.