Bug 3406

Summary: CSS1: letters following apostrophe are wrongly capitalized when text-transform:capitalize applied
Product: WebKit Reporter: Dave Hyatt <hyatt>
Component: CSSAssignee: Beth Dakin <bdakin>
Status: RESOLVED FIXED    
Severity: Normal CC: nickshanks
Priority: P2    
Version: 412   
Hardware: Mac   
OS: OS X 10.4   
Attachments:
Description Flags
test case: rsquo, combining diacritic and apos
none
improved test case, added word beginning with apos ('cept)
none
proposed patch
none
proposed patch
darin: review+
Regression Test
none
Regression Test
none
a nasty little test >:-D
none
nasty test amended
none
added UTF-8 BOM to test
none
correction from french-speaker none

Description Dave Hyatt 2005-06-10 00:11:00 PDT
3/20/03 11:46 AM Vicki Murley:
In IE 5, 5.5, and 6 under Windows, as well as Mozilla / Camino, the following treatment is applied 
correctly -->

<p style="text-transform:capitalize;">todd's bargain basement</p>

Output: Todd's Bargain Basement

In Safari (and IE 5 Mac for that matter) the same line renders as...

Todd'S Bargain Basement

Any character immediately following a quote or apostrophe wrongly receives a capitalization transform.
Comment 1 Dave Hyatt 2005-06-10 00:12:49 PDT
Apple Bug: 3204011
Comment 2 Andrew Wellington 2005-06-10 02:35:34 PDT
Patch and regression test posted to webkit-reviews
Comment 3 Nicholas Shanks 2005-06-10 02:51:54 PDT
Created attachment 2211 [details]
test case: rsquo, combining diacritic and apos

Correct rendering would be "Safari’s Naïve Nut'in"
Comment 4 Nicholas Shanks 2005-06-10 03:49:37 PDT
Created attachment 2212 [details]
improved test case, added word beginning with apos ('cept)

Note that due to a bug introduced between safari 2.0 and the current ToT, this
test fails. I have yet to work out the cause, but it's someone else's fault :-)
Comment 5 Nicholas Shanks 2005-06-10 03:50:09 PDT
Created attachment 2213 [details]
proposed patch
Comment 6 Andrew Wellington 2005-06-10 04:05:25 PDT
Created attachment 2214 [details]
proposed patch

This is a more generalised patch.
Comment 7 Andrew Wellington 2005-06-10 04:06:49 PDT
Created attachment 2215 [details]
Regression Test
Comment 8 Dave Hyatt 2005-06-10 23:12:45 PDT
More tests will help.  Make sure you don't capitalize letters that occur after soft hyphens or after regular 
hyphens.
Comment 9 Andrew Wellington 2005-06-11 02:35:33 PDT
Created attachment 2244 [details]
Regression Test

Now includes hyphenated and soft hyphenated words and an abbreviation "e.g."
Comment 10 Nicholas Shanks 2005-06-11 10:59:14 PDT
Created attachment 2248 [details]
a nasty little test   >:-D
Comment 11 Nicholas Shanks 2005-06-11 11:07:35 PDT
Created attachment 2249 [details]
nasty test amended
Comment 12 Darin Adler 2005-06-12 17:02:59 PDT
Test looks great. Needs a UTF-8 encoding meta tag.
Comment 13 Nicholas Shanks 2005-06-15 07:28:50 PDT
Created attachment 2360 [details]
added UTF-8 BOM to test
Comment 14 Nicholas Shanks 2005-06-16 00:09:33 PDT
Created attachment 2381 [details]
correction from french-speaker
Comment 15 Darin Adler 2005-07-27 14:16:11 PDT
I think a UTF-8 meta tag would be better than a UTF-8 BOM in the test case.
Comment 16 Darin Adler 2005-07-27 14:26:00 PDT
Beth and I just researched this a bit.

To do a good job of capitalizing words, we really want to use the ICU library. ICU specifically suggests 
using the break iterator in this way -- they call it "title boundary analysis".

We're thinking of landing the patch attached to this bug to make things a little better, then eventually 
following up with a much better implementation that uses UBreakIterator (from <unicode/ubrk.h>).

We also think that some of the items in this test case are beyond the scope of what should be expected 
from text-transform: capitalize. Specifically, we don't think the browser should be required to do 
linguistic analysis to tell the difference between words that should be capitalized in title case and words 
that should not. We can't find any other browser that does this.
Comment 17 Darin Adler 2005-07-27 14:26:38 PDT
Comment on attachment 2214 [details]
proposed patch

While this patch is not perfect, it does seem to make things better. So lets
land this, and write another bug about doing even better later on.
Comment 18 Nicholas Shanks 2005-08-08 07:38:40 PDT
I used a BOM in the test case because Safari first checks for a BOM, then goes to the Content-Encoding 
HTTP header. The bugzilla.opendarwin.org server seems to be sending incorrect Content-Encoding header 
information, as can be seen when viewing any page with non-ASCII characters in it (e.g. θης ις συμ γÏ?εεκ) 
when your default/current page encoding is set to something other than ISO-8859-1.