Bug 4171

Summary: text-transform:capitalize exhibits incorrect behavior in many edge cases
Product: WebKit Reporter: Beth Dakin <bdakin>
Component: CSSAssignee: Beth Dakin <bdakin>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, bdakin, darin, nickshanks
Priority: P2    
Version: 420+   
Hardware: Mac   
OS: OS X 10.4   
Attachments:
Description Flags
Test case created by Nicholas Shanks for 3406
none
First attempt at implementing UBreakIterator
mjs: review+
screenshot of observed effect none

Description Beth Dakin 2005-07-27 17:20:29 PDT
This is the continuation of http://bugzilla.opendarwin.org/show_bug.cgi?id=3406

Bugzilla 3406 began to address some of the problems with text-transform:capitalize, but fundamentally, 
to fix all of the problems (see attached test case which was submitted by Nicholas Shanks for bug 3406), 
we need a better word-break algorithm. Darin suggests we implement the algorithms provided in 
<unicode/ubrk.h>

This bug is also being tracked with Radar 4195862.
Comment 1 Beth Dakin 2005-07-27 17:21:48 PDT
Created attachment 3125 [details]
Test case created by Nicholas Shanks for 3406
Comment 2 Alexey Proskuryakov 2005-07-29 01:39:23 PDT
Just wondering, has CFString been considered for WebCore? It's open source, cross-platform, and has a lot 
of string manipulation routines, including CFStringCapitalize().

Under OS X, that would give results consistent with other applications, and reduce memory footprint. 
Under other OSes, that would give a better tested CF-Lite, which is a good thing IMO.
Comment 3 Beth Dakin 2006-03-14 20:19:06 PST
Created attachment 7069 [details]
First attempt at implementing UBreakIterator

Here is a first pass at implementing text-transform:capitalize with a UBreakIterator. This fixes a TOT bug I noticed where we were no longer capitalizing after non-breaking spaces. It also changes some of our current behavior in ways that I think is good. For example, instead of "Newcastle-upon-tyne," this patch makes it "Newcastle-Upon-Tyne." Also instead of "E.g." this new patch writes out "E.G." These new behaviors are not necessarily correct, but they at least seem consistent and more correct than the old behavior for these edge cases.

The patch came out very interleaved so it is kind of hard to read.
Comment 4 Maciej Stachowiak 2006-03-14 21:44:56 PST
Comment on attachment 7069 [details]
First attempt at implementing UBreakIterator

Looks great! Reusing the word break iterator like this is good. 

One thing I'm not 100% sure about:

+ QChar previous = 0;

Does the ICU word break iterator handle the case of an initial null character properly? For example, if you do a document that contains only <span style="text-transform: capitalize">word</style> properly capitalize the word? If so, fine to land as is, otherwise I think a space might be a better default choice.

r=me assuming this case works.
Comment 5 Beth Dakin 2006-03-14 22:40:42 PST
Setting previous to 0 does work, but I think that a space makes more sense, so I changed it. I committed the patch. 

Our behavior still isn't perfect, so we may want to create another bug for edge cases, but I am going to conisder this to be the bug representing our switch to the UBreakIterator and mark it fixed.
Comment 6 Nicholas Shanks 2006-03-15 07:40:24 PST
In the test case, this turns "earth<span>worm</span>" into "EarthOrm" !
What is happening to the 'w' ?
Comment 7 Beth Dakin 2006-03-15 09:33:12 PST
I am not seeing that behavior. I just see "Earthworm." Are you sure you have updated your tree? That is very strange.
Comment 8 Nicholas Shanks 2006-03-15 10:18:47 PST
I updated to ToT (r. 13304) a few hours ago, compiled and ran it. Incredibly, turning on "Use ATSU for text" causes the rendering to appear as "Earth?Orm", despite no hyphen occurring in the HTML, and a copy/paste causes "EarthCorm" to be put on the clipboard under 'TEXT' and "Earth-Orm" as NSStringPboardType!
Comment 9 Nicholas Shanks 2006-03-15 10:25:05 PST
Oops. That question mark was supposed to be a HYPHEN MINUS.
Comment 10 Alexey Proskuryakov 2006-03-15 10:32:07 PST
I can confirm that different text goes into different clipboard flavors (like Earthworm vs. EarthWorm), but I don't see any dependence on the "Use ATSU for text" setting, nor any seriously broken text. FWIW.
Comment 11 Nicholas Shanks 2006-03-15 10:35:31 PST
Created attachment 7090 [details]
screenshot of observed effect

Rendering of test case with ToT (r. 13304) and "Use ATSU For All Text" turned on. Includes result of a copy/paste into TextEdit's plain text (right) and rich text (left) windows. The hyphen on the clipboard is U+2D HYPHEN MINUS. Turning the ATSU option off causes a SegFault on page reload.
Comment 12 Alexey Proskuryakov 2006-03-15 10:44:01 PST
Oh, I see something that can kind of explain the problem (which would likely be separate from different pasteboard flavors getting different results):

Safari(2737,0xa000ed68) malloc: *** error for object 0x12fe29e0: incorrect checksum for freed object - object was probably modified after being freed, break at szone_error to debug
Safari(2737,0xa000ed68) malloc: *** set a breakpoint in szone_error to debug
Comment 13 Beth Dakin 2006-03-15 10:52:46 PST
I still need to investigate the ATSU problem, but I just checked in a patch for a leak and some occassional crashes that were happening through dump render tree. I don't expect that to fix whatever this ATSU problem is, but I figure it is worth noting.
Comment 14 Nicholas Shanks 2006-03-15 11:02:36 PST
Revision 13306 now renders as Earth"Orm.
Clipboard 'TEXT' = Earth"Orm
NSPboardStringType = EarthGorm
(ATSU off)


I suspect this is a memory clobbering issue, but that the recent checkin is not the one causing it.
Comment 15 Beth Dakin 2006-03-15 11:17:38 PST
Okay, so I also see "EarthWorm" in the pasteboard, but I haven't been able to reproduce any of the other strange behaviors you have described, with or without ATSU enabled. Is there anything else different about your settings that might make a difference? 

I am going to add Darin to the cc list because he has a lot of expertise in this area. (Hope you don't mind, Darin!)
Comment 16 Nicholas Shanks 2006-03-15 13:12:29 PST
I originally thought it might have been caused by some of the stuff I have installed, like Inquisitor and SafariStand, but I ruled that out after disabling it and still seeing the problems. 417.9.2 does not exhibit the SegFault on switching between ATSU on/off nor the letter munging behaviour (which also causes the first letter of the line in Klingon to disappear), either with or without the extensions. The difference in clipboard contents is not testable in released version of safari.
Comment 17 Beth Dakin 2006-03-15 13:44:22 PST
Could you file a new bug for the problems you are seeing? Still can't repro. Thanks!