This is the continuation of http://bugzilla.opendarwin.org/show_bug.cgi?id=3406
Bugzilla 3406 began to address some of the problems with text-transform:capitalize, but fundamentally,
to fix all of the problems (see attached test case which was submitted by Nicholas Shanks for bug 3406),
we need a better word-break algorithm. Darin suggests we implement the algorithms provided in
This bug is also being tracked with Radar 4195862.
Created attachment 3125 [details]
Test case created by Nicholas Shanks for 3406
Just wondering, has CFString been considered for WebCore? It's open source, cross-platform, and has a lot
of string manipulation routines, including CFStringCapitalize().
Under OS X, that would give results consistent with other applications, and reduce memory footprint.
Under other OSes, that would give a better tested CF-Lite, which is a good thing IMO.
Created attachment 7069 [details]
First attempt at implementing UBreakIterator
Here is a first pass at implementing text-transform:capitalize with a UBreakIterator. This fixes a TOT bug I noticed where we were no longer capitalizing after non-breaking spaces. It also changes some of our current behavior in ways that I think is good. For example, instead of "Newcastle-upon-tyne," this patch makes it "Newcastle-Upon-Tyne." Also instead of "E.g." this new patch writes out "E.G." These new behaviors are not necessarily correct, but they at least seem consistent and more correct than the old behavior for these edge cases.
The patch came out very interleaved so it is kind of hard to read.
Comment on attachment 7069 [details]
First attempt at implementing UBreakIterator
Looks great! Reusing the word break iterator like this is good.
One thing I'm not 100% sure about:
+ QChar previous = 0;
Does the ICU word break iterator handle the case of an initial null character properly? For example, if you do a document that contains only <span style="text-transform: capitalize">word</style> properly capitalize the word? If so, fine to land as is, otherwise I think a space might be a better default choice.
r=me assuming this case works.
Setting previous to 0 does work, but I think that a space makes more sense, so I changed it. I committed the patch.
Our behavior still isn't perfect, so we may want to create another bug for edge cases, but I am going to conisder this to be the bug representing our switch to the UBreakIterator and mark it fixed.
In the test case, this turns "earth<span>worm</span>" into "EarthOrm" !
What is happening to the 'w' ?
I am not seeing that behavior. I just see "Earthworm." Are you sure you have updated your tree? That is very strange.
I updated to ToT (r. 13304) a few hours ago, compiled and ran it. Incredibly, turning on "Use ATSU for text" causes the rendering to appear as "Earth?Orm", despite no hyphen occurring in the HTML, and a copy/paste causes "EarthCorm" to be put on the clipboard under 'TEXT' and "Earth-Orm" as NSStringPboardType!
Oops. That question mark was supposed to be a HYPHEN MINUS.
I can confirm that different text goes into different clipboard flavors (like Earthworm vs. EarthWorm), but I don't see any dependence on the "Use ATSU for text" setting, nor any seriously broken text. FWIW.
Created attachment 7090 [details]
screenshot of observed effect
Rendering of test case with ToT (r. 13304) and "Use ATSU For All Text" turned on. Includes result of a copy/paste into TextEdit's plain text (right) and rich text (left) windows. The hyphen on the clipboard is U+2D HYPHEN MINUS. Turning the ATSU option off causes a SegFault on page reload.
Oh, I see something that can kind of explain the problem (which would likely be separate from different pasteboard flavors getting different results):
Safari(2737,0xa000ed68) malloc: *** error for object 0x12fe29e0: incorrect checksum for freed object - object was probably modified after being freed, break at szone_error to debug
Safari(2737,0xa000ed68) malloc: *** set a breakpoint in szone_error to debug
I still need to investigate the ATSU problem, but I just checked in a patch for a leak and some occassional crashes that were happening through dump render tree. I don't expect that to fix whatever this ATSU problem is, but I figure it is worth noting.
Revision 13306 now renders as Earth"Orm.
Clipboard 'TEXT' = Earth"Orm
NSPboardStringType = EarthGorm
I suspect this is a memory clobbering issue, but that the recent checkin is not the one causing it.
Okay, so I also see "EarthWorm" in the pasteboard, but I haven't been able to reproduce any of the other strange behaviors you have described, with or without ATSU enabled. Is there anything else different about your settings that might make a difference?
I am going to add Darin to the cc list because he has a lot of expertise in this area. (Hope you don't mind, Darin!)
I originally thought it might have been caused by some of the stuff I have installed, like Inquisitor and SafariStand, but I ruled that out after disabling it and still seeing the problems. 417.9.2 does not exhibit the SegFault on switching between ATSU on/off nor the letter munging behaviour (which also causes the first letter of the line in Klingon to disappear), either with or without the extensions. The difference in clipboard contents is not testable in released version of safari.
Could you file a new bug for the problems you are seeing? Still can't repro. Thanks!