Summary: | Arabic characters are not correctly joined when using Arial | ||
---|---|---|---|
Product: | WebKit | Reporter: | Oliver Hunt <oliver> |
Component: | Text | Assignee: | Nobody <webkit-unassigned> |
Status: | RESOLVED FIXED | ||
Severity: | Normal | CC: | ap, mitz |
Priority: | P1 | Keywords: | HasReduction, InRadar |
Version: | 523.x (Safari 3) | ||
Hardware: | Mac | ||
OS: | OS X 10.4 | ||
URL: | http://www.voanews.com/pashto/2007-04-23-voa2.cfm | ||
Attachments: |
Description
Oliver Hunt
2007-05-03 01:25:26 PDT
I looked at this in the debugger. shapeArabic() is called, and the text is handed over to ICU for shaping. u_shapeArabic takes U+0686 U+0627 U+06A9 U+0648 and returns U+0686 U+FE8E U+06A9 U+FEEE Somehow at least the first and third characters don't seem to be shaped correctly. Could be a bug in ICU or in the way it's called (maybe it needs different options?). The string is small enough that I might try to run it through <http://www.opensource.apple.com/darwinsource/10.4.9.ppc/ICU-6.2.13/icuSources/common/ushape.c> manually and see what happens. Created attachment 14314 [details]
Simple test case to show series of glyphs that aren't being merged correctly :(
Here is a page that demonstrates these issues. Arial was clearly sent to damn us all :( Oops, actually include url http://www.voanews.com/pashto/2007-04-23-voa2.cfm replaced original data url with this link as well OK, so that page and the test cases are in the Arabic script, but not in the Arabic language. The page is quite evidently in Pashto. The non-Arabic characters' contextual forms reside in the Arabic Presentation Forms-A block (U+FB50..U+FDFF), whereas Arabic characters' contextual forms are in Arabic Presentation Forms-B (U+FE70..U+FEFF). ICU's u_shapeArabic doesn't shape the non-Arabic characters, so they get no shaping in fonts where WebKit relies on ICU for that. However, ICU does have the linking type of those characters (left, right, both, none), so I think it might be possible to patch ICU and add the pointers to the contextual forms. Created attachment 14315 [details] ushape.c patch Done using <http://www.unicode.org/Public/4.1.0/ucd/ArabicShaping.txt> (which is identical to Unicode 5.0 for the characters in question) and Mac OS X's Character Palette. Created attachment 14316 [details]
Test case rendering with the ushape.c patch
Created attachment 14317 [details]
Site rendering with the ushape.c patch
It should be possible to build the patched u_shapeArabic into WebKit. That would also lock in (or allow fixing of) the Lam-Aleph quirk mentioned in FontMac.mm:298. I'm concerned about testing, though. I don't have a source for test data and if I make a test I'm not sure I'll know what to expect. Automatic testing poses another problem, since Tiger doesn't ship with any fonts that trigger the ICU code path. It is now possible to bundle extra fonts with DumpRenderTree, but I don't know if there's a suitable font that's compatible with WebKit's licensing requirements. In the long term, I think it would be nice to create a font for testing this and a few other code paths in WebKit that are only selected for fonts with pathologies that don't exist in standard fonts (e.g. non-zero-width zero-width-space, non-mirrored mirrored characters). Only now it occurred to me to check out ICU TOT. It has a few of those fixed <http://source.icu-project.org/repos/icu/icu/trunk/source/common/ushape.c> and it agrees with what I did in attachment 14315 [details]. I wonder if the rest are needed. A compile-time version check on ICU might be a good idea, if there's a release that includes what's currently TOT and if that's sufficient. This is the ICU patch: <http://bugs.icu-project.org/trac/changeset/20705>. As far as I can tell it is not included in any ICU release. Created attachment 14476 [details]
Add a private stripped-down version of u_shapeArabic that handles the non-Arabic languages
What a mess. I started out just copying ushape.c and patching it, but I ended up deleting stuff that's unused in WebKit and removing code that reversed the string twice.
Comment on attachment 14476 [details]
Add a private stripped-down version of u_shapeArabic that handles the non-Arabic languages
I'm surprised by the function name. In what way is this Macintosh-specific?
It's hard to review the differences in this between the ICU version and the new version because the diff just shows the new code.
I don't think we need to use U_CAPI and U_EXPORT2 in an internal WebCore header.
Typically we capitalize filenames, so I would call this ShapeArabicMac.h/cpp.
I'm comfortable with this patch despite those concerns. So I'm going to mark this review+ -- feel free to clear the review flag if you want to do a new version.
Comment on attachment 14476 [details]
Add a private stripped-down version of u_shapeArabic that handles the non-Arabic languages
I'm going to address at least some of Darin's comments.
Created attachment 14488 [details] Add a private stripped-down version of u_shapeArabic that handles the non-Arabic languages (In reply to comment #14) > (From update of attachment 14476 [details] [edit]) > I'm surprised by the function name. In what way is this Macintosh-specific? I removed "Mac" from the function and file names, but I left them in the platform/mac subdirectory, since this code is not cross-platform (and in that sense it is Macintosh-specific). I really hope this hack can be removed in the future. > I don't think we need to use U_CAPI and U_EXPORT2 in an internal WebCore > header. Fixed. > Typically we capitalize filenames, so I would call this ShapeArabicMac.h/cpp. Fixed. Comment on attachment 14488 [details]
Add a private stripped-down version of u_shapeArabic that handles the non-Arabic languages
r=me
Landed in r21408. |