Bug 13572 - Arabic characters are not correctly joined when using Arial
Summary: Arabic characters are not correctly joined when using Arial
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Text (show other bugs)
Version: 523.x (Safari 3)
Hardware: Macintosh OS X 10.4
: P1 Normal
Assignee: Nobody
URL: http://www.voanews.com/pashto/2007-04...
Keywords: HasReduction, InRadar
Depends on:
Blocks:
 
Reported: 2007-05-03 01:25 PDT by Oliver Hunt
Modified: 2007-05-11 09:47 PDT (History)
2 users (show)

See Also:


Attachments
Simple test case to show series of glyphs that aren't being merged correctly :( (423 bytes, text/html)
2007-05-03 03:11 PDT, Oliver Hunt
no flags Details
ushape.c patch (4.44 KB, patch)
2007-05-03 05:43 PDT, mitz
no flags Details | Formatted Diff | Diff
Test case rendering with the ushape.c patch (3.48 KB, image/png)
2007-05-03 05:44 PDT, mitz
no flags Details
Site rendering with the ushape.c patch (144.52 KB, image/png)
2007-05-03 05:45 PDT, mitz
no flags Details
Add a private stripped-down version of u_shapeArabic that handles the non-Arabic languages (28.29 KB, patch)
2007-05-10 13:14 PDT, mitz
no flags Details | Formatted Diff | Diff
Add a private stripped-down version of u_shapeArabic that handles the non-Arabic languages (28.19 KB, patch)
2007-05-11 06:10 PDT, mitz
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Oliver Hunt 2007-05-03 01:25:26 PDT
The above url shows two divs, the first using Arial, the second in Helvetica.  They should be more or less identical, however in the first case (Arial) four separate glyphs are drawn instead of the two combined glyphs produced by helvetica.
Comment 1 mitz 2007-05-03 02:38:44 PDT
I looked at this in the debugger. shapeArabic() is called, and the text is handed over to ICU for shaping. u_shapeArabic takes
U+0686 U+0627 U+06A9 U+0648
and returns
U+0686 U+FE8E U+06A9 U+FEEE

Somehow at least the first and third characters don't seem to be shaped correctly. Could be a bug in ICU or in the way it's called (maybe it needs different options?). The string is small enough that I might try to run it through <http://www.opensource.apple.com/darwinsource/10.4.9.ppc/ICU-6.2.13/icuSources/common/ushape.c> manually and see what happens.
Comment 2 Oliver Hunt 2007-05-03 03:11:08 PDT
Created attachment 14314 [details]
Simple test case to show series of glyphs that aren't being merged correctly :(
Comment 3 Oliver Hunt 2007-05-03 03:12:35 PDT
Here is a page that demonstrates these issues.  Arial was clearly sent to damn us all :(
Comment 4 Oliver Hunt 2007-05-03 03:13:18 PDT
Oops, actually include url
http://www.voanews.com/pashto/2007-04-23-voa2.cfm

replaced original data url with this link as well
Comment 5 mitz 2007-05-03 04:21:00 PDT
OK, so that page and the test cases are in the Arabic script, but not in the Arabic language. The page is quite evidently in Pashto. The non-Arabic characters' contextual forms reside in the Arabic Presentation Forms-A block (U+FB50..U+FDFF), whereas Arabic characters' contextual forms are in Arabic Presentation Forms-B (U+FE70..U+FEFF).

ICU's u_shapeArabic doesn't shape the non-Arabic characters, so they get no shaping in fonts where WebKit relies on ICU for that. However, ICU does have the linking type of those characters (left, right, both, none), so I think it might be possible to patch ICU and add the pointers to the contextual forms.
Comment 6 mitz 2007-05-03 05:43:16 PDT
Created attachment 14315 [details]
ushape.c patch

Done using <http://www.unicode.org/Public/4.1.0/ucd/ArabicShaping.txt> (which is identical to Unicode 5.0 for the characters in question) and Mac OS X's Character Palette.
Comment 7 mitz 2007-05-03 05:44:03 PDT
Created attachment 14316 [details]
Test case rendering with the ushape.c patch
Comment 8 mitz 2007-05-03 05:45:43 PDT
Created attachment 14317 [details]
Site rendering with the ushape.c patch
Comment 9 mitz 2007-05-03 07:39:16 PDT
It should be possible to build the patched u_shapeArabic into WebKit. That would also lock in (or allow fixing of) the Lam-Aleph quirk mentioned in FontMac.mm:298.

I'm concerned about testing, though. I don't have a source for test data and if I make a test I'm not sure I'll know what to expect. Automatic testing poses another problem, since Tiger doesn't ship with any fonts that trigger the ICU code path. It is now possible to bundle extra fonts with DumpRenderTree, but I don't know if there's a suitable font that's compatible with WebKit's licensing requirements. In the long term, I think it would be nice to create a font for testing this and a few other code paths in WebKit that are only selected for fonts with pathologies that don't exist in standard fonts (e.g. non-zero-width zero-width-space, non-mirrored mirrored characters).
Comment 10 mitz 2007-05-03 09:27:00 PDT
Only now it occurred to me to check out ICU TOT. It has a few of those fixed <http://source.icu-project.org/repos/icu/icu/trunk/source/common/ushape.c> and it agrees with what I did in attachment 14315 [details]. I wonder if the rest are needed. A compile-time version check on ICU might be a good idea, if there's a release that includes what's currently TOT and if that's sufficient.
Comment 11 Darin Adler 2007-05-04 22:20:11 PDT
<rdar://problem/5183696>
Comment 12 mitz 2007-05-05 01:36:19 PDT
This is the ICU patch: <http://bugs.icu-project.org/trac/changeset/20705>. As far as I can tell it is not included in any ICU release.
Comment 13 mitz 2007-05-10 13:14:45 PDT
Created attachment 14476 [details]
Add a private stripped-down version of u_shapeArabic that handles the non-Arabic languages

What a mess. I started out just copying ushape.c and patching it, but I ended up deleting stuff that's unused in WebKit and removing code that reversed the string twice.
Comment 14 Darin Adler 2007-05-10 13:40:51 PDT
Comment on attachment 14476 [details]
Add a private stripped-down version of u_shapeArabic that handles the non-Arabic languages

I'm surprised by the function name. In what way is this Macintosh-specific?

It's hard to review the differences in this between the ICU version and the new version because the diff just shows the new code.

I don't think we need to use U_CAPI and U_EXPORT2 in an internal WebCore header.

Typically we capitalize filenames, so I would call this ShapeArabicMac.h/cpp.

I'm comfortable with this patch despite those concerns. So I'm going to mark this review+ -- feel free to clear the review flag if you want to do a new version.
Comment 15 mitz 2007-05-10 13:43:59 PDT
Comment on attachment 14476 [details]
Add a private stripped-down version of u_shapeArabic that handles the non-Arabic languages

I'm going to address at least some of Darin's comments.
Comment 16 mitz 2007-05-11 06:10:17 PDT
Created attachment 14488 [details]
Add a private stripped-down version of u_shapeArabic that handles the non-Arabic languages

(In reply to comment #14)
> (From update of attachment 14476 [details] [edit])
> I'm surprised by the function name. In what way is this Macintosh-specific?

I removed "Mac" from the function and file names, but I left them in the platform/mac
subdirectory, since this code is not cross-platform (and in that sense it is Macintosh-specific). I really hope this hack can be removed in the future.

> I don't think we need to use U_CAPI and U_EXPORT2 in an internal WebCore
> header.

Fixed.

> Typically we capitalize filenames, so I would call this ShapeArabicMac.h/cpp.

Fixed.
Comment 17 Darin Adler 2007-05-11 09:02:13 PDT
Comment on attachment 14488 [details]
Add a private stripped-down version of u_shapeArabic that handles the non-Arabic languages

r=me
Comment 18 Mark Rowe (bdash) 2007-05-11 09:47:31 PDT
Landed in r21408.