Bug 16548 - REGRESSION(r28810): Font style and sizes are weird for Japanese text
Summary: REGRESSION(r28810): Font style and sizes are weird for Japanese text
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Text (show other bugs)
Version: 528+ (Nightly build)
Hardware: PC Windows XP
: P1 Normal
Assignee: Nobody
URL: http://www.google.co.jp/search?q=safa...
Keywords: InRadar, Regression
Depends on:
Blocks:
 
Reported: 2007-12-20 19:22 PST by Louise
Modified: 2008-01-03 18:06 PST (History)
2 users (show)

See Also:


Attachments
Screenshot taken with Safari 3.0.4 (523.13) (45.33 KB, image/png)
2007-12-20 19:24 PST, Louise
no flags Details
Screenshot taken with Firefox (13.66 KB, image/png)
2007-12-20 19:24 PST, Louise
no flags Details
test code (not patch) corresponds to screenshots below (1.24 KB, text/plain)
2007-12-21 02:51 PST, 808caaa4.8ce9.9cd6c799e9f6
no flags Details
screenshots, result samples, with CP_ACP (16.30 KB, image/gif)
2007-12-21 02:52 PST, 808caaa4.8ce9.9cd6c799e9f6
no flags Details
screenshots, result samples, with CP=932(Japanese) (16.26 KB, image/gif)
2007-12-21 02:52 PST, 808caaa4.8ce9.9cd6c799e9f6
no flags Details
Use FontLink\SystemLink registry values to map fallback fonts (8.35 KB, patch)
2007-12-24 22:47 PST, mitz
no flags Details | Formatted Diff | Diff
Use FontLink\SystemLink registry values to map fallback fonts (10.40 KB, patch)
2008-01-02 08:25 PST, mitz
mitz: review-
Details | Formatted Diff | Diff
Use FontLink\SystemLink registry values to map fallback fonts (9.15 KB, patch)
2008-01-03 17:37 PST, mitz
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Louise 2007-12-20 19:22:54 PST
Font sizes look different for Japanese Hiragana compared to Chinese characters and alphabets. The title of the page on Safari's window has the same problem.

Reproducible: Always

Steps to Reproduce:
Go to the above URL (or any site with Japanese text)

tested with Safari 3.0.4 (523.13) Windows XP SP2
Comment 1 Louise 2007-12-20 19:24:03 PST
Created attachment 18023 [details]
Screenshot taken with Safari 3.0.4 (523.13)
Comment 2 Louise 2007-12-20 19:24:23 PST
Created attachment 18024 [details]
Screenshot taken with Firefox
Comment 3 Louise 2007-12-20 20:09:11 PST
This is a regression from r28810.  Previous webkits didn't have this problem.
The font style for Japanese text typed in Safari has changed from previous webkits (in text input, search box) and they are somewhat difficult to read now.
Comment 4 mitz 2007-12-20 20:20:08 PST
<rdar://problem/5659452>
Comment 5 mitz 2007-12-20 20:32:57 PST
Need to devise a better heuristic for picking fallback fonts for CJK characters that is consistent but doesn't favor the Chinese font everywhere.
Comment 6 808caaa4.8ce9.9cd6c799e9f6 2007-12-21 02:51:10 PST
Created attachment 18028 [details]
test code (not patch) corresponds to screenshots below
Comment 7 808caaa4.8ce9.9cd6c799e9f6 2007-12-21 02:52:00 PST
Created attachment 18029 [details]
screenshots, result samples, with CP_ACP
Comment 8 808caaa4.8ce9.9cd6c799e9f6 2007-12-21 02:52:53 PST
Created attachment 18030 [details]
screenshots, result samples, with CP=932(Japanese)
Comment 9 808caaa4.8ce9.9cd6c799e9f6 2007-12-21 03:00:33 PST
With my environments,
- CodePageToCodePages(CP_ACP,...) returns with E_FAIL, acpCodePages remains zero.
- GetStrCodePages(<JapaneseUNICODEStr>,...) shows 
	0x1e0000 == Japanese | ChineseSimplified | LangKorean | ChineseTraditional
so finally codePage == simplifiedChineseCP, *even when* 932(Japanese) is specified to CodePageToCodePages().

Screenshots with my environment attached.
1st line is 'apple'. 2nd line is hiragana form of 'apple'.
3rd line is ascii-arts, contains hankaku-hiragana.
Comment 10 mitz 2007-12-21 07:36:19 PST
One limitation is that system fallback font selection in WebKit is done on a character-by-character basis. GetStrCodePages will always be passed a single character.
Comment 11 808caaa4.8ce9.9cd6c799e9f6 2007-12-21 20:20:35 PST
Current impl meaning is, I think (WebCore/platform/graphics/win/FontCacheWin.cpp/getFontDataForCharacters()),
if each UNICODE characters CAN BE simplifiedChinese(CP936), they are all simplifiedChinese char.

Currently many many chars, not only Kanji but also alphanumeric chars are detected as
(Chinese | Japanese | Korean | more...) by mlang.dll, and rendered with unfamiliar fonts
(for-Chinese system fonts, it's forced preinstalled to Japanese NT too, specified by windows/inf/intl.inf).

I don't know now how to solve this gently, with considaration of international use of WebKit.
I'll think over while doing houseworks today....

The quickest *temporary* hack is exclude Japanese installation, simply checkable:
	if (GetACP()==932){ /* is Japanese */}
so hacks like:
	if (/*TEMPHACK*/GetACP()!=932 && actualCodePages && ...

Some anonymous testers says this hacks avoids current problem.
Comment 12 808caaa4.8ce9.9cd6c799e9f6 2007-12-21 20:21:42 PST
/* #9, not hankaku-hiragana, is hankaku-kana.

Yes these example are with strings. Checking for chars resulted as same.

---
D:\works>cptest A
acpCodePages: 0(80004005),actualCodePages: 1F01FF, cchActual: 1,finalCodePage: 936
D:\works>cptest <one-hiragana-char>
acpCodePages: 0(80004005),actualCodePages: E0000, cchActual: 1,finalCodePage: 936
D:\works>
---

*/
Comment 13 mitz 2007-12-21 22:45:42 PST
(In reply to comment #11)
> The quickest *temporary* hack is exclude Japanese installation, simply
> checkable:
>         if (GetACP()==932){ /* is Japanese */}
> so hacks like:
>         if (/*TEMPHACK*/GetACP()!=932 && actualCodePages && ...

I believe the above will still break if you are on a Japanese installation once you go to a Chinese page. We will then ask MLang to map from codepage 936 alone, so it will return a Chinese font, and subsequently it will return the same font for all characters, even those that are both in 932 and in 936. This inconsistent behavior and dependency on the order of operations is what r28810 was trying to prevent.

The system code page should probably factor into the font linking process, but hopefully there is a way for that to happen that is internal to MLang or another Windows API without having to query it explicitly in WebKit.

Perhaps hiding information from MLang is the way to go: when a character belongs to multiple code pages, if one of them is the system code page ask only about that code page. Then figure out what to do if none of the multiple code pages is the system code page (as is the case on an English installation). I think Mac OS X uses font traits to pick the fallback font.
Comment 14 mitz 2007-12-23 13:32:18 PST
I have been looking at how IE behaves on different language versions of Windows. Using two UTF-8 encoded HTML files with no locale metadata and no style information, one containing text from MSN in Chinese and the other containing text from Google search results in Japanese, I have observed the following:

* On the English and Chinese (cn) installs
  - Chinese was rendered using a single "Chinese"-looking font (probably Simsun).
  - Japanese was rendered using a mixture of two fonts, the "Chinese" font for some character and a "Japanese" font (probably MS PGothic) for others.

* On the Japanese install
  - Chinese was rendered using a mixture of two fonts, a "Chinese" one and a "Japanese" one.
  - Japanese was rendered using a single "Japanese" font.

* On the Chinese (zh_HK) install
  - Japanese looked like on the (cn) install.
  - Chinese used mostly the "Chinese" font but a few characters were rendered using a "Japanese"-looking font.

I have not tested it, but IE might perform better when it can infer the language from the encoding or metadata.

Font fallback in WebKit is per-character and cannot be specific to a document, so I think ideas that involve context or metadata are not the right answer.

The use of code pages on Windows is what leads to the "mixed fonts" behavior. I think the whole notion of code page should be avoided in WebKit, just like on the Mac.

The other thing that helps on the Mac is that font fallback tries to maintain font traits, so for example even though the google.co.jp style sheet does not specify any font family that has CJK characters on Leopard, since fallback is from a sans-serif font, the system hands back a "Japanese"-looking font.

As far as I could tell, Windows font fallback mechanisms do not try to match traits. However, it might still be possible to at least fall back on the appropriate font for the installed language by using the registry keys that GDI uses for its internal font fallback.
Comment 15 mitz 2007-12-24 22:47:08 PST
Created attachment 18104 [details]
Use FontLink\SystemLink registry values to map fallback fonts

Not using code pages and MapFonts. On an English XP, Japanese is rendered consistently in MS UI Gothic, but Chinese uses a mixture of MS UI Gothic and Simsun. I have not tested on other systems yet, but I expect Japanese Vista to behave the same, and Chinese Vista to use Simsun exclusively for both languages.
Comment 16 808caaa4.8ce9.9cd6c799e9f6 2007-12-25 02:47:53 PST
future extentions:
What about try fetching WebKit-local fontlink list (prefs on plist,registry,...)
before query SystemLink Key for further impl?
These keys cannot be modified by 'Users' users.
Comment 17 mitz 2007-12-30 19:03:29 PST
(In reply to comment #15)
> I have not tested on other systems yet, but I expect Japanese Vista to
> behave the same, and Chinese Vista to use Simsun exclusively for both
> languages.

Confirmed.
Comment 18 mitz 2008-01-02 08:25:45 PST
Created attachment 18238 [details]
Use FontLink\SystemLink registry values to map fallback fonts

Cleaned up and added a change log. Not sure this is the best/correct approach.
Comment 19 Darin Adler 2008-01-02 09:29:30 PST
Comment on attachment 18238 [details]
Use FontLink\SystemLink registry values to map fallback fonts

This looks good.

r=me
Comment 20 mitz 2008-01-02 14:38:21 PST
Comment on attachment 18238 [details]
Use FontLink\SystemLink registry values to map fallback fonts

I found an error in the loop that scans the registry key value and a few layout test failures.
Comment 21 mitz 2008-01-03 13:42:02 PST
(In reply to comment #20)
> layout test failures

Some tests that were using Ahem were failing because FontCache::getFontDataForCharacters() returned 0 when the primary font was Ahem and the character was a zero width space. Here is a list of characters for which Uniscribe says that it will use Ahem even though Ahem does not have a glyph for them:

U+070F SYRIAC ABBREVIATION MAKR
U+180B MONGOLIAN FREE VARIATION SELECTOR ONE
U+180C MONGOLIAN FREE VARIATION SELECTOR TWO
U+180D MONGOLIAN FREE VARIATION SELECTOR THREE
U+180E MONGOLIAN VOWEL SEPARATOR
U+180F
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+200B ZERO WIDTH SPACE
U+200C ZERO WIDTH NON-JOINER
U+200D ZERO WIDTH JOINER
Comment 22 mitz 2008-01-03 13:43:56 PST
U+070F SYRIAC ABBREVIATION MARK (fixed typo in case anyone ever searches Bugzilla for this string).
Comment 23 mitz 2008-01-03 14:02:23 PST
For U+070F, U+180B, U+180C, U+180D, U+180E, U+180F, the current code calls MapFont, and the call fails, so next it tries the Uniscribe-metafile method, which as mentioned above returns "Ahem", and that is what is returned to the caller.

It seems like U+2000..U+200D are really the only characters that should be treated specially.
Comment 24 mitz 2008-01-03 17:37:31 PST
Created attachment 18258 [details]
Use FontLink\SystemLink registry values to map fallback fonts

Corrected the loop limit and added a special case for characters in the range U+2000..U+200F.
Comment 25 mitz 2008-01-03 17:41:50 PST
Comment on attachment 18258 [details]
Use FontLink\SystemLink registry values to map fallback fonts

I think I do not need this:
+#include "CharacterNames.h"
Comment 26 Darin Adler 2008-01-03 17:47:26 PST
Comment on attachment 18258 [details]
Use FontLink\SystemLink registry values to map fallback fonts

r=me
Comment 27 mitz 2008-01-03 18:06:44 PST
Landed in <http://trac.webkit.org/projects/webkit/changeset/29140>.