Bug 20237

Summary:

Zero width and space characters are displayed incorrectly if not contained in a fallback font.

Product:

WebKit

Reporter:

Brett Wilson (Google) <brettw>

Component:

Platform

Assignee:

mitz

Status:

RESOLVED FIXED

Severity:

Normal

CC:

ap, dglazkov, dimitri, hyatt, jshin, mitz, nickshanks

Priority:

Version:

528+ (Nightly build)

Hardware:

OS:

Windows XP

Attachments:

Description	Flags
Test case	none
Patch	mitz: review-
Updated Patch	sam: review-
Updated: ZWS fix only	none
With just the 0xFFFC check	mitz: review+

Brett Wilson (Google)

Reported 2008-07-30 17:23:55 PDT

If I have a character like U+200B (non-breaking space) in a document, WebCore will go ask the font system for it. Some fonts include a "glyph" for this character with the correct rendering (nothing) and advance (space). I'm pretty sure the fallback font shipped with Mac/Safari does this and all is well. The same is true for Arial Unicode MS which ships with MS Office on Windows. However, if these fonts are unavailable, this glyph will not be found in Arial or Times and will be mapped to the 0 glyph (since it doesn't exist). This glyph is a box in most fonts. You want this behavior for most nonexistant glyphs since if you don't have a Chinese font installed, you want to see boxes rather than nothing if you visit a Chinese site so you know you're missing something. For characters that should be treated as space or zero width space, this is not the case. The same happens for U+200E (LTR marker) and the like. These should be rendered as zero width spaces. When we get mapped to the 0 glyph, it turns into a box. I think these two classes of characters need to be special cased to work without a pan-Unicode font that includes these characters.

Attachments
Test case (41 bytes, text/html) 2008-07-30 17:24 PDT, Brett Wilson (Google)	no flags	Details
Patch (10.42 KB, patch) 2008-07-31 16:56 PDT, Brett Wilson (Google)	mitz: review-	Details Formatted Diff Diff
Updated Patch (11.29 KB, patch) 2008-09-04 12:43 PDT, Dimitri Glazkov (Google)	sam: review-	Details Formatted Diff Diff
Updated: ZWS fix only (8.27 KB, patch) 2008-09-05 14:18 PDT, Dimitri Glazkov (Google)	no flags	Details Formatted Diff Diff
With just the 0xFFFC check (1.40 KB, patch) 2008-09-08 15:25 PDT, Dimitri Glazkov (Google)	mitz: review+	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.

Brett Wilson (Google)

Comment 1 2008-07-30 17:24:52 PDT

Created attachment 22570 [details] Test case This patch shows a ZWSP and LTR characters. If you don't have Apple's fallback font or Arial Unicode MS on Windows, then these will be rendered as squares.

Brett Wilson (Google)

Comment 2 2008-07-31 16:56:27 PDT

Created attachment 22591 [details] Patch This patch makes boxes for directional markers and nonbreaking spaces disappear when there is no fallback font that supports them. It hardcodes that the width of "zero width space" characters is 0. Platforms will also have to hardcode that the glyph for such characters is the space glyph. I've added this to the Windows code. Mac doesn't seem to have this problem because the fonts are less crappy. If Mac wanted to implement it, it would be best in wkGetGlyphVectorFirstRecord which looks like isn't public. I added a parameter to the glyph width getter so it gets the original character. I also use this for an optimization for Chinese characters which I hope to submit soon; it's very helpful for the width computation code to actually know what character we're talking about. The changes in the Font.h file are to make the zero width space (and a few related chars) actually return true in the treatAsZeroWidthSpace function().

Eric Seidel (no email)

Comment 3 2008-08-01 15:33:30 PDT

Comment on attachment 22591 [details] Patch Hyatt says that mitz is the man to review this.

mitz

Comment 4 2008-08-04 23:05:01 PDT

(In reply to comment #2) > I added a parameter to the glyph width getter so it gets the original > character. I also use this for an optimization for Chinese characters which I > hope to submit soon; it's very helpful for the width computation code to > actually know what character we're talking about. I am not too happy with the extra work in widthForGlyph, and not particularly happy with breaking the separation between character space and glyph space (what are we going to pass down to that function when we have ligatures and other features that break the 1:1 character:glyph ratio?), and I can think of a different way to address this bug, so I would like to know more about your planned optimization for Chinese.

Brett Wilson (Google)

Comment 5 2008-08-06 07:42:02 PDT

With our CJK optimization, it looks like this: SimpleFontData::widthForGlyph(UChar32 c, Glyph glyph) const { bool is_CJK = IsCJKCodePoint(c); float width = is_CJK ? m_cjkGlyphWidth : m_glyphToWidthMap.widthForGlyph(glyph); #ifndef NDEBUG // Test our optimization that assuming all CGK glyphs have the same width if (is_CJK) { const float actual_width = platformWidthForGlyph(glyph); ASSERT((cGlyphWidthUnknown == width) || (actual_width == width)); } #endif if (c > ' ' && Font::treatAsZeroWidthSpace(c)) return 0.0f; if (width != cGlyphWidthUnknown) return width; width = platformWidthForGlyph(glyph); if (is_CJK) { m_cjkGlyphWidth = width; } else { m_glyphToWidthMap.setWidthForGlyph(glyph, width); } return width; } So basically, this caches the widths of the first CJK character seen in a font, and uses that for all subsequent ones. If I recall correctly, this, along with making the widths 16-bit fixed point, saves us ~150MB for a page cycler run with a lot of different CJK pages in it.

Jungshik Shin

Comment 6 2008-08-06 15:53:09 PDT

There are a whole bunch of characters to fall into 'zero-width' category. It might be better to use | u_hasBinaryProperty(c, UCHAR_DEFAULT_IGNORABLE_CODE_POINT) | ( http://www.icu-project.org/apiref/icu4c/uchar_8h.html#25c5c820d4141e4099acc15ca83572a5 ). Firefox does not use the set (as it is) but defines its own (derived from the default ignorable). Some characters with default ignorable property cannot be ignored but as long as they're in 'simple script' code path, I think it's ok.

Brett Wilson (Google)

Comment 7 2008-08-07 17:01:15 PDT

mitz: Can you comment on the overall approach. I'll try to come up with better patch using Jungshik's suggestions, but I would like to know how best to integrate the new code.

mitz

Comment 8 2008-08-13 09:46:58 PDT

I think it would be best to keep character-to-glyph mapping separate from glyph-to-width mapping, and I think the goals of your changes could be achieved by creating separate FontData objects, such as one that has a zero-width invisible glyph (the underlying platform font can be any font with a space glyph; the width map changed to give it zero width) which could be used for characters that should render as zero-width space; and such as one that shares the same underlying platform font with another FontData, has a 'monospace' bit set, has no width map, which could be used for all CJK characters in the font.

mitz

Comment 9 2008-08-19 22:34:50 PDT

Comment on attachment 22591 [details] Patch r- to remove this from the queue while you consider my comments :)

Dimitri Glazkov

Comment 10 2008-08-26 11:24:20 PDT

... trying to figure out how this will work. So, in your view, these FontData instances be would be sub-classes of SimpleFontData (say, ZeroWidthFontData and CJKFontData), each offering a different widthForGlyph implementation? Then, in GlyphPage::fill, we supply either an instance of SimpleFontData (for most chars), ZeroWidthFontData (for ZWS and alike), or CJKFontData (for CJK chars). WDYT?

mitz

Comment 11 2008-09-02 16:31:08 PDT

I think subclassing and making widthForGlyph() virtual might be bad for performance, but if you have a patch that does that I can try and measure it.

Dimitri Glazkov (Google)

Comment 12 2008-09-04 12:43:47 PDT

Created attachment 23174 [details] Updated Patch As discussed on #webkit w/mitz, here's an updated patch. There are opportunities for optimization, but I thought I'd first get the basics right.

Sam Weinig

Comment 13 2008-09-05 10:06:04 PDT

Comment on attachment 23174 [details] Updated Patch +// static Why is this here? I think this patch tries to do too many things at once. This bug is related to the display of zero width characters, and thus, the patch should try and address only that issue. Please file another bug about the potential CJK optimization.

Eric Seidel (no email)

Comment 14 2008-09-05 10:10:52 PDT

(In reply to comment #13) > (From update of attachment 23174 [details] [edit]) > +// static > Why is this here? it seems to be part of the Google C++ Style guidelines (thus what most Google engineers are familiar with). I don't see it listed in the external c++ style guidelines though. It will take a while for many of us to learn the differences between WebKit/Google style. :)

Dimitri Glazkov (Google)

Comment 15 2008-09-05 11:31:52 PDT

(In reply to comment #14) > (In reply to comment #13) > > (From update of attachment 23174 [details] [edit] [edit]) > > +// static > > Why is this here? > > it seems to be part of the Google C++ Style guidelines (thus what most Google > engineers are familiar with). I don't see it listed in the external c++ style > guidelines though. It will take a while for many of us to learn the > differences between WebKit/Google style. :) Yes, I've been kinda sorta trying to merge both styles. I will split up CJK opt and resubmit.

Dimitri Glazkov (Google)

Comment 16 2008-09-05 14:18:46 PDT

Created attachment 23200 [details] Updated: ZWS fix only Removed CJK optimization.

mitz

Comment 17 2008-09-06 14:42:20 PDT

I am confused. From the initial description of the bug it sounded as if this is only a problem in the system fallback case, so I expected the substitution to happen only in the end of Font::glyphDataForCharacter(). What am I missing?

mitz

Comment 18 2008-09-07 00:02:02 PDT

The more I look at the bug description, and the more testing I do in Safari on Windows, the less I understand how this bug works, even in the system fallback case. The only exception is U+FFFC (OBJECT REPLACEMENT CHARACTER) which is not treatAsZeroWidthSpace(), even though it is ZWS-ed out in GlyphPageTreeNode::initializePage. In all other cases, I expect a missing glyph (square) to be rendered only if no font on the system has a glyph for ZWS. As far as I can tell, Times New Roman, plain ol' Arial, and other fonts on Windows have correct glyphs for ZWS.

Dimitri Glazkov (Google)

Comment 19 2008-09-08 10:35:52 PDT

Well, I dug some more, using a pristine (font-wise) XP install and here's what happens for the ZWS chars for a standard Win font, using Times New Roman as an example. wkGetGlyphs returns 0 for the 0x200B char, so the glyph doesn't exist in the font, but FontCache::getFontDataForCharacters is able to scare up the glyph using font-linking (the glyph is borrowed from Lucida Sans Unicode). In the case of 0xFFFC, the same path is followed, except no glyph is found using font-linking, so we fall back into Uniscribe, drawing the string into a metafile. Unlucky for us, this succeeds with (woot!) "Times New Roman", which brings back the original font, which in turn draws a square in place of the character. So, the proposed patch, IMHO (I am not a fonts expert, please correct me if I am wrong) eliminates the need to dip into the IMLang linking business and avoids creating metafile objects for drawing the ORC char. I don't have a strong opinion about moving the SimpleFontData substitution logic to Font::glyphDataForCharacter. My thinking was to initialize the FontCache with the substituted objects to begin with, so that there's no additional checking while drawing each character.

mitz

Comment 20 2008-09-08 11:44:06 PDT

(In reply to comment #19) > In the case of 0xFFFC, the same path is followed, except no glyph is found > using font-linking [...] But we should never be asking for a glyph for U+FFFC. We should -- and we currently don't -- ask for a glyph for zero-width space instead. The reason we don't is that even though GlyphPageTreeNode::initializePage() is aware of this substitution, Font::treatAsZeroWidthSpace() is not, and the latter is what's used in the "system fallback" case near the end of glyphDataForCharacter(). Just adding U+FFFC to treatAsZeroWidthSpace() OR would make the problem go away, I think.

Dimitri Glazkov (Google)

Comment 21 2008-09-08 15:25:59 PDT

Created attachment 23276 [details] With just the 0xFFFC check Shedding another pound of code after healthy mitzpettel's workout. This patch leaves finding ZWS glyphs up to IMLang font-linking, instead of using a separate FontData instance. So, the only change is making Font:treatAsZeroWidthSpace(c) react to 0xFFFC properly.

mitz

Comment 22 2008-09-08 16:03:57 PDT

Comment on attachment 23276 [details] With just the 0xFFFC check + Adds an extra check for Object Replacement Character (0xFFFF) to address Typo: should say FFFC. Unfortunately, a layout test does not seem possible because our test configurations have ZWS mapped to a glyph even in non-fallback fonts, so assuming no existing layout tests break, r=me!

mitz

Comment 23 2008-09-08 17:36:07 PDT

Committed <http://trac.webkit.org/changeset/36276>.

Note You need to log in before you can comment on or make changes to this bug.