Bug 20237

Summary: Zero width and space characters are displayed incorrectly if not contained in a fallback font.
Product: WebKit Reporter: Brett Wilson (Google) <brettw>
Component: PlatformAssignee: mitz
Status: RESOLVED FIXED    
Severity: Normal CC: ap, dglazkov, dimitri, hyatt, jshin, mitz, nickshanks
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: PC   
OS: Windows XP   
Attachments:
Description Flags
Test case
none
Patch
mitz: review-
Updated Patch
sam: review-
Updated: ZWS fix only
none
With just the 0xFFFC check mitz: review+

Description Brett Wilson (Google) 2008-07-30 17:23:55 PDT
If I have a character like U+200B (non-breaking space) in a document, WebCore will go ask the font system for it. Some fonts include a "glyph" for this character with the correct rendering (nothing) and advance (space). I'm pretty sure the fallback font shipped with Mac/Safari does this and all is well. The same is true for Arial Unicode MS which ships with MS Office on Windows.

However, if these fonts are unavailable, this glyph will not be found in Arial or Times and will be mapped to the 0 glyph (since it doesn't exist). This glyph is a box in most fonts. You want this behavior for most nonexistant glyphs since if you don't have a Chinese font installed, you want to see boxes rather than nothing if you visit a Chinese site so you know you're missing something. For characters that should be treated as space or zero width space, this is not the case.

The same happens for U+200E (LTR marker) and the like. These should be rendered as zero width spaces. When we get mapped to the 0 glyph, it turns into a box.

I think these two classes of characters need to be special cased to work without a pan-Unicode font that includes these characters.
Comment 1 Brett Wilson (Google) 2008-07-30 17:24:52 PDT
Created attachment 22570 [details]
Test case

This patch shows a ZWSP and LTR characters. If you don't have Apple's fallback font or Arial Unicode MS on Windows, then these will be rendered as squares.
Comment 2 Brett Wilson (Google) 2008-07-31 16:56:27 PDT
Created attachment 22591 [details]
Patch

This patch makes boxes for directional markers and nonbreaking spaces disappear when there is no fallback font that supports them. It hardcodes that the width of "zero width space" characters is 0. Platforms will also have to hardcode that the glyph for such characters is the space glyph. I've added this to the Windows code.

Mac doesn't seem to have this problem because the fonts are less crappy. If Mac wanted to implement it, it would be best in wkGetGlyphVectorFirstRecord which looks like isn't public.

I added a parameter to the glyph width getter so it gets the original character. I also use this for an optimization for Chinese characters which I hope to submit soon; it's very helpful for the width computation code to actually know what character we're talking about.

The changes in the Font.h file are to make the zero width space (and a few related chars) actually return true in the treatAsZeroWidthSpace function().
Comment 3 Eric Seidel (no email) 2008-08-01 15:33:30 PDT
Comment on attachment 22591 [details]
Patch

Hyatt says that mitz is the man to review this.
Comment 4 mitz 2008-08-04 23:05:01 PDT
(In reply to comment #2)
> I added a parameter to the glyph width getter so it gets the original
> character. I also use this for an optimization for Chinese characters which I
> hope to submit soon; it's very helpful for the width computation code to
> actually know what character we're talking about.

I am not too happy with the extra work in widthForGlyph, and not particularly happy with breaking the separation between character space and glyph space (what are we going to pass down to that function when we have ligatures and other features that break the 1:1 character:glyph ratio?), and I can think of a different way to address this bug, so I would like to know more about your planned optimization for Chinese.
Comment 5 Brett Wilson (Google) 2008-08-06 07:42:02 PDT
With our CJK optimization, it looks like this:

SimpleFontData::widthForGlyph(UChar32 c, Glyph glyph) const
{
    bool is_CJK = IsCJKCodePoint(c);
    float width = is_CJK ? m_cjkGlyphWidth : m_glyphToWidthMap.widthForGlyph(glyph);

#ifndef NDEBUG
    // Test our optimization that assuming all CGK glyphs have the same width
    if (is_CJK) {
        const float actual_width = platformWidthForGlyph(glyph);
        ASSERT((cGlyphWidthUnknown == width) || (actual_width == width));
    }
#endif

    if (c > ' ' && Font::treatAsZeroWidthSpace(c))
        return 0.0f;

    if (width != cGlyphWidthUnknown)
        return width;

    width = platformWidthForGlyph(glyph);

    if (is_CJK) {
        m_cjkGlyphWidth = width;
    } else {
        m_glyphToWidthMap.setWidthForGlyph(glyph, width);
    }
    
    return width;
}

So basically, this caches the widths of the first CJK character seen in a font, and uses that for all subsequent ones. If I recall correctly, this, along with making the widths 16-bit fixed point, saves us ~150MB for a page cycler run with a lot of different CJK pages in it.
Comment 6 Jungshik Shin 2008-08-06 15:53:09 PDT
There are a whole bunch of characters to fall into 'zero-width' category. 
It might be better to use | u_hasBinaryProperty(c, UCHAR_DEFAULT_IGNORABLE_CODE_POINT) |  ( http://www.icu-project.org/apiref/icu4c/uchar_8h.html#25c5c820d4141e4099acc15ca83572a5 ).

Firefox does not use the set (as it is) but defines its own (derived from the default ignorable). 

Some characters with default ignorable property cannot be ignored but as long as they're in 'simple script' code path, I think it's ok. 


Comment 7 Brett Wilson (Google) 2008-08-07 17:01:15 PDT
mitz: Can you comment on the overall approach. I'll try to come up with better patch using Jungshik's suggestions, but I would like to know how best to integrate the new code.
Comment 8 mitz 2008-08-13 09:46:58 PDT
I think it would be best to keep character-to-glyph mapping separate from glyph-to-width mapping, and I think the goals of your changes could be achieved by creating separate FontData objects, such as one that has a zero-width invisible glyph (the underlying platform font can be any font with a space glyph; the width map changed to give it zero width) which could be used for characters that should render as zero-width space; and such as one that shares the same underlying platform font with another FontData, has a 'monospace' bit set, has no width map, which could be used for all CJK characters in the font.
Comment 9 mitz 2008-08-19 22:34:50 PDT
Comment on attachment 22591 [details]
Patch

r- to remove this from the queue while you consider my comments :)
Comment 10 Dimitri Glazkov 2008-08-26 11:24:20 PDT
... trying to figure out how this will work. So, in your view, these FontData instances be would be sub-classes of SimpleFontData (say, ZeroWidthFontData and CJKFontData), each offering a different widthForGlyph implementation?

Then, in GlyphPage::fill, we supply either an instance of SimpleFontData (for most chars), ZeroWidthFontData (for ZWS and alike), or CJKFontData (for CJK chars).

WDYT?
Comment 11 mitz 2008-09-02 16:31:08 PDT
I think subclassing and making widthForGlyph() virtual might be bad for performance, but if you have a patch that does that I can try and measure it.
Comment 12 Dimitri Glazkov (Google) 2008-09-04 12:43:47 PDT
Created attachment 23174 [details]
Updated Patch

As discussed on #webkit w/mitz, here's an updated patch. There are opportunities for optimization, but I thought I'd first get the basics right.
Comment 13 Sam Weinig 2008-09-05 10:06:04 PDT
Comment on attachment 23174 [details]
Updated Patch

+// static
Why is this here?

I think this patch tries to do too many things at once.  This bug is related to the display of zero width characters, and thus, the patch should try and address only that issue.  Please file another bug about the potential CJK optimization.
Comment 14 Eric Seidel (no email) 2008-09-05 10:10:52 PDT
(In reply to comment #13)
> (From update of attachment 23174 [details] [edit])
> +// static
> Why is this here?

it seems to be part of the Google C++ Style guidelines (thus what most Google engineers are familiar with).  I don't see it listed in the external c++ style guidelines though.  It will take a while for many of us to learn the differences between WebKit/Google style. :)
Comment 15 Dimitri Glazkov (Google) 2008-09-05 11:31:52 PDT
(In reply to comment #14)
> (In reply to comment #13)
> > (From update of attachment 23174 [details] [edit] [edit])
> > +// static
> > Why is this here?
> 
> it seems to be part of the Google C++ Style guidelines (thus what most Google
> engineers are familiar with).  I don't see it listed in the external c++ style
> guidelines though.  It will take a while for many of us to learn the
> differences between WebKit/Google style. :)

Yes, I've been kinda sorta trying to merge both styles. I will split up CJK opt and resubmit.
Comment 16 Dimitri Glazkov (Google) 2008-09-05 14:18:46 PDT
Created attachment 23200 [details]
Updated: ZWS fix only

Removed CJK optimization.
Comment 17 mitz 2008-09-06 14:42:20 PDT
I am confused. From the initial description of the bug it sounded as if this is only a problem in the system fallback case, so I expected the substitution to happen only in the end of Font::glyphDataForCharacter(). What am I missing?
Comment 18 mitz 2008-09-07 00:02:02 PDT
The more I look at the bug description, and the more testing I do in Safari on Windows, the less I understand how this bug works, even in the system fallback case.

The only exception is U+FFFC (OBJECT REPLACEMENT CHARACTER) which is not treatAsZeroWidthSpace(), even though it is ZWS-ed out in GlyphPageTreeNode::initializePage. In all other cases, I expect a missing glyph (square) to be rendered only if no font on the system has a glyph for ZWS. As far as I can tell, Times New Roman, plain ol' Arial, and other fonts on Windows have correct glyphs for ZWS.
Comment 19 Dimitri Glazkov (Google) 2008-09-08 10:35:52 PDT
Well, I dug some more, using a pristine (font-wise) XP install and here's what happens for the ZWS chars for a standard Win font, using Times New Roman as an example.

wkGetGlyphs returns 0 for the 0x200B char, so the glyph doesn't exist in the font, but FontCache::getFontDataForCharacters is able to scare up the glyph using font-linking (the glyph is borrowed from Lucida Sans Unicode).

In the case of 0xFFFC, the same path is followed, except no glyph is found using font-linking, so we fall back into Uniscribe, drawing the string into a metafile. Unlucky for us, this succeeds with (woot!) "Times New Roman", which brings back the original font, which in turn draws a square in place of the character.

So, the proposed patch, IMHO (I am not a fonts expert, please correct me if I am wrong) eliminates the need to dip into the IMLang linking business and avoids creating metafile objects for drawing the ORC char.

I don't have a strong opinion about moving the SimpleFontData substitution logic to Font::glyphDataForCharacter. My thinking was to initialize the FontCache with the substituted objects to begin with, so that there's no additional checking while drawing each character.
Comment 20 mitz 2008-09-08 11:44:06 PDT
(In reply to comment #19)
> In the case of 0xFFFC, the same path is followed, except no glyph is found
> using font-linking [...]

But we should never be asking for a glyph for U+FFFC. We should -- and we currently don't -- ask for a glyph for zero-width space instead. The reason we don't is that even though GlyphPageTreeNode::initializePage() is aware of this substitution, Font::treatAsZeroWidthSpace() is not, and the latter is what's used in the "system fallback" case near the end of glyphDataForCharacter(). Just adding U+FFFC to treatAsZeroWidthSpace() OR would make the problem go away, I think.
Comment 21 Dimitri Glazkov (Google) 2008-09-08 15:25:59 PDT
Created attachment 23276 [details]
With just the 0xFFFC check

Shedding another pound of code after healthy mitzpettel's workout. This patch leaves finding ZWS glyphs up to IMLang font-linking, instead of using a separate FontData instance.

So, the only change is making Font:treatAsZeroWidthSpace(c) react to 0xFFFC properly.
Comment 22 mitz 2008-09-08 16:03:57 PDT
Comment on attachment 23276 [details]
With just the 0xFFFC check

+        Adds an extra check for Object Replacement Character (0xFFFF) to address

Typo: should say FFFC.

Unfortunately, a layout test does not seem possible because our test configurations have ZWS mapped to a glyph even in non-fallback fonts, so assuming no existing layout tests break, r=me!
Comment 23 mitz 2008-09-08 17:36:07 PDT
Committed <http://trac.webkit.org/changeset/36276>.