Bug 13136

Summary: Spurious glyphs in Google Israel and Gmail (all languages)
Product: WebKit Reporter: Jungshik Shin <jshin>
Component: Layout and RenderingAssignee: Dave Hyatt <hyatt>
Status: RESOLVED FIXED    
Severity: Critical CC: mitz
Priority: P1 Keywords: GoogleBug, InRadar
Version: 523.x (Safari 3)   
Hardware: Mac   
OS: OS X 10.4   
URL: http://google.co.il
Attachments:
Description Flags
Screenshot of Safari w/latest nightly WebKit
none
Screenshot of google.co.il
none
Whitespace collapses through LRM in Firefox
none
Prevent LRM/RLM from rendering on the simple code path
aroben: review+
Screenshot of google.co.il with the patch none

Description Jungshik Shin 2007-03-20 16:20:59 PDT
With 'Personal level indicators' turned on, gmail emits 'LRM' (U+200e) before the subject. When Arabic or Hebrew font is not installed on Mac OS X, 'LRM' and 'RLM' (U+200f) are rendered with a spurrious glyph.  

The same is true of http://google.co.il.  One can argue who uses Google in Hebrew without Hebrew font. However, in case of gmail, this is rather serious because a lot of Mac users would never bother to install fonts for scripts they don't use.  

Apparently, LRM and RLM are passed down to the low-level drawing routine instead of being filtered at a higher layout level.   When a font is present that understands LRM/RLM, this is not an issue. As I wrote above, that's not always the case. Therefore, it should be better to take care of them at a higher level. 

LRM and RLM should be removed in a string that goes down to a drawing routine. Before that, they should be treated as a character with a strong left and right directionality. 

There are other characters that need similar treatments. Most of (not all) characters belonging to 'Unicode default ignorable' need to be treated this way. 

In all cases, they should survive in selection. What about 'cursor/caret movement'? That's a little tough to say.  

I'd be willing to give it a shot. If anyone can give a pointer or two, that would be great.
Comment 1 mitz 2007-03-20 17:05:17 PDT
Jungshik, as far as I know none of the fonts Mac OS X ships with contains non-empty glyphs for LRM and RLM. Some fonts from Microsoft Windows and Microsoft Office X do include visible glyphs for those characters (apparently per Microsoft guidelines for font developers). Can you determine which font the glyphs you are seeing are coming from?

Rendering the glyphs for LRM and RLM in WebKit is consistent with the behavior of other Mac OS X text engines.
Comment 2 Don Gibson 2007-03-20 17:05:34 PDT
Created attachment 13729 [details]
Screenshot of Safari w/latest nightly WebKit

Here's a shot of this in the latest nightly on a Mac.  Take a close look at the start of the first two subject lines.
Comment 3 Don Gibson 2007-03-20 18:36:41 PDT
Created attachment 13730 [details]
Screenshot of google.co.il

In this shot of google.co.il, note the RLM glyphs around the second item from right at the bottom of the page.
Comment 4 mitz 2007-03-21 00:01:09 PDT
(In reply to comment #0)

> I'd be willing to give it a shot. If anyone can give a pointer or two, that
> would be great.

GlyphPageTreeNode::initializePage() replaces characters that should not render with zero-width spaces for the 'simple' code path. For the 'complex' code path, this feature isn't implemented yet (see comment in ATSULayoutParameters::initialize(), which is probably where you'd add it).

I guess the question here is whether you want to make it impossible to display those glyphs. Other browsers doing it is a strong argument for WebKit to do it too (despite the inconsistency with the rest of Mac OS X).

Zero-width joiner and zero-width non-joiner should be handled the same as LRM and RLM. What other characters are "default ignorable"?
Comment 5 Jungshik Shin 2007-03-21 13:12:18 PDT
(In reply to comment #4)
> (In reply to comment #0)
> 
> > I'd be willing to give it a shot. If anyone can give a pointer or two, that
> > would be great.
> 
> GlyphPageTreeNode::initializePage() replaces characters that should not render
> with zero-width spaces for the 'simple' code path. For the 'complex' code path,
> this feature isn't implemented yet (see comment in
> ATSULayoutParameters::initialize(), which is probably where you'd add it).

Thanks for the pointer. I was looking at WidthIterator::advance, but that place seems better.  

> I guess the question here is whether you want to make it impossible to display
> those glyphs. Other browsers doing it is a strong argument for WebKit to do it
> too (despite the inconsistency with the rest of Mac OS X).

IMHO, there are two groups of (invisible or near-invisible) characters in 'default ignorables' 
  
  a. characters that should always be rendered invisible (*) and don't interact with adjacent characters (e.g. 'U+2062 INVISIBLE TIMES'). they should be killed higher up and should not be passed to 'string drawing/measuring' routines to avoid what we're observing with LRM/RLM here.  (some fonts have visibly glyphs for them causing problems like this). Gecko has the list at :

http://lxr.mozilla.org/seamonkey/source/gfx/src/shared/ignorable.x-ccmap

LRM/RLM is not there because Gecko special-case them in its BiDi code at 

http://lxr.mozilla.org/seamonkey/source/layout/generic/nsTextTransformer.h#74

  b. characters that interact with adjacent characters to change the rendering result (e.g ZWNJ, ZWJ). If the rendering routine (API) and fonts can take care of them, that's good. If there isn't a font that covers them, they should be turned into nothingness (rather than a 'last resort' glyph) at the drawing stage. Gecko has the list at : 

http://lxr.mozilla.org/seamonkey/source/intl/unicharutil/src/ignorables_abjadpoints.x-ccmap



For characters in group (a), GlyphPageTreeNode::initializePage() seems to be a good place to kill them. 

For characters in group (b), we may want to 'kill them' right before the invocation of 'last resort' glyph in platform-specific files. 
 
> Zero-width joiner and zero-width non-joiner should be handled the same as LRM
> and RLM. 

I'm afraid it's not that simple. ZWJ and ZWNJ need to be treated in a context-sensitive manner because they interact with adjacent characters to change the rendering result (conjunct formation in Indic scripts and joining behavior in Arabic). In the code path for simple scripts, they might well be safely turned into nothingness, though. 


> What other characters are "default ignorable"?

Unicode database has the list (I'll give the URL later), but as outlined above, not all of them can be treated the same way. 
 

BTW, there might be cases we want to render 'the invisible' with visible glyphs ('code view' of a html editor, mathematical equation editor, etc). However, CSS does not provide any way to that effect at the moment. So that, I guess we don't have to worry about it at least for now. 
Comment 6 mitz 2007-03-21 13:32:30 PDT
(In reply to comment #5)
>  If there isn't a font that covers them, they should be
> turned into nothingness (rather than a 'last resort' glyph) at the drawing
> stage.

I'm not sure why you want to hide them only if there isn't a font that covers them. If you do that, and the user has installed one of the Office X fonts, and the page specifies that font (Arial is a classic example), those characters will render. Is that how it works in WinIE and Firefox?

Anyway, you're right that ATSULayoutParameters::initialize() is too early to hide them (whereas for the fast code path, initializePage() will do). I think it could be done in the overrideLayoutOperation() callback for both the (a) and (b) classes of characters. Is there a way to query if a character is in one of them using ICU?
Comment 7 Dave Hyatt 2007-03-21 14:15:36 PDT
(In reply to comment #5)
> (In reply to comment #4)
> > (In reply to comment #0)
> > 
>
> IMHO, there are two groups of (invisible or near-invisible) characters in
> 'default ignorables' 
> 
>   a. characters that should always be rendered invisible (*) and don't interact
> with adjacent characters (e.g. 'U+2062 INVISIBLE TIMES'). they should be killed
> higher up and should not be passed to 'string drawing/measuring' routines to
> avoid what we're observing with LRM/RLM here.  (some fonts have visibly glyphs
> for them causing problems like this). Gecko has the list at :
> 
> http://lxr.mozilla.org/seamonkey/source/gfx/src/shared/ignorable.x-ccmap
> 
> LRM/RLM is not there because Gecko special-case them in its BiDi code at 
> 
> http://lxr.mozilla.org/seamonkey/source/layout/generic/nsTextTransformer.h#74
> 

I don't agree with removing these characters before passing them into our Font drawing/rendering routines, since the only way to do so would be to actually break up the text into smaller runs.  In general we pass strings unaltered into drawing/rendering routines for performance reasons and then rely on the fixups to happen in the Font routines themselves.

The primary reason for doing this is performance.  In the fast code path we can associate any glyph we want to with a given character, so we can effectively cache the decision to avoid rendering anything for those glyphs.  This means we don't have to waste time comparing the characters themselves with some set of invisible characters over and over again, because our decision of what glyph to use is cached on future measurements that use that GlyphMap.

In the complex text code path we have no glyph map cache, so if there are characters that need to not render, we'll have to take care of it in overrideLayoutOperation on Mac.  You can see that tweaks are made to placement there.  It should be possible to do glyph replacements as needed there as well.
Comment 8 Jungshik Shin 2007-03-21 22:19:10 PDT
(In reply to comment #7)
> (In reply to comment #5)
> > (In reply to comment #4)
> > > (In reply to comment #0)
> > > 
> >
> > IMHO, there are two groups of (invisible or near-invisible) characters in
> > 'default ignorables' 
> > 
> >   a. characters that should always be rendered invisible (*) and don't interact
> > with adjacent characters (e.g. 'U+2062 INVISIBLE TIMES'). they should be killed
> > higher up and should not be passed to 'string drawing/measuring' routines to
> > avoid what we're observing with LRM/RLM here.  (some fonts have visibly glyphs
> > for them causing problems like this).

> I don't agree with removing these characters before passing them into our Font
> drawing/rendering routines, since the only way to do so would be to actually
> break up the text into smaller runs.  

Thank you for the detailed explanation.

I'm afraid my use of the word 'kill' was confusing. What's suggested by mitzin comment #4 is what I want to do for a subset of invisible characters in the fast path. That way, those characters would be available for copy into clipboard and we wouldn't have to keep checking whether they're in the set   as you wrote. So, I guess we agree with each other.

As for white space characters with various widths (EN QUAD, EM QUAD, EN SPACE, THREE-PER-EM SPACE, etc : U+2000 - U+200A), what would be the best way to deal with them? 


> In general we pass strings unaltered into
> drawing/rendering routines for performance reasons and then rely on the fixups
> to happen in the Font routines themselves.

I guess that means nothing is done about  BiDi control characters (LRM, RLM, LRE, RLE, PDF. yes, they'd better not be used in html, but some documents use them)? In other words, do they have impact on the directionality of text layout in WebKit? Or, does it pass them down so that ATSUI (complex path) can take care of them on Mac OS X?    


In reply to comment #6

Thank you for the feedback. 

> I'm not sure why you want to hide them only if there isn't a font that covers
> them. 


> If you do that, and the user has installed one of the Office X fonts, and
> the page specifies that font (Arial is a classic example), those characters
> will render. Is that how it works in WinIE and Firefox?

My assumption was that if the page author specified those characters to be rendered by a specific fonts,  those characters in group (b) (e.g. Hebrew vowel signs) would be rendered well by the font. 

Even if that's not the case, a font claiming to have glyphs for them are likely to have "acceptable" (could be far-from-optimal as in Arial) glyphs.  Because 'truly invisible' characters in group (a) are turned to 'zero-width' glyph higher up in the code path, using these 'acceptable' (fallback) [1] glyphs for group b (perhaps, we need to give more thoughts to how to divide two groups) would make more sense than just using 'last resort' glyph (a box with a symbol representing a script, or a hollow box or question mark.) 


Perhaps, I'll try to solve the group A case first. Or, I may begin with an even simpler case (the original bug of LRM/RLM). 

[1] Examples of 'acceptable fallback' glyphs would include stand-alone glyphs for 'combining mark XXX'. 

 
Comment 9 Jungshik Shin 2007-03-21 22:30:27 PDT
http://www.w3.org/TR/unicode-xml/#Bidi

As for BiDi control characters (I mentioned in comment #8), it's recommended that they be ignored in a browser context. See section 3.3. of "Unicode in XML and other markup languages' at the above link. HTML4 has a different recommendation, but I guess we may ignore them. 

Comment 10 Dave Hyatt 2007-03-21 23:09:31 PDT
A good first step would be to attach simple HTML test cases to this bug that illustrate the various problems you've described.  

That would be very helpful, since those tests can eventually be made into layout tests, and it will help other ports (like Qt) ensure that their results are also compatible with any changes we make to the Mac complex text code path.

Because the complex text code path is platform-specific, lots of regression tests will help us capture these decisions and enable other platforms to know what the desired behavior is.
Comment 11 mitz 2007-03-22 00:11:34 PDT
(In reply to comment #8)
> I guess that means nothing is done about  BiDi control characters (LRM, RLM,
> LRE, RLE, PDF. yes, they'd better not be used in html, but some documents use
> them)? In other words, do they have impact on the directionality of text layout
> in WebKit?

Yes. They are honored by the bidi algorithm implementation in WebCore. LRE/RLE/LRO/RLO and PDF are not passed down to the font rendering routines.

> Or, does it pass them down so that ATSUI (complex path) can take
> care of them on Mac OS X?

No. All bidi processing is done at the WebCore level, and text gets broken down to unidirectional runs that get passed down to the rendering code.

> My assumption was that if the page author specified those characters to be
> rendered by a specific fonts,  those characters in group (b) (e.g. Hebrew vowel
> signs) would be rendered well by the font. 

I guess I need to know more about what each of those groups contain (I had no idea that Hebrew vowels were included). In the RLM example, the author specified Arial, a font that contains a visible RLM glyph, but did not expect it to render (since the author's experience with WinIE and Firefox is that RLM is hidden). If a WebKit user happens to have the same font (the Microsoft version of Arial), it will be selected and the RLM will be visible, contrary to the intention of the author. I think this should be avoided.

Similarly for ZWJ and ZWNJ, despite their effect on the rendering of neighboring characters, if they do not render visibly when the author specifies Arial on WinIE or Firefox, then WebKit should not render them.

> Even if that's not the case, a font claiming to have glyphs for them are likely
> to have "acceptable" (could be far-from-optimal as in Arial) glyphs.  Because
> 'truly invisible' characters in group (a) are turned to 'zero-width' glyph
> higher up in the code path, using these 'acceptable' (fallback) [1] glyphs for
> group b (perhaps, we need to give more thoughts to how to divide two groups)
> would make more sense than just using 'last resort' glyph (a box with a symbol
> representing a script, or a hollow box or question mark.) 

That is already how WebKit works: if a font has a certain character mapped to a glyph -- possibly an empty one -- then depending on the character, either that glyph is used, or the zero-width space glyph is used (as is currently done with the U+0000..U+001F range for example). A "last resort" glyph (or a glyph from a different font) is used only in cases where the original font does not have a mapping for the character. For example, in the screenshots, the version of Arial the user has contains a (non-empty) glyph for LRM, and that's what's used. If Arial did not map LRM to any glyph, font fallback would occur, going through the specified family list (followed by a fallback font suggested by the OS), and the (probably empty) LRM glyph from the first font that has it mapped would be used.
Comment 12 Jungshik Shin 2007-03-22 03:27:31 PDT
David and mitz,

Thanks a lot for your detailed comments.  I'll try to come up with a html file with test cases and attach it here.

As for group A and group B. My explanation was not very clear. 

Simply put, characters in group A should be rendered with zero-width glyph (no visible effect at all) *no matter what*. Therefore, we should block them higher up instead of passing them down. LRM and RLM belong to this group (especially considering that they don't have any more role once they're taken care of by  WebCore BiDi algorithm as you explained). So, your concern about Windows version of Arial having a bogus glyph can be addressed.

As for characters in group B, we expect fonts and drawing/measuring routines to do a 'reasonable' (not perfect) job. Even if they're not combined well with a base character or rendered with a little-off glyph, it's OK. 

In case of ZWNJ and ZWJ, I believe they should go down along the complex path (in most cases). My understanding of complex path  in WebKit (and ATSUI) is rather weak. 

> A "last resort" glyph (or a glyph from a different font) is used only in cases > where the original font does not have a  mapping for the character. 

I don't have  the source code at the moment and am writing from the memory. If I remember correctly and read the code correctly, there's some like this in complex code path:

 1. even a system fallback fails
 2. do something as the last resort

What I'm saying is for group B, we should not take step 2 and just ignore them. 

If no other font covers a given character, on Mac OS X, I believe 'last resort' font comes into play. That 'font' (is it a real font?) has glyphs for script blocks. For characters like Hebrew vowel signs, using that glyph representing Hebrew script is not so good as just ignoring them. 
Comment 13 Dave Hyatt 2007-03-22 04:11:01 PDT
Created attachment 13760 [details]
Whitespace collapses through LRM in Firefox

Here's an example that shows that Firefox even collapses whitespace through LRM.  While we don't render them, we don't collapse whitespace through them.

Doing so seems unnecessary I suppose, but I thought I'd show this just to demonstrate how much effort Firefox is going through to continue rendering as though they don't exist. :)
Comment 14 Dave Hyatt 2007-03-22 04:12:42 PDT
Anyway if the Firefox behavior is desired (and it's not clear how much we need to match them), then we would have to deal with these characters in bidi.cpp rather than in the font routines, so I stand corrected.
Comment 15 Jungshik Shin 2007-03-22 10:18:15 PDT
Collapsing whitespaces through LRMs does not seem to be required, but it seems reasonable to do so if we can do it without sacrificing perf. 

http://www.w3.org/TR/html401/struct/text.html#didx-white_space-1

BTW, I filed bug 13159 about 'Form Feed' (U+000C). 
Comment 16 Alice Liu 2007-03-23 01:35:10 PDT
<rdar://problem/5048219>
Comment 17 Dave Hyatt 2007-03-23 18:29:29 PDT
(In reply to comment #14)
> Anyway if the Firefox behavior is desired (and it's not clear how much we need
> to match them), then we would have to deal with these characters in bidi.cpp
> rather than in the font routines, so I stand corrected.
> 

I thought about this some more, and I believe my original recommendation was correct.  Because floatWidth measures these glyphs (it runs in both calcMinMaxWidth and findNextLineBreak), we need to make sure these glyphs have zero width in the low-level Font rendering routines.  Otherwise widths will be computed inaccurately.

This does not preclude us from adding code to bidi.cpp to avoid having them in the rendered runs, but they also can't be part of measurements either.

Taking this bug since I own the corresponding Radar.
Comment 18 Dave Hyatt 2007-03-23 18:29:53 PDT
Elevating to P1 priority.
Comment 19 Dave Hyatt 2007-03-23 23:09:58 PDT
Created attachment 13792 [details]
Prevent LRM/RLM from rendering on the simple code path
Comment 20 Adam Roben (:aroben) 2007-03-23 23:23:11 PDT
Comment on attachment 13792 [details]
Prevent LRM/RLM from rendering on the simple code path

r=me
Comment 21 Dave Hyatt 2007-03-23 23:42:38 PDT
Simple code path patched.  Open a new bug to cover the complex code path if you can show cases of misrendering there.
Comment 22 mitz 2007-03-24 01:17:04 PDT
(In reply to comment #21)
> Simple code path patched.  Open a new bug to cover the complex code path if you
> can show cases of misrendering there.
> 

Bug 13177.
Comment 23 mitz 2007-03-24 01:39:38 PDT
Created attachment 13794 [details]
Screenshot of google.co.il with the patch

This is what google.co.il renders like with the patch. There is extra space between the links at the bottom because in Arial the zero-width space glyph has non-zero width. I think this should be addressed, either by forcing ZWS to have zero width or by using a different replacement glyph that does have zero width.
Comment 24 mitz 2007-03-24 03:16:00 PDT
(In reply to comment #23)
> in Arial the zero-width space glyph has
> non-zero width.

The problem is not Arial but Helvetica (which is the fallback font for ZWS with the particular version of Arial I have). Filed bug 13178.