Bug 22166

Summary: HTML entities for surrogate pair codepoints cause rendering issues
Product: WebKit Reporter: Kevin Ballard <kevin>
Component: WebKit Misc.Assignee: Nobody <webkit-unassigned>
Status: NEW ---    
Severity: Normal CC: ap, jshin, mathias, mitz
Priority: P2 Keywords: HasReduction, InRadar
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Attachments:
Description Flags
Testcase for the bug
none
test in UTF-16LE with a lone surrogate codepoint none

Description Kevin Ballard 2008-11-10 16:33:17 PST
When WebKit encounters an HTML entity that represents a codepoint that belongs to the UTF-16 surrogate pair range (U+D800 - U+DFFF) it interprets that as a single UTF-16 codepoint. This means a pair of these entities will be treated the same way as a single entity for a high unicode codepoint (e.g. &#x1D367; is interpreted the same as &#xD834;&#xDF67;). This in of itself is kinda strange, but not necessarily incorrect. What is incorrect is WebKit's behavior when only a single half of the surrogate pair is present (such as &#xD834;). In this scenario, WebKit will stop rendering text on that line starting with the codepoint until a linebreak.

I don't know if there's any official spec on how such entities should be treated, but my own preference would be to treat such an entity the same as an unknown named entity and strip it from the rendered text entirely.
Comment 1 Kevin Ballard 2008-11-10 16:39:42 PST
Created attachment 25034 [details]
Testcase for the bug
Comment 2 Kevin Ballard 2008-11-10 16:41:01 PST
I just attached a testcase. Irritatingly, it's using different layout than I saw in TextMate's preview, not sure why, so it doesn't read the same and you only lose a single word of the rendered text, but if you download it locally and play with the source it should be more apparent what's going on.
Comment 3 Mark Rowe (bdash) 2008-11-11 02:34:45 PST
<rdar://problem/6359804>
Comment 4 Jungshik Shin 2008-11-11 14:14:37 PST
> &#x1D367; is interpreted the same as &#xD834;&#xDF67;

If "&#xD834;&#xDF67;" is treated as &#x1D367;,  it's also a bug.

Paired or not, NCRs with surrogate codepoints are invalid and perhaps should be converted to U+FFDF. I don't think it's a good idea to skip them as if there's nothing. 

Comment 5 Kevin Ballard 2008-11-11 14:18:55 PST
Perhaps you mean U+FEFF? U+FFDF is a reserved codepoint, U+FEFF is the zero-width no-break space.
Comment 6 Alexey Proskuryakov 2008-11-11 23:16:07 PST
(In reply to comment #4)
> If "&#xD834;&#xDF67;" is treated as &#x1D367;,  it's also a bug.

This was done to match Firefox in bug 6446. I see that Firefox has changed now; we should check what IE does. But let's not discuss it here - this bug is about disappearing text.

(In reply to comment #5)
> Perhaps you mean U+FEFF? U+FFDF is a reserved codepoint

This was supposed to be U+FFFD REPLACEMENT CHARACTER.
Comment 7 Jungshik Shin 2008-11-12 11:34:10 PST
(In reply to comment #6)
> (In reply to comment #4)
> > If "&#xD834;&#xDF67;" is treated as &#x1D367;,  it's also a bug.
> 
> This was done to match Firefox in bug 6446. I see that Firefox has changed now;
> we should check what IE does. But let's not discuss it here - this bug is about
> disappearing text.

Ok. I've just filed bug 22210 about NCRs with surrogate code points.
 
> (In reply to comment #5)
> > Perhaps you mean U+FEFF? U+FFDF is a reserved codepoint
> 
> This was supposed to be U+FFFD REPLACEMENT CHARACTER.

Oops. sorry for correcting it. that's what I meant. 
 

Comment 8 Jungshik Shin 2008-11-24 22:32:08 PST
Fixing bug 22210 (reverting the patch for bug 6446) 'takes care of' this issue as well. 
Comment 9 Jungshik Shin 2008-11-24 23:21:59 PST
Created attachment 25476 [details]
test in UTF-16LE with a lone surrogate codepoint 

The file is in UTF-16LE with BOM at the beginning. Instead of an invalid NCR for a surrogate codepoint, this file has an unpaired surrogate code point, '0xD835' in UTF-16 ( 0x35 0xD8 ).  What follows after that becomes invisible. 

Even if we take care of the original issue by not allowing NCRs for surrogate codepoints, this issue won't be fixed. And, it should be fixed.
Comment 10 mitz 2008-11-25 00:20:47 PST
WidthIterator::advance() bails out on an unpaired surrogate. Is it its (and the complex text code paths) behavior that needs to be changed or do such encoding errors be fixed at a higher level?
Comment 11 Alexey Proskuryakov 2008-11-25 01:12:01 PST
There may be multiple definitions of correctness in this case, but Firefox 3 draws a custom picture for an unpaired surrogate (at least, I do not know where else this "D835" picture could come from). We could probably just draw a glyph for unpaired surrogate from LastResort font.
Comment 12 Jungshik Shin 2008-11-25 09:49:49 PST
Firefox synthesizes a glyph of a rectangle with a 4-digit or 6-digit hexadecimal codepoint in it for a character it cannot find a font for. (Up to Firefox2, only Linux version did that, but beginning with Firefox3, the same is done on Windows and Mac OS X).

 However, for an unpaired surrogate codepoint, it must NOT do that. I'll alert to them about this issue. Instead, they should show U+FFFD in its place (or another representation of an error in the input). Note that it does replace  the UTF-8 representation of an unpaired surrogate codepoint by U+FFFD. The same must be done for UTF-16. 

I think the encoding error should be fixed before reaching the measuring/drawing text. 

Comment 13 Alexey Proskuryakov 2008-11-25 10:06:35 PST
(In reply to comment #12)
> I think the encoding error should be fixed before reaching the
> measuring/drawing text. 

In what way should it be fixed? If we replace it with U+FFFD, then we won't be able to use an unpaired surrogate glyph from LastResort, which makes the most sense here.
Comment 14 Jungshik Shin 2008-11-25 11:48:16 PST
Unicode 5.1 section 3.2 ( http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf ) has the following conformance requirement:

C10 When a process interprets a code unit sequence which purports to be in a Unicode character
encoding form, it shall treat ill-formed code unit sequences as an error condition
and shall not interpret such sequences as characters.
• For example, in UTF-8 every code unit of the form 110xxxx2 must be followed
by a code unit of the form 10xxxxxx2. A sequence such as 110xxxxx2 0xxxxxxx2
is ill-formed and must never be generated. When faced with this ill-formed
code unit sequence while transforming or interpreting text, a conformant process
must treat the first code unit 110xxxxx2 as an illegally terminated code unit
sequence—for example, by signaling an error, filtering the code unit out, or
representing the code unit with a marker such as U+FFFD replacement
character.


Section 3.9 (Unicode Encoding Forms) has the following (2nd bullet point in D93):
Encoding Form Conversion
D93 Encoding form conversion: A conversion defined directly between the code unit
sequences of one Unicode encoding form and the code unit sequences of another
Unicode encoding form

A conformant encoding form conversion will treat any ill-formed code unit
sequence as an error condition. (See conformance clause C10.) This guarantees
that it will neither interpret nor emit an ill-formed code unit sequence. Any
implementation of encoding form conversion must take this requirement into
account, because an encoding form conversion implicitly involves a verification
that the Unicode strings being converted do, in fact, contain well-formed code
unit sequences.

--------------
I'm not quoting D91 (defining UTF-16 and what's ill-formed in UTF-16) because it's obvious that an isolated surrogate codepoint is ill-formed. 

BTW, the corresponding Firefox bug is http://bugzilla.mozilla.org/show_bug.cgi?id=317216

Using the last resort glyph for an isolated surrogate code point is arguably considered as a way of signaling error, but it's not just rendering that is at stake. Other parts in webkit need to deal with them. By replacing isolated surrogate code points with U+FFFD at the earliest stage, we can spare them from having to do that check themselves. IMHO, it's always a good idea to validate what's coming from an external source before doing anything. 


  
Comment 15 Alexey Proskuryakov 2008-11-25 12:28:18 PST
Rendering code has to deal with this anyway, because an unpaired surrogate can be inserted into the string via DOM APIs. And I doubt that it makes practical sense to require DOM APIs to convert unpaired surrogates into U+FFFD - e.g. JS code may insert text as it comes from some streaming source, and the surrogate pair may be broken between two chunks of data.
Comment 16 Alexey Proskuryakov 2012-05-29 10:50:06 PDT
I can no longer reproduce this with Safari 5.1.7 on Lion.
Comment 17 mitz 2012-05-29 11:24:47 PDT
(In reply to comment #16)
> I can no longer reproduce this with Safari 5.1.7 on Lion.

Did you try with attachment 25476 [details]? I am not sure what the expected behavior is, but with TOT some text appears to be missing.
Comment 18 Alexey Proskuryakov 2012-05-29 11:51:28 PDT
I only tried the first attached test before, and can now see the problem with the second one.