When WebKit encounters an HTML entity that represents a codepoint that belongs to the UTF-16 surrogate pair range (U+D800 - U+DFFF) it interprets that as a single UTF-16 codepoint. This means a pair of these entities will be treated the same way as a single entity for a high unicode codepoint (e.g. 𝍧 is interpreted the same as ��). This in of itself is kinda strange, but not necessarily incorrect. What is incorrect is WebKit's behavior when only a single half of the surrogate pair is present (such as �). In this scenario, WebKit will stop rendering text on that line starting with the codepoint until a linebreak.
I don't know if there's any official spec on how such entities should be treated, but my own preference would be to treat such an entity the same as an unknown named entity and strip it from the rendered text entirely.
Created attachment 25034 [details]
Testcase for the bug
I just attached a testcase. Irritatingly, it's using different layout than I saw in TextMate's preview, not sure why, so it doesn't read the same and you only lose a single word of the rendered text, but if you download it locally and play with the source it should be more apparent what's going on.
> 𝍧 is interpreted the same as ��
If "��" is treated as 𝍧, it's also a bug.
Paired or not, NCRs with surrogate codepoints are invalid and perhaps should be converted to U+FFDF. I don't think it's a good idea to skip them as if there's nothing.
Perhaps you mean U+FEFF? U+FFDF is a reserved codepoint, U+FEFF is the zero-width no-break space.
(In reply to comment #4)
> If "��" is treated as 𝍧, it's also a bug.
This was done to match Firefox in bug 6446. I see that Firefox has changed now; we should check what IE does. But let's not discuss it here - this bug is about disappearing text.
(In reply to comment #5)
> Perhaps you mean U+FEFF? U+FFDF is a reserved codepoint
This was supposed to be U+FFFD REPLACEMENT CHARACTER.
(In reply to comment #6)
> (In reply to comment #4)
> > If "��" is treated as 𝍧, it's also a bug.
> This was done to match Firefox in bug 6446. I see that Firefox has changed now;
> we should check what IE does. But let's not discuss it here - this bug is about
> disappearing text.
Ok. I've just filed bug 22210 about NCRs with surrogate code points.
> (In reply to comment #5)
> > Perhaps you mean U+FEFF? U+FFDF is a reserved codepoint
> This was supposed to be U+FFFD REPLACEMENT CHARACTER.
Oops. sorry for correcting it. that's what I meant.
Fixing bug 22210 (reverting the patch for bug 6446) 'takes care of' this issue as well.
Created attachment 25476 [details]
test in UTF-16LE with a lone surrogate codepoint
The file is in UTF-16LE with BOM at the beginning. Instead of an invalid NCR for a surrogate codepoint, this file has an unpaired surrogate code point, '0xD835' in UTF-16 ( 0x35 0xD8 ). What follows after that becomes invisible.
Even if we take care of the original issue by not allowing NCRs for surrogate codepoints, this issue won't be fixed. And, it should be fixed.
WidthIterator::advance() bails out on an unpaired surrogate. Is it its (and the complex text code paths) behavior that needs to be changed or do such encoding errors be fixed at a higher level?
There may be multiple definitions of correctness in this case, but Firefox 3 draws a custom picture for an unpaired surrogate (at least, I do not know where else this "D835" picture could come from). We could probably just draw a glyph for unpaired surrogate from LastResort font.
Firefox synthesizes a glyph of a rectangle with a 4-digit or 6-digit hexadecimal codepoint in it for a character it cannot find a font for. (Up to Firefox2, only Linux version did that, but beginning with Firefox3, the same is done on Windows and Mac OS X).
However, for an unpaired surrogate codepoint, it must NOT do that. I'll alert to them about this issue. Instead, they should show U+FFFD in its place (or another representation of an error in the input). Note that it does replace the UTF-8 representation of an unpaired surrogate codepoint by U+FFFD. The same must be done for UTF-16.
I think the encoding error should be fixed before reaching the measuring/drawing text.
(In reply to comment #12)
> I think the encoding error should be fixed before reaching the
> measuring/drawing text.
In what way should it be fixed? If we replace it with U+FFFD, then we won't be able to use an unpaired surrogate glyph from LastResort, which makes the most sense here.
Unicode 5.1 section 3.2 ( http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf ) has the following conformance requirement:
C10 When a process interprets a code unit sequence which purports to be in a Unicode character
encoding form, it shall treat ill-formed code unit sequences as an error condition
and shall not interpret such sequences as characters.
• For example, in UTF-8 every code unit of the form 110xxxx2 must be followed
by a code unit of the form 10xxxxxx2. A sequence such as 110xxxxx2 0xxxxxxx2
is ill-formed and must never be generated. When faced with this ill-formed
code unit sequence while transforming or interpreting text, a conformant process
must treat the first code unit 110xxxxx2 as an illegally terminated code unit
sequence—for example, by signaling an error, filtering the code unit out, or
representing the code unit with a marker such as U+FFFD replacement
Section 3.9 (Unicode Encoding Forms) has the following (2nd bullet point in D93):
Encoding Form Conversion
D93 Encoding form conversion: A conversion defined directly between the code unit
sequences of one Unicode encoding form and the code unit sequences of another
Unicode encoding form
A conformant encoding form conversion will treat any ill-formed code unit
sequence as an error condition. (See conformance clause C10.) This guarantees
that it will neither interpret nor emit an ill-formed code unit sequence. Any
implementation of encoding form conversion must take this requirement into
account, because an encoding form conversion implicitly involves a verification
that the Unicode strings being converted do, in fact, contain well-formed code
I'm not quoting D91 (defining UTF-16 and what's ill-formed in UTF-16) because it's obvious that an isolated surrogate codepoint is ill-formed.
BTW, the corresponding Firefox bug is http://bugzilla.mozilla.org/show_bug.cgi?id=317216
Using the last resort glyph for an isolated surrogate code point is arguably considered as a way of signaling error, but it's not just rendering that is at stake. Other parts in webkit need to deal with them. By replacing isolated surrogate code points with U+FFFD at the earliest stage, we can spare them from having to do that check themselves. IMHO, it's always a good idea to validate what's coming from an external source before doing anything.
Rendering code has to deal with this anyway, because an unpaired surrogate can be inserted into the string via DOM APIs. And I doubt that it makes practical sense to require DOM APIs to convert unpaired surrogates into U+FFFD - e.g. JS code may insert text as it comes from some streaming source, and the surrogate pair may be broken between two chunks of data.
I can no longer reproduce this with Safari 5.1.7 on Lion.
(In reply to comment #16)
> I can no longer reproduce this with Safari 5.1.7 on Lion.
Did you try with attachment 25476 [details]? I am not sure what the expected behavior is, but with TOT some text appears to be missing.
I only tried the first attached test before, and can now see the problem with the second one.