Summary: | HTML entities for surrogate pair codepoints cause rendering issues | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Kevin Ballard <kevin> | ||||||
Component: | WebKit Misc. | Assignee: | Nobody <webkit-unassigned> | ||||||
Status: | RESOLVED CONFIGURATION CHANGED | ||||||||
Severity: | Normal | CC: | ap, jshin, mathias, mitz | ||||||
Priority: | P2 | Keywords: | HasReduction, InRadar | ||||||
Version: | 528+ (Nightly build) | ||||||||
Hardware: | All | ||||||||
OS: | All | ||||||||
Attachments: |
|
Description
Kevin Ballard
2008-11-10 16:33:17 PST
Created attachment 25034 [details]
Testcase for the bug
I just attached a testcase. Irritatingly, it's using different layout than I saw in TextMate's preview, not sure why, so it doesn't read the same and you only lose a single word of the rendered text, but if you download it locally and play with the source it should be more apparent what's going on. > 𝍧 is interpreted the same as ��
If "��" is treated as 𝍧, it's also a bug.
Paired or not, NCRs with surrogate codepoints are invalid and perhaps should be converted to U+FFDF. I don't think it's a good idea to skip them as if there's nothing.
Perhaps you mean U+FEFF? U+FFDF is a reserved codepoint, U+FEFF is the zero-width no-break space. (In reply to comment #4) > If "��" is treated as 𝍧, it's also a bug. This was done to match Firefox in bug 6446. I see that Firefox has changed now; we should check what IE does. But let's not discuss it here - this bug is about disappearing text. (In reply to comment #5) > Perhaps you mean U+FEFF? U+FFDF is a reserved codepoint This was supposed to be U+FFFD REPLACEMENT CHARACTER. (In reply to comment #6) > (In reply to comment #4) > > If "��" is treated as 𝍧, it's also a bug. > > This was done to match Firefox in bug 6446. I see that Firefox has changed now; > we should check what IE does. But let's not discuss it here - this bug is about > disappearing text. Ok. I've just filed bug 22210 about NCRs with surrogate code points. > (In reply to comment #5) > > Perhaps you mean U+FEFF? U+FFDF is a reserved codepoint > > This was supposed to be U+FFFD REPLACEMENT CHARACTER. Oops. sorry for correcting it. that's what I meant. Created attachment 25476 [details]
test in UTF-16LE with a lone surrogate codepoint
The file is in UTF-16LE with BOM at the beginning. Instead of an invalid NCR for a surrogate codepoint, this file has an unpaired surrogate code point, '0xD835' in UTF-16 ( 0x35 0xD8 ). What follows after that becomes invisible.
Even if we take care of the original issue by not allowing NCRs for surrogate codepoints, this issue won't be fixed. And, it should be fixed.
WidthIterator::advance() bails out on an unpaired surrogate. Is it its (and the complex text code paths) behavior that needs to be changed or do such encoding errors be fixed at a higher level? There may be multiple definitions of correctness in this case, but Firefox 3 draws a custom picture for an unpaired surrogate (at least, I do not know where else this "D835" picture could come from). We could probably just draw a glyph for unpaired surrogate from LastResort font. Firefox synthesizes a glyph of a rectangle with a 4-digit or 6-digit hexadecimal codepoint in it for a character it cannot find a font for. (Up to Firefox2, only Linux version did that, but beginning with Firefox3, the same is done on Windows and Mac OS X). However, for an unpaired surrogate codepoint, it must NOT do that. I'll alert to them about this issue. Instead, they should show U+FFFD in its place (or another representation of an error in the input). Note that it does replace the UTF-8 representation of an unpaired surrogate codepoint by U+FFFD. The same must be done for UTF-16. I think the encoding error should be fixed before reaching the measuring/drawing text. (In reply to comment #12) > I think the encoding error should be fixed before reaching the > measuring/drawing text. In what way should it be fixed? If we replace it with U+FFFD, then we won't be able to use an unpaired surrogate glyph from LastResort, which makes the most sense here. Unicode 5.1 section 3.2 ( http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf ) has the following conformance requirement: C10 When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters. • For example, in UTF-8 every code unit of the form 110xxxx2 must be followed by a code unit of the form 10xxxxxx2. A sequence such as 110xxxxx2 0xxxxxxx2 is ill-formed and must never be generated. When faced with this ill-formed code unit sequence while transforming or interpreting text, a conformant process must treat the first code unit 110xxxxx2 as an illegally terminated code unit sequence—for example, by signaling an error, filtering the code unit out, or representing the code unit with a marker such as U+FFFD replacement character. Section 3.9 (Unicode Encoding Forms) has the following (2nd bullet point in D93): Encoding Form Conversion D93 Encoding form conversion: A conversion defined directly between the code unit sequences of one Unicode encoding form and the code unit sequences of another Unicode encoding form A conformant encoding form conversion will treat any ill-formed code unit sequence as an error condition. (See conformance clause C10.) This guarantees that it will neither interpret nor emit an ill-formed code unit sequence. Any implementation of encoding form conversion must take this requirement into account, because an encoding form conversion implicitly involves a verification that the Unicode strings being converted do, in fact, contain well-formed code unit sequences. -------------- I'm not quoting D91 (defining UTF-16 and what's ill-formed in UTF-16) because it's obvious that an isolated surrogate codepoint is ill-formed. BTW, the corresponding Firefox bug is http://bugzilla.mozilla.org/show_bug.cgi?id=317216 Using the last resort glyph for an isolated surrogate code point is arguably considered as a way of signaling error, but it's not just rendering that is at stake. Other parts in webkit need to deal with them. By replacing isolated surrogate code points with U+FFFD at the earliest stage, we can spare them from having to do that check themselves. IMHO, it's always a good idea to validate what's coming from an external source before doing anything. Rendering code has to deal with this anyway, because an unpaired surrogate can be inserted into the string via DOM APIs. And I doubt that it makes practical sense to require DOM APIs to convert unpaired surrogates into U+FFFD - e.g. JS code may insert text as it comes from some streaming source, and the surrogate pair may be broken between two chunks of data. I can no longer reproduce this with Safari 5.1.7 on Lion. (In reply to comment #16) > I can no longer reproduce this with Safari 5.1.7 on Lion. Did you try with attachment 25476 [details]? I am not sure what the expected behavior is, but with TOT some text appears to be missing. I only tried the first attached test before, and can now see the problem with the second one. This was fixed at some point between 609.3.5.1.5 and 613.1.8.2. |