Bug 22210 - don't allow NCRs with surrogate codepoints
Summary: don't allow NCRs with surrogate codepoints
Alias: None
Product: WebKit
Classification: Unclassified
Component: DOM (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Nobody
URL: http://i18nl10n.com/webkit/ncr.html
Depends on:
Reported: 2008-11-12 11:32 PST by Jungshik Shin
Modified: 2012-05-29 16:57 PDT (History)
5 users (show)

See Also:


Note You need to log in before you can comment on or make changes to this bug.
Description Jungshik Shin 2008-11-12 11:32:48 PST
HTML/XML character set (independent of actual character encoding of a document) is Unicode/ISO 10646 and NCRs represent Unicode code points. They do not represent '2byte code units' of UTF-16. So, NCRs with surrogate code points should not be allowed whether they are paired or not. 

They're let in by the change made in bug 6446 to be compatible with Firefox (as of 2006), but we should reverse it.  IMHO, this is not something to change to be compatible with other browsers (IE, Firefox or whatever).
Comment 1 Alexey Proskuryakov 2008-11-12 14:43:59 PST
(In reply to comment #0)
> IMHO, this is not something to change to be compatible with other browsers (IE, Firefox or whatever).

Could you explain why compatibility isn't the primary concern in this case? I know that Firefox has changed its behavior, so the original reason for our change is void, but at this point, I don't see why we shouldn't just check what IE does, and match that.
Comment 2 Jungshik Shin 2008-11-14 15:52:47 PST
IMHO, we should stick to the standard unless there's a very compelling compatibility reason for violating that. I don't think there is in this case (for one, Firefox and IE disagree). I don't have any hard number [1], but I guess we would gain very little in terms of compatibility by being compatible with IE. 

IE7 does indeed interpret a pair of NCRs with high and low surrogate codepoints as a single Unicode character. In the page at the URL field, both 1st and 3rd columns are rendered identically by IE (U+10400 is shown). In the middle column, the lone high surrogate code point (D801) is turned invisible while DC00 (a lone low surrogate code point) is rendered as an empty box. 

Firefox turns any surrogate code points (paired or not) in NCR to U+FFFD (replacement character). 

BTW, firefox's recent change (in addition to being compliant to CHARMOD) may have been motivated by security concerns (http://www.mozilla.org/security/announce/2008/mfsa2008-43.html )

[1] If you really need a hard number, I can produce (ask somebody to produce) one based on Google's repository, but it seems too much work for too little.
Comment 3 Alexey Proskuryakov 2008-11-14 16:39:56 PST
I don't have a strong opinion on this. It seems a little weird to flip-flop without additional data, but it doesn't sound too dangerous.
Comment 4 Jungshik Shin 2009-03-06 18:09:46 PST
It turned out that HTML5 (current draft) has the following to say about the issue:

Otherwise, if the number is in the range 0x0000 to 0x0008,    0x000E to 0x001F,  0x007F  to 0x009F, 0xD800 to 0xDFFF, 0xFDD0 to 0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or 0x10FFFF, or is higher than 0x10FFFF, then this is a parse error; return a character token for the U+FFFD REPLACEMENT CHARACTER character instead.

So, for surrogate codepoints in NCRs, we have to replace each of them with U+FFFD (as we used to do). 
Comment 5 Alexey Proskuryakov 2012-05-29 11:53:11 PDT
I'm a bit confused. This looks like it was done already. Was this part of HTML5 parser rewrite?
Comment 6 Adam Barth 2012-05-29 16:33:32 PDT
> Was this part of HTML5 parser rewrite?

I believe so, yes.  I didn't know this issue was controversial and just did what the spec said to do.
Comment 7 Alexey Proskuryakov 2012-05-29 16:57:28 PDT
I don't think that this one was particularly controversial. Just not willing to flip-flop without a good reason to.