HTML/XML character set (independent of actual character encoding of a document) is Unicode/ISO 10646 and NCRs represent Unicode code points. They do not represent '2byte code units' of UTF-16. So, NCRs with surrogate code points should not be allowed whether they are paired or not.
They're let in by the change made in bug 6446 to be compatible with Firefox (as of 2006), but we should reverse it. IMHO, this is not something to change to be compatible with other browsers (IE, Firefox or whatever).
(In reply to comment #0)
> IMHO, this is not something to change to be compatible with other browsers (IE, Firefox or whatever).
Could you explain why compatibility isn't the primary concern in this case? I know that Firefox has changed its behavior, so the original reason for our change is void, but at this point, I don't see why we shouldn't just check what IE does, and match that.
IMHO, we should stick to the standard unless there's a very compelling compatibility reason for violating that. I don't think there is in this case (for one, Firefox and IE disagree). I don't have any hard number , but I guess we would gain very little in terms of compatibility by being compatible with IE.
IE7 does indeed interpret a pair of NCRs with high and low surrogate codepoints as a single Unicode character. In the page at the URL field, both 1st and 3rd columns are rendered identically by IE (U+10400 is shown). In the middle column, the lone high surrogate code point (D801) is turned invisible while DC00 (a lone low surrogate code point) is rendered as an empty box.
Firefox turns any surrogate code points (paired or not) in NCR to U+FFFD (replacement character).
BTW, firefox's recent change (in addition to being compliant to CHARMOD) may have been motivated by security concerns (http://www.mozilla.org/security/announce/2008/mfsa2008-43.html )
 If you really need a hard number, I can produce (ask somebody to produce) one based on Google's repository, but it seems too much work for too little.
I don't have a strong opinion on this. It seems a little weird to flip-flop without additional data, but it doesn't sound too dangerous.
It turned out that HTML5 (current draft) has the following to say about the issue:
Otherwise, if the number is in the range 0x0000 to 0x0008, 0x000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF, 0xFDD0 to 0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or 0x10FFFF, or is higher than 0x10FFFF, then this is a parse error; return a character token for the U+FFFD REPLACEMENT CHARACTER character instead.
So, for surrogate codepoints in NCRs, we have to replace each of them with U+FFFD (as we used to do).
I'm a bit confused. This looks like it was done already. Was this part of HTML5 parser rewrite?
> Was this part of HTML5 parser rewrite?
I believe so, yes. I didn't know this issue was controversial and just did what the spec said to do.
I don't think that this one was particularly controversial. Just not willing to flip-flop without a good reason to.