22210 – don't allow NCRs with surrogate codepoints

RESOLVED WORKSFORME 22210

don't allow NCRs with surrogate codepoints

https://bugs.webkit.org/show_bug.cgi?id=22210

Summary don't allow NCRs with surrogate codepoints

Jungshik Shin

Reported 2008-11-12 11:32:48 PST

HTML/XML character set (independent of actual character encoding of a document) is Unicode/ISO 10646 and NCRs represent Unicode code points. They do not represent '2byte code units' of UTF-16. So, NCRs with surrogate code points should not be allowed whether they are paired or not. They're let in by the change made in bug 6446 to be compatible with Firefox (as of 2006), but we should reverse it. IMHO, this is not something to change to be compatible with other browsers (IE, Firefox or whatever).

Attachments
Add attachment proposed patch, testcase, etc.

Alexey Proskuryakov

Comment 1 2008-11-12 14:43:59 PST

(In reply to comment #0) > IMHO, this is not something to change to be compatible with other browsers (IE, Firefox or whatever). Could you explain why compatibility isn't the primary concern in this case? I know that Firefox has changed its behavior, so the original reason for our change is void, but at this point, I don't see why we shouldn't just check what IE does, and match that.

Jungshik Shin

Comment 2 2008-11-14 15:52:47 PST

IMHO, we should stick to the standard unless there's a very compelling compatibility reason for violating that. I don't think there is in this case (for one, Firefox and IE disagree). I don't have any hard number [1], but I guess we would gain very little in terms of compatibility by being compatible with IE. IE7 does indeed interpret a pair of NCRs with high and low surrogate codepoints as a single Unicode character. In the page at the URL field, both 1st and 3rd columns are rendered identically by IE (U+10400 is shown). In the middle column, the lone high surrogate code point (D801) is turned invisible while DC00 (a lone low surrogate code point) is rendered as an empty box. Firefox turns any surrogate code points (paired or not) in NCR to U+FFFD (replacement character). BTW, firefox's recent change (in addition to being compliant to CHARMOD) may have been motivated by security concerns (http://www.mozilla.org/security/announce/2008/mfsa2008-43.html ) [1] If you really need a hard number, I can produce (ask somebody to produce) one based on Google's repository, but it seems too much work for too little.

Alexey Proskuryakov

Comment 3 2008-11-14 16:39:56 PST

I don't have a strong opinion on this. It seems a little weird to flip-flop without additional data, but it doesn't sound too dangerous.

Jungshik Shin

Comment 4 2009-03-06 18:09:46 PST

It turned out that HTML5 (current draft) has the following to say about the issue: Otherwise, if the number is in the range 0x0000 to 0x0008, 0x000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF, 0xFDD0 to 0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or 0x10FFFF, or is higher than 0x10FFFF, then this is a parse error; return a character token for the U+FFFD REPLACEMENT CHARACTER character instead. So, for surrogate codepoints in NCRs, we have to replace each of them with U+FFFD (as we used to do).

Alexey Proskuryakov

Comment 5 2012-05-29 11:53:11 PDT

I'm a bit confused. This looks like it was done already. Was this part of HTML5 parser rewrite?

Adam Barth

Comment 6 2012-05-29 16:33:32 PDT

> Was this part of HTML5 parser rewrite? I believe so, yes. I didn't know this issue was controversial and just did what the spec said to do.

Alexey Proskuryakov

Comment 7 2012-05-29 16:57:28 PDT

I don't think that this one was particularly controversial. Just not willing to flip-flop without a good reason to. Marking WORKSFORME.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution WORKSFORME

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware All

OS All

Product WebKit

Component DOM

Assignee

Nobody

Reported

2008-11-12 11:32 PST

Modified

2012-05-29 16:57 PDT History

CC List

5 users Show

URL

http://i18nl10n.com/webkit/ncr.html

Keywords

Depends on

Blocks