WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED WORKSFORME
22210
don't allow NCRs with surrogate codepoints
https://bugs.webkit.org/show_bug.cgi?id=22210
Summary
don't allow NCRs with surrogate codepoints
Jungshik Shin
Reported
2008-11-12 11:32:48 PST
HTML/XML character set (independent of actual character encoding of a document) is Unicode/ISO 10646 and NCRs represent Unicode code points. They do not represent '2byte code units' of UTF-16. So, NCRs with surrogate code points should not be allowed whether they are paired or not. They're let in by the change made in
bug 6446
to be compatible with Firefox (as of 2006), but we should reverse it. IMHO, this is not something to change to be compatible with other browsers (IE, Firefox or whatever).
Attachments
Add attachment
proposed patch, testcase, etc.
Alexey Proskuryakov
Comment 1
2008-11-12 14:43:59 PST
(In reply to
comment #0
)
> IMHO, this is not something to change to be compatible with other browsers (IE, Firefox or whatever).
Could you explain why compatibility isn't the primary concern in this case? I know that Firefox has changed its behavior, so the original reason for our change is void, but at this point, I don't see why we shouldn't just check what IE does, and match that.
Jungshik Shin
Comment 2
2008-11-14 15:52:47 PST
IMHO, we should stick to the standard unless there's a very compelling compatibility reason for violating that. I don't think there is in this case (for one, Firefox and IE disagree). I don't have any hard number [1], but I guess we would gain very little in terms of compatibility by being compatible with IE. IE7 does indeed interpret a pair of NCRs with high and low surrogate codepoints as a single Unicode character. In the page at the URL field, both 1st and 3rd columns are rendered identically by IE (U+10400 is shown). In the middle column, the lone high surrogate code point (D801) is turned invisible while DC00 (a lone low surrogate code point) is rendered as an empty box. Firefox turns any surrogate code points (paired or not) in NCR to U+FFFD (replacement character). BTW, firefox's recent change (in addition to being compliant to CHARMOD) may have been motivated by security concerns (
http://www.mozilla.org/security/announce/2008/mfsa2008-43.html
) [1] If you really need a hard number, I can produce (ask somebody to produce) one based on Google's repository, but it seems too much work for too little.
Alexey Proskuryakov
Comment 3
2008-11-14 16:39:56 PST
I don't have a strong opinion on this. It seems a little weird to flip-flop without additional data, but it doesn't sound too dangerous.
Jungshik Shin
Comment 4
2009-03-06 18:09:46 PST
It turned out that HTML5 (current draft) has the following to say about the issue: Otherwise, if the number is in the range 0x0000 to 0x0008, 0x000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF, 0xFDD0 to 0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or 0x10FFFF, or is higher than 0x10FFFF, then this is a parse error; return a character token for the U+FFFD REPLACEMENT CHARACTER character instead. So, for surrogate codepoints in NCRs, we have to replace each of them with U+FFFD (as we used to do).
Alexey Proskuryakov
Comment 5
2012-05-29 11:53:11 PDT
I'm a bit confused. This looks like it was done already. Was this part of HTML5 parser rewrite?
Adam Barth
Comment 6
2012-05-29 16:33:32 PDT
> Was this part of HTML5 parser rewrite?
I believe so, yes. I didn't know this issue was controversial and just did what the spec said to do.
Alexey Proskuryakov
Comment 7
2012-05-29 16:57:28 PDT
I don't think that this one was particularly controversial. Just not willing to flip-flop without a good reason to. Marking WORKSFORME.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug