RESOLVED FIXED Bug 42112
U+0000 is turned to U+FFFD (replacement character)
https://bugs.webkit.org/show_bug.cgi?id=42112
Summary U+0000 is turned to U+FFFD (replacement character)
Jungshik Shin
Reported 2010-07-12 14:24:38 PDT
The page in the url field has td's like this sprinkled all over the place <td>^@^@</td> where ^@ stands for U+0000. They're rendered with a glyph for replacement character (U+0000). This is a recent regression. It's reported against Chromium at http://crbug.com/48603
Attachments
Work in progress (5.67 KB, patch)
2010-07-22 14:56 PDT, Adam Barth
no flags
Patch (21.86 KB, patch)
2010-08-05 12:16 PDT, Adam Barth
no flags
Patch (21.85 KB, patch)
2010-08-05 12:45 PDT, Adam Barth
no flags
Patch (13.89 KB, patch)
2010-08-05 15:35 PDT, Adam Barth
no flags
Jungshik Shin
Comment 1 2010-07-12 15:18:56 PDT
It's strange that this regressed because I don't see a recent change relevant to the treatment of U+0000 in Font.h and GlyphPageTreeNode.cpp
Jungshik Shin
Comment 2 2010-07-12 16:15:21 PDT
It seems that it regressed between 461229 and r61316
Alexey Proskuryakov
Comment 3 2010-07-13 16:58:10 PDT
HTML5 requires that zero bytes are decoded as U+FFFD. I think that this is a mistake, but I haven't been successful at convincing HTML5 powers.
Adam Barth
Comment 4 2010-07-13 17:03:26 PDT
Thanks. These are nice examples.
Adam Barth
Comment 5 2010-07-13 17:27:40 PDT
Our layout is also screwy on that page.
Jungshik Shin
Comment 6 2010-07-14 13:42:28 PDT
(In reply to comment #3) > HTML5 requires that zero bytes are decoded as U+FFFD. I think that this is a mistake, but I haven't been successful at convincing HTML5 powers. Is it about HTML5 10.2.4.70 (http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html) or something else? 10.2.4.70 seems to only about NCR processing. For my reference: it's done in http://trac.webkit.org/changeset/60791
Jungshik Shin
Comment 7 2010-07-14 13:45:37 PDT
Sorry that I missed a reference to the spec in HTMLTokenizer (http://www.whatwg.org/specs/web-apps/current-work/#preprocessing-the-input-stream)
Adam Barth
Comment 8 2010-07-14 15:25:31 PDT
[3:24pm] Hixie: abarth: yeah, it's basically intentional [3:24pm] Hixie: abarth: there shouldn't be NULs there at all and they're likely to be triggering all kinds of security problems [3:24pm] Hixie: abarth: making it ugly is one way to draw attention to it
Alexey Proskuryakov
Comment 9 2010-07-14 16:16:18 PDT
It doesn't seem that we're going to get much sympathy from WhatWG, so the options are basically to ignore such content, or to avoid implementing this aspect of HTML5. My personal preference would be the latter, since it's a useless and incompatible change, but I expect many other WebKit developers to disagree.
Adam Barth
Comment 10 2010-07-14 17:04:22 PDT
I think it's important for the spec and the implementation to agree. I don't have a strong opinion on which behavior is better in this particular topic. One approach to move forward is to talk with Henri and some of the other Mozilla folks and see if we can come to an agreement. If WebKit and Firefox agree, it will be hard for the working group to not agree also.
Henri Sivonen
Comment 11 2010-07-15 00:29:20 PDT
See http://hg.mozilla.org/mozilla-central/rev/ae259fec2443 I made the tree builder ignore U+0000-originating U+FFFD in element content when the tree builder is neither in the 'text' state nor in the 'in foreign content' state (the two states that handle <script>, <style> and <textarea> text.) As far as I can tell, the defense in depth rationale for turning U+0000 into U+FFFD doesn't apply to element content when script, style or textarea isn't involved. Does this change make sense to you? (Note that Hixie didn't put this in the spec in http://html5.org/tools/web-apps-tracker?from=5155&to=5156 )
Adam Barth
Comment 12 2010-07-15 00:38:02 PDT
> Does this change make sense to you? Yeah we could do something similar. Architecturally, we'd have the tree builder set a bit on the tokenizer / input stream preprocessor to not generate FFFD when it's not in the text or foreign content state. It would then be similar to the CR LF collapsing, but it could collapse an unbounded number of characters, which should be doable. @ap, would that address your concerns?
Alexey Proskuryakov
Comment 13 2010-07-15 08:10:30 PDT
Yes, that certainly addresses the known practical issues. In general, I think that the more complicated null handling gets, the more potential security issues there are. The recent security problems with nulls that I know of were all caused by stripping them in tokenizer instead of passing as is, which some software did, and some didn't perform. Converting nulls to U+FFFD will be a unique trait of HTML5 processing, so it will likely always remain an unexpected feature for many.
Adam Barth
Comment 14 2010-07-20 17:47:14 PDT
Apparently this problem also occurs on the US bank login page: http://code.google.com/p/chromium/issues/detail?id=49720
Evan Martin
Comment 15 2010-07-20 17:49:24 PDT
Another page exhibiting the same issue: usbank's login page. This was reported by a Chrome users. You can see the problem without an account here: https://www4.usbank.com/internetBanking/RequestRouter?requestCmdId=DisplayLoginPageNotYou Hex snippet of page contents from the cache: 00004cb0: 66 6d 2f 70 65 72 73 6f 6e 61 6c 2f 61 63 68 69 fm/personal/achi 00004cc0: 65 76 65 5f 67 6f 61 6c 73 2f 49 44 5f 74 68 65 eve_goals/ID_the 00004cd0: 66 74 2e 63 66 6d 22 3e 50 72 6f 74 65 63 74 20 ft.cfm">Protect 00004ce0: 59 6f 75 72 20 49 64 65 6e 74 69 74 79 3c 2f 61 Your Identity</a 00004cf0: 3e 3c 2f 74 64 3e 0a 20 20 20 20 20 20 20 20 20 ></td>. 00004d00: 20 20 20 20 20 20 20 20 20 3c 2f 74 72 3e 20 3c </tr> < 00004d10: 2f 74 61 62 6c 65 3e 00 00 00 00 00 00 00 00 00 /table>......... 00004d20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00004d30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00004d40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Henri Sivonen
Comment 16 2010-07-21 01:20:46 PDT
It seems to me that the fix I implemented for Gecko (ignoring U+0000-originating U+FFFD in text content when the tree builder isn't in the 'text' or 'in foreign content' modes) addresses this problem for the sites reported here.
Adam Barth
Comment 17 2010-07-22 14:55:05 PDT
This bug is hard to fix before enabling the new tree builder because the old tree builder doesn't use the same state machine. It's complicated to fix here too without introducing a per-character branch, but I think I see how to do it.
Adam Barth
Comment 18 2010-07-22 14:56:31 PDT
Created attachment 62344 [details] Work in progress
Adam Barth
Comment 19 2010-08-05 10:43:14 PDT
I haz a patch. Gather test results changes.
Adam Barth
Comment 20 2010-08-05 12:16:45 PDT
Darin Adler
Comment 21 2010-08-05 12:23:25 PDT
Comment on attachment 63614 [details] Patch > + // Swallowing U+0000 characters isn't in the HTML5 spec, but turning all > + // the U+0000 characters into replacement characters has compatibility > + // problems problems. problems problems problems
Adam Barth
Comment 22 2010-08-05 12:45:03 PDT
Darin Adler
Comment 23 2010-08-05 13:08:04 PDT
Comment on attachment 63620 [details] Patch > -PASS canvas.getContext('2d�') is null > +PASS canvas.getContext('2d') is null > -PASS String("�").length is 1 > +PASS String("").length is 1 I think an even better approach to these would be to fix the escapeHTML function in js-test-pre.js to escape null characters. We could do that in a separate pass before making this change.
Adam Barth
Comment 24 2010-08-05 13:31:34 PDT
> I think an even better approach to these would be to fix the escapeHTML function in js-test-pre.js to escape null characters. We could do that in a separate pass before making this change. Ok. I'll file a new bug for that.
Eric Seidel (no email)
Comment 25 2010-08-05 14:02:07 PDT
Comment on attachment 63620 [details] Patch We came up with better names in person.
Adam Barth
Comment 26 2010-08-05 14:39:16 PDT
> Ok. I'll file a new bug for that. https://bugs.webkit.org/show_bug.cgi?id=43579
Adam Barth
Comment 27 2010-08-05 15:35:57 PDT
Eric Seidel (no email)
Comment 28 2010-08-05 15:42:34 PDT
Comment on attachment 63648 [details] Patch OK.
Adam Barth
Comment 29 2010-08-05 17:12:26 PDT
Comment on attachment 63648 [details] Patch Clearing flags on attachment: 63648 Committed r64799: <http://trac.webkit.org/changeset/64799>
Adam Barth
Comment 30 2010-08-05 17:12:35 PDT
All reviewed patches have been landed. Closing bug.
WebKit Review Bot
Comment 31 2010-08-05 17:49:19 PDT
Note You need to log in before you can comment on or make changes to this bug.