The page in the url field has td's like this sprinkled all over the place <td>^@^@</td> where ^@ stands for U+0000. They're rendered with a glyph for replacement character (U+0000). This is a recent regression. It's reported against Chromium at http://crbug.com/48603
It's strange that this regressed because I don't see a recent change relevant to the treatment of U+0000 in Font.h and GlyphPageTreeNode.cpp
It seems that it regressed between 461229 and r61316
HTML5 requires that zero bytes are decoded as U+FFFD. I think that this is a mistake, but I haven't been successful at convincing HTML5 powers.
Thanks. These are nice examples.
Our layout is also screwy on that page.
(In reply to comment #3) > HTML5 requires that zero bytes are decoded as U+FFFD. I think that this is a mistake, but I haven't been successful at convincing HTML5 powers. Is it about HTML5 10.2.4.70 (http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html) or something else? 10.2.4.70 seems to only about NCR processing. For my reference: it's done in http://trac.webkit.org/changeset/60791
Sorry that I missed a reference to the spec in HTMLTokenizer (http://www.whatwg.org/specs/web-apps/current-work/#preprocessing-the-input-stream)
[3:24pm] Hixie: abarth: yeah, it's basically intentional [3:24pm] Hixie: abarth: there shouldn't be NULs there at all and they're likely to be triggering all kinds of security problems [3:24pm] Hixie: abarth: making it ugly is one way to draw attention to it
It doesn't seem that we're going to get much sympathy from WhatWG, so the options are basically to ignore such content, or to avoid implementing this aspect of HTML5. My personal preference would be the latter, since it's a useless and incompatible change, but I expect many other WebKit developers to disagree.
I think it's important for the spec and the implementation to agree. I don't have a strong opinion on which behavior is better in this particular topic. One approach to move forward is to talk with Henri and some of the other Mozilla folks and see if we can come to an agreement. If WebKit and Firefox agree, it will be hard for the working group to not agree also.
See http://hg.mozilla.org/mozilla-central/rev/ae259fec2443 I made the tree builder ignore U+0000-originating U+FFFD in element content when the tree builder is neither in the 'text' state nor in the 'in foreign content' state (the two states that handle <script>, <style> and <textarea> text.) As far as I can tell, the defense in depth rationale for turning U+0000 into U+FFFD doesn't apply to element content when script, style or textarea isn't involved. Does this change make sense to you? (Note that Hixie didn't put this in the spec in http://html5.org/tools/web-apps-tracker?from=5155&to=5156 )
> Does this change make sense to you? Yeah we could do something similar. Architecturally, we'd have the tree builder set a bit on the tokenizer / input stream preprocessor to not generate FFFD when it's not in the text or foreign content state. It would then be similar to the CR LF collapsing, but it could collapse an unbounded number of characters, which should be doable. @ap, would that address your concerns?
Yes, that certainly addresses the known practical issues. In general, I think that the more complicated null handling gets, the more potential security issues there are. The recent security problems with nulls that I know of were all caused by stripping them in tokenizer instead of passing as is, which some software did, and some didn't perform. Converting nulls to U+FFFD will be a unique trait of HTML5 processing, so it will likely always remain an unexpected feature for many.
Apparently this problem also occurs on the US bank login page: http://code.google.com/p/chromium/issues/detail?id=49720
Another page exhibiting the same issue: usbank's login page. This was reported by a Chrome users. You can see the problem without an account here: https://www4.usbank.com/internetBanking/RequestRouter?requestCmdId=DisplayLoginPageNotYou Hex snippet of page contents from the cache: 00004cb0: 66 6d 2f 70 65 72 73 6f 6e 61 6c 2f 61 63 68 69 fm/personal/achi 00004cc0: 65 76 65 5f 67 6f 61 6c 73 2f 49 44 5f 74 68 65 eve_goals/ID_the 00004cd0: 66 74 2e 63 66 6d 22 3e 50 72 6f 74 65 63 74 20 ft.cfm">Protect 00004ce0: 59 6f 75 72 20 49 64 65 6e 74 69 74 79 3c 2f 61 Your Identity</a 00004cf0: 3e 3c 2f 74 64 3e 0a 20 20 20 20 20 20 20 20 20 ></td>. 00004d00: 20 20 20 20 20 20 20 20 20 3c 2f 74 72 3e 20 3c </tr> < 00004d10: 2f 74 61 62 6c 65 3e 00 00 00 00 00 00 00 00 00 /table>......... 00004d20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00004d30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00004d40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
It seems to me that the fix I implemented for Gecko (ignoring U+0000-originating U+FFFD in text content when the tree builder isn't in the 'text' or 'in foreign content' modes) addresses this problem for the sites reported here.
This bug is hard to fix before enabling the new tree builder because the old tree builder doesn't use the same state machine. It's complicated to fix here too without introducing a per-character branch, but I think I see how to do it.
Created attachment 62344 [details] Work in progress
I haz a patch. Gather test results changes.
Created attachment 63614 [details] Patch
Comment on attachment 63614 [details] Patch > + // Swallowing U+0000 characters isn't in the HTML5 spec, but turning all > + // the U+0000 characters into replacement characters has compatibility > + // problems problems. problems problems problems
Created attachment 63620 [details] Patch
Comment on attachment 63620 [details] Patch > -PASS canvas.getContext('2d�') is null > +PASS canvas.getContext('2d') is null > -PASS String("�").length is 1 > +PASS String("").length is 1 I think an even better approach to these would be to fix the escapeHTML function in js-test-pre.js to escape null characters. We could do that in a separate pass before making this change.
> I think an even better approach to these would be to fix the escapeHTML function in js-test-pre.js to escape null characters. We could do that in a separate pass before making this change. Ok. I'll file a new bug for that.
Comment on attachment 63620 [details] Patch We came up with better names in person.
> Ok. I'll file a new bug for that. https://bugs.webkit.org/show_bug.cgi?id=43579
Created attachment 63648 [details] Patch
Comment on attachment 63648 [details] Patch OK.
Comment on attachment 63648 [details] Patch Clearing flags on attachment: 63648 Committed r64799: <http://trac.webkit.org/changeset/64799>
All reviewed patches have been landed. Closing bug.
http://trac.webkit.org/changeset/64799 might have broken Qt Linux Release The following changes are on the blame list: http://trac.webkit.org/changeset/64800 http://trac.webkit.org/changeset/64801 http://trac.webkit.org/changeset/64799