Bug 42112 - U+0000 is turned to U+FFFD (replacement character)
Summary: U+0000 is turned to U+FFFD (replacement character)
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Layout and Rendering (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Adam Barth
URL: http://big5.xinhuanet.com/gate/big5/w...
Keywords:
Depends on:
Blocks: 41115
  Show dependency treegraph
 
Reported: 2010-07-12 14:24 PDT by Jungshik Shin
Modified: 2010-08-05 17:49 PDT (History)
9 users (show)

See Also:


Attachments
Work in progress (5.67 KB, patch)
2010-07-22 14:56 PDT, Adam Barth
no flags Details | Formatted Diff | Diff
Patch (21.86 KB, patch)
2010-08-05 12:16 PDT, Adam Barth
no flags Details | Formatted Diff | Diff
Patch (21.85 KB, patch)
2010-08-05 12:45 PDT, Adam Barth
no flags Details | Formatted Diff | Diff
Patch (13.89 KB, patch)
2010-08-05 15:35 PDT, Adam Barth
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jungshik Shin 2010-07-12 14:24:38 PDT
The page in the url field has td's like this sprinkled all over the place

<td>^@^@</td>

where ^@ stands for U+0000. 

They're rendered with a glyph for replacement character (U+0000). This is a recent regression. 

It's reported against Chromium at http://crbug.com/48603
Comment 1 Jungshik Shin 2010-07-12 15:18:56 PDT
It's strange that this regressed because I don't see a recent change relevant to the treatment of U+0000 in Font.h and GlyphPageTreeNode.cpp
Comment 2 Jungshik Shin 2010-07-12 16:15:21 PDT
It seems that it regressed between 461229 and r61316
Comment 3 Alexey Proskuryakov 2010-07-13 16:58:10 PDT
HTML5 requires that zero bytes are decoded as U+FFFD. I think that this is a mistake, but I haven't been successful at convincing HTML5 powers.
Comment 4 Adam Barth 2010-07-13 17:03:26 PDT
Thanks.  These are nice examples.
Comment 5 Adam Barth 2010-07-13 17:27:40 PDT
Our layout is also screwy on that page.
Comment 6 Jungshik Shin 2010-07-14 13:42:28 PDT
(In reply to comment #3)
> HTML5 requires that zero bytes are decoded as U+FFFD. I think that this is a mistake, but I haven't been successful at convincing HTML5 powers.

Is it about HTML5 10.2.4.70 (http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html) or something else? 10.2.4.70 seems to only about NCR processing. 

For my reference: it's done in http://trac.webkit.org/changeset/60791
Comment 7 Jungshik Shin 2010-07-14 13:45:37 PDT
Sorry that I missed a reference to the spec in HTMLTokenizer (http://www.whatwg.org/specs/web-apps/current-work/#preprocessing-the-input-stream)
Comment 8 Adam Barth 2010-07-14 15:25:31 PDT
[3:24pm] Hixie: abarth: yeah, it's basically intentional
[3:24pm] Hixie: abarth: there shouldn't be NULs there at all and they're likely to be triggering all kinds of security problems
[3:24pm] Hixie: abarth: making it ugly is one way to draw attention to it
Comment 9 Alexey Proskuryakov 2010-07-14 16:16:18 PDT
It doesn't seem that we're going to get much sympathy from WhatWG, so the options are basically to ignore such content, or to avoid implementing this aspect of HTML5.

My personal preference would be the latter, since it's a useless and incompatible change, but I expect many other WebKit developers to disagree.
Comment 10 Adam Barth 2010-07-14 17:04:22 PDT
I think it's important for the spec and the implementation to agree.  I don't have a strong opinion on which behavior is better in this particular topic.  One approach to move forward is to talk with Henri and some of the other Mozilla folks and see if we can come to an agreement.  If WebKit and Firefox agree, it will be hard for the working group to not agree also.
Comment 11 Henri Sivonen 2010-07-15 00:29:20 PDT
See http://hg.mozilla.org/mozilla-central/rev/ae259fec2443

I made the tree builder ignore U+0000-originating U+FFFD in element content when the tree builder is neither in the 'text' state nor in the 'in foreign content' state (the two states that handle <script>, <style> and <textarea> text.)

As far as I can tell, the defense in depth rationale for turning U+0000 into U+FFFD doesn't apply to element content when script, style or textarea isn't involved.

Does this change make sense to you? (Note that Hixie didn't put this in the spec in http://html5.org/tools/web-apps-tracker?from=5155&to=5156 )
Comment 12 Adam Barth 2010-07-15 00:38:02 PDT
> Does this change make sense to you?

Yeah we could do something similar.  Architecturally, we'd have the tree builder set a bit on the tokenizer / input stream preprocessor to not generate FFFD when it's not in the text or foreign content state.  It would then be similar to the CR LF collapsing, but it could collapse an unbounded number of characters, which should be doable.

@ap, would that address your concerns?
Comment 13 Alexey Proskuryakov 2010-07-15 08:10:30 PDT
Yes, that certainly addresses the known practical issues.

In general, I think that the more complicated null handling gets, the more potential security issues there are. The recent security problems with nulls that I know of were all caused by stripping them in tokenizer instead of passing as is, which some software did, and some didn't perform. Converting nulls to U+FFFD will be a unique trait of HTML5 processing, so it will likely always remain an unexpected feature for many.
Comment 14 Adam Barth 2010-07-20 17:47:14 PDT
Apparently this problem also occurs on the US bank login page:
http://code.google.com/p/chromium/issues/detail?id=49720
Comment 15 Evan Martin 2010-07-20 17:49:24 PDT
Another page exhibiting the same issue: usbank's login page.  This was reported by a Chrome users.

You can see the problem without an account here:
https://www4.usbank.com/internetBanking/RequestRouter?requestCmdId=DisplayLoginPageNotYou

Hex snippet of page contents from the cache:
00004cb0:  66  6d  2f  70  65  72  73  6f  6e  61  6c  2f  61  63  68  69  fm/personal/achi
00004cc0:  65  76  65  5f  67  6f  61  6c  73  2f  49  44  5f  74  68  65  eve_goals/ID_the
00004cd0:  66  74  2e  63  66  6d  22  3e  50  72  6f  74  65  63  74  20  ft.cfm">Protect 
00004ce0:  59  6f  75  72  20  49  64  65  6e  74  69  74  79  3c  2f  61  Your Identity</a
00004cf0:  3e  3c  2f  74  64  3e  0a  20  20  20  20  20  20  20  20  20  ></td>.         
00004d00:  20  20  20  20  20  20  20  20  20  3c  2f  74  72  3e  20  3c           </tr> <
00004d10:  2f  74  61  62  6c  65  3e  00  00  00  00  00  00  00  00  00  /table>.........
00004d20:  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  ................
00004d30:  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  ................
00004d40:  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  ................
Comment 16 Henri Sivonen 2010-07-21 01:20:46 PDT
It seems to me that the fix I implemented for Gecko (ignoring U+0000-originating U+FFFD in text content when the tree builder isn't in the 'text' or 'in foreign content' modes) addresses this problem for the sites reported here.
Comment 17 Adam Barth 2010-07-22 14:55:05 PDT
This bug is hard to fix before enabling the new tree builder because the old tree builder doesn't use the same state machine.  It's complicated to fix here too without introducing a per-character branch, but I think I see how to do it.
Comment 18 Adam Barth 2010-07-22 14:56:31 PDT
Created attachment 62344 [details]
Work in progress
Comment 19 Adam Barth 2010-08-05 10:43:14 PDT
I haz a patch.  Gather test results changes.
Comment 20 Adam Barth 2010-08-05 12:16:45 PDT
Created attachment 63614 [details]
Patch
Comment 21 Darin Adler 2010-08-05 12:23:25 PDT
Comment on attachment 63614 [details]
Patch

> +    // Swallowing U+0000 characters isn't in the HTML5 spec, but turning all
> +    // the U+0000 characters into replacement characters has compatibility
> +    // problems problems.

problems problems problems
Comment 22 Adam Barth 2010-08-05 12:45:03 PDT
Created attachment 63620 [details]
Patch
Comment 23 Darin Adler 2010-08-05 13:08:04 PDT
Comment on attachment 63620 [details]
Patch

> -PASS canvas.getContext('2d�') is null
> +PASS canvas.getContext('2d') is null

> -PASS String("�").length is 1
> +PASS String("").length is 1

I think an even better approach to these would be to fix the escapeHTML function in js-test-pre.js to escape null characters. We could do that in a separate pass before making this change.
Comment 24 Adam Barth 2010-08-05 13:31:34 PDT
> I think an even better approach to these would be to fix the escapeHTML function in js-test-pre.js to escape null characters. We could do that in a separate pass before making this change.

Ok.  I'll file a new bug for that.
Comment 25 Eric Seidel (no email) 2010-08-05 14:02:07 PDT
Comment on attachment 63620 [details]
Patch

We came up with better names in person.
Comment 26 Adam Barth 2010-08-05 14:39:16 PDT
> Ok.  I'll file a new bug for that.

https://bugs.webkit.org/show_bug.cgi?id=43579
Comment 27 Adam Barth 2010-08-05 15:35:57 PDT
Created attachment 63648 [details]
Patch
Comment 28 Eric Seidel (no email) 2010-08-05 15:42:34 PDT
Comment on attachment 63648 [details]
Patch

OK.
Comment 29 Adam Barth 2010-08-05 17:12:26 PDT
Comment on attachment 63648 [details]
Patch

Clearing flags on attachment: 63648

Committed r64799: <http://trac.webkit.org/changeset/64799>
Comment 30 Adam Barth 2010-08-05 17:12:35 PDT
All reviewed patches have been landed.  Closing bug.
Comment 31 WebKit Review Bot 2010-08-05 17:49:19 PDT
http://trac.webkit.org/changeset/64799 might have broken Qt Linux Release
The following changes are on the blame list:
http://trac.webkit.org/changeset/64800
http://trac.webkit.org/changeset/64801
http://trac.webkit.org/changeset/64799