Bug 42112

Summary:

U+0000 is turned to U+FFFD (replacement character)

Product:

WebKit

Reporter:

Jungshik Shin <jshin>

Component:

Layout and Rendering

Assignee:

Adam Barth <abarth>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

abarth, ap, eric, evan, hsivonen, ian, mike, vivianz, webkit.review.bot

Priority:

Version:

528+ (Nightly build)

Hardware:

All

OS:

All

URL:

http://big5.xinhuanet.com/gate/big5/www.xinhuanet.com/

Bug Depends on:

Bug Blocks:

41115

Attachments:

Description	Flags
Work in progress	none
Patch	none
Patch	none
Patch	none

Jungshik Shin

Reported 2010-07-12 14:24:38 PDT

The page in the url field has td's like this sprinkled all over the place <td>^@^@</td> where ^@ stands for U+0000. They're rendered with a glyph for replacement character (U+0000). This is a recent regression. It's reported against Chromium at http://crbug.com/48603

Attachments
Work in progress (5.67 KB, patch) 2010-07-22 14:56 PDT, Adam Barth	no flags	Details Formatted Diff Diff
Patch (21.86 KB, patch) 2010-08-05 12:16 PDT, Adam Barth	no flags	Details Formatted Diff Diff
Patch (21.85 KB, patch) 2010-08-05 12:45 PDT, Adam Barth	no flags	Details Formatted Diff Diff
Patch (13.89 KB, patch) 2010-08-05 15:35 PDT, Adam Barth	no flags	Details Formatted Diff Diff
Show Obsolete (3) View All Add attachment proposed patch, testcase, etc.

Jungshik Shin

Comment 1 2010-07-12 15:18:56 PDT

It's strange that this regressed because I don't see a recent change relevant to the treatment of U+0000 in Font.h and GlyphPageTreeNode.cpp

Jungshik Shin

Comment 2 2010-07-12 16:15:21 PDT

It seems that it regressed between 461229 and r61316

Alexey Proskuryakov

Comment 3 2010-07-13 16:58:10 PDT

HTML5 requires that zero bytes are decoded as U+FFFD. I think that this is a mistake, but I haven't been successful at convincing HTML5 powers.

Adam Barth

Comment 4 2010-07-13 17:03:26 PDT

Thanks. These are nice examples.

Adam Barth

Comment 5 2010-07-13 17:27:40 PDT

Our layout is also screwy on that page.

Jungshik Shin

Comment 6 2010-07-14 13:42:28 PDT

(In reply to comment #3) > HTML5 requires that zero bytes are decoded as U+FFFD. I think that this is a mistake, but I haven't been successful at convincing HTML5 powers. Is it about HTML5 10.2.4.70 (http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html) or something else? 10.2.4.70 seems to only about NCR processing. For my reference: it's done in http://trac.webkit.org/changeset/60791

Jungshik Shin

Comment 7 2010-07-14 13:45:37 PDT

Sorry that I missed a reference to the spec in HTMLTokenizer (http://www.whatwg.org/specs/web-apps/current-work/#preprocessing-the-input-stream)

Adam Barth

Comment 8 2010-07-14 15:25:31 PDT

[3:24pm] Hixie: abarth: yeah, it's basically intentional [3:24pm] Hixie: abarth: there shouldn't be NULs there at all and they're likely to be triggering all kinds of security problems [3:24pm] Hixie: abarth: making it ugly is one way to draw attention to it

Alexey Proskuryakov

Comment 9 2010-07-14 16:16:18 PDT

It doesn't seem that we're going to get much sympathy from WhatWG, so the options are basically to ignore such content, or to avoid implementing this aspect of HTML5. My personal preference would be the latter, since it's a useless and incompatible change, but I expect many other WebKit developers to disagree.

Adam Barth

Comment 10 2010-07-14 17:04:22 PDT

I think it's important for the spec and the implementation to agree. I don't have a strong opinion on which behavior is better in this particular topic. One approach to move forward is to talk with Henri and some of the other Mozilla folks and see if we can come to an agreement. If WebKit and Firefox agree, it will be hard for the working group to not agree also.

Henri Sivonen

Comment 11 2010-07-15 00:29:20 PDT

See http://hg.mozilla.org/mozilla-central/rev/ae259fec2443 I made the tree builder ignore U+0000-originating U+FFFD in element content when the tree builder is neither in the 'text' state nor in the 'in foreign content' state (the two states that handle <script>, <style> and <textarea> text.) As far as I can tell, the defense in depth rationale for turning U+0000 into U+FFFD doesn't apply to element content when script, style or textarea isn't involved. Does this change make sense to you? (Note that Hixie didn't put this in the spec in http://html5.org/tools/web-apps-tracker?from=5155&to=5156 )

Adam Barth

Comment 12 2010-07-15 00:38:02 PDT

> Does this change make sense to you? Yeah we could do something similar. Architecturally, we'd have the tree builder set a bit on the tokenizer / input stream preprocessor to not generate FFFD when it's not in the text or foreign content state. It would then be similar to the CR LF collapsing, but it could collapse an unbounded number of characters, which should be doable. @ap, would that address your concerns?

Alexey Proskuryakov

Comment 13 2010-07-15 08:10:30 PDT

Yes, that certainly addresses the known practical issues. In general, I think that the more complicated null handling gets, the more potential security issues there are. The recent security problems with nulls that I know of were all caused by stripping them in tokenizer instead of passing as is, which some software did, and some didn't perform. Converting nulls to U+FFFD will be a unique trait of HTML5 processing, so it will likely always remain an unexpected feature for many.

Adam Barth

Comment 14 2010-07-20 17:47:14 PDT

Apparently this problem also occurs on the US bank login page: http://code.google.com/p/chromium/issues/detail?id=49720

Evan Martin

Comment 15 2010-07-20 17:49:24 PDT

Another page exhibiting the same issue: usbank's login page. This was reported by a Chrome users. You can see the problem without an account here: https://www4.usbank.com/internetBanking/RequestRouter?requestCmdId=DisplayLoginPageNotYou Hex snippet of page contents from the cache: 00004cb0: 66 6d 2f 70 65 72 73 6f 6e 61 6c 2f 61 63 68 69 fm/personal/achi 00004cc0: 65 76 65 5f 67 6f 61 6c 73 2f 49 44 5f 74 68 65 eve_goals/ID_the 00004cd0: 66 74 2e 63 66 6d 22 3e 50 72 6f 74 65 63 74 20 ft.cfm">Protect 00004ce0: 59 6f 75 72 20 49 64 65 6e 74 69 74 79 3c 2f 61 Your Identity</a 00004cf0: 3e 3c 2f 74 64 3e 0a 20 20 20 20 20 20 20 20 20 ></td>. 00004d00: 20 20 20 20 20 20 20 20 20 3c 2f 74 72 3e 20 3c </tr> < 00004d10: 2f 74 61 62 6c 65 3e 00 00 00 00 00 00 00 00 00 /table>......... 00004d20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00004d30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00004d40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................

Henri Sivonen

Comment 16 2010-07-21 01:20:46 PDT

It seems to me that the fix I implemented for Gecko (ignoring U+0000-originating U+FFFD in text content when the tree builder isn't in the 'text' or 'in foreign content' modes) addresses this problem for the sites reported here.

Adam Barth

Comment 17 2010-07-22 14:55:05 PDT

This bug is hard to fix before enabling the new tree builder because the old tree builder doesn't use the same state machine. It's complicated to fix here too without introducing a per-character branch, but I think I see how to do it.

Adam Barth

Comment 18 2010-07-22 14:56:31 PDT

Created attachment 62344 [details] Work in progress

Adam Barth

Comment 19 2010-08-05 10:43:14 PDT

I haz a patch. Gather test results changes.

Adam Barth

Comment 20 2010-08-05 12:16:45 PDT

Created attachment 63614 [details] Patch

Darin Adler

Comment 21 2010-08-05 12:23:25 PDT

Comment on attachment 63614 [details] Patch > + // Swallowing U+0000 characters isn't in the HTML5 spec, but turning all > + // the U+0000 characters into replacement characters has compatibility > + // problems problems. problems problems problems

Adam Barth

Comment 22 2010-08-05 12:45:03 PDT

Created attachment 63620 [details] Patch

Darin Adler

Comment 23 2010-08-05 13:08:04 PDT

Comment on attachment 63620 [details] Patch > -PASS canvas.getContext('2dï¿½') is null > +PASS canvas.getContext('2d') is null > -PASS String("ï¿½").length is 1 > +PASS String("").length is 1 I think an even better approach to these would be to fix the escapeHTML function in js-test-pre.js to escape null characters. We could do that in a separate pass before making this change.

Adam Barth

Comment 24 2010-08-05 13:31:34 PDT

> I think an even better approach to these would be to fix the escapeHTML function in js-test-pre.js to escape null characters. We could do that in a separate pass before making this change. Ok. I'll file a new bug for that.

Eric Seidel (no email)

Comment 25 2010-08-05 14:02:07 PDT

Comment on attachment 63620 [details] Patch We came up with better names in person.

Adam Barth

Comment 26 2010-08-05 14:39:16 PDT

> Ok. I'll file a new bug for that. https://bugs.webkit.org/show_bug.cgi?id=43579

Adam Barth

Comment 27 2010-08-05 15:35:57 PDT

Created attachment 63648 [details] Patch

Eric Seidel (no email)

Comment 28 2010-08-05 15:42:34 PDT

Comment on attachment 63648 [details] Patch OK.