WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
Bug 42112
U+0000 is turned to U+FFFD (replacement character)
https://bugs.webkit.org/show_bug.cgi?id=42112
Summary
U+0000 is turned to U+FFFD (replacement character)
Jungshik Shin
Reported
2010-07-12 14:24:38 PDT
The page in the url field has td's like this sprinkled all over the place <td>^@^@</td> where ^@ stands for U+0000. They're rendered with a glyph for replacement character (U+0000). This is a recent regression. It's reported against Chromium at
http://crbug.com/48603
Attachments
Work in progress
(5.67 KB, patch)
2010-07-22 14:56 PDT
,
Adam Barth
no flags
Details
Formatted Diff
Diff
Patch
(21.86 KB, patch)
2010-08-05 12:16 PDT
,
Adam Barth
no flags
Details
Formatted Diff
Diff
Patch
(21.85 KB, patch)
2010-08-05 12:45 PDT
,
Adam Barth
no flags
Details
Formatted Diff
Diff
Patch
(13.89 KB, patch)
2010-08-05 15:35 PDT
,
Adam Barth
no flags
Details
Formatted Diff
Diff
Show Obsolete
(3)
View All
Add attachment
proposed patch, testcase, etc.
Jungshik Shin
Comment 1
2010-07-12 15:18:56 PDT
It's strange that this regressed because I don't see a recent change relevant to the treatment of U+0000 in Font.h and GlyphPageTreeNode.cpp
Jungshik Shin
Comment 2
2010-07-12 16:15:21 PDT
It seems that it regressed between 461229 and
r61316
Alexey Proskuryakov
Comment 3
2010-07-13 16:58:10 PDT
HTML5 requires that zero bytes are decoded as U+FFFD. I think that this is a mistake, but I haven't been successful at convincing HTML5 powers.
Adam Barth
Comment 4
2010-07-13 17:03:26 PDT
Thanks. These are nice examples.
Adam Barth
Comment 5
2010-07-13 17:27:40 PDT
Our layout is also screwy on that page.
Jungshik Shin
Comment 6
2010-07-14 13:42:28 PDT
(In reply to
comment #3
)
> HTML5 requires that zero bytes are decoded as U+FFFD. I think that this is a mistake, but I haven't been successful at convincing HTML5 powers.
Is it about HTML5 10.2.4.70 (
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html
) or something else? 10.2.4.70 seems to only about NCR processing. For my reference: it's done in
http://trac.webkit.org/changeset/60791
Jungshik Shin
Comment 7
2010-07-14 13:45:37 PDT
Sorry that I missed a reference to the spec in HTMLTokenizer (
http://www.whatwg.org/specs/web-apps/current-work/#preprocessing-the-input-stream
)
Adam Barth
Comment 8
2010-07-14 15:25:31 PDT
[3:24pm] Hixie: abarth: yeah, it's basically intentional [3:24pm] Hixie: abarth: there shouldn't be NULs there at all and they're likely to be triggering all kinds of security problems [3:24pm] Hixie: abarth: making it ugly is one way to draw attention to it
Alexey Proskuryakov
Comment 9
2010-07-14 16:16:18 PDT
It doesn't seem that we're going to get much sympathy from WhatWG, so the options are basically to ignore such content, or to avoid implementing this aspect of HTML5. My personal preference would be the latter, since it's a useless and incompatible change, but I expect many other WebKit developers to disagree.
Adam Barth
Comment 10
2010-07-14 17:04:22 PDT
I think it's important for the spec and the implementation to agree. I don't have a strong opinion on which behavior is better in this particular topic. One approach to move forward is to talk with Henri and some of the other Mozilla folks and see if we can come to an agreement. If WebKit and Firefox agree, it will be hard for the working group to not agree also.
Henri Sivonen
Comment 11
2010-07-15 00:29:20 PDT
See
http://hg.mozilla.org/mozilla-central/rev/ae259fec2443
I made the tree builder ignore U+0000-originating U+FFFD in element content when the tree builder is neither in the 'text' state nor in the 'in foreign content' state (the two states that handle <script>, <style> and <textarea> text.) As far as I can tell, the defense in depth rationale for turning U+0000 into U+FFFD doesn't apply to element content when script, style or textarea isn't involved. Does this change make sense to you? (Note that Hixie didn't put this in the spec in
http://html5.org/tools/web-apps-tracker?from=5155&to=5156
)
Adam Barth
Comment 12
2010-07-15 00:38:02 PDT
> Does this change make sense to you?
Yeah we could do something similar. Architecturally, we'd have the tree builder set a bit on the tokenizer / input stream preprocessor to not generate FFFD when it's not in the text or foreign content state. It would then be similar to the CR LF collapsing, but it could collapse an unbounded number of characters, which should be doable. @ap, would that address your concerns?
Alexey Proskuryakov
Comment 13
2010-07-15 08:10:30 PDT
Yes, that certainly addresses the known practical issues. In general, I think that the more complicated null handling gets, the more potential security issues there are. The recent security problems with nulls that I know of were all caused by stripping them in tokenizer instead of passing as is, which some software did, and some didn't perform. Converting nulls to U+FFFD will be a unique trait of HTML5 processing, so it will likely always remain an unexpected feature for many.
Adam Barth
Comment 14
2010-07-20 17:47:14 PDT
Apparently this problem also occurs on the US bank login page:
http://code.google.com/p/chromium/issues/detail?id=49720
Evan Martin
Comment 15
2010-07-20 17:49:24 PDT
Another page exhibiting the same issue: usbank's login page. This was reported by a Chrome users. You can see the problem without an account here:
https://www4.usbank.com/internetBanking/RequestRouter?requestCmdId=DisplayLoginPageNotYou
Hex snippet of page contents from the cache: 00004cb0: 66 6d 2f 70 65 72 73 6f 6e 61 6c 2f 61 63 68 69 fm/personal/achi 00004cc0: 65 76 65 5f 67 6f 61 6c 73 2f 49 44 5f 74 68 65 eve_goals/ID_the 00004cd0: 66 74 2e 63 66 6d 22 3e 50 72 6f 74 65 63 74 20 ft.cfm">Protect 00004ce0: 59 6f 75 72 20 49 64 65 6e 74 69 74 79 3c 2f 61 Your Identity</a 00004cf0: 3e 3c 2f 74 64 3e 0a 20 20 20 20 20 20 20 20 20 ></td>. 00004d00: 20 20 20 20 20 20 20 20 20 3c 2f 74 72 3e 20 3c </tr> < 00004d10: 2f 74 61 62 6c 65 3e 00 00 00 00 00 00 00 00 00 /table>......... 00004d20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00004d30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00004d40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Henri Sivonen
Comment 16
2010-07-21 01:20:46 PDT
It seems to me that the fix I implemented for Gecko (ignoring U+0000-originating U+FFFD in text content when the tree builder isn't in the 'text' or 'in foreign content' modes) addresses this problem for the sites reported here.
Adam Barth
Comment 17
2010-07-22 14:55:05 PDT
This bug is hard to fix before enabling the new tree builder because the old tree builder doesn't use the same state machine. It's complicated to fix here too without introducing a per-character branch, but I think I see how to do it.
Adam Barth
Comment 18
2010-07-22 14:56:31 PDT
Created
attachment 62344
[details]
Work in progress
Adam Barth
Comment 19
2010-08-05 10:43:14 PDT
I haz a patch. Gather test results changes.
Adam Barth
Comment 20
2010-08-05 12:16:45 PDT
Created
attachment 63614
[details]
Patch
Darin Adler
Comment 21
2010-08-05 12:23:25 PDT
Comment on
attachment 63614
[details]
Patch
> + // Swallowing U+0000 characters isn't in the HTML5 spec, but turning all > + // the U+0000 characters into replacement characters has compatibility > + // problems problems.
problems problems problems
Adam Barth
Comment 22
2010-08-05 12:45:03 PDT
Created
attachment 63620
[details]
Patch
Darin Adler
Comment 23
2010-08-05 13:08:04 PDT
Comment on
attachment 63620
[details]
Patch
> -PASS canvas.getContext('2d�') is null > +PASS canvas.getContext('2d') is null
> -PASS String("�").length is 1 > +PASS String("").length is 1
I think an even better approach to these would be to fix the escapeHTML function in js-test-pre.js to escape null characters. We could do that in a separate pass before making this change.
Adam Barth
Comment 24
2010-08-05 13:31:34 PDT
> I think an even better approach to these would be to fix the escapeHTML function in js-test-pre.js to escape null characters. We could do that in a separate pass before making this change.
Ok. I'll file a new bug for that.
Eric Seidel (no email)
Comment 25
2010-08-05 14:02:07 PDT
Comment on
attachment 63620
[details]
Patch We came up with better names in person.
Adam Barth
Comment 26
2010-08-05 14:39:16 PDT
> Ok. I'll file a new bug for that.
https://bugs.webkit.org/show_bug.cgi?id=43579
Adam Barth
Comment 27
2010-08-05 15:35:57 PDT
Created
attachment 63648
[details]
Patch
Eric Seidel (no email)
Comment 28
2010-08-05 15:42:34 PDT
Comment on
attachment 63648
[details]
Patch OK.
Adam Barth
Comment 29
2010-08-05 17:12:26 PDT
Comment on
attachment 63648
[details]
Patch Clearing flags on attachment: 63648 Committed
r64799
: <
http://trac.webkit.org/changeset/64799
>
Adam Barth
Comment 30
2010-08-05 17:12:35 PDT
All reviewed patches have been landed. Closing bug.
WebKit Review Bot
Comment 31
2010-08-05 17:49:19 PDT
http://trac.webkit.org/changeset/64799
might have broken Qt Linux Release The following changes are on the blame list:
http://trac.webkit.org/changeset/64800
http://trac.webkit.org/changeset/64801
http://trac.webkit.org/changeset/64799
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug