4820 – hexadecimal HTML entities split across TCP packets are not parsed correctly

VERIFIED FIXED4820

hexadecimal HTML entities split across TCP packets are not parsed correctly

https://bugs.webkit.org/show_bug.cgi?id=4820

Summary hexadecimal HTML entities split across TCP packets are not parsed correctly

Alexey Proskuryakov

Reported 2005-09-03 01:30:35 PDT

If an HTML entity is not completely contained in one TCP packet, it may be not handled as such, and get rendered as plain text instead. Steps to reproduce: Load the page from bug URL. Won't reproduce if loaded from a local file or from cache, of course. Regression: present in Safari 2.0 (as of Mac OS X 10.4.2 without updates); haven't checked earlier versions.

Attachments
proposed patch (4.40 KB, patch) 2005-09-04 11:17 PDT, Alexey Proskuryakov	darin: review-	Details Formatted Diff Diff
test case for the older bug (356 bytes, text/html) 2005-09-04 13:14 PDT, Alexey Proskuryakov	no flags	Details
proposed patch (1.13 KB, patch) 2005-09-04 13:17 PDT, Alexey Proskuryakov	darin: review+	Details Formatted Diff Diff
Long entities test (1.56 KB, text/html) 2005-09-06 03:42 PDT, Alexey Proskuryakov	no flags	Details
Show Obsolete (1) View All Add attachment proposed patch, testcase, etc.

Mark Rowe (bdash)

Comment 1 2005-09-03 01:47:25 PDT

Confirmed with WebKit 412.7 and ToT.

Darin Adler

Comment 2 2005-09-03 13:22:59 PDT

The issue is likely in htmltokenizer.cpp. The code looks like it's almost right to handle this case, just not quite.

Alexey Proskuryakov

Comment 3 2005-09-04 11:17:30 PDT

Created attachment 3755 [details] proposed patch Apparently, it was only failing for hexadecimal entities - so, I've just copied a couple of lines from a decimal case. This patch also fixes a bunch of compilation errors with TOKEN_DEBUG defined.

Alexey Proskuryakov

Comment 4 2005-09-04 11:24:31 PDT

Is XML DOM indeed a correct component for this? Perhaps, I do not understand the differences well enough, but this doesn't sound related to XML at all (maybe HTML DOM?)

Darin Adler

Comment 5 2005-09-04 12:27:59 PDT

Comment on attachment 3755 [details] proposed patch I believe this patch reintroduces a bug we fixed a while back. It was trying to fix that one that we introduced this bug. This was back when we didn't always make tests for the bugs we fixed (bad!) so there's no layout test. The original fix was back on 2003-11-17, you can see it in the ChangeLog. 3485925: Safari does not correctly parse eight-digit hex character entities Also, please don't include the TOKEN_DEBUG changes along with the bug fix. Lets handle those separately. Here's some text from the original bug report: -------- Safari does not correctly parse eight-digit hex character entities. I noticed this at <http://www.alanwood.net/unicode/deseret.html>. "&#x0010400;" in HTML works fine (gives me the glyph for U+10400, "DESERET CAPITAL LETTER LONG I"). But if I use "&#x00010400;", the page renders incorrectly; I get a Last Resort glyph followed by "0;" It looks like the numeric entity parser only looks seven digits into the hex string, so displays U+1040 followed by "0;", instead of realizing that it's all one entity and displaying U+10400. -------- So we should make a new fix that works properly for both and probably make a layout test for the older bug fix.

Alexey Proskuryakov

Comment 6 2005-09-04 13:14:36 PDT

Created attachment 3760 [details] test case for the older bug

Alexey Proskuryakov

Comment 7 2005-09-04 13:17:15 PDT

Created attachment 3761 [details] proposed patch Just changed the length from 9 to 10 - this seems to fix the issue... I filed TOKEN_DEBUG patch as bug 4849.

Darin Adler

Comment 8 2005-09-04 13:52:58 PDT

Comment on attachment 3761 [details] proposed patch OK, r=me.

Darin Adler

Comment 9 2005-09-05 10:09:36 PDT

Gecko seems to allow hexadecimal entities with any number of digits; it lets extra leading digits overflow off the left.

Darin Adler

Comment 10 2005-09-05 10:15:51 PDT

I added the test case to layout tests. Perhaps we should change the hexadecimal entity code to not have a limit on number of digits to match Gecko/Firefox? Seems that would be simple. I guess it depends on what the behavior of WinIE is, and probably worth investigating too-long decimal entities too.

Alexey Proskuryakov

Comment 11 2005-09-06 03:42:13 PDT

Created attachment 3783 [details] Long entities test Here is a test matrix for long and otherwise invalid entities with results from several browsers. Mozilla bug report about bignum entities is here: <https://bugzilla.mozilla.org/show_bug.cgi?id=215622>.

Darin Adler

Comment 12 2005-09-10 12:56:44 PDT

Do the matrix entries for WinIE that say "?" mean an actual question mark character? Do the empty matrix entries for MacIE mean that no characters at all were emitted? It seems like our behavior is not much like other browsers -- we should probably do a little work in this area? What about when non-digit characters are included before the semicolon?

Alexey Proskuryakov

Comment 13 2005-09-10 13:41:08 PDT

(In reply to comment #12) > Do the matrix entries for WinIE that say "?" mean an actual question mark character? Do the empty > matrix entries for MacIE mean that no characters at all were emitted? Yes in both cases. > It seems like our behavior is not much like other browsers -- we should probably do a little work in > this area? Probably... Would you prefer to have this done while resolving this bug, or in a separate one? > What about when non-digit characters are included before the semicolon? I would expect them to bail out immediately (otherwise, it's too easy to lose a lot of content because of a non-terminated entity)... I'll check on Monday. A note at <http://www.w3.org/TR/REC-html40/charset.html#entities> suggests some more evil cases to possibly deal with: >In SGML, it is possible to eliminate the final ";" after a character >reference in some cases (e.g., at a line break or immediately before >a tag). In other circumstances it may not be eliminated (e.g., in the >middle of a word). We strongly suggest using the ";" in all cases >to avoid problems with user agents that require this character to be present.

Darin Adler

Comment 14 2005-09-10 14:07:20 PDT

I think we should make one or more new bug reports for the issues we turned up here. I'm going to land your patch to fix the "split across two separate tokenizer writes" issue and close this bug report.

Alexey Proskuryakov

Comment 15 2005-09-12 01:48:13 PDT

(In reply to comment #14) Done; bug 4948.

Note You need to log in before you can comment on or make changes to this bug.

Status VERIFIED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 420+

Hardware Mac

OS OS X 10.4

Product WebKit

Component DOM

Assignee

Darin Adler

Reported

2005-09-03 01:30 PDT

Modified

2005-09-12 01:48 PDT History

CC List

2 users Show

URL

http://nypop.com/~ap/webkit/spaces.html

Keywords

Depends on

Blocks