7602 – Only use fixupChar for entities

RESOLVED FIXED7602

Only use fixupChar for entities

https://bugs.webkit.org/show_bug.cgi?id=7602

Summary Only use fixupChar for entities

Alexey Proskuryakov

Reported 2006-03-04 13:58:10 PST

HTMLTokenizer uses fixupChar() to ensure that the incoming data is treated as windows-1252, not the real Latin-1. This mostly duplicates the logic in StreamingTextDecoder (see effectiveEncoding()). More importantly, shuffling code points in HTMLTokenizer affects cases where the character is already known for certain, such as: document.write("\u0080"); // writes 20AC (Euro sign)

Attachments
test case (525 bytes, text/html) 2006-03-04 13:58 PST, Alexey Proskuryakov	no flags	Details
proposed patch (4.08 KB, patch) 2006-03-04 14:15 PST, Alexey Proskuryakov	darin: review+	Details Formatted Diff Diff
test case to distinguish latin1 / winlatin1 parsing (982 bytes, text/html) 2006-03-04 17:24 PST, Maciej Stachowiak	no flags	Details
View All Add attachment proposed patch, testcase, etc.

Alexey Proskuryakov

Comment 1 2006-03-04 13:58:44 PST

Created attachment 6855 [details] test case

Alexey Proskuryakov

Comment 2 2006-03-04 14:15:08 PST

Created attachment 6857 [details] proposed patch Besides limiting fixupChar to entities, this removes a semi-broken Latin-1 special case from StreamingTextDecoder. As it stood, this special case was pushing both Latin-1 and windows-1252 into QString's ASCII buffer, which was then converted to Unicode in QString::makeUnicode, as if it were Latin-1. Since the tokenizer called fixupChar on the result, it (usually?) didn't matter, though. If this change negatively affects performance, then we will probably need to postulate that QString's ASCII buffer in fact contains windows-1252 data, and adjust makeUnicode accordingly.

Darin Adler

Comment 3 2006-03-04 16:44:23 PST

Comment on attachment 6857 [details] proposed patch Two issues prevent landing this: 1) We need to do performance testing. When I added a fast case for Latin-1 that relied on fixupChar, we saw a significant performance boost. Even if we don't need the super-fast "just put it in the ASCII buffer" version, we'll probably need to consider making a path that does not go through ICU. 2) I believe this may remove the quirk in our existing code that ISO Latin-1 effectively means Windows Latin-1 -- I believe we need that quirk. Maybe Alexey can handle issue (2). Someone at Apple will probably have to tackle (1) since we haven't yet found a good way to measure performance that people outside the Safari team can do easily.

Maciej Stachowiak

Comment 4 2006-03-04 17:23:21 PST

You are write that document.write("\u0080"); should not apply the fixup hack. However, characters that are present in the original source text of a supposedly latin1 document *do* get treated as latin1 instead. So I think your patch may be backing off a bit too much. See attached test case.

Maciej Stachowiak

Comment 5 2006-03-04 17:24:57 PST

Created attachment 6859 [details] test case to distinguish latin1 / winlatin1 parsing On this test case, Firefox treats the first two examples as WinLatin1 characters and the latter two as Latin1 (i.e. at their true unicode value).

Alexey Proskuryakov

Comment 6 2006-03-05 00:36:04 PST

Comment on attachment 6857 [details] proposed patch Actually, the Latin-1/windows-1252 quirk is still enforced by effectiveEncoding(). It is covered by existing test cases, although indirectly, so adding Maciej's test case will be IMO useful, too. It passes with a single difference from Firefox - the latter maps 0x81 to U+FFFD, while fixupChar leaves it at U+0081. Resetting the r? flag for performance testing...

Darin Adler

Comment 7 2006-03-24 08:21:50 PST

Comment on attachment 6857 [details] proposed patch r=me

Justin Garcia

Comment 8 2006-03-28 03:31:20 PST

Oops I moved this to fixed accidently.

Alexey Proskuryakov

Comment 9 2006-04-13 10:01:17 PDT

Landed, r13863.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 420+

Hardware Mac

OS OS X 10.4

Product WebKit

Component DOM

Assignee

Alexey Proskuryakov

Reported

2006-03-04 13:58 PST

Modified

2006-04-13 10:01 PDT History

CC List

0 users

URL

Keywords

Depends on

Blocks

7518

Dependencies

tree graph