WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
7602
Only use fixupChar for entities
https://bugs.webkit.org/show_bug.cgi?id=7602
Summary
Only use fixupChar for entities
Alexey Proskuryakov
Reported
2006-03-04 13:58:10 PST
HTMLTokenizer uses fixupChar() to ensure that the incoming data is treated as windows-1252, not the real Latin-1. This mostly duplicates the logic in StreamingTextDecoder (see effectiveEncoding()). More importantly, shuffling code points in HTMLTokenizer affects cases where the character is already known for certain, such as: document.write("\u0080"); // writes 20AC (Euro sign)
Attachments
test case
(525 bytes, text/html)
2006-03-04 13:58 PST
,
Alexey Proskuryakov
no flags
Details
proposed patch
(4.08 KB, patch)
2006-03-04 14:15 PST
,
Alexey Proskuryakov
darin
: review+
Details
Formatted Diff
Diff
test case to distinguish latin1 / winlatin1 parsing
(982 bytes, text/html)
2006-03-04 17:24 PST
,
Maciej Stachowiak
no flags
Details
View All
Add attachment
proposed patch, testcase, etc.
Alexey Proskuryakov
Comment 1
2006-03-04 13:58:44 PST
Created
attachment 6855
[details]
test case
Alexey Proskuryakov
Comment 2
2006-03-04 14:15:08 PST
Created
attachment 6857
[details]
proposed patch Besides limiting fixupChar to entities, this removes a semi-broken Latin-1 special case from StreamingTextDecoder. As it stood, this special case was pushing both Latin-1 and windows-1252 into QString's ASCII buffer, which was then converted to Unicode in QString::makeUnicode, as if it were Latin-1. Since the tokenizer called fixupChar on the result, it (usually?) didn't matter, though. If this change negatively affects performance, then we will probably need to postulate that QString's ASCII buffer in fact contains windows-1252 data, and adjust makeUnicode accordingly.
Darin Adler
Comment 3
2006-03-04 16:44:23 PST
Comment on
attachment 6857
[details]
proposed patch Two issues prevent landing this: 1) We need to do performance testing. When I added a fast case for Latin-1 that relied on fixupChar, we saw a significant performance boost. Even if we don't need the super-fast "just put it in the ASCII buffer" version, we'll probably need to consider making a path that does not go through ICU. 2) I believe this may remove the quirk in our existing code that ISO Latin-1 effectively means Windows Latin-1 -- I believe we need that quirk. Maybe Alexey can handle issue (2). Someone at Apple will probably have to tackle (1) since we haven't yet found a good way to measure performance that people outside the Safari team can do easily.
Maciej Stachowiak
Comment 4
2006-03-04 17:23:21 PST
You are write that document.write("\u0080"); should not apply the fixup hack. However, characters that are present in the original source text of a supposedly latin1 document *do* get treated as latin1 instead. So I think your patch may be backing off a bit too much. See attached test case.
Maciej Stachowiak
Comment 5
2006-03-04 17:24:57 PST
Created
attachment 6859
[details]
test case to distinguish latin1 / winlatin1 parsing On this test case, Firefox treats the first two examples as WinLatin1 characters and the latter two as Latin1 (i.e. at their true unicode value).
Alexey Proskuryakov
Comment 6
2006-03-05 00:36:04 PST
Comment on
attachment 6857
[details]
proposed patch Actually, the Latin-1/windows-1252 quirk is still enforced by effectiveEncoding(). It is covered by existing test cases, although indirectly, so adding Maciej's test case will be IMO useful, too. It passes with a single difference from Firefox - the latter maps 0x81 to U+FFFD, while fixupChar leaves it at U+0081. Resetting the r? flag for performance testing...
Darin Adler
Comment 7
2006-03-24 08:21:50 PST
Comment on
attachment 6857
[details]
proposed patch r=me
Justin Garcia
Comment 8
2006-03-28 03:31:20 PST
Oops I moved this to fixed accidently.
Alexey Proskuryakov
Comment 9
2006-04-13 10:01:17 PDT
Landed,
r13863
.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug