Bug 7602

Summary: Only use fixupChar for entities
Product: WebKit Reporter: Alexey Proskuryakov <ap>
Component: DOMAssignee: Alexey Proskuryakov <ap>
Status: RESOLVED FIXED    
Severity: Normal    
Priority: P2    
Version: 420+   
Hardware: Mac   
OS: OS X 10.4   
Bug Depends on:    
Bug Blocks: 7518    
Attachments:
Description Flags
test case
none
proposed patch
darin: review+
test case to distinguish latin1 / winlatin1 parsing none

Alexey Proskuryakov
Reported 2006-03-04 13:58:10 PST
HTMLTokenizer uses fixupChar() to ensure that the incoming data is treated as windows-1252, not the real Latin-1. This mostly duplicates the logic in StreamingTextDecoder (see effectiveEncoding()). More importantly, shuffling code points in HTMLTokenizer affects cases where the character is already known for certain, such as: document.write("\u0080"); // writes 20AC (Euro sign)
Attachments
test case (525 bytes, text/html)
2006-03-04 13:58 PST, Alexey Proskuryakov
no flags
proposed patch (4.08 KB, patch)
2006-03-04 14:15 PST, Alexey Proskuryakov
darin: review+
test case to distinguish latin1 / winlatin1 parsing (982 bytes, text/html)
2006-03-04 17:24 PST, Maciej Stachowiak
no flags
Alexey Proskuryakov
Comment 1 2006-03-04 13:58:44 PST
Created attachment 6855 [details] test case
Alexey Proskuryakov
Comment 2 2006-03-04 14:15:08 PST
Created attachment 6857 [details] proposed patch Besides limiting fixupChar to entities, this removes a semi-broken Latin-1 special case from StreamingTextDecoder. As it stood, this special case was pushing both Latin-1 and windows-1252 into QString's ASCII buffer, which was then converted to Unicode in QString::makeUnicode, as if it were Latin-1. Since the tokenizer called fixupChar on the result, it (usually?) didn't matter, though. If this change negatively affects performance, then we will probably need to postulate that QString's ASCII buffer in fact contains windows-1252 data, and adjust makeUnicode accordingly.
Darin Adler
Comment 3 2006-03-04 16:44:23 PST
Comment on attachment 6857 [details] proposed patch Two issues prevent landing this: 1) We need to do performance testing. When I added a fast case for Latin-1 that relied on fixupChar, we saw a significant performance boost. Even if we don't need the super-fast "just put it in the ASCII buffer" version, we'll probably need to consider making a path that does not go through ICU. 2) I believe this may remove the quirk in our existing code that ISO Latin-1 effectively means Windows Latin-1 -- I believe we need that quirk. Maybe Alexey can handle issue (2). Someone at Apple will probably have to tackle (1) since we haven't yet found a good way to measure performance that people outside the Safari team can do easily.
Maciej Stachowiak
Comment 4 2006-03-04 17:23:21 PST
You are write that document.write("\u0080"); should not apply the fixup hack. However, characters that are present in the original source text of a supposedly latin1 document *do* get treated as latin1 instead. So I think your patch may be backing off a bit too much. See attached test case.
Maciej Stachowiak
Comment 5 2006-03-04 17:24:57 PST
Created attachment 6859 [details] test case to distinguish latin1 / winlatin1 parsing On this test case, Firefox treats the first two examples as WinLatin1 characters and the latter two as Latin1 (i.e. at their true unicode value).
Alexey Proskuryakov
Comment 6 2006-03-05 00:36:04 PST
Comment on attachment 6857 [details] proposed patch Actually, the Latin-1/windows-1252 quirk is still enforced by effectiveEncoding(). It is covered by existing test cases, although indirectly, so adding Maciej's test case will be IMO useful, too. It passes with a single difference from Firefox - the latter maps 0x81 to U+FFFD, while fixupChar leaves it at U+0081. Resetting the r? flag for performance testing...
Darin Adler
Comment 7 2006-03-24 08:21:50 PST
Comment on attachment 6857 [details] proposed patch r=me
Justin Garcia
Comment 8 2006-03-28 03:31:20 PST
Oops I moved this to fixed accidently.
Alexey Proskuryakov
Comment 9 2006-04-13 10:01:17 PDT
Landed, r13863.
Note You need to log in before you can comment on or make changes to this bug.