Bug 7602

Summary: Only use fixupChar for entities
Product: WebKit Reporter: Alexey Proskuryakov <ap>
Component: DOMAssignee: Alexey Proskuryakov <ap>
Status: RESOLVED FIXED    
Severity: Normal    
Priority: P2    
Version: 420+   
Hardware: Mac   
OS: OS X 10.4   
Bug Depends on:    
Bug Blocks: 7518    
Attachments:
Description Flags
test case
none
proposed patch
darin: review+
test case to distinguish latin1 / winlatin1 parsing none

Description Alexey Proskuryakov 2006-03-04 13:58:10 PST
HTMLTokenizer uses fixupChar() to ensure that the incoming data is treated as windows-1252, not the real Latin-1. This mostly duplicates the logic in StreamingTextDecoder (see effectiveEncoding()). More importantly, shuffling code points in HTMLTokenizer affects cases where the character is already known for certain, such as:

document.write("\u0080"); // writes 20AC (Euro sign)
Comment 1 Alexey Proskuryakov 2006-03-04 13:58:44 PST
Created attachment 6855 [details]
test case
Comment 2 Alexey Proskuryakov 2006-03-04 14:15:08 PST
Created attachment 6857 [details]
proposed patch

Besides limiting fixupChar to entities, this removes a semi-broken Latin-1 special case from StreamingTextDecoder. As it stood, this special case was pushing both Latin-1 and windows-1252 into QString's ASCII buffer, which was then converted to Unicode in QString::makeUnicode, as if it were Latin-1. Since the tokenizer called fixupChar on the result, it (usually?) didn't matter, though.

If this change negatively affects performance, then we will probably need to postulate that QString's ASCII buffer in fact contains windows-1252 data, and adjust makeUnicode accordingly.
Comment 3 Darin Adler 2006-03-04 16:44:23 PST
Comment on attachment 6857 [details]
proposed patch

Two issues prevent landing this:

    1) We need to do performance testing. When I added a fast case for Latin-1 that relied on fixupChar, we saw a significant performance boost. Even if we don't need the super-fast "just put it in the ASCII buffer" version, we'll probably need to consider making a path that does not go through ICU.
    2) I believe this may remove the quirk in our existing code that ISO Latin-1 effectively means Windows Latin-1 -- I believe we need that quirk.

Maybe Alexey can handle issue (2). Someone at Apple will probably have to tackle (1) since we haven't yet found a good way to measure performance that people outside the Safari team can do easily.
Comment 4 Maciej Stachowiak 2006-03-04 17:23:21 PST
You are write that document.write("\u0080"); should not apply the fixup hack. However, characters that are present in the original source text of a supposedly latin1 document *do* get treated as latin1 instead. So I think your patch may be backing off a bit too much. See attached test case.
Comment 5 Maciej Stachowiak 2006-03-04 17:24:57 PST
Created attachment 6859 [details]
test case to distinguish latin1 / winlatin1 parsing

On this test case, Firefox treats the first two examples as WinLatin1  characters and the latter two as Latin1 (i.e. at their true unicode value).
Comment 6 Alexey Proskuryakov 2006-03-05 00:36:04 PST
Comment on attachment 6857 [details]
proposed patch

Actually, the Latin-1/windows-1252 quirk is still enforced by effectiveEncoding(). It is covered by existing test cases, although indirectly, so adding Maciej's test case will be IMO useful, too. It passes with a single difference from Firefox - the latter maps 0x81 to U+FFFD, while fixupChar leaves it at U+0081.

Resetting the r? flag for performance testing...
Comment 7 Darin Adler 2006-03-24 08:21:50 PST
Comment on attachment 6857 [details]
proposed patch

r=me
Comment 8 Justin Garcia 2006-03-28 03:31:20 PST
Oops I moved this to fixed accidently.
Comment 9 Alexey Proskuryakov 2006-04-13 10:01:17 PDT
Landed, r13863.