HTMLTokenizer uses fixupChar() to ensure that the incoming data is treated as windows-1252, not the real Latin-1. This mostly duplicates the logic in StreamingTextDecoder (see effectiveEncoding()). More importantly, shuffling code points in HTMLTokenizer affects cases where the character is already known for certain, such as: document.write("\u0080"); // writes 20AC (Euro sign)
Created attachment 6855 [details] test case
Created attachment 6857 [details] proposed patch Besides limiting fixupChar to entities, this removes a semi-broken Latin-1 special case from StreamingTextDecoder. As it stood, this special case was pushing both Latin-1 and windows-1252 into QString's ASCII buffer, which was then converted to Unicode in QString::makeUnicode, as if it were Latin-1. Since the tokenizer called fixupChar on the result, it (usually?) didn't matter, though. If this change negatively affects performance, then we will probably need to postulate that QString's ASCII buffer in fact contains windows-1252 data, and adjust makeUnicode accordingly.
Comment on attachment 6857 [details] proposed patch Two issues prevent landing this: 1) We need to do performance testing. When I added a fast case for Latin-1 that relied on fixupChar, we saw a significant performance boost. Even if we don't need the super-fast "just put it in the ASCII buffer" version, we'll probably need to consider making a path that does not go through ICU. 2) I believe this may remove the quirk in our existing code that ISO Latin-1 effectively means Windows Latin-1 -- I believe we need that quirk. Maybe Alexey can handle issue (2). Someone at Apple will probably have to tackle (1) since we haven't yet found a good way to measure performance that people outside the Safari team can do easily.
You are write that document.write("\u0080"); should not apply the fixup hack. However, characters that are present in the original source text of a supposedly latin1 document *do* get treated as latin1 instead. So I think your patch may be backing off a bit too much. See attached test case.
Created attachment 6859 [details] test case to distinguish latin1 / winlatin1 parsing On this test case, Firefox treats the first two examples as WinLatin1 characters and the latter two as Latin1 (i.e. at their true unicode value).
Comment on attachment 6857 [details] proposed patch Actually, the Latin-1/windows-1252 quirk is still enforced by effectiveEncoding(). It is covered by existing test cases, although indirectly, so adding Maciej's test case will be IMO useful, too. It passes with a single difference from Firefox - the latter maps 0x81 to U+FFFD, while fixupChar leaves it at U+0081. Resetting the r? flag for performance testing...
Comment on attachment 6857 [details] proposed patch r=me
Oops I moved this to fixed accidently.
Landed, r13863.