Summary: | Safari ignores encoding description "charset=Shift_JIS" in invalid html | ||||||
---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Darin Adler <darin> | ||||
Component: | WebCore Misc. | Assignee: | Alexey Proskuryakov <ap> | ||||
Status: | RESOLVED FIXED | ||||||
Severity: | Normal | CC: | ap, jshin | ||||
Priority: | P2 | Keywords: | InRadar | ||||
Version: | 420+ | ||||||
Hardware: | Mac | ||||||
OS: | OS X 10.4 | ||||||
URL: | http://www.bandai.co.jp/releases/J2006120401.html | ||||||
Attachments: |
|
Description
Darin Adler
2007-02-01 00:06:43 PST
<div class="moz-text-flowed" style="font-family: -moz-fixed"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS"> This is the sort of broken markup we don't aim to support at the moment (though that can be reconsidered). I think we should reconsider. Our behavior is only a heuristic, so maybe there's a good way to improve the heuristic to work for cases like this without makng things significantly worse. On the other hand, I have no specific suggesting. The following two pages have a very long script (~ 10kB) before <html> and charset declaration in <meta> is not honored. http://db66.vnet.cn/ http://www.ddm.com/event/event84.asp?code=-548 I thought Firefox and IE stop looking for meta charset at 1 or 2 kB into a document, but both seem to go well beyond that. With auto-detection off in FF, the page begins to be rendered as the default encoding, but when meta charset is read in, it begins decoding anew and the page is rendered correctly. (By trying the second one above, one can see Japanese characters turn into Korean characters in the page). I guess this is a rather big compatibility issue. Scripts are particularly tricky - I've seen many bugs at b.m.o. related to scripts being executed twice during page loading, because Firefox restarts parsing when it sees a charset declaration in <meta> (HTML5 suggests the same). And 10K is a lot of data to pre-scan just in case there's a meta somewhere. The same thing hurts Safari's web compatibility at http://www.hebrewtoday.com It begins with '<font> and <a name>', but later it has a meta tag for windows-1255. Interesting is that Safari 2.0.x shipped with Mac OS 10.4 does not have this problem. So, this is a relatively new 'regression'(?) introduced for perf. reason, right? (In reply to comment #5) > The same thing hurts Safari's web compatibility at http://www.hebrewtoday.com > > It begins with '<font> and <a name>', but later it has a meta tag for > windows-1255. Interesting is that Safari 2.0.x shipped with Mac OS 10.4 does > not have this problem. So, this is a relatively new 'regression'(?) introduced > for perf. reason, right? That's oversimplifying. The changes we made weren't to improve performance. They've been to improve correctness. Our old algorithm got the wrong answer at many websites, and the refinements we've made have fixed some and broken others. What makes this a big challenge is that Firefox takes a completely different approach, reloading the web page when it encounters a <meta> tag that changes the charset. I'm not sure exactly why Safari 2 worked on this site. It had a similar rule, but there were many bugs in its implementation. It would be instructive to learn how our charset detection compares to IE's approach. We already understand Firefox's approach pretty well, and we can't adopt that any time soon. (In reply to comment #5) > Interesting is that Safari 2.0.x shipped with Mac OS 10.4 does > not have this problem. What version of 10.4 do you have? I can reproduce this problem with shipping 10.4.10 Safari/WebKit. Created attachment 18110 [details] proposed fix This doesn't fix all the examples we have (per comment 3, there are sites that would require an unacceptably large cut-off), but it fixes quite a few. I still really dislike the idea that a browser would restart parsing if it sees a charset declaration anywhere in the document. Hopefully, this brings us close enough to real world compatibility and we won't have to implement that. Comment on attachment 18110 [details]
proposed fix
r=me
|