Bug 12526 - Safari ignores encoding description "charset=Shift_JIS" in invalid html
Summary: Safari ignores encoding description "charset=Shift_JIS" in invalid html
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebCore Misc. (show other bugs)
Version: 420+
Hardware: Macintosh OS X 10.4
: P2 Normal
Assignee: Alexey Proskuryakov
URL: http://www.bandai.co.jp/releases/J200...
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2007-02-01 00:06 PST by Darin Adler
Modified: 2007-12-27 00:36 PST (History)
2 users (show)

See Also:


Attachments
proposed fix (13.92 KB, patch)
2007-12-25 11:00 PST, Alexey Proskuryakov
mjs: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Darin Adler 2007-02-01 00:06:43 PST
2006-12-06 00:46:39 Masayuki Shibatani:
* SUMMARY
Recap the problem title and/or include more descriptive summary information.

* STEPS TO REPRODUCE
1. Install Leopard 9A320 as English primary, Japanese country region and Kotoeri as default keyboard.
2. Open URL:http://www.bandai.co.jp/releases/J2006120401.html

* RESULTS
Actual: The page is garbled. See attached screen shot.

Expected: Since there is a description '<meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS">'. Safari should treat the page as Shift_JIS encoding.

2006-12-06 00:47:16 Masayuki Shibatani:
Attached html source.

2006-12-11 20:56:14 Stephanie Lewis:
Changing the encoding to Shift_JIS fixes the problem, Safari is not picking up the encoding.  Happens in Safari 2.0.4 and Leopard 9A319.

2007-01-11 21:11:03 Hiroshi Sakuraba:
This is an regression from Tiger and needs to be fixed in Leopard. Nominating Int'l BRB.

2007-01-16 14:36:38 DAVID Murphy:
Intl BRB:  Setting Intl Blocker/P1 for Leopard.

2007-01-16 21:52:09 Oliver Hunt:
Examining this bug the most obvious problem is that this page is fundamentally poorly formed with a div defined outside the body of the page.

As firefox can handle this, we need to however.

2007-01-17 17:36:10 Oliver Hunt:
Just verified: this is *not* a regression.



2007-01-17 17:47:56 Ping Huey:
Per email from Oliver:
I believe you are exagerating the importance of this bug, as the page is invalid.  We offer a lot of leeway in charset definitions, and in those case we do correctly determine the charset.

I have relabelled the radar to make it more clear.  We are failing in this particular example of a broken page, not on all Shift_JIS pages.

Re-nom'ing to intl brb for reconsideration.

2007-01-23 13:38:10 DAVID Murphy:
Intl BRB:  Setting Intl High Priority/P2.

<rdar://problem/4867183>
Comment 1 Alexey Proskuryakov 2007-02-01 06:38:04 PST
<div class="moz-text-flowed" style="font-family: -moz-fixed">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS">

This is the sort of broken markup we don't aim to support at the moment (though that can be reconsidered). 
Comment 2 Darin Adler 2007-02-01 10:10:59 PST
I think we should reconsider. Our behavior is only a heuristic, so maybe there's a good way to improve the heuristic to work for cases like this without makng things significantly worse.

On the other hand, I have no specific suggesting.
Comment 3 Jungshik Shin 2007-07-17 19:08:33 PDT
The following two pages have a very long script (~ 10kB) before <html> and charset declaration in <meta> is not honored. 

http://db66.vnet.cn/
http://www.ddm.com/event/event84.asp?code=-548

I thought Firefox and IE stop looking for meta charset at 1 or 2 kB into a document, but both seem to go well beyond that. With auto-detection off in FF, the page begins to be rendered as the default encoding, but when meta charset is read in, it begins decoding anew and the page is rendered correctly. (By trying the second one above, one can see Japanese characters turn into Korean characters in the page). 

I guess this is a rather big compatibility issue.  


Comment 4 Alexey Proskuryakov 2007-07-18 02:28:22 PDT
Scripts are particularly tricky - I've seen many bugs at b.m.o. related to scripts being executed twice during page loading, because Firefox restarts parsing when it sees a charset declaration in <meta> (HTML5 suggests the same). And 10K is a lot of data to pre-scan just in case there's a meta somewhere.
Comment 5 Jungshik Shin 2007-08-20 11:37:47 PDT
The same thing hurts Safari's web compatibility at http://www.hebrewtoday.com

It begins with '<font> and <a name>', but later it has a meta tag for windows-1255.  Interesting is that Safari 2.0.x shipped with Mac OS 10.4 does not have this problem.  So, this is a relatively new 'regression'(?) introduced for perf. reason, right? 
Comment 6 Darin Adler 2007-08-20 12:36:20 PDT
(In reply to comment #5)
> The same thing hurts Safari's web compatibility at http://www.hebrewtoday.com
> 
> It begins with '<font> and <a name>', but later it has a meta tag for
> windows-1255.  Interesting is that Safari 2.0.x shipped with Mac OS 10.4 does
> not have this problem.  So, this is a relatively new 'regression'(?) introduced
> for perf. reason, right? 

That's oversimplifying. The changes we made weren't to improve performance. They've been to improve correctness. Our old algorithm got the wrong answer at many websites, and the refinements we've made have fixed some and broken others.

What makes this a big challenge is that Firefox takes a completely different approach, reloading the web page when it encounters a <meta> tag that changes the charset.

I'm not sure exactly why Safari 2 worked on this site. It had a similar rule, but there were many bugs in its implementation.

It would be instructive to learn how our charset detection compares to IE's approach. We already understand Firefox's approach pretty well, and we can't adopt that any time soon.
Comment 7 Alexey Proskuryakov 2007-08-21 03:31:28 PDT
(In reply to comment #5)
> Interesting is that Safari 2.0.x shipped with Mac OS 10.4 does
> not have this problem. 

What version of 10.4 do you have? I can reproduce this problem with shipping 10.4.10 Safari/WebKit.
Comment 8 Alexey Proskuryakov 2007-12-25 11:00:36 PST
Created attachment 18110 [details]
proposed fix

This doesn't fix all the examples we have (per comment 3, there are sites that would require an unacceptably large cut-off), but it fixes quite a few.

I still really dislike the idea that a browser would restart parsing if it sees a charset declaration anywhere in the document. Hopefully, this brings us close enough to real world compatibility and we won't have to implement that.
Comment 9 Maciej Stachowiak 2007-12-27 00:06:33 PST
Comment on attachment 18110 [details]
proposed fix

r=me
Comment 10 Alexey Proskuryakov 2007-12-27 00:36:53 PST
Committed revision 28998.

Filed bug 16621 for sites mentioned in comment 3 that are still problematic.