Bug 16621

Summary: WebKit ignores encoding description in invalid HTML if it's too far from the start
Product: WebKit Reporter: Alexey Proskuryakov <ap>
Component: Page LoadingAssignee: Nobody <webkit-unassigned>
Status: NEW ---    
Severity: Normal CC: darin, ddkilzer, ian, jshin, mrowe
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Macintosh   
OS: OS X 10.4   

Description Alexey Proskuryakov 2007-12-27 00:36:02 PST
From bug 12526 comment 3.

Our heuristic for <meta> charset declarations differs from what Firefox does, and what is documented in HTML5. Namely, we do not check for <meta> during normal parsing and re-start parsing if the charset changes late in the game. We only pre-parse the first 512 bytes of the document, or the whole <head>, whichever is larger. This is usually enough, but we know of pages that aren't decoded correctly because of this difference.

The following two pages have a very long script (~ 10kB) at the beginning, and
charset declaration in <meta> is not honored. 

http://db66.vnet.cn/
http://www.ddm.com/event/event84.asp?code=-548

Restarting parsing at any point is a big can of worms though - e.g., some scripts with side effects may run twice because of that.
Comment 1 Mark Rowe (bdash) 2007-12-27 01:58:44 PST
Is the handling of scripts when reparsing discussed in the HTML5 specification?  Is that something which should be documented in the spec?
Comment 2 Alexey Proskuryakov 2007-12-27 02:28:56 PST
See <http://www.whatwg.org/specs/web-apps/current-work/#change>.
Comment 3 Ian 'Hixie' Hickson 2008-01-08 18:37:00 PST
(basically, HTML5 requires that the scripts run twice.)