Bug 66085 - The HTML parser doesn't ignore the BOM when the HTTP charset=foo conflicts with it
Summary: The HTML parser doesn't ignore the BOM when the HTTP charset=foo conflicts wi...
Status: RESOLVED INVALID
Alias: None
Product: WebKit
Classification: Unclassified
Component: DOM (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Major
Assignee: Nobody
URL: http://malform.no/testing/html5/bom/h...
Keywords:
Depends on:
Blocks:
 
Reported: 2011-08-11 12:42 PDT by Leif Halvard Silli
Modified: 2023-04-01 00:38 PDT (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Leif Halvard Silli 2011-08-11 12:42:46 PDT
ISSUE: 

   Webkit fails to ignore the BOM when the charset=foo attribute of the HTTP Content-Type: header conflicts with it. In other words: it lets the BOM take precedence over the HTTP Content-Type: header. NB: This bug is actually also present for *both* HTML and XML. Of course, the BOM is not actually the BOM unless the the document is uses an encoding which includes the BOM. And hence, if the HTTP Content-Type: says "ISO-8859-1" while the document contains the BOM, then - according to current specs, the parser should land in  *QUIRKS-MODE*, due to the presence of the illegal "BOM" before the DOCTYPE.

BACKGROUND:

  HTML5 requires the Charset=FOO attribute of the HTTP Content-Type header to take presedence over page-internal information, including the BOM.

WAYS TO REPRODUCE THIS BUG:

* Visit http://malform.no/testing/html5/bom/htm.html
    That page is accompanied with a HTTP Content-Type: which says "text/html;charset=KOI8-r". However, internally the page is actually UTF-8 encoded, and - importantly - it also contains the BOM. But when read as KOI8-R encoded - as HTML 5 requires, then the BOM becomes an illegal character before the DOCTYPE, which in turn should cause QUIRKS-MODE.

EXPECTED RESULT:  Webkit should obey the charset info in the HTTP Content-Type: header w.r.t. the encoding. Hence it should land in QUIRKS-MODE.
  
ACTUAL RESULT:  Webkit instead ignores the charset info in the HTTP Content-Type: header and obeys the BOM.

COMMENTS:

[BOM CAUSES UAs TO NOT PERMIT USERS TO OVERRIDE THE ENCODING:] 
    For HTML, unlike for XML, it is permitted that the user overrides the encoding. However, actually, when the page includes the BOM, then IE (IE6 to IE9) and Webkit browsers do not allow the user to override the encoding. This is, in my view, a good thing - and I don't want to change it! However, to be in accordance with what currently is specified in HTTP andin HTML5, the charset info comfing from HTTP, should actually take precedence, when it differs from the BOM - so that's a detail that perhaps should be changed.

[OVERVIEW - OTHER PARSERS:]
    Firefox does not have this bug - Firefox also lets users override the encoding also when there is a BOM. Opera behaves like Firefox - [but Opera makes a special exception for ISO-8859-1 for some reason ... see http://malform.no/testing/html5/bom/]). It seems that IE6 to IE9 behaves like Webkit too. In fact, there are a few HTML parsers with similar issues, for more data, read http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897 (As noted in that bug, I wonder if the Webkit behaviour should become the correct one ... )
Comment 1 Alexey Proskuryakov 2011-08-12 21:44:00 PDT
A BOM is most authoritative indication of encoding, because there are few ways to get it wrong. It's much easier to get an encoding declaration or an HTTP header wrong.

There are some synthetic examples of strings in other encodings that can be mistaken for a BOM, but it hasn't been a practical issue.
Comment 2 Alexey Proskuryakov 2011-08-12 21:47:04 PDT
Even the linked test's title says "UTF-8 encoded document with erroneous external encoding". We display a UTF-8 document as UTF-8 - would it be better for users is we displayed garbage?
Comment 3 Leif Halvard Silli 2011-08-13 05:59:55 PDT
(In reply to comment #2)
> Even the linked test's title says "UTF-8 encoded document with erroneous external encoding".

It is mostly just a boilerplate text.

> We display a UTF-8 document as UTF-8 - would it be better for users is we displayed garbage?

FIRSTLY: what you state here about what you do, is only 50% true. 

    Because, if a page does *not* contain the BOM but still is UTF-8 encoded, then Webkit does EITHER listen to the HTTP charset OR, if lacking, it does default to WINDOWS-1252'. And this despite that UTF-8 is easy to detect. Thus, for UTF-8 pages which are lackign the BOM, you seem to favor something other than what is better for users. So, unless there an effort to change this so that UTF-8 is used whenever it can be detected (also when it conflicts with HTTP), then I don't feel that this argument carry as much weight as it otherwise would have. 

    NOTE: When HTTP charset=foo is lacking, then Chrome and Opera do detect UTF-8, instead of defaulting to WINDOWS-1252. Webkit should behave the same way.

SECONDLY: As for 50% where your statement is true (that is: when there is a BOM), then the arguments for changing Webkit are:

 1) To do what the specs (HTTP and HTML5) says
 2) To promote interoperability with Firefox, Opera and more
     (at the expence of IE interoperability)
 3) To promote interoperability (think Polyglot Markup) with how XML 
     parsers should operate (they do not always behave that way though)

That said, 

  1) HTML parsers differs - should they all adobt IE/Webkit behaviour?
  2) May be the HTTP spec should change?
  3) May be the HTML5 spec should change?
  4) May be the XML 1.0 spec should change?

If you think that the Webkit behaviour should not change, then I encourage you to add your voice in support of changing HTML5 - this can e.g. by done by stating Webkit's position in the W3_bug_12897: http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897
Comment 4 Anne van Kesteren 2023-04-01 00:38:28 PDT
The standards ended up considering the BOM as the most authoritative piece of information. This is codified by the HTML and Encoding standards nowadays.