66085 – The HTML parser doesn't ignore the BOM when the HTTP charset=foo conflicts with it

RESOLVED INVALID 66085

The HTML parser doesn't ignore the BOM when the HTTP charset=foo conflicts with it

https://bugs.webkit.org/show_bug.cgi?id=66085

Summary The HTML parser doesn't ignore the BOM when the HTTP charset=foo conflicts wi...

Leif Halvard Silli

Reported 2011-08-11 12:42:46 PDT

ISSUE: Webkit fails to ignore the BOM when the charset=foo attribute of the HTTP Content-Type: header conflicts with it. In other words: it lets the BOM take precedence over the HTTP Content-Type: header. NB: This bug is actually also present for *both* HTML and XML. Of course, the BOM is not actually the BOM unless the the document is uses an encoding which includes the BOM. And hence, if the HTTP Content-Type: says "ISO-8859-1" while the document contains the BOM, then - according to current specs, the parser should land in *QUIRKS-MODE*, due to the presence of the illegal "BOM" before the DOCTYPE. BACKGROUND: HTML5 requires the Charset=FOO attribute of the HTTP Content-Type header to take presedence over page-internal information, including the BOM. WAYS TO REPRODUCE THIS BUG: * Visit http://malform.no/testing/html5/bom/htm.html That page is accompanied with a HTTP Content-Type: which says "text/html;charset=KOI8-r". However, internally the page is actually UTF-8 encoded, and - importantly - it also contains the BOM. But when read as KOI8-R encoded - as HTML 5 requires, then the BOM becomes an illegal character before the DOCTYPE, which in turn should cause QUIRKS-MODE. EXPECTED RESULT: Webkit should obey the charset info in the HTTP Content-Type: header w.r.t. the encoding. Hence it should land in QUIRKS-MODE. ACTUAL RESULT: Webkit instead ignores the charset info in the HTTP Content-Type: header and obeys the BOM. COMMENTS: [BOM CAUSES UAs TO NOT PERMIT USERS TO OVERRIDE THE ENCODING:] For HTML, unlike for XML, it is permitted that the user overrides the encoding. However, actually, when the page includes the BOM, then IE (IE6 to IE9) and Webkit browsers do not allow the user to override the encoding. This is, in my view, a good thing - and I don't want to change it! However, to be in accordance with what currently is specified in HTTP andin HTML5, the charset info comfing from HTTP, should actually take precedence, when it differs from the BOM - so that's a detail that perhaps should be changed. [OVERVIEW - OTHER PARSERS:] Firefox does not have this bug - Firefox also lets users override the encoding also when there is a BOM. Opera behaves like Firefox - [but Opera makes a special exception for ISO-8859-1 for some reason ... see http://malform.no/testing/html5/bom/]). It seems that IE6 to IE9 behaves like Webkit too. In fact, there are a few HTML parsers with similar issues, for more data, read http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897 (As noted in that bug, I wonder if the Webkit behaviour should become the correct one ... )

Attachments
Add attachment proposed patch, testcase, etc.

Alexey Proskuryakov

Comment 1 2011-08-12 21:44:00 PDT

A BOM is most authoritative indication of encoding, because there are few ways to get it wrong. It's much easier to get an encoding declaration or an HTTP header wrong. There are some synthetic examples of strings in other encodings that can be mistaken for a BOM, but it hasn't been a practical issue.

Alexey Proskuryakov

Comment 2 2011-08-12 21:47:04 PDT

Even the linked test's title says "UTF-8 encoded document with erroneous external encoding". We display a UTF-8 document as UTF-8 - would it be better for users is we displayed garbage?

Leif Halvard Silli

Comment 3 2011-08-13 05:59:55 PDT

(In reply to comment #2) > Even the linked test's title says "UTF-8 encoded document with erroneous external encoding". It is mostly just a boilerplate text. > We display a UTF-8 document as UTF-8 - would it be better for users is we displayed garbage? FIRSTLY: what you state here about what you do, is only 50% true. Because, if a page does *not* contain the BOM but still is UTF-8 encoded, then Webkit does EITHER listen to the HTTP charset OR, if lacking, it does default to WINDOWS-1252'. And this despite that UTF-8 is easy to detect. Thus, for UTF-8 pages which are lackign the BOM, you seem to favor something other than what is better for users. So, unless there an effort to change this so that UTF-8 is used whenever it can be detected (also when it conflicts with HTTP), then I don't feel that this argument carry as much weight as it otherwise would have. NOTE: When HTTP charset=foo is lacking, then Chrome and Opera do detect UTF-8, instead of defaulting to WINDOWS-1252. Webkit should behave the same way. SECONDLY: As for 50% where your statement is true (that is: when there is a BOM), then the arguments for changing Webkit are: 1) To do what the specs (HTTP and HTML5) says 2) To promote interoperability with Firefox, Opera and more (at the expence of IE interoperability) 3) To promote interoperability (think Polyglot Markup) with how XML parsers should operate (they do not always behave that way though) That said, 1) HTML parsers differs - should they all adobt IE/Webkit behaviour? 2) May be the HTTP spec should change? 3) May be the HTML5 spec should change? 4) May be the XML 1.0 spec should change? If you think that the Webkit behaviour should not change, then I encourage you to add your voice in support of changing HTML5 - this can e.g. by done by stating Webkit's position in the W3_bug_12897: http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897

Anne van Kesteren

Comment 4 2023-04-01 00:38:28 PDT

The standards ended up considering the BOM as the most authoritative piece of information. This is codified by the HTML and Encoding standards nowadays.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution INVALID

Priority P2

Severity Major

Classification Unclassified

Version 528+ (Nightly build)

Hardware All

OS All

Product WebKit

Component DOM

Assignee

Nobody

Reported

2011-08-11 12:42 PDT

Modified

2023-04-01 00:38 PDT History

CC List

2 users Show

URL

http://malform.no/testing/html5/bom/htm.html

Keywords

Depends on

Blocks