Bug 66084

Summary: XML parser doesn't emit Fatal Error when HTTP charset=foo conflicts with BOM
Product: WebKit Reporter: Leif Halvard Silli <xn--mlform-iua>
Component: XMLAssignee: Nobody <webkit-unassigned>
Status: UNCONFIRMED ---    
Severity: Major CC: ap
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
URL: http://malform.no/testing/html5/bom/xml.html

Description Leif Halvard Silli 2011-08-11 12:24:37 PDT
ISSUE: 

   Webkit fails to ignore the BOM when the charset=foo attribute of the HTTP Content-Type: header conflicts with it. In other words: it lets the BOM take precedence over the HTTP Content-Type: header. NB: This bug is actually also present for *both* HTML and XML. Of course, the BOM is not actually the BOM unless the the document is uses an encoding which includes the BOM. And hence, if the HTTP Content-Type: says "ISO-8859-1" while the document contains the BOM, then - according to current specs, the parser should emit a FATAL ERROR.

BACKGROUND:

  See section 4.3.3 of the XML 1.0 spec.

WAYS TO REPRODUCE THIS BUG:

* Visit http://malform.no/testing/html5/bom/xml.html
    That page is accompanied with a HTTP Content-Type: which says "application/xhtml+xml;charset=KOI8-r". However, internally the page is actually UTF-8 encoded, and - importantly - it also contains the BOM. But when read as KOI8-R encoded - as XML 1.0 requires, then the BOM becomes an illegal character before the DOCTYPE, which in turn should cause FATAL ERROR.

EXPECTED RESULT:  Webkit should obey the charset info in the HTTP Content-Type: header w.r.t. the encoding. Hence it should emit a FATAL ERROR.
  
ACTUAL RESULT:  Webkit instead ignores the charset info in the HTTP Content-Type: header and obeys the BOM.

COMMENTS:

[OTHER PARSERS:] Firefox does not have this bug. Opera does also not have this bug (unless the user manually overrides the encoding - which is another bug and one that it shares with Webkit). And xmllib2 also does not have this bug. But it seems that IE9 has this bug too. In fact, there are a few XML parsers with similar issues, for more data, read http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897 (As noted in that bug, I wonder if the Webkit behaviour should become the correct one ... But so far it hasn't happened - and I know about at least one parser [Xerces C++] which is aligning with the specs.)
Comment 1 Alexey Proskuryakov 2011-08-12 21:45:11 PDT
A BOM is most authoritative indication of encoding, because there are few ways to get it wrong. It's much easier to get an encoding declaration or an HTTP header wrong.

There are some synthetic examples of strings in other encodings that can be mistaken for a BOM, but it hasn't been a practical issue.
Comment 2 Leif Halvard Silli 2011-08-13 13:37:56 PDT
(In reply to comment #1)
> A BOM is most authoritative indication of encoding, because there are few ways to get it wrong. It's much easier to get an encoding declaration or an HTTP header wrong.
> 
> There are some synthetic examples of strings in other encodings that can be mistaken for a BOM, but it hasn't been a practical issue.

I see your point.

But if this is Webkit's position, then I recomend to state this in W3_bug_12897 (<http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897>) and to also file a bug against XML 1.0!

Alternatively, Webkit's XML parser could remove support for all encodings except for those for which XML 1.0 requires support: UTF-8 and UTF-16. Then you could support, at least *those* encodings correctly!

IMHO, a valid reason to not listen to Webkit, if you should want to change HTML5 and XML 1.0 to behave like Webkit currently behaves w.r.t. HTTP and the BOM, is that Webkit have so many errors  when it comes to encodings in XML files: You do not properly treat UTF-8 as the default encoding and you do almost never emit fatal error (see the other bugs I filed recently). 

Thus, to support your position on the BOM when HTTP conflicts with it, would be to simultaneously silently support the way Webkit in general treats encodings for XML files. It would be to "cave in" to Webkit's current possition, which seems to be that 

  a) encodings in XML and HTML should behave the same way - the HTML way
  b) there is no real encoding default for XML files, except when there is a BOM

To add a point c)

  c) always adhere to HTTP  except when there is a BOM

beomes too much, I am afraid. May be c) is useful - I am seriously considering that it is! (See the W3C bug I pointed to above.) But unless Webkit properly support the rest of the encodings rules, then at least Webkit's position does not sound very credible to me.