66055 – The XML parser doesn't (always) default to UTF-8 when HTTP charset or encoding declaration is lacking

UNCONFIRMED 66055

The XML parser doesn't (always) default to UTF-8 when HTTP charset or encoding declaration is lacking

https://bugs.webkit.org/show_bug.cgi?id=66055

Summary The XML parser doesn't (always) default to UTF-8 when HTTP charset or encodin...

Leif Halvard Silli

Reported 2011-08-11 07:16:37 PDT

ISSUE: Webkit fails to *always* assume that UTF-8 is the default encoding of an XML file for which explicit external or internal encoding information is lacking. BACKGROUND: According to section 4.3.3 of the XML 1.0 spec, documents that are not served with - or do not contain - an explicit encoding information MUST be either UTF-16 encoded or UTF-8 encoded: ]] In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 must begin with a text declaration (see 4.3.1 The Text Declaration) containing an encoding declaration [[ (Note that the encoding known as "UTF-16" always includes the BOM - which in principle is a form of explicit encoding declaration.) Further down in the same section it is stated that when a page is not served with - or does not contain - explicit encoding information, including when it does not contain the BOM, then it is a FATAL ERROR if the page is not encoded as UTF-8: ]] In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. [[ STEPS TO REPRODUCE THIS BUG: 1) In a browser in the Webkit family (including nightly build), go to the "Text Encodings" submenu of the View menu and select something other than "Default" or "UTF-8". (I will assume that you select "KOI8-r" .) This step changes - for the current window or tab - the default encoding from "Default/Automatic" to the encoding that you selected. 2) Now, within the same window or tab, visit this page: http://malform.no/testing/html5/bom/normal-XML-BOMless-HTTPcharsetLESS That page has the following features: a) XHTML page b) served as application/xhtml+xml in the HTTP Content-Type: header c) served *without* the charset=foo attribute in the HTTP Content-Type: header d) *no* BOM (byte order mark) in the document f) *no* encoding declaration (<?xml version="1.0" encoding="UTF-8" ?>) in the document EXPECTED RESULTS: Webkit should ignore that the user changed the default encoding to KOI8-R and instead, in accordance with section 4.3.3. of XML 1.0, assume that the encoding of the page to be "UTF-8" ACTUAL RESULTS: Webkit instead pays respect to the user's choice of default encoding (i.e. it renders the page as KOI8-r), and without displaying a fatal error. COMMENTS: [OTHER PARSERS:] Firefox does not have this bug. Opera *does* have a similar bug. I don't know if IE9 has this bug. I don't think XML parsers in general (e.g. XMLlib2) have this bug. [RELEVANCE:] Because XML must default to UTF-8 in absense of other info from the page server or from the page, Polylogot Markup [1] states that one does not need to declare the encoding for XML parsers. However, as long as Webkit does not abide to XML 1.0's default to UTF-8, Polyglto Mark's advice does not really float. Thus the only way that works, is to use the BOM - which however some are against using. [2] [1] http://dev.w3.org/html5/html-xhtml-author-guide/html-xhtml-authoring-guide.html#character-encoding [2] http://www.w3.org/Bugs/Public/show_bug.cgi?id=13392

Attachments
Add attachment proposed patch, testcase, etc.

Leif Halvard Silli

Comment 1 2011-08-13 15:09:51 PDT

A special case of this bug is when ""an XML document may be in a subframe inside an HTML one"" - see bug 66056#c13 Test page - HTML page with XHTML subframe: <http://malform.no/testing/html5/bom/frame> Results: Webkit lets the XHTML page inherit the encoding from the HTML page ... IE9 and Firefox do not have this bug. For reference: Subframe served as HTML <http://malform.no/testing/html5/bom/frame4> Results: Now, Webkit respects the encoding declaration of the subframe Firefox and IE6 to IE9 behaves like Webkit, except that they do not have bug 17873 ("Encoding override should not be persistent") Comment: It is quite weird that the HTML page works better and safer (when it comes to encoding) than the XHTML file. And the only reason for this weirdness is because Webkit does not follow XML 1.0's encoding rules.

Leif Halvard Silli

Comment 2 2011-08-13 16:19:27 PDT

Another instance of the same problem is SVG files (and MathML files) which may contain non-ASCII and non-WINDOWS-1252 text. Test page: http://malform.no/testing/html5/bom/frame5 Features of the HTML page: a) Windows-1252, b) contains <iframe>, <img> and <object> with the same SVG file in each c) SVG file features: * UTF-8 encoded, but without encoding declaration or BOM. * Text of the SVG file is 'Hello, world!" in Russian (Cyrillic). EXPECTED RESULTS: That the SVG file is rendereds the same way, regardless of whether it is referenced from the <iframe> element, the <img> element or the <object> element. ACTUAL RESULTS: Only when occuring inside the <img> file, does the SVG work as expected.

Note You need to log in before you can comment on or make changes to this bug.