Bug 66055 - The XML parser doesn't (always) default to UTF-8 when HTTP charset or encoding declaration is lacking
Summary: The XML parser doesn't (always) default to UTF-8 when HTTP charset or encodin...
Status: UNCONFIRMED
Alias: None
Product: WebKit
Classification: Unclassified
Component: XML (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Major
Assignee: Nobody
URL: http://malform.no/testing/html5/bom/n...
Keywords:
Depends on:
Blocks: 66106
  Show dependency treegraph
 
Reported: 2011-08-11 07:16 PDT by Leif Halvard Silli
Modified: 2023-09-21 09:13 PDT (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Leif Halvard Silli 2011-08-11 07:16:37 PDT
ISSUE: 

   Webkit fails to *always* assume that UTF-8 is the default encoding of an XML file for which explicit external or internal encoding information is lacking.

BACKGROUND:

   According to section 4.3.3 of the XML 1.0 spec, documents that are not served with - or do not contain - an explicit encoding information MUST be either UTF-16 encoded or UTF-8 encoded:

]]
  In the absence of external character encoding information (such as MIME 
  headers), parsed entities which are stored in an encoding other than 
  UTF-8 or UTF-16 must begin with a text declaration (see 4.3.1 The Text 
  Declaration) containing an encoding declaration
[[ 

(Note that the encoding known as  "UTF-16" always includes the BOM - which in principle is a form of explicit encoding declaration.) 

Further down in the same section it is stated that  when a page is not served with - or does not contain -  explicit encoding information, including when it does not contain the BOM, then it is a FATAL ERROR if the page is not encoded as UTF-8:

]]
   In the absence of information provided by an external transport protocol 
   (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding
   declaration to be presented to the XML processor in an encoding other 
   than that named in the declaration, or for an entity which begins with 
   neither a Byte Order Mark nor an encoding declaration to use an encoding
   other than UTF-8.
[[

STEPS TO REPRODUCE THIS BUG:

1) In a browser in the Webkit family (including nightly build), go to the "Text Encodings" submenu of the View menu and select something other than "Default" or "UTF-8". (I will assume that you select "KOI8-r" .) This step changes - for the current window or tab - the default encoding from "Default/Automatic" to the encoding that you selected.

2) Now, within the same window or tab, visit this page:
    http://malform.no/testing/html5/bom/normal-XML-BOMless-HTTPcharsetLESS

    That page has the following features:
      a) XHTML page
      b) served as application/xhtml+xml in the HTTP Content-Type: header
      c) served *without* the charset=foo attribute in the HTTP Content-Type: header
      d) *no* BOM (byte order mark) in the document
      f) *no* encoding declaration (<?xml version="1.0" encoding="UTF-8" ?>) in the document

EXPECTED RESULTS:  Webkit should ignore that the user changed the default encoding to KOI8-R and instead, in accordance with section 4.3.3. of XML 1.0,  assume that the encoding of the page to be "UTF-8"

ACTUAL RESULTS:  Webkit instead pays respect to the user's choice of default encoding (i.e. it renders the page as KOI8-r), and without displaying a fatal error.

COMMENTS:

[OTHER PARSERS:] Firefox does not have this bug. Opera *does* have a similar bug. I don't know if IE9 has this bug. I don't think XML parsers in general (e.g. XMLlib2) have this bug. 

       [RELEVANCE:] Because XML must default to UTF-8 in absense of other info from the page server or from the page, Polylogot Markup [1] states that one does not need to declare the encoding for XML parsers. However, as long as Webkit does not abide to XML 1.0's default to UTF-8, Polyglto Mark's advice does not really float. Thus the only way that works, is to use the BOM  - which however some are against using. [2]

[1] http://dev.w3.org/html5/html-xhtml-author-guide/html-xhtml-authoring-guide.html#character-encoding
[2]  http://www.w3.org/Bugs/Public/show_bug.cgi?id=13392
Comment 1 Leif Halvard Silli 2011-08-13 15:09:51 PDT
A special case of this bug is when ""an XML document may be in a subframe inside an HTML one"" - see bug 66056#c13

Test page - HTML page with XHTML subframe: <http://malform.no/testing/html5/bom/frame>
   Results: 
      Webkit lets the XHTML page inherit the encoding from the HTML page ... 
      IE9 and Firefox do not have this bug.

For reference:  Subframe served as HTML <http://malform.no/testing/html5/bom/frame4>
   Results: 
      Now, Webkit respects the encoding declaration of the subframe
      Firefox and IE6 to IE9 behaves like Webkit, except that they do not have bug 17873 ("Encoding override should not be persistent") 

Comment: It is quite weird that the HTML page works better and safer (when it comes to encoding) than the XHTML file. And the only reason for this weirdness is because Webkit does not follow XML 1.0's encoding rules.
Comment 2 Leif Halvard Silli 2011-08-13 16:19:27 PDT
Another instance of the same problem is SVG files (and MathML files) which may contain non-ASCII and non-WINDOWS-1252 text.

Test page: http://malform.no/testing/html5/bom/frame5
   Features of the HTML page: 
      a) Windows-1252, 
      b) contains <iframe>, <img> and <object> with the same SVG file in each
      c) SVG file features: 
          * UTF-8 encoded, but without encoding declaration or BOM.
          * Text of the SVG file is 'Hello, world!" in Russian (Cyrillic).

EXPECTED RESULTS: 
   That the SVG file is rendereds the same way, regardless of whether it is referenced from the <iframe> element, the <img> element or the <object> element.

ACTUAL RESULTS: 
   Only when occuring inside the <img> file, does the SVG work as expected.