Bug 66189

Summary: XML parser doesn't emit FATAL ERROR for all, detectable encoding errors
Product: WebKit Reporter: Leif Halvard Silli <xn--mlform-iua>
Component: XMLAssignee: Nobody <webkit-unassigned>
Status: UNCONFIRMED ---    
Severity: Major CC: annevk, ap
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
URL: http://www.w3.org/TR/xml/#charencoding

Description Leif Halvard Silli 2011-08-13 09:29:10 PDT
ISSUE: 

   XML 1.0 requires that if an explicit or implicit encoding label can be detected to be incorrect, then it is always a fatal error. Webkit does not fully adhere to this requirement.

   * Bug 66084 focuses on a particular detail of this bug, namely the issue that Webkit adheres to the BOM rather than HTTP's Content-Type: charset=foo attribute - whenever the two differ. This should lead to Fatal Error, but Webkit instead behaves as nothing. The problem in that case is that HTTP, according to XML, should be considered authoritative, and thus the BOM - viewed from the authoriative encoding's point of view - is not a legal BOM character anynore.

   * The focus of this bug is cases when the file does not have any accompanying external encoding info but there is is explicit (declaration) or implicit (default) encoding info inside the file.

   * NOTE: It is - unfortunately perhaps - not an error in itself if HTTP differs with the XML encoding declaration - this is only a Fatal Error if the HTTP declared encoding can be *detected* to be incompatible with the actual encoding *irrespective* of what the internal encoiding declaration says. (A typical example of such a detectable error is when the page contains a BOM that, from the externally declared encoding's point of view, is not a BOM - again, bug 66084. Another example could be that the external declaration says "US-ASCII" while the page obviously is not using US-ASCII.)

BACKGROUND:

   Section 4.3.3 of the XML 1.0:

]]
  In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8.
      [
            snip
      ] 
  It is a fatal error when an XML processor encounters an entity with an encoding that it is unable to process
[[ 

STEPS TO REPRODUCE THIS BUG:

1) Open an empty tab or window in a Webkit browser
2) Manually select the encoding ISO-8859-1 in the Text Encoding submenu.
    NOTE 1: this step can also be performed as step 4) instead of as step 2)
    NOTE 2: For PAGE C, step 2) and step 4) are unneccessary
3) Visit one of these pages:
    PAGE A: http://malform.no/testing/html5/bom/normal-XML-BOMless-HTTPcharsetLESS 
                 (Features: no HTTP charset=foo", no BOM, no encoding declaration.)
    PAGE B: http://malform.no/testing/html5/bom/cyrillic-encoding-declaration
                 (Features: no HTTP charset=foo", no BOM, HOWEVER encoding declaration says "KOI8-R"!)
    PAGE C: http://malform.no/testing/html5/bom/normal-XML-ascii-encoding
                 (Features: HTTP charset=US-ASCII", no BOM, HOWEVER encoding declaration says "KOI8-R"!)
4) Step 4) is equal to step 2) - and should be performed if you jumped over step 2)
    NOTE: For PAGE C, step 2) and step 4) are unneccessary

EXPECTED RESULTS:  

   FOR PAGE A: Because says it is a fatal error, quote: "for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8", Webkit should emit Fatal Error as soon as it discovers that the user tries to use another encoding than UTF-8.

   FOR PAGE B: Because the user attempts to have the page, quote: "be presented to the XML processor in an encoding other than that named in the declaration", Webkit should emit a Fatal Error as soon as it discovers that the declared encoding ("KOI8-R") differs from the user chosen encoding.

  ALTERNATIVELY, FOR BOTH PAGE A and PAGE B: Alternatively Webkit should, like Firefox, prevent - or not react - to the user's encoding choice. This would prevent the user from seeing any fatal errors - this seems like the best choice. (See bug 66056 about ignoring user's choice and bug 66106 about greying out the encoding names of the Text Encoding menu for XML pages and UTF-8 encoded HTML pages that includes the BOM.)

  FOR PAGE C: as - evidently - US-ASCII is not supported (HTML5 recommends treating the US-ASCII label as equal to WINDOWS-1252), Webkit should emit a fatal error because the page contains non-ASCII letter and because, quote "It is a fatal error when an XML processor encounters an entity with an encoding that it is unable to process"

ACTUAL RESULTS:  For page A and page B, Webkit pays respect to the user's encoding presentation choice, instead of emitting a fatal error (or preventing user's choice from having effect). For page C, webkit treats the page as WINDOWS-1251 instead of emitting fatal error.

COMMENTS:

 * Firefox does not have this bug. (Exception: Firefox has the US-ASCII variant of this bug.)
 * Opera *does* have a similar bug. (Exception: Opera has the the US-ASCII variant of this bug.) 
 * IE9 ? unknown by me ?
 * XMLlib2 does not have this bug (however it has a bug with 66084, in that obeys HTTP but fails to emit fatal error due to the BOM)
Comment 1 Leif Halvard Silli 2011-08-13 13:42:04 PDT
Related HTML5 bug: "Encodings 'misinterpreted for compatibility' should risk fatal error in XHTML"
See: http://www.w3.org/Bugs/Public/show_bug.cgi?id=13771