Bug 3809 - Should default to UTF-8 or UTF-16 for application/xml documents with omitted charset and encoding declaration
Summary: Should default to UTF-8 or UTF-16 for application/xml documents with omitted ...
Alias: None
Product: WebKit
Classification: Unclassified
Component: DOM (show other bugs)
Version: 312.x
Hardware: Mac OS X 10.3
: P2 Major
Assignee: Darin Adler
URL: http://hsivonen.iki.fi/test/mobile/la...
Depends on:
Reported: 2005-07-02 04:33 PDT by Henri Sivonen
Modified: 2019-02-06 09:04 PST (History)
2 users (show)

See Also:

proposed patch (741 bytes, patch)
2005-09-09 12:49 PDT, Alexey Proskuryakov
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Henri Sivonen 2005-07-02 04:33:38 PDT
Steps to reproduce:
1) Make Safari load (either in content area or through XMLHttpRequest) an XML
document that 
  does not have an XML declaration that declares the character encoding 
  does not have a BOM 
  is encoded in UTF-8 
  contains characters from outside the ASCII range
  is served as either application/xml or application/xhtml+xml
  has no charset parameter on the HTTP layer.

(Although the above looks very specific, the conditions commonly hold true.)

2) Observe.

Actual results:
The bytes are decoded as characters according to the Default Encoding in
Appearance preferences.

Expected results:
Expected the bytes to be decoded as characters according to UTF-8 as per section
3.2 of RFC 3023, which defers to XML 1.0 section 4.3.3.

Additional information:
Besides the obvious implications of this bug, there are two less obvious
1) Safari cannot properly consume Canonical XML.
2) Safari cannot properly consume XML documents it has produced itself via
XMLHttpRequest POST!
Comment 1 Oliver Hunt 2005-07-21 16:26:05 PDT
Would you be able to attach a test document,
Comment 2 Henri Sivonen 2005-09-09 01:14:22 PDT
What reduction is needed beyond the case that has been in the URL field all along?
Comment 3 Oliver Hunt 2005-09-09 01:25:10 PDT
Behaviour is wrong (confirmed against ffx)
Comment 4 Alexey Proskuryakov 2005-09-09 12:49:23 PDT
Created attachment 3827 [details]
proposed patch

Well, the XML spec is pretty explicit about files that do not have an encoding
declaration in the text declaration - they should be UTF-8 or UTF-16, unless a
higher-level protocol defines a charset (4.3.3).
Comment 5 Alexey Proskuryakov 2005-09-09 12:50:57 PDT
The file from bug URL can serve as a test case (without a link to the next test, of course).
Comment 6 Darin Adler 2005-09-09 15:36:48 PDT
Comment on attachment 3827 [details]
proposed patch

Is there any other browser that has this behavior? The comments above lead me
to believe this is not working this way in Firefox.
Comment 7 Henri Sivonen 2005-09-09 23:55:57 PDT
Gecko used to have this same bug (at least in content area--not sure about
XMLHttpRequest), but it has been fixed.
Comment 8 Alexey Proskuryakov 2005-09-10 03:22:28 PDT
Henri, which Gecko bugfix are you referring to? I see that Firefox 1.0.5 renders the test as expected, but I 
couldn't find anything in Bugzilla.

I found <https://bugzilla.mozilla.org/show_bug.cgi?id=247024>, but it talks about a different issue: 
documents transferred with MIME type text/xml should default to us-ascii, not utf-8. I'm not sure if 
WebKit has the same problem, but if it has, that should be in a separate report IMO.
Comment 9 Darin Adler 2005-09-11 21:57:43 PDT
Comment on attachment 3827 [details]
proposed patch

I thought about it a lot, and I think it's fine to land the fix just like this.
Comment 10 Lucas Forschler 2019-02-06 09:04:18 PST
Mass moving XML DOM bugs to the "DOM" Component.