Bug 215764

Summary: incorrect charset default for text/xml
Product: WebKit Reporter: Julian Reschke <julian.reschke>
Component: DOMAssignee: Nobody <webkit-unassigned>
Status: RESOLVED CONFIGURATION CHANGED    
Severity: Normal CC: annevk, ap, webkit-bug-importer
Priority: P2 Keywords: InRadar
Version: Safari 13   
Hardware: Unspecified   
OS: Unspecified   
URL: http://test.greenbytes.de/tech/tc/httpcontenttype/#textxmlnodefaultutf8nodecl

Description Julian Reschke 2020-08-24 04:09:16 PDT
Apparently, when getting a content-type of "text/xml" (no charset parameter), Safari defaults to ISO-8859-1, instead of inspecting the XML content.

See testcase at

  http://test.greenbytes.de/tech/tc/httpcontenttype/#textxmlnodefaultutf8nodecl

(note that Firefox and Chrome correctly detect the charset.
Comment 1 Alexey Proskuryakov 2020-08-24 17:29:45 PDT
Could you please clarify what you expect as "inspecting the XML content"? This test case doesn't seem to have any kind of encoding declaration, so it could expect either defaulting to UTF-8, or sniffing.

I think that we are probably defaulting to the embedding page charset here, and that wouldn't seem obviously wrong.
Comment 2 Julian Reschke 2020-08-24 21:29:42 PDT
I would expect that it follows:

   https://www.w3.org/TR/REC-xml/#sec-guessing

That's what the other browsers do.
Comment 3 Julian Reschke 2020-08-24 21:32:42 PDT
And:

   https://www.w3.org/TR/REC-xml/#charencoding

says:

"hough an XML processor is required to read only entities in the UTF-8 and UTF-16 encodings, it is recognized that other encodings are used around the world, and it may be desired for XML processors to read entities that use them. In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 MUST begin with a text declaration (see 4.3.1 The Text Declaration) containing an encoding declaration: (...)"
Comment 4 Alexey Proskuryakov 2020-08-25 09:30:02 PDT
https://www.w3.org/TR/REC-xml/#charencoding defers to RFC 3023 for text/xml resources delivered over http, which says:

      Conformant with [RFC2046], if a text/xml entity is received with
      the charset parameter omitted, MIME processors and XML processors
      MUST use the default charset value of "us-ascii"[ASCII].  In cases
      where the XML MIME entity is transmitted via HTTP, the default
      charset value is still "us-ascii".  (Note: There is an
      inconsistency between this specification and HTTP/1.1, which uses
      ISO-8859-1[ISO8859] as the default for a historical reason.  Since
      XML is a new format, a new default should be chosen for better
      I18N.  US-ASCII was chosen, since it is the intersection of UTF-8
      and ISO-8859-1 and since it is already used by MIME.)

So it looks like other browser engines violate the spec in a different way. Us inheriting the default charset from the page is at least consistent with how other text/ subresources are handled.
Comment 5 Julian Reschke 2020-08-25 12:37:28 PDT
Unless I'm missing something, https://www.w3.org/TR/REC-xml/#charencoding does not refer to RFC 3023 at all.

That said, what would be relevant is the *current* definition of the text/xml media type, which is RFC 7303.

Also, it seems you missed the normative text in <https://www.w3.org/TR/REC-xml/#charencoding>:

"Though an XML processor is required to read only entities in the UTF-8 and UTF-16 encodings, it is recognized that other encodings are used around the world, and it may be desired for XML processors to read entities that use them. In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 MUST begin with a text declaration (see 4.3.1 The Text Declaration) containing an encoding declaration: (...)"

Note the last sentence; if there is no external character encoding information, the default is UTF-8 or UTF-16, nothing else.
Comment 6 Alexey Proskuryakov 2020-08-25 13:03:17 PDT
Wrong copy/paste, I wanted to say that https://www.w3.org/TR/REC-xml/#sec-guessing referred to RFC 3023.

My understanding of the specs' language is that anything loaded via http falls into "has external character encoding information" case, even when there is no charset in http headers - this just means that external information is taken as default for http.
Comment 7 Julian Reschke 2020-08-26 03:57:16 PDT
...but there is no default in HTTP.

(there was in RFC 2616, but that was removed in RFC 723* with good reasons)
Comment 9 Radar WebKit Bug Importer 2020-08-31 04:10:16 PDT
<rdar://problem/68065097>
Comment 10 Anne van Kesteren 2023-12-18 04:51:42 PST
This appears to have been fixed.