WebKit Bugzilla
New
Browse
Search+
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED CONFIGURATION CHANGED
215764
incorrect charset default for text/xml
https://bugs.webkit.org/show_bug.cgi?id=215764
Summary
incorrect charset default for text/xml
Julian Reschke
Reported
2020-08-24 04:09:16 PDT
Apparently, when getting a content-type of "text/xml" (no charset parameter), Safari defaults to ISO-8859-1, instead of inspecting the XML content. See testcase at
http://test.greenbytes.de/tech/tc/httpcontenttype/#textxmlnodefaultutf8nodecl
(note that Firefox and Chrome correctly detect the charset.
Attachments
Add attachment
proposed patch, testcase, etc.
Alexey Proskuryakov
Comment 1
2020-08-24 17:29:45 PDT
Could you please clarify what you expect as "inspecting the XML content"? This test case doesn't seem to have any kind of encoding declaration, so it could expect either defaulting to UTF-8, or sniffing. I think that we are probably defaulting to the embedding page charset here, and that wouldn't seem obviously wrong.
Julian Reschke
Comment 2
2020-08-24 21:29:42 PDT
I would expect that it follows:
https://www.w3.org/TR/REC-xml/#sec-guessing
That's what the other browsers do.
Julian Reschke
Comment 3
2020-08-24 21:32:42 PDT
And:
https://www.w3.org/TR/REC-xml/#charencoding
says: "hough an XML processor is required to read only entities in the UTF-8 and UTF-16 encodings, it is recognized that other encodings are used around the world, and it may be desired for XML processors to read entities that use them. In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 MUST begin with a text declaration (see 4.3.1 The Text Declaration) containing an encoding declaration: (...)"
Alexey Proskuryakov
Comment 4
2020-08-25 09:30:02 PDT
https://www.w3.org/TR/REC-xml/#charencoding
defers to RFC 3023 for text/xml resources delivered over http, which says: Conformant with [RFC2046], if a text/xml entity is received with the charset parameter omitted, MIME processors and XML processors MUST use the default charset value of "us-ascii"[ASCII]. In cases where the XML MIME entity is transmitted via HTTP, the default charset value is still "us-ascii". (Note: There is an inconsistency between this specification and HTTP/1.1, which uses ISO-8859-1[ISO8859] as the default for a historical reason. Since XML is a new format, a new default should be chosen for better I18N. US-ASCII was chosen, since it is the intersection of UTF-8 and ISO-8859-1 and since it is already used by MIME.) So it looks like other browser engines violate the spec in a different way. Us inheriting the default charset from the page is at least consistent with how other text/ subresources are handled.
Julian Reschke
Comment 5
2020-08-25 12:37:28 PDT
Unless I'm missing something,
https://www.w3.org/TR/REC-xml/#charencoding
does not refer to RFC 3023 at all. That said, what would be relevant is the *current* definition of the text/xml media type, which is RFC 7303. Also, it seems you missed the normative text in <
https://www.w3.org/TR/REC-xml/#charencoding
>: "Though an XML processor is required to read only entities in the UTF-8 and UTF-16 encodings, it is recognized that other encodings are used around the world, and it may be desired for XML processors to read entities that use them. In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 MUST begin with a text declaration (see 4.3.1 The Text Declaration) containing an encoding declaration: (...)" Note the last sentence; if there is no external character encoding information, the default is UTF-8 or UTF-16, nothing else.
Alexey Proskuryakov
Comment 6
2020-08-25 13:03:17 PDT
Wrong copy/paste, I wanted to say that
https://www.w3.org/TR/REC-xml/#sec-guessing
referred to RFC 3023. My understanding of the specs' language is that anything loaded via http falls into "has external character encoding information" case, even when there is no charset in http headers - this just means that external information is taken as default for http.
Julian Reschke
Comment 7
2020-08-26 03:57:16 PDT
...but there is no default in HTTP. (there was in RFC 2616, but that was removed in RFC 723* with good reasons)
Julian Reschke
Comment 8
2020-08-26 04:24:44 PDT
Link:
https://greenbytes.de/tech/webdav/rfc7231.html#rfc.section.B.p.4
Radar WebKit Bug Importer
Comment 9
2020-08-31 04:10:16 PDT
<
rdar://problem/68065097
>
Anne van Kesteren
Comment 10
2023-12-18 04:51:42 PST
This appears to have been fixed.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug