Bug 5027 - Decoder doesn't auto-detect XML encoded as UTF-16 if BOM is not present
Summary: Decoder doesn't auto-detect XML encoded as UTF-16 if BOM is not present
Status: VERIFIED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: DOM (show other bugs)
Version: 420+
Hardware: Mac OS X 10.4
: P2 Normal
Assignee: Maciej Stachowiak
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-09-17 13:00 PDT by Alexey Proskuryakov
Modified: 2019-02-06 09:03 PST (History)
1 user (show)

See Also:


Attachments
proposed patch (1.00 KB, patch)
2005-09-17 13:01 PDT, Alexey Proskuryakov
darin: review-
Details | Formatted Diff | Diff
test case (150 bytes, text/html)
2005-09-17 13:02 PDT, Alexey Proskuryakov
no flags Details
improved test case (236 bytes, text/html)
2005-09-17 13:08 PDT, Alexey Proskuryakov
no flags Details
proposed patch (1.05 KB, patch)
2005-09-19 21:11 PDT, Alexey Proskuryakov
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alexey Proskuryakov 2005-09-17 13:00:36 PDT
When viewing XML files encoded as UTF-16 without a byte order mark, user default encoding is used 
instead of UTF-16 (big- or little-endian).
Comment 1 Alexey Proskuryakov 2005-09-17 13:01:14 PDT
Created attachment 3925 [details]
proposed patch
Comment 2 Alexey Proskuryakov 2005-09-17 13:02:23 PDT
Created attachment 3926 [details]
test case
Comment 3 Alexey Proskuryakov 2005-09-17 13:08:36 PDT
Created attachment 3927 [details]
improved test case

The previous test case didn't work if the default encoding in Safari was
Latin-1 (worked with Windows Cyrillic).
Comment 4 Eric Seidel (no email) 2005-09-17 15:24:32 PDT
Comment on attachment 3925 [details]
proposed patch

Well, the code looks fine, as in it does what it says it does.	I'm not sure
what the policy on autodetection is.  The previous if block before this one
finds <?xml and searches for an encoding listed there.	This one finds	< ? x m
l  says, oh those must be 2 byte characters, and assumes unicode 16. I'm not
sure if that's a good thing or not, and thus will leave this one up to Darin to
decide.
Comment 5 Tobias Lidskog 2005-09-18 02:05:49 PDT
This would be the relevant section in the xml spec:
http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing-no-ext-info
Comment 6 Alexey Proskuryakov 2005-09-18 03:31:53 PDT
(In reply to comment #5)
So, what is still not auto-detected:
1) UTF-32
2) EBCDIC
3) "other encoding with a 16-bit code unit <...> and ASCII characters encoded as ASCII values (the 
encoding declaration must be read to determine which)"

Of these, only (3) may be relevant in this specific case; but I'm not sure if it is practical at all. Yes, the 
spec suggests that an encoding declaration should be always read because new encodings may be 
invented, but there's nothing we can do about new encodings in advance, so treating them as UTF-16 
should be as good as any other solution.

Even if there are existing encodings such as in (3), this patch doesn't make the situation worse.
Comment 7 Darin Adler 2005-09-19 15:56:00 PDT
Comment on attachment 3925 [details]
proposed patch

What's the argument for looking only at ptr[0] and not also at ptr[2], [4], and
[6]?

Otherwise, looks great!
Comment 8 Alexey Proskuryakov 2005-09-19 21:11:42 PDT
Created attachment 3955 [details]
proposed patch

Well, we're guessing here, and I didn't see how checking for all the zeroes
would be any safer than checking for just the first one - but it makes the code
longer :)

Anyhow, here's a patch with the suggested change.
Comment 9 Darin Adler 2005-09-20 17:02:09 PDT
Comment on attachment 3955 [details]
proposed patch

r=me
Comment 10 Lucas Forschler 2019-02-06 09:03:08 PST
Mass moving XML DOM bugs to the "DOM" Component.