WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
VERIFIED FIXED
5027
Decoder doesn't auto-detect XML encoded as UTF-16 if BOM is not present
https://bugs.webkit.org/show_bug.cgi?id=5027
Summary
Decoder doesn't auto-detect XML encoded as UTF-16 if BOM is not present
Alexey Proskuryakov
Reported
2005-09-17 13:00:36 PDT
When viewing XML files encoded as UTF-16 without a byte order mark, user default encoding is used instead of UTF-16 (big- or little-endian).
Attachments
proposed patch
(1.00 KB, patch)
2005-09-17 13:01 PDT
,
Alexey Proskuryakov
darin
: review-
Details
Formatted Diff
Diff
test case
(150 bytes, text/html)
2005-09-17 13:02 PDT
,
Alexey Proskuryakov
no flags
Details
improved test case
(236 bytes, text/html)
2005-09-17 13:08 PDT
,
Alexey Proskuryakov
no flags
Details
proposed patch
(1.05 KB, patch)
2005-09-19 21:11 PDT
,
Alexey Proskuryakov
darin
: review+
Details
Formatted Diff
Diff
Show Obsolete
(2)
View All
Add attachment
proposed patch, testcase, etc.
Alexey Proskuryakov
Comment 1
2005-09-17 13:01:14 PDT
Created
attachment 3925
[details]
proposed patch
Alexey Proskuryakov
Comment 2
2005-09-17 13:02:23 PDT
Created
attachment 3926
[details]
test case
Alexey Proskuryakov
Comment 3
2005-09-17 13:08:36 PDT
Created
attachment 3927
[details]
improved test case The previous test case didn't work if the default encoding in Safari was Latin-1 (worked with Windows Cyrillic).
Eric Seidel (no email)
Comment 4
2005-09-17 15:24:32 PDT
Comment on
attachment 3925
[details]
proposed patch Well, the code looks fine, as in it does what it says it does. I'm not sure what the policy on autodetection is. The previous if block before this one finds <?xml and searches for an encoding listed there. This one finds < ? x m l says, oh those must be 2 byte characters, and assumes unicode 16. I'm not sure if that's a good thing or not, and thus will leave this one up to Darin to decide.
Tobias Lidskog
Comment 5
2005-09-18 02:05:49 PDT
This would be the relevant section in the xml spec:
http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing-no-ext-info
Alexey Proskuryakov
Comment 6
2005-09-18 03:31:53 PDT
(In reply to
comment #5
) So, what is still not auto-detected: 1) UTF-32 2) EBCDIC 3) "other encoding with a 16-bit code unit <...> and ASCII characters encoded as ASCII values (the encoding declaration must be read to determine which)" Of these, only (3) may be relevant in this specific case; but I'm not sure if it is practical at all. Yes, the spec suggests that an encoding declaration should be always read because new encodings may be invented, but there's nothing we can do about new encodings in advance, so treating them as UTF-16 should be as good as any other solution. Even if there are existing encodings such as in (3), this patch doesn't make the situation worse.
Darin Adler
Comment 7
2005-09-19 15:56:00 PDT
Comment on
attachment 3925
[details]
proposed patch What's the argument for looking only at ptr[0] and not also at ptr[2], [4], and [6]? Otherwise, looks great!
Alexey Proskuryakov
Comment 8
2005-09-19 21:11:42 PDT
Created
attachment 3955
[details]
proposed patch Well, we're guessing here, and I didn't see how checking for all the zeroes would be any safer than checking for just the first one - but it makes the code longer :) Anyhow, here's a patch with the suggested change.
Darin Adler
Comment 9
2005-09-20 17:02:09 PDT
Comment on
attachment 3955
[details]
proposed patch r=me
Lucas Forschler
Comment 10
2019-02-06 09:03:08 PST
Mass moving XML DOM bugs to the "DOM" Component.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug