WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED INVALID
51035
UCS2 encoding aliases should be defaulted to Big Endian
https://bugs.webkit.org/show_bug.cgi?id=51035
Summary
UCS2 encoding aliases should be defaulted to Big Endian
Yong Li
Reported
Tuesday, December 14, 2010 3:14:14 PM UTC
I think UCS2 encoding names like "ISO-10646-UCS-2" should be treated as Big Endian, because: 1)
http://www.isi.edu/in-notes/iana/assignments/character-sets
says about ISO-10646-UCS-2: "this needs to specify network byte order: the standard does not specify" 2) RFC 2152 says: "ISO/IEC 10646-1:1993(E) specifies that when characters the UCS-2 form are serialized as octets, that the most significant octet appear first." 3) ICU code page description (
http://publib.boulder.ibm.com/infocenter/tivihelp/v24r1/index.jsp?topic=/com.ibm.itcama.doc_6.2.3/itcam_oraclerac63200.htm
) says: UCS2 UCS-2 (Really UTF-16 BE) 4) IE and FireFox treat "ISO-10646-UCS-2" as Big Endian. However, it is hard to find a website in reality that uses UCS2, except test cases. So priority should be low.
Attachments
the patch
(2.74 KB, patch)
2010-12-14 08:31 PST
,
Yong Li
ap
: review-
Details
Formatted Diff
Diff
update test and expected result
(5.93 KB, patch)
2010-12-14 13:56 PST
,
Yong Li
no flags
Details
Formatted Diff
Diff
updated
(11.14 KB, patch)
2010-12-15 08:26 PST
,
Yong Li
no flags
Details
Formatted Diff
Diff
test case updated
(5.86 KB, patch)
2010-12-15 08:29 PST
,
Yong Li
no flags
Details
Formatted Diff
Diff
utf-16 test case
(28 bytes, text/html; charset=utf-16)
2010-12-16 10:16 PST
,
Alexey Proskuryakov
no flags
Details
test case with <p> removed
(16 bytes, text/plain; charset=utf-16)
2010-12-16 10:30 PST
,
Yong Li
no flags
Details
test case - BE
(16 bytes, text/plain;charset=utf-16)
2010-12-16 10:33 PST
,
Yong Li
no flags
Details
Test case - BE - ISO-10646-UCS-2
(16 bytes, text/plain;charset="ISO-10646-UCS-2")
2010-12-16 10:44 PST
,
Yong Li
no flags
Details
Show Obsolete
(3)
View All
Add attachment
proposed patch, testcase, etc.
Yong Li
Comment 1
Tuesday, December 14, 2010 4:31:34 PM UTC
Created
attachment 76540
[details]
the patch
Alexey Proskuryakov
Comment 2
Tuesday, December 14, 2010 7:19:14 PM UTC
Comment on
attachment 76540
[details]
the patch
> 4) IE and FireFox treat "ISO-10646-UCS-2" as Big Endian.
Did you test other encoding names modified here? Do all the added sub-tests pass in IE and Firefox? It's surprising that we fail on every page that's labeled as "UTF-16" and doesn't have a BOM. In fact, I recall that the current behavior is intentional, and was chosen for compatibility with other browsers - but I don't remember details of testing. +// UCS2 encoding aliases. +testDecode('ISO-10646-UCS-2', '%00a', 'U+0061'); +testDecode('UCS-2', '%00a', 'U+0061'); +testDecode('UTF-16', '%00a', 'U+0061'); +testDecode('Unicode', '%00a', 'U+0061'); +testDecode('csUnicode', '%00a', 'U+0061'); +testDecode('unicodeFFFE', '%00a', 'U+0061'); +testDecode('unicodeFEFF', 'a%00', 'U+0061'); This patch has confused terminology. Why did you label UTF-16 as a "UCS2 encoding alias"? It's clearly not. UCS-2 is a historic name for 16-bit encoding defined by Unicode 1.0, and obsoleted by Unicode 2.0. I suspect that most or all clients treat it as UTF-16 in practice. ICU certainly does. +testDecode('ISO-10646-UCS-2', '%00a', 'U+0061'); Please add tests for something besides zero bytes. While at it, please consider adding tests for whether surrogate pairs are supported for each of these encoding names. Extending test coverage would is very welcome.
Yong Li
Comment 3
Tuesday, December 14, 2010 8:16:58 PM UTC
(In reply to
comment #2
)
> (From update of
attachment 76540
[details]
) > > 4) IE and FireFox treat "ISO-10646-UCS-2" as Big Endian. > > Did you test other encoding names modified here? Do all the added sub-tests pass in IE and Firefox? > > It's surprising that we fail on every page that's labeled as "UTF-16" and doesn't have a BOM. In fact, I recall that the current behavior is intentional, and was chosen for compatibility with other browsers - but I don't remember details of testing. > > +// UCS2 encoding aliases. > +testDecode('ISO-10646-UCS-2', '%00a', 'U+0061'); > +testDecode('UCS-2', '%00a', 'U+0061'); > +testDecode('UTF-16', '%00a', 'U+0061'); > +testDecode('Unicode', '%00a', 'U+0061'); > +testDecode('csUnicode', '%00a', 'U+0061'); > +testDecode('unicodeFFFE', '%00a', 'U+0061'); > +testDecode('unicodeFEFF', 'a%00', 'U+0061'); > > This patch has confused terminology. Why did you label UTF-16 as a "UCS2 encoding alias"? It's clearly not. > > UCS-2 is a historic name for 16-bit encoding defined by Unicode 1.0, and obsoleted by Unicode 2.0. I suspect that most or all clients treat it as UTF-16 in practice. ICU certainly does. > > +testDecode('ISO-10646-UCS-2', '%00a', 'U+0061'); > > Please add tests for something besides zero bytes. While at it, please consider adding tests for whether surrogate pairs are supported for each of these encoding names. > > Extending test coverage would is very welcome.
(In reply to
comment #2
)
> (From update of
attachment 76540
[details]
) > > 4) IE and FireFox treat "ISO-10646-UCS-2" as Big Endian. > > Did you test other encoding names modified here? Do all the added sub-tests pass in IE and Firefox?
FireFox seems treat "Unicode" as UTF-8. But for other names, it uses Big Endian. Except, when using 0 bytes, it also works with little endian for "UTF-16".
> > It's surprising that we fail on every page that's labeled as "UTF-16" and doesn't have a BOM. In fact, I recall that the current behavior is intentional, and was chosen for compatibility with other browsers - but I don't remember details of testing. > > +// UCS2 encoding aliases. > +testDecode('ISO-10646-UCS-2', '%00a', 'U+0061'); > +testDecode('UCS-2', '%00a', 'U+0061'); > +testDecode('UTF-16', '%00a', 'U+0061'); > +testDecode('Unicode', '%00a', 'U+0061'); > +testDecode('csUnicode', '%00a', 'U+0061'); > +testDecode('unicodeFFFE', '%00a', 'U+0061'); > +testDecode('unicodeFEFF', 'a%00', 'U+0061'); > > This patch has confused terminology. Why did you label UTF-16 as a "UCS2 encoding alias"? It's clearly not. > > UCS-2 is a historic name for 16-bit encoding defined by Unicode 1.0, and obsoleted by Unicode 2.0. I suspect that most or all clients treat it as UTF-16 in practice. ICU certainly does.
> Can I say they are aliases of UTF-16 then?
> +testDecode('ISO-10646-UCS-2', '%00a', 'U+0061'); > > Please add tests for something besides zero bytes. While at it, please consider adding tests for whether surrogate pairs are supported for each of these encoding names. >
Will use surrogate pairs.
Alexey Proskuryakov
Comment 4
Tuesday, December 14, 2010 8:41:48 PM UTC
Please be sure to test Internet Explorer behavior in detail.
> Can I say they are aliases of UTF-16 then?
I'd just say "UTF-16 variants" - unicodeFFFE and unicodeFEFF clearly aren't aliases of the same encoding.
> Will use surrogate pairs.
Thanks!
Darin Adler
Comment 5
Tuesday, December 14, 2010 8:57:57 PM UTC
Have we found any existing test cases on websites? It seems that 16-bit encodings without BOMs are extremely rare. Is there some context other than viewing a webpage where this comes up? Form submission or XMLHttpRequest?
Yong Li
Comment 6
Tuesday, December 14, 2010 9:10:18 PM UTC
(In reply to
comment #5
)
> Have we found any existing test cases on websites? It seems that 16-bit encodings without BOMs are extremely rare.
Yeah. I can hardly find a real website sending UTF-16 without BOM. But seems everyone is saying UTF-16 without BOM should be UTF-16BE.
> > Is there some context other than viewing a webpage where this comes up? Form submission or XMLHttpRequest?
I'm using data URL like "data:text/plain;charset=UTF-16,%D8%69%DE%D6" to be iframe source or just type it in address bar.
Yong Li
Comment 7
Tuesday, December 14, 2010 9:56:47 PM UTC
Created
attachment 76567
[details]
update test and expected result
Alexey Proskuryakov
Comment 8
Tuesday, December 14, 2010 10:03:25 PM UTC
Comment on
attachment 76567
[details]
update test and expected result View in context:
https://bugs.webkit.org/attachment.cgi?id=76567&action=review
I think that key to landing this would be testing IE and Firefox with regular HTML files served over HTTP. Data URLs are a great trick for regression tests, but comparing browser behavior based on them isn't very reliable.
> LayoutTests/fast/encoding/char-decoding-expected.txt:160 > +PASS decodeSurrogatePair('UTF-16BE', '%D8%69%DE%D6') is 'U+D869/U+DED6'
I'm not sure why decodeSurrogatePair is needed - does decode() not work for those? If so, could it be better to fix decode()?
Yong Li
Comment 9
Tuesday, December 14, 2010 10:11:05 PM UTC
(In reply to
comment #8
)
> (From update of
attachment 76567
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=76567&action=review
> > I think that key to landing this would be testing IE and Firefox with regular HTML files served over HTTP. Data URLs are a great trick for regression tests, but comparing browser behavior based on them isn't very reliable. > > > LayoutTests/fast/encoding/char-decoding-expected.txt:160 > > +PASS decodeSurrogatePair('UTF-16BE', '%D8%69%DE%D6') is 'U+D869/U+DED6' > > I'm not sure why decodeSurrogatePair is needed - does decode() not work for those? If so, could it be better to fix decode()?
decode() only returns one UTF-16 character. Do you think it is good to make decode() dump all characters in responseText?
Alexey Proskuryakov
Comment 10
Tuesday, December 14, 2010 10:40:03 PM UTC
> Do you think it is good to make decode() dump all characters in responseText?
I think so, but I haven't been involved in the original design of this test. Perhaps Darin will have some comments.
Darin Adler
Comment 11
Wednesday, December 15, 2010 1:10:18 AM UTC
(In reply to
comment #10
)
> > Do you think it is good to make decode() dump all characters in responseText? > > I think so, but I haven't been involved in the original design of this test. Perhaps Darin will have some comments.
Yes, decode should dump all the characters. There’s no advantage to have it returning only one. In fact, that behavior could hide failures.
Yong Li
Comment 12
Wednesday, December 15, 2010 3:30:22 PM UTC
(In reply to
comment #11
)
> (In reply to
comment #10
) > > > Do you think it is good to make decode() dump all characters in responseText? > > > > I think so, but I haven't been involved in the original design of this test. Perhaps Darin will have some comments. > > Yes, decode should dump all the characters. There’s no advantage to have it returning only one. In fact, that behavior could hide failures.
will do it in that way
Yong Li
Comment 13
Wednesday, December 15, 2010 4:26:02 PM UTC
Created
attachment 76649
[details]
updated
Yong Li
Comment 14
Wednesday, December 15, 2010 4:29:52 PM UTC
Created
attachment 76650
[details]
test case updated
WebKit Commit Bot
Comment 15
Wednesday, December 15, 2010 9:31:34 PM UTC
The commit-queue encountered the following flaky tests while processing
attachment 76650
[details]
: http/tests/navigation/document-location-click-timeout.html
bug 51130
(author:
mpcomplete@chromium.org
) fast/loader/location-port.html
bug 51131
(author:
gns@gnome.org
) The commit-queue is continuing to process your patch.
WebKit Commit Bot
Comment 16
Wednesday, December 15, 2010 11:31:58 PM UTC
Comment on
attachment 76650
[details]
test case updated Rejecting
attachment 76650
[details]
from commit-queue. Failed to run "['./WebKitTools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '--bot-id=cr-jail-3', 'build-and-test', '--no-clean', '--no-update', '--test', '--non-interactive']" exit_code: 2 Last 500 characters of output: .................................................................................................................................................................................................... inspector ................................................................. inspector/styles-source-lines-inline.html -> crashed Exiting early after 1 failures. 17876 tests run. 405.07s total testing time 17875 test cases (99%) succeeded 1 test case (<1%) crashed 8 test cases (<1%) had stderr output Full output:
http://queues.webkit.org/results/7140055
Eric Seidel (no email)
Comment 17
Thursday, December 16, 2010 12:07:11 AM UTC
Comment on
attachment 76650
[details]
test case updated The failure seems unlikely to be related.
WebKit Commit Bot
Comment 18
Thursday, December 16, 2010 12:55:32 AM UTC
Comment on
attachment 76650
[details]
test case updated Clearing flags on attachment: 76650 Committed
r74162
: <
http://trac.webkit.org/changeset/74162
>
WebKit Commit Bot
Comment 19
Thursday, December 16, 2010 12:55:39 AM UTC
All reviewed patches have been landed. Closing bug.
Xan Lopez
Comment 20
Thursday, December 16, 2010 3:17:17 PM UTC
This broke the hell out of GTK+ (> 100 tests failing), so I reverted it in
r74186
. In the future please play attention to massive breakage in the bots when landing patches.
Xan Lopez
Comment 21
Thursday, December 16, 2010 3:17:58 PM UTC
Reopening.
Adam Roben (:aroben)
Comment 22
Thursday, December 16, 2010 3:42:25 PM UTC
(In reply to
comment #20
)
> This broke the hell out of GTK+ (> 100 tests failing), so I reverted it in
r74186
. In the future please play attention to massive breakage in the bots when landing patches.
It also broke Safari on Windows's location field, search field, and error pages. These features use the IWebFrame::load[Alternate]HTMLString APIs, which end up calling through WebFrame::loadHTMLString [1]. BSTRs appear to be encoded with UTF-16LE, but we pass just "utf-16" as the encoding name to the SubstituteData constructor. It looks like WebKit2's WebPage has this same problem [2]. 1.
http://trac.webkit.org/browser/trunk/WebKit/win/WebFrame.cpp?rev=73430
2.
http://trac.webkit.org/browser/trunk/WebKit2/WebProcess/WebPage/WebPage.cpp?rev=74164
Adam Roben (:aroben)
Comment 23
Thursday, December 16, 2010 4:10:20 PM UTC
See also
http://build.webkit.org/results/Windows%207%20Release%20(Tests)/r74170%20(7487)/fast/loader/non-deferred-substitute-load-pretty-diff.html
Yong Li
Comment 24
Thursday, December 16, 2010 4:20:54 PM UTC
Am I responsible to fix all these UTF-16 to UTF-16LE in all test cases/error pages? /me cries...
Darin Adler
Comment 25
Thursday, December 16, 2010 4:22:02 PM UTC
(In reply to
comment #24
)
> Am I responsible to fix all these UTF-16 to UTF-16LE in all test cases/error pages? /me cries...
You don’t have to do it yourself, but you will have to talk with the people affected and wait until they’ve resolved these issues if you want to change this.
Yong Li
Comment 26
Thursday, December 16, 2010 4:44:00 PM UTC
(In reply to
comment #25
)
> (In reply to
comment #24
) > > Am I responsible to fix all these UTF-16 to UTF-16LE in all test cases/error pages? /me cries... > You don’t have to do it yourself, but you will have to talk with the people affected and wait until they’ve resolved these issues if you want to change this.
Thanks a lot. I'll try to fix some first
Alexey Proskuryakov
Comment 27
Thursday, December 16, 2010 5:00:11 PM UTC
Despite earlier Darin's r+, I would still appreciate more direct testing of IE behavior.
Xan Lopez
Comment 28
Thursday, December 16, 2010 6:04:57 PM UTC
CCing a couple of GTK+ guys for the fallout.
Darin Adler
Comment 29
Thursday, December 16, 2010 6:06:44 PM UTC
(In reply to
comment #27
)
> Despite earlier Darin's r+, I would still appreciate more direct testing of IE behavior.
I didn’t mean to override your concern, Alexey. This is enough of an edge case that I think we are unlikely to have significant problems no matter which way we go. But I agree that would be very good to know about IE behavior based on resources, not just data URLs. Also, I think it would be good to land the new tests either with expected failures or with expectations that match our current behavior, even if we can't make the change in WebCore right now.
Alexey Proskuryakov
Comment 30
Thursday, December 16, 2010 6:16:36 PM UTC
Created
attachment 76780
[details]
utf-16 test case In fact, I've just tested, and it appears that IE treats "utf-16" as little endian. See the attached test case.
Alexey Proskuryakov
Comment 31
Thursday, December 16, 2010 6:17:36 PM UTC
And so does Firefox, so WebKit matches both.
Yong Li
Comment 32
Thursday, December 16, 2010 6:26:25 PM UTC
(In reply to
comment #31
)
> And so does Firefox, so WebKit matches both.
I don't understand the test case. Can you try these 2 URL in FireFox? data:text/plain;charset=UTF-16,%D8%69%DE%D6 and data:text/plain;charset=UTF-16,%69%D8%D6%DE To me, it shows FireFox uses BigEndian
Yong Li
Comment 33
Thursday, December 16, 2010 6:30:32 PM UTC
Created
attachment 76782
[details]
test case with <p> removed
Yong Li
Comment 34
Thursday, December 16, 2010 6:33:50 PM UTC
Created
attachment 76783
[details]
test case - BE
Yong Li
Comment 35
Thursday, December 16, 2010 6:34:51 PM UTC
Seems IE is using little endian and firefox uses big endian for "utf-16"
Alexey Proskuryakov
Comment 36
Thursday, December 16, 2010 6:36:03 PM UTC
So it appears that Firefox performed content sniffing to change the encoding (I think it's on by default as of Firefox 3.6.16).
Alexey Proskuryakov
Comment 37
Thursday, December 16, 2010 6:36:36 PM UTC
I meant 3.6.13, sorry.
Alexey Proskuryakov
Comment 38
Thursday, December 16, 2010 6:41:15 PM UTC
The currently preferred way to resolve such differences is via standardization in HTML5. In fact, HTML5 already sides with IE/WebKit: --------------------------- When a user agent is to use the UTF-16 encoding but no BOM has been found, user agents must default to UTF-16LE. Note: The requirement to default UTF-16 to LE rather than BE is a willful violation of RFC 2781, motivated by a desire for compatibility with legacy content. [RFC2781] --------------------------- The tests added here are very good, and I suggest landing those. But it doesn't seem like we should be changing behavior.
Yong Li
Comment 39
Thursday, December 16, 2010 6:43:04 PM UTC
(In reply to
comment #38
)
> The currently preferred way to resolve such differences is via standardization in HTML5. In fact, HTML5 already sides with IE/WebKit: > --------------------------- > When a user agent is to use the UTF-16 encoding but no BOM has been found, user agents must default to UTF-16LE. > Note: The requirement to default UTF-16 to LE rather than BE is a willful violation of RFC 2781, motivated by a desire for compatibility with legacy content. [RFC2781] > --------------------------- > The tests added here are very good, and I suggest landing those. But it doesn't seem like we should be changing behavior.
Tend to agree. But probably we still want to handle 'ISO-10646-UCS-2' and 'UCS-2' as BE?
Yong Li
Comment 40
Thursday, December 16, 2010 6:43:42 PM UTC
(In reply to
comment #39
)
> (In reply to
comment #38
) > > The currently preferred way to resolve such differences is via standardization in HTML5. In fact, HTML5 already sides with IE/WebKit: > > --------------------------- > > When a user agent is to use the UTF-16 encoding but no BOM has been found, user agents must default to UTF-16LE. > > Note: The requirement to default UTF-16 to LE rather than BE is a willful violation of RFC 2781, motivated by a desire for compatibility with legacy content. [RFC2781] > > --------------------------- > > The tests added here are very good, and I suggest landing those. But it doesn't seem like we should be changing behavior. > Tend to agree. But probably we still want to handle 'ISO-10646-UCS-2' and 'UCS-2' as BE?
As IE also uses BE for 'ISO-10646-UCS-2'
Yong Li
Comment 41
Thursday, December 16, 2010 6:44:49 PM UTC
Created
attachment 76784
[details]
Test case - BE - ISO-10646-UCS-2
Yong Li
Comment 42
Thursday, December 16, 2010 6:46:14 PM UTC
(In reply to
comment #41
)
> Created an attachment (id=76784) [details] > Test case - BE - ISO-10646-UCS-2
Oops. I'm wrong... werid.
Yong Li
Comment 43
Thursday, December 16, 2010 6:50:13 PM UTC
Sorry. I think it is a false alarm...
Alexey Proskuryakov
Comment 44
Thursday, December 16, 2010 6:53:56 PM UTC
It seems that IE doesn't recognize ISO-10646-UCS-2 at all - browser default encoding was used instead. It doesn't matter whether we interpret it as little or big endian, but I'm fine with matching Firefox and changing it to big endian. I'm also fine with treating it exactly like "utf-16" for internal consistency. Would you be willing to file a bug and make a patch with tests?
Yong Li
Comment 45
Thursday, December 16, 2010 6:55:00 PM UTC
(In reply to
comment #44
)
> It seems that IE doesn't recognize ISO-10646-UCS-2 at all - browser default encoding was used instead. It doesn't matter whether we interpret it as little or big endian, but I'm fine with matching Firefox and changing it to big endian. I'm also fine with treating it exactly like "utf-16" for internal consistency. > Would you be willing to file a bug and make a patch with tests?
Sure. I'll update the patch to only include the tests
Alexey Proskuryakov
Comment 46
Thursday, December 16, 2010 7:55:47 PM UTC
Bug 51199
.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug