51035 – UCS2 encoding aliases should be defaulted to Big Endian

RESOLVED INVALID 51035

UCS2 encoding aliases should be defaulted to Big Endian

https://bugs.webkit.org/show_bug.cgi?id=51035

Summary UCS2 encoding aliases should be defaulted to Big Endian

Yong Li

Reported 2010-12-14 07:14:14 PST

I think UCS2 encoding names like "ISO-10646-UCS-2" should be treated as Big Endian, because: 1) http://www.isi.edu/in-notes/iana/assignments/character-sets says about ISO-10646-UCS-2: "this needs to specify network byte order: the standard does not specify" 2) RFC 2152 says: "ISO/IEC 10646-1:1993(E) specifies that when characters the UCS-2 form are serialized as octets, that the most significant octet appear first." 3) ICU code page description (http://publib.boulder.ibm.com/infocenter/tivihelp/v24r1/index.jsp?topic=/com.ibm.itcama.doc_6.2.3/itcam_oraclerac63200.htm) says: UCS2 UCS-2 (Really UTF-16 BE) 4) IE and FireFox treat "ISO-10646-UCS-2" as Big Endian. However, it is hard to find a website in reality that uses UCS2, except test cases. So priority should be low.

Attachments
the patch (2.74 KB, patch) 2010-12-14 08:31 PST, Yong Li	ap: review-	Details Formatted Diff Diff
update test and expected result (5.93 KB, patch) 2010-12-14 13:56 PST, Yong Li	no flags	Details Formatted Diff Diff
updated (11.14 KB, patch) 2010-12-15 08:26 PST, Yong Li	no flags	Details Formatted Diff Diff
test case updated (5.86 KB, patch) 2010-12-15 08:29 PST, Yong Li	no flags	Details Formatted Diff Diff
utf-16 test case (28 bytes, text/html; charset=utf-16) 2010-12-16 10:16 PST, Alexey Proskuryakov	no flags	Details
test case with <p> removed (16 bytes, text/plain; charset=utf-16) 2010-12-16 10:30 PST, Yong Li	no flags	Details
test case - BE (16 bytes, text/plain;charset=utf-16) 2010-12-16 10:33 PST, Yong Li	no flags	Details
Test case - BE - ISO-10646-UCS-2 (16 bytes, text/plain;charset="ISO-10646-UCS-2") 2010-12-16 10:44 PST, Yong Li	no flags	Details
Show Obsolete (3) View All Add attachment proposed patch, testcase, etc.

Yong Li

Comment 1 2010-12-14 08:31:34 PST

Created attachment 76540 [details] the patch

Alexey Proskuryakov

Comment 2 2010-12-14 11:19:14 PST

Comment on attachment 76540 [details] the patch > 4) IE and FireFox treat "ISO-10646-UCS-2" as Big Endian. Did you test other encoding names modified here? Do all the added sub-tests pass in IE and Firefox? It's surprising that we fail on every page that's labeled as "UTF-16" and doesn't have a BOM. In fact, I recall that the current behavior is intentional, and was chosen for compatibility with other browsers - but I don't remember details of testing. +// UCS2 encoding aliases. +testDecode('ISO-10646-UCS-2', '%00a', 'U+0061'); +testDecode('UCS-2', '%00a', 'U+0061'); +testDecode('UTF-16', '%00a', 'U+0061'); +testDecode('Unicode', '%00a', 'U+0061'); +testDecode('csUnicode', '%00a', 'U+0061'); +testDecode('unicodeFFFE', '%00a', 'U+0061'); +testDecode('unicodeFEFF', 'a%00', 'U+0061'); This patch has confused terminology. Why did you label UTF-16 as a "UCS2 encoding alias"? It's clearly not. UCS-2 is a historic name for 16-bit encoding defined by Unicode 1.0, and obsoleted by Unicode 2.0. I suspect that most or all clients treat it as UTF-16 in practice. ICU certainly does. +testDecode('ISO-10646-UCS-2', '%00a', 'U+0061'); Please add tests for something besides zero bytes. While at it, please consider adding tests for whether surrogate pairs are supported for each of these encoding names. Extending test coverage would is very welcome.

Yong Li

Comment 3 2010-12-14 12:16:58 PST

(In reply to comment #2) > (From update of attachment 76540 [details]) > > 4) IE and FireFox treat "ISO-10646-UCS-2" as Big Endian. > > Did you test other encoding names modified here? Do all the added sub-tests pass in IE and Firefox? > > It's surprising that we fail on every page that's labeled as "UTF-16" and doesn't have a BOM. In fact, I recall that the current behavior is intentional, and was chosen for compatibility with other browsers - but I don't remember details of testing. > > +// UCS2 encoding aliases. > +testDecode('ISO-10646-UCS-2', '%00a', 'U+0061'); > +testDecode('UCS-2', '%00a', 'U+0061'); > +testDecode('UTF-16', '%00a', 'U+0061'); > +testDecode('Unicode', '%00a', 'U+0061'); > +testDecode('csUnicode', '%00a', 'U+0061'); > +testDecode('unicodeFFFE', '%00a', 'U+0061'); > +testDecode('unicodeFEFF', 'a%00', 'U+0061'); > > This patch has confused terminology. Why did you label UTF-16 as a "UCS2 encoding alias"? It's clearly not. > > UCS-2 is a historic name for 16-bit encoding defined by Unicode 1.0, and obsoleted by Unicode 2.0. I suspect that most or all clients treat it as UTF-16 in practice. ICU certainly does. > > +testDecode('ISO-10646-UCS-2', '%00a', 'U+0061'); > > Please add tests for something besides zero bytes. While at it, please consider adding tests for whether surrogate pairs are supported for each of these encoding names. > > Extending test coverage would is very welcome. (In reply to comment #2) > (From update of attachment 76540 [details]) > > 4) IE and FireFox treat "ISO-10646-UCS-2" as Big Endian. > > Did you test other encoding names modified here? Do all the added sub-tests pass in IE and Firefox? FireFox seems treat "Unicode" as UTF-8. But for other names, it uses Big Endian. Except, when using 0 bytes, it also works with little endian for "UTF-16". > > It's surprising that we fail on every page that's labeled as "UTF-16" and doesn't have a BOM. In fact, I recall that the current behavior is intentional, and was chosen for compatibility with other browsers - but I don't remember details of testing. > > +// UCS2 encoding aliases. > +testDecode('ISO-10646-UCS-2', '%00a', 'U+0061'); > +testDecode('UCS-2', '%00a', 'U+0061'); > +testDecode('UTF-16', '%00a', 'U+0061'); > +testDecode('Unicode', '%00a', 'U+0061'); > +testDecode('csUnicode', '%00a', 'U+0061'); > +testDecode('unicodeFFFE', '%00a', 'U+0061'); > +testDecode('unicodeFEFF', 'a%00', 'U+0061'); > > This patch has confused terminology. Why did you label UTF-16 as a "UCS2 encoding alias"? It's clearly not. > > UCS-2 is a historic name for 16-bit encoding defined by Unicode 1.0, and obsoleted by Unicode 2.0. I suspect that most or all clients treat it as UTF-16 in practice. ICU certainly does. > Can I say they are aliases of UTF-16 then? > +testDecode('ISO-10646-UCS-2', '%00a', 'U+0061'); > > Please add tests for something besides zero bytes. While at it, please consider adding tests for whether surrogate pairs are supported for each of these encoding names. > Will use surrogate pairs.

Alexey Proskuryakov

Comment 4 2010-12-14 12:41:48 PST

Please be sure to test Internet Explorer behavior in detail. > Can I say they are aliases of UTF-16 then? I'd just say "UTF-16 variants" - unicodeFFFE and unicodeFEFF clearly aren't aliases of the same encoding. > Will use surrogate pairs. Thanks!

Darin Adler

Comment 5 2010-12-14 12:57:57 PST

Have we found any existing test cases on websites? It seems that 16-bit encodings without BOMs are extremely rare. Is there some context other than viewing a webpage where this comes up? Form submission or XMLHttpRequest?

Yong Li

Comment 6 2010-12-14 13:10:18 PST

(In reply to comment #5) > Have we found any existing test cases on websites? It seems that 16-bit encodings without BOMs are extremely rare. Yeah. I can hardly find a real website sending UTF-16 without BOM. But seems everyone is saying UTF-16 without BOM should be UTF-16BE. > > Is there some context other than viewing a webpage where this comes up? Form submission or XMLHttpRequest? I'm using data URL like "data:text/plain;charset=UTF-16,%D8%69%DE%D6" to be iframe source or just type it in address bar.

Yong Li

Comment 7 2010-12-14 13:56:47 PST

Created attachment 76567 [details] update test and expected result

Alexey Proskuryakov

Comment 8 2010-12-14 14:03:25 PST

Comment on attachment 76567 [details] update test and expected result View in context: https://bugs.webkit.org/attachment.cgi?id=76567&action=review I think that key to landing this would be testing IE and Firefox with regular HTML files served over HTTP. Data URLs are a great trick for regression tests, but comparing browser behavior based on them isn't very reliable. > LayoutTests/fast/encoding/char-decoding-expected.txt:160 > +PASS decodeSurrogatePair('UTF-16BE', '%D8%69%DE%D6') is 'U+D869/U+DED6' I'm not sure why decodeSurrogatePair is needed - does decode() not work for those? If so, could it be better to fix decode()?

Yong Li

Comment 9 2010-12-14 14:11:05 PST

(In reply to comment #8) > (From update of attachment 76567 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=76567&action=review > > I think that key to landing this would be testing IE and Firefox with regular HTML files served over HTTP. Data URLs are a great trick for regression tests, but comparing browser behavior based on them isn't very reliable. > > > LayoutTests/fast/encoding/char-decoding-expected.txt:160 > > +PASS decodeSurrogatePair('UTF-16BE', '%D8%69%DE%D6') is 'U+D869/U+DED6' > > I'm not sure why decodeSurrogatePair is needed - does decode() not work for those? If so, could it be better to fix decode()? decode() only returns one UTF-16 character. Do you think it is good to make decode() dump all characters in responseText?

Alexey Proskuryakov

Comment 10 2010-12-14 14:40:03 PST

> Do you think it is good to make decode() dump all characters in responseText? I think so, but I haven't been involved in the original design of this test. Perhaps Darin will have some comments.

Darin Adler

Comment 11 2010-12-14 17:10:18 PST

(In reply to comment #10) > > Do you think it is good to make decode() dump all characters in responseText? > > I think so, but I haven't been involved in the original design of this test. Perhaps Darin will have some comments. Yes, decode should dump all the characters. There’s no advantage to have it returning only one. In fact, that behavior could hide failures.

Yong Li

Comment 12 2010-12-15 07:30:22 PST

(In reply to comment #11) > (In reply to comment #10) > > > Do you think it is good to make decode() dump all characters in responseText? > > > > I think so, but I haven't been involved in the original design of this test. Perhaps Darin will have some comments. > > Yes, decode should dump all the characters. There’s no advantage to have it returning only one. In fact, that behavior could hide failures. will do it in that way

Yong Li

Comment 13 2010-12-15 08:26:02 PST

Created attachment 76649 [details] updated

Yong Li

Comment 14 2010-12-15 08:29:52 PST

Created attachment 76650 [details] test case updated

WebKit Commit Bot

Comment 15 2010-12-15 13:31:34 PST

The commit-queue encountered the following flaky tests while processing attachment 76650 [details]: http/tests/navigation/document-location-click-timeout.html bug 51130 (author: mpcomplete@chromium.org) fast/loader/location-port.html bug 51131 (author: gns@gnome.org) The commit-queue is continuing to process your patch.

WebKit Commit Bot

Comment 16 2010-12-15 15:31:58 PST

Comment on attachment 76650 [details] test case updated Rejecting attachment 76650 [details] from commit-queue. Failed to run "['./WebKitTools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '--bot-id=cr-jail-3', 'build-and-test', '--no-clean', '--no-update', '--test', '--non-interactive']" exit_code: 2 Last 500 characters of output: .................................................................................................................................................................................................... inspector ................................................................. inspector/styles-source-lines-inline.html -> crashed Exiting early after 1 failures. 17876 tests run. 405.07s total testing time 17875 test cases (99%) succeeded 1 test case (<1%) crashed 8 test cases (<1%) had stderr output Full output: http://queues.webkit.org/results/7140055

Eric Seidel (no email)

Comment 17 2010-12-15 16:07:11 PST

Comment on attachment 76650 [details] test case updated The failure seems unlikely to be related.

WebKit Commit Bot

Comment 18 2010-12-15 16:55:32 PST

Comment on attachment 76650 [details] test case updated Clearing flags on attachment: 76650 Committed r74162: <http://trac.webkit.org/changeset/74162>

WebKit Commit Bot

Comment 19 2010-12-15 16:55:39 PST

All reviewed patches have been landed. Closing bug.

Xan Lopez

Comment 20 2010-12-16 07:17:17 PST

This broke the hell out of GTK+ (> 100 tests failing), so I reverted it in r74186. In the future please play attention to massive breakage in the bots when landing patches.

Xan Lopez

Comment 21 2010-12-16 07:17:58 PST

Reopening.

Adam Roben (:aroben)

Comment 22 2010-12-16 07:42:25 PST

(In reply to comment #20) > This broke the hell out of GTK+ (> 100 tests failing), so I reverted it in r74186. In the future please play attention to massive breakage in the bots when landing patches. It also broke Safari on Windows's location field, search field, and error pages. These features use the IWebFrame::load[Alternate]HTMLString APIs, which end up calling through WebFrame::loadHTMLString [1]. BSTRs appear to be encoded with UTF-16LE, but we pass just "utf-16" as the encoding name to the SubstituteData constructor. It looks like WebKit2's WebPage has this same problem [2]. 1. http://trac.webkit.org/browser/trunk/WebKit/win/WebFrame.cpp?rev=73430 2. http://trac.webkit.org/browser/trunk/WebKit2/WebProcess/WebPage/WebPage.cpp?rev=74164

Adam Roben (:aroben)

Comment 23 2010-12-16 08:10:20 PST

See also http://build.webkit.org/results/Windows%207%20Release%20(Tests)/r74170%20(7487)/fast/loader/non-deferred-substitute-load-pretty-diff.html

Yong Li

Comment 24 2010-12-16 08:20:54 PST

Am I responsible to fix all these UTF-16 to UTF-16LE in all test cases/error pages? /me cries...

Darin Adler

Comment 25 2010-12-16 08:22:02 PST

(In reply to comment #24) > Am I responsible to fix all these UTF-16 to UTF-16LE in all test cases/error pages? /me cries... You don’t have to do it yourself, but you will have to talk with the people affected and wait until they’ve resolved these issues if you want to change this.

Yong Li

Comment 26 2010-12-16 08:44:00 PST

(In reply to comment #25) > (In reply to comment #24) > > Am I responsible to fix all these UTF-16 to UTF-16LE in all test cases/error pages? /me cries... > You don’t have to do it yourself, but you will have to talk with the people affected and wait until they’ve resolved these issues if you want to change this. Thanks a lot. I'll try to fix some first

Alexey Proskuryakov

Comment 27 2010-12-16 09:00:11 PST

Despite earlier Darin's r+, I would still appreciate more direct testing of IE behavior.

Xan Lopez

Comment 28 2010-12-16 10:04:57 PST

CCing a couple of GTK+ guys for the fallout.

Darin Adler

Comment 29 2010-12-16 10:06:44 PST

(In reply to comment #27) > Despite earlier Darin's r+, I would still appreciate more direct testing of IE behavior. I didn’t mean to override your concern, Alexey. This is enough of an edge case that I think we are unlikely to have significant problems no matter which way we go. But I agree that would be very good to know about IE behavior based on resources, not just data URLs. Also, I think it would be good to land the new tests either with expected failures or with expectations that match our current behavior, even if we can't make the change in WebCore right now.

Alexey Proskuryakov

Comment 30 2010-12-16 10:16:36 PST

Created attachment 76780 [details] utf-16 test case In fact, I've just tested, and it appears that IE treats "utf-16" as little endian. See the attached test case.

Alexey Proskuryakov

Comment 31 2010-12-16 10:17:36 PST

And so does Firefox, so WebKit matches both.

Yong Li

Comment 32 2010-12-16 10:26:25 PST

(In reply to comment #31) > And so does Firefox, so WebKit matches both. I don't understand the test case. Can you try these 2 URL in FireFox? data:text/plain;charset=UTF-16,%D8%69%DE%D6 and data:text/plain;charset=UTF-16,%69%D8%D6%DE To me, it shows FireFox uses BigEndian

Yong Li

Comment 33 2010-12-16 10:30:32 PST

Created attachment 76782 [details] test case with <p> removed

Yong Li

Comment 34 2010-12-16 10:33:50 PST

Created attachment 76783 [details] test case - BE

Yong Li

Comment 35 2010-12-16 10:34:51 PST

Seems IE is using little endian and firefox uses big endian for "utf-16"

Alexey Proskuryakov

Comment 36 2010-12-16 10:36:03 PST

So it appears that Firefox performed content sniffing to change the encoding (I think it's on by default as of Firefox 3.6.16).

Alexey Proskuryakov

Comment 37 2010-12-16 10:36:36 PST

I meant 3.6.13, sorry.

Alexey Proskuryakov

Comment 38 2010-12-16 10:41:15 PST

The currently preferred way to resolve such differences is via standardization in HTML5. In fact, HTML5 already sides with IE/WebKit: --------------------------- When a user agent is to use the UTF-16 encoding but no BOM has been found, user agents must default to UTF-16LE. Note: The requirement to default UTF-16 to LE rather than BE is a willful violation of RFC 2781, motivated by a desire for compatibility with legacy content. [RFC2781] --------------------------- The tests added here are very good, and I suggest landing those. But it doesn't seem like we should be changing behavior.

Yong Li

Comment 39 2010-12-16 10:43:04 PST

(In reply to comment #38) > The currently preferred way to resolve such differences is via standardization in HTML5. In fact, HTML5 already sides with IE/WebKit: > --------------------------- > When a user agent is to use the UTF-16 encoding but no BOM has been found, user agents must default to UTF-16LE. > Note: The requirement to default UTF-16 to LE rather than BE is a willful violation of RFC 2781, motivated by a desire for compatibility with legacy content. [RFC2781] > --------------------------- > The tests added here are very good, and I suggest landing those. But it doesn't seem like we should be changing behavior. Tend to agree. But probably we still want to handle 'ISO-10646-UCS-2' and 'UCS-2' as BE?

Yong Li

Comment 40 2010-12-16 10:43:42 PST

(In reply to comment #39) > (In reply to comment #38) > > The currently preferred way to resolve such differences is via standardization in HTML5. In fact, HTML5 already sides with IE/WebKit: > > --------------------------- > > When a user agent is to use the UTF-16 encoding but no BOM has been found, user agents must default to UTF-16LE. > > Note: The requirement to default UTF-16 to LE rather than BE is a willful violation of RFC 2781, motivated by a desire for compatibility with legacy content. [RFC2781] > > --------------------------- > > The tests added here are very good, and I suggest landing those. But it doesn't seem like we should be changing behavior. > Tend to agree. But probably we still want to handle 'ISO-10646-UCS-2' and 'UCS-2' as BE? As IE also uses BE for 'ISO-10646-UCS-2'

Yong Li

Comment 41 2010-12-16 10:44:49 PST

Created attachment 76784 [details] Test case - BE - ISO-10646-UCS-2

Yong Li

Comment 42 2010-12-16 10:46:14 PST

(In reply to comment #41) > Created an attachment (id=76784) [details] > Test case - BE - ISO-10646-UCS-2 Oops. I'm wrong... werid.

Yong Li

Comment 43 2010-12-16 10:50:13 PST

Sorry. I think it is a false alarm...

Alexey Proskuryakov

Comment 44 2010-12-16 10:53:56 PST

It seems that IE doesn't recognize ISO-10646-UCS-2 at all - browser default encoding was used instead. It doesn't matter whether we interpret it as little or big endian, but I'm fine with matching Firefox and changing it to big endian. I'm also fine with treating it exactly like "utf-16" for internal consistency. Would you be willing to file a bug and make a patch with tests?

Yong Li

Comment 45 2010-12-16 10:55:00 PST

(In reply to comment #44) > It seems that IE doesn't recognize ISO-10646-UCS-2 at all - browser default encoding was used instead. It doesn't matter whether we interpret it as little or big endian, but I'm fine with matching Firefox and changing it to big endian. I'm also fine with treating it exactly like "utf-16" for internal consistency. > Would you be willing to file a bug and make a patch with tests? Sure. I'll update the patch to only include the tests

Alexey Proskuryakov

Comment 46 2010-12-16 11:55:47 PST

Bug 51199.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution INVALID

Priority P3

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware All

OS All

Product WebKit

Component Text

Assignee

Nobody

Reported

2010-12-14 07:14 PST

Modified

2010-12-16 11:55 PST History

CC List

8 users Show

URL

Keywords

Depends on

51185 51187

Blocks

Dependencies

tree graph