Bug 51035 - UCS2 encoding aliases should be defaulted to Big Endian
Summary: UCS2 encoding aliases should be defaulted to Big Endian
Status: RESOLVED INVALID
Alias: None
Product: WebKit
Classification: Unclassified
Component: Text (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P3 Normal
Assignee: Nobody
URL:
Keywords:
Depends on: 51185 51187
Blocks:
  Show dependency treegraph
 
Reported: 2010-12-14 07:14 PST by Yong Li
Modified: 2010-12-16 11:55 PST (History)
8 users (show)

See Also:


Attachments
the patch (2.74 KB, patch)
2010-12-14 08:31 PST, Yong Li
ap: review-
Details | Formatted Diff | Diff
update test and expected result (5.93 KB, patch)
2010-12-14 13:56 PST, Yong Li
no flags Details | Formatted Diff | Diff
updated (11.14 KB, patch)
2010-12-15 08:26 PST, Yong Li
no flags Details | Formatted Diff | Diff
test case updated (5.86 KB, patch)
2010-12-15 08:29 PST, Yong Li
no flags Details | Formatted Diff | Diff
utf-16 test case (28 bytes, text/html; charset=utf-16)
2010-12-16 10:16 PST, Alexey Proskuryakov
no flags Details
test case with <p> removed (16 bytes, text/plain; charset=utf-16)
2010-12-16 10:30 PST, Yong Li
no flags Details
test case - BE (16 bytes, text/plain;charset=utf-16)
2010-12-16 10:33 PST, Yong Li
no flags Details
Test case - BE - ISO-10646-UCS-2 (16 bytes, text/plain;charset="ISO-10646-UCS-2")
2010-12-16 10:44 PST, Yong Li
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Yong Li 2010-12-14 07:14:14 PST
I think UCS2 encoding names like "ISO-10646-UCS-2" should be treated as Big Endian, because:

1) http://www.isi.edu/in-notes/iana/assignments/character-sets says about ISO-10646-UCS-2: "this needs to specify network byte order: the standard does not specify"

2) RFC 2152 says: "ISO/IEC 10646-1:1993(E) specifies that when characters the UCS-2 form are serialized as octets, that the most significant octet appear first."

3) ICU code page description (http://publib.boulder.ibm.com/infocenter/tivihelp/v24r1/index.jsp?topic=/com.ibm.itcama.doc_6.2.3/itcam_oraclerac63200.htm) says:

UCS2 	UCS-2 (Really UTF-16 BE)

4) IE and FireFox treat "ISO-10646-UCS-2" as Big Endian.

However, it is hard to find a website in reality that uses UCS2, except test cases. So priority should be low.
Comment 1 Yong Li 2010-12-14 08:31:34 PST
Created attachment 76540 [details]
the patch
Comment 2 Alexey Proskuryakov 2010-12-14 11:19:14 PST
Comment on attachment 76540 [details]
the patch

> 4) IE and FireFox treat "ISO-10646-UCS-2" as Big Endian.

Did you test other encoding names modified here? Do all the added sub-tests pass in IE and Firefox?

It's surprising that we fail on every page that's labeled as "UTF-16" and doesn't have a BOM. In fact, I recall that the current behavior is intentional, and was chosen for compatibility with other browsers - but I don't remember details of testing.

+// UCS2 encoding aliases.
+testDecode('ISO-10646-UCS-2', '%00a', 'U+0061');
+testDecode('UCS-2', '%00a', 'U+0061');
+testDecode('UTF-16', '%00a', 'U+0061');
+testDecode('Unicode', '%00a', 'U+0061');
+testDecode('csUnicode', '%00a', 'U+0061');
+testDecode('unicodeFFFE', '%00a', 'U+0061');
+testDecode('unicodeFEFF', 'a%00', 'U+0061');

This patch has confused terminology. Why did you label UTF-16 as a "UCS2 encoding alias"? It's clearly not.

UCS-2 is a historic name for 16-bit encoding defined by Unicode 1.0, and obsoleted by Unicode 2.0. I suspect that most or all clients treat it as UTF-16 in practice. ICU certainly does.

+testDecode('ISO-10646-UCS-2', '%00a', 'U+0061');

Please add tests for something besides zero bytes. While at it, please consider adding tests for whether surrogate pairs are supported for each of these encoding names.

Extending test coverage would is very welcome.
Comment 3 Yong Li 2010-12-14 12:16:58 PST
(In reply to comment #2)
> (From update of attachment 76540 [details])
> > 4) IE and FireFox treat "ISO-10646-UCS-2" as Big Endian.
> 
> Did you test other encoding names modified here? Do all the added sub-tests pass in IE and Firefox?
> 
> It's surprising that we fail on every page that's labeled as "UTF-16" and doesn't have a BOM. In fact, I recall that the current behavior is intentional, and was chosen for compatibility with other browsers - but I don't remember details of testing.
> 
> +// UCS2 encoding aliases.
> +testDecode('ISO-10646-UCS-2', '%00a', 'U+0061');
> +testDecode('UCS-2', '%00a', 'U+0061');
> +testDecode('UTF-16', '%00a', 'U+0061');
> +testDecode('Unicode', '%00a', 'U+0061');
> +testDecode('csUnicode', '%00a', 'U+0061');
> +testDecode('unicodeFFFE', '%00a', 'U+0061');
> +testDecode('unicodeFEFF', 'a%00', 'U+0061');
> 
> This patch has confused terminology. Why did you label UTF-16 as a "UCS2 encoding alias"? It's clearly not.
> 
> UCS-2 is a historic name for 16-bit encoding defined by Unicode 1.0, and obsoleted by Unicode 2.0. I suspect that most or all clients treat it as UTF-16 in practice. ICU certainly does.
> 
> +testDecode('ISO-10646-UCS-2', '%00a', 'U+0061');
> 
> Please add tests for something besides zero bytes. While at it, please consider adding tests for whether surrogate pairs are supported for each of these encoding names.
> 
> Extending test coverage would is very welcome.

(In reply to comment #2)
> (From update of attachment 76540 [details])
> > 4) IE and FireFox treat "ISO-10646-UCS-2" as Big Endian.
> 
> Did you test other encoding names modified here? Do all the added sub-tests pass in IE and Firefox?

FireFox seems treat "Unicode" as UTF-8. But for other names, it uses Big Endian. Except, when using 0 bytes, it also works with little endian for "UTF-16".

> 
> It's surprising that we fail on every page that's labeled as "UTF-16" and doesn't have a BOM. In fact, I recall that the current behavior is intentional, and was chosen for compatibility with other browsers - but I don't remember details of testing.
> 
> +// UCS2 encoding aliases.
> +testDecode('ISO-10646-UCS-2', '%00a', 'U+0061');
> +testDecode('UCS-2', '%00a', 'U+0061');
> +testDecode('UTF-16', '%00a', 'U+0061');
> +testDecode('Unicode', '%00a', 'U+0061');
> +testDecode('csUnicode', '%00a', 'U+0061');
> +testDecode('unicodeFFFE', '%00a', 'U+0061');
> +testDecode('unicodeFEFF', 'a%00', 'U+0061');
> 
> This patch has confused terminology. Why did you label UTF-16 as a "UCS2 encoding alias"? It's clearly not.
> 
> UCS-2 is a historic name for 16-bit encoding defined by Unicode 1.0, and obsoleted by Unicode 2.0. I suspect that most or all clients treat it as UTF-16 in practice. ICU certainly does.
>

Can I say they are aliases of UTF-16 then?

> +testDecode('ISO-10646-UCS-2', '%00a', 'U+0061');
> 
> Please add tests for something besides zero bytes. While at it, please consider adding tests for whether surrogate pairs are supported for each of these encoding names.
> 
Will use surrogate pairs.
Comment 4 Alexey Proskuryakov 2010-12-14 12:41:48 PST
Please be sure to test Internet Explorer behavior in detail.

> Can I say they are aliases of UTF-16 then?

I'd just say "UTF-16 variants" - unicodeFFFE and unicodeFEFF clearly aren't aliases of the same encoding.

> Will use surrogate pairs.

Thanks!
Comment 5 Darin Adler 2010-12-14 12:57:57 PST
Have we found any existing test cases on websites? It seems that 16-bit encodings without BOMs are extremely rare.

Is there some context other than viewing a webpage where this comes up? Form submission or XMLHttpRequest?
Comment 6 Yong Li 2010-12-14 13:10:18 PST
(In reply to comment #5)
> Have we found any existing test cases on websites? It seems that 16-bit encodings without BOMs are extremely rare.

Yeah. I can hardly find a real website sending UTF-16 without BOM. But seems everyone is saying UTF-16 without BOM should be UTF-16BE.

> 
> Is there some context other than viewing a webpage where this comes up? Form submission or XMLHttpRequest?

I'm using data URL like "data:text/plain;charset=UTF-16,%D8%69%DE%D6" to be iframe source or just type it in address bar.
Comment 7 Yong Li 2010-12-14 13:56:47 PST
Created attachment 76567 [details]
update test and expected result
Comment 8 Alexey Proskuryakov 2010-12-14 14:03:25 PST
Comment on attachment 76567 [details]
update test and expected result

View in context: https://bugs.webkit.org/attachment.cgi?id=76567&action=review

I think that key to landing this would be testing IE and Firefox with regular HTML files served over HTTP. Data URLs are a great trick for regression tests, but comparing browser behavior based on them isn't very reliable.

> LayoutTests/fast/encoding/char-decoding-expected.txt:160
> +PASS decodeSurrogatePair('UTF-16BE', '%D8%69%DE%D6') is 'U+D869/U+DED6'

I'm not sure why decodeSurrogatePair is needed - does decode() not work for those? If so, could it be better to fix decode()?
Comment 9 Yong Li 2010-12-14 14:11:05 PST
(In reply to comment #8)
> (From update of attachment 76567 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=76567&action=review
> 
> I think that key to landing this would be testing IE and Firefox with regular HTML files served over HTTP. Data URLs are a great trick for regression tests, but comparing browser behavior based on them isn't very reliable.
> 
> > LayoutTests/fast/encoding/char-decoding-expected.txt:160
> > +PASS decodeSurrogatePair('UTF-16BE', '%D8%69%DE%D6') is 'U+D869/U+DED6'
> 
> I'm not sure why decodeSurrogatePair is needed - does decode() not work for those? If so, could it be better to fix decode()?

decode() only returns one UTF-16 character. Do you think it is good to make decode() dump all characters in responseText?
Comment 10 Alexey Proskuryakov 2010-12-14 14:40:03 PST
> Do you think it is good to make decode() dump all characters in responseText?

I think so, but I haven't been involved in the original design of this test. Perhaps Darin will have some comments.
Comment 11 Darin Adler 2010-12-14 17:10:18 PST
(In reply to comment #10)
> > Do you think it is good to make decode() dump all characters in responseText?
> 
> I think so, but I haven't been involved in the original design of this test. Perhaps Darin will have some comments.

Yes, decode should dump all the characters. There’s no advantage to have it returning only one. In fact, that behavior could hide failures.
Comment 12 Yong Li 2010-12-15 07:30:22 PST
(In reply to comment #11)
> (In reply to comment #10)
> > > Do you think it is good to make decode() dump all characters in responseText?
> > 
> > I think so, but I haven't been involved in the original design of this test. Perhaps Darin will have some comments.
> 
> Yes, decode should dump all the characters. There’s no advantage to have it returning only one. In fact, that behavior could hide failures.

will do it in that way
Comment 13 Yong Li 2010-12-15 08:26:02 PST
Created attachment 76649 [details]
updated
Comment 14 Yong Li 2010-12-15 08:29:52 PST
Created attachment 76650 [details]
test case updated
Comment 15 WebKit Commit Bot 2010-12-15 13:31:34 PST
The commit-queue encountered the following flaky tests while processing attachment 76650 [details]:

http/tests/navigation/document-location-click-timeout.html bug 51130 (author: mpcomplete@chromium.org)
fast/loader/location-port.html bug 51131 (author: gns@gnome.org)
The commit-queue is continuing to process your patch.
Comment 16 WebKit Commit Bot 2010-12-15 15:31:58 PST
Comment on attachment 76650 [details]
test case updated

Rejecting attachment 76650 [details] from commit-queue.

Failed to run "['./WebKitTools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '--bot-id=cr-jail-3', 'build-and-test', '--no-clean', '--no-update', '--test', '--non-interactive']" exit_code: 2
Last 500 characters of output:
....................................................................................................................................................................................................
inspector .................................................................
inspector/styles-source-lines-inline.html -> crashed

Exiting early after 1 failures. 17876 tests run.
405.07s total testing time

17875 test cases (99%) succeeded
1 test case (<1%) crashed
8 test cases (<1%) had stderr output

Full output: http://queues.webkit.org/results/7140055
Comment 17 Eric Seidel (no email) 2010-12-15 16:07:11 PST
Comment on attachment 76650 [details]
test case updated

The failure seems unlikely to be related.
Comment 18 WebKit Commit Bot 2010-12-15 16:55:32 PST
Comment on attachment 76650 [details]
test case updated

Clearing flags on attachment: 76650

Committed r74162: <http://trac.webkit.org/changeset/74162>
Comment 19 WebKit Commit Bot 2010-12-15 16:55:39 PST
All reviewed patches have been landed.  Closing bug.
Comment 20 Xan Lopez 2010-12-16 07:17:17 PST
This broke the hell out of GTK+ (> 100 tests failing), so I reverted it in r74186. In the future please play attention to massive breakage in the bots when landing patches.
Comment 21 Xan Lopez 2010-12-16 07:17:58 PST
Reopening.
Comment 22 Adam Roben (:aroben) 2010-12-16 07:42:25 PST
(In reply to comment #20)
> This broke the hell out of GTK+ (> 100 tests failing), so I reverted it in r74186. In the future please play attention to massive breakage in the bots when landing patches.

It also broke Safari on Windows's location field, search field, and error pages. These features use the IWebFrame::load[Alternate]HTMLString APIs, which end up calling through WebFrame::loadHTMLString [1]. BSTRs appear to be encoded with UTF-16LE, but we pass just "utf-16" as the encoding name to the SubstituteData constructor.

It looks like WebKit2's WebPage has this same problem [2].

1. http://trac.webkit.org/browser/trunk/WebKit/win/WebFrame.cpp?rev=73430
2. http://trac.webkit.org/browser/trunk/WebKit2/WebProcess/WebPage/WebPage.cpp?rev=74164
Comment 24 Yong Li 2010-12-16 08:20:54 PST
Am I responsible to fix all these UTF-16 to UTF-16LE in all test cases/error pages? /me cries...
Comment 25 Darin Adler 2010-12-16 08:22:02 PST
(In reply to comment #24)
> Am I responsible to fix all these UTF-16 to UTF-16LE in all test cases/error pages? /me cries...

You don’t have to do it yourself, but you will have to talk with the people affected and wait until they’ve resolved these issues if you want to change this.
Comment 26 Yong Li 2010-12-16 08:44:00 PST
(In reply to comment #25)
> (In reply to comment #24)
> > Am I responsible to fix all these UTF-16 to UTF-16LE in all test cases/error pages? /me cries...
> You don’t have to do it yourself, but you will have to talk with the people affected and wait until they’ve resolved these issues if you want to change this.

Thanks a lot. I'll try to fix some first
Comment 27 Alexey Proskuryakov 2010-12-16 09:00:11 PST
Despite earlier Darin's r+, I would still appreciate more direct testing of IE behavior.
Comment 28 Xan Lopez 2010-12-16 10:04:57 PST
CCing a couple of GTK+ guys for the fallout.
Comment 29 Darin Adler 2010-12-16 10:06:44 PST
(In reply to comment #27)
> Despite earlier Darin's r+, I would still appreciate more direct testing of IE behavior.

I didn’t mean to override your concern, Alexey. This is enough of an edge case that I think we are unlikely to have significant problems no matter which way we go. But I agree that would be very good to know about IE behavior based on resources, not just data URLs.

Also, I think it would be good to land the new tests either with expected failures or with expectations that match our current behavior, even if we can't make the change in WebCore right now.
Comment 30 Alexey Proskuryakov 2010-12-16 10:16:36 PST
Created attachment 76780 [details]
utf-16 test case

In fact, I've just tested, and it appears that IE treats "utf-16" as little endian. See the attached test case.
Comment 31 Alexey Proskuryakov 2010-12-16 10:17:36 PST
And so does Firefox, so WebKit matches both.
Comment 32 Yong Li 2010-12-16 10:26:25 PST
(In reply to comment #31)
> And so does Firefox, so WebKit matches both.

I don't understand the test case. Can you try these 2 URL in FireFox?
 
data:text/plain;charset=UTF-16,%D8%69%DE%D6

and 

data:text/plain;charset=UTF-16,%69%D8%D6%DE

To me, it shows FireFox uses BigEndian
Comment 33 Yong Li 2010-12-16 10:30:32 PST
Created attachment 76782 [details]
test case with <p> removed
Comment 34 Yong Li 2010-12-16 10:33:50 PST
Created attachment 76783 [details]
test case - BE
Comment 35 Yong Li 2010-12-16 10:34:51 PST
Seems IE is using little endian and firefox uses big endian for "utf-16"
Comment 36 Alexey Proskuryakov 2010-12-16 10:36:03 PST
So it appears that Firefox performed content sniffing to change the encoding (I think it's on by default as of Firefox 3.6.16).
Comment 37 Alexey Proskuryakov 2010-12-16 10:36:36 PST
I meant 3.6.13, sorry.
Comment 38 Alexey Proskuryakov 2010-12-16 10:41:15 PST
The currently preferred way to resolve such differences is via standardization in HTML5. In fact, HTML5 already sides with IE/WebKit:

---------------------------
When a user agent is to use the UTF-16 encoding but no BOM has been found, user agents must default to UTF-16LE.

Note: The requirement to default UTF-16 to LE rather than BE is a willful violation of RFC 2781, motivated by a desire for compatibility with legacy content. [RFC2781]
---------------------------

The tests added here are very good, and I suggest landing those. But it doesn't seem like we should be changing behavior.
Comment 39 Yong Li 2010-12-16 10:43:04 PST
(In reply to comment #38)
> The currently preferred way to resolve such differences is via standardization in HTML5. In fact, HTML5 already sides with IE/WebKit:
> ---------------------------
> When a user agent is to use the UTF-16 encoding but no BOM has been found, user agents must default to UTF-16LE.
> Note: The requirement to default UTF-16 to LE rather than BE is a willful violation of RFC 2781, motivated by a desire for compatibility with legacy content. [RFC2781]
> ---------------------------
> The tests added here are very good, and I suggest landing those. But it doesn't seem like we should be changing behavior.

Tend to agree. But probably we still want to handle 'ISO-10646-UCS-2' and 'UCS-2' as BE?
Comment 40 Yong Li 2010-12-16 10:43:42 PST
(In reply to comment #39)
> (In reply to comment #38)
> > The currently preferred way to resolve such differences is via standardization in HTML5. In fact, HTML5 already sides with IE/WebKit:
> > ---------------------------
> > When a user agent is to use the UTF-16 encoding but no BOM has been found, user agents must default to UTF-16LE.
> > Note: The requirement to default UTF-16 to LE rather than BE is a willful violation of RFC 2781, motivated by a desire for compatibility with legacy content. [RFC2781]
> > ---------------------------
> > The tests added here are very good, and I suggest landing those. But it doesn't seem like we should be changing behavior.
> Tend to agree. But probably we still want to handle 'ISO-10646-UCS-2' and 'UCS-2' as BE?

As IE also uses BE for 'ISO-10646-UCS-2'
Comment 41 Yong Li 2010-12-16 10:44:49 PST
Created attachment 76784 [details]
Test case - BE - ISO-10646-UCS-2
Comment 42 Yong Li 2010-12-16 10:46:14 PST
(In reply to comment #41)
> Created an attachment (id=76784) [details]
> Test case - BE - ISO-10646-UCS-2

Oops. I'm wrong... werid.
Comment 43 Yong Li 2010-12-16 10:50:13 PST
Sorry. I think it is a false alarm...
Comment 44 Alexey Proskuryakov 2010-12-16 10:53:56 PST
It seems that IE doesn't recognize ISO-10646-UCS-2 at all - browser default encoding was used instead. It doesn't matter whether we interpret it as little or big endian, but I'm fine with matching Firefox and changing it to big endian. I'm also fine with treating it exactly like "utf-16" for internal consistency.

Would you be willing to file a bug and make a patch with tests?
Comment 45 Yong Li 2010-12-16 10:55:00 PST
(In reply to comment #44)
> It seems that IE doesn't recognize ISO-10646-UCS-2 at all - browser default encoding was used instead. It doesn't matter whether we interpret it as little or big endian, but I'm fine with matching Firefox and changing it to big endian. I'm also fine with treating it exactly like "utf-16" for internal consistency.
> Would you be willing to file a bug and make a patch with tests?

Sure. I'll update the patch to only include the tests
Comment 46 Alexey Proskuryakov 2010-12-16 11:55:47 PST
Bug 51199.