I think UCS2 encoding names like "ISO-10646-UCS-2" should be treated as Big Endian, because: 1) http://www.isi.edu/in-notes/iana/assignments/character-sets says about ISO-10646-UCS-2: "this needs to specify network byte order: the standard does not specify" 2) RFC 2152 says: "ISO/IEC 10646-1:1993(E) specifies that when characters the UCS-2 form are serialized as octets, that the most significant octet appear first." 3) ICU code page description (http://publib.boulder.ibm.com/infocenter/tivihelp/v24r1/index.jsp?topic=/com.ibm.itcama.doc_6.2.3/itcam_oraclerac63200.htm) says: UCS2 UCS-2 (Really UTF-16 BE) 4) IE and FireFox treat "ISO-10646-UCS-2" as Big Endian. However, it is hard to find a website in reality that uses UCS2, except test cases. So priority should be low.
Created attachment 76540 [details] the patch
Comment on attachment 76540 [details] the patch > 4) IE and FireFox treat "ISO-10646-UCS-2" as Big Endian. Did you test other encoding names modified here? Do all the added sub-tests pass in IE and Firefox? It's surprising that we fail on every page that's labeled as "UTF-16" and doesn't have a BOM. In fact, I recall that the current behavior is intentional, and was chosen for compatibility with other browsers - but I don't remember details of testing. +// UCS2 encoding aliases. +testDecode('ISO-10646-UCS-2', '%00a', 'U+0061'); +testDecode('UCS-2', '%00a', 'U+0061'); +testDecode('UTF-16', '%00a', 'U+0061'); +testDecode('Unicode', '%00a', 'U+0061'); +testDecode('csUnicode', '%00a', 'U+0061'); +testDecode('unicodeFFFE', '%00a', 'U+0061'); +testDecode('unicodeFEFF', 'a%00', 'U+0061'); This patch has confused terminology. Why did you label UTF-16 as a "UCS2 encoding alias"? It's clearly not. UCS-2 is a historic name for 16-bit encoding defined by Unicode 1.0, and obsoleted by Unicode 2.0. I suspect that most or all clients treat it as UTF-16 in practice. ICU certainly does. +testDecode('ISO-10646-UCS-2', '%00a', 'U+0061'); Please add tests for something besides zero bytes. While at it, please consider adding tests for whether surrogate pairs are supported for each of these encoding names. Extending test coverage would is very welcome.
(In reply to comment #2) > (From update of attachment 76540 [details]) > > 4) IE and FireFox treat "ISO-10646-UCS-2" as Big Endian. > > Did you test other encoding names modified here? Do all the added sub-tests pass in IE and Firefox? > > It's surprising that we fail on every page that's labeled as "UTF-16" and doesn't have a BOM. In fact, I recall that the current behavior is intentional, and was chosen for compatibility with other browsers - but I don't remember details of testing. > > +// UCS2 encoding aliases. > +testDecode('ISO-10646-UCS-2', '%00a', 'U+0061'); > +testDecode('UCS-2', '%00a', 'U+0061'); > +testDecode('UTF-16', '%00a', 'U+0061'); > +testDecode('Unicode', '%00a', 'U+0061'); > +testDecode('csUnicode', '%00a', 'U+0061'); > +testDecode('unicodeFFFE', '%00a', 'U+0061'); > +testDecode('unicodeFEFF', 'a%00', 'U+0061'); > > This patch has confused terminology. Why did you label UTF-16 as a "UCS2 encoding alias"? It's clearly not. > > UCS-2 is a historic name for 16-bit encoding defined by Unicode 1.0, and obsoleted by Unicode 2.0. I suspect that most or all clients treat it as UTF-16 in practice. ICU certainly does. > > +testDecode('ISO-10646-UCS-2', '%00a', 'U+0061'); > > Please add tests for something besides zero bytes. While at it, please consider adding tests for whether surrogate pairs are supported for each of these encoding names. > > Extending test coverage would is very welcome. (In reply to comment #2) > (From update of attachment 76540 [details]) > > 4) IE and FireFox treat "ISO-10646-UCS-2" as Big Endian. > > Did you test other encoding names modified here? Do all the added sub-tests pass in IE and Firefox? FireFox seems treat "Unicode" as UTF-8. But for other names, it uses Big Endian. Except, when using 0 bytes, it also works with little endian for "UTF-16". > > It's surprising that we fail on every page that's labeled as "UTF-16" and doesn't have a BOM. In fact, I recall that the current behavior is intentional, and was chosen for compatibility with other browsers - but I don't remember details of testing. > > +// UCS2 encoding aliases. > +testDecode('ISO-10646-UCS-2', '%00a', 'U+0061'); > +testDecode('UCS-2', '%00a', 'U+0061'); > +testDecode('UTF-16', '%00a', 'U+0061'); > +testDecode('Unicode', '%00a', 'U+0061'); > +testDecode('csUnicode', '%00a', 'U+0061'); > +testDecode('unicodeFFFE', '%00a', 'U+0061'); > +testDecode('unicodeFEFF', 'a%00', 'U+0061'); > > This patch has confused terminology. Why did you label UTF-16 as a "UCS2 encoding alias"? It's clearly not. > > UCS-2 is a historic name for 16-bit encoding defined by Unicode 1.0, and obsoleted by Unicode 2.0. I suspect that most or all clients treat it as UTF-16 in practice. ICU certainly does. > Can I say they are aliases of UTF-16 then? > +testDecode('ISO-10646-UCS-2', '%00a', 'U+0061'); > > Please add tests for something besides zero bytes. While at it, please consider adding tests for whether surrogate pairs are supported for each of these encoding names. > Will use surrogate pairs.
Please be sure to test Internet Explorer behavior in detail. > Can I say they are aliases of UTF-16 then? I'd just say "UTF-16 variants" - unicodeFFFE and unicodeFEFF clearly aren't aliases of the same encoding. > Will use surrogate pairs. Thanks!
Have we found any existing test cases on websites? It seems that 16-bit encodings without BOMs are extremely rare. Is there some context other than viewing a webpage where this comes up? Form submission or XMLHttpRequest?
(In reply to comment #5) > Have we found any existing test cases on websites? It seems that 16-bit encodings without BOMs are extremely rare. Yeah. I can hardly find a real website sending UTF-16 without BOM. But seems everyone is saying UTF-16 without BOM should be UTF-16BE. > > Is there some context other than viewing a webpage where this comes up? Form submission or XMLHttpRequest? I'm using data URL like "data:text/plain;charset=UTF-16,%D8%69%DE%D6" to be iframe source or just type it in address bar.
Created attachment 76567 [details] update test and expected result
Comment on attachment 76567 [details] update test and expected result View in context: https://bugs.webkit.org/attachment.cgi?id=76567&action=review I think that key to landing this would be testing IE and Firefox with regular HTML files served over HTTP. Data URLs are a great trick for regression tests, but comparing browser behavior based on them isn't very reliable. > LayoutTests/fast/encoding/char-decoding-expected.txt:160 > +PASS decodeSurrogatePair('UTF-16BE', '%D8%69%DE%D6') is 'U+D869/U+DED6' I'm not sure why decodeSurrogatePair is needed - does decode() not work for those? If so, could it be better to fix decode()?
(In reply to comment #8) > (From update of attachment 76567 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=76567&action=review > > I think that key to landing this would be testing IE and Firefox with regular HTML files served over HTTP. Data URLs are a great trick for regression tests, but comparing browser behavior based on them isn't very reliable. > > > LayoutTests/fast/encoding/char-decoding-expected.txt:160 > > +PASS decodeSurrogatePair('UTF-16BE', '%D8%69%DE%D6') is 'U+D869/U+DED6' > > I'm not sure why decodeSurrogatePair is needed - does decode() not work for those? If so, could it be better to fix decode()? decode() only returns one UTF-16 character. Do you think it is good to make decode() dump all characters in responseText?
> Do you think it is good to make decode() dump all characters in responseText? I think so, but I haven't been involved in the original design of this test. Perhaps Darin will have some comments.
(In reply to comment #10) > > Do you think it is good to make decode() dump all characters in responseText? > > I think so, but I haven't been involved in the original design of this test. Perhaps Darin will have some comments. Yes, decode should dump all the characters. There’s no advantage to have it returning only one. In fact, that behavior could hide failures.
(In reply to comment #11) > (In reply to comment #10) > > > Do you think it is good to make decode() dump all characters in responseText? > > > > I think so, but I haven't been involved in the original design of this test. Perhaps Darin will have some comments. > > Yes, decode should dump all the characters. There’s no advantage to have it returning only one. In fact, that behavior could hide failures. will do it in that way
Created attachment 76649 [details] updated
Created attachment 76650 [details] test case updated
The commit-queue encountered the following flaky tests while processing attachment 76650 [details]: http/tests/navigation/document-location-click-timeout.html bug 51130 (author: mpcomplete@chromium.org) fast/loader/location-port.html bug 51131 (author: gns@gnome.org) The commit-queue is continuing to process your patch.
Comment on attachment 76650 [details] test case updated Rejecting attachment 76650 [details] from commit-queue. Failed to run "['./WebKitTools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '--bot-id=cr-jail-3', 'build-and-test', '--no-clean', '--no-update', '--test', '--non-interactive']" exit_code: 2 Last 500 characters of output: .................................................................................................................................................................................................... inspector ................................................................. inspector/styles-source-lines-inline.html -> crashed Exiting early after 1 failures. 17876 tests run. 405.07s total testing time 17875 test cases (99%) succeeded 1 test case (<1%) crashed 8 test cases (<1%) had stderr output Full output: http://queues.webkit.org/results/7140055
Comment on attachment 76650 [details] test case updated The failure seems unlikely to be related.
Comment on attachment 76650 [details] test case updated Clearing flags on attachment: 76650 Committed r74162: <http://trac.webkit.org/changeset/74162>
All reviewed patches have been landed. Closing bug.
This broke the hell out of GTK+ (> 100 tests failing), so I reverted it in r74186. In the future please play attention to massive breakage in the bots when landing patches.
Reopening.
(In reply to comment #20) > This broke the hell out of GTK+ (> 100 tests failing), so I reverted it in r74186. In the future please play attention to massive breakage in the bots when landing patches. It also broke Safari on Windows's location field, search field, and error pages. These features use the IWebFrame::load[Alternate]HTMLString APIs, which end up calling through WebFrame::loadHTMLString [1]. BSTRs appear to be encoded with UTF-16LE, but we pass just "utf-16" as the encoding name to the SubstituteData constructor. It looks like WebKit2's WebPage has this same problem [2]. 1. http://trac.webkit.org/browser/trunk/WebKit/win/WebFrame.cpp?rev=73430 2. http://trac.webkit.org/browser/trunk/WebKit2/WebProcess/WebPage/WebPage.cpp?rev=74164
See also http://build.webkit.org/results/Windows%207%20Release%20(Tests)/r74170%20(7487)/fast/loader/non-deferred-substitute-load-pretty-diff.html
Am I responsible to fix all these UTF-16 to UTF-16LE in all test cases/error pages? /me cries...
(In reply to comment #24) > Am I responsible to fix all these UTF-16 to UTF-16LE in all test cases/error pages? /me cries... You don’t have to do it yourself, but you will have to talk with the people affected and wait until they’ve resolved these issues if you want to change this.
(In reply to comment #25) > (In reply to comment #24) > > Am I responsible to fix all these UTF-16 to UTF-16LE in all test cases/error pages? /me cries... > You don’t have to do it yourself, but you will have to talk with the people affected and wait until they’ve resolved these issues if you want to change this. Thanks a lot. I'll try to fix some first
Despite earlier Darin's r+, I would still appreciate more direct testing of IE behavior.
CCing a couple of GTK+ guys for the fallout.
(In reply to comment #27) > Despite earlier Darin's r+, I would still appreciate more direct testing of IE behavior. I didn’t mean to override your concern, Alexey. This is enough of an edge case that I think we are unlikely to have significant problems no matter which way we go. But I agree that would be very good to know about IE behavior based on resources, not just data URLs. Also, I think it would be good to land the new tests either with expected failures or with expectations that match our current behavior, even if we can't make the change in WebCore right now.
Created attachment 76780 [details] utf-16 test case In fact, I've just tested, and it appears that IE treats "utf-16" as little endian. See the attached test case.
And so does Firefox, so WebKit matches both.
(In reply to comment #31) > And so does Firefox, so WebKit matches both. I don't understand the test case. Can you try these 2 URL in FireFox? data:text/plain;charset=UTF-16,%D8%69%DE%D6 and data:text/plain;charset=UTF-16,%69%D8%D6%DE To me, it shows FireFox uses BigEndian
Created attachment 76782 [details] test case with <p> removed
Created attachment 76783 [details] test case - BE
Seems IE is using little endian and firefox uses big endian for "utf-16"
So it appears that Firefox performed content sniffing to change the encoding (I think it's on by default as of Firefox 3.6.16).
I meant 3.6.13, sorry.
The currently preferred way to resolve such differences is via standardization in HTML5. In fact, HTML5 already sides with IE/WebKit: --------------------------- When a user agent is to use the UTF-16 encoding but no BOM has been found, user agents must default to UTF-16LE. Note: The requirement to default UTF-16 to LE rather than BE is a willful violation of RFC 2781, motivated by a desire for compatibility with legacy content. [RFC2781] --------------------------- The tests added here are very good, and I suggest landing those. But it doesn't seem like we should be changing behavior.
(In reply to comment #38) > The currently preferred way to resolve such differences is via standardization in HTML5. In fact, HTML5 already sides with IE/WebKit: > --------------------------- > When a user agent is to use the UTF-16 encoding but no BOM has been found, user agents must default to UTF-16LE. > Note: The requirement to default UTF-16 to LE rather than BE is a willful violation of RFC 2781, motivated by a desire for compatibility with legacy content. [RFC2781] > --------------------------- > The tests added here are very good, and I suggest landing those. But it doesn't seem like we should be changing behavior. Tend to agree. But probably we still want to handle 'ISO-10646-UCS-2' and 'UCS-2' as BE?
(In reply to comment #39) > (In reply to comment #38) > > The currently preferred way to resolve such differences is via standardization in HTML5. In fact, HTML5 already sides with IE/WebKit: > > --------------------------- > > When a user agent is to use the UTF-16 encoding but no BOM has been found, user agents must default to UTF-16LE. > > Note: The requirement to default UTF-16 to LE rather than BE is a willful violation of RFC 2781, motivated by a desire for compatibility with legacy content. [RFC2781] > > --------------------------- > > The tests added here are very good, and I suggest landing those. But it doesn't seem like we should be changing behavior. > Tend to agree. But probably we still want to handle 'ISO-10646-UCS-2' and 'UCS-2' as BE? As IE also uses BE for 'ISO-10646-UCS-2'
Created attachment 76784 [details] Test case - BE - ISO-10646-UCS-2
(In reply to comment #41) > Created an attachment (id=76784) [details] > Test case - BE - ISO-10646-UCS-2 Oops. I'm wrong... werid.
Sorry. I think it is a false alarm...
It seems that IE doesn't recognize ISO-10646-UCS-2 at all - browser default encoding was used instead. It doesn't matter whether we interpret it as little or big endian, but I'm fine with matching Firefox and changing it to big endian. I'm also fine with treating it exactly like "utf-16" for internal consistency. Would you be willing to file a bug and make a patch with tests?
(In reply to comment #44) > It seems that IE doesn't recognize ISO-10646-UCS-2 at all - browser default encoding was used instead. It doesn't matter whether we interpret it as little or big endian, but I'm fine with matching Firefox and changing it to big endian. I'm also fine with treating it exactly like "utf-16" for internal consistency. > Would you be willing to file a bug and make a patch with tests? Sure. I'll update the patch to only include the tests
Bug 51199.