Bug 179307 - WebKit treats Big5-HKSCS as a distinct encoding from Big5, Encoding standard says it's the same
Summary: WebKit treats Big5-HKSCS as a distinct encoding from Big5, Encoding standard ...
Status: RESOLVED DUPLICATE of bug 216016
Alias: None
Product: WebKit
Classification: Unclassified
Component: Text (show other bugs)
Version: Safari Technology Preview
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
URL:
Keywords:
: 159890 (view as bug list)
Depends on:
Blocks: 179303
  Show dependency treegraph
 
Reported: 2017-11-05 16:38 PST by Maciej Stachowiak
Modified: 2022-09-27 06:44 PDT (History)
6 users (show)

See Also:


Attachments
Test case for (lack of) WebKit's Big5 quirks, meant to go in LayoutTests/fast/encodings (853 bytes, text/html)
2017-11-05 20:41 PST, Maciej Stachowiak
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Maciej Stachowiak 2017-11-05 16:38:29 PST
WebKit treats Big5-HKSCS as a distinct encoding from Big5, but the Encoding standard says it's the same. Chrome and Firefox report Big5 as the canonical name when using the TextDecoder API. It's not clear to me if they actually decode it differently though, I am not sure how to make a test for that.
Comment 1 Maciej Stachowiak 2017-11-05 18:29:18 PST
Here's some past revisions that may explain why we have this behavior (pointed out by Darin):

https://trac.webkit.org/changeset/3611/webkit
    We changed to treat all Big5 as an alias for the Windows version (like the latest Encoding spec does)


https://trac.webkit.org/changeset/4054/webkit
    We changed to treat most Big5 character sets as Big5_HKSCS_1999, unless they were explicitly Microsoft-specific.

https://trac.webkit.org/changeset/4689/webkit
    We changed to treat most Big5 character sets as the DOS/Windows version, but left Big5-HKSCS alone.

It's not totally clear why Big5-HKSCS was left alone in that last change. I don't think this is compatible with other browsers do, so we should probably abandon this direction. But I need to make some tests.
Comment 2 Alexey Proskuryakov 2017-11-05 19:32:48 PST
Big5 is a large family of standards governed by various entities, and we basically never got to check if ICU supported the variant(s) that other browsers used. This is likely moot now, as Chrome also uses ICU.
Comment 3 Maciej Stachowiak 2017-11-05 20:36:22 PST
These are our differences from the standard on Big5-related encodings:

MISMATCH: encoding big5-hkscs is Big5 in the standard, but Big5-HKSCS in WebKit
EXTRA NAME: WebKit knows extra nonstandard name x-windows-950 for Big5
EXTRA NAME: WebKit knows extra nonstandard name windows-950 for Big5
EXTRA NAME: WebKit knows extra nonstandard name x-big5 for Big5
EXTRA NAME: WebKit knows extra nonstandard name ms950 for Big5
EXTRA NAME: WebKit knows extra nonstandard name windows-950-2000 for Big5
EXTRA ENCODING: WebKit knows nonstandard encoding Big5-HKSCS with names ['big5-hkscs', 'big5hk', 'hkscs-big5', 'ibm-1375', 'ibm-1375_p100-2008']
Comment 4 Maciej Stachowiak 2017-11-05 20:41:52 PST
Created attachment 326098 [details]
Test case for (lack of) WebKit's Big5 quirks, meant to go in LayoutTests/fast/encodings

This test case gives exactly the spec-mandated results for Firefox and Chrome. They both have the exact spec behavior. Safari has the differences described above.
Comment 5 Maciej Stachowiak 2017-11-05 20:58:37 PST
Here's the Gecko bug from when they did the merge: https://bugzilla.mozilla.org/show_bug.cgi?id=912470

It seems like their Big5 supports HKSCS character sequences. But I'm not sure if that's the same as our Big5-HKSCS or something that's a larger of that and Windows-flavord Big5.
Comment 6 Maciej Stachowiak 2017-11-05 21:56:40 PST
Based on http://w3c-test.org/encoding/big5-encoder.html , it doesn't look like either Big5 or Big5_HKSCS encodings from ICU quite match what the Encoding standard requires, and their failures are not the same either, so merging down to one of the two is bound to cause bugs. We might need a custom Big5 codec.

ICU seems to support several apparent Big5 variants:
ibm-1373_P100-2002
windows-950-2000
ibm-950_P110-1999
ibm-1375_P100-2008
ibm-5471_P100-2006

I'm not sure if any of these are the proper web variant.
Comment 7 Anne van Kesteren 2020-05-06 07:11:42 PDT
*** Bug 159890 has been marked as a duplicate of this bug. ***
Comment 8 Anne van Kesteren 2022-09-27 06:27:24 PDT
According to https://wpt.fyi/results/encoding?label=master&label=experimental&aligned&view=subtest&q=big5 we pass all the tests so this was fixed at some point.

Probably by Alex?
Comment 9 Anne van Kesteren 2022-09-27 06:44:37 PDT
Confirmed: https://github.com/WebKit/WebKit/commit/70a5c3285eca476faa66c6e6055d615c26c78fc4

*** This bug has been marked as a duplicate of bug 216016 ***