RESOLVED DUPLICATE of bug 179303 Bug 55441
EUC-JP implementation doesn't fully match CP51932
https://bugs.webkit.org/show_bug.cgi?id=55441
Summary EUC-JP implementation doesn't fully match CP51932
NARUSE, Yui
Reported 2011-02-28 19:26:52 PST
EUC-JP of HTML should be CP51932 = Abstract HTML5 says EUC-JP should be CP51932. So WebKit's mapping of EUC-JP should be changed. http://www.w3.org/TR/html5/parsing.html#character-encodings-0 = EUC-JP variants == CP51932 (Internet Explorer) CP51932 is Japanese EUC variant which is defined by Microsoft. It consists * US-ASCII * JIS X 0201 Katakana * JIS X 0208 * NEC special character * NEC-selected IBM extended character http://www.iana.org/assignments/charset-reg/CP51932 == EUC-JP by IANA This is different from "EUC-JP" defined by IANA * US-ASCII * JIS X 0208 * JIS X 0201 Katakana * JIS X 0212 http://www.iana.org/assignments/character-sets == Firefox Firefox uses yet another original encoding: CP51932+JIS X 0212 * US-ASCII * JIS X 0201 Katakana * JIS X 0208 * NEC special character * NEC-selected IBM extended character * JIS X 0212 https://bugzilla.mozilla.org/show_bug.cgi?id=600715 == WebKit Current Webkit seems to use ICU's ibm-33722_P12A_P12A-2004_U2. It consists * US-ASCII * JIS X 0201 Katakana * JIS X 0208 * IBM extended characters (IBM's mapping) http://demo.icu-project.org/icu-bin/convexp?conv=ibm-33722_P12A_P12A-2004_U2&s=ALL This mapping has some problems: * can't decode NEC special characters even if IE sends them * can't decode NEC selected IBM extended characters even if IE sends them * can encode/decode IBM's original mapping of IBM extended characters == Chrome Google Chrome extends this to compatible with IE/Firefox. It consists: * US-ASCII * JIS X 0201 Katakana * JIS X 0208 * NEC special character * NEC-selected IBM extended character * JIS X 0212 * IBM extended characters (IBM's mapping) = test page you can test a browser by http://nalsh.jp/euc.cgi = Ideal implementation == Plan A use CP51932 and compatible with IE. http://cpansearch.perl.org/src/NARUSE/Encode-EUCJPMS-0.07/ucm/cp51932.ucm == Plan B use Firefox's one. But current Firefox's one has a problem written in Bug 600715. https://bugzilla.mozilla.org/show_bug.cgi?id=600715 So the one JIS X 0212 encoder is removed seems suitable.
Attachments
Alexey Proskuryakov
Comment 1 2011-03-01 10:08:13 PST
Are any Web sites known to be affected by this? It would be good to have some URLs for real life testing. > Current Webkit seems to use ICU's ibm-33722_P12A_P12A-2004_U2. > It consists > * US-ASCII > * JIS X 0201 Katakana > * JIS X 0208 > * IBM extended characters (IBM's mapping) > http://demo.icu-project.org/icu-bin/convexp?conv=ibm-33722_P12A_P12A-2004_U2&s=ALL The same ICU converter explorer page says that windows-51932 is an alias name for this encoding. Is it a mistake in ICU that windows-51932 is different from what it should be? Has an ICU bug been filed about that?
NARUSE, Yui
Comment 2 2011-03-01 18:36:54 PST
(In reply to comment #1) > Are any Web sites known to be affected by this? It would be good to have some URLs for real life testing. For example, http://d.hatena.ne.jp/eggmoon/20061004/p1 http://blog.livedoor.jp/blog_ch/archives/50992738.html http://d.hatena.ne.jp/nsjisc/20100605/1275745170 People on business know NEC special characters and NEC selected IBM extended characters are Vender depended, and don't use. But casual users don't know it and post such characters to blog or other CGM applications. The content of this missing characters on WebKit are following. You can imagine casual users use circled characters and Roman numbers http://legacy-encoding.sourceforge.jp/wiki/index.php?NEC%C6%C3%BC%EC%CA%B8%BB%FA%28cp51932%29 http://legacy-encoding.sourceforge.jp/wiki/index.php?NEC%C1%AA%C4%EAIBM%B3%C8%C4%A5%CA%B8%BB%FA%28cp51932%29 > > Current Webkit seems to use ICU's ibm-33722_P12A_P12A-2004_U2. > > It consists > > * US-ASCII > > * JIS X 0201 Katakana > > * JIS X 0208 > > * IBM extended characters (IBM's mapping) > > http://demo.icu-project.org/icu-bin/convexp?conv=ibm-33722_P12A_P12A-2004_U2&s=ALL > > The same ICU converter explorer page says that windows-51932 is an alias name for this encoding. Is it a mistake in ICU that windows-51932 is different from what it should be? Encoding aliasing depends the converter's policy; especially ICU includes historical reasons from AIX or other IBM products. What I can say is the mapping is different from original Microsoft Codepage 51932, and is not suitable for Web. Because its decoder can't see some characters and its encoder sends strange characters which aren't available on other than WebKit. > Has an ICU bug been filed about that? I added http://bugs.icu-project.org/trac/ticket/8390
Alexey Proskuryakov
Comment 3 2011-03-01 21:39:51 PST
Jungshik Shin
Comment 5 2011-04-08 13:31:48 PDT
Chromium uses a custom EUC-JP encoding table (that is very similar to what Firefox used to have before removing JIS X 0212) which is different from the stock EUC-JP table. I planned to add it to the ICU, but haven't managed to. Anyway, I should have paid more attention to the HTML5 decision about EUC-JP => CP51932, which I don't like very much.
Masatoshi Kimura
Comment 6 2011-04-08 13:41:24 PDT
I'm surprised you dislike the decision about EUC-JP replacement encoding. We've removed the JIS X 0212 encoder from EUC-JP for a similar reason why you are planning to remove KS X 1001:1998 Annex 3 encoder from EUC-KR encoder in Mozilla bug 562091. https://bugzilla.mozilla.org/show_bug.cgi?id=562091
Masatoshi Kimura
Comment 7 2011-04-08 13:59:56 PDT
Furthermore, your current EUC-JP converter (IBM33722) is incompatible with any of IANA EUC-JP, eucJP-ms, and CP51932. While IBM33722 supports IBM extensions (as the name implies), the mapping is completely different from other variants. Your converter is not interoperable with any other browsers. We are suffering from this incompatibility. It's far better to use CP51932 mappings than the status quo.
Alexey Proskuryakov
Comment 8 2011-04-08 14:09:25 PDT
As far as mainline WebKit is concerned, we'll most likely just use whatever ICU provides, unless the impact is demonstrated to be so huge that a custom table becomes justified.
Masatoshi Kimura
Comment 9 2011-04-08 14:20:27 PDT
I'm fine waiting for the ICU change.
NARUSE, Yui
Comment 10 2011-04-12 01:45:30 PDT
Anne van Kesteren
Comment 11 2022-09-27 06:28:48 PDT
This got fixed as part of bug 179303 and related efforts so marking as a duplicate. *** This bug has been marked as a duplicate of bug 179303 ***
Note You need to log in before you can comment on or make changes to this bug.