WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED DUPLICATE of
bug 179303
Bug 55441
EUC-JP implementation doesn't fully match CP51932
https://bugs.webkit.org/show_bug.cgi?id=55441
Summary
EUC-JP implementation doesn't fully match CP51932
NARUSE, Yui
Reported
2011-02-28 19:26:52 PST
EUC-JP of HTML should be CP51932 = Abstract HTML5 says EUC-JP should be CP51932. So WebKit's mapping of EUC-JP should be changed.
http://www.w3.org/TR/html5/parsing.html#character-encodings-0
= EUC-JP variants == CP51932 (Internet Explorer) CP51932 is Japanese EUC variant which is defined by Microsoft. It consists * US-ASCII * JIS X 0201 Katakana * JIS X 0208 * NEC special character * NEC-selected IBM extended character
http://www.iana.org/assignments/charset-reg/CP51932
== EUC-JP by IANA This is different from "EUC-JP" defined by IANA * US-ASCII * JIS X 0208 * JIS X 0201 Katakana * JIS X 0212
http://www.iana.org/assignments/character-sets
== Firefox Firefox uses yet another original encoding: CP51932+JIS X 0212 * US-ASCII * JIS X 0201 Katakana * JIS X 0208 * NEC special character * NEC-selected IBM extended character * JIS X 0212
https://bugzilla.mozilla.org/show_bug.cgi?id=600715
== WebKit Current Webkit seems to use ICU's ibm-33722_P12A_P12A-2004_U2. It consists * US-ASCII * JIS X 0201 Katakana * JIS X 0208 * IBM extended characters (IBM's mapping)
http://demo.icu-project.org/icu-bin/convexp?conv=ibm-33722_P12A_P12A-2004_U2&s=ALL
This mapping has some problems: * can't decode NEC special characters even if IE sends them * can't decode NEC selected IBM extended characters even if IE sends them * can encode/decode IBM's original mapping of IBM extended characters == Chrome Google Chrome extends this to compatible with IE/Firefox. It consists: * US-ASCII * JIS X 0201 Katakana * JIS X 0208 * NEC special character * NEC-selected IBM extended character * JIS X 0212 * IBM extended characters (IBM's mapping) = test page you can test a browser by
http://nalsh.jp/euc.cgi
= Ideal implementation == Plan A use CP51932 and compatible with IE.
http://cpansearch.perl.org/src/NARUSE/Encode-EUCJPMS-0.07/ucm/cp51932.ucm
== Plan B use Firefox's one. But current Firefox's one has a problem written in Bug 600715.
https://bugzilla.mozilla.org/show_bug.cgi?id=600715
So the one JIS X 0212 encoder is removed seems suitable.
Attachments
Add attachment
proposed patch, testcase, etc.
Alexey Proskuryakov
Comment 1
2011-03-01 10:08:13 PST
Are any Web sites known to be affected by this? It would be good to have some URLs for real life testing.
> Current Webkit seems to use ICU's ibm-33722_P12A_P12A-2004_U2. > It consists > * US-ASCII > * JIS X 0201 Katakana > * JIS X 0208 > * IBM extended characters (IBM's mapping) >
http://demo.icu-project.org/icu-bin/convexp?conv=ibm-33722_P12A_P12A-2004_U2&s=ALL
The same ICU converter explorer page says that windows-51932 is an alias name for this encoding. Is it a mistake in ICU that windows-51932 is different from what it should be? Has an ICU bug been filed about that?
NARUSE, Yui
Comment 2
2011-03-01 18:36:54 PST
(In reply to
comment #1
)
> Are any Web sites known to be affected by this? It would be good to have some URLs for real life testing.
For example,
http://d.hatena.ne.jp/eggmoon/20061004/p1
http://blog.livedoor.jp/blog_ch/archives/50992738.html
http://d.hatena.ne.jp/nsjisc/20100605/1275745170
People on business know NEC special characters and NEC selected IBM extended characters are Vender depended, and don't use. But casual users don't know it and post such characters to blog or other CGM applications. The content of this missing characters on WebKit are following. You can imagine casual users use circled characters and Roman numbers
http://legacy-encoding.sourceforge.jp/wiki/index.php?NEC%C6%C3%BC%EC%CA%B8%BB%FA%28cp51932%29
http://legacy-encoding.sourceforge.jp/wiki/index.php?NEC%C1%AA%C4%EAIBM%B3%C8%C4%A5%CA%B8%BB%FA%28cp51932%29
> > Current Webkit seems to use ICU's ibm-33722_P12A_P12A-2004_U2. > > It consists > > * US-ASCII > > * JIS X 0201 Katakana > > * JIS X 0208 > > * IBM extended characters (IBM's mapping) > >
http://demo.icu-project.org/icu-bin/convexp?conv=ibm-33722_P12A_P12A-2004_U2&s=ALL
> > The same ICU converter explorer page says that windows-51932 is an alias name for this encoding. Is it a mistake in ICU that windows-51932 is different from what it should be?
Encoding aliasing depends the converter's policy; especially ICU includes historical reasons from AIX or other IBM products. What I can say is the mapping is different from original Microsoft Codepage 51932, and is not suitable for Web. Because its decoder can't see some characters and its encoder sends strange characters which aren't available on other than WebKit.
> Has an ICU bug been filed about that?
I added
http://bugs.icu-project.org/trac/ticket/8390
Alexey Proskuryakov
Comment 3
2011-03-01 21:39:51 PST
<
rdar://problem/9073710
>
NARUSE, Yui
Comment 4
2011-03-02 00:57:08 PST
FYI, on searching those characters you can find thousands of examples.
http://search.hatena.ne.jp/search?word=%AD%A1&site=d.hatena.ne.jp
http://search.hatena.ne.jp/search?word=%AD%B5&site=d.hatena.ne.jp
http://search.hatena.ne.jp/search?word=%FC%E2&site=d.hatena.ne.jp
http://search.hatena.ne.jp/search?word=%F9%F5&site=d.hatena.ne.jp
Jungshik Shin
Comment 5
2011-04-08 13:31:48 PDT
Chromium uses a custom EUC-JP encoding table (that is very similar to what Firefox used to have before removing JIS X 0212) which is different from the stock EUC-JP table. I planned to add it to the ICU, but haven't managed to. Anyway, I should have paid more attention to the HTML5 decision about EUC-JP => CP51932, which I don't like very much.
Masatoshi Kimura
Comment 6
2011-04-08 13:41:24 PDT
I'm surprised you dislike the decision about EUC-JP replacement encoding. We've removed the JIS X 0212 encoder from EUC-JP for a similar reason why you are planning to remove KS X 1001:1998 Annex 3 encoder from EUC-KR encoder in Mozilla bug 562091.
https://bugzilla.mozilla.org/show_bug.cgi?id=562091
Masatoshi Kimura
Comment 7
2011-04-08 13:59:56 PDT
Furthermore, your current EUC-JP converter (IBM33722) is incompatible with any of IANA EUC-JP, eucJP-ms, and CP51932. While IBM33722 supports IBM extensions (as the name implies), the mapping is completely different from other variants. Your converter is not interoperable with any other browsers. We are suffering from this incompatibility. It's far better to use CP51932 mappings than the status quo.
Alexey Proskuryakov
Comment 8
2011-04-08 14:09:25 PDT
As far as mainline WebKit is concerned, we'll most likely just use whatever ICU provides, unless the impact is demonstrated to be so huge that a custom table becomes justified.
Masatoshi Kimura
Comment 9
2011-04-08 14:20:27 PDT
I'm fine waiting for the ICU change.
NARUSE, Yui
Comment 10
2011-04-12 01:45:30 PDT
Just FYI, ICU added CP51932.
http://bugs.icu-project.org/trac/changeset/29664
Chromium's issue is on
http://code.google.com/p/chromium/issues/detail?id=78847
Anne van Kesteren
Comment 11
2022-09-27 06:28:48 PDT
This got fixed as part of
bug 179303
and related efforts so marking as a duplicate. *** This bug has been marked as a duplicate of
bug 179303
***
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug