Bug 55441
Summary: | EUC-JP implementation doesn't fully match CP51932 | ||
---|---|---|---|
Product: | WebKit | Reporter: | NARUSE, Yui <naruse> |
Component: | Text | Assignee: | Nobody <webkit-unassigned> |
Status: | RESOLVED DUPLICATE | ||
Severity: | Normal | CC: | annevk, ap, darin, jshin, VYV03354 |
Priority: | P2 | Keywords: | InRadar |
Version: | 528+ (Nightly build) | ||
Hardware: | All | ||
OS: | All |
NARUSE, Yui
EUC-JP of HTML should be CP51932
= Abstract
HTML5 says EUC-JP should be CP51932.
So WebKit's mapping of EUC-JP should be changed.
http://www.w3.org/TR/html5/parsing.html#character-encodings-0
= EUC-JP variants
== CP51932 (Internet Explorer)
CP51932 is Japanese EUC variant which is defined by Microsoft.
It consists
* US-ASCII
* JIS X 0201 Katakana
* JIS X 0208
* NEC special character
* NEC-selected IBM extended character
http://www.iana.org/assignments/charset-reg/CP51932
== EUC-JP by IANA
This is different from "EUC-JP" defined by IANA
* US-ASCII
* JIS X 0208
* JIS X 0201 Katakana
* JIS X 0212
http://www.iana.org/assignments/character-sets
== Firefox
Firefox uses yet another original encoding: CP51932+JIS X 0212
* US-ASCII
* JIS X 0201 Katakana
* JIS X 0208
* NEC special character
* NEC-selected IBM extended character
* JIS X 0212
https://bugzilla.mozilla.org/show_bug.cgi?id=600715
== WebKit
Current Webkit seems to use ICU's ibm-33722_P12A_P12A-2004_U2.
It consists
* US-ASCII
* JIS X 0201 Katakana
* JIS X 0208
* IBM extended characters (IBM's mapping)
http://demo.icu-project.org/icu-bin/convexp?conv=ibm-33722_P12A_P12A-2004_U2&s=ALL
This mapping has some problems:
* can't decode NEC special characters even if IE sends them
* can't decode NEC selected IBM extended characters even if IE sends them
* can encode/decode IBM's original mapping of IBM extended characters
== Chrome
Google Chrome extends this to compatible with IE/Firefox.
It consists:
* US-ASCII
* JIS X 0201 Katakana
* JIS X 0208
* NEC special character
* NEC-selected IBM extended character
* JIS X 0212
* IBM extended characters (IBM's mapping)
= test page
you can test a browser by http://nalsh.jp/euc.cgi
= Ideal implementation
== Plan A
use CP51932 and compatible with IE.
http://cpansearch.perl.org/src/NARUSE/Encode-EUCJPMS-0.07/ucm/cp51932.ucm
== Plan B
use Firefox's one.
But current Firefox's one has a problem written in Bug 600715.
https://bugzilla.mozilla.org/show_bug.cgi?id=600715
So the one JIS X 0212 encoder is removed seems suitable.
Attachments | ||
---|---|---|
Add attachment proposed patch, testcase, etc. |
Alexey Proskuryakov
Are any Web sites known to be affected by this? It would be good to have some URLs for real life testing.
> Current Webkit seems to use ICU's ibm-33722_P12A_P12A-2004_U2.
> It consists
> * US-ASCII
> * JIS X 0201 Katakana
> * JIS X 0208
> * IBM extended characters (IBM's mapping)
> http://demo.icu-project.org/icu-bin/convexp?conv=ibm-33722_P12A_P12A-2004_U2&s=ALL
The same ICU converter explorer page says that windows-51932 is an alias name for this encoding. Is it a mistake in ICU that windows-51932 is different from what it should be? Has an ICU bug been filed about that?
NARUSE, Yui
(In reply to comment #1)
> Are any Web sites known to be affected by this? It would be good to have some URLs for real life testing.
For example,
http://d.hatena.ne.jp/eggmoon/20061004/p1
http://blog.livedoor.jp/blog_ch/archives/50992738.html
http://d.hatena.ne.jp/nsjisc/20100605/1275745170
People on business know NEC special characters and NEC selected IBM extended characters
are Vender depended, and don't use. But casual users don't know it and post such characters to blog
or other CGM applications.
The content of this missing characters on WebKit are following.
You can imagine casual users use circled characters and Roman numbers
http://legacy-encoding.sourceforge.jp/wiki/index.php?NEC%C6%C3%BC%EC%CA%B8%BB%FA%28cp51932%29
http://legacy-encoding.sourceforge.jp/wiki/index.php?NEC%C1%AA%C4%EAIBM%B3%C8%C4%A5%CA%B8%BB%FA%28cp51932%29
> > Current Webkit seems to use ICU's ibm-33722_P12A_P12A-2004_U2.
> > It consists
> > * US-ASCII
> > * JIS X 0201 Katakana
> > * JIS X 0208
> > * IBM extended characters (IBM's mapping)
> > http://demo.icu-project.org/icu-bin/convexp?conv=ibm-33722_P12A_P12A-2004_U2&s=ALL
>
> The same ICU converter explorer page says that windows-51932 is an alias name for this encoding. Is it a mistake in ICU that windows-51932 is different from what it should be?
Encoding aliasing depends the converter's policy; especially ICU includes historical reasons from AIX or other IBM products.
What I can say is the mapping is different from original Microsoft Codepage 51932, and is not suitable for Web.
Because its decoder can't see some characters and its encoder sends strange characters which aren't available on other than WebKit.
> Has an ICU bug been filed about that?
I added http://bugs.icu-project.org/trac/ticket/8390
Alexey Proskuryakov
<rdar://problem/9073710>
NARUSE, Yui
FYI, on searching those characters you can find thousands of examples.
http://search.hatena.ne.jp/search?word=%AD%A1&site=d.hatena.ne.jp
http://search.hatena.ne.jp/search?word=%AD%B5&site=d.hatena.ne.jp
http://search.hatena.ne.jp/search?word=%FC%E2&site=d.hatena.ne.jp
http://search.hatena.ne.jp/search?word=%F9%F5&site=d.hatena.ne.jp
Jungshik Shin
Chromium uses a custom EUC-JP encoding table (that is very similar to what Firefox used to have before removing JIS X 0212) which is different from the stock EUC-JP table. I planned to add it to the ICU, but haven't managed to.
Anyway, I should have paid more attention to the HTML5 decision about EUC-JP => CP51932, which I don't like very much.
Masatoshi Kimura
I'm surprised you dislike the decision about EUC-JP replacement encoding.
We've removed the JIS X 0212 encoder from EUC-JP for a similar reason why you are planning to remove KS X 1001:1998 Annex 3 encoder from EUC-KR encoder in Mozilla bug 562091.
https://bugzilla.mozilla.org/show_bug.cgi?id=562091
Masatoshi Kimura
Furthermore, your current EUC-JP converter (IBM33722) is incompatible with any of IANA EUC-JP, eucJP-ms, and CP51932. While IBM33722 supports IBM extensions (as the name implies), the mapping is completely different from other variants. Your converter is not interoperable with any other browsers. We are suffering from this incompatibility. It's far better to use CP51932 mappings than the status quo.
Alexey Proskuryakov
As far as mainline WebKit is concerned, we'll most likely just use whatever ICU provides, unless the impact is demonstrated to be so huge that a custom table becomes justified.
Masatoshi Kimura
I'm fine waiting for the ICU change.
NARUSE, Yui
Just FYI, ICU added CP51932.
http://bugs.icu-project.org/trac/changeset/29664
Chromium's issue is on http://code.google.com/p/chromium/issues/detail?id=78847
Anne van Kesteren
This got fixed as part of bug 179303 and related efforts so marking as a duplicate.
*** This bug has been marked as a duplicate of bug 179303 ***