<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://bugs.webkit.org/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4.1"
          urlbase="https://bugs.webkit.org/"
          
          maintainer="admin@webkit.org"
>

    <bug>
          <bug_id>55441</bug_id>
          
          <creation_ts>2011-02-28 19:26:52 -0800</creation_ts>
          <short_desc>EUC-JP implementation doesn&apos;t fully match CP51932</short_desc>
          <delta_ts>2022-09-27 06:28:48 -0700</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WebKit</product>
          <component>Text</component>
          <version>528+ (Nightly build)</version>
          <rep_platform>All</rep_platform>
          <op_sys>All</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>DUPLICATE</resolution>
          <dup_id>179303</dup_id>
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords>InRadar</keywords>
          <priority>P2</priority>
          <bug_severity>Normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>0</everconfirmed>
          <reporter name="NARUSE, Yui">naruse</reporter>
          <assigned_to name="Nobody">webkit-unassigned</assigned_to>
          <cc>annevk</cc>
    
    <cc>ap</cc>
    
    <cc>darin</cc>
    
    <cc>jshin</cc>
    
    <cc>VYV03354</cc>
          

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>359602</commentid>
    <comment_count>0</comment_count>
    <who name="NARUSE, Yui">naruse</who>
    <bug_when>2011-02-28 19:26:52 -0800</bug_when>
    <thetext>EUC-JP of HTML should be CP51932

= Abstract

HTML5 says EUC-JP should be CP51932.
So WebKit&apos;s mapping of EUC-JP should be changed.
http://www.w3.org/TR/html5/parsing.html#character-encodings-0

= EUC-JP variants

== CP51932 (Internet Explorer)

CP51932 is Japanese EUC variant which is defined by Microsoft.
It consists
* US-ASCII
* JIS X 0201 Katakana
* JIS X 0208
* NEC special character
* NEC-selected IBM extended character
http://www.iana.org/assignments/charset-reg/CP51932

== EUC-JP by IANA

This is different from &quot;EUC-JP&quot; defined by IANA
* US-ASCII
* JIS X 0208
* JIS X 0201 Katakana
* JIS X 0212
http://www.iana.org/assignments/character-sets

== Firefox

Firefox uses yet another original encoding: CP51932+JIS X 0212
* US-ASCII
* JIS X 0201 Katakana
* JIS X 0208
* NEC special character
* NEC-selected IBM extended character
* JIS X 0212
https://bugzilla.mozilla.org/show_bug.cgi?id=600715

== WebKit

Current Webkit seems to use ICU&apos;s ibm-33722_P12A_P12A-2004_U2.
It consists 
* US-ASCII
* JIS X 0201 Katakana
* JIS X 0208
* IBM extended characters (IBM&apos;s mapping)
http://demo.icu-project.org/icu-bin/convexp?conv=ibm-33722_P12A_P12A-2004_U2&amp;s=ALL

This mapping has some problems:
* can&apos;t decode NEC special characters even if IE sends them
* can&apos;t decode NEC selected IBM extended characters even if IE sends them
* can encode/decode IBM&apos;s original mapping of IBM extended characters

== Chrome

Google Chrome extends this to compatible with IE/Firefox.
It consists:
* US-ASCII
* JIS X 0201 Katakana
* JIS X 0208
* NEC special character
* NEC-selected IBM extended character
* JIS X 0212
* IBM extended characters (IBM&apos;s mapping)

= test page

you can test a browser by http://nalsh.jp/euc.cgi

= Ideal implementation

== Plan A

use CP51932 and compatible with IE.
http://cpansearch.perl.org/src/NARUSE/Encode-EUCJPMS-0.07/ucm/cp51932.ucm

== Plan B

use Firefox&apos;s one.
But current Firefox&apos;s one has a problem written in Bug 600715.
https://bugzilla.mozilla.org/show_bug.cgi?id=600715
So the one JIS X 0212 encoder is removed seems suitable.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>359978</commentid>
    <comment_count>1</comment_count>
    <who name="Alexey Proskuryakov">ap</who>
    <bug_when>2011-03-01 10:08:13 -0800</bug_when>
    <thetext>Are any Web sites known to be affected by this? It would be good to have some URLs for real life testing.

&gt; Current Webkit seems to use ICU&apos;s ibm-33722_P12A_P12A-2004_U2.
&gt; It consists 
&gt; * US-ASCII
&gt; * JIS X 0201 Katakana
&gt; * JIS X 0208
&gt; * IBM extended characters (IBM&apos;s mapping)
&gt; http://demo.icu-project.org/icu-bin/convexp?conv=ibm-33722_P12A_P12A-2004_U2&amp;s=ALL

The same ICU converter explorer page says that windows-51932 is an alias name for this encoding. Is it a mistake in ICU that windows-51932 is different from what it should be? Has an ICU bug been filed about that?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>360491</commentid>
    <comment_count>2</comment_count>
    <who name="NARUSE, Yui">naruse</who>
    <bug_when>2011-03-01 18:36:54 -0800</bug_when>
    <thetext>(In reply to comment #1)
&gt; Are any Web sites known to be affected by this? It would be good to have some URLs for real life testing.

For example,
http://d.hatena.ne.jp/eggmoon/20061004/p1
http://blog.livedoor.jp/blog_ch/archives/50992738.html
http://d.hatena.ne.jp/nsjisc/20100605/1275745170

People on business know NEC special characters and NEC selected IBM extended characters
are Vender depended, and don&apos;t use. But casual users don&apos;t know it and post such characters to blog
or other CGM applications.

The content of this missing characters on WebKit are following.
You can imagine casual users use circled characters and Roman numbers
http://legacy-encoding.sourceforge.jp/wiki/index.php?NEC%C6%C3%BC%EC%CA%B8%BB%FA%28cp51932%29
http://legacy-encoding.sourceforge.jp/wiki/index.php?NEC%C1%AA%C4%EAIBM%B3%C8%C4%A5%CA%B8%BB%FA%28cp51932%29

&gt; &gt; Current Webkit seems to use ICU&apos;s ibm-33722_P12A_P12A-2004_U2.
&gt; &gt; It consists 
&gt; &gt; * US-ASCII
&gt; &gt; * JIS X 0201 Katakana
&gt; &gt; * JIS X 0208
&gt; &gt; * IBM extended characters (IBM&apos;s mapping)
&gt; &gt; http://demo.icu-project.org/icu-bin/convexp?conv=ibm-33722_P12A_P12A-2004_U2&amp;s=ALL
&gt; 
&gt; The same ICU converter explorer page says that windows-51932 is an alias name for this encoding. Is it a mistake in ICU that windows-51932 is different from what it should be?

Encoding aliasing depends the converter&apos;s policy; especially ICU includes historical reasons from AIX or other IBM products.
What I can say is the mapping is different from original Microsoft Codepage 51932, and is not suitable for Web.
Because its decoder can&apos;t see some characters and its encoder sends strange characters which aren&apos;t available on other than WebKit.

&gt; Has an ICU bug been filed about that?

I added http://bugs.icu-project.org/trac/ticket/8390</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>360555</commentid>
    <comment_count>3</comment_count>
    <who name="Alexey Proskuryakov">ap</who>
    <bug_when>2011-03-01 21:39:51 -0800</bug_when>
    <thetext>&lt;rdar://problem/9073710&gt;</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>360651</commentid>
    <comment_count>4</comment_count>
    <who name="NARUSE, Yui">naruse</who>
    <bug_when>2011-03-02 00:57:08 -0800</bug_when>
    <thetext>FYI, on searching those characters you can find thousands of examples.
http://search.hatena.ne.jp/search?word=%AD%A1&amp;site=d.hatena.ne.jp
http://search.hatena.ne.jp/search?word=%AD%B5&amp;site=d.hatena.ne.jp
http://search.hatena.ne.jp/search?word=%FC%E2&amp;site=d.hatena.ne.jp
http://search.hatena.ne.jp/search?word=%F9%F5&amp;site=d.hatena.ne.jp</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>382511</commentid>
    <comment_count>5</comment_count>
    <who name="Jungshik Shin">jshin</who>
    <bug_when>2011-04-08 13:31:48 -0700</bug_when>
    <thetext>Chromium uses a custom EUC-JP encoding table (that is very similar to what Firefox used to have before removing JIS X 0212) which is different from the stock EUC-JP table.  I planned to add it to the ICU, but haven&apos;t managed to. 

Anyway, I should have paid more attention to the HTML5 decision about EUC-JP =&gt; CP51932, which I don&apos;t like very much.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>382521</commentid>
    <comment_count>6</comment_count>
    <who name="Masatoshi Kimura">VYV03354</who>
    <bug_when>2011-04-08 13:41:24 -0700</bug_when>
    <thetext>I&apos;m surprised you dislike the decision about EUC-JP replacement encoding.
We&apos;ve removed the JIS X 0212 encoder from EUC-JP for a similar reason why you are planning to remove KS X 1001:1998 Annex 3 encoder from EUC-KR encoder in Mozilla bug 562091.
https://bugzilla.mozilla.org/show_bug.cgi?id=562091</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>382545</commentid>
    <comment_count>7</comment_count>
    <who name="Masatoshi Kimura">VYV03354</who>
    <bug_when>2011-04-08 13:59:56 -0700</bug_when>
    <thetext>Furthermore, your current EUC-JP converter (IBM33722) is incompatible with any of IANA EUC-JP, eucJP-ms, and CP51932. While IBM33722 supports IBM extensions (as the name implies), the mapping is completely different from other variants. Your converter is not interoperable with any other browsers. We are suffering from this incompatibility. It&apos;s far better to use CP51932 mappings than the status quo.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>382555</commentid>
    <comment_count>8</comment_count>
    <who name="Alexey Proskuryakov">ap</who>
    <bug_when>2011-04-08 14:09:25 -0700</bug_when>
    <thetext>As far as mainline WebKit is concerned, we&apos;ll most likely just use whatever ICU provides, unless the impact is demonstrated to be so huge that a custom table becomes justified.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>382560</commentid>
    <comment_count>9</comment_count>
    <who name="Masatoshi Kimura">VYV03354</who>
    <bug_when>2011-04-08 14:20:27 -0700</bug_when>
    <thetext>I&apos;m fine waiting for the ICU change.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>383884</commentid>
    <comment_count>10</comment_count>
    <who name="NARUSE, Yui">naruse</who>
    <bug_when>2011-04-12 01:45:30 -0700</bug_when>
    <thetext>Just FYI, ICU added CP51932.
http://bugs.icu-project.org/trac/changeset/29664

Chromium&apos;s issue is on http://code.google.com/p/chromium/issues/detail?id=78847</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>1901470</commentid>
    <comment_count>11</comment_count>
    <who name="Anne van Kesteren">annevk</who>
    <bug_when>2022-09-27 06:28:48 -0700</bug_when>
    <thetext>This got fixed as part of bug 179303 and related efforts so marking as a duplicate.

*** This bug has been marked as a duplicate of bug 179303 ***</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>