<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://bugs.webkit.org/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4.1"
          urlbase="https://bugs.webkit.org/"
          
          maintainer="admin@webkit.org"
>

    <bug>
          <bug_id>28760</bug_id>
          
          <creation_ts>2009-08-26 17:10:41 -0700</creation_ts>
          <short_desc>Make the canonical names (of TextEncoding) robust to changes in ICU&apos;s alias table</short_desc>
          <delta_ts>2014-01-01 23:20:50 -0800</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WebKit</product>
          <component>Platform</component>
          <version>528+ (Nightly build)</version>
          <rep_platform>All</rep_platform>
          <op_sys>All</op_sys>
          <bug_status>NEW</bug_status>
          <resolution></resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>Normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Jungshik Shin">jshin</reporter>
          <assigned_to name="Nobody">webkit-unassigned</assigned_to>
          <cc>ap</cc>
    
    <cc>darin</cc>
    
    <cc>eric</cc>
          

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>143007</commentid>
    <comment_count>0</comment_count>
    <who name="Jungshik Shin">jshin</who>
    <bug_when>2009-08-26 17:10:41 -0700</bug_when>
    <thetext>In ICU 4.2, the gb18030 entry in convertrs.txt changed to 

gb18030 { IANA* }       ibm-1392 { IBM* } windows-54936 { WINDOWS* } GB18030 { MIME* }


from

gb18030 { IANA* }       ibm-1392 { IBM* } windows-54936 { WINDOWS* }

Note that &apos;GB18030&apos; (uppercase) was added as the MIME name for gb18030. Because Webkit gives a higher precedence to MIME, it picks up GB18030 as the canonical name. 

Chromium has some tests that do the case-sensitive comparison of charset names (Webkit layout tests have some, too. e.g. &apos;EUC-JP&apos;). Chromium also has some unit tests  (dom serialization) and ui tests (encoding menu test) that compare the textual contents of two files which include &apos;meta charset&apos; label generated by dom serializer which uses the canonical name of TextEncoding for meta charset generation. 

It&apos;s possible to track down all the cases where TextEncoding::name() is used and lowercases the return value in Webkit &apos;clients&apos;, but it may be better to make the canonical name of TextEncoding be always lowercase. When we do, we have to change the expected results of a few layout tests.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>143025</commentid>
    <comment_count>1</comment_count>
    <who name="Darin Adler">darin</who>
    <bug_when>2009-08-26 18:30:30 -0700</bug_when>
    <thetext>in theory I like the idea of TextEncoding always using the &quot;canonical&quot; capitalization of charset names. If such a thing existed.

Lacking that, lowercasing all the names and changing the tests sounds OK to me. As long as it doesn&apos;t affect performance.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>143027</commentid>
    <comment_count>2</comment_count>
    <who name="Alexey Proskuryakov">ap</who>
    <bug_when>2009-08-26 18:42:18 -0700</bug_when>
    <thetext>Won&apos;t this change what JavaScript code sees as document.charset? If so, there&apos;s certain potential for negative web site compatibility effects - which is difficult to justify by ease of writing regression tests in my opinion.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>144267</commentid>
    <comment_count>3</comment_count>
    <who name="Jungshik Shin">jshin</who>
    <bug_when>2009-09-02 12:47:35 -0700</bug_when>
    <thetext>(In reply to comment #2)
&gt; Won&apos;t this change what JavaScript code sees as document.charset? If so, there&apos;s
&gt; certain potential for negative web site compatibility effects - which is
&gt; difficult to justify by ease of writing regression tests in my opinion.

In theory, all those JS codes and server-side codes behind them should do the case-insensitive matching for charset names, but in practice, you&apos;re right that there&apos;s a risk. I&apos;ll see what other browsers emit for document.charset (capital or lowercase).

It seems that it&apos;s only GB18030 vs gb18030 that has changed in ICU 4.2. An alternative to lowercasing all charset names is to special-case GB18030 vs gb18030 in TextCodecICU.cpp probably enclosed with #ifdef (to make it ICU version-specific).</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>963692</commentid>
    <comment_count>4</comment_count>
    <who name="Alexey Proskuryakov">ap</who>
    <bug_when>2014-01-01 23:20:50 -0800</bug_when>
    <thetext>See also: bug 125225.</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>