Bug 17537 - Try MIME before trying IANA names when enumerating ICU converters
Summary: Try MIME before trying IANA names when enumerating ICU converters
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Platform (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Jungshik Shin
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-02-25 13:56 PST by Jungshik Shin
Modified: 2008-09-17 09:53 PDT (History)
1 user (show)

See Also:


Attachments
IANA name vs MIME name (2.81 KB, text/plain)
2008-08-29 11:40 PDT, Jungshik Shin
no flags Details
a patch with a modified layout test (2.92 KB, patch)
2008-08-29 14:32 PDT, Jungshik Shin
no flags Details | Formatted Diff | Diff
patch updated with changelog entry (4.66 KB, patch)
2008-09-05 15:25 PDT, Jungshik Shin
no flags Details | Formatted Diff | Diff
patch updated (with typo fixed) (4.47 KB, patch)
2008-09-06 10:13 PDT, Jungshik Shin
no flags Details | Formatted Diff | Diff
patch (that passes webkit test) (4.41 KB, patch)
2008-09-06 12:51 PDT, Jungshik Shin
ap: review-
Details | Formatted Diff | Diff
patch updated with more detailed ChangeLog and strcasecmp added (6.47 KB, patch)
2008-09-08 12:40 PDT, Jungshik Shin
ap: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jungshik Shin 2008-02-25 13:56:34 PST
In TextCodecICU, IANA names are tried, but MIME names are not tried.

The result is that we get horrendously long names like 'Extended_UNIX_Code_Packed_Format_for_Japanese' for common encodings like 'EUC-JP'. 

These long names are also exposed via 'document.charset' (or a couple of other similar properties). See fast/encoding/hanarei-blog32-fc2-com.html

What we can do is try MIME first and then try IANA. If we only try MIME,  windows-125x won't be picked up. So, we have to try both, but MIME first to pick up more commonly recognized names like EUC-JP. 

I have a patch for this and I'll upload it soon.
Comment 1 Alexey Proskuryakov 2008-02-26 09:22:50 PST
Is there a theoretical reason why preferring MIME names will be better? For example, RFC 2616 refers to IANA registry: "The complete set of tokens is defined by the IANA Character Set registry [19]," but also says that "HTTP uses the same definition of the term _character set_ as that described for MIME."
Comment 2 Jungshik Shin 2008-02-26 12:58:19 PST
IANA lists charsets (character encodings) with multiple aliases. In 'MIME context', using the preferred MIME name (alias) in the IANA charset registry makes sense, doesn't it? MIME names are widely used in email, html documents and http header fields. (as you quoted, http and html adopted some of MIME conventions). 

Have you ever seen a web server emitting 'Extended_UNIX_Code_Packed_Format_for_Japanes'  or a web document with meta charset declaration (email )  with that long name?  There may be few, but the vast majority use 'EUC-JP', don't they? 

Comment 3 Alexey Proskuryakov 2008-02-26 13:04:06 PST
I'm just worrying whether this change will fix EUC-JP, but cause similar problems with other encodings. Why can we expect that this won't happen?
Comment 4 Jungshik Shin 2008-08-27 20:38:59 PDT
(In reply to comment #3)
> I'm just worrying whether this change will fix EUC-JP, but cause similar
> problems with other encodings. Why can we expect that this won't happen?

Well, I'll attach a list of encoding names in ICU with MIME and IANA aliases and you'll see yourself. :-)




Comment 5 Jungshik Shin 2008-08-29 11:40:28 PDT
Created attachment 23072 [details]
IANA name vs MIME name 

The file contains IANA and MIME names for ICU converters. For many converters, they're identical. But for some, MIME names are certainly better and I don't see any opposite case.

In addition to Extended_...Unix_..Japanese vs EUC-JP, 

ISO-8859-X has 'year' appended in IANA names but MIME names do not have that. Obviously, we want just ISO-8859-X rather than ISO-8859-X:YYYY

And, the IANA name for US-ASCII is ANSI_X3.4-1968. I don't think you like that better than US-ASCII. 

For some cases, MIME names are missing (windows-12xx) , which is why I proposed trying MIME name first and then falling back to IANA names.

Also note that ICU encoding detector returns MIME names (i.e. EUC-JP rather than Extended_Unix...Japanese, ISO-8859-x rather than ISO-8859-x:yyyy). In case they're not available IANA names - for windows-12xx).  My latest patch (uploaded today) for bug 16482  assumes that we do what I proposed here.
Comment 6 Jungshik Shin 2008-08-29 14:32:53 PDT
Created attachment 23076 [details]
a patch with a modified layout test
Comment 7 Alexey Proskuryakov 2008-08-31 23:28:04 PDT
> ibm-860_P100-1995	IBM860::cp860

Are there any conflicts in "cpXXX" names between IBM and Windows nomenclatures? I vaguely remember that there were some near-conflicts, but I cannot see them right now, will need to look more.

> a patch with a modified layout test

Did you mean to submit this for review? A ChangeLog would be appreciated, as always: <http://webkit.org/coding/contributing.html>.
Comment 8 Jungshik Shin 2008-09-05 14:19:01 PDT
(In reply to comment #7)
> > ibm-860_P100-1995	IBM860::cp860
> 
> Are there any conflicts in "cpXXX" names between IBM and Windows nomenclatures?
> I vaguely remember that there were some near-conflicts, but I cannot see them
> right now, will need to look more.

Yeah, IBM codepage designations and MS-DOS codepage designations are not completely in sync.  However, all those ibm 8xx/cp8xx are of very little importance in today's web. Their share is extremely small if used at all. 


> > a patch with a modified layout test
> 
> Did you mean to submit this for review? A ChangeLog would be appreciated, as
> always: <http://webkit.org/coding/contributing.html>.
> 

Thanks. I'll add a changelog entry and upload a new patch for review. 

Comment 9 Jungshik Shin 2008-09-05 15:25:32 PDT
Created attachment 23205 [details]
patch updated with changelog entry

ap, can you take a look? Thanks.
Comment 10 Jungshik Shin 2008-09-06 10:13:43 PDT
Created attachment 23216 [details]
patch updated (with typo fixed)
Comment 11 Jungshik Shin 2008-09-06 12:51:08 PDT
Created attachment 23219 [details]
patch (that passes webkit test)

On my Mac OS 10.4, somehow, the MIME name for ISO-8859-9 came out in lowercase, which led to the failure of fast/text/encoding/char-decoding.html.

 I checked the revision history of ICU's convrtrs.txt and couldn't find any trace of that being in lowercase in the past.  Anyway, to be safe, I changed the patch to check for both ISO-8859-9 and iso-8859-9 . (I didn't use strcasecmp or stricmp because neither is portable).
Comment 12 Alexey Proskuryakov 2008-09-07 05:02:54 PDT
Comment on attachment 23219 [details]
patch (that passes webkit test)

OK, let's do this.

> Yeah, IBM codepage designations and MS-DOS codepage designations are not
> completely in sync.

Would it be possible to specify what exactly changes in this regard with this patch, if only for posterity?

+2008-09-05  jungshik  <set EMAIL_ADDRESS environment variable>

Please fix this.

> I didn't use strcasecmp or stricmp because neither is portable

We can use strncasecmp from  wtf/StringExtras.h. Also, I think that this line needs a comment - otherwise it's not clear from just looking at the code why we use case-insensitive compare in this single case only.

I'm going to say r- for you to consider my comments, but really it's an r+, as the only thing that absolutely must be fixed is the ChangeLog.
Comment 13 Alexey Proskuryakov 2008-09-07 05:04:19 PDT
(In reply to comment #12)
> We can use strncasecmp from  wtf/StringExtras.h. 

And I think that strcasecmp would be good to add there.
Comment 14 Jungshik Shin 2008-09-08 12:40:19 PDT
Created attachment 23270 [details]
patch updated with more detailed ChangeLog and strcasecmp added

Revised ChangeLog to give more details on changes. 
Also added strcasecmp to wtf/StringExtras.h
Comment 15 Alexey Proskuryakov 2008-09-09 00:22:14 PDT
Comment on attachment 23270 [details]
patch updated with more detailed ChangeLog and strcasecmp added

r=me

+        // And so on. Some versions of ICU may have this in lowercase while others have
+        // in uppercase. Need to confirm. If it's one way or the other, we can just
+        // use strcmp.

I've confirmed that this is different between Tiger and Leopard versions of ICU, and will re-word the comment when landing. I'll also tweak ChangeLog entries a little.

Thanks for fixing this!
Comment 16 Alexey Proskuryakov 2008-09-09 01:12:08 PDT
Committed <http://trac.webkit.org/changeset/36289>.
Comment 17 Alexey Proskuryakov 2008-09-09 03:14:37 PDT
This has caused bug 20741.
Comment 18 Alexey Proskuryakov 2008-09-17 09:53:19 PDT
...and an issue with ICU 4, <http://trac.webkit.org/changeset/36542>.