17537 – Try MIME before trying IANA names when enumerating ICU converters

RESOLVED FIXED17537

Try MIME before trying IANA names when enumerating ICU converters

https://bugs.webkit.org/show_bug.cgi?id=17537

Summary Try MIME before trying IANA names when enumerating ICU converters

Jungshik Shin

Reported 2008-02-25 13:56:34 PST

In TextCodecICU, IANA names are tried, but MIME names are not tried. The result is that we get horrendously long names like 'Extended_UNIX_Code_Packed_Format_for_Japanese' for common encodings like 'EUC-JP'. These long names are also exposed via 'document.charset' (or a couple of other similar properties). See fast/encoding/hanarei-blog32-fc2-com.html What we can do is try MIME first and then try IANA. If we only try MIME, windows-125x won't be picked up. So, we have to try both, but MIME first to pick up more commonly recognized names like EUC-JP. I have a patch for this and I'll upload it soon.

Attachments
IANA name vs MIME name (2.81 KB, text/plain) 2008-08-29 11:40 PDT, Jungshik Shin	no flags	Details
a patch with a modified layout test (2.92 KB, patch) 2008-08-29 14:32 PDT, Jungshik Shin	no flags	Details Formatted Diff Diff
patch updated with changelog entry (4.66 KB, patch) 2008-09-05 15:25 PDT, Jungshik Shin	no flags	Details Formatted Diff Diff
patch updated (with typo fixed) (4.47 KB, patch) 2008-09-06 10:13 PDT, Jungshik Shin	no flags	Details Formatted Diff Diff
patch (that passes webkit test) (4.41 KB, patch) 2008-09-06 12:51 PDT, Jungshik Shin	ap: review-	Details Formatted Diff Diff
patch updated with more detailed ChangeLog and strcasecmp added (6.47 KB, patch) 2008-09-08 12:40 PDT, Jungshik Shin	ap: review+	Details Formatted Diff Diff
Show Obsolete (4) View All Add attachment proposed patch, testcase, etc.

Alexey Proskuryakov

Comment 1 2008-02-26 09:22:50 PST

Is there a theoretical reason why preferring MIME names will be better? For example, RFC 2616 refers to IANA registry: "The complete set of tokens is defined by the IANA Character Set registry [19]," but also says that "HTTP uses the same definition of the term _character set_ as that described for MIME."

Jungshik Shin

Comment 2 2008-02-26 12:58:19 PST

IANA lists charsets (character encodings) with multiple aliases. In 'MIME context', using the preferred MIME name (alias) in the IANA charset registry makes sense, doesn't it? MIME names are widely used in email, html documents and http header fields. (as you quoted, http and html adopted some of MIME conventions). Have you ever seen a web server emitting 'Extended_UNIX_Code_Packed_Format_for_Japanes' or a web document with meta charset declaration (email ) with that long name? There may be few, but the vast majority use 'EUC-JP', don't they?

Alexey Proskuryakov

Comment 3 2008-02-26 13:04:06 PST

I'm just worrying whether this change will fix EUC-JP, but cause similar problems with other encodings. Why can we expect that this won't happen?

Jungshik Shin

Comment 4 2008-08-27 20:38:59 PDT

(In reply to comment #3) > I'm just worrying whether this change will fix EUC-JP, but cause similar > problems with other encodings. Why can we expect that this won't happen? Well, I'll attach a list of encoding names in ICU with MIME and IANA aliases and you'll see yourself. :-)

Jungshik Shin

Comment 5 2008-08-29 11:40:28 PDT

Created attachment 23072 [details] IANA name vs MIME name The file contains IANA and MIME names for ICU converters. For many converters, they're identical. But for some, MIME names are certainly better and I don't see any opposite case. In addition to Extended_...Unix_..Japanese vs EUC-JP, ISO-8859-X has 'year' appended in IANA names but MIME names do not have that. Obviously, we want just ISO-8859-X rather than ISO-8859-X:YYYY And, the IANA name for US-ASCII is ANSI_X3.4-1968. I don't think you like that better than US-ASCII. For some cases, MIME names are missing (windows-12xx) , which is why I proposed trying MIME name first and then falling back to IANA names. Also note that ICU encoding detector returns MIME names (i.e. EUC-JP rather than Extended_Unix...Japanese, ISO-8859-x rather than ISO-8859-x:yyyy). In case they're not available IANA names - for windows-12xx). My latest patch (uploaded today) for bug 16482 assumes that we do what I proposed here.

Jungshik Shin

Comment 6 2008-08-29 14:32:53 PDT

Created attachment 23076 [details] a patch with a modified layout test

Alexey Proskuryakov

Comment 7 2008-08-31 23:28:04 PDT

> ibm-860_P100-1995 IBM860::cp860 Are there any conflicts in "cpXXX" names between IBM and Windows nomenclatures? I vaguely remember that there were some near-conflicts, but I cannot see them right now, will need to look more. > a patch with a modified layout test Did you mean to submit this for review? A ChangeLog would be appreciated, as always: <http://webkit.org/coding/contributing.html>.

Jungshik Shin

Comment 8 2008-09-05 14:19:01 PDT

(In reply to comment #7) > > ibm-860_P100-1995 IBM860::cp860 > > Are there any conflicts in "cpXXX" names between IBM and Windows nomenclatures? > I vaguely remember that there were some near-conflicts, but I cannot see them > right now, will need to look more. Yeah, IBM codepage designations and MS-DOS codepage designations are not completely in sync. However, all those ibm 8xx/cp8xx are of very little importance in today's web. Their share is extremely small if used at all. > > a patch with a modified layout test > > Did you mean to submit this for review? A ChangeLog would be appreciated, as > always: <http://webkit.org/coding/contributing.html>. > Thanks. I'll add a changelog entry and upload a new patch for review.

Jungshik Shin

Comment 9 2008-09-05 15:25:32 PDT

Created attachment 23205 [details] patch updated with changelog entry ap, can you take a look? Thanks.

Jungshik Shin

Comment 10 2008-09-06 10:13:43 PDT

Created attachment 23216 [details] patch updated (with typo fixed)

Jungshik Shin

Comment 11 2008-09-06 12:51:08 PDT

Created attachment 23219 [details] patch (that passes webkit test) On my Mac OS 10.4, somehow, the MIME name for ISO-8859-9 came out in lowercase, which led to the failure of fast/text/encoding/char-decoding.html. I checked the revision history of ICU's convrtrs.txt and couldn't find any trace of that being in lowercase in the past. Anyway, to be safe, I changed the patch to check for both ISO-8859-9 and iso-8859-9 . (I didn't use strcasecmp or stricmp because neither is portable).

Alexey Proskuryakov

Comment 12 2008-09-07 05:02:54 PDT

Comment on attachment 23219 [details] patch (that passes webkit test) OK, let's do this. > Yeah, IBM codepage designations and MS-DOS codepage designations are not > completely in sync. Would it be possible to specify what exactly changes in this regard with this patch, if only for posterity? +2008-09-05 jungshik <set EMAIL_ADDRESS environment variable> Please fix this. > I didn't use strcasecmp or stricmp because neither is portable We can use strncasecmp from wtf/StringExtras.h. Also, I think that this line needs a comment - otherwise it's not clear from just looking at the code why we use case-insensitive compare in this single case only. I'm going to say r- for you to consider my comments, but really it's an r+, as the only thing that absolutely must be fixed is the ChangeLog.

Alexey Proskuryakov

Comment 13 2008-09-07 05:04:19 PDT

(In reply to comment #12) > We can use strncasecmp from wtf/StringExtras.h. And I think that strcasecmp would be good to add there.

Jungshik Shin

Comment 14 2008-09-08 12:40:19 PDT

Created attachment 23270 [details] patch updated with more detailed ChangeLog and strcasecmp added Revised ChangeLog to give more details on changes. Also added strcasecmp to wtf/StringExtras.h

Alexey Proskuryakov

Comment 15 2008-09-09 00:22:14 PDT

Comment on attachment 23270 [details] patch updated with more detailed ChangeLog and strcasecmp added r=me + // And so on. Some versions of ICU may have this in lowercase while others have + // in uppercase. Need to confirm. If it's one way or the other, we can just + // use strcmp. I've confirmed that this is different between Tiger and Leopard versions of ICU, and will re-word the comment when landing. I'll also tweak ChangeLog entries a little. Thanks for fixing this!

Alexey Proskuryakov

Comment 16 2008-09-09 01:12:08 PDT

Committed <http://trac.webkit.org/changeset/36289>.

Alexey Proskuryakov

Comment 17 2008-09-09 03:14:37 PDT

This has caused bug 20741.

Alexey Proskuryakov

Comment 18 2008-09-17 09:53:19 PDT

...and an issue with ICU 4, <http://trac.webkit.org/changeset/36542>.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware All

OS All

Product WebKit

Component Platform

Assignee

Jungshik Shin

Reported

2008-02-25 13:56 PST

Modified

2008-09-17 09:53 PDT History

CC List

1 user Show

URL

Keywords

Depends on

Blocks