RESOLVED FIXED 17537
Try MIME before trying IANA names when enumerating ICU converters
https://bugs.webkit.org/show_bug.cgi?id=17537
Summary Try MIME before trying IANA names when enumerating ICU converters
Jungshik Shin
Reported 2008-02-25 13:56:34 PST
In TextCodecICU, IANA names are tried, but MIME names are not tried. The result is that we get horrendously long names like 'Extended_UNIX_Code_Packed_Format_for_Japanese' for common encodings like 'EUC-JP'. These long names are also exposed via 'document.charset' (or a couple of other similar properties). See fast/encoding/hanarei-blog32-fc2-com.html What we can do is try MIME first and then try IANA. If we only try MIME, windows-125x won't be picked up. So, we have to try both, but MIME first to pick up more commonly recognized names like EUC-JP. I have a patch for this and I'll upload it soon.
Attachments
IANA name vs MIME name (2.81 KB, text/plain)
2008-08-29 11:40 PDT, Jungshik Shin
no flags
a patch with a modified layout test (2.92 KB, patch)
2008-08-29 14:32 PDT, Jungshik Shin
no flags
patch updated with changelog entry (4.66 KB, patch)
2008-09-05 15:25 PDT, Jungshik Shin
no flags
patch updated (with typo fixed) (4.47 KB, patch)
2008-09-06 10:13 PDT, Jungshik Shin
no flags
patch (that passes webkit test) (4.41 KB, patch)
2008-09-06 12:51 PDT, Jungshik Shin
ap: review-
patch updated with more detailed ChangeLog and strcasecmp added (6.47 KB, patch)
2008-09-08 12:40 PDT, Jungshik Shin
ap: review+
Alexey Proskuryakov
Comment 1 2008-02-26 09:22:50 PST
Is there a theoretical reason why preferring MIME names will be better? For example, RFC 2616 refers to IANA registry: "The complete set of tokens is defined by the IANA Character Set registry [19]," but also says that "HTTP uses the same definition of the term _character set_ as that described for MIME."
Jungshik Shin
Comment 2 2008-02-26 12:58:19 PST
IANA lists charsets (character encodings) with multiple aliases. In 'MIME context', using the preferred MIME name (alias) in the IANA charset registry makes sense, doesn't it? MIME names are widely used in email, html documents and http header fields. (as you quoted, http and html adopted some of MIME conventions). Have you ever seen a web server emitting 'Extended_UNIX_Code_Packed_Format_for_Japanes' or a web document with meta charset declaration (email ) with that long name? There may be few, but the vast majority use 'EUC-JP', don't they?
Alexey Proskuryakov
Comment 3 2008-02-26 13:04:06 PST
I'm just worrying whether this change will fix EUC-JP, but cause similar problems with other encodings. Why can we expect that this won't happen?
Jungshik Shin
Comment 4 2008-08-27 20:38:59 PDT
(In reply to comment #3) > I'm just worrying whether this change will fix EUC-JP, but cause similar > problems with other encodings. Why can we expect that this won't happen? Well, I'll attach a list of encoding names in ICU with MIME and IANA aliases and you'll see yourself. :-)
Jungshik Shin
Comment 5 2008-08-29 11:40:28 PDT
Created attachment 23072 [details] IANA name vs MIME name The file contains IANA and MIME names for ICU converters. For many converters, they're identical. But for some, MIME names are certainly better and I don't see any opposite case. In addition to Extended_...Unix_..Japanese vs EUC-JP, ISO-8859-X has 'year' appended in IANA names but MIME names do not have that. Obviously, we want just ISO-8859-X rather than ISO-8859-X:YYYY And, the IANA name for US-ASCII is ANSI_X3.4-1968. I don't think you like that better than US-ASCII. For some cases, MIME names are missing (windows-12xx) , which is why I proposed trying MIME name first and then falling back to IANA names. Also note that ICU encoding detector returns MIME names (i.e. EUC-JP rather than Extended_Unix...Japanese, ISO-8859-x rather than ISO-8859-x:yyyy). In case they're not available IANA names - for windows-12xx). My latest patch (uploaded today) for bug 16482 assumes that we do what I proposed here.
Jungshik Shin
Comment 6 2008-08-29 14:32:53 PDT
Created attachment 23076 [details] a patch with a modified layout test
Alexey Proskuryakov
Comment 7 2008-08-31 23:28:04 PDT
> ibm-860_P100-1995 IBM860::cp860 Are there any conflicts in "cpXXX" names between IBM and Windows nomenclatures? I vaguely remember that there were some near-conflicts, but I cannot see them right now, will need to look more. > a patch with a modified layout test Did you mean to submit this for review? A ChangeLog would be appreciated, as always: <http://webkit.org/coding/contributing.html>.
Jungshik Shin
Comment 8 2008-09-05 14:19:01 PDT
(In reply to comment #7) > > ibm-860_P100-1995 IBM860::cp860 > > Are there any conflicts in "cpXXX" names between IBM and Windows nomenclatures? > I vaguely remember that there were some near-conflicts, but I cannot see them > right now, will need to look more. Yeah, IBM codepage designations and MS-DOS codepage designations are not completely in sync. However, all those ibm 8xx/cp8xx are of very little importance in today's web. Their share is extremely small if used at all. > > a patch with a modified layout test > > Did you mean to submit this for review? A ChangeLog would be appreciated, as > always: <http://webkit.org/coding/contributing.html>. > Thanks. I'll add a changelog entry and upload a new patch for review.
Jungshik Shin
Comment 9 2008-09-05 15:25:32 PDT
Created attachment 23205 [details] patch updated with changelog entry ap, can you take a look? Thanks.
Jungshik Shin
Comment 10 2008-09-06 10:13:43 PDT
Created attachment 23216 [details] patch updated (with typo fixed)
Jungshik Shin
Comment 11 2008-09-06 12:51:08 PDT
Created attachment 23219 [details] patch (that passes webkit test) On my Mac OS 10.4, somehow, the MIME name for ISO-8859-9 came out in lowercase, which led to the failure of fast/text/encoding/char-decoding.html. I checked the revision history of ICU's convrtrs.txt and couldn't find any trace of that being in lowercase in the past. Anyway, to be safe, I changed the patch to check for both ISO-8859-9 and iso-8859-9 . (I didn't use strcasecmp or stricmp because neither is portable).
Alexey Proskuryakov
Comment 12 2008-09-07 05:02:54 PDT
Comment on attachment 23219 [details] patch (that passes webkit test) OK, let's do this. > Yeah, IBM codepage designations and MS-DOS codepage designations are not > completely in sync. Would it be possible to specify what exactly changes in this regard with this patch, if only for posterity? +2008-09-05 jungshik <set EMAIL_ADDRESS environment variable> Please fix this. > I didn't use strcasecmp or stricmp because neither is portable We can use strncasecmp from wtf/StringExtras.h. Also, I think that this line needs a comment - otherwise it's not clear from just looking at the code why we use case-insensitive compare in this single case only. I'm going to say r- for you to consider my comments, but really it's an r+, as the only thing that absolutely must be fixed is the ChangeLog.
Alexey Proskuryakov
Comment 13 2008-09-07 05:04:19 PDT
(In reply to comment #12) > We can use strncasecmp from wtf/StringExtras.h. And I think that strcasecmp would be good to add there.
Jungshik Shin
Comment 14 2008-09-08 12:40:19 PDT
Created attachment 23270 [details] patch updated with more detailed ChangeLog and strcasecmp added Revised ChangeLog to give more details on changes. Also added strcasecmp to wtf/StringExtras.h
Alexey Proskuryakov
Comment 15 2008-09-09 00:22:14 PDT
Comment on attachment 23270 [details] patch updated with more detailed ChangeLog and strcasecmp added r=me + // And so on. Some versions of ICU may have this in lowercase while others have + // in uppercase. Need to confirm. If it's one way or the other, we can just + // use strcmp. I've confirmed that this is different between Tiger and Leopard versions of ICU, and will re-word the comment when landing. I'll also tweak ChangeLog entries a little. Thanks for fixing this!
Alexey Proskuryakov
Comment 16 2008-09-09 01:12:08 PDT
Alexey Proskuryakov
Comment 17 2008-09-09 03:14:37 PDT
This has caused bug 20741.
Alexey Proskuryakov
Comment 18 2008-09-17 09:53:19 PDT
...and an issue with ICU 4, <http://trac.webkit.org/changeset/36542>.
Note You need to log in before you can comment on or make changes to this bug.