Bug 22116

Summary: Revise manual charset alias in TextCodecICU
Product: WebKit Reporter: Jungshik Shin <jshin>
Component: PlatformAssignee: Jungshik Shin <jshin>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, darin, ian
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Attachments:
Description Flags
patch
eric: review-
patch
ap: review-
updated patch (with more fleshed-out log entries) ap: review+

Description Jungshik Shin 2008-11-06 15:25:15 PST
Chromium tree has its own customized copy of ICU to take care of various issues with ICU 3.8.1 (it uses). Some of those changes are addressed in WebCore by adding work-arounds in the code, but Chrom does not need them. This bug is to add PLATFORM(CHROMIUM) #ifdefs for chrome to opt out of those work-arounds.
Comment 1 Jungshik Shin 2008-11-06 17:53:29 PST
Created attachment 24960 [details]
patch 

While adding PLATFORM(CHROMIUM) #ifdef's, I fixed what appears to be a typo (windows874-2000 => windows-874-2000).
Comment 2 Jungshik Shin 2008-11-06 18:12:38 PST
Some more info: 
TextCodecICU.cpp changes: most of aliases (except for ISO-8859-9 => Windows-1254) are included in our copy of ICU and we want to avoid duping them in the code. 

line-break change : As commented in the current code, it's fixed in ICU 4.0x (Chrome uses ICU 3.8.1, but this issue was fixed locally) and eventually, that code should go away regardless of PLATFORM(CHROMIUM). It's not that critical, but we want to avoid the unnecessary code if possible. 

Comment 3 Alexey Proskuryakov 2008-11-07 00:58:43 PST
I think that we should try to make this conditional based on ICU version, not the port (although I understand that this may be tricky, given that you have custom modifications).

I'm also not sure what's the benefit in not applying these workarounds - they give negligible if any runtime slowdown, but tracking ICU versions used by ports is additional work for developers.
Comment 4 Eric Seidel (no email) 2008-11-25 21:32:51 PST
Comment on attachment 24960 [details]
patch 

It seems like these changes should be broken into 2 types.  1. are modifications to ICU that chromium has made.  Those should be in PLATFORM(CHROMIUM) blocks.  The others are version differences between ICU versions.  We should use U_ICU_VERSION_MAJOR_NUM and U_ICU_VERSION_MAJOR_NUM and U_ICU_VERSION_MINOR_NUM comparisons for those, I would assume?
Comment 5 Eric Seidel (no email) 2008-11-25 21:39:27 PST
Comment on attachment 24960 [details]
patch 

I'm going to mark this r- (to remove it from the review queue), pending further comment from Jungshik.  Seems like 2 questions were raised above by ap and myself:

1.  What is the harm in leaving any of these workarounds in, even though chromium uses a fixed version of ICU?  Maybe this is very performance sensitive code? (probably is) and thus removing these workarounds is a speedup?

2.  If we're going to conditionals these workarounds, we should probably do so with some smarter #defines.  Here are two possible examples:


#define USE_ICU_3_X_LINE_BREAK_WORKAROUND (U_ICU_VERSION_MAJOR_NUM < 3 && !PLATFORM(CHROMIUM))

and the other one seems like it could just be an ICU major/minor version check for 3.3 or later, no?  Maybe it too needs a nice USE_ICU_3_X_EXTRA_ENCODING_ALIASES or something.
Comment 6 Jungshik Shin 2008-11-25 22:42:37 PST
Sorry I haven't gotten back here earlier. And, thanks for the suggestion. 

I have been tinkering what would be the best approach here, but haven't yet come up with one. 

For the line breaking issue, ICU version macros should work in principle, but Chrome still uses ICU 3.8.1 with a lot of changes while the line breaking issue was not resolved until ICU 4.x. (ICU 4.0 may not have a fix. I have to check again). So, using ICU version macros wouldn't work for Chrome. I don't have any measurement of how hot it's (I can measure it with a special set of pages - Chinese and Japanese pages). [0]  My gut feeling is that it could be pretty hot, but it's often wrong. 

TextCodecICU change is trickier. Though less "invasive" than the patch uploaded, PLATFORM(CHROMIUM) is still necessary. There are two issues:

1. Chrome doesn't want to include various legacy Mac encodings (e.g macgreek, macturkish) in our build of ICU. If we leave alone those names alone in the code without actual mapping tables in ICU, Chrome (well, Chrome's renderer process) would crash if somebody intentionally makes a web page with one of those encodings and points Chrome to it. Then, it occurred to me that some other ports (especially on ports for  mobile devices) may not want them, either, in their ICU build. So, instead of 'PLATFORM(CHROMIUM)', we may use something like 'DISABLE_LEGACY_MAC_ENCODINGS' (which means excluding all legacy mac encodings other than 'macroman' and possibly 'maccyrillic' [2]). 

What do you think of this macro? 

2. I'd rather not expose windows-949-2000. MS's request to register windows-{949,936,932,950} has been stuck in the IANA. windows-949 is most problematic because 936,932,950 have other standard names to use (GBK, Shift_JIS and Big5) while windows-949 does not. (EUC-KR is a subset of windows-949). Nonetheless, virtually everybody means windows-949 while declaring their pages to be EUC-KR. So, I guess we have to live with windows-949. I took a long look at ICU's converters.txt (unmodified version of ICU 3.x and 4.x) and played with ucnv_open to find that we don't have to pass 'windows-949-2000' to ucnv_open to use the table ('windows-949-2000.txt'). We can just use 'windows-949', which I can live with. The same is true of 'windows-874' vs 'windows-874-2000'. (there's no need to explicitly open ucnv_open with 'windows-874-2000'. 'windows-874' works well). 

 Another related concern was that ICU's encoding detector returns "EUC-KR" (bug 16482), but I figured out how to deal with that issue in my patch (to be updated) for bug 16482. So, it's not an issue any more.

Tomorrow, I'll make a new patch for you to take a look at. 

BTW, Chrome's ICU converter customization can be used by any port (including Safari on Windows) which has freedom to customize ICU. So, it's possible to come up with a more generic macro for TextCodecICU. Well,  I'd rather put off this for later.

[0] I realized that I can extract some stats from Chrome's pageload test #2 (by excluding numbers from Korean sites). 

[1] There's one more I'd rather block in Chrome (manual GBK override for a dozen? or so codepoints) that I forgot to include in the patch because again we can take care of them in ICU. 

[2] A couple of layout tests need to be split into two if we take this path. I already have them split in my local tree.
Comment 7 Eric Seidel (no email) 2008-11-25 22:50:44 PST
Clearly Jungshik has given this more thought than I. :)  I defer to those in the project who have actually worked with ICU (darin, ap, possibly mitz?).  Thank you for your additional comments Jungshik.
Comment 8 Alexey Proskuryakov 2008-11-26 00:51:41 PST
In cases where we must check ICU version, I'd prefer runtime version detection over version macros, as the version of ICU used to build WebKit doesn't necessarily match the one present on user systems.

I think that we really should try to avoid making different judgement calls here - e.g. is there any reason to support "windows-949-2000" in Safari, but not in Chrome? If it's harmful, let's drop it from all ports.

I can see a bit of a reason for having different legacy Mac encoding support (as local files on Macs are way more likely to use those than Web documents), but even that is not quite clear cut.
Comment 9 Jungshik Shin 2008-12-02 16:31:38 PST
Perhaps, I was not clear about windows-949. What I meant is not that 'euc-kr' should not be treated as windows-949. We should certainly do that because there are so many web pages labelled with euc-kr that are actually windows-949. 

In a sense, my issue with it is more 'cosmetic' than 'substantial'. I'd want to avoid exposing that name when we emit something to a user, web server, etc. Because windows-949 is not registered with IANA, there's very few, if any, web pages that uses that label and understands the label. 

So, my preference is to use the name 'EUC-KR' with the actual mapping being 'windows-949' (which is a 'lie' :-)).  Without changing the ICU alias table, it's still possible to do that by some tweaking in TextCodecICU, TextEncoding et, but it seems rather dirty.  Because we emit the charset canonical name only in a couple of places (Javascript document.encoding? property and post form submission. I haven't check whether we use 'charset' parameter. I f not, it's not a concern at all), I guess we can just go with 'windows-949'. 

'windows-949' and 'windows-874' as canonical names work as well as 'windows-949-2000' and 'windows-874-2000' (in ICU 3.2, 3.6, 3.8 and 4.0) and I think it's better to use more compact names.

I've revised TextCodecICU.cpp (with some more small changes in addition to the above two) and changed the corresponding tests in fast/encoding and am uploading a patch. 

I'll address the line breaking issue in another bug. 





Comment 10 Jungshik Shin 2008-12-02 16:33:14 PST
Created attachment 25689 [details]
patch 

Revises charset alias map for TextCodecICU.
        - Uses windows-949 and windows-874 instead of windows-949-2000 and windows-874
        - Replaces 'windows874' in a couple of place with 'windows-874' (for the canonical name)
        - Maps 'dos-874' to 'windows-874'
        - Maps 'EUC-CN' to 'GBK' and replaces it with 'GBK' when it's used as the canonical name
        - Puts Mac encodings in a separate test char-decoding-mac.html. Rename char-decoding-mac.html
          in platform/mac to xnextstep-decoding.html to avoid the name collision in the result files.
        - Adds a JS file for common functions used by char-decoding-*.html tests.
Comment 11 Alexey Proskuryakov 2008-12-03 01:22:50 PST
Comment on attachment 25689 [details]
patch 

Thank you, I like this patch a lot!

> +        - Uses windows-949 and windows-874 instead of windows-949-2000 and windows-874

A typo, should be windows-874-2000.

The reason why I used windows-949-2000 is that windows-949 is ambiguous in ICU, see <http://demo.icu-project.org/icu-bin/convexp?s=IANA&s=MIME&s=ALL>. Does anything guarantee that the correct codec will be picked for windows-949 family encodings with your change? I _think_ that windows-949-2000 is the correct one, as in matching WinIE better, but I haven't verified that assumption. This is not an issue for windows-874.

On a tangential note, HTML5 uses the windows-949 name: <http://www.whatwg.org/specs/web-apps/current-work/#character-encoding-requirements>.

> +        - Maps 'dos-874' to 'windows-874'

Could you please mention the rationale, in this bug and/or in ChangeLog? Is this done for compatibility with other browsers? Do both Firefox and IE support this name? Do we have a reason to think that actual pages ever use it? I don't have anything against this change, just want it fully documented.

> +        - Maps 'EUC-CN' to 'GBK' and replaces it with 'GBK' when it's used as the canonical name

Is this a change in behavior? According to ICU converter explorer, EUC-CN is already mapped to GB2312, which we upgrade to GBK.

> +        - Adds 'x-uhc' as an alias for 'windows-949'

Could you please mention the rationale?

r- because of the concern about ambiguity of windows-949, but please feel free to mark for review again if it's actually OK for some reason that I missed.
Comment 12 Ian 'Hixie' Hickson 2008-12-03 02:04:41 PST
I can change HTML5 if there's a better solution than what I have now. Just let me know.
Comment 13 Jungshik Shin 2008-12-03 14:52:40 PST
Thank you for going over it. I'll add a bit more details for my changes and upload a new patch. 

(In reply to comment #11)
> (From update of attachment 25689 [details] [review])
> Thank you, I like this patch a lot!
> 
> > +        - Uses windows-949 and windows-874 instead of windows-949-2000 and windows-874
> 
> A typo, should be windows-874-2000.
> 
> The reason why I used windows-949-2000 is that windows-949 is ambiguous in ICU,
> see <http://demo.icu-project.org/icu-bin/convexp?s=IANA&s=MIME&s=ALL>. Does
> anything guarantee that the correct codec will be picked for windows-949 family
> encodings with your change? I _think_ that windows-949-2000 is the correct one,
> as in matching WinIE better,but I haven't verified that assumption. This is
> not an issue for windows-874.

I was also concerned about it :-) That's why I made sure that windows-949 is not ambiguous in all ICU versions I checked (3.2.1, 3.6, 3.8.1. 4.0 and trunk [1]). At the beginning of convrtrs.txt file, the priority order of various standards is listed. (see also  See also http://source.icu-project.org/repos/icu/icu/trunk/source/common/ucnv_io.c ). 
windows-949 is given to two mapping tables in the file, but one of two ( ibm-1363_P11B-1998) is not tagged with any standard while the other (windows-949-2000) is annotated with 'Windows' and 'Java'. 'Windows' has the 2nd highest priority next to UTR22 so that 'windows-949-2000' will be always picked up when 'windows-949' is asked for in ucnv_open. 

Note that Shift_JIS has  a similar problem and there's no problem with it. Shift_JIS is given to  ibm-943_P15A-2003 and ibm-943_P130-1999, but ucnv_open("Shift_JIS"...) always picks up the former because the latter is untagged while the former is tagged.

One more argument in favor of 'windows-949' over 'windows-949-2000' is that for MIME charset, we'd better use a stable name like 'windows-949' rather than a specific name like 'windows-949-2000'. ISO-8859-X changed over time (by adding a few more characters), but specific names such as ISO-8859-7:200x or ISO-8859-5:198x are not generally used as MIME charset, but  ISO-8859-7 or ISO-8859-5 is used much more widely.
 
> On a tangential note, HTML5 uses the windows-949 name:
> <http://www.whatwg.org/specs/web-apps/current-work/#character-encoding-requirements>.

Sure, it does :-), but registering it with IANA got stuck for 2+ years. IANA registration aside, very few, if any, web pages use that name, which is NOT a concern. My minor concern mentioned earlier was that when we EMIT that name (form submission with POST if we add 'charset' parameter to C-T header) and server-side programs may not recognize it. If that's indeed a problem, we have to use 'EUC-KR' (name) while using the mapping for windows-949-2000. This is messy...

Do we add 'charset' parameter to C-T header in POST form submission? If not, I guess we don't have to worry about that (document.encoding(?) or similar properties is of less concern). Gecko added charset several years ago  very briefly, but was forced to remove it because at that time, most web server-side programs choked on it. 


> 
> > +        - Maps 'dos-874' to 'windows-874'
> 
> Could you please mention the rationale, in this bug and/or in ChangeLog? Is
> this done for compatibility with other browsers? Do both Firefox and IE support
> this name? Do we have a reason to think that actual pages ever use it? I don't
> have anything against this change, just want it fully documented.

Actually, I think we can just drop it, but I'd rather be on the safe-side by keeping it until I have some numbers to back it up.

Currently, 'dos-874' is mapped manually to 'cp874' with this line:

   registrar("dos874", "cp874")

Hmm.. I wonder why this does not trigger assertion in a debug build because cp874 is neither MIME nor IANA standard names.  Anyway, cp874 in ICU is ibm-874_P100-1995, whose IANA name is TIS-620, which is in turn mapped to windows-874(-2000). So, replacing the above line with
   registrar("dos874", "windows-874") 
is 'no-op'. 


> > +        - Maps 'EUC-CN' to 'GBK' and replaces it with 'GBK' when it's used as the canonical name
> 
> Is this a change in behavior? According to ICU converter explorer, EUC-CN is
> already mapped to GB2312, which we upgrade to GBK.
 
Again, we have lines like :

  registrar("cngb", "EUC-CN");

However, EUC-CN will not be picked up in the double-for-loop (going through IANA/MIME names and their aliases) because EUC-CN in convrtrs.txt is NOT tagged with any standard. And, I wonder why this works in a debug build.  

Anyway, directly mapping cngb, xgbk, xeuccn to 'GBK' is cleaner, isn't it? 


> > +        - Adds 'x-uhc' as an alias for 'windows-949'
> 
> Could you please mention the rationale?

UHC stands for Unified Hangul Code (an alternative name for windows-949). Firefox only recognizes it (x-uhc) and tries not to expose it (well, it's in its UI), but some web authors pick the name and I saw it being used in a few cases.
 


[1] In addition to inspecting convrtrs.txt files in various ICU versions, I wrote a simple test program and made sure, in all ICU versions mentioned,  that there's no warning (U_AMBIGUOUS...) and what's actually picked up is 'windows-949-2000' when 'windows-949' is asked for.

Comment 14 Jungshik Shin 2008-12-03 15:14:48 PST
Created attachment 25723 [details]
updated patch (with more fleshed-out log entries)

Alexey, can you take another look? Thanks.
Comment 15 Alexey Proskuryakov 2008-12-03 22:54:49 PST
Comment on attachment 25723 [details]
updated patch (with more fleshed-out log entries)

> However, EUC-CN will not be picked up in the double-for-loop (going through
> IANA/MIME names and their aliases) because EUC-CN in convrtrs.txt is NOT tagged
> with any standard.

According to online converter explorer, it has both MIME and IANA standard names (GB2312 and csGB2312), so it should be picked. But I'm probably a little confused: ChangeLog says "Maps 'EUC-CN' to 'GBK'", while I don't see any changes to EUC-CN behavior in code, only an added test for it, which passes in ToT.

+// FIX: Have to add tests for Euro and a few new characters added to ISO-8859-x 

We use FIXME for such comments.

All substantial changes look good, and this patch significantly cleans up code and tests, r=me!
Comment 16 Jungshik Shin 2008-12-04 13:18:46 PST
Thanks again for the review.

(In reply to comment #15)
> (From update of attachment 25723 [details] [review])
> > However, EUC-CN will not be picked up in the double-for-loop (going through
> > IANA/MIME names and their aliases) because EUC-CN in convrtrs.txt is NOT tagged
> > with any standard.
> 
> According to online converter explorer, it has both MIME and IANA standard
> names (GB2312 and csGB2312), so it should be picked. But I'm probably a little
> confused: ChangeLog says "Maps 'EUC-CN' to 'GBK'", while I don't see any
> changes to EUC-CN behavior in code, only an added test for it, which passes in
> ToT.

You're right. When writing the previous comment, I incorrectly remembered what addToTextEncodingNameMap does. EUC-CN is picked up in the loop as an alias for GBK so 'registra("xeuccn", "EUC-CN")' indirectly aliases 'xeuccn' to 'GBK' because 'EUC-CN' is aliased to GBK in the loop. My change removes that indirection and aliases 'xeuccn' (and others) directly to 'GBK'. Not a change of substance but a kinda code-cleanup.


> +// FIX: Have to add tests for Euro and a few new characters added to
> ISO-8859-x 
> 
> We use FIXME for such comments.
> 
> All substantial changes look good, and this patch significantly cleans up code
> and tests, r=me!
> 

Comment 17 Alexey Proskuryakov 2008-12-05 01:49:08 PST
Committed revision 39021.