Bug 115953

Summary: TextCodecICU complains about ambiguous codec names with current ICU release
Product: WebKit Reporter: Alexey Proskuryakov <ap>
Component: WebCore Misc.Assignee: Alexey Proskuryakov <ap>
Status: RESOLVED FIXED    
Severity: Normal CC: commit-queue, darin
Priority: P2 Keywords: InRadar
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
proposed fix darin: review+

Description Alexey Proskuryakov 2013-05-10 23:08:16 PDT
ICU has gotten a lot more encoding aliases, and as a result, some of them correspond to multiple internal codecs.

For example, whenever a Korean (windows-949) or a Russian (windows-1251) webpage is opened in a debug build, an ERROR is logged.

I am not aware of any specific examples of breakage, however we probably end up with unexpected versions of codecs as a result. Looking at ICU code, there doesn't appear to be any preference given to MIME or IANA names over non-standard aliases.

<rdar://problem/13823864>
Comment 1 Alexey Proskuryakov 2013-05-10 23:38:27 PDT
Created attachment 201450 [details]
proposed fix
Comment 2 WebKit Commit Bot 2013-05-10 23:40:46 PDT
Attachment 201450 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCore/platform/text/TextCodecICU.cpp', u'Source/WebCore/platform/text/TextCodecICU.h']" exit_code: 1
Source/WebCore/platform/text/TextCodecICU.cpp:96:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
Source/WebCore/platform/text/TextCodecICU.cpp:99:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
Source/WebCore/platform/text/TextCodecICU.cpp:102:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
Source/WebCore/platform/text/TextCodecICU.cpp:104:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
Source/WebCore/platform/text/TextCodecICU.cpp:190:  canonicalConverterNameForISO_8859_8_I is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
Source/WebCore/platform/text/TextCodecICU.cpp:207:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
Source/WebCore/platform/text/TextCodecICU.cpp:208:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
Source/WebCore/platform/text/TextCodecICU.cpp:211:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
Total errors found: 8 in 3 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 3 Darin Adler 2013-05-11 16:29:55 PDT
Comment on attachment 201450 [details]
proposed fix

View in context: https://bugs.webkit.org/attachment.cgi?id=201450&action=review

It would have been nice to do all the renaming in a separate patch first. The name changes combined with other style changes and behavior changes makes a much harder to review patch.

I’m OK with this, but I also think we could make some significant improvements to make the code more readable.

>> Source/WebCore/platform/text/TextCodecICU.cpp:190
>> +    const char* canonicalConverterNameForISO_8859_8_I = ucnv_getCanonicalName("ISO-8859-8-I", "IANA", &error);
> 
> canonicalConverterNameForISO_8859_8_I is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]

I think you could leave out the underlines in the name of this local variable. Note that it has a small scope so it doesn’t have to have a long name.

Separately, it’s not great that the string "ISO-8859-8-I" is repeated twice in these three lines of code. Would be nice to avoid that.

And there is no “why” comment explaining why this has to be handled differently. Presumably the "MIME" standard name for this is bad and needs to be ignored?

> Source/WebCore/platform/text/TextCodecICU.cpp:198
> +        const char* webStandardName = ucnv_getStandardName(canonicalConverterName, "MIME", &error);

Not good that this ucnv_getStandardName, first MIME, then IANA, technique is repeated twice in these two functions. Can’t we share the code?

> Source/WebCore/platform/text/TextCodecICU.cpp:206
> +        // Don't register codecs for overridden encodings.

It worries me that this repeats the list of encodings that are also listed above. Is there a way to guarantee they stay in sync? Maybe even share a table?

> Source/WebCore/platform/text/TextCodecICU.cpp:210
> +            || strcasecmp(webStandardName, "iso-8859-9") == 0

Why is this one a strcasecmp, but all the others strcmp? Also, historically we mostly steered clear of strcasecmp since it uses a locale-specific concept of case, for the same reason we use toASCIILower instead of tolower.

> Source/WebCore/platform/text/TextCodecICU.cpp:269
> -    m_converterICU = ucnv_open(m_encodingName, &err);
> -#if !LOG_DISABLED
> -    if (err == U_AMBIGUOUS_ALIAS_WARNING)
> -        LOG_ERROR("ICU ambiguous alias warning for encoding: %s", m_encodingName);
> -#endif
> +    m_converterICU = ucnv_open(m_canonicalConverterName, &err);
> +    ASSERT(U_SUCCESS(err));

If we do get an error, now we won’t see the error in the log, just that some error occurred, but not which one. Not sure this is an improvement.

> Source/WebCore/platform/text/TextCodecICU.h:46
> +        static PassOwnPtr<TextCodec> create(const TextEncoding&, const void* canonicalConverterName);

Why const void* for the canonical converter name instead of const char*?
Comment 4 Alexey Proskuryakov 2013-05-11 17:38:16 PDT
Comment on attachment 201450 [details]
proposed fix

View in context: https://bugs.webkit.org/attachment.cgi?id=201450&action=review

Thank you for review!

>>> Source/WebCore/platform/text/TextCodecICU.cpp:190
>>> +    const char* canonicalConverterNameForISO_8859_8_I = ucnv_getCanonicalName("ISO-8859-8-I", "IANA", &error);
>> 
>> canonicalConverterNameForISO_8859_8_I is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
> 
> I think you could leave out the underlines in the name of this local variable. Note that it has a small scope so it doesn’t have to have a long name.
> 
> Separately, it’s not great that the string "ISO-8859-8-I" is repeated twice in these three lines of code. Would be nice to avoid that.
> 
> And there is no “why” comment explaining why this has to be handled differently. Presumably the "MIME" standard name for this is bad and needs to be ignored?

Good point, I'll just use the same canonicalConverterName variable as in the loop below.

A comment above says "See comment above in registerEncodingNames", which explains that ISO-8859-8 and ISO-8859-8-I are exactly the same encoding, so ICU does not differentiate them at all. But the two encoding names differ in how the text should be displayed (logical vs. visual), so this trick helps us preserve this information for TextEncoding class to use. It's taken me a while to understand this at first, but I think that the comments explain it well now. It would certainly be nice to find a way to pass this information in a non-hacky way.

>> Source/WebCore/platform/text/TextCodecICU.cpp:198
>> +        const char* webStandardName = ucnv_getStandardName(canonicalConverterName, "MIME", &error);
> 
> Not good that this ucnv_getStandardName, first MIME, then IANA, technique is repeated twice in these two functions. Can’t we share the code?

I'm actually not sure why we have separate registerEncodingNames and registerCodecs functions.

However, I think that we should stop relying on ICU to provide us with encoding aliases at all, and just hardcode the aliases that we want to support on the Web. It just doesn't make sense to support all the random names ICU knows about.

Once we do that, the two functions will have much less in common.

>> Source/WebCore/platform/text/TextCodecICU.cpp:206
>> +        // Don't register codecs for overridden encodings.
> 
> It worries me that this repeats the list of encodings that are also listed above. Is there a way to guarantee they stay in sync? Maybe even share a table?

Yes, this is something I really want to resolve when moving to a hardcoded alias table. I tried adding an assertion in TextEncodingRegistry, but it got tripped when ICU tried to re-register one of our built-in codecs, because names didn't match (I've seen UTF-16 vs. UTF-16LE).

Once we use a hardcoded table, we'll avoid this class of issues.

>> Source/WebCore/platform/text/TextCodecICU.cpp:210
>> +            || strcasecmp(webStandardName, "iso-8859-9") == 0
> 
> Why is this one a strcasecmp, but all the others strcmp? Also, historically we mostly steered clear of strcasecmp since it uses a locale-specific concept of case, for the same reason we use toASCIILower instead of tolower.

Historically, we needed a case-insensitive comparison just here because of a change in ICU. I'll add a FIXME about how strcasecmp is not the right function for this.

>> Source/WebCore/platform/text/TextCodecICU.cpp:269
>> +    ASSERT(U_SUCCESS(err));
> 
> If we do get an error, now we won’t see the error in the log, just that some error occurred, but not which one. Not sure this is an improvement.

Both LOG_ERROR and ASSERT are debug only, so the error will be visible in debugger once one attaches to the process before crashing.

I didn't quite realize that we used to silently ignore all other errors. This aspect code was (and remains) somewhat broken - one caller of createICUConverter asserts that it succeeded, while another logs an error in debug builds, and returns an empty result.

>> Source/WebCore/platform/text/TextCodecICU.h:46
>> +        static PassOwnPtr<TextCodec> create(const TextEncoding&, const void* canonicalConverterName);
> 
> Why const void* for the canonical converter name instead of const char*?

This function's type has to be NewTextCodecFunction, as it's passed to registrar.

    typedef PassOwnPtr<TextCodec> (*NewTextCodecFunction)(const TextEncoding&, const void* additionalData);
    typedef void (*TextCodecRegistrar)(const char* name, NewTextCodecFunction, const void* additionalData);

The reason why we can't use const char* everywhere is that TextCodecMac uses a different type of additional data (TECTextEncodingID*)

Perhaps it would be less confusing if I called the argument additionalData, and only renamed to canonicalConverterName when casting. I'll make this change.
Comment 5 Alexey Proskuryakov 2013-05-11 17:52:58 PDT
Committed <http://trac.webkit.org/r149947>.