Bug 115848

Summary: Make TextCodecICU not depend on TextEncoding
Product: WebKit Reporter: Alexey Proskuryakov <ap>
Component: WebCore Misc.Assignee: Alexey Proskuryakov <ap>
Status: RESOLVED FIXED    
Severity: Normal CC: benjamin, commit-queue, darin
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
proposed patch darin: review+

Description Alexey Proskuryakov 2013-05-08 22:11:05 PDT
TextEncoding is a WebCore concept that builds on TextCodec, which is a platform API wrapper, so using the former in TextCodecICU is confusing. It's used for two reasons:

1. Passing an encoding as TextEncoding guarantees that the encoding name has been already canonicalized.

2. We use TextEncoding::displayString() to tweak Asian encoders, so that they have desired behavior with backslash. I think that even though the string transformation is the same in both cases, using displayString for this is conceptually unsound.
Comment 1 Alexey Proskuryakov 2013-05-08 22:14:16 PDT
Created attachment 201135 [details]
proposed patch

I'm looking into this because I plan to add an ICU codec name pointer to TextCodecICU for <rdar://problem/13823864>.
Comment 2 WebKit Commit Bot 2013-05-08 22:15:36 PDT
Attachment 201135 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCore/platform/text/TextCodecICU.cpp', u'Source/WebCore/platform/text/TextCodecICU.h']" exit_code: 1
Source/WebCore/platform/text/TextCodecICU.cpp:339:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
Total errors found: 1 in 3 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 3 Benjamin Poulain 2013-05-09 01:46:11 PDT
Comment on attachment 201135 [details]
proposed patch

View in context: https://bugs.webkit.org/attachment.cgi?id=201135&action=review

> Source/WebCore/platform/text/TextCodecICU.cpp:62
> +    // TextEncoding name is always atomic, so the string is never deleted.

I am a little confused by "atomic" here. What do you mean?
Comment 4 Alexey Proskuryakov 2013-05-09 08:45:22 PDT
Comment on attachment 201135 [details]
proposed patch

View in context: https://bugs.webkit.org/attachment.cgi?id=201135&action=review

>> Source/WebCore/platform/text/TextCodecICU.cpp:62
>> +    // TextEncoding name is always atomic, so the string is never deleted.
> 
> I am a little confused by "atomic" here. What do you mean?

TextEncodingRegistry has a map for encodings, see atomicCanonicalTextEncodingName().
Comment 5 Darin Adler 2013-05-10 17:22:57 PDT
Comment on attachment 201135 [details]
proposed patch

View in context: https://bugs.webkit.org/attachment.cgi?id=201135&action=review

> Source/WebCore/platform/text/TextCodecICU.cpp:32
> +#include "TextEncodingRegistry.h"

Why is this include needed?

>>> Source/WebCore/platform/text/TextCodecICU.cpp:62
>>> +    // TextEncoding name is always atomic, so the string is never deleted.
>> 
>> I am a little confused by "atomic" here. What do you mean?
> 
> TextEncodingRegistry has a map for encodings, see atomicCanonicalTextEncodingName().

Maybe word more like this?

    // TextEncoding name is always one from atomicCanonicalTextEncodingName, guaranteed to never be deleted.

> Source/WebCore/platform/text/TextCodecICU.cpp:230
> -    const char* name = m_encoding.name();
> -    m_needsGBKFallbacks = name[0] == 'G' && name[1] == 'B' && name[2] == 'K' && !name[3];
> +    m_needsGBKFallbacks = m_encodingName[0] == 'G' && m_encodingName[1] == 'B' && m_encodingName[2] == 'K' && !m_encodingName[3];

Seems to be a slight decrease in readability. The original author (me) thought that using a local made it more readable. It wasn’t performance optimization, but expression readability optimization.

> Source/WebCore/platform/text/TextCodecICU.cpp:437
> +    if (shouldShowBackslashAsCurrencySymbolIn(m_encodingName)) {

Why not optimize further by searching for '\\' and not making a copy if there is no '\\' character?

> Source/WebCore/platform/text/TextCodecICU.cpp:439
> +        copy.append(characters, length);
> +        copy.replace('\\', 0xA5);

This copies the characters twice. Instead, we could just use Vector<UChar> and modify the copy in place.

> Source/WebCore/platform/text/TextCodecICU.h:45
> +        TextCodecICU(const char* encoding);

Maybe mark this explicit?

> Source/WebCore/platform/text/TextCodecICU.h:59
> +        const char* m_encodingName;

Maybe even const char* const?
Comment 6 Alexey Proskuryakov 2013-05-10 22:45:20 PDT
>> +#include "TextEncodingRegistry.h"
>
> Why is this include needed?

We use shouldShowBackslashAsCurrencySymbolIn() that's declared in this header.

> Seems to be a slight decrease in readability. The original author (me) thought that using a local made it 
> more readable. It wasn’t performance optimization, but expression readability optimization.

For reference, this change was <http://trac.webkit.org/changeset/21227/trunk/WebCore/platform/TextCodecICU.cpp>. Before the change, the code would implicitly enumerate non-base codecs just to construct TextEncoding("GBK"). Interestingly, that's a historic example of why it's undesirable to use TextEncoding in TextCodecICU.cpp.

I'm unsure why we can't just use strcmp, and in fact, I have a local patch that moves this code, and changes it to use strcmp.

Made the other suggested changes, except for optimization ideas - this is just encode(), which is probably not worth any added risk.

Committed <http://trac.webkit.org/changeset/149924>.
Comment 7 Darin Adler 2013-05-11 16:18:20 PDT
(In reply to comment #6)
> > Seems to be a slight decrease in readability. The original author (me) thought that using a local made it 
> > more readable. It wasn’t performance optimization, but expression readability optimization.
> 
> For reference, this change was <http://trac.webkit.org/changeset/21227/trunk/WebCore/platform/TextCodecICU.cpp>. Before the change, the code would implicitly enumerate non-base codecs just to construct TextEncoding("GBK"). Interestingly, that's a historic example of why it's undesirable to use TextEncoding in TextCodecICU.cpp.
> 
> I'm unsure why we can't just use strcmp

I think that when I wrote this I thought that the four explicit comparisons would be faster than a strcmp. Depends how aggressive compilers are at optimizing strcmp, particularly when a string constant is involved. And also whether code size or speed is more important.

For readability, I think strcmp and memcmp are some of the worst functions. Remembering that "0 means equal" is confusing enough that we made an exception to our "== 0" rule for it, an exception that most people probably don’t realize it exists.

But a comparison with "GBK" is obviously better, if only there was a clearer way than strcmp ;-)