115848 – Make TextCodecICU not depend on TextEncoding

RESOLVED FIXED115848

Make TextCodecICU not depend on TextEncoding

https://bugs.webkit.org/show_bug.cgi?id=115848

Summary Make TextCodecICU not depend on TextEncoding

Alexey Proskuryakov

Reported 2013-05-08 22:11:05 PDT

TextEncoding is a WebCore concept that builds on TextCodec, which is a platform API wrapper, so using the former in TextCodecICU is confusing. It's used for two reasons: 1. Passing an encoding as TextEncoding guarantees that the encoding name has been already canonicalized. 2. We use TextEncoding::displayString() to tweak Asian encoders, so that they have desired behavior with backslash. I think that even though the string transformation is the same in both cases, using displayString for this is conceptually unsound.

Attachments
proposed patch (5.85 KB, patch) 2013-05-08 22:14 PDT, Alexey Proskuryakov	darin: review+	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.

Alexey Proskuryakov

Comment 1 2013-05-08 22:14:16 PDT

Created attachment 201135 [details] proposed patch I'm looking into this because I plan to add an ICU codec name pointer to TextCodecICU for <rdar://problem/13823864>.

WebKit Commit Bot

Comment 2 2013-05-08 22:15:36 PDT

Attachment 201135 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCore/platform/text/TextCodecICU.cpp', u'Source/WebCore/platform/text/TextCodecICU.h']" exit_code: 1 Source/WebCore/platform/text/TextCodecICU.cpp:339: Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons. [readability/comparison_to_zero] [5] Total errors found: 1 in 3 files If any of these errors are false positives, please file a bug against check-webkit-style.

Benjamin Poulain

Comment 3 2013-05-09 01:46:11 PDT

Comment on attachment 201135 [details] proposed patch View in context: https://bugs.webkit.org/attachment.cgi?id=201135&action=review > Source/WebCore/platform/text/TextCodecICU.cpp:62 > + // TextEncoding name is always atomic, so the string is never deleted. I am a little confused by "atomic" here. What do you mean?

Alexey Proskuryakov

Comment 4 2013-05-09 08:45:22 PDT

Comment on attachment 201135 [details] proposed patch View in context: https://bugs.webkit.org/attachment.cgi?id=201135&action=review >> Source/WebCore/platform/text/TextCodecICU.cpp:62 >> + // TextEncoding name is always atomic, so the string is never deleted. > > I am a little confused by "atomic" here. What do you mean? TextEncodingRegistry has a map for encodings, see atomicCanonicalTextEncodingName().

Darin Adler

Comment 5 2013-05-10 17:22:57 PDT

Comment on attachment 201135 [details] proposed patch View in context: https://bugs.webkit.org/attachment.cgi?id=201135&action=review > Source/WebCore/platform/text/TextCodecICU.cpp:32 > +#include "TextEncodingRegistry.h" Why is this include needed? >>> Source/WebCore/platform/text/TextCodecICU.cpp:62 >>> + // TextEncoding name is always atomic, so the string is never deleted. >> >> I am a little confused by "atomic" here. What do you mean? > > TextEncodingRegistry has a map for encodings, see atomicCanonicalTextEncodingName(). Maybe word more like this? // TextEncoding name is always one from atomicCanonicalTextEncodingName, guaranteed to never be deleted. > Source/WebCore/platform/text/TextCodecICU.cpp:230 > - const char* name = m_encoding.name(); > - m_needsGBKFallbacks = name[0] == 'G' && name[1] == 'B' && name[2] == 'K' && !name[3]; > + m_needsGBKFallbacks = m_encodingName[0] == 'G' && m_encodingName[1] == 'B' && m_encodingName[2] == 'K' && !m_encodingName[3]; Seems to be a slight decrease in readability. The original author (me) thought that using a local made it more readable. It wasn’t performance optimization, but expression readability optimization. > Source/WebCore/platform/text/TextCodecICU.cpp:437 > + if (shouldShowBackslashAsCurrencySymbolIn(m_encodingName)) { Why not optimize further by searching for '\\' and not making a copy if there is no '\\' character? > Source/WebCore/platform/text/TextCodecICU.cpp:439 > + copy.append(characters, length); > + copy.replace('\\', 0xA5); This copies the characters twice. Instead, we could just use Vector<UChar> and modify the copy in place. > Source/WebCore/platform/text/TextCodecICU.h:45 > + TextCodecICU(const char* encoding); Maybe mark this explicit? > Source/WebCore/platform/text/TextCodecICU.h:59 > + const char* m_encodingName; Maybe even const char* const?

Alexey Proskuryakov

Comment 6 2013-05-10 22:45:20 PDT

>> +#include "TextEncodingRegistry.h" > > Why is this include needed? We use shouldShowBackslashAsCurrencySymbolIn() that's declared in this header. > Seems to be a slight decrease in readability. The original author (me) thought that using a local made it > more readable. It wasn’t performance optimization, but expression readability optimization. For reference, this change was <http://trac.webkit.org/changeset/21227/trunk/WebCore/platform/TextCodecICU.cpp>. Before the change, the code would implicitly enumerate non-base codecs just to construct TextEncoding("GBK"). Interestingly, that's a historic example of why it's undesirable to use TextEncoding in TextCodecICU.cpp. I'm unsure why we can't just use strcmp, and in fact, I have a local patch that moves this code, and changes it to use strcmp. Made the other suggested changes, except for optimization ideas - this is just encode(), which is probably not worth any added risk. Committed <http://trac.webkit.org/changeset/149924>.

Darin Adler

Comment 7 2013-05-11 16:18:20 PDT

(In reply to comment #6) > > Seems to be a slight decrease in readability. The original author (me) thought that using a local made it > > more readable. It wasn’t performance optimization, but expression readability optimization. > > For reference, this change was <http://trac.webkit.org/changeset/21227/trunk/WebCore/platform/TextCodecICU.cpp>. Before the change, the code would implicitly enumerate non-base codecs just to construct TextEncoding("GBK"). Interestingly, that's a historic example of why it's undesirable to use TextEncoding in TextCodecICU.cpp. > > I'm unsure why we can't just use strcmp I think that when I wrote this I thought that the four explicit comparisons would be faster than a strcmp. Depends how aggressive compilers are at optimizing strcmp, particularly when a string constant is involved. And also whether code size or speed is more important. For readability, I think strcmp and memcmp are some of the worst functions. Remembering that "0 means equal" is confusing enough that we made an exception to our "== 0" rule for it, an exception that most people probably don’t realize it exists. But a comparison with "GBK" is obviously better, if only there was a clearer way than strcmp ;-)

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Unspecified

OS Unspecified

Product WebKit

Component WebCore Misc.

Assignee

Alexey Proskuryakov

Reported

2013-05-08 22:11 PDT

Modified

2013-05-11 16:18 PDT History

CC List

3 users Show

URL

Keywords

Depends on

Blocks