Bug 49714 - Yensign hack should work with Shift_JIS encoding
Summary: Yensign hack should work with Shift_JIS encoding
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: HTML Editing (show other bugs)
Version: 528+ (Nightly build)
Hardware: PC All
: P2 Normal
Assignee: Nobody
URL: http://crbug.com/38653
Keywords:
Depends on:
Blocks:
 
Reported: 2010-11-17 22:47 PST by Kent Tamura
Modified: 2010-12-09 00:34 PST (History)
6 users (show)

See Also:


Attachments
Patch (5.67 KB, patch)
2010-11-17 22:52 PST, Kent Tamura
no flags Details | Formatted Diff | Diff
Patch 2 (6.95 KB, patch)
2010-11-18 23:15 PST, Kent Tamura
no flags Details | Formatted Diff | Diff
Patch 3 (6.81 KB, patch)
2010-11-18 23:36 PST, Kent Tamura
no flags Details | Formatted Diff | Diff
Patch 4 (12.68 KB, patch)
2010-11-22 01:37 PST, Kent Tamura
no flags Details | Formatted Diff | Diff
Patch 5 (12.71 KB, patch)
2010-11-22 02:55 PST, Kent Tamura
no flags Details | Formatted Diff | Diff
Patch 6 (12.85 KB, patch)
2010-11-22 02:59 PST, Kent Tamura
ap: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Kent Tamura 2010-11-17 22:47:35 PST
LayoutTests/editing/selection/find-yensign-backslash.html doesn't pass or does pass unexpectedly on non-Mac platforms because
 - Shift_JIS_X0213-2000 is available only with TextCodecMac, and TextCodecICU doesn't provide it.
 - TextEncoding::backslashAsCurrencySymbol() doesn't take care of a case that atomicCanonicalTextEncodingName() result is 0.
Comment 1 Kent Tamura 2010-11-17 22:52:33 PST
Created attachment 74205 [details]
Patch
Comment 2 Tony Chang 2010-11-18 09:37:59 PST
Comment on attachment 74205 [details]
Patch

Out of curiosity, does TextCodecMac have both Shift_JIS and Shift_JIS_X0213-2000?
Comment 3 Alexey Proskuryakov 2010-11-18 10:53:02 PST
Comment on attachment 74205 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=74205&action=review

Note that there are other places where we only check for the longer name - e.g. TextEncoding::backslashAsCurrencySymbol().

> Out of curiosity, does TextCodecMac have both Shift_JIS and Shift_JIS_X0213-2000?

Yes. The latter doesn't come from ICU, but from mac-encodings.txt.

I'd really expect a reviewer to know or at least research the answer to such a question before giving r+.

> WebCore/platform/text/TextEncoding.cpp:207
> +    // Shift_JIS_X0213-2000 is not the same encoding as Shift_JIS on Mac. We
> +    // need to check both of them.

This is a short line, it shouldn't be split into two.

> WebCore/platform/text/TextEncoding.cpp:210
> +    static const char* const shiftJis = atomicCanonicalTextEncodingName("Shift_JIS");
> +    static const char* const shiftJis2000 = atomicCanonicalTextEncodingName("Shift_JIS_X0213-2000");
> +    static const char* const eucJp = atomicCanonicalTextEncodingName("EUC-JP");

Coding style mistake - it's "Shift_JIS", not "Shift_Jis", so variables should be named accordingly. Ditto for EUC-JP.

> WebCore/platform/text/TextEncoding.cpp:211
> +    return ((shiftJis && m_name == shiftJis) || (shiftJis2000 && m_name == shiftJis2000) || (eucJp && m_name == eucJp)) ? 0x00A5 : '\\';

I'm not really sure how performance critical this function is, but it's certainly invoked a lot, so I wouldn't be adding unnecessary null checks.
Comment 4 Tony Chang 2010-11-18 11:13:05 PST
(In reply to comment #3)
> > WebCore/platform/text/TextEncoding.cpp:211
> > +    return ((shiftJis && m_name == shiftJis) || (shiftJis2000 && m_name == shiftJis2000) || (eucJp && m_name == eucJp)) ? 0x00A5 : '\\';
> 
> I'm not really sure how performance critical this function is, but it's certainly invoked a lot, so I wouldn't be adding unnecessary null checks.

I think this is a correctness check.  If the m_name is NULL and shiftJis2000 is NULL, we'll return the wrong value.  I suppose you could reduce the number of comparisons by just NULL checking m_name.
Comment 5 Alexey Proskuryakov 2010-11-18 11:19:45 PST
It were null checks for Shift_JIS and EUC-JP that caught my eye - every reasonable back-end should have those.

Perhaps Shift_JIS_X0213-2000 could just go inside #if PLATFORM(MAC). My recollection is that it's there simply because it was supported in Safari prior to switching to ICU, and no one had the gut to remove it.
Comment 6 Shinichiro Hamaji 2010-11-18 12:58:01 PST
As this hack is for compatibilities with IE, I think we also want other encodings with which Japanese fonts will be used. Specifically, I'd add ISO-2022-JP and ISO-2022-JP-2. Could you add them in this chance? I think we can use a HashSet like TextEncoding::isJapanese.
Comment 7 Kent Tamura 2010-11-18 22:41:16 PST
(In reply to comment #4)
> I suppose you could reduce the number of comparisons by just NULL checking m_name.

It's a good idea.
Comment 8 Kent Tamura 2010-11-18 22:49:34 PST
(In reply to comment #6)
> As this hack is for compatibilities with IE, I think we also want other encodings with which Japanese fonts will be used. Specifically, I'd add ISO-2022-JP and ISO-2022-JP-2. Could you add them in this chance? I think we can use a HashSet like TextEncoding::isJapanese.

I tested some encoding names with IE.
IE showed yensign for 0x5c in the following encoding names:
 EUC-JP
 Shift_JIS, Windows-31J, x-sjis
 Shift_JIS_X0213-2000
 x-mac-japanese
 ISO-2022-JP

IE showed a backslash:
 cp932
 JIS_X0201
 JIS_X0208-1983
 JIS_X0208-1990
 JIS_X0212-1990
 JIS_C6226-1978
 ISO-2022-JP-1
 ISO-2022-JP-2
 ISO-2022-JP-3
Comment 9 Kent Tamura 2010-11-18 23:15:19 PST
Created attachment 74356 [details]
Patch 2
Comment 10 Kent Tamura 2010-11-18 23:36:36 PST
Created attachment 74357 [details]
Patch 3
Comment 11 Alexey Proskuryakov 2010-11-18 23:42:17 PST
Comment on attachment 74357 [details]
Patch 3

View in context: https://bugs.webkit.org/attachment.cgi?id=74357&action=review

> WebCore/platform/text/TextEncoding.cpp:204
> +    DEFINE_STATIC_LOCAL(HashSet<const char*>, set, ());

Can this code be run from a secondary thread?
Comment 12 Kent Tamura 2010-11-18 23:47:19 PST
Comment on attachment 74357 [details]
Patch 3

View in context: https://bugs.webkit.org/attachment.cgi?id=74357&action=review

>> WebCore/platform/text/TextEncoding.cpp:204
>> +    DEFINE_STATIC_LOCAL(HashSet<const char*>, set, ());
> 
> Can this code be run from a secondary thread?

This code is not MT-safe.  Is TextEncoding class used by multiple threads?
TextEncoding::isJapanese() has the same issue.
Comment 13 Kent Tamura 2010-11-19 00:11:23 PST
(In reply to comment #12)
> >> WebCore/platform/text/TextEncoding.cpp:204
> >> +    DEFINE_STATIC_LOCAL(HashSet<const char*>, set, ());
> > 
> > Can this code be run from a secondary thread?
> 
> This code is not MT-safe.  Is TextEncoding class used by multiple threads?
> TextEncoding::isJapanese() has the same issue.

If the main thread always creates a TextEncoding instance and calls TextEncoding::isJapanese() before a secondary thread is created, we have no thread-safety issue.
Comment 14 Alexey Proskuryakov 2010-11-19 00:40:52 PST
I remember that TextEncodingRegistry had multi-threading issues, but don't remember which exactly. I'm not sure about these particular TextCodec functions. At the very least, they need isMainThread() assertions.

One direction to try would be checking svn blame for encodingRegistryMutex to see why it was added.
Comment 15 Kent Tamura 2010-11-22 01:36:03 PST
(In reply to comment #14)
> One direction to try would be checking svn blame for encodingRegistryMutex to see why it was added.

I checked Bug 22341, and found no detail of the reason :-)

Anyway, my "Patch 3" is wrong. If no extended encodings are used before starting workers and then extended encodings are used in the workers, multiple threads can enter the HashSet initialization path.

I tried AtomicallyInitializedStatic(), but it produced a build error about global destructor.  So I decided to move these HashSets to TextEncodingRegistry.
Comment 16 Kent Tamura 2010-11-22 01:37:01 PST
Created attachment 74532 [details]
Patch 4
Comment 17 Alexey Proskuryakov 2010-11-22 01:42:47 PST
> I checked Bug 22341, and found no detail of the reason :-)

Well, it explains that we were getting there through KURL in workers.
Comment 18 Alexey Proskuryakov 2010-11-22 01:49:23 PST
Comment on attachment 74532 [details]
Patch 4

+        IE shows a yensign for 0x5c code point encoded in x-mac-japanese,
+        ISO-2022-JP, EUC-JP, Shift_JIS, Shift_JIS_X0213-2000, x-sjis, and
+        Windows-31J. We follow it.

This explanation is misleading. IE chooses a default font based on encoding, and some Windows fonts have a yen sign in place of backslash. We need much more complicated logic because fonts on other platforms are different, and don't have this quirk.
Comment 19 Kent Tamura 2010-11-22 02:55:40 PST
Created attachment 74535 [details]
Patch 5
Comment 20 Kent Tamura 2010-11-22 02:59:08 PST
Created attachment 74536 [details]
Patch 6
Comment 21 Kent Tamura 2010-11-22 02:59:45 PST
(In reply to comment #18)
> (From update of attachment 74532 [details])
> +        IE shows a yensign for 0x5c code point encoded in x-mac-japanese,
> +        ISO-2022-JP, EUC-JP, Shift_JIS, Shift_JIS_X0213-2000, x-sjis, and
> +        Windows-31J. We follow it.
> 
> This explanation is misleading. IE chooses a default font based on encoding, and some Windows fonts have a yen sign in place of backslash. We need much more complicated logic because fonts on other platforms are different, and don't have this quirk.

Ok, I have updated ChangeLog.
Comment 22 Alexey Proskuryakov 2010-12-06 20:55:28 PST
Comment on attachment 74536 [details]
Patch 6

View in context: https://bugs.webkit.org/attachment.cgi?id=74536&action=review

> WebCore/ChangeLog:10
> +        IE chooses a font which shows a yensign for 0x5c code point for a page
> +        encoded in x-mac-japanese, ISO-2022-JP, EUC-JP, Shift_JIS, Shift_JIS_X0213-2000,
> +        x-sjis, and Windows-31J.

I thought IE didn't support Shift_JIS_X0213-2000.

> WebCore/ChangeLog:16
> +        Also, we move the HashSet initialization for isJapanese() and
> +        backslashAsCurrencySymbol() to TextEncodingRegistry.cpp because of
> +        ease of making them multi-thread safe.

This is not a great reason - TextEncodingRegistry does not and should not care about backslashes.

> WebCore/platform/text/TextEncoding.cpp:166
>  bool TextEncoding::isJapanese() const

This function is misplaced - all it is needed for is to make a decision on whether to invoke charset detection. So it should be in TextResourceDecoder (not something for this patch, of course).
Comment 23 Kent Tamura 2010-12-08 16:05:27 PST
(In reply to comment #22)
> (From update of attachment 74536 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=74536&action=review
> 
> > WebCore/ChangeLog:10
> > +        IE chooses a font which shows a yensign for 0x5c code point for a page
> > +        encoded in x-mac-japanese, ISO-2022-JP, EUC-JP, Shift_JIS, Shift_JIS_X0213-2000,
> > +        x-sjis, and Windows-31J.
> 
> I thought IE didn't support Shift_JIS_X0213-2000.

I was surprised when I saw IE switched the default font for it.
I guess IE uses prefix-matching for encoding names.

> > WebCore/ChangeLog:16
> > +        Also, we move the HashSet initialization for isJapanese() and
> > +        backslashAsCurrencySymbol() to TextEncodingRegistry.cpp because of
> > +        ease of making them multi-thread safe.
> 
> This is not a great reason - TextEncodingRegistry does not and should not care about backslashes.

However, the encoding names for backslashAsCurrencySymbol() depend on TextEncodingRegistry.  If one added new Japanese encoding to TextEncodingRegistry, he/she might need to add it to nonBackslashEncodings.

Anyway, I also feel the new code is not good.  I'll add a FIXME comment.
Comment 24 Kent Tamura 2010-12-08 16:55:18 PST
Landed: http://trac.webkit.org/changeset/73566
Comment 25 WebKit Review Bot 2010-12-09 00:34:46 PST
http://trac.webkit.org/changeset/73566 might have broken GTK Linux 32-bit Release
The following tests are not passing:
fast/css/input-search-padding.html
fast/css/text-input-with-webkit-border-radius.html
fast/forms/box-shadow-override.html
fast/forms/control-restrict-line-height.html
fast/forms/input-appearance-height.html
fast/forms/placeholder-pseudo-style.html
fast/forms/placeholder-set-value.html
fast/forms/search-cancel-button-style-sharing.html
fast/forms/search-placeholder-value-changed.html
fast/forms/search-rtl.html
fast/forms/search-styled.html
fast/forms/search-transformed.html
fast/forms/search-vertical-alignment.html
fast/forms/search-zoomed.html
fast/forms/searchfield-heights.html