Bug 49714

Summary: Yensign hack should work with Shift_JIS encoding
Product: WebKit Reporter: Kent Tamura <tkent>
Component: HTML EditingAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, ap, eric, hamaji, tony, webkit.review.bot
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: PC   
OS: All   
URL: http://crbug.com/38653
Attachments:
Description Flags
Patch
none
Patch 2
none
Patch 3
none
Patch 4
none
Patch 5
none
Patch 6 ap: review+

Kent Tamura
Reported 2010-11-17 22:47:35 PST
LayoutTests/editing/selection/find-yensign-backslash.html doesn't pass or does pass unexpectedly on non-Mac platforms because - Shift_JIS_X0213-2000 is available only with TextCodecMac, and TextCodecICU doesn't provide it. - TextEncoding::backslashAsCurrencySymbol() doesn't take care of a case that atomicCanonicalTextEncodingName() result is 0.
Attachments
Patch (5.67 KB, patch)
2010-11-17 22:52 PST, Kent Tamura
no flags
Patch 2 (6.95 KB, patch)
2010-11-18 23:15 PST, Kent Tamura
no flags
Patch 3 (6.81 KB, patch)
2010-11-18 23:36 PST, Kent Tamura
no flags
Patch 4 (12.68 KB, patch)
2010-11-22 01:37 PST, Kent Tamura
no flags
Patch 5 (12.71 KB, patch)
2010-11-22 02:55 PST, Kent Tamura
no flags
Patch 6 (12.85 KB, patch)
2010-11-22 02:59 PST, Kent Tamura
ap: review+
Kent Tamura
Comment 1 2010-11-17 22:52:33 PST
Tony Chang
Comment 2 2010-11-18 09:37:59 PST
Comment on attachment 74205 [details] Patch Out of curiosity, does TextCodecMac have both Shift_JIS and Shift_JIS_X0213-2000?
Alexey Proskuryakov
Comment 3 2010-11-18 10:53:02 PST
Comment on attachment 74205 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=74205&action=review Note that there are other places where we only check for the longer name - e.g. TextEncoding::backslashAsCurrencySymbol(). > Out of curiosity, does TextCodecMac have both Shift_JIS and Shift_JIS_X0213-2000? Yes. The latter doesn't come from ICU, but from mac-encodings.txt. I'd really expect a reviewer to know or at least research the answer to such a question before giving r+. > WebCore/platform/text/TextEncoding.cpp:207 > + // Shift_JIS_X0213-2000 is not the same encoding as Shift_JIS on Mac. We > + // need to check both of them. This is a short line, it shouldn't be split into two. > WebCore/platform/text/TextEncoding.cpp:210 > + static const char* const shiftJis = atomicCanonicalTextEncodingName("Shift_JIS"); > + static const char* const shiftJis2000 = atomicCanonicalTextEncodingName("Shift_JIS_X0213-2000"); > + static const char* const eucJp = atomicCanonicalTextEncodingName("EUC-JP"); Coding style mistake - it's "Shift_JIS", not "Shift_Jis", so variables should be named accordingly. Ditto for EUC-JP. > WebCore/platform/text/TextEncoding.cpp:211 > + return ((shiftJis && m_name == shiftJis) || (shiftJis2000 && m_name == shiftJis2000) || (eucJp && m_name == eucJp)) ? 0x00A5 : '\\'; I'm not really sure how performance critical this function is, but it's certainly invoked a lot, so I wouldn't be adding unnecessary null checks.
Tony Chang
Comment 4 2010-11-18 11:13:05 PST
(In reply to comment #3) > > WebCore/platform/text/TextEncoding.cpp:211 > > + return ((shiftJis && m_name == shiftJis) || (shiftJis2000 && m_name == shiftJis2000) || (eucJp && m_name == eucJp)) ? 0x00A5 : '\\'; > > I'm not really sure how performance critical this function is, but it's certainly invoked a lot, so I wouldn't be adding unnecessary null checks. I think this is a correctness check. If the m_name is NULL and shiftJis2000 is NULL, we'll return the wrong value. I suppose you could reduce the number of comparisons by just NULL checking m_name.
Alexey Proskuryakov
Comment 5 2010-11-18 11:19:45 PST
It were null checks for Shift_JIS and EUC-JP that caught my eye - every reasonable back-end should have those. Perhaps Shift_JIS_X0213-2000 could just go inside #if PLATFORM(MAC). My recollection is that it's there simply because it was supported in Safari prior to switching to ICU, and no one had the gut to remove it.
Shinichiro Hamaji
Comment 6 2010-11-18 12:58:01 PST
As this hack is for compatibilities with IE, I think we also want other encodings with which Japanese fonts will be used. Specifically, I'd add ISO-2022-JP and ISO-2022-JP-2. Could you add them in this chance? I think we can use a HashSet like TextEncoding::isJapanese.
Kent Tamura
Comment 7 2010-11-18 22:41:16 PST
(In reply to comment #4) > I suppose you could reduce the number of comparisons by just NULL checking m_name. It's a good idea.
Kent Tamura
Comment 8 2010-11-18 22:49:34 PST
(In reply to comment #6) > As this hack is for compatibilities with IE, I think we also want other encodings with which Japanese fonts will be used. Specifically, I'd add ISO-2022-JP and ISO-2022-JP-2. Could you add them in this chance? I think we can use a HashSet like TextEncoding::isJapanese. I tested some encoding names with IE. IE showed yensign for 0x5c in the following encoding names: EUC-JP Shift_JIS, Windows-31J, x-sjis Shift_JIS_X0213-2000 x-mac-japanese ISO-2022-JP IE showed a backslash: cp932 JIS_X0201 JIS_X0208-1983 JIS_X0208-1990 JIS_X0212-1990 JIS_C6226-1978 ISO-2022-JP-1 ISO-2022-JP-2 ISO-2022-JP-3
Kent Tamura
Comment 9 2010-11-18 23:15:19 PST
Created attachment 74356 [details] Patch 2
Kent Tamura
Comment 10 2010-11-18 23:36:36 PST
Created attachment 74357 [details] Patch 3
Alexey Proskuryakov
Comment 11 2010-11-18 23:42:17 PST
Comment on attachment 74357 [details] Patch 3 View in context: https://bugs.webkit.org/attachment.cgi?id=74357&action=review > WebCore/platform/text/TextEncoding.cpp:204 > + DEFINE_STATIC_LOCAL(HashSet<const char*>, set, ()); Can this code be run from a secondary thread?
Kent Tamura
Comment 12 2010-11-18 23:47:19 PST
Comment on attachment 74357 [details] Patch 3 View in context: https://bugs.webkit.org/attachment.cgi?id=74357&action=review >> WebCore/platform/text/TextEncoding.cpp:204 >> + DEFINE_STATIC_LOCAL(HashSet<const char*>, set, ()); > > Can this code be run from a secondary thread? This code is not MT-safe. Is TextEncoding class used by multiple threads? TextEncoding::isJapanese() has the same issue.
Kent Tamura
Comment 13 2010-11-19 00:11:23 PST
(In reply to comment #12) > >> WebCore/platform/text/TextEncoding.cpp:204 > >> + DEFINE_STATIC_LOCAL(HashSet<const char*>, set, ()); > > > > Can this code be run from a secondary thread? > > This code is not MT-safe. Is TextEncoding class used by multiple threads? > TextEncoding::isJapanese() has the same issue. If the main thread always creates a TextEncoding instance and calls TextEncoding::isJapanese() before a secondary thread is created, we have no thread-safety issue.
Alexey Proskuryakov
Comment 14 2010-11-19 00:40:52 PST
I remember that TextEncodingRegistry had multi-threading issues, but don't remember which exactly. I'm not sure about these particular TextCodec functions. At the very least, they need isMainThread() assertions. One direction to try would be checking svn blame for encodingRegistryMutex to see why it was added.
Kent Tamura
Comment 15 2010-11-22 01:36:03 PST
(In reply to comment #14) > One direction to try would be checking svn blame for encodingRegistryMutex to see why it was added. I checked Bug 22341, and found no detail of the reason :-) Anyway, my "Patch 3" is wrong. If no extended encodings are used before starting workers and then extended encodings are used in the workers, multiple threads can enter the HashSet initialization path. I tried AtomicallyInitializedStatic(), but it produced a build error about global destructor. So I decided to move these HashSets to TextEncodingRegistry.
Kent Tamura
Comment 16 2010-11-22 01:37:01 PST
Created attachment 74532 [details] Patch 4
Alexey Proskuryakov
Comment 17 2010-11-22 01:42:47 PST
> I checked Bug 22341, and found no detail of the reason :-) Well, it explains that we were getting there through KURL in workers.
Alexey Proskuryakov
Comment 18 2010-11-22 01:49:23 PST
Comment on attachment 74532 [details] Patch 4 + IE shows a yensign for 0x5c code point encoded in x-mac-japanese, + ISO-2022-JP, EUC-JP, Shift_JIS, Shift_JIS_X0213-2000, x-sjis, and + Windows-31J. We follow it. This explanation is misleading. IE chooses a default font based on encoding, and some Windows fonts have a yen sign in place of backslash. We need much more complicated logic because fonts on other platforms are different, and don't have this quirk.
Kent Tamura
Comment 19 2010-11-22 02:55:40 PST
Created attachment 74535 [details] Patch 5
Kent Tamura
Comment 20 2010-11-22 02:59:08 PST
Created attachment 74536 [details] Patch 6
Kent Tamura
Comment 21 2010-11-22 02:59:45 PST
(In reply to comment #18) > (From update of attachment 74532 [details]) > + IE shows a yensign for 0x5c code point encoded in x-mac-japanese, > + ISO-2022-JP, EUC-JP, Shift_JIS, Shift_JIS_X0213-2000, x-sjis, and > + Windows-31J. We follow it. > > This explanation is misleading. IE chooses a default font based on encoding, and some Windows fonts have a yen sign in place of backslash. We need much more complicated logic because fonts on other platforms are different, and don't have this quirk. Ok, I have updated ChangeLog.
Alexey Proskuryakov
Comment 22 2010-12-06 20:55:28 PST
Comment on attachment 74536 [details] Patch 6 View in context: https://bugs.webkit.org/attachment.cgi?id=74536&action=review > WebCore/ChangeLog:10 > + IE chooses a font which shows a yensign for 0x5c code point for a page > + encoded in x-mac-japanese, ISO-2022-JP, EUC-JP, Shift_JIS, Shift_JIS_X0213-2000, > + x-sjis, and Windows-31J. I thought IE didn't support Shift_JIS_X0213-2000. > WebCore/ChangeLog:16 > + Also, we move the HashSet initialization for isJapanese() and > + backslashAsCurrencySymbol() to TextEncodingRegistry.cpp because of > + ease of making them multi-thread safe. This is not a great reason - TextEncodingRegistry does not and should not care about backslashes. > WebCore/platform/text/TextEncoding.cpp:166 > bool TextEncoding::isJapanese() const This function is misplaced - all it is needed for is to make a decision on whether to invoke charset detection. So it should be in TextResourceDecoder (not something for this patch, of course).
Kent Tamura
Comment 23 2010-12-08 16:05:27 PST
(In reply to comment #22) > (From update of attachment 74536 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=74536&action=review > > > WebCore/ChangeLog:10 > > + IE chooses a font which shows a yensign for 0x5c code point for a page > > + encoded in x-mac-japanese, ISO-2022-JP, EUC-JP, Shift_JIS, Shift_JIS_X0213-2000, > > + x-sjis, and Windows-31J. > > I thought IE didn't support Shift_JIS_X0213-2000. I was surprised when I saw IE switched the default font for it. I guess IE uses prefix-matching for encoding names. > > WebCore/ChangeLog:16 > > + Also, we move the HashSet initialization for isJapanese() and > > + backslashAsCurrencySymbol() to TextEncodingRegistry.cpp because of > > + ease of making them multi-thread safe. > > This is not a great reason - TextEncodingRegistry does not and should not care about backslashes. However, the encoding names for backslashAsCurrencySymbol() depend on TextEncodingRegistry. If one added new Japanese encoding to TextEncodingRegistry, he/she might need to add it to nonBackslashEncodings. Anyway, I also feel the new code is not good. I'll add a FIXME comment.
Kent Tamura
Comment 24 2010-12-08 16:55:18 PST
WebKit Review Bot
Comment 25 2010-12-09 00:34:46 PST
http://trac.webkit.org/changeset/73566 might have broken GTK Linux 32-bit Release The following tests are not passing: fast/css/input-search-padding.html fast/css/text-input-with-webkit-border-radius.html fast/forms/box-shadow-override.html fast/forms/control-restrict-line-height.html fast/forms/input-appearance-height.html fast/forms/placeholder-pseudo-style.html fast/forms/placeholder-set-value.html fast/forms/search-cancel-button-style-sharing.html fast/forms/search-placeholder-value-changed.html fast/forms/search-rtl.html fast/forms/search-styled.html fast/forms/search-transformed.html fast/forms/search-vertical-alignment.html fast/forms/search-zoomed.html fast/forms/searchfield-heights.html
Note You need to log in before you can comment on or make changes to this bug.