Bug 30437

Summary:

REGRESSION: Japanese text search ignores small vs. large and voicing mark differences

Product:

WebKit

Reporter:

Satoshi Nakagawa <artension>

Component:

WebCore Misc.

Assignee:

Darin Adler <darin>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

ap, commit-queue, darin, eric, jshin, justin.garcia, keishi, mitz

Priority:

Keywords:

InRadar

Version:

528+ (Nightly build)

Hardware:

All

OS:

All

URL:

http://limechat.net/report/webkit-search-problem.html

Attachments:

Description	Flags
A patch to fix this problem	none
Revised patch	eric: commit-queue-
Revised patch 2	none
Revised patch 3	none
work in progress	none
patch	ap: review+

Satoshi Nakagawa

Reported 2009-10-16 04:36:41 PDT

== Summary == In Japanese, 'ぁ' and 'あ' are treated as different characters in anytime. 'か' and 'が' are as well. But in Safari and Chrome, they are treated as the same characters in its search. == Description == As you know in English, abc and ABC are treated as the same in a case insensitive context like application searches. But in Japanese, for example, "あった" and "あつた" are always different words in any contexts. Because in Japanese semantics, 'っ' is NOT considered as a small form of 'つ'. These characters are never treated as the same characters. In the current Unicode Collation Algorithm, っ and つ are in the same order in the primary collation strength. WebKit uses the primary collation strength in ICU for its search. I reported this problem in the Unicode ML. (http://unicode.org/mail-arch/unicode-ml/y2009-m10/0019.html) Mark Davis replied to my report. (http://unicode.org/mail-arch/unicode-ml/y2009-m10/0022.html) > UTS#10 does not necessarily match the sorting of any particular language. It means we cannot use ICU's search function directly for application searches. It needs some tailoring in the collation table for some languages. I wrote a patch for WebKit to add the following tailoring rules for Japanese text search. This patch doesn't have any regression in the other languages.

Attachments
A patch to fix this problem (4.61 KB, patch) 2009-10-16 04:39 PDT, Satoshi Nakagawa	no flags	Details Formatted Diff Diff
Revised patch (11.67 KB, patch) 2009-10-18 22:02 PDT, Satoshi Nakagawa	eric: commit-queue-	Details Formatted Diff Diff
Revised patch 2 (11.88 KB, patch) 2009-10-19 23:57 PDT, Satoshi Nakagawa	no flags	Details Formatted Diff Diff
Revised patch 3 (11.96 KB, patch) 2009-10-20 19:08 PDT, Satoshi Nakagawa	no flags	Details Formatted Diff Diff
work in progress (8.33 KB, patch) 2010-01-07 17:56 PST, Darin Adler	no flags	Details Formatted Diff Diff
patch (38.43 KB, patch) 2010-01-10 17:39 PST, Darin Adler	ap: review+	Details Formatted Diff Diff
Show Obsolete (5) View All Add attachment proposed patch, testcase, etc.

Satoshi Nakagawa

Comment 1 2009-10-16 04:39:05 PDT

Created attachment 41282 [details] A patch to fix this problem

Satoshi Nakagawa

Comment 2 2009-10-16 04:44:12 PDT

The tailoring rules are like this: &ぁ=ァ=ｧ<あ=ア=ｱ<ぃ=ィ=ｨ<い=イ=ｲ<ぅ=ゥ=ｩ<う=ウ=ｳ<ゔ=ヴ<ぇ=ェ=ｪ<え=エ=ｴ<ぉ=ォ=ｫ<お=オ=ｵ <ゕ=ヵ<か=カ=ｶ<が=ガ<き=キ=ｷ<ぎ=ギ<く=ク=ｸ<ぐ=グ<ゖ=ヶ<け=ケ=ｹ<げ=ゲ<こ=コ=ｺ<ご=ゴ <さ=サ=ｻ<ざ=ザ<し=シ=ｼ<じ=ジ<す=ス=ｽ<ず=ズ<せ=セ=ｾ<ぜ=ゼ<そ=ソ=ｿ<ぞ=ゾ <た=タ=ﾀ<だ=ダ<ち=チ=ﾁ<ぢ=ヂ<っ=ッ=ｯ<つ=ツ=ﾂ<づ=ヅ<て=テ=ﾃ<で=デ<と=ト=ﾄ<ど=ド <な=ナ=ﾅ<に=ニ=ﾆ<ぬ=ヌ=ﾇ<ね=ネ=ﾈ<の=ノ=ﾉ <は=ハ=ﾊ<ば=バ<ぱ=パ<ひ=ヒ=ﾋ<び=ビ<ぴ=ピ<ふ=フ=ﾌ<ぶ=ブ<ぷ=プ<へ=ヘ=ﾍ<べ=ベ<ぺ=ペ<ほ=ホ=ﾎ<ぼ=ボ<ぽ=ポ <ま=マ=ﾏ<み=ミ=ﾐ<む=ム=ﾑ<め=メ=ﾒ<も=モ=ﾓ <ゃ=ャ=ｬ<や=ヤ=ﾔ<ゅ=ュ=ｭ<ゆ=ユ=ﾕ<ょ=ョ=ｮ<よ=ヨ=ﾖ <ら=ラ=ﾗ<り=リ=ﾘ<る=ル=ﾙ<れ=レ=ﾚ<ろ=ロ=ﾛ <ゎ=ヮ<わ=ワ=ﾜ<ヷ<ゐ=ヰ<ヸ<ゑ=ヱ<を=ヲ=ｦ<ん=ン=ﾝ For example, "ぁ=ァ=ｧ<あ=ア=ｱ" means: - 'ぁ', 'ァ' and 'ァ' are in the same order. - 'ぁ', 'ァ' and 'ァ' are smaller than 'あ', 'ア' and 'ｱ'. You can see some screenshots that describe this problem and the effect of this patch. http://limechat.net/report/webkit-search-problem.html

Alexey Proskuryakov

Comment 3 2009-10-16 15:54:23 PDT

A fix like this definitely needs include a regression test. See <http://trac.webkit.org/browser/trunk/LayoutTests/fast/text/find-case-folding.html> for an example of how it could be done. Some more information about contributing code to WebKit is available at <http://webkit.org/coding/contributing.html>. This may seem like a lot to ask for, but it's vital for maintainability of the project. See also: bug 27587. Once you have a patch ready for review, please mark it as such by setting review flag to "?".

Darin Adler

Comment 4 2009-10-16 17:58:11 PDT

Comment on attachment 41282 [details] A patch to fix this problem This code should go inside createSearcher(), not inside searcher().

Satoshi Nakagawa

Comment 5 2009-10-18 22:02:21 PDT

Created attachment 41396 [details] Revised patch

Satoshi Nakagawa

Comment 6 2009-10-18 22:07:35 PDT

Thanks for your comments. I've uploaded a revised patch. About bug 27587, IMO it's natural to treat "シ" and "㋛" as different charactes. "アパート" and "㌀" as well.

Darin Adler

Comment 7 2009-10-19 15:26:19 PDT

Comment on attachment 41396 [details] Revised patch > + * editing/TextIterator.cpp: > + (WebCore::): > + (WebCore::createSearcher): The line saying just "WebCore::" should be deleted. > +// Tailored collation rules for Japanese text search. > +// The default Unicode Collation Algorithm is unnatural in Japanese. > +// These rules intend to treat the following characters as different characters. > +// > +// - Small kana letters and normal kana letters > +// - Voiceless letters, voiced letters and semi-voiced letters > +// This comment should document where this array came from. Is this original work or did you copy this here from some other project? > +static const UChar JAPANESE_KANA_COLLATION_RULES[] = { This array should not have a name that's in all capitals. Those names are reserved for macros. > UErrorCode status = U_ZERO_ERROR; > UStringSearch* searcher = usearch_open(&newlineCharacter, 1, &newlineCharacter, 1, currentSearchLocaleID(), 0, &status); > ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING); > + > + static UCollator* collator = 0; > + if (!collator) { > + // Set tailored collation rules to fix Japanese text search. > + // See the comments before JAPANESE_KANA_COLLATION_RULES for details. > + status = U_ZERO_ERROR; > + collator = ucol_openRules(JAPANESE_KANA_COLLATION_RULES, -1, UCOL_DEFAULT, > + UCOL_DEFAULT_STRENGTH, 0, &status); > + ASSERT(status == U_ZERO_ERROR); > + status = U_ZERO_ERROR; > + usearch_setCollator(searcher, collator, &status); > + ASSERT(status == U_ZERO_ERROR); > + usearch_reset(searcher); > + } This is OK, but not quite right. The usearch_setCollator and usearch_reset calls should be outside the if statement, since they are part of the creation of searcher, not the creation of the collator. However, since the createSearcher function is only called once, there's no problem in practice. The entire function should ideally be refactored to match the way the searcher() and createSearcher() function work. There would be a collator() and createCollator() function, and createSearcher() would call collator(). Otherwise, the patch looks good. I'm going to say r=me because I think it's OK to land this patch as is, but I think it would be even better if you took my suggestions above.

Eric Seidel (no email)

Comment 8 2009-10-19 16:54:46 PDT

Comment on attachment 41396 [details] Revised patch Nakagawa-san is not a committer, so I would mark this cq+, except Darin has asked for modifications. So best if Nakagawa-san could post a new patch with r=? and cq=?.

Satoshi Nakagawa

Comment 9 2009-10-19 23:57:39 PDT

Created attachment 41486 [details] Revised patch 2

Satoshi Nakagawa

Comment 10 2009-10-20 00:02:52 PDT

Thanks for reviewing. I've updated the patch per your comments.

WebKit Commit Bot

Comment 11 2009-10-20 09:43:13 PDT

Comment on attachment 41486 [details] Revised patch 2 Rejecting patch 41486 from commit-queue. Failed to run "['WebKitTools/Scripts/run-webkit-tests', '--no-launch-safari', '--quiet', '--exit-after-n-failures=1']" exit_code: 60 Last 500 characters of output: ages while\n\trunning as root. There are known race conditions that\n\twill allow any local user to read any file on the system.\n\tIf you still desire to serve pages as root then\n\tadd -DBIG_SECURITY_HOLE to the CFLAGS env variable\n\tand then rebuild the server.\n\tIt is strongly suggested that you instead modify the User\n\tdirective in your httpd.conf file to list a non-root\n\tuser.\n Timed out waiting for httpd to start at WebKitTools/Scripts/run-webkit-tests line 1359, <IN> line 30187.

Eric Seidel (no email)

Comment 12 2009-10-20 11:03:35 PDT

My apologies. Something is causing run-webkit-tests to fail on the commit bot. I'm not sure what yet.

Eric Seidel (no email)

Comment 13 2009-10-20 11:14:44 PDT

Comment on attachment 41486 [details] Revised patch 2 I've fixed this commit-queue. Again my apologies.

WebKit Commit Bot

Comment 14 2009-10-20 12:00:49 PDT

Comment on attachment 41486 [details] Revised patch 2 Clearing flags on attachment: 41486 Committed r49876: <http://trac.webkit.org/changeset/49876>

WebKit Commit Bot

Comment 15 2009-10-20 12:00:55 PDT

All reviewed patches have been landed. Closing bug.

Satoshi Nakagawa

Comment 16 2009-10-20 14:11:07 PDT

Thanks!

Mark Rowe (bdash)

Comment 17 2009-10-20 17:20:38 PDT

After this change the SnowLeopard debug bot is hitting numerous assertion failures with stack traces like so: Exception Type: EXC_BAD_ACCESS (SIGSEGV) Exception Codes: KERN_INVALID_ADDRESS at 0x00000000bbadbeef Crashed Thread: 0 Dispatch queue: com.apple.main-thread Thread 0 Crashed: Dispatch queue: com.apple.main-thread 0 com.apple.WebCore 0x000000010163333c WebCore::createSearcher() + 216 (TextIterator.cpp:1534) 1 com.apple.WebCore 0x000000010163336f WebCore::searcher() + 23 (TextIterator.cpp:1541) 2 com.apple.WebCore 0x000000010163a27f WebCore::SearchBuffer::SearchBuffer(WebCore::String const&, bool) + 247 (TextIterator.cpp:1581) 3 com.apple.WebCore 0x0000000101636736 WebCore::findPlainText(WebCore::CharacterIterator&, WebCore::String const&, bool, bool, unsigned long&) + 81 (TextIterator.cpp:1983) 4 com.apple.WebCore 0x0000000101637bff WebCore::findPlainText(WebCore::Range const*, WebCore::String const&, bool, bool) + 116 (TextIterator.cpp:2015) I’m going to verify that I can reproduce this locally and may end up rolling this change out.

Mark Rowe (bdash)

Comment 18 2009-10-20 17:34:33 PDT

When doing ‘run-webkit-tests editing’ I hit this assertion failure on seven tests. Program received signal EXC_BAD_ACCESS, Could not access memory. Reason: KERN_INVALID_ADDRESS at address: 0x00000000bbadbeef 0x00000001016360ac in WebCore::createSearcher () at WebCore/editing/TextIterator.cpp:1534 1534 ASSERT(status == U_ZERO_ERROR); (gdb) print status $1 = U_USING_DEFAULT_WARNING Current language: auto; currently c++ (gdb) I’m going to roll this out for now since we can’t leave debug builds in a broken state.

Mark Rowe (bdash)

Comment 19 2009-10-20 17:43:20 PDT

Rolled out in r49876.

Mark Rowe (bdash)

Comment 20 2009-10-20 17:53:42 PDT

Maybe the solution is to modify the assertion to be the same as the one a few lines up rather than requiring U_ZERO_ERROR.

Satoshi Nakagawa

Comment 21 2009-10-20 19:08:45 PDT

Created attachment 41543 [details] Revised patch 3 I'm sorry it caused an assertion failure. Here is a revised patch.

Darin Adler

Comment 22 2009-10-21 00:03:02 PDT

Comment on attachment 41543 [details] Revised patch 3 > + UCollator* collator = ucol_openRules(japaneseKanaCollationRules, -1, UCOL_DEFAULT, UCOL_DEFAULT_STRENGTH, 0, &status); > + ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING); Are all three of these expected here for some reason, or are you just ignoring the same set I ignored for usearch_open? > + usearch_setCollator(searcher, collator(), &status); > + ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING); Same question. r=me, but I would like to hear the answer to that question at some point.

WebKit Commit Bot

Comment 23 2009-10-21 00:11:16 PDT

Comment on attachment 41543 [details] Revised patch 3 Clearing flags on attachment: 41543 Committed r49899: <http://trac.webkit.org/changeset/49899>

WebKit Commit Bot

Comment 24 2009-10-21 00:11:22 PDT

All reviewed patches have been landed. Closing bug.

Satoshi Nakagawa

Comment 25 2009-10-21 00:49:03 PDT

(In reply to comment #22) > (From update of attachment 41543 [details]) > > + UCollator* collator = ucol_openRules(japaneseKanaCollationRules, -1, UCOL_DEFAULT, UCOL_DEFAULT_STRENGTH, 0, &status); > > + ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING); > > Are all three of these expected here for some reason, or are you just ignoring > the same set I ignored for usearch_open? It's just for ignoring the same set as for usearch_open. I read ICU documents, and I thought it's better to ignore them. http://icu-project.org/apiref/icu4c/ucol_8h.html#a128ea0ed3869415c1c96a9a2c997c2d http://icu-project.org/apiref/icu4c/utypes_8h.html#3343c1c8a8377277046774691c98d78c Probably, using U_SUCCESS() would be better for assertion. > > + usearch_setCollator(searcher, collator(), &status); > > + ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING); > > Same question. > > r=me, but I would like to hear the answer to that question at some point. The same here.

Darin Adler

Comment 26 2009-10-21 08:21:09 PDT

(In reply to comment #25) > Probably, using U_SUCCESS() would be better for assertion. I prefer having the more specific assertion, because we want to know if we're getting new, unexpected error codes there. Sorry it caused a problem in this case. I think this is good now.

Jungshik Shin

Comment 27 2009-10-21 15:22:02 PDT

Sorry I'm late here, but this patch breaks non-Japanese search (e.g. Swedish). Before the change, the search collator was locale-dependent. That is, if you use Japanese Safari, you'd get the collator tailored for Japanese (which I think does the right thing for Japanese). The same is true of other languages. The intent of the patch is to make Japanese search work *regardless of the current locale (UI language)*, which I support. However, it should be done without breaking search in Swedish, German, Finnish or any other language. The patch here breaks that because it throws away the locale-dependent collator right after creating one and replace it with the UCA + Japanese tailoring. The way to do that is to get the collation rule string for the current locale and combine it with the Japanese tailoring (added in the patch) and build a new collator from the combined rule strings. (I just went over the patch with one of ICU's leading contributors). I'll make a patch to do the above in a couple of days.

Darin Adler

Comment 28 2009-10-21 15:30:59 PDT

(In reply to comment #27) > Sorry I'm late here, but this patch breaks non-Japanese search (e.g. Swedish). I can't believe I missed that. We may need to roll the patch out until we have a method that doesn't break all the other languages.

Darin Adler

Comment 29 2009-10-21 15:34:09 PDT

(In reply to comment #27) > The way to do that is to get the collation rule string for the current locale > and combine it with the Japanese tailoring (added in the patch) and build a new > collator from the combined rule strings. (I just went over the patch with one > of ICU's leading contributors). For what it's worth, I tried to do exactly that when I added the code to fold quote marks. But as I mention in a comment, I was unable to do this, even with the help of some ICU experts here at Apple. Here's hoping you are able to do what I was not! See <http://trac.webkit.org/changeset/45858> for the comment and code. If you are able to add custom tailoring, then perhaps we can do the quote marks that way.

Jungshik Shin

Comment 30 2009-10-21 15:40:11 PDT

(In reply to comment #28) > (In reply to comment #27) > > Sorry I'm late here, but this patch breaks non-Japanese search (e.g. Swedish). > > I can't believe I missed that. We may need to roll the patch out until we have > a method that doesn't break all the other languages. It'd be great if you can roll it out. Then, I don't have to rush to make a new ICU data file for Chrome with invuca table (which I excluded to save some space) :-). Chrome is crashing due to assertions because it doesn't have invuca table. Without noticing you adding the last two comments, I filed a new bug 30646 to fix the regression. If you think it's better to roll out the patch, perhaps it's better to deal with a better solution for the original issue here instead of bug 30646. Either way is fine with me.

Darin Adler

Comment 31 2009-10-21 15:42:48 PDT

What is the invuca table, and why does Chrome need it, but not other WebKit-based browsers?

Jungshik Shin

Comment 32 2009-10-21 15:56:06 PDT

(In reply to comment #31) > What is the invuca table, and why does Chrome need it, but not other > WebKit-based browsers? I was not clear. invuca is necessary by all other ICU-dependent webkit port IF collator is built at run-time (as is done with the patch for this bug). Chrome excluded it from its ICU data file because neither Webkit nor Chrome's other components build a collator with custom rules at run-time. With the patch here (or a new patch addressing the regression), that's not the case any more. Among other ICU-dependent Webkit ports, Safari on OS X is not affected because it gets the *full* ICU data from the OS. Safari on Win could have cut down the download size a little bit by excluding invuca, but apparently it didn't. So, it's not affected, either.

Darin Adler

Comment 33 2009-10-21 16:11:04 PDT

Rolled the change out in r49926. Reopening the bug.

Eric Seidel (no email)

Comment 34 2009-10-26 12:06:32 PDT

Comment on attachment 41396 [details] Revised patch Clearing darin's r+ on this obsolete patch.

mitz

Comment 35 2009-10-30 16:43:20 PDT

<rdar://problem/7214058>

Darin Adler

Comment 36 2009-10-30 16:46:36 PDT

If we can't fix this with tailoring, maybe there's some other simple way to fix it. I’d really like to resolve this somehow!

Darin Adler

Comment 37 2010-01-06 16:52:34 PST

Does anyone have any ideas about how to solve this?

Darin Adler

Comment 38 2010-01-06 16:59:19 PST

I am going to tackle this.

Darin Adler

Comment 39 2010-01-07 17:56:09 PST

Created attachment 46103 [details] work in progress

Darin Adler

Comment 40 2010-01-10 17:39:32 PST

Created attachment 46249 [details] patch

Alexey Proskuryakov

Comment 41 2010-01-10 23:02:25 PST

Comment on attachment 46249 [details] patch > +static void normalizeCharacters(const UChar* characters, unsigned length, Vector<UChar>& buffer) > +{ > + ASSERT(length); > + > + UErrorCode status = U_ZERO_ERROR; > + size_t bufferSize = unorm_normalize(characters, length, UNORM_NFC, 0, 0, 0, &status); > + ASSERT(status == U_BUFFER_OVERFLOW_ERROR); > + ASSERT(bufferSize); Would it make sense to try with an output buffer of size length,, to avoid having two passes in most cases? > Index: LayoutTests/fast/text/find-kana.html I need to add a similar test for Russian! r=me

Darin Adler

Comment 42 2010-01-11 07:54:56 PST

(In reply to comment #41) > Would it make sense to try with an output buffer of size length, to avoid > having two passes in most cases? Yes. > I need to add a similar test for Russian! Please do.

Darin Adler

Comment 43 2010-01-11 08:31:35 PST

http://trac.webkit.org/changeset/53078

Darin Adler

Comment 44 2010-01-11 10:41:43 PST

*** Bug 27587 has been marked as a duplicate of this bug. ***

Eric Seidel (no email)

Comment 45 2010-01-11 12:07:44 PST

Looks like this test breaks on Leopard and Tiger: http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r53078%20(9114)/fast/text/find-kana-diffs.txt

Eric Seidel (no email)

Comment 46 2010-01-11 12:23:24 PST

And Qt: http://build.webkit.org/results/Qt%20Linux%20Release/r53092%20(5839)/fast/text/find-kana-diffs.txt

Darin Adler

Comment 47 2010-01-11 13:02:47 PST

Tiger and Leopard are failing because they are using an old ICU. Qt is failing because it's using the non-ICU search code path.

Darin Adler

Comment 48 2010-01-11 13:03:13 PST

In both cases the best solution is checking in expected failure results.

Note You need to log in before you can comment on or make changes to this bug.