Bug 30437

Summary: REGRESSION: Japanese text search ignores small vs. large and voicing mark differences
Product: WebKit Reporter: Satoshi Nakagawa <artension>
Component: WebCore Misc.Assignee: Darin Adler <darin>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, commit-queue, darin, eric, jshin, justin.garcia, keishi, mitz
Priority: P2 Keywords: InRadar
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
URL: http://limechat.net/report/webkit-search-problem.html
Attachments:
Description Flags
A patch to fix this problem
none
Revised patch
eric: commit-queue-
Revised patch 2
none
Revised patch 3
none
work in progress
none
patch ap: review+

Description Satoshi Nakagawa 2009-10-16 04:36:41 PDT
== Summary ==

In Japanese, 'ぁ' and 'あ' are treated as different characters in anytime. 'か' and 'が' are as well.

But in Safari and Chrome, they are treated as the same characters in its search.

== Description ==

As you know in English, abc and ABC are treated as the same in a case insensitive context like application searches.

But in Japanese, for example, "あった" and "あつた" are always different words in any contexts. Because in Japanese semantics, 'っ' is NOT considered as a small form of 'つ'. These characters are never treated as the same characters.

In the current Unicode Collation Algorithm, っ and つ are in the same order in the primary collation strength. WebKit uses the primary collation strength in ICU for its search.

I reported this problem in the Unicode ML. (http://unicode.org/mail-arch/unicode-ml/y2009-m10/0019.html)

Mark Davis replied to my report. (http://unicode.org/mail-arch/unicode-ml/y2009-m10/0022.html)
> UTS#10 does not necessarily match the sorting of any particular language.

It means we cannot use ICU's search function directly for application searches. It needs some tailoring in the collation table for some languages.

I wrote a patch for WebKit to add the following tailoring rules for Japanese text search. This patch doesn't have any regression in the other languages.
Comment 1 Satoshi Nakagawa 2009-10-16 04:39:05 PDT
Created attachment 41282 [details]
A patch to fix this problem
Comment 2 Satoshi Nakagawa 2009-10-16 04:44:12 PDT
The tailoring rules are like this:


&ぁ=ァ=ァ<あ=ア=ア<ぃ=ィ=ィ<い=イ=イ<ぅ=ゥ=ゥ<う=ウ=ウ<ゔ=ヴ<ぇ=ェ=ェ<え=エ=エ<ぉ=ォ=ォ<お=オ=オ
<ゕ=ヵ<か=カ=カ<が=ガ<き=キ=キ<ぎ=ギ<く=ク=ク<ぐ=グ<ゖ=ヶ<け=ケ=ケ<げ=ゲ<こ=コ=コ<ご=ゴ
<さ=サ=サ<ざ=ザ<し=シ=シ<じ=ジ<す=ス=ス<ず=ズ<せ=セ=セ<ぜ=ゼ<そ=ソ=ソ<ぞ=ゾ
<た=タ=タ<だ=ダ<ち=チ=チ<ぢ=ヂ<っ=ッ=ッ<つ=ツ=ツ<づ=ヅ<て=テ=テ<で=デ<と=ト=ト<ど=ド
<な=ナ=ナ<に=ニ=ニ<ぬ=ヌ=ヌ<ね=ネ=ネ<の=ノ=ノ
<は=ハ=ハ<ば=バ<ぱ=パ<ひ=ヒ=ヒ<び=ビ<ぴ=ピ<ふ=フ=フ<ぶ=ブ<ぷ=プ<へ=ヘ=ヘ<べ=ベ<ぺ=ペ<ほ=ホ=ホ<ぼ=ボ<ぽ=ポ
<ま=マ=マ<み=ミ=ミ<む=ム=ム<め=メ=メ<も=モ=モ
<ゃ=ャ=ャ<や=ヤ=ヤ<ゅ=ュ=ュ<ゆ=ユ=ユ<ょ=ョ=ョ<よ=ヨ=ヨ
<ら=ラ=ラ<り=リ=リ<る=ル=ル<れ=レ=レ<ろ=ロ=ロ
<ゎ=ヮ<わ=ワ=ワ<ヷ<ゐ=ヰ<ヸ<ゑ=ヱ<を=ヲ=ヲ<ん=ン=ン


For example, "ぁ=ァ=ァ<あ=ア=ア" means:

- 'ぁ', 'ァ' and 'ァ' are in the same order.
- 'ぁ', 'ァ' and 'ァ' are smaller than 'あ', 'ア' and 'ア'.


You can see some screenshots that describe this problem and the effect of this patch.

http://limechat.net/report/webkit-search-problem.html
Comment 3 Alexey Proskuryakov 2009-10-16 15:54:23 PDT
A fix like this definitely needs include a regression test. See <http://trac.webkit.org/browser/trunk/LayoutTests/fast/text/find-case-folding.html> for an example of how it could be done.

Some more information about contributing code to WebKit is available at <http://webkit.org/coding/contributing.html>. This may seem like a lot to ask for, but it's vital for maintainability of the project.

See also: bug 27587.

Once you have a patch ready for review, please mark it as such by setting review flag to "?".
Comment 4 Darin Adler 2009-10-16 17:58:11 PDT
Comment on attachment 41282 [details]
A patch to fix this problem

This code should go inside createSearcher(), not inside searcher().
Comment 5 Satoshi Nakagawa 2009-10-18 22:02:21 PDT
Created attachment 41396 [details]
Revised patch
Comment 6 Satoshi Nakagawa 2009-10-18 22:07:35 PDT
Thanks for your comments.
I've uploaded a revised patch.

About bug 27587, IMO it's natural to treat "シ" and "㋛" as different charactes. "アパート" and "㌀" as well.
Comment 7 Darin Adler 2009-10-19 15:26:19 PDT
Comment on attachment 41396 [details]
Revised patch

> +        * editing/TextIterator.cpp:
> +        (WebCore::):
> +        (WebCore::createSearcher):

The line saying just "WebCore::" should be deleted.

> +// Tailored collation rules for Japanese text search.
> +// The default Unicode Collation Algorithm is unnatural in Japanese.
> +// These rules intend to treat the following characters as different characters.
> +//
> +// - Small kana letters and normal kana letters
> +// - Voiceless letters, voiced letters and semi-voiced letters
> +//

This comment should document where this array came from. Is this original work or did you copy this here from some other project?

> +static const UChar JAPANESE_KANA_COLLATION_RULES[] = {

This array should not have a name that's in all capitals. Those names are reserved for macros.

>      UErrorCode status = U_ZERO_ERROR;
>      UStringSearch* searcher = usearch_open(&newlineCharacter, 1, &newlineCharacter, 1, currentSearchLocaleID(), 0, &status);
>      ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING);
> +    
> +    static UCollator* collator = 0;
> +    if (!collator) {
> +        // Set tailored collation rules to fix Japanese text search.
> +        // See the comments before JAPANESE_KANA_COLLATION_RULES for details.
> +        status = U_ZERO_ERROR;
> +        collator = ucol_openRules(JAPANESE_KANA_COLLATION_RULES, -1, UCOL_DEFAULT,
> +                                  UCOL_DEFAULT_STRENGTH, 0, &status);
> +        ASSERT(status == U_ZERO_ERROR);
> +        status = U_ZERO_ERROR;
> +        usearch_setCollator(searcher, collator, &status);
> +        ASSERT(status == U_ZERO_ERROR);
> +        usearch_reset(searcher);
> +    }

This is OK, but not quite right.

The usearch_setCollator and usearch_reset calls should be outside the if statement, since they are part of the creation of searcher, not the creation of the collator. However, since the createSearcher function is only called once, there's no problem in practice.

The entire function should ideally be refactored to match the way the searcher() and createSearcher() function work. There would be a collator() and createCollator() function, and createSearcher() would call collator().

Otherwise, the patch looks good. I'm going to say r=me because I think it's OK to land this patch as is, but I think it would be even better if you took my suggestions above.
Comment 8 Eric Seidel (no email) 2009-10-19 16:54:46 PDT
Comment on attachment 41396 [details]
Revised patch

Nakagawa-san is not a committer, so I would mark this cq+, except Darin has asked for modifications.  So best if Nakagawa-san could post a new patch with r=? and cq=?.
Comment 9 Satoshi Nakagawa 2009-10-19 23:57:39 PDT
Created attachment 41486 [details]
Revised patch 2
Comment 10 Satoshi Nakagawa 2009-10-20 00:02:52 PDT
Thanks for reviewing.
I've updated the patch per your comments.
Comment 11 WebKit Commit Bot 2009-10-20 09:43:13 PDT
Comment on attachment 41486 [details]
Revised patch 2

Rejecting patch 41486 from commit-queue.

Failed to run "['WebKitTools/Scripts/run-webkit-tests', '--no-launch-safari', '--quiet', '--exit-after-n-failures=1']" exit_code: 60
Last 500 characters of output:
ages while\n\trunning as root.  There are known race conditions that\n\twill allow any local user to read any file on the system.\n\tIf you still desire to serve pages as root then\n\tadd -DBIG_SECURITY_HOLE to the CFLAGS env variable\n\tand then rebuild the server.\n\tIt is strongly suggested that you instead modify the User\n\tdirective in your httpd.conf file to list a non-root\n\tuser.\n
Timed out waiting for httpd to start at WebKitTools/Scripts/run-webkit-tests line 1359, <IN> line 30187.
Comment 12 Eric Seidel (no email) 2009-10-20 11:03:35 PDT
My apologies.  Something is causing run-webkit-tests to fail on the commit bot.  I'm not sure what yet.
Comment 13 Eric Seidel (no email) 2009-10-20 11:14:44 PDT
Comment on attachment 41486 [details]
Revised patch 2

I've fixed this commit-queue.  Again my apologies.
Comment 14 WebKit Commit Bot 2009-10-20 12:00:49 PDT
Comment on attachment 41486 [details]
Revised patch 2

Clearing flags on attachment: 41486

Committed r49876: <http://trac.webkit.org/changeset/49876>
Comment 15 WebKit Commit Bot 2009-10-20 12:00:55 PDT
All reviewed patches have been landed.  Closing bug.
Comment 16 Satoshi Nakagawa 2009-10-20 14:11:07 PDT
Thanks!
Comment 17 Mark Rowe (bdash) 2009-10-20 17:20:38 PDT
After this change the SnowLeopard debug bot is hitting numerous assertion failures with stack traces like so:

Exception Type:  EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: KERN_INVALID_ADDRESS at 0x00000000bbadbeef
Crashed Thread:  0  Dispatch queue: com.apple.main-thread

Thread 0 Crashed:  Dispatch queue: com.apple.main-thread
0   com.apple.WebCore                   0x000000010163333c WebCore::createSearcher() + 216 (TextIterator.cpp:1534)
1   com.apple.WebCore                   0x000000010163336f WebCore::searcher() + 23 (TextIterator.cpp:1541)
2   com.apple.WebCore                   0x000000010163a27f WebCore::SearchBuffer::SearchBuffer(WebCore::String const&, bool) + 247 (TextIterator.cpp:1581)
3   com.apple.WebCore                   0x0000000101636736 WebCore::findPlainText(WebCore::CharacterIterator&, WebCore::String const&, bool, bool, unsigned long&) + 81 (TextIterator.cpp:1983)
4   com.apple.WebCore                   0x0000000101637bff WebCore::findPlainText(WebCore::Range const*, WebCore::String const&, bool, bool) + 116 (TextIterator.cpp:2015)


I’m going to verify that I can reproduce this locally and may end up rolling this change out.
Comment 18 Mark Rowe (bdash) 2009-10-20 17:34:33 PDT
When doing ‘run-webkit-tests editing’ I hit this assertion failure on seven tests.

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x00000000bbadbeef
0x00000001016360ac in WebCore::createSearcher () at WebCore/editing/TextIterator.cpp:1534
1534	    ASSERT(status == U_ZERO_ERROR);
(gdb) print status
$1 = U_USING_DEFAULT_WARNING
Current language:  auto; currently c++
(gdb) 

I’m going to roll this out for now since we can’t leave debug builds in a broken state.
Comment 19 Mark Rowe (bdash) 2009-10-20 17:43:20 PDT
Rolled out in r49876.
Comment 20 Mark Rowe (bdash) 2009-10-20 17:53:42 PDT
Maybe the solution is to modify the assertion to be the same as the one a few lines up rather than requiring U_ZERO_ERROR.
Comment 21 Satoshi Nakagawa 2009-10-20 19:08:45 PDT
Created attachment 41543 [details]
Revised patch 3

I'm sorry it caused an assertion failure.
Here is a revised patch.
Comment 22 Darin Adler 2009-10-21 00:03:02 PDT
Comment on attachment 41543 [details]
Revised patch 3

> +    UCollator* collator = ucol_openRules(japaneseKanaCollationRules, -1, UCOL_DEFAULT, UCOL_DEFAULT_STRENGTH, 0, &status);
> +    ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING);

Are all three of these expected here for some reason, or are you just ignoring the same set I ignored for usearch_open?

> +    usearch_setCollator(searcher, collator(), &status);
> +    ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING);

Same question.

r=me, but I would like to hear the answer to that question at some point.
Comment 23 WebKit Commit Bot 2009-10-21 00:11:16 PDT
Comment on attachment 41543 [details]
Revised patch 3

Clearing flags on attachment: 41543

Committed r49899: <http://trac.webkit.org/changeset/49899>
Comment 24 WebKit Commit Bot 2009-10-21 00:11:22 PDT
All reviewed patches have been landed.  Closing bug.
Comment 25 Satoshi Nakagawa 2009-10-21 00:49:03 PDT
(In reply to comment #22)
> (From update of attachment 41543 [details])
> > +    UCollator* collator = ucol_openRules(japaneseKanaCollationRules, -1, UCOL_DEFAULT, UCOL_DEFAULT_STRENGTH, 0, &status);
> > +    ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING);
> 
> Are all three of these expected here for some reason, or are you just ignoring
> the same set I ignored for usearch_open?

It's just for ignoring the same set as for usearch_open.
I read ICU documents, and I thought it's better to ignore them.

http://icu-project.org/apiref/icu4c/ucol_8h.html#a128ea0ed3869415c1c96a9a2c997c2d
http://icu-project.org/apiref/icu4c/utypes_8h.html#3343c1c8a8377277046774691c98d78c

Probably,  using U_SUCCESS() would be better for assertion.

> > +    usearch_setCollator(searcher, collator(), &status);
> > +    ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING);
> 
> Same question.
> 
> r=me, but I would like to hear the answer to that question at some point.

The same here.
Comment 26 Darin Adler 2009-10-21 08:21:09 PDT
(In reply to comment #25)
> Probably,  using U_SUCCESS() would be better for assertion.

I prefer having the more specific assertion, because we want to know if we're getting new, unexpected error codes there. Sorry it caused a problem in this case. I think this is good now.
Comment 27 Jungshik Shin 2009-10-21 15:22:02 PDT
Sorry I'm late here, but this patch breaks non-Japanese search (e.g. Swedish). Before the change, the search collator was locale-dependent. That is, if you use Japanese Safari, you'd get the collator tailored for Japanese (which I think does the right thing for Japanese). The same is true of other languages.

The intent of the patch is to make Japanese search work *regardless of the current locale (UI language)*, which I support. However, it should be done without breaking search in Swedish, German, Finnish or any other language.

The patch here breaks that because it throws away the locale-dependent collator right after creating one and replace it with the UCA + Japanese tailoring. 

The way to do that is to get the collation rule string for the current locale and combine it with the Japanese tailoring (added in the patch) and build a new collator from the combined rule strings. (I just went over the patch with one of ICU's leading contributors). 

I'll make a patch to do the above in a couple of days.
Comment 28 Darin Adler 2009-10-21 15:30:59 PDT
(In reply to comment #27)
> Sorry I'm late here, but this patch breaks non-Japanese search (e.g. Swedish).

I can't believe I missed that. We may need to roll the patch out until we have a method that doesn't break all the other languages.
Comment 29 Darin Adler 2009-10-21 15:34:09 PDT
(In reply to comment #27)
> The way to do that is to get the collation rule string for the current locale
> and combine it with the Japanese tailoring (added in the patch) and build a new
> collator from the combined rule strings. (I just went over the patch with one
> of ICU's leading contributors).

For what it's worth, I tried to do exactly that when I added the code to fold quote marks. But as I mention in a comment, I was unable to do this, even with the help of some ICU experts here at Apple. Here's hoping you are able to do what I was not!

See <http://trac.webkit.org/changeset/45858> for the comment and code.

If you are able to add custom tailoring, then perhaps we can do the quote marks that way.
Comment 30 Jungshik Shin 2009-10-21 15:40:11 PDT
(In reply to comment #28)
> (In reply to comment #27)
> > Sorry I'm late here, but this patch breaks non-Japanese search (e.g. Swedish).
> 
> I can't believe I missed that. We may need to roll the patch out until we have
> a method that doesn't break all the other languages.

It'd be great if you can roll it out. Then, I don't have to rush to make a new ICU data file for Chrome with invuca table (which I excluded to save some space) :-).  Chrome is crashing due to assertions because it doesn't have invuca table. 

Without noticing you adding the last two comments, I filed a new bug 30646 to fix the regression. If you think it's better to roll out the patch, perhaps it's better to deal with a better solution for the original issue here instead of bug 30646. 

Either way is fine with me.
Comment 31 Darin Adler 2009-10-21 15:42:48 PDT
What is the invuca table, and why does Chrome need it, but not other WebKit-based browsers?
Comment 32 Jungshik Shin 2009-10-21 15:56:06 PDT
(In reply to comment #31)
> What is the invuca table, and why does Chrome need it, but not other
> WebKit-based browsers?

I was not clear. invuca is necessary by all other ICU-dependent webkit port IF collator is built at run-time (as is done with the patch for this bug). Chrome excluded it from its ICU data file because neither Webkit nor Chrome's other components build a collator with custom rules at run-time. With the patch here (or a new patch addressing the regression), that's not the case any more. 

Among other ICU-dependent Webkit ports, Safari on OS X is not affected because it gets the *full* ICU data from the OS. Safari on Win could have cut down the download size a little bit by excluding invuca, but apparently it didn't. So, it's not affected, either.
Comment 33 Darin Adler 2009-10-21 16:11:04 PDT
Rolled the change out in r49926. Reopening the bug.
Comment 34 Eric Seidel (no email) 2009-10-26 12:06:32 PDT
Comment on attachment 41396 [details]
Revised patch

Clearing darin's r+ on this obsolete patch.
Comment 35 mitz 2009-10-30 16:43:20 PDT
<rdar://problem/7214058>
Comment 36 Darin Adler 2009-10-30 16:46:36 PDT
If we can't fix this with tailoring, maybe there's some other simple way to fix it. I’d really like to resolve this somehow!
Comment 37 Darin Adler 2010-01-06 16:52:34 PST
Does anyone have any ideas about how to solve this?
Comment 38 Darin Adler 2010-01-06 16:59:19 PST
I am going to tackle this.
Comment 39 Darin Adler 2010-01-07 17:56:09 PST
Created attachment 46103 [details]
work in progress
Comment 40 Darin Adler 2010-01-10 17:39:32 PST
Created attachment 46249 [details]
patch
Comment 41 Alexey Proskuryakov 2010-01-10 23:02:25 PST
Comment on attachment 46249 [details]
patch

> +static void normalizeCharacters(const UChar* characters, unsigned length, Vector<UChar>& buffer)
> +{
> +    ASSERT(length);
> +
> +    UErrorCode status = U_ZERO_ERROR;
> +    size_t bufferSize = unorm_normalize(characters, length, UNORM_NFC, 0, 0, 0, &status);
> +    ASSERT(status == U_BUFFER_OVERFLOW_ERROR);
> +    ASSERT(bufferSize);

Would it make sense to try with an output buffer of size length,, to avoid having two passes in most cases?

> Index: LayoutTests/fast/text/find-kana.html

I need to add a similar test for Russian!

r=me
Comment 42 Darin Adler 2010-01-11 07:54:56 PST
(In reply to comment #41)
> Would it make sense to try with an output buffer of size length, to avoid
> having two passes in most cases?

Yes.

> I need to add a similar test for Russian!

Please do.
Comment 43 Darin Adler 2010-01-11 08:31:35 PST
http://trac.webkit.org/changeset/53078
Comment 44 Darin Adler 2010-01-11 10:41:43 PST
*** Bug 27587 has been marked as a duplicate of this bug. ***
Comment 45 Eric Seidel (no email) 2010-01-11 12:07:44 PST
Looks like this test breaks on Leopard and Tiger:
http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r53078%20(9114)/fast/text/find-kana-diffs.txt
Comment 47 Darin Adler 2010-01-11 13:02:47 PST
Tiger and Leopard are failing because they are using an old ICU.

Qt is failing because it's using the non-ICU search code path.
Comment 48 Darin Adler 2010-01-11 13:03:13 PST
In both cases the best solution is checking in expected failure results.