Summary: | REGRESSION: Japanese text search ignores small vs. large and voicing mark differences | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Satoshi Nakagawa <artension> | ||||||||||||||
Component: | WebCore Misc. | Assignee: | Darin Adler <darin> | ||||||||||||||
Status: | RESOLVED FIXED | ||||||||||||||||
Severity: | Normal | CC: | ap, commit-queue, darin, eric, jshin, justin.garcia, keishi, mitz | ||||||||||||||
Priority: | P2 | Keywords: | InRadar | ||||||||||||||
Version: | 528+ (Nightly build) | ||||||||||||||||
Hardware: | All | ||||||||||||||||
OS: | All | ||||||||||||||||
URL: | http://limechat.net/report/webkit-search-problem.html | ||||||||||||||||
Attachments: |
|
Description
Satoshi Nakagawa
2009-10-16 04:36:41 PDT
Created attachment 41282 [details]
A patch to fix this problem
The tailoring rules are like this: &ぁ=ァ=ァ<あ=ア=ア<ぃ=ィ=ィ<い=イ=イ<ぅ=ゥ=ゥ<う=ウ=ウ<ゔ=ヴ<ぇ=ェ=ェ<え=エ=エ<ぉ=ォ=ォ<お=オ=オ <ゕ=ヵ<か=カ=カ<が=ガ<き=キ=キ<ぎ=ギ<く=ク=ク<ぐ=グ<ゖ=ヶ<け=ケ=ケ<げ=ゲ<こ=コ=コ<ご=ゴ <さ=サ=サ<ざ=ザ<し=シ=シ<じ=ジ<す=ス=ス<ず=ズ<せ=セ=セ<ぜ=ゼ<そ=ソ=ソ<ぞ=ゾ <た=タ=タ<だ=ダ<ち=チ=チ<ぢ=ヂ<っ=ッ=ッ<つ=ツ=ツ<づ=ヅ<て=テ=テ<で=デ<と=ト=ト<ど=ド <な=ナ=ナ<に=ニ=ニ<ぬ=ヌ=ヌ<ね=ネ=ネ<の=ノ=ノ <は=ハ=ハ<ば=バ<ぱ=パ<ひ=ヒ=ヒ<び=ビ<ぴ=ピ<ふ=フ=フ<ぶ=ブ<ぷ=プ<へ=ヘ=ヘ<べ=ベ<ぺ=ペ<ほ=ホ=ホ<ぼ=ボ<ぽ=ポ <ま=マ=マ<み=ミ=ミ<む=ム=ム<め=メ=メ<も=モ=モ <ゃ=ャ=ャ<や=ヤ=ヤ<ゅ=ュ=ュ<ゆ=ユ=ユ<ょ=ョ=ョ<よ=ヨ=ヨ <ら=ラ=ラ<り=リ=リ<る=ル=ル<れ=レ=レ<ろ=ロ=ロ <ゎ=ヮ<わ=ワ=ワ<ヷ<ゐ=ヰ<ヸ<ゑ=ヱ<を=ヲ=ヲ<ん=ン=ン For example, "ぁ=ァ=ァ<あ=ア=ア" means: - 'ぁ', 'ァ' and 'ァ' are in the same order. - 'ぁ', 'ァ' and 'ァ' are smaller than 'あ', 'ア' and 'ア'. You can see some screenshots that describe this problem and the effect of this patch. http://limechat.net/report/webkit-search-problem.html A fix like this definitely needs include a regression test. See <http://trac.webkit.org/browser/trunk/LayoutTests/fast/text/find-case-folding.html> for an example of how it could be done. Some more information about contributing code to WebKit is available at <http://webkit.org/coding/contributing.html>. This may seem like a lot to ask for, but it's vital for maintainability of the project. See also: bug 27587. Once you have a patch ready for review, please mark it as such by setting review flag to "?". Comment on attachment 41282 [details]
A patch to fix this problem
This code should go inside createSearcher(), not inside searcher().
Created attachment 41396 [details]
Revised patch
Thanks for your comments. I've uploaded a revised patch. About bug 27587, IMO it's natural to treat "シ" and "㋛" as different charactes. "アパート" and "㌀" as well. Comment on attachment 41396 [details] Revised patch > + * editing/TextIterator.cpp: > + (WebCore::): > + (WebCore::createSearcher): The line saying just "WebCore::" should be deleted. > +// Tailored collation rules for Japanese text search. > +// The default Unicode Collation Algorithm is unnatural in Japanese. > +// These rules intend to treat the following characters as different characters. > +// > +// - Small kana letters and normal kana letters > +// - Voiceless letters, voiced letters and semi-voiced letters > +// This comment should document where this array came from. Is this original work or did you copy this here from some other project? > +static const UChar JAPANESE_KANA_COLLATION_RULES[] = { This array should not have a name that's in all capitals. Those names are reserved for macros. > UErrorCode status = U_ZERO_ERROR; > UStringSearch* searcher = usearch_open(&newlineCharacter, 1, &newlineCharacter, 1, currentSearchLocaleID(), 0, &status); > ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING); > + > + static UCollator* collator = 0; > + if (!collator) { > + // Set tailored collation rules to fix Japanese text search. > + // See the comments before JAPANESE_KANA_COLLATION_RULES for details. > + status = U_ZERO_ERROR; > + collator = ucol_openRules(JAPANESE_KANA_COLLATION_RULES, -1, UCOL_DEFAULT, > + UCOL_DEFAULT_STRENGTH, 0, &status); > + ASSERT(status == U_ZERO_ERROR); > + status = U_ZERO_ERROR; > + usearch_setCollator(searcher, collator, &status); > + ASSERT(status == U_ZERO_ERROR); > + usearch_reset(searcher); > + } This is OK, but not quite right. The usearch_setCollator and usearch_reset calls should be outside the if statement, since they are part of the creation of searcher, not the creation of the collator. However, since the createSearcher function is only called once, there's no problem in practice. The entire function should ideally be refactored to match the way the searcher() and createSearcher() function work. There would be a collator() and createCollator() function, and createSearcher() would call collator(). Otherwise, the patch looks good. I'm going to say r=me because I think it's OK to land this patch as is, but I think it would be even better if you took my suggestions above. Comment on attachment 41396 [details]
Revised patch
Nakagawa-san is not a committer, so I would mark this cq+, except Darin has asked for modifications. So best if Nakagawa-san could post a new patch with r=? and cq=?.
Created attachment 41486 [details]
Revised patch 2
Thanks for reviewing. I've updated the patch per your comments. Comment on attachment 41486 [details]
Revised patch 2
Rejecting patch 41486 from commit-queue.
Failed to run "['WebKitTools/Scripts/run-webkit-tests', '--no-launch-safari', '--quiet', '--exit-after-n-failures=1']" exit_code: 60
Last 500 characters of output:
ages while\n\trunning as root. There are known race conditions that\n\twill allow any local user to read any file on the system.\n\tIf you still desire to serve pages as root then\n\tadd -DBIG_SECURITY_HOLE to the CFLAGS env variable\n\tand then rebuild the server.\n\tIt is strongly suggested that you instead modify the User\n\tdirective in your httpd.conf file to list a non-root\n\tuser.\n
Timed out waiting for httpd to start at WebKitTools/Scripts/run-webkit-tests line 1359, <IN> line 30187.
My apologies. Something is causing run-webkit-tests to fail on the commit bot. I'm not sure what yet. Comment on attachment 41486 [details]
Revised patch 2
I've fixed this commit-queue. Again my apologies.
Comment on attachment 41486 [details] Revised patch 2 Clearing flags on attachment: 41486 Committed r49876: <http://trac.webkit.org/changeset/49876> All reviewed patches have been landed. Closing bug. Thanks! After this change the SnowLeopard debug bot is hitting numerous assertion failures with stack traces like so: Exception Type: EXC_BAD_ACCESS (SIGSEGV) Exception Codes: KERN_INVALID_ADDRESS at 0x00000000bbadbeef Crashed Thread: 0 Dispatch queue: com.apple.main-thread Thread 0 Crashed: Dispatch queue: com.apple.main-thread 0 com.apple.WebCore 0x000000010163333c WebCore::createSearcher() + 216 (TextIterator.cpp:1534) 1 com.apple.WebCore 0x000000010163336f WebCore::searcher() + 23 (TextIterator.cpp:1541) 2 com.apple.WebCore 0x000000010163a27f WebCore::SearchBuffer::SearchBuffer(WebCore::String const&, bool) + 247 (TextIterator.cpp:1581) 3 com.apple.WebCore 0x0000000101636736 WebCore::findPlainText(WebCore::CharacterIterator&, WebCore::String const&, bool, bool, unsigned long&) + 81 (TextIterator.cpp:1983) 4 com.apple.WebCore 0x0000000101637bff WebCore::findPlainText(WebCore::Range const*, WebCore::String const&, bool, bool) + 116 (TextIterator.cpp:2015) I’m going to verify that I can reproduce this locally and may end up rolling this change out. When doing ‘run-webkit-tests editing’ I hit this assertion failure on seven tests. Program received signal EXC_BAD_ACCESS, Could not access memory. Reason: KERN_INVALID_ADDRESS at address: 0x00000000bbadbeef 0x00000001016360ac in WebCore::createSearcher () at WebCore/editing/TextIterator.cpp:1534 1534 ASSERT(status == U_ZERO_ERROR); (gdb) print status $1 = U_USING_DEFAULT_WARNING Current language: auto; currently c++ (gdb) I’m going to roll this out for now since we can’t leave debug builds in a broken state. Rolled out in r49876. Maybe the solution is to modify the assertion to be the same as the one a few lines up rather than requiring U_ZERO_ERROR. Created attachment 41543 [details]
Revised patch 3
I'm sorry it caused an assertion failure.
Here is a revised patch.
Comment on attachment 41543 [details] Revised patch 3 > + UCollator* collator = ucol_openRules(japaneseKanaCollationRules, -1, UCOL_DEFAULT, UCOL_DEFAULT_STRENGTH, 0, &status); > + ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING); Are all three of these expected here for some reason, or are you just ignoring the same set I ignored for usearch_open? > + usearch_setCollator(searcher, collator(), &status); > + ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING); Same question. r=me, but I would like to hear the answer to that question at some point. Comment on attachment 41543 [details] Revised patch 3 Clearing flags on attachment: 41543 Committed r49899: <http://trac.webkit.org/changeset/49899> All reviewed patches have been landed. Closing bug. (In reply to comment #22) > (From update of attachment 41543 [details]) > > + UCollator* collator = ucol_openRules(japaneseKanaCollationRules, -1, UCOL_DEFAULT, UCOL_DEFAULT_STRENGTH, 0, &status); > > + ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING); > > Are all three of these expected here for some reason, or are you just ignoring > the same set I ignored for usearch_open? It's just for ignoring the same set as for usearch_open. I read ICU documents, and I thought it's better to ignore them. http://icu-project.org/apiref/icu4c/ucol_8h.html#a128ea0ed3869415c1c96a9a2c997c2d http://icu-project.org/apiref/icu4c/utypes_8h.html#3343c1c8a8377277046774691c98d78c Probably, using U_SUCCESS() would be better for assertion. > > + usearch_setCollator(searcher, collator(), &status); > > + ASSERT(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING || status == U_USING_DEFAULT_WARNING); > > Same question. > > r=me, but I would like to hear the answer to that question at some point. The same here. (In reply to comment #25) > Probably, using U_SUCCESS() would be better for assertion. I prefer having the more specific assertion, because we want to know if we're getting new, unexpected error codes there. Sorry it caused a problem in this case. I think this is good now. Sorry I'm late here, but this patch breaks non-Japanese search (e.g. Swedish). Before the change, the search collator was locale-dependent. That is, if you use Japanese Safari, you'd get the collator tailored for Japanese (which I think does the right thing for Japanese). The same is true of other languages. The intent of the patch is to make Japanese search work *regardless of the current locale (UI language)*, which I support. However, it should be done without breaking search in Swedish, German, Finnish or any other language. The patch here breaks that because it throws away the locale-dependent collator right after creating one and replace it with the UCA + Japanese tailoring. The way to do that is to get the collation rule string for the current locale and combine it with the Japanese tailoring (added in the patch) and build a new collator from the combined rule strings. (I just went over the patch with one of ICU's leading contributors). I'll make a patch to do the above in a couple of days. (In reply to comment #27) > Sorry I'm late here, but this patch breaks non-Japanese search (e.g. Swedish). I can't believe I missed that. We may need to roll the patch out until we have a method that doesn't break all the other languages. (In reply to comment #27) > The way to do that is to get the collation rule string for the current locale > and combine it with the Japanese tailoring (added in the patch) and build a new > collator from the combined rule strings. (I just went over the patch with one > of ICU's leading contributors). For what it's worth, I tried to do exactly that when I added the code to fold quote marks. But as I mention in a comment, I was unable to do this, even with the help of some ICU experts here at Apple. Here's hoping you are able to do what I was not! See <http://trac.webkit.org/changeset/45858> for the comment and code. If you are able to add custom tailoring, then perhaps we can do the quote marks that way. (In reply to comment #28) > (In reply to comment #27) > > Sorry I'm late here, but this patch breaks non-Japanese search (e.g. Swedish). > > I can't believe I missed that. We may need to roll the patch out until we have > a method that doesn't break all the other languages. It'd be great if you can roll it out. Then, I don't have to rush to make a new ICU data file for Chrome with invuca table (which I excluded to save some space) :-). Chrome is crashing due to assertions because it doesn't have invuca table. Without noticing you adding the last two comments, I filed a new bug 30646 to fix the regression. If you think it's better to roll out the patch, perhaps it's better to deal with a better solution for the original issue here instead of bug 30646. Either way is fine with me. What is the invuca table, and why does Chrome need it, but not other WebKit-based browsers? (In reply to comment #31) > What is the invuca table, and why does Chrome need it, but not other > WebKit-based browsers? I was not clear. invuca is necessary by all other ICU-dependent webkit port IF collator is built at run-time (as is done with the patch for this bug). Chrome excluded it from its ICU data file because neither Webkit nor Chrome's other components build a collator with custom rules at run-time. With the patch here (or a new patch addressing the regression), that's not the case any more. Among other ICU-dependent Webkit ports, Safari on OS X is not affected because it gets the *full* ICU data from the OS. Safari on Win could have cut down the download size a little bit by excluding invuca, but apparently it didn't. So, it's not affected, either. Rolled the change out in r49926. Reopening the bug. Comment on attachment 41396 [details]
Revised patch
Clearing darin's r+ on this obsolete patch.
If we can't fix this with tailoring, maybe there's some other simple way to fix it. I’d really like to resolve this somehow! Does anyone have any ideas about how to solve this? I am going to tackle this. Created attachment 46103 [details]
work in progress
Created attachment 46249 [details]
patch
Comment on attachment 46249 [details] patch > +static void normalizeCharacters(const UChar* characters, unsigned length, Vector<UChar>& buffer) > +{ > + ASSERT(length); > + > + UErrorCode status = U_ZERO_ERROR; > + size_t bufferSize = unorm_normalize(characters, length, UNORM_NFC, 0, 0, 0, &status); > + ASSERT(status == U_BUFFER_OVERFLOW_ERROR); > + ASSERT(bufferSize); Would it make sense to try with an output buffer of size length,, to avoid having two passes in most cases? > Index: LayoutTests/fast/text/find-kana.html I need to add a similar test for Russian! r=me (In reply to comment #41) > Would it make sense to try with an output buffer of size length, to avoid > having two passes in most cases? Yes. > I need to add a similar test for Russian! Please do. *** Bug 27587 has been marked as a duplicate of this bug. *** Looks like this test breaks on Leopard and Tiger: http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r53078%20(9114)/fast/text/find-kana-diffs.txt And Qt: http://build.webkit.org/results/Qt%20Linux%20Release/r53092%20(5839)/fast/text/find-kana-diffs.txt Tiger and Leopard are failing because they are using an old ICU. Qt is failing because it's using the non-ICU search code path. In both cases the best solution is checking in expected failure results. |