24342 – Cannot insert a Thai character after a Thai prepend character when using ICU 4.0

RESOLVED FIXED24342

Cannot insert a Thai character after a Thai prepend character when using ICU 4.0

https://bugs.webkit.org/show_bug.cgi?id=24342

Summary Cannot insert a Thai character after a Thai prepend character when using ICU 4.0

Hironori Bono

Reported 2009-03-03 20:42:06 PST

Copied from "http://crbug.com/3523" * Additional Builds and Platforms: Other browsers tested: Add OK or FAIL after other browsers where you have tested this issue: Safari 3: OK Firefox 3: OK IE 7: OK * Steps to Reproduce: 1. Type "อยากญี่ปุ่น" in an input field. 2. move cursor to before "ญี่ปุ่น" after "อยาก" like this "อยาก|ญี่ปุ่น" 3. Type "ไป" * Expected Result: The input field becomes "อยากไปญี่ปุ่น". * Actual Result: The input field becomes "อยากไญี่ปปุ่น" * Additional Information: ICU 4.0 changed character-break rules to comply with UAX #29 "http://unicode.org/reports/tr29/" and prevent inserting a character break after a "prepend" character. Even though this rule is good for displaying Thai characters, it is not good for inputting them because Thai people are familiar with the behavior of ICU 3.8 for inputting Thai characters. (ICU 4.0 permits inserting a character-break before a half-width katakana voiced mark again.) To prevent this problem, we should add a new break iterator for inputting characters and use custom rules (based on the character-break rules of ICU 3.8) to avoid using the character-break rules of ICU 4.0 for inputting characters.

Attachments
A proposed fix (16.09 KB, patch) 2009-03-03 21:02 PST, Hironori Bono	ap: review-	Details Formatted Diff Diff
A proposed fix 2 (16.39 KB, patch) 2009-03-04 03:17 PST, Hironori Bono	no flags	Details Formatted Diff Diff
a proposed fix 3 (16.57 KB, patch) 2009-03-05 20:43 PST, Hironori Bono	ap: review+	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.

Hironori Bono

Comment 1 2009-03-03 21:02:41 PST

Created attachment 28254 [details] A proposed fix Even though this issue may happen all platforms which uses ICU 4.0, I'm wondering if we should change "TextBreakIteratorICU.cpp" because this issue currently happens only on Chromium. Would it be possible to give me your comments?

Mark Rowe (bdash)

Comment 2 2009-03-03 22:02:05 PST

Comment on attachment 28254 [details] A proposed fix It seems incorrect that the declaration of codepointBreakIterator is wrapped in #if PLATFORM(CHROMIUM) but called on all platforms.

Alexey Proskuryakov

Comment 3 2009-03-04 00:17:41 PST

Comment on attachment 28254 [details] A proposed fix It is definitely appropriate to fix this for all platforms that use ICU. Are all the ICU functions used here available in at least ICU 3.2? > + This change creates a new break iterator "codepointBreakIterator" for > + inputting characters. The name "codepointBreakIterator" is extremely misleading. I would expect it to move by a code point (i.e. one or two UTF-16 code units). What is this new iterator actually used for? It's called from RenderText::previousOffset() and RenderText::nextOffset(), which are used for far more than just typing text. Is it perhaps a general cursor movement iterator? Please note that cursor movement conventions may be different on each platform, so we should clearly document what this iterator does conceptually, so that platform-specific overrides could be added safely. +#if PLATFORM(CHROMIUM) + TextBreakIterator* codepointBreakIterator(const UChar*, int length); +#endif As mentioned by Mark, this shouldn't be Chromium-specific. #include "TextBreakIteratorInternalICU.h" +#include "AtomicString.h" Please keep include lists sorted. But I do not think you need AtomicString here. The rules string is only used once, and it is immediately destroyed. + if (!iterator) + return 0; Incorrect indentation here. + if (U_FAILURE(setTextStatus)) + return 0; And here. + // This rule is based on the character-break iterator rules of ICU 3.8. + // http://source.icu-project.org/repos/icu/icu/tags/release-3-8/source/data/brkitr/char.txt As mentioned above, the name "codepointBreakIterator" is not appropriate. It's good to tell where these rules come from, but this comment should also expand on what they are expected to do. + return setUpIteratorWithRules(createdCodepointBreakIterator, + staticCodepointBreakIterator, kRules, string, length); There is no need to break this line, it's quite short. > Property changes on: LayoutTests/editing/inserting/insert-thai-characters-001.html > ___________________________________________________________________ > Name: svn:executable > + * Please don't make new files executable.

Hironori Bono

Comment 4 2009-03-04 03:17:38 PST

Created attachment 28262 [details] A proposed fix 2 Thank you for your comments and sorry for sending a stupid fix which contains lots of simple mistakes. I changed my fix to apply your comments as much as possible. (I assume this rule set works also with ICU 3.2 because it does not contain any rules that depend on ICU 3.8 or later. But I have not verified because I don't have any platforms which use ICU 3.2. Sorry.) Would it be possible to review the updated one?

Alexey Proskuryakov

Comment 5 2009-03-04 03:35:10 PST

+ // For example, we should not insert a character break after a Thai preposed + // character (a Thai vowel character) because it breaks a syllable. + // On the other hand, we should be able to move an input cursor even after + // a Thai preposed character. Where should the former character break iterator come into play? Should we allow selections that only cover a prepend character, for example? I'd like to have as many examples of use cases for either iterator as possible, to avoid future confusion. What about my AtomicString comment? I still don't think that you need atomic strings here. The reason for them to exist is quick comparison - two AtomicStrings are equal if and only if their impl pointers are equal. + // The only difference from the original rules is: + // added "!!chain" to change this rule set to chained. Please expand this comment to explain why we need to make the rule set chained.

Hironori Bono

Comment 6 2009-03-05 04:13:32 PST

Thank you for your comments. Sorry for the lack of writing backgrounds. Yes, it is very confusing to figure out this problem without examples. Even though I may not be able to understand this issue totally because I'm not either a native Thai speaker or a WebKit guru who have good knowledge about WebKit, I would like to describe the backgrounds of this issue as much as possible. (In reply to comment #5) > + // For example, we should not insert a character break after a Thai preposed > + // character (a Thai vowel character) because it breaks a syllable. > + // On the other hand, we should be able to move an input cursor even after > + // a Thai preposed character. > > Where should the former character break iterator come into play? Should we > allow selections that only cover a prepend character, for example? To write from a conclusion, yes, Thai people would like to select only a prepend character, delete it or replace it with another prepend character. In brief, prepend characters (U+0E40,...,U+0E44 and U+0EC0,..,U+0EC4) are vowel characters placed but they are NOT COMBINING CHARACTERS. It is a Thai alphabet placed BEFORE a consonant to denote the first vowel of a Thai syllable. So, a Thai syllable may look like two characters from our eyes. To write the original problem, it is that: they cannot insert a Thai syllable (which consists of a prepend character 'A' and a consonant 'B' and looks like two characters 'AB' from our eyes) before another Thai syllable (which consists of a consonant and combining characters and looks like a character 'C' from our eyes). When they type a prepend character 'A', the input cursor moves after the end of the syllable 'C' and the consonant 'B' is inserted after the 'C'. So, the actual result looks like 'ACB' from our eyes. (Its expected result looks like 'ABC' from our eyes.) Also, they cannot delete only a prepend character 'A' from 'ACB' because they cannot move the input cursor after the prepend character 'A'. To investigate this issue deeply, we noticed this issue was caused by a design change of the character-break iterator in ICU 4.0. Before ICU 4.0, its character-break iterator splits a string into grapheme clusters. a grapheme cluster is a set of characters that look like only one character from our eyes. So, it splits a Thai syllable 'AB' which consists of a prepend character and a consonant into two pieces, 'A' and 'B'. The only exception is Japanese half-width katakana voiced marks. It does not split a Japanese syllable which consists of a half-width katanaka and half-width katakana voiced mark into pieces. On the other hand, ICU 4.0 changed its character-break iterator to split a string into "extended" grapheme clusters. For most languages, an "extended" grapheme cluster is same as a grapheme cluster. But, unfortunately for Thai and Lao, an "extended" grapheme cluster becomes a syllable. So, even if a Thai syllable 'AB' looks like two characters from our eyes, it does not split into pieces. On the other hand, somehow, the character-break iterator of ICU 4.0 deletes the exception for Japanese half-width katakana voiced mark, i.e. it does split a Japanese syllable which consists of a half-width katanaka and a half-width katakana voiced mark into pieces as it did in ICU 3.2. So, we have to re-enable a workaround for ICU 3.2 in RenderText::previousOffset() and RenderText::nextOffset() for ICU 4.0. Even though future ICU may change the behaviors of its character-break iterator, we thought we should use a custom iterator for cursor iteration to avoid problems related to ICU versions. > I'd like to have as many examples of use cases for either iterator as possible, > to avoid future confusion. Sorry for your confusion. My description always lacks important background information and it is confusing. > What about my AtomicString comment? I still don't think that you need atomic > strings here. The reason for them to exist is quick comparison - two > AtomicStrings are equal if and only if their impl pointers are equal. Sorry again for this problem. I thought I changed this AtomicString to String, but, actually, I forgot changing it. > + // The only difference from the original rules is: > + // added "!!chain" to change this rule set to chained. > > Please expand this comment to explain why we need to make the rule set chained. This line is for preventing two or more rules from matching to an input string. Nevertheless, to read the latest rule set, there are not any rules matching at once. Even though this line is harmless, I'm going to remove this line to avoid confusion.

Alexey Proskuryakov

Comment 7 2009-03-05 06:38:16 PST

Comment on attachment 28262 [details] A proposed fix 2 I've now built a local copy of ICU 4.0.1, and I can reproduce this problem in Safari with it. I think that the approach in this fix is absolutely correct. Instead of inputCursorIterator, I'd say "cursorMovementIterator" - it is not directly related to input, as it will be used for e.g. selecting non-editable text. Looking over other uses of characterBreakIterator in WebCore, I noticed another case that might be incorrect, which is in CoreTextController::offsetForPosition(). It's better to handle it separately though. Clearing review flag for now, as you said that you were going to submit a new patch.

Alexey Proskuryakov

Comment 8 2009-03-05 07:00:44 PST

Ironically, ICU 4 fixes the test case from bug 15790, so I expect that this patch will break it again. That's OK, I just wanted to have it cross-referenced here.

Hironori Bono

Comment 9 2009-03-05 20:43:46 PST

Created attachment 28346 [details] a proposed fix 3 Thank you for your useful comments and sorry for my slow updates. I renamed inputCursorIterator to cursorMovementIterator, and AtomicString to String. Also I removed a redundant "!!chain;" rule and updated comments. (As for Issue 15790, I'm also checking it with my local build of Chrome. I'm going to update the issue when I find something.) Would it be possible to review it again?

Alexey Proskuryakov

Comment 10 2009-03-06 02:08:59 PST

Comment on attachment 28346 [details] a proposed fix 3 r=me I'm going to tweak the comments a little more when landing.

Alexey Proskuryakov

Comment 11 2009-03-06 03:28:51 PST

Committed <http://trac.webkit.org/changeset/41477>. Could you please check the comments that I modified for accuracy?

Hironori Bono

Comment 12 2009-03-06 03:52:05 PST

Thank you for your review and updates. (In reply to comment #11) > Committed <http://trac.webkit.org/changeset/41477>. Could you please check the > comments that I modified for accuracy? Awesome. They are pretty simple and practical for users who use WebKit. I realized I didn't consider users who use this new iterator so much. Thank you for your correction. Also I think they are pretty accurate.

mossila

Comment 13 2009-05-05 00:27:37 PDT

this bug is found on libwebkit-1.0 gtk+, and webkitmm,too

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware All

OS All

Product WebKit

Component HTML Editing

Assignee

Nobody

Reported

2009-03-03 20:42 PST

Modified

2009-05-05 00:27 PDT History

CC List

4 users Show

URL

Keywords

Depends on

Blocks