Bug 24342 - Cannot insert a Thai character after a Thai prepend character when using ICU 4.0
: Cannot insert a Thai character after a Thai prepend character when using ICU 4.0
Status: RESOLVED FIXED
: WebKit
HTML Editing
: 528+ (Nightly build)
: All All
: P2 Normal
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2009-03-03 20:42 PST by
Modified: 2009-05-05 00:27 PST (History)


Attachments
A proposed fix (16.09 KB, patch)
2009-03-03 21:02 PST, Hironori Bono
ap: review-
Review Patch | Details | Formatted Diff | Diff
A proposed fix 2 (16.39 KB, patch)
2009-03-04 03:17 PST, Hironori Bono
no flags Review Patch | Details | Formatted Diff | Diff
a proposed fix 3 (16.57 KB, patch)
2009-03-05 20:43 PST, Hironori Bono
ap: review+
Review Patch | Details | Formatted Diff | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2009-03-03 20:42:06 PST
Copied from "http://crbug.com/3523"

* Additional Builds and Platforms:

Other browsers tested:
Add OK or FAIL after other browsers where you have tested this issue:
   Safari 3: OK
  Firefox 3: OK
       IE 7: OK

* Steps to Reproduce:

1. Type "อยากญี่ปุ่น" in an input field.
2. move cursor to before "ญี่ปุ่น" after "อยาก" like this  "อยาก|ญี่ปุ่น"
3. Type "ไป"

* Expected Result:

The input field becomes "อยากไปญี่ปุ่น".

* Actual Result:

The input field becomes "อยากไญี่ปปุ่น"

* Additional Information:

ICU 4.0 changed character-break rules to comply with UAX #29 "http://unicode.org/reports/tr29/" and prevent inserting a character break after a "prepend" character. Even though this rule is good for displaying Thai characters, it is not good for inputting them because Thai people are familiar with the behavior of ICU 3.8 for inputting Thai characters.
(ICU 4.0 permits inserting a character-break before a half-width katakana voiced mark again.)

To prevent this problem, we should add a new break iterator for inputting characters and use custom rules (based on the character-break rules of ICU 3.8) to avoid using the character-break rules of ICU 4.0 for inputting characters.
------- Comment #1 From 2009-03-03 21:02:41 PST -------
Created an attachment (id=28254) [details]
A proposed fix

Even though this issue may happen all platforms which uses ICU 4.0, I'm wondering if we should change "TextBreakIteratorICU.cpp" because this issue currently happens only on Chromium.
Would it be possible to give me your comments?
------- Comment #2 From 2009-03-03 22:02:05 PST -------
(From update of attachment 28254 [details])
It seems incorrect that the declaration of codepointBreakIterator is wrapped in #if PLATFORM(CHROMIUM) but called on all platforms.
------- Comment #3 From 2009-03-04 00:17:41 PST -------
(From update of attachment 28254 [details])
It is definitely appropriate to fix this for all platforms that use ICU. Are all the ICU functions used here available in at least ICU 3.2?

> +        This change creates a new break iterator "codepointBreakIterator" for
> +        inputting characters.

The name "codepointBreakIterator" is extremely misleading. I would expect it to move by a code point (i.e. one or two UTF-16 code units).

What is this new iterator actually used for? It's called from RenderText::previousOffset() and RenderText::nextOffset(), which are used for far more than just typing text. Is it perhaps a general cursor movement iterator?

Please note that cursor movement conventions may be different on each platform, so we should clearly document what this iterator does conceptually, so that platform-specific overrides could be added safely.

+#if PLATFORM(CHROMIUM)
+    TextBreakIterator* codepointBreakIterator(const UChar*, int length);
+#endif

As mentioned by Mark, this shouldn't be Chromium-specific.

 #include "TextBreakIteratorInternalICU.h"
+#include "AtomicString.h"

Please keep include lists sorted. But I do not think you need AtomicString here. The rules string is only used once, and it is immediately destroyed.

+    if (!iterator)
+      return 0;

Incorrect indentation here.

+    if (U_FAILURE(setTextStatus))
+      return 0;

And here.

+    // This rule is based on the character-break iterator rules of ICU 3.8.
+    // http://source.icu-project.org/repos/icu/icu/tags/release-3-8/source/data/brkitr/char.txt

As mentioned above, the name "codepointBreakIterator" is not appropriate. It's good to tell where these rules come from, but this comment should also expand on what they are expected to do.

+    return setUpIteratorWithRules(createdCodepointBreakIterator,
+        staticCodepointBreakIterator, kRules, string, length);

There is no need to break this line, it's quite short.

> Property changes on: LayoutTests/editing/inserting/insert-thai-characters-001.html
> ___________________________________________________________________
> Name: svn:executable
>    + *

Please don't make new files executable.
------- Comment #4 From 2009-03-04 03:17:38 PST -------
Created an attachment (id=28262) [details]
A proposed fix 2

Thank you for your comments and sorry for sending a stupid fix which contains lots of simple mistakes.
I changed my fix to apply your comments as much as possible.
(I assume this rule set works also with ICU 3.2 because it does not contain any rules that depend on ICU 3.8 or later. But I have not verified because I don't have any platforms which use ICU 3.2. Sorry.)

Would it be possible to review the updated one?
------- Comment #5 From 2009-03-04 03:35:10 PST -------
+    // For example, we should not insert a character break after a Thai preposed
+    // character (a Thai vowel character) because it breaks a syllable.
+    // On the other hand, we should be able to move an input cursor even after
+    // a Thai preposed character.

Where should the former character break iterator come into play? Should we allow selections that only cover a prepend character, for example?

I'd like to have as many examples of use cases for either iterator as possible, to avoid future confusion.

What about my AtomicString comment? I still don't think that you need atomic strings here. The reason for them to exist is quick comparison - two AtomicStrings are equal if and only if their impl pointers are equal.

+    // The only difference from the original rules is:
+    //   added "!!chain" to change this rule set to chained.

Please expand this comment to explain why we need to make the rule set chained.
------- Comment #6 From 2009-03-05 04:13:32 PST -------
Thank you for your comments.
Sorry for the lack of writing backgrounds. Yes, it is very confusing to figure
out this problem without examples.
Even though I may not be able to understand this issue totally because I'm not either
a native Thai speaker or a WebKit guru who have good knowledge about WebKit, I would like
to describe the backgrounds of this issue as much as possible.

(In reply to comment #5)
> +    // For example, we should not insert a character break after a Thai preposed
> +    // character (a Thai vowel character) because it breaks a syllable.
> +    // On the other hand, we should be able to move an input cursor even after
> +    // a Thai preposed character.
> 
> Where should the former character break iterator come into play? Should we
> allow selections that only cover a prepend character, for example?

To write from a conclusion, yes, Thai people would like to select only a prepend
character, delete it or replace it with another prepend character.

In brief, prepend characters (U+0E40,...,U+0E44 and U+0EC0,..,U+0EC4) are vowel
characters placed but they are NOT COMBINING CHARACTERS. It is a Thai alphabet
placed BEFORE a consonant to denote the first vowel of a Thai syllable.
So, a Thai syllable may look like two characters from our eyes.

To write the original problem, it is that:
  they cannot insert a Thai syllable (which consists of a prepend character 'A'
  and a consonant 'B' and looks like two characters 'AB' from our eyes) before
  another Thai syllable (which consists of a consonant and combining characters
  and looks like a character 'C' from our eyes).
  When they type a prepend character 'A', the input cursor moves after the end
  of the syllable 'C' and the consonant 'B' is inserted after the 'C'. So, the
  actual result looks like 'ACB' from our eyes. (Its expected result looks like
  'ABC' from our eyes.)
  Also, they cannot delete only a prepend character 'A' from 'ACB' because they
  cannot move the input cursor after the prepend character 'A'.

To investigate this issue deeply, we noticed this issue was caused by a design
change of the character-break iterator in ICU 4.0.

Before ICU 4.0, its character-break iterator splits a string into grapheme
clusters. a grapheme cluster is a set of characters that look like only one
character from our eyes. So, it splits a Thai syllable 'AB' which consists of
a prepend character and a consonant into two pieces, 'A' and 'B'.
The only exception is Japanese half-width katakana voiced marks. It does not
split a Japanese syllable which consists of a half-width katanaka and 
 half-width katakana voiced mark into pieces.

On the other hand, ICU 4.0 changed its character-break iterator to split a
string into "extended" grapheme clusters. For most languages, an "extended"
grapheme cluster is same as a grapheme cluster. But, unfortunately for Thai and
Lao, an "extended" grapheme cluster becomes a syllable. So, even if a Thai
syllable 'AB' looks like two characters from our eyes, it does not split into pieces.
On the other hand, somehow, the character-break iterator of ICU 4.0 deletes the
exception for Japanese half-width katakana voiced mark, i.e. it does split a
Japanese syllable which consists of a half-width katanaka and a half-width
katakana voiced mark into pieces as it did in ICU 3.2. So, we have to re-enable
a workaround for ICU 3.2 in RenderText::previousOffset() and RenderText::nextOffset()
for ICU 4.0.

Even though future ICU may change the behaviors of its character-break iterator,
we thought we should use a custom iterator for cursor iteration to avoid problems
related to ICU versions.

> I'd like to have as many examples of use cases for either iterator as possible,
> to avoid future confusion.

Sorry for your confusion. My description always lacks important background
information and it is confusing.

> What about my AtomicString comment? I still don't think that you need atomic
> strings here. The reason for them to exist is quick comparison - two
> AtomicStrings are equal if and only if their impl pointers are equal.

Sorry again for this problem. I thought I changed this AtomicString to String,
but, actually, I forgot changing it.

> +    // The only difference from the original rules is:
> +    //   added "!!chain" to change this rule set to chained.
>
> Please expand this comment to explain why we need to make the rule set chained.

This line is for preventing two or more rules from matching to an input string.
Nevertheless, to read the latest rule set, there are not any rules matching at
once. Even though this line is harmless, I'm going to remove this line to avoid
confusion.
------- Comment #7 From 2009-03-05 06:38:16 PST -------
(From update of attachment 28262 [details])
I've now built a local copy of ICU 4.0.1, and I can reproduce this problem in Safari with it. I think that the approach in this fix is absolutely correct.

Instead of inputCursorIterator, I'd say "cursorMovementIterator" - it is not directly related to input, as it will be used for e.g. selecting non-editable text.

Looking over other uses of characterBreakIterator in WebCore, I noticed another case that might be incorrect, which is in CoreTextController::offsetForPosition(). It's better to handle it separately though.

Clearing review flag for now, as you said that you were going to submit a new patch.
------- Comment #8 From 2009-03-05 07:00:44 PST -------
Ironically, ICU 4 fixes the test case from bug 15790, so I expect that this patch will break it again. That's OK, I just wanted to have it cross-referenced here.
------- Comment #9 From 2009-03-05 20:43:46 PST -------
Created an attachment (id=28346) [details]
a proposed fix 3

Thank you for your useful comments and sorry for my slow updates.
I renamed inputCursorIterator to cursorMovementIterator, and AtomicString to String. Also I removed a redundant "!!chain;" rule and updated comments.
(As for Issue 15790, I'm also checking it with my local build of Chrome. I'm going to update the issue when I find something.)
Would it be possible to review it again?
------- Comment #10 From 2009-03-06 02:08:59 PST -------
(From update of attachment 28346 [details])
r=me

I'm going to tweak the comments a little more when landing.
------- Comment #11 From 2009-03-06 03:28:51 PST -------
Committed <http://trac.webkit.org/changeset/41477>. Could you please check the comments that I modified for accuracy?
------- Comment #12 From 2009-03-06 03:52:05 PST -------
Thank you for your review and updates.

(In reply to comment #11)
> Committed <http://trac.webkit.org/changeset/41477>. Could you please check the
> comments that I modified for accuracy?

Awesome. They are pretty simple and practical for users who use WebKit. I realized I didn't consider users who use this new iterator so much. Thank you for your correction.
Also I think they are pretty accurate.
------- Comment #13 From 2009-05-05 00:27:37 PST -------
this bug is found on libwebkit-1.0 gtk+, and webkitmm,too