Bug 108347 - CSS3 'word-break: break-all' spoils CJK line-break restrictions
Summary: CSS3 'word-break: break-all' spoils CJK line-break restrictions
Status: UNCONFIRMED
Alias: None
Product: WebKit
Classification: Unclassified
Component: CSS (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Myles C. Maxfield
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-01-30 08:46 PST by murakami
Modified: 2018-01-24 06:46 PST (History)
11 users (show)

See Also:


Attachments
Test html using {word-break: break-all} (245 bytes, text/html)
2013-02-15 01:45 PST, Zheng Xu
no flags Details
This is the wrong case. (1.96 KB, image/png)
2013-02-15 01:52 PST, Zheng Xu
no flags Details
This is the expected case. (1.89 KB, image/png)
2013-02-15 01:52 PST, Zheng Xu
no flags Details
Snapshot after adding CJK detecting (95.51 KB, image/png)
2013-02-24 09:53 PST, Lu jing
no flags Details
Snapshot to test word-break (108.88 KB, image/png)
2013-02-26 08:47 PST, Lu jing
no flags Details
Non-CJK punctuations (half-width) Test. (154.55 KB, image/png)
2013-02-26 09:00 PST, Lu jing
no flags Details
Patch v1 (need a little more tests before r?) (11.42 KB, patch)
2013-03-13 09:01 PDT, Zheng Xu
no flags Details | Formatted Diff | Diff
test for break-normal mode (107.62 KB, image/png)
2013-03-14 06:33 PDT, Lu jing
no flags Details
Test result with Numbers (134.24 KB, image/png)
2013-03-19 22:25 PDT, Lu jing
no flags Details
Patch v1 (6.34 KB, patch)
2013-04-29 04:53 PDT, Lu jing
beidson: review-
dujid0: commit-queue?
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description murakami 2013-01-30 08:46:04 PST
CSS3 Text 'word-break: break-all' has the following definition:

‘break-all’
  In addition to ‘normal’ soft wrap opportunities, lines may break 
  between any two letters (except where forbidden by the ‘line-break’ 
  property). Hyphenation is not applied. This option is used mostly 
  in a context where the text is predominantly using CJK characters 
  with few non-CJK excerpts and it is desired that the text be better 
  distributed on each line. 

However, in the current WebKit implementation, 'word-break: break-all'
does not only allow line-break between non-CJK letters but also 
spoils any line-break restrictions, e.g., opening brackets at end of
line, and closing brackets or comma/fullstops at beginning of line.

A test of 'word-break: all':
http://nadita.com/test/wordbreak/word-break-all.html

Wrong result (Chrome 25):
http://nadita.com/test/wordbreak/webkit.png

Expected result (IE9):
http://nadita.com/test/wordbreak/ie.png
Comment 1 Lu jing 2013-02-05 06:40:12 PST
I want to work for fixing this bug, could you assign to me? orz
Comment 2 Zheng Xu 2013-02-15 01:45:39 PST
Created attachment 188512 [details]
Test html using {word-break: break-all}

Maybe this test can show the issue easily.
Comment 3 Zheng Xu 2013-02-15 01:52:07 PST
Created attachment 188514 [details]
This is the wrong case.
Comment 4 Zheng Xu 2013-02-15 01:52:38 PST
Created attachment 188515 [details]
This is the expected case.
Comment 5 Zheng Xu 2013-02-15 01:53:08 PST
I added some test case, are they correct?
Comment 6 Zheng Xu 2013-02-16 03:41:31 PST
This issue can also be reproduced on 537+ (Safari v6.0.2 OSX Mountain Lion)
Comment 7 Zheng Xu 2013-02-18 06:20:40 PST
Problem seems in RenderBlock::LineBreaker::nextSegmentBreak @ RenderBlockLineLayout.cpp.
Comment 8 Koji Ishii 2013-02-18 17:43:18 PST
It'd be helpful if you describe how you're trying to approach the fix. I'm not sure how easy or hard this one is, given the rules in CSS specs are buried deep in break iterator in bug 89235.
Comment 9 Lu jing 2013-02-24 09:42:33 PST
This can be fixed by only changing the following code:

==== from ====
midWordBreak = width.committedWidth() + wrapW + charWidth > width.availableWidth();

==== to ====
midWordBreak = !Font::isCJKIdeographOrSymbol(c) && width.committedWidth() + wrapW + charWidth > width.availableWidth();

in RenderBlockLineLayout.cpp, line 2824.

The current implementations ignores the CJK detecting and always perform line breaking.
Comment 10 Lu jing 2013-02-24 09:53:10 PST
Created attachment 189978 [details]
Snapshot after adding CJK detecting
Comment 11 murakami 2013-02-24 18:20:28 PST
Thanks. The snapshot seems good. However, this test only contains
CJK (fullwidth) punctuations (fullwidth ( ) and 、 。). 
Please check also with non-CJK punctuations 
(e.g. "坊主が,屏風に(上手に)絵を描いた." instead of 
"坊主が、屏風に(上手に)絵を描いた。").
Comment 12 Lu jing 2013-02-24 19:17:29 PST
Thanks for your confirmation and suggestion. I will test it with non-CJK punctuations.

For a mixed string like "坊主が,屏風に(上手に)絵を描いた.",
This change can support '(' but ')' because only the current character is checked.
It's also need to check the previous character.

midWordBreak = !Font::isCJKIdeographOrSymbol(c)
               && !Font::isCJKIdeographOrSymbol(previousChar)
               && ...

After test I will report my result again.
Comment 13 Lu jing 2013-02-26 08:47:10 PST
Created attachment 190296 [details]
Snapshot to test word-break

Non-CJK half-width parentheses test.
Comment 14 Lu jing 2013-02-26 09:00:48 PST
Created attachment 190298 [details]
Non-CJK punctuations (half-width) Test.
Comment 15 murakami 2013-02-27 01:39:00 PST
The result (After Change, Snapshot to test word-break) still has problem:
'(a' and 'a)' must not be broken. (The result of IE9/Win is correct)

The word-break:break-all means treating non-CJK letters as like CJK letters,
so the expected patch will be:

midWordBreak = IsNonCJKLetter(c) /* I don't know correct function name 
                                    but something like this */
               && IsNonCJKLetter(previousChar)
               && ...
Comment 16 Lu jing 2013-02-27 02:03:16 PST
When I did the test yesterday, I worried about that too. w3schools.com says:

break-all: Lines may break between any two characters for non-CJK scripts
http://www.w3schools.com/cssref/css3_pr_word-break.asp

I'm not very clear about the definition of 'any two characters'. But I think you are right, the '(a' and 'b)' should not be broken. 

I will try to find a function like IsNonCJKLetter and redo this test.
Comment 17 murakami 2013-02-27 19:41:02 PST
See the spec:
http://www.w3.org/TR/css3-text/#word-break
  ‘break-all’
     In addition to ‘normal’ soft wrap opportunities, 
    lines may break between any two **letters**
    (except where forbidden by the ‘line-break’ property).
Comment 18 Lu jing 2013-02-28 21:43:45 PST
Thanks for referencing the spec.

Function like IsNonCJKLetter found in break_lines.cpp but just as Ishii said, it only be provided for line break iterator. I am trying to find a better way, which merges current break-all check and line break iterator in order to improve performance than calling IsNonCJKLetter.
Comment 19 Zheng Xu 2013-03-13 09:01:29 PDT
Created attachment 192932 [details]
Patch v1 (need a little more tests before r?)
Comment 20 Zheng Xu 2013-03-13 15:46:08 PDT
If someone else is also handling this, let me know and we can coordinate. My plan for this issue is to create break_words table based on break_lines table and decide if we can break word by searching prevChar and char in the break_words table.
Maybe I need to wait 89235 for some improvement about line break in detail.
Comment 21 Lu jing 2013-03-14 06:33:44 PDT
Created attachment 193115 [details]
test for break-normal mode

Hi Zheng Xu your plan sounds nice. Please have a look at this attachment that may be useful to you.
Comment 22 murakami 2013-03-14 11:16:35 PDT
For 'word-break: normal', the current WebKit behavior, "a(b)c" unbreakable, 
is good and should not be changed.

You should read the Unicode Standard Annex #14, Unicode Line Breaking Algorithm:
http://www.unicode.org/reports/tr14/

    LB30  Do not break between letters, numbers, or ordinary symbols 
    and opening or closing parentheses.

        (AL | HL | NU) × OP
        CP × (AL | HL | NU)

Here, AL=letters and ordinary symbols (non-CJK), HL=Hebrew letters, 
NU=numbers, OP=Open Punctuation, CP=Close Parenthesis, ×=No break allowed.

In 'word-break: break-all' mode, this LB30 rule should be ignored.
The expected behavior for break-all mode can be obtained by treating 
all AL, HL, NU and AI(Ambiguous) characters same as ID (Ideographic).
The following rules will be also affected:

    LB23 Do not break within ‘a9’, ‘3a’, or ‘H%’
        ...
        (AL | HL) × NU
        NU × (AL | HL)

    LB25  Do not break between the following pairs of classes relevant to numbers:
        ...
        IS × NU
        NU × NU
        SY × NU
        ...

    LB28  Do not break between alphabetics (“at”).
        (AL | HL) × (AL | HL)

    LB29  Do not break between numeric punctuation and alphabetics (“e.g.”).
        IS × (AL | HL)
Comment 23 murakami 2013-03-17 22:28:07 PDT
I am reconsidering what I wrote,
"The expected behavior for break-all mode can be obtained by treating 
all AL, HL, NU and AI(Ambiguous) characters same as ID (Ideographic)."
This is not consistent with what CSS3 Text spec saying:

  ‘break-all’
    In addition to ‘normal’ soft wrap opportunities, lines may break 
    between any two letters (except where forbidden by the ‘line-break’ 
    property). Hyphenation is not applied. ...

Note that the term 'letter' here is defined in the CSS3 Text as,
http://www.w3.org/TR/css3-text/#letter0
    A letter for the purpose of this specification is a character 
    belonging to one of the Letter or Number general categories in 
    Unicode. [UAX44] 

In this break-all definition, there is no break opportunity in "a(b)c", 
"b(ス)c" and "o.w" because the characters '(', ')' and '.' are not 
belonging to the letter categories and the break-all permits only
between two letters.

I'm sorry if I made a confusion.
The "After Changes" result now seems to be conform to the spec.
However, you should also test with Numbers (the 'letter' in CSS3 Text
includes numbers). For example, "12", "A3" and "4β" must be
breakable in the 'break-all' mode.
Comment 24 Lu jing 2013-03-19 22:25:22 PDT
Created attachment 193985 [details]
Test result with Numbers

It seems according to the break-all rule.
Comment 25 murakami 2013-03-19 22:47:47 PDT
In the recent "Test result with Numbers", the "After Change, break-all" result seems good,
but the "After Change, normal" rusult seems like "Before Change, break-all".
Comment 26 Lu jing 2013-03-19 23:03:20 PDT
Sorry that is my mistake. Thanks for correct it and your explanation is very useful to me. I am going to do local layout test and prepare the patch.
Comment 27 Lu jing 2013-04-29 04:53:23 PDT
Created attachment 199990 [details]
Patch v1

Could someone have a review for this patch?
Comment 28 Glenn Adams 2013-04-29 09:09:29 PDT
Comment on attachment 199990 [details]
Patch v1

View in context: https://bugs.webkit.org/attachment.cgi?id=199990&action=review

> Source/WebCore/rendering/RenderBlockLineLayout.cpp:2750
> +}

this is not consistent with the definition of break-all found in http://dev.w3.org/csswg/css-text/#word-break; in particular, the spec says that break all applies between any two 'letters', except where forbidden by line-break settings; since the spec defines 'letter' as "a character belonging to one of the Letter or Number general categories in Unicode", it is more than simply ASCII letters/digits;

accordingly, this patch should be r- (i'm not a reviewer, otherwise I would mark r-)

> Source/WebCore/rendering/RenderBlockLineLayout.cpp:3040
> +            UChar previousCharacterInIteration = 0;

use lastCharacter instead of previousCharacterInIteration.

> Source/WebCore/rendering/RenderBlockLineLayout.cpp:3068
> +                    //             can be broken by isBreakableAllLetter function

remove comment, since it is already described in change long; also this comment doesn't follow comment conventions anyway

> Source/WebCore/rendering/RenderBlockLineLayout.cpp:3072
> +                               && isBreakableAllLetter(previousCharacterInIteration)));

use lastCharacter instead of previousCharacterInIteration

> Source/WebCore/rendering/RenderBlockLineLayout.cpp:3074
> +                previousCharacterInIteration = c;

ditto
Comment 29 Lu jing 2013-04-30 08:30:00 PDT
(In reply to comment #28)
> (From update of attachment 199990 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=199990&action=review
> 
> > Source/WebCore/rendering/RenderBlockLineLayout.cpp:2750
> > +}
> 
> this is not consistent with the definition of break-all found in http://dev.w3.org/csswg/css-text/#word-break; in particular, the spec says that break all applies between any two 'letters', except where forbidden by line-break settings; since the spec defines 'letter' as "a character belonging to one of the Letter or Number general categories in Unicode", it is more than simply ASCII letters/digits;
> 
> accordingly, this patch should be r- (i'm not a reviewer, otherwise I would mark r-)

Thank you for review patch. CJK and other Unicode letters will be broken properly by this patch, Unicode letters checked by original line-break code and ICU. (I just fixed break-all, for the CJK letters there is same logic in break-normal mode.) 

> 
> > Source/WebCore/rendering/RenderBlockLineLayout.cpp:3040
> > +            UChar previousCharacterInIteration = 0;
> 
> use lastCharacter instead of previousCharacterInIteration.
> 
> > Source/WebCore/rendering/RenderBlockLineLayout.cpp:3068
> > +                    //             can be broken by isBreakableAllLetter function
> 
> remove comment, since it is already described in change long; also this comment doesn't follow comment conventions anyway

Ok. I will remove this.

> 
> > Source/WebCore/rendering/RenderBlockLineLayout.cpp:3072
> > +                               && isBreakableAllLetter(previousCharacterInIteration)));
> 
> use lastCharacter instead of previousCharacterInIteration
Ok

> 
> > Source/WebCore/rendering/RenderBlockLineLayout.cpp:3074
> > +                previousCharacterInIteration = c;
> 
> ditto
Comment 30 Brady Eidson 2016-05-24 21:38:05 PDT
Comment on attachment 199990 [details]
Patch v1

r- to clear stale patches from the review queue
Comment 31 JhonMerced 2018-01-24 06:41:44 PST
removed spam
Comment 32 JhonMerced 2018-01-24 06:46:20 PST
removed spam