RESOLVED CONFIGURATION CHANGED108347
CSS3 'word-break: break-all' spoils CJK line-break restrictions
https://bugs.webkit.org/show_bug.cgi?id=108347
Summary CSS3 'word-break: break-all' spoils CJK line-break restrictions
murakami
Reported 2013-01-30 08:46:04 PST
CSS3 Text 'word-break: break-all' has the following definition: ‘break-all’ In addition to ‘normal’ soft wrap opportunities, lines may break between any two letters (except where forbidden by the ‘line-break’ property). Hyphenation is not applied. This option is used mostly in a context where the text is predominantly using CJK characters with few non-CJK excerpts and it is desired that the text be better distributed on each line. However, in the current WebKit implementation, 'word-break: break-all' does not only allow line-break between non-CJK letters but also spoils any line-break restrictions, e.g., opening brackets at end of line, and closing brackets or comma/fullstops at beginning of line. A test of 'word-break: all': http://nadita.com/test/wordbreak/word-break-all.html Wrong result (Chrome 25): http://nadita.com/test/wordbreak/webkit.png Expected result (IE9): http://nadita.com/test/wordbreak/ie.png
Attachments
Test html using {word-break: break-all} (245 bytes, text/html)
2013-02-15 01:45 PST, Zheng Xu
no flags
This is the wrong case. (1.96 KB, image/png)
2013-02-15 01:52 PST, Zheng Xu
no flags
This is the expected case. (1.89 KB, image/png)
2013-02-15 01:52 PST, Zheng Xu
no flags
Snapshot after adding CJK detecting (95.51 KB, image/png)
2013-02-24 09:53 PST, Lu jing
no flags
Snapshot to test word-break (108.88 KB, image/png)
2013-02-26 08:47 PST, Lu jing
no flags
Non-CJK punctuations (half-width) Test. (154.55 KB, image/png)
2013-02-26 09:00 PST, Lu jing
no flags
Patch v1 (need a little more tests before r?) (11.42 KB, patch)
2013-03-13 09:01 PDT, Zheng Xu
no flags
test for break-normal mode (107.62 KB, image/png)
2013-03-14 06:33 PDT, Lu jing
no flags
Test result with Numbers (134.24 KB, image/png)
2013-03-19 22:25 PDT, Lu jing
no flags
Patch v1 (6.34 KB, patch)
2013-04-29 04:53 PDT, Lu jing
beidson: review-
Lu jing
Comment 1 2013-02-05 06:40:12 PST
I want to work for fixing this bug, could you assign to me? orz
Zheng Xu
Comment 2 2013-02-15 01:45:39 PST
Created attachment 188512 [details] Test html using {word-break: break-all} Maybe this test can show the issue easily.
Zheng Xu
Comment 3 2013-02-15 01:52:07 PST
Created attachment 188514 [details] This is the wrong case.
Zheng Xu
Comment 4 2013-02-15 01:52:38 PST
Created attachment 188515 [details] This is the expected case.
Zheng Xu
Comment 5 2013-02-15 01:53:08 PST
I added some test case, are they correct?
Zheng Xu
Comment 6 2013-02-16 03:41:31 PST
This issue can also be reproduced on 537+ (Safari v6.0.2 OSX Mountain Lion)
Zheng Xu
Comment 7 2013-02-18 06:20:40 PST
Problem seems in RenderBlock::LineBreaker::nextSegmentBreak @ RenderBlockLineLayout.cpp.
Koji Ishii
Comment 8 2013-02-18 17:43:18 PST
It'd be helpful if you describe how you're trying to approach the fix. I'm not sure how easy or hard this one is, given the rules in CSS specs are buried deep in break iterator in bug 89235.
Lu jing
Comment 9 2013-02-24 09:42:33 PST
This can be fixed by only changing the following code: ==== from ==== midWordBreak = width.committedWidth() + wrapW + charWidth > width.availableWidth(); ==== to ==== midWordBreak = !Font::isCJKIdeographOrSymbol(c) && width.committedWidth() + wrapW + charWidth > width.availableWidth(); in RenderBlockLineLayout.cpp, line 2824. The current implementations ignores the CJK detecting and always perform line breaking.
Lu jing
Comment 10 2013-02-24 09:53:10 PST
Created attachment 189978 [details] Snapshot after adding CJK detecting
murakami
Comment 11 2013-02-24 18:20:28 PST
Thanks. The snapshot seems good. However, this test only contains CJK (fullwidth) punctuations (fullwidth ( ) and 、 。). Please check also with non-CJK punctuations (e.g. "坊主が,屏風に(上手に)絵を描いた." instead of "坊主が、屏風に(上手に)絵を描いた。").
Lu jing
Comment 12 2013-02-24 19:17:29 PST
Thanks for your confirmation and suggestion. I will test it with non-CJK punctuations. For a mixed string like "坊主が,屏風に(上手に)絵を描いた.", This change can support '(' but ')' because only the current character is checked. It's also need to check the previous character. midWordBreak = !Font::isCJKIdeographOrSymbol(c) && !Font::isCJKIdeographOrSymbol(previousChar) && ... After test I will report my result again.
Lu jing
Comment 13 2013-02-26 08:47:10 PST
Created attachment 190296 [details] Snapshot to test word-break Non-CJK half-width parentheses test.
Lu jing
Comment 14 2013-02-26 09:00:48 PST
Created attachment 190298 [details] Non-CJK punctuations (half-width) Test.
murakami
Comment 15 2013-02-27 01:39:00 PST
The result (After Change, Snapshot to test word-break) still has problem: '(a' and 'a)' must not be broken. (The result of IE9/Win is correct) The word-break:break-all means treating non-CJK letters as like CJK letters, so the expected patch will be: midWordBreak = IsNonCJKLetter(c) /* I don't know correct function name but something like this */ && IsNonCJKLetter(previousChar) && ...
Lu jing
Comment 16 2013-02-27 02:03:16 PST
When I did the test yesterday, I worried about that too. w3schools.com says: break-all: Lines may break between any two characters for non-CJK scripts http://www.w3schools.com/cssref/css3_pr_word-break.asp I'm not very clear about the definition of 'any two characters'. But I think you are right, the '(a' and 'b)' should not be broken. I will try to find a function like IsNonCJKLetter and redo this test.
murakami
Comment 17 2013-02-27 19:41:02 PST
See the spec: http://www.w3.org/TR/css3-text/#word-break ‘break-all’ In addition to ‘normal’ soft wrap opportunities, lines may break between any two **letters** (except where forbidden by the ‘line-break’ property).
Lu jing
Comment 18 2013-02-28 21:43:45 PST
Thanks for referencing the spec. Function like IsNonCJKLetter found in break_lines.cpp but just as Ishii said, it only be provided for line break iterator. I am trying to find a better way, which merges current break-all check and line break iterator in order to improve performance than calling IsNonCJKLetter.
Zheng Xu
Comment 19 2013-03-13 09:01:29 PDT
Created attachment 192932 [details] Patch v1 (need a little more tests before r?)
Zheng Xu
Comment 20 2013-03-13 15:46:08 PDT
If someone else is also handling this, let me know and we can coordinate. My plan for this issue is to create break_words table based on break_lines table and decide if we can break word by searching prevChar and char in the break_words table. Maybe I need to wait 89235 for some improvement about line break in detail.
Lu jing
Comment 21 2013-03-14 06:33:44 PDT
Created attachment 193115 [details] test for break-normal mode Hi Zheng Xu your plan sounds nice. Please have a look at this attachment that may be useful to you.
murakami
Comment 22 2013-03-14 11:16:35 PDT
For 'word-break: normal', the current WebKit behavior, "a(b)c" unbreakable, is good and should not be changed. You should read the Unicode Standard Annex #14, Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/ LB30 Do not break between letters, numbers, or ordinary symbols and opening or closing parentheses. (AL | HL | NU) × OP CP × (AL | HL | NU) Here, AL=letters and ordinary symbols (non-CJK), HL=Hebrew letters, NU=numbers, OP=Open Punctuation, CP=Close Parenthesis, ×=No break allowed. In 'word-break: break-all' mode, this LB30 rule should be ignored. The expected behavior for break-all mode can be obtained by treating all AL, HL, NU and AI(Ambiguous) characters same as ID (Ideographic). The following rules will be also affected: LB23 Do not break within ‘a9’, ‘3a’, or ‘H%’ ... (AL | HL) × NU NU × (AL | HL) LB25 Do not break between the following pairs of classes relevant to numbers: ... IS × NU NU × NU SY × NU ... LB28 Do not break between alphabetics (“at”). (AL | HL) × (AL | HL) LB29 Do not break between numeric punctuation and alphabetics (“e.g.”). IS × (AL | HL)
murakami
Comment 23 2013-03-17 22:28:07 PDT
I am reconsidering what I wrote, "The expected behavior for break-all mode can be obtained by treating all AL, HL, NU and AI(Ambiguous) characters same as ID (Ideographic)." This is not consistent with what CSS3 Text spec saying: ‘break-all’ In addition to ‘normal’ soft wrap opportunities, lines may break between any two letters (except where forbidden by the ‘line-break’ property). Hyphenation is not applied. ... Note that the term 'letter' here is defined in the CSS3 Text as, http://www.w3.org/TR/css3-text/#letter0 A letter for the purpose of this specification is a character belonging to one of the Letter or Number general categories in Unicode. [UAX44] In this break-all definition, there is no break opportunity in "a(b)c", "b(ス)c" and "o.w" because the characters '(', ')' and '.' are not belonging to the letter categories and the break-all permits only between two letters. I'm sorry if I made a confusion. The "After Changes" result now seems to be conform to the spec. However, you should also test with Numbers (the 'letter' in CSS3 Text includes numbers). For example, "12", "A3" and "4β" must be breakable in the 'break-all' mode.
Lu jing
Comment 24 2013-03-19 22:25:22 PDT
Created attachment 193985 [details] Test result with Numbers It seems according to the break-all rule.
murakami
Comment 25 2013-03-19 22:47:47 PDT
In the recent "Test result with Numbers", the "After Change, break-all" result seems good, but the "After Change, normal" rusult seems like "Before Change, break-all".
Lu jing
Comment 26 2013-03-19 23:03:20 PDT
Sorry that is my mistake. Thanks for correct it and your explanation is very useful to me. I am going to do local layout test and prepare the patch.
Lu jing
Comment 27 2013-04-29 04:53:23 PDT
Created attachment 199990 [details] Patch v1 Could someone have a review for this patch?
Glenn Adams
Comment 28 2013-04-29 09:09:29 PDT
Comment on attachment 199990 [details] Patch v1 View in context: https://bugs.webkit.org/attachment.cgi?id=199990&action=review > Source/WebCore/rendering/RenderBlockLineLayout.cpp:2750 > +} this is not consistent with the definition of break-all found in http://dev.w3.org/csswg/css-text/#word-break; in particular, the spec says that break all applies between any two 'letters', except where forbidden by line-break settings; since the spec defines 'letter' as "a character belonging to one of the Letter or Number general categories in Unicode", it is more than simply ASCII letters/digits; accordingly, this patch should be r- (i'm not a reviewer, otherwise I would mark r-) > Source/WebCore/rendering/RenderBlockLineLayout.cpp:3040 > + UChar previousCharacterInIteration = 0; use lastCharacter instead of previousCharacterInIteration. > Source/WebCore/rendering/RenderBlockLineLayout.cpp:3068 > + // can be broken by isBreakableAllLetter function remove comment, since it is already described in change long; also this comment doesn't follow comment conventions anyway > Source/WebCore/rendering/RenderBlockLineLayout.cpp:3072 > + && isBreakableAllLetter(previousCharacterInIteration))); use lastCharacter instead of previousCharacterInIteration > Source/WebCore/rendering/RenderBlockLineLayout.cpp:3074 > + previousCharacterInIteration = c; ditto
Lu jing
Comment 29 2013-04-30 08:30:00 PDT
(In reply to comment #28) > (From update of attachment 199990 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=199990&action=review > > > Source/WebCore/rendering/RenderBlockLineLayout.cpp:2750 > > +} > > this is not consistent with the definition of break-all found in http://dev.w3.org/csswg/css-text/#word-break; in particular, the spec says that break all applies between any two 'letters', except where forbidden by line-break settings; since the spec defines 'letter' as "a character belonging to one of the Letter or Number general categories in Unicode", it is more than simply ASCII letters/digits; > > accordingly, this patch should be r- (i'm not a reviewer, otherwise I would mark r-) Thank you for review patch. CJK and other Unicode letters will be broken properly by this patch, Unicode letters checked by original line-break code and ICU. (I just fixed break-all, for the CJK letters there is same logic in break-normal mode.) > > > Source/WebCore/rendering/RenderBlockLineLayout.cpp:3040 > > + UChar previousCharacterInIteration = 0; > > use lastCharacter instead of previousCharacterInIteration. > > > Source/WebCore/rendering/RenderBlockLineLayout.cpp:3068 > > + // can be broken by isBreakableAllLetter function > > remove comment, since it is already described in change long; also this comment doesn't follow comment conventions anyway Ok. I will remove this. > > > Source/WebCore/rendering/RenderBlockLineLayout.cpp:3072 > > + && isBreakableAllLetter(previousCharacterInIteration))); > > use lastCharacter instead of previousCharacterInIteration Ok > > > Source/WebCore/rendering/RenderBlockLineLayout.cpp:3074 > > + previousCharacterInIteration = c; > > ditto
Brady Eidson
Comment 30 2016-05-24 21:38:05 PDT
Comment on attachment 199990 [details] Patch v1 r- to clear stale patches from the review queue
JhonMerced
Comment 31 2018-01-24 06:41:44 PST
removed spam
JhonMerced
Comment 32 2018-01-24 06:46:20 PST
removed spam
Brent Fulgham
Comment 33 2022-07-13 13:59:58 PDT
WebKit now handles this case properly in Safari 15.
Note You need to log in before you can comment on or make changes to this bug.