Summary: | CSS3 'word-break: break-all' spoils CJK line-break restrictions | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | murakami | ||||||||||||||||||||||
Component: | CSS | Assignee: | Myles C. Maxfield <mmaxfield> | ||||||||||||||||||||||
Status: | RESOLVED CONFIGURATION CHANGED | ||||||||||||||||||||||||
Severity: | Normal | CC: | bfulgham, dujid0, enrica, glenn, jhonmerced5, joone, kojii, mmaxfield, rniwa, syoichi, thorton, xz911jp | ||||||||||||||||||||||
Priority: | P2 | ||||||||||||||||||||||||
Version: | 528+ (Nightly build) | ||||||||||||||||||||||||
Hardware: | All | ||||||||||||||||||||||||
OS: | All | ||||||||||||||||||||||||
Attachments: |
|
Description
murakami
2013-01-30 08:46:04 PST
I want to work for fixing this bug, could you assign to me? orz Created attachment 188512 [details]
Test html using {word-break: break-all}
Maybe this test can show the issue easily.
Created attachment 188514 [details]
This is the wrong case.
Created attachment 188515 [details]
This is the expected case.
I added some test case, are they correct? This issue can also be reproduced on 537+ (Safari v6.0.2 OSX Mountain Lion) Problem seems in RenderBlock::LineBreaker::nextSegmentBreak @ RenderBlockLineLayout.cpp. It'd be helpful if you describe how you're trying to approach the fix. I'm not sure how easy or hard this one is, given the rules in CSS specs are buried deep in break iterator in bug 89235. This can be fixed by only changing the following code: ==== from ==== midWordBreak = width.committedWidth() + wrapW + charWidth > width.availableWidth(); ==== to ==== midWordBreak = !Font::isCJKIdeographOrSymbol(c) && width.committedWidth() + wrapW + charWidth > width.availableWidth(); in RenderBlockLineLayout.cpp, line 2824. The current implementations ignores the CJK detecting and always perform line breaking. Created attachment 189978 [details]
Snapshot after adding CJK detecting
Thanks. The snapshot seems good. However, this test only contains CJK (fullwidth) punctuations (fullwidth ( ) and 、 。). Please check also with non-CJK punctuations (e.g. "坊主が,屏風に(上手に)絵を描いた." instead of "坊主が、屏風に(上手に)絵を描いた。"). Thanks for your confirmation and suggestion. I will test it with non-CJK punctuations. For a mixed string like "坊主が,屏風に(上手に)絵を描いた.", This change can support '(' but ')' because only the current character is checked. It's also need to check the previous character. midWordBreak = !Font::isCJKIdeographOrSymbol(c) && !Font::isCJKIdeographOrSymbol(previousChar) && ... After test I will report my result again. Created attachment 190296 [details]
Snapshot to test word-break
Non-CJK half-width parentheses test.
Created attachment 190298 [details]
Non-CJK punctuations (half-width) Test.
The result (After Change, Snapshot to test word-break) still has problem: '(a' and 'a)' must not be broken. (The result of IE9/Win is correct) The word-break:break-all means treating non-CJK letters as like CJK letters, so the expected patch will be: midWordBreak = IsNonCJKLetter(c) /* I don't know correct function name but something like this */ && IsNonCJKLetter(previousChar) && ... When I did the test yesterday, I worried about that too. w3schools.com says: break-all: Lines may break between any two characters for non-CJK scripts http://www.w3schools.com/cssref/css3_pr_word-break.asp I'm not very clear about the definition of 'any two characters'. But I think you are right, the '(a' and 'b)' should not be broken. I will try to find a function like IsNonCJKLetter and redo this test. See the spec: http://www.w3.org/TR/css3-text/#word-break ‘break-all’ In addition to ‘normal’ soft wrap opportunities, lines may break between any two **letters** (except where forbidden by the ‘line-break’ property). Thanks for referencing the spec. Function like IsNonCJKLetter found in break_lines.cpp but just as Ishii said, it only be provided for line break iterator. I am trying to find a better way, which merges current break-all check and line break iterator in order to improve performance than calling IsNonCJKLetter. Created attachment 192932 [details]
Patch v1 (need a little more tests before r?)
If someone else is also handling this, let me know and we can coordinate. My plan for this issue is to create break_words table based on break_lines table and decide if we can break word by searching prevChar and char in the break_words table. Maybe I need to wait 89235 for some improvement about line break in detail. Created attachment 193115 [details]
test for break-normal mode
Hi Zheng Xu your plan sounds nice. Please have a look at this attachment that may be useful to you.
For 'word-break: normal', the current WebKit behavior, "a(b)c" unbreakable, is good and should not be changed. You should read the Unicode Standard Annex #14, Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/ LB30 Do not break between letters, numbers, or ordinary symbols and opening or closing parentheses. (AL | HL | NU) × OP CP × (AL | HL | NU) Here, AL=letters and ordinary symbols (non-CJK), HL=Hebrew letters, NU=numbers, OP=Open Punctuation, CP=Close Parenthesis, ×=No break allowed. In 'word-break: break-all' mode, this LB30 rule should be ignored. The expected behavior for break-all mode can be obtained by treating all AL, HL, NU and AI(Ambiguous) characters same as ID (Ideographic). The following rules will be also affected: LB23 Do not break within ‘a9’, ‘3a’, or ‘H%’ ... (AL | HL) × NU NU × (AL | HL) LB25 Do not break between the following pairs of classes relevant to numbers: ... IS × NU NU × NU SY × NU ... LB28 Do not break between alphabetics (“at”). (AL | HL) × (AL | HL) LB29 Do not break between numeric punctuation and alphabetics (“e.g.”). IS × (AL | HL) I am reconsidering what I wrote, "The expected behavior for break-all mode can be obtained by treating all AL, HL, NU and AI(Ambiguous) characters same as ID (Ideographic)." This is not consistent with what CSS3 Text spec saying: ‘break-all’ In addition to ‘normal’ soft wrap opportunities, lines may break between any two letters (except where forbidden by the ‘line-break’ property). Hyphenation is not applied. ... Note that the term 'letter' here is defined in the CSS3 Text as, http://www.w3.org/TR/css3-text/#letter0 A letter for the purpose of this specification is a character belonging to one of the Letter or Number general categories in Unicode. [UAX44] In this break-all definition, there is no break opportunity in "a(b)c", "b(ス)c" and "o.w" because the characters '(', ')' and '.' are not belonging to the letter categories and the break-all permits only between two letters. I'm sorry if I made a confusion. The "After Changes" result now seems to be conform to the spec. However, you should also test with Numbers (the 'letter' in CSS3 Text includes numbers). For example, "12", "A3" and "4β" must be breakable in the 'break-all' mode. Created attachment 193985 [details]
Test result with Numbers
It seems according to the break-all rule.
In the recent "Test result with Numbers", the "After Change, break-all" result seems good, but the "After Change, normal" rusult seems like "Before Change, break-all". Sorry that is my mistake. Thanks for correct it and your explanation is very useful to me. I am going to do local layout test and prepare the patch. Created attachment 199990 [details]
Patch v1
Could someone have a review for this patch?
Comment on attachment 199990 [details] Patch v1 View in context: https://bugs.webkit.org/attachment.cgi?id=199990&action=review > Source/WebCore/rendering/RenderBlockLineLayout.cpp:2750 > +} this is not consistent with the definition of break-all found in http://dev.w3.org/csswg/css-text/#word-break; in particular, the spec says that break all applies between any two 'letters', except where forbidden by line-break settings; since the spec defines 'letter' as "a character belonging to one of the Letter or Number general categories in Unicode", it is more than simply ASCII letters/digits; accordingly, this patch should be r- (i'm not a reviewer, otherwise I would mark r-) > Source/WebCore/rendering/RenderBlockLineLayout.cpp:3040 > + UChar previousCharacterInIteration = 0; use lastCharacter instead of previousCharacterInIteration. > Source/WebCore/rendering/RenderBlockLineLayout.cpp:3068 > + // can be broken by isBreakableAllLetter function remove comment, since it is already described in change long; also this comment doesn't follow comment conventions anyway > Source/WebCore/rendering/RenderBlockLineLayout.cpp:3072 > + && isBreakableAllLetter(previousCharacterInIteration))); use lastCharacter instead of previousCharacterInIteration > Source/WebCore/rendering/RenderBlockLineLayout.cpp:3074 > + previousCharacterInIteration = c; ditto (In reply to comment #28) > (From update of attachment 199990 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=199990&action=review > > > Source/WebCore/rendering/RenderBlockLineLayout.cpp:2750 > > +} > > this is not consistent with the definition of break-all found in http://dev.w3.org/csswg/css-text/#word-break; in particular, the spec says that break all applies between any two 'letters', except where forbidden by line-break settings; since the spec defines 'letter' as "a character belonging to one of the Letter or Number general categories in Unicode", it is more than simply ASCII letters/digits; > > accordingly, this patch should be r- (i'm not a reviewer, otherwise I would mark r-) Thank you for review patch. CJK and other Unicode letters will be broken properly by this patch, Unicode letters checked by original line-break code and ICU. (I just fixed break-all, for the CJK letters there is same logic in break-normal mode.) > > > Source/WebCore/rendering/RenderBlockLineLayout.cpp:3040 > > + UChar previousCharacterInIteration = 0; > > use lastCharacter instead of previousCharacterInIteration. > > > Source/WebCore/rendering/RenderBlockLineLayout.cpp:3068 > > + // can be broken by isBreakableAllLetter function > > remove comment, since it is already described in change long; also this comment doesn't follow comment conventions anyway Ok. I will remove this. > > > Source/WebCore/rendering/RenderBlockLineLayout.cpp:3072 > > + && isBreakableAllLetter(previousCharacterInIteration))); > > use lastCharacter instead of previousCharacterInIteration Ok > > > Source/WebCore/rendering/RenderBlockLineLayout.cpp:3074 > > + previousCharacterInIteration = c; > > ditto Comment on attachment 199990 [details]
Patch v1
r- to clear stale patches from the review queue
removed spam removed spam WebKit now handles this case properly in Safari 15. |