Bug 89235 - CSS3: line-break property support
: CSS3: line-break property support
Status: REOPENED
: WebKit
CSS
: 528+ (Nightly build)
: Unspecified Unspecified
: P2 Normal
Assigned To:
:
: WebExposed
: 100739 101138 101302 101371 109954 110872 110892 110944
: 46481 101123 101124 101250 101281 101368 106303
  Show dependency treegraph
 
Reported: 2012-06-15 11:39 PST by
Modified: 2013-10-02 12:28 PST (History)


Attachments
Patch (951.09 KB, patch)
2012-09-05 01:54 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff
Patch (951.16 KB, patch)
2012-09-05 04:23 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff
Patch (953.92 KB, patch)
2012-09-05 17:07 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff
Patch (954.30 KB, patch)
2012-09-05 20:32 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff
Patch (954.53 KB, patch)
2012-09-06 18:58 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff
Patch (951.61 KB, patch)
2012-09-06 19:31 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff
Patch (951.76 KB, patch)
2012-09-06 20:45 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff
Patch (966.00 KB, patch)
2012-09-11 01:58 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff
Patch (1.12 MB, patch)
2012-09-13 00:58 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff
Patch (1.12 MB, patch)
2012-09-13 01:29 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff
performance test results and comparison (14.03 KB, text/html)
2012-09-13 05:14 PST, Glenn Adams
no flags Details
Patch (1.12 MB, patch)
2012-09-13 19:47 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff
Patch (1.12 MB, patch)
2012-09-14 17:42 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff
address comment #60 (1.12 MB, patch)
2012-09-18 21:34 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff
Patch (1.13 MB, patch)
2012-10-15 07:28 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff
Patch (1.13 MB, patch)
2012-10-16 01:17 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff
Patch (1.12 MB, patch)
2012-10-16 02:34 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff
Patch (1.12 MB, patch)
2012-10-28 04:18 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff
Patch (1.12 MB, patch)
2012-10-29 09:30 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff
Patch (465.41 KB, patch)
2012-10-31 04:08 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff
Patch (465.45 KB, patch)
2012-10-31 05:58 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff
Patch (465.83 KB, patch)
2012-11-01 14:00 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff
Patch (465.84 KB, patch)
2012-11-01 14:13 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff
Patch (492.31 KB, patch)
2012-11-05 13:53 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff
Patch (51.18 KB, patch)
2013-02-22 15:59 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff
Patch (51.09 KB, patch)
2013-02-22 19:34 PST, Glenn Adams
no flags Review Patch | Details | Formatted Diff | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2012-06-15 11:39:19 PST
http://dev.w3.org/csswg/css3-text/#line-break has introduced the following values for line-break. 

‘auto’
The UA determines the set of line-breaking restrictions to use, and it may vary the restrictions based on the length of the line; e.g., use a less restrictive set of line-break rules for short lines.
‘loose’
Breaks text using the least restrictive set of line-breaking rules. Typically used for short lines, such as in newspapers.
‘normal’
Breaks text using the most common set of line-breaking rules.
‘strict’
Breaks text using the most stringent set of line-breaking rules.


Depending on the value specified, line breaking behavior has to be different. 

For ports that use ICU, one possibility is to create separate line breaking iterators instances with different tailorings for each value. 

The other possibility (ports that can easily change ICU itself such as Chromium) is to add tailored line breaking rule files to ICU (  line_ja_strict for strict mode in  Japanese, line_ja_loose for loose mode in Japanese, etc) and pick one by changing the locale name. 


See also http://unicode.org/cldr/trac/ticket/4931 (CLDR feature request to parameterize line breaking rules).
------- Comment #1 From 2012-06-19 13:59:09 PST -------
Another related question is which of three values has to be the default. IE and Firefox use 'normal' as the default. 

The current Safari and Chrome behavior (based on ICU 4.x) is roughly 'strict', I believe.
------- Comment #2 From 2012-06-20 11:05:18 PST -------
Regarding the default selection, allow me to add some background just in case this helps discussions.

Traditional documents usually use "strict" even today, and since JLREQ[1]'s goal is to document traditional typesetting, it recommends "strict" as default.

However, casual documents tend to prefer "normal" today. MS Word defaults to similar setting as "normal," IE and Firefox as well. There was a discussion in Japan to change JLREQ's default to "normal," but we concluded that JLREQ should follow tradition given its goal.

Making normal as default has another benefit for small screen because strict rule will cause more justification. Less justifications are usually considered better.

I prefer making normal as default for Web for reasons above, and allowing author to select would be great for e-book publishers.

[1] http://www.w3.org/TR/jlreq/
------- Comment #3 From 2012-06-20 11:34:13 PST -------
A feature request at ICU is here:
http://bugs.icu-project.org/trac/ticket/9379
Request to add Japanese linebreak tailoring selectable as variations
------- Comment #4 From 2012-08-23 22:42:52 PST -------
if someone else is already is working this, let me know and we can coordinate; i'm planning to implement natively as a first pass, i.e., without making use of ICU, in order to obtain behavioral consistency across ports
------- Comment #5 From 2012-08-24 23:48:41 PST -------
(In reply to comment #4)
> i'm planning to implement natively as a first pass, i.e., without making use of ICU, in order to obtain behavioral consistency across ports

scratch that last; now that i've looked things over, i expect i can use ICU but supply a custom rule set
------- Comment #6 From 2012-08-26 08:13:39 PST -------
(In reply to comment #5)
> (In reply to comment #4)
> > i'm planning to implement natively as a first pass, i.e., without making use of ICU, in order to obtain behavioral consistency across ports
> 
> scratch that last; now that i've looked things over, i expect i can use ICU but supply a custom rule set

How are you trying to "use ICU but supply a custom rule set"? Are you going to wrap ICU with additional code, or to change ICU itself? Could you please share the plan before you got too much code written? Jungshik Shin manages ICU in Chromium, so I'd like to make sure he's ok with your plan.

If we modify ICU, how to upstream the changes to ICU is an issue. If we don't upstream, we need to consider how we'd keep merging the local changes to future ICU. Some platforms use ICU in OS, so that'd be another concern.

If you're wrapping ICU line breaking APIs, we don't have to worry about upstreaming changes to ICU. It aligns to some requests to add @line-break-rule to allow authors fully control break-before/after characters in future levels of CSS Text. The risk here is we may need a bit more code than modifying ICU.
------- Comment #7 From 2012-08-26 17:49:51 PST -------
(In reply to comment #6)
> (In reply to comment #5)
> > (In reply to comment #4)
> > > i'm planning to implement natively as a first pass, i.e., without making use of ICU, in order to obtain behavioral consistency across ports
> > 
> > scratch that last; now that i've looked things over, i expect i can use ICU but supply a custom rule set
> 
> How are you trying to "use ICU but supply a custom rule set"? Are you going to wrap ICU with additional code, or to change ICU itself? Could you please share the plan before you got too much code written? Jungshik Shin manages ICU in Chromium, so I'd like to make sure he's ok with your plan.

By using ubrk_openRules instead of ubrk_open (from LineBreakIteratorPool::take); at present, ICU does not admit any customization parameters other than the locale string itself when constructing a BreakIterator. Since there is no locale intrinsic mechanism for specifying break iterator variants (without introducing a new locale string keyword, e.g., @break), that leaves us with having to use ubrk_openRules and specify the rule set from the outside, e.g., in a fashion similar to cursorBreakIterator (cf. TextBreakIteratorICU.cpp).

Thus I'm leaning towards using ubrk_openRules as an initial approach pending further consideration of support in ICU itself (see below).

> 
> If we modify ICU, how to upstream the changes to ICU is an issue. If we don't upstream, we need to consider how we'd keep merging the local changes to future ICU. Some platforms use ICU in OS, so that'd be another concern.

My thinking for subsequent migration of this to ICU is to add an @break keyword to the ICU supported locale, e.g., "ja@break=loose", etc. This would load a different set of rules than currently found in icu/source/data/brkitr/line_ja.txt, e.g., loading instead from icu/source/data/brkitr/line_ja_loose.txt.

> 
> If you're wrapping ICU line breaking APIs, we don't have to worry about upstreaming changes to ICU. It aligns to some requests to add @line-break-rule to allow authors fully control break-before/after characters in future levels of CSS Text. The risk here is we may need a bit more code than modifying ICU.

Presuming that the semantics of the current CSS3 line-break values can be effectively encoded using the ICU break rules syntax (which I am now becoming familiar with), then using ubrk_openRules should minimize the additional wrapper code needed, reducing it to a number of data sets containing the necessary rules. Unfortunately, it doesn't look like there is an easy way to have one set of rules inherit from another set of rules, which means violating DRY principles. I notice, e.g., that line_ja.txt copies the rules from line.txt modifying as needed. We will need to end up with something like the following sets:

line_loose.txt    <= copies from line.txt with mods
line_normal.txt <= copies from line.txt (or is same as line.txt?)
line_strict.txt    <= copies from line.txt with mods

line_ja_loose.txt    <= copies from line_ja.txt with mods
line_ja_normal.txt <= copies from line_ja.txt with mods
line_ja_strict.txt    <= copies from line_ja.txt with mods (this may be same as current line_ja.txt)


line_zh.txt             <= copies from line.txt with mods
line_zh_loose.txt    <= copies from line_zh.txt with mods
line_zh_normal.txt <= copies from line_zh.txt (with mods?)
line_zh_strict.txt    <= copies from line_zh.txt with mods

Comments?
------- Comment #8 From 2012-08-31 06:26:49 PST -------
I've created two wiki pages that attempt to describe where I am going with resolving this bug.

* https://trac.webkit.org/wiki/LineBreaking
* https://trac.webkit.org/wiki/LineBreakingCSS3Mapping

Please review and comment.
------- Comment #9 From 2012-08-31 20:22:48 PST -------
Looks good to me. You might know but please note that each port uses different versions of ICU. Chromium has its own copy in the source tree. I don't know how much ICU changes its APIs across versions, but I wish your attention here.
------- Comment #10 From 2012-08-31 20:41:25 PST -------
(In reply to comment #9)
> Looks good to me. You might know but please note that each port uses different versions of ICU. Chromium has its own copy in the source tree. I don't know how much ICU changes its APIs across versions, but I wish your attention here.

thanks for that heads up
------- Comment #11 From 2012-09-02 22:08:36 PST -------
this bug is expected to change the values of the -webkit-line-break property; in particular, the following changes are expected:

(1) add 'auto', 'loose', and 'strict' values (as defined in CSS3 Text);
(2) change value 'normal' to mean as defined by CSS3 Text;
(3) change initial (default) value to 'auto'

this means that authors expecting the default value returned to be 'normal' will now get 'auto', and the meaning of 'normal' will (very slightly) change from the previous use of normal;

before the change, 'normal' translated to the use of ICU's default line breaking rules, possibly determined by locale (only 'ja' and 'fi' locales have a different set of rules in ICU 49);

after the change, 'normal' will mean the same thing, plus the rules of CSS3 Text that apply to 'normal' will apply as well;

see the following links for further details:

* https://trac.webkit.org/wiki/LineBreaking
* https://trac.webkit.org/wiki/LineBreakingCSS3Mapping

note that no patch has yet been submitted or applied, so the above information is forward looking, and may change before this bug is closed
------- Comment #12 From 2012-09-05 01:54:22 PST -------
Created an attachment (id=162188) [details]
Patch
------- Comment #13 From 2012-09-05 04:23:47 PST -------
Created an attachment (id=162213) [details]
Patch
------- Comment #14 From 2012-09-05 09:46:37 PST -------
Has this feature been announced on webkit-dev? I saw an e-mail discussing testing strategy, but not a discussion of the feature itself.
------- Comment #15 From 2012-09-05 09:50:44 PST -------
(From update of attachment 162213 [details])
Attachment 162213 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/13755805

New failing tests:
fast/css3-line-break/line-break-loose-inseparables-ko.html
fast/css3-line-break/line-break-loose-centered-zh.html
fast/css3-line-break/line-break-auto-sound-marks-en.html
fast/css3-line-break/line-break-auto-hyphens-ja.html
fast/css3-line-break/line-break-auto-sound-marks-ko.html
fast/css3-line-break/line-break-loose-iteration-marks-ja.html
fast/css3-line-break/line-break-auto-half-kana-en.html
fast/css3-line-break/line-break-loose-prefixes-ja.html
fast/css3-line-break/line-break-loose-iteration-marks-zh.html
fast/css3-line-break/line-break-auto-hyphens-ko.html
fast/css3-line-break/line-break-loose-hyphens-ko.html
fast/css3-line-break/line-break-loose-postfixes-ja.html
fast/css3-line-break/line-break-loose-iteration-marks-ko.html
fast/css3-line-break/line-break-loose-centered-ko.html
fast/canvas/webgl/shader-precision-format.html
fast/css3-line-break/line-break-loose-prefixes-ko.html
fast/css3-line-break/line-break-loose-centered-ja.html
fast/css3-line-break/line-break-loose-inseparables-en.html
fast/css3-line-break/line-break-loose-hyphens-ja.html
fast/css3-line-break/line-break-auto-hyphens-zh.html
fast/css3-line-break/line-break-auto-sound-marks-zh.html
fast/css3-line-break/line-break-loose-hyphens-zh.html
fast/css3-line-break/line-break-loose-postfixes-zh.html
fast/css3-line-break/line-break-loose-postfixes-ko.html
fast/css3-line-break/line-break-loose-iteration-marks-en.html
fast/css3-line-break/line-break-auto-sound-marks-ja.html
fast/css3-line-break/line-break-loose-inseparables-ja.html
fast/css3-line-break/line-break-loose-inseparables-zh.html
------- Comment #16 From 2012-09-05 12:13:16 PST -------
Thank you for the patch. 

(In reply to comment #5)
>  i expect i can use ICU but supply a custom rule set

That's my approach #1 in comment #0. It works without for all ports that use ICU regardless of ICU versions used by a particular port (as long as Unicode properties you use in your rules are well-defined in the version of ICU used by a port).  Not having dependency on the version of ICU (and CLDR used by ICU). Then, we don't have to wait for an ICU bug [1] and a CLDR bug [2]  to be fixed.

OTOH, the size of webkit binary will be slightly larger. And, we have to make sure that Unicode properties referred to in your rules are all available in ICU versions used by webkit ports. 

Because it takes time to fix ICU and CLDR issues and for a fixed version to be shipped with Mac OS/iOS (ports like Chrome can apply its own patches to ICU easily, but that's not the case for other ICU ports), your current approach is the only viable way to make all ICU-based webkit ports behave consistently. 

A slight variation would be to add #ifdef (either port-dependency or ICU-version dependency) and use custom rules for ports-using-older-icu and use ICU's upcoming api [1] for ports-using-newer-icu. 


[1] ttp://bugs.icu-project.org/trac/ticket/9379
[2]  http://unicode.org/cldr/trac/ticket/4931
------- Comment #17 From 2012-09-05 12:41:03 PST -------
rewriting my previous comment:

There can be two code paths (if #ifdef'd codes for different ports in this part of Webkit source is acceptable) : 

1. Use Adam's patch for webkit-port-using-old ICU
2a Use a new upcoming ICU API (yet-to-be-implemented) for webkit-port-using-new ICU
2b  Append '@css3lb={loose,strict}' to a locale name when creating a linebreak iterator for webkit ports using an ICU with separate line breaking rules for CSS3 loose and strict. No webkit port using ICU has this, but we can add them easily to Chrome. 


Because an ICU API to parameterize line breaking rules not just with a locale but also with CSS3 line breaking mode is not yet implemented anywhere, we can also consider taking approach 2b in the meantime for ports like Chromium that can easily add additional line breaking rules to its ICU.  2b is the second approach I proposed in the original bug report.
------- Comment #18 From 2012-09-05 16:38:27 PST -------
(In reply to comment #17)
> rewriting my previous comment:
> 
> There can be two code paths (if #ifdef'd codes for different ports in this part of Webkit source is acceptable) : 

We should avoid any option that requires #ifdef code if there is a viable alternative.

> 
> 1. Use Adam's patch for webkit-port-using-old ICU

This is the only viable option until (and if) ICU itself is updated to directly support all of the CSS3 LB semantics. As a preliminary step, I am using a locale with keyword in preparation for such eventuality. I used the following keywords (see patched LineBreakIteratorPool::take):

@break=loose
@break=normal
@break=strict

I would suggest a general keyword name like 'break' rather than 'css3lb'. Though, we could define css3 specific values for it, such as 'css3loose', 'css3normal', 'css3strict'.

We do need an explicit 'normal' (or equivalent) since its semantics (as specified by CSS3 Text) does not match the current (builtin) ICU rules.

> 2a Use a new upcoming ICU API (yet-to-be-implemented) for webkit-port-using-new ICU

We can migrate to that over time when it becomes available. But that shouldn't prevent us from starting with the current approach (using WK copy of rules, parameterized in WK).

> 2b  Append '@css3lb={loose,strict}' to a locale name when creating a linebreak iterator for webkit ports using an ICU with separate line breaking rules for CSS3 loose and strict. No webkit port using ICU has this, but we can add them easily to Chrome.

See above.

> Because an ICU API to parameterize line breaking rules not just with a locale but also with CSS3 line breaking mode is not yet implemented anywhere, we can also consider taking approach 2b in the meantime for ports like Chromium that can easily add additional line breaking rules to its ICU.  2b is the second approach I proposed in the original bug report.

I think we should use approach (1) for now. I don't see any need for a Chrome specific modification if its current ICU implementation supports the approach of (1).

To be clear, the short term approach should be (1), while the long term approach can be (2a) a future version of released ICU that supports a @break keyword and the necessary semantics, with these changes backported into Chromes' flavor of ICU.
------- Comment #19 From 2012-09-05 17:07:30 PST -------
Created an attachment (id=162373) [details]
Patch
------- Comment #20 From 2012-09-05 17:36:37 PST -------
Please do answer the question in comment 14.
------- Comment #21 From 2012-09-05 17:43:12 PST -------
(In reply to comment #14)
> Has this feature been announced on webkit-dev? I saw an e-mail discussing testing strategy, but not a discussion of the feature itself.

http://mac-os-forge.2317878.n4.nabble.com/adding-support-for-css3-line-break-property-to-webkit-line-break-td192326.html
------- Comment #22 From 2012-09-05 20:32:39 PST -------
Created an attachment (id=162404) [details]
Patch
------- Comment #23 From 2012-09-06 09:26:04 PST -------
Approach looks pretty good to me. Can you talk briefly about what the difference is between auto and normal, since that is the only compatibility concern I see here?
------- Comment #24 From 2012-09-06 11:01:58 PST -------
@Glenn, I'm fine with custom-rules hardcoded into Webkit source code as it's done in your patch. It's less work and less cluttering (no #ifdef's) in Webkit tree with negligible  code size increase. 

The only concern is that your patch is still ICU-version dependency because [:LineBreak = Conditional_Japanese_Starter:] was relatively recently introduced and not all ICU versions used by webkit-ports support that. 

+static const char* uax14AssignmentsBefore =
+    "$CJ = [:LineBreak = Conditional_Japanese_Starter:];";

To make your patch ICU-version-independent, you have to specify the set explicitly instead of relying on LB=C_J_S. 

Other sets used also have some ICU-version dependeny, but I think we just have to live with those differences unless we want to hard-code Unicode line breaking and other character properties in webkit source files (which I don't think we do).
------- Comment #25 From 2012-09-06 18:58:45 PST -------
Created an attachment (id=162646) [details]
Patch
------- Comment #26 From 2012-09-06 19:04:15 PST -------
(In reply to comment #24)
> The only concern is that your patch is still ICU-version dependency because [:LineBreak = Conditional_Japanese_Starter:] was relatively recently introduced and not all ICU versions used by webkit-ports support that. 

Thanks for that pointer, as that is probably what has been causing fails on cr-linux EWS, which I see uses ICU46 (patched). I've removed reference to C_J_S and fully enumerated $CJ.

Just re-submitted patch and will see if that fixes it. [I guess i'm just going to have to install and start building/testing cr-linux; guessing is not the best debugging technique for resolving port-specific fails.]
------- Comment #27 From 2012-09-06 19:31:23 PST -------
Created an attachment (id=162654) [details]
Patch
------- Comment #28 From 2012-09-06 20:03:11 PST -------
(In reply to comment #23)
> Approach looks pretty good to me. Can you talk briefly about what the difference is between auto and normal, since that is the only compatibility concern I see here?

From the perspective of the prior -webkit-line-break behavior, 'normal' effectively mapped to ICU's default LB rules, modulo locale specific tailoring, of which only a few locale specific rule sets were available in ICU:

* default (i.e., no locale or locale with no rule set)
* 'ja' locale
* 'fi' locale

[These are the 3 LB rule sets found in ICU50, which none of WK's current platforms use yet. Some are using ICU49, cr-linux uses a locally modified version of ICU46, etc.]

Given different platforms provide different versions of ICU (in general) and different locale specific LB rule sets, the behavior of 'normal' has been somewhat variable.

In contrast, CSS3 Text 'normal' prescribes specific LB rules for certain CJK related character subsets. Some of these apply if locale is CJK and some apply in non-CJK cases. For other characters, CSS3 Text is relatively silent on LB behavior, merely deferring to the default line break rules defined in [1], of which "soft break opportunity" applies in most cases.

So, 'normal' in the pre-CSS3 case and in the CSS3 case differ primarily in these CJK related characters.

The definition of 'auto' in CSS3 Text is non-specific, and leaves it to UA dependent behavior except for a small number of mandatory rules defined in [1].

What I've done is to map 'auto' to the same behavior of the *old* 'normal' when locale is unspecified or is not CJK; in the case of 'auto' and locale is CJK, then I map to the *new* (i.e., CSS3 Text) 'normal'.

This approach seems to provide the best backward compatibility for the *old* default 'normal' while migrating towards the *new* 'normal' in CJK behavior in the case of a defaulted 'auto' value.

More details can be found at [2].

[1] http://dev.w3.org/csswg/css3-text/#line-break-details
[2] https://trac.webkit.org/wiki/LineBreakingCSS3Mapping
------- Comment #29 From 2012-09-06 20:45:46 PST -------
Created an attachment (id=162663) [details]
Patch
------- Comment #30 From 2012-09-06 21:12:57 PST -------
break_lines.cpp is incredibly hot code.  You should be sure to run PerformanceTests/Layout/line-breaking.html to make sure you're not regressing ascii-fast-path line breaking.
------- Comment #31 From 2012-09-06 23:56:00 PST -------
(In reply to comment #30)
> break_lines.cpp is incredibly hot code.  You should be sure to run PerformanceTests/Layout/line-breaking.html to make sure you're not regressing ascii-fast-path line breaking.

I'm getting the following results from line-layout.html (perftest) on revision 127813, using a MBP (Retina, Mid 2012), release configuration:

Without Patch:

avg 92.15439425716251 runs/s
median 92.10526315789474 runs/s
stdev 0.2407525458781455 runs/s
min 91.62303664921465 runs/s
max 92.5925925925926 runs/s

With Patch:

avg 93.33876206192781 runs/s
median 93.28671328671328 runs/s
stdev 0.2669316373954801 runs/s
min 92.96148738379814 runs/s
max 94.00705052878966 runs/s

So the patch is showing a slight performance increase, though I'm at a loss to explain it.
------- Comment #32 From 2012-09-07 15:09:59 PST -------
(From update of attachment 162663 [details])
Attachment 162663 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/13776847

New failing tests:
fast/css3-line-break/line-break-loose-inseparables-ko.html
fast/css3-line-break/line-break-loose-centered-zh.html
fast/css3-line-break/line-break-auto-sound-marks-en.html
fast/css3-line-break/line-break-auto-hyphens-ja.html
fast/css3-line-break/line-break-auto-sound-marks-ko.html
fast/css3-line-break/line-break-loose-iteration-marks-ja.html
fast/css3-line-break/line-break-auto-half-kana-en.html
fast/css3-line-break/line-break-loose-prefixes-ja.html
fast/css3-line-break/line-break-loose-iteration-marks-zh.html
fast/css3-line-break/line-break-auto-hyphens-ko.html
fast/css3-line-break/line-break-loose-sound-marks-en.html
fast/css3-line-break/line-break-loose-hyphens-ko.html
fast/css3-line-break/line-break-loose-postfixes-ja.html
fast/css3-line-break/line-break-loose-iteration-marks-ko.html
fast/css3-line-break/line-break-loose-centered-ko.html
fast/canvas/webgl/shader-precision-format.html
fast/css3-line-break/line-break-loose-prefixes-ko.html
fast/css3-line-break/line-break-loose-centered-ja.html
fast/css3-line-break/line-break-loose-inseparables-en.html
fast/css3-line-break/line-break-loose-hyphens-ja.html
fast/css3-line-break/line-break-auto-hyphens-zh.html
fast/css3-line-break/line-break-auto-sound-marks-zh.html
fast/css3-line-break/line-break-loose-hyphens-zh.html
fast/css3-line-break/line-break-loose-prefixes-zh.html
fast/css3-line-break/line-break-loose-postfixes-zh.html
fast/css3-line-break/line-break-loose-postfixes-ko.html
fast/css3-line-break/line-break-loose-iteration-marks-en.html
fast/css3-line-break/line-break-auto-sound-marks-ja.html
fast/css3-line-break/line-break-loose-inseparables-ja.html
fast/css3-line-break/line-break-loose-inseparables-zh.html
------- Comment #33 From 2012-09-07 22:37:32 PST -------
> Without Patch:
> 
> avg 92.15439425716251 runs/s
> median 92.10526315789474 runs/s
> stdev 0.2407525458781455 runs/s
> min 91.62303664921465 runs/s
> max 92.5925925925926 runs/s
> 
> With Patch:
> 
> avg 93.33876206192781 runs/s
> median 93.28671328671328 runs/s
> stdev 0.2669316373954801 runs/s
> min 92.96148738379814 runs/s
> max 94.00705052878966 runs/s
> 
> So the patch is showing a slight performance increase, though I'm at a loss to explain it.

Did you just that just once? or are those results the median?

If it is one run, you cannot trust those results. Run the test a couple of time without the patch and verify the results are stable. If they are not, find what is causing the variation on your machine.

Keep all the results, and run the test again a few times with the patch.

The result of the very first run is also particularly unreliable, even on a stable setup.

/me wishes there was an automated way to run WebCore tests.
------- Comment #34 From 2012-09-07 23:05:12 PST -------
Do you mean like Tools/Scripts/run-perf-tests?  Or were you looking for a full solution like autovicki (old safari tool)?
------- Comment #35 From 2012-09-07 23:35:09 PST -------
(In reply to comment #34)
> Do you mean like Tools/Scripts/run-perf-tests?  Or were you looking for a full solution like autovicki (old safari tool)?

I mean something to mitigate the issues described above.

Maybe I should modify run-perf-tests to do
1) one warmup launch
2) run the tests a dozen of time
3) report the median results (and tell me if the variance is reasonable).

For some reason, on my setup, I have more reliable results with tests like Sunspider than when I use run-perf-tests.
With run-perf-tests, I often run tests 3-4 times to make sure what I measure is correct.

Am I the only one with that issue? :-D
------- Comment #36 From 2012-09-07 23:42:19 PST -------
Glenn's results already include standard deviation.
------- Comment #37 From 2012-09-07 23:45:44 PST -------
I would encourage you to file bugs against run-perf-tests and cc rniwa.  I suspect he's eager to hear your feedback (and help you hack on it yourself if you so desire).
------- Comment #38 From 2012-09-07 23:52:41 PST -------
(In reply to comment #33)
> > So the patch is showing a slight performance increase, though I'm at a loss to explain it.
> 
> Did you just that just once? or are those results the median?

I elided the individual test run results above, of which there were 20, with the initial one ignored. Just load PerformanceTests/Layout/line-layout.html to repeat.
------- Comment #39 From 2012-09-08 00:07:48 PST -------
(In reply to comment #36)
> Glenn's results already include standard deviation.

As reported by the test. Which is the one I was explaining has been unreliable in my experience.
------- Comment #40 From 2012-09-11 01:58:55 PST -------
Created an attachment (id=163306) [details]
Patch
------- Comment #41 From 2012-09-11 02:33:12 PST -------
(In reply to comment #26)
> (In reply to comment #24)
> > The only concern is that your patch is still ICU-version dependency because [:LineBreak = Conditional_Japanese_Starter:] was relatively recently introduced and not all ICU versions used by webkit-ports support that. 

Turns out that :LineBreak = Hebrew_Letter: also doesn't work on ICU46 (cr-linux). I've handled this by using the older style definition of [[:Hebrew:] & [:Letter:]].

There are also a couple of minor differences in the handling of small kana and prolonged sound marks between the default ICU49 and ICU46 non-CJK locale behaviors that affects 'auto' behavior with non-CJK 
@lang. I've addressed this by adding two reftest expected variants under platform/chromium-linux.

The latest patch attachment 163306 [details] contains these fixes, as well as a reworking of the break_lines changes to use template parameters to remove possible performance effects from special casing of loose mode behavior.
------- Comment #42 From 2012-09-11 09:56:24 PST -------
(From update of attachment 163306 [details])
View in context: https://bugs.webkit.org/attachment.cgi?id=163306&action=review

> Source/WebCore/platform/text/LineBreakIteratorPoolICU.h:61
> +        else {
> +          StringBuilder sb;
> +          sb.append(locale);
> +          sb.appendLiteral("@break=");
> +          switch (mode) {

Do we have any perf testing of this path?  Building this string may turn out to be expensive in this possibly hot code path.  It's not clear to me how often we make TextBreakIterators.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:93
> +    if (!locale.isEmpty()) {

nit: I would have written this as an early return.  If (locale.isEmpty()) return false;

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:96
> +        String language = String::createUninitialized(languageBufferLength, reinterpret_cast<LChar*&>(languageBuffer));

I'm confused.  How can this work?

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:101
> +            language.truncate(languageLength);
> +            language.makeLower();

More mallocs, are going to likely make this path expensive...
------- Comment #43 From 2012-09-11 12:01:06 PST -------
(From update of attachment 163306 [details])
View in context: https://bugs.webkit.org/attachment.cgi?id=163306&action=review

> Source/WebCore/rendering/break_lines.cpp:150
> -template<bool treatNoBreakSpaceAsBreak>
> +template<bool treatNoBreakSpaceAsBreak, bool isLooseMode>

Wait a minute, if the previous change improved performance as you claim, why did you change it to use a template?
Making the generated asm bigger is not good.

Also two bools is a bad idea here. You should transform that to two enums.
------- Comment #44 From 2012-09-11 19:58:47 PST -------
(In reply to comment #42)
> (From update of attachment 163306 [details] [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=163306&action=review
> 
> > Source/WebCore/platform/text/LineBreakIteratorPoolICU.h:61
> > +        else {
> > +          StringBuilder sb;
> > +          sb.append(locale);
> > +          sb.appendLiteral("@break=");
> > +          switch (mode) {
> 
> Do we have any perf testing of this path?  Building this string may turn out to be expensive in this possibly hot code path.  It's not clear to me how often we make TextBreakIterators.

This path is taken only if author specifies an explicit line break mode of 'loose', 'normal', or 'strict'. If author specifies 'auto' or the default (initial) value 'auto' applies, then the non-allocating path

if (mode == LineBreakIteratorModeUAX14)
    localeWithOptionalBreakKeyword = locale;

is taken.

If author does specify loose|normal|strict, then the string building path is taken on the first get() on LazyLineBreakIterator after a reset() of the same, which occurs once for each RenderText object enumerated by RenderBlock::LineBreaker::nextLineBreak (provided the text differs from the current text set into the LazyLineBreakIterator).

I agree that the string building path could be further optimized for this latter case. I chose to take the simpler approach at this juncture, which effectively means modifying the (AtomicString) key used by the LineBreakIteratorPool.m_vendedIterators hash map:

HashMap<TextBreakIterator*, AtomicString> m_vendedIterators;

Since the allocating path is taken only when author uses a non-default line break mode, it seemed an acceptable choice at this time. [An optimization could be a subsequent bug/patch.]

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:93
> > +    if (!locale.isEmpty()) {
> 
> nit: I would have written this as an early return.  If (locale.isEmpty()) return false;

will fix in a new patch

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:96
> > +        String language = String::createUninitialized(languageBufferLength, reinterpret_cast<LChar*&>(languageBuffer));
> 
> I'm confused.  How can this work?
> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:101
> > +            language.truncate(languageLength);
> > +            language.makeLower();
> 
> More mallocs, are going to likely make this path expensive...

will rewrite this code in a new patch
------- Comment #45 From 2012-09-11 20:03:34 PST -------
(From update of attachment 163306 [details])
pending updated patch
------- Comment #46 From 2012-09-11 20:19:42 PST -------
(In reply to comment #43)
> (From update of attachment 163306 [details] [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=163306&action=review
> 
> > Source/WebCore/rendering/break_lines.cpp:150
> > -template<bool treatNoBreakSpaceAsBreak>
> > +template<bool treatNoBreakSpaceAsBreak, bool isLooseMode>
> 
> Wait a minute, if the previous change improved performance as you claim, why did you change it to use a template?

After further analysis of my performance tests, I concluded there was insufficient statistical evidence of an improvement or regression. However, analyzing the code itself showed that the insertion of two references to a boolean (isLooseMode) in the inner loop could have a negative effect.

Based on your recent change in [1] to templatize nextBreakablePosition with a boolean treatNoBreakSpaceAsSpace parameter, I chose to take the same approach with isLooseMode.

[1] http://trac.webkit.org/changeset/127974

> Making the generated asm bigger is not good.

Do you have a suggested alternative approach that avoids use of isLooseMode in the inner loop of nextBreakablePosition?

> Also two bools is a bad idea here. You should transform that to two enums.

I'm not sure I understand you. Are you suggesting using (1) two new enum types each with two values that correspond to the true and false conditions or (2) a single new type that enumerates all four possibilities (of the two unrelated parameters)?
------- Comment #47 From 2012-09-12 15:14:28 PST -------
> > Making the generated asm bigger is not good.
> 
> Do you have a suggested alternative approach that avoids use of isLooseMode in the inner loop of nextBreakablePosition?

No, I don't have a better idea :(
It is likely the safest solution.

> > Also two bools is a bad idea here. You should transform that to two enums.
> 
> I'm not sure I understand you. Are you suggesting using (1) two new enum types each with two values that correspond to the true and false conditions or (2) a single new type that enumerates all four possibilities (of the two unrelated parameters)?

1! :)

It will be easier to read. E.g.:
    nextBreakablePosition<IncludeBreakOnNBSP, LooseBreakMode>(lazyBreakIterator, pos);
------- Comment #48 From 2012-09-13 00:58:23 PST -------
Created an attachment (id=163806) [details]
Patch
------- Comment #49 From 2012-09-13 01:29:37 PST -------
Created an attachment (id=163814) [details]
Patch
------- Comment #50 From 2012-09-13 05:05:30 PST -------
(From update of attachment 163814 [details])
This updated patch addresses the comments from Eric (comment #42) and Benjamin (comment #43); in addition, additional performance improvements were implemented to ensure no regression in the default case of use of default line break mode.
------- Comment #51 From 2012-09-13 05:14:30 PST -------
Created an attachment (id=163844) [details]
performance test results and comparison

This attachment contains the results of performing run-perf-tests on PerformanceTests/Layout/line-layout.html unpatched revision #128140 and patched version (using attachment #163814 [details]). In addition, the results of running run-perf-tests on the new performance tests contained in the patch are provided. As can be seen, there is no regression in the default case (no use of -webkit-line-break), and very little or no difference even when -webkit-line-break is used when @lang is 'en'. In the case of explicit use of -webkit-line-break and @lang of 'ja', some slight performance drop is apparent, which is to be expected.
------- Comment #52 From 2012-09-13 16:45:46 PST -------
(From update of attachment 163814 [details])
View in context: https://bugs.webkit.org/attachment.cgi?id=163814&action=review

> Source/WebCore/platform/text/LineBreakIteratorPoolICU.h:60
> +        if (mode == LineBreakIteratorModeUAX14)
> +          localeWithOptionalBreakKeyword = locale;
> +        else {
> +          StringBuilder sb;
> +          sb.append(locale);
> +          sb.appendLiteral("@break=");

sb is a poor name for a variable.

(and following lines): Indent.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:113
> +    if (c1 == 'c' && c2 == 'h')
> +        return true;
> +    if (c1 == 'j' && (c2 == 'a' || (c2 == 'p')))
> +        return true;
> +    if (c1 == 'k' && c2 == 'o')
> +        return true;
> +    if (c1 == 'z' && c2 == 'h')
> +        return true;
> +    if (c1 == 'C' && c2 == 'H')
> +        return true;
> +    if (c1 == 'J' && (c2 == 'A' || (c2 == 'P')))
> +        return true;
> +    if (c1 == 'K' && c2 == 'O')
> +        return true;
> +    if (c1 == 'Z' && c2 == 'H')
> +        return true;

If this is a hotspot, you should aggregate by c2:
if (c2 == 'h') {
   // test c and z
}
if (c2 == 'H') {
   // test C and Z
}

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:125
> +    const String& s(locale.string());
> +    size_t sLen = s.length();
> +    if (s.is8Bit())
> +        return isCJK<LChar>(s.characters8(), sLen);
> +    return isCJK<UChar>(s.characters16(), sLen);

"s" is a poor variable name.
You don't need to create s at all, AtomicString now has characters8() and characters16().

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:738
> +    StringBuilder sb;

"sb".

> Source/WebCore/rendering/break_lines.cpp:175
> +        // don't use ASCII shortcut (shouldBreakAfter) if loose mode

Comment style: Comments are sentences in WebKit.

It is the kind of comment that could be advantageously turned into an inline function.

> Source/WebCore/rendering/break_lines.cpp:180
> +        // always use line break iterator if loose mode
> +        if ((looseBehavior == LooseMode) || needsLineBreakIterator<nbspBehavior>(ch) || needsLineBreakIterator<nbspBehavior>(lastCh)) {

Ditto.
------- Comment #53 From 2012-09-13 19:47:34 PST -------
Created an attachment (id=164031) [details]
Patch
------- Comment #54 From 2012-09-13 21:15:10 PST -------
(From update of attachment 164031 [details])
address comments in comment #52; benjaminp suggests that mitz perform final review
------- Comment #55 From 2012-09-14 15:19:32 PST -------
(From update of attachment 164031 [details])
View in context: https://bugs.webkit.org/attachment.cgi?id=164031&action=review

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:92
> +static bool isCJK(const T* s, size_t len)

Why are you treating 'ch' and 'jp' as if they're 'chn' and 'jpn'? I don't think we have to accept ISO 639 3-letter codes. Even if we do, just checking the 1st two letter does not work because there are ISO 639 3-letter codes starting with 'ch' for languages other than Chinese? See http://www.loc.gov/standards/iso639-2/php/code_list.php. Here's the list. 

cha    ch    Chamorro    chamorro
chb         Chibcha    chibcha
che    ce    Chechen    tchétchène
chg         Chagatai    djaghataï
chi (B)
zho (T)    zh    Chinese    chinois
chk         Chuukese    chuuk
chm         Mari    mari
chn         Chinook jargon    chinook, jargon
cho         Choctaw    choctaw
chp         Chipewyan; Dene Suline    chipewyan
chr         Cherokee    cherokee
chu    cu    Church Slavic; Old Slavonic; Church Slavonic; Old Bulgarian; Old Church Slavonic    slavon d'église; vieux slave; slavon liturgique; vieux bulgare
chv    cv    Chuvash    tchouvache
chy         Cheyenne    cheyenne

Even worse is that 'ch' (ISO 639 2-letter code) is Chamorro. 

Given all these, I think we just have to support the shortest ISO-639 codes for CJK ('zh', 'ja', 'ko'). That's also what BCP 47 says. (http://tools.ietf.org/html/bcp47 )

 language      = 2*3ALPHA            ; shortest ISO 639 code
                 ["-" extlang]       ; sometimes followed by
                                     ; extended language subtags
               / 4ALPHA              ; or reserved for future use
               / 5*8ALPHA            ; or registered language subtag
------- Comment #56 From 2012-09-14 16:08:35 PST -------
(From update of attachment 164031 [details])
pending fix for comment #55
------- Comment #57 From 2012-09-14 17:42:53 PST -------
Created an attachment (id=164255) [details]
Patch
------- Comment #58 From 2012-09-14 19:51:36 PST -------
(From update of attachment 164255 [details])
address comment #55
------- Comment #59 From 2012-09-17 17:21:37 PST -------
(From update of attachment 164255 [details])
View in context: https://bugs.webkit.org/attachment.cgi?id=164255&action=review

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:103
> +    if (!c3 || c3 == '-') {

I was about to ask you to do this (after my comment #55), but you beat me to that. In addition, I'd like to suggest that you make this a bit more lenient by allowing '_' (underscore) and '@' to support cases like 'ja_JP', 'zh@foo_bar'
------- Comment #60 From 2012-09-17 17:21:39 PST -------
(From update of attachment 164255 [details])
View in context: https://bugs.webkit.org/attachment.cgi?id=164255&action=review

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:103
> +    if (!c3 || c3 == '-') {

I was about to ask you to do this (after my comment #55), but you beat me to that. In addition, I'd like to suggest that you make this a bit more lenient by allowing '_' (underscore) and '@' to support cases like 'ja_JP', 'zh@foo_bar'
------- Comment #61 From 2012-09-18 20:28:18 PST -------
(From update of attachment 164255 [details])
pending fix for comment #60
------- Comment #62 From 2012-09-18 21:34:18 PST -------
Created an attachment (id=164656) [details]
address comment #60
------- Comment #63 From 2012-09-18 23:13:14 PST -------
(From update of attachment 164656 [details])
all comments addressed; please review and land
------- Comment #64 From 2012-09-19 16:23:21 PST -------
Thank you for addressing my comments about locale check. It looks good to me. It'd be great if hyatt or mitz can review+ this patch. 

BTW, Chrome is losing some changes it has made locally in its version of ICU to better handle Hebrew letters and some CJK punctuation marks (upstream bugs against CLDR were filed a while ago, but they're not yet resolved at UTC ). Now that we're including the full source of line breaking rules in webkit source, I'll file a follow-up bug to integrate those changes to webkit.
------- Comment #65 From 2012-10-08 05:53:07 PST -------
Sorry for the late comment, but EPUB 3 CSS Profile[1] defines "-epub-line-break" as an alias to the line-break property. Could you please add this one?

[1] http://epub-revision.googlecode.com/svn/trunk/build/30/spec/epub30-contentdocs.html#sec-css-profile
------- Comment #66 From 2012-10-08 05:55:46 PST -------
(In reply to comment #65)
> Sorry for the late comment, but EPUB 3 CSS Profile[1] defines "-epub-line-break" as an alias to the line-break property. Could you please add this one?
> 
> [1] http://epub-revision.googlecode.com/svn/trunk/build/30/spec/epub30-contentdocs.html#sec-css-profile

SInce this is really orthogonal to defining the CSS3 line-break semantics, I would prefer *not* to do this in this patch, but in a follow-on patch once this one has been applied. Is that acceptable?
------- Comment #67 From 2012-10-08 06:59:27 PST -------
(In reply to comment #66)
> (In reply to comment #65)
> > Sorry for the late comment, but EPUB 3 CSS Profile[1] defines "-epub-line-break" as an alias to the line-break property. Could you please add this one?
> > 
> > [1] http://epub-revision.googlecode.com/svn/trunk/build/30/spec/epub30-contentdocs.html#sec-css-profile
> 
> SInce this is really orthogonal to defining the CSS3 line-break semantics, I would prefer *not* to do this in this patch, but in a follow-on patch once this one has been applied. Is that acceptable?

That's reasonable, thank you for the comment.
------- Comment #68 From 2012-10-15 07:28:27 PST -------
Created an attachment (id=168709) [details]
Patch
------- Comment #69 From 2012-10-16 01:17:59 PST -------
Created an attachment (id=168884) [details]
Patch
------- Comment #70 From 2012-10-16 02:34:56 PST -------
Created an attachment (id=168899) [details]
Patch
------- Comment #71 From 2012-10-16 09:34:57 PST -------
(From update of attachment 168899 [details])
update from recent changes to HEAD to ensure patch succeeds; move new layout tests from fast/css3-line-break to css3/line-break due to number of tests producing relatively slow overall test time
------- Comment #72 From 2012-10-28 04:18:55 PST -------
Created an attachment (id=171129) [details]
Patch
------- Comment #73 From 2012-10-28 09:14:12 PST -------
(From update of attachment 171129 [details])
update from recent changes to HEAD to ensure patch succeeds
------- Comment #74 From 2012-10-29 09:30:16 PST -------
Created an attachment (id=171259) [details]
Patch
------- Comment #75 From 2012-10-29 12:06:53 PST -------
Could we land the PerformanceTest changes first?  Those are non-controversial and would reduce the size of the change. :)

I've read through your patch once.  I'll need to go through it a couple more times.  Overall it looks fine.  The big question in my head is performance, since break_lines.* is ridiculously hot code.
------- Comment #76 From 2012-10-29 14:21:36 PST -------
(From update of attachment 171259 [details])
View in context: https://bugs.webkit.org/attachment.cgi?id=171259&action=review

The code change overall looks fine.  There are some style nits which I mentioned above.  The test cases seem to be needlessly repetitive (a little bit of JavaScript could probably have generated most of these).  I believe you that you didn't make things slower, and I trust that you got the ICU line breaking rules correct (it's difficult for me to verify those).

I'll give this another look tomorrow.  I'm pretty much ready to pull the trigger here.  Mitz has had a long time to comment but we should probably give him another 24 hours.

Lets talk about this again tomrorow.  If you'd like to post a patch fixing some of the nits mentioned that woudl be nice.

Also, if you could explain the reinterpret_cast scariness, that would also be nice. :)

> Source/WebCore/platform/text/TextBreakIterator.h:49
> +    TextBreakIterator* acquireLineBreakIterator(const LChar*, int length, const AtomicString& locale, LineBreakIteratorMode, bool cjk);
> +    TextBreakIterator* acquireLineBreakIterator(const UChar*, int length, const AtomicString& locale, LineBreakIteratorMode, bool cjk);

WebKit style is probably to write isCJK for a bool (although Enums are almost always preferred).

> Source/WebCore/platform/text/TextBreakIterator.h:67
> +    bool isCJKLocale(const AtomicString& locale);

The arg name doesn't add anything.

> Source/WebCore/platform/text/TextBreakIterator.h:73
> +        , m_cjk(false)

isCJK.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:283
> +    UBreakIterator* ubrkIter = reinterpret_cast<UBreakIterator*>(iterator);

Um.  Bad idea.  TextBreakIterator should expose a function or something...  Or can this be a static_cast?

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:293
> -    return reinterpret_cast<TextBreakIterator*>(iterator);
> +    return iterator;

I see, this was already doing this.  Still a bad idea.  Is UBreakIterator a void* or something?

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:362
> +        ubrkIter = ubrk_open(UBRK_LINE, localeIsEmpty ? currentTextBreakLocaleID() : locale.string().utf8().data(), 0, 0, &openStatus);

locale.string().utf8().data()?  I assume that ubrk_open doesn't hold onto that string? (since it will die at the end of this line).

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:600
> +static const char* uax14AssignmentsBefore =
> +    // explicitly enumerate $CJ since ICU versions prior to 49 don't support :LineBreak=Conditional_Japanese_Starter:
> +    "$CJ = ["
> +    "\\u3041\\u3043\\u3045\\u3047\\u3049\\u3063\\u3083\\u3085\\u3087\\u308E\\u3095\\u3096\\u30A1\\u30A3\\u30A5\\u30A7"
> +    "\\u30A9\\u30C3\\u30E3\\u30E5\\u30E7\\u30EE\\u30F5\\u30F6\\u30FC"
> +    "\\u31F0\\u31F1\\u31F2\\u31F3\\u31F4\\u31F5\\u31F6\\u31F7\\u31F8\\u31F9\\u31FA\\u31FB\\u31FC\\u31FD\\u31FE\\u31FF"
> +    "\\uFF67\\uFF68\\uFF69\\uFF6A\\uFF6B\\uFF6C\\uFF6D\\uFF6E\\uFF6F\\uFF70"
> +    "];";

Should we use #defines to key this off of the ICU version instead?  If for no other reason than to allow us to remove the old code once all ports are to 49+?

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:680
> +    "$HL = [[:Hebrew:] & [:Letter:]];"

Do we need to key this off the ICU version (as you mentioned in the bug?)  Is there any advantage to the syntax you originally tried?

> Source/WebCore/rendering/RenderBlockLineLayout.cpp:2557
> +            bool looseMode = false;

Again, we generally prefix bools with "is" or possibly "in" or "use" in this case (at least that is my understanding of webkit style).

> Source/WebCore/rendering/RenderText.cpp:944
> +    default:
> +        return LineBreakIteratorModeUAX14;

We often avoid using default: so that the compiler will tell us when we add a value to the enum and forget to add it to the switch statements.

> Source/WebCore/rendering/break_lines.cpp:52
> +enum LooseBehavior {
> +    NonLooseMode,
> +    LooseMode,
> +};

Yes, exactly.  We use enums like this instead of bools, mostly to make callsites more readable.  :)

> Source/WebCore/rendering/style/RenderStyleConstants.h:214
> -enum EKHTMLLineBreak {
> -    LBNORMAL, AFTER_WHITE_SPACE
> +enum LineBreak {
> +    LineBreakAuto, LineBreakLoose, LineBreakNormal, LineBreakStrict, LineBreakAfterWhiteSpace

This change could have been split off and done first.  It probably would have made the diff slightly smaller.  You even could have plumbed in the new values before adding support for them in order to reduce the size of the diff.

> Source/WebCore/rendering/style/StyleRareInheritedData.h:88
> -    unsigned khtmlLineBreak : 1; // EKHTMLLineBreak
> +    unsigned lineBreak : 3; // LineBreak

Do we have any constraints on size of this object?  Presumably not, it looks rather huge to begin with. :)

> LayoutTests/css3/line-break/line-break-auto-centered-ja.html:36
> +        <div style="font-family:'Lucida Grande'; font-size:16pt; text-decoration:underline; width:4.1em;" lang="ja">

Even some CSS could have made this test much smaller. :)
------- Comment #77 From 2012-10-30 04:30:55 PST -------
(From update of attachment 171259 [details])
preparing a sub-division of this patch, so canceling this r?
------- Comment #78 From 2012-10-31 04:08:57 PST -------
Created an attachment (id=171613) [details]
Patch
------- Comment #79 From 2012-10-31 04:16:23 PST -------
(From update of attachment 171613 [details])
Attachment 171613 [details] did not pass qt-ews (qt):
Output: http://queues.webkit.org/results/14625857
------- Comment #80 From 2012-10-31 04:17:34 PST -------
(From update of attachment 171613 [details])
Attachment 171613 [details] did not pass qt-wk2-ews (qt):
Output: http://queues.webkit.org/results/14625858
------- Comment #81 From 2012-10-31 04:30:03 PST -------
(From update of attachment 171613 [details])
Attachment 171613 [details] did not pass cr-android-ews (chromium-android):
Output: http://queues.webkit.org/results/14688019
------- Comment #82 From 2012-10-31 04:34:02 PST -------
(From update of attachment 171613 [details])
Attachment 171613 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/14677113
------- Comment #83 From 2012-10-31 05:37:13 PST -------
(From update of attachment 171613 [details])
Attachment 171613 [details] did not pass win-ews (win):
Output: http://queues.webkit.org/results/14661208
------- Comment #84 From 2012-10-31 05:58:03 PST -------
Created an attachment (id=171628) [details]
Patch
------- Comment #85 From 2012-10-31 10:00:46 PST -------
(From update of attachment 171628 [details])
address comment 76 issues
------- Comment #86 From 2012-10-31 10:20:49 PST -------
(In reply to comment #76)
> (From update of attachment 171259 [details] [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=171259&action=review
> 
> Also, if you could explain the reinterpret_cast scariness, that would also be nice. :)

see below

> 
> > Source/WebCore/platform/text/TextBreakIterator.h:49
> > +    TextBreakIterator* acquireLineBreakIterator(const LChar*, int length, const AtomicString& locale, LineBreakIteratorMode, bool cjk);
> > +    TextBreakIterator* acquireLineBreakIterator(const UChar*, int length, const AtomicString& locale, LineBreakIteratorMode, bool cjk);
> 
> WebKit style is probably to write isCJK for a bool (although Enums are almost always preferred).

done

> 
> > Source/WebCore/platform/text/TextBreakIterator.h:67
> > +    bool isCJKLocale(const AtomicString& locale);
> 
> The arg name doesn't add anything.

removed

> 
> > Source/WebCore/platform/text/TextBreakIterator.h:73
> > +        , m_cjk(false)
> 
> isCJK.

done

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:283
> > +    UBreakIterator* ubrkIter = reinterpret_cast<UBreakIterator*>(iterator);
> 
> Um.  Bad idea.  TextBreakIterator should expose a function or something...  Or can this be a static_cast?
> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:293
> > -    return reinterpret_cast<TextBreakIterator*>(iterator);
> > +    return iterator;
> 
> I see, this was already doing this.  Still a bad idea.  Is UBreakIterator a void* or something?

as you notice, i didn't introduce this, just refactored a bit; basically we have the opaque type

typedef struct UBreakIterator UBreakIterator;

defined in either ustring.h or ubrk.h, depending on which ICU include files are used.

then we have an opaque WK type

class TextBreakIterator;

defined in TextBreakIterator.h, but never expanded elsewhere in the case of platforms that use ICU;

for platforms that don't use ICU, TextBreakIterator as well as acquireLineBreakIterator() et al, return a different opaque type or return a port specific type, in which case a different acquireLineBreakIterator() is supplied;

as for using static_cast<>, i'm not sure if that is feasible here

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:362
> > +        ubrkIter = ubrk_open(UBRK_LINE, localeIsEmpty ? currentTextBreakLocaleID() : locale.string().utf8().data(), 0, 0, &openStatus);
> 
> locale.string().utf8().data()?  I assume that ubrk_open doesn't hold onto that string? (since it will die at the end of this line).

correct, the reference isn't held

>
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:600
> > +static const char* uax14AssignmentsBefore =
> > +    // explicitly enumerate $CJ since ICU versions prior to 49 don't support :LineBreak=Conditional_Japanese_Starter:
> > +    "$CJ = ["
> > +    "\\u3041\\u3043\\u3045\\u3047\\u3049\\u3063\\u3083\\u3085\\u3087\\u308E\\u3095\\u3096\\u30A1\\u30A3\\u30A5\\u30A7"
> > +    "\\u30A9\\u30C3\\u30E3\\u30E5\\u30E7\\u30EE\\u30F5\\u30F6\\u30FC"
> > +    "\\u31F0\\u31F1\\u31F2\\u31F3\\u31F4\\u31F5\\u31F6\\u31F7\\u31F8\\u31F9\\u31FA\\u31FB\\u31FC\\u31FD\\u31FE\\u31FF"
> > +    "\\uFF67\\uFF68\\uFF69\\uFF6A\\uFF6B\\uFF6C\\uFF6D\\uFF6E\\uFF6F\\uFF70"
> > +    "];";
> 
> Should we use #defines to key this off of the ICU version instead?  If for no other reason than to allow us to remove the old code once all ports are to 49+?

done

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:680
> > +    "$HL = [[:Hebrew:] & [:Letter:]];"
> 
> Do we need to key this off the ICU version (as you mentioned in the bug?)  Is there any advantage to the syntax you originally tried?

done; i use the "syntax [I] originally tried" if the ICU version is 49 or later

> 
> > Source/WebCore/rendering/RenderBlockLineLayout.cpp:2557
> > +            bool looseMode = false;
> 
> Again, we generally prefix bools with "is" or possibly "in" or "use" in this case (at least that is my understanding of webkit style).

done

> 
> > Source/WebCore/rendering/RenderText.cpp:944
> > +    default:
> > +        return LineBreakIteratorModeUAX14;
> 
> We often avoid using default: so that the compiler will tell us when we add a value to the enum and forget to add it to the switch statements.

removed default, but added catch all return at end of function; otherwise, some platforms failed to compile, complaining that some return path did not return a value

> 
> > Source/WebCore/rendering/break_lines.cpp:52
> > +enum LooseBehavior {
> > +    NonLooseMode,
> > +    LooseMode,
> > +};
> 
> Yes, exactly.  We use enums like this instead of bools, mostly to make callsites more readable.  :)
> 
> > Source/WebCore/rendering/style/RenderStyleConstants.h:214
> > -enum EKHTMLLineBreak {
> > -    LBNORMAL, AFTER_WHITE_SPACE
> > +enum LineBreak {
> > +    LineBreakAuto, LineBreakLoose, LineBreakNormal, LineBreakStrict, LineBreakAfterWhiteSpace
> 
> This change could have been split off and done first.  It probably would have made the diff slightly smaller.  You even could have plumbed in the new values before adding support for them in order to reduce the size of the diff.

done and landed (bug 100739)

> 
> > Source/WebCore/rendering/style/StyleRareInheritedData.h:88
> > -    unsigned khtmlLineBreak : 1; // EKHTMLLineBreak
> > +    unsigned lineBreak : 3; // LineBreak
> 
> Do we have any constraints on size of this object?  Presumably not, it looks rather huge to begin with. :)

don't know of any; it seems to accrete new fields regularly

> 
> > LayoutTests/css3/line-break/line-break-auto-centered-ja.html:36
> > +        <div style="font-family:'Lucida Grande'; font-size:16pt; text-decoration:underline; width:4.1em;" lang="ja">
> 
> Even some CSS could have made this test much smaller. :)

greatly reduce test size by using CSS and combining en|ja|ko|zh tests, left with 32 new tests instead of 128
------- Comment #87 From 2012-11-01 11:51:59 PST -------
(From update of attachment 171628 [details])
View in context: https://bugs.webkit.org/attachment.cgi?id=171628&action=review

I briefly looked over this patch and have some minor nits.

> Source/WebCore/platform/text/LineBreakIteratorPoolICU.h:73
> +            switch (mode) {
> +            case LineBreakIteratorModeUAX14Loose:
> +                localeWithKeyword.appendLiteral("loose");
> +                break;
> +            case LineBreakIteratorModeUAX14Normal:
> +                localeWithKeyword.appendLiteral("normal");
> +                break;
> +            case LineBreakIteratorModeUAX14Strict:
> +                localeWithKeyword.appendLiteral("strict");
> +                break;
> +            default:
> +                ASSERT_NOT_REACHED();
> +            }

This code is error prone. Instead of using a runtime assertion to catch an unexpected mode we should take advantage of the compiler to warn us when a switch block doesn't have a case statement for each enum value by omitting the default statement. To do this, we should extract this code into a static function.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:325
> +static bool isCJKLocale(const T* s, size_t len)

Nit: len => length

By <http://www.webkit.org/coding/coding-style.html#names-full-words>, we prefer full words for variable names over abbreviations unless such abbreviations are more canonical. Changing "len" to "length" would also be consistent with the argument of the same name in acquireLineBreakIterator(), sentenceBreakIterator(), and among other text iterator functions.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:347
> +    size_t sLen = locale.length();

I take it that the 's' in "sLen" stands for string. Regardless, I would call this variable length.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:357
> +    UNUSED_PARAM(mode);

I take it you plan to make use of this parameter in a follow up patch?

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:360
> +    bool localeIsEmpty = locale.isEmpty();

Nit: localeIsEmpty => isLocaleEmpty

By <http://www.webkit.org/coding/coding-style.html#names-bool>.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:370
> +    // locale comes from a web page and it can be invalid, leading ICU
> +    // to fail, in which case we fall back to the default locale (with default rules)

Please make this comment a proper sentence that begins with a capital letter and ends with a period. This will make the comment conform to <http://www.webkit.org/coding/coding-style.html#comments-sentences>.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:668
> +#define uax14AssignmentsCustomStrictNonCJK      uax14AssignmentsCustomStrictCJK
> +#define uax14AssignmentsCustomDefaultCJK        uax14AssignmentsCustomNormalCJK
> +#define uax14AssignmentsCustomDefaultNonCJK     uax14AssignmentsCustomStrictNonCJK

See my remark in mapLineIteratorModeToRules() on how we can remove these defines.

At the very least, we should #undef these defines at the end of this file.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:981
> +    if (mode == LineBreakIteratorModeUAX14Loose)
> +        rulesBuilder.append(isCJK ? uax14AssignmentsCustomLooseCJK : uax14AssignmentsCustomLooseNonCJK);
> +    else if (mode == LineBreakIteratorModeUAX14Normal)
> +        rulesBuilder.append(isCJK ? uax14AssignmentsCustomNormalCJK : uax14AssignmentsCustomNormalNonCJK);
> +    else if (mode == LineBreakIteratorModeUAX14Strict)
> +        rulesBuilder.append(isCJK ? uax14AssignmentsCustomStrictCJK : uax14AssignmentsCustomStrictNonCJK);
> +    else
> +        rulesBuilder.append(isCJK ? uax14AssignmentsCustomDefaultCJK : uax14AssignmentsCustomDefaultNonCJK);

I would use a switch block here with a default statement instead of if/else if/else statements as it will make this code less error prone because the compiler can ensure that we handle every enum value in enum LineBreakIteratorMode. Currently, LineBreakIteratorModeUAX14 is the only mode that is covered by the else statement. Moreover, using a switch block will allow us to remove macro defines uax14AssignmentsCustomDefaultCJK, uax14AssignmentsCustomDefaultNonCJK, and uax14AssignmentsCustomStrictNonCJK.
------- Comment #88 From 2012-11-01 13:24:57 PST -------
(In reply to comment #87)
> (From update of attachment 171628 [details] [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=171628&action=review
> 
> I briefly looked over this patch and have some minor nits.

uploading updated patch shortly with corrections as described below

> > Source/WebCore/platform/text/LineBreakIteratorPoolICU.h:73
> > +            switch (mode) {
> > +            case LineBreakIteratorModeUAX14Loose:
> > +                localeWithKeyword.appendLiteral("loose");
> > +                break;
> > +            case LineBreakIteratorModeUAX14Normal:
> > +                localeWithKeyword.appendLiteral("normal");
> > +                break;
> > +            case LineBreakIteratorModeUAX14Strict:
> > +                localeWithKeyword.appendLiteral("strict");
> > +                break;
> > +            default:
> > +                ASSERT_NOT_REACHED();
> > +            }
> 
> This code is error prone. Instead of using a runtime assertion to catch an unexpected mode we should take advantage of the compiler to warn us when a switch block doesn't have a case statement for each enum value by omitting the default statement. To do this, we should extract this code into a static function.

fixed, though I'm not sure the relation between fully populating the switch and using a distinct function

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:325
> > +static bool isCJKLocale(const T* s, size_t len)
> 
> Nit: len => length
> 
> By <http://www.webkit.org/coding/coding-style.html#names-full-words>, we prefer full words for variable names over abbreviations unless such abbreviations are more canonical. Changing "len" to "length" would also be consistent with the argument of the same name in acquireLineBreakIterator(), sentenceBreakIterator(), and among other text iterator functions.

fixed

> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:347
> > +    size_t sLen = locale.length();
> 
> I take it that the 's' in "sLen" stands for string. Regardless, I would call this variable length.

fixed

> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:357
> > +    UNUSED_PARAM(mode);
>
> I take it you plan to make use of this parameter in a follow up patch?

removed; i'd neglected to remove it after using mode

> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:360
> > +    bool localeIsEmpty = locale.isEmpty();
> 
> Nit: localeIsEmpty => isLocaleEmpty

fixed

> By <http://www.webkit.org/coding/coding-style.html#names-bool>.
> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:370
> > +    // locale comes from a web page and it can be invalid, leading ICU
> > +    // to fail, in which case we fall back to the default locale (with default rules)
> 
> Please make this comment a proper sentence that begins with a capital letter and ends with a period. This will make the comment conform to <http://www.webkit.org/coding/coding-style.html#comments-sentences>.

i merely copied that comment from existing code without modifying, but i've fixed as you suggest


> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:668
> > +#define uax14AssignmentsCustomStrictNonCJK      uax14AssignmentsCustomStrictCJK
> > +#define uax14AssignmentsCustomDefaultCJK        uax14AssignmentsCustomNormalCJK
> > +#define uax14AssignmentsCustomDefaultNonCJK     uax14AssignmentsCustomStrictNonCJK
> 
> See my remark in mapLineIteratorModeToRules() on how we can remove these defines.

if you insist, i will remove, but i think removing will reduce understandability of code: the #define names are semantically significant, and are derived from a coincidental mapping that happens to apply from [1]

[1] http://trac.webkit.org/wiki/LineBreakingCSS3Mapping

if they are removed, the [code] reader will be left wondering about this mapping, e.g., why StrictNonCJK happens to be using StrictCJK

> At the very least, we should #undef these defines at the end of this file.

If this were an include file, then yes, but it doesn't seem necessary for a compilation unit (or if it does, then I'm not following your reasoning).

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:981
> > +    if (mode == LineBreakIteratorModeUAX14Loose)
> > +        rulesBuilder.append(isCJK ? uax14AssignmentsCustomLooseCJK : uax14AssignmentsCustomLooseNonCJK);
> > +    else if (mode == LineBreakIteratorModeUAX14Normal)
> > +        rulesBuilder.append(isCJK ? uax14AssignmentsCustomNormalCJK : uax14AssignmentsCustomNormalNonCJK);
> > +    else if (mode == LineBreakIteratorModeUAX14Strict)
> > +        rulesBuilder.append(isCJK ? uax14AssignmentsCustomStrictCJK : uax14AssignmentsCustomStrictNonCJK);
> > +    else
> > +        rulesBuilder.append(isCJK ? uax14AssignmentsCustomDefaultCJK : uax14AssignmentsCustomDefaultNonCJK);
> 
> I would use a switch block here with a default statement instead of if/else if/else statements as it will make this code less error prone because the compiler can ensure that we handle every enum value in enum LineBreakIteratorMode. Currently, LineBreakIteratorModeUAX14 is the only mode that is covered by the else statement. Moreover, using a switch block will allow us to remove macro defines uax14AssignmentsCustomDefaultCJK, uax14AssignmentsCustomDefaultNonCJK, and uax14AssignmentsCustomStrictNonCJK.

i've changed to a switch statement, but i don't think we want to remove the macro defines (see above)

is this acceptable?
------- Comment #89 From 2012-11-01 13:27:07 PST -------
(From update of attachment 171628 [details])
cancel r? pending fix for comment 87 issues
------- Comment #90 From 2012-11-01 14:00:47 PST -------
Created an attachment (id=171924) [details]
Patch
------- Comment #91 From 2012-11-01 14:13:43 PST -------
Created an attachment (id=171927) [details]
Patch
------- Comment #92 From 2012-11-01 14:33:34 PST -------
(From update of attachment 171927 [details])
View in context: https://bugs.webkit.org/attachment.cgi?id=171927&action=review

> Source/WebCore/ChangeLog:53
> +        Add static function to construct ICU locale argument (also used as pool key) with additional
> +        break keyword.

It's unclear to me why we need to encode the break information in the "locale" string at all?  I assume this is used as the key to some hash?

> Source/WebCore/platform/text/TextBreakIterator.h:51
> +    TextBreakIterator* openLineBreakIterator(const AtomicString& locale, LineBreakIteratorMode, bool isCJK);

This is no longer just "locale" it seems.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:267
> -    UBreakIterator* iterator = LineBreakIteratorPool::sharedPool().take(locale);
> +    TextBreakIterator* iterator = LineBreakIteratorPool::sharedPool().take(locale, mode, isCJK);

I see.  This is why we ad the break part to the string.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:361
> +    if ((mode == LineBreakIteratorModeUAX14) && !isCJK)
> +        ubrkIter = ubrk_open(UBRK_LINE, isLocaleEmpty ? currentTextBreakLocaleID() : locale.string().utf8().data(), 0, 0, &openStatus);

Do other text-breaking backends function with &break= strings like ICU does?  It seems encoding that information in the locale string outside of this file may be a layering violation?

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:373
> +    // Locale comes from a web page and it can be invalid, leading ICU
> +    // to fail, in which case we fall back to the default locale (with default rules).
> +    if (!isLocaleEmpty && U_FAILURE(openStatus)) {
> +        openStatus = U_ZERO_ERROR;
> +        ubrkIter = ubrk_open(UBRK_LINE, currentTextBreakLocaleID(), 0, 0, &openStatus);
> +    }

Woh, so webpages can get this behavior indirectly by specifying locale="en&break=strict" already (or do we correctly sanitize the locale string before getting in here?)

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:992
> +    rules = rulesBuilder.toString();

Odd.  Why pass by reference?
------- Comment #93 From 2012-11-01 14:47:39 PST -------
(In reply to comment #88)
> [...]
> > > Source/WebCore/platform/text/LineBreakIteratorPoolICU.h:73
> > > +            switch (mode) {
> > > +            case LineBreakIteratorModeUAX14Loose:
> > > +                localeWithKeyword.appendLiteral("loose");
> > > +                break;
> > > +            case LineBreakIteratorModeUAX14Normal:
> > > +                localeWithKeyword.appendLiteral("normal");
> > > +                break;
> > > +           ...
> > 
> [...]
> fixed, though I'm not sure the relation between fully populating the switch and using a distinct function

Disregard my remark about the distinct function as it isn't necessary.

> 
> [...]
> > See my remark in mapLineIteratorModeToRules() on how we can remove these defines.
> 
> if you insist, i will remove, but i think removing will reduce understandability of code: the #define names are semantically significant, and are derived from a coincidental mapping that happens to apply from [1]
> 
> [1] http://trac.webkit.org/wiki/LineBreakingCSS3Mapping
> 
> if they are removed, the [code] reader will be left wondering about this mapping, e.g., why StrictNonCJK happens to be using StrictCJK
>

OK. I'll trust your judgement with regards to the #defines.

> 
> > At the very least, we should #undef these defines at the end of this file.
> 
> If this were an include file, then yes, but it doesn't seem necessary for a compilation unit (or if it does, then I'm not following your reasoning).

You're right. It's unnecessary to #undef the defines since they are in a .cpp file.

> 
> > 
> > > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:981
> > > +    if (mode == LineBreakIteratorModeUAX14Loose)
> > > +        rulesBuilder.append(isCJK ? uax14AssignmentsCustomLooseCJK : uax14AssignmentsCustomLooseNonCJK);
> > > +    else if (mode == LineBreakIteratorModeUAX14Normal)
> > > +        rulesBuilder.append(isCJK ? uax14AssignmentsCustomNormalCJK : uax14AssignmentsCustomNormalNonCJK);
> > > ...
> > 
> >[...] Moreover, using a switch block will allow us to remove macro defines uax14AssignmentsCustomDefaultCJK, uax14AssignmentsCustomDefaultNonCJK, and uax14AssignmentsCustomStrictNonCJK.
> 
> i've changed to a switch statement, but i don't think we want to remove the macro defines (see above)
> 
> is this acceptable?

Yes, I'll trust your judgement with regards to the #defines.
------- Comment #94 From 2012-11-01 15:03:41 PST -------
(In reply to comment #92)
> (From update of attachment 171927 [details] [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=171927&action=review
> 
> > Source/WebCore/ChangeLog:53
> > +        Add static function to construct ICU locale argument (also used as pool key) with additional
> > +        break keyword.
> 
> It's unclear to me why we need to encode the break information in the "locale" string at all?  I assume this is used as the key to some hash?

it is used both as part of the line pool hash and also as a (future) argument to ICU, which uses keyword/value pairs as locale params [1]

[1] http://icu-project.org/apiref/icu4c/classLocale.html#a2e847ff3c5df01b077293a30347bac11

> 
> > Source/WebCore/platform/text/TextBreakIterator.h:51
> > +    TextBreakIterator* openLineBreakIterator(const AtomicString& locale, LineBreakIteratorMode, bool isCJK);
> 
> This is no longer just "locale" it seems.

from ICU perspective, the locale string [may] contain such optional keyword/value pairs

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:267
> > -    UBreakIterator* iterator = LineBreakIteratorPool::sharedPool().take(locale);
> > +    TextBreakIterator* iterator = LineBreakIteratorPool::sharedPool().take(locale, mode, isCJK);
> 
> I see.  This is why we ad the break part to the string.

yep, but also in future, ICU may make use of this keyword value to parameterize the line break iterator, in which case we can remove our custom rules and let ICU take over with its (at that future time) default (but parameterized) rules [n.b. i've been talking with some folks working on ICU about doing this, and its something they may take up]

in any case, i needed to distinguish the entries in the pool cache by the break param, so this served both purposes

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:361
> > +    if ((mode == LineBreakIteratorModeUAX14) && !isCJK)
> > +        ubrkIter = ubrk_open(UBRK_LINE, isLocaleEmpty ? currentTextBreakLocaleID() : locale.string().utf8().data(), 0, 0, &openStatus);
> 
> Do other text-breaking backends function with &break= strings like ICU does?

according to [2], it would seem there is such use outside ICU, though not necessarily the same syntax

[2] http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html

> It seems encoding that information in the locale string outside of this file may be a layering violation?

probably, but that's how the ICU interface works

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:373
> > +    // Locale comes from a web page and it can be invalid, leading ICU
> > +    // to fail, in which case we fall back to the default locale (with default rules).
> > +    if (!isLocaleEmpty && U_FAILURE(openStatus)) {
> > +        openStatus = U_ZERO_ERROR;
> > +        ubrkIter = ubrk_open(UBRK_LINE, currentTextBreakLocaleID(), 0, 0, &openStatus);
> > +    }
> 
> Woh, so webpages can get this behavior indirectly by specifying locale="en&break=strict" already (or do we correctly sanitize the locale string before getting in here?)

this is unchanged code from prior to this patch (from pre-patch LineBreakIteratorPoolICU.h), which uses the value returned by currentTextBreakLocaleID() as a back-up, which, if that value happened to be returned with such a keyword, would be semantically significant to ICU; i haven't traced back that code to see if it sanitizes the value syntax

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:992
> > +    rules = rulesBuilder.toString();
> 
> Odd.  Why pass by reference?

given the length of the rules string, i didn't want to perform an unnecessary copy, presuming that would have occurred if I had returned String, i.e., if I had done that and then had:

String rules = mapLineIteratorModeToRules(...)

if there is a convenient way to avoid a copy, then I could change this if you think it needed
------- Comment #95 From 2012-11-01 15:10:43 PST -------
(From update of attachment 171927 [details])
View in context: https://bugs.webkit.org/attachment.cgi?id=171927&action=review

>>> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:992
>>> +    rules = rulesBuilder.toString();
>> 
>> Odd.  Why pass by reference?
> 
> given the length of the rules string, i didn't want to perform an unnecessary copy, presuming that would have occurred if I had returned String, i.e., if I had done that and then had:
> 
> String rules = mapLineIteratorModeToRules(...)
> 
> if there is a convenient way to avoid a copy, then I could change this if you think it needed

Returning a String by value does a little extra ref-churn, but that's all:
http://trac.webkit.org/browser/trunk/Source/WTF/wtf/text/WTFString.h#L144
------- Comment #96 From 2012-11-01 15:33:44 PST -------
(From update of attachment 171927 [details])
address comment 87 and comment 92 nits
------- Comment #97 From 2012-11-01 15:35:53 PST -------
(From update of attachment 171927 [details])
should be ready to go now
------- Comment #98 From 2012-11-02 15:49:09 PST -------
(From update of attachment 171927 [details])
View in context: https://bugs.webkit.org/attachment.cgi?id=171927&action=review

My job here is not really to tell you if this works right or not (you're more of a line-breaking expert than I).  My job is to tell you if this makes sense in WebKit's system.  Overall this seems fine.  There are a couple design nits, which I've pointed out in this and previous reviews.  Many of which could be fixed either in lead-up or follow-up bugs to the main change.  I don't know what level of interest in you have in long-term maintenance of this code. :)

> Source/WebCore/platform/text/LineBreakIteratorPoolICU.h:56
> +        localeWithKeyword.appendLiteral("@break=");

Since we're just using this as a hash key it feels a bit odd to hijack a string for this use.  A class would make more sense, but that is a slightly more complicated change, as you'd need to both define the class (with a String/mode pair) and then define how to hash it. I suspect std::pair<String, LineBreakIteratorMode> might actually do what you want though.

> Source/WebCore/platform/text/TextBreakIterator.h:123
> +    LineBreakIteratorMode m_mode;
> +    bool m_isCJK;

I don't know how tight memory is on these iterators? Presumably we only have a couple per page?

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:326
> +static bool isCJKLocale(const T* s, size_t length)
> +{

I'm surprised we don't already have this code somewhere else in WebCore.
------- Comment #99 From 2012-11-02 16:01:38 PST -------
(In reply to comment #98)
> (From update of attachment 171927 [details] [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=171927&action=review
> 
> My job here is not really to tell you if this works right or not (you're more of a line-breaking expert than I).  My job is to tell you if this makes sense in WebKit's system.  Overall this seems fine.  There are a couple design nits, which I've pointed out in this and previous reviews.  Many of which could be fixed either in lead-up or follow-up bugs to the main change.  I don't know what level of interest in you have in long-term maintenance of this code. :)

My plan on WK contribution is to continue work on it, and perhaps reach reviewer status eventually. Beyond this patch, I plan to work on filling out ruby functionality and other CT bugs, optimizations, etc.

> 
> > Source/WebCore/platform/text/LineBreakIteratorPoolICU.h:56
> > +        localeWithKeyword.appendLiteral("@break=");
> 
> Since we're just using this as a hash key it feels a bit odd to hijack a string for this use.  A class would make more sense, but that is a slightly more complicated change, as you'd need to both define the class (with a String/mode pair) and then define how to hash it. I suspect std::pair<String, LineBreakIteratorMode> might actually do what you want though.

Yes, I thought of that as well when starting this, but thought it was more complex than needed to get this feature off the ground. It could certainly be done in as a later optimization.

> 
> > Source/WebCore/platform/text/TextBreakIterator.h:123
> > +    LineBreakIteratorMode m_mode;
> > +    bool m_isCJK;
> 
> I don't know how tight memory is on these iterators? Presumably we only have a couple per page?

That's my assumption, though I haven't instrumented it specifically.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:326
> > +static bool isCJKLocale(const T* s, size_t length)
> > +{
> 
> I'm surprised we don't already have this code somewhere else in WebCore.

I didn't look very far to see if this was available elsewhere, but I needed a very tight and performative imll in this case to prevent introducing a slowdown in break_lines.

Overall, I think this is ready to land at this juncture. Unless you disagree, I would appreciate a r/cq+, so I can move to subsequent improvements in this area.
------- Comment #100 From 2012-11-02 16:27:51 PST -------
(From update of attachment 171927 [details])
I agree that it's time to land this and follow-up with other patches as necessary.  I've filed a couple follow-up bugs and related them to this one.  They can be either fixed or closed at a later juncture.
------- Comment #101 From 2012-11-02 16:30:33 PST -------
Thank you for your perseverance.  I look forward to discussing this and other patches with you in person next time you're in SF.
------- Comment #102 From 2012-11-02 16:56:24 PST -------
(From update of attachment 171927 [details])
Clearing flags on attachment: 171927

Committed r133375: <http://trac.webkit.org/changeset/133375>
------- Comment #103 From 2012-11-02 16:56:37 PST -------
All reviewed patches have been landed.  Closing bug.
------- Comment #104 From 2012-11-02 18:21:37 PST -------
Re-opened since this is blocked by bug 101138
------- Comment #106 From 2012-11-02 19:05:25 PST -------
Glenn:  To be clear, the path forward here is just for you to look at the failures and to explain what's right/wrong and for us to fix the location of the -expected files to match, and then we can reland.
------- Comment #107 From 2012-11-05 13:53:58 PST -------
Created an attachment (id=172397) [details]
Patch
------- Comment #108 From 2012-11-05 13:57:41 PST -------
(From update of attachment 172397 [details])
Retarget chromium-linux reftest expectations to chromium (generic), but use default reftests with chromium-android. This should address chromium fails, but may see some Qt fails, for which I've opened https://bugs.webkit.org/show_bug.cgi?id=101250. If Qt fails, then should skip css3/line-break, pending fixes for bug 101250.
------- Comment #109 From 2012-11-05 14:05:35 PST -------
(From update of attachment 172397 [details])
Thanks.
------- Comment #110 From 2012-11-05 14:39:50 PST -------
(From update of attachment 172397 [details])
Clearing flags on attachment: 172397

Committed r133529: <http://trac.webkit.org/changeset/133529>
------- Comment #111 From 2012-11-05 14:40:03 PST -------
All reviewed patches have been landed.  Closing bug.
------- Comment #112 From 2012-11-05 18:34:49 PST -------
This made performance worse on chromium's intl2 page cycler by about 10%

http://build.chromium.org/f/chromium/perf/chromium-rel-win7-webkit/intl1/report.html?history=150&rev=-1

Notice that performance regressed with http://trac.webkit.org/log/?verbose=on&rev=133379&stop_rev=133369, improved with this patch was reverted in http://trac.webkit.org/log/?verbose=on&rev=133386&stop_rev=133385, then regressed again in http://trac.webkit.org/log/?verbose=on&rev=133531&stop_rev=133523
------- Comment #113 From 2012-11-05 18:42:57 PST -------
Is that a test that Glenn can run himself?
------- Comment #114 From 2012-11-05 19:31:53 PST -------
(In reply to comment #112)
> This made performance worse on chromium's intl2 page cycler by about 10%
> 
> http://build.chromium.org/f/chromium/perf/chromium-rel-win7-webkit/intl1/report.html?history=150&rev=-1
> 
> Notice that performance regressed with http://trac.webkit.org/log/?verbose=on&rev=133379&stop_rev=133369, improved with this patch was reverted in http://trac.webkit.org/log/?verbose=on&rev=133386&stop_rev=133385, then regressed again in http://trac.webkit.org/log/?verbose=on&rev=133531&stop_rev=133523

is this perf regress chromium platform specific? e.g., cr-win specific? cr-linux? i'm set up to build/test on Ubuntu 12.04 64-bit, so i'll update and attempt to track down this test
------- Comment #115 From 2012-11-05 19:41:12 PST -------
It seems to be most pronounced on cr-win - here's a cr-linux graph for comparison: http://build.chromium.org/f/chromium/perf/linux-release-webkit-latest/intl2/report.html?history=150&rev=-1.  The results overall are noisier, but there doesn't appear to be room for a 10% regression.  Some parts of the line breaking code are platform-specific so that could explain it - we also use a different toolchain (msvs vs gcc/clang on linux/mac respectively).
------- Comment #116 From 2012-11-05 19:53:10 PST -------
presumably intl2 doesn't use any of the new -webkit-line-break modes [though it's worth checking of it explicitly uses 'normal' as a value in a CJK context];

more likely, this is a compile/optimizer issue, since this patch does introduce a third template parameter on a very hot function (WebCore::nextBreakablePosition), which I was very careful to check for regressions on mac ports;

the hot loop on the above function should devolve to pre-patch behavior (and performance) when the looseBehavior template parameter is not LooseMode; however, if the compiler isn't optimizing this test, and is doing a compare on looseBehavior, then that could be the culprit

i'd need to look at the disassembly of nextBreakBehavior on cr-win to see what the compiler was doing; i did careful perf testing of this on mac-* ports, and proved to my satisfaction that no perf regression occurred
------- Comment #117 From 2012-11-05 19:59:07 PST -------
(In reply to comment #116)
> presumably intl2 doesn't use any of the new -webkit-line-break modes [though it's worth checking of it explicitly uses 'normal' as a value in a CJK context];

The string "line-break" does not appear in the test data at all.

> 
> more likely, this is a compile/optimizer issue, since this patch does introduce a third template parameter on a very hot function (WebCore::nextBreakablePosition), which I was very careful to check for regressions on mac ports;
> 
> the hot loop on the above function should devolve to pre-patch behavior (and performance) when the looseBehavior template parameter is not LooseMode; however, if the compiler isn't optimizing this test, and is doing a compare on looseBehavior, then that could be the culprit
> 
> i'd need to look at the disassembly of nextBreakBehavior on cr-win to see what the compiler was doing; i did careful perf testing of this on mac-* ports, and proved to my satisfaction that no perf regression occurred

The mac and linux perf bots appear to agree with your hypothesis.  The chromium windows build instructions are here: http://www.chromium.org/developers/how-tos/build-instructions-windows or I'm sure we could find someone here who's already set up to build in this configuration who can help diagnose this issue (+cc a few who might be interested)
------- Comment #118 From 2012-11-05 20:01:06 PST -------
Or if you don't/can't set up a build environment yourself, in a few days this patch will hit the Google Chrome Canary and you can disassemble the binary yourself.  We provide debug symbols here: http://www.chromium.org/developers/how-tos/debugging#TOC-Symbol-server
------- Comment #119 From 2012-11-06 10:27:56 PST -------
It looks like this patch may have caused a 15% regression in memory usage on the Chromium page cyclers.

http://build.chromium.org/f/chromium/perf/chromium-rel-win7-webkit/intl2/report.html?rev=166210&graph=ws_single_peak_r&history=30

Regression range: http://trac.webkit.org/log/?verbose=on&rev=133531&stop_rev=133523

Can we come up with a way to support this that uses less memory? In the meantime, can we rollout this patch?
------- Comment #120 From 2012-11-06 11:36:54 PST -------
Thank you Ojan and James.  Yes, we should roll this out.  We can't regress a PLT by 10% and memory by 15%, at least not w/o understanding the tradeoff better.
------- Comment #121 From 2012-11-06 11:42:47 PST -------
One of the tests was also flaky (noted in bug 101368).

I think to land this again, we need to run the memory tests locally, and consider splitting this test into smaller pieces.
------- Comment #122 From 2012-11-06 13:05:16 PST -------
Good new and bad. Turns out I was wrong about this patch causing the memory regression. On the other hand, it does seem to be the culprit for the performance regression James pointed out.
------- Comment #123 From 2012-11-06 15:34:14 PST -------
(In reply to comment #122)
> Good new and bad. Turns out I was wrong about this patch causing the memory regression. On the other hand, it does seem to be the culprit for the performance regression James pointed out.

Sigh. I'm doing too many things at once. This was the cause of the memory regression as well. I just misread the dashboard.
------- Comment #124 From 2012-11-06 15:48:39 PST -------
(In reply to comment #123)
> (In reply to comment #122)
> > Good new and bad. Turns out I was wrong about this patch causing the memory regression. On the other hand, it does seem to be the culprit for the performance regression James pointed out.
> 
> Sigh. I'm doing too many things at once. This was the cause of the memory regression as well. I just misread the dashboard.

Double sigh. Really doing too many things at once. It was *not* this patch the caused the memory regression, it was http://trac.webkit.org/changeset/133526.
------- Comment #125 From 2013-02-22 15:59:26 PST -------
Created an attachment (id=189854) [details]
Patch
------- Comment #126 From 2013-02-22 16:07:03 PST -------
(From update of attachment 189854 [details])
Attachment 189854 [details] did not pass qt-ews (qt):
Output: http://queues.webkit.org/results/16722162
------- Comment #127 From 2013-02-22 16:09:34 PST -------
(From update of attachment 189854 [details])
Attachment 189854 [details] did not pass qt-wk2-ews (qt):
Output: http://queues.webkit.org/results/16719285
------- Comment #128 From 2013-02-22 16:58:34 PST -------
(From update of attachment 189854 [details])
Attachment 189854 [details] did not pass efl-ews (efl):
Output: http://queues.webkit.org/results/16718327
------- Comment #129 From 2013-02-22 19:34:03 PST -------
Created an attachment (id=189897) [details]
Patch
------- Comment #130 From 2013-02-23 04:03:35 PST -------
(From update of attachment 189897 [details])
Ready to re-land after:
(1) subdividing patch to reduce size (see bug 109954)
(2) reworking nextBreakPosition<> (break_lines.cpp) to avoid WIN optimization bug
(3) incrementally enabling new tests on port by port basis as verification proceeds
------- Comment #131 From 2013-02-25 12:08:32 PST -------
(From update of attachment 189897 [details])
This LGTM.  My main concern is possible perf regressions.  We'll need to be careful to check the bots again in 24 hours to make sure this didn't cause a blip.  Some of this is extremely hot code.
------- Comment #132 From 2013-02-26 00:44:07 PST -------
(From update of attachment 189897 [details])
Clearing flags on attachment: 189897

Committed r144019: <http://trac.webkit.org/changeset/144019>
------- Comment #133 From 2013-02-26 00:44:21 PST -------
All reviewed patches have been landed.  Closing bug.
------- Comment #134 From 2013-02-26 05:19:25 PST -------
8% performance regression on chromium-win7 intl2 page cycler noted at

http://build.chromium.org/f/chromium/perf/chromium-rel-win7-webkit/intl1/report.html?history=40&rev=184623&graph=times

and bug opened at

https://bugs.webkit.org/show_bug.cgi?id=110872

with dependency added here

rather than reverting this patch, i'd like to fix the regression in a follow on patch that addresses the win specific issue
------- Comment #135 From 2013-02-26 10:33:11 PST -------
Re-opened since this is blocked by bug 110892
------- Comment #136 From 2013-02-26 10:39:24 PST -------
(From update of attachment 189897 [details])
View in context: https://bugs.webkit.org/attachment.cgi?id=189897&action=review

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:384
> +    UBreakIterator* ubrkIter = reinterpret_cast<UBreakIterator*>(iterator);

Why reinterpret rather than static_cast here?

Also you may want to try keeping openLineBreakIterator and especially closeLineBreakIterator (as it's a trivial forwarding function all you're doing here is adding call overhead) inline.
------- Comment #139 From 2013-02-26 11:05:22 PST -------
(In reply to comment #138)
> Thanks, i'll wear that as a badge of honor. :)

You're not a true WebKitten until you've introduced a major regression :)
------- Comment #140 From 2013-02-26 11:13:53 PST -------
(In reply to comment #136)
> (From update of attachment 189897 [details] [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=189897&action=review
> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:384
> > +    UBreakIterator* ubrkIter = reinterpret_cast<UBreakIterator*>(iterator);
> 
> Why reinterpret rather than static_cast here?

it was already there, from https://trac.webkit.org/browser/trunk/Source/WebCore/platform/text/TextBreakIteratorICU.cpp?rev=89592

just moved it

> Also you may want to try keeping openLineBreakIterator and especially closeLineBreakIterator (as it's a trivial forwarding function all you're doing here is adding call overhead) inline.

good idea
------- Comment #141 From 2013-02-26 11:23:38 PST -------
Given that this regression showed in html5-full-render, this should be trivial to debug locally.  PLT performance is a pain, but run-perf-tests is easy.

try:

run-perf-tests Parser/html5-full-render

You can even use:

run-perf-tests Parser/html5-full-render --profile

to have it automatically run whatever your system's default sampling profiler is.  See run-perf-tests for options.