Bug 105692 - Element boundaries prevent Japanese line break opportunities
Summary: Element boundaries prevent Japanese line break opportunities
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Layout and Rendering (show other bugs)
Version: 528+ (Nightly build)
Hardware: All Unspecified
: P2 Normal
Assignee: Glenn Adams
URL:
Keywords:
Depends on: 17427 113811 113823
Blocks:
  Show dependency treegraph
 
Reported: 2012-12-23 09:33 PST by Koji Ishii
Modified: 2013-10-23 06:38 PDT (History)
23 users (show)

See Also:


Attachments
test case (675 bytes, text/html)
2012-12-23 09:33 PST, Koji Ishii
no flags Details
test case result (5.66 KB, image/png)
2013-01-13 23:05 PST, Koji Ishii
no flags Details
test case expected (8.49 KB, image/png)
2013-01-13 23:06 PST, Koji Ishii
no flags Details
Patch (7.21 KB, patch)
2013-03-11 19:14 PDT, Glenn Adams
webkit.review.bot: commit-queue-
Details | Formatted Diff | Diff
Patch (24.03 KB, patch)
2013-03-15 20:24 PDT, Glenn Adams
no flags Details | Formatted Diff | Diff
Patch (42.90 KB, patch)
2013-03-18 22:12 PDT, Glenn Adams
no flags Details | Formatted Diff | Diff
Patch (43.13 KB, patch)
2013-03-18 22:37 PDT, Glenn Adams
no flags Details | Formatted Diff | Diff
Patch (71.00 KB, patch)
2013-03-19 14:38 PDT, Glenn Adams
no flags Details | Formatted Diff | Diff
Line layout performance results (en, jp) for attachment 193924. (46.50 KB, text/html)
2013-03-19 14:44 PDT, Glenn Adams
no flags Details
Parser (html5-full-render) performance results for attachment 193924. (32.53 KB, text/html)
2013-03-19 14:47 PDT, Glenn Adams
no flags Details
Patch (73.12 KB, patch)
2013-03-20 19:43 PDT, Glenn Adams
no flags Details | Formatted Diff | Diff
Patch (263.95 KB, patch)
2013-03-21 18:29 PDT, Glenn Adams
no flags Details | Formatted Diff | Diff
Archive of layout-test-results from webkit-ews-02 (1.16 MB, application/zip)
2013-03-22 08:50 PDT, Build Bot
no flags Details
Patch (79.70 KB, patch)
2013-03-26 19:02 PDT, Glenn Adams
glenn: review-
webkit.review.bot: commit-queue-
Details | Formatted Diff | Diff
Archive of layout-test-results from gce-cr-linux-06 for chromium-linux-x86_64 (1.76 MB, application/zip)
2013-03-26 21:18 PDT, WebKit Review Bot
no flags Details
Archive of layout-test-results from webkit-ews-16 for mac-mountainlion-wk2 (1.15 MB, application/zip)
2013-03-26 22:12 PDT, Build Bot
no flags Details
Archive of layout-test-results from gce-cr-linux-03 for chromium-linux-x86_64 (1.89 MB, application/zip)
2013-03-26 22:20 PDT, WebKit Review Bot
no flags Details
Patch (625.42 KB, patch)
2013-03-27 23:22 PDT, Glenn Adams
no flags Details | Formatted Diff | Diff
Patch (226.94 KB, patch)
2013-04-02 20:17 PDT, Glenn Adams
no flags Details | Formatted Diff | Diff
Patch (226.99 KB, patch)
2013-04-02 20:36 PDT, Glenn Adams
no flags Details | Formatted Diff | Diff
Patch (583.64 KB, patch)
2013-04-02 22:00 PDT, Glenn Adams
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Koji Ishii 2012-12-23 09:33:49 PST
Created attachment 180625 [details]
test case

Line break opportunities before and after <span> or any inline elements are prohibited.

EXPECTED:
<span> or any inline elements should not prevent line break opportunities.
Comment 1 mitz 2012-12-31 19:09:56 PST
Looks like a duplicate of bug 17427.
Comment 2 Koji Ishii 2013-01-13 23:05:45 PST
Created attachment 182512 [details]
test case result
Comment 3 Koji Ishii 2013-01-13 23:06:07 PST
Created attachment 182513 [details]
test case expected
Comment 4 Lu jing 2013-01-22 05:13:03 PST
I work for fix this bug, could you assign it to me?
Comment 5 Glenn Adams 2013-01-22 18:13:44 PST
(In reply to comment #4)
> I work for fix this bug, could you assign it to me?

Hi Lu Jing, I have already assigned it to myself and started work on this bug, which I believe may be dependent on bug 89235. If you would like to help, please email me directly to discuss. Thanks, Glenn
Comment 6 Lu jing 2013-01-23 16:32:02 PST
(In reply to comment #5)
> (In reply to comment #4)
> > I work for fix this bug, could you assign it to me?
> 
> Hi Lu Jing, I have already assigned it to myself and started work on this bug, which I believe may be dependent on bug 89235. If you would like to help, please email me directly to discuss. Thanks, Glenn

Got it. Thanks.
Comment 7 Koji Ishii 2013-02-17 06:43:35 PST
I don't think this bug depends on bug 89235. 89325 is about changing iterator while this bug is about how to call iterators. Let's not try to create unnecessary dependencies as it will slow things down.

Lu, can you please keep working on this if you can?
Comment 8 Glenn Adams 2013-02-17 08:48:45 PST
(In reply to comment #7)
> I don't think this bug depends on bug 89235. 89325 is about changing iterator while this bug is about how to call iterators. Let's not try to create unnecessary dependencies as it will slow things down.

agreed; however, it either depends on or is a duplicate of 17427 as pointed out by mitz in comment #1
Comment 9 Lu jing 2013-02-17 11:33:31 PST
(In reply to comment #7)
> I don't think this bug depends on bug 89235. 89325 is about changing iterator while this bug is about how to call iterators. Let's not try to create unnecessary dependencies as it will slow things down.
> 
> Lu, can you please keep working on this if you can?

Ok, I've read all about 17427. I'm not sure I can fix but I will keep working on this.
Comment 10 Koji Ishii 2013-02-21 06:32:36 PST
If 17427 can fix this too, since nobody seems to be working on 17427, you can work on 17427 and then resolve this as dup, if you want to.
Comment 11 Glenn Adams 2013-02-21 06:59:10 PST
(In reply to comment #10)
> If 17427 can fix this too, since nobody seems to be working on 17427, you can work on 17427 and then resolve this as dup, if you want to.

that's my plan
Comment 12 Glenn Adams 2013-02-25 09:28:53 PST
verified that patch https://bugs.webkit.org/attachment.cgi?id=190055 on bug 17427 fixes the problem shown by the attached test case
Comment 13 Glenn Adams 2013-03-11 19:14:41 PDT
Created attachment 192625 [details]
Patch
Comment 14 WebKit Review Bot 2013-03-11 22:14:59 PDT
Comment on attachment 192625 [details]
Patch

Attachment 192625 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://webkit-commit-queue.appspot.com/results/17013978

New failing tests:
http/tests/canvas/philip/tests/security.pattern.canvas.strokeStyle.html
css1/cascade/important.html
css1/box_properties/acid_test.html
http/tests/canvas/philip/tests/security.pattern.cross.html
css1/box_properties/border_bottom.html
css1/basic/containment.html
http/tests/canvas/philip/tests/security.pattern.image.strokeStyle.html
accessibility/radio-button-title-label.html
css1/box_properties/border_bottom_width_inline.html
http/tests/canvas/philip/tests/security.reset.html
http/tests/canvas/philip/tests/security.pattern.canvas.timing.html
css1/classification/list_style_position.html
accessibility/label-for-control-hittest.html
canvas/philip/tests/2d.drawImage.negativeSourceHeight2.html
css1/basic/contextual_selectors.html
canvas/philip/tests/2d.composite.operation.foobar.html
css1/classification/display.html
css1/color_and_background/background.html
css1/classification/list_style.html
http/tests/canvas/philip/tests/security.drawImage.image.html
css1/color_and_background/background_color.html
http/tests/canvas/philip/tests/security.pattern.image.fillStyle.html
css1/box_properties/border_bottom_inline.html
http/tests/canvas/philip/tests/security.drawImage.canvas.html
css1/color_and_background/background_attachment.html
http/tests/canvas/philip/tests/security.pattern.canvas.fillStyle.html
http/tests/canvas/philip/tests/security.pattern.create.html
canvas/philip/tests/2d.drawImage.negativeSourceHeightAndWidth.html
http/tests/cache/subresource-failover-to-network.html
canvas/philip/tests/2d.drawImage.negativeSourceHeight.html
Comment 15 Glenn Adams 2013-03-11 23:58:32 PDT
Comment on attachment 192625 [details]
Patch

causes functional regression
Comment 16 Glenn Adams 2013-03-15 20:24:06 PDT
Created attachment 193419 [details]
Patch
Comment 17 Glenn Adams 2013-03-15 20:33:47 PDT
Comment on attachment 193419 [details]
Patch

Support use of prior context in non-ASCII path without burden of memory allocation. This required changing to a custom ICU text provider for UChar strings and will also require changes to the current custom text provider for LChar strings. This patch is not yet ready for review:

(1) uTextLatin1* (existing LChar text provider) need to be updated to use prior context;
(2) need to handle surrogate pairs in prior context as well as at boundary of prior and primary contexts;
(3) need functional and performance regression testing;

This patch does, however, fix the reported test reduction, as also indicated by the new LayoutTest included in this patch.
Comment 18 Glenn Adams 2013-03-15 21:10:34 PDT
(In reply to comment #17)
> (From update of attachment 193419 [details])
> Support use of prior context in non-ASCII path without burden of memory allocation. This required changing to a custom ICU text provider for UChar strings and will also require changes to the current custom text provider for LChar strings. This patch is not yet ready for review:
> 
> (1) uTextLatin1* (existing LChar text provider) need to be updated to use prior context;
> (2) need to handle surrogate pairs in prior context as well as at boundary of prior and primary contexts;
> (3) need functional and performance regression testing;
> 
> This patch does, however, fix the reported test reduction, as also indicated by the new LayoutTest included in this patch.

Adding msaboff since he implemented original uTextLatin1* text provider. I expect to ask him to take a quick look at this work once it gets to an r? stage.
Comment 19 Glenn Adams 2013-03-18 22:12:42 PDT
Created attachment 193728 [details]
Patch
Comment 20 Early Warning System Bot 2013-03-18 22:18:46 PDT
Comment on attachment 193728 [details]
Patch

Attachment 193728 [details] did not pass qt-ews (qt):
Output: http://webkit-commit-queue.appspot.com/results/17160598
Comment 21 Early Warning System Bot 2013-03-18 22:20:37 PDT
Comment on attachment 193728 [details]
Patch

Attachment 193728 [details] did not pass qt-wk2-ews (qt):
Output: http://webkit-commit-queue.appspot.com/results/17234336
Comment 22 WebKit Review Bot 2013-03-18 22:30:40 PDT
Comment on attachment 193728 [details]
Patch

Attachment 193728 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://webkit-commit-queue.appspot.com/results/17176607
Comment 23 WebKit Review Bot 2013-03-18 22:31:46 PDT
Comment on attachment 193728 [details]
Patch

Attachment 193728 [details] did not pass cr-linux-debug-ews (chromium-xvfb):
Output: http://webkit-commit-queue.appspot.com/results/17211445
Comment 24 Glenn Adams 2013-03-18 22:37:49 PDT
Created attachment 193732 [details]
Patch
Comment 25 Build Bot 2013-03-19 00:51:36 PDT
Comment on attachment 193732 [details]
Patch

Attachment 193732 [details] did not pass mac-ews (mac):
Output: http://webkit-commit-queue.appspot.com/results/17183548
Comment 26 Build Bot 2013-03-19 00:59:25 PDT
Comment on attachment 193732 [details]
Patch

Attachment 193732 [details] did not pass mac-wk2-ews (mac-wk2):
Output: http://webkit-commit-queue.appspot.com/results/17074957
Comment 27 Glenn Adams 2013-03-19 14:38:57 PDT
Created attachment 193924 [details]
Patch
Comment 28 Glenn Adams 2013-03-19 14:44:55 PDT
Created attachment 193927 [details]
Line layout performance results (en, jp) for attachment 193924 [details].

Approximately 4% improvement for Japanese. Slight improvement for English.
Comment 29 Glenn Adams 2013-03-19 14:47:52 PDT
Created attachment 193929 [details]
Parser (html5-full-render) performance results for attachment 193924 [details].

No measurable regression. Performed on MBP Retina MacOSX 10.8.3.
Comment 30 Glenn Adams 2013-03-19 15:22:08 PDT
(In reply to comment #29)
> Created an attachment (id=193929) [details]
> Parser (html5-full-render) performance results for attachment 193924 [details].
> 
> No measurable regression. Performed on MBP Retina MacOSX 10.8.3.

I should have said no clearly significant regression within the indicated variance.
Comment 31 Darin Adler 2013-03-20 10:46:43 PDT
Comment on attachment 193924 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=193924&action=review

Thanks for tackling this. A lot of the new code here isn’t following WebKit coding style and it should.

I’m concerned about the level of test coverage for this significant amount code. I am pretty sure I spotted an infinite loop in the code, not sure why it was not a problem in practice.

> Source/WebCore/ChangeLog:28
> +        (WebCore):

It’s best to remove bogus lines like this that the script adds to change log entries.

> Source/WebCore/platform/text/TextBreakIterator.h:57
> +    const int TextBreakPriorContextCapacity = 2;

I think this constant would be better as a private member of LazyLineBreakIterator rather exposed at the top level as a public constant. It has nothing to do with the interface to the iterator.

> Source/WebCore/platform/text/TextBreakIterator.h:83
> +    UChar lastCharacter() const { return m_priorContext[1]; }

Do we need an assertion about m_priorContextLength here?

> Source/WebCore/platform/text/TextBreakIterator.h:85
> +    const UChar* getPriorContext(int& priorContextLength)

This is used only within the class, so I suggest it be public. I’m also not sure it needs to be a function. Having this logic in the get function might be good.

> Source/WebCore/platform/text/TextBreakIterator.h:95
> +        m_priorContextLength = last ? (secondToLast ? 2 : 1) : 0;

This seems a bit peculiar. We are storing a length, but it’s not additional data; it’s just an indicator of which characters are zero. m_priorContextLength should not be stored if it can be computed.

> Source/WebCore/platform/text/TextBreakIterator.h:103
> +    void resetPriorContext()

Renaming this to be more abstract isn’t all that useful. We still are storing just two characters. It’s not all that helpful to pretend to be more general in our naming.

> Source/WebCore/platform/text/TextBreakIterator.h:110
> -    TextBreakIterator* get()
> +    TextBreakIterator* get(int& priorContextLength)

This interface is confusing enough that I think we’ll need a comment.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:71
> +static inline int64_t uTextWithPriorContextGetNativeLength(UText* ut)
> +{
> +    return ut->a + ut->b;
> +}

Nothing here seems to be specific to “a UText with prior context” so the function name seems peculiar. A shorter name would likely make the code easier to read. Further, including the word “Get” in this name does not make clear “this is the inlined one you want to call”. It’s just a nearly identical name to the other function and it seems likely we’d call the wrong one.

Further, I don’t think that adding “with prior context” to the names of all the functions is necessary. It’s really hard to parse those long names. There must be a more concise way to put things.

WebKit coding style is to use words rather than abbreviations. I suggest the word “text” rather than the letters “ut” as the name of the object pointer in all these functions.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:77
> +// Generic native length function, works with both Latin1 and UC text providers.
> +static int64_t uTextWithPriorContextNativeLength(UText* ut)
> +{
> +    return uTextWithPriorContextGetNativeLength(ut);
> +}

Here the long name is perhaps justified, since this function exists to be put in the method table below.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:80
> -static UText* uTextLatin1Clone(UText* destination, const UText* source, UBool deep, UErrorCode* status)
> +// Generic clone function, works with both Latin1 and UC text providers.
> +static UText * uTextWithPriorContextClone(UText* dest, const UText* src, UBool deep, UErrorCode* status)

Why the formatting change to add a space before the "*" here? That’s not normal for our project.

Why the abbreviation of destination and source? In WebKit we prefer words to abbreviations.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:89
> +    void* extraNew = dest->pExtra;

What does “extra new” mean? Unclear name.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:93
> +    int sizeToCopy = src->sizeOfStruct;
> +    if (sizeToCopy > dest->sizeOfStruct)
> +        sizeToCopy = dest->sizeOfStruct;

Would read more clearly with std::min instead of an if statement.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:98
> +    if (extraSize)
> +        memcpy(dest->pExtra, src->pExtra, extraSize);

The if here is unneeded and I suggest omitting it.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:103
> +static int32_t uTextWithPriorContextExtract(UText* ut, int64_t start, int64_t limit, UChar* dest, int32_t destCapacity, UErrorCode* pErrorCode)

The use of a p prefix to mean a pointer is not a WebKit standard style element, so should not be done.

The name “ut” does not seem clear; a word would be better than a pile of letters and more consistent with WebKit coding style.

Please don’t use the abbreviation dest; use the word destination instead.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:128
> +    int32_t copied = 0;

It’s normally best to name local variables with noun phrases rather than fragments. The word “copied” sounds like it could be a boolean and it would be clearer to call it charactersCopied or something along those lines.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:129
> +    while (copied < length) {

This loop does not seem to increment “copied”, so I’d expect it to be an infinite loop. How did you test?

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:131
> +        if (!ut->pFuncs->access(ut, start, TRUE))
> +            ASSERT_NOT_REACHED();

Is continuing to run an ignoring the failure the most helpful behavior here for release builds?

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:133
> +        int32_t ucAvailable = ut->chunkLength - ut->chunkOffset;
> +        int32_t ucToCopy = ucAvailable < length ? ucAvailable : length;

The “uc” prefix on these local variable names is not the usual WebKit style. It’s also best to name local variables with noun phrases rather than fragments. “to copy” is not really so great.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:134
> +        StringImpl::copyChars(&dest[copied], ut->chunkContents + ut->chunkOffset, static_cast<unsigned>(ucToCopy));

Why the static_cast here? Normally we can pass an int32_t to an unsigned without an explicit cast.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:146
> +// Generic map offset to native index function, works with both Latin1 and UC text providers.

The comments like this one before each function don’t say much. I’d omit them. It’s true that these functions are used twice, for both Latin-1 and Unicode, but they’re not “generic” in any additional interesting sense, nor is it helpful to repeat the function name in prose in a comment.

Also, I am not comfortable with the abbreviation “uc” for “Unicode” and I’d prefer we keep the use of that to a minimum.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:157
> +    ASSERT(nativeIndex - ut->chunkNativeStart < INT32_MAX);

What guarantees this is true? We can assert it only if we know it will be true.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:238
> +            ASSERT(nativeIndex - ut->chunkNativeStart < INT32_MAX);

Same question as elsewhere. What guarantees this?

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:250
> +            ASSERT(nativeIndex - ut->chunkNativeStart < INT32_MAX);

Same question as elsewhere. What guarantees this?

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:291
> +static const struct UTextFuncs uTextLatin1WithPriorContextFuncs = 
> +{

In WebKit coding style the brace would go on the previous line.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:347
> +    ASSERT_UNUSED(forward, forward ? nativeIndex >= ut->b && nativeIndex < ut->a + ut->b : nativeIndex > ut->b && nativeIndex <= ut->a + ut->b);

We don’t normally do assertions with && in them. I suggest breaking this out into two assertions, one about the start and one about the end.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:350
> +    ASSERT(ut->chunkNativeLimit - ut->chunkNativeStart < INT32_MAX);

What guarantees this? We can’t just assert because we hope it’s true.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:353
> +    ASSERT(nativeIndex - ut->chunkNativeStart < INT32_MAX);

What guarantees this? We can’t just assert because we hope it’s true.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:372
> +    ASSERT(nativeIndex < INT32_MAX);

What guarantees this? We can’t just assert because we hope it’s true.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:486
> +    UText uTextUCWithPriorContextLocal = UTEXT_INITIALIZER;

This long name does not seem helpful. Why does the phrase “with prior context” need to appear in the variable name? Why not just call this textLocalStorage or something like that?

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:488
> +    UErrorCode uTextOpenStatus = U_ZERO_ERROR;

Why not just call this openStatus?

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:489
> +    UText* uTextUCWithPriorContext = UTextOpenUTF16(&uTextUCWithPriorContextLocal, string, length, priorContext, priorContextLength, &uTextOpenStatus);

Why not just call this text?
Comment 32 Glenn Adams 2013-03-20 11:02:08 PDT
(In reply to comment #31)
> (From update of attachment 193924 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=193924&action=review
> 
> Thanks for tackling this. A lot of the new code here isn’t following WebKit coding style and it should.

Thanks for the quick review! I'll address these points and post a new patch ASAP.
Comment 33 Michael Saboff 2013-03-20 12:33:13 PDT
Comment on attachment 193924 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=193924&action=review

> Source/WebCore/ChangeLog:9
> +        in non-ASCII path through nextBreakablePossition(). This enables

typo - nextBreakablePosition()

> Source/WebCore/ChangeLog:68
> +        (WebCore::uTextWithPriorContextGetNativeLength):
> +        (WebCore):
> +        (WebCore::uTextWithPriorContextNativeLength):
> +        (WebCore::uTextWithPriorContextClone):
> +        (WebCore::uTextWithPriorContextExtract):
> +        (WebCore::uTextWithPriorContextMapOffsetToNative):
> +        (WebCore::uTextWithPriorContextMapNativeIndexToUTF16):
> +        (WebCore::uTextWithPriorContextClose):
> +        (WebCore::uTextGetContext):
> +        (WebCore::uTextLatin1GetCurrentContext):
> +        (WebCore::uTextLatin1MoveInPrimaryContext):
> +        (WebCore::uTextLatin1SwitchToPrimaryContext):
> +        (WebCore::uTextLatin1MoveInPriorContext):
> +        (WebCore::uTextLatin1SwitchToPriorContext):
> +        (WebCore::uTextInChunkOrOutOfRange):
> +        (WebCore::uTextLatin1WithPriorContextAccess):
> +        (WebCore::uTextWithPriorContextInit):
> +        (WebCore::UTextOpenLatin1):
> +        (WebCore::uTextUTF16GetCurrentContext):
> +        (WebCore::uTextUTF16MoveInPrimaryContext):
> +        (WebCore::uTextUTF16SwitchToPrimaryContext):
> +        (WebCore::uTextUTF16MoveInPriorContext):
> +        (WebCore::uTextUTF16SwitchToPriorContext):
> +        (WebCore::uTextUTF16WithPriorContextAccess):

I don't think adding WithPriorContext to the various function names adds much.  Some of these functions are shared between the Latin1 and UTF16 UText processing and then there are the two Access functions.

> Source/WebCore/platform/text/TextBreakIterator.h:92
>      {

Shouldn't we ASSERT here that we have at least two characters of space in m_priorContext?

> Source/WebCore/platform/text/TextBreakIterator.h:98
>      {

Shouldn't we ASSERT here that we have at least two characters of space in m_priorContext?

> Source/WebCore/platform/text/TextBreakIterator.h:104
> +    {

Shouldn't we ASSERT here that we have at least two characters of context to reset, or iterate through from 0..TextBreakPriorContextCapacity-1?

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:92
> +    if (sizeToCopy > dest->sizeOfStruct)

When would src->sizeOfStruct != dest->sizeOfStruct?

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:94
> +    memcpy(dest, src, sizeToCopy);

I prefer the element by element initialization, since there are special cases to handle in the destination.  For example, the addition of a new pointer in struct UText would end up being a problem with the memcpy.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:459
> +    UTextWithBuffer uTextLatin1WithPriorContextLocal;
> +    uTextLatin1WithPriorContextLocal.ut = emptyUText;
> +    uTextLatin1WithPriorContextLocal.ut.extraSize = sizeof(uTextLatin1WithPriorContextLocal.buffer);
> +    uTextLatin1WithPriorContextLocal.ut.pExtra = uTextLatin1WithPriorContextLocal.buffer;

Same comment about adding WithPriorContext.  I don't think it is needed.
Comment 34 Glenn Adams 2013-03-20 16:56:41 PDT
These comments apply to an update to the current patch I will be submitting presently.

(In reply to comment #31)
> (From update of attachment 193924 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=193924&action=review
> 
> Thanks for tackling this. A lot of the new code here isn’t following WebKit coding style and it should.

Fixed (see details below). Most of the style diffs were from using the ICU text provider code examples (and their style).

> 
> I’m concerned about the level of test coverage for this significant amount code. I am pretty sure I spotted an infinite loop in the code, not sure why it was not a problem in practice.

Apparently never called, but see further comments below.

> 
> > Source/WebCore/ChangeLog:28
> > +        (WebCore):
> 
> It’s best to remove bogus lines like this that the script adds to change log entries.

Fixed.

> 
> > Source/WebCore/platform/text/TextBreakIterator.h:57
> > +    const int TextBreakPriorContextCapacity = 2;
> 
> I think this constant would be better as a private member of LazyLineBreakIterator rather exposed at the top level as a public constant. It has nothing to do with the interface to the iterator.

Fixed.

> 
> > Source/WebCore/platform/text/TextBreakIterator.h:83
> > +    UChar lastCharacter() const { return m_priorContext[1]; }
> 
> Do we need an assertion about m_priorContextLength here?

Added assert against priorContextCapacity.

> 
> > Source/WebCore/platform/text/TextBreakIterator.h:85
> > +    const UChar* getPriorContext(int& priorContextLength)
> 
> This is used only within the class, so I suggest it be public. I’m also not sure it needs to be a function. Having this logic in the get function might be good.

Removed, incorporating logic into get().

> 
> > Source/WebCore/platform/text/TextBreakIterator.h:95
> > +        m_priorContextLength = last ? (secondToLast ? 2 : 1) : 0;
> 
> This seems a bit peculiar. We are storing a length, but it’s not additional data; it’s just an indicator of which characters are zero. m_priorContextLength should not be stored if it can be computed.

Replaced with private compute function.

> 
> > Source/WebCore/platform/text/TextBreakIterator.h:103
> > +    void resetPriorContext()
> 
> Renaming this to be more abstract isn’t all that useful. We still are storing just two characters. It’s not all that helpful to pretend to be more general in our naming.

The present storage of only two characters matches the current ASCII path usage and fixes the CJK cases; however,  I anticipate having to expand prior context to store up to an entire grapheme cluster (or more) in order to properly handle Indic and certain SE Asian scripts. It may eventually require storage of a prior word in order to handle dictionary based word breaking in Thai (and similar writing systems).

> 
> > Source/WebCore/platform/text/TextBreakIterator.h:110
> > -    TextBreakIterator* get()
> > +    TextBreakIterator* get(int& priorContextLength)
> 
> This interface is confusing enough that I think we’ll need a comment.

Added comment.

> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:71
> > +static inline int64_t uTextWithPriorContextGetNativeLength(UText* ut)
> > +{
> > +    return ut->a + ut->b;
> > +}
> 
> Nothing here seems to be specific to “a UText with prior context” so the function name seems peculiar. A shorter name would likely make the code easier to read. Further, including the word “Get” in this name does not make clear “this is the inlined one you want to call”. It’s just a nearly identical name to the other function and it seems likely we’d call the wrong one.

Change "uTextWithPriorContext" to "text" throughout. I originally made the length getter into two parts, an inline part and a non-inline part, because I couldn't recall if the existing compilers would expand an inline as non-inline if a function pointer were taken of it (as is the case here).

> 
> Further, I don’t think that adding “with prior context” to the names of all the functions is necessary. It’s really hard to parse those long names. There must be a more concise way to put things.

Fixed.
 
> WebKit coding style is to use words rather than abbreviations. I suggest the word “text” rather than the letters “ut” as the name of the object pointer in all these functions.

Changed to "text". The "ut" usage was based on the usage found in the ICU library implementation of text providers.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:77
> > +// Generic native length function, works with both Latin1 and UC text providers.
> > +static int64_t uTextWithPriorContextNativeLength(UText* ut)
> > +{
> > +    return uTextWithPriorContextGetNativeLength(ut);
> > +}
> 
> Here the long name is perhaps justified, since this function exists to be put in the method table below.
> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:80
> > -static UText* uTextLatin1Clone(UText* destination, const UText* source, UBool deep, UErrorCode* status)
> > +// Generic clone function, works with both Latin1 and UC text providers.
> > +static UText * uTextWithPriorContextClone(UText* dest, const UText* src, UBool deep, UErrorCode* status)
> 
> Why the formatting change to add a space before the "*" here? That’s not normal for our project.

Fixed. It was a typo. For some reason check-webkit-style didn't catch it.

> 
> Why the abbreviation of destination and source? In WebKit we prefer words to abbreviations.

Fixed. This usage was from the ICU implementation examples of text providers.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:89
> > +    void* extraNew = dest->pExtra;
> 
> What does “extra new” mean? Unclear name.

This name corresponds to the ICU field UText.pExtra. See ICU utext.h for further information. 

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:93
> > +    int sizeToCopy = src->sizeOfStruct;
> > +    if (sizeToCopy > dest->sizeOfStruct)
> > +        sizeToCopy = dest->sizeOfStruct;
> 
> Would read more clearly with std::min instead of an if statement.

Fixed.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:98
> > +    if (extraSize)
> > +        memcpy(dest->pExtra, src->pExtra, extraSize);
> 
> The if here is unneeded and I suggest omitting it.

Removed.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:103
> > +static int32_t uTextWithPriorContextExtract(UText* ut, int64_t start, int64_t limit, UChar* dest, int32_t destCapacity, UErrorCode* pErrorCode)
> 
> The use of a p prefix to mean a pointer is not a WebKit standard style element, so should not be done.

Fixed. This came from ICU text provider usage.

> 
> The name “ut” does not seem clear; a word would be better than a pile of letters and more consistent with WebKit coding style.

Changed to "text". I had been following ICU text provider code example usage here.

> 
> Please don’t use the abbreviation dest; use the word destination instead.

Fixed. I had been following ICU text provider code example usage here.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:128
> > +    int32_t copied = 0;
> 
> It’s normally best to name local variables with noun phrases rather than fragments. The word “copied” sounds like it could be a boolean and it would be clearer to call it charactersCopied or something along those lines.

I had originally called it charsCopied, but that seemed too verbose and a misnomer since it is not characters but character coding units (i.e., UTF-16 coding units) that are being copied. For example, a surrogate pair is two coding units but one character! Since I already use the prefix "uc" to refer to such unicode code (units), perhaps I'll change here to "ucCopied", which goes along with "ucToCopy".

In any case, i've removed this code and this method now asserts or returns an error code if invoked. See more below.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:129
> > +    while (copied < length) {
> 
> This loop does not seem to increment “copied”, so I’d expect it to be an infinite loop. How did you test?

Good catch. I had originally incremented in an earlier fragment but it got chopped out by mistake. It appears that utext_extract() isn't being called by ICU's rbbi (rule based break iterator) implementation even though it appears to be defined as a mandatory method in UTextFuncs.

Perhaps the best way to handle this would be to replace the body of the method with ASSERT_NOT_REACHED() and return U_UNSUPPORTED_ERROR code for release build,  then see if we hit this in any of the test bots? There does appear to be some internal ICU tests aimed at this method, but I haven't attempted to run them. WDYT?

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:131
> > +        if (!ut->pFuncs->access(ut, start, TRUE))
> > +            ASSERT_NOT_REACHED();
> 
> Is continuing to run an ignoring the failure the most helpful behavior here for release builds?

This method now always asserts or returns error code.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:133
> > +        int32_t ucAvailable = ut->chunkLength - ut->chunkOffset;
> > +        int32_t ucToCopy = ucAvailable < length ? ucAvailable : length;
> 
> The “uc” prefix on these local variable names is not the usual WebKit style. It’s also best to name local variables with noun phrases rather than fragments. “to copy” is not really so great.

Code removed. This method now always asserts or returns error code.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:134
> > +        StringImpl::copyChars(&dest[copied], ut->chunkContents + ut->chunkOffset, static_cast<unsigned>(ucToCopy));
> 
> Why the static_cast here? Normally we can pass an int32_t to an unsigned without an explicit cast.

Code removed. This method now always asserts or returns error code.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:146
> > +// Generic map offset to native index function, works with both Latin1 and UC text providers.
> 
> The comments like this one before each function don’t say much. I’d omit them. It’s true that these functions are used twice, for both Latin-1 and Unicode, but they’re not “generic” in any additional interesting sense, nor is it helpful to repeat the function name in prose in a comment.

Removed.

> 
> Also, I am not comfortable with the abbreviation “uc” for “Unicode” and I’d prefer we keep the use of that to a minimum.

Removed.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:157
> > +    ASSERT(nativeIndex - ut->chunkNativeStart < INT32_MAX);
> 
> What guarantees this is true? We can assert it only if we know it will be true.

Fixed. Use 0 if offset exceeds 32 bits. Note also that utext.h says:

"Behavior is undefined if the native index does not fall within the current chunk."

So returning zero will at least be a fixed, defined behavior.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:238
> > +            ASSERT(nativeIndex - ut->chunkNativeStart < INT32_MAX);
> 
> Same question as elsewhere. What guarantees this?

Fixed. Use 0 if offset exceeds 32 bits.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:250
> > +            ASSERT(nativeIndex - ut->chunkNativeStart < INT32_MAX);
> 
> Same question as elsewhere. What guarantees this?

Fixed. Use 0 if offset exceeds 32 bits.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:291
> > +static const struct UTextFuncs uTextLatin1WithPriorContextFuncs = 
> > +{
> 
> In WebKit coding style the brace would go on the previous line.

Fixed.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:347
> > +    ASSERT_UNUSED(forward, forward ? nativeIndex >= ut->b && nativeIndex < ut->a + ut->b : nativeIndex > ut->b && nativeIndex <= ut->a + ut->b);
> 
> We don’t normally do assertions with && in them. I suggest breaking this out into two assertions, one about the start and one about the end.

Fixed.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:350
> > +    ASSERT(ut->chunkNativeLimit - ut->chunkNativeStart < INT32_MAX);
> 
> What guarantees this? We can’t just assert because we hope it’s true.

Fixed. Use 0 if offset exceeds 32 bits.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:353
> > +    ASSERT(nativeIndex - ut->chunkNativeStart < INT32_MAX);
> 
> What guarantees this? We can’t just assert because we hope it’s true.

Fixed. Use 0 if offset exceeds 32 bits.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:372
> > +    ASSERT(nativeIndex < INT32_MAX);
> 
> What guarantees this? We can’t just assert because we hope it’s true.

Fixed. Use 0 if offset exceeds 32 bits.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:486
> > +    UText uTextUCWithPriorContextLocal = UTEXT_INITIALIZER;
> 
> This long name does not seem helpful. Why does the phrase “with prior context” need to appear in the variable name? Why not just call this textLocalStorage or something like that?

Fixed. Changed to "textLocal".

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:488
> > +    UErrorCode uTextOpenStatus = U_ZERO_ERROR;
> 
> Why not just call this openStatus?

Fixed. This was previously based on ICU text provider code usage.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:489
> > +    UText* uTextUCWithPriorContext = UTextOpenUTF16(&uTextUCWithPriorContextLocal, string, length, priorContext, priorContextLength, &uTextOpenStatus);
> 
> Why not just call this text?

Fixed.
Comment 35 Glenn Adams 2013-03-20 17:31:45 PDT
Thanks for the quick review! The following comments apply to an update to the current patch I will be submitting presently.

(In reply to comment #33)
> (From update of attachment 193924 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=193924&action=review
> 
> > Source/WebCore/ChangeLog:9
> > +        in non-ASCII path through nextBreakablePossition(). This enables
> 
> typo - nextBreakablePosition()

Fixed.

> 
> > Source/WebCore/ChangeLog:68
> > +        (WebCore::uTextWithPriorContextGetNativeLength):
> > +        (WebCore):
> > +        (WebCore::uTextWithPriorContextNativeLength):
> > +        (WebCore::uTextWithPriorContextClone):
> > +        (WebCore::uTextWithPriorContextExtract):
> > +        (WebCore::uTextWithPriorContextMapOffsetToNative):
> > +        (WebCore::uTextWithPriorContextMapNativeIndexToUTF16):
> > +        (WebCore::uTextWithPriorContextClose):
> > +        (WebCore::uTextGetContext):
> > +        (WebCore::uTextLatin1GetCurrentContext):
> > +        (WebCore::uTextLatin1MoveInPrimaryContext):
> > +        (WebCore::uTextLatin1SwitchToPrimaryContext):
> > +        (WebCore::uTextLatin1MoveInPriorContext):
> > +        (WebCore::uTextLatin1SwitchToPriorContext):
> > +        (WebCore::uTextInChunkOrOutOfRange):
> > +        (WebCore::uTextLatin1WithPriorContextAccess):
> > +        (WebCore::uTextWithPriorContextInit):
> > +        (WebCore::UTextOpenLatin1):
> > +        (WebCore::uTextUTF16GetCurrentContext):
> > +        (WebCore::uTextUTF16MoveInPrimaryContext):
> > +        (WebCore::uTextUTF16SwitchToPrimaryContext):
> > +        (WebCore::uTextUTF16MoveInPriorContext):
> > +        (WebCore::uTextUTF16SwitchToPriorContext):
> > +        (WebCore::uTextUTF16WithPriorContextAccess):
> 
> I don't think adding WithPriorContext to the various function names adds much.  Some of these functions are shared between the Latin1 and UTF16 UText processing and then there are the two Access functions.

Fixed: s/uTextWithPriorContext/text/ && s/uText/text/

> 
> > Source/WebCore/platform/text/TextBreakIterator.h:92
> >      {
> 
> Shouldn't we ASSERT here that we have at least two characters of space in m_priorContext?

Fixed.

> 
> > Source/WebCore/platform/text/TextBreakIterator.h:98
> >      {
> 
> Shouldn't we ASSERT here that we have at least two characters of space in m_priorContext?

Fixed.

> 
> > Source/WebCore/platform/text/TextBreakIterator.h:104
> > +    {
> 
> Shouldn't we ASSERT here that we have at least two characters of context to reset, or iterate through from 0..TextBreakPriorContextCapacity-1?

Fixed.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:92
> > +    if (sizeToCopy > dest->sizeOfStruct)
> 
> When would src->sizeOfStruct != dest->sizeOfStruct?

I modeled this code on that found in ICU utext.cpp:shallowTextClone(), which hasn't changed between version 46 and the current version 50. I believe the intent of this code was to handle clones from a source to which fields had been added to a destination where some of those added fields were not present. I suspect it does not occur in our particular usage of text, but it doesn't hurt to keep IMO.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:94
> > +    memcpy(dest, src, sizeToCopy);
> 
> I prefer the element by element initialization, since there are special cases to handle in the destination.  For example, the addition of a new pointer in struct UText would end up being a problem with the memcpy.

Again, I pretty much used the ICU shallowTextClone() as a model here. In the prior implementation of uTextLatin1Clone, there were handling of fields that were specific to the Latin1 text provider usage. Here, I chose to use a generic approach as indicated in utext.cpp. In this (next) updated patch, I've added code to perform fixups on pointers in the destination text structure that originally pointed into either the source text structure's extra buffer or somewhere else in the source text structure.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:459
> > +    UTextWithBuffer uTextLatin1WithPriorContextLocal;
> > +    uTextLatin1WithPriorContextLocal.ut = emptyUText;
> > +    uTextLatin1WithPriorContextLocal.ut.extraSize = sizeof(uTextLatin1WithPriorContextLocal.buffer);
> > +    uTextLatin1WithPriorContextLocal.ut.pExtra = uTextLatin1WithPriorContextLocal.buffer;
> 
> Same comment about adding WithPriorContext.  I don't think it is needed.

Fixed.
Comment 36 Glenn Adams 2013-03-20 19:43:22 PDT
Created attachment 194170 [details]
Patch

Address comments from darin and msaboff.
Comment 37 WebKit Review Bot 2013-03-21 09:31:10 PDT
Comment on attachment 194170 [details]
Patch

Attachment 194170 [details] did not pass cr-linux-debug-ews (chromium-xvfb):
Output: http://webkit-commit-queue.appspot.com/results/17132595
Comment 38 Glenn Adams 2013-03-21 18:29:17 PDT
Created attachment 194409 [details]
Patch

Address non-ICU path; replace skipped expectations with rebaselines or requests for same.
Comment 39 Build Bot 2013-03-22 08:50:17 PDT
Comment on attachment 194409 [details]
Patch

Attachment 194409 [details] did not pass mac-ews (mac):
Output: http://webkit-commit-queue.appspot.com/results/17286170

New failing tests:
fast/sub-pixel/table-rtl-padding.html
Comment 40 Build Bot 2013-03-22 08:50:21 PDT
Created attachment 194558 [details]
Archive of layout-test-results from webkit-ews-02

The attached test failures were seen while running run-webkit-tests on the mac-ews.
Bot: webkit-ews-02  Port: <class 'webkitpy.common.config.ports.MacPort'>  Platform: Mac OS X 10.8.2
Comment 41 Glenn Adams 2013-03-26 19:02:19 PDT
Created attachment 195207 [details]
Patch

Address regression in fast/sub-pixel/table-rtl-padding; expect fails on mac/chromium for rebaselining.
Comment 42 WebKit Review Bot 2013-03-26 21:18:28 PDT
Comment on attachment 195207 [details]
Patch

Attachment 195207 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://webkit-commit-queue.appspot.com/results/17237572

New failing tests:
fast/text/international/003.html
fast/text/international/text-combine-image-test.html
fast/writing-mode/Kusa-Makura-background-canvas.html
Comment 43 WebKit Review Bot 2013-03-26 21:18:34 PDT
Created attachment 195218 [details]
Archive of layout-test-results from gce-cr-linux-06 for chromium-linux-x86_64

The attached test failures were seen while running run-webkit-tests on the chromium-ews.
Bot: gce-cr-linux-06  Port: chromium-linux-x86_64  Platform: Linux-3.3.8-gcg-201212281604-x86_64-with-GCEL-10.04-gcel_10.04
Comment 44 Build Bot 2013-03-26 22:12:08 PDT
Comment on attachment 195207 [details]
Patch

Attachment 195207 [details] did not pass mac-wk2-ews (mac-wk2):
Output: http://webkit-commit-queue.appspot.com/results/17286470

New failing tests:
fast/text/international/003.html
fast/text/international/text-combine-image-test.html
fast/writing-mode/Kusa-Makura-background-canvas.html
Comment 45 Build Bot 2013-03-26 22:12:13 PDT
Created attachment 195221 [details]
Archive of layout-test-results from webkit-ews-16 for mac-mountainlion-wk2

The attached test failures were seen while running run-webkit-tests on the mac-wk2-ews.
Bot: webkit-ews-16  Port: mac-mountainlion-wk2  Platform: Mac OS X 10.8.2
Comment 46 WebKit Review Bot 2013-03-26 22:20:40 PDT
Comment on attachment 195207 [details]
Patch

Attachment 195207 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://webkit-commit-queue.appspot.com/results/17251671

New failing tests:
fast/text/international/003.html
fast/text/international/text-combine-image-test.html
fast/writing-mode/Kusa-Makura-background-canvas.html
Comment 47 WebKit Review Bot 2013-03-26 22:20:47 PDT
Created attachment 195222 [details]
Archive of layout-test-results from gce-cr-linux-03 for chromium-linux-x86_64

The attached test failures were seen while running run-webkit-tests on the chromium-ews.
Bot: gce-cr-linux-03  Port: chromium-linux-x86_64  Platform: Linux-3.3.8-gcg-201212281604-x86_64-with-GCEL-10.04-gcel_10.04
Comment 48 Glenn Adams 2013-03-27 22:37:21 PDT
Comment on attachment 195207 [details]
Patch

marking as r- in order to terminate apparent endless retry loop on mac-ews test
Comment 49 Glenn Adams 2013-03-27 23:22:56 PDT
Created attachment 195488 [details]
Patch
Comment 50 Darin Adler 2013-04-02 10:35:07 PDT
Comment on attachment 195488 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=195488&action=review

Generally looks great.

review- because of the problem in acquireLineBreakIterator.

> Source/WebCore/ChangeLog:37
> +        (LazyLineBreakIterator):

Ideally, you should remove bogus lines like this that are added because of bugs in the prepare-ChangeLog script.

Even better, you could get someone to fix that script, or do it yourself!

> Source/WebCore/platform/text/TextBreakIterator.h:82
> +    unsigned getPriorContextLength() const

This function should be named priorContextLength in WebKit coding style. We reserve the word “get” for functions that return results in some other way.

> Source/WebCore/platform/text/TextBreakIterator.h:85
> +        ASSERT(sizeof(m_priorContext) / sizeof(m_priorContext[0]) >= 2);

These should be a COMPILE_ASSERT rather than just an ASSERT.

WTF has a macro for array length called WTF_ARRAY_LENGTH that should be used instead of writing out the math with two separate sizeofs.

This should be asserting that it’s == 2, not just >= 2, since this function’s implementation assumes there are exactly 2 elements. Generally speaking, all the functions seem specific to a 2-element, array, so the assertions don’t seem quite right. For example, this one looks only at two characters, and others assume that the last character is at index 1 and the second to last at index 0. So all these functions are assuming 2 characters of context, not >= 2 or >= 1.

> Source/WebCore/platform/text/TextBreakIterator.h:149
> +        // N.B. prior context is explicitly not reset here since it needs to be accumulated and retained
> +        // across multiple inline nodes, each of which cause this reset() to be invoked in order
> +        // to update the primary context string.

This comment makes it sound like we should consider renaming the reset() function for clarity about what it does and does not reset.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:87
> +        const void* pointerSource = static_cast<const void*>(source);

It’s strange to first cast these to void* and then to char*. I suggest a reinterpret_cast directly to char*. There’s no problem comparing a void* to a char*, and the number of casts in this function might be cut down considerably.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:129
> +    ASSERT(errorCode);

Seems a bit excessive to assert this. We don’t have to assert every pointer as non-null.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:130
> +    ASSERT_NOT_REACHED();

Might be worth commenting why this is correct.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:145
> +    return offset < INT32_MAX ? static_cast<int32_t>(offset) : 0;

I didn’t know that INT32_MAX existed, portably. Normally I’d use numeric_limits<int32_t>::max().

Why is it OK to return 0 if the offset is too big to fit into 32 bits? A comment should say why; it’s not obvious.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:187
> +    text->chunkLength = length < INT32_MAX ? static_cast<int32_t>(length) : 0;

Same comments as above.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:209
> +    text->chunkOffset = nativeIndex < INT32_MAX ? static_cast<int32_t>(nativeIndex) : 0;

Same thing again.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:314
> +    UText* text = utext_setup(&utWithBuffer->text, sizeof(utWithBuffer->buffer), status);
> +    if (U_SUCCESS(*status))
> +        textInit(text, &textLatin1Funcs, string, length, priorContext, priorContextLength);
> +    return text;

Is utext_setup guaranteed to return 0 if the status does not indicate success? WebKit style is normally early return for errors, so we’d write:

    if (U_FAILURE(*status)) {
        ASSERT(!text);
        return 0;
    }

And continue with the call to textInit, rather than putting the normal case inside an if statement.

> Source/WebCore/platform/text/TextBreakIteratorICU.cpp:419
> +    if (U_SUCCESS(*status))
> +        textInit(text, &textUTF16Funcs, string, length, priorContext, priorContextLength);
> +    return text;

Same comment as above.

> Source/WebCore/platform/text/wchar/TextBreakIteratorWchar.cpp:263
> +        Vector<UChar> utf16string(priorContextLength + length);

This isn’t a good name for the local variable. There’s an argument called string that is also UTF-16, so calling this utf16string does nothing to make clear why it’s different. Also, the capitalization does not match WebKit coding style, which requires a capital S. I would call this stringWithPrependedContext or some other sort name that makes clear what the difference is.

> Source/WebCore/platform/text/wchar/TextBreakIteratorWchar.cpp:268
> +        return acquireLineBreakIterator(utf16string.data(), utf16string.size(), locale, 0, 0);

Having the function call itself like this doesn’t seem optimal to me.

Is it OK to have a line break iterator that outlives the data it’s pointed to? When the local vector, utf16string, is deleted, why will the line break iterator still work properly? I think this is a major problem!
Comment 51 Darin Adler 2013-04-02 10:36:05 PDT
Comment on attachment 195488 [details]
Patch

I think it would be a good idea to break off the renaming and reformatting from this patch in TextBreakIterator.h and land it first, to make the real patch a bit smaller. Not required, but worthwhile.
Comment 52 Glenn Adams 2013-04-02 13:32:33 PDT
(In reply to comment #50)
> (From update of attachment 195488 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=195488&action=review
> > Source/WebCore/platform/text/wchar/TextBreakIteratorWchar.cpp:263
> > +        Vector<UChar> utf16string(priorContextLength + length);
> 
> This isn’t a good name for the local variable. There’s an argument called string that is also UTF-16, so calling this utf16string does nothing to make clear why it’s different. Also, the capitalization does not match WebKit coding style, which requires a capital S. I would call this stringWithPrependedContext or some other sort name that makes clear what the difference is.

I copied this usage from existing code in platform/text/TextBreakIterator.cpp:

TextBreakIterator* acquireLineBreakIterator(const LChar* string, int length, const AtomicString& locale)
{
    Vector<UChar> utf16string(length);
    for (int i = 0; i < length; ++i)
        utf16string[i] = string[i];
    return acquireLineBreakIterator(utf16string.data(), length, locale);
}

> 
> > Source/WebCore/platform/text/wchar/TextBreakIteratorWchar.cpp:268
> > +        return acquireLineBreakIterator(utf16string.data(), utf16string.size(), locale, 0, 0);
> 
> Having the function call itself like this doesn’t seem optimal to me.
> 
> Is it OK to have a line break iterator that outlives the data it’s pointed to? When the local vector, utf16string, is deleted, why will the line break iterator still work properly? I think this is a major problem!

Interestingly, that is also exactly what was done by the existing code I cite above.

I wonder if it is essential that I attempt to fix up this !USE(ICU_UNICODE) code path, particularly since I don't have such a platform to build/test on. WinCE perhaps? Would it be better to simply not attempt to fix up this code and to let those who depend on this code path to effect a fix?

Or I could simply add the new parameters (so that a build succeeds) and add an ASSERT if priorContextLength is non-zero thus forcing a debug time testing failure? What would you suggest?
Comment 53 Glenn Adams 2013-04-02 15:44:15 PDT
(In reply to comment #51)
> (From update of attachment 195488 [details])
> I think it would be a good idea to break off the renaming and reformatting from this patch in TextBreakIterator.h and land it first, to make the real patch a bit smaller. Not required, but worthwhile.

Done: https://bugs.webkit.org/show_bug.cgi?id=113823

Also, I separated out the new performance test: https://bugs.webkit.org/show_bug.cgi?id=113811
Comment 54 Glenn Adams 2013-04-02 20:17:38 PDT
Created attachment 196271 [details]
Patch
Comment 55 EFL EWS Bot 2013-04-02 20:22:44 PDT
Comment on attachment 196271 [details]
Patch

Attachment 196271 [details] did not pass efl-ews (efl):
Output: http://webkit-commit-queue.appspot.com/results/17388178
Comment 56 WebKit Review Bot 2013-04-02 20:30:53 PDT
Comment on attachment 196271 [details]
Patch

Attachment 196271 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://webkit-commit-queue.appspot.com/results/17323758
Comment 57 WebKit Review Bot 2013-04-02 20:32:28 PDT
Comment on attachment 196271 [details]
Patch

Attachment 196271 [details] did not pass cr-linux-debug-ews (chromium-xvfb):
Output: http://webkit-commit-queue.appspot.com/results/17246817
Comment 58 Glenn Adams 2013-04-02 20:36:52 PDT
Created attachment 196273 [details]
Patch
Comment 59 Glenn Adams 2013-04-02 22:00:40 PDT
Created attachment 196276 [details]
Patch
Comment 60 Glenn Adams 2013-04-02 22:56:19 PDT
Comment on attachment 196276 [details]
Patch

Addressed comment #50 issues.
Comment 61 Glenn Adams 2013-04-02 23:05:12 PDT
(In reply to comment #50)
> (From update of attachment 195488 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=195488&action=review

Addressed in https://bugs.webkit.org/attachment.cgi?id=196276

> > Source/WebCore/ChangeLog:37
> > +        (LazyLineBreakIterator):
> 
> Ideally, you should remove bogus lines like this that are added because of bugs in the prepare-ChangeLog script.

Fixed.

> > Source/WebCore/platform/text/TextBreakIterator.h:82
> > +    unsigned getPriorContextLength() const
> This function should be named priorContextLength in WebKit coding style. We reserve the word “get” for functions that return results in some other way.

Fixed in https://bugs.webkit.org/show_bug.cgi?id=113823.

> 
> > Source/WebCore/platform/text/TextBreakIterator.h:85
> > +        ASSERT(sizeof(m_priorContext) / sizeof(m_priorContext[0]) >= 2);
> 
> These should be a COMPILE_ASSERT rather than just an ASSERT.
> 
> WTF has a macro for array length called WTF_ARRAY_LENGTH that should be used instead of writing out the math with two separate sizeofs.
> 
> This should be asserting that it’s == 2, not just >= 2, since this function’s implementation assumes there are exactly 2 elements. Generally speaking, all the functions seem specific to a 2-element, array, so the assertions don’t seem quite right. For example, this one looks only at two characters, and others assume that the last character is at index 1 and the second to last at index 0. So all these functions are assuming 2 characters of context, not >= 2 or >= 1.

Fixed in https://bugs.webkit.org/show_bug.cgi?id=113823.

> 
> > Source/WebCore/platform/text/TextBreakIterator.h:149
> > +        // N.B. prior context is explicitly not reset here since it needs to be accumulated and retained
> > +        // across multiple inline nodes, each of which cause this reset() to be invoked in order
> > +        // to update the primary context string.
> 
> This comment makes it sound like we should consider renaming the reset() function for clarity about what it does and does not reset.

Fixed in https://bugs.webkit.org/show_bug.cgi?id=113823.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:87
> > +        const void* pointerSource = static_cast<const void*>(source);
> 
> It’s strange to first cast these to void* and then to char*. I suggest a reinterpret_cast directly to char*. There’s no problem comparing a void* to a char*, and the number of casts in this function might be cut down considerably.

Fixed.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:129
> > +    ASSERT(errorCode);
> 
> Seems a bit excessive to assert this. We don’t have to assert every pointer as non-null.

Fixed.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:130
> > +    ASSERT_NOT_REACHED();
> 
> Might be worth commenting why this is correct.

Added comment.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:145
> > +    return offset < INT32_MAX ? static_cast<int32_t>(offset) : 0;
> 
> I didn’t know that INT32_MAX existed, portably. Normally I’d use numeric_limits<int32_t>::max().
> 
> Why is it OK to return 0 if the offset is too big to fit into 32 bits? A comment should say why; it’s not obvious.

Fixed. Use max() and added comments.

> 
> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:187
> > +    text->chunkLength = length < INT32_MAX ? static_cast<int32_t>(length) : 0;
> 
> Same comments as above.
> 

Fixed. Use max() and added comments.

> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:209
> > +    text->chunkOffset = nativeIndex < INT32_MAX ? static_cast<int32_t>(nativeIndex) : 0;
> 
> Same thing again.
> 

Fixed. Use max() and added comments.

> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:314
> > +    UText* text = utext_setup(&utWithBuffer->text, sizeof(utWithBuffer->buffer), status);
> > +    if (U_SUCCESS(*status))
> > +        textInit(text, &textLatin1Funcs, string, length, priorContext, priorContextLength);
> > +    return text;
> 
> Is utext_setup guaranteed to return 0 if the status does not indicate success? WebKit style is normally early return for errors, so we’d write:
> 
>     if (U_FAILURE(*status)) {
>         ASSERT(!text);
>         return 0;
>     }
> 
> And continue with the call to textInit, rather than putting the normal case inside an if statement.
> 

Fixed.

> > Source/WebCore/platform/text/TextBreakIteratorICU.cpp:419
> > +    if (U_SUCCESS(*status))
> > +        textInit(text, &textUTF16Funcs, string, length, priorContext, priorContextLength);
> > +    return text;
> 
> Same comment as above.
> 

Fixed.

> > Source/WebCore/platform/text/wchar/TextBreakIteratorWchar.cpp:263
> > +        Vector<UChar> utf16string(priorContextLength + length);
> 
> This isn’t a good name for the local variable. There’s an argument called string that is also UTF-16, so calling this utf16string does nothing to make clear why it’s different. Also, the capitalization does not match WebKit coding style, which requires a capital S. I would call this stringWithPrependedContext or some other sort name that makes clear what the difference is.
>

Removed code. See also comment #52.

> > Source/WebCore/platform/text/wchar/TextBreakIteratorWchar.cpp:268
> > +        return acquireLineBreakIterator(utf16string.data(), utf16string.size(), locale, 0, 0);
> 
> Having the function call itself like this doesn’t seem optimal to me.

No longer applicable, but FYI, modern compilers convert this to iteration by tail call optimization. [It's a programming paradigm I often used in Lisp back in the "good ole days" :)]

> 
> Is it OK to have a line break iterator that outlives the data it’s pointed to? When the local vector, utf16string, is deleted, why will the line break iterator still work properly? I think this is a major problem!

Replaced code with asserts since I really can't build/test this anyway. See also comment #52. If you suggest another approach, let me know.
Comment 62 Glenn Adams 2013-04-03 02:28:10 PDT
(In reply to comment #61)
> (In reply to comment #50)
> > > Source/WebCore/platform/text/TextBreakIterator.h:82
> > > +    unsigned getPriorContextLength() const
> > This function should be named priorContextLength in WebKit coding style. We reserve the word “get” for functions that return results in some other way.
> 
> Fixed in https://bugs.webkit.org/show_bug.cgi?id=113823.

Correction. Fixed (in current patch to this bug).
Comment 63 WebKit Review Bot 2013-04-03 14:52:18 PDT
Comment on attachment 196276 [details]
Patch

Clearing flags on attachment: 196276

Committed r147588: <http://trac.webkit.org/changeset/147588>
Comment 64 WebKit Review Bot 2013-04-03 14:52:28 PDT
All reviewed patches have been landed.  Closing bug.
Comment 65 Benjamin Poulain 2013-04-03 16:03:23 PDT
This caused 3 failures on the bots. Can you have a look? (rebaseline needed?)
Comment 66 Glenn Adams 2013-04-03 16:06:07 PDT
(In reply to comment #65)
> This caused 3 failures on the bots. Can you have a look? (rebaseline needed?)

Those were expected as I marked in the ChangeLog:

   Expect rebaselines will be needed due to different line breaks:
              fast/text/international/003.html
              fast/text/international/text-combine-image-test-expected.html
              fast/text/writing-mode/Kusa-Makura-background-canvas.html

I'm waiting for bots to catch up so I can rebaseline with garden-o-magic. This will be my first attempt to use it.
Comment 67 Simon Fraser (smfr) 2013-04-03 22:25:04 PDT
Are these diffs expected?
http://build.webkit.org/results/Apple%20MountainLion%20Release%20WK1%20(Tests)/r147588%20(8648)/retries/fast/text/international/003-diffs.html

The new results look rather different.
Comment 68 Glenn Adams 2013-04-04 00:45:23 PDT
(In reply to comment #67)
> Are these diffs expected?
> http://build.webkit.org/results/Apple%20MountainLion%20Release%20WK1%20(Tests)/r147588%20(8648)/retries/fast/text/international/003-diffs.html
> 
> The new results look rather different.

Yes, they were. WIth this patch, there are new line break opportunities that were not previously considered, particularly at element boundaries. The test file you cite and two others have been rebaselined to accommodate the different line breaks. I just committed the third of three rebaseline sets:

[1] http://trac.webkit.org/changeset/147599 [Chromium]
[2] http://trac.webkit.org/changeset/147601 [GTK]
[3] http://trac.webkit.org/changeset/147613 [Mac]

if I missed any, please let me know or go ahead with a rebaseline.
Comment 69 mitz 2013-04-16 21:26:10 PDT
(In reply to comment #63)
> (From update of attachment 196276 [details])
> Clearing flags on attachment: 196276
> 
> Committed r147588: <http://trac.webkit.org/changeset/147588>

This change caused bug 114721.
Comment 70 Mark Salisbury 2013-06-06 14:27:17 PDT
(In reply to comment #52)

> I wonder if it is essential that I attempt to fix up this !USE(ICU_UNICODE) code path, particularly since I don't have such a platform to build/test on. WinCE perhaps? Would it be better to simply not attempt to fix up this code and to let those who depend on this code path to effect a fix?

I'm planning to submit a patch for this.  I've opened https://bugs.webkit.org/show_bug.cgi?id=117320.
Comment 71 Eric Seidel (no email) 2013-06-24 17:17:07 PDT
FYI, Blink also took this patch and found it to be a 4% regression in PerformanceTests/Layout/hindi-line-layout.html

https://code.google.com/p/chromium/issues/detail?id=252050

We're going to keep the patch, but it's also sad to make are already slow complex text path slower. :(
http://stackoverflow.com/questions/16554205/google-chrome-text-rendering-for-non-latin-languages-is-too-slow
Comment 72 Antoine Quint 2013-10-23 06:38:17 PDT
(In reply to comment #63)
> (From update of attachment 196276 [details])
> Clearing flags on attachment: 196276
> 
> Committed r147588: <http://trac.webkit.org/changeset/147588>

This change apparently caused a regression, see https://bugs.webkit.org/show_bug.cgi?id=123204.