Bug 17427 - Line breaking opportunities at the end of a text node are missed
Summary: Line breaking opportunities at the end of a text node are missed
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Text (show other bugs)
Version: 528+ (Nightly build)
Hardware: All OS X 10.5
: P2 Normal
Assignee: Glenn Adams
URL:
Keywords: InRadar
Depends on:
Blocks: 110468 105692 110322
  Show dependency treegraph
 
Reported: 2008-02-18 11:54 PST by mitz
Modified: 2013-03-10 20:33 PDT (History)
21 users (show)

See Also:


Attachments
Test case (226 bytes, text/html)
2008-02-18 11:55 PST, mitz
no flags Details
Proposed patch (6.73 KB, patch)
2012-09-16 17:25 PDT, Alex Henrie
alexhenrie24: review-
webkit.review.bot: commit-queue-
Details | Formatted Diff | Diff
Proposed patch v2 (10.22 KB, patch)
2012-09-18 20:45 PDT, Alex Henrie
darin: review-
webkit.review.bot: commit-queue-
Details | Formatted Diff | Diff
Proposed patch v3 (20.64 KB, patch)
2012-09-24 21:02 PDT, Alex Henrie
no flags Details | Formatted Diff | Diff
Proposed patch v4 (11.87 KB, patch)
2012-09-24 21:28 PDT, Alex Henrie
no flags Details | Formatted Diff | Diff
Proposed patch v5 (11.80 KB, patch)
2012-09-24 21:38 PDT, Alex Henrie
buildbot: commit-queue-
Details | Formatted Diff | Diff
Proposed patch v6 (11.66 KB, patch)
2012-11-04 22:07 PST, Alex Henrie
no flags Details | Formatted Diff | Diff
Patch (8.39 KB, patch)
2013-02-24 15:13 PST, Glenn Adams
no flags Details | Formatted Diff | Diff
Patch (8.39 KB, patch)
2013-02-25 07:30 PST, Glenn Adams
no flags Details | Formatted Diff | Diff
perf results - attachment 190055 (26.90 KB, text/html)
2013-03-06 08:01 PST, Glenn Adams
no flags Details
Patch (8.97 KB, patch)
2013-03-10 15:35 PDT, Glenn Adams
no flags Details | Formatted Diff | Diff
Patch (11.88 KB, patch)
2013-03-10 19:25 PDT, Glenn Adams
no flags Details | Formatted Diff | Diff
Patch for landing (11.82 KB, patch)
2013-03-10 19:54 PDT, Glenn Adams
no flags Details | Formatted Diff | Diff
Patch for landing (11.82 KB, patch)
2013-03-10 19:59 PDT, Glenn Adams
no flags Details | Formatted Diff | Diff
Patch for landing (11.84 KB, patch)
2013-03-10 20:03 PDT, Glenn Adams
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description mitz 2008-02-18 11:54:45 PST
A line breaking opportunity at the end of a text node is ignored. See the attached test case for example (there is a line-breaking opportunity after a question mark).
Comment 1 mitz 2008-02-18 11:55:26 PST
Created attachment 19193 [details]
Test case
Comment 2 Xianzhu Wang 2011-09-06 18:49:34 PDT
For now the line breaking iterator can only see the text in the current element.  
The solution might be also passing the text of the previous element to the line breaking iterator, if the previous element is in the same paragraph of the current element. For ICU, we can avoid the extra string allocation by passing the texts in UText. Solution to bug 67022 will also use UText.
Comment 3 Darin Adler 2011-09-07 14:24:27 PDT
If we do keep a UText object around, it should be done through an abstraction that would let other platforms do the same kind of thing with whatever context object they would use.
Comment 4 Xianzhu Wang 2011-09-07 18:50:09 PDT
I think at WebKit side, we can change the interfaces of TextBreakIterator by adding a 'context' parameter besides the original text parameter for the xxxxIterator() functions. In TextBreakIteratorICU.cpp we can combine the two parameters into one UText and pass to ICU. For other ports use their own proper methods if possible, or just allocate a new string to copy the context and the text into it, or maybe just ignore the context.
Comment 5 mitz 2012-05-18 22:19:46 PDT
<rdar://problem/11489433>
Comment 6 Alex Henrie 2012-09-16 17:25:33 PDT
Created attachment 164327 [details]
Proposed patch

I've written a patch that fixes this bug. It determines whether or not a line break is possible immediately before a text node by running the line-breaking algorithm on a temporary string made from the first two UChars of that text node and the last two UChars of the preceding text node. I chose this approach over adding a 'context' parameter to TextBreakIterator for two reasons: First, I could not figure out how to combine two UChar* strings into a UText without allocating a new string and copying them both into it, and second, because it would have required much more intrusive changes for a dubious benefit.

I noticed that my patch causes LayoutTests/fast/text/international/003.html and LayoutTests/fast/text/international/text-combine-image-test.html to "fail". I believe that my patch actually makes the rendering of 003.html more correct because it correctly allows a line break between the number 20 and the character 日. As far as text-combine-image-test.html, I do not believe that this page renders correctly with or without the patch because the text should render in a single vertical column, the same as if it had no markup. For what it's worth, I think the page looks a little nicer with the patch than without.

Even if you think there is a better approach, could this patch be accepted as an interim solution? This bug needs to be fixed to make the best use of limited screen space on mobile devices, for consistency with other web browsers and applications in general, and for web applications such as [1] that require line breaking to be consistent.

[1] http://meta.wikimedia.org/wiki/User:Remember_the_dot/Syntax_highlighter
Comment 7 WebKit Review Bot 2012-09-16 21:16:08 PDT
Comment on attachment 164327 [details]
Proposed patch

Attachment 164327 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/13874271

New failing tests:
fast/text/international/003.html
fast/text/line-break-before-text-node.html
fast/css/word-spacing-characters.html
fast/writing-mode/Kusa-Makura-background-canvas.html
Comment 8 Ryosuke Niwa 2012-09-17 14:07:14 PDT
Comment on attachment 164327 [details]
Proposed patch

Why are we using 4 characters?
Comment 9 Darin Adler 2012-09-17 17:44:30 PDT
Comment on attachment 164327 [details]
Proposed patch

View in context: https://bugs.webkit.org/attachment.cgi?id=164327&action=review

> Source/WebCore/ChangeLog:13
> +        Adding semantic markup to text should not affect the line-breaking
> +        algorithm. The standard line-breaking algorithm is defined in Unicode
> +        Annex 14 (UAX14) and strict adherence to it is necessary to make the
> +        best use of limited screen space on mobile devices, for consistency
> +        with other web browsers and applications in general, and for web
> +        applications that require line breaking to be consistent.

This paragraph is not needed in ChangeLog.

> Source/WebCore/ChangeLog:18
> +        This patch determines whether or not a line break is possible
> +        immediately before a text node by running the line-breaking algorithm on
> +        a temporary string made from the first two UChars of that text node and
> +        the last two UChars of the preceding text node.

Why are two characters before and two character after sufficient?
Comment 10 Alex Henrie 2012-09-17 20:24:14 PDT
Comment on attachment 164327 [details]
Proposed patch

Okay, a few notes. text-combine-image-test.html is the test with the string "20日". I looked at this test again and without any markup there is no line break, so the line break my patch causes must be incorrect. However, I wasn't able to investigate it further because I spent all my free time today trying to recompile WebKit and when I run the GTK launcher all I get is a segmentation fault. GDB reports:

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff4645a91 in JSC::ArrayConstructor::finishCreation(JSC::ExecState*, JSC::ArrayPrototype*) ()
   from /home/alex/webkit/WebKitBuild/Release/.libs/libjavascriptcoregtk-3.0.so.0

I used four characters because I thought that that was always going to be enough context to determine line breaking, and in everything I've tested, it is. However, reading UAX14 more closely it looks like Southeast Asian languages (such as Thai) require dictionary-based line-breaking algorithms, which would require more context. Do you know of a way to provide this context to the line-breaking algorithm without allocating a new string and copying the contents of the previous and current text nodes into it? Will Thai require more context than just the first previous text node? Any tips on how to fix the segfault?
Comment 11 Ryosuke Niwa 2012-09-17 20:34:32 PDT
> Do you know of a way to provide this context to the line-breaking algorithm without allocating a new string and copying the contents of the previous and current text nodes into it?

Use UText as suggested in comments #2-#4?

> Will Thai require more context than just the first previous text node?

It appears that line breaking in Thai requires dictionary lookups so I'd expect it to require much more than just few characters to determine the line breaking.
Comment 12 Alex Henrie 2012-09-17 20:41:11 PDT
Do you know the syntax for creating a UText object without allocating a new string? I could not find it anywhere.
Comment 13 Alex Henrie 2012-09-18 20:45:28 PDT
Created attachment 164653 [details]
Proposed patch v2

Okay, here is version 2 of the patch (the segfault seems to have been fixed in the latest SVN revision). I added a test for Thai line breaking this time around. Also, I got around the failure on fast/text/line-break-before-text-node.html by not running the new code if -webkit-text-combine is present.

I know that the string manipulation in this patch is highly inefficient; help would really be appreciated. [1] and [2] talk about creating UText objects from discontinuous strings but it is not clear whether this feature is anything more than theoretical. Any other information about how you want the solution to look would also be helpful.

[1] http://userguide.icu-project.org/strings/utext
[2] http://icu-project.org/apiref/icu4c/utext_8h.html
Comment 14 WebKit Review Bot 2012-09-18 20:48:36 PDT
Attachment 164653 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'LayoutTests/ChangeLog', u'LayoutTests/fast..." exit_code: 1
Source/WebCore/rendering/RenderBlockLineLayout.cpp:2445:  This { should be at the end of the previous line  [whitespace/braces] [4]
Source/WebCore/rendering/RenderBlockLineLayout.cpp:2450:  More than one command on the same line in if  [whitespace/parens] [4]
Source/WebCore/rendering/RenderBlockLineLayout.cpp:2451:  More than one command on the same line in if  [whitespace/parens] [4]
Source/WebCore/rendering/RenderBlockLineLayout.cpp:2460:  More than one command on the same line in if  [whitespace/parens] [4]
Source/WebCore/rendering/RenderBlockLineLayout.cpp:2461:  More than one command on the same line in if  [whitespace/parens] [4]
Total errors found: 5 in 7 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 15 WebKit Review Bot 2012-09-19 03:27:06 PDT
Comment on attachment 164653 [details]
Proposed patch v2

Attachment 164653 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/13890488

New failing tests:
fast/forms/access-key-for-all-elements.html
css3/filters/custom/custom-filter-property-computed-style.html
fast/canvas/webgl/data-view-test.html
fast/loader/unload-form-post-about-blank.html
fast/forms/text-control-intrinsic-widths.html
fast/hidpi/video-controls-in-hidpi.html
editing/selection/extend-selection-home-end.html
editing/selection/home-end.html
editing/selection/extend-selection-character.html
css3/filters/custom/custom-filter-property-parsing.html
fast/js/comparison-operators-less.html
css3/filters/filter-property-parsing.html
fast/js/comparison-operators-greater.html
fast/js/comparison-operators.html
fast/canvas/webgl/array-unit-tests.html
editing/selection/programmatic-selection-on-mac-is-directionless.html
fast/canvas/webgl/array-message-passing.html
css3/filters/filter-property-computed-style.html
fast/canvas/webgl/object-deletion-behaviour.html
fast/dom/Window/window-lookup-precedence.html
fast/canvas/webgl/type-conversion-test.html
fast/js/array-enumerators-functions.html
fast/js/kde/encode_decode_uri.html
fast/js/kde/Number.html
http/tests/xmlhttprequest/web-apps/001.html
fast/dom/wrapper-classes.html
fast/css/word-spacing-characters.html
fast/canvas/webgl/tex-image-and-sub-image-2d-with-array-buffer-view.html
fast/forms/option-strip-unicode-spaces.html
fast/loader/local-CSS-from-local.html
Comment 16 Darin Adler 2012-09-19 10:00:40 PDT
Comment on attachment 164653 [details]
Proposed patch v2

View in context: https://bugs.webkit.org/attachment.cgi?id=164653&action=review

Please fix the style complaints from the style checking. They are all correctly expressing WebKit coding style.

> Source/WebCore/rendering/RenderBlockLineLayout.cpp:2446
> +                    // Thai and other Southeast Asian languages use dictionary-based line breaking algorithms that require a lot of context

Please add a period to this sentence-style comment.

The comment is a bit mysterious. This says “a lot of context”, but does not explain how much is enough.

> Source/WebCore/rendering/RenderBlockLineLayout.cpp:2447
> +                    String textWithContext;

Just using a StringBuilder instead of a String will be significantly more efficient. But perhaps we can do even better.

> Source/WebCore/rendering/RenderBlockLineLayout.cpp:2468
> +                        current.m_nextBreakablePosition = 0; // Yes, a line break may occur immediately before

I find this comment confusing. Maybe you could rewrite it to be a bit more explicit?
Comment 17 Darin Adler 2012-09-19 10:01:35 PDT
(In reply to comment #13)
> [1] and [2] talk about creating UText objects from discontinuous strings but it is not clear whether this feature is anything more than theoretical.

It is not theoretical. In fact Michael Saboff is doing something interesting with this to make Latin-1 work with ICU without making UTF-16 copies of the strings.
Comment 18 Alex Henrie 2012-09-24 21:02:32 PDT
Created attachment 165510 [details]
Proposed patch v3

Try 3. I used a StringBuilder this time, and made the code only concatenate the context together once for an entire set of runs. This works quite nicely. I tried a few different ways and finally settled on using a hash table to determine where each text node lies in the context string. I'll email Michael Saboff to ask what would have to be done to change this code to use UText instead of StringBuilder, but I think this patch would be fine as-is.
Comment 19 Alex Henrie 2012-09-24 21:28:38 PDT
Created attachment 165512 [details]
Proposed patch v4

Apologies, the tests were repeated multiple times in the last patch. The patch is now fixed.
Comment 20 Alex Henrie 2012-09-24 21:38:05 PDT
Created attachment 165516 [details]
Proposed patch v5

I really apologize. Try 4 still had a duplicate <meta> tag on the Thai test. I will definitely be more careful to make sure I don't accidentally duplicate the tests again in any future patches.
Comment 21 Build Bot 2012-09-24 22:29:30 PDT
Comment on attachment 165516 [details]
Proposed patch v5

Attachment 165516 [details] did not pass mac-ews (mac):
Output: http://queues.webkit.org/results/13992883

New failing tests:
fast/text/international/003.html
fast/text/line-break-between-text-nodes.html
fast/text/line-break-in-marked-up-thai.html
Comment 22 WebKit Review Bot 2012-09-25 07:55:58 PDT
Comment on attachment 165516 [details]
Proposed patch v5

Attachment 165516 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/14021139

New failing tests:
fast/text/international/003.html
fast/text/line-break-between-text-nodes.html
fast/writing-mode/Kusa-Makura-background-canvas.html
fast/text/line-break-in-marked-up-thai.html
Comment 23 Michael Saboff 2012-09-25 09:28:03 PDT
(In reply to comment #17)
> (In reply to comment #13)
> > [1] and [2] talk about creating UText objects from discontinuous strings but it is not clear whether this feature is anything more than theoretical.
> 
> It is not theoretical. In fact Michael Saboff is doing something interesting with this to make Latin-1 work with ICU without making UTF-16 copies of the strings.

You can write your own UText "object", basically a UText struct with function pointers to "virtual" functions for your UText. See http://icu-project.org/apiref/icu4c/structUText.html for some details.  You can also look at my patch out for review at https://bugs.webkit.org/attachment.cgi?id=165221&action=review.  I do copy chunks because I'm starting with 8 bit source data and the ICU breaking code working with UText expects UChar data.

For read only uses, you don't need all the UText functions.  For breaking, I think you'll need clone (shallow only I think), nativeLength, access, extract, mapOffsetToNative, mapNativeIndexToUTF16 and close.  Your functions need to be set in pFuncs along with all the other UText fields, likely in your own makeUTextForCombinedRenderText method. Given that you have data across multiple elements, your access function will need to map in the appropriate element (RenderText) and it's offsets into the UText structure as part of access processing.  Using this method you shouldn't need to copy RenderText data.  THis method will only work on platforms that have version 4 or later of ICU, IIRC.

Hope this helps.

You changes and my changes are touching similar code.  I plan on changing  RenderText so that the m_text String can contain 8 bit data.
Comment 24 Alex Henrie 2012-09-25 16:22:38 PDT
Michael, thanks for the explanation. I think I understand how your patch works. So my next question is, how would we make the the line-breaking functions take a UText if available and a String if not? Would it require changes to all platforms?

Also, given that UText is not available on all platforms, and what we want to do with it requires ICU 4 or later, would you be okay with using the StringBuilder-based implementation to fix the rendering bugs now and adding optimizations for ICU platforms afterwards in a separate patch?
Comment 25 Michael Saboff 2012-09-26 09:48:21 PDT
(In reply to comment #24)
> Michael, thanks for the explanation. I think I understand how your patch works. So my next question is, how would we make the the line-breaking functions take a UText if available and a String if not? Would it require changes to all platforms?
> 
> Also, given that UText is not available on all platforms, and what we want to do with it requires ICU 4 or later, would you be okay with using the StringBuilder-based implementation to fix the rendering bugs now and adding optimizations for ICU platforms afterwards in a separate patch?

I think all the platforms that use ICU are at ICU 4 or greater.  We just updated Mac to ICU 4.6.  I think Chromium is already at 4.6 or higher.  My proposed patch that used UText built on all platforms, although those changes are in Source/WebCore/platform/text/TextBreakIteratorICU.cpp.  I would suggest that you make your changes in an existing or new "ICU" file.

No sense to land a StringBuilder fix when we can make things better for ICU capable platforms.
Comment 26 Alex Henrie 2012-09-26 18:25:29 PDT
I agree that optimizations for ICU platforms would be nice. A few other considerations also come to mind. First, fixing rendering bugs (which are causing problems in real-world applications, see comments #6 and #11) is generally more important than platform-specific optimizations. Second, I am a volunteer: I have limited time, and I only have access to Linux machines. I could probably come up with a UText solution but I would need help. Third, the line-breaking functions in TextBreakIterator.h currently expect a UChar*, not a UText:

TextBreakIterator* acquireLineBreakIterator(const UChar*, int length, const AtomicString& locale);
void LazyLineBreakIterator::reset(const UChar* string, int length, const AtomicString& locale);

How would you suggest changing these functions so that they accept a UText when available and a String or UChar* when not? Won't any change here affect all platforms? If it affects all platforms, how can I test my changes?

Thanks for the help.
Comment 27 Michael Saboff 2012-09-27 11:48:23 PDT
(In reply to comment #26)
> I agree that optimizations for ICU platforms would be nice. A few other considerations also come to mind. First, fixing rendering bugs (which are causing problems in real-world applications, see comments #6 and #11) is generally more important than platform-specific optimizations. Second, I am a volunteer: I have limited time, and I only have access to Linux machines. I could probably come up with a UText solution but I would need help. Third, the line-breaking functions in TextBreakIterator.h currently expect a UChar*, not a UText:
> 
> TextBreakIterator* acquireLineBreakIterator(const UChar*, int length, const AtomicString& locale);
> void LazyLineBreakIterator::reset(const UChar* string, int length, const AtomicString& locale);
> 
> How would you suggest changing these functions so that they accept a UText when available and a String or UChar* when not? Won't any change here affect all platforms? If it affects all platforms, how can I test my changes?
> 
> Thanks for the help.

I think the best way to do this is to create a new acquireLineBreakIterator() and LazyLineBreakIterator::reset() that take the list of RenderText's or TextNode's that contain all of the strings.  The implementation of those new methods could be in a ICU file.  It may be sufficient to provide this implementation and leave other implementations to other platforms.  You might need to #if the definitions.
Comment 28 Alex Henrie 2012-10-01 21:14:31 PDT
layoutRunsAndFloatsInRange calls acquireLineBreakIterator once for each RenderText object in the set, using the locale of the current RenderText object to configure any special line-breaking behavior. With ICU, acquireLineBreakIterator would work best if given a precomputed mapping from indices to RenderText objects. Without ICU, acquireLineBreakIterator would work best if given a precomputed context string. In both cases, layoutRunsAndFloatsInRange also needs a mapping from RenderText objects to indices.

So, to optimize for both cases, we would need to write code for both in layoutRunsAndFloatsInRange and use #ifdef WTF_USE_ICU_UNICODE to switch between the two. Likewise, #ifdef WTF_USE_ICU_UNICODE would add to TextBreakIterator.h a function declaration for an acquireLineBreakIterator that takes (instead of a string) mappings between indices and RenderText objects. The same goes for struct RenderTextInfo, which would have m_textWithContext on non-ICU platforms and the mapping of indices to RenderText objects on ICU.

If this explanation doesn't make sense at first, look at what the StringBuilder patch has to do and think about how it would have to be different to optimize for ICU.

Can you think of a better solution?
Comment 29 Darin Adler 2012-10-02 09:54:02 PDT
(In reply to comment #28)
> With ICU, acquireLineBreakIterator would work best if given a precomputed mapping from indices to RenderText objects. Without ICU, acquireLineBreakIterator would work best if given a precomputed context string.

Can the non-ICU acquireLineBreakIterator function compute and cache the context string? I don’t understand why the platform independent code needs to build the string, just in case the line break code needs to consolidate everything into a single string.
Comment 30 Alex Henrie 2012-10-02 21:44:31 PDT
How would acquireLineBreakIterator know when the list of RenderText objects or their contents has changed? If it takes a pointer to the list, there is no guarantee that the data at the pointer's address is still the same. I suppose we could add a boolean parameter to acquireLineBreakIterator to tell it when to recompute the context string. What did you have in mind?
Comment 31 Darin Adler 2012-10-03 09:07:59 PDT
(In reply to comment #30)
> How would acquireLineBreakIterator know when the list of RenderText objects or their contents has changed? If it takes a pointer to the list, there is no guarantee that the data at the pointer's address is still the same. I suppose we could add a boolean parameter to acquireLineBreakIterator to tell it when to recompute the context string. What did you have in mind?

I didn’t have anything specific in mind, but I suspect we can come up with something. The point is to look for a way to efficiently abstract away these differences in interface rather than putting #if into the platform-independent code.
Comment 32 Alex Henrie 2012-10-03 21:43:25 PDT
I want to see this bug fixed and I've written a cross-platform solution, but I don't see a nice clean way to add ICU optimizations. The WebKit architects need to either agree on a messier solution (so that the solution can be coded up) or leave ICU optimization for later.
Comment 33 Michael Saboff 2012-10-04 13:03:40 PDT
(In reply to comment #32)
> I want to see this bug fixed and I've written a cross-platform solution, but I don't see a nice clean way to add ICU optimizations. The WebKit architects need to either agree on a messier solution (so that the solution can be coded up) or leave ICU optimization for later.

I think the best way to make a cross-platform solution is to create an API that can be used for both an ICU solution and a non-ICU solution.  Put this definition in a common .h file, like TextBreakIterator.h.  Then put  the corresponding implementation in 2 or more platform specific .cpp files.  This new API will likely need to walk through the nodes and therefore need a list of nodes.

For the ICU version, you create the UText routines that walk through the node list and then you use your implementation for non-ICU platforms.
Comment 34 Alex Henrie 2012-10-07 22:32:55 PDT
I thought of a worse problem: If acquireLineBreakIterator takes a list of RenderText objects and on some platforms needs to generate a reusable string, where would that string be stored in memory?

It seems like the most logical way to solve this problem would be to rewrite all the line-breaking functions to be part of a line-breaker class, and then the class could store whatever variables it needs to maintain its state. However, this would amount to a complete redesign of the line-breaking API on all platforms, and I am unable to test any platform but Linux.

I could really use some specific help designing and implementing a solution that makes everyone happy.
Comment 35 Michael Saboff 2012-10-08 16:56:47 PDT
(In reply to comment #34)
> I thought of a worse problem: If acquireLineBreakIterator takes a list of RenderText objects and on some platforms needs to generate a reusable string, where would that string be stored in memory?

Since TextBreakIterator is really just a pointer type and internal objects are cast to and from it, you could create your own type on those platforms that need a string.

> It seems like the most logical way to solve this problem would be to rewrite all the line-breaking functions to be part of a line-breaker class, and then the class could store whatever variables it needs to maintain its state. However, this would amount to a complete redesign of the line-breaking API on all platforms, and I am unable to test any platform but Linux.
> 
> I could really use some specific help designing and implementing a solution that makes everyone happy.
Comment 36 Alex Henrie 2012-10-09 22:05:39 PDT
Changing the non-ICU platforms may not be as daunting a task as I thought. Looking through Source/WebCore/platform/text, it seems that only WinCE and possibly GTK do not use ICU. Is GTK ever used with ICU disabled? If not then only ICU and WinCE need to be supported.

I think we're getting closer to figuring out how to code this. Regardless, I still don't have a way to test changes to WinCE. Any ideas?
Comment 37 Alex Henrie 2012-11-04 22:07:41 PST
Created attachment 172268 [details]
Proposed patch v6

I've spent more time on this and, unfortunately, have not come up with anything workable. The entire system, from acquireLineBreakIterator to LazyLineBreakIterator to nextBreakablePosition, is designed to work with a single string. If you wrap the nodes in a UText and take away the ability to see a single string then multiple problems arise, the most obvious being that there is no way to have the algorithm treat NBSP as a simple space. If you try to fix this problem by passing around the list of nodes as well as the iterator over them, you run into the difficulty of having to find a way to remember and tell all these functions which node and which index within that node they are on, as well as the difficulty of rewriting everything to use a list instead of a string.

Furthermore, writing the necessary functions for pFuncs is nontrivial even with Michael's code as an example. In the implementation of UTextAccess, if the last character of a chunk is requested "going forward" then a new, temporary chunk must be created from the end of the current chunk and the start of the next chunk. According to the ICU documentation, this new chunk MUST "begin and end on code point boundaries" and "a single code point comprised of multiple storage units must never span a chunk boundary." The same complication arises when the first character of a chunk is requested "going backward".

I also learned recently that in addition to the previously listed problems that this bug is causing, the combination of this bug and bug 76131 causes incorrect rendering of Google Translate's yellow highlighting if, at a non-space line break in the textbox, a newline is added and then deleted.

I've rebased the StringBuilder patch against the current tree. I realize that a more efficient ICU solution is theoretically possible, but due to the massive complexity of such a solution and lack of willing developers I just don't see it happening. StringBuilder is designed to minimize memory reallocations, so the performance shouldn't be terrible. Performance might be further improved by changing LazyLineBreakIterator to take a pointer to the string instead of recopying the string.

If there is someone out there who can come up with something better that works, please come forward...
Comment 38 WebKit Review Bot 2012-11-04 22:08:22 PST
Comment on attachment 172268 [details]
Proposed patch v6

Rejecting attachment 172268 [details] from review queue.

alexhenrie24@gmail.com does not have reviewer permissions according to http://trac.webkit.org/browser/trunk/Tools/Scripts/webkitpy/common/config/committers.py.

- If you do not have reviewer rights please read http://webkit.org/coding/contributing.html for instructions on how to use bugzilla flags.

- If you have reviewer rights please correct the error in Tools/Scripts/webkitpy/common/config/committers.py by adding yourself to the file (no review needed).  The commit-queue restarts itself every 2 hours.  After restart the commit-queue will correctly respect your reviewer rights.
Comment 39 WebKit Review Bot 2012-11-04 22:08:58 PST
Comment on attachment 172268 [details]
Proposed patch v6

Rejecting attachment 172268 [details] from commit-queue.

alexhenrie24@gmail.com does not have committer permissions according to http://trac.webkit.org/browser/trunk/Tools/Scripts/webkitpy/common/config/committers.py.

- If you do not have committer rights please read http://webkit.org/coding/contributing.html for instructions on how to use bugzilla flags.

- If you have committer rights please correct the error in Tools/Scripts/webkitpy/common/config/committers.py by adding yourself to the file (no review needed).  The commit-queue restarts itself every 2 hours.  After restart the commit-queue will correctly respect your committer rights.
Comment 40 Build Bot 2013-01-15 19:47:58 PST
Comment on attachment 172268 [details]
Proposed patch v6

Attachment 172268 [details] did not pass mac-wk2-ews (mac-wk2):
Output: http://queues.webkit.org/results/15903195

New failing tests:
fast/text/international/003.html
fast/text/line-break-between-text-nodes.html
fast/writing-mode/Kusa-Makura-background-canvas.html
fast/text/line-break-in-marked-up-thai.html
Comment 41 Lu jing 2013-02-22 22:00:40 PST
In RenderBlockLineLayout.cpp,

Change the following code:

==== from ====

// If the next item on the line is text, and if we did not end with
// a space, then the next text run continues our word (and so it needs to
// keep adding to the uncommitted width. Just update and continue.
checkForBreak = !currentCharacterIsSpace && (c == ' ' || c == '\t' || (c == '\n' && !next->preservesNewline()));
              
==== to ====
      
checkForBreak = !currentCharacterIsSpace &&
    (c == ' ' || c == '\t' || (c == '\n' && !next->preservesNewline())
    || isBreakable(renderTextInfo.m_lineBreakIterator, current.m_pos, current.m_nextBreakablePosition, true));

then I can get the expected result.
As the comment above, breaking opportunities are missed here.
How about this solution? I am going to check the side effects.
Comment 42 Lu jing 2013-02-22 23:19:22 PST
I just found some incorrect result, the inline nodes like span will always cause a line breaking. Perhaps more check is needed near isBreakable...
Comment 43 Glenn Adams 2013-02-24 15:13:32 PST
Created attachment 189990 [details]
Patch
Comment 44 Glenn Adams 2013-02-24 17:35:55 PST
(In reply to comment #41)
> In RenderBlockLineLayout.cpp,
> 
> Change the following code:
> 
> ==== from ====
> 
> // If the next item on the line is text, and if we did not end with
> // a space, then the next text run continues our word (and so it needs to
> // keep adding to the uncommitted width. Just update and continue.
> checkForBreak = !currentCharacterIsSpace && (c == ' ' || c == '\t' || (c == '\n' && !next->preservesNewline()));
> 
> ==== to ====
> 
> checkForBreak = !currentCharacterIsSpace &&
>     (c == ' ' || c == '\t' || (c == '\n' && !next->preservesNewline())
>     || isBreakable(renderTextInfo.m_lineBreakIterator, current.m_pos, current.m_nextBreakablePosition, true));
> 
> then I can get the expected result.
> As the comment above, breaking opportunities are missed here.
> How about this solution? I am going to check the side effects.

Could you provide a test file that demonstrates what you were trying to fix with this change?
Comment 45 Glenn Adams 2013-02-25 07:30:55 PST
Created attachment 190055 [details]
Patch

Resubmit same for EWS retry.
Comment 46 Ryosuke Niwa 2013-02-25 12:34:23 PST
Comment on attachment 190055 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=190055&action=review

> Source/WebCore/rendering/RenderBlockLineLayout.cpp:1563
> +    StringBuilder contextBuilder;
> +    for (InlineIterator i = resolver.position(); i.m_obj; i.moveToStartOf(bidiNextSkippingEmptyInlines(this, i.m_obj))) {
> +        if (i.m_obj->isText()) {
> +            renderTextInfo.m_positionInContext.add(i.m_obj, contextBuilder.length());
> +            contextBuilder.append(toRenderText(i.m_obj)->text());
> +        }
> +    }

The most important question is how much of performance impact this change has.

> LayoutTests/TestExpectations:8
> +# pending rebaseline on passing ports
> +webkit.org/b/17427 fast/text/international/003.html [ Skip ]
> +webkit.org/b/17427 fast/writing-mode/Kusa-Makura-background-canvas.html [ Skip ]
> +

Please generate pixel results on at least one port.
Comment 47 Glenn Adams 2013-02-25 12:49:39 PST
(In reply to comment #46)
> (From update of attachment 190055 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=190055&action=review
> 
> > Source/WebCore/rendering/RenderBlockLineLayout.cpp:1563
> > +    StringBuilder contextBuilder;
> > +    for (InlineIterator i = resolver.position(); i.m_obj; i.moveToStartOf(bidiNextSkippingEmptyInlines(this, i.m_obj))) {
> > +        if (i.m_obj->isText()) {
> > +            renderTextInfo.m_positionInContext.add(i.m_obj, contextBuilder.length());
> > +            contextBuilder.append(toRenderText(i.m_obj)->text());
> > +        }
> > +    }
> 
> The most important question is how much of performance impact this change has.

Thanks for the pre-review, though I didn't r? yet. I'm testing performance now and expect to need to optimize the original proposed code (above).

> 
> > LayoutTests/TestExpectations:8
> > +# pending rebaseline on passing ports
> > +webkit.org/b/17427 fast/text/international/003.html [ Skip ]
> > +webkit.org/b/17427 fast/writing-mode/Kusa-Makura-background-canvas.html [ Skip ]
> > +
> 
> Please generate pixel results on at least one port.

Will do. Just want to see if I could get this through EWS once.
Comment 48 Glenn Adams 2013-03-06 08:01:20 PST
Created attachment 191752 [details]
perf results - attachment 190055 [details]

Running perftest on apple mac (release) port (MBP R hw) shows:

* 0.85% regression on line-layout
* 0.75% regression on html5-full-render

Is this too high to warrant a commit?
Comment 49 Ryosuke Niwa 2013-03-06 09:08:20 PST
(In reply to comment #48)
> Created an attachment (id=191752) [details]
> perf results - attachment 190055 [details]
> 
> Running perftest on apple mac (release) port (MBP R hw) shows:
> 
> * 0.85% regression on line-layout
> * 0.75% regression on html5-full-render

I don't think 1% regression is acceptable.
Comment 50 Ryosuke Niwa 2013-03-06 09:10:26 PST
On the other hand, it seems like you got only 3 samples?  We need something like 30-50 samples to get a statistically significant result.
Comment 51 Glenn Adams 2013-03-10 15:35:33 PDT
Created attachment 192388 [details]
Patch
Comment 52 Glenn Adams 2013-03-10 15:39:27 PDT
(In reply to comment #51)
> Created an attachment (id=192388) [details]
> Patch

Reimplement patch so as to eliminate any performance regression (in run time and memory allocation). This patch handles the ASCII shortcut path through nextBreakPosition() and doesn't change (breaking or performance) behavior for non-ASCII path. A subsequent patch against bug 105692 will address the non-ASCII path.
Comment 53 Darin Adler 2013-03-10 17:48:49 PDT
Comment on attachment 192388 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=192388&action=review

Please look into the single-character-node issue I mention below. That’s the only reason I gave this a review- instead of review+.

> Source/WebCore/platform/text/TextBreakIterator.h:63
> +        , m_lastChar(0)
> +        , m_lastLastChar(0)

I know that our existing code calls these lastCh and lastLastCh. But, in WebKit code we normally prefer to use words rather than abbreviations and there is a normal English phrase for the “last last character” that I think is clear. I would prefer m_lastCharacter, m_secondToLastCharacter, setLastTwoCharacters, and resetLastTwoCharacters. All those seem easy to read and less like jargon.

> Source/WebCore/platform/text/TextBreakIterator.h:95
> +    inline UChar lastChar() const { return m_lastChar; }
> +    inline UChar lastLastChar() const { return m_lastLastChar; }
> +    inline void setLastChars(UChar last, UChar lastLast)
> +    {
> +        m_lastChar = last;
> +        m_lastLastChar = lastLast;
> +    }
> +    inline void resetLastChars()
> +    {
> +        m_lastChar = 0;
> +        m_lastLastChar = 0;
> +    }

The “inline” here are redundant and have no effect. All function definitions that are inside a class definition are treated as if “inline” is specified automatically.

Please omit them.

> Source/WebCore/rendering/RenderBlockLineLayout.cpp:2873
> +            UChar lastCharInNode = 0;

lastCharacterInNode, please

> Source/WebCore/rendering/RenderBlockLineLayout.cpp:2874
> +            UChar lastLastCharInNode = 0;

secondToLastCharacterInNode, please

> Source/WebCore/rendering/RenderBlockLineLayout.cpp:2917
> +                            goto nextChar;

nextCharacter, please

> Source/WebCore/rendering/RenderBlockLineLayout.cpp:3083
> +            renderTextInfo.m_lineBreakIterator.setLastChars(lastCharInNode, lastLastCharInNode);

This does not seem to correctly handle a text node that has only one character in it (or zero characters, for that matter). I think that in that case it would overwrite a characters from the second to last node with a zero.

> Source/WebCore/rendering/break_lines.cpp:157
> +    CharacterType lastLastCh = pos > 1 ? str[pos - 2] : CharacterType(lazyBreakIterator.lastLastChar());
> +    CharacterType lastCh = pos > 0 ? str[pos - 1] : CharacterType(lazyBreakIterator.lastChar());

Please use static_cast<CharacterType>(x) instead of CharacterType(x).
Comment 54 Glenn Adams 2013-03-10 17:57:05 PDT
(In reply to comment #53)
> (From update of attachment 192388 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=192388&action=review
> 
> Please look into the single-character-node issue I mention below. That’s the only reason I gave this a review- instead of review+.
> > Source/WebCore/rendering/RenderBlockLineLayout.cpp:3083
> > +            renderTextInfo.m_lineBreakIterator.setLastChars(lastCharInNode, lastLastCharInNode);
> 
> This does not seem to correctly handle a text node that has only one character in it (or zero characters, for that matter). I think that in that case it would overwrite a characters from the second to last node with a zero.

yes, i can see that... of course the current code (without this patch) doesn't handle any way, so this patch wouldn't regress that; i wonder if it could be addressed in a follow-up bug/patch, just to separate the behavioral impact a bit? or do you think it should be fixed here before moving forward (after i can address the other comments)?
Comment 55 Darin Adler 2013-03-10 18:15:25 PDT
I think it would be best to handle that case, too, in this patch. It seems likely to be a small delta on the patch you have already put together.
Comment 56 Glenn Adams 2013-03-10 19:25:18 PDT
Created attachment 192396 [details]
Patch
Comment 57 Glenn Adams 2013-03-10 19:26:12 PDT
(In reply to comment #56)
> Created an attachment (id=192396) [details]
> Patch

address comment #53 (darin)
Comment 58 Glenn Adams 2013-03-10 19:54:29 PDT
Created attachment 192397 [details]
Patch for landing
Comment 59 Glenn Adams 2013-03-10 19:55:08 PDT
(In reply to comment #58)
> Created an attachment (id=192397) [details]
> Patch for landing

tweaked ChangeLog
Comment 60 WebKit Review Bot 2013-03-10 19:58:24 PDT
Comment on attachment 192397 [details]
Patch for landing

Rejecting attachment 192397 [details] from commit-queue.

Failed to run "['/mnt/git/webkit-commit-queue/Tools/Scripts/webkit-patch', '--status-host=webkit-commit-queue.appspot.com', '--bot-id=gce-cq-03', 'validate-changelog', '--non-interactive', 192397, '--port=chromium-xvfb']" exit_code: 1 cwd: /mnt/git/webkit-commit-queue

darin found in /mnt/git/webkit-commit-queue/LayoutTests/ChangeLog does not appear to be a valid reviewer according to committers.py.
/mnt/git/webkit-commit-queue/LayoutTests/ChangeLog neither lists a valid reviewer nor contains the string "Unreviewed" or "Rubber stamp" (case insensitive).

Full output: http://webkit-commit-queue.appspot.com/results/16991738
Comment 61 Glenn Adams 2013-03-10 19:59:53 PDT
Created attachment 192398 [details]
Patch for landing
Comment 62 WebKit Review Bot 2013-03-10 20:01:48 PDT
Comment on attachment 192398 [details]
Patch for landing

Rejecting attachment 192398 [details] from commit-queue.

Failed to run "['/mnt/git/webkit-commit-queue/Tools/Scripts/webkit-patch', '--status-host=webkit-commit-queue.appspot.com', '--bot-id=gce-cq-01', 'validate-changelog', '--non-interactive', 192398, '--port=chromium-xvfb']" exit_code: 1 cwd: /mnt/git/webkit-commit-queue

darin found in /mnt/git/webkit-commit-queue/LayoutTests/ChangeLog does not appear to be a valid reviewer according to committers.py.
/mnt/git/webkit-commit-queue/LayoutTests/ChangeLog neither lists a valid reviewer nor contains the string "Unreviewed" or "Rubber stamp" (case insensitive).

Full output: http://webkit-commit-queue.appspot.com/results/17114338
Comment 63 Glenn Adams 2013-03-10 20:03:13 PDT
Created attachment 192399 [details]
Patch for landing
Comment 64 WebKit Review Bot 2013-03-10 20:33:10 PDT
Comment on attachment 192399 [details]
Patch for landing

Clearing flags on attachment: 192399

Committed r145338: <http://trac.webkit.org/changeset/145338>
Comment 65 WebKit Review Bot 2013-03-10 20:33:18 PDT
All reviewed patches have been landed.  Closing bug.