Bug 65277 - Make functions to find word boundaries more flexible
Summary: Make functions to find word boundaries more flexible
Status: RESOLVED WONTFIX
Alias: None
Product: WebKit
Classification: Unclassified
Component: HTML Editing (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks: 25298 65773
  Show dependency treegraph
 
Reported: 2011-07-27 12:32 PDT by Xiaomei Ji
Modified: 2012-06-15 07:31 PDT (History)
13 users (show)

See Also:


Attachments
WIP patch, would like some feedback (13.46 KB, patch)
2011-08-05 15:28 PDT, Van Lam
no flags Details | Formatted Diff | Diff
Proposed fix (13.93 KB, patch)
2011-08-11 18:24 PDT, Van Lam
no flags Details | Formatted Diff | Diff
Proposed fix (14.48 KB, patch)
2011-08-11 18:31 PDT, Van Lam
webkit-ews: commit-queue-
Details | Formatted Diff | Diff
Proposed fix (16.59 KB, patch)
2011-08-15 15:23 PDT, Van Lam
no flags Details | Formatted Diff | Diff
Proposed fix (16.89 KB, patch)
2011-08-15 15:56 PDT, Van Lam
rniwa: review-
Details | Formatted Diff | Diff
Proposed fix (18.73 KB, patch)
2011-08-17 15:07 PDT, Van Lam
no flags Details | Formatted Diff | Diff
Proposed (alternative) fix (7.03 KB, patch)
2011-08-18 18:26 PDT, Van Lam
no flags Details | Formatted Diff | Diff
Revised (alternative) fix (7.35 KB, patch)
2011-08-19 16:36 PDT, Van Lam
rniwa: review-
Details | Formatted Diff | Diff
Revised (alternative) fix (8.65 KB, patch)
2011-08-25 11:34 PDT, Van Lam
rniwa: review-
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Xiaomei Ji 2011-07-27 12:32:00 PDT
one way to go might be:
provide a flexible internal API which can return logical word break before or after space depending on passing-in parameter.
expose the  public API by passing the correct parameter so that the logical word break matches native platform's.

--webkit-visual-word should use this internal API,
so, some of the cases where word break in a box is collected will not be needed, instead, we could just use logical word break by passing in correct parameter depends on box and block's directionality.
Comment 1 Xiaomei Ji 2011-07-27 12:34:45 PDT
related issue:
https://bugs.webkit.org/show_bug.cgi?id=64392
Comment 2 Van Lam 2011-08-05 15:28:43 PDT
Created attachment 103122 [details]
WIP patch, would like some feedback

Detailed description of changes:

- Extracted functionality of findNextWordFromIndex into findNextWordFromIndexConsideringWordBreakPositioningRelativeToWord (this name is WIP) so that it can be parametrized with whether the word break should be logically before or after the word while maintaining existing functionality.

- Added four new search functions to be passed into nextBoundary/previousBoundary to choose where the word break should be. This is what I'd like most feedback on as doing it this way results in some duplicate code. An alternative to adding four new search functions introducing some duplicate code is to add a parameter to each search function; while there is no duplicate code, this would require changing more code (which includes adding parameters that will be unused in search functions endWordBoundary, endSentenceBoundary, and nextSentencePositionBoundary).

- In right/leftWordPositionAcrossBoundary, we never need to collect word breaks within a single box; this is an improvement because previously when moving right in an LTR block or left in an RTL block, word breaks would always be collected; this is significant for cases where runs of text of one directionality are very long.
Comment 3 Ryosuke Niwa 2011-08-08 12:06:28 PDT
How does this improve the performance?  Can you give us some metrics?
Comment 4 Xiaomei Ji 2011-08-08 13:08:17 PDT
We are using logical word break to get the visual word break.
due to the way logical word break works (in terms of whether the word break is before or after the space), even when the start position is within the box, we are not able to correctly get the visual word break (in terms of before or after space) by directly using logical word break in 50% of the cases.

Using moving left as example, 
Given a LTR box "abc def|" in LTR block, we can get visually left word break "abc |def" by directly using previousWordPosition().
Given a RTL box "FED CBA|" in LTR block, we can get the visually left word break "FED |CBA" by directly using nextWordPosition().
But given LTR box "abc def|" in RTL block, we are not able to get the visual word break "abc| def" (which is a position after previous word) by directly using previousWordPosition(). (please refer to http://trac.webkit.org/changeset/88359 for more details). Instead, we are collecting all word breaks in the box and compare offset to get the correct visual word break.
Similar for  a RTL box "FED CBA|" in RTL block.

If we provide flexible logical word breaker (in terms of whether it breaks before or after space), we could always use that directly when start position is inside box, instead of collecting word breaks of the box and comparing the offset to get the correct word break for 50% cases.

Also, we could use this flexible logical word breaker in some cases when start position is at the boundary.
Comment 5 Ryosuke Niwa 2011-08-08 13:28:35 PDT
The question is whether this is really impacting the performance or not.   You should use Shirk on Mac or some other means to measure the performance impact of this patch.
Comment 6 Ryosuke Niwa 2011-08-08 13:29:39 PDT
Just running visual word movement tests with --repeat & measure timing is good.  Make 10+ measurements and compute stdev.
Comment 7 Ryosuke Niwa 2011-08-08 13:35:43 PDT
Also, this bug title is too general.  With the current title, any patch that tries to improve the performance of --webkit-visual-word should be posted here.  That's not ideal.  If you have a specific fix that improves the performance measurably, then please file a separate bug as a blocker of this bug.
Comment 8 Ryosuke Niwa 2011-08-08 17:37:46 PDT
I still don't understand the benefit of this change.  Is there a mesurable difference after this change?  Otherwise, it seems to bloats the code pretty badly.
Comment 9 Van Lam 2011-08-08 17:49:03 PDT
Hi Ryosuke. There is a strong performance improvement with this change. If you consider visual word movement code as doing either:

1) getting the next/previous word break within a box
2) collecting *all* the word breaks in a box using (1) and choosing the correct next word break

This patch essentially makes the need for (2) much less frequent. Right now, if we have:
|abc |def ... |xyz| (LTR context)
for every word movement we'd collect every word break in the box. Now with this patch, we just get the correct next/previous word break in time proportional to the length of the next word (as opposed to the length of the entire box).

I'll submit profiling results in a few.
Comment 10 Van Lam 2011-08-08 21:42:11 PDT
When I profiled the release build of DumpRenderTree with --repeat-each 5 on LayoutTests/editing/selection/move-by-word-visually*, I got the results (only showing relevant part):

Before change:
3.8% rightWordPosition
   1.9% collectWordBreaksInBox
   1.7% rightWordBoundary
   0.1% nextBoundary
   ...

After change:
3.4% rightWordPosition
   1.7% rightWordBoundary
   1.2% collectWordBreaksInBox
   0.3% nextBoundary
   ...

Results for leftWordPosition are similar. This makes sense as, before the change, calls to collectWordBreaksInBox occurred when going right in LTR blocks and left in RTL blocks (roughly accounts for half of test cases). Now word breaks don't need to be collected for those cases; but they still need to be collected when the correct word break is at the logical end of line or when moving from one box to another.
Comment 11 Xiaomei Ji 2011-08-09 10:52:21 PDT
(In reply to comment #8)
> I still don't understand the benefit of this change.  Is there a mesurable difference after this change?  Otherwise, it seems to bloats the code pretty badly.

This patch provides 4 word search functions depends on the word break positions relative to word (before word or after word). The pro is cleaner code logic, and the con is code duplication (we could make existing function previousWordPositionBoundary call previousWordPositionBoundaryLogicallyAfterWord to reduce the duplication, if we want to fix the logical word break behavior across platform).

An alternative is:
add one more parameter in nextBoundary(), previousBoundary(), previousWordPositionBounary(), nextWordPositionBoundary(), findNextWordFromIndex(), and other 6 search functions in visible_unit.cpp (including startSentenceBoundary and endSentenceBounary, in which the newly added parameter is just a placeholder).

Inside function findNextWordFromIndex(), depends on the passing in parameter, return logical word break before or after word.

the word breaker search functions just pass in the IN parameter to findNextWordFromIndex().

the caller of nextBoundary()/previousBoundary() pass in the correct (before word or after word) parameter based on their usage: for visual word break, the value of the parameter depends on the platform, box directionality, and block directionality. For logical word break (previousWordPosition and nextWordPosition), the value of the parameter depends on the platform (editing behavior. related to https://bugs.webkit.org/show_bug.cgi?id=64392).


The pro of the alternative: reduce code duplication, and it could be used to solve issue 64392 as well.
The con of the alternative: adding placeholder parameters to the already heavily parameter-loaded search functions such as startSentenceBoundary().

We are debating on the approaches. Van prefer the approach in the patch, and I prefer the alternative. But we believe neither is ideal. 

Would appreciate your inputs!

One more note, the uploaded patch provides findNextWordFromIndexConsideringWordBreakPositioningRelativeToWord() as a cross-platform function, the existing findNextWordFromIndex() is platform-dependent.
Comment 12 Xiaomei Ji 2011-08-09 11:02:49 PDT
(In reply to comment #10)
> When I profiled the release build of DumpRenderTree with --repeat-each 5 on LayoutTests/editing/selection/move-by-word-visually*, I got the results (only showing relevant part):
> 
> Before change:
> 3.8% rightWordPosition
>    1.9% collectWordBreaksInBox
>    1.7% rightWordBoundary
>    0.1% nextBoundary
>    ...
> 
> After change:
> 3.4% rightWordPosition
>    1.7% rightWordBoundary
>    1.2% collectWordBreaksInBox
>    0.3% nextBoundary
>    ...
> 
> Results for leftWordPosition are similar. This makes sense as, before the change, calls to collectWordBreaksInBox occurred when going right in LTR blocks and left in RTL blocks (roughly accounts for half of test cases). Now word breaks don't need to be collected for those cases; but they still need to be collected when the correct word break is at the logical end of line or when moving from one box to another.

rightWordBoundary (and leftWordBoundary) could be improved using the same technic. 
currently, word breaks are collected in the box for 50% of the cases.

Then, nextBoundary() and previousBoundary() will take even higher percentage of time, and we can continue work on reducing constructing of VisiblePosition.
Comment 13 Van Lam 2011-08-11 18:24:44 PDT
Created attachment 103716 [details]
Proposed fix
Comment 14 WebKit Review Bot 2011-08-11 18:27:27 PDT
Attachment 103716 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'LayoutTests/ChangeLog', u'LayoutTests/edit..." exit_code: 1

Source/WebCore/platform/text/TextBoundaries.h:33:  Code inside a namespace should not be indented.  [whitespace/indent] [4]
Total errors found: 1 in 5 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 15 Van Lam 2011-08-11 18:31:43 PDT
Created attachment 103717 [details]
Proposed fix

Fixed style
Comment 16 Early Warning System Bot 2011-08-11 18:38:30 PDT
Comment on attachment 103717 [details]
Proposed fix

Attachment 103717 [details] did not pass qt-ews (qt):
Output: http://queues.webkit.org/results/9358439
Comment 17 Alexey Proskuryakov 2011-08-11 21:03:59 PDT
The measured performance improvements don't seem particularly large. Also, our problem is with calling these functions many times for each operation, not necessarily with them being slow.

I think that prior to adding new code, we should better define the concepts it works with. Different ports return drastically different results for word boundaries, and considering how basic these operations are, that causes really ugly bugs (see e.g. bug 65898 or bug 65999).

While we cannot even agree upon what word boundaries are, adding "word boundaries considering word break positioning" seems counter-productive.
Comment 18 Ryosuke Niwa 2011-08-11 21:10:18 PDT
I agree with Alexey's sentiment here.  We should try investigating different approaches to see which optimization pays off the most.  When I did optimization in editing, the improvement I saw were in the order of 1-2% in release and 30-40% in debug.  Anything below 1% seems ineffective at best.
Comment 19 Xiaomei Ji 2011-08-11 22:20:48 PDT
(In reply to comment #17)
> The measured performance improvements don't seem particularly large. Also, our problem is with calling these functions many times for each operation, not necessarily with them being slow.

Yes. you are right. The problem is calling these functions many times.
And the improvement is to decrease the number of calls.

When we compute visual word breaks, we are calling logical word break many times because logical word break does not return the correct (in terms of before or after space) word break position visual word breaker is looking for.

For example,
given a string "abc def hij",  nextWordPosition() returns the following position "abc| def| hij", and previousWordPosition() returns the following position "|abc |def |hij".
when start from beginning and move left to right, if the visual word break I am looking for is "abc |def |hij", 
then, I am not able to use nextWordPosition(). Instead,  I collected the word breaks inside this box starting from end of the string and using previousWordPosition(), the collected word breaks are "abc def |hij", then, "abc |def hij", then, "|abc def hij".  And by comparing the word break offset and current position offset, return "abc |def hij" as the visual word break.

The proposed improvement is to provide an *internal* API which can return flexible logical word break in terms of space. Given the above example, if there is an *internal* API that returns "abc |def |hij" as logical next word breaker, then, I can use that to get the visual word breaker in one call.

> 
> I think that prior to adding new code, we should better define the concepts it works with. Different ports return drastically different results for word boundaries, and considering how basic these operations are, that causes really ugly bugs (see e.g. bug 65898 or bug 65999). 
> While we cannot even agree upon what word boundaries are, adding "word boundaries considering word break positioning" seems counter-productive.
Comment 20 Ryosuke Niwa 2011-08-11 22:25:56 PDT
(In reply to comment #19)
> The proposed improvement is to provide an *internal* API which can return flexible logical word break in terms of space. Given the above example, if there is an *internal* API that returns "abc |def |hij" as logical next word breaker, then, I can use that to get the visual word breaker in one call.

Regardless of how logical and rationale the improvement is, I don't think 0.4% performance improvement justifies the maintenance burden introduced by doubling the number of functions.  For this to make sense, the performance improvement needs to be much more substantial.
Comment 21 Xiaomei Ji 2011-08-11 22:29:34 PDT
(In reply to comment #18)
> I agree with Alexey's sentiment here.  We should try investigating different approaches to see which optimization pays off the most.  When I did optimization in editing, the improvement I saw were in the order of 1-2% in release and 30-40% in debug.  Anything below 1% seems ineffective at best.

when you say 1-2% and 30-40%, are you talking about relative or absolute numbers?

For example, rightWordPosition from 3.8% to 3.4%, is the improvement 0.4% or ~10% (0.4/3.8)?
collectWordbreakInBox has 30%+ gain (0.7/1.9). And I believe it and rightWordBoundary should improve further if we decrease the collectWordBreakInBox calls in rightWordBoundary.

I think this is the 1st step of improvement (reduce the number of calls to nextBoundary()), after which, the cost of nextBoundary() will be more evident, and that is the next step for improvement.
Comment 22 Van Lam 2011-08-12 10:06:55 PDT
I should have clarified when I posted the profiling results the percentages are relative to running DRT; so originally 3.8% of the time was spent in rightWordPosition and after the change, 3.4%. This can be approximately seen as a ~10% speedup as Xiaomei pointed out.
Comment 23 Alexey Proskuryakov 2011-08-12 11:03:18 PDT
I don't think that we should focus much on performance of ad hoc tests. Move by word is not a very common editing operation.
Comment 24 Xiaomei Ji 2011-08-15 11:19:11 PDT
(In reply to comment #23)
> I don't think that we should focus much on performance of ad hoc tests. Move by word is not a very common editing operation.

hm... shift-arrow (alt-arrow in Mac) should move caret by word visually is the top #1 issue reported by chrome native users.
Comment 25 Van Lam 2011-08-15 11:46:58 PDT
Clarification on the patch:

The overall performance speedup is 0.8% (going left and right) when running in DRT; about 7.5% is spent in visual word movement code. From within visual word movement code, however, this is about a 10% speedup; this might be improved to 15-20% if we more aggressively utilize this API within rightWordBoundary and leftWordBoundary.

A secondary (probably more significant) purpose of this patch is that it will make future work much simpler to implement, most directly the Mac/Linux version (https://bugs.webkit.org/show_bug.cgi?id=65773). I've implemented a working Mac/Linux version but did not submit it for review just because it is not very understandable. I think with this change it would be much easier to implement and maintain.
Comment 26 Van Lam 2011-08-15 15:23:18 PDT
Created attachment 103963 [details]
Proposed fix
Comment 27 WebKit Review Bot 2011-08-15 15:26:46 PDT
Attachment 103963 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'LayoutTests/ChangeLog', u'LayoutTests/edit..." exit_code: 1

Source/WebCore/platform/text/TextBoundaries.h:33:  Code inside a namespace should not be indented.  [whitespace/indent] [4]
Source/WebCore/platform/text/TextBoundaries.h:45:  The parameter name "wordBreakPositioning" adds no information, so it should be removed.  [readability/parameter_name] [5]
Source/WebCore/platform/text/brew/TextBoundariesBrew.cpp:39:  Weird number of spaces at line-start.  Are you using a 4-space indent?  [whitespace/indent] [3]
Source/WebCore/platform/text/brew/TextBoundariesBrew.cpp:40:  Weird number of spaces at line-start.  Are you using a 4-space indent?  [whitespace/indent] [3]
Total errors found: 4 in 7 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 28 Van Lam 2011-08-15 15:56:29 PDT
Created attachment 103969 [details]
Proposed fix
Comment 29 Alexey Proskuryakov 2011-08-15 16:06:39 PDT
Comment on attachment 103969 [details]
Proposed fix

View in context: https://bugs.webkit.org/attachment.cgi?id=103969&action=review

> Source/WebCore/ChangeLog:21
> +        (WebCore::previousWordPositionBoundaryLogicallyBeforeWord):
> +        (WebCore::previousWordPositionBoundaryLogicallyAfterWord):
> +        (WebCore::nextWordPositionBoundaryLogicallyBeforeWord):
> +        (WebCore::nextWordPositionBoundaryLogicallyAfterWord):
> +        (WebCore::leftWordPositionAcrossBoundary):
> +        (WebCore::rightWordPositionAcrossBoundary):

I cannot really figure out from these function names what they are doing. Does "logically" have something to do with RTL languages? What is a "position across boundary"?
Comment 30 Van Lam 2011-08-15 16:38:16 PDT
(In reply to comment #29)
> (From update of attachment 103969 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=103969&action=review
> 
> > Source/WebCore/ChangeLog:21
> > +        (WebCore::previousWordPositionBoundaryLogicallyBeforeWord):
> > +        (WebCore::previousWordPositionBoundaryLogicallyAfterWord):
> > +        (WebCore::nextWordPositionBoundaryLogicallyBeforeWord):
> > +        (WebCore::nextWordPositionBoundaryLogicallyAfterWord):
> > +        (WebCore::leftWordPositionAcrossBoundary):
> > +        (WebCore::rightWordPositionAcrossBoundary):
> 
> I cannot really figure out from these function names what they are doing. Does "logically" have something to do with RTL languages? What is a "position across boundary"?

I'm aiming to specify where word breaks should be. So "logically after word" means those boundaries that follow words:

In LTR text: abc| def|
RTL: |FED |CBA

The usage of these functions is that one is passed into nextBoundary/previousBoundary to specify where the word break should be; it would mean the difference between:

(calling nextBoundary)
|abc def -> abc| def   and   |abc def -> abc |def

Currently when calling nextBoundary, there is only one clear function to pass in at the word-movement level and that will always return the word break logically after the word; same goes for previousBoundary except it returns the word break logically before the word.

An alternative is to specify visually (previousWordPositionBoundaryOnLeftOfWord, etc.).
Comment 31 Xiaomei Ji 2011-08-15 18:00:48 PDT
(In reply to comment #29)
> (From update of attachment 103969 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=103969&action=review
> 
> > Source/WebCore/ChangeLog:21
> > +        (WebCore::previousWordPositionBoundaryLogicallyBeforeWord):
> > +        (WebCore::previousWordPositionBoundaryLogicallyAfterWord):
> > +        (WebCore::nextWordPositionBoundaryLogicallyBeforeWord):
> > +        (WebCore::nextWordPositionBoundaryLogicallyAfterWord):
> > +        (WebCore::leftWordPositionAcrossBoundary):
> > +        (WebCore::rightWordPositionAcrossBoundary):
> 
> I cannot really figure out from these function names what they are doing. Does "logically" have something to do with RTL languages? What is a "position across boundary"?

"position across boundary" is a position does not honor editable boundary. Maybe I should rename it to "right/leftWordPositionAcrossEditableBoundary".
Comment 32 Alexey Proskuryakov 2011-08-15 20:43:20 PDT
Both names mean that the result will always be across boundary, which is apparently untrue.
Comment 33 Ryosuke Niwa 2011-08-15 20:58:02 PDT
I talked with Van on IRC, and the purpose of this patch is apparently not the performance but rather to implement Windows and Linux behaviors.  So let's focus on that and not on the superfluous performance improvement.
Comment 34 Ryosuke Niwa 2011-08-15 21:04:38 PDT
Comment on attachment 103969 [details]
Proposed fix

View in context: https://bugs.webkit.org/attachment.cgi?id=103969&action=review

Why is there no change in platform/text/mac/TextBoundaries.mm?

> Source/WebCore/ChangeLog:19
> +        (WebCore::previousWordPositionBoundaryLogicallyBeforeWord):
> +        (WebCore::previousWordPositionBoundaryLogicallyAfterWord):
> +        (WebCore::nextWordPositionBoundaryLogicallyBeforeWord):
> +        (WebCore::nextWordPositionBoundaryLogicallyAfterWord):

Before and after are logical concepts so "logically" is redundant.

> Source/WebCore/platform/text/TextBoundaries.cpp:112
> +#if !PLATFORM(BREWMP) && !PLATFORM(MAC) && !PLATFORM(QT)

We shouldn't be duplicating code in Brew and Qt and leaving Mac behind.  r- because of this.

> Source/WebCore/platform/text/qt/TextBoundariesQt.cpp:82
> +int findNextWordFromIndex(UChar const* buffer, int len, int position, bool forward)
> +{
> +    return findNextWordFromIndexConsideringWordBreakPositioningRelativeToWord
> +        (buffer, len, position, forward, forward ? LogicallyAfterWord : LogicallyBeforeWord);
> +}
> +

What's the point of duplicating the function here?
Comment 35 Van Lam 2011-08-16 09:58:50 PDT
(In reply to comment #34)
> (From update of attachment 103969 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=103969&action=review
> 
> Why is there no change in platform/text/mac/TextBoundaries.mm?
> 
> > Source/WebCore/ChangeLog:19
> > +        (WebCore::previousWordPositionBoundaryLogicallyBeforeWord):
> > +        (WebCore::previousWordPositionBoundaryLogicallyAfterWord):
> > +        (WebCore::nextWordPositionBoundaryLogicallyBeforeWord):
> > +        (WebCore::nextWordPositionBoundaryLogicallyAfterWord):

As far as visual word movement goes it seemed like the implementation in TextBoundaries.cpp works on Mac; but to make the change uniform across platforms I'll add the duplication.

> 
> Before and after are logical concepts so "logically" is redundant.
> 
> > Source/WebCore/platform/text/TextBoundaries.cpp:112
> > +#if !PLATFORM(BREWMP) && !PLATFORM(MAC) && !PLATFORM(QT)
> 
> We shouldn't be duplicating code in Brew and Qt and leaving Mac behind.  r- because of this.
> 
> > Source/WebCore/platform/text/qt/TextBoundariesQt.cpp:82
> > +int findNextWordFromIndex(UChar const* buffer, int len, int position, bool forward)
> > +{
> > +    return findNextWordFromIndexConsideringWordBreakPositioningRelativeToWord
> > +        (buffer, len, position, forward, forward ? LogicallyAfterWord : LogicallyBeforeWord);
> > +}
> > +
> 
> What's the point of duplicating the function here?

Good catch; there only needs to be one copy of findNextWordFromIndex that delegates to the platform's 
 findNextWordFromIndexConsideringWordBreakPositioningRelativeToWord.

Thanks for the review.
Comment 36 Van Lam 2011-08-17 15:07:16 PDT
Created attachment 104254 [details]
Proposed fix

Modified patch following review and comments.

Would like feedback on my changes in TextBoundaries.mm (not sure if my decision to use TextBreakIterator is correct).
Comment 37 Ryosuke Niwa 2011-08-17 15:11:47 PDT
I don't think I'm qualified to review this patch. ap, darin, or eseidel should do.
Comment 38 Darin Adler 2011-08-17 15:14:07 PDT
I think Dan also understand the issues here, perhaps better than I do.
Comment 39 Alexey Proskuryakov 2011-08-17 15:51:54 PDT
I have commented about function names above, and those comments still stand. This patch adds a good deal of confusion to an already confused area of code.
Comment 40 Van Lam 2011-08-17 16:34:13 PDT
(In reply to comment #39)
> I have commented about function names above, and those comments still stand. This patch adds a good deal of confusion to an already confused area of code.

Hi Alexey,

I understand that the naming of right/leftWordPositionAcrossBoundary should change but thought that the renaming and several other stylistic to-dos would best be addressed in a separate cleanup patch.

I also understand that this isn't a clean patch. I've considered an alternative to implement this based on the observation:

You can chain calls to nextBoundary and previousBoundary to get the correct position. For example in the common case, you can get the next word break before word in LTR text by calling previousBoundary(nextBoundary(nextBoundary(currentVisiblePosition, nextWordPositionBoundary), nextWordPositionBoundary), previousWordPositionBoundary). While this avoids platform-level changes there are many cases to consider especially in bidi text and it is slower.

I'm open to ideas on how best to implement this functionality as it's blocking https://bugs.webkit.org/show_bug.cgi?id=65773.

Thanks
Comment 41 Van Lam 2011-08-18 18:26:32 PDT
Created attachment 104432 [details]
Proposed (alternative) fix
Comment 42 Xiaomei Ji 2011-08-19 15:46:25 PDT
(In reply to comment #41)
> Created an attachment (id=104432) [details]
> Proposed (alternative) fix

This seems the right fix for the code removed in http://trac.webkit.org/changeset/88359.
Comment 43 Ryosuke Niwa 2011-08-19 15:49:06 PDT
Comment on attachment 104432 [details]
Proposed (alternative) fix

View in context: https://bugs.webkit.org/attachment.cgi?id=104432&action=review

> Source/WebCore/editing/visible_units.cpp:1591
> +static VisiblePosition nextWordBoundaryBeforeWord(VisiblePosition startingPosition, InlineBox* startingBox, int startingOffset)

Why do we ned startingBox and startingOffset?
Comment 44 Xiaomei Ji 2011-08-19 15:50:39 PDT
Comment on attachment 104432 [details]
Proposed (alternative) fix

View in context: https://bugs.webkit.org/attachment.cgi?id=104432&action=review

> Source/WebCore/editing/visible_units.cpp:1601
> +    if (currentOffset > startingOffset)

should we checking (currentBox == startingBox &&....)?

> Source/WebCore/editing/visible_units.cpp:1624
> +        return prevNext;

ditto
Comment 45 Van Lam 2011-08-19 15:59:07 PDT
(In reply to comment #43)
> (From update of attachment 104432 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=104432&action=review
> 
> > Source/WebCore/editing/visible_units.cpp:1591
> > +static VisiblePosition nextWordBoundaryBeforeWord(VisiblePosition startingPosition, InlineBox* startingBox, int startingOffset)
> 
> Why do we ned startingBox and startingOffset?

We could obtain these from the VisiblePosition passed in but since these two functions are called from right/leftWordPositionIgnoringEditingBoundary in which we've already called getInlineBoxAndOffset, I chose to just pass in the box and offset instead of recalculating. The actual box and offset are used to check if the result actually makes sense; since these two functions rely on next/previousBoundary which are logical functions, if the result is outside the box there's no guarantee that there's visual continuity.
Comment 46 Van Lam 2011-08-19 16:07:50 PDT
(In reply to comment #44)
> (From update of attachment 104432 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=104432&action=review
> 
> > Source/WebCore/editing/visible_units.cpp:1601
> > +    if (currentOffset > startingOffset)
> 
> should we checking (currentBox == startingBox &&....)?
> 
> > Source/WebCore/editing/visible_units.cpp:1624
> > +        return prevNext;
> 
> ditto

If we exit early because of this check then (In reply to comment #44)
> (From update of attachment 104432 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=104432&action=review
> 
> > Source/WebCore/editing/visible_units.cpp:1601
> > +    if (currentOffset > startingOffset)
> 
> should we checking (currentBox == startingBox &&....)?
> 
> > Source/WebCore/editing/visible_units.cpp:1624
> > +        return prevNext;
> 
> ditto

Right. I didn't catch this as the return values of these two functions are checked using positionIsInBoxButNotOnBoundary right after being called. But we should never return a position outside of startingBox to be safe. I'll resubmit with the additional check.
Comment 47 Van Lam 2011-08-19 16:36:56 PDT
Created attachment 104591 [details]
Revised (alternative) fix

Corrected previous patch following Xiaomei's comment.
Comment 48 Ryosuke Niwa 2011-08-24 20:03:10 PDT
Comment on attachment 104591 [details]
Revised (alternative) fix

View in context: https://bugs.webkit.org/attachment.cgi?id=104591&action=review

> Source/WebCore/editing/visible_units.cpp:1591
> +static VisiblePosition nextWordBoundaryBeforeWord(VisiblePosition startingPosition, InlineBox* startingBox, int startingOffset)

This function's interface is quite confusing.  Why does this function take BOTH startingPosition and startingBox & startingOffset.  I'd rather make an extra call to getInlineBoxAndOffset than requiring callers to pass in the correct arguments.  It appears that this function wants to take RenderedPosition as the argument after all.

> Source/WebCore/editing/visible_units.cpp:1593
> +    VisiblePosition next = nextBoundary(startingPosition, nextWordPositionBoundary);

next is a very vague name.  In this function, we're concerned about word boundary and on which side of the word boundary we're at.  So your variable name should reflect that.  nextPrev, nextNextPrev do not convey such information and should therefore be be avoided.

> LayoutTests/ChangeLog:8
> +        My change fixed this failure.

You should explain what has been fixed and why it has been fixed.
Comment 49 Van Lam 2011-08-25 11:34:47 PDT
Created attachment 105225 [details]
Revised (alternative) fix

Thanks for the review. Here's my updated patch.
Comment 50 Van Lam 2011-08-26 13:37:03 PDT
Ping reviewers.

Would really appreciate a review for my updated patch as I want to work on 65773.
Comment 51 Alexey Proskuryakov 2011-08-26 15:56:36 PDT
I'm having fairly hard time trying to understand what the new functions do.

Let's look at nextWordBoundaryBeforeWord() for a specific example. What will it return in each of the following cases (vertical line denoting starting position)?

te|st
test te|st test
test |test test
test test| test
test        |test test
test   |     test test
+1|0
test +++++++|++++++ test
U|.S.A.
test "te|st" test
Comment 52 Alexey Proskuryakov 2011-08-26 16:01:46 PDT
Also, <p>te|st</p><p>test</p> (two words separated by a paragraph break).
Comment 53 Van Lam 2011-08-27 02:35:18 PDT
(In reply to comment #51)
> I'm having fairly hard time trying to understand what the new functions do.
> 
> Let's look at nextWordBoundaryBeforeWord() for a specific example. What will it return in each of the following cases (vertical line denoting starting position)?
> 
> te|st
> test te|st test
> test |test test
> test test| test
> test        |test test
> test   |     test test
> +1|0
> test +++++++|++++++ test
> U|.S.A.
> test "te|st" test

Hi Alexey,

nextWordBoundaryBeforeWord is based off the observation that in most cases, the next word boundary from a position somePosition that lies before a word is previousBoundary(nextBoundary(nextBoundary(somePosition, _), _), _); let the underscores be the appropriate search functions passed in (next/previousWordPositionBoundary). 

Example:

Given text "a|bc def ghi" where | indicates the starting position,

call nextBoundary: a|bc def ghi -> abc| def ghi
call nextBoundary: abc| def ghi -> abc def| ghi
call previousBoundary: abc def| ghi -> abc |def ghi

This doesn't work when 1) the current position is at the end of a word or 2) when nextBoundary returns the last word boundary in the block. In the latter case, VisiblePosition() is returned to indicate that the desired word break is not in the box of the starting position.

As for the cases:

te|st -> VisiblePosition()
test te|st test -> test test |test
test |test test -> test test |test
test        |test test -> test        test |test
+1|0 -> VisiblePosition()
test +++++++|++++++ test -> test ++++++++|+++++ test
U|.S.A. -> VisiblePosition()
test "te|st" test -> test "test" |test

These are the results in Safari on Mac; results may differ slightly on other platforms as the functions eventually delegate to findNextWordFromIndex in TextBoundaries.
Comment 54 Van Lam 2011-08-27 03:17:10 PDT
(In reply to comment #52)
> Also, <p>te|st</p><p>test</p> (two words separated by a paragraph break).

In this case nextWordBoundaryBeforeWord returns VisiblePosition(). Because the two "test" strings are in different blocks (I think this is the correct term) the behavior of the function will work like:

call nextBoundary: <p>te|st</p><p>test</p> -> <p>test|</p><p>test</p>
call nextBoundary: <p>test|</p><p>test</p> -> <p>test|</p><p>test</p> (same position returned)

At this point the code returns VisiblePosition() and the correct word break (at the end of the first line) is computed in another part of visual word movement code.
Comment 55 Alexey Proskuryakov 2011-08-29 09:29:58 PDT
Some comments:
1. It's important to know not just what function currently does, but what it's supposed to do. There are two ways to describe that in function name - either by telling what exactly it does, or when exactly it's supposed to be called.
2. nextWordBoundaryBeforeWord(VisiblePosition) looks as if it returned next word boundary before word that contained the VisiblePosition, which makes no sense.
3. You changed one of the test cases, so I'm unsure what happens for "test   |     test test".
4. You are saying that there are platform differences allowed in the results. What differences are OK? Mac version of findNextWordFromIndex() isn't based on ICU, and I don't know how different it can be. In particular, how will we avoid issues like bug 65898 going forward?
5. If implementing Windows style alt-arrow movement is the goal, perhaps the code should make this explicit. Also, being unable to move across paragraphs is incorrect for this use case.
Comment 56 Van Lam 2011-08-29 10:28:59 PDT
(In reply to comment #55)
> Some comments:
> 1. It's important to know not just what function currently does, but what it's supposed to do. There are two ways to describe that in function name - either by telling what exactly it does, or when exactly it's supposed to be called.
> 2. nextWordBoundaryBeforeWord(VisiblePosition) looks as if it returned next word boundary before word that contained the VisiblePosition, which makes no sense.
> 3. You changed one of the test cases, so I'm unsure what happens for "test   |     test test".
> 4. You are saying that there are platform differences allowed in the results. What differences are OK? Mac version of findNextWordFromIndex() isn't based on ICU, and I don't know how different it can be. In particular, how will we avoid issues like bug 65898 going forward?
> 5. If implementing Windows style alt-arrow movement is the goal, perhaps the code should make this explicit. Also, being unable to move across paragraphs is incorrect for this use case.

1) I should make it explicit that these functions work within a single box.

2) How's wordBoundaryBeforeNextWord? I think this is more clear.

3) Ah, I overlooked this test case. Here it is: test   |     test test -> test        |test test

4) As far as I know every platform's findNextWordFromIndex returns the word boundary after the next word; the differences that will affect this function are which special non-alphanumeric characters get treated as parts of words. As for bug 65898, isn't this a problem that should be fixed within findWordBoundary and findNextWordFromIndex? Right now "foo bar _baz_" breaks the logical word movement as well (alt left arrow after "_baz_").

5) The ultimate goal is to implement the Mac style alt-arrow movement (as the Windows version is already implemented); to do this we (Xiaomei and I) planned to return the correct word break based on box directionality, block directionality, movement direction, and editing behavior. Having these two functions would simplify the code a lot. I should point out that these two functions would mainly be used as helper functions within visual word movement code; in the end alt-arrow will move across paragraphs.

Thanks for your feedback!
Comment 57 Alexey Proskuryakov 2011-08-29 11:00:09 PDT
> 3) Ah, I overlooked this test case. Here it is: test   |     test test -> test        |test test

This result seems substantially different from others.
Comment 58 Van Lam 2011-08-29 11:15:27 PDT
(In reply to comment #57)
> > 3) Ah, I overlooked this test case. Here it is: test   |     test test -> test        |test test
> 
> This result seems substantially different from others.

If you consider all word breaks before words, nextWordBoundaryBeforeWord will return the logically next one after the passed-in position. The behavior of this function on this case aligns with Windows alt-arrow behavior.
Comment 59 Darin Adler 2011-08-29 11:21:19 PDT
(In reply to comment #58)
> If you consider all word breaks before words, nextWordBoundaryBeforeWord

I think you mean to say “word boundaries” not “word breaks”.

And no, I don’t think that the term “word boundary” specifically means the place before a word. We’d need a more-specific term for that. A word boundary would mean both the place before a word and the place after a word.
Comment 60 Alexey Proskuryakov 2011-08-29 11:38:17 PDT
I think that we need better abstraction for code to be used in cursor movement iterator. For instance, there is no guarantee that Option-left arrow wants the same breaks as Option-delete.

Current code is guilty of having lots of specialized functions with generic names in visible_units.cpp and then mistakenly reusing them when subtly different behavior is needed. Adding more of that is a step in a wrong direction.
Comment 61 Kenneth Rohde Christiansen 2011-08-30 00:48:36 PDT
rniwa told me to add some comments here.

Basically I am in the need for a method similar to VisualSelection.expandUsingGranularity(WordGranularity);, but that does not include spaces and never considers spaces to be a word. This is needed for injecting compositions into our virtual keyboard when clicking on a word in an editable field.
Comment 62 Ryosuke Niwa 2012-04-21 18:59:07 PDT
Comment on attachment 105225 [details]
Revised (alternative) fix

This patch is obsolete after xji's rewrite of visual word movement patch.