Bug 116877

Summary: Optimize RenderText::offsetNext for 8 bit strings
Product: WebKit Reporter: Adam Barth <abarth>
Component: New BugsAssignee: Adam Barth <abarth>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, benjamin, commit-queue, darin, esprehn+autocc, glenn, ojan, rniwa
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch
none
Patch none

Description Adam Barth 2013-05-28 11:58:50 PDT
Optimize RenderText::offsetNext for 8 bit strings
Comment 1 Adam Barth 2013-05-28 12:00:56 PDT
Created attachment 203074 [details]
Patch
Comment 2 Adam Barth 2013-05-28 12:02:34 PDT
@benjaminp: I figured you'd be a good reviewer for the 8-bit strings aspect of this change, but I'm unsure who would be a good reviewer for the RenderText aspects.  Would you be willing to either review this patch or suggest another reviewer?  Thanks!
Comment 3 Ryosuke Niwa 2013-05-28 12:33:50 PDT
Comment on attachment 203074 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=203074&action=review

> Source/WebCore/rendering/RenderText.cpp:1731
> +    if (isAllASCII() || m_text.is8Bit())
> +        return current - 1;

I don't understand how this works. Why is it okay to never call textBreakPreceding in this case?
Comment 4 Adam Barth 2013-05-28 13:11:17 PDT
Comment on attachment 203074 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=203074&action=review

>> Source/WebCore/rendering/RenderText.cpp:1731
>> +        return current - 1;
> 
> I don't understand how this works. Why is it okay to never call textBreakPreceding in this case?

The idea is that the cursorMovementIterator will always move one character at a time for Latin-1 strings.  Perhaps that idea is flawed?
Comment 5 Glenn Adams 2013-05-28 13:20:17 PDT
(In reply to comment #4)
> (From update of attachment 203074 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=203074&action=review
> 
> >> Source/WebCore/rendering/RenderText.cpp:1731
> >> +        return current - 1;
> > 
> > I don't understand how this works. Why is it okay to never call textBreakPreceding in this case?
> 
> The idea is that the cursorMovementIterator will always move one character at a time for Latin-1 strings.  Perhaps that idea is flawed?

Any issue with ligatures (ff, fi, fl, ffi, ffl)?
Comment 6 Darin Adler 2013-05-28 14:06:42 PDT
Comment on attachment 203074 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=203074&action=review

>>>> Source/WebCore/rendering/RenderText.cpp:1731
>>>> +        return current - 1;
>>> 
>>> I don't understand how this works. Why is it okay to never call textBreakPreceding in this case?
>> 
>> The idea is that the cursorMovementIterator will always move one character at a time for Latin-1 strings.  Perhaps that idea is flawed?
> 
> Any issue with ligatures (ff, fi, fl, ffi, ffl)?

The only characters I can think of that that might be a problem: Control characters 0000-001F and 0080-009F, Soft hyphen 00AD. Others should be fine. And no, I don’t see any issue with ligatures.
Comment 7 Darin Adler 2013-05-28 14:09:21 PDT
Comment on attachment 203074 [details]
Patch

We obviously want something like this; if there is an issue with soft hyphen we should be able to deal with it simply. Would be nice to have a test to be sure. The isAllASCII part of this patch could be landed separately and already delivers the bulk of the value.
Comment 8 Adam Barth 2013-05-28 14:19:24 PDT
Comment on attachment 203074 [details]
Patch

(In reply to comment #7)
> (From update of attachment 203074 [details])
> We obviously want something like this; if there is an issue with soft hyphen we should be able to deal with it simply.

It looks like there is an issue with soft hyphen.

> Would be nice to have a test to be sure.

I verified that soft hypen has non-trivial cursor movement by making the Wikipedia page for soft hypen contenteditable and moving the cursor around the page.

> The isAllASCII part of this patch could be landed separately and already delivers the bulk of the value.

Indeed.  /me will work on a test for the soft hypen issue.

Thanks for the review!
Comment 9 Adam Barth 2013-05-28 21:59:37 PDT
Experimenting with a brute force test, I believe this optimization is valid for characters below U+0300 (at least when arranged in an ascending sequence).  U+0300 is COMBINING GRAVE ACCENT:

http://www.fileformat.info/info/unicode/char/300/index.htm

I'll upload a new patch with a test momentarily.
Comment 10 Adam Barth 2013-05-28 22:01:17 PDT
Created attachment 203120 [details]
Patch
Comment 11 Adam Barth 2013-05-28 22:03:04 PDT
I also experimented with a RELEASE_ASSERT and a test case involving a bunch of different line breaking scenarios for soft hyphen.  The RELEASE_ASSERT didn't fire (except for test cases involving characters >= U+0300), and the interactive behavior was unchanged.
Comment 12 Alexey Proskuryakov 2013-05-28 22:38:51 PDT
The behavior may depend on locale. I do not have a specific example where ICU does that (because I don't know how to search for it), but have a look at digraphs section at <https://en.wikipedia.org/wiki/Gaj's_Latin_alphabet>.
Comment 14 Adam Barth 2013-05-29 11:53:07 PDT
(In reply to comment #13)
> Related blink change:
> https://chromium.googlesource.com/chromium/blink/+/441ee6b89876e80754147f3aeccc09b5da423c6b

The most closely related change is of course https://chromium.googlesource.com/chromium/blink/+/c02a5a95086dd4ac48bc46d64518db1b083cac11

Anyway, I thought you all might be interested in these optimizations as well.
Comment 15 Darin Adler 2013-05-29 12:21:22 PDT
Comment on attachment 203120 [details]
Patch

OK. I’m satisfied this performance optimization probably doesn’t change behavior. We have looked and haven’t found any problems.

Seems like someone motivated could investigate if we have good behavior for soft hyphens. Aside from this specific function, do we edit them in the way we would like?
Comment 16 Glenn Adams 2013-05-29 12:22:54 PDT
Have you done any perf tests to show what gain is achieved?
Comment 17 Darin Adler 2013-05-29 12:24:17 PDT
The main gain here is memory; all these 8-bit strings were converted to 16-bit strings. I hesitate to say this, but “the memory gain is dramatic and obvious”.
Comment 18 WebKit Commit Bot 2013-05-29 12:26:03 PDT
Comment on attachment 203120 [details]
Patch

Clearing flags on attachment: 203120

Committed r150922: <http://trac.webkit.org/changeset/150922>
Comment 19 WebKit Commit Bot 2013-05-29 12:26:07 PDT
All reviewed patches have been landed.  Closing bug.
Comment 20 Glenn Adams 2013-05-29 12:29:44 PDT
(In reply to comment #17)
> The main gain here is memory; all these 8-bit strings were converted to 16-bit strings. I hesitate to say this, but “the memory gain is dramatic and obvious”.

I was just wondering about the (time) performance impact of adding the isAllASCII() call.
Comment 21 Ryosuke Niwa 2013-05-29 12:31:01 PDT
(In reply to comment #20)
> (In reply to comment #17)
> > The main gain here is memory; all these 8-bit strings were converted to 16-bit strings. I hesitate to say this, but “the memory gain is dramatic and obvious”.
> 
> I was just wondering about the (time) performance impact of adding the isAllASCII() call.

bool isAllASCII() const { return m_isAllASCII; }

There should be no performance impact.
Comment 22 Adam Barth 2013-05-29 12:31:45 PDT
> Aside from this specific function, do we edit them in the way we would like?

I'm not sure how users expect to be able to edit them, but the behavior isn't what I would have guessed.

foo-bar

In this case, the - is invisible between foo and bar, but if you try to move the caret through the text (e.g., with the right arrow key), you need to press right-arrow an "extra" time (visually) to get over the invisible hypen.

foo-
bar

In this case, if you have the caret at fo|o, then moving pressing right-arrow once moves the caret after the soft hyphen: foo-|.  If you type "a", the character is inserted after the last o and the caret remains after the hypen: fooa-|

I don't know enough about editing to know whether those are sensible behaviors, but they don't appear to be affected by this patch.

> I hesitate to say this, but “the memory gain is dramatic and obvious”.

I profiled all the sources of 16 bit strings in Mobile Gmail in Blink, and this was one of the top sources (aside from a bunch of Chromium-specific sources, which aren't relevant for WebKit).

I don't have exact numbers, unfortunately, because I only recently got my memory benchmark to be stable enough to generate repeatable numbers.
Comment 23 Glenn Adams 2013-05-29 12:45:34 PDT
(In reply to comment #21)
> (In reply to comment #20)
> > (In reply to comment #17)
> > > The main gain here is memory; all these 8-bit strings were converted to 16-bit strings. I hesitate to say this, but “the memory gain is dramatic and obvious”.
> > 
> > I was just wondering about the (time) performance impact of adding the isAllASCII() call.
> 
> bool isAllASCII() const { return m_isAllASCII; }
> 
> There should be no performance impact.

Excellent! I was worried it was iterating over the text (again).