Bug 46695

Summary: Optimize TextBreakIteratorQt to avoid deep-copying the strings used by cached iterators
Product: WebKit Reporter: Matthew Figg <matthew.figg>
Component: TextAssignee: Nobody <webkit-unassigned>
Status: RESOLVED INVALID    
Severity: Enhancement CC: benjamin, buildbot, christopher.reiss, darin, dglazkov, eric, gustavo.noronha, gustavo, kenneth, kling, koivisto, laszlo.gombos, luiz, manuel.lazzari, mitz, ned, webkit.review.bot, xan.lopez
Priority: P3 Keywords: Qt, QtTriaged
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Attachments:
Description Flags
Incomplete - seek feedback on this approach.
menard: review-
one big <div> block
none
proposed approach to safe recycling without doing a deep copy of the string
none
(fix to chromium build break)
kling: review-
style fixes
none
Using RefPtr's to safely reuse textiterators
none
Using RefPtr's to safely reuse textiterators kling: review-

Description Matthew Figg 2010-09-27 21:16:53 PDT
The setUpIterator function in WebCore/platform/text/qt/TextBreakIteratorQt.cpp:53 has two problems:

    TextBreakIterator* setUpIterator(TextBreakIterator& iterator, QTextBoundaryFinder::BoundaryType type, const UChar* string, int length)
    {
        if (!string || !length)
            return 0;

        if (iterator.isValid() && type == iterator.type() && length == iterator.length
            && memcmp(string, iterator.string, length) == 0) {
            iterator.toStart();
            return &iterator;
        }

        iterator = TextBreakIterator(type, string, length);

        return &iterator;
    }

1) The memcmp doesn't compare the entire string.  It should instead be memcmp(string, iterator.string, length * sizeof(UChar))

This is easily proven by loading the following webpage.  The iterator is reused for all inputs:

<html>
  <body>
    <input value="search0"/>
    <input value="search1"/>
    <input value="search2"/>
    <input value="search3"/>
    <input value="search4"/>
    <input value="search5"/>
    <input value="search6"/>
    <input value="search7"/>
    <input value="search8"/>
    <input value="search9"/>
  </body>
</html>

2) The iterator.string is sometimes freed between calls.  If the iterator is valid, the type is the same and the length is the same, it ends up comparing the string to random memory and potentially (if the random memory matches) returning an iterator with the invalid string pointer.  I'm not sure if this could cause larger issues further downstream?

This bug was introduced when fixing https://bugs.webkit.org/show_bug.cgi?id=39958 - See http://trac.webkit.org/changeset/60847/trunk/WebCore/platform/text/qt/TextBreakIteratorQt.cpp
Comment 1 Manuel Lazzari 2010-12-28 01:00:55 PST
I managed to verify problem (2) using debugging tool for windows and enabling global flags:

- page heap
- heap tail check
- heap free checking
- heap parameters checking
- heap validation on call

To reproduce the problem and have the program to crash you can simply start to type in an input field, delete all the input via backspace and type something else: when you start retyping the program crashes. 

Calls stack following:

5caf8f50 MSVCR90D!free+0x00000010
    10c9fe6d QtWebKitd4!WTF::fastFree+0x0000000d
    1008af7d QtWebKitd4!WTF::FastAllocBase::operator delete+0x0000001d
    1008af40 QtWebKitd4!WebCore::StringImpl::`scalar deleting destructor'+0x00000020
    1008aeef QtWebKitd4!WebCore::StringImpl::deref+0x0000003f
    1008aea1 QtWebKitd4!WTF::derefIfNotNull<WebCore::StringImpl>+0x00000011
    1008ae52 QtWebKitd4!WTF::RefPtr<WebCore::StringImpl>::~RefPtr<WebCore::StringImpl>+0x00000012
    1008ad5f QtWebKitd4!WebCore::String::~String+0x0000000f
    108e3d08 QtWebKitd4!WebCore::RenderText::~RenderText+0x00000038
    108e3848 QtWebKitd4!WebCore::RenderBR::~RenderBR+0x00000018
    108e3c9f QtWebKitd4!WebCore::RenderBR::`scalar deleting destructor'+0x0000000f
    1092df7b QtWebKitd4!WebCore::RenderObject::arenaDelete+0x0000013b
    1092de3b QtWebKitd4!WebCore::RenderObject::destroy+0x0000013b
    1095021a QtWebKitd4!WebCore::RenderText::destroy+0x000000ca
    1053b93f QtWebKitd4!WebCore::Node::detach+0x0000003f
    104e41d9 QtWebKitd4!WebCore::ContainerNode::detach+0x00000049
    1052241a QtWebKitd4!WebCore::Element::detach+0x0000003a
    104e3457 QtWebKitd4!WebCore::ContainerNode::removeChild+0x00000157
    1053a4da QtWebKitd4!WebCore::Node::remove+0x0000003a
    105e2a9f QtWebKitd4!WebCore::RemoveNodeCommand::doApply+0x0000008f
    105aafa8 QtWebKitd4!WebCore::EditCommand::apply+0x000000c8
    105959ff QtWebKitd4!WebCore::CompositeEditCommand::applyCommandToComposite+0x0000004f
    105964e4 QtWebKitd4!WebCore::CompositeEditCommand::removeNode+0x00000074
    10598988 QtWebKitd4!WebCore::CompositeEditCommand::removePlaceholderAt+0x00000048
    105d4f53 QtWebKitd4!WebCore::InsertTextCommand::input+0x00000463
    10601769 QtWebKitd4!WebCore::TypingCommand::insertTextRunWithoutNewlines+0x00000119
Comment 2 Manuel Lazzari 2010-12-28 01:04:50 PST
EDIT - call stack obviously reports who frees the string
Comment 3 Benjamin Poulain 2011-01-30 07:46:22 PST
Please follow http://trac.webkit.org/wiki/QtWebKitBugs when reporing bugs here (missing Qt keyword). Your bug was not tracked because of the missing Qt keyword :(
Comment 4 chris reiss 2011-01-31 10:29:06 PST
starting on this ...
Comment 5 chris reiss 2011-01-31 11:06:37 PST
Just seeing this code for the first time, but my initial take is that
this 
        if (iterator.isValid() && type == iterator.type() && length == iterator.length
            && memcmp(string, iterator.string, length) == 0) {
            iterator.toStart();
            return &iterator;
        }


should really be this ...

        if (iterator.isValid() && type == iterator.type() && length == iterator.length && string == iterator.string) {
            iterator.toStart();
            return &iterator;
        }

That is, we just want to make sure iterator.string is at the same place in memory as 'string'.  Then it's safe to recycle it.   Let me do some more digging and testing ...
Comment 6 Benjamin Poulain 2011-01-31 11:23:25 PST
(In reply to comment #5)
> should really be this ...
> 
>         if (iterator.isValid() && type == iterator.type() && length == iterator.length && string == iterator.string) {
>             iterator.toStart();
>             return &iterator;
>         }
> 
> That is, we just want to make sure iterator.string is at the same place in memory as 'string'.  Then it's safe to recycle it.   Let me do some more digging and testing ...

Uh... "string" is a UChar*. I don't think you'll skip the memcmp so easily.
Comment 7 chris reiss 2011-01-31 11:44:40 PST
It seems to me that : 

   We can recycle the iterator if - 
   a) the strings are at the same location in memory, hence identical.   A comparison of (string== iterator.string) tests for this.
   b) the strings happen to be at different locations, but have the same content.
But in this case, we run the risk that 'string' points to freed memory (as Manuel pointed out).

   We could work around this by doing a string copy, but this may incur a worse performance hit than recreating the iterator.

Of course, a) is useless if there is some copying being done somewhere up the stack.   I'm testing for this now, that is, does the test ever pass.
Comment 8 chris reiss 2011-01-31 11:47:45 PST
EDIT : we run the risk that iterator.string points to free memory.
Comment 9 Andreas Kling 2011-01-31 12:26:40 PST
(In reply to comment #7)
> It seems to me that : 
> 
>    We can recycle the iterator if - 
>    a) the strings are at the same location in memory, hence identical.   A comparison of (string== iterator.string) tests for this.

This is not sufficient. 'string' can be == 'iterator.string' for different strings in the case where the allocator re-uses the same address in a subsequent allocation.

There are two solutions, as I see it. Either we hold a strong reference to the string for the lifetime of the cached iterator or we take a deep copy.

The ref fix is superior since it eliminates the bug and avoids the deep copy. Unfortunately the API of these functions is character/length based, which means that we have no string to bump the refcount on.

Is it possible to alter the setUpIterator() (and friends) API to take String objects instead?
Comment 10 chris reiss 2011-01-31 14:04:10 PST
> This is not sufficient. 'string' can be == 'iterator.string' for different strings in the case where the allocator re-uses the same address in a subsequent allocation.

I see - in the case that 'string' has been freed, reallocated and overwritten by 'iterator.string'.  In which  case there is no first string anymore :)

Is this a problem, so long as the first thing we do to the TextBreakIterator is call toStart() ?   It doesn't seem to me that TextBreakIterator maintains any sort of state which clings to the string once its been rewound.  (its parent, QTextBoundaryFinder, doesn't do a copy either - or we cd compare against *that*)
Comment 11 chris reiss 2011-02-11 13:01:05 PST
Created attachment 82164 [details]
Incomplete - seek feedback on this approach.

Curious to hear feedback on this approach to avoiding the deep string copy while safely reusing the TextIterator.    

I found that the big performance occurs hit occurs  down the stack from    
   RenderBlockLineLayout::findNextLineBreak, 
   WebCore::isBreakable, 
   WebCore::nextBreakablePosition and 
   WebCore::lineBreakIterator and finally to TextBreakIteratorQt.cpp.

Here i protect the memory using an optional argument and PassRefPtr .

Is this too disruptive?   Also - this patch isn't complete, some function signatures would need to change in other ports of TextIterator.   I just wanted to see if there were any fundamental objections to this approach before I roll up a complete patch.

i also have timing results comparing no caching, the current (unsafe) code, and this code if ppl are interested.   Also - Am i using PassRefPtr correctly?  It doesn't crash, anyway :)
Comment 12 WebKit Review Bot 2011-02-14 07:13:36 PST
Attachment 82164 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/platform/text/TextBreakIter..." exit_code: 1

Source/WebCore/platform/text/qt/TextBreakIteratorQt.cpp:25:  Alphabetical sorting problem.  [build/include_order] [4]
Source/WebCore/platform/text/qt/TextBreakIteratorQt.cpp:50:  Missing spaces around =  [whitespace/operators] [4]
Source/WebCore/platform/text/qt/TextBreakIteratorQt.cpp:50:  Use 0 instead of NULL.  [readability/null] [5]
Source/WebCore/platform/text/qt/TextBreakIteratorQt.cpp:66:  Missing spaces around =  [whitespace/operators] [4]
Source/WebCore/platform/text/qt/TextBreakIteratorQt.cpp:66:  Use 0 instead of NULL.  [readability/null] [5]
Source/WebCore/platform/text/qt/TextBreakIteratorQt.cpp:72:  Missing spaces around ==  [whitespace/operators] [3]
Source/WebCore/rendering/break_lines.h:25:  Alphabetical sorting problem.  [build/include_order] [4]
Source/WebCore/rendering/break_lines.h:27:  Alphabetical sorting problem.  [build/include_order] [4]
Source/WebCore/rendering/break_lines.h:32:  Code inside a namespace should not be indented.  [whitespace/indent] [4]
Source/WebCore/rendering/break_lines.h:32:  Missing spaces around =  [whitespace/operators] [4]
Source/WebCore/rendering/break_lines.h:32:  Use 0 instead of NULL.  [readability/null] [5]
Source/WebCore/rendering/break_lines.h:34:  Missing spaces around =  [whitespace/operators] [4]
Source/WebCore/rendering/break_lines.h:34:  Use 0 instead of NULL.  [readability/null] [5]
Source/WebCore/platform/text/TextBreakIterator.h:26:  Alphabetical sorting problem.  [build/include_order] [4]
Source/WebCore/platform/text/TextBreakIterator.h:49:  Missing spaces around =  [whitespace/operators] [4]
Source/WebCore/platform/text/TextBreakIterator.h:49:  Use 0 instead of NULL.  [readability/null] [5]
Source/WebCore/platform/text/gtk/TextBreakIteratorGtk.cpp:29:  Alphabetical sorting problem.  [build/include_order] [4]
Source/WebCore/platform/text/gtk/TextBreakIteratorGtk.cpp:244:  Missing spaces around =  [whitespace/operators] [4]
Source/WebCore/platform/text/gtk/TextBreakIteratorGtk.cpp:244:  Use 0 instead of NULL.  [readability/null] [5]
Total errors found: 19 in 6 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 13 WebKit Review Bot 2011-02-14 07:16:43 PST
Attachment 82164 [details] did not build on chromium:
Build output: http://queues.webkit.org/results/7907608
Comment 14 Collabora GTK+ EWS bot 2011-02-14 07:22:18 PST
Attachment 82164 [details] did not build on gtk:
Build output: http://queues.webkit.org/results/7905638
Comment 15 Build Bot 2011-02-14 07:38:01 PST
Attachment 82164 [details] did not build on win:
Build output: http://queues.webkit.org/results/7908612
Comment 16 chris reiss 2011-02-14 08:57:45 PST
Here are some timing results which motivate this caching.

To recap, the code as it is has two errors, the length if the compare is incorrect and there is a risk of a dangling pointer if iterator.string gets destroyed.

The simple fix is to
   *take a deep copy of the string in Iterator.

The more extensive fix is the above patch (or something like it): https://bugs.webkit.org/attachment.cgi?id=82164, 
which (tries to) protect the string in memory using RefPtr's, and only
recycles Iterators in the path where the performance hit seems to occur.

Here is time spent in  TextBreakIterator* setUpIterator(...), using QtTestBrowser on Ubuntu (debug build, 2.8Gz) for the different alternatives :

----
URL : http://en.wikipedia.org/wiki/List_of_science_fiction_television_programs

  No caching :  36 milliseconds    (here we always create a new iterator.)
  
  deep copy :   23 milliseconds

  RefPtr appraoch : 16 milliseconds

here is another set of results, using one enormous <div> (attached) :
----
  No caching : 700 milliseconds (ouch)
  deep copy :  9 milliseconds
  RefPtr's  :  5 milliseconds.
Comment 17 chris reiss 2011-02-14 08:59:07 PST
Created attachment 82322 [details]
one big <div> block
Comment 18 Alexis Menard (darktears) 2011-02-14 10:46:51 PST
Comment on attachment 82164 [details]
Incomplete - seek feedback on this approach.

It breaks the build in many platforms you should take a look.
Comment 19 chris reiss 2011-02-17 07:43:38 PST
Created attachment 82802 [details]
proposed approach to safe recycling without doing a deep copy of the string

Posting this as sort of a straw man, a possible way to safely recycle the TextIterator without doing a deep copy of the string.   Is this too disruptive to other ports?  Am I using RefPtr's correctly.

Please also see timing results above ...
Comment 20 WebKit Review Bot 2011-02-17 07:47:35 PST
Attachment 82802 [details] did not build on chromium:
Build output: http://queues.webkit.org/results/7927642
Comment 21 chris reiss 2011-02-17 07:56:18 PST
Created attachment 82806 [details]
(fix to chromium build break)
Comment 22 Andreas Kling 2011-02-18 02:47:23 PST
Comment on attachment 82806 [details]
(fix to chromium build break)

View in context: https://bugs.webkit.org/attachment.cgi?id=82806&action=review

> Source/WebCore/ChangeLog:8
> +        No new tests. (OOPS!)

This line should be replaced, either mention what tests are added, or explain why none are needed.

> Source/WebCore/ChangeLog:9
> +

I'm missing a comment block here explaining that you're adding a way for TextBreakIterators to hold a reference to the last string they operate on for reuse purposes.

> Source/WebCore/platform/text/qt/TextBreakIteratorQt.cpp:48
> +        TextBreakIterator(QTextBoundaryFinder::BoundaryType type, const UChar* string, int length, PassRefPtr<StringImpl> origText = 0)

"origText" is a bad variable name, we stay away from abbreviations in WebKit style. Repeated below.

> Source/WebCore/platform/text/gtk/TextBreakIteratorGtk.cpp:244
> +TextBreakIterator* lineBreakIterator(const UChar* string, int length, PassRefPtr<StringImpl> origText = 0 /* used for memory-safe recycling of Iterator */)

No need for the " = 0 /* used for memory-safe recycling of Iterator */" here.

> Source/WebCore/rendering/break_lines.h:28
> +typedef PassRefPtr<StringImpl> Blerk; 

Blerk?
Comment 23 chris reiss 2011-02-18 08:16:30 PST
Created attachment 82961 [details]
style fixes 

(and, um, de-blerked :) )
Comment 24 Alexis Menard (darktears) 2011-02-21 09:17:46 PST
Comment on attachment 82961 [details]
style fixes 

View in context: https://bugs.webkit.org/attachment.cgi?id=82961&action=review

> Source/WebCore/ChangeLog:9
> +        to the string they operate on.   In this patch, the reference is only used to

Extra space here.
Comment 25 chris reiss 2011-02-21 10:25:17 PST
Created attachment 83180 [details]
Using RefPtr's to safely reuse textiterators

Fix spacing in changelog.
Comment 26 Alexis Menard (darktears) 2011-02-21 10:32:58 PST
Comment on attachment 83180 [details]
Using RefPtr's to safely reuse textiterators

View in context: https://bugs.webkit.org/attachment.cgi?id=83180&action=review

> Source/WebCore/ChangeLog:9
> +        to the string they operate on.  In this patch, the reference is only used to

Sorry it's one space after the dot :(.
Comment 27 chris reiss 2011-02-21 10:40:38 PST
Created attachment 83185 [details]
Using RefPtr's to safely reuse textiterators

Another correction to spaces in Changelog.

Happy to fix it, thanks Alexis.
Comment 28 Eric Seidel (no email) 2011-02-24 02:51:32 PST
I'm not sure I understand this patch.  It's not Qt only however since you affect all platforms.
Comment 29 chris reiss 2011-02-24 08:39:34 PST
This is an optimization.

When we're rendering lots of text in RenderBlock::findNextLineBreak( ), we make many calls to isBreakable( ), which in turns creates a lot of TextBreakIterators.    This got very expensive in 
the Qt port, for a <div> with lots of text, 0.7 seconds were spent in setUpIterator.  See comment 16 .

Within the RenderBlock::findNextLine( ) loop, we're passing the same string repeatedly to isBreakable( ).  So we'd like to reuse the TextBreakIterator from last time.   We'd like setupTextIterator to recognize that the string
hasn't changed.

The wrinkle is that a raw UChar* is passed to setUpIterator.    The pointer may not have changed since last time, but there's an outside chance that the memory has been deleted and reallocated (isBreakable() has lots of call sites.)  
So setupTextIterator( ) is never sure that it can reuse the last iterator.

This patch gives the call site the option to pass a RefPtr string to isBreakable( ).   If the parameter is present, setupTextIterator will use it to avoid recreating iterators on the same string.

So I needed to add that parameter to code outside the Qt port.    The hope is - it's essentially a NOOP for other ports so they wouldn't mind too much,
and perhaps even find it useful at some point.
Comment 30 mitz 2011-02-24 08:50:01 PST
I think the "optimization" has just been removed in http://trac.webkit.org/changeset/79567 and Ned Holbrook has a patch that provides a safe, cross-platform optimization.
Comment 31 mitz 2011-02-24 08:51:29 PST
See bug 54912.
Comment 32 Andreas Kling 2011-03-18 09:13:55 PDT
Comment on attachment 83185 [details]
Using RefPtr's to safely reuse textiterators

This patch needs to be updated to match the current code in TextBreakIteratorQt.
I'm not sure the patch makes sense anymore though, is there still a performance issue with the cross-platform optimization in place?
Comment 33 Andreas Kling 2011-03-18 09:16:14 PDT
Unblocking 2.2 release as this is no longer about the crash that was fixed in bug 55139.)
Comment 34 Jocelyn Turcotte 2014-02-03 03:50:48 PST
=== Bulk closing of Qt bugs ===

If you believe that this bug report is still relevant for a non-Qt port of webkit.org, please re-open it.

If you believe that this is still an important QtWebKit bug, please fill a new report at https://bugreports.qt-project.org and add a link to this issue. See http://qt-project.org/wiki/ReportingBugsInQt for additional guidelines.