Bug 103295 - TextIterator unnecessarily converts 8 bit strings to 16 bits
Summary: TextIterator unnecessarily converts 8 bit strings to 16 bits
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: HTML Editing (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Michael Saboff
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-11-26 14:16 PST by Michael Saboff
Modified: 2019-03-30 09:42 PDT (History)
8 users (show)

See Also:


Attachments
Patch (14.02 KB, patch)
2012-11-26 16:05 PST, Michael Saboff
no flags Details | Formatted Diff | Diff
Updated Patch (14.27 KB, patch)
2012-11-27 11:35 PST, Michael Saboff
bfulgham: review-
Details | Formatted Diff | Diff
ANother updated patch (14.27 KB, patch)
2012-11-27 14:21 PST, Michael Saboff
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Saboff 2012-11-26 14:16:49 PST
TextIterator uses String::character() in emitText().  Also, the only access to text in a TextIterator is UChar* characters().

Given that most strings are 8 bit, TextIterator and functions that use TextIterator like plainText() should be changed to use Strings instead of UChar pointers.
Comment 1 Michael Saboff 2012-11-26 16:05:55 PST
Created attachment 176103 [details]
Patch

This change is a progression on the performance test attached to https://bugs.webkit.org/show_bug.cgi?id=81192 referenced in the code.
Comparing the geometric mean of 3 runs before and after the patch:
           Baseline   With Patch   Delta
innerText:  3676.2ms      3032.87ms   +17%
outerText:  3625.7ms      3603.27ms    +0.6%

It is also great improvement on the number of strings/characters that need to be up converted to 16 bits in our page load memory test.

Before, 21289 (10.73%) strings up converted totaling 432,175 characters using 864,350 bytes
After, 2201 ( 1.08%) strings up converted totaling 82,394 characters using 164,788 bytes
This is a 90% reduction of strings and an 81% reduction in characters / bytes.
Comment 2 Michael Saboff 2012-11-26 16:10:07 PST
After posting, I noticed code that should have been removed from TextIterator.h:

#if 0
UChar* plainTextToMallocAllocatedBuffer(const Range*, unsigned& bufferLength, bool isDisplayString, TextIteratorBehavior = TextIteratorDefaultBehavior);
#endif

It has been removed from my working code.  I can repost if needed.
Comment 3 Darin Adler 2012-11-26 16:56:24 PST
Comment on attachment 176103 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=176103&action=review

> Source/WebCore/editing/TextIterator.cpp:468
> +    ASSERT(static_cast<int>(index) < length());
> +    if (!(static_cast<int>(index) < length()))

I think the typecast needs to be done in the other direction, casting length() to an unsigned, otherwise large indexes will get past this check because they get converted to negative numbers.

> Source/WebCore/editing/TextIterator.h:66
> +#if 0
>  UChar* plainTextToMallocAllocatedBuffer(const Range*, unsigned& bufferLength, bool isDisplayString, TextIteratorBehavior = TextIteratorDefaultBehavior);
> +#endif

Why just #if'd out?

> Source/WebCore/editing/TextIterator.h:103
> +    int startOffset() const { return m_positionStartOffset; }
> +    const String string() const { return m_text; }
> +    const UChar* characters() const { return m_textCharacters ? m_textCharacters : m_text.characters() + startOffset(); }
> +    UChar characterAt(unsigned index) const;
> +    void appendTextToStringBuilder(StringBuilder&) const;

This additional API makes TextIterator a lot more confusing to use than it was before. The old API was designed to make it hard to use incorrectly; there was no copying of text ever. I’m particularly confused about how to use the string() function. Do all these new functions need to be public?

The string() function should not return "const String". It should just return "String".

> Source/WebKit/mac/WebView/WebFrame.mm:500
> -    // This will give a system malloc'd buffer that can be turned directly into an NSString
> -    unsigned length;
> -    UChar* buf = plainTextToMallocAllocatedBuffer(core(range), length, true);
> -    
> -    if (!buf)
> -        return [NSString string];
> -
> -    // Transfer buffer ownership to NSString
> -    return [[[NSString alloc] initWithCharactersNoCopy:buf length:length freeWhenDone:YES] autorelease];
> +    return plainText(core(range), TextIteratorDefaultBehavior, true);

Why is it OK to remove the plainTextToMallocAllocatedBuffer optimization? That was a critical change back when Antti originally did it. Did you research why it was needed and determine it’s not needed any more?
Comment 4 Michael Saboff 2012-11-27 11:34:29 PST
(In reply to comment #3)
> (From update of attachment 176103 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=176103&action=review
> 
> > Source/WebCore/editing/TextIterator.cpp:468
> > +    ASSERT(static_cast<int>(index) < length());
> > +    if (!(static_cast<int>(index) < length()))
> 
> I think the typecast needs to be done in the other direction, casting length() to an unsigned, otherwise large indexes will get past this check because they get converted to negative numbers.
> 
> > Source/WebCore/editing/TextIterator.h:66
> > +#if 0
> >  UChar* plainTextToMallocAllocatedBuffer(const Range*, unsigned& bufferLength, bool isDisplayString, TextIteratorBehavior = TextIteratorDefaultBehavior);
> > +#endif
> 
> Why just #if'd out?
> 
> > Source/WebCore/editing/TextIterator.h:103
> > +    int startOffset() const { return m_positionStartOffset; }
> > +    const String string() const { return m_text; }
> > +    const UChar* characters() const { return m_textCharacters ? m_textCharacters : m_text.characters() + startOffset(); }
> > +    UChar characterAt(unsigned index) const;
> > +    void appendTextToStringBuilder(StringBuilder&) const;
> 
> This additional API makes TextIterator a lot more confusing to use than it was before. The old API was designed to make it hard to use incorrectly; there was no copying of text ever. I’m particularly confused about how to use the string() function. Do all these new functions need to be public?
> 
> The string() function should not return "const String". It should just return "String".
> 
> > Source/WebKit/mac/WebView/WebFrame.mm:500
> > -    // This will give a system malloc'd buffer that can be turned directly into an NSString
> > -    unsigned length;
> > -    UChar* buf = plainTextToMallocAllocatedBuffer(core(range), length, true);
> > -    
> > -    if (!buf)
> > -        return [NSString string];
> > -
> > -    // Transfer buffer ownership to NSString
> > -    return [[[NSString alloc] initWithCharactersNoCopy:buf length:length freeWhenDone:YES] autorelease];
> > +    return plainText(core(range), TextIteratorDefaultBehavior, true);
> 
> Why is it OK to remove the plainTextToMallocAllocatedBuffer optimization? That was a critical change back when Antti originally did it. Did you research why it was needed and determine it’s not needed any more?
Comment 5 Michael Saboff 2012-11-27 11:35:12 PST
Created attachment 176308 [details]
Updated Patch

(In reply to comment #3)
> (From update of attachment 176103 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=176103&action=review
> 
> > Source/WebCore/editing/TextIterator.cpp:468
> > +    ASSERT(static_cast<int>(index) < length());
> > +    if (!(static_cast<int>(index) < length()))
> 
> I think the typecast needs to be done in the other direction, casting length() to an unsigned, otherwise large indexes will get past this check because they get converted to negative numbers.

Done.

> > Source/WebCore/editing/TextIterator.h:66
> > +#if 0
> >  UChar* plainTextToMallocAllocatedBuffer(const Range*, unsigned& bufferLength, bool isDisplayString, TextIteratorBehavior = TextIteratorDefaultBehavior);
> > +#endif
> 
> Why just #if'd out?

Oversight in original patch - fixed.
 
> > Source/WebCore/editing/TextIterator.h:103
> > +    int startOffset() const { return m_positionStartOffset; }
> > +    const String string() const { return m_text; }
> > +    const UChar* characters() const { return m_textCharacters ? m_textCharacters : m_text.characters() + startOffset(); }
> > +    UChar characterAt(unsigned index) const;
> > +    void appendTextToStringBuilder(StringBuilder&) const;
> 
> This additional API makes TextIterator a lot more confusing to use than it was before. The old API was designed to make it hard to use incorrectly; there was no copying of text ever. I’m particularly confused about how to use the string() function. Do all these new functions need to be public?

I made startOffset() and string() private.  The way to access text data is as before via characters(), if one character is wanted use characterAt() or if appending to a StringBuilder using appendTextToStringBuilder().  The two new functions are optimized for 8 bit strings.

> The string() function should not return "const String". It should just return "String".

Done.

> > Source/WebKit/mac/WebView/WebFrame.mm:500
> > -    // This will give a system malloc'd buffer that can be turned directly into an NSString
> > -    unsigned length;
> > -    UChar* buf = plainTextToMallocAllocatedBuffer(core(range), length, true);
> > -    
> > -    if (!buf)
> > -        return [NSString string];
> > -
> > -    // Transfer buffer ownership to NSString
> > -    return [[[NSString alloc] initWithCharactersNoCopy:buf length:length freeWhenDone:YES] autorelease];
> > +    return plainText(core(range), TextIteratorDefaultBehavior, true);
> 
> Why is it OK to remove the plainTextToMallocAllocatedBuffer optimization? That was a critical change back when Antti originally did it. Did you research why it was needed and determine it’s not needed any more?

The original change was done in http://trac.webkit.org/changeset/24832 to fix fragmentation caused by plainText() on large full documents.  The change did two things to fix the problem, use system malloc and allocate large buffers.  The patch I propose goes back to using fastMalloc, but it keeps the large buffer behavior exactly for the original reason.

I tested it with the test referenced in the code and posted those results with the original patch.  That is a performance test and not a memory test.  The test site "http://dscoder.com/test.txt" referenced in Antti's change is no longer available.  Instead I tried some memory tests using the HTML 5 spec file in PerformanceTests/Parser/resources/html5.html (a 5.8MB file).  Loading this file invokes plainText() once to create a 2.6MB String.  I loaded this file and measured memory before and after this patch:
                          Baseline     With Patch   
Fast Malloc Allocations     835719         720998
Fast Malloc Dirty          133.1MB        127.5MB
System Malloc Allocations    35318          35320
System Malloc Dirty         25.2MB         22.8MB
Total Dirty                171.0MB        162.7MB
Comment 6 Brent Fulgham 2012-11-27 12:26:52 PST
Comment on attachment 176308 [details]
Updated Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=176308&action=review

This looks like a huge improvement!  However, I think it will be even better if you don't copy the entire String each time you want to access a single character :-)

r- for this reason (or please explain what I'm misunderstanding about your code change).

> Source/WebCore/editing/TextIterator.cpp:472
> +        return string()[startOffset() + index];

Currently this makes a copy of the string, then returns one character from it!

> Source/WebCore/editing/TextIterator.cpp:480
> +        builder.append(string(), startOffset(), length());

Again, I think this is copying for no real reason...

> Source/WebCore/editing/TextIterator.h:110
> +    String string() const { return m_text; }

Isn't this making a copy?  I think your original "const String" should probably have just been "const String&"
Comment 7 Michael Saboff 2012-11-27 14:21:03 PST
Created attachment 176338 [details]
ANother updated patch

(In reply to comment #6)
> (From update of attachment 176308 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=176308&action=review
> 
> This looks like a huge improvement!  However, I think it will be even better if you don't copy the entire String each time you want to access a single character :-)
> 
> r- for this reason (or please explain what I'm misunderstanding about your code change).
> 
> > Source/WebCore/editing/TextIterator.cpp:472
> > +        return string()[startOffset() + index];
> 
> Currently this makes a copy of the string, then returns one character from it!
> 
> > Source/WebCore/editing/TextIterator.cpp:480
> > +        builder.append(string(), startOffset(), length());
> 
> Again, I think this is copying for no real reason...
> 
> > Source/WebCore/editing/TextIterator.h:110
> > +    String string() const { return m_text; }
> 
> Isn't this making a copy?  I think your original "const String" should probably have just been "const String&"

Made string() be const String& as suggested.
Comment 8 Brent Fulgham 2012-11-27 21:20:09 PST
Comment on attachment 176338 [details]
ANother updated patch

Looks great!  Thanks.  r=me.
Comment 9 WebKit Review Bot 2012-11-27 21:28:23 PST
Comment on attachment 176338 [details]
ANother updated patch

Clearing flags on attachment: 176338

Committed r135972: <http://trac.webkit.org/changeset/135972>
Comment 10 WebKit Review Bot 2012-11-27 21:28:27 PST
All reviewed patches have been landed.  Closing bug.