Bug 25779 - Allow Strings to be created with one malloc node with no copying
Summary: Allow Strings to be created with one malloc node with no copying
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Platform (show other bugs)
Version: 528+ (Nightly build)
Hardware: All OS X 10.5
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-05-13 20:33 PDT by Dave Moore
Modified: 2009-05-24 22:41 PDT (History)
1 user (show)

See Also:


Attachments
Fix (9.43 KB, patch)
2009-05-13 20:43 PDT, Dave Moore
darin: review-
Details | Formatted Diff | Diff
Fix (5.19 KB, patch)
2009-05-14 13:20 PDT, Dave Moore
eric: review-
Details | Formatted Diff | Diff
Fix (5.02 KB, patch)
2009-05-15 09:12 PDT, Dave Moore
darin: review+
Details | Formatted Diff | Diff
Additional changes (22.50 KB, patch)
2009-05-24 21:04 PDT, Dave Moore
mjs: review-
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dave Moore 2009-05-13 20:33:53 PDT
Currently when creating StringImpls we have to choose between these two apis:

    static PassRefPtr<StringImpl> create(const UChar*, unsigned length);
    static PassRefPtr<StringImpl> adopt(StringBuffer&);

The first mallocs an object big enough to hold both the StringImpl and a buffer to contain the characters that are passed to it. These characters are copied. The second doesn't copy the characters but you have to create a second object, the StringBuffer, which hands its character buffer to the StringImpl.

I would like to add a new method,
    static String createUninitialized(unsigned length, UChar*& data);

on String that returns a pointer to the inline buffer so that the caller can fill the characters in, yielding a String with one malloc node without any copying.
Comment 1 Dave Moore 2009-05-13 20:43:28 PDT
Created attachment 30312 [details]
Fix
Comment 2 Darin Adler 2009-05-14 02:49:47 PDT
Comment on attachment 30312 [details]
Fix

> +        * platform\text\PlatformString.h:
> +        * platform\text\String.cpp:
> +        * platform\text\StringImpl.cpp:
> +        * platform\text\StringImpl.h
> +:

The paths here are using the Windows-style \ patch separator.

>      unsigned end = m_length - 1;
> -    
> +
>      // skip white space from start

Your patch here strips all trailing whitespace from this file. Although we have no formal policy about this, and I believe the git tool does encourage you to do this, we normally discourage including changes like that in the same check-in with substantive changes, since it confuses things in both that check-in and the file history.

We also don't have agreement on whether to do something like this project-wide, and that's something I'd like to see before taking patches that just remove trailing whitespace.

There are more lines in your patch from the whitespace change than from the substantive one.

> +    if (!length) {
> +        data = NULL;

We use 0 in this project, not NULL. See the WebKit style guide.

> +PassRefPtr<StringImpl> StringImpl::create(const UChar* characters, unsigned length)
> +{
> +    if (!characters || !length)
> +        return empty();
> +
> +    UChar *data;

We put the * next to the type -- UChar*, not UChar * -- see the WebKit style guide.

> -    UChar* data = reinterpret_cast<UChar*>(buffer + sizeof(StringImpl));
> +    UChar *data;

Same here.

I'm slightly concerned that this patch adds a branch or two and a new level of function call to a common code path. I assume it won't slow anything down measurably though.

The substance of the patch looks good. Please fix these style issues.
Comment 3 Dave Moore 2009-05-14 13:20:44 PDT
Created attachment 30351 [details]
Fix
Comment 4 Darin Adler 2009-05-14 14:00:36 PDT
Comment on attachment 30351 [details]
Fix

> +        * platform/text/PlatformString.h:
> +        * platform/text/String.cpp:
> +        * platform/text/StringImpl.cpp:
> +        * platform/text/StringImpl.h
> +:

The lists of affected functions are missing. The colon on this last line is on a separate line.

> +    // Returns an uninitialized string. The characters needs to be written
> +    // into the buffer returned in data before using
> +    static String createUninitialized(unsigned length, UChar*& data);

The comment seems to stop mid-sentence. Or maybe it's just a missing period.

> +String String::createUninitialized(unsigned length, UChar*& data)
> +{
> +    return StringImpl::createUninitialized(length, data);
> +}

This should be inlined. No reason to pay function call overhead here.

r=me as is, but maybe fix those things above and post a new patch?
Comment 5 Eric Seidel (no email) 2009-05-14 22:43:23 PDT
Comment on attachment 30351 [details]
Fix

The fix is fine.  Marking this r- to remove it from the commit queue.  I think Dave Moore has commit bit?  Otherwise I'm happy to commit this once he posts a patch with fixed comment (and darin's nits addressed).
Comment 6 Dave Moore 2009-05-15 09:12:32 PDT
Created attachment 30389 [details]
Fix
Comment 7 Darin Adler 2009-05-15 10:21:37 PDT
Comment on attachment 30389 [details]
Fix

> +        * platform/text/StringImpl.cpp:
> +        (WebCore::StringImpl::createUninitialized(),
> +         create(const UChar* characters, unsigned length),
> +         create(const char* characters, unsigned length))

This is not the standard ChangeLog format for the function names. In the future, using the prepare-ChangeLog script can help you get the standard format.

r=me
Comment 8 Mark Rowe (bdash) 2009-05-16 20:45:11 PDT
Landed in r43808.
Comment 9 Darin Adler 2009-05-16 21:13:31 PDT
There are 8 call sites for StringImpl::adopt and 50 call sites for String::adopt. Some those should use this new function for better efficiency. Should we file a new bug about that. Dave, would you be willing to work on that?
Comment 10 Dave Moore 2009-05-18 08:08:18 PDT
I'd be happy to work on that.
Comment 11 Darin Adler 2009-05-18 09:41:26 PDT
I have a few other thoughts about Vector::adopt call sites:

    1) In cases where the string is built up incrementally and often ends up quite large, a Vector and adopt is probably still a good idea.

    2) In cases where the string is small, we may want to copy the string and avoid keeping around two memory blocks, even if we are in the Vector adopt call.

    3) In some cases where we build up the string incrementally and it usually stays small, we could consider using Vector<UChar, 32> (or some suitable size) and making an adopt function that understands how to create a string two different ways depending on whether the inline storage was used or not.

A useful statistic would be how many single-block strings vs. double-block strings are allocated during a page load.
Comment 12 Dave Moore 2009-05-24 21:04:07 PDT
Created attachment 30639 [details]
Additional changes

This patch changes many of the uses of adopt() to createUninitialized(). Most are just straight-forward wins. A couple are places where we walk through a list of nodes and extract the strings from them. These used to build up the string incrementally, appending to a buffer. Now I walk through the list twice, the first to discover the total length and the second to write the string into the inline buffer. I think it's worth the second navigation to keep everything in one malloc node.

In places where the logic was somewhat complicated as to the creation of the to be adopted buffer I left it as it was.

I haven't yet done the automatic inlining of small adopted buffers. I'm not sure what the right size is, at which it would be worth the extra copy.
Comment 13 Maciej Stachowiak 2009-05-24 22:41:56 PDT
Comment on attachment 30639 [details]
Additional changes

This patch should really go in a separate bugzilla bug, as originally suggested by Darin. It is awkward to have a new patch for review on a resolved bug. r- for that but I'd be glad to give substantive review when it is attached to a new bug report. I hope you don't mind the r- for process reasons, it just gets messy to have bug reports reused for a different purpose, especially when already resolved. In general I think this work is great, just trying to keep the bug system clean.

A few quick comments: 

1) Please listen to the script and <set EMAIL_ADDRESS environment variable> :-)

2) For determining the tradeoff point for inline buffers - you could make a microbenchmark to check or make a rough estimate. The tradeoff is basically this: (a) FastMalloc allocate a StringImpl + allocate a buffer of N UChars. (b) One-shot allocate a StringImpl with extra space for N UChars, and copy N UChars. When N is large, the copy will be more expensive than the extra allocation of the StringImpl body. It should be possible to estimate the tradeoff point with microbenchmarks. Also, since space for N UChars will be reserved on the stack, there is a practical limit to N, to avoid risk of blowing the stack.