Bug 67081

Summary: Basic enhancements to StringBuilder
Product: WebKit Reporter: Xianzhu Wang <wangxianzhu>
Component: Web Template FrameworkAssignee: Xianzhu Wang <wangxianzhu>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, aestes, ap, darin, eric, mjs, sullivan, webkit.review.bot
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Bug Depends on:    
Bug Blocks: 66661    
Attachments:
Description Flags
patch
darin: review-, darin: commit-queue-
patch: Basic enhancements to StringBuilder
darin: review-, darin: commit-queue-
patch v3
none
patch v4
darin: review-
patch v5
none
Patch v5.1
darin: review+
patch v6
darin: review+
patch v7 none

Description Xianzhu Wang 2011-08-26 20:54:13 PDT
Now neither StringBuilder::toString() nor StringBuilder::toStringPreserveCapacity() is const. This prevents some const Vector<UChar> usages from being replaced with StringBuilder.
We can change StringBuilder::toStringReserveCapacity() to const, perhaps also change a better name for it.
Comment 1 Xianzhu Wang 2011-09-08 01:17:54 PDT
Created attachment 106709 [details]
patch
Comment 2 Darin Adler 2011-09-09 18:57:46 PDT
Comment on attachment 106709 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=106709&action=review

This patch still has too many changes in it at once. Each change needs to be looked at carefully. Can we do this in smaller pieces. The issues are different for different items here, so reviewing all at once is too hard. I made a few comments.

> Source/JavaScriptCore/wtf/text/StringBuilder.h:46
>          // If we're appending to an empty string, and there is not buffer

Small grammar mistake: is not a buffer

> Source/JavaScriptCore/wtf/text/StringBuilder.h:59
> +        // If we're appending to an empty string, and there is not buffer

Small grammar mistake: is not a buffer

> Source/WebCore/html/parser/HTMLPreloadScanner.cpp:69
> +            AtomicString attributeName(iter->m_name.toStringPreserveCapacity());

We don’t want to allocate a string here if there is already an AtomicString with the correct value. The old code did that correctly, but the new code instead always a string.

> Source/WebCore/html/parser/HTMLTokenizer.cpp:1590
> -    return vectorEqualsString(m_temporaryBuffer, expectedString);
> +    return m_temporaryBuffer.toStringPreserveCapacity() == expectedString;

It seems like a bad design that we have to convert a string buffer to a string just to compare it with a string. I suggest instead we overload the == operator so it can work directly on a StringBuffer.
Comment 3 Darin Adler 2011-09-09 18:58:34 PDT
Comment on attachment 106709 [details]
patch

I’m going to say review-. The change to StringBuilder is fine. But all the changes to adopt StringBuilder need to be reviewed individually or in smaller groups that have similar issues, rather than in one large patch.
Comment 4 Xianzhu Wang 2011-09-10 01:19:29 PDT
I will separate the patch into 2:

1. several enhancements to StringBuilder, including:
  a) toStringPreserveCapacity() const
  b) toAtomicString() const
  c) == and != operators
  d) append(const StringBuilder&)

2. Replace Vector<UChar> in html/xml/VTT parsers with StringBuilder after the above patch is landed (that is, the part of the last patch excluding the changes to StringBuilder)

And there will be also several patches to replace remaining Vector<UChar> usages. If some replacements require other enhancements/changes to StringBuilder, I'll also create separate bugs and patches.
Comment 5 Xianzhu Wang 2011-09-10 01:27:03 PDT
Created attachment 106970 [details]
patch: Basic enhancements to StringBuilder
Comment 6 Darin Adler 2011-09-13 10:22:54 PDT
Comment on attachment 106970 [details]
patch: Basic enhancements to StringBuilder

View in context: https://bugs.webkit.org/attachment.cgi?id=106970&action=review

review- because of the mistake in the == functions

I have a question about the toString and toStringPreserveCapacity functions in this class. It seems that if you do this:

    builder.toString()
    builder.append(character)
    builder.toString()

and the buffer still has additional capacity remaining, then the second toString will return an incorrect result. I think this is a basic implementation flaw in the class as it currently stands.

I also think that for efficiency toString and toStringPreserveCapacity should return const String& rather than String.

> Source/JavaScriptCore/wtf/text/StringBuilder.h:40
> +    friend bool operator==(const String&, const StringBuilder&);
> +    friend bool operator==(const StringBuilder&, const String&);
> +    friend bool operator==(const StringBuilder&, const StringBuilder&);
> +    friend bool operator!=(const String&, const StringBuilder&);
> +    friend bool operator!=(const StringBuilder&, const String&);
> +    friend bool operator!=(const StringBuilder&, const StringBuilder&);

Only two of the functions need to be implemented as friend functions. The rest could just call the others. I’d prefer to do it that way to cut down on the number of friend functions.

> Source/JavaScriptCore/wtf/text/StringBuilder.h:63
> +    void append(const StringBuilder& stringBuilder)

It’s confusing for a member function of class StringBuilder to have an argument named stringBuilder, since "this" is also a string builder. I suggest finding a different name for the argument.

> Source/JavaScriptCore/wtf/text/StringBuilder.h:116
> +    AtomicString toAtomicString() const
> +    {
> +        return AtomicString(characters(), length());
> +    }

This is not good for the case where the the string is not already in the AtomicString table. In that case we would not want to always copy and allocate a new string since we may already have one ready to go. This is likely to be a quite common case and so we should think through how to make sure we have good performance in that case.

There's also a case where we have a String and it already has the flag saying it’s an atomic string. Rehashing the characters in the string just to re-find the string that we already have a reference to is not good.

Due to these issues we probably need "non-preserve-capacity" vs. "preserve-capacity" versions of this function. The fact that we have not yet optimized this is affecting the interface, not just the implementation, of this class.

> Source/JavaScriptCore/wtf/text/StringBuilder.h:159
> +    void swap(StringBuilder& stringBuilder)

It’s good to define a swap function.

I am concerned that this class currently is allowing the compiler to generate the default copy constructor and assignment operator. Any use of these could result in two StringBuilder objects thinking they share the same m_buffer and m_bufferCharacters or even orphan m_bufferCharacters. To fix that, we need to either make the class noncopyable or implement the copy constructor. If we do implement the copy constructor then we can implement the assignment operator trivially by combining the copy constructor with swap.

I think it’s confusing for a member function of class StringBuilder to have an argument named stringBuilder, since "this" is also a string builder. I suggest finding a different name for the argument.

> Source/JavaScriptCore/wtf/text/StringBuilder.h:184
> +inline bool operator==(const StringBuilder& a, const String& b) { return equal(a.impl(), b.impl()); }
> +inline bool operator==(const String& a, const StringBuilder& b) { return equal(a.impl(), b.impl()); }
> +inline bool operator==(const StringBuilder& a, const StringBuilder& b) { return equal(a.impl(), b.impl()); }
> +inline bool operator!=(const StringBuilder& a, const String& b) { return !equal(a.impl(), b.impl()); }
> +inline bool operator!=(const String& a, const StringBuilder& b) { return !equal(a.impl(), b.impl()); }
> +inline bool operator!=(const StringBuilder& a, const StringBuilder& b) { return !equal(a.impl(), b.impl()); }

These functions will all work wrong when m_buffer->length() is longer than m_length. I think the impl() function is probably not a good idea; it’s a sort of unreliable version of toString.
Comment 7 Xianzhu Wang 2011-09-14 02:59:23 PDT
(In reply to comment #6)
> (From update of attachment 106970 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=106970&action=review
> 
> review- because of the mistake in the == functions
> 
> I have a question about the toString and toStringPreserveCapacity functions in this class. It seems that if you do this:
> 
>     builder.toString()
>     builder.append(character)
>     builder.toString()
> 
> and the buffer still has additional capacity remaining, then the second toString will return an incorrect result. I think this is a basic implementation flaw in the class as it currently stands.
> 

I verified several cases and tested, and I think the original code is correct.

The cases:
1. m_string directly uses m_buffer: This only occurs when m_buffer.length() == m_length, so the append() will definitely exceed capacity. New buffer will be allocated and m_string will be set to null.

2. m_string.impl() is a substring of m_buffer, and the append() doesn't exceed capacity. As append() will only change the content after original m_length, m_string will remain unchanged.

3. m_string.impl() is a substring of m_buffer, and the append() exceeds capacity. New buffer will be allocated and m_string will be set to null.

4. All the other mutation operations like clear(), resize() will set m_string/m_buffer to null or a new value.



> I also think that for efficiency toString and toStringPreserveCapacity should return const String& rather than String.

Done.

View in context: https://bugs.webkit.org/attachment.cgi?id=106970&action=review

>> Source/JavaScriptCore/wtf/text/StringBuilder.h:40
>> +    friend bool operator!=(const StringBuilder&, const StringBuilder&);
> 
> Only two of the functions need to be implemented as friend functions. The rest could just call the others. I’d prefer to do it that way to cut down on the number of friend functions.

I changed 4 of them (whose first parameter is StringBuilder) to StringBuilder methods, and the other two as inline functions that call the methods.

>> Source/JavaScriptCore/wtf/text/StringBuilder.h:63
>> +    void append(const StringBuilder& stringBuilder)
> 
> It’s confusing for a member function of class StringBuilder to have an argument named stringBuilder, since "this" is also a string builder. I suggest finding a different name for the argument.

Changed to 'other'.

>> Source/JavaScriptCore/wtf/text/StringBuilder.h:116

> 
> This is not good for the case where the the string is not already in the AtomicString table. In that case we would not want to always copy and allocate a new string since we may already have one ready to go. This is likely to be a quite common case and so we should think through how to make sure we have good performance in that case.
> 
> There's also a case where we have a String and it already has the flag saying it’s an atomic string. Rehashing the characters in the string just to re-find the string that we already have a reference to is not good.
> 
> Due to these issues we probably need "non-preserve-capacity" vs. "preserve-capacity" versions of this function. The fact that we have not yet optimized this is affecting the interface, not just the implementation, of this class.

Done by creating two versions of it. Use m_string if not null, or m_buffer (directly or create a substring of it). Extracted a private method bufferToStringImpl() for use in toAtomicString(), toAtomicStringPreserveCapacity() and reifyString().

>> Source/JavaScriptCore/wtf/text/StringBuilder.h:159
>> +    void swap(StringBuilder& stringBuilder)
> 
> It’s good to define a swap function.
> 
> I am concerned that this class currently is allowing the compiler to generate the default copy constructor and assignment operator. Any use of these could result in two StringBuilder objects thinking they share the same m_buffer and m_bufferCharacters or even orphan m_bufferCharacters. To fix that, we need to either make the class noncopyable or implement the copy constructor. If we do implement the copy constructor then we can implement the assignment operator trivially by combining the copy constructor with swap.
> 
> I think it’s confusing for a member function of class StringBuilder to have an argument named stringBuilder, since "this" is also a string builder. I suggest finding a different name for the argument.

I choose to make the class noncopyable as with the copy constructor people could easily ignore the cost of it.

Changed the parameter name to "other".

>> Source/JavaScriptCore/wtf/text/StringBuilder.h:184
>> +inline bool operator!=(const StringBuilder& a, const StringBuilder& b) { return !equal(a.impl(), b.impl()); }
> 
> These functions will all work wrong when m_buffer->length() is longer than m_length. I think the impl() function is probably not a good idea; it’s a sort of unreliable version of toString.

Thanks!
Changed them to use WTF::equal(StringImpl*, const UChar*, unsigned length).

Will upload a new patch soon.
Comment 8 Xianzhu Wang 2011-09-14 04:20:14 PDT
Created attachment 107317 [details]
patch v3

After comparing the cost, I chose to use AtomicString(toString()) and AtomicString(toStringPreserveCapacity()) in toAtomicString() and toAtomicStringPreserveCapacity() respectively.

If they don't use toString() and toStringPreserveCapacity(), they still need shrinkToFit() (for toAtomicString()) and something like reifyString() (but without constructing a String). The difference is whether String::String(StringImpl*) or String::String(PassRefPtr<StringImpl>) will be called. The costs of the constructors are trivial, just like those of the constructors of RefPtr and PassRefPtr. The cached m_string might also benefit in some cases.

AtomicString(characters(), m_length) can get a fit-sized independent buffer, but I don't see obvious benefits of it.
Comment 9 Darin Adler 2011-09-16 11:17:05 PDT
(In reply to comment #8)
> After comparing the cost, I chose to use AtomicString(toString()) and AtomicString(toStringPreserveCapacity()) in toAtomicString() and toAtomicStringPreserveCapacity() respectively.

That is the wrong choice when the string happens to be something that’s already in the AtomicString table. In that case, we should do no memory allocation. Only if the string is not in the table do we want to do the normal toString work.

> If they don't use toString() and toStringPreserveCapacity(), they still need shrinkToFit()

That’s wrong. We can and should create a code path that finds the existing AtomicString without modifying the StringBuilder at all.
Comment 10 Darin Adler 2011-09-16 11:20:40 PDT
(In reply to comment #7)
> (In reply to comment #6)
> > (From update of attachment 106970 [details] [details])
> > View in context: https://bugs.webkit.org/attachment.cgi?id=106970&action=review
> > 
> > review- because of the mistake in the == functions
> > 
> > I have a question about the toString and toStringPreserveCapacity functions in this class. It seems that if you do this:
> > 
> >     builder.toString()
> >     builder.append(character)
> >     builder.toString()
> > 
> > and the buffer still has additional capacity remaining, then the second toString will return an incorrect result. I think this is a basic implementation flaw in the class as it currently stands.

> 2. m_string.impl() is a substring of m_buffer, and the append() doesn't exceed capacity. As append() will only change the content after original m_length, m_string will remain unchanged.

Right, that’s the problematic case. The second call to builder.toString() will return the same string as the first, a string that does not include the last appended character. Unless I’m missing something. Did you actually test this case?

> >> Source/JavaScriptCore/wtf/text/StringBuilder.h:40
> >> +    friend bool operator!=(const StringBuilder&, const StringBuilder&);
> > 
> > Only two of the functions need to be implemented as friend functions. The rest could just call the others. I’d prefer to do it that way to cut down on the number of friend functions.
> 
> I changed 4 of them (whose first parameter is StringBuilder) to StringBuilder methods, and the other two as inline functions that call the methods.

Normally it is best style to avoid having binary operators be member functions, and non-members functions are preferred, friend when necessary. Changing them to be member functions has a small downside, since type conversions can’t be done on the left side when the operator is a member function. Probably OK here, though.
Comment 11 Xianzhu Wang 2011-09-17 00:48:53 PDT
(In reply to comment #9)
> (In reply to comment #8)
> > After comparing the cost, I chose to use AtomicString(toString()) and AtomicString(toStringPreserveCapacity()) in toAtomicString() and toAtomicStringPreserveCapacity() respectively.
> 
> That is the wrong choice when the string happens to be something that’s already in the AtomicString table. In that case, we should do no memory allocation. Only if the string is not in the table do we want to do the normal toString work.
>
> > If they don't use toString() and toStringPreserveCapacity(), they still need shrinkToFit() [(for toAtomicString() and something like ...]
> 
> That’s wrong. We can and should create a code path that finds the existing AtomicString without modifying the StringBuilder at all.

I think the following could help us evaluate the cost and choose the best solution:

The cases:
1. string in table
1.1 buffer much over-allocated (that is, shrinkToFit() will allocate new space and copy)
1.2 buffer just fit or over-allocation<20% (that is, shrinkToFit() will do nothing.)
1.2.1 the StringBuilder happens to have only appended one AtomicString (that is, m_string is an AtomicString)
2. string not in table
Descriptions of 2.1, 2.2 and 2.2.1 are the same as 1.1, 1.2 and 1.2.1 respectively.

The implementations:
A. AtomicString(toString())
A1. Use m_string if it is not null, otherwise shrinkToFit() and construct AtomicString over m_buffer or substring of m_buffer (that is, the similar thing in reifyString()).
B. AtomicString(toStringPreserveCapacity())
B1. same as A1 except no shrinkToFit()
C. AtomicString(characters(), length())

       A/A1                       B/B1                  C
1.1    unnecessary shrinkToFit()  trivial cost/good     good
1.2    trivial cost/good          trivial cost/good     good
1.2.1  good                       good                  unnecessary hash
2.1    good                       extra memory space    string space can't be shared
2.2    good                       good                  string space can't be shared
2.2.1  good                       good                  unnecessary hash and allocation

*trivial cost: the cost to set the value of m_string to m_buffer or substring of m_buffer.

Please correct me if I missed anything.

(In reply to comment #10)
> (In reply to comment #7)
> > (In reply to comment #6)
> > 2. m_string.impl() is a substring of m_buffer, and the append() doesn't exceed capacity. As append() will only change the content after original m_length, m_string will remain unchanged.
> 
> Right, that’s the problematic case. The second call to builder.toString() will return the same string as the first, a string that does not include the last appended character. Unless I’m missing something. Did you actually test this case?
> 

Sorry, "'m_string' will remain unchanged" should be "the original result of toString() will remain unchanged and m_string will be set to null".

The unit tests have covered the case.
Comment 12 Xianzhu Wang 2011-09-29 02:59:56 PDT
Hi Darin, what do you think of my analysis in #11? Am I missing something?
Comment 13 Xianzhu Wang 2011-09-29 03:13:45 PDT
P.S. My opinion based on the analysis in #11 is:
Using AtomicString(toString()) and AtomicString(toStringPreserveCapacity()) as the implementation of toAtomicString() and toAtomicStringPreserveCapacity() (the A and B columns in the analysis table), the not good cases are:

1. AtomicString(toString()) when the string is already in the atomic string table, and the StringBuilder is much over-allocated. In this case, the shrinkToFit() might unnecessarily allocate new buffer and copy data.

2. AtomicString(toStringPreserveCapacity()) when the string is not in the atomic string table, and the StringBuilder is much over-allocated. In this case, the string put into the atomic string table will have unnecessary over-allocated buffer.

The other cases seem all good. Because we provide both toAtomicString() and toAtomicStringPreserveCapacity(), the caller can choose one that might have higher chance to avoid the extra cost.
Comment 14 Xianzhu Wang 2011-10-08 20:48:09 PDT
Created attachment 110288 [details]
patch v4

This patch removed toAtomicStringPreserveCapacity() because I think it's not reasonable to hold an over-allocated buffer in the atomic string table.

For toAtomicString() when the string value is already in the string table, there is no extra cost, except the last 'AtomicString(toStringPreserveCapacity())' whose cost is for String(stringImpl) or String(StringImpl::create(stringImpl, 0, m_length) which is small. I've tried to reduce that cost using a HashTranslator in AtomicString.cpp but it brought much complexity in the interface between AtomicString and StringBuilder. I can upload that patch if you prefer it.

The implementation of canShrink() should be tuned (will file another bug for it).
Comment 15 WebKit Review Bot 2011-10-08 20:50:18 PDT
Attachment 110288 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1

Source/JavaScriptCore/wtf/text/StringBuilder.h:196:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
Total errors found: 1 in 12 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 16 Darin Adler 2011-10-09 19:41:00 PDT
Comment on attachment 110288 [details]
patch v4

View in context: https://bugs.webkit.org/attachment.cgi?id=110288&action=review

> Source/JavaScriptCore/wtf/text/StringBuilder.cpp:171
>      // If the buffer is at least 80% full, don't bother copying. Need to tune this heuristic!

Comment should be removed from here since it was moved into canShrink.
Comment 17 Darin Adler 2011-10-09 19:57:35 PDT
Comment on attachment 110288 [details]
patch v4

View in context: https://bugs.webkit.org/attachment.cgi?id=110288&action=review

This looks good, but is not quite ready to land. I have some comments and there is a problem because this did not build on EWS with the Windows bot.

> Source/JavaScriptCore/wtf/text/StringBuilder.h:118
> +    AtomicString toAtomicString() const
> +    {
> +        if (!m_length)
> +            return AtomicString();
> +        if (canShrink())
> +            return AtomicString(characters(), length());
> +        if (!m_string.isNull())
> +            return AtomicString(m_string);
> +        return AtomicString(toStringPreserveCapacity());

I don’t understand the canShrink logic here. That function determines if it’s a good idea to just use the buffer rather than creating a new one since it’s already mostly full. But here, the code says “if the buffer is already full then don’t use it”, which seems like the opposite of what canShrink is supposed to mean. If this is a good algorithm, then we need a comment explaining why.

The if (!m_string.isNull) part of this function has no value; it just repeats what toStringPreserveCapacity will already do.

Your comment said that there is no easy way to optimize the case where the string is already the right size and happens to be in the table without complicating the relationship between AtomicString and StringBuilder, but that does not seem to be the case given the current form of this function.

If the AtomicString constructor had a version that took a character pointer, a length, and a StringImpl, it could have all the correct logic; StringBuilder::reifyString has nothing in it that is truly specific to StringBuilder policy.

> Source/JavaScriptCore/wtf/text/StringBuilder.h:175
> +    // Disallow the copy constructor and the assignment operator otherwise the cost could be easily ignored.
> +    StringBuilder(const StringBuilder&);
> +    StringBuilder& operator=(const StringBuilder&);

This is a great idea. In WebKit the normal way to do this is to use WTF_NONCOPYABLE.

I would word it like this:

    // Disallow copying since it’s expensive and we don’t want code to do it by accident.

> Source/JavaScriptCore/wtf/text/StringBuilder.h:192
> +    if (s.isEmpty() && !length)
> +        return true;
> +    if (s.length() != length)
> +        return false;

If the StringBuilder is empty, presumably the cost of calling the length function is quite small, so the first check seems to be wasteful.

> Source/JavaScriptCore/wtf/text/StringBuilder.h:195
> +    if (s.characters() == buffer)
> +        return true;

While this condition is possible in theory, it seems it would never occur in practice. Accordingly, I suggest omitting this.

> Source/WebCore/editing/markup.cpp:218
> -    return result.toString().replace(0, "");
> +    String string = result.toString();
> +    return string.replace(0, "");

What’s the rationale for this change? It doesn’t seem that the copy constructor or the assignment operator are relevant here.
Comment 18 Xianzhu Wang 2011-10-09 21:31:31 PDT
Comment on attachment 110288 [details]
patch v4

View in context: https://bugs.webkit.org/attachment.cgi?id=110288&action=review

Thanks Darin for review!

Will upload the new patch after I fixed the break on win.

>> Source/JavaScriptCore/wtf/text/StringBuilder.cpp:171
>>      // If the buffer is at least 80% full, don't bother copying. Need to tune this heuristic!
> 
> Comment should be removed from here since it was moved into canShrink.

Done.

>> Source/JavaScriptCore/wtf/text/StringBuilder.h:118
>> +        return AtomicString(toStringPreserveCapacity());
> 
> I don’t understand the canShrink logic here. That function determines if it’s a good idea to just use the buffer rather than creating a new one since it’s already mostly full. But here, the code says “if the buffer is already full then don’t use it”, which seems like the opposite of what canShrink is supposed to mean. If this is a good algorithm, then we need a comment explaining why.
> 
> The if (!m_string.isNull) part of this function has no value; it just repeats what toStringPreserveCapacity will already do.
> 
> Your comment said that there is no easy way to optimize the case where the string is already the right size and happens to be in the table without complicating the relationship between AtomicString and StringBuilder, but that does not seem to be the case given the current form of this function.
> 
> If the AtomicString constructor had a version that took a character pointer, a length, and a StringImpl, it could have all the correct logic; StringBuilder::reifyString has nothing in it that is truly specific to StringBuilder policy.

The comment in canShrink() was misleading. I've changed it to "Only shrink the buffer if it's less than 80% full". The code is: if the buffer is over-allocated, then don't use it.

Your idea is great! Thanks.

>> Source/JavaScriptCore/wtf/text/StringBuilder.h:175
>> +    StringBuilder& operator=(const StringBuilder&);
> 
> This is a great idea. In WebKit the normal way to do this is to use WTF_NONCOPYABLE.
> 
> I would word it like this:
> 
>     // Disallow copying since it’s expensive and we don’t want code to do it by accident.

Done.

>> Source/JavaScriptCore/wtf/text/StringBuilder.h:192

> 
> If the StringBuilder is empty, presumably the cost of calling the length function is quite small, so the first check seems to be wasteful.

I added the first check to satisfy the next ASSERT(buffer) in case of equal(StringBuilder(), 0, 0).
Removed along with the ASSERT.

>> Source/JavaScriptCore/wtf/text/StringBuilder.h:195
>> +        return true;
> 
> While this condition is possible in theory, it seems it would never occur in practice. Accordingly, I suggest omitting this.

This happens when a String is appended to each of two new StringBuilders, or a StringBuilder with non-null m_string is just appended to a new StringBuilder. In this case, the m_strings in the StringBuilders share the same impl. So I prefer keep it.

>> Source/JavaScriptCore/wtf/text/StringBuilder.h:196

> 
> Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]

Done.

>> Source/WebCore/editing/markup.cpp:218
>> +    return string.replace(0, "");
> 
> What’s the rationale for this change? It doesn’t seem that the copy constructor or the assignment operator are relevant here.

This is related to the change of the return type of toString() from 'String' to 'const String&', so here needed an explicit copy.
Will update the Source/WebCore/ChangeLog.
Comment 19 Xianzhu Wang 2011-10-09 21:57:28 PDT
Created attachment 110325 [details]
patch v5

I haven't figured out the reason of win-ews failure. It seemed not related to this patch. Just try it.
Comment 20 WebKit Review Bot 2011-10-09 22:00:27 PDT
Attachment 110325 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1

Source/JavaScriptCore/wtf/text/AtomicString.h:127:  The parameter name "impl" adds no information, so it should be removed.  [readability/parameter_name] [5]
Total errors found: 1 in 15 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 21 Xianzhu Wang 2011-10-09 23:38:28 PDT
Created attachment 110330 [details]
Patch v5.1
Comment 22 Eric Seidel (no email) 2011-12-19 13:49:54 PST
Darin is really your man here.  Perhaps he can be convinced to take another pass.
Comment 23 Darin Adler 2011-12-22 09:10:36 PST
Comment on attachment 110330 [details]
Patch v5.1

View in context: https://bugs.webkit.org/attachment.cgi?id=110330&action=review

This looks really good. I’m going to say review+ as is, but I think you should consider the improvements I suggest in comments.

> Source/JavaScriptCore/wtf/text/AtomicString.cpp:254
> +struct SubstringInfo {

The suffix info adds nothing. I would call this just Substring, but since it’s not using a RefPtr it probably should be named something more like SubstringLocation.

> Source/JavaScriptCore/wtf/text/AtomicString.cpp:288
> +    SubstringInfo buffer = { impl, offset, length };
> +    return addToStringTable<SubstringInfo, SubstringTranslator>(buffer);

It’s much less expensive to compute the hash if a substring happens to be an entire string; the hash is already pre-computed and stored. This function should either optimize that case or require the caller to not use it in that case.

Also, we want to use the impl and not create a new one if the offset is 0 and then length is the string’s length

> Source/JavaScriptCore/wtf/text/StringBuilder.cpp:67
> +        m_string = String(); // Clear the string earlier to remove the reference to m_buffer if any.

The word “earlier” in this comment makes no sense once the code is checked in. I think by earlier you mean “before checking the reference count of m_buffer” but that’s not really clear.

> Source/JavaScriptCore/wtf/text/StringBuilder.h:36
> +    // Disallow copying since itâs expensive and we donât want code to do it by accident.

We normally stick to ASCII for comments in our source files, so the straight quote is better here.

> Source/JavaScriptCore/wtf/text/StringBuilder.h:52
> -        // If we're appending to an empty string, and there is not buffer
> -        // (in case reserveCapacity has been called) then just retain the
> +        // If we're appending to an empty string, and there is not a buffer
> +        // (in case reserveCapacity has not been called) then just retain the
>          // string.

You should remove the words “in case” from this comment. It would be much clearer without that.

You should rewrap this comment so it’s not three lines long with a trailing single word. We can go wider.

> Source/JavaScriptCore/wtf/text/StringBuilder.h:68
> +        // If we're appending to an empty string, and there is not a buffer
> +        // (in case reserveCapacity has not been called) then just retain the
> +        // string.

You should remove the words “in case” from this comment. It would be much clearer without that.

You should rewrap this comment so it’s not three lines long with a trailing single word. We can go wider.

> Source/JavaScriptCore/wtf/text/StringBuilder.h:119
> +        // If the buffer is much over-allocated, the AtomicString should make a fit-sized copy when needed.

I think I would word the comment like this.

    // If the buffer is sufficiently over-allocated, make a new AtomicString from a copy so its buffer is not so large.

> Source/JavaScriptCore/wtf/text/StringBuilder.h:126
> +        // The AtomicString will create a substring based on m_buffer when needed.

This comment says what AtomicString should do, but the current implementation above does not do it (see comment above).

> Source/JavaScriptCore/wtf/text/StringBuilder.h:200
> +inline bool equal(const StringBuilder& s, const UChar* buffer, unsigned length)
> +{
> +    if (s.length() != length)
> +        return false;
> +    if (s.characters() == buffer)
> +        return true;
> +    return !memcmp(s.characters(), buffer, length * sizeof(UChar));
> +}

For the new 8/16-bit world we may need to rewrite this that it doesn’t use characters() on an 8-bit string.

> Source/JavaScriptCore/wtf/text/StringBuilder.h:202
> +inline bool operator==(const StringBuilder& a, const StringBuilder& b) { return equal(a, b.characters(), b.length()); }

Ditto.

> Source/JavaScriptCore/wtf/text/StringBuilder.h:204
> +inline bool operator==(const StringBuilder& a, const String& b) { return equal(a, b.characters(), b.length()); }

Ditto.

> Source/WebCore/editing/markup.cpp:218
> -    return result.toString().replace(0, "");
> +    String string = result.toString();
> +    return string.replace(0, "");

These are awkward. Maybe changing toString to return a reference for efficiency is not worth the inconvenience.
Comment 24 Xianzhu Wang 2012-01-05 11:42:27 PST
(In reply to comment #23)
> (From update of attachment 110330 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=110330&action=review
> 
> This looks really good. I’m going to say review+ as is, but I think you should consider the improvements I suggest in comments.
> 

Thanks for review!

Sorry for late reply. I'm busy recently. Hopefully I'll update the patch in the next week.
Comment 25 Xianzhu Wang 2012-01-13 14:08:48 PST
Comment on attachment 110330 [details]
Patch v5.1

View in context: https://bugs.webkit.org/attachment.cgi?id=110330&action=review

>> Source/JavaScriptCore/wtf/text/AtomicString.cpp:254
>> +struct SubstringInfo {
> 
> The suffix info adds nothing. I would call this just Substring, but since it’s not using a RefPtr it probably should be named something more like SubstringLocation.

Changed to SubstringLocation.

>> Source/JavaScriptCore/wtf/text/AtomicString.cpp:288
>> +    return addToStringTable<SubstringInfo, SubstringTranslator>(buffer);
> 
> It’s much less expensive to compute the hash if a substring happens to be an entire string; the hash is already pre-computed and stored. This function should either optimize that case or require the caller to not use it in that case.
> 
> Also, we want to use the impl and not create a new one if the offset is 0 and then length is the string’s length

Added check for whole string and call AtomicString::add(StringImpl*) in the case.

>> Source/JavaScriptCore/wtf/text/StringBuilder.cpp:67
>> +        m_string = String(); // Clear the string earlier to remove the reference to m_buffer if any.
> 
> The word “earlier” in this comment makes no sense once the code is checked in. I think by earlier you mean “before checking the reference count of m_buffer” but that’s not really clear.

Done.

>> Source/JavaScriptCore/wtf/text/StringBuilder.h:36
>> +    // Disallow copying since itâs expensive and we donât want code to do it by accident.
> 
> We normally stick to ASCII for comments in our source files, so the straight quote is better here.

Done.

>> Source/JavaScriptCore/wtf/text/StringBuilder.h:52
>>          // string.
> 
> You should remove the words “in case” from this comment. It would be much clearer without that.
> 
> You should rewrap this comment so it’s not three lines long with a trailing single word. We can go wider.

Done.

>> Source/JavaScriptCore/wtf/text/StringBuilder.h:68
>> +        // string.
> 
> You should remove the words “in case” from this comment. It would be much clearer without that.
> 
> You should rewrap this comment so it’s not three lines long with a trailing single word. We can go wider.

Done.

>> Source/JavaScriptCore/wtf/text/StringBuilder.h:119
>> +        // If the buffer is much over-allocated, the AtomicString should make a fit-sized copy when needed.
> 
> I think I would word the comment like this.
> 
>     // If the buffer is sufficiently over-allocated, make a new AtomicString from a copy so its buffer is not so large.

Done.

>> Source/JavaScriptCore/wtf/text/StringBuilder.h:126
>> +        // The AtomicString will create a substring based on m_buffer when needed.
> 
> This comment says what AtomicString should do, but the current implementation above does not do it (see comment above).

With the added check for whole string in the substring version of AtomicString::add(), if m_string happens to be the same as m_buffer.length(), no substring will be created. Otherwise substring will still be created if no existing AtomicString with the same value.

>> Source/JavaScriptCore/wtf/text/StringBuilder.h:200
>> +}
> 
> For the new 8/16-bit world we may need to rewrite this that it doesn’t use characters() on an 8-bit string.

Done.

>> Source/JavaScriptCore/wtf/text/StringBuilder.h:202
>> +inline bool operator==(const StringBuilder& a, const StringBuilder& b) { return equal(a, b.characters(), b.length()); }
> 
> Ditto.

Done.

>> Source/JavaScriptCore/wtf/text/StringBuilder.h:204
>> +inline bool operator==(const StringBuilder& a, const String& b) { return equal(a, b.characters(), b.length()); }
> 
> Ditto.

Done.

>> Source/WebCore/editing/markup.cpp:218
>> +    return string.replace(0, "");
> 
> These are awkward. Maybe changing toString to return a reference for efficiency is not worth the inconvenience.

Done.
Comment 26 Xianzhu Wang 2012-01-13 14:09:56 PST
Created attachment 122500 [details]
patch v6

Darin, could you please review the new patch?
Comment 27 Darin Adler 2012-01-16 17:23:08 PST
Comment on attachment 122500 [details]
patch v6

View in context: https://bugs.webkit.org/attachment.cgi?id=122500&action=review

I cleared the commit-queue flag because the patch does not apply so the commit queue would just fail.

> Source/JavaScriptCore/wtf/text/AtomicString.cpp:255
> +    StringImpl* impl;

I would name this “underlyingString” or “baseString” rather than “impl”.

> Source/JavaScriptCore/wtf/text/AtomicString.cpp:288
> +    if (!offset && length == r->length())
> +        return add(r);

This should be >= rather than ==.

> Source/JavaScriptCore/wtf/text/StringBuilder.h:85
> +        if (!other.m_length)
> +            return;
> +
> +        // If we're appending to an empty string, and there is not a buffer (reserveCapacity has not been called)
> +        // then just retain the string.
> +        if (!m_length && !m_buffer && !other.m_string.isNull()) {
> +            m_string = other.m_string;
> +            m_length = other.m_length;
> +            return;
> +        }
> +        append(other.characters(), other.m_length);

Seems like there should be another blank line after the if block but before the call to append.
Comment 28 Xianzhu Wang 2012-01-17 10:02:50 PST
(In reply to comment #27)
> (From update of attachment 122500 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=122500&action=review
> 
> I cleared the commit-queue flag because the patch does not apply so the commit queue would just fail.

> 
> > Source/JavaScriptCore/wtf/text/AtomicString.cpp:288
> > +    if (!offset && length == r->length())
> > +        return add(r);
> 
> This should be >= rather than ==.
> 

How about 'ASSERT(length <= r->length())' before that?
Comment 29 Darin Adler 2012-01-17 17:59:50 PST
Comment on attachment 122500 [details]
patch v6

View in context: https://bugs.webkit.org/attachment.cgi?id=122500&action=review

>>> Source/JavaScriptCore/wtf/text/AtomicString.cpp:288
>>> +        return add(r);
>> 
>> This should be >= rather than ==.
> 
> How about 'ASSERT(length <= r->length())' before that?

Normally, in functions that take a length such as String::substring, we allow callers to pass a length that goes past the end of the string and clamp the result so things end at the end of the string.

I thought this function was one such case, but now that I look at it more carefully, I see that the code below will not work with a larger length and so an assertion is probably best.
Comment 30 Xianzhu Wang 2012-01-18 13:58:06 PST
(In reply to comment #29)
> (From update of attachment 122500 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=122500&action=review
> 
> >>> Source/JavaScriptCore/wtf/text/AtomicString.cpp:288
> >>> +        return add(r);
> >> 
> >> This should be >= rather than ==.
> > 
> > How about 'ASSERT(length <= r->length())' before that?
> 
> Normally, in functions that take a length such as String::substring, we allow callers to pass a length that goes past the end of the string and clamp the result so things end at the end of the string.
> 
> I thought this function was one such case, but now that I look at it more carefully, I see that the code below will not work with a larger length and so an assertion is probably best.

I see. I think we can still let it conform to the convention of String::substring by resetting the length to the max length here (just like what StringImpl::substring() does.)
Comment 31 Xianzhu Wang 2012-01-18 14:12:27 PST
Created attachment 122991 [details]
patch v7

Upload for EWS.
Comment 32 Xianzhu Wang 2012-01-23 09:32:23 PST
Comment on attachment 122991 [details]
patch v7

Verified with EWS and local mac build.
Darin, I flagged 'commit-queue ?' for your approval. Thanks.
Comment 33 Darin Adler 2012-01-23 09:46:59 PST
Comment on attachment 122991 [details]
patch v7

View in context: https://bugs.webkit.org/attachment.cgi?id=122991&action=review

> Source/JavaScriptCore/wtf/text/AtomicString.cpp:287
> +    if (!length || start >= baseString->length())
> +        return StringImpl::empty();
>  
> +    unsigned maxLength = baseString->length() - start;

It would be better to tweak the structure here to avoid calling baseString->length() twice.
Comment 34 WebKit Review Bot 2012-01-23 12:42:21 PST
Comment on attachment 122991 [details]
patch v7

Clearing flags on attachment: 122991

Committed r105635: <http://trac.webkit.org/changeset/105635>
Comment 35 WebKit Review Bot 2012-01-23 12:42:28 PST
All reviewed patches have been landed.  Closing bug.
Comment 36 Andy Estes 2012-01-23 13:41:37 PST
(In reply to comment #34)
> (From update of attachment 122991 [details])
> Clearing flags on attachment: 122991
> 
> Committed r105635: <http://trac.webkit.org/changeset/105635>

This broke the build. Fixed in <http://trac.webkit.org/changeset/105639>.
Comment 37 Xianzhu Wang 2012-01-23 13:50:31 PST
(In reply to comment #36)
> This broke the build. Fixed in <http://trac.webkit.org/changeset/105639>.

Thanks!
Comment 38 Adam Barth 2012-01-23 18:48:12 PST
Did this patch change anything that could have affected performance?
Comment 39 Darin Adler 2012-01-24 09:45:00 PST
(In reply to comment #38)
> Did this patch change anything that could have affected performance?

Not intentionally, but changes to inlining or mistakes could have had an effect. The change is complex enough that I don’t think we can say definitively that it did not affect performance without testing.
Comment 40 Xianzhu Wang 2012-01-24 11:05:48 PST
Actually the change itself won't have performance effect because the added methods are used nowhere except for the unit test. The only logical change to the existing code is in resize() which fixed a small issue and will improve a bit the memory efficiency.

The intent of this change is make it possible to replace the many Vector<UChar> usages with StringBuilder. This will enable 8-bit string of the places which will save memory. I believe the new toAtomicString() method will also improve memory efficiency compared to String::adopt(Vector<UChar>).
Comment 41 Adam Barth 2012-01-24 11:13:28 PST
Thanks for the responses.  It turned out to just be a blip on the perf chart.