Bug 66286 - Use String [] operator instead of indexing into String::characters() directly
Summary: Use String [] operator instead of indexing into String::characters() directly
Status: UNCONFIRMED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Web Template Framework (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Annie Sullivan
URL:
Keywords:
Depends on: 66706
Blocks: 66161
  Show dependency treegraph
 
Reported: 2011-08-16 03:05 PDT by Xianzhu Wang
Modified: 2011-08-31 09:14 PDT (History)
16 users (show)

See Also:


Attachments
patch (31.20 KB, patch)
2011-08-17 00:13 PDT, Xianzhu Wang
gyuyoung.kim: commit-queue-
Details | Formatted Diff | Diff
updated patch (31.29 KB, patch)
2011-08-17 02:48 PDT, Xianzhu Wang
no flags Details | Formatted Diff | Diff
performance test (5.15 KB, text/x-c++src)
2011-08-17 21:28 PDT, Xianzhu Wang
no flags Details
StringIterator as the returning type of characters() (32.98 KB, patch)
2011-08-18 03:40 PDT, Xianzhu Wang
gyuyoung.kim: commit-queue-
Details | Formatted Diff | Diff
patch. The last one had a style issue and missed some JavaScriptCore files (35.47 KB, patch)
2011-08-19 04:19 PDT, Xianzhu Wang
no flags Details | Formatted Diff | Diff
Example of replacing checks in operator[] with ASSERTS (22.05 KB, patch)
2011-08-19 16:28 PDT, Annie Sullivan
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Xianzhu Wang 2011-08-16 03:05:57 PDT
This is part of removing references to String::characters() which depends on the internal data format of String.
Comment 1 Xianzhu Wang 2011-08-17 00:13:49 PDT
Created attachment 104155 [details]
patch
Comment 2 Gyuyoung Kim 2011-08-17 00:27:03 PDT
Comment on attachment 104155 [details]
patch

Attachment 104155 [details] did not pass efl-ews (efl):
Output: http://queues.webkit.org/results/9408587
Comment 3 Collabora GTK+ EWS bot 2011-08-17 00:30:13 PDT
Comment on attachment 104155 [details]
patch

Attachment 104155 [details] did not pass gtk-ews (gtk):
Output: http://queues.webkit.org/results/9405667
Comment 4 Early Warning System Bot 2011-08-17 00:39:59 PDT
Comment on attachment 104155 [details]
patch

Attachment 104155 [details] did not pass qt-ews (qt):
Output: http://queues.webkit.org/results/9404723
Comment 5 Xianzhu Wang 2011-08-17 00:46:18 PDT
Comment on attachment 104155 [details]
patch

Please ignore this corrupted patch. I'm making a new new.
Comment 6 Xianzhu Wang 2011-08-17 02:48:41 PDT
Created attachment 104162 [details]
updated patch
Comment 7 Andreas Kling 2011-08-17 03:56:09 PDT
This seems very likely to regress performance, since String::operator[] does additional bounds checking.

Have you done any perf testing of this change? And is the effort to remove characters() usage tracked somewhere?
Comment 8 Andreas Kling 2011-08-17 04:42:41 PDT
(In reply to comment #7)
> And is the effort to remove characters() usage tracked somewhere?

D'oh, this is blocking bug 66161. Never mind :)
Comment 9 Darin Adler 2011-08-17 10:25:07 PDT
Most places that use characters() do so because of performance optimization.
Comment 10 Gavin Barraclough 2011-08-17 10:57:56 PDT
Looks good, I think this change should be performance tested before landed (per Darin's comment above).  I'm not sure what a good test would be, that might hit any of this code.  Dromaeo? peacemaker? - I'll comment again this afternoon when a few more people are around who might have good suggestions on the coverage.
Comment 11 Geoffrey Garen 2011-08-17 12:14:04 PDT
In the future, String::operator[] will do even more checking, to support 8bit vs 16bit representation, right? If so, the performance impact of repeated String::operator[] will be even greater.

If we need direct access to a character buffer for performance, maybe we should still allow it, but guard it with an ASSERT, e.g.:

if (string.is16Bit()) { // currently always true
    UChar* buffer = string.characters16(); // guarded by ASSERT
    ...
}
Comment 12 Xianzhu Wang 2011-08-17 21:28:01 PDT
Created attachment 104303 [details]
performance test

I tested the performance with the attached program. To run it, you should have a chromium port of WebKit, put it under Source/WebKit/chromium/tests, and add it into Source/WebKit/chromium/WebKit.gypi, update and build webkit, then run 'out/Release/webkit_unit_tests --gtest_filter=StringTest.*'.

My results are:
accessStringPtr: 0.660000
accessStringPtrNoPostPlusPlus: 0.660000
accessCharacters: 0.650000
accessImpl: 0.650000
accessImplCacheLength: 0.660000
accessArray: 1.000000
accessArrayCacheLength: 1.000000

Description of the test:
Run 100000 times sequentially accessing each character of a string containing 10000 characters.
accessStringPtr: use a StringPtr class to simulate characters() and UChar*
accessStringPtrNoPostPlusPlus: same as above, but avoid using ptr++
accessCharacters: use characters()
accessImpl: use StringImpl instead of characters(); use length() in loop
accessImplCacheLength: same as above, but save length() to a local variable before loop
accessArray: use String::operator[]; use length() in loop
accessArrayCacheLength: same as above, but save length() to a local variable before loop

The timings show that the only performance difference is about the range checking in String::operator[]. If eliminating it, there would be no performance difference.

I think we could have two choices:
1. Add String::ptr() to return StringPtr class instead of UChar*, and replace some characters() references with it.
2. Replace the range checking in String::operator[] to an ASSERT. 

I have no preference and would like to hear your opinions.

About the performance cost of checking 8-bit or 16-bit, I will investigate it. However, it seems not in scope of this bug.
Comment 13 Xianzhu Wang 2011-08-17 21:57:30 PDT
Now I prefer the StringPtr class method, because it can hide and optimize checking of 8/16 bit like what Geoffrey suggested. The issue does related to this bug.
Comment 14 Xianzhu Wang 2011-08-18 03:40:02 PDT
Created attachment 104318 [details]
StringIterator as the returning type of characters()

For now StringIterator is just typedefed to const UChar*. In the future it will be defined as a class that can handle 8 and 16-bit buffers.
The subject of this bug should be updated, but I'd like to hear your opinions about the patch and will update the subject if the patch looks good to you.
Comment 15 WebKit Review Bot 2011-08-18 03:41:56 PDT
Attachment 104318 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1

Source/WebCore/css/CSSParser.h:296:  The parameter name "str" adds no information, so it should be removed.  [readability/parameter_name] [5]
Total errors found: 1 in 22 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 16 Gyuyoung Kim 2011-08-18 03:49:41 PDT
Comment on attachment 104318 [details]
StringIterator as the returning type of characters()

Attachment 104318 [details] did not pass efl-ews (efl):
Output: http://queues.webkit.org/results/9419559
Comment 17 Early Warning System Bot 2011-08-18 03:50:52 PDT
Comment on attachment 104318 [details]
StringIterator as the returning type of characters()

Attachment 104318 [details] did not pass qt-ews (qt):
Output: http://queues.webkit.org/results/9422575
Comment 18 WebKit Review Bot 2011-08-18 03:51:04 PDT
Comment on attachment 104318 [details]
StringIterator as the returning type of characters()

Attachment 104318 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/9426277
Comment 19 Gustavo Noronha (kov) 2011-08-18 06:01:15 PDT
Comment on attachment 104318 [details]
StringIterator as the returning type of characters()

Attachment 104318 [details] did not pass gtk-ews (gtk):
Output: http://queues.webkit.org/results/9425364
Comment 20 Geoffrey Garen 2011-08-18 13:03:35 PDT
(In reply to comment #13)
> Now I prefer the StringPtr class method, because it can hide and optimize checking of 8/16 bit like what Geoffrey suggested. The issue does related to this bug.

Actually, what I suggested was that StringPtr *can't* optimize checking the 8 vs 16 bit flag, because it must do the check on every access.
Comment 21 Xianzhu Wang 2011-08-18 21:46:08 PDT
(In reply to comment #20)
> (In reply to comment #13)
> > Now I prefer the StringPtr class method, because it can hide and optimize checking of 8/16 bit like what Geoffrey suggested. The issue does related to this bug.
> 
> Actually, what I suggested was that StringPtr *can't* optimize checking the 8 vs 16 bit flag, because it must do the check on every access.

You are right that there must be some extra cost, but I think with the encapsulation, we could potentially have some optimizations during the life of a StringPtr to keep the extra cost as small as possible.

I'm trying to measure the impact of performance in real applications.
Comment 22 Xianzhu Wang 2011-08-19 04:00:59 PDT
I tried to create a big change which replaces most 'const UChar*' to StringIterator in WebKit to test the performance impact of additional condition check on each access, but gave up because of some bug introduced which is hard to find out from the hundreds of changed files.

I did a simpler test with my first patch that changes some characters() access to []. The results are as below:

                  patched           original
               avg    medium      avg    medium
url-parser   2339.35  2340.5    2270.75  2249.5
xml-parser   1417.4   1417.5    1429.75  1415
html-parser  1459.15  1457      1455.15  1454.5

Though my StringTest shows 1/2 cost increase for [] accesses, the impact seems not actually visible in real applications. I guess the impact of 8/16-bits implementation of String and StringIterator should be also like this.
Comment 23 Xianzhu Wang 2011-08-19 04:19:13 PDT
Created attachment 104488 [details]
patch. The last one had a style issue and missed some JavaScriptCore files
Comment 24 Annie Sullivan 2011-08-19 16:28:57 PDT
Created attachment 104588 [details]
Example of replacing checks in operator[] with ASSERTS
Comment 25 Annie Sullivan 2011-08-19 16:42:00 PDT
(In reply to comment #24)
> Created an attachment (id=104588) [details]
> Example of replacing checks in operator[] with ASSERTS

I looked into how much work it would be to change String operator [] to ASSERT instead of do range checks. This patch passes all layout tests. If it seems reasonable to people to use ASSERT instead of range checks, we can do the following:

1. Clean up this patch and put it up for review on a dependent bug.
2. Write a benchmark that checks the performance of additional code that would need to be added to operator[] to work with 8-bit strings. The code would probably look something like:

if (m_refCountAndFlags & s_is8bit)
  return static_cast<UChar>(m_data8[i]);
else
  return m_data16[i];

3. If that goes well, continue with the original patch on this bug.

Or we could go with the StringIterator approach. Would be great to hear what some more experienced WebKit developers think.
Comment 26 Ryosuke Niwa 2011-08-19 16:57:32 PDT
Comment on attachment 104588 [details]
Example of replacing checks in operator[] with ASSERTS

View in context: https://bugs.webkit.org/attachment.cgi?id=104588&action=review

This patch appears to degrade the readability of call sites of operator[].  Can we instead add charAtWithoutBoundsCheck(unsigned) const, which doesn't do any bounds checks, and use that in places where we currently call characters()?

> Source/WebCore/css/CSSComputedStyleDeclaration.cpp:590
> +    if (!pseudoElementName.isEmpty())
> +        nameWithoutColonsStart = pseudoElementName[0] == ':' ? (pseudoElementName[1] == ':' ? 2 : 1) : 0;

Shouldn't we check that pseudoElementName's length is at least 2 if pseudoElementName[0] == ':'.
Comment 27 Darin Adler 2011-08-19 17:00:55 PDT
(In reply to comment #26)
> Can we instead add charAtWithoutBoundsCheck(unsigned) const

I was going to suggest something similar. But I’m not sure it has to have such a long name.
Comment 28 Annie Sullivan 2011-08-19 17:04:14 PDT
(In reply to comment #27)
> (In reply to comment #26)
> > Can we instead add charAtWithoutBoundsCheck(unsigned) const
> 
> I was going to suggest something similar. But I’m not sure it has to have such a long name.

So this would change my plan in comment 25 to:

1. Implement charAtWithoutBoundsCheck (with a shorter name)
2. Rewrite the first patch on this bug to call that instead of operator[]
3. Write a benchmark that checks the performance of additional code that would need to be added to charAtWithoutBoundsCheck and operator[] to work with 8-bit strings. The code would probably look something like:

if (m_refCountAndFlags & s_is8bit)
  return static_cast<UChar>(m_data8[i]);
else
  return m_data16[i];

Do people like that better than the StringIterator approach?
Comment 29 Ryosuke Niwa 2011-08-19 17:04:43 PDT
(In reply to comment #27)
> (In reply to comment #26)
> > Can we instead add charAtWithoutBoundsCheck(unsigned) const
> 
> I was going to suggest something similar. But I’m not sure it has to have such a long name.

fastCharAt?  But I personally think that renaming the existing operator[] to charAt and then eliminating the bounds check from operator[] will be better since operator[] on an array doesn't do bounds check.
Comment 30 Darin Adler 2011-08-19 17:18:36 PDT
(In reply to comment #29)
> fastCharAt?  But I personally think that renaming the existing operator[] to charAt and then eliminating the bounds check from operator[] will be better since operator[] on an array doesn't do bounds check.

Sure, later we can do that, but the point is to avoid making global risky changes that could even introduce buffer overrun for little benefit.
Comment 31 Ryosuke Niwa 2011-08-19 17:25:01 PDT
(In reply to comment #30)
> Sure, later we can do that, but the point is to avoid making global risky changes that could even introduce buffer overrun for little benefit.

I think we can do it safely if we first renamed operator[] to charAt in one patch, and then re-introduced operator[] in another patch.
Comment 32 Xianzhu Wang 2011-08-19 22:43:09 PDT
About the plan in comment 28, I have two points:

1. About the final usages of const UChar*: Extending from the cases in 'Passed to functions as UChar*', 'Passed as substring as UChar*', 'Classes with UChar* members', 'Return characters()' in Annie's spreadsheet, the final usages would fall into the other categories. This could also increase the cases of 'Use [] operator' category, and the performance measured after code change could be more accurate. We could also know the potential problems using 8-bit buffers in strings, especially with the third-party libraries (e.g. ICU). I can work on this.

2. To also change the final usages of 'Use [] operator' category, we should first determine how to pass const UChar* through parameters, class members and returned values: StringIterator, or 'const String&'/String?

I think one benefit of StringIterator is its partly compatibility with const UChar*. Using StringIterator we can replace const UChar*s with small code change, until the final usages of them. If the final usage is just iterating through the characters (i.e. this bug), the iteration code can remain unchanged. All intermediate steps that just pass const UChar* to the next step (with or without an offset) can just change the type name.
Comment 33 Annie Sullivan 2011-08-22 12:14:43 PDT
(In reply to comment #32)
> About the plan in comment 28, I have two points:
> 
> 1. About the final usages of const UChar*: Extending from the cases in 'Passed to functions as UChar*', 'Passed as substring as UChar*', 'Classes with UChar* members', 'Return characters()' in Annie's spreadsheet, the final usages would fall into the other categories. This could also increase the cases of 'Use [] operator' category, and the performance measured after code change could be more accurate. We could also know the potential problems using 8-bit buffers in strings, especially with the third-party libraries (e.g. ICU). I can work on this.

That would be great!

> 2. To also change the final usages of 'Use [] operator' category, we should first determine how to pass const UChar* through parameters, class members and returned values: StringIterator, or 'const String&'/String?

A lot of these usages pass in something like (characters() + offset, length) to create a substring quickly.

As I think about it more, it should be possible to make StringIterator::substring() method which does this internally but returns a StringIterator so it doesn't need to expose the internal format of the character buffer.

> I think one benefit of StringIterator is its partly compatibility with const UChar*. Using StringIterator we can replace const UChar*s with small code change, until the final usages of them. If the final usage is just iterating through the characters (i.e. this bug), the iteration code can remain unchanged. All intermediate steps that just pass const UChar* to the next step (with or without an offset) can just change the type name.

The more I dig through the code that calls characters(), the more I agree with this. I think StringIterator is the right approach. I set the review flag to r? on your StringIterator patch. Can someone review?
Comment 34 Annie Sullivan 2011-08-22 13:29:05 PDT
On second thought, I created bug 66706 for just the StringIterator part of the patch, since there is a lot of discussion on this bug and I hoped it would be clearer what StringIterator is for if we put it on a separate bug.