RESOLVED FIXED Bug 66161
Investigate storing strings in 8-bit buffers when possible
https://bugs.webkit.org/show_bug.cgi?id=66161
Summary Investigate storing strings in 8-bit buffers when possible
Annie Sullivan
Reported 2011-08-12 13:41:49 PDT
Currently WTF::String always uses 16-bit buffers. This can be wasteful, because many strings only contain characters which are 8-bit. I did some testing on various sites, and found there would be savings of 500k-2MB (3-5% total memory usage) if we managed to store all the strings that don't contain double-byte characters in 8-bit buffers. Getting this right would definitely be difficult, but I think it's worth investigating.
Attachments
Proposed StringImpl patch that supports 8 bit strings. (76.71 KB, patch)
2011-10-09 22:04 PDT, Michael Saboff
webkit-ews: commit-queue-
Updated patch (76.85 KB, patch)
2011-10-10 17:43 PDT, Michael Saboff
no flags
Updated patch (77.88 KB, patch)
2011-10-11 09:43 PDT, Michael Saboff
no flags
Updated patch with tested export fix for Windows (78.40 KB, patch)
2011-10-13 13:59 PDT, Michael Saboff
no flags
Updated patch refactored to use LChar (111.03 KB, patch)
2011-10-25 17:50 PDT, Michael Saboff
ggaren: review-
Updated patch addressing comments #21 (118.58 KB, patch)
2011-10-26 14:50 PDT, Michael Saboff
ggaren: review-
webkit-ews: commit-queue-
Patch with updates suggested in comment #26, stylebot and build failures (117.21 KB, patch)
2011-10-26 17:32 PDT, Michael Saboff
ggaren: review+
Gavin Barraclough
Comment 1 2011-08-12 15:56:52 PDT
Hi Annie, Great! - this is definitely something we're keen to see happen. What are your current plans? I'm assuming any approach will involve a flag in the StringImpl indicating the format of the character data. Clearly in many cases where we currently access characters in the string we could be agnostic as to how the data is stored (e.g. change code that calls to characters() to get a const UChar* to instead access one character at a time), however doing so in all cases will be tricky. One question would be, what do you plan on doing for more difficult cases where we have code expecting to be able to read UFT16 data? A set of obvious options would seem to be: (1) Fix them all, make the entire codebase 8-bit string safe, never need to convert or copy. (2) Create a temporary UTF16 copy as necessary (i.e. when a method that needs a const UChar* is called). (3) Convert 8-bit strings up to UTF16 when necessary, retaining only the UTF16 copy and freeing the 8-bit source. (4) Convert 8-bit strings up to UTF16 when necessary, retaining both the UTF16 and 8-bit representations. (1) Is probably the best option, but is also probably the most work. (2) Could be a performance regression, if a string may be repeatedly converted. (3) May be difficult since the underlying buffers of StringImpls may be shared when substrings are created (one StringImpl may have a pointer into another's buffer). [This may also be fixable, e.g. we could make strings we an offset relative to their buffer's base, or give buffers access to the list of StringImpls referencing them]. (4) In a pathological case would mean that instead of saving 50% of memory used for string data you would start using 150%! What are your current plans? Did you have another course of action in mind? One more complexity is that the contents of the StringImpl are directly accessed by JIT generated code in JSC (used by all ports, since WebCore uses JSC's regexp engine). This will be a bit of work to fix, but should be very doable, and the JSC team will be happy to help. There are very few places that access the string data from the regular JavaScript language JIT, but unsurprisingly more in the regular expression JIT. The regular expression JIT should be a pretty easy fix, it'll mostly just mean switching some 16-bit loads to 8-bit, but we'll probably need restructure some regexp objects to be able to hold onto two copies of JIT generated code, one for use with 16-bit source data & one for use with 8-bit source data. Again, we'll be happy to help make this happen. cheers, G.
Annie Sullivan
Comment 2 2011-08-15 21:40:54 PDT
Thanks so much, Gavin! We'd really like the help, especially since none of us are very familiar with JSC. Our current plan is to start out by aiming for option 1--we made a spreadsheet of all the characters() usages and we're going to try and change as many of them as possible to use String APIs that don't expose the internal format of the character data. It's located here, if anyone cc-ed on this bug would like to take a task off of it, please feel free: https://docs.google.com/a/chromium.org/spreadsheet/ccc?key=0AsX8ECAuTqEzdEVFYnZaUlZ3QzJCbnBKamwxdGMxVkE&hl=en_US#gid=9 Once we've gotten to the point where we're down to just the usages that really make option 1 impractical, we hope that having many fewer issues to look at will make the choice between options 2-4 easier. I know this is a lot of work, but we can easily spread it across several people.
Erik Corry
Comment 3 2011-08-16 00:10:00 PDT
Since this is likely to have a huge impact on V8 I would appreciate this bug being kept up to date on which of the 4 ideas you are thinking of going with. FWIW V8 has has always stored strings in both 1 and 2-byte formats. The 2-byte strings can be 'external' which means that they are backed by a WebKit string. V8's one byte strings are strict ASCII ie only the characters from 0 to 127 are allowed. Latin1 characters from 128 to 255 are not allowed in 1-byte strings. One of the reasons for this is that you can just memcpy an ASCII string into a UTF-8 output buffer, but you have to inspect every character if your source is Latin1. It also simplifies the 1-byte version of the regexp engine if you know there are no Latin1 characters. The ASCII restriction in V8 can be changed, obviously, but that's the way it is currently.
Annie Sullivan
Comment 4 2011-08-16 09:20:52 PDT
(In reply to comment #3) > Since this is likely to have a huge impact on V8 I would appreciate this bug being kept up to date on which of the 4 ideas you are thinking of going with. We definitely will. > FWIW V8 has has always stored strings in both 1 and 2-byte formats. The 2-byte strings can be 'external' which means that they are backed by a WebKit string. V8's one byte strings are strict ASCII ie only the characters from 0 to 127 are allowed. Latin1 characters from 128 to 255 are not allowed in 1-byte strings. > > One of the reasons for this is that you can just memcpy an ASCII string into a UTF-8 output buffer, but you have to inspect every character if your source is Latin1. It also simplifies the 1-byte version of the regexp engine if you know there are no Latin1 characters. > > The ASCII restriction in V8 can be changed, obviously, but that's the way it is currently. I did some preliminary investigation into how many bytes are taken up by strings containing all-ASCII vs some Latin1 chars vs some double-byte chars on different websites: CNN.com (english): 1M ASCII, 2K Latin1, 55K 2byte ESPN.com (english): 1.5M ASCII, 16K Latin1, 54K 2byte Baidu.com (chinese): 92K ASCII, 1.5K Latin1, 64K 2byte news.baidu.com (chinese): 535K ASCII, 2.6K Latin1, 148K 2byte yahoo.fr (french): 892K ASCII, 84K Latin1, 55K 2byte news.google.de (german): 480K ASCII, 46K Latin1, 61K 2byte Based on that, I think v8's approach of having only ASCII characters in 1-byte strings seems quite reasonable. Gavin, what do you think?
Gavin Barraclough
Comment 5 2011-08-16 14:52:21 PDT
Agreed - given those stats, targeting ASCII only seems like the right choice. If we were to ever find evidence that LATIN1 support would be of benefit, then we could look at expanding the character set supported in 8-bit strings from ASCII to LATIN1 (and coordinate with V8 in doing so), but ASCII seems a simpler starting point, and by the look of it should be sufficient.
Jungshik Shin
Comment 6 2011-08-25 13:13:50 PDT
(In reply to comment #5) > Agreed - given those stats, targeting ASCII only seems like the right choice. > > If we were to ever find evidence that LATIN1 support would be of benefit, then we could look at expanding the character set supported in 8-bit strings from ASCII to LATIN1 (and coordinate with V8 in doing so), but ASCII seems a simpler starting point, and by the look of it should be sufficient. I"m relieved that we're limiting a single byte string to ASCII only :-). Extending it to the first 256 bytes (up to U+00FF) may be ok, but we'd better stop there.
Michael Saboff
Comment 7 2011-08-25 16:24:52 PDT
Annie, I have started to work on this from a JSC side of things. I will first focus on making the YARR regular expression engine 8 / 16 bit switchable. After that I'll look at removing JSC use of characters(). Have you thought through the StringImpl changes?
Annie Sullivan
Comment 8 2011-08-26 14:37:54 PDT
(In reply to comment #7) > Annie, > > I have started to work on this from a JSC side of things. I will first focus on making the YARR regular expression engine 8 / 16 bit switchable. After that I'll look at removing JSC use of characters(). > > Have you thought through the StringImpl changes? Thanks, Michael! I'm looking into fleshing out the StringImpl changes now. Is the best way to get feedback to create a strawman patch in a blocking bug and ask people to give feedback as review comments? (Although of course we wouldn't actually submit the patch until all the other work is done and we'd finished properly testing possible performance changes).
Michael Saboff
Comment 9 2011-08-29 10:59:23 PDT
(In reply to comment #8) > (In reply to comment #7) > > Annie, > > > > I have started to work on this from a JSC side of things. I will first focus on making the YARR regular expression engine 8 / 16 bit switchable. After that I'll look at removing JSC use of characters(). > > > > Have you thought through the StringImpl changes? > > Thanks, Michael! I'm looking into fleshing out the StringImpl changes now. Is the best way to get feedback to create a strawman patch in a blocking bug and ask people to give feedback as review comments? (Although of course we wouldn't actually submit the patch until all the other work is done and we'd finished properly testing possible performance changes). I think that would work. You can email me with ideas if you want before posting a patch.
Annie Sullivan
Comment 10 2011-09-23 12:56:26 PDT
I am going on maternity leave in two weeks, so Xianzhu Wang is taking ownership of this.
Sam Weinig
Comment 11 2011-09-25 11:08:54 PDT
(In reply to comment #10) > I am going on maternity leave in two weeks, so Xianzhu Wang is taking ownership of this. Congratulations!
Michael Saboff
Comment 12 2011-10-09 22:04:46 PDT
Created attachment 110327 [details] Proposed StringImpl patch that supports 8 bit strings. This change allows the creation of both 8 bit or 16 bit strings. If the characters() method is called on an 8 bit StringImpl object, a 16 bit shadow copy is created and returned. This patch does not include changes that actually use 8 bit strings, in fact there is an ASSERT_NOT_REACHED() in the characters8() method. Future patches will make other changes to support and use 8 bit strings. The next of these are the neccessary JIT changes. This work is broken up in smaller patches to make it more digestible.
Early Warning System Bot
Comment 13 2011-10-09 22:37:16 PDT
Comment on attachment 110327 [details] Proposed StringImpl patch that supports 8 bit strings. Attachment 110327 [details] did not pass qt-ews (qt): Output: http://queues.webkit.org/results/10000841
Xianzhu Wang
Comment 14 2011-10-09 23:15:41 PDT
(In reply to comment #12) > Created an attachment (id=110327) [details] > Proposed StringImpl patch that supports 8 bit strings. > > This change allows the creation of both 8 bit or 16 bit strings. If the characters() method is called on an 8 bit StringImpl object, a 16 bit shadow copy is created and returned. > > This patch does not include changes that actually use 8 bit strings, in fact there is an ASSERT_NOT_REACHED() in the characters8() method. > > Future patches will make other changes to support and use 8 bit strings. The next of these are the neccessary JIT changes. > > This work is broken up in smaller patches to make it more digestible. Thank you for the patch. Would it be better to create a blocking bug for the change of StringImpl? What do you think about bug 66706? I think with StringCharacterIterator, we can keep most of the characters() usages unchanged without the 16bit shadow copy.
Michael Saboff
Comment 15 2011-10-10 08:50:25 PDT
(In reply to comment #14) > (In reply to comment #12) > > Created an attachment (id=110327) [details] [details] > > Proposed StringImpl patch that supports 8 bit strings. > > > > This change allows the creation of both 8 bit or 16 bit strings. If the characters() method is called on an 8 bit StringImpl object, a 16 bit shadow copy is created and returned. > > > > This patch does not include changes that actually use 8 bit strings, in fact there is an ASSERT_NOT_REACHED() in the characters8() method. > > > > Future patches will make other changes to support and use 8 bit strings. The next of these are the neccessary JIT changes. > > > > This work is broken up in smaller patches to make it more digestible. > > Thank you for the patch. Would it be better to create a blocking bug for the change of StringImpl? > > What do you think about bug 66706? I think with StringCharacterIterator, we can keep most of the characters() usages unchanged without the 16bit shadow copy. I don't have a problem creating a blocking bug. I figured that this bug was for the StringImpl changes, so I used it for the proposed changes. What I found in JavaScriptCore with comprehensive changes for 8 bit strings is that the 16bit shadow copy is hardly every created. Most strings end up being 8 or 16 bit based on source content and there is seldom a need to convert an 8 bit string to a 16 bit string. Having characters8() and characters16() allows callers that want to manipulate actual characters the ability to manipulate the native format. I don't think this precludes characters() returning a StringCharacterIterator and following the 4 step plan described in 66706. We still typedef StringCharacterIterator to be a "const UChar*" and use the shadow copy for 8 bit strings. We can remove the 16 bit shadow copy at a later time, but in the mean time we can make forward progress. Given that per-character size checks will hurt performance, hot paths will likely need to manipulate the characters directly as the current code does. That means that either code paths check once (e.g. at the top of a loop) for 8/16 bit-ness or we templatize classes based on character size. For code that isn't as performance sensitive, an iterator implemented similar to the CharAccess class I added to YarrInterpreter would work.
Michael Saboff
Comment 16 2011-10-10 17:43:44 PDT
Created attachment 110451 [details] Updated patch This patch fixes two issues in the prior patch - The build failure reported - fixed StringImpl::lower() - Test failures - fixed StringImpl::replace(UChar, StringImpl*)
Michael Saboff
Comment 17 2011-10-11 09:43:47 PDT
Created attachment 110526 [details] Updated patch (Speculative fix) Add export to fix windows build.
Michael Saboff
Comment 18 2011-10-13 13:59:23 PDT
Created attachment 110903 [details] Updated patch with tested export fix for Windows
Geoffrey Garen
Comment 19 2011-10-13 16:02:59 PDT
Comment on attachment 110903 [details] Updated patch with tested export fix for Windows View in context: https://bugs.webkit.org/attachment.cgi?id=110903&action=review > Source/JavaScriptCore/wtf/text/StringImpl.h:471 > + const char* m_data8; I think you should create a new typedef, "unsigned char LChar" for "Latin1 character", as distinct from the existing "UChar" for "Unicode character". Regular char is not so good because on OS X and other platforms with signed chars, (a) handing back just a char gives the client a value that will blow up and sign extend silently and incorrectly when converted to UChar, and (b) char requires a bunch of casting.
Michael Saboff
Comment 20 2011-10-25 17:50:57 PDT
Created attachment 112435 [details] Updated patch refactored to use LChar This patch includes the typedef suggested by Geoff. That change rippled through the code growing this patch somewhat. Comment #12 still holds true. Still working on getting the JavaScriptCore symbol export list right for Windows.
Geoffrey Garen
Comment 21 2011-10-25 22:24:12 PDT
Comment on attachment 112435 [details] Updated patch refactored to use LChar View in context: https://bugs.webkit.org/attachment.cgi?id=112435&action=review You added LChar but didn't remove most of the static_casts. I pointed out some of them above, but ran out of time. You need to take a sweep through and remove them all. > Source/JavaScriptCore/runtime/UString.h:108 > + return static_cast<unsigned char>(m_impl->characters8()[index]); No need for static_cast, now that you have LChar. > Source/JavaScriptCore/runtime/UString.h:145 > if (rep1 == rep2) // If they're the same rep, they're equal. This function is way too big to inline now. You should pick the right part to inline and move the rest out of line. > Source/JavaScriptCore/runtime/UString.h:185 > + if (static_cast<unsigned char>(d1[i]) != d2[i]) No need for static_cast, now that you have LChar. > Source/JavaScriptCore/runtime/UString.h:196 > + if (d1[i] != static_cast<unsigned char>(d2[i])) No need for static_cast, now that you have LChar. > Source/JavaScriptCore/wtf/StringHasher.h:165 > + return static_cast<UChar>(ch); No need for static_cast, now that you have LChar. > Source/JavaScriptCore/wtf/text/StringImpl.cpp:222 > + m_copyData16[i] = static_cast<unsigned char>(m_data8[i]); No need for static_cast, now that you have LChar. > Source/JavaScriptCore/wtf/text/StringImpl.cpp:236 > + UChar c = static_cast<unsigned char>(m_data8[i]); No need for static_cast, now that you have LChar.
Michael Saboff
Comment 22 2011-10-26 14:50:14 PDT
Created attachment 112601 [details] Updated patch addressing comments #21 This addresses Geoff's comments and doesn't include the flag bit change factored out in https://bugs.webkit.org/show_bug.cgi?id=70937. Also cleanup of comments suggested by Darin.
WebKit Review Bot
Comment 23 2011-10-26 14:54:10 PDT
Attachment 112601 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1 Source/JavaScriptCore/runtime/UString.cpp:277: A case label should not be indented, but line up with its switch statement. [whitespace/indent] [4] Source/JavaScriptCore/runtime/UString.cpp:315: A case label should not be indented, but line up with its switch statement. [whitespace/indent] [4] Source/JavaScriptCore/runtime/UString.cpp:320: Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons. [readability/comparison_to_zero] [5] Source/JavaScriptCore/runtime/UString.h:180: A case label should not be indented, but line up with its switch statement. [whitespace/indent] [4] Source/JavaScriptCore/runtime/UString.h:198: A case label should not be indented, but line up with its switch statement. [whitespace/indent] [4] Source/JavaScriptCore/runtime/UString.h:203: Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons. [readability/comparison_to_zero] [5] Total errors found: 6 in 28 files If any of these errors are false positives, please file a bug against check-webkit-style.
Early Warning System Bot
Comment 24 2011-10-26 15:25:03 PDT
Comment on attachment 112601 [details] Updated patch addressing comments #21 Attachment 112601 [details] did not pass qt-ews (qt): Output: http://queues.webkit.org/results/10220438
WebKit Review Bot
Comment 25 2011-10-26 15:30:25 PDT
Comment on attachment 112601 [details] Updated patch addressing comments #21 Attachment 112601 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/10223123
Geoffrey Garen
Comment 26 2011-10-26 15:33:23 PDT
Comment on attachment 112601 [details] Updated patch addressing comments #21 View in context: https://bugs.webkit.org/attachment.cgi?id=112601&action=review The fastFree and ASCII vs Latin1 issues look like real bugs. The rest would just improve readability and performance. I have a feeling that, in the end, we'll want all the inline fast cases to cover the 8bit case, moving the 16bit cases to out-of-line slow cases. (Right now, everything is in the same case, which probably defeats inlining. And we can't inline the 8bit case yet, since most strings are still 16bit.) > Source/JavaScriptCore/runtime/UString.h:140 > +NEVER_INLINE bool equalSlowCase(const UString& s1, const UString& s2); > + > ALWAYS_INLINE bool operator==(const UString& s1, const UString& s2) Still worried this is too big to inline. If it doesn't inline, perhaps you could: (1) inline just the size and buffer checks, and out-of-line everything else; (2) Do (1), and always just call memcmp with the right length for the slow case. > Source/JavaScriptCore/wtf/text/StringConcatenate.h:129 > + unsigned char c = m_buffer[i]; > + destination[i] = c; This can just be destination[i] = m_buffer[i]. > Source/JavaScriptCore/wtf/text/StringConcatenate.h:205 > + unsigned char c = m_buffer[i]; > + destination[i] = c; Ditto. > Source/JavaScriptCore/wtf/text/StringHash.h:56 > + if (a->is8Bit()) { As above, I'm concerned that this can't inline anymore, and might need rejiggering to inline properly. > Source/JavaScriptCore/wtf/text/StringImpl.cpp:59 > + if (has16BitShadow() && ((ownership == BufferInternal) || (ownership == BufferOwned))) { if has16BitShadow() is true, we should unconditionally fastFree. > Source/JavaScriptCore/wtf/text/StringImpl.cpp:184 > + m_copyData16 = static_cast<UChar*>(fastCalloc(len, sizeof(UChar))); Can this be fastMalloc instead? > Source/JavaScriptCore/wtf/text/StringImpl.cpp:407 > + if (is8Bit()) { > + LChar* data; > + RefPtr <StringImpl>newImpl = createUninitialized(m_length, data); > + for (int32_t i = 0; i < length; i++) > + data[i] = toASCIILower(m_data8[i]); > + return newImpl.release(); > + } What happens here for a Latin1 string, when you call toASCIILower on its characters? I think you need to check each character for being ASCII. > Source/JavaScriptCore/wtf/text/StringImpl.cpp:748 > + size_t matchStringLength = strlen(reinterpret_cast<const char *>(matchString)); No space before "*", please. > Source/JavaScriptCore/wtf/text/StringImpl.cpp:1017 > + memcpy(data, m_data8, position * sizeof(char)); > + if (str) > + memcpy(data + position, str->m_data8, lengthToInsert * sizeof(char)); > + memcpy(data + position + lengthToInsert, m_data8 + position + lengthToReplace, > + (length() - position - lengthToReplace) * sizeof(char)); Minor nit: should be sizeof(LChar) rather than sizeof(char) in these 3 places -- even though the math works out the same. > Source/JavaScriptCore/wtf/text/StringImpl.h:252 > + // FIXME: Add Vector<char> adopt? This is probably more appropriate as a bug report, with an explanation of why you were considering that. > Source/JavaScriptCore/wtf/text/StringImpl.h:272 > + ALWAYS_INLINE const UChar* characters() const Eventually, we'll want to rename characters() to "deprecatedCharacters()" to communicate this. (Not right now, though.)
Michael Saboff
Comment 27 2011-10-26 17:32:48 PDT
Created attachment 112623 [details] Patch with updates suggested in comment #26, stylebot and build failures > > Source/JavaScriptCore/runtime/UString.h:140 > > +NEVER_INLINE bool equalSlowCase(const UString& s1, const UString& s2); > > + > > ALWAYS_INLINE bool operator==(const UString& s1, const UString& s2) > > Still worried this is too big to inline. > > If it doesn't inline, perhaps you could: (1) inline just the size and buffer checks, and out-of-line everything else; (2) Do (1), and always just call memcmp with the right length for the slow case. It didn't inline, so I changed to (1) - size / buffer check, slow case the rest. > > Source/JavaScriptCore/wtf/text/StringHash.h:56 > > + if (a->is8Bit()) { > > As above, I'm concerned that this can't inline anymore, and might need rejiggering to inline properly. The existing code doesn't inline this method. The StringHash class is used in StringImpl.h to create a template class therefore the complete method in the .h file. > > Source/JavaScriptCore/wtf/text/StringImpl.cpp:59 > > + if (has16BitShadow() && ((ownership == BufferInternal) || (ownership == BufferOwned))) { > > if has16BitShadow() is true, we should unconditionally fastFree. Done. > > Source/JavaScriptCore/wtf/text/StringImpl.cpp:184 > > + m_copyData16 = static_cast<UChar*>(fastCalloc(len, sizeof(UChar))); > > Can this be fastMalloc instead? Yes, done. > > Source/JavaScriptCore/wtf/text/StringImpl.cpp:407 > > + if (is8Bit()) { > > + LChar* data; > > + RefPtr <StringImpl>newImpl = createUninitialized(m_length, data); > > + for (int32_t i = 0; i < length; i++) > > + data[i] = toASCIILower(m_data8[i]); > > + return newImpl.release(); > > + } > > What happens here for a Latin1 string, when you call toASCIILower on its characters? I think you need to check each character for being ASCII. Updated StringImpl::lower(), ::upper() and ::foldCase() to work with Latin-1 strings with an ASCII fast path. > > Source/JavaScriptCore/wtf/text/StringImpl.h:252 > > + // FIXME: Add Vector<char> adopt? > > This is probably more appropriate as a bug report, with an explanation of why you were considering that. Removed. Also cleaned up the sizeof(char) that should be sizeof(LChar) and other nits.
Geoffrey Garen
Comment 28 2011-10-26 18:12:34 PDT
Comment on attachment 112623 [details] Patch with updates suggested in comment #26, stylebot and build failures r=me
Geoffrey Garen
Comment 29 2011-10-26 18:13:02 PDT
Please watch the perf bot result to verify this is not a measurable regression.
Michael Saboff
Comment 30 2011-10-27 13:16:35 PDT
Note You need to log in before you can comment on or make changes to this bug.