72404 – Could save a lot of memory in CharacterData by not always storing a String

RESOLVED FIXED72404

Could save a lot of memory in CharacterData by not always storing a String

https://bugs.webkit.org/show_bug.cgi?id=72404

Summary Could save a lot of memory in CharacterData by not always storing a String

Ojan Vafai

Reported 2011-11-15 12:25:54 PST

Below is the dump just of whitespace text nodes from the HTML spec (string followed by count of that string occurring). If we stored atomic strings in these cases, we'd save a lot of memory. There are also non-whitespace strings that are repeated many times in the spec (e.g. "Unique " is in >30k text nodes). For a short whitespace string that conforms to the list of 5 valid space characters in http://www.whatwg.org/specs/web-apps/current-work/multipage/common-microsyntaxes.html#space-character, we could even store it as a single unsigned. Looking at the HTML spec, this is actually by far the common case. The HTML spec is a bit unique in it's content, but it's certainly a common pattern for sites to have repeated whitespace due to formatting their HTML to be readable. " " 8197 " " 4172 " " 1052 " " 469 " " 1 " " 413 " " 21 " " 3099 " " 224 " " 497 " " 96 " " 157 " " 19240 " " 52 " " 242 " " 122 " " 472 " " 6117 " " 242 " " 4 " " 17 " " 3126 " " 1132 " " 2 " " 7 " " 545 " " 445 " " 78 " " 2 " " 166 " " 140 " " 1 " " 51 " " 26 " " 11 " " 7 " " 2 " " 3 " " 29 " " 6 " " 1 " " 1 " " 7 " " 7 " " 1 " " 7 " " 2 " " 9 " " 1 " " 1 " " 1 " " 2 " " 2 " " 1 " " 8 " " 1 " " 1 " " 1 " " 1 " " 1 " " 1 " " 1 " " 1 " " 2 " " 1 " " 3 " " 1 " " 1 "  " 1 " " 2 " " 1 " " 8 " " 2 " " 1

Attachments
Patch (20.87 KB, patch) 2011-12-01 16:21 PST, Erik Arvidsson	no flags	Details Formatted Diff Diff
WIP patch using AtomicString instead of AlignedBuffer (19.78 KB, patch) 2011-12-01 17:03 PST, Erik Arvidsson	no flags	Details Formatted Diff Diff
Patch (10.04 KB, patch) 2011-12-02 16:50 PST, Erik Arvidsson	no flags	Details Formatted Diff Diff
Patch (9.36 KB, patch) 2011-12-05 13:28 PST, Erik Arvidsson	no flags	Details Formatted Diff Diff
Show Obsolete (3) View All Add attachment proposed patch, testcase, etc.

Darin Adler

Comment 1 2011-11-15 13:25:14 PST

Neat ideas. I think yours are are: 1) Use AtomicString instead of String. 2) Use a count of U+0020 space characters instead of a string. 3) Use a count of HTML space characters instead of a string. Not clear on how this would work. Some other ideas: 4) Store the text in the node itself instead of in a separate object, allocating at a larger size to do so. This requires refactoring string operations so they can work directly on text without having to allocate a String object. That’s something I’ve been thinking that refactoring is something we should do. 5) Keep the original text around and reference count it; have the strings created by the HTML parser use pointers and lengths pointing at a single large character buffer instead of allocating separate StringImpl objects for each. I think (1) is the easiest to do correctly and should be tried to see what the performance cost would be.

Geoffrey Garen

Comment 2 2011-11-15 13:26:49 PST

How much memory does CharacterData account for on an average web page?

Adam Barth

Comment 3 2011-11-15 13:33:31 PST

One quick hack would be for the parser to check when parsing whether a string is entirely whitespace and to use AtomicStrings just for those cases (maybe that's what you were suggesting).

Erik Arvidsson

Comment 4 2011-11-15 13:35:14 PST

(In reply to comment #3) > One quick hack would be for the parser to check when parsing whether a string is entirely whitespace and to use AtomicStrings just for those cases (maybe that's what you were suggesting). That was pretty much what we were thinking ;)

Ojan Vafai

Comment 5 2011-11-15 13:36:03 PST

The way I was picturing option 3 work is that we'd just do it for strings that only contain U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF) and U+000D CARRIAGE RETURN (CR). That way, you can use 2 bits per character and could encode strings up to 16 characters long as a single unsigned. I think this is probably too complicated to be worth it. Option 5 sounds awesomest, but complicated. I agree that we should try the AtomicString approach first just for whitespace strings (possibly with a length limit?). It's the simplest and gets the vast majority of the memory benefits.

Ojan Vafai

Comment 6 2011-11-15 13:41:07 PST

BTW, here's the script I'm using to count the whitespace text nodes on a page: var hash = {}; function walk(node) { if (node.nodeType == Node.TEXT_NODE || node.nodeType == Node.COMMENT_NODE || node.nodeType == Node.PROCESSING_INSTRUCTION_NODE) { var value = node.nodeValue; if (!hash[value]) hash[value] = 0; hash[value] += 1; } for (var i = 0, len = node.childNodes.length; i < len; i++) { walk(node.childNodes[i]); } } walk(document.body); var uniqueStrings = 0; for (var value in hash) { var count = hash[value]; if (value.match(/^\s*$/)) { console.log(JSON.stringify(value), count); uniqueStrings++; } } console.log(uniqueStrings); And here's a more readable dump of the HTML spec's whitespace nodes: "\n\n " 8197 "\n " 4172 "\n " 1052 "\n " 471 "\n \n" 1 "\n" 419 "\n " 21 "\n " 3099 "\n " 228 "\n " 497 "\n " 96 "\n " 157 " " 19240 "\n\n\n\n\n " 52 "\n\n\n\n " 242 "\n\n" 122 "\n\n\n " 472 "\n\n " 6117 "\n\n\n " 242 "\n\n\n\n " 4 "\n\n\n" 17 "\n\n " 3126 "\n\n " 1132 "\n\n\n\n\n\n\n " 2 "\n\n\n\n\n\n " 7 "\n\n " 545 "\n\n " 445 "\n\n\n " 78 "\n\n " 2 "\n\n " 166 "\n\n " 140 "\n \n " 1 "\n\n " 51 "\n\n " 26 "\n\n " 11 "\n " 7 "\n\n " 2 "\n\n " 3 "\n\n\n " 29 "\n\n\n " 6 " " 1 " " 1 " \n\n " 7 " \n\n " 7 "\n \n\n\n\n " 1 "\n\n \n\n " 7 " " 2 "\n " 9 " " 1 "\n " 1 "\n\n \n " 1 "\n\n\n\n\n\n\n\n " 2 "\n \n " 2 "\n \n " 1 "\n\n\n " 8 "\n\n\n\n\n" 1 "\n \n " 1 "\n\n\n\n" 1 "\n\n\n " 1 " " 1 " " 1 " " 1 " " 1 " " 2 " " 1 " " 3 " " 1 " " 1 "  " 1 " " 2 "" 2 " \n\n\n " 1 "\n " 8 " \n\n " 2 "\n\n \n" 1 75

Antti Koivisto

Comment 7 2011-11-15 13:48:36 PST

Nice find!

Erik Arvidsson

Comment 8 2011-11-30 16:45:55 PST

I started looking at this and I hit the limitation of my C++ knowledge. To make this worth it, it seems like we would need to use a union type. However, union types cannot have POD members. I talked to Jeffrey Yasskin (C++ guru at Google) and he pointed me to the ManualConstructor pattern (BSD licensed): http://code.google.com/p/google-ctemplate/source/browse/trunk/src/base/manual_constructor.h Does WTF have anything similar? If not, can we add something like it? (Preferably a port of the GTL code since according to Jeffrey it is very hard to get this right). Maybe I'm missing something and there is a simpler way to do this? Also, when testing this I used ActivityMonitor. What is a better tool to see the memory usage of this?

Darin Adler

Comment 9 2011-11-30 18:13:29 PST

(In reply to comment #8) > I talked to Jeffrey Yasskin (C++ guru at Google) and he pointed me to the ManualConstructor pattern (BSD licensed): > > Does WTF have anything similar? It does have something like it, AlignedBuffer, used in the Vector and HashTable class. It could use some additional love to make it handle type conversion more like that ManualConstructor class. > If not, can we add something like it? (Preferably a port of the GTL code since according to Jeffrey it is very hard to get this right). A port of the code is OK, but wouldn’t be my first choice.

Erik Arvidsson

Comment 10 2011-11-30 18:43:10 PST

Darin, thanks for your comment. I took a quick look at AlignedBuffer and it seem to cover this use case. I'll try to add an AlignedBuffer to my patch in progress tomorrow. Just to make things clear. Here are my current thought. - Use 1 bit in NodeFlags (currently 2 remaining) to determine whether we have a white space node or not. - Use an AlignedBuffer<sizeof(String), WTF_ALIGN_OF(String)> - Add an assert that sizeof(String) == sizeof(AtomicString) - Based on the flag return a String or AtomicString.

Darin Adler

Comment 11 2011-12-01 15:58:00 PST

(In reply to comment #10) > - Use 1 bit in NodeFlags (currently 2 remaining) to determine whether we have a white space node or not. > - Use an AlignedBuffer<sizeof(String), WTF_ALIGN_OF(String)> > - Add an assert that sizeof(String) == sizeof(AtomicString) > - Based on the flag return a String or AtomicString. If the two different things you are storing as a union either a String or an AtomicString, then I think you are over-engineering this. The StringImpl object already has a flag in it that tells whether it is in the AtomicString table, and simply constructing an AtomicString then storing it in a String accomplishes what you want to do, making it share a single unique copy.

Erik Arvidsson

Comment 12 2011-12-01 16:17:14 PST

(In reply to comment #11) > (In reply to comment #10) > > - Use 1 bit in NodeFlags (currently 2 remaining) to determine whether we have a white space node or not. > > - Use an AlignedBuffer<sizeof(String), WTF_ALIGN_OF(String)> > > - Add an assert that sizeof(String) == sizeof(AtomicString) > > - Based on the flag return a String or AtomicString. > > If the two different things you are storing as a union either a String or an AtomicString, then I think you are over-engineering this. The StringImpl object already has a flag in it that tells whether it is in the AtomicString table, and simply constructing an AtomicString then storing it in a String accomplishes what you want to do, making it share a single unique copy. I was just about to upload a WIP patch that uses AlignedBuffer. It saves about 10megs (5%) on the HTML5. I'll still upload it for people to look at but I'll investigate the String/AtomicString case too.

Erik Arvidsson

Comment 13 2011-12-01 16:21:58 PST

Created attachment 117514 [details] Patch

Erik Arvidsson

Comment 14 2011-12-01 17:03:08 PST

Created attachment 117525 [details] WIP patch using AtomicString instead of AlignedBuffer

Erik Arvidsson

Comment 15 2011-12-01 17:03:47 PST

(In reply to comment #14) > Created an attachment (id=117525) [details] > WIP patch using AtomicString instead of AlignedBuffer This patch is a mess but it works :-)

Darin Adler

Comment 16 2011-12-01 17:04:46 PST

(In reply to comment #15) > (In reply to comment #14) > > Created an attachment (id=117525) [details] [details] > > WIP patch using AtomicString instead of AlignedBuffer > > This patch is a mess but it works :-) You can accomplish the same thing with a tiny fraction of the code. The Text and CharacterData classes don’t even need to be changed!

Darin Adler

Comment 17 2011-12-01 17:07:09 PST

Comment on attachment 117525 [details] WIP patch using AtomicString instead of AlignedBuffer View in context: https://bugs.webkit.org/attachment.cgi?id=117525&action=review > Source/WebCore/html/parser/HTMLConstructionSite.cpp:383 > + attachAtSite(site, Text::createWhitespaceOnly(site.parent->document(), characters)); All you need to do here is this: attachAtSite(site, Text::create(site.parent->document(), AtomicString(characters).string());

Erik Arvidsson

Comment 18 2011-12-02 16:50:49 PST

Created attachment 117712 [details] Patch

Erik Arvidsson

Comment 19 2011-12-02 16:58:43 PST

(In reply to comment #18) > Created an attachment (id=117712) [details] > Patch On http://www.whatwg.org/specs/web-apps/current-work/ this saves about 5mb (out of 277). This seems too much to me. There are only 169049 characters used for whitespace nodes on that page so the memory saving should be about 340kb. Maybe this is due to how memory is allocated?

Darin Adler

Comment 20 2011-12-02 18:06:11 PST

(In reply to comment #19) > On http://www.whatwg.org/specs/web-apps/current-work/ this saves about 5mb (out of 277). This seems too much to me. There are only 169049 characters used for whitespace nodes on that page so the memory saving should be about 340kb. Maybe this is due to how memory is allocated? The savings are for all the StringImpl objects, not just the characters.

Darin Adler

Comment 21 2011-12-02 18:14:10 PST

Comment on attachment 117712 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=117712&action=review > Source/WebCore/html/parser/HTMLConstructionSite.cpp:350 > + if (whitespaceInfo == WhitespaceUnknown) > + whitespaceInfo = isAllWhitespace(characters) ? AllWhitespace : NotAllWhitespace; I think you should write this: // Strings composed entirely of whitespace are likely to be repeated. // Turn them into AtomicString so we share a single string for each. bool shouldUseAtomicString = whitespaceInfo == AllWhitespace || (whitespaceInfo == WhitespaceUnknown && isAllWhitespace(characters)); > Source/WebCore/html/parser/HTMLConstructionSite.cpp:366 > - RefPtr<Text> textNode = Text::createWithLengthLimit(site.parent->document(), characters, currentPosition); > + RefPtr<Text> textNode = Text::createWithLengthLimit(site.parent->document(), whitespaceInfo == AllWhitespace ? AtomicString(characters).string() : characters, currentPosition); This code needs a comment explaining the policy. Or better, it could just use a boolean named shouldUseAtomicString and the comment could be above. > Source/WebCore/html/parser/HTMLConstructionSite.cpp:369 > - textNode = Text::create(site.parent->document(), characters.substring(currentPosition)); > + textNode = Text::create(site.parent->document(), whitespaceInfo == AllWhitespace ? AtomicString(characters.substring(currentPosition)).string() : characters.substring(currentPosition)); It would be nice to put the substring into a local String so we don’t have to repeat that subexpression. > Source/WebCore/html/parser/HTMLConstructionSite.h:55 > + enum WhitespaceInfo { > + AllWhitespace, > + NotAllWhitespace, > + WhitespaceUnknown > + }; Since this is software, everything is “info”, so that seems like a non-helpful suffix, and abbreviated to boot! I didn’t think of a better name, though, and I did try for a few minutes. I think it would be fine to have this at the namespace level instead of using a class member. Would make the call sites more readable. > Source/WebCore/html/parser/HTMLTreeBuilder.cpp:276 > - String takeLeadingWhitespace() > + AtomicString takeLeadingWhitespace() Why is it helpful for this to return an AtomicString? It seems like extra unnecessary work and extra unnecessary change. > Source/WebCore/html/parser/HTMLTreeBuilder.cpp:300 > - String takeRemainingWhitespace() > + AtomicString takeRemainingWhitespace() Why is it helpful for this to return an AtomicString?

Erik Arvidsson

Comment 22 2011-12-05 13:28:15 PST

Created attachment 117932 [details] Patch

Erik Arvidsson

Comment 23 2011-12-05 13:34:08 PST

Thanks again Darin. New patch is up. I had one concern that I wanted to discuss before committing this. In WebKit1 the AtomicStringTable never gets cleaned up (Chromium kills the process when you leave the page/domain and I assume WebKit2 does something similar.). Maybe we should limit the whitespace text nodes in some way to reduce the amount of memory these strings will take up. One possible option would be to limit this to strings with less than 8 or 16 characters?

Darin Adler

Comment 24 2011-12-05 14:32:57 PST

(In reply to comment #23) > In WebKit1 the AtomicStringTable never gets cleaned up Sure it does. When the last reference to a string is removed the string is removed from the table.

WebKit Review Bot

Comment 25 2011-12-05 15:39:33 PST

Comment on attachment 117932 [details] Patch Attachment 117932 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/10728791 New failing tests: svg/custom/linking-uri-01-b.svg

WebKit Review Bot

Comment 26 2011-12-05 22:13:18 PST

Comment on attachment 117932 [details] Patch Clearing flags on attachment: 117932 Committed r102095: <http://trac.webkit.org/changeset/102095>

WebKit Review Bot

Comment 27 2011-12-05 22:13:25 PST

All reviewed patches have been landed. Closing bug.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Unspecified

OS Unspecified

Product WebKit

Component DOM

Assignee

Erik Arvidsson

Reported

2011-11-15 12:25 PST

Modified

2011-12-05 22:13 PST History

CC List

12 users Show

URL

Keywords

Depends on

Blocks