Below is the dump just of whitespace text nodes from the HTML spec (string followed by count of that string occurring). If we stored atomic strings in these cases, we'd save a lot of memory. There are also non-whitespace strings that are repeated many times in the spec (e.g. "Unique " is in >30k text nodes). For a short whitespace string that conforms to the list of 5 valid space characters in http://www.whatwg.org/specs/web-apps/current-work/multipage/common-microsyntaxes.html#space-character, we could even store it as a single unsigned. Looking at the HTML spec, this is actually by far the common case. The HTML spec is a bit unique in it's content, but it's certainly a common pattern for sites to have repeated whitespace due to formatting their HTML to be readable. " " 8197 " " 4172 " " 1052 " " 469 " " 1 " " 413 " " 21 " " 3099 " " 224 " " 497 " " 96 " " 157 " " 19240 " " 52 " " 242 " " 122 " " 472 " " 6117 " " 242 " " 4 " " 17 " " 3126 " " 1132 " " 2 " " 7 " " 545 " " 445 " " 78 " " 2 " " 166 " " 140 " " 1 " " 51 " " 26 " " 11 " " 7 " " 2 " " 3 " " 29 " " 6 " " 1 " " 1 " " 7 " " 7 " " 1 " " 7 " " 2 " " 9 " " 1 " " 1 " " 1 " " 2 " " 2 " " 1 " " 8 " " 1 " " 1 " " 1 " " 1 " " 1 " " 1 " " 1 " " 1 " " 2 " " 1 " " 3 " " 1 " " 1 " " 1 " " 2 " " 1 " " 8 " " 2 " " 1
Neat ideas. I think yours are are: 1) Use AtomicString instead of String. 2) Use a count of U+0020 space characters instead of a string. 3) Use a count of HTML space characters instead of a string. Not clear on how this would work. Some other ideas: 4) Store the text in the node itself instead of in a separate object, allocating at a larger size to do so. This requires refactoring string operations so they can work directly on text without having to allocate a String object. That’s something I’ve been thinking that refactoring is something we should do. 5) Keep the original text around and reference count it; have the strings created by the HTML parser use pointers and lengths pointing at a single large character buffer instead of allocating separate StringImpl objects for each. I think (1) is the easiest to do correctly and should be tried to see what the performance cost would be.
How much memory does CharacterData account for on an average web page?
One quick hack would be for the parser to check when parsing whether a string is entirely whitespace and to use AtomicStrings just for those cases (maybe that's what you were suggesting).
(In reply to comment #3) > One quick hack would be for the parser to check when parsing whether a string is entirely whitespace and to use AtomicStrings just for those cases (maybe that's what you were suggesting). That was pretty much what we were thinking ;)
The way I was picturing option 3 work is that we'd just do it for strings that only contain U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF) and U+000D CARRIAGE RETURN (CR). That way, you can use 2 bits per character and could encode strings up to 16 characters long as a single unsigned. I think this is probably too complicated to be worth it. Option 5 sounds awesomest, but complicated. I agree that we should try the AtomicString approach first just for whitespace strings (possibly with a length limit?). It's the simplest and gets the vast majority of the memory benefits.
BTW, here's the script I'm using to count the whitespace text nodes on a page: var hash = {}; function walk(node) { if (node.nodeType == Node.TEXT_NODE || node.nodeType == Node.COMMENT_NODE || node.nodeType == Node.PROCESSING_INSTRUCTION_NODE) { var value = node.nodeValue; if (!hash[value]) hash[value] = 0; hash[value] += 1; } for (var i = 0, len = node.childNodes.length; i < len; i++) { walk(node.childNodes[i]); } } walk(document.body); var uniqueStrings = 0; for (var value in hash) { var count = hash[value]; if (value.match(/^\s*$/)) { console.log(JSON.stringify(value), count); uniqueStrings++; } } console.log(uniqueStrings); And here's a more readable dump of the HTML spec's whitespace nodes: "\n\n " 8197 "\n " 4172 "\n " 1052 "\n " 471 "\n \n" 1 "\n" 419 "\n " 21 "\n " 3099 "\n " 228 "\n " 497 "\n " 96 "\n " 157 " " 19240 "\n\n\n\n\n " 52 "\n\n\n\n " 242 "\n\n" 122 "\n\n\n " 472 "\n\n " 6117 "\n\n\n " 242 "\n\n\n\n " 4 "\n\n\n" 17 "\n\n " 3126 "\n\n " 1132 "\n\n\n\n\n\n\n " 2 "\n\n\n\n\n\n " 7 "\n\n " 545 "\n\n " 445 "\n\n\n " 78 "\n\n " 2 "\n\n " 166 "\n\n " 140 "\n \n " 1 "\n\n " 51 "\n\n " 26 "\n\n " 11 "\n " 7 "\n\n " 2 "\n\n " 3 "\n\n\n " 29 "\n\n\n " 6 " " 1 " " 1 " \n\n " 7 " \n\n " 7 "\n \n\n\n\n " 1 "\n\n \n\n " 7 " " 2 "\n " 9 " " 1 "\n " 1 "\n\n \n " 1 "\n\n\n\n\n\n\n\n " 2 "\n \n " 2 "\n \n " 1 "\n\n\n " 8 "\n\n\n\n\n" 1 "\n \n " 1 "\n\n\n\n" 1 "\n\n\n " 1 " " 1 " " 1 " " 1 " " 1 " " 2 " " 1 " " 3 " " 1 " " 1 " " 1 " " 2 "" 2 " \n\n\n " 1 "\n " 8 " \n\n " 2 "\n\n \n" 1 75
Nice find!
I started looking at this and I hit the limitation of my C++ knowledge. To make this worth it, it seems like we would need to use a union type. However, union types cannot have POD members. I talked to Jeffrey Yasskin (C++ guru at Google) and he pointed me to the ManualConstructor pattern (BSD licensed): http://code.google.com/p/google-ctemplate/source/browse/trunk/src/base/manual_constructor.h Does WTF have anything similar? If not, can we add something like it? (Preferably a port of the GTL code since according to Jeffrey it is very hard to get this right). Maybe I'm missing something and there is a simpler way to do this? Also, when testing this I used ActivityMonitor. What is a better tool to see the memory usage of this?
(In reply to comment #8) > I talked to Jeffrey Yasskin (C++ guru at Google) and he pointed me to the ManualConstructor pattern (BSD licensed): > > Does WTF have anything similar? It does have something like it, AlignedBuffer, used in the Vector and HashTable class. It could use some additional love to make it handle type conversion more like that ManualConstructor class. > If not, can we add something like it? (Preferably a port of the GTL code since according to Jeffrey it is very hard to get this right). A port of the code is OK, but wouldn’t be my first choice.
Darin, thanks for your comment. I took a quick look at AlignedBuffer and it seem to cover this use case. I'll try to add an AlignedBuffer to my patch in progress tomorrow. Just to make things clear. Here are my current thought. - Use 1 bit in NodeFlags (currently 2 remaining) to determine whether we have a white space node or not. - Use an AlignedBuffer<sizeof(String), WTF_ALIGN_OF(String)> - Add an assert that sizeof(String) == sizeof(AtomicString) - Based on the flag return a String or AtomicString.
(In reply to comment #10) > - Use 1 bit in NodeFlags (currently 2 remaining) to determine whether we have a white space node or not. > - Use an AlignedBuffer<sizeof(String), WTF_ALIGN_OF(String)> > - Add an assert that sizeof(String) == sizeof(AtomicString) > - Based on the flag return a String or AtomicString. If the two different things you are storing as a union either a String or an AtomicString, then I think you are over-engineering this. The StringImpl object already has a flag in it that tells whether it is in the AtomicString table, and simply constructing an AtomicString then storing it in a String accomplishes what you want to do, making it share a single unique copy.
(In reply to comment #11) > (In reply to comment #10) > > - Use 1 bit in NodeFlags (currently 2 remaining) to determine whether we have a white space node or not. > > - Use an AlignedBuffer<sizeof(String), WTF_ALIGN_OF(String)> > > - Add an assert that sizeof(String) == sizeof(AtomicString) > > - Based on the flag return a String or AtomicString. > > If the two different things you are storing as a union either a String or an AtomicString, then I think you are over-engineering this. The StringImpl object already has a flag in it that tells whether it is in the AtomicString table, and simply constructing an AtomicString then storing it in a String accomplishes what you want to do, making it share a single unique copy. I was just about to upload a WIP patch that uses AlignedBuffer. It saves about 10megs (5%) on the HTML5. I'll still upload it for people to look at but I'll investigate the String/AtomicString case too.
Created attachment 117514 [details] Patch
Created attachment 117525 [details] WIP patch using AtomicString instead of AlignedBuffer
(In reply to comment #14) > Created an attachment (id=117525) [details] > WIP patch using AtomicString instead of AlignedBuffer This patch is a mess but it works :-)
(In reply to comment #15) > (In reply to comment #14) > > Created an attachment (id=117525) [details] [details] > > WIP patch using AtomicString instead of AlignedBuffer > > This patch is a mess but it works :-) You can accomplish the same thing with a tiny fraction of the code. The Text and CharacterData classes don’t even need to be changed!
Comment on attachment 117525 [details] WIP patch using AtomicString instead of AlignedBuffer View in context: https://bugs.webkit.org/attachment.cgi?id=117525&action=review > Source/WebCore/html/parser/HTMLConstructionSite.cpp:383 > + attachAtSite(site, Text::createWhitespaceOnly(site.parent->document(), characters)); All you need to do here is this: attachAtSite(site, Text::create(site.parent->document(), AtomicString(characters).string());
Created attachment 117712 [details] Patch
(In reply to comment #18) > Created an attachment (id=117712) [details] > Patch On http://www.whatwg.org/specs/web-apps/current-work/ this saves about 5mb (out of 277). This seems too much to me. There are only 169049 characters used for whitespace nodes on that page so the memory saving should be about 340kb. Maybe this is due to how memory is allocated?
(In reply to comment #19) > On http://www.whatwg.org/specs/web-apps/current-work/ this saves about 5mb (out of 277). This seems too much to me. There are only 169049 characters used for whitespace nodes on that page so the memory saving should be about 340kb. Maybe this is due to how memory is allocated? The savings are for all the StringImpl objects, not just the characters.
Comment on attachment 117712 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=117712&action=review > Source/WebCore/html/parser/HTMLConstructionSite.cpp:350 > + if (whitespaceInfo == WhitespaceUnknown) > + whitespaceInfo = isAllWhitespace(characters) ? AllWhitespace : NotAllWhitespace; I think you should write this: // Strings composed entirely of whitespace are likely to be repeated. // Turn them into AtomicString so we share a single string for each. bool shouldUseAtomicString = whitespaceInfo == AllWhitespace || (whitespaceInfo == WhitespaceUnknown && isAllWhitespace(characters)); > Source/WebCore/html/parser/HTMLConstructionSite.cpp:366 > - RefPtr<Text> textNode = Text::createWithLengthLimit(site.parent->document(), characters, currentPosition); > + RefPtr<Text> textNode = Text::createWithLengthLimit(site.parent->document(), whitespaceInfo == AllWhitespace ? AtomicString(characters).string() : characters, currentPosition); This code needs a comment explaining the policy. Or better, it could just use a boolean named shouldUseAtomicString and the comment could be above. > Source/WebCore/html/parser/HTMLConstructionSite.cpp:369 > - textNode = Text::create(site.parent->document(), characters.substring(currentPosition)); > + textNode = Text::create(site.parent->document(), whitespaceInfo == AllWhitespace ? AtomicString(characters.substring(currentPosition)).string() : characters.substring(currentPosition)); It would be nice to put the substring into a local String so we don’t have to repeat that subexpression. > Source/WebCore/html/parser/HTMLConstructionSite.h:55 > + enum WhitespaceInfo { > + AllWhitespace, > + NotAllWhitespace, > + WhitespaceUnknown > + }; Since this is software, everything is “info”, so that seems like a non-helpful suffix, and abbreviated to boot! I didn’t think of a better name, though, and I did try for a few minutes. I think it would be fine to have this at the namespace level instead of using a class member. Would make the call sites more readable. > Source/WebCore/html/parser/HTMLTreeBuilder.cpp:276 > - String takeLeadingWhitespace() > + AtomicString takeLeadingWhitespace() Why is it helpful for this to return an AtomicString? It seems like extra unnecessary work and extra unnecessary change. > Source/WebCore/html/parser/HTMLTreeBuilder.cpp:300 > - String takeRemainingWhitespace() > + AtomicString takeRemainingWhitespace() Why is it helpful for this to return an AtomicString?
Created attachment 117932 [details] Patch
Thanks again Darin. New patch is up. I had one concern that I wanted to discuss before committing this. In WebKit1 the AtomicStringTable never gets cleaned up (Chromium kills the process when you leave the page/domain and I assume WebKit2 does something similar.). Maybe we should limit the whitespace text nodes in some way to reduce the amount of memory these strings will take up. One possible option would be to limit this to strings with less than 8 or 16 characters?
(In reply to comment #23) > In WebKit1 the AtomicStringTable never gets cleaned up Sure it does. When the last reference to a string is removed the string is removed from the table.
Comment on attachment 117932 [details] Patch Attachment 117932 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/10728791 New failing tests: svg/custom/linking-uri-01-b.svg
Comment on attachment 117932 [details] Patch Clearing flags on attachment: 117932 Committed r102095: <http://trac.webkit.org/changeset/102095>
All reviewed patches have been landed. Closing bug.