WebKit Bugzilla
New
Browse
Search+
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
72404
Could save a lot of memory in CharacterData by not always storing a String
https://bugs.webkit.org/show_bug.cgi?id=72404
Summary
Could save a lot of memory in CharacterData by not always storing a String
Ojan Vafai
Reported
2011-11-15 12:25:54 PST
Below is the dump just of whitespace text nodes from the HTML spec (string followed by count of that string occurring). If we stored atomic strings in these cases, we'd save a lot of memory. There are also non-whitespace strings that are repeated many times in the spec (e.g. "Unique " is in >30k text nodes). For a short whitespace string that conforms to the list of 5 valid space characters in
http://www.whatwg.org/specs/web-apps/current-work/multipage/common-microsyntaxes.html#space-character
, we could even store it as a single unsigned. Looking at the HTML spec, this is actually by far the common case. The HTML spec is a bit unique in it's content, but it's certainly a common pattern for sites to have repeated whitespace due to formatting their HTML to be readable. " " 8197 " " 4172 " " 1052 " " 469 " " 1 " " 413 " " 21 " " 3099 " " 224 " " 497 " " 96 " " 157 " " 19240 " " 52 " " 242 " " 122 " " 472 " " 6117 " " 242 " " 4 " " 17 " " 3126 " " 1132 " " 2 " " 7 " " 545 " " 445 " " 78 " " 2 " " 166 " " 140 " " 1 " " 51 " " 26 " " 11 " " 7 " " 2 " " 3 " " 29 " " 6 " " 1 " " 1 " " 7 " " 7 " " 1 " " 7 " " 2 " " 9 " " 1 " " 1 " " 1 " " 2 " " 2 " " 1 " " 8 " " 1 " " 1 " " 1 " " 1 " " 1 " " 1 " " 1 " " 1 " " 2 " " 1 " " 3 " " 1 " " 1 " " 1 " " 2 " " 1 " " 8 " " 2 " " 1
Attachments
Patch
(20.87 KB, patch)
2011-12-01 16:21 PST
,
Erik Arvidsson
no flags
Details
Formatted Diff
Diff
WIP patch using AtomicString instead of AlignedBuffer
(19.78 KB, patch)
2011-12-01 17:03 PST
,
Erik Arvidsson
no flags
Details
Formatted Diff
Diff
Patch
(10.04 KB, patch)
2011-12-02 16:50 PST
,
Erik Arvidsson
no flags
Details
Formatted Diff
Diff
Patch
(9.36 KB, patch)
2011-12-05 13:28 PST
,
Erik Arvidsson
no flags
Details
Formatted Diff
Diff
Show Obsolete
(3)
View All
Add attachment
proposed patch, testcase, etc.
Darin Adler
Comment 1
2011-11-15 13:25:14 PST
Neat ideas. I think yours are are: 1) Use AtomicString instead of String. 2) Use a count of U+0020 space characters instead of a string. 3) Use a count of HTML space characters instead of a string. Not clear on how this would work. Some other ideas: 4) Store the text in the node itself instead of in a separate object, allocating at a larger size to do so. This requires refactoring string operations so they can work directly on text without having to allocate a String object. That’s something I’ve been thinking that refactoring is something we should do. 5) Keep the original text around and reference count it; have the strings created by the HTML parser use pointers and lengths pointing at a single large character buffer instead of allocating separate StringImpl objects for each. I think (1) is the easiest to do correctly and should be tried to see what the performance cost would be.
Geoffrey Garen
Comment 2
2011-11-15 13:26:49 PST
How much memory does CharacterData account for on an average web page?
Adam Barth
Comment 3
2011-11-15 13:33:31 PST
One quick hack would be for the parser to check when parsing whether a string is entirely whitespace and to use AtomicStrings just for those cases (maybe that's what you were suggesting).
Erik Arvidsson
Comment 4
2011-11-15 13:35:14 PST
(In reply to
comment #3
)
> One quick hack would be for the parser to check when parsing whether a string is entirely whitespace and to use AtomicStrings just for those cases (maybe that's what you were suggesting).
That was pretty much what we were thinking ;)
Ojan Vafai
Comment 5
2011-11-15 13:36:03 PST
The way I was picturing option 3 work is that we'd just do it for strings that only contain U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF) and U+000D CARRIAGE RETURN (CR). That way, you can use 2 bits per character and could encode strings up to 16 characters long as a single unsigned. I think this is probably too complicated to be worth it. Option 5 sounds awesomest, but complicated. I agree that we should try the AtomicString approach first just for whitespace strings (possibly with a length limit?). It's the simplest and gets the vast majority of the memory benefits.
Ojan Vafai
Comment 6
2011-11-15 13:41:07 PST
BTW, here's the script I'm using to count the whitespace text nodes on a page: var hash = {}; function walk(node) { if (node.nodeType == Node.TEXT_NODE || node.nodeType == Node.COMMENT_NODE || node.nodeType == Node.PROCESSING_INSTRUCTION_NODE) { var value = node.nodeValue; if (!hash[value]) hash[value] = 0; hash[value] += 1; } for (var i = 0, len = node.childNodes.length; i < len; i++) { walk(node.childNodes[i]); } } walk(document.body); var uniqueStrings = 0; for (var value in hash) { var count = hash[value]; if (value.match(/^\s*$/)) { console.log(JSON.stringify(value), count); uniqueStrings++; } } console.log(uniqueStrings); And here's a more readable dump of the HTML spec's whitespace nodes: "\n\n " 8197 "\n " 4172 "\n " 1052 "\n " 471 "\n \n" 1 "\n" 419 "\n " 21 "\n " 3099 "\n " 228 "\n " 497 "\n " 96 "\n " 157 " " 19240 "\n\n\n\n\n " 52 "\n\n\n\n " 242 "\n\n" 122 "\n\n\n " 472 "\n\n " 6117 "\n\n\n " 242 "\n\n\n\n " 4 "\n\n\n" 17 "\n\n " 3126 "\n\n " 1132 "\n\n\n\n\n\n\n " 2 "\n\n\n\n\n\n " 7 "\n\n " 545 "\n\n " 445 "\n\n\n " 78 "\n\n " 2 "\n\n " 166 "\n\n " 140 "\n \n " 1 "\n\n " 51 "\n\n " 26 "\n\n " 11 "\n " 7 "\n\n " 2 "\n\n " 3 "\n\n\n " 29 "\n\n\n " 6 " " 1 " " 1 " \n\n " 7 " \n\n " 7 "\n \n\n\n\n " 1 "\n\n \n\n " 7 " " 2 "\n " 9 " " 1 "\n " 1 "\n\n \n " 1 "\n\n\n\n\n\n\n\n " 2 "\n \n " 2 "\n \n " 1 "\n\n\n " 8 "\n\n\n\n\n" 1 "\n \n " 1 "\n\n\n\n" 1 "\n\n\n " 1 " " 1 " " 1 " " 1 " " 1 " " 2 " " 1 " " 3 " " 1 " " 1 " " 1 " " 2 "" 2 " \n\n\n " 1 "\n " 8 " \n\n " 2 "\n\n \n" 1 75
Antti Koivisto
Comment 7
2011-11-15 13:48:36 PST
Nice find!
Erik Arvidsson
Comment 8
2011-11-30 16:45:55 PST
I started looking at this and I hit the limitation of my C++ knowledge. To make this worth it, it seems like we would need to use a union type. However, union types cannot have POD members. I talked to Jeffrey Yasskin (C++ guru at Google) and he pointed me to the ManualConstructor pattern (BSD licensed):
http://code.google.com/p/google-ctemplate/source/browse/trunk/src/base/manual_constructor.h
Does WTF have anything similar? If not, can we add something like it? (Preferably a port of the GTL code since according to Jeffrey it is very hard to get this right). Maybe I'm missing something and there is a simpler way to do this? Also, when testing this I used ActivityMonitor. What is a better tool to see the memory usage of this?
Darin Adler
Comment 9
2011-11-30 18:13:29 PST
(In reply to
comment #8
)
> I talked to Jeffrey Yasskin (C++ guru at Google) and he pointed me to the ManualConstructor pattern (BSD licensed): > > Does WTF have anything similar?
It does have something like it, AlignedBuffer, used in the Vector and HashTable class. It could use some additional love to make it handle type conversion more like that ManualConstructor class.
> If not, can we add something like it? (Preferably a port of the GTL code since according to Jeffrey it is very hard to get this right).
A port of the code is OK, but wouldn’t be my first choice.
Erik Arvidsson
Comment 10
2011-11-30 18:43:10 PST
Darin, thanks for your comment. I took a quick look at AlignedBuffer and it seem to cover this use case. I'll try to add an AlignedBuffer to my patch in progress tomorrow. Just to make things clear. Here are my current thought. - Use 1 bit in NodeFlags (currently 2 remaining) to determine whether we have a white space node or not. - Use an AlignedBuffer<sizeof(String), WTF_ALIGN_OF(String)> - Add an assert that sizeof(String) == sizeof(AtomicString) - Based on the flag return a String or AtomicString.
Darin Adler
Comment 11
2011-12-01 15:58:00 PST
(In reply to
comment #10
)
> - Use 1 bit in NodeFlags (currently 2 remaining) to determine whether we have a white space node or not. > - Use an AlignedBuffer<sizeof(String), WTF_ALIGN_OF(String)> > - Add an assert that sizeof(String) == sizeof(AtomicString) > - Based on the flag return a String or AtomicString.
If the two different things you are storing as a union either a String or an AtomicString, then I think you are over-engineering this. The StringImpl object already has a flag in it that tells whether it is in the AtomicString table, and simply constructing an AtomicString then storing it in a String accomplishes what you want to do, making it share a single unique copy.
Erik Arvidsson
Comment 12
2011-12-01 16:17:14 PST
(In reply to
comment #11
)
> (In reply to
comment #10
) > > - Use 1 bit in NodeFlags (currently 2 remaining) to determine whether we have a white space node or not. > > - Use an AlignedBuffer<sizeof(String), WTF_ALIGN_OF(String)> > > - Add an assert that sizeof(String) == sizeof(AtomicString) > > - Based on the flag return a String or AtomicString. > > If the two different things you are storing as a union either a String or an AtomicString, then I think you are over-engineering this. The StringImpl object already has a flag in it that tells whether it is in the AtomicString table, and simply constructing an AtomicString then storing it in a String accomplishes what you want to do, making it share a single unique copy.
I was just about to upload a WIP patch that uses AlignedBuffer. It saves about 10megs (5%) on the HTML5. I'll still upload it for people to look at but I'll investigate the String/AtomicString case too.
Erik Arvidsson
Comment 13
2011-12-01 16:21:58 PST
Created
attachment 117514
[details]
Patch
Erik Arvidsson
Comment 14
2011-12-01 17:03:08 PST
Created
attachment 117525
[details]
WIP patch using AtomicString instead of AlignedBuffer
Erik Arvidsson
Comment 15
2011-12-01 17:03:47 PST
(In reply to
comment #14
)
> Created an attachment (id=117525) [details] > WIP patch using AtomicString instead of AlignedBuffer
This patch is a mess but it works :-)
Darin Adler
Comment 16
2011-12-01 17:04:46 PST
(In reply to
comment #15
)
> (In reply to
comment #14
) > > Created an attachment (id=117525) [details] [details] > > WIP patch using AtomicString instead of AlignedBuffer > > This patch is a mess but it works :-)
You can accomplish the same thing with a tiny fraction of the code. The Text and CharacterData classes don’t even need to be changed!
Darin Adler
Comment 17
2011-12-01 17:07:09 PST
Comment on
attachment 117525
[details]
WIP patch using AtomicString instead of AlignedBuffer View in context:
https://bugs.webkit.org/attachment.cgi?id=117525&action=review
> Source/WebCore/html/parser/HTMLConstructionSite.cpp:383 > + attachAtSite(site, Text::createWhitespaceOnly(site.parent->document(), characters));
All you need to do here is this: attachAtSite(site, Text::create(site.parent->document(), AtomicString(characters).string());
Erik Arvidsson
Comment 18
2011-12-02 16:50:49 PST
Created
attachment 117712
[details]
Patch
Erik Arvidsson
Comment 19
2011-12-02 16:58:43 PST
(In reply to
comment #18
)
> Created an attachment (id=117712) [details] > Patch
On
http://www.whatwg.org/specs/web-apps/current-work/
this saves about 5mb (out of 277). This seems too much to me. There are only 169049 characters used for whitespace nodes on that page so the memory saving should be about 340kb. Maybe this is due to how memory is allocated?
Darin Adler
Comment 20
2011-12-02 18:06:11 PST
(In reply to
comment #19
)
> On
http://www.whatwg.org/specs/web-apps/current-work/
this saves about 5mb (out of 277). This seems too much to me. There are only 169049 characters used for whitespace nodes on that page so the memory saving should be about 340kb. Maybe this is due to how memory is allocated?
The savings are for all the StringImpl objects, not just the characters.
Darin Adler
Comment 21
2011-12-02 18:14:10 PST
Comment on
attachment 117712
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=117712&action=review
> Source/WebCore/html/parser/HTMLConstructionSite.cpp:350 > + if (whitespaceInfo == WhitespaceUnknown) > + whitespaceInfo = isAllWhitespace(characters) ? AllWhitespace : NotAllWhitespace;
I think you should write this: // Strings composed entirely of whitespace are likely to be repeated. // Turn them into AtomicString so we share a single string for each. bool shouldUseAtomicString = whitespaceInfo == AllWhitespace || (whitespaceInfo == WhitespaceUnknown && isAllWhitespace(characters));
> Source/WebCore/html/parser/HTMLConstructionSite.cpp:366 > - RefPtr<Text> textNode = Text::createWithLengthLimit(site.parent->document(), characters, currentPosition); > + RefPtr<Text> textNode = Text::createWithLengthLimit(site.parent->document(), whitespaceInfo == AllWhitespace ? AtomicString(characters).string() : characters, currentPosition);
This code needs a comment explaining the policy. Or better, it could just use a boolean named shouldUseAtomicString and the comment could be above.
> Source/WebCore/html/parser/HTMLConstructionSite.cpp:369 > - textNode = Text::create(site.parent->document(), characters.substring(currentPosition)); > + textNode = Text::create(site.parent->document(), whitespaceInfo == AllWhitespace ? AtomicString(characters.substring(currentPosition)).string() : characters.substring(currentPosition));
It would be nice to put the substring into a local String so we don’t have to repeat that subexpression.
> Source/WebCore/html/parser/HTMLConstructionSite.h:55 > + enum WhitespaceInfo { > + AllWhitespace, > + NotAllWhitespace, > + WhitespaceUnknown > + };
Since this is software, everything is “info”, so that seems like a non-helpful suffix, and abbreviated to boot! I didn’t think of a better name, though, and I did try for a few minutes. I think it would be fine to have this at the namespace level instead of using a class member. Would make the call sites more readable.
> Source/WebCore/html/parser/HTMLTreeBuilder.cpp:276 > - String takeLeadingWhitespace() > + AtomicString takeLeadingWhitespace()
Why is it helpful for this to return an AtomicString? It seems like extra unnecessary work and extra unnecessary change.
> Source/WebCore/html/parser/HTMLTreeBuilder.cpp:300 > - String takeRemainingWhitespace() > + AtomicString takeRemainingWhitespace()
Why is it helpful for this to return an AtomicString?
Erik Arvidsson
Comment 22
2011-12-05 13:28:15 PST
Created
attachment 117932
[details]
Patch
Erik Arvidsson
Comment 23
2011-12-05 13:34:08 PST
Thanks again Darin. New patch is up. I had one concern that I wanted to discuss before committing this. In WebKit1 the AtomicStringTable never gets cleaned up (Chromium kills the process when you leave the page/domain and I assume WebKit2 does something similar.). Maybe we should limit the whitespace text nodes in some way to reduce the amount of memory these strings will take up. One possible option would be to limit this to strings with less than 8 or 16 characters?
Darin Adler
Comment 24
2011-12-05 14:32:57 PST
(In reply to
comment #23
)
> In WebKit1 the AtomicStringTable never gets cleaned up
Sure it does. When the last reference to a string is removed the string is removed from the table.
WebKit Review Bot
Comment 25
2011-12-05 15:39:33 PST
Comment on
attachment 117932
[details]
Patch
Attachment 117932
[details]
did not pass chromium-ews (chromium-xvfb): Output:
http://queues.webkit.org/results/10728791
New failing tests: svg/custom/linking-uri-01-b.svg
WebKit Review Bot
Comment 26
2011-12-05 22:13:18 PST
Comment on
attachment 117932
[details]
Patch Clearing flags on attachment: 117932 Committed
r102095
: <
http://trac.webkit.org/changeset/102095
>
WebKit Review Bot
Comment 27
2011-12-05 22:13:25 PST
All reviewed patches have been landed. Closing bug.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug