Bug 72404 - Could save a lot of memory in CharacterData by not always storing a String
Summary: Could save a lot of memory in CharacterData by not always storing a String
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: DOM (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Erik Arvidsson
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-11-15 12:25 PST by Ojan Vafai
Modified: 2011-12-05 22:13 PST (History)
12 users (show)

See Also:


Attachments
Patch (20.87 KB, patch)
2011-12-01 16:21 PST, Erik Arvidsson
no flags Details | Formatted Diff | Diff
WIP patch using AtomicString instead of AlignedBuffer (19.78 KB, patch)
2011-12-01 17:03 PST, Erik Arvidsson
no flags Details | Formatted Diff | Diff
Patch (10.04 KB, patch)
2011-12-02 16:50 PST, Erik Arvidsson
no flags Details | Formatted Diff | Diff
Patch (9.36 KB, patch)
2011-12-05 13:28 PST, Erik Arvidsson
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Ojan Vafai 2011-11-15 12:25:54 PST
Below is the dump just of whitespace text nodes from the HTML spec (string followed by count of that string occurring). If we stored atomic strings in these cases, we'd save a lot of memory. There are also non-whitespace strings that are repeated many times in the spec (e.g. "Unique " is in >30k text nodes). 

For a short whitespace string that conforms to the list of 5 valid space characters in http://www.whatwg.org/specs/web-apps/current-work/multipage/common-microsyntaxes.html#space-character, we could even store it as a single unsigned. Looking at the HTML spec, this is actually by far the common case.

The HTML spec is a bit unique in it's content, but it's certainly a common pattern for sites to have repeated whitespace due to formatting their HTML to be readable.


"

  " 8197
"
   " 4172
"
    " 1052
"
  " 469
"
  
" 1
"
" 413
"
 " 21
"
     " 3099
"
      " 224
"
       " 497
"
        " 96
"
         " 157
" " 19240
"




  " 52
"



  " 242
"

" 122
"


  " 472
"

   " 6117
"


   " 242
"



   " 4
"


" 17
"

    " 3126
"

     " 1132
"






  " 2
"





  " 7
"

      " 545
"

       " 445
"


     " 78
"

 " 2
"

        " 166
"

         " 140
"
 
  " 1
"

          " 51
"

           " 26
"

            " 11
"
            " 7
"

              " 2
"

             " 3
"


       " 29
"


    " 6
"          " 1
"     " 1
" 

      " 7
" 

       " 7
"
  



  " 1
"

  

  " 7
"   " 2
"
          " 9
"                                                          " 1
"
                                                                            " 1
"

                                   
   " 1
"







  " 2
"
  
  " 2
"
 
    " 1
"


         " 8
"




" 1
"
                                   
  " 1
"



" 1
"


      " 1
" " 1
" " 1
" " 1
" " 1
" " 2
" " 1
" " 3
" " 1
" " 1
"  " 1
" " 2
"  


  " 1
"
           " 8
"   

   " 2
"

 
" 1
Comment 1 Darin Adler 2011-11-15 13:25:14 PST
Neat ideas. I think yours are are:

1) Use AtomicString instead of String.

2) Use a count of U+0020 space characters instead of a string.

3) Use a count of HTML space characters instead of a string. Not clear on how this would work.

Some other ideas:

4) Store the text in the node itself instead of in a separate object, allocating at a larger size to do so. This requires refactoring string operations so they can work directly on text without having to allocate a String object. That’s something I’ve been thinking that refactoring is something we should do.

5) Keep the original text around and reference count it; have the strings created by the HTML parser use pointers and lengths pointing at a single large character buffer instead of allocating separate StringImpl objects for each.

I think (1) is the easiest to do correctly and should be tried to see what the performance cost would be.
Comment 2 Geoffrey Garen 2011-11-15 13:26:49 PST
How much memory does CharacterData account for on an average web page?
Comment 3 Adam Barth 2011-11-15 13:33:31 PST
One quick hack would be for the parser to check when parsing whether a string is entirely whitespace and to use AtomicStrings just for those cases (maybe that's what you were suggesting).
Comment 4 Erik Arvidsson 2011-11-15 13:35:14 PST
(In reply to comment #3)
> One quick hack would be for the parser to check when parsing whether a string is entirely whitespace and to use AtomicStrings just for those cases (maybe that's what you were suggesting).

That was pretty much what we were thinking ;)
Comment 5 Ojan Vafai 2011-11-15 13:36:03 PST
The way I was picturing option 3 work is that we'd just do it for strings that only contain U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF) and U+000D CARRIAGE RETURN (CR). That way, you can use 2 bits per character and could encode strings up to 16 characters long as a single unsigned. I think this is probably too complicated to be worth it.

Option 5 sounds awesomest, but complicated.

I agree that we should try the AtomicString approach first just for whitespace strings (possibly with a length limit?). It's the simplest and gets the vast majority of the memory benefits.
Comment 6 Ojan Vafai 2011-11-15 13:41:07 PST
BTW, here's the script I'm using to count the whitespace text nodes on a page:

var hash = {};
function walk(node) {
    if (node.nodeType == Node.TEXT_NODE || node.nodeType == Node.COMMENT_NODE || node.nodeType == Node.PROCESSING_INSTRUCTION_NODE) {
        var value = node.nodeValue;
        if (!hash[value])
            hash[value] = 0;
        hash[value] += 1;
    }
    for (var i = 0, len = node.childNodes.length; i < len; i++) {
        walk(node.childNodes[i]);
    }
}

walk(document.body);
var uniqueStrings = 0;
for (var value in hash) {
    var count = hash[value];
    if (value.match(/^\s*$/)) {
        console.log(JSON.stringify(value), count);
        uniqueStrings++;
    }
}
console.log(uniqueStrings);

And here's a more readable dump of the HTML spec's whitespace nodes:
"\n\n  " 8197
"\n   " 4172
"\n    " 1052
"\n  " 471
"\n  \n" 1
"\n" 419
"\n " 21
"\n     " 3099
"\n      " 228
"\n       " 497
"\n        " 96
"\n         " 157
" " 19240
"\n\n\n\n\n  " 52
"\n\n\n\n  " 242
"\n\n" 122
"\n\n\n  " 472
"\n\n   " 6117
"\n\n\n   " 242
"\n\n\n\n   " 4
"\n\n\n" 17
"\n\n    " 3126
"\n\n     " 1132
"\n\n\n\n\n\n\n  " 2
"\n\n\n\n\n\n  " 7
"\n\n      " 545
"\n\n       " 445
"\n\n\n     " 78
"\n\n " 2
"\n\n        " 166
"\n\n         " 140
"\n \n  " 1
"\n\n          " 51
"\n\n           " 26
"\n\n            " 11
"\n            " 7
"\n\n              " 2
"\n\n             " 3
"\n\n\n       " 29
"\n\n\n    " 6
"          " 1
"     " 1
" \n\n      " 7
" \n\n       " 7
"\n  \n\n\n\n  " 1
"\n\n  \n\n  " 7
"   " 2
"\n          " 9
"                                                          " 1
"\n                                                                            " 1
"\n\n                                   \n   " 1
"\n\n\n\n\n\n\n\n  " 2
"\n  \n  " 2
"\n \n    " 1
"\n\n\n         " 8
"\n\n\n\n\n" 1
"\n                                   \n  " 1
"\n\n\n\n" 1
"\n\n\n      " 1
" " 1
" " 1
" " 1
" " 1
" " 2
" " 1
" " 3
" " 1
" " 1
"  " 1
" " 2
"" 2
"  \n\n\n  " 1
"\n           " 8
"   \n\n   " 2
"\n\n \n" 1
75
Comment 7 Antti Koivisto 2011-11-15 13:48:36 PST
Nice find!
Comment 8 Erik Arvidsson 2011-11-30 16:45:55 PST
I started looking at this and I hit the limitation of my C++ knowledge.

To make this worth it, it seems like we would need to use a union type. However, union types cannot have POD members.

I talked to Jeffrey Yasskin (C++ guru at Google) and he pointed me to the ManualConstructor pattern (BSD licensed):

http://code.google.com/p/google-ctemplate/source/browse/trunk/src/base/manual_constructor.h

Does WTF have anything similar? If not, can we add something like it? (Preferably a port of the GTL code since according to Jeffrey it is very hard to get this right).

Maybe I'm missing something and there is a simpler way to do this?

Also, when testing this I used ActivityMonitor. What is a better tool to see the memory usage of this?
Comment 9 Darin Adler 2011-11-30 18:13:29 PST
(In reply to comment #8)
> I talked to Jeffrey Yasskin (C++ guru at Google) and he pointed me to the ManualConstructor pattern (BSD licensed):
> 
> Does WTF have anything similar?

It does have something like it, AlignedBuffer, used in the Vector and HashTable class. It could use some additional love to make it handle type conversion more like that ManualConstructor class.

> If not, can we add something like it? (Preferably a port of the GTL code since according to Jeffrey it is very hard to get this right).

A port of the code is OK, but wouldn’t be my first choice.
Comment 10 Erik Arvidsson 2011-11-30 18:43:10 PST
Darin, thanks for your comment. I took a quick look at AlignedBuffer and it seem to cover this use case.

I'll try to add an AlignedBuffer to my patch in progress tomorrow.

Just to make things clear. Here are my current thought.

- Use 1 bit in NodeFlags (currently 2 remaining) to determine whether we have a white space node or not.
- Use an AlignedBuffer<sizeof(String), WTF_ALIGN_OF(String)>
- Add an assert that sizeof(String) == sizeof(AtomicString)
- Based on the flag return a String or AtomicString.
Comment 11 Darin Adler 2011-12-01 15:58:00 PST
(In reply to comment #10)
> - Use 1 bit in NodeFlags (currently 2 remaining) to determine whether we have a white space node or not.
> - Use an AlignedBuffer<sizeof(String), WTF_ALIGN_OF(String)>
> - Add an assert that sizeof(String) == sizeof(AtomicString)
> - Based on the flag return a String or AtomicString.

If the two different things you are storing as a union either a String or an AtomicString, then I think you are over-engineering this. The StringImpl object already has a flag in it that tells whether it is in the AtomicString table, and simply constructing an AtomicString then storing it in a String accomplishes what you want to do, making it share a single unique copy.
Comment 12 Erik Arvidsson 2011-12-01 16:17:14 PST
(In reply to comment #11)
> (In reply to comment #10)
> > - Use 1 bit in NodeFlags (currently 2 remaining) to determine whether we have a white space node or not.
> > - Use an AlignedBuffer<sizeof(String), WTF_ALIGN_OF(String)>
> > - Add an assert that sizeof(String) == sizeof(AtomicString)
> > - Based on the flag return a String or AtomicString.
> 
> If the two different things you are storing as a union either a String or an AtomicString, then I think you are over-engineering this. The StringImpl object already has a flag in it that tells whether it is in the AtomicString table, and simply constructing an AtomicString then storing it in a String accomplishes what you want to do, making it share a single unique copy.

I was just about to upload a WIP patch that uses AlignedBuffer. It saves about 10megs (5%) on the HTML5.

I'll still upload it for people to look at but I'll investigate the String/AtomicString case too.
Comment 13 Erik Arvidsson 2011-12-01 16:21:58 PST
Created attachment 117514 [details]
Patch
Comment 14 Erik Arvidsson 2011-12-01 17:03:08 PST
Created attachment 117525 [details]
WIP patch using AtomicString instead of AlignedBuffer
Comment 15 Erik Arvidsson 2011-12-01 17:03:47 PST
(In reply to comment #14)
> Created an attachment (id=117525) [details]
> WIP patch using AtomicString instead of AlignedBuffer

This patch is a mess but it works :-)
Comment 16 Darin Adler 2011-12-01 17:04:46 PST
(In reply to comment #15)
> (In reply to comment #14)
> > Created an attachment (id=117525) [details] [details]
> > WIP patch using AtomicString instead of AlignedBuffer
> 
> This patch is a mess but it works :-)

You can accomplish the same thing with a tiny fraction of the code. The Text and CharacterData classes don’t even need to be changed!
Comment 17 Darin Adler 2011-12-01 17:07:09 PST
Comment on attachment 117525 [details]
WIP patch using AtomicString instead of AlignedBuffer

View in context: https://bugs.webkit.org/attachment.cgi?id=117525&action=review

> Source/WebCore/html/parser/HTMLConstructionSite.cpp:383
> +    attachAtSite(site, Text::createWhitespaceOnly(site.parent->document(), characters));

All you need to do here is this:

    attachAtSite(site, Text::create(site.parent->document(), AtomicString(characters).string());
Comment 18 Erik Arvidsson 2011-12-02 16:50:49 PST
Created attachment 117712 [details]
Patch
Comment 19 Erik Arvidsson 2011-12-02 16:58:43 PST
(In reply to comment #18)
> Created an attachment (id=117712) [details]
> Patch

On http://www.whatwg.org/specs/web-apps/current-work/ this saves about 5mb (out of 277). This seems too much to me. There are only 169049 characters used for whitespace nodes on that page so the memory saving should be about 340kb. Maybe this is due to how memory is allocated?
Comment 20 Darin Adler 2011-12-02 18:06:11 PST
(In reply to comment #19)
> On http://www.whatwg.org/specs/web-apps/current-work/ this saves about 5mb (out of 277). This seems too much to me. There are only 169049 characters used for whitespace nodes on that page so the memory saving should be about 340kb. Maybe this is due to how memory is allocated?

The savings are for all the StringImpl objects, not just the characters.
Comment 21 Darin Adler 2011-12-02 18:14:10 PST
Comment on attachment 117712 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=117712&action=review

> Source/WebCore/html/parser/HTMLConstructionSite.cpp:350
> +    if (whitespaceInfo == WhitespaceUnknown)
> +        whitespaceInfo = isAllWhitespace(characters) ? AllWhitespace : NotAllWhitespace;

I think you should write this:

    // Strings composed entirely of whitespace are likely to be repeated.
    // Turn them into AtomicString so we share a single string for each.
    bool shouldUseAtomicString = whitespaceInfo == AllWhitespace
        || (whitespaceInfo == WhitespaceUnknown && isAllWhitespace(characters));

> Source/WebCore/html/parser/HTMLConstructionSite.cpp:366
> -        RefPtr<Text> textNode = Text::createWithLengthLimit(site.parent->document(), characters, currentPosition);
> +        RefPtr<Text> textNode = Text::createWithLengthLimit(site.parent->document(), whitespaceInfo == AllWhitespace ? AtomicString(characters).string() : characters, currentPosition);

This code needs a comment explaining the policy. Or better, it could just use a boolean named shouldUseAtomicString and the comment could be above.

> Source/WebCore/html/parser/HTMLConstructionSite.cpp:369
> -            textNode = Text::create(site.parent->document(), characters.substring(currentPosition));
> +            textNode = Text::create(site.parent->document(), whitespaceInfo == AllWhitespace ? AtomicString(characters.substring(currentPosition)).string() : characters.substring(currentPosition));

It would be nice to put the substring into a local String so we don’t have to repeat that subexpression.

> Source/WebCore/html/parser/HTMLConstructionSite.h:55
> +    enum WhitespaceInfo {
> +        AllWhitespace,
> +        NotAllWhitespace,
> +        WhitespaceUnknown
> +    };

Since this is software, everything is “info”, so that seems like a non-helpful suffix, and abbreviated to boot! I didn’t think of a better name, though, and I did try for a few minutes.

I think it would be fine to have this at the namespace level instead of using a class member. Would make the call sites more readable.

> Source/WebCore/html/parser/HTMLTreeBuilder.cpp:276
> -    String takeLeadingWhitespace()
> +    AtomicString takeLeadingWhitespace()

Why is it helpful for this to return an AtomicString? It seems like extra unnecessary work and extra unnecessary change.

> Source/WebCore/html/parser/HTMLTreeBuilder.cpp:300
> -    String takeRemainingWhitespace()
> +    AtomicString takeRemainingWhitespace()

Why is it helpful for this to return an AtomicString?
Comment 22 Erik Arvidsson 2011-12-05 13:28:15 PST
Created attachment 117932 [details]
Patch
Comment 23 Erik Arvidsson 2011-12-05 13:34:08 PST
Thanks again Darin. New patch is up.

I had one concern that I wanted to discuss before committing this. In WebKit1 the AtomicStringTable never gets cleaned up (Chromium kills the process when you leave the page/domain and I assume WebKit2 does something similar.). Maybe we should limit the whitespace text nodes in some way to reduce the amount of memory these strings will take up. One possible option would be to limit this to strings with less than 8 or 16 characters?
Comment 24 Darin Adler 2011-12-05 14:32:57 PST
(In reply to comment #23)
> In WebKit1 the AtomicStringTable never gets cleaned up

Sure it does. When the last reference to a string is removed the string is removed from the table.
Comment 25 WebKit Review Bot 2011-12-05 15:39:33 PST
Comment on attachment 117932 [details]
Patch

Attachment 117932 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/10728791

New failing tests:
svg/custom/linking-uri-01-b.svg
Comment 26 WebKit Review Bot 2011-12-05 22:13:18 PST
Comment on attachment 117932 [details]
Patch

Clearing flags on attachment: 117932

Committed r102095: <http://trac.webkit.org/changeset/102095>
Comment 27 WebKit Review Bot 2011-12-05 22:13:25 PST
All reviewed patches have been landed.  Closing bug.