RESOLVED FIXED 18665
HTMLTokenizer double allocates strings for <script>
https://bugs.webkit.org/show_bug.cgi?id=18665
Summary HTMLTokenizer double allocates strings for <script>
Mike Belshe
Reported 2008-04-21 12:56:09 PDT
In HTMLTokenizer, we call: String scriptCode(buffer, dest - buffer); // this allocates a string processToken(); // This allocates another string for the Text in the DOM So all javascript code gets allocated twice; in the case of the popular web app I was using, this totalled over 1.3MB of extra space. (The JS code measured ~650KB, but in UChars, that measures 1.3MB) I'm attaching a small patch, which passes all layout tests. The fix is to have the scriptCode share the string created via the node Text. The only question I have is that this string is created via StringImpl::createStrippingNullCharacters(). Is it possible to have embedded null characters inside script? I don't think so, but I am not certain.
Attachments
Patch to HTMLTokenizer string allocation (559 bytes, patch)
2008-04-21 12:57 PDT, Mike Belshe
mbelshe: review-
Update. (1.81 KB, patch)
2008-05-08 14:29 PDT, Mike Belshe
no flags
Updated patch, this one with a changelog. (2.84 KB, patch)
2008-05-12 15:52 PDT, Mike Belshe
koivisto: review+
Mike Belshe
Comment 1 2008-04-21 12:57:07 PDT
Created attachment 20738 [details] Patch to HTMLTokenizer string allocation
Eric Seidel (no email)
Comment 2 2008-04-21 13:14:04 PDT
I bet this could be a win on the PLT.
Eric Seidel (no email)
Comment 3 2008-04-21 15:11:23 PDT
Comment on attachment 20738 [details] Patch to HTMLTokenizer string allocation We don't generally use PassRefPtr<> on the stack, rather we use RefPtr<> and call .release() on the ref ptr (to make it more explicit that you're clearing it). This patch changes two behaviors. 1. The string passed off into JSC now has all nulls removed. I'm not sure what we would actually do when encoutering a null char in a script? I assume that's an error, in which case scripts will now succeed which would previously fail... 2. Under what circumstances can processToken() return 0? You've changed the behavior. Before we would have passed a non-empty string off to JSC, where as now we'll pass an empty string in that case. In general the patch looks good, and the idea is sound, but we should answer these minor details before landing this.
Mike Belshe
Comment 4 2008-04-21 15:20:06 PDT
I believe the case for processToken() returning null is when the node hasn't been fully processed yet (e.g. you've seen the start <script> tag, but no end tag) However, I admit, I am not an expert in this code, and would like someone who knows it better to help!
Mike Belshe
Comment 5 2008-04-28 07:18:34 PDT
The patch I submitted is not sufficient; turns out that there is a 64KB limit on text nodes, so if you pass <script></script> larger than that, it gets processed and breaks. The memory savings potential here is real, however.
Mike Belshe
Comment 6 2008-05-08 14:29:41 PDT
Created attachment 21029 [details] Update.
Mike Belshe
Comment 7 2008-05-08 14:31:46 PDT
I've updated with a new patch. Currently, the parser chops text blocks into 64KB chunks. For normal text, this works fine. However, in order to share the string which is passed to JS with the string used in the DOM, we need one contiguous string. Modified the HTMLParser such that it will not chop text blocks into 64KB chunks if it is part of a script tag. Passes all layout tests.
Maciej Stachowiak
Comment 8 2008-05-09 02:08:47 PDT
Comment on attachment 21029 [details] Update. Please provide a ChangeLog entry, and once that is done flag for review.
Mike Belshe
Comment 9 2008-05-12 15:52:57 PDT
Created attachment 21094 [details] Updated patch, this one with a changelog.
Eric Seidel (no email)
Comment 10 2008-08-01 01:49:54 PDT
Comment on attachment 21094 [details] Updated patch, this one with a changelog. I assume he meant to mark this for review. One style nit: + if (t->tagName == textAtom && t->text && current->localName() != scriptTag) + { { should be on the same line as the if in WebKit style. Also, generally the changelog reads: date/name line blank reviewed by line blank description and link to bug blank test cases (if applicable) blank comments on individual files. The patch looks good to me, but I think someone else who has hacked in this code recently should at least glance at it.
Eric Seidel (no email)
Comment 11 2008-08-02 00:58:51 PDT
Comment on attachment 21094 [details] Updated patch, this one with a changelog. Assigning this to Beth. Any reviewer is still of course welcome to review it. But given that she's hacked in this area recently, it makes sense to get her comments. Beth, if you don't feel you should review this, please feel free to request someone else, or just un-assign it with comment.
Antti Koivisto
Comment 12 2008-08-04 13:22:38 PDT
Comment on attachment 21094 [details] Updated patch, this one with a changelog. r=me, with the style fixes pointed out by Eric. Nice simple memory optimization.
Eric Seidel (no email)
Comment 13 2008-08-04 17:33:45 PDT
I've applied this locally, once the tests finish running it will be committed.
Eric Seidel (no email)
Comment 14 2008-08-04 17:39:48 PDT
Note You need to log in before you can comment on or make changes to this bug.