Bug 18665

Summary: HTMLTokenizer double allocates strings for <script>
Product: WebKit Reporter: Mike Belshe <mbelshe>
Component: DOMAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: darin, hyatt, mjs
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: PC   
OS: All   
Attachments:
Description Flags
Patch to HTMLTokenizer string allocation
mbelshe: review-
Update.
none
Updated patch, this one with a changelog. koivisto: review+

Description Mike Belshe 2008-04-21 12:56:09 PDT
In HTMLTokenizer, we call:

   String scriptCode(buffer, dest - buffer);  // this allocates a string
   processToken();   // This allocates another string for the Text in the DOM

So all javascript code gets allocated twice; in the case of the popular web app I was using, this totalled over 1.3MB of extra space.  (The JS code measured ~650KB, but in UChars, that measures 1.3MB)

I'm attaching a small patch, which passes all layout tests.  The fix is to have the scriptCode share the string created via the node Text.  The only question I have is that this string is created via StringImpl::createStrippingNullCharacters().  Is it possible to have embedded null characters inside script?  I don't think so, but I am not certain.
Comment 1 Mike Belshe 2008-04-21 12:57:07 PDT
Created attachment 20738 [details]
Patch to HTMLTokenizer string allocation
Comment 2 Eric Seidel (no email) 2008-04-21 13:14:04 PDT
I bet this could be a win on the PLT.
Comment 3 Eric Seidel (no email) 2008-04-21 15:11:23 PDT
Comment on attachment 20738 [details]
Patch to HTMLTokenizer string allocation

We don't generally use PassRefPtr<> on the stack, rather we use RefPtr<> and call .release() on the ref ptr (to make it more explicit that you're clearing it).

This patch changes two behaviors.

1.  The string passed off into JSC now has all nulls removed.  I'm not sure what we would actually do when encoutering a null char in a script?  I assume that's an error, in which case scripts will now succeed which would previously fail...

2. Under what circumstances can processToken() return 0?  You've changed the behavior.  Before we would have passed a non-empty string off to JSC, where as now we'll pass an empty string in that case.

In general the patch looks good, and the idea is sound, but we should answer these minor details before landing this.
Comment 4 Mike Belshe 2008-04-21 15:20:06 PDT
I believe the case for processToken() returning null is when the node hasn't been fully processed yet (e.g. you've seen the start <script> tag, but no end tag)

However, I admit, I am not an expert in this code, and would like someone who knows it better to help!
Comment 5 Mike Belshe 2008-04-28 07:18:34 PDT
The patch I submitted is not sufficient; turns out that there is a 64KB limit on text nodes, so if you pass <script></script> larger than that, it gets processed and breaks.

The memory savings potential here is real, however.
Comment 6 Mike Belshe 2008-05-08 14:29:41 PDT
Created attachment 21029 [details]
Update.
Comment 7 Mike Belshe 2008-05-08 14:31:46 PDT
I've updated with a new patch.  Currently, the parser chops text blocks into 64KB chunks.  For normal text, this works fine.  However, in order to share the string which is passed to JS with the string used in the DOM, we need one contiguous string.  Modified the HTMLParser such that it will not chop text blocks into 64KB chunks if it is part of a script tag.

Passes all layout tests.
Comment 8 Maciej Stachowiak 2008-05-09 02:08:47 PDT
Comment on attachment 21029 [details]
Update.

Please provide a ChangeLog entry, and once that is done flag for review.
Comment 9 Mike Belshe 2008-05-12 15:52:57 PDT
Created attachment 21094 [details]
Updated patch, this one with a changelog.
Comment 10 Eric Seidel (no email) 2008-08-01 01:49:54 PDT
Comment on attachment 21094 [details]
Updated patch, this one with a changelog.

I assume he meant to mark this for review.

One style nit:
+    if (t->tagName == textAtom && t->text && current->localName() != scriptTag)
+    {

{ should be on the same line as the if in WebKit style.

Also, generally the changelog reads:
date/name line
blank
reviewed by line
blank
description and link to bug
blank
test cases (if applicable)
blank
comments on individual files.

The patch looks good to me, but I think someone else who has hacked in this code recently should at least glance at it.
Comment 11 Eric Seidel (no email) 2008-08-02 00:58:51 PDT
Comment on attachment 21094 [details]
Updated patch, this one with a changelog.

Assigning this to Beth.  Any reviewer is still of course welcome to review it.  But given that she's hacked in this area recently, it makes sense to get her comments.  Beth, if you don't feel you should review this, please feel free to request someone else, or just un-assign it with comment.
Comment 12 Antti Koivisto 2008-08-04 13:22:38 PDT
Comment on attachment 21094 [details]
Updated patch, this one with a changelog.

r=me, with the style fixes pointed out by Eric.

Nice simple memory optimization.
Comment 13 Eric Seidel (no email) 2008-08-04 17:33:45 PDT
I've applied this locally, once the tests finish running it will be committed.
Comment 14 Eric Seidel (no email) 2008-08-04 17:39:48 PDT
http://trac.webkit.org/changeset/35544