Bug 105367

Summary: [chromium] Creation of dedicated workers (isolates) leaks reference to HTMLDocument
Product: WebKit Reporter: Brandon Jones <bajones>
Component: WebCore JavaScriptAssignee: Kenneth Russell <kbr>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, danno, dimich, gman, haraken, igor.oliveira, kbr, levin, mhahnenberg, pfeldman, simonjam, tkent, tonikitoo, tonyg, tzik, ulan, webkit.review.bot, yurys, zmo
Priority: P1    
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Bug Depends on: 106016    
Bug Blocks:    
Attachments:
Description Flags
Test pages to provoke the described error
none
Simplified test case
none
Patch used to debug problem
none
Patch
none
Patch
none
Patch for landing none

Description Brandon Jones 2012-12-18 16:32:04 PST
It appears that some C++ objects associated with Javascript wrappers are not being properly destroyed if a Worker was created during page execution.

We have primarily noticed this with WebGL contexts, which provide a convenient visualization of the issue in older versions of Chrome, but suspect that it may be happening with other objects as well. If a WebGL context and Worker object are both created on the same page, refreshing that page several times will force the WebGL context to aquire a smaller than requested backbuffer after hitting internal memory limits. On pages without workers these limits are not hit because objects are collected properly.

A more concrete illustration is to add some instrumentation to the WebKit source. In WebCore/html/canvas/WebGLRenderingContext.cpp, add a logging to the WebGLRenderingContext constructor and destructor so that you can monitor when they are created and destroyed. (This modification is included as a patch in the attached zip).

To see the issue using the attached tests extract them locally and make sure they are accessible from a local web server (If you don't have a local web server set up, navigate to the folder the files were extracted to and run "python -m SimpleHTTPServer". The files should now be accessible at http://localhost:8000) Viewing index.html will then start the test, which loads a simple WebGL page with a worker in an iframe and refreshes it multiple times. If the WebGL backbuffer resolution is different than the requested resolution the canvas will turn red and the test will stop.

If you have added logging to the WebGLRenderingContext constructor and destructor you will see that the destructor is never called. Contrast this with the behavior of index-no-worker.html, which only differs in that the page does not create a worker and displays expected garbage collection patterns where the destructors are called in batches. In the case where the contexts are not destroyed the Chrome heap profiler indicates that there are no live WebGLRenderingContext objects other than the ones on the current instance of the page, but even a full GC will not force the destructors to be called properly.
Comment 1 Brandon Jones 2012-12-18 16:33:05 PST
Created attachment 180056 [details]
Test pages to provoke the described error
Comment 2 Kenneth Russell 2012-12-18 17:00:43 PST
+some people who worked on the DOM bindings, isolates, and GC recently

Initially we thought this was a bug in V8's GC, but because the WebGLRenderingContext objects aren't alive according to the heap profiler, I think it's a bug in the DOM bindings related to isolates. I have a bad feeling that if a dedicated worker is created, it may be preventing C++ objects which are referred to by JavaScript objects (whether those objects are in the DOM or not -- the WebGLRenderingContext isn't) from being properly unreferenced and deleted.

I'm raising this to P1 until the cause is at least understood, as this bug could be causing major memory leaks for applications using workers. I don't think it's specific to WebGL.

Please CC: anyone who may have worked in this area.
Comment 3 Kenneth Russell 2012-12-27 12:34:45 PST
I've been investigating this and have some updates which I'll post shortly. Assigning to myself.
Comment 4 Kenneth Russell 2012-12-27 13:16:38 PST
CC'ing a few more people who may be interested.
Comment 5 Kenneth Russell 2012-12-27 13:19:50 PST
Created attachment 180817 [details]
Simplified test case
Comment 6 Kenneth Russell 2012-12-27 13:22:10 PST
Created attachment 180818 [details]
Patch used to debug problem
Comment 7 Kenneth Russell 2012-12-27 13:36:51 PST
The problem is that the WebWorkerClientImpl instance (Source/WebKit/chromium/src/WebWorkerClientImpl.cpp) is never deleted, leaking a reference to the HTMLDocument via m_scriptExecutionContext. This prevents the destruction of all attached DOM elements, including the Canvas, so that when the JavaScript reference to the WebGLRenderingContext is GC'd, it doesn't ultimately cause the Canvas to be deleted. Since the Canvas owns its m_context via OwnPtr, the WebGL context and all associated resources are never cleaned up.

Basically, any page using Workers in Chromium leaks the entire DOM when the page is reloaded.

I found that the same leak didn't exist in Safari, despite a strong hunch that the problem wasn't a leak of a reference outside Chromium's WebKit API boundary. That should have been a hint about where to look.

This bug was pretty tricky to track down. I'd appreciate any suggestions about how it could have been done better or more efficiently. Basically, per the attached patch, I instrumented TreeShared::ref() and deref() to fire when operating upon HTMLDocuments, since I found from early debugging that the Canvas wasn't being deleted and that was because the Document had a nonzero refcount at the last call to unref(). However, ref/unref were called too often against the document to debug directly. I added a mechanism to suppress the firing inside ref/unref and added those suppressions throughout WebKit. That got pretty close to the bug, but it turned out I suppressed the key place (within a call to ScriptElement::executeScript) where the unmatched call to HTMLDocument::ref was being made. I only got lucky because I saw a call to unref in WorkerMessagingProxy::~WorkerMessagingProxy and didn't see the construction, so made the suppression in executeScript more precise, which exposed the unbalanced call from WebWorkerClientImpl::WebWorkerClientImpl.

I don't have a fix yet but am going to try to find one, as well as add a test to ensure this leak doesn't reappear.
Comment 8 Kenneth Russell 2012-12-31 09:32:02 PST
Created attachment 180978 [details]
Patch
Comment 9 Adam Barth 2013-01-02 11:02:33 PST
Interesting!  Who is the best person to review this patch?
Comment 10 David Levin 2013-01-02 11:14:43 PST
(In reply to comment #9)
> Interesting!  Who is the best person to review this patch?

I'd guess dimich@
Comment 11 Dmitry Titov 2013-01-02 14:42:29 PST
Comment on attachment 180978 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=180978&action=review

r- due to "void*" comment.

> Source/WebCore/workers/WorkerLoaderProxy.h:62
> +        virtual void* toWebWorkerBase() = 0;

Lets try to forward-declare the return type here.

In general, it's unfortunate to introduce the 'embedder context' into WorkerLoaderProxy. Is there a better solution?
Comment 12 Kenneth Russell 2013-01-02 17:08:57 PST
Created attachment 181115 [details]
Patch
Comment 13 Kenneth Russell 2013-01-02 17:22:12 PST
Comment on attachment 180978 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=180978&action=review

>> Source/WebCore/workers/WorkerLoaderProxy.h:62
>> +        virtual void* toWebWorkerBase() = 0;
> 
> Lets try to forward-declare the return type here.
> 
> In general, it's unfortunate to introduce the 'embedder context' into WorkerLoaderProxy. Is there a better solution?

New patch uploaded which forward declares the return type, and avoids static_casts in the implementations.

I gave this a lot of thought but can't come up with a better solution. The basic problem is that the class hierarchies of Chromium's dedicated and shared worker implementations now diverge significantly. It used to be the case in Chromium that every implementation of WorkerLoaderProxy that hung off the WorkerThread derived from WebWorkerBase. That is now no longer the case. For dedicated workers, WebCore's WorkerMessagingProxy is now used as the base class. WebSharedWorkerImpl continues to derive from WebWorkerBase. This means that it is no longer legal to cast between WebWorkerBase* and WorkerLoaderProxy*.

It would be possible to add a static method "WebWorkerBase::fromWorkerLoaderProxy", and avoid touching WorkerLoaderProxy.h, but it would be very fragile. If any new implementation of WorkerLoaderProxy were added, the implementation of WebWorkerBase::fromWorkerLoaderProxy would have to be updated -- and there is no way to force a compilation failure to give the developer a hint that that code needs to be updated. The presence of the pure virtual in WorkerLoaderProxy provides this essential information.

In sum, as poor as it is to modify WorkerLoaderProxy with a Chromium-specific forward declaration and method, I think it is the best technical solution given the constraints.
Comment 14 Dmitry Titov 2013-01-02 17:36:50 PST
Comment on attachment 181115 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=181115&action=review

r+ with a note.

> Source/WebKit/chromium/src/WebWorkerClientImpl.cpp:73
>  // Delegates implementation of Worker{Loader,Context,Object}Proxy to WorkerMessagingProxy.

This comment seems old now... thsi class does not really delegate, it is a WMP... Remove?
Comment 15 Kenneth Russell 2013-01-02 17:39:22 PST
Comment on attachment 181115 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=181115&action=review

>> Source/WebKit/chromium/src/WebWorkerClientImpl.cpp:73
>>  // Delegates implementation of Worker{Loader,Context,Object}Proxy to WorkerMessagingProxy.
> 
> This comment seems old now... thsi class does not really delegate, it is a WMP... Remove?

You're right. I'll update the patch.
Comment 16 Kenneth Russell 2013-01-02 18:41:54 PST
Created attachment 181129 [details]
Patch for landing

Fixed comment and small typo in failure case of layout test
Comment 17 WebKit Review Bot 2013-01-02 19:23:49 PST
Comment on attachment 181129 [details]
Patch for landing

Clearing flags on attachment: 181129

Committed r138693: <http://trac.webkit.org/changeset/138693>
Comment 18 WebKit Review Bot 2013-01-02 19:23:54 PST
All reviewed patches have been landed.  Closing bug.