Bug 31615

Summary: REGRESSION(r50919) Introduced a non-thread-safe refcounting in ScriptExecutionContext::postTaskToMainThread
Product: WebKit Reporter: Dmitry Titov <dimich>
Component: WebCore JavaScriptAssignee: Dmitry Titov <dimich>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, darin, eric, levin, oliver
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: PC   
OS: OS X 10.5   
Bug Depends on:    
Bug Blocks: 30835    
Attachments:
Description Flags
Proposed patch
dimich: commit-queue-
Same patch, Added comment for the ASSERT.
dimich: commit-queue-
Revert r50919 that has caused the regression. eric: review+, dimich: commit-queue-

Description Dmitry Titov 2009-11-17 22:24:21 PST
Using RefPtr on both sides of thread boundary have caused significant part of recent flakiness of xmlhttprequest/workers tests. Need to remove RefPtr as it was implemented in Document before r50919.

This explains at least some flakiness captured in bug 30835.
Comment 1 Dmitry Titov 2009-11-17 22:57:44 PST
Created attachment 43406 [details]
Proposed patch
Comment 2 Eric Seidel (no email) 2009-11-17 23:08:32 PST
If we add the mainThread() assert, we'll need to at least add a comment there.
Comment 3 Dmitry Titov 2009-11-17 23:20:15 PST
Created attachment 43408 [details]
Same patch, Added comment for the ASSERT.

Added comment for the ASSERT.
Comment 4 Eric Seidel (no email) 2009-11-18 01:35:38 PST
Comment on attachment 43408 [details]
Same patch, Added comment for the ASSERT.

So this is OK if we never plan to use anything TreeShared from a non-main thread, which is probably the case since Nodes are the only TreeShared types iirc.

This is OK with me.  Others should see this go by and comment.  However since this may be the cause of our recent bout of bot crashes, it would be nice to get this in quickly so we can work on finding the real cause behind bug 30835.

Please make sure you've run the layout tests in debug mode before landing this.

I think we may need to solve the FIXME before landing, or at least shortly after:
+    // FIXME: Find out what guarantees that context is still alive when task is executed.
+    // If it is not so, it is another source of random crashes.
We need to understand why a raw pointer there is safe, and if it is, document it.

Would it be easy to write a layout test which caused this crash w/o needing to be run 100 times?
Comment 5 Darin Adler 2009-11-18 09:40:38 PST
(In reply to comment #4)
> (From update of attachment 43408 [details])
> So this is OK if we never plan to use anything TreeShared from a non-main
> thread, which is probably the case since Nodes are the only TreeShared types
> iirc.

The assertion is great. I would also do the assertion in all the other TreeShared member functions: The constructor, destructor, setParent, parent too. It's never safe to read or write these on other threads -- obviously we can't add such assertions to every single access to any field of any DOM node, but if there was an efficient way to do so we would!

The practical way to get this right for the many other reference counted objects is by adding machinery in RefCounted to make it fire an assert when the ref is done from the wrong thread too. For RefCounted, we'll need to add a function to disable this for the few objects that are used cross-thread such as strings that are copied for that reason.

The vast majority of RefCounted can either assert they are used only on the main thread or only on the same thread they were created on. Adding such assertions will catch problems like this in a way that makes them much easier to notice and understand. And a function to turn them off for certain objects is likely to be a practical way to do this without getting in the way of work on things like workers.
Comment 6 Darin Adler 2009-11-18 09:51:23 PST
Comment on attachment 43408 [details]
Same patch, Added comment for the ASSERT.

> +    // This is not a RefPtr because it would require ScriptExecutionContext-derived classes to be
> +    // refcountable in a thread-safe manner which is overkill because they always are created and deleted on
> +    // the same thread.
> +    // FIXME: Find out what guarantees that context is still alive when task is executed.
> +    // If it is not so, it is another source of random crashes.

This comment seems way too big. Mentioning "source of random crashes" seems too vague to me, but your FIXME is right on.

I have reviewed the clients and I can tell you right now there is no guarantee the context is still alive when the task is executed. The design here is broken. The code has this idea that we can post a task for later execution on the main thread and have the document passed to the task, given that we can't add a ref to the document to guarantee it still still be around. That simply can't be done.

So I think we need to do something different here. This patch trades one bug for another, but doesn't fix it, as you suspect.

I don't see any way ScriptExecutionContext::postTaskToMainThread can work. It makes a promise it can't keep, that it can defer work to the main thread for a document, without being given a ref to that document, which is normally needed for such things. Either we need to already have a ref to the document, or the document needs a thread-safe way to cancel already-posted tasks.

I don't know what to say about the patch. It's definitely no worse with the patch than without it, but the bug is still there.
Comment 7 Alexey Proskuryakov 2009-11-18 10:14:40 PST
> I don't see any way ScriptExecutionContext::postTaskToMainThread can work. It
> makes a promise it can't keep, that it can defer work to the main thread for a
> document, without being given a ref to that document, which is normally needed
> for such things. Either we need to already have a ref to the document, or the
> document needs a thread-safe way to cancel already-posted tasks.

The original idea was that tasks could only be sent via this mechanism when there was activity in a worker. So, worker messaging proxy kept the document alive.

It seems that the mechanism has been generalized since then - I'm not sure if the guarantee still holds in practice, but it's certainly possible to abuse postTaskToMainThread in a way that makes it void.
Comment 8 Dmitry Titov 2009-11-18 10:36:48 PST
Thanks guys!

I think we should have a way to remove tasks from the 'post to main thread' queue when Document is destroyed or earlier, since even if we have thread-safe ref it's unclear if it makes sense to run a task on a document that is only alive because it still have refcount > 0. I think this worked before w/o many problems because in practice there are always other 'tasks' that has to be also delivered via main thread's run loop before Document will be destroyed so there is some serialization of work. It is not a guarantee though so I agree this could use more thought.

I can see 2 issues here:
1. Recent surge in test flakiness caused by me adding a RefPtr - the attached patch reverts this back to state as existed before http://trac.webkit.org/changeset/50919. If others think this is an ok intermediate solution, we could land the patch, with suggested modifications. Alternatively I could simply revert r50919.

2. postTaskToMainThread design that has to be redone. I'll be glad to give it a try and iterate on something that could be better. It could take a few days though.
Comment 9 Darin Adler 2009-11-18 10:39:19 PST
For simplicity, I am calling it "document", rather than "script execution context".

(In reply to comment #7)
> The original idea was that tasks could only be sent via this mechanism when
> there was activity in a worker. So, worker messaging proxy kept the document
> alive.

That would be fine as long as the task doesn't get passed a reference to the document. The issue is really that the task passes a reference to the document, but can't guarantee that reference is good. If the task has its own pointer to the document, there is no problem.

> It seems that the mechanism has been generalized since then - I'm not sure if
> the guarantee still holds in practice, but it's certainly possible to abuse
> postTaskToMainThread in a way that makes it void.

The only benefit this seems to provide is callOnMainThread plus a delete call. There's no additional service provided here.
Comment 10 Alexey Proskuryakov 2009-11-18 11:06:51 PST
> That would be fine as long as the task doesn't get passed a reference to the
> document.

A document cannot go away while there are (dedicated?) workers present, or any communication pending with them.
Comment 11 Dmitry Titov 2009-11-18 12:13:49 PST
Created attachment 43444 [details]
Revert r50919 that has caused the regression.

The easiest way to deal with recent regression seems to revert the recent change.

Created bug 31633 for 'dangling Document pointer' issue discussed here.
Comment 12 Eric Seidel (no email) 2009-11-18 12:16:39 PST
Comment on attachment 43444 [details]
Revert r50919 that has caused the regression.

If this is just a revert rs=me.

I also think we'll need separate patches to add the ASSERTs to TreeShared and possibly RefCounted.

Shared workers might still be the underlying cause of bug 30835, which started sometime before r50171, possibly due to another shared worker change.
Comment 13 Oliver Hunt 2009-11-18 12:27:30 PST
(In reply to comment #12)
> (From update of attachment 43444 [details])
> If this is just a revert rs=me.
> 
> I also think we'll need separate patches to add the ASSERTs to TreeShared and
> possibly RefCounted.
> 
> Shared workers might still be the underlying cause of bug 30835, which started
> sometime before r50171, possibly due to another shared worker change.

What assertion can we do with RefCounted?  RefCounted can (and is) legitimately used on non-main threads, maybe refcounted would be expected to maintain a record of what thread it expects to be used on?
Comment 14 Eric Seidel (no email) 2009-11-18 12:31:31 PST
(In reply to comment #13)
> What assertion can we do with RefCounted?  RefCounted can (and is) legitimately
> used on non-main threads, maybe refcounted would be expected to maintain a
> record of what thread it expects to be used on?

Yes.  See comment 5.
Comment 15 Dmitry Titov 2009-11-18 12:44:36 PST
Landed: http://trac.webkit.org/changeset/51127

Created bug 31637 and bug 31639 to add asserts in TreeShared and RefCounted.