Bug 28795 - fast/workers/worker-lifecycle.html crashes intermittently on snowleopard
Summary: fast/workers/worker-lifecycle.html crashes intermittently on snowleopard
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebCore Misc. (show other bugs)
Version: 528+ (Nightly build)
Hardware: PC OS X 10.5
: P2 Normal
Assignee: Andrew Wilson
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-08-27 20:38 PDT by Andrew Wilson
Modified: 2009-08-31 16:42 PDT (History)
2 users (show)

See Also:


Attachments
proposed patch (9.04 KB, patch)
2009-08-28 21:22 PDT, Andrew Wilson
no flags Details | Formatted Diff | Diff
patch with fix for typo in comment (9.04 KB, patch)
2009-08-28 21:25 PDT, Andrew Wilson
ap: review+
ap: commit-queue-
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Wilson 2009-08-27 20:38:37 PDT
The worker-lifecycle layout test crashes intermittently on snowleopard.

The stderr shows:

ASSERTION FAILED: !result
(/Volumes/Data/WebKit-BuildSlave/snowleopard-intel-leaks/build/JavaScriptCore/wtf/ThreadingPthreads.cpp:248 void WTF::Mutex::lock())

pthread_mutex_lock() is returning an error - this seems to imply a deadlock of some kind, but I don't see any way for that to happen given the simple uses of mutexes in the worker code.
Comment 1 Alexey Proskuryakov 2009-08-27 22:14:58 PDT
> pthread_mutex_lock() is returning an error

When I saw errors from this function, it usually meant that a garbage mutex object was passed in (likely because its containing object was already deallocated).
Comment 2 Andrew Wilson 2009-08-28 09:46:58 PDT
I was able to repro this locally on Leopard (not snowleopard) by running the test in a tight loop - seems to happen every few hundred iterations.

Debugging now.
Comment 3 Andrew Wilson 2009-08-28 11:30:18 PDT
The race condition seems to be this:

WorkerContext::close() invokes m_workerThread->stop(). This starts the process of shutting down the worker thread by calling runLoop.terminate(). Once the thread of execution unwinds and the runloop exits, the following actions occur:

    m_workerContext->clearScript();
    // The below assignment will destroy the context, which will in turn notify messaging proxy.
    // We cannot let any objects survive past thread exit, because no other thread will run GC or otherwise destroy them.
    m_workerContext = 0;

Note that we call clearScript() before freeing the workerContext - I'm not certain why, but I'm guessing it's to try to untangle some kind of circular reference.

Now, while those steps are executing, imagine that the main thread decides to GC the Worker object (it can do this, because the WorkerThread went idle as part of shutting down). This invokes WorkerMessagingProxy::workerObjectDestroyed():

void WorkerMessagingProxy::workerObjectDestroyed()
{
    m_workerObject = 0;
    if (m_workerThread)
        terminateWorkerContext();
    else
        workerContextDestroyedInternal();
}

Since the WorkerMessagingProxy doesn't know that the workerContext has been destroyed (hasn't happened yet), it happily invokes terminateWorkerContext() which invokes WorkerThread::stop() again.

WorkerThread::stop() accesses the workerContext and WorkerScriptController to forbidExecution() (force the thread to exit), while those objects could be getting freed by the already-exiting thread, which generates this failure.
Comment 4 Andrew Wilson 2009-08-28 11:35:51 PDT
Looks like only two pieces of WorkerThread are accessed cross-thread:

1) WorkerThread->runLoop()
2) WorkerThread->stop()

Since runLoop is already thread safe, we're OK with accessing that across threads.

WorkerThread::stop() currently grabs the m_threadCreationMutex while doing its work. I'm wondering if we should also grab that mutex while cleaning up after the run loop exits, to protect access to workerContext.

I'll try that fix locally and see how it works. ap - what do you think?
Comment 5 Alexey Proskuryakov 2009-08-28 12:12:51 PDT
I think that WorkerContext.close() shouldn't close the worker directly - it should send this request to proxy, which should close the worker. All information about worker state should be current in proxy, workers themselves shouldn't have the freedom to stop.
Comment 6 Andrew Wilson 2009-08-28 12:31:58 PDT
I don't think that's a viable solution.

The semantics of WorkerGlobalScope::close() are such that no further async events (timers, message port messages, network requests, etc) should be processed after close() are processed.

Routing close requests through the proxy would mean we'd have to jump through hoops to short-circuit any further event activity/generation. Possibly doable, but so much simpler to just call m_runLoop.terminate().

Also, forcing the parent to get involved has some nasty implications with nested SharedWorkers where the parent worker may not be around any more, although we probably address that by routing close requests through the repository instead of the parent worker object.

What's the downside to allowing workers to shut themselves down (other than this bug, of course :)?
Comment 7 Alexey Proskuryakov 2009-08-28 12:55:00 PDT
See also: bug 25902 comment 3 (so, there was a race condition, after all!)

I think this proves that the current solution is difficult to understand - the code smells wrong, but we couldn't find the bug even after detailed analysis, not until it was exposed by a test. Simplifying cross-thread communication by not letting worker contexts close without proxy consent should be much more maintainable long-term.
Comment 8 Andrew Wilson 2009-08-28 21:22:22 PDT
Created attachment 38769 [details]
proposed patch

Per my earlier conversation with ap - we no longer invoke WorkerThread::stop() when WorkerContext::close() is called. Instead, the WorkerMessagingProxy and SharedWorkerProxy shutdown the thread themselves, to avoid race conditions with the thread shutting down while the main thread is still accessing it.
Comment 9 Andrew Wilson 2009-08-28 21:25:24 PDT
Created attachment 38770 [details]
patch with fix for typo in comment

Had a typo in a comment so figured I'd fix it up.
Comment 10 Alexey Proskuryakov 2009-08-31 10:43:56 PDT
Comment on attachment 38770 [details]
patch with fix for typo in comment

+        Error events should still be delivered even if the worker thread is closing.

This sounds like something worth a comment in code, ideally with an explanation of why they should be delivered. Is this something covered by existing tests?

+        Also fixed problem where error events were not dispatched if error handlers were added via addEventListener().

Please add a test for this bug fix.

 void SharedWorkerProxy::close()
 {
-    ASSERT(!isClosing());

So, this function can now be called repeatedly. Is it OK to call m_thread->stop() twice?

r=me on a condition there's a good answer to this last question.
Comment 11 Andrew Wilson 2009-08-31 11:05:55 PDT
(In reply to comment #10)
> (From update of attachment 38770 [details])
> +        Error events should still be delivered even if the worker thread is
> closing.
> 
> This sounds like something worth a comment in code, ideally with an explanation
> of why they should be delivered. Is this something covered by existing tests?

Yes, it's covered by existing tests (worker_close.html), which is why I made this change (the tests started breaking once I routed WorkerContext::close() through the WorkerObjectProxy, because now the m_askToTerminate flag is set.

I will add a comment, although it's weird to have a disembodied comment in the code about *why* you aren't introducing buggy behavior with unnecessary checks, which is why I put it in the ChangeLog instead :)

> 
> +        Also fixed problem where error events were not dispatched if error
> handlers were added via addEventListener().
> 
> Please add a test for this bug fix.

Will do.

> 
>  void SharedWorkerProxy::close()
>  {
> -    ASSERT(!isClosing());
> 
> So, this function can now be called repeatedly. Is it OK to call
> m_thread->stop() twice?

It is technically OK to call m_thread->stop() twice in this case (the case where it gets called twice is when the parent document closes while the worker thread is executing, then the worker thread calls WorkerContext::close(). The second call to m_thread->stop() would happen on the Worker thread itself, which is safe.

However, this seems fragile so I'm going to make the following change to clean it up:

1) Make SharedWorkerProxy::close() private (should have been already).
2) Re-institute the ASSERT(!isClosing()) to SharedWorkerProxy::close().
3) Add the following to workerContextClosed() to avoid calling close() on an already-closed worker:

if (isClosing())
    return;



> 
> r=me on a condition there's a good answer to this last question.
Comment 12 Alexey Proskuryakov 2009-08-31 11:18:14 PDT
> I will add a comment, although it's weird to have a disembodied comment in the
> code about *why* you aren't introducing buggy behavior with unnecessary checks,
> which is why I put it in the ChangeLog instead :)

This is a good way to think about it. But this change makes this version of performTask() different from MessageWorkerTask::performTask(), which can be confusing, and that was the reason for me to suggest having a comment in code.

> However, this seems fragile so I'm going to make the following change to clean
> it up:

Sounds good to me.
Comment 13 Andrew Wilson 2009-08-31 16:42:05 PDT
Committed as r47914.