28795 – fast/workers/worker-lifecycle.html crashes intermittently on snowleopard

RESOLVED FIXED 28795

fast/workers/worker-lifecycle.html crashes intermittently on snowleopard

https://bugs.webkit.org/show_bug.cgi?id=28795

Summary fast/workers/worker-lifecycle.html crashes intermittently on snowleopard

Andrew Wilson

Reported 2009-08-27 20:38:37 PDT

The worker-lifecycle layout test crashes intermittently on snowleopard. The stderr shows: ASSERTION FAILED: !result (/Volumes/Data/WebKit-BuildSlave/snowleopard-intel-leaks/build/JavaScriptCore/wtf/ThreadingPthreads.cpp:248 void WTF::Mutex::lock()) pthread_mutex_lock() is returning an error - this seems to imply a deadlock of some kind, but I don't see any way for that to happen given the simple uses of mutexes in the worker code.

Attachments
proposed patch (9.04 KB, patch) 2009-08-28 21:22 PDT, Andrew Wilson	no flags	Details Formatted Diff Diff
patch with fix for typo in comment (9.04 KB, patch) 2009-08-28 21:25 PDT, Andrew Wilson	ap: review+ ap: commit-queue-	Details Formatted Diff Diff
Show Obsolete (1) View All Add attachment proposed patch, testcase, etc.

Alexey Proskuryakov

Comment 1 2009-08-27 22:14:58 PDT

> pthread_mutex_lock() is returning an error When I saw errors from this function, it usually meant that a garbage mutex object was passed in (likely because its containing object was already deallocated).

Andrew Wilson

Comment 2 2009-08-28 09:46:58 PDT

I was able to repro this locally on Leopard (not snowleopard) by running the test in a tight loop - seems to happen every few hundred iterations. Debugging now.

Andrew Wilson

Comment 3 2009-08-28 11:30:18 PDT

The race condition seems to be this: WorkerContext::close() invokes m_workerThread->stop(). This starts the process of shutting down the worker thread by calling runLoop.terminate(). Once the thread of execution unwinds and the runloop exits, the following actions occur: m_workerContext->clearScript(); // The below assignment will destroy the context, which will in turn notify messaging proxy. // We cannot let any objects survive past thread exit, because no other thread will run GC or otherwise destroy them. m_workerContext = 0; Note that we call clearScript() before freeing the workerContext - I'm not certain why, but I'm guessing it's to try to untangle some kind of circular reference. Now, while those steps are executing, imagine that the main thread decides to GC the Worker object (it can do this, because the WorkerThread went idle as part of shutting down). This invokes WorkerMessagingProxy::workerObjectDestroyed(): void WorkerMessagingProxy::workerObjectDestroyed() { m_workerObject = 0; if (m_workerThread) terminateWorkerContext(); else workerContextDestroyedInternal(); } Since the WorkerMessagingProxy doesn't know that the workerContext has been destroyed (hasn't happened yet), it happily invokes terminateWorkerContext() which invokes WorkerThread::stop() again. WorkerThread::stop() accesses the workerContext and WorkerScriptController to forbidExecution() (force the thread to exit), while those objects could be getting freed by the already-exiting thread, which generates this failure.

Andrew Wilson

Comment 4 2009-08-28 11:35:51 PDT

Looks like only two pieces of WorkerThread are accessed cross-thread: 1) WorkerThread->runLoop() 2) WorkerThread->stop() Since runLoop is already thread safe, we're OK with accessing that across threads. WorkerThread::stop() currently grabs the m_threadCreationMutex while doing its work. I'm wondering if we should also grab that mutex while cleaning up after the run loop exits, to protect access to workerContext. I'll try that fix locally and see how it works. ap - what do you think?

Alexey Proskuryakov

Comment 5 2009-08-28 12:12:51 PDT

I think that WorkerContext.close() shouldn't close the worker directly - it should send this request to proxy, which should close the worker. All information about worker state should be current in proxy, workers themselves shouldn't have the freedom to stop.

Andrew Wilson

Comment 6 2009-08-28 12:31:58 PDT

I don't think that's a viable solution. The semantics of WorkerGlobalScope::close() are such that no further async events (timers, message port messages, network requests, etc) should be processed after close() are processed. Routing close requests through the proxy would mean we'd have to jump through hoops to short-circuit any further event activity/generation. Possibly doable, but so much simpler to just call m_runLoop.terminate(). Also, forcing the parent to get involved has some nasty implications with nested SharedWorkers where the parent worker may not be around any more, although we probably address that by routing close requests through the repository instead of the parent worker object. What's the downside to allowing workers to shut themselves down (other than this bug, of course :)?

Alexey Proskuryakov

Comment 7 2009-08-28 12:55:00 PDT

See also: bug 25902 comment 3 (so, there was a race condition, after all!) I think this proves that the current solution is difficult to understand - the code smells wrong, but we couldn't find the bug even after detailed analysis, not until it was exposed by a test. Simplifying cross-thread communication by not letting worker contexts close without proxy consent should be much more maintainable long-term.

Andrew Wilson

Comment 8 2009-08-28 21:22:22 PDT

Created attachment 38769 [details] proposed patch Per my earlier conversation with ap - we no longer invoke WorkerThread::stop() when WorkerContext::close() is called. Instead, the WorkerMessagingProxy and SharedWorkerProxy shutdown the thread themselves, to avoid race conditions with the thread shutting down while the main thread is still accessing it.

Andrew Wilson

Comment 9 2009-08-28 21:25:24 PDT

Created attachment 38770 [details] patch with fix for typo in comment Had a typo in a comment so figured I'd fix it up.

Alexey Proskuryakov

Comment 10 2009-08-31 10:43:56 PDT

Comment on attachment 38770 [details] patch with fix for typo in comment + Error events should still be delivered even if the worker thread is closing. This sounds like something worth a comment in code, ideally with an explanation of why they should be delivered. Is this something covered by existing tests? + Also fixed problem where error events were not dispatched if error handlers were added via addEventListener(). Please add a test for this bug fix. void SharedWorkerProxy::close() { - ASSERT(!isClosing()); So, this function can now be called repeatedly. Is it OK to call m_thread->stop() twice? r=me on a condition there's a good answer to this last question.

Andrew Wilson

Comment 11 2009-08-31 11:05:55 PDT

(In reply to comment #10) > (From update of attachment 38770 [details]) > + Error events should still be delivered even if the worker thread is > closing. > > This sounds like something worth a comment in code, ideally with an explanation > of why they should be delivered. Is this something covered by existing tests? Yes, it's covered by existing tests (worker_close.html), which is why I made this change (the tests started breaking once I routed WorkerContext::close() through the WorkerObjectProxy, because now the m_askToTerminate flag is set. I will add a comment, although it's weird to have a disembodied comment in the code about *why* you aren't introducing buggy behavior with unnecessary checks, which is why I put it in the ChangeLog instead :) > > + Also fixed problem where error events were not dispatched if error > handlers were added via addEventListener(). > > Please add a test for this bug fix. Will do. > > void SharedWorkerProxy::close() > { > - ASSERT(!isClosing()); > > So, this function can now be called repeatedly. Is it OK to call > m_thread->stop() twice? It is technically OK to call m_thread->stop() twice in this case (the case where it gets called twice is when the parent document closes while the worker thread is executing, then the worker thread calls WorkerContext::close(). The second call to m_thread->stop() would happen on the Worker thread itself, which is safe. However, this seems fragile so I'm going to make the following change to clean it up: 1) Make SharedWorkerProxy::close() private (should have been already). 2) Re-institute the ASSERT(!isClosing()) to SharedWorkerProxy::close(). 3) Add the following to workerContextClosed() to avoid calling close() on an already-closed worker: if (isClosing()) return; > > r=me on a condition there's a good answer to this last question.

Alexey Proskuryakov

Comment 12 2009-08-31 11:18:14 PDT

> I will add a comment, although it's weird to have a disembodied comment in the > code about *why* you aren't introducing buggy behavior with unnecessary checks, > which is why I put it in the ChangeLog instead :) This is a good way to think about it. But this change makes this version of performTask() different from MessageWorkerTask::performTask(), which can be confusing, and that was the reason for me to suggest having a comment in code. > However, this seems fragile so I'm going to make the following change to clean > it up: Sounds good to me.

Andrew Wilson

Comment 13 2009-08-31 16:42:05 PDT

Committed as r47914.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware PC

OS OS X 10.5

Product WebKit

Component WebCore Misc.

Assignee

Andrew Wilson

Reported

2009-08-27 20:38 PDT

Modified

2009-08-31 16:42 PDT History

CC List

2 users Show

URL

Keywords

Depends on

Blocks