Bug 181761 - ASSERT(registration || isTerminating()) hit in SWServerWorker::skipWaiting()
Summary: ASSERT(registration || isTerminating()) hit in SWServerWorker::skipWaiting()
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Service Workers (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Chris Dumez
URL:
Keywords: InRadar
: 181763 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-01-17 13:29 PST by Chris Dumez
Modified: 2018-01-19 11:14 PST (History)
7 users (show)

See Also:


Attachments
Patch (6.43 KB, patch)
2018-01-17 14:17 PST, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (10.01 KB, patch)
2018-01-18 12:09 PST, Chris Dumez
no flags Details | Formatted Diff | Diff
Archive of layout-test-results from ews116 for mac-sierra (2.92 MB, application/zip)
2018-01-18 13:42 PST, EWS Watchlist
no flags Details
Archive of layout-test-results from ews206 for win-future (11.97 MB, application/zip)
2018-01-18 14:05 PST, EWS Watchlist
no flags Details
Patch (9.80 KB, patch)
2018-01-18 16:39 PST, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (10.01 KB, patch)
2018-01-18 17:01 PST, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (8.84 KB, patch)
2018-01-19 09:42 PST, Chris Dumez
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Dumez 2018-01-17 13:29:11 PST
ASSERT(registration || isTerminating()) hit in SWServerWorker::skipWaiting() on the bots sometimes.
Comment 1 Radar WebKit Bug Importer 2018-01-17 13:29:56 PST
<rdar://problem/36594564>
Comment 2 Chris Dumez 2018-01-17 14:17:23 PST
Created attachment 331541 [details]
Patch
Comment 3 Chris Dumez 2018-01-17 14:18:11 PST
*** Bug 181763 has been marked as a duplicate of this bug. ***
Comment 4 youenn fablet 2018-01-18 01:18:56 PST
Comment on attachment 331541 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=331541&action=review

> Source/WebCore/workers/service/context/SWContextManager.cpp:113
> +    serviceWorker->thread().runLoop().postTask([identifier, serviceWorker = WTFMove(serviceWorker), completionHandler = WTFMove(completionHandler)](ScriptExecutionContext& context) mutable {

Isn't there a risk that the service worker thread might be kept busy (by JS say), will never execute that task and thus will never stop?

We could try making ServiceWorkerContainer::ensureSWClientConnection() asynchronous.
CacheStorageConnection is using the same pattern though so maybe this deadlock can also happen with regular workers using Cache API.
It seems to me we should try to not be service worker specific and handle this at WorkerThread level.

WorkerThread::start has a callback so we might be able to know if the worker is started and use that information to stop it now or just after being started.
The current callback is called after evaluation of the script which might not be good in case of runaway script.
Maybe we could do a callOnMainThread shortly after creation of the worker global scope to check whether worker thread is to be stopped?

> Source/WebCore/workers/service/server/SWServerRegistration.cpp:250
> +    }

Can there be cases where there is more than one such worker per registration? 
If not, we could break the loop when we found one.
Or maybe it is not too complex to keep a pointer to this worker in SWServerRegistration?
Comment 5 youenn fablet 2018-01-18 01:32:15 PST
(In reply to youenn fablet from comment #4)
> Comment on attachment 331541 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=331541&action=review
> 
> > Source/WebCore/workers/service/context/SWContextManager.cpp:113
> > +    serviceWorker->thread().runLoop().postTask([identifier, serviceWorker = WTFMove(serviceWorker), completionHandler = WTFMove(completionHandler)](ScriptExecutionContext& context) mutable {
> 
> Isn't there a risk that the service worker thread might be kept busy (by JS
> say), will never execute that task and thus will never stop?
> 
> We could try making ServiceWorkerContainer::ensureSWClientConnection()
> asynchronous.
> CacheStorageConnection is using the same pattern though so maybe this
> deadlock can also happen with regular workers using Cache API.

Looking at it again, Cache API will probably not block the worker thread at creation of the global scope.
Making ServiceWorkerContainer::ensureSWClientConnection() asynchronous or a related refactoring might so the trick.
Comment 6 Chris Dumez 2018-01-18 09:07:09 PST
(In reply to youenn fablet from comment #5)
> (In reply to youenn fablet from comment #4)
> > Comment on attachment 331541 [details]
> > Patch
> > 
> > View in context:
> > https://bugs.webkit.org/attachment.cgi?id=331541&action=review
> > 
> > > Source/WebCore/workers/service/context/SWContextManager.cpp:113
> > > +    serviceWorker->thread().runLoop().postTask([identifier, serviceWorker = WTFMove(serviceWorker), completionHandler = WTFMove(completionHandler)](ScriptExecutionContext& context) mutable {
> > 
> > Isn't there a risk that the service worker thread might be kept busy (by JS
> > say), will never execute that task and thus will never stop?
> > 

This is why I uploaded https://bugs.webkit.org/show_bug.cgi?id=181563.

> > We could try making ServiceWorkerContainer::ensureSWClientConnection()
> > asynchronous.
> > CacheStorageConnection is using the same pattern though so maybe this
> > deadlock can also happen with regular workers using Cache API.
> 
> Looking at it again, Cache API will probably not block the worker thread at
> creation of the global scope.
> Making ServiceWorkerContainer::ensureSWClientConnection() asynchronous or a
> related refactoring might so the trick.

Making it async may not be trivial though. I'd have to check.

(In reply to youenn fablet from comment #4)
> Comment on attachment 331541 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=331541&action=review
> 
> > Source/WebCore/workers/service/context/SWContextManager.cpp:113
> > +    serviceWorker->thread().runLoop().postTask([identifier, serviceWorker = WTFMove(serviceWorker), completionHandler = WTFMove(completionHandler)](ScriptExecutionContext& context) mutable {
> 
> Isn't there a risk that the service worker thread might be kept busy (by JS
> say), will never execute that task and thus will never stop?
> 
> We could try making ServiceWorkerContainer::ensureSWClientConnection()
> asynchronous.
> CacheStorageConnection is using the same pattern though so maybe this
> deadlock can also happen with regular workers using Cache API.
> It seems to me we should try to not be service worker specific and handle
> this at WorkerThread level.
> 
> WorkerThread::start has a callback so we might be able to know if the worker
> is started and use that information to stop it now or just after being
> started.
> The current callback is called after evaluation of the script which might
> not be good in case of runaway script.
> Maybe we could do a callOnMainThread shortly after creation of the worker
> global scope to check whether worker thread is to be stopped?
> 
> > Source/WebCore/workers/service/server/SWServerRegistration.cpp:250
> > +    }
> 
> Can there be cases where there is more than one such worker per
> registration? 
> If not, we could break the loop when we found one.
> Or maybe it is not too complex to keep a pointer to this worker in
> SWServerRegistration?

No, I don't think there could be more than 1 such worker. I could either break from the loop when finding one or store that worker on the registration somehow. I'll look into it.
Comment 7 youenn fablet 2018-01-18 09:17:19 PST
> > > Isn't there a risk that the service worker thread might be kept busy (by JS
> > > say), will never execute that task and thus will never stop?
> > > 
> 
> This is why I uploaded https://bugs.webkit.org/show_bug.cgi?id=181563.

I believe this is well handled for regular workers.
If there is something special in service workers that would make this exception-throwing-to-stop-spinning strategy to fail, I would hope that we can realign service workers with workers.
Comment 8 Chris Dumez 2018-01-18 09:21:51 PST
(In reply to youenn fablet from comment #7)
> > > > Isn't there a risk that the service worker thread might be kept busy (by JS
> > > > say), will never execute that task and thus will never stop?
> > > > 
> > 
> > This is why I uploaded https://bugs.webkit.org/show_bug.cgi?id=181563.
> 
> I believe this is well handled for regular workers.
> If there is something special in service workers that would make this
> exception-throwing-to-stop-spinning strategy to fail, I would hope that we
> can realign service workers with workers.

From what I can see from regular workers, WorkerThread::stop() is called from terminateWorkerGlobalScope(), which is always called on the background thread. Call sites look like:
m_scriptExecutionContext->postTask([this] (ScriptExecutionContext&) {
        m_mayBeDestroyed = true;
        if (m_workerThread)
            terminateWorkerGlobalScope();
        else
            workerGlobalScopeDestroyedInternal();
    });
Comment 9 Chris Dumez 2018-01-18 12:09:08 PST
Created attachment 331648 [details]
Patch
Comment 10 Chris Dumez 2018-01-18 12:09:33 PST
Improved approach based on offline discussion with Youenn and Brady.
Comment 11 EWS Watchlist 2018-01-18 13:42:43 PST
Comment on attachment 331648 [details]
Patch

Attachment 331648 [details] did not pass mac-debug-ews (mac):
Output: http://webkit-queues.webkit.org/results/6124606

New failing tests:
fast/workers/worker-terminate-forever.html
Comment 12 EWS Watchlist 2018-01-18 13:42:44 PST
Created attachment 331655 [details]
Archive of layout-test-results from ews116 for mac-sierra

The attached test failures were seen while running run-webkit-tests on the mac-debug-ews.
Bot: ews116  Port: mac-sierra  Platform: Mac OS X 10.12.6
Comment 13 Chris Dumez 2018-01-18 13:53:50 PST
Comment on attachment 331648 [details]
Patch

Well, this is unfortunate.
Comment 14 EWS Watchlist 2018-01-18 14:05:35 PST
Comment on attachment 331648 [details]
Patch

Attachment 331648 [details] did not pass win-ews (win):
Output: http://webkit-queues.webkit.org/results/6124993

New failing tests:
fast/workers/worker-terminate-forever.html
Comment 15 EWS Watchlist 2018-01-18 14:05:45 PST
Created attachment 331659 [details]
Archive of layout-test-results from ews206 for win-future

The attached test failures were seen while running run-webkit-tests on the win-ews.
Bot: ews206  Port: win-future  Platform: CYGWIN_NT-6.1-2.9.0-0.318-5-3-x86_64-64bit
Comment 16 Chris Dumez 2018-01-18 16:37:50 PST
I looked into making ensureSWConnection() async and this is a larger change and it may introduce bugs, which we really do not want at this point.

I think I'll go back to the original dispatching of the WorkerThread::stop() to the bg thread.
Comment 17 Chris Dumez 2018-01-18 16:39:22 PST
Created attachment 331670 [details]
Patch
Comment 18 Chris Dumez 2018-01-18 17:01:07 PST
Created attachment 331674 [details]
Patch
Comment 19 youenn fablet 2018-01-19 00:54:17 PST
The deadlock shown in Bug 181763 is probably fixed by making ServiceWorkerGlobalScope::m_registration a RefPtr<> and lazily create it outside the WorkerThread::start lock section, in WorkerScriptController::initScript() for instance.
Can we do that instead?
Comment 20 Chris Dumez 2018-01-19 07:07:33 PST
(In reply to youenn fablet from comment #19)
> The deadlock shown in Bug 181763 is probably fixed by making
> ServiceWorkerGlobalScope::m_registration a RefPtr<> and lazily create it
> outside the WorkerThread::start lock section, in
> WorkerScriptController::initScript() for instance.
> Can we do that instead?

I thought about this but if I am outside the locked section then the workerglobalscope may get destroyed at any point no? Because stop() is able to lock the mutex at this point.
Comment 21 youenn fablet 2018-01-19 07:25:20 PST
WorkerScriptController::initScript expects a valid global scope.
IIRC, it is called outside the lock section and before evaluating the script.
Comment 22 Chris Dumez 2018-01-19 08:57:13 PST
(In reply to youenn fablet from comment #21)
> WorkerScriptController::initScript expects a valid global scope.
> IIRC, it is called outside the lock section and before evaluating the script.

I guess it works out because WorkerThread::stop() post a task to destroy the workerglobalscope on the background thread, using m_runLoop.postTaskAndTerminate().

Still, seems odd to add new infrastructure (new virtual function and call after the lock is released), just to work around the fact that we are calling callOnMainThreadAndWait(). I plan to get rid of callOnMainThreadAndWait() a little bit longer term anyway. Also, dispatching to the bg thread the stop() request, as the current patch does seems fine to me. What's the drawback? If the thread is busy looping and does not process the request within 10 seconds, we will kill the SW process.
Comment 23 Chris Dumez 2018-01-19 09:42:16 PST
Created attachment 331745 [details]
Patch
Comment 24 Chris Dumez 2018-01-19 11:13:59 PST
Comment on attachment 331745 [details]
Patch

Clearing flags on attachment: 331745

Committed r227221: <https://trac.webkit.org/changeset/227221>
Comment 25 Chris Dumez 2018-01-19 11:14:01 PST
All reviewed patches have been landed.  Closing bug.