Bug 155678 - GuardMalloc crash on DFG::WorkList thread in JSC::Heap::isCollecting for destroyed Web Worker
Summary: GuardMalloc crash on DFG::WorkList thread in JSC::Heap::isCollecting for dest...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2016-03-18 21:30 PDT by Joseph Pecoraro
Modified: 2016-03-21 13:41 PDT (History)
5 users (show)

See Also:


Attachments
[PATCH] Proposed Fix (1.58 KB, patch)
2016-03-18 21:42 PDT, Joseph Pecoraro
fpizlo: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Joseph Pecoraro 2016-03-18 21:30:12 PDT
* SUMMARY
GuardMalloc crash on DFGWorkList thread in JSC::Heap::isCollecting for destroyed Web Worker.

* STEPS TO REPRODUCE
shell> run-webkit-tests --release --no-retry-failures --time-out-ms=30000 -g -1 --iterations=50 LayoutTests/inspector/heap/snapshot.html

* NOTES
- In Release it is very quick to crash (almost always the 2nd test run).
- In Debug it is very rare to crash, I got it after ~20 runs.
- The Web Inspector was recently made to use Web Workers here <http://trac.webkit.org/changeset/198353>

* CRASH SNIPPET
Crashed Thread:        15  FTL Worklist Worker Thread

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_INVALID_ADDRESS at 0x0000000358989b30
Exception Note:        EXC_CORPSE_NOTIFY

Application Specific Information:
CRASHING TEST: inspector/heap/snapshot.html
This process is running with libgmalloc.dylib (GuardMalloc) which may have forced the crash due to a memory access error.
 
Thread 15 Crashed:: FTL Worklist Worker Thread
0   com.apple.JavaScriptCore      	0x000000010bb7fc9e JSC::Heap::isCollecting() + 14
1   com.apple.JavaScriptCore      	0x000000010c1ab0c5 JSC::DFG::Worklist::runThread(JSC::DFG::ThreadData*) + 757
2   com.apple.JavaScriptCore      	0x000000010c1a9404 JSC::DFG::Worklist::threadFunction(void*) + 36
3   com.apple.JavaScriptCore      	0x000000010ca53dc9 WTF::createThread(void (*)(void*), void*, char const*)::$_0::operator()() const + 25
4   com.apple.JavaScriptCore      	0x000000010ca53d9d void std::__1::__invoke_void_return_wrapper<void>::__call<WTF::createThread(void (*)(void*), void*, char const*)::$_0&>(WTF::createThread(void (*)(void*), void*, char const*)::$_0&&&) + 45
5   com.apple.JavaScriptCore      	0x000000010ca53d3c std::__1::__function::__func<WTF::createThread(void (*)(void*), void*, char const*)::$_0, std::__1::allocator<WTF::createThread(void (*)(void*), void*, char const*)::$_0>, void ()>::operator()() + 44
6   com.apple.JavaScriptCore      	0x000000010c2eef5a std::__1::function<void ()>::operator()() const + 26
7   com.apple.JavaScriptCore      	0x000000010ca5299e WTF::threadEntryPoint(void*) + 158
8   com.apple.JavaScriptCore      	0x000000010ca54331 WTF::wtfThreadEntryPoint(void*) + 289
9   libsystem_pthread.dylib       	0x00007fff966b799d _pthread_body + 131
10  libsystem_pthread.dylib       	0x00007fff966b791a _pthread_start + 168
11  libsystem_pthread.dylib       	0x00007fff966b5351 thread_start + 13
Comment 1 Joseph Pecoraro 2016-03-18 21:30:28 PDT
<rdar://problem/25251439>
Comment 2 Joseph Pecoraro 2016-03-18 21:30:36 PDT
Caught in lldb I was able to confirm that this is a Cancelled DFG::Plan holding a reference to now gone Web Worker VM that has since destructed.

1. VM::~VM for the Web Worker on thread #289:

    (lldb)  bt
      thread #289: tid = 0x69ded, 0x0000000109170b10 JavaScriptCore`JSC::VM::~VM(this=0x0000001715b67a90) + 16 at VM.cpp:335, name = 'WebCore: Worker', stop reason = breakpoint 3.2
        frame #0: JavaScriptCore`JSC::VM::~VM(this=0x0000001715b67a90) + 16 at VM.cpp:335
        frame #1: JavaScriptCore`WTF::ThreadSafeRefCounted<JSC::VM>::deref
        frame #2: JavaScriptCore`void WTF::derefIfNotNull<JSC::VM>
        frame #3: JavaScriptCore`WTF::RefPtr<JSC::VM>::operator=
        frame #4: JavaScriptCore`JSC::JSLockHolder::~JSLockHolder
        frame #5: JavaScriptCore`JSC::JSLockHolder::~JSLockHolder
        frame #6: WebCore`WebCore::WorkerScriptController::~WorkerScriptController
        frame #7: WebCore`WebCore::WorkerScriptController::~WorkerScriptController
        frame #8: WebCore`WebCore::WorkerGlobalScope::clearScript
        frame #9: WebCore`WebCore::WorkerGlobalScope::clearScript
        frame #10: WebCore`WebCore::WorkerGlobalScope::clearScript
        frame #11: WebCore`WebCore::WorkerGlobalScope::clearScript
        frame #12: WebCore`WebCore::WorkerThread::stop
    (lldb) c


2. Hitting the Crash

    (lldb) Process 35098 stopped
    * thread #41: tid = 0x6934f, 0x000000010845c6ae JavaScriptCore`JSC::Heap::isCollecting(this=0x0000001715b67aa8) + 14 at HeapInlines.h:58, name = 'FTL Worklist Worker Thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x1715b67b30)
        frame #0: 0x000000010845c6ae JavaScriptCore`JSC::Heap::isCollecting(this=0x0000001715b67aa8) + 14 at HeapInlines.h:58
       55  	
       56  	inline bool Heap::isCollecting()
       57  	{
    -> 58  	    return m_operationInProgress == FullCollection || m_operationInProgress == EdenCollection;
       59  	}
       60  	
       61  	inline Heap* Heap::heap(const JSCell* cell)

3. Inspect the DFG::Plan it contains the Web Worker's VM, and the Plan has been Cancelled, so none of its members are safe to use.


    (lldb) up
    frame #1: 0x0000000108a75e95 JavaScriptCore`JSC::DFG::Worklist::runThread(this=0x00000003c4c73ef0, data=0x00000003c4ee5fe0) + 757 at DFGWorklist.cpp:370
       367 	        
       368 	            RELEASE_ASSERT(!plan->vm.heap.isCollecting());
       369 	            plan->compileInThread(longLivedState, data);
    -> 370 	            RELEASE_ASSERT(!plan->vm.heap.isCollecting());
       371 	            
       372 	            {
       373 	                LockHolder locker(m_lock);

    (lldb) frame variable plan.m_ptr->vm
    (JSC::VM &) plan.m_ptr->vm = 0x0000001715b67a90: {...}

    (lldb) frame variable plan.m_ptr->stage
    (JSC::DFG::Plan::Stage) plan.m_ptr->stage = Cancelled
Comment 3 Joseph Pecoraro 2016-03-18 21:42:00 PDT
Created attachment 274500 [details]
[PATCH] Proposed Fix

Proposed fix.

This DFG::Plan is not seen by VM::~VM's attempt to remove all plans for the destructing VM because that path loops over `m_plans`. A Cancelled DFG::Plan would have already been removed from `m_plans`.
Comment 4 Filip Pizlo 2016-03-18 21:43:04 PDT
(In reply to comment #2)
> Caught in lldb I was able to confirm that this is a Cancelled DFG::Plan
> holding a reference to now gone Web Worker VM that has since destructed.
> 
> 1. VM::~VM for the Web Worker on thread #289:
> 
>     (lldb)  bt
>       thread #289: tid = 0x69ded, 0x0000000109170b10
> JavaScriptCore`JSC::VM::~VM(this=0x0000001715b67a90) + 16 at VM.cpp:335,
> name = 'WebCore: Worker', stop reason = breakpoint 3.2
>         frame #0: JavaScriptCore`JSC::VM::~VM(this=0x0000001715b67a90) + 16
> at VM.cpp:335
>         frame #1: JavaScriptCore`WTF::ThreadSafeRefCounted<JSC::VM>::deref
>         frame #2: JavaScriptCore`void WTF::derefIfNotNull<JSC::VM>
>         frame #3: JavaScriptCore`WTF::RefPtr<JSC::VM>::operator=
>         frame #4: JavaScriptCore`JSC::JSLockHolder::~JSLockHolder
>         frame #5: JavaScriptCore`JSC::JSLockHolder::~JSLockHolder
>         frame #6:
> WebCore`WebCore::WorkerScriptController::~WorkerScriptController
>         frame #7:
> WebCore`WebCore::WorkerScriptController::~WorkerScriptController
>         frame #8: WebCore`WebCore::WorkerGlobalScope::clearScript
>         frame #9: WebCore`WebCore::WorkerGlobalScope::clearScript
>         frame #10: WebCore`WebCore::WorkerGlobalScope::clearScript
>         frame #11: WebCore`WebCore::WorkerGlobalScope::clearScript
>         frame #12: WebCore`WebCore::WorkerThread::stop
>     (lldb) c
> 
> 
> 2. Hitting the Crash
> 
>     (lldb) Process 35098 stopped
>     * thread #41: tid = 0x6934f, 0x000000010845c6ae
> JavaScriptCore`JSC::Heap::isCollecting(this=0x0000001715b67aa8) + 14 at
> HeapInlines.h:58, name = 'FTL Worklist Worker Thread', stop reason =
> EXC_BAD_ACCESS (code=1, address=0x1715b67b30)
>         frame #0: 0x000000010845c6ae
> JavaScriptCore`JSC::Heap::isCollecting(this=0x0000001715b67aa8) + 14 at
> HeapInlines.h:58
>        55  	
>        56  	inline bool Heap::isCollecting()
>        57  	{
>     -> 58  	    return m_operationInProgress == FullCollection ||
> m_operationInProgress == EdenCollection;
>        59  	}
>        60  	
>        61  	inline Heap* Heap::heap(const JSCell* cell)
> 
> 3. Inspect the DFG::Plan it contains the Web Worker's VM, and the Plan has
> been Cancelled, so none of its members are safe to use.
> 
> 
>     (lldb) up
>     frame #1: 0x0000000108a75e95
> JavaScriptCore`JSC::DFG::Worklist::runThread(this=0x00000003c4c73ef0,
> data=0x00000003c4ee5fe0) + 757 at DFGWorklist.cpp:370
>        367 	        
>        368 	            RELEASE_ASSERT(!plan->vm.heap.isCollecting());
>        369 	            plan->compileInThread(longLivedState, data);
>     -> 370 	            RELEASE_ASSERT(!plan->vm.heap.isCollecting());
>        371 	            
>        372 	            {
>        373 	                LockHolder locker(m_lock);
> 
>     (lldb) frame variable plan.m_ptr->vm
>     (JSC::VM &) plan.m_ptr->vm = 0x0000001715b67a90: {...}
> 
>     (lldb) frame variable plan.m_ptr->stage
>     (JSC::DFG::Plan::Stage) plan.m_ptr->stage = Cancelled

Fascinating.

VM destruction should wait for all plans for that VM to be retired from the worklist.  I believe that this code is fine.

But note that the GC has a magical power, which I believe comes into play here: it can run concurrently to compilation and cancel a plan while that plan is compiling. We take advantage of the fact that some parts of the compilation pipeline, like the B3 prepareForGeneration() phase, are both long and have no need for the VM& or any JS-ey things.  In other words, we can run GC during this "safe" phase.  If the GC finds that the plan is compiling something that is garbage, the GC can cancel the plan. B3 will keep running, but when it finishes, we will "exit" the safepoint, which will wait for GC to finish, and then observe that the plan has been cancelled. The plan will then just tear down all of its data and return.

And then this awesomeness happens: the VM has also been destroyed in the meantime.  Normally, the VM would wait for compilation threads to finish.  But it doesn't wait for cancelled ones to finish.

So, the VM was destroyed and then this idiotic assertion happens!  It's probably the only darn thing on the cancelled-plan-gives-up-and-dies path that requires the VM!  Since it's a release assertion we always access the isCollecting bits, and so in this case we die.

The fix is to remove the assertion. :-)  That way, there won't be anything on the "give-up-and-die" path that needs the VM to still exist.  And yes, I just read the code for this, and it does look like everyone else just speedily bails without touching anything, except for these bogus assertions.  There are a handful of them.  I believe they should all be burned.
Comment 5 Filip Pizlo 2016-03-18 21:44:53 PDT
Comment on attachment 274500 [details]
[PATCH] Proposed Fix

Nice.  Better than removing the assertion.
Comment 6 Joseph Pecoraro 2016-03-20 15:43:16 PDT
http://trac.webkit.org/changeset/198477
Comment 7 David Kilzer (:ddkilzer) 2016-03-21 13:41:11 PDT
This is not a security issue since it resulted in a RELEASE_ASSERT().