160337 – Crash in JavaScriptCore GC when using JSC on dispatch queues (thread_get_state returns NULL stack pointer)

Ben Nham

Reported 2016-07-29 04:46:23 PDT

We recently enabled the use of JSC APIs in our app from dispatch queues. This lead to a number of crashes in the JSC GC timer so we had to remove this feature from our app. This implies that it's not safe to use JSC on a dispatch queue. The crashing stack looks like this on iOS 8 (EXC_BAD_ADDRESS at 0x0): Thread #15 Crashed: 0 JavaScriptCore JSC::ConservativeRoots::add(void*, void*, JSC::JITStubRoutineSet&, JSC::CodeBlockSet&) + 88 1 JavaScriptCore JSC::MachineThreads::gatherFromOtherThread(JSC::ConservativeRoots&, JSC::MachineThreads::Thread*, JSC::JITStubRoutineSet&, JSC::CodeBlockSet&) + 132 2 JavaScriptCore JSC::MachineThreads::gatherFromOtherThread(JSC::ConservativeRoots&, JSC::MachineThreads::Thread*, JSC::JITStubRoutineSet&, JSC::CodeBlockSet&) + 132 3 JavaScriptCore JSC::MachineThreads::gatherConservativeRoots(JSC::ConservativeRoots&, JSC::JITStubRoutineSet&, JSC::CodeBlockSet&, void*, int (&) [48]) + 384 4 JavaScriptCore JSC::Heap::markRoots(double) + 432 5 JavaScriptCore JSC::Heap::collect(JSC::HeapOperation) + 376 6 JavaScriptCore JSC::GCActivityCallback::doWork() + 144 7 JavaScriptCore JSC::HeapTimer::timerDidFire(__CFRunLoopTimer*, void*) + 196 8 CoreFoundation __CFRUNLOOP_IS_CALLING_OUT_TO_A_TIMER_CALLBACK_FUNCTION__ + 24 9 CoreFoundation __CFRunLoopDoTimer + 884 10 CoreFoundation __CFRunLoopRun + 1368 11 CoreFoundation CFRunLoopRunSpecific + 392 12 Facebook +[RCTJSCExecutor runRunLoopThread] (RCTJSCExecutor.mm:280) 13 Foundation __NSThread__main__ + 1068 14 libsystem_pthread.dylib _pthread_body + 160 15 libsystem_pthread.dylib _pthread_start + 156 16 libsystem_pthread.dylib thread_start + 0 Basically, this is the marker segfaulting on trying to dereference a stack pointer of 0x0 for the target thread. (For this crash, we actually have a dump of the entire state of the stack in this crash, and were able to walk up the stack and find the Thread::Registers instance right on the stack clearly showing a SP of 0x0.) The crashing stack looks like this on iOS 9 (EXC_BAD_ACCESS at 0xbbadbeef): Thread #11 Crashed: 0 JavaScriptCore bmalloc::Heap::allocateXLarge(std::__1::lock_guard<bmalloc::StaticMutex>&, unsigned long) + 36 1 JavaScriptCore bmalloc::Heap::allocateXLarge(std::__1::lock_guard<bmalloc::StaticMutex>&, unsigned long) + 20 2 JavaScriptCore bmalloc::Allocator::allocateXLarge(unsigned long) + 92 3 JavaScriptCore JSC::MachineThreads::gatherConservativeRoots(JSC::ConservativeRoots&, JSC::JITStubRoutineSet&, JSC::CodeBlockSet&, void*, void*, int (&) [48]) + 176 4 JavaScriptCore JSC::Heap::markRoots(double, void*, void*, int (&) [48]) + 384 5 JavaScriptCore JSC::Heap::collectImpl(JSC::HeapOperation, void*, void*, int (&) [48]) + 612 6 JavaScriptCore JSC::Heap::collect(JSC::HeapOperation) + 92 7 JavaScriptCore JSC::GCActivityCallback::doWork() + 88 8 JavaScriptCore JSC::HeapTimer::timerDidFire(__CFRunLoopTimer*, void*) + 216 9 CoreFoundation __CFRUNLOOP_IS_CALLING_OUT_TO_A_TIMER_CALLBACK_FUNCTION__ + 24 10 CoreFoundation __CFRunLoopDoTimer + 880 11 CoreFoundation __CFRunLoopRun + 1516 12 CoreFoundation CFRunLoopRunSpecific + 380 13 Facebook +[RCTJSCExecutor runRunLoopThread] (RCTJSCExecutor.mm:280) 14 Foundation __NSThread__start__ + 996 15 libsystem_pthread.dylib _pthread_body + 152 16 libsystem_pthread.dylib _pthread_start + 152 17 libsystem_pthread.dylib thread_start + 0 Basically, this is bmalloc crashing on purpose because vm_allocate can't allocate a large enough contiguous region to hold the target thread's stack, presumably because JSC thinks the stack is ginormous due to it having a SP of 0x0. It looks like the basic issue here is that JSC uses thread_get_state to the the stack pointer of the target thread. Unfortunately, thread_get_state sometimes returns a SP of 0x0 for a thread. See <rdar://27607384> and https://openradar.appspot.com/27607384 for more details. Basically, when the target thread is a dispatch queue thread that's just about to run with this call stack: Thread #32: 0 libsystem_pthread.dylib start_wqthread + 0 then thread_get_state may return 0x0 for the SP, causing the crashes in JSC GC outlined above. It's unclear to me whether this is actually a kernel bug or a JSC bug, but it seems like it might be a good idea for JSC to be defensive against a SP of 0x0 and have it abort marking if a thread is in this state. Additionally, it seems like MachineStackMarker.cpp should do some sort of sanity check on the size of the stack and not try to allocate a ginormous buffer to hold a thread's stack. For instance, on iOS I believe the maximum size of a stack is 1 MB, so JSC should never be allocating more than that amount of space to hold a single thread's stack.

Geoffrey Garen

Comment 1 2016-07-29 10:48:25 PDT

I don't think we want to set a stack size maximum because stack size is configurable. But it sounds like we do need a specific workaround for the case where thread_get_state returns a bogus value.

Radar WebKit Bug Importer

Comment 2 2016-07-29 10:48:44 PDT

<rdar://problem/27611733>

Ben Nham

Comment 3 2016-08-11 02:12:58 PDT

Core OS wrote back in <rdar://27607384>: Can you guys use thread_get_register_pointer_values() as an alternative? This seems like it would fix the issue because thread_get_register_pointer_values appears to be a wrapper function around thread_get_state that a) filters out any pointers pointing to the null page and b) properly takes into account the red zone beneath the stack pointer.

Ben Nham

Comment 4 2016-08-11 02:19:55 PDT

Actually looking at this a bit more, I still don't think thread_get_register_pointer_values will do the right thing. For instance, on x86_64, it returns SP - 128 (taking in to account the red zone underneath the stack pointer): #elif defined(__x86_64__) if (sp) *sp = state.__rsp - 128 /* redzone */; This means that thread_get_register_pointer_values would actually return -128 when state.__rsp is 0x0, which is also an invalid stack pointer.

Ben Nham

Comment 5 2016-08-11 02:23:18 PDT

Ugh, I'm commenting too fast. thread_get_register_pointer_values just leaves the sp output parameter untouched when state.__rsp (or the equivalent register on another arch) is 0x0 and also returns KERN_SUCCESS. I don't see how this behavior would fix this crash. Seems like the simplest workaround is to ignore threads with null stack pointers.

Shuan Zhao

Comment 6 2017-01-23 09:45:58 PST

Is anyone working on this? I have similar issue with iOS 9.

Mark Lam

Comment 7 2017-05-10 11:24:24 PDT

Created attachment 309621 [details] proposed patch. Let's try this on the EWS.

Filip Pizlo

Comment 8 2017-05-10 12:21:09 PDT

Comment on attachment 309621 [details] proposed patch. r=me. Did you mean to set r??

Mark Lam

Comment 9 2017-05-10 12:27:51 PDT

(In reply to Filip Pizlo from comment #8) > Comment on attachment 309621 [details] > proposed patch. > > r=me. Did you mean to set r?? Oops. Yes, I forgot to set the r? after the EWS results were satisfactory.

Geoffrey Garen

Comment 10 2017-05-10 12:30:47 PDT

Comment on attachment 309621 [details] proposed patch. r=me

Mark Lam

Comment 11 2017-05-10 13:07:40 PDT

Thanks for the reviews. Landed in r216608: <http://trac.webkit.org/r216608>.

Geoffrey Garen

Comment 12 2017-05-10 13:25:07 PDT

Comment on attachment 309621 [details] proposed patch. View in context: https://bugs.webkit.org/attachment.cgi?id=309621&action=review > Source/JavaScriptCore/heap/MachineStackMarker.cpp:316 > + // This is a workaround for <rdar://problem/27607384>. During thread initialization, > + // for some target platforms, thread state is momentarily set to 0 before being Thinking about this again, I don't think this explanation can be correct. JSC only scans threads after they register themselves by invoking a JSC API. So, it's impossible that JSC would try to scan a thread that had not yet initialized. Given that, I don't think you've fixed this bug.

Mark Lam

Comment 13 2017-05-10 13:50:28 PDT

Comment on attachment 309621 [details] proposed patch. View in context: https://bugs.webkit.org/attachment.cgi?id=309621&action=review >> Source/JavaScriptCore/heap/MachineStackMarker.cpp:316 >> + // for some target platforms, thread state is momentarily set to 0 before being > > Thinking about this again, I don't think this explanation can be correct. > > JSC only scans threads after they register themselves by invoking a JSC API. So, it's impossible that JSC would try to scan a thread that had not yet initialized. Given that, I don't think you've fixed this bug. Read on below: "This issue may manifest with workqueue threads where the OS may choose to recycle a thread for an expired task." A more conclusive fix is to ensure that workqueue threads will properly invoke the thread destructor code, but short of that, this is a workaround that should work.

Geoffrey Garen

Comment 14 2017-05-10 15:17:19 PDT

> Read on below: "This issue may manifest with workqueue threads where the OS > may choose to recycle a thread for an expired task." > > A more conclusive fix is to ensure that workqueue threads will properly > invoke the thread destructor code, but short of that, this is a workaround > that should work. If we believe that we have workqueue threads that terminate without invoking their destructors, then we may still have a bug where we scan those threads after they deallocate their stacks. Seems like we need closer investigation of the conditions that produce this null.

Geoffrey Garen

Comment 15 2017-05-10 16:08:03 PDT

Comment on attachment 309621 [details] proposed patch. View in context: https://bugs.webkit.org/attachment.cgi?id=309621&action=review > Source/JavaScriptCore/heap/MachineStackMarker.cpp:325 > + // This is a workaround for <rdar://problem/27607384>. During thread initialization, > + // for some target platforms, thread state is momentarily set to 0 before being > + // filled in with the target thread's real register values. As a result, there's > + // a race condition that may result in us getting a null stackPointer. > + // This issue may manifest with workqueue threads where the OS may choose to recycle > + // a thread for an expired task. > + // > + // The workaround is simply to indicate that there's nothing to copy and return. > + // This is correct because we will only ever observe a null pointer during thread > + // initialization. Hence, by definition, there's nothing there that we need to scan > + // yet, and therefore, nothing that needs to be copied. I would just say this: // This is a workaround for <rdar://problem/27607384>. libdispatch recycles work queue threads without running pthread exit destructors. This can cause us to scan a thread during work queue initialization, when the stack pointer is null.

Mark Lam

Comment 16 2017-05-10 16:44:29 PDT

(In reply to Geoffrey Garen from comment #15) > I would just say this: > > // This is a workaround for <rdar://problem/27607384>. libdispatch recycles > work queue threads without running pthread exit destructors. This can cause > us to scan a thread during work queue initialization, when the stack pointer > is null. Landed this comment change in r216637: <http://trac.webkit.org/r216637>.

Attachments
proposed patch. (3.01 KB, patch) 2017-05-10 11:24 PDT, Mark Lam	ggaren: review+	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.