Bug 160337 - Crash in JavaScriptCore GC when using JSC on dispatch queues (thread_get_state returns NULL stack pointer)
Summary: Crash in JavaScriptCore GC when using JSC on dispatch queues (thread_get_stat...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: Other
Hardware: iPhone / iPad iOS 9.3
: P2 Normal
Assignee: Mark Lam
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2016-07-29 04:46 PDT by Ben Nham
Modified: 2017-05-10 16:44 PDT (History)
9 users (show)

See Also:


Attachments
proposed patch. (3.01 KB, patch)
2017-05-10 11:24 PDT, Mark Lam
ggaren: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Ben Nham 2016-07-29 04:46:23 PDT
We recently enabled the use of JSC APIs in our app from dispatch queues. This lead to a number of crashes in the JSC GC timer so we had to remove this feature from our app. This implies that it's not safe to use JSC on a dispatch queue.

The crashing stack looks like this on iOS 8 (EXC_BAD_ADDRESS at 0x0):

  Thread #15 Crashed:
  0 JavaScriptCore JSC::ConservativeRoots::add(void*, void*, JSC::JITStubRoutineSet&, JSC::CodeBlockSet&) + 88
  1 JavaScriptCore JSC::MachineThreads::gatherFromOtherThread(JSC::ConservativeRoots&, JSC::MachineThreads::Thread*, JSC::JITStubRoutineSet&, JSC::CodeBlockSet&) + 132
  2 JavaScriptCore JSC::MachineThreads::gatherFromOtherThread(JSC::ConservativeRoots&, JSC::MachineThreads::Thread*, JSC::JITStubRoutineSet&, JSC::CodeBlockSet&) + 132
  3 JavaScriptCore JSC::MachineThreads::gatherConservativeRoots(JSC::ConservativeRoots&, JSC::JITStubRoutineSet&, JSC::CodeBlockSet&, void*, int (&) [48]) + 384
  4 JavaScriptCore JSC::Heap::markRoots(double) + 432
  5 JavaScriptCore JSC::Heap::collect(JSC::HeapOperation) + 376
  6 JavaScriptCore JSC::GCActivityCallback::doWork() + 144
  7 JavaScriptCore JSC::HeapTimer::timerDidFire(__CFRunLoopTimer*, void*) + 196
  8 CoreFoundation __CFRUNLOOP_IS_CALLING_OUT_TO_A_TIMER_CALLBACK_FUNCTION__ + 24
  9 CoreFoundation __CFRunLoopDoTimer + 884
  10 CoreFoundation __CFRunLoopRun + 1368
  11 CoreFoundation CFRunLoopRunSpecific + 392
  12 Facebook +[RCTJSCExecutor runRunLoopThread] (RCTJSCExecutor.mm:280)
  13 Foundation __NSThread__main__ + 1068
  14 libsystem_pthread.dylib _pthread_body + 160
  15 libsystem_pthread.dylib _pthread_start + 156
  16 libsystem_pthread.dylib thread_start + 0

Basically, this is the marker segfaulting on trying to dereference a stack pointer of 0x0 for the target thread. (For this crash, we actually have a dump of the entire state of the stack in this crash, and were able to walk up the stack and find the Thread::Registers instance right on the stack clearly showing a SP of 0x0.)

The crashing stack looks like this on iOS 9 (EXC_BAD_ACCESS at 0xbbadbeef):

  Thread #11 Crashed:
  0 JavaScriptCore bmalloc::Heap::allocateXLarge(std::__1::lock_guard<bmalloc::StaticMutex>&, unsigned long) + 36
  1 JavaScriptCore bmalloc::Heap::allocateXLarge(std::__1::lock_guard<bmalloc::StaticMutex>&, unsigned long) + 20
  2 JavaScriptCore bmalloc::Allocator::allocateXLarge(unsigned long) + 92
  3 JavaScriptCore JSC::MachineThreads::gatherConservativeRoots(JSC::ConservativeRoots&, JSC::JITStubRoutineSet&, JSC::CodeBlockSet&, void*, void*, int (&) [48]) + 176
  4 JavaScriptCore JSC::Heap::markRoots(double, void*, void*, int (&) [48]) + 384
  5 JavaScriptCore JSC::Heap::collectImpl(JSC::HeapOperation, void*, void*, int (&) [48]) + 612
  6 JavaScriptCore JSC::Heap::collect(JSC::HeapOperation) + 92
  7 JavaScriptCore JSC::GCActivityCallback::doWork() + 88
  8 JavaScriptCore JSC::HeapTimer::timerDidFire(__CFRunLoopTimer*, void*) + 216
  9 CoreFoundation __CFRUNLOOP_IS_CALLING_OUT_TO_A_TIMER_CALLBACK_FUNCTION__ + 24
  10 CoreFoundation __CFRunLoopDoTimer + 880
  11 CoreFoundation __CFRunLoopRun + 1516
  12 CoreFoundation CFRunLoopRunSpecific + 380
  13 Facebook +[RCTJSCExecutor runRunLoopThread] (RCTJSCExecutor.mm:280)
  14 Foundation __NSThread__start__ + 996
  15 libsystem_pthread.dylib _pthread_body + 152
  16 libsystem_pthread.dylib _pthread_start + 152
  17 libsystem_pthread.dylib thread_start + 0

Basically, this is bmalloc crashing on purpose because vm_allocate can't allocate a large enough contiguous region to hold the target thread's stack, presumably because JSC thinks the stack is ginormous due to it having a SP of 0x0.

It looks like the basic issue here is that JSC uses thread_get_state to the the stack pointer of the target thread. Unfortunately, thread_get_state sometimes returns a SP of 0x0 for a thread. See <rdar://27607384> and https://openradar.appspot.com/27607384 for more details. Basically, when the target thread is a dispatch queue thread that's just about to run with this call stack:

  Thread #32:
  0 libsystem_pthread.dylib start_wqthread + 0

then thread_get_state may return 0x0 for the SP, causing the crashes in JSC GC outlined above.

It's unclear to me whether this is actually a kernel bug or a JSC bug, but it seems like it might be a good idea for JSC to be defensive against a SP of 0x0 and have it abort marking if a thread is in this state. Additionally, it seems like MachineStackMarker.cpp should do some sort of sanity check on the size of the stack and not try to allocate a ginormous buffer to hold a thread's stack. For instance, on iOS I believe the maximum size of a stack is 1 MB, so JSC should never be allocating more than that amount of space to hold a single thread's stack.
Comment 1 Geoffrey Garen 2016-07-29 10:48:25 PDT
I don't think we want to set a stack size maximum because stack size is configurable.

But it sounds like we do need a specific workaround for the case where thread_get_state returns a bogus value.
Comment 2 Radar WebKit Bug Importer 2016-07-29 10:48:44 PDT
<rdar://problem/27611733>
Comment 3 Ben Nham 2016-08-11 02:12:58 PDT
Core OS  wrote back in <rdar://27607384>:

  Can you guys use thread_get_register_pointer_values() as an alternative?

This seems like it would fix the issue because thread_get_register_pointer_values appears to be a wrapper function around thread_get_state that a) filters out any pointers pointing to the null page and b) properly takes into account the red zone beneath the stack pointer.
Comment 4 Ben Nham 2016-08-11 02:19:55 PDT
Actually looking at this a bit more, I still don't think thread_get_register_pointer_values will do the right thing. For instance, on x86_64, it returns SP - 128 (taking in to account the red zone underneath the stack pointer):

#elif defined(__x86_64__)
    if (sp) *sp = state.__rsp - 128 /* redzone */;

This means that thread_get_register_pointer_values would actually return -128 when state.__rsp is 0x0, which is also an invalid stack pointer.
Comment 5 Ben Nham 2016-08-11 02:23:18 PDT
Ugh, I'm commenting too fast. thread_get_register_pointer_values just leaves the sp output parameter untouched when state.__rsp (or the equivalent register on another arch) is 0x0 and also returns KERN_SUCCESS. I don't see how this behavior would fix this crash. Seems like the simplest workaround is to ignore threads with null stack pointers.
Comment 6 Shuan Zhao 2017-01-23 09:45:58 PST
Is anyone working on this? I have similar issue with iOS 9.
Comment 7 Mark Lam 2017-05-10 11:24:24 PDT
Created attachment 309621 [details]
proposed patch.

Let's try this on the EWS.
Comment 8 Filip Pizlo 2017-05-10 12:21:09 PDT
Comment on attachment 309621 [details]
proposed patch.

r=me.  Did you mean to set r??
Comment 9 Mark Lam 2017-05-10 12:27:51 PDT
(In reply to Filip Pizlo from comment #8)
> Comment on attachment 309621 [details]
> proposed patch.
> 
> r=me.  Did you mean to set r??

Oops.  Yes, I forgot to set the r? after the EWS results were satisfactory.
Comment 10 Geoffrey Garen 2017-05-10 12:30:47 PDT
Comment on attachment 309621 [details]
proposed patch.

r=me
Comment 11 Mark Lam 2017-05-10 13:07:40 PDT
Thanks for the reviews.  Landed in r216608: <http://trac.webkit.org/r216608>.
Comment 12 Geoffrey Garen 2017-05-10 13:25:07 PDT
Comment on attachment 309621 [details]
proposed patch.

View in context: https://bugs.webkit.org/attachment.cgi?id=309621&action=review

> Source/JavaScriptCore/heap/MachineStackMarker.cpp:316
> +    // This is a workaround for <rdar://problem/27607384>. During thread initialization,
> +    // for some target platforms, thread state is momentarily set to 0 before being

Thinking about this again, I don't think this explanation can be correct.

JSC only scans threads after they register themselves by invoking a JSC API. So, it's impossible that JSC would try to scan a thread that had not yet initialized. Given that, I don't think you've fixed this bug.
Comment 13 Mark Lam 2017-05-10 13:50:28 PDT
Comment on attachment 309621 [details]
proposed patch.

View in context: https://bugs.webkit.org/attachment.cgi?id=309621&action=review

>> Source/JavaScriptCore/heap/MachineStackMarker.cpp:316
>> +    // for some target platforms, thread state is momentarily set to 0 before being
> 
> Thinking about this again, I don't think this explanation can be correct.
> 
> JSC only scans threads after they register themselves by invoking a JSC API. So, it's impossible that JSC would try to scan a thread that had not yet initialized. Given that, I don't think you've fixed this bug.

Read on below: "This issue may manifest with workqueue threads where the OS may choose to recycle a thread for an expired task."

A more conclusive fix is to ensure that workqueue threads will properly invoke the thread destructor code, but short of that, this is a workaround that should work.
Comment 14 Geoffrey Garen 2017-05-10 15:17:19 PDT
> Read on below: "This issue may manifest with workqueue threads where the OS
> may choose to recycle a thread for an expired task."
> 
> A more conclusive fix is to ensure that workqueue threads will properly
> invoke the thread destructor code, but short of that, this is a workaround
> that should work.

If we believe that we have workqueue threads that terminate without invoking their destructors, then we may still have a bug where we scan those threads after they deallocate their stacks. Seems like we need closer investigation of the conditions that produce this null.
Comment 15 Geoffrey Garen 2017-05-10 16:08:03 PDT
Comment on attachment 309621 [details]
proposed patch.

View in context: https://bugs.webkit.org/attachment.cgi?id=309621&action=review

> Source/JavaScriptCore/heap/MachineStackMarker.cpp:325
> +    // This is a workaround for <rdar://problem/27607384>. During thread initialization,
> +    // for some target platforms, thread state is momentarily set to 0 before being
> +    // filled in with the target thread's real register values. As a result, there's
> +    // a race condition that may result in us getting a null stackPointer.
> +    // This issue may manifest with workqueue threads where the OS may choose to recycle
> +    // a thread for an expired task.
> +    //
> +    // The workaround is simply to indicate that there's nothing to copy and return.
> +    // This is correct because we will only ever observe a null pointer during thread
> +    // initialization. Hence, by definition, there's nothing there that we need to scan
> +    // yet, and therefore, nothing that needs to be copied.

I would just say this:

// This is a workaround for <rdar://problem/27607384>. libdispatch recycles work queue threads without running pthread exit destructors. This can cause us to scan a thread during work queue initialization, when the stack pointer is null.
Comment 16 Mark Lam 2017-05-10 16:44:29 PDT
(In reply to Geoffrey Garen from comment #15)
> I would just say this:
> 
> // This is a workaround for <rdar://problem/27607384>. libdispatch recycles
> work queue threads without running pthread exit destructors. This can cause
> us to scan a thread during work queue initialization, when the stack pointer
> is null.

Landed this comment change in r216637: <http://trac.webkit.org/r216637>.