Bug 200983 - [Android] 64-bit JSC r245459 crashes in JSC::AccessCase::propagateTransitions(JSC::SlotVisitor&)
Summary: [Android] 64-bit JSC r245459 crashes in JSC::AccessCase::propagateTransitions...
Status: RESOLVED DUPLICATE of bug 202892
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: WebKit Nightly Build
Hardware: Other Linux
: P2 Blocker
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-08-21 09:02 PDT by Pratik
Modified: 2019-12-18 18:19 PST (History)
5 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Pratik 2019-08-21 09:02:40 PDT
Hi folks, 
As part of google’s Support 64-bit architectures (https://developer.android.com/distribute/best-practices/develop/64-bit) requirement, Amazon Alexa (https://play.google.com/store/apps/details?id=com.amazon.dee.app&hl=en_US) launched 64-bit version of Android app on July-30, 2019 based on an O3-optimized JSC (r245459). 
While stability is vastly improved compared to our earlier attempts, there are still residual problems. We see a crash rate of 0.35% over cold starts from the 64-bit JSC seg faults alone. 
Following is the stack-trace and currently the reproduction steps are unknown to us as we are unable to reproduce this crash in our in-house testing.

SIGSEGV
MainActivity
Segmentation violation (invalid memory reference)
Aug 14th, 2019, 11:32:55 UTC

STACKTRACE

SIGSEGV: Segmentation violation (invalid memory reference)
        JSC::AccessCase::propagateTransitions(JSC::SlotVisitor&) const at sfp-exceptions.c:?
        JSC::PolymorphicAccess::propagateTransitions(JSC::SlotVisitor&) const at sfp-exceptions.c:?
        JSC::CodeBlock::propagateTransitions(JSC::ConcurrentJSLocker const&, JSC::SlotVisitor&) at sfp-exceptions.c:?
        JSC::ExecutableToCodeBlockEdge::runConstraint(JSC::ConcurrentJSLocker const&, JSC::VM&, JSC::SlotVisitor&) at sfp-exceptions.c:?
        JSC::ExecutableToCodeBlockEdge::visitChildren(JSC::JSCell*, JSC::SlotVisitor&) at sfp-exceptions.c:?
        JSC::SlotVisitor::drain(WTF::MonotonicTime)::$_3::operator()(JSC::MarkStackArray&) const at sfp-exceptions.c:?
        JSC::SlotVisitor::drain(WTF::MonotonicTime) at sfp-exceptions.c:?
        JSC::SlotVisitor::drainFromShared(JSC::SlotVisitor::SharedDrainMode, WTF::MonotonicTime) at sfp-exceptions.c:?
        WTF::SharedTaskFunctor<void (), JSC::Heap::runBeginPhase(JSC::GCConductor)::$_17>::run() at sfp-exceptions.c:?
        WTF::ParallelHelperClient::runTask(WTF::RefPtr<WTF::SharedTask<void ()>, WTF::DumbPtrTraits<WTF::SharedTask<void ()> > > const&) at sfp-exceptions.c:?
        WTF::ParallelHelperPool::Thread::work() at sfp-exceptions.c:?
        WTF::Function<void ()>::CallableWrapper<WTF::AutomaticThread::start(WTF::AbstractLocker const&)::$_0>::call() at sfp-exceptions.c:?
        WTF::Thread::entryPoint(WTF::Thread::NewThreadContext*) at sfp-exceptions.c:?
        WTF::wtfThreadEntryPoint(void*) at sfp-exceptions.c:?
        at 0x7f89700484(/system/lib64/libc.so:427140)
        at 0x7f896b5db4(/system/lib64/libc.so:122292)
        at 0x0(Unknown)

The React Native community reports the same issue in https://github.com/facebook/react-native/issues/25494. The stack-trace is identical to what we are observing, see https://github.com/facebook/react-native/issues/25494#issuecomment-514976930.

There are two options discussed in the RN community as a workaround.
1) Use JSC with JIT disabled: We tried this option but this has great performance impact that we cannot deploy in the field.
2) Use RN 0.60.x with Hermes: The current version of RN 0.60.x comes with its own stability issues and until those get resolved, we cannot move to RN 0.60.

We contacted ARM since this crash is specific to arm64 and happening across multiple devices. We got the following response:
“We had a few engineers looking at this and we do not see an obvious pattern. There is a mixture of CPUs ranging from A53 only to “big/little” multi-core systems using A53, A55, A73, A75, A76, and M4. Our open source software engineers primarily work on the V8 JS engine so, we looked for similar issues fixed there and it does not look like something we have fixed previously. 

Given the problem is across so many disparate devices, our engineers suspect missing barriers, cache maintenance, or both in the generated code. These are the first areas, along with CPU detection (since it only seems to occur on Arm), where we would examine if we had better knowledge of JSC. I hope this helps you frame the issue for the bug report.”


Thanks for your support,
Pratik Patel
Comment 1 Mark Lam 2019-09-03 16:41:41 PDT
Suggestion: see SigillCrashAnalyzer.h/cpp and see if you can adapt that to capture / log more details about your crash e.g. 

1. the PC offset into AccessCase::propagateTransitions(), so that you can determine which line of code is crashing.
2. a dump of registers so that you can see what values are being acted on that results in this crash.

Good luck.
Comment 2 Filip Pizlo 2019-09-03 16:43:18 PDT
I don't see how this is actionable for anyone on the JSC team.

- We don't have any of the CPUs you speak of.
- This is a relatively old revision, so maybe there was a bug there that we fixed already.

I recommend trying a newer revision of JSC before overthinking it too much.
Comment 3 Yusuke Suzuki 2019-09-20 18:08:29 PDT
Saam, Tadeu, and I discussed about https://bugs.webkit.org/show_bug.cgi?id=201986 and we are now thinking that https://bugs.webkit.org/show_bug.cgi?id=201986 possibly fixes this issue.
Comment 4 Pratik 2019-09-24 10:44:49 PDT
Hi Yusuke,

Thanks for taking time to look into this issue. 
I created the patch based on your suggestion from this commit https://trac.webkit.org/changeset/250184/webkit and applied to r245459 version of JSC (https://github.com/react-native-community/jsc-android-buildscripts) and we could still see the same crash.

We are able to reproduce this crash when we perform the stress test by opening and closing the app multiple times.

I have also been trying to follow the earlier suggestion pointed by Fillip and the current android-sdk's cmake 3.10.2 version does not work with the latest webkit 2.26.2 so unable to create latest version of JSC. 

Thanks,
Pratik Patel
Comment 5 Yusuke Suzuki 2019-09-24 13:16:05 PDT
(In reply to Pratik from comment #4)
> Hi Yusuke,
> 
> Thanks for taking time to look into this issue. 
> I created the patch based on your suggestion from this commit
> https://trac.webkit.org/changeset/250184/webkit and applied to r245459
> version of JSC
> (https://github.com/react-native-community/jsc-android-buildscripts) and we
> could still see the same crash.
> 
> We are able to reproduce this crash when we perform the stress test by
> opening and closing the app multiple times.
> 
> I have also been trying to follow the earlier suggestion pointed by Fillip
> and the current android-sdk's cmake 3.10.2 version does not work with the
> latest webkit 2.26.2 so unable to create latest version of JSC. 
> 
> Thanks,
> Pratik Patel

I think you need to update / work-around cmake issue.
We actively fixed various issues, so trying the latest one looks the first step for this issue :)
Comment 6 Pratik 2019-10-04 10:19:54 PDT
Hi Yusuke,

We are actively working on upgrading JSC to the latest version. 

In the meantime we back-ported the fix from this ticket https://bugs.webkit.org/show_bug.cgi?id=202122 as it was mark dup of https://bugs.webkit.org/show_bug.cgi?id=202150 which claims to reproduce the same crash, but we are still observing this crash. How confident are you that upgrading to latest version and applying this patch would fix this issue?
Comment 7 Yusuke Suzuki 2019-10-14 14:52:44 PDT
(In reply to Pratik from comment #6)
> Hi Yusuke,
> 
> We are actively working on upgrading JSC to the latest version. 
> 
> In the meantime we back-ported the fix from this ticket
> https://bugs.webkit.org/show_bug.cgi?id=202122 as it was mark dup of
> https://bugs.webkit.org/show_bug.cgi?id=202150 which claims to reproduce the
> same crash, but we are still observing this crash. How confident are you
> that upgrading to latest version and applying this patch would fix this
> issue?

We fixed bunch of crash issues. Many issues like OSR exit bug are shown as showing incorrect garbage in stack / registers, which leads to GC crashes later. So it is worth trying.

BTW, can you dump the crash stack traces with all the threads? The pasted crash trace is the one in the Heap thread. But if the issue is concurrency-related, the stack traces of the other threads when crashing is also important. (like, `bt all` in gdb).
Comment 8 Yusuke Suzuki 2019-10-18 14:03:18 PDT
And now, I'm guessing this crash is due to misconfiguration of the JSC in https://github.com/facebook/react-native/ binary. We can ensure whether this is correct or not by checking `bt all` results.
Here is my guess: The binary would not enable ConcurrentJS while the binary is enabling ConcurrentGC. Such a configuration is wrong and never tested in JSC.
Comment 9 Pratik 2019-10-18 14:13:52 PDT
Here is the original JSC repository that is used for building the JSC in react-native. Which uploaded to npm later. 
https://github.com/react-native-community/jsc-android-buildscripts
and here are all the customized options, https://github.com/react-native-community/jsc-android-buildscripts/blob/master/scripts/compile/jsc.sh.

I would appreciate if you take a quick look and spot something that you think is causing the issues.
Comment 10 Yusuke Suzuki 2019-10-18 14:18:05 PDT
(In reply to Pratik from comment #9)
> Here is the original JSC repository that is used for building the JSC in
> react-native. Which uploaded to npm later. 
> https://github.com/react-native-community/jsc-android-buildscripts
> and here are all the customized options,
> https://github.com/react-native-community/jsc-android-buildscripts/blob/
> master/scripts/compile/jsc.sh.
> 
> I would appreciate if you take a quick look and spot something that you
> think is causing the issues.

Can I get `bt all` crash trace instead of crash trace for one thread? If this is race condition issue, crash trace in all the threads are very helpful to us.
Comment 11 Yusuke Suzuki 2019-10-18 14:27:52 PDT
(In reply to Pratik from comment #9)
> Here is the original JSC repository that is used for building the JSC in
> react-native. Which uploaded to npm later. 
> https://github.com/react-native-community/jsc-android-buildscripts
> and here are all the customized options,
> https://github.com/react-native-community/jsc-android-buildscripts/blob/
> master/scripts/compile/jsc.sh.
> 
> I would appreciate if you take a quick look and spot something that you
> think is causing the issues.

I've checked the passed options, and now I'm slightly thinking that this options seem wrong.

>   -DENABLE_DFG_JIT=OFF \
>   -DENABLE_FTL_JIT=OFF \

These options could disable CONCURRENT_JS option while the listed option is not disabling ConcurrentGC. I'm not sure whether this is true, but it would be possible that this makes JSC binary with contradiction,

1. ConcurrentJS = OFF
2. ConcurrentGC = ON

If the above misconfiguration happens, this crash can easily happen. To ensure this, the best way is checking stacktrace of all the threads.

Please send us crash traces of all the threads, and we can easily say whether the above guess is correct or not. And if the above guess is correct, we can easily offer the way to fix this misconfiguration.
Comment 12 Yusuke Suzuki 2019-10-18 16:35:29 PDT
(In reply to Pratik from comment #9)
> Here is the original JSC repository that is used for building the JSC in
> react-native. Which uploaded to npm later. 
> https://github.com/react-native-community/jsc-android-buildscripts
> and here are all the customized options,
> https://github.com/react-native-community/jsc-android-buildscripts/blob/
> master/scripts/compile/jsc.sh.
> 
> I would appreciate if you take a quick look and spot something that you
> think is causing the issues.

Thanks Pratik for offering build options in jsc-android,
I think this option makes JSC with `CONCURRENT_JS=OFF` and `Options::useConcurrentGC() = true`, which can cause this type of crash :)
Can you cherry-pick https://trac.webkit.org/changeset/251307/webkit and test it?
And can you offer `bt all` crash traces?
Comment 13 Pratik 2019-10-24 17:24:23 PDT
Hi Yusuke,
I was able to build the latest JSC using webkit 2.26.1 (250230) and applied the patch (https://trac.webkit.org/changeset/251307/webkit) you mentioned and after running the stress test ~3400 times, we don't see the JSC crash which gives us some confidence to put this version of JSC out in the field and see how it performs against the larger scale and data.

The reason I could not get the stacktrace from all threads is, we are unable to reproduce it reliably but if we run stress tests multiple times, it is likely to report the JSC crash once or twice in 1000 automated runs and getting the stacktrace at this stage would not be possible.

Thanks for all your help and we are keeping eye on this new JSC (250230) version to see if it crashes. I will keep you posted.

--Pratik Patel
Comment 14 Yusuke Suzuki 2019-10-29 14:25:51 PDT
(In reply to Pratik from comment #13)
> Hi Yusuke,
> I was able to build the latest JSC using webkit 2.26.1 (250230) and applied
> the patch (https://trac.webkit.org/changeset/251307/webkit) you mentioned
> and after running the stress test ~3400 times, we don't see the JSC crash
> which gives us some confidence to put this version of JSC out in the field
> and see how it performs against the larger scale and data.
> 
> The reason I could not get the stacktrace from all threads is, we are unable
> to reproduce it reliably but if we run stress tests multiple times, it is
> likely to report the JSC crash once or twice in 1000 automated runs and
> getting the stacktrace at this stage would not be possible.
> 
> Thanks for all your help and we are keeping eye on this new JSC (250230)
> version to see if it crashes. I will keep you posted.
> 
> --Pratik Patel

Sounds nice!
Comment 15 Yusuke Suzuki 2019-12-18 18:19:28 PST
I think this is the good time to close this issue as fixed ;)

*** This bug has been marked as a duplicate of bug 202892 ***