Bug 213200 - REGRESSION(r262991): Broke stress/ensure-crash.js
Summary: REGRESSION(r262991): Broke stress/ensure-crash.js
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: WebKit Nightly Build
Hardware: PC Linux
: P2 Normal
Assignee: Mark Lam
URL:
Keywords: InRadar
: 213207 (view as bug list)
Depends on:
Blocks:
 
Reported: 2020-06-15 10:53 PDT by Michael Catanzaro
Modified: 2020-06-17 10:47 PDT (History)
12 users (show)

See Also:


Attachments
proposed patch. (3.33 KB, patch)
2020-06-15 15:04 PDT, Mark Lam
mcatanzaro: review+
Details | Formatted Diff | Diff
patch for landing. (4.33 KB, patch)
2020-06-15 15:23 PDT, Mark Lam
no flags Details | Formatted Diff | Diff
Patch (6.66 KB, patch)
2020-06-15 16:29 PDT, Michael Catanzaro
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Catanzaro 2020-06-15 10:53:52 PDT
Moving from bug #212479. stress/ensure-crash.js is broken since r262991 on Red Hat's internal cloop CI:

Running stress/ensure-crash.js.default
stress/ensure-crash.js.default: ERROR: Unexpected exit code: 134
FAIL: stress/ensure-crash.js.default
Running stress/ensure-crash.js.mini-mode
stress/ensure-crash.js.mini-mode: ERROR: Unexpected exit code: 134
FAIL: stress/ensure-crash.js.mini-mode

134 - 128 = 6 = SIGABRT on Linux, so the test is failing with SIGABRT. (SIGILL = 4 so if it were to exit with SIGILL its exit status would be 132.) Skimming the commit, I don't immediately see what could be causing a SIGABRT. I thought the most likely problem was that this change assumes HAVE(MACHINE_CONTEXT) is enabled, but looking in PlatformHave.h, that is not true on two of the four architectures we test (ppc64le and s390x). However, the test is failing in the same way on both x86_64 and aarch64, so that must not be the only problem here. I wonder if it might be related to ENABLE(SIGNAL_BASED_VM_TRAPS), which depends on DFG and is therefore disabled in cloop.
Comment 1 Michael Catanzaro 2020-06-15 11:13:46 PDT
Mark says bug #212543 might help.
Comment 2 Mark Lam 2020-06-15 13:11:51 PDT
(In reply to Michael Catanzaro from comment #1)
> Mark says bug #212543 might help.

I was wrong.  212542 won't help.  But I think I know what's happening.  Will provide details after I confirm things.
Comment 3 Mark Lam 2020-06-15 13:26:46 PDT
The root cause of this is because of https://trac.webkit.org/r238478 i.e. non-darwin ports chooses to implement CRASH() using abort.
Comment 4 Michael Catanzaro 2020-06-15 14:31:35 PDT
(In reply to Mark Lam from comment #3)
> The root cause of this is because of https://trac.webkit.org/r238478 i.e.
> non-darwin ports chooses to implement CRASH() using abort.

We want CRASH() to raise SIGABRT because that's the expected way to express an assertion failure crash on Linux. Before that commit, it would have been SIGSEGV (not SIGILL).
Comment 5 Mark Lam 2020-06-15 14:32:31 PDT
(In reply to Michael Catanzaro from comment #4)
> (In reply to Mark Lam from comment #3)
> > The root cause of this is because of https://trac.webkit.org/r238478 i.e.
> > non-darwin ports chooses to implement CRASH() using abort.
> 
> We want CRASH() to raise SIGABRT because that's the expected way to express
> an assertion failure crash on Linux. Before that commit, it would have been
> SIGSEGV (not SIGILL).

I understand.  I'm investigating a fix now.
Comment 6 Mark Lam 2020-06-15 15:04:34 PDT
Created attachment 401944 [details]
proposed patch.
Comment 7 Michael Catanzaro 2020-06-15 15:10:31 PDT
Comment on attachment 401944 [details]
proposed patch.

Thanks for investigating!

The change looks good, but I don't think it will fully fix the crash because HAVE(MACHINE_CONTEXT) is false on two of the four tested architectures.
Comment 8 Mark Lam 2020-06-15 15:14:58 PDT
(In reply to Michael Catanzaro from comment #7)
> Comment on attachment 401944 [details]
> proposed patch.
> 
> Thanks for investigating!
> 
> The change looks good, but I don't think it will fully fix the crash because
> HAVE(MACHINE_CONTEXT) is false on two of the four tested architectures.

I think you'll just have to skip the tests that expect crashes (this with `//@ crashOK!` at the top) for those platforms.  That is unless they can actually handle signals and for some reason is misclaiming that they !HAVE(MACHINE_CONTEXT).
Comment 9 Mark Lam 2020-06-15 15:23:19 PDT
Created attachment 401947 [details]
patch for landing.

Thanks for the review.  I'm also going to unskip the ensure-crash.js test for armv7 and mips.  Will let the EWS run on this patch before landing.
Comment 10 Michael Catanzaro 2020-06-15 16:20:37 PDT
(In reply to Mark Lam from comment #8)
> I think you'll just have to skip the tests that expect crashes (this with
> `//@ crashOK!` at the top) for those platforms.  That is unless they can
> actually handle signals and for some reason is misclaiming that they
> !HAVE(MACHINE_CONTEXT).

I can add per-architecture skips if need be, but hopefully there's a better way. :) TL;DR I think the affected code doesn't actually need to be guarded by HAVE(MACHINE_CONTEXT) at all?

Signals work the same on all architectures. I can't imagine how broken things would be if the OS couldn't handle UNIX signals. :D

PlatformHave.h has this:

#if OS(DARWIN) || OS(FUCHSIA) || ((OS(FREEBSD) || defined(__GLIBC__) || defined(__BIONIC__)) && (CPU(X86) || CPU(X86_64) || CPU(ARM) || CPU(ARM64) || CPU(MIPS)))
#define HAVE_MACHINE_CONTEXT 1
#endif

mcontext is actually available on all architectures when the C library is glibc; however, since mcontext is architecture-specific, any code that wants to use it needs to be written separately for each desired architecture. I see JSC's MachineContext.h includes platform-specific code for many architectures, but there's no way to make this work for arbitrary architectures. I *personally* only care about a few specific architectures (x86_64, aarch64, ppc64le, and s390x), but since signal handling is not platform-specific and doesn't require mcontext, ideally we wouldn't require mcontext at all. Also, I know Igalia cares about non-glibc too; they use either ulibc or muslc (I can never remember which) on some embedded systems. (I think(?) that mcontext is implemented by both of these, but ucontext is probably not, and our MachineContext.h does depend on ucontext.)

Anyway, it looks like the affected code in jsc.cpp does not depend on MachineContext.h at all. It only uses WTF's Signals.h, which is guarded by HAVE(MACHINE_CONTEXT), but I don't see why. Guess: it's guarded by HAVE(MACHINE_CONTEXT) because originally it was only needed for that case? I think we can just remove HAVE(MACHINE_CONTEXT) from the guard... or change it to #if !OS(WINDOWS) or something like that.

Your patch still looks good though, don't hesitate to land. I'll cook a patch to change the guards tomorrow.
Comment 11 Michael Catanzaro 2020-06-15 16:22:08 PDT
(In reply to Michael Catanzaro from comment #10)
> Your patch still looks good though, don't hesitate to land. I'll cook a
> patch to change the guards tomorrow.

Actually let me try uploading an alternate version of your patch with these guards changed and see if it passes EWS.
Comment 12 Michael Catanzaro 2020-06-15 16:29:57 PDT
Created attachment 401955 [details]
Patch
Comment 13 Michael Catanzaro 2020-06-15 16:30:45 PDT
(In reply to Michael Catanzaro from comment #12)
> Created attachment 401955 [details]
> Patch

Let me know what you think of this. I'll do a local build with HAVE(MACHINE_CONTEXT) disabled now to make sure it actually builds.
Comment 14 Mark Lam 2020-06-15 16:31:47 PDT
(In reply to Michael Catanzaro from comment #11)
> (In reply to Michael Catanzaro from comment #10)
> > Your patch still looks good though, don't hesitate to land. I'll cook a
> > patch to change the guards tomorrow.
> 
> Actually let me try uploading an alternate version of your patch with these
> guards changed and see if it passes EWS.

I don't think we should remove HAVE(MACHINE_CONTEXT).  Even on unixes, there's no guarantee that the struct for the signal context is the same, right?  The platform that opts in to HAVE(MACHINE_CONTEXT) must do the work to make it work first.  It doesn't make sense to opt all unixes in blindly.
Comment 15 Mark Lam 2020-06-15 16:33:22 PDT
(In reply to Mark Lam from comment #14)
> (In reply to Michael Catanzaro from comment #11)
> > (In reply to Michael Catanzaro from comment #10)
> > > Your patch still looks good though, don't hesitate to land. I'll cook a
> > > patch to change the guards tomorrow.
> > 
> > Actually let me try uploading an alternate version of your patch with these
> > guards changed and see if it passes EWS.
> 
> I don't think we should remove HAVE(MACHINE_CONTEXT).  Even on unixes,
> there's no guarantee that the struct for the signal context is the same,
> right?  The platform that opts in to HAVE(MACHINE_CONTEXT) must do the work
> to make it work first.  It doesn't make sense to opt all unixes in blindly.

And please don't just obsolete my patch and replace it with yours.  Now, my patch won't get the EWS validation that I put it there for.
Comment 16 Michael Catanzaro 2020-06-15 16:35:47 PDT
(In reply to Mark Lam from comment #14)
> I don't think we should remove HAVE(MACHINE_CONTEXT).  Even on unixes,
> there's no guarantee that the struct for the signal context is the same,
> right? 

The mcontext struct is going to be different on every architecture, but jsc.cpp and Signals.[cpp,h] don't use it anywhere, so it should be fine. (There is code in JSC that really does depend on mcontext, just not this code.)

Sorry for interfering with your EWS run. :(
Comment 17 Michael Catanzaro 2020-06-15 16:43:52 PDT
Comment on attachment 401955 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=401955&action=review

> Source/WTF/wtf/threads/Signals.h:28
> -#if USE(PTHREADS) && HAVE(MACHINE_CONTEXT)
> +#if USE(PTHREADS)

Note this code does not use pthreads either, so really these are not good guards... how about #if !OS(WINDOWS)?

I bet the original usage of these classes in JSC depends on pthreads and mcontext; that's probably how these guards were originally selected.
Comment 18 Michael Catanzaro 2020-06-15 16:48:18 PDT
Comment on attachment 401955 [details]
Patch

Obsoleting my patch, Mark asked me to use a separate bug to change the guards.
Comment 19 Michael Catanzaro 2020-06-15 20:01:34 PDT
Comment on attachment 401947 [details]
patch for landing.

View in context: https://bugs.webkit.org/attachment.cgi?id=401947&action=review

> Source/WTF/wtf/threads/Signals.h:53
> +#if !OS(DARWIN)
> +    Abort,
> +#endif

Looking this over again, the #if !OS(DARWIN) guards aren't really needed in Signals.h. The guards in jsc.cpp should suffice to avoid installing the signal handler on Apple systems. But even then, I guess it wouldn't hurt to install the SIGABRT handler unconditionally, since the code looks like it's trying to handle every possible crash signal, and it seems odd for only SIGABRT to be conspicuously missing. (Linux has a couple more crash signals, but they are truly obscure. Is SIGABRT really not common on macOS?)
Comment 20 Mark Lam 2020-06-15 20:08:31 PDT
Comment on attachment 401947 [details]
patch for landing.

View in context: https://bugs.webkit.org/attachment.cgi?id=401947&action=review

>> Source/WTF/wtf/threads/Signals.h:53
>> +#endif
> 
> Looking this over again, the #if !OS(DARWIN) guards aren't really needed in Signals.h. The guards in jsc.cpp should suffice to avoid installing the signal handler on Apple systems. But even then, I guess it wouldn't hurt to install the SIGABRT handler unconditionally, since the code looks like it's trying to handle every possible crash signal, and it seems odd for only SIGABRT to be conspicuously missing. (Linux has a couple more crash signals, but they are truly obscure. Is SIGABRT really not common on macOS?)

I added it because we deliberately don't need or want this for OS(DARWIN).  There should never be a test that generates a SIGABRT.  Only non-DARWIN ports will ever encounter SIGABRT by their own choice.  Alternatively, I can make it more restricted to Apple specific platforms.

The purpose of the handler is not to catch all crashes, only crashes that we think we will generate deliberately.  This is facility for testing that expected crash points will crash.
Comment 21 Mark Lam 2020-06-16 00:56:09 PDT
Comment on attachment 401947 [details]
patch for landing.

armv7 and mips bots are happy.  Landing now.
Comment 22 Mark Lam 2020-06-16 00:59:26 PDT
*** Bug 213207 has been marked as a duplicate of this bug. ***
Comment 23 EWS 2020-06-16 01:32:51 PDT
Committed r263082: <https://trac.webkit.org/changeset/263082>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 401947 [details].
Comment 24 Radar WebKit Bug Importer 2020-06-16 01:33:19 PDT
<rdar://problem/64397861>
Comment 25 Michael Catanzaro 2020-06-17 10:47:07 PDT
(In reply to Michael Catanzaro from comment #0)
> Moving from bug #212479. stress/ensure-crash.js is broken since r262991 on
> Red Hat's internal cloop CI

I've verified this is now fixed with the combination of r263082, r263074, and r263123.