Bug 171159

Summary: REGRESSION(r215638): [Linux] Several worker tests are crashing in Thread::signalHandlerSuspendResume after r215638
Product: WebKit Reporter: Carlos Garcia Campos <cgarcia>
Component: Web Template FrameworkAssignee: Yusuke Suzuki <ysuzuki>
Status: RESOLVED FIXED    
Severity: Normal CC: benjamin, buildbot, cdumez, cgarcia, clopez, cmarcelo, dbates, fpizlo, jfbastien, keith_miller, mark.lam, mcatanzaro, saam, ysuzuki
Priority: P2 Keywords: Gtk, LayoutTestFailure, Regression
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugs.webkit.org/show_bug.cgi?id=170976
Bug Depends on: 170976    
Bug Blocks:    
Attachments:
Description Flags
Patch
none
Patch mcatanzaro: review+

Description Carlos Garcia Campos 2017-04-22 01:02:10 PDT
You can see backtraces in GTK+ bots:

https://build.webkit.org/results/GTK Linux 64-bit Release (Tests)/r215663 (693)/results.html

For example:

https://build.webkit.org/results/GTK Linux 64-bit Release (Tests)/r215663 (693)/fast/workers/worker-document-leak-crash-log.txt
Comment 1 Yusuke Suzuki 2017-04-22 02:19:23 PDT
Super curious thing is that there is no user of calling Thread::suspend / Thread::resume. I'm still investigating.
Comment 2 Yusuke Suzuki 2017-04-22 02:23:04 PDT
I guess this is related to bug 170976. Before starting using signaling APIs, WTF in Linux uses SIGUSR2 to suspend and resume threads. I guess this suspend/resume handler is incorrectly routed from jscSignalHandler. Investigating.
Comment 3 Yusuke Suzuki 2017-04-22 02:25:16 PDT
We should use Signaling APIs to set up suspend / resume signals I guess.
Comment 4 Yusuke Suzuki 2017-04-22 02:28:00 PDT
OK, signaling API uses SIGUSR2 for there purpose, but it is already used in Linux to suspend and resume threads.
Comment 5 Yusuke Suzuki 2017-04-22 03:20:49 PDT
(In reply to Yusuke Suzuki from comment #4)
> OK, signaling API uses SIGUSR2 for there purpose, but it is already used in
> Linux to suspend and resume threads.

suspend / resume is a bit tricky operation. We should use dedicated signal for this operation.
Now SIGUSR2 is used by ThreadMessage. But SIGUSR1 becomes empty. We should use it instead.
Comment 6 Yusuke Suzuki 2017-04-22 03:33:08 PDT
Created attachment 307904 [details]
Patch
Comment 7 Yusuke Suzuki 2017-04-22 03:35:04 PDT
Created attachment 307905 [details]
Patch
Comment 8 Michael Catanzaro 2017-04-22 06:28:07 PDT
Comment on attachment 307905 [details]
Patch

I don't think it's possible to use UNIX signals for any purpose without introducing subtle race conditions that are extremely hard to detect or reason about. Is it possible to eventually eliminate our use of UNIX signals for thread suspension, or is there just no other way to accomplish this than by using signals?
Comment 9 Yusuke Suzuki 2017-04-22 06:54:12 PDT
(In reply to Michael Catanzaro from comment #8)
> Comment on attachment 307905 [details]
> Patch
> 
> I don't think it's possible to use UNIX signals for any purpose without
> introducing subtle race conditions that are extremely hard to detect or
> reason about. Is it possible to eventually eliminate our use of UNIX signals
> for thread suspension, or is there just no other way to accomplish this than
> by using signals?

If we want to have the identical mechanism (ability to suspend and resume threads),
I believe we do not have a non-signal way in Linux. For non Linux env,
sometimes an alternative is offered, like pthread_suspend_np. But it also has a
problem that we cannot retrieve register sets from a suspended thread.

If we do not want to have the current strong mechanism, the simplest way is
something like safepoint approach. Maybe current CheckTrap op can be used.
safepoint repeatedly appears in JIT code and JS mutators will park if the ccollector requires. Then, the stop the world op works.
But to make it efficiently implemented, yet it involves SEGV signal handler :P
Wihout it, I think it's not as efficient as is...
Comment 10 Yusuke Suzuki 2017-04-22 07:07:38 PDT
Committed r215665: <http://trac.webkit.org/changeset/215665>
Comment 11 Filip Pizlo 2017-04-23 08:37:36 PDT
I have some thoughts:

- MachineStackMarker should switch to using this new mechanism.  You would use the functional message API to send a Function that scans the stack.  We should test whether this is perf-neutral on macOS/iOS/etc, and if it is, then we should get rid of the thread_suspend/resume code and just use this.  It's better if all OSes use the same approach because then we share test coverage, ideas, etc.  We should only use different APIs on different OSes if it gives us some measurable progression.

- You don't need the full CheckTraps mechanism, since that's more of a safepoint - you stop at a place where the JIT is able to vend additional information.  We don't really need safepoints for stack scanning.  One thread holds the JSLock, and that thread does need to be safepointed but it's already at a safepoint anytime the GC requests a stack scan.  None of the other threads that MachineStackMarker looks at hold the JSLock and so they don't need to be safepointed.  In fact, all we really need is a ragged safepoint, where the thread requesting scanning signals all of the threads and then waits for them to return with stack scan results. 

(In reply to Michael Catanzaro from comment #8)
> Comment on attachment 307905 [details]
> Patch
> 
> I don't think it's possible to use UNIX signals for any purpose without
> introducing subtle race conditions that are extremely hard to detect or
> reason about. 

That's not true.  Two sound uses are:

- pthread_kill, as we are doing.

- handling a signal that originated as a CPU trap, like SEGV, BUS, TRAP, ILL, FPE, etc.  

We use signal handling for those purposes.

> Is it possible to eventually eliminate our use of UNIX signals
> for thread suspension, or is there just no other way to accomplish this than
> by using signals?

It's not possible.
Comment 12 Yusuke Suzuki 2017-04-23 09:45:20 PDT
Interesting.

(In reply to Filip Pizlo from comment #11)
> I have some thoughts:
> 
> - MachineStackMarker should switch to using this new mechanism.  You would
> use the functional message API to send a Function that scans the stack.  We
> should test whether this is perf-neutral on macOS/iOS/etc, and if it is,
> then we should get rid of the thread_suspend/resume code and just use this. 
> It's better if all OSes use the same approach because then we share test
> coverage, ideas, etc.  We should only use different APIs on different OSes
> if it gives us some measurable progression.

From the point of cross-platform view, one obstacle is that ThreadMessage
is not implemented in Windows since it does not have identical mechanism to
pthread_kill. I'm not sure there is a good mechanism in Windows to implement
such semantics.

> - You don't need the full CheckTraps mechanism, since that's more of a
> safepoint - you stop at a place where the JIT is able to vend additional
> information.  We don't really need safepoints for stack scanning.  One
> thread holds the JSLock, and that thread does need to be safepointed but
> it's already at a safepoint anytime the GC requests a stack scan.  None of
> the other threads that MachineStackMarker looks at hold the JSLock and so
> they don't need to be safepointed.  In fact, all we really need is a ragged
> safepoint, where the thread requesting scanning signals all of the threads
> and then waits for them to return with stack scan results. 

Right, make sense. Mutator already has JSLock when performing GC, and other
ones do not have JSLock for that VM.

It sounds like similar to TLB shootdown implementation. One thread broadcasts
the request for the stack scan, threads perform it and return the result.
The requester waits for the results, then, all the threads start executing.

> (In reply to Michael Catanzaro from comment #8)
> > Comment on attachment 307905 [details]
> > Patch
> > 
> > I don't think it's possible to use UNIX signals for any purpose without
> > introducing subtle race conditions that are extremely hard to detect or
> > reason about. 
> 
> That's not true.  Two sound uses are:
> 
> - pthread_kill, as we are doing.
> 
> - handling a signal that originated as a CPU trap, like SEGV, BUS, TRAP,
> ILL, FPE, etc.  
> 
> We use signal handling for those purposes.
> 
> > Is it possible to eventually eliminate our use of UNIX signals
> > for thread suspension, or is there just no other way to accomplish this than
> > by using signals?
> 
> It's not possible.