Bug 178368 - webkitpy: Force workers to exit if they do not do so gracefully
Summary: webkitpy: Force workers to exit if they do not do so gracefully
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Jonathan Bedard
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2017-10-16 16:51 PDT by Jonathan Bedard
Modified: 2017-10-20 16:14 PDT (History)
7 users (show)

See Also:


Attachments
Patch (2.16 KB, patch)
2017-10-16 17:01 PDT, Jonathan Bedard
no flags Details | Formatted Diff | Diff
Patch (2.39 KB, patch)
2017-10-17 10:07 PDT, Jonathan Bedard
no flags Details | Formatted Diff | Diff
Patch (2.79 KB, patch)
2017-10-17 13:33 PDT, Jonathan Bedard
no flags Details | Formatted Diff | Diff
Patch (2.92 KB, patch)
2017-10-17 15:40 PDT, Jonathan Bedard
no flags Details | Formatted Diff | Diff
Patch (2.90 KB, patch)
2017-10-19 11:20 PDT, Jonathan Bedard
no flags Details | Formatted Diff | Diff
Patch (2.91 KB, patch)
2017-10-19 13:06 PDT, Jonathan Bedard
jbedard: commit-queue-
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jonathan Bedard 2017-10-16 16:51:06 PDT
If a worker process refuses to gracefully exit after receiving a SIGTERM, we should send it a SIGKILL so the main process doesn't hang.
Comment 1 Radar WebKit Bug Importer 2017-10-16 16:51:48 PDT
<rdar://problem/35018927>
Comment 2 Jonathan Bedard 2017-10-16 16:52:56 PDT
Note that we're seen this a number of times on EWS.  It's not the easiest problem to reproduce, and indicates something is wrong with the workers.  We should investigate what the problem with the workers is as well, but there is no reason to allow a process which we have already sent a SIGTERM to to continue running.
Comment 3 Jonathan Bedard 2017-10-16 17:01:33 PDT
Created attachment 323959 [details]
Patch
Comment 4 Aakash Jain 2017-10-17 09:42:52 PDT
Comment on attachment 323959 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=323959&action=review

> Tools/Scripts/webkitpy/common/message_pool.py:-140
> -                worker.join()

was this worker.join() waiting indefinitely in the cases we noticed? If so, simply providing a timeout parameter might help.

> Tools/Scripts/webkitpy/common/message_pool.py:144
> +        # If nice termination doesn't work, force worker to terminate.

Nit: replace "nice" with "above" or "regular". Nice makes me think if "nice" is a kind of termination defined somewhere.

> Tools/Scripts/webkitpy/common/message_pool.py:148
> +                worker.join(timeout=math.ceil(timeout_target - time.time()) if timeout_target > time.time() else 0)

This should be done in multiple lines. This is unreadable and non-obvious.

Also adding a comment here would be useful, explaining that you are implementing an overall timeout of 60 seconds (for all the workers combined), rather than usual 60s timeout for each worker.
Comment 5 Jonathan Bedard 2017-10-17 09:49:42 PDT
Comment on attachment 323959 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=323959&action=review

>> Tools/Scripts/webkitpy/common/message_pool.py:-140
>> -                worker.join()
> 
> was this worker.join() waiting indefinitely in the cases we noticed? If so, simply providing a timeout parameter might help.

Yes, the worker.join() was waiting indefinitely.

Simply providing a timeout would un-block the main process, but would then leave orphaned workers.  That's why I forcibly kill the workers below.
Comment 6 Jonathan Bedard 2017-10-17 10:07:39 PDT
Created attachment 324022 [details]
Patch
Comment 7 Jonathan Bedard 2017-10-17 10:08:46 PDT
Comment on attachment 324022 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=324022&action=review

> Tools/Scripts/webkitpy/common/message_pool.py:155
> +                os.kill(worker.pid, signal.SIGKILL)

We may want to throw an exception if we have to SIGKILL a worker.  The main thread should be fine, but this indicates something is wrong with the worker.
Comment 8 Aakash Jain 2017-10-17 10:21:42 PDT
Comment on attachment 324022 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=324022&action=review

> Tools/Scripts/webkitpy/common/message_pool.py:144
> +        # Check if above termination was successful, force worker to terminate if it wasn't.

Nit: "worker" -> "workers"

> Tools/Scripts/webkitpy/common/message_pool.py:148
> +                # This sets a timeout of 60 seconds for all workers to close, instead of 60 seconds for each worker.

Thinking more about it, I think it might be better to keep fixed timeout for each worker. Right now, if first worker is stuck, it would take 60s to timeout, and rest all workers will have 0s timeouts. Having fixed timeout would increase overall timeout, but that's ok to have few minutes timeout in case of some problem.

> Tools/Scripts/webkitpy/common/message_pool.py:154
> +            if worker.is_alive():

Nit: would be good to add a newline before this to improve readability.

> Tools/Scripts/webkitpy/common/message_pool.py:156
> +

Can you at least add a log?
Comment 9 Jonathan Bedard 2017-10-17 10:37:17 PDT
Comment on attachment 324022 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=324022&action=review

>> Tools/Scripts/webkitpy/common/message_pool.py:148
>> +                # This sets a timeout of 60 seconds for all workers to close, instead of 60 seconds for each worker.
> 
> Thinking more about it, I think it might be better to keep fixed timeout for each worker. Right now, if first worker is stuck, it would take 60s to timeout, and rest all workers will have 0s timeouts. Having fixed timeout would increase overall timeout, but that's ok to have few minutes timeout in case of some problem.

I don't think we want a fixed timeout.

In your example, at least 60 seconds would have passed between worker 2's terminate() and the attempted join.  I can see a case for setting the timeout to 1 instead of 0 in this case, but if worker 2 isn't stuck, worker.is_alive() should return false after 60 seconds anyways.

When I've observed this problem, it's usually present in all the workers.  For EWS, this means we'd have to wait for 24 minutes before we kill all workers.
Comment 10 Alexey Proskuryakov 2017-10-17 12:52:08 PDT
Comment on attachment 324022 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=324022&action=review

> Tools/ChangeLog:3
> +        webkitpy: Force workers to exit if they do not do so gracefully

I'm not super keen on this idea. We've never hit this condition in the past, and when it happened recently, that was an actual recent regression that needed to be fixed (and would have been masked by the force kill).

Having defense in depth is generally good in my opinion, but the specifics of this case seem to suggest that increased code complexity may not be worth it.
Comment 11 Jonathan Bedard 2017-10-17 13:15:03 PDT
Comment on attachment 324022 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=324022&action=review

>> Tools/ChangeLog:3
>> +        webkitpy: Force workers to exit if they do not do so gracefully
> 
> I'm not super keen on this idea. We've never hit this condition in the past, and when it happened recently, that was an actual recent regression that needed to be fixed (and would have been masked by the force kill).
> 
> Having defense in depth is generally good in my opinion, but the specifics of this case seem to suggest that increased code complexity may not be worth it.

Because of this, I pointed out in comment 8 that we should consider throwing an exception when this happens.

While I agree that we should be cautious of masking problems in the worker processes, the parent process should be able to force workers to exit when abnormal circumstances are detected, even if that worker is misbehaving.
Comment 12 Jonathan Bedard 2017-10-17 13:33:32 PDT
Created attachment 324049 [details]
Patch
Comment 13 Daniel Bates 2017-10-17 14:42:41 PDT
Comment on attachment 324049 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=324049&action=review

> Tools/Scripts/webkitpy/common/message_pool.py:157
> +                os.kill(worker.pid, signal.SIGKILL)

Could this put us in a bad state? I take it this we do not need to do any cleaning up, especially for iOS simulator and iOS device testing?
Comment 14 Jonathan Bedard 2017-10-17 15:04:48 PDT
(In reply to Daniel Bates from comment #13)
> Comment on attachment 324049 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=324049&action=review
> 
> > Tools/Scripts/webkitpy/common/message_pool.py:157
> > +                os.kill(worker.pid, signal.SIGKILL)
> 
> Could this put us in a bad state? I take it this we do not need to do any
> cleaning up, especially for iOS simulator and iOS device testing?

In theory, yes, it could. If a worker owned any sort of cleanup tied to SIGTERM, killing the process could leave things in a bad state.

In practice, we don't tie any cleanup to SIGTERM at the moment.  The two examples you gave are cleaned up by the parent process when the children processes receive SIGTERMs.
Comment 15 Daniel Bates 2017-10-17 15:17:58 PDT
I agree with the sentiments of comment 10. I am unclear how we came to the conclusion to file this bug from the current EWS situation. I am not at a computer with a checkout. I do recall many changes to the message pool code recently and we were not experiencing stuck EWS bots over the past six months (or at least not as many stuck bots) as we are now. What has changed?
Comment 16 Jonathan Bedard 2017-10-17 15:32:22 PDT
(In reply to Daniel Bates from comment #15)
> I agree with the sentiments of comment 10. I am unclear how we came to the
> conclusion to file this bug from the current EWS situation. I am not at a
> computer with a checkout. I do recall many changes to the message pool code
> recently and we were not experiencing stuck EWS bots over the past six
> months (or at least not as many stuck bots) as we are now. What has changed?

The specific issue which caused the EWS bug has already been addressed in <https://trac.webkit.org/changeset/223572>.

The purpose of this bug is to make sure that when workers do not respond gracefully to a SIGTERM, the parent process does not get stuck and there are no orphaned processes.

There were 2 problems which caused EWS to get into this state.  One of those problems was covered by <https://bugs.webkit.org/show_bug.cgi?id=178402>, this is the reason why the workers were failing to gracefully exit.  The first problem is the regression.

The 2nd problem is this one: even if the workers fail to gracefully exit, the parent process should not hang forever.  The 2nd problem has been present in webkitpy for a long time, and is not a regression.
Comment 17 Jonathan Bedard 2017-10-17 15:40:28 PDT
Created attachment 324064 [details]
Patch
Comment 18 Daniel Bates 2017-10-17 15:45:47 PDT
(In reply to Jonathan Bedard from comment #16)
> The 2nd problem is this one: even if the workers fail to gracefully exit,
> the parent process should not hang forever.  The 2nd problem has been
> present in webkitpy for a long time, and is not a regression.

I do not think we should fix this "problem" as it represents a programming error. If we were to have fixed this "problem" then it would have masked the regression caused by bug #178402 and we likely would not have landed the fix for the root problem in <https://trac.webkit.org/changeset/223572>.
Comment 19 Daniel Bates 2017-10-17 15:47:47 PDT
(In reply to Daniel Bates from comment #18)
> (In reply to Jonathan Bedard from comment #16)
> > The 2nd problem is this one: even if the workers fail to gracefully exit,
> > the parent process should not hang forever.  The 2nd problem has been
> > present in webkitpy for a long time, and is not a regression.
> 
> I do not think we should fix this "problem" as it represents a programming
> error. If we were to have fixed this "problem" then it would have masked the
> regression caused by bug #178402 and we likely would not have landed the fix
> for the root problem in <https://trac.webkit.org/changeset/223572>.

*bug #176393
Comment 20 Jonathan Bedard 2017-10-17 15:51:59 PDT
(In reply to Daniel Bates from comment #18)
> (In reply to Jonathan Bedard from comment #16)
> > The 2nd problem is this one: even if the workers fail to gracefully exit,
> > the parent process should not hang forever.  The 2nd problem has been
> > present in webkitpy for a long time, and is not a regression.
> 
> I do not think we should fix this "problem" as it represents a programming
> error. If we were to have fixed this "problem" then it would have masked the
> regression caused by bug #178402 and we likely would not have landed the fix
> for the root problem in <https://trac.webkit.org/changeset/223572>.

The current iteration of the patch (attachment 324064 [details]) just makes it so that a regression like the one in bug #176393 is fatal rather than causing EWS to lock up.  This seems more inline with how WebKit infrastructure usually works.  Don't we prefer immediate and fatal errors to undiagnosed hangs?
Comment 21 Alexey Proskuryakov 2017-10-17 16:29:58 PDT
Comment on attachment 324064 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=324064&action=review

> Tools/Scripts/webkitpy/common/message_pool.py:171
> +        if did_kill:
> +            raise Exception('Workers failed to terminate for 60 seconds after SIGTERM')

Seems like a good idea to fail quickly.

How did you test this change? Would the EWS bubble be red, and would an archive with results still be uploaded?
Comment 22 Jonathan Bedard 2017-10-19 11:20:15 PDT
Created attachment 324252 [details]
Patch
Comment 23 Jonathan Bedard 2017-10-19 11:21:03 PDT
I'm more convinced that we need this (small bug in the last patch, just updated it)

A more recent example from today, if we exit early because we run out of file handles, the SIGTERM handler on the workers may hang.  Unlike previously, this problem doesn't necissarily constitute a programming error on the part of webkitpy, as a bug in WebKit or even one of the tools we rely on (like spindump, or the simulator) could also cause this.
Comment 24 Aakash Jain 2017-10-19 11:29:28 PDT
I agree with this change. 

I think it needs more testing on EWS to make sure that it makes the bubbles red, logs are uploaded when this issue is encountered and the bot is able to process next build properly.
Comment 25 Jonathan Bedard 2017-10-19 11:53:00 PDT
(In reply to Alexey Proskuryakov from comment #21)
> Comment on attachment 324064 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=324064&action=review
> 
> > Tools/Scripts/webkitpy/common/message_pool.py:171
> > +        if did_kill:
> > +            raise Exception('Workers failed to terminate for 60 seconds after SIGTERM')
> 
> Seems like a good idea to fail quickly.
> 
> How did you test this change? Would the EWS bubble be red, and would an
> archive with results still be uploaded?

The EWS bubble would we red, it would upload the archive, but the archive may not have much in it (depending on how far run-webkit-tests has gotten).  EWS doesn't make a distinction between failing exit codes.

I've tested this run running run-webkit-tests in various configurations.  This problem isn't unique to EWS.
Comment 26 Alexey Proskuryakov 2017-10-19 12:34:01 PDT
The bubbles shouldn't be red when there is an infrastructure failure.
Comment 27 Jonathan Bedard 2017-10-19 13:06:47 PDT
Created attachment 324266 [details]
Patch
Comment 28 Jonathan Bedard 2017-10-19 13:09:02 PDT
(In reply to Jonathan Bedard from comment #27)
> Created attachment 324266 [details]
> Patch

Sending this through to test EWS behavior when an exception is thrown.
Comment 29 Daniel Bates 2017-10-19 15:46:38 PDT
Comment on attachment 324266 [details]
Patch

I feel this patch (attachment #324266 [details]) is a bad idea. Ideally, we should not have run-away multiprocessing.Process processes. And I do not feel it is worth the code complexity to defend against this bug. If we can find a less complex solution that would be great. By default, Python will terminate the process when it receives a SIGTERM. Therefore, the only way we can get into trouble is whenever we register a SIGTERM handler. At the time, only code in Tools/Scripts/webkitpy/common/interupt_debugging.py does this.

Any solution, like the one proposed, that subverts the multiprocessing module by calling os.kill() is suspicious and seems like a bad idea.
Comment 30 Jonathan Bedard 2017-10-19 16:42:56 PDT
(In reply to Daniel Bates from comment #29)
> Comment on attachment 324266 [details]
> Patch
> 
> I feel this patch (attachment #324266 [details]) is a bad idea. Ideally, we
> should not have run-away multiprocessing.Process processes. And I do not
> feel it is worth the code complexity to defend against this bug. If we can
> find a less complex solution that would be great. By default, Python will
> terminate the process when it receives a SIGTERM. Therefore, the only way we
> can get into trouble is whenever we register a SIGTERM handler. At the time,
> only code in Tools/Scripts/webkitpy/common/interupt_debugging.py does this.
> 
> Any solution, like the one proposed, that subverts the multiprocessing
> module by calling os.kill() is suspicious and seems like a bad idea.

Please note that attachment #324266 [details] is not the patch I'm proposing, I was sending that through EWS to confirm what EWS does to a patch when an exception is thrown.  Attachment #324064 [details] is the relevant one.

Removing the SIGTERM handler would prevent the workers from being stuck.  However, I added this code so that we can actually debug webkitpy when it's stuck.  Without these signal handlers, a problem like this one is difficult to solve.

I spoke with Aakash today about this problem.  It sounds like what has happened is that for awhile, we've had a problem where our Mac EWS bots get into a state where they run out of file handles.  When I attached a SIGTERM handler which saves to a file, machines in this state will now hang.  Currently, we restart our EWS patches every 10 patches to get machines out of this state.

After talking with Aakash, we will spend some time determining why we're leaking file handles before making any changes.
Comment 31 Aakash Jain 2017-10-20 16:14:24 PDT
Marked patch 324266 as obsolete to get it out of EWS queues.