178368 – webkitpy: Force workers to exit if they do not do so gracefully

Jonathan Bedard

Reported 2017-10-16 16:51:06 PDT

If a worker process refuses to gracefully exit after receiving a SIGTERM, we should send it a SIGKILL so the main process doesn't hang.

Radar WebKit Bug Importer

Comment 1 2017-10-16 16:51:48 PDT

<rdar://problem/35018927>

Jonathan Bedard

Comment 2 2017-10-16 16:52:56 PDT

Note that we're seen this a number of times on EWS. It's not the easiest problem to reproduce, and indicates something is wrong with the workers. We should investigate what the problem with the workers is as well, but there is no reason to allow a process which we have already sent a SIGTERM to to continue running.

Jonathan Bedard

Comment 3 2017-10-16 17:01:33 PDT

Created attachment 323959 [details] Patch

Aakash Jain

Comment 4 2017-10-17 09:42:52 PDT

Comment on attachment 323959 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=323959&action=review > Tools/Scripts/webkitpy/common/message_pool.py:-140 > - worker.join() was this worker.join() waiting indefinitely in the cases we noticed? If so, simply providing a timeout parameter might help. > Tools/Scripts/webkitpy/common/message_pool.py:144 > + # If nice termination doesn't work, force worker to terminate. Nit: replace "nice" with "above" or "regular". Nice makes me think if "nice" is a kind of termination defined somewhere. > Tools/Scripts/webkitpy/common/message_pool.py:148 > + worker.join(timeout=math.ceil(timeout_target - time.time()) if timeout_target > time.time() else 0) This should be done in multiple lines. This is unreadable and non-obvious. Also adding a comment here would be useful, explaining that you are implementing an overall timeout of 60 seconds (for all the workers combined), rather than usual 60s timeout for each worker.

Jonathan Bedard

Comment 5 2017-10-17 09:49:42 PDT

Comment on attachment 323959 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=323959&action=review >> Tools/Scripts/webkitpy/common/message_pool.py:-140 >> - worker.join() > > was this worker.join() waiting indefinitely in the cases we noticed? If so, simply providing a timeout parameter might help. Yes, the worker.join() was waiting indefinitely. Simply providing a timeout would un-block the main process, but would then leave orphaned workers. That's why I forcibly kill the workers below.

Jonathan Bedard

Comment 6 2017-10-17 10:07:39 PDT

Created attachment 324022 [details] Patch

Jonathan Bedard

Comment 7 2017-10-17 10:08:46 PDT

Comment on attachment 324022 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=324022&action=review > Tools/Scripts/webkitpy/common/message_pool.py:155 > + os.kill(worker.pid, signal.SIGKILL) We may want to throw an exception if we have to SIGKILL a worker. The main thread should be fine, but this indicates something is wrong with the worker.

Aakash Jain

Comment 8 2017-10-17 10:21:42 PDT

Comment on attachment 324022 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=324022&action=review > Tools/Scripts/webkitpy/common/message_pool.py:144 > + # Check if above termination was successful, force worker to terminate if it wasn't. Nit: "worker" -> "workers" > Tools/Scripts/webkitpy/common/message_pool.py:148 > + # This sets a timeout of 60 seconds for all workers to close, instead of 60 seconds for each worker. Thinking more about it, I think it might be better to keep fixed timeout for each worker. Right now, if first worker is stuck, it would take 60s to timeout, and rest all workers will have 0s timeouts. Having fixed timeout would increase overall timeout, but that's ok to have few minutes timeout in case of some problem. > Tools/Scripts/webkitpy/common/message_pool.py:154 > + if worker.is_alive(): Nit: would be good to add a newline before this to improve readability. > Tools/Scripts/webkitpy/common/message_pool.py:156 > + Can you at least add a log?

Jonathan Bedard

Comment 9 2017-10-17 10:37:17 PDT

Comment on attachment 324022 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=324022&action=review >> Tools/Scripts/webkitpy/common/message_pool.py:148 >> + # This sets a timeout of 60 seconds for all workers to close, instead of 60 seconds for each worker. > > Thinking more about it, I think it might be better to keep fixed timeout for each worker. Right now, if first worker is stuck, it would take 60s to timeout, and rest all workers will have 0s timeouts. Having fixed timeout would increase overall timeout, but that's ok to have few minutes timeout in case of some problem. I don't think we want a fixed timeout. In your example, at least 60 seconds would have passed between worker 2's terminate() and the attempted join. I can see a case for setting the timeout to 1 instead of 0 in this case, but if worker 2 isn't stuck, worker.is_alive() should return false after 60 seconds anyways. When I've observed this problem, it's usually present in all the workers. For EWS, this means we'd have to wait for 24 minutes before we kill all workers.

Alexey Proskuryakov

Comment 10 2017-10-17 12:52:08 PDT

Comment on attachment 324022 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=324022&action=review > Tools/ChangeLog:3 > + webkitpy: Force workers to exit if they do not do so gracefully I'm not super keen on this idea. We've never hit this condition in the past, and when it happened recently, that was an actual recent regression that needed to be fixed (and would have been masked by the force kill). Having defense in depth is generally good in my opinion, but the specifics of this case seem to suggest that increased code complexity may not be worth it.

Jonathan Bedard

Comment 11 2017-10-17 13:15:03 PDT

Comment on attachment 324022 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=324022&action=review >> Tools/ChangeLog:3 >> + webkitpy: Force workers to exit if they do not do so gracefully > > I'm not super keen on this idea. We've never hit this condition in the past, and when it happened recently, that was an actual recent regression that needed to be fixed (and would have been masked by the force kill). > > Having defense in depth is generally good in my opinion, but the specifics of this case seem to suggest that increased code complexity may not be worth it. Because of this, I pointed out in comment 8 that we should consider throwing an exception when this happens. While I agree that we should be cautious of masking problems in the worker processes, the parent process should be able to force workers to exit when abnormal circumstances are detected, even if that worker is misbehaving.

Jonathan Bedard

Comment 12 2017-10-17 13:33:32 PDT

Created attachment 324049 [details] Patch

Daniel Bates

Comment 13 2017-10-17 14:42:41 PDT

Comment on attachment 324049 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=324049&action=review > Tools/Scripts/webkitpy/common/message_pool.py:157 > + os.kill(worker.pid, signal.SIGKILL) Could this put us in a bad state? I take it this we do not need to do any cleaning up, especially for iOS simulator and iOS device testing?

Jonathan Bedard

Comment 14 2017-10-17 15:04:48 PDT

(In reply to Daniel Bates from comment #13) > Comment on attachment 324049 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=324049&action=review > > > Tools/Scripts/webkitpy/common/message_pool.py:157 > > + os.kill(worker.pid, signal.SIGKILL) > > Could this put us in a bad state? I take it this we do not need to do any > cleaning up, especially for iOS simulator and iOS device testing? In theory, yes, it could. If a worker owned any sort of cleanup tied to SIGTERM, killing the process could leave things in a bad state. In practice, we don't tie any cleanup to SIGTERM at the moment. The two examples you gave are cleaned up by the parent process when the children processes receive SIGTERMs.

Daniel Bates

Comment 15 2017-10-17 15:17:58 PDT

I agree with the sentiments of comment 10. I am unclear how we came to the conclusion to file this bug from the current EWS situation. I am not at a computer with a checkout. I do recall many changes to the message pool code recently and we were not experiencing stuck EWS bots over the past six months (or at least not as many stuck bots) as we are now. What has changed?

Jonathan Bedard

Comment 16 2017-10-17 15:32:22 PDT

(In reply to Daniel Bates from comment #15) > I agree with the sentiments of comment 10. I am unclear how we came to the > conclusion to file this bug from the current EWS situation. I am not at a > computer with a checkout. I do recall many changes to the message pool code > recently and we were not experiencing stuck EWS bots over the past six > months (or at least not as many stuck bots) as we are now. What has changed? The specific issue which caused the EWS bug has already been addressed in <https://trac.webkit.org/changeset/223572>. The purpose of this bug is to make sure that when workers do not respond gracefully to a SIGTERM, the parent process does not get stuck and there are no orphaned processes. There were 2 problems which caused EWS to get into this state. One of those problems was covered by <https://bugs.webkit.org/show_bug.cgi?id=178402>, this is the reason why the workers were failing to gracefully exit. The first problem is the regression. The 2nd problem is this one: even if the workers fail to gracefully exit, the parent process should not hang forever. The 2nd problem has been present in webkitpy for a long time, and is not a regression.

Jonathan Bedard

Comment 17 2017-10-17 15:40:28 PDT

Created attachment 324064 [details] Patch

Daniel Bates

Comment 18 2017-10-17 15:45:47 PDT

(In reply to Jonathan Bedard from comment #16) > The 2nd problem is this one: even if the workers fail to gracefully exit, > the parent process should not hang forever. The 2nd problem has been > present in webkitpy for a long time, and is not a regression. I do not think we should fix this "problem" as it represents a programming error. If we were to have fixed this "problem" then it would have masked the regression caused by bug #178402 and we likely would not have landed the fix for the root problem in <https://trac.webkit.org/changeset/223572>.

Daniel Bates

Comment 19 2017-10-17 15:47:47 PDT

(In reply to Daniel Bates from comment #18) > (In reply to Jonathan Bedard from comment #16) > > The 2nd problem is this one: even if the workers fail to gracefully exit, > > the parent process should not hang forever. The 2nd problem has been > > present in webkitpy for a long time, and is not a regression. > > I do not think we should fix this "problem" as it represents a programming > error. If we were to have fixed this "problem" then it would have masked the > regression caused by bug #178402 and we likely would not have landed the fix > for the root problem in <https://trac.webkit.org/changeset/223572>. *bug #176393

Jonathan Bedard

Comment 20 2017-10-17 15:51:59 PDT

(In reply to Daniel Bates from comment #18) > (In reply to Jonathan Bedard from comment #16) > > The 2nd problem is this one: even if the workers fail to gracefully exit, > > the parent process should not hang forever. The 2nd problem has been > > present in webkitpy for a long time, and is not a regression. > > I do not think we should fix this "problem" as it represents a programming > error. If we were to have fixed this "problem" then it would have masked the > regression caused by bug #178402 and we likely would not have landed the fix > for the root problem in <https://trac.webkit.org/changeset/223572>. The current iteration of the patch (attachment 324064 [details]) just makes it so that a regression like the one in bug #176393 is fatal rather than causing EWS to lock up. This seems more inline with how WebKit infrastructure usually works. Don't we prefer immediate and fatal errors to undiagnosed hangs?

Alexey Proskuryakov

Comment 21 2017-10-17 16:29:58 PDT

Comment on attachment 324064 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=324064&action=review > Tools/Scripts/webkitpy/common/message_pool.py:171 > + if did_kill: > + raise Exception('Workers failed to terminate for 60 seconds after SIGTERM') Seems like a good idea to fail quickly. How did you test this change? Would the EWS bubble be red, and would an archive with results still be uploaded?

Jonathan Bedard

Comment 22 2017-10-19 11:20:15 PDT

Created attachment 324252 [details] Patch

Jonathan Bedard

Comment 23 2017-10-19 11:21:03 PDT

I'm more convinced that we need this (small bug in the last patch, just updated it) A more recent example from today, if we exit early because we run out of file handles, the SIGTERM handler on the workers may hang. Unlike previously, this problem doesn't necissarily constitute a programming error on the part of webkitpy, as a bug in WebKit or even one of the tools we rely on (like spindump, or the simulator) could also cause this.

Aakash Jain

Comment 24 2017-10-19 11:29:28 PDT

I agree with this change. I think it needs more testing on EWS to make sure that it makes the bubbles red, logs are uploaded when this issue is encountered and the bot is able to process next build properly.

Jonathan Bedard

Comment 25 2017-10-19 11:53:00 PDT

(In reply to Alexey Proskuryakov from comment #21) > Comment on attachment 324064 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=324064&action=review > > > Tools/Scripts/webkitpy/common/message_pool.py:171 > > + if did_kill: > > + raise Exception('Workers failed to terminate for 60 seconds after SIGTERM') > > Seems like a good idea to fail quickly. > > How did you test this change? Would the EWS bubble be red, and would an > archive with results still be uploaded? The EWS bubble would we red, it would upload the archive, but the archive may not have much in it (depending on how far run-webkit-tests has gotten). EWS doesn't make a distinction between failing exit codes. I've tested this run running run-webkit-tests in various configurations. This problem isn't unique to EWS.

Alexey Proskuryakov

Comment 26 2017-10-19 12:34:01 PDT

The bubbles shouldn't be red when there is an infrastructure failure.

Jonathan Bedard

Comment 27 2017-10-19 13:06:47 PDT

Created attachment 324266 [details] Patch

Jonathan Bedard

Comment 28 2017-10-19 13:09:02 PDT

(In reply to Jonathan Bedard from comment #27) > Created attachment 324266 [details] > Patch Sending this through to test EWS behavior when an exception is thrown.

Daniel Bates

Comment 29 2017-10-19 15:46:38 PDT

Comment on attachment 324266 [details] Patch I feel this patch (attachment #324266 [details]) is a bad idea. Ideally, we should not have run-away multiprocessing.Process processes. And I do not feel it is worth the code complexity to defend against this bug. If we can find a less complex solution that would be great. By default, Python will terminate the process when it receives a SIGTERM. Therefore, the only way we can get into trouble is whenever we register a SIGTERM handler. At the time, only code in Tools/Scripts/webkitpy/common/interupt_debugging.py does this. Any solution, like the one proposed, that subverts the multiprocessing module by calling os.kill() is suspicious and seems like a bad idea.

Jonathan Bedard

Comment 30 2017-10-19 16:42:56 PDT

(In reply to Daniel Bates from comment #29) > Comment on attachment 324266 [details] > Patch > > I feel this patch (attachment #324266 [details]) is a bad idea. Ideally, we > should not have run-away multiprocessing.Process processes. And I do not > feel it is worth the code complexity to defend against this bug. If we can > find a less complex solution that would be great. By default, Python will > terminate the process when it receives a SIGTERM. Therefore, the only way we > can get into trouble is whenever we register a SIGTERM handler. At the time, > only code in Tools/Scripts/webkitpy/common/interupt_debugging.py does this. > > Any solution, like the one proposed, that subverts the multiprocessing > module by calling os.kill() is suspicious and seems like a bad idea. Please note that attachment #324266 [details] is not the patch I'm proposing, I was sending that through EWS to confirm what EWS does to a patch when an exception is thrown. Attachment #324064 [details] is the relevant one. Removing the SIGTERM handler would prevent the workers from being stuck. However, I added this code so that we can actually debug webkitpy when it's stuck. Without these signal handlers, a problem like this one is difficult to solve. I spoke with Aakash today about this problem. It sounds like what has happened is that for awhile, we've had a problem where our Mac EWS bots get into a state where they run out of file handles. When I attached a SIGTERM handler which saves to a file, machines in this state will now hang. Currently, we restart our EWS patches every 10 patches to get machines out of this state. After talking with Aakash, we will spend some time determining why we're leaking file handles before making any changes.

Aakash Jain

Comment 31 2017-10-20 16:14:24 PDT

Marked patch 324266 as obsolete to get it out of EWS queues.

Attachments
Patch (2.16 KB, patch) 2017-10-16 17:01 PDT, Jonathan Bedard	no flags	Details Formatted Diff Diff
Patch (2.39 KB, patch) 2017-10-17 10:07 PDT, Jonathan Bedard	no flags	Details Formatted Diff Diff
Patch (2.79 KB, patch) 2017-10-17 13:33 PDT, Jonathan Bedard	no flags	Details Formatted Diff Diff
Patch (2.92 KB, patch) 2017-10-17 15:40 PDT, Jonathan Bedard	no flags	Details Formatted Diff Diff
Patch (2.90 KB, patch) 2017-10-19 11:20 PDT, Jonathan Bedard	no flags	Details Formatted Diff Diff
Patch (2.91 KB, patch) 2017-10-19 13:06 PDT, Jonathan Bedard	jbedard: commit-queue-	Details Formatted Diff Diff
Show Obsolete (5) View All Add attachment proposed patch, testcase, etc.