If a worker process refuses to gracefully exit after receiving a SIGTERM, we should send it a SIGKILL so the main process doesn't hang.
<rdar://problem/35018927>
Note that we're seen this a number of times on EWS. It's not the easiest problem to reproduce, and indicates something is wrong with the workers. We should investigate what the problem with the workers is as well, but there is no reason to allow a process which we have already sent a SIGTERM to to continue running.
Created attachment 323959 [details] Patch
Comment on attachment 323959 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=323959&action=review > Tools/Scripts/webkitpy/common/message_pool.py:-140 > - worker.join() was this worker.join() waiting indefinitely in the cases we noticed? If so, simply providing a timeout parameter might help. > Tools/Scripts/webkitpy/common/message_pool.py:144 > + # If nice termination doesn't work, force worker to terminate. Nit: replace "nice" with "above" or "regular". Nice makes me think if "nice" is a kind of termination defined somewhere. > Tools/Scripts/webkitpy/common/message_pool.py:148 > + worker.join(timeout=math.ceil(timeout_target - time.time()) if timeout_target > time.time() else 0) This should be done in multiple lines. This is unreadable and non-obvious. Also adding a comment here would be useful, explaining that you are implementing an overall timeout of 60 seconds (for all the workers combined), rather than usual 60s timeout for each worker.
Comment on attachment 323959 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=323959&action=review >> Tools/Scripts/webkitpy/common/message_pool.py:-140 >> - worker.join() > > was this worker.join() waiting indefinitely in the cases we noticed? If so, simply providing a timeout parameter might help. Yes, the worker.join() was waiting indefinitely. Simply providing a timeout would un-block the main process, but would then leave orphaned workers. That's why I forcibly kill the workers below.
Created attachment 324022 [details] Patch
Comment on attachment 324022 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=324022&action=review > Tools/Scripts/webkitpy/common/message_pool.py:155 > + os.kill(worker.pid, signal.SIGKILL) We may want to throw an exception if we have to SIGKILL a worker. The main thread should be fine, but this indicates something is wrong with the worker.
Comment on attachment 324022 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=324022&action=review > Tools/Scripts/webkitpy/common/message_pool.py:144 > + # Check if above termination was successful, force worker to terminate if it wasn't. Nit: "worker" -> "workers" > Tools/Scripts/webkitpy/common/message_pool.py:148 > + # This sets a timeout of 60 seconds for all workers to close, instead of 60 seconds for each worker. Thinking more about it, I think it might be better to keep fixed timeout for each worker. Right now, if first worker is stuck, it would take 60s to timeout, and rest all workers will have 0s timeouts. Having fixed timeout would increase overall timeout, but that's ok to have few minutes timeout in case of some problem. > Tools/Scripts/webkitpy/common/message_pool.py:154 > + if worker.is_alive(): Nit: would be good to add a newline before this to improve readability. > Tools/Scripts/webkitpy/common/message_pool.py:156 > + Can you at least add a log?
Comment on attachment 324022 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=324022&action=review >> Tools/Scripts/webkitpy/common/message_pool.py:148 >> + # This sets a timeout of 60 seconds for all workers to close, instead of 60 seconds for each worker. > > Thinking more about it, I think it might be better to keep fixed timeout for each worker. Right now, if first worker is stuck, it would take 60s to timeout, and rest all workers will have 0s timeouts. Having fixed timeout would increase overall timeout, but that's ok to have few minutes timeout in case of some problem. I don't think we want a fixed timeout. In your example, at least 60 seconds would have passed between worker 2's terminate() and the attempted join. I can see a case for setting the timeout to 1 instead of 0 in this case, but if worker 2 isn't stuck, worker.is_alive() should return false after 60 seconds anyways. When I've observed this problem, it's usually present in all the workers. For EWS, this means we'd have to wait for 24 minutes before we kill all workers.
Comment on attachment 324022 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=324022&action=review > Tools/ChangeLog:3 > + webkitpy: Force workers to exit if they do not do so gracefully I'm not super keen on this idea. We've never hit this condition in the past, and when it happened recently, that was an actual recent regression that needed to be fixed (and would have been masked by the force kill). Having defense in depth is generally good in my opinion, but the specifics of this case seem to suggest that increased code complexity may not be worth it.
Comment on attachment 324022 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=324022&action=review >> Tools/ChangeLog:3 >> + webkitpy: Force workers to exit if they do not do so gracefully > > I'm not super keen on this idea. We've never hit this condition in the past, and when it happened recently, that was an actual recent regression that needed to be fixed (and would have been masked by the force kill). > > Having defense in depth is generally good in my opinion, but the specifics of this case seem to suggest that increased code complexity may not be worth it. Because of this, I pointed out in comment 8 that we should consider throwing an exception when this happens. While I agree that we should be cautious of masking problems in the worker processes, the parent process should be able to force workers to exit when abnormal circumstances are detected, even if that worker is misbehaving.
Created attachment 324049 [details] Patch
Comment on attachment 324049 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=324049&action=review > Tools/Scripts/webkitpy/common/message_pool.py:157 > + os.kill(worker.pid, signal.SIGKILL) Could this put us in a bad state? I take it this we do not need to do any cleaning up, especially for iOS simulator and iOS device testing?
(In reply to Daniel Bates from comment #13) > Comment on attachment 324049 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=324049&action=review > > > Tools/Scripts/webkitpy/common/message_pool.py:157 > > + os.kill(worker.pid, signal.SIGKILL) > > Could this put us in a bad state? I take it this we do not need to do any > cleaning up, especially for iOS simulator and iOS device testing? In theory, yes, it could. If a worker owned any sort of cleanup tied to SIGTERM, killing the process could leave things in a bad state. In practice, we don't tie any cleanup to SIGTERM at the moment. The two examples you gave are cleaned up by the parent process when the children processes receive SIGTERMs.
I agree with the sentiments of comment 10. I am unclear how we came to the conclusion to file this bug from the current EWS situation. I am not at a computer with a checkout. I do recall many changes to the message pool code recently and we were not experiencing stuck EWS bots over the past six months (or at least not as many stuck bots) as we are now. What has changed?
(In reply to Daniel Bates from comment #15) > I agree with the sentiments of comment 10. I am unclear how we came to the > conclusion to file this bug from the current EWS situation. I am not at a > computer with a checkout. I do recall many changes to the message pool code > recently and we were not experiencing stuck EWS bots over the past six > months (or at least not as many stuck bots) as we are now. What has changed? The specific issue which caused the EWS bug has already been addressed in <https://trac.webkit.org/changeset/223572>. The purpose of this bug is to make sure that when workers do not respond gracefully to a SIGTERM, the parent process does not get stuck and there are no orphaned processes. There were 2 problems which caused EWS to get into this state. One of those problems was covered by <https://bugs.webkit.org/show_bug.cgi?id=178402>, this is the reason why the workers were failing to gracefully exit. The first problem is the regression. The 2nd problem is this one: even if the workers fail to gracefully exit, the parent process should not hang forever. The 2nd problem has been present in webkitpy for a long time, and is not a regression.
Created attachment 324064 [details] Patch
(In reply to Jonathan Bedard from comment #16) > The 2nd problem is this one: even if the workers fail to gracefully exit, > the parent process should not hang forever. The 2nd problem has been > present in webkitpy for a long time, and is not a regression. I do not think we should fix this "problem" as it represents a programming error. If we were to have fixed this "problem" then it would have masked the regression caused by bug #178402 and we likely would not have landed the fix for the root problem in <https://trac.webkit.org/changeset/223572>.
(In reply to Daniel Bates from comment #18) > (In reply to Jonathan Bedard from comment #16) > > The 2nd problem is this one: even if the workers fail to gracefully exit, > > the parent process should not hang forever. The 2nd problem has been > > present in webkitpy for a long time, and is not a regression. > > I do not think we should fix this "problem" as it represents a programming > error. If we were to have fixed this "problem" then it would have masked the > regression caused by bug #178402 and we likely would not have landed the fix > for the root problem in <https://trac.webkit.org/changeset/223572>. *bug #176393
(In reply to Daniel Bates from comment #18) > (In reply to Jonathan Bedard from comment #16) > > The 2nd problem is this one: even if the workers fail to gracefully exit, > > the parent process should not hang forever. The 2nd problem has been > > present in webkitpy for a long time, and is not a regression. > > I do not think we should fix this "problem" as it represents a programming > error. If we were to have fixed this "problem" then it would have masked the > regression caused by bug #178402 and we likely would not have landed the fix > for the root problem in <https://trac.webkit.org/changeset/223572>. The current iteration of the patch (attachment 324064 [details]) just makes it so that a regression like the one in bug #176393 is fatal rather than causing EWS to lock up. This seems more inline with how WebKit infrastructure usually works. Don't we prefer immediate and fatal errors to undiagnosed hangs?
Comment on attachment 324064 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=324064&action=review > Tools/Scripts/webkitpy/common/message_pool.py:171 > + if did_kill: > + raise Exception('Workers failed to terminate for 60 seconds after SIGTERM') Seems like a good idea to fail quickly. How did you test this change? Would the EWS bubble be red, and would an archive with results still be uploaded?
Created attachment 324252 [details] Patch
I'm more convinced that we need this (small bug in the last patch, just updated it) A more recent example from today, if we exit early because we run out of file handles, the SIGTERM handler on the workers may hang. Unlike previously, this problem doesn't necissarily constitute a programming error on the part of webkitpy, as a bug in WebKit or even one of the tools we rely on (like spindump, or the simulator) could also cause this.
I agree with this change. I think it needs more testing on EWS to make sure that it makes the bubbles red, logs are uploaded when this issue is encountered and the bot is able to process next build properly.
(In reply to Alexey Proskuryakov from comment #21) > Comment on attachment 324064 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=324064&action=review > > > Tools/Scripts/webkitpy/common/message_pool.py:171 > > + if did_kill: > > + raise Exception('Workers failed to terminate for 60 seconds after SIGTERM') > > Seems like a good idea to fail quickly. > > How did you test this change? Would the EWS bubble be red, and would an > archive with results still be uploaded? The EWS bubble would we red, it would upload the archive, but the archive may not have much in it (depending on how far run-webkit-tests has gotten). EWS doesn't make a distinction between failing exit codes. I've tested this run running run-webkit-tests in various configurations. This problem isn't unique to EWS.
The bubbles shouldn't be red when there is an infrastructure failure.
Created attachment 324266 [details] Patch
(In reply to Jonathan Bedard from comment #27) > Created attachment 324266 [details] > Patch Sending this through to test EWS behavior when an exception is thrown.
Comment on attachment 324266 [details] Patch I feel this patch (attachment #324266 [details]) is a bad idea. Ideally, we should not have run-away multiprocessing.Process processes. And I do not feel it is worth the code complexity to defend against this bug. If we can find a less complex solution that would be great. By default, Python will terminate the process when it receives a SIGTERM. Therefore, the only way we can get into trouble is whenever we register a SIGTERM handler. At the time, only code in Tools/Scripts/webkitpy/common/interupt_debugging.py does this. Any solution, like the one proposed, that subverts the multiprocessing module by calling os.kill() is suspicious and seems like a bad idea.
(In reply to Daniel Bates from comment #29) > Comment on attachment 324266 [details] > Patch > > I feel this patch (attachment #324266 [details]) is a bad idea. Ideally, we > should not have run-away multiprocessing.Process processes. And I do not > feel it is worth the code complexity to defend against this bug. If we can > find a less complex solution that would be great. By default, Python will > terminate the process when it receives a SIGTERM. Therefore, the only way we > can get into trouble is whenever we register a SIGTERM handler. At the time, > only code in Tools/Scripts/webkitpy/common/interupt_debugging.py does this. > > Any solution, like the one proposed, that subverts the multiprocessing > module by calling os.kill() is suspicious and seems like a bad idea. Please note that attachment #324266 [details] is not the patch I'm proposing, I was sending that through EWS to confirm what EWS does to a patch when an exception is thrown. Attachment #324064 [details] is the relevant one. Removing the SIGTERM handler would prevent the workers from being stuck. However, I added this code so that we can actually debug webkitpy when it's stuck. Without these signal handlers, a problem like this one is difficult to solve. I spoke with Aakash today about this problem. It sounds like what has happened is that for awhile, we've had a problem where our Mac EWS bots get into a state where they run out of file handles. When I attached a SIGTERM handler which saves to a file, machines in this state will now hang. Currently, we restart our EWS patches every 10 patches to get machines out of this state. After talking with Aakash, we will spend some time determining why we're leaking file handles before making any changes.
Marked patch 324266 as obsolete to get it out of EWS queues.