Bug 197205

Summary: webkitpy: Kill unresponsive worker processes
Product: WebKit Reporter: Jonathan Bedard <jbedard>
Component: Tools / TestsAssignee: Jonathan Bedard <jbedard>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, commit-queue, ews-watchlist, glenn, lforschler, ryanhaddad, simon.fraser, webkit-bug-importer
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch
none
Patch
none
Patch none

Jonathan Bedard
Reported 2019-04-23 11:49:31 PDT
It's possible for us to get into a state, most common in EWS, where workers are deadlocked waiting on the message queue. In most cases, sending a termination to all workers before joining any workers is sufficient to fix the problem. In some cases, we might need to explicitly kill unresponsive workers. This isn't a great practice, but it's very important when run-webkit-tests has been interrupted, either with a SIGTERM or because the maximum number of failed tests has been hit.
Attachments
Patch (1.77 KB, patch)
2019-04-23 11:56 PDT, Jonathan Bedard
no flags
Patch (1.81 KB, patch)
2019-04-23 12:05 PDT, Jonathan Bedard
no flags
Patch (2.59 KB, patch)
2019-04-23 15:18 PDT, Jonathan Bedard
no flags
Jonathan Bedard
Comment 1 2019-04-23 11:52:59 PDT
Jonathan Bedard
Comment 2 2019-04-23 11:56:11 PDT
Jonathan Bedard
Comment 3 2019-04-23 12:05:24 PDT
Lucas Forschler
Comment 4 2019-04-23 12:28:24 PDT
Comment on attachment 368048 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=368048&action=review I'd like to think on this more, but here are my thoughts: Why are we running into this issue now, and never before? There must be a root cause which gets us in a bad place. Fixing it with a workaround to kill stuck processes seems like we'll just be accepting that things are not great. What does this mean moving forward? Will we start marking these tests as crashing so they don't make the tree red? Will we lose any testing coverage? Will this compound and get even worse moving forward, and cause us to miss things which are really wrong? I'd argue we should find the root cause and determine why our workers are hanging. > Tools/Scripts/webkitpy/common/message_pool.py:143 > + deadline = time.time() + 30 I'm not sure we want to hardcode a 30 second deadline... do we store this timeout value elsewhere we could use here?
Jonathan Bedard
Comment 5 2019-04-23 13:20:26 PDT
(In reply to Lucas Forschler from comment #4) > Comment on attachment 368048 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=368048&action=review > > I'd like to think on this more, but here are my thoughts: > > Why are we running into this issue now, and never before? There must be a > root cause which gets us in a bad place. Fixing it with a workaround to kill > stuck processes seems like we'll just be accepting that things are not > great. What does this mean moving forward? Will we start marking these tests > as crashing so they don't make the tree red? Will we lose any testing > coverage? Will this compound and get even worse moving forward, and cause us > to miss things which are really wrong? > > I'd argue we should find the root cause and determine why our workers are > hanging. A few responses to this; 1) Why now: If my theory about workers being deadlocked on the message queue is correct (and stack traces corroborate, but do not confirm this) this entire situation is a race condition. I suspect it might be tied to specific kinds of tests, but ultimately, it's not the sort of thing that we're going to be able to find a regression point for. 2) Will we start marking these tests as crashing so they don't make the tree red That's not a great way to address this sort of problem. We aren't actually seeing this in trunk infrastructure anyways (those tests are mostly green), thus far, we've only seen this on EWS. The thing is, EWS is sort of expected to take sub-par code and 'work,' mostly. We know there are some tooling changes that can hose EWS if you send them through, but the sorts of changes we've seen triggering this behavior are the kind of changes EWS is meant to test. Basically, EWS should be able to coherantly handle 30 crashing tests. 3) 'I'd argue we should find the root cause and determine why our workers are hanging' I'm pretty sure that we have addressed the root cause here, at least as far as the hanging goes. This change does 2 separate things, although it's not totally obvious that they are actually distinct. The first thing this change does is make sure that all workers are terminated before we try and join any of them. This may seem like a trivial change, but from my testing on ews106, I'm pretty sure that THIS is actually the root cause. It's also possible that <https://trac.webkit.org/changeset/244064> manifested this bug by trying to log something on the termination handler, which would cause the workers to be posting a message to the manager on the termination handler. The second thing this change does is guarantee that if a worker is hung, we kill it (this seems to be the part of the patch that is contentious). My rationale for this change is essentially this: Once the manager determines that worker is hung, what is the rational thing to do? Never mind the fact that the worker should never get hung (I agree, it shouldn't, hence the first part of this change), once it has already hung, what can be done? The previous code does nothing and waits forever. This is a horrible way to address the problem, especially in EWS. Even if the logging returned by EWS isn't entirely accurate, at least it's present. And at least run-webkit-tests returns with a non-zero exit code since some tests did not pass. It's not a perfect solution, but I'm also not sure it's a case we would encounter after this patch lands. I've applied this patch to ews106 and have been testing for the past hour. While I have doctored the directory so we have more than 30 failures, I have yet to see a worker be killed: all have terminated gracefully because of the 1st part of this change. > > > Tools/Scripts/webkitpy/common/message_pool.py:143 > > + deadline = time.time() + 30 > > I'm not sure we want to hardcode a 30 second deadline... do we store this > timeout value elsewhere we could use here? We don't have an idea of timeouts in the message pool, although we could add it.
Lucas Forschler
Comment 6 2019-04-23 15:15:00 PDT
Comment on attachment 368049 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=368049&action=review > Tools/Scripts/webkitpy/common/message_pool.py:146 > + worker.join(timeout=deadline - time.time()) in our chat, you discovered this should probably be join(max(deadline - time.time(), 0)) it might be worth seeing what join will do with a negative timeout... maybe it behaves in an ok way.
Jonathan Bedard
Comment 7 2019-04-23 15:18:04 PDT
Jonathan Bedard
Comment 8 2019-04-23 15:21:03 PDT
(In reply to Lucas Forschler from comment #6) > Comment on attachment 368049 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=368049&action=review > > > Tools/Scripts/webkitpy/common/message_pool.py:146 > > + worker.join(timeout=deadline - time.time()) > > in our chat, you discovered this should probably be join(max(deadline - > time.time(), 0)) > it might be worth seeing what join will do with a negative timeout... maybe > it behaves in an ok way. According to <https://bugs.python.org/issue14753>, it should be treated as a 0...but I dislike relying on confusing behavior like that, let's use max(...).
WebKit Commit Bot
Comment 9 2019-04-23 16:52:48 PDT
Comment on attachment 368070 [details] Patch Clearing flags on attachment: 368070 Committed r244571: <https://trac.webkit.org/changeset/244571>
WebKit Commit Bot
Comment 10 2019-04-23 16:52:50 PDT
All reviewed patches have been landed. Closing bug.
Note You need to log in before you can comment on or make changes to this bug.