The latest occurence is: http://build.webkit.org/builders/Chromium%20Win%20Release%20%28Tests%29/builds/15243/steps/layout-test/logs/stdio
It looks like there's a buuuuuunch of orphan processes in left over: python.exe, LayoutTestHelper.exe.
Right around 2011-06-08 09:36:14,887 8616 stack_utils.py:67 DEBUG raise e 2011-06-08 09:36:14,887 8616 worker.py:148 DEBUG worker/0 cleaning up 2011-06-08 09:36:14,887 8616 worker.py:114 DEBUG worker/0 exiting you can see one of the threads bailing out, in this case because we tried to delete an old pywebsocket log file and failed. Because this was an unexpected exception, we bail out without trying to clean up, which has the result that nothing gets cleaned up properly and you have all of these stale processes around. The patch I've posted in bug 62180 will fix this particular issue; it's possible that we should do more to try and clean up on the way out, though.
We should certainly fix the bugs in the scripts if possible, but we should also make sure the buildbot can recover no matter what state we're in. This is what the task_kill step tries to do (clean up stray processes). This doesn't work for python processes because if we kill all python processes, we take down the buildbot process. The chromium win bots try to work around this by having a separate binary called python_slave.exe (I think it's just a copy of python.exe) and running the buildbot slave with that binary. Then it's safe to taskkill /f /im python.exe on the waterfall.
(In reply to comment #3) > We should certainly fix the bugs in the scripts if possible, but we should also make sure the buildbot can recover no matter what state we're in. This is what the task_kill step tries to do (clean up stray processes). This doesn't work for python processes because if we kill all python processes, we take down the buildbot process. The chromium win bots try to work around this by having a separate binary called python_slave.exe (I think it's just a copy of python.exe) and running the buildbot slave with that binary. Then it's safe to taskkill /f /im python.exe on the waterfall. I agree with everything you wrote, and that's an interesting suggestion. Originally I didn't attempt to clean up the workers on an unexpected exception because I figured that might just make a bad thing worse; however, it would be hard to be worse than what seems to be happening on windows, so maybe I'll try changing the code to always try to clean up the workers and see if that helps.
(In reply to comment #3) > We should certainly fix the bugs in the scripts if possible, but we should also make sure the buildbot can recover no matter what state we're in. This is what the task_kill step tries to do (clean up stray processes). This doesn't work for python processes because if we kill all python processes, we take down the buildbot process. The chromium win bots try to work around this by having a separate binary called python_slave.exe (I think it's just a copy of python.exe) and running the buildbot slave with that binary. Then it's safe to taskkill /f /im python.exe on the waterfall. That sounds like a great idea. But I wonder if we can achieve the same effect by using perl, ruby, or some other scripting language.
(In reply to comment #5) > That sounds like a great idea. But I wonder if we can achieve the same effect by using perl, ruby, or some other scripting language. I'm not sure I follow you. Buildbot, new-run-webkit-tests, and pywebsocket all are written in Python. Tony's point was that we can't simply kill all python processes from taskkill or one of these scripts without killing ourselves or our parents. Are you suggesting that we rewrite one of these so that it isn't in Python?
(In reply to comment #6) > I'm not sure I follow you. Buildbot, new-run-webkit-tests, and pywebsocket all are written in Python. Tony's point was that we can't simply kill all python processes from taskkill or one of these scripts without killing ourselves or our parents. Are you suggesting that we rewrite one of these so that it isn't in Python? Right. We can write a simple perl/ruby script that kills all python instances and starts new python instance. That'll avoid having to duplicate python.exe and makes it easier to be deployed across ports.
(In reply to comment #7) > (In reply to comment #6) > > I'm not sure I follow you. Buildbot, new-run-webkit-tests, and pywebsocket all are written in Python. Tony's point was that we can't simply kill all python processes from taskkill or one of these scripts without killing ourselves or our parents. Are you suggesting that we rewrite one of these so that it isn't in Python? > > Right. We can write a simple perl/ruby script that kills all python instances and starts new python instance. That'll avoid having to duplicate python.exe and makes it easier to be deployed across ports. Since buildbot is in python, it can't call a script that kills all python processes, or it itself would be killed (causing the whole build to fail). You're not suggesting we rewrite buildbot, presumably, so I'm not sure how this would work?
(In reply to comment #8) > Since buildbot is in python, it can't call a script that kills all python processes, or it itself would be killed (causing the whole build to fail). You're not suggesting we rewrite buildbot, presumably, so I'm not sure how this would work? Ah, that's a good point. We can't kill buildslave.
Is it possible to figure out which python process is running buildslave and whitelist it?
(In reply to comment #10) > Is it possible to figure out which python process is running buildslave and whitelist it? The taskkill /im / "killall" processes (which are systemwide utilities that we didn't write) don't give us a way to say "kill everything named X except me if I'm named X" or anything like that kind of flexibility. It is presumably possible to reconstruct that logic in python or some other language to do it ourselves, but we haven't (yet) done so, and I have no idea how much work it would be, but at least on windows, a decent amount, I think.
we kill old processes on the build.webkit.org bots now, so I think we can close this. Please reopen if anyone disagrees.