RESOLVED FIXED 62297
nwrt: Chromium Win hangs frequently
https://bugs.webkit.org/show_bug.cgi?id=62297
Summary nwrt: Chromium Win hangs frequently
Attachments
Dimitri Glazkov (Google)
Comment 1 2011-06-08 10:11:37 PDT
It looks like there's a buuuuuunch of orphan processes in left over: python.exe, LayoutTestHelper.exe.
Dirk Pranke
Comment 2 2011-06-08 13:49:53 PDT
Right around 2011-06-08 09:36:14,887 8616 stack_utils.py:67 DEBUG raise e 2011-06-08 09:36:14,887 8616 worker.py:148 DEBUG worker/0 cleaning up 2011-06-08 09:36:14,887 8616 worker.py:114 DEBUG worker/0 exiting you can see one of the threads bailing out, in this case because we tried to delete an old pywebsocket log file and failed. Because this was an unexpected exception, we bail out without trying to clean up, which has the result that nothing gets cleaned up properly and you have all of these stale processes around. The patch I've posted in bug 62180 will fix this particular issue; it's possible that we should do more to try and clean up on the way out, though.
Tony Chang
Comment 3 2011-06-08 13:54:38 PDT
We should certainly fix the bugs in the scripts if possible, but we should also make sure the buildbot can recover no matter what state we're in. This is what the task_kill step tries to do (clean up stray processes). This doesn't work for python processes because if we kill all python processes, we take down the buildbot process. The chromium win bots try to work around this by having a separate binary called python_slave.exe (I think it's just a copy of python.exe) and running the buildbot slave with that binary. Then it's safe to taskkill /f /im python.exe on the waterfall.
Dirk Pranke
Comment 4 2011-06-08 13:57:34 PDT
(In reply to comment #3) > We should certainly fix the bugs in the scripts if possible, but we should also make sure the buildbot can recover no matter what state we're in. This is what the task_kill step tries to do (clean up stray processes). This doesn't work for python processes because if we kill all python processes, we take down the buildbot process. The chromium win bots try to work around this by having a separate binary called python_slave.exe (I think it's just a copy of python.exe) and running the buildbot slave with that binary. Then it's safe to taskkill /f /im python.exe on the waterfall. I agree with everything you wrote, and that's an interesting suggestion. Originally I didn't attempt to clean up the workers on an unexpected exception because I figured that might just make a bad thing worse; however, it would be hard to be worse than what seems to be happening on windows, so maybe I'll try changing the code to always try to clean up the workers and see if that helps.
Ryosuke Niwa
Comment 5 2011-06-08 14:24:45 PDT
(In reply to comment #3) > We should certainly fix the bugs in the scripts if possible, but we should also make sure the buildbot can recover no matter what state we're in. This is what the task_kill step tries to do (clean up stray processes). This doesn't work for python processes because if we kill all python processes, we take down the buildbot process. The chromium win bots try to work around this by having a separate binary called python_slave.exe (I think it's just a copy of python.exe) and running the buildbot slave with that binary. Then it's safe to taskkill /f /im python.exe on the waterfall. That sounds like a great idea. But I wonder if we can achieve the same effect by using perl, ruby, or some other scripting language.
Dirk Pranke
Comment 6 2011-06-08 15:27:35 PDT
(In reply to comment #5) > That sounds like a great idea. But I wonder if we can achieve the same effect by using perl, ruby, or some other scripting language. I'm not sure I follow you. Buildbot, new-run-webkit-tests, and pywebsocket all are written in Python. Tony's point was that we can't simply kill all python processes from taskkill or one of these scripts without killing ourselves or our parents. Are you suggesting that we rewrite one of these so that it isn't in Python?
Ryosuke Niwa
Comment 7 2011-06-08 15:36:41 PDT
(In reply to comment #6) > I'm not sure I follow you. Buildbot, new-run-webkit-tests, and pywebsocket all are written in Python. Tony's point was that we can't simply kill all python processes from taskkill or one of these scripts without killing ourselves or our parents. Are you suggesting that we rewrite one of these so that it isn't in Python? Right. We can write a simple perl/ruby script that kills all python instances and starts new python instance. That'll avoid having to duplicate python.exe and makes it easier to be deployed across ports.
Dirk Pranke
Comment 8 2011-06-08 15:57:12 PDT
(In reply to comment #7) > (In reply to comment #6) > > I'm not sure I follow you. Buildbot, new-run-webkit-tests, and pywebsocket all are written in Python. Tony's point was that we can't simply kill all python processes from taskkill or one of these scripts without killing ourselves or our parents. Are you suggesting that we rewrite one of these so that it isn't in Python? > > Right. We can write a simple perl/ruby script that kills all python instances and starts new python instance. That'll avoid having to duplicate python.exe and makes it easier to be deployed across ports. Since buildbot is in python, it can't call a script that kills all python processes, or it itself would be killed (causing the whole build to fail). You're not suggesting we rewrite buildbot, presumably, so I'm not sure how this would work?
Ryosuke Niwa
Comment 9 2011-06-08 15:59:24 PDT
(In reply to comment #8) > Since buildbot is in python, it can't call a script that kills all python processes, or it itself would be killed (causing the whole build to fail). You're not suggesting we rewrite buildbot, presumably, so I'm not sure how this would work? Ah, that's a good point. We can't kill buildslave.
Ryosuke Niwa
Comment 10 2011-06-08 15:59:55 PDT
Is it possible to figure out which python process is running buildslave and whitelist it?
Dirk Pranke
Comment 11 2011-06-08 16:15:32 PDT
(In reply to comment #10) > Is it possible to figure out which python process is running buildslave and whitelist it? The taskkill /im / "killall" processes (which are systemwide utilities that we didn't write) don't give us a way to say "kill everything named X except me if I'm named X" or anything like that kind of flexibility. It is presumably possible to reconstruct that logic in python or some other language to do it ourselves, but we haven't (yet) done so, and I have no idea how much work it would be, but at least on windows, a decent amount, I think.
Dirk Pranke
Comment 12 2011-12-21 14:11:10 PST
we kill old processes on the build.webkit.org bots now, so I think we can close this. Please reopen if anyone disagrees.
Note You need to log in before you can comment on or make changes to this bug.