Bug 62297 - nwrt: Chromium Win hangs frequently
Summary: nwrt: Chromium Win hangs frequently
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on: 62180
Blocks:
  Show dependency treegraph
 
Reported: 2011-06-08 10:04 PDT by Dimitri Glazkov (Google)
Modified: 2011-12-21 14:11 PST (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Comment 1 Dimitri Glazkov (Google) 2011-06-08 10:11:37 PDT
It looks like there's a buuuuuunch of orphan processes in left over: python.exe, LayoutTestHelper.exe.
Comment 2 Dirk Pranke 2011-06-08 13:49:53 PDT
Right around


2011-06-08 09:36:14,887 8616 stack_utils.py:67 DEBUG       raise e
2011-06-08 09:36:14,887 8616 worker.py:148 DEBUG worker/0 cleaning up
2011-06-08 09:36:14,887 8616 worker.py:114 DEBUG worker/0 exiting

you can see one of the threads bailing out, in this case because we tried to delete an old pywebsocket log file and failed. Because this was an unexpected exception, we bail out without trying to clean up, which has the result that nothing gets cleaned up properly and you have all of these stale processes around.

The patch I've posted in bug 62180 will fix this particular issue; it's possible that we should do more to try and clean up on the way out, though.
Comment 3 Tony Chang 2011-06-08 13:54:38 PDT
We should certainly fix the bugs in the scripts if possible, but we should also make sure the buildbot can recover no matter what state we're in.  This is what the task_kill step tries to do (clean up stray processes).  This doesn't work for python processes because if we kill all python processes, we take down the buildbot process.  The chromium win bots try to work around this by having a separate binary called python_slave.exe (I think it's just a copy of python.exe) and running the buildbot slave with that binary.  Then it's safe to taskkill /f /im python.exe on the waterfall.
Comment 4 Dirk Pranke 2011-06-08 13:57:34 PDT
(In reply to comment #3)
> We should certainly fix the bugs in the scripts if possible, but we should also make sure the buildbot can recover no matter what state we're in.  This is what the task_kill step tries to do (clean up stray processes).  This doesn't work for python processes because if we kill all python processes, we take down the buildbot process.  The chromium win bots try to work around this by having a separate binary called python_slave.exe (I think it's just a copy of python.exe) and running the buildbot slave with that binary.  Then it's safe to taskkill /f /im python.exe on the waterfall.

I agree with everything you wrote, and that's an interesting suggestion. Originally I didn't attempt to clean up the workers on an unexpected exception because I figured that might just make a bad thing worse; however, it would be hard to be worse than what seems to be happening on windows, so maybe I'll try changing the code to always try to clean up the workers and see if that helps.
Comment 5 Ryosuke Niwa 2011-06-08 14:24:45 PDT
(In reply to comment #3)
> We should certainly fix the bugs in the scripts if possible, but we should also make sure the buildbot can recover no matter what state we're in.  This is what the task_kill step tries to do (clean up stray processes).  This doesn't work for python processes because if we kill all python processes, we take down the buildbot process.  The chromium win bots try to work around this by having a separate binary called python_slave.exe (I think it's just a copy of python.exe) and running the buildbot slave with that binary.  Then it's safe to taskkill /f /im python.exe on the waterfall.

That sounds like a great idea. But I wonder if we can achieve the same effect by using perl, ruby, or some other scripting language.
Comment 6 Dirk Pranke 2011-06-08 15:27:35 PDT
(In reply to comment #5)
> That sounds like a great idea. But I wonder if we can achieve the same effect by using perl, ruby, or some other scripting language.

I'm not sure I follow you. Buildbot, new-run-webkit-tests, and pywebsocket all are written in Python. Tony's point was that we can't simply kill all python processes from taskkill or one of these scripts without killing ourselves or our parents. Are you suggesting that we rewrite one of these so that it isn't in Python?
Comment 7 Ryosuke Niwa 2011-06-08 15:36:41 PDT
(In reply to comment #6)
> I'm not sure I follow you. Buildbot, new-run-webkit-tests, and pywebsocket all are written in Python. Tony's point was that we can't simply kill all python processes from taskkill or one of these scripts without killing ourselves or our parents. Are you suggesting that we rewrite one of these so that it isn't in Python?

Right. We can write a simple perl/ruby script that kills all python instances and starts new python instance.  That'll avoid having to duplicate python.exe and makes it easier to be deployed across ports.
Comment 8 Dirk Pranke 2011-06-08 15:57:12 PDT
(In reply to comment #7)
> (In reply to comment #6)
> > I'm not sure I follow you. Buildbot, new-run-webkit-tests, and pywebsocket all are written in Python. Tony's point was that we can't simply kill all python processes from taskkill or one of these scripts without killing ourselves or our parents. Are you suggesting that we rewrite one of these so that it isn't in Python?
> 
> Right. We can write a simple perl/ruby script that kills all python instances and starts new python instance.  That'll avoid having to duplicate python.exe and makes it easier to be deployed across ports.

Since buildbot is in python, it can't call a script that kills all python processes, or it itself would be killed (causing the whole build to fail). You're not suggesting we rewrite buildbot, presumably, so I'm not sure how this would work?
Comment 9 Ryosuke Niwa 2011-06-08 15:59:24 PDT
(In reply to comment #8)
> Since buildbot is in python, it can't call a script that kills all python processes, or it itself would be killed (causing the whole build to fail). You're not suggesting we rewrite buildbot, presumably, so I'm not sure how this would work?

Ah, that's a good point. We can't kill buildslave.
Comment 10 Ryosuke Niwa 2011-06-08 15:59:55 PDT
Is it possible to figure out which python process is running buildslave and whitelist it?
Comment 11 Dirk Pranke 2011-06-08 16:15:32 PDT
(In reply to comment #10)
> Is it possible to figure out which python process is running buildslave and whitelist it?

The taskkill /im / "killall" processes (which are systemwide utilities that we didn't write) don't give us a way to say "kill everything named X except me if I'm named X" or anything like that kind of flexibility.

It is presumably possible to reconstruct that logic in python or some other language to do it ourselves, but we haven't (yet) done so, and I have no idea how much work it would be, but at least on windows, a decent amount, I think.
Comment 12 Dirk Pranke 2011-12-21 14:11:10 PST
we kill old processes on the build.webkit.org bots now, so I think we can close this. Please reopen if anyone disagrees.