57724 – test_kill_process is hanging on the EWS bots

NEW 57724

test_kill_process is hanging on the EWS bots

https://bugs.webkit.org/show_bug.cgi?id=57724

Summary test_kill_process is hanging on the EWS bots

Eric Seidel (no email)

Reported 2011-04-03 02:09:40 PDT

webkitpy.test.main: INFO Excluding: webkitpy.common.checkout.scm_unittest (use --all to include) ....................................................................................................................^CTraceback (most recent call last): File "./Tools/Scripts/test-webkitpy", line 266, in <module> Tester().run_tests(sys.argv, external_package_paths) File "/mnt/git/webkit-cr-mac-ews/Tools/Scripts/webkitpy/test/main.py", line 157, in run_tests unittest.main(argv=sys_argv, module=None) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 817, in __init__ self.runTests() File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 861, in runTests result = testRunner.run(self.test) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 753, in run test(result) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 464, in __call__ return self.run(*args, **kwds) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 460, in run test(result) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 464, in __call__ return self.run(*args, **kwds) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 460, in run test(result) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 464, in __call__ return self.run(*args, **kwds) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 460, in run test(result) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 300, in __call__ return self.run(*args, **kwds) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 279, in run testMethod() File "/mnt/git/webkit-cr-mac-ews/Tools/Scripts/webkitpy/common/system/executive_unittest.py", line 159, in test_kill_process self.assertEqual(process.wait(), expected_exit_code) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/subprocess.py", line 1137, in wait pid, sts = os.waitpid(self.pid, 0) KeyboardInterrupt It only hangs occasionally. But this currently brings the bot down and requires a manual restart. Maybe we're hitting http://bugs.python.org/issue5898 ? Maybe we should try adding closefds=True?

Attachments
Add attachment proposed patch, testcase, etc.

Tony Chang

Comment 1 2011-04-04 09:31:50 PDT

(In reply to comment #0) > It only hangs occasionally. But this currently brings the bot down and requires a manual restart. > > Maybe we're hitting http://bugs.python.org/issue5898 ? Maybe we should try adding closefds=True? You mean just for the subprocess.Popen call in the test? That seems OK to try. Alternately, maybe we shouldn't redirect stdout in the test? Note that the documentation for close_fds says that on Windows, you can't set close_fds=True and redirect stdout.

Eric Seidel (no email)

Comment 2 2011-05-05 12:13:50 PDT

I just had test_cancel hang too: ====================================================================== FAIL: test_cancel (webkitpy.layout_tests.layout_package.manager_worker_broker_unittest.MultiProcessBrokerTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Projects/CommitQueue/Tools/Scripts/webkitpy/layout_tests/layout_package/manager_worker_broker_unittest.py", line 178, in test_cancel self.assertFalse(worker.is_alive()) AssertionError ---------------------------------------------------------------------- Ran 1032 tests in 34.302s FAILED (failures=1) ^CError in atexit._run_exitfuncs: Traceback (most recent call last): File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/atexit.py", line 24, in _run_exitfuncs func(*targs, **kargs) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/util.py", line 281, in _exit_function p.join() File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 119, in join res = self._popen.wait(timeout) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/forking.py", line 117, in wait return self.poll(0) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/forking.py", line 106, in poll pid, sts = os.waitpid(self.pid, flag) KeyboardInterrupt Error in sys.exitfunc: Traceback (most recent call last): File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/atexit.py", line 24, in _run_exitfuncs func(*targs, **kargs) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/util.py", line 281, in _exit_function p.join() File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 119, in join res = self._popen.wait(timeout) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/forking.py", line 117, in wait return self.poll(0) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/forking.py", line 106, in poll pid, sts = os.waitpid(self.pid, flag) KeyboardInterrupt

Dirk Pranke

Comment 3 2011-05-05 13:45:14 PDT

I assume this is just on linux? There is also bug 58202 ... I think there's a bug in linux python somewhere that causes .join() to return early sometimes, but I've not been able to reproduce it myself.

Eric Seidel (no email)

Comment 4 2011-05-05 13:48:38 PDT

No these were both on Mac. The second one was on eseidel-cq-sf. 10.6.6

Dirk Pranke

Comment 5 2011-05-05 13:52:29 PDT

Unrelated comment: why does this bring the bot down? I mentioned this to Adam yesterday, but the EWS needs to have a concept of a deadman timer, like buildbot does. It should not be possible for any spawned off task to hang or bring down the bots, no matter what goes wrong. Unless of course you're running into a bug in python where spawning off the test-webkitpy itself is somehow hosing the caller.

Adam Barth

Comment 6 2011-05-05 13:58:46 PDT

Regardless of what the bots do, this test is busted and should be fixed.

Eric Seidel (no email)

Comment 7 2011-05-05 14:05:25 PDT

Yes, all our bots need deadman timers. We haven't found an easy way to do that yet. Maybe there is a unix command I'm missing.

Eric Seidel (no email)

Comment 8 2011-05-05 14:11:10 PDT

I suspect wrapping possibly long-running calls in sig-alarms would do the trick. We could add some simple watchdogging to executive.py.

Adam Barth

Comment 9 2011-05-05 14:20:28 PDT

(In reply to comment #8) > I suspect wrapping possibly long-running calls in sig-alarms would do the trick. We could add some simple watchdogging to executive.py. Ideally we'd add this to https://trac.webkit.org/browser/trunk/Tools/EWSTools/start-queue.sh to ensure that we'll catch everything.

Dirk Pranke

Comment 10 2011-05-05 14:24:13 PDT

There probably isn't an "easy" way to do this, but here are some thoughts. sigalarms are dicey in the best of cases and won't work on windows at all. A "unix command" wouldn't really help either. Don't use subprocess.call() or communicate(), use Popen() or some other variant and implement an explicit timeout via sleep and poll(). It is unfortunate that subprocess doesn't give you a way to wait(timeout), but that wouldn't really help anyway, because ... you will need to distinguish between a process that is hung and a process that is just slow. I believe the way buildbot does this (at least on the Chromium bots) is to ensure that the process is generating output periodically, and we only kill the process if we've gotten no input for 10 minutes. You can probably steal the code from buildbot.

Adam Barth

Comment 11 2011-05-05 14:29:51 PDT

We can wait for an hour before pulling the plug. There's no rush. The key is to do this at as high a level as possible. The queues are all designed to be killable at any moment.

Eric Seidel (no email)

Comment 12 2011-05-05 14:39:23 PDT

A suggestion from a friend (as an example of how we might do this from start_queue.sh): foo & TARGET=$! (sleep $TIMEOUT && kill -9 $TARGET &) DEADMAN=$! wait $TARGET kill -9 $DEADMAN

Eric Seidel (no email)

Comment 13 2011-05-10 14:04:56 PDT

I'm making the EWS robust against hangs in bug 60579. But we should still fix/remove these tests.

Note You need to log in before you can comment on or make changes to this bug.

Status NEW

Resolution

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware PC

OS OS X 10.5

Product WebKit

Component Tools / Tests

Assignee

Nobody

Reported

2011-04-03 02:09 PDT

Modified

2011-05-10 14:04 PDT History

CC List

4 users Show

URL

Keywords

Depends on

54790 60579

Blocks

Dependencies

tree graph