webkitpy.test.main: INFO Excluding: webkitpy.common.checkout.scm_unittest (use --all to include) ....................................................................................................................^CTraceback (most recent call last): File "./Tools/Scripts/test-webkitpy", line 266, in <module> Tester().run_tests(sys.argv, external_package_paths) File "/mnt/git/webkit-cr-mac-ews/Tools/Scripts/webkitpy/test/main.py", line 157, in run_tests unittest.main(argv=sys_argv, module=None) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 817, in __init__ self.runTests() File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 861, in runTests result = testRunner.run(self.test) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 753, in run test(result) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 464, in __call__ return self.run(*args, **kwds) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 460, in run test(result) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 464, in __call__ return self.run(*args, **kwds) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 460, in run test(result) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 464, in __call__ return self.run(*args, **kwds) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 460, in run test(result) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 300, in __call__ return self.run(*args, **kwds) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 279, in run testMethod() File "/mnt/git/webkit-cr-mac-ews/Tools/Scripts/webkitpy/common/system/executive_unittest.py", line 159, in test_kill_process self.assertEqual(process.wait(), expected_exit_code) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/subprocess.py", line 1137, in wait pid, sts = os.waitpid(self.pid, 0) KeyboardInterrupt It only hangs occasionally. But this currently brings the bot down and requires a manual restart. Maybe we're hitting http://bugs.python.org/issue5898 ? Maybe we should try adding closefds=True?
(In reply to comment #0) > It only hangs occasionally. But this currently brings the bot down and requires a manual restart. > > Maybe we're hitting http://bugs.python.org/issue5898 ? Maybe we should try adding closefds=True? You mean just for the subprocess.Popen call in the test? That seems OK to try. Alternately, maybe we shouldn't redirect stdout in the test? Note that the documentation for close_fds says that on Windows, you can't set close_fds=True and redirect stdout.
I just had test_cancel hang too: ====================================================================== FAIL: test_cancel (webkitpy.layout_tests.layout_package.manager_worker_broker_unittest.MultiProcessBrokerTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Projects/CommitQueue/Tools/Scripts/webkitpy/layout_tests/layout_package/manager_worker_broker_unittest.py", line 178, in test_cancel self.assertFalse(worker.is_alive()) AssertionError ---------------------------------------------------------------------- Ran 1032 tests in 34.302s FAILED (failures=1) ^CError in atexit._run_exitfuncs: Traceback (most recent call last): File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/atexit.py", line 24, in _run_exitfuncs func(*targs, **kargs) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/util.py", line 281, in _exit_function p.join() File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 119, in join res = self._popen.wait(timeout) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/forking.py", line 117, in wait return self.poll(0) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/forking.py", line 106, in poll pid, sts = os.waitpid(self.pid, flag) KeyboardInterrupt Error in sys.exitfunc: Traceback (most recent call last): File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/atexit.py", line 24, in _run_exitfuncs func(*targs, **kargs) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/util.py", line 281, in _exit_function p.join() File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 119, in join res = self._popen.wait(timeout) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/forking.py", line 117, in wait return self.poll(0) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/forking.py", line 106, in poll pid, sts = os.waitpid(self.pid, flag) KeyboardInterrupt
I assume this is just on linux? There is also bug 58202 ... I think there's a bug in linux python somewhere that causes .join() to return early sometimes, but I've not been able to reproduce it myself.
No these were both on Mac. The second one was on eseidel-cq-sf. 10.6.6
Unrelated comment: why does this bring the bot down? I mentioned this to Adam yesterday, but the EWS needs to have a concept of a deadman timer, like buildbot does. It should not be possible for any spawned off task to hang or bring down the bots, no matter what goes wrong. Unless of course you're running into a bug in python where spawning off the test-webkitpy itself is somehow hosing the caller.
Regardless of what the bots do, this test is busted and should be fixed.
Yes, all our bots need deadman timers. We haven't found an easy way to do that yet. Maybe there is a unix command I'm missing.
I suspect wrapping possibly long-running calls in sig-alarms would do the trick. We could add some simple watchdogging to executive.py.
(In reply to comment #8) > I suspect wrapping possibly long-running calls in sig-alarms would do the trick. We could add some simple watchdogging to executive.py. Ideally we'd add this to https://trac.webkit.org/browser/trunk/Tools/EWSTools/start-queue.sh to ensure that we'll catch everything.
There probably isn't an "easy" way to do this, but here are some thoughts. sigalarms are dicey in the best of cases and won't work on windows at all. A "unix command" wouldn't really help either. Don't use subprocess.call() or communicate(), use Popen() or some other variant and implement an explicit timeout via sleep and poll(). It is unfortunate that subprocess doesn't give you a way to wait(timeout), but that wouldn't really help anyway, because ... you will need to distinguish between a process that is hung and a process that is just slow. I believe the way buildbot does this (at least on the Chromium bots) is to ensure that the process is generating output periodically, and we only kill the process if we've gotten no input for 10 minutes. You can probably steal the code from buildbot.
We can wait for an hour before pulling the plug. There's no rush. The key is to do this at as high a level as possible. The queues are all designed to be killable at any moment.
A suggestion from a friend (as an example of how we might do this from start_queue.sh): foo & TARGET=$! (sleep $TIMEOUT && kill -9 $TARGET &) DEADMAN=$! wait $TARGET kill -9 $DEADMAN
I'm making the EWS robust against hangs in bug 60579. But we should still fix/remove these tests.