WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
NEW
57724
test_kill_process is hanging on the EWS bots
https://bugs.webkit.org/show_bug.cgi?id=57724
Summary
test_kill_process is hanging on the EWS bots
Eric Seidel (no email)
Reported
2011-04-03 02:09:40 PDT
webkitpy.test.main: INFO Excluding: webkitpy.common.checkout.scm_unittest (use --all to include) ....................................................................................................................^CTraceback (most recent call last): File "./Tools/Scripts/test-webkitpy", line 266, in <module> Tester().run_tests(sys.argv, external_package_paths) File "/mnt/git/webkit-cr-mac-ews/Tools/Scripts/webkitpy/test/main.py", line 157, in run_tests unittest.main(argv=sys_argv, module=None) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 817, in __init__ self.runTests() File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 861, in runTests result = testRunner.run(self.test) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 753, in run test(result) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 464, in __call__ return self.run(*args, **kwds) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 460, in run test(result) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 464, in __call__ return self.run(*args, **kwds) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 460, in run test(result) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 464, in __call__ return self.run(*args, **kwds) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 460, in run test(result) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 300, in __call__ return self.run(*args, **kwds) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 279, in run testMethod() File "/mnt/git/webkit-cr-mac-ews/Tools/Scripts/webkitpy/common/system/executive_unittest.py", line 159, in test_kill_process self.assertEqual(process.wait(), expected_exit_code) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/subprocess.py", line 1137, in wait pid, sts = os.waitpid(self.pid, 0) KeyboardInterrupt It only hangs occasionally. But this currently brings the bot down and requires a manual restart. Maybe we're hitting
http://bugs.python.org/issue5898
? Maybe we should try adding closefds=True?
Attachments
Add attachment
proposed patch, testcase, etc.
Tony Chang
Comment 1
2011-04-04 09:31:50 PDT
(In reply to
comment #0
)
> It only hangs occasionally. But this currently brings the bot down and requires a manual restart. > > Maybe we're hitting
http://bugs.python.org/issue5898
? Maybe we should try adding closefds=True?
You mean just for the subprocess.Popen call in the test? That seems OK to try. Alternately, maybe we shouldn't redirect stdout in the test? Note that the documentation for close_fds says that on Windows, you can't set close_fds=True and redirect stdout.
Eric Seidel (no email)
Comment 2
2011-05-05 12:13:50 PDT
I just had test_cancel hang too: ====================================================================== FAIL: test_cancel (webkitpy.layout_tests.layout_package.manager_worker_broker_unittest.MultiProcessBrokerTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Projects/CommitQueue/Tools/Scripts/webkitpy/layout_tests/layout_package/manager_worker_broker_unittest.py", line 178, in test_cancel self.assertFalse(worker.is_alive()) AssertionError ---------------------------------------------------------------------- Ran 1032 tests in 34.302s FAILED (failures=1) ^CError in atexit._run_exitfuncs: Traceback (most recent call last): File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/atexit.py", line 24, in _run_exitfuncs func(*targs, **kargs) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/util.py", line 281, in _exit_function p.join() File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 119, in join res = self._popen.wait(timeout) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/forking.py", line 117, in wait return self.poll(0) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/forking.py", line 106, in poll pid, sts = os.waitpid(self.pid, flag) KeyboardInterrupt Error in sys.exitfunc: Traceback (most recent call last): File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/atexit.py", line 24, in _run_exitfuncs func(*targs, **kargs) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/util.py", line 281, in _exit_function p.join() File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 119, in join res = self._popen.wait(timeout) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/forking.py", line 117, in wait return self.poll(0) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/forking.py", line 106, in poll pid, sts = os.waitpid(self.pid, flag) KeyboardInterrupt
Dirk Pranke
Comment 3
2011-05-05 13:45:14 PDT
I assume this is just on linux? There is also
bug 58202
... I think there's a bug in linux python somewhere that causes .join() to return early sometimes, but I've not been able to reproduce it myself.
Eric Seidel (no email)
Comment 4
2011-05-05 13:48:38 PDT
No these were both on Mac. The second one was on eseidel-cq-sf. 10.6.6
Dirk Pranke
Comment 5
2011-05-05 13:52:29 PDT
Unrelated comment: why does this bring the bot down? I mentioned this to Adam yesterday, but the EWS needs to have a concept of a deadman timer, like buildbot does. It should not be possible for any spawned off task to hang or bring down the bots, no matter what goes wrong. Unless of course you're running into a bug in python where spawning off the test-webkitpy itself is somehow hosing the caller.
Adam Barth
Comment 6
2011-05-05 13:58:46 PDT
Regardless of what the bots do, this test is busted and should be fixed.
Eric Seidel (no email)
Comment 7
2011-05-05 14:05:25 PDT
Yes, all our bots need deadman timers. We haven't found an easy way to do that yet. Maybe there is a unix command I'm missing.
Eric Seidel (no email)
Comment 8
2011-05-05 14:11:10 PDT
I suspect wrapping possibly long-running calls in sig-alarms would do the trick. We could add some simple watchdogging to executive.py.
Adam Barth
Comment 9
2011-05-05 14:20:28 PDT
(In reply to
comment #8
)
> I suspect wrapping possibly long-running calls in sig-alarms would do the trick. We could add some simple watchdogging to executive.py.
Ideally we'd add this to
https://trac.webkit.org/browser/trunk/Tools/EWSTools/start-queue.sh
to ensure that we'll catch everything.
Dirk Pranke
Comment 10
2011-05-05 14:24:13 PDT
There probably isn't an "easy" way to do this, but here are some thoughts. sigalarms are dicey in the best of cases and won't work on windows at all. A "unix command" wouldn't really help either. Don't use subprocess.call() or communicate(), use Popen() or some other variant and implement an explicit timeout via sleep and poll(). It is unfortunate that subprocess doesn't give you a way to wait(timeout), but that wouldn't really help anyway, because ... you will need to distinguish between a process that is hung and a process that is just slow. I believe the way buildbot does this (at least on the Chromium bots) is to ensure that the process is generating output periodically, and we only kill the process if we've gotten no input for 10 minutes. You can probably steal the code from buildbot.
Adam Barth
Comment 11
2011-05-05 14:29:51 PDT
We can wait for an hour before pulling the plug. There's no rush. The key is to do this at as high a level as possible. The queues are all designed to be killable at any moment.
Eric Seidel (no email)
Comment 12
2011-05-05 14:39:23 PDT
A suggestion from a friend (as an example of how we might do this from start_queue.sh): foo & TARGET=$! (sleep $TIMEOUT && kill -9 $TARGET &) DEADMAN=$! wait $TARGET kill -9 $DEADMAN
Eric Seidel (no email)
Comment 13
2011-05-10 14:04:56 PDT
I'm making the EWS robust against hangs in
bug 60579
. But we should still fix/remove these tests.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug