Bug 57724 - test_kill_process is hanging on the EWS bots
Summary: test_kill_process is hanging on the EWS bots
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: 528+ (Nightly build)
Hardware: PC OS X 10.5
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on: 54790 60579
Blocks:
  Show dependency treegraph
 
Reported: 2011-04-03 02:09 PDT by Eric Seidel (no email)
Modified: 2011-05-10 14:04 PDT (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Eric Seidel (no email) 2011-04-03 02:09:40 PDT
webkitpy.test.main: INFO     Excluding: webkitpy.common.checkout.scm_unittest (use --all to include)
....................................................................................................................^CTraceback (most recent call last):
  File "./Tools/Scripts/test-webkitpy", line 266, in <module>
    Tester().run_tests(sys.argv, external_package_paths)
  File "/mnt/git/webkit-cr-mac-ews/Tools/Scripts/webkitpy/test/main.py", line 157, in run_tests
    unittest.main(argv=sys_argv, module=None)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 817, in __init__
    self.runTests()
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 861, in runTests
    result = testRunner.run(self.test)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 753, in run
    test(result)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 464, in __call__
    return self.run(*args, **kwds)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 460, in run
    test(result)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 464, in __call__
    return self.run(*args, **kwds)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 460, in run
    test(result)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 464, in __call__
    return self.run(*args, **kwds)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 460, in run
    test(result)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 300, in __call__
    return self.run(*args, **kwds)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/unittest.py", line 279, in run
    testMethod()
  File "/mnt/git/webkit-cr-mac-ews/Tools/Scripts/webkitpy/common/system/executive_unittest.py", line 159, in test_kill_process
    self.assertEqual(process.wait(), expected_exit_code)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/subprocess.py", line 1137, in wait
    pid, sts = os.waitpid(self.pid, 0)
KeyboardInterrupt

It only hangs occasionally.  But this currently brings the bot down and requires a manual restart.

Maybe we're hitting http://bugs.python.org/issue5898 ?  Maybe we should try adding closefds=True?
Comment 1 Tony Chang 2011-04-04 09:31:50 PDT
(In reply to comment #0)
> It only hangs occasionally.  But this currently brings the bot down and requires a manual restart.
> 
> Maybe we're hitting http://bugs.python.org/issue5898 ?  Maybe we should try adding closefds=True?

You mean just for the subprocess.Popen call in the test?  That seems OK to try.  Alternately, maybe we shouldn't redirect stdout in the test?

Note that the documentation for close_fds says that on Windows, you can't set close_fds=True and redirect stdout.
Comment 2 Eric Seidel (no email) 2011-05-05 12:13:50 PDT
I just had test_cancel hang too:

======================================================================
FAIL: test_cancel (webkitpy.layout_tests.layout_package.manager_worker_broker_unittest.MultiProcessBrokerTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Projects/CommitQueue/Tools/Scripts/webkitpy/layout_tests/layout_package/manager_worker_broker_unittest.py", line 178, in test_cancel
    self.assertFalse(worker.is_alive())
AssertionError

----------------------------------------------------------------------
Ran 1032 tests in 34.302s

FAILED (failures=1)
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/util.py", line 281, in _exit_function
    p.join()
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 119, in join
    res = self._popen.wait(timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/forking.py", line 117, in wait
    return self.poll(0)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/forking.py", line 106, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
Error in sys.exitfunc:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/util.py", line 281, in _exit_function
    p.join()
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 119, in join
    res = self._popen.wait(timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/forking.py", line 117, in wait
    return self.poll(0)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/forking.py", line 106, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
Comment 3 Dirk Pranke 2011-05-05 13:45:14 PDT
I assume this is just on linux? 

There is also bug 58202 ... I think there's a bug in linux python somewhere that causes .join() to return early sometimes, but I've not been able to reproduce it myself.
Comment 4 Eric Seidel (no email) 2011-05-05 13:48:38 PDT
No these were both on Mac.  The second one was on eseidel-cq-sf. 10.6.6
Comment 5 Dirk Pranke 2011-05-05 13:52:29 PDT
Unrelated comment: why does this bring the bot down?

I mentioned this to Adam yesterday, but the EWS needs to have a concept of a deadman timer, like buildbot does. It should not be possible for any spawned off task to hang or bring down the bots, no matter what goes wrong.

Unless of course you're running into a bug in python where spawning off the test-webkitpy itself is somehow hosing the caller.
Comment 6 Adam Barth 2011-05-05 13:58:46 PDT
Regardless of what the bots do, this test is busted and should be fixed.
Comment 7 Eric Seidel (no email) 2011-05-05 14:05:25 PDT
Yes, all our bots need deadman timers.  We haven't found an easy way to do that yet.   Maybe there is a unix command I'm missing.
Comment 8 Eric Seidel (no email) 2011-05-05 14:11:10 PDT
I suspect wrapping possibly long-running calls in sig-alarms would do the trick.  We could add some simple watchdogging to executive.py.
Comment 9 Adam Barth 2011-05-05 14:20:28 PDT
(In reply to comment #8)
> I suspect wrapping possibly long-running calls in sig-alarms would do the trick.  We could add some simple watchdogging to executive.py.

Ideally we'd add this to https://trac.webkit.org/browser/trunk/Tools/EWSTools/start-queue.sh to ensure that we'll catch everything.
Comment 10 Dirk Pranke 2011-05-05 14:24:13 PDT
There probably isn't an "easy" way to do this, but here are some thoughts.

sigalarms are dicey in the best of cases and won't work on windows at all. A "unix command" wouldn't really help either.

Don't use subprocess.call() or communicate(), use Popen() or some other variant and implement an explicit timeout via sleep and poll(). It is unfortunate that subprocess doesn't give you a way to wait(timeout), but that wouldn't really help anyway, because ...

you will need to distinguish between a process that is hung and a process that is just slow. I believe the way buildbot does this (at least on the Chromium bots) is to ensure that the process is generating output periodically, and we only kill the process if we've gotten no input for 10 minutes. You can probably steal the code from buildbot.
Comment 11 Adam Barth 2011-05-05 14:29:51 PDT
We can wait for an hour before pulling the plug.  There's no rush.  The key is to do this at as high a level as possible.  The queues are all designed to be killable at any moment.
Comment 12 Eric Seidel (no email) 2011-05-05 14:39:23 PDT
A suggestion from a friend (as an example of how we might do this from start_queue.sh):

foo &
TARGET=$!
(sleep $TIMEOUT && kill -9 $TARGET &)
DEADMAN=$!
wait $TARGET
kill -9 $DEADMAN
Comment 13 Eric Seidel (no email) 2011-05-10 14:04:56 PDT
I'm making the EWS robust against hangs in bug 60579.  But we should still fix/remove these tests.