Bug 61847

Summary: Worker may not be stopped after LT.
Product: WebKit Reporter: Hao Zheng <zhenghao>
Component: Tools / TestsAssignee: Dirk Pranke <dpranke>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, commit-queue, dpranke, tony
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Bug Depends on: 61965    
Bug Blocks:    
Attachments:
Description Flags
Proposed patch.
none
Patch
none
Patch none

Description Hao Zheng 2011-06-01 01:48:01 PDT
In worker.py, only call to cleanup() is in __del__, which is not guaranteed to be called when the python interpreter exits. And self._driver is not killed after LT. After many runs of LT, there will be many zombie processes of driver. I think we need to run cleanup() explicitly for worker.

I'm not sure where is better to put cleanup():

    def handle_stop(self, src):
        self._done = True
**      self.cleanup()

Or,

    def run(self, port):
        self.safe_init(port)

        exception_msg = ""
        _log.debug("%s starting" % self._name)

        try:
            self._worker_connection.run_message_loop()
            if not self.is_done():
                raise AssertionError("%s: ran out of messages in worker queue."
                                     % self._name)
        except KeyboardInterrupt:
            exception_msg = ", interrupted"
        except:
            exception_msg = ", exception raised"
        finally:
            _log.debug("%s done%s" % (self._name, exception_msg))
            if exception_msg:
                exception_type, exception_value, exception_traceback = sys.exc_info()
                stack_utils.log_traceback(_log.debug, exception_traceback)
                # FIXME: Figure out how to send a message with a traceback.
                self._worker_connection.post_message('exception',
                    (exception_type, exception_value, None))
            self._worker_connection.post_message('done')
**          self.cleanup()

I'd like to hear your comments.
Comment 1 Dirk Pranke 2011-06-01 10:34:28 PDT
cleanup() should be getting called at the end of run(). I'm pretty sure it did at some point; I wonder if I deleted it?

Feel free to add a patch for this, or I'll do so this afternoon.
Comment 2 Hao Zheng 2011-06-01 21:15:49 PDT
Created attachment 95716 [details]
Proposed patch.
Comment 3 WebKit Commit Bot 2011-06-02 12:58:55 PDT
The commit-queue encountered the following flaky tests while processing attachment 95716 [details]:

fast/history/history-subframe-with-name.html bug 51039 (author: mihaip@chromium.org)
http/tests/websocket/tests/error-detect.html bug 54012 (author: abarth@webkit.org)
The commit-queue is continuing to process your patch.
Comment 4 WebKit Commit Bot 2011-06-02 13:01:14 PDT
Comment on attachment 95716 [details]
Proposed patch.

Clearing flags on attachment: 95716

Committed r87946: <http://trac.webkit.org/changeset/87946>
Comment 5 WebKit Commit Bot 2011-06-02 13:01:18 PDT
All reviewed patches have been landed.  Closing bug.
Comment 6 Adam Barth 2011-06-02 14:40:34 PDT
This broke NRWT in multiple configurations.  Here's an example:

http://build.chromium.org/p/chromium.webkit/builders/Webkit%20Linux%20%28dbg%29%281%29/builds/6754/steps/webkit_tests/logs/stdio


2011-06-02 13:35:14,919 8291 manager.py:672 DEBUG No wedged threads
2011-06-02 13:35:15,419 8291 manager.py:690 INFO Exception raised, exiting
Traceback (most recent call last):
  File "/mnt/data/b/build/slave/Webkit_Linux__dbg__1_/build/src/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 460, in <module>
    sys.exit(main())
  File "/mnt/data/b/build/slave/Webkit_Linux__dbg__1_/build/src/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 455, in main
    return run(port_obj, options, args)
  File "/mnt/data/b/build/slave/Webkit_Linux__dbg__1_/build/src/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 118, in run
    num_unexpected_results = manager.run(result_summary)
  File "/mnt/data/b/build/slave/Webkit_Linux__dbg__1_/build/src/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py", line 779, in run
    self._run_tests(self._test_files_list, result_summary))
  File "/mnt/data/b/build/slave/Webkit_Linux__dbg__1_/build/src/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py", line 678, in _run_tests
    assert not worker_state.worker_connection.is_alive()
AssertionError
program finished with exit code 1
Comment 7 Dirk Pranke 2011-06-02 19:17:46 PDT
Created attachment 95845 [details]
Patch
Comment 8 Hao Zheng 2011-06-02 20:13:12 PDT
Thanks for your help. Sorry to break the bots.
Comment 9 Tony Chang 2011-06-03 09:53:28 PDT
Comment on attachment 95845 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=95845&action=review

> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:681
> +                        _log.error('Worked %d did not exit in time. Aborting.' % worker_state.number)
> +                        raise AssertionError('Worker %d did not exit in time' % worker_state.number)

Should we raise an Assertion here or just log the error and try to keep going?

> Tools/Scripts/webkitpy/layout_tests/layout_package/worker.py:150
> +        self.kill_driver()
> +        self.stop_servers_with_lock()

Nit: We may want to remove the _driver and _has_http_lock checks in the other call sites (in a separate patch).
Comment 10 Dirk Pranke 2011-06-03 10:01:43 PDT
(In reply to comment #9)
> (From update of attachment 95845 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=95845&action=review
> 
> > Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:681
> > +                        _log.error('Worked %d did not exit in time. Aborting.' % worker_state.number)
> > +                        raise AssertionError('Worker %d did not exit in time' % worker_state.number)
> 
> Should we raise an Assertion here or just log the error and try to keep going?
> 

Good question. I'm not actually sure what the end result of the run will be in either case (hung process or process that eventually exits when the slow thread/process exits). I would prefer to leave it as an assertion for now just to help identify if it actually occurs.

> > Tools/Scripts/webkitpy/layout_tests/layout_package/worker.py:150
> > +        self.kill_driver()
> > +        self.stop_servers_with_lock()
> 
> Nit: We may want to remove the _driver and _has_http_lock checks in the other call sites (in a separate patch).

will do.
Comment 11 Tony Chang 2011-06-03 11:29:41 PDT
Comment on attachment 95845 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=95845&action=review

>>> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:681
>>> +                        raise AssertionError('Worker %d did not exit in time' % worker_state.number)
>> 
>> Should we raise an Assertion here or just log the error and try to keep going?
> 
> Good question. I'm not actually sure what the end result of the run will be in either case (hung process or process that eventually exits when the slow thread/process exits). I would prefer to leave it as an assertion for now just to help identify if it actually occurs.

I think this is OK, but In general I think we should try to have the bots keep going as much as possible.  If the script crashes, it causes problems for the gardener who then needs to rollback and wait for the bots to cycle an extra time.  The downside for you is much smaller: you just need to land the patch manually and watch the bots as they cycle to see if an error is logged.

You might want to try landing this during non-busy hours or coordinate the landing with the gardener.
Comment 12 Dirk Pranke 2011-06-03 12:32:23 PDT
(In reply to comment #11)
> (From update of attachment 95845 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=95845&action=review
> 
> >>> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:681
> >>> +                        raise AssertionError('Worker %d did not exit in time' % worker_state.number)
> >> 
> >> Should we raise an Assertion here or just log the error and try to keep going?
> > 
> > Good question. I'm not actually sure what the end result of the run will be in either case (hung process or process that eventually exits when the slow thread/process exits). I would prefer to leave it as an assertion for now just to help identify if it actually occurs.
> 
> I think this is OK, but In general I think we should try to have the bots keep going as much as possible.  If the script crashes, it causes problems for the gardener who then needs to rollback and wait for the bots to cycle an extra time.  The downside for you is much smaller: you just need to land the patch manually and watch the bots as they cycle to see if an error is logged.
> 
> You might want to try landing this during non-busy hours or coordinate the landing with the gardener.

Oh, I'm not worried about this patch in particular being broken, I'm worried about something happening down the road that causes this to fail.

That said, I just tested it, and raising the assert won't by itself cause the program to exit (since there are still child processes running), so there's not much effective difference here between asserting and logging an error. I'll change this to a log for now.

There's also two typos in that line that are real bugs, so I'll post another patch that actually works :)
Comment 13 Dirk Pranke 2011-06-03 12:34:24 PDT
Created attachment 95947 [details]
Patch
Comment 14 Dirk Pranke 2011-06-03 12:44:03 PDT
Committed r88040: <http://trac.webkit.org/changeset/88040>