Bug 54071 - nrwt multiprocessing: spawn multiple workers
Summary: nrwt multiprocessing: spawn multiple workers
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: 528+ (Nightly build)
Hardware: Other OS X 10.5
: P2 Normal
Assignee: Dirk Pranke
URL:
Keywords:
Depends on: 54070
Blocks: 49566 54072
  Show dependency treegraph
 
Reported: 2011-02-08 23:44 PST by Dirk Pranke
Modified: 2011-02-14 13:48 PST (History)
4 users (show)

See Also:


Attachments
Patch (5.31 KB, patch)
2011-02-08 23:46 PST, Dirk Pranke
no flags Details | Formatted Diff | Diff
reset state properly in _run_tests() (5.30 KB, patch)
2011-02-09 00:21 PST, Dirk Pranke
no flags Details | Formatted Diff | Diff
update w/ feedback from tony (6.77 KB, patch)
2011-02-11 14:44 PST, Dirk Pranke
no flags Details | Formatted Diff | Diff
rename fields per feedback from tony (6.91 KB, patch)
2011-02-11 17:42 PST, Dirk Pranke
tony: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dirk Pranke 2011-02-08 23:44:47 PST
nrwt multiprocessing: spawn multiple workers
Comment 1 Dirk Pranke 2011-02-08 23:46:53 PST
Created attachment 81760 [details]
Patch
Comment 2 Dirk Pranke 2011-02-09 00:21:00 PST
Created attachment 81762 [details]
reset state properly in _run_tests()
Comment 3 Tony Chang 2011-02-11 13:24:14 PST
Comment on attachment 81762 [details]
reset state properly in _run_tests()

View in context: https://bugs.webkit.org/attachment.cgi?id=81762&action=review

> Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:109
> +            w = manager_connection.start_worker(worker_number)
> +            ws = _WorkerState(worker_number, w)
> +            self._workers[w.name] = ws

w -> worker, ws -> worker_state

> Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:113
> +            # We sleep a bit in between workers to give the processes a chance
> +            # to start up without thrashing.
> +            time.sleep(0.1)

This seems prone to a race.  What type of thrashing happens if you don't sleep?

> Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:119
> -        manager_connection.post_message('stop')
> +        for i in xrange(num_workers):
> +            manager_connection.post_message('stop')

Why do we have to post multiple stop messages?

> Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:144
> +        for w in self._workers.values():
> +            if not w.done:
> +                done = False

Can you use any() + list comprehension here?

> Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:147
> +        # FIXME: is the following line safe?
> +        # self._done = done

Can you elaborate on this?

> Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:150
>      def handle_started_test(self, src, test_info, hang_timeout):

src -> source

> Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:156
> +        w = self._workers[src]
> +        w.done = True

w -> worker, src -> source

> Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:158
>      def handle_exception(self, src, exception_info):

src -> source
Comment 4 Dirk Pranke 2011-02-11 14:16:47 PST
(In reply to comment #3)
> (From update of attachment 81762 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=81762&action=review
> 
> > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:109
> > +            w = manager_connection.start_worker(worker_number)
> > +            ws = _WorkerState(worker_number, w)
> > +            self._workers[w.name] = ws
> 
> w -> worker, ws -> worker_state
> 

Will do.

> > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:113
> > +            # We sleep a bit in between workers to give the processes a chance
> > +            # to start up without thrashing.
> > +            time.sleep(0.1)
> 
> This seems prone to a race.  What type of thrashing happens if you don't sleep?
>

I don't really understand what is going on, but it appears that the DumpRenderTrees get into weird states and end up timing out on their first couple tests before starting to run smoothly.
 
> > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:119
> > -        manager_connection.post_message('stop')
> > +        for i in xrange(num_workers):
> > +            manager_connection.post_message('stop')
> 
> Why do we have to post multiple stop messages?
> 

One for each worker. Because of the ordering of the messages, and the fact that a worker will stop reading messages after receiving a 'stop', this is sufficient to guarantee that every worker will receive a stop. I'll add a comment to this effect.

> > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:144
> > +        for w in self._workers.values():
> > +            if not w.done:
> > +                done = False
> 
> Can you use any() + list comprehension here?
> 

Sure.

> > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:147
> > +        # FIXME: is the following line safe?
> > +        # self._done = done
> 
> Can you elaborate on this?
> 

I will add a better comment. Basically, at the moment this code always returns False for TestRunner.is_done(), which means that broker_connect.run_message_loop() will never exit early. This is fine, since we run with timeouts, so in the worst case we bail out within a second. 

However, I think that if we set _done here, we will safely bail out as soon as the last "done" message is processed. I just hadn't convinced myself 100% that this was true.

> > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:150
> >      def handle_started_test(self, src, test_info, hang_timeout):
> 
> src -> source
>

Will do.
 
> > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:156
> > +        w = self._workers[src]
> > +        w.done = True
> 
> w -> worker, src -> source
>

Will do.
 
> > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:158
> >      def handle_exception(self, src, exception_info):
> 
> src -> source

Will do.
Comment 5 Dirk Pranke 2011-02-11 14:35:02 PST
(In reply to comment #4)
> (In reply to comment #3)
> > (From update of attachment 81762 [details] [details])
> > View in context: https://bugs.webkit.org/attachment.cgi?id=81762&action=review
> > 
> > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:109
> > > +            w = manager_connection.start_worker(worker_number)
> > > +            ws = _WorkerState(worker_number, w)
> > > +            self._workers[w.name] = ws
> > 
> > w -> worker, ws -> worker_state
> > 
> 
> Will do.
> 

Oh yeah, I used 'w' because 'worker' is already used to name the module. I'll use 'worker_obj' instead.

> > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:144
> > > +        for w in self._workers.values():
> > > +            if not w.done:
> > > +                done = False
> > 
> > Can you use any() + list comprehension here?
> > 
> 
> Sure.

Actually,  the block inside the loop gets more complicated when we check to see if the worker is wedged, but I think I can pull that into a worker_is_done_or_wedged() function.
Comment 6 Dirk Pranke 2011-02-11 14:44:43 PST
Created attachment 82183 [details]
update w/ feedback from tony
Comment 7 Tony Chang 2011-02-11 16:57:34 PST
Comment on attachment 82183 [details]
update w/ feedback from tony

View in context: https://bugs.webkit.org/attachment.cgi?id=82183&action=review

> Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:52
> +    def __init__(self, number, worker):
> +        self.worker = worker

Should we name this param something other than |worker| to avoid shadowing the module?

> Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:76
> +    def _worker_is_done(self, worker):
> +        # FIXME: check if the worker is wedged.
> +        return worker.done

What about this param?

> Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:120
> +            # FIXME: If we start workers up too quickly, DumpRenderTree appears
> +            # to thrash on something and time out its first few tests. Until
> +            # we can figure out what's going on, sleep a bit in between
> +            # workers.
> +            time.sleep(0.1)

This is fine for now, but it sounds like this could lead to flakiness.  Any ideas why the old code doesn't have this problem?

> Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:130
> +        # We post one 'stop' message for each worker. Because the stop message
> +        # are sent after all of the tests, and because each worker will stop
> +        # reading messsages after receiving a stop, we can be sure each
> +        # worker will get a stop message and hence they will all shut down.
> +        for i in xrange(num_workers):
> +            manager_connection.post_message('stop')

This is also fine, but maybe ManagerConnection should have a stop_all_workers method?

> Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:155
> +    def handle_done(self, source):
> +        worker = self._workers[source]
> +        worker.done = True

worker -> worker_state?
Comment 8 Dirk Pranke 2011-02-11 17:32:01 PST
(In reply to comment #7)
> (From update of attachment 82183 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=82183&action=review
> 
> > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:52
> > +    def __init__(self, number, worker):
> > +        self.worker = worker
> 
> Should we name this param something other than |worker| to avoid shadowing the module?
> 

I can rename them to worker_connection to be a little clearer.

> > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:76
> > +    def _worker_is_done(self, worker):
> > +        # FIXME: check if the worker is wedged.
> > +        return worker.done
> 
> What about this param?
>

Will rename to worker_state.
 
> > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:120
> > +            # FIXME: If we start workers up too quickly, DumpRenderTree appears
> > +            # to thrash on something and time out its first few tests. Until
> > +            # we can figure out what's going on, sleep a bit in between
> > +            # workers.
> > +            time.sleep(0.1)
> 
> This is fine for now, but it sounds like this could lead to flakiness.  Any ideas why the old code doesn't have this problem?
> 

It might be being aggravated by using multiple python processes instead of threads so that there's much greater overall load on the system. I haven't really explored it much yet; it's still on my to-do list.

> > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:130
> > +        # We post one 'stop' message for each worker. Because the stop message
> > +        # are sent after all of the tests, and because each worker will stop
> > +        # reading messsages after receiving a stop, we can be sure each
> > +        # worker will get a stop message and hence they will all shut down.
> > +        for i in xrange(num_workers):
> > +            manager_connection.post_message('stop')
> 
> This is also fine, but maybe ManagerConnection should have a stop_all_workers method?
> 

That would make ManagerConnection aware of the semantics of the messages and how many workers have been started, duplicating state in multiple places, so I'd prefer not to do that (the code was originally written that way, for what its worth, and it ended up being much cleaner this way).

> > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:155
> > +    def handle_done(self, source):
> > +        worker = self._workers[source]
> > +        worker.done = True
> 
> worker -> worker_state?

Will do.
Comment 9 Dirk Pranke 2011-02-11 17:42:06 PST
Created attachment 82215 [details]
rename fields per feedback from tony
Comment 10 Dirk Pranke 2011-02-14 13:48:35 PST
Committed r78502: <http://trac.webkit.org/changeset/78502>