54071 – nrwt multiprocessing: spawn multiple workers

RESOLVED FIXED 54071

nrwt multiprocessing: spawn multiple workers

https://bugs.webkit.org/show_bug.cgi?id=54071

Summary nrwt multiprocessing: spawn multiple workers

Dirk Pranke

Reported 2011-02-08 23:44:47 PST

nrwt multiprocessing: spawn multiple workers

Attachments
Patch (5.31 KB, patch) 2011-02-08 23:46 PST, Dirk Pranke	no flags	Details Formatted Diff Diff
reset state properly in _run_tests() (5.30 KB, patch) 2011-02-09 00:21 PST, Dirk Pranke	no flags	Details Formatted Diff Diff
update w/ feedback from tony (6.77 KB, patch) 2011-02-11 14:44 PST, Dirk Pranke	no flags	Details Formatted Diff Diff
rename fields per feedback from tony (6.91 KB, patch) 2011-02-11 17:42 PST, Dirk Pranke	tony: review+	Details Formatted Diff Diff
Show Obsolete (3) View All Add attachment proposed patch, testcase, etc.

Dirk Pranke

Comment 1 2011-02-08 23:46:53 PST

Created attachment 81760 [details] Patch

Dirk Pranke

Comment 2 2011-02-09 00:21:00 PST

Created attachment 81762 [details] reset state properly in _run_tests()

Tony Chang

Comment 3 2011-02-11 13:24:14 PST

Comment on attachment 81762 [details] reset state properly in _run_tests() View in context: https://bugs.webkit.org/attachment.cgi?id=81762&action=review > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:109 > + w = manager_connection.start_worker(worker_number) > + ws = _WorkerState(worker_number, w) > + self._workers[w.name] = ws w -> worker, ws -> worker_state > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:113 > + # We sleep a bit in between workers to give the processes a chance > + # to start up without thrashing. > + time.sleep(0.1) This seems prone to a race. What type of thrashing happens if you don't sleep? > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:119 > - manager_connection.post_message('stop') > + for i in xrange(num_workers): > + manager_connection.post_message('stop') Why do we have to post multiple stop messages? > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:144 > + for w in self._workers.values(): > + if not w.done: > + done = False Can you use any() + list comprehension here? > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:147 > + # FIXME: is the following line safe? > + # self._done = done Can you elaborate on this? > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:150 > def handle_started_test(self, src, test_info, hang_timeout): src -> source > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:156 > + w = self._workers[src] > + w.done = True w -> worker, src -> source > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:158 > def handle_exception(self, src, exception_info): src -> source

Dirk Pranke

Comment 4 2011-02-11 14:16:47 PST

(In reply to comment #3) > (From update of attachment 81762 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=81762&action=review > > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:109 > > + w = manager_connection.start_worker(worker_number) > > + ws = _WorkerState(worker_number, w) > > + self._workers[w.name] = ws > > w -> worker, ws -> worker_state > Will do. > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:113 > > + # We sleep a bit in between workers to give the processes a chance > > + # to start up without thrashing. > > + time.sleep(0.1) > > This seems prone to a race. What type of thrashing happens if you don't sleep? > I don't really understand what is going on, but it appears that the DumpRenderTrees get into weird states and end up timing out on their first couple tests before starting to run smoothly. > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:119 > > - manager_connection.post_message('stop') > > + for i in xrange(num_workers): > > + manager_connection.post_message('stop') > > Why do we have to post multiple stop messages? > One for each worker. Because of the ordering of the messages, and the fact that a worker will stop reading messages after receiving a 'stop', this is sufficient to guarantee that every worker will receive a stop. I'll add a comment to this effect. > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:144 > > + for w in self._workers.values(): > > + if not w.done: > > + done = False > > Can you use any() + list comprehension here? > Sure. > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:147 > > + # FIXME: is the following line safe? > > + # self._done = done > > Can you elaborate on this? > I will add a better comment. Basically, at the moment this code always returns False for TestRunner.is_done(), which means that broker_connect.run_message_loop() will never exit early. This is fine, since we run with timeouts, so in the worst case we bail out within a second. However, I think that if we set _done here, we will safely bail out as soon as the last "done" message is processed. I just hadn't convinced myself 100% that this was true. > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:150 > > def handle_started_test(self, src, test_info, hang_timeout): > > src -> source > Will do. > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:156 > > + w = self._workers[src] > > + w.done = True > > w -> worker, src -> source > Will do. > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:158 > > def handle_exception(self, src, exception_info): > > src -> source Will do.

Dirk Pranke

Comment 5 2011-02-11 14:35:02 PST

(In reply to comment #4) > (In reply to comment #3) > > (From update of attachment 81762 [details] [details]) > > View in context: https://bugs.webkit.org/attachment.cgi?id=81762&action=review > > > > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:109 > > > + w = manager_connection.start_worker(worker_number) > > > + ws = _WorkerState(worker_number, w) > > > + self._workers[w.name] = ws > > > > w -> worker, ws -> worker_state > > > > Will do. > Oh yeah, I used 'w' because 'worker' is already used to name the module. I'll use 'worker_obj' instead. > > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:144 > > > + for w in self._workers.values(): > > > + if not w.done: > > > + done = False > > > > Can you use any() + list comprehension here? > > > > Sure. Actually, the block inside the loop gets more complicated when we check to see if the worker is wedged, but I think I can pull that into a worker_is_done_or_wedged() function.

Dirk Pranke

Comment 6 2011-02-11 14:44:43 PST

Created attachment 82183 [details] update w/ feedback from tony

Tony Chang

Comment 7 2011-02-11 16:57:34 PST

Comment on attachment 82183 [details] update w/ feedback from tony View in context: https://bugs.webkit.org/attachment.cgi?id=82183&action=review > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:52 > + def __init__(self, number, worker): > + self.worker = worker Should we name this param something other than |worker| to avoid shadowing the module? > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:76 > + def _worker_is_done(self, worker): > + # FIXME: check if the worker is wedged. > + return worker.done What about this param? > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:120 > + # FIXME: If we start workers up too quickly, DumpRenderTree appears > + # to thrash on something and time out its first few tests. Until > + # we can figure out what's going on, sleep a bit in between > + # workers. > + time.sleep(0.1) This is fine for now, but it sounds like this could lead to flakiness. Any ideas why the old code doesn't have this problem? > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:130 > + # We post one 'stop' message for each worker. Because the stop message > + # are sent after all of the tests, and because each worker will stop > + # reading messsages after receiving a stop, we can be sure each > + # worker will get a stop message and hence they will all shut down. > + for i in xrange(num_workers): > + manager_connection.post_message('stop') This is also fine, but maybe ManagerConnection should have a stop_all_workers method? > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:155 > + def handle_done(self, source): > + worker = self._workers[source] > + worker.done = True worker -> worker_state?

Dirk Pranke

Comment 8 2011-02-11 17:32:01 PST

(In reply to comment #7) > (From update of attachment 82183 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=82183&action=review > > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:52 > > + def __init__(self, number, worker): > > + self.worker = worker > > Should we name this param something other than |worker| to avoid shadowing the module? > I can rename them to worker_connection to be a little clearer. > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:76 > > + def _worker_is_done(self, worker): > > + # FIXME: check if the worker is wedged. > > + return worker.done > > What about this param? > Will rename to worker_state. > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:120 > > + # FIXME: If we start workers up too quickly, DumpRenderTree appears > > + # to thrash on something and time out its first few tests. Until > > + # we can figure out what's going on, sleep a bit in between > > + # workers. > > + time.sleep(0.1) > > This is fine for now, but it sounds like this could lead to flakiness. Any ideas why the old code doesn't have this problem? > It might be being aggravated by using multiple python processes instead of threads so that there's much greater overall load on the system. I haven't really explored it much yet; it's still on my to-do list. > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:130 > > + # We post one 'stop' message for each worker. Because the stop message > > + # are sent after all of the tests, and because each worker will stop > > + # reading messsages after receiving a stop, we can be sure each > > + # worker will get a stop message and hence they will all shut down. > > + for i in xrange(num_workers): > > + manager_connection.post_message('stop') > > This is also fine, but maybe ManagerConnection should have a stop_all_workers method? > That would make ManagerConnection aware of the semantics of the messages and how many workers have been started, duplicating state in multiple places, so I'd prefer not to do that (the code was originally written that way, for what its worth, and it ended up being much cleaner this way). > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:155 > > + def handle_done(self, source): > > + worker = self._workers[source] > > + worker.done = True > > worker -> worker_state? Will do.

Dirk Pranke

Comment 9 2011-02-11 17:42:06 PST

Created attachment 82215 [details] rename fields per feedback from tony

Dirk Pranke

Comment 10 2011-02-14 13:48:35 PST

Committed r78502: <http://trac.webkit.org/changeset/78502>

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Other

OS OS X 10.5

Product WebKit

Component New Bugs

Assignee

Dirk Pranke

Reported

2011-02-08 23:44 PST

Modified

2011-02-14 13:48 PST History

CC List

4 users Show

URL

Keywords

Depends on

54070

Blocks

49566 54072

Dependencies

tree graph