RESOLVED FIXED 54071
nrwt multiprocessing: spawn multiple workers
https://bugs.webkit.org/show_bug.cgi?id=54071
Summary nrwt multiprocessing: spawn multiple workers
Dirk Pranke
Reported 2011-02-08 23:44:47 PST
nrwt multiprocessing: spawn multiple workers
Attachments
Patch (5.31 KB, patch)
2011-02-08 23:46 PST, Dirk Pranke
no flags
reset state properly in _run_tests() (5.30 KB, patch)
2011-02-09 00:21 PST, Dirk Pranke
no flags
update w/ feedback from tony (6.77 KB, patch)
2011-02-11 14:44 PST, Dirk Pranke
no flags
rename fields per feedback from tony (6.91 KB, patch)
2011-02-11 17:42 PST, Dirk Pranke
tony: review+
Dirk Pranke
Comment 1 2011-02-08 23:46:53 PST
Dirk Pranke
Comment 2 2011-02-09 00:21:00 PST
Created attachment 81762 [details] reset state properly in _run_tests()
Tony Chang
Comment 3 2011-02-11 13:24:14 PST
Comment on attachment 81762 [details] reset state properly in _run_tests() View in context: https://bugs.webkit.org/attachment.cgi?id=81762&action=review > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:109 > + w = manager_connection.start_worker(worker_number) > + ws = _WorkerState(worker_number, w) > + self._workers[w.name] = ws w -> worker, ws -> worker_state > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:113 > + # We sleep a bit in between workers to give the processes a chance > + # to start up without thrashing. > + time.sleep(0.1) This seems prone to a race. What type of thrashing happens if you don't sleep? > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:119 > - manager_connection.post_message('stop') > + for i in xrange(num_workers): > + manager_connection.post_message('stop') Why do we have to post multiple stop messages? > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:144 > + for w in self._workers.values(): > + if not w.done: > + done = False Can you use any() + list comprehension here? > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:147 > + # FIXME: is the following line safe? > + # self._done = done Can you elaborate on this? > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:150 > def handle_started_test(self, src, test_info, hang_timeout): src -> source > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:156 > + w = self._workers[src] > + w.done = True w -> worker, src -> source > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:158 > def handle_exception(self, src, exception_info): src -> source
Dirk Pranke
Comment 4 2011-02-11 14:16:47 PST
(In reply to comment #3) > (From update of attachment 81762 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=81762&action=review > > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:109 > > + w = manager_connection.start_worker(worker_number) > > + ws = _WorkerState(worker_number, w) > > + self._workers[w.name] = ws > > w -> worker, ws -> worker_state > Will do. > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:113 > > + # We sleep a bit in between workers to give the processes a chance > > + # to start up without thrashing. > > + time.sleep(0.1) > > This seems prone to a race. What type of thrashing happens if you don't sleep? > I don't really understand what is going on, but it appears that the DumpRenderTrees get into weird states and end up timing out on their first couple tests before starting to run smoothly. > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:119 > > - manager_connection.post_message('stop') > > + for i in xrange(num_workers): > > + manager_connection.post_message('stop') > > Why do we have to post multiple stop messages? > One for each worker. Because of the ordering of the messages, and the fact that a worker will stop reading messages after receiving a 'stop', this is sufficient to guarantee that every worker will receive a stop. I'll add a comment to this effect. > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:144 > > + for w in self._workers.values(): > > + if not w.done: > > + done = False > > Can you use any() + list comprehension here? > Sure. > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:147 > > + # FIXME: is the following line safe? > > + # self._done = done > > Can you elaborate on this? > I will add a better comment. Basically, at the moment this code always returns False for TestRunner.is_done(), which means that broker_connect.run_message_loop() will never exit early. This is fine, since we run with timeouts, so in the worst case we bail out within a second. However, I think that if we set _done here, we will safely bail out as soon as the last "done" message is processed. I just hadn't convinced myself 100% that this was true. > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:150 > > def handle_started_test(self, src, test_info, hang_timeout): > > src -> source > Will do. > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:156 > > + w = self._workers[src] > > + w.done = True > > w -> worker, src -> source > Will do. > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:158 > > def handle_exception(self, src, exception_info): > > src -> source Will do.
Dirk Pranke
Comment 5 2011-02-11 14:35:02 PST
(In reply to comment #4) > (In reply to comment #3) > > (From update of attachment 81762 [details] [details]) > > View in context: https://bugs.webkit.org/attachment.cgi?id=81762&action=review > > > > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:109 > > > + w = manager_connection.start_worker(worker_number) > > > + ws = _WorkerState(worker_number, w) > > > + self._workers[w.name] = ws > > > > w -> worker, ws -> worker_state > > > > Will do. > Oh yeah, I used 'w' because 'worker' is already used to name the module. I'll use 'worker_obj' instead. > > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:144 > > > + for w in self._workers.values(): > > > + if not w.done: > > > + done = False > > > > Can you use any() + list comprehension here? > > > > Sure. Actually, the block inside the loop gets more complicated when we check to see if the worker is wedged, but I think I can pull that into a worker_is_done_or_wedged() function.
Dirk Pranke
Comment 6 2011-02-11 14:44:43 PST
Created attachment 82183 [details] update w/ feedback from tony
Tony Chang
Comment 7 2011-02-11 16:57:34 PST
Comment on attachment 82183 [details] update w/ feedback from tony View in context: https://bugs.webkit.org/attachment.cgi?id=82183&action=review > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:52 > + def __init__(self, number, worker): > + self.worker = worker Should we name this param something other than |worker| to avoid shadowing the module? > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:76 > + def _worker_is_done(self, worker): > + # FIXME: check if the worker is wedged. > + return worker.done What about this param? > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:120 > + # FIXME: If we start workers up too quickly, DumpRenderTree appears > + # to thrash on something and time out its first few tests. Until > + # we can figure out what's going on, sleep a bit in between > + # workers. > + time.sleep(0.1) This is fine for now, but it sounds like this could lead to flakiness. Any ideas why the old code doesn't have this problem? > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:130 > + # We post one 'stop' message for each worker. Because the stop message > + # are sent after all of the tests, and because each worker will stop > + # reading messsages after receiving a stop, we can be sure each > + # worker will get a stop message and hence they will all shut down. > + for i in xrange(num_workers): > + manager_connection.post_message('stop') This is also fine, but maybe ManagerConnection should have a stop_all_workers method? > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:155 > + def handle_done(self, source): > + worker = self._workers[source] > + worker.done = True worker -> worker_state?
Dirk Pranke
Comment 8 2011-02-11 17:32:01 PST
(In reply to comment #7) > (From update of attachment 82183 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=82183&action=review > > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:52 > > + def __init__(self, number, worker): > > + self.worker = worker > > Should we name this param something other than |worker| to avoid shadowing the module? > I can rename them to worker_connection to be a little clearer. > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:76 > > + def _worker_is_done(self, worker): > > + # FIXME: check if the worker is wedged. > > + return worker.done > > What about this param? > Will rename to worker_state. > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:120 > > + # FIXME: If we start workers up too quickly, DumpRenderTree appears > > + # to thrash on something and time out its first few tests. Until > > + # we can figure out what's going on, sleep a bit in between > > + # workers. > > + time.sleep(0.1) > > This is fine for now, but it sounds like this could lead to flakiness. Any ideas why the old code doesn't have this problem? > It might be being aggravated by using multiple python processes instead of threads so that there's much greater overall load on the system. I haven't really explored it much yet; it's still on my to-do list. > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:130 > > + # We post one 'stop' message for each worker. Because the stop message > > + # are sent after all of the tests, and because each worker will stop > > + # reading messsages after receiving a stop, we can be sure each > > + # worker will get a stop message and hence they will all shut down. > > + for i in xrange(num_workers): > > + manager_connection.post_message('stop') > > This is also fine, but maybe ManagerConnection should have a stop_all_workers method? > That would make ManagerConnection aware of the semantics of the messages and how many workers have been started, duplicating state in multiple places, so I'd prefer not to do that (the code was originally written that way, for what its worth, and it ended up being much cleaner this way). > > Tools/Scripts/webkitpy/layout_tests/layout_package/test_runner2.py:155 > > + def handle_done(self, source): > > + worker = self._workers[source] > > + worker.done = True > > worker -> worker_state? Will do.
Dirk Pranke
Comment 9 2011-02-11 17:42:06 PST
Created attachment 82215 [details] rename fields per feedback from tony
Dirk Pranke
Comment 10 2011-02-14 13:48:35 PST
Note You need to log in before you can comment on or make changes to this bug.