63103 – nrwt: move http locking code into manager

RESOLVED FIXED63103

nrwt: move http locking code into manager

https://bugs.webkit.org/show_bug.cgi?id=63103

Summary nrwt: move http locking code into manager

Dirk Pranke

Reported 2011-06-21 16:13:43 PDT

nrwt: move http locking code into manager

Attachments
Patch (6.46 KB, patch) 2011-06-21 17:16 PDT, Dirk Pranke	no flags	Details Formatted Diff Diff
change location of start/stop code to mesh with subsequent sharding better (6.83 KB, patch) 2011-06-21 19:45 PDT, Dirk Pranke	no flags	Details Formatted Diff Diff
remove dangling stop (6.69 KB, patch) 2011-06-21 20:05 PDT, Dirk Pranke	no flags	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.

Dirk Pranke

Comment 1 2011-06-21 17:16:04 PDT

Created attachment 98085 [details] Patch

Dirk Pranke

Comment 2 2011-06-21 17:49:58 PDT

This patch is the first of a (shortish) series of patches that will allow us to get better sharding and throughput of the http tests. A while ago we added the ability to new-run-webkit-tests to only run the http tests if we held a machine-wide file lock, so that we could run multiple NRWT's on a box at a time without needing to worry about resource conflicts. That change was fine, but the way it was implemented depended on a single worker thread getting all of the http tests, grabbing the lock, and then running the tests. This had the advantage that if some other process held the lock, NRWT could execute at least some tests in other threads. But, it had a couple of downsides. First, any failures related to starting and stopping the web server tended to get lost in the --verbose noise and the output from the other tests executing. Second, it was separate from all of the other configuration checking and status updates (which are done in the worker). Third, and by far the most important, it made it impossible to run the http tests in multiple shards amongst multiple workers. So, I am attempting to unwind the change slightly and implement it differently. The startup/shutdown logic will be done in the manager, and we will distribute the http tests into some (fixable) number of shards and hand them to workers, so that we can better control the load on the web server (e.g., we might want to have 12 child processes on a big Mac Pro, but only ever be doing 4 concurrent http test). It is unclear to me if anyone is actively running multiple NRWTs in parallel on a single machine, and if it is important that we get real multiplexing, as opposed to just correct semi-serialization. The HTTP tests when run in a single thread are easily the slowest shard and the long pole in an NRWT run, so I imagine that the lock is held for most if not all of the run, anyway. So, this patch just surrounds the entire test run with the lock. Subsequent patches will then split the tests into multiple shards, and modify the logic so that we drop the lock when all of the http-requiring shards have completed. If some users are actively running multiple test runs in parallel on the same machine, and it turns out that more finely-grained contention is really useful, we can change the logic in the master so that it does a non-blocking trylock on the file, and runs non-http tests until it can actually get the lock. I've copied at least a few people who might want to run multiple concurrent test runs. Can any of you say how important it is in practice right now to be able to get more finely-grained locking as I've described above?

Dirk Pranke

Comment 3 2011-06-21 19:45:38 PDT

Created attachment 98104 [details] change location of start/stop code to mesh with subsequent sharding better

Dirk Pranke

Comment 4 2011-06-21 20:05:27 PDT

Created attachment 98106 [details] remove dangling stop

Adam Barth

Comment 5 2011-06-21 22:23:08 PDT

Comment on attachment 98106 [details] remove dangling stop This patch caused trouble for the cr-linux-ews.

Dirk Pranke

Comment 6 2011-06-21 22:34:27 PDT

(In reply to comment #5) > (From update of attachment 98106 [details]) > This patch caused trouble for the cr-linux-ews. Can you go into any more specifics? Log files, etc.? Is there some other way I can get this reviewed that won't cause problems for the ews bot? Although, this patch shouldn't cause trouble for any bots, so it would be good to see the errors ...

Adam Barth

Comment 7 2011-06-22 10:13:44 PDT

> Can you go into any more specifics? Log files, etc.? > > Is there some other way I can get this reviewed that won't cause problems for the ews bot? Although, this patch shouldn't cause trouble for any bots, so it would be good to see the errors ... I'm not 100% clear whether this patch caused the problems, but the bots were spinning because the HTTP port was in use. Now that I read the email you send to the gardening list, I realize this might be related to that patch. In any case, I've rebooted the bots and this patch now has a green bubble, so perhaps everything is fine. Sorry for the distraction.

Tony Chang

Comment 8 2011-06-23 16:00:23 PDT

Comment on attachment 98106 [details] remove dangling stop Didn't we have unittests for these methods? Maybe they were lower level. _run_tests is a long function. It would be nice to break it into smaller pieces some day . . .

Dirk Pranke

Comment 9 2011-06-23 17:06:11 PDT

(In reply to comment #8) > (From update of attachment 98106 [details]) > Didn't we have unittests for these methods? Maybe they were lower level. > > _run_tests is a long function. It would be nice to break it into smaller pieces some day . . . We don't have unittests per se for these methods, but they are pretty well covered by the tests in run_webkit_tests_integrationtest.py, and we have tests for the locking and start/stop code now. I agree that _run_tests is too long now, but that'll be separate patches to fix.

Dirk Pranke

Comment 10 2011-06-23 17:33:49 PDT

Comment on attachment 98106 [details] remove dangling stop Clearing flags on attachment: 98106 Committed r89642: <http://trac.webkit.org/changeset/89642>

Dirk Pranke

Comment 11 2011-06-23 17:33:54 PDT

All reviewed patches have been landed. Closing bug.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware PC

OS All

Product WebKit

Component New Bugs

Assignee

Dirk Pranke

Reported

2011-06-21 16:13 PDT

Modified

2011-06-23 17:33 PDT History

CC List

9 users Show

URL

Keywords

Depends on

Blocks

63112

Dependencies

tree graph