Bug 63103 - nrwt: move http locking code into manager
Summary: nrwt: move http locking code into manager
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: 528+ (Nightly build)
Hardware: PC All
: P2 Normal
Assignee: Dirk Pranke
URL:
Keywords:
Depends on:
Blocks: 63112
  Show dependency treegraph
 
Reported: 2011-06-21 16:13 PDT by Dirk Pranke
Modified: 2011-06-23 17:33 PDT (History)
9 users (show)

See Also:


Attachments
Patch (6.46 KB, patch)
2011-06-21 17:16 PDT, Dirk Pranke
no flags Details | Formatted Diff | Diff
change location of start/stop code to mesh with subsequent sharding better (6.83 KB, patch)
2011-06-21 19:45 PDT, Dirk Pranke
no flags Details | Formatted Diff | Diff
remove dangling stop (6.69 KB, patch)
2011-06-21 20:05 PDT, Dirk Pranke
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dirk Pranke 2011-06-21 16:13:43 PDT
nrwt: move http locking code into manager
Comment 1 Dirk Pranke 2011-06-21 17:16:04 PDT
Created attachment 98085 [details]
Patch
Comment 2 Dirk Pranke 2011-06-21 17:49:58 PDT
This patch is the first of a (shortish) series of patches that will allow us to get better sharding and throughput of the http tests.

A while ago we added the ability to new-run-webkit-tests to only run the http tests if we held a machine-wide file lock, so that we could run multiple NRWT's on a box at a time without needing to worry about resource conflicts.

That change was fine, but the way it was implemented depended on a single worker thread getting all of the http tests, grabbing the lock, and then running the tests. This had the advantage that if some other process held the lock, NRWT could execute at least some tests in other threads.

But, it had a couple of downsides. First, any failures related to starting and stopping the web server tended to get lost in the --verbose noise and the output from the other tests executing. Second, it was separate from all of the other configuration checking and status updates (which are done in the worker). Third, and by far the most important, it made it impossible to run the http tests in multiple shards amongst multiple workers.

So, I am attempting to unwind the change slightly and implement it differently. The startup/shutdown logic will be done in the manager, and we will distribute the http tests into some (fixable) number of shards and hand them to workers, so that we can better control the load on the web server (e.g., we might want to have 12 child processes on a big Mac Pro, but only ever be doing 4 concurrent http test).

It is unclear to me if anyone is actively running multiple NRWTs in parallel on a single machine, and if it is important that we get real multiplexing, as opposed to just correct semi-serialization. The HTTP tests when run in a single thread are easily the slowest shard and the long pole in an NRWT run, so I imagine that the lock is held for most if not all of the run, anyway.

So, this patch just surrounds the entire test run with the lock.

Subsequent patches will then split the tests into multiple shards, and modify the logic so that we drop the lock when all of the http-requiring shards have completed.

If some users are actively running multiple test runs in parallel on the same machine, and it turns out that more finely-grained contention is really useful, we can change the logic in the master so that it does a non-blocking trylock on the file, and runs non-http tests until it can actually get the lock.

I've copied at least a few people who might want to run multiple concurrent test runs. Can any of you say how important it is in practice right now to be able to get more finely-grained locking as I've described above?
Comment 3 Dirk Pranke 2011-06-21 19:45:38 PDT
Created attachment 98104 [details]
change location of start/stop code to mesh with subsequent sharding better
Comment 4 Dirk Pranke 2011-06-21 20:05:27 PDT
Created attachment 98106 [details]
remove dangling stop
Comment 5 Adam Barth 2011-06-21 22:23:08 PDT
Comment on attachment 98106 [details]
remove dangling stop

This patch caused trouble for the cr-linux-ews.
Comment 6 Dirk Pranke 2011-06-21 22:34:27 PDT
(In reply to comment #5)
> (From update of attachment 98106 [details])
> This patch caused trouble for the cr-linux-ews.

Can you go into any more specifics? Log files, etc.? 

Is there some other way I can get this reviewed that won't cause problems for the ews bot? Although, this patch shouldn't cause trouble for any bots, so it would be good to see the errors ...
Comment 7 Adam Barth 2011-06-22 10:13:44 PDT
> Can you go into any more specifics? Log files, etc.? 
> 
> Is there some other way I can get this reviewed that won't cause problems for the ews bot? Although, this patch shouldn't cause trouble for any bots, so it would be good to see the errors ...

I'm not 100% clear whether this patch caused the problems, but the bots were spinning because the HTTP port was in use.  Now that I read the email you send to the gardening list, I realize this might be related to that patch.

In any case, I've rebooted the bots and this patch now has a green bubble, so perhaps everything is fine.  Sorry for the distraction.
Comment 8 Tony Chang 2011-06-23 16:00:23 PDT
Comment on attachment 98106 [details]
remove dangling stop

Didn't we have unittests for these methods?  Maybe they were lower level.

_run_tests is a long function.  It would be nice to break it into smaller pieces some day . . .
Comment 9 Dirk Pranke 2011-06-23 17:06:11 PDT
(In reply to comment #8)
> (From update of attachment 98106 [details])
> Didn't we have unittests for these methods?  Maybe they were lower level.
> 
> _run_tests is a long function.  It would be nice to break it into smaller pieces some day . . .

We don't have unittests per se for these methods, but they are pretty well covered by the tests in run_webkit_tests_integrationtest.py, and we have tests for the locking and start/stop code now.

I agree that _run_tests is too long now, but that'll be separate patches to fix.
Comment 10 Dirk Pranke 2011-06-23 17:33:49 PDT
Comment on attachment 98106 [details]
remove dangling stop

Clearing flags on attachment: 98106

Committed r89642: <http://trac.webkit.org/changeset/89642>
Comment 11 Dirk Pranke 2011-06-23 17:33:54 PDT
All reviewed patches have been landed.  Closing bug.