Bug 98562 - nrwt: [chromium] run http tests in parallel on bigger machines
Summary: nrwt: [chromium] run http tests in parallel on bigger machines
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Dirk Pranke
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-10-05 15:21 PDT by Dirk Pranke
Modified: 2012-10-09 12:31 PDT (History)
7 users (show)

See Also:


Attachments
Patch (8.39 KB, patch)
2012-10-05 15:31 PDT, Dirk Pranke
eric: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dirk Pranke 2012-10-05 15:21:22 PDT
nrwt: [chromium] run http tests in parallel on bigger machines
Comment 1 Dirk Pranke 2012-10-05 15:31:29 PDT
Created attachment 167396 [details]
Patch
Comment 2 Dirk Pranke 2012-10-05 15:37:24 PDT
specifically, this should only affect the Mac 10.7 bot (which has 8 workers, so we'll run 2 http tests in parallel) and the Linux dbg bot (which runs 24 works, so 6 http tests in parallel -> the debug bot is a beast).
Comment 3 Dirk Pranke 2012-10-05 15:38:01 PDT
if we see any increased flakiness on those bots, we can back the change out, otherwise we can look at adding more workers to the other debug bots.
Comment 4 Tony Chang 2012-10-05 15:47:01 PDT
A couple weeks ago, I was testing setting --max-locked-shards=2 on my Z620 and found that it made the perf tests flaky, since they run in the same shard.

Do you see any flakiness in the perf tests on your machine?
Comment 5 Dirk Pranke 2012-10-05 15:58:21 PDT
I am running some tests to see. 

In reality I think we need to separate out the concepts of "test needs the lock" from "test should not be run while other tests are running", and we should probably strive to not have any tests fall in the latter category (although they might be necessary for some periods of time). Otherwise we'll always have bottlenecks in the cycle time.
Comment 6 Vincent Scheib 2012-10-05 16:21:26 PDT
Comment on attachment 167396 [details]
Patch

Dirk made me!
Comment 7 Vincent Scheib 2012-10-05 16:21:58 PDT
Comment on attachment 167396 [details]
Patch

Test complete.
Comment 8 Dirk Pranke 2012-10-05 16:23:25 PDT
(we were testing if a non-reviewer can r+ bugs ...)
Comment 9 Dirk Pranke 2012-10-05 16:37:09 PDT
So, ironically, on Linux on my z600 in a release build, I see flakiness in http and storage, but not perf (and I always get flakiness in http and storage). I don't see any flakiness on the Mac at all (again in Release).
Comment 10 Eric Seidel (no email) 2012-10-08 11:50:58 PDT
Comment on attachment 167396 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=167396&action=review

LGTM.

> Tools/Scripts/webkitpy/layout_tests/port/chromium.py:120
> +    def default_max_locked_shards(self):
> +        """Return the number of "locked" shards to run in parallel (like the http tests)."""
> +        max_locked_shards = int(self.default_child_processes()) / 4
> +        if not max_locked_shards:
> +            return 1
> +        return max_locked_shards

I assume the plan is to move this logic down once tested in the field?
Comment 11 Dirk Pranke 2012-10-08 11:53:25 PDT
(In reply to comment #10)
> (From update of attachment 167396 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=167396&action=review
> 
> LGTM.
> 
> > Tools/Scripts/webkitpy/layout_tests/port/chromium.py:120
> > +    def default_max_locked_shards(self):
> > +        """Return the number of "locked" shards to run in parallel (like the http tests)."""
> > +        max_locked_shards = int(self.default_child_processes()) / 4
> > +        if not max_locked_shards:
> > +            return 1
> > +        return max_locked_shards
> 
> I assume the plan is to move this logic down once tested in the field?

If by "down" you mean into base.py and "tested in the field" you mean "works stably on other ports", then yes :).
Comment 12 Dirk Pranke 2012-10-08 15:06:00 PDT
Committed r130690: <http://trac.webkit.org/changeset/130690>
Comment 13 Ojan Vafai 2012-10-08 16:47:48 PDT
Looks like a couple perf tests started consistently failing after this.

http://test-results.appspot.com/dashboards/flakiness_dashboard.html#tests=%5Eperf
Comment 14 Dirk Pranke 2012-10-08 17:46:50 PDT
(In reply to comment #13)
> Looks like a couple perf tests started consistently failing after this.
> 
> http://test-results.appspot.com/dashboards/flakiness_dashboard.html#tests=%5Eperf

I'm not sure exactly what you're seeing here, so could you confirm you're seeing what I'm seeing? Namely, it looks like maybe perf/mouse-event started failing consistently on Linux (dbg), but it appears to be failing the same way that it failed on Linux and Linux 32, so maybe it should be marked as SLOW.

In the other cases, it just looks to me like the tests have been flaky, period, and are mostly marked as PASS FAIL. 

So, I'm not sure why we're even running these?
Comment 15 Ojan Vafai 2012-10-08 18:18:41 PDT
(In reply to comment #14)
> (In reply to comment #13)
> > Looks like a couple perf tests started consistently failing after this.
> > 
> > http://test-results.appspot.com/dashboards/flakiness_dashboard.html#tests=%5Eperf
> 
> I'm not sure exactly what you're seeing here, so could you confirm you're seeing what I'm seeing? Namely, it looks like maybe perf/mouse-event started failing consistently on Linux (dbg), but it appears to be failing the same way that it failed on Linux and Linux 32, so maybe it should be marked as SLOW.

They're not timing out, so marking them slow won't help. perf/mouse-event.html was consistently passing before this patch. The way these tests work is to run at different magnitudes and try to compare to see if they're constant, linear, polynomial, etc. So, if they don't have a whole core to themselves, they're likely to be more flaky.

> In the other cases, it just looks to me like the tests have been flaky, period, and are mostly marked as PASS FAIL. 
> 
> So, I'm not sure why we're even running these?

It's a little confusing because I recently committed a couple patches to reduce flakiness. So, you can see that some of them haven't been flaky for the last couple dozen runs (e.g. perf/typing-at-end-of-line.html). I've been waiting for it to stabilize before removing things from TestExpectations.

In general, I'm not sure how I feel about the LayoutTests/perf tests. Some of them are never flaky. But, ~30% or so are always flaky. It does give a way to add regression tests when making order of magnitude performance improvements.
Comment 16 Dirk Pranke 2012-10-08 18:48:12 PDT
(In reply to comment #15)
> (In reply to comment #14)
> > (In reply to comment #13)
> > > Looks like a couple perf tests started consistently failing after this.
> > > 
> > > http://test-results.appspot.com/dashboards/flakiness_dashboard.html#tests=%5Eperf
> > 
> > I'm not sure exactly what you're seeing here, so could you confirm you're seeing what I'm seeing? Namely, it looks like maybe perf/mouse-event started failing consistently on Linux (dbg), but it appears to be failing the same way that it failed on Linux and Linux 32, so maybe it should be marked as SLOW.
> 
> They're not timing out, so marking them slow won't help. perf/mouse-event.html was consistently passing before this patch. 

Good point. Okay, so there weren't any other tests that started failing or became flakier? It only looked like the one to me.

> The way these tests work is to run at different magnitudes and try to compare to see if they're constant, linear, polynomial, etc. So, if they don't have a whole core to themselves, they're likely to be more flaky.
> 

Even without this change, assuming these tests had a dedicated core is a dangerous assumption. As I said in comment #5, I would rather approach this problem by splitting this out into "tests that are load-sensitive" vs. "tests that need the server lock" and try to solve the problem that way. It would not be hard to implement at least a coarse "run these tests by themselves" mechanism and I think it would be generally useful.

Do you think we should revert this change in the mean time, or suppress the new failure in the meantime?
Comment 17 Dirk Pranke 2012-10-09 12:24:37 PDT
(In reply to comment #16)
> (In reply to comment #15)
> > (In reply to comment #14)
> > > (In reply to comment #13)
> > > > Looks like a couple perf tests started consistently failing after this.
> > > > 
> > > > http://test-results.appspot.com/dashboards/flakiness_dashboard.html#tests=%5Eperf
> > > 
> > > I'm not sure exactly what you're seeing here, so could you confirm you're seeing what I'm seeing? Namely, it looks like maybe perf/mouse-event started failing consistently on Linux (dbg), but it appears to be failing the same way that it failed on Linux and Linux 32, so maybe it should be marked as SLOW.
> > 
> > They're not timing out, so marking them slow won't help. perf/mouse-event.html was consistently passing before this patch. 

As an aside, we normally skip the perf tests in debug, but this one was getting run due to an overriding expectation of flakiness on Release that wasn't properly scoped. 

I've fixed the expectation in http://trac.webkit.org/changeset/130793 , but that doesn't change the general nature of the problem.
Comment 18 Ojan Vafai 2012-10-09 12:26:36 PDT
I didn't notice that this was a debug bot. Nevermind. We can ignore it. I should probably update TestExpectations.
Comment 19 Dirk Pranke 2012-10-09 12:31:33 PDT
(In reply to comment #18)
> I didn't notice that this was a debug bot. Nevermind. We can ignore it. I should probably update TestExpectations.

I did already. See above :).