Bug 37738

Summary: new-run-webkit-tests low timeout is too sensitive to runaway processes or test crashes
Product: WebKit Reporter: Maciej Stachowiak <mjs>
Component: Tools / TestsAssignee: Dirk Pranke <dpranke>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, dpranke, eric, lforschler, ojan, tony
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: PC   
OS: OS X 10.5   
Bug Depends on:    
Bug Blocks: 34984    
Attachments:
Description Flags
Patch tony: review+

Description Maciej Stachowiak 2010-04-16 17:30:37 PDT
new-run-webkit-tests has a much lower default timeout than run-webkit-tests. This makes it extremely sensitive to runaway processes, since anything going on in the background will make a bunch of tests time out. One extra bad factor is that crash logging for a crashed test (ReportCrash) can cause a system to get slow enough to time out a bunch of tests in a row, or even time out a DumpRenderTree process (whereupon it is restarted from scratch).

Thus, the low timeout can actually make overall test time longer instead of shorter.

I suggest matching the old timeout initially, and changing it over time once we solve these problems.
Comment 1 Eric Seidel (no email) 2010-04-16 17:45:21 PDT
We need bot data to confirm or deny how much of a problem this is in practice.  Low timeouts are a good thing as they allow us to run tests which expect to TIMEOUT every time w/ minimal delay on total test time.  It's certainly possible to detect load and increase timeouts for tests which are not expected to timeout when run under load.
Comment 2 Adam Barth 2010-04-17 01:18:03 PDT
Why not run tests with a TIMEOUT expectation with a lower timeout?
Comment 3 Ojan Vafai 2010-04-17 08:45:35 PDT
(In reply to comment #2)
> Why not run tests with a TIMEOUT expectation with a lower timeout?

I prefer that to the current model actually. I don't much like that we need to mark things as SLOW. It's confusing and complicated (mea culpa). The downside, which is relatively minor, is that the test may start passing with the regular timeout and we wouldn't know. The downside seems preferable to the confusion of the current model, namely, people have a lot of trouble understanding when they should mark something TIMEOUT vs SLOW.

Also, is there a way we can kill ReportCrash from NRWT? When I run the tests locally, I frequently will "sudo killall ReportCrash" in a loop. I'd much rather have NRWT do that for me. The sudo is the problem of course.
Comment 4 Maciej Stachowiak 2010-04-19 00:20:50 PDT
(In reply to comment #1)
> We need bot data to confirm or deny how much of a problem this is in practice. 
> Low timeouts are a good thing as they allow us to run tests which expect to
> TIMEOUT every time w/ minimal delay on total test time.  It's certainly
> possible to detect load and increase timeouts for tests which are not expected
> to timeout when run under load.

It's definitely a problem on my computer. I've gotten the failure cascade multiple times. I've also heard of other people experiencing it. It seems to me that this needs to be fixed whether or not we observe it on a buildbot.
Comment 5 Maciej Stachowiak 2010-04-19 01:55:42 PDT
(In reply to comment #3)
> (In reply to comment #2)
> > Why not run tests with a TIMEOUT expectation with a lower timeout?
> 
> I prefer that to the current model actually. I don't much like that we need to
> mark things as SLOW. It's confusing and complicated (mea culpa). The downside,
> which is relatively minor, is that the test may start passing with the regular
> timeout and we wouldn't know. The downside seems preferable to the confusion of
> the current model, namely, people have a lot of trouble understanding when they
> should mark something TIMEOUT vs SLOW.

Sounds like a neat idea. It would somewhat reduce the incentive to make your tests fast though. Maybe there could be a "soft" timeout that would result in a non-failure warning, to make slow tests more visible?

> Also, is there a way we can kill ReportCrash from NRWT? When I run the tests
> locally, I frequently will "sudo killall ReportCrash" in a loop. I'd much
> rather have NRWT do that for me. The sudo is the problem of course.

I don't think that would be a good thing to do. The crash logs are very useful for diagnosing what went wrong, especially on the buildbots.
Comment 6 Eric Seidel (no email) 2010-04-19 02:50:33 PDT
(In reply to comment #5)
> Sounds like a neat idea. It would somewhat reduce the incentive to make your
> tests fast though. Maybe there could be a "soft" timeout that would result in a
> non-failure warning, to make slow tests more visible?

All timeouts are currently "soft" in that all failures are re-run at the end.  Tests which timed out once due to machine load are unlikely to time out twice.

> > Also, is there a way we can kill ReportCrash from NRWT? When I run the tests
> > locally, I frequently will "sudo killall ReportCrash" in a loop. I'd much
> > rather have NRWT do that for me. The sudo is the problem of course.
> 
> I don't think that would be a good thing to do. The crash logs are very useful
> for diagnosing what went wrong, especially on the buildbots.

I've long wanted such an option to run-webkit-tests.  It's easy enough to do. Just requires adding a flag to DumpRenderTree to have it catch the necessary signals and exit(1) instead of letting the OS catch them for us and calling ReportCrash.  I don't think we'd want this mode on by default, but it would be useful.

Even if there was just an easy way to renice ReportCrash a few times that would make its behavior way easier to deal with.
Comment 7 Ojan Vafai 2010-04-19 07:35:56 PDT
(In reply to comment #6)
> (In reply to comment #5)
> > Sounds like a neat idea. It would somewhat reduce the incentive to make your
> > tests fast though. Maybe there could be a "soft" timeout that would result in a
> > non-failure warning, to make slow tests more visible?

This seems like a great idea. If we could take it one step further and show the number of slow tests on the waterfall, that might provide incentive to fix them. What I like about this is that we can be more flexible about the timeout. The current 6 second timeout (12 seconds on debug) is just what happened to seem reasonable given the data from the Chromium bots. 

We can start with something conservative (6 seconds?) and make it more aggressive as we fix the slower tests.

> All timeouts are currently "soft" in that all failures are re-run at the end. 
> Tests which timed out once due to machine load are unlikely to time out twice.

This is half true. If ReportCrash is the problem and your patch causes crashes, then crashes will happen when you retry and tests might timeout if you don't have sufficient extra cores on your machine.

> > > Also, is there a way we can kill ReportCrash from NRWT? When I run the tests
> > > locally, I frequently will "sudo killall ReportCrash" in a loop. I'd much
> > > rather have NRWT do that for me. The sudo is the problem of course.
> > 
> > I don't think that would be a good thing to do. The crash logs are very useful
> > for diagnosing what went wrong, especially on the buildbots.
> 
> I've long wanted such an option to run-webkit-tests.  It's easy enough to do.
> Just requires adding a flag to DumpRenderTree to have it catch the necessary
> signals and exit(1) instead of letting the OS catch them for us and calling
> ReportCrash.  I don't think we'd want this mode on by default, but it would be
> useful.

Filed bug 37797.
Comment 8 Dirk Pranke 2011-04-01 16:18:34 PDT
Okay, running against current time of tree (one year later), we have a grand total of *two* tests that take longer than 6 seconds on Mac Snow Leopard. I've marked them as SLOW in the test_expectations file.

I suggest that the defaults of 6 seconds for regular tests and 30 seconds for slow tests is good enough. We can always adjust this if it turns out to be a problem in practice.

For reference old-run-webkit-tests appears to default to 35 seconds and of course doesn't have a SLOW concept.

Maciej, what do you think?
Comment 9 Dirk Pranke 2011-04-04 15:47:24 PDT
Okay, it looks like the mac port of NRWT seems much less happy with the default of six seconds. I'm going to reopen this and change the default to be 35 seconds to match ORWT. 

We can gradually pull the time down as necessary in the future.
Comment 10 Dirk Pranke 2011-04-04 16:35:00 PDT
Created attachment 88154 [details]
Patch
Comment 11 Tony Chang 2011-04-04 18:14:56 PDT
Comment on attachment 88154 [details]
Patch

This doesn't impact chromium-mac, right?
Comment 12 Dirk Pranke 2011-04-04 18:36:02 PDT
(In reply to comment #11)
> (From update of attachment 88154 [details])
> This doesn't impact chromium-mac, right?

Correct.
Comment 13 Dirk Pranke 2011-04-06 18:47:40 PDT
Committed r83130: <http://trac.webkit.org/changeset/83130>