Bug 37738

Summary:

new-run-webkit-tests low timeout is too sensitive to runaway processes or test crashes

Product:

WebKit

Reporter:

Maciej Stachowiak <mjs>

Component:

Tools / Tests

Assignee:

Dirk Pranke <dpranke>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

abarth, dpranke, eric, lforschler, ojan, tony

Priority:

Version:

528+ (Nightly build)

Hardware:

OS:

OS X 10.5

Bug Depends on:

Bug Blocks:

34984

Attachments:

Description	Flags
Patch	tony: review+

Maciej Stachowiak

Reported 2010-04-16 17:30:37 PDT

new-run-webkit-tests has a much lower default timeout than run-webkit-tests. This makes it extremely sensitive to runaway processes, since anything going on in the background will make a bunch of tests time out. One extra bad factor is that crash logging for a crashed test (ReportCrash) can cause a system to get slow enough to time out a bunch of tests in a row, or even time out a DumpRenderTree process (whereupon it is restarted from scratch). Thus, the low timeout can actually make overall test time longer instead of shorter. I suggest matching the old timeout initially, and changing it over time once we solve these problems.

Attachments
Patch (1.37 KB, patch) 2011-04-04 16:35 PDT, Dirk Pranke	tony: review+	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.

Eric Seidel (no email)

Comment 1 2010-04-16 17:45:21 PDT

We need bot data to confirm or deny how much of a problem this is in practice. Low timeouts are a good thing as they allow us to run tests which expect to TIMEOUT every time w/ minimal delay on total test time. It's certainly possible to detect load and increase timeouts for tests which are not expected to timeout when run under load.

Adam Barth

Comment 2 2010-04-17 01:18:03 PDT

Why not run tests with a TIMEOUT expectation with a lower timeout?

Ojan Vafai

Comment 3 2010-04-17 08:45:35 PDT

(In reply to comment #2) > Why not run tests with a TIMEOUT expectation with a lower timeout? I prefer that to the current model actually. I don't much like that we need to mark things as SLOW. It's confusing and complicated (mea culpa). The downside, which is relatively minor, is that the test may start passing with the regular timeout and we wouldn't know. The downside seems preferable to the confusion of the current model, namely, people have a lot of trouble understanding when they should mark something TIMEOUT vs SLOW. Also, is there a way we can kill ReportCrash from NRWT? When I run the tests locally, I frequently will "sudo killall ReportCrash" in a loop. I'd much rather have NRWT do that for me. The sudo is the problem of course.

Maciej Stachowiak

Comment 4 2010-04-19 00:20:50 PDT

(In reply to comment #1) > We need bot data to confirm or deny how much of a problem this is in practice. > Low timeouts are a good thing as they allow us to run tests which expect to > TIMEOUT every time w/ minimal delay on total test time. It's certainly > possible to detect load and increase timeouts for tests which are not expected > to timeout when run under load. It's definitely a problem on my computer. I've gotten the failure cascade multiple times. I've also heard of other people experiencing it. It seems to me that this needs to be fixed whether or not we observe it on a buildbot.

Maciej Stachowiak

Comment 5 2010-04-19 01:55:42 PDT

(In reply to comment #3) > (In reply to comment #2) > > Why not run tests with a TIMEOUT expectation with a lower timeout? > > I prefer that to the current model actually. I don't much like that we need to > mark things as SLOW. It's confusing and complicated (mea culpa). The downside, > which is relatively minor, is that the test may start passing with the regular > timeout and we wouldn't know. The downside seems preferable to the confusion of > the current model, namely, people have a lot of trouble understanding when they > should mark something TIMEOUT vs SLOW. Sounds like a neat idea. It would somewhat reduce the incentive to make your tests fast though. Maybe there could be a "soft" timeout that would result in a non-failure warning, to make slow tests more visible? > Also, is there a way we can kill ReportCrash from NRWT? When I run the tests > locally, I frequently will "sudo killall ReportCrash" in a loop. I'd much > rather have NRWT do that for me. The sudo is the problem of course. I don't think that would be a good thing to do. The crash logs are very useful for diagnosing what went wrong, especially on the buildbots.

Eric Seidel (no email)

Comment 6 2010-04-19 02:50:33 PDT

(In reply to comment #5) > Sounds like a neat idea. It would somewhat reduce the incentive to make your > tests fast though. Maybe there could be a "soft" timeout that would result in a > non-failure warning, to make slow tests more visible? All timeouts are currently "soft" in that all failures are re-run at the end. Tests which timed out once due to machine load are unlikely to time out twice. > > Also, is there a way we can kill ReportCrash from NRWT? When I run the tests > > locally, I frequently will "sudo killall ReportCrash" in a loop. I'd much > > rather have NRWT do that for me. The sudo is the problem of course. > > I don't think that would be a good thing to do. The crash logs are very useful > for diagnosing what went wrong, especially on the buildbots. I've long wanted such an option to run-webkit-tests. It's easy enough to do. Just requires adding a flag to DumpRenderTree to have it catch the necessary signals and exit(1) instead of letting the OS catch them for us and calling ReportCrash. I don't think we'd want this mode on by default, but it would be useful. Even if there was just an easy way to renice ReportCrash a few times that would make its behavior way easier to deal with.

Ojan Vafai

Comment 7 2010-04-19 07:35:56 PDT

(In reply to comment #6) > (In reply to comment #5) > > Sounds like a neat idea. It would somewhat reduce the incentive to make your > > tests fast though. Maybe there could be a "soft" timeout that would result in a > > non-failure warning, to make slow tests more visible? This seems like a great idea. If we could take it one step further and show the number of slow tests on the waterfall, that might provide incentive to fix them. What I like about this is that we can be more flexible about the timeout. The current 6 second timeout (12 seconds on debug) is just what happened to seem reasonable given the data from the Chromium bots. We can start with something conservative (6 seconds?) and make it more aggressive as we fix the slower tests. > All timeouts are currently "soft" in that all failures are re-run at the end. > Tests which timed out once due to machine load are unlikely to time out twice. This is half true. If ReportCrash is the problem and your patch causes crashes, then crashes will happen when you retry and tests might timeout if you don't have sufficient extra cores on your machine. > > > Also, is there a way we can kill ReportCrash from NRWT? When I run the tests > > > locally, I frequently will "sudo killall ReportCrash" in a loop. I'd much > > > rather have NRWT do that for me. The sudo is the problem of course. > > > > I don't think that would be a good thing to do. The crash logs are very useful > > for diagnosing what went wrong, especially on the buildbots. > > I've long wanted such an option to run-webkit-tests. It's easy enough to do. > Just requires adding a flag to DumpRenderTree to have it catch the necessary > signals and exit(1) instead of letting the OS catch them for us and calling > ReportCrash. I don't think we'd want this mode on by default, but it would be > useful. Filed bug 37797.

Dirk Pranke

Comment 8 2011-04-01 16:18:34 PDT

Okay, running against current time of tree (one year later), we have a grand total of *two* tests that take longer than 6 seconds on Mac Snow Leopard. I've marked them as SLOW in the test_expectations file. I suggest that the defaults of 6 seconds for regular tests and 30 seconds for slow tests is good enough. We can always adjust this if it turns out to be a problem in practice. For reference old-run-webkit-tests appears to default to 35 seconds and of course doesn't have a SLOW concept. Maciej, what do you think?

Dirk Pranke

Comment 9 2011-04-04 15:47:24 PDT

Okay, it looks like the mac port of NRWT seems much less happy with the default of six seconds. I'm going to reopen this and change the default to be 35 seconds to match ORWT. We can gradually pull the time down as necessary in the future.

Dirk Pranke

Comment 10 2011-04-04 16:35:00 PDT

Created attachment 88154 [details] Patch

Tony Chang

Comment 11 2011-04-04 18:14:56 PDT

Comment on attachment 88154 [details] Patch This doesn't impact chromium-mac, right?

Dirk Pranke

Comment 12 2011-04-04 18:36:02 PDT

(In reply to comment #11) > (From update of attachment 88154 [details]) > This doesn't impact chromium-mac, right? Correct.

Dirk Pranke

Comment 13 2011-04-06 18:47:40 PDT

Committed r83130: <http://trac.webkit.org/changeset/83130>

Note You need to log in before you can comment on or make changes to this bug.