Bug 82233 - Retry crashing tests serially at the end of NRWT on Apple Mac
Summary: Retry crashing tests serially at the end of NRWT on Apple Mac
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Mark Hahnenberg
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-03-26 12:18 PDT by Mark Hahnenberg
Modified: 2015-02-05 09:51 PST (History)
8 users (show)

See Also:


Attachments
Patch (1.71 KB, patch)
2012-03-26 13:05 PDT, Mark Hahnenberg
no flags Details | Formatted Diff | Diff
Patch (3.14 KB, patch)
2012-03-26 13:35 PDT, Mark Hahnenberg
dpranke: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Hahnenberg 2012-03-26 12:18:48 PDT
A few of the daemons/system services/frameworks on the Mac bots can't handle being pummeled by many threads simultaneously while we're running all the layout tests concurrently. This causes some of the tests to randomly fail occasionally when it isn't actually an issue with the test. If we were to run all of these potentially crashing tests serially at the end of NRWT (which we already do for non-crashing tests), this might alleviate the problem.
Comment 1 Mark Hahnenberg 2012-03-26 13:05:12 PDT
Created attachment 133873 [details]
Patch
Comment 2 Dirk Pranke 2012-03-26 13:19:42 PDT
Comment on attachment 133873 [details]
Patch

Note that the tests that are retried are still actually run concurrently, so this patch may not do what you think it should do. 

I'm sure I had a patch at one point that did run things serially, but I guess I never committed it; it would be easy to make that change and I can't really think of a good reason not to do it, so you could either add that to this or add a separate patch to do so, if you like, or you could file a bug and I could do it at some point.

All that aside, I'm a bit torn by this patch. I can see the argument in retrying the crashes to see if they're consistently crashing or just flaky under load; I don't really have a problem with that. 

However, I'm not sure that i think a flaky crash should be ignored (which is what would happen by default).

Is it hard (or impossible) to fix things so that DRT and WTR don't actually crash?

As an actual comment on the crash, I think I would prefer this to be a method on the port itself (like port.should_retry_crashes()), rather than a test against the port name.
Comment 3 Geoffrey Garen 2012-03-26 13:32:27 PDT
> Is it hard (or impossible) to fix things so that DRT and WTR don't actually crash?

Like the bug description says, the crash isn't in DRT or WTR (or WebKit, for that matter): It's a bug in certain "daemons/system services/frameworks on the Mac bots". Fixing those crashes it out of scope for the WebKit project.

> Note that the tests that are retried are still actually run concurrently, so this patch may not do what you think it should do. 

Mark, it sounds like you should write a follow-up patch to have retried tests run serially. (Possibly just change the concurrency setting before retrying?) Based on my testing a year ago, this would also fix flakiness in filesystem/database and animation tests, so it's probably appropriate for all ports.

> However, I'm not sure that i think a flaky crash should be ignored (which is what would happen by default).

What would you suggest instead?

We'd like a solution that can give an unambiguous "PASS" (green) or "FAIL" (red) result for a given test run.
Comment 4 Mark Hahnenberg 2012-03-26 13:35:35 PDT
Created attachment 133879 [details]
Patch
Comment 5 Dirk Pranke 2012-03-26 14:00:09 PDT
(In reply to comment #3)
> > Is it hard (or impossible) to fix things so that DRT and WTR don't actually crash?
> 
> Like the bug description says, the crash isn't in DRT or WTR (or WebKit, for that matter): It's a bug in certain "daemons/system services/frameworks on the Mac bots". Fixing those crashes it out of scope for the WebKit project.
> 

Well, for daemons and system services, presumably you could make DRT/WTR/WebKit resilient to crashes in other processes. For frameworks obviously that's harder; however, in either case I would be concerned that crashes represent potential security issues and shouldn't be ignored. 

Obviously, having tests crash all the time is a good way to make sure they aren't ignored, but I agree that randomly flaky tests don't help anything.

> > However, I'm not sure that i think a flaky crash should be ignored (which is what would happen by default).
> 
> What would you suggest instead?
> 

The alternative I was considering was that we could retry the test to see if it crashed twice, but still consider it a FAIL even if it only crashed once.

> We'd like a solution that can give an unambiguous "PASS" (green) or "FAIL" (red) result for a given test run.

Sure, wouldn't we all? :)
Comment 6 Dirk Pranke 2012-03-26 14:01:04 PDT
Comment on attachment 133879 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=133879&action=review

> Tools/Scripts/webkitpy/layout_tests/controllers/manager.py:901
> +        # On Apple Mac, we retry crashes due to https://bugs.webkit.org/show_bug.cgi?id=82233

this comment should move to mac.py now. Please fix that before landing ...
Comment 7 Mark Hahnenberg 2012-03-28 17:34:20 PDT
This was fixed in http://trac.webkit.org/changeset/112189.
Comment 8 Csaba Osztrogonác 2015-02-05 05:31:58 PST
Is it still needed? Apple buildbots run with --no-retry-failures for a while
without a regular flakiness. Should we still retry crashes for local users only?
Comment 9 Alexey Proskuryakov 2015-02-05 09:51:22 PST
> Apple buildbots run with --no-retry-failures for a while 

We still have retrying enabled on debug bots and on EWS.

But as far as retrying crashed tests goes, I do not understand why we do that. AFAIK we still report it as a hard failure when first try crashes, and second one doesn't.