Bug 220794 - run-jsc-stress-tests doesn't handle dead remotes in detectFailures
Summary: run-jsc-stress-tests doesn't handle dead remotes in detectFailures
Status: RESOLVED DUPLICATE of bug 225803
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2021-01-21 06:36 PST by Angelos Oikonomopoulos
Modified: 2021-05-24 01:56 PDT (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Angelos Oikonomopoulos 2021-01-21 06:36:47 PST
When a remote board goes away while run-jsc-stress tests is running, the --gnu-parallel-runner reschedules the tests properly, but detectFailures can fail in a number of ways:

- if the board is down when detectFailures runs, it'll fail the whole test run after getting a connection error
- if the board has come up again, there's no guarantee that the failure files are still there. In fact, the mips boards will recreate the R/W filesystem if fsck detects any errors on boot, which means that all the machinery in the remoteDirectory isn't there anymore.

One way to handle this case would be to also restart jobs for which we weren't able to get the PASS/FAIL status. Perhaps by including the fetch in the command invocation, so that GNU parallel will transparently handle this for us -- guess this means we need to move away from detectFailures on --gnu-parallel-runner.

Note that detectFailures is fundamentally flawed in any case: it should be actively confirming that the job finished successfully, not relying on the absence of a 'failure' file.
Comment 1 Radar WebKit Bug Importer 2021-01-28 06:37:13 PST
<rdar://problem/73706786>
Comment 2 Angelos Oikonomopoulos 2021-05-24 01:56:13 PDT
(In reply to Angelos Oikonomopoulos from comment #0)
> When a remote board goes away while run-jsc-stress tests is running, the
> --gnu-parallel-runner reschedules the tests properly, but detectFailures can
> fail in a number of ways:
> 
> - if the board is down when detectFailures runs, it'll fail the whole test
> run after getting a connection error
> - if the board has come up again, there's no guarantee that the failure
> files are still there. In fact, the mips boards will recreate the R/W
> filesystem if fsck detects any errors on boot, which means that all the
> machinery in the remoteDirectory isn't there anymore.
> 
> One way to handle this case would be to also restart jobs for which we
> weren't able to get the PASS/FAIL status. Perhaps by including the fetch in
> the command invocation, so that GNU parallel will transparently handle this
> for us -- guess this means we need to move away from detectFailures on
> --gnu-parallel-runner.
> 
> Note that detectFailures is fundamentally flawed in any case: it should be
> actively confirming that the job finished successfully, not relying on the
> absence of a 'failure' file.

This has been partly taken care of in https://bugs.webkit.org/show_bug.cgi?id=222601 (don't be so optimistic in detecting failures). Building on this, https://bugs.webkit.org/show_bug.cgi?id=225803 implements a retry loop within run-jsc-stress-tests. Closing this bug in favor of 225803.
Comment 3 Angelos Oikonomopoulos 2021-05-24 01:56:43 PDT

*** This bug has been marked as a duplicate of bug 225803 ***