220794 – run-jsc-stress-tests doesn't handle dead remotes in detectFailures

RESOLVED DUPLICATE of bug 225803 220794

run-jsc-stress-tests doesn't handle dead remotes in detectFailures

https://bugs.webkit.org/show_bug.cgi?id=220794

Summary run-jsc-stress-tests doesn't handle dead remotes in detectFailures

Angelos Oikonomopoulos

Reported 2021-01-21 06:36:47 PST

When a remote board goes away while run-jsc-stress tests is running, the --gnu-parallel-runner reschedules the tests properly, but detectFailures can fail in a number of ways: - if the board is down when detectFailures runs, it'll fail the whole test run after getting a connection error - if the board has come up again, there's no guarantee that the failure files are still there. In fact, the mips boards will recreate the R/W filesystem if fsck detects any errors on boot, which means that all the machinery in the remoteDirectory isn't there anymore. One way to handle this case would be to also restart jobs for which we weren't able to get the PASS/FAIL status. Perhaps by including the fetch in the command invocation, so that GNU parallel will transparently handle this for us -- guess this means we need to move away from detectFailures on --gnu-parallel-runner. Note that detectFailures is fundamentally flawed in any case: it should be actively confirming that the job finished successfully, not relying on the absence of a 'failure' file.

Attachments
Add attachment proposed patch, testcase, etc.

Radar WebKit Bug Importer

Comment 1 2021-01-28 06:37:13 PST

<rdar://problem/73706786>

Angelos Oikonomopoulos

Comment 2 2021-05-24 01:56:13 PDT

(In reply to Angelos Oikonomopoulos from comment #0) > When a remote board goes away while run-jsc-stress tests is running, the > --gnu-parallel-runner reschedules the tests properly, but detectFailures can > fail in a number of ways: > > - if the board is down when detectFailures runs, it'll fail the whole test > run after getting a connection error > - if the board has come up again, there's no guarantee that the failure > files are still there. In fact, the mips boards will recreate the R/W > filesystem if fsck detects any errors on boot, which means that all the > machinery in the remoteDirectory isn't there anymore. > > One way to handle this case would be to also restart jobs for which we > weren't able to get the PASS/FAIL status. Perhaps by including the fetch in > the command invocation, so that GNU parallel will transparently handle this > for us -- guess this means we need to move away from detectFailures on > --gnu-parallel-runner. > > Note that detectFailures is fundamentally flawed in any case: it should be > actively confirming that the job finished successfully, not relying on the > absence of a 'failure' file. This has been partly taken care of in https://bugs.webkit.org/show_bug.cgi?id=222601 (don't be so optimistic in detecting failures). Building on this, https://bugs.webkit.org/show_bug.cgi?id=225803 implements a retry loop within run-jsc-stress-tests. Closing this bug in favor of 225803.

Angelos Oikonomopoulos

Comment 3 2021-05-24 01:56:43 PDT

*** This bug has been marked as a duplicate of bug 225803 ***

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution DUPLICATE

of bug 225803

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component JavaScriptCore

Assignee

Nobody

Reported

2021-01-21 06:36 PST

Modified

2021-05-24 01:56 PDT History

CC List

4 users Show

URL

Keywords InRadar

Depends on

Blocks