Bug 220794

Summary: run-jsc-stress-tests doesn't handle dead remotes in detectFailures
Product: WebKit Reporter: Angelos Oikonomopoulos <angelos>
Component: JavaScriptCoreAssignee: Nobody <webkit-unassigned>
Status: RESOLVED DUPLICATE    
Severity: Normal CC: aakash_jain, angelos, clopez, webkit-bug-importer
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   

Description Angelos Oikonomopoulos 2021-01-21 06:36:47 PST
When a remote board goes away while run-jsc-stress tests is running, the --gnu-parallel-runner reschedules the tests properly, but detectFailures can fail in a number of ways:

- if the board is down when detectFailures runs, it'll fail the whole test run after getting a connection error
- if the board has come up again, there's no guarantee that the failure files are still there. In fact, the mips boards will recreate the R/W filesystem if fsck detects any errors on boot, which means that all the machinery in the remoteDirectory isn't there anymore.

One way to handle this case would be to also restart jobs for which we weren't able to get the PASS/FAIL status. Perhaps by including the fetch in the command invocation, so that GNU parallel will transparently handle this for us -- guess this means we need to move away from detectFailures on --gnu-parallel-runner.

Note that detectFailures is fundamentally flawed in any case: it should be actively confirming that the job finished successfully, not relying on the absence of a 'failure' file.
Comment 1 Radar WebKit Bug Importer 2021-01-28 06:37:13 PST
<rdar://problem/73706786>
Comment 2 Angelos Oikonomopoulos 2021-05-24 01:56:13 PDT
(In reply to Angelos Oikonomopoulos from comment #0)
> When a remote board goes away while run-jsc-stress tests is running, the
> --gnu-parallel-runner reschedules the tests properly, but detectFailures can
> fail in a number of ways:
> 
> - if the board is down when detectFailures runs, it'll fail the whole test
> run after getting a connection error
> - if the board has come up again, there's no guarantee that the failure
> files are still there. In fact, the mips boards will recreate the R/W
> filesystem if fsck detects any errors on boot, which means that all the
> machinery in the remoteDirectory isn't there anymore.
> 
> One way to handle this case would be to also restart jobs for which we
> weren't able to get the PASS/FAIL status. Perhaps by including the fetch in
> the command invocation, so that GNU parallel will transparently handle this
> for us -- guess this means we need to move away from detectFailures on
> --gnu-parallel-runner.
> 
> Note that detectFailures is fundamentally flawed in any case: it should be
> actively confirming that the job finished successfully, not relying on the
> absence of a 'failure' file.

This has been partly taken care of in https://bugs.webkit.org/show_bug.cgi?id=222601 (don't be so optimistic in detecting failures). Building on this, https://bugs.webkit.org/show_bug.cgi?id=225803 implements a retry loop within run-jsc-stress-tests. Closing this bug in favor of 225803.
Comment 3 Angelos Oikonomopoulos 2021-05-24 01:56:43 PDT

*** This bug has been marked as a duplicate of bug 225803 ***