WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED DUPLICATE of
bug 225803
220794
run-jsc-stress-tests doesn't handle dead remotes in detectFailures
https://bugs.webkit.org/show_bug.cgi?id=220794
Summary
run-jsc-stress-tests doesn't handle dead remotes in detectFailures
Angelos Oikonomopoulos
Reported
2021-01-21 06:36:47 PST
When a remote board goes away while run-jsc-stress tests is running, the --gnu-parallel-runner reschedules the tests properly, but detectFailures can fail in a number of ways: - if the board is down when detectFailures runs, it'll fail the whole test run after getting a connection error - if the board has come up again, there's no guarantee that the failure files are still there. In fact, the mips boards will recreate the R/W filesystem if fsck detects any errors on boot, which means that all the machinery in the remoteDirectory isn't there anymore. One way to handle this case would be to also restart jobs for which we weren't able to get the PASS/FAIL status. Perhaps by including the fetch in the command invocation, so that GNU parallel will transparently handle this for us -- guess this means we need to move away from detectFailures on --gnu-parallel-runner. Note that detectFailures is fundamentally flawed in any case: it should be actively confirming that the job finished successfully, not relying on the absence of a 'failure' file.
Attachments
Add attachment
proposed patch, testcase, etc.
Radar WebKit Bug Importer
Comment 1
2021-01-28 06:37:13 PST
<
rdar://problem/73706786
>
Angelos Oikonomopoulos
Comment 2
2021-05-24 01:56:13 PDT
(In reply to Angelos Oikonomopoulos from
comment #0
)
> When a remote board goes away while run-jsc-stress tests is running, the > --gnu-parallel-runner reschedules the tests properly, but detectFailures can > fail in a number of ways: > > - if the board is down when detectFailures runs, it'll fail the whole test > run after getting a connection error > - if the board has come up again, there's no guarantee that the failure > files are still there. In fact, the mips boards will recreate the R/W > filesystem if fsck detects any errors on boot, which means that all the > machinery in the remoteDirectory isn't there anymore. > > One way to handle this case would be to also restart jobs for which we > weren't able to get the PASS/FAIL status. Perhaps by including the fetch in > the command invocation, so that GNU parallel will transparently handle this > for us -- guess this means we need to move away from detectFailures on > --gnu-parallel-runner. > > Note that detectFailures is fundamentally flawed in any case: it should be > actively confirming that the job finished successfully, not relying on the > absence of a 'failure' file.
This has been partly taken care of in
https://bugs.webkit.org/show_bug.cgi?id=222601
(don't be so optimistic in detecting failures). Building on this,
https://bugs.webkit.org/show_bug.cgi?id=225803
implements a retry loop within run-jsc-stress-tests. Closing this bug in favor of 225803.
Angelos Oikonomopoulos
Comment 3
2021-05-24 01:56:43 PDT
*** This bug has been marked as a duplicate of
bug 225803
***
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug