Bug 238256 - EWS queue gets stuck when patch causes tests to hang
Summary: EWS queue gets stuck when patch causes tests to hang
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2022-03-23 06:32 PDT by Angelos Oikonomopoulos
Modified: 2022-05-15 06:47 PDT (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Angelos Oikonomopoulos 2022-03-23 06:32:16 PDT
Starting with https://ews-build.webkit.org/#/builders/46/builds/21295, patch https://bugs.webkit.org/attachment.cgi?id=455076&action=prettypatch apparently caused enough of the JSC tests to hang that make(1) was stuck waiting for its jobs to finish. Since no process was producing any output any more, this resulted in the whole thing getting killed by buildbot:

command timed out: 1200 seconds without output running ['perl', 'Tools/Scripts/run-javascriptcore-tests', '--no-build', '--no-fail-fast', '--json-output=jsc_results.json', '--release', '--memory-limited', '--verbose', '--jsc-only', '--treat-failing-as-flaky=0.6,10,200'], attempting to kill

Even worse, this resulted in a RETRY (which failed in the same way, resulted in another RETRY and so on), causing the patch to get tested over and over for days.

Ideas:

1. Only handle this specific issue via means of `/usr/bin/timeout` for each test so that make doesn't get stuck. Clearly, that only addresses this specific cause, not the failure mode.

2. Somehow keep state and only allow a limited number of retries (perhaps just one?). If the tests without the patch consistently return results but the ones with the patch don't, then it's a good guess that this is not a transient infrastructure issue but a problem caused by the patch itself. The above patch would be an example of that, but any patch making changes to `run-jsc-stress-tests` could result in such a failure mode. That said, I've skimmed the docs and it doesn't look like buildbot offers a simple way to keep state between builds.

3. Have an explicit (and cheap!) checkThatTheInfrastructureWorks step and use it judiciously. If a test fails without even producing results, run checkThatTheInfrastructureWorks. If it fails, RETRY. If not, declare the patch a failure.

Can come up with more schemes but those seem like the cheapest ones.
Comment 1 Radar WebKit Bug Importer 2022-03-24 16:07:20 PDT
<rdar://problem/90799559>
Comment 2 Angelos Oikonomopoulos 2022-05-15 06:47:36 PDT
There's currently a similar issue, though the cause may be different. Specifically, Starting with https://ews-build.webkit.org/#/builders/46/builds/22930, PR https://github.com/WebKit/WebKit/pull/617 (commit https://github.com/WebKit/WebKit/pull/617/commits/b59d0400104cb703d6a6b3fea5eb378d8fd1a76a) is keeps resulting in a RETRY, which causes another RETRY, etc.

Other patches seem to go through the same EWS queue (https://ews-build.webkit.org/#/builders/46) just fine.

Contrary to the reprot in the bug description, in this case the error message we get is:

remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]

However, I can't find any errors in the log of the buildbot worker or in the system logs and the memory usage on the host is barely visible in the plot. So I don't currently have an explanation for how the failure actually happens in this case.