238256 – EWS queue gets stuck when patch causes tests to hang

NEW 238256

EWS queue gets stuck when patch causes tests to hang

https://bugs.webkit.org/show_bug.cgi?id=238256

Summary EWS queue gets stuck when patch causes tests to hang

Angelos Oikonomopoulos

Reported 2022-03-23 06:32:16 PDT

Starting with https://ews-build.webkit.org/#/builders/46/builds/21295, patch https://bugs.webkit.org/attachment.cgi?id=455076&action=prettypatch apparently caused enough of the JSC tests to hang that make(1) was stuck waiting for its jobs to finish. Since no process was producing any output any more, this resulted in the whole thing getting killed by buildbot: command timed out: 1200 seconds without output running ['perl', 'Tools/Scripts/run-javascriptcore-tests', '--no-build', '--no-fail-fast', '--json-output=jsc_results.json', '--release', '--memory-limited', '--verbose', '--jsc-only', '--treat-failing-as-flaky=0.6,10,200'], attempting to kill Even worse, this resulted in a RETRY (which failed in the same way, resulted in another RETRY and so on), causing the patch to get tested over and over for days. Ideas: 1. Only handle this specific issue via means of `/usr/bin/timeout` for each test so that make doesn't get stuck. Clearly, that only addresses this specific cause, not the failure mode. 2. Somehow keep state and only allow a limited number of retries (perhaps just one?). If the tests without the patch consistently return results but the ones with the patch don't, then it's a good guess that this is not a transient infrastructure issue but a problem caused by the patch itself. The above patch would be an example of that, but any patch making changes to `run-jsc-stress-tests` could result in such a failure mode. That said, I've skimmed the docs and it doesn't look like buildbot offers a simple way to keep state between builds. 3. Have an explicit (and cheap!) checkThatTheInfrastructureWorks step and use it judiciously. If a test fails without even producing results, run checkThatTheInfrastructureWorks. If it fails, RETRY. If not, declare the patch a failure. Can come up with more schemes but those seem like the cheapest ones.

Attachments
Add attachment proposed patch, testcase, etc.

Radar WebKit Bug Importer

Comment 1 2022-03-24 16:07:20 PDT

<rdar://problem/90799559>

Angelos Oikonomopoulos

Comment 2 2022-05-15 06:47:36 PDT

There's currently a similar issue, though the cause may be different. Specifically, Starting with https://ews-build.webkit.org/#/builders/46/builds/22930, PR https://github.com/WebKit/WebKit/pull/617 (commit https://github.com/WebKit/WebKit/pull/617/commits/b59d0400104cb703d6a6b3fea5eb378d8fd1a76a) is keeps resulting in a RETRY, which causes another RETRY, etc. Other patches seem to go through the same EWS queue (https://ews-build.webkit.org/#/builders/46) just fine. Contrary to the reprot in the bug description, in this case the error message we get is: remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] However, I can't find any errors in the log of the buildbot worker or in the system logs and the memory usage on the host is barely visible in the plot. So I don't currently have an explanation for how the failure actually happens in this case.

Note You need to log in before you can comment on or make changes to this bug.

Status NEW

Resolution

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component Tools / Tests

Assignee

Nobody

Reported

2022-03-23 06:32 PDT

Modified

2022-05-15 06:47 PDT History

CC List

2 users Show

URL

Keywords InRadar

Depends on

Blocks