Bug 203858 - EWS should retry build in case of kill-old-processes failure
Summary: EWS should retry build in case of kill-old-processes failure
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: Other
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Aakash Jain
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2019-11-05 11:27 PST by Aakash Jain
Modified: 2020-02-20 08:06 PST (History)
5 users (show)

See Also:


Attachments
Patch (2.33 KB, patch)
2019-11-05 11:33 PST, Aakash Jain
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Aakash Jain 2019-11-05 11:27:53 PST
Sometime a bot gets in a bad state and kill-old-processes start failing. When that happens, that bot process (and fails) the build very quickly. Because of this all the pending builds end-up failing.

For example 2 weeks back ews121 went into bad state and many builds failed, e.g:

https://ews-build.webkit.org/#/builders/24/builds/2671
https://ews-build.webkit.org/#/builders/24/builds/2672
https://ews-build.webkit.org/#/builders/24/builds/2673
https://ews-build.webkit.org/#/builders/24/builds/2674
https://ews-build.webkit.org/#/builders/24/builds/2675
https://ews-build.webkit.org/#/builders/24/builds/2677
https://ews-build.webkit.org/#/builders/24/builds/2680
https://ews-build.webkit.org/#/builders/24/builds/2684
https://ews-build.webkit.org/#/builders/24/builds/2690
https://ews-build.webkit.org/#/builders/24/builds/2691
https://ews-build.webkit.org/#/builders/24/builds/2693
https://ews-build.webkit.org/#/builders/24/builds/2694
https://ews-build.webkit.org/#/builders/24/builds/2697

We should retry the build, in case of kill-old-processes failure, so that that bot do not burn through all the pending builds. The build will keep on retrying, until a different bot picks it up. This will make EWS robust against this kind of infrastructure failure.
Comment 1 Aakash Jain 2019-11-05 11:33:54 PST
Created attachment 382837 [details]
Patch
Comment 2 Aakash Jain 2019-11-05 11:34:57 PST
Sample run: https://ews-build.webkit-uat.org/#/builders/3/builds/227
Comment 3 Jonathan Bedard 2019-11-06 14:23:07 PST
....if kill-old-prcesses fails, we should force reboot the machine. Do we have any evidence that a retry will actually help?
Comment 4 Aakash Jain 2019-11-10 05:29:40 PST
It just happened again in https://ews-build.webkit.org/#/builders/24/builds/4562 and https://ews-build.webkit.org/#/builders/24/builds/4566

> ....if kill-old-prcesses fails, we should force reboot the machine.
Yes, rebooting the machine is a better idea. However, it will take me a while to implement and test that. Meanwhile can we land this (maybe with a FIXME), since this is clearly an improvement.

> Do we have any evidence that a retry will actually help?
Yes, we have already seen it working many times for RETRY on checkout failure (https://trac.webkit.org/changeset/247364/webkit). For example when bot igalia1-gtk-wk2-ews went out of space in https://ews-build.webkit.org/#/builders/4/builds/6684, instead of simply failing, build was retried, and picked up by different bot and passed in https://ews-build.webkit.org/#/builders/4/builds/6685
Comment 5 Jonathan Bedard 2019-11-11 07:41:21 PST
(In reply to Aakash Jain from comment #4)
> It just happened again in
> https://ews-build.webkit.org/#/builders/24/builds/4562 and
> https://ews-build.webkit.org/#/builders/24/builds/4566
> 
> > ....if kill-old-prcesses fails, we should force reboot the machine.
> Yes, rebooting the machine is a better idea. However, it will take me a
> while to implement and test that. Meanwhile can we land this (maybe with a
> FIXME), since this is clearly an improvement.

You've convinced me this is an improvement often enough to be worth landing, although I remain somewhat skeptical of our ability to trust machines which fail kill-old-processes.

I'm actually not sure that rebooting takes much testing or additional code. I don't think we need to put effort into being delicate, I don't see a world where machines are failing to kill-old-processes frequently enough to find themselves in a crash loop, and if we're really worried about that, we can just refuse to reboot unless a bot has been up longer than some amount of time (I'd say an hour, but that's sort of arbitrary)

> 
> ...
Comment 6 WebKit Commit Bot 2019-11-11 08:23:33 PST
Comment on attachment 382837 [details]
Patch

Clearing flags on attachment: 382837

Committed r252324: <https://trac.webkit.org/changeset/252324>
Comment 7 WebKit Commit Bot 2019-11-11 08:23:34 PST
All reviewed patches have been landed.  Closing bug.
Comment 8 Radar WebKit Bug Importer 2019-11-11 16:57:21 PST
<rdar://problem/57099318>