WebKit Bugzilla
New
Browse
Search+
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
203858
EWS should retry build in case of kill-old-processes failure
https://bugs.webkit.org/show_bug.cgi?id=203858
Summary
EWS should retry build in case of kill-old-processes failure
Aakash Jain
Reported
2019-11-05 11:27:53 PST
Sometime a bot gets in a bad state and kill-old-processes start failing. When that happens, that bot process (and fails) the build very quickly. Because of this all the pending builds end-up failing. For example 2 weeks back ews121 went into bad state and many builds failed, e.g:
https://ews-build.webkit.org/#/builders/24/builds/2671
https://ews-build.webkit.org/#/builders/24/builds/2672
https://ews-build.webkit.org/#/builders/24/builds/2673
https://ews-build.webkit.org/#/builders/24/builds/2674
https://ews-build.webkit.org/#/builders/24/builds/2675
https://ews-build.webkit.org/#/builders/24/builds/2677
https://ews-build.webkit.org/#/builders/24/builds/2680
https://ews-build.webkit.org/#/builders/24/builds/2684
https://ews-build.webkit.org/#/builders/24/builds/2690
https://ews-build.webkit.org/#/builders/24/builds/2691
https://ews-build.webkit.org/#/builders/24/builds/2693
https://ews-build.webkit.org/#/builders/24/builds/2694
https://ews-build.webkit.org/#/builders/24/builds/2697
We should retry the build, in case of kill-old-processes failure, so that that bot do not burn through all the pending builds. The build will keep on retrying, until a different bot picks it up. This will make EWS robust against this kind of infrastructure failure.
Attachments
Patch
(2.33 KB, patch)
2019-11-05 11:33 PST
,
Aakash Jain
no flags
Details
Formatted Diff
Diff
View All
Add attachment
proposed patch, testcase, etc.
Aakash Jain
Comment 1
2019-11-05 11:33:54 PST
Created
attachment 382837
[details]
Patch
Aakash Jain
Comment 2
2019-11-05 11:34:57 PST
Sample run:
https://ews-build.webkit-uat.org/#/builders/3/builds/227
Jonathan Bedard
Comment 3
2019-11-06 14:23:07 PST
....if kill-old-prcesses fails, we should force reboot the machine. Do we have any evidence that a retry will actually help?
Aakash Jain
Comment 4
2019-11-10 05:29:40 PST
It just happened again in
https://ews-build.webkit.org/#/builders/24/builds/4562
and
https://ews-build.webkit.org/#/builders/24/builds/4566
> ....if kill-old-prcesses fails, we should force reboot the machine.
Yes, rebooting the machine is a better idea. However, it will take me a while to implement and test that. Meanwhile can we land this (maybe with a FIXME), since this is clearly an improvement.
> Do we have any evidence that a retry will actually help?
Yes, we have already seen it working many times for RETRY on checkout failure (
https://trac.webkit.org/changeset/247364/webkit
). For example when bot igalia1-gtk-wk2-ews went out of space in
https://ews-build.webkit.org/#/builders/4/builds/6684
, instead of simply failing, build was retried, and picked up by different bot and passed in
https://ews-build.webkit.org/#/builders/4/builds/6685
Jonathan Bedard
Comment 5
2019-11-11 07:41:21 PST
(In reply to Aakash Jain from
comment #4
)
> It just happened again in >
https://ews-build.webkit.org/#/builders/24/builds/4562
and >
https://ews-build.webkit.org/#/builders/24/builds/4566
> > > ....if kill-old-prcesses fails, we should force reboot the machine. > Yes, rebooting the machine is a better idea. However, it will take me a > while to implement and test that. Meanwhile can we land this (maybe with a > FIXME), since this is clearly an improvement.
You've convinced me this is an improvement often enough to be worth landing, although I remain somewhat skeptical of our ability to trust machines which fail kill-old-processes. I'm actually not sure that rebooting takes much testing or additional code. I don't think we need to put effort into being delicate, I don't see a world where machines are failing to kill-old-processes frequently enough to find themselves in a crash loop, and if we're really worried about that, we can just refuse to reboot unless a bot has been up longer than some amount of time (I'd say an hour, but that's sort of arbitrary)
> > ...
WebKit Commit Bot
Comment 6
2019-11-11 08:23:33 PST
Comment on
attachment 382837
[details]
Patch Clearing flags on attachment: 382837 Committed
r252324
: <
https://trac.webkit.org/changeset/252324
>
WebKit Commit Bot
Comment 7
2019-11-11 08:23:34 PST
All reviewed patches have been landed. Closing bug.
Radar WebKit Bug Importer
Comment 8
2019-11-11 16:57:21 PST
<
rdar://problem/57099318
>
Aakash Jain
Comment 9
2020-02-20 06:03:42 PST
This change seems to be working fine. Few examples where this helped:
https://ews-build.webkit.org/#/builders/22/builds/10002
https://ews-build.webkit.org/#/builders/3/builds/16167
https://ews-build.webkit.org/#/builders/3/builds/16230
https://ews-build.webkit.org/#/builders/9/builds/17233
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug