Sometime a bot gets in a bad state and kill-old-processes start failing. When that happens, that bot process (and fails) the build very quickly. Because of this all the pending builds end-up failing. For example 2 weeks back ews121 went into bad state and many builds failed, e.g: https://ews-build.webkit.org/#/builders/24/builds/2671 https://ews-build.webkit.org/#/builders/24/builds/2672 https://ews-build.webkit.org/#/builders/24/builds/2673 https://ews-build.webkit.org/#/builders/24/builds/2674 https://ews-build.webkit.org/#/builders/24/builds/2675 https://ews-build.webkit.org/#/builders/24/builds/2677 https://ews-build.webkit.org/#/builders/24/builds/2680 https://ews-build.webkit.org/#/builders/24/builds/2684 https://ews-build.webkit.org/#/builders/24/builds/2690 https://ews-build.webkit.org/#/builders/24/builds/2691 https://ews-build.webkit.org/#/builders/24/builds/2693 https://ews-build.webkit.org/#/builders/24/builds/2694 https://ews-build.webkit.org/#/builders/24/builds/2697 We should retry the build, in case of kill-old-processes failure, so that that bot do not burn through all the pending builds. The build will keep on retrying, until a different bot picks it up. This will make EWS robust against this kind of infrastructure failure.
Created attachment 382837 [details] Patch
Sample run: https://ews-build.webkit-uat.org/#/builders/3/builds/227
....if kill-old-prcesses fails, we should force reboot the machine. Do we have any evidence that a retry will actually help?
It just happened again in https://ews-build.webkit.org/#/builders/24/builds/4562 and https://ews-build.webkit.org/#/builders/24/builds/4566 > ....if kill-old-prcesses fails, we should force reboot the machine. Yes, rebooting the machine is a better idea. However, it will take me a while to implement and test that. Meanwhile can we land this (maybe with a FIXME), since this is clearly an improvement. > Do we have any evidence that a retry will actually help? Yes, we have already seen it working many times for RETRY on checkout failure (https://trac.webkit.org/changeset/247364/webkit). For example when bot igalia1-gtk-wk2-ews went out of space in https://ews-build.webkit.org/#/builders/4/builds/6684, instead of simply failing, build was retried, and picked up by different bot and passed in https://ews-build.webkit.org/#/builders/4/builds/6685
(In reply to Aakash Jain from comment #4) > It just happened again in > https://ews-build.webkit.org/#/builders/24/builds/4562 and > https://ews-build.webkit.org/#/builders/24/builds/4566 > > > ....if kill-old-prcesses fails, we should force reboot the machine. > Yes, rebooting the machine is a better idea. However, it will take me a > while to implement and test that. Meanwhile can we land this (maybe with a > FIXME), since this is clearly an improvement. You've convinced me this is an improvement often enough to be worth landing, although I remain somewhat skeptical of our ability to trust machines which fail kill-old-processes. I'm actually not sure that rebooting takes much testing or additional code. I don't think we need to put effort into being delicate, I don't see a world where machines are failing to kill-old-processes frequently enough to find themselves in a crash loop, and if we're really worried about that, we can just refuse to reboot unless a bot has been up longer than some amount of time (I'd say an hour, but that's sort of arbitrary) > > ...
Comment on attachment 382837 [details] Patch Clearing flags on attachment: 382837 Committed r252324: <https://trac.webkit.org/changeset/252324>
All reviewed patches have been landed. Closing bug.
<rdar://problem/57099318>
This change seems to be working fine. Few examples where this helped: https://ews-build.webkit.org/#/builders/22/builds/10002 https://ews-build.webkit.org/#/builders/3/builds/16167 https://ews-build.webkit.org/#/builders/3/builds/16230 https://ews-build.webkit.org/#/builders/9/builds/17233