Bug 203698

Summary: [EWS] Layout testers can go in an infinite RETRY loop when there are 30+ failures on trunk
Product: WebKit Reporter: Aakash Jain <aakash_jain>
Component: Tools / TestsAssignee: Aakash Jain <aakash_jain>
Status: RESOLVED FIXED    
Severity: Normal CC: aakash_jain, ap, ddkilzer, jbedard, ryanhaddad, webkit-bot-watchers-bugzilla, webkit-bug-importer
Priority: P2 Keywords: InRadar
Version: Other   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugs.webkit.org/show_bug.cgi?id=212755
https://bugs.webkit.org/show_bug.cgi?id=216046
https://bugs.webkit.org/show_bug.cgi?id=216487
Attachments:
Description Flags
Patch
none
Patch for landing none

Description Aakash Jain 2019-10-31 14:28:23 PDT
Layout testers can go in an infinite RETRY loop when there are 30+ build failures on trunk. 30 is the layout-test failure limit we use in ews (--exit-after-n-failures 30).

In case there are 30+ failures on trunk, layout-tests, re-run-layout-tests, and run-layout-tests-without-patch will all fail. EWS will simply retry the build, since in case of so many failures, EWS can't reliably tell if the patch introduce any new failures or not. 

However, the retried build will re-use the already built archive (built on a builder queue) and test using the same revision (on which builder queue built the patch). It will not build again with ToT. (This was done to save the building time)

However, since EWS will keep retrying the build on same revision, it will keep failing. Even if the failures are fixed in ToT later, the tester queue will keep retrying the same patch on a given revision, and keep failing, and wouldn't be able to automatically recover (until the patch becomes obsolete, r- or bug is closed).

For example, all of the following are retry of same patch on same revision:

https://ews-build.webkit.org/#/builders/24/builds/3152
https://ews-build.webkit.org/#/builders/24/builds/3162
https://ews-build.webkit.org/#/builders/24/builds/3168
https://ews-build.webkit.org/#/builders/24/builds/3174
https://ews-build.webkit.org/#/builders/24/builds/3181
https://ews-build.webkit.org/#/builders/24/builds/3189
https://ews-build.webkit.org/#/builders/24/builds/3195
Comment 2 Aakash Jain 2019-11-22 15:23:42 PST
Created attachment 384207 [details]
Patch
Comment 3 Jonathan Bedard 2019-11-22 17:44:26 PST
Comment on attachment 384207 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=384207&action=review

> Tools/BuildSlaveSupport/ews-build/loadConfig.py:142
> +            if not isTriggerUsedByAnyBuilder(config, scheduler['name']) and 'build' not in scheduler['name'].lower():

Is this saying that testers cannot trigger other testers?

> Tools/BuildSlaveSupport/ews-build/steps.py:494
> +        self.include_revision = include_revision

What is 'revision' for EWS? The git hash?

> Tools/BuildSlaveSupport/ews-build/steps.py:1412
> +            self.finished(SUCCESS)

Won't this turn the bubble green for a bit?

Also, we still have an infinite loop problem is someone broke the build, right?
Comment 4 Aakash Jain 2019-11-22 18:06:00 PST
Comment on attachment 384207 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=384207&action=review

>> Tools/BuildSlaveSupport/ews-build/loadConfig.py:142
>> +            if not isTriggerUsedByAnyBuilder(config, scheduler['name']) and 'build' not in scheduler['name'].lower():
> 
> Is this saying that testers cannot trigger other testers?

This was testing that we do not have any unused triggers in config.json. But now it's possible to have a valid case of triggers which no-one in config.json will use directly  (since these triggers would be triggered dynamically).

Effectively yes, we don't have such a scenario (testers triggering other testers) yet.

>> Tools/BuildSlaveSupport/ews-build/steps.py:494
>> +        self.include_revision = include_revision
> 
> What is 'revision' for EWS? The git hash?

yeah, it's git hash

>> Tools/BuildSlaveSupport/ews-build/steps.py:1412
>> +            self.finished(SUCCESS)
> 
> Won't this turn the bubble green for a bit?
> 
> Also, we still have an infinite loop problem is someone broke the build, right?

This will turn only the step green. Build would be red/FAILED because of previous layout-test steps failing.
Comment 5 Radar WebKit Bug Importer 2020-07-27 05:29:43 PDT
<rdar://problem/66157544>
Comment 6 Jonathan Bedard 2020-09-04 11:30:50 PDT
Comment on attachment 384207 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=384207&action=review

> Tools/BuildSlaveSupport/ews-build/config.json:349
> +      "triggered_by": ["ios-13-sim-build-ews"],

This makes sense to me, but I don't understand why this isn't already here, since ios-simulator-13 has been triggered by ios-13-sim-build-ews for months.
Comment 7 Aakash Jain 2020-09-04 11:40:55 PDT
Comment on attachment 384207 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=384207&action=review

>> Tools/BuildSlaveSupport/ews-build/config.json:349
>> +      "triggered_by": ["ios-13-sim-build-ews"],
> 
> This makes sense to me, but I don't understand why this isn't already here, since ios-simulator-13 has been triggered by ios-13-sim-build-ews for months.

This 'triggered_by' key is used just for passing information. I wasn't able to dynamically extract this information (about who triggered current build) using buildbot. So I am passing this information in a static manner. This information is used by tester build, to trigger back the builder.

Yes, iOS-13-Simulator-WK2-Tests-EWS is triggered by iOS-13-Simulator-Build-EWS using the 'triggers' key.
Comment 8 Jonathan Bedard 2020-09-04 12:43:13 PDT
Comment on attachment 384207 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=384207&action=review

>>> Tools/BuildSlaveSupport/ews-build/config.json:349
>>> +      "triggered_by": ["ios-13-sim-build-ews"],
>> 
>> This makes sense to me, but I don't understand why this isn't already here, since ios-simulator-13 has been triggered by ios-13-sim-build-ews for months.
> 
> This 'triggered_by' key is used just for passing information. I wasn't able to dynamically extract this information (about who triggered current build) using buildbot. So I am passing this information in a static manner. This information is used by tester build, to trigger back the builder.
> 
> Yes, iOS-13-Simulator-WK2-Tests-EWS is triggered by iOS-13-Simulator-Build-EWS using the 'triggers' key.

That's unfortunate, but does answer my question!
Comment 9 Aakash Jain 2020-09-14 09:42:24 PDT
Created attachment 408716 [details]
Patch for landing
Comment 10 Aakash Jain 2020-09-14 09:48:26 PDT
Comment on attachment 408716 [details]
Patch for landing

View in context: https://bugs.webkit.org/attachment.cgi?id=408716&action=review

Did some more testing, and updated patch accordingly. Also re-based it on ToT.

> Tools/BuildSlaveSupport/ews-app/ews/views/statusbubble.py:156
> +                bubble['state'] = 'provisional-fail'

Updated status-bubble code here to ensure that the status-bubble is orange when the build is waiting to be retried.

> Tools/BuildSlaveSupport/ews-build/factories_unittest.py:50
>      def test_generic_factory(self):

Updated factories_unittest.py accordingly.

> Tools/BuildSlaveSupport/ews-build/steps.py:2186
> +            self.setProperty('build_summary', message)

Added this line to ensure that the buildbot build summary is set properly.
Comment 11 Aakash Jain 2020-09-14 09:52:27 PDT
Sample run:

Builder build: https://ews-build.webkit-uat.org/#/builders/14/builds/1594
Corresponding tester build: https://ews-build.webkit-uat.org/#/builders/16/builds/122, 
it correctly triggered back the builder in step #30.

Re-triggered builder build: https://ews-build.webkit-uat.org/#/builders/14/builds/1607 , it correctly updated to ToT and triggered tester with ToT.
Comment 12 Aakash Jain 2020-09-14 09:55:33 PDT
Corresponding status-bubble for ios-wk2 is correctly orange and hover-over message is also correct in https://ews.webkit-uat.org/status-bubble/407969/
Comment 13 EWS 2020-09-14 10:47:38 PDT
Committed r267022: <https://trac.webkit.org/changeset/267022>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 408716 [details].
Comment 14 Aakash Jain 2020-09-18 10:14:04 PDT
This change seems to be working fine. 

e.g.: https://ews-build.webkit.org/#/builders/24/builds/26185 failed with 30+ failures. It correctly re-trigerred the parent builder in: https://ews-build.webkit.org/#/builders/23/builds/27400. This correctly triggered the tester again with ToT in https://ews-build.webkit.org/#/builders/24/builds/26223, which passed.
Comment 15 David Kilzer (:ddkilzer) 2020-09-22 10:03:46 PDT
(In reply to Aakash Jain from comment #14)
> This change seems to be working fine. 
> 
> e.g.: https://ews-build.webkit.org/#/builders/24/builds/26185 failed with
> 30+ failures. It correctly re-trigerred the parent builder in:
> https://ews-build.webkit.org/#/builders/23/builds/27400. This correctly
> triggered the tester again with ToT in
> https://ews-build.webkit.org/#/builders/24/builds/26223, which passed.

Awesome!  Thanks Aakash!