Bug 231999

Summary: [EWS][GTK][WPE] Add a new Class for running layout tests on the EWS for the GTK and WPE ports.
Product: WebKit Reporter: Carlos Alberto Lopez Perez <clopez>
Component: Tools / TestsAssignee: Carlos Alberto Lopez Perez <clopez>
Status: RESOLVED FIXED    
Severity: Normal CC: aakash_jain, ap, clopez, jbedard, lmoura, ryanhaddad, webkit-bug-importer
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugs.webkit.org/show_bug.cgi?id=231790
https://bugs.webkit.org/show_bug.cgi?id=231265
https://bugs.webkit.org/show_bug.cgi?id=234270
https://bugs.webkit.org/show_bug.cgi?id=234378
https://bugs.webkit.org/show_bug.cgi?id=209104
https://bugs.webkit.org/show_bug.cgi?id=236705
https://bugs.webkit.org/show_bug.cgi?id=245329
https://bugs.webkit.org/show_bug.cgi?id=253048
https://bugs.webkit.org/show_bug.cgi?id=264837
Attachments:
Description Flags
Patch
none
Patch
none
Patch
none
Patch none

Description Carlos Alberto Lopez Perez 2021-10-19 20:11:09 PDT Comment hidden (obsolete)
Comment 1 Carlos Alberto Lopez Perez 2021-10-19 20:33:22 PDT
We would like to add a new class to the EWS for running layout tests on the GTK and WPE ports.

Initially this class will be used for the GTK port, and later for WPE.
Mac/iOS ports will continue to use the current default EWS, so no behaviour change for them.

This new class is specially designed to:
  A. Work with a tree that is not always green (or even that is often quite red).
  B. To not report any false positive to the patch author.
  C. To allow patch authors to use this EWS to get new expectations for patches that need lot of new re-baselines (like a WPT import)

The very simplified logic of how this works is the following:
  1. Run layout tests with patch (abort early at 500 unexpected failures)
  2. Run layout tests with patch 10 times for each test that failed consistently (non-flaky) on step 1.
  3. Run layout tests without patch 10 times for each test that failed consistently (non-flaky) on step 2.

Then report to the patch author only the new consistent failures (tests that failed always with the patch and passed always without the patch, retrying 10 times)
Any flaky test found is only reported to the bot watchers.

For an explainer about why this is needed, and more details about the flow diagram and several design considerations, please check: https://people.igalia.com/clopez/wkbug/231999/explainer.html
Comment 2 Carlos Alberto Lopez Perez 2021-10-19 20:37:55 PDT
Created attachment 441841 [details]
Patch
Comment 3 Carlos Alberto Lopez Perez 2021-10-19 20:42:48 PDT
I have tested this locally quite a lot and I'm happy with the results. I think is robust and working as expected. I tested it with several patches, some of them adding failures and flakies like the one on bug 231192 and other removing expectations that still fail like the one on bug 231273

For it to work as expected a patch on run-webkit-tests fixing an existent issue is needed, see bug 231790

I will be quite happy if we can run this on the UAT server for some time to test it before going into production on the real server.
Comment 4 Carlos Alberto Lopez Perez 2021-10-19 20:59:33 PDT
Created attachment 441844 [details]
Patch
Comment 5 Aakash Jain 2021-10-20 08:57:01 PDT
(In reply to Carlos Alberto Lopez Perez from comment #3)
> I will be quite happy if we can run this on the UAT server for some time to test it before going into production on the real server.
Applied attachment 441844 [details] to the UAT server. It is running builds now with the change. e.g.: https://ews-build.webkit-uat.org/#/builders/34/builds/50455
Comment 6 Lauro Moura 2021-10-20 20:49:38 PDT
Comment on attachment 441844 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=441844&action=review

Nice work! Just some minor inline comments.

> Tools/CISupport/ews-build/steps.py:63
> +        super(BufferLogHeaderObserver, self).__init__(**kwargs)

Given the buildbots are using Python3, maybe we could use just `super()`, without repeating the class name. (Ditto for the other classes added by the patch)

> Tools/CISupport/ews-build/steps.py:2766
> +        timed_out_line_start = 'command timed out: {} seconds elapsed running'.format(self.MAX_SECONDS_STEP_RUN)

Should those timeouts "command timed out: TIMEOUT seconds without output..." be treated here too or would they follow another path?

> Tools/CISupport/ews-build/steps.py:2989
> +            return self.report_infrastructure_issue_and_maybe_retry_build('The layout-test run with patch generated no list of results but exited with error, and the clean_tree without patch run did the samething.')

Missing space in "same thing"
Comment 7 Carlos Alberto Lopez Perez 2021-10-22 11:30:38 PDT
Comment on attachment 441844 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=441844&action=review

>> Tools/CISupport/ews-build/steps.py:2766
>> +        timed_out_line_start = 'command timed out: {} seconds elapsed running'.format(self.MAX_SECONDS_STEP_RUN)
> 
> Should those timeouts "command timed out: TIMEOUT seconds without output..." be treated here too or would they follow another path?

There are two kinds of timeouts:
 1. Command doesn't produce any output for more than XX seconds.
 2. Command reaches the value of the timeout even when it was producing output.

This code cares about the second case. The first one is a different problem (likely RWT hanged) and it would be treated like a general unknown failure on the case of "run-layout-tests returned error code and didn't returned a list of failures or flakies"
Comment 8 Carlos Alberto Lopez Perez 2021-10-22 12:02:16 PDT
(In reply to Aakash Jain from comment #5)
> (In reply to Carlos Alberto Lopez Perez from comment #3)
> > I will be quite happy if we can run this on the UAT server for some time to test it before going into production on the real server.
> Applied attachment 441844 [details] to the UAT server. It is running builds
> now with the change. e.g.:
> https://ews-build.webkit-uat.org/#/builders/34/builds/50455

Thanks a lot for applying it!

The start has not been great since it is aborting early due to more than 500 unexpected failures due to a combination of:
 1. we currently have on the tree more than 230 unexpected failures due to bug 230797
 2. on the EWS there are some tests timing out that don't happen on the post-commit bots that add to those 230 failures. I have to investigate this.

But even with this quite broken tree, it is still being able to do work and is is even finding real new failures on bugs, but the completeness of this new findings is (of course) limited due to the early aborts. Another important thing is that is not reporting false positives (it reported false positives at the start because I think there were some problem with the flatpak SDK version used on the build bot, but I triggered clean builds and got fixed)


On top of that there is also now a build failure with the build that is adding to the set of problems (bug 232160 ) and is causing the queue to grow more than acceptable.

I will continue to monitor how it behaves and try to fine-tune some things.
Comment 9 Carlos Alberto Lopez Perez 2021-10-22 13:27:07 PDT
Comment on attachment 441844 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=441844&action=review

> Tools/CISupport/ews-build/steps.py:2868
> +        self.setProperty('xdebug_timeout_previous', with_patch_repeat_failures_timedout)

Note to me: remove this debug thing
Comment 10 Carlos Alberto Lopez Perez 2021-11-11 11:17:28 PST
Ok!

Sorry for the long time in replying.. I have been on a holidays and busy with other things...

Let's recap this:

1. This has been running for several weeks on the UAT.
- On the first weeks it worked badly because we had some problems with the builds (random crashes when running layout tests due to a bug in ccache and the linker that is already solved)
- Then it didn't started to work really well until r284784 landed which is needed for the reporting on the retry steps to be reliable

So after this, it has been working as expected.

I did an analysis of the last 300 builds.

You can read the detailed analysis here: https://people.igalia.com/clopez/wkbug/231999/ews-results-analysis.txt

The conclusions are:


 * This new EWS is working as intended
 * It finds real new failues on the patches
 * It doesn't report false positives
 * It works with a tree that is red (when this analysis was done there were around 100 unexpected failures on the clean tree of the GTK port)
 * Is slow compared to the Mac EWS tester, but this performance will improve when we deploy it.
   * Having it working will help to reduce the number of unexpected failures landing. And the number of unexpected failures on the clean tree has an exponential direct impact on the time it takes to finish the testing (since it has to retry those unexpected failures from the clean tree 20 times) on the retry steps (with patch and without it).
   * We plan to add more hardware resources.



I found two cases where it reported false positives, that will be corrected on a next version the patch

 - What happened: The step 'layout-tests-repeat-failures-without-patch' aborted due to Xvfb not starting and the EWS thought it finished clean (without producing failures)
 - Corrective action: We need to detect a general failure of the step on this cases and then repeat t


 - What happened: The step 'layout-tests' aborted due to the WPT http server not starting, so it caused a general failure (exit error but no results). So the EWS tried to run the whole layout tests without patch ('run-layout-tests-without-patch') and this time it finished  (no general error, simply a list of failures)
 - Corrective action: if we have a general failure on the step 'layout-tests' and then the step 'run-layout-tests-without-patch' produces unexpected results then we need to retry the whole testing (we can't conclude anything when that happens)
Comment 11 Carlos Alberto Lopez Perez 2021-11-11 11:23:25 PST
Created attachment 443981 [details]
Patch

v2 of the patch. diff to previous: http://sprunge.us/b79tdR?diff
Comment 12 Carlos Alberto Lopez Perez 2021-11-12 06:32:00 PST
BTW, the failures on the EWS (mac-wk1 and jsc) for this last version of the patch are totally unrelated
Comment 13 Carlos Alberto Lopez Perez 2021-11-24 12:06:52 PST
ping reviewers?
Comment 14 Jonathan Bedard 2021-12-01 12:05:46 PST
Comment on attachment 443981 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=443981&action=review

> Tools/CISupport/ews-build/steps.py:2984
> +            if clean_tree_run_failures or clean_tree_run_flakies:

A bit confused why this case triggers a retry. If the run with the patch caused an exit code that didn't generate test results, but the clean-tree run did generate test results, doesn't that mean we can blame the patch for not generating test results? Or is that already covered because the return code of such a run is either SUCCESS or WARNINGS?

> Tools/CISupport/ews-build/steps.py:3026
> +            self.send_email_for_flaky_failure(flaky_failure, step_names_str)

It seems wrong to warn bot watchers about flakey failures that we only detect with a patch.
Comment 15 Carlos Alberto Lopez Perez 2021-12-01 12:28:38 PST
Comment on attachment 443981 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=443981&action=review

>> Tools/CISupport/ews-build/steps.py:2984
>> +            if clean_tree_run_failures or clean_tree_run_flakies:
> 
> A bit confused why this case triggers a retry. If the run with the patch caused an exit code that didn't generate test results, but the clean-tree run did generate test results, doesn't that mean we can blame the patch for not generating test results? Or is that already covered because the return code of such a run is either SUCCESS or WARNINGS?

This is a corner case that was detected here https://bugs.webkit.org/show_bug.cgi?id=231999#c10 when reviewing the results on the UAT:

 - What happened: The step 'layout-tests' aborted due to the WPT http server not starting, so it caused a general failure (exit error but no results). So the EWS tried to run the whole layout tests without patch ('run-layout-tests-without-patch' aka 'clean-tree') and this time it finished  (no general error, simply a list of failures)

In other words: If the clean-tree generates a list of failures, then the run with patch should also generate a list of failures unless it exists with zero status. Otherwise it means something broke during the testing with patch that caused a general failure (non-zero exit code but no list of failures/flakies) and then we can't conclude anything.

>> Tools/CISupport/ews-build/steps.py:3026
>> +            self.send_email_for_flaky_failure(flaky_failure, step_names_str)
> 
> It seems wrong to warn bot watchers about flakey failures that we only detect with a patch.

The idea of this is to help bot watchers to catch flakies and start marking them on the TestExpectation files.
And I agree that flakies of interest are those that happen without patch.
But the way this works, it causes that most of the flakies are filtered before reaching the run without patch.
It Does this:
 with-patch: run-layout-tests(all_tests) => failures1, flakies1
 with-patch: run-layout-tests-10times(failures1) => failures2, flakies2
 without-patch: run-layout-tests-10times(failures2) => wfailures3, wflakies3
So most of the flakies will not reach the testing without-patch, and wflakies3 will likely be empty or will miss flakies of interest that are caused without the patch as well.

The assumption here is that flakies are a general problem that will happen with patch even when the patch has not caused them so it reports everything found as flaky with the hope that it will be useful info for the bot watcher.
The function to report the flakies includes the step name where the flaky was found so the bot watcher also knows if the flaky was found with patch or without it, so he can take that into account when investigating the issue.
Comment 16 Carlos Alberto Lopez Perez 2021-12-01 13:48:59 PST
Comment on attachment 443981 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=443981&action=review

>>> Tools/CISupport/ews-build/steps.py:2984
>>> +            if clean_tree_run_failures or clean_tree_run_flakies:
>> 
>> A bit confused why this case triggers a retry. If the run with the patch caused an exit code that didn't generate test results, but the clean-tree run did generate test results, doesn't that mean we can blame the patch for not generating test results? Or is that already covered because the return code of such a run is either SUCCESS or WARNINGS?
> 
> This is a corner case that was detected here https://bugs.webkit.org/show_bug.cgi?id=231999#c10 when reviewing the results on the UAT:
> 
>  - What happened: The step 'layout-tests' aborted due to the WPT http server not starting, so it caused a general failure (exit error but no results). So the EWS tried to run the whole layout tests without patch ('run-layout-tests-without-patch' aka 'clean-tree') and this time it finished  (no general error, simply a list of failures)
> 
> In other words: If the clean-tree generates a list of failures, then the run with patch should also generate a list of failures unless it exists with zero status. Otherwise it means something broke during the testing with patch that caused a general failure (non-zero exit code but no list of failures/flakies) and then we can't conclude anything.

But thinking again about this corner case maybe the patch itself broke the script run-webkit-tests.
In reality we should not treat different this corner case depending on the results of run-layout-tests-without-patch' aka 'clean-tree' (whether it passed green or generated a list of failures).
Detecting a patch breaking the script run-webkit-tests shouldn't depend on the results of the 'clean-tree' run
However it is true that is much more likely that a random error not caused by the patch happened on the first run, than the patch itself broke the script.
So I think we can maybe do here the following:
- 1. if retry_count is less than self.MAX_RETRY and we reach the case of "non-zero exist status for first-run-with-patch but no results of failure or flakies" then retry the whole testing again (we assume some random infrastructure error happened)
- 2. otherwise (retry_count is self.MAX_RETRY) we report the patch as bad with 'Found unexpected failure with patch'

That way we give it a few tries with the hope it was a random error that will not happen on a retry before reporting the patch as bad.
Comment 17 Carlos Alberto Lopez Perez 2021-12-01 14:38:27 PST
Created attachment 445617 [details]
Patch

v3 of the patch: 1. Implement the change from the comment above and 2. Add 3 more workers to speed up the queue. Diff to previous version: http://sprunge.us/TdLC6I?diff
Comment 18 Jonathan Bedard 2021-12-01 16:13:27 PST
(In reply to Carlos Alberto Lopez Perez from comment #15)
> ....
> 
> The idea of this is to help bot watchers to catch flakies and start marking them on the TestExpectation files.
> And I agree that flakies of interest are those that happen without patch.
> But the way this works, it causes that most of the flakies are filtered before reaching the run without patch.
> ....

This is the bit I wasn't thinking about.

I think the flakiness noise may be too high to be useful, but that's something we can address after we get this queue running.
Comment 19 Carlos Alberto Lopez Perez 2021-12-01 16:42:51 PST
Committed r286405 (244752@main): <https://commits.webkit.org/244752@main>
Comment 20 Radar WebKit Bug Importer 2021-12-01 16:43:24 PST
<rdar://problem/85945620>
Comment 21 Carlos Alberto Lopez Perez 2021-12-01 16:43:48 PST
Thanks! landed it with some edits on the comments to make them more clear: http://sprunge.us/5DOchH?diff