231999 – [EWS][GTK][WPE] Add a new Class for running layout tests on the EWS for the GTK and WPE ports.

RESOLVED FIXED 231999

[EWS][GTK][WPE] Add a new Class for running layout tests on the EWS for the GTK and WPE ports.

https://bugs.webkit.org/show_bug.cgi?id=231999

Summary [EWS][GTK][WPE] Add a new Class for running layout tests on the EWS for the G...

Carlos Alberto Lopez Perez

Reported 2021-10-19 20:11:09 PDT Comment hidden (obsolete)

SSIA

Attachments
Patch (64.45 KB, patch) 2021-10-19 20:37 PDT, Carlos Alberto Lopez Perez	no flags	Details Formatted Diff Diff
Patch (65.74 KB, patch) 2021-10-19 20:59 PDT, Carlos Alberto Lopez Perez	no flags	Details Formatted Diff Diff
Patch (67.53 KB, patch) 2021-11-11 11:23 PST, Carlos Alberto Lopez Perez	no flags	Details Formatted Diff Diff
Patch (70.01 KB, patch) 2021-12-01 14:38 PST, Carlos Alberto Lopez Perez	no flags	Details Formatted Diff Diff
Show Obsolete (3) View All Add attachment proposed patch, testcase, etc.

Carlos Alberto Lopez Perez

Comment 1 2021-10-19 20:33:22 PDT

We would like to add a new class to the EWS for running layout tests on the GTK and WPE ports. Initially this class will be used for the GTK port, and later for WPE. Mac/iOS ports will continue to use the current default EWS, so no behaviour change for them. This new class is specially designed to: A. Work with a tree that is not always green (or even that is often quite red). B. To not report any false positive to the patch author. C. To allow patch authors to use this EWS to get new expectations for patches that need lot of new re-baselines (like a WPT import) The very simplified logic of how this works is the following: 1. Run layout tests with patch (abort early at 500 unexpected failures) 2. Run layout tests with patch 10 times for each test that failed consistently (non-flaky) on step 1. 3. Run layout tests without patch 10 times for each test that failed consistently (non-flaky) on step 2. Then report to the patch author only the new consistent failures (tests that failed always with the patch and passed always without the patch, retrying 10 times) Any flaky test found is only reported to the bot watchers. For an explainer about why this is needed, and more details about the flow diagram and several design considerations, please check: https://people.igalia.com/clopez/wkbug/231999/explainer.html

Carlos Alberto Lopez Perez

Comment 2 2021-10-19 20:37:55 PDT

Created attachment 441841 [details] Patch

Carlos Alberto Lopez Perez

Comment 3 2021-10-19 20:42:48 PDT

I have tested this locally quite a lot and I'm happy with the results. I think is robust and working as expected. I tested it with several patches, some of them adding failures and flakies like the one on bug 231192 and other removing expectations that still fail like the one on bug 231273 For it to work as expected a patch on run-webkit-tests fixing an existent issue is needed, see bug 231790 I will be quite happy if we can run this on the UAT server for some time to test it before going into production on the real server.

Carlos Alberto Lopez Perez

Comment 4 2021-10-19 20:59:33 PDT

Created attachment 441844 [details] Patch

Aakash Jain

Comment 5 2021-10-20 08:57:01 PDT

(In reply to Carlos Alberto Lopez Perez from comment #3) > I will be quite happy if we can run this on the UAT server for some time to test it before going into production on the real server. Applied attachment 441844 [details] to the UAT server. It is running builds now with the change. e.g.: https://ews-build.webkit-uat.org/#/builders/34/builds/50455

Lauro Moura

Comment 6 2021-10-20 20:49:38 PDT

Comment on attachment 441844 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=441844&action=review Nice work! Just some minor inline comments. > Tools/CISupport/ews-build/steps.py:63 > + super(BufferLogHeaderObserver, self).__init__(**kwargs) Given the buildbots are using Python3, maybe we could use just `super()`, without repeating the class name. (Ditto for the other classes added by the patch) > Tools/CISupport/ews-build/steps.py:2766 > + timed_out_line_start = 'command timed out: {} seconds elapsed running'.format(self.MAX_SECONDS_STEP_RUN) Should those timeouts "command timed out: TIMEOUT seconds without output..." be treated here too or would they follow another path? > Tools/CISupport/ews-build/steps.py:2989 > + return self.report_infrastructure_issue_and_maybe_retry_build('The layout-test run with patch generated no list of results but exited with error, and the clean_tree without patch run did the samething.') Missing space in "same thing"

Carlos Alberto Lopez Perez

Comment 7 2021-10-22 11:30:38 PDT

Comment on attachment 441844 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=441844&action=review >> Tools/CISupport/ews-build/steps.py:2766 >> + timed_out_line_start = 'command timed out: {} seconds elapsed running'.format(self.MAX_SECONDS_STEP_RUN) > > Should those timeouts "command timed out: TIMEOUT seconds without output..." be treated here too or would they follow another path? There are two kinds of timeouts: 1. Command doesn't produce any output for more than XX seconds. 2. Command reaches the value of the timeout even when it was producing output. This code cares about the second case. The first one is a different problem (likely RWT hanged) and it would be treated like a general unknown failure on the case of "run-layout-tests returned error code and didn't returned a list of failures or flakies"

Carlos Alberto Lopez Perez

Comment 8 2021-10-22 12:02:16 PDT

(In reply to Aakash Jain from comment #5) > (In reply to Carlos Alberto Lopez Perez from comment #3) > > I will be quite happy if we can run this on the UAT server for some time to test it before going into production on the real server. > Applied attachment 441844 [details] to the UAT server. It is running builds > now with the change. e.g.: > https://ews-build.webkit-uat.org/#/builders/34/builds/50455 Thanks a lot for applying it! The start has not been great since it is aborting early due to more than 500 unexpected failures due to a combination of: 1. we currently have on the tree more than 230 unexpected failures due to bug 230797 2. on the EWS there are some tests timing out that don't happen on the post-commit bots that add to those 230 failures. I have to investigate this. But even with this quite broken tree, it is still being able to do work and is is even finding real new failures on bugs, but the completeness of this new findings is (of course) limited due to the early aborts. Another important thing is that is not reporting false positives (it reported false positives at the start because I think there were some problem with the flatpak SDK version used on the build bot, but I triggered clean builds and got fixed) On top of that there is also now a build failure with the build that is adding to the set of problems (bug 232160 ) and is causing the queue to grow more than acceptable. I will continue to monitor how it behaves and try to fine-tune some things.

Carlos Alberto Lopez Perez

Comment 9 2021-10-22 13:27:07 PDT

Comment on attachment 441844 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=441844&action=review > Tools/CISupport/ews-build/steps.py:2868 > + self.setProperty('xdebug_timeout_previous', with_patch_repeat_failures_timedout) Note to me: remove this debug thing

Carlos Alberto Lopez Perez

Comment 10 2021-11-11 11:17:28 PST

Ok! Sorry for the long time in replying.. I have been on a holidays and busy with other things... Let's recap this: 1. This has been running for several weeks on the UAT. - On the first weeks it worked badly because we had some problems with the builds (random crashes when running layout tests due to a bug in ccache and the linker that is already solved) - Then it didn't started to work really well until r284784 landed which is needed for the reporting on the retry steps to be reliable So after this, it has been working as expected. I did an analysis of the last 300 builds. You can read the detailed analysis here: https://people.igalia.com/clopez/wkbug/231999/ews-results-analysis.txt The conclusions are: * This new EWS is working as intended * It finds real new failues on the patches * It doesn't report false positives * It works with a tree that is red (when this analysis was done there were around 100 unexpected failures on the clean tree of the GTK port) * Is slow compared to the Mac EWS tester, but this performance will improve when we deploy it. * Having it working will help to reduce the number of unexpected failures landing. And the number of unexpected failures on the clean tree has an exponential direct impact on the time it takes to finish the testing (since it has to retry those unexpected failures from the clean tree 20 times) on the retry steps (with patch and without it). * We plan to add more hardware resources. I found two cases where it reported false positives, that will be corrected on a next version the patch - What happened: The step 'layout-tests-repeat-failures-without-patch' aborted due to Xvfb not starting and the EWS thought it finished clean (without producing failures) - Corrective action: We need to detect a general failure of the step on this cases and then repeat t - What happened: The step 'layout-tests' aborted due to the WPT http server not starting, so it caused a general failure (exit error but no results). So the EWS tried to run the whole layout tests without patch ('run-layout-tests-without-patch') and this time it finished (no general error, simply a list of failures) - Corrective action: if we have a general failure on the step 'layout-tests' and then the step 'run-layout-tests-without-patch' produces unexpected results then we need to retry the whole testing (we can't conclude anything when that happens)

Carlos Alberto Lopez Perez

Comment 11 2021-11-11 11:23:25 PST

Created attachment 443981 [details] Patch v2 of the patch. diff to previous: http://sprunge.us/b79tdR?diff

Carlos Alberto Lopez Perez

Comment 12 2021-11-12 06:32:00 PST

BTW, the failures on the EWS (mac-wk1 and jsc) for this last version of the patch are totally unrelated

Carlos Alberto Lopez Perez

Comment 13 2021-11-24 12:06:52 PST

ping reviewers?

Jonathan Bedard

Comment 14 2021-12-01 12:05:46 PST

Comment on attachment 443981 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=443981&action=review > Tools/CISupport/ews-build/steps.py:2984 > + if clean_tree_run_failures or clean_tree_run_flakies: A bit confused why this case triggers a retry. If the run with the patch caused an exit code that didn't generate test results, but the clean-tree run did generate test results, doesn't that mean we can blame the patch for not generating test results? Or is that already covered because the return code of such a run is either SUCCESS or WARNINGS? > Tools/CISupport/ews-build/steps.py:3026 > + self.send_email_for_flaky_failure(flaky_failure, step_names_str) It seems wrong to warn bot watchers about flakey failures that we only detect with a patch.

Carlos Alberto Lopez Perez

Comment 15 2021-12-01 12:28:38 PST

Comment on attachment 443981 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=443981&action=review >> Tools/CISupport/ews-build/steps.py:2984 >> + if clean_tree_run_failures or clean_tree_run_flakies: > > A bit confused why this case triggers a retry. If the run with the patch caused an exit code that didn't generate test results, but the clean-tree run did generate test results, doesn't that mean we can blame the patch for not generating test results? Or is that already covered because the return code of such a run is either SUCCESS or WARNINGS? This is a corner case that was detected here https://bugs.webkit.org/show_bug.cgi?id=231999#c10 when reviewing the results on the UAT: - What happened: The step 'layout-tests' aborted due to the WPT http server not starting, so it caused a general failure (exit error but no results). So the EWS tried to run the whole layout tests without patch ('run-layout-tests-without-patch' aka 'clean-tree') and this time it finished (no general error, simply a list of failures) In other words: If the clean-tree generates a list of failures, then the run with patch should also generate a list of failures unless it exists with zero status. Otherwise it means something broke during the testing with patch that caused a general failure (non-zero exit code but no list of failures/flakies) and then we can't conclude anything. >> Tools/CISupport/ews-build/steps.py:3026 >> + self.send_email_for_flaky_failure(flaky_failure, step_names_str) > > It seems wrong to warn bot watchers about flakey failures that we only detect with a patch. The idea of this is to help bot watchers to catch flakies and start marking them on the TestExpectation files. And I agree that flakies of interest are those that happen without patch. But the way this works, it causes that most of the flakies are filtered before reaching the run without patch. It Does this: with-patch: run-layout-tests(all_tests) => failures1, flakies1 with-patch: run-layout-tests-10times(failures1) => failures2, flakies2 without-patch: run-layout-tests-10times(failures2) => wfailures3, wflakies3 So most of the flakies will not reach the testing without-patch, and wflakies3 will likely be empty or will miss flakies of interest that are caused without the patch as well. The assumption here is that flakies are a general problem that will happen with patch even when the patch has not caused them so it reports everything found as flaky with the hope that it will be useful info for the bot watcher. The function to report the flakies includes the step name where the flaky was found so the bot watcher also knows if the flaky was found with patch or without it, so he can take that into account when investigating the issue.

Carlos Alberto Lopez Perez

Comment 16 2021-12-01 13:48:59 PST

Comment on attachment 443981 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=443981&action=review >>> Tools/CISupport/ews-build/steps.py:2984 >>> + if clean_tree_run_failures or clean_tree_run_flakies: >> >> A bit confused why this case triggers a retry. If the run with the patch caused an exit code that didn't generate test results, but the clean-tree run did generate test results, doesn't that mean we can blame the patch for not generating test results? Or is that already covered because the return code of such a run is either SUCCESS or WARNINGS? > > This is a corner case that was detected here https://bugs.webkit.org/show_bug.cgi?id=231999#c10 when reviewing the results on the UAT: > > - What happened: The step 'layout-tests' aborted due to the WPT http server not starting, so it caused a general failure (exit error but no results). So the EWS tried to run the whole layout tests without patch ('run-layout-tests-without-patch' aka 'clean-tree') and this time it finished (no general error, simply a list of failures) > > In other words: If the clean-tree generates a list of failures, then the run with patch should also generate a list of failures unless it exists with zero status. Otherwise it means something broke during the testing with patch that caused a general failure (non-zero exit code but no list of failures/flakies) and then we can't conclude anything. But thinking again about this corner case maybe the patch itself broke the script run-webkit-tests. In reality we should not treat different this corner case depending on the results of run-layout-tests-without-patch' aka 'clean-tree' (whether it passed green or generated a list of failures). Detecting a patch breaking the script run-webkit-tests shouldn't depend on the results of the 'clean-tree' run However it is true that is much more likely that a random error not caused by the patch happened on the first run, than the patch itself broke the script. So I think we can maybe do here the following: - 1. if retry_count is less than self.MAX_RETRY and we reach the case of "non-zero exist status for first-run-with-patch but no results of failure or flakies" then retry the whole testing again (we assume some random infrastructure error happened) - 2. otherwise (retry_count is self.MAX_RETRY) we report the patch as bad with 'Found unexpected failure with patch' That way we give it a few tries with the hope it was a random error that will not happen on a retry before reporting the patch as bad.

Carlos Alberto Lopez Perez

Comment 17 2021-12-01 14:38:27 PST

Created attachment 445617 [details] Patch v3 of the patch: 1. Implement the change from the comment above and 2. Add 3 more workers to speed up the queue. Diff to previous version: http://sprunge.us/TdLC6I?diff

Jonathan Bedard

Comment 18 2021-12-01 16:13:27 PST

(In reply to Carlos Alberto Lopez Perez from comment #15) > .... > > The idea of this is to help bot watchers to catch flakies and start marking them on the TestExpectation files. > And I agree that flakies of interest are those that happen without patch. > But the way this works, it causes that most of the flakies are filtered before reaching the run without patch. > .... This is the bit I wasn't thinking about. I think the flakiness noise may be too high to be useful, but that's something we can address after we get this queue running.

Carlos Alberto Lopez Perez

Comment 19 2021-12-01 16:42:51 PST

Committed r286405 (244752@main): <https://commits.webkit.org/244752@main>

Radar WebKit Bug Importer

Comment 20 2021-12-01 16:43:24 PST

<rdar://problem/85945620>

Carlos Alberto Lopez Perez

Comment 21 2021-12-01 16:43:48 PST

Thanks! landed it with some edits on the comments to make them more clear: http://sprunge.us/5DOchH?diff

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component Tools / Tests

Assignee

Carlos Alberto Lopez Perez

Reported

2021-10-19 20:11 PDT

Modified

2023-11-14 14:06 PST History

CC List

7 users Show

URL

Keywords InRadar

Depends on

Blocks