Bug 219500

Summary: [EWS] run-layout-tests-without-patch step should only retry the tests that failed on the previous steps
Product: WebKit Reporter: Carlos Alberto Lopez Perez <clopez>
Component: Tools / TestsAssignee: Carlos Alberto Lopez Perez <clopez>
Status: RESOLVED FIXED    
Severity: Normal CC: aakash_jain, ap, clopez, dewei_zhu, dpino, jbedard, lmoura, rniwa, ryanhaddad, webkit-bug-importer
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugs.webkit.org/show_bug.cgi?id=231265
https://bugs.webkit.org/show_bug.cgi?id=245329
Attachments:
Description Flags
Patch
none
Patch
none
Patch
none
Patch
none
Patch none

Description Carlos Alberto Lopez Perez 2020-12-03 11:53:17 PST
When a patch that introduces failures the EWS for layout tests currently does this:

1.  step run-layout-tests: run all the tests with the pach
2.  step re-run-layout-tests: re-run all the tests with the pach
3.  step run-layout-tests-without-patch: re-run all the tests without the pach


Then it picks the tests that failed both in 1. and 2. and compares that with the tests that failed on 3.

tests_that_consistently_failed = first_results_failing_tests.intersection(second_results_failing_tests)
flaky_failures = first_results_failing_tests.union(second_results_failing_tests) - first_results_failing_tests.intersection(second_results_failing_tests)
failures_introduced_by_patch = tests_that_consistently_failed - clean_tree_results_failing_tests


The problem with this is the time that it takes to run the whole layout tests (between 30 minutes and 1 hour). So when a patch introduces new failures the EWS takes a lot to process.

This problem is specially relevant for WebKitGTK, where we are trying to add an EWS that runs layout tests but our tree is not always green, so even when the patch doesn't introduce new failures the EWS has to run the 3 steps to compare with the clean tree (which is not clean).


I think it would be idea to change this logic to on steps 2 and 3 to only run the tests that failed on the first step.

See example:

1.  step run-layout-tests: run all the tests with the pach (30 minutes)
    - Failed tests: {'fast/css/failure1.html', 'fast/css/failure2.html', 'fast/css/flaky_test.html'}

2.  step re-run-layout-tests: re-run only this 3 failures (1 minute)
    - Failed tests: {'fast/css/failure1.html', 'fast/css/failure2.html'}


3. step run-layout-tests-without-patch: re-run only the 3 failures (1 minute=
   - Failed tests: {'fast/css/flaky_test.html'}

result:

tests_that_consistently_failed =  {'fast/css/failure1.html', 'fast/css/failure2.html'}
flaky_failures =   {'fast/css/flaky_test.html'}
failures_introduced_by_patch = {'fast/css/failure1.html', 'fast/css/failure2.html'}


So the end result is the same but much faster as we only run once the whole layout test step.
Comment 1 Carlos Alberto Lopez Perez 2020-12-03 12:01:26 PST
Created attachment 415318 [details]
Patch
Comment 2 Carlos Alberto Lopez Perez 2020-12-03 12:09:24 PST
Created attachment 415319 [details]
Patch
Comment 3 Alexey Proskuryakov 2020-12-03 13:26:18 PST
Comment on attachment 415319 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=415319&action=review

> Tools/ChangeLog:14
> +        This helps to speed up the test times on the EWS when the patch
> +        doesn't pass on the first try. This is specially useful when the
> +        clean-tree-without-patch is not 100% green or has some flakies.

Great idea!

I can't think of any major problem with running a subset of tests in clean-tree-without-patch, given that retries are enabled on EWS (so tests that only fail because of dependencies aren't detected anyway).

However, re-run-layout-tests needs to run the whole test suite, so that we could detect failures by count. E.g. if a patch introduces failures on a random subset totaling 1% of tests, we would detect 30 failures on the first run, and it's absolutely not sufficient to only re-run those 30 (chances are that we'll hit zero failures, and green-light the patch).

There is a small problem with run-layout-tests-without-patch too. What about tests that were just added with the patch, just marked in TestExpectations, or just unmarked? Forcing these tests via command line arguments will produce incorrect results in some of these cases.

> Tools/CISupport/ews-build/steps.py:1949
> +            self.setCommand(self.command + list(first_results_failing_tests))

Are we guaranteed to not run out of command line length limit?
Comment 4 Alexey Proskuryakov 2020-12-03 13:27:26 PST
At a higher level, I think that the solution to this problem is using flakiness dashboard data to ignore tests known to be flaky.
Comment 5 Carlos Alberto Lopez Perez 2020-12-03 13:48:36 PST
(In reply to Alexey Proskuryakov from comment #3)
> Comment on attachment 415319 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=415319&action=review
> 
> > Tools/ChangeLog:14
> > +        This helps to speed up the test times on the EWS when the patch
> > +        doesn't pass on the first try. This is specially useful when the
> > +        clean-tree-without-patch is not 100% green or has some flakies.
> 
> Great idea!
> 
> I can't think of any major problem with running a subset of tests in
> clean-tree-without-patch, given that retries are enabled on EWS (so tests
> that only fail because of dependencies aren't detected anyway).
> 
> However, re-run-layout-tests needs to run the whole test suite, so that we
> could detect failures by count. E.g. if a patch introduces failures on a
> random subset totaling 1% of tests, we would detect 30 failures on the first
> run, and it's absolutely not sufficient to only re-run those 30 (chances are
> that we'll hit zero failures, and green-light the patch).
> 


The current patch already takes that into account.

If on the first run we hit 30 failures then we exit early and first_results_did_exceed_test_failure_limit becomes True.

So in that case the second retry runs with the whole set of layout tests again.

> There is a small problem with run-layout-tests-without-patch too. What about
> tests that were just added with the patch, just marked in TestExpectations,
> or just unmarked? Forcing these tests via command line arguments will
> produce incorrect results in some of these cases.
> 

I'm not sure if I'm understanding this problem.
Is because passing a path to a test that doesn't exist when the patch is unaplied?
I have tested now to pass to run-webkit-tests a test path that doesn't exist and it simply ignores it.


> > Tools/CISupport/ews-build/steps.py:1949
> > +            self.setCommand(self.command + list(first_results_failing_tests))
> 
> Are we guaranteed to not run out of command line length limit?

We are guaranteed because the layout test step for the EWS runs with --exit-after-n-failures 30'. So only 29 tests will be passed in the worst case.
If it reaches 30 then first_results_did_exceed_test_failure_limit becomes True and the second retry runs the whole step as I said above.
Comment 6 Ryosuke Niwa 2020-12-03 13:54:43 PST
(In reply to Carlos Alberto Lopez Perez from comment #5)
> (In reply to Alexey Proskuryakov from comment #3)
> > Comment on attachment 415319 [details]
> > Patch
> > 
> > View in context:
> > https://bugs.webkit.org/attachment.cgi?id=415319&action=review
> > 
> > > Tools/ChangeLog:14
> > > +        This helps to speed up the test times on the EWS when the patch
> > > +        doesn't pass on the first try. This is specially useful when the
> > > +        clean-tree-without-patch is not 100% green or has some flakies.
> > 
> > Great idea!
> > 
> > I can't think of any major problem with running a subset of tests in
> > clean-tree-without-patch, given that retries are enabled on EWS (so tests
> > that only fail because of dependencies aren't detected anyway).
> > 
> > However, re-run-layout-tests needs to run the whole test suite, so that we
> > could detect failures by count. E.g. if a patch introduces failures on a
> > random subset totaling 1% of tests, we would detect 30 failures on the first
> > run, and it's absolutely not sufficient to only re-run those 30 (chances are
> > that we'll hit zero failures, and green-light the patch).
> > 
> 
> 
> The current patch already takes that into account.
> 
> If on the first run we hit 30 failures then we exit early and
> first_results_did_exceed_test_failure_limit becomes True.
> 
> So in that case the second retry runs with the whole set of layout tests
> again.

There are cases in which the number of failures is less than that limit. There are also cases in which a test failure happens because some states or conditions set by the proceeding tests so it's not always sufficient to just re-run the test that failed.
Comment 7 Alexey Proskuryakov 2020-12-03 17:00:19 PST
> If on the first run we hit 30 failures then we exit early and first_results_did_exceed_test_failure_limit becomes True.

OK - but as Ryosuke pointed out, it could also be 20, and then the proposed logic will make EWS work incorrectly.

> Is because passing a path to a test that doesn't exist when the patch is unaplied?
I have tested now to pass to run-webkit-tests a test path that doesn't exist and it simply ignores it.

One example is when a patch purportedly fixes a test and removes it from TestExpectations, but it actually still fails. Currently, clean-tree-without-patch wouldn't run this test, and it would be reported as a new failure. With the proposed change, it will be on the command line, will fail, and thus will be misreported as preexisting failure.

> So only 29 tests will be passed in the worst case.

Right, I was asking if you calculated how long 29 paths plus other options can be, and compared that to command line length limit.
Comment 8 Radar WebKit Bug Importer 2020-12-10 11:54:15 PST
<rdar://problem/72190937>
Comment 9 Carlos Alberto Lopez Perez 2021-03-03 17:55:04 PST
Created attachment 422168 [details]
Patch
Comment 10 Carlos Alberto Lopez Perez 2021-03-04 08:23:10 PST
(In reply to Alexey Proskuryakov from comment #7)
> > If on the first run we hit 30 failures then we exit early and first_results_did_exceed_test_failure_limit becomes True.
> 
> OK - but as Ryosuke pointed out, it could also be 20, and then the proposed
> logic will make EWS work incorrectly.
> 
> > Is because passing a path to a test that doesn't exist when the patch is unaplied?
> I have tested now to pass to run-webkit-tests a test path that doesn't exist
> and it simply ignores it.
> 
> One example is when a patch purportedly fixes a test and removes it from
> TestExpectations, but it actually still fails. Currently,
> clean-tree-without-patch wouldn't run this test, and it would be reported as
> a new failure. With the proposed change, it will be on the command line,
> will fail, and thus will be misreported as preexisting failure.
> 
> > So only 29 tests will be passed in the worst case.
> 
> Right, I was asking if you calculated how long 29 paths plus other options
> can be, and compared that to command line length limit.


Ok.

I implemented a new version with this suggestions. It only modified the step run-layout-tests-without-patch to run the subset of failed tests from the previous steps (only if those previous steps finished testing with all tests)


BTW, The red EWS from iOS is unrelated.
Comment 11 Alexey Proskuryakov 2021-03-05 16:48:17 PST
I think that this still has a problem with "tests that were just added with the patch, just marked in TestExpectations, or just unmarked". Sorry for not clarifying when you asked before.

Consider a patch that removes a [ Skip ] test expectation. If the test still fails in reality, then EWS will be green anyway! That's because all three runs will see it fail - passing the test on command line overrides [ Skip ]. Skipped directories would result in something similar, but I'm not even entirely sure what will happen.

Same thing with tests previously marked as [ Failure ], because EWS skips those.

This is a practical scenario, as people do rely on EWS to tell them what happens when unskipping tests.

Other than that, I cannot think of problems with the latest iteration.
Comment 12 Carlos Alberto Lopez Perez 2021-03-10 11:17:33 PST
(In reply to Alexey Proskuryakov from comment #11)
> I think that this still has a problem with "tests that were just added with
> the patch, just marked in TestExpectations, or just unmarked". Sorry for not
> clarifying when you asked before.
> 
> Consider a patch that removes a [ Skip ] test expectation. If the test still
> fails in reality, then EWS will be green anyway! That's because all three
> runs will see it fail - passing the test on command line overrides [ Skip ].
> Skipped directories would result in something similar, but I'm not even
> entirely sure what will happen.
> 
> Same thing with tests previously marked as [ Failure ], because EWS skips
> those.
> 
> This is a practical scenario, as people do rely on EWS to tell them what
> happens when unskipping tests.
> 
> Other than that, I cannot think of problems with the latest iteration.

I see. This is an interesting case. Thanks for pointing it out.

I have a new iteration of this:

    - Now it takes a look at the patch that is being tested, and if the patch modifies a TestExpectation file inside the directory LayoutTests then it avoids doing this optimization (it runs the whole test suite on the step run-layout-tests-without-patch)
Comment 13 Carlos Alberto Lopez Perez 2021-03-10 11:30:48 PST
Created attachment 422850 [details]
Patch
Comment 14 Alexey Proskuryakov 2021-03-10 16:17:58 PST
I think that this makes sense conceptually! Probably best for Aakash to review the code, as he has more recent hands on experience with EWS.
Comment 15 Aakash Jain 2021-03-15 09:26:11 PDT
Comment on attachment 422850 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=422850&action=review

> Tools/CISupport/ews-build/steps.py:2069
> +                        if ('LayoutTests/' in line and 'TestExpectations' in line) and (line.startswith('---') or line.startswith('+++')):

patch is stored by buildbot as bytes (see https://github.com/buildbot/buildbot/issues/5812#issuecomment-790175979). So this line should be changed to something like:

if (b'LayoutTests/' in line and b'TestExpectations' in line) and (line.startswith(b'---') or line.startswith(b'+++')):

Else it hits issue like: builtins.TypeError: a bytes-like object is required, not 'str' (https://ews-build.webkit-uat.org/#/builders/69/builds/4/steps/29/logs/err_text)

> Tools/CISupport/ews-build/steps_unittest.py:1983
> +        RunWebKitTests._get_patch = lambda x: '+++ LayoutTests/Changelog\n+++ LayoutTests/platform/gtk/TestExpectations\n'

Please make the patch content a byte object, since in reality patch is stored as byte object.
Comment 16 Aakash Jain 2021-03-15 09:28:17 PDT
Looks fine, tested on uat instance.
e.g.: 
Patch which modifies TestExpectations: https://ews-build.webkit-uat.org/#/builders/69/builds/44

Patch which doesn't modifies TestExpectations: https://ews-build.webkit-uat.org/#/builders/69/builds/45
Comment 17 Aakash Jain 2021-03-15 10:48:57 PDT
Comment on attachment 422850 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=422850&action=review

> Tools/CISupport/ews-build/steps.py:2054
> +        if self.name == 'run-layout-tests-without-patch':

can change this to: RunWebKitTestsWithoutPatch.name
Comment 18 Carlos Alberto Lopez Perez 2021-03-15 19:25:56 PDT
(In reply to Aakash Jain from comment #15)
> Comment on attachment 422850 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=422850&action=review
> 
> > Tools/CISupport/ews-build/steps.py:2069
> > +                        if ('LayoutTests/' in line and 'TestExpectations' in line) and (line.startswith('---') or line.startswith('+++')):
> 
> patch is stored by buildbot as bytes (see
> https://github.com/buildbot/buildbot/issues/5812#issuecomment-790175979). So
> this line should be changed to something like:
> 
> if (b'LayoutTests/' in line and b'TestExpectations' in line) and
> (line.startswith(b'---') or line.startswith(b'+++')):
> 
> Else it hits issue like: builtins.TypeError: a bytes-like object is
> required, not 'str'
> (https://ews-build.webkit-uat.org/#/builders/69/builds/4/steps/29/logs/
> err_text)
> 
> > Tools/CISupport/ews-build/steps_unittest.py:1983
> > +        RunWebKitTests._get_patch = lambda x: '+++ LayoutTests/Changelog\n+++ LayoutTests/platform/gtk/TestExpectations\n'
> 
> Please make the patch content a byte object, since in reality patch is
> stored as byte object.


Good catch!

I will fix it.


(In reply to Aakash Jain from comment #16)
> Looks fine, tested on uat instance.
> e.g.: 
> Patch which modifies TestExpectations:
> https://ews-build.webkit-uat.org/#/builders/69/builds/44
> 
> Patch which doesn't modifies TestExpectations:
> https://ews-build.webkit-uat.org/#/builders/69/builds/45


Great! :)
Comment 19 Carlos Alberto Lopez Perez 2021-03-15 19:29:28 PDT
Created attachment 423284 [details]
Patch

use bytes for the patch
Comment 20 Carlos Alberto Lopez Perez 2021-03-16 07:05:52 PDT
Committed r274475 (235331@main): <https://commits.webkit.org/235331@main>
Comment 21 Aakash Jain 2021-03-16 07:07:31 PDT
Restarted buildbot to pick up this change.
Comment 22 Carlos Alberto Lopez Perez 2021-10-05 17:04:59 PDT
(In reply to Alexey Proskuryakov from comment #11)
> I think that this still has a problem with "tests that were just added with
> the patch, just marked in TestExpectations, or just unmarked". Sorry for not
> clarifying when you asked before.
> 
> Consider a patch that removes a [ Skip ] test expectation. If the test still
> fails in reality, then EWS will be green anyway! That's because all three
> runs will see it fail - passing the test on command line overrides [ Skip ].
> Skipped directories would result in something similar, but I'm not even
> entirely sure what will happen.
> 
> Same thing with tests previously marked as [ Failure ], because EWS skips
> those.
> 
> This is a practical scenario, as people do rely on EWS to tell them what
> happens when unskipping tests.
> 
> Other than that, I cannot think of problems with the latest iteration.

I just discovered the option '--skipped=always' for run-webkit-tests. As far as I can test, with this option we can apply this optimization in this corner case without worries.

I uploaded a patch in bug 231265