Bug 212381 - [EWS] Add a special case for running the layout test step without aborting in case of many failures for WPT tests
Summary: [EWS] Add a special case for running the layout test step without aborting in...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Carlos Alberto Lopez Perez
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2020-05-26 10:16 PDT by Carlos Alberto Lopez Perez
Modified: 2020-06-03 13:19 PDT (History)
8 users (show)

See Also:


Attachments
Patch (15.64 KB, patch)
2020-05-26 10:22 PDT, Carlos Alberto Lopez Perez
no flags Details | Formatted Diff | Diff
Patch (15.64 KB, patch)
2020-05-27 11:45 PDT, Carlos Alberto Lopez Perez
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Carlos Alberto Lopez Perez 2020-05-26 10:16:20 PDT
I want to prototype a bot that helps to update weekly our WPT import checkout <https://lists.webkit.org/pipermail/webkit-dev/2020-March/031121.html>

And for doing this, I want to (re-)use our EWS system to get the layout test results for all the platforms (Mac, iOS, etc..).

However, I'm finding that the EWS layout test results are not very useful when trying patches that touch lot of tests (like a WPT import)

This is due to the EWS running the layout test step with:

 * "--exit-after-n-failures 30" -> usually a WPT import are thousands of new tests, if it aborts instead of giving back all the failures it is not really useful
 * "--skip-failing-tests" -> some tests may crash (or timeout) instead of fail in different platforms, if we don't try to re-recheck if they fail we miss that. See for example r262062 (I only found that after landing the patch, because I had marked that test as failing (since it fails on release), but the EWS debug was skipping it, therefore I couldn't know it was crashing on debug)

I understand this parameters are useful to make the EWS faster and are good enough for the general use-case

So I want to propose something like not passing those parameters to the EWS for patches uploaded by a specific bugzilla user (the one this bot would use).
That way the EWS can be useful for this bot, meanwhile we keep the current behavior for everybody else.

Also it will help (to make everything faster and simplify everything) that for this user the layout test step only runs the tests inside the folder "imported/w3c/web-platform-tests/", since those are the only ones interesting for it anyway.
Comment 1 Carlos Alberto Lopez Perez 2020-05-26 10:22:45 PDT
Created attachment 400257 [details]
Patch
Comment 2 Carlos Alberto Lopez Perez 2020-05-26 10:43:01 PDT
Note: i think the style check error is wrong. I followed the same style than the other tests above
Comment 3 Jonathan Bedard 2020-05-26 11:15:16 PDT
Comment on attachment 400257 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=400257&action=review

> Tools/BuildSlaveSupport/ews-build/steps.py:1712
> +        if patch_author == 'webkit-wpt-import-bot@igalia.com':

This is an interesting idea. Aakash should weigh in.

It does feel like this should be a list, it seems likely that Apple will want a wpt import bot at some point in the future.
Comment 4 Alexey Proskuryakov 2020-05-26 11:45:17 PDT
Comment on attachment 400257 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=400257&action=review

> Tools/ChangeLog:3
> +        [EWS] Add a special case for running the layout test step without aborting in case of many failures for WPT tests

Not directly related to this patch, just a suggestion. I think that this EWS also needs to run a lot of iterations to detect flaky tests. That would likely be more intricate than simply passing an argument for iteration count.
Comment 5 Carlos Alberto Lopez Perez 2020-05-27 10:52:46 PDT
(In reply to Jonathan Bedard from comment #3)
> Comment on attachment 400257 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=400257&action=review
> 
> > Tools/BuildSlaveSupport/ews-build/steps.py:1712
> > +        if patch_author == 'webkit-wpt-import-bot@igalia.com':
> 
> This is an interesting idea. Aakash should weigh in.
> 

I talked with him on Slack, he was ok with this idea.

> It does feel like this should be a list, it seems likely that Apple will
> want a wpt import bot at some point in the future.

Yes! I will turn that into a list
Comment 6 Carlos Alberto Lopez Perez 2020-05-27 11:32:08 PDT
(In reply to Alexey Proskuryakov from comment #4)
> Comment on attachment 400257 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=400257&action=review
> 
> > Tools/ChangeLog:3
> > +        [EWS] Add a special case for running the layout test step without aborting in case of many failures for WPT tests
> 
> Not directly related to this patch, just a suggestion. I think that this EWS
> also needs to run a lot of iterations to detect flaky tests. That would
> likely be more intricate than simply passing an argument for iteration count.

Great idea.

I was playing now with passing "--iterations N" to run-webkit-tests and it doesn't behave as I was expecting. It seems it marks the test as failing if it failed on > N/2 iterations, but otherwise marks it as pass. For example: on 3 iterations a tests passes 2 but fails 1, it simply reports it as pass.

So I think I would need something different. Possible ideas that come to mind,

 * Add a new "--detect-flaky-tests N" switch to run-webkit-tests that does the same than "--iterations" but reports all the different results for the test it got.
 * Have the bot to upload the final patch N times and having everything green (on the EWS) N times before asking for review.


Not sure now which one will be the best approach.
Comment 7 Carlos Alberto Lopez Perez 2020-05-27 11:45:57 PDT
Created attachment 400359 [details]
Patch
Comment 8 Jonathan Bedard 2020-05-27 11:51:57 PDT
(In reply to Carlos Alberto Lopez Perez from comment #6)
> (In reply to Alexey Proskuryakov from comment #4)
> > Comment on attachment 400257 [details]
> > Patch
> > 
> > View in context:
> > https://bugs.webkit.org/attachment.cgi?id=400257&action=review
> > 
> > > Tools/ChangeLog:3
> > > +        [EWS] Add a special case for running the layout test step without aborting in case of many failures for WPT tests
> > 
> > Not directly related to this patch, just a suggestion. I think that this EWS
> > also needs to run a lot of iterations to detect flaky tests. That would
> > likely be more intricate than simply passing an argument for iteration count.
> 
> Great idea.
> 
> I was playing now with passing "--iterations N" to run-webkit-tests and it
> doesn't behave as I was expecting. It seems it marks the test as failing if
> it failed on > N/2 iterations, but otherwise marks it as pass. For example:
> on 3 iterations a tests passes 2 but fails 1, it simply reports it as pass.

To me, that feels like the wrong behavior. If you run with multiple iterations, we should use the worst result as the result of the test, but that's definitely a different bug.

> 
> So I think I would need something different. Possible ideas that come to
> mind,
> 
>  * Add a new "--detect-flaky-tests N" switch to run-webkit-tests that does
> the same than "--iterations" but reports all the different results for the
> test it got.
>  * Have the bot to upload the final patch N times and having everything
> green (on the EWS) N times before asking for review.
> 
> 
> Not sure now which one will be the best approach.
Comment 9 Carlos Alberto Lopez Perez 2020-05-27 12:02:06 PDT
(In reply to Jonathan Bedard from comment #8)
> (In reply to Carlos Alberto Lopez Perez from comment #6)
> > (In reply to Alexey Proskuryakov from comment #4)
> > > Comment on attachment 400257 [details]
> > > Patch
> > > 
> > > View in context:
> > > https://bugs.webkit.org/attachment.cgi?id=400257&action=review
> > > 
> > > > Tools/ChangeLog:3
> > > > +        [EWS] Add a special case for running the layout test step without aborting in case of many failures for WPT tests
> > > 
> > > Not directly related to this patch, just a suggestion. I think that this EWS
> > > also needs to run a lot of iterations to detect flaky tests. That would
> > > likely be more intricate than simply passing an argument for iteration count.
> > 
> > Great idea.
> > 
> > I was playing now with passing "--iterations N" to run-webkit-tests and it
> > doesn't behave as I was expecting. It seems it marks the test as failing if
> > it failed on > N/2 iterations, but otherwise marks it as pass. For example:
> > on 3 iterations a tests passes 2 but fails 1, it simply reports it as pass.
> 
> To me, that feels like the wrong behavior. If you run with multiple
> iterations, we should use the worst result as the result of the test, but
> that's definitely a different bug.
> 

Its not always obvious which one its the worst result. For example, imagine a test that gives: Failure Timeout and Pass.

Which its worse: Failure or Timeout? The expectation would have to contain both cases (or be an Skip) to be ok.

So I see value in reporting all the results and letting the user/tool choose how to deal with that.
Comment 10 Alexey Proskuryakov 2020-05-27 12:17:18 PDT
My idea was that the logic would be in EWS, not in run-webkit-tests. That removes any API barriers that could be complicating flaky test detection and reporting.

A counter-argument is that having the logic in EWS means that we need to restart the master when code changes.
Comment 11 Jonathan Bedard 2020-05-27 12:48:56 PDT
(In reply to Carlos Alberto Lopez Perez from comment #9)
> (In reply to Jonathan Bedard from comment #8)
> > (In reply to Carlos Alberto Lopez Perez from comment #6)
> > > (In reply to Alexey Proskuryakov from comment #4)
> > > > Comment on attachment 400257 [details]
> > > > Patch
> > > > 
> > > > View in context:
> > > > https://bugs.webkit.org/attachment.cgi?id=400257&action=review
> > > > 
> > > > > Tools/ChangeLog:3
> > > > > +        [EWS] Add a special case for running the layout test step without aborting in case of many failures for WPT tests
> > > > 
> > > > Not directly related to this patch, just a suggestion. I think that this EWS
> > > > also needs to run a lot of iterations to detect flaky tests. That would
> > > > likely be more intricate than simply passing an argument for iteration count.
> > > 
> > > Great idea.
> > > 
> > > I was playing now with passing "--iterations N" to run-webkit-tests and it
> > > doesn't behave as I was expecting. It seems it marks the test as failing if
> > > it failed on > N/2 iterations, but otherwise marks it as pass. For example:
> > > on 3 iterations a tests passes 2 but fails 1, it simply reports it as pass.
> > 
> > To me, that feels like the wrong behavior. If you run with multiple
> > iterations, we should use the worst result as the result of the test, but
> > that's definitely a different bug.
> > 
> 
> Its not always obvious which one its the worst result. For example, imagine
> a test that gives: Failure Timeout and Pass.
> 
> Which its worse: Failure or Timeout? The expectation would have to contain
> both cases (or be an Skip) to be ok.
> 
> So I see value in reporting all the results and letting the user/tool choose
> how to deal with that.

There is actually a canonical hierarchy we defined for the results database (in Tools/Scripts/webkitpy/results/upload.py), but your point about about needing multiple expectations is a good one.
Comment 12 Carlos Alberto Lopez Perez 2020-05-27 13:29:45 PDT
(In reply to Alexey Proskuryakov from comment #10)
> My idea was that the logic would be in EWS, not in run-webkit-tests. That
> removes any API barriers that could be complicating flaky test detection and
> reporting.
> 
> A counter-argument is that having the logic in EWS means that we need to
> restart the master when code changes.

Having the logic in the EWS its a really interesting idea.

Currently the EWS does this:

  - run1 layout-tests with patch
  - run2 layout-tests with patch (only run if unexpected failures in run1)
  - run3 layout-tests without patch (only run if unexpected failures in run1)

* EWS status: green if new failures in both run1 and run2 (but not run 3). If a new failure happened on run1 but not run2 it ignores it.


I'm thinking in modifying the logic of the EWS, so for this special use case it does this:

  - run1 layout-tests without patch
  - run2 layout-tests with patch (only run if _no_ unexpected failures in run1)
  - run3 layout-tests with patch (only run if _no_ unexpected failures in run2)
  - run4 layout-tests with patch (only run if _no_ unexpected failures in run3)


* EWS result: green if no unexpected failures (on no one of the runs) and the 4 runs completed

That way I can assert that:
  - No pre-existent failures existed (for example, due to the revision of WebKit beeing tested having issues)
  - It passed without any unexpected results 3 runs with the patch.


What do you think?
Comment 13 Carlos Alberto Lopez Perez 2020-05-27 13:31:28 PDT
(In reply to Carlos Alberto Lopez Perez from comment #12)
> 
> Currently the EWS does this:
> 
>   - run1 layout-tests with patch
>   - run2 layout-tests with patch (only run if unexpected failures in run1)
>   - run3 layout-tests without patch (only run if unexpected failures in run1)
> 
> * EWS status: green if new failures in both run1 and run2 (but not run 3).
> If a new failure happened on run1 but not run2 it ignores it.
> 

Sorry, I mean:

 * EWS status: red if new failures in both run1 and run2 (but not run 3).
Comment 14 Alexey Proskuryakov 2020-05-27 13:36:54 PDT
One approach would be to provide a .txt with new test expectations; red/green is not so meaningful for imported tests.

Ideally, EWS would understand which tests were modified, and only stress test those. Eventually, we'd like to do that even for regular patches, not just for imports.
Comment 15 Carlos Alberto Lopez Perez 2020-05-27 14:23:59 PDT
(In reply to Alexey Proskuryakov from comment #14)
> One approach would be to provide a .txt with new test expectations;
> red/green is not so meaningful for imported tests.
> 

The patch to be tested will contain modified TestExpectation files.

The bot for updating the tests would open new bugs on bugzilla (for the failures detected on the EWS on the first tries) and mark those tests as failing in the TestExpectation files.

If the EWS are green after testing (the final version) of the patch generated by this bot, that would mean that the patch with the new imported tests doesn't cause any unexpected failures (all new failures have been triaged)

The goal is to automate updating the imported WPT tests without causing new unexpected failures on the bots, to avoid creating extra work for gardeners and QA people. 

And the reviewer approving the patch for landing would have a great confidence on that (since it passed 3 rounds on every EWS without causing any new failure/flakiness)

> Ideally, EWS would understand which tests were modified, and only stress
> test those. Eventually, we'd like to do that even for regular patches, not
> just for imports.

That can be very difficult to do for patches touching support files. A general update of the WPT imported tests may modify (for example) the python program used for starting the http servers needed for testing (./wpt serve) and that may cause issues on some specific tests that are not being updated in the patch tested, but are part of the set of imported tests from a previous patch.

Running the whole set of imported WPT tests is affordable (this would not run daily) and gives me more peace of mind that this will now cause unexpected issues.
Comment 16 Carlos Alberto Lopez Perez 2020-05-27 14:25:03 PDT
(In reply to Carlos Alberto Lopez Perez from comment #15)
> Running the whole set of imported WPT tests is affordable (this would not
> run daily) and gives me more peace of mind that this will now cause
> unexpected issues.

sorry, I mean: "and gives me more peace of mind that this will not cause unexpected issues"
Comment 17 Carlos Alberto Lopez Perez 2020-05-29 07:44:33 PDT
Can I have r+ this patch?

I would like to start testing the outputs the EWS gives me. In future patches I can implement what we discussed here (trying to modify the logic of the EWS). But for now this is good enough for me to start testing.
Comment 18 Jonathan Bedard 2020-05-29 14:07:10 PDT
(In reply to Carlos Alberto Lopez Perez from comment #17)
> Can I have r+ this patch?
> 
> I would like to start testing the outputs the EWS gives me. In future
> patches I can implement what we discussed here (trying to modify the logic
> of the EWS). But for now this is good enough for me to start testing.

I was hoping we could get Aakash's input on this before landing, but he has been out this week, so it seems wrong to block landing this on Aakash.

I do think we should put off landing this until Monday, because it will require a buildbot restart, which I think is unwise Friday afternoon.
Comment 19 Aakash Jain 2020-05-29 18:49:26 PDT
it looks good to me.
Comment 20 EWS 2020-06-01 09:15:47 PDT
Committed r262379: <https://trac.webkit.org/changeset/262379>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 400359 [details].
Comment 21 Radar WebKit Bug Importer 2020-06-01 09:16:19 PDT
<rdar://problem/63831282>
Comment 22 Carlos Alberto Lopez Perez 2020-06-01 09:46:59 PDT
(In reply to Jonathan Bedard from comment #18)
> I do think we should put off landing this until Monday, because it will
> require a buildbot restart, which I think is unwise Friday afternoon.

Sure. I have landed it today Monday.

If possible to restart the EWS at some point, that would be great.

Thanks!
Comment 23 Jonathan Bedard 2020-06-03 13:19:44 PDT
(In reply to Carlos Alberto Lopez Perez from comment #22)
> (In reply to Jonathan Bedard from comment #18)
> > I do think we should put off landing this until Monday, because it will
> > require a buildbot restart, which I think is unwise Friday afternoon.
> 
> Sure. I have landed it today Monday.
> 
> If possible to restart the EWS at some point, that would be great.
> 
> Thanks!

Buildbot has been restarted.

Sorry it took so long to land this!