212381 – [EWS] Add a special case for running the layout test step without aborting in case of many failures for WPT tests

RESOLVED FIXED 212381

[EWS] Add a special case for running the layout test step without aborting in case of many failures for WPT tests

https://bugs.webkit.org/show_bug.cgi?id=212381

Summary [EWS] Add a special case for running the layout test step without aborting in...

Carlos Alberto Lopez Perez

Reported 2020-05-26 10:16:20 PDT

I want to prototype a bot that helps to update weekly our WPT import checkout <https://lists.webkit.org/pipermail/webkit-dev/2020-March/031121.html> And for doing this, I want to (re-)use our EWS system to get the layout test results for all the platforms (Mac, iOS, etc..). However, I'm finding that the EWS layout test results are not very useful when trying patches that touch lot of tests (like a WPT import) This is due to the EWS running the layout test step with: * "--exit-after-n-failures 30" -> usually a WPT import are thousands of new tests, if it aborts instead of giving back all the failures it is not really useful * "--skip-failing-tests" -> some tests may crash (or timeout) instead of fail in different platforms, if we don't try to re-recheck if they fail we miss that. See for example r262062 (I only found that after landing the patch, because I had marked that test as failing (since it fails on release), but the EWS debug was skipping it, therefore I couldn't know it was crashing on debug) I understand this parameters are useful to make the EWS faster and are good enough for the general use-case So I want to propose something like not passing those parameters to the EWS for patches uploaded by a specific bugzilla user (the one this bot would use). That way the EWS can be useful for this bot, meanwhile we keep the current behavior for everybody else. Also it will help (to make everything faster and simplify everything) that for this user the layout test step only runs the tests inside the folder "imported/w3c/web-platform-tests/", since those are the only ones interesting for it anyway.

Attachments
Patch (15.64 KB, patch) 2020-05-26 10:22 PDT, Carlos Alberto Lopez Perez	no flags	Details Formatted Diff Diff
Patch (15.64 KB, patch) 2020-05-27 11:45 PDT, Carlos Alberto Lopez Perez	no flags	Details Formatted Diff Diff
Show Obsolete (1) View All Add attachment proposed patch, testcase, etc.

Carlos Alberto Lopez Perez

Comment 1 2020-05-26 10:22:45 PDT

Created attachment 400257 [details] Patch

Carlos Alberto Lopez Perez

Comment 2 2020-05-26 10:43:01 PDT

Note: i think the style check error is wrong. I followed the same style than the other tests above

Jonathan Bedard

Comment 3 2020-05-26 11:15:16 PDT

Comment on attachment 400257 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=400257&action=review > Tools/BuildSlaveSupport/ews-build/steps.py:1712 > + if patch_author == 'webkit-wpt-import-bot@igalia.com': This is an interesting idea. Aakash should weigh in. It does feel like this should be a list, it seems likely that Apple will want a wpt import bot at some point in the future.

Alexey Proskuryakov

Comment 4 2020-05-26 11:45:17 PDT

Comment on attachment 400257 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=400257&action=review > Tools/ChangeLog:3 > + [EWS] Add a special case for running the layout test step without aborting in case of many failures for WPT tests Not directly related to this patch, just a suggestion. I think that this EWS also needs to run a lot of iterations to detect flaky tests. That would likely be more intricate than simply passing an argument for iteration count.

Carlos Alberto Lopez Perez

Comment 5 2020-05-27 10:52:46 PDT

(In reply to Jonathan Bedard from comment #3) > Comment on attachment 400257 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=400257&action=review > > > Tools/BuildSlaveSupport/ews-build/steps.py:1712 > > + if patch_author == 'webkit-wpt-import-bot@igalia.com': > > This is an interesting idea. Aakash should weigh in. > I talked with him on Slack, he was ok with this idea. > It does feel like this should be a list, it seems likely that Apple will > want a wpt import bot at some point in the future. Yes! I will turn that into a list

Carlos Alberto Lopez Perez

Comment 6 2020-05-27 11:32:08 PDT

(In reply to Alexey Proskuryakov from comment #4) > Comment on attachment 400257 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=400257&action=review > > > Tools/ChangeLog:3 > > + [EWS] Add a special case for running the layout test step without aborting in case of many failures for WPT tests > > Not directly related to this patch, just a suggestion. I think that this EWS > also needs to run a lot of iterations to detect flaky tests. That would > likely be more intricate than simply passing an argument for iteration count. Great idea. I was playing now with passing "--iterations N" to run-webkit-tests and it doesn't behave as I was expecting. It seems it marks the test as failing if it failed on > N/2 iterations, but otherwise marks it as pass. For example: on 3 iterations a tests passes 2 but fails 1, it simply reports it as pass. So I think I would need something different. Possible ideas that come to mind, * Add a new "--detect-flaky-tests N" switch to run-webkit-tests that does the same than "--iterations" but reports all the different results for the test it got. * Have the bot to upload the final patch N times and having everything green (on the EWS) N times before asking for review. Not sure now which one will be the best approach.

Carlos Alberto Lopez Perez

Comment 7 2020-05-27 11:45:57 PDT

Created attachment 400359 [details] Patch

Jonathan Bedard

Comment 8 2020-05-27 11:51:57 PDT

(In reply to Carlos Alberto Lopez Perez from comment #6) > (In reply to Alexey Proskuryakov from comment #4) > > Comment on attachment 400257 [details] > > Patch > > > > View in context: > > https://bugs.webkit.org/attachment.cgi?id=400257&action=review > > > > > Tools/ChangeLog:3 > > > + [EWS] Add a special case for running the layout test step without aborting in case of many failures for WPT tests > > > > Not directly related to this patch, just a suggestion. I think that this EWS > > also needs to run a lot of iterations to detect flaky tests. That would > > likely be more intricate than simply passing an argument for iteration count. > > Great idea. > > I was playing now with passing "--iterations N" to run-webkit-tests and it > doesn't behave as I was expecting. It seems it marks the test as failing if > it failed on > N/2 iterations, but otherwise marks it as pass. For example: > on 3 iterations a tests passes 2 but fails 1, it simply reports it as pass. To me, that feels like the wrong behavior. If you run with multiple iterations, we should use the worst result as the result of the test, but that's definitely a different bug. > > So I think I would need something different. Possible ideas that come to > mind, > > * Add a new "--detect-flaky-tests N" switch to run-webkit-tests that does > the same than "--iterations" but reports all the different results for the > test it got. > * Have the bot to upload the final patch N times and having everything > green (on the EWS) N times before asking for review. > > > Not sure now which one will be the best approach.

Carlos Alberto Lopez Perez

Comment 9 2020-05-27 12:02:06 PDT

(In reply to Jonathan Bedard from comment #8) > (In reply to Carlos Alberto Lopez Perez from comment #6) > > (In reply to Alexey Proskuryakov from comment #4) > > > Comment on attachment 400257 [details] > > > Patch > > > > > > View in context: > > > https://bugs.webkit.org/attachment.cgi?id=400257&action=review > > > > > > > Tools/ChangeLog:3 > > > > + [EWS] Add a special case for running the layout test step without aborting in case of many failures for WPT tests > > > > > > Not directly related to this patch, just a suggestion. I think that this EWS > > > also needs to run a lot of iterations to detect flaky tests. That would > > > likely be more intricate than simply passing an argument for iteration count. > > > > Great idea. > > > > I was playing now with passing "--iterations N" to run-webkit-tests and it > > doesn't behave as I was expecting. It seems it marks the test as failing if > > it failed on > N/2 iterations, but otherwise marks it as pass. For example: > > on 3 iterations a tests passes 2 but fails 1, it simply reports it as pass. > > To me, that feels like the wrong behavior. If you run with multiple > iterations, we should use the worst result as the result of the test, but > that's definitely a different bug. > Its not always obvious which one its the worst result. For example, imagine a test that gives: Failure Timeout and Pass. Which its worse: Failure or Timeout? The expectation would have to contain both cases (or be an Skip) to be ok. So I see value in reporting all the results and letting the user/tool choose how to deal with that.

Alexey Proskuryakov

Comment 10 2020-05-27 12:17:18 PDT

My idea was that the logic would be in EWS, not in run-webkit-tests. That removes any API barriers that could be complicating flaky test detection and reporting. A counter-argument is that having the logic in EWS means that we need to restart the master when code changes.

Jonathan Bedard

Comment 11 2020-05-27 12:48:56 PDT

(In reply to Carlos Alberto Lopez Perez from comment #9) > (In reply to Jonathan Bedard from comment #8) > > (In reply to Carlos Alberto Lopez Perez from comment #6) > > > (In reply to Alexey Proskuryakov from comment #4) > > > > Comment on attachment 400257 [details] > > > > Patch > > > > > > > > View in context: > > > > https://bugs.webkit.org/attachment.cgi?id=400257&action=review > > > > > > > > > Tools/ChangeLog:3 > > > > > + [EWS] Add a special case for running the layout test step without aborting in case of many failures for WPT tests > > > > > > > > Not directly related to this patch, just a suggestion. I think that this EWS > > > > also needs to run a lot of iterations to detect flaky tests. That would > > > > likely be more intricate than simply passing an argument for iteration count. > > > > > > Great idea. > > > > > > I was playing now with passing "--iterations N" to run-webkit-tests and it > > > doesn't behave as I was expecting. It seems it marks the test as failing if > > > it failed on > N/2 iterations, but otherwise marks it as pass. For example: > > > on 3 iterations a tests passes 2 but fails 1, it simply reports it as pass. > > > > To me, that feels like the wrong behavior. If you run with multiple > > iterations, we should use the worst result as the result of the test, but > > that's definitely a different bug. > > > > Its not always obvious which one its the worst result. For example, imagine > a test that gives: Failure Timeout and Pass. > > Which its worse: Failure or Timeout? The expectation would have to contain > both cases (or be an Skip) to be ok. > > So I see value in reporting all the results and letting the user/tool choose > how to deal with that. There is actually a canonical hierarchy we defined for the results database (in Tools/Scripts/webkitpy/results/upload.py), but your point about about needing multiple expectations is a good one.

Carlos Alberto Lopez Perez

Comment 12 2020-05-27 13:29:45 PDT

(In reply to Alexey Proskuryakov from comment #10) > My idea was that the logic would be in EWS, not in run-webkit-tests. That > removes any API barriers that could be complicating flaky test detection and > reporting. > > A counter-argument is that having the logic in EWS means that we need to > restart the master when code changes. Having the logic in the EWS its a really interesting idea. Currently the EWS does this: - run1 layout-tests with patch - run2 layout-tests with patch (only run if unexpected failures in run1) - run3 layout-tests without patch (only run if unexpected failures in run1) * EWS status: green if new failures in both run1 and run2 (but not run 3). If a new failure happened on run1 but not run2 it ignores it. I'm thinking in modifying the logic of the EWS, so for this special use case it does this: - run1 layout-tests without patch - run2 layout-tests with patch (only run if _no_ unexpected failures in run1) - run3 layout-tests with patch (only run if _no_ unexpected failures in run2) - run4 layout-tests with patch (only run if _no_ unexpected failures in run3) * EWS result: green if no unexpected failures (on no one of the runs) and the 4 runs completed That way I can assert that: - No pre-existent failures existed (for example, due to the revision of WebKit beeing tested having issues) - It passed without any unexpected results 3 runs with the patch. What do you think?

Carlos Alberto Lopez Perez

Comment 13 2020-05-27 13:31:28 PDT

(In reply to Carlos Alberto Lopez Perez from comment #12) > > Currently the EWS does this: > > - run1 layout-tests with patch > - run2 layout-tests with patch (only run if unexpected failures in run1) > - run3 layout-tests without patch (only run if unexpected failures in run1) > > * EWS status: green if new failures in both run1 and run2 (but not run 3). > If a new failure happened on run1 but not run2 it ignores it. > Sorry, I mean: * EWS status: red if new failures in both run1 and run2 (but not run 3).

Alexey Proskuryakov

Comment 14 2020-05-27 13:36:54 PDT

One approach would be to provide a .txt with new test expectations; red/green is not so meaningful for imported tests. Ideally, EWS would understand which tests were modified, and only stress test those. Eventually, we'd like to do that even for regular patches, not just for imports.

Carlos Alberto Lopez Perez

Comment 15 2020-05-27 14:23:59 PDT

(In reply to Alexey Proskuryakov from comment #14) > One approach would be to provide a .txt with new test expectations; > red/green is not so meaningful for imported tests. > The patch to be tested will contain modified TestExpectation files. The bot for updating the tests would open new bugs on bugzilla (for the failures detected on the EWS on the first tries) and mark those tests as failing in the TestExpectation files. If the EWS are green after testing (the final version) of the patch generated by this bot, that would mean that the patch with the new imported tests doesn't cause any unexpected failures (all new failures have been triaged) The goal is to automate updating the imported WPT tests without causing new unexpected failures on the bots, to avoid creating extra work for gardeners and QA people. And the reviewer approving the patch for landing would have a great confidence on that (since it passed 3 rounds on every EWS without causing any new failure/flakiness) > Ideally, EWS would understand which tests were modified, and only stress > test those. Eventually, we'd like to do that even for regular patches, not > just for imports. That can be very difficult to do for patches touching support files. A general update of the WPT imported tests may modify (for example) the python program used for starting the http servers needed for testing (./wpt serve) and that may cause issues on some specific tests that are not being updated in the patch tested, but are part of the set of imported tests from a previous patch. Running the whole set of imported WPT tests is affordable (this would not run daily) and gives me more peace of mind that this will now cause unexpected issues.

Carlos Alberto Lopez Perez

Comment 16 2020-05-27 14:25:03 PDT

(In reply to Carlos Alberto Lopez Perez from comment #15) > Running the whole set of imported WPT tests is affordable (this would not > run daily) and gives me more peace of mind that this will now cause > unexpected issues. sorry, I mean: "and gives me more peace of mind that this will not cause unexpected issues"

Carlos Alberto Lopez Perez

Comment 17 2020-05-29 07:44:33 PDT

Can I have r+ this patch? I would like to start testing the outputs the EWS gives me. In future patches I can implement what we discussed here (trying to modify the logic of the EWS). But for now this is good enough for me to start testing.

Jonathan Bedard

Comment 18 2020-05-29 14:07:10 PDT

(In reply to Carlos Alberto Lopez Perez from comment #17) > Can I have r+ this patch? > > I would like to start testing the outputs the EWS gives me. In future > patches I can implement what we discussed here (trying to modify the logic > of the EWS). But for now this is good enough for me to start testing. I was hoping we could get Aakash's input on this before landing, but he has been out this week, so it seems wrong to block landing this on Aakash. I do think we should put off landing this until Monday, because it will require a buildbot restart, which I think is unwise Friday afternoon.

Aakash Jain

Comment 19 2020-05-29 18:49:26 PDT

it looks good to me.

EWS

Comment 20 2020-06-01 09:15:47 PDT

Committed r262379: <https://trac.webkit.org/changeset/262379> All reviewed patches have been landed. Closing bug and clearing flags on attachment 400359 [details].

Radar WebKit Bug Importer

Comment 21 2020-06-01 09:16:19 PDT

<rdar://problem/63831282>

Carlos Alberto Lopez Perez

Comment 22 2020-06-01 09:46:59 PDT

(In reply to Jonathan Bedard from comment #18) > I do think we should put off landing this until Monday, because it will > require a buildbot restart, which I think is unwise Friday afternoon. Sure. I have landed it today Monday. If possible to restart the EWS at some point, that would be great. Thanks!

Jonathan Bedard

Comment 23 2020-06-03 13:19:44 PDT

(In reply to Carlos Alberto Lopez Perez from comment #22) > (In reply to Jonathan Bedard from comment #18) > > I do think we should put off landing this until Monday, because it will > > require a buildbot restart, which I think is unwise Friday afternoon. > > Sure. I have landed it today Monday. > > If possible to restart the EWS at some point, that would be great. > > Thanks! Buildbot has been restarted. Sorry it took so long to land this!

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component Tools / Tests

Assignee

Carlos Alberto Lopez Perez

Reported

2020-05-26 10:16 PDT

Modified

2020-06-03 13:19 PDT History

CC List

8 users Show

URL

Keywords InRadar

Depends on

Blocks