108288 – On EWS bots, run tests on patch once instead of twice if some tests fail.

Timothy Loh

Reported 2013-01-29 21:04:21 PST

On EWS bots, run tests on patch once instead of twice if some tests fail.

Timothy Loh

Comment 1 2013-01-29 21:14:50 PST

Created attachment 185388 [details] Patch

Eric Seidel (no email)

Comment 2 2013-01-29 21:29:46 PST

Doesn't that mean that the EWS will have to rebuild between each try now?

Timothy Loh

Comment 3 2013-01-29 21:34:51 PST

(In reply to comment #2) > Doesn't that mean that the EWS will have to rebuild between each try now? Assuming this is what you're thinking, the RETRY_FAILURES parameter is just pertaining to what NRWT does, which doesn't involve building at all. If you meant something else, you'll have to explain.

Adam Barth

Comment 4 2013-01-29 23:59:24 PST

Comment on attachment 185388 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=185388&action=review > Tools/ChangeLog:13 > + Currently the bots will re-run all the tests for a patch if some tests > + fail twice. Since we retry failures on NRWT, this functionality doesn't > + help us much. To compensate for not re-running everything, the number of > + retries is now a parameter to NRWT which we set as 3 in RunTests.run(). > + Flakiness detection here is temporarily removed until we can read it > + from full_results.json. How many times does a flaky test need to fail in a row in order to cause the EWS to report a false failure? Previously, the test would need to flake four times in a row.

Build Bot

Comment 5 2013-01-30 00:36:54 PST

Comment on attachment 185388 [details] Patch Attachment 185388 [details] did not pass mac-ews (mac): Output: http://queues.webkit.org/results/16183539 New failing tests: http/tests/inspector/console-resource-errors.html http/tests/security/mixedContent/redirect-https-to-http-iframe-in-main-frame.html

Timothy Loh

Comment 6 2013-01-30 01:11:08 PST

(In reply to comment #4) > (From update of attachment 185388 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=185388&action=review > > > Tools/ChangeLog:13 > > + Currently the bots will re-run all the tests for a patch if some tests > > + fail twice. Since we retry failures on NRWT, this functionality doesn't > > + help us much. To compensate for not re-running everything, the number of > > + retries is now a parameter to NRWT which we set as 3 in RunTests.run(). > > + Flakiness detection here is temporarily removed until we can read it > > + from full_results.json. > > How many times does a flaky test need to fail in a row in order to cause the EWS to report a false failure? Previously, the test would need to flake four times in a row. Flaky tests will still need to fail four times in a row (one fail + three retries), but it's a trivial change to increase the number to (say) ten.

Adam Barth

Comment 7 2013-01-30 01:25:49 PST

> Flaky tests will still need to fail four times in a row (one fail + three retries), but it's a trivial change to increase the number to (say) ten. Ah, the number of retries is in addition to the first failure. By the way, this sounds like a great approach. What we're doing now is really silly.

Eric Seidel (no email)

Comment 8 2013-01-30 02:18:28 PST

Comment on attachment 185388 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=185388&action=review > Tools/Scripts/webkitpy/layout_tests/run_webkit_tests_integrationtest.py:645 > - res, err, _ = logging_run(['--no-retry-failures', '--clobber-old-results', 'failures/flaky'], tests_included=True, host=host) > + res, err, _ = logging_run(['--retry-failures=0', '--clobber-old-results', 'failures/flaky'], tests_included=True, host=host) I think users use --no-retry-failures on NRWT, so this may be slightly confusing and worth announcing on webkit-dev. > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:197 > + # FIXME: We'll want to add LayoutTestResults.flaky_tests() first > + # Only report flaky tests if we were successful at parsing results.json and archiving results. > + # if first_results and first_results_archive: > + # self._report_flaky_tests(first_results.failing_test_results(), first_results_archive) So now we're going to depend on NRWT reporting the flaky tests to us? That sounds reasonable, but may not work for other methods of testing?

Build Bot

Comment 9 2013-01-30 02:59:46 PST

Comment on attachment 185388 [details] Patch Attachment 185388 [details] did not pass mac-wk2-ews (mac-wk2): Output: http://queues.webkit.org/results/16202614 New failing tests: canvas/philip/tests/2d.path.arc.scale.2.html http/tests/security/mixedContent/redirect-https-to-http-iframe-in-main-frame.html canvas/philip/tests/2d.path.stroke.skew.html fast/canvas/webgl/read-pixels-test.html canvas/philip/tests/2d.text.draw.fontface.notinpage.html canvas/philip/tests/2d.path.arcTo.scale.html canvas/philip/tests/2d.path.stroke.scale1.html canvas/philip/tests/2d.path.arcTo.transformation.html

Dirk Pranke

Comment 10 2013-01-30 12:45:26 PST

Comment on attachment 185388 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=185388&action=review > Tools/Scripts/webkitpy/layout_tests/controllers/manager.py:206 > + repeat_each=retries, iterations=1, num_workers=1, retrying=True) There are important differences between --retry-failures and --repeat-each that are getting glossed over here with your change (and with this patch as a whole). Currently we save the results for both the initial run (into initial_results) and the retry run (into retry_results) and compute the aggregate results and return code downstream based on knowing what happens on each run. This means that, for example, the flakiness dashboard gets the results of both times the test is invoked (although we don't actually use that today). When you use --repeat-each, each result gets overwritten, so you only know the result for the last invocation. In other words, if the test fails on the last run, you'll think the test is failing, even if it passed the previous six times and should be considered flaky. Put differently, --repeat-each and --iterations are not designed to be used on the bots; I don't have any good sense of what sort of problems this'll expose. Now, one could argue that we should combine --repeat-each and --retry-failures as you're trying to do, but I'd want to step back a bit first and understand *why* you want to combine them; i.e., what we're looking to gain from this. For example, I'm not following what we're attempting to fix with this patch as a whole, apart from the fact that we have two layers both retrying things. Is the belief that NRWT's built in retry logic is reporting too many flaky tests as actual failures? (i.e., EWS needs to retry more often in order to figure out what's really failing)? When I initially added the retry logic to NRWT, I actually had the abiliity to retry N times. It turned out nobody really wanted to retry more than once and we didn't really use the feature, so I simplified it to the code that we have today. Of course, we weren't using this for EWS at the time ...

Timothy Loh

Comment 11 2013-01-30 17:00:17 PST

(In reply to comment #10) > (From update of attachment 185388 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=185388&action=review > > > Tools/Scripts/webkitpy/layout_tests/controllers/manager.py:206 > > + repeat_each=retries, iterations=1, num_workers=1, retrying=True) > > There are important differences between --retry-failures and --repeat-each that are getting glossed over here with your change (and with this patch as a whole). > > Currently we save the results for both the initial run (into initial_results) and the retry run (into retry_results) and compute the aggregate results and return code downstream based on knowing what happens on each run. This means that, for example, the flakiness dashboard gets the results of both times the test is invoked (although we don't actually use that today). > > When you use --repeat-each, each result gets overwritten, so you only know the result for the last invocation. In other words, if the test fails on the last run, you'll think the test is failing, even if it passed the previous six times and should be considered flaky. > > Put differently, --repeat-each and --iterations are not designed to be used on the bots; I don't have any good sense of what sort of problems this'll expose. > > Now, one could argue that we should combine --repeat-each and --retry-failures as you're trying to do, but I'd want to step back a bit first and understand *why* you want to combine them; i.e., what we're looking to gain from this. > > For example, I'm not following what we're attempting to fix with this patch as a whole, apart from the fact that we have two layers both retrying things. Is the belief that NRWT's built in retry logic is reporting too many flaky tests as actual failures? (i.e., EWS needs to retry more often in order to figure out what's really failing)? > > When I initially added the retry logic to NRWT, I actually had the abiliity to retry N times. It turned out nobody really wanted to retry more than once and we didn't really use the feature, so I simplified it to the code that we have today. Of course, we weren't using this for EWS at the time ... What I want to achieve with this patch is to simplify the logic for running tests and increase the maximum throughput of the EWS bots (since the tests that behave as expected only get run once with a submitted patch). As you mention, we can also reduce the number of flaky tests reported as failures with negligible runtime penalties (although I kept the number of total runs the same here). My main idea here, though, is to set us up for being better at actually reporting flaky tests by seeing what NRWT outputs re. flakiness (dependent on getting the FlakyTestReporter working and replying to Eric's comment #8) -- note for example that if a test flakes with a small probability p, the current strategy would need about 1/p^2 tests to detect it as flaky, while the new strategy only needs about 1/p tests. It seems too wasteful to have the flakiness information from NRWT which we ignore.

Dirk Pranke

Comment 12 2013-01-30 17:30:08 PST

(In reply to comment #11) > What I want to achieve with this patch is to simplify the logic for running tests and increase the maximum throughput of the EWS bots (since the tests that behave as expected only get run once with a submitted patch). As you mention, we can also reduce the number of flaky tests reported as failures with negligible runtime penalties (although I kept the number of total runs the same here). > I've unfortunately forgotten the details of how EWS views and treats flaky tests and how that's different from how NRWT and the flakiness dashboard view them, and so it's hard for me to respond. In my ideal world, all of the currently flaky (layout) tests would be marked as such in TestExpectations files, and EWS would not try to have a separate concept of flakiness that it was filtering out. IIRC, We originally didn't do this at least because EWS was designed to work with old-run-webkit-tests, which didn't support marking tests as flaky. And, of course, you need some sort of equivalent concept for flaky c++ tests if you want to worry about them. I wouldn't worry about flaky python tests, we don't have a big enough problem there. > My main idea here, though, is to set us up for being better at actually reporting flaky tests by seeing what NRWT outputs re. flakiness (dependent on getting the FlakyTestReporter working and replying to Eric's comment #8) -- note for example that if a test flakes with a small probability p, the current strategy would need about 1/p^2 tests to detect it as flaky, while the new strategy only needs about 1/p tests. It seems too wasteful to have the flakiness information from NRWT which we ignore. I'm not sure what you mean by "actually reporting flaky tests". Reporting where? to EWS? to the flakiness dashboard? somewhere else?

Timothy Loh

Comment 13 2013-01-30 21:23:05 PST

(In reply to comment #12) > I've unfortunately forgotten the details of how EWS views and treats flaky tests and how that's different from how NRWT and the flakiness dashboard view them, and so it's hard for me to respond. The EWS currently re-runs all the tests when any of them fail (twice). The tests that fail on the first NRWT run and pass on the second run (and vice versa, but we're ignoring these at the moment) are considered flaky. The EWS currently lets a patch pass if one of the two runs is clean, or the results are the same and are to be expected given the failures from the clean tree. If the two runs fail on different tests, we grab a new patch and leave the current one in the queue (this leads to the cycling behaviour we sometimes see, that we should probably get rid of). The flakiness dashboard looks like it actually uses the flakiness output from NRWT. I think it takes the worst run (e.g. crash > text > pass) for the colour you see. > In my ideal world, all of the currently flaky (layout) tests would be marked as such in TestExpectations files, and EWS would not try to have a separate concept of flakiness that it was filtering out. IIRC, We originally didn't do this at least because EWS was designed to work with old-run-webkit-tests, which didn't support marking tests as flaky. And, of course, you need some sort of equivalent concept for flaky c++ tests if you want to worry about them. > > I'm not sure what you mean by "actually reporting flaky tests". Reporting where? to EWS? to the flakiness dashboard? somewhere else? The commit-queue used to report flaky tests via the FlakyTestReporter (see the master - Bug 50856). This isn't working at the moment for as yet unknown reasons, but what I'd want is for it post patches to annotate the TestExpectation files for us (the CQ doesn't re-run tests if the EWS already ran so we'd need to pass the info along). What I'd also like to happen is for the FlakyTestReporter to post a message on your patch if it finds new flaky tests, but we'd need the existing flakes to be marked in TestExpectations before this can happen (otherwise it'll get turned off :-( ).

Timothy Loh

Comment 14 2013-01-30 21:37:26 PST

(In reply to comment #8) > (From update of attachment 185388 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=185388&action=review > > > Tools/Scripts/webkitpy/layout_tests/run_webkit_tests_integrationtest.py:645 > > - res, err, _ = logging_run(['--no-retry-failures', '--clobber-old-results', 'failures/flaky'], tests_included=True, host=host) > > + res, err, _ = logging_run(['--retry-failures=0', '--clobber-old-results', 'failures/flaky'], tests_included=True, host=host) > > I think users use --no-retry-failures on NRWT, so this may be slightly confusing and worth announcing on webkit-dev. I can of course leave the flag in, so as to not bother others. > > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:197 > > + # FIXME: We'll want to add LayoutTestResults.flaky_tests() first > > + # Only report flaky tests if we were successful at parsing results.json and archiving results. > > + # if first_results and first_results_archive: > > + # self._report_flaky_tests(first_results.failing_test_results(), first_results_archive) > > So now we're going to depend on NRWT reporting the flaky tests to us? That sounds reasonable, but may not work for other methods of testing? Good point. Do we expect that we'll be using the code with other tests though? The existing code already does some re-running to detect flaky tests, which I imagine probably won't end up used by the other subclasses of PatchAnalysisTask (StyleQueueTask and PerfalizerTask), but you might think otherwise.

Dirk Pranke

Comment 15 2013-01-31 14:29:19 PST

(In reply to comment #13) > > The flakiness dashboard looks like it actually uses the flakiness output from NRWT. I think it takes the worst run (e.g. crash > text > pass) for the colour you see. > Where are you seeing this? I didn't see anything obvious in the flakiness javascript code, and it looks like we only upload the results from the initial run in manager.py's _upload_json_files . But I didn't look carefully, and maybe I've forgotten something. Ojan, do you recall anything one way or another?

Ojan Vafai

Comment 16 2013-01-31 14:46:07 PST

(In reply to comment #15) > (In reply to comment #13) > > > > The flakiness dashboard looks like it actually uses the flakiness output from NRWT. I think it takes the worst run (e.g. crash > text > pass) for the colour you see. > > > > Where are you seeing this? I didn't see anything obvious in the flakiness javascript code, and it looks like we only upload the results from the initial run in manager.py's _upload_json_files . But I didn't look carefully, and maybe I've forgotten something. > > Ojan, do you recall anything one way or another? That sounds right to me. We only upload the first result I believe. It's been a long standing FIXME to make it so that we upload both and have the dashboard have a UI for showing both.

Attachments
Patch (20.74 KB, patch) 2013-01-29 21:14 PST, Timothy Loh	dpranke: review- buildbot: commit-queue-	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.