138184 – CommitQueue and EWS should reject any patches that result in consistent test failures that aren't present on the tree.

RESOLVED FIXED 138184

CommitQueue and EWS should reject any patches that result in consistent test failures that aren't present on the tree.

https://bugs.webkit.org/show_bug.cgi?id=138184

Summary CommitQueue and EWS should reject any patches that result in consistent test ...

Jake Nielsen

Reported 2014-10-29 14:19:40 PDT

Example: https://bugs.webkit.org/show_bug.cgi?id=137132

Attachments
Alters PatchAnalysisTask's behavior, and changes the unit tests accordingly. (14.10 KB, patch) 2014-10-29 14:24 PDT, Jake Nielsen	ap: review+ dbates: commit-queue-	Details Formatted Diff Diff
Addresses comments (14.02 KB, patch) 2014-10-29 15:01 PDT, Jake Nielsen	no flags	Details Formatted Diff Diff
Changes sets to frozen sets, and computes frozenset(first_results.failing_test_results()) only once. (14.16 KB, patch) 2014-10-29 16:09 PDT, Jake Nielsen	no flags	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.

Jake Nielsen

Comment 1 2014-10-29 14:24:52 PDT

Created attachment 240624 [details] Alters PatchAnalysisTask's behavior, and changes the unit tests accordingly.

Alexey Proskuryakov

Comment 2 2014-10-29 14:49:12 PDT

Comment on attachment 240624 [details] Alters PatchAnalysisTask's behavior, and changes the unit tests accordingly. View in context: https://bugs.webkit.org/attachment.cgi?id=240624&action=review > Tools/ChangeLog:3 > + CommitQueue should reject any patches result in consistent test Is it just the commit queue? > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:-231 > - # we happen to hit the --exit-after-N-failures limit. Do we have enough test coverage for when we hit the limit of failures? > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:228 > + tests_that_only_failed_first = list(set(first_results.failing_test_results()) - set(second_results.failing_test_results())) Why is it not always a set? It seems wasteful to convert every time. > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:243 > + new_failures_introduced_by_patch = list(set(tests_that_consistently_failed) - set(tests_that_failed_on_tree)) > + if new_failures_introduced_by_patch: > + self.failure_status_id = first_failure_status_id > + return self.report_failure(first_results_archive, new_failures_introduced_by_patch, first_script_error) I don't think that this heuristic is quite right. If there is enough flakiness in the tree, we could reject a perfectly good patch. We should probably refuse to give a result when there is much flakiness. However, I'm not sure what "enough" would be here. > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:247 > + # were introduced. This is a bit of a grey-zone, and for now I will leave this code path as it was, > + # AKA, defer. I'd finish this after "grey-zone", as the rest is not very helpful to the reader.

Daniel Bates

Comment 3 2014-10-29 14:55:45 PDT

Comment on attachment 240624 [details] Alters PatchAnalysisTask's behavior, and changes the unit tests accordingly. View in context: https://bugs.webkit.org/attachment.cgi?id=240624&action=review I didn't check this patch for correctness. I noticed some minor nits and had some questions. > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:234 > + tests_that_only_failed_first = list(set(first_results.failing_test_results()) - set(second_results.failing_test_results())) > + self._report_flaky_tests(tests_that_only_failed_first, first_results_archive) > + > + tests_that_only_failed_second = list(set(second_results.failing_test_results()) - set(first_results.failing_test_results())) > + self._report_flaky_tests(tests_that_only_failed_second, second_results_archive) > + > + tests_that_consistently_failed = list(set(second_results.failing_test_results()) & set(first_results.failing_test_results())) This is inefficient as we are computing set(first_results.failing_test_results()) and set(second_results.failing_test_results()) multiple times. We should compute these values once. Is it necessary to create mutable sets? Or can we use frozenset()? As far as I can tell from briefly searching through the Python code we prefer using set.intersection() over the overloaded binary-and operator (&) when computing the intersection of two sets/frozensets. I suggest we use set.intersection() for consistency. See <https://docs.python.org/2/library/stdtypes.html#frozenset> for more details on frozenset(). > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:240 > + new_failures_introduced_by_patch = list(set(tests_that_consistently_failed) - set(tests_that_failed_on_tree)) Notice that we made tests_that_consistently_failed a list data type above (line 234) by explicitly passing the right-hand side set computation to the list constructor. Instead of converting tests_that_consistently_failed from a set to a list to a set, I suggest we let tests_that_consistently_failed be a set. > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:245 > + # At this point we know that there is flakyness with the patch applied, but no consistent failures Nit: flakyness => flakiness

Jake Nielsen

Comment 4 2014-10-29 15:01:53 PDT

Created attachment 240630 [details] Addresses comments

Daniel Bates

Comment 5 2014-10-29 15:11:20 PDT

Comment on attachment 240630 [details] Addresses comments View in context: https://bugs.webkit.org/attachment.cgi?id=240630&action=review > Tools/Scripts/webkitpy/tool/bot/commitqueuetask_unittest.py:178 > + self.assertEqual(set(commit_queue.get_reported_flaky_tests()), set(expected_reported_flaky_tests)) Can you elaborate on this change?

Jake Nielsen

Comment 6 2014-10-29 15:19:46 PDT

I just stepped out for some food, but I'll address these comments when I get back.

Jake Nielsen

Comment 7 2014-10-29 16:05:49 PDT

(In reply to comment #5) > Comment on attachment 240630 [details] > Addresses comments > > View in context: > https://bugs.webkit.org/attachment.cgi?id=240630&action=review > > > Tools/Scripts/webkitpy/tool/bot/commitqueuetask_unittest.py:178 > > + self.assertEqual(set(commit_queue.get_reported_flaky_tests()), set(expected_reported_flaky_tests)) > > Can you elaborate on this change? This just makes it less annoying to define which tests you expect to get reported. The order in which they appear doesn't matter, so checking set equality makes more sense.

Jake Nielsen

Comment 8 2014-10-29 16:09:12 PDT

Created attachment 240637 [details] Changes sets to frozen sets, and computes frozenset(first_results.failing_test_results()) only once.

WebKit Commit Bot

Comment 9 2014-10-29 22:02:30 PDT

Comment on attachment 240637 [details] Changes sets to frozen sets, and computes frozenset(first_results.failing_test_results()) only once. Clearing flags on attachment: 240637 Committed r175367: <http://trac.webkit.org/changeset/175367>

WebKit Commit Bot

Comment 10 2014-10-29 22:02:35 PDT

All reviewed patches have been landed. Closing bug.

Daniel Bates

Comment 11 2014-10-30 09:52:04 PDT

Comment on attachment 240637 [details] Changes sets to frozen sets, and computes frozenset(first_results.failing_test_results()) only once. View in context: https://bugs.webkit.org/attachment.cgi?id=240637&action=review I know that this patch already landed. I noticed some minor issues and had some questions about the correctness of this patch. > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:-232 > - # See https://bugs.webkit.org/show_bug.cgi?id=51272 Out of curiosity, did you have a chance to look at the posted patches and discussion in bug #51272? If so, are the concerns raised about --exit-after-N-failures and test ordering non-issues? In particular, is <https://bugs.webkit.org/show_bug.cgi?id=51272#c3> a non-issue? > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:235 > + tests_that_only_failed_second = second_failing_results_set.difference(first_failing_results_set) > + self._report_flaky_tests(tests_that_only_failed_second, second_results_archive) The following assumes <https://bugs.webkit.org/show_bug.cgi?id=51272#c3> is still applicable (see my question above). Take --exit-after-N-failures to be 2 and assume there are a total of 4 tests {A, B, C, D} and test D is a test that constantly fails (and we haven't added it to the skip list or rolled out the patch that caused the failure yet). Suppose the first and second test runs have the following results {(A, Pass), (B, Fail), (C, Fail)} and {(A, Fail), (B, Pass), (C, Pass), (D, Fail)}. Then we will report test D as a flaky test (since tests_that_only_failed_second = {A, D}). Is this acceptable? > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:246 > + self._build_and_test_without_patch() > + self._clean_tree_results = self._delegate.test_results() > + tests_that_failed_on_tree = self._clean_tree_results.failing_test_results() > + > + new_failures_introduced_by_patch = tests_that_consistently_failed.difference(set(tests_that_failed_on_tree)) > + if new_failures_introduced_by_patch: > + self.failure_status_id = first_failure_status_id > + return self.report_failure(first_results_archive, new_failures_introduced_by_patch, first_script_error) Is is necessary to compute all of this when self._build_and_test_without_patch() returns true? On another note, this code looks similar to code we have below when self._results_failed_different_tests() returns false. I wish would could unify these code paths/share more code so as to avoid code duplication. > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:247 > + If we get here, it means that that the tests in tests_that_consistently_failed are also failing on trunk (i.e. the tree is red). So, we know the tests in tests_that_consistently_failed are not flaky. However, we don't know if the tests in tests_that_consistently_failed with the patch applied are failing for the same reason as when these tests failed without the patch. If the failure reason (diff) is the same for these tests with and without the patch then the patch didn't cause these failures; => do not report such test failures for the patch. Otherwise, the patch has caused the tests to fail differently than with a clean build (maybe the patch actually fixes the tests, but due to human error the expected results are still not quite right); => we need to report these tests as failing. > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:249 > + # At this point we know that there is flakyness with the patch applied, but no consistent failures > + # were introduced. This is a bit of a grey-zone. This comment is disingenuous. We may have reached this point in the execution flow because all these tests that failed in both runs with the patch also fail without the patch (i.e. tests_that_consistently_failed != [] and new_failures_introduced_by_patch := [])

Daniel Bates

Comment 12 2014-10-30 09:53:57 PDT

Comment on attachment 240637 [details] Changes sets to frozen sets, and computes frozenset(first_results.failing_test_results()) only once. View in context: https://bugs.webkit.org/attachment.cgi?id=240637&action=review > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:248 > + # At this point we know that there is flakyness with the patch applied, but no consistent failures As remarked in comment #3, did you intend to write flakiness as "flakyness"?

Alexey Proskuryakov

Comment 13 2014-10-30 10:04:50 PDT

In comment 2, I asked Jake to check into test coverage for when we hit the limit, as we need regression tests to fix those cases. Thinking about it now, I have a different concern. We report flaky tests whenever first and second runs are different. But how does that make sense, what if flakiness was introduced by the patch itself? Also, I don't see queues report flaky tests recently, so I suspect that the functionality is broken anyway. > However, we don't know if the tests in tests_that_consistently_failed with the patch applied are failing for the same reason as when these tests failed without the patch. This is an interesting improvement idea. I don't think that EWS ever compared actual results, or did it?

Jake Nielsen

Comment 14 2014-10-30 11:23:55 PDT

(In reply to comment #11) > Comment on attachment 240637 [details] > Changes sets to frozen sets, and computes > frozenset(first_results.failing_test_results()) only once. > > View in context: > https://bugs.webkit.org/attachment.cgi?id=240637&action=review > > I know that this patch already landed. I noticed some minor issues and had > some questions about the correctness of this patch. > > > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:-232 > > - # See https://bugs.webkit.org/show_bug.cgi?id=51272 > > Out of curiosity, did you have a chance to look at the posted patches and > discussion in bug #51272? If so, are the concerns raised about > --exit-after-N-failures and test ordering non-issues? In particular, is > <https://bugs.webkit.org/show_bug.cgi?id=51272#c3> a non-issue? I don't see which comment you mean regarding --exit-after-N-failures, but regarding test ordering, order shouldn't matter since we're taking a set difference. > > > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:235 > > + tests_that_only_failed_second = second_failing_results_set.difference(first_failing_results_set) > > + self._report_flaky_tests(tests_that_only_failed_second, second_results_archive) > > The following assumes <https://bugs.webkit.org/show_bug.cgi?id=51272#c3> is > still applicable (see my question above). Maybe I'm not understanding what you mean, I don't see what this has to do with test ordering? > > Take --exit-after-N-failures to be 2 and assume there are a total of 4 tests > {A, B, C, D} and test D is a test that constantly fails (and we haven't > added it to the skip list or rolled out the patch that caused the failure > yet). Suppose the first and second test runs have the following results {(A, > Pass), (B, Fail), (C, Fail)} and {(A, Fail), (B, Pass), (C, Pass), (D, > Fail)}. Then we will report test D as a flaky test (since > tests_that_only_failed_second = {A, D}). Is this acceptable? It won't reach this code path. It will enter the if first_results.did_exceed_test_failure_limit(): return self._continue_testing_patch_that_exceeded_failure_limit_on_first_or_second_try(first_results, first_results_archive, first_script_error) codepath. > > > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:246 > > + self._build_and_test_without_patch() > > + self._clean_tree_results = self._delegate.test_results() > > + tests_that_failed_on_tree = self._clean_tree_results.failing_test_results() > > + > > + new_failures_introduced_by_patch = tests_that_consistently_failed.difference(set(tests_that_failed_on_tree)) > > + if new_failures_introduced_by_patch: > > + self.failure_status_id = first_failure_status_id > > + return self.report_failure(first_results_archive, new_failures_introduced_by_patch, first_script_error) > > Is is necessary to compute all of this when > self._build_and_test_without_patch() returns true? It's not, but special casing it means more code duplication for microseconds of computing time gain. It's not worth it. > > On another note, this code looks similar to code we have below when > self._results_failed_different_tests() returns false. I wish would could > unify these code paths/share more code so as to avoid code duplication. You're right. I think we can. I'll do that. > > > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:247 > > + > > If we get here, it means that that the tests in > tests_that_consistently_failed are also failing on trunk (i.e. the tree is > red). So, we know the tests in tests_that_consistently_failed are not flaky. > However, we don't know if the tests in tests_that_consistently_failed with > the patch applied are failing for the same reason as when these tests failed > without the patch. If the failure reason (diff) is the same for these tests > with and without the patch then the patch didn't cause these failures; => do > not report such test failures for the patch. Otherwise, the patch has caused > the tests to fail differently than with a clean build (maybe the patch > actually fixes the tests, but due to human error the expected results are > still not quite right); => we need to report these tests as failing. The failure type is already taken into account by the TestResult.__eq__ method. It would be a mistake to try to compare the failure messages though. Since there are stack traces in the messages, if line numbers changed and we were just doing a comparison of those strings, we would peg the failures as different, even if they weren't, and we would be unable to land a good patch. Even beyond this though, the behavior at this point in the code is exactly the same as without this patch, so any further issues at this point should be new bugs and aren't really to do with the problem at hand. > > > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:249 > > + # At this point we know that there is flakyness with the patch applied, but no consistent failures > > + # were introduced. This is a bit of a grey-zone. > > This comment is disingenuous. We may have reached this point in the > execution flow because all these tests that failed in both runs with the > patch also fail without the patch (i.e. tests_that_consistently_failed != [] > and new_failures_introduced_by_patch := []) The comment says that no consistent failures were introduced. In this case, no consistent failures were introduced, they were there to begin with.

Jake Nielsen

Comment 15 2014-10-30 11:28:52 PDT

(In reply to comment #14) > (In reply to comment #11) > > Comment on attachment 240637 [details] > > Changes sets to frozen sets, and computes > > frozenset(first_results.failing_test_results()) only once. > > > > View in context: > > https://bugs.webkit.org/attachment.cgi?id=240637&action=review > > > > I know that this patch already landed. I noticed some minor issues and had > > some questions about the correctness of this patch. > > > > > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:-232 > > > - # See https://bugs.webkit.org/show_bug.cgi?id=51272 > > > > Out of curiosity, did you have a chance to look at the posted patches and > > discussion in bug #51272? If so, are the concerns raised about > > --exit-after-N-failures and test ordering non-issues? In particular, is > > <https://bugs.webkit.org/show_bug.cgi?id=51272#c3> a non-issue? > I don't see which comment you mean regarding --exit-after-N-failures, but > regarding test ordering, order shouldn't matter since we're taking a set > difference. > > > > > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:235 > > > + tests_that_only_failed_second = second_failing_results_set.difference(first_failing_results_set) > > > + self._report_flaky_tests(tests_that_only_failed_second, second_results_archive) > > > > The following assumes <https://bugs.webkit.org/show_bug.cgi?id=51272#c3> is > > still applicable (see my question above). > Maybe I'm not understanding what you mean, I don't see what this has to do > with test ordering? > > > > Take --exit-after-N-failures to be 2 and assume there are a total of 4 tests > > {A, B, C, D} and test D is a test that constantly fails (and we haven't > > added it to the skip list or rolled out the patch that caused the failure > > yet). Suppose the first and second test runs have the following results {(A, > > Pass), (B, Fail), (C, Fail)} and {(A, Fail), (B, Pass), (C, Pass), (D, > > Fail)}. Then we will report test D as a flaky test (since > > tests_that_only_failed_second = {A, D}). Is this acceptable? > It won't reach this code path. It will enter the > if first_results.did_exceed_test_failure_limit(): > return > self. > _continue_testing_patch_that_exceeded_failure_limit_on_first_or_second_try(fi > rst_results, first_results_archive, first_script_error) > codepath. > > > > > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:246 > > > + self._build_and_test_without_patch() > > > + self._clean_tree_results = self._delegate.test_results() > > > + tests_that_failed_on_tree = self._clean_tree_results.failing_test_results() > > > + > > > + new_failures_introduced_by_patch = tests_that_consistently_failed.difference(set(tests_that_failed_on_tree)) > > > + if new_failures_introduced_by_patch: > > > + self.failure_status_id = first_failure_status_id > > > + return self.report_failure(first_results_archive, new_failures_introduced_by_patch, first_script_error) > > > > Is is necessary to compute all of this when > > self._build_and_test_without_patch() returns true? > It's not, but special casing it means more code duplication for microseconds > of computing time gain. It's not worth it. > > > > On another note, this code looks similar to code we have below when > > self._results_failed_different_tests() returns false. I wish would could > > unify these code paths/share more code so as to avoid code duplication. > You're right. I think we can. I'll do that. > > > > > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:247 > > > + > > > > If we get here, it means that that the tests in > > tests_that_consistently_failed are also failing on trunk (i.e. the tree is > > red). So, we know the tests in tests_that_consistently_failed are not flaky. > > However, we don't know if the tests in tests_that_consistently_failed with > > the patch applied are failing for the same reason as when these tests failed > > without the patch. If the failure reason (diff) is the same for these tests > > with and without the patch then the patch didn't cause these failures; => do > > not report such test failures for the patch. Otherwise, the patch has caused > > the tests to fail differently than with a clean build (maybe the patch > > actually fixes the tests, but due to human error the expected results are > > still not quite right); => we need to report these tests as failing. > The failure type is already taken into account by the TestResult.__eq__ > method. It would be a mistake to try to compare the failure messages though. > Since there are stack traces in the messages, if line numbers changed and we > were just doing a comparison of those strings, we would peg the failures as > different, even if they weren't, and we would be unable to land a good patch. I lied, they don't contain stack traces. Brain fart. > Even beyond this though, the behavior at this point in the code is exactly > the same as without this patch, so any further issues at this point should > be new bugs and aren't really to do with the problem at hand. > > > > > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:249 > > > + # At this point we know that there is flakyness with the patch applied, but no consistent failures > > > + # were introduced. This is a bit of a grey-zone. > > > > This comment is disingenuous. We may have reached this point in the > > execution flow because all these tests that failed in both runs with the > > patch also fail without the patch (i.e. tests_that_consistently_failed != [] > > and new_failures_introduced_by_patch := []) > The comment says that no consistent failures were introduced. In this case, > no consistent failures were introduced, they were there to begin with.

Jake Nielsen

Comment 16 2014-10-30 11:33:16 PDT

(In reply to comment #12) > Comment on attachment 240637 [details] > Changes sets to frozen sets, and computes > frozenset(first_results.failing_test_results()) only once. > > View in context: > https://bugs.webkit.org/attachment.cgi?id=240637&action=review > > > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:248 > > + # At this point we know that there is flakyness with the patch applied, but no consistent failures > > As remarked in comment #3, did you intend to write flakiness as "flakyness"? Sorry I missed that. I'll make a note and change it in the next bug fix in this code.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Unspecified

OS Unspecified

Product WebKit

Component Tools / Tests

Assignee

Jake Nielsen

Reported

2014-10-29 14:19 PDT

Modified

2014-10-30 11:33 PDT History

CC List

6 users Show

URL

Keywords

Depends on

Blocks