Bug 138184 - CommitQueue and EWS should reject any patches that result in consistent test failures that aren't present on the tree.
Summary: CommitQueue and EWS should reject any patches that result in consistent test ...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Jake Nielsen
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-10-29 14:19 PDT by Jake Nielsen
Modified: 2014-10-30 11:33 PDT (History)
6 users (show)

See Also:


Attachments
Alters PatchAnalysisTask's behavior, and changes the unit tests accordingly. (14.10 KB, patch)
2014-10-29 14:24 PDT, Jake Nielsen
ap: review+
dbates: commit-queue-
Details | Formatted Diff | Diff
Addresses comments (14.02 KB, patch)
2014-10-29 15:01 PDT, Jake Nielsen
no flags Details | Formatted Diff | Diff
Changes sets to frozen sets, and computes frozenset(first_results.failing_test_results()) only once. (14.16 KB, patch)
2014-10-29 16:09 PDT, Jake Nielsen
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jake Nielsen 2014-10-29 14:19:40 PDT
Example: https://bugs.webkit.org/show_bug.cgi?id=137132
Comment 1 Jake Nielsen 2014-10-29 14:24:52 PDT
Created attachment 240624 [details]
Alters PatchAnalysisTask's behavior, and changes the unit tests accordingly.
Comment 2 Alexey Proskuryakov 2014-10-29 14:49:12 PDT
Comment on attachment 240624 [details]
Alters PatchAnalysisTask's behavior, and changes the unit tests accordingly.

View in context: https://bugs.webkit.org/attachment.cgi?id=240624&action=review

> Tools/ChangeLog:3
> +        CommitQueue should reject any patches result in consistent test

Is it just the commit queue?

> Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:-231
> -            # we happen to hit the --exit-after-N-failures limit.

Do we have enough test coverage for when we hit the limit of failures?

> Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:228
> +            tests_that_only_failed_first = list(set(first_results.failing_test_results()) - set(second_results.failing_test_results()))

Why is it not always a set? It seems wasteful to convert every time.

> Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:243
> +                new_failures_introduced_by_patch = list(set(tests_that_consistently_failed) - set(tests_that_failed_on_tree))
> +                if new_failures_introduced_by_patch:
> +                    self.failure_status_id = first_failure_status_id
> +                    return self.report_failure(first_results_archive, new_failures_introduced_by_patch, first_script_error)

I don't think that this heuristic is quite right. If there is enough flakiness in the tree, we could reject a perfectly good patch. We should probably refuse to give a result when there is much flakiness.

However, I'm not sure what "enough" would be here.

> Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:247
> +            # were introduced. This is a bit of a grey-zone, and for now I will leave this code path as it was,
> +            # AKA, defer.

I'd finish this after "grey-zone", as the rest is not very helpful to the reader.
Comment 3 Daniel Bates 2014-10-29 14:55:45 PDT
Comment on attachment 240624 [details]
Alters PatchAnalysisTask's behavior, and changes the unit tests accordingly.

View in context: https://bugs.webkit.org/attachment.cgi?id=240624&action=review

I didn't check this patch for correctness. I noticed some minor nits and had some questions.

> Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:234
> +            tests_that_only_failed_first = list(set(first_results.failing_test_results()) - set(second_results.failing_test_results()))
> +            self._report_flaky_tests(tests_that_only_failed_first, first_results_archive)
> +
> +            tests_that_only_failed_second = list(set(second_results.failing_test_results()) - set(first_results.failing_test_results()))
> +            self._report_flaky_tests(tests_that_only_failed_second, second_results_archive)
> +
> +            tests_that_consistently_failed = list(set(second_results.failing_test_results()) & set(first_results.failing_test_results()))

This is inefficient as we are computing set(first_results.failing_test_results()) and set(second_results.failing_test_results()) multiple times. We should compute these values once. Is it necessary to create mutable sets? Or can we use frozenset()?

As far as I can tell from briefly searching through the Python code we prefer using set.intersection() over the overloaded binary-and operator (&) when computing the intersection of two sets/frozensets. I suggest we use set.intersection() for consistency. 

See <https://docs.python.org/2/library/stdtypes.html#frozenset> for more details on frozenset().

> Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:240
> +                new_failures_introduced_by_patch = list(set(tests_that_consistently_failed) - set(tests_that_failed_on_tree))

Notice that we made tests_that_consistently_failed a list data type above (line 234) by explicitly passing the right-hand side set computation to the list constructor. Instead of converting tests_that_consistently_failed from a set to a list to a set, I suggest we let tests_that_consistently_failed be a set.

> Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:245
> +            # At this point we know that there is flakyness with the patch applied, but no consistent failures

Nit: flakyness => flakiness
Comment 4 Jake Nielsen 2014-10-29 15:01:53 PDT
Created attachment 240630 [details]
Addresses comments
Comment 5 Daniel Bates 2014-10-29 15:11:20 PDT
Comment on attachment 240630 [details]
Addresses comments

View in context: https://bugs.webkit.org/attachment.cgi?id=240630&action=review

> Tools/Scripts/webkitpy/tool/bot/commitqueuetask_unittest.py:178
> +        self.assertEqual(set(commit_queue.get_reported_flaky_tests()), set(expected_reported_flaky_tests))

Can you elaborate on this change?
Comment 6 Jake Nielsen 2014-10-29 15:19:46 PDT
I just stepped out for some food, but I'll address these comments when I get back.
Comment 7 Jake Nielsen 2014-10-29 16:05:49 PDT
(In reply to comment #5)
> Comment on attachment 240630 [details]
> Addresses comments
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=240630&action=review
> 
> > Tools/Scripts/webkitpy/tool/bot/commitqueuetask_unittest.py:178
> > +        self.assertEqual(set(commit_queue.get_reported_flaky_tests()), set(expected_reported_flaky_tests))
> 
> Can you elaborate on this change?

This just makes it less annoying to define which tests you expect to get reported. The order in which they appear doesn't matter, so checking set equality makes more sense.
Comment 8 Jake Nielsen 2014-10-29 16:09:12 PDT
Created attachment 240637 [details]
Changes sets to frozen sets, and computes frozenset(first_results.failing_test_results()) only once.
Comment 9 WebKit Commit Bot 2014-10-29 22:02:30 PDT
Comment on attachment 240637 [details]
Changes sets to frozen sets, and computes frozenset(first_results.failing_test_results()) only once.

Clearing flags on attachment: 240637

Committed r175367: <http://trac.webkit.org/changeset/175367>
Comment 10 WebKit Commit Bot 2014-10-29 22:02:35 PDT
All reviewed patches have been landed.  Closing bug.
Comment 11 Daniel Bates 2014-10-30 09:52:04 PDT
Comment on attachment 240637 [details]
Changes sets to frozen sets, and computes frozenset(first_results.failing_test_results()) only once.

View in context: https://bugs.webkit.org/attachment.cgi?id=240637&action=review

I know that this patch already landed. I noticed some minor issues and had some questions about the correctness of this patch.

> Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:-232
> -            # See https://bugs.webkit.org/show_bug.cgi?id=51272

Out of curiosity, did you have a chance to look at the posted patches and discussion in bug #51272? If so, are the concerns raised about --exit-after-N-failures and test ordering non-issues? In particular, is <https://bugs.webkit.org/show_bug.cgi?id=51272#c3> a non-issue?

> Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:235
> +            tests_that_only_failed_second = second_failing_results_set.difference(first_failing_results_set)
> +            self._report_flaky_tests(tests_that_only_failed_second, second_results_archive)

The following assumes <https://bugs.webkit.org/show_bug.cgi?id=51272#c3> is still applicable (see my question above).

Take --exit-after-N-failures to be 2 and assume there are a total of 4 tests {A, B, C, D} and test D is a test that constantly fails (and we haven't added it to the skip list or rolled out the patch that caused the failure yet). Suppose the first and second test runs have the following results {(A, Pass), (B, Fail), (C, Fail)} and {(A, Fail), (B, Pass), (C, Pass), (D, Fail)}. Then we will report test D as a flaky test (since tests_that_only_failed_second = {A, D}). Is this acceptable?

> Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:246
> +                self._build_and_test_without_patch()
> +                self._clean_tree_results = self._delegate.test_results()
> +                tests_that_failed_on_tree = self._clean_tree_results.failing_test_results()
> +
> +                new_failures_introduced_by_patch = tests_that_consistently_failed.difference(set(tests_that_failed_on_tree))
> +                if new_failures_introduced_by_patch:
> +                    self.failure_status_id = first_failure_status_id
> +                    return self.report_failure(first_results_archive, new_failures_introduced_by_patch, first_script_error)

Is is necessary to compute all of this when self._build_and_test_without_patch() returns true?

On another note, this code looks similar to code we have below when self._results_failed_different_tests() returns false. I wish would could unify these code paths/share more code so as to avoid code duplication.

> Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:247
> +

If we get here, it means that that the tests in tests_that_consistently_failed are also failing on trunk (i.e. the tree is red). So, we know the tests in tests_that_consistently_failed are not flaky. However, we don't know if the tests in tests_that_consistently_failed with the patch applied are failing for the same reason as when these tests failed without the patch. If the failure reason (diff) is the same for these tests with and without the patch then the patch didn't cause these failures; => do not report such test failures for the patch. Otherwise, the patch has caused the tests to fail differently than with a clean build (maybe the patch actually fixes the tests, but due to human error the expected results are still not quite right); => we need to report these tests as failing.

> Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:249
> +            # At this point we know that there is flakyness with the patch applied, but no consistent failures
> +            # were introduced. This is a bit of a grey-zone.

This comment is disingenuous. We may have reached this point in the execution flow because all these tests that failed in both runs with the patch also fail without the patch (i.e. tests_that_consistently_failed != [] and new_failures_introduced_by_patch := [])
Comment 12 Daniel Bates 2014-10-30 09:53:57 PDT
Comment on attachment 240637 [details]
Changes sets to frozen sets, and computes frozenset(first_results.failing_test_results()) only once.

View in context: https://bugs.webkit.org/attachment.cgi?id=240637&action=review

> Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:248
> +            # At this point we know that there is flakyness with the patch applied, but no consistent failures

As remarked in comment #3, did you intend to write flakiness as "flakyness"?
Comment 13 Alexey Proskuryakov 2014-10-30 10:04:50 PDT
In comment 2, I asked Jake to check into test coverage for when we hit the limit, as we need regression tests to fix those cases.

Thinking about it now, I have a different concern. We report flaky tests whenever first and second runs are different. But how does that make sense, what if flakiness was introduced by the patch itself? Also, I don't see queues report flaky tests recently, so I suspect that the functionality is broken anyway.

> However, we don't know if the tests in tests_that_consistently_failed with the patch applied are failing for the same reason as when these tests failed without the patch.

This is an interesting improvement idea. I don't think that EWS ever compared actual results, or did it?
Comment 14 Jake Nielsen 2014-10-30 11:23:55 PDT
(In reply to comment #11)
> Comment on attachment 240637 [details]
> Changes sets to frozen sets, and computes
> frozenset(first_results.failing_test_results()) only once.
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=240637&action=review
> 
> I know that this patch already landed. I noticed some minor issues and had
> some questions about the correctness of this patch.
> 
> > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:-232
> > -            # See https://bugs.webkit.org/show_bug.cgi?id=51272
> 
> Out of curiosity, did you have a chance to look at the posted patches and
> discussion in bug #51272? If so, are the concerns raised about
> --exit-after-N-failures and test ordering non-issues? In particular, is
> <https://bugs.webkit.org/show_bug.cgi?id=51272#c3> a non-issue?
I don't see which comment you mean regarding --exit-after-N-failures, but regarding test ordering, order shouldn't matter since we're taking a set difference.
> 
> > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:235
> > +            tests_that_only_failed_second = second_failing_results_set.difference(first_failing_results_set)
> > +            self._report_flaky_tests(tests_that_only_failed_second, second_results_archive)
> 
> The following assumes <https://bugs.webkit.org/show_bug.cgi?id=51272#c3> is
> still applicable (see my question above).
Maybe I'm not understanding what you mean, I don't see what this has to do with test ordering?
> 
> Take --exit-after-N-failures to be 2 and assume there are a total of 4 tests
> {A, B, C, D} and test D is a test that constantly fails (and we haven't
> added it to the skip list or rolled out the patch that caused the failure
> yet). Suppose the first and second test runs have the following results {(A,
> Pass), (B, Fail), (C, Fail)} and {(A, Fail), (B, Pass), (C, Pass), (D,
> Fail)}. Then we will report test D as a flaky test (since
> tests_that_only_failed_second = {A, D}). Is this acceptable?
It won't reach this code path. It will enter the 
if first_results.did_exceed_test_failure_limit():
    return self._continue_testing_patch_that_exceeded_failure_limit_on_first_or_second_try(first_results, first_results_archive, first_script_error)
codepath.
> 
> > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:246
> > +                self._build_and_test_without_patch()
> > +                self._clean_tree_results = self._delegate.test_results()
> > +                tests_that_failed_on_tree = self._clean_tree_results.failing_test_results()
> > +
> > +                new_failures_introduced_by_patch = tests_that_consistently_failed.difference(set(tests_that_failed_on_tree))
> > +                if new_failures_introduced_by_patch:
> > +                    self.failure_status_id = first_failure_status_id
> > +                    return self.report_failure(first_results_archive, new_failures_introduced_by_patch, first_script_error)
> 
> Is is necessary to compute all of this when
> self._build_and_test_without_patch() returns true?
It's not, but special casing it means more code duplication for microseconds of computing time gain. It's not worth it.
> 
> On another note, this code looks similar to code we have below when
> self._results_failed_different_tests() returns false. I wish would could
> unify these code paths/share more code so as to avoid code duplication.
You're right. I think we can. I'll do that.
> 
> > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:247
> > +
> 
> If we get here, it means that that the tests in
> tests_that_consistently_failed are also failing on trunk (i.e. the tree is
> red). So, we know the tests in tests_that_consistently_failed are not flaky.
> However, we don't know if the tests in tests_that_consistently_failed with
> the patch applied are failing for the same reason as when these tests failed
> without the patch. If the failure reason (diff) is the same for these tests
> with and without the patch then the patch didn't cause these failures; => do
> not report such test failures for the patch. Otherwise, the patch has caused
> the tests to fail differently than with a clean build (maybe the patch
> actually fixes the tests, but due to human error the expected results are
> still not quite right); => we need to report these tests as failing.
The failure type is already taken into account by the TestResult.__eq__ method. It would be a mistake to try to compare the failure messages though. Since there are stack traces in the messages, if line numbers changed and we were just doing a comparison of those strings, we would peg the failures as different, even if they weren't, and we would be unable to land a good patch.
Even beyond this though, the behavior at this point in the code is exactly the same as without this patch, so any further issues at this point should be new bugs and aren't really to do with the problem at hand.
> 
> > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:249
> > +            # At this point we know that there is flakyness with the patch applied, but no consistent failures
> > +            # were introduced. This is a bit of a grey-zone.
> 
> This comment is disingenuous. We may have reached this point in the
> execution flow because all these tests that failed in both runs with the
> patch also fail without the patch (i.e. tests_that_consistently_failed != []
> and new_failures_introduced_by_patch := [])
The comment says that no consistent failures were introduced. In this case, no consistent failures were introduced, they were there to begin with.
Comment 15 Jake Nielsen 2014-10-30 11:28:52 PDT
(In reply to comment #14)
> (In reply to comment #11)
> > Comment on attachment 240637 [details]
> > Changes sets to frozen sets, and computes
> > frozenset(first_results.failing_test_results()) only once.
> > 
> > View in context:
> > https://bugs.webkit.org/attachment.cgi?id=240637&action=review
> > 
> > I know that this patch already landed. I noticed some minor issues and had
> > some questions about the correctness of this patch.
> > 
> > > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:-232
> > > -            # See https://bugs.webkit.org/show_bug.cgi?id=51272
> > 
> > Out of curiosity, did you have a chance to look at the posted patches and
> > discussion in bug #51272? If so, are the concerns raised about
> > --exit-after-N-failures and test ordering non-issues? In particular, is
> > <https://bugs.webkit.org/show_bug.cgi?id=51272#c3> a non-issue?
> I don't see which comment you mean regarding --exit-after-N-failures, but
> regarding test ordering, order shouldn't matter since we're taking a set
> difference.
> > 
> > > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:235
> > > +            tests_that_only_failed_second = second_failing_results_set.difference(first_failing_results_set)
> > > +            self._report_flaky_tests(tests_that_only_failed_second, second_results_archive)
> > 
> > The following assumes <https://bugs.webkit.org/show_bug.cgi?id=51272#c3> is
> > still applicable (see my question above).
> Maybe I'm not understanding what you mean, I don't see what this has to do
> with test ordering?
> > 
> > Take --exit-after-N-failures to be 2 and assume there are a total of 4 tests
> > {A, B, C, D} and test D is a test that constantly fails (and we haven't
> > added it to the skip list or rolled out the patch that caused the failure
> > yet). Suppose the first and second test runs have the following results {(A,
> > Pass), (B, Fail), (C, Fail)} and {(A, Fail), (B, Pass), (C, Pass), (D,
> > Fail)}. Then we will report test D as a flaky test (since
> > tests_that_only_failed_second = {A, D}). Is this acceptable?
> It won't reach this code path. It will enter the 
> if first_results.did_exceed_test_failure_limit():
>     return
> self.
> _continue_testing_patch_that_exceeded_failure_limit_on_first_or_second_try(fi
> rst_results, first_results_archive, first_script_error)
> codepath.
> > 
> > > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:246
> > > +                self._build_and_test_without_patch()
> > > +                self._clean_tree_results = self._delegate.test_results()
> > > +                tests_that_failed_on_tree = self._clean_tree_results.failing_test_results()
> > > +
> > > +                new_failures_introduced_by_patch = tests_that_consistently_failed.difference(set(tests_that_failed_on_tree))
> > > +                if new_failures_introduced_by_patch:
> > > +                    self.failure_status_id = first_failure_status_id
> > > +                    return self.report_failure(first_results_archive, new_failures_introduced_by_patch, first_script_error)
> > 
> > Is is necessary to compute all of this when
> > self._build_and_test_without_patch() returns true?
> It's not, but special casing it means more code duplication for microseconds
> of computing time gain. It's not worth it.
> > 
> > On another note, this code looks similar to code we have below when
> > self._results_failed_different_tests() returns false. I wish would could
> > unify these code paths/share more code so as to avoid code duplication.
> You're right. I think we can. I'll do that.
> > 
> > > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:247
> > > +
> > 
> > If we get here, it means that that the tests in
> > tests_that_consistently_failed are also failing on trunk (i.e. the tree is
> > red). So, we know the tests in tests_that_consistently_failed are not flaky.
> > However, we don't know if the tests in tests_that_consistently_failed with
> > the patch applied are failing for the same reason as when these tests failed
> > without the patch. If the failure reason (diff) is the same for these tests
> > with and without the patch then the patch didn't cause these failures; => do
> > not report such test failures for the patch. Otherwise, the patch has caused
> > the tests to fail differently than with a clean build (maybe the patch
> > actually fixes the tests, but due to human error the expected results are
> > still not quite right); => we need to report these tests as failing.
> The failure type is already taken into account by the TestResult.__eq__
> method. It would be a mistake to try to compare the failure messages though.
> Since there are stack traces in the messages, if line numbers changed and we
> were just doing a comparison of those strings, we would peg the failures as
> different, even if they weren't, and we would be unable to land a good patch.
I lied, they don't contain stack traces. Brain fart.
> Even beyond this though, the behavior at this point in the code is exactly
> the same as without this patch, so any further issues at this point should
> be new bugs and aren't really to do with the problem at hand.
> > 
> > > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:249
> > > +            # At this point we know that there is flakyness with the patch applied, but no consistent failures
> > > +            # were introduced. This is a bit of a grey-zone.
> > 
> > This comment is disingenuous. We may have reached this point in the
> > execution flow because all these tests that failed in both runs with the
> > patch also fail without the patch (i.e. tests_that_consistently_failed != []
> > and new_failures_introduced_by_patch := [])
> The comment says that no consistent failures were introduced. In this case,
> no consistent failures were introduced, they were there to begin with.
Comment 16 Jake Nielsen 2014-10-30 11:33:16 PDT
(In reply to comment #12)
> Comment on attachment 240637 [details]
> Changes sets to frozen sets, and computes
> frozenset(first_results.failing_test_results()) only once.
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=240637&action=review
> 
> > Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py:248
> > +            # At this point we know that there is flakyness with the patch applied, but no consistent failures
> 
> As remarked in comment #3, did you intend to write flakiness as "flakyness"?
Sorry I missed that. I'll make a note and change it in the next bug fix in this code.