Bug 123385 - New flakiness dashboard shouldn't treat tests with right expectations as failing
Summary: New flakiness dashboard shouldn't treat tests with right expectations as failing
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebKit Website (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Ryosuke Niwa
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-10-25 23:29 PDT by Ryosuke Niwa
Modified: 2013-10-27 19:56 PDT (History)
6 users (show)

See Also:


Attachments
Changes the behavior (2.24 KB, patch)
2013-10-25 23:33 PDT, Ryosuke Niwa
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Ryosuke Niwa 2013-10-25 23:29:20 PDT
Right now, if you select "failing" tests on the builder pane, the new flakiness dashboard lists all failing tests including ones that have the right test expectation.
It should instead only list tests that are failing and don't have the right expectation that are making bots red.
Comment 1 Ryosuke Niwa 2013-10-25 23:33:19 PDT
Created attachment 215240 [details]
Changes the behavior
Comment 2 Alexey Proskuryakov 2013-10-26 16:52:45 PDT
Comment on attachment 215240 [details]
Changes the behavior

I've never used this feature on the old dashboard, so it's not clear to me if either behavior is useful. What are the use cases? If this is a replacement for regular dashboard, then we should consider just removing the duplicate functionality.

r=me
Comment 3 Ryosuke Niwa 2013-10-26 17:21:01 PDT
(In reply to comment #2)
> (From update of attachment 215240 [details])
> I've never used this feature on the old dashboard, so it's not clear to me if either behavior is useful. What are the use cases? If this is a replacement for regular dashboard, then we should consider just removing the duplicate functionality.

This shows the list of failing tests on the bots.
Comment 4 WebKit Commit Bot 2013-10-26 17:45:38 PDT
Comment on attachment 215240 [details]
Changes the behavior

Clearing flags on attachment: 215240

Committed r158093: <http://trac.webkit.org/changeset/158093>
Comment 5 WebKit Commit Bot 2013-10-26 17:45:40 PDT
All reviewed patches have been landed.  Closing bug.
Comment 6 Alexey Proskuryakov 2013-10-27 10:06:03 PDT
> This shows the list of failing tests on the bots.

I don't think that this answers my question about use cases. Listing tests that are currently failing is not a job for the dashboard, which is for historic analysis of results.
Comment 7 Ryosuke Niwa 2013-10-27 11:04:55 PDT
(In reply to comment #6)
> > This shows the list of failing tests on the bots.
> 
> I don't think that this answers my question about use cases. Listing tests that are currently failing is not a job for the dashboard, which is for historic analysis of results.

If you're talking about http://build.webkit.org/dashboard/, I find it impossible to use because it doesn't have links to builder's page and it has -webkit-user-select: none along with dozens of other problems.
Comment 8 Alexey Proskuryakov 2013-10-27 11:41:48 PDT
Can you please file bugs for those? That is the tool intended to be used for looking at immediate state of the bots, and adding duplicate functionality to other tools is not the best path forward. We'll just end up with a set of tools that no one but their creators understand or use.

build.webkit.org/dashboard is also meant to be the primary entry point into the regression test bot system for most people, because checking historic flakiness is an activity that is secondary to checking immediate state. Buildbot waterfall and console certainly have their use, but mostly for people who administer the system, not for WebKit developers in my opinion.

There is a bunch of bugs and enhancement requests filed already, you can find these by searching for "build.webkit.org/dashboard" in Bugzilla titles. 

I encourage you to file bugs in terms of use cases that aren't addressed well (i.e. not simply "please remove user-select:none", but "I often need to do XXX when bot watching, and it's difficult to do now").
Comment 9 Ryosuke Niwa 2013-10-27 12:03:58 PDT
(In reply to comment #8)
> build.webkit.org/dashboard is also meant to be the primary entry point into the regression test bot system for most people, because checking historic flakiness is an activity that is secondary to checking immediate state. Buildbot waterfall and console certainly have their use, but mostly for people who administer the system, not for WebKit developers in my opinion.

I don't see a point in doing that given I'm satisfied with what build.webkit.org/waterfall and build.webkit.org/console provides.  Those two pages provides exactly the kind of information I need.
Comment 10 Alexey Proskuryakov 2013-10-27 19:19:43 PDT
> I'm satisfied with what build.webkit.org/waterfall and build.webkit.org/console provides

In this case, can we just get rid of the "failing" display in the new flakiness dashboard?
Comment 11 Ryosuke Niwa 2013-10-27 19:24:24 PDT
(In reply to comment #10)
> > I'm satisfied with what build.webkit.org/waterfall and build.webkit.org/console provides
> 
> In this case, can we just get rid of the "failing" display in the new flakiness dashboard?

Why?  The historical results of currently failing tests is exactly what bot watchers need to see to determine which patch caused the failure and whether tests have been flaky or not.
Comment 12 Ryosuke Niwa 2013-10-27 19:56:29 PDT
I think I'm disagreeing with the statement that "checking historic flakiness is an activity that is secondary to checking immediate state".

In my experience, viewing the historical results of a test has been essential in determining the culprit and the correct test expectation to add.

Knowing how many tests are failing on a builder doesn't get me anywhere as a bot watcher because my primary job as a bot watcher (contacting the patch author, etc…) cannot be carried out until the culprit is determined.

I don't know what revision number http://build.webkit.org/dashboard/ is showing but automatically determining the culprit has already been tried by TestFailures and garden-o-magic.  They have both miserably failed to carry out the promise.  The task of this sort is best done by humans.