Bug 97045 - webkit-patch rebaseline does the wrong thing
: webkit-patch rebaseline does the wrong thing
Status: NEW
Product: WebKit
Classification: Unclassified
Component: Tools / Tests
: 528+ (Nightly build)
: Unspecified Unspecified
: P2 Normal
Assigned To: Nobody
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-09-18 14:36 PDT by Ojan Vafai
Modified: 2012-09-19 14:14 PDT (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ojan Vafai 2012-09-18 14:36:54 PDT
Rebaselined windows only failures with http://trac.webkit.org/changeset/128912, but that caused these tests to start failing on Linux since Chromium Linux falls back to Chromium Windows and there wasn't an existing Chromium Linux specific result. Submitted http://trac.webkit.org/changeset/128931 to fix.

When we rebaseline a test, we need to rebaseline all the ports that implicitly are changing as well to make sure we don't cause new failures. In this case, we should notice that Chromium Linux is passing the tests and rebaseline both Chromium Linux and Chromium Win.
Comment 1 Ojan Vafai 2012-09-18 14:37:56 PDT
Really this is a bug with webkit-patch rebaseline*, not garden-o-matic.
Comment 2 Dirk Pranke 2012-09-18 14:42:13 PDT
hm. I thought we had code that checked that.
Comment 3 Ojan Vafai 2012-09-18 14:48:08 PDT
We have code that lets you hardcode "platform_move_to" in builders.py to do this. But it doesn't automatically figure out to do it from the hypergraph. I think we just need to change that logic to do it automatically based off the hypergraph data.
Comment 4 Dirk Pranke 2012-09-18 14:55:28 PDT
Turns out I'm thinking of the code in the baseline optimizer that checks to make sure optimizing doesn't change any results. You're right, we don't have any code that checks if other ports might need to be rebaselined if we change a baseline for one port.
Comment 5 Adam Barth 2012-09-19 11:20:27 PDT
I knew about this issue when I designed the algorithm, but it's not clear to me how to fix it.  The problem is that you often want to change the results for the other ports.

Consider the case where revision N changes the results of test X and previously all the Windows versions had the same results for test X.  For whatever reason, the WinXP bot hasn't processed revision N yet, so the bot still sees old result.  The right solution here is to guess that the result is going to stay the same across Windows versions and to overwrite the chromium-win results with the results from the Win7 bot.  If later we discover that WinXP has a different result, we can then recode that in the chromium-win-xp directory.

The algorithm is designed to be eventually correct.  If you keep rebaselining, you'll eventually get to the right state.  You just might not get there in one step.  t's impossible to always know what the correct final configuration is, so we have some basic heuristics in place.  It's very likely there are ways to improve the heuristics.
Comment 6 Ojan Vafai 2012-09-19 12:16:55 PDT
The issue with the current heuristics is that it breaks even if all the bots have run the tests. Maybe we should actually use different heuristics for the unexpected failures tab vs the expected failures tab. In the former, the common case is that not all the bots have run. In the latter it's the opposite. We could make rebaselines from the expected failures tab always do the 100% right thing and the unexpected failures tab do the right thing most of the time (i.e. what it does now).

WDYT?
Comment 7 Dirk Pranke 2012-09-19 12:29:48 PDT
I think there's also potentially the problem that we're not giving enough information / direction to the rebaselining, e.g., it can be unclear whether we mean "change the win result and push the old one to linux" or "update the win result; we want linux to get the updated result also". I think the tooling lets you be explicit about this (i.e., we have the right infrastructure) but we may not have the best interface to it and users may not realize the side effects of their actions.

re: "it's impossible to always know what the correct final configuration is" ... I don't quite understand this; assuming all the bots have produced results, and you know which bots are failing and which aren't, you can determine what the correct configuration is, right? So you're saying just one or both of those assumptions might not hold? or am I missing something?
Comment 8 Adam Barth 2012-09-19 13:11:08 PDT
> Maybe we should actually use different heuristics for the unexpected failures tab vs the expected failures tab.

That makes sense.  The time-skew issue is much less likely to occur for the unexpected failures tab.

There's still the issue of configurations that don't have bots, but IMHO we should just delete those configurations.  That's mostly what we've been doing (e.g., the google-chrome configuration is gone).  Do we have any left?

> re: "it's impossible to always know what the correct final configuration is" ... I don't quite understand this; assuming all the bots have produced results, and you know which bots are failing and which aren't, you can determine what the correct configuration is, right? So you're saying just one or both of those assumptions might not hold? or am I missing something?

You can never know for certain that all the bots have produced a consistent set of results because the bots aren't synchronized.  Ojan's point is that the time-skew issue is less likely to be a problem on the expected failures tab.  It's definitely a problem on the unexpected failures tab.
Comment 9 Dirk Pranke 2012-09-19 13:16:18 PDT
(In reply to comment #8)
> 
> You can never know for certain that all the bots have produced a consistent set of results because the bots aren't synchronized.  

Got it, thanks.
Comment 10 Ojan Vafai 2012-09-19 14:13:16 PDT
(In reply to comment #8)
> There's still the issue of configurations that don't have bots, but IMHO we should just delete those configurations.

I agree. Any configurations without bots cannot be expected to have their expected results kept up to date by other ports. This includes, for example, keeping pixel results up to date for ports that don't run pixel tests.
Comment 11 Ojan Vafai 2012-09-19 14:14:13 PDT
In either case, sounds like we have consensus on a path forward. Just need to find someone with time to make it happen. :) I'm gardening next week, so maybe I'll have some time to spare.