Bug 97045 - webkit-patch rebaseline does the wrong thing
: webkit-patch rebaseline does the wrong thing
Status: NEW
: WebKit
Tools / Tests
: 528+ (Nightly build)
: Unspecified Unspecified
: P2 Normal
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2012-09-18 14:36 PST by
Modified: 2012-09-19 14:14 PST (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2012-09-18 14:36:54 PST
Rebaselined windows only failures with http://trac.webkit.org/changeset/128912, but that caused these tests to start failing on Linux since Chromium Linux falls back to Chromium Windows and there wasn't an existing Chromium Linux specific result. Submitted http://trac.webkit.org/changeset/128931 to fix.

When we rebaseline a test, we need to rebaseline all the ports that implicitly are changing as well to make sure we don't cause new failures. In this case, we should notice that Chromium Linux is passing the tests and rebaseline both Chromium Linux and Chromium Win.
------- Comment #1 From 2012-09-18 14:37:56 PST -------
Really this is a bug with webkit-patch rebaseline*, not garden-o-matic.
------- Comment #2 From 2012-09-18 14:42:13 PST -------
hm. I thought we had code that checked that.
------- Comment #3 From 2012-09-18 14:48:08 PST -------
We have code that lets you hardcode "platform_move_to" in builders.py to do this. But it doesn't automatically figure out to do it from the hypergraph. I think we just need to change that logic to do it automatically based off the hypergraph data.
------- Comment #4 From 2012-09-18 14:55:28 PST -------
Turns out I'm thinking of the code in the baseline optimizer that checks to make sure optimizing doesn't change any results. You're right, we don't have any code that checks if other ports might need to be rebaselined if we change a baseline for one port.
------- Comment #5 From 2012-09-19 11:20:27 PST -------
I knew about this issue when I designed the algorithm, but it's not clear to me how to fix it.  The problem is that you often want to change the results for the other ports.

Consider the case where revision N changes the results of test X and previously all the Windows versions had the same results for test X.  For whatever reason, the WinXP bot hasn't processed revision N yet, so the bot still sees old result.  The right solution here is to guess that the result is going to stay the same across Windows versions and to overwrite the chromium-win results with the results from the Win7 bot.  If later we discover that WinXP has a different result, we can then recode that in the chromium-win-xp directory.

The algorithm is designed to be eventually correct.  If you keep rebaselining, you'll eventually get to the right state.  You just might not get there in one step.  t's impossible to always know what the correct final configuration is, so we have some basic heuristics in place.  It's very likely there are ways to improve the heuristics.
------- Comment #6 From 2012-09-19 12:16:55 PST -------
The issue with the current heuristics is that it breaks even if all the bots have run the tests. Maybe we should actually use different heuristics for the unexpected failures tab vs the expected failures tab. In the former, the common case is that not all the bots have run. In the latter it's the opposite. We could make rebaselines from the expected failures tab always do the 100% right thing and the unexpected failures tab do the right thing most of the time (i.e. what it does now).

WDYT?
------- Comment #7 From 2012-09-19 12:29:48 PST -------
I think there's also potentially the problem that we're not giving enough information / direction to the rebaselining, e.g., it can be unclear whether we mean "change the win result and push the old one to linux" or "update the win result; we want linux to get the updated result also". I think the tooling lets you be explicit about this (i.e., we have the right infrastructure) but we may not have the best interface to it and users may not realize the side effects of their actions.

re: "it's impossible to always know what the correct final configuration is" ... I don't quite understand this; assuming all the bots have produced results, and you know which bots are failing and which aren't, you can determine what the correct configuration is, right? So you're saying just one or both of those assumptions might not hold? or am I missing something?
------- Comment #8 From 2012-09-19 13:11:08 PST -------
> Maybe we should actually use different heuristics for the unexpected failures tab vs the expected failures tab.

That makes sense.  The time-skew issue is much less likely to occur for the unexpected failures tab.

There's still the issue of configurations that don't have bots, but IMHO we should just delete those configurations.  That's mostly what we've been doing (e.g., the google-chrome configuration is gone).  Do we have any left?

> re: "it's impossible to always know what the correct final configuration is" ... I don't quite understand this; assuming all the bots have produced results, and you know which bots are failing and which aren't, you can determine what the correct configuration is, right? So you're saying just one or both of those assumptions might not hold? or am I missing something?

You can never know for certain that all the bots have produced a consistent set of results because the bots aren't synchronized.  Ojan's point is that the time-skew issue is less likely to be a problem on the expected failures tab.  It's definitely a problem on the unexpected failures tab.
------- Comment #9 From 2012-09-19 13:16:18 PST -------
(In reply to comment #8)
> 
> You can never know for certain that all the bots have produced a consistent set of results because the bots aren't synchronized.  

Got it, thanks.
------- Comment #10 From 2012-09-19 14:13:16 PST -------
(In reply to comment #8)
> There's still the issue of configurations that don't have bots, but IMHO we should just delete those configurations.

I agree. Any configurations without bots cannot be expected to have their expected results kept up to date by other ports. This includes, for example, keeping pixel results up to date for ports that don't run pixel tests.
------- Comment #11 From 2012-09-19 14:14:13 PST -------
In either case, sounds like we have consensus on a path forward. Just need to find someone with time to make it happen. :) I'm gardening next week, so maybe I'll have some time to spare.