Bug 72860 - proposal: separate baseline optimization from new baseline generation
Summary: proposal: separate baseline optimization from new baseline generation
Status: RESOLVED DUPLICATE of bug 69590
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: epoger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-11-20 23:24 PST by epoger
Modified: 2012-03-05 17:54 PST (History)
5 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description epoger 2011-11-20 23:24:59 PST
In conversation about https://bugs.webkit.org/show_bug.cgi?id=72746 ('new baselines for crbug 104128') and elsewhere, the following issue has come up:

If you try to create new baseline images "too soon" after tests start failing (when some but not all of the bots have completed a cycle), the rebaselining tools may create counterproductive results.  This is due to them optimizing across different platforms, when the different platforms' bots are always out of sync to some extent.

I think our lives would be simpler if the rebaselining tools just generated "dumb" new baselines--specific to each platform-- and then the optimization step was handled asynchronously, once the bots have calmed down.  This would allow us to greenify bots more quickly, thus improving everyone's productivity.

The optimization step could even be done as a batch process during a typically quiet time on the tree (overnight each night, or perhaps each weekend).

I suggest that we change the rebaselining tools such that, if run as they have been to date, they just generate the "dumb" (platform-specific) baselines; if run with an "optimize" flag or command, they will remove redundant images but not modify the actual expected results for any platform.

I volunteer to work on this, but I want to make sure that people think it's a good idea first...
Comment 1 Ryosuke Niwa 2011-11-21 00:35:13 PST
Also see https://bugs.webkit.org/show_bug.cgi?id=69590.
Comment 2 Adam Barth 2011-11-21 00:53:19 PST
Ok.  I wonder if we should have two levels of optimization:

1) One that only affects the baselines you're applying right now (e.g., so you don't add purely redundant Windows/Linux baselines.

2) Another that optimizes all the baselines for a given test.
Comment 3 Adam Barth 2011-11-21 00:53:56 PST
Another possibility on the "light" optimization path is to just remove redundant baselines (and not do any moving around).
Comment 4 Peter Kasting 2011-11-21 09:48:40 PST
Frankly it would be fine with me if the tool optimized by default, had a switch to reduce the optimization to "only avoid committing redundant baselines", and most importantly, printed out which bots would be affected and whether they're currently passing or failing.
Comment 5 Dirk Pranke 2011-11-21 12:31:00 PST
Big fan of this idea, in particular the approach in comment #3 to just remove redundant baselines but otherwise not move things around; this is the approach the old rebaseline-c-w-t took.

Note that I do quite like optimize-baselines, just to be clear, but I'm increasingly thinking that it should rarely be run at the same time as when you are rebaselining. It's just too confusing.
Comment 6 Ojan Vafai 2011-11-21 14:48:52 PST
While only removing redundant baselines makes the rebaseline itself easier to make sense of, it makes generally managing layout test results more complicated. The more results there are for a given test, the harder it is to make sense of whether a new result is correct or not.

Also, moving tests around means we end up with considerably fewer tests in the tree, which improves checkout times, time to run the tests, etc.

I think all of these problems would be solved if pretty-diff had some special code to show you the effects of a rebaseline on the individual platforms instead of or in addition to the way the files are changing under the covers. It could have a UI kind of like the one in garden-o-matic, except instead of a tab per bot, it would have a tab per affected platform.

You shouldn't need to understand run-webkit-tests's fallback order in order to evaluate whether a rebaseline is correct.
Comment 7 Dirk Pranke 2011-11-21 15:04:52 PST
(In reply to comment #6)
> While only removing redundant baselines makes the rebaseline itself easier to make sense of, it makes generally managing layout test results more complicated. The more results there are for a given test, the harder it is to make sense of whether a new result is correct or not.

I'm not sure that this is true; saying "every platform produces a different result" is an easier thing to get your head wrapped around than "most platforms produce different results, but a couple are the same".

> 
> Also, moving tests around means we end up with considerably fewer tests in the tree, which improves checkout times, time to run the tests, etc.
>

As I tried to say above, I do believe that we should be trying to optimize the baselines to reduce the impact on the tree, all other things being equal.

I'm just saying that trying to judge whether or not your rebaselined baselines are correct can be harder when you have to figure out why the baselines have been moved into different directories, especially, when optimizing can end up moving files you didn't even rebaseline.
 
> I think all of these problems would be solved if pretty-diff had some special code to show you the effects of a rebaseline on the individual platforms instead of or in addition to the way the files are changing under the covers. It could have a UI kind of like the one in garden-o-matic, except instead of a tab per bot, it would have a tab per affected platform.

I think it would be good to have a tool like this, but I'm not sure I'd want it to be pretty-diff. pretty-diff's functioning is very straightforward and I'd hate to have to complicate that up.

> 
> You shouldn't need to understand run-webkit-tests's fallback order in order to evaluate whether a rebaseline is correct.

I'm not sure that this is true, either. If you aren't comparing new baselines across ports, you may miss important things ... e.g., if a test that was previously producing the same result everywhere is now failing on Snow Leopard, that is worth distinguishing from a test that previously had a SL-specific result and the new result is different. How well you can figure this out without understanding the fallback order at least to some extent is not clear to me.
Comment 8 Ryosuke Niwa 2011-11-21 15:13:47 PST
(In reply to comment #7)
> (In reply to comment #6)
> > Also, moving tests around means we end up with considerably fewer tests in the tree, which improves checkout times, time to run the tests, etc.
> 
> As I tried to say above, I do believe that we should be trying to optimize the baselines to reduce the impact on the tree, all other things being equal.

> I'm just saying that trying to judge whether or not your rebaselined baselines are correct can be harder when you have to figure out why the baselines have been moved into different directories, especially, when optimizing can end up moving files you didn't even rebaseline.

Agreed.

> > I think all of these problems would be solved if pretty-diff had some special code to show you the effects of a rebaseline on the individual platforms instead of or in addition to the way the files are changing under the covers. It could have a UI kind of like the one in garden-o-matic, except instead of a tab per bot, it would have a tab per affected platform.
> 
> I think it would be good to have a tool like this, but I'm not sure I'd want it to be pretty-diff. pretty-diff's functioning is very straightforward and I'd hate to have to complicate that up.

Agreed. Furthermore, having to check the diff per platform will be quite tedious work even just for Chromium. If multiple platforms share the same result, it'll nice not having to verify it for each platform.

I'll also add that, I like simple plain old HTML over fancy UIs with lots of scripts.

> > You shouldn't need to understand run-webkit-tests's fallback order in order to evaluate whether a rebaseline is correct.
> 
> I'm not sure that this is true, either. If you aren't comparing new baselines across ports, you may miss important things ... e.g., if a test that was previously producing the same result everywhere is now failing on Snow Leopard, that is worth distinguishing from a test that previously had a SL-specific result and the new result is different. How well you can figure this out without understanding the fallback order at least to some extent is not clear to me.

Not sure. While I admit that knowing fallback paths is useful, I don't think we can expect gardeners or a random other contributor rebaselining a test to know fallback paths for every single platform.
Comment 9 Dirk Pranke 2011-11-21 15:19:29 PST
(In reply to comment #8)
> > > You shouldn't need to understand run-webkit-tests's fallback order in order to evaluate whether a rebaseline is correct.
> > 
> > I'm not sure that this is true, either. If you aren't comparing new baselines across ports, you may miss important things ... e.g., if a test that was previously producing the same result everywhere is now failing on Snow Leopard, that is worth distinguishing from a test that previously had a SL-specific result and the new result is different. How well you can figure this out without understanding the fallback order at least to some extent is not clear to me.
> 
> Not sure. While I admit that knowing fallback paths is useful, I don't think we can expect gardeners or a random other contributor rebaselining a test to know fallback paths for every single platform.

I don't think you have to have the fallback path memorized for every port. However, I think if a test is no longer producing a platform-specific result, or if it starts producing a platform-specific result (clearly the minority of new baselines), I think you should understand why that is before you rebaseline it. And I think understanding that probably requires some understanding of the fallback paths.
Comment 10 Ojan Vafai 2011-11-21 15:30:08 PST
> > > I think all of these problems would be solved if pretty-diff had some special code to show you the effects of a rebaseline on the individual platforms instead of or in addition to the way the files are changing under the covers. It could have a UI kind of like the one in garden-o-matic, except instead of a tab per bot, it would have a tab per affected platform.
> > 
> > I think it would be good to have a tool like this, but I'm not sure I'd want it to be pretty-diff. pretty-diff's functioning is very straightforward and I'd hate to have to complicate that up.
> 
> Agreed. Furthermore, having to check the diff per platform will be quite tedious work even just for Chromium. If multiple platforms share the same result, it'll nice not having to verify it for each platform.

If it only shows you the diff for the platforms where the results are now actually different, then, if you are only rebaselining chromium ports, it would only list the chromium ports. Even though it may be moving around files for the other ports, you don't need to worry about it if that port ends up using an identical result.
Comment 11 Dirk Pranke 2012-03-05 17:54:10 PST
Closing this as a duplicate of bug 69590 ... conversation can continue there if necessary.

The patch in that bug adds a --no-optimize step to 'webkit-patch rebaseline-expectations' but otherwise doesn't change the flow at all.
Comment 12 Dirk Pranke 2012-03-05 17:54:23 PST

*** This bug has been marked as a duplicate of bug 69590 ***