72860 – proposal: separate baseline optimization from new baseline generation

RESOLVED DUPLICATE of bug 69590 72860

proposal: separate baseline optimization from new baseline generation

https://bugs.webkit.org/show_bug.cgi?id=72860

Summary proposal: separate baseline optimization from new baseline generation

epoger

Reported 2011-11-20 23:24:59 PST

In conversation about https://bugs.webkit.org/show_bug.cgi?id=72746 ('new baselines for crbug 104128') and elsewhere, the following issue has come up: If you try to create new baseline images "too soon" after tests start failing (when some but not all of the bots have completed a cycle), the rebaselining tools may create counterproductive results. This is due to them optimizing across different platforms, when the different platforms' bots are always out of sync to some extent. I think our lives would be simpler if the rebaselining tools just generated "dumb" new baselines--specific to each platform-- and then the optimization step was handled asynchronously, once the bots have calmed down. This would allow us to greenify bots more quickly, thus improving everyone's productivity. The optimization step could even be done as a batch process during a typically quiet time on the tree (overnight each night, or perhaps each weekend). I suggest that we change the rebaselining tools such that, if run as they have been to date, they just generate the "dumb" (platform-specific) baselines; if run with an "optimize" flag or command, they will remove redundant images but not modify the actual expected results for any platform. I volunteer to work on this, but I want to make sure that people think it's a good idea first...

Attachments
Add attachment proposed patch, testcase, etc.

Ryosuke Niwa

Comment 1 2011-11-21 00:35:13 PST

Also see https://bugs.webkit.org/show_bug.cgi?id=69590.

Adam Barth

Comment 2 2011-11-21 00:53:19 PST

Ok. I wonder if we should have two levels of optimization: 1) One that only affects the baselines you're applying right now (e.g., so you don't add purely redundant Windows/Linux baselines. 2) Another that optimizes all the baselines for a given test.

Adam Barth

Comment 3 2011-11-21 00:53:56 PST

Another possibility on the "light" optimization path is to just remove redundant baselines (and not do any moving around).

Peter Kasting

Comment 4 2011-11-21 09:48:40 PST

Frankly it would be fine with me if the tool optimized by default, had a switch to reduce the optimization to "only avoid committing redundant baselines", and most importantly, printed out which bots would be affected and whether they're currently passing or failing.

Dirk Pranke

Comment 5 2011-11-21 12:31:00 PST

Big fan of this idea, in particular the approach in comment #3 to just remove redundant baselines but otherwise not move things around; this is the approach the old rebaseline-c-w-t took. Note that I do quite like optimize-baselines, just to be clear, but I'm increasingly thinking that it should rarely be run at the same time as when you are rebaselining. It's just too confusing.

Ojan Vafai

Comment 6 2011-11-21 14:48:52 PST

While only removing redundant baselines makes the rebaseline itself easier to make sense of, it makes generally managing layout test results more complicated. The more results there are for a given test, the harder it is to make sense of whether a new result is correct or not. Also, moving tests around means we end up with considerably fewer tests in the tree, which improves checkout times, time to run the tests, etc. I think all of these problems would be solved if pretty-diff had some special code to show you the effects of a rebaseline on the individual platforms instead of or in addition to the way the files are changing under the covers. It could have a UI kind of like the one in garden-o-matic, except instead of a tab per bot, it would have a tab per affected platform. You shouldn't need to understand run-webkit-tests's fallback order in order to evaluate whether a rebaseline is correct.

Dirk Pranke

Comment 7 2011-11-21 15:04:52 PST

(In reply to comment #6) > While only removing redundant baselines makes the rebaseline itself easier to make sense of, it makes generally managing layout test results more complicated. The more results there are for a given test, the harder it is to make sense of whether a new result is correct or not. I'm not sure that this is true; saying "every platform produces a different result" is an easier thing to get your head wrapped around than "most platforms produce different results, but a couple are the same". > > Also, moving tests around means we end up with considerably fewer tests in the tree, which improves checkout times, time to run the tests, etc. > As I tried to say above, I do believe that we should be trying to optimize the baselines to reduce the impact on the tree, all other things being equal. I'm just saying that trying to judge whether or not your rebaselined baselines are correct can be harder when you have to figure out why the baselines have been moved into different directories, especially, when optimizing can end up moving files you didn't even rebaseline. > I think all of these problems would be solved if pretty-diff had some special code to show you the effects of a rebaseline on the individual platforms instead of or in addition to the way the files are changing under the covers. It could have a UI kind of like the one in garden-o-matic, except instead of a tab per bot, it would have a tab per affected platform. I think it would be good to have a tool like this, but I'm not sure I'd want it to be pretty-diff. pretty-diff's functioning is very straightforward and I'd hate to have to complicate that up. > > You shouldn't need to understand run-webkit-tests's fallback order in order to evaluate whether a rebaseline is correct. I'm not sure that this is true, either. If you aren't comparing new baselines across ports, you may miss important things ... e.g., if a test that was previously producing the same result everywhere is now failing on Snow Leopard, that is worth distinguishing from a test that previously had a SL-specific result and the new result is different. How well you can figure this out without understanding the fallback order at least to some extent is not clear to me.

Ryosuke Niwa

Comment 8 2011-11-21 15:13:47 PST

(In reply to comment #7) > (In reply to comment #6) > > Also, moving tests around means we end up with considerably fewer tests in the tree, which improves checkout times, time to run the tests, etc. > > As I tried to say above, I do believe that we should be trying to optimize the baselines to reduce the impact on the tree, all other things being equal. > I'm just saying that trying to judge whether or not your rebaselined baselines are correct can be harder when you have to figure out why the baselines have been moved into different directories, especially, when optimizing can end up moving files you didn't even rebaseline. Agreed. > > I think all of these problems would be solved if pretty-diff had some special code to show you the effects of a rebaseline on the individual platforms instead of or in addition to the way the files are changing under the covers. It could have a UI kind of like the one in garden-o-matic, except instead of a tab per bot, it would have a tab per affected platform. > > I think it would be good to have a tool like this, but I'm not sure I'd want it to be pretty-diff. pretty-diff's functioning is very straightforward and I'd hate to have to complicate that up. Agreed. Furthermore, having to check the diff per platform will be quite tedious work even just for Chromium. If multiple platforms share the same result, it'll nice not having to verify it for each platform. I'll also add that, I like simple plain old HTML over fancy UIs with lots of scripts. > > You shouldn't need to understand run-webkit-tests's fallback order in order to evaluate whether a rebaseline is correct. > > I'm not sure that this is true, either. If you aren't comparing new baselines across ports, you may miss important things ... e.g., if a test that was previously producing the same result everywhere is now failing on Snow Leopard, that is worth distinguishing from a test that previously had a SL-specific result and the new result is different. How well you can figure this out without understanding the fallback order at least to some extent is not clear to me. Not sure. While I admit that knowing fallback paths is useful, I don't think we can expect gardeners or a random other contributor rebaselining a test to know fallback paths for every single platform.

Dirk Pranke

Comment 9 2011-11-21 15:19:29 PST

(In reply to comment #8) > > > You shouldn't need to understand run-webkit-tests's fallback order in order to evaluate whether a rebaseline is correct. > > > > I'm not sure that this is true, either. If you aren't comparing new baselines across ports, you may miss important things ... e.g., if a test that was previously producing the same result everywhere is now failing on Snow Leopard, that is worth distinguishing from a test that previously had a SL-specific result and the new result is different. How well you can figure this out without understanding the fallback order at least to some extent is not clear to me. > > Not sure. While I admit that knowing fallback paths is useful, I don't think we can expect gardeners or a random other contributor rebaselining a test to know fallback paths for every single platform. I don't think you have to have the fallback path memorized for every port. However, I think if a test is no longer producing a platform-specific result, or if it starts producing a platform-specific result (clearly the minority of new baselines), I think you should understand why that is before you rebaseline it. And I think understanding that probably requires some understanding of the fallback paths.

Ojan Vafai

Comment 10 2011-11-21 15:30:08 PST

> > > I think all of these problems would be solved if pretty-diff had some special code to show you the effects of a rebaseline on the individual platforms instead of or in addition to the way the files are changing under the covers. It could have a UI kind of like the one in garden-o-matic, except instead of a tab per bot, it would have a tab per affected platform. > > > > I think it would be good to have a tool like this, but I'm not sure I'd want it to be pretty-diff. pretty-diff's functioning is very straightforward and I'd hate to have to complicate that up. > > Agreed. Furthermore, having to check the diff per platform will be quite tedious work even just for Chromium. If multiple platforms share the same result, it'll nice not having to verify it for each platform. If it only shows you the diff for the platforms where the results are now actually different, then, if you are only rebaselining chromium ports, it would only list the chromium ports. Even though it may be moving around files for the other ports, you don't need to worry about it if that port ends up using an identical result.

Dirk Pranke

Comment 11 2012-03-05 17:54:10 PST

Closing this as a duplicate of bug 69590 ... conversation can continue there if necessary. The patch in that bug adds a --no-optimize step to 'webkit-patch rebaseline-expectations' but otherwise doesn't change the flow at all.

Dirk Pranke

Comment 12 2012-03-05 17:54:23 PST

*** This bug has been marked as a duplicate of bug 69590 ***

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution DUPLICATE

of bug 69590

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Unspecified

OS Unspecified

Product WebKit

Component Tools / Tests

Assignee

epoger

Reported

2011-11-20 23:24 PST

Modified

2012-03-05 17:54 PST History

CC List

5 users Show

URL

Keywords

Depends on

Blocks