Allow a test to specify a custom pixel tolerance in a ref test
Created attachment 262492 [details] Patch
Comment on attachment 262492 [details] Patch Attachment 262492 [details] did not pass mac-ews (mac): Output: http://webkit-queues.webkit.org/results/249175 New failing tests: compositing/overlap-blending/children-opacity-no-overlap.html compositing/overlap-blending/children-opacity-huge.html compositing/overlap-blending/nested-non-overlap-clipping.html
Created attachment 262496 [details] Archive of layout-test-results from ews103 for mac-mavericks The attached test failures were seen while running run-webkit-tests on the mac-ews. Bot: ews103 Port: mac-mavericks Platform: Mac OS X 10.9.5
Comment on attachment 262492 [details] Patch Attachment 262492 [details] did not pass mac-wk2-ews (mac-wk2): Output: http://webkit-queues.webkit.org/results/249179 New failing tests: fast/images/gif-loop-count.html compositing/overlap-blending/children-opacity-no-overlap.html compositing/overlap-blending/nested-non-overlap-clipping.html compositing/overlap-blending/children-opacity-huge.html fast/images/animated-gif-no-layout.html
Created attachment 262497 [details] Archive of layout-test-results from ews106 for mac-mavericks-wk2 The attached test failures were seen while running run-webkit-tests on the mac-wk2-ews. Bot: ews106 Port: mac-mavericks-wk2 Platform: Mac OS X 10.9.5
I like this a lot!
Comment on attachment 262492 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=262492&action=review > Tools/ChangeLog:4 > + Allow a test to specify a custom pixel tolerance in a ref test > + https://bugs.webkit.org/show_bug.cgi?id=149828 I still do not think that that this is desirable. It is very difficult or impossible to reason about what tolerance is acceptable for a test, or what it even means to have a particular tolerance. I have no idea what 0.004 means, and I expect that this default makes some existing tests ineffective.
(In reply to comment #7) > It is very difficult or impossible to reason about what tolerance is > acceptable for a test, or what it even means to have a particular tolerance. > I have no idea what 0.004 means, and I expect that this default makes some > existing tests ineffective. So you're OK with the arbitrary small tolerance that ImageDiff currently allows? The tolerance value here is the same value that ImageDiff returns, which we've been reporting in test results for ages. ImageDiff explains it as "The difference as a percentage combining both the number of different pixels and their difference amount i.e. the average distance over the entire image". Without this, it's impossible to do some kinds of testing.
> So you're OK with the arbitrary small tolerance that ImageDiff currently allows? Not really. It's also confusing, and the formula used for calculating the difference doesn't seem to make sense.
Adding a way to manually control the tolerance is a step back though, as it adds a new dimension to possible mistakes. Let's discuss what kinds of testing are impossible, maybe there is another way to implement those.
(In reply to comment #10) > Adding a way to manually control the tolerance is a step back though, as it > adds a new dimension to possible mistakes. > > Let's discuss what kinds of testing are impossible, maybe there is another > way to implement those. An idea that comes to mind is that we can add support to specify on the TestExpectations files a tolerance for a given test. That way different ports can specify different tolerances or not tolerance at all for a test.
(In reply to comment #11) > (In reply to comment #10) > > Adding a way to manually control the tolerance is a step back though, as it > > adds a new dimension to possible mistakes. > > > > Let's discuss what kinds of testing are impossible, maybe there is another > > way to implement those. > > An idea that comes to mind is that we can add support to specify on the > TestExpectations files a tolerance for a given test. That way different > ports can specify different tolerances or not tolerance at all for a test. That seems even more error-prone that the test author changing the tolerance, since a tree maintainer may have no idea why tolerance is relevant in the specific scenario. And, increasing tolerance enough will make the test pass regardless of whether there's a real bug, which is bad. Also, TestExpectations suck as a way of annotating tests with runtime parameters. It's bad enough trying to differentiate image vs test and other outcomes.
We need something like this for testing gradients. Floating point rounding gives us tiny differences in ref tests that are perfectly acceptable. Our alternative is to effectively implement the tolerance in the test, the way we now do for gradients in the canvas/philip suite. This is much more error prone, and requires duplication.
(In reply to Alexey Proskuryakov from comment #7) > Comment on attachment 262492 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=262492&action=review > > > Tools/ChangeLog:4 > > + Allow a test to specify a custom pixel tolerance in a ref test > > + https://bugs.webkit.org/show_bug.cgi?id=149828 > > I still do not think that that this is desirable. > > It is very difficult or impossible to reason about what tolerance is > acceptable for a test, or what it even means to have a particular tolerance. > I have no idea what 0.004 means, and I expect that this default makes some > existing tests ineffective. Would you be ok with a solution that allows to specify the value for an allowed pixel tolerance in the TestExpectation file of each port and for each test? Something like that: imported/blink/fast/canvas/canvas-clip-stack-persistence-diffs.html [ PixelTolerance=0.05 ] fast/gradients [ PixelTolerance=0.001 ] That would mean that imported/blink/fast/canvas/canvas-clip-stack-persistence-diffs.html will pass if the diff on the image is <= 0.05 and that all test on fast/gradients will pass if the diff is <= 0.001 (supposing that there isn't any line below overriding that default on fast/gradients for any specific test under that set) That way we can set the value specifically for each test and port. So the one setting the value can reason about if it makes sense for that very specific test or not.
We would need to clearly define what a tolerance value means. There are many bugs where the problem manifests as a single line or even a single pixel of entirely wrong color. Then there are many bugs where the problem is a slight color variance. The current algorithm for computing a distance doesn't account for that.
(In reply to Alexey Proskuryakov from comment #15) > We would need to clearly define what a tolerance value means. > > There are many bugs where the problem manifests as a single line or even a > single pixel of entirely wrong color. Then there are many bugs where the > problem is a slight color variance. > > The current algorithm for computing a distance doesn't account for that. Another idea is that a reftest result can be overriding by a png file. That way the png result produced by the test-suite can be dropped into the LayouTests directory as needed: LayoutTests/imported/blink/fast/canvas/canvas-clip-stack-persistence.html LayoutTests/imported/blink/fast/canvas/canvas-clip-stack-persistence-expected.html LayoutTests/platform/gtk/imported/blink/fast/canvas/canvas-clip-stack-persistence-expected.png That would mean that on the GTK+ port the test imported/blink/fast/canvas/canvas-clip-stack-persistence.html would pass if it produces an image equal to the one generated by LayoutTests/imported/blink/fast/canvas/canvas-clip-stack-persistence-expected.html *or* if it produces an image equal to the one stored at LayoutTests/platform/gtk/imported/blink/fast/canvas/canvas-clip-stack-persistence-expected.png
(In reply to Alexey Proskuryakov from comment #15) > We would need to clearly define what a tolerance value means. > > There are many bugs where the problem manifests as a single line or even a > single pixel of entirely wrong color. Then there are many bugs where the > problem is a slight color variance. > > The current algorithm for computing a distance doesn't account for that. What I am asking for is NOT the ability to give an arbitrary test a tolerance. I'm talking about tests that by their nature are unable to be 100% replicated in a ref test e.g. due to floating point rounding. Gradients are a good example of this. I agree that it is not desirable if people fix flakey/bad tests with a tolerance value.
<rdar://problem/62682186>
We should implement the WPT approach: http://web-platform-tests.org/writing-tests/reftests.html#fuzzy-matching
Created attachment 442670 [details] Patch
Comment on attachment 442670 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=442670&action=review Does this cause any new test passes? It seems like it should at least in the WPT transforms directory. > Tools/Scripts/webkitpy/layout_tests/controllers/single_test_runner.py:387 > + Nit: I think this
Created attachment 443320 [details] Patch
Committed r287323 (245473@main): <https://commits.webkit.org/245473@main> All reviewed patches have been landed. Closing bug and clearing flags on attachment 443320 [details].