Bug 149828

Summary: Allow a way to specify a custom pixel tolerance in a ref test
Product: WebKit Reporter: Simon Fraser (smfr) <simon.fraser>
Component: New BugsAssignee: Simon Fraser (smfr) <simon.fraser>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, bburg, bjonesbe, buildbot, clopez, commit-queue, dbates, dino, ews-watchlist, fred.wang, glenn, gsnedders, jbedard, mrobinson, rniwa, ryanhaddad, simon.fraser, thorton, webkit-bug-importer, zalan
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugs.webkit.org/show_bug.cgi?id=234908
https://bugs.webkit.org/show_bug.cgi?id=235084
https://bugs.webkit.org/show_bug.cgi?id=236680
Bug Depends on: 232399, 232410, 232522, 232523, 232525    
Bug Blocks: 142258, 146523, 211261, 213511, 215324    
Attachments:
Description Flags
Patch
ap: review-, buildbot: commit-queue-
Archive of layout-test-results from ews103 for mac-mavericks
none
Archive of layout-test-results from ews106 for mac-mavericks-wk2
none
Patch
mrobinson: review+, ews-feeder: commit-queue-
Patch none

Description Simon Fraser (smfr) 2015-10-05 20:45:40 PDT
Allow a test to specify a custom pixel tolerance in a ref test
Comment 1 Simon Fraser (smfr) 2015-10-05 21:07:05 PDT
Created attachment 262492 [details]
Patch
Comment 2 Build Bot 2015-10-05 21:39:32 PDT
Comment on attachment 262492 [details]
Patch

Attachment 262492 [details] did not pass mac-ews (mac):
Output: http://webkit-queues.webkit.org/results/249175

New failing tests:
compositing/overlap-blending/children-opacity-no-overlap.html
compositing/overlap-blending/children-opacity-huge.html
compositing/overlap-blending/nested-non-overlap-clipping.html
Comment 3 Build Bot 2015-10-05 21:39:36 PDT
Created attachment 262496 [details]
Archive of layout-test-results from ews103 for mac-mavericks

The attached test failures were seen while running run-webkit-tests on the mac-ews.
Bot: ews103  Port: mac-mavericks  Platform: Mac OS X 10.9.5
Comment 4 Build Bot 2015-10-05 21:44:17 PDT
Comment on attachment 262492 [details]
Patch

Attachment 262492 [details] did not pass mac-wk2-ews (mac-wk2):
Output: http://webkit-queues.webkit.org/results/249179

New failing tests:
fast/images/gif-loop-count.html
compositing/overlap-blending/children-opacity-no-overlap.html
compositing/overlap-blending/nested-non-overlap-clipping.html
compositing/overlap-blending/children-opacity-huge.html
fast/images/animated-gif-no-layout.html
Comment 5 Build Bot 2015-10-05 21:44:20 PDT
Created attachment 262497 [details]
Archive of layout-test-results from ews106 for mac-mavericks-wk2

The attached test failures were seen while running run-webkit-tests on the mac-wk2-ews.
Bot: ews106  Port: mac-mavericks-wk2  Platform: Mac OS X 10.9.5
Comment 6 zalan 2015-10-05 21:45:09 PDT
I like this a lot!
Comment 7 Alexey Proskuryakov 2015-10-05 21:53:26 PDT
Comment on attachment 262492 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=262492&action=review

> Tools/ChangeLog:4
> +        Allow a test to specify a custom pixel tolerance in a ref test
> +        https://bugs.webkit.org/show_bug.cgi?id=149828

I still do not think that that this is desirable.

It is very difficult or impossible to reason about what tolerance is acceptable for a test, or what it even means to have a particular tolerance. I have no idea what 0.004 means, and I expect that this default makes some existing tests ineffective.
Comment 8 Simon Fraser (smfr) 2015-10-05 23:15:41 PDT
(In reply to comment #7)
> It is very difficult or impossible to reason about what tolerance is
> acceptable for a test, or what it even means to have a particular tolerance.
> I have no idea what 0.004 means, and I expect that this default makes some
> existing tests ineffective.

So you're OK with the arbitrary small tolerance that ImageDiff currently allows?

The tolerance value here is the same value that ImageDiff returns, which we've been reporting in test results for ages. ImageDiff explains it as "The difference as a percentage combining both the number of different pixels and their difference amount i.e. the average distance over the entire image".

Without this, it's impossible to do some kinds of testing.
Comment 9 Alexey Proskuryakov 2015-10-05 23:21:16 PDT
> So you're OK with the arbitrary small tolerance that ImageDiff currently allows?

Not really. It's also confusing, and the formula used for calculating the difference doesn't seem to make sense.
Comment 10 Alexey Proskuryakov 2015-10-05 23:22:41 PDT
Adding a way to manually control the tolerance is a step back though, as it adds a new dimension to possible mistakes.

Let's discuss what kinds of testing are impossible, maybe there is another way to implement those.
Comment 11 Carlos Alberto Lopez Perez 2015-11-13 10:09:03 PST
(In reply to comment #10)
> Adding a way to manually control the tolerance is a step back though, as it
> adds a new dimension to possible mistakes.
> 
> Let's discuss what kinds of testing are impossible, maybe there is another
> way to implement those.

An idea that comes to mind is that we can add support to specify on the TestExpectations files a tolerance for a given test. That way different ports can specify different tolerances or not tolerance at all for a test.
Comment 12 BJ Burg 2015-11-18 23:26:25 PST
(In reply to comment #11)
> (In reply to comment #10)
> > Adding a way to manually control the tolerance is a step back though, as it
> > adds a new dimension to possible mistakes.
> > 
> > Let's discuss what kinds of testing are impossible, maybe there is another
> > way to implement those.
> 
> An idea that comes to mind is that we can add support to specify on the
> TestExpectations files a tolerance for a given test. That way different
> ports can specify different tolerances or not tolerance at all for a test.

That seems even more error-prone that the test author changing the tolerance, since a tree maintainer may have no idea why tolerance is relevant in the specific scenario. And, increasing tolerance enough will make the test pass regardless of whether there's a real bug, which is bad.

Also, TestExpectations suck as a way of annotating tests with runtime parameters. It's bad enough trying to differentiate image vs test and other outcomes.
Comment 13 Dean Jackson 2017-04-25 10:28:07 PDT
We need something like this for testing gradients. Floating point rounding gives us tiny differences in ref tests that are perfectly acceptable.

Our alternative is to effectively implement the tolerance in the test, the way we now do for gradients in the canvas/philip suite. This is much more error prone, and requires duplication.
Comment 14 Carlos Alberto Lopez Perez 2017-04-25 10:42:09 PDT
(In reply to Alexey Proskuryakov from comment #7)
> Comment on attachment 262492 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=262492&action=review
> 
> > Tools/ChangeLog:4
> > +        Allow a test to specify a custom pixel tolerance in a ref test
> > +        https://bugs.webkit.org/show_bug.cgi?id=149828
> 
> I still do not think that that this is desirable.
> 
> It is very difficult or impossible to reason about what tolerance is
> acceptable for a test, or what it even means to have a particular tolerance.
> I have no idea what 0.004 means, and I expect that this default makes some
> existing tests ineffective.

Would you be ok with a solution that allows to specify the value for an allowed pixel tolerance in the TestExpectation file of each port and for each test?

Something like that:

imported/blink/fast/canvas/canvas-clip-stack-persistence-diffs.html [ PixelTolerance=0.05 ]
fast/gradients [ PixelTolerance=0.001 ]

That would mean that imported/blink/fast/canvas/canvas-clip-stack-persistence-diffs.html will pass if the diff on the image is <= 0.05 and that all test on fast/gradients will pass if the diff is <= 0.001 (supposing that there isn't any line below overriding that default on fast/gradients for any specific test under that set)


That way we can set the value specifically for each test and port.

So the one setting the value can reason about if it makes sense for that very specific test or not.
Comment 15 Alexey Proskuryakov 2017-04-25 10:53:44 PDT
We would need to clearly define what a tolerance value means.

There are many bugs where the problem manifests as a single line or even a single pixel of entirely wrong color. Then there are many bugs where the problem is a slight color variance.

The current algorithm for computing a distance doesn't account for that.
Comment 16 Carlos Alberto Lopez Perez 2017-04-25 11:07:41 PDT
(In reply to Alexey Proskuryakov from comment #15)
> We would need to clearly define what a tolerance value means.
> 
> There are many bugs where the problem manifests as a single line or even a
> single pixel of entirely wrong color. Then there are many bugs where the
> problem is a slight color variance.
> 
> The current algorithm for computing a distance doesn't account for that.

Another idea is that a reftest result can be overriding by a png file. That way the png result produced by the test-suite can be dropped into the LayouTests directory as needed:


LayoutTests/imported/blink/fast/canvas/canvas-clip-stack-persistence.html
LayoutTests/imported/blink/fast/canvas/canvas-clip-stack-persistence-expected.html
LayoutTests/platform/gtk/imported/blink/fast/canvas/canvas-clip-stack-persistence-expected.png


That would mean that on the GTK+ port the test imported/blink/fast/canvas/canvas-clip-stack-persistence.html would pass if it produces an image equal to the one generated by LayoutTests/imported/blink/fast/canvas/canvas-clip-stack-persistence-expected.html *or* if it produces an image equal to the one stored at LayoutTests/platform/gtk/imported/blink/fast/canvas/canvas-clip-stack-persistence-expected.png
Comment 17 Dean Jackson 2017-04-26 10:23:24 PDT
(In reply to Alexey Proskuryakov from comment #15)
> We would need to clearly define what a tolerance value means.
> 
> There are many bugs where the problem manifests as a single line or even a
> single pixel of entirely wrong color. Then there are many bugs where the
> problem is a slight color variance.
> 
> The current algorithm for computing a distance doesn't account for that.

What I am asking for is NOT the ability to give an arbitrary test a tolerance. I'm talking about tests that by their nature are unable to be 100% replicated in a ref test e.g. due to floating point rounding. Gradients are a good example of this.

I agree that it is not desirable if people fix flakey/bad tests with a tolerance value.
Comment 18 Radar WebKit Bug Importer 2020-04-30 16:13:33 PDT
<rdar://problem/62682186>
Comment 19 Simon Fraser (smfr) 2020-07-27 10:30:17 PDT
We should implement the WPT approach: http://web-platform-tests.org/writing-tests/reftests.html#fuzzy-matching
Comment 20 Simon Fraser (smfr) 2021-10-27 21:04:05 PDT
Created attachment 442670 [details]
Patch
Comment 21 Martin Robinson 2021-10-28 10:20:25 PDT
Comment on attachment 442670 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=442670&action=review

Does this cause any new test passes? It seems like it should at least in the WPT transforms directory.

> Tools/Scripts/webkitpy/layout_tests/controllers/single_test_runner.py:387
> +

Nit: I think this
Comment 22 Simon Fraser (smfr) 2021-11-04 10:54:57 PDT
Created attachment 443320 [details]
Patch
Comment 23 EWS 2021-12-21 10:35:58 PST
Committed r287323 (245473@main): <https://commits.webkit.org/245473@main>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 443320 [details].