Bug 161242 - [GTK][Threaded Compositor] Several flaky tests
Summary: [GTK][Threaded Compositor] Several flaky tests
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: WebKit Local Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
URL:
Keywords: Gtk
Depends on:
Blocks:
 
Reported: 2016-08-26 06:53 PDT by Carlos Garcia Campos
Modified: 2016-09-07 08:31 PDT (History)
5 users (show)

See Also:


Attachments
Patch (2.75 KB, patch)
2016-09-06 03:35 PDT, Carlos Garcia Campos
no flags Details | Formatted Diff | Diff
Patch (8.72 KB, patch)
2016-09-06 08:41 PDT, Carlos Garcia Campos
mcatanzaro: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Carlos Garcia Campos 2016-08-26 06:53:25 PDT
Even after fixing bug #160450 we still have a lot of laky tests sometimes in the bots. We no longer see the problem of scrollbars partially rendered, but we see scrollbars in some cases where they shouldn't be there at all (the contents are not bigger than the view size). That made me think that maybe there are tests leaving the view in an inconsistent state, most of the flaky tests are ref tests and all run by the same worker. But I tried to run all the tests using a single worker and I couldn't reproduce the problem so there must be something else. The other thing I can think of is that maybe in some cases the bot is more overloaded and a particular worker has less priority and things happen slower. We are still assuming that scheduling the task in the next loop iteration in the UI process after a force redraw ensures a redraw. The patch I initially posted in bug #160450 as a workaround tried to fix that assumption. I'm not sure that's the problem though, but we could try that patch and roll it out if it doesn't work. I'm a bit lost with this, because it only happens sometimes in the bots and I've never been able to reproduce it locally in any of machines I've tried.
Comment 1 Michael Catanzaro 2016-08-26 08:21:20 PDT
Doesn't hurt to try.
Comment 2 Carlos Alberto Lopez Perez 2016-08-26 10:19:58 PDT
(In reply to comment #0)
> Even after fixing bug #160450 we still have a lot of laky tests sometimes in
> the bots. We no longer see the problem of scrollbars partially rendered, but
> we see scrollbars in some cases where they shouldn't be there at all (the
> contents are not bigger than the view size). That made me think that maybe
> there are tests leaving the view in an inconsistent state, most of the flaky
> tests are ref tests and all run by the same worker. But I tried to run all
> the tests using a single worker and I couldn't reproduce the problem so
> there must be something else. The other thing I can think of is that maybe
> in some cases the bot is more overloaded and a particular worker has less
> priority and things happen slower. We are still assuming that scheduling the
> task in the next loop iteration in the UI process after a force redraw
> ensures a redraw. The patch I initially posted in bug #160450 as a
> workaround tried to fix that assumption. I'm not sure that's the problem
> though, but we could try that patch and roll it out if it doesn't work. I'm
> a bit lost with this, because it only happens sometimes in the bots and I've
> never been able to reproduce it locally in any of machines I've tried.

I think you will be able to reproduce this if you run the whole test suite instead a subset of it.

And this is something that I believe was happening before the threaded compositor with other kind of tests.

I have found many tests that are flaky when the whole test suite is run, but if you run then manually or you run a subset of the test suite (for example: run only the fast/ tests). Then its not longer reproducible.

I'm not sure if the problem is on webkit itself or is in the tooling and the way the tests are ran.
Comment 3 Carlos Garcia Campos 2016-08-26 23:42:33 PDT
(In reply to comment #2)
> (In reply to comment #0)
> > Even after fixing bug #160450 we still have a lot of laky tests sometimes in
> > the bots. We no longer see the problem of scrollbars partially rendered, but
> > we see scrollbars in some cases where they shouldn't be there at all (the
> > contents are not bigger than the view size). That made me think that maybe
> > there are tests leaving the view in an inconsistent state, most of the flaky
> > tests are ref tests and all run by the same worker. But I tried to run all
> > the tests using a single worker and I couldn't reproduce the problem so
> > there must be something else. The other thing I can think of is that maybe
> > in some cases the bot is more overloaded and a particular worker has less
> > priority and things happen slower. We are still assuming that scheduling the
> > task in the next loop iteration in the UI process after a force redraw
> > ensures a redraw. The patch I initially posted in bug #160450 as a
> > workaround tried to fix that assumption. I'm not sure that's the problem
> > though, but we could try that patch and roll it out if it doesn't work. I'm
> > a bit lost with this, because it only happens sometimes in the bots and I've
> > never been able to reproduce it locally in any of machines I've tried.
> 
> I think you will be able to reproduce this if you run the whole test suite
> instead a subset of it.

I did run all the tests using a single worker.

> And this is something that I believe was happening before the threaded
> compositor with other kind of tests.
> 
> I have found many tests that are flaky when the whole test suite is run, but
> if you run then manually or you run a subset of the test suite (for example:
> run only the fast/ tests). Then its not longer reproducible.
> 
> I'm not sure if the problem is on webkit itself or is in the tooling and the
> way the tests are ran.

I'm pretty sure the problem is in the tests infrastructure.
Comment 4 Carlos Garcia Campos 2016-08-27 02:14:32 PDT
(In reply to comment #1)
> Doesn't hurt to try.

Yes, I don't think it will fix it, to be honest, but actually ensuring a drawing in dispatchAfterEnsuringDrawing instead of assuming it will happen in the next run loop iteration makes sense in any case. So, let's see if it helps.
Comment 5 Carlos Garcia Campos 2016-08-27 02:20:53 PDT
Committed r205075: <http://trac.webkit.org/changeset/205075>
Comment 6 Michael Catanzaro 2016-08-27 17:58:43 PDT
(In reply to comment #4) 
> Yes, I don't think it will fix it, to be honest

Right as usual. :(

> but actually ensuring a
> drawing in dispatchAfterEnsuringDrawing instead of assuming it will happen
> in the next run loop iteration makes sense in any case.

Agreed.
Comment 7 Michael Catanzaro 2016-08-27 17:59:14 PDT
(Reopening.)
Comment 8 Carlos Garcia Campos 2016-08-28 03:38:01 PDT
(In reply to comment #6)
> (In reply to comment #4) 
> > Yes, I don't think it will fix it, to be honest
> 
> Right as usual. :(

However, it seems the number of flaky tests has been reduced a bit, and what is more important, the errors we see now are different too. Before the patch, there were tests in which we were showing scrollbars when we shouldn't. Now what I see is that contents are blurry and/or clipped, and in many of the cases both the expected and the actual rendering are wrong. So, I think there's a condition somewhere that can leave a different page scale or device scale factor in the web view that affects other tests.

> > but actually ensuring a
> > drawing in dispatchAfterEnsuringDrawing instead of assuming it will happen
> > in the next run loop iteration makes sense in any case.
> 
> Agreed.
Comment 9 Michael Catanzaro 2016-08-28 12:39:52 PDT
(In reply to comment #8)
> However, it seems the number of flaky tests has been reduced a bit

It was probably just luck: https://build.webkit.org/builders/GTK%20Linux%2064-bit%20Release%20%28Tests%29/builds/17930
Comment 10 Zan Dobersek 2016-08-28 23:43:09 PDT
Would it be possible to test on the bots without using the software rasterizer?

The clipping issues look as if the resizing of the underlying pixmap is not yet complete, or as if just a part of it is rendered.
Comment 11 Carlos Garcia Campos 2016-08-29 01:21:31 PDT
(In reply to comment #10)
> Would it be possible to test on the bots without using the software
> rasterizer?

Yes, please, swrast doesn't work with wayland either, because of eglCreateImage being dummy when using the software rasterizer.

> The clipping issues look as if the resizing of the underlying pixmap is not
> yet complete, or as if just a part of it is rendered.

Why are we using the software path, to not depend on the bot drivers? Maybe that's why I could never reproduce those issues. And that would explain why all this started when switched to the threaded compositor.
Comment 12 Carlos Alberto Lopez Perez 2016-08-30 06:31:09 PDT
(In reply to comment #11)
> (In reply to comment #10)
> > Would it be possible to test on the bots without using the software
> > rasterizer?
> 
> Yes, please, swrast doesn't work with wayland either, because of
> eglCreateImage being dummy when using the software rasterizer.
> 
> > The clipping issues look as if the resizing of the underlying pixmap is not
> > yet complete, or as if just a part of it is rendered.
> 
> Why are we using the software path, to not depend on the bot drivers? Maybe
> that's why I could never reproduce those issues. And that would explain why
> all this started when switched to the threaded compositor.

We use swrast because its the best way of ensuring that developers can get consistent results for the layout tests.

Otherwise it will be a nightmare. Think in developers having nvidia cards, or intel ones. With different versions of the drivers, etc.


I have tested the following:

1) Reproduce the problem. I was able to do this at the first try. The trick is run the whole test suite using the same workers as number of CPUs, which is the default. So if you want to reproduce this, don't force any number of workers and simply tun the full test suite.

I did this:
    $ Tools/Scripts/run-webkit-tests --no-show-results --no-new-test-results --no-sample-on-timeout --results-directory layout-test-results-xvfb-swrast --debug-rwt-logging --release --gtk

And I got this:
    https://people.igalia.com/clopez/wkbug/161242/xresults/layout-test-results-xvfb-swrast/results.html

From the stdout.log, you see that I got 122 unexpected image-only flakiness (122): 
    https://people.igalia.com/clopez/wkbug/161242/xresults/layout-test-results-xvfb-swrast/


Now I did the same but using my native x11 display with the opengl nvidia drivers. To do that, simply export the variable USE_NATIVE_XDISPLAY=1 before running the tests. I did several runs of this, storing each run in a different directory. This are the results:

 -> Run 1: Unexpected flakiness: image-only failures (159) https://people.igalia.com/clopez/wkbug/161242/xresults/layout-test-results-xvfb-xdisplay1/results.html
 -> Run 2: Unexpected flakiness: image-only failures (25) https://people.igalia.com/clopez/wkbug/161242/xresults/layout-test-results-xvfb-xdisplay2/results.html
 -> Run 3: Unexpected flakiness: image-only failures (22) https://people.igalia.com/clopez/wkbug/161242/xresults/layout-test-results-xvfb-xdisplay3/results.html


So, my conclusion is that is not caused by swrast. I can reproduce the flakiness also with the nvidia driver.


BTW, Check this results:

 - swrast: https://people.igalia.com/clopez/wkbug/161242/xresults/layout-test-results-xvfb-swrast/fast/flexbox/auto-height-with-flex-diffs.html
 - nvidia run 1: https://people.igalia.com/clopez/wkbug/161242/xresults/layout-test-results-xvfb-xdisplay1/fast/flexbox/auto-height-with-flex-diffs.html
 - nvidia run 2: https://people.igalia.com/clopez/wkbug/161242/xresults/layout-test-results-xvfb-xdisplay2/fast/flexbox/auto-height-with-flex-diffs.html


As you can see. Both the expected and the actual images are different on each one of the three cases.

On the nvidia run 2, the expected result didn't even reached the point to render anything than a blank viewport.

I think that using swrast instead of hardware-backed opengl is not the problem here.
Comment 13 Carlos Garcia Campos 2016-08-30 06:47:28 PDT
Thanks for the so detailed analysis, Carlos. So, the results are the same than in the bots, both the expected and actual files are incorrectly rendered and scale factors seem to have something to do there. Maybe there's a condition (test timing out or crashing) causing that any of the scale factors is note correctly reset between tests. The device scale factor is changed from the UI process, while all other are changed in the web process on every BeginTest message.
Comment 14 Carlos Garcia Campos 2016-09-01 05:06:43 PDT
Finally found the issue. The problem is the viewport controller, that is changed by the fixed layout tests, but never reset. There's also a problem with the fixed layout tests that could leave the view in an inconsistent state as I explained in bug #160117. I'm already working on a proper fix, but in the meantime I've skipped fixed layout tests (they are only two) to make sure we don't see all those flaky tests in the bots anymore. To easily reproduce it, launch run-webkit-tests with a single worker job and run fixed layout test first and then other ref tests, for example:

$ run-webkit-tests --gtk --release --child-processes=1 fast/fixed-layout/fixed-layout.html fast/regions
Comment 15 Michael Catanzaro 2016-09-01 10:33:52 PDT
(In reply to comment #14)
> Finally found the issue. The problem is the viewport controller, that is
> changed by the fixed layout tests, but never reset. There's also a problem
> with the fixed layout tests that could leave the view in an inconsistent
> state as I explained in bug #160117. I'm already working on a proper fix,
> but in the meantime I've skipped fixed layout tests (they are only two) to
> make sure we don't see all those flaky tests in the bots anymore. To easily
> reproduce it, launch run-webkit-tests with a single worker job and run fixed
> layout test first and then other ref tests, for example:
> 
> $ run-webkit-tests --gtk --release --child-processes=1
> fast/fixed-layout/fixed-layout.html fast/regions

It's clearly a big improvement, good job.

But it's still not completely fixed. We have 50 unexpected passes in build #18009, for example. These are tests that normally always fail, but occasionally pass now that we've enabled threaded compositor.
Comment 16 Carlos Garcia Campos 2016-09-01 22:40:04 PDT
(In reply to comment #15)
> (In reply to comment #14)
> > Finally found the issue. The problem is the viewport controller, that is
> > changed by the fixed layout tests, but never reset. There's also a problem
> > with the fixed layout tests that could leave the view in an inconsistent
> > state as I explained in bug #160117. I'm already working on a proper fix,
> > but in the meantime I've skipped fixed layout tests (they are only two) to
> > make sure we don't see all those flaky tests in the bots anymore. To easily
> > reproduce it, launch run-webkit-tests with a single worker job and run fixed
> > layout test first and then other ref tests, for example:
> > 
> > $ run-webkit-tests --gtk --release --child-processes=1
> > fast/fixed-layout/fixed-layout.html fast/regions
> 
> It's clearly a big improvement, good job.
> 
> But it's still not completely fixed. We have 50 unexpected passes in build
> #18009, for example. These are tests that normally always fail, but
> occasionally pass now that we've enabled threaded compositor.

You are assuming again that there's a single issue. This particular problem that was making both the expected and actual rendering of reftest fail, was caused by the fixed layout (it changes the viewport controller state that is never reset). Of course there are more issues, but I think the situation now is mostly the same to what we had before switching to the threaded compositor.
Comment 17 Michael Catanzaro 2016-09-02 04:52:41 PDT
(In reply to comment #16)
> You are assuming again that there's a single issue.

No, clearly there were many different issues, as you've fixed three different issues so far....

> I think the situation now
> is mostly the same to what we had before switching to the threaded
> compositor.

Yes, except for the fact that sometimes a bunch of tests pass when they usually fail. In build #18023 we have 84 unexpected passes; that never happened before. What's interesting is these tests either always pass or always fail in a particular run of run-webkit-tests; they are clearly flaky, but they don't contribute to the flakiness count.
Comment 18 Carlos Garcia Campos 2016-09-02 06:25:48 PDT
(In reply to comment #17)
> (In reply to comment #16)
> > You are assuming again that there's a single issue.
> 
> No, clearly there were many different issues, as you've fixed three
> different issues so far....
> 
> > I think the situation now
> > is mostly the same to what we had before switching to the threaded
> > compositor.
> 
> Yes, except for the fact that sometimes a bunch of tests pass when they
> usually fail. In build #18023 we have 84 unexpected passes; that never
> happened before. What's interesting is these tests either always pass or
> always fail in a particular run of run-webkit-tests; they are clearly flaky,
> but they don't contribute to the flakiness count.

We should probably mark them as pass, to see what's wrong when they fail
Comment 19 Carlos Alberto Lopez Perez 2016-09-02 08:32:44 PDT
(In reply to comment #18)
> (In reply to comment #17)
> > (In reply to comment #16)
> > > You are assuming again that there's a single issue.
> > 
> > No, clearly there were many different issues, as you've fixed three
> > different issues so far....
> > 
> > > I think the situation now
> > > is mostly the same to what we had before switching to the threaded
> > > compositor.
> > 
> > Yes, except for the fact that sometimes a bunch of tests pass when they
> > usually fail. In build #18023 we have 84 unexpected passes; that never
> > happened before. What's interesting is these tests either always pass or
> > always fail in a particular run of run-webkit-tests; they are clearly flaky,
> > but they don't contribute to the flakiness count.
> 
> We should probably mark them as pass, to see what's wrong when they fail

That don't makes much sense to me.

You can see the failure diff even if they are expected failures. You just have to navigate under https://build.webkit.org/results/GTK%20Linux%2064-bit%20Release%20%28Tests%29/ to find them.

For example: 

1) Check the unexpected passes at end of build #18023 https://build.webkit.org/builders/GTK%20Linux%2064-bit%20Release%20%28Tests%29/builds/18023/steps/layout-test/logs/stdio and pick 1

2) I pick: imported/mozilla/svg/blend-difference-stacking.html

3) Check that on the next build that test failed. Log of the next build: https://build.webkit.org/builders/GTK%20Linux%2064-bit%20Release%20%28Tests%29/builds/18024/steps/layout-test/logs/stdio <--- yes it failed

4) Go to https://build.webkit.org/results/GTK%20Linux%2064-bit%20Release%20%28Tests%29/ and find the results for build #18204 -> https://build.webkit.org/results/GTK%20Linux%2064-bit%20Release%20%28Tests%29/r205335%20%2818024%29/

5) Now under that folder navigate to imported/mozilla/svg/ and look for the blend-difference-stacking diff

6) Here you have it: https://build.webkit.org/results/GTK%20Linux%2064-bit%20Release%20%28Tests%29/r205335%20%2818024%29/imported/mozilla/svg/blend-difference-stacking-diffs.html
Comment 20 Carlos Garcia Campos 2016-09-02 22:44:41 PDT
(In reply to comment #19)
> (In reply to comment #18)
> > (In reply to comment #17)
> > > (In reply to comment #16)
> > > > You are assuming again that there's a single issue.
> > > 
> > > No, clearly there were many different issues, as you've fixed three
> > > different issues so far....
> > > 
> > > > I think the situation now
> > > > is mostly the same to what we had before switching to the threaded
> > > > compositor.
> > > 
> > > Yes, except for the fact that sometimes a bunch of tests pass when they
> > > usually fail. In build #18023 we have 84 unexpected passes; that never
> > > happened before. What's interesting is these tests either always pass or
> > > always fail in a particular run of run-webkit-tests; they are clearly flaky,
> > > but they don't contribute to the flakiness count.
> > 
> > We should probably mark them as pass, to see what's wrong when they fail
> 
> That don't makes much sense to me.
> 
> You can see the failure diff even if they are expected failures. You just
> have to navigate under
> https://build.webkit.org/results/GTK%20Linux%2064-
> bit%20Release%20%28Tests%29/ to find them.
> 
> For example: 
> 
> 1) Check the unexpected passes at end of build #18023
> https://build.webkit.org/builders/GTK%20Linux%2064-
> bit%20Release%20%28Tests%29/builds/18023/steps/layout-test/logs/stdio and
> pick 1
> 
> 2) I pick: imported/mozilla/svg/blend-difference-stacking.html
> 
> 3) Check that on the next build that test failed. Log of the next build:
> https://build.webkit.org/builders/GTK%20Linux%2064-
> bit%20Release%20%28Tests%29/builds/18024/steps/layout-test/logs/stdio <---
> yes it failed
> 
> 4) Go to
> https://build.webkit.org/results/GTK%20Linux%2064-
> bit%20Release%20%28Tests%29/ and find the results for build #18204 ->
> https://build.webkit.org/results/GTK%20Linux%2064-
> bit%20Release%20%28Tests%29/r205335%20%2818024%29/
> 
> 5) Now under that folder navigate to imported/mozilla/svg/ and look for the
> blend-difference-stacking diff
> 
> 6) Here you have it:
> https://build.webkit.org/results/GTK%20Linux%2064-
> bit%20Release%20%28Tests%29/r205335%20%2818024%29/imported/mozilla/svg/blend-
> difference-stacking-diffs.html

hmm, well, not very convenient, but still. I have theory, though. When they unexpectedly pass, they don't actually pass, we just fail to render both the actual and expected files in the same way (for example a fully white image in both cases). That's a problem of the reftests. So, more interesting to see what's failing, which can be also reproduced locally more easily, would be to see what we render when they pass.
Comment 21 Carlos Garcia Campos 2016-09-06 03:35:41 PDT
Created attachment 288009 [details]
Patch
Comment 22 Michael Catanzaro 2016-09-06 07:55:40 PDT
Comment on attachment 288009 [details]
Patch

(wrong bug)
Comment 23 Carlos Garcia Campos 2016-09-06 08:41:16 PDT
Created attachment 288025 [details]
Patch
Comment 24 Michael Catanzaro 2016-09-06 09:32:54 PDT
Comment on attachment 288025 [details]
Patch

r=me for the CoordinatedGraphics changes. I'll let you decide whether or not to wait for an owner on the iOS code you moved; it looks clearly fine to me, though.
Comment 25 Carlos Garcia Campos 2016-09-06 23:43:08 PDT
Committed r205537: <http://trac.webkit.org/changeset/205537>
Comment 26 Michael Catanzaro 2016-09-07 05:26:20 PDT
Reopening since we still have one bug causing flakiness left (per comment #17)
Comment 27 Carlos Garcia Campos 2016-09-07 06:09:20 PDT
Please no. Use new bug reports for remaining flaky tests. We have always had flaky tests, so remaining ones might not be related to threaded compositor.
Comment 28 Michael Catanzaro 2016-09-07 08:31:14 PDT
(In reply to comment #27)
> Please no. Use new bug reports for remaining flaky tests. We have always had
> flaky tests, so remaining ones might not be related to threaded compositor.

Bug #161688.

They are almost certainly related to threaded compositor; we've never had more than maybe 10 flaky tests appear in unexpected passes until threaded compositor, now we have dozens.