Bug 191002 - At least two dozen IntersectionObserver tests are flaky
Summary: At least two dozen IntersectionObserver tests are flaky
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebKitGTK (show other bugs)
Version: Other
Hardware: PC Linux
: P2 Normal
Assignee: Ali Juma
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-10-27 17:43 PDT by Michael Catanzaro
Modified: 2019-05-14 13:40 PDT (History)
12 users (show)

See Also:


Attachments
Update expectations (4.39 KB, patch)
2018-10-29 10:07 PDT, Ali Juma
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Catanzaro 2018-10-27 17:43:47 PDT
https://build.webkit.org/results/GTK%20Linux%2064-bit%20Release%20(Tests)/r237491%20(8613)/results.html shows two dozen flaky IntersectionObserver tests. It looks like these tests all usually pass, but sometimes fail all together? The couple I cheked have been flaky since they were added in r234723 "Import WPTs for IntersectionObserver" and unfortunately not fixed by r237218 "Flaky IntersectionObserver web platform tests involving style updates".

I'm marking the entire imported/w3c/web-platform-tests/intersection-observer/ directory as flaky.
Comment 1 Michael Catanzaro 2018-10-27 17:45:43 PDT
Flakiness is observed also on Apple bots, so I'll do this in the global expectations file.
Comment 2 Michael Catanzaro 2018-10-27 17:47:22 PDT
(In reply to Michael Catanzaro from comment #1)
> Flakiness is observed also on Apple bots, so I'll do this in the global
> expectations file.

(That said, the GTK bots are *way* more flaky than the Apple bots.)
Comment 3 Michael Catanzaro 2018-10-28 18:20:08 PDT
Here's an example failure that occurred after r237218:

https://build.webkit.org/results/Apple%20High%20Sierra%20Debug%20WK2%20(Tests)/r237322%20(5225)/results.html

--- /Volumes/Data/slave/highsierra-debug-tests-wk2/build/layout-test-results/imported/w3c/web-platform-tests/intersection-observer/containing-block-expected.txt
+++ /Volumes/Data/slave/highsierra-debug-tests-wk2/build/layout-test-results/imported/w3c/web-platform-tests/intersection-observer/containing-block-actual.txt
@@ -1,7 +1,7 @@
 
 PASS IntersectionObserver should only report intersections if root is a containing block ancestor of target. 
-PASS In containing block and intersecting. 
-PASS In containing block and not intersecting. 
-PASS Not in containing block and intersecting. 
-PASS Not in containing block and not intersecting. 
+FAIL In containing block and intersecting. assert_equals: entries.length expected 1 but got 0
+FAIL In containing block and not intersecting. assert_equals: entries.length expected 2 but got 1
+FAIL Not in containing block and intersecting. assert_equals: entries.length expected 2 but got 1
+FAIL Not in containing block and not intersecting. assert_equals: entries.length expected 2 but got 1
Comment 4 Ali Juma 2018-10-29 06:47:01 PDT
The root cause of this flakiness is that the test suite assumes HTMLEventLoop timing, and WebKit doesn't implement the HTMLEventLoop spec (and, when I brought this up at the Contributors' Meeting, there wasn't consensus that we should implement that spec). More specifically, the tests assume that if changes to style/layout are made in a rAF callback, then intersection observations will be updated before the next rAF callback, and events will fire within one setTimeout of that second callback. In HTMLEventLoop-based engines, this is straightforward to implement -- rAF is immediately followed by style/layout, and immediately followed by an update of intersection observations. After spending a while considering different approaches for WebKit, the most predictable (and closest to the spec) approach was to update intersection observations at the same time as we flush layers.

On iOS, flushing and rAF are both driven by didUpdate messages received from the UIProcess. None of the tests are flaky on iOS.

Mac is slightly different, but rAF still gets throttled when we fall behind on flushes. There's still a bit of flakiness, but still, the vast majority of tests are not flaky. And those that are flaky, are only flaky on Debug bots (where we're more likely to fall behind on flushes), not on Release bots.

So I suspect that on GTK, there's perhaps weaker coupling between rAF and flushes, and so on some runs we get into a state where flushing/drawing is slower for some reason, so the tests fail because the observations they're expecting haven't been computed yet.

In any case, given the above, I think that marking the entire suite as flaky in the global expectations file is too broad. So I'm going to move that over to the GTK expectations file.
Comment 5 Michael Catanzaro 2018-10-29 07:42:42 PDT
Thanks for the detailed explanation! Sounds like some investigation of the GTK graphics stack would be needed here.

(In reply to Ali Juma from comment #4)
> In any case, given the above, I think that marking the entire suite as flaky
> in the global expectations file is too broad. So I'm going to move that over
> to the GTK expectations file.

FWIW I don't agree with this because the flakes are still occurring on several Apple bots so it's going to result in spurious complaints from the EWS.
Comment 6 Ali Juma 2018-10-29 07:47:05 PDT
(In reply to Michael Catanzaro from comment #5)
> Thanks for the detailed explanation! Sounds like some investigation of the
> GTK graphics stack would be needed here.
> 
> (In reply to Ali Juma from comment #4)
> > In any case, given the above, I think that marking the entire suite as flaky
> > in the global expectations file is too broad. So I'm going to move that over
> > to the GTK expectations file.
> 
> FWIW I don't agree with this because the flakes are still occurring on
> several Apple bots so it's going to result in spurious complaints from the
> EWS.

I'm also going to add expectations for the specific tests that are flaky on Mac Debug (rather than having a global expectation for the whole test suite).
Comment 7 Michael Catanzaro 2018-10-29 07:57:22 PDT
OK, sounds good!
Comment 8 Ali Juma 2018-10-29 10:07:45 PDT
Created attachment 353294 [details]
Update expectations
Comment 9 Ali Juma 2018-10-29 10:09:16 PDT
Comment on attachment 353294 [details]
Update expectations

Clearing flags on attachment: 353294

Committed r237556: <https://trac.webkit.org/changeset/237556>
Comment 10 Michael Catanzaro 2018-10-29 10:57:45 PDT
There were definitely some failures on WK1 bots too, which should be accounted for.
Comment 11 Ali Juma 2018-10-29 11:04:14 PDT
(In reply to Michael Catanzaro from comment #10)
> There were definitely some failures on WK1 bots too, which should be
> accounted for.

These tests were skipped entirely on mac-wk1 in r237218. And on the Win bots, all web platform tests are skipped.

Looking at https://webkit-test-results.webkit.org/dashboards/flakiness_dashboard.html#tests=intersection-observer,
the only flakiness I see is on Mac WK2 Debug and GTK. 

Are there others that I'm missing?
Comment 12 Michael Catanzaro 2018-10-30 06:43:02 PDT
Hmm, I don't know. I'm sure I saw them, but maybe they were too far back in the history? Not sure. Clearly it looks good now!
Comment 13 Ali Juma 2019-05-14 13:40:22 PDT
After recent changes to rAF timing to align with the spec, these tests are no longer flaky on Mac Debug, so I've updated the expectations in https://trac.webkit.org/changeset/245304.