Bug 201550

Summary: REGRESSION: http/tests/resourceLoadStatistics/do-not-capture-statistics-for-simple-top-navigations.html is frequently timing out on iOS EWS bots
Product: WebKit Reporter: Ryan Haddad <ryanhaddad>
Component: Tools / TestsAssignee: Chris Dumez <cdumez>
Status: RESOLVED FIXED    
Severity: Normal CC: aakash_jain, achristensen, ap, bfulgham, cdumez, commit-queue, dbates, ews-watchlist, japhet, thorton, webkit-bot-watchers-bugzilla, webkit-bug-importer, wilander
Priority: P2 Keywords: InRadar
Version: Other   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugs.webkit.org/show_bug.cgi?id=198923
Attachments:
Description Flags
Patch
none
Patch none

Description Ryan Haddad 2019-09-06 11:00:31 PDT
http/tests/resourceLoadStatistics/do-not-capture-statistics-for-simple-top-navigations.html is frequently failing on iOS EWS bots and resulting in false positives

https://ews-build.webkit.org/#/builders/18
Comment 1 Ryan Haddad 2019-09-06 11:01:57 PDT
The test was a flaky timeout on trunk bots until 9/4/19, but it hasn't appeared again since then:
https://webkit-test-results.webkit.org/dashboards/flakiness_dashboard.html#tests=http%2Ftests%2FresourceLoadStatistics%2Fdo-not-capture-statistics-for-simple-top-navigations.html
Comment 2 Ryan Haddad 2019-09-06 11:03:47 PDT
Looking at the last 1000 EWS runs, the first timeout I see was 14 days ago:
https://ews-build.webkit.org/#/builders/18/builds/2379

...then the next one was 7 days ago:
https://ews-build.webkit.org/#/builders/18/builds/2803

Since then, it has been timing out very frequently.
Comment 3 Ryan Haddad 2019-09-06 11:06:28 PDT
The patch that presented the failure 7 days ago was Bug 201344 [PSON] We no longer need to throw away suspended pages in a process before using it for a navigation

That did introduce a few flaky adClickAttribution test timeouts, documented in:
https://bugs.webkit.org/show_bug.cgi?id=201440

That issue was fixed yesterday with https://trac.webkit.org/changeset/249557
Comment 4 Ryan Haddad 2019-09-06 11:13:32 PDT
Chris did mention that his change did change the timing of things, so it could possibly make things more / less flaky. I'm still not sure why we don't see this on trunk bots.

I'm not sure if this could have something to do with test order. We use '--skip-failing-tests' on EWS and also retry failures, so that could explain the different results.
Comment 5 Ryan Haddad 2019-09-06 11:46:25 PDT
I'm trying to reproduce this locally.
Comment 6 Tim Horton 2019-09-06 12:09:27 PDT
It reproduced locally for me; it didn't seem like the process was hung, just ... not doing anything.
Comment 7 Chris Dumez 2019-09-06 13:29:18 PDT
(In reply to Tim Horton from comment #6)
> It reproduced locally for me; it didn't seem like the process was hung, just
> ... not doing anything.

It reproduces after https://trac.webkit.org/changeset/249557 ?
Comment 8 Tim Horton 2019-09-06 13:43:32 PDT
Oh, hmm! I definitely didn't have that.
Comment 9 Ryan Haddad 2019-09-06 14:27:31 PDT
I can reproduce it with r249575, so r249557 did not resolve the issue with this test.
Comment 10 Chris Dumez 2019-09-06 14:28:40 PDT
(In reply to Ryan Haddad from comment #9)
> I can reproduce it with r249575, so r249557 did not resolve the issue with
> this test.

Ok, and do we think this regressed after r249350?
Comment 11 Ryan Haddad 2019-09-06 14:37:02 PDT
(In reply to Chris Dumez from comment #10)
> (In reply to Ryan Haddad from comment #9)
> > I can reproduce it with r249575, so r249557 did not resolve the issue with
> > this test.
> 
> Ok, and do we think this regressed after r249350?
I confirmed that I cannot reproduce it with r249349, but I can with r249350.
Comment 12 Chris Dumez 2019-09-06 14:47:48 PDT
(In reply to Ryan Haddad from comment #11)
> (In reply to Chris Dumez from comment #10)
> > (In reply to Ryan Haddad from comment #9)
> > > I can reproduce it with r249575, so r249557 did not resolve the issue with
> > > this test.
> > 
> > Ok, and do we think this regressed after r249350?
> I confirmed that I cannot reproduce it with r249349, but I can with r249350.

Ok then, I will investigate soon.
Comment 13 Chris Dumez 2019-09-06 15:00:30 PDT
(In reply to Chris Dumez from comment #12)
> (In reply to Ryan Haddad from comment #11)
> > (In reply to Chris Dumez from comment #10)
> > > (In reply to Ryan Haddad from comment #9)
> > > > I can reproduce it with r249575, so r249557 did not resolve the issue with
> > > > this test.
> > > 
> > > Ok, and do we think this regressed after r249350?
> > I confirmed that I cannot reproduce it with r249349, but I can with r249350.
> 
> Ok then, I will investigate soon.

Looking at this now, it does not seem to drop any IPC, nor does it seem to leverage the code paths impacted by r249350.
Comment 14 Chris Dumez 2019-09-06 15:04:15 PDT
The test seems to rely on TestRunner::installStatisticsDidScanDataRecordsCallback() / TestRunner::statisticsDidScanDataRecordsCallback(). Digging into it now.
Comment 15 Ryan Haddad 2019-09-06 15:20:16 PDT
Chris was able to reproduce the timeout by just running the test by itself. I was running the whole resourceLoadStatistics directory.

I tried running the one test with r249349 and still got a timeout, so perhaps something else is going on.
Comment 16 Chris Dumez 2019-09-06 15:28:07 PDT
The test calls TestRunner::statisticsNotifyObserver() but this ends up being a no-op because there are no statistics in the ResourceLoadObserver to send to the NetworkProcess. As a result, the test ends up waiting for a notification that will never arrive.

This has nothing to do with my change. I guess that when you run it with other tests, there may be leftover statistics in the ResourceLaodObserver to send and this is why the test usually passes when not run by itself.
Comment 17 Chris Dumez 2019-09-06 15:37:27 PDT
(In reply to Chris Dumez from comment #16)
> The test calls TestRunner::statisticsNotifyObserver() but this ends up being
> a no-op because there are no statistics in the ResourceLoadObserver to send
> to the NetworkProcess. As a result, the test ends up waiting for a
> notification that will never arrive.
> 
> This has nothing to do with my change. I guess that when you run it with
> other tests, there may be leftover statistics in the ResourceLaodObserver to
> send and this is why the test usually passes when not run by itself.

Potential regression from https://trac.webkit.org/r249056. Before r249056, the call to statisticsNotifyObserver() would not have been a no-op I think, since we would have actually notified the WebResourceLoadStatistics store (providing it an empty HashMap).
Comment 18 Chris Dumez 2019-09-06 15:53:58 PDT
Created attachment 378247 [details]
Patch
Comment 19 Chris Dumez 2019-09-06 15:54:36 PDT
Not sure if this is the best way to address the issue. John may have a better idea as he knows this code / test better.
Comment 20 Ryan Haddad 2019-09-06 16:36:13 PDT
Marked the test as flaky in https://trac.webkit.org/changeset/249595/webkit to speed up EWS.
Comment 21 Radar WebKit Bug Importer 2019-09-06 16:36:51 PDT
<rdar://problem/55133261>
Comment 22 Alex Christensen 2019-09-07 09:18:14 PDT
Comment on attachment 378247 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=378247&action=review

> LayoutTests/http/tests/resourceLoadStatistics/do-not-capture-statistics-for-simple-top-navigations.html:37
> +                    timerHandle = setTimeout(finishTest, 100);

Is 100ms enough?  What does this need to wait for?
Comment 23 Chris Dumez 2019-09-07 10:57:26 PDT
(In reply to Alex Christensen from comment #22)
> Comment on attachment 378247 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=378247&action=review
> 
> > LayoutTests/http/tests/resourceLoadStatistics/do-not-capture-statistics-for-simple-top-navigations.html:37
> > +                    timerHandle = setTimeout(finishTest, 100);
> 
> Is 100ms enough?  What does this need to wait for?

It does not need to wait for anything. There is supposed to be no statistics, this is just making sure that no new statistics come in. We are testing that something does not happen (I.e something is not captured). So no flakiness issue with a fairly short timer.
Comment 24 John Wilander 2019-09-09 15:59:55 PDT
Comment on attachment 378247 [details]
Patch

LGTM.
Comment 25 Chris Dumez 2019-09-09 19:01:02 PDT
Created attachment 378430 [details]
Patch
Comment 26 WebKit Commit Bot 2019-09-09 21:06:48 PDT
Comment on attachment 378430 [details]
Patch

Clearing flags on attachment: 378430

Committed r249702: <https://trac.webkit.org/changeset/249702>
Comment 27 WebKit Commit Bot 2019-09-09 21:06:50 PDT
All reviewed patches have been landed.  Closing bug.