Bug 201550

Summary: REGRESSION: http/tests/resourceLoadStatistics/do-not-capture-statistics-for-simple-top-navigations.html is frequently timing out on iOS EWS bots
Product: WebKit Reporter: Ryan Haddad <ryanhaddad>
Component: Tools / TestsAssignee: Chris Dumez <cdumez>
Status: RESOLVED FIXED    
Severity: Normal CC: aakash_jain, achristensen, ap, bfulgham, cdumez, commit-queue, dbates, ews-watchlist, japhet, thorton, webkit-bot-watchers-bugzilla, webkit-bug-importer, wilander
Priority: P2 Keywords: InRadar
Version: Other   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugs.webkit.org/show_bug.cgi?id=198923
Attachments:
Description Flags
Patch
none
Patch none

Ryan Haddad
Reported 2019-09-06 11:00:31 PDT
http/tests/resourceLoadStatistics/do-not-capture-statistics-for-simple-top-navigations.html is frequently failing on iOS EWS bots and resulting in false positives https://ews-build.webkit.org/#/builders/18
Attachments
Patch (11.56 KB, patch)
2019-09-06 15:53 PDT, Chris Dumez
no flags
Patch (12.84 KB, patch)
2019-09-09 19:01 PDT, Chris Dumez
no flags
Ryan Haddad
Comment 1 2019-09-06 11:01:57 PDT
Ryan Haddad
Comment 2 2019-09-06 11:03:47 PDT
Looking at the last 1000 EWS runs, the first timeout I see was 14 days ago: https://ews-build.webkit.org/#/builders/18/builds/2379 ...then the next one was 7 days ago: https://ews-build.webkit.org/#/builders/18/builds/2803 Since then, it has been timing out very frequently.
Ryan Haddad
Comment 3 2019-09-06 11:06:28 PDT
The patch that presented the failure 7 days ago was Bug 201344 [PSON] We no longer need to throw away suspended pages in a process before using it for a navigation That did introduce a few flaky adClickAttribution test timeouts, documented in: https://bugs.webkit.org/show_bug.cgi?id=201440 That issue was fixed yesterday with https://trac.webkit.org/changeset/249557
Ryan Haddad
Comment 4 2019-09-06 11:13:32 PDT
Chris did mention that his change did change the timing of things, so it could possibly make things more / less flaky. I'm still not sure why we don't see this on trunk bots. I'm not sure if this could have something to do with test order. We use '--skip-failing-tests' on EWS and also retry failures, so that could explain the different results.
Ryan Haddad
Comment 5 2019-09-06 11:46:25 PDT
I'm trying to reproduce this locally.
Tim Horton
Comment 6 2019-09-06 12:09:27 PDT
It reproduced locally for me; it didn't seem like the process was hung, just ... not doing anything.
Chris Dumez
Comment 7 2019-09-06 13:29:18 PDT
(In reply to Tim Horton from comment #6) > It reproduced locally for me; it didn't seem like the process was hung, just > ... not doing anything. It reproduces after https://trac.webkit.org/changeset/249557 ?
Tim Horton
Comment 8 2019-09-06 13:43:32 PDT
Oh, hmm! I definitely didn't have that.
Ryan Haddad
Comment 9 2019-09-06 14:27:31 PDT
I can reproduce it with r249575, so r249557 did not resolve the issue with this test.
Chris Dumez
Comment 10 2019-09-06 14:28:40 PDT
(In reply to Ryan Haddad from comment #9) > I can reproduce it with r249575, so r249557 did not resolve the issue with > this test. Ok, and do we think this regressed after r249350?
Ryan Haddad
Comment 11 2019-09-06 14:37:02 PDT
(In reply to Chris Dumez from comment #10) > (In reply to Ryan Haddad from comment #9) > > I can reproduce it with r249575, so r249557 did not resolve the issue with > > this test. > > Ok, and do we think this regressed after r249350? I confirmed that I cannot reproduce it with r249349, but I can with r249350.
Chris Dumez
Comment 12 2019-09-06 14:47:48 PDT
(In reply to Ryan Haddad from comment #11) > (In reply to Chris Dumez from comment #10) > > (In reply to Ryan Haddad from comment #9) > > > I can reproduce it with r249575, so r249557 did not resolve the issue with > > > this test. > > > > Ok, and do we think this regressed after r249350? > I confirmed that I cannot reproduce it with r249349, but I can with r249350. Ok then, I will investigate soon.
Chris Dumez
Comment 13 2019-09-06 15:00:30 PDT
(In reply to Chris Dumez from comment #12) > (In reply to Ryan Haddad from comment #11) > > (In reply to Chris Dumez from comment #10) > > > (In reply to Ryan Haddad from comment #9) > > > > I can reproduce it with r249575, so r249557 did not resolve the issue with > > > > this test. > > > > > > Ok, and do we think this regressed after r249350? > > I confirmed that I cannot reproduce it with r249349, but I can with r249350. > > Ok then, I will investigate soon. Looking at this now, it does not seem to drop any IPC, nor does it seem to leverage the code paths impacted by r249350.
Chris Dumez
Comment 14 2019-09-06 15:04:15 PDT
The test seems to rely on TestRunner::installStatisticsDidScanDataRecordsCallback() / TestRunner::statisticsDidScanDataRecordsCallback(). Digging into it now.
Ryan Haddad
Comment 15 2019-09-06 15:20:16 PDT
Chris was able to reproduce the timeout by just running the test by itself. I was running the whole resourceLoadStatistics directory. I tried running the one test with r249349 and still got a timeout, so perhaps something else is going on.
Chris Dumez
Comment 16 2019-09-06 15:28:07 PDT
The test calls TestRunner::statisticsNotifyObserver() but this ends up being a no-op because there are no statistics in the ResourceLoadObserver to send to the NetworkProcess. As a result, the test ends up waiting for a notification that will never arrive. This has nothing to do with my change. I guess that when you run it with other tests, there may be leftover statistics in the ResourceLaodObserver to send and this is why the test usually passes when not run by itself.
Chris Dumez
Comment 17 2019-09-06 15:37:27 PDT
(In reply to Chris Dumez from comment #16) > The test calls TestRunner::statisticsNotifyObserver() but this ends up being > a no-op because there are no statistics in the ResourceLoadObserver to send > to the NetworkProcess. As a result, the test ends up waiting for a > notification that will never arrive. > > This has nothing to do with my change. I guess that when you run it with > other tests, there may be leftover statistics in the ResourceLaodObserver to > send and this is why the test usually passes when not run by itself. Potential regression from https://trac.webkit.org/r249056. Before r249056, the call to statisticsNotifyObserver() would not have been a no-op I think, since we would have actually notified the WebResourceLoadStatistics store (providing it an empty HashMap).
Chris Dumez
Comment 18 2019-09-06 15:53:58 PDT
Chris Dumez
Comment 19 2019-09-06 15:54:36 PDT
Not sure if this is the best way to address the issue. John may have a better idea as he knows this code / test better.
Ryan Haddad
Comment 20 2019-09-06 16:36:13 PDT
Marked the test as flaky in https://trac.webkit.org/changeset/249595/webkit to speed up EWS.
Radar WebKit Bug Importer
Comment 21 2019-09-06 16:36:51 PDT
Alex Christensen
Comment 22 2019-09-07 09:18:14 PDT
Comment on attachment 378247 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=378247&action=review > LayoutTests/http/tests/resourceLoadStatistics/do-not-capture-statistics-for-simple-top-navigations.html:37 > + timerHandle = setTimeout(finishTest, 100); Is 100ms enough? What does this need to wait for?
Chris Dumez
Comment 23 2019-09-07 10:57:26 PDT
(In reply to Alex Christensen from comment #22) > Comment on attachment 378247 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=378247&action=review > > > LayoutTests/http/tests/resourceLoadStatistics/do-not-capture-statistics-for-simple-top-navigations.html:37 > > + timerHandle = setTimeout(finishTest, 100); > > Is 100ms enough? What does this need to wait for? It does not need to wait for anything. There is supposed to be no statistics, this is just making sure that no new statistics come in. We are testing that something does not happen (I.e something is not captured). So no flakiness issue with a fairly short timer.
John Wilander
Comment 24 2019-09-09 15:59:55 PDT
Comment on attachment 378247 [details] Patch LGTM.
Chris Dumez
Comment 25 2019-09-09 19:01:02 PDT
WebKit Commit Bot
Comment 26 2019-09-09 21:06:48 PDT
Comment on attachment 378430 [details] Patch Clearing flags on attachment: 378430 Committed r249702: <https://trac.webkit.org/changeset/249702>
WebKit Commit Bot
Comment 27 2019-09-09 21:06:50 PDT
All reviewed patches have been landed. Closing bug.
Note You need to log in before you can comment on or make changes to this bug.