Summary: | REGRESSION: http/tests/resourceLoadStatistics/do-not-capture-statistics-for-simple-top-navigations.html is frequently timing out on iOS EWS bots | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Ryan Haddad <ryanhaddad> | ||||||
Component: | Tools / Tests | Assignee: | Chris Dumez <cdumez> | ||||||
Status: | RESOLVED FIXED | ||||||||
Severity: | Normal | CC: | aakash_jain, achristensen, ap, bfulgham, cdumez, commit-queue, dbates, ews-watchlist, japhet, thorton, webkit-bot-watchers-bugzilla, webkit-bug-importer, wilander | ||||||
Priority: | P2 | Keywords: | InRadar | ||||||
Version: | Other | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
See Also: | https://bugs.webkit.org/show_bug.cgi?id=198923 | ||||||||
Attachments: |
|
Description
Ryan Haddad
2019-09-06 11:00:31 PDT
The test was a flaky timeout on trunk bots until 9/4/19, but it hasn't appeared again since then: https://webkit-test-results.webkit.org/dashboards/flakiness_dashboard.html#tests=http%2Ftests%2FresourceLoadStatistics%2Fdo-not-capture-statistics-for-simple-top-navigations.html Looking at the last 1000 EWS runs, the first timeout I see was 14 days ago: https://ews-build.webkit.org/#/builders/18/builds/2379 ...then the next one was 7 days ago: https://ews-build.webkit.org/#/builders/18/builds/2803 Since then, it has been timing out very frequently. The patch that presented the failure 7 days ago was Bug 201344 [PSON] We no longer need to throw away suspended pages in a process before using it for a navigation That did introduce a few flaky adClickAttribution test timeouts, documented in: https://bugs.webkit.org/show_bug.cgi?id=201440 That issue was fixed yesterday with https://trac.webkit.org/changeset/249557 Chris did mention that his change did change the timing of things, so it could possibly make things more / less flaky. I'm still not sure why we don't see this on trunk bots. I'm not sure if this could have something to do with test order. We use '--skip-failing-tests' on EWS and also retry failures, so that could explain the different results. I'm trying to reproduce this locally. It reproduced locally for me; it didn't seem like the process was hung, just ... not doing anything. (In reply to Tim Horton from comment #6) > It reproduced locally for me; it didn't seem like the process was hung, just > ... not doing anything. It reproduces after https://trac.webkit.org/changeset/249557 ? Oh, hmm! I definitely didn't have that. (In reply to Ryan Haddad from comment #9) > I can reproduce it with r249575, so r249557 did not resolve the issue with > this test. Ok, and do we think this regressed after r249350? (In reply to Chris Dumez from comment #10) > (In reply to Ryan Haddad from comment #9) > > I can reproduce it with r249575, so r249557 did not resolve the issue with > > this test. > > Ok, and do we think this regressed after r249350? I confirmed that I cannot reproduce it with r249349, but I can with r249350. (In reply to Ryan Haddad from comment #11) > (In reply to Chris Dumez from comment #10) > > (In reply to Ryan Haddad from comment #9) > > > I can reproduce it with r249575, so r249557 did not resolve the issue with > > > this test. > > > > Ok, and do we think this regressed after r249350? > I confirmed that I cannot reproduce it with r249349, but I can with r249350. Ok then, I will investigate soon. (In reply to Chris Dumez from comment #12) > (In reply to Ryan Haddad from comment #11) > > (In reply to Chris Dumez from comment #10) > > > (In reply to Ryan Haddad from comment #9) > > > > I can reproduce it with r249575, so r249557 did not resolve the issue with > > > > this test. > > > > > > Ok, and do we think this regressed after r249350? > > I confirmed that I cannot reproduce it with r249349, but I can with r249350. > > Ok then, I will investigate soon. Looking at this now, it does not seem to drop any IPC, nor does it seem to leverage the code paths impacted by r249350. The test seems to rely on TestRunner::installStatisticsDidScanDataRecordsCallback() / TestRunner::statisticsDidScanDataRecordsCallback(). Digging into it now. Chris was able to reproduce the timeout by just running the test by itself. I was running the whole resourceLoadStatistics directory. I tried running the one test with r249349 and still got a timeout, so perhaps something else is going on. The test calls TestRunner::statisticsNotifyObserver() but this ends up being a no-op because there are no statistics in the ResourceLoadObserver to send to the NetworkProcess. As a result, the test ends up waiting for a notification that will never arrive. This has nothing to do with my change. I guess that when you run it with other tests, there may be leftover statistics in the ResourceLaodObserver to send and this is why the test usually passes when not run by itself. (In reply to Chris Dumez from comment #16) > The test calls TestRunner::statisticsNotifyObserver() but this ends up being > a no-op because there are no statistics in the ResourceLoadObserver to send > to the NetworkProcess. As a result, the test ends up waiting for a > notification that will never arrive. > > This has nothing to do with my change. I guess that when you run it with > other tests, there may be leftover statistics in the ResourceLaodObserver to > send and this is why the test usually passes when not run by itself. Potential regression from https://trac.webkit.org/r249056. Before r249056, the call to statisticsNotifyObserver() would not have been a no-op I think, since we would have actually notified the WebResourceLoadStatistics store (providing it an empty HashMap). Created attachment 378247 [details]
Patch
Not sure if this is the best way to address the issue. John may have a better idea as he knows this code / test better. Marked the test as flaky in https://trac.webkit.org/changeset/249595/webkit to speed up EWS. Comment on attachment 378247 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=378247&action=review > LayoutTests/http/tests/resourceLoadStatistics/do-not-capture-statistics-for-simple-top-navigations.html:37 > + timerHandle = setTimeout(finishTest, 100); Is 100ms enough? What does this need to wait for? (In reply to Alex Christensen from comment #22) > Comment on attachment 378247 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=378247&action=review > > > LayoutTests/http/tests/resourceLoadStatistics/do-not-capture-statistics-for-simple-top-navigations.html:37 > > + timerHandle = setTimeout(finishTest, 100); > > Is 100ms enough? What does this need to wait for? It does not need to wait for anything. There is supposed to be no statistics, this is just making sure that no new statistics come in. We are testing that something does not happen (I.e something is not captured). So no flakiness issue with a fairly short timer. Comment on attachment 378247 [details]
Patch
LGTM.
Created attachment 378430 [details]
Patch
Comment on attachment 378430 [details] Patch Clearing flags on attachment: 378430 Committed r249702: <https://trac.webkit.org/changeset/249702> All reviewed patches have been landed. Closing bug. |