Bug 201550

Summary:

REGRESSION: http/tests/resourceLoadStatistics/do-not-capture-statistics-for-simple-top-navigations.html is frequently timing out on iOS EWS bots

Product:

WebKit

Reporter:

Ryan Haddad <ryanhaddad>

Component:

Tools / Tests

Assignee:

Chris Dumez <cdumez>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

aakash_jain, achristensen, ap, bfulgham, cdumez, commit-queue, dbates, ews-watchlist, japhet, thorton, webkit-bot-watchers-bugzilla, webkit-bug-importer, wilander

Priority:

Keywords:

InRadar

Version:

Other

Hardware:

Unspecified

OS:

Unspecified

See Also:

https://bugs.webkit.org/show_bug.cgi?id=198923

Attachments:

Description	Flags
Patch	none
Patch	none

Ryan Haddad

Reported 2019-09-06 11:00:31 PDT

http/tests/resourceLoadStatistics/do-not-capture-statistics-for-simple-top-navigations.html is frequently failing on iOS EWS bots and resulting in false positives https://ews-build.webkit.org/#/builders/18

Attachments
Patch (11.56 KB, patch) 2019-09-06 15:53 PDT, Chris Dumez	no flags	Details Formatted Diff Diff
Patch (12.84 KB, patch) 2019-09-09 19:01 PDT, Chris Dumez	no flags	Details Formatted Diff Diff
Show Obsolete (1) View All Add attachment proposed patch, testcase, etc.

Ryan Haddad

Comment 1 2019-09-06 11:01:57 PDT

The test was a flaky timeout on trunk bots until 9/4/19, but it hasn't appeared again since then: https://webkit-test-results.webkit.org/dashboards/flakiness_dashboard.html#tests=http%2Ftests%2FresourceLoadStatistics%2Fdo-not-capture-statistics-for-simple-top-navigations.html

Ryan Haddad

Comment 2 2019-09-06 11:03:47 PDT

Looking at the last 1000 EWS runs, the first timeout I see was 14 days ago: https://ews-build.webkit.org/#/builders/18/builds/2379 ...then the next one was 7 days ago: https://ews-build.webkit.org/#/builders/18/builds/2803 Since then, it has been timing out very frequently.

Ryan Haddad

Comment 3 2019-09-06 11:06:28 PDT

The patch that presented the failure 7 days ago was Bug 201344 [PSON] We no longer need to throw away suspended pages in a process before using it for a navigation That did introduce a few flaky adClickAttribution test timeouts, documented in: https://bugs.webkit.org/show_bug.cgi?id=201440 That issue was fixed yesterday with https://trac.webkit.org/changeset/249557

Ryan Haddad

Comment 4 2019-09-06 11:13:32 PDT

Chris did mention that his change did change the timing of things, so it could possibly make things more / less flaky. I'm still not sure why we don't see this on trunk bots. I'm not sure if this could have something to do with test order. We use '--skip-failing-tests' on EWS and also retry failures, so that could explain the different results.

Ryan Haddad

Comment 5 2019-09-06 11:46:25 PDT

I'm trying to reproduce this locally.

Tim Horton

Comment 6 2019-09-06 12:09:27 PDT

It reproduced locally for me; it didn't seem like the process was hung, just ... not doing anything.

Chris Dumez

Comment 7 2019-09-06 13:29:18 PDT

(In reply to Tim Horton from comment #6) > It reproduced locally for me; it didn't seem like the process was hung, just > ... not doing anything. It reproduces after https://trac.webkit.org/changeset/249557 ?

Tim Horton

Comment 8 2019-09-06 13:43:32 PDT

Oh, hmm! I definitely didn't have that.

Ryan Haddad

Comment 9 2019-09-06 14:27:31 PDT

I can reproduce it with r249575, so r249557 did not resolve the issue with this test.

Chris Dumez

Comment 10 2019-09-06 14:28:40 PDT

(In reply to Ryan Haddad from comment #9) > I can reproduce it with r249575, so r249557 did not resolve the issue with > this test. Ok, and do we think this regressed after r249350?

Ryan Haddad

Comment 11 2019-09-06 14:37:02 PDT

(In reply to Chris Dumez from comment #10) > (In reply to Ryan Haddad from comment #9) > > I can reproduce it with r249575, so r249557 did not resolve the issue with > > this test. > > Ok, and do we think this regressed after r249350? I confirmed that I cannot reproduce it with r249349, but I can with r249350.

Chris Dumez

Comment 12 2019-09-06 14:47:48 PDT

(In reply to Ryan Haddad from comment #11) > (In reply to Chris Dumez from comment #10) > > (In reply to Ryan Haddad from comment #9) > > > I can reproduce it with r249575, so r249557 did not resolve the issue with > > > this test. > > > > Ok, and do we think this regressed after r249350? > I confirmed that I cannot reproduce it with r249349, but I can with r249350. Ok then, I will investigate soon.

Chris Dumez

Comment 13 2019-09-06 15:00:30 PDT

(In reply to Chris Dumez from comment #12) > (In reply to Ryan Haddad from comment #11) > > (In reply to Chris Dumez from comment #10) > > > (In reply to Ryan Haddad from comment #9) > > > > I can reproduce it with r249575, so r249557 did not resolve the issue with > > > > this test. > > > > > > Ok, and do we think this regressed after r249350? > > I confirmed that I cannot reproduce it with r249349, but I can with r249350. > > Ok then, I will investigate soon. Looking at this now, it does not seem to drop any IPC, nor does it seem to leverage the code paths impacted by r249350.

Chris Dumez

Comment 14 2019-09-06 15:04:15 PDT

The test seems to rely on TestRunner::installStatisticsDidScanDataRecordsCallback() / TestRunner::statisticsDidScanDataRecordsCallback(). Digging into it now.

Ryan Haddad

Comment 15 2019-09-06 15:20:16 PDT

Chris was able to reproduce the timeout by just running the test by itself. I was running the whole resourceLoadStatistics directory. I tried running the one test with r249349 and still got a timeout, so perhaps something else is going on.

Chris Dumez

Comment 16 2019-09-06 15:28:07 PDT

The test calls TestRunner::statisticsNotifyObserver() but this ends up being a no-op because there are no statistics in the ResourceLoadObserver to send to the NetworkProcess. As a result, the test ends up waiting for a notification that will never arrive. This has nothing to do with my change. I guess that when you run it with other tests, there may be leftover statistics in the ResourceLaodObserver to send and this is why the test usually passes when not run by itself.

Chris Dumez

Comment 17 2019-09-06 15:37:27 PDT

(In reply to Chris Dumez from comment #16) > The test calls TestRunner::statisticsNotifyObserver() but this ends up being > a no-op because there are no statistics in the ResourceLoadObserver to send > to the NetworkProcess. As a result, the test ends up waiting for a > notification that will never arrive. > > This has nothing to do with my change. I guess that when you run it with > other tests, there may be leftover statistics in the ResourceLaodObserver to > send and this is why the test usually passes when not run by itself. Potential regression from https://trac.webkit.org/r249056. Before r249056, the call to statisticsNotifyObserver() would not have been a no-op I think, since we would have actually notified the WebResourceLoadStatistics store (providing it an empty HashMap).

Chris Dumez

Comment 18 2019-09-06 15:53:58 PDT

Created attachment 378247 [details] Patch

Chris Dumez

Comment 19 2019-09-06 15:54:36 PDT

Not sure if this is the best way to address the issue. John may have a better idea as he knows this code / test better.

Ryan Haddad

Comment 20 2019-09-06 16:36:13 PDT

Marked the test as flaky in https://trac.webkit.org/changeset/249595/webkit to speed up EWS.

Radar WebKit Bug Importer

Comment 21 2019-09-06 16:36:51 PDT

<rdar://problem/55133261>

Alex Christensen

Comment 22 2019-09-07 09:18:14 PDT

Comment on attachment 378247 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=378247&action=review > LayoutTests/http/tests/resourceLoadStatistics/do-not-capture-statistics-for-simple-top-navigations.html:37 > + timerHandle = setTimeout(finishTest, 100); Is 100ms enough? What does this need to wait for?

Chris Dumez

Comment 23 2019-09-07 10:57:26 PDT

(In reply to Alex Christensen from comment #22) > Comment on attachment 378247 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=378247&action=review > > > LayoutTests/http/tests/resourceLoadStatistics/do-not-capture-statistics-for-simple-top-navigations.html:37 > > + timerHandle = setTimeout(finishTest, 100); > > Is 100ms enough? What does this need to wait for? It does not need to wait for anything. There is supposed to be no statistics, this is just making sure that no new statistics come in. We are testing that something does not happen (I.e something is not captured). So no flakiness issue with a fairly short timer.

John Wilander

Comment 24 2019-09-09 15:59:55 PDT

Comment on attachment 378247 [details] Patch LGTM.

Chris Dumez

Comment 25 2019-09-09 19:01:02 PDT

Created attachment 378430 [details] Patch

WebKit Commit Bot

Comment 26 2019-09-09 21:06:48 PDT

Comment on attachment 378430 [details] Patch Clearing flags on attachment: 378430 Committed r249702: <https://trac.webkit.org/changeset/249702>

WebKit Commit Bot

Comment 27 2019-09-09 21:06:50 PDT

All reviewed patches have been landed. Closing bug.

Note You need to log in before you can comment on or make changes to this bug.