Bug 144155

Summary: fast/frames/flattening/iframe-flattening-resize-event-count.html times out on Yosemite WK2
Product: WebKit Reporter: Alexey Proskuryakov <ap>
Component: Tools / TestsAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: barraclough, cdumez, darin, kling
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugs.webkit.org/show_bug.cgi?id=144308
Attachments:
Description Flags
add some logging
andersca: review+
disable App Nap
thorton: review+
unhide the window between tests kling: review+

Description Alexey Proskuryakov 2015-04-24 11:12:48 PDT
fast/frames/flattening/iframe-flattening-resize-event-count.html very frequently times out on Yosemite WK2 bot. It's quite fast when it doesn't time out.
Comment 1 Alexey Proskuryakov 2015-04-24 11:52:31 PDT
Created attachment 251563 [details]
add some logging
Comment 2 Alexey Proskuryakov 2015-04-24 11:56:52 PDT
Landed in r183272. This can't fix the issue though.
Comment 3 Alexey Proskuryakov 2015-04-25 12:25:57 PDT
The logging is interesting! Every time the test fails, it only fires the timer exactly 7 times instead of 100.

This makes me suspect that nested timer throttling goes wrong, and stops the timer completely. Note that the new 20-second watchdog timer fires anyway, so the process is not frozen, only the 0-delay timer is.

I cannot reproduce locally yet.
Comment 4 Alexey Proskuryakov 2015-04-25 13:21:57 PDT
Added more temporary logging for timers in r183312.
Comment 5 Alexey Proskuryakov 2015-04-25 19:25:19 PDT
The logging says:

CONSOLE MESSAGE: line 36: Test timed out. Resized 7 times, got 7 events.
CONSOLE MESSAGE: line 40: Timers were scheduled at: 37.944230949506164,556.3224649522454,1564.046156941913,2555.3537160158157,3563.0944420117885,4561.856752960011,20164.311336935498

When I test locally, and print the timing data even on success, the output is:

CONSOLE MESSAGE: line 31: Timers were scheduled at: 125.79913299850887,128.9805999986129,141.93081700068433,146.540253997955,152.07373299926985,155.34578100050567,162.11933999875328,177.24060999898938,183.70886299817357,190.41601500066463,197.25199999811593,203.73337699857075,210.3919910005061,218.73386399965966,225.48055100196507,250.12134900316596,258.89310300408397,265.5821590014966,272.1203940018313,278.9613470013137,285.7946400035871,292.5488300024881,303.734835004434,310.7598240021616,319.68921900261194,326.54779400036205,335.85378299903823,342.9574070032686,353.57211700465996,360.1231889988412,370.0214600030449,377.03421500191325,387.11607600271236,400.0187960045878,407.9218039987609,416.8596780000371,423.3271519988193,429.7157410037471,435.7981600041967,442.50050999835366,454.3809640017571,461.1590260028606,470.320073000039,477.19982400303707,487.3991450003814,494.21547100064345,503.9252940041479,510.3242820041487,520.7717220037011,527.4011189976591,537.5356279982952,544.4351370024378,553.9757229998941,560.8028060014476,570.4738330023247,577.4468419986079,587.4084050010424,594.3376850045752,603.7952970000333,610.804048999853,620.8505680042435,627.6742940026452,637.7976470030262,644.5893909985898,653.4063369981595,660.498393997841,671.2844819994643,678.3023910029442,687.6440220003133,694.384695001645,703.5507379987394,710.1908699987689,721.123487004661,728.1277980000596,738.0604470017715,745.0514209995163,754.0727090017754,761.0254570026882,771.0709820021293,777.9118559992639,788.0212250020122,794.822378004028,804.5823670036043,811.569422003231,821.8013090008753,828.3847080019768,837.5335410019034,844.372765001026,854.4278589979513,861.4545880045625,871.3724330009427,878.4461210016161,887.7621720021125,894.8069830003078,904.3407959979959,911.3547460001428,921.1714140037657,928.0214230020647,938.36195699987


This means that when the test fails, the 0-delay timers first get throttled to fire once a second, and then stall completely until the 20-second watchdog timer fires. At the same time, WebCore doesn't think that the timers are ever throttled.

Given that this happens on Yosemite only, and that WebCore doesn't know about the throttling, it smells like an issue below WebKit. At the same time, it is difficult to see how the 20-second timer could fire at exact time if the issue were below us - all these DOM timers are served by a single shared CF timer.
Comment 6 Alexey Proskuryakov 2015-04-25 20:20:06 PDT
Looks like the only way WebCore would clamp timers to 1 second is when it thinks that the timer causes a repaint of something that's not user visible (it won't do that when there is no repainting at all). And there seems to be no way that we'd clamp to 15 seconds, short of a timer heap bug.
Comment 7 Alexey Proskuryakov 2015-04-25 23:44:35 PDT
Found something interesting in system logs. Every time the test fails, there is this logging while it runs:

Apr 25 18:55:31 bot185.apple.com WindowServer[147]: WSGetSurfaceInWindow : Invalid surface 734273305 for window 38598
Apr 25 18:55:32 --- last message repeated 3 times ---
Apr 25 18:55:32 bot185.apple.com WindowServer[147]: WSGetSurfaceInWindow : Invalid surface 656102528 for window 38600
Apr 25 18:55:32 --- last message repeated 3 times ---

And when the test passes, the same logging occurs slightly later.
Comment 8 Alexey Proskuryakov 2015-04-26 00:17:12 PDT
I can reproduce this logging by running fast/events/popup-allowed-from-gesture-initiated-event.html (and there is another test that also causes it).

However, running it in parallel with fast/frames/flattening/iframe-flattening-resize-event-count.html doesn't reproduce the problem, so it's not directly related.
Comment 9 Alexey Proskuryakov 2015-04-27 12:14:59 PDT
Created attachment 251766 [details]
disable App Nap

Talked to Gavin, who suggested that this might be an effect of App Nap. The mechanism is not quite clear, but it certainly seems worth trying.
Comment 10 Alexey Proskuryakov 2015-04-27 13:36:06 PDT
Committed the patch for disabling App Nap in r183418, let's see what happens.
Comment 11 Alexey Proskuryakov 2015-04-27 17:34:43 PDT
Disabling App Nap made a difference, but the test is still flaky. We got rid of the super long stall, but the 0-delay timer is still taking 1 second to fire:

+CONSOLE MESSAGE: line 40: Timers were scheduled at: 43.86900598183274,548.0330429272726,1556.697802967392,2549.8486949363723,3549.5337379397824,4556.125045986846,5548.013246967457,6556.058387970552,7554.3883389327675,8552.839554962702,9547.597624943592,10548.691916977987,11550.223996979184,12547.692796913907,13549.164323951118,14547.859265003353,15556.478225975297,16549.75908191409,17550.117639941163,18548.79687190987,19555.410598986782
Comment 12 Alexey Proskuryakov 2015-04-27 17:41:14 PDT
Oh, I wonder if 1 second is this:

static double hiddenPageAlignmentInterval() { return 1.0; } // 1 second.
Comment 13 Alexey Proskuryakov 2015-04-27 23:34:45 PDT
Confirmed that WebKitTestRunner doesn't bring back the window if a preceding test calls testRunner.setPageVisibility("hidden") without a subsequent testRunner.resetPageVisibility() or testRunner.setPageVisibility("visible").

I couldn't find any direct evidence of this being the culprit in this case yet, but it seems like it has to be.
Comment 14 Darin Adler 2015-04-28 08:13:38 PDT
So does that mean we could fix this by adding code to TestController::resetStateToConsistentValues?
Comment 15 Alexey Proskuryakov 2015-04-28 10:40:19 PDT
Yes, probably just call TestController::setHidden(false). I'd like to look into a few things before that:

- which test leaves the window hidden (couldn't find it yet);

- how we get away with TestController::setHidden(false) calling -makeKeyAndOrderFront - why doesn't that steal focus?

- how DumpRenderTree works (it doesn't seem to ever hide or unhide windows, maybe it doesn't test page visibility properly?)
Comment 16 Alexey Proskuryakov 2015-04-29 00:46:29 PDT
Created attachment 251931 [details]
unhide the window between tests

> - which test leaves the window hidden (couldn't find it yet);

Still couldn't catch anything like this in action. If this patch doesn't help, my next theory is that the window is actually visible, but a CoreIPC issue akin to bug 141122 confuses WebProcess into getting out of sync. That would be pretty much impossible to verify without a locally reproducible case.

- how we get away with TestController::setHidden(false) calling -makeKeyAndOrderFront - why doesn't that steal focus?

A mystery. I patched WKTR to not move the window offscreen, and ran a test that hides/shows the window in a loop. The window does appear and disappear, but it always stays below the currently active window, not stealing focus and not obscuring it.

- how DumpRenderTree works (it doesn't seem to ever hide or unhide windows, maybe it doesn't test page visibility properly?)

DumpRenderTree calls directly into WebKit to fool it about the current visibility state.

[webView _setVisibilityState:WebPageVisibilityStateHidden isInitialState:NO];
Comment 17 Alexey Proskuryakov 2015-04-29 10:44:32 PDT
Committed the patch for unhiding the window in r183558, will keep watching.
Comment 18 Alexey Proskuryakov 2015-04-29 20:05:59 PDT
This is still happening, no change.

Added more logging in r183607, it will now print out visibility state as WebProcess sees it.
Comment 19 Alexey Proskuryakov 2015-04-29 23:50:06 PDT
Strange.

CONSOLE MESSAGE: line 40: Timers were scheduled at: 38.54244900867343,634.1223609633744,1630.6184619897977,2633.0562729854137,3635.5311119696125,4630.785787943751,5634.050983935595,6638.653066940606,7635.863823001273,8635.53286192473,9633.291222970001,10638.184183975682,11632.18399696052,12640.454213949852,13631.606793962419,14630.80570299644,15635.705446940847,16635.929689975455,17632.111273007467,18637.691960902885,19631.200731964782
CONSOLE MESSAGE: line 41: document.hidden = false; document.visibilityState = visible
Comment 20 Alexey Proskuryakov 2015-05-02 12:05:31 PDT
testRunner.setPageVisibility is somewhat broken in WebKitTestRunner, creating an opportunity for races. It is asynchronous, and there are tests that just call the function and terminate without waiting for a visibility change event, thus leaving the following test to handle the message in UI process.

But once again, there is no indication that this is somehow breaking this particular test - in a failing run on the bot, no preceding tests call setPageVisibility.
Comment 21 Alexey Proskuryakov 2015-05-02 12:20:56 PDT
Added some debug-only logging to WebCore in r183720, will roll out as soon as there are results from the bots.
Comment 22 Alexey Proskuryakov 2015-05-03 00:09:51 PDT
Yes, there are spurious calls to setDOMTimerAlignmentInterval. They sometimes happen during this test, which makes it fail, and sometimes during other tests. This doesn't look like a race though.

Will see if I can reproduce this slightly more broad condition locally.
Comment 23 Alexey Proskuryakov 2015-05-12 10:32:56 PDT
This is taking way too long to keep the bots red, marked as flaky in r184205.
Comment 24 Alexey Proskuryakov 2015-08-06 18:32:14 PDT
The remaining issues are quite likely fixed in bug 147718. Will keep an eye on the bots.
Comment 25 Alexey Proskuryakov 2015-08-28 15:58:23 PDT
fast/frames/flattening/iframe-flattening-resize-event-count.html hasn't failed since. So, these fixes plus bug 147718 took care of it.