Bug 144155

Summary: fast/frames/flattening/iframe-flattening-resize-event-count.html times out on Yosemite WK2
Product: WebKit Reporter: Alexey Proskuryakov <ap>
Component: Tools / TestsAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: barraclough, cdumez, darin, kling
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugs.webkit.org/show_bug.cgi?id=144308
Attachments:
Description Flags
add some logging
andersca: review+
disable App Nap
thorton: review+
unhide the window between tests kling: review+

Alexey Proskuryakov
Reported 2015-04-24 11:12:48 PDT
fast/frames/flattening/iframe-flattening-resize-event-count.html very frequently times out on Yosemite WK2 bot. It's quite fast when it doesn't time out.
Attachments
add some logging (2.45 KB, patch)
2015-04-24 11:52 PDT, Alexey Proskuryakov
andersca: review+
disable App Nap (1.58 KB, patch)
2015-04-27 12:14 PDT, Alexey Proskuryakov
thorton: review+
unhide the window between tests (1.49 KB, patch)
2015-04-29 00:46 PDT, Alexey Proskuryakov
kling: review+
Alexey Proskuryakov
Comment 1 2015-04-24 11:52:31 PDT
Created attachment 251563 [details] add some logging
Alexey Proskuryakov
Comment 2 2015-04-24 11:56:52 PDT
Landed in r183272. This can't fix the issue though.
Alexey Proskuryakov
Comment 3 2015-04-25 12:25:57 PDT
The logging is interesting! Every time the test fails, it only fires the timer exactly 7 times instead of 100. This makes me suspect that nested timer throttling goes wrong, and stops the timer completely. Note that the new 20-second watchdog timer fires anyway, so the process is not frozen, only the 0-delay timer is. I cannot reproduce locally yet.
Alexey Proskuryakov
Comment 4 2015-04-25 13:21:57 PDT
Added more temporary logging for timers in r183312.
Alexey Proskuryakov
Comment 5 2015-04-25 19:25:19 PDT
The logging says: CONSOLE MESSAGE: line 36: Test timed out. Resized 7 times, got 7 events. CONSOLE MESSAGE: line 40: Timers were scheduled at: 37.944230949506164,556.3224649522454,1564.046156941913,2555.3537160158157,3563.0944420117885,4561.856752960011,20164.311336935498 When I test locally, and print the timing data even on success, the output is: CONSOLE MESSAGE: line 31: Timers were scheduled at: 125.79913299850887,128.9805999986129,141.93081700068433,146.540253997955,152.07373299926985,155.34578100050567,162.11933999875328,177.24060999898938,183.70886299817357,190.41601500066463,197.25199999811593,203.73337699857075,210.3919910005061,218.73386399965966,225.48055100196507,250.12134900316596,258.89310300408397,265.5821590014966,272.1203940018313,278.9613470013137,285.7946400035871,292.5488300024881,303.734835004434,310.7598240021616,319.68921900261194,326.54779400036205,335.85378299903823,342.9574070032686,353.57211700465996,360.1231889988412,370.0214600030449,377.03421500191325,387.11607600271236,400.0187960045878,407.9218039987609,416.8596780000371,423.3271519988193,429.7157410037471,435.7981600041967,442.50050999835366,454.3809640017571,461.1590260028606,470.320073000039,477.19982400303707,487.3991450003814,494.21547100064345,503.9252940041479,510.3242820041487,520.7717220037011,527.4011189976591,537.5356279982952,544.4351370024378,553.9757229998941,560.8028060014476,570.4738330023247,577.4468419986079,587.4084050010424,594.3376850045752,603.7952970000333,610.804048999853,620.8505680042435,627.6742940026452,637.7976470030262,644.5893909985898,653.4063369981595,660.498393997841,671.2844819994643,678.3023910029442,687.6440220003133,694.384695001645,703.5507379987394,710.1908699987689,721.123487004661,728.1277980000596,738.0604470017715,745.0514209995163,754.0727090017754,761.0254570026882,771.0709820021293,777.9118559992639,788.0212250020122,794.822378004028,804.5823670036043,811.569422003231,821.8013090008753,828.3847080019768,837.5335410019034,844.372765001026,854.4278589979513,861.4545880045625,871.3724330009427,878.4461210016161,887.7621720021125,894.8069830003078,904.3407959979959,911.3547460001428,921.1714140037657,928.0214230020647,938.36195699987 This means that when the test fails, the 0-delay timers first get throttled to fire once a second, and then stall completely until the 20-second watchdog timer fires. At the same time, WebCore doesn't think that the timers are ever throttled. Given that this happens on Yosemite only, and that WebCore doesn't know about the throttling, it smells like an issue below WebKit. At the same time, it is difficult to see how the 20-second timer could fire at exact time if the issue were below us - all these DOM timers are served by a single shared CF timer.
Alexey Proskuryakov
Comment 6 2015-04-25 20:20:06 PDT
Looks like the only way WebCore would clamp timers to 1 second is when it thinks that the timer causes a repaint of something that's not user visible (it won't do that when there is no repainting at all). And there seems to be no way that we'd clamp to 15 seconds, short of a timer heap bug.
Alexey Proskuryakov
Comment 7 2015-04-25 23:44:35 PDT
Found something interesting in system logs. Every time the test fails, there is this logging while it runs: Apr 25 18:55:31 bot185.apple.com WindowServer[147]: WSGetSurfaceInWindow : Invalid surface 734273305 for window 38598 Apr 25 18:55:32 --- last message repeated 3 times --- Apr 25 18:55:32 bot185.apple.com WindowServer[147]: WSGetSurfaceInWindow : Invalid surface 656102528 for window 38600 Apr 25 18:55:32 --- last message repeated 3 times --- And when the test passes, the same logging occurs slightly later.
Alexey Proskuryakov
Comment 8 2015-04-26 00:17:12 PDT
I can reproduce this logging by running fast/events/popup-allowed-from-gesture-initiated-event.html (and there is another test that also causes it). However, running it in parallel with fast/frames/flattening/iframe-flattening-resize-event-count.html doesn't reproduce the problem, so it's not directly related.
Alexey Proskuryakov
Comment 9 2015-04-27 12:14:59 PDT
Created attachment 251766 [details] disable App Nap Talked to Gavin, who suggested that this might be an effect of App Nap. The mechanism is not quite clear, but it certainly seems worth trying.
Alexey Proskuryakov
Comment 10 2015-04-27 13:36:06 PDT
Committed the patch for disabling App Nap in r183418, let's see what happens.
Alexey Proskuryakov
Comment 11 2015-04-27 17:34:43 PDT
Disabling App Nap made a difference, but the test is still flaky. We got rid of the super long stall, but the 0-delay timer is still taking 1 second to fire: +CONSOLE MESSAGE: line 40: Timers were scheduled at: 43.86900598183274,548.0330429272726,1556.697802967392,2549.8486949363723,3549.5337379397824,4556.125045986846,5548.013246967457,6556.058387970552,7554.3883389327675,8552.839554962702,9547.597624943592,10548.691916977987,11550.223996979184,12547.692796913907,13549.164323951118,14547.859265003353,15556.478225975297,16549.75908191409,17550.117639941163,18548.79687190987,19555.410598986782
Alexey Proskuryakov
Comment 12 2015-04-27 17:41:14 PDT
Oh, I wonder if 1 second is this: static double hiddenPageAlignmentInterval() { return 1.0; } // 1 second.
Alexey Proskuryakov
Comment 13 2015-04-27 23:34:45 PDT
Confirmed that WebKitTestRunner doesn't bring back the window if a preceding test calls testRunner.setPageVisibility("hidden") without a subsequent testRunner.resetPageVisibility() or testRunner.setPageVisibility("visible"). I couldn't find any direct evidence of this being the culprit in this case yet, but it seems like it has to be.
Darin Adler
Comment 14 2015-04-28 08:13:38 PDT
So does that mean we could fix this by adding code to TestController::resetStateToConsistentValues?
Alexey Proskuryakov
Comment 15 2015-04-28 10:40:19 PDT
Yes, probably just call TestController::setHidden(false). I'd like to look into a few things before that: - which test leaves the window hidden (couldn't find it yet); - how we get away with TestController::setHidden(false) calling -makeKeyAndOrderFront - why doesn't that steal focus? - how DumpRenderTree works (it doesn't seem to ever hide or unhide windows, maybe it doesn't test page visibility properly?)
Alexey Proskuryakov
Comment 16 2015-04-29 00:46:29 PDT
Created attachment 251931 [details] unhide the window between tests > - which test leaves the window hidden (couldn't find it yet); Still couldn't catch anything like this in action. If this patch doesn't help, my next theory is that the window is actually visible, but a CoreIPC issue akin to bug 141122 confuses WebProcess into getting out of sync. That would be pretty much impossible to verify without a locally reproducible case. - how we get away with TestController::setHidden(false) calling -makeKeyAndOrderFront - why doesn't that steal focus? A mystery. I patched WKTR to not move the window offscreen, and ran a test that hides/shows the window in a loop. The window does appear and disappear, but it always stays below the currently active window, not stealing focus and not obscuring it. - how DumpRenderTree works (it doesn't seem to ever hide or unhide windows, maybe it doesn't test page visibility properly?) DumpRenderTree calls directly into WebKit to fool it about the current visibility state. [webView _setVisibilityState:WebPageVisibilityStateHidden isInitialState:NO];
Alexey Proskuryakov
Comment 17 2015-04-29 10:44:32 PDT
Committed the patch for unhiding the window in r183558, will keep watching.
Alexey Proskuryakov
Comment 18 2015-04-29 20:05:59 PDT
This is still happening, no change. Added more logging in r183607, it will now print out visibility state as WebProcess sees it.
Alexey Proskuryakov
Comment 19 2015-04-29 23:50:06 PDT
Strange. CONSOLE MESSAGE: line 40: Timers were scheduled at: 38.54244900867343,634.1223609633744,1630.6184619897977,2633.0562729854137,3635.5311119696125,4630.785787943751,5634.050983935595,6638.653066940606,7635.863823001273,8635.53286192473,9633.291222970001,10638.184183975682,11632.18399696052,12640.454213949852,13631.606793962419,14630.80570299644,15635.705446940847,16635.929689975455,17632.111273007467,18637.691960902885,19631.200731964782 CONSOLE MESSAGE: line 41: document.hidden = false; document.visibilityState = visible
Alexey Proskuryakov
Comment 20 2015-05-02 12:05:31 PDT
testRunner.setPageVisibility is somewhat broken in WebKitTestRunner, creating an opportunity for races. It is asynchronous, and there are tests that just call the function and terminate without waiting for a visibility change event, thus leaving the following test to handle the message in UI process. But once again, there is no indication that this is somehow breaking this particular test - in a failing run on the bot, no preceding tests call setPageVisibility.
Alexey Proskuryakov
Comment 21 2015-05-02 12:20:56 PDT
Added some debug-only logging to WebCore in r183720, will roll out as soon as there are results from the bots.
Alexey Proskuryakov
Comment 22 2015-05-03 00:09:51 PDT
Yes, there are spurious calls to setDOMTimerAlignmentInterval. They sometimes happen during this test, which makes it fail, and sometimes during other tests. This doesn't look like a race though. Will see if I can reproduce this slightly more broad condition locally.
Alexey Proskuryakov
Comment 23 2015-05-12 10:32:56 PDT
This is taking way too long to keep the bots red, marked as flaky in r184205.
Alexey Proskuryakov
Comment 24 2015-08-06 18:32:14 PDT
The remaining issues are quite likely fixed in bug 147718. Will keep an eye on the bots.
Alexey Proskuryakov
Comment 25 2015-08-28 15:58:23 PDT
fast/frames/flattening/iframe-flattening-resize-event-count.html hasn't failed since. So, these fixes plus bug 147718 took care of it.
Note You need to log in before you can comment on or make changes to this bug.