Bug 36647

Summary: fast/loader/stateobjects/replacestate-then-pushstate.html fails intermittently on all bots
Product: WebKit Reporter: Eric Seidel (no email) <eric>
Component: Tools / TestsAssignee: Darin Fisher (:fishd, Google) <fishd>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, beidson, commit-queue, eric, fishd, sam, webkit.review.bot
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: PC   
OS: OS X 10.5   
Bug Depends on: 36779    
Bug Blocks:    
Attachments:
Description Flags
v1 patch
fishd: review-
v2 patch abarth: review+

Eric Seidel (no email)
Reported 2010-03-26 01:25:24 PDT
fast/loader/stateobjects/replacestate-then-pushstate.html failed on Snow Leopard Release Bot The dump looks like DRT is confused and dumping the wrong test (maybe the previous test?) http://trac.webkit.org/browser/trunk/LayoutTests/fast/loader/stateobjects/replacestate-then-pushstate.html http://build.webkit.org/results/SnowLeopard%20Intel%20Release%20(Tests)/r56612%20(7249)/fast/loader/stateobjects/replacestate-then-pushstate-diffs.txt --- /Volumes/Data/WebKit-BuildSlave/snowleopard-intel-release-tests/build/layout-test-results/fast/loader/stateobjects/replacestate-then-pushstate-expected.txt 2010-03-26 01:10:32.000000000 -0700 +++ /Volumes/Data/WebKit-BuildSlave/snowleopard-intel-release-tests/build/layout-test-results/fast/loader/stateobjects/replacestate-then-pushstate-actual.txt 2010-03-26 01:10:32.000000000 -0700 @@ -1,13 +1,17 @@ -This test does the following: --Makes a call to replaceState() --Makes sure the history length is correct --Makes a call to pushState() --Makes sure the history length is correct --Goes back, and makes sure the popstate event is correct --Goes forward, and makes sure the popstate event is correct - -History length is 1 -History length is 2 -State popped - OriginalHistoryItem (type string) -State popped - NewHistoryItem (type string) - +layer at (0,0) size 800x600 + RenderView at (0,0) size 800x600 +layer at (0,0) size 800x600 + RenderBlock {HTML} at (0,0) size 800x600 + RenderBody {BODY} at (8,8) size 784x584 + RenderText {#text} at (0,140) size 70x18 + text run at (0,140) width 70: "default text" + RenderText {#text} at (70,140) size 4x18 + text run at (70,140) width 4: " " + RenderPartObject {IFRAME} at (74,0) size 304x154 [border: (2px inset #000000)] + layer at (0,0) size 300x150 + RenderView at (0,0) size 300x150 + layer at (0,0) size 300x150 + RenderBlock {HTML} at (0,0) size 300x150 + RenderBody {BODY} at (8,8) size 284x134 + RenderText {#text} at (0,0) size 21x18 + text run at (0,0) width 21: "foo"
Attachments
v1 patch (4.26 KB, patch)
2010-04-02 12:22 PDT, Darin Fisher (:fishd, Google)
fishd: review-
v2 patch (4.55 KB, patch)
2010-04-02 20:16 PDT, Darin Fisher (:fishd, Google)
abarth: review+
Brady Eidson
Comment 1 2010-03-26 09:26:44 PDT
Yes, it is dumping the wrong test. Yes, this has been happening with scattered tests that I added, sometimes very recently and sometimes many months ago. No, I have no idea how to explore this especially since they're often very rare failures on the bots and I can never reproduce locally. In other words... *sigh*
Eric Seidel (no email)
Comment 3 2010-03-29 19:05:45 PDT
Failed on the Leopard Commit bot just now too: https://bugs.webkit.org/show_bug.cgi?id=30144#c66
Eric Seidel (no email)
Comment 4 2010-03-31 15:32:42 PDT
It appears to be failing on the Tiger bot as well: http://build.webkit.org/results/Tiger%20Intel%20Release/r56871%20(10265)/fast/loader/stateobjects/replacestate-then-pushstate-pretty-diff.html Seems we should skip/disable this test since it's failing on at least 3 bots.
Brady Eidson
Comment 5 2010-03-31 15:42:41 PDT
Until now, I wasn't aware this was failing every time on the bots.
Brady Eidson
Comment 6 2010-03-31 15:43:42 PDT
(In reply to comment #5) > Until now, I wasn't aware this was failing every time on the bots. Checking it myself, it appears it is not failing every time.
Eric Seidel (no email)
Comment 7 2010-03-31 16:01:33 PDT
Correct. It is not failing consistently on any machine to my knowledge. My apologies if my statements led you to believe such. This is however a common failure, just not consistent. :(
Eric Seidel (no email)
Comment 8 2010-04-01 18:20:21 PDT
The commit-queue just hit this too: https://bugs.webkit.org/show_bug.cgi?id=36369#c4
Eric Seidel (no email)
Comment 9 2010-04-02 01:09:38 PDT
Tiger hit this several times today. Seems we should disable this test for all bots. Unless you have a fix in the pipeline Brady?
Brady Eidson
Comment 10 2010-04-02 09:37:58 PDT
(In reply to comment #9) > Tiger hit this several times today. Seems we should disable this test for all > bots. Unless you have a fix in the pipeline Brady? Is this one of the tests that's "failing" by showing the output from the previous test? ie - the test itself isn't necessarily failing but being polluted by something prior. Here's my dilemma. I would love to pour in the time to figure out what is wrong with the test harness, but I am not an expert at the test harness and would be spinning my wheels, and don't have time right now to spin my wheels. We can't just keep disabling tests that only have intermittent failures and (apparently) only on the bots. If I can never reproduce locally, I can never explore it. But it we just keep disabling test-after-test, then we lose the test coverage for when it really breaks 100% of the time, even for individual contributors on their own machines. I would love to hear referrals for people who are truly experts at run-webkit-tests and DRT, and to work with them to figure out what is going on.
Adam Barth
Comment 11 2010-04-02 09:53:39 PDT
The flip side of that argument is that we have >13k tests. Because the bots (Tiger and Snow Leopard especially) are constantly red, we become blind to failures and miss real regressions. If we lose 0.01% of our tests, that's a net win for the project. However, in many cases, we've been able to keep the test coverage and remove the flakiness, even with no access to the failures locally. For example, last night I fixed a test that was failing only on Tiger with no ability to reproduce the failure myself. It just requires effort. I don't have visibility into the other demands on your time. It's entirely possible you have more important things to be working on. However, that doesn't means we should impose these costs on the rest of the project.
Adam Barth
Comment 12 2010-04-02 09:57:52 PDT
> Is this one of the tests that's "failing" by showing the output from the > previous test? ie - the test itself isn't necessarily failing but being > polluted by something prior. Here's the failure pretty diff: http://build.webkit.org/results/Tiger%20Intel%20Release/r57003%20(10355)/fast/loader/stateobjects/replacestate-then-pushstate-pretty-diff.html
Adam Barth
Comment 13 2010-04-02 10:01:17 PDT
By way of context, this test has failed on 8 of the last 10 Tiger bot runs. It's one of three tests that prevent this bot from being green and put under the protection of SheriffBot (who will help prevent future real regressions). Three tests are 0.023% of tests.
Brady Eidson
Comment 14 2010-04-02 10:13:23 PDT
(In reply to comment #11) > The flip side of that argument is that we have >13k tests. Because the bots > (Tiger and Snow Leopard especially) are constantly red, we become blind to > failures and miss real regressions. If we lose 0.01% of our tests, that's a > net win for the project. Unless the feature is new with only an initial batch of tests and disabling one of the tests would result in the loss of ~10% of the test coverage for a reasonably complicated mechanism. > However, in many cases, we've been able to keep the test coverage and remove > the flakiness, even with no access to the failures locally. For example, last > night I fixed a test that was failing only on Tiger with no ability to > reproduce the failure myself. It just requires effort. Wildly varying amounts of effort - I have spent time on this already and no one I've reached out to has decided it was worth helping. (In reply to comment #12) > > Is this one of the tests that's "failing" by showing the output from the > > previous test? ie - the test itself isn't necessarily failing but being > > polluted by something prior. > > Here's the failure pretty diff: > > http://build.webkit.org/results/Tiger%20Intel%20Release/r57003%20(10355)/fast/loader/stateobjects/replacestate-then-pushstate-pretty-diff.html This diff appears to be the previous test (fast/loader/replacestate-in-iframe.html) dumping into the test that the bots see as failing. This appears to best a test harness problem that I would love help tracking down, because apparently every other test I've added lately has been affected by this problem somehow. Any ideas? (CC'ing Darin Fisher, who added the test that is causing the forward failure)
Darin Fisher (:fishd, Google)
Comment 15 2010-04-02 10:21:23 PDT
This test is also failing intermittently on the chromium bots. I'll investigate.
Adam Barth
Comment 16 2010-04-02 10:23:39 PDT
I can read up on how this feature is supposed to work, but the first step I would try to black-box debug this issue is to insert a trivial test between the previous test and this one. That way we can see whether the previous test is leaking state forward or this test is pulling previous state forward.
Brady Eidson
Comment 17 2010-04-02 10:27:05 PDT
(In reply to comment #16) > I can read up on how this feature is supposed to work, but the first step I > would try to black-box debug this issue is to insert a trivial test between the > previous test and this one. That way we can see whether the previous test is > leaking state forward or this test is pulling previous state forward. That is a good idea - except since I have never been able to reproduce this locally, and I haven't heard from any other fellow WebKit developers nearby about this test failing for them, this appears to be a bot-only issue and your experiment would require checking in a dummy test for experimentation on the bots. Which (I don't believe) there is any precedent for.
Adam Barth
Comment 18 2010-04-02 10:30:04 PDT
> Which (I don't believe) there is any precedent for. Right, I'm proposing adding a new test to debug this on the bots. Given that it fails about 80% of the time, we should get our answer in a few cycles. I've done similar experiments before to solve bot-only issues. In any case, I don't think it's a big problem to add more tests.
Adam Barth
Comment 19 2010-04-02 10:31:03 PDT
Reading replacestate-then-pushstate.html, I wonder if the call to history.back() is causing problems, for example if the history timer is racing with some other asynchronous bit? An interesting clue is that the previous test results are dumped as a render tree and not as text.
Brady Eidson
Comment 20 2010-04-02 10:35:17 PDT
(In reply to comment #19) > Reading replacestate-then-pushstate.html, I wonder if the call to > history.back() is causing problems, for example if the history timer is racing > with some other asynchronous bit? An interesting clue is that the previous > test results are dumped as a render tree and not as text. Which seems to always be the case in these types of failures that we've seen lately. Because of the varying and async nature of the test it is - of course - a waitUntilDone test. And it seems to me that DRT should have already done its dumping for the previous test before the new test is started, then wiped that out. With those two understandings, I'm not sure what asynchronous bit might exist here.
Adam Barth
Comment 21 2010-04-02 10:39:24 PDT
> Which seems to always be the case in these types of failures that we've seen > lately. Do you have links to other failures of this kind? I'm new to the flaky test party. > Because of the varying and async nature of the test it is - of course - a > waitUntilDone test. Both this test and the previous test are waitUntilDone. > And it seems to me that DRT should have already done its dumping for the > previous test before the new test is started, then wiped that out. Right. The previous test passes, so DRT has decided it's happy with what happened there. > With those two understandings, I'm not sure what asynchronous bit might exist > here. When I've seen leakages before, it's because a test is running code after DRT thinks that the test is done. For example, if you call alert in your unload handler, the alert will be recorded for the next test. Maybe popstate has similar timing? Is that called when you navigate away from a page? I guess I should read up on how this API works.
Adam Barth
Comment 22 2010-04-02 10:48:40 PDT
Woah, I'm now convinced that http://trac.webkit.org/browser/trunk/LayoutTests/fast/loader/stateobjects/replacestate-in-iframe.html is the problem. That test is nuts. We should run it in popup window.
Darin Fisher (:fishd, Google)
Comment 23 2010-04-02 10:50:43 PDT
I think this is probably related to the fact that we dispatch the PopStateEvent synchronously. In fast/loader/replacestate-in-iframe.html, the PopStateEvent occurs in an iframe, and it is from there that the test calls notifyDone(). I think I'm going to approach this problem by making the test less nutty ;-)
Brady Eidson
Comment 24 2010-04-02 11:40:25 PDT
(In reply to comment #22) > Woah, I'm now convinced that > http://trac.webkit.org/browser/trunk/LayoutTests/fast/loader/stateobjects/replacestate-in-iframe.html > is the problem. That test is nuts. We should run it in popup window. I agree that test is nutty, and am glad Darin's looking into it. I'm still at a loss on this from the high run-webkit-tests + DRT level. Could you share your understanding of why Darin's test causes this to happen? And could we fix DRT + run-webkit-tests to handle Darin's test -as nutty as it is? :) I ask because - like I mentioned - this has come up a couple of time with bizarre loader tests I've added over the last few months and I *really* need the understanding of the root cause so I can go back and reenable disabled tests and fix other intermittent weirdness!
Darin Fisher (:fishd, Google)
Comment 26 2010-04-02 12:22:41 PDT
Created attachment 52435 [details] v1 patch
Darin Fisher (:fishd, Google)
Comment 27 2010-04-02 12:26:51 PDT
(In reply to comment #24) > And could we fix DRT + run-webkit-tests to handle Darin's test -as nutty as it > is? :) I looked into this, and I'm not sure it is possible. In this case, the root cause of the bug is that onpopstate sometimes runs before the page is fully loaded. During onpopstate, I was calling notifyDone. As a result, when the page finally finishes loading, it ends up dumping the test results again. One way to avoid problems like this would be to create a new WebView for each test.
Brady Eidson
Comment 28 2010-04-02 12:29:32 PDT
(In reply to comment #27) > (In reply to comment #24) > > And could we fix DRT + run-webkit-tests to handle Darin's test -as nutty as it > > is? :) > > I looked into this, and I'm not sure it is possible. In this case, the root > cause of the bug is that onpopstate sometimes runs before the page is fully > loaded. During onpopstate, I was calling notifyDone. As a result, when the > page finally finishes loading, it ends up dumping the test results again. > > One way to avoid problems like this would be to create a new WebView for each > test. Presumably we could make notifyDone() be a synchronous message to DRT meaning "Hey, test is over - RIGHT NOW" and DRT could cancel the load if its not done yet. I don't know how many other tests this would affect, but one might argue that those other tests are "wrong"
Darin Fisher (:fishd, Google)
Comment 29 2010-04-02 12:29:46 PDT
Comment on attachment 52435 [details] v1 patch Whoops, I just realized that this no longer tests the fix I was intending it to test! Doh!
Adam Barth
Comment 30 2010-04-02 15:03:30 PDT
BTW, new-run-webkit-tests support --run-singly, which I think uses a new DRT process for each test.
Eric Seidel (no email)
Comment 31 2010-04-02 15:11:01 PDT
"old-" run webkit-tests supports that too. :)
Darin Fisher (:fishd, Google)
Comment 32 2010-04-02 20:16:10 PDT
Created attachment 52475 [details] v2 patch
Adam Barth
Comment 33 2010-04-02 20:38:15 PDT
Comment on attachment 52475 [details] v2 patch LGTM, thanks. Sounds like a cool bug.
Darin Fisher (:fishd, Google)
Comment 34 2010-04-02 21:49:17 PDT
WebKit Review Bot
Comment 35 2010-04-02 22:59:32 PDT
http://trac.webkit.org/changeset/57042 might have broken Leopard Intel Debug (Tests)
Adam Barth
Comment 36 2010-04-02 23:00:40 PDT
It's really https://bugs.webkit.org/show_bug.cgi?id=36646. The failure rate from that test is insanely high. I've disabled it.
Note You need to log in before you can comment on or make changes to this bug.