Bug 36647 - fast/loader/stateobjects/replacestate-then-pushstate.html fails intermittently on all bots
Summary: fast/loader/stateobjects/replacestate-then-pushstate.html fails intermittentl...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: 528+ (Nightly build)
Hardware: PC OS X 10.5
: P2 Normal
Assignee: Darin Fisher (:fishd, Google)
URL:
Keywords:
Depends on: 36779
Blocks:
  Show dependency treegraph
 
Reported: 2010-03-26 01:25 PDT by Eric Seidel (no email)
Modified: 2010-04-02 23:00 PDT (History)
7 users (show)

See Also:


Attachments
v1 patch (4.26 KB, patch)
2010-04-02 12:22 PDT, Darin Fisher (:fishd, Google)
fishd: review-
Details | Formatted Diff | Diff
v2 patch (4.55 KB, patch)
2010-04-02 20:16 PDT, Darin Fisher (:fishd, Google)
abarth: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Eric Seidel (no email) 2010-03-26 01:25:24 PDT
fast/loader/stateobjects/replacestate-then-pushstate.html failed on Snow Leopard Release Bot

The dump looks like DRT is confused and dumping the wrong test (maybe the previous test?)

http://trac.webkit.org/browser/trunk/LayoutTests/fast/loader/stateobjects/replacestate-then-pushstate.html

http://build.webkit.org/results/SnowLeopard%20Intel%20Release%20(Tests)/r56612%20(7249)/fast/loader/stateobjects/replacestate-then-pushstate-diffs.txt
--- /Volumes/Data/WebKit-BuildSlave/snowleopard-intel-release-tests/build/layout-test-results/fast/loader/stateobjects/replacestate-then-pushstate-expected.txt	2010-03-26 01:10:32.000000000 -0700
+++ /Volumes/Data/WebKit-BuildSlave/snowleopard-intel-release-tests/build/layout-test-results/fast/loader/stateobjects/replacestate-then-pushstate-actual.txt	2010-03-26 01:10:32.000000000 -0700
@@ -1,13 +1,17 @@
-This test does the following:
--Makes a call to replaceState()
--Makes sure the history length is correct
--Makes a call to pushState()
--Makes sure the history length is correct
--Goes back, and makes sure the popstate event is correct
--Goes forward, and makes sure the popstate event is correct
-
-History length is 1
-History length is 2
-State popped - OriginalHistoryItem (type string)
-State popped - NewHistoryItem (type string)
-
+layer at (0,0) size 800x600
+  RenderView at (0,0) size 800x600
+layer at (0,0) size 800x600
+  RenderBlock {HTML} at (0,0) size 800x600
+    RenderBody {BODY} at (8,8) size 784x584
+      RenderText {#text} at (0,140) size 70x18
+        text run at (0,140) width 70: "default text"
+      RenderText {#text} at (70,140) size 4x18
+        text run at (70,140) width 4: " "
+      RenderPartObject {IFRAME} at (74,0) size 304x154 [border: (2px inset #000000)]
+        layer at (0,0) size 300x150
+          RenderView at (0,0) size 300x150
+        layer at (0,0) size 300x150
+          RenderBlock {HTML} at (0,0) size 300x150
+            RenderBody {BODY} at (8,8) size 284x134
+              RenderText {#text} at (0,0) size 21x18
+                text run at (0,0) width 21: "foo"
Comment 1 Brady Eidson 2010-03-26 09:26:44 PDT
Yes, it is dumping the wrong test.

Yes, this has been happening with scattered tests that I added, sometimes very recently and sometimes many months ago.

No, I have no idea how to explore this especially since they're often very rare failures on the bots and I can never reproduce locally.

In other words...
*sigh*
Comment 3 Eric Seidel (no email) 2010-03-29 19:05:45 PDT
Failed on the Leopard Commit bot just now too:
https://bugs.webkit.org/show_bug.cgi?id=30144#c66
Comment 4 Eric Seidel (no email) 2010-03-31 15:32:42 PDT
It appears to be failing on the Tiger bot as well:
http://build.webkit.org/results/Tiger%20Intel%20Release/r56871%20(10265)/fast/loader/stateobjects/replacestate-then-pushstate-pretty-diff.html

Seems we should skip/disable this test since it's failing on at least 3 bots.
Comment 5 Brady Eidson 2010-03-31 15:42:41 PDT
Until now, I wasn't aware this was failing every time on the bots.
Comment 6 Brady Eidson 2010-03-31 15:43:42 PDT
(In reply to comment #5)
> Until now, I wasn't aware this was failing every time on the bots.

Checking it myself, it appears it is not failing every time.
Comment 7 Eric Seidel (no email) 2010-03-31 16:01:33 PDT
Correct.  It is not failing consistently on any machine to my knowledge.  My apologies if my statements led you to believe such.  This is however a common failure, just not consistent. :(
Comment 8 Eric Seidel (no email) 2010-04-01 18:20:21 PDT
The commit-queue just hit this too:
https://bugs.webkit.org/show_bug.cgi?id=36369#c4
Comment 9 Eric Seidel (no email) 2010-04-02 01:09:38 PDT
Tiger hit this several times today.  Seems we should disable this test for all bots.  Unless you have a fix in the pipeline Brady?
Comment 10 Brady Eidson 2010-04-02 09:37:58 PDT
(In reply to comment #9)
> Tiger hit this several times today.  Seems we should disable this test for all
> bots.  Unless you have a fix in the pipeline Brady?

Is this one of the tests that's "failing" by showing the output from the previous test?  ie - the test itself isn't necessarily failing but being polluted by something prior.

Here's my dilemma.  I would love to pour in the time to figure out what is wrong with the test harness, but I am not an expert at the test harness and would be spinning my wheels, and don't have time right now to spin my wheels.  We can't just keep disabling tests that only have intermittent failures and (apparently) only on the bots.

If I can never reproduce locally, I can never explore it.  But it we just keep disabling test-after-test, then we lose the test coverage for when it really breaks 100% of the time, even for individual contributors on their own machines.

I would love to hear referrals for people who are truly experts at run-webkit-tests and DRT, and to work with them to figure out what is going on.
Comment 11 Adam Barth 2010-04-02 09:53:39 PDT
The flip side of that argument is that we have >13k tests.  Because the bots (Tiger and Snow Leopard especially) are constantly red, we become blind to failures and miss real regressions.  If we lose 0.01% of our tests, that's a net win for the project.

However, in many cases, we've been able to keep the test coverage and remove the flakiness, even with no access to the failures locally.  For example, last night I fixed a test that was failing only on Tiger with no ability to reproduce the failure myself.  It just requires effort.

I don't have visibility into the other demands on your time.  It's entirely possible you have more important things to be working on.  However, that doesn't means we should impose these costs on the rest of the project.
Comment 12 Adam Barth 2010-04-02 09:57:52 PDT
> Is this one of the tests that's "failing" by showing the output from the
> previous test?  ie - the test itself isn't necessarily failing but being
> polluted by something prior.

Here's the failure pretty diff:

http://build.webkit.org/results/Tiger%20Intel%20Release/r57003%20(10355)/fast/loader/stateobjects/replacestate-then-pushstate-pretty-diff.html
Comment 13 Adam Barth 2010-04-02 10:01:17 PDT
By way of context, this test has failed on 8 of the last 10 Tiger bot runs.  It's one of three tests that prevent this bot from being green and put under the protection of SheriffBot (who will help prevent future real regressions).  Three tests are 0.023% of tests.
Comment 14 Brady Eidson 2010-04-02 10:13:23 PDT
(In reply to comment #11)
> The flip side of that argument is that we have >13k tests.  Because the bots
> (Tiger and Snow Leopard especially) are constantly red, we become blind to
> failures and miss real regressions.  If we lose 0.01% of our tests, that's a
> net win for the project.

Unless the feature is new with only an initial batch of tests and disabling one of the tests would result in the loss of ~10% of the test coverage for a reasonably complicated mechanism.

> However, in many cases, we've been able to keep the test coverage and remove
> the flakiness, even with no access to the failures locally.  For example, last
> night I fixed a test that was failing only on Tiger with no ability to
> reproduce the failure myself.  It just requires effort.

Wildly varying amounts of effort - I have spent time on this already and no one I've reached out to has decided it was worth helping.

(In reply to comment #12)
> > Is this one of the tests that's "failing" by showing the output from the
> > previous test?  ie - the test itself isn't necessarily failing but being
> > polluted by something prior.
> 
> Here's the failure pretty diff:
> 
> http://build.webkit.org/results/Tiger%20Intel%20Release/r57003%20(10355)/fast/loader/stateobjects/replacestate-then-pushstate-pretty-diff.html

This diff appears to be the previous test (fast/loader/replacestate-in-iframe.html) dumping into the test that the bots see as failing.  This appears to best a test harness problem that I would love help tracking down, because apparently every other test I've added lately has been affected by this problem somehow.  Any ideas?

(CC'ing Darin Fisher, who added the test that is causing the forward failure)
Comment 15 Darin Fisher (:fishd, Google) 2010-04-02 10:21:23 PDT
This test is also failing intermittently on the chromium bots.  I'll investigate.
Comment 16 Adam Barth 2010-04-02 10:23:39 PDT
I can read up on how this feature is supposed to work, but the first step I
would try to black-box debug this issue is to insert a trivial test between the
previous test and this one.  That way we can see whether the previous test is
leaking state forward or this test is pulling previous state forward.
Comment 17 Brady Eidson 2010-04-02 10:27:05 PDT
(In reply to comment #16)
> I can read up on how this feature is supposed to work, but the first step I
> would try to black-box debug this issue is to insert a trivial test between the
> previous test and this one.  That way we can see whether the previous test is
> leaking state forward or this test is pulling previous state forward.

That is a good idea - except since I have never been able to reproduce this locally, and I haven't heard from any other fellow WebKit developers nearby about this test failing for them, this appears to be a bot-only issue and your experiment would require checking in a dummy test for experimentation on the bots.

Which (I don't believe) there is any precedent for.
Comment 18 Adam Barth 2010-04-02 10:30:04 PDT
> Which (I don't believe) there is any precedent for.

Right, I'm proposing adding a new test to debug this on the bots.  Given that it fails about 80% of the time, we should get our answer in a few cycles.  I've done similar experiments before to solve bot-only issues.  In any case, I don't think it's a big problem to add more tests.
Comment 19 Adam Barth 2010-04-02 10:31:03 PDT
Reading replacestate-then-pushstate.html, I wonder if the call to
history.back() is causing problems, for example if the history timer is racing
with some other asynchronous bit?  An interesting clue is that the previous test results are dumped as a render tree and not as text.
Comment 20 Brady Eidson 2010-04-02 10:35:17 PDT
(In reply to comment #19)
> Reading replacestate-then-pushstate.html, I wonder if the call to
> history.back() is causing problems, for example if the history timer is racing
> with some other asynchronous bit?  An interesting clue is that the previous
> test results are dumped as a render tree and not as text.

Which seems to always be the case in these types of failures that we've seen lately.

Because of the varying and async nature of the test it is - of course - a waitUntilDone test.

And it seems to me that DRT should have already done its dumping for the previous test before the new test is started, then wiped that out.

With those two understandings, I'm not sure what asynchronous bit might exist here.
Comment 21 Adam Barth 2010-04-02 10:39:24 PDT
> Which seems to always be the case in these types of failures that we've seen
> lately.

Do you have links to other failures of this kind?  I'm new to the flaky test party.

> Because of the varying and async nature of the test it is - of course - a
> waitUntilDone test.

Both this test and the previous test are waitUntilDone.

> And it seems to me that DRT should have already done its dumping for the
> previous test before the new test is started, then wiped that out.

Right.  The previous test passes, so DRT has decided it's happy with what happened there.

> With those two understandings, I'm not sure what asynchronous bit might exist
> here.

When I've seen leakages before, it's because a test is running code after DRT thinks that the test is done.  For example, if you call alert in your unload handler, the alert will be recorded for the next test.  Maybe popstate has similar timing?  Is that called when you navigate away from a page?  I guess I should read up on how this API works.
Comment 22 Adam Barth 2010-04-02 10:48:40 PDT
Woah, I'm now convinced that http://trac.webkit.org/browser/trunk/LayoutTests/fast/loader/stateobjects/replacestate-in-iframe.html is the problem.  That test is nuts.  We should run it in popup window.
Comment 23 Darin Fisher (:fishd, Google) 2010-04-02 10:50:43 PDT
I think this is probably related to the fact that we dispatch the PopStateEvent synchronously.  In fast/loader/replacestate-in-iframe.html, the PopStateEvent occurs in an iframe, and it is from there that the test calls notifyDone().

I think I'm going to approach this problem by making the test less nutty ;-)
Comment 24 Brady Eidson 2010-04-02 11:40:25 PDT
(In reply to comment #22)
> Woah, I'm now convinced that
> http://trac.webkit.org/browser/trunk/LayoutTests/fast/loader/stateobjects/replacestate-in-iframe.html
> is the problem.  That test is nuts.  We should run it in popup window.

I agree that test is nutty, and am glad Darin's looking into it.

I'm still at a loss on this from the high run-webkit-tests + DRT level.

Could you share your understanding of why Darin's test causes this to happen?

And could we fix DRT + run-webkit-tests to handle Darin's test -as nutty as it is?  :)

I ask because - like I mentioned - this has come up a couple of time with bizarre loader tests I've added over the last few months and I *really* need the understanding of the root cause so I can go back and reenable disabled tests and fix other intermittent weirdness!
Comment 26 Darin Fisher (:fishd, Google) 2010-04-02 12:22:41 PDT
Created attachment 52435 [details]
v1 patch
Comment 27 Darin Fisher (:fishd, Google) 2010-04-02 12:26:51 PDT
(In reply to comment #24)
> And could we fix DRT + run-webkit-tests to handle Darin's test -as nutty as it
> is?  :)

I looked into this, and I'm not sure it is possible.  In this case, the root cause of the bug is that onpopstate sometimes runs before the page is fully loaded.  During onpopstate, I was calling notifyDone.  As a result, when the page finally finishes loading, it ends up dumping the test results again.

One way to avoid problems like this would be to create a new WebView for each test.
Comment 28 Brady Eidson 2010-04-02 12:29:32 PDT
(In reply to comment #27)
> (In reply to comment #24)
> > And could we fix DRT + run-webkit-tests to handle Darin's test -as nutty as it
> > is?  :)
> 
> I looked into this, and I'm not sure it is possible.  In this case, the root
> cause of the bug is that onpopstate sometimes runs before the page is fully
> loaded.  During onpopstate, I was calling notifyDone.  As a result, when the
> page finally finishes loading, it ends up dumping the test results again.
> 
> One way to avoid problems like this would be to create a new WebView for each
> test.

Presumably we could make notifyDone() be a synchronous message to DRT meaning "Hey, test is over - RIGHT NOW" and DRT could cancel the load if its not done yet.

I don't know how many other tests this would affect, but one might argue that those other tests are "wrong"
Comment 29 Darin Fisher (:fishd, Google) 2010-04-02 12:29:46 PDT
Comment on attachment 52435 [details]
v1 patch

Whoops, I just realized that this no longer tests the fix I was intending it to test!  Doh!
Comment 30 Adam Barth 2010-04-02 15:03:30 PDT
BTW, new-run-webkit-tests support --run-singly, which I think uses a new DRT process for each test.
Comment 31 Eric Seidel (no email) 2010-04-02 15:11:01 PDT
"old-" run webkit-tests supports that too. :)
Comment 32 Darin Fisher (:fishd, Google) 2010-04-02 20:16:10 PDT
Created attachment 52475 [details]
v2 patch
Comment 33 Adam Barth 2010-04-02 20:38:15 PDT
Comment on attachment 52475 [details]
v2 patch

LGTM, thanks.  Sounds like a cool bug.
Comment 34 Darin Fisher (:fishd, Google) 2010-04-02 21:49:17 PDT
Landed as http://trac.webkit.org/changeset/57042
Comment 35 WebKit Review Bot 2010-04-02 22:59:32 PDT
http://trac.webkit.org/changeset/57042 might have broken Leopard Intel Debug (Tests)
Comment 36 Adam Barth 2010-04-02 23:00:40 PDT
It's really https://bugs.webkit.org/show_bug.cgi?id=36646.  The failure rate from that test is insanely high.  I've disabled it.