Bug 29344 - REGRESSION: fast/workers/dedicated-worker-lifecycle.html failing intermittently on leopard bot
Summary: REGRESSION: fast/workers/dedicated-worker-lifecycle.html failing intermittent...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 528+ (Nightly build)
Hardware: Macintosh Intel OS X 10.5
: P2 Normal
Assignee: Andrew Wilson
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-09-17 13:39 PDT by Eric Seidel (no email)
Modified: 2009-12-11 13:04 PST (History)
10 users (show)

See Also:


Attachments
Proposal to skip the test until a fix can be found (1.26 KB, patch)
2009-09-28 16:24 PDT, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
proposed patch that may address the flakiness (3.43 KB, patch)
2009-09-28 18:32 PDT, Andrew Wilson
no flags Details | Formatted Diff | Diff
Crash log. (30.78 KB, text/plain)
2009-12-07 14:49 PST, David Levin
no flags Details
Normal crash log (-- had incorrect define for jsc zombies). (27.47 KB, text/plain)
2009-12-07 15:39 PST, David Levin
no flags Details
Another JSC_ZOMBIES crash log (in case it's helpful) (26.10 KB, text/plain)
2009-12-07 16:16 PST, Eric Seidel (no email)
no flags Details
Add dedicated-worker-lifecycle.html to the appropriate Skipped files (2.07 KB, patch)
2009-12-11 11:00 PST, Andrew Wilson
no flags Details | Formatted Diff | Diff
Disabling on leopard only (1.31 KB, patch)
2009-12-11 11:33 PST, Andrew Wilson
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Eric Seidel (no email) 2009-09-17 13:39:06 PDT
fast/workers/dedicated-worker-lifecycle.html failing intermittently on leopard bot

http://build.webkit.org/results/Leopard%20Intel%20Release%20(Tests)/r48488%20(5093)/results.html

--- layout-test-results/fast/workers/dedicated-worker-lifecycle-expected.txt	2009-09-17 13:27:57.000000000 -0700
+++ layout-test-results/fast/workers/dedicated-worker-lifecycle-actual.txt	2009-09-17 13:27:57.000000000 -0700
@@ -1,12 +1,8 @@
+FAIL: Timed out waiting for notifyDone to be called
 This test checks whether orphaned workers exit under various conditions
 
 On success, you will see a series of "PASS" messages, followed by "TEST COMPLETE".
 
 
 PASS Orphaned worker thread created.
-PASS Orphaned worker thread exited.
-PASS Orphaned timeout worker thread created.
-PASS Orphaned timeout worker thread exited.
-
-TEST COMPLETE

This might be a one-time failure, but I'll file this so we have record for now.
Comment 1 Eric Seidel (no email) 2009-09-28 16:19:25 PDT
Sigh.  This just caused bug 29743 to fail to land too.  I'll cook up a patch to skip this test too.
Comment 2 Eric Seidel (no email) 2009-09-28 16:24:33 PDT
Created attachment 40269 [details]
Proposal to skip the test until a fix can be found
Comment 3 Andrew Wilson 2009-09-28 16:45:25 PDT
(In reply to comment #2)
> Created an attachment (id=40269) [details]
> Proposal to skip the test until a fix can be found

These lifecycle tests are all somewhat dodgy I suspect.

The problem is that the worker threads only exit once the Worker object itself is GC'd, and the worker object will not be GC'd if there's something on the stack that looks like a reference to it. I've seen odd errors in the past due to this (objects not getting GC'd even though there are no references). It's probably reasonable to just leave these disabled since there's no ironclad way for JS code to force something to get garbage collected.
Comment 4 David Levin 2009-09-28 16:50:35 PDT
Based on discussion with Drew, I r+ this.

It sounds like it may need to be removed completely.
Comment 5 Eric Seidel (no email) 2009-09-28 16:52:41 PDT
We can force a collection with GCController.collect() in DumpRenderTree.

If there are no references to it, it will be collected.  Would that help you make this non-flakey?
Comment 6 Maciej Stachowiak 2009-09-28 16:53:32 PDT
(In reply to comment #3)
> (In reply to comment #2)
> > Created an attachment (id=40269) [details] [details]
> > Proposal to skip the test until a fix can be found
> 
> These lifecycle tests are all somewhat dodgy I suspect.
> 
> The problem is that the worker threads only exit once the Worker object itself
> is GC'd, and the worker object will not be GC'd if there's something on the
> stack that looks like a reference to it. I've seen odd errors in the past due
> to this (objects not getting GC'd even though there are no references). It's
> probably reasonable to just leave these disabled since there's no ironclad way
> for JS code to force something to get garbage collected.

We have found ways before to get objects to be GC'd with 100% reliability in tests. Usually this involves doing lots of allocation and calling after the last reference should be dropped, and using a timer to return to the event loop and clear out any temporary stack.
Comment 7 Eric Seidel (no email) 2009-09-28 16:56:36 PDT
I'm in no hurry to check this in.  I'll wait until there is some amount of consensus about what should be done to this test.  If the vote is it should be Skipped for all platforms, or removed entirely, please just r- the existing patch and I'll be happy to post a different one (or someone else is welcome to).

Thanks again for looking!
Comment 8 Andrew Wilson 2009-09-28 17:02:44 PDT
(In reply to comment #6)

> We have found ways before to get objects to be GC'd with 100% reliability in
> tests. Usually this involves doing lots of allocation and calling after the
> last reference should be dropped, and using a timer to return to the event loop
> and clear out any temporary stack.

When I looked at this, I also tried those ways (including the exhaustive gc() function from js-test-pre.js, also happening from a setTimeout()):

function gc() {
    if (typeof GCController !== "undefined")
        GCController.collect();
    else {
        function gcRec(n) {
            if (n < 1)
                return {};
            var temp = {i: "ab" + i + (i / 100000)};
            temp += "foo";
            gcRec(n-1);
        }
        for (var i = 0; i < 1000; i++)
            gcRec(10)
    }
}

And yet the GC would not free the object. In my previous case, it turned out that allocating a Date() object (for whatever reason) would allow the object to get GC'd (so I know the problem really was an issue of GC, and not some kind of lingering reference) - if you look at the body of dedicated-worker-lifecycle.js, you'll see this:

        // For some reason, the worker object does not get GC'd unless we allocate a new object here.
        // The conjecture is that there's a value on the stack that appears to point to the worker which this clobbers.
        new Date();

Anyhow, I'd be happy to make whatever changes are necessary to keep this test live - but I've tried all the suggestions I've heard so far and none of them have worked :(
Comment 9 Andrew Wilson 2009-09-28 17:04:51 PDT
(In reply to comment #8)

> When I looked at this, I also tried those ways (including the exhaustive gc()
> function from js-test-pre.js, also happening from a setTimeout()):

To clarify, I was forcing the "slow/exhaustive" branch in the gc() routine - it wasn't just calling GCCollect.collect().

Anyhow, I could definitely re-enable that code again - maybe at least it would make it stable enough that we could keep the test enabled...
Comment 10 Andrew Wilson 2009-09-28 17:25:52 PDT
Here's my proposal:

I'll re-enable the code that does what Maciej suggests. We'll leave the test disabled on mac-leopard, and watch to see if we see failures on any other platform. If not, we can try re-enabling on mac-leopard as well.
Comment 11 Andrew Wilson 2009-09-28 18:32:44 PDT
Created attachment 40277 [details]
proposed patch that may address the flakiness

Here's a patch that makes the changes Maciej suggests. We can either land this patch and wait and see if it addresses the flakiness, or land both this and Eric's patch.
Comment 12 Eric Seidel (no email) 2009-09-30 15:51:21 PDT
Comment on attachment 40277 [details]
proposed patch that may address the flakiness

Where did you get the gc() changes?

I'd love to try something since this is still flakey on the bots:
http://build.webkit.org/results/Leopard%20Intel%20Release%20(Tests)/r48919%20(5566)/results.html
Comment 13 Andrew Wilson 2009-09-30 16:38:31 PDT
(In reply to comment #12)
> (From update of attachment 40277 [details])
> Where did you get the gc() changes?
> 
From fast/js/resources/js_test_pre.js, but with minor logic changes to force them to happen even if GCController is defined.
Comment 14 Eric Seidel (no email) 2009-10-01 12:15:46 PDT
Comment on attachment 40277 [details]
proposed patch that may address the flakiness

I certainly think this is worth a try.  I'm not your best reviewer for this, but the others have been silent and this looks non-harmful.
Comment 15 Eric Seidel (no email) 2009-10-01 12:16:06 PDT
Comment on attachment 40269 [details]
Proposal to skip the test until a fix can be found

Removing Levin's r+ while we wait to see if Andrew's fix works.
Comment 16 Andrew Wilson 2009-10-01 13:28:57 PDT
Committed as r48996.

I'll keep this bug open for a bit to see if this addresses the flakiness at all.
Comment 17 Eric Seidel (no email) 2009-10-05 11:09:01 PDT
Comment on attachment 40277 [details]
proposed patch that may address the flakiness

Clearing my r+ from this patch now that it's been committed.  (This way the to-be-committed query doesn't list it.)
Comment 18 Eric Seidel (no email) 2009-10-11 10:18:39 PDT
This doesn't seem to be gone, sadly:
http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r49424%20(5891)/results.html
Comment 19 Alexey Proskuryakov 2009-10-11 15:33:53 PDT
On my home machine (and a debug build), this test isn't just flaky - it's always failing. Worker activity management appears to be working well, so this looks like a general GC failure. I see the worker wrapper marked by Heap::markConservatively(), as if there were references to it remaining on the stack.

The test already attempts black magic to fight this. As far as I know, it shouldn't be necessary, and it doesn't work anyway.

        // Orphan our worker (no more references to it) and wait for it to exit.
        worker.onmessage = 0;
        worker = 0;
        // For some reason, the worker object does not get GC'd unless we allocate a new object here.
        // The conjecture is that there's a value on the stack that appears to point to the worker which this clobbers.
        new Date();
        waitUntilWorkerThreadsExit(orphanedTimeoutWorkerExited);

Here, waitUntilWorkerThreadsExit invokes garbage collector, each 10 ms, waiting for GC to collect the unused Worker object, which never happens.
Comment 20 Andrew Wilson 2009-10-11 15:40:50 PDT
OK, it sounds like the only thing to do here is just skip the test (eric's original patch) since my black magic to try to force the object off the stack isn't working.

This really befuddles me, since retrying via setTimeout() should leave the stack totally empty.
Comment 24 Eric Seidel (no email) 2009-12-03 14:13:09 PST
Crashed again:
http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r51657%20(7847)/results.html

I'll see if I can find cycles to find a locally reproducible crash.
Comment 25 Oliver Hunt 2009-12-03 14:18:46 PST
(In reply to comment #24)
> Crashed again:
> http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r51657%20(7847)/results.html
> 
> I'll see if I can find cycles to find a locally reproducible crash.

try enabling JSC_ZOMBIES
Comment 26 Eric Seidel (no email) 2009-12-07 12:00:00 PST
Another crash on the bots:
http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r51782%20(7961)/results.html

I would like to disable/skip this test soon if it remains flakey like this.  No sense in keeping red tests around.
Comment 27 David Levin 2009-12-07 14:49:51 PST
Created attachment 44434 [details]
Crash log.

I was unable to repro the crash when I ran the worker directory 1000 times but I ran the fast directory 10 times ("run-webkit-tests --iterations 10  fast/") and I got the crash once.
Comment 28 David Levin 2009-12-07 15:39:28 PST
Created attachment 44438 [details]
Normal crash log (-- had incorrect define for jsc zombies).

It still looks like some memory corruption issue (that JSC_ZOMBIES didn't detect). Time to run with malloc guard on.
Comment 29 Eric Seidel (no email) 2009-12-07 16:16:58 PST
Created attachment 44442 [details]
Another JSC_ZOMBIES crash log (in case it's helpful)

Thanks for looking dave!  I also ran a JSC_ZOMBIES run this morning, and now that I'm finally back at my desk, I see that it too crashed.
Comment 30 Eric Seidel (no email) 2009-12-07 16:19:29 PST
The ASSERT that it crashed at was:
ASSERTION FAILED: primaryHeap.numLiveObjects == primaryHeap.numZombies
(/Users/eseidel/Projects/WebKit/JavaScriptCore/runtime/Collector.cpp:199 void JSC::Heap::destroy())
Comment 31 David Levin 2009-12-09 11:22:12 PST
Notes on the crash:
* The assert at JavaScriptCore/runtime/Collector.cpp:199 isn't too useful, so it is good to comment it out (I think there may be a bug in the counting done in that function of num live objects.)  

* Use r51657 for repros. I couldn't repro with tip of tree,  r51782 had build errors for me, r51657 built and repros.

* The smallest repro that I got so far is two test:
   run-webkit-tests  --iterations 20 LayoutTests/fast/websockets/ fast/workers/dedicated-worker-lifecycle.html 

  If you run ~20 times, you'll very likely see the crash.

* Running with malloc guard causes the test to time out but if you turn off the brute force gc in worker-util.js, the test will pass (and the above crashes were done like this).

* It turns out that the crash was due to WebSockets and was recently fixed with https://bugs.webkit.org/show_bug.cgi?id=32226 in r51790. (Note no more buildbot crashes have been listed after that revision).

Case closed for the crashes!
Comment 32 Andrew Wilson 2009-12-11 10:02:38 PST
Now that Dave has fixed the crash, I'm going to disable this flaky test unless someone objects in short order.
Comment 33 Andrew Wilson 2009-12-11 11:00:42 PST
Created attachment 44697 [details]
Add dedicated-worker-lifecycle.html to the appropriate Skipped files

Don't need to disable this in Qt because it's already disabled there (missing DRT functionality).

The similarly named shared-worker-lifecycle.html and worker-lifecycle.html tests don't rely on GC (they test whether explicitly calling close() shuts down the thread), so dedicated-worker-lifecycle.html is the only one that needs to be disabled.
Comment 34 WebKit Review Bot 2009-12-11 11:04:17 PST
style-queue ran check-webkit-style on attachment 44697 [details] without any errors.
Comment 35 Darin Adler 2009-12-11 11:11:51 PST
Comment on attachment 44697 [details]
Add dedicated-worker-lifecycle.html to the appropriate Skipped files

If a test is not working on any platform, we normally disable it by renaming the file to add a "-disabled" suffix. Try this command to see the 70 tests currently disabled in this fashion:

    find LayoutTests -name '*-disabled'

Would you consider disabling this test that way? Or is this test still useful on some platform?
Comment 36 David Levin 2009-12-11 11:15:58 PST
Why are disabling this as opposed to deleting it completely?

From what I've seen fixing this seems like it depends on getting gc of a *particular* object to happen in a more deterministic manner which despite repeated attempts hasn't been possible (and it is hard to see what is going to change in that regard).
Comment 37 Andrew Wilson 2009-12-11 11:17:53 PST
Because the test itself passes 95%+ of the time (it always passes locally for me - it just fails sporadically on the bots).

I'm resubmitting this with the -disabled rename shortly, although just putting it in the skipped file makes it easier to run the test (with --skipped=ignore).
Comment 38 Andrew Wilson 2009-12-11 11:23:03 PST
Rethinking this, I'm just going to skip it on leopard for now, since that's the only platform we've been seeing flakiness on. It could, in theory, fail anywhere since it relies on GC, but in practice it isn't failing anywhere but leopard and the test itself has significant value as it's the only thing that checks to make sure that the complex worker reachability mechanism actually works.
Comment 39 Andrew Wilson 2009-12-11 11:33:00 PST
Created attachment 44699 [details]
Disabling on leopard only
Comment 40 Darin Adler 2009-12-11 11:35:08 PST
Comment on attachment 44699 [details]
Disabling on leopard only

Seems OK as a short term measure. I think we can probably find a way to force the GC to be more deterministic in the test harness.

If we start seeing failures on other platforms we might need to come up with a better solution.
Comment 41 WebKit Review Bot 2009-12-11 11:35:20 PST
style-queue ran check-webkit-style on attachment 44699 [details] without any errors.
Comment 42 WebKit Commit Bot 2009-12-11 12:38:38 PST
Comment on attachment 44699 [details]
Disabling on leopard only

Clearing flags on attachment: 44699

Committed r52014: <http://trac.webkit.org/changeset/52014>
Comment 43 WebKit Commit Bot 2009-12-11 12:38:49 PST
All reviewed patches have been landed.  Closing bug.
Comment 44 Darin Adler 2009-12-11 13:04:28 PST
(In reply to comment #37)
> just putting it in the skipped file makes it easier to run the test (with --skipped=ignore).

It might be good to replace the -disabled technique with a global Skipped file.
Although that means there will be 70 other tests that --skipped=ignore will now
include, which might not be good.