fast/workers/dedicated-worker-lifecycle.html failing intermittently on leopard bot http://build.webkit.org/results/Leopard%20Intel%20Release%20(Tests)/r48488%20(5093)/results.html --- layout-test-results/fast/workers/dedicated-worker-lifecycle-expected.txt 2009-09-17 13:27:57.000000000 -0700 +++ layout-test-results/fast/workers/dedicated-worker-lifecycle-actual.txt 2009-09-17 13:27:57.000000000 -0700 @@ -1,12 +1,8 @@ +FAIL: Timed out waiting for notifyDone to be called This test checks whether orphaned workers exit under various conditions On success, you will see a series of "PASS" messages, followed by "TEST COMPLETE". PASS Orphaned worker thread created. -PASS Orphaned worker thread exited. -PASS Orphaned timeout worker thread created. -PASS Orphaned timeout worker thread exited. - -TEST COMPLETE This might be a one-time failure, but I'll file this so we have record for now.
Sigh. This just caused bug 29743 to fail to land too. I'll cook up a patch to skip this test too.
Created attachment 40269 [details] Proposal to skip the test until a fix can be found
(In reply to comment #2) > Created an attachment (id=40269) [details] > Proposal to skip the test until a fix can be found These lifecycle tests are all somewhat dodgy I suspect. The problem is that the worker threads only exit once the Worker object itself is GC'd, and the worker object will not be GC'd if there's something on the stack that looks like a reference to it. I've seen odd errors in the past due to this (objects not getting GC'd even though there are no references). It's probably reasonable to just leave these disabled since there's no ironclad way for JS code to force something to get garbage collected.
Based on discussion with Drew, I r+ this. It sounds like it may need to be removed completely.
We can force a collection with GCController.collect() in DumpRenderTree. If there are no references to it, it will be collected. Would that help you make this non-flakey?
(In reply to comment #3) > (In reply to comment #2) > > Created an attachment (id=40269) [details] [details] > > Proposal to skip the test until a fix can be found > > These lifecycle tests are all somewhat dodgy I suspect. > > The problem is that the worker threads only exit once the Worker object itself > is GC'd, and the worker object will not be GC'd if there's something on the > stack that looks like a reference to it. I've seen odd errors in the past due > to this (objects not getting GC'd even though there are no references). It's > probably reasonable to just leave these disabled since there's no ironclad way > for JS code to force something to get garbage collected. We have found ways before to get objects to be GC'd with 100% reliability in tests. Usually this involves doing lots of allocation and calling after the last reference should be dropped, and using a timer to return to the event loop and clear out any temporary stack.
I'm in no hurry to check this in. I'll wait until there is some amount of consensus about what should be done to this test. If the vote is it should be Skipped for all platforms, or removed entirely, please just r- the existing patch and I'll be happy to post a different one (or someone else is welcome to). Thanks again for looking!
(In reply to comment #6) > We have found ways before to get objects to be GC'd with 100% reliability in > tests. Usually this involves doing lots of allocation and calling after the > last reference should be dropped, and using a timer to return to the event loop > and clear out any temporary stack. When I looked at this, I also tried those ways (including the exhaustive gc() function from js-test-pre.js, also happening from a setTimeout()): function gc() { if (typeof GCController !== "undefined") GCController.collect(); else { function gcRec(n) { if (n < 1) return {}; var temp = {i: "ab" + i + (i / 100000)}; temp += "foo"; gcRec(n-1); } for (var i = 0; i < 1000; i++) gcRec(10) } } And yet the GC would not free the object. In my previous case, it turned out that allocating a Date() object (for whatever reason) would allow the object to get GC'd (so I know the problem really was an issue of GC, and not some kind of lingering reference) - if you look at the body of dedicated-worker-lifecycle.js, you'll see this: // For some reason, the worker object does not get GC'd unless we allocate a new object here. // The conjecture is that there's a value on the stack that appears to point to the worker which this clobbers. new Date(); Anyhow, I'd be happy to make whatever changes are necessary to keep this test live - but I've tried all the suggestions I've heard so far and none of them have worked :(
(In reply to comment #8) > When I looked at this, I also tried those ways (including the exhaustive gc() > function from js-test-pre.js, also happening from a setTimeout()): To clarify, I was forcing the "slow/exhaustive" branch in the gc() routine - it wasn't just calling GCCollect.collect(). Anyhow, I could definitely re-enable that code again - maybe at least it would make it stable enough that we could keep the test enabled...
Here's my proposal: I'll re-enable the code that does what Maciej suggests. We'll leave the test disabled on mac-leopard, and watch to see if we see failures on any other platform. If not, we can try re-enabling on mac-leopard as well.
Created attachment 40277 [details] proposed patch that may address the flakiness Here's a patch that makes the changes Maciej suggests. We can either land this patch and wait and see if it addresses the flakiness, or land both this and Eric's patch.
Comment on attachment 40277 [details] proposed patch that may address the flakiness Where did you get the gc() changes? I'd love to try something since this is still flakey on the bots: http://build.webkit.org/results/Leopard%20Intel%20Release%20(Tests)/r48919%20(5566)/results.html
(In reply to comment #12) > (From update of attachment 40277 [details]) > Where did you get the gc() changes? > From fast/js/resources/js_test_pre.js, but with minor logic changes to force them to happen even if GCController is defined.
Comment on attachment 40277 [details] proposed patch that may address the flakiness I certainly think this is worth a try. I'm not your best reviewer for this, but the others have been silent and this looks non-harmful.
Comment on attachment 40269 [details] Proposal to skip the test until a fix can be found Removing Levin's r+ while we wait to see if Andrew's fix works.
Committed as r48996. I'll keep this bug open for a bit to see if this addresses the flakiness at all.
Comment on attachment 40277 [details] proposed patch that may address the flakiness Clearing my r+ from this patch now that it's been committed. (This way the to-be-committed query doesn't list it.)
This doesn't seem to be gone, sadly: http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r49424%20(5891)/results.html
On my home machine (and a debug build), this test isn't just flaky - it's always failing. Worker activity management appears to be working well, so this looks like a general GC failure. I see the worker wrapper marked by Heap::markConservatively(), as if there were references to it remaining on the stack. The test already attempts black magic to fight this. As far as I know, it shouldn't be necessary, and it doesn't work anyway. // Orphan our worker (no more references to it) and wait for it to exit. worker.onmessage = 0; worker = 0; // For some reason, the worker object does not get GC'd unless we allocate a new object here. // The conjecture is that there's a value on the stack that appears to point to the worker which this clobbers. new Date(); waitUntilWorkerThreadsExit(orphanedTimeoutWorkerExited); Here, waitUntilWorkerThreadsExit invokes garbage collector, each 10 ms, waiting for GC to collect the unused Worker object, which never happens.
OK, it sounds like the only thing to do here is just skip the test (eric's original patch) since my black magic to try to force the object off the stack isn't working. This really befuddles me, since retrying via setTimeout() should leave the stack totally empty.
Crashed tonight: http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r51524%20(7729)/results.html
Looks like it failed earlier tonight as well: http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r51523%20(7728)/fast/workers/dedicated-worker-lifecycle-pretty-diff.html
Failed again just now: http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r51610%20(7808)/fast/workers/dedicated-worker-lifecycle-pretty-diff.html
Crashed again: http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r51657%20(7847)/results.html I'll see if I can find cycles to find a locally reproducible crash.
(In reply to comment #24) > Crashed again: > http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r51657%20(7847)/results.html > > I'll see if I can find cycles to find a locally reproducible crash. try enabling JSC_ZOMBIES
Another crash on the bots: http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r51782%20(7961)/results.html I would like to disable/skip this test soon if it remains flakey like this. No sense in keeping red tests around.
Created attachment 44434 [details] Crash log. I was unable to repro the crash when I ran the worker directory 1000 times but I ran the fast directory 10 times ("run-webkit-tests --iterations 10 fast/") and I got the crash once.
Created attachment 44438 [details] Normal crash log (-- had incorrect define for jsc zombies). It still looks like some memory corruption issue (that JSC_ZOMBIES didn't detect). Time to run with malloc guard on.
Created attachment 44442 [details] Another JSC_ZOMBIES crash log (in case it's helpful) Thanks for looking dave! I also ran a JSC_ZOMBIES run this morning, and now that I'm finally back at my desk, I see that it too crashed.
The ASSERT that it crashed at was: ASSERTION FAILED: primaryHeap.numLiveObjects == primaryHeap.numZombies (/Users/eseidel/Projects/WebKit/JavaScriptCore/runtime/Collector.cpp:199 void JSC::Heap::destroy())
Notes on the crash: * The assert at JavaScriptCore/runtime/Collector.cpp:199 isn't too useful, so it is good to comment it out (I think there may be a bug in the counting done in that function of num live objects.) * Use r51657 for repros. I couldn't repro with tip of tree, r51782 had build errors for me, r51657 built and repros. * The smallest repro that I got so far is two test: run-webkit-tests --iterations 20 LayoutTests/fast/websockets/ fast/workers/dedicated-worker-lifecycle.html If you run ~20 times, you'll very likely see the crash. * Running with malloc guard causes the test to time out but if you turn off the brute force gc in worker-util.js, the test will pass (and the above crashes were done like this). * It turns out that the crash was due to WebSockets and was recently fixed with https://bugs.webkit.org/show_bug.cgi?id=32226 in r51790. (Note no more buildbot crashes have been listed after that revision). Case closed for the crashes!
Now that Dave has fixed the crash, I'm going to disable this flaky test unless someone objects in short order.
Created attachment 44697 [details] Add dedicated-worker-lifecycle.html to the appropriate Skipped files Don't need to disable this in Qt because it's already disabled there (missing DRT functionality). The similarly named shared-worker-lifecycle.html and worker-lifecycle.html tests don't rely on GC (they test whether explicitly calling close() shuts down the thread), so dedicated-worker-lifecycle.html is the only one that needs to be disabled.
style-queue ran check-webkit-style on attachment 44697 [details] without any errors.
Comment on attachment 44697 [details] Add dedicated-worker-lifecycle.html to the appropriate Skipped files If a test is not working on any platform, we normally disable it by renaming the file to add a "-disabled" suffix. Try this command to see the 70 tests currently disabled in this fashion: find LayoutTests -name '*-disabled' Would you consider disabling this test that way? Or is this test still useful on some platform?
Why are disabling this as opposed to deleting it completely? From what I've seen fixing this seems like it depends on getting gc of a *particular* object to happen in a more deterministic manner which despite repeated attempts hasn't been possible (and it is hard to see what is going to change in that regard).
Because the test itself passes 95%+ of the time (it always passes locally for me - it just fails sporadically on the bots). I'm resubmitting this with the -disabled rename shortly, although just putting it in the skipped file makes it easier to run the test (with --skipped=ignore).
Rethinking this, I'm just going to skip it on leopard for now, since that's the only platform we've been seeing flakiness on. It could, in theory, fail anywhere since it relies on GC, but in practice it isn't failing anywhere but leopard and the test itself has significant value as it's the only thing that checks to make sure that the complex worker reachability mechanism actually works.
Created attachment 44699 [details] Disabling on leopard only
Comment on attachment 44699 [details] Disabling on leopard only Seems OK as a short term measure. I think we can probably find a way to force the GC to be more deterministic in the test harness. If we start seeing failures on other platforms we might need to come up with a better solution.
style-queue ran check-webkit-style on attachment 44699 [details] without any errors.
Comment on attachment 44699 [details] Disabling on leopard only Clearing flags on attachment: 44699 Committed r52014: <http://trac.webkit.org/changeset/52014>
All reviewed patches have been landed. Closing bug.
(In reply to comment #37) > just putting it in the skipped file makes it easier to run the test (with --skipped=ignore). It might be good to replace the -disabled technique with a global Skipped file. Although that means there will be 70 other tests that --skipped=ignore will now include, which might not be good.