29344 – REGRESSION: fast/workers/dedicated-worker-lifecycle.html failing intermittently on leopard bot

RESOLVED FIXED 29344

REGRESSION: fast/workers/dedicated-worker-lifecycle.html failing intermittently on leopard bot

https://bugs.webkit.org/show_bug.cgi?id=29344

Summary REGRESSION: fast/workers/dedicated-worker-lifecycle.html failing intermittent...

Eric Seidel (no email)

Reported 2009-09-17 13:39:06 PDT

fast/workers/dedicated-worker-lifecycle.html failing intermittently on leopard bot http://build.webkit.org/results/Leopard%20Intel%20Release%20(Tests)/r48488%20(5093)/results.html --- layout-test-results/fast/workers/dedicated-worker-lifecycle-expected.txt 2009-09-17 13:27:57.000000000 -0700 +++ layout-test-results/fast/workers/dedicated-worker-lifecycle-actual.txt 2009-09-17 13:27:57.000000000 -0700 @@ -1,12 +1,8 @@ +FAIL: Timed out waiting for notifyDone to be called This test checks whether orphaned workers exit under various conditions On success, you will see a series of "PASS" messages, followed by "TEST COMPLETE". PASS Orphaned worker thread created. -PASS Orphaned worker thread exited. -PASS Orphaned timeout worker thread created. -PASS Orphaned timeout worker thread exited. - -TEST COMPLETE This might be a one-time failure, but I'll file this so we have record for now.

Attachments
Proposal to skip the test until a fix can be found (1.26 KB, patch) 2009-09-28 16:24 PDT, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
proposed patch that may address the flakiness (3.43 KB, patch) 2009-09-28 18:32 PDT, Andrew Wilson	no flags	Details Formatted Diff Diff
Crash log. (30.78 KB, text/plain) 2009-12-07 14:49 PST, David Levin	no flags	Details
Normal crash log (-- had incorrect define for jsc zombies). (27.47 KB, text/plain) 2009-12-07 15:39 PST, David Levin	no flags	Details
Another JSC_ZOMBIES crash log (in case it's helpful) (26.10 KB, text/plain) 2009-12-07 16:16 PST, Eric Seidel (no email)	no flags	Details
Add dedicated-worker-lifecycle.html to the appropriate Skipped files (2.07 KB, patch) 2009-12-11 11:00 PST, Andrew Wilson	no flags	Details Formatted Diff Diff
Disabling on leopard only (1.31 KB, patch) 2009-12-11 11:33 PST, Andrew Wilson	no flags	Details Formatted Diff Diff
Show Obsolete (6) View All Add attachment proposed patch, testcase, etc.

Eric Seidel (no email)

Comment 1 2009-09-28 16:19:25 PDT

Sigh. This just caused bug 29743 to fail to land too. I'll cook up a patch to skip this test too.

Eric Seidel (no email)

Comment 2 2009-09-28 16:24:33 PDT

Created attachment 40269 [details] Proposal to skip the test until a fix can be found

Andrew Wilson

Comment 3 2009-09-28 16:45:25 PDT

(In reply to comment #2) > Created an attachment (id=40269) [details] > Proposal to skip the test until a fix can be found These lifecycle tests are all somewhat dodgy I suspect. The problem is that the worker threads only exit once the Worker object itself is GC'd, and the worker object will not be GC'd if there's something on the stack that looks like a reference to it. I've seen odd errors in the past due to this (objects not getting GC'd even though there are no references). It's probably reasonable to just leave these disabled since there's no ironclad way for JS code to force something to get garbage collected.

David Levin

Comment 4 2009-09-28 16:50:35 PDT

Based on discussion with Drew, I r+ this. It sounds like it may need to be removed completely.

Eric Seidel (no email)

Comment 5 2009-09-28 16:52:41 PDT

We can force a collection with GCController.collect() in DumpRenderTree. If there are no references to it, it will be collected. Would that help you make this non-flakey?

Maciej Stachowiak

Comment 6 2009-09-28 16:53:32 PDT

(In reply to comment #3) > (In reply to comment #2) > > Created an attachment (id=40269) [details] [details] > > Proposal to skip the test until a fix can be found > > These lifecycle tests are all somewhat dodgy I suspect. > > The problem is that the worker threads only exit once the Worker object itself > is GC'd, and the worker object will not be GC'd if there's something on the > stack that looks like a reference to it. I've seen odd errors in the past due > to this (objects not getting GC'd even though there are no references). It's > probably reasonable to just leave these disabled since there's no ironclad way > for JS code to force something to get garbage collected. We have found ways before to get objects to be GC'd with 100% reliability in tests. Usually this involves doing lots of allocation and calling after the last reference should be dropped, and using a timer to return to the event loop and clear out any temporary stack.

Eric Seidel (no email)

Comment 7 2009-09-28 16:56:36 PDT

I'm in no hurry to check this in. I'll wait until there is some amount of consensus about what should be done to this test. If the vote is it should be Skipped for all platforms, or removed entirely, please just r- the existing patch and I'll be happy to post a different one (or someone else is welcome to). Thanks again for looking!

Andrew Wilson

Comment 8 2009-09-28 17:02:44 PDT

(In reply to comment #6) > We have found ways before to get objects to be GC'd with 100% reliability in > tests. Usually this involves doing lots of allocation and calling after the > last reference should be dropped, and using a timer to return to the event loop > and clear out any temporary stack. When I looked at this, I also tried those ways (including the exhaustive gc() function from js-test-pre.js, also happening from a setTimeout()): function gc() { if (typeof GCController !== "undefined") GCController.collect(); else { function gcRec(n) { if (n < 1) return {}; var temp = {i: "ab" + i + (i / 100000)}; temp += "foo"; gcRec(n-1); } for (var i = 0; i < 1000; i++) gcRec(10) } } And yet the GC would not free the object. In my previous case, it turned out that allocating a Date() object (for whatever reason) would allow the object to get GC'd (so I know the problem really was an issue of GC, and not some kind of lingering reference) - if you look at the body of dedicated-worker-lifecycle.js, you'll see this: // For some reason, the worker object does not get GC'd unless we allocate a new object here. // The conjecture is that there's a value on the stack that appears to point to the worker which this clobbers. new Date(); Anyhow, I'd be happy to make whatever changes are necessary to keep this test live - but I've tried all the suggestions I've heard so far and none of them have worked :(

Andrew Wilson

Comment 9 2009-09-28 17:04:51 PDT

(In reply to comment #8) > When I looked at this, I also tried those ways (including the exhaustive gc() > function from js-test-pre.js, also happening from a setTimeout()): To clarify, I was forcing the "slow/exhaustive" branch in the gc() routine - it wasn't just calling GCCollect.collect(). Anyhow, I could definitely re-enable that code again - maybe at least it would make it stable enough that we could keep the test enabled...

Andrew Wilson

Comment 10 2009-09-28 17:25:52 PDT

Here's my proposal: I'll re-enable the code that does what Maciej suggests. We'll leave the test disabled on mac-leopard, and watch to see if we see failures on any other platform. If not, we can try re-enabling on mac-leopard as well.

Andrew Wilson

Comment 11 2009-09-28 18:32:44 PDT

Created attachment 40277 [details] proposed patch that may address the flakiness Here's a patch that makes the changes Maciej suggests. We can either land this patch and wait and see if it addresses the flakiness, or land both this and Eric's patch.

Eric Seidel (no email)

Comment 12 2009-09-30 15:51:21 PDT

Comment on attachment 40277 [details] proposed patch that may address the flakiness Where did you get the gc() changes? I'd love to try something since this is still flakey on the bots: http://build.webkit.org/results/Leopard%20Intel%20Release%20(Tests)/r48919%20(5566)/results.html

Andrew Wilson

Comment 13 2009-09-30 16:38:31 PDT

(In reply to comment #12) > (From update of attachment 40277 [details]) > Where did you get the gc() changes? > From fast/js/resources/js_test_pre.js, but with minor logic changes to force them to happen even if GCController is defined.

Eric Seidel (no email)

Comment 14 2009-10-01 12:15:46 PDT

Comment on attachment 40277 [details] proposed patch that may address the flakiness I certainly think this is worth a try. I'm not your best reviewer for this, but the others have been silent and this looks non-harmful.

Eric Seidel (no email)

Comment 15 2009-10-01 12:16:06 PDT

Comment on attachment 40269 [details] Proposal to skip the test until a fix can be found Removing Levin's r+ while we wait to see if Andrew's fix works.

Andrew Wilson

Comment 16 2009-10-01 13:28:57 PDT

Committed as r48996. I'll keep this bug open for a bit to see if this addresses the flakiness at all.

Eric Seidel (no email)

Comment 17 2009-10-05 11:09:01 PDT

Comment on attachment 40277 [details] proposed patch that may address the flakiness Clearing my r+ from this patch now that it's been committed. (This way the to-be-committed query doesn't list it.)

Eric Seidel (no email)

Comment 18 2009-10-11 10:18:39 PDT

This doesn't seem to be gone, sadly: http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r49424%20(5891)/results.html

Alexey Proskuryakov

Comment 19 2009-10-11 15:33:53 PDT

On my home machine (and a debug build), this test isn't just flaky - it's always failing. Worker activity management appears to be working well, so this looks like a general GC failure. I see the worker wrapper marked by Heap::markConservatively(), as if there were references to it remaining on the stack. The test already attempts black magic to fight this. As far as I know, it shouldn't be necessary, and it doesn't work anyway. // Orphan our worker (no more references to it) and wait for it to exit. worker.onmessage = 0; worker = 0; // For some reason, the worker object does not get GC'd unless we allocate a new object here. // The conjecture is that there's a value on the stack that appears to point to the worker which this clobbers. new Date(); waitUntilWorkerThreadsExit(orphanedTimeoutWorkerExited); Here, waitUntilWorkerThreadsExit invokes garbage collector, each 10 ms, waiting for GC to collect the unused Worker object, which never happens.

Andrew Wilson

Comment 20 2009-10-11 15:40:50 PDT

OK, it sounds like the only thing to do here is just skip the test (eric's original patch) since my black magic to try to force the object off the stack isn't working. This really befuddles me, since retrying via setTimeout() should leave the stack totally empty.

Eric Seidel (no email)

Comment 21 2009-11-30 22:14:59 PST

Crashed tonight: http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r51524%20(7729)/results.html

Eric Seidel (no email)

Comment 22 2009-11-30 22:16:17 PST

Looks like it failed earlier tonight as well: http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r51523%20(7728)/fast/workers/dedicated-worker-lifecycle-pretty-diff.html

Eric Seidel (no email)

Comment 23 2009-12-02 13:04:49 PST

Failed again just now: http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r51610%20(7808)/fast/workers/dedicated-worker-lifecycle-pretty-diff.html

Eric Seidel (no email)

Comment 24 2009-12-03 14:13:09 PST

Crashed again: http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r51657%20(7847)/results.html I'll see if I can find cycles to find a locally reproducible crash.

Oliver Hunt

Comment 25 2009-12-03 14:18:46 PST

(In reply to comment #24) > Crashed again: > http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r51657%20(7847)/results.html > > I'll see if I can find cycles to find a locally reproducible crash. try enabling JSC_ZOMBIES

Eric Seidel (no email)

Comment 26 2009-12-07 12:00:00 PST

Another crash on the bots: http://build.webkit.org/results/Leopard%20Intel%20Debug%20(Tests)/r51782%20(7961)/results.html I would like to disable/skip this test soon if it remains flakey like this. No sense in keeping red tests around.

David Levin

Comment 27 2009-12-07 14:49:51 PST

Created attachment 44434 [details] Crash log. I was unable to repro the crash when I ran the worker directory 1000 times but I ran the fast directory 10 times ("run-webkit-tests --iterations 10 fast/") and I got the crash once.

David Levin

Comment 28 2009-12-07 15:39:28 PST

Created attachment 44438 [details] Normal crash log (-- had incorrect define for jsc zombies). It still looks like some memory corruption issue (that JSC_ZOMBIES didn't detect). Time to run with malloc guard on.

Eric Seidel (no email)

Comment 29 2009-12-07 16:16:58 PST

Created attachment 44442 [details] Another JSC_ZOMBIES crash log (in case it's helpful) Thanks for looking dave! I also ran a JSC_ZOMBIES run this morning, and now that I'm finally back at my desk, I see that it too crashed.

Eric Seidel (no email)

Comment 30 2009-12-07 16:19:29 PST

The ASSERT that it crashed at was: ASSERTION FAILED: primaryHeap.numLiveObjects == primaryHeap.numZombies (/Users/eseidel/Projects/WebKit/JavaScriptCore/runtime/Collector.cpp:199 void JSC::Heap::destroy())

David Levin

Comment 31 2009-12-09 11:22:12 PST

Notes on the crash: * The assert at JavaScriptCore/runtime/Collector.cpp:199 isn't too useful, so it is good to comment it out (I think there may be a bug in the counting done in that function of num live objects.) * Use r51657 for repros. I couldn't repro with tip of tree, r51782 had build errors for me, r51657 built and repros. * The smallest repro that I got so far is two test: run-webkit-tests --iterations 20 LayoutTests/fast/websockets/ fast/workers/dedicated-worker-lifecycle.html If you run ~20 times, you'll very likely see the crash. * Running with malloc guard causes the test to time out but if you turn off the brute force gc in worker-util.js, the test will pass (and the above crashes were done like this). * It turns out that the crash was due to WebSockets and was recently fixed with https://bugs.webkit.org/show_bug.cgi?id=32226 in r51790. (Note no more buildbot crashes have been listed after that revision). Case closed for the crashes!

Andrew Wilson

Comment 32 2009-12-11 10:02:38 PST

Now that Dave has fixed the crash, I'm going to disable this flaky test unless someone objects in short order.

Andrew Wilson

Comment 33 2009-12-11 11:00:42 PST

Created attachment 44697 [details] Add dedicated-worker-lifecycle.html to the appropriate Skipped files Don't need to disable this in Qt because it's already disabled there (missing DRT functionality). The similarly named shared-worker-lifecycle.html and worker-lifecycle.html tests don't rely on GC (they test whether explicitly calling close() shuts down the thread), so dedicated-worker-lifecycle.html is the only one that needs to be disabled.

WebKit Review Bot

Comment 34 2009-12-11 11:04:17 PST

style-queue ran check-webkit-style on attachment 44697 [details] without any errors.

Darin Adler

Comment 35 2009-12-11 11:11:51 PST

Comment on attachment 44697 [details] Add dedicated-worker-lifecycle.html to the appropriate Skipped files If a test is not working on any platform, we normally disable it by renaming the file to add a "-disabled" suffix. Try this command to see the 70 tests currently disabled in this fashion: find LayoutTests -name '*-disabled' Would you consider disabling this test that way? Or is this test still useful on some platform?

David Levin

Comment 36 2009-12-11 11:15:58 PST

Why are disabling this as opposed to deleting it completely? From what I've seen fixing this seems like it depends on getting gc of a *particular* object to happen in a more deterministic manner which despite repeated attempts hasn't been possible (and it is hard to see what is going to change in that regard).

Andrew Wilson

Comment 37 2009-12-11 11:17:53 PST

Because the test itself passes 95%+ of the time (it always passes locally for me - it just fails sporadically on the bots). I'm resubmitting this with the -disabled rename shortly, although just putting it in the skipped file makes it easier to run the test (with --skipped=ignore).

Andrew Wilson

Comment 38 2009-12-11 11:23:03 PST

Rethinking this, I'm just going to skip it on leopard for now, since that's the only platform we've been seeing flakiness on. It could, in theory, fail anywhere since it relies on GC, but in practice it isn't failing anywhere but leopard and the test itself has significant value as it's the only thing that checks to make sure that the complex worker reachability mechanism actually works.

Andrew Wilson

Comment 39 2009-12-11 11:33:00 PST

Created attachment 44699 [details] Disabling on leopard only

Darin Adler

Comment 40 2009-12-11 11:35:08 PST

Comment on attachment 44699 [details] Disabling on leopard only Seems OK as a short term measure. I think we can probably find a way to force the GC to be more deterministic in the test harness. If we start seeing failures on other platforms we might need to come up with a better solution.

WebKit Review Bot

Comment 41 2009-12-11 11:35:20 PST

style-queue ran check-webkit-style on attachment 44699 [details] without any errors.

WebKit Commit Bot

Comment 42 2009-12-11 12:38:38 PST

Comment on attachment 44699 [details] Disabling on leopard only Clearing flags on attachment: 44699 Committed r52014: <http://trac.webkit.org/changeset/52014>

WebKit Commit Bot

Comment 43 2009-12-11 12:38:49 PST

All reviewed patches have been landed. Closing bug.

Darin Adler

Comment 44 2009-12-11 13:04:28 PST

(In reply to comment #37) > just putting it in the skipped file makes it easier to run the test (with --skipped=ignore). It might be good to replace the -disabled technique with a global Skipped file. Although that means there will be 70 other tests that --skipped=ignore will now include, which might not be good.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Mac (Intel)

OS OS X 10.5

Product WebKit

Component JavaScriptCore

Assignee

Andrew Wilson

Reported

2009-09-17 13:39 PDT

Modified

2009-12-11 13:04 PST History

CC List

10 users Show

URL

Keywords

Depends on

Blocks