Bug 71695

Summary:

Regression(r53595): Sync xhr requests in workers aren't terminated on worker close.

Product:

WebKit

Reporter:

Csaba Osztrogonác <ossy>

Component:

Tools / Tests

Assignee:

David Levin <levin>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

ap, dimich, ericu, hnandor, levin, levin+threading, ossy, webkit.review.bot, zherczeg

Priority:

Keywords:

Qt, QtTriaged

Version:

528+ (Nightly build)

Hardware:

All

OS:

All

Bug Depends on:

Bug Blocks:

64002, 70619

Attachments:

Description	Flags
Patch	zherczeg: review+

Csaba Osztrogonác

Reported 2011-11-07 08:35:11 PST

To reproduce these fail, use: $ Tools/Scripts/old-run-webkit-tests --wait-for-httpd fast/workers/shared-worker-lifecycle.html fast/workers/shared-worker-frame-lifecycle.html fast/workers/worker-close-more.html fast/workers/storage/interrupt-database.html fast/workers/dedicated-worker-lifecycle.html fast/workers/worker-lifecycle.html

Attachments
Patch (12.76 KB, patch) 2011-12-07 15:30 PST, David Levin	zherczeg: review+	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.

Csaba Osztrogonác

Comment 1 2011-11-07 08:41:08 PST

I meant: $ Tools/Scripts/old-run-webkit-tests --wait-for-httpd http/tests/xmlhttprequest/workers/abort-exception-assert.html fast/workers/shared-worker-lifecycle.html fast/workers/shared-worker-frame-lifecycle.html fast/workers/worker-close-more.html fast/workers/storage/interrupt-database.html fast/workers/dedicated-worker-lifecycle.html fast/workers/worker-lifecycle.html fails: notifydone timeout after 30 secs +FAIL: Timed out waiting for notifyDone to be called

Nandor Huszka

Comment 2 2011-11-24 00:00:02 PST

In abort-exception-assert.html Worker is created which execute the LayoutTests/http/tests/xmlhttprequest/workers/resources/endless-sync-xhr.js script. There can be found the following line: xhr.open("GET", "endless-response.php", false); If I overwrite the third parameter from false to true, the other 6 tests will be correct. As a result, I think the problem is about synchronous request. After more examining, in the same file, after that request is executed there is a layoutTestController.notifyDone(); calling. This command hasn't effect at all, what's more in my opinion it isn't called, however the preventing statement is true ( if (window.layoutTestController) ). When we use the mentioned synchronous request, for example in interrupt-database.html function finishTest() { log("PASS: database operations interrupted."); if (window.layoutTestController) layoutTestController.notifyDone(); } the window.layoutTestController will be null, that's why this test is fail with timeout. I have no idea yet what's the problem exactly, but I'm on it.

Nandor Huszka

Comment 3 2011-11-24 00:54:30 PST

In abort-exception-assert.html, the mentioned Worker's terminating is called with 100 milliseconds latency. This Worker does an endless cycle: sleeping for 10 seconds and then send a synchronous request. If we try to terminate this Worker with 2 seconds latency, and sleeping it for 1 second in every iteration before sending the request, the layoutTestController.notifyDone(); will be called, so the Worker will terminate successfully. In this case the other 6 tests will be passed too. Before terminating the Worker, there can be read a comment, it says: "This is an (unlikely) race condition here in that the worker may not have started the sync xhr call at this point, but it has been greatly lessened by the 100ms delay." In spite of giving the 100ms delay, I think it's a race condition yet, it must be the cause of the bug...

Nandor Huszka

Comment 4 2011-11-25 06:18:41 PST

Problem isn't that I have mentioned... It seems it's that the worker in abort-exception-assert.html doesn't terminate, but the other 6 tests use the waitUntilWorkerThreadsExit(runTests); function, which can be found in worker-utils.js. This method calls the function they got, after there are no more workers. It try to fill the stack with lots of object (in and endless cycle), then remove them and the possible stucked workers with GCController.collect(); call. But this call doesn't remove the worker, that's why the 6 tests stopped, and fail with timeout. Therefore I think, the worker isn't seems garbage for the GC for some reason. I tried to kill the worker from itself, but it didn't solve the problem.

Zoltan Herczeg

Comment 5 2011-12-05 06:59:34 PST

Hi Alexey, David your patch (http://trac.webkit.org/changeset/41216) introduced a complicated bug in the abort-exception-assert.html. You have added a mode string called "loadResourceSynchronouslyMode", and the worker created by the test above does not stop until it receives a message with this mode string. See WorkerRunLoop.cpp:155: OwnPtr<WorkerRunLoop::Task> task = m_messageQueue.waitForMessageFilteredWithTimeout(result, predicate, absoluteTime); The predicate is the following: 81 bool operator()(WorkerRunLoop::Task* task) const 82 { 83 return m_defaultMode || m_mode == task->mode(); 84 } and m_defaultMode is false in or case. The abort-exception-assert.html sends a terminate message to this worker, and the worker should stop. However, this message does not have loadResourceSynchronouslyMode mode (its mode string is actually NULL), so the "predicate" rule filters it out which makes the worker run infinitely. The purpose of this test is testing the terminate message. But the story does not end here. The test calls terminate, but it does not wait until it is successful (thus the abort-exception-assert.html seems successul). However, the next test waits until all workers quit. Since this particular worker does never stop, it cause a timeout error during the next test (which makes the next test flakey).

Zoltan Herczeg

Comment 6 2011-12-05 07:41:37 PST

Well, I try to simplify what is happening here TEST1: workerA = new worker [ do thing here ] workerA.terminate() <-- does not check whether the terminate is successful exit_with("PASS") <-- workerA still waits here TEST2: workerB = new worker <-- workerB exits here on its own <-- workerA still waits wait_until_ALL_worker_finish <-- wait forever (until timeout reached) <-- it become a falkey test since it runs well on its own exit_with("PASS") Solution: 1) terminate message should override ANY filters. Add a new "bool nofilter = false" argument to message creaton. 2) DRT should check whether all workers are quit before starting a new test. I don't think any test should leave garbage after them.

David Levin

Comment 7 2011-12-05 08:02:16 PST

If I'm reading this correctly, I believe that the MessageQueue should terminate with MessageQueueTerminated due to being killed in this case. btw, why is this just failing now when the patch referenced is about 3 years old?

Csaba Osztrogonác

Comment 8 2011-12-05 08:08:28 PST

(In reply to comment #7) > If I'm reading this correctly, I believe that the MessageQueue should terminate with MessageQueueTerminated due to being killed in this case. > > btw, why is this just failing now when the patch referenced is about 3 years old? As the name of the bug says http/tests/xmlhttprequest/workers/abort-exception-assert.html makes 6 fast/workers test fail. We didn't catch this bug previously, because ORWT runs http tests at the end. But we use NRWT from this summer and NRWT starts testing with https tests. So this bug was revealed by ORWT/NRWT switch.

David Levin

Comment 9 2011-12-05 18:07:43 PST

(In reply to comment #8) > (In reply to comment #7) > > If I'm reading this correctly, I believe that the MessageQueue should terminate with MessageQueueTerminated due to being killed in this case. > > > > btw, why is this just failing now when the patch referenced is about 3 years old? > > As the name of the bug says http/tests/xmlhttprequest/workers/abort-exception-assert.html makes 6 fast/workers test fail. We didn't catch this bug previously, because ORWT runs http tests at the end. But we use NRWT from this summer and NRWT starts testing with https tests. So this bug was revealed by ORWT/NRWT switch. I'm checking out what happens on OS X to see if there is the same issue or how it is addressed. Thanks for investigating this issue down to this level!

David Levin

Comment 10 2011-12-05 18:52:30 PST

It looks like http://trac.webkit.org/changeset/53595 broke this. Before that patch, WorkerThread::stop would always call m_runLoop.terminate(); which would break out of these nested loops. After that change, it no longer did. I think this is because "Worker thread shutdown got changed a bunch to handle Database cleanup" but this was unfortunately a broken way to do this. Ideally the http/tests/xmlhttprequest/workers/abort-exception-assert.html or some other test would have caught this but I guess not.

Zoltan Herczeg

Comment 11 2011-12-05 23:38:55 PST

Do you have any idea how to fix it? We spent days for searching this bug and we would really like to fix it.

David Levin

Comment 12 2011-12-06 11:35:16 PST

(In reply to comment #11) > Do you have any idea how to fix it? We spent days for searching this bug and we would really like to fix it. I'm discussing it with ericu. The idea we're discussing is moving http://trac.webkit.org/changeset/53595/trunk/WebCore/workers/WorkerThread.cpp::performTask (http://trac.webkit.org/changeset/53595/trunk/WebCore/workers/WorkerThread.cpp) into the normal code path of WorkerContext shutdown and undo'ing the changes done to WorkerThread::stop() in this patch.

David Levin

Comment 13 2011-12-06 15:22:09 PST

(In reply to comment #11) > Do you have any idea how to fix it? We spent days for searching this bug and we would really like to fix it. If you'd like to help the fix along, a single test which repros the issue would be nice as that will be needed with the final fix.

David Levin

Comment 14 2011-12-07 15:30:02 PST

Created attachment 118282 [details] Patch

Zoltan Herczeg

Comment 15 2011-12-08 00:15:09 PST

Comment on attachment 118282 [details] Patch r=me. Looks like we were not fast enough. You should add more text to the changelog to describe the issue and why your change fix it. I still think DRT should check if there is a hanging worker and throw some error message. But that coudl go into a differnt patch (I hope Nandi is up for the task).

Nandor Huszka

Comment 16 2011-12-08 00:48:11 PST

(In reply to comment #15) > (From update of attachment 118282 [details]) > I still think DRT should check if there is a hanging worker and throw some error message. But that coudl go into a differnt patch (I hope Nandi is up for the task). I'm working on it.

David Levin

Comment 17 2011-12-08 06:52:46 PST

ap or dimich, any comments?

Alexey Proskuryakov

Comment 18 2011-12-08 08:50:46 PST

Comment on attachment 118282 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=118282&action=review I don't remember how this works sufficiently well to comment. > Source/WebCore/workers/WorkerThread.cpp:-257 > - // FIXME: Rudely killing the thread won't work when we allow nested workers, because they will try to post notifications of their destruction. > - // This can likely use the same mechanism as used for databases above. Why is this no longer an issue?

David Levin

Comment 19 2011-12-09 10:49:51 PST

(In reply to comment #18) > (From update of attachment 118282 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=118282&action=review > > I don't remember how this works sufficiently well to comment. > > > Source/WebCore/workers/WorkerThread.cpp:-257 > > - // FIXME: Rudely killing the thread won't work when we allow nested workers, because they will try to post notifications of their destruction. > > - // This can likely use the same mechanism as used for databases above. > > Why is this no longer an issue? The database comment didn't make sense to me (and given that the change that added the comment broke the shutdown process for workers, I didn't trust it as well). The other comment likely still applies. I was thinking there are lots of issues to tackle and examine when one does nested workers and this would naturally be one of them, so I'd remove the FIXME... but I'll leave it in.

Alexey Proskuryakov

Comment 20 2011-12-09 13:14:01 PST

> I was thinking there are lots of issues to tackle and examine when one does nested workers and this > would naturally be one of them, so I'd remove the FIXME... That makes sense to me. The comment was written at a time when nested workers seemed imminent, and now it's unclear if they will ever materialize. A line in ChangeLog can explain the rationale for removal.

David Levin

Comment 21 2011-12-09 14:06:04 PST

Committed as http://trac.webkit.org/changeset/102473

Csaba Osztrogonác

Comment 22 2012-01-04 03:35:47 PST

(In reply to comment #21) > Committed as http://trac.webkit.org/changeset/102473 I unskipped the test: http://trac.webkit.org/changeset/104024

Note You need to log in before you can comment on or make changes to this bug.