Bug 71695

Summary: Regression(r53595): Sync xhr requests in workers aren't terminated on worker close.
Product: WebKit Reporter: Csaba Osztrogonác <ossy>
Component: Tools / TestsAssignee: David Levin <levin>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, dimich, ericu, hnandor, levin, levin+threading, ossy, webkit.review.bot, zherczeg
Priority: P2 Keywords: Qt, QtTriaged
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Bug Depends on:    
Bug Blocks: 64002, 70619    
Attachments:
Description Flags
Patch zherczeg: review+

Description Csaba Osztrogonác 2011-11-07 08:35:11 PST
To reproduce these fail, use:
$ Tools/Scripts/old-run-webkit-tests --wait-for-httpd fast/workers/shared-worker-lifecycle.html fast/workers/shared-worker-frame-lifecycle.html fast/workers/worker-close-more.html fast/workers/storage/interrupt-database.html fast/workers/dedicated-worker-lifecycle.html fast/workers/worker-lifecycle.html
Comment 1 Csaba Osztrogonác 2011-11-07 08:41:08 PST
I meant: $ Tools/Scripts/old-run-webkit-tests --wait-for-httpd http/tests/xmlhttprequest/workers/abort-exception-assert.html fast/workers/shared-worker-lifecycle.html fast/workers/shared-worker-frame-lifecycle.html fast/workers/worker-close-more.html fast/workers/storage/interrupt-database.html fast/workers/dedicated-worker-lifecycle.html fast/workers/worker-lifecycle.html

fails: notifydone timeout after 30 secs
+FAIL: Timed out waiting for notifyDone to be called
Comment 2 Nandor Huszka 2011-11-24 00:00:02 PST
In abort-exception-assert.html Worker is created which execute the 
LayoutTests/http/tests/xmlhttprequest/workers/resources/endless-sync-xhr.js 
script. There can be found the following line:

xhr.open("GET", "endless-response.php", false);

If I overwrite the third parameter from false to true, the other 6 tests will be correct. As a result, I think the problem is about synchronous request.

After more examining, in the same file, after that request is executed there is a
layoutTestController.notifyDone();
calling. This command hasn't effect at all, what's more in my opinion it isn't called, however the preventing statement is true ( if (window.layoutTestController) ).

When we use the mentioned synchronous request, for example in interrupt-database.html
function finishTest()
{
    log("PASS: database operations interrupted.");
    if (window.layoutTestController)
        layoutTestController.notifyDone();
}
the window.layoutTestController will be null, that's why this test is fail with timeout.
I have no idea yet what's the problem exactly, but I'm on it.
Comment 3 Nandor Huszka 2011-11-24 00:54:30 PST
In abort-exception-assert.html, the mentioned Worker's terminating is called with 100 milliseconds latency. This Worker does an endless cycle: sleeping for 10 seconds and then send a synchronous request.

If we try to terminate this Worker with 2 seconds latency, and sleeping it for 1 second in every iteration before sending the request, the
layoutTestController.notifyDone();
will be called, so the Worker will terminate successfully. In this case the other 6 tests will be passed too.

Before terminating the Worker, there can be read a comment, it says:
"This is an (unlikely) race condition here in that the worker may not have started the sync xhr call at this point, but it has been greatly lessened by the 100ms delay."
In spite of giving the 100ms delay, I think it's a race condition yet, it must be the cause of the bug...
Comment 4 Nandor Huszka 2011-11-25 06:18:41 PST
Problem isn't that I have mentioned... It seems it's that the worker in abort-exception-assert.html doesn't terminate, but the other 6 tests use the 

waitUntilWorkerThreadsExit(runTests);

function, which can be found in worker-utils.js.
This method calls the function they got, after there are no more workers. It try to fill the stack with lots of object (in and endless cycle), then remove them and the possible stucked workers with

GCController.collect();

call. But this call doesn't remove the worker, that's why the 6 tests stopped, and fail with timeout.
Therefore I think, the worker isn't seems garbage for the GC for some reason. I tried to kill the worker from itself, but it didn't solve the problem.
Comment 5 Zoltan Herczeg 2011-12-05 06:59:34 PST
Hi Alexey, David

your patch (http://trac.webkit.org/changeset/41216) introduced a complicated bug in the abort-exception-assert.html. You have added a mode string called "loadResourceSynchronouslyMode", and the worker created by the test above does not stop until it receives a message with this mode string.

See WorkerRunLoop.cpp:155:
OwnPtr<WorkerRunLoop::Task> task = m_messageQueue.waitForMessageFilteredWithTimeout(result, predicate, absoluteTime);

The predicate is the following:

81	    bool operator()(WorkerRunLoop::Task* task) const
82	    {
83	        return m_defaultMode || m_mode == task->mode();
84	    }

and m_defaultMode is false in or case.

The abort-exception-assert.html sends a terminate message to this worker, and the worker should stop. However, this message does not have loadResourceSynchronouslyMode mode (its mode string is actually NULL), so the "predicate" rule filters it out which makes the worker run infinitely. The purpose of this test is testing the terminate message.

But the story does not end here. The test calls terminate, but it does not wait until it is successful (thus the abort-exception-assert.html seems successul). However, the next test waits until all workers quit. Since this particular worker does never stop, it cause a timeout error during the next test (which makes the next test flakey).
Comment 6 Zoltan Herczeg 2011-12-05 07:41:37 PST
Well, I try to simplify what is happening here

TEST1:

workerA = new worker
[ do thing here ]
workerA.terminate()
<-- does not check whether the terminate is successful
exit_with("PASS")

<-- workerA still waits here

TEST2:

workerB = new worker
<-- workerB exits here on its own
<-- workerA still waits
wait_until_ALL_worker_finish <-- wait forever (until timeout reached)
<-- it become a falkey test since it runs well on its own
exit_with("PASS")

Solution:
1) terminate message should override ANY filters. Add a new "bool nofilter = false" argument to message creaton.

2) DRT should check whether all workers are quit before starting a new test. I don't think any test should leave garbage after them.
Comment 7 David Levin 2011-12-05 08:02:16 PST
If I'm reading this correctly, I believe that the MessageQueue should terminate with MessageQueueTerminated due to being killed in this case.

btw, why is this just failing now when the patch referenced is about 3 years old?
Comment 8 Csaba Osztrogonác 2011-12-05 08:08:28 PST
(In reply to comment #7)
> If I'm reading this correctly, I believe that the MessageQueue should terminate with MessageQueueTerminated due to being killed in this case.
> 
> btw, why is this just failing now when the patch referenced is about 3 years old?

As the name of the bug says http/tests/xmlhttprequest/workers/abort-exception-assert.html makes 6 fast/workers test fail. We didn't catch this bug previously, because ORWT runs http tests at the end. But we use NRWT from this summer and NRWT starts testing with https tests. So this bug was revealed by ORWT/NRWT switch.
Comment 9 David Levin 2011-12-05 18:07:43 PST
(In reply to comment #8)
> (In reply to comment #7)
> > If I'm reading this correctly, I believe that the MessageQueue should terminate with MessageQueueTerminated due to being killed in this case.
> > 
> > btw, why is this just failing now when the patch referenced is about 3 years old?
> 
> As the name of the bug says http/tests/xmlhttprequest/workers/abort-exception-assert.html makes 6 fast/workers test fail. We didn't catch this bug previously, because ORWT runs http tests at the end. But we use NRWT from this summer and NRWT starts testing with https tests. So this bug was revealed by ORWT/NRWT switch.

I'm checking out what happens on OS X to see if there is the same issue or how it is addressed.

Thanks for investigating this issue down to this level!
Comment 10 David Levin 2011-12-05 18:52:30 PST
It looks like http://trac.webkit.org/changeset/53595 broke this.

Before that patch, WorkerThread::stop would always call m_runLoop.terminate();  which would break out of these nested loops. After that change, it no longer did.

I think this is because "Worker thread shutdown got changed a bunch to handle Database cleanup" but this was unfortunately a broken way to do this. Ideally the http/tests/xmlhttprequest/workers/abort-exception-assert.html or some other test would have caught this but I guess not.
Comment 11 Zoltan Herczeg 2011-12-05 23:38:55 PST
Do you have any idea how to fix it? We spent days for searching this bug and we would really like to fix it.
Comment 12 David Levin 2011-12-06 11:35:16 PST
(In reply to comment #11)
> Do you have any idea how to fix it? We spent days for searching this bug and we would really like to fix it.

I'm discussing it with ericu. The idea we're discussing is moving http://trac.webkit.org/changeset/53595/trunk/WebCore/workers/WorkerThread.cpp::performTask (http://trac.webkit.org/changeset/53595/trunk/WebCore/workers/WorkerThread.cpp) into the normal code path of WorkerContext shutdown and undo'ing the changes done to WorkerThread::stop() in this patch.
Comment 13 David Levin 2011-12-06 15:22:09 PST
(In reply to comment #11)
> Do you have any idea how to fix it? We spent days for searching this bug and we would really like to fix it.

If you'd like to help the fix along, a single test which repros the issue would be nice as that will be needed with the final fix.
Comment 14 David Levin 2011-12-07 15:30:02 PST
Created attachment 118282 [details]
Patch
Comment 15 Zoltan Herczeg 2011-12-08 00:15:09 PST
Comment on attachment 118282 [details]
Patch

r=me.

Looks like we were not fast enough. You should add more text to the changelog to describe the issue and why your change fix it.

I still think DRT should check if there is a hanging worker and throw some error message. But that coudl go into a differnt patch (I hope Nandi is up for the task).
Comment 16 Nandor Huszka 2011-12-08 00:48:11 PST
(In reply to comment #15)
> (From update of attachment 118282 [details])

> I still think DRT should check if there is a hanging worker and throw some error message. But that coudl go into a differnt patch (I hope Nandi is up for the task).

I'm working on it.
Comment 17 David Levin 2011-12-08 06:52:46 PST
ap or dimich, any comments?
Comment 18 Alexey Proskuryakov 2011-12-08 08:50:46 PST
Comment on attachment 118282 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=118282&action=review

I don't remember how this works sufficiently well to comment.

> Source/WebCore/workers/WorkerThread.cpp:-257
> -    // FIXME: Rudely killing the thread won't work when we allow nested workers, because they will try to post notifications of their destruction.
> -    // This can likely use the same mechanism as used for databases above.

Why is this no longer an issue?
Comment 19 David Levin 2011-12-09 10:49:51 PST
(In reply to comment #18)
> (From update of attachment 118282 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=118282&action=review
> 
> I don't remember how this works sufficiently well to comment.
> 
> > Source/WebCore/workers/WorkerThread.cpp:-257
> > -    // FIXME: Rudely killing the thread won't work when we allow nested workers, because they will try to post notifications of their destruction.
> > -    // This can likely use the same mechanism as used for databases above.
> 
> Why is this no longer an issue?

The database comment didn't make sense to me (and given that the change that added the comment broke the shutdown process for workers, I didn't trust it as well). The other comment likely still applies. I was thinking there are lots of issues to tackle and examine when one does nested workers and this would naturally be one of them, so I'd remove the FIXME... but I'll leave it in.
Comment 20 Alexey Proskuryakov 2011-12-09 13:14:01 PST
> I was thinking there are lots of issues to tackle and examine when one does nested workers and this
> would naturally be one of them, so I'd remove the FIXME...

That makes sense to me. The comment was written at a time when nested workers seemed imminent, and now it's unclear if they will ever materialize.

A line in ChangeLog can explain the rationale for removal.
Comment 21 David Levin 2011-12-09 14:06:04 PST
Committed as http://trac.webkit.org/changeset/102473
Comment 22 Csaba Osztrogonác 2012-01-04 03:35:47 PST
(In reply to comment #21)
> Committed as http://trac.webkit.org/changeset/102473
I unskipped the test: http://trac.webkit.org/changeset/104024