Bug 37554

Summary: REGRESSION: fast/workers/wrapper-map-gc.html crashes on Snow Leopard Release Bot
Product: WebKit Reporter: Eric Seidel (no email) <eric>
Component: Tools / TestsAssignee: Dmitry Titov <dimich>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, ap, dimich, ggaren, sam, webkit.review.bot
Priority: P1    
Version: 528+ (Nightly build)   
Hardware: PC   
OS: OS X 10.5   
Attachments:
Description Flags
crash log
none
Patch.
ggaren: review-, dimich: commit-queue-
Added check in ScheduledAction
none
Patch.
ggaren: review-, dimich: commit-queue-
Patch with a more explicit check. ggaren: review+, dimich: commit-queue-

Description Eric Seidel (no email) 2010-04-13 23:42:02 PDT
fast/workers/wrapper-map-gc.html crashed on Snow Leopard Release Bot

http://build.webkit.org/results/SnowLeopard%20Intel%20Release%20(Tests)/r57566%20(8248)/results.html

Sorry I don't have a stack trace for you (yet).  I believe I've seen this one crash once before but I couldn't find a bug on it.

http://trac.webkit.org/browser/trunk/LayoutTests/fast/workers/wrapper-map-gc.html
Comment 2 Eric Seidel (no email) 2010-04-14 12:20:34 PDT
Struck again:
http://build.webkit.org/results/SnowLeopard%20Intel%20Release%20(Tests)/r57593%20(8279)/results.html

Given this has reproduced 3 times just this morning, I expect it should be easy to repro.
Comment 3 Eric Seidel (no email) 2010-04-14 14:36:59 PDT
Yeah, I think something might have recently regressed to cause this to get so flaky.
Again just now:
http://build.webkit.org/results/SnowLeopard%20Intel%20Release%20(Tests)/r57604%20(8293)/results.html
Comment 6 Eric Seidel (no email) 2010-04-18 14:20:24 PDT
Created attachment 53635 [details]
crash log

This crash is super-easy to repro locally.

run-webkit-tests --debug --iterations 100  fast/workers/wrapper-map-gc.html

did it for me.  Crash log attached.
Comment 7 Eric Seidel (no email) 2010-04-18 14:23:09 PDT
Comment on attachment 53635 [details]
crash log

Looks like it hit this ASSERT:
    ASSERT(isUndefinedOrNull());
http://trac.webkit.org/browser/trunk/JavaScriptCore/runtime/JSValue.cpp#L78

That ASSERT has been there forever (added by ggaren).
Comment 8 Eric Seidel (no email) 2010-04-18 14:23:38 PDT
Maybe Geoff can offer some insight as to what Workers is doing wrong here.
Comment 9 Geoffrey Garen 2010-04-19 16:28:10 PDT
It looks like the JavaScript wrapper being used as 'this' on this test is garbage. Not sure why.
Comment 10 Dmitry Titov 2010-04-20 20:41:42 PDT
It was broken by introducing the possibility for a JS wrapper of WorkerContext be returned as 0 in case of terminated worker, to prevent the re-entry into JS (r57349). It works fine if all the places that get the wrapper before invoking JS check for the value. JSEventListener did not, so if worker's JS execution was terminated between pulling globalObject (returns 0 for terminated worker, there is a check) and getting a wrapper for the 'this' (returns 0 for terminated worker, no check) - the subsequent attempt to run a listener function crashes.

Note there is no need to synchronize checks here, even though the termination is requested on one thread and the value of a flag is used on another - once the JS execution on worker thread is set up properly, it's ok to go into it since it will be terminated by JSC::Terminator causing an exception eventually.

So, I broke it :-(. My only excuse is that I wrote the test that caught the borkage and I now have a fix :-) Patch coming.
Comment 11 Dmitry Titov 2010-04-20 20:49:54 PDT
Created attachment 53914 [details]
Patch.
Comment 12 Eric Seidel (no email) 2010-04-21 04:27:36 PDT
Looks sane to me, but someone with more recent experience in JSC should probably r+ this.
Comment 13 Geoffrey Garen 2010-04-21 09:32:50 PDT
Thanks Dmitry, I think I understand what's going on here now.

The ASSERT you ran into reflects a JSC philosophy that it's an error for the runtime ever to encounter a JSValue that is not a standard JavaScript value. We allow clients to construct an empty JSValue() / null C pointer, for their convenience, but we don't allow clients to pass an empty JSValue() / null C pointer into JSC.

So, your change in JSWorkerContextBase.cpp is the right thing to do.

However, I have two questions about the rest of the patch:

1. If a WorkerContext wants to prevent events from firing, isn't encoding that information in JSEventListener::handleEvent pretty far afield? It would be better if we could separate concerns, so that JSEventListener didn't have to know about the internal implementation details of WorkerContext. I wonder if WorkerContext could just override "virtual bool dispatchEvent(PassRefPtr<Event>);", and have it do nothing in the case where the WorkerContext's m_executionForbidden flag has been set.

2. Doesn't ScheduledAction::execute need to be patched too? It seems to assume that workerContextWrapper() will not return 0.
Comment 14 Geoffrey Garen 2010-04-21 09:33:13 PDT
Comment on attachment 53914 [details]
Patch.

r- because I think there's a better way to do this.
Comment 15 Dmitry Titov 2010-04-21 14:53:15 PDT
(In reply to comment #13)
> However, I have two questions about the rest of the patch:
> 
> 1. If a WorkerContext wants to prevent events from firing, isn't encoding that
> information in JSEventListener::handleEvent pretty far afield? It would be
> better if we could separate concerns, so that JSEventListener didn't have to
> know about the internal implementation details of WorkerContext. I wonder if
> WorkerContext could just override "virtual bool
> dispatchEvent(PassRefPtr<Event>);", and have it do nothing in the case where
> the WorkerContext's m_executionForbidden flag has been set.
> 
> 2. Doesn't ScheduledAction::execute need to be patched too? It seems to assume
> that workerContextWrapper() will not return 0.

#1 - that was the first idea, for both JSC and V8 implementations - to check some flag on WorkerScriptController early on.  However, it translated into need to add this check to so many places.. We have many EventTargets, even in Workers, various evaluate() methods, callbacks (like in databases), even things like conversion of JS types to string can reenter JS (at least in V8 they do). All of those places pull the global object or context in some form in order to enter JS - and many places already check for null, using this pattern (or similar):

JSDOMGlobalObject* globalObject = toJSDOMGlobalObject(scriptExecutionContext, m_isolatedWorld.get());
if (!globalObject)
    return;
    
So it seems logical to avoid lots of extra piping (to pull WorkerScriptController or some Worker-independent way to figure out the scripting engine was terminated) and just make it official that an attempt to pull the global object wrapper before running some JS can return no wrapper and we need checks for that.

Also, it seems a better trap for errors, because returning 0 globalObject will likely cause crash, while silent eventual executing of some event handlers may go unnoticed for a while and create more problems down the road. It could also be harder to figure out since the crashes could happen later in the code and vary in signature.

In addition to being more centralized, returning null context/globalObject also makes code for JCS and V8 be similar. In case of V8, there are more "js entry" points that has to be guarded against reentry, so having to pull WorkerScriptController (or even more ubiquitous ScriptExecutionContext) to all necessary places would be hard and would simply add another check to existing checks for null context.

There is a bit more discussion in bug 37053, which introduced returning 0.

It's possible to remove returning 0 from WorkerScriptController::workerContextWrapper() and add a virtual dispatchEvent() as you propose but it immediately breaks test from bug 37053 and perhaps introduces more issues not yet covered by tests. I'd prefer to keep the crash we have now, at least we will be able to recognize the root issue (running JS in terminated worker) because those crashes, while timing-dependent, would look similar.

#2 is currently impossible in Workers (due to the way timers work there, via WorkerRunLoop) and I left it w/o check intentionally to make sure it crashes if there is a regression in the way WorkerRunLoop processes tasks.

Could you please weight this and let me know what you think? I could have added more info in the beginning.
Comment 16 Dmitry Titov 2010-04-22 11:40:04 PDT
Created attachment 54081 [details]
Added check in ScheduledAction

I've added a check in Scheduled action, it makes sense to have this checks in all places that may be in worker's code .

I'm still feeling the check for the global object wrapper performed down the hierarchy is the right thing. For example, dispatchEvent() invokes a non-virtual fireEventListeners which has a loop over all registered event handler functions and invoke them one after another. If the execution was terminated in the middle of one, we want to avoid invoking the rest.
Comment 17 WebKit Review Bot 2010-04-22 11:41:12 PDT
Attachment 54081 [details] did not pass style-queue:

Failed to run "['WebKitTools/Scripts/check-webkit-style']" exit_code: 1
WebCore/bindings/js/JSEventListener.cpp:118:  One space before end of line comments  [whitespace/comments] [5]
Total errors found: 1 in 4 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 18 Dmitry Titov 2010-04-22 11:44:25 PDT
Created attachment 54082 [details]
Patch.

Fixed style issue.
Comment 19 Dmitry Titov 2010-04-26 12:54:12 PDT
I think this is ready for another look. I've changed ScheduledAction. Also decided to leave the check for 0 context low in the call hierarchy and tried to explain the reasons above. Lets iterate?

This test crashes ~1 out of 3 runs on SL lately.
Comment 21 Adam Barth 2010-04-27 13:47:31 PDT
Comment on attachment 54082 [details]
Patch.

I'd like ggaren to take another look, but this crash is happening really often.
Comment 22 Dmitry Titov 2010-04-27 16:45:56 PDT
It's hard to get ggaren for a chat :-)

From what I could glean from a brief IRC conversation, he would really like to see more explicit check rather then a null check on a pointer. I tend to agree since there is no precedent of checking a result of toJS(..) for 0. On another hand, it's best not to pull some high-level objects like WorkerScriptController down to JS level to do checks.

I'm experimenting with another patch, will upload soon.
Comment 23 Geoffrey Garen 2010-04-27 17:01:06 PDT
It's not good that workerContextWrapper() can magically become 0 at any time from another thread. This means that all clients of workerContextWrapper(), implicit and explicit, must constantly check for 0, or whatever it might transmute into, like jsNull(), even within the subcomponents of a given expression. That's crazy!

In this patch, you've missed at least three other place that need to check the magic crash bit:

    JSValue handleEventFunction = jsFunction->get(exec, Identifier(exec, "handleEvent"));

    JSC::call(exec, handleEventFunction, callType, callData, jsFunction, args)

    event->storeResult(ustringToString(retval.toString(exec)));

I'm not sure if you've missed other places, too. It's very difficult to reason about a magic multithreaded bit that can cause your program to crash at any time.
Comment 24 Geoffrey Garen 2010-04-27 17:02:16 PDT
Comment on attachment 54082 [details]
Patch.

Please add tests for the cases that you missed, and fix those cases.

I think the right fix would be to move away from a magic threaded bit that can cause WebKit to crash at any time, but maybe someone else has an even better solution.
Comment 25 Dmitry Titov 2010-04-27 17:34:06 PDT
Created attachment 54484 [details]
Patch with a more explicit check.

I hope this is better. I've added ScriptExecutionContext::isJSExecutionTerminated(), which is always 'false' for Document and becomes 'true' for WorkerContext once Worker.terminate() or WorkerGlobalScope.clsoe() is invoked.

Apparently, JSC does not crash on reentry while execution is terminated, it will just cause another exception and almost immediate exit from the JS fragment.
Comment 26 Dmitry Titov 2010-04-27 18:03:09 PDT
(In reply to comment #24)
> (From update of attachment 54082 [details])
> Please add tests for the cases that you missed, and fix those cases.
> 
> I think the right fix would be to move away from a magic threaded bit that can
> cause WebKit to crash at any time, but maybe someone else has an even better
> solution.

I think there are no more missed cases, this fixes wrapper-map-gc.html for JSC. There is no more 0 pointers returned from toJS. My understanding of the code is that GlobalData::terminator will raise a new exception on reentry and exit execution.

As I understand your recommended to have an explicit check, avoid context pointer to be returned as 0, and you also suggested that it is not necessarily to check before every JS entry point, since JSC (unlike v8) tolerates reentry. The attached patch does this.

I don't know how to find all the entry points into JS, how to write tests for that etc. While there is a remote chance of something breaking once JS execution starts throwing termination exception, it doesn't seem practical to try to add checks everywhere. It's better to write more tests with time on using Workers in various modes, find those issues and fix them.
Comment 27 Geoffrey Garen 2010-04-28 10:47:56 PDT
Comment on attachment 54484 [details]
Patch with a more explicit check.

r=me
Comment 28 Dmitry Titov 2010-04-28 12:40:39 PDT
Landed: http://trac.webkit.org/changeset/58421