Bug 131082

Summary: watchdog m_didFire state erroneously retained
Product: WebKit Reporter: abaldeva
Component: JavaScriptCoreAssignee: Mark Lam <mark.lam>
Status: RESOLVED FIXED    
Severity: Normal CC: commit-queue, ggaren, mark.lam, p.jacquemart
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Bug Depends on: 147968    
Bug Blocks:    
Attachments:
Description Flags
work in progress: let's try this on the EWS bots.
none
the fix.
mark.lam: review-
fix 2.
none
fix 3
none
fix 4: nearly there.
none
fix 5: latest greatest with Web Workers now able to terminate().
ggaren: review+
fix 6 ggaren: review+

Description abaldeva 2014-04-01 16:51:22 PDT
Hi,

Looking at the watchdog.cpp implementation, once the m_didFire is set to true in the WatchDog::didFire(ExecState), it is never set to false. Thus, if the user decided to terminate the script as a result of timeout but keep the page around(WebKit 1 architecture), the scripts would suffer from the side effect of earlier m_didFire state.

It seems like one possible solution would be to set m_didFire to false when arming the watchdog. 

--Arpit
Comment 1 Mark Lam 2015-07-29 23:13:40 PDT
The existing watchdog implementation requires that the user call Watchdog::setTimeLimit() to reset the watchdog.  Granted, it would be more intuitive if we had provided a Watchdog::reset() to make it clear.

As for why I didn’t just reset the watchdog automatically on exit from the VM or re-entry, it was because a rogue script could call setTimeout() and register itself for callback before the watchdog times out.  Then, if we automatically reset the watchdog on exiting the VM, or when we re-enter the VM for the setTimeout() callback, the script will get to hog the CPU again.  And it can repeat this ad infinitum, effectively defeating the watchdog time out.

The solution was to not allow any scripts to run until the user of the VM is ready and explicitly resets the watchdog.  “Ready” in this case would probably mean that the user has purged all the setTimeout callbacks (and any other mechanisms by which the rogue script can get more execution time). 

That said, the user could ask to be notified of the time out, and do the purging before it re-entere the VM to execute any script again.  So, maybe the explicit reset is not so necessary after all.

I’ll give this some more thought.  If anyone has suggestions of a better solution (and why), please chime in.
Comment 2 Mark Lam 2015-07-30 18:30:32 PDT
I'll fix the Watchdog to clear the didfire state as soon as the VM has thrown the TerminateExecutionException.  Starting on the work now.
Comment 3 Mark Lam 2015-08-12 01:08:06 PDT
Created attachment 258815 [details]
work in progress: let's try this on the EWS bots.
Comment 4 WebKit Commit Bot 2015-08-12 01:12:21 PDT
Attachment 258815 [details] did not pass style-queue:


ERROR: Source/WebCore/ChangeLog:8:  You should remove the 'No new tests' and either add and list tests, or explain why no new tests were possible.  [changelog/nonewtests] [5]
Total errors found: 1 in 11 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 5 Mark Lam 2015-08-12 22:36:40 PDT
Created attachment 258879 [details]
the fix.
Comment 6 Mark Lam 2015-08-13 06:49:51 PDT
Comment on attachment 258879 [details]
the fix.

Cancelling this review.  I thought of a better way I should take to implement the fix.
Comment 7 Mark Lam 2015-08-13 10:37:46 PDT
Created attachment 258899 [details]
fix 2.
Comment 8 Geoffrey Garen 2015-08-13 13:45:25 PDT
Comment on attachment 258899 [details]
fix 2.

View in context: https://bugs.webkit.org/attachment.cgi?id=258899&action=review

> Source/JavaScriptCore/runtime/VM.cpp:819
> +void VM::scheduleTermination()
> +{
> +    m_hasScheduledTermination = true;
> +    WTF::storeStoreFence();
> +    if (watchdog)
> +        watchdog->interruptScript();
> +}

Why do we want two different ways to indicate interruption (m_hasScheduledTermination and m_timerDidFire)?

> Source/JavaScriptCore/runtime/Watchdog.cpp:257
> +void Watchdog::resetForTermination()
> +{
> +    stopTimer();
> +}

It is an anti-pattern to have a function whose only job is to call another function of a different name.
Comment 9 Mark Lam 2015-08-13 14:35:11 PDT
Comment on attachment 258899 [details]
fix 2.

View in context: https://bugs.webkit.org/attachment.cgi?id=258899&action=review

>> Source/JavaScriptCore/runtime/VM.cpp:819
>> +}
> 
> Why do we want two different ways to indicate interruption (m_hasScheduledTermination and m_timerDidFire)?

Because the Watchdog::didFireSlow() will reject m_timerDidFire as a spurious wake up.  scheduleTermination() is used to request termination even if the watchdog's CPU deadline has not expired.  

The alternative is to have scheduleTermination modify the watchdog's m_cpuDeadline and m_wallClockDeadline to make it look like the timer fired.  Note that the caller of scheduleTermination() is usually a thread different than the script thread that we wish to terminate.  I had thought about this approach but previously thought that it wasn't good to have a separate thread modify the deadline values.  But on second thought, I see no reason why we can't do that.  I will make it so.

>> Source/JavaScriptCore/runtime/Watchdog.cpp:257
>> +}
> 
> It is an anti-pattern to have a function whose only job is to call another function of a different name.

Will fix.
Comment 10 Mark Lam 2015-08-13 15:39:14 PDT
Created attachment 258940 [details]
fix 3
Comment 11 Geoffrey Garen 2015-08-13 16:27:56 PDT
Comment on attachment 258940 [details]
fix 3

View in context: https://bugs.webkit.org/attachment.cgi?id=258940&action=review

> Source/JavaScriptCore/runtime/VM.cpp:817
> +void VM::scheduleTermination()
> +{
> +    if (watchdog)
> +        watchdog->interruptScript();
> +}

A helper function that silently ignores you if you have forgotten to hook up the watchdog is an invitation to make mistakes. One can only schedule termination if one has set up the VM to be able to do so by hooking up the watchdog and compiling watchdog hooks into all code. 

I don't think we should have a helper function like this. It only serves to hide an important limitation.

> Source/JavaScriptCore/runtime/Watchdog.cpp:249
> +void Watchdog::interruptScript()
> +{
> +    // Currently, the best that we can do to interrupt the script thread is to simulate the watchdog
> +    // firing on it.

I would call this function "fire" or "fireManually" since its action is tested by "didFire".

I don't think this comment belongs here. Calling this "the best we can do" implies that this function is somehow lacking. But if you ask the watchdog to fire, and it does, it is not lacking.

> Source/WebCore/bindings/js/WorkerScriptController.cpp:155
> +    m_vm->scheduleTermination();

It looks like this code won't do anything, since WorkerScriptController does not ensure the creation of the watchdog. Is this a regression, or has it always been this way? Do we have tests that cover worker cancellation?

This seems to be an example of how VM::scheduleTermination can hide a problem. If you only read this line of code, it appears that the VM will take care of termination for you. But it won't unless you have created a watchdog.
Comment 12 Mark Lam 2015-08-14 07:03:57 PDT
Comment on attachment 258940 [details]
fix 3

View in context: https://bugs.webkit.org/attachment.cgi?id=258940&action=review

>> Source/WebCore/bindings/js/WorkerScriptController.cpp:155
>> +    m_vm->scheduleTermination();
> 
> It looks like this code won't do anything, since WorkerScriptController does not ensure the creation of the watchdog. Is this a regression, or has it always been this way? Do we have tests that cover worker cancellation?
> 
> This seems to be an example of how VM::scheduleTermination can hide a problem. If you only read this line of code, it appears that the VM will take care of termination for you. But it won't unless you have created a watchdog.

Regarding whether WorkerScriptController not working completely, it has always been this way.  More specifically, we can terminate workers as long as they don't have an infinite loop.  After the introduction of the watchdog, we can also terminate workers with infinite loops but only if there's a time limit set on the watchdog.

Regarding tests that cover worker cancellation, a survey of the worker tests shows that they do test termination but most of them do not test with an infinite loop.  The sole exception is fast/workers/worker-terminate-forever.html.  However, there's no good way to confirm that the worker thread actually terminated (at least, we don't have one yet).  So, the failure to terminate the infinite loop in the worker goes undetected.
Comment 13 Mark Lam 2015-08-14 10:24:40 PDT
(In reply to comment #12)
> Regarding tests that cover worker cancellation, a survey of the worker tests
> shows that they do test termination but most of them do not test with an
> infinite loop.  The sole exception is
> fast/workers/worker-terminate-forever.html.  However, there's no good way to
> confirm that the worker thread actually terminated (at least, we don't have
> one yet). ...

I've found a way using internals.workerThreadCount.  Working on the modifying the test now.
Comment 14 Mark Lam 2015-08-14 16:10:00 PDT
Created attachment 259054 [details]
fix 4: nearly there.

Archiving this version while I work on adding CheckWatchdog support to the FTL.  After that, we can update the patch to enable the fixed worker with runaway loop test.
Comment 15 Mark Lam 2015-08-18 16:41:58 PDT
Created attachment 259319 [details]
fix 5: latest greatest with Web Workers now able to terminate().
Comment 16 Geoffrey Garen 2015-08-19 14:26:11 PDT
Comment on attachment 259319 [details]
fix 5: latest greatest with Web Workers now able to terminate().

View in context: https://bugs.webkit.org/attachment.cgi?id=259319&action=review

r=me with some changes below

> Source/JavaScriptCore/runtime/Watchdog.cpp:104
> +    // To prevent a race condition with fireManually(), this is the only place where we're allowed
> +    // to clear m_timerDidFire. That is because by getting here, we're committed to check
> +    // hasScheduledTermination() below.

You should remove this comment.

It is wrong to explain this behavior as a defense against a race condition. The only way to defend against a race condition is to add a synchronization primitive, and there is none here.

It is wrong to explain this behavior as required by threads. We would still need this behavior if there were no threads. It is always true that one must only set m_timerDidFire to false if one subsequently tests for termination because, by clearing m_timerDidFire, we cancel any further checks.

> Source/JavaScriptCore/runtime/Watchdog.cpp:111
> +    // We need an explicit check here because if we have a scheduled termination, we don't want to
> +    // give the watchdog callback function (if available) the chance to prevent termination.
> +    if (hasScheduledTermination())
> +        return true;

Let's not do this. We have no clients who want this behavior, and it is bad to add special cases when they are not needed, since they add complication, and they may contradict what we discover clients do want down the road. For example, we may discover that workers prefer to allow themselves a short amount of slack time before honoring a termination request.

If we did want to do this, we would need a better name. The API name "fireManually" does not convey "guarantee termination regardless of what the time limit delegate says". It only conveys bringing JavaScript execution to a safe point, which will either terminate immediately or ask the delegate if it wants to terminate, depending on how the client has configured the VM. A better name would be "terminateExecutionIgnoringTimeLimitDelegate".

You should remove this test and change fireManually set the time limit and the wall clock and CPU deadlines to ensure termination, if the time limit delegate agrees.
Comment 17 Mark Lam 2015-08-19 14:59:31 PDT
I discussed the multi-threading issues with Geoff offline, and with the addition of a missing storeLoadFence, we are convinced that the code works.  However, the correctness of this code depends on a carefully orchestrated sequencing of load and stores of the watchdog values.  Hence, it is fragile and may be easily broken if someone modifies it in the future without a full understanding of the orchestration.  Since this code is not in the critical path for performance, we agreed that it would be simpler if we just switch to using a lock instead.  I will make that change before landing.
Comment 18 Mark Lam 2015-08-20 18:01:29 PDT
Created attachment 259543 [details]
fix 6
Comment 19 Geoffrey Garen 2015-08-26 12:48:04 PDT
Comment on attachment 259543 [details]
fix 6

r=me
Comment 20 Mark Lam 2015-08-26 19:51:30 PDT
Thanks for the review and all the feedback.  Landed in r189009: <http://trac.webkit.org/r189009>.