Bug 219232 - [GLib] Leaked RunLoop objects on worker threads
Summary: [GLib] Leaked RunLoop objects on worker threads
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Zan Dobersek
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2020-11-21 00:44 PST by Zan Dobersek
Modified: 2021-02-21 17:01 PST (History)
10 users (show)

See Also:


Attachments
RunLoop::Holder cleanup (433 bytes, patch)
2020-11-21 00:47 PST, Zan Dobersek
no flags Details | Formatted Diff | Diff
Patch (1.77 KB, patch)
2020-11-28 01:16 PST, Zan Dobersek
no flags Details | Formatted Diff | Diff
Patch (4.57 KB, patch)
2020-12-03 01:30 PST, Zan Dobersek
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Zan Dobersek 2020-11-21 00:44:27 PST
On JS worker threads, the thread-specific RunLoop object remains alive, along with all its resources, even after the thread is destroyed.

I've only observed this with WPE on JetStream2.0, subtest bomb-workers. For GLib implementation of RunLoop, the internal GMainContext manages an eventfd file descriptor. With large amounts of workers and corresponding RunLoops, these file descriptors grow in number until they hit the user limit for the number of allowed file descriptors, at which point the executable is terminated.

It could still be a problem on other ports, but like said, I haven't tested there.


Once the worker thread is terminated, the thread-specific RunLoop object outlives that thread with one reference remaining. This reference is stuck in a RunLoop::Timer object that is spawned in the per-VM data for JSRunLoopTimer. When the owner VM is being unregistered and the per-VM data destroyed, the unique_ptr owning the timer is taken and pushed into the RunLoop's dispatch queue in order to be destroyed on the thread it is associated with. At that point, the thread is already being shut down, and the dispatch of the clean-up functor is never performed, leaving the RunLoop::Timer dangling inside the queue owned by the RunLoop instance, which never gets destroyed because of the remaining reference on the RunLoop::Timer.
https://trac.webkit.org/browser/webkit/trunk/Source/JavaScriptCore/runtime/JSRunLoopTimer.cpp#L61
Comment 1 Zan Dobersek 2020-11-21 00:47:47 PST
Created attachment 414759 [details]
RunLoop::Holder cleanup

One possible fix to this issue is explicitly clearing out the dispatch queues in RunLoop::Holder destructor, which is called when the TLS storage for this thread-specific RunLoop is being destroyed.
Comment 2 Radar WebKit Bug Importer 2020-11-28 00:45:18 PST
<rdar://problem/71772277>
Comment 3 Zan Dobersek 2020-11-28 01:16:43 PST
Created attachment 414975 [details]
Patch
Comment 4 Geoffrey Garen 2020-11-30 11:13:40 PST
Comment on attachment 414975 [details]
Patch

Nice catch!

This kind of feels like the wrong level at which to fix this bug.

I wonder if TimerBase can hold a weak pointer to RunLoop instead.
Comment 5 Geoffrey Garen 2020-11-30 12:24:13 PST
I guess we have a few related problems:

1. DispatchTimer retains RunLoop. If a thread exits before a scheduled dispatchAfter fires, we leak.

2. RunLoop::dispatch() lambdas may retain anything. If a thread exits before a dispatch() lambda dequeues, we leak.

3. The function body of a timer or dispatch may be responsible for releasing something. If a thread exits before a function body executes, we leak. (This is the general case of #1.)

For #1, maybe RunLoop should keep an explicit retained set of DispatchTimers waiting to fire (instead of DispatchTimer owning itself). dispatchAfter() would add to the set, and firing the timer would remove from the set. That way, if the RunLoop is destroyed, all DispatchTimers in the set are freed.

For #2, I guess it's right to clear the queues. But can move that clearing into a member function named something like RunLoop::threadWillExit()? I'm torn about whether we should clear the queues or ASSERT that they're already empty. See below.

For #3, maybe there's nothing we can do about it, and it's just programmer error to allow a thread to exit when it has more work to do.
Comment 6 Zan Dobersek 2020-12-01 08:16:40 PST
I think DispatchTimer problem would be resolved if queues were emptied out upon thread destruction and the DispatchTimer ownership was tied to the lambda lifetime and not its execution.

I'm not sure how much people could be blamed for dispatches still pending at the point of thread determination. It's certainly problematic if the queued dispatches were supposed to manually release resources that otherwise weren't tied to the dispatch functor lifetime. There could be a helper class made available, instances of which could be included in lambdas and would have to be manually marked upon dispatch, and asserting upon destruction of those instances if marking was not done.
Comment 7 Zan Dobersek 2020-12-01 08:19:18 PST
(In reply to Geoffrey Garen from comment #4)
> Comment on attachment 414975 [details]
> Patch
> 
> Nice catch!
> 
> This kind of feels like the wrong level at which to fix this bug.
> 
> I wonder if TimerBase can hold a weak pointer to RunLoop instead.

A WeakPtr would work for the GLib implementation. Beyond the TimerBase constructor, the only place where RunLoop is used is during the dispatch of the RunLoop::dispatch() source, which wouldn't happen after the RunLoop is already destroyed.
Comment 8 Geoffrey Garen 2020-12-01 09:28:37 PST
Another option is to remove the m_runLoop member entirely from TimberBase, and instead require the caller to pass in a RunLoop when scheduling a timer into a RunLoop. That seems to be the way that the CoreFoundation API avoids a circular reference between timer and runloop.
Comment 9 Geoffrey Garen 2020-12-01 09:32:23 PST
> I think DispatchTimer problem would be resolved if queues were emptied out
> upon thread destruction and the DispatchTimer ownership was tied to the
> lambda lifetime and not its execution.

I don't think emptying WTF::RunLoop queues is sufficient to address DispatchTimer because DispatchTimer is not in any WTF::RunLoop queue; instead it is registered with lower-level OS timer primitives (GSource in the Glib case).
Comment 10 Zan Dobersek 2020-12-02 03:21:45 PST
(In reply to Geoffrey Garen from comment #9)
> > I think DispatchTimer problem would be resolved if queues were emptied out
> > upon thread destruction and the DispatchTimer ownership was tied to the
> > lambda lifetime and not its execution.
> 
> I don't think emptying WTF::RunLoop queues is sufficient to address
> DispatchTimer because DispatchTimer is not in any WTF::RunLoop queue;
> instead it is registered with lower-level OS timer primitives (GSource in
> the Glib case).

Yes, this is correct. I was wrong before.
Comment 11 Zan Dobersek 2020-12-03 01:30:17 PST
Created attachment 415284 [details]
Patch
Comment 12 Zan Dobersek 2020-12-03 01:31:56 PST
(In reply to Zan Dobersek from comment #11)
> Created attachment 415284 [details]
> Patch

This still only addresses the original bug, i.e. case #2. It now adds and uses RunLoop::threadWillExit(), and a unit test is provided.
Comment 13 Zan Dobersek 2020-12-03 04:48:46 PST
Removing (In reply to Geoffrey Garen from comment #8)
> Another option is to remove the m_runLoop member entirely from TimberBase,
> and instead require the caller to pass in a RunLoop when scheduling a timer
> into a RunLoop. That seems to be the way that the CoreFoundation API avoids
> a circular reference between timer and runloop.

At least for GLib, it could be possible to drop the RunLoop reference from RunLoop::TimerBase. We don't need it beyond the GSource creation in the constructor.
Comment 14 Geoffrey Garen 2020-12-04 12:05:16 PST
Comment on attachment 415284 [details]
Patch

r=me

Would be nice to do a follow-up for timers. Still, this is clearly an improvement over the status quo.
Comment 15 Zan Dobersek 2020-12-07 01:02:29 PST
I'll be opening separate bugs for the general dispatchAfter() leaks and the RunLoop ref removal from TimerBase.
Comment 16 EWS 2020-12-07 01:19:56 PST
Committed r270496: <https://trac.webkit.org/changeset/270496>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 415284 [details].
Comment 17 Fujii Hironori 2021-02-21 17:01:22 PST
Comment on attachment 415284 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=415284&action=review

> Source/WTF/wtf/RunLoop.cpp:177
> +    m_nextIteration.clear();

This causes a problem.
Filed: Bug 221115 – RunLoop::threadWillExit is doing m_nextIteration.clear() without locking m_nextIterationLock