Bug 219232

Summary:

[GLib] Leaked RunLoop objects on worker threads

Product:

WebKit

Reporter:

Zan Dobersek <zan>

Component:

New Bugs

Assignee:

Zan Dobersek <zan>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

aperez, benjamin, cdumez, cgarcia, cmarcelo, ews-watchlist, fujii, ggaren, webkit-bug-importer, ysuzuki

Priority:

Keywords:

InRadar

Version:

WebKit Nightly Build

Hardware:

Unspecified

OS:

Unspecified

See Also:

https://bugs.webkit.org/show_bug.cgi?id=219591
https://bugs.webkit.org/show_bug.cgi?id=219593
https://bugs.webkit.org/show_bug.cgi?id=221115

Attachments:

Description	Flags
RunLoop::Holder cleanup	none
Patch	none
Patch	none

Zan Dobersek

Reported 2020-11-21 00:44:27 PST

On JS worker threads, the thread-specific RunLoop object remains alive, along with all its resources, even after the thread is destroyed. I've only observed this with WPE on JetStream2.0, subtest bomb-workers. For GLib implementation of RunLoop, the internal GMainContext manages an eventfd file descriptor. With large amounts of workers and corresponding RunLoops, these file descriptors grow in number until they hit the user limit for the number of allowed file descriptors, at which point the executable is terminated. It could still be a problem on other ports, but like said, I haven't tested there. Once the worker thread is terminated, the thread-specific RunLoop object outlives that thread with one reference remaining. This reference is stuck in a RunLoop::Timer object that is spawned in the per-VM data for JSRunLoopTimer. When the owner VM is being unregistered and the per-VM data destroyed, the unique_ptr owning the timer is taken and pushed into the RunLoop's dispatch queue in order to be destroyed on the thread it is associated with. At that point, the thread is already being shut down, and the dispatch of the clean-up functor is never performed, leaving the RunLoop::Timer dangling inside the queue owned by the RunLoop instance, which never gets destroyed because of the remaining reference on the RunLoop::Timer. https://trac.webkit.org/browser/webkit/trunk/Source/JavaScriptCore/runtime/JSRunLoopTimer.cpp#L61

Attachments
RunLoop::Holder cleanup (433 bytes, patch) 2020-11-21 00:47 PST, Zan Dobersek	no flags	Details Formatted Diff Diff
Patch (1.77 KB, patch) 2020-11-28 01:16 PST, Zan Dobersek	no flags	Details Formatted Diff Diff
Patch (4.57 KB, patch) 2020-12-03 01:30 PST, Zan Dobersek	no flags	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.

Zan Dobersek

Comment 1 2020-11-21 00:47:47 PST

Created attachment 414759 [details] RunLoop::Holder cleanup One possible fix to this issue is explicitly clearing out the dispatch queues in RunLoop::Holder destructor, which is called when the TLS storage for this thread-specific RunLoop is being destroyed.

Radar WebKit Bug Importer

Comment 2 2020-11-28 00:45:18 PST

<rdar://problem/71772277>

Zan Dobersek

Comment 3 2020-11-28 01:16:43 PST

Created attachment 414975 [details] Patch

Geoffrey Garen

Comment 4 2020-11-30 11:13:40 PST

Comment on attachment 414975 [details] Patch Nice catch! This kind of feels like the wrong level at which to fix this bug. I wonder if TimerBase can hold a weak pointer to RunLoop instead.

Geoffrey Garen

Comment 5 2020-11-30 12:24:13 PST

I guess we have a few related problems: 1. DispatchTimer retains RunLoop. If a thread exits before a scheduled dispatchAfter fires, we leak. 2. RunLoop::dispatch() lambdas may retain anything. If a thread exits before a dispatch() lambda dequeues, we leak. 3. The function body of a timer or dispatch may be responsible for releasing something. If a thread exits before a function body executes, we leak. (This is the general case of #1.) For #1, maybe RunLoop should keep an explicit retained set of DispatchTimers waiting to fire (instead of DispatchTimer owning itself). dispatchAfter() would add to the set, and firing the timer would remove from the set. That way, if the RunLoop is destroyed, all DispatchTimers in the set are freed. For #2, I guess it's right to clear the queues. But can move that clearing into a member function named something like RunLoop::threadWillExit()? I'm torn about whether we should clear the queues or ASSERT that they're already empty. See below. For #3, maybe there's nothing we can do about it, and it's just programmer error to allow a thread to exit when it has more work to do.

Zan Dobersek

Comment 6 2020-12-01 08:16:40 PST

I think DispatchTimer problem would be resolved if queues were emptied out upon thread destruction and the DispatchTimer ownership was tied to the lambda lifetime and not its execution. I'm not sure how much people could be blamed for dispatches still pending at the point of thread determination. It's certainly problematic if the queued dispatches were supposed to manually release resources that otherwise weren't tied to the dispatch functor lifetime. There could be a helper class made available, instances of which could be included in lambdas and would have to be manually marked upon dispatch, and asserting upon destruction of those instances if marking was not done.

Zan Dobersek

Comment 7 2020-12-01 08:19:18 PST

(In reply to Geoffrey Garen from comment #4) > Comment on attachment 414975 [details] > Patch > > Nice catch! > > This kind of feels like the wrong level at which to fix this bug. > > I wonder if TimerBase can hold a weak pointer to RunLoop instead. A WeakPtr would work for the GLib implementation. Beyond the TimerBase constructor, the only place where RunLoop is used is during the dispatch of the RunLoop::dispatch() source, which wouldn't happen after the RunLoop is already destroyed.

Geoffrey Garen

Comment 8 2020-12-01 09:28:37 PST

Another option is to remove the m_runLoop member entirely from TimberBase, and instead require the caller to pass in a RunLoop when scheduling a timer into a RunLoop. That seems to be the way that the CoreFoundation API avoids a circular reference between timer and runloop.

Geoffrey Garen

Comment 9 2020-12-01 09:32:23 PST

> I think DispatchTimer problem would be resolved if queues were emptied out > upon thread destruction and the DispatchTimer ownership was tied to the > lambda lifetime and not its execution. I don't think emptying WTF::RunLoop queues is sufficient to address DispatchTimer because DispatchTimer is not in any WTF::RunLoop queue; instead it is registered with lower-level OS timer primitives (GSource in the Glib case).

Zan Dobersek

Comment 10 2020-12-02 03:21:45 PST

(In reply to Geoffrey Garen from comment #9) > > I think DispatchTimer problem would be resolved if queues were emptied out > > upon thread destruction and the DispatchTimer ownership was tied to the > > lambda lifetime and not its execution. > > I don't think emptying WTF::RunLoop queues is sufficient to address > DispatchTimer because DispatchTimer is not in any WTF::RunLoop queue; > instead it is registered with lower-level OS timer primitives (GSource in > the Glib case). Yes, this is correct. I was wrong before.

Zan Dobersek

Comment 11 2020-12-03 01:30:17 PST

Created attachment 415284 [details] Patch

Zan Dobersek

Comment 12 2020-12-03 01:31:56 PST

(In reply to Zan Dobersek from comment #11) > Created attachment 415284 [details] > Patch This still only addresses the original bug, i.e. case #2. It now adds and uses RunLoop::threadWillExit(), and a unit test is provided.

Zan Dobersek

Comment 13 2020-12-03 04:48:46 PST

Removing (In reply to Geoffrey Garen from comment #8) > Another option is to remove the m_runLoop member entirely from TimberBase, > and instead require the caller to pass in a RunLoop when scheduling a timer > into a RunLoop. That seems to be the way that the CoreFoundation API avoids > a circular reference between timer and runloop. At least for GLib, it could be possible to drop the RunLoop reference from RunLoop::TimerBase. We don't need it beyond the GSource creation in the constructor.

Geoffrey Garen

Comment 14 2020-12-04 12:05:16 PST

Comment on attachment 415284 [details] Patch r=me Would be nice to do a follow-up for timers. Still, this is clearly an improvement over the status quo.

Zan Dobersek

Comment 15 2020-12-07 01:02:29 PST

I'll be opening separate bugs for the general dispatchAfter() leaks and the RunLoop ref removal from TimerBase.

EWS

Comment 16 2020-12-07 01:19:56 PST

Committed r270496: <https://trac.webkit.org/changeset/270496> All reviewed patches have been landed. Closing bug and clearing flags on attachment 415284 [details].

Fujii Hironori

Comment 17 2021-02-21 17:01:22 PST

Comment on attachment 415284 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=415284&action=review > Source/WTF/wtf/RunLoop.cpp:177 > + m_nextIteration.clear(); This causes a problem. Filed: Bug 221115 – RunLoop::threadWillExit is doing m_nextIteration.clear() without locking m_nextIterationLock

Note You need to log in before you can comment on or make changes to this bug.