Bug 264824 - [GStreamer] Crash after 10 seconds on watchdog thread due to loop when destroying ~ImageDecoderGStreamerSample
Summary: [GStreamer] Crash after 10 seconds on watchdog thread due to loop when destro...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Media (show other bugs)
Version: WebKit Nightly Build
Hardware: PC Linux
: P2 Normal
Assignee: Philippe Normand
URL:
Keywords: InRadar
: 270698 (view as bug list)
Depends on:
Blocks:
 
Reported: 2023-11-14 10:46 PST by Michael Catanzaro
Modified: 2024-03-08 09:40 PST (History)
4 users (show)

See Also:


Attachments
backtrace (94.92 KB, text/plain)
2023-11-14 10:46 PST, Michael Catanzaro
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Catanzaro 2023-11-14 10:46:35 PST
Created attachment 468593 [details]
backtrace

Similar to bug #257551, bug #259504, and bug #260796, here's another crash after 10 seconds when the main thread has deadlocked deep inside GStreamer. The main thread is stuck here:

hread 2 (Thread 0x7f115d3e7f40 (LWP 2)):
#0  0x00007f11655113cb in __GI_sched_yield () at ../sysdeps/unix/syscall-template.S:120
#1  0x00007f11610b1279 in g_thread_yield () at ../glib/gthread-posix.c:1329
#2  0x00007f1161d55ad5 in default_stop (pool=0x7f0e941fcf70 [GstVideoBufferPool|videobufferpool1602]) at ../gst/gstbufferpool.c:415
#3  0x00007f1161d4eb53 in do_stop (pool=pool@entry=0x7f0e941fcf70 [GstVideoBufferPool|videobufferpool1602]) at ../gst/gstbufferpool.c:440
#4  0x00007f1161d56b70 in dec_outstanding (pool=0x7f0e941fcf70 [GstVideoBufferPool|videobufferpool1602]) at ../gst/gstbufferpool.c:1207
#5  dec_outstanding (pool=0x7f0e941fcf70 [GstVideoBufferPool|videobufferpool1602]) at ../gst/gstbufferpool.c:1197
#6  gst_buffer_pool_release_buffer (buffer=0x7f0e941fff80 [GstBuffer], pool=0x7f0e941fcf70 [GstVideoBufferPool|videobufferpool1602]) at ../gst/gstbufferpool.c:1393
#7  gst_buffer_pool_release_buffer (pool=0x7f0e941fcf70 [GstVideoBufferPool|videobufferpool1602], buffer=0x7f0e941fff80 [GstBuffer]) at ../gst/gstbufferpool.c:1372
#8  0x00007f1161d56bb8 in _gst_buffer_dispose (buffer=0x7f0e941fff80 [GstBuffer]) at ../gst/gstbuffer.c:776
#9  0x00007f1161d8c0d4 in gst_mini_object_unref (mini_object=0x7f0e941fff80 [GstBuffer]) at ../gst/gstminiobject.c:669
#10 0x00007f1161dbefca in _gst_sample_free (sample=0x55b845127710 [GstSample]) at ../gst/gstsample.c:87
#11 0x00007f1167d89570 in WTF::GRefPtr<_GstSample>::clear() (this=0x7f109a9f07b8) at WTF/Headers/wtf/glib/GRefPtr.h:90
#12 WebCore::ImageGStreamer::~ImageGStreamer() (this=0x7f109a9f07b0) at /buildstream/gnome/sdk/webkitgtk-6.0.bst/Source/WebCore/platform/graphics/gstreamer/ImageGStreamerCairo.cpp:175


All-threads backtrace attached.
Comment 1 Philippe Normand 2023-11-15 07:00:03 PST
Deadlock or infinite loop?
Comment 2 Michael Catanzaro 2023-11-15 07:05:28 PST
Hm, I guess it's indeed an infinite (or excessively slow) loop inside default_stop().
Comment 3 Philippe Normand 2023-11-15 07:18:32 PST
It's a bit weird, maybe something buggy in gst_video_convert_sample()... The returned sample is ref'ed in the decoder buffer pool, which was disposed before returning from gst_video_convert_sample() ...

We might need to do a deep copy of the sample I think...
Comment 4 Philippe Normand 2023-11-15 07:41:48 PST
I filed a GStreamer bug about this: https://gitlab.freedesktop.org/gstreamer/gstreamer/-/issues/3131
Comment 5 Michael Catanzaro 2023-11-15 08:04:45 PST
I just hit this again when viewing:

https://news.sky.com/story/plane-forced-to-return-to-airport-after-horse-gets-loose-on-board-13008688

This crash occurred shortly after a different web process died to bug #264674 when attempting to display the same page. Notably, the UI process died as well, although there is no core dump for it, so presumably it quit itself and we'll never learn why. :/

I also found a huge spam of warnings in my journal. The first look like this:

Nov 15 09:55:54 chargestone-cave org.gnome.Epiphany.Devel.desktop[35157]: (WebKitWebProcess:2): GStreamer-CRITICAL **: 09:55:54.594: gst_poll_write_control: assertion 'set != NULL' failed
Nov 15 09:55:54 chargestone-cave org.gnome.Epiphany.Devel.desktop[35157]: (WebKitWebProcess:2): GStreamer-CRITICAL **: 09:55:54.594: gst_poll_write_control: assertion 'set != NULL' failed
Nov 15 09:55:54 chargestone-cave org.gnome.Epiphany.Devel.desktop[35157]: (WebKitWebProcess:2): GStreamer-CRITICAL **: 09:55:54.594: gst_poll_read_control: assertion 'set != NULL' failed

So that explains why the loop never quits:

while (!gst_poll_read_control (priv->poll))

It will always return FALSE due to the precondition violation.
Comment 6 Philippe Normand 2023-11-15 10:00:42 PST
This makes no sense and I am unable to reproduce this issue. Sorry.
Comment 7 Michael Catanzaro 2024-01-01 06:50:34 PST
I believe I've effectively lost my system journal to this bug; I now have only a few days of history in my journal instead of the several years of history that I would expect. It's all spammed with:

gst_poll_read_control: assertion 'set != NULL' failed

I see about 30,000 of these in the first two seconds of my journal; after that, there's a 30 second gap and then more of the same. I haven't checked beyond that but I think it's safe to assume that my system journal is probably >99% this. Can we turn this into a g_assert() to crash the process instead of spam the journal?

(Arguably there is a systemd bug here for even allowing one process to create this much spam, but oh well.)
Comment 8 Michael Catanzaro 2024-01-25 07:23:26 PST
How about we add an assertion to crash the process when this bug is hit, so it doesn't spam the journal? I'm trying to investigate a hardware error that happened yesterday at the same time that I hit this bug, and having to wade through thousands of lines of this is not helpful. I gave up on hitting PageDown after 100,000 lines of this error message, and I suspect that might be only a small fraction of the spam; it's so much that journalctl isn't even able to page to the end.
Comment 9 Philippe Normand 2024-01-25 10:36:12 PST
I'm not sure to follow. iiuc:

gst_poll_read_control: assertion 'set != NULL' failed

happens because we run out of descriptors. In any case that's deep in GStreamer. So not sure what you want to do about this.


Obviously the proper fix would be to fix the leaks :)
Comment 10 Michael Catanzaro 2024-01-25 11:39:43 PST
What I'm suggesting as a stopgap measure until this can be properly investigated and fixed. Something like:

diff --git a/subprojects/gstreamer/gst/gstbufferpool.c b/subprojects/gstreamer/gst/gstbufferpool.c
index d7da0cd4ad..d06c78fd6a 100644
--- a/subprojects/gstreamer/gst/gstbufferpool.c
+++ b/subprojects/gstreamer/gst/gstbufferpool.c
@@ -408,6 +408,7 @@ default_stop (GstBufferPool * pool)
 
   /* clear the pool */
   while ((buffer = gst_atomic_queue_pop (priv->queue))) {
+    g_assert (priv->poll != NULL);
     while (!gst_poll_read_control (priv->poll)) {
       if (errno == EWOULDBLOCK) {
         /* We put the buffer into the queue but did not finish writing control

While a crash of course is not great, it would be *much* less inconvenient than a denial of service attack against the system journal. :)
Comment 11 Philippe Normand 2024-02-16 08:39:14 PST
(In reply to Michael Catanzaro from comment #5)
> I just hit this again when viewing:
> 
> https://news.sky.com/story/plane-forced-to-return-to-airport-after-horse-
> gets-loose-on-board-13008688
> 

After disabling my pihole I can reproduce the warning (GStreamer-CRITICAL **: 16:30:29.127: gst_poll_write_control: assertion 'set != NULL' failed) in TP (gotta scroll down to the bottom and wait).

I can't reproduce this with current git main, so ... shrug. Maybe this was fixed then? There were a bunch of leak fixes in the last couple months.
Comment 12 Philippe Normand 2024-02-16 09:24:52 PST
OK I can reproduce the issue after setting autoplay-policy=deny and enabling leak tracing...
Comment 13 Philippe Normand 2024-02-17 05:11:11 PST
Pull request: https://github.com/WebKit/WebKit/pull/24684
Comment 14 EWS 2024-02-20 03:50:15 PST
Committed 275032@main (e653f5a19d03): <https://commits.webkit.org/275032@main>

Reviewed commits have been landed. Closing PR #24684 and removing active labels.
Comment 15 Radar WebKit Bug Importer 2024-02-20 03:51:20 PST
<rdar://problem/123304014>
Comment 16 Michael Catanzaro 2024-03-01 06:28:18 PST
*** Bug 266573 has been marked as a duplicate of this bug. ***
Comment 17 Philippe Normand 2024-03-08 09:40:24 PST
*** Bug 270698 has been marked as a duplicate of this bug. ***