Bug 272975

Summary: [MSE][GStreamer] Early flushes may deadlock WebkitMediaSrc
Product: WebKit Reporter: Carlos Bentzen <cadubentzen>
Component: MediaAssignee: Nobody <webkit-unassigned>
Status: NEW ---    
Severity: Normal CC: bandou.yacine, kdwkleung, philn
Priority: P2    
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugs.webkit.org/show_bug.cgi?id=256428
https://bugs.webkit.org/show_bug.cgi?id=230028

Description Carlos Bentzen 2024-04-19 06:20:59 PDT
We have some tests that timeout while sending FLUSH_STOP in webKitMediaSrcStreamFlush.

- media/media-source/media-source-abort-resets-parser.html: 100% for me in release mode)
- media/media-source/media-source-seek-twice.html: 20% of the time when running that test with --fully-parallel --repeat-each=1000
- media/media-source/media-source-current-time.html: 20% of the time when running that test with --fully-parallel --repeat-each=1000)

They have in common the fact that flushes are requested very early, before the playbin pipeline is fully created and connected.
Comment 1 Carlos Bentzen 2024-04-19 06:54:45 PDT
Pull request: https://github.com/WebKit/WebKit/pull/27517
Comment 2 Carlos Bentzen 2024-04-19 07:09:04 PDT
I have a WIP patch, but needs a lot of refinement before proposing to merge it.

What I have found out so far:

1. In media-source-abort-resets-parser.html, we deadlock because FLUSH_START (which travels out-of-band), arrives at the playsink earlier than STREAM_START. STREAM_START is only handled when the sticky events are pushed after sending FLUSH_STOP. When handling that, it in turn blocks the video pad and waits for FLUSH_START, but FLUSH_START arrived at the pad before and thus it deadlocks.

In gstreamer code, STREAM_START will block in gstpad.c do_probe_callbacks() on GST_PAD_BLOCK_WAIT(pad); It would have been unblocked by a FLUSH_START in gst_pad_push_event_unchecked() via GST_PAD_BLOCK_BROADCAST(pad); (but remember, in the test we get the broadcast call first and then the block call).

Another thing that is important to trigger this is the fact that the media segment contains both video and audio, but no audio samples are pushed. The test is pushing half the bytes of the segment and aborting, but the audio samples are in the final 2% of the file. Why is that important? Because the audio stream's STREAM_START never arrives and thus doesn't trigger a reconfiguration in the playsink (see gstplaysink.c sinkpad_blocked_cb()), which would unblock the pads as well. To confirm this, simply change the manifest in the test from 'test-fragmented-manifest.json' to 'test-fragmented-video-manifest.json' and the issue goes away.

For fixing that, I added a synchronization point before FLUSH_START to wait for STREAM_START to have been handled in the playsink for all streams that pushed buffers. This was achieved with a probe in the playsink pad. Very ugly, but it works.

2. For media/media-source/media-source-current-time.html, the case above happens, but also the second level of racing: sending a FLUSH_START after all STREAM_START's have been handled, but before playsink has caps for the streams on its pads.

In this case
Comment 3 Carlos Bentzen 2024-04-19 07:12:38 PDT
Continuing... 

2. For media/media-source/media-source-current-time.html, the case above happens, but also the second level of racing: sending a FLUSH_START after all STREAM_START's have been handled, but before playsink has caps for the streams on its pads.

In this case, I don't fully understand it yet, but essentially the pipeline will reject buffers returning flushing indefinitely, even though FLUSH_START and FLUSH_STOP are sent.

Added another synchronization point for, if we have received all STREAM_START's, to only flush after playsink has all the caps. Achieved that by connecting to the "notify::caps" signal on the playsink pads. Again, very ugly but it worked as well.
Comment 4 Carlos Bentzen 2024-04-19 07:14:55 PDT
Now knowing this I'd like to see if this synchronization could be moved to MediaPlayerPrivateGStreamer[MSE]. If very weird having to add API and condition variables in the source element for a pipeline issue. And see if there's any way to actually fix this in playsink or playbin3 elements themselves, if I can reduce this to a GStreamer-only test case.
Comment 5 Carlos Bentzen 2024-04-19 07:15:25 PDT
(In reply to Carlos Bentzen from comment #4)
> If very weird having to add API and

It's very weird*
Comment 6 Yacine Bandou 2024-04-26 17:14:17 PDT
According to my analysis, we lose the FLUSH-START event (which is not a sticky event) in the multiqueue source pad which is not yet linked to the avdec_h264 sink pad, so when the multiqueue is linked to avdec_h264 and then to playsink, the sticky events are sent, the first being STREAM-START, immediately afterwards we receive the FLUSH-STOP in playsink without FLUSH-START which is lost !

See this traces:

0:00:00.588103102 [36m   83[00m 0x564b9de40210 [37mDEBUG  [00m [00m          multiqueue gstmultiqueue.c:2680:gst_multi_queue_sink_event:<multiqueue0:queue_0>[00m Received flush start event
0:00:00.588106251 [36m   83[00m 0x564b9de40210 [37mDEBUG  [00m [00;01;31;44m            GST_PADS gstpad.c:5657:gst_pad_push_event_unchecked:<multiqueue0:src_0>[00m Dropping event flush-start because pad is not linked
0:00:00.588108455 [36m   83[00m 0x564b9de40210 [37mDEBUG  [00m [00m          multiqueue gstmultiqueue.c:1382:gst_single_queue_flush:<multiqueue0:queue_0>[00m flush start
0:00:00.588110632 [36m   83[00m 0x564b9de40210 [33;01mLOG    [00m [00m          multiqueue gstmultiqueue.c:1392:gst_single_queue_flush:<multiqueue0:queue_0>[00m Waking up eventually waiting task
Comment 7 Yacine Bandou 2024-04-28 14:18:27 PDT
Of course, this analysis is about the test:
- media/media-source/media-source-abort-resets-parser.html
Comment 8 Philippe Normand 2024-05-05 11:56:13 PDT
Related, I suppose? https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/6763
Comment 9 Yacine Bandou 2024-05-05 14:07:39 PDT
Yes, this MR fixes one of some cases of deadlock. but it doesn't fixe the timeout of this LayoutTest "media/media-source/media-source-abort-resets-parser.html".
Comment 10 Philippe Normand 2024-05-16 04:53:36 PDT
*** Bug 274190 has been marked as a duplicate of this bug. ***
Comment 11 Philippe Normand 2024-07-18 02:47:57 PDT
*** Bug 276716 has been marked as a duplicate of this bug. ***