Bug 197558

Summary: REGRESSION(r243197): [GStreamer] Web process hangs when scrolling twitter timeline which contains HLS videos
Product: WebKit Reporter: Jérémy Lal <kapouer>
Component: WebKitGTKAssignee: Philippe Normand <pnormand>
Status: RESOLVED FIXED    
Severity: Normal CC: andrea.giammarchi, bugs-noreply, calvaris, commit-queue, cturner, ews-watchlist, mcatanzaro, pnormand, richcoe2, svillar
Priority: P2    
Version: Other   
Hardware: Unspecified   
OS: Linux   
Attachments:
Description Flags
Patch
calvaris: commit-queue-
Archive of layout-test-results from ews215 for win-future
none
WiP patch
none
Patch
none
Patch for landing
none
Patch for landing none

Description Jérémy Lal 2019-05-03 06:58:03 PDT
On debian/testing, webkit2gtk 2.24, scrolling down a timeline containing video(s) result in web process hanging.
I suspect it happens when the video is destroyed after scrolling past it, but i might be wrong.

I tried removing some extra gstreamer 1.14 plugins but i got the same result.
Comment 1 Jérémy Lal 2019-05-03 06:59:08 PDT
Also it did not happen with webkit2gtk 2.22.7.
Comment 2 Philippe Normand 2019-05-07 03:37:04 PDT
Please provide a URL test-case.
By 2.24, do you mean 2.24.0 or 2.24.1 ?
Comment 3 Jérémy Lal 2019-05-07 03:40:11 PDT
2.24.1, the one currently in debian/testing.

https://twitter.com/search?q=video&src=typd
and scroll down until it hangs.
Comment 4 Philippe Normand 2019-05-07 04:12:54 PDT
Ok, this is a deadlock between adaptivedemux and the webkithttpsrc element used for HLS manifest or fragments downloading. I'm investigating this issue.
Comment 5 Philippe Normand 2019-05-09 04:31:09 PDT
Created attachment 369484 [details]
Patch
Comment 6 Xabier Rodríguez Calvar 2019-05-09 06:31:18 PDT
Comment on attachment 369484 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=369484&action=review

> Source/WebCore/platform/graphics/gstreamer/MainThreadNotifier.h:73
> +                GST_CAT_LEVEL_LOG(webkit_media_player_debug, GST_LEVEL_WARNING, m_object.get(), "Notifier was invalidated, bailing out");

Do you think this notification should never happen? I don't think it's so weird that a task has been enqueued, notifier becomes invalid and the task is run. Wouldn't it be interesting to make this at least INFO?

> Source/WebCore/platform/graphics/gstreamer/MainThreadNotifier.h:97
> +            m_synchronousNotificationCondition.wait(m_synchronousNotificationMutex);

Considering we are waking up everybody below, shouldn't we check for the notification flag to be still active? Maybe a conditional wait?

> Source/WebCore/platform/graphics/gstreamer/MainThreadNotifier.h:158
> +    GRefPtr<GstObject> m_object;

I think this should be called m_gstObject.

> Source/WebCore/platform/graphics/gstreamer/WebKitWebSourceGStreamer.cpp:614
> +    GST_DEBUG_OBJECT(src, "Closing session, loader: %p, resource: %p", priv->loader.get(), priv->resource.get());

You could put this as else of the next if

> Source/WebCore/platform/graphics/gstreamer/WebKitWebSourceGStreamer.cpp:649
> +        return TRUE;

Shouldn't we unref the event before this here?
Comment 7 Philippe Normand 2019-05-09 06:38:58 PDT
Comment on attachment 369484 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=369484&action=review

>> Source/WebCore/platform/graphics/gstreamer/MainThreadNotifier.h:73
>> +                GST_CAT_LEVEL_LOG(webkit_media_player_debug, GST_LEVEL_WARNING, m_object.get(), "Notifier was invalidated, bailing out");
> 
> Do you think this notification should never happen? I don't think it's so weird that a task has been enqueued, notifier becomes invalid and the task is run. Wouldn't it be interesting to make this at least INFO?

Yes, makes sense :)

>> Source/WebCore/platform/graphics/gstreamer/MainThreadNotifier.h:97
>> +            m_synchronousNotificationCondition.wait(m_synchronousNotificationMutex);
> 
> Considering we are waking up everybody below, shouldn't we check for the notification flag to be still active? Maybe a conditional wait?

Hum, I'm not sure ... I'll try.

>> Source/WebCore/platform/graphics/gstreamer/WebKitWebSourceGStreamer.cpp:649
>> +        return TRUE;
> 
> Shouldn't we unref the event before this here?

No, gst_base_src_event() does it for us.
Comment 8 Philippe Normand 2019-05-09 07:20:49 PDT
Comment on attachment 369484 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=369484&action=review

>>> Source/WebCore/platform/graphics/gstreamer/MainThreadNotifier.h:97
>>> +            m_synchronousNotificationCondition.wait(m_synchronousNotificationMutex);
>> 
>> Considering we are waking up everybody below, shouldn't we check for the notification flag to be still active? Maybe a conditional wait?
> 
> Hum, I'm not sure ... I'll try.

Actually I'm not sure I understood what you meant. By notification flag, do you mean m_pendingNotifications? If so, I think the existing !isMainThread() test before the wait is more than enough already :)
Comment 9 EWS Watchlist 2019-05-09 08:33:09 PDT
Comment on attachment 369484 [details]
Patch

Attachment 369484 [details] did not pass win-ews (win):
Output: https://webkit-queues.webkit.org/results/12143904

New failing tests:
security/contentSecurityPolicy/video-with-file-url-allowed-by-media-src-star-with-AllowContentSecurityPolicySourceStarToMatchAnyProtocol-enabled.html
storage/indexeddb/modern/gc-closes-database.html
http/tests/css/filters-on-iframes.html
Comment 10 EWS Watchlist 2019-05-09 08:33:11 PDT
Created attachment 369493 [details]
Archive of layout-test-results from ews215 for win-future

The attached test failures were seen while running run-webkit-tests on the win-ews.
Bot: ews215  Port: win-future  Platform: CYGWIN_NT-10.0-17763-3.0.5-338.x86_64-x86_64-64bit
Comment 11 Philippe Normand 2019-05-09 08:44:26 PDT
False positive (again).
Comment 12 Xabier Rodríguez Calvar 2019-05-09 23:06:06 PDT
Comment on attachment 369484 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=369484&action=review

> Source/WebCore/platform/graphics/gstreamer/MainThreadNotifier.h:91
> +            m_synchronousNotificationCondition.notifyOne();

While answering your comment below I realized this could be an issue as well because of having now shared lock and condition. Imagine you have two pending notifications, you could be waking the wrong one here, couldn't you? I think a conditional wait below could fix this as well, couldn't it?

>>>> Source/WebCore/platform/graphics/gstreamer/MainThreadNotifier.h:97
>>>> +            m_synchronousNotificationCondition.wait(m_synchronousNotificationMutex);
>>> 
>>> Considering we are waking up everybody below, shouldn't we check for the notification flag to be still active? Maybe a conditional wait?
>> 
>> Hum, I'm not sure ... I'll try.
> 
> Actually I'm not sure I understood what you meant. By notification flag, do you mean m_pendingNotifications? If so, I think the existing !isMainThread() test before the wait is more than enough already :)

Yes, I mean to check the m_pendingNotifications flags. The use case I am thinking of is you cancel a certain notification type but then you get notified for all other types and wake them here and shouldn't they be still waiting?
Comment 13 Philippe Normand 2019-05-10 02:09:25 PDT
There's another issue to address, 2 webkitwebsrc operating from different threads, both requiring access to the main thread at the same time, leading to another deadlock...
Comment 14 Philippe Normand 2019-05-31 03:52:48 PDT
Created attachment 371048 [details]
WiP patch
Comment 15 Charlie Turner 2019-06-05 13:02:14 PDT
I spent a lot of time debugging this, and it seems like we can't achieve deadlock freedom with the design of the adaptive demuxer, and the constraint that all our network operations happen on the main thread.

I tried using the AbortableTaskQueue abstraction instead of the MainThreadNotifier, which seemed in the same vain as philn's changes to the MainThreadNotifier, but there's no clear time when a startAborting phase is going to guarantee adaptivedemux doesn't kick off new get_range requests at an unfortunate time for us. It's not just in the destruction phase of the player when this can happen, it can also happen in changePipelineState calls to READY as well.

I tried making the WebSrc behave asynchronously, such that no state change will block (see gst_base_src_set_async). That solves the MediaPlayer taking the main thread during shutdown and blocking on the fragment downloader threads, but there are other issues with this approach that caused it to be a dead-end that are internal to GStreamer.

I tried adding signal handlers to collect all created websrcs inside the player, such that at shutdown time, before we issue a change_state to NULL, we can forcibly close all network sessions on the main thread before shutting the rest of the pipeline down. This helps a lot, but it doesn't stop the tricky case of adaptivedemux's EOS handling. When the websrc completes the download of a manifest, adaptive demux uses the EOS on the manifest file to signal it should start new srcs to download the fragments. This can happen, due to bad timing, at shutdown and render the careful collection of websrcs in the player pointless.

I tried a few variations on philn's idea of adding timeouts to some of the conditional waits. These kinds of solutions can get real tricky, and indeed I didn't manage to eliminate the deadlocks, they were just a little more flakey rather than easy to trigger.

I tried sending EOS's and/or GST_MESSAGE_ERRORS to the uridownloader, to try and get it to unblock itself in the range requests at the correct times. These are the only ways to signal that element that it should interrupt it's blocking operations. Again, improvements, but not complete solutions.

The fundamental problem here is that the EOS handler takes an API_LOCK in the manifest, which blocks any state changes happening until fragments are downloaded (state changes also need the API lock). If, during the network request window, the player is destroyed, which can happen whenever with a scrolling view of players, the main thread is taken over and can not complete it's state change, and the network request completion signal never makes it back to the main thread. Deadlock.

So, I only see three options now,
  1) Someone else takes a look and hopefully spots a functional approach to this issue
  2) We revert using webkitwebsrc in the adaptivedemuxers (bye sandboxing)
  3) We stop advertising support for application/x-hls.

I'd like for 1) to happen :), 2) is sad times, and 3) is not that crazy but seemingly a bit impractical. Chrome & FF don't support HLS natively AFAICT, what happens on Twitter is a JS library does the HLS protocol and exposes it as MSE blobs. Unfortunately, for reasons I didn't debug, when you fake the user agent as FF, no videos play. I suspect either an issue in our MSE player, or something FF specific in the HLS library twitter is using. Unfortunately because we're a WebKit, most sites will want to using the native HLS support, because Safari supports it well I assume.
Comment 16 Michael Catanzaro 2019-06-06 07:35:27 PDT
(In reply to Charlie Turner from comment #15)
>   2) We revert using webkitwebsrc in the adaptivedemuxers (bye sandboxing)

That would also reintroduce CVE-2019-11070.
Comment 17 Michael Catanzaro 2019-06-14 06:36:06 PDT
*** Bug 198857 has been marked as a duplicate of this bug. ***
Comment 18 Charlie Turner 2019-06-18 03:22:59 PDT
*** Bug 198958 has been marked as a duplicate of this bug. ***
Comment 19 Philippe Normand 2019-06-27 04:14:38 PDT
*** Bug 199217 has been marked as a duplicate of this bug. ***
Comment 20 Philippe Normand 2019-06-29 01:33:52 PDT
*** Bug 199287 has been marked as a duplicate of this bug. ***
Comment 21 Charlie Turner 2019-07-04 08:35:55 PDT
Created attachment 373468 [details]
Patch
Comment 22 Charlie Turner 2019-07-04 08:36:37 PDT
Above patch also a bit of a WIP, since the upstream GStreamer review has not begun yet.
Comment 23 Michael Catanzaro 2019-07-04 09:01:09 PDT
Charlie, you're a hero! But what is the plan for older versions of GStreamer?

I think it's OK to regress functionality if GStreamer isn't new enough, but of course not OK to deadlock.
Comment 24 Charlie Turner 2019-07-04 09:08:46 PDT
(In reply to Michael Catanzaro from comment #23)
> Charlie, you're a hero! But what is the plan for older versions of GStreamer?
> 
> I think it's OK to regress functionality if GStreamer isn't new enough, but
> of course not OK to deadlock.

Thanks :blush: but lets see how it settles with users, hopefully I can get some testers here with the patch.

I think we have to just disable HLS support for now until the GStreamer patch makes its way to users through the distro channels. Especially on Twitter this is going to be painful for those users who enjoy watching videos on that site.
Comment 25 Charlie Turner 2019-07-08 03:34:37 PDT
Created attachment 373621 [details]
Patch for landing
Comment 26 Charlie Turner 2019-07-08 03:36:11 PDT
Created attachment 373622 [details]
Patch for landing
Comment 27 WebKit Commit Bot 2019-07-08 11:16:14 PDT
Comment on attachment 373622 [details]
Patch for landing

Clearing flags on attachment: 373622

Committed r247215: <https://trac.webkit.org/changeset/247215>
Comment 28 WebKit Commit Bot 2019-07-08 11:16:16 PDT
All reviewed patches have been landed.  Closing bug.
Comment 29 Rich Coe 2019-07-14 08:44:50 PDT
the patch is working for me.