Bug 219715 - [GPUProcess] Cut in half amount of IPC needed to do WebAudio
Summary: [GPUProcess] Cut in half amount of IPC needed to do WebAudio
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Web Audio (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Chris Dumez
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2020-12-09 15:41 PST by Chris Dumez
Modified: 2020-12-10 18:16 PST (History)
12 users (show)

See Also:


Attachments
Patch (24.20 KB, patch)
2020-12-09 16:04 PST, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (24.20 KB, patch)
2020-12-09 16:05 PST, Chris Dumez
ews-feeder: commit-queue-
Details | Formatted Diff | Diff
Patch (24.16 KB, patch)
2020-12-09 16:17 PST, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (24.16 KB, patch)
2020-12-09 16:26 PST, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (55.52 KB, patch)
2020-12-10 13:22 PST, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (51.78 KB, patch)
2020-12-10 13:54 PST, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (51.78 KB, patch)
2020-12-10 14:01 PST, Chris Dumez
no flags Details | Formatted Diff | Diff
Patch (51.76 KB, patch)
2020-12-10 14:37 PST, Chris Dumez
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Dumez 2020-12-09 15:41:54 PST
Cut in half amount of IPC needed to do WebAudio.
Comment 1 Chris Dumez 2020-12-09 16:04:14 PST
Created attachment 415805 [details]
Patch
Comment 2 Chris Dumez 2020-12-09 16:05:41 PST
Created attachment 415806 [details]
Patch
Comment 3 Geoffrey Garen 2020-12-09 16:17:27 PST
GTK failure:

../../Source/WebKit/WebProcess/GPU/media/RemoteAudioDestinationProxy.cpp:35:10: fatal error: SharedRingBufferFrameBounds.h: No such file or directory
Comment 4 Chris Dumez 2020-12-09 16:17:36 PST
Created attachment 415809 [details]
Patch
Comment 5 Geoffrey Garen 2020-12-09 16:24:49 PST
> Source/WebKit/ChangeLog:14
> +        This patch gets rid of IPC 2 why storing the RingBuffer bounds in shared

why -> by

> Source/WebKit/ChangeLog:20
> +        This is still too much IPC but this is a step in the right directly. We

directly -> direction

> Source/WebKit/GPUProcess/media/RemoteAudioDestinationManager.cpp:157
> +        if (auto ringBufferBounds = m_ringBufferFrameBounds->tryGet())
> +            m_ringBuffer->setCurrentFrameBounds(ringBufferBounds->first, ringBufferBounds->second);

I'm not super familiar with this code. In the case where the opportunistic tryGet() fails, what ensures that we'll eventually tryGet() again and succeed?
Comment 6 Chris Dumez 2020-12-09 16:26:43 PST
Created attachment 415811 [details]
Patch
Comment 7 Chris Dumez 2020-12-09 16:30:14 PST
(In reply to Geoffrey Garen from comment #5)
> > Source/WebKit/ChangeLog:14
> > +        This patch gets rid of IPC 2 why storing the RingBuffer bounds in shared
> 
> why -> by
> 
> > Source/WebKit/ChangeLog:20
> > +        This is still too much IPC but this is a step in the right directly. We
> 
> directly -> direction
> 
> > Source/WebKit/GPUProcess/media/RemoteAudioDestinationManager.cpp:157
> > +        if (auto ringBufferBounds = m_ringBufferFrameBounds->tryGet())
> > +            m_ringBuffer->setCurrentFrameBounds(ringBufferBounds->first, ringBufferBounds->second);
> 
> I'm not super familiar with this code. In the case where the opportunistic
> tryGet() fails, what ensures that we'll eventually tryGet() again and
> succeed?

Are you worried about the case where we get very bad luck and every time the GPUProcess needs sample the WebProcess happens to writing the frame bounds (and thus holding the lock). I guess this is a possibility but it sounds not very likely?

I used tryGet() because:
1. It is generally frowned upon to have the audio thread waiting on a lock
2. Even though we don't have the latest bounds, the RingBuffer may already have enough samples to provide an answer.

That said, if our media people say otherwise, I can switch to a get() and actually wait for the lock.
Comment 8 Chris Dumez 2020-12-09 17:00:18 PST
(In reply to Chris Dumez from comment #7)
> (In reply to Geoffrey Garen from comment #5)
> > > Source/WebKit/ChangeLog:14
> > > +        This patch gets rid of IPC 2 why storing the RingBuffer bounds in shared
> > 
> > why -> by
> > 
> > > Source/WebKit/ChangeLog:20
> > > +        This is still too much IPC but this is a step in the right directly. We
> > 
> > directly -> direction
> > 
> > > Source/WebKit/GPUProcess/media/RemoteAudioDestinationManager.cpp:157
> > > +        if (auto ringBufferBounds = m_ringBufferFrameBounds->tryGet())
> > > +            m_ringBuffer->setCurrentFrameBounds(ringBufferBounds->first, ringBufferBounds->second);
> > 
> > I'm not super familiar with this code. In the case where the opportunistic
> > tryGet() fails, what ensures that we'll eventually tryGet() again and
> > succeed?
> 
> Are you worried about the case where we get very bad luck and every time the
> GPUProcess needs sample the WebProcess happens to writing the frame bounds
> (and thus holding the lock). I guess this is a possibility but it sounds not
> very likely?
> 
> I used tryGet() because:
> 1. It is generally frowned upon to have the audio thread waiting on a lock
> 2. Even though we don't have the latest bounds, the RingBuffer may already
> have enough samples to provide an answer.
> 
> That said, if our media people say otherwise, I can switch to a get() and
> actually wait for the lock.

Jer confirmed on Slack that we should NEVER grab a lock on the CA audio thread (try lock is ok though).
Comment 9 Peng Liu 2020-12-09 20:11:19 PST
I believe this patch works, but it may not be the best approach.

To move audio samples from a web process to the GPU process, we use a pair of CARingBuffers with shared memory (SharedRingBufferStorage) that stores audio samples. We need to set bounds information in the GPU Process (reader) because the shared memory only stores audio samples, but not the internal data structure to implement the lockless operations. That works for two CARingBuffers in one process but different threads. But when the two CARingBuffers are in different processes, we need to synchronize the internal data structures in the reader. Fortunately, only synchronizing the bounds information will make the CARingBuffer in the GPU process functional.

If we can modify the CARingBuffer implementation to store other information in the shared memory in addition to audio samples, we may be able to use the CARingBuffer in two processes as in two threads (don't need to set the bounds information in a reader).

Since audio capture also uses CARingBuffer in the same manner, Youenn and Eric may have some comments.
Comment 10 Darin Adler 2020-12-10 08:40:33 PST
Comment on attachment 415811 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=415811&action=review

> Source/WebKit/Shared/Cocoa/SharedRingBufferFrameBounds.cpp:36
> +static inline void yield()

This will almost certainly get inlined without the inline keyword. Iā€™m not sure we should use the keyword in cases like this.

> Source/WebKit/Shared/Cocoa/SharedRingBufferFrameBounds.h:38
> +    WTF_MAKE_FAST_ALLOCATED;
> +    WTF_MAKE_NONCOPYABLE(SharedRingBufferFrameBounds)

Not sure why the first has a semicolon and the second does not.
Comment 11 Chris Dumez 2020-12-10 08:47:50 PST
(In reply to Peng Liu from comment #9)
> I believe this patch works, but it may not be the best approach.
> 
> To move audio samples from a web process to the GPU process, we use a pair
> of CARingBuffers with shared memory (SharedRingBufferStorage) that stores
> audio samples. We need to set bounds information in the GPU Process (reader)
> because the shared memory only stores audio samples, but not the internal
> data structure to implement the lockless operations. That works for two
> CARingBuffers in one process but different threads. But when the two
> CARingBuffers are in different processes, we need to synchronize the
> internal data structures in the reader. Fortunately, only synchronizing the
> bounds information will make the CARingBuffer in the GPU process functional.

I agree it would be nicer if CARingBuffer did it for us. I will investigate to see how much work that would be.
Comment 12 Geoffrey Garen 2020-12-10 10:03:58 PST
> Are you worried about the case where we get very bad luck and every time the
> GPUProcess needs sample the WebProcess happens to writing the frame bounds
> (and thus holding the lock). I guess this is a possibility but it sounds not
> very likely?

Yes, I guess that's my worry. Or, at least, not *every* time, but enough times in a row that we run out of samples.

I agree that this case is unlikely, but it does seem possible.

A related worry is that we would fail the tryGet() on the very last update, possibly cutting off the end of the audio stream. (I'm not sure how much we would cut off in that case.)

How frequently will the audio thread run the tryGet() operation?

What happens if tryGet() fails on the last attempt?

> I used tryGet() because:
> 1. It is generally frowned upon to have the audio thread waiting on a lock
> 2. Even though we don't have the latest bounds, the RingBuffer may already
> have enough samples to provide an answer.
> 
> That said, if our media people say otherwise, I can switch to a get() and
> actually wait for the lock.

Definitely agree that we should not lock on the audio thread.

Another option is, instead of { isLocked, pair }, you can use { pair*, std::array<pair>, 2> }. When the writing thread wants to update the bounds, it writes the new pair value to the un-pointed-to pair and then atomically swaps the pair* (with a release memory barrier). When the reading thread wants to read the bounds, it reads using the pair* (with an acquire memory barrier). This ensures no locking, and also no pathological failure. However, it is still possible to miss the very last update if you never get another chance to check.
Comment 13 Chris Dumez 2020-12-10 10:14:53 PST
(In reply to Geoffrey Garen from comment #12)
> > Are you worried about the case where we get very bad luck and every time the
> > GPUProcess needs sample the WebProcess happens to writing the frame bounds
> > (and thus holding the lock). I guess this is a possibility but it sounds not
> > very likely?
> 
> Yes, I guess that's my worry. Or, at least, not *every* time, but enough
> times in a row that we run out of samples.
> 
> I agree that this case is unlikely, but it does seem possible.
> 
> A related worry is that we would fail the tryGet() on the very last update,
> possibly cutting off the end of the audio stream. (I'm not sure how much we
> would cut off in that case.)
> 
> How frequently will the audio thread run the tryGet() operation?

That depends on the sample rate. A usual sample rate is 44khz and render is called for 128 samples each time.
I think this means render would get called ~340 per seconds in general. Media people can confirm.

> 
> What happens if tryGet() fails on the last attempt?

I am personally not too worried about this part. If Audio is still audible by the time by stop rendering then it will always be cut off in general.
Since requests to stop come on the main thread and since audio is being rendered on the audio thread, the exact point we stop rendering is always undefined.

> 
> > I used tryGet() because:
> > 1. It is generally frowned upon to have the audio thread waiting on a lock
> > 2. Even though we don't have the latest bounds, the RingBuffer may already
> > have enough samples to provide an answer.
> > 
> > That said, if our media people say otherwise, I can switch to a get() and
> > actually wait for the lock.
> 
> Definitely agree that we should not lock on the audio thread.
> 
> Another option is, instead of { isLocked, pair }, you can use { pair*,
> std::array<pair>, 2> }. When the writing thread wants to update the bounds,
> it writes the new pair value to the un-pointed-to pair and then atomically
> swaps the pair* (with a release memory barrier). When the reading thread
> wants to read the bounds, it reads using the pair* (with an acquire memory
> barrier). This ensures no locking, and also no pathological failure.
> However, it is still possible to miss the very last update if you never get
> another chance to check.

This sounds like a good idea. I will see if I can make it work.
Comment 14 Chris Dumez 2020-12-10 13:22:09 PST
Created attachment 415921 [details]
Patch
Comment 15 Chris Dumez 2020-12-10 13:24:18 PST
(In reply to Chris Dumez from comment #14)
> Created attachment 415921 [details]
> Patch

New approach based on feedback from Peng and Geoff:
1. I updated the CARingBuffer class so that it now stores its frame bounds in its memory backing (which may be shared memory). This way, we no longer need a separate SharedRingBufferFrameBounds class to get this functionality.
2. I no longer use a lock for the frame bounds in shared memory and instead use the approach suggested by Geoff (buffer of bounds + atomic index).
Comment 16 Chris Dumez 2020-12-10 13:42:42 PST
Comment on attachment 415921 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=415921&action=review

> Source/WebKit/GPUProcess/media/RemoteAudioDestinationManager.cpp:82
> +        auto memory = SharedMemory::map(ipcHandle.handle, SharedMemory::Protection::ReadWrite);

Now that I no longer have a lock, I can keep this ReadOnly. Will update the patch accordingly.
Comment 17 Chris Dumez 2020-12-10 13:54:09 PST
Created attachment 415925 [details]
Patch
Comment 18 Chris Dumez 2020-12-10 14:01:58 PST
Created attachment 415927 [details]
Patch
Comment 19 Geoffrey Garen 2020-12-10 14:33:33 PST
Comment on attachment 415927 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=415927&action=review

r=me

> Source/WebCore/platform/audio/cocoa/CARingBuffer.cpp:332
> +    // When the RingBuffer is backed by shared memory, make sure we pull frame bounds from shared memory before fetching.
> +    updateFrameBounds();
> +
> +    uint64_t start, end;
> +    getCurrentFrameBoundsWithoutUpdate(start, end);

Should this just be getCurrentFrameBounds()? Seems like the behavior would be equivalent.

> Source/WebKit/Shared/Cocoa/SharedRingBufferStorage.h:73
> +    static constexpr unsigned boundsBufferSize { 32 };
> +    struct FrameBounds {
> +        std::pair<uint64_t, uint64_t> boundsBuffer[boundsBufferSize];
> +        Atomic<unsigned> boundsBufferIndex { 0 };
> +    };

The logic of this data structure and its related code seem right to me, but I don't fully understand why it's OK for this class and the related CARingBuffer class to assume that the reader will not fall behind by more than 32 frames. I get it that falling behind is very rare and will definitely cause audio distortion. But this logic (and the pre-existing kGeneralRingTimeBoundsQueueSize in CARingBuffer) seem to take the audio distortion bug and embiggen it into a buffer overflow bug. Seems like something we should follow up on with our audio experts. Maybe they can explain why 32 is always enough. Or maybe we need to adjust this logic in both places. (We could use a strategy similar to your previous tryLock() strategy; or we could have the *reader* update the boundsBufferIndex and have the writer keep writing to the same index until the reader is ready for the next one).
Comment 20 Chris Dumez 2020-12-10 14:37:02 PST
Created attachment 415931 [details]
Patch
Comment 21 Chris Dumez 2020-12-10 14:40:56 PST
(In reply to Geoffrey Garen from comment #19)
> Comment on attachment 415927 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=415927&action=review
> 
> r=me
> 
> > Source/WebCore/platform/audio/cocoa/CARingBuffer.cpp:332
> > +    // When the RingBuffer is backed by shared memory, make sure we pull frame bounds from shared memory before fetching.
> > +    updateFrameBounds();
> > +
> > +    uint64_t start, end;
> > +    getCurrentFrameBoundsWithoutUpdate(start, end);
> 
> Should this just be getCurrentFrameBounds()? Seems like the behavior would
> be equivalent.

Yes, I was trying to be more explicit about the update but I made the change.

> > Source/WebKit/Shared/Cocoa/SharedRingBufferStorage.h:73
> > +    static constexpr unsigned boundsBufferSize { 32 };
> > +    struct FrameBounds {
> > +        std::pair<uint64_t, uint64_t> boundsBuffer[boundsBufferSize];
> > +        Atomic<unsigned> boundsBufferIndex { 0 };
> > +    };
> 
> The logic of this data structure and its related code seem right to me, but
> I don't fully understand why it's OK for this class and the related
> CARingBuffer class to assume that the reader will not fall behind by more
> than 32 frames. I get it that falling behind is very rare and will
> definitely cause audio distortion. But this logic (and the pre-existing
> kGeneralRingTimeBoundsQueueSize in CARingBuffer) seem to take the audio
> distortion bug and embiggen it into a buffer overflow bug. Seems like
> something we should follow up on with our audio experts. Maybe they can
> explain why 32 is always enough. Or maybe we need to adjust this logic in
> both places. (We could use a strategy similar to your previous tryLock()
> strategy; or we could have the *reader* update the boundsBufferIndex and
> have the writer keep writing to the same index until the reader is ready for
> the next one).

Yes, I am discussing this with Jer. It seem CARingBuffer::frameOffset() always make sure we read inside the RingBuffer by doing a modulo BufferSize.
Comment 22 EWS 2020-12-10 18:15:08 PST
Committed r270663: <https://trac.webkit.org/changeset/270663>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 415931 [details].
Comment 23 Radar WebKit Bug Importer 2020-12-10 18:16:20 PST
<rdar://problem/72206083>