Bug 219818

Summary: [GPUProcess] Avoid doing an IPC per rendering quantum when using WebAudio
Product: WebKit Reporter: Chris Dumez <cdumez>
Component: Web AudioAssignee: Chris Dumez <cdumez>
Status: RESOLVED FIXED    
Severity: Normal CC: annulen, cdumez, eric.carlson, ews-watchlist, ggaren, glenn, gyuyoung.kim, jer.noble, peng.liu6, philipj, ryuan.choi, sergio, webkit-bug-importer
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugs.webkit.org/show_bug.cgi?id=219964
Bug Depends on: 219809, 219867, 219873    
Bug Blocks:    
Attachments:
Description Flags
WIP Patch
ews-feeder: commit-queue-
WIP Patch
none
WIP Patch
none
Patch
ews-feeder: commit-queue-
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch ews-feeder: commit-queue-

Description Chris Dumez 2020-12-11 17:23:07 PST
Avoid doing an IPC per rendering quantum when using WebAudio.
Comment 1 Chris Dumez 2020-12-11 17:32:45 PST
Created attachment 416076 [details]
WIP Patch
Comment 2 Geoffrey Garen 2020-12-14 11:42:49 PST
Note: We probably need to make sure to avoid timer coalescing, since coalescing happens at millisecond granularity and audio happens at sub-millisecond granularity.

I don't think ThreadType::Audio affects priority or timer coalescing.

Maybe setCurrentThreadIsUserInteractive() would get the job done?
Comment 3 Chris Dumez 2020-12-14 14:22:30 PST
(In reply to Geoffrey Garen from comment #2)
> Note: We probably need to make sure to avoid timer coalescing, since
> coalescing happens at millisecond granularity and audio happens at
> sub-millisecond granularity.
> 
> I don't think ThreadType::Audio affects priority or timer coalescing.
> 
> Maybe setCurrentThreadIsUserInteractive() would get the job done?

Yes, based on [1], it seems important to use UserInteractive QoS for this real-time audio thread. Addressing this via Bug 219873.

[1] https://developer.apple.com/library/archive/documentation/Performance/Conceptual/power_efficiency_guidelines_osx/PrioritizeWorkAtTheTaskLevel.html#//apple_ref/doc/uid/TP40013929-CH35-SW1
Comment 4 Chris Dumez 2020-12-15 09:46:21 PST
Created attachment 416257 [details]
WIP Patch
Comment 5 Chris Dumez 2020-12-15 09:47:43 PST
Created attachment 416258 [details]
WIP Patch
Comment 6 Chris Dumez 2020-12-15 10:52:31 PST
Created attachment 416264 [details]
Patch
Comment 7 Chris Dumez 2020-12-15 11:22:57 PST
Created attachment 416268 [details]
Patch
Comment 8 Chris Dumez 2020-12-15 12:29:44 PST
Comment on attachment 416268 [details]
Patch

Seems like the Mac-debug-ews just needs a clean build.
Comment 9 Chris Dumez 2020-12-15 13:25:03 PST
Created attachment 416286 [details]
Patch
Comment 10 Peng Liu 2020-12-15 14:06:14 PST
Comment on attachment 416286 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=416286&action=review

> Source/WebKit/ChangeLog:12
> +        To address the issue, we now start a rendering timer on the WebProcess size when rendering

s/size/side/

> Source/WebKit/ChangeLog:15
> +        it shares with the GPUProcess.

Looks like this patch changes the "pull-based" audio rendering approach to a "push-based" one.
Comment 11 Geoffrey Garen 2020-12-15 14:13:19 PST
Comment on attachment 416286 [details]
Patch

r=me

Do we need to consider the possibility that timer drift might eventually get WebContent and GPU out of sync enough to underflow or overflow the CARingBuffer?
Comment 12 Chris Dumez 2020-12-15 14:17:21 PST
(In reply to Geoffrey Garen from comment #11)
> Comment on attachment 416286 [details]
> Patch
> 
> r=me
> 
> Do we need to consider the possibility that timer drift might eventually get
> WebContent and GPU out of sync enough to underflow or overflow the
> CARingBuffer?

I don't see how this is an issue currently. The boundaries of the RingBuffer are sync'd cross process so the GPU Process knows if there is nothing to pull from the buffer.
Comment 13 Chris Dumez 2020-12-15 14:30:45 PST
(In reply to Chris Dumez from comment #12)
> (In reply to Geoffrey Garen from comment #11)
> > Comment on attachment 416286 [details]
> > Patch
> > 
> > r=me
> > 
> > Do we need to consider the possibility that timer drift might eventually get
> > WebContent and GPU out of sync enough to underflow or overflow the
> > CARingBuffer?
> 
> I don't see how this is an issue currently. The boundaries of the RingBuffer
> are sync'd cross process so the GPU Process knows if there is nothing to
> pull from the buffer.

Maybe you are referring to the fact that if the timer in the WebProcess runs slightly faster than the speed the GPUProcess is reading, then eventually the WebProcess will start overwriting frames in the RingBuffer that the GPUProcess has not read yet.

I do believe this is theoretically possible with the current design. I am not sure if we need to address this and how to best achieve it. Currently, the WebProcess has no idea where the GPU Process is at in its reading... I guess this would require some kind of synchronization, like the GPUProcess telling the WebProcess it is going too fast and asking it to stop for a while so it can catch up..
Comment 14 Chris Dumez 2020-12-15 14:34:57 PST
Created attachment 416295 [details]
Patch
Comment 15 Chris Dumez 2020-12-15 14:38:54 PST
(In reply to Chris Dumez from comment #13)
> (In reply to Chris Dumez from comment #12)
> > (In reply to Geoffrey Garen from comment #11)
> > > Comment on attachment 416286 [details]
> > > Patch
> > > 
> > > r=me
> > > 
> > > Do we need to consider the possibility that timer drift might eventually get
> > > WebContent and GPU out of sync enough to underflow or overflow the
> > > CARingBuffer?
> > 
> > I don't see how this is an issue currently. The boundaries of the RingBuffer
> > are sync'd cross process so the GPU Process knows if there is nothing to
> > pull from the buffer.
> 
> Maybe you are referring to the fact that if the timer in the WebProcess runs
> slightly faster than the speed the GPUProcess is reading, then eventually
> the WebProcess will start overwriting frames in the RingBuffer that the
> GPUProcess has not read yet.
> 
> I do believe this is theoretically possible with the current design. I am
> not sure if we need to address this and how to best achieve it. Currently,
> the WebProcess has no idea where the GPU Process is at in its reading... I
> guess this would require some kind of synchronization, like the GPUProcess
> telling the WebProcess it is going too fast and asking it to stop for a
> while so it can catch up..

Thinking of this more, I think one way we could achieve this without extra IPC would be to save the last "readFrame" in the shared memory (which would require give a read-write shared memory handle to the GPUProcess). Then the WebProcess could check this readFrame because writing and skip processing if it would overwrite the readFrame.
Comment 16 Chris Dumez 2020-12-15 14:40:12 PST
(In reply to Chris Dumez from comment #15)
> (In reply to Chris Dumez from comment #13)
> > (In reply to Chris Dumez from comment #12)
> > > (In reply to Geoffrey Garen from comment #11)
> > > > Comment on attachment 416286 [details]
> > > > Patch
> > > > 
> > > > r=me
> > > > 
> > > > Do we need to consider the possibility that timer drift might eventually get
> > > > WebContent and GPU out of sync enough to underflow or overflow the
> > > > CARingBuffer?
> > > 
> > > I don't see how this is an issue currently. The boundaries of the RingBuffer
> > > are sync'd cross process so the GPU Process knows if there is nothing to
> > > pull from the buffer.
> > 
> > Maybe you are referring to the fact that if the timer in the WebProcess runs
> > slightly faster than the speed the GPUProcess is reading, then eventually
> > the WebProcess will start overwriting frames in the RingBuffer that the
> > GPUProcess has not read yet.
> > 
> > I do believe this is theoretically possible with the current design. I am
> > not sure if we need to address this and how to best achieve it. Currently,
> > the WebProcess has no idea where the GPU Process is at in its reading... I
> > guess this would require some kind of synchronization, like the GPUProcess
> > telling the WebProcess it is going too fast and asking it to stop for a
> > while so it can catch up..
> 
> Thinking of this more, I think one way we could achieve this without extra
> IPC would be to save the last "readFrame" in the shared memory (which would
> require give a read-write shared memory handle to the GPUProcess). Then the
> WebProcess could check this readFrame because writing and skip processing if
> it would overwrite the readFrame.

Jer/Eric. What do you think? Should I do this? In this patch or in a follow-up?
Comment 17 Geoffrey Garen 2020-12-15 14:53:34 PST
I think my concern was not warranted:

<<<

<https://developer.apple.com/documentation/foundation/nstimer?language=objc>

A repeating timer always schedules itself based on the scheduled firing time, as opposed to the actual firing time. For example, if a timer is scheduled to fire at a particular time and every 5 seconds after that, the scheduled firing time will always fall on the original 5-second time intervals, even if the actual firing time gets delayed. If the firing time is delayed so far that it passes one or more of the scheduled firing times, the timer is fired only once for that time period; the timer is then rescheduled, after firing, for the next scheduled firing time in the future.

>>>

So, a timer never drifts / accumulates error; it may miss a firing deadline (and drop an audio frame), but it will always re-schedule for the next appropriate firing deadline.
Comment 18 Chris Dumez 2020-12-15 15:11:57 PST
(In reply to Geoffrey Garen from comment #17)
> I think my concern was not warranted:
> 
> <<<
> 
> <https://developer.apple.com/documentation/foundation/nstimer?language=objc>
> 
> A repeating timer always schedules itself based on the scheduled firing
> time, as opposed to the actual firing time. For example, if a timer is
> scheduled to fire at a particular time and every 5 seconds after that, the
> scheduled firing time will always fall on the original 5-second time
> intervals, even if the actual firing time gets delayed. If the firing time
> is delayed so far that it passes one or more of the scheduled firing times,
> the timer is fired only once for that time period; the timer is then
> rescheduled, after firing, for the next scheduled firing time in the future.
> 
> >>>
> 
> So, a timer never drifts / accumulates error; it may miss a firing deadline
> (and drop an audio frame), but it will always re-schedule for the next
> appropriate firing deadline.

Jer said on Slack he is worried about the timer drift in either direction compared to how fast CoreAudio is reading data. He think we do need an extra mechanism to ensure the WebProcess stays a quantum ahead of the GPUProcess.
Comment 19 Geoffrey Garen 2020-12-15 15:48:44 PST
I do think it would be nice to give ourselves a few quanta of write-ahead. Because WebContent might get scheduler-unlucky and drop a frame. (It seems like the current approach provides 128 samples at a time, but no further write-ahead.) Maybe we should fill half the ring buffer at a time (reserving the other half for the GPU Process getting scheduler-unlucky).

But to be clear, I think that _drift_ is not possible -- where drift means accumulating small timer fire imprecisions until they add up to a big imprecision. The timer implementation takes care of this. If you schedule a repeating 1ms timer, and it fires after 1.1ms, the next fire will schedule for 0.9ms automatically.
Comment 20 Chris Dumez 2020-12-15 16:41:46 PST
Created attachment 416302 [details]
Patch
Comment 21 Jer Noble 2020-12-15 21:55:17 PST
Comment on attachment 416302 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=416302&action=review

> Source/WebKit/WebProcess/GPU/media/RemoteAudioDestinationProxy.cpp:50
> +constexpr uint64_t framesAheadOfReader = 4 * WebCore::AudioUtilities::renderQuantumSize;

At a 48khz sample rate, and a 128 frame render quantum, this adds 10ms of latency to WebAudio rendering, which is 2/3 of a full rAF cycle at 60 hz, or on the ragged edge of being audibly out-of-sync.  Is 2 quantum ahead (and 6ms of latency) enough here?
Comment 22 Chris Dumez 2020-12-16 07:31:48 PST
(In reply to Jer Noble from comment #21)
> Comment on attachment 416302 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=416302&action=review
> 
> > Source/WebKit/WebProcess/GPU/media/RemoteAudioDestinationProxy.cpp:50
> > +constexpr uint64_t framesAheadOfReader = 4 * WebCore::AudioUtilities::renderQuantumSize;
> 
> At a 48khz sample rate, and a 128 frame render quantum, this adds 10ms of
> latency to WebAudio rendering, which is 2/3 of a full rAF cycle at 60 hz, or
> on the ragged edge of being audibly out-of-sync.  Is 2 quantum ahead (and
> 6ms of latency) enough here?

Probably, I will do some testing. I did not realize that buffering more was bad so I chose a fairly large value arbitrarily.
Comment 23 Chris Dumez 2020-12-16 08:43:16 PST
Hmmm.. If I play for long enough (may take a while), I eventually end up in a state where we miss a lot of frames.

For a while, everything looks perfect, the WebProcess does not need to skip timer fires and almost never needs to do more than one rendering quantum per timer fire. However, after a while, the WebProcess keeps doing extra rendering quantums per timer fire (2 or 3) and despite that, the GPUProcess starts frequently missing frames. I am trying to figure out why this happens.
Comment 24 Chris Dumez 2020-12-16 09:30:21 PST
(In reply to Chris Dumez from comment #23)
> Hmmm.. If I play for long enough (may take a while), I eventually end up in
> a state where we miss a lot of frames.
> 
> For a while, everything looks perfect, the WebProcess does not need to skip
> timer fires and almost never needs to do more than one rendering quantum per
> timer fire. However, after a while, the WebProcess keeps doing extra
> rendering quantums per timer fire (2 or 3) and despite that, the GPUProcess
> starts frequently missing frames. I am trying to figure out why this happens.

Oh never mind, it happens without my patch as well on this particular demo site :/ Looks like we may have an issue where audio rendering becomes more and more expensive to compute (Profiler says most time is spent in the animation timeline). After a while, CPU usage reaches 100% and we start to miss frame. I will investigate this separately.

In the mean time, I will find a WebAudio demo which does not involve an animation timeline to validate this patch.
Comment 25 Geoffrey Garen 2020-12-16 09:41:52 PST
> But to be clear, I think that _drift_ is not possible -- where drift means
> accumulating small timer fire imprecisions until they add up to a big
> imprecision.

Jer mentioned on Slack that drift may still be possible because the CPU and the audio hardware have distinct hardware clocks that are not synchronized.
Comment 26 Chris Dumez 2020-12-16 09:53:44 PST
Created attachment 416343 [details]
Patch
Comment 27 Chris Dumez 2020-12-16 09:55:16 PST
(In reply to Chris Dumez from comment #24)
> (In reply to Chris Dumez from comment #23)
> > Hmmm.. If I play for long enough (may take a while), I eventually end up in
> > a state where we miss a lot of frames.
> > 
> > For a while, everything looks perfect, the WebProcess does not need to skip
> > timer fires and almost never needs to do more than one rendering quantum per
> > timer fire. However, after a while, the WebProcess keeps doing extra
> > rendering quantums per timer fire (2 or 3) and despite that, the GPUProcess
> > starts frequently missing frames. I am trying to figure out why this happens.
> 
> Oh never mind, it happens without my patch as well on this particular demo
> site :/ Looks like we may have an issue where audio rendering becomes more
> and more expensive to compute (Profiler says most time is spent in the
> animation timeline). After a while, CPU usage reaches 100% and we start to
> miss frame. I will investigate this separately.
> 
> In the mean time, I will find a WebAudio demo which does not involve an
> animation timeline to validate this patch.

Ok, I validated on another demo which does not use an animation timeline. No skipped frames even after playing for a long time, keeping the writer only 2 rendering quantum ahead as Jer suggested.
Comment 28 Geoffrey Garen 2020-12-16 09:57:34 PST
Here's an idea: Let's try to get the lastReadFrame patch land-able, rendering only 2 quanta in advance. That's 5.8ms in advance, which is below the 10ms threshold at which most humans perceive audio to be out of sync with graphics and user action. So, a probably shippable result.

Then, let's try to change from a timer to a cross process semaphore. If the GPU Process can wake the audio generation thread using a semaphore, then we can run in sync with the audio hardware clock, and we can probably drop latency back down to one quantum (2.9ms).

An additional benefit of doing the semaphore patch is that we probably need it for our graphics work in GPU Process: https://bugs.webkit.org/show_bug.cgi?id=219586. But it's a little tricky to send a semaphore across processes. So we need to figure out how to do it.

(Darwin also supports the "eventlink" API, which is a special semaphore that reduces scheduler cost when doing sync audio IPC. That could bring latency down even further. But I think there's more value right now in figuring out the semaphore solution, so we can reuse it for graphics.)
Comment 29 Chris Dumez 2020-12-16 10:04:49 PST
(In reply to Geoffrey Garen from comment #28)
> Here's an idea: Let's try to get the lastReadFrame patch land-able,
> rendering only 2 quanta in advance. That's 5.8ms in advance, which is below
> the 10ms threshold at which most humans perceive audio to be out of sync
> with graphics and user action. So, a probably shippable result.

I believe this last patch iteration works and uses 2 quanta.

> 
> Then, let's try to change from a timer to a cross process semaphore. If the
> GPU Process can wake the audio generation thread using a semaphore, then we
> can run in sync with the audio hardware clock, and we can probably drop
> latency back down to one quantum (2.9ms).
> 
> An additional benefit of doing the semaphore patch is that we probably need
> it for our graphics work in GPU Process:
> https://bugs.webkit.org/show_bug.cgi?id=219586. But it's a little tricky to
> send a semaphore across processes. So we need to figure out how to do it.
> 
> (Darwin also supports the "eventlink" API, which is a special semaphore that
> reduces scheduler cost when doing sync audio IPC. That could bring latency
> down even further. But I think there's more value right now in figuring out
> the semaphore solution, so we can reuse it for graphics.)

Using a cross-process semaphore was actually my initial proposal but the timer seemed simpler for the following reasons:
1. We don't have yet such a cross-position semaphore utility in WebKit, although I am confident I could add one.
2. Presumably the idea would be to have the Audio thread on the WebContent process side blocked on the semaphore and the GPUProcess's render() signal the semaphore. What happens if render() signals the semaphore *WHILE* the WebProcess is rendering a quantum. We have to somehow make sure the WebProcess processes another quantum right after, without blocking on the semaphore.
3. When an AudioWorklet is active, the thread is not 100% dedicated to audio rendering and I don't think we can truly block the thread on a semaphore. The WorkletThread needs to process its run loop and do things like resolve JS promises.
Comment 30 Geoffrey Garen 2020-12-16 11:24:22 PST
> 1. We don't have yet such a cross-position semaphore utility in WebKit,
> although I am confident I could add one.

Right.

> 2. Presumably the idea would be to have the Audio thread on the WebContent
> process side blocked on the semaphore and the GPUProcess's render() signal
> the semaphore. What happens if render() signals the semaphore *WHILE* the
> WebProcess is rendering a quantum. We have to somehow make sure the
> WebProcess processes another quantum right after, without blocking on the
> semaphore.

FYI, that's automatic. Mach and BSD semaphores count. So, if you semaphore_wait, semaphore_signal, and then semaphore_signal again, the next call to semaphore_wait will return right away.

> 3. When an AudioWorklet is active, the thread is not 100% dedicated to audio
> rendering and I don't think we can truly block the thread on a semaphore.
> The WorkletThread needs to process its run loop and do things like resolve
> JS promises.

Good point. Resolving promises seems manageable. But if worklets can schedule timers or postMessage or similar, then we've got issues.
Comment 31 Chris Dumez 2020-12-16 11:29:59 PST
(In reply to Geoffrey Garen from comment #30)
> > 1. We don't have yet such a cross-position semaphore utility in WebKit,
> > although I am confident I could add one.
> 
> Right.
> 
> > 2. Presumably the idea would be to have the Audio thread on the WebContent
> > process side blocked on the semaphore and the GPUProcess's render() signal
> > the semaphore. What happens if render() signals the semaphore *WHILE* the
> > WebProcess is rendering a quantum. We have to somehow make sure the
> > WebProcess processes another quantum right after, without blocking on the
> > semaphore.
> 
> FYI, that's automatic. Mach and BSD semaphores count. So, if you
> semaphore_wait, semaphore_signal, and then semaphore_signal again, the next
> call to semaphore_wait will return right away.
> 
> > 3. When an AudioWorklet is active, the thread is not 100% dedicated to audio
> > rendering and I don't think we can truly block the thread on a semaphore.
> > The WorkletThread needs to process its run loop and do things like resolve
> > JS promises.
> 
> Good point. Resolving promises seems manageable. But if worklets can
> schedule timers or postMessage or similar, then we've got issues.

Worklets cannot schedule timers (worklets have super limited api available). They have however a message port via which they can postMessage() or receive messages posted by the main thread's JS.
Comment 32 Geoffrey Garen 2020-12-16 11:43:10 PST
Comment on attachment 416343 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=416343&action=review

r=me

Seems OK to land as-is, but I think there might be room for improvement.

> Source/WebCore/platform/audio/cocoa/CARingBuffer.cpp:182
> +    for (unsigned i = 0; i < kGeneralRingTimeBoundsQueueSize; ++i)
> +        m_lastReadFrameBuffer[i] = 0;

If you use std::array, you can just call fill() here instead.

> Source/WebCore/platform/audio/cocoa/CARingBuffer.h:64
> +    static constexpr uint32_t kGeneralRingTimeBoundsQueueSize = 32;

Should this be 2 now (1 current reader frame + 1 write-ahead frame = 5.8ms maximum total audio in the queue)?

> Source/WebCore/platform/audio/cocoa/CARingBuffer.h:94
> +    uint64_t m_lastReadFrameBuffer[kGeneralRingTimeBoundsQueueSize] = { 0 };

I have a slight preference for std::array. Because array-to-pointer type masquerading breaks my brain.

> Source/WebCore/platform/audio/cocoa/CARingBuffer.h:95
> +    uint64_t m_lastReadFrameBuffer[kGeneralRingTimeBoundsQueueSize] = { 0 };
> +    std::atomic<unsigned> m_lastReadFrameBufferIndex { 0 };

It seems like you're writing the "last read frame" as a stream of values in a ring buffer (m_lastReadFrameBuffer).

Is that necessary?

I think that only the GPU Process updates this value, and I think that the WebContent Process is only interested in the most up-to-date value. If so, we can use a single std::atomic<uint64_t> instead.
Comment 33 Chris Dumez 2020-12-16 12:29:20 PST
Created attachment 416356 [details]
Patch
Comment 34 EWS 2020-12-16 13:08:25 PST
Committed r270907: <https://trac.webkit.org/changeset/270907>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 416356 [details].
Comment 35 Radar WebKit Bug Importer 2020-12-16 13:09:21 PST
<rdar://problem/72396623>
Comment 36 Peng Liu 2020-12-16 13:52:32 PST
*** Bug 217969 has been marked as a duplicate of this bug. ***