RESOLVED FIXED 219818
[GPUProcess] Avoid doing an IPC per rendering quantum when using WebAudio
https://bugs.webkit.org/show_bug.cgi?id=219818
Summary [GPUProcess] Avoid doing an IPC per rendering quantum when using WebAudio
Chris Dumez
Reported 2020-12-11 17:23:07 PST
Avoid doing an IPC per rendering quantum when using WebAudio.
Attachments
WIP Patch (21.20 KB, patch)
2020-12-11 17:32 PST, Chris Dumez
ews-feeder: commit-queue-
WIP Patch (21.18 KB, patch)
2020-12-15 09:46 PST, Chris Dumez
no flags
WIP Patch (20.67 KB, patch)
2020-12-15 09:47 PST, Chris Dumez
no flags
Patch (25.15 KB, patch)
2020-12-15 10:52 PST, Chris Dumez
ews-feeder: commit-queue-
Patch (24.91 KB, patch)
2020-12-15 11:22 PST, Chris Dumez
no flags
Patch (25.05 KB, patch)
2020-12-15 13:25 PST, Chris Dumez
no flags
Patch (25.05 KB, patch)
2020-12-15 14:34 PST, Chris Dumez
no flags
Patch (36.68 KB, patch)
2020-12-15 16:41 PST, Chris Dumez
no flags
Patch (37.08 KB, patch)
2020-12-16 09:53 PST, Chris Dumez
no flags
Patch (34.94 KB, patch)
2020-12-16 12:29 PST, Chris Dumez
ews-feeder: commit-queue-
Chris Dumez
Comment 1 2020-12-11 17:32:45 PST
Created attachment 416076 [details] WIP Patch
Geoffrey Garen
Comment 2 2020-12-14 11:42:49 PST
Note: We probably need to make sure to avoid timer coalescing, since coalescing happens at millisecond granularity and audio happens at sub-millisecond granularity. I don't think ThreadType::Audio affects priority or timer coalescing. Maybe setCurrentThreadIsUserInteractive() would get the job done?
Chris Dumez
Comment 3 2020-12-14 14:22:30 PST
(In reply to Geoffrey Garen from comment #2) > Note: We probably need to make sure to avoid timer coalescing, since > coalescing happens at millisecond granularity and audio happens at > sub-millisecond granularity. > > I don't think ThreadType::Audio affects priority or timer coalescing. > > Maybe setCurrentThreadIsUserInteractive() would get the job done? Yes, based on [1], it seems important to use UserInteractive QoS for this real-time audio thread. Addressing this via Bug 219873. [1] https://developer.apple.com/library/archive/documentation/Performance/Conceptual/power_efficiency_guidelines_osx/PrioritizeWorkAtTheTaskLevel.html#//apple_ref/doc/uid/TP40013929-CH35-SW1
Chris Dumez
Comment 4 2020-12-15 09:46:21 PST
Created attachment 416257 [details] WIP Patch
Chris Dumez
Comment 5 2020-12-15 09:47:43 PST
Created attachment 416258 [details] WIP Patch
Chris Dumez
Comment 6 2020-12-15 10:52:31 PST
Chris Dumez
Comment 7 2020-12-15 11:22:57 PST
Chris Dumez
Comment 8 2020-12-15 12:29:44 PST
Comment on attachment 416268 [details] Patch Seems like the Mac-debug-ews just needs a clean build.
Chris Dumez
Comment 9 2020-12-15 13:25:03 PST
Peng Liu
Comment 10 2020-12-15 14:06:14 PST
Comment on attachment 416286 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=416286&action=review > Source/WebKit/ChangeLog:12 > + To address the issue, we now start a rendering timer on the WebProcess size when rendering s/size/side/ > Source/WebKit/ChangeLog:15 > + it shares with the GPUProcess. Looks like this patch changes the "pull-based" audio rendering approach to a "push-based" one.
Geoffrey Garen
Comment 11 2020-12-15 14:13:19 PST
Comment on attachment 416286 [details] Patch r=me Do we need to consider the possibility that timer drift might eventually get WebContent and GPU out of sync enough to underflow or overflow the CARingBuffer?
Chris Dumez
Comment 12 2020-12-15 14:17:21 PST
(In reply to Geoffrey Garen from comment #11) > Comment on attachment 416286 [details] > Patch > > r=me > > Do we need to consider the possibility that timer drift might eventually get > WebContent and GPU out of sync enough to underflow or overflow the > CARingBuffer? I don't see how this is an issue currently. The boundaries of the RingBuffer are sync'd cross process so the GPU Process knows if there is nothing to pull from the buffer.
Chris Dumez
Comment 13 2020-12-15 14:30:45 PST
(In reply to Chris Dumez from comment #12) > (In reply to Geoffrey Garen from comment #11) > > Comment on attachment 416286 [details] > > Patch > > > > r=me > > > > Do we need to consider the possibility that timer drift might eventually get > > WebContent and GPU out of sync enough to underflow or overflow the > > CARingBuffer? > > I don't see how this is an issue currently. The boundaries of the RingBuffer > are sync'd cross process so the GPU Process knows if there is nothing to > pull from the buffer. Maybe you are referring to the fact that if the timer in the WebProcess runs slightly faster than the speed the GPUProcess is reading, then eventually the WebProcess will start overwriting frames in the RingBuffer that the GPUProcess has not read yet. I do believe this is theoretically possible with the current design. I am not sure if we need to address this and how to best achieve it. Currently, the WebProcess has no idea where the GPU Process is at in its reading... I guess this would require some kind of synchronization, like the GPUProcess telling the WebProcess it is going too fast and asking it to stop for a while so it can catch up..
Chris Dumez
Comment 14 2020-12-15 14:34:57 PST
Chris Dumez
Comment 15 2020-12-15 14:38:54 PST
(In reply to Chris Dumez from comment #13) > (In reply to Chris Dumez from comment #12) > > (In reply to Geoffrey Garen from comment #11) > > > Comment on attachment 416286 [details] > > > Patch > > > > > > r=me > > > > > > Do we need to consider the possibility that timer drift might eventually get > > > WebContent and GPU out of sync enough to underflow or overflow the > > > CARingBuffer? > > > > I don't see how this is an issue currently. The boundaries of the RingBuffer > > are sync'd cross process so the GPU Process knows if there is nothing to > > pull from the buffer. > > Maybe you are referring to the fact that if the timer in the WebProcess runs > slightly faster than the speed the GPUProcess is reading, then eventually > the WebProcess will start overwriting frames in the RingBuffer that the > GPUProcess has not read yet. > > I do believe this is theoretically possible with the current design. I am > not sure if we need to address this and how to best achieve it. Currently, > the WebProcess has no idea where the GPU Process is at in its reading... I > guess this would require some kind of synchronization, like the GPUProcess > telling the WebProcess it is going too fast and asking it to stop for a > while so it can catch up.. Thinking of this more, I think one way we could achieve this without extra IPC would be to save the last "readFrame" in the shared memory (which would require give a read-write shared memory handle to the GPUProcess). Then the WebProcess could check this readFrame because writing and skip processing if it would overwrite the readFrame.
Chris Dumez
Comment 16 2020-12-15 14:40:12 PST
(In reply to Chris Dumez from comment #15) > (In reply to Chris Dumez from comment #13) > > (In reply to Chris Dumez from comment #12) > > > (In reply to Geoffrey Garen from comment #11) > > > > Comment on attachment 416286 [details] > > > > Patch > > > > > > > > r=me > > > > > > > > Do we need to consider the possibility that timer drift might eventually get > > > > WebContent and GPU out of sync enough to underflow or overflow the > > > > CARingBuffer? > > > > > > I don't see how this is an issue currently. The boundaries of the RingBuffer > > > are sync'd cross process so the GPU Process knows if there is nothing to > > > pull from the buffer. > > > > Maybe you are referring to the fact that if the timer in the WebProcess runs > > slightly faster than the speed the GPUProcess is reading, then eventually > > the WebProcess will start overwriting frames in the RingBuffer that the > > GPUProcess has not read yet. > > > > I do believe this is theoretically possible with the current design. I am > > not sure if we need to address this and how to best achieve it. Currently, > > the WebProcess has no idea where the GPU Process is at in its reading... I > > guess this would require some kind of synchronization, like the GPUProcess > > telling the WebProcess it is going too fast and asking it to stop for a > > while so it can catch up.. > > Thinking of this more, I think one way we could achieve this without extra > IPC would be to save the last "readFrame" in the shared memory (which would > require give a read-write shared memory handle to the GPUProcess). Then the > WebProcess could check this readFrame because writing and skip processing if > it would overwrite the readFrame. Jer/Eric. What do you think? Should I do this? In this patch or in a follow-up?
Geoffrey Garen
Comment 17 2020-12-15 14:53:34 PST
I think my concern was not warranted: <<< <https://developer.apple.com/documentation/foundation/nstimer?language=objc> A repeating timer always schedules itself based on the scheduled firing time, as opposed to the actual firing time. For example, if a timer is scheduled to fire at a particular time and every 5 seconds after that, the scheduled firing time will always fall on the original 5-second time intervals, even if the actual firing time gets delayed. If the firing time is delayed so far that it passes one or more of the scheduled firing times, the timer is fired only once for that time period; the timer is then rescheduled, after firing, for the next scheduled firing time in the future. >>> So, a timer never drifts / accumulates error; it may miss a firing deadline (and drop an audio frame), but it will always re-schedule for the next appropriate firing deadline.
Chris Dumez
Comment 18 2020-12-15 15:11:57 PST
(In reply to Geoffrey Garen from comment #17) > I think my concern was not warranted: > > <<< > > <https://developer.apple.com/documentation/foundation/nstimer?language=objc> > > A repeating timer always schedules itself based on the scheduled firing > time, as opposed to the actual firing time. For example, if a timer is > scheduled to fire at a particular time and every 5 seconds after that, the > scheduled firing time will always fall on the original 5-second time > intervals, even if the actual firing time gets delayed. If the firing time > is delayed so far that it passes one or more of the scheduled firing times, > the timer is fired only once for that time period; the timer is then > rescheduled, after firing, for the next scheduled firing time in the future. > > >>> > > So, a timer never drifts / accumulates error; it may miss a firing deadline > (and drop an audio frame), but it will always re-schedule for the next > appropriate firing deadline. Jer said on Slack he is worried about the timer drift in either direction compared to how fast CoreAudio is reading data. He think we do need an extra mechanism to ensure the WebProcess stays a quantum ahead of the GPUProcess.
Geoffrey Garen
Comment 19 2020-12-15 15:48:44 PST
I do think it would be nice to give ourselves a few quanta of write-ahead. Because WebContent might get scheduler-unlucky and drop a frame. (It seems like the current approach provides 128 samples at a time, but no further write-ahead.) Maybe we should fill half the ring buffer at a time (reserving the other half for the GPU Process getting scheduler-unlucky). But to be clear, I think that _drift_ is not possible -- where drift means accumulating small timer fire imprecisions until they add up to a big imprecision. The timer implementation takes care of this. If you schedule a repeating 1ms timer, and it fires after 1.1ms, the next fire will schedule for 0.9ms automatically.
Chris Dumez
Comment 20 2020-12-15 16:41:46 PST
Jer Noble
Comment 21 2020-12-15 21:55:17 PST
Comment on attachment 416302 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=416302&action=review > Source/WebKit/WebProcess/GPU/media/RemoteAudioDestinationProxy.cpp:50 > +constexpr uint64_t framesAheadOfReader = 4 * WebCore::AudioUtilities::renderQuantumSize; At a 48khz sample rate, and a 128 frame render quantum, this adds 10ms of latency to WebAudio rendering, which is 2/3 of a full rAF cycle at 60 hz, or on the ragged edge of being audibly out-of-sync. Is 2 quantum ahead (and 6ms of latency) enough here?
Chris Dumez
Comment 22 2020-12-16 07:31:48 PST
(In reply to Jer Noble from comment #21) > Comment on attachment 416302 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=416302&action=review > > > Source/WebKit/WebProcess/GPU/media/RemoteAudioDestinationProxy.cpp:50 > > +constexpr uint64_t framesAheadOfReader = 4 * WebCore::AudioUtilities::renderQuantumSize; > > At a 48khz sample rate, and a 128 frame render quantum, this adds 10ms of > latency to WebAudio rendering, which is 2/3 of a full rAF cycle at 60 hz, or > on the ragged edge of being audibly out-of-sync. Is 2 quantum ahead (and > 6ms of latency) enough here? Probably, I will do some testing. I did not realize that buffering more was bad so I chose a fairly large value arbitrarily.
Chris Dumez
Comment 23 2020-12-16 08:43:16 PST
Hmmm.. If I play for long enough (may take a while), I eventually end up in a state where we miss a lot of frames. For a while, everything looks perfect, the WebProcess does not need to skip timer fires and almost never needs to do more than one rendering quantum per timer fire. However, after a while, the WebProcess keeps doing extra rendering quantums per timer fire (2 or 3) and despite that, the GPUProcess starts frequently missing frames. I am trying to figure out why this happens.
Chris Dumez
Comment 24 2020-12-16 09:30:21 PST
(In reply to Chris Dumez from comment #23) > Hmmm.. If I play for long enough (may take a while), I eventually end up in > a state where we miss a lot of frames. > > For a while, everything looks perfect, the WebProcess does not need to skip > timer fires and almost never needs to do more than one rendering quantum per > timer fire. However, after a while, the WebProcess keeps doing extra > rendering quantums per timer fire (2 or 3) and despite that, the GPUProcess > starts frequently missing frames. I am trying to figure out why this happens. Oh never mind, it happens without my patch as well on this particular demo site :/ Looks like we may have an issue where audio rendering becomes more and more expensive to compute (Profiler says most time is spent in the animation timeline). After a while, CPU usage reaches 100% and we start to miss frame. I will investigate this separately. In the mean time, I will find a WebAudio demo which does not involve an animation timeline to validate this patch.
Geoffrey Garen
Comment 25 2020-12-16 09:41:52 PST
> But to be clear, I think that _drift_ is not possible -- where drift means > accumulating small timer fire imprecisions until they add up to a big > imprecision. Jer mentioned on Slack that drift may still be possible because the CPU and the audio hardware have distinct hardware clocks that are not synchronized.
Chris Dumez
Comment 26 2020-12-16 09:53:44 PST
Chris Dumez
Comment 27 2020-12-16 09:55:16 PST
(In reply to Chris Dumez from comment #24) > (In reply to Chris Dumez from comment #23) > > Hmmm.. If I play for long enough (may take a while), I eventually end up in > > a state where we miss a lot of frames. > > > > For a while, everything looks perfect, the WebProcess does not need to skip > > timer fires and almost never needs to do more than one rendering quantum per > > timer fire. However, after a while, the WebProcess keeps doing extra > > rendering quantums per timer fire (2 or 3) and despite that, the GPUProcess > > starts frequently missing frames. I am trying to figure out why this happens. > > Oh never mind, it happens without my patch as well on this particular demo > site :/ Looks like we may have an issue where audio rendering becomes more > and more expensive to compute (Profiler says most time is spent in the > animation timeline). After a while, CPU usage reaches 100% and we start to > miss frame. I will investigate this separately. > > In the mean time, I will find a WebAudio demo which does not involve an > animation timeline to validate this patch. Ok, I validated on another demo which does not use an animation timeline. No skipped frames even after playing for a long time, keeping the writer only 2 rendering quantum ahead as Jer suggested.
Geoffrey Garen
Comment 28 2020-12-16 09:57:34 PST
Here's an idea: Let's try to get the lastReadFrame patch land-able, rendering only 2 quanta in advance. That's 5.8ms in advance, which is below the 10ms threshold at which most humans perceive audio to be out of sync with graphics and user action. So, a probably shippable result. Then, let's try to change from a timer to a cross process semaphore. If the GPU Process can wake the audio generation thread using a semaphore, then we can run in sync with the audio hardware clock, and we can probably drop latency back down to one quantum (2.9ms). An additional benefit of doing the semaphore patch is that we probably need it for our graphics work in GPU Process: https://bugs.webkit.org/show_bug.cgi?id=219586. But it's a little tricky to send a semaphore across processes. So we need to figure out how to do it. (Darwin also supports the "eventlink" API, which is a special semaphore that reduces scheduler cost when doing sync audio IPC. That could bring latency down even further. But I think there's more value right now in figuring out the semaphore solution, so we can reuse it for graphics.)
Chris Dumez
Comment 29 2020-12-16 10:04:49 PST
(In reply to Geoffrey Garen from comment #28) > Here's an idea: Let's try to get the lastReadFrame patch land-able, > rendering only 2 quanta in advance. That's 5.8ms in advance, which is below > the 10ms threshold at which most humans perceive audio to be out of sync > with graphics and user action. So, a probably shippable result. I believe this last patch iteration works and uses 2 quanta. > > Then, let's try to change from a timer to a cross process semaphore. If the > GPU Process can wake the audio generation thread using a semaphore, then we > can run in sync with the audio hardware clock, and we can probably drop > latency back down to one quantum (2.9ms). > > An additional benefit of doing the semaphore patch is that we probably need > it for our graphics work in GPU Process: > https://bugs.webkit.org/show_bug.cgi?id=219586. But it's a little tricky to > send a semaphore across processes. So we need to figure out how to do it. > > (Darwin also supports the "eventlink" API, which is a special semaphore that > reduces scheduler cost when doing sync audio IPC. That could bring latency > down even further. But I think there's more value right now in figuring out > the semaphore solution, so we can reuse it for graphics.) Using a cross-process semaphore was actually my initial proposal but the timer seemed simpler for the following reasons: 1. We don't have yet such a cross-position semaphore utility in WebKit, although I am confident I could add one. 2. Presumably the idea would be to have the Audio thread on the WebContent process side blocked on the semaphore and the GPUProcess's render() signal the semaphore. What happens if render() signals the semaphore *WHILE* the WebProcess is rendering a quantum. We have to somehow make sure the WebProcess processes another quantum right after, without blocking on the semaphore. 3. When an AudioWorklet is active, the thread is not 100% dedicated to audio rendering and I don't think we can truly block the thread on a semaphore. The WorkletThread needs to process its run loop and do things like resolve JS promises.
Geoffrey Garen
Comment 30 2020-12-16 11:24:22 PST
> 1. We don't have yet such a cross-position semaphore utility in WebKit, > although I am confident I could add one. Right. > 2. Presumably the idea would be to have the Audio thread on the WebContent > process side blocked on the semaphore and the GPUProcess's render() signal > the semaphore. What happens if render() signals the semaphore *WHILE* the > WebProcess is rendering a quantum. We have to somehow make sure the > WebProcess processes another quantum right after, without blocking on the > semaphore. FYI, that's automatic. Mach and BSD semaphores count. So, if you semaphore_wait, semaphore_signal, and then semaphore_signal again, the next call to semaphore_wait will return right away. > 3. When an AudioWorklet is active, the thread is not 100% dedicated to audio > rendering and I don't think we can truly block the thread on a semaphore. > The WorkletThread needs to process its run loop and do things like resolve > JS promises. Good point. Resolving promises seems manageable. But if worklets can schedule timers or postMessage or similar, then we've got issues.
Chris Dumez
Comment 31 2020-12-16 11:29:59 PST
(In reply to Geoffrey Garen from comment #30) > > 1. We don't have yet such a cross-position semaphore utility in WebKit, > > although I am confident I could add one. > > Right. > > > 2. Presumably the idea would be to have the Audio thread on the WebContent > > process side blocked on the semaphore and the GPUProcess's render() signal > > the semaphore. What happens if render() signals the semaphore *WHILE* the > > WebProcess is rendering a quantum. We have to somehow make sure the > > WebProcess processes another quantum right after, without blocking on the > > semaphore. > > FYI, that's automatic. Mach and BSD semaphores count. So, if you > semaphore_wait, semaphore_signal, and then semaphore_signal again, the next > call to semaphore_wait will return right away. > > > 3. When an AudioWorklet is active, the thread is not 100% dedicated to audio > > rendering and I don't think we can truly block the thread on a semaphore. > > The WorkletThread needs to process its run loop and do things like resolve > > JS promises. > > Good point. Resolving promises seems manageable. But if worklets can > schedule timers or postMessage or similar, then we've got issues. Worklets cannot schedule timers (worklets have super limited api available). They have however a message port via which they can postMessage() or receive messages posted by the main thread's JS.
Geoffrey Garen
Comment 32 2020-12-16 11:43:10 PST
Comment on attachment 416343 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=416343&action=review r=me Seems OK to land as-is, but I think there might be room for improvement. > Source/WebCore/platform/audio/cocoa/CARingBuffer.cpp:182 > + for (unsigned i = 0; i < kGeneralRingTimeBoundsQueueSize; ++i) > + m_lastReadFrameBuffer[i] = 0; If you use std::array, you can just call fill() here instead. > Source/WebCore/platform/audio/cocoa/CARingBuffer.h:64 > + static constexpr uint32_t kGeneralRingTimeBoundsQueueSize = 32; Should this be 2 now (1 current reader frame + 1 write-ahead frame = 5.8ms maximum total audio in the queue)? > Source/WebCore/platform/audio/cocoa/CARingBuffer.h:94 > + uint64_t m_lastReadFrameBuffer[kGeneralRingTimeBoundsQueueSize] = { 0 }; I have a slight preference for std::array. Because array-to-pointer type masquerading breaks my brain. > Source/WebCore/platform/audio/cocoa/CARingBuffer.h:95 > + uint64_t m_lastReadFrameBuffer[kGeneralRingTimeBoundsQueueSize] = { 0 }; > + std::atomic<unsigned> m_lastReadFrameBufferIndex { 0 }; It seems like you're writing the "last read frame" as a stream of values in a ring buffer (m_lastReadFrameBuffer). Is that necessary? I think that only the GPU Process updates this value, and I think that the WebContent Process is only interested in the most up-to-date value. If so, we can use a single std::atomic<uint64_t> instead.
Chris Dumez
Comment 33 2020-12-16 12:29:20 PST
EWS
Comment 34 2020-12-16 13:08:25 PST
Committed r270907: <https://trac.webkit.org/changeset/270907> All reviewed patches have been landed. Closing bug and clearing flags on attachment 416356 [details].
Radar WebKit Bug Importer
Comment 35 2020-12-16 13:09:21 PST
Peng Liu
Comment 36 2020-12-16 13:52:32 PST
*** Bug 217969 has been marked as a duplicate of this bug. ***
Note You need to log in before you can comment on or make changes to this bug.