75564 – ConvolverNode should not incur processing latency

Chris Rogers

Reported 2012-01-04 12:52:36 PST

Currently the ConvolverNode in the Web Audio API incurs a 128 sample-frame processing latency. This corresponds to around 3ms at a 44.1KHz sample-rate. For many/most applications this small latency is not noticeable because the impulse response is often set to a "room-effect" which itself involves substantial amounts of delays/reflections/diffusion. However, for some technical uses of convolution (e.g. linear-phase crossover filters) this extra delay becomes problematic. The current implementation uses FFTs everywhere. To achieve zero latency, the leading portion of the impulse response can be handled with "direct" convolution.

Xingnan Wang

Comment 1 2012-02-24 21:18:57 PST

Created attachment 128848 [details] Patch

Xingnan Wang

Comment 2 2012-02-27 17:05:00 PST

(In reply to comment #1) > Created an attachment (id=128848) [details] > Patch Provide a fix for the latency issue of ConvolverNode. Please review.

Chris Rogers

Comment 3 2012-02-27 18:11:05 PST

Comment on attachment 128848 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=128848&action=review Great patch! I'm surprised somebody has gotten to this so quickly. > Source/WebCore/WebCore.gypi:2865 > + 'platform/audio/DirectConvolver.h', You'll also need to add these files to the GTK makefile: WebCore/GNUmakefile.list.am and to the "mac" port build file (do you have a Macintosh with Xcode? If not I can help you with that part of the patch): WebCore/WebCore.xcodeproj > Source/WebCore/platform/audio/DirectConvolver.cpp:2 > + * Copyright (C) 2010 Google Inc. All rights reserved. 2010 -> 2012 You may wish to use Intel's version of the copyright file... > Source/WebCore/platform/audio/DirectConvolver.cpp:43 > +DirectConvolver::DirectConvolver(size_t convolveSize) Could we instead pass in "kernelSize" > Source/WebCore/platform/audio/DirectConvolver.cpp:45 > + , m_inputBuffer(convolveSize) And then initialize these two as (2 * kernelSize) > Source/WebCore/platform/audio/DirectConvolver.cpp:46 > + , m_lastOverlapBuffer(convolveSize / 2) and this one as "kernelSize" > Source/WebCore/platform/audio/DirectConvolver.cpp:51 > +{ Could you please add an ASSERT (and early return if fail) checking for framesToProcess matching the correct size according to your comment in the header file: // Only support convolveSize = 2 * framesToProcess > Source/WebCore/platform/audio/DirectConvolver.cpp:52 > + int kernelSize = convolveKernel->size(); Should add ASSERT that (kernelSize == m_kernelSize) (assuming we change the constructor as I recommend above) > Source/WebCore/platform/audio/DirectConvolver.cpp:54 > + int halfSize = m_convolveSize / 2; I think we can get rid of this variable "halfSize" and simply use kernelSize > Source/WebCore/platform/audio/DirectConvolver.cpp:60 > + // kernelSize must equal to halfSize here. Can get rid of this check if we add something similar above as I recommend > Source/WebCore/platform/audio/DirectConvolver.cpp:61 > + // FIXME: Better solution is needed for variable kernelSize. can remove FIXME if you simply use kernelSize (and get rid of halfSize) > Source/WebCore/platform/audio/DirectConvolver.cpp:70 > + for (int i = 0; i < halfSize; i++) { Could you do some basic scalar optimization here? For example loop unrolling similar in style to SincResampler.cpp if (n == 32) { CONVOLVE_ONE_SAMPLE // 1 CONVOLVE_ONE_SAMPLE // 2 CONVOLVE_ONE_SAMPLE // 3 CONVOLVE_ONE_SAMPLE // 4 ... Don't worry about SSE -- for that can add a FIXME > Source/WebCore/platform/audio/DirectConvolver.cpp:85 > + for (int i = 0; i < halfSize; i++) { Same comment as above about basic scalar optimization -- loop unrolling > Source/WebCore/platform/audio/DirectConvolver.h:2 > + * Copyright (C) 2010 Google Inc. All rights reserved. Same comment about copyright header as before > Source/WebCore/platform/audio/DirectConvolver.h:39 > + DirectConvolver(size_t convolveSize); I recommend you pass in "kernelSize" as I mention in the .cpp file > Source/WebCore/platform/audio/DirectConvolver.h:45 > + size_t convolveSize() const { return m_convolveSize; } I'd change this method to "kernelSize()" > Source/WebCore/platform/audio/DirectConvolver.h:48 > + size_t m_convolveSize; m_kernelSize

Xingnan Wang

Comment 4 2012-02-28 00:52:25 PST

Created attachment 129211 [details] Patch

Xingnan Wang

Comment 5 2012-02-28 01:24:50 PST

Comment on attachment 128848 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=128848&action=review >> Source/WebCore/WebCore.gypi:2865 >> + 'platform/audio/DirectConvolver.h', > > You'll also need to add these files to the GTK makefile: > WebCore/GNUmakefile.list.am > > and to the "mac" port build file (do you have a Macintosh with Xcode? If not I can help you with that part of the patch): > WebCore/WebCore.xcodeproj Add files to WebCore/GNUmakefile.list.am. I don`t have Macintosh now, maybe I will get one in one or two months. So it`s very nice if you can help add the mac part, thank you. >> Source/WebCore/platform/audio/DirectConvolver.cpp:2 >> + * Copyright (C) 2010 Google Inc. All rights reserved. > > 2010 -> 2012 > > You may wish to use Intel's version of the copyright file... Done. >> Source/WebCore/platform/audio/DirectConvolver.cpp:43 >> +DirectConvolver::DirectConvolver(size_t convolveSize) > > Could we instead pass in "kernelSize" Done. >> Source/WebCore/platform/audio/DirectConvolver.cpp:45 >> + , m_inputBuffer(convolveSize) > > And then initialize these two as (2 * kernelSize) Done. >> Source/WebCore/platform/audio/DirectConvolver.cpp:46 >> + , m_lastOverlapBuffer(convolveSize / 2) > > and this one as "kernelSize" Done. >> Source/WebCore/platform/audio/DirectConvolver.cpp:51 >> +{ > > Could you please add an ASSERT (and early return if fail) checking for framesToProcess matching the correct size according to your comment in the header file: > // Only support convolveSize = 2 * framesToProcess Done. >> Source/WebCore/platform/audio/DirectConvolver.cpp:60 >> + // kernelSize must equal to halfSize here. > > Can get rid of this check if we add something similar above as I recommend Done. >> Source/WebCore/platform/audio/DirectConvolver.cpp:70 >> + for (int i = 0; i < halfSize; i++) { > > Could you do some basic scalar optimization here? For example loop unrolling similar in style to SincResampler.cpp > > if (n == 32) { > CONVOLVE_ONE_SAMPLE // 1 > CONVOLVE_ONE_SAMPLE // 2 > CONVOLVE_ONE_SAMPLE // 3 > CONVOLVE_ONE_SAMPLE // 4 > ... > > Don't worry about SSE -- for that can add a FIXME Added FIXME about SSE optimization. Also added FIXME about IPP optimization. >> Source/WebCore/platform/audio/DirectConvolver.cpp:85 >> + for (int i = 0; i < halfSize; i++) { > > Same comment as above about basic scalar optimization -- loop unrolling Done. >> Source/WebCore/platform/audio/DirectConvolver.h:2 >> + * Copyright (C) 2010 Google Inc. All rights reserved. > > Same comment about copyright header as before Done. >> Source/WebCore/platform/audio/DirectConvolver.h:39 >> + DirectConvolver(size_t convolveSize); > > I recommend you pass in "kernelSize" as I mention in the .cpp file Done. >> Source/WebCore/platform/audio/DirectConvolver.h:45 >> + size_t convolveSize() const { return m_convolveSize; } > > I'd change this method to "kernelSize()" Done. >> Source/WebCore/platform/audio/DirectConvolver.h:48 >> + size_t m_convolveSize; > > m_kernelSize Done.

Xingnan Wang

Comment 6 2012-02-28 01:54:08 PST

Chris, I have a question about the performance test. It seems that there is not any benchmark for the web audio(maybe there is but I didn`t find it), but it`s truly needed a way to measure the improvement or optimization of web audio. For some simple scenarios(like SSE optimization of the vector-math) it may be enough to count the functions` executing time. But for more complicated scenarios(like measure the convolution effect) we may need some test cases or typical app examples to measure the cpu usage, memory consuming, the latency or other aspects, right? And I want to know how do you do the performance test? Whether some methods or cases could be shared?

Chris Rogers

Comment 7 2012-02-29 16:18:53 PST

(In reply to comment #6) > Chris, > > I have a question about the performance test. > It seems that there is not any benchmark for the web audio(maybe there is but I didn`t find it), but it`s truly needed a way to measure the improvement or optimization of web audio. For some simple scenarios(like SSE optimization of the vector-math) it may be enough to count the functions` executing time. But for more complicated scenarios(like measure the convolution effect) we may need some test cases or typical app examples to measure the cpu usage, memory consuming, the latency or other aspects, right? > And I want to know how do you do the performance test? Whether some methods or cases could be shared? Xingnan, that is a very good question. I think the best thing we could do is add some extra powers to the layout test tool (DumpRenderTree) so that it is able to measure the total time spent in rendering in an OfflineAudioContext (please see the while() loop in OfflineAudioDestinationNode::render()) There are people in the chromium project who are now looking into high-resolution timers: Mac OS X: mach_absolute_time() Windows: QueryPerformanceCounter()??? Linux: ??? Currently (as far as I know) WebKit doesn't have proper high-resolution timers as an abstraction... If we did, we can measure the start time and end time of rendering in a specific test. I'm not quite sure how easy it would be to add these tests as part of the automated testing system, since these tests run on a wide variety of hardware and operating systems. Some of the testing bots even run in virtual machines!! But, still we could have the tests available to be run manually by real humans to help test these types of performance improvements (given a specific machine with specific exact hardware)...

Chris Rogers

Comment 8 2012-02-29 16:55:42 PST

Comment on attachment 129211 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=129211&action=review > Source/WebCore/platform/audio/DirectConvolver.cpp:79 > +#define FIRST_HALF_CONVOLVE_ONE_SAMPLE \ The macros and loops are "inside-out" I think you're going to have to reverse these macros so that there is no loop 81:82 inside here, but instead there's a loop (to generate each destination sample). Something like this: for (size_t i = 0; i < framesToProcess; ++i) { sum = 0; CONVOLVE_ONE_SAMPLE; CONVOLVE_ONE_SAMPLE; CONVOLVE_ONE_SAMPLE; CONVOLVE_ONE_SAMPLE; // .... *destP++ = sum; } Please see SincResampler for the general idea. > Source/WebCore/platform/audio/DirectConvolver.cpp:243 > +#define SECOND_HALF_CONVOLVE_ONE_SAMPLE \ ditto: see above comment

Xingnan Wang

Comment 9 2012-02-29 22:46:06 PST

(In reply to comment #7) > (In reply to comment #6) > > Chris, > > > > I have a question about the performance test. > > It seems that there is not any benchmark for the web audio(maybe there is but I didn`t find it), but it`s truly needed a way to measure the improvement or optimization of web audio. For some simple scenarios(like SSE optimization of the vector-math) it may be enough to count the functions` executing time. But for more complicated scenarios(like measure the convolution effect) we may need some test cases or typical app examples to measure the cpu usage, memory consuming, the latency or other aspects, right? > > And I want to know how do you do the performance test? Whether some methods or cases could be shared? > > Xingnan, that is a very good question. I think the best thing we could do is add some extra powers to the layout test tool (DumpRenderTree) so that it is able to measure the total time spent in rendering in an OfflineAudioContext (please see the while() loop in OfflineAudioDestinationNode::render()) > There are people in the chromium project who are now looking into high-resolution timers: > Mac OS X: mach_absolute_time() > Windows: QueryPerformanceCounter()??? > Linux: ??? > > Currently (as far as I know) WebKit doesn't have proper high-resolution timers as an abstraction... > > If we did, we can measure the start time and end time of rendering in a specific test. I'm not quite sure how easy it would be to add these tests as part of the automated testing system, since these tests run on a wide variety of hardware and operating systems. Some of the testing bots even run in virtual machines!! > But, still we could have the tests available to be run manually by real humans to help test these types of performance improvements (given a specific machine with specific exact hardware)... Chris, great thanks for your answer and it`s really helpful for my performance test. Hope such tests could be in layout test.

Xingnan Wang

Comment 10 2012-02-29 22:54:08 PST

Comment on attachment 129211 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=129211&action=review >> Source/WebCore/platform/audio/DirectConvolver.cpp:79 >> +#define FIRST_HALF_CONVOLVE_ONE_SAMPLE \ > > The macros and loops are "inside-out" > > I think you're going to have to reverse these macros so that there is no loop 81:82 inside here, but instead there's a loop (to generate each destination sample). > Something like this: > > for (size_t i = 0; i < framesToProcess; ++i) { > sum = 0; > CONVOLVE_ONE_SAMPLE; > CONVOLVE_ONE_SAMPLE; > CONVOLVE_ONE_SAMPLE; > CONVOLVE_ONE_SAMPLE; > // .... > > *destP++ = sum; > } > > Please see SincResampler for the general idea. It seems the case here is a little different from in SincResampler. If we want to unroll the inner loop, it must have a fixed loop count. But In DirectConvolver the loop count is variable with i. I`m not sure whether we can unroll here?

Chris Rogers

Comment 11 2012-02-29 23:19:48 PST

(In reply to comment #10) > (From update of attachment 129211 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=129211&action=review > > >> Source/WebCore/platform/audio/DirectConvolver.cpp:79 > >> +#define FIRST_HALF_CONVOLVE_ONE_SAMPLE \ > > > > The macros and loops are "inside-out" > > > > I think you're going to have to reverse these macros so that there is no loop 81:82 inside here, but instead there's a loop (to generate each destination sample). > > Something like this: > > > > for (size_t i = 0; i < framesToProcess; ++i) { > > sum = 0; > > CONVOLVE_ONE_SAMPLE; > > CONVOLVE_ONE_SAMPLE; > > CONVOLVE_ONE_SAMPLE; > > CONVOLVE_ONE_SAMPLE; > > // .... > > > > *destP++ = sum; > > } > > > > Please see SincResampler for the general idea. > > It seems the case here is a little different from in SincResampler. > If we want to unroll the inner loop, it must have a fixed loop count. But In DirectConvolver the loop count is variable with i. I`m not sure whether we can unroll here? I'm quite sure it can be done. Look at SincResampler where specific cases are hard-coded: if (n == 32) { ... } else if (n == 64) { ... } else { } I believe the case we specially want to optimize is 128. Generally, the kernelSize and the framesToProcess will often both be 128, so should be easy to special-case that. Actually, in a way, the SincResampler code is just a more complex version of how the DirectConvolver could be coded. If you look at how the CONVOLVE_ONE_SAMPLE macro is defined in SincResampler it actually has to accumulate two sums (sum1, sum2) then later it effectively performs a linear interpolation between these two. But DirectConvolver can have a similar macro accumulating just a single sum...

Xingnan Wang

Comment 12 2012-03-01 00:24:08 PST

Chris, Here I get another question about SincResampler. Does the current the layout tests cover the SincResampler? I am now investigating whether the IPP can be used to optimize the SincResampler, also do you think it`s valuable to do such optimization or whether the SincResampler is hot enough and needs this?

Xingnan Wang

Comment 13 2012-03-01 00:25:16 PST

Comment on attachment 129211 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=129211&action=review >>>> Source/WebCore/platform/audio/DirectConvolver.cpp:79 >>>> +#define FIRST_HALF_CONVOLVE_ONE_SAMPLE \ >>> >>> The macros and loops are "inside-out" >>> >>> I think you're going to have to reverse these macros so that there is no loop 81:82 inside here, but instead there's a loop (to generate each destination sample). >>> Something like this: >>> >>> for (size_t i = 0; i < framesToProcess; ++i) { >>> sum = 0; >>> CONVOLVE_ONE_SAMPLE; >>> CONVOLVE_ONE_SAMPLE; >>> CONVOLVE_ONE_SAMPLE; >>> CONVOLVE_ONE_SAMPLE; >>> // .... >>> >>> *destP++ = sum; >>> } >>> >>> Please see SincResampler for the general idea. >> >> It seems the case here is a little different from in SincResampler. >> If we want to unroll the inner loop, it must have a fixed loop count. But In DirectConvolver the loop count is variable with i. I`m not sure whether we can unroll here? > > I'm quite sure it can be done. Look at SincResampler where specific cases are hard-coded: > > if (n == 32) { > ... > } else if (n == 64) { > ... > } else { > } > > I believe the case we specially want to optimize is 128. Generally, the kernelSize and the framesToProcess will often both be 128, so should be easy to special-case that. > > Actually, in a way, the SincResampler code is just a more complex version of how the DirectConvolver could be coded. If you look at how the CONVOLVE_ONE_SAMPLE macro is defined in SincResampler it actually has to accumulate two sums (sum1, sum2) then later it effectively performs a linear interpolation between these two. But DirectConvolver can have a similar macro accumulating just a single sum... In the DirectConvolver the convolution is while (i < kernelSize) { for (size_t j = 0; j <= i; j++) sum += inputP[i - j] * kernelP[j]; } And from my understanding, it could use the unrolling like in SincResampler, only when it`s the follow format: while (i < kernelSize) { for (size_t j = 0; j < kernelSize; j++)//unroll here when kernekSize is 128, not j<=i because i is variable in the while loop sum += inputP[i - j] * kernelP[j]; // Here use macro instead } But here I exclude the " sum += inputP[i - j] * kernelP[j];" when j>i&&j<kernelSize, because they are all zeros. So if we include the " sum += inputP[i - j] * kernelP[j];" when j>i&&j<kernelSize and unroll the loop, I am doubt about that it will get faster than not do that. Maybe what I understand is wrong, if that correct me.

Chris Rogers

Comment 14 2012-03-01 12:33:21 PST

Hi Xingnan, think about the buffering in a *different* way than you currently are. In the following algorithm there will not be a 1st and 2nd half of convolving as in your current code, but the m_inputBuffer will be split into two halves. Consider this (simplified for kernelSize==128 and framesToProcess==128) Keep m_inputBuffer (which is of size 256 and zero-initialized) m_lastOverlapBuffer is not needed and can be removed from the code. Then follow these steps: 1) memcpy() 128 input frames to 2nd half of m_inputBuffer 2) Perform convolution of 128 output frames (each frame requiring 128 multiply-add instructions) placing results in the destination buffer. These 128 multiply-add instructions correspond to 128 consecutive calls of the CONVOLVE_ONE_SAMPLE macro 3) memcpy() the 128 input frames (or 2nd half of m_inputBuffer which is the same) to the 1st half of m_inputBuffer 4) return from function The steps 1-4 will be performed each time process() is called Note that in step (2) there will always be 128 multiply-add operations

Xingnan Wang

Comment 15 2012-03-01 17:39:45 PST

(In reply to comment #14) > Hi Xingnan, think about the buffering in a *different* way than you currently are. In the following algorithm there will not be a 1st and 2nd half of convolving as in your current code, but the m_inputBuffer will be split into two halves. > > Consider this (simplified for kernelSize==128 and framesToProcess==128) > > Keep m_inputBuffer (which is of size 256 and zero-initialized) > m_lastOverlapBuffer is not needed and can be removed from the code. > > Then follow these steps: > > 1) memcpy() 128 input frames to 2nd half of m_inputBuffer > 2) Perform convolution of 128 output frames (each frame requiring 128 multiply-add instructions) placing results in the destination buffer. These 128 multiply-add instructions correspond to 128 consecutive calls of the CONVOLVE_ONE_SAMPLE macro > 3) memcpy() the 128 input frames (or 2nd half of m_inputBuffer which is the same) to the 1st half of m_inputBuffer > 4) return from function > > The steps 1-4 will be performed each time process() is called > > Note that in step (2) there will always be 128 multiply-add operations Got it, thank you Chris.

Xingnan Wang

Comment 16 2012-03-01 23:22:15 PST

Created attachment 129828 [details] Patch

Xingnan Wang

Comment 17 2012-03-01 23:24:50 PST

(In reply to comment #16) > Created an attachment (id=129828) [details] > Patch Updated the patch.

Build Bot

Comment 18 2012-03-02 00:03:18 PST

Comment on attachment 129828 [details] Patch Attachment 129828 [details] did not pass mac-ews (mac): Output: http://queues.webkit.org/results/11769811

Chris Rogers

Comment 19 2012-03-04 18:32:33 PST

Comment on attachment 129828 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=129828&action=review Overall, this is looking really good! As you'll see in the comments below, I think we should support smaller kernel sizes (such as 32,64) and should consider using 32 by default. In order to do this, the MinFFTSize needs to be adjusted (to twice kernelSize) from its current value: const size_t MinFFTSize = 256; > Source/WebCore/platform/audio/DirectConvolver.cpp:44 > + : m_kernelSize(kernelSize) Looking at this a little more closely, do we even need to pass the kernel size here? The reason I ask is because we don't really use the m_kernelSize variable except to check that it's equal to convolveKernel->size(). I think it would be more useful to pass in "inputBlockSize" instead (which would normally be 128 and equal to the value "framesToProcess" passed into process() and initialize a member variable called "m_inputBlockSize" and get rid of m_kernelSize > Source/WebCore/platform/audio/DirectConvolver.cpp:45 > + , m_inputBuffer(kernelSize * 2) Then this would be m_inputBuffer(inputBlockSize * 2) > Source/WebCore/platform/audio/DirectConvolver.cpp:49 > +void DirectConvolver::process(AudioFloatArray* convolveKernel, const float* sourceP, float* destP, size_t framesToProcess) I would change the name slightly: convolveKernel -> convolutionKernel > Source/WebCore/platform/audio/DirectConvolver.cpp:51 > + nit: please remove extra blank line > Source/WebCore/platform/audio/DirectConvolver.cpp:52 > + // FIXME: Optimize for IPP, reverb function in ipp library can be used here. Please also add a FIXME to use the "conv()" function in Accelerate.framework for Darwin (Apple) Ideally, you could write a bug for each of these FIXMEs and include the bug URLs here. > Source/WebCore/platform/audio/DirectConvolver.cpp:56 > + if (framesToProcess != m_kernelSize) We should change these checks to check against "m_inputBlockSize" instead of "m_kernelSize" > Source/WebCore/platform/audio/DirectConvolver.cpp:59 > + size_t kernelSize = convolveKernel->size(); This line can be moved up above line 54 and use this variable instead of "m_kernelSize" > Source/WebCore/platform/audio/DirectConvolver.cpp:61 > + // kernelSize must euqal to m_kernelSize spelling nit: "euqal" -> "equal" > Source/WebCore/platform/audio/DirectConvolver.cpp:64 > + return; We can remove these checks, but instead check that (kernelSize <= m_inputBlockSize) > Source/WebCore/platform/audio/DirectConvolver.cpp:68 > + float* inputP = m_inputBuffer.data() + m_kernelSize; m_kernelSize -> m_inputBlockSize (or framesToProcess which should be the same) > Source/WebCore/platform/audio/DirectConvolver.cpp:81 > + j++; nit: should use four spaces indentation I suspect this macro can be further optimized to avoid pipeline stalls, etc. for the scalar case. We should add FIXME here. One possibility for further optimization might be to maintain 4 separate sums: sum1, sum2, sum3, sum4 and change the macro to CONVOLVE_FOUR_SAMPLES But let's not worry about that right now - FIXME is fine. > Source/WebCore/platform/audio/DirectConvolver.cpp:89 > + if (kernelSize == 128) { I highly recommend adding the kernelSize == 64 and kernelSize == 32 here too: if (kernelSize == 32) { else if (kernelSize == 64) { else if (kernelSize == 128) { } The reason is because I'm worried that changing ReverbConvolver to do the DirectConvolver with kernelSize==128 may be too inefficient especially on slow machines (we lose performance compared with FFT) So it would be great to handle the smaller sizes (32,64) and then pick the 32 size in ReverbConvolver > Source/WebCore/platform/audio/DirectConvolver.cpp:231 > + while (i < kernelSize) { shouldn't this be: j < kernelSize > Source/WebCore/platform/audio/ReverbConvolver.cpp:110 > + stage = adoptPtr(new ReverbConvolverStage(response, totalResponseLength, reverbTotalLatency, stageOffset, stageSize, fftSize, renderPhase, renderSliceSize, &m_accumulationBuffer, true)); Instead of duplicating this line twice: stage = adoptPtr(new ReverbConvolverStage(response, totalResponseLength, reverbTotalLatency, stageOffset, stageSize, fftSize, renderPhase, renderSliceSize, &m_accumulationBuffer)); Please initialize "stage" on a single line, passing in (stageOffset==0) as the final argument Or, even better, create a local "bool useDirectConvolver = (stageOffset==0);" and pass in useDirectConvolver so that the code is easier to understand > Source/WebCore/platform/audio/ReverbConvolverStage.cpp:60 > + m_FFTconvolver = adoptPtr(new FFTConvolver(fftSize)); I know this is a bit strange, but to be consistent with the naming of "m_fftKernel" we should call this member variable "m_fftConvolver" > Source/WebCore/platform/audio/ReverbConvolverStage.cpp:64 > + m_directConvolver = adoptPtr(new DirectConvolver(stageLength)); stageLength -> renderSliceSize if the changes are made to the DirectConvolver constructor

Chris Rogers

Comment 20 2012-03-04 18:34:57 PST

Please also note that if we reduce MinFFTSize, then FFTConvolver::process() will need to be adjusted to handle framesToProcess > fftSize/2 (see FIXME inside that method)

Xingnan Wang

Comment 21 2012-03-05 02:21:37 PST

Created attachment 130083 [details] Patch

Xingnan Wang

Comment 22 2012-03-05 02:26:26 PST

Comment on attachment 129828 [details] Patch Update the patch as your comments, thanks. View in context: https://bugs.webkit.org/attachment.cgi?id=129828&action=review >> Source/WebCore/platform/audio/DirectConvolver.cpp:89 >> + if (kernelSize == 128) { > > I highly recommend adding the kernelSize == 64 and kernelSize == 32 here too: > > if (kernelSize == 32) { > else if (kernelSize == 64) { > else if (kernelSize == 128) { > } > > The reason is because I'm worried that changing ReverbConvolver to do the DirectConvolver with kernelSize==128 may be too inefficient especially on slow machines (we lose performance compared with FFT) > So it would be great to handle the smaller sizes (32,64) and then pick the 32 size in ReverbConvolver One question: whether the FFT performance is still better than direct way here when kernelSize is small like 32?

Build Bot

Comment 23 2012-03-05 02:28:00 PST

Comment on attachment 130083 [details] Patch Attachment 130083 [details] did not pass mac-ews (mac): Output: http://queues.webkit.org/results/11802897

Chris Rogers

Comment 24 2012-03-05 09:23:09 PST

(In reply to comment #22) > (From update of attachment 129828 [details]) > Update the patch as your comments, thanks. > View in context: https://bugs.webkit.org/attachment.cgi?id=129828&action=review > > >> Source/WebCore/platform/audio/DirectConvolver.cpp:89 > >> + if (kernelSize == 128) { > > > > I highly recommend adding the kernelSize == 64 and kernelSize == 32 here too: > > > > if (kernelSize == 32) { > > else if (kernelSize == 64) { > > else if (kernelSize == 128) { > > } > > > > The reason is because I'm worried that changing ReverbConvolver to do the DirectConvolver with kernelSize==128 may be too inefficient especially on slow machines (we lose performance compared with FFT) > > So it would be great to handle the smaller sizes (32,64) and then pick the 32 size in ReverbConvolver > > One question: whether the FFT performance is still better than direct way here when kernelSize is small like 32? Good question, is this something you can measure so we can know how to tune? Just remember that in order to test kernelSize<128 you'll need to add a loop inside FFTConvolver::process() since the FFT size will be small (see FIXME in that function)

Chris Rogers

Comment 25 2012-03-12 15:14:35 PDT

I've tested this on a 2-year old Mac desktop machine (2.26 GHz Quad-Core Intel Xeon) and the current direct convolution (kernelSize==128) takes about 20 microseconds (compared with ~ 5 microseconds) for FFTConvolver version. So, on this machine with a stereo impulse response this is about an extra 1% CPU load. But a slower machine (ARM, etc.) would have a worse CPU hit. Ideally, we'd move down to kernelSize==64 or 32 since I'm pretty sure we'll get better performance. Xingnan, would you mind finishing that last part (fixing FFTConvolver::process() to handle smaller processing sizes...) so we can switch to 32?? Jer, it should be fairly trivial to call into the vecLib.framework conv() function here which should be super-optimized.

Xingnan Wang

Comment 26 2012-03-12 19:29:20 PDT

(In reply to comment #25) > I've tested this on a 2-year old Mac desktop machine (2.26 GHz Quad-Core Intel Xeon) and the current direct convolution (kernelSize==128) takes about 20 microseconds (compared with ~ 5 microseconds) for FFTConvolver version. So, on this machine with a stereo impulse response this is about an extra 1% CPU load. But a slower machine (ARM, etc.) would have a worse CPU hit. > > Ideally, we'd move down to kernelSize==64 or 32 since I'm pretty sure we'll get better performance. Xingnan, would you mind finishing that last part (fixing FFTConvolver::process() to handle smaller processing sizes...) so we can switch to 32?? > Of cause not, I would like to handle this. Should it be filed another bug? I have a question that does the first convolverStage( or first several stages) is the key part to impact the performance? As the left stages have increasing fft size, they still process with large kernelSize. > Jer, it should be fairly trivial to call into the vecLib.framework conv() function here which should be super-optimized.

Chris Rogers

Comment 27 2012-03-12 20:55:19 PDT

(In reply to comment #26) > (In reply to comment #25) > > I've tested this on a 2-year old Mac desktop machine (2.26 GHz Quad-Core Intel Xeon) and the current direct convolution (kernelSize==128) takes about 20 microseconds (compared with ~ 5 microseconds) for FFTConvolver version. So, on this machine with a stereo impulse response this is about an extra 1% CPU load. But a slower machine (ARM, etc.) would have a worse CPU hit. > > > > Ideally, we'd move down to kernelSize==64 or 32 since I'm pretty sure we'll get better performance. Xingnan, would you mind finishing that last part (fixing FFTConvolver::process() to handle smaller processing sizes...) so we can switch to 32?? > > > Of cause not, I would like to handle this. Should it be filed another bug? > I have a question that does the first convolverStage( or first several stages) is the key part to impact the performance? As the left stages have increasing fft size, they still process with large kernelSize. I think we should handle the smaller size in this patch, since it's a performance regression. It's true that the tail of the impulse response has the large FFTs, but much of that happens in a different thread, so doesn't influence the real-time audio thread's performance (on a multi-core machine). Even if it's a fraction of the overall CPU load, we should still make a best effort to minimize the impact of the performance regression. It's pretty easy to make the changes to FFTConvolver to divide the processing up into smaller pieces (probably 5-10 lines of code). I can explain more about that if you're not clear how to do that.

Xingnan Wang

Comment 28 2012-03-13 00:00:20 PDT

Created attachment 131559 [details] Patch

Xingnan Wang

Comment 29 2012-03-13 00:05:49 PDT

(In reply to comment #28) > Created an attachment (id=131559) [details] > Patch Update the patch, (In reply to comment #27) > (In reply to comment #26) > > (In reply to comment #25) > > > I've tested this on a 2-year old Mac desktop machine (2.26 GHz Quad-Core Intel Xeon) and the current direct convolution (kernelSize==128) takes about 20 microseconds (compared with ~ 5 microseconds) for FFTConvolver version. So, on this machine with a stereo impulse response this is about an extra 1% CPU load. But a slower machine (ARM, etc.) would have a worse CPU hit. > > > > > > Ideally, we'd move down to kernelSize==64 or 32 since I'm pretty sure we'll get better performance. Xingnan, would you mind finishing that last part (fixing FFTConvolver::process() to handle smaller processing sizes...) so we can switch to 32?? > > > > > Of cause not, I would like to handle this. Should it be filed another bug? > > I have a question that does the first convolverStage( or first several stages) is the key part to impact the performance? As the left stages have increasing fft size, they still process with large kernelSize. > > I think we should handle the smaller size in this patch, since it's a performance regression. It's true that the tail of the impulse response has the large FFTs, but much of that happens in a different thread, so doesn't influence the real-time audio thread's performance (on a multi-core machine). Even if it's a fraction of the overall CPU load, we should still make a best effort to minimize the impact of the performance regression. It's pretty easy to make the changes to FFTConvolver to divide the processing up into smaller pieces (probably 5-10 lines of code). I can explain more about that if you're not clear how to do that. Chris, thanks your answer, I updated the patch, to add the 32&64 kernelSize support in FFTConvolver as my understanding, please review.

Build Bot

Comment 30 2012-03-13 00:35:01 PDT

Comment on attachment 131559 [details] Patch Attachment 131559 [details] did not pass mac-ews (mac): Output: http://queues.webkit.org/results/11947387

Chris Rogers

Comment 31 2012-03-13 15:12:22 PDT

Comment on attachment 131559 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=131559&action=review > Source/WebCore/platform/audio/DirectConvolver.cpp:51 > + // FIXME: Optimize for IPP, reverb function in ipp library can be used here. nit: you use "IPP" then "ipp" later in the sentence. > Source/WebCore/platform/audio/DirectConvolver.cpp:76 > + // Copy samples to 2nd half of input buffer nit: period at end of sentence > Source/WebCore/platform/audio/DirectConvolver.cpp:81 > + sum += inputP[i - j] * kernelP[j]; \ nit: indentation should be 4 spaces > Source/WebCore/platform/audio/FFTConvolver.cpp:82 > + size_t kernelSize = fftKernel->fftSize() / 2; I'm not sure why you define a separate kernelSize variable as compared with halfSize above -- they should always be the same, so we might as well just use the existing "halfSize" > Source/WebCore/platform/audio/FFTConvolver.cpp:84 > + for (size_t i = 0; i < halfSize; i += kernelSize) { I believe you'll need to move this for() loop higher up in this function - probably all the way very near the top of this function and then simply change "framesToProcess" -> "halfSize" in the 5 places it's used. size_t numberOfDivisions = halfSize <= framesToProcess ? (framesToProcess / halfSize) : 1; for (size_t i = 0; i < numberOfDivisions; ++i, sourceP += halfSize, destP += halfSize) { And we also need to have an ASSERT that framesToProcess is an exact multiple of halfSize (or if halfSize > framesToProcess then halfSize is a multiple of framesToProcess) That should make the change pretty simple... > Source/WebCore/platform/audio/ReverbConvolver.cpp:86 > Please change MinFFTSize to 64 from 256 in order to make sure DirectConvolver has kernelSize==32

Xingnan Wang

Comment 32 2012-03-13 22:42:24 PDT

Created attachment 131789 [details] Patch

Xingnan Wang

Comment 33 2012-03-13 22:52:24 PDT

(In reply to comment #31) > (From update of attachment 131559 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=131559&action=review > > > Source/WebCore/platform/audio/DirectConvolver.cpp:51 > > + // FIXME: Optimize for IPP, reverb function in ipp library can be used here. > > nit: you use "IPP" then "ipp" later in the sentence. > > > Source/WebCore/platform/audio/DirectConvolver.cpp:76 > > + // Copy samples to 2nd half of input buffer > > nit: period at end of sentence > > > Source/WebCore/platform/audio/DirectConvolver.cpp:81 > > + sum += inputP[i - j] * kernelP[j]; \ > > nit: indentation should be 4 spaces > > > Source/WebCore/platform/audio/FFTConvolver.cpp:82 > > + size_t kernelSize = fftKernel->fftSize() / 2; > > I'm not sure why you define a separate kernelSize variable as compared with halfSize above -- they should always be the same, so we might as well just use the existing "halfSize" > > > Source/WebCore/platform/audio/FFTConvolver.cpp:84 > > + for (size_t i = 0; i < halfSize; i += kernelSize) { > > I believe you'll need to move this for() loop higher up in this function - probably all the way very near the top of this function and then simply change "framesToProcess" -> "halfSize" in the 5 places it's used. > It seems we cannot simply change "framesToProcess" to "halfSize", there will be a problem when "halfSize" > "framesToProcess". So I use a "divisionSize = numberOfDivisions == 1 ? framesToProcess : halfSize;" instead. Patch is updated as your other comments. > size_t numberOfDivisions = halfSize <= framesToProcess ? (framesToProcess / halfSize) : 1; > for (size_t i = 0; i < numberOfDivisions; ++i, sourceP += halfSize, destP += halfSize) { > > And we also need to have an ASSERT that framesToProcess is an exact multiple of halfSize (or if halfSize > framesToProcess then halfSize is a multiple of framesToProcess) > > That should make the change pretty simple... > > > Source/WebCore/platform/audio/ReverbConvolver.cpp:86 > > > > Please change MinFFTSize to 64 from 256 in order to make sure DirectConvolver has kernelSize==32

Chris Rogers

Comment 34 2012-03-15 14:55:48 PDT

Comment on attachment 131789 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=131789&action=review Xingnan, this looks almost perfect. I have a few comments below. Also, please adjust the FIXME comment in FFTConvolver.h since the buffering is more flexible now. I'm applying the patch and testing with it now. Assuming all is good, I can handle the last detail of adding the new files to the mac-port .xcodeproj build file. > Source/WebCore/GNUmakefile.list.am:5471 > + Source/WebCore/platform/audio/DirectConvolver.h \ Please put in alphabetical order > Source/WebCore/WebCore.gypi:3041 > + 'platform/audio/DirectConvolver.h', Please put in alphabetical order > Source/WebCore/platform/audio/FFTConvolver.cpp:52 > // FIXME: make so framesToProcess is not required to fit evenly into fftSize/2 We can change this comment now since framesToProcess may now be larger than (fftSize/2). Probably best to simply remove this comment since you've added a different comment on line 56. > Source/WebCore/platform/audio/ReverbConvolver.cpp:-88 > - size_t reverbTotalLatency = m_minFFTSize / 2; Might as well have comment here saying that total latency is zero because we're using direct-convolution in the leading portion.

Chris Rogers

Comment 35 2012-03-15 18:12:59 PDT

Xingnan, I've applied the patch and played with different values for MinFFTSize. I've found that a value of 128 gives the best performance in my testing, so please update MinFFTSize=128

Xingnan Wang

Comment 36 2012-03-15 20:45:42 PDT

Created attachment 132185 [details] Patch

Xingnan Wang

Comment 37 2012-03-15 20:49:52 PDT

(In reply to comment #34) > (From update of attachment 131789 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=131789&action=review > > Xingnan, this looks almost perfect. I have a few comments below. Also, please adjust the FIXME comment in FFTConvolver.h since the buffering is more flexible now. I'm applying the patch and testing with it now. Assuming all is good, I can handle the last detail of adding the new files to the mac-port .xcodeproj build file. Thanks Chris, updated the patch as your comments and changed MinFFTSize to 128. > > > Source/WebCore/GNUmakefile.list.am:5471 > > + Source/WebCore/platform/audio/DirectConvolver.h \ > > Please put in alphabetical order > > > Source/WebCore/WebCore.gypi:3041 > > + 'platform/audio/DirectConvolver.h', > > Please put in alphabetical order > > > Source/WebCore/platform/audio/FFTConvolver.cpp:52 > > // FIXME: make so framesToProcess is not required to fit evenly into fftSize/2 > > We can change this comment now since framesToProcess may now be larger than (fftSize/2). Probably best to simply remove this comment since you've added a different comment on line 56. > > > Source/WebCore/platform/audio/ReverbConvolver.cpp:-88 > > - size_t reverbTotalLatency = m_minFFTSize / 2; > > Might as well have comment here saying that total latency is zero because we're using direct-convolution in the leading portion.

Build Bot

Comment 38 2012-03-15 21:21:50 PDT

Comment on attachment 132185 [details] Patch Attachment 132185 [details] did not pass mac-ews (mac): Output: http://queues.webkit.org/results/11962038

Xingnan Wang

Comment 39 2012-03-18 19:10:53 PDT

Created attachment 132521 [details] Patch

Xingnan Wang

Comment 40 2012-03-18 19:11:58 PDT

(In reply to comment #39) > Created an attachment (id=132521) [details] > Patch Hi Chris, updated the "mac" part.

Chris Rogers

Comment 41 2012-03-19 13:11:13 PDT

Comment on attachment 132521 [details] Patch Thanks Xingnan, looks good...

WebKit Review Bot

Comment 42 2012-03-19 13:20:03 PDT

Comment on attachment 132521 [details] Patch Clearing flags on attachment: 132521 Committed r111227: <http://trac.webkit.org/changeset/111227>

WebKit Review Bot

Comment 43 2012-03-19 13:20:10 PDT

All reviewed patches have been landed. Closing bug.

Attachments
Patch (20.86 KB, patch) 2012-02-24 21:18 PST, Xingnan Wang	no flags	Details Formatted Diff Diff
Patch (34.04 KB, patch) 2012-02-28 00:52 PST, Xingnan Wang	no flags	Details Formatted Diff Diff
Patch (26.40 KB, patch) 2012-03-01 23:22 PST, Xingnan Wang	no flags	Details Formatted Diff Diff
Patch (30.26 KB, patch) 2012-03-05 02:21 PST, Xingnan Wang	no flags	Details Formatted Diff Diff
Patch (32.30 KB, patch) 2012-03-13 00:00 PDT, Xingnan Wang	no flags	Details Formatted Diff Diff
Patch (35.60 KB, patch) 2012-03-13 22:42 PDT, Xingnan Wang	no flags	Details Formatted Diff Diff
Patch (35.83 KB, patch) 2012-03-15 20:45 PDT, Xingnan Wang	no flags	Details Formatted Diff Diff
Patch (40.51 KB, patch) 2012-03-18 19:10 PDT, Xingnan Wang	no flags	Details Formatted Diff Diff
Show Obsolete (7) View All Add attachment proposed patch, testcase, etc.