UNCONFIRMED 75528
Optimize the multiply-add in Biquad.cpp::process
https://bugs.webkit.org/show_bug.cgi?id=75528
Summary Optimize the multiply-add in Biquad.cpp::process
Xingnan Wang
Reported 2012-01-04 01:04:07 PST
Optimize the multiply-add by piplining with SSE2 instructions.
Attachments
Patch (4.30 KB, patch)
2012-01-04 01:17 PST, Xingnan Wang
no flags
Biquad layout test script (4.08 KB, text/html)
2012-01-05 10:45 PST, Raymond Toy
no flags
Patch (4.10 KB, patch)
2012-02-09 23:44 PST, Xingnan Wang
webkit.review.bot: commit-queue-
Patch (4.11 KB, patch)
2012-02-13 00:03 PST, Xingnan Wang
no flags
Updated patch to handle 32 and 64-bit builds (4.31 KB, patch)
2012-03-15 11:51 PDT, Raymond Toy
no flags
Patch (3.17 KB, patch)
2012-03-18 20:18 PDT, Xingnan Wang
no flags
Patch (3.63 KB, patch)
2012-03-19 22:34 PDT, Xingnan Wang
no flags
Patch (3.68 KB, patch)
2012-03-20 20:12 PDT, Xingnan Wang
no flags
Xingnan Wang
Comment 1 2012-01-04 01:17:37 PST
Chris Rogers
Comment 2 2012-01-04 18:04:48 PST
Thanks for the optimization! Any idea how much improvement this gives? I'm adding Raymond as reviewer here, and think we should get at least a very simple Biquad layout test written to validate both the existing code, and your new code. The layout test can do the following: 1. create a BiquadFilterNode, configure it as lowpass at a certain cutoff frequency 2. feed an impulse into the filter in an OfflineAudioContext 3. compare the result with the expected output using JavaScript to generate the expected response (we can borrow the math from Biquad.cpp and convert to JS.
Chris Rogers
Comment 3 2012-01-04 18:06:30 PST
Xingnan do you know how to write a layout test. If not, maybe Raymond can help you there, since he's just starting to write them himself.
Xingnan Wang
Comment 4 2012-01-04 18:12:11 PST
(In reply to comment #2) > Thanks for the optimization! Any idea how much improvement this gives? > From my test, it boosted at least 20%(for 128 frames size), and reached 40% with larger size(like 1280). > I'm adding Raymond as reviewer here, and think we should get at least a very simple Biquad layout test written to validate both the existing code, and your new code. The layout test can do the following: > > 1. create a BiquadFilterNode, configure it as lowpass at a certain cutoff frequency > 2. feed an impulse into the filter in an OfflineAudioContext > 3. compare the result with the expected output using JavaScript to generate the expected response (we can borrow the math from Biquad.cpp and convert to JS.
Xingnan Wang
Comment 5 2012-01-04 18:16:41 PST
(In reply to comment #3) > Xingnan do you know how to write a layout test. If not, maybe Raymond can help you there, since he's just starting to write them himself. I did not write layout test before but very interest in it, I would like to run the layout test, it`s very pleasure if Raymond can help. Thanks.
Raymond Toy
Comment 6 2012-01-05 08:53:24 PST
(In reply to comment #5) > (In reply to comment #3) > > Xingnan do you know how to write a layout test. If not, maybe Raymond can help you there, since he's just starting to write them himself. > > I did not write layout test before but very interest in it, I would like to run the layout test, it`s very pleasure if Raymond can help. Thanks. I would be my pleasure to write a simple layout test that you can start with.
Raymond Toy
Comment 7 2012-01-05 10:45:40 PST
Created attachment 121300 [details] Biquad layout test script Here is a start at a test of the biquad (lowpass) filter. Place this file in the third_party/WebKit/LayoutTests/webaudio directory. A biquad-test-expected.txt file needs to be created that contains the expected output. To run the layout test, see http://www.chromium.org/developers/testing/webkit-layout-tests. Perhaps it is enough to test just the lowpass filter for now. But eventually, I think we need a test for all the filters. In that case biquad-test.html should be broken up into a common piece that contains the test infrastructure and a probably several tests that create the appropriate filter tests. Should probably follow the style of convolution-mono-mono.html tests.
Raymond Toy
Comment 8 2012-01-05 11:07:10 PST
Comment on attachment 121082 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=121082&action=review > Source/WebCore/platform/audio/Biquad.cpp:90 > +#ifdef __SSE2__ Did you test this on windows? I think the asm syntax below only works for gcc, so this conditionalization should test for gcc too. (But I guess on windows, __SSE2__ is probably never defined.) > Source/WebCore/platform/audio/Biquad.cpp:93 > + __asm__( I am far from an expert in sse2, but this seems rather complex. Could we do something like this? Create array yy of length 2 (or maybe use the destination array directly?): yy[0] = y yy[1] = y1 load xmm0 with (b0 b1 b2 0) load xmm1 with (-a0 -a1 0 0) load xmm2 from *source to get (x[0] x[1] x[2] x[3]) xmm2 = xmm2 * xmm0 to get (b0*x0 b1*x1 b2*x2 0) load xmm3 from yy to get (y0 y1 junk junk) xmm3 = xmm3 * xmm1 to get (-a0*y0 a1*y1 0 0) xmm3 = xmm3 + xmm2 to get (b0*x0-a0*y0, b1*x1-a1*y1, b2*x2, 0) Extract each part of xmm3 and add them together. (We could gain something here if we had SSE3 to do the add, I think.) yy[0] = yy[1] yy[1] = result of sum. Don't know if this is faster or slower. This will change results slightly because we do everything in single precision.
Raymond Toy
Comment 9 2012-01-05 11:20:26 PST
Oops. In the test script, you need to uncomment the first 5 commented lines in runTest() when running the layout test. You can use the script as is when running in a browser (provided the other scripts are available).
Chris Rogers
Comment 10 2012-01-05 11:34:28 PST
(In reply to comment #8) > Don't know if this is faster or slower. This will change results slightly because we do everything in single precision. I'd like to keep everything in double-precision because of filter stability problems with certain types of filters.
Xingnan Wang
Comment 11 2012-01-05 17:31:21 PST
(In reply to comment #6) > (In reply to comment #5) > > (In reply to comment #3) > > > Xingnan do you know how to write a layout test. If not, maybe Raymond can help you there, since he's just starting to write them himself. > > > > I did not write layout test before but very interest in it, I would like to run the layout test, it`s very pleasure if Raymond can help. Thanks. > > I would be my pleasure to write a simple layout test that you can start with. That`s so nice if there is a sample, thanks Raymond.
Xingnan Wang
Comment 12 2012-01-05 17:59:50 PST
(In reply to comment #8) > (From update of attachment 121082 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=121082&action=review > > > Source/WebCore/platform/audio/Biquad.cpp:90 > > +#ifdef __SSE2__ > > Did you test this on windows? I think the asm syntax below only works for gcc, so this conditionalization should test for gcc too. (But I guess on windows, __SSE2__ is probably never defined.) Yes, there is another macro in windows. I just started with linux support only, in case of any potential issues. I can add the windows support and do the test. > > > Source/WebCore/platform/audio/Biquad.cpp:93 > > + __asm__( > > I am far from an expert in sse2, but this seems rather complex. Could we do something like this? > > Create array yy of length 2 (or maybe use the destination array directly?): > yy[0] = y > yy[1] = y1 > > load xmm0 with (b0 b1 b2 0) > load xmm1 with (-a0 -a1 0 0) > load xmm2 from *source to get (x[0] x[1] x[2] x[3]) > xmm2 = xmm2 * xmm0 to get (b0*x0 b1*x1 b2*x2 0) > load xmm3 from yy to get (y0 y1 junk junk) > xmm3 = xmm3 * xmm1 to get (-a0*y0 a1*y1 0 0) > xmm3 = xmm3 + xmm2 to get (b0*x0-a0*y0, b1*x1-a1*y1, b2*x2, 0) > > Extract each part of xmm3 and add them together. (We could gain something here if we had SSE3 to do the add, I think.) > > yy[0] = yy[1] > yy[1] = result of sum. > > Don't know if this is faster or slower. This will change results slightly because we do everything in single precision. Your solution may work in single precision and I didn`t try. I have tried this way in double precision but no too much improvement. Then I changed to use inline asm so that I can pipeline the instructions, also get the benefit from SSE2.
Raymond Toy
Comment 13 2012-01-11 09:54:53 PST
Comment on attachment 121082 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=121082&action=review >>> Source/WebCore/platform/audio/Biquad.cpp:93 >>> + __asm__( >> >> I am far from an expert in sse2, but this seems rather complex. Could we do something like this? >> >> Create array yy of length 2 (or maybe use the destination array directly?): >> yy[0] = y >> yy[1] = y1 >> >> load xmm0 with (b0 b1 b2 0) >> load xmm1 with (-a0 -a1 0 0) >> load xmm2 from *source to get (x[0] x[1] x[2] x[3]) >> xmm2 = xmm2 * xmm0 to get (b0*x0 b1*x1 b2*x2 0) >> load xmm3 from yy to get (y0 y1 junk junk) >> xmm3 = xmm3 * xmm1 to get (-a0*y0 a1*y1 0 0) >> xmm3 = xmm3 + xmm2 to get (b0*x0-a0*y0, b1*x1-a1*y1, b2*x2, 0) >> >> Extract each part of xmm3 and add them together. (We could gain something here if we had SSE3 to do the add, I think.) >> >> yy[0] = yy[1] >> yy[1] = result of sum. >> >> Don't know if this is faster or slower. This will change results slightly because we do everything in single precision. > > Your solution may work in single precision and I didn`t try. > I have tried this way in double precision but no too much improvement. Then I changed to use inline asm so that I can pipeline the instructions, also get the benefit from SSE2. This looks fine then especially considering Chris's comment about keeping things in double for stability reasons.
Raymond Toy
Comment 14 2012-01-12 11:10:25 PST
If you are not in a hurry, bug 71413 will include a bunch of tests for the biquad filters so you do not have to write any.
Xingnan Wang
Comment 15 2012-01-13 00:51:05 PST
(In reply to comment #14) > If you are not in a hurry, bug 71413 will include a bunch of tests for the biquad filters so you do not have to write any. That will be nice. I was going to update the patch with the low pass filter layout test, but if bug 71413 provides all the tests, I will wait for it.
Xingnan Wang
Comment 16 2012-02-09 23:44:37 PST
Created attachment 126465 [details] Patch Update the patch as layout tests of biquad filter landed, it`s OK to pass all these tests.
WebKit Review Bot
Comment 17 2012-02-10 01:20:22 PST
Comment on attachment 126465 [details] Patch Attachment 126465 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/11486599 New failing tests: webaudio/biquad-highpass.html webaudio/biquad-lowpass.html webaudio/biquad-lowshelf.html webaudio/biquad-notch.html webaudio/biquad-highshelf.html webaudio/biquad-peaking.html webaudio/biquad-bandpass.html webaudio/biquad-allpass.html
Raymond Toy
Comment 18 2012-02-10 13:56:13 PST
Comment on attachment 126465 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=126465&action=review There also appears to be test failures on linux. Can you investigate why that is happening? > Source/WebCore/platform/audio/Biquad.cpp:121 > + "addsd %%xmm4, %%xmm5\n\t" // x*b0 + (x2*b2+x*b0) Doesn't xmm4 contain a1*y1 here so the result is a1*y1 + (x2*b2+x*b0)? > Source/WebCore/platform/audio/Biquad.cpp:137 > + :"eax", "edx", "ecx" Is it necessary to add all of the xmm registers to this clobber list?
Xingnan Wang
Comment 19 2012-02-12 23:26:19 PST
Comment on attachment 126465 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=126465&action=review >> Source/WebCore/platform/audio/Biquad.cpp:121 >> + "addsd %%xmm4, %%xmm5\n\t" // x*b0 + (x2*b2+x*b0) > > Doesn't xmm4 contain a1*y1 here so the result is a1*y1 + (x2*b2+x*b0)? Sorry for that, it should be a1*y1 + (x2*b2+x*b0). >> Source/WebCore/platform/audio/Biquad.cpp:137 >> + :"eax", "edx", "ecx" > > Is it necessary to add all of the xmm registers to this clobber list? OK, it`s better to add all the xmm registers to the clobber-list, through there is little possibility of conflict of xmm registers here.
Xingnan Wang
Comment 20 2012-02-12 23:32:34 PST
(In reply to comment #18) > (From update of attachment 126465 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=126465&action=review > > There also appears to be test failures on linux. Can you investigate why that is happening? > It`s strange that I can pass the layout tests locally, and little message can be found from the error log http://queues.webkit.org/results/11486599. I`ll try to check my patch in different machine.
Xingnan Wang
Comment 21 2012-02-13 00:03:59 PST
Raymond Toy
Comment 22 2012-02-13 11:04:32 PST
(In reply to comment #21) > Created an attachment (id=126729) [details] > Patch Looks good, except for the test failures that happened in the previous patch. I didn't see any test failures from the bots for this patch, though. Perhaps they've been fixed?
Xingnan Wang
Comment 23 2012-02-13 17:23:21 PST
(In reply to comment #22) > (In reply to comment #21) > > Created an attachment (id=126729) [details] [details] > > Patch > > Looks good, except for the test failures that happened in the previous patch. I didn't see any test failures from the bots for this patch, though. Perhaps they've been fixed? Actually I never reproduced the failures at all, just updated the patch as your comments. Perhaps the last patch would pass the tests if run it again. Whatever, does that mean the patch is OK now? BTW, how can I check the building details including these failures from the bots? Thanks.
Raymond Toy
Comment 24 2012-02-14 09:39:06 PST
(In reply to comment #23) > (In reply to comment #22) > > (In reply to comment #21) > > > Created an attachment (id=126729) [details] [details] [details] > > > Patch > > > > Looks good, except for the test failures that happened in the previous patch. I didn't see any test failures from the bots for this patch, though. Perhaps they've been fixed? > > Actually I never reproduced the failures at all, just updated the patch as your comments. Perhaps the last patch would pass the tests if run it again. Whatever, does that mean the patch is OK now? Since there are no messages from the webkit review bot with your latest change, I think it's ok now. Not sure what changes made this work.
Xingnan Wang
Comment 25 2012-02-21 05:46:42 PST
(In reply to comment #24) > (In reply to comment #23) > > (In reply to comment #22) > > > (In reply to comment #21) > > > > Created an attachment (id=126729) [details] [details] [details] [details] > > > > Patch > > > > > > Looks good, except for the test failures that happened in the previous patch. I didn't see any test failures from the bots for this patch, though. Perhaps they've been fixed? > > > > Actually I never reproduced the failures at all, just updated the patch as your comments. Perhaps the last patch would pass the tests if run it again. Whatever, does that mean the patch is OK now? > > Since there are no messages from the webkit review bot with your latest change, I think it's ok now. Not sure what changes made this work. Since the patch is OK, can it be submitted now?
Raymond Toy
Comment 26 2012-02-22 11:35:37 PST
Comment on attachment 126729 [details] Patch LGTM
Xingnan Wang
Comment 27 2012-02-27 20:58:20 PST
(In reply to comment #26) > (From update of attachment 126729 [details]) > LGTM Chris, does the patch need more review?
Xingnan Wang
Comment 28 2012-02-29 20:32:41 PST
(In reply to comment #27) > (In reply to comment #26) > > (From update of attachment 126729 [details] [details]) > > LGTM > > Chris, does the patch need more review? Whether the patch is OK to commit?
WebKit Review Bot
Comment 29 2012-03-14 14:01:51 PDT
Comment on attachment 126729 [details] Patch Clearing flags on attachment: 126729 Committed r110744: <http://trac.webkit.org/changeset/110744>
WebKit Review Bot
Comment 30 2012-03-14 14:01:57 PDT
All reviewed patches have been landed. Closing bug.
Adam Barth
Comment 31 2012-03-14 16:25:32 PDT
This patch seems to crash in debug: [49220:49236:17280644243061:ERROR:process_util_posix.cc(142)] Received signal 11 base::debug::StackTrace::StackTrace() [0x851b56] base::(anonymous namespace)::StackDumpSignalHandler() [0x80b739] 0x7f352f9d1af0 WebCore::Biquad::process() [0x256d73e] WebCore::HRTFKernel::HRTFKernel() [0x25ac2e4] WebCore::HRTFKernel::create() [0x25ab549] WebCore::HRTFElevation::calculateKernelsForAzimuthElevation() [0x25aa703] WebCore::HRTFElevation::createForSubject() [0x25aa9db] WebCore::HRTFDatabase::HRTFDatabase() [0x25a987e] WebCore::HRTFDatabase::create() [0x25a9630] WebCore::HRTFDatabaseLoader::load() [0x257382d] WebCore::databaseLoaderEntry() [0x25737ac] WTF::threadEntryPoint() [0x1f509e5] WTF::wtfThreadEntryPoint() [0x793cb8] start_thread [0x7f3531e369ca] 0x7f352fa8470d None
Adam Barth
Comment 32 2012-03-14 16:34:51 PDT
Xingnan Wang
Comment 33 2012-03-15 00:33:53 PDT
(In reply to comment #31) > This patch seems to crash in debug: > > [49220:49236:17280644243061:ERROR:process_util_posix.cc(142)] Received signal 11 > base::debug::StackTrace::StackTrace() [0x851b56] > base::(anonymous namespace)::StackDumpSignalHandler() [0x80b739] > 0x7f352f9d1af0 > WebCore::Biquad::process() [0x256d73e] > WebCore::HRTFKernel::HRTFKernel() [0x25ac2e4] > WebCore::HRTFKernel::create() [0x25ab549] > WebCore::HRTFElevation::calculateKernelsForAzimuthElevation() [0x25aa703] > WebCore::HRTFElevation::createForSubject() [0x25aa9db] > WebCore::HRTFDatabase::HRTFDatabase() [0x25a987e] > WebCore::HRTFDatabase::create() [0x25a9630] > WebCore::HRTFDatabaseLoader::load() [0x257382d] > WebCore::databaseLoaderEntry() [0x25737ac] > WTF::threadEntryPoint() [0x1f509e5] > WTF::wtfThreadEntryPoint() [0x793cb8] > start_thread [0x7f3531e369ca] > 0x7f352fa8470d > None Hi, That`s strange that I cannot reproduce the issue (by several different machines). Could you provide more details about your platform?
Adam Barth
Comment 34 2012-03-15 01:03:40 PDT
It was the chromium-mac-linux configuration, which is Ubuntu Lucid.
Xingnan Wang
Comment 35 2012-03-15 08:38:09 PDT
(In reply to comment #34) > It was the chromium-mac-linux configuration, which is Ubuntu Lucid. I am confused by chromium-mac-linux, should not it be either chromium-linux or chromium-mac? The patch only acts in linux, I`ve tested in Ubuntu oneiric and it works very well.
Raymond Toy
Comment 36 2012-03-15 09:26:08 PDT
(In reply to comment #33) > (In reply to comment #31) > > This patch seems to crash in debug: > > > > [49220:49236:17280644243061:ERROR:process_util_posix.cc(142)] Received signal 11 > > base::debug::StackTrace::StackTrace() [0x851b56] > > base::(anonymous namespace)::StackDumpSignalHandler() [0x80b739] > > 0x7f352f9d1af0 > > WebCore::Biquad::process() [0x256d73e] > > WebCore::HRTFKernel::HRTFKernel() [0x25ac2e4] > > WebCore::HRTFKernel::create() [0x25ab549] > > WebCore::HRTFElevation::calculateKernelsForAzimuthElevation() [0x25aa703] > > WebCore::HRTFElevation::createForSubject() [0x25aa9db] > > WebCore::HRTFDatabase::HRTFDatabase() [0x25a987e] > > WebCore::HRTFDatabase::create() [0x25a9630] > > WebCore::HRTFDatabaseLoader::load() [0x257382d] > > WebCore::databaseLoaderEntry() [0x25737ac] > > WTF::threadEntryPoint() [0x1f509e5] > > WTF::wtfThreadEntryPoint() [0x793cb8] > > start_thread [0x7f3531e369ca] > > 0x7f352fa8470d > > None > > Hi, > That`s strange that I cannot reproduce the issue (by several different machines). Could you provide more details about your platform? The stack trace looks totally bogus. (I'm pretty sure the HRTFKernel doesn't call Biquad). But I can reproduce the crash on my system. The stack trace is totally trash too. More work needed....
Raymond Toy
Comment 37 2012-03-15 10:39:35 PDT
(In reply to comment #36) > (In reply to comment #33) > > (In reply to comment #31) > > > This patch seems to crash in debug: > > > > > > [49220:49236:17280644243061:ERROR:process_util_posix.cc(142)] Received signal 11 > > > base::debug::StackTrace::StackTrace() [0x851b56] > > > base::(anonymous namespace)::StackDumpSignalHandler() [0x80b739] > > > 0x7f352f9d1af0 > > > WebCore::Biquad::process() [0x256d73e] > > > WebCore::HRTFKernel::HRTFKernel() [0x25ac2e4] > > > WebCore::HRTFKernel::create() [0x25ab549] > > > WebCore::HRTFElevation::calculateKernelsForAzimuthElevation() [0x25aa703] > > > WebCore::HRTFElevation::createForSubject() [0x25aa9db] > > > WebCore::HRTFDatabase::HRTFDatabase() [0x25a987e] > > > WebCore::HRTFDatabase::create() [0x25a9630] > > > WebCore::HRTFDatabaseLoader::load() [0x257382d] > > > WebCore::databaseLoaderEntry() [0x25737ac] > > > WTF::threadEntryPoint() [0x1f509e5] > > > WTF::wtfThreadEntryPoint() [0x793cb8] > > > start_thread [0x7f3531e369ca] > > > 0x7f352fa8470d > > > None > > > > Hi, > > That`s strange that I cannot reproduce the issue (by several different machines). Could you provide more details about your platform? > > The stack trace looks totally bogus. (I'm pretty sure the HRTFKernel doesn't call Biquad). > > But I can reproduce the crash on my system. The stack trace is totally trash too. More work needed.... Ok. The issue is that on my debug build, I get a 64-bit build so loading the source address into edx loses the top 32 bits, which are non-zero. We then try to load from that non-existent address and crash. Don't understand why this doesn't happen on a release build, but this is definitely a bug.
Raymond Toy
Comment 38 2012-03-15 11:51:01 PDT
Created attachment 132095 [details] Updated patch to handle 32 and 64-bit builds Here is an updated patch. This passes all of the webaudio tests in debug mode. (Release mode test in progress.) This basically replaces all uses of the edx and ecx registers with their 64-bit counterparts, rdx and rcx. However, the complexity of maintaining this assembly code is getting rather large, so I'm inclined not to do this. It would be much better if intrinsics could be used, or, even better, do what the FIXME comment says and unroll the while loop with carefully scheduled arithmetic (in C) if that can achieve the desired performance gains.
Adam Barth
Comment 39 2012-03-15 13:33:56 PDT
> I am confused by chromium-mac-linux, should not it be either chromium-linux or chromium-mac? The patch only acts in linux, I`ve tested in Ubuntu oneiric and it works very well. Sorry. I meant chromium-linux. :)
Xingnan Wang
Comment 40 2012-03-15 19:17:59 PDT
(In reply to comment #38) > Created an attachment (id=132095) [details] > Updated patch to handle 32 and 64-bit builds > > Here is an updated patch. This passes all of the webaudio tests in debug mode. (Release mode test in progress.) > > This basically replaces all uses of the edx and ecx registers with their 64-bit counterparts, rdx and rcx. > > However, the complexity of maintaining this assembly code is getting rather large, so I'm inclined not to do this. It would be much better if intrinsics could be used, or, even better, do what the FIXME comment says and unroll the while loop with carefully scheduled arithmetic (in C) if that can achieve the desired performance gains. Hi Ray, thank you very much for fixing this bug. I agree that we should not make the code here to get too large. I have tried some ways to use intrinsics and C code to unroll the loop, but they could not achieve the performance increased by assembly code, because it was not easy to pipeline and reschedule the executing order of the code as assembly did. But I can make a try to find whether there is a better way to optimize the code with intrinsics.
Xingnan Wang
Comment 41 2012-03-18 20:18:12 PDT
Xingnan Wang
Comment 42 2012-03-18 20:23:38 PDT
(In reply to comment #41) > Created an attachment (id=132532) [details] > Patch Re-implemented the pipeline by intrinsics. What surprised me is the performance is much better than assembly code, ~45% with original ~20%(for 128 frames size).
Raymond Toy
Comment 43 2012-03-19 12:13:27 PDT
Comment on attachment 132532 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=132532&action=review This looks much nicer than the assembly code, and it's faster too! Code looks good, but adding some comments would help to understand what's going on. > Source/WebCore/platform/audio/Biquad.cpp:113 > + __m128d mma1 = _mm_set_sd(-a1); mma1 = (0, -a1) > Source/WebCore/platform/audio/Biquad.cpp:117 > + mm6 = mm1; Add comment mm6 = (y2, x) > Source/WebCore/platform/audio/Biquad.cpp:118 > + mm1 = _mm_shuffle_pd(mm1, mm4, 0); // y2 = y1 Add comment that mm1 = (y1, x) > Source/WebCore/platform/audio/Biquad.cpp:119 > + mm5 = _mm_mul_pd(mm2, mm0); // (x1*b1 x2*b2) Insert comma: (x1 * b1, x2 * b2) Webkit style includes a space around arithmetic operators. Do the same for next line. > Source/WebCore/platform/audio/Biquad.cpp:121 > + mm0 = _mm_shuffle_pd(mm0, mm1, 1); // (x1 = x, x2 = x1) Comment is confusing. Maybe say mm0 = (x, x1) > Source/WebCore/platform/audio/Biquad.cpp:122 > + mm4 = _mm_mul_sd(mm4, mma1); // -y1*a1 Be more explicit about contents of mm4. mm4 = (0, -y1 * a1) > Source/WebCore/platform/audio/Biquad.cpp:123 > + mm5 = _mm_add_pd(mm5, mm6); // (x1*b1-y2*a2 x2*b2+x*b0) Insert comma, add space around operators: (x1 * b1 - y2 * a2, x2 * b2 + x * b0) > Source/WebCore/platform/audio/Biquad.cpp:124 > + n--; Why decrement n here? Is this a pipeline issue? If not, can we put the decrement back into while (n--)? > Source/WebCore/platform/audio/Biquad.cpp:125 > + mm6 = mm5; mm6 = (x1 * b1 - y2 * a2, x2 * b2 + x * b0) > Source/WebCore/platform/audio/Biquad.cpp:126 > + mm7 = _mm_load_ss(sourceP); mm7 = (0, 0, 0, *sourceP) And we're loading the next x value. > Source/WebCore/platform/audio/Biquad.cpp:127 > + mm1 = _mm_cvtss_sd(mm1, mm7); // x = *sourceP Maybe say exactly what is in mm1? mm1 = (y1, x) > Source/WebCore/platform/audio/Biquad.cpp:128 > + mm5 = _mm_add_pd(mm4, mm5); // (x1*b1-y2*a2 x2*b2+x*b0-y1*a1) Same comment as for line 123, 119. Any advantage in doing mm_add_pd? It looks like we don't care about the high part of mm5. Is that right? > Source/WebCore/platform/audio/Biquad.cpp:129 > + mm6 = _mm_shuffle_pd(mm6, mm6, 1); mm6 = (x2 * b2 + x * b0, x1 * b1 - y2 * a2) > Source/WebCore/platform/audio/Biquad.cpp:130 > + mm5 = _mm_add_pd(mm5, mm6); // y mm5 = (x1*b1-y2*a2+x2*b2+x*b0, y = x2*b2+x*b0-y1*a1+x1*b1-y2*a2), and we ignore high part of mm5, so maybe just mm_add_sd? > Source/WebCore/platform/audio/Biquad.cpp:131 > + mm7 = _mm_cvtsd_ss(mm7, mm5); mm7 = (0, 0, 0, y) > Source/WebCore/platform/audio/Biquad.cpp:133 > + mm4 = mm5; // y1 = y mm4 = (x1*b1-y2*a2+x2*b2+x*b0, y)
Xingnan Wang
Comment 44 2012-03-19 22:34:45 PDT
Xingnan Wang
Comment 45 2012-03-19 22:44:34 PDT
Comment on attachment 132532 [details] Patch Hi Ray, patch is updated as your comments. View in context: https://bugs.webkit.org/attachment.cgi?id=132532&action=review >> Source/WebCore/platform/audio/Biquad.cpp:113 >> + __m128d mma1 = _mm_set_sd(-a1); > > mma1 = (0, -a1) Done. >> Source/WebCore/platform/audio/Biquad.cpp:117 >> + mm6 = mm1; > > Add comment mm6 = (y2, x) Done. >> Source/WebCore/platform/audio/Biquad.cpp:118 >> + mm1 = _mm_shuffle_pd(mm1, mm4, 0); // y2 = y1 > > Add comment that mm1 = (y1, x) Done. >> Source/WebCore/platform/audio/Biquad.cpp:119 >> + mm5 = _mm_mul_pd(mm2, mm0); // (x1*b1 x2*b2) > > Insert comma: (x1 * b1, x2 * b2) > Webkit style includes a space around arithmetic operators. Do the same for next line. Done. >> Source/WebCore/platform/audio/Biquad.cpp:121 >> + mm0 = _mm_shuffle_pd(mm0, mm1, 1); // (x1 = x, x2 = x1) > > Comment is confusing. Maybe say mm0 = (x, x1) Done. >> Source/WebCore/platform/audio/Biquad.cpp:122 >> + mm4 = _mm_mul_sd(mm4, mma1); // -y1*a1 > > Be more explicit about contents of mm4. > > mm4 = (0, -y1 * a1) Done. >> Source/WebCore/platform/audio/Biquad.cpp:123 >> + mm5 = _mm_add_pd(mm5, mm6); // (x1*b1-y2*a2 x2*b2+x*b0) > > Insert comma, add space around operators: > > (x1 * b1 - y2 * a2, x2 * b2 + x * b0) Done. >> Source/WebCore/platform/audio/Biquad.cpp:124 >> + n--; > > Why decrement n here? Is this a pipeline issue? If not, can we put the decrement back into while (n--)? This will impact the pipeline, using "while(n--)" pulls down the performance a little, 45% to 43%. >> Source/WebCore/platform/audio/Biquad.cpp:125 >> + mm6 = mm5; > > mm6 = (x1 * b1 - y2 * a2, x2 * b2 + x * b0) Done. >> Source/WebCore/platform/audio/Biquad.cpp:126 >> + mm7 = _mm_load_ss(sourceP); > > mm7 = (0, 0, 0, *sourceP) > > And we're loading the next x value. Done. >> Source/WebCore/platform/audio/Biquad.cpp:128 >> + mm5 = _mm_add_pd(mm4, mm5); // (x1*b1-y2*a2 x2*b2+x*b0-y1*a1) > > Same comment as for line 123, 119. > > Any advantage in doing mm_add_pd? It looks like we don't care about the high part of mm5. Is that right? Yes, _mm_add_sd is better. >> Source/WebCore/platform/audio/Biquad.cpp:129 >> + mm6 = _mm_shuffle_pd(mm6, mm6, 1); > > mm6 = (x2 * b2 + x * b0, x1 * b1 - y2 * a2) Done. >> Source/WebCore/platform/audio/Biquad.cpp:130 >> + mm5 = _mm_add_pd(mm5, mm6); // y > > mm5 = (x1*b1-y2*a2+x2*b2+x*b0, y = x2*b2+x*b0-y1*a1+x1*b1-y2*a2), and we ignore high part of mm5, so maybe just mm_add_sd? Done. >> Source/WebCore/platform/audio/Biquad.cpp:133 >> + mm4 = mm5; // y1 = y > > mm4 = (x1*b1-y2*a2+x2*b2+x*b0, y) Done.
Raymond Toy
Comment 46 2012-03-20 10:00:27 PDT
Comment on attachment 132760 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=132760&action=review Code looks good; just a few more minor items to look at. > Source/WebCore/platform/audio/Biquad.cpp:109 > + __m128d mm4 = _mm_set_pd(0, y1); // mm4 = (0, y1) It looks like we only care about the the low part of mm4. Maybe add comment that we only use the low part of mm4. Same goes for mma1 in line 113. > Source/WebCore/platform/audio/Biquad.cpp:122 > + mm4 = _mm_mul_sd(mm4, mma1); // mm4 = (0, -y1 * a1) I think the comment should be mm4=-y1*a1 since we're doing mul_sd. We don't care what's in the high part of mm4. > Source/WebCore/platform/audio/Biquad.cpp:125 > + mm6 = mm5; // mm6 = (x1 * b1 - y2 * a2, x2 * b2 + x * b0) See note on for line 130. Shuffle mm5 to the right parts of mm6 now instead of later? > Source/WebCore/platform/audio/Biquad.cpp:126 > + mm7 = _mm_load_ss(sourceP); // Load next x value, mm7 = (0, 0, 0, x) We load mm7 from memory here, but use it in the very next instruction. Could this be moved further up to lessen any pipeline issues? We're not using the value of mm7 until line 127. > Source/WebCore/platform/audio/Biquad.cpp:128 > + mm5 = _mm_add_sd(mm4, mm5); // mm5 = (x1 * b1 - y2 * a2, x2 * b2 + x * b0 - y1 * a1) Since we're doing mm_add_sd, we don't care what's in the high part of mm5. Maybe the comment should be mm5 = x2*b2 + x*b0-y1*a1. > Source/WebCore/platform/audio/Biquad.cpp:130 > + mm5 = _mm_add_sd(mm5, mm6); // mm5 = (x1 * b1 - y2 * a2, x * b0 + x1 * b1 + x2 * b2 - y1 * a1 - y2 * a2) = (*, y) Same comment here as for line 128. We don't care what's in the high part of mm5. Also, in line 129, we swap the high and low parts of mm6. Isn't it possible to do that in line 125 where we assign mm5 to mm6? Just shuffle mm5 to the right parts of mm6? That's one less instruction, but I don't know what the pipeline affects will be. > Source/WebCore/platform/audio/Biquad.cpp:131 > + mm7 = _mm_cvtsd_ss(mm7, mm5); // mm7 = (0, 0, 0, y) Actually, I think the comment is wrong. (Sorry about that.) cvtsd_ss doesn't modify the other parts of mm7. Plus we're only using the low part, maybe just say mm7 = static_cast<float>(y).
Xingnan Wang
Comment 47 2012-03-20 20:12:57 PDT
Xingnan Wang
Comment 48 2012-03-20 20:36:18 PDT
Comment on attachment 132760 [details] Patch Thanks, Ray. Update the patch. View in context: https://bugs.webkit.org/attachment.cgi?id=132760&action=review >> Source/WebCore/platform/audio/Biquad.cpp:126 >> + mm7 = _mm_load_ss(sourceP); // Load next x value, mm7 = (0, 0, 0, x) > > We load mm7 from memory here, but use it in the very next instruction. Could this be moved further up to lessen any pipeline issues? We're not using the value of mm7 until line 127. There is little difference of performance after moving this line up, I think it is not in the key path of the pipeline. >> Source/WebCore/platform/audio/Biquad.cpp:130 >> + mm5 = _mm_add_sd(mm5, mm6); // mm5 = (x1 * b1 - y2 * a2, x * b0 + x1 * b1 + x2 * b2 - y1 * a1 - y2 * a2) = (*, y) > > Same comment here as for line 128. We don't care what's in the high part of mm5. > > Also, in line 129, we swap the high and low parts of mm6. Isn't it possible to do that in line 125 where we assign mm5 to mm6? Just shuffle mm5 to the right parts of mm6? That's one less instruction, but I don't know what the pipeline affects will be. It seems we should do that and make us to save one instruction, but it pulls down the performance after that, strange~ I guess the compiler does much for us when we use intrinsic, the actual assembly code looks much different from what I expect for after "gcc -s " .
Raymond Toy
Comment 49 2012-03-21 09:17:07 PDT
(In reply to comment #48) > (From update of attachment 132760 [details]) > Thanks, Ray. Update the patch. > > View in context: https://bugs.webkit.org/attachment.cgi?id=132760&action=review > > >> Source/WebCore/platform/audio/Biquad.cpp:126 > >> + mm7 = _mm_load_ss(sourceP); // Load next x value, mm7 = (0, 0, 0, x) > > > > We load mm7 from memory here, but use it in the very next instruction. Could this be moved further up to lessen any pipeline issues? We're not using the value of mm7 until line 127. > > There is little difference of performance after moving this line up, I think it is not in the key path of the pipeline. > > >> Source/WebCore/platform/audio/Biquad.cpp:130 > >> + mm5 = _mm_add_sd(mm5, mm6); // mm5 = (x1 * b1 - y2 * a2, x * b0 + x1 * b1 + x2 * b2 - y1 * a1 - y2 * a2) = (*, y) > > > > Same comment here as for line 128. We don't care what's in the high part of mm5. > > > > Also, in line 129, we swap the high and low parts of mm6. Isn't it possible to do that in line 125 where we assign mm5 to mm6? Just shuffle mm5 to the right parts of mm6? That's one less instruction, but I don't know what the pipeline affects will be. > > It seems we should do that and make us to save one instruction, but it pulls down the performance after that, strange~ > I guess the compiler does much for us when we use intrinsic, the actual assembly code looks much different from what I expect for after "gcc -s " . Ok. I'm happy with this patch as it is now. Looks good. I do wonder, though, if you moved load_ss to be after line 124 and changed line 125 to do the shuffle (removing the shuffle at line 129) would improve performance. Or maybe move the (new) shuffle down a few lines since we don't need mm6 until line 130. Perhaps these changes will help the pipeline. You don't have to do this experiment if you don't want to. I'm just kind of curious. Otherwise, the patch looks good to me and is ready to go.
Chris Rogers
Comment 50 2012-03-21 12:38:13 PDT
Does this latest patch pass the layout test (and not crash) on the Debug build?
Xingnan Wang
Comment 51 2012-03-22 02:41:26 PDT
Comment on attachment 132760 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=132760&action=review >>>> Source/WebCore/platform/audio/Biquad.cpp:126 >>>> + mm7 = _mm_load_ss(sourceP); // Load next x value, mm7 = (0, 0, 0, x) >>> >>> We load mm7 from memory here, but use it in the very next instruction. Could this be moved further up to lessen any pipeline issues? We're not using the value of mm7 until line 127. >> >> There is little difference of performance after moving this line up, I think it is not in the key path of the pipeline. > > Ok. I'm happy with this patch as it is now. Looks good. > > I do wonder, though, if you moved load_ss to be after line 124 and changed line 125 to do the shuffle (removing the shuffle at line 129) would improve performance. Or maybe move the (new) shuffle down a few lines since we don't need mm6 until line 130. Perhaps these changes will help the pipeline. > > You don't have to do this experiment if you don't want to. I'm just kind of curious. Otherwise, the patch looks good to me and is ready to go. It does not matter, I did it. But it seems it does not work, a tiny slowing down. Hard to explain, we cannot exactly schedule the pipeline by intrinsics as the way by assembly, I guess.
Xingnan Wang
Comment 52 2012-03-22 02:52:28 PDT
(In reply to comment #50) > Does this latest patch pass the layout test (and not crash) on the Debug build? The layout test passed in both debug and release build from my environment (32-bit). Last crash issue`s root cause is I used 32-bit registers for the address and it will failed in 64-bit system. I use intrinsics this time and suppose it would not meet this issue. Certainly more tests are needed, but I don`t have 64-bit system right now. Ray, would you mind to help me to check whether the patch is OK in your 64-bit system?
Raymond Toy
Comment 53 2012-03-22 10:28:13 PDT
(In reply to comment #52) > (In reply to comment #50) > > Does this latest patch pass the layout test (and not crash) on the Debug build? > > The layout test passed in both debug and release build from my environment (32-bit). > Last crash issue`s root cause is I used 32-bit registers for the address and it will failed in 64-bit system. I use intrinsics this time and suppose it would not meet this issue. Certainly more tests are needed, but I don`t have 64-bit system right now. I don't expect any issues with 64-bit builds. The compiler should use the right sized registers for the intrinsic operations. > > Ray, would you mind to help me to check whether the patch is OK in your 64-bit system? No problem. The debug build passes all tests, except for the convolution-mono-mono test, but I think that's a bug in the convolver that needs to be fixed. The release build passes all tests too. (convolution-mono-mono passes.)
Raymond Toy
Comment 54 2012-03-22 10:30:29 PDT
(In reply to comment #51) > (From update of attachment 132760 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=132760&action=review > > >>>> Source/WebCore/platform/audio/Biquad.cpp:126 > >>>> + mm7 = _mm_load_ss(sourceP); // Load next x value, mm7 = (0, 0, 0, x) > >>> > >>> We load mm7 from memory here, but use it in the very next instruction. Could this be moved further up to lessen any pipeline issues? We're not using the value of mm7 until line 127. > >> > >> There is little difference of performance after moving this line up, I think it is not in the key path of the pipeline. > > > > Ok. I'm happy with this patch as it is now. Looks good. > > > > I do wonder, though, if you moved load_ss to be after line 124 and changed line 125 to do the shuffle (removing the shuffle at line 129) would improve performance. Or maybe move the (new) shuffle down a few lines since we don't need mm6 until line 130. Perhaps these changes will help the pipeline. > > > > You don't have to do this experiment if you don't want to. I'm just kind of curious. Otherwise, the patch looks good to me and is ready to go. > > It does not matter, I did it. But it seems it does not work, a tiny slowing down. Hard to explain, we cannot exactly schedule the pipeline by intrinsics as the way by assembly, I guess. Thanks for doing the test. I would have thought there might be some small gain, but the compiler can rearrange things as you say. So I think this new patch is ready to go as is. Thanks for doing the new version!
Xingnan Wang
Comment 55 2012-03-22 19:58:56 PDT
(In reply to comment #53) > (In reply to comment #52) > > (In reply to comment #50) > > > Does this latest patch pass the layout test (and not crash) on the Debug build? > > > > The layout test passed in both debug and release build from my environment (32-bit). > > Last crash issue`s root cause is I used 32-bit registers for the address and it will failed in 64-bit system. I use intrinsics this time and suppose it would not meet this issue. Certainly more tests are needed, but I don`t have 64-bit system right now. > > I don't expect any issues with 64-bit builds. The compiler should use the right sized registers for the intrinsic operations. > > > > > Ray, would you mind to help me to check whether the patch is OK in your 64-bit system? > > No problem. The debug build passes all tests, except for the convolution-mono-mono test, but I think that's a bug in the convolver that needs to be fixed. > > The release build passes all tests too. (convolution-mono-mono passes.) Ray, thank you for doing the tests.
Brent Fulgham
Comment 56 2012-11-14 17:42:33 PST
If this patch looked good back in March, why hasn't it been approved or landed?
Brent Fulgham
Comment 57 2012-11-15 17:10:38 PST
Comment on attachment 132954 [details] Patch I'm landing based on prior reviewers comments. It seems like it was simply lost in the shuffle at some point.
WebKit Review Bot
Comment 58 2012-11-15 17:20:24 PST
Comment on attachment 132954 [details] Patch Clearing flags on attachment: 132954 Committed r134867: <http://trac.webkit.org/changeset/134867>
WebKit Review Bot
Comment 59 2012-11-15 17:20:31 PST
All reviewed patches have been landed. Closing bug.
WebKit Review Bot
Comment 60 2012-11-16 11:43:33 PST
Re-opened since this is blocked by bug 102544
Brent Fulgham
Comment 61 2012-11-16 11:45:41 PST
Reopened because this change broke one of the security tests. See https://bugs.webkit.org/show_bug.cgi?id=102524 for details.
Brent Fulgham
Comment 62 2012-11-16 11:47:12 PST
danceoffwithyourpantsoff
Comment 63 2013-07-09 10:24:55 PDT
I might be completely wrong (I just happened to skim past this patch), but shouldn't the SSE2 code be a runtime check? The #ifdef SSE2 is only good for knowing if you can use the intrinsics or not (if the compiler supports SSE2). I don't know the current standing expectation of having SSE2 support is, but last I remember it wasn't considered a requirement: https://groups.google.com/a/chromium.org/forum/#!topic/chromium-dev/66p8sJtluJA
Raymond Toy
Comment 64 2013-07-09 10:43:54 PDT
(In reply to comment #63) > I might be completely wrong (I just happened to skim past this patch), but shouldn't the SSE2 code be a runtime check? The #ifdef SSE2 is only good for knowing if you can use the intrinsics or not (if the compiler supports SSE2). I don't know the current standing expectation of having SSE2 support is, but last I remember it wasn't considered a requirement: > > https://groups.google.com/a/chromium.org/forum/#!topic/chromium-dev/66p8sJtluJA I think you are correct. It's been a while since I looked, but I think on Linux and Windows the compiler flag that enables SSE2 support is not set, so the SSE2 code is never compiled in. On Mac, SSE2 has always been available on x86 machines and I think SSE2 code is compiled by default. There should be no issues with this code. In any case, this particular patch was rolled out. Ideally, we would like a runtime check and use the SSE2 code when available, falling back to the C reference when SSE2 is not available.
Note You need to log in before you can comment on or make changes to this bug.