Bug 75528 - Optimize the multiply-add in Biquad.cpp::process
: Optimize the multiply-add in Biquad.cpp::process
Status: UNCONFIRMED
: WebKit
Web Audio
: 528+ (Nightly build)
: Unspecified Unspecified
: P2 Normal
Assigned To:
:
:
: 81168 102524 102544
:
  Show dependency treegraph
 
Reported: 2012-01-04 01:04 PST by
Modified: 2013-07-09 10:43 PST (History)


Attachments
Patch (4.30 KB, patch)
2012-01-04 01:17 PST, Xingnan Wang
no flags Review Patch | Details | Formatted Diff | Diff
Biquad layout test script (4.08 KB, text/html)
2012-01-05 10:45 PST, Raymond Toy
no flags Details
Patch (4.10 KB, patch)
2012-02-09 23:44 PST, Xingnan Wang
webkit.review.bot: commit‑queue-
Review Patch | Details | Formatted Diff | Diff
Patch (4.11 KB, patch)
2012-02-13 00:03 PST, Xingnan Wang
no flags Review Patch | Details | Formatted Diff | Diff
Updated patch to handle 32 and 64-bit builds (4.31 KB, patch)
2012-03-15 11:51 PST, Raymond Toy
no flags Review Patch | Details | Formatted Diff | Diff
Patch (3.17 KB, patch)
2012-03-18 20:18 PST, Xingnan Wang
no flags Review Patch | Details | Formatted Diff | Diff
Patch (3.63 KB, patch)
2012-03-19 22:34 PST, Xingnan Wang
no flags Review Patch | Details | Formatted Diff | Diff
Patch (3.68 KB, patch)
2012-03-20 20:12 PST, Xingnan Wang
no flags Review Patch | Details | Formatted Diff | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2012-01-04 01:04:07 PST
Optimize the multiply-add by piplining with SSE2 instructions.
------- Comment #1 From 2012-01-04 01:17:37 PST -------
Created an attachment (id=121082) [details]
Patch
------- Comment #2 From 2012-01-04 18:04:48 PST -------
Thanks for the optimization!  Any idea how much improvement this gives?

I'm adding Raymond as reviewer here, and think we should get at least a very simple Biquad layout test written to validate both the existing code, and your new code.  The layout test can do the following:

1. create a BiquadFilterNode, configure it as lowpass at a certain cutoff frequency
2. feed an impulse into the filter in an OfflineAudioContext
3. compare the result with the expected output using JavaScript to generate the expected response (we can borrow the math from Biquad.cpp and convert to JS.
------- Comment #3 From 2012-01-04 18:06:30 PST -------
Xingnan do you know how to write a layout test.  If not, maybe Raymond can help you there, since he's just starting to write them himself.
------- Comment #4 From 2012-01-04 18:12:11 PST -------
(In reply to comment #2)
> Thanks for the optimization!  Any idea how much improvement this gives?
> 
From my test, it boosted at least 20%(for 128 frames size), and reached 40% with larger size(like 1280).
> I'm adding Raymond as reviewer here, and think we should get at least a very simple Biquad layout test written to validate both the existing code, and your new code.  The layout test can do the following:
> 
> 1. create a BiquadFilterNode, configure it as lowpass at a certain cutoff frequency
> 2. feed an impulse into the filter in an OfflineAudioContext
> 3. compare the result with the expected output using JavaScript to generate the expected response (we can borrow the math from Biquad.cpp and convert to JS.
------- Comment #5 From 2012-01-04 18:16:41 PST -------
(In reply to comment #3)
> Xingnan do you know how to write a layout test.  If not, maybe Raymond can help you there, since he's just starting to write them himself.

I did not write layout test before but very interest in it, I would like to run the layout test, it`s very pleasure if Raymond can help. Thanks.
------- Comment #6 From 2012-01-05 08:53:24 PST -------
(In reply to comment #5)
> (In reply to comment #3)
> > Xingnan do you know how to write a layout test.  If not, maybe Raymond can help you there, since he's just starting to write them himself.
> 
> I did not write layout test before but very interest in it, I would like to run the layout test, it`s very pleasure if Raymond can help. Thanks.

I would be my pleasure to write a simple layout test that you can start with.
------- Comment #7 From 2012-01-05 10:45:40 PST -------
Created an attachment (id=121300) [details]
Biquad layout test script

Here is a start at a test of the biquad (lowpass) filter.

Place this file in the third_party/WebKit/LayoutTests/webaudio directory.  A biquad-test-expected.txt file needs to be created that contains the expected output.

To run the layout test, see http://www.chromium.org/developers/testing/webkit-layout-tests.

Perhaps it is enough to test just the lowpass filter for now.  But eventually, I think we need a test for all the filters.  In that case biquad-test.html should be broken up into a common piece that contains the test infrastructure and a probably several tests that create the appropriate filter tests.

Should probably follow the style of convolution-mono-mono.html tests.
------- Comment #8 From 2012-01-05 11:07:10 PST -------
(From update of attachment 121082 [details])
View in context: https://bugs.webkit.org/attachment.cgi?id=121082&action=review

> Source/WebCore/platform/audio/Biquad.cpp:90
> +#ifdef __SSE2__

Did you test this on windows?  I think the asm syntax below only works for gcc, so this conditionalization should test for gcc too.  (But I guess on windows, __SSE2__ is probably never defined.)

> Source/WebCore/platform/audio/Biquad.cpp:93
> +    __asm__(

I am far from an expert in sse2, but this seems rather complex.  Could we do something like this?

Create array yy of length 2 (or maybe use the destination array directly?):
yy[0] = y
yy[1] = y1

load xmm0 with (b0 b1 b2 0)
load xmm1 with (-a0 -a1 0 0)
load xmm2 from *source to get (x[0] x[1] x[2] x[3])
xmm2 = xmm2 * xmm0 to get (b0*x0 b1*x1 b2*x2 0)
load xmm3 from yy to get (y0 y1 junk junk)
xmm3 = xmm3 * xmm1 to get (-a0*y0 a1*y1 0 0)
xmm3 = xmm3 + xmm2 to get (b0*x0-a0*y0, b1*x1-a1*y1, b2*x2, 0)

Extract each part of xmm3 and add them together.  (We could gain something here if we had SSE3 to do the add, I think.)

yy[0] = yy[1]
yy[1] = result of sum.

Don't know if this is faster or slower.  This will change results slightly because we do everything in single precision.
------- Comment #9 From 2012-01-05 11:20:26 PST -------
Oops.  In the test script, you need to uncomment the first 5 commented lines in runTest() when running the layout test.  You can use the script as is when running in a browser (provided the other scripts are available).
------- Comment #10 From 2012-01-05 11:34:28 PST -------
(In reply to comment #8)
> Don't know if this is faster or slower.  This will change results slightly because we do everything in single precision.

I'd like to keep everything in double-precision because of filter stability problems with certain types of filters.
------- Comment #11 From 2012-01-05 17:31:21 PST -------
(In reply to comment #6)
> (In reply to comment #5)
> > (In reply to comment #3)
> > > Xingnan do you know how to write a layout test.  If not, maybe Raymond can help you there, since he's just starting to write them himself.
> > 
> > I did not write layout test before but very interest in it, I would like to run the layout test, it`s very pleasure if Raymond can help. Thanks.
> 
> I would be my pleasure to write a simple layout test that you can start with.

That`s so nice if there is a sample, thanks Raymond.
------- Comment #12 From 2012-01-05 17:59:50 PST -------
(In reply to comment #8)
> (From update of attachment 121082 [details] [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=121082&action=review
> 
> > Source/WebCore/platform/audio/Biquad.cpp:90
> > +#ifdef __SSE2__
> 
> Did you test this on windows?  I think the asm syntax below only works for gcc, so this conditionalization should test for gcc too.  (But I guess on windows, __SSE2__ is probably never defined.)
Yes, there is another macro in windows. I just started with linux support only, in case of any potential issues. I can add the windows support and do the test.
> 
> > Source/WebCore/platform/audio/Biquad.cpp:93
> > +    __asm__(
> 
> I am far from an expert in sse2, but this seems rather complex.  Could we do something like this?
> 
> Create array yy of length 2 (or maybe use the destination array directly?):
> yy[0] = y
> yy[1] = y1
> 
> load xmm0 with (b0 b1 b2 0)
> load xmm1 with (-a0 -a1 0 0)
> load xmm2 from *source to get (x[0] x[1] x[2] x[3])
> xmm2 = xmm2 * xmm0 to get (b0*x0 b1*x1 b2*x2 0)
> load xmm3 from yy to get (y0 y1 junk junk)
> xmm3 = xmm3 * xmm1 to get (-a0*y0 a1*y1 0 0)
> xmm3 = xmm3 + xmm2 to get (b0*x0-a0*y0, b1*x1-a1*y1, b2*x2, 0)
> 
> Extract each part of xmm3 and add them together.  (We could gain something here if we had SSE3 to do the add, I think.)
> 
> yy[0] = yy[1]
> yy[1] = result of sum.
> 
> Don't know if this is faster or slower.  This will change results slightly because we do everything in single precision.

Your solution may work in single precision and I didn`t try.
I have tried this way in double precision but no too much improvement. Then I changed to use inline asm so that I can pipeline the instructions, also get the benefit from SSE2.
------- Comment #13 From 2012-01-11 09:54:53 PST -------
(From update of attachment 121082 [details])
View in context: https://bugs.webkit.org/attachment.cgi?id=121082&action=review

>>> Source/WebCore/platform/audio/Biquad.cpp:93
>>> +    __asm__(
>> 
>> I am far from an expert in sse2, but this seems rather complex.  Could we do something like this?
>> 
>> Create array yy of length 2 (or maybe use the destination array directly?):
>> yy[0] = y
>> yy[1] = y1
>> 
>> load xmm0 with (b0 b1 b2 0)
>> load xmm1 with (-a0 -a1 0 0)
>> load xmm2 from *source to get (x[0] x[1] x[2] x[3])
>> xmm2 = xmm2 * xmm0 to get (b0*x0 b1*x1 b2*x2 0)
>> load xmm3 from yy to get (y0 y1 junk junk)
>> xmm3 = xmm3 * xmm1 to get (-a0*y0 a1*y1 0 0)
>> xmm3 = xmm3 + xmm2 to get (b0*x0-a0*y0, b1*x1-a1*y1, b2*x2, 0)
>> 
>> Extract each part of xmm3 and add them together.  (We could gain something here if we had SSE3 to do the add, I think.)
>> 
>> yy[0] = yy[1]
>> yy[1] = result of sum.
>> 
>> Don't know if this is faster or slower.  This will change results slightly because we do everything in single precision.
> 
> Your solution may work in single precision and I didn`t try.
> I have tried this way in double precision but no too much improvement. Then I changed to use inline asm so that I can pipeline the instructions, also get the benefit from SSE2.

This looks fine then especially considering Chris's comment about keeping things in double for stability reasons.
------- Comment #14 From 2012-01-12 11:10:25 PST -------
If you are not in a hurry, bug 71413 will include a bunch of tests for the biquad filters so you do not have to write any.
------- Comment #15 From 2012-01-13 00:51:05 PST -------
(In reply to comment #14)
> If you are not in a hurry, bug 71413 will include a bunch of tests for the biquad filters so you do not have to write any.

That will be nice. I was going to update the patch with the low pass filter layout test, but if bug 71413 provides all the tests, I will wait for it.
------- Comment #16 From 2012-02-09 23:44:37 PST -------
Created an attachment (id=126465) [details]
Patch

Update the patch as layout tests of biquad filter landed, it`s OK to pass all these tests.
------- Comment #17 From 2012-02-10 01:20:22 PST -------
(From update of attachment 126465 [details])
Attachment 126465 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/11486599

New failing tests:
webaudio/biquad-highpass.html
webaudio/biquad-lowpass.html
webaudio/biquad-lowshelf.html
webaudio/biquad-notch.html
webaudio/biquad-highshelf.html
webaudio/biquad-peaking.html
webaudio/biquad-bandpass.html
webaudio/biquad-allpass.html
------- Comment #18 From 2012-02-10 13:56:13 PST -------
(From update of attachment 126465 [details])
View in context: https://bugs.webkit.org/attachment.cgi?id=126465&action=review

There also appears to be test failures on linux.  Can you investigate why that is happening?

> Source/WebCore/platform/audio/Biquad.cpp:121
> +        "addsd    %%xmm4,  %%xmm5\n\t" // x*b0 + (x2*b2+x*b0)

Doesn't xmm4 contain a1*y1 here so the result is a1*y1 + (x2*b2+x*b0)?

> Source/WebCore/platform/audio/Biquad.cpp:137
> +        :"eax", "edx", "ecx"

Is it necessary to add all of the xmm registers to this clobber list?
------- Comment #19 From 2012-02-12 23:26:19 PST -------
(From update of attachment 126465 [details])
View in context: https://bugs.webkit.org/attachment.cgi?id=126465&action=review

>> Source/WebCore/platform/audio/Biquad.cpp:121
>> +        "addsd    %%xmm4,  %%xmm5\n\t" // x*b0 + (x2*b2+x*b0)
> 
> Doesn't xmm4 contain a1*y1 here so the result is a1*y1 + (x2*b2+x*b0)?

Sorry for that, it should be a1*y1 + (x2*b2+x*b0).

>> Source/WebCore/platform/audio/Biquad.cpp:137
>> +        :"eax", "edx", "ecx"
> 
> Is it necessary to add all of the xmm registers to this clobber list?

OK, it`s better to add all the xmm registers to the clobber-list, through there is little possibility of conflict of xmm registers here.
------- Comment #20 From 2012-02-12 23:32:34 PST -------
(In reply to comment #18)
> (From update of attachment 126465 [details] [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=126465&action=review
> 
> There also appears to be test failures on linux.  Can you investigate why that is happening?
> 

It`s strange that I can pass the layout tests locally, and little message can be found from the error log  http://queues.webkit.org/results/11486599. I`ll try to check my patch in different machine.
------- Comment #21 From 2012-02-13 00:03:59 PST -------
Created an attachment (id=126729) [details]
Patch
------- Comment #22 From 2012-02-13 11:04:32 PST -------
(In reply to comment #21)
> Created an attachment (id=126729) [details] [details]
> Patch

Looks good, except for the test failures that happened in the previous patch.  I didn't see any test failures from the bots for this patch, though.  Perhaps they've been fixed?
------- Comment #23 From 2012-02-13 17:23:21 PST -------
(In reply to comment #22)
> (In reply to comment #21)
> > Created an attachment (id=126729) [details] [details] [details]
> > Patch
> 
> Looks good, except for the test failures that happened in the previous patch.  I didn't see any test failures from the bots for this patch, though.  Perhaps they've been fixed?

Actually I never reproduced the failures at all, just updated the patch as your comments. Perhaps the last patch would pass the tests if run it again. Whatever, does that mean the patch is OK now?

BTW, how can I check the building details including these failures from the bots? Thanks.
------- Comment #24 From 2012-02-14 09:39:06 PST -------
(In reply to comment #23)
> (In reply to comment #22)
> > (In reply to comment #21)
> > > Created an attachment (id=126729) [details] [details] [details] [details]
> > > Patch
> > 
> > Looks good, except for the test failures that happened in the previous patch.  I didn't see any test failures from the bots for this patch, though.  Perhaps they've been fixed?
> 
> Actually I never reproduced the failures at all, just updated the patch as your comments. Perhaps the last patch would pass the tests if run it again. Whatever, does that mean the patch is OK now?

Since there are no messages from the webkit review bot with your latest change, I think it's ok now.  Not sure what changes made this work.
------- Comment #25 From 2012-02-21 05:46:42 PST -------
(In reply to comment #24)
> (In reply to comment #23)
> > (In reply to comment #22)
> > > (In reply to comment #21)
> > > > Created an attachment (id=126729) [details] [details] [details] [details] [details]
> > > > Patch
> > > 
> > > Looks good, except for the test failures that happened in the previous patch.  I didn't see any test failures from the bots for this patch, though.  Perhaps they've been fixed?
> > 
> > Actually I never reproduced the failures at all, just updated the patch as your comments. Perhaps the last patch would pass the tests if run it again. Whatever, does that mean the patch is OK now?
> 
> Since there are no messages from the webkit review bot with your latest change, I think it's ok now.  Not sure what changes made this work.

Since the patch is OK, can it be submitted now?
------- Comment #26 From 2012-02-22 11:35:37 PST -------
(From update of attachment 126729 [details])
LGTM
------- Comment #27 From 2012-02-27 20:58:20 PST -------
(In reply to comment #26)
> (From update of attachment 126729 [details] [details])
> LGTM

Chris, does the patch need more review?
------- Comment #28 From 2012-02-29 20:32:41 PST -------
(In reply to comment #27)
> (In reply to comment #26)
> > (From update of attachment 126729 [details] [details] [details])
> > LGTM
> 
> Chris, does the patch need more review?

Whether the patch is OK to commit?
------- Comment #29 From 2012-03-14 14:01:51 PST -------
(From update of attachment 126729 [details])
Clearing flags on attachment: 126729

Committed r110744: <http://trac.webkit.org/changeset/110744>
------- Comment #30 From 2012-03-14 14:01:57 PST -------
All reviewed patches have been landed.  Closing bug.
------- Comment #31 From 2012-03-14 16:25:32 PST -------
This patch seems to crash in debug:

[49220:49236:17280644243061:ERROR:process_util_posix.cc(142)] Received signal 11
    base::debug::StackTrace::StackTrace() [0x851b56]
    base::(anonymous namespace)::StackDumpSignalHandler() [0x80b739]
    0x7f352f9d1af0
    WebCore::Biquad::process() [0x256d73e]
    WebCore::HRTFKernel::HRTFKernel() [0x25ac2e4]
    WebCore::HRTFKernel::create() [0x25ab549]
    WebCore::HRTFElevation::calculateKernelsForAzimuthElevation() [0x25aa703]
    WebCore::HRTFElevation::createForSubject() [0x25aa9db]
    WebCore::HRTFDatabase::HRTFDatabase() [0x25a987e]
    WebCore::HRTFDatabase::create() [0x25a9630]
    WebCore::HRTFDatabaseLoader::load() [0x257382d]
    WebCore::databaseLoaderEntry() [0x25737ac]
    WTF::threadEntryPoint() [0x1f509e5]
    WTF::wtfThreadEntryPoint() [0x793cb8]
    start_thread [0x7f3531e369ca]
    0x7f352fa8470d
None
------- Comment #32 From 2012-03-14 16:34:51 PST -------
Rolled out in http://trac.webkit.org/changeset/110782
------- Comment #33 From 2012-03-15 00:33:53 PST -------
(In reply to comment #31)
> This patch seems to crash in debug:
> 
> [49220:49236:17280644243061:ERROR:process_util_posix.cc(142)] Received signal 11
>     base::debug::StackTrace::StackTrace() [0x851b56]
>     base::(anonymous namespace)::StackDumpSignalHandler() [0x80b739]
>     0x7f352f9d1af0
>     WebCore::Biquad::process() [0x256d73e]
>     WebCore::HRTFKernel::HRTFKernel() [0x25ac2e4]
>     WebCore::HRTFKernel::create() [0x25ab549]
>     WebCore::HRTFElevation::calculateKernelsForAzimuthElevation() [0x25aa703]
>     WebCore::HRTFElevation::createForSubject() [0x25aa9db]
>     WebCore::HRTFDatabase::HRTFDatabase() [0x25a987e]
>     WebCore::HRTFDatabase::create() [0x25a9630]
>     WebCore::HRTFDatabaseLoader::load() [0x257382d]
>     WebCore::databaseLoaderEntry() [0x25737ac]
>     WTF::threadEntryPoint() [0x1f509e5]
>     WTF::wtfThreadEntryPoint() [0x793cb8]
>     start_thread [0x7f3531e369ca]
>     0x7f352fa8470d
> None

Hi,
That`s strange that I cannot reproduce the issue (by several different machines). Could you provide more details about your platform?
------- Comment #34 From 2012-03-15 01:03:40 PST -------
It was the chromium-mac-linux configuration, which is Ubuntu Lucid.
------- Comment #35 From 2012-03-15 08:38:09 PST -------
(In reply to comment #34)
> It was the chromium-mac-linux configuration, which is Ubuntu Lucid.

I am confused by chromium-mac-linux, should not it be either chromium-linux or chromium-mac? The patch only acts in linux, I`ve tested in Ubuntu oneiric and it works very well.
------- Comment #36 From 2012-03-15 09:26:08 PST -------
(In reply to comment #33)
> (In reply to comment #31)
> > This patch seems to crash in debug:
> > 
> > [49220:49236:17280644243061:ERROR:process_util_posix.cc(142)] Received signal 11
> >     base::debug::StackTrace::StackTrace() [0x851b56]
> >     base::(anonymous namespace)::StackDumpSignalHandler() [0x80b739]
> >     0x7f352f9d1af0
> >     WebCore::Biquad::process() [0x256d73e]
> >     WebCore::HRTFKernel::HRTFKernel() [0x25ac2e4]
> >     WebCore::HRTFKernel::create() [0x25ab549]
> >     WebCore::HRTFElevation::calculateKernelsForAzimuthElevation() [0x25aa703]
> >     WebCore::HRTFElevation::createForSubject() [0x25aa9db]
> >     WebCore::HRTFDatabase::HRTFDatabase() [0x25a987e]
> >     WebCore::HRTFDatabase::create() [0x25a9630]
> >     WebCore::HRTFDatabaseLoader::load() [0x257382d]
> >     WebCore::databaseLoaderEntry() [0x25737ac]
> >     WTF::threadEntryPoint() [0x1f509e5]
> >     WTF::wtfThreadEntryPoint() [0x793cb8]
> >     start_thread [0x7f3531e369ca]
> >     0x7f352fa8470d
> > None
> 
> Hi,
> That`s strange that I cannot reproduce the issue (by several different machines). Could you provide more details about your platform?

The stack trace looks totally bogus. (I'm pretty sure the HRTFKernel doesn't call Biquad).

But I can reproduce the crash on my system.  The stack trace is totally trash too.  More work needed....
------- Comment #37 From 2012-03-15 10:39:35 PST -------
(In reply to comment #36)
> (In reply to comment #33)
> > (In reply to comment #31)
> > > This patch seems to crash in debug:
> > > 
> > > [49220:49236:17280644243061:ERROR:process_util_posix.cc(142)] Received signal 11
> > >     base::debug::StackTrace::StackTrace() [0x851b56]
> > >     base::(anonymous namespace)::StackDumpSignalHandler() [0x80b739]
> > >     0x7f352f9d1af0
> > >     WebCore::Biquad::process() [0x256d73e]
> > >     WebCore::HRTFKernel::HRTFKernel() [0x25ac2e4]
> > >     WebCore::HRTFKernel::create() [0x25ab549]
> > >     WebCore::HRTFElevation::calculateKernelsForAzimuthElevation() [0x25aa703]
> > >     WebCore::HRTFElevation::createForSubject() [0x25aa9db]
> > >     WebCore::HRTFDatabase::HRTFDatabase() [0x25a987e]
> > >     WebCore::HRTFDatabase::create() [0x25a9630]
> > >     WebCore::HRTFDatabaseLoader::load() [0x257382d]
> > >     WebCore::databaseLoaderEntry() [0x25737ac]
> > >     WTF::threadEntryPoint() [0x1f509e5]
> > >     WTF::wtfThreadEntryPoint() [0x793cb8]
> > >     start_thread [0x7f3531e369ca]
> > >     0x7f352fa8470d
> > > None
> > 
> > Hi,
> > That`s strange that I cannot reproduce the issue (by several different machines). Could you provide more details about your platform?
> 
> The stack trace looks totally bogus. (I'm pretty sure the HRTFKernel doesn't call Biquad).
> 
> But I can reproduce the crash on my system.  The stack trace is totally trash too.  More work needed....

Ok.  The issue is that on my debug build, I get a 64-bit build so loading the source address into edx loses the top 32 bits, which are non-zero.  We then try to load from that non-existent address and crash.

Don't understand why this doesn't happen on a release build, but this is definitely a bug.
------- Comment #38 From 2012-03-15 11:51:01 PST -------
Created an attachment (id=132095) [details]
Updated patch to handle 32 and 64-bit builds

Here is an updated patch.  This passes all of the webaudio tests in debug mode. (Release mode test in progress.)

This basically replaces all uses of the edx and ecx registers with their 64-bit counterparts, rdx and rcx.

However, the complexity of maintaining this assembly code is getting rather large, so I'm inclined not to do this.  It would be much better if intrinsics could be used, or, even better, do what the FIXME comment says and unroll the while loop with carefully scheduled arithmetic (in C) if that can achieve the desired performance gains.
------- Comment #39 From 2012-03-15 13:33:56 PST -------
> I am confused by chromium-mac-linux, should not it be either chromium-linux or chromium-mac? The patch only acts in linux, I`ve tested in Ubuntu oneiric and it works very well.

Sorry.  I meant chromium-linux.  :)
------- Comment #40 From 2012-03-15 19:17:59 PST -------
(In reply to comment #38)
> Created an attachment (id=132095) [details] [details]
> Updated patch to handle 32 and 64-bit builds
> 
> Here is an updated patch.  This passes all of the webaudio tests in debug mode. (Release mode test in progress.)
> 
> This basically replaces all uses of the edx and ecx registers with their 64-bit counterparts, rdx and rcx.
> 
> However, the complexity of maintaining this assembly code is getting rather large, so I'm inclined not to do this.  It would be much better if intrinsics could be used, or, even better, do what the FIXME comment says and unroll the while loop with carefully scheduled arithmetic (in C) if that can achieve the desired performance gains.

Hi Ray, thank you very much for fixing this bug. I agree that we should not make the code here to get too large.
I have tried some ways to use intrinsics and C code to unroll the loop, but they could not achieve the performance increased by assembly code, because it was not easy to pipeline and reschedule the executing order of the code as assembly did. 
But I can make a try to find whether there is a better way to optimize the code with intrinsics.
------- Comment #41 From 2012-03-18 20:18:12 PST -------
Created an attachment (id=132532) [details]
Patch
------- Comment #42 From 2012-03-18 20:23:38 PST -------
(In reply to comment #41)
> Created an attachment (id=132532) [details] [details]
> Patch

Re-implemented the pipeline by intrinsics. What surprised me is the performance is much better than assembly code, ~45% with original ~20%(for 128 frames size).
------- Comment #43 From 2012-03-19 12:13:27 PST -------
(From update of attachment 132532 [details])
View in context: https://bugs.webkit.org/attachment.cgi?id=132532&action=review

This looks much nicer than the assembly code, and it's faster too!  Code looks good, but adding some comments would help to understand what's going on.

> Source/WebCore/platform/audio/Biquad.cpp:113
> +    __m128d mma1 = _mm_set_sd(-a1);

mma1 = (0, -a1)

> Source/WebCore/platform/audio/Biquad.cpp:117
> +        mm6 = mm1;

Add comment mm6 = (y2, x)

> Source/WebCore/platform/audio/Biquad.cpp:118
> +        mm1 = _mm_shuffle_pd(mm1, mm4, 0); // y2 = y1

Add comment that mm1 = (y1, x)

> Source/WebCore/platform/audio/Biquad.cpp:119
> +        mm5 = _mm_mul_pd(mm2, mm0); // (x1*b1 x2*b2)

Insert comma:  (x1 * b1, x2 * b2)
Webkit style includes a space around arithmetic operators.  Do the same for next line.

> Source/WebCore/platform/audio/Biquad.cpp:121
> +        mm0 = _mm_shuffle_pd(mm0, mm1, 1); // (x1 = x, x2 = x1)

Comment is confusing.  Maybe say mm0 = (x, x1)

> Source/WebCore/platform/audio/Biquad.cpp:122
> +        mm4 = _mm_mul_sd(mm4, mma1); // -y1*a1

Be more explicit about contents of mm4.

mm4 = (0, -y1 * a1)

> Source/WebCore/platform/audio/Biquad.cpp:123
> +        mm5 = _mm_add_pd(mm5, mm6); // (x1*b1-y2*a2 x2*b2+x*b0)

Insert comma, add space around operators:

(x1 * b1 - y2 * a2, x2 * b2 + x * b0)

> Source/WebCore/platform/audio/Biquad.cpp:124
> +        n--;

Why decrement n here?  Is this a pipeline issue?  If not, can we put the decrement back into while (n--)?

> Source/WebCore/platform/audio/Biquad.cpp:125
> +        mm6 = mm5;

mm6 = (x1 * b1 - y2 * a2, x2 * b2 + x * b0)

> Source/WebCore/platform/audio/Biquad.cpp:126
> +        mm7 = _mm_load_ss(sourceP);

mm7 = (0, 0, 0, *sourceP)

And we're loading the next x value.

> Source/WebCore/platform/audio/Biquad.cpp:127
> +        mm1 = _mm_cvtss_sd(mm1, mm7); // x = *sourceP

Maybe say exactly what is in mm1?  mm1 = (y1, x)

> Source/WebCore/platform/audio/Biquad.cpp:128
> +        mm5 = _mm_add_pd(mm4, mm5); // (x1*b1-y2*a2 x2*b2+x*b0-y1*a1) 

Same comment as for line 123, 119.

Any advantage in doing mm_add_pd?  It looks like we don't care about the high part of mm5.  Is that right?

> Source/WebCore/platform/audio/Biquad.cpp:129
> +        mm6 = _mm_shuffle_pd(mm6, mm6, 1);

mm6 = (x2 * b2 + x * b0, x1 * b1 - y2 * a2)

> Source/WebCore/platform/audio/Biquad.cpp:130
> +        mm5 = _mm_add_pd(mm5, mm6); // y

mm5 = (x1*b1-y2*a2+x2*b2+x*b0, y = x2*b2+x*b0-y1*a1+x1*b1-y2*a2), and we ignore high part of mm5, so maybe just mm_add_sd?

> Source/WebCore/platform/audio/Biquad.cpp:131
> +        mm7 = _mm_cvtsd_ss(mm7, mm5);

mm7 = (0, 0, 0, y)

> Source/WebCore/platform/audio/Biquad.cpp:133
> +        mm4 = mm5; // y1 = y

mm4 = (x1*b1-y2*a2+x2*b2+x*b0, y)
------- Comment #44 From 2012-03-19 22:34:45 PST -------
Created an attachment (id=132760) [details]
Patch
------- Comment #45 From 2012-03-19 22:44:34 PST -------
(From update of attachment 132532 [details])
Hi Ray, patch is updated as your comments.

View in context: https://bugs.webkit.org/attachment.cgi?id=132532&action=review

>> Source/WebCore/platform/audio/Biquad.cpp:113
>> +    __m128d mma1 = _mm_set_sd(-a1);
> 
> mma1 = (0, -a1)

Done.

>> Source/WebCore/platform/audio/Biquad.cpp:117
>> +        mm6 = mm1;
> 
> Add comment mm6 = (y2, x)

Done.

>> Source/WebCore/platform/audio/Biquad.cpp:118
>> +        mm1 = _mm_shuffle_pd(mm1, mm4, 0); // y2 = y1
> 
> Add comment that mm1 = (y1, x)

Done.

>> Source/WebCore/platform/audio/Biquad.cpp:119
>> +        mm5 = _mm_mul_pd(mm2, mm0); // (x1*b1 x2*b2)
> 
> Insert comma:  (x1 * b1, x2 * b2)
> Webkit style includes a space around arithmetic operators.  Do the same for next line.

Done.

>> Source/WebCore/platform/audio/Biquad.cpp:121
>> +        mm0 = _mm_shuffle_pd(mm0, mm1, 1); // (x1 = x, x2 = x1)
> 
> Comment is confusing.  Maybe say mm0 = (x, x1)

Done.

>> Source/WebCore/platform/audio/Biquad.cpp:122
>> +        mm4 = _mm_mul_sd(mm4, mma1); // -y1*a1
> 
> Be more explicit about contents of mm4.
> 
> mm4 = (0, -y1 * a1)

Done.

>> Source/WebCore/platform/audio/Biquad.cpp:123
>> +        mm5 = _mm_add_pd(mm5, mm6); // (x1*b1-y2*a2 x2*b2+x*b0)
> 
> Insert comma, add space around operators:
> 
> (x1 * b1 - y2 * a2, x2 * b2 + x * b0)

Done.

>> Source/WebCore/platform/audio/Biquad.cpp:124
>> +        n--;
> 
> Why decrement n here?  Is this a pipeline issue?  If not, can we put the decrement back into while (n--)?

This will impact the pipeline, using "while(n--)" pulls down the performance a little, 45% to 43%.

>> Source/WebCore/platform/audio/Biquad.cpp:125
>> +        mm6 = mm5;
> 
> mm6 = (x1 * b1 - y2 * a2, x2 * b2 + x * b0)

Done.

>> Source/WebCore/platform/audio/Biquad.cpp:126
>> +        mm7 = _mm_load_ss(sourceP);
> 
> mm7 = (0, 0, 0, *sourceP)
> 
> And we're loading the next x value.

Done.

>> Source/WebCore/platform/audio/Biquad.cpp:128
>> +        mm5 = _mm_add_pd(mm4, mm5); // (x1*b1-y2*a2 x2*b2+x*b0-y1*a1) 
> 
> Same comment as for line 123, 119.
> 
> Any advantage in doing mm_add_pd?  It looks like we don't care about the high part of mm5.  Is that right?

Yes, _mm_add_sd is better.

>> Source/WebCore/platform/audio/Biquad.cpp:129
>> +        mm6 = _mm_shuffle_pd(mm6, mm6, 1);
> 
> mm6 = (x2 * b2 + x * b0, x1 * b1 - y2 * a2)

Done.

>> Source/WebCore/platform/audio/Biquad.cpp:130
>> +        mm5 = _mm_add_pd(mm5, mm6); // y
> 
> mm5 = (x1*b1-y2*a2+x2*b2+x*b0, y = x2*b2+x*b0-y1*a1+x1*b1-y2*a2), and we ignore high part of mm5, so maybe just mm_add_sd?

Done.

>> Source/WebCore/platform/audio/Biquad.cpp:133
>> +        mm4 = mm5; // y1 = y
> 
> mm4 = (x1*b1-y2*a2+x2*b2+x*b0, y)

Done.
------- Comment #46 From 2012-03-20 10:00:27 PST -------
(From update of attachment 132760 [details])
View in context: https://bugs.webkit.org/attachment.cgi?id=132760&action=review

Code looks good; just a few more minor items to look at.

> Source/WebCore/platform/audio/Biquad.cpp:109
> +    __m128d mm4 = _mm_set_pd(0, y1); // mm4 = (0, y1)

It looks like we only care about the the low part of mm4.  Maybe add comment that we only use the low part of mm4.

Same goes for mma1 in line 113.

> Source/WebCore/platform/audio/Biquad.cpp:122
> +        mm4 = _mm_mul_sd(mm4, mma1); // mm4 = (0, -y1 * a1)

I think the comment should be mm4=-y1*a1 since we're doing mul_sd.  We don't care what's in the high part of mm4.

> Source/WebCore/platform/audio/Biquad.cpp:125
> +        mm6 = mm5; // mm6 = (x1 * b1 - y2 * a2, x2 * b2 + x * b0)

See note on for line 130.  Shuffle mm5 to the right parts of mm6 now instead of later?

> Source/WebCore/platform/audio/Biquad.cpp:126
> +        mm7 = _mm_load_ss(sourceP); // Load next x value, mm7 = (0, 0, 0, x)

We load mm7 from memory here, but use it in the very next instruction.  Could this be moved further up to lessen any pipeline issues?  We're not using the value of mm7 until line 127.

> Source/WebCore/platform/audio/Biquad.cpp:128
> +        mm5 = _mm_add_sd(mm4, mm5); // mm5 = (x1 * b1 - y2 * a2, x2 * b2 + x * b0 - y1 * a1)

Since we're doing mm_add_sd, we don't care what's in the high part of mm5.  Maybe the comment should be mm5 = x2*b2 + x*b0-y1*a1.

> Source/WebCore/platform/audio/Biquad.cpp:130
> +        mm5 = _mm_add_sd(mm5, mm6); // mm5 = (x1 * b1 - y2 * a2, x * b0 + x1 * b1 + x2 * b2 - y1 * a1 - y2 * a2) = (*, y)

Same comment here as for line 128.  We don't care what's in the high part of mm5.

Also, in line 129, we swap the high and low parts of mm6.  Isn't it possible to do that in line 125 where we assign mm5 to mm6?  Just shuffle mm5 to the right parts of mm6?  That's one less instruction, but I don't know what the pipeline affects will be.

> Source/WebCore/platform/audio/Biquad.cpp:131
> +        mm7 = _mm_cvtsd_ss(mm7, mm5); // mm7 = (0, 0, 0, y)

Actually, I think the comment is wrong. (Sorry about that.)  cvtsd_ss doesn't modify the other parts of mm7.  Plus we're only using the low part, maybe just say mm7 = static_cast<float>(y).
------- Comment #47 From 2012-03-20 20:12:57 PST -------
Created an attachment (id=132954) [details]
Patch
------- Comment #48 From 2012-03-20 20:36:18 PST -------
(From update of attachment 132760 [details])
Thanks, Ray. Update the patch.

View in context: https://bugs.webkit.org/attachment.cgi?id=132760&action=review

>> Source/WebCore/platform/audio/Biquad.cpp:126
>> +        mm7 = _mm_load_ss(sourceP); // Load next x value, mm7 = (0, 0, 0, x)
> 
> We load mm7 from memory here, but use it in the very next instruction.  Could this be moved further up to lessen any pipeline issues?  We're not using the value of mm7 until line 127.

There is little difference of performance after moving this line up, I think it is not in the key path of the pipeline.

>> Source/WebCore/platform/audio/Biquad.cpp:130
>> +        mm5 = _mm_add_sd(mm5, mm6); // mm5 = (x1 * b1 - y2 * a2, x * b0 + x1 * b1 + x2 * b2 - y1 * a1 - y2 * a2) = (*, y)
> 
> Same comment here as for line 128.  We don't care what's in the high part of mm5.
> 
> Also, in line 129, we swap the high and low parts of mm6.  Isn't it possible to do that in line 125 where we assign mm5 to mm6?  Just shuffle mm5 to the right parts of mm6?  That's one less instruction, but I don't know what the pipeline affects will be.

It seems we should do that and make us to save one instruction, but it pulls down the performance after that, strange~ 
I guess the compiler does much for us when we use intrinsic, the actual assembly code looks much different from what I expect for after  "gcc -s " .
------- Comment #49 From 2012-03-21 09:17:07 PST -------
(In reply to comment #48)
> (From update of attachment 132760 [details] [details])
> Thanks, Ray. Update the patch.
> 
> View in context: https://bugs.webkit.org/attachment.cgi?id=132760&action=review
> 
> >> Source/WebCore/platform/audio/Biquad.cpp:126
> >> +        mm7 = _mm_load_ss(sourceP); // Load next x value, mm7 = (0, 0, 0, x)
> > 
> > We load mm7 from memory here, but use it in the very next instruction.  Could this be moved further up to lessen any pipeline issues?  We're not using the value of mm7 until line 127.
> 
> There is little difference of performance after moving this line up, I think it is not in the key path of the pipeline.
> 
> >> Source/WebCore/platform/audio/Biquad.cpp:130
> >> +        mm5 = _mm_add_sd(mm5, mm6); // mm5 = (x1 * b1 - y2 * a2, x * b0 + x1 * b1 + x2 * b2 - y1 * a1 - y2 * a2) = (*, y)
> > 
> > Same comment here as for line 128.  We don't care what's in the high part of mm5.
> > 
> > Also, in line 129, we swap the high and low parts of mm6.  Isn't it possible to do that in line 125 where we assign mm5 to mm6?  Just shuffle mm5 to the right parts of mm6?  That's one less instruction, but I don't know what the pipeline affects will be.
> 
> It seems we should do that and make us to save one instruction, but it pulls down the performance after that, strange~ 
> I guess the compiler does much for us when we use intrinsic, the actual assembly code looks much different from what I expect for after  "gcc -s " .

Ok.  I'm happy with this patch as it is now.  Looks good.

I do wonder, though, if you moved load_ss to be after line 124 and changed line 125 to do the shuffle (removing the shuffle at line 129) would improve performance.  Or maybe move the (new) shuffle down a few lines since we don't need mm6 until line 130.  Perhaps these changes will help the pipeline.

You don't have to do this experiment if you don't want to.  I'm just kind of curious.  Otherwise, the patch looks good to me and is ready to go.
------- Comment #50 From 2012-03-21 12:38:13 PST -------
Does this latest patch pass the layout test (and not crash) on the Debug build?
------- Comment #51 From 2012-03-22 02:41:26 PST -------
(From update of attachment 132760 [details])
View in context: https://bugs.webkit.org/attachment.cgi?id=132760&action=review

>>>> Source/WebCore/platform/audio/Biquad.cpp:126
>>>> +        mm7 = _mm_load_ss(sourceP); // Load next x value, mm7 = (0, 0, 0, x)
>>> 
>>> We load mm7 from memory here, but use it in the very next instruction.  Could this be moved further up to lessen any pipeline issues?  We're not using the value of mm7 until line 127.
>> 
>> There is little difference of performance after moving this line up, I think it is not in the key path of the pipeline.
> 
> Ok.  I'm happy with this patch as it is now.  Looks good.
> 
> I do wonder, though, if you moved load_ss to be after line 124 and changed line 125 to do the shuffle (removing the shuffle at line 129) would improve performance.  Or maybe move the (new) shuffle down a few lines since we don't need mm6 until line 130.  Perhaps these changes will help the pipeline.
> 
> You don't have to do this experiment if you don't want to.  I'm just kind of curious.  Otherwise, the patch looks good to me and is ready to go.

It does not matter, I did it. But it seems it does not work, a tiny slowing down. Hard to explain, we cannot exactly schedule the pipeline by intrinsics as the way by assembly, I guess.
------- Comment #52 From 2012-03-22 02:52:28 PST -------
(In reply to comment #50)
> Does this latest patch pass the layout test (and not crash) on the Debug build?

The layout test passed in both debug and release build from my environment (32-bit).
Last crash issue`s root cause is I used 32-bit registers for the address and it will failed in 64-bit system. I use intrinsics this time and suppose it would not meet this issue. Certainly more tests are needed, but I don`t have 64-bit system right now.

Ray, would you mind to help me to check whether the patch is OK in your 64-bit system?
------- Comment #53 From 2012-03-22 10:28:13 PST -------
(In reply to comment #52)
> (In reply to comment #50)
> > Does this latest patch pass the layout test (and not crash) on the Debug build?
> 
> The layout test passed in both debug and release build from my environment (32-bit).
> Last crash issue`s root cause is I used 32-bit registers for the address and it will failed in 64-bit system. I use intrinsics this time and suppose it would not meet this issue. Certainly more tests are needed, but I don`t have 64-bit system right now.

I don't expect any issues with 64-bit builds.  The compiler should use the right sized registers for the intrinsic operations.

> 
> Ray, would you mind to help me to check whether the patch is OK in your 64-bit system?

No problem.  The debug build passes all tests, except for the convolution-mono-mono test, but I think that's a bug in the convolver that needs to be fixed.

The release build passes all tests too. (convolution-mono-mono passes.)
------- Comment #54 From 2012-03-22 10:30:29 PST -------
(In reply to comment #51)
> (From update of attachment 132760 [details] [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=132760&action=review
> 
> >>>> Source/WebCore/platform/audio/Biquad.cpp:126
> >>>> +        mm7 = _mm_load_ss(sourceP); // Load next x value, mm7 = (0, 0, 0, x)
> >>> 
> >>> We load mm7 from memory here, but use it in the very next instruction.  Could this be moved further up to lessen any pipeline issues?  We're not using the value of mm7 until line 127.
> >> 
> >> There is little difference of performance after moving this line up, I think it is not in the key path of the pipeline.
> > 
> > Ok.  I'm happy with this patch as it is now.  Looks good.
> > 
> > I do wonder, though, if you moved load_ss to be after line 124 and changed line 125 to do the shuffle (removing the shuffle at line 129) would improve performance.  Or maybe move the (new) shuffle down a few lines since we don't need mm6 until line 130.  Perhaps these changes will help the pipeline.
> > 
> > You don't have to do this experiment if you don't want to.  I'm just kind of curious.  Otherwise, the patch looks good to me and is ready to go.
> 
> It does not matter, I did it. But it seems it does not work, a tiny slowing down. Hard to explain, we cannot exactly schedule the pipeline by intrinsics as the way by assembly, I guess.

Thanks for doing the test.  I would have thought there might be some small gain, but the compiler can rearrange things as you say.

So I think this new patch is ready to go as is.  Thanks for doing the new version!
------- Comment #55 From 2012-03-22 19:58:56 PST -------
(In reply to comment #53)
> (In reply to comment #52)
> > (In reply to comment #50)
> > > Does this latest patch pass the layout test (and not crash) on the Debug build?
> > 
> > The layout test passed in both debug and release build from my environment (32-bit).
> > Last crash issue`s root cause is I used 32-bit registers for the address and it will failed in 64-bit system. I use intrinsics this time and suppose it would not meet this issue. Certainly more tests are needed, but I don`t have 64-bit system right now.
> 
> I don't expect any issues with 64-bit builds.  The compiler should use the right sized registers for the intrinsic operations.
> 
> > 
> > Ray, would you mind to help me to check whether the patch is OK in your 64-bit system?
> 
> No problem.  The debug build passes all tests, except for the convolution-mono-mono test, but I think that's a bug in the convolver that needs to be fixed.
> 
> The release build passes all tests too. (convolution-mono-mono passes.)

Ray, thank you for doing the tests.
------- Comment #56 From 2012-11-14 17:42:33 PST -------
If this patch looked good back in March, why hasn't it been approved or landed?
------- Comment #57 From 2012-11-15 17:10:38 PST -------
(From update of attachment 132954 [details])
I'm landing based on prior reviewers comments.  It seems like it was simply lost in the shuffle at some point.
------- Comment #58 From 2012-11-15 17:20:24 PST -------
(From update of attachment 132954 [details])
Clearing flags on attachment: 132954

Committed r134867: <http://trac.webkit.org/changeset/134867>
------- Comment #59 From 2012-11-15 17:20:31 PST -------
All reviewed patches have been landed.  Closing bug.
------- Comment #60 From 2012-11-16 11:43:33 PST -------
Re-opened since this is blocked by bug 102544
------- Comment #61 From 2012-11-16 11:45:41 PST -------
Reopened because this change broke one of the security tests.  See https://bugs.webkit.org/show_bug.cgi?id=102524 for details.
------- Comment #62 From 2012-11-16 11:47:12 PST -------
Rollout was done via https://bugs.webkit.org/show_bug.cgi?id=102544.
------- Comment #63 From 2013-07-09 10:24:55 PST -------
I might be completely wrong (I just happened to skim past this patch), but shouldn't the SSE2 code be a runtime check?  The #ifdef SSE2 is only good for knowing if you can use the intrinsics or not (if the compiler supports SSE2).  I don't know the current standing expectation of having SSE2 support is, but last I remember it wasn't considered a requirement:

https://groups.google.com/a/chromium.org/forum/#!topic/chromium-dev/66p8sJtluJA
------- Comment #64 From 2013-07-09 10:43:54 PST -------
(In reply to comment #63)
> I might be completely wrong (I just happened to skim past this patch), but shouldn't the SSE2 code be a runtime check?  The #ifdef SSE2 is only good for knowing if you can use the intrinsics or not (if the compiler supports SSE2).  I don't know the current standing expectation of having SSE2 support is, but last I remember it wasn't considered a requirement:
> 
> https://groups.google.com/a/chromium.org/forum/#!topic/chromium-dev/66p8sJtluJA

I think you are correct.  It's been a while since I looked, but I think on Linux and Windows the compiler flag that enables SSE2 support is not set, so the SSE2 code is never compiled in.  On Mac, SSE2 has always been available on x86 machines and I think SSE2 code is compiled by default.  

There should be no issues with this code.  In any case, this particular patch was rolled out.

Ideally, we would like a runtime check and use the SSE2 code when available, falling back to the C reference when SSE2 is not available.