Bug 75528

Summary: Optimize the multiply-add in Biquad.cpp::process
Product: WebKit Reporter: Xingnan Wang <xingnan.wang>
Component: Web AudioAssignee: Nobody <webkit-unassigned>
Status: UNCONFIRMED ---    
Severity: Normal CC: abarth, bfulgham, crogers, danceoffwithyourpantsoff, dglazkov, rtoy, rtoy, webkit.review.bot
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Bug Depends on: 81168, 102524, 102544    
Bug Blocks:    
Attachments:
Description Flags
Patch
none
Biquad layout test script
none
Patch
webkit.review.bot: commit-queue-
Patch
none
Updated patch to handle 32 and 64-bit builds
none
Patch
none
Patch
none
Patch none

Description Xingnan Wang 2012-01-04 01:04:07 PST
Optimize the multiply-add by piplining with SSE2 instructions.
Comment 1 Xingnan Wang 2012-01-04 01:17:37 PST
Created attachment 121082 [details]
Patch
Comment 2 Chris Rogers 2012-01-04 18:04:48 PST
Thanks for the optimization!  Any idea how much improvement this gives?

I'm adding Raymond as reviewer here, and think we should get at least a very simple Biquad layout test written to validate both the existing code, and your new code.  The layout test can do the following:

1. create a BiquadFilterNode, configure it as lowpass at a certain cutoff frequency
2. feed an impulse into the filter in an OfflineAudioContext
3. compare the result with the expected output using JavaScript to generate the expected response (we can borrow the math from Biquad.cpp and convert to JS.
Comment 3 Chris Rogers 2012-01-04 18:06:30 PST
Xingnan do you know how to write a layout test.  If not, maybe Raymond can help you there, since he's just starting to write them himself.
Comment 4 Xingnan Wang 2012-01-04 18:12:11 PST
(In reply to comment #2)
> Thanks for the optimization!  Any idea how much improvement this gives?
> 
From my test, it boosted at least 20%(for 128 frames size), and reached 40% with larger size(like 1280).
> I'm adding Raymond as reviewer here, and think we should get at least a very simple Biquad layout test written to validate both the existing code, and your new code.  The layout test can do the following:
> 
> 1. create a BiquadFilterNode, configure it as lowpass at a certain cutoff frequency
> 2. feed an impulse into the filter in an OfflineAudioContext
> 3. compare the result with the expected output using JavaScript to generate the expected response (we can borrow the math from Biquad.cpp and convert to JS.
Comment 5 Xingnan Wang 2012-01-04 18:16:41 PST
(In reply to comment #3)
> Xingnan do you know how to write a layout test.  If not, maybe Raymond can help you there, since he's just starting to write them himself.

I did not write layout test before but very interest in it, I would like to run the layout test, it`s very pleasure if Raymond can help. Thanks.
Comment 6 Raymond Toy 2012-01-05 08:53:24 PST
(In reply to comment #5)
> (In reply to comment #3)
> > Xingnan do you know how to write a layout test.  If not, maybe Raymond can help you there, since he's just starting to write them himself.
> 
> I did not write layout test before but very interest in it, I would like to run the layout test, it`s very pleasure if Raymond can help. Thanks.

I would be my pleasure to write a simple layout test that you can start with.
Comment 7 Raymond Toy 2012-01-05 10:45:40 PST
Created attachment 121300 [details]
Biquad layout test script

Here is a start at a test of the biquad (lowpass) filter.

Place this file in the third_party/WebKit/LayoutTests/webaudio directory.  A biquad-test-expected.txt file needs to be created that contains the expected output.

To run the layout test, see http://www.chromium.org/developers/testing/webkit-layout-tests.

Perhaps it is enough to test just the lowpass filter for now.  But eventually, I think we need a test for all the filters.  In that case biquad-test.html should be broken up into a common piece that contains the test infrastructure and a probably several tests that create the appropriate filter tests.

Should probably follow the style of convolution-mono-mono.html tests.
Comment 8 Raymond Toy 2012-01-05 11:07:10 PST
Comment on attachment 121082 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=121082&action=review

> Source/WebCore/platform/audio/Biquad.cpp:90
> +#ifdef __SSE2__

Did you test this on windows?  I think the asm syntax below only works for gcc, so this conditionalization should test for gcc too.  (But I guess on windows, __SSE2__ is probably never defined.)

> Source/WebCore/platform/audio/Biquad.cpp:93
> +    __asm__(

I am far from an expert in sse2, but this seems rather complex.  Could we do something like this?

Create array yy of length 2 (or maybe use the destination array directly?):
yy[0] = y
yy[1] = y1

load xmm0 with (b0 b1 b2 0)
load xmm1 with (-a0 -a1 0 0)
load xmm2 from *source to get (x[0] x[1] x[2] x[3])
xmm2 = xmm2 * xmm0 to get (b0*x0 b1*x1 b2*x2 0)
load xmm3 from yy to get (y0 y1 junk junk)
xmm3 = xmm3 * xmm1 to get (-a0*y0 a1*y1 0 0)
xmm3 = xmm3 + xmm2 to get (b0*x0-a0*y0, b1*x1-a1*y1, b2*x2, 0)

Extract each part of xmm3 and add them together.  (We could gain something here if we had SSE3 to do the add, I think.)

yy[0] = yy[1]
yy[1] = result of sum.

Don't know if this is faster or slower.  This will change results slightly because we do everything in single precision.
Comment 9 Raymond Toy 2012-01-05 11:20:26 PST
Oops.  In the test script, you need to uncomment the first 5 commented lines in runTest() when running the layout test.  You can use the script as is when running in a browser (provided the other scripts are available).
Comment 10 Chris Rogers 2012-01-05 11:34:28 PST
(In reply to comment #8)
> Don't know if this is faster or slower.  This will change results slightly because we do everything in single precision.

I'd like to keep everything in double-precision because of filter stability problems with certain types of filters.
Comment 11 Xingnan Wang 2012-01-05 17:31:21 PST
(In reply to comment #6)
> (In reply to comment #5)
> > (In reply to comment #3)
> > > Xingnan do you know how to write a layout test.  If not, maybe Raymond can help you there, since he's just starting to write them himself.
> > 
> > I did not write layout test before but very interest in it, I would like to run the layout test, it`s very pleasure if Raymond can help. Thanks.
> 
> I would be my pleasure to write a simple layout test that you can start with.

That`s so nice if there is a sample, thanks Raymond.
Comment 12 Xingnan Wang 2012-01-05 17:59:50 PST
(In reply to comment #8)
> (From update of attachment 121082 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=121082&action=review
> 
> > Source/WebCore/platform/audio/Biquad.cpp:90
> > +#ifdef __SSE2__
> 
> Did you test this on windows?  I think the asm syntax below only works for gcc, so this conditionalization should test for gcc too.  (But I guess on windows, __SSE2__ is probably never defined.)
Yes, there is another macro in windows. I just started with linux support only, in case of any potential issues. I can add the windows support and do the test.
> 
> > Source/WebCore/platform/audio/Biquad.cpp:93
> > +    __asm__(
> 
> I am far from an expert in sse2, but this seems rather complex.  Could we do something like this?
> 
> Create array yy of length 2 (or maybe use the destination array directly?):
> yy[0] = y
> yy[1] = y1
> 
> load xmm0 with (b0 b1 b2 0)
> load xmm1 with (-a0 -a1 0 0)
> load xmm2 from *source to get (x[0] x[1] x[2] x[3])
> xmm2 = xmm2 * xmm0 to get (b0*x0 b1*x1 b2*x2 0)
> load xmm3 from yy to get (y0 y1 junk junk)
> xmm3 = xmm3 * xmm1 to get (-a0*y0 a1*y1 0 0)
> xmm3 = xmm3 + xmm2 to get (b0*x0-a0*y0, b1*x1-a1*y1, b2*x2, 0)
> 
> Extract each part of xmm3 and add them together.  (We could gain something here if we had SSE3 to do the add, I think.)
> 
> yy[0] = yy[1]
> yy[1] = result of sum.
> 
> Don't know if this is faster or slower.  This will change results slightly because we do everything in single precision.

Your solution may work in single precision and I didn`t try.
I have tried this way in double precision but no too much improvement. Then I changed to use inline asm so that I can pipeline the instructions, also get the benefit from SSE2.
Comment 13 Raymond Toy 2012-01-11 09:54:53 PST
Comment on attachment 121082 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=121082&action=review

>>> Source/WebCore/platform/audio/Biquad.cpp:93
>>> +    __asm__(
>> 
>> I am far from an expert in sse2, but this seems rather complex.  Could we do something like this?
>> 
>> Create array yy of length 2 (or maybe use the destination array directly?):
>> yy[0] = y
>> yy[1] = y1
>> 
>> load xmm0 with (b0 b1 b2 0)
>> load xmm1 with (-a0 -a1 0 0)
>> load xmm2 from *source to get (x[0] x[1] x[2] x[3])
>> xmm2 = xmm2 * xmm0 to get (b0*x0 b1*x1 b2*x2 0)
>> load xmm3 from yy to get (y0 y1 junk junk)
>> xmm3 = xmm3 * xmm1 to get (-a0*y0 a1*y1 0 0)
>> xmm3 = xmm3 + xmm2 to get (b0*x0-a0*y0, b1*x1-a1*y1, b2*x2, 0)
>> 
>> Extract each part of xmm3 and add them together.  (We could gain something here if we had SSE3 to do the add, I think.)
>> 
>> yy[0] = yy[1]
>> yy[1] = result of sum.
>> 
>> Don't know if this is faster or slower.  This will change results slightly because we do everything in single precision.
> 
> Your solution may work in single precision and I didn`t try.
> I have tried this way in double precision but no too much improvement. Then I changed to use inline asm so that I can pipeline the instructions, also get the benefit from SSE2.

This looks fine then especially considering Chris's comment about keeping things in double for stability reasons.
Comment 14 Raymond Toy 2012-01-12 11:10:25 PST
If you are not in a hurry, bug 71413 will include a bunch of tests for the biquad filters so you do not have to write any.
Comment 15 Xingnan Wang 2012-01-13 00:51:05 PST
(In reply to comment #14)
> If you are not in a hurry, bug 71413 will include a bunch of tests for the biquad filters so you do not have to write any.

That will be nice. I was going to update the patch with the low pass filter layout test, but if bug 71413 provides all the tests, I will wait for it.
Comment 16 Xingnan Wang 2012-02-09 23:44:37 PST
Created attachment 126465 [details]
Patch

Update the patch as layout tests of biquad filter landed, it`s OK to pass all these tests.
Comment 17 WebKit Review Bot 2012-02-10 01:20:22 PST
Comment on attachment 126465 [details]
Patch

Attachment 126465 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/11486599

New failing tests:
webaudio/biquad-highpass.html
webaudio/biquad-lowpass.html
webaudio/biquad-lowshelf.html
webaudio/biquad-notch.html
webaudio/biquad-highshelf.html
webaudio/biquad-peaking.html
webaudio/biquad-bandpass.html
webaudio/biquad-allpass.html
Comment 18 Raymond Toy 2012-02-10 13:56:13 PST
Comment on attachment 126465 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=126465&action=review

There also appears to be test failures on linux.  Can you investigate why that is happening?

> Source/WebCore/platform/audio/Biquad.cpp:121
> +        "addsd    %%xmm4,  %%xmm5\n\t" // x*b0 + (x2*b2+x*b0)

Doesn't xmm4 contain a1*y1 here so the result is a1*y1 + (x2*b2+x*b0)?

> Source/WebCore/platform/audio/Biquad.cpp:137
> +        :"eax", "edx", "ecx"

Is it necessary to add all of the xmm registers to this clobber list?
Comment 19 Xingnan Wang 2012-02-12 23:26:19 PST
Comment on attachment 126465 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=126465&action=review

>> Source/WebCore/platform/audio/Biquad.cpp:121
>> +        "addsd    %%xmm4,  %%xmm5\n\t" // x*b0 + (x2*b2+x*b0)
> 
> Doesn't xmm4 contain a1*y1 here so the result is a1*y1 + (x2*b2+x*b0)?

Sorry for that, it should be a1*y1 + (x2*b2+x*b0).

>> Source/WebCore/platform/audio/Biquad.cpp:137
>> +        :"eax", "edx", "ecx"
> 
> Is it necessary to add all of the xmm registers to this clobber list?

OK, it`s better to add all the xmm registers to the clobber-list, through there is little possibility of conflict of xmm registers here.
Comment 20 Xingnan Wang 2012-02-12 23:32:34 PST
(In reply to comment #18)
> (From update of attachment 126465 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=126465&action=review
> 
> There also appears to be test failures on linux.  Can you investigate why that is happening?
> 

It`s strange that I can pass the layout tests locally, and little message can be found from the error log  http://queues.webkit.org/results/11486599. I`ll try to check my patch in different machine.
Comment 21 Xingnan Wang 2012-02-13 00:03:59 PST
Created attachment 126729 [details]
Patch
Comment 22 Raymond Toy 2012-02-13 11:04:32 PST
(In reply to comment #21)
> Created an attachment (id=126729) [details]
> Patch

Looks good, except for the test failures that happened in the previous patch.  I didn't see any test failures from the bots for this patch, though.  Perhaps they've been fixed?
Comment 23 Xingnan Wang 2012-02-13 17:23:21 PST
(In reply to comment #22)
> (In reply to comment #21)
> > Created an attachment (id=126729) [details] [details]
> > Patch
> 
> Looks good, except for the test failures that happened in the previous patch.  I didn't see any test failures from the bots for this patch, though.  Perhaps they've been fixed?

Actually I never reproduced the failures at all, just updated the patch as your comments. Perhaps the last patch would pass the tests if run it again. Whatever, does that mean the patch is OK now?

BTW, how can I check the building details including these failures from the bots? Thanks.
Comment 24 Raymond Toy 2012-02-14 09:39:06 PST
(In reply to comment #23)
> (In reply to comment #22)
> > (In reply to comment #21)
> > > Created an attachment (id=126729) [details] [details] [details]
> > > Patch
> > 
> > Looks good, except for the test failures that happened in the previous patch.  I didn't see any test failures from the bots for this patch, though.  Perhaps they've been fixed?
> 
> Actually I never reproduced the failures at all, just updated the patch as your comments. Perhaps the last patch would pass the tests if run it again. Whatever, does that mean the patch is OK now?

Since there are no messages from the webkit review bot with your latest change, I think it's ok now.  Not sure what changes made this work.
Comment 25 Xingnan Wang 2012-02-21 05:46:42 PST
(In reply to comment #24)
> (In reply to comment #23)
> > (In reply to comment #22)
> > > (In reply to comment #21)
> > > > Created an attachment (id=126729) [details] [details] [details] [details]
> > > > Patch
> > > 
> > > Looks good, except for the test failures that happened in the previous patch.  I didn't see any test failures from the bots for this patch, though.  Perhaps they've been fixed?
> > 
> > Actually I never reproduced the failures at all, just updated the patch as your comments. Perhaps the last patch would pass the tests if run it again. Whatever, does that mean the patch is OK now?
> 
> Since there are no messages from the webkit review bot with your latest change, I think it's ok now.  Not sure what changes made this work.

Since the patch is OK, can it be submitted now?
Comment 26 Raymond Toy 2012-02-22 11:35:37 PST
Comment on attachment 126729 [details]
Patch

LGTM
Comment 27 Xingnan Wang 2012-02-27 20:58:20 PST
(In reply to comment #26)
> (From update of attachment 126729 [details])
> LGTM

Chris, does the patch need more review?
Comment 28 Xingnan Wang 2012-02-29 20:32:41 PST
(In reply to comment #27)
> (In reply to comment #26)
> > (From update of attachment 126729 [details] [details])
> > LGTM
> 
> Chris, does the patch need more review?

Whether the patch is OK to commit?
Comment 29 WebKit Review Bot 2012-03-14 14:01:51 PDT
Comment on attachment 126729 [details]
Patch

Clearing flags on attachment: 126729

Committed r110744: <http://trac.webkit.org/changeset/110744>
Comment 30 WebKit Review Bot 2012-03-14 14:01:57 PDT
All reviewed patches have been landed.  Closing bug.
Comment 31 Adam Barth 2012-03-14 16:25:32 PDT
This patch seems to crash in debug:

[49220:49236:17280644243061:ERROR:process_util_posix.cc(142)] Received signal 11
	base::debug::StackTrace::StackTrace() [0x851b56]
	base::(anonymous namespace)::StackDumpSignalHandler() [0x80b739]
	0x7f352f9d1af0
	WebCore::Biquad::process() [0x256d73e]
	WebCore::HRTFKernel::HRTFKernel() [0x25ac2e4]
	WebCore::HRTFKernel::create() [0x25ab549]
	WebCore::HRTFElevation::calculateKernelsForAzimuthElevation() [0x25aa703]
	WebCore::HRTFElevation::createForSubject() [0x25aa9db]
	WebCore::HRTFDatabase::HRTFDatabase() [0x25a987e]
	WebCore::HRTFDatabase::create() [0x25a9630]
	WebCore::HRTFDatabaseLoader::load() [0x257382d]
	WebCore::databaseLoaderEntry() [0x25737ac]
	WTF::threadEntryPoint() [0x1f509e5]
	WTF::wtfThreadEntryPoint() [0x793cb8]
	start_thread [0x7f3531e369ca]
	0x7f352fa8470d
None
Comment 32 Adam Barth 2012-03-14 16:34:51 PDT
Rolled out in http://trac.webkit.org/changeset/110782
Comment 33 Xingnan Wang 2012-03-15 00:33:53 PDT
(In reply to comment #31)
> This patch seems to crash in debug:
> 
> [49220:49236:17280644243061:ERROR:process_util_posix.cc(142)] Received signal 11
>     base::debug::StackTrace::StackTrace() [0x851b56]
>     base::(anonymous namespace)::StackDumpSignalHandler() [0x80b739]
>     0x7f352f9d1af0
>     WebCore::Biquad::process() [0x256d73e]
>     WebCore::HRTFKernel::HRTFKernel() [0x25ac2e4]
>     WebCore::HRTFKernel::create() [0x25ab549]
>     WebCore::HRTFElevation::calculateKernelsForAzimuthElevation() [0x25aa703]
>     WebCore::HRTFElevation::createForSubject() [0x25aa9db]
>     WebCore::HRTFDatabase::HRTFDatabase() [0x25a987e]
>     WebCore::HRTFDatabase::create() [0x25a9630]
>     WebCore::HRTFDatabaseLoader::load() [0x257382d]
>     WebCore::databaseLoaderEntry() [0x25737ac]
>     WTF::threadEntryPoint() [0x1f509e5]
>     WTF::wtfThreadEntryPoint() [0x793cb8]
>     start_thread [0x7f3531e369ca]
>     0x7f352fa8470d
> None

Hi,
That`s strange that I cannot reproduce the issue (by several different machines). Could you provide more details about your platform?
Comment 34 Adam Barth 2012-03-15 01:03:40 PDT
It was the chromium-mac-linux configuration, which is Ubuntu Lucid.
Comment 35 Xingnan Wang 2012-03-15 08:38:09 PDT
(In reply to comment #34)
> It was the chromium-mac-linux configuration, which is Ubuntu Lucid.

I am confused by chromium-mac-linux, should not it be either chromium-linux or chromium-mac? The patch only acts in linux, I`ve tested in Ubuntu oneiric and it works very well.
Comment 36 Raymond Toy 2012-03-15 09:26:08 PDT
(In reply to comment #33)
> (In reply to comment #31)
> > This patch seems to crash in debug:
> > 
> > [49220:49236:17280644243061:ERROR:process_util_posix.cc(142)] Received signal 11
> >     base::debug::StackTrace::StackTrace() [0x851b56]
> >     base::(anonymous namespace)::StackDumpSignalHandler() [0x80b739]
> >     0x7f352f9d1af0
> >     WebCore::Biquad::process() [0x256d73e]
> >     WebCore::HRTFKernel::HRTFKernel() [0x25ac2e4]
> >     WebCore::HRTFKernel::create() [0x25ab549]
> >     WebCore::HRTFElevation::calculateKernelsForAzimuthElevation() [0x25aa703]
> >     WebCore::HRTFElevation::createForSubject() [0x25aa9db]
> >     WebCore::HRTFDatabase::HRTFDatabase() [0x25a987e]
> >     WebCore::HRTFDatabase::create() [0x25a9630]
> >     WebCore::HRTFDatabaseLoader::load() [0x257382d]
> >     WebCore::databaseLoaderEntry() [0x25737ac]
> >     WTF::threadEntryPoint() [0x1f509e5]
> >     WTF::wtfThreadEntryPoint() [0x793cb8]
> >     start_thread [0x7f3531e369ca]
> >     0x7f352fa8470d
> > None
> 
> Hi,
> That`s strange that I cannot reproduce the issue (by several different machines). Could you provide more details about your platform?

The stack trace looks totally bogus. (I'm pretty sure the HRTFKernel doesn't call Biquad).

But I can reproduce the crash on my system.  The stack trace is totally trash too.  More work needed....
Comment 37 Raymond Toy 2012-03-15 10:39:35 PDT
(In reply to comment #36)
> (In reply to comment #33)
> > (In reply to comment #31)
> > > This patch seems to crash in debug:
> > > 
> > > [49220:49236:17280644243061:ERROR:process_util_posix.cc(142)] Received signal 11
> > >     base::debug::StackTrace::StackTrace() [0x851b56]
> > >     base::(anonymous namespace)::StackDumpSignalHandler() [0x80b739]
> > >     0x7f352f9d1af0
> > >     WebCore::Biquad::process() [0x256d73e]
> > >     WebCore::HRTFKernel::HRTFKernel() [0x25ac2e4]
> > >     WebCore::HRTFKernel::create() [0x25ab549]
> > >     WebCore::HRTFElevation::calculateKernelsForAzimuthElevation() [0x25aa703]
> > >     WebCore::HRTFElevation::createForSubject() [0x25aa9db]
> > >     WebCore::HRTFDatabase::HRTFDatabase() [0x25a987e]
> > >     WebCore::HRTFDatabase::create() [0x25a9630]
> > >     WebCore::HRTFDatabaseLoader::load() [0x257382d]
> > >     WebCore::databaseLoaderEntry() [0x25737ac]
> > >     WTF::threadEntryPoint() [0x1f509e5]
> > >     WTF::wtfThreadEntryPoint() [0x793cb8]
> > >     start_thread [0x7f3531e369ca]
> > >     0x7f352fa8470d
> > > None
> > 
> > Hi,
> > That`s strange that I cannot reproduce the issue (by several different machines). Could you provide more details about your platform?
> 
> The stack trace looks totally bogus. (I'm pretty sure the HRTFKernel doesn't call Biquad).
> 
> But I can reproduce the crash on my system.  The stack trace is totally trash too.  More work needed....

Ok.  The issue is that on my debug build, I get a 64-bit build so loading the source address into edx loses the top 32 bits, which are non-zero.  We then try to load from that non-existent address and crash.

Don't understand why this doesn't happen on a release build, but this is definitely a bug.
Comment 38 Raymond Toy 2012-03-15 11:51:01 PDT
Created attachment 132095 [details]
Updated patch to handle 32 and 64-bit builds

Here is an updated patch.  This passes all of the webaudio tests in debug mode. (Release mode test in progress.)

This basically replaces all uses of the edx and ecx registers with their 64-bit counterparts, rdx and rcx.

However, the complexity of maintaining this assembly code is getting rather large, so I'm inclined not to do this.  It would be much better if intrinsics could be used, or, even better, do what the FIXME comment says and unroll the while loop with carefully scheduled arithmetic (in C) if that can achieve the desired performance gains.
Comment 39 Adam Barth 2012-03-15 13:33:56 PDT
> I am confused by chromium-mac-linux, should not it be either chromium-linux or chromium-mac? The patch only acts in linux, I`ve tested in Ubuntu oneiric and it works very well.

Sorry.  I meant chromium-linux.  :)
Comment 40 Xingnan Wang 2012-03-15 19:17:59 PDT
(In reply to comment #38)
> Created an attachment (id=132095) [details]
> Updated patch to handle 32 and 64-bit builds
> 
> Here is an updated patch.  This passes all of the webaudio tests in debug mode. (Release mode test in progress.)
> 
> This basically replaces all uses of the edx and ecx registers with their 64-bit counterparts, rdx and rcx.
> 
> However, the complexity of maintaining this assembly code is getting rather large, so I'm inclined not to do this.  It would be much better if intrinsics could be used, or, even better, do what the FIXME comment says and unroll the while loop with carefully scheduled arithmetic (in C) if that can achieve the desired performance gains.

Hi Ray, thank you very much for fixing this bug. I agree that we should not make the code here to get too large.
I have tried some ways to use intrinsics and C code to unroll the loop, but they could not achieve the performance increased by assembly code, because it was not easy to pipeline and reschedule the executing order of the code as assembly did. 
But I can make a try to find whether there is a better way to optimize the code with intrinsics.
Comment 41 Xingnan Wang 2012-03-18 20:18:12 PDT
Created attachment 132532 [details]
Patch
Comment 42 Xingnan Wang 2012-03-18 20:23:38 PDT
(In reply to comment #41)
> Created an attachment (id=132532) [details]
> Patch

Re-implemented the pipeline by intrinsics. What surprised me is the performance is much better than assembly code, ~45% with original ~20%(for 128 frames size).
Comment 43 Raymond Toy 2012-03-19 12:13:27 PDT
Comment on attachment 132532 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=132532&action=review

This looks much nicer than the assembly code, and it's faster too!  Code looks good, but adding some comments would help to understand what's going on.

> Source/WebCore/platform/audio/Biquad.cpp:113
> +    __m128d mma1 = _mm_set_sd(-a1);

mma1 = (0, -a1)

> Source/WebCore/platform/audio/Biquad.cpp:117
> +        mm6 = mm1;

Add comment mm6 = (y2, x)

> Source/WebCore/platform/audio/Biquad.cpp:118
> +        mm1 = _mm_shuffle_pd(mm1, mm4, 0); // y2 = y1

Add comment that mm1 = (y1, x)

> Source/WebCore/platform/audio/Biquad.cpp:119
> +        mm5 = _mm_mul_pd(mm2, mm0); // (x1*b1 x2*b2)

Insert comma:  (x1 * b1, x2 * b2)
Webkit style includes a space around arithmetic operators.  Do the same for next line.

> Source/WebCore/platform/audio/Biquad.cpp:121
> +        mm0 = _mm_shuffle_pd(mm0, mm1, 1); // (x1 = x, x2 = x1)

Comment is confusing.  Maybe say mm0 = (x, x1)

> Source/WebCore/platform/audio/Biquad.cpp:122
> +        mm4 = _mm_mul_sd(mm4, mma1); // -y1*a1

Be more explicit about contents of mm4.

mm4 = (0, -y1 * a1)

> Source/WebCore/platform/audio/Biquad.cpp:123
> +        mm5 = _mm_add_pd(mm5, mm6); // (x1*b1-y2*a2 x2*b2+x*b0)

Insert comma, add space around operators:

(x1 * b1 - y2 * a2, x2 * b2 + x * b0)

> Source/WebCore/platform/audio/Biquad.cpp:124
> +        n--;

Why decrement n here?  Is this a pipeline issue?  If not, can we put the decrement back into while (n--)?

> Source/WebCore/platform/audio/Biquad.cpp:125
> +        mm6 = mm5;

mm6 = (x1 * b1 - y2 * a2, x2 * b2 + x * b0)

> Source/WebCore/platform/audio/Biquad.cpp:126
> +        mm7 = _mm_load_ss(sourceP);

mm7 = (0, 0, 0, *sourceP)

And we're loading the next x value.

> Source/WebCore/platform/audio/Biquad.cpp:127
> +        mm1 = _mm_cvtss_sd(mm1, mm7); // x = *sourceP

Maybe say exactly what is in mm1?  mm1 = (y1, x)

> Source/WebCore/platform/audio/Biquad.cpp:128
> +        mm5 = _mm_add_pd(mm4, mm5); // (x1*b1-y2*a2 x2*b2+x*b0-y1*a1) 

Same comment as for line 123, 119.

Any advantage in doing mm_add_pd?  It looks like we don't care about the high part of mm5.  Is that right?

> Source/WebCore/platform/audio/Biquad.cpp:129
> +        mm6 = _mm_shuffle_pd(mm6, mm6, 1);

mm6 = (x2 * b2 + x * b0, x1 * b1 - y2 * a2)

> Source/WebCore/platform/audio/Biquad.cpp:130
> +        mm5 = _mm_add_pd(mm5, mm6); // y

mm5 = (x1*b1-y2*a2+x2*b2+x*b0, y = x2*b2+x*b0-y1*a1+x1*b1-y2*a2), and we ignore high part of mm5, so maybe just mm_add_sd?

> Source/WebCore/platform/audio/Biquad.cpp:131
> +        mm7 = _mm_cvtsd_ss(mm7, mm5);

mm7 = (0, 0, 0, y)

> Source/WebCore/platform/audio/Biquad.cpp:133
> +        mm4 = mm5; // y1 = y

mm4 = (x1*b1-y2*a2+x2*b2+x*b0, y)
Comment 44 Xingnan Wang 2012-03-19 22:34:45 PDT
Created attachment 132760 [details]
Patch
Comment 45 Xingnan Wang 2012-03-19 22:44:34 PDT
Comment on attachment 132532 [details]
Patch

Hi Ray, patch is updated as your comments.

View in context: https://bugs.webkit.org/attachment.cgi?id=132532&action=review

>> Source/WebCore/platform/audio/Biquad.cpp:113
>> +    __m128d mma1 = _mm_set_sd(-a1);
> 
> mma1 = (0, -a1)

Done.

>> Source/WebCore/platform/audio/Biquad.cpp:117
>> +        mm6 = mm1;
> 
> Add comment mm6 = (y2, x)

Done.

>> Source/WebCore/platform/audio/Biquad.cpp:118
>> +        mm1 = _mm_shuffle_pd(mm1, mm4, 0); // y2 = y1
> 
> Add comment that mm1 = (y1, x)

Done.

>> Source/WebCore/platform/audio/Biquad.cpp:119
>> +        mm5 = _mm_mul_pd(mm2, mm0); // (x1*b1 x2*b2)
> 
> Insert comma:  (x1 * b1, x2 * b2)
> Webkit style includes a space around arithmetic operators.  Do the same for next line.

Done.

>> Source/WebCore/platform/audio/Biquad.cpp:121
>> +        mm0 = _mm_shuffle_pd(mm0, mm1, 1); // (x1 = x, x2 = x1)
> 
> Comment is confusing.  Maybe say mm0 = (x, x1)

Done.

>> Source/WebCore/platform/audio/Biquad.cpp:122
>> +        mm4 = _mm_mul_sd(mm4, mma1); // -y1*a1
> 
> Be more explicit about contents of mm4.
> 
> mm4 = (0, -y1 * a1)

Done.

>> Source/WebCore/platform/audio/Biquad.cpp:123
>> +        mm5 = _mm_add_pd(mm5, mm6); // (x1*b1-y2*a2 x2*b2+x*b0)
> 
> Insert comma, add space around operators:
> 
> (x1 * b1 - y2 * a2, x2 * b2 + x * b0)

Done.

>> Source/WebCore/platform/audio/Biquad.cpp:124
>> +        n--;
> 
> Why decrement n here?  Is this a pipeline issue?  If not, can we put the decrement back into while (n--)?

This will impact the pipeline, using "while(n--)" pulls down the performance a little, 45% to 43%.

>> Source/WebCore/platform/audio/Biquad.cpp:125
>> +        mm6 = mm5;
> 
> mm6 = (x1 * b1 - y2 * a2, x2 * b2 + x * b0)

Done.

>> Source/WebCore/platform/audio/Biquad.cpp:126
>> +        mm7 = _mm_load_ss(sourceP);
> 
> mm7 = (0, 0, 0, *sourceP)
> 
> And we're loading the next x value.

Done.

>> Source/WebCore/platform/audio/Biquad.cpp:128
>> +        mm5 = _mm_add_pd(mm4, mm5); // (x1*b1-y2*a2 x2*b2+x*b0-y1*a1) 
> 
> Same comment as for line 123, 119.
> 
> Any advantage in doing mm_add_pd?  It looks like we don't care about the high part of mm5.  Is that right?

Yes, _mm_add_sd is better.

>> Source/WebCore/platform/audio/Biquad.cpp:129
>> +        mm6 = _mm_shuffle_pd(mm6, mm6, 1);
> 
> mm6 = (x2 * b2 + x * b0, x1 * b1 - y2 * a2)

Done.

>> Source/WebCore/platform/audio/Biquad.cpp:130
>> +        mm5 = _mm_add_pd(mm5, mm6); // y
> 
> mm5 = (x1*b1-y2*a2+x2*b2+x*b0, y = x2*b2+x*b0-y1*a1+x1*b1-y2*a2), and we ignore high part of mm5, so maybe just mm_add_sd?

Done.

>> Source/WebCore/platform/audio/Biquad.cpp:133
>> +        mm4 = mm5; // y1 = y
> 
> mm4 = (x1*b1-y2*a2+x2*b2+x*b0, y)

Done.
Comment 46 Raymond Toy 2012-03-20 10:00:27 PDT
Comment on attachment 132760 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=132760&action=review

Code looks good; just a few more minor items to look at.

> Source/WebCore/platform/audio/Biquad.cpp:109
> +    __m128d mm4 = _mm_set_pd(0, y1); // mm4 = (0, y1)

It looks like we only care about the the low part of mm4.  Maybe add comment that we only use the low part of mm4.

Same goes for mma1 in line 113.

> Source/WebCore/platform/audio/Biquad.cpp:122
> +        mm4 = _mm_mul_sd(mm4, mma1); // mm4 = (0, -y1 * a1)

I think the comment should be mm4=-y1*a1 since we're doing mul_sd.  We don't care what's in the high part of mm4.

> Source/WebCore/platform/audio/Biquad.cpp:125
> +        mm6 = mm5; // mm6 = (x1 * b1 - y2 * a2, x2 * b2 + x * b0)

See note on for line 130.  Shuffle mm5 to the right parts of mm6 now instead of later?

> Source/WebCore/platform/audio/Biquad.cpp:126
> +        mm7 = _mm_load_ss(sourceP); // Load next x value, mm7 = (0, 0, 0, x)

We load mm7 from memory here, but use it in the very next instruction.  Could this be moved further up to lessen any pipeline issues?  We're not using the value of mm7 until line 127.

> Source/WebCore/platform/audio/Biquad.cpp:128
> +        mm5 = _mm_add_sd(mm4, mm5); // mm5 = (x1 * b1 - y2 * a2, x2 * b2 + x * b0 - y1 * a1)

Since we're doing mm_add_sd, we don't care what's in the high part of mm5.  Maybe the comment should be mm5 = x2*b2 + x*b0-y1*a1.

> Source/WebCore/platform/audio/Biquad.cpp:130
> +        mm5 = _mm_add_sd(mm5, mm6); // mm5 = (x1 * b1 - y2 * a2, x * b0 + x1 * b1 + x2 * b2 - y1 * a1 - y2 * a2) = (*, y)

Same comment here as for line 128.  We don't care what's in the high part of mm5.

Also, in line 129, we swap the high and low parts of mm6.  Isn't it possible to do that in line 125 where we assign mm5 to mm6?  Just shuffle mm5 to the right parts of mm6?  That's one less instruction, but I don't know what the pipeline affects will be.

> Source/WebCore/platform/audio/Biquad.cpp:131
> +        mm7 = _mm_cvtsd_ss(mm7, mm5); // mm7 = (0, 0, 0, y)

Actually, I think the comment is wrong. (Sorry about that.)  cvtsd_ss doesn't modify the other parts of mm7.  Plus we're only using the low part, maybe just say mm7 = static_cast<float>(y).
Comment 47 Xingnan Wang 2012-03-20 20:12:57 PDT
Created attachment 132954 [details]
Patch
Comment 48 Xingnan Wang 2012-03-20 20:36:18 PDT
Comment on attachment 132760 [details]
Patch

Thanks, Ray. Update the patch.

View in context: https://bugs.webkit.org/attachment.cgi?id=132760&action=review

>> Source/WebCore/platform/audio/Biquad.cpp:126
>> +        mm7 = _mm_load_ss(sourceP); // Load next x value, mm7 = (0, 0, 0, x)
> 
> We load mm7 from memory here, but use it in the very next instruction.  Could this be moved further up to lessen any pipeline issues?  We're not using the value of mm7 until line 127.

There is little difference of performance after moving this line up, I think it is not in the key path of the pipeline.

>> Source/WebCore/platform/audio/Biquad.cpp:130
>> +        mm5 = _mm_add_sd(mm5, mm6); // mm5 = (x1 * b1 - y2 * a2, x * b0 + x1 * b1 + x2 * b2 - y1 * a1 - y2 * a2) = (*, y)
> 
> Same comment here as for line 128.  We don't care what's in the high part of mm5.
> 
> Also, in line 129, we swap the high and low parts of mm6.  Isn't it possible to do that in line 125 where we assign mm5 to mm6?  Just shuffle mm5 to the right parts of mm6?  That's one less instruction, but I don't know what the pipeline affects will be.

It seems we should do that and make us to save one instruction, but it pulls down the performance after that, strange~ 
I guess the compiler does much for us when we use intrinsic, the actual assembly code looks much different from what I expect for after  "gcc -s " .
Comment 49 Raymond Toy 2012-03-21 09:17:07 PDT
(In reply to comment #48)
> (From update of attachment 132760 [details])
> Thanks, Ray. Update the patch.
> 
> View in context: https://bugs.webkit.org/attachment.cgi?id=132760&action=review
> 
> >> Source/WebCore/platform/audio/Biquad.cpp:126
> >> +        mm7 = _mm_load_ss(sourceP); // Load next x value, mm7 = (0, 0, 0, x)
> > 
> > We load mm7 from memory here, but use it in the very next instruction.  Could this be moved further up to lessen any pipeline issues?  We're not using the value of mm7 until line 127.
> 
> There is little difference of performance after moving this line up, I think it is not in the key path of the pipeline.
> 
> >> Source/WebCore/platform/audio/Biquad.cpp:130
> >> +        mm5 = _mm_add_sd(mm5, mm6); // mm5 = (x1 * b1 - y2 * a2, x * b0 + x1 * b1 + x2 * b2 - y1 * a1 - y2 * a2) = (*, y)
> > 
> > Same comment here as for line 128.  We don't care what's in the high part of mm5.
> > 
> > Also, in line 129, we swap the high and low parts of mm6.  Isn't it possible to do that in line 125 where we assign mm5 to mm6?  Just shuffle mm5 to the right parts of mm6?  That's one less instruction, but I don't know what the pipeline affects will be.
> 
> It seems we should do that and make us to save one instruction, but it pulls down the performance after that, strange~ 
> I guess the compiler does much for us when we use intrinsic, the actual assembly code looks much different from what I expect for after  "gcc -s " .

Ok.  I'm happy with this patch as it is now.  Looks good.

I do wonder, though, if you moved load_ss to be after line 124 and changed line 125 to do the shuffle (removing the shuffle at line 129) would improve performance.  Or maybe move the (new) shuffle down a few lines since we don't need mm6 until line 130.  Perhaps these changes will help the pipeline.

You don't have to do this experiment if you don't want to.  I'm just kind of curious.  Otherwise, the patch looks good to me and is ready to go.
Comment 50 Chris Rogers 2012-03-21 12:38:13 PDT
Does this latest patch pass the layout test (and not crash) on the Debug build?
Comment 51 Xingnan Wang 2012-03-22 02:41:26 PDT
Comment on attachment 132760 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=132760&action=review

>>>> Source/WebCore/platform/audio/Biquad.cpp:126
>>>> +        mm7 = _mm_load_ss(sourceP); // Load next x value, mm7 = (0, 0, 0, x)
>>> 
>>> We load mm7 from memory here, but use it in the very next instruction.  Could this be moved further up to lessen any pipeline issues?  We're not using the value of mm7 until line 127.
>> 
>> There is little difference of performance after moving this line up, I think it is not in the key path of the pipeline.
> 
> Ok.  I'm happy with this patch as it is now.  Looks good.
> 
> I do wonder, though, if you moved load_ss to be after line 124 and changed line 125 to do the shuffle (removing the shuffle at line 129) would improve performance.  Or maybe move the (new) shuffle down a few lines since we don't need mm6 until line 130.  Perhaps these changes will help the pipeline.
> 
> You don't have to do this experiment if you don't want to.  I'm just kind of curious.  Otherwise, the patch looks good to me and is ready to go.

It does not matter, I did it. But it seems it does not work, a tiny slowing down. Hard to explain, we cannot exactly schedule the pipeline by intrinsics as the way by assembly, I guess.
Comment 52 Xingnan Wang 2012-03-22 02:52:28 PDT
(In reply to comment #50)
> Does this latest patch pass the layout test (and not crash) on the Debug build?

The layout test passed in both debug and release build from my environment (32-bit).
Last crash issue`s root cause is I used 32-bit registers for the address and it will failed in 64-bit system. I use intrinsics this time and suppose it would not meet this issue. Certainly more tests are needed, but I don`t have 64-bit system right now.

Ray, would you mind to help me to check whether the patch is OK in your 64-bit system?
Comment 53 Raymond Toy 2012-03-22 10:28:13 PDT
(In reply to comment #52)
> (In reply to comment #50)
> > Does this latest patch pass the layout test (and not crash) on the Debug build?
> 
> The layout test passed in both debug and release build from my environment (32-bit).
> Last crash issue`s root cause is I used 32-bit registers for the address and it will failed in 64-bit system. I use intrinsics this time and suppose it would not meet this issue. Certainly more tests are needed, but I don`t have 64-bit system right now.

I don't expect any issues with 64-bit builds.  The compiler should use the right sized registers for the intrinsic operations.

> 
> Ray, would you mind to help me to check whether the patch is OK in your 64-bit system?

No problem.  The debug build passes all tests, except for the convolution-mono-mono test, but I think that's a bug in the convolver that needs to be fixed.

The release build passes all tests too. (convolution-mono-mono passes.)
Comment 54 Raymond Toy 2012-03-22 10:30:29 PDT
(In reply to comment #51)
> (From update of attachment 132760 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=132760&action=review
> 
> >>>> Source/WebCore/platform/audio/Biquad.cpp:126
> >>>> +        mm7 = _mm_load_ss(sourceP); // Load next x value, mm7 = (0, 0, 0, x)
> >>> 
> >>> We load mm7 from memory here, but use it in the very next instruction.  Could this be moved further up to lessen any pipeline issues?  We're not using the value of mm7 until line 127.
> >> 
> >> There is little difference of performance after moving this line up, I think it is not in the key path of the pipeline.
> > 
> > Ok.  I'm happy with this patch as it is now.  Looks good.
> > 
> > I do wonder, though, if you moved load_ss to be after line 124 and changed line 125 to do the shuffle (removing the shuffle at line 129) would improve performance.  Or maybe move the (new) shuffle down a few lines since we don't need mm6 until line 130.  Perhaps these changes will help the pipeline.
> > 
> > You don't have to do this experiment if you don't want to.  I'm just kind of curious.  Otherwise, the patch looks good to me and is ready to go.
> 
> It does not matter, I did it. But it seems it does not work, a tiny slowing down. Hard to explain, we cannot exactly schedule the pipeline by intrinsics as the way by assembly, I guess.

Thanks for doing the test.  I would have thought there might be some small gain, but the compiler can rearrange things as you say.

So I think this new patch is ready to go as is.  Thanks for doing the new version!
Comment 55 Xingnan Wang 2012-03-22 19:58:56 PDT
(In reply to comment #53)
> (In reply to comment #52)
> > (In reply to comment #50)
> > > Does this latest patch pass the layout test (and not crash) on the Debug build?
> > 
> > The layout test passed in both debug and release build from my environment (32-bit).
> > Last crash issue`s root cause is I used 32-bit registers for the address and it will failed in 64-bit system. I use intrinsics this time and suppose it would not meet this issue. Certainly more tests are needed, but I don`t have 64-bit system right now.
> 
> I don't expect any issues with 64-bit builds.  The compiler should use the right sized registers for the intrinsic operations.
> 
> > 
> > Ray, would you mind to help me to check whether the patch is OK in your 64-bit system?
> 
> No problem.  The debug build passes all tests, except for the convolution-mono-mono test, but I think that's a bug in the convolver that needs to be fixed.
> 
> The release build passes all tests too. (convolution-mono-mono passes.)

Ray, thank you for doing the tests.
Comment 56 Brent Fulgham 2012-11-14 17:42:33 PST
If this patch looked good back in March, why hasn't it been approved or landed?
Comment 57 Brent Fulgham 2012-11-15 17:10:38 PST
Comment on attachment 132954 [details]
Patch

I'm landing based on prior reviewers comments.  It seems like it was simply lost in the shuffle at some point.
Comment 58 WebKit Review Bot 2012-11-15 17:20:24 PST
Comment on attachment 132954 [details]
Patch

Clearing flags on attachment: 132954

Committed r134867: <http://trac.webkit.org/changeset/134867>
Comment 59 WebKit Review Bot 2012-11-15 17:20:31 PST
All reviewed patches have been landed.  Closing bug.
Comment 60 WebKit Review Bot 2012-11-16 11:43:33 PST
Re-opened since this is blocked by bug 102544
Comment 61 Brent Fulgham 2012-11-16 11:45:41 PST
Reopened because this change broke one of the security tests.  See https://bugs.webkit.org/show_bug.cgi?id=102524 for details.
Comment 62 Brent Fulgham 2012-11-16 11:47:12 PST
Rollout was done via https://bugs.webkit.org/show_bug.cgi?id=102544.
Comment 63 danceoffwithyourpantsoff 2013-07-09 10:24:55 PDT
I might be completely wrong (I just happened to skim past this patch), but shouldn't the SSE2 code be a runtime check?  The #ifdef SSE2 is only good for knowing if you can use the intrinsics or not (if the compiler supports SSE2).  I don't know the current standing expectation of having SSE2 support is, but last I remember it wasn't considered a requirement:

https://groups.google.com/a/chromium.org/forum/#!topic/chromium-dev/66p8sJtluJA
Comment 64 Raymond Toy 2013-07-09 10:43:54 PDT
(In reply to comment #63)
> I might be completely wrong (I just happened to skim past this patch), but shouldn't the SSE2 code be a runtime check?  The #ifdef SSE2 is only good for knowing if you can use the intrinsics or not (if the compiler supports SSE2).  I don't know the current standing expectation of having SSE2 support is, but last I remember it wasn't considered a requirement:
> 
> https://groups.google.com/a/chromium.org/forum/#!topic/chromium-dev/66p8sJtluJA

I think you are correct.  It's been a while since I looked, but I think on Linux and Windows the compiler flag that enables SSE2 support is not set, so the SSE2 code is never compiled in.  On Mac, SSE2 has always been available on x86 machines and I think SSE2 code is compiled by default.  

There should be no issues with this code.  In any case, this particular patch was rolled out.

Ideally, we would like a runtime check and use the SSE2 code when available, falling back to the C reference when SSE2 is not available.