Bug 73789

Summary: Need SSE optimization for SincResampler::Process()
Product: WebKit Reporter: Xingnan Wang <xingnan.wang>
Component: Web AudioAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: benjamin, crogers, kbr, rtoy, webkit.review.bot
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch
benjamin: review-
Patch
benjamin: review-
Patch
none
Patch
benjamin: review+, benjamin: commit-queue-
Patch
benjamin: review+, benjamin: commit-queue-
Patch none

Description Xingnan Wang 2011-12-04 18:04:03 PST
Here is a FIXME about SIMD optimizations on a piece of loop code which optimized by unrolling, use SSE will get greater enhancement.
Comment 1 Xingnan Wang 2011-12-04 21:26:19 PST
Created attachment 117838 [details]
Patch
Comment 2 Xingnan Wang 2011-12-04 21:29:02 PST
From my test this optimization got about 45% performance improvement.
Comment 3 Benjamin Poulain 2011-12-04 22:30:52 PST
Comment on attachment 117838 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=117838&action=review

> Source/WebCore/platform/audio/SincResampler.cpp:-249
> -            // FIXME: add SIMD optimizations for the following. The scalar code-path can probably also be optimized better.

This could be changed to
// FIXME: add ARM NEON optimizations for the following. The scalar code-path can probably also be optimized better.
:)

> Source/WebCore/platform/audio/SincResampler.cpp:262
> +                input = *inputP++;
> +                sum1 += input * *k1;
> +                sum2 += input * *k2;
> +                ++k1;
> +                ++k2;

You should put your code under the #define CONVOLVE_ONE_SAMPLE and use the macro here.

> Source/WebCore/platform/audio/SincResampler.cpp:267
> +            int group = n / 4;

Typically you can just compute the end pointer so you do not have to keep track of that.

> Source/WebCore/platform/audio/SincResampler.cpp:355
> +                input = *inputP++;
> +                sum1 += input * *k1;
> +                sum2 += input * *k2;
> +                ++k1;
> +                ++k2;

Same regarding CONVOLVE_ONE_SAMPLE.
Comment 4 Benjamin Poulain 2011-12-04 22:31:27 PST
(In reply to comment #2)
> From my test this optimization got about 45% performance improvement.

Any idea why such little improvement?
Comment 5 Benjamin Poulain 2011-12-04 22:34:05 PST
Comment on attachment 117838 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=117838&action=review

> Source/WebCore/platform/audio/SincResampler.cpp:287
> +                    pInput = reinterpret_cast<__m128*>(const_cast<float*>(inputP));
> +                    pK1 = reinterpret_cast<__m128*>(const_cast<float*>(k1));
> +                    pK2 = reinterpret_cast<__m128*>(const_cast<float*>(k2));

Personally I prefer to have _mm_load_ps explicitly but that is a question of style I guess...
Comment 6 Xingnan Wang 2011-12-04 22:49:17 PST
Thanks for your fast response and comments.
(In reply to comment #3)
> (From update of attachment 117838 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=117838&action=review
> 
> > Source/WebCore/platform/audio/SincResampler.cpp:-249
> > -            // FIXME: add SIMD optimizations for the following. The scalar code-path can probably also be optimized better.
> 
> This could be changed to
> // FIXME: add ARM NEON optimizations for the following. The scalar code-path can probably also be optimized better.
> :)
> 
OK, I`ll add it.
> > Source/WebCore/platform/audio/SincResampler.cpp:262
> > +                input = *inputP++;
> > +                sum1 += input * *k1;
> > +                sum2 += input * *k2;
> > +                ++k1;
> > +                ++k2;
> 
> You should put your code under the #define CONVOLVE_ONE_SAMPLE and use the macro here.
> 
All right.
> > Source/WebCore/platform/audio/SincResampler.cpp:267
> > +            int group = n / 4;
> 
> Typically you can just compute the end pointer so you do not have to keep track of that.
> 
I don`t quite understand here, do you mean we just use end pointer to count the loop? 
> > Source/WebCore/platform/audio/SincResampler.cpp:355
> > +                input = *inputP++;
> > +                sum1 += input * *k1;
> > +                sum2 += input * *k2;
> > +                ++k1;
> > +                ++k2;
> 
> Same regarding CONVOLVE_ONE_SAMPLE.
All right.
Comment 7 Xingnan Wang 2011-12-04 22:55:47 PST
(In reply to comment #4)
> (In reply to comment #2)
> > From my test this optimization got about 45% performance improvement.
> 
> Any idea why such little improvement?

The data is compared with the original impl with unrolling which also has some improvement. And for sse we must handle the address alignment and it produces the memory copy so it pulls down the performance. For the vadd (https://bugs.webkit.org/show_bug.cgi?id=73182) , using similar way, we just get the 100% performance improvement.
Comment 8 Xingnan Wang 2011-12-04 23:35:20 PST
(In reply to comment #5)
> (From update of attachment 117838 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=117838&action=review
> 
> > Source/WebCore/platform/audio/SincResampler.cpp:287
> > +                    pInput = reinterpret_cast<__m128*>(const_cast<float*>(inputP));
> > +                    pK1 = reinterpret_cast<__m128*>(const_cast<float*>(k1));
> > +                    pK2 = reinterpret_cast<__m128*>(const_cast<float*>(k2));
> 
> Personally I prefer to have _mm_load_ps explicitly but that is a question of style I guess...

I tried to use _mm_load_ps instead of _cast way and got some regression, from about 45% to 38%. I think the additional ops of function call resulted it.
Comment 9 Benjamin Poulain 2011-12-05 00:43:47 PST
> > Personally I prefer to have _mm_load_ps explicitly but that is a question of style I guess...
> 
> I tried to use _mm_load_ps instead of _cast way and got some regression, from about 45% to 38%. I think the additional ops of function call resulted it.

That is really strange. Intrinsics should not result in function calls, they are supposed to be always inlined. Are you sure you are not testing a debug build?
Comment 10 Xingnan Wang 2011-12-05 02:31:56 PST
(In reply to comment #9)
> > > Personally I prefer to have _mm_load_ps explicitly but that is a question of style I guess...
> > 
> > I tried to use _mm_load_ps instead of _cast way and got some regression, from about 45% to 38%. I think the additional ops of function call resulted it.
> 
> That is really strange. Intrinsics should not result in function calls, they are supposed to be always inlined. Are you sure you are not testing a debug build?

Yes, you are right, there is no function call, so I did some investigation and found there were additional 2 movaps and 1 movl when using _mm_load_ps, while the reinterpret_cast is executed more directly.

Here is some details of test:
 === 
sse              run time: 10 s 430768981 ns
sse with mm_load run time: 11 s 386974860 ns
std              run time: 15 s 747296354 ns
sse              speed up = 1.509697
sse with mm_load speed up = 1.382922
Comment 11 Xingnan Wang 2011-12-05 02:43:44 PST
(In reply to comment #10)
> (In reply to comment #9)
> > > > Personally I prefer to have _mm_load_ps explicitly but that is a question of style I guess...
> > > 
> > > I tried to use _mm_load_ps instead of _cast way and got some regression, from about 45% to 38%. I think the additional ops of function call resulted it.
> > 
> > That is really strange. Intrinsics should not result in function calls, they are supposed to be always inlined. Are you sure you are not testing a debug build?
> 
> Yes, you are right, there is no function call, so I did some investigation and found there were additional 2 movaps and 1 movl when using _mm_load_ps, while the reinterpret_cast is executed more directly.
> 
> Here is some details of test:
>  === 
> sse              run time: 10 s 430768981 ns
> sse with mm_load run time: 11 s 386974860 ns
> std              run time: 15 s 747296354 ns
> sse              speed up = 1.509697
> sse with mm_load speed up = 1.382922

More update:

It should be the gcc optimization problem, if using -O2 I got the same result.
Comment 12 Xingnan Wang 2011-12-05 06:52:07 PST
Created attachment 117871 [details]
Patch
Comment 13 Xingnan Wang 2011-12-05 07:00:41 PST
Hi Poulain,
    Thanks your review. I updated the patch as your comments and get the new test result. The improvement is up to about 70% when complied with -O2.
Comment 14 Benjamin Poulain 2011-12-05 10:16:33 PST
Comment on attachment 117871 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=117871&action=review

> Source/WebCore/ChangeLog:7
> +        Implement the SSE optimization in SincResampler::process()
> +        https://bugs.webkit.org/show_bug.cgi?id=73789
> +
> +        Reviewed by NOBODY (OOPS!).
> +

You should add a little description, mentioning the performance gain of the change.

> Source/WebCore/ChangeLog:10
> +        * platform/audio/SincResampler.cpp:
> +        (WebCore::SincResampler::process):
>  2011-12-05  Roland Steiner  <rolandsteiner@chromium.org>

Missing space at the end of the ChangeLog.

> Source/WebCore/platform/audio/SincResampler.cpp:264
> +            while ((reinterpret_cast<size_t>(inputP) & 0x0F) && n) {

This should be uintptr_t instead of size_t.

> Source/WebCore/platform/audio/SincResampler.cpp:270
> +            int group = n / 4;

Instead of tracking this, you could find the end, and then loop until the end: 
while (mInput < end) {
   ...
}
The advantage is you remove the computation of group-- from a tight loop.

> Source/WebCore/platform/audio/SincResampler.cpp:282
> +            sums1 = _mm_setzero_ps();
> +            sums2 = _mm_setzero_ps();

You can move those to the lines where you declare sums1 and sums2.

> Source/WebCore/platform/audio/SincResampler.cpp:284
> +            k1Aligned = !(reinterpret_cast<size_t>(k1) & 0x0F);
> +            k2Aligned = !(reinterpret_cast<size_t>(k2) & 0x0F);

This should be uintptr_t instead of size_t.

> Source/WebCore/platform/audio/SincResampler.cpp:298
> +                    mul1 = _mm_mul_ps(mInput, mK1);
> +                    mul2 = _mm_mul_ps(mInput, mK2);
> +                    sums1 = _mm_add_ps(sums1, mul1);
> +                    sums2 = _mm_add_ps(sums2, mul2);
> +
> +                    inputP += 4;
> +                    k1 += 4;
> +                    k2 += 4;

This code is copied three time. It could be moved to a macro CONVOLVE_4_SAMPLE_SSE.
You can try an inline function but I think that will fail with MSVC compiler due to a bug when passing vectors as arguments.
Comment 15 Benjamin Poulain 2011-12-05 10:17:38 PST
> Hi Poulain,

My first name is Benjamin :)
Comment 16 Xingnan Wang 2011-12-06 00:33:36 PST
Created attachment 118001 [details]
Patch
Comment 17 Xingnan Wang 2011-12-06 00:43:18 PST
(In reply to comment #15)
> > Hi Poulain,
> 
> My first name is Benjamin :)

Hi Benjamin, 
    So sorry for misusing your name -_-!
    I updated the patch as your comments, please help to review.

    Also I have a question that I tried to use script webkit-patch to upload the patch, but it showed me "Failed to assign bug to you (can't find assigned_to) control" and upload failed, does that mean I cannot upload by script unless I`m the assignee?

    Thank you~
Comment 18 Benjamin Poulain 2011-12-06 01:43:41 PST
(In reply to comment #17)
>     I updated the patch as your comments, please help to review.

Please also update the Changelog.

>     Also I have a question that I tried to use script webkit-patch to upload the patch, but it showed me "Failed to assign bug to you (can't find assigned_to) control" and upload failed, does that mean I cannot upload by script unless I`m the assignee?

It is because you do not have yet the EditBug privileges. Please file a bug for webkit-patch, in my opinion it should work regardless, and just ignore that step.
Comment 19 Xingnan Wang 2011-12-06 01:58:58 PST
Created attachment 118008 [details]
Patch
Comment 20 Xingnan Wang 2011-12-06 02:00:52 PST
(In reply to comment #18)
> (In reply to comment #17)
> >     I updated the patch as your comments, please help to review.
> 
> Please also update the Changelog.
> 
> >     Also I have a question that I tried to use script webkit-patch to upload the patch, but it showed me "Failed to assign bug to you (can't find assigned_to) control" and upload failed, does that mean I cannot upload by script unless I`m the assignee?
> 
> It is because you do not have yet the EditBug privileges. Please file a bug for webkit-patch, in my opinion it should work regardless, and just ignore that step.

OK, I`ll file a bug.
And Changelog updated, thanks.
Comment 21 Raymond Toy 2011-12-06 10:14:56 PST
Comment on attachment 118008 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=118008&action=review

> Source/WebCore/platform/audio/SincResampler.cpp:260
> +#ifdef __SSE2__

I think visually, it might make more sense to stick the entire #ifdef __SSE2__ code inside the original
 {
   float input;
   ...
 }

So we get something like
 {
   float input;
#ifdef __SSE2__
   ...
#else
   <original code>
#endif
 }

> Source/WebCore/platform/audio/SincResampler.cpp:282
> +#define CONVOLVE_4_SAMPLE                    \

Nit:  Probably should be named CONVOLVE_4_SAMPLES (plural, not singular).

> Source/WebCore/platform/audio/SincResampler.cpp:293
> +                    mInput = _mm_load_ps(inputP);

I might consider adding a new macro

#define LOAD_DATA(l1, l2) \
  mInput = mm_load_ps(inputP); \
  mK1 = _mm_##l1_ps(k1); \
  mK2 = _mm_##l2_ps(k2)

Then here we can say
  LOAD_DATA(load, load);

And below, we can have
  LOAD_DATA(loadu, load);
  LOAD_DATA(load, loadu);
  LOAD_DATA(loadu, loadu);

This makes it easy to see that they're all the same except whether we do aligned or unaligned loads.

(I didn't test this code; some minor changes might be needed.)

> Source/WebCore/platform/audio/SincResampler.cpp:327
> +            groupSumP = reinterpret_cast<float*>(&sums2);

I think I'd write the above as

sum1 += groupSumP[0];
sum1 += groupSumP[1];
sum1 += groupSumP[2];
sum1 += groupSumP[3];

Or maybe just sum1 += groupSumP[0] + groupSumP[1] + groupSumP[2] + groupSumP[3]

Same for sum2 below.
Comment 22 Xingnan Wang 2011-12-06 17:19:39 PST
(In reply to comment #21)
> (From update of attachment 118008 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=118008&action=review
> 
> > Source/WebCore/platform/audio/SincResampler.cpp:260
> > +#ifdef __SSE2__
> 
> I think visually, it might make more sense to stick the entire #ifdef __SSE2__ code inside the original
>  {
>    float input;
>    ...
>  }
> 
> So we get something like
>  {
>    float input;
> #ifdef __SSE2__
>    ...
> #else
>    <original code>
> #endif
>  }
> 
> > Source/WebCore/platform/audio/SincResampler.cpp:282
> > +#define CONVOLVE_4_SAMPLE                    \
> 
> Nit:  Probably should be named CONVOLVE_4_SAMPLES (plural, not singular).
> 
> > Source/WebCore/platform/audio/SincResampler.cpp:293
> > +                    mInput = _mm_load_ps(inputP);
> 
> I might consider adding a new macro
> 
> #define LOAD_DATA(l1, l2) \
>   mInput = mm_load_ps(inputP); \
>   mK1 = _mm_##l1_ps(k1); \
>   mK2 = _mm_##l2_ps(k2)
> 
> Then here we can say
>   LOAD_DATA(load, load);
> 
> And below, we can have
>   LOAD_DATA(loadu, load);
>   LOAD_DATA(load, loadu);
>   LOAD_DATA(loadu, loadu);
> 
> This makes it easy to see that they're all the same except whether we do aligned or unaligned loads.
> 
> (I didn't test this code; some minor changes might be needed.)
> 
> > Source/WebCore/platform/audio/SincResampler.cpp:327
> > +            groupSumP = reinterpret_cast<float*>(&sums2);
> 
> I think I'd write the above as
> 
> sum1 += groupSumP[0];
> sum1 += groupSumP[1];
> sum1 += groupSumP[2];
> sum1 += groupSumP[3];
> 
> Or maybe just sum1 += groupSumP[0] + groupSumP[1] + groupSumP[2] + groupSumP[3]
> 
> Same for sum2 below.

Hi Raymond,
    Thanks your good comments, I`ll update it.
Comment 23 Benjamin Poulain 2011-12-06 19:45:35 PST
> Then here we can say
>   LOAD_DATA(load, load);
> 
> And below, we can have
>   LOAD_DATA(loadu, load);
>   LOAD_DATA(load, loadu);
>   LOAD_DATA(loadu, loadu);
> 
> This makes it easy to see that they're all the same except whether we do aligned or unaligned loads.

I think this would be much harder to follow than the existing code...
Comment 24 Raymond Toy 2011-12-07 11:01:24 PST
(In reply to comment #23)
> > Then here we can say
> >   LOAD_DATA(load, load);
> > 
> > And below, we can have
> >   LOAD_DATA(loadu, load);
> >   LOAD_DATA(load, loadu);
> >   LOAD_DATA(loadu, loadu);
> > 
> > This makes it easy to see that they're all the same except whether we do aligned or unaligned loads.
> 
> I think this would be much harder to follow than the existing code...

That's why I said "maybe add".  But it took me a bit of time to see what was different between the three branches.  Just one character difference...

I won't mind of the macro is not added.
Comment 25 Benjamin Poulain 2011-12-07 13:18:46 PST
The patch looks correct to me but I'd like the comments of Raymond to be fixed before landing.

I think "load" and "loadu" are common enough to avoid a LOAD_DATA macro but I leave that up to you.
Comment 26 Xingnan Wang 2011-12-07 18:48:14 PST
Created attachment 118305 [details]
Patch
Comment 27 Xingnan Wang 2011-12-07 18:56:09 PST
(In reply to comment #26)
> Created an attachment (id=118305) [details]
> Patch

Patch is updated with keeping LOAD_DATA macro.
Comment 28 Benjamin Poulain 2011-12-07 20:08:34 PST
Comment on attachment 118305 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=118305&action=review

> Source/WebCore/platform/audio/SincResampler.cpp:319
> +                // Summarize the SSE results to sum1 and sum2

Make comments look like sentences by starting with a capital letter and ending with a period (punctation). One exception may be end of line comments like this "if (x == y) // false for NaN".
Comment 29 Xingnan Wang 2011-12-07 20:32:35 PST
Created attachment 118320 [details]
Patch

Add the period for the comment.
Comment 30 Benjamin Poulain 2011-12-07 20:37:44 PST
Comment on attachment 118320 [details]
Patch

Looks correct.

Sorry it took so many iterations to find all the little issues.
Comment 31 Xingnan Wang 2011-12-07 20:42:12 PST
(In reply to comment #30)
> (From update of attachment 118320 [details])
> Looks correct.
> 
> Sorry it took so many iterations to find all the little issues.

That`s all right, it could make me to pay more attention to the details in the feature, thanks all your review.
Comment 32 WebKit Review Bot 2011-12-07 21:23:57 PST
Comment on attachment 118320 [details]
Patch

Clearing flags on attachment: 118320

Committed r102312: <http://trac.webkit.org/changeset/102312>
Comment 33 WebKit Review Bot 2011-12-07 21:24:03 PST
All reviewed patches have been landed.  Closing bug.