RESOLVED FIXED73789
Need SSE optimization for SincResampler::Process()
https://bugs.webkit.org/show_bug.cgi?id=73789
Summary Need SSE optimization for SincResampler::Process()
Xingnan Wang
Reported 2011-12-04 18:04:03 PST
Here is a FIXME about SIMD optimizations on a piece of loop code which optimized by unrolling, use SSE will get greater enhancement.
Attachments
Patch (6.27 KB, patch)
2011-12-04 21:26 PST, Xingnan Wang
benjamin: review-
Patch (5.80 KB, patch)
2011-12-05 06:52 PST, Xingnan Wang
benjamin: review-
Patch (5.09 KB, patch)
2011-12-06 00:33 PST, Xingnan Wang
no flags
Patch (5.18 KB, patch)
2011-12-06 01:58 PST, Xingnan Wang
benjamin: review+
benjamin: commit-queue-
Patch (5.04 KB, patch)
2011-12-07 18:48 PST, Xingnan Wang
benjamin: review+
benjamin: commit-queue-
Patch (5.04 KB, patch)
2011-12-07 20:32 PST, Xingnan Wang
no flags
Xingnan Wang
Comment 1 2011-12-04 21:26:19 PST
Xingnan Wang
Comment 2 2011-12-04 21:29:02 PST
From my test this optimization got about 45% performance improvement.
Benjamin Poulain
Comment 3 2011-12-04 22:30:52 PST
Comment on attachment 117838 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=117838&action=review > Source/WebCore/platform/audio/SincResampler.cpp:-249 > - // FIXME: add SIMD optimizations for the following. The scalar code-path can probably also be optimized better. This could be changed to // FIXME: add ARM NEON optimizations for the following. The scalar code-path can probably also be optimized better. :) > Source/WebCore/platform/audio/SincResampler.cpp:262 > + input = *inputP++; > + sum1 += input * *k1; > + sum2 += input * *k2; > + ++k1; > + ++k2; You should put your code under the #define CONVOLVE_ONE_SAMPLE and use the macro here. > Source/WebCore/platform/audio/SincResampler.cpp:267 > + int group = n / 4; Typically you can just compute the end pointer so you do not have to keep track of that. > Source/WebCore/platform/audio/SincResampler.cpp:355 > + input = *inputP++; > + sum1 += input * *k1; > + sum2 += input * *k2; > + ++k1; > + ++k2; Same regarding CONVOLVE_ONE_SAMPLE.
Benjamin Poulain
Comment 4 2011-12-04 22:31:27 PST
(In reply to comment #2) > From my test this optimization got about 45% performance improvement. Any idea why such little improvement?
Benjamin Poulain
Comment 5 2011-12-04 22:34:05 PST
Comment on attachment 117838 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=117838&action=review > Source/WebCore/platform/audio/SincResampler.cpp:287 > + pInput = reinterpret_cast<__m128*>(const_cast<float*>(inputP)); > + pK1 = reinterpret_cast<__m128*>(const_cast<float*>(k1)); > + pK2 = reinterpret_cast<__m128*>(const_cast<float*>(k2)); Personally I prefer to have _mm_load_ps explicitly but that is a question of style I guess...
Xingnan Wang
Comment 6 2011-12-04 22:49:17 PST
Thanks for your fast response and comments. (In reply to comment #3) > (From update of attachment 117838 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=117838&action=review > > > Source/WebCore/platform/audio/SincResampler.cpp:-249 > > - // FIXME: add SIMD optimizations for the following. The scalar code-path can probably also be optimized better. > > This could be changed to > // FIXME: add ARM NEON optimizations for the following. The scalar code-path can probably also be optimized better. > :) > OK, I`ll add it. > > Source/WebCore/platform/audio/SincResampler.cpp:262 > > + input = *inputP++; > > + sum1 += input * *k1; > > + sum2 += input * *k2; > > + ++k1; > > + ++k2; > > You should put your code under the #define CONVOLVE_ONE_SAMPLE and use the macro here. > All right. > > Source/WebCore/platform/audio/SincResampler.cpp:267 > > + int group = n / 4; > > Typically you can just compute the end pointer so you do not have to keep track of that. > I don`t quite understand here, do you mean we just use end pointer to count the loop? > > Source/WebCore/platform/audio/SincResampler.cpp:355 > > + input = *inputP++; > > + sum1 += input * *k1; > > + sum2 += input * *k2; > > + ++k1; > > + ++k2; > > Same regarding CONVOLVE_ONE_SAMPLE. All right.
Xingnan Wang
Comment 7 2011-12-04 22:55:47 PST
(In reply to comment #4) > (In reply to comment #2) > > From my test this optimization got about 45% performance improvement. > > Any idea why such little improvement? The data is compared with the original impl with unrolling which also has some improvement. And for sse we must handle the address alignment and it produces the memory copy so it pulls down the performance. For the vadd (https://bugs.webkit.org/show_bug.cgi?id=73182) , using similar way, we just get the 100% performance improvement.
Xingnan Wang
Comment 8 2011-12-04 23:35:20 PST
(In reply to comment #5) > (From update of attachment 117838 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=117838&action=review > > > Source/WebCore/platform/audio/SincResampler.cpp:287 > > + pInput = reinterpret_cast<__m128*>(const_cast<float*>(inputP)); > > + pK1 = reinterpret_cast<__m128*>(const_cast<float*>(k1)); > > + pK2 = reinterpret_cast<__m128*>(const_cast<float*>(k2)); > > Personally I prefer to have _mm_load_ps explicitly but that is a question of style I guess... I tried to use _mm_load_ps instead of _cast way and got some regression, from about 45% to 38%. I think the additional ops of function call resulted it.
Benjamin Poulain
Comment 9 2011-12-05 00:43:47 PST
> > Personally I prefer to have _mm_load_ps explicitly but that is a question of style I guess... > > I tried to use _mm_load_ps instead of _cast way and got some regression, from about 45% to 38%. I think the additional ops of function call resulted it. That is really strange. Intrinsics should not result in function calls, they are supposed to be always inlined. Are you sure you are not testing a debug build?
Xingnan Wang
Comment 10 2011-12-05 02:31:56 PST
(In reply to comment #9) > > > Personally I prefer to have _mm_load_ps explicitly but that is a question of style I guess... > > > > I tried to use _mm_load_ps instead of _cast way and got some regression, from about 45% to 38%. I think the additional ops of function call resulted it. > > That is really strange. Intrinsics should not result in function calls, they are supposed to be always inlined. Are you sure you are not testing a debug build? Yes, you are right, there is no function call, so I did some investigation and found there were additional 2 movaps and 1 movl when using _mm_load_ps, while the reinterpret_cast is executed more directly. Here is some details of test: === sse run time: 10 s 430768981 ns sse with mm_load run time: 11 s 386974860 ns std run time: 15 s 747296354 ns sse speed up = 1.509697 sse with mm_load speed up = 1.382922
Xingnan Wang
Comment 11 2011-12-05 02:43:44 PST
(In reply to comment #10) > (In reply to comment #9) > > > > Personally I prefer to have _mm_load_ps explicitly but that is a question of style I guess... > > > > > > I tried to use _mm_load_ps instead of _cast way and got some regression, from about 45% to 38%. I think the additional ops of function call resulted it. > > > > That is really strange. Intrinsics should not result in function calls, they are supposed to be always inlined. Are you sure you are not testing a debug build? > > Yes, you are right, there is no function call, so I did some investigation and found there were additional 2 movaps and 1 movl when using _mm_load_ps, while the reinterpret_cast is executed more directly. > > Here is some details of test: > === > sse run time: 10 s 430768981 ns > sse with mm_load run time: 11 s 386974860 ns > std run time: 15 s 747296354 ns > sse speed up = 1.509697 > sse with mm_load speed up = 1.382922 More update: It should be the gcc optimization problem, if using -O2 I got the same result.
Xingnan Wang
Comment 12 2011-12-05 06:52:07 PST
Xingnan Wang
Comment 13 2011-12-05 07:00:41 PST
Hi Poulain, Thanks your review. I updated the patch as your comments and get the new test result. The improvement is up to about 70% when complied with -O2.
Benjamin Poulain
Comment 14 2011-12-05 10:16:33 PST
Comment on attachment 117871 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=117871&action=review > Source/WebCore/ChangeLog:7 > + Implement the SSE optimization in SincResampler::process() > + https://bugs.webkit.org/show_bug.cgi?id=73789 > + > + Reviewed by NOBODY (OOPS!). > + You should add a little description, mentioning the performance gain of the change. > Source/WebCore/ChangeLog:10 > + * platform/audio/SincResampler.cpp: > + (WebCore::SincResampler::process): > 2011-12-05 Roland Steiner <rolandsteiner@chromium.org> Missing space at the end of the ChangeLog. > Source/WebCore/platform/audio/SincResampler.cpp:264 > + while ((reinterpret_cast<size_t>(inputP) & 0x0F) && n) { This should be uintptr_t instead of size_t. > Source/WebCore/platform/audio/SincResampler.cpp:270 > + int group = n / 4; Instead of tracking this, you could find the end, and then loop until the end: while (mInput < end) { ... } The advantage is you remove the computation of group-- from a tight loop. > Source/WebCore/platform/audio/SincResampler.cpp:282 > + sums1 = _mm_setzero_ps(); > + sums2 = _mm_setzero_ps(); You can move those to the lines where you declare sums1 and sums2. > Source/WebCore/platform/audio/SincResampler.cpp:284 > + k1Aligned = !(reinterpret_cast<size_t>(k1) & 0x0F); > + k2Aligned = !(reinterpret_cast<size_t>(k2) & 0x0F); This should be uintptr_t instead of size_t. > Source/WebCore/platform/audio/SincResampler.cpp:298 > + mul1 = _mm_mul_ps(mInput, mK1); > + mul2 = _mm_mul_ps(mInput, mK2); > + sums1 = _mm_add_ps(sums1, mul1); > + sums2 = _mm_add_ps(sums2, mul2); > + > + inputP += 4; > + k1 += 4; > + k2 += 4; This code is copied three time. It could be moved to a macro CONVOLVE_4_SAMPLE_SSE. You can try an inline function but I think that will fail with MSVC compiler due to a bug when passing vectors as arguments.
Benjamin Poulain
Comment 15 2011-12-05 10:17:38 PST
> Hi Poulain, My first name is Benjamin :)
Xingnan Wang
Comment 16 2011-12-06 00:33:36 PST
Xingnan Wang
Comment 17 2011-12-06 00:43:18 PST
(In reply to comment #15) > > Hi Poulain, > > My first name is Benjamin :) Hi Benjamin, So sorry for misusing your name -_-! I updated the patch as your comments, please help to review. Also I have a question that I tried to use script webkit-patch to upload the patch, but it showed me "Failed to assign bug to you (can't find assigned_to) control" and upload failed, does that mean I cannot upload by script unless I`m the assignee? Thank you~
Benjamin Poulain
Comment 18 2011-12-06 01:43:41 PST
(In reply to comment #17) > I updated the patch as your comments, please help to review. Please also update the Changelog. > Also I have a question that I tried to use script webkit-patch to upload the patch, but it showed me "Failed to assign bug to you (can't find assigned_to) control" and upload failed, does that mean I cannot upload by script unless I`m the assignee? It is because you do not have yet the EditBug privileges. Please file a bug for webkit-patch, in my opinion it should work regardless, and just ignore that step.
Xingnan Wang
Comment 19 2011-12-06 01:58:58 PST
Xingnan Wang
Comment 20 2011-12-06 02:00:52 PST
(In reply to comment #18) > (In reply to comment #17) > > I updated the patch as your comments, please help to review. > > Please also update the Changelog. > > > Also I have a question that I tried to use script webkit-patch to upload the patch, but it showed me "Failed to assign bug to you (can't find assigned_to) control" and upload failed, does that mean I cannot upload by script unless I`m the assignee? > > It is because you do not have yet the EditBug privileges. Please file a bug for webkit-patch, in my opinion it should work regardless, and just ignore that step. OK, I`ll file a bug. And Changelog updated, thanks.
Raymond Toy
Comment 21 2011-12-06 10:14:56 PST
Comment on attachment 118008 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=118008&action=review > Source/WebCore/platform/audio/SincResampler.cpp:260 > +#ifdef __SSE2__ I think visually, it might make more sense to stick the entire #ifdef __SSE2__ code inside the original { float input; ... } So we get something like { float input; #ifdef __SSE2__ ... #else <original code> #endif } > Source/WebCore/platform/audio/SincResampler.cpp:282 > +#define CONVOLVE_4_SAMPLE \ Nit: Probably should be named CONVOLVE_4_SAMPLES (plural, not singular). > Source/WebCore/platform/audio/SincResampler.cpp:293 > + mInput = _mm_load_ps(inputP); I might consider adding a new macro #define LOAD_DATA(l1, l2) \ mInput = mm_load_ps(inputP); \ mK1 = _mm_##l1_ps(k1); \ mK2 = _mm_##l2_ps(k2) Then here we can say LOAD_DATA(load, load); And below, we can have LOAD_DATA(loadu, load); LOAD_DATA(load, loadu); LOAD_DATA(loadu, loadu); This makes it easy to see that they're all the same except whether we do aligned or unaligned loads. (I didn't test this code; some minor changes might be needed.) > Source/WebCore/platform/audio/SincResampler.cpp:327 > + groupSumP = reinterpret_cast<float*>(&sums2); I think I'd write the above as sum1 += groupSumP[0]; sum1 += groupSumP[1]; sum1 += groupSumP[2]; sum1 += groupSumP[3]; Or maybe just sum1 += groupSumP[0] + groupSumP[1] + groupSumP[2] + groupSumP[3] Same for sum2 below.
Xingnan Wang
Comment 22 2011-12-06 17:19:39 PST
(In reply to comment #21) > (From update of attachment 118008 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=118008&action=review > > > Source/WebCore/platform/audio/SincResampler.cpp:260 > > +#ifdef __SSE2__ > > I think visually, it might make more sense to stick the entire #ifdef __SSE2__ code inside the original > { > float input; > ... > } > > So we get something like > { > float input; > #ifdef __SSE2__ > ... > #else > <original code> > #endif > } > > > Source/WebCore/platform/audio/SincResampler.cpp:282 > > +#define CONVOLVE_4_SAMPLE \ > > Nit: Probably should be named CONVOLVE_4_SAMPLES (plural, not singular). > > > Source/WebCore/platform/audio/SincResampler.cpp:293 > > + mInput = _mm_load_ps(inputP); > > I might consider adding a new macro > > #define LOAD_DATA(l1, l2) \ > mInput = mm_load_ps(inputP); \ > mK1 = _mm_##l1_ps(k1); \ > mK2 = _mm_##l2_ps(k2) > > Then here we can say > LOAD_DATA(load, load); > > And below, we can have > LOAD_DATA(loadu, load); > LOAD_DATA(load, loadu); > LOAD_DATA(loadu, loadu); > > This makes it easy to see that they're all the same except whether we do aligned or unaligned loads. > > (I didn't test this code; some minor changes might be needed.) > > > Source/WebCore/platform/audio/SincResampler.cpp:327 > > + groupSumP = reinterpret_cast<float*>(&sums2); > > I think I'd write the above as > > sum1 += groupSumP[0]; > sum1 += groupSumP[1]; > sum1 += groupSumP[2]; > sum1 += groupSumP[3]; > > Or maybe just sum1 += groupSumP[0] + groupSumP[1] + groupSumP[2] + groupSumP[3] > > Same for sum2 below. Hi Raymond, Thanks your good comments, I`ll update it.
Benjamin Poulain
Comment 23 2011-12-06 19:45:35 PST
> Then here we can say > LOAD_DATA(load, load); > > And below, we can have > LOAD_DATA(loadu, load); > LOAD_DATA(load, loadu); > LOAD_DATA(loadu, loadu); > > This makes it easy to see that they're all the same except whether we do aligned or unaligned loads. I think this would be much harder to follow than the existing code...
Raymond Toy
Comment 24 2011-12-07 11:01:24 PST
(In reply to comment #23) > > Then here we can say > > LOAD_DATA(load, load); > > > > And below, we can have > > LOAD_DATA(loadu, load); > > LOAD_DATA(load, loadu); > > LOAD_DATA(loadu, loadu); > > > > This makes it easy to see that they're all the same except whether we do aligned or unaligned loads. > > I think this would be much harder to follow than the existing code... That's why I said "maybe add". But it took me a bit of time to see what was different between the three branches. Just one character difference... I won't mind of the macro is not added.
Benjamin Poulain
Comment 25 2011-12-07 13:18:46 PST
The patch looks correct to me but I'd like the comments of Raymond to be fixed before landing. I think "load" and "loadu" are common enough to avoid a LOAD_DATA macro but I leave that up to you.
Xingnan Wang
Comment 26 2011-12-07 18:48:14 PST
Xingnan Wang
Comment 27 2011-12-07 18:56:09 PST
(In reply to comment #26) > Created an attachment (id=118305) [details] > Patch Patch is updated with keeping LOAD_DATA macro.
Benjamin Poulain
Comment 28 2011-12-07 20:08:34 PST
Comment on attachment 118305 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=118305&action=review > Source/WebCore/platform/audio/SincResampler.cpp:319 > + // Summarize the SSE results to sum1 and sum2 Make comments look like sentences by starting with a capital letter and ending with a period (punctation). One exception may be end of line comments like this "if (x == y) // false for NaN".
Xingnan Wang
Comment 29 2011-12-07 20:32:35 PST
Created attachment 118320 [details] Patch Add the period for the comment.
Benjamin Poulain
Comment 30 2011-12-07 20:37:44 PST
Comment on attachment 118320 [details] Patch Looks correct. Sorry it took so many iterations to find all the little issues.
Xingnan Wang
Comment 31 2011-12-07 20:42:12 PST
(In reply to comment #30) > (From update of attachment 118320 [details]) > Looks correct. > > Sorry it took so many iterations to find all the little issues. That`s all right, it could make me to pay more attention to the details in the feature, thanks all your review.
WebKit Review Bot
Comment 32 2011-12-07 21:23:57 PST
Comment on attachment 118320 [details] Patch Clearing flags on attachment: 118320 Committed r102312: <http://trac.webkit.org/changeset/102312>
WebKit Review Bot
Comment 33 2011-12-07 21:24:03 PST
All reviewed patches have been landed. Closing bug.
Note You need to log in before you can comment on or make changes to this bug.