Summary: | Need SSE optimization for SincResampler::Process() | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Xingnan Wang <xingnan.wang> | ||||||||||||||
Component: | Web Audio | Assignee: | Nobody <webkit-unassigned> | ||||||||||||||
Status: | RESOLVED FIXED | ||||||||||||||||
Severity: | Normal | CC: | benjamin, crogers, kbr, rtoy, webkit.review.bot | ||||||||||||||
Priority: | P2 | ||||||||||||||||
Version: | 528+ (Nightly build) | ||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||
OS: | Unspecified | ||||||||||||||||
Attachments: |
|
Description
Xingnan Wang
2011-12-04 18:04:03 PST
Created attachment 117838 [details]
Patch
From my test this optimization got about 45% performance improvement. Comment on attachment 117838 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=117838&action=review > Source/WebCore/platform/audio/SincResampler.cpp:-249 > - // FIXME: add SIMD optimizations for the following. The scalar code-path can probably also be optimized better. This could be changed to // FIXME: add ARM NEON optimizations for the following. The scalar code-path can probably also be optimized better. :) > Source/WebCore/platform/audio/SincResampler.cpp:262 > + input = *inputP++; > + sum1 += input * *k1; > + sum2 += input * *k2; > + ++k1; > + ++k2; You should put your code under the #define CONVOLVE_ONE_SAMPLE and use the macro here. > Source/WebCore/platform/audio/SincResampler.cpp:267 > + int group = n / 4; Typically you can just compute the end pointer so you do not have to keep track of that. > Source/WebCore/platform/audio/SincResampler.cpp:355 > + input = *inputP++; > + sum1 += input * *k1; > + sum2 += input * *k2; > + ++k1; > + ++k2; Same regarding CONVOLVE_ONE_SAMPLE. (In reply to comment #2) > From my test this optimization got about 45% performance improvement. Any idea why such little improvement? Comment on attachment 117838 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=117838&action=review > Source/WebCore/platform/audio/SincResampler.cpp:287 > + pInput = reinterpret_cast<__m128*>(const_cast<float*>(inputP)); > + pK1 = reinterpret_cast<__m128*>(const_cast<float*>(k1)); > + pK2 = reinterpret_cast<__m128*>(const_cast<float*>(k2)); Personally I prefer to have _mm_load_ps explicitly but that is a question of style I guess... Thanks for your fast response and comments. (In reply to comment #3) > (From update of attachment 117838 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=117838&action=review > > > Source/WebCore/platform/audio/SincResampler.cpp:-249 > > - // FIXME: add SIMD optimizations for the following. The scalar code-path can probably also be optimized better. > > This could be changed to > // FIXME: add ARM NEON optimizations for the following. The scalar code-path can probably also be optimized better. > :) > OK, I`ll add it. > > Source/WebCore/platform/audio/SincResampler.cpp:262 > > + input = *inputP++; > > + sum1 += input * *k1; > > + sum2 += input * *k2; > > + ++k1; > > + ++k2; > > You should put your code under the #define CONVOLVE_ONE_SAMPLE and use the macro here. > All right. > > Source/WebCore/platform/audio/SincResampler.cpp:267 > > + int group = n / 4; > > Typically you can just compute the end pointer so you do not have to keep track of that. > I don`t quite understand here, do you mean we just use end pointer to count the loop? > > Source/WebCore/platform/audio/SincResampler.cpp:355 > > + input = *inputP++; > > + sum1 += input * *k1; > > + sum2 += input * *k2; > > + ++k1; > > + ++k2; > > Same regarding CONVOLVE_ONE_SAMPLE. All right. (In reply to comment #4) > (In reply to comment #2) > > From my test this optimization got about 45% performance improvement. > > Any idea why such little improvement? The data is compared with the original impl with unrolling which also has some improvement. And for sse we must handle the address alignment and it produces the memory copy so it pulls down the performance. For the vadd (https://bugs.webkit.org/show_bug.cgi?id=73182) , using similar way, we just get the 100% performance improvement. (In reply to comment #5) > (From update of attachment 117838 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=117838&action=review > > > Source/WebCore/platform/audio/SincResampler.cpp:287 > > + pInput = reinterpret_cast<__m128*>(const_cast<float*>(inputP)); > > + pK1 = reinterpret_cast<__m128*>(const_cast<float*>(k1)); > > + pK2 = reinterpret_cast<__m128*>(const_cast<float*>(k2)); > > Personally I prefer to have _mm_load_ps explicitly but that is a question of style I guess... I tried to use _mm_load_ps instead of _cast way and got some regression, from about 45% to 38%. I think the additional ops of function call resulted it. > > Personally I prefer to have _mm_load_ps explicitly but that is a question of style I guess...
>
> I tried to use _mm_load_ps instead of _cast way and got some regression, from about 45% to 38%. I think the additional ops of function call resulted it.
That is really strange. Intrinsics should not result in function calls, they are supposed to be always inlined. Are you sure you are not testing a debug build?
(In reply to comment #9) > > > Personally I prefer to have _mm_load_ps explicitly but that is a question of style I guess... > > > > I tried to use _mm_load_ps instead of _cast way and got some regression, from about 45% to 38%. I think the additional ops of function call resulted it. > > That is really strange. Intrinsics should not result in function calls, they are supposed to be always inlined. Are you sure you are not testing a debug build? Yes, you are right, there is no function call, so I did some investigation and found there were additional 2 movaps and 1 movl when using _mm_load_ps, while the reinterpret_cast is executed more directly. Here is some details of test: === sse run time: 10 s 430768981 ns sse with mm_load run time: 11 s 386974860 ns std run time: 15 s 747296354 ns sse speed up = 1.509697 sse with mm_load speed up = 1.382922 (In reply to comment #10) > (In reply to comment #9) > > > > Personally I prefer to have _mm_load_ps explicitly but that is a question of style I guess... > > > > > > I tried to use _mm_load_ps instead of _cast way and got some regression, from about 45% to 38%. I think the additional ops of function call resulted it. > > > > That is really strange. Intrinsics should not result in function calls, they are supposed to be always inlined. Are you sure you are not testing a debug build? > > Yes, you are right, there is no function call, so I did some investigation and found there were additional 2 movaps and 1 movl when using _mm_load_ps, while the reinterpret_cast is executed more directly. > > Here is some details of test: > === > sse run time: 10 s 430768981 ns > sse with mm_load run time: 11 s 386974860 ns > std run time: 15 s 747296354 ns > sse speed up = 1.509697 > sse with mm_load speed up = 1.382922 More update: It should be the gcc optimization problem, if using -O2 I got the same result. Created attachment 117871 [details]
Patch
Hi Poulain, Thanks your review. I updated the patch as your comments and get the new test result. The improvement is up to about 70% when complied with -O2. Comment on attachment 117871 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=117871&action=review > Source/WebCore/ChangeLog:7 > + Implement the SSE optimization in SincResampler::process() > + https://bugs.webkit.org/show_bug.cgi?id=73789 > + > + Reviewed by NOBODY (OOPS!). > + You should add a little description, mentioning the performance gain of the change. > Source/WebCore/ChangeLog:10 > + * platform/audio/SincResampler.cpp: > + (WebCore::SincResampler::process): > 2011-12-05 Roland Steiner <rolandsteiner@chromium.org> Missing space at the end of the ChangeLog. > Source/WebCore/platform/audio/SincResampler.cpp:264 > + while ((reinterpret_cast<size_t>(inputP) & 0x0F) && n) { This should be uintptr_t instead of size_t. > Source/WebCore/platform/audio/SincResampler.cpp:270 > + int group = n / 4; Instead of tracking this, you could find the end, and then loop until the end: while (mInput < end) { ... } The advantage is you remove the computation of group-- from a tight loop. > Source/WebCore/platform/audio/SincResampler.cpp:282 > + sums1 = _mm_setzero_ps(); > + sums2 = _mm_setzero_ps(); You can move those to the lines where you declare sums1 and sums2. > Source/WebCore/platform/audio/SincResampler.cpp:284 > + k1Aligned = !(reinterpret_cast<size_t>(k1) & 0x0F); > + k2Aligned = !(reinterpret_cast<size_t>(k2) & 0x0F); This should be uintptr_t instead of size_t. > Source/WebCore/platform/audio/SincResampler.cpp:298 > + mul1 = _mm_mul_ps(mInput, mK1); > + mul2 = _mm_mul_ps(mInput, mK2); > + sums1 = _mm_add_ps(sums1, mul1); > + sums2 = _mm_add_ps(sums2, mul2); > + > + inputP += 4; > + k1 += 4; > + k2 += 4; This code is copied three time. It could be moved to a macro CONVOLVE_4_SAMPLE_SSE. You can try an inline function but I think that will fail with MSVC compiler due to a bug when passing vectors as arguments. > Hi Poulain,
My first name is Benjamin :)
Created attachment 118001 [details]
Patch
(In reply to comment #15) > > Hi Poulain, > > My first name is Benjamin :) Hi Benjamin, So sorry for misusing your name -_-! I updated the patch as your comments, please help to review. Also I have a question that I tried to use script webkit-patch to upload the patch, but it showed me "Failed to assign bug to you (can't find assigned_to) control" and upload failed, does that mean I cannot upload by script unless I`m the assignee? Thank you~ (In reply to comment #17) > I updated the patch as your comments, please help to review. Please also update the Changelog. > Also I have a question that I tried to use script webkit-patch to upload the patch, but it showed me "Failed to assign bug to you (can't find assigned_to) control" and upload failed, does that mean I cannot upload by script unless I`m the assignee? It is because you do not have yet the EditBug privileges. Please file a bug for webkit-patch, in my opinion it should work regardless, and just ignore that step. Created attachment 118008 [details]
Patch
(In reply to comment #18) > (In reply to comment #17) > > I updated the patch as your comments, please help to review. > > Please also update the Changelog. > > > Also I have a question that I tried to use script webkit-patch to upload the patch, but it showed me "Failed to assign bug to you (can't find assigned_to) control" and upload failed, does that mean I cannot upload by script unless I`m the assignee? > > It is because you do not have yet the EditBug privileges. Please file a bug for webkit-patch, in my opinion it should work regardless, and just ignore that step. OK, I`ll file a bug. And Changelog updated, thanks. Comment on attachment 118008 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=118008&action=review > Source/WebCore/platform/audio/SincResampler.cpp:260 > +#ifdef __SSE2__ I think visually, it might make more sense to stick the entire #ifdef __SSE2__ code inside the original { float input; ... } So we get something like { float input; #ifdef __SSE2__ ... #else <original code> #endif } > Source/WebCore/platform/audio/SincResampler.cpp:282 > +#define CONVOLVE_4_SAMPLE \ Nit: Probably should be named CONVOLVE_4_SAMPLES (plural, not singular). > Source/WebCore/platform/audio/SincResampler.cpp:293 > + mInput = _mm_load_ps(inputP); I might consider adding a new macro #define LOAD_DATA(l1, l2) \ mInput = mm_load_ps(inputP); \ mK1 = _mm_##l1_ps(k1); \ mK2 = _mm_##l2_ps(k2) Then here we can say LOAD_DATA(load, load); And below, we can have LOAD_DATA(loadu, load); LOAD_DATA(load, loadu); LOAD_DATA(loadu, loadu); This makes it easy to see that they're all the same except whether we do aligned or unaligned loads. (I didn't test this code; some minor changes might be needed.) > Source/WebCore/platform/audio/SincResampler.cpp:327 > + groupSumP = reinterpret_cast<float*>(&sums2); I think I'd write the above as sum1 += groupSumP[0]; sum1 += groupSumP[1]; sum1 += groupSumP[2]; sum1 += groupSumP[3]; Or maybe just sum1 += groupSumP[0] + groupSumP[1] + groupSumP[2] + groupSumP[3] Same for sum2 below. (In reply to comment #21) > (From update of attachment 118008 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=118008&action=review > > > Source/WebCore/platform/audio/SincResampler.cpp:260 > > +#ifdef __SSE2__ > > I think visually, it might make more sense to stick the entire #ifdef __SSE2__ code inside the original > { > float input; > ... > } > > So we get something like > { > float input; > #ifdef __SSE2__ > ... > #else > <original code> > #endif > } > > > Source/WebCore/platform/audio/SincResampler.cpp:282 > > +#define CONVOLVE_4_SAMPLE \ > > Nit: Probably should be named CONVOLVE_4_SAMPLES (plural, not singular). > > > Source/WebCore/platform/audio/SincResampler.cpp:293 > > + mInput = _mm_load_ps(inputP); > > I might consider adding a new macro > > #define LOAD_DATA(l1, l2) \ > mInput = mm_load_ps(inputP); \ > mK1 = _mm_##l1_ps(k1); \ > mK2 = _mm_##l2_ps(k2) > > Then here we can say > LOAD_DATA(load, load); > > And below, we can have > LOAD_DATA(loadu, load); > LOAD_DATA(load, loadu); > LOAD_DATA(loadu, loadu); > > This makes it easy to see that they're all the same except whether we do aligned or unaligned loads. > > (I didn't test this code; some minor changes might be needed.) > > > Source/WebCore/platform/audio/SincResampler.cpp:327 > > + groupSumP = reinterpret_cast<float*>(&sums2); > > I think I'd write the above as > > sum1 += groupSumP[0]; > sum1 += groupSumP[1]; > sum1 += groupSumP[2]; > sum1 += groupSumP[3]; > > Or maybe just sum1 += groupSumP[0] + groupSumP[1] + groupSumP[2] + groupSumP[3] > > Same for sum2 below. Hi Raymond, Thanks your good comments, I`ll update it. > Then here we can say
> LOAD_DATA(load, load);
>
> And below, we can have
> LOAD_DATA(loadu, load);
> LOAD_DATA(load, loadu);
> LOAD_DATA(loadu, loadu);
>
> This makes it easy to see that they're all the same except whether we do aligned or unaligned loads.
I think this would be much harder to follow than the existing code...
(In reply to comment #23) > > Then here we can say > > LOAD_DATA(load, load); > > > > And below, we can have > > LOAD_DATA(loadu, load); > > LOAD_DATA(load, loadu); > > LOAD_DATA(loadu, loadu); > > > > This makes it easy to see that they're all the same except whether we do aligned or unaligned loads. > > I think this would be much harder to follow than the existing code... That's why I said "maybe add". But it took me a bit of time to see what was different between the three branches. Just one character difference... I won't mind of the macro is not added. The patch looks correct to me but I'd like the comments of Raymond to be fixed before landing. I think "load" and "loadu" are common enough to avoid a LOAD_DATA macro but I leave that up to you. Created attachment 118305 [details]
Patch
(In reply to comment #26) > Created an attachment (id=118305) [details] > Patch Patch is updated with keeping LOAD_DATA macro. Comment on attachment 118305 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=118305&action=review > Source/WebCore/platform/audio/SincResampler.cpp:319 > + // Summarize the SSE results to sum1 and sum2 Make comments look like sentences by starting with a capital letter and ending with a period (punctation). One exception may be end of line comments like this "if (x == y) // false for NaN". Created attachment 118320 [details]
Patch
Add the period for the comment.
Comment on attachment 118320 [details]
Patch
Looks correct.
Sorry it took so many iterations to find all the little issues.
(In reply to comment #30) > (From update of attachment 118320 [details]) > Looks correct. > > Sorry it took so many iterations to find all the little issues. That`s all right, it could make me to pay more attention to the details in the feature, thanks all your review. Comment on attachment 118320 [details] Patch Clearing flags on attachment: 118320 Committed r102312: <http://trac.webkit.org/changeset/102312> All reviewed patches have been landed. Closing bug. |