As suggested in bug id 100737, very soon proposing here optimized approach or patch over following bug for VectorMath.cpp: https://bugs.webkit.org/show_bug.cgi?id=100737 Thanks and regards, kdj
Raymond, can you please have a look.
Created attachment 190700 [details] Patch with more optimised approach Please let me know the review flags are correct or not.
Comment on attachment 190700 [details] Patch with more optimised approach Rejecting attachment 190700 [details] from review queue. kaustubh.j@samsung.com does not have reviewer permissions according to http://trac.webkit.org/browser/trunk/Tools/Scripts/webkitpy/common/config/committers.py. - If you do not have reviewer rights please read http://webkit.org/coding/contributing.html for instructions on how to use bugzilla flags. - If you have reviewer rights please correct the error in Tools/Scripts/webkitpy/common/config/committers.py by adding yourself to the file (no review needed). The commit-queue restarts itself every 2 hours. After restart the commit-queue will correctly respect your reviewer rights.
Comment on attachment 190700 [details] Patch with more optimised approach corrected the flags for review.
Comment on attachment 190700 [details] Patch with more optimised approach View in context: https://bugs.webkit.org/attachment.cgi?id=190700&action=review You should follow the WebKit this coding style: http://www.webkit.org/coding/coding-style.html Run Tools/Scripts/check-webkit-style to check it. Do you have some info about the speedup of this patch? > Source/WebCore/platform/audio/VectorMath.cpp:180 > + unsigned loopCount = n >> 4; > + unsigned residueCount = n & 15; > + float32x4_t num1, num2, result1, scaleNum; > + float32x4_t num3, num4, result2; > + float32x4_t num5, num6, result3; > + float32x4_t num7, num8, result4; > + > + scaleNum = vdupq_n_f32(*scale); > + > + for (;loopCount>0;loopCount--) { Why don't you define the loopCount here? > Source/WebCore/platform/audio/VectorMath.cpp:188 > + num3 = vld1q_f32(dataSource+4); // load 4xfloat values > + num4 = vld1q_f32(dataDest+4); // load 4xfloat values > + num5 = vld1q_f32(dataSource+8); // load 4xfloat values > + num6 = vld1q_f32(dataDest+8); // load 4xfloat values > + num7 = vld1q_f32(dataSource+12); // load 4xfloat values > + num8 = vld1q_f32(dataDest+12); // load 4xfloat values You should put spaces before and after the + sign everywhere and I think these comments are unnecessary here. > Source/WebCore/platform/audio/VectorMath.cpp:204 > + if (residueCount) { > + for (;residueCount>0;residueCount--) { These checks are unnecessary because the for cycles don't run if residueCount equals 0 and again why don't you define residueCount here? > Source/WebCore/platform/audio/VectorMath.cpp:216 > + } else { > + while (n) { > + *destP += *sourceP * *scale; > + sourceP += sourceStride; > + destP += destStride; > + n--; > } That's a code duplication. Why don't you reuse the original solution for the tail frames. > Source/WebCore/platform/audio/VectorMath.cpp:315 > + *dataDest = *dataSource * (*scale); Sometimes you use parenthesis sometimes not. You should do the same everywhere.
Comment on attachment 190700 [details] Patch with more optimised approach View in context: https://bugs.webkit.org/attachment.cgi?id=190700&action=review Basic changes look good, but it's really important to know how much faster this new code is. > Source/WebCore/ChangeLog:8 > + No new tests? Did you run the webaudio layout tests? It would be good if the changelog indicated what kind of speed up you get by manually unrolling these loops. > Source/WebCore/platform/audio/VectorMath.cpp:176 > + float32x4_t num7, num8, result4; Move the definition of num1-num8 and result1-result4 into the body of the for loop. Similar comments for the other routines below. >> Source/WebCore/platform/audio/VectorMath.cpp:180 >> + for (;loopCount>0;loopCount--) { > > Why don't you define the loopCount here? gcc -O is supposed to support loop unrolling. Is unrolling not occurring here? clang might support unrolling too. It seems that getting the compiler to do this would be better than doing it by hand. > Source/WebCore/platform/audio/VectorMath.cpp:199 > + Since you're trying to optimize for speed, as there any gain to be had by interleaving some of the computation with the loads? That is, move some of the vmlaq_f32 calls in between the vld1q_f32 calls. Same with storing the result. Only do this if the gain is significant since it makes it harder to understand the code. > Source/WebCore/platform/audio/VectorMath.cpp:443 > + num4= vld1q_f32(source2+4); Fix up spacing for num4= and num8= below. > Source/WebCore/platform/audio/VectorMath.cpp:642 > + unsigned loopCount = framesToProcess >> 2; In the other loops above, you processed 16 floats at a time. Here, you only do 4 floats (2 complexes) at a time. Why? Is there no significant gain from doing 4 or 8 complexes at a time? > Source/WebCore/platform/audio/VectorMath.cpp:644 > + float32x4_t real1, real2, img1, img2; Use imag1 and imag2 instead of img1, img2, to be consistent with names for the imaginary parts. Same for imgSource1. > Source/WebCore/platform/audio/VectorMath.cpp:651 > + img2= vld1q_f32(imgSource2); // load 4xfloat values Space before "=".
Hi, 1. I have executed the Tools/Scripts/check-webkit-style result came with 0 errors in 1 file. But still some comments are not required shall be removed in next patch 2. Regarding performance gain i have seen 9 to 10% of gain over previous implementation in webkit.org. Definitely the gain difference would be not as more as over pure c code. 3. Regarding the line no.204 if residueCount < 128 then this loop would get executed; this code is written with all possible generic value taken into consideration. 4. Code duplication at 206; i will recheck on this
(In reply to comment #7) Corrected gain > 2. Regarding performance gain i have seen 9% to 30% of gain over previous implementation in webkit.org. Definitely the gain difference would be not as more as over pure c code.
5. Comment 642: here main the objective of avoiding the stalls due immediate load and any mathematical operations has been achieved already so not necessary to still unroll by 2 again. 6. Changes related to space would be changed in next patch. These kind of loop unrolling are part of commonly suggested optimization over GCC compiler options.
Created attachment 190921 [details] Second patch with suggested modifications Second patch with suggested modifications
Enable to see EWS test option for second patch. Please let me know how to get it.
Unable to see EWS test option for second patch. Please let me know how to get it.
Comment on attachment 190921 [details] Second patch with suggested modifications Attachment 190921 [details] did not pass chromium-ews (chromium-xvfb): Output: http://webkit-commit-queue.appspot.com/results/16778699
Created attachment 190927 [details] #if Macro set properly. #if Macro set properly.
(In reply to comment #8) > (In reply to comment #7) > Corrected gain > > 2. Regarding performance gain i have seen 9% to 30% of gain over previous implementation in webkit.org. Definitely the gain difference would be not as more as over pure c code. Can you break this down to the gain for each of the functions?
Comment on attachment 190927 [details] #if Macro set properly. View in context: https://bugs.webkit.org/attachment.cgi?id=190927&action=review > Source/WebCore/platform/audio/VectorMath.cpp:180 > + for (;loopCount>0;loopCount--) { Gabor suggested moving the initialization of loopCount to the for loop. The spacing in the for loop is wrong. I previously suggested moving the definitions of num1-num8 into the body of the for loop. > Source/WebCore/platform/audio/VectorMath.cpp:203 > + if (residueCount) { Can we set n = residueCount and fall through to the code at line 186, as the original code did? Same comment applies for all routines below that use the residueCount. > Source/WebCore/platform/audio/VectorMath.cpp:204 > + for (;residueCount>0;residueCount--) { Fix spacing. > Source/WebCore/platform/audio/VectorMath.cpp:210 > + } else { // If strides are not 1, rollback to normal algorithm. This else wasn't needed before. Why is it needed now? Can't we just fall through? > Source/WebCore/platform/audio/VectorMath.cpp:285 > + float32x4_t num4, result4; Move declaration of num1-num4 and result1-result4 into body of for loop. > Source/WebCore/platform/audio/VectorMath.cpp:287 > + scaleNum = vdupq_n_f32(*scale); float scaleNum = vdupq_n_f32(*scale); > Source/WebCore/platform/audio/VectorMath.cpp:289 > + for (;loopCount>0;loopCount--) { Fix spacing. > Source/WebCore/platform/audio/VectorMath.cpp:309 > + for (;residueCount>0;residueCount--) { Fix spacing. > Source/WebCore/platform/audio/VectorMath.cpp:426 > + float32x4_t num7, num8, sum4; Move these into the body of the for loop. > Source/WebCore/platform/audio/VectorMath.cpp:428 > + for (;loopCount>0;loopCount--) { Fix spacing. > Source/WebCore/platform/audio/VectorMath.cpp:460 > + } else { // If strides are not 1, rollback to normal algorithm. Don't think this is necessary and we can just fall through as in the original code. > Source/WebCore/platform/audio/VectorMath.cpp:535 > + float32x4_t num7, num8, result4; Move these declarations into the body of the for loop. > Source/WebCore/platform/audio/VectorMath.cpp:537 > + for (;loopCount>0;loopCount--) { Fix spacing. > Source/WebCore/platform/audio/VectorMath.cpp:-452 > - while (n) { Doesn't this mean the SSE2 code no longer has this loop to handle the tail frames? > Source/WebCore/platform/audio/VectorMath.cpp:622 > + float32x4_t result1, result2; Move declarations of real1, real2, imag1, imag2, result1, result2 into the body of the for loop. > Source/WebCore/platform/audio/VectorMath.cpp:624 > + for (;loopCount>0;loopCount--) { Fix spacing. > Source/WebCore/platform/audio/VectorMath.cpp:709 > + float32x4_t num4, result4; Move declaration of num1-num4 into the body of the for loop. > Source/WebCore/platform/audio/VectorMath.cpp:711 > + result1 = result2 = result3 = result4 = vdupq_n_f32(0); I think the style guide says this should be 4 separate assignments. (But it's not completely clear.) > Source/WebCore/platform/audio/VectorMath.cpp:713 > + for (;loopCount>0;loopCount--) { Fix spacing.
(In reply to comment #12) > Unable to see EWS test option for second patch. Please let me know how to get it. In the attachment area, there should be a button to submit to EWS. You can press it to submit the patch to the EWS bots.
(In reply to comment #7) > Hi, > 1. I have executed the Tools/Scripts/check-webkit-style result came with 0 errors in 1 file. But still some comments are not required shall be removed in next patch > 2. Regarding performance gain i have seen 9 to 10% of gain over previous implementation in webkit.org. Definitely the gain difference would be not as more as over pure c code. > 3. Regarding the line no.204 if residueCount < 128 then this loop would get executed; this code is written with all possible generic value taken into consideration. I think Gabor is saying that you don't need the if (residueCount) because the for loop won't be executed when residueCount is 0.
Hi, Let us first discuss the technical details; Here is the figures: 1. vadd-29% 2. vmul-29% 3. vsma-30% 4. vsmul-28% 5. vsvesq-25% 6. zvmul-9% The spacing, comments etc. i will check script in detail and line by line from next check-in.
Hi, i request Praveen to upload the next patch. thanks, kdj
Hi Raymond and Chris, please find the numbers above and my comments in 'Review Patch'. please let me know if you have any more code execution doubts or suggestions. br, kdj
(In reply to comment #21) > Hi Raymond and Chris, > > please find the numbers above and my comments in 'Review Patch'. > please let me know if you have any more code execution doubts or suggestions. > > br, > kdj The numbers look quite nice. Thanks for providing them. I don't see any comments from you on the latest patch. Can you send me a link to them?
Hi i have few comments like this. Raymond Toy: Gabor suggested moving the initialization of loopCount to the for loop. The spacing in the for loop is wrong. I previously suggested moving the definitions of num1-num8 into the body of the for loop. kdj: Is it due to style guide ? it is better to keep outside for optimizations; in case compilers takes any variable to stack it would be very costly operation. Raymond Toy: Can we set n = residueCount and fall through to the code at line 186, as the original code did? kdj: Same comment applies for all routines below that use the residueCount. Here for case loopCount > 0 and residueCount > 0 : the loop for residue members should star from sourceP; if we merge both loops we have to set sourceP to processed members; but it is declared as "const float *"; so that is why i have kept separately. Raymond Toy: Move declaration of num1-num4 into the body of the for loop. kdj: Is it due to style guide ? it is better to keep outside for optimizations;so moved the variables outside loop. If you have any more comments, please let me know. Other coding styles suggestion will be incorporated in next patch.
(In reply to comment #23) > Hi i have few comments like this. I cannot find them in the reviews. > > Raymond Toy: > Gabor suggested moving the initialization of loopCount to the for loop. > The spacing in the for loop is wrong. > I previously suggested moving the definitions of num1-num8 into the body of the for loop. > kdj: > Is it due to style guide ? it is better to keep outside for optimizations; > in case compilers takes any variable to stack it would be very costly operation. I guess it's the typical style used in other code. The variable is declared and initialized together when possible. Those variables are only used inside the loop so I would think it would make the compiler's job easier since the scope is limited. If you move them inside, is the generated code worse? > > Raymond Toy: > Can we set n = residueCount and fall through to the code at line 186, as the original code did? > kdj: > Same comment applies for all routines below that use the residueCount. > Here for case loopCount > 0 and residueCount > 0 : the loop for residue members > should star from sourceP; if we merge both loops we have to set sourceP to processed members; > but it is declared as "const float *"; so that is why i have kept separately. I don't understand your comment here. In the SSE2 code, n = tailFrames (your residueCount) and the while loop at 212 is run. Isn't this the same as what your code does? > > Raymond Toy: > Move declaration of num1-num4 into the body of the for loop. > kdj: > Is it due to style guide ? it is better to keep outside for optimizations;so moved the variables outside loop. What optimization is enabled by keeping the variable declaration outside the loop? > > If you have any more comments, please let me know. > Other coding styles suggestion will be incorporated in next patch.
Created attachment 195022 [details] Third patch with suggested modifications Hi Raymond, Please find the patch with modifications. Regarding keeping variables inside loop; i will generate assembly with my compiler tool-chain, and update on same. Meanwhile, please review the patch. Thanks, kdj
Comment on attachment 195022 [details] Third patch with suggested modifications View in context: https://bugs.webkit.org/attachment.cgi?id=195022&action=review Looks good, with just a few minor comments about renaming some variables for clarity. I'll wait for your final test with your compilers. > Source/WebCore/platform/audio/VectorMath.cpp:184 > + num8 = vld1q_f32(sourceP + 12); I think the code would be a little clearer if you used better names than num1-num8. Maybe rename num1 to dest0, num2 to source0, num3 to dest1, num4 to source1, and so on. > Source/WebCore/platform/audio/VectorMath.cpp:405 > + num8 = vld1q_f32(source2P + 12); As above, maybe rename num1-num8 with better names to reflect what the source is. Maybe rename num1 to source1P0, num2 to source2P0, num3 to source1P1, num4 to source2P1, and so on. > Source/WebCore/platform/audio/VectorMath.cpp:500 > + num8 = vld1q_f32(source2P + 12); Same comment about renaming num1-num8
Created attachment 195231 [details] Patch with variable name changes Hi Raymond, 1. As the number of Neon and ARM register are within count, nothing has gone on stack. I checked assembly also keeping variables in loop and out-loop in .S file. 2. Renamed the variables as you mentioned. Please check the patch. Thanks, kdj
Comment on attachment 195231 [details] Patch with variable name changes View in context: https://bugs.webkit.org/attachment.cgi?id=195231&action=review > Source/WebCore/ChangeLog:8 > + Modifications are done to unsure overcoming stalls and Loop unrolling mechanism Typo: "unsure" -> "ensure" > Source/WebCore/platform/audio/VectorMath.cpp:270 > + float32x4_t sour0, result0; Why not use "source0" instead of "sour0"? Whole words are preferred. Same comment applies to all code below that uses "sour".
Created attachment 195506 [details] Modified the patch Modified the patch.
Comment on attachment 195506 [details] Modified the patch View in context: https://bugs.webkit.org/attachment.cgi?id=195506&action=review Looks good except for one last issue. > Source/WebCore/platform/audio/VectorMath.cpp:174 > + float32x4_t dest3, source3, result3; Since you mentioned that there was no difference with declaring these variables inside the loop, I think they should be declared in the loop. So something like for (...) { float32x4_t dest0 = vld1q_f32(destP); ... }
Created attachment 196121 [details] Variable declaration inside Loop please check.
(In reply to comment #31) > Created an attachment (id=196121) [details] > Variable declaration inside Loop > > please check. This looks good. Thank you for this optimized implementation. One last question: Have you run your tests on this last patch?
I checked the gain it is same; except for zvmul; as we are not using unrolling there, we are using same code.
Since long no updation about this patch. Please share views.
Please share the status, we would like to do more contributions on audio filters also.
Comment on attachment 196121 [details] Variable declaration inside Loop Unfortunately, this patch seems to have been ignored and no longer applies to the source tree. Could you please rebase the patch against the current source archive?
This needs rebased (several variables have got renamed over the past decade, but broadly it looks like it should still (manually) apply), and performance tested against modern Neon implementations.
<rdar://problem/96339063>