Bug 110744 - More Optimize approach for NEON in VectorMath
Summary: More Optimize approach for NEON in VectorMath
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: Web Audio (show other bugs)
Version: 528+ (Nightly build)
Hardware: Other Linux
: P2 Normal
Assignee: kdj
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2013-02-25 03:48 PST by kdj
Modified: 2022-07-02 05:55 PDT (History)
15 users (show)

See Also:


Attachments
Patch with more optimised approach (16.37 KB, patch)
2013-02-28 04:24 PST, kdj
no flags Details | Formatted Diff | Diff
Second patch with suggested modifications (16.48 KB, patch)
2013-03-01 03:13 PST, kdj
webkit.review.bot: commit-queue-
Details | Formatted Diff | Diff
#if Macro set properly. (16.50 KB, patch)
2013-03-01 03:56 PST, kdj
no flags Details | Formatted Diff | Diff
Third patch with suggested modifications (10.06 KB, patch)
2013-03-26 00:38 PDT, kdj
no flags Details | Formatted Diff | Diff
Patch with variable name changes (10.02 KB, patch)
2013-03-26 23:40 PDT, kdj
no flags Details | Formatted Diff | Diff
Modified the patch (10.45 KB, patch)
2013-03-28 02:19 PDT, kdj
no flags Details | Formatted Diff | Diff
Variable declaration inside Loop (10.18 KB, patch)
2013-04-02 04:46 PDT, kdj
bfulgham: review-
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description kdj 2013-02-25 03:48:57 PST
As suggested in bug id 100737, very soon proposing here optimized approach or patch over following bug for VectorMath.cpp: 
https://bugs.webkit.org/show_bug.cgi?id=100737

Thanks and regards,
kdj
Comment 1 Chris Rogers 2013-02-25 10:30:42 PST
Raymond, can you please have a look.
Comment 2 kdj 2013-02-28 04:24:37 PST
Created attachment 190700 [details]
Patch with more optimised approach

Please let me know the review flags are correct or not.
Comment 3 WebKit Review Bot 2013-02-28 04:27:12 PST
Comment on attachment 190700 [details]
Patch with more optimised approach

Rejecting attachment 190700 [details] from review queue.

kaustubh.j@samsung.com does not have reviewer permissions according to http://trac.webkit.org/browser/trunk/Tools/Scripts/webkitpy/common/config/committers.py.

- If you do not have reviewer rights please read http://webkit.org/coding/contributing.html for instructions on how to use bugzilla flags.

- If you have reviewer rights please correct the error in Tools/Scripts/webkitpy/common/config/committers.py by adding yourself to the file (no review needed).  The commit-queue restarts itself every 2 hours.  After restart the commit-queue will correctly respect your reviewer rights.
Comment 4 kdj 2013-02-28 04:32:55 PST
Comment on attachment 190700 [details]
Patch with more optimised approach

corrected the flags for review.
Comment 5 Gabor Rapcsanyi 2013-02-28 05:14:52 PST
Comment on attachment 190700 [details]
Patch with more optimised approach

View in context: https://bugs.webkit.org/attachment.cgi?id=190700&action=review

You should follow the WebKit this coding style: http://www.webkit.org/coding/coding-style.html
Run Tools/Scripts/check-webkit-style to check it.

Do you have some info about the speedup of this patch?

> Source/WebCore/platform/audio/VectorMath.cpp:180
> +        unsigned loopCount = n >> 4;
> +        unsigned residueCount = n & 15;
> +        float32x4_t num1, num2, result1, scaleNum;
> +        float32x4_t num3, num4, result2;
> +        float32x4_t num5, num6, result3;
> +        float32x4_t num7, num8, result4;
> +
> +        scaleNum = vdupq_n_f32(*scale);
> +
> +        for (;loopCount>0;loopCount--) {

Why don't you define the loopCount here?

> Source/WebCore/platform/audio/VectorMath.cpp:188
> +            num3 = vld1q_f32(dataSource+4); // load 4xfloat values
> +            num4 = vld1q_f32(dataDest+4); // load 4xfloat values
> +            num5 = vld1q_f32(dataSource+8); // load 4xfloat values
> +            num6 = vld1q_f32(dataDest+8); // load 4xfloat values
> +            num7 = vld1q_f32(dataSource+12); // load 4xfloat values
> +            num8 = vld1q_f32(dataDest+12); // load 4xfloat values

You should put spaces before and after the + sign everywhere and I think these comments are unnecessary here.

> Source/WebCore/platform/audio/VectorMath.cpp:204
> +        if (residueCount) {
> +            for (;residueCount>0;residueCount--) {

These checks are unnecessary because the for cycles don't run if residueCount equals 0 and again why don't you define residueCount here?

> Source/WebCore/platform/audio/VectorMath.cpp:216
> +    } else {
> +        while (n) {
> +            *destP += *sourceP * *scale;
> +            sourceP += sourceStride;
> +            destP += destStride;
> +            n--;
>          }

That's a code duplication. Why don't you reuse the original solution for the tail frames.

> Source/WebCore/platform/audio/VectorMath.cpp:315
> +                *dataDest = *dataSource * (*scale);

Sometimes you use parenthesis sometimes not. You should do the same everywhere.
Comment 6 Raymond Toy 2013-02-28 09:55:55 PST
Comment on attachment 190700 [details]
Patch with more optimised approach

View in context: https://bugs.webkit.org/attachment.cgi?id=190700&action=review

Basic changes look good, but it's really important to know how much faster this new code is.

> Source/WebCore/ChangeLog:8
> +

No new tests?  Did you run the webaudio layout tests?

It would be good if the changelog indicated what kind of speed up you get by manually unrolling these loops.

> Source/WebCore/platform/audio/VectorMath.cpp:176
> +        float32x4_t num7, num8, result4;

Move the definition of num1-num8 and result1-result4 into the body of the for loop.  Similar comments for the other routines below.

>> Source/WebCore/platform/audio/VectorMath.cpp:180
>> +        for (;loopCount>0;loopCount--) {
> 
> Why don't you define the loopCount here?

gcc -O is supposed to support loop unrolling.  Is unrolling not occurring here? clang might support unrolling too.  It seems that getting the compiler to do this would be better than doing it by hand.

> Source/WebCore/platform/audio/VectorMath.cpp:199
> +

Since you're trying to optimize for speed, as there any gain to be had by interleaving some of the computation with the loads?  That is, move some of the vmlaq_f32 calls in between the vld1q_f32 calls.  Same with storing the result.

Only do this if the gain is significant since it makes it harder to understand the code.

> Source/WebCore/platform/audio/VectorMath.cpp:443
> +            num4= vld1q_f32(source2+4);

Fix up spacing for num4= and num8= below.

> Source/WebCore/platform/audio/VectorMath.cpp:642
> +    unsigned loopCount = framesToProcess >> 2;

In the other loops above, you processed 16 floats at a time.  Here, you only do 4 floats (2 complexes) at a time.  Why?  Is there no significant gain from doing 4 or 8 complexes at a time?

> Source/WebCore/platform/audio/VectorMath.cpp:644
> +    float32x4_t real1, real2, img1, img2;

Use imag1 and imag2 instead of img1, img2, to be consistent with names for the imaginary parts.  Same for imgSource1.

> Source/WebCore/platform/audio/VectorMath.cpp:651
> +        img2= vld1q_f32(imgSource2); // load 4xfloat values

Space before "=".
Comment 7 kdj 2013-02-28 21:52:32 PST
Hi,
1. I have executed the Tools/Scripts/check-webkit-style result came with 0 errors in 1 file. But still some comments are not required shall be removed in next patch
2. Regarding performance gain i have seen 9 to 10% of gain over previous implementation in webkit.org. Definitely the gain difference would be not as more as over pure c code. 
3. Regarding the line no.204 if residueCount < 128 then this loop would get executed; this code is written with all possible generic value taken into consideration.
4. Code duplication at 206; i will recheck on this
Comment 8 kdj 2013-02-28 21:55:44 PST
(In reply to comment #7)
Corrected gain
> 2. Regarding performance gain i have seen 9% to 30% of gain over previous implementation in webkit.org. Definitely the gain difference would be not as more as over pure c code.
Comment 9 kdj 2013-02-28 22:13:42 PST
5. Comment 642:  here main the objective of avoiding the stalls due immediate load and any mathematical operations has been achieved already so not necessary to still unroll by 2 again.
6. Changes related to space would be changed in next patch.

These kind of loop unrolling are part of commonly suggested optimization over GCC compiler options.
Comment 10 kdj 2013-03-01 03:13:32 PST
Created attachment 190921 [details]
Second patch with suggested modifications

Second patch with suggested modifications
Comment 11 kdj 2013-03-01 03:22:57 PST
Enable to see EWS test option for second patch. Please let me know how to get it.
Comment 12 kdj 2013-03-01 03:23:42 PST
Unable to see EWS test option for second patch. Please let me know how to get it.
Comment 13 WebKit Review Bot 2013-03-01 03:38:56 PST
Comment on attachment 190921 [details]
Second patch with suggested modifications

Attachment 190921 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://webkit-commit-queue.appspot.com/results/16778699
Comment 14 kdj 2013-03-01 03:56:16 PST
Created attachment 190927 [details]
#if Macro set properly.

#if Macro set properly.
Comment 15 Raymond Toy 2013-03-01 09:15:45 PST
(In reply to comment #8)
> (In reply to comment #7)
> Corrected gain
> > 2. Regarding performance gain i have seen 9% to 30% of gain over previous implementation in webkit.org. Definitely the gain difference would be not as more as over pure c code.

Can you break this down to the gain for each of the functions?
Comment 16 Raymond Toy 2013-03-01 09:46:53 PST
Comment on attachment 190927 [details]
#if Macro set properly.

View in context: https://bugs.webkit.org/attachment.cgi?id=190927&action=review

> Source/WebCore/platform/audio/VectorMath.cpp:180
> +        for (;loopCount>0;loopCount--) {

Gabor suggested moving the initialization of loopCount to the for loop.

The spacing in the for loop is wrong.

I previously suggested moving the definitions of num1-num8 into the body of the for loop.

> Source/WebCore/platform/audio/VectorMath.cpp:203
> +        if (residueCount) {

Can we set n = residueCount and fall through to the code at line 186, as the original code did?

Same comment applies for all routines below that use the residueCount.

> Source/WebCore/platform/audio/VectorMath.cpp:204
> +            for (;residueCount>0;residueCount--) {

Fix spacing.

> Source/WebCore/platform/audio/VectorMath.cpp:210
> +    } else { // If strides are not 1, rollback to normal algorithm.

This else wasn't needed before.  Why is it needed now?  Can't we just fall through?

> Source/WebCore/platform/audio/VectorMath.cpp:285
> +        float32x4_t num4, result4;

Move declaration of num1-num4 and result1-result4 into body of for loop.

> Source/WebCore/platform/audio/VectorMath.cpp:287
> +        scaleNum = vdupq_n_f32(*scale);

float scaleNum = vdupq_n_f32(*scale);

> Source/WebCore/platform/audio/VectorMath.cpp:289
> +        for (;loopCount>0;loopCount--) {

Fix spacing.

> Source/WebCore/platform/audio/VectorMath.cpp:309
> +            for (;residueCount>0;residueCount--) {

Fix spacing.

> Source/WebCore/platform/audio/VectorMath.cpp:426
> +        float32x4_t num7, num8, sum4;

Move these into the body of the for loop.

> Source/WebCore/platform/audio/VectorMath.cpp:428
> +        for (;loopCount>0;loopCount--) {

Fix spacing.

> Source/WebCore/platform/audio/VectorMath.cpp:460
> +    } else { // If strides are not 1, rollback to normal algorithm.

Don't think this is necessary and we can just fall through as in the original code.

> Source/WebCore/platform/audio/VectorMath.cpp:535
> +        float32x4_t num7, num8, result4;

Move these declarations into the body of the for loop.

> Source/WebCore/platform/audio/VectorMath.cpp:537
> +        for (;loopCount>0;loopCount--) {

Fix spacing.

> Source/WebCore/platform/audio/VectorMath.cpp:-452
> -    while (n) {

Doesn't this mean the SSE2 code no longer has this loop to handle the tail frames?

> Source/WebCore/platform/audio/VectorMath.cpp:622
> +    float32x4_t result1, result2;

Move declarations of real1, real2, imag1, imag2, result1, result2 into the body of the for loop.

> Source/WebCore/platform/audio/VectorMath.cpp:624
> +    for (;loopCount>0;loopCount--) {

Fix spacing.

> Source/WebCore/platform/audio/VectorMath.cpp:709
> +        float32x4_t num4, result4;

Move declaration of num1-num4 into the body of the for loop.

> Source/WebCore/platform/audio/VectorMath.cpp:711
> +        result1 = result2 = result3 = result4 = vdupq_n_f32(0);

I think the style guide says this should be 4 separate assignments.  (But it's not completely clear.)

> Source/WebCore/platform/audio/VectorMath.cpp:713
> +        for (;loopCount>0;loopCount--) {

Fix spacing.
Comment 17 Raymond Toy 2013-03-01 09:49:14 PST
(In reply to comment #12)
> Unable to see EWS test option for second patch. Please let me know how to get it.

In the attachment area, there should be a button to submit to EWS.  You can press it to submit the patch to the EWS bots.
Comment 18 Raymond Toy 2013-03-01 09:55:36 PST
(In reply to comment #7)
> Hi,
> 1. I have executed the Tools/Scripts/check-webkit-style result came with 0 errors in 1 file. But still some comments are not required shall be removed in next patch
> 2. Regarding performance gain i have seen 9 to 10% of gain over previous implementation in webkit.org. Definitely the gain difference would be not as more as over pure c code. 
> 3. Regarding the line no.204 if residueCount < 128 then this loop would get executed; this code is written with all possible generic value taken into consideration.

I think Gabor is saying that you don't need the if (residueCount) because the for loop won't be executed when residueCount is 0.
Comment 19 kdj 2013-03-11 23:23:58 PDT
Hi,
Let us first discuss the technical details; 
Here is the figures:
1. vadd-29%
2. vmul-29%
3. vsma-30%
4. vsmul-28%
5. vsvesq-25%
6. zvmul-9%


The spacing, comments etc. i will check script in detail and line by line from next check-in.
Comment 20 kdj 2013-03-18 00:38:39 PDT
Hi,
i request Praveen to upload the next patch.

thanks,
kdj
Comment 21 kdj 2013-03-20 02:41:25 PDT
Hi Raymond and Chris,

please find the numbers above and my comments in 'Review Patch'.
please let me know if you have any more code execution doubts or suggestions.

br,
kdj
Comment 22 Raymond Toy 2013-03-20 09:37:26 PDT
(In reply to comment #21)
> Hi Raymond and Chris,
> 
> please find the numbers above and my comments in 'Review Patch'.
> please let me know if you have any more code execution doubts or suggestions.
> 
> br,
> kdj

The numbers look quite nice.  Thanks for providing them.

I don't see any comments from you on the latest patch.  Can you send me a link to them?
Comment 23 kdj 2013-03-20 23:34:12 PDT
Hi i have few comments like this.

Raymond Toy:
Gabor suggested moving the initialization of loopCount to the for loop.
The spacing in the for loop is wrong.
I previously suggested moving the definitions of num1-num8 into the body of the for loop.
kdj:
Is it due to style guide ? it is better to keep outside for optimizations; 
in case compilers takes any variable to stack it would be very costly operation.

Raymond Toy:
Can we set n = residueCount and fall through to the code at line 186, as the original code did?
kdj:
Same comment applies for all routines below that use the residueCount.
Here for case loopCount > 0 and residueCount > 0 : the loop for residue members 
should star from sourceP; if we merge both loops we have to set sourceP to processed members; 
but it is declared as "const float *"; so that is why i have kept separately.

Raymond Toy:
Move declaration of num1-num4 into the body of the for loop.
kdj:
Is it due to style guide ? it is better to keep outside for optimizations;so moved the variables outside loop.

If you have any more comments, please let me know.
Other coding styles suggestion will be incorporated in next patch.
Comment 24 Raymond Toy 2013-03-25 09:35:24 PDT
(In reply to comment #23)
> Hi i have few comments like this.

I cannot find them in the reviews.
> 
> Raymond Toy:
> Gabor suggested moving the initialization of loopCount to the for loop.
> The spacing in the for loop is wrong.
> I previously suggested moving the definitions of num1-num8 into the body of the for loop.
> kdj:
> Is it due to style guide ? it is better to keep outside for optimizations; 
> in case compilers takes any variable to stack it would be very costly operation.

I guess it's the typical style used in other code.  The variable is declared and initialized together when possible.  Those variables are only used inside the loop so I would think it would make the compiler's job easier since the scope is limited.  If you move them inside, is the generated code worse?
> 
> Raymond Toy:
> Can we set n = residueCount and fall through to the code at line 186, as the original code did?
> kdj:
> Same comment applies for all routines below that use the residueCount.
> Here for case loopCount > 0 and residueCount > 0 : the loop for residue members 
> should star from sourceP; if we merge both loops we have to set sourceP to processed members; 
> but it is declared as "const float *"; so that is why i have kept separately.

I don't understand your comment here.  In the SSE2 code, n = tailFrames (your residueCount) and the while loop at 212 is run.  Isn't this the same as what your code does?
> 
> Raymond Toy:
> Move declaration of num1-num4 into the body of the for loop.
> kdj:
> Is it due to style guide ? it is better to keep outside for optimizations;so moved the variables outside loop.

What optimization is enabled by keeping the variable declaration outside the loop?

> 
> If you have any more comments, please let me know.
> Other coding styles suggestion will be incorporated in next patch.
Comment 25 kdj 2013-03-26 00:38:05 PDT
Created attachment 195022 [details]
Third patch with suggested modifications

Hi Raymond,
Please find the patch with modifications.

Regarding keeping variables inside loop; i will generate assembly with my compiler tool-chain, and update on same.
Meanwhile, please review the patch.

Thanks,
kdj
Comment 26 Raymond Toy 2013-03-26 09:15:32 PDT
Comment on attachment 195022 [details]
Third patch with suggested modifications

View in context: https://bugs.webkit.org/attachment.cgi?id=195022&action=review

Looks good, with just a few minor comments about renaming some variables for clarity.  I'll wait for your final test with your compilers.

> Source/WebCore/platform/audio/VectorMath.cpp:184
> +            num8 = vld1q_f32(sourceP + 12);

I think the code would be a little clearer if you used better names than num1-num8.  Maybe rename num1 to dest0, num2 to source0, num3 to dest1, num4 to source1, and so on.

> Source/WebCore/platform/audio/VectorMath.cpp:405
> +            num8 = vld1q_f32(source2P + 12);

As above, maybe rename num1-num8 with better names to reflect what the source is.  Maybe rename num1 to source1P0, num2 to source2P0, num3 to source1P1, num4 to source2P1, and so on.

> Source/WebCore/platform/audio/VectorMath.cpp:500
> +            num8 = vld1q_f32(source2P + 12);

Same comment about renaming num1-num8
Comment 27 kdj 2013-03-26 23:40:46 PDT
Created attachment 195231 [details]
Patch with variable name changes

Hi Raymond,

1. As the number of Neon and ARM register are within count, nothing has gone on stack. I checked assembly also keeping variables in loop and out-loop in .S file.
2. Renamed the variables as you mentioned.

Please check the patch.

Thanks,
kdj
Comment 28 Raymond Toy 2013-03-27 08:57:04 PDT
Comment on attachment 195231 [details]
Patch with variable name changes

View in context: https://bugs.webkit.org/attachment.cgi?id=195231&action=review

> Source/WebCore/ChangeLog:8
> +        Modifications are done to unsure overcoming stalls and Loop unrolling mechanism

Typo:  "unsure" -> "ensure"

> Source/WebCore/platform/audio/VectorMath.cpp:270
> +        float32x4_t sour0, result0;

Why not use "source0" instead of "sour0"?  Whole words are preferred.  Same comment applies to all code below that uses "sour".
Comment 29 kdj 2013-03-28 02:19:31 PDT
Created attachment 195506 [details]
Modified the patch

Modified the patch.
Comment 30 Raymond Toy 2013-03-28 11:38:12 PDT
Comment on attachment 195506 [details]
Modified the patch

View in context: https://bugs.webkit.org/attachment.cgi?id=195506&action=review

Looks good except for one last issue.

> Source/WebCore/platform/audio/VectorMath.cpp:174
> +        float32x4_t dest3, source3, result3;

Since you mentioned that there was no difference with declaring these variables inside the loop, I think they should be declared in the loop.  So something like

for (...) {
  float32x4_t dest0 = vld1q_f32(destP);
  ...
}
Comment 31 kdj 2013-04-02 04:46:39 PDT
Created attachment 196121 [details]
Variable declaration inside Loop

please check.
Comment 32 Raymond Toy 2013-04-02 11:08:23 PDT
(In reply to comment #31)
> Created an attachment (id=196121) [details]
> Variable declaration inside Loop
> 
> please check.

This looks good.  Thank you for this optimized implementation.

One last question:  Have you run your tests on this last patch?
Comment 33 kdj 2013-04-04 03:27:43 PDT
I checked the gain it is same; except for zvmul; as we are not using unrolling there, we are using same code.
Comment 34 kdj 2013-05-09 20:49:50 PDT
Since long no updation about this patch. Please share views.
Comment 35 kdj 2013-06-10 21:53:08 PDT
Please share the status, we would like to do more contributions on audio filters also.
Comment 36 Brent Fulgham 2016-03-14 12:11:25 PDT
Comment on attachment 196121 [details]
Variable declaration inside Loop

Unfortunately, this patch seems to have been ignored and no longer applies to the source tree. Could you please rebase the patch against the current source archive?
Comment 37 Sam Sneddon [:gsnedders] 2022-07-02 05:55:15 PDT
This needs rebased (several variables have got renamed over the past decade, but broadly it looks like it should still (manually) apply), and performance tested against modern Neon implementations.
Comment 38 Radar WebKit Bug Importer 2022-07-02 05:55:24 PDT
<rdar://problem/96339063>