Bug 77950

Summary: SSE optimization for vsvesq and vmaxmgv
Product: WebKit Reporter: Xingnan Wang <xingnan.wang>
Component: Web AudioAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: crogers, rtoy, webkit.review.bot
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch
none
Patch
none
Patch
none
Patch
none
Patch none

Description Xingnan Wang 2012-02-07 02:00:40 PST
Optimize the function vsvesq and vmaxmgv in VectorMath by SSE2.
Comment 1 Xingnan Wang 2012-02-07 02:12:53 PST
Created attachment 125790 [details]
Patch

Achieved 3.7x on vsvesq and 4.1x on vmaxmgv.
Comment 2 Chris Rogers 2012-02-07 15:22:07 PST
Comment on attachment 125790 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=125790&action=review

> Source/WebCore/platform/audio/VectorMath.cpp:439
> + 

nit: extra blank line - please remove here and elsewhere in similar places in this file

> Source/WebCore/platform/audio/VectorMath.cpp:440
> +        // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately. 

spelling nit: "seperately" -> "separately"

please fix here and elsewhere in the file

> Source/WebCore/platform/audio/VectorMath.cpp:448
> +        // Now the sourceP address aligned and start to apply SSE. 

grammar nit: I'd re-word this comment as:

// Now that sourceP is aligned, use SSE.

> Source/WebCore/platform/audio/VectorMath.cpp:450
> +        float* endP = (float *)sourceP + n - tailFrames; 

shouldn't use ordinary cast here.

Can't we make endP "const float":

const float* endP = sourceP + n - tailFrames;

> Source/WebCore/platform/audio/VectorMath.cpp:462
> +        float* groupSumP = reinterpret_cast<float*>(&mSum); 

nit: I think you can use "const float*" instead of "float*"

> Source/WebCore/platform/audio/VectorMath.cpp:487
> +

nit: extra blank line

> Source/WebCore/platform/audio/VectorMath.cpp:488
> +        // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately.

nit: "seperately" spelling

> Source/WebCore/platform/audio/VectorMath.cpp:495
> +        // Now the sourceP address aligned and start to apply SSE.

nit: see comment above about re-wording

> Source/WebCore/platform/audio/VectorMath.cpp:497
> +        float* endP = (float*)sourceP + n - tailFrames;

nit: see comment above

> Source/WebCore/platform/audio/VectorMath.cpp:509
> +        // Caculate abs of min, then campare with max.

typo: "Caculate" -> "Calculate"
typo: "campare" -> "compare"

Is this math correct??
Comment 3 Xingnan Wang 2012-02-08 02:17:34 PST
Created attachment 126033 [details]
Patch
Comment 4 Xingnan Wang 2012-02-08 02:26:27 PST
(In reply to comment #2)
> (From update of attachment 125790 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=125790&action=review
> 
> > Source/WebCore/platform/audio/VectorMath.cpp:439
> > + 
> 
> nit: extra blank line - please remove here and elsewhere in similar places in this file
> 
> > Source/WebCore/platform/audio/VectorMath.cpp:440
> > +        // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately. 
> 
> spelling nit: "seperately" -> "separately"
> 
> please fix here and elsewhere in the file
> 
> > Source/WebCore/platform/audio/VectorMath.cpp:448
> > +        // Now the sourceP address aligned and start to apply SSE. 
> 
> grammar nit: I'd re-word this comment as:
> 
> // Now that sourceP is aligned, use SSE.
> 
> > Source/WebCore/platform/audio/VectorMath.cpp:450
> > +        float* endP = (float *)sourceP + n - tailFrames; 
> 
> shouldn't use ordinary cast here.
> 
> Can't we make endP "const float":
> 
> const float* endP = sourceP + n - tailFrames;
> 
> > Source/WebCore/platform/audio/VectorMath.cpp:462
> > +        float* groupSumP = reinterpret_cast<float*>(&mSum); 
> 
> nit: I think you can use "const float*" instead of "float*"
> 
> > Source/WebCore/platform/audio/VectorMath.cpp:487
> > +
> 
> nit: extra blank line
> 
> > Source/WebCore/platform/audio/VectorMath.cpp:488
> > +        // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately.
> 
> nit: "seperately" spelling
> 
> > Source/WebCore/platform/audio/VectorMath.cpp:495
> > +        // Now the sourceP address aligned and start to apply SSE.
> 
> nit: see comment above about re-wording
> 
> > Source/WebCore/platform/audio/VectorMath.cpp:497
> > +        float* endP = (float*)sourceP + n - tailFrames;
> 
> nit: see comment above
> 
> > Source/WebCore/platform/audio/VectorMath.cpp:509
> > +        // Caculate abs of min, then campare with max.
> 
> typo: "Caculate" -> "Calculate"
> typo: "campare" -> "compare"
>
Chris, update the patch as your comments. 
> Is this math correct??
Do you mean whether vmaxmgv is correnct?
I think the maximum abstract value in a vector is either the maximum positive value or the minimum negative value, so select one of them which has the maximum abstract value.
I did the functional test and it`s OK.
Comment 5 Chris Rogers 2012-02-15 12:04:37 PST
Raymond, could you please scrutinize this patch, especially the vmaxmgv() part?
Comment 6 Raymond Toy 2012-02-15 13:18:54 PST
Comment on attachment 126033 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=126033&action=review

> Source/WebCore/ChangeLog:8
> +        Achieved the performance of 3.7x on vsvesq and 4.1x on vmaxmgv.

Out of curiosity is this speed up measured using the typical 128-length vectors?  Or did you use longer vectors so that the initial setup is in the noise?

> Source/WebCore/platform/audio/VectorMath.cpp:500
> +            mMin = _mm_min_ps(mMin, source);

I think it would be more in line with the original code to compute the absolute value before computing the max.  Something like

source = _mm_and_ps(source, mask);
mMax = _mm_max_ps(mMax, source);

where mask is #x7fffffff (4 copies, one for each float).  I think this would be as fast as the current code.

If this is done, then lines 505-507 can be deleted.

> Source/WebCore/platform/audio/VectorMath.cpp:514
> +

What is the reason for the temp variable?  Is this an optimization so that we're not reading and writing to the same max variable?
Comment 7 Xingnan Wang 2012-02-15 19:01:22 PST
Comment on attachment 126033 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=126033&action=review

>> Source/WebCore/ChangeLog:8
>> +        Achieved the performance of 3.7x on vsvesq and 4.1x on vmaxmgv.
> 
> Out of curiosity is this speed up measured using the typical 128-length vectors?  Or did you use longer vectors so that the initial setup is in the noise?

Yes, the result above is measured with 128-length vectors, and also variable lengths are measured. For example, with 1280-length vectors the speed up is up to 3.88x and 5.1x.
I think the 128-length result is most valuable for us now, so just present the 128-length result.

>> Source/WebCore/platform/audio/VectorMath.cpp:500
>> +            mMin = _mm_min_ps(mMin, source);
> 
> I think it would be more in line with the original code to compute the absolute value before computing the max.  Something like
> 
> source = _mm_and_ps(source, mask);
> mMax = _mm_max_ps(mMax, source);
> 
> where mask is #x7fffffff (4 copies, one for each float).  I think this would be as fast as the current code.
> 
> If this is done, then lines 505-507 can be deleted.

Good suggestion, thanks.

>> Source/WebCore/platform/audio/VectorMath.cpp:514
>> +
> 
> What is the reason for the temp variable?  Is this an optimization so that we're not reading and writing to the same max variable?

I just used the temp variable for the readability without considering the optimization here because it`s not in the loop so it`s not quite necessary, I think. If the opt is truly needed some SSE function may be applied here.
Comment 8 Xingnan Wang 2012-02-22 19:17:39 PST
Created attachment 128369 [details]
Patch

Ray, 
Thanks your comments, I update the patch and use mask 0x7fffffff.
There is less affect of the performance.
Comment 9 Raymond Toy 2012-02-23 09:25:45 PST
Comment on attachment 128369 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=128369&action=review

Looks good overall, except for the comment at line 508.

> Source/WebCore/ChangeLog:8
> +        Achieved the performance of 3.7x on vsvesq and 4.1x on vmaxmgv.

You mentioned that the udpated version of vmaxmgv is now slower(?) than before.  Can you update the numbers here?

> Source/WebCore/platform/audio/VectorMath.cpp:500
> +            source = _mm_and_ps(source, mMask);

You mentioned that the performance with and_ps is less than the previous version with min_ps.  Is this because and_ps is slower or maybe because of some pipelining isues?  I'm just curious as to why this is slower.

> Source/WebCore/platform/audio/VectorMath.cpp:508
> +        max = std::max(groupMaxP[2], groupMaxP[3]);

The assignment to max here at line 508 overwrites the max computed in line 485 when the source is not aligned.  This needs to be rearranged a little so as not to destroy that.
Comment 10 Xingnan Wang 2012-02-23 22:17:15 PST
Comment on attachment 128369 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=128369&action=review

>> Source/WebCore/platform/audio/VectorMath.cpp:500
>> +            source = _mm_and_ps(source, mMask);
> 
> You mentioned that the performance with and_ps is less than the previous version with min_ps.  Is this because and_ps is slower or maybe because of some pipelining isues?  I'm just curious as to why this is slower.

It`s not slower with and_ps, my expression may not be clear so sorry for misleading you. I mean the performance between using and_ps and min_ps is not too much different, so I think the performance data is no needed to be updated.

>> Source/WebCore/platform/audio/VectorMath.cpp:508
>> +        max = std::max(groupMaxP[2], groupMaxP[3]);
> 
> The assignment to max here at line 508 overwrites the max computed in line 485 when the source is not aligned.  This needs to be rearranged a little so as not to destroy that.

I`ll fix it.
Comment 11 Xingnan Wang 2012-02-24 21:39:51 PST
Created attachment 128852 [details]
Patch
Comment 12 Xingnan Wang 2012-02-24 21:41:43 PST
(In reply to comment #10)
> (From update of attachment 128369 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=128369&action=review
> 
> >> Source/WebCore/platform/audio/VectorMath.cpp:500
> >> +            source = _mm_and_ps(source, mMask);
> > 
> > You mentioned that the performance with and_ps is less than the previous version with min_ps.  Is this because and_ps is slower or maybe because of some pipelining isues?  I'm just curious as to why this is slower.
> 
> It`s not slower with and_ps, my expression may not be clear so sorry for misleading you. I mean the performance between using and_ps and min_ps is not too much different, so I think the performance data is no needed to be updated.
> 
> >> Source/WebCore/platform/audio/VectorMath.cpp:508
> >> +        max = std::max(groupMaxP[2], groupMaxP[3]);
> > 
> > The assignment to max here at line 508 overwrites the max computed in line 485 when the source is not aligned.  This needs to be rearranged a little so as not to destroy that.
> 
> I`ll fix it.
Fixed.
Comment 13 Raymond Toy 2012-02-27 09:19:55 PST
Comment on attachment 128852 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=128852&action=review

Except for the nit, looks good.

> Source/WebCore/platform/audio/VectorMath.cpp:500
> +            source = _mm_and_ps(source, mMask);

nit: Perhaps add a comment here to say that anding the source with the mask computes the absolute value by making the sign bit 0 (positive).
Comment 14 Raymond Toy 2012-02-27 09:21:17 PST
(In reply to comment #10)
> (From update of attachment 128369 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=128369&action=review
> 
> >> Source/WebCore/platform/audio/VectorMath.cpp:500
> >> +            source = _mm_and_ps(source, mMask);
> > 
> > You mentioned that the performance with and_ps is less than the previous version with min_ps.  Is this because and_ps is slower or maybe because of some pipelining isues?  I'm just curious as to why this is slower.
> 
> It`s not slower with and_ps, my expression may not be clear so sorry for misleading you. I mean the performance between using and_ps and min_ps is not too much different, so I think the performance data is no needed to be updated.

Ok.  No need to change anything then.  I'm glad the performance didn't change.
Comment 15 Xingnan Wang 2012-02-27 21:01:39 PST
(In reply to comment #14)
> (In reply to comment #10)
> > (From update of attachment 128369 [details] [details])
> > View in context: https://bugs.webkit.org/attachment.cgi?id=128369&action=review
> > 
> > >> Source/WebCore/platform/audio/VectorMath.cpp:500
> > >> +            source = _mm_and_ps(source, mMask);
> > > 
> > > You mentioned that the performance with and_ps is less than the previous version with min_ps.  Is this because and_ps is slower or maybe because of some pipelining isues?  I'm just curious as to why this is slower.
> > 
> > It`s not slower with and_ps, my expression may not be clear so sorry for misleading you. I mean the performance between using and_ps and min_ps is not too much different, so I think the performance data is no needed to be updated.
> 
> Ok.  No need to change anything then.  I'm glad the performance didn't change.

Chris, does the patch need more review?
Comment 16 Xingnan Wang 2012-02-27 21:14:59 PST
Created attachment 129183 [details]
Patch
Comment 17 Xingnan Wang 2012-02-27 21:16:00 PST
(In reply to comment #16)
> Created an attachment (id=129183) [details]
> Patch

(In reply to comment #13)
> (From update of attachment 128852 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=128852&action=review
> 
> Except for the nit, looks good.
> 
> > Source/WebCore/platform/audio/VectorMath.cpp:500
> > +            source = _mm_and_ps(source, mMask);
> 
> nit: Perhaps add a comment here to say that anding the source with the mask computes the absolute value by making the sign bit 0 (positive).

The comment is add, thanks.
Comment 18 Raymond Toy 2012-02-28 10:14:20 PST
Comment on attachment 129183 [details]
Patch

LGTM
Comment 19 Xingnan Wang 2012-02-29 20:30:26 PST
Comment on attachment 129183 [details]
Patch

Submit to the commit-queue as it`s "LGTM".
Comment 20 Chris Rogers 2012-03-01 12:45:41 PST
Comment on attachment 129183 [details]
Patch

Looks good.
Comment 21 WebKit Review Bot 2012-03-01 12:55:09 PST
Comment on attachment 129183 [details]
Patch

Rejecting attachment 129183 [details] from commit-queue.

Failed to run "['/mnt/git/webkit-commit-queue/Tools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '-..." exit_code: 1

ERROR: /mnt/git/webkit-commit-queue/ChangeLog neither lists a valid reviewer nor contains the string "Unreviewed" or "Rubber stamp" (case insensitive).

Full output: http://queues.webkit.org/results/11757580
Comment 22 Chris Rogers 2012-03-01 17:38:36 PST
Comment on attachment 129183 [details]
Patch

trying commit queue again
Comment 23 WebKit Review Bot 2012-03-01 18:00:21 PST
Comment on attachment 129183 [details]
Patch

Clearing flags on attachment: 129183

Committed r109476: <http://trac.webkit.org/changeset/109476>
Comment 24 WebKit Review Bot 2012-03-01 18:00:26 PST
All reviewed patches have been landed.  Closing bug.