WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
77950
SSE optimization for vsvesq and vmaxmgv
https://bugs.webkit.org/show_bug.cgi?id=77950
Summary
SSE optimization for vsvesq and vmaxmgv
Xingnan Wang
Reported
2012-02-07 02:00:40 PST
Optimize the function vsvesq and vmaxmgv in VectorMath by SSE2.
Attachments
Patch
(3.76 KB, patch)
2012-02-07 02:12 PST
,
Xingnan Wang
no flags
Details
Formatted Diff
Diff
Patch
(6.95 KB, patch)
2012-02-08 02:17 PST
,
Xingnan Wang
no flags
Details
Formatted Diff
Diff
Patch
(6.83 KB, patch)
2012-02-22 19:17 PST
,
Xingnan Wang
no flags
Details
Formatted Diff
Diff
Patch
(7.31 KB, patch)
2012-02-24 21:39 PST
,
Xingnan Wang
no flags
Details
Formatted Diff
Diff
Patch
(7.38 KB, patch)
2012-02-27 21:14 PST
,
Xingnan Wang
no flags
Details
Formatted Diff
Diff
Show Obsolete
(4)
View All
Add attachment
proposed patch, testcase, etc.
Xingnan Wang
Comment 1
2012-02-07 02:12:53 PST
Created
attachment 125790
[details]
Patch Achieved 3.7x on vsvesq and 4.1x on vmaxmgv.
Chris Rogers
Comment 2
2012-02-07 15:22:07 PST
Comment on
attachment 125790
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=125790&action=review
> Source/WebCore/platform/audio/VectorMath.cpp:439 > +
nit: extra blank line - please remove here and elsewhere in similar places in this file
> Source/WebCore/platform/audio/VectorMath.cpp:440 > + // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately.
spelling nit: "seperately" -> "separately" please fix here and elsewhere in the file
> Source/WebCore/platform/audio/VectorMath.cpp:448 > + // Now the sourceP address aligned and start to apply SSE.
grammar nit: I'd re-word this comment as: // Now that sourceP is aligned, use SSE.
> Source/WebCore/platform/audio/VectorMath.cpp:450 > + float* endP = (float *)sourceP + n - tailFrames;
shouldn't use ordinary cast here. Can't we make endP "const float": const float* endP = sourceP + n - tailFrames;
> Source/WebCore/platform/audio/VectorMath.cpp:462 > + float* groupSumP = reinterpret_cast<float*>(&mSum);
nit: I think you can use "const float*" instead of "float*"
> Source/WebCore/platform/audio/VectorMath.cpp:487 > +
nit: extra blank line
> Source/WebCore/platform/audio/VectorMath.cpp:488 > + // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately.
nit: "seperately" spelling
> Source/WebCore/platform/audio/VectorMath.cpp:495 > + // Now the sourceP address aligned and start to apply SSE.
nit: see comment above about re-wording
> Source/WebCore/platform/audio/VectorMath.cpp:497 > + float* endP = (float*)sourceP + n - tailFrames;
nit: see comment above
> Source/WebCore/platform/audio/VectorMath.cpp:509 > + // Caculate abs of min, then campare with max.
typo: "Caculate" -> "Calculate" typo: "campare" -> "compare" Is this math correct??
Xingnan Wang
Comment 3
2012-02-08 02:17:34 PST
Created
attachment 126033
[details]
Patch
Xingnan Wang
Comment 4
2012-02-08 02:26:27 PST
(In reply to
comment #2
)
> (From update of
attachment 125790
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=125790&action=review
> > > Source/WebCore/platform/audio/VectorMath.cpp:439 > > + > > nit: extra blank line - please remove here and elsewhere in similar places in this file > > > Source/WebCore/platform/audio/VectorMath.cpp:440 > > + // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately. > > spelling nit: "seperately" -> "separately" > > please fix here and elsewhere in the file > > > Source/WebCore/platform/audio/VectorMath.cpp:448 > > + // Now the sourceP address aligned and start to apply SSE. > > grammar nit: I'd re-word this comment as: > > // Now that sourceP is aligned, use SSE. > > > Source/WebCore/platform/audio/VectorMath.cpp:450 > > + float* endP = (float *)sourceP + n - tailFrames; > > shouldn't use ordinary cast here. > > Can't we make endP "const float": > > const float* endP = sourceP + n - tailFrames; > > > Source/WebCore/platform/audio/VectorMath.cpp:462 > > + float* groupSumP = reinterpret_cast<float*>(&mSum); > > nit: I think you can use "const float*" instead of "float*" > > > Source/WebCore/platform/audio/VectorMath.cpp:487 > > + > > nit: extra blank line > > > Source/WebCore/platform/audio/VectorMath.cpp:488 > > + // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately. > > nit: "seperately" spelling > > > Source/WebCore/platform/audio/VectorMath.cpp:495 > > + // Now the sourceP address aligned and start to apply SSE. > > nit: see comment above about re-wording > > > Source/WebCore/platform/audio/VectorMath.cpp:497 > > + float* endP = (float*)sourceP + n - tailFrames; > > nit: see comment above > > > Source/WebCore/platform/audio/VectorMath.cpp:509 > > + // Caculate abs of min, then campare with max. > > typo: "Caculate" -> "Calculate" > typo: "campare" -> "compare"
> Chris, update the patch as your comments.
> Is this math correct??
Do you mean whether vmaxmgv is correnct? I think the maximum abstract value in a vector is either the maximum positive value or the minimum negative value, so select one of them which has the maximum abstract value. I did the functional test and it`s OK.
Chris Rogers
Comment 5
2012-02-15 12:04:37 PST
Raymond, could you please scrutinize this patch, especially the vmaxmgv() part?
Raymond Toy
Comment 6
2012-02-15 13:18:54 PST
Comment on
attachment 126033
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=126033&action=review
> Source/WebCore/ChangeLog:8 > + Achieved the performance of 3.7x on vsvesq and 4.1x on vmaxmgv.
Out of curiosity is this speed up measured using the typical 128-length vectors? Or did you use longer vectors so that the initial setup is in the noise?
> Source/WebCore/platform/audio/VectorMath.cpp:500 > + mMin = _mm_min_ps(mMin, source);
I think it would be more in line with the original code to compute the absolute value before computing the max. Something like source = _mm_and_ps(source, mask); mMax = _mm_max_ps(mMax, source); where mask is #x7fffffff (4 copies, one for each float). I think this would be as fast as the current code. If this is done, then lines 505-507 can be deleted.
> Source/WebCore/platform/audio/VectorMath.cpp:514 > +
What is the reason for the temp variable? Is this an optimization so that we're not reading and writing to the same max variable?
Xingnan Wang
Comment 7
2012-02-15 19:01:22 PST
Comment on
attachment 126033
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=126033&action=review
>> Source/WebCore/ChangeLog:8 >> + Achieved the performance of 3.7x on vsvesq and 4.1x on vmaxmgv. > > Out of curiosity is this speed up measured using the typical 128-length vectors? Or did you use longer vectors so that the initial setup is in the noise?
Yes, the result above is measured with 128-length vectors, and also variable lengths are measured. For example, with 1280-length vectors the speed up is up to 3.88x and 5.1x. I think the 128-length result is most valuable for us now, so just present the 128-length result.
>> Source/WebCore/platform/audio/VectorMath.cpp:500 >> + mMin = _mm_min_ps(mMin, source); > > I think it would be more in line with the original code to compute the absolute value before computing the max. Something like > > source = _mm_and_ps(source, mask); > mMax = _mm_max_ps(mMax, source); > > where mask is #x7fffffff (4 copies, one for each float). I think this would be as fast as the current code. > > If this is done, then lines 505-507 can be deleted.
Good suggestion, thanks.
>> Source/WebCore/platform/audio/VectorMath.cpp:514 >> + > > What is the reason for the temp variable? Is this an optimization so that we're not reading and writing to the same max variable?
I just used the temp variable for the readability without considering the optimization here because it`s not in the loop so it`s not quite necessary, I think. If the opt is truly needed some SSE function may be applied here.
Xingnan Wang
Comment 8
2012-02-22 19:17:39 PST
Created
attachment 128369
[details]
Patch Ray, Thanks your comments, I update the patch and use mask 0x7fffffff. There is less affect of the performance.
Raymond Toy
Comment 9
2012-02-23 09:25:45 PST
Comment on
attachment 128369
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=128369&action=review
Looks good overall, except for the comment at line 508.
> Source/WebCore/ChangeLog:8 > + Achieved the performance of 3.7x on vsvesq and 4.1x on vmaxmgv.
You mentioned that the udpated version of vmaxmgv is now slower(?) than before. Can you update the numbers here?
> Source/WebCore/platform/audio/VectorMath.cpp:500 > + source = _mm_and_ps(source, mMask);
You mentioned that the performance with and_ps is less than the previous version with min_ps. Is this because and_ps is slower or maybe because of some pipelining isues? I'm just curious as to why this is slower.
> Source/WebCore/platform/audio/VectorMath.cpp:508 > + max = std::max(groupMaxP[2], groupMaxP[3]);
The assignment to max here at line 508 overwrites the max computed in line 485 when the source is not aligned. This needs to be rearranged a little so as not to destroy that.
Xingnan Wang
Comment 10
2012-02-23 22:17:15 PST
Comment on
attachment 128369
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=128369&action=review
>> Source/WebCore/platform/audio/VectorMath.cpp:500 >> + source = _mm_and_ps(source, mMask); > > You mentioned that the performance with and_ps is less than the previous version with min_ps. Is this because and_ps is slower or maybe because of some pipelining isues? I'm just curious as to why this is slower.
It`s not slower with and_ps, my expression may not be clear so sorry for misleading you. I mean the performance between using and_ps and min_ps is not too much different, so I think the performance data is no needed to be updated.
>> Source/WebCore/platform/audio/VectorMath.cpp:508 >> + max = std::max(groupMaxP[2], groupMaxP[3]); > > The assignment to max here at line 508 overwrites the max computed in line 485 when the source is not aligned. This needs to be rearranged a little so as not to destroy that.
I`ll fix it.
Xingnan Wang
Comment 11
2012-02-24 21:39:51 PST
Created
attachment 128852
[details]
Patch
Xingnan Wang
Comment 12
2012-02-24 21:41:43 PST
(In reply to
comment #10
)
> (From update of
attachment 128369
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=128369&action=review
> > >> Source/WebCore/platform/audio/VectorMath.cpp:500 > >> + source = _mm_and_ps(source, mMask); > > > > You mentioned that the performance with and_ps is less than the previous version with min_ps. Is this because and_ps is slower or maybe because of some pipelining isues? I'm just curious as to why this is slower. > > It`s not slower with and_ps, my expression may not be clear so sorry for misleading you. I mean the performance between using and_ps and min_ps is not too much different, so I think the performance data is no needed to be updated. > > >> Source/WebCore/platform/audio/VectorMath.cpp:508 > >> + max = std::max(groupMaxP[2], groupMaxP[3]); > > > > The assignment to max here at line 508 overwrites the max computed in line 485 when the source is not aligned. This needs to be rearranged a little so as not to destroy that. > > I`ll fix it.
Fixed.
Raymond Toy
Comment 13
2012-02-27 09:19:55 PST
Comment on
attachment 128852
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=128852&action=review
Except for the nit, looks good.
> Source/WebCore/platform/audio/VectorMath.cpp:500 > + source = _mm_and_ps(source, mMask);
nit: Perhaps add a comment here to say that anding the source with the mask computes the absolute value by making the sign bit 0 (positive).
Raymond Toy
Comment 14
2012-02-27 09:21:17 PST
(In reply to
comment #10
)
> (From update of
attachment 128369
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=128369&action=review
> > >> Source/WebCore/platform/audio/VectorMath.cpp:500 > >> + source = _mm_and_ps(source, mMask); > > > > You mentioned that the performance with and_ps is less than the previous version with min_ps. Is this because and_ps is slower or maybe because of some pipelining isues? I'm just curious as to why this is slower. > > It`s not slower with and_ps, my expression may not be clear so sorry for misleading you. I mean the performance between using and_ps and min_ps is not too much different, so I think the performance data is no needed to be updated.
Ok. No need to change anything then. I'm glad the performance didn't change.
Xingnan Wang
Comment 15
2012-02-27 21:01:39 PST
(In reply to
comment #14
)
> (In reply to
comment #10
) > > (From update of
attachment 128369
[details]
[details]) > > View in context:
https://bugs.webkit.org/attachment.cgi?id=128369&action=review
> > > > >> Source/WebCore/platform/audio/VectorMath.cpp:500 > > >> + source = _mm_and_ps(source, mMask); > > > > > > You mentioned that the performance with and_ps is less than the previous version with min_ps. Is this because and_ps is slower or maybe because of some pipelining isues? I'm just curious as to why this is slower. > > > > It`s not slower with and_ps, my expression may not be clear so sorry for misleading you. I mean the performance between using and_ps and min_ps is not too much different, so I think the performance data is no needed to be updated. > > Ok. No need to change anything then. I'm glad the performance didn't change.
Chris, does the patch need more review?
Xingnan Wang
Comment 16
2012-02-27 21:14:59 PST
Created
attachment 129183
[details]
Patch
Xingnan Wang
Comment 17
2012-02-27 21:16:00 PST
(In reply to
comment #16
)
> Created an attachment (id=129183) [details] > Patch
(In reply to
comment #13
)
> (From update of
attachment 128852
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=128852&action=review
> > Except for the nit, looks good. > > > Source/WebCore/platform/audio/VectorMath.cpp:500 > > + source = _mm_and_ps(source, mMask); > > nit: Perhaps add a comment here to say that anding the source with the mask computes the absolute value by making the sign bit 0 (positive).
The comment is add, thanks.
Raymond Toy
Comment 18
2012-02-28 10:14:20 PST
Comment on
attachment 129183
[details]
Patch LGTM
Xingnan Wang
Comment 19
2012-02-29 20:30:26 PST
Comment on
attachment 129183
[details]
Patch Submit to the commit-queue as it`s "LGTM".
Chris Rogers
Comment 20
2012-03-01 12:45:41 PST
Comment on
attachment 129183
[details]
Patch Looks good.
WebKit Review Bot
Comment 21
2012-03-01 12:55:09 PST
Comment on
attachment 129183
[details]
Patch Rejecting
attachment 129183
[details]
from commit-queue. Failed to run "['/mnt/git/webkit-commit-queue/Tools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '-..." exit_code: 1 ERROR: /mnt/git/webkit-commit-queue/ChangeLog neither lists a valid reviewer nor contains the string "Unreviewed" or "Rubber stamp" (case insensitive). Full output:
http://queues.webkit.org/results/11757580
Chris Rogers
Comment 22
2012-03-01 17:38:36 PST
Comment on
attachment 129183
[details]
Patch trying commit queue again
WebKit Review Bot
Comment 23
2012-03-01 18:00:21 PST
Comment on
attachment 129183
[details]
Patch Clearing flags on attachment: 129183 Committed
r109476
: <
http://trac.webkit.org/changeset/109476
>
WebKit Review Bot
Comment 24
2012-03-01 18:00:26 PST
All reviewed patches have been landed. Closing bug.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug