77950 – SSE optimization for vsvesq and vmaxmgv

RESOLVED FIXED77950

SSE optimization for vsvesq and vmaxmgv

https://bugs.webkit.org/show_bug.cgi?id=77950

Summary SSE optimization for vsvesq and vmaxmgv

Xingnan Wang

Reported 2012-02-07 02:00:40 PST

Optimize the function vsvesq and vmaxmgv in VectorMath by SSE2.

Attachments
Patch (3.76 KB, patch) 2012-02-07 02:12 PST, Xingnan Wang	no flags	Details Formatted Diff Diff
Patch (6.95 KB, patch) 2012-02-08 02:17 PST, Xingnan Wang	no flags	Details Formatted Diff Diff
Patch (6.83 KB, patch) 2012-02-22 19:17 PST, Xingnan Wang	no flags	Details Formatted Diff Diff
Patch (7.31 KB, patch) 2012-02-24 21:39 PST, Xingnan Wang	no flags	Details Formatted Diff Diff
Patch (7.38 KB, patch) 2012-02-27 21:14 PST, Xingnan Wang	no flags	Details Formatted Diff Diff
Show Obsolete (4) View All Add attachment proposed patch, testcase, etc.

Xingnan Wang

Comment 1 2012-02-07 02:12:53 PST

Created attachment 125790 [details] Patch Achieved 3.7x on vsvesq and 4.1x on vmaxmgv.

Chris Rogers

Comment 2 2012-02-07 15:22:07 PST

Comment on attachment 125790 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=125790&action=review > Source/WebCore/platform/audio/VectorMath.cpp:439 > + nit: extra blank line - please remove here and elsewhere in similar places in this file > Source/WebCore/platform/audio/VectorMath.cpp:440 > + // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately. spelling nit: "seperately" -> "separately" please fix here and elsewhere in the file > Source/WebCore/platform/audio/VectorMath.cpp:448 > + // Now the sourceP address aligned and start to apply SSE. grammar nit: I'd re-word this comment as: // Now that sourceP is aligned, use SSE. > Source/WebCore/platform/audio/VectorMath.cpp:450 > + float* endP = (float *)sourceP + n - tailFrames; shouldn't use ordinary cast here. Can't we make endP "const float": const float* endP = sourceP + n - tailFrames; > Source/WebCore/platform/audio/VectorMath.cpp:462 > + float* groupSumP = reinterpret_cast<float*>(&mSum); nit: I think you can use "const float*" instead of "float*" > Source/WebCore/platform/audio/VectorMath.cpp:487 > + nit: extra blank line > Source/WebCore/platform/audio/VectorMath.cpp:488 > + // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately. nit: "seperately" spelling > Source/WebCore/platform/audio/VectorMath.cpp:495 > + // Now the sourceP address aligned and start to apply SSE. nit: see comment above about re-wording > Source/WebCore/platform/audio/VectorMath.cpp:497 > + float* endP = (float*)sourceP + n - tailFrames; nit: see comment above > Source/WebCore/platform/audio/VectorMath.cpp:509 > + // Caculate abs of min, then campare with max. typo: "Caculate" -> "Calculate" typo: "campare" -> "compare" Is this math correct??

Xingnan Wang

Comment 3 2012-02-08 02:17:34 PST

Created attachment 126033 [details] Patch

Xingnan Wang

Comment 4 2012-02-08 02:26:27 PST

(In reply to comment #2) > (From update of attachment 125790 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=125790&action=review > > > Source/WebCore/platform/audio/VectorMath.cpp:439 > > + > > nit: extra blank line - please remove here and elsewhere in similar places in this file > > > Source/WebCore/platform/audio/VectorMath.cpp:440 > > + // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately. > > spelling nit: "seperately" -> "separately" > > please fix here and elsewhere in the file > > > Source/WebCore/platform/audio/VectorMath.cpp:448 > > + // Now the sourceP address aligned and start to apply SSE. > > grammar nit: I'd re-word this comment as: > > // Now that sourceP is aligned, use SSE. > > > Source/WebCore/platform/audio/VectorMath.cpp:450 > > + float* endP = (float *)sourceP + n - tailFrames; > > shouldn't use ordinary cast here. > > Can't we make endP "const float": > > const float* endP = sourceP + n - tailFrames; > > > Source/WebCore/platform/audio/VectorMath.cpp:462 > > + float* groupSumP = reinterpret_cast<float*>(&mSum); > > nit: I think you can use "const float*" instead of "float*" > > > Source/WebCore/platform/audio/VectorMath.cpp:487 > > + > > nit: extra blank line > > > Source/WebCore/platform/audio/VectorMath.cpp:488 > > + // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately. > > nit: "seperately" spelling > > > Source/WebCore/platform/audio/VectorMath.cpp:495 > > + // Now the sourceP address aligned and start to apply SSE. > > nit: see comment above about re-wording > > > Source/WebCore/platform/audio/VectorMath.cpp:497 > > + float* endP = (float*)sourceP + n - tailFrames; > > nit: see comment above > > > Source/WebCore/platform/audio/VectorMath.cpp:509 > > + // Caculate abs of min, then campare with max. > > typo: "Caculate" -> "Calculate" > typo: "campare" -> "compare" > Chris, update the patch as your comments. > Is this math correct?? Do you mean whether vmaxmgv is correnct? I think the maximum abstract value in a vector is either the maximum positive value or the minimum negative value, so select one of them which has the maximum abstract value. I did the functional test and it`s OK.

Chris Rogers

Comment 5 2012-02-15 12:04:37 PST

Raymond, could you please scrutinize this patch, especially the vmaxmgv() part?

Raymond Toy

Comment 6 2012-02-15 13:18:54 PST

Comment on attachment 126033 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=126033&action=review > Source/WebCore/ChangeLog:8 > + Achieved the performance of 3.7x on vsvesq and 4.1x on vmaxmgv. Out of curiosity is this speed up measured using the typical 128-length vectors? Or did you use longer vectors so that the initial setup is in the noise? > Source/WebCore/platform/audio/VectorMath.cpp:500 > + mMin = _mm_min_ps(mMin, source); I think it would be more in line with the original code to compute the absolute value before computing the max. Something like source = _mm_and_ps(source, mask); mMax = _mm_max_ps(mMax, source); where mask is #x7fffffff (4 copies, one for each float). I think this would be as fast as the current code. If this is done, then lines 505-507 can be deleted. > Source/WebCore/platform/audio/VectorMath.cpp:514 > + What is the reason for the temp variable? Is this an optimization so that we're not reading and writing to the same max variable?

Xingnan Wang

Comment 7 2012-02-15 19:01:22 PST

Comment on attachment 126033 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=126033&action=review >> Source/WebCore/ChangeLog:8 >> + Achieved the performance of 3.7x on vsvesq and 4.1x on vmaxmgv. > > Out of curiosity is this speed up measured using the typical 128-length vectors? Or did you use longer vectors so that the initial setup is in the noise? Yes, the result above is measured with 128-length vectors, and also variable lengths are measured. For example, with 1280-length vectors the speed up is up to 3.88x and 5.1x. I think the 128-length result is most valuable for us now, so just present the 128-length result. >> Source/WebCore/platform/audio/VectorMath.cpp:500 >> + mMin = _mm_min_ps(mMin, source); > > I think it would be more in line with the original code to compute the absolute value before computing the max. Something like > > source = _mm_and_ps(source, mask); > mMax = _mm_max_ps(mMax, source); > > where mask is #x7fffffff (4 copies, one for each float). I think this would be as fast as the current code. > > If this is done, then lines 505-507 can be deleted. Good suggestion, thanks. >> Source/WebCore/platform/audio/VectorMath.cpp:514 >> + > > What is the reason for the temp variable? Is this an optimization so that we're not reading and writing to the same max variable? I just used the temp variable for the readability without considering the optimization here because it`s not in the loop so it`s not quite necessary, I think. If the opt is truly needed some SSE function may be applied here.

Xingnan Wang

Comment 8 2012-02-22 19:17:39 PST

Created attachment 128369 [details] Patch Ray, Thanks your comments, I update the patch and use mask 0x7fffffff. There is less affect of the performance.

Raymond Toy

Comment 9 2012-02-23 09:25:45 PST

Comment on attachment 128369 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=128369&action=review Looks good overall, except for the comment at line 508. > Source/WebCore/ChangeLog:8 > + Achieved the performance of 3.7x on vsvesq and 4.1x on vmaxmgv. You mentioned that the udpated version of vmaxmgv is now slower(?) than before. Can you update the numbers here? > Source/WebCore/platform/audio/VectorMath.cpp:500 > + source = _mm_and_ps(source, mMask); You mentioned that the performance with and_ps is less than the previous version with min_ps. Is this because and_ps is slower or maybe because of some pipelining isues? I'm just curious as to why this is slower. > Source/WebCore/platform/audio/VectorMath.cpp:508 > + max = std::max(groupMaxP[2], groupMaxP[3]); The assignment to max here at line 508 overwrites the max computed in line 485 when the source is not aligned. This needs to be rearranged a little so as not to destroy that.

Xingnan Wang

Comment 10 2012-02-23 22:17:15 PST

Comment on attachment 128369 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=128369&action=review >> Source/WebCore/platform/audio/VectorMath.cpp:500 >> + source = _mm_and_ps(source, mMask); > > You mentioned that the performance with and_ps is less than the previous version with min_ps. Is this because and_ps is slower or maybe because of some pipelining isues? I'm just curious as to why this is slower. It`s not slower with and_ps, my expression may not be clear so sorry for misleading you. I mean the performance between using and_ps and min_ps is not too much different, so I think the performance data is no needed to be updated. >> Source/WebCore/platform/audio/VectorMath.cpp:508 >> + max = std::max(groupMaxP[2], groupMaxP[3]); > > The assignment to max here at line 508 overwrites the max computed in line 485 when the source is not aligned. This needs to be rearranged a little so as not to destroy that. I`ll fix it.

Xingnan Wang

Comment 11 2012-02-24 21:39:51 PST

Created attachment 128852 [details] Patch

Xingnan Wang

Comment 12 2012-02-24 21:41:43 PST

(In reply to comment #10) > (From update of attachment 128369 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=128369&action=review > > >> Source/WebCore/platform/audio/VectorMath.cpp:500 > >> + source = _mm_and_ps(source, mMask); > > > > You mentioned that the performance with and_ps is less than the previous version with min_ps. Is this because and_ps is slower or maybe because of some pipelining isues? I'm just curious as to why this is slower. > > It`s not slower with and_ps, my expression may not be clear so sorry for misleading you. I mean the performance between using and_ps and min_ps is not too much different, so I think the performance data is no needed to be updated. > > >> Source/WebCore/platform/audio/VectorMath.cpp:508 > >> + max = std::max(groupMaxP[2], groupMaxP[3]); > > > > The assignment to max here at line 508 overwrites the max computed in line 485 when the source is not aligned. This needs to be rearranged a little so as not to destroy that. > > I`ll fix it. Fixed.

Raymond Toy

Comment 13 2012-02-27 09:19:55 PST

Comment on attachment 128852 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=128852&action=review Except for the nit, looks good. > Source/WebCore/platform/audio/VectorMath.cpp:500 > + source = _mm_and_ps(source, mMask); nit: Perhaps add a comment here to say that anding the source with the mask computes the absolute value by making the sign bit 0 (positive).

Raymond Toy

Comment 14 2012-02-27 09:21:17 PST

(In reply to comment #10) > (From update of attachment 128369 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=128369&action=review > > >> Source/WebCore/platform/audio/VectorMath.cpp:500 > >> + source = _mm_and_ps(source, mMask); > > > > You mentioned that the performance with and_ps is less than the previous version with min_ps. Is this because and_ps is slower or maybe because of some pipelining isues? I'm just curious as to why this is slower. > > It`s not slower with and_ps, my expression may not be clear so sorry for misleading you. I mean the performance between using and_ps and min_ps is not too much different, so I think the performance data is no needed to be updated. Ok. No need to change anything then. I'm glad the performance didn't change.

Xingnan Wang

Comment 15 2012-02-27 21:01:39 PST

(In reply to comment #14) > (In reply to comment #10) > > (From update of attachment 128369 [details] [details]) > > View in context: https://bugs.webkit.org/attachment.cgi?id=128369&action=review > > > > >> Source/WebCore/platform/audio/VectorMath.cpp:500 > > >> + source = _mm_and_ps(source, mMask); > > > > > > You mentioned that the performance with and_ps is less than the previous version with min_ps. Is this because and_ps is slower or maybe because of some pipelining isues? I'm just curious as to why this is slower. > > > > It`s not slower with and_ps, my expression may not be clear so sorry for misleading you. I mean the performance between using and_ps and min_ps is not too much different, so I think the performance data is no needed to be updated. > > Ok. No need to change anything then. I'm glad the performance didn't change. Chris, does the patch need more review?

Xingnan Wang

Comment 16 2012-02-27 21:14:59 PST

Created attachment 129183 [details] Patch

Xingnan Wang

Comment 17 2012-02-27 21:16:00 PST

(In reply to comment #16) > Created an attachment (id=129183) [details] > Patch (In reply to comment #13) > (From update of attachment 128852 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=128852&action=review > > Except for the nit, looks good. > > > Source/WebCore/platform/audio/VectorMath.cpp:500 > > + source = _mm_and_ps(source, mMask); > > nit: Perhaps add a comment here to say that anding the source with the mask computes the absolute value by making the sign bit 0 (positive). The comment is add, thanks.

Raymond Toy

Comment 18 2012-02-28 10:14:20 PST

Comment on attachment 129183 [details] Patch LGTM

Xingnan Wang

Comment 19 2012-02-29 20:30:26 PST

Comment on attachment 129183 [details] Patch Submit to the commit-queue as it`s "LGTM".

Chris Rogers

Comment 20 2012-03-01 12:45:41 PST

Comment on attachment 129183 [details] Patch Looks good.

WebKit Review Bot

Comment 21 2012-03-01 12:55:09 PST

Comment on attachment 129183 [details] Patch Rejecting attachment 129183 [details] from commit-queue. Failed to run "['/mnt/git/webkit-commit-queue/Tools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '-..." exit_code: 1 ERROR: /mnt/git/webkit-commit-queue/ChangeLog neither lists a valid reviewer nor contains the string "Unreviewed" or "Rubber stamp" (case insensitive). Full output: http://queues.webkit.org/results/11757580

Chris Rogers

Comment 22 2012-03-01 17:38:36 PST

Comment on attachment 129183 [details] Patch trying commit queue again

WebKit Review Bot

Comment 23 2012-03-01 18:00:21 PST

Comment on attachment 129183 [details] Patch Clearing flags on attachment: 129183 Committed r109476: <http://trac.webkit.org/changeset/109476>

WebKit Review Bot

Comment 24 2012-03-01 18:00:26 PST

All reviewed patches have been landed. Closing bug.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Unspecified

OS Unspecified

Product WebKit

Component Web Audio

Assignee

Nobody

Reported

2012-02-07 02:00 PST

Modified

2012-03-01 18:00 PST History

CC List

3 users Show

URL

Keywords

Depends on

Blocks