RESOLVED FIXED 77950
SSE optimization for vsvesq and vmaxmgv
https://bugs.webkit.org/show_bug.cgi?id=77950
Summary SSE optimization for vsvesq and vmaxmgv
Xingnan Wang
Reported 2012-02-07 02:00:40 PST
Optimize the function vsvesq and vmaxmgv in VectorMath by SSE2.
Attachments
Patch (3.76 KB, patch)
2012-02-07 02:12 PST, Xingnan Wang
no flags
Patch (6.95 KB, patch)
2012-02-08 02:17 PST, Xingnan Wang
no flags
Patch (6.83 KB, patch)
2012-02-22 19:17 PST, Xingnan Wang
no flags
Patch (7.31 KB, patch)
2012-02-24 21:39 PST, Xingnan Wang
no flags
Patch (7.38 KB, patch)
2012-02-27 21:14 PST, Xingnan Wang
no flags
Xingnan Wang
Comment 1 2012-02-07 02:12:53 PST
Created attachment 125790 [details] Patch Achieved 3.7x on vsvesq and 4.1x on vmaxmgv.
Chris Rogers
Comment 2 2012-02-07 15:22:07 PST
Comment on attachment 125790 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=125790&action=review > Source/WebCore/platform/audio/VectorMath.cpp:439 > + nit: extra blank line - please remove here and elsewhere in similar places in this file > Source/WebCore/platform/audio/VectorMath.cpp:440 > + // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately. spelling nit: "seperately" -> "separately" please fix here and elsewhere in the file > Source/WebCore/platform/audio/VectorMath.cpp:448 > + // Now the sourceP address aligned and start to apply SSE. grammar nit: I'd re-word this comment as: // Now that sourceP is aligned, use SSE. > Source/WebCore/platform/audio/VectorMath.cpp:450 > + float* endP = (float *)sourceP + n - tailFrames; shouldn't use ordinary cast here. Can't we make endP "const float": const float* endP = sourceP + n - tailFrames; > Source/WebCore/platform/audio/VectorMath.cpp:462 > + float* groupSumP = reinterpret_cast<float*>(&mSum); nit: I think you can use "const float*" instead of "float*" > Source/WebCore/platform/audio/VectorMath.cpp:487 > + nit: extra blank line > Source/WebCore/platform/audio/VectorMath.cpp:488 > + // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately. nit: "seperately" spelling > Source/WebCore/platform/audio/VectorMath.cpp:495 > + // Now the sourceP address aligned and start to apply SSE. nit: see comment above about re-wording > Source/WebCore/platform/audio/VectorMath.cpp:497 > + float* endP = (float*)sourceP + n - tailFrames; nit: see comment above > Source/WebCore/platform/audio/VectorMath.cpp:509 > + // Caculate abs of min, then campare with max. typo: "Caculate" -> "Calculate" typo: "campare" -> "compare" Is this math correct??
Xingnan Wang
Comment 3 2012-02-08 02:17:34 PST
Xingnan Wang
Comment 4 2012-02-08 02:26:27 PST
(In reply to comment #2) > (From update of attachment 125790 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=125790&action=review > > > Source/WebCore/platform/audio/VectorMath.cpp:439 > > + > > nit: extra blank line - please remove here and elsewhere in similar places in this file > > > Source/WebCore/platform/audio/VectorMath.cpp:440 > > + // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately. > > spelling nit: "seperately" -> "separately" > > please fix here and elsewhere in the file > > > Source/WebCore/platform/audio/VectorMath.cpp:448 > > + // Now the sourceP address aligned and start to apply SSE. > > grammar nit: I'd re-word this comment as: > > // Now that sourceP is aligned, use SSE. > > > Source/WebCore/platform/audio/VectorMath.cpp:450 > > + float* endP = (float *)sourceP + n - tailFrames; > > shouldn't use ordinary cast here. > > Can't we make endP "const float": > > const float* endP = sourceP + n - tailFrames; > > > Source/WebCore/platform/audio/VectorMath.cpp:462 > > + float* groupSumP = reinterpret_cast<float*>(&mSum); > > nit: I think you can use "const float*" instead of "float*" > > > Source/WebCore/platform/audio/VectorMath.cpp:487 > > + > > nit: extra blank line > > > Source/WebCore/platform/audio/VectorMath.cpp:488 > > + // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately. > > nit: "seperately" spelling > > > Source/WebCore/platform/audio/VectorMath.cpp:495 > > + // Now the sourceP address aligned and start to apply SSE. > > nit: see comment above about re-wording > > > Source/WebCore/platform/audio/VectorMath.cpp:497 > > + float* endP = (float*)sourceP + n - tailFrames; > > nit: see comment above > > > Source/WebCore/platform/audio/VectorMath.cpp:509 > > + // Caculate abs of min, then campare with max. > > typo: "Caculate" -> "Calculate" > typo: "campare" -> "compare" > Chris, update the patch as your comments. > Is this math correct?? Do you mean whether vmaxmgv is correnct? I think the maximum abstract value in a vector is either the maximum positive value or the minimum negative value, so select one of them which has the maximum abstract value. I did the functional test and it`s OK.
Chris Rogers
Comment 5 2012-02-15 12:04:37 PST
Raymond, could you please scrutinize this patch, especially the vmaxmgv() part?
Raymond Toy
Comment 6 2012-02-15 13:18:54 PST
Comment on attachment 126033 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=126033&action=review > Source/WebCore/ChangeLog:8 > + Achieved the performance of 3.7x on vsvesq and 4.1x on vmaxmgv. Out of curiosity is this speed up measured using the typical 128-length vectors? Or did you use longer vectors so that the initial setup is in the noise? > Source/WebCore/platform/audio/VectorMath.cpp:500 > + mMin = _mm_min_ps(mMin, source); I think it would be more in line with the original code to compute the absolute value before computing the max. Something like source = _mm_and_ps(source, mask); mMax = _mm_max_ps(mMax, source); where mask is #x7fffffff (4 copies, one for each float). I think this would be as fast as the current code. If this is done, then lines 505-507 can be deleted. > Source/WebCore/platform/audio/VectorMath.cpp:514 > + What is the reason for the temp variable? Is this an optimization so that we're not reading and writing to the same max variable?
Xingnan Wang
Comment 7 2012-02-15 19:01:22 PST
Comment on attachment 126033 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=126033&action=review >> Source/WebCore/ChangeLog:8 >> + Achieved the performance of 3.7x on vsvesq and 4.1x on vmaxmgv. > > Out of curiosity is this speed up measured using the typical 128-length vectors? Or did you use longer vectors so that the initial setup is in the noise? Yes, the result above is measured with 128-length vectors, and also variable lengths are measured. For example, with 1280-length vectors the speed up is up to 3.88x and 5.1x. I think the 128-length result is most valuable for us now, so just present the 128-length result. >> Source/WebCore/platform/audio/VectorMath.cpp:500 >> + mMin = _mm_min_ps(mMin, source); > > I think it would be more in line with the original code to compute the absolute value before computing the max. Something like > > source = _mm_and_ps(source, mask); > mMax = _mm_max_ps(mMax, source); > > where mask is #x7fffffff (4 copies, one for each float). I think this would be as fast as the current code. > > If this is done, then lines 505-507 can be deleted. Good suggestion, thanks. >> Source/WebCore/platform/audio/VectorMath.cpp:514 >> + > > What is the reason for the temp variable? Is this an optimization so that we're not reading and writing to the same max variable? I just used the temp variable for the readability without considering the optimization here because it`s not in the loop so it`s not quite necessary, I think. If the opt is truly needed some SSE function may be applied here.
Xingnan Wang
Comment 8 2012-02-22 19:17:39 PST
Created attachment 128369 [details] Patch Ray, Thanks your comments, I update the patch and use mask 0x7fffffff. There is less affect of the performance.
Raymond Toy
Comment 9 2012-02-23 09:25:45 PST
Comment on attachment 128369 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=128369&action=review Looks good overall, except for the comment at line 508. > Source/WebCore/ChangeLog:8 > + Achieved the performance of 3.7x on vsvesq and 4.1x on vmaxmgv. You mentioned that the udpated version of vmaxmgv is now slower(?) than before. Can you update the numbers here? > Source/WebCore/platform/audio/VectorMath.cpp:500 > + source = _mm_and_ps(source, mMask); You mentioned that the performance with and_ps is less than the previous version with min_ps. Is this because and_ps is slower or maybe because of some pipelining isues? I'm just curious as to why this is slower. > Source/WebCore/platform/audio/VectorMath.cpp:508 > + max = std::max(groupMaxP[2], groupMaxP[3]); The assignment to max here at line 508 overwrites the max computed in line 485 when the source is not aligned. This needs to be rearranged a little so as not to destroy that.
Xingnan Wang
Comment 10 2012-02-23 22:17:15 PST
Comment on attachment 128369 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=128369&action=review >> Source/WebCore/platform/audio/VectorMath.cpp:500 >> + source = _mm_and_ps(source, mMask); > > You mentioned that the performance with and_ps is less than the previous version with min_ps. Is this because and_ps is slower or maybe because of some pipelining isues? I'm just curious as to why this is slower. It`s not slower with and_ps, my expression may not be clear so sorry for misleading you. I mean the performance between using and_ps and min_ps is not too much different, so I think the performance data is no needed to be updated. >> Source/WebCore/platform/audio/VectorMath.cpp:508 >> + max = std::max(groupMaxP[2], groupMaxP[3]); > > The assignment to max here at line 508 overwrites the max computed in line 485 when the source is not aligned. This needs to be rearranged a little so as not to destroy that. I`ll fix it.
Xingnan Wang
Comment 11 2012-02-24 21:39:51 PST
Xingnan Wang
Comment 12 2012-02-24 21:41:43 PST
(In reply to comment #10) > (From update of attachment 128369 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=128369&action=review > > >> Source/WebCore/platform/audio/VectorMath.cpp:500 > >> + source = _mm_and_ps(source, mMask); > > > > You mentioned that the performance with and_ps is less than the previous version with min_ps. Is this because and_ps is slower or maybe because of some pipelining isues? I'm just curious as to why this is slower. > > It`s not slower with and_ps, my expression may not be clear so sorry for misleading you. I mean the performance between using and_ps and min_ps is not too much different, so I think the performance data is no needed to be updated. > > >> Source/WebCore/platform/audio/VectorMath.cpp:508 > >> + max = std::max(groupMaxP[2], groupMaxP[3]); > > > > The assignment to max here at line 508 overwrites the max computed in line 485 when the source is not aligned. This needs to be rearranged a little so as not to destroy that. > > I`ll fix it. Fixed.
Raymond Toy
Comment 13 2012-02-27 09:19:55 PST
Comment on attachment 128852 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=128852&action=review Except for the nit, looks good. > Source/WebCore/platform/audio/VectorMath.cpp:500 > + source = _mm_and_ps(source, mMask); nit: Perhaps add a comment here to say that anding the source with the mask computes the absolute value by making the sign bit 0 (positive).
Raymond Toy
Comment 14 2012-02-27 09:21:17 PST
(In reply to comment #10) > (From update of attachment 128369 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=128369&action=review > > >> Source/WebCore/platform/audio/VectorMath.cpp:500 > >> + source = _mm_and_ps(source, mMask); > > > > You mentioned that the performance with and_ps is less than the previous version with min_ps. Is this because and_ps is slower or maybe because of some pipelining isues? I'm just curious as to why this is slower. > > It`s not slower with and_ps, my expression may not be clear so sorry for misleading you. I mean the performance between using and_ps and min_ps is not too much different, so I think the performance data is no needed to be updated. Ok. No need to change anything then. I'm glad the performance didn't change.
Xingnan Wang
Comment 15 2012-02-27 21:01:39 PST
(In reply to comment #14) > (In reply to comment #10) > > (From update of attachment 128369 [details] [details]) > > View in context: https://bugs.webkit.org/attachment.cgi?id=128369&action=review > > > > >> Source/WebCore/platform/audio/VectorMath.cpp:500 > > >> + source = _mm_and_ps(source, mMask); > > > > > > You mentioned that the performance with and_ps is less than the previous version with min_ps. Is this because and_ps is slower or maybe because of some pipelining isues? I'm just curious as to why this is slower. > > > > It`s not slower with and_ps, my expression may not be clear so sorry for misleading you. I mean the performance between using and_ps and min_ps is not too much different, so I think the performance data is no needed to be updated. > > Ok. No need to change anything then. I'm glad the performance didn't change. Chris, does the patch need more review?
Xingnan Wang
Comment 16 2012-02-27 21:14:59 PST
Xingnan Wang
Comment 17 2012-02-27 21:16:00 PST
(In reply to comment #16) > Created an attachment (id=129183) [details] > Patch (In reply to comment #13) > (From update of attachment 128852 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=128852&action=review > > Except for the nit, looks good. > > > Source/WebCore/platform/audio/VectorMath.cpp:500 > > + source = _mm_and_ps(source, mMask); > > nit: Perhaps add a comment here to say that anding the source with the mask computes the absolute value by making the sign bit 0 (positive). The comment is add, thanks.
Raymond Toy
Comment 18 2012-02-28 10:14:20 PST
Comment on attachment 129183 [details] Patch LGTM
Xingnan Wang
Comment 19 2012-02-29 20:30:26 PST
Comment on attachment 129183 [details] Patch Submit to the commit-queue as it`s "LGTM".
Chris Rogers
Comment 20 2012-03-01 12:45:41 PST
Comment on attachment 129183 [details] Patch Looks good.
WebKit Review Bot
Comment 21 2012-03-01 12:55:09 PST
Comment on attachment 129183 [details] Patch Rejecting attachment 129183 [details] from commit-queue. Failed to run "['/mnt/git/webkit-commit-queue/Tools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '-..." exit_code: 1 ERROR: /mnt/git/webkit-commit-queue/ChangeLog neither lists a valid reviewer nor contains the string "Unreviewed" or "Rubber stamp" (case insensitive). Full output: http://queues.webkit.org/results/11757580
Chris Rogers
Comment 22 2012-03-01 17:38:36 PST
Comment on attachment 129183 [details] Patch trying commit queue again
WebKit Review Bot
Comment 23 2012-03-01 18:00:21 PST
Comment on attachment 129183 [details] Patch Clearing flags on attachment: 129183 Committed r109476: <http://trac.webkit.org/changeset/109476>
WebKit Review Bot
Comment 24 2012-03-01 18:00:26 PST
All reviewed patches have been landed. Closing bug.
Note You need to log in before you can comment on or make changes to this bug.