Optimize the function vsvesq and vmaxmgv in VectorMath by SSE2.
Created attachment 125790 [details] Patch Achieved 3.7x on vsvesq and 4.1x on vmaxmgv.
Comment on attachment 125790 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=125790&action=review > Source/WebCore/platform/audio/VectorMath.cpp:439 > + nit: extra blank line - please remove here and elsewhere in similar places in this file > Source/WebCore/platform/audio/VectorMath.cpp:440 > + // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately. spelling nit: "seperately" -> "separately" please fix here and elsewhere in the file > Source/WebCore/platform/audio/VectorMath.cpp:448 > + // Now the sourceP address aligned and start to apply SSE. grammar nit: I'd re-word this comment as: // Now that sourceP is aligned, use SSE. > Source/WebCore/platform/audio/VectorMath.cpp:450 > + float* endP = (float *)sourceP + n - tailFrames; shouldn't use ordinary cast here. Can't we make endP "const float": const float* endP = sourceP + n - tailFrames; > Source/WebCore/platform/audio/VectorMath.cpp:462 > + float* groupSumP = reinterpret_cast<float*>(&mSum); nit: I think you can use "const float*" instead of "float*" > Source/WebCore/platform/audio/VectorMath.cpp:487 > + nit: extra blank line > Source/WebCore/platform/audio/VectorMath.cpp:488 > + // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately. nit: "seperately" spelling > Source/WebCore/platform/audio/VectorMath.cpp:495 > + // Now the sourceP address aligned and start to apply SSE. nit: see comment above about re-wording > Source/WebCore/platform/audio/VectorMath.cpp:497 > + float* endP = (float*)sourceP + n - tailFrames; nit: see comment above > Source/WebCore/platform/audio/VectorMath.cpp:509 > + // Caculate abs of min, then campare with max. typo: "Caculate" -> "Calculate" typo: "campare" -> "compare" Is this math correct??
Created attachment 126033 [details] Patch
(In reply to comment #2) > (From update of attachment 125790 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=125790&action=review > > > Source/WebCore/platform/audio/VectorMath.cpp:439 > > + > > nit: extra blank line - please remove here and elsewhere in similar places in this file > > > Source/WebCore/platform/audio/VectorMath.cpp:440 > > + // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately. > > spelling nit: "seperately" -> "separately" > > please fix here and elsewhere in the file > > > Source/WebCore/platform/audio/VectorMath.cpp:448 > > + // Now the sourceP address aligned and start to apply SSE. > > grammar nit: I'd re-word this comment as: > > // Now that sourceP is aligned, use SSE. > > > Source/WebCore/platform/audio/VectorMath.cpp:450 > > + float* endP = (float *)sourceP + n - tailFrames; > > shouldn't use ordinary cast here. > > Can't we make endP "const float": > > const float* endP = sourceP + n - tailFrames; > > > Source/WebCore/platform/audio/VectorMath.cpp:462 > > + float* groupSumP = reinterpret_cast<float*>(&mSum); > > nit: I think you can use "const float*" instead of "float*" > > > Source/WebCore/platform/audio/VectorMath.cpp:487 > > + > > nit: extra blank line > > > Source/WebCore/platform/audio/VectorMath.cpp:488 > > + // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately. > > nit: "seperately" spelling > > > Source/WebCore/platform/audio/VectorMath.cpp:495 > > + // Now the sourceP address aligned and start to apply SSE. > > nit: see comment above about re-wording > > > Source/WebCore/platform/audio/VectorMath.cpp:497 > > + float* endP = (float*)sourceP + n - tailFrames; > > nit: see comment above > > > Source/WebCore/platform/audio/VectorMath.cpp:509 > > + // Caculate abs of min, then campare with max. > > typo: "Caculate" -> "Calculate" > typo: "campare" -> "compare" > Chris, update the patch as your comments. > Is this math correct?? Do you mean whether vmaxmgv is correnct? I think the maximum abstract value in a vector is either the maximum positive value or the minimum negative value, so select one of them which has the maximum abstract value. I did the functional test and it`s OK.
Raymond, could you please scrutinize this patch, especially the vmaxmgv() part?
Comment on attachment 126033 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=126033&action=review > Source/WebCore/ChangeLog:8 > + Achieved the performance of 3.7x on vsvesq and 4.1x on vmaxmgv. Out of curiosity is this speed up measured using the typical 128-length vectors? Or did you use longer vectors so that the initial setup is in the noise? > Source/WebCore/platform/audio/VectorMath.cpp:500 > + mMin = _mm_min_ps(mMin, source); I think it would be more in line with the original code to compute the absolute value before computing the max. Something like source = _mm_and_ps(source, mask); mMax = _mm_max_ps(mMax, source); where mask is #x7fffffff (4 copies, one for each float). I think this would be as fast as the current code. If this is done, then lines 505-507 can be deleted. > Source/WebCore/platform/audio/VectorMath.cpp:514 > + What is the reason for the temp variable? Is this an optimization so that we're not reading and writing to the same max variable?
Comment on attachment 126033 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=126033&action=review >> Source/WebCore/ChangeLog:8 >> + Achieved the performance of 3.7x on vsvesq and 4.1x on vmaxmgv. > > Out of curiosity is this speed up measured using the typical 128-length vectors? Or did you use longer vectors so that the initial setup is in the noise? Yes, the result above is measured with 128-length vectors, and also variable lengths are measured. For example, with 1280-length vectors the speed up is up to 3.88x and 5.1x. I think the 128-length result is most valuable for us now, so just present the 128-length result. >> Source/WebCore/platform/audio/VectorMath.cpp:500 >> + mMin = _mm_min_ps(mMin, source); > > I think it would be more in line with the original code to compute the absolute value before computing the max. Something like > > source = _mm_and_ps(source, mask); > mMax = _mm_max_ps(mMax, source); > > where mask is #x7fffffff (4 copies, one for each float). I think this would be as fast as the current code. > > If this is done, then lines 505-507 can be deleted. Good suggestion, thanks. >> Source/WebCore/platform/audio/VectorMath.cpp:514 >> + > > What is the reason for the temp variable? Is this an optimization so that we're not reading and writing to the same max variable? I just used the temp variable for the readability without considering the optimization here because it`s not in the loop so it`s not quite necessary, I think. If the opt is truly needed some SSE function may be applied here.
Created attachment 128369 [details] Patch Ray, Thanks your comments, I update the patch and use mask 0x7fffffff. There is less affect of the performance.
Comment on attachment 128369 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=128369&action=review Looks good overall, except for the comment at line 508. > Source/WebCore/ChangeLog:8 > + Achieved the performance of 3.7x on vsvesq and 4.1x on vmaxmgv. You mentioned that the udpated version of vmaxmgv is now slower(?) than before. Can you update the numbers here? > Source/WebCore/platform/audio/VectorMath.cpp:500 > + source = _mm_and_ps(source, mMask); You mentioned that the performance with and_ps is less than the previous version with min_ps. Is this because and_ps is slower or maybe because of some pipelining isues? I'm just curious as to why this is slower. > Source/WebCore/platform/audio/VectorMath.cpp:508 > + max = std::max(groupMaxP[2], groupMaxP[3]); The assignment to max here at line 508 overwrites the max computed in line 485 when the source is not aligned. This needs to be rearranged a little so as not to destroy that.
Comment on attachment 128369 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=128369&action=review >> Source/WebCore/platform/audio/VectorMath.cpp:500 >> + source = _mm_and_ps(source, mMask); > > You mentioned that the performance with and_ps is less than the previous version with min_ps. Is this because and_ps is slower or maybe because of some pipelining isues? I'm just curious as to why this is slower. It`s not slower with and_ps, my expression may not be clear so sorry for misleading you. I mean the performance between using and_ps and min_ps is not too much different, so I think the performance data is no needed to be updated. >> Source/WebCore/platform/audio/VectorMath.cpp:508 >> + max = std::max(groupMaxP[2], groupMaxP[3]); > > The assignment to max here at line 508 overwrites the max computed in line 485 when the source is not aligned. This needs to be rearranged a little so as not to destroy that. I`ll fix it.
Created attachment 128852 [details] Patch
(In reply to comment #10) > (From update of attachment 128369 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=128369&action=review > > >> Source/WebCore/platform/audio/VectorMath.cpp:500 > >> + source = _mm_and_ps(source, mMask); > > > > You mentioned that the performance with and_ps is less than the previous version with min_ps. Is this because and_ps is slower or maybe because of some pipelining isues? I'm just curious as to why this is slower. > > It`s not slower with and_ps, my expression may not be clear so sorry for misleading you. I mean the performance between using and_ps and min_ps is not too much different, so I think the performance data is no needed to be updated. > > >> Source/WebCore/platform/audio/VectorMath.cpp:508 > >> + max = std::max(groupMaxP[2], groupMaxP[3]); > > > > The assignment to max here at line 508 overwrites the max computed in line 485 when the source is not aligned. This needs to be rearranged a little so as not to destroy that. > > I`ll fix it. Fixed.
Comment on attachment 128852 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=128852&action=review Except for the nit, looks good. > Source/WebCore/platform/audio/VectorMath.cpp:500 > + source = _mm_and_ps(source, mMask); nit: Perhaps add a comment here to say that anding the source with the mask computes the absolute value by making the sign bit 0 (positive).
(In reply to comment #10) > (From update of attachment 128369 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=128369&action=review > > >> Source/WebCore/platform/audio/VectorMath.cpp:500 > >> + source = _mm_and_ps(source, mMask); > > > > You mentioned that the performance with and_ps is less than the previous version with min_ps. Is this because and_ps is slower or maybe because of some pipelining isues? I'm just curious as to why this is slower. > > It`s not slower with and_ps, my expression may not be clear so sorry for misleading you. I mean the performance between using and_ps and min_ps is not too much different, so I think the performance data is no needed to be updated. Ok. No need to change anything then. I'm glad the performance didn't change.
(In reply to comment #14) > (In reply to comment #10) > > (From update of attachment 128369 [details] [details]) > > View in context: https://bugs.webkit.org/attachment.cgi?id=128369&action=review > > > > >> Source/WebCore/platform/audio/VectorMath.cpp:500 > > >> + source = _mm_and_ps(source, mMask); > > > > > > You mentioned that the performance with and_ps is less than the previous version with min_ps. Is this because and_ps is slower or maybe because of some pipelining isues? I'm just curious as to why this is slower. > > > > It`s not slower with and_ps, my expression may not be clear so sorry for misleading you. I mean the performance between using and_ps and min_ps is not too much different, so I think the performance data is no needed to be updated. > > Ok. No need to change anything then. I'm glad the performance didn't change. Chris, does the patch need more review?
Created attachment 129183 [details] Patch
(In reply to comment #16) > Created an attachment (id=129183) [details] > Patch (In reply to comment #13) > (From update of attachment 128852 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=128852&action=review > > Except for the nit, looks good. > > > Source/WebCore/platform/audio/VectorMath.cpp:500 > > + source = _mm_and_ps(source, mMask); > > nit: Perhaps add a comment here to say that anding the source with the mask computes the absolute value by making the sign bit 0 (positive). The comment is add, thanks.
Comment on attachment 129183 [details] Patch LGTM
Comment on attachment 129183 [details] Patch Submit to the commit-queue as it`s "LGTM".
Comment on attachment 129183 [details] Patch Looks good.
Comment on attachment 129183 [details] Patch Rejecting attachment 129183 [details] from commit-queue. Failed to run "['/mnt/git/webkit-commit-queue/Tools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '-..." exit_code: 1 ERROR: /mnt/git/webkit-commit-queue/ChangeLog neither lists a valid reviewer nor contains the string "Unreviewed" or "Rubber stamp" (case insensitive). Full output: http://queues.webkit.org/results/11757580
Comment on attachment 129183 [details] Patch trying commit queue again
Comment on attachment 129183 [details] Patch Clearing flags on attachment: 129183 Committed r109476: <http://trac.webkit.org/changeset/109476>
All reviewed patches have been landed. Closing bug.