74048 – A function of vector multiply with SSE optimization is needed in VectorMath.cpp

RESOLVED FIXED 74048

A function of vector multiply with SSE optimization is needed in VectorMath.cpp

https://bugs.webkit.org/show_bug.cgi?id=74048

Summary A function of vector multiply with SSE optimization is needed in VectorMath.cpp

Xingnan Wang

Reported 2011-12-07 19:18:06 PST

Here are only vadd and vsmul in vector lib, and a vmmul for two vectors multiply is widely used and it makes sense to add it.

Attachments
Patch (5.02 KB, patch) 2011-12-07 21:08 PST, Xingnan Wang	no flags	Details Formatted Diff Diff
Patch (5.73 KB, patch) 2011-12-11 21:38 PST, Xingnan Wang	no flags	Details Formatted Diff Diff
Patch (7.01 KB, patch) 2011-12-11 23:16 PST, Xingnan Wang	webkit.review.bot: commit-queue-	Details Formatted Diff Diff
Patch (6.38 KB, patch) 2011-12-11 23:44 PST, Xingnan Wang	benjamin: review-	Details Formatted Diff Diff
Patch (6.20 KB, patch) 2011-12-12 21:38 PST, Xingnan Wang	benjamin: review+ benjamin: commit-queue-	Details Formatted Diff Diff
Patch (6.13 KB, patch) 2011-12-13 01:03 PST, Xingnan Wang	no flags	Details Formatted Diff Diff
Show Obsolete (5) View All Add attachment proposed patch, testcase, etc.

Xingnan Wang

Comment 1 2011-12-07 21:08:15 PST

Created attachment 118323 [details] Patch It achieves about 3.4x compared with usual way.

Raymond Toy

Comment 2 2011-12-09 10:06:28 PST

Comment on attachment 118323 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=118323&action=review Please include a Changelog entry too. > Source/WebCore/platform/audio/VectorMath.cpp:231 > Add comment saying vmmul performs an element-by-element multiply of two vectors, with user-specified strides for the sources and destination. > Source/WebCore/platform/audio/VectorMath.cpp:232 > +void vmmul(const float* source1P, int sourceStride1, const float* source2P, int sourceStride2, float* destP, int destStride, size_t framesToProcess) Why is it named vmmul instead of vmul? What does the extra m mean? > Source/WebCore/platform/audio/VectorMath.cpp:237 > + int n = framesToProcess; Since this is common to both SSE and non-SSE, move it to the top? Also, why do we need the extra variable n? Can't we use framesToProcess? > Source/WebCore/platform/audio/VectorMath.cpp:265 > + destP += 4; As mentioned in another patch, it might be useful to add a macro to encapsulate the very similar code here and the branches below. Something like #define SSE2_MULT(s1Load, s2Load, dstore) \ pSource1 = _mm_##s1Load##_ps(source1P); \ pSource2 = _mm_##s2Load##_ps(source2P); \ dest = __mm_mul_ps(pSource1, pSource2); \ _mm_##dstore##_ps(destP, dest); \ source1P += 4; \ source2P += 4; \ destP += 4\ Then use it: SSE2_MULT(load, load, store); And SSE2_MULT(load,load, storeu); below and so on. I think it makes it easier to see the similarity and differences between the branches. Up to you do include this or not.

Xingnan Wang

Comment 3 2011-12-10 05:18:08 PST

(In reply to comment #2) > (From update of attachment 118323 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=118323&action=review > > Please include a Changelog entry too. > > > Source/WebCore/platform/audio/VectorMath.cpp:231 > > > > Add comment saying vmmul performs an element-by-element multiply of two vectors, with user-specified strides for the sources and destination. > > > Source/WebCore/platform/audio/VectorMath.cpp:232 > > +void vmmul(const float* source1P, int sourceStride1, const float* source2P, int sourceStride2, float* destP, int destStride, size_t framesToProcess) > > Why is it named vmmul instead of vmul? What does the extra m mean? > > > Source/WebCore/platform/audio/VectorMath.cpp:237 > > + int n = framesToProcess; > > Since this is common to both SSE and non-SSE, move it to the top? Also, why do we need the extra variable n? Can't we use framesToProcess? > > > Source/WebCore/platform/audio/VectorMath.cpp:265 > > + destP += 4; > > As mentioned in another patch, it might be useful to add a macro to encapsulate the very similar code here and the branches below. Something like > > #define SSE2_MULT(s1Load, s2Load, dstore) \ > pSource1 = _mm_##s1Load##_ps(source1P); \ > pSource2 = _mm_##s2Load##_ps(source2P); \ > dest = __mm_mul_ps(pSource1, pSource2); \ > _mm_##dstore##_ps(destP, dest); \ > source1P += 4; \ > source2P += 4; \ > destP += 4\ > > Then use it: > SSE2_MULT(load, load, store); > > And SSE2_MULT(load,load, storeu); below and so on. > > I think it makes it easier to see the similarity and differences between the branches. Up to you do include this or not. Hi Raymond, Do you have the privilege to assign the bug to me? Because I cannot upload the patch by webkit-patch script if I am not the assignee. Thank you.

Benjamin Poulain

Comment 4 2011-12-10 11:43:50 PST

Comment on attachment 118323 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=118323&action=review This change needs a ChangeLog. > Source/WebCore/platform/audio/VectorMath.cpp:318 > + // Non-SSE handling for left frames which are less than 4. > + n %= 4; > + while (n) { > + *destP = *source1P * *source2P; > + source1P++; > + source2P++; > + destP++; > + n--; > + } > + } else { // If strides are not 1, rollback to normal algorithm. > +#endif > + int n = framesToProcess; > + while (n--) { > + *destP = *source1P * *source2P; > + source1P += sourceStride1; > + source2P += sourceStride2; > + destP += destStride; > + } You can unify those two path and avoid the else {} branch. Just declare "int n = framesToProcess;" above the block.

Xingnan Wang

Comment 5 2011-12-11 21:38:18 PST

Created attachment 118731 [details] Patch

Xingnan Wang

Comment 6 2011-12-11 21:46:48 PST

Comment on attachment 118323 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=118323&action=review >>> Source/WebCore/platform/audio/VectorMath.cpp:231 >>> >> >> Add comment saying vmmul performs an element-by-element multiply of two vectors, with user-specified strides for the sources and destination. > > Hi Raymond, > Do you have the privilege to assign the bug to me? Because I cannot upload the patch by webkit-patch script if I am not the assignee. Thank you. Done. >> Source/WebCore/platform/audio/VectorMath.cpp:232 >> +void vmmul(const float* source1P, int sourceStride1, const float* source2P, int sourceStride2, float* destP, int destStride, size_t framesToProcess) > > Why is it named vmmul instead of vmul? What does the extra m mean? Yes, It should be vmul. >> Source/WebCore/platform/audio/VectorMath.cpp:237 >> + int n = framesToProcess; > > Since this is common to both SSE and non-SSE, move it to the top? Also, why do we need the extra variable n? Can't we use framesToProcess? I think it makes the code clear use a "n" in loop and easy to get the value of framesToProcess when n changed, also it not much different in the function, so both are OK in my opinion. >> Source/WebCore/platform/audio/VectorMath.cpp:265 >> + destP += 4; > > As mentioned in another patch, it might be useful to add a macro to encapsulate the very similar code here and the branches below. Something like > > #define SSE2_MULT(s1Load, s2Load, dstore) \ > pSource1 = _mm_##s1Load##_ps(source1P); \ > pSource2 = _mm_##s2Load##_ps(source2P); \ > dest = __mm_mul_ps(pSource1, pSource2); \ > _mm_##dstore##_ps(destP, dest); \ > source1P += 4; \ > source2P += 4; \ > destP += 4\ > > Then use it: > SSE2_MULT(load, load, store); > > And SSE2_MULT(load,load, storeu); below and so on. > > I think it makes it easier to see the similarity and differences between the branches. Up to you do include this or not. Used it, thanks. >> Source/WebCore/platform/audio/VectorMath.cpp:318 >> + } > > You can unify those two path and avoid the else {} branch. Just declare "int n = framesToProcess;" above the block. Good comments, thanks.

Chris Rogers

Comment 7 2011-12-11 22:45:59 PST

Thanks for the patch. I've only had a quick look, but it seems like you're only adding the vmul() function to the non-OS(DARWIN) section. I see that vecLib.framework already has a vmul(), but don't we need to do something similar to what we're doing for vsmul() and vadd() in the OS(DARWIN) section to deal with the 32bit vs. 64bit way the macros are defined?

Xingnan Wang

Comment 8 2011-12-11 23:16:23 PST

Created attachment 118739 [details] Patch

Xingnan Wang

Comment 9 2011-12-11 23:19:58 PST

(In reply to comment #7) > Thanks for the patch. I've only had a quick look, but it seems like you're only adding the vmul() function to the non-OS(DARWIN) section. I see that vecLib.framework already has a vmul(), but don't we need to do something similar to what we're doing for vsmul() and vadd() in the OS(DARWIN) section to deal with the 32bit vs. 64bit way the macros are defined? Yes, we should handle the vmul with the same way of vadd and vsmul for OS(DARWIN). Patch is updated as your comments, thank you.

WebKit Review Bot

Comment 10 2011-12-11 23:33:18 PST

Comment on attachment 118739 [details] Patch Attachment 118739 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/10839015

Xingnan Wang

Comment 11 2011-12-11 23:44:24 PST

Created attachment 118744 [details] Patch Sorry for some noise code in the patch, update it.

Benjamin Poulain

Comment 12 2011-12-12 16:11:06 PST

Comment on attachment 118744 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=118744&action=review > Source/WebCore/ChangeLog:12 > + - It`s a function for an element-by-element multiply of two float vectors and we get > + about 3.4x performance improvement with SSE2 optimization compared with the common > + multiply. > + > + - Use the function in AudioBus::copyWithSampleAccurateGainValuesFrom(). You should not start the sentences by "- ". "It`s" -> "It's" + the subject "It" is not clear in that sentence. > Source/WebCore/platform/audio/VectorMath.cpp:246 > + // About 3.4x performance improvement with SSE2 optimization for the very common case of all the strides are 1. This should not be a comment. The commit message is enough information. > Source/WebCore/platform/audio/VectorMath.cpp:248 > + if ((sourceStride1 ==1) && (sourceStride2 == 1) && (destStride == 1)) { Coding style: "sourceStride1 ==1" -> "sourceStride1 == 1" > Source/WebCore/platform/audio/VectorMath.cpp:250 > + // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately. sourceP -> source1P > Source/WebCore/platform/audio/VectorMath.cpp:268 > +#define SSE2_MULT(l, s) \ please use "loadInstr" and "storeInstr" instead of l, and s. Explanatory variable names are better for readability. > Source/WebCore/platform/audio/VectorMath.cpp:293 > + // Non-SSE handling for left frames which is less than 4. " which is less than 4" in the sentence is unclear without reading the code above. Maybe you should get rid of this part of the comment, and have a variable for "n % 4" when you compute endP.

Xingnan Wang

Comment 13 2011-12-12 21:38:51 PST

Created attachment 118951 [details] Patch

Xingnan Wang

Comment 14 2011-12-12 21:40:03 PST

(In reply to comment #12) > (From update of attachment 118744 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=118744&action=review > > > Source/WebCore/ChangeLog:12 > > + - It`s a function for an element-by-element multiply of two float vectors and we get > > + about 3.4x performance improvement with SSE2 optimization compared with the common > > + multiply. > > + > > + - Use the function in AudioBus::copyWithSampleAccurateGainValuesFrom(). > > You should not start the sentences by "- ". > "It`s" -> "It's" + the subject "It" is not clear in that sentence. > > > Source/WebCore/platform/audio/VectorMath.cpp:246 > > + // About 3.4x performance improvement with SSE2 optimization for the very common case of all the strides are 1. > > This should not be a comment. The commit message is enough information. > > > Source/WebCore/platform/audio/VectorMath.cpp:248 > > + if ((sourceStride1 ==1) && (sourceStride2 == 1) && (destStride == 1)) { > > Coding style: "sourceStride1 ==1" -> "sourceStride1 == 1" > > > Source/WebCore/platform/audio/VectorMath.cpp:250 > > + // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately. > > sourceP -> source1P > > > Source/WebCore/platform/audio/VectorMath.cpp:268 > > +#define SSE2_MULT(l, s) \ > > please use "loadInstr" and "storeInstr" instead of l, and s. Explanatory variable names are better for readability. > > > Source/WebCore/platform/audio/VectorMath.cpp:293 > > + // Non-SSE handling for left frames which is less than 4. > > " which is less than 4" in the sentence is unclear without reading the code above. Maybe you should get rid of this part of the comment, and have a variable for "n % 4" when you compute endP. Thanks very much and patch is updated.

Benjamin Poulain

Comment 15 2011-12-12 23:14:36 PST

Comment on attachment 118951 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=118951&action=review > Source/WebCore/platform/audio/VectorMath.cpp:278 > +#define SSE2_MULT(loadInstr, storeInstr) \ > + { \ > + pSource1 = _mm_load_ps(source1P); \ > + pSource2 = _mm_##loadInstr##_ps(source2P); \ > + dest = _mm_mul_ps(pSource1, pSource2); \ > + _mm_##storeInstr##_ps(destP, dest); \ > + source1P += 4; \ > + source2P += 4; \ > + destP += 4; \ > + } > + You could even put the while (destP < endP) in here.

Benjamin Poulain

Comment 16 2011-12-12 23:17:13 PST

Comment on attachment 118951 [details] Patch If you can just update the macro, I will land the updated version.

Xingnan Wang

Comment 17 2011-12-13 00:59:41 PST

Comment on attachment 118951 [details] Patch diff --git a/Source/WebCore/ChangeLog b/Source/WebCore/ChangeLog index f123fbf..8cb44bb 100644 --- a/Source/WebCore/ChangeLog +++ b/Source/WebCore/ChangeLog @@ -1,3 +1,22 @@ +2011-12-13 Xingnan Wang <xingnan.wang@intel.com> + + Implement a function of vector multiply with SSE2 optimization in VectorMath.cpp. + https://bugs.webkit.org/show_bug.cgi?id=74048 + + Reviewed by NOBODY (OOPS!). + + The vmul is a function for an element-by-element multiply of two float vectors and we + get about 3.4x performance improvement with SSE2 optimization compared with the common + multiply. + + Use vmul in AudioBus::copyWithSampleAccurateGainValuesFrom(). + + * platform/audio/AudioBus.cpp: + (WebCore::AudioBus::copyWithSampleAccurateGainValuesFrom): + * platform/audio/VectorMath.cpp: + (WebCore::VectorMath::vmul): + * platform/audio/VectorMath.h: + 2011-12-13 Yosifumi Inoue <yosin@chromium.org> RenderTheme should have a function for disabled text color adjustment diff --git a/Source/WebCore/platform/audio/AudioBus.cpp b/Source/WebCore/platform/audio/AudioBus.cpp index 6c076b7..b5ec246 100644 --- a/Source/WebCore/platform/audio/AudioBus.cpp +++ b/Source/WebCore/platform/audio/AudioBus.cpp @@ -382,15 +382,13 @@ void AudioBus::copyWithSampleAccurateGainValuesFrom(const AudioBus &sourceBus, f return; } - // FIXME: this can potentially use SIMD optimizations with vector libraries. // We handle both the 1 -> N and N -> N case here. const float* source = sourceBus.channel(0)->data(); for (unsigned channelIndex = 0; channelIndex < numberOfChannels(); ++channelIndex) { if (sourceBus.numberOfChannels() == numberOfChannels()) source = sourceBus.channel(channelIndex)->data(); float* destination = channel(channelIndex)->data(); - for (unsigned i = 0; i < numberOfGainValues; ++i) - destination[i] = source[i] * gainValues[i]; + vmul(source, 1, gainValues, 1, destination, 1, numberOfGainValues); } } diff --git a/Source/WebCore/platform/audio/VectorMath.cpp b/Source/WebCore/platform/audio/VectorMath.cpp index ff4406a..2e6d1e1 100644 --- a/Source/WebCore/platform/audio/VectorMath.cpp +++ b/Source/WebCore/platform/audio/VectorMath.cpp @@ -63,6 +63,15 @@ void vadd(const float* source1P, int sourceStride1, const float* source2P, int s #endif } +void vmul(const float* source1P, int sourceStride1, const float* source2P, int sourceStride2, float* destP, int destStride, size_t framesToProcess) +{ +#if defined(__ppc__) || defined(__i386__) + ::vmul(source1P, sourceStride1, source2P, sourceStride2, destP, destStride, framesToProcess); +#else + vDSP_vmul(source1P, sourceStride1, source2P, sourceStride2, destP, destStride, framesToProcess); +#endif +} + #else void vsmul(const float* sourceP, int sourceStride, const float* scale, float* destP, int destStride, size_t framesToProcess) @@ -229,6 +238,66 @@ void vadd(const float* source1P, int sourceStride1, const float* source2P, int s #endif } +void vmul(const float* source1P, int sourceStride1, const float* source2P, int sourceStride2, float* destP, int destStride, size_t framesToProcess) +{ + + int n = framesToProcess; + +#ifdef __SSE2__ + if ((sourceStride1 == 1) && (sourceStride2 == 1) && (destStride == 1)) { + + // If the source1P address is not 16-byte aligned, the first several frames (at most three) should be processed seperately. + while ((reinterpret_cast<uintptr_t>(source1P) & 0x0F) && n) { + *destP = *source1P * *source2P; + source1P++; + source2P++; + destP++; + n--; + } + + // Now the source1P address aligned and start to apply SSE. + int tailFrames = n % 4; + float* endP = destP + n - tailFrames; + __m128 pSource1; + __m128 pSource2; + __m128 dest; + + bool source2Aligned = !(reinterpret_cast<uintptr_t>(source2P) & 0x0F); + bool destAligned = !(reinterpret_cast<uintptr_t>(destP) & 0x0F); + +#define SSE2_MULT(loadInstr, storeInstr) \ + while (destP < endP) \ + { \ + pSource1 = _mm_load_ps(source1P); \ + pSource2 = _mm_##loadInstr##_ps(source2P); \ + dest = _mm_mul_ps(pSource1, pSource2); \ + _mm_##storeInstr##_ps(destP, dest); \ + source1P += 4; \ + source2P += 4; \ + destP += 4; \ + } + + if (source2Aligned && destAligned) // Both aligned. + SSE2_MULT(load, store) + else if (source2Aligned && !destAligned) // Source2 is aligned but dest not. + SSE2_MULT(load, storeu) + else if (!source2Aligned && destAligned) // Dest is aligned but source2 not. + SSE2_MULT(loadu, store) + else // Neither aligned. + SSE2_MULT(loadu, storeu) + + n = tailFrames; + } +#endif + while (n) { + *destP = *source1P * *source2P; + source1P += sourceStride1; + source2P += sourceStride2; + destP += destStride; + n--; + } +} + #endif // OS(DARWIN) } // namespace VectorMath diff --git a/Source/WebCore/platform/audio/VectorMath.h b/Source/WebCore/platform/audio/VectorMath.h index 0360bdb..a3a92e5 100644 --- a/Source/WebCore/platform/audio/VectorMath.h +++ b/Source/WebCore/platform/audio/VectorMath.h @@ -34,6 +34,9 @@ namespace VectorMath { void vsmul(const float* sourceP, int sourceStride, const float* scale, float* destP, int destStride, size_t framesToProcess); void vadd(const float* source1P, int sourceStride1, const float* source2P, int sourceStride2, float* destP, int destStride, size_t framesToProcess); +// For an element-by-element multiply of two float vectors. +void vmul(const float* source1P, int sourceStride1, const float* source2P, int sourceStride2, float* destP, int destStride, size_t framesToProcess); + } // namespace VectorMath } // namespace WebCore

Xingnan Wang

Comment 18 2011-12-13 01:03:39 PST

Created attachment 118971 [details] Patch

Xingnan Wang

Comment 19 2011-12-13 01:04:35 PST

(In reply to comment #16) > (From update of attachment 118951 [details]) > If you can just update the macro, I will land the updated version. Just updated the macro, thanks.

WebKit Review Bot

Comment 20 2011-12-13 13:42:56 PST

Comment on attachment 118971 [details] Patch Clearing flags on attachment: 118971 Committed r102702: <http://trac.webkit.org/changeset/102702>

WebKit Review Bot

Comment 21 2011-12-13 13:43:05 PST

All reviewed patches have been landed. Closing bug.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Unspecified

OS Unspecified

Product WebKit

Component Web Audio

Assignee

Xingnan Wang

Reported

2011-12-07 19:18 PST

Modified

2011-12-13 13:43 PST History

CC List

5 users Show

URL

Keywords

Depends on

Blocks