WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
74048
A function of vector multiply with SSE optimization is needed in VectorMath.cpp
https://bugs.webkit.org/show_bug.cgi?id=74048
Summary
A function of vector multiply with SSE optimization is needed in VectorMath.cpp
Xingnan Wang
Reported
2011-12-07 19:18:06 PST
Here are only vadd and vsmul in vector lib, and a vmmul for two vectors multiply is widely used and it makes sense to add it.
Attachments
Patch
(5.02 KB, patch)
2011-12-07 21:08 PST
,
Xingnan Wang
no flags
Details
Formatted Diff
Diff
Patch
(5.73 KB, patch)
2011-12-11 21:38 PST
,
Xingnan Wang
no flags
Details
Formatted Diff
Diff
Patch
(7.01 KB, patch)
2011-12-11 23:16 PST
,
Xingnan Wang
webkit.review.bot
: commit-queue-
Details
Formatted Diff
Diff
Patch
(6.38 KB, patch)
2011-12-11 23:44 PST
,
Xingnan Wang
benjamin
: review-
Details
Formatted Diff
Diff
Patch
(6.20 KB, patch)
2011-12-12 21:38 PST
,
Xingnan Wang
benjamin
: review+
benjamin
: commit-queue-
Details
Formatted Diff
Diff
Patch
(6.13 KB, patch)
2011-12-13 01:03 PST
,
Xingnan Wang
no flags
Details
Formatted Diff
Diff
Show Obsolete
(5)
View All
Add attachment
proposed patch, testcase, etc.
Xingnan Wang
Comment 1
2011-12-07 21:08:15 PST
Created
attachment 118323
[details]
Patch It achieves about 3.4x compared with usual way.
Raymond Toy
Comment 2
2011-12-09 10:06:28 PST
Comment on
attachment 118323
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=118323&action=review
Please include a Changelog entry too.
> Source/WebCore/platform/audio/VectorMath.cpp:231 >
Add comment saying vmmul performs an element-by-element multiply of two vectors, with user-specified strides for the sources and destination.
> Source/WebCore/platform/audio/VectorMath.cpp:232 > +void vmmul(const float* source1P, int sourceStride1, const float* source2P, int sourceStride2, float* destP, int destStride, size_t framesToProcess)
Why is it named vmmul instead of vmul? What does the extra m mean?
> Source/WebCore/platform/audio/VectorMath.cpp:237 > + int n = framesToProcess;
Since this is common to both SSE and non-SSE, move it to the top? Also, why do we need the extra variable n? Can't we use framesToProcess?
> Source/WebCore/platform/audio/VectorMath.cpp:265 > + destP += 4;
As mentioned in another patch, it might be useful to add a macro to encapsulate the very similar code here and the branches below. Something like #define SSE2_MULT(s1Load, s2Load, dstore) \ pSource1 = _mm_##s1Load##_ps(source1P); \ pSource2 = _mm_##s2Load##_ps(source2P); \ dest = __mm_mul_ps(pSource1, pSource2); \ _mm_##dstore##_ps(destP, dest); \ source1P += 4; \ source2P += 4; \ destP += 4\ Then use it: SSE2_MULT(load, load, store); And SSE2_MULT(load,load, storeu); below and so on. I think it makes it easier to see the similarity and differences between the branches. Up to you do include this or not.
Xingnan Wang
Comment 3
2011-12-10 05:18:08 PST
(In reply to
comment #2
)
> (From update of
attachment 118323
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=118323&action=review
> > Please include a Changelog entry too. > > > Source/WebCore/platform/audio/VectorMath.cpp:231 > > > > Add comment saying vmmul performs an element-by-element multiply of two vectors, with user-specified strides for the sources and destination. > > > Source/WebCore/platform/audio/VectorMath.cpp:232 > > +void vmmul(const float* source1P, int sourceStride1, const float* source2P, int sourceStride2, float* destP, int destStride, size_t framesToProcess) > > Why is it named vmmul instead of vmul? What does the extra m mean? > > > Source/WebCore/platform/audio/VectorMath.cpp:237 > > + int n = framesToProcess; > > Since this is common to both SSE and non-SSE, move it to the top? Also, why do we need the extra variable n? Can't we use framesToProcess? > > > Source/WebCore/platform/audio/VectorMath.cpp:265 > > + destP += 4; > > As mentioned in another patch, it might be useful to add a macro to encapsulate the very similar code here and the branches below. Something like > > #define SSE2_MULT(s1Load, s2Load, dstore) \ > pSource1 = _mm_##s1Load##_ps(source1P); \ > pSource2 = _mm_##s2Load##_ps(source2P); \ > dest = __mm_mul_ps(pSource1, pSource2); \ > _mm_##dstore##_ps(destP, dest); \ > source1P += 4; \ > source2P += 4; \ > destP += 4\ > > Then use it: > SSE2_MULT(load, load, store); > > And SSE2_MULT(load,load, storeu); below and so on. > > I think it makes it easier to see the similarity and differences between the branches. Up to you do include this or not.
Hi Raymond, Do you have the privilege to assign the bug to me? Because I cannot upload the patch by webkit-patch script if I am not the assignee. Thank you.
Benjamin Poulain
Comment 4
2011-12-10 11:43:50 PST
Comment on
attachment 118323
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=118323&action=review
This change needs a ChangeLog.
> Source/WebCore/platform/audio/VectorMath.cpp:318 > + // Non-SSE handling for left frames which are less than 4. > + n %= 4; > + while (n) { > + *destP = *source1P * *source2P; > + source1P++; > + source2P++; > + destP++; > + n--; > + } > + } else { // If strides are not 1, rollback to normal algorithm. > +#endif > + int n = framesToProcess; > + while (n--) { > + *destP = *source1P * *source2P; > + source1P += sourceStride1; > + source2P += sourceStride2; > + destP += destStride; > + }
You can unify those two path and avoid the else {} branch. Just declare "int n = framesToProcess;" above the block.
Xingnan Wang
Comment 5
2011-12-11 21:38:18 PST
Created
attachment 118731
[details]
Patch
Xingnan Wang
Comment 6
2011-12-11 21:46:48 PST
Comment on
attachment 118323
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=118323&action=review
>>> Source/WebCore/platform/audio/VectorMath.cpp:231 >>> >> >> Add comment saying vmmul performs an element-by-element multiply of two vectors, with user-specified strides for the sources and destination. > > Hi Raymond, > Do you have the privilege to assign the bug to me? Because I cannot upload the patch by webkit-patch script if I am not the assignee. Thank you.
Done.
>> Source/WebCore/platform/audio/VectorMath.cpp:232 >> +void vmmul(const float* source1P, int sourceStride1, const float* source2P, int sourceStride2, float* destP, int destStride, size_t framesToProcess) > > Why is it named vmmul instead of vmul? What does the extra m mean?
Yes, It should be vmul.
>> Source/WebCore/platform/audio/VectorMath.cpp:237 >> + int n = framesToProcess; > > Since this is common to both SSE and non-SSE, move it to the top? Also, why do we need the extra variable n? Can't we use framesToProcess?
I think it makes the code clear use a "n" in loop and easy to get the value of framesToProcess when n changed, also it not much different in the function, so both are OK in my opinion.
>> Source/WebCore/platform/audio/VectorMath.cpp:265 >> + destP += 4; > > As mentioned in another patch, it might be useful to add a macro to encapsulate the very similar code here and the branches below. Something like > > #define SSE2_MULT(s1Load, s2Load, dstore) \ > pSource1 = _mm_##s1Load##_ps(source1P); \ > pSource2 = _mm_##s2Load##_ps(source2P); \ > dest = __mm_mul_ps(pSource1, pSource2); \ > _mm_##dstore##_ps(destP, dest); \ > source1P += 4; \ > source2P += 4; \ > destP += 4\ > > Then use it: > SSE2_MULT(load, load, store); > > And SSE2_MULT(load,load, storeu); below and so on. > > I think it makes it easier to see the similarity and differences between the branches. Up to you do include this or not.
Used it, thanks.
>> Source/WebCore/platform/audio/VectorMath.cpp:318 >> + } > > You can unify those two path and avoid the else {} branch. Just declare "int n = framesToProcess;" above the block.
Good comments, thanks.
Chris Rogers
Comment 7
2011-12-11 22:45:59 PST
Thanks for the patch. I've only had a quick look, but it seems like you're only adding the vmul() function to the non-OS(DARWIN) section. I see that vecLib.framework already has a vmul(), but don't we need to do something similar to what we're doing for vsmul() and vadd() in the OS(DARWIN) section to deal with the 32bit vs. 64bit way the macros are defined?
Xingnan Wang
Comment 8
2011-12-11 23:16:23 PST
Created
attachment 118739
[details]
Patch
Xingnan Wang
Comment 9
2011-12-11 23:19:58 PST
(In reply to
comment #7
)
> Thanks for the patch. I've only had a quick look, but it seems like you're only adding the vmul() function to the non-OS(DARWIN) section. I see that vecLib.framework already has a vmul(), but don't we need to do something similar to what we're doing for vsmul() and vadd() in the OS(DARWIN) section to deal with the 32bit vs. 64bit way the macros are defined?
Yes, we should handle the vmul with the same way of vadd and vsmul for OS(DARWIN). Patch is updated as your comments, thank you.
WebKit Review Bot
Comment 10
2011-12-11 23:33:18 PST
Comment on
attachment 118739
[details]
Patch
Attachment 118739
[details]
did not pass chromium-ews (chromium-xvfb): Output:
http://queues.webkit.org/results/10839015
Xingnan Wang
Comment 11
2011-12-11 23:44:24 PST
Created
attachment 118744
[details]
Patch Sorry for some noise code in the patch, update it.
Benjamin Poulain
Comment 12
2011-12-12 16:11:06 PST
Comment on
attachment 118744
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=118744&action=review
> Source/WebCore/ChangeLog:12 > + - It`s a function for an element-by-element multiply of two float vectors and we get > + about 3.4x performance improvement with SSE2 optimization compared with the common > + multiply. > + > + - Use the function in AudioBus::copyWithSampleAccurateGainValuesFrom().
You should not start the sentences by "- ". "It`s" -> "It's" + the subject "It" is not clear in that sentence.
> Source/WebCore/platform/audio/VectorMath.cpp:246 > + // About 3.4x performance improvement with SSE2 optimization for the very common case of all the strides are 1.
This should not be a comment. The commit message is enough information.
> Source/WebCore/platform/audio/VectorMath.cpp:248 > + if ((sourceStride1 ==1) && (sourceStride2 == 1) && (destStride == 1)) {
Coding style: "sourceStride1 ==1" -> "sourceStride1 == 1"
> Source/WebCore/platform/audio/VectorMath.cpp:250 > + // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately.
sourceP -> source1P
> Source/WebCore/platform/audio/VectorMath.cpp:268 > +#define SSE2_MULT(l, s) \
please use "loadInstr" and "storeInstr" instead of l, and s. Explanatory variable names are better for readability.
> Source/WebCore/platform/audio/VectorMath.cpp:293 > + // Non-SSE handling for left frames which is less than 4.
" which is less than 4" in the sentence is unclear without reading the code above. Maybe you should get rid of this part of the comment, and have a variable for "n % 4" when you compute endP.
Xingnan Wang
Comment 13
2011-12-12 21:38:51 PST
Created
attachment 118951
[details]
Patch
Xingnan Wang
Comment 14
2011-12-12 21:40:03 PST
(In reply to
comment #12
)
> (From update of
attachment 118744
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=118744&action=review
> > > Source/WebCore/ChangeLog:12 > > + - It`s a function for an element-by-element multiply of two float vectors and we get > > + about 3.4x performance improvement with SSE2 optimization compared with the common > > + multiply. > > + > > + - Use the function in AudioBus::copyWithSampleAccurateGainValuesFrom(). > > You should not start the sentences by "- ". > "It`s" -> "It's" + the subject "It" is not clear in that sentence. > > > Source/WebCore/platform/audio/VectorMath.cpp:246 > > + // About 3.4x performance improvement with SSE2 optimization for the very common case of all the strides are 1. > > This should not be a comment. The commit message is enough information. > > > Source/WebCore/platform/audio/VectorMath.cpp:248 > > + if ((sourceStride1 ==1) && (sourceStride2 == 1) && (destStride == 1)) { > > Coding style: "sourceStride1 ==1" -> "sourceStride1 == 1" > > > Source/WebCore/platform/audio/VectorMath.cpp:250 > > + // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately. > > sourceP -> source1P > > > Source/WebCore/platform/audio/VectorMath.cpp:268 > > +#define SSE2_MULT(l, s) \ > > please use "loadInstr" and "storeInstr" instead of l, and s. Explanatory variable names are better for readability. > > > Source/WebCore/platform/audio/VectorMath.cpp:293 > > + // Non-SSE handling for left frames which is less than 4. > > " which is less than 4" in the sentence is unclear without reading the code above. Maybe you should get rid of this part of the comment, and have a variable for "n % 4" when you compute endP.
Thanks very much and patch is updated.
Benjamin Poulain
Comment 15
2011-12-12 23:14:36 PST
Comment on
attachment 118951
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=118951&action=review
> Source/WebCore/platform/audio/VectorMath.cpp:278 > +#define SSE2_MULT(loadInstr, storeInstr) \ > + { \ > + pSource1 = _mm_load_ps(source1P); \ > + pSource2 = _mm_##loadInstr##_ps(source2P); \ > + dest = _mm_mul_ps(pSource1, pSource2); \ > + _mm_##storeInstr##_ps(destP, dest); \ > + source1P += 4; \ > + source2P += 4; \ > + destP += 4; \ > + } > +
You could even put the while (destP < endP) in here.
Benjamin Poulain
Comment 16
2011-12-12 23:17:13 PST
Comment on
attachment 118951
[details]
Patch If you can just update the macro, I will land the updated version.
Xingnan Wang
Comment 17
2011-12-13 00:59:41 PST
Comment on
attachment 118951
[details]
Patch diff --git a/Source/WebCore/ChangeLog b/Source/WebCore/ChangeLog index f123fbf..8cb44bb 100644 --- a/Source/WebCore/ChangeLog +++ b/Source/WebCore/ChangeLog @@ -1,3 +1,22 @@ +2011-12-13 Xingnan Wang <
xingnan.wang@intel.com
> + + Implement a function of vector multiply with SSE2 optimization in VectorMath.cpp. +
https://bugs.webkit.org/show_bug.cgi?id=74048
+ + Reviewed by NOBODY (OOPS!). + + The vmul is a function for an element-by-element multiply of two float vectors and we + get about 3.4x performance improvement with SSE2 optimization compared with the common + multiply. + + Use vmul in AudioBus::copyWithSampleAccurateGainValuesFrom(). + + * platform/audio/AudioBus.cpp: + (WebCore::AudioBus::copyWithSampleAccurateGainValuesFrom): + * platform/audio/VectorMath.cpp: + (WebCore::VectorMath::vmul): + * platform/audio/VectorMath.h: + 2011-12-13 Yosifumi Inoue <
yosin@chromium.org
> RenderTheme should have a function for disabled text color adjustment diff --git a/Source/WebCore/platform/audio/AudioBus.cpp b/Source/WebCore/platform/audio/AudioBus.cpp index 6c076b7..b5ec246 100644 --- a/Source/WebCore/platform/audio/AudioBus.cpp +++ b/Source/WebCore/platform/audio/AudioBus.cpp @@ -382,15 +382,13 @@ void AudioBus::copyWithSampleAccurateGainValuesFrom(const AudioBus &sourceBus, f return; } - // FIXME: this can potentially use SIMD optimizations with vector libraries. // We handle both the 1 -> N and N -> N case here. const float* source = sourceBus.channel(0)->data(); for (unsigned channelIndex = 0; channelIndex < numberOfChannels(); ++channelIndex) { if (sourceBus.numberOfChannels() == numberOfChannels()) source = sourceBus.channel(channelIndex)->data(); float* destination = channel(channelIndex)->data(); - for (unsigned i = 0; i < numberOfGainValues; ++i) - destination[i] = source[i] * gainValues[i]; + vmul(source, 1, gainValues, 1, destination, 1, numberOfGainValues); } } diff --git a/Source/WebCore/platform/audio/VectorMath.cpp b/Source/WebCore/platform/audio/VectorMath.cpp index ff4406a..2e6d1e1 100644 --- a/Source/WebCore/platform/audio/VectorMath.cpp +++ b/Source/WebCore/platform/audio/VectorMath.cpp @@ -63,6 +63,15 @@ void vadd(const float* source1P, int sourceStride1, const float* source2P, int s #endif } +void vmul(const float* source1P, int sourceStride1, const float* source2P, int sourceStride2, float* destP, int destStride, size_t framesToProcess) +{ +#if defined(__ppc__) || defined(__i386__) + ::vmul(source1P, sourceStride1, source2P, sourceStride2, destP, destStride, framesToProcess); +#else + vDSP_vmul(source1P, sourceStride1, source2P, sourceStride2, destP, destStride, framesToProcess); +#endif +} + #else void vsmul(const float* sourceP, int sourceStride, const float* scale, float* destP, int destStride, size_t framesToProcess) @@ -229,6 +238,66 @@ void vadd(const float* source1P, int sourceStride1, const float* source2P, int s #endif } +void vmul(const float* source1P, int sourceStride1, const float* source2P, int sourceStride2, float* destP, int destStride, size_t framesToProcess) +{ + + int n = framesToProcess; + +#ifdef __SSE2__ + if ((sourceStride1 == 1) && (sourceStride2 == 1) && (destStride == 1)) { + + // If the source1P address is not 16-byte aligned, the first several frames (at most three) should be processed seperately. + while ((reinterpret_cast<uintptr_t>(source1P) & 0x0F) && n) { + *destP = *source1P * *source2P; + source1P++; + source2P++; + destP++; + n--; + } + + // Now the source1P address aligned and start to apply SSE. + int tailFrames = n % 4; + float* endP = destP + n - tailFrames; + __m128 pSource1; + __m128 pSource2; + __m128 dest; + + bool source2Aligned = !(reinterpret_cast<uintptr_t>(source2P) & 0x0F); + bool destAligned = !(reinterpret_cast<uintptr_t>(destP) & 0x0F); + +#define SSE2_MULT(loadInstr, storeInstr) \ + while (destP < endP) \ + { \ + pSource1 = _mm_load_ps(source1P); \ + pSource2 = _mm_##loadInstr##_ps(source2P); \ + dest = _mm_mul_ps(pSource1, pSource2); \ + _mm_##storeInstr##_ps(destP, dest); \ + source1P += 4; \ + source2P += 4; \ + destP += 4; \ + } + + if (source2Aligned && destAligned) // Both aligned. + SSE2_MULT(load, store) + else if (source2Aligned && !destAligned) // Source2 is aligned but dest not. + SSE2_MULT(load, storeu) + else if (!source2Aligned && destAligned) // Dest is aligned but source2 not. + SSE2_MULT(loadu, store) + else // Neither aligned. + SSE2_MULT(loadu, storeu) + + n = tailFrames; + } +#endif + while (n) { + *destP = *source1P * *source2P; + source1P += sourceStride1; + source2P += sourceStride2; + destP += destStride; + n--; + } +} + #endif // OS(DARWIN) } // namespace VectorMath diff --git a/Source/WebCore/platform/audio/VectorMath.h b/Source/WebCore/platform/audio/VectorMath.h index 0360bdb..a3a92e5 100644 --- a/Source/WebCore/platform/audio/VectorMath.h +++ b/Source/WebCore/platform/audio/VectorMath.h @@ -34,6 +34,9 @@ namespace VectorMath { void vsmul(const float* sourceP, int sourceStride, const float* scale, float* destP, int destStride, size_t framesToProcess); void vadd(const float* source1P, int sourceStride1, const float* source2P, int sourceStride2, float* destP, int destStride, size_t framesToProcess); +// For an element-by-element multiply of two float vectors. +void vmul(const float* source1P, int sourceStride1, const float* source2P, int sourceStride2, float* destP, int destStride, size_t framesToProcess); + } // namespace VectorMath } // namespace WebCore
Xingnan Wang
Comment 18
2011-12-13 01:03:39 PST
Created
attachment 118971
[details]
Patch
Xingnan Wang
Comment 19
2011-12-13 01:04:35 PST
(In reply to
comment #16
)
> (From update of
attachment 118951
[details]
) > If you can just update the macro, I will land the updated version.
Just updated the macro, thanks.
WebKit Review Bot
Comment 20
2011-12-13 13:42:56 PST
Comment on
attachment 118971
[details]
Patch Clearing flags on attachment: 118971 Committed
r102702
: <
http://trac.webkit.org/changeset/102702
>
WebKit Review Bot
Comment 21
2011-12-13 13:43:05 PST
All reviewed patches have been landed. Closing bug.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug