Bug 74048

Summary: A function of vector multiply with SSE optimization is needed in VectorMath.cpp
Product: WebKit Reporter: Xingnan Wang <xingnan.wang>
Component: Web AudioAssignee: Xingnan Wang <xingnan.wang>
Status: RESOLVED FIXED    
Severity: Normal CC: benjamin, crogers, dglazkov, peter, webkit.review.bot
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch
none
Patch
none
Patch
webkit.review.bot: commit-queue-
Patch
benjamin: review-
Patch
benjamin: review+, benjamin: commit-queue-
Patch none

Description Xingnan Wang 2011-12-07 19:18:06 PST
Here are only vadd and vsmul in vector lib, and a vmmul for two vectors multiply is widely used and it makes sense to add it.
Comment 1 Xingnan Wang 2011-12-07 21:08:15 PST
Created attachment 118323 [details]
Patch

It achieves about 3.4x compared with usual way.
Comment 2 Raymond Toy 2011-12-09 10:06:28 PST
Comment on attachment 118323 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=118323&action=review

Please include a Changelog entry too.

> Source/WebCore/platform/audio/VectorMath.cpp:231
>  

Add comment saying vmmul performs an element-by-element multiply of two vectors, with user-specified strides for the sources and destination.

> Source/WebCore/platform/audio/VectorMath.cpp:232
> +void vmmul(const float* source1P, int sourceStride1, const float* source2P, int sourceStride2, float* destP, int destStride, size_t framesToProcess)

Why is it named vmmul instead of vmul?  What does the extra m mean?

> Source/WebCore/platform/audio/VectorMath.cpp:237
> +        int n = framesToProcess;

Since this is common to both SSE and non-SSE, move it to the top?  Also, why do we need the extra variable n?  Can't we use framesToProcess?

> Source/WebCore/platform/audio/VectorMath.cpp:265
> +                destP += 4;

As mentioned in another patch, it might be useful to add a macro to encapsulate the very similar code here and the branches below.  Something like

#define SSE2_MULT(s1Load, s2Load, dstore) \
  pSource1 = _mm_##s1Load##_ps(source1P); \
  pSource2 = _mm_##s2Load##_ps(source2P); \
  dest = __mm_mul_ps(pSource1, pSource2); \
  _mm_##dstore##_ps(destP, dest); \
  source1P += 4; \
  source2P += 4; \
  destP += 4\

Then use it:
  SSE2_MULT(load, load, store);

And SSE2_MULT(load,load, storeu); below and so on.

I think it makes it easier to see the similarity and differences between the branches.  Up to you do include this or not.
Comment 3 Xingnan Wang 2011-12-10 05:18:08 PST
(In reply to comment #2)
> (From update of attachment 118323 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=118323&action=review
> 
> Please include a Changelog entry too.
> 
> > Source/WebCore/platform/audio/VectorMath.cpp:231
> >  
> 
> Add comment saying vmmul performs an element-by-element multiply of two vectors, with user-specified strides for the sources and destination.
> 
> > Source/WebCore/platform/audio/VectorMath.cpp:232
> > +void vmmul(const float* source1P, int sourceStride1, const float* source2P, int sourceStride2, float* destP, int destStride, size_t framesToProcess)
> 
> Why is it named vmmul instead of vmul?  What does the extra m mean?
> 
> > Source/WebCore/platform/audio/VectorMath.cpp:237
> > +        int n = framesToProcess;
> 
> Since this is common to both SSE and non-SSE, move it to the top?  Also, why do we need the extra variable n?  Can't we use framesToProcess?
> 
> > Source/WebCore/platform/audio/VectorMath.cpp:265
> > +                destP += 4;
> 
> As mentioned in another patch, it might be useful to add a macro to encapsulate the very similar code here and the branches below.  Something like
> 
> #define SSE2_MULT(s1Load, s2Load, dstore) \
>   pSource1 = _mm_##s1Load##_ps(source1P); \
>   pSource2 = _mm_##s2Load##_ps(source2P); \
>   dest = __mm_mul_ps(pSource1, pSource2); \
>   _mm_##dstore##_ps(destP, dest); \
>   source1P += 4; \
>   source2P += 4; \
>   destP += 4\
> 
> Then use it:
>   SSE2_MULT(load, load, store);
> 
> And SSE2_MULT(load,load, storeu); below and so on.
> 
> I think it makes it easier to see the similarity and differences between the branches.  Up to you do include this or not.
Hi Raymond,
    Do you have the privilege to assign the bug to me? Because I cannot upload the patch by webkit-patch script if I am not the assignee. Thank you.
Comment 4 Benjamin Poulain 2011-12-10 11:43:50 PST
Comment on attachment 118323 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=118323&action=review

This change needs a ChangeLog.

> Source/WebCore/platform/audio/VectorMath.cpp:318
> +        // Non-SSE handling for left frames which are less than 4.
> +        n %= 4;
> +        while (n) {
> +            *destP = *source1P * *source2P;
> +            source1P++;
> +            source2P++;
> +            destP++;
> +            n--;
> +        }
> +    } else { // If strides are not 1, rollback to normal algorithm.
> +#endif
> +    int n = framesToProcess;
> +    while (n--) {
> +        *destP = *source1P * *source2P;
> +        source1P += sourceStride1;
> +        source2P += sourceStride2;
> +        destP += destStride;
> +    }

You can unify those two path and avoid the else {} branch. Just declare "int n = framesToProcess;" above the block.
Comment 5 Xingnan Wang 2011-12-11 21:38:18 PST
Created attachment 118731 [details]
Patch
Comment 6 Xingnan Wang 2011-12-11 21:46:48 PST
Comment on attachment 118323 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=118323&action=review

>>> Source/WebCore/platform/audio/VectorMath.cpp:231
>>>  
>> 
>> Add comment saying vmmul performs an element-by-element multiply of two vectors, with user-specified strides for the sources and destination.
> 
> Hi Raymond,
>     Do you have the privilege to assign the bug to me? Because I cannot upload the patch by webkit-patch script if I am not the assignee. Thank you.

Done.

>> Source/WebCore/platform/audio/VectorMath.cpp:232
>> +void vmmul(const float* source1P, int sourceStride1, const float* source2P, int sourceStride2, float* destP, int destStride, size_t framesToProcess)
> 
> Why is it named vmmul instead of vmul?  What does the extra m mean?

Yes, It should be vmul.

>> Source/WebCore/platform/audio/VectorMath.cpp:237
>> +        int n = framesToProcess;
> 
> Since this is common to both SSE and non-SSE, move it to the top?  Also, why do we need the extra variable n?  Can't we use framesToProcess?

I think it makes the code clear use a "n" in loop and easy to get the value of framesToProcess when n changed, also it not much different in the function, so both are OK in my opinion.

>> Source/WebCore/platform/audio/VectorMath.cpp:265
>> +                destP += 4;
> 
> As mentioned in another patch, it might be useful to add a macro to encapsulate the very similar code here and the branches below.  Something like
> 
> #define SSE2_MULT(s1Load, s2Load, dstore) \
>   pSource1 = _mm_##s1Load##_ps(source1P); \
>   pSource2 = _mm_##s2Load##_ps(source2P); \
>   dest = __mm_mul_ps(pSource1, pSource2); \
>   _mm_##dstore##_ps(destP, dest); \
>   source1P += 4; \
>   source2P += 4; \
>   destP += 4\
> 
> Then use it:
>   SSE2_MULT(load, load, store);
> 
> And SSE2_MULT(load,load, storeu); below and so on.
> 
> I think it makes it easier to see the similarity and differences between the branches.  Up to you do include this or not.

Used it, thanks.

>> Source/WebCore/platform/audio/VectorMath.cpp:318
>> +    }
> 
> You can unify those two path and avoid the else {} branch. Just declare "int n = framesToProcess;" above the block.

Good comments, thanks.
Comment 7 Chris Rogers 2011-12-11 22:45:59 PST
Thanks for the patch.  I've only had a quick look, but it seems like you're only adding the vmul() function to the non-OS(DARWIN) section.  I see that vecLib.framework already has a vmul(), but don't we need to do something similar to what we're doing for vsmul() and vadd() in the OS(DARWIN) section to deal with the 32bit vs. 64bit way the macros are defined?
Comment 8 Xingnan Wang 2011-12-11 23:16:23 PST
Created attachment 118739 [details]
Patch
Comment 9 Xingnan Wang 2011-12-11 23:19:58 PST
(In reply to comment #7)
> Thanks for the patch.  I've only had a quick look, but it seems like you're only adding the vmul() function to the non-OS(DARWIN) section.  I see that vecLib.framework already has a vmul(), but don't we need to do something similar to what we're doing for vsmul() and vadd() in the OS(DARWIN) section to deal with the 32bit vs. 64bit way the macros are defined?

Yes, we should handle the vmul with the same way of vadd and vsmul for OS(DARWIN). Patch is updated as your comments, thank you.
Comment 10 WebKit Review Bot 2011-12-11 23:33:18 PST
Comment on attachment 118739 [details]
Patch

Attachment 118739 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/10839015
Comment 11 Xingnan Wang 2011-12-11 23:44:24 PST
Created attachment 118744 [details]
Patch

Sorry for some noise code in the patch, update it.
Comment 12 Benjamin Poulain 2011-12-12 16:11:06 PST
Comment on attachment 118744 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=118744&action=review

> Source/WebCore/ChangeLog:12
> +        - It`s a function for an element-by-element multiply of two float vectors and we get 
> +        about 3.4x performance improvement with SSE2 optimization compared with the common 
> +        multiply.
> +
> +        - Use the function in AudioBus::copyWithSampleAccurateGainValuesFrom().

You should not start the sentences by "- ". 
"It`s" -> "It's" + the subject "It" is not clear in that sentence.

> Source/WebCore/platform/audio/VectorMath.cpp:246
> +    // About 3.4x performance improvement with SSE2 optimization for the very common case of all the strides are 1.

This should not be a comment. The commit message is enough information.

> Source/WebCore/platform/audio/VectorMath.cpp:248
> +    if ((sourceStride1 ==1) && (sourceStride2 == 1) && (destStride == 1)) {

Coding style: "sourceStride1 ==1" -> "sourceStride1 == 1"

> Source/WebCore/platform/audio/VectorMath.cpp:250
> +        // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately.

sourceP -> source1P

> Source/WebCore/platform/audio/VectorMath.cpp:268
> +#define SSE2_MULT(l, s)                                \

please use "loadInstr" and "storeInstr" instead of l, and s. Explanatory variable names are better for readability.

> Source/WebCore/platform/audio/VectorMath.cpp:293
> +        // Non-SSE handling for left frames which is less than 4.

" which is less than 4" in the sentence is unclear without reading the code above. Maybe you should get rid of this part of the comment, and have a variable for "n % 4" when you compute endP.
Comment 13 Xingnan Wang 2011-12-12 21:38:51 PST
Created attachment 118951 [details]
Patch
Comment 14 Xingnan Wang 2011-12-12 21:40:03 PST
(In reply to comment #12)
> (From update of attachment 118744 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=118744&action=review
> 
> > Source/WebCore/ChangeLog:12
> > +        - It`s a function for an element-by-element multiply of two float vectors and we get 
> > +        about 3.4x performance improvement with SSE2 optimization compared with the common 
> > +        multiply.
> > +
> > +        - Use the function in AudioBus::copyWithSampleAccurateGainValuesFrom().
> 
> You should not start the sentences by "- ". 
> "It`s" -> "It's" + the subject "It" is not clear in that sentence.
> 
> > Source/WebCore/platform/audio/VectorMath.cpp:246
> > +    // About 3.4x performance improvement with SSE2 optimization for the very common case of all the strides are 1.
> 
> This should not be a comment. The commit message is enough information.
> 
> > Source/WebCore/platform/audio/VectorMath.cpp:248
> > +    if ((sourceStride1 ==1) && (sourceStride2 == 1) && (destStride == 1)) {
> 
> Coding style: "sourceStride1 ==1" -> "sourceStride1 == 1"
> 
> > Source/WebCore/platform/audio/VectorMath.cpp:250
> > +        // If the sourceP address is not 16-byte aligned, the first several frames (at most three) should be processed seperately.
> 
> sourceP -> source1P
> 
> > Source/WebCore/platform/audio/VectorMath.cpp:268
> > +#define SSE2_MULT(l, s)                                \
> 
> please use "loadInstr" and "storeInstr" instead of l, and s. Explanatory variable names are better for readability.
> 
> > Source/WebCore/platform/audio/VectorMath.cpp:293
> > +        // Non-SSE handling for left frames which is less than 4.
> 
> " which is less than 4" in the sentence is unclear without reading the code above. Maybe you should get rid of this part of the comment, and have a variable for "n % 4" when you compute endP.

Thanks very much and patch is updated.
Comment 15 Benjamin Poulain 2011-12-12 23:14:36 PST
Comment on attachment 118951 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=118951&action=review

> Source/WebCore/platform/audio/VectorMath.cpp:278
> +#define SSE2_MULT(loadInstr, storeInstr)                   \
> +            {                                              \
> +                pSource1 = _mm_load_ps(source1P);          \
> +                pSource2 = _mm_##loadInstr##_ps(source2P); \
> +                dest = _mm_mul_ps(pSource1, pSource2);     \
> +                _mm_##storeInstr##_ps(destP, dest);        \
> +                source1P += 4;                             \
> +                source2P += 4;                             \
> +                destP += 4;                                \
> +            }
> +

You could even put the while  (destP < endP) in here.
Comment 16 Benjamin Poulain 2011-12-12 23:17:13 PST
Comment on attachment 118951 [details]
Patch

If you can just update the macro, I will land the updated version.
Comment 17 Xingnan Wang 2011-12-13 00:59:41 PST
Comment on attachment 118951 [details]
Patch

diff --git a/Source/WebCore/ChangeLog b/Source/WebCore/ChangeLog
index f123fbf..8cb44bb 100644
--- a/Source/WebCore/ChangeLog
+++ b/Source/WebCore/ChangeLog
@@ -1,3 +1,22 @@
+2011-12-13  Xingnan Wang  <xingnan.wang@intel.com>
+
+        Implement a function of vector multiply with SSE2 optimization in VectorMath.cpp.
+        https://bugs.webkit.org/show_bug.cgi?id=74048
+
+        Reviewed by NOBODY (OOPS!).
+
+        The vmul is a function for an element-by-element multiply of two float vectors and we 
+        get about 3.4x performance improvement with SSE2 optimization compared with the common 
+        multiply.
+
+        Use vmul in AudioBus::copyWithSampleAccurateGainValuesFrom().
+
+        * platform/audio/AudioBus.cpp:
+        (WebCore::AudioBus::copyWithSampleAccurateGainValuesFrom):
+        * platform/audio/VectorMath.cpp:
+        (WebCore::VectorMath::vmul):
+        * platform/audio/VectorMath.h:
+
 2011-12-13  Yosifumi Inoue  <yosin@chromium.org>
 
         RenderTheme should have a function for disabled text color adjustment
diff --git a/Source/WebCore/platform/audio/AudioBus.cpp b/Source/WebCore/platform/audio/AudioBus.cpp
index 6c076b7..b5ec246 100644
--- a/Source/WebCore/platform/audio/AudioBus.cpp
+++ b/Source/WebCore/platform/audio/AudioBus.cpp
@@ -382,15 +382,13 @@ void AudioBus::copyWithSampleAccurateGainValuesFrom(const AudioBus &sourceBus, f
         return;
     }
 
-    // FIXME: this can potentially use SIMD optimizations with vector libraries.
     // We handle both the 1 -> N and N -> N case here.
     const float* source = sourceBus.channel(0)->data();
     for (unsigned channelIndex = 0; channelIndex < numberOfChannels(); ++channelIndex) {
         if (sourceBus.numberOfChannels() == numberOfChannels())
             source = sourceBus.channel(channelIndex)->data();
         float* destination = channel(channelIndex)->data();
-        for (unsigned i = 0; i < numberOfGainValues; ++i)
-            destination[i] = source[i] * gainValues[i];
+        vmul(source, 1, gainValues, 1, destination, 1, numberOfGainValues);
     }
 }
 
diff --git a/Source/WebCore/platform/audio/VectorMath.cpp b/Source/WebCore/platform/audio/VectorMath.cpp
index ff4406a..2e6d1e1 100644
--- a/Source/WebCore/platform/audio/VectorMath.cpp
+++ b/Source/WebCore/platform/audio/VectorMath.cpp
@@ -63,6 +63,15 @@ void vadd(const float* source1P, int sourceStride1, const float* source2P, int s
 #endif
 }
 
+void vmul(const float* source1P, int sourceStride1, const float* source2P, int sourceStride2, float* destP, int destStride, size_t framesToProcess)
+{
+#if defined(__ppc__) || defined(__i386__)
+    ::vmul(source1P, sourceStride1, source2P, sourceStride2, destP, destStride, framesToProcess);
+#else
+    vDSP_vmul(source1P, sourceStride1, source2P, sourceStride2, destP, destStride, framesToProcess);
+#endif
+}
+
 #else
 
 void vsmul(const float* sourceP, int sourceStride, const float* scale, float* destP, int destStride, size_t framesToProcess)
@@ -229,6 +238,66 @@ void vadd(const float* source1P, int sourceStride1, const float* source2P, int s
 #endif
 }
 
+void vmul(const float* source1P, int sourceStride1, const float* source2P, int sourceStride2, float* destP, int destStride, size_t framesToProcess)
+{
+
+    int n = framesToProcess;
+
+#ifdef __SSE2__
+    if ((sourceStride1 == 1) && (sourceStride2 == 1) && (destStride == 1)) {
+
+        // If the source1P address is not 16-byte aligned, the first several frames (at most three) should be processed seperately.
+        while ((reinterpret_cast<uintptr_t>(source1P) & 0x0F) && n) {
+            *destP = *source1P * *source2P;
+            source1P++;
+            source2P++;
+            destP++;
+            n--;
+        }
+
+        // Now the source1P address aligned and start to apply SSE.
+        int tailFrames = n % 4;
+        float* endP = destP + n - tailFrames;
+        __m128 pSource1;
+        __m128 pSource2;
+        __m128 dest;
+
+        bool source2Aligned = !(reinterpret_cast<uintptr_t>(source2P) & 0x0F);
+        bool destAligned = !(reinterpret_cast<uintptr_t>(destP) & 0x0F);
+
+#define SSE2_MULT(loadInstr, storeInstr)                   \
+            while (destP < endP)                           \
+            {                                              \
+                pSource1 = _mm_load_ps(source1P);          \
+                pSource2 = _mm_##loadInstr##_ps(source2P); \
+                dest = _mm_mul_ps(pSource1, pSource2);     \
+                _mm_##storeInstr##_ps(destP, dest);        \
+                source1P += 4;                             \
+                source2P += 4;                             \
+                destP += 4;                                \
+            }
+
+        if (source2Aligned && destAligned) // Both aligned.
+            SSE2_MULT(load, store)
+        else if (source2Aligned && !destAligned) // Source2 is aligned but dest not.
+            SSE2_MULT(load, storeu)
+        else if (!source2Aligned && destAligned) // Dest is aligned but source2 not.
+            SSE2_MULT(loadu, store)
+        else // Neither aligned.
+            SSE2_MULT(loadu, storeu)
+
+        n = tailFrames;
+    }
+#endif
+    while (n) {
+        *destP = *source1P * *source2P;
+        source1P += sourceStride1;
+        source2P += sourceStride2;
+        destP += destStride;
+        n--;
+    }
+}
+
 #endif // OS(DARWIN)
 
 } // namespace VectorMath
diff --git a/Source/WebCore/platform/audio/VectorMath.h b/Source/WebCore/platform/audio/VectorMath.h
index 0360bdb..a3a92e5 100644
--- a/Source/WebCore/platform/audio/VectorMath.h
+++ b/Source/WebCore/platform/audio/VectorMath.h
@@ -34,6 +34,9 @@ namespace VectorMath {
 void vsmul(const float* sourceP, int sourceStride, const float* scale, float* destP, int destStride, size_t framesToProcess);
 void vadd(const float* source1P, int sourceStride1, const float* source2P, int sourceStride2, float* destP, int destStride, size_t framesToProcess);
 
+// For an element-by-element multiply of two float vectors.
+void vmul(const float* source1P, int sourceStride1, const float* source2P, int sourceStride2, float* destP, int destStride, size_t framesToProcess);
+
 } // namespace VectorMath
 
 } // namespace WebCore
Comment 18 Xingnan Wang 2011-12-13 01:03:39 PST
Created attachment 118971 [details]
Patch
Comment 19 Xingnan Wang 2011-12-13 01:04:35 PST
(In reply to comment #16)
> (From update of attachment 118951 [details])
> If you can just update the macro, I will land the updated version.

Just updated the macro, thanks.
Comment 20 WebKit Review Bot 2011-12-13 13:42:56 PST
Comment on attachment 118971 [details]
Patch

Clearing flags on attachment: 118971

Committed r102702: <http://trac.webkit.org/changeset/102702>
Comment 21 WebKit Review Bot 2011-12-13 13:43:05 PST
All reviewed patches have been landed.  Closing bug.