Bug 74842

Summary: SSE optimization for FFTFrameFFMPEG.cpp::multiply()
Product: WebKit Reporter: Xingnan Wang <xingnan.wang>
Component: Web AudioAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: crogers, dglazkov, hclam, kbr, webkit.review.bot
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch
none
Patch
none
Patch
none
Patch
none
Patch none

Description Xingnan Wang 2011-12-19 00:10:33 PST
We will get a better performance with SSE2 optimization due to it is a vector related multiply in FFTFrameFFMPEG.cpp::multiply().
Comment 1 Xingnan Wang 2011-12-19 00:18:42 PST
Created attachment 119829 [details]
Patch
Comment 2 Chris Rogers 2011-12-19 10:31:48 PST
Thanks for doing this.  You got to it first before I could recommend it to you!

One thing I would suggest is to take this code out of FFTFrameFFMPEG and move it to a place where other FFTFrame implementations could potentially use it (or other code needing complex multiply).  Maybe it makes sense to make a function for this in VectorMath?  In FFTFrameMac.cpp there's a call to a vecLib function called vDSP_zvmul(), so zvmul()??
Comment 3 Xingnan Wang 2011-12-20 01:44:55 PST
(In reply to comment #2)
> Thanks for doing this.  You got to it first before I could recommend it to you!
> 
> One thing I would suggest is to take this code out of FFTFrameFFMPEG and move it to a place where other FFTFrame implementations could potentially use it (or other code needing complex multiply).  Maybe it makes sense to make a function for this in VectorMath?  In FFTFrameMac.cpp there's a call to a vecLib function called vDSP_zvmul(), so zvmul()??

Hi Roger,

I agree that we should add a zvmul into VectorMath, and in my mind it`s also very meaning to add more basic vector calculation functions such as vDSP_deq22D in Biquad.cpp to VectorMath and thus we can get more benefits from SSE optimization in audio processing. If so I am very glad to contribute more such optimization codes to enlarge the VectorMath.

For the zvmul, I would like to add two with different types:
//1. To be consistent with vDSP_zvmul
void zvmul(const float* source1P, int sourceStride1, const float* source2P, int sourceStride2, float* destP, int destStride, size_t framesToProcess, int conjugate);
//2. Used in FFTFrame, avoiding splitting and merging of complex_data
void zvmul(const float* real1P, const float* imag1P, const float* real2P, const* imag2P, float* realDestP, float* imagDestP, size_t framesToProcess); 

They have different implementations of SSE optimization.
If the two function`s definitions are OK, patch coming soon.
Thanks for any comments.
Comment 4 Chris Rogers 2011-12-20 10:28:41 PST
great to hear about the Biquad optimization!

I'll have to look more closely, but I'm pretty sure we only need the zvmul() version which operates on non-interleaved (planar) complex data like we use in FFTFrameFFMPEG.  I *thought* that was the same as the one used in FFTFrameMac.
Comment 5 Xingnan Wang 2011-12-20 18:14:37 PST
(In reply to comment #4)
> great to hear about the Biquad optimization!
> 
Now I am doing some work on the Biquad optimization, I will submit another patch when finishing it.
> I'll have to look more closely, but I'm pretty sure we only need the zvmul() version which operates on non-interleaved (planar) complex data like we use in FFTFrameFFMPEG.  I *thought* that was the same as the one used in FFTFrameMac.

Yes you are right , sorry for misunderstanding the vDSP_zvmul, which surely uses non-interleaved complex data. I`ll update the patch soon.

BTW, do you think it makes sense to add two functions like vDSP_ctoz and vDSP_ztoc in VectorMath? Maybe we can improve the performance of packing and unpacking complex data after some optimization on them.
Comment 6 Chris Rogers 2011-12-20 18:33:26 PST
vDSP_ctoz and vDSP_ztoc might be useful.  But the FFT algorithm itself is *really* the hotspot.  Do you think it would make sense to create an FFTFrameIPP which calls into an appropriate Intel IPP library FFT function?  We already have one for MKL (FFTFrameMKL), but I've since learned that the IPP library may be more appropriate for media/audio applications.
Comment 7 Xingnan Wang 2011-12-20 19:05:59 PST
(In reply to comment #6)
> vDSP_ctoz and vDSP_ztoc might be useful.  But the FFT algorithm itself is *really* the hotspot.  Do you think it would make sense to create an FFTFrameIPP which calls into an appropriate Intel IPP library FFT function?  We already have one for MKL (FFTFrameMKL), but I've since learned that the IPP library may be more appropriate for media/audio applications.

Agree, IPP is more suitable than MKL for FFT in audio processing and it makes sense to leverage the IPP to do FFT. So I could have a try to add an FFTFrameIPP and enable the library.
Comment 8 Xingnan Wang 2011-12-20 21:52:57 PST
Created attachment 120142 [details]
Patch

Patch updated as comments.
Add zvmul in VectorMath and use zvmul and vsmul in multiply().
Thanks.
Comment 9 Chris Rogers 2011-12-21 15:44:28 PST
Comment on attachment 120142 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=120142&action=review

Looks good overall, with a few comments:

> Source/WebCore/platform/audio/VectorMath.cpp:301
> +void zvmul(const float* real1P, const float* imag1P, const float* real2P, const float* imag2P, float* realDestP, float* imagDestP, size_t framesToProcess)

We need to provide a version of this function in the DARWIN section using vDSP_zvmul.  It shouldn't be too hard.  You can look in FFTFrameMac.cpp for an example usage.

It will involve using three local variables of type "DSPSplitComplex"

/* Complex-split vector multiply, single-precision.*/
/*
 *  vDSP_zvmul()
 *  
 *  Availability:
 *    Mac OS X:         in version 10.0 and later in vecLib.framework
 *    CarbonLib:        not in Carbon, but vecLib is compatible with CarbonLib
 *    Non-Carbon CFM:   in vecLib 1.0 and later
 */
extern void vDSP_zvmul(
	const DSPSplitComplex *__vDSP_input1, vDSP_Stride __vDSP_stride1,
	const DSPSplitComplex *__vDSP_input2, vDSP_Stride __vDSP_stride2,
	const DSPSplitComplex *__vDSP_result, vDSP_Stride __vDSP_strideResult,
	vDSP_Length __vDSP_size,
	int __vDSP_conjugate)
		AVAILABLE_MAC_OS_X_VERSION_10_0_AND_LATER;

> Source/WebCore/platform/audio/VectorMath.cpp:305
> +    // Only handle the very common case that all addresses are 16-byte aligned.

I would elaborate on this comment a little by saying something like:

// Only use the SSE optimization in the very common case that all addresses are 16-byte aligned.
// Otherwise, fall through to the scalar code below.

As the comment is right now, it seems to imply that we don't handle "at all" the non-aligned case.

> Source/WebCore/platform/audio/ffmpeg/FFTFrameFFMPEG.cpp:-138
> -    }

I would restore the comment:
// Multiply the packed DC/nyquist component
Comment 10 Xingnan Wang 2011-12-21 19:23:39 PST
Created attachment 120259 [details]
Patch

Update the patch as comments.
Change the FFTFrameMac::multiply() to the same way of FFTFrameFFMPEG, due to the real1P, imag1P, real2P and imag2P are all 16-byte aligned and it`s better to start process from 0 of buffer than 1.
Comment 11 Chris Rogers 2011-12-21 20:35:52 PST
Comment on attachment 120259 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=120259&action=review

> Source/WebCore/platform/audio/VectorMath.cpp:79
> +#else

You need to invoke ::zvmul exactly with the same arguments as vDSP_zvmul()

and move lines 77:79 down to just after line 88

In other words ::zvmul() also needs to take DSPSplitComplex arguments (as opposed to our new VectorMath::zvmul() function...)

> Source/WebCore/platform/audio/ffmpeg/FFTFrameFFMPEG.cpp:122
>      float scale = 0.5f;

Please move lines 121:122 to just before first use, which is now just before line 134

> Source/WebCore/platform/audio/mac/FFTFrameMac.cpp:113
>      float scale = 0.5f;

Please move lines 112:113 to just right before first use, which is now just before line 126
Comment 12 Xingnan Wang 2011-12-21 21:04:35 PST
Comment on attachment 120259 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=120259&action=review

>> Source/WebCore/platform/audio/VectorMath.cpp:79
>> +#else
> 
> You need to invoke ::zvmul exactly with the same arguments as vDSP_zvmul()
> 
> and move lines 77:79 down to just after line 88
> 
> In other words ::zvmul() also needs to take DSPSplitComplex arguments (as opposed to our new VectorMath::zvmul() function...)

Not quite understand it, do you mean that the ::zvmul should be re-defined as VDSP_zvmul and import DSPSplitComplex into non-OS(DARWIN)?
or
add another zvmul definition below in VectorMath.h?
#if OS(DARWIN)
void vDSP_zvmul(const DSPSplitComplex *, vDSP_Stride ,const DSPSplitComplex *, vDSP_Stride, const DSPSplitComplex *, vDSP_Stride, vDSP_Length , int );
#endif
Comment 13 Chris Rogers 2011-12-21 21:29:00 PST
(In reply to comment #12)
> Not quite understand it, do you mean that the ::zvmul should be re-defined as VDSP_zvmul and import DSPSplitComplex into non-OS(DARWIN)?
> or
> add another zvmul definition below in VectorMath.h?
> #if OS(DARWIN)
> void vDSP_zvmul(const DSPSplitComplex *, vDSP_Stride ,const DSPSplitComplex *, vDSP_Stride, const DSPSplitComplex *, vDSP_Stride, vDSP_Length , int );
> #endif

It's easier if I just show you how the function should be defined:

void zvmul(const float* real1P, const float* imag1P, const float* real2P, const float* imag2P, float* realDestP, float* imagDestP, size_t framesToProcess)
{
    DSPSplitComplex sc1;
    DSPSplitComplex sc2;
    DSPSplitComplex dest;
    sc1.realp = real1P;
    sc1.imagp = imag1P;
    sc2.realp = real2P;
    sc2.imagp = imag2P;
    dest.realp = realDestP;
    dest.imagp = imagDestP;
#if defined(__ppc__) || defined(__i386__)
    ::zvmul(&sc1, 1, &sc2, 1, &dest, 1, framesToProcess, 1);
#else
    vDSP_zvmul(&sc1, 1, &sc2, 1, &dest, 1, framesToProcess, 1);
#endif
}
Comment 14 Chris Rogers 2011-12-21 21:30:53 PST
It's too bad we have to do this.  It's really just a hack/work-around for the way some macros are defined in <Accelerate/Accelerat.h>  The reason is explained in this comment in VectorMath.cpp:

// On the Mac we use the highly optimized versions in Accelerate.framework
// In 32-bit mode (__ppc__ or __i386__) <Accelerate/Accelerate.h> includes <vecLib/vDSP_translate.h> which defines macros of the same name as
// our namespaced function names, so we must handle this case differently. Other architectures (64bit, ARM, etc.) do not include this header file.
Comment 15 Xingnan Wang 2011-12-21 21:50:29 PST
Created attachment 120270 [details]
Patch
Comment 16 Xingnan Wang 2011-12-21 21:51:38 PST
(In reply to comment #14)
> It's too bad we have to do this.  It's really just a hack/work-around for the way some macros are defined in <Accelerate/Accelerat.h>  The reason is explained in this comment in VectorMath.cpp:
> 
> // On the Mac we use the highly optimized versions in Accelerate.framework
> // In 32-bit mode (__ppc__ or __i386__) <Accelerate/Accelerate.h> includes <vecLib/vDSP_translate.h> which defines macros of the same name as
> // our namespaced function names, so we must handle this case differently. Other architectures (64bit, ARM, etc.) do not include this header file.

Got it. Thanks for explanation.
Patch is updated.
Comment 17 WebKit Review Bot 2011-12-22 00:39:14 PST
Comment on attachment 120270 [details]
Patch

Attachment 120270 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/10986368

New failing tests:
fast/workers/storage/use-same-database-in-page-and-workers.html
Comment 18 Xingnan Wang 2011-12-22 02:11:10 PST
(In reply to comment #17)
> (From update of attachment 120270 [details])
> Attachment 120270 [details] did not pass chromium-ews (chromium-xvfb):
> Output: http://queues.webkit.org/results/10986368
> 
> New failing tests:
> fast/workers/storage/use-same-database-in-page-and-workers.html

Hi Roger,

My patch is failed to pass the tests in buildbots, but I am confused that these tests seems not have not too much relation with the change from my patch. I am not familiar with the chromium test framework so could you help me to check what`s the root cause of the building failure?
 
Thank you very much.
Comment 19 Chris Rogers 2011-12-22 11:50:51 PST
It's unlikely that it's related to your patch, and is probably some other recent change.  You can try re-uploading your patch to have it try again.
Comment 20 Xingnan Wang 2012-01-03 18:06:47 PST
Created attachment 121031 [details]
Patch

Hi Roger,
  I re-upload the patch, please help to submit to commit-queue again, thank you~
Comment 21 Chris Rogers 2012-01-04 17:40:39 PST
Looks good.
Comment 22 Kenneth Russell 2012-01-04 17:42:22 PST
Comment on attachment 121031 [details]
Patch

rs=me based on Chris's review.
Comment 23 WebKit Review Bot 2012-01-05 04:54:20 PST
Comment on attachment 121031 [details]
Patch

Clearing flags on attachment: 121031

Committed r104143: <http://trac.webkit.org/changeset/104143>
Comment 24 WebKit Review Bot 2012-01-05 04:54:25 PST
All reviewed patches have been landed.  Closing bug.
Comment 25 Hin-Chung Lam 2012-01-05 05:27:37 PST
Looks like this patch breaks the tree. See the following compilation error:

/Volumes/Data/slave/lion-intel-leaks/build/Source/WebCore/platform/audio/VectorMath.cpp:80:15:{80:17-80:23}: error: assigning to 'float *' from incompatible type 'const float *' [3]
     sc1.realp = real1P;
               ^ ~~~~~~
/Volumes/Data/slave/lion-intel-leaks/build/Source/WebCore/platform/audio/VectorMath.cpp:81:15:{81:17-81:23}: error: assigning to 'float *' from incompatible type 'const float *' [3]
     sc1.imagp = imag1P;
               ^ ~~~~~~
/Volumes/Data/slave/lion-intel-leaks/build/Source/WebCore/platform/audio/VectorMath.cpp:82:15:{82:17-82:23}: error: assigning to 'float *' from incompatible type 'const float *' [3]
     sc2.realp = real2P;
               ^ ~~~~~~
/Volumes/Data/slave/lion-intel-leaks/build/Source/WebCore/platform/audio/VectorMath.cpp:83:15:{83:17-83:23}: error: assigning to 'float *' from incompatible type 'const float *' [3]
     sc2.imagp = imag2P;
               ^ ~~~~~~

Any fix for this?
Comment 26 Kenneth Russell 2012-01-05 10:18:10 PST
(In reply to comment #25)
> Looks like this patch breaks the tree. See the following compilation error:
> 
> /Volumes/Data/slave/lion-intel-leaks/build/Source/WebCore/platform/audio/VectorMath.cpp:80:15:{80:17-80:23}: error: assigning to 'float *' from incompatible type 'const float *' [3]
>      sc1.realp = real1P;
>                ^ ~~~~~~
> /Volumes/Data/slave/lion-intel-leaks/build/Source/WebCore/platform/audio/VectorMath.cpp:81:15:{81:17-81:23}: error: assigning to 'float *' from incompatible type 'const float *' [3]
>      sc1.imagp = imag1P;
>                ^ ~~~~~~
> /Volumes/Data/slave/lion-intel-leaks/build/Source/WebCore/platform/audio/VectorMath.cpp:82:15:{82:17-82:23}: error: assigning to 'float *' from incompatible type 'const float *' [3]
>      sc2.realp = real2P;
>                ^ ~~~~~~
> /Volumes/Data/slave/lion-intel-leaks/build/Source/WebCore/platform/audio/VectorMath.cpp:83:15:{83:17-83:23}: error: assigning to 'float *' from incompatible type 'const float *' [3]
>      sc2.imagp = imag2P;
>                ^ ~~~~~~
> 
> Any fix for this?

Alpha: thanks for patching up these compilation failures in http://trac.webkit.org/changeset/104147 .
Comment 27 Xingnan Wang 2012-01-05 17:27:21 PST
(In reply to comment #25)
> Looks like this patch breaks the tree. See the following compilation error:
> 
> /Volumes/Data/slave/lion-intel-leaks/build/Source/WebCore/platform/audio/VectorMath.cpp:80:15:{80:17-80:23}: error: assigning to 'float *' from incompatible type 'const float *' [3]
>      sc1.realp = real1P;
>                ^ ~~~~~~
> /Volumes/Data/slave/lion-intel-leaks/build/Source/WebCore/platform/audio/VectorMath.cpp:81:15:{81:17-81:23}: error: assigning to 'float *' from incompatible type 'const float *' [3]
>      sc1.imagp = imag1P;
>                ^ ~~~~~~
> /Volumes/Data/slave/lion-intel-leaks/build/Source/WebCore/platform/audio/VectorMath.cpp:82:15:{82:17-82:23}: error: assigning to 'float *' from incompatible type 'const float *' [3]
>      sc2.realp = real2P;
>                ^ ~~~~~~
> /Volumes/Data/slave/lion-intel-leaks/build/Source/WebCore/platform/audio/VectorMath.cpp:83:15:{83:17-83:23}: error: assigning to 'float *' from incompatible type 'const float *' [3]
>      sc2.imagp = imag2P;
>                ^ ~~~~~~
> 
> Any fix for this?

Thanks for your patch.