RESOLVED FIXED 91398
Fix blend filter for autovectorizing
https://bugs.webkit.org/show_bug.cgi?id=91398
Summary Fix blend filter for autovectorizing
Allan Sandfeld Jensen
Reported 2012-07-16 09:05:32 PDT
The blend filter is currently written in such a way that it can not be autovectorized by the GCC auto-vectorizer, by removing conditions from the inner of the loop, GCC will be able to auto-vectorizing giving a large potential speedup on all platforms with SIMD instructions.
Attachments
Patch (10.74 KB, patch)
2012-07-16 09:12 PDT, Allan Sandfeld Jensen
no flags
Patch (10.22 KB, patch)
2012-07-24 02:44 PDT, Allan Sandfeld Jensen
no flags
Patch (10.30 KB, patch)
2012-07-24 04:56 PDT, Allan Sandfeld Jensen
no flags
Patch (6.66 KB, patch)
2012-07-25 01:48 PDT, Allan Sandfeld Jensen
no flags
Patch (6.93 KB, patch)
2012-07-25 02:53 PDT, Allan Sandfeld Jensen
no flags
Allan Sandfeld Jensen
Comment 1 2012-07-16 09:12:30 PDT
Allan Sandfeld Jensen
Comment 2 2012-07-16 09:29:57 PDT
Bonus info I missed in the ChangeLog, even without auto-vectorization the improvements themselves yield a 1.5x speed-up.
Nikolas Zimmermann
Comment 3 2012-07-16 10:48:10 PDT
Comment on attachment 152547 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=152547&action=review > Source/WebCore/ChangeLog:10 > + To support auto-vectorizing, the loop had to be unswitched, and regular arrays used that > + did not do boundary-checks in the inner loop. Finally the integer division by 255 > + was optimized not use integer division intrinsics. Nice speed-up, I'd like to hear Zoltans opinion on the patch as well, he's our master optimizer... > Source/WebCore/platform/graphics/filters/FEBlend.cpp:68 > + // This is an approximate algorithm for division by 255, but it gives accurate results for 16bit values. Perfect mixture of a what & why comment :-) > Source/WebCore/platform/graphics/filters/FEBlend.cpp:114 > + for (unsigned pixelOffset = 0; pixelOffset < len; pixelOffset++) { > + unsigned char alphaA = sourcePixelA[3]; > + unsigned char alphaB = sourcePixelB[3]; > + for (unsigned channel = 0; channel < 3; ++channel) > + destinationPixel[channel] = normal(sourcePixelA[channel], sourcePixelB[channel], alphaA, alphaB); > + destinationPixel[3] = 255 - div255((255 - alphaA) * (255 - alphaB)); > + sourcePixelA += 4; > + sourcePixelB += 4; > + destinationPixel += 4; > + } It should still be possible to share this in a static inline helper function w/o loss of performance, if you pass in the operation as function pointer (&normal, &multiply, etc.)
Allan Sandfeld Jensen
Comment 4 2012-07-16 11:08:04 PDT
(In reply to comment #3) > > Source/WebCore/platform/graphics/filters/FEBlend.cpp:114 > > + for (unsigned pixelOffset = 0; pixelOffset < len; pixelOffset++) { > > + unsigned char alphaA = sourcePixelA[3]; > > + unsigned char alphaB = sourcePixelB[3]; > > + for (unsigned channel = 0; channel < 3; ++channel) > > + destinationPixel[channel] = normal(sourcePixelA[channel], sourcePixelB[channel], alphaA, alphaB); > > + destinationPixel[3] = 255 - div255((255 - alphaA) * (255 - alphaB)); > > + sourcePixelA += 4; > > + sourcePixelB += 4; > > + destinationPixel += 4; > > + } > > It should still be possible to share this in a static inline helper function w/o loss of performance, if you pass in the operation as function pointer (&normal, &multiply, etc.) Maybe, I thought it would result in a call by function-pointer which would make vectorization impossible, but if the functions is fully inlined first it might work. Too bad we can not instantiate templates with functions (without by using wrapper classes).
Zoltan Herczeg
Comment 5 2012-07-16 11:13:07 PDT
Nice speedup. I agree with Niko's comments.
Allan Sandfeld Jensen
Comment 6 2012-07-24 02:44:56 PDT
Nikolas Zimmermann
Comment 7 2012-07-24 03:37:19 PDT
Comment on attachment 153999 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=153999&action=review Looks better, I still have some suggestions though: > Source/WebCore/platform/graphics/filters/FEBlend.cpp:66 > +static inline unsigned char div255(uint16_t value) nitpick: fastDivideBy255, to avoid abbreviations. > Source/WebCore/platform/graphics/filters/FEBlend.cpp:124 > +template<typename BlendFunctor> > +static void platformApply(unsigned char* sourcePixelA, unsigned char* sourcePixelB, > + unsigned char* destinationPixel, unsigned pixelArrayLength) I'm not sure why you need a templatified solution. I would have defined: typedef void (*BlendModefunction)(uchar colorA, uchar colorB, uchar alphaA, uchar alphaB) for the use in static void platformApply(uchar* sourcePixelA, uchar* sourcePixelB, uchar* destPixel, unsigned pixelArrayLength, BlendModeFunction blendModeFunction) { ... destinationPixel[0] = (*blendModeFunction)(sourcePixelA.... } This way you don't need to wrap the existing darken/lighten/.. methods in dummy classes. Please elaborate if I'm overlooking something which makes this impossible. > Source/WebCore/platform/graphics/filters/FEBlend.cpp:160 > + platformApply<BlendUnknown>(sourcePixelA, sourcePixelB, destinationPixel, pixelArrayLength); If you go with my proposed solution above, you can pass 0 as function-pointer here for the unknown case.
Allan Sandfeld Jensen
Comment 8 2012-07-24 04:18:25 PDT
(In reply to comment #7) > (From update of attachment 153999 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=153999&action=review > > Looks better, I still have some suggestions though: > > > Source/WebCore/platform/graphics/filters/FEBlend.cpp:66 > > +static inline unsigned char div255(uint16_t value) > > nitpick: fastDivideBy255, to avoid abbreviations. > Will do. > I'm not sure why you need a templatified solution. Because I want the functions to be inlined, if they are not inlined the loop that calls them can not be vectorized, and a function-pointer is a run-time parameter, while template parameters are compile-time parameters. We need the function to be fully defined at compile time for the compiler to be able to take advantage of every possible optimization. > > Source/WebCore/platform/graphics/filters/FEBlend.cpp:160 > > + platformApply<BlendUnknown>(sourcePixelA, sourcePixelB, destinationPixel, pixelArrayLength); > > If you go with my proposed solution above, you can pass 0 as function-pointer here for the unknown case. I don't see what that would gain. We would need to special case the null-pointer function anyway or it would segfault when called.
Nikolas Zimmermann
Comment 9 2012-07-24 04:27:45 PDT
(In reply to comment #8) > Because I want the functions to be inlined, if they are not inlined the loop that calls them can not be vectorized, and a function-pointer is a run-time parameter, while template parameters are compile-time parameters. We need the function to be fully defined at compile time for the compiler to be able to take advantage of every possible optimization. Ouch, you're right, I take back the criticism. If you upload a new patch I'll r+ it. > I don't see what that would gain. We would need to special case the null-pointer function anyway or it would segfault when called. I thought of special casing, but it's really not worth it, especially if you can't use it anyways.
Allan Sandfeld Jensen
Comment 10 2012-07-24 04:56:22 PDT
Nikolas Zimmermann
Comment 11 2012-07-24 05:33:14 PDT
Comment on attachment 154021 [details] Patch Thanks Allan, r=me.
WebKit Review Bot
Comment 12 2012-07-24 06:38:05 PDT
Comment on attachment 154021 [details] Patch Clearing flags on attachment: 154021 Committed r123467: <http://trac.webkit.org/changeset/123467>
WebKit Review Bot
Comment 13 2012-07-24 06:38:09 PDT
All reviewed patches have been landed. Closing bug.
Nikolas Zimmermann
Comment 14 2012-07-25 00:15:58 PDT
(In reply to comment #9) > (In reply to comment #8) > > Because I want the functions to be inlined, if they are not inlined the loop that calls them can not be vectorized, and a function-pointer is a run-time parameter, while template parameters are compile-time parameters. We need the function to be fully defined at compile time for the compiler to be able to take advantage of every possible optimization. > Ouch, you're right, I take back the criticism. If you upload a new patch I'll r+ it. Hm, I slept a night over this. What you could have done is: template<void myFunction()> void doSomething() { myFunction(); } This still allows the compile-time optimizations, but avoids the classes wrapping apply() functions. What do you think?
Allan Sandfeld Jensen
Comment 15 2012-07-25 00:43:41 PDT
(In reply to comment #14) > (In reply to comment #9) > > (In reply to comment #8) > > > Because I want the functions to be inlined, if they are not inlined the loop that calls them can not be vectorized, and a function-pointer is a run-time parameter, while template parameters are compile-time parameters. We need the function to be fully defined at compile time for the compiler to be able to take advantage of every possible optimization. > > Ouch, you're right, I take back the criticism. If you upload a new patch I'll r+ it. > > Hm, I slept a night over this. What you could have done is: > > template<void myFunction()> > void doSomething() { myFunction(); } > > This still allows the compile-time optimizations, but avoids the classes wrapping apply() functions. > What do you think? This is what I wanted to do, but I have tried the same approach before in other contexts, and it appears functions are not allowed as template argument for some reason. Maybe it would work in C++11 though.
Nikolas Zimmermann
Comment 16 2012-07-25 00:51:33 PDT
(In reply to comment #15) > This is what I wanted to do, but I have tried the same approach before in other contexts, and it appears functions are not allowed as template argument for some reason. Maybe it would work in C++11 though. I'm quite sure that is a valid c++ construct, even before c++11. A quick grep shows this is actually used in our HTML5 parser: class HTMLTreeBuilder::ExternalCharacterTokenBuffer { WTF_MAKE_NONCOPYABLE(ExternalCharacterTokenBuffer); public: ... String takeLeadingWhitespace() { return takeLeading<isHTMLSpace>(); } .. template<bool characterPredicate(UChar)> void skipLeading() { ASSERT(!isEmpty()); while (characterPredicate(*m_current)) { if (++m_current == m_end) return; } } template<bool characterPredicate(UChar)> String takeLeading() { ASSERT(!isEmpty()); const UChar* start = m_current; skipLeading<characterPredicate>(); if (start == m_current) return String(); return String(start, m_current - start); } ... Can you give it another try?
Allan Sandfeld Jensen
Comment 17 2012-07-25 01:48:46 PDT
Created attachment 154292 [details] Patch You are right, this actually works. I wonder what has changed, or what I did wrong last time I tried this.
Allan Sandfeld Jensen
Comment 18 2012-07-25 01:50:29 PDT
(In reply to comment #17) > Created an attachment (id=154292) [details] > Patch > > You are right, this actually works. I wonder what has changed, or what I did wrong last time I tried this. Well, static functions are not allowed. So I had to removed the static argument from the functions. This might could cause trouble with all-in-one compilations, if it does we need to add an extra namespace to these functions.
Nikolas Zimmermann
Comment 19 2012-07-25 02:22:52 PDT
Comment on attachment 154292 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=154292&action=review I'd r+ this as-is, though it doesn't apply on the bots. Still r=me, please watch the bots to make sure it builds everywhere. Thanks Allan, I'm now fully satisfied with your solution! > Source/WebCore/platform/graphics/filters/FEBlend.cpp:74 > +inline unsigned char normal(unsigned char colorA, unsigned char colorB, unsigned char alphaA, unsigned char) You could just rename this to "feBlendNormal", this will reduce the likelihood of clashes in all-in-one builds a lot.
Nikolas Zimmermann
Comment 20 2012-07-25 02:23:27 PDT
(In reply to comment #17) > Created an attachment (id=154292) [details] > Patch > > You are right, this actually works. I wonder what has changed, or what I did wrong last time I tried this. Your patch is 6.66k, evil, but works ;-)
Allan Sandfeld Jensen
Comment 21 2012-07-25 02:49:50 PDT
Reopen to get the EWS to check the patch.
Allan Sandfeld Jensen
Comment 22 2012-07-25 02:53:48 PDT
Allan Sandfeld Jensen
Comment 23 2012-07-25 04:01:05 PDT
Comment on attachment 154307 [details] Patch Builds on all platforms, so please land it.
WebKit Review Bot
Comment 24 2012-07-25 04:16:10 PDT
Comment on attachment 154307 [details] Patch Clearing flags on attachment: 154307 Committed r123602: <http://trac.webkit.org/changeset/123602>
WebKit Review Bot
Comment 25 2012-07-25 04:16:15 PDT
All reviewed patches have been landed. Closing bug.
Note You need to log in before you can comment on or make changes to this bug.