RESOLVED FIXED 116151
Make PNGImageDecoder::rowAvailable auto-vectorizable
https://bugs.webkit.org/show_bug.cgi?id=116151
Summary Make PNGImageDecoder::rowAvailable auto-vectorizable
Allan Sandfeld Jensen
Reported 2013-05-15 07:12:42 PDT
After analyzing PNG decoding performance, I noticed that depending on the image somewhere between 25-50% of the decoding time is spend under PNGImageDecoder::rowAvailable simply populating the ImageFrame. This loop should be vectorizable giving at least 2x speedup in PNGImageDecoder::rowAvailable and thus 12-25% speed-up on PNG decoding in total.
Attachments
Patch (7.15 KB, patch)
2013-05-15 07:18 PDT, Allan Sandfeld Jensen
no flags
Patch (7.39 KB, patch)
2013-05-16 02:34 PDT, Allan Sandfeld Jensen
benjamin: review+
benjamin: commit-queue-
Allan Sandfeld Jensen
Comment 1 2013-05-15 07:18:26 PDT
WebKit Commit Bot
Comment 2 2013-05-15 07:20:10 PDT
Attachment 201825 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WTF/ChangeLog', u'Source/WTF/wtf/MathExtras.h', u'Source/WebCore/ChangeLog', u'Source/WebCore/platform/graphics/filters/FEBlend.cpp', u'Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp']" exit_code: 1 Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp:417: setPixelRGBA_Premultiplied is incorrectly named. Don't use underscores in your identifier names. [readability/naming/underscores] [4] Total errors found: 1 in 5 files If any of these errors are false positives, please file a bug against check-webkit-style.
Benjamin Poulain
Comment 3 2013-05-15 14:12:34 PDT
Comment on attachment 201825 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=201825&action=review This looks like a step in the right direction. I have a couple of comments: > Source/WebCore/ChangeLog:13 > + Together with automatic vectorization by the compiler this provides slightly > + over 2x speed-up on PNGImageDecoder::rowAvailable and shaves off 12-25% on > + PNG decoding in general. This is rather unimpressive. Vectorized premultiplication is usually > 4 times faster (vector gain + preload gain + math gain). Are you sure the compiler is not fucking up somewhere? > Source/WTF/wtf/MathExtras.h:463 > +inline unsigned char fastDivideBy255(uint16_t value) > +{ > + // This is an approximate algorithm for division by 255, but it gives accurate results for 16bit values. > + uint16_t approximation = value >> 8; > + uint16_t remainder = value - (approximation * 255) + 1; > + return approximation + (remainder >> 8); > +} I think that does not belong to MathExtras but you are not the first one to abuse this header. :) Let's keep the typing consistent: unsigned char -> uint16_t. > Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp:408 > +static inline void setPixelRGB(ImageFrame::PixelData* dest, png_bytep pixel) > +{ > + *dest = 0xFF000000 | pixel[0] << 16 | pixel[1] << 8 | pixel[2]; > +} You should check but I am pretty sure the compiler will do a bad job at vectorizing this. If it is the case, I have written optimized version for Qt that you are free to steal. > Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp:414 > + nonTrivialAlpha |= ~(a ^ 255U); Let's make this cleaner. > Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp:425 > + nonTrivialAlpha |= ~(a ^ 255U); Let's make this cleaner. > Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp:527 > + unsigned char nonTrivialAlpha = 0; I would either keep this as a boolean or give it a proper name for a mask. > Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp:535 > + unsigned alpha = hasAlpha ? pixel[3] : 255; > + buffer.setRGBA(address++, pixel[0], pixel[1], pixel[2], alpha); > + nonTrivialAlpha |= alpha ^ 255U; No need to move the "hasAlpha" test out of the loop and use setPixelRGB().
Allan Sandfeld Jensen
Comment 4 2013-05-15 14:57:09 PDT
(In reply to comment #3) > (From update of attachment 201825 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=201825&action=review > > This looks like a step in the right direction. I have a couple of comments: > > > Source/WebCore/ChangeLog:13 > > + Together with automatic vectorization by the compiler this provides slightly > > + over 2x speed-up on PNGImageDecoder::rowAvailable and shaves off 12-25% on > > + PNG decoding in general. > > This is rather unimpressive. Vectorized premultiplication is usually > 4 times faster (vector gain + preload gain + math gain). > Are you sure the compiler is not fucking up somewhere? > This was using a generic x64 target, so the vectorization is limited to SSE2. Additional the 2+x speed-up is for the entire rowAvailable method, including the unoptimized combining of existing row data with the new. It was pretty much what I expected in this situation. > > Source/WTF/wtf/MathExtras.h:463 > > +inline unsigned char fastDivideBy255(uint16_t value) > > +{ > > + // This is an approximate algorithm for division by 255, but it gives accurate results for 16bit values. > > + uint16_t approximation = value >> 8; > > + uint16_t remainder = value - (approximation * 255) + 1; > > + return approximation + (remainder >> 8); > > +} > > I think that does not belong to MathExtras but you are not the first one to abuse this header. :) > Yes. It is not quite a perfect fit (and I think I might have been one those trying to abuse that header before), but where else should methods like this be placed if they are to be reused? > > Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp:408 > > +static inline void setPixelRGB(ImageFrame::PixelData* dest, png_bytep pixel) > > +{ > > + *dest = 0xFF000000 | pixel[0] << 16 | pixel[1] << 8 | pixel[2]; > > +} > > You should check but I am pretty sure the compiler will do a bad job at vectorizing this. > I didn't study the generated code, but GCC reports the loop vectorized. > > Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp:535 > > + unsigned alpha = hasAlpha ? pixel[3] : 255; > > + buffer.setRGBA(address++, pixel[0], pixel[1], pixel[2], alpha); > > + nonTrivialAlpha |= alpha ^ 255U; > > No need to move the "hasAlpha" test out of the loop and use setPixelRGB(). This is the scaled case. It doesn't vectorize anyway because it has table look-up for each x.
Allan Sandfeld Jensen
Comment 5 2013-05-16 02:34:16 PDT
Benjamin Poulain
Comment 6 2013-05-16 23:57:38 PDT
Comment on attachment 201936 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=201936&action=review Looks good. I cq- because I would like you to clarify why you handle nonTrivialAlpha in this weird way. > Source/WebCore/platform/graphics/Color.h:211 > +inline uint16_t fastDivideBy255(uint16_t value) > +{ > + // This is an approximate algorithm for division by 255, but it gives accurate results for 16bit values. > + uint16_t approximation = value >> 8; > + uint16_t remainder = value - (approximation * 255) + 1; > + return approximation + (remainder >> 8); > +} Good idea! > Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp:415 > + nonTrivialAlpha |= ~(a ^ 255U); // Arithmetic calculation of (a != 255). Because the compiler has to set bool to 1 or 0, it will actually add a branch here. So using nonTrivialAlpha |= (a == 255). If you change "bool nonTrivialAlpha" to "unsigned char nonTrivialAlphaMask", you will avoid that. I don't get why you don't just write nonTrivialAlphaMask |= (a - 0xff) and check nonTrivialAlpha at the end. > Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp:426 > + nonTrivialAlpha |= ~(a ^ 255U); // Arithmetic calculation of (a != 255). Ditto. > Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp:536 > + nonTrivialAlpha |= ~(alpha ^ 255U); // Arithmetic calculation of (alpha != 255). Ditto.
Allan Sandfeld Jensen
Comment 7 2013-05-17 02:16:39 PDT
(In reply to comment #6) > > Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp:415 > > + nonTrivialAlpha |= ~(a ^ 255U); // Arithmetic calculation of (a != 255). > > Because the compiler has to set bool to 1 or 0, it will actually add a branch here. > So using > nonTrivialAlpha |= (a == 255). > > If you change "bool nonTrivialAlpha" to "unsigned char nonTrivialAlphaMask", you will avoid that. That is what I was doing in the first patch. I tried using a bool and checked if it was still auto-vectorized, and since it was I am assuming gcc is doing something more clever with it. The generated code is unreadable to me though, so I am not sure what exactly is transformed into. I will return it to a nonTrivialAlphaMask before landing, just to be nice to different compilers and architectures. (alpha - 255) is a good idea, I hadn't thought of that. It is probably the same performance wise, but much easier to read.
Allan Sandfeld Jensen
Comment 8 2013-05-17 05:41:03 PDT
Note You need to log in before you can comment on or make changes to this bug.