Bug 116151 - Make PNGImageDecoder::rowAvailable auto-vectorizable
Summary: Make PNGImageDecoder::rowAvailable auto-vectorizable
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Images (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Allan Sandfeld Jensen
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-05-15 07:12 PDT by Allan Sandfeld Jensen
Modified: 2013-05-17 05:41 PDT (History)
5 users (show)

See Also:


Attachments
Patch (7.15 KB, patch)
2013-05-15 07:18 PDT, Allan Sandfeld Jensen
no flags Details | Formatted Diff | Diff
Patch (7.39 KB, patch)
2013-05-16 02:34 PDT, Allan Sandfeld Jensen
benjamin: review+
benjamin: commit-queue-
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Allan Sandfeld Jensen 2013-05-15 07:12:42 PDT
After analyzing PNG decoding performance, I noticed that depending on the image somewhere between 25-50% of the decoding time is spend under PNGImageDecoder::rowAvailable simply populating the ImageFrame.

This loop should be vectorizable giving at least 2x speedup in PNGImageDecoder::rowAvailable and thus 12-25% speed-up on PNG decoding in total.
Comment 1 Allan Sandfeld Jensen 2013-05-15 07:18:26 PDT
Created attachment 201825 [details]
Patch
Comment 2 WebKit Commit Bot 2013-05-15 07:20:10 PDT
Attachment 201825 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WTF/ChangeLog', u'Source/WTF/wtf/MathExtras.h', u'Source/WebCore/ChangeLog', u'Source/WebCore/platform/graphics/filters/FEBlend.cpp', u'Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp']" exit_code: 1
Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp:417:  setPixelRGBA_Premultiplied is incorrectly named. Don't use underscores in your identifier names.  [readability/naming/underscores] [4]
Total errors found: 1 in 5 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 3 Benjamin Poulain 2013-05-15 14:12:34 PDT
Comment on attachment 201825 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=201825&action=review

This looks like a step in the right direction. I have a couple of comments:

> Source/WebCore/ChangeLog:13
> +        Together with automatic vectorization by the compiler this provides slightly
> +        over 2x speed-up on PNGImageDecoder::rowAvailable and shaves off 12-25% on
> +        PNG decoding in general.

This is rather unimpressive. Vectorized premultiplication is usually > 4 times faster (vector gain + preload gain + math gain).
Are you sure the compiler is not fucking up somewhere?

> Source/WTF/wtf/MathExtras.h:463
> +inline unsigned char fastDivideBy255(uint16_t value)
> +{
> +    // This is an approximate algorithm for division by 255, but it gives accurate results for 16bit values.
> +    uint16_t approximation = value >> 8;
> +    uint16_t remainder = value - (approximation * 255) + 1;
> +    return approximation + (remainder >> 8);
> +}

I think that does not belong to MathExtras but you are not the first one to abuse this header. :)

Let's keep the typing consistent: unsigned char -> uint16_t.

> Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp:408
> +static inline void setPixelRGB(ImageFrame::PixelData* dest, png_bytep pixel)
> +{
> +    *dest = 0xFF000000 | pixel[0] << 16 | pixel[1] << 8 | pixel[2];
> +}

You should check but I am pretty sure the compiler will do a bad job at vectorizing this.

If it is the case, I have written optimized version for Qt that you are free to steal.

> Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp:414
> +    nonTrivialAlpha |= ~(a ^ 255U);

Let's make this cleaner.

> Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp:425
> +    nonTrivialAlpha |= ~(a ^ 255U);

Let's make this cleaner.

> Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp:527
> +    unsigned char nonTrivialAlpha = 0;

I would either keep this as a boolean or give it a proper name for a mask.

> Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp:535
> +            unsigned alpha = hasAlpha ? pixel[3] : 255;
> +            buffer.setRGBA(address++, pixel[0], pixel[1], pixel[2], alpha);
> +            nonTrivialAlpha |= alpha ^ 255U;

No need to move the "hasAlpha" test out of the loop and use setPixelRGB().
Comment 4 Allan Sandfeld Jensen 2013-05-15 14:57:09 PDT
(In reply to comment #3)
> (From update of attachment 201825 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=201825&action=review
> 
> This looks like a step in the right direction. I have a couple of comments:
> 
> > Source/WebCore/ChangeLog:13
> > +        Together with automatic vectorization by the compiler this provides slightly
> > +        over 2x speed-up on PNGImageDecoder::rowAvailable and shaves off 12-25% on
> > +        PNG decoding in general.
> 
> This is rather unimpressive. Vectorized premultiplication is usually > 4 times faster (vector gain + preload gain + math gain).
> Are you sure the compiler is not fucking up somewhere?
> 
This was using a generic x64 target, so the vectorization is limited to SSE2. Additional the 2+x speed-up is for the entire rowAvailable method, including the unoptimized combining of existing row data with the new. It was pretty much what I expected in this situation.

> > Source/WTF/wtf/MathExtras.h:463
> > +inline unsigned char fastDivideBy255(uint16_t value)
> > +{
> > +    // This is an approximate algorithm for division by 255, but it gives accurate results for 16bit values.
> > +    uint16_t approximation = value >> 8;
> > +    uint16_t remainder = value - (approximation * 255) + 1;
> > +    return approximation + (remainder >> 8);
> > +}
> 
> I think that does not belong to MathExtras but you are not the first one to abuse this header. :)
> 
Yes. It is not quite a perfect fit (and I think I might have been one those trying to abuse that header before), but where else should methods like this be placed if they are to be reused?

> > Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp:408
> > +static inline void setPixelRGB(ImageFrame::PixelData* dest, png_bytep pixel)
> > +{
> > +    *dest = 0xFF000000 | pixel[0] << 16 | pixel[1] << 8 | pixel[2];
> > +}
> 
> You should check but I am pretty sure the compiler will do a bad job at vectorizing this.
> 

I didn't study the generated code, but GCC reports the loop vectorized.

> > Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp:535
> > +            unsigned alpha = hasAlpha ? pixel[3] : 255;
> > +            buffer.setRGBA(address++, pixel[0], pixel[1], pixel[2], alpha);
> > +            nonTrivialAlpha |= alpha ^ 255U;
> 
> No need to move the "hasAlpha" test out of the loop and use setPixelRGB().

This is the scaled case. It doesn't vectorize anyway because it has table look-up for each x.
Comment 5 Allan Sandfeld Jensen 2013-05-16 02:34:16 PDT
Created attachment 201936 [details]
Patch
Comment 6 Benjamin Poulain 2013-05-16 23:57:38 PDT
Comment on attachment 201936 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=201936&action=review

Looks good.

I cq- because I would like you to clarify why you handle nonTrivialAlpha in this weird way.

> Source/WebCore/platform/graphics/Color.h:211
> +inline uint16_t fastDivideBy255(uint16_t value)
> +{
> +    // This is an approximate algorithm for division by 255, but it gives accurate results for 16bit values.
> +    uint16_t approximation = value >> 8;
> +    uint16_t remainder = value - (approximation * 255) + 1;
> +    return approximation + (remainder >> 8);
> +}

Good idea!

> Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp:415
> +    nonTrivialAlpha |= ~(a ^ 255U); // Arithmetic calculation of (a != 255).

Because the compiler has to set bool to 1 or 0, it will actually add a branch here.
So using
    nonTrivialAlpha |= (a == 255).

If you change "bool nonTrivialAlpha" to "unsigned char nonTrivialAlphaMask", you will avoid that.
I don't get why you don't just write 
   nonTrivialAlphaMask |= (a - 0xff)
and check nonTrivialAlpha at the end.

> Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp:426
> +    nonTrivialAlpha |= ~(a ^ 255U); // Arithmetic calculation of (a != 255).

Ditto.

> Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp:536
> +            nonTrivialAlpha |= ~(alpha ^ 255U); // Arithmetic calculation of (alpha != 255).

Ditto.
Comment 7 Allan Sandfeld Jensen 2013-05-17 02:16:39 PDT
(In reply to comment #6)
> > Source/WebCore/platform/image-decoders/png/PNGImageDecoder.cpp:415
> > +    nonTrivialAlpha |= ~(a ^ 255U); // Arithmetic calculation of (a != 255).
> 
> Because the compiler has to set bool to 1 or 0, it will actually add a branch here.
> So using
>     nonTrivialAlpha |= (a == 255).
> 
> If you change "bool nonTrivialAlpha" to "unsigned char nonTrivialAlphaMask", you will avoid that.

That is what I was doing in the first patch. I tried using a bool and checked if it was still auto-vectorized, and since it was I am assuming gcc is doing something more clever with it. The generated code is unreadable to me though, so I am not sure what  exactly is transformed into.

I will return it to a nonTrivialAlphaMask before landing, just to be nice to different compilers and architectures. (alpha - 255) is a good idea, I hadn't thought of that. It is probably the same performance wise, but much easier to read.
Comment 8 Allan Sandfeld Jensen 2013-05-17 05:41:03 PDT
Committed r150252: <http://trac.webkit.org/changeset/150252>