Bug 103216 - PNG decode performance: avoid using frame buffer.setRGBA(x,y)
Summary: PNG decode performance: avoid using frame buffer.setRGBA(x,y)
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Images (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: noel gordon
URL:
Keywords:
Depends on:
Blocks: 88424
  Show dependency treegraph
 
Reported: 2012-11-25 23:01 PST by noel gordon
Modified: 2012-11-27 16:51 PST (History)
3 users (show)

See Also:


Attachments
Patch (5.68 KB, patch)
2012-11-25 23:05 PST, noel gordon
no flags Details | Formatted Diff | Diff
Patch for landing (5.62 KB, patch)
2012-11-26 17:21 PST, noel gordon
no flags Details | Formatted Diff | Diff
png.write.time.git.diff (3.59 KB, text/plain)
2012-11-26 18:41 PST, noel gordon
no flags Details
png.decoder.time.git.diff (4.33 KB, text/plain)
2012-11-26 18:44 PST, noel gordon
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description noel gordon 2012-11-25 23:01:07 PST
Writing decoded row pixels to the frame buffer with buffer.setRGBA(x,y) is slow compared to writing direct to the frame buffer. 

I disabled color profiles in the PNG decoder, and instrumented the total decode time and time taken to write row data to the frame buffer.  The amount in ()'s is the percentage of decode time spent writing row data.

http://duke.kenai.com/eco/DukeTreeHuggerSmall.png
           row write time 0.002204
           PNG decode ends 0.016935 (13%)

http://duke.kenai.com/eco/DukeTreeHuggerSmall.png
           row write time 0.022714
           PNG decode ends 0.141308 (16%)

http://trac.webkit.org/browser/trunk/LayoutTests/fast/images/resources/red-at-12-oclock-with-color-profile.png
           row write time 0.001901
           PNG decode ends 0.012777 (14%)
Comment 1 noel gordon 2012-11-25 23:05:23 PST
Created attachment 175930 [details]
Patch
Comment 2 noel gordon 2012-11-25 23:12:25 PST
The patch avoids using buffer.setRGBA(x,y), and writes direct to the frame buffer.  This reduces decoding time 2% or more for these images. 

http://duke.kenai.com/eco/DukeTreeHuggerSmall.png
          row write time 0.001829
          PNG decode ends 0.016632 (10%)

http://duke.kenai.com/eco/DukeTreeHuggerSmall.png
          row write time 0.017736
          PNG decode ends 0.136413 (13%)

http://trac.webkit.org/browser/trunk/LayoutTests/fast/images/resources/red-at-12-oclock-with-color-profile.png
          row write time 0.001005
          PNG decode ends 0.012086 (8%)
Comment 3 Brent Fulgham 2012-11-26 00:26:01 PST
Comment on attachment 175930 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=175930&action=review

Looks good. I'll r+ after the EWS finishes running.

> Source/WebCore/platform/image-decoders/ImageDecoder.h:134
> +        void reportMemoryUsage(MemoryObjectInfo*) const;

Why was this moved?

> Source/WebCore/platform/image-decoders/ImageDecoder.h:170
> +                float alphaPercent = a / 255.0f;

Divides are costly. It might save ~20 cycles to multiply a by a precomputed "1/255.0f" I'm not sure if this would create any significant inaccuracies due to the vagaries of floating point math...
Comment 4 noel gordon 2012-11-26 02:47:24 PST
(In reply to comment #3)

> > Source/WebCore/platform/image-decoders/ImageDecoder.h:134
> > +        void reportMemoryUsage(MemoryObjectInfo*) const;
> 
> Why was this moved?

It was in the middle of routines dealing with frame buffer writing.  Seemed misplaced to me, so I moved it elsewhere.

> > Source/WebCore/platform/image-decoders/ImageDecoder.h:170
> > +                float alphaPercent = a / 255.0f;
> 
> Divides are costly. It might save ~20 cycles to multiply a by a precomputed "1/255.0f" I'm not sure if this would create any significant inaccuracies due to the vagaries of floating point math...

Indeed.  We have inaccuracies as the code stands for other reasons, and a change here might introduce more.  Not looking to change behavior on this bug.
Comment 5 Viatcheslav Ostapenko 2012-11-26 07:29:51 PST
This is basically duplicate of bug 88424 and recent patch there does the same for all image formats with some extra.
Comment 6 Viatcheslav Ostapenko 2012-11-26 07:53:22 PST
(In reply to comment #3)
> (From update of attachment 175930 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=175930&action=review
> 
> Looks good. I'll r+ after the EWS finishes running.

I would ask you too look at the complete patch which has this as a part: https://bugs.webkit.org/attachment.cgi?id=175328&action=review
Comment 7 Brent Fulgham 2012-11-26 10:01:13 PST
(In reply to comment #6)
> (In reply to comment #3)
> > (From update of attachment 175930 [details] [details])
> > View in context: https://bugs.webkit.org/attachment.cgi?id=175930&action=review
> > 
> > Looks good. I'll r+ after the EWS finishes running.
> 
> I would ask you too look at the complete patch which has this as a part: https://bugs.webkit.org/attachment.cgi?id=175328&action=review

Battle of the competing patches!

I must admit that I prefer Viatcheslav's handling of the advancement of the address position, but otherwise I think that committing these as individual changes is better from a review standpoint.

Noel, do you have committer rights?
Comment 8 Viatcheslav Ostapenko 2012-11-26 11:20:40 PST
(In reply to comment #7)
> (In reply to comment #6)
> > (In reply to comment #3)
> > > (From update of attachment 175930 [details] [details] [details])
> > > View in context: https://bugs.webkit.org/attachment.cgi?id=175930&action=review
> > > 
> > > Looks good. I'll r+ after the EWS finishes running.
> > 
> > I would ask you too look at the complete patch which has this as a part: https://bugs.webkit.org/attachment.cgi?id=175328&action=review
> 
> Battle of the competing patches!
> 
> I must admit that I prefer Viatcheslav's 

Actually it is not my patch. I just rebaselined original patch from a person who submitted it half a year ago, but doesn't work on webkit anymore. This upstreamed patch showed great improvement on mobile devices where CPU branch prediction is not so good as on desktop processors.
Comment 9 noel gordon 2012-11-26 15:39:19 PST
(In reply to comment #5)
> This is basically duplicate of bug 88424 and recent patch there does the same for all image formats with some extra.

(In reply to comment #6)
> 
> I would ask you too look at the complete patch which has this as a part: https://bugs.webkit.org/attachment.cgi?id=175328&action=review

I would r- that patch: insufficient ChangeLog, it breaks JPEG on some ports, and is poorly tested, yet you claim it greatly improves performance on mobile mobile devices.

We are shipping code in a production browser environments: production is the wrong place to test webkit patches.
Comment 10 noel gordon 2012-11-26 15:41:44 PST
(In reply to comment #7)

> I must admit that I prefer Viatcheslav's handling of the advancement of the address position, but otherwise I think that committing these as individual changes is better from a review standpoint.

Moreover, individual changes are easier to test, and roll out.

> Noel, do you have committer rights?

Yes.
Comment 11 Brent Fulgham 2012-11-26 16:38:15 PST
Comment on attachment 175930 [details]
Patch

Let's get this started!
Comment 12 Brent Fulgham 2012-11-26 16:40:17 PST
(In reply to comment #10)
> (In reply to comment #7)
> 
> > I must admit that I prefer Viatcheslav's handling of the advancement of the address position, but otherwise I think that committing these as individual changes is better from a review standpoint.
> 
> Moreover, individual changes are easier to test, and roll out.
> 
> > Noel, do you have committer rights?
> 
> Yes.

Okay. I marked this as r+, and cq+.  I'll update 88424 to rebaseline after this lands, as those changes will no longer be necessary.

Incidentally, how are you instrumenting the code to run your performance checks? If this is documented somewhere I'd like to get up-to-speed with it for some future changes I'd like to try.
Comment 13 WebKit Review Bot 2012-11-26 16:58:45 PST
Comment on attachment 175930 [details]
Patch

Rejecting attachment 175930 [details] from commit-queue.

Failed to run "['/mnt/git/webkit-commit-queue/Tools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '-..." exit_code: 2

Last 500 characters of output:
 (content): Merge conflict in Source/WebKit/chromium/ChangeLog
Failed to merge in the changes.
Patch failed at 0001 RenderStyle: Move 'list-style-image' to rare inherited data.

When you have resolved this problem run "git rebase --continue".
If you would prefer to skip this patch, instead run "git rebase --skip".
To restore the original branch and stop rebasing run "git rebase --abort".

rebase refs/remotes/origin/master: command returned error: 1

Died at Tools/Scripts/update-webkit line 154.

Full output: http://queues.webkit.org/results/14989805
Comment 14 noel gordon 2012-11-26 17:21:21 PST
Created attachment 176121 [details]
Patch for landing
Comment 15 Brent Fulgham 2012-11-26 17:25:20 PST
Comment on attachment 176121 [details]
Patch for landing

Thanks for rebaselining.
Comment 16 WebKit Review Bot 2012-11-26 18:11:04 PST
Comment on attachment 176121 [details]
Patch for landing

Clearing flags on attachment: 176121

Committed r135799: <http://trac.webkit.org/changeset/135799>
Comment 17 WebKit Review Bot 2012-11-26 18:11:08 PST
All reviewed patches have been landed.  Closing bug.
Comment 18 noel gordon 2012-11-26 18:15:57 PST
(In reply to comment #0)

The second image URL was wrong: should be http://duke.kenai.com/eco/GreenThumbLarge.png

> Writing decoded row pixels to the frame buffer with buffer.setRGBA(x,y) is slow compared to writing direct to the frame buffer. 
> 
> I disabled color profiles in the PNG decoder, and instrumented the total decode time and time taken to write row data to the frame buffer.  The amount in ()'s is the percentage of decode time spent writing row data.

http://duke.kenai.com/eco/DukeTreeHuggerSmall.png
           row write time 0.002204
           PNG decode ends 0.016935 (13%)

http://duke.kenai.com/eco/GreenThumbLarge.png
           row write time 0.022714
           PNG decode ends 0.141308 (16%)

http://trac.webkit.org/browser/trunk/LayoutTests/fast/images/resources/red-at-12-oclock-with-color-profile.png
           row write time 0.001901
           PNG decode ends 0.012777 (14%)

(In reply to comment #2)

> The patch avoids using buffer.setRGBA(x,y), and writes direct to the frame buffer.  This reduces decoding time 2% or more for these images. 
 
http://duke.kenai.com/eco/DukeTreeHuggerSmall.png
           row write time 0.001829
           PNG decode ends 0.016632 (10%)
 
http://duke.kenai.com/eco/GreenThumbLarge.png
           row write time 0.017736
           PNG decode ends 0.136413 (13%)
 
http://trac.webkit.org/browser/trunk/LayoutTests/fast/images/resources/red-at-12-oclock-with-color-profile.png
           row write time 0.001005
           PNG decode ends 0.012086 (8%)
Comment 19 noel gordon 2012-11-26 18:39:50 PST
(In reply to comment #12)

> Incidentally, how are you instrumenting the code to run your performance checks? If this is documented somewhere I'd like to get up-to-speed with it for some future changes I'd like to try.

My only documentation is a patch.
Comment 20 noel gordon 2012-11-26 18:41:28 PST
Created attachment 176140 [details]
png.write.time.git.diff
Comment 21 noel gordon 2012-11-26 18:44:02 PST
Created attachment 176143 [details]
png.decoder.time.git.diff

This is the timing patch I used for the test results on this bug.  Enjoy.
Comment 22 noel gordon 2012-11-27 03:22:22 PST
(In reply to comment #3)

> Divides are costly. It might save ~20 cycles to multiply a by a precomputed "1/255.0f" I'm not sure if this would create any significant inaccuracies due to the vagaries of floating point math...

Given the above timing patch, one can try multiplying by the constant factor 1/255.0 to avoid the division.  My result on Chromium Mac is no speed improvement :/
Comment 23 Brent Fulgham 2012-11-27 10:43:27 PST
(In reply to comment #22)
> (In reply to comment #3)
> 
> > Divides are costly. It might save ~20 cycles to multiply a by a precomputed "1/255.0f" I'm not sure if this would create any significant inaccuracies due to the vagaries of floating point math...
> 
> Given the above timing patch, one can try multiplying by the constant factor 1/255.0 to avoid the division.  My result on Chromium Mac is no speed improvement :/

Thanks for checking -- it was worth a try.  I've seen this shave a few percent off of tight loops in the past, but we must not be hitting this hard enough for typical use cases.

There are other places where we do this kind of pass over the entire contents of a buffer.  Converting to multiplies across the board would probably help some, and taking advantage of SSE vectorizing would potentially be an even bigger win, assuming alignment is dealt with properly.
Comment 24 noel gordon 2012-11-27 14:25:55 PST
(In reply to comment #23)

I've similarly seen a few percent shaved off tight loops by converting a div to a mul (example below).  Perhaps Clang automatically computes the inverse.

> Converting to multiplies across the board would probably help some, and taking advantage of SSE vectorizing would potentially be an even bigger win ...

Indeed.  Skia has alpha pre-multiplication converters that avoid the division by using shifts 

http://code.google.com/searchframe#OAMlx_jo-ck/src/third_party/skia/src/core/SkMathPriv.h&exact_package=chromium&q=SkMulDiv255&type=cs&l=72

and SSE would be faster still.  However, doing it right either way requires a  understanding of whether to round, truncate, or ceiling the result.
Comment 25 noel gordon 2012-11-27 14:31:18 PST
Also you asked for ImageFrame width() and height() to be inlined in this bug and I prepared a separate patch for that: bug 103401 -- did I misunderstand https://bugs.webkit.org/show_bug.cgi?id=88424#c38 ??
Comment 26 Brent Fulgham 2012-11-27 14:49:03 PST
(In reply to comment #25)
> Also you asked for ImageFrame width() and height() to be inlined in this bug and I prepared a separate patch for that: bug 103401 -- did I misunderstand https://bugs.webkit.org/show_bug.cgi?id=88424#c38 ??

No -- you didn't misunderstand, though it may not be worth doing in its own patch.  I think ostap's current patch  (rebaselined after landing this bug) includes the inlining.

I think it would be good to combine the width/height inlining with a possible templatized setRGBA implementation that gets rid of the branching for premultiplication, since we know outside the main loop whether we need to apply those transformations or not.  No need to test for each pixel.
Comment 27 noel gordon 2012-11-27 16:05:33 PST
(In reply to comment #26)
> (In reply to comment #25)
> > Also you asked for ImageFrame width() and height() to be inlined in this bug and I prepared a separate patch for that: bug 103401 -- did I misunderstand https://bugs.webkit.org/show_bug.cgi?id=88424#c38 ??
> 
> No -- you didn't misunderstand, though it may not be worth doing in its own patch.  I think ostap's current patch  (rebaselined after landing this bug) includes the inlining.

But as public symbols.  Can we please not do that.

> I think it would be good to combine the width/height inlining with a possible templatized setRGBA implementation that gets rid of the branching for premultiplication, since we know outside the main loop whether we need to apply those transformations or not.  No need to test for each pixel.

Noting that the 3d canvas (WebGL) performance and tests will need to be looked as a result.  Perhaps one simple step at a time.
Comment 28 Brent Fulgham 2012-11-27 16:51:53 PST
(In reply to comment #27)
> But as public symbols.  Can we please not do that.

Oh!  Right, they don't need to be exposed at all.

> Noting that the 3d canvas (WebGL) performance and tests will need to be looked as a result.  Perhaps one simple step at a time.

Sounds good.  By the way, I've created a patch to track adding ostap's profiling utility (https://bugs.webkit.org/show_bug.cgi?id=103463).