RESOLVED FIXED Bug 103216
PNG decode performance: avoid using frame buffer.setRGBA(x,y)
https://bugs.webkit.org/show_bug.cgi?id=103216
Summary PNG decode performance: avoid using frame buffer.setRGBA(x,y)
noel gordon
Reported 2012-11-25 23:01:07 PST
Writing decoded row pixels to the frame buffer with buffer.setRGBA(x,y) is slow compared to writing direct to the frame buffer. I disabled color profiles in the PNG decoder, and instrumented the total decode time and time taken to write row data to the frame buffer. The amount in ()'s is the percentage of decode time spent writing row data. http://duke.kenai.com/eco/DukeTreeHuggerSmall.png row write time 0.002204 PNG decode ends 0.016935 (13%) http://duke.kenai.com/eco/DukeTreeHuggerSmall.png row write time 0.022714 PNG decode ends 0.141308 (16%) http://trac.webkit.org/browser/trunk/LayoutTests/fast/images/resources/red-at-12-oclock-with-color-profile.png row write time 0.001901 PNG decode ends 0.012777 (14%)
Attachments
Patch (5.68 KB, patch)
2012-11-25 23:05 PST, noel gordon
no flags
Patch for landing (5.62 KB, patch)
2012-11-26 17:21 PST, noel gordon
no flags
png.write.time.git.diff (3.59 KB, text/plain)
2012-11-26 18:41 PST, noel gordon
no flags
png.decoder.time.git.diff (4.33 KB, text/plain)
2012-11-26 18:44 PST, noel gordon
no flags
noel gordon
Comment 1 2012-11-25 23:05:23 PST
noel gordon
Comment 2 2012-11-25 23:12:25 PST
The patch avoids using buffer.setRGBA(x,y), and writes direct to the frame buffer. This reduces decoding time 2% or more for these images. http://duke.kenai.com/eco/DukeTreeHuggerSmall.png row write time 0.001829 PNG decode ends 0.016632 (10%) http://duke.kenai.com/eco/DukeTreeHuggerSmall.png row write time 0.017736 PNG decode ends 0.136413 (13%) http://trac.webkit.org/browser/trunk/LayoutTests/fast/images/resources/red-at-12-oclock-with-color-profile.png row write time 0.001005 PNG decode ends 0.012086 (8%)
Brent Fulgham
Comment 3 2012-11-26 00:26:01 PST
Comment on attachment 175930 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=175930&action=review Looks good. I'll r+ after the EWS finishes running. > Source/WebCore/platform/image-decoders/ImageDecoder.h:134 > + void reportMemoryUsage(MemoryObjectInfo*) const; Why was this moved? > Source/WebCore/platform/image-decoders/ImageDecoder.h:170 > + float alphaPercent = a / 255.0f; Divides are costly. It might save ~20 cycles to multiply a by a precomputed "1/255.0f" I'm not sure if this would create any significant inaccuracies due to the vagaries of floating point math...
noel gordon
Comment 4 2012-11-26 02:47:24 PST
(In reply to comment #3) > > Source/WebCore/platform/image-decoders/ImageDecoder.h:134 > > + void reportMemoryUsage(MemoryObjectInfo*) const; > > Why was this moved? It was in the middle of routines dealing with frame buffer writing. Seemed misplaced to me, so I moved it elsewhere. > > Source/WebCore/platform/image-decoders/ImageDecoder.h:170 > > + float alphaPercent = a / 255.0f; > > Divides are costly. It might save ~20 cycles to multiply a by a precomputed "1/255.0f" I'm not sure if this would create any significant inaccuracies due to the vagaries of floating point math... Indeed. We have inaccuracies as the code stands for other reasons, and a change here might introduce more. Not looking to change behavior on this bug.
Viatcheslav Ostapenko
Comment 5 2012-11-26 07:29:51 PST
This is basically duplicate of bug 88424 and recent patch there does the same for all image formats with some extra.
Viatcheslav Ostapenko
Comment 6 2012-11-26 07:53:22 PST
(In reply to comment #3) > (From update of attachment 175930 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=175930&action=review > > Looks good. I'll r+ after the EWS finishes running. I would ask you too look at the complete patch which has this as a part: https://bugs.webkit.org/attachment.cgi?id=175328&action=review
Brent Fulgham
Comment 7 2012-11-26 10:01:13 PST
(In reply to comment #6) > (In reply to comment #3) > > (From update of attachment 175930 [details] [details]) > > View in context: https://bugs.webkit.org/attachment.cgi?id=175930&action=review > > > > Looks good. I'll r+ after the EWS finishes running. > > I would ask you too look at the complete patch which has this as a part: https://bugs.webkit.org/attachment.cgi?id=175328&action=review Battle of the competing patches! I must admit that I prefer Viatcheslav's handling of the advancement of the address position, but otherwise I think that committing these as individual changes is better from a review standpoint. Noel, do you have committer rights?
Viatcheslav Ostapenko
Comment 8 2012-11-26 11:20:40 PST
(In reply to comment #7) > (In reply to comment #6) > > (In reply to comment #3) > > > (From update of attachment 175930 [details] [details] [details]) > > > View in context: https://bugs.webkit.org/attachment.cgi?id=175930&action=review > > > > > > Looks good. I'll r+ after the EWS finishes running. > > > > I would ask you too look at the complete patch which has this as a part: https://bugs.webkit.org/attachment.cgi?id=175328&action=review > > Battle of the competing patches! > > I must admit that I prefer Viatcheslav's Actually it is not my patch. I just rebaselined original patch from a person who submitted it half a year ago, but doesn't work on webkit anymore. This upstreamed patch showed great improvement on mobile devices where CPU branch prediction is not so good as on desktop processors.
noel gordon
Comment 9 2012-11-26 15:39:19 PST
(In reply to comment #5) > This is basically duplicate of bug 88424 and recent patch there does the same for all image formats with some extra. (In reply to comment #6) > > I would ask you too look at the complete patch which has this as a part: https://bugs.webkit.org/attachment.cgi?id=175328&action=review I would r- that patch: insufficient ChangeLog, it breaks JPEG on some ports, and is poorly tested, yet you claim it greatly improves performance on mobile mobile devices. We are shipping code in a production browser environments: production is the wrong place to test webkit patches.
noel gordon
Comment 10 2012-11-26 15:41:44 PST
(In reply to comment #7) > I must admit that I prefer Viatcheslav's handling of the advancement of the address position, but otherwise I think that committing these as individual changes is better from a review standpoint. Moreover, individual changes are easier to test, and roll out. > Noel, do you have committer rights? Yes.
Brent Fulgham
Comment 11 2012-11-26 16:38:15 PST
Comment on attachment 175930 [details] Patch Let's get this started!
Brent Fulgham
Comment 12 2012-11-26 16:40:17 PST
(In reply to comment #10) > (In reply to comment #7) > > > I must admit that I prefer Viatcheslav's handling of the advancement of the address position, but otherwise I think that committing these as individual changes is better from a review standpoint. > > Moreover, individual changes are easier to test, and roll out. > > > Noel, do you have committer rights? > > Yes. Okay. I marked this as r+, and cq+. I'll update 88424 to rebaseline after this lands, as those changes will no longer be necessary. Incidentally, how are you instrumenting the code to run your performance checks? If this is documented somewhere I'd like to get up-to-speed with it for some future changes I'd like to try.
WebKit Review Bot
Comment 13 2012-11-26 16:58:45 PST
Comment on attachment 175930 [details] Patch Rejecting attachment 175930 [details] from commit-queue. Failed to run "['/mnt/git/webkit-commit-queue/Tools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '-..." exit_code: 2 Last 500 characters of output: (content): Merge conflict in Source/WebKit/chromium/ChangeLog Failed to merge in the changes. Patch failed at 0001 RenderStyle: Move 'list-style-image' to rare inherited data. When you have resolved this problem run "git rebase --continue". If you would prefer to skip this patch, instead run "git rebase --skip". To restore the original branch and stop rebasing run "git rebase --abort". rebase refs/remotes/origin/master: command returned error: 1 Died at Tools/Scripts/update-webkit line 154. Full output: http://queues.webkit.org/results/14989805
noel gordon
Comment 14 2012-11-26 17:21:21 PST
Created attachment 176121 [details] Patch for landing
Brent Fulgham
Comment 15 2012-11-26 17:25:20 PST
Comment on attachment 176121 [details] Patch for landing Thanks for rebaselining.
WebKit Review Bot
Comment 16 2012-11-26 18:11:04 PST
Comment on attachment 176121 [details] Patch for landing Clearing flags on attachment: 176121 Committed r135799: <http://trac.webkit.org/changeset/135799>
WebKit Review Bot
Comment 17 2012-11-26 18:11:08 PST
All reviewed patches have been landed. Closing bug.
noel gordon
Comment 18 2012-11-26 18:15:57 PST
(In reply to comment #0) The second image URL was wrong: should be http://duke.kenai.com/eco/GreenThumbLarge.png > Writing decoded row pixels to the frame buffer with buffer.setRGBA(x,y) is slow compared to writing direct to the frame buffer. > > I disabled color profiles in the PNG decoder, and instrumented the total decode time and time taken to write row data to the frame buffer. The amount in ()'s is the percentage of decode time spent writing row data. http://duke.kenai.com/eco/DukeTreeHuggerSmall.png row write time 0.002204 PNG decode ends 0.016935 (13%) http://duke.kenai.com/eco/GreenThumbLarge.png row write time 0.022714 PNG decode ends 0.141308 (16%) http://trac.webkit.org/browser/trunk/LayoutTests/fast/images/resources/red-at-12-oclock-with-color-profile.png row write time 0.001901 PNG decode ends 0.012777 (14%) (In reply to comment #2) > The patch avoids using buffer.setRGBA(x,y), and writes direct to the frame buffer. This reduces decoding time 2% or more for these images. http://duke.kenai.com/eco/DukeTreeHuggerSmall.png row write time 0.001829 PNG decode ends 0.016632 (10%) http://duke.kenai.com/eco/GreenThumbLarge.png row write time 0.017736 PNG decode ends 0.136413 (13%) http://trac.webkit.org/browser/trunk/LayoutTests/fast/images/resources/red-at-12-oclock-with-color-profile.png row write time 0.001005 PNG decode ends 0.012086 (8%)
noel gordon
Comment 19 2012-11-26 18:39:50 PST
(In reply to comment #12) > Incidentally, how are you instrumenting the code to run your performance checks? If this is documented somewhere I'd like to get up-to-speed with it for some future changes I'd like to try. My only documentation is a patch.
noel gordon
Comment 20 2012-11-26 18:41:28 PST
Created attachment 176140 [details] png.write.time.git.diff
noel gordon
Comment 21 2012-11-26 18:44:02 PST
Created attachment 176143 [details] png.decoder.time.git.diff This is the timing patch I used for the test results on this bug. Enjoy.
noel gordon
Comment 22 2012-11-27 03:22:22 PST
(In reply to comment #3) > Divides are costly. It might save ~20 cycles to multiply a by a precomputed "1/255.0f" I'm not sure if this would create any significant inaccuracies due to the vagaries of floating point math... Given the above timing patch, one can try multiplying by the constant factor 1/255.0 to avoid the division. My result on Chromium Mac is no speed improvement :/
Brent Fulgham
Comment 23 2012-11-27 10:43:27 PST
(In reply to comment #22) > (In reply to comment #3) > > > Divides are costly. It might save ~20 cycles to multiply a by a precomputed "1/255.0f" I'm not sure if this would create any significant inaccuracies due to the vagaries of floating point math... > > Given the above timing patch, one can try multiplying by the constant factor 1/255.0 to avoid the division. My result on Chromium Mac is no speed improvement :/ Thanks for checking -- it was worth a try. I've seen this shave a few percent off of tight loops in the past, but we must not be hitting this hard enough for typical use cases. There are other places where we do this kind of pass over the entire contents of a buffer. Converting to multiplies across the board would probably help some, and taking advantage of SSE vectorizing would potentially be an even bigger win, assuming alignment is dealt with properly.
noel gordon
Comment 24 2012-11-27 14:25:55 PST
(In reply to comment #23) I've similarly seen a few percent shaved off tight loops by converting a div to a mul (example below). Perhaps Clang automatically computes the inverse. > Converting to multiplies across the board would probably help some, and taking advantage of SSE vectorizing would potentially be an even bigger win ... Indeed. Skia has alpha pre-multiplication converters that avoid the division by using shifts http://code.google.com/searchframe#OAMlx_jo-ck/src/third_party/skia/src/core/SkMathPriv.h&exact_package=chromium&q=SkMulDiv255&type=cs&l=72 and SSE would be faster still. However, doing it right either way requires a understanding of whether to round, truncate, or ceiling the result.
noel gordon
Comment 25 2012-11-27 14:31:18 PST
Also you asked for ImageFrame width() and height() to be inlined in this bug and I prepared a separate patch for that: bug 103401 -- did I misunderstand https://bugs.webkit.org/show_bug.cgi?id=88424#c38 ??
Brent Fulgham
Comment 26 2012-11-27 14:49:03 PST
(In reply to comment #25) > Also you asked for ImageFrame width() and height() to be inlined in this bug and I prepared a separate patch for that: bug 103401 -- did I misunderstand https://bugs.webkit.org/show_bug.cgi?id=88424#c38 ?? No -- you didn't misunderstand, though it may not be worth doing in its own patch. I think ostap's current patch (rebaselined after landing this bug) includes the inlining. I think it would be good to combine the width/height inlining with a possible templatized setRGBA implementation that gets rid of the branching for premultiplication, since we know outside the main loop whether we need to apply those transformations or not. No need to test for each pixel.
noel gordon
Comment 27 2012-11-27 16:05:33 PST
(In reply to comment #26) > (In reply to comment #25) > > Also you asked for ImageFrame width() and height() to be inlined in this bug and I prepared a separate patch for that: bug 103401 -- did I misunderstand https://bugs.webkit.org/show_bug.cgi?id=88424#c38 ?? > > No -- you didn't misunderstand, though it may not be worth doing in its own patch. I think ostap's current patch (rebaselined after landing this bug) includes the inlining. But as public symbols. Can we please not do that. > I think it would be good to combine the width/height inlining with a possible templatized setRGBA implementation that gets rid of the branching for premultiplication, since we know outside the main loop whether we need to apply those transformations or not. No need to test for each pixel. Noting that the 3d canvas (WebGL) performance and tests will need to be looked as a result. Perhaps one simple step at a time.
Brent Fulgham
Comment 28 2012-11-27 16:51:53 PST
(In reply to comment #27) > But as public symbols. Can we please not do that. Oh! Right, they don't need to be exposed at all. > Noting that the 3d canvas (WebGL) performance and tests will need to be looked as a result. Perhaps one simple step at a time. Sounds good. By the way, I've created a patch to track adding ostap's profiling utility (https://bugs.webkit.org/show_bug.cgi?id=103463).
Note You need to log in before you can comment on or make changes to this bug.