68353 – [Qt] Apply ParallelJobs for ImageBuffer::platformTransformColorSpace method

RESOLVED INVALID 68353

[Qt] Apply ParallelJobs for ImageBuffer::platformTransformColorSpace method

https://bugs.webkit.org/show_bug.cgi?id=68353

Summary [Qt] Apply ParallelJobs for ImageBuffer::platformTransformColorSpace method

Andras Piroska

Reported 2011-09-19 06:10:08 PDT

Based on the measurements of the oprofile, the ImageBuffer::platformTransformColorSpace method used often by such functions like FEGaussian filter. The computation can be distributed to multiple core if the architecture supports.

Attachments
Apply the ParallelJobs support to ImageBuffer::platformTransformColorSpace method. (4.39 KB, patch) 2011-09-19 06:32 PDT, Andras Piroska	no flags	Details Formatted Diff Diff
Apply the ParallelJobs support to ImageBuffer::platformTransformColorSpace method (4.38 KB, patch) 2011-09-20 05:37 PDT, Andras Piroska	no flags	Details Formatted Diff Diff
[Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method (4.39 KB, patch) 2011-09-20 06:22 PDT, Andras Piroska	kling: review-	Details Formatted Diff Diff
[Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method (5.31 KB, patch) 2011-10-05 00:03 PDT, Andras Piroska	no flags	Details Formatted Diff Diff
[Qt] A test svg to the ImageBuffer::platformTransformColorSpace method (634 bytes, image/svg+xml) 2011-10-05 00:49 PDT, Andras Piroska	no flags	Details
[Qt] A test svg to the ImageBuffer::platformTransformColorSpace method (605 bytes, image/svg+xml) 2011-10-05 01:46 PDT, Andras Piroska	no flags	Details
[Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method (6.32 KB, patch) 2011-10-10 06:42 PDT, Andras Piroska	no flags	Details Formatted Diff Diff
[Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method (6.32 KB, patch) 2011-10-12 04:50 PDT, Andras Piroska	hausmann: review- hausmann: commit-queue-	Details Formatted Diff Diff
Show Obsolete (6) View All Add attachment proposed patch, testcase, etc.

Andras Piroska

Comment 1 2011-09-19 06:32:10 PDT

Created attachment 107845 [details] Apply the ParallelJobs support to ImageBuffer::platformTransformColorSpace method.

Andras Piroska

Comment 2 2011-09-20 05:37:01 PDT

Created attachment 107987 [details] Apply the ParallelJobs support to ImageBuffer::platformTransformColorSpace method

Csaba Osztrogonác

Comment 3 2011-09-20 05:38:54 PDT

Comment on attachment 107845 [details] Apply the ParallelJobs support to ImageBuffer::platformTransformColorSpace method. You shouldn't set r-, but remove r? and set obsolete flag

Andras Piroska

Comment 4 2011-09-20 06:22:08 PDT

Created attachment 107994 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method

Andreas Kling

Comment 5 2011-09-20 06:44:26 PDT

Comment on attachment 107994 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method View in context: https://bugs.webkit.org/attachment.cgi?id=107994&action=review > Source/WebCore/ChangeLog:14 > + No new tests / no behavioral changes. So how much faster is it? > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:169 > +static const int minimalAreaOfImage = 128 * 128; // Empirical data limit for parallel jobs Empirical data limit? Where does this value come from? The name 'minimalAreaOfImage' doesn't sufficiently explain what the constant does. > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:171 > +struct ThreadParam { We try not to abbreviate unnecessarily in WebKit. ThreadParameters? > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:172 > + int yStart, yEnd; One declaration per line. > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:179 > +static void worker(ThreadParam* tp) 'tp' is a bad variable name. parameters? > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:189 > + for (int x = 0; x < tp->width; ++x) { > + QRgb& pixel = scanLine[x]; > + pixel = qRgba((*tp->lookUpTable)[qRed(pixel)], > + (*tp->lookUpTable)[qGreen(pixel)], > + (*tp->lookUpTable)[qBlue(pixel)], > + qAlpha(pixel)); > + } First off, we should share this code with the non-parallel path. Second, this is a terrible algorithm, performance wise. I'm sure you can make it a *lot* faster by manipulating the color byte fields directly instead of going through qRgba and qRed/qGreen/qBlue/qAlpha. > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:208 > + int areaOfImage = m_size.width() * m_size.height(); > + int optimalThreadCount = areaOfImage / minimalAreaOfImage; Granted, I haven't studied the math behind this, but I would think that the optimal thread counts relates more to the number of scanlines, not so much to the X*Y area. > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:210 > + ParallelJobs<ThreadParam> parallelJobs(&worker, 2); Why hard-code 2 threads if we know the optimal thread count? > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:211 > + int threadCount = parallelJobs.numberOfJobs(); You don't need this variable. > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:213 > + int dy = m_size.height() / threadCount; 'dy' is a bad variable name. scanLinesPerThread? > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:215 > + ThreadParam& tp = parallelJobs.parameter(i); 'tp' is a bad variable name. parameters? > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:218 > + tp.yEnd = (i == threadCount-1) ? m_size.height() : yStart; Coding style, threadCount-1 should be threadCount - 1

Balazs Kelemen

Comment 6 2011-09-20 06:48:44 PDT

Comment on attachment 107994 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method View in context: https://bugs.webkit.org/attachment.cgi?id=107994&action=review Informal review - just some style nit, I didn't check it algoritmically. > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:170 > +static const int minimalAreaOfImage = 128 * 128; // Empirical data limit for parallel jobs > + This seems to be a confusing name. minimalAreaForWorkerThread? > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:180 > +static void worker(ThreadParam* tp) > +{ 1. worker does not say too much. Please chose a name for the function that suggests it's task. 2. tp is not a valid name, please use non-abbreviated variable names. > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:209 > + if (1 != optimalThreadCount) { It's not common in WebKit to write the constant first, please transpose the expression.

Andreas Kling

Comment 7 2011-09-20 07:47:18 PDT

Comment on attachment 107994 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method Resetting r-, not sure why Balazs removed it.

Csaba Osztrogonác

Comment 8 2011-09-20 08:10:19 PDT

(In reply to comment #7) > (From update of attachment 107994 [details]) > Resetting r-, not sure why Balazs removed it. I think it was a conflict between your reviews ...

Andras Piroska

Comment 9 2011-09-20 23:40:28 PDT

Sorry, I will fix it.

Balazs Kelemen

Comment 10 2011-09-21 04:13:15 PDT

(In reply to comment #8) > (In reply to comment #7) > > (From update of attachment 107994 [details] [details]) > > Resetting r-, not sure why Balazs removed it. > > I think it was a conflict between your reviews ... Yes, it was not intentional, sorry.

Andras Piroska

Comment 11 2011-10-05 00:03:26 PDT

Created attachment 109750 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method At the websites not cause regression. The performance growth is approximately 3% and this patch only influence to SVG filters exclusively. So I don't know that result is enough to accept the patch.

Andras Piroska

Comment 12 2011-10-05 00:15:24 PDT

Err, I just wanted to say that is this speed-up progression enough for that compexity?

Andreas Kling

Comment 13 2011-10-05 00:42:33 PDT

(In reply to comment #12) > Err, I just wanted to say that is this speed-up progression enough for that compexity? I'm surprised it's only 3%. How are you measuring it exactly?

Andreas Kling

Comment 14 2011-10-05 00:49:39 PDT

Comment on attachment 109750 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method View in context: https://bugs.webkit.org/attachment.cgi?id=109750&action=review > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:208 > + const int optimalThreadCount = min(m_size.height(), areaOfImage / areaLimitForParallelJobs); Could you explain this to me? Why is this better than deriving the optimal thread count from the number of scanlines total? > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:213 > + Unnecessary newline. > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:216 > + for (int i = 0; i < parallelJobs.numberOfJobs(); i++) { numberOfJobs() returns a size_t, consequently 'i' should be a size_t. Also, WebKit style prefers prefix increment (++i) to postfix (i++.) > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:239 > + ThreadParameters parameters; > + parameters.yStart = 0; > + parameters.yEnd = m_size.height(); > + parameters.width = m_size.width(); > + parameters.bits = bits; > + parameters.bytesPerLine = bytesPerLine; > + parameters.lookUpTable = &lookUpTable; > + transformColorWorker(&parameters); I like that you're sharing the code between the threaded and non-threaded paths. Could we refactor it slightly to look less thread-specific? For example having a transformColorSpace(yStart, yEnd, width, bits, bytesPerLine, lookupTable) method instead of passing a ThreadParameters object. Then the non-threading code doesn't need to know about ThreadParameters.

Andras Piroska

Comment 15 2011-10-05 00:49:42 PDT

Created attachment 109755 [details] [Qt] A test svg to the ImageBuffer::platformTransformColorSpace method This is an svg example which could show the performance improvement for the patch. A similar result can be measured with the following filters: FEBlend, FEColorMatrix, FEComposite, FEDisplacement, FEDropshadow, FEGauss, FELightning, FEMerge, FEMorph, FEOffset, FETile...

Andreas Kling

Comment 16 2011-10-05 00:54:41 PDT

(In reply to comment #15) > Created an attachment (id=109755) [details] > [Qt] A test svg to the ImageBuffer::platformTransformColorSpace method > > This is an svg example which could show the performance improvement for the patch. > A similar result can be measured with the following filters: > FEBlend, FEColorMatrix, FEComposite, FEDisplacement, FEDropshadow, > FEGauss, FELightning, FEMerge, FEMorph, FEOffset, FETile... This svg gives me "Uncaught TypeError: Object [object DOMWindow] has no method 'methanol_next'" I'm more interested in how you arrived in the 3% number though. 3% total runtime improvement? 3% less time spent in platformTransformColorSpace()? Statistical profiler? Callgrind? Etc :)

Gabor Loki

Comment 17 2011-10-05 01:13:21 PDT

> This svg gives me "Uncaught TypeError: Object [object DOMWindow] has no method 'methanol_next'" I guess Andras forgot to remove that onload event. > I'm more interested in how you arrived in the 3% number though. 3% total runtime improvement? 3% less time spent in platformTransformColorSpace()? Statistical profiler? Callgrind? Etc :) As the onload event said the measurement engine was the Methanol (http://gitorious.org/methanol) ;) This engine measures the page loads. So the 3% is the total runtime improvement on those SVG filters.

Andreas Kling

Comment 18 2011-10-05 01:25:19 PDT

(In reply to comment #17) > As the onload event said the measurement engine was the Methanol (http://gitorious.org/methanol) ;) > This engine measures the page loads. So the 3% is the total runtime improvement on those SVG filters. Right, I cloned that repo and opened the fire-svg.html and fire-smp.html, and I get this error: "Uncaught TypeError: Property 'methanol_next' of object [object DOMWindow] is not a function" Maybe I'm supposed to open something else? There are no instructions AFAICT. But never mind that. So what we're talking about here is a 3% overall improvement on a test suite that exercises a myriad of functionality, not necessarily even hitting platformTransformColorSpace() in all tests etc? That's not a bad thing in itself, but we should know how much faster the new platformTransformColorSpace() is compared to the old version.

Andras Piroska

Comment 19 2011-10-05 01:30:43 PDT

> Could you explain this to me? > Why is this better than deriving the optimal thread count from the number of scanlines total? If the image is 5x120 the algorithm will separate the picture (at 4 cores) to four 5x30 image, but for so small images the overhead much bigger then the gain. Our experience showed the overhead disappears below approx. 256x256 pixel per thread. You should also consider there 120x5 image which is also not a good candidate for parallelization so we should check the if the image has the minimum area(width * height) and a minimum scaleline as well. And must be sure that the count of threads not be more than the height of the image. > I like that you're sharing the code between the threaded and non-threaded paths. > Could we refactor it slightly to look less thread-specific? > For example having a transformColorSpace(yStart, yEnd, width, bits, bytesPerLine, lookupTable) method instead of passing a ThreadParameters object. > Then the non-threading code doesn't need to know about ThreadParameters. Not, the worker function must be get only one parameter. That is a limitation of Parallel_Jobs API.

Andras Piroska

Comment 20 2011-10-05 01:46:03 PDT

Created attachment 109764 [details] [Qt] A test svg to the ImageBuffer::platformTransformColorSpace method I removed the "onload="top.methanol_next()" code from the svg, but I don't know how this bug came out.

Andreas Kling

Comment 21 2011-10-05 01:46:41 PDT

(In reply to comment #19) > > Could you explain this to me? > > Why is this better than deriving the optimal thread count from the number of scanlines total? > If the image is 5x120 the algorithm will separate the picture (at 4 cores) to four 5x30 image, but for so small images the overhead much bigger then the gain. > Our experience showed the overhead disappears below approx. 256x256 pixel per thread. > You should also consider there 120x5 image which is also not a good candidate for parallelization so we should check the if the image has the minimum area(width * height) and a minimum scaleline as well. > And must be sure that the count of threads not be more than the height of the image. Fair enough. So what about a 100x1000 image? That would use a single thread with this formula, and seems like an awesome candidate for more parallelization. (Please correct me if I'm wrong. :) > > I like that you're sharing the code between the threaded and non-threaded paths. > > Could we refactor it slightly to look less thread-specific? > > For example having a transformColorSpace(yStart, yEnd, width, bits, bytesPerLine, lookupTable) method instead of passing a ThreadParameters object. > > Then the non-threading code doesn't need to know about ThreadParameters. > Not, the worker function must be get only one parameter. That is a limitation of Parallel_Jobs API. The worker function could be a one-liner that calls an inline function with the appropriate parameters.

Andras Piroska

Comment 22 2011-10-05 02:13:59 PDT

> Fair enough. So what about a 100x1000 image? That would use a single thread with this formula, and seems like an awesome candidate for more parallelization. (Please correct me if I'm wrong. :) In this case the image slices are too small (at 2 cores: 50x1000, that implies so many pixels than a 200x250 sized image) > The worker function could be a one-liner that calls an inline function with the appropriate parameters. Possible to write an additional method, that take out the fields of this struct and pass these parameters to the actual worker function.

Andras Piroska

Comment 23 2011-10-10 06:42:06 PDT

Created attachment 110354 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method Apply the ParallelJobs support to ImageBuffer::platformTransformColorSpace method. The worker function calls now in one-line.

WebKit Review Bot

Comment 24 2011-10-11 07:45:32 PDT

Attachment 110354 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1 Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:198: Extra space before ) [whitespace/parens] [2] Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:244: Missing space after , [whitespace/comma] [3] Total errors found: 2 in 3 files If any of these errors are false positives, please file a bug against check-webkit-style.

Zoltan Horvath

Comment 25 2011-10-12 03:10:11 PDT

I modified the summary from paralell to parallel, please modify it in your next patch also!

Andras Piroska

Comment 26 2011-10-12 04:50:58 PDT

Created attachment 110668 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method Sorry, I fixed the style errors.

Simon Hausmann

Comment 27 2011-10-18 00:01:37 PDT

(In reply to comment #18) [...] > But never mind that. So what we're talking about here is a 3% overall improvement on a test suite that exercises a myriad of functionality, not necessarily even hitting platformTransformColorSpace() in all tests etc? That's not a bad thing in itself, but we should know how much faster the new platformTransformColorSpace() is compared to the old version. I share a bit of Andreas' sentiment here. It would be good to know how much this particular function improved and much more importantly it would be good to know _why_. "Blind" optimizations are dangerous :) It is not apparent to me why distributing the function over different cores improves performance. From looking at the function and its lack of a computational kernel I _suspect_ that the bottleneck here is actually memory throughput. If that is the case, then it may be worth researching other ways of improving the performance because they may lead to even bigger wins with less heuristics and the thread start-up overhead. I'm saying this not to criticize your code, but I'd just like to stress the importance of explaining what you saw in oprofile and why your patch works. And if it is a memory throughput problem, then I recommend trying some other cache optimization techniques in comparision to see which one is best.

Simon Hausmann

Comment 28 2011-11-01 14:07:17 PDT

Comment on attachment 110668 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method Taking this out of the review queue. No response for two weeks and it's not clear what the patch improves (cache efficiency? cpu load distribution?) that leads to a lower run-time and how.

Andras Piroska

Comment 29 2011-11-03 02:16:43 PDT

Sorry, I forgot this bug because I received some new task. Soon I want to upload some measurements.

Andras Piroska

Comment 30 2011-11-08 04:41:35 PST

I made some measurements. The average acceleration of SVG filters is approx 3%. This is very little, doesn't cause considerable acceleration. In addition, I tested the platformTransformColorSpace() function, so I queried the time at the begin and at the end of this function and compared the runtime of the original code and the modified code. The result of this (function only) test in the best case is ~312sec for the original code ~281sec for the optimalized code This is also too little. I hoped the 3% speed-up enough to apply the patch and I think that it is not possible to accelerate it better.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution INVALID

Priority P2

Severity Enhancement

Classification Unclassified

Version 528+ (Nightly build)

Hardware All

OS All

Product WebKit

Component Platform

Assignee

Nobody

Reported

2011-09-19 06:10 PDT

Modified

2011-11-15 04:41 PST History

CC List

9 users Show

URL

Keywords Performance, Qt, QtTriaged

Depends on

Blocks