RESOLVED INVALID 68353
[Qt] Apply ParallelJobs for ImageBuffer::platformTransformColorSpace method
https://bugs.webkit.org/show_bug.cgi?id=68353
Summary [Qt] Apply ParallelJobs for ImageBuffer::platformTransformColorSpace method
Andras Piroska
Reported 2011-09-19 06:10:08 PDT
Based on the measurements of the oprofile, the ImageBuffer::platformTransformColorSpace method used often by such functions like FEGaussian filter. The computation can be distributed to multiple core if the architecture supports.
Attachments
Apply the ParallelJobs support to ImageBuffer::platformTransformColorSpace method. (4.39 KB, patch)
2011-09-19 06:32 PDT, Andras Piroska
no flags
Apply the ParallelJobs support to ImageBuffer::platformTransformColorSpace method (4.38 KB, patch)
2011-09-20 05:37 PDT, Andras Piroska
no flags
[Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method (4.39 KB, patch)
2011-09-20 06:22 PDT, Andras Piroska
kling: review-
[Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method (5.31 KB, patch)
2011-10-05 00:03 PDT, Andras Piroska
no flags
[Qt] A test svg to the ImageBuffer::platformTransformColorSpace method (634 bytes, image/svg+xml)
2011-10-05 00:49 PDT, Andras Piroska
no flags
[Qt] A test svg to the ImageBuffer::platformTransformColorSpace method (605 bytes, image/svg+xml)
2011-10-05 01:46 PDT, Andras Piroska
no flags
[Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method (6.32 KB, patch)
2011-10-10 06:42 PDT, Andras Piroska
no flags
[Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method (6.32 KB, patch)
2011-10-12 04:50 PDT, Andras Piroska
hausmann: review-
hausmann: commit-queue-
Andras Piroska
Comment 1 2011-09-19 06:32:10 PDT
Created attachment 107845 [details] Apply the ParallelJobs support to ImageBuffer::platformTransformColorSpace method.
Andras Piroska
Comment 2 2011-09-20 05:37:01 PDT
Created attachment 107987 [details] Apply the ParallelJobs support to ImageBuffer::platformTransformColorSpace method
Csaba Osztrogonác
Comment 3 2011-09-20 05:38:54 PDT
Comment on attachment 107845 [details] Apply the ParallelJobs support to ImageBuffer::platformTransformColorSpace method. You shouldn't set r-, but remove r? and set obsolete flag
Andras Piroska
Comment 4 2011-09-20 06:22:08 PDT
Created attachment 107994 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method
Andreas Kling
Comment 5 2011-09-20 06:44:26 PDT
Comment on attachment 107994 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method View in context: https://bugs.webkit.org/attachment.cgi?id=107994&action=review > Source/WebCore/ChangeLog:14 > + No new tests / no behavioral changes. So how much faster is it? > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:169 > +static const int minimalAreaOfImage = 128 * 128; // Empirical data limit for parallel jobs Empirical data limit? Where does this value come from? The name 'minimalAreaOfImage' doesn't sufficiently explain what the constant does. > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:171 > +struct ThreadParam { We try not to abbreviate unnecessarily in WebKit. ThreadParameters? > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:172 > + int yStart, yEnd; One declaration per line. > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:179 > +static void worker(ThreadParam* tp) 'tp' is a bad variable name. parameters? > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:189 > + for (int x = 0; x < tp->width; ++x) { > + QRgb& pixel = scanLine[x]; > + pixel = qRgba((*tp->lookUpTable)[qRed(pixel)], > + (*tp->lookUpTable)[qGreen(pixel)], > + (*tp->lookUpTable)[qBlue(pixel)], > + qAlpha(pixel)); > + } First off, we should share this code with the non-parallel path. Second, this is a terrible algorithm, performance wise. I'm sure you can make it a *lot* faster by manipulating the color byte fields directly instead of going through qRgba and qRed/qGreen/qBlue/qAlpha. > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:208 > + int areaOfImage = m_size.width() * m_size.height(); > + int optimalThreadCount = areaOfImage / minimalAreaOfImage; Granted, I haven't studied the math behind this, but I would think that the optimal thread counts relates more to the number of scanlines, not so much to the X*Y area. > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:210 > + ParallelJobs<ThreadParam> parallelJobs(&worker, 2); Why hard-code 2 threads if we know the optimal thread count? > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:211 > + int threadCount = parallelJobs.numberOfJobs(); You don't need this variable. > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:213 > + int dy = m_size.height() / threadCount; 'dy' is a bad variable name. scanLinesPerThread? > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:215 > + ThreadParam& tp = parallelJobs.parameter(i); 'tp' is a bad variable name. parameters? > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:218 > + tp.yEnd = (i == threadCount-1) ? m_size.height() : yStart; Coding style, threadCount-1 should be threadCount - 1
Balazs Kelemen
Comment 6 2011-09-20 06:48:44 PDT
Comment on attachment 107994 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method View in context: https://bugs.webkit.org/attachment.cgi?id=107994&action=review Informal review - just some style nit, I didn't check it algoritmically. > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:170 > +static const int minimalAreaOfImage = 128 * 128; // Empirical data limit for parallel jobs > + This seems to be a confusing name. minimalAreaForWorkerThread? > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:180 > +static void worker(ThreadParam* tp) > +{ 1. worker does not say too much. Please chose a name for the function that suggests it's task. 2. tp is not a valid name, please use non-abbreviated variable names. > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:209 > + if (1 != optimalThreadCount) { It's not common in WebKit to write the constant first, please transpose the expression.
Andreas Kling
Comment 7 2011-09-20 07:47:18 PDT
Comment on attachment 107994 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method Resetting r-, not sure why Balazs removed it.
Csaba Osztrogonác
Comment 8 2011-09-20 08:10:19 PDT
(In reply to comment #7) > (From update of attachment 107994 [details]) > Resetting r-, not sure why Balazs removed it. I think it was a conflict between your reviews ...
Andras Piroska
Comment 9 2011-09-20 23:40:28 PDT
Sorry, I will fix it.
Balazs Kelemen
Comment 10 2011-09-21 04:13:15 PDT
(In reply to comment #8) > (In reply to comment #7) > > (From update of attachment 107994 [details] [details]) > > Resetting r-, not sure why Balazs removed it. > > I think it was a conflict between your reviews ... Yes, it was not intentional, sorry.
Andras Piroska
Comment 11 2011-10-05 00:03:26 PDT
Created attachment 109750 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method At the websites not cause regression. The performance growth is approximately 3% and this patch only influence to SVG filters exclusively. So I don't know that result is enough to accept the patch.
Andras Piroska
Comment 12 2011-10-05 00:15:24 PDT
Err, I just wanted to say that is this speed-up progression enough for that compexity?
Andreas Kling
Comment 13 2011-10-05 00:42:33 PDT
(In reply to comment #12) > Err, I just wanted to say that is this speed-up progression enough for that compexity? I'm surprised it's only 3%. How are you measuring it exactly?
Andreas Kling
Comment 14 2011-10-05 00:49:39 PDT
Comment on attachment 109750 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method View in context: https://bugs.webkit.org/attachment.cgi?id=109750&action=review > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:208 > + const int optimalThreadCount = min(m_size.height(), areaOfImage / areaLimitForParallelJobs); Could you explain this to me? Why is this better than deriving the optimal thread count from the number of scanlines total? > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:213 > + Unnecessary newline. > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:216 > + for (int i = 0; i < parallelJobs.numberOfJobs(); i++) { numberOfJobs() returns a size_t, consequently 'i' should be a size_t. Also, WebKit style prefers prefix increment (++i) to postfix (i++.) > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:239 > + ThreadParameters parameters; > + parameters.yStart = 0; > + parameters.yEnd = m_size.height(); > + parameters.width = m_size.width(); > + parameters.bits = bits; > + parameters.bytesPerLine = bytesPerLine; > + parameters.lookUpTable = &lookUpTable; > + transformColorWorker(&parameters); I like that you're sharing the code between the threaded and non-threaded paths. Could we refactor it slightly to look less thread-specific? For example having a transformColorSpace(yStart, yEnd, width, bits, bytesPerLine, lookupTable) method instead of passing a ThreadParameters object. Then the non-threading code doesn't need to know about ThreadParameters.
Andras Piroska
Comment 15 2011-10-05 00:49:42 PDT
Created attachment 109755 [details] [Qt] A test svg to the ImageBuffer::platformTransformColorSpace method This is an svg example which could show the performance improvement for the patch. A similar result can be measured with the following filters: FEBlend, FEColorMatrix, FEComposite, FEDisplacement, FEDropshadow, FEGauss, FELightning, FEMerge, FEMorph, FEOffset, FETile...
Andreas Kling
Comment 16 2011-10-05 00:54:41 PDT
(In reply to comment #15) > Created an attachment (id=109755) [details] > [Qt] A test svg to the ImageBuffer::platformTransformColorSpace method > > This is an svg example which could show the performance improvement for the patch. > A similar result can be measured with the following filters: > FEBlend, FEColorMatrix, FEComposite, FEDisplacement, FEDropshadow, > FEGauss, FELightning, FEMerge, FEMorph, FEOffset, FETile... This svg gives me "Uncaught TypeError: Object [object DOMWindow] has no method 'methanol_next'" I'm more interested in how you arrived in the 3% number though. 3% total runtime improvement? 3% less time spent in platformTransformColorSpace()? Statistical profiler? Callgrind? Etc :)
Gabor Loki
Comment 17 2011-10-05 01:13:21 PDT
> This svg gives me "Uncaught TypeError: Object [object DOMWindow] has no method 'methanol_next'" I guess Andras forgot to remove that onload event. > I'm more interested in how you arrived in the 3% number though. 3% total runtime improvement? 3% less time spent in platformTransformColorSpace()? Statistical profiler? Callgrind? Etc :) As the onload event said the measurement engine was the Methanol (http://gitorious.org/methanol) ;) This engine measures the page loads. So the 3% is the total runtime improvement on those SVG filters.
Andreas Kling
Comment 18 2011-10-05 01:25:19 PDT
(In reply to comment #17) > As the onload event said the measurement engine was the Methanol (http://gitorious.org/methanol) ;) > This engine measures the page loads. So the 3% is the total runtime improvement on those SVG filters. Right, I cloned that repo and opened the fire-svg.html and fire-smp.html, and I get this error: "Uncaught TypeError: Property 'methanol_next' of object [object DOMWindow] is not a function" Maybe I'm supposed to open something else? There are no instructions AFAICT. But never mind that. So what we're talking about here is a 3% overall improvement on a test suite that exercises a myriad of functionality, not necessarily even hitting platformTransformColorSpace() in all tests etc? That's not a bad thing in itself, but we should know how much faster the new platformTransformColorSpace() is compared to the old version.
Andras Piroska
Comment 19 2011-10-05 01:30:43 PDT
> Could you explain this to me? > Why is this better than deriving the optimal thread count from the number of scanlines total? If the image is 5x120 the algorithm will separate the picture (at 4 cores) to four 5x30 image, but for so small images the overhead much bigger then the gain. Our experience showed the overhead disappears below approx. 256x256 pixel per thread. You should also consider there 120x5 image which is also not a good candidate for parallelization so we should check the if the image has the minimum area(width * height) and a minimum scaleline as well. And must be sure that the count of threads not be more than the height of the image. > I like that you're sharing the code between the threaded and non-threaded paths. > Could we refactor it slightly to look less thread-specific? > For example having a transformColorSpace(yStart, yEnd, width, bits, bytesPerLine, lookupTable) method instead of passing a ThreadParameters object. > Then the non-threading code doesn't need to know about ThreadParameters. Not, the worker function must be get only one parameter. That is a limitation of Parallel_Jobs API.
Andras Piroska
Comment 20 2011-10-05 01:46:03 PDT
Created attachment 109764 [details] [Qt] A test svg to the ImageBuffer::platformTransformColorSpace method I removed the "onload="top.methanol_next()" code from the svg, but I don't know how this bug came out.
Andreas Kling
Comment 21 2011-10-05 01:46:41 PDT
(In reply to comment #19) > > Could you explain this to me? > > Why is this better than deriving the optimal thread count from the number of scanlines total? > If the image is 5x120 the algorithm will separate the picture (at 4 cores) to four 5x30 image, but for so small images the overhead much bigger then the gain. > Our experience showed the overhead disappears below approx. 256x256 pixel per thread. > You should also consider there 120x5 image which is also not a good candidate for parallelization so we should check the if the image has the minimum area(width * height) and a minimum scaleline as well. > And must be sure that the count of threads not be more than the height of the image. Fair enough. So what about a 100x1000 image? That would use a single thread with this formula, and seems like an awesome candidate for more parallelization. (Please correct me if I'm wrong. :) > > I like that you're sharing the code between the threaded and non-threaded paths. > > Could we refactor it slightly to look less thread-specific? > > For example having a transformColorSpace(yStart, yEnd, width, bits, bytesPerLine, lookupTable) method instead of passing a ThreadParameters object. > > Then the non-threading code doesn't need to know about ThreadParameters. > Not, the worker function must be get only one parameter. That is a limitation of Parallel_Jobs API. The worker function could be a one-liner that calls an inline function with the appropriate parameters.
Andras Piroska
Comment 22 2011-10-05 02:13:59 PDT
> Fair enough. So what about a 100x1000 image? That would use a single thread with this formula, and seems like an awesome candidate for more parallelization. (Please correct me if I'm wrong. :) In this case the image slices are too small (at 2 cores: 50x1000, that implies so many pixels than a 200x250 sized image) > The worker function could be a one-liner that calls an inline function with the appropriate parameters. Possible to write an additional method, that take out the fields of this struct and pass these parameters to the actual worker function.
Andras Piroska
Comment 23 2011-10-10 06:42:06 PDT
Created attachment 110354 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method Apply the ParallelJobs support to ImageBuffer::platformTransformColorSpace method. The worker function calls now in one-line.
WebKit Review Bot
Comment 24 2011-10-11 07:45:32 PDT
Attachment 110354 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1 Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:198: Extra space before ) [whitespace/parens] [2] Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:244: Missing space after , [whitespace/comma] [3] Total errors found: 2 in 3 files If any of these errors are false positives, please file a bug against check-webkit-style.
Zoltan Horvath
Comment 25 2011-10-12 03:10:11 PDT
I modified the summary from paralell to parallel, please modify it in your next patch also!
Andras Piroska
Comment 26 2011-10-12 04:50:58 PDT
Created attachment 110668 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method Sorry, I fixed the style errors.
Simon Hausmann
Comment 27 2011-10-18 00:01:37 PDT
(In reply to comment #18) [...] > But never mind that. So what we're talking about here is a 3% overall improvement on a test suite that exercises a myriad of functionality, not necessarily even hitting platformTransformColorSpace() in all tests etc? That's not a bad thing in itself, but we should know how much faster the new platformTransformColorSpace() is compared to the old version. I share a bit of Andreas' sentiment here. It would be good to know how much this particular function improved and much more importantly it would be good to know _why_. "Blind" optimizations are dangerous :) It is not apparent to me why distributing the function over different cores improves performance. From looking at the function and its lack of a computational kernel I _suspect_ that the bottleneck here is actually memory throughput. If that is the case, then it may be worth researching other ways of improving the performance because they may lead to even bigger wins with less heuristics and the thread start-up overhead. I'm saying this not to criticize your code, but I'd just like to stress the importance of explaining what you saw in oprofile and why your patch works. And if it is a memory throughput problem, then I recommend trying some other cache optimization techniques in comparision to see which one is best.
Simon Hausmann
Comment 28 2011-11-01 14:07:17 PDT
Comment on attachment 110668 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method Taking this out of the review queue. No response for two weeks and it's not clear what the patch improves (cache efficiency? cpu load distribution?) that leads to a lower run-time and how.
Andras Piroska
Comment 29 2011-11-03 02:16:43 PDT
Sorry, I forgot this bug because I received some new task. Soon I want to upload some measurements.
Andras Piroska
Comment 30 2011-11-08 04:41:35 PST
I made some measurements. The average acceleration of SVG filters is approx 3%. This is very little, doesn't cause considerable acceleration. In addition, I tested the platformTransformColorSpace() function, so I queried the time at the begin and at the end of this function and compared the runtime of the original code and the modified code. The result of this (function only) test in the best case is ~312sec for the original code ~281sec for the optimalized code This is also too little. I hoped the 3% speed-up enough to apply the patch and I think that it is not possible to accelerate it better.
Note You need to log in before you can comment on or make changes to this bug.