Based on the measurements of the oprofile, the ImageBuffer::platformTransformColorSpace method used often by such functions like FEGaussian filter. The computation can be distributed to multiple core if the architecture supports.
Created attachment 107845 [details] Apply the ParallelJobs support to ImageBuffer::platformTransformColorSpace method.
Created attachment 107987 [details] Apply the ParallelJobs support to ImageBuffer::platformTransformColorSpace method
Comment on attachment 107845 [details] Apply the ParallelJobs support to ImageBuffer::platformTransformColorSpace method. You shouldn't set r-, but remove r? and set obsolete flag
Created attachment 107994 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method
Comment on attachment 107994 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method View in context: https://bugs.webkit.org/attachment.cgi?id=107994&action=review > Source/WebCore/ChangeLog:14 > + No new tests / no behavioral changes. So how much faster is it? > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:169 > +static const int minimalAreaOfImage = 128 * 128; // Empirical data limit for parallel jobs Empirical data limit? Where does this value come from? The name 'minimalAreaOfImage' doesn't sufficiently explain what the constant does. > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:171 > +struct ThreadParam { We try not to abbreviate unnecessarily in WebKit. ThreadParameters? > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:172 > + int yStart, yEnd; One declaration per line. > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:179 > +static void worker(ThreadParam* tp) 'tp' is a bad variable name. parameters? > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:189 > + for (int x = 0; x < tp->width; ++x) { > + QRgb& pixel = scanLine[x]; > + pixel = qRgba((*tp->lookUpTable)[qRed(pixel)], > + (*tp->lookUpTable)[qGreen(pixel)], > + (*tp->lookUpTable)[qBlue(pixel)], > + qAlpha(pixel)); > + } First off, we should share this code with the non-parallel path. Second, this is a terrible algorithm, performance wise. I'm sure you can make it a *lot* faster by manipulating the color byte fields directly instead of going through qRgba and qRed/qGreen/qBlue/qAlpha. > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:208 > + int areaOfImage = m_size.width() * m_size.height(); > + int optimalThreadCount = areaOfImage / minimalAreaOfImage; Granted, I haven't studied the math behind this, but I would think that the optimal thread counts relates more to the number of scanlines, not so much to the X*Y area. > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:210 > + ParallelJobs<ThreadParam> parallelJobs(&worker, 2); Why hard-code 2 threads if we know the optimal thread count? > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:211 > + int threadCount = parallelJobs.numberOfJobs(); You don't need this variable. > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:213 > + int dy = m_size.height() / threadCount; 'dy' is a bad variable name. scanLinesPerThread? > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:215 > + ThreadParam& tp = parallelJobs.parameter(i); 'tp' is a bad variable name. parameters? > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:218 > + tp.yEnd = (i == threadCount-1) ? m_size.height() : yStart; Coding style, threadCount-1 should be threadCount - 1
Comment on attachment 107994 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method View in context: https://bugs.webkit.org/attachment.cgi?id=107994&action=review Informal review - just some style nit, I didn't check it algoritmically. > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:170 > +static const int minimalAreaOfImage = 128 * 128; // Empirical data limit for parallel jobs > + This seems to be a confusing name. minimalAreaForWorkerThread? > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:180 > +static void worker(ThreadParam* tp) > +{ 1. worker does not say too much. Please chose a name for the function that suggests it's task. 2. tp is not a valid name, please use non-abbreviated variable names. > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:209 > + if (1 != optimalThreadCount) { It's not common in WebKit to write the constant first, please transpose the expression.
Comment on attachment 107994 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method Resetting r-, not sure why Balazs removed it.
(In reply to comment #7) > (From update of attachment 107994 [details]) > Resetting r-, not sure why Balazs removed it. I think it was a conflict between your reviews ...
Sorry, I will fix it.
(In reply to comment #8) > (In reply to comment #7) > > (From update of attachment 107994 [details] [details]) > > Resetting r-, not sure why Balazs removed it. > > I think it was a conflict between your reviews ... Yes, it was not intentional, sorry.
Created attachment 109750 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method At the websites not cause regression. The performance growth is approximately 3% and this patch only influence to SVG filters exclusively. So I don't know that result is enough to accept the patch.
Err, I just wanted to say that is this speed-up progression enough for that compexity?
(In reply to comment #12) > Err, I just wanted to say that is this speed-up progression enough for that compexity? I'm surprised it's only 3%. How are you measuring it exactly?
Comment on attachment 109750 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method View in context: https://bugs.webkit.org/attachment.cgi?id=109750&action=review > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:208 > + const int optimalThreadCount = min(m_size.height(), areaOfImage / areaLimitForParallelJobs); Could you explain this to me? Why is this better than deriving the optimal thread count from the number of scanlines total? > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:213 > + Unnecessary newline. > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:216 > + for (int i = 0; i < parallelJobs.numberOfJobs(); i++) { numberOfJobs() returns a size_t, consequently 'i' should be a size_t. Also, WebKit style prefers prefix increment (++i) to postfix (i++.) > Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:239 > + ThreadParameters parameters; > + parameters.yStart = 0; > + parameters.yEnd = m_size.height(); > + parameters.width = m_size.width(); > + parameters.bits = bits; > + parameters.bytesPerLine = bytesPerLine; > + parameters.lookUpTable = &lookUpTable; > + transformColorWorker(¶meters); I like that you're sharing the code between the threaded and non-threaded paths. Could we refactor it slightly to look less thread-specific? For example having a transformColorSpace(yStart, yEnd, width, bits, bytesPerLine, lookupTable) method instead of passing a ThreadParameters object. Then the non-threading code doesn't need to know about ThreadParameters.
Created attachment 109755 [details] [Qt] A test svg to the ImageBuffer::platformTransformColorSpace method This is an svg example which could show the performance improvement for the patch. A similar result can be measured with the following filters: FEBlend, FEColorMatrix, FEComposite, FEDisplacement, FEDropshadow, FEGauss, FELightning, FEMerge, FEMorph, FEOffset, FETile...
(In reply to comment #15) > Created an attachment (id=109755) [details] > [Qt] A test svg to the ImageBuffer::platformTransformColorSpace method > > This is an svg example which could show the performance improvement for the patch. > A similar result can be measured with the following filters: > FEBlend, FEColorMatrix, FEComposite, FEDisplacement, FEDropshadow, > FEGauss, FELightning, FEMerge, FEMorph, FEOffset, FETile... This svg gives me "Uncaught TypeError: Object [object DOMWindow] has no method 'methanol_next'" I'm more interested in how you arrived in the 3% number though. 3% total runtime improvement? 3% less time spent in platformTransformColorSpace()? Statistical profiler? Callgrind? Etc :)
> This svg gives me "Uncaught TypeError: Object [object DOMWindow] has no method 'methanol_next'" I guess Andras forgot to remove that onload event. > I'm more interested in how you arrived in the 3% number though. 3% total runtime improvement? 3% less time spent in platformTransformColorSpace()? Statistical profiler? Callgrind? Etc :) As the onload event said the measurement engine was the Methanol (http://gitorious.org/methanol) ;) This engine measures the page loads. So the 3% is the total runtime improvement on those SVG filters.
(In reply to comment #17) > As the onload event said the measurement engine was the Methanol (http://gitorious.org/methanol) ;) > This engine measures the page loads. So the 3% is the total runtime improvement on those SVG filters. Right, I cloned that repo and opened the fire-svg.html and fire-smp.html, and I get this error: "Uncaught TypeError: Property 'methanol_next' of object [object DOMWindow] is not a function" Maybe I'm supposed to open something else? There are no instructions AFAICT. But never mind that. So what we're talking about here is a 3% overall improvement on a test suite that exercises a myriad of functionality, not necessarily even hitting platformTransformColorSpace() in all tests etc? That's not a bad thing in itself, but we should know how much faster the new platformTransformColorSpace() is compared to the old version.
> Could you explain this to me? > Why is this better than deriving the optimal thread count from the number of scanlines total? If the image is 5x120 the algorithm will separate the picture (at 4 cores) to four 5x30 image, but for so small images the overhead much bigger then the gain. Our experience showed the overhead disappears below approx. 256x256 pixel per thread. You should also consider there 120x5 image which is also not a good candidate for parallelization so we should check the if the image has the minimum area(width * height) and a minimum scaleline as well. And must be sure that the count of threads not be more than the height of the image. > I like that you're sharing the code between the threaded and non-threaded paths. > Could we refactor it slightly to look less thread-specific? > For example having a transformColorSpace(yStart, yEnd, width, bits, bytesPerLine, lookupTable) method instead of passing a ThreadParameters object. > Then the non-threading code doesn't need to know about ThreadParameters. Not, the worker function must be get only one parameter. That is a limitation of Parallel_Jobs API.
Created attachment 109764 [details] [Qt] A test svg to the ImageBuffer::platformTransformColorSpace method I removed the "onload="top.methanol_next()" code from the svg, but I don't know how this bug came out.
(In reply to comment #19) > > Could you explain this to me? > > Why is this better than deriving the optimal thread count from the number of scanlines total? > If the image is 5x120 the algorithm will separate the picture (at 4 cores) to four 5x30 image, but for so small images the overhead much bigger then the gain. > Our experience showed the overhead disappears below approx. 256x256 pixel per thread. > You should also consider there 120x5 image which is also not a good candidate for parallelization so we should check the if the image has the minimum area(width * height) and a minimum scaleline as well. > And must be sure that the count of threads not be more than the height of the image. Fair enough. So what about a 100x1000 image? That would use a single thread with this formula, and seems like an awesome candidate for more parallelization. (Please correct me if I'm wrong. :) > > I like that you're sharing the code between the threaded and non-threaded paths. > > Could we refactor it slightly to look less thread-specific? > > For example having a transformColorSpace(yStart, yEnd, width, bits, bytesPerLine, lookupTable) method instead of passing a ThreadParameters object. > > Then the non-threading code doesn't need to know about ThreadParameters. > Not, the worker function must be get only one parameter. That is a limitation of Parallel_Jobs API. The worker function could be a one-liner that calls an inline function with the appropriate parameters.
> Fair enough. So what about a 100x1000 image? That would use a single thread with this formula, and seems like an awesome candidate for more parallelization. (Please correct me if I'm wrong. :) In this case the image slices are too small (at 2 cores: 50x1000, that implies so many pixels than a 200x250 sized image) > The worker function could be a one-liner that calls an inline function with the appropriate parameters. Possible to write an additional method, that take out the fields of this struct and pass these parameters to the actual worker function.
Created attachment 110354 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method Apply the ParallelJobs support to ImageBuffer::platformTransformColorSpace method. The worker function calls now in one-line.
Attachment 110354 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1 Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:198: Extra space before ) [whitespace/parens] [2] Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:244: Missing space after , [whitespace/comma] [3] Total errors found: 2 in 3 files If any of these errors are false positives, please file a bug against check-webkit-style.
I modified the summary from paralell to parallel, please modify it in your next patch also!
Created attachment 110668 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method Sorry, I fixed the style errors.
(In reply to comment #18) [...] > But never mind that. So what we're talking about here is a 3% overall improvement on a test suite that exercises a myriad of functionality, not necessarily even hitting platformTransformColorSpace() in all tests etc? That's not a bad thing in itself, but we should know how much faster the new platformTransformColorSpace() is compared to the old version. I share a bit of Andreas' sentiment here. It would be good to know how much this particular function improved and much more importantly it would be good to know _why_. "Blind" optimizations are dangerous :) It is not apparent to me why distributing the function over different cores improves performance. From looking at the function and its lack of a computational kernel I _suspect_ that the bottleneck here is actually memory throughput. If that is the case, then it may be worth researching other ways of improving the performance because they may lead to even bigger wins with less heuristics and the thread start-up overhead. I'm saying this not to criticize your code, but I'd just like to stress the importance of explaining what you saw in oprofile and why your patch works. And if it is a memory throughput problem, then I recommend trying some other cache optimization techniques in comparision to see which one is best.
Comment on attachment 110668 [details] [Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method Taking this out of the review queue. No response for two weeks and it's not clear what the patch improves (cache efficiency? cpu load distribution?) that leads to a lower run-time and how.
Sorry, I forgot this bug because I received some new task. Soon I want to upload some measurements.
I made some measurements. The average acceleration of SVG filters is approx 3%. This is very little, doesn't cause considerable acceleration. In addition, I tested the platformTransformColorSpace() function, so I queried the time at the begin and at the end of this function and compared the runtime of the original code and the modified code. The result of this (function only) test in the best case is ~312sec for the original code ~281sec for the optimalized code This is also too little. I hoped the 3% speed-up enough to apply the patch and I think that it is not possible to accelerate it better.