Bug 68353

Summary: [Qt] Apply ParallelJobs for ImageBuffer::platformTransformColorSpace method
Product: WebKit Reporter: Andras Piroska <pandras>
Component: PlatformAssignee: Nobody <webkit-unassigned>
Status: RESOLVED INVALID    
Severity: Enhancement CC: hausmann, kbalazs, kling, loki, ossy, thorton, webkit.review.bot, zherczeg, zoltan
Priority: P2 Keywords: Performance, Qt, QtTriaged
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Attachments:
Description Flags
Apply the ParallelJobs support to ImageBuffer::platformTransformColorSpace method.
none
Apply the ParallelJobs support to ImageBuffer::platformTransformColorSpace method
none
[Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method
kling: review-
[Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method
none
[Qt] A test svg to the ImageBuffer::platformTransformColorSpace method
none
[Qt] A test svg to the ImageBuffer::platformTransformColorSpace method
none
[Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method
none
[Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method hausmann: review-, hausmann: commit-queue-

Description Andras Piroska 2011-09-19 06:10:08 PDT
Based on the measurements of the oprofile,
the ImageBuffer::platformTransformColorSpace method used
often by such functions like FEGaussian filter.
The computation can be distributed to multiple core
if the architecture supports.
Comment 1 Andras Piroska 2011-09-19 06:32:10 PDT
Created attachment 107845 [details]
Apply the ParallelJobs support to ImageBuffer::platformTransformColorSpace method.
Comment 2 Andras Piroska 2011-09-20 05:37:01 PDT
Created attachment 107987 [details]
Apply the ParallelJobs support to ImageBuffer::platformTransformColorSpace method
Comment 3 Csaba Osztrogonác 2011-09-20 05:38:54 PDT
Comment on attachment 107845 [details]
Apply the ParallelJobs support to ImageBuffer::platformTransformColorSpace method.

You shouldn't set r-, but remove r? and set obsolete flag
Comment 4 Andras Piroska 2011-09-20 06:22:08 PDT
Created attachment 107994 [details]
[Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method
Comment 5 Andreas Kling 2011-09-20 06:44:26 PDT
Comment on attachment 107994 [details]
[Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method

View in context: https://bugs.webkit.org/attachment.cgi?id=107994&action=review

> Source/WebCore/ChangeLog:14
> +        No new tests / no behavioral changes.

So how much faster is it?

> Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:169
> +static const int minimalAreaOfImage = 128 * 128; // Empirical data limit for parallel jobs

Empirical data limit?
Where does this value come from?
The name 'minimalAreaOfImage' doesn't sufficiently explain what the constant does.

> Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:171
> +struct ThreadParam {

We try not to abbreviate unnecessarily in WebKit. ThreadParameters?

> Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:172
> +    int yStart, yEnd;

One declaration per line.

> Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:179
> +static void worker(ThreadParam* tp)

'tp' is a bad variable name. parameters?

> Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:189
> +        for (int x = 0; x < tp->width; ++x) {
> +            QRgb& pixel = scanLine[x];
> +            pixel = qRgba((*tp->lookUpTable)[qRed(pixel)],
> +                          (*tp->lookUpTable)[qGreen(pixel)],
> +                          (*tp->lookUpTable)[qBlue(pixel)],
> +                          qAlpha(pixel));
> +        }

First off, we should share this code with the non-parallel path.
Second, this is a terrible algorithm, performance wise. I'm sure you can make it a *lot* faster by manipulating the color byte fields directly instead of going through qRgba and  qRed/qGreen/qBlue/qAlpha.

> Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:208
> +    int areaOfImage = m_size.width() * m_size.height();
> +    int optimalThreadCount = areaOfImage / minimalAreaOfImage;

Granted, I haven't studied the math behind this, but I would think that the optimal thread counts relates more to the number of scanlines, not so much to the X*Y area.

> Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:210
> +        ParallelJobs<ThreadParam> parallelJobs(&worker, 2);

Why hard-code 2 threads if we know the optimal thread count?

> Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:211
> +        int threadCount = parallelJobs.numberOfJobs();

You don't need this variable.

> Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:213
> +        int dy = m_size.height() / threadCount;

'dy' is a bad variable name. scanLinesPerThread?

> Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:215
> +            ThreadParam& tp = parallelJobs.parameter(i);

'tp' is a bad variable name. parameters?

> Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:218
> +            tp.yEnd = (i == threadCount-1) ? m_size.height() : yStart;

Coding style, threadCount-1 should be threadCount - 1
Comment 6 Balazs Kelemen 2011-09-20 06:48:44 PDT
Comment on attachment 107994 [details]
[Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method

View in context: https://bugs.webkit.org/attachment.cgi?id=107994&action=review

Informal review - just some style nit, I didn't check it algoritmically.

> Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:170
> +static const int minimalAreaOfImage = 128 * 128; // Empirical data limit for parallel jobs
> +

This seems to be a confusing name. minimalAreaForWorkerThread?

> Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:180
> +static void worker(ThreadParam* tp)
> +{

1. worker does not say too much. Please chose a name for the function that suggests it's task.
2. tp is not a valid name, please use non-abbreviated variable names.

> Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:209
> +    if (1 != optimalThreadCount) {

It's not common in WebKit to write the constant first, please transpose the expression.
Comment 7 Andreas Kling 2011-09-20 07:47:18 PDT
Comment on attachment 107994 [details]
[Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method

Resetting r-, not sure why Balazs removed it.
Comment 8 Csaba Osztrogonác 2011-09-20 08:10:19 PDT
(In reply to comment #7)
> (From update of attachment 107994 [details])
> Resetting r-, not sure why Balazs removed it.

I think it was a conflict between your reviews ...
Comment 9 Andras Piroska 2011-09-20 23:40:28 PDT
Sorry, I will fix it.
Comment 10 Balazs Kelemen 2011-09-21 04:13:15 PDT
(In reply to comment #8)
> (In reply to comment #7)
> > (From update of attachment 107994 [details] [details])
> > Resetting r-, not sure why Balazs removed it.
> 
> I think it was a conflict between your reviews ...

Yes, it was not intentional, sorry.
Comment 11 Andras Piroska 2011-10-05 00:03:26 PDT
Created attachment 109750 [details]
[Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method

At the websites not cause regression.
The performance growth is approximately 3%
and this patch only influence to SVG filters exclusively.
So I don't know that result is enough to accept the patch.
Comment 12 Andras Piroska 2011-10-05 00:15:24 PDT
Err, I just wanted to say that is this speed-up progression enough for that compexity?
Comment 13 Andreas Kling 2011-10-05 00:42:33 PDT
(In reply to comment #12)
> Err, I just wanted to say that is this speed-up progression enough for that compexity?

I'm surprised it's only 3%. How are you measuring it exactly?
Comment 14 Andreas Kling 2011-10-05 00:49:39 PDT
Comment on attachment 109750 [details]
[Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method

View in context: https://bugs.webkit.org/attachment.cgi?id=109750&action=review

> Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:208
> +    const int optimalThreadCount = min(m_size.height(), areaOfImage / areaLimitForParallelJobs);

Could you explain this to me?
Why is this better than deriving the optimal thread count from the number of scanlines total?

> Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:213
> +

Unnecessary newline.

> Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:216
> +            for (int i = 0; i < parallelJobs.numberOfJobs(); i++) {

numberOfJobs() returns a size_t, consequently 'i' should be a size_t.
Also, WebKit style prefers prefix increment (++i) to postfix (i++.)

> Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:239
> +    ThreadParameters parameters;
> +    parameters.yStart = 0;
> +    parameters.yEnd = m_size.height();
> +    parameters.width = m_size.width();
> +    parameters.bits = bits;
> +    parameters.bytesPerLine = bytesPerLine;
> +    parameters.lookUpTable = &lookUpTable;
> +    transformColorWorker(&parameters);

I like that you're sharing the code between the threaded and non-threaded paths.
Could we refactor it slightly to look less thread-specific?
For example having a transformColorSpace(yStart, yEnd, width, bits, bytesPerLine, lookupTable) method instead of passing a ThreadParameters object.
Then the non-threading code doesn't need to know about ThreadParameters.
Comment 15 Andras Piroska 2011-10-05 00:49:42 PDT
Created attachment 109755 [details]
[Qt] A test svg to the ImageBuffer::platformTransformColorSpace method

This is an svg example which could show the performance improvement for the patch.
A similar result can be measured with the following filters:
FEBlend, FEColorMatrix, FEComposite, FEDisplacement, FEDropshadow,
FEGauss, FELightning, FEMerge, FEMorph, FEOffset, FETile...
Comment 16 Andreas Kling 2011-10-05 00:54:41 PDT
(In reply to comment #15)
> Created an attachment (id=109755) [details]
> [Qt] A test svg to the ImageBuffer::platformTransformColorSpace method
> 
> This is an svg example which could show the performance improvement for the patch.
> A similar result can be measured with the following filters:
> FEBlend, FEColorMatrix, FEComposite, FEDisplacement, FEDropshadow,
> FEGauss, FELightning, FEMerge, FEMorph, FEOffset, FETile...

This svg gives me "Uncaught TypeError: Object [object DOMWindow] has no method 'methanol_next'"

I'm more interested in how you arrived in the 3% number though. 3% total runtime improvement? 3% less time spent in platformTransformColorSpace()? Statistical profiler? Callgrind? Etc :)
Comment 17 Gabor Loki 2011-10-05 01:13:21 PDT
> This svg gives me "Uncaught TypeError: Object [object DOMWindow] has no method 'methanol_next'"

I guess Andras forgot to remove that onload event.

> I'm more interested in how you arrived in the 3% number though. 3% total runtime improvement? 3% less time spent in platformTransformColorSpace()? Statistical profiler? Callgrind? Etc :)

As the onload event said the measurement engine was the Methanol (http://gitorious.org/methanol) ;)
This engine measures the page loads. So the 3% is the total runtime improvement on those SVG filters.
Comment 18 Andreas Kling 2011-10-05 01:25:19 PDT
(In reply to comment #17)
> As the onload event said the measurement engine was the Methanol (http://gitorious.org/methanol) ;)
> This engine measures the page loads. So the 3% is the total runtime improvement on those SVG filters.

Right, I cloned that repo and opened the fire-svg.html and fire-smp.html, and I get this error: "Uncaught TypeError: Property 'methanol_next' of object [object DOMWindow] is not a function" Maybe I'm supposed to open something else? There are no instructions AFAICT.

But never mind that. So what we're talking about here is a 3% overall improvement on a test suite that exercises a myriad of functionality, not necessarily even hitting platformTransformColorSpace() in all tests etc? That's not a bad thing in itself, but we should know how much faster the new platformTransformColorSpace() is compared to the old version.
Comment 19 Andras Piroska 2011-10-05 01:30:43 PDT
> Could you explain this to me?
> Why is this better than deriving the optimal thread count from the number of scanlines total?
If the image is 5x120 the algorithm will separate the picture (at 4 cores) to four 5x30 image, but for so small images the overhead much bigger then the gain.
Our experience showed the overhead disappears below approx. 256x256 pixel per thread.
You should also consider there 120x5 image which is also not a good candidate for parallelization so we should check the if the image has the minimum area(width * height) and a minimum scaleline as well.
And must be sure that the count of threads not be more than the height of the image.

> I like that you're sharing the code between the threaded and non-threaded paths.
> Could we refactor it slightly to look less thread-specific?
> For example having a transformColorSpace(yStart, yEnd, width, bits, bytesPerLine, lookupTable) method instead of passing a ThreadParameters object.
> Then the non-threading code doesn't need to know about ThreadParameters.
Not, the worker function must be get only one parameter. That is a limitation of Parallel_Jobs API.
Comment 20 Andras Piroska 2011-10-05 01:46:03 PDT
Created attachment 109764 [details]
[Qt] A test svg to the ImageBuffer::platformTransformColorSpace method

I removed the "onload="top.methanol_next()" code from the svg,
but I don't know how this bug came out.
Comment 21 Andreas Kling 2011-10-05 01:46:41 PDT
(In reply to comment #19)
> > Could you explain this to me?
> > Why is this better than deriving the optimal thread count from the number of scanlines total?
> If the image is 5x120 the algorithm will separate the picture (at 4 cores) to four 5x30 image, but for so small images the overhead much bigger then the gain.
> Our experience showed the overhead disappears below approx. 256x256 pixel per thread.
> You should also consider there 120x5 image which is also not a good candidate for parallelization so we should check the if the image has the minimum area(width * height) and a minimum scaleline as well.
> And must be sure that the count of threads not be more than the height of the image.

Fair enough. So what about a 100x1000 image? That would use a single thread with this formula, and seems like an awesome candidate for more parallelization. (Please correct me if I'm wrong. :)

> > I like that you're sharing the code between the threaded and non-threaded paths.
> > Could we refactor it slightly to look less thread-specific?
> > For example having a transformColorSpace(yStart, yEnd, width, bits, bytesPerLine, lookupTable) method instead of passing a ThreadParameters object.
> > Then the non-threading code doesn't need to know about ThreadParameters.
> Not, the worker function must be get only one parameter. That is a limitation of Parallel_Jobs API.

The worker function could be a one-liner that calls an inline function with the appropriate parameters.
Comment 22 Andras Piroska 2011-10-05 02:13:59 PDT
> Fair enough. So what about a 100x1000 image? That would use a single thread with this formula, and seems like an awesome candidate for more parallelization. (Please correct me if I'm wrong. :)
In this case the image slices are too small (at 2 cores: 50x1000, that implies so many pixels than a 200x250 sized image)

> The worker function could be a one-liner that calls an inline function with the appropriate parameters.
Possible to write an additional method, that take out the fields of this struct and pass these parameters to the actual worker function.
Comment 23 Andras Piroska 2011-10-10 06:42:06 PDT
Created attachment 110354 [details]
[Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method

Apply the ParallelJobs support to ImageBuffer::platformTransformColorSpace method.
The worker function calls now in one-line.
Comment 24 WebKit Review Bot 2011-10-11 07:45:32 PDT
Attachment 110354 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1

Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:198:  Extra space before )  [whitespace/parens] [2]
Source/WebCore/platform/graphics/qt/ImageBufferQt.cpp:244:  Missing space after ,  [whitespace/comma] [3]
Total errors found: 2 in 3 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 25 Zoltan Horvath 2011-10-12 03:10:11 PDT
I modified the summary from paralell to parallel, please modify it in your next patch also!
Comment 26 Andras Piroska 2011-10-12 04:50:58 PDT
Created attachment 110668 [details]
[Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method

Sorry, I fixed the style errors.
Comment 27 Simon Hausmann 2011-10-18 00:01:37 PDT
(In reply to comment #18)
[...]
> But never mind that. So what we're talking about here is a 3% overall improvement on a test suite that exercises a myriad of functionality, not necessarily even hitting platformTransformColorSpace() in all tests etc? That's not a bad thing in itself, but we should know how much faster the new platformTransformColorSpace() is compared to the old version.

I share a bit of Andreas' sentiment here.

It would be good to know how much this particular function improved and much more importantly it would be good to know _why_. "Blind" optimizations are dangerous :)

It is not apparent to me why distributing the function over different cores improves performance. From looking at the function and its lack of a computational kernel I _suspect_ that the bottleneck here is actually memory throughput. If that is the case, then it may be worth researching other ways of improving the performance because they may lead to even bigger wins with less heuristics and the thread start-up overhead.

I'm saying this not to criticize your code, but I'd just like to stress the importance of explaining what you saw in oprofile and why your patch works. And if it is a memory throughput problem, then I recommend trying some other cache optimization techniques in comparision to see which one is best.
Comment 28 Simon Hausmann 2011-11-01 14:07:17 PDT
Comment on attachment 110668 [details]
[Qt] Apply ParalellJobs for ImageBuffer::platformTransformColorSpace method

Taking this out of the review queue. No response for two weeks and it's not clear what the patch improves (cache efficiency? cpu load distribution?) that leads to a lower run-time and how.
Comment 29 Andras Piroska 2011-11-03 02:16:43 PDT
Sorry, I forgot this bug because I received some new task.
Soon I want to upload some measurements.
Comment 30 Andras Piroska 2011-11-08 04:41:35 PST
I made some measurements.
The average acceleration of SVG filters is approx 3%.
This is very little, doesn't cause considerable acceleration.
In addition, I tested the platformTransformColorSpace() function,
so I queried the time at the begin and at the end of this function
and compared the runtime of the original code and the modified code.
The result of this (function only) test in the best case is
 ~312sec for the original code
 ~281sec for the optimalized code
This is also too little.
I hoped the 3% speed-up enough to apply the patch and
I think that it is not possible to accelerate it better.