Bug 43903 - Speeding up SVG filters with multicore (SMP) support
Summary: Speeding up SVG filters with multicore (SMP) support
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: SVG (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Enhancement
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-08-12 02:46 PDT by Zoltan Herczeg
Modified: 2011-04-26 10:47 PDT (History)
22 users (show)

See Also:


Attachments
parallel svg light filter (9.38 KB, patch)
2010-08-12 02:49 PDT, Zoltan Herczeg
no flags Details | Formatted Diff | Diff
updated patch (17.22 KB, patch)
2010-11-30 05:07 PST, Zoltan Herczeg
no flags Details | Formatted Diff | Diff
Speeding up FETurbulence with SMP (23.58 KB, patch)
2011-02-22 07:06 PST, Gabor Loki
ossy: review-
Details | Formatted Diff | Diff
A big SVG to test turbulence filter (625 bytes, image/svg+xml)
2011-03-02 03:44 PST, Gabor Loki
no flags Details
A simple test case to benchmark the SMP-based turbulence filter (2.54 KB, image/svg+xml)
2011-03-02 04:24 PST, Gabor Loki
no flags Details
Measurement result with Methanol (1.12 KB, text/plain)
2011-03-09 07:41 PST, Gabor Loki
no flags Details
Speeding up FETurbulence with SMP (30.44 KB, patch)
2011-04-07 01:06 PDT, Gabor Loki
no flags Details | Formatted Diff | Diff
Speeding up FETurbulence with SMP (30.46 KB, patch)
2011-04-07 05:19 PDT, Gabor Loki
no flags Details | Formatted Diff | Diff
Speeding up FETurbulence with SMP (41.75 KB, patch)
2011-04-12 00:41 PDT, Gabor Loki
no flags Details | Formatted Diff | Diff
Speeding up FETurbulence with SMP (42.72 KB, patch)
2011-04-12 06:26 PDT, Gabor Loki
no flags Details | Formatted Diff | Diff
Speeding up FETurbulence with SMP (42.15 KB, patch)
2011-04-26 01:15 PDT, Gabor Loki
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Zoltan Herczeg 2010-08-12 02:46:50 PDT
This patch is just show new ways of exploiting multicore cpus in WebKit. Filters usually requires a huge amount of CPU work, especially for animated SVGs. Any feedback would be welcome.
Comment 1 Zoltan Herczeg 2010-08-12 02:49:59 PDT
Created attachment 64198 [details]
parallel svg light filter

Light filter is easy to split up for multiple CPUs. The initial plan is to add one helper thread, and two functions (execute / waitForCompletition) to the Filter base class.
Comment 2 Zoltan Herczeg 2010-11-30 05:07:48 PST
Created attachment 75125 [details]
updated patch
Comment 3 Nikolas Zimmermann 2010-11-30 05:48:36 PST
Comment on attachment 75125 [details]
updated patch

View in context: https://bugs.webkit.org/attachment.cgi?id=75125&action=review

The idea looks great, the video you showed is also very convincing! :-)
Some early review comments:

> WebCore/platform/graphics/filters/FELighting.cpp:240
> +            data.interior(offset, normalVector);

Hm, not related to this patch, but 'interior' is a bad name, what does it do?

> WebCore/platform/graphics/filters/FELighting.cpp:246
> +#ifdef FILTER_THREAD

This should use the USE(FILTER_THREADS) or ENABLE(FILTER_THREADS) logic. (not sure which one to use, need to read Maciejs mail again regarding that topic, when to use USE or ENABLE, but I guess USE is approriate here).

> WebCore/platform/graphics/filters/FELighting.cpp:249
> +    PaintThreadData& data = *reinterpret_cast<PaintThreadData*>(parameter);

How about adding ASSERT(parameter) as first argument.

> WebCore/platform/graphics/filters/FELighting.cpp:280
> +    if (width >= 3 && height >= 10 && width * height > 4000) {

I'd like a comment here, that this is emperical data.

> WebCore/platform/graphics/filters/FELighting.h:79
> +        {

An ASSERT(filter) can't hurt.
It looks awkward to use filter(filter) and lightingData(lightingData) - is that guarenteed to work? I was not aware that this is possible!

> WebCore/platform/graphics/filters/FETurbulence.cpp:335
> +#ifdef FILTER_THREAD

Use USE(FILTER_THREADS).

> WebCore/platform/graphics/filters/FETurbulence.cpp:338
> +    PaintThreadData& data = *reinterpret_cast<PaintThreadData*>(parameter);

Maybe ASSERT(parameter) first.

> WebCore/platform/graphics/filters/FETurbulence.cpp:360
> +#ifdef FILTER_THREAD

Same comment, use USE(FILTER_THREADS), won't repeat it for the other occourences.

> WebCore/platform/graphics/filters/FETurbulence.cpp:362
> +    if (imageRect.height() >= 6 && imageRect.width() * imageRect.height() > 4000) {

Emperical data comment...

> WebCore/platform/graphics/filters/FETurbulence.h:95
> +        {

ASSERT(filter)

> WebCore/platform/graphics/filters/FilterEffect.cpp:33
> +#ifdef FILTER_THREAD
> +ThreadIdentifier FilterEffect::s_threadID;
> +FilterEffect::ThreadShared* FilterEffect::s_threadShared;
> +bool FilterEffect::s_stopThread;
> +#endif

We generally disallow globals, so how about putting these in functions?
static ThreadIdentifier threadIdentifier()
{
    static ThreadIdentifier s_threadID;
    return s_threadID;
}

Are you avoiding the use of DEFINE_STATIC_LOCAL on purpose?
I have never used threads in WebCore so far, forgive my ignorance if I'm overlooking something obvious.

> WebCore/platform/graphics/filters/FilterEffect.cpp:141
> +        s_threadShared->m_cond.wait(s_threadShared->m_mutex);

Is this prefereedd style when using threads? Are shall we expose accesors for m_cond, m_start, m_finish, m_function, m_mutex etc.?
Comment 4 Gabor Loki 2011-02-22 07:06:48 PST
Created attachment 83307 [details]
Speeding up FETurbulence with SMP

I have redesigned the previous concept of threading from Zoltan's suggestion and introduced a framework for parallel tasks/jobs.

The ParallelJobs framework is based on WebKit's threading infrastructure and/or OpenMP's API.

The usage of ParallelJobs is the following:
1.- Create ParallelJobs object - which allocates the worker threads and necessary objects.
2.- Fill the parameter array - allocated by ParallelJobs, assign different sub-tasks for the workers
3.- Execute the workers
4.- Combine the necessary information [optional]

This patch contains:
1.- ParallelJobs framework
2.- SMP version of FETurbulence

One more note:
I have measured 60%-80% performance progression on dual-core systems on ARM and x86 as well.
Comment 5 Benjamin Poulain 2011-02-22 07:34:42 PST
May I suggest to add benchmarks (even if only Qt benchmarks) to see how does that fly?

I am also not a fan of OpenMP for having had some strange performance numbers out of it (like x1.1 boost where direct threading with pthread gave me x1.9). If we want to use more SMP inside WebKit, I would really prefer something more like TBB, because you have a lot more control over the way things are parallelized.
Comment 6 Build Bot 2011-02-22 07:49:08 PST
Attachment 83307 [details] did not build on win:
Build output: http://queues.webkit.org/results/7946220
Comment 7 WebKit Review Bot 2011-02-22 08:36:41 PST
Attachment 83307 [details] did not build on mac:
Build output: http://queues.webkit.org/results/7946234
Comment 8 Gabor Loki 2011-02-22 08:42:56 PST
> May I suggest to add benchmarks (even if only Qt benchmarks) to see how does that fly?

Where to add benchmarks? If you send me some pointers I will happily measure the effect of this patch, but I have to mention that in general the benchmarks mainly give an overall insight about the performance. So, if you really would like to measure the effect of this kind of parallelization, you should create a basic svg with turbulence filter. For example: you can use the Source/WebCore/manual-tests/svg-filter-animation.svg as a simple benchmark after all non-related filters are removed from the SVG.


> I would really prefer something more like TBB, because you have a lot more control over the way things are parallelized.

Good idea, but I feel this could be another patch where we will introduce TBB for another threading model of the WebKit's threading interface.
Comment 9 Benjamin Poulain 2011-02-22 08:58:43 PST
(In reply to comment #8)
> > May I suggest to add benchmarks (even if only Qt benchmarks) to see how does that fly?
> 
> Where to add benchmarks? If you send me some pointers I will happily measure the effect of this patch, but I have to mention that in general the benchmarks mainly give an overall insight about the performance. So, if you really would like to measure the effect of this kind of parallelization, you should create a basic svg with turbulence filter. For example: you can use the Source/WebCore/manual-tests/svg-filter-animation.svg as a simple benchmark after all non-related filters are removed from the SVG.

The benchmarks for Qt are in Source/WebKit/qt/tests/benchmarks/

This is really a performance optimization. You don't need SMP to make the code cleaner/simpler, quite the contrary. So I think the performance gain is very relevant for this task.

If I understant correctly, you are doing research. So getting accurate numbers is probably interesting for your papers as well.
When we considered parallelization for Qt's CPU rasterization, we looked for the boundaries of input for which SMP is helpful VS is harmful. I think this kind of analysis in both interesting for the WebKit project, and for research. (and note that we ended up not using parallelization for most primitives of Qt because it was more harmful than helpful for general rendering).
Comment 10 Gabor Loki 2011-02-24 00:19:50 PST
> If I understant correctly, you are doing research. So getting accurate numbers is probably interesting for your papers as well.

I would be the most happiest people if this topic was a research.

> When we considered parallelization for Qt's CPU rasterization, we looked for the boundaries of input for which SMP is helpful VS is harmful. I think this kind of analysis in both interesting for the WebKit project, and for research. (and note that we ended up not using parallelization for most primitives of Qt because it was more harmful than helpful for general rendering).

I agree with you about the importance of boundaries. That's why I left the s_minimalRectDimension for platforms. All other code depends on the number of requested threads. So, if the platform/OS/architecture is able to decide its boundaries this will be an easy task to implement their needs. Of course lots of benchmarking are needed to achieve this, but for introducing this parallelization approach the current limit is fine. Fine tuning of parallel turbulence filter should be done later with the help of benchmarks.

Anyway I will check Qt's benchmark environment. Is the time measuring method of this environment based on clock_t or wall-time (for example: clock_gettime)?
Comment 11 Benjamin Poulain 2011-02-24 02:07:14 PST
(In reply to comment #10)
> Anyway I will check Qt's benchmark environment. Is the time measuring method of this environment based on clock_t or wall-time (for example: clock_gettime)?

You choose the kind of timing with a parameter when running the benchmark: http://doc.qt.nokia.com/latest/qtestlib-manual.html#creating-a-benchmark
Comment 12 Eric Seidel (no email) 2011-02-24 02:20:48 PST
crazy.
Comment 13 Eric Seidel (no email) 2011-02-24 02:21:43 PST
If we're going to use this parallel jobs thing widely, it's eventually going to need a GCG implementation:
http://developer.apple.com/technologies/mac/snowleopard/gcd.html
Comment 14 Simon Fraser (smfr) 2011-02-24 22:17:21 PST
Comment on attachment 83307 [details]
Speeding up FETurbulence with SMP

View in context: https://bugs.webkit.org/attachment.cgi?id=83307&action=review

> Source/JavaScriptCore/ChangeLog:10
> +        too complex. Using the power of Simmetric Multi Processing (SMP) we

symmetric
Comment 15 Simon Fraser (smfr) 2011-02-24 22:17:55 PST
I think it would be better to try to speed things up with SSE before we go to multithreading.
Comment 16 Zoltan Herczeg 2011-02-24 23:13:07 PST
I think some template magic could make this more user frendly. We could remove some static_casts and sizeof operations in that way.
Comment 17 Zoltan Herczeg 2011-02-24 23:15:19 PST
(In reply to comment #15)
> I think it would be better to try to speed things up with SSE before we go to multithreading.

See https://bugs.webkit.org/show_bug.cgi?id=54456 (ARM Neon is similar to x86 SSE). Could you help me reviewing that?
Comment 18 Gabor Loki 2011-02-25 01:06:18 PST
(In reply to comment #15)
> I think it would be better to try to speed things up with SSE before we go to multithreading.

The SSE and NEON are very cool features which should be used in WebKit directly or indirectly, but they are architecture dependent features. We can use them to speed up something directly, like Zoltan's patch for FELighting, or design a common framework (something like JSC's macro assembler) which drops many performance opportunities because of the high level abstraction - like in JSC. Anyhow, we should consider SIMD extensions, but the scope of SIMD and SMP are different. Any of them can live without the other, and can work with the other.
Comment 19 Balazs Kelemen 2011-03-01 01:27:07 PST
(In reply to comment #11)
> (In reply to comment #10)
> > Anyway I will check Qt's benchmark environment. Is the time measuring method of this environment based on clock_t or wall-time (for example: clock_gettime)?
> 
> You choose the kind of timing with a parameter when running the benchmark: http://doc.qt.nokia.com/latest/qtestlib-manual.html#creating-a-benchmark

You can find another version of the benchmarks on gitorious: http://gitorious.org/qtwebkit/performance.
In this repo the benchmarks based on wall-time with higher accuracy.
Comment 20 Gabor Loki 2011-03-02 00:55:30 PST
> You can find another version of the benchmarks on gitorious: http://gitorious.org/qtwebkit/performance.

It looks like this performance tool has some issue running as a benchmark. I used several tools from this project, but they provided strange values to me.

I create a simple svg with only one filter (on a BIG area). It takes about 5 sec to compute the filter with QtWebKit. And here is a sample result of cycler test:

********* Start testing of tst_Cycler *********
Config: Using QTest library 4.7.0, Qt 4.7.0
PASS   : tst_Cycler::initTestCase()
PASS   : tst_Cycler::load()
PASS   : tst_Cycler::cleanupTestCase()
Totals: 3 passed, 0 failed, 0 skipped
********* Finished testing of tst_Cycler *********
Printing result for: total
	benchmark: cycler::load	http://127.0.0.1/~smp/svg/turbulence.svg
		mean: 3 msecs +/- 3239 msecs, +/- 107967 %
		avg:  1685 msecs
			5741 3 3 3 2 2 2 5464 5636 3 
Total mean: 3 msecs per test, avg: 1685


And another one:
		mean: 3 msecs +/- 3850 msecs, +/- 128333 %
		avg:  2312 msecs
			5875 5682 5875 5679 3 2 2 2 2 2 

Btw, using WebKit's threading, the results:
		mean: 2 msecs +/- 1588 msecs, +/- 79400 %
		avg:  676 msecs
			3408 3336 3 3 2 2 2 2 2 2 

I think there is some kind of caching issue. I feel this environment cannot be used as a benchmark now.

Benjamin, do you have any idea what is happening here? Why did the environment produce this strange numbers?

Btw, the derivation of the valid numbers are also too high.
Comment 21 Benjamin Poulain 2011-03-02 03:00:57 PST
> Benjamin, do you have any idea what is happening here? Why did the environment produce this strange numbers?

Nope, I have no idea what is going on here. The deviation is just insane, it is usually around 1%.

But I would not use that particular benchmark. I would use tst_Painting::fullPagePaint(). Because what you really care about is the rendering speed. The cycler will also take into account loading and layout time.
Comment 22 Gabor Loki 2011-03-02 03:44:45 PST
Created attachment 84394 [details]
A big SVG to test turbulence filter
Comment 23 Gabor Loki 2011-03-02 03:47:12 PST
> I would use tst_Painting::fullPagePaint(). Because what you really care about is the rendering speed.

In this test the calculation of SVG is not included.

********* Finished testing of tst_Painting *********
Printing result for: total
	benchmark: painting::paint	http://127.0.0.1/~smp/svg/turbulence.svg
		mean: 46 msecs +/- 0 msecs, +/- 0 %
		avg:  46 msecs
			47 47 47 46 46 46 46 46 46 46 
	benchmark: painting::fullPagePaint	http://127.0.0.1/~smp/svg/turbulence.svg
		mean: 58 msecs +/- 0 msecs, +/- 0 %
		avg:  58 msecs
			58 58 58 58 58 58 58 58 59 58 
Total mean: 52 msecs per test, avg: 52
Comment 24 Gabor Loki 2011-03-02 04:24:18 PST
Created attachment 84399 [details]
A simple test case to benchmark the SMP-based turbulence filter

This test case is able to show the effect of SMP-based turbulence filter. I know the JS is also taken account in this test, but its effect for the execution time is not significant (according to filter calculation). It is not an ultimate benchmark, it is just a simple test. If anyone has a good test or a benchmark for SVG, please let me know!
Comment 25 Balazs Kelemen 2011-03-02 04:40:51 PST
(In reply to comment #21)
> > Benjamin, do you have any idea what is happening here? Why did the environment produce this strange numbers?
> 
> Nope, I have no idea what is going on here. The deviation is just insane, it is usually around 1%.
> 

Not belongs strictly to this bug (should we create another?) but
I have also tried those tests (the gitorious version) and the deviation
was not acceptable. I guess the trunk version also suffering from this
in wall-time mode. I do not think this is only a problem with SVG related
pages but it is caused by the event based approach of our benchmark framework.
The really good thing in this framework is that one can measure
the performance of specific subsystems like layouting or painting
but we cannot use it to validate optimizations as long as we cannot make it
producing stable results.
Comment 26 Gabor Loki 2011-03-09 07:41:32 PST
Created attachment 85174 [details]
Measurement result with Methanol

I have measured the performance progression of this SMP framework with our page loading benchmark (called Methanol: http://gitorious.org/methanol).

Two measurement scenarios were used to validate the framework (an artificial test-suite and a collection of real websites).

Scenario 1:
 - Real websites:
   - Contains the first 25 unique websites from Alexa Top 500.
   - Does not contain any SVG.
 - Result
   - Measured on x86 and on ARM
   - It is not surprising that there is no difference between the reference and the two SMP-based approaches.

Scenario 2:
 - Artificial test-suite:
   - Methanol contains a small test-suite for SVG. Some test cases contain only one SVG filter with simple shapes, others are using combined filters.
 - Result
   - Measured on x86 and ARM
   - Methanol reported 30% - 80% performance progression on turbulence-related tests. Others were not changed. See the attached file.
 - An important though
   - The 'turbSmall' test contains several small elements. The turbulence filter is called on these elements independently. The drawing area is for each elements is 100x100 (which is under the initial SMP-limit for turbulence). You can see the performance numbers were the same for each measurements. While the 'turbBig' test performs better with SMP (in this case the elements' size are 400x400).

If someone has another SVG-related benchmark, I would be happy to measure the SMP framework with it.
Comment 27 Gabor Loki 2011-03-28 23:53:27 PDT
> Created an attachment (id=83307) [details]
> Speeding up FETurbulence with SMP
> ...
>    - Methanol reported 30% - 80% performance progression on turbulence-related tests. Others were not changed. See the attached file.

Any other thoughts or suggestions?
Comment 28 Csaba Osztrogonác 2011-04-04 08:51:22 PDT
Comment on attachment 83307 [details]
Speeding up FETurbulence with SMP

View in context: https://bugs.webkit.org/attachment.cgi?id=83307&action=review

On the whole looks good to me, but please fix the following things:

- I agree with Zoltan ( https://bugs.webkit.org/show_bug.cgi?id=43903#c16 ), we need 
template for parameter passing to avoid ugly reinterpret_casts and using sizof.

- Please fix the typo detected by smfr. (s/Simmetric/Symmetric)

- And what about the other build systems? You should add the new files
for all build system. ( Or at least disable PARALLEL_JOBS if !PLATFORM(QT) )
For example it doesn't build on Windows because of missing include path.

r- for the present due to my comments above and below.

> Source/JavaScriptCore/wtf/ParallelJobs.cpp:135
> +    ASSERT(m_numberOfJobs > 0);

m_numberOfJobs is always greater than 0 if !ENABLE(OPENMP)

> Source/JavaScriptCore/wtf/ParallelJobs.h:44
> +//     ParallelJobs parallelJobs(&worker, sizeof(Parameters));

s/worker/workerFunction

> Source/JavaScriptCore/wtf/Platform.h:1162
> +#if !defined(ENABLE_PARALLEL_JOBS) && ENABLE(OPENMP)
> +#define ENABLE_PARALLEL_JOBS 1
> +#endif

We shouldn't enable parallel jobs if ENABLE(SINGLE_THREADED) is 
true, so we need an additional !ENABLE(SINGLE_THREADED) guard.
Comment 29 Gabor Loki 2011-04-05 06:26:21 PDT
> On the whole looks good to me, but please fix the following things:
> 
> - I agree with Zoltan ( https://bugs.webkit.org/show_bug.cgi?id=43903#c16 ), we need 
> template for parameter passing to avoid ugly reinterpret_casts and using sizof.

Well, it seems an easy task, but the ParallelJobs has a static s_threadPool member in case of using WebKit's threading. If ParallelJobs is changed to template<typename Type>ParallelJobs, we will end up in more s_threadPool members (one for each different template type). This is not what we want, we should have only one thread pool. A possible solution is to create a base class (for example ParallelEnvironment) which contains the s_threadPool member and the related ThreadPrivate class.

I am still not convinced which solution is better - a template class with a base pool, or passing the size of the parameter.
Comment 30 Zoltan Herczeg 2011-04-05 06:31:58 PDT
> I am still not convinced which solution is better - a template class with a base pool, or passing the size of the parameter.

I vote for the base class. You have to do it only once. The other one is ugly.
Comment 31 Alexey Proskuryakov 2011-04-05 08:50:46 PDT
Please consider that inheritance should only be used to express "is a" relationships, and never to share code or data.
Comment 32 Gabor Loki 2011-04-07 01:06:10 PDT
Created attachment 88589 [details]
Speeding up FETurbulence with SMP

Well, I redesigned the patch using template (and fixed the other issues as well). The difference of the usage of ParallelJobs looks very small, but the framework itself is getting more complex.
Comment 33 Gabor Loki 2011-04-07 05:19:11 PDT
Created attachment 88609 [details]
Speeding up FETurbulence with SMP

Speculative fix for vcproj file
Comment 34 Alexey Proskuryakov 2011-04-07 08:58:14 PDT
+        programming. The framework is based on WebKit's threading infrastructure
+        and Open Multi-Processing's (OpenMP) API.

While this is certainly an interesting experiment and an extremely promising direction, I'm not sure that adding an OpenMP-inspired threading code to WebKit is a good idea.

The code doesn't fit with the rest of WebKit, and it doesn't fit well with some target platforms (e.g. on OS X, one should use libdispatch for parallel jobs).
Comment 35 Zoltan Herczeg 2011-04-07 10:25:45 PDT
> The code doesn't fit with the rest of WebKit, and it doesn't fit well with some target platforms (e.g. on OS X, one should use libdispatch for parallel jobs).

As far as I see the central question of the patch is the generic WebKit interface for parallel jobs. The technical details (which platfom how implements them) is secondary. The header contains a simple, portable interface, which has 3 steps:
 - request threads, passing the optimal number of threads
 - system returns with the number of allocated threads (always <= optimal, but >= l), 
 - fill the thread local data
 - execute the jobs, and wait for finish

Alexey, would you do that a different way? Again, the purpose of this work seems to find a simple, yet powerful parallel processing interface. The openmp thing is just an implementation, nothing more.
Comment 36 Alexey Proskuryakov 2011-04-07 11:11:39 PDT
Libdispatch is different in recognizing that no single application knows what happens with the rest of the system, so no application can know how many concurrent threads to use. Instead, you supply small tasks, and it makes sure to optimize parallelism globally and dynamically.

Since it's open source, you can use it on other operating systems, although the benefits are smaller without kernel support.
Comment 37 Zoltan Herczeg 2011-04-07 11:21:15 PDT
(In reply to comment #36)
> Libdispatch is different in recognizing that no single application knows what happens with the rest of the system, so no application can know how many concurrent threads to use. Instead, you supply small tasks, and it makes sure to optimize parallelism globally and dynamically.
> 
> Since it's open source, you can use it on other operating systems, although the benefits are smaller without kernel support.

This is good news, it fits to the current approach. Basically we would return the optimal number of threads and let the system execute them in any order.
Comment 38 Alexey Proskuryakov 2011-04-07 11:36:06 PDT
What I'm saying is that libdispatch changes the number of threads dynamically. It's not determined at the time tasks are scheduled.
Comment 39 Zoltan Herczeg 2011-04-07 11:52:08 PDT
(In reply to comment #38)
> What I'm saying is that libdispatch changes the number of threads dynamically. It's not determined at the time tasks are scheduled.

I understand that, still I don't see a problem here. Basically even with libdispatch, you need to create those small tasks, isn't it? And their number is fixed when you created them, and they were executed somehow (doesn't matter how).

http://developer.apple.com/library/mac/#documentation/General/Conceptual/ConcurrencyProgrammingGuide/ConcurrencyandApplicationDesign/ConcurrencyandApplicationDesign.html#//apple_ref/doc/uid/TP40008091-CH100-SW1

Even with libdispatch, you determine the number of sub-tasks (since you create them), and their number is never infinte. Thus, the ParrallelJobs returns the number you passed as maxThreads, you create those sub-tasks, and libdispatch execute them.
Comment 40 Alexey Proskuryakov 2011-04-07 12:14:33 PDT
So, are you saying that we'd just pass INT_MAX as optimal number of threads? Even if that works, it sounds like an abuse.

I think that using parallelism more would be a good topic for WebKit contributor meeting discussion this month.
Comment 41 Zoltan Herczeg 2011-04-07 12:48:32 PDT
(In reply to comment #40)
> So, are you saying that we'd just pass INT_MAX as optimal number of threads? Even if that works, it sounds like an abuse.

hmmmm, not exactly. I would rather say that the optimal number would depend on the input, like we want to start a job for each 5000 pixels (don't think every pixel should have an own thread...), so the optimal would be round_up(width * height / 5000)

> I think that using parallelism more would be a good topic for WebKit contributor meeting discussion this month.

Absolutely true. Unfortunately we can't be there in person. I hope we could follow you somehow, like mails, or something...
Comment 42 Nikolas Zimmermann 2011-04-08 23:51:08 PDT
(In reply to comment #35)
> > The code doesn't fit with the rest of WebKit, and it doesn't fit well with some target platforms (e.g. on OS X, one should use libdispatch for parallel jobs).
> 
> As far as I see the central question of the patch is the generic WebKit interface for parallel jobs. The technical details (which platfom how implements them) is secondary. The header contains a simple, portable interface, which has 3 steps:
>  - request threads, passing the optimal number of threads
>  - system returns with the number of allocated threads (always <= optimal, but >= l), 
>  - fill the thread local data
>  - execute the jobs, and wait for finish
> 
> Alexey, would you do that a different way? Again, the purpose of this work seems to find a simple, yet powerful parallel processing interface. The openmp thing is just an implementation, nothing more.

Can we maybe split up everything openmp related into a different header, to really make it a detail of the implementation.

I think the patch is already great as-is, though I'm absolutely no expert in this area, and like the hear from others how to go on :-)
Comment 43 Zoltan Herczeg 2011-04-09 00:27:42 PDT
> Can we maybe split up everything openmp related into a different header, to really make it a detail of the implementation.

The patch needs to be redesigned :(, because the ^{ block } construction required by libdispatch combined with templates results an GCC internal compiler error. Perhaps someone should report this to Apple as it seems ^{ block } is a special extension introduced by them.
Comment 44 Gabor Loki 2011-04-09 02:39:08 PDT
It looks like everything is discussed without me. ;) That's great!

First of all the ParallelJobs framework should look like the Threading one. So splitting it into different headers is a good idea.

About OpenMP, libdispatch and other parallel processing API. Every platform has its own favorite parallel processing method/API. I have only introduced one extra (OpenMP) beside the generic WebKit's threading infrastructure. The libdispatch is also a very good feature, but I am not familiar with it. So, It would be nice if someone helps in this area.

About the framework. The WebKit's ParallelJobs framework should be simple and powerful for each parallel processing method/API. It is simple if the developers think it is ;) . It is powerful if it is as close to the platform dependent parallel processing API as possible. My solution is close to satisfy those two requirement. Well, I do not think I invented a brand new method for parallel processing. The basics are the same as you can see in any API.
 - Allocate / request some threads (depending on the input)
 - Fill the threads' local data
 - Execute the threads and wait for finish

This is a very basic method using parallel processing and every API provides a way to execute tasks based on the above topics. As I see libdispatch also supports this way (http://libdispatch.macosforge.org/trac/wiki/tutorial#Puttingworkinthepipeline:Queues)

About the technical details. Using template makes a big mess. Especially when I try to add libdispatch support. As Zoltan said libdispatch is not supported in a template. The GCC 4.2 fails with internal compiler error on Mac. So, I have to redesign the whole framework.
- First of all, I would like to create separate headers for each parallel processing method/API and a common one which will contain the public API of WebKit's ParallelJobs framework (thanks to Nikolas).
- The other important thing should be the base class (so-called ParallelEnviroment). It will be something like in my first patch, will contain a constructor for initialization, functions to be able to fill the local data and an execute function.
- The third part should be the template class. The purpose of this class is to make the framework be as simple as possible.

Any more thoughts?
Comment 45 Geoffrey Garen 2011-04-11 11:22:09 PDT
Is it really a good idea to add the complexity of threading to the mix when so many basic bugs and crashes in SVG filters remain?
Comment 46 Zoltan Herczeg 2011-04-11 12:26:47 PDT
(In reply to comment #45)
> Is it really a good idea to add the complexity of threading to the mix when so many basic bugs and crashes in SVG filters remain?

So many is a little exaggerated :) SVG is getting more attention than in the past, and the code improves rapidly.
Comment 47 Brent Fulgham 2011-04-11 13:36:18 PDT
Comment on attachment 88609 [details]
Speeding up FETurbulence with SMP

View in context: https://bugs.webkit.org/attachment.cgi?id=88609&action=review

This is a very interesting project, and I'd love to be able to benchmark it under WinCairo to see what improvements it might provide.  It would be nice (in the future) to allow GCD/OpenMP backends for generic parallelization (perhaps in a 'platform/libdispatch' and 'platform/openmp' structure).

> Source/JavaScriptCore/wtf/ParallelJobs.cpp:135
> +

Couldn't the m_numberOfJobs calculation and the threadpool setup be done in a method, hiding the knowledge of OPENMP use?  Ditto for maxNumberOfThreads (above), though probably less important.

> Source/JavaScriptCore/wtf/ParallelJobs.h:73
> +static const int maxParallelThreads = 2;

Can't this be hidden inside the implementation file?  Is it accessed by other compilation units?

> Source/JavaScriptCore/wtf/ParallelJobs.h:149
> +    void executeJobs()

For clarity, could this implementation be hidden inside the cpp file?

> Source/JavaScriptCore/wtf/ParallelJobs.h:158
> +        int i;

Why is this declared outside the for?
Comment 48 Geoffrey Garen 2011-04-11 14:56:10 PDT
> So many is a little exaggerated :) SVG is getting more attention than in the past, and the code improves rapidly.

Given the rapid improvement you're expecting, would you be willing to hold off on changes like this until all reproducible crashers in SVG filters have been fixed?
Comment 49 Dirk Schulze 2011-04-12 00:01:40 PDT
(In reply to comment #48)
> > So many is a little exaggerated :) SVG is getting more attention than in the past, and the code improves rapidly.
> 
> Given the rapid improvement you're expecting, would you be willing to hold off on changes like this until all reproducible crashers in SVG filters have been fixed?

I just could just find of two crashes in bugzilla. IIRC Zoltan is working on one of these crashes. The second one is more a implementation problem of SVGImage, that affects SVG filter. So SVGImage should get fixed instead of filter. If you know of more bugs, please let me know. If you know of more crashes, file a bug and add me and Zoltan please.
Comment 50 Gabor Loki 2011-04-12 00:41:58 PDT
Created attachment 89166 [details]
Speeding up FETurbulence with SMP

Well, I have updated the path as I promised; using separate headers, base classes try to be close to the platform dependent parallel processing APIs, the template class is as generic as possible.

Although I am not familiar with Mac, I have introduced a possible usage of libdistpatch API. I hope there will be someone else who is able to fine tune this part for Mac.
Comment 51 Gabor Loki 2011-04-12 00:46:00 PDT
> For clarity, could this implementation be hidden inside the cpp file?

Hmm. Why do you want to hide the implementation details? For example, in JavaScriptCore almost the whole assembler is public.
Comment 52 Gabor Loki 2011-04-12 03:00:30 PDT
Comment on attachment 89166 [details]
Speeding up FETurbulence with SMP

I am going to update the build systems to make svn apply happy.
Comment 53 Gabor Loki 2011-04-12 06:26:06 PDT
Created attachment 89195 [details]
Speeding up FETurbulence with SMP
Comment 54 WebKit Review Bot 2011-04-12 06:27:33 PDT
Attachment 89195 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1

Source/JavaScriptCore/wtf/ParallelJobsLibdispatch.h:75:  More than one command on the same line  [whitespace/newline] [4]
Source/JavaScriptCore/wtf/ParallelJobsLibdispatch.h:75:  Missing space before {  [whitespace/braces] [5]
Source/JavaScriptCore/wtf/ParallelJobsLibdispatch.h:78:  More than one command on the same line  [whitespace/newline] [4]
Source/JavaScriptCore/wtf/ParallelJobsLibdispatch.h:78:  Missing space before {  [whitespace/braces] [5]
Total errors found: 4 in 16 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 55 Gabor Loki 2011-04-26 01:15:08 PDT
Created attachment 91082 [details]
Speeding up FETurbulence with SMP

I got a feedback about the container of parameters in private; the destructor of parameters should be called during the destruction of their container.

Well, in order to solve this issue the ByteArray is changed to Vector, the container of the parameters and the ParallelEnviroment become the members of ParallelJobs class.
Comment 56 WebKit Review Bot 2011-04-26 01:19:27 PDT
Attachment 91082 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1

Source/JavaScriptCore/wtf/ParallelJobsLibdispatch.h:67:  More than one command on the same line  [whitespace/newline] [4]
Source/JavaScriptCore/wtf/ParallelJobsLibdispatch.h:67:  Missing space before {  [whitespace/braces] [5]
Source/JavaScriptCore/wtf/ParallelJobsLibdispatch.h:72:  More than one command on the same line  [whitespace/newline] [4]
Source/JavaScriptCore/wtf/ParallelJobsLibdispatch.h:72:  Missing space before {  [whitespace/braces] [5]
Total errors found: 4 in 16 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 57 Csaba Osztrogonác 2011-04-26 04:59:16 PDT
Comment on attachment 91082 [details]
Speeding up FETurbulence with SMP

You fixed all things what other asked, and there weren't 
any other objections, so it is ready for landing now. r=me
Comment 58 Eric Seidel (no email) 2011-04-26 06:57:53 PDT
Style?
Comment 59 Gabor Loki 2011-04-26 07:02:43 PDT
> Style?

:) I guess style bot is not prepared for the 'newly-supported feature of the C language'. See here: http://libdispatch.macosforge.org/trac/wiki/tutorial#Definingworkitems:FunctionsandBlocks
Comment 60 Eric Seidel (no email) 2011-04-26 07:36:13 PDT
Please file a bug against the style-bot.  :)
Comment 61 WebKit Commit Bot 2011-04-26 07:38:07 PDT
The commit-queue encountered the following flaky tests while processing attachment 91082 [details]:

http/tests/misc/favicon-loads-with-icon-loading-override.html bug 58412 (author: alice.liu@apple.com)
The commit-queue is continuing to process your patch.
Comment 62 WebKit Commit Bot 2011-04-26 07:41:04 PDT
Comment on attachment 91082 [details]
Speeding up FETurbulence with SMP

Clearing flags on attachment: 91082

Committed r84911: <http://trac.webkit.org/changeset/84911>
Comment 63 WebKit Commit Bot 2011-04-26 07:41:13 PDT
All reviewed patches have been landed.  Closing bug.
Comment 64 WebKit Review Bot 2011-04-26 09:24:31 PDT
http://trac.webkit.org/changeset/84911 might have broken SnowLeopard Intel Release (WebKit2 Tests)
Comment 65 Nikolas Zimmermann 2011-04-26 10:47:37 PDT
Excellent work, looking forward to test it :-)