43903 – Speeding up SVG filters with multicore (SMP) support

RESOLVED FIXED 43903

Speeding up SVG filters with multicore (SMP) support

https://bugs.webkit.org/show_bug.cgi?id=43903

Summary Speeding up SVG filters with multicore (SMP) support

Zoltan Herczeg

Reported 2010-08-12 02:46:50 PDT

This patch is just show new ways of exploiting multicore cpus in WebKit. Filters usually requires a huge amount of CPU work, especially for animated SVGs. Any feedback would be welcome.

Attachments
parallel svg light filter (9.38 KB, patch) 2010-08-12 02:49 PDT, Zoltan Herczeg	no flags	Details Formatted Diff Diff
updated patch (17.22 KB, patch) 2010-11-30 05:07 PST, Zoltan Herczeg	no flags	Details Formatted Diff Diff
Speeding up FETurbulence with SMP (23.58 KB, patch) 2011-02-22 07:06 PST, Gabor Loki	ossy: review-	Details Formatted Diff Diff
A big SVG to test turbulence filter (625 bytes, image/svg+xml) 2011-03-02 03:44 PST, Gabor Loki	no flags	Details
A simple test case to benchmark the SMP-based turbulence filter (2.54 KB, image/svg+xml) 2011-03-02 04:24 PST, Gabor Loki	no flags	Details
Measurement result with Methanol (1.12 KB, text/plain) 2011-03-09 07:41 PST, Gabor Loki	no flags	Details
Speeding up FETurbulence with SMP (30.44 KB, patch) 2011-04-07 01:06 PDT, Gabor Loki	no flags	Details Formatted Diff Diff
Speeding up FETurbulence with SMP (30.46 KB, patch) 2011-04-07 05:19 PDT, Gabor Loki	no flags	Details Formatted Diff Diff
Speeding up FETurbulence with SMP (41.75 KB, patch) 2011-04-12 00:41 PDT, Gabor Loki	no flags	Details Formatted Diff Diff
Speeding up FETurbulence with SMP (42.72 KB, patch) 2011-04-12 06:26 PDT, Gabor Loki	no flags	Details Formatted Diff Diff
Speeding up FETurbulence with SMP (42.15 KB, patch) 2011-04-26 01:15 PDT, Gabor Loki	no flags	Details Formatted Diff Diff
Show Obsolete (7) View All Add attachment proposed patch, testcase, etc.

Zoltan Herczeg

Comment 1 2010-08-12 02:49:59 PDT

Created attachment 64198 [details] parallel svg light filter Light filter is easy to split up for multiple CPUs. The initial plan is to add one helper thread, and two functions (execute / waitForCompletition) to the Filter base class.

Zoltan Herczeg

Comment 2 2010-11-30 05:07:48 PST

Created attachment 75125 [details] updated patch

Nikolas Zimmermann

Comment 3 2010-11-30 05:48:36 PST

Comment on attachment 75125 [details] updated patch View in context: https://bugs.webkit.org/attachment.cgi?id=75125&action=review The idea looks great, the video you showed is also very convincing! :-) Some early review comments: > WebCore/platform/graphics/filters/FELighting.cpp:240 > + data.interior(offset, normalVector); Hm, not related to this patch, but 'interior' is a bad name, what does it do? > WebCore/platform/graphics/filters/FELighting.cpp:246 > +#ifdef FILTER_THREAD This should use the USE(FILTER_THREADS) or ENABLE(FILTER_THREADS) logic. (not sure which one to use, need to read Maciejs mail again regarding that topic, when to use USE or ENABLE, but I guess USE is approriate here). > WebCore/platform/graphics/filters/FELighting.cpp:249 > + PaintThreadData& data = *reinterpret_cast<PaintThreadData*>(parameter); How about adding ASSERT(parameter) as first argument. > WebCore/platform/graphics/filters/FELighting.cpp:280 > + if (width >= 3 && height >= 10 && width * height > 4000) { I'd like a comment here, that this is emperical data. > WebCore/platform/graphics/filters/FELighting.h:79 > + { An ASSERT(filter) can't hurt. It looks awkward to use filter(filter) and lightingData(lightingData) - is that guarenteed to work? I was not aware that this is possible! > WebCore/platform/graphics/filters/FETurbulence.cpp:335 > +#ifdef FILTER_THREAD Use USE(FILTER_THREADS). > WebCore/platform/graphics/filters/FETurbulence.cpp:338 > + PaintThreadData& data = *reinterpret_cast<PaintThreadData*>(parameter); Maybe ASSERT(parameter) first. > WebCore/platform/graphics/filters/FETurbulence.cpp:360 > +#ifdef FILTER_THREAD Same comment, use USE(FILTER_THREADS), won't repeat it for the other occourences. > WebCore/platform/graphics/filters/FETurbulence.cpp:362 > + if (imageRect.height() >= 6 && imageRect.width() * imageRect.height() > 4000) { Emperical data comment... > WebCore/platform/graphics/filters/FETurbulence.h:95 > + { ASSERT(filter) > WebCore/platform/graphics/filters/FilterEffect.cpp:33 > +#ifdef FILTER_THREAD > +ThreadIdentifier FilterEffect::s_threadID; > +FilterEffect::ThreadShared* FilterEffect::s_threadShared; > +bool FilterEffect::s_stopThread; > +#endif We generally disallow globals, so how about putting these in functions? static ThreadIdentifier threadIdentifier() { static ThreadIdentifier s_threadID; return s_threadID; } Are you avoiding the use of DEFINE_STATIC_LOCAL on purpose? I have never used threads in WebCore so far, forgive my ignorance if I'm overlooking something obvious. > WebCore/platform/graphics/filters/FilterEffect.cpp:141 > + s_threadShared->m_cond.wait(s_threadShared->m_mutex); Is this prefereedd style when using threads? Are shall we expose accesors for m_cond, m_start, m_finish, m_function, m_mutex etc.?

Gabor Loki

Comment 4 2011-02-22 07:06:48 PST

Created attachment 83307 [details] Speeding up FETurbulence with SMP I have redesigned the previous concept of threading from Zoltan's suggestion and introduced a framework for parallel tasks/jobs. The ParallelJobs framework is based on WebKit's threading infrastructure and/or OpenMP's API. The usage of ParallelJobs is the following: 1.- Create ParallelJobs object - which allocates the worker threads and necessary objects. 2.- Fill the parameter array - allocated by ParallelJobs, assign different sub-tasks for the workers 3.- Execute the workers 4.- Combine the necessary information [optional] This patch contains: 1.- ParallelJobs framework 2.- SMP version of FETurbulence One more note: I have measured 60%-80% performance progression on dual-core systems on ARM and x86 as well.

Benjamin Poulain

Comment 5 2011-02-22 07:34:42 PST

May I suggest to add benchmarks (even if only Qt benchmarks) to see how does that fly? I am also not a fan of OpenMP for having had some strange performance numbers out of it (like x1.1 boost where direct threading with pthread gave me x1.9). If we want to use more SMP inside WebKit, I would really prefer something more like TBB, because you have a lot more control over the way things are parallelized.

Build Bot

Comment 6 2011-02-22 07:49:08 PST

Attachment 83307 [details] did not build on win: Build output: http://queues.webkit.org/results/7946220

WebKit Review Bot

Comment 7 2011-02-22 08:36:41 PST

Attachment 83307 [details] did not build on mac: Build output: http://queues.webkit.org/results/7946234

Gabor Loki

Comment 8 2011-02-22 08:42:56 PST

> May I suggest to add benchmarks (even if only Qt benchmarks) to see how does that fly? Where to add benchmarks? If you send me some pointers I will happily measure the effect of this patch, but I have to mention that in general the benchmarks mainly give an overall insight about the performance. So, if you really would like to measure the effect of this kind of parallelization, you should create a basic svg with turbulence filter. For example: you can use the Source/WebCore/manual-tests/svg-filter-animation.svg as a simple benchmark after all non-related filters are removed from the SVG. > I would really prefer something more like TBB, because you have a lot more control over the way things are parallelized. Good idea, but I feel this could be another patch where we will introduce TBB for another threading model of the WebKit's threading interface.

Benjamin Poulain

Comment 9 2011-02-22 08:58:43 PST

(In reply to comment #8) > > May I suggest to add benchmarks (even if only Qt benchmarks) to see how does that fly? > > Where to add benchmarks? If you send me some pointers I will happily measure the effect of this patch, but I have to mention that in general the benchmarks mainly give an overall insight about the performance. So, if you really would like to measure the effect of this kind of parallelization, you should create a basic svg with turbulence filter. For example: you can use the Source/WebCore/manual-tests/svg-filter-animation.svg as a simple benchmark after all non-related filters are removed from the SVG. The benchmarks for Qt are in Source/WebKit/qt/tests/benchmarks/ This is really a performance optimization. You don't need SMP to make the code cleaner/simpler, quite the contrary. So I think the performance gain is very relevant for this task. If I understant correctly, you are doing research. So getting accurate numbers is probably interesting for your papers as well. When we considered parallelization for Qt's CPU rasterization, we looked for the boundaries of input for which SMP is helpful VS is harmful. I think this kind of analysis in both interesting for the WebKit project, and for research. (and note that we ended up not using parallelization for most primitives of Qt because it was more harmful than helpful for general rendering).

Gabor Loki

Comment 10 2011-02-24 00:19:50 PST

> If I understant correctly, you are doing research. So getting accurate numbers is probably interesting for your papers as well. I would be the most happiest people if this topic was a research. > When we considered parallelization for Qt's CPU rasterization, we looked for the boundaries of input for which SMP is helpful VS is harmful. I think this kind of analysis in both interesting for the WebKit project, and for research. (and note that we ended up not using parallelization for most primitives of Qt because it was more harmful than helpful for general rendering). I agree with you about the importance of boundaries. That's why I left the s_minimalRectDimension for platforms. All other code depends on the number of requested threads. So, if the platform/OS/architecture is able to decide its boundaries this will be an easy task to implement their needs. Of course lots of benchmarking are needed to achieve this, but for introducing this parallelization approach the current limit is fine. Fine tuning of parallel turbulence filter should be done later with the help of benchmarks. Anyway I will check Qt's benchmark environment. Is the time measuring method of this environment based on clock_t or wall-time (for example: clock_gettime)?

Benjamin Poulain

Comment 11 2011-02-24 02:07:14 PST

(In reply to comment #10) > Anyway I will check Qt's benchmark environment. Is the time measuring method of this environment based on clock_t or wall-time (for example: clock_gettime)? You choose the kind of timing with a parameter when running the benchmark: http://doc.qt.nokia.com/latest/qtestlib-manual.html#creating-a-benchmark

Eric Seidel (no email)

Comment 12 2011-02-24 02:20:48 PST

crazy.

Eric Seidel (no email)

Comment 13 2011-02-24 02:21:43 PST

If we're going to use this parallel jobs thing widely, it's eventually going to need a GCG implementation: http://developer.apple.com/technologies/mac/snowleopard/gcd.html

Simon Fraser (smfr)

Comment 14 2011-02-24 22:17:21 PST

Comment on attachment 83307 [details] Speeding up FETurbulence with SMP View in context: https://bugs.webkit.org/attachment.cgi?id=83307&action=review > Source/JavaScriptCore/ChangeLog:10 > + too complex. Using the power of Simmetric Multi Processing (SMP) we symmetric

Simon Fraser (smfr)

Comment 15 2011-02-24 22:17:55 PST

I think it would be better to try to speed things up with SSE before we go to multithreading.

Zoltan Herczeg

Comment 16 2011-02-24 23:13:07 PST

I think some template magic could make this more user frendly. We could remove some static_casts and sizeof operations in that way.

Zoltan Herczeg

Comment 17 2011-02-24 23:15:19 PST

(In reply to comment #15) > I think it would be better to try to speed things up with SSE before we go to multithreading. See https://bugs.webkit.org/show_bug.cgi?id=54456 (ARM Neon is similar to x86 SSE). Could you help me reviewing that?

Gabor Loki

Comment 18 2011-02-25 01:06:18 PST

(In reply to comment #15) > I think it would be better to try to speed things up with SSE before we go to multithreading. The SSE and NEON are very cool features which should be used in WebKit directly or indirectly, but they are architecture dependent features. We can use them to speed up something directly, like Zoltan's patch for FELighting, or design a common framework (something like JSC's macro assembler) which drops many performance opportunities because of the high level abstraction - like in JSC. Anyhow, we should consider SIMD extensions, but the scope of SIMD and SMP are different. Any of them can live without the other, and can work with the other.

Balazs Kelemen

Comment 19 2011-03-01 01:27:07 PST

(In reply to comment #11) > (In reply to comment #10) > > Anyway I will check Qt's benchmark environment. Is the time measuring method of this environment based on clock_t or wall-time (for example: clock_gettime)? > > You choose the kind of timing with a parameter when running the benchmark: http://doc.qt.nokia.com/latest/qtestlib-manual.html#creating-a-benchmark You can find another version of the benchmarks on gitorious: http://gitorious.org/qtwebkit/performance. In this repo the benchmarks based on wall-time with higher accuracy.

Gabor Loki

Comment 20 2011-03-02 00:55:30 PST

> You can find another version of the benchmarks on gitorious: http://gitorious.org/qtwebkit/performance. It looks like this performance tool has some issue running as a benchmark. I used several tools from this project, but they provided strange values to me. I create a simple svg with only one filter (on a BIG area). It takes about 5 sec to compute the filter with QtWebKit. And here is a sample result of cycler test: ********* Start testing of tst_Cycler ********* Config: Using QTest library 4.7.0, Qt 4.7.0 PASS : tst_Cycler::initTestCase() PASS : tst_Cycler::load() PASS : tst_Cycler::cleanupTestCase() Totals: 3 passed, 0 failed, 0 skipped ********* Finished testing of tst_Cycler ********* Printing result for: total benchmark: cycler::load http://127.0.0.1/~smp/svg/turbulence.svg mean: 3 msecs +/- 3239 msecs, +/- 107967 % avg: 1685 msecs 5741 3 3 3 2 2 2 5464 5636 3 Total mean: 3 msecs per test, avg: 1685 And another one: mean: 3 msecs +/- 3850 msecs, +/- 128333 % avg: 2312 msecs 5875 5682 5875 5679 3 2 2 2 2 2 Btw, using WebKit's threading, the results: mean: 2 msecs +/- 1588 msecs, +/- 79400 % avg: 676 msecs 3408 3336 3 3 2 2 2 2 2 2 I think there is some kind of caching issue. I feel this environment cannot be used as a benchmark now. Benjamin, do you have any idea what is happening here? Why did the environment produce this strange numbers? Btw, the derivation of the valid numbers are also too high.

Benjamin Poulain

Comment 21 2011-03-02 03:00:57 PST

> Benjamin, do you have any idea what is happening here? Why did the environment produce this strange numbers? Nope, I have no idea what is going on here. The deviation is just insane, it is usually around 1%. But I would not use that particular benchmark. I would use tst_Painting::fullPagePaint(). Because what you really care about is the rendering speed. The cycler will also take into account loading and layout time.

Gabor Loki

Comment 22 2011-03-02 03:44:45 PST

Created attachment 84394 [details] A big SVG to test turbulence filter

Gabor Loki

Comment 23 2011-03-02 03:47:12 PST

> I would use tst_Painting::fullPagePaint(). Because what you really care about is the rendering speed. In this test the calculation of SVG is not included. ********* Finished testing of tst_Painting ********* Printing result for: total benchmark: painting::paint http://127.0.0.1/~smp/svg/turbulence.svg mean: 46 msecs +/- 0 msecs, +/- 0 % avg: 46 msecs 47 47 47 46 46 46 46 46 46 46 benchmark: painting::fullPagePaint http://127.0.0.1/~smp/svg/turbulence.svg mean: 58 msecs +/- 0 msecs, +/- 0 % avg: 58 msecs 58 58 58 58 58 58 58 58 59 58 Total mean: 52 msecs per test, avg: 52

Gabor Loki

Comment 24 2011-03-02 04:24:18 PST

Created attachment 84399 [details] A simple test case to benchmark the SMP-based turbulence filter This test case is able to show the effect of SMP-based turbulence filter. I know the JS is also taken account in this test, but its effect for the execution time is not significant (according to filter calculation). It is not an ultimate benchmark, it is just a simple test. If anyone has a good test or a benchmark for SVG, please let me know!

Balazs Kelemen

Comment 25 2011-03-02 04:40:51 PST

(In reply to comment #21) > > Benjamin, do you have any idea what is happening here? Why did the environment produce this strange numbers? > > Nope, I have no idea what is going on here. The deviation is just insane, it is usually around 1%. > Not belongs strictly to this bug (should we create another?) but I have also tried those tests (the gitorious version) and the deviation was not acceptable. I guess the trunk version also suffering from this in wall-time mode. I do not think this is only a problem with SVG related pages but it is caused by the event based approach of our benchmark framework. The really good thing in this framework is that one can measure the performance of specific subsystems like layouting or painting but we cannot use it to validate optimizations as long as we cannot make it producing stable results.

Gabor Loki

Comment 26 2011-03-09 07:41:32 PST

Created attachment 85174 [details] Measurement result with Methanol I have measured the performance progression of this SMP framework with our page loading benchmark (called Methanol: http://gitorious.org/methanol). Two measurement scenarios were used to validate the framework (an artificial test-suite and a collection of real websites). Scenario 1: - Real websites: - Contains the first 25 unique websites from Alexa Top 500. - Does not contain any SVG. - Result - Measured on x86 and on ARM - It is not surprising that there is no difference between the reference and the two SMP-based approaches. Scenario 2: - Artificial test-suite: - Methanol contains a small test-suite for SVG. Some test cases contain only one SVG filter with simple shapes, others are using combined filters. - Result - Measured on x86 and ARM - Methanol reported 30% - 80% performance progression on turbulence-related tests. Others were not changed. See the attached file. - An important though - The 'turbSmall' test contains several small elements. The turbulence filter is called on these elements independently. The drawing area is for each elements is 100x100 (which is under the initial SMP-limit for turbulence). You can see the performance numbers were the same for each measurements. While the 'turbBig' test performs better with SMP (in this case the elements' size are 400x400). If someone has another SVG-related benchmark, I would be happy to measure the SMP framework with it.

Gabor Loki

Comment 27 2011-03-28 23:53:27 PDT

> Created an attachment (id=83307) [details] > Speeding up FETurbulence with SMP > ... > - Methanol reported 30% - 80% performance progression on turbulence-related tests. Others were not changed. See the attached file. Any other thoughts or suggestions?

Csaba Osztrogonác

Comment 28 2011-04-04 08:51:22 PDT

Comment on attachment 83307 [details] Speeding up FETurbulence with SMP View in context: https://bugs.webkit.org/attachment.cgi?id=83307&action=review On the whole looks good to me, but please fix the following things: - I agree with Zoltan ( https://bugs.webkit.org/show_bug.cgi?id=43903#c16 ), we need template for parameter passing to avoid ugly reinterpret_casts and using sizof. - Please fix the typo detected by smfr. (s/Simmetric/Symmetric) - And what about the other build systems? You should add the new files for all build system. ( Or at least disable PARALLEL_JOBS if !PLATFORM(QT) ) For example it doesn't build on Windows because of missing include path. r- for the present due to my comments above and below. > Source/JavaScriptCore/wtf/ParallelJobs.cpp:135 > + ASSERT(m_numberOfJobs > 0); m_numberOfJobs is always greater than 0 if !ENABLE(OPENMP) > Source/JavaScriptCore/wtf/ParallelJobs.h:44 > +// ParallelJobs parallelJobs(&worker, sizeof(Parameters)); s/worker/workerFunction > Source/JavaScriptCore/wtf/Platform.h:1162 > +#if !defined(ENABLE_PARALLEL_JOBS) && ENABLE(OPENMP) > +#define ENABLE_PARALLEL_JOBS 1 > +#endif We shouldn't enable parallel jobs if ENABLE(SINGLE_THREADED) is true, so we need an additional !ENABLE(SINGLE_THREADED) guard.

Gabor Loki

Comment 29 2011-04-05 06:26:21 PDT

> On the whole looks good to me, but please fix the following things: > > - I agree with Zoltan ( https://bugs.webkit.org/show_bug.cgi?id=43903#c16 ), we need > template for parameter passing to avoid ugly reinterpret_casts and using sizof. Well, it seems an easy task, but the ParallelJobs has a static s_threadPool member in case of using WebKit's threading. If ParallelJobs is changed to template<typename Type>ParallelJobs, we will end up in more s_threadPool members (one for each different template type). This is not what we want, we should have only one thread pool. A possible solution is to create a base class (for example ParallelEnvironment) which contains the s_threadPool member and the related ThreadPrivate class. I am still not convinced which solution is better - a template class with a base pool, or passing the size of the parameter.

Zoltan Herczeg

Comment 30 2011-04-05 06:31:58 PDT

> I am still not convinced which solution is better - a template class with a base pool, or passing the size of the parameter. I vote for the base class. You have to do it only once. The other one is ugly.

Alexey Proskuryakov

Comment 31 2011-04-05 08:50:46 PDT

Please consider that inheritance should only be used to express "is a" relationships, and never to share code or data.

Gabor Loki

Comment 32 2011-04-07 01:06:10 PDT

Created attachment 88589 [details] Speeding up FETurbulence with SMP Well, I redesigned the patch using template (and fixed the other issues as well). The difference of the usage of ParallelJobs looks very small, but the framework itself is getting more complex.

Gabor Loki

Comment 33 2011-04-07 05:19:11 PDT

Created attachment 88609 [details] Speeding up FETurbulence with SMP Speculative fix for vcproj file

Alexey Proskuryakov

Comment 34 2011-04-07 08:58:14 PDT

+ programming. The framework is based on WebKit's threading infrastructure + and Open Multi-Processing's (OpenMP) API. While this is certainly an interesting experiment and an extremely promising direction, I'm not sure that adding an OpenMP-inspired threading code to WebKit is a good idea. The code doesn't fit with the rest of WebKit, and it doesn't fit well with some target platforms (e.g. on OS X, one should use libdispatch for parallel jobs).

Zoltan Herczeg

Comment 35 2011-04-07 10:25:45 PDT

> The code doesn't fit with the rest of WebKit, and it doesn't fit well with some target platforms (e.g. on OS X, one should use libdispatch for parallel jobs). As far as I see the central question of the patch is the generic WebKit interface for parallel jobs. The technical details (which platfom how implements them) is secondary. The header contains a simple, portable interface, which has 3 steps: - request threads, passing the optimal number of threads - system returns with the number of allocated threads (always <= optimal, but >= l), - fill the thread local data - execute the jobs, and wait for finish Alexey, would you do that a different way? Again, the purpose of this work seems to find a simple, yet powerful parallel processing interface. The openmp thing is just an implementation, nothing more.

Alexey Proskuryakov

Comment 36 2011-04-07 11:11:39 PDT

Libdispatch is different in recognizing that no single application knows what happens with the rest of the system, so no application can know how many concurrent threads to use. Instead, you supply small tasks, and it makes sure to optimize parallelism globally and dynamically. Since it's open source, you can use it on other operating systems, although the benefits are smaller without kernel support.

Zoltan Herczeg

Comment 37 2011-04-07 11:21:15 PDT

(In reply to comment #36) > Libdispatch is different in recognizing that no single application knows what happens with the rest of the system, so no application can know how many concurrent threads to use. Instead, you supply small tasks, and it makes sure to optimize parallelism globally and dynamically. > > Since it's open source, you can use it on other operating systems, although the benefits are smaller without kernel support. This is good news, it fits to the current approach. Basically we would return the optimal number of threads and let the system execute them in any order.

Alexey Proskuryakov

Comment 38 2011-04-07 11:36:06 PDT

What I'm saying is that libdispatch changes the number of threads dynamically. It's not determined at the time tasks are scheduled.

Zoltan Herczeg

Comment 39 2011-04-07 11:52:08 PDT

(In reply to comment #38) > What I'm saying is that libdispatch changes the number of threads dynamically. It's not determined at the time tasks are scheduled. I understand that, still I don't see a problem here. Basically even with libdispatch, you need to create those small tasks, isn't it? And their number is fixed when you created them, and they were executed somehow (doesn't matter how). http://developer.apple.com/library/mac/#documentation/General/Conceptual/ConcurrencyProgrammingGuide/ConcurrencyandApplicationDesign/ConcurrencyandApplicationDesign.html#//apple_ref/doc/uid/TP40008091-CH100-SW1 Even with libdispatch, you determine the number of sub-tasks (since you create them), and their number is never infinte. Thus, the ParrallelJobs returns the number you passed as maxThreads, you create those sub-tasks, and libdispatch execute them.

Alexey Proskuryakov

Comment 40 2011-04-07 12:14:33 PDT

So, are you saying that we'd just pass INT_MAX as optimal number of threads? Even if that works, it sounds like an abuse. I think that using parallelism more would be a good topic for WebKit contributor meeting discussion this month.

Zoltan Herczeg

Comment 41 2011-04-07 12:48:32 PDT

(In reply to comment #40) > So, are you saying that we'd just pass INT_MAX as optimal number of threads? Even if that works, it sounds like an abuse. hmmmm, not exactly. I would rather say that the optimal number would depend on the input, like we want to start a job for each 5000 pixels (don't think every pixel should have an own thread...), so the optimal would be round_up(width * height / 5000) > I think that using parallelism more would be a good topic for WebKit contributor meeting discussion this month. Absolutely true. Unfortunately we can't be there in person. I hope we could follow you somehow, like mails, or something...

Nikolas Zimmermann

Comment 42 2011-04-08 23:51:08 PDT

(In reply to comment #35) > > The code doesn't fit with the rest of WebKit, and it doesn't fit well with some target platforms (e.g. on OS X, one should use libdispatch for parallel jobs). > > As far as I see the central question of the patch is the generic WebKit interface for parallel jobs. The technical details (which platfom how implements them) is secondary. The header contains a simple, portable interface, which has 3 steps: > - request threads, passing the optimal number of threads > - system returns with the number of allocated threads (always <= optimal, but >= l), > - fill the thread local data > - execute the jobs, and wait for finish > > Alexey, would you do that a different way? Again, the purpose of this work seems to find a simple, yet powerful parallel processing interface. The openmp thing is just an implementation, nothing more. Can we maybe split up everything openmp related into a different header, to really make it a detail of the implementation. I think the patch is already great as-is, though I'm absolutely no expert in this area, and like the hear from others how to go on :-)

Zoltan Herczeg

Comment 43 2011-04-09 00:27:42 PDT

> Can we maybe split up everything openmp related into a different header, to really make it a detail of the implementation. The patch needs to be redesigned :(, because the ^{ block } construction required by libdispatch combined with templates results an GCC internal compiler error. Perhaps someone should report this to Apple as it seems ^{ block } is a special extension introduced by them.

Gabor Loki

Comment 44 2011-04-09 02:39:08 PDT

It looks like everything is discussed without me. ;) That's great! First of all the ParallelJobs framework should look like the Threading one. So splitting it into different headers is a good idea. About OpenMP, libdispatch and other parallel processing API. Every platform has its own favorite parallel processing method/API. I have only introduced one extra (OpenMP) beside the generic WebKit's threading infrastructure. The libdispatch is also a very good feature, but I am not familiar with it. So, It would be nice if someone helps in this area. About the framework. The WebKit's ParallelJobs framework should be simple and powerful for each parallel processing method/API. It is simple if the developers think it is ;) . It is powerful if it is as close to the platform dependent parallel processing API as possible. My solution is close to satisfy those two requirement. Well, I do not think I invented a brand new method for parallel processing. The basics are the same as you can see in any API. - Allocate / request some threads (depending on the input) - Fill the threads' local data - Execute the threads and wait for finish This is a very basic method using parallel processing and every API provides a way to execute tasks based on the above topics. As I see libdispatch also supports this way (http://libdispatch.macosforge.org/trac/wiki/tutorial#Puttingworkinthepipeline:Queues) About the technical details. Using template makes a big mess. Especially when I try to add libdispatch support. As Zoltan said libdispatch is not supported in a template. The GCC 4.2 fails with internal compiler error on Mac. So, I have to redesign the whole framework. - First of all, I would like to create separate headers for each parallel processing method/API and a common one which will contain the public API of WebKit's ParallelJobs framework (thanks to Nikolas). - The other important thing should be the base class (so-called ParallelEnviroment). It will be something like in my first patch, will contain a constructor for initialization, functions to be able to fill the local data and an execute function. - The third part should be the template class. The purpose of this class is to make the framework be as simple as possible. Any more thoughts?

Geoffrey Garen

Comment 45 2011-04-11 11:22:09 PDT

Is it really a good idea to add the complexity of threading to the mix when so many basic bugs and crashes in SVG filters remain?

Zoltan Herczeg

Comment 46 2011-04-11 12:26:47 PDT

(In reply to comment #45) > Is it really a good idea to add the complexity of threading to the mix when so many basic bugs and crashes in SVG filters remain? So many is a little exaggerated :) SVG is getting more attention than in the past, and the code improves rapidly.

Brent Fulgham

Comment 47 2011-04-11 13:36:18 PDT

Comment on attachment 88609 [details] Speeding up FETurbulence with SMP View in context: https://bugs.webkit.org/attachment.cgi?id=88609&action=review This is a very interesting project, and I'd love to be able to benchmark it under WinCairo to see what improvements it might provide. It would be nice (in the future) to allow GCD/OpenMP backends for generic parallelization (perhaps in a 'platform/libdispatch' and 'platform/openmp' structure). > Source/JavaScriptCore/wtf/ParallelJobs.cpp:135 > + Couldn't the m_numberOfJobs calculation and the threadpool setup be done in a method, hiding the knowledge of OPENMP use? Ditto for maxNumberOfThreads (above), though probably less important. > Source/JavaScriptCore/wtf/ParallelJobs.h:73 > +static const int maxParallelThreads = 2; Can't this be hidden inside the implementation file? Is it accessed by other compilation units? > Source/JavaScriptCore/wtf/ParallelJobs.h:149 > + void executeJobs() For clarity, could this implementation be hidden inside the cpp file? > Source/JavaScriptCore/wtf/ParallelJobs.h:158 > + int i; Why is this declared outside the for?

Geoffrey Garen

Comment 48 2011-04-11 14:56:10 PDT

> So many is a little exaggerated :) SVG is getting more attention than in the past, and the code improves rapidly. Given the rapid improvement you're expecting, would you be willing to hold off on changes like this until all reproducible crashers in SVG filters have been fixed?

Dirk Schulze

Comment 49 2011-04-12 00:01:40 PDT

(In reply to comment #48) > > So many is a little exaggerated :) SVG is getting more attention than in the past, and the code improves rapidly. > > Given the rapid improvement you're expecting, would you be willing to hold off on changes like this until all reproducible crashers in SVG filters have been fixed? I just could just find of two crashes in bugzilla. IIRC Zoltan is working on one of these crashes. The second one is more a implementation problem of SVGImage, that affects SVG filter. So SVGImage should get fixed instead of filter. If you know of more bugs, please let me know. If you know of more crashes, file a bug and add me and Zoltan please.

Gabor Loki

Comment 50 2011-04-12 00:41:58 PDT

Created attachment 89166 [details] Speeding up FETurbulence with SMP Well, I have updated the path as I promised; using separate headers, base classes try to be close to the platform dependent parallel processing APIs, the template class is as generic as possible. Although I am not familiar with Mac, I have introduced a possible usage of libdistpatch API. I hope there will be someone else who is able to fine tune this part for Mac.

Gabor Loki

Comment 51 2011-04-12 00:46:00 PDT

> For clarity, could this implementation be hidden inside the cpp file? Hmm. Why do you want to hide the implementation details? For example, in JavaScriptCore almost the whole assembler is public.

Gabor Loki

Comment 52 2011-04-12 03:00:30 PDT

Comment on attachment 89166 [details] Speeding up FETurbulence with SMP I am going to update the build systems to make svn apply happy.

Gabor Loki

Comment 53 2011-04-12 06:26:06 PDT

Created attachment 89195 [details] Speeding up FETurbulence with SMP

WebKit Review Bot

Comment 54 2011-04-12 06:27:33 PDT

Attachment 89195 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1 Source/JavaScriptCore/wtf/ParallelJobsLibdispatch.h:75: More than one command on the same line [whitespace/newline] [4] Source/JavaScriptCore/wtf/ParallelJobsLibdispatch.h:75: Missing space before { [whitespace/braces] [5] Source/JavaScriptCore/wtf/ParallelJobsLibdispatch.h:78: More than one command on the same line [whitespace/newline] [4] Source/JavaScriptCore/wtf/ParallelJobsLibdispatch.h:78: Missing space before { [whitespace/braces] [5] Total errors found: 4 in 16 files If any of these errors are false positives, please file a bug against check-webkit-style.

Gabor Loki

Comment 55 2011-04-26 01:15:08 PDT

Created attachment 91082 [details] Speeding up FETurbulence with SMP I got a feedback about the container of parameters in private; the destructor of parameters should be called during the destruction of their container. Well, in order to solve this issue the ByteArray is changed to Vector, the container of the parameters and the ParallelEnviroment become the members of ParallelJobs class.

WebKit Review Bot

Comment 56 2011-04-26 01:19:27 PDT

Attachment 91082 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1 Source/JavaScriptCore/wtf/ParallelJobsLibdispatch.h:67: More than one command on the same line [whitespace/newline] [4] Source/JavaScriptCore/wtf/ParallelJobsLibdispatch.h:67: Missing space before { [whitespace/braces] [5] Source/JavaScriptCore/wtf/ParallelJobsLibdispatch.h:72: More than one command on the same line [whitespace/newline] [4] Source/JavaScriptCore/wtf/ParallelJobsLibdispatch.h:72: Missing space before { [whitespace/braces] [5] Total errors found: 4 in 16 files If any of these errors are false positives, please file a bug against check-webkit-style.

Csaba Osztrogonác

Comment 57 2011-04-26 04:59:16 PDT

Comment on attachment 91082 [details] Speeding up FETurbulence with SMP You fixed all things what other asked, and there weren't any other objections, so it is ready for landing now. r=me

Eric Seidel (no email)

Comment 58 2011-04-26 06:57:53 PDT

Style?

Gabor Loki

Comment 59 2011-04-26 07:02:43 PDT

> Style? :) I guess style bot is not prepared for the 'newly-supported feature of the C language'. See here: http://libdispatch.macosforge.org/trac/wiki/tutorial#Definingworkitems:FunctionsandBlocks

Eric Seidel (no email)

Comment 60 2011-04-26 07:36:13 PDT

Please file a bug against the style-bot. :)

WebKit Commit Bot

Comment 61 2011-04-26 07:38:07 PDT

The commit-queue encountered the following flaky tests while processing attachment 91082 [details]: http/tests/misc/favicon-loads-with-icon-loading-override.html bug 58412 (author: alice.liu@apple.com) The commit-queue is continuing to process your patch.

WebKit Commit Bot

Comment 62 2011-04-26 07:41:04 PDT

Comment on attachment 91082 [details] Speeding up FETurbulence with SMP Clearing flags on attachment: 91082 Committed r84911: <http://trac.webkit.org/changeset/84911>

WebKit Commit Bot

Comment 63 2011-04-26 07:41:13 PDT

All reviewed patches have been landed. Closing bug.

WebKit Review Bot

Comment 64 2011-04-26 09:24:31 PDT

http://trac.webkit.org/changeset/84911 might have broken SnowLeopard Intel Release (WebKit2 Tests)

Nikolas Zimmermann

Comment 65 2011-04-26 10:47:37 PDT

Excellent work, looking forward to test it :-)

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Enhancement

Classification Unclassified

Version 528+ (Nightly build)

Hardware All

OS All

Product WebKit

Component SVG

Assignee

Nobody

Reported

2010-08-12 02:46 PDT

Modified

2011-04-26 10:47 PDT History

CC List

22 users Show

URL

Keywords

Depends on

Blocks

abarth
ap
benjamin
buildbot
commit-queue
dimich
eric.carlson
eric
ggaren
kbalazs
krit
levin
loki
mdelaney7
mjs
oliver
ossy
pandras
sam
simon.fraser
webkit.review.bot
zimmermann