48212 – SVG FELighting performance issues

RESOLVED FIXED 48212

SVG FELighting performance issues

https://bugs.webkit.org/show_bug.cgi?id=48212

Summary SVG FELighting performance issues

Dirk Schulze

Reported 2010-10-24 14:28:30 PDT

FELighting has some performance issues. The most obvious issue is the -> operator for the get and set calls on CanvasPixelArray. The ByteArray is ref counted, so we have to use this operator on every read or write operation. This can be avoided by using a ByteArray pointer instead of the CanvasPixelArray like on the other effects. The other big perf issue if FloatPoint3D. The most time is spend on setPixel and mainly in the normalize() function of FloatPoint3D. normalize() makes three divisions and a dot product of itself. The dot product seems really slow and calls x() y() and z() on every call. Can we speed this up? Can't we write a selfDot function? Is it possible to avoid the call of normalize() more? Other issues depend on the kind of Lighting effect and are located at the update functions.

Attachments
draft patch (28.40 KB, patch) 2010-10-27 00:07 PDT, Zoltan Herczeg	krit: review-	Details Formatted Diff Diff
Example (1.02 KB, image/svg+xml) 2010-10-27 00:07 PDT, Zoltan Herczeg	no flags	Details
Draft patch 2 (29.37 KB, patch) 2010-10-27 05:10 PDT, Zoltan Herczeg	no flags	Details Formatted Diff Diff
ByteArray rewriting (29.33 KB, patch) 2010-10-27 05:12 PDT, Zoltan Herczeg	no flags	Details Formatted Diff Diff
final patch (31.37 KB, patch) 2010-11-02 07:51 PDT, Zoltan Herczeg	krit: review+	Details Formatted Diff Diff
Show Obsolete (3) View All Add attachment proposed patch, testcase, etc.

Zoltan Herczeg

Comment 1 2010-10-27 00:07:03 PDT

Created attachment 71991 [details] draft patch Faster light filter. I will upload an example, where the speed is improved from 55 to 73 fps (on a mac leopard)! The patch contains heavy optimizations, which you may not like, that is why it is a draft atm.

Zoltan Herczeg

Comment 2 2010-10-27 00:07:50 PDT

Created attachment 71992 [details] Example

Dirk Schulze

Comment 3 2010-10-27 01:02:35 PDT

Comment on attachment 71991 [details] draft patch View in context: https://bugs.webkit.org/attachment.cgi?id=71991&action=review Have you measured the performance with ByteArray*? I couldn't see a difference between usigned char* and ByteArray* on feGaussianBlur with shark. Of course you save some address calculations on using unsigned char* (pixels[0] = ... pixels[1]), but would still interest me you see a difference. The changes on FloatPoint3D are ok. > WebCore/platform/graphics/filters/FELighting.cpp:95 > +inline void FELighting::LightingData::topRight(int offset, IntPair& intPair) > { > - return static_cast<int>(pixels->get(offset - widthMultipliedByPixelSize + cPixelSize + cAlphaChannelOffset)); > + unsigned char* pixel = pixels + offset; > + int left = static_cast<int>(pixel[-cPixelSize + cAlphaChannelOffset]); > + int center = static_cast<int>(pixel[cAlphaChannelOffset]); > + pixel += widthMultipliedByPixelSize; > + int bottomLeft = static_cast<int>(pixel[-cPixelSize + cAlphaChannelOffset]); > + int bottom = static_cast<int>(pixel[cAlphaChannelOffset]); > + intPair.normalX = -(left << 1) + (center << 1) - bottomLeft + bottom; > + intPair.normalY = -left - (center << 1) + bottomLeft + (bottom << 1); > } Is there no way to share more of the code with the other edge functions? > WebCore/platform/graphics/filters/FELighting.cpp:196 > + if (m_specularExponent == 1.0f) == 1 > WebCore/platform/graphics/filters/FELighting.cpp:205 > + normalVector.setZ(1.0f); just ...setZ(1); > WebCore/platform/graphics/filters/FELighting.cpp:214 > + if (m_specularExponent == 1.0f) == 1 > WebCore/platform/graphics/filters/FELighting.cpp:-128 > - if (data.lightStrength > 1.0f) > - data.lightStrength = 1.0f; > - if (data.lightStrength < 0.0f) > - data.lightStrength = 0.0f; 1 and 0 > WebCore/platform/graphics/filters/FELighting.cpp:237 > + IntPair intPair; > + int offset; Please initialize these values. And declare them on the first use. > WebCore/platform/graphics/filters/FELighting.cpp:258 > + setPixel(offset, data, paintingData, 0, 0, -2.0f / 3.0f, -2.0f / 3.0f, intPair); Normally the compiler should optimize it, but can you add a static const for -2 / 3.f ? This is reused multiple times. The same for some other constant calculations, a name is also easier to understand. > WebCore/platform/graphics/filters/FELighting.cpp:346 > RefPtr<ImageData> srcImageData(in->resultImage()->getUnmultipliedImageData(effectDrawingRect)); change srcImageData(...) to srcImageData = ... please > WebCore/platform/graphics/filters/FELighting.h:57 > + struct IntPair { > + int normalX; > + int normalY; > + }; Do you really see a noticeable performance win compared to IntPoint?

Zoltan Herczeg

Comment 4 2010-10-27 05:10:43 PDT

Created attachment 72014 [details] Draft patch 2 This one contains the requested changes except the ByteArray* rewriting. This is the fastest so far, can achive 74 fps.

Zoltan Herczeg

Comment 5 2010-10-27 05:12:34 PDT

Created attachment 72015 [details] ByteArray rewriting This one contains the ByteArray rewriting. It is slower, only 72 fps.

Dirk Schulze

Comment 6 2010-10-27 05:44:45 PDT

Thanks for taking the bother to check ByteArray, have you used callgrind (kcachegrind) or shark to look why it is slower on ByteArray? Is it the get and set method? If so, we might follow kling's suggestion and add a unsafeGet/Set method. I'd also like to know what takes most of the time after your changes.

Zoltan Herczeg

Comment 7 2010-10-27 06:35:26 PDT

(In reply to comment #6) > Thanks for taking the bother to check ByteArray, have you used callgrind (kcachegrind) or shark to look why it is slower on ByteArray? Is it the get and set method? If so, we might follow kling's suggestion and add a unsafeGet/Set method. I'd also like to know what takes most of the time after your changes. Assembly magic. The pixels->get(offset) is turned to indexed addressing modes [reg + reg], while the other [reg, immediate] is supported by more architectures, and results a simplified machine code. Valgrind is not useful for performance measurements, since it totally recompiles the code, even if you don't touch the instruction stream. Valgrind is a great tool, and excellent for searching memory leaks (and other things unrealted to performance). There is no unsafeGet method for ByteArray. Get is unsafe enough :) unsigned char get(unsigned index) const { ASSERT(index < m_size); return m_data[index]; }

Dirk Schulze

Comment 8 2010-10-27 07:15:52 PDT

(In reply to comment #7) > (In reply to comment #6) > > Thanks for taking the bother to check ByteArray, have you used callgrind (kcachegrind) or shark to look why it is slower on ByteArray? Is it the get and set method? If so, we might follow kling's suggestion and add a unsafeGet/Set method. I'd also like to know what takes most of the time after your changes. > > Assembly magic. The pixels->get(offset) is turned to indexed addressing modes [reg + reg], while the other [reg, immediate] is supported by more architectures, and results a simplified machine code. > > Valgrind is not useful for performance measurements, since it totally recompiles the code, even if you don't touch the instruction stream. Valgrind is a great tool, and excellent for searching memory leaks (and other things unrealted to performance). > > There is no unsafeGet method for ByteArray. Get is unsafe enough :) > > unsigned char get(unsigned index) const > { > ASSERT(index < m_size); > return m_data[index]; > } You don't need to recompile for using callgrind, and it is very useful to see how much time is spend in which function. Valgrind is a tool chain, and consists of different tools, not only cachegrind (I guess you were just thinking about this). kcachegrind2 on the other hand is just a graphical tool and makes it quiet easy to visualize the output of callgrind. Yes, get is pretty simple, but set has a bit more "complexity", that might get visible on the massive calls of it for pixel manipulation.

Zoltan Herczeg

Comment 9 2010-10-27 09:47:13 PDT

> You don't need to recompile for using callgrind, and it is very useful to see how much time is spend in which function. Valgrind is a tool chain, and consists of different tools, not only cachegrind (I guess you were just thinking about this). kcachegrind2 on the other hand is just a graphical tool and makes it quiet easy to visualize the output of callgrind. I like valgrind, I even wrote valgrind tools myself. Valgrind using a recompiler called libVEX. It reads the machine instructions from the binary, convert them to a high level assembly language, where a tool can modify this stream (thus it is really easy to insert calls into the stream, that is what most tools do). Later, this high level assembly language is compiled back to binary form on-the-fly by libVEX again. If your tool does not touch the stream, you get a 3 times slower binary code. Hence it is useless for performance related things. But you can follow the control flow (inserting calls before function entries, and such), you can insert checks before each memory reads/writes (memcheck), and so on. If you need info, but you don't care about speed -> valgring (libVEX) is your tool. If speed matters, use cycle accurate simulators (we also have one for arm: http://www.inf.u-szeged.hu/xeemu/)

Zoltan Herczeg

Comment 10 2010-10-29 04:44:24 PDT

Krit, which version you choose? (Unfortunately the lighting filter is a big chunk of code because of inline functions, and cannot be analized properly)

Dirk Schulze

Comment 11 2010-10-29 05:19:59 PDT

(In reply to comment #10) > Krit, which version you choose? (Unfortunately the lighting filter is a big chunk of code because of inline functions, and cannot be analized properly) Oh sorry, for consistency to other effects and given the fact that the difference is not that high, Id prefer ByteArray for the moment. Is it possible to share more code on bottomLeft, borromRight,... functions?

Zoltan Herczeg

Comment 12 2010-11-02 07:51:59 PDT

Created attachment 72664 [details] final patch

Zoltan Herczeg

Comment 13 2010-11-02 07:53:37 PDT

> Oh sorry, for consistency to other effects and given the fact that the difference is not that high, Id prefer ByteArray for the moment. > Is it possible to share more code on bottomLeft, borromRight,... functions? Did it. There maybe some template magic which could reduce the number of functions without performance loss, but it looks difficult.

Dirk Schulze

Comment 14 2010-11-02 14:25:54 PDT

Comment on attachment 72664 [details] final patch View in context: https://bugs.webkit.org/attachment.cgi?id=72664&action=review r=me with little style changes. > WebCore/platform/graphics/filters/FELighting.cpp:178 > +inline void FELighting::inlineSetPixel(int offset, LightingData& data, LightSource::PaintingData& paintingData, > + int lightX, int lightY, float factorX, float factorY, IntPoint& normal2DVector) align the int's > WebCore/platform/graphics/filters/FELighting.cpp:227 > +void FELighting::setPixel(int offset, LightingData& data, LightSource::PaintingData& paintingData, > + int lightX, int lightY, float factorX, float factorY, IntPoint& normalVector) align the int's. > WebCore/platform/graphics/filters/FELighting.h:76 > + inline void inlineSetPixel(int offset, LightingData&, LightSource::PaintingData&, > + int lightX, int lightY, float factorX, float factorY, IntPoint& normalVector); please align the ints > WebCore/platform/graphics/filters/FELighting.h:80 > + void setPixel(int offset, LightingData&, LightSource::PaintingData&, > + int lightX, int lightY, float factorX, float factorY, IntPoint& normalVector); Ditto.

Zoltan Herczeg

Comment 15 2010-11-03 01:44:19 PDT

Landed in http://trac.webkit.org/changeset/71222 Closing bug.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware PC

OS OS X 10.5

Product WebKit

Component SVG

Assignee

Nobody

Reported

2010-10-24 14:28 PDT

Modified

2014-05-12 05:54 PDT History

CC List

4 users Show

URL

Keywords

Depends on

Blocks

68469 26389 48174

Dependencies

tree graph