Bug 48212

Summary: SVG FELighting performance issues
Product: WebKit Reporter: Dirk Schulze <krit>
Component: SVGAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: krit, mdelaney7, zherczeg, zimmermann
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: PC   
OS: OS X 10.5   
Bug Depends on:    
Bug Blocks: 68469, 26389, 48174    
Attachments:
Description Flags
draft patch
krit: review-
Example
none
Draft patch 2
none
ByteArray rewriting
none
final patch krit: review+

Dirk Schulze
Reported 2010-10-24 14:28:30 PDT
FELighting has some performance issues. The most obvious issue is the -> operator for the get and set calls on CanvasPixelArray. The ByteArray is ref counted, so we have to use this operator on every read or write operation. This can be avoided by using a ByteArray pointer instead of the CanvasPixelArray like on the other effects. The other big perf issue if FloatPoint3D. The most time is spend on setPixel and mainly in the normalize() function of FloatPoint3D. normalize() makes three divisions and a dot product of itself. The dot product seems really slow and calls x() y() and z() on every call. Can we speed this up? Can't we write a selfDot function? Is it possible to avoid the call of normalize() more? Other issues depend on the kind of Lighting effect and are located at the update functions.
Attachments
draft patch (28.40 KB, patch)
2010-10-27 00:07 PDT, Zoltan Herczeg
krit: review-
Example (1.02 KB, image/svg+xml)
2010-10-27 00:07 PDT, Zoltan Herczeg
no flags
Draft patch 2 (29.37 KB, patch)
2010-10-27 05:10 PDT, Zoltan Herczeg
no flags
ByteArray rewriting (29.33 KB, patch)
2010-10-27 05:12 PDT, Zoltan Herczeg
no flags
final patch (31.37 KB, patch)
2010-11-02 07:51 PDT, Zoltan Herczeg
krit: review+
Zoltan Herczeg
Comment 1 2010-10-27 00:07:03 PDT
Created attachment 71991 [details] draft patch Faster light filter. I will upload an example, where the speed is improved from 55 to 73 fps (on a mac leopard)! The patch contains heavy optimizations, which you may not like, that is why it is a draft atm.
Zoltan Herczeg
Comment 2 2010-10-27 00:07:50 PDT
Created attachment 71992 [details] Example
Dirk Schulze
Comment 3 2010-10-27 01:02:35 PDT
Comment on attachment 71991 [details] draft patch View in context: https://bugs.webkit.org/attachment.cgi?id=71991&action=review Have you measured the performance with ByteArray*? I couldn't see a difference between usigned char* and ByteArray* on feGaussianBlur with shark. Of course you save some address calculations on using unsigned char* (pixels[0] = ... pixels[1]), but would still interest me you see a difference. The changes on FloatPoint3D are ok. > WebCore/platform/graphics/filters/FELighting.cpp:95 > +inline void FELighting::LightingData::topRight(int offset, IntPair& intPair) > { > - return static_cast<int>(pixels->get(offset - widthMultipliedByPixelSize + cPixelSize + cAlphaChannelOffset)); > + unsigned char* pixel = pixels + offset; > + int left = static_cast<int>(pixel[-cPixelSize + cAlphaChannelOffset]); > + int center = static_cast<int>(pixel[cAlphaChannelOffset]); > + pixel += widthMultipliedByPixelSize; > + int bottomLeft = static_cast<int>(pixel[-cPixelSize + cAlphaChannelOffset]); > + int bottom = static_cast<int>(pixel[cAlphaChannelOffset]); > + intPair.normalX = -(left << 1) + (center << 1) - bottomLeft + bottom; > + intPair.normalY = -left - (center << 1) + bottomLeft + (bottom << 1); > } Is there no way to share more of the code with the other edge functions? > WebCore/platform/graphics/filters/FELighting.cpp:196 > + if (m_specularExponent == 1.0f) == 1 > WebCore/platform/graphics/filters/FELighting.cpp:205 > + normalVector.setZ(1.0f); just ...setZ(1); > WebCore/platform/graphics/filters/FELighting.cpp:214 > + if (m_specularExponent == 1.0f) == 1 > WebCore/platform/graphics/filters/FELighting.cpp:-128 > - if (data.lightStrength > 1.0f) > - data.lightStrength = 1.0f; > - if (data.lightStrength < 0.0f) > - data.lightStrength = 0.0f; 1 and 0 > WebCore/platform/graphics/filters/FELighting.cpp:237 > + IntPair intPair; > + int offset; Please initialize these values. And declare them on the first use. > WebCore/platform/graphics/filters/FELighting.cpp:258 > + setPixel(offset, data, paintingData, 0, 0, -2.0f / 3.0f, -2.0f / 3.0f, intPair); Normally the compiler should optimize it, but can you add a static const for -2 / 3.f ? This is reused multiple times. The same for some other constant calculations, a name is also easier to understand. > WebCore/platform/graphics/filters/FELighting.cpp:346 > RefPtr<ImageData> srcImageData(in->resultImage()->getUnmultipliedImageData(effectDrawingRect)); change srcImageData(...) to srcImageData = ... please > WebCore/platform/graphics/filters/FELighting.h:57 > + struct IntPair { > + int normalX; > + int normalY; > + }; Do you really see a noticeable performance win compared to IntPoint?
Zoltan Herczeg
Comment 4 2010-10-27 05:10:43 PDT
Created attachment 72014 [details] Draft patch 2 This one contains the requested changes except the ByteArray* rewriting. This is the fastest so far, can achive 74 fps.
Zoltan Herczeg
Comment 5 2010-10-27 05:12:34 PDT
Created attachment 72015 [details] ByteArray rewriting This one contains the ByteArray rewriting. It is slower, only 72 fps.
Dirk Schulze
Comment 6 2010-10-27 05:44:45 PDT
Thanks for taking the bother to check ByteArray, have you used callgrind (kcachegrind) or shark to look why it is slower on ByteArray? Is it the get and set method? If so, we might follow kling's suggestion and add a unsafeGet/Set method. I'd also like to know what takes most of the time after your changes.
Zoltan Herczeg
Comment 7 2010-10-27 06:35:26 PDT
(In reply to comment #6) > Thanks for taking the bother to check ByteArray, have you used callgrind (kcachegrind) or shark to look why it is slower on ByteArray? Is it the get and set method? If so, we might follow kling's suggestion and add a unsafeGet/Set method. I'd also like to know what takes most of the time after your changes. Assembly magic. The pixels->get(offset) is turned to indexed addressing modes [reg + reg], while the other [reg, immediate] is supported by more architectures, and results a simplified machine code. Valgrind is not useful for performance measurements, since it totally recompiles the code, even if you don't touch the instruction stream. Valgrind is a great tool, and excellent for searching memory leaks (and other things unrealted to performance). There is no unsafeGet method for ByteArray. Get is unsafe enough :) unsigned char get(unsigned index) const { ASSERT(index < m_size); return m_data[index]; }
Dirk Schulze
Comment 8 2010-10-27 07:15:52 PDT
(In reply to comment #7) > (In reply to comment #6) > > Thanks for taking the bother to check ByteArray, have you used callgrind (kcachegrind) or shark to look why it is slower on ByteArray? Is it the get and set method? If so, we might follow kling's suggestion and add a unsafeGet/Set method. I'd also like to know what takes most of the time after your changes. > > Assembly magic. The pixels->get(offset) is turned to indexed addressing modes [reg + reg], while the other [reg, immediate] is supported by more architectures, and results a simplified machine code. > > Valgrind is not useful for performance measurements, since it totally recompiles the code, even if you don't touch the instruction stream. Valgrind is a great tool, and excellent for searching memory leaks (and other things unrealted to performance). > > There is no unsafeGet method for ByteArray. Get is unsafe enough :) > > unsigned char get(unsigned index) const > { > ASSERT(index < m_size); > return m_data[index]; > } You don't need to recompile for using callgrind, and it is very useful to see how much time is spend in which function. Valgrind is a tool chain, and consists of different tools, not only cachegrind (I guess you were just thinking about this). kcachegrind2 on the other hand is just a graphical tool and makes it quiet easy to visualize the output of callgrind. Yes, get is pretty simple, but set has a bit more "complexity", that might get visible on the massive calls of it for pixel manipulation.
Zoltan Herczeg
Comment 9 2010-10-27 09:47:13 PDT
> You don't need to recompile for using callgrind, and it is very useful to see how much time is spend in which function. Valgrind is a tool chain, and consists of different tools, not only cachegrind (I guess you were just thinking about this). kcachegrind2 on the other hand is just a graphical tool and makes it quiet easy to visualize the output of callgrind. I like valgrind, I even wrote valgrind tools myself. Valgrind using a recompiler called libVEX. It reads the machine instructions from the binary, convert them to a high level assembly language, where a tool can modify this stream (thus it is really easy to insert calls into the stream, that is what most tools do). Later, this high level assembly language is compiled back to binary form on-the-fly by libVEX again. If your tool does not touch the stream, you get a 3 times slower binary code. Hence it is useless for performance related things. But you can follow the control flow (inserting calls before function entries, and such), you can insert checks before each memory reads/writes (memcheck), and so on. If you need info, but you don't care about speed -> valgring (libVEX) is your tool. If speed matters, use cycle accurate simulators (we also have one for arm: http://www.inf.u-szeged.hu/xeemu/)
Zoltan Herczeg
Comment 10 2010-10-29 04:44:24 PDT
Krit, which version you choose? (Unfortunately the lighting filter is a big chunk of code because of inline functions, and cannot be analized properly)
Dirk Schulze
Comment 11 2010-10-29 05:19:59 PDT
(In reply to comment #10) > Krit, which version you choose? (Unfortunately the lighting filter is a big chunk of code because of inline functions, and cannot be analized properly) Oh sorry, for consistency to other effects and given the fact that the difference is not that high, Id prefer ByteArray for the moment. Is it possible to share more code on bottomLeft, borromRight,... functions?
Zoltan Herczeg
Comment 12 2010-11-02 07:51:59 PDT
Created attachment 72664 [details] final patch
Zoltan Herczeg
Comment 13 2010-11-02 07:53:37 PDT
> Oh sorry, for consistency to other effects and given the fact that the difference is not that high, Id prefer ByteArray for the moment. > Is it possible to share more code on bottomLeft, borromRight,... functions? Did it. There maybe some template magic which could reduce the number of functions without performance loss, but it looks difficult.
Dirk Schulze
Comment 14 2010-11-02 14:25:54 PDT
Comment on attachment 72664 [details] final patch View in context: https://bugs.webkit.org/attachment.cgi?id=72664&action=review r=me with little style changes. > WebCore/platform/graphics/filters/FELighting.cpp:178 > +inline void FELighting::inlineSetPixel(int offset, LightingData& data, LightSource::PaintingData& paintingData, > + int lightX, int lightY, float factorX, float factorY, IntPoint& normal2DVector) align the int's > WebCore/platform/graphics/filters/FELighting.cpp:227 > +void FELighting::setPixel(int offset, LightingData& data, LightSource::PaintingData& paintingData, > + int lightX, int lightY, float factorX, float factorY, IntPoint& normalVector) align the int's. > WebCore/platform/graphics/filters/FELighting.h:76 > + inline void inlineSetPixel(int offset, LightingData&, LightSource::PaintingData&, > + int lightX, int lightY, float factorX, float factorY, IntPoint& normalVector); please align the ints > WebCore/platform/graphics/filters/FELighting.h:80 > + void setPixel(int offset, LightingData&, LightSource::PaintingData&, > + int lightX, int lightY, float factorX, float factorY, IntPoint& normalVector); Ditto.
Zoltan Herczeg
Comment 15 2010-11-03 01:44:19 PDT
Note You need to log in before you can comment on or make changes to this bug.