Bug 48212

Summary: SVG FELighting performance issues
Product: WebKit Reporter: Dirk Schulze <krit>
Component: SVGAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: krit, mdelaney7, zherczeg, zimmermann
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: PC   
OS: OS X 10.5   
Bug Depends on:    
Bug Blocks: 68469, 26389, 48174    
Attachments:
Description Flags
draft patch
krit: review-
Example
none
Draft patch 2
none
ByteArray rewriting
none
final patch krit: review+

Description Dirk Schulze 2010-10-24 14:28:30 PDT
FELighting has some performance issues.

The most obvious issue is the -> operator for the get and set calls on CanvasPixelArray. The ByteArray is ref counted, so we have to use this operator on every read or write operation. This can be avoided by using a ByteArray pointer instead of the CanvasPixelArray like on the other effects.

The other big perf issue if FloatPoint3D.
The most time is spend on setPixel and mainly in the normalize() function of FloatPoint3D. normalize() makes three divisions and a dot product of itself. The dot product seems really slow and calls x() y() and z() on every call. Can we speed this up? Can't we write a selfDot function? Is it possible to avoid the call of normalize() more?

Other issues depend on the kind of Lighting effect and are located at the update functions.
Comment 1 Zoltan Herczeg 2010-10-27 00:07:03 PDT
Created attachment 71991 [details]
draft patch

Faster light filter. I will upload an example, where the speed is improved from 55 to 73 fps (on a mac leopard)! The patch contains heavy optimizations, which you may not like, that is why it is a draft atm.
Comment 2 Zoltan Herczeg 2010-10-27 00:07:50 PDT
Created attachment 71992 [details]
Example
Comment 3 Dirk Schulze 2010-10-27 01:02:35 PDT
Comment on attachment 71991 [details]
draft patch

View in context: https://bugs.webkit.org/attachment.cgi?id=71991&action=review

Have you measured the performance with ByteArray*? I couldn't see a difference between usigned char* and ByteArray* on feGaussianBlur with shark. Of course you save some address calculations on using unsigned char* (pixels[0] = ... pixels[1]), but would still interest me you see a difference.
The changes on FloatPoint3D are ok.

> WebCore/platform/graphics/filters/FELighting.cpp:95
> +inline void FELighting::LightingData::topRight(int offset, IntPair& intPair)
>  {
> -    return static_cast<int>(pixels->get(offset - widthMultipliedByPixelSize + cPixelSize + cAlphaChannelOffset));
> +    unsigned char* pixel = pixels + offset;
> +    int left = static_cast<int>(pixel[-cPixelSize + cAlphaChannelOffset]);
> +    int center = static_cast<int>(pixel[cAlphaChannelOffset]);
> +    pixel += widthMultipliedByPixelSize;
> +    int bottomLeft = static_cast<int>(pixel[-cPixelSize + cAlphaChannelOffset]);
> +    int bottom = static_cast<int>(pixel[cAlphaChannelOffset]);
> +    intPair.normalX = -(left << 1) + (center << 1) - bottomLeft + bottom;
> +    intPair.normalY = -left - (center << 1) + bottomLeft + (bottom << 1);
>  }

Is there no way to share more of the code with the other edge functions?

> WebCore/platform/graphics/filters/FELighting.cpp:196
> +            if (m_specularExponent == 1.0f)

== 1

> WebCore/platform/graphics/filters/FELighting.cpp:205
> +        normalVector.setZ(1.0f);

just ...setZ(1);

> WebCore/platform/graphics/filters/FELighting.cpp:214
> +            if (m_specularExponent == 1.0f)

== 1

> WebCore/platform/graphics/filters/FELighting.cpp:-128
> -    if (data.lightStrength > 1.0f)
> -        data.lightStrength = 1.0f;
> -    if (data.lightStrength < 0.0f)
> -        data.lightStrength = 0.0f;

1 and 0

> WebCore/platform/graphics/filters/FELighting.cpp:237
> +    IntPair intPair;
> +    int offset;

Please initialize these values. And declare them on the first use.

> WebCore/platform/graphics/filters/FELighting.cpp:258
> +    setPixel(offset, data, paintingData, 0, 0, -2.0f / 3.0f, -2.0f / 3.0f, intPair);

Normally the compiler should optimize it, but can you add a static const for -2 / 3.f ? This is reused multiple times. The same for some other constant calculations, a name is also easier to understand.

> WebCore/platform/graphics/filters/FELighting.cpp:346
>      RefPtr<ImageData> srcImageData(in->resultImage()->getUnmultipliedImageData(effectDrawingRect));

change srcImageData(...) to srcImageData = ... please

> WebCore/platform/graphics/filters/FELighting.h:57
> +    struct IntPair {
> +        int normalX;
> +        int normalY;
> +    };

Do you really see a noticeable performance win compared to IntPoint?
Comment 4 Zoltan Herczeg 2010-10-27 05:10:43 PDT
Created attachment 72014 [details]
Draft patch 2

This one contains the requested changes except the ByteArray* rewriting. This is the fastest so far, can achive 74 fps.
Comment 5 Zoltan Herczeg 2010-10-27 05:12:34 PDT
Created attachment 72015 [details]
ByteArray rewriting

This one contains the ByteArray rewriting. It is slower, only 72 fps.
Comment 6 Dirk Schulze 2010-10-27 05:44:45 PDT
Thanks for taking the bother to check ByteArray, have you used callgrind (kcachegrind) or shark to look why it is slower on ByteArray? Is it the get and set method? If so, we might follow kling's suggestion and add a unsafeGet/Set method. I'd also like to know what takes most of the time after your changes.
Comment 7 Zoltan Herczeg 2010-10-27 06:35:26 PDT
(In reply to comment #6)
> Thanks for taking the bother to check ByteArray, have you used callgrind (kcachegrind) or shark to look why it is slower on ByteArray? Is it the get and set method? If so, we might follow kling's suggestion and add a unsafeGet/Set method. I'd also like to know what takes most of the time after your changes.

Assembly magic. The pixels->get(offset) is turned to indexed addressing modes  [reg + reg], while the other [reg, immediate] is supported by more architectures, and results a simplified machine code.

Valgrind is not useful for performance measurements, since it totally recompiles the code, even if you don't touch the instruction stream. Valgrind is a great tool, and excellent for searching memory leaks (and other things unrealted to performance).

There is no unsafeGet method for ByteArray. Get is unsafe enough :)

unsigned char get(unsigned index) const
{
    ASSERT(index < m_size);
    return m_data[index];
}
Comment 8 Dirk Schulze 2010-10-27 07:15:52 PDT
(In reply to comment #7)
> (In reply to comment #6)
> > Thanks for taking the bother to check ByteArray, have you used callgrind (kcachegrind) or shark to look why it is slower on ByteArray? Is it the get and set method? If so, we might follow kling's suggestion and add a unsafeGet/Set method. I'd also like to know what takes most of the time after your changes.
> 
> Assembly magic. The pixels->get(offset) is turned to indexed addressing modes  [reg + reg], while the other [reg, immediate] is supported by more architectures, and results a simplified machine code.
> 
> Valgrind is not useful for performance measurements, since it totally recompiles the code, even if you don't touch the instruction stream. Valgrind is a great tool, and excellent for searching memory leaks (and other things unrealted to performance).
> 
> There is no unsafeGet method for ByteArray. Get is unsafe enough :)
> 
> unsigned char get(unsigned index) const
> {
>     ASSERT(index < m_size);
>     return m_data[index];
> }

You don't need to recompile for using callgrind, and it is very useful to see how much time is spend in which function. Valgrind is a tool chain, and consists of different tools, not only cachegrind (I guess you were just thinking about this). kcachegrind2 on the other hand is just a graphical tool and makes it quiet easy to visualize the output of callgrind.

Yes, get is pretty simple, but set has a bit more "complexity", that might get visible on the massive calls of it for pixel manipulation.
Comment 9 Zoltan Herczeg 2010-10-27 09:47:13 PDT
> You don't need to recompile for using callgrind, and it is very useful to see how much time is spend in which function. Valgrind is a tool chain, and consists of different tools, not only cachegrind (I guess you were just thinking about this). kcachegrind2 on the other hand is just a graphical tool and makes it quiet easy to visualize the output of callgrind.

I like valgrind, I even wrote valgrind tools myself. Valgrind using a recompiler called libVEX. It reads the machine instructions from the binary, convert them to a high level assembly language, where a tool can modify this stream (thus it is really easy to insert calls into the stream, that is what most tools do). Later, this high level assembly language is compiled back to binary form on-the-fly by libVEX again. If your tool does not touch the stream, you get a 3 times slower binary code. Hence it is useless for performance related things. But you can follow the control flow (inserting calls before function entries, and such), you can insert checks before each memory reads/writes (memcheck), and so on. If you need info, but you don't care about speed -> valgring (libVEX) is your tool. If speed matters, use cycle accurate simulators (we also have one for arm: http://www.inf.u-szeged.hu/xeemu/)
Comment 10 Zoltan Herczeg 2010-10-29 04:44:24 PDT
Krit, which version you choose? (Unfortunately the lighting filter is a big chunk of code because of inline functions, and cannot be analized properly)
Comment 11 Dirk Schulze 2010-10-29 05:19:59 PDT
(In reply to comment #10)
> Krit, which version you choose? (Unfortunately the lighting filter is a big chunk of code because of inline functions, and cannot be analized properly)

Oh sorry, for consistency to other effects and given the fact that the difference is not that high, Id prefer ByteArray for the moment.
Is it possible to share more code on bottomLeft, borromRight,... functions?
Comment 12 Zoltan Herczeg 2010-11-02 07:51:59 PDT
Created attachment 72664 [details]
final patch
Comment 13 Zoltan Herczeg 2010-11-02 07:53:37 PDT
> Oh sorry, for consistency to other effects and given the fact that the difference is not that high, Id prefer ByteArray for the moment.
> Is it possible to share more code on bottomLeft, borromRight,... functions?

Did it. There maybe some template magic which could reduce the number of functions without performance loss, but it looks difficult.
Comment 14 Dirk Schulze 2010-11-02 14:25:54 PDT
Comment on attachment 72664 [details]
final patch

View in context: https://bugs.webkit.org/attachment.cgi?id=72664&action=review

r=me with little style changes.

> WebCore/platform/graphics/filters/FELighting.cpp:178
> +inline void FELighting::inlineSetPixel(int offset, LightingData& data, LightSource::PaintingData& paintingData,
> +    int lightX, int lightY, float factorX, float factorY, IntPoint& normal2DVector)

align the int's

> WebCore/platform/graphics/filters/FELighting.cpp:227
> +void FELighting::setPixel(int offset, LightingData& data, LightSource::PaintingData& paintingData,
> +    int lightX, int lightY, float factorX, float factorY, IntPoint& normalVector)

align the int's.

> WebCore/platform/graphics/filters/FELighting.h:76
> +    inline void inlineSetPixel(int offset, LightingData&, LightSource::PaintingData&,
> +        int lightX, int lightY, float factorX, float factorY, IntPoint& normalVector);

please align the ints

> WebCore/platform/graphics/filters/FELighting.h:80
> +    void setPixel(int offset, LightingData&, LightSource::PaintingData&,
> +        int lightX, int lightY, float factorX, float factorY, IntPoint& normalVector);

Ditto.
Comment 15 Zoltan Herczeg 2010-11-03 01:44:19 PDT
Landed in http://trac.webkit.org/changeset/71222

Closing bug.