Bug 180020 - Optimize FEMorphology
Summary: Optimize FEMorphology
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Simon Fraser (smfr)
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2017-11-26 15:44 PST by Simon Fraser (smfr)
Modified: 2017-12-06 16:29 PST (History)
6 users (show)

See Also:


Attachments
Patch (17.38 KB, patch)
2017-11-26 15:52 PST, Simon Fraser (smfr)
sam: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Simon Fraser (smfr) 2017-11-26 15:44:10 PST
Optimize FEMorphology
Comment 1 Simon Fraser (smfr) 2017-11-26 15:52:21 PST
Created attachment 327591 [details]
Patch
Comment 2 Sam Weinig 2017-11-26 16:34:25 PST
Comment on attachment 327591 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=327591&action=review

> Source/WebCore/ChangeLog:16
> +        This is about a 4x speedup before the parallel jobs tweaking. With fixed parallelism,
> +        a 200x200 filter went from 15ms to about 1ms with these changes.

Noice.

> Source/WebCore/platform/graphics/ColorUtilities.h:100
> +    static ColorComponents fromRGBA(unsigned pixel)
> +    {
> +        return ColorComponents((pixel >> 24) & 0xFF, (pixel >> 16) & 0xFF, (pixel >> 8) & 0xFF, pixel & 0xFF);
> +    }

In other places, we represent a pixel value like this with the RGBA class from Color.h.  Would be good to be consistent if it makes sense.

> Source/WebCore/platform/graphics/ColorUtilities.h:113
> +    unsigned toRGBA() const
> +    {
> +        return components[0] << 24 | components[1] << 16 | components[2] << 8 | components[3];
> +    }

Same as above.

> Source/WebCore/platform/graphics/filters/FEMorphology.cpp:98
> +ALWAYS_INLINE ColorComponents minOrMax(ColorComponents a, ColorComponents b)
>  {
> -    return (y * width + x) * 4 + colorChannel;
> +    if (type == FEMORPHOLOGY_OPERATOR_ERODE)
> +        return perComponentMin(a, b);
> +
> +    return perComponentMax(a, b);

perComponentMin/perComponentMax both take const ColorComponents&, but this takes it by value. Any idea which produces better code?

> Source/WebCore/platform/graphics/filters/FEMorphology.cpp:104
> +    ColorComponents extremum = ColorComponents::fromRGBA(*reinterpret_cast<const unsigned*>(srcPixelArray.data() + pixelArrayIndex(x, yStart, width)));

Could use auto here.

> Source/WebCore/platform/graphics/filters/FEMorphology.cpp:107
> +        ColorComponents pixel = ColorComponents::fromRGBA(*reinterpret_cast<const unsigned*>(srcPixelArray.data() + pixelArrayIndex(x, y, width)));

Auto?

> Source/WebCore/platform/graphics/filters/FEMorphology.cpp:126
> +    ColorComponents extremum = kernel[0];

auto?

> Source/WebCore/platform/graphics/filters/FEMorphology.cpp:144
> +    const Uint8ClampedArray& srcPixelArray = *paintingData.srcPixelArray;
> +    Uint8ClampedArray& dstPixelArray = *paintingData.dstPixelArray;

auto?

> Source/WebCore/platform/graphics/filters/FEMorphology.cpp:155
> +    ColumnExtrema extrema;
> +    extrema.reserveCapacity(2 * radiusX + 1);

You can use extrema.reserveInitialCapacity() here.

> Source/WebCore/platform/graphics/filters/FEMorphology.cpp:161
> +        extrema.resize(0);

This should be extrema.shrink(0);

> Source/WebCore/platform/graphics/filters/FEMorphology.cpp:173
> +            if (x > radiusX)
> +                extrema.remove(0);

I can't tell how common this is. But in many cases where we need to remove from the beginning, a Deque can be be the better choice.

> Source/WebCore/platform/graphics/filters/FEMorphology.cpp:188
> +    // Empirically, runtime is approximately linear over reasonable kernel sizes with a slope of about 0.65.

Might want to document how you determined this empirical data.

> Source/WebCore/platform/graphics/filters/FEMorphology.cpp:191
> +    static const int minimalArea = (160 * 160); // Empirical data limit for parallel jobs

Same.

> Source/WebCore/platform/graphics/filters/FEMorphology.cpp:195
>          ParallelJobs<PlatformApplyParameters> parallelJobs(&WebCore::FEMorphology::platformApplyWorker, optimalThreadNumber);

No strictly related to this patch, but I checked what we were using for ParallelJobs, saw that ParallelJobsLibdispatch (which I assume we are using) uses dispatch_apply with a global queue. I believe using DISPATCH_APPLY_AUTO for the queue will work better.
Comment 3 Simon Fraser (smfr) 2017-11-27 08:20:03 PST
(In reply to Sam Weinig from comment #2)
> Comment on attachment 327591 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=327591&action=review
> 
> > Source/WebCore/ChangeLog:16
> > +        This is about a 4x speedup before the parallel jobs tweaking. With fixed parallelism,
> > +        a 200x200 filter went from 15ms to about 1ms with these changes.
> 
> Noice.
> 
> > Source/WebCore/platform/graphics/ColorUtilities.h:100
> > +    static ColorComponents fromRGBA(unsigned pixel)
> > +    {
> > +        return ColorComponents((pixel >> 24) & 0xFF, (pixel >> 16) & 0xFF, (pixel >> 8) & 0xFF, pixel & 0xFF);
> > +    }
> 
> In other places, we represent a pixel value like this with the RGBA class
> from Color.h.  Would be good to be consistent if it makes sense.

Sadly the layout of RGBA belies its name. It actually represents ARGB:

inline uint8_t RGBA::alpha() const
{
    return m_integer >> 24;
}

> > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:98
> > +ALWAYS_INLINE ColorComponents minOrMax(ColorComponents a, ColorComponents b)
> >  {
> > -    return (y * width + x) * 4 + colorChannel;
> > +    if (type == FEMORPHOLOGY_OPERATOR_ERODE)
> > +        return perComponentMin(a, b);
> > +
> > +    return perComponentMax(a, b);
> 
> perComponentMin/perComponentMax both take const ColorComponents&, but this
> takes it by value. Any idea which produces better code?

Will see if there's any perf difference.

> > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:195
> >          ParallelJobs<PlatformApplyParameters> parallelJobs(&WebCore::FEMorphology::platformApplyWorker, optimalThreadNumber);
> 
> No strictly related to this patch, but I checked what we were using for
> ParallelJobs, saw that ParallelJobsLibdispatch (which I assume we are using)
> uses dispatch_apply with a global queue. I believe using DISPATCH_APPLY_AUTO
> for the queue will work better.

Noted in bug 180024.
Comment 4 Simon Fraser (smfr) 2017-11-27 08:30:00 PST
(In reply to Sam Weinig from comment #2)

> perComponentMin/perComponentMax both take const ColorComponents&, but this
> takes it by value. Any idea which produces better code?

Passing by value was slower.
Comment 5 Simon Fraser (smfr) 2017-11-27 08:38:05 PST
https://trac.webkit.org/changeset/225172/webkit
Comment 6 Radar WebKit Bug Importer 2017-12-06 16:29:37 PST
<rdar://problem/35896093>
Comment 7 Radar WebKit Bug Importer 2017-12-06 16:29:38 PST
<rdar://problem/35896095>