Bug 180020

Summary:

Optimize FEMorphology

Product:

WebKit

Reporter:

Simon Fraser (smfr) <simon.fraser>

Component:

New Bugs

Assignee:

Simon Fraser (smfr) <simon.fraser>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

dino, ews-watchlist, kondapallykalyan, sam, simon.fraser, webkit-bug-importer

Priority:

Keywords:

InRadar

Version:

WebKit Nightly Build

Hardware:

Unspecified

OS:

Unspecified

Attachments:

Description	Flags
Patch	sam: review+

Simon Fraser (smfr)

Reported 2017-11-26 15:44:10 PST

Optimize FEMorphology

Attachments
Patch (17.38 KB, patch) 2017-11-26 15:52 PST, Simon Fraser (smfr)	sam: review+	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.

Simon Fraser (smfr)

Comment 1 2017-11-26 15:52:21 PST

Created attachment 327591 [details] Patch

Sam Weinig

Comment 2 2017-11-26 16:34:25 PST

Comment on attachment 327591 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=327591&action=review > Source/WebCore/ChangeLog:16 > + This is about a 4x speedup before the parallel jobs tweaking. With fixed parallelism, > + a 200x200 filter went from 15ms to about 1ms with these changes. Noice. > Source/WebCore/platform/graphics/ColorUtilities.h:100 > + static ColorComponents fromRGBA(unsigned pixel) > + { > + return ColorComponents((pixel >> 24) & 0xFF, (pixel >> 16) & 0xFF, (pixel >> 8) & 0xFF, pixel & 0xFF); > + } In other places, we represent a pixel value like this with the RGBA class from Color.h. Would be good to be consistent if it makes sense. > Source/WebCore/platform/graphics/ColorUtilities.h:113 > + unsigned toRGBA() const > + { > + return components[0] << 24 | components[1] << 16 | components[2] << 8 | components[3]; > + } Same as above. > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:98 > +ALWAYS_INLINE ColorComponents minOrMax(ColorComponents a, ColorComponents b) > { > - return (y * width + x) * 4 + colorChannel; > + if (type == FEMORPHOLOGY_OPERATOR_ERODE) > + return perComponentMin(a, b); > + > + return perComponentMax(a, b); perComponentMin/perComponentMax both take const ColorComponents&, but this takes it by value. Any idea which produces better code? > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:104 > + ColorComponents extremum = ColorComponents::fromRGBA(*reinterpret_cast<const unsigned*>(srcPixelArray.data() + pixelArrayIndex(x, yStart, width))); Could use auto here. > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:107 > + ColorComponents pixel = ColorComponents::fromRGBA(*reinterpret_cast<const unsigned*>(srcPixelArray.data() + pixelArrayIndex(x, y, width))); Auto? > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:126 > + ColorComponents extremum = kernel[0]; auto? > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:144 > + const Uint8ClampedArray& srcPixelArray = *paintingData.srcPixelArray; > + Uint8ClampedArray& dstPixelArray = *paintingData.dstPixelArray; auto? > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:155 > + ColumnExtrema extrema; > + extrema.reserveCapacity(2 * radiusX + 1); You can use extrema.reserveInitialCapacity() here. > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:161 > + extrema.resize(0); This should be extrema.shrink(0); > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:173 > + if (x > radiusX) > + extrema.remove(0); I can't tell how common this is. But in many cases where we need to remove from the beginning, a Deque can be be the better choice. > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:188 > + // Empirically, runtime is approximately linear over reasonable kernel sizes with a slope of about 0.65. Might want to document how you determined this empirical data. > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:191 > + static const int minimalArea = (160 * 160); // Empirical data limit for parallel jobs Same. > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:195 > ParallelJobs<PlatformApplyParameters> parallelJobs(&WebCore::FEMorphology::platformApplyWorker, optimalThreadNumber); No strictly related to this patch, but I checked what we were using for ParallelJobs, saw that ParallelJobsLibdispatch (which I assume we are using) uses dispatch_apply with a global queue. I believe using DISPATCH_APPLY_AUTO for the queue will work better.

Simon Fraser (smfr)

Comment 3 2017-11-27 08:20:03 PST

(In reply to Sam Weinig from comment #2) > Comment on attachment 327591 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=327591&action=review > > > Source/WebCore/ChangeLog:16 > > + This is about a 4x speedup before the parallel jobs tweaking. With fixed parallelism, > > + a 200x200 filter went from 15ms to about 1ms with these changes. > > Noice. > > > Source/WebCore/platform/graphics/ColorUtilities.h:100 > > + static ColorComponents fromRGBA(unsigned pixel) > > + { > > + return ColorComponents((pixel >> 24) & 0xFF, (pixel >> 16) & 0xFF, (pixel >> 8) & 0xFF, pixel & 0xFF); > > + } > > In other places, we represent a pixel value like this with the RGBA class > from Color.h. Would be good to be consistent if it makes sense. Sadly the layout of RGBA belies its name. It actually represents ARGB: inline uint8_t RGBA::alpha() const { return m_integer >> 24; } > > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:98 > > +ALWAYS_INLINE ColorComponents minOrMax(ColorComponents a, ColorComponents b) > > { > > - return (y * width + x) * 4 + colorChannel; > > + if (type == FEMORPHOLOGY_OPERATOR_ERODE) > > + return perComponentMin(a, b); > > + > > + return perComponentMax(a, b); > > perComponentMin/perComponentMax both take const ColorComponents&, but this > takes it by value. Any idea which produces better code? Will see if there's any perf difference. > > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:195 > > ParallelJobs<PlatformApplyParameters> parallelJobs(&WebCore::FEMorphology::platformApplyWorker, optimalThreadNumber); > > No strictly related to this patch, but I checked what we were using for > ParallelJobs, saw that ParallelJobsLibdispatch (which I assume we are using) > uses dispatch_apply with a global queue. I believe using DISPATCH_APPLY_AUTO > for the queue will work better. Noted in bug 180024.

Simon Fraser (smfr)

Comment 4 2017-11-27 08:30:00 PST

(In reply to Sam Weinig from comment #2) > perComponentMin/perComponentMax both take const ColorComponents&, but this > takes it by value. Any idea which produces better code? Passing by value was slower.

Simon Fraser (smfr)

Comment 5 2017-11-27 08:38:05 PST

https://trac.webkit.org/changeset/225172/webkit

Radar WebKit Bug Importer

Comment 6 2017-12-06 16:29:37 PST

<rdar://problem/35896093>

Radar WebKit Bug Importer

Comment 7 2017-12-06 16:29:38 PST

<rdar://problem/35896095>

Note You need to log in before you can comment on or make changes to this bug.