Summary: | Optimize FEMorphology | ||||||
---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Simon Fraser (smfr) <simon.fraser> | ||||
Component: | New Bugs | Assignee: | Simon Fraser (smfr) <simon.fraser> | ||||
Status: | RESOLVED FIXED | ||||||
Severity: | Normal | CC: | dino, ews-watchlist, kondapallykalyan, sam, simon.fraser, webkit-bug-importer | ||||
Priority: | P2 | Keywords: | InRadar | ||||
Version: | WebKit Nightly Build | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Attachments: |
|
Description
Simon Fraser (smfr)
2017-11-26 15:44:10 PST
Created attachment 327591 [details]
Patch
Comment on attachment 327591 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=327591&action=review > Source/WebCore/ChangeLog:16 > + This is about a 4x speedup before the parallel jobs tweaking. With fixed parallelism, > + a 200x200 filter went from 15ms to about 1ms with these changes. Noice. > Source/WebCore/platform/graphics/ColorUtilities.h:100 > + static ColorComponents fromRGBA(unsigned pixel) > + { > + return ColorComponents((pixel >> 24) & 0xFF, (pixel >> 16) & 0xFF, (pixel >> 8) & 0xFF, pixel & 0xFF); > + } In other places, we represent a pixel value like this with the RGBA class from Color.h. Would be good to be consistent if it makes sense. > Source/WebCore/platform/graphics/ColorUtilities.h:113 > + unsigned toRGBA() const > + { > + return components[0] << 24 | components[1] << 16 | components[2] << 8 | components[3]; > + } Same as above. > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:98 > +ALWAYS_INLINE ColorComponents minOrMax(ColorComponents a, ColorComponents b) > { > - return (y * width + x) * 4 + colorChannel; > + if (type == FEMORPHOLOGY_OPERATOR_ERODE) > + return perComponentMin(a, b); > + > + return perComponentMax(a, b); perComponentMin/perComponentMax both take const ColorComponents&, but this takes it by value. Any idea which produces better code? > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:104 > + ColorComponents extremum = ColorComponents::fromRGBA(*reinterpret_cast<const unsigned*>(srcPixelArray.data() + pixelArrayIndex(x, yStart, width))); Could use auto here. > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:107 > + ColorComponents pixel = ColorComponents::fromRGBA(*reinterpret_cast<const unsigned*>(srcPixelArray.data() + pixelArrayIndex(x, y, width))); Auto? > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:126 > + ColorComponents extremum = kernel[0]; auto? > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:144 > + const Uint8ClampedArray& srcPixelArray = *paintingData.srcPixelArray; > + Uint8ClampedArray& dstPixelArray = *paintingData.dstPixelArray; auto? > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:155 > + ColumnExtrema extrema; > + extrema.reserveCapacity(2 * radiusX + 1); You can use extrema.reserveInitialCapacity() here. > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:161 > + extrema.resize(0); This should be extrema.shrink(0); > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:173 > + if (x > radiusX) > + extrema.remove(0); I can't tell how common this is. But in many cases where we need to remove from the beginning, a Deque can be be the better choice. > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:188 > + // Empirically, runtime is approximately linear over reasonable kernel sizes with a slope of about 0.65. Might want to document how you determined this empirical data. > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:191 > + static const int minimalArea = (160 * 160); // Empirical data limit for parallel jobs Same. > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:195 > ParallelJobs<PlatformApplyParameters> parallelJobs(&WebCore::FEMorphology::platformApplyWorker, optimalThreadNumber); No strictly related to this patch, but I checked what we were using for ParallelJobs, saw that ParallelJobsLibdispatch (which I assume we are using) uses dispatch_apply with a global queue. I believe using DISPATCH_APPLY_AUTO for the queue will work better. (In reply to Sam Weinig from comment #2) > Comment on attachment 327591 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=327591&action=review > > > Source/WebCore/ChangeLog:16 > > + This is about a 4x speedup before the parallel jobs tweaking. With fixed parallelism, > > + a 200x200 filter went from 15ms to about 1ms with these changes. > > Noice. > > > Source/WebCore/platform/graphics/ColorUtilities.h:100 > > + static ColorComponents fromRGBA(unsigned pixel) > > + { > > + return ColorComponents((pixel >> 24) & 0xFF, (pixel >> 16) & 0xFF, (pixel >> 8) & 0xFF, pixel & 0xFF); > > + } > > In other places, we represent a pixel value like this with the RGBA class > from Color.h. Would be good to be consistent if it makes sense. Sadly the layout of RGBA belies its name. It actually represents ARGB: inline uint8_t RGBA::alpha() const { return m_integer >> 24; } > > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:98 > > +ALWAYS_INLINE ColorComponents minOrMax(ColorComponents a, ColorComponents b) > > { > > - return (y * width + x) * 4 + colorChannel; > > + if (type == FEMORPHOLOGY_OPERATOR_ERODE) > > + return perComponentMin(a, b); > > + > > + return perComponentMax(a, b); > > perComponentMin/perComponentMax both take const ColorComponents&, but this > takes it by value. Any idea which produces better code? Will see if there's any perf difference. > > Source/WebCore/platform/graphics/filters/FEMorphology.cpp:195 > > ParallelJobs<PlatformApplyParameters> parallelJobs(&WebCore::FEMorphology::platformApplyWorker, optimalThreadNumber); > > No strictly related to this patch, but I checked what we were using for > ParallelJobs, saw that ParallelJobsLibdispatch (which I assume we are using) > uses dispatch_apply with a global queue. I believe using DISPATCH_APPLY_AUTO > for the queue will work better. Noted in bug 180024. (In reply to Sam Weinig from comment #2) > perComponentMin/perComponentMax both take const ColorComponents&, but this > takes it by value. Any idea which produces better code? Passing by value was slower. |