Summary: | Optimizing RGBA16, RGB16, ARGB16, BGRA16 unpacking functions with NEON intrinsics | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Gabor Rapcsanyi <rgabor> | ||||||||
Component: | WebGL | Assignee: | Nobody <webkit-unassigned> | ||||||||
Status: | RESOLVED FIXED | ||||||||||
Severity: | Normal | CC: | dino, laszlo.gombos, webkit.review.bot, zherczeg | ||||||||
Priority: | P2 | ||||||||||
Version: | 528+ (Nightly build) | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Attachments: |
|
Description
Gabor Rapcsanyi
2012-11-29 01:04:36 PST
Created attachment 176708 [details]
patch
I tried it on Pandaboard with Linaro 12.10 Ubuntu.
unpackOneRowOfRGBA16LittleToRGBA8: 2.3x faster
unpackOneRowOfRGB16LittleToRGBA8: 1.97x faster
unpackOneRowOfARGB16LittleToRGBA8: 3.07x faster
unpackOneRowOfBGRA16LittleToRGBA8: 2.93x faster
Comment on attachment 176708 [details] patch View in context: https://bugs.webkit.org/attachment.cgi?id=176708&action=review > Source/WebCore/platform/graphics/GraphicsContext3D.cpp:377 > +#if HAVE(ARM_NEON_INTRINSICS) > + unsigned componentsPerRow = pixelsPerRow * 4; > + unsigned tailComponents = componentsPerRow % 8; > + unsigned componentsSize = componentsPerRow - tailComponents; > + > + ARM::unpackOneRowOfRGBA16LittleToRGBA8NEON(source, destination, componentsSize); > + > + source += componentsSize; > + destination += componentsSize; > + pixelsPerRow = tailComponents / 4; > +#endif I realized that I don't really like in this approach. The modification of the common code path is way too big. And too ARM specific. I would prefer: #if HAVE(ARM_NEON_INTRINSICS) optionally other SIMDS connected with || operator SIMD::unpackOneRowOfRGBA16LittleToRGBA8(source, destination, pixelsPerRow) $endif And the SIMD class (namespace) would define the folowing interface: inline void SIMD::unpackOneRowOfRGBA16LittleToRGBA8(const uint16_t*& source, uint8_t*& destination, unsigned int& pixelsPerRow) Advantages: 1) Modifications of the common code path is much shorter. 2) Can modify the arguments, since they passed as reference (usually SIMD process a group of pixels, but not necessary all if the length is not divisible by a certain value). 3) Can be extended to support other SIMD-s, not just NEON 4) Still it can be seen that a certain function is supported by the current CPU (So it is not necessary to support all of these functions on all CPUs which have SIMD support) (In reply to comment #2) > (From update of attachment 176708 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=176708&action=review > > > Source/WebCore/platform/graphics/GraphicsContext3D.cpp:377 > > +#if HAVE(ARM_NEON_INTRINSICS) > > + unsigned componentsPerRow = pixelsPerRow * 4; > > + unsigned tailComponents = componentsPerRow % 8; > > + unsigned componentsSize = componentsPerRow - tailComponents; > > + > > + ARM::unpackOneRowOfRGBA16LittleToRGBA8NEON(source, destination, componentsSize); > > + > > + source += componentsSize; > > + destination += componentsSize; > > + pixelsPerRow = tailComponents / 4; > > +#endif > > I realized that I don't really like in this approach. The modification of the common code path is way too big. And too ARM specific. > > I would prefer: > > #if HAVE(ARM_NEON_INTRINSICS) optionally other SIMDS connected with || operator > SIMD::unpackOneRowOfRGBA16LittleToRGBA8(source, destination, pixelsPerRow) > $endif > > And the SIMD class (namespace) would define the folowing interface: > inline void SIMD::unpackOneRowOfRGBA16LittleToRGBA8(const uint16_t*& source, uint8_t*& destination, unsigned int& pixelsPerRow) > > Advantages: > 1) Modifications of the common code path is much shorter. > 2) Can modify the arguments, since they passed as reference (usually SIMD process a group of pixels, but not necessary all if the length is not divisible by a certain value). > 3) Can be extended to support other SIMD-s, not just NEON > 4) Still it can be seen that a certain function is supported by the current CPU (So it is not necessary to support all of these functions on all CPUs which have SIMD support) Yes, sounds logical to me so I made a bug to change those optimizations: https://bugs.webkit.org/show_bug.cgi?id=105086 Created attachment 179710 [details]
modified patch
Comment on attachment 179710 [details] modified patch View in context: https://bugs.webkit.org/attachment.cgi?id=179710&action=review > Source/WebCore/platform/graphics/cpu/arm/GraphicsContext3DNEON.h:46 > + uint16x8_t eightComponents = vld1q_u16(source + i); > + eightComponents = vshrq_n_u16(eightComponents, 8); > + vst1_u8(destination + i, vqmovn_u16(eightComponents)); I think this could be simplified to a simple read/write method without vshr. Just read an interleaved low/high component data, and write back the high component. Similar algorithm can be created to the other cases. Created attachment 180126 [details] patch2 (In reply to comment #5) > (From update of attachment 179710 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=179710&action=review > > > Source/WebCore/platform/graphics/cpu/arm/GraphicsContext3DNEON.h:46 > > + uint16x8_t eightComponents = vld1q_u16(source + i); > > + eightComponents = vshrq_n_u16(eightComponents, 8); > > + vst1_u8(destination + i, vqmovn_u16(eightComponents)); > > I think this could be simplified to a simple read/write method without vshr. Just read an interleaved low/high component data, and write back the high component. Similar algorithm can be created to the other cases. Yes thanks I changed it. unpackOneRowOfRGBA16LittleToRGBA8: 3.19x faster now I tried the same with unpackOneRowOfARGB16LittleToRGBA8: uint8x16x2_t components = vld2q_u8(src + i * 2); uint32x4_t ARGB = vreinterpretq_u32_u8(components.val[1]); uint32x4_t RGBA = vorrq_u32(vshrq_n_u32(ARGB, 24), vshlq_n_u32(ARGB, 8)); vst1q_u8(destination + i, vreinterpretq_u8_u32(RGBA)); It was a little bit slower than my original solution. Comment on attachment 180126 [details]
patch2
r=me
Comment on attachment 180126 [details] patch2 Clearing flags on attachment: 180126 Committed r138936: <http://trac.webkit.org/changeset/138936> All reviewed patches have been landed. Closing bug. |