103614 – Optimizing RGBA16, RGB16, ARGB16, BGRA16 unpacking functions with NEON intrinsics

RESOLVED FIXED103614

Optimizing RGBA16, RGB16, ARGB16, BGRA16 unpacking functions with NEON intrinsics

https://bugs.webkit.org/show_bug.cgi?id=103614

Summary Optimizing RGBA16, RGB16, ARGB16, BGRA16 unpacking functions with NEON intrin...

Gabor Rapcsanyi

Reported 2012-11-29 01:04:36 PST

Optimizing RGBA16LittleToRGBA8, RGB16LittleToRGBA8, ARGB16LittleToRGBA8, BGRA16LittleToRGBA8 unpacking functions in GraphicsContext3D with ARM NEON intrinsics.

Attachments
patch (7.24 KB, patch) 2012-11-29 05:17 PST, Gabor Rapcsanyi	zherczeg: review- zherczeg: commit-queue-	Details Formatted Diff Diff
modified patch (7.23 KB, patch) 2012-12-17 03:29 PST, Gabor Rapcsanyi	zherczeg: review-	Details Formatted Diff Diff
patch2 (7.23 KB, patch) 2012-12-19 03:33 PST, Gabor Rapcsanyi	no flags	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.

Gabor Rapcsanyi

Comment 1 2012-11-29 05:17:47 PST

Created attachment 176708 [details] patch I tried it on Pandaboard with Linaro 12.10 Ubuntu. unpackOneRowOfRGBA16LittleToRGBA8: 2.3x faster unpackOneRowOfRGB16LittleToRGBA8: 1.97x faster unpackOneRowOfARGB16LittleToRGBA8: 3.07x faster unpackOneRowOfBGRA16LittleToRGBA8: 2.93x faster

Zoltan Herczeg

Comment 2 2012-12-06 00:35:24 PST

Comment on attachment 176708 [details] patch View in context: https://bugs.webkit.org/attachment.cgi?id=176708&action=review > Source/WebCore/platform/graphics/GraphicsContext3D.cpp:377 > +#if HAVE(ARM_NEON_INTRINSICS) > + unsigned componentsPerRow = pixelsPerRow * 4; > + unsigned tailComponents = componentsPerRow % 8; > + unsigned componentsSize = componentsPerRow - tailComponents; > + > + ARM::unpackOneRowOfRGBA16LittleToRGBA8NEON(source, destination, componentsSize); > + > + source += componentsSize; > + destination += componentsSize; > + pixelsPerRow = tailComponents / 4; > +#endif I realized that I don't really like in this approach. The modification of the common code path is way too big. And too ARM specific. I would prefer: #if HAVE(ARM_NEON_INTRINSICS) optionally other SIMDS connected with || operator SIMD::unpackOneRowOfRGBA16LittleToRGBA8(source, destination, pixelsPerRow) $endif And the SIMD class (namespace) would define the folowing interface: inline void SIMD::unpackOneRowOfRGBA16LittleToRGBA8(const uint16_t*& source, uint8_t*& destination, unsigned int& pixelsPerRow) Advantages: 1) Modifications of the common code path is much shorter. 2) Can modify the arguments, since they passed as reference (usually SIMD process a group of pixels, but not necessary all if the length is not divisible by a certain value). 3) Can be extended to support other SIMD-s, not just NEON 4) Still it can be seen that a certain function is supported by the current CPU (So it is not necessary to support all of these functions on all CPUs which have SIMD support)

Gabor Rapcsanyi

Comment 3 2012-12-15 07:24:58 PST

(In reply to comment #2) > (From update of attachment 176708 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=176708&action=review > > > Source/WebCore/platform/graphics/GraphicsContext3D.cpp:377 > > +#if HAVE(ARM_NEON_INTRINSICS) > > + unsigned componentsPerRow = pixelsPerRow * 4; > > + unsigned tailComponents = componentsPerRow % 8; > > + unsigned componentsSize = componentsPerRow - tailComponents; > > + > > + ARM::unpackOneRowOfRGBA16LittleToRGBA8NEON(source, destination, componentsSize); > > + > > + source += componentsSize; > > + destination += componentsSize; > > + pixelsPerRow = tailComponents / 4; > > +#endif > > I realized that I don't really like in this approach. The modification of the common code path is way too big. And too ARM specific. > > I would prefer: > > #if HAVE(ARM_NEON_INTRINSICS) optionally other SIMDS connected with || operator > SIMD::unpackOneRowOfRGBA16LittleToRGBA8(source, destination, pixelsPerRow) > $endif > > And the SIMD class (namespace) would define the folowing interface: > inline void SIMD::unpackOneRowOfRGBA16LittleToRGBA8(const uint16_t*& source, uint8_t*& destination, unsigned int& pixelsPerRow) > > Advantages: > 1) Modifications of the common code path is much shorter. > 2) Can modify the arguments, since they passed as reference (usually SIMD process a group of pixels, but not necessary all if the length is not divisible by a certain value). > 3) Can be extended to support other SIMD-s, not just NEON > 4) Still it can be seen that a certain function is supported by the current CPU (So it is not necessary to support all of these functions on all CPUs which have SIMD support) Yes, sounds logical to me so I made a bug to change those optimizations: https://bugs.webkit.org/show_bug.cgi?id=105086

Gabor Rapcsanyi

Comment 4 2012-12-17 03:29:37 PST

Created attachment 179710 [details] modified patch

Zoltan Herczeg

Comment 5 2012-12-17 03:54:44 PST

Comment on attachment 179710 [details] modified patch View in context: https://bugs.webkit.org/attachment.cgi?id=179710&action=review > Source/WebCore/platform/graphics/cpu/arm/GraphicsContext3DNEON.h:46 > + uint16x8_t eightComponents = vld1q_u16(source + i); > + eightComponents = vshrq_n_u16(eightComponents, 8); > + vst1_u8(destination + i, vqmovn_u16(eightComponents)); I think this could be simplified to a simple read/write method without vshr. Just read an interleaved low/high component data, and write back the high component. Similar algorithm can be created to the other cases.

Gabor Rapcsanyi

Comment 6 2012-12-19 03:33:47 PST

Created attachment 180126 [details] patch2 (In reply to comment #5) > (From update of attachment 179710 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=179710&action=review > > > Source/WebCore/platform/graphics/cpu/arm/GraphicsContext3DNEON.h:46 > > + uint16x8_t eightComponents = vld1q_u16(source + i); > > + eightComponents = vshrq_n_u16(eightComponents, 8); > > + vst1_u8(destination + i, vqmovn_u16(eightComponents)); > > I think this could be simplified to a simple read/write method without vshr. Just read an interleaved low/high component data, and write back the high component. Similar algorithm can be created to the other cases. Yes thanks I changed it. unpackOneRowOfRGBA16LittleToRGBA8: 3.19x faster now I tried the same with unpackOneRowOfARGB16LittleToRGBA8: uint8x16x2_t components = vld2q_u8(src + i * 2); uint32x4_t ARGB = vreinterpretq_u32_u8(components.val[1]); uint32x4_t RGBA = vorrq_u32(vshrq_n_u32(ARGB, 24), vshlq_n_u32(ARGB, 8)); vst1q_u8(destination + i, vreinterpretq_u8_u32(RGBA)); It was a little bit slower than my original solution.

Zoltan Herczeg

Comment 7 2013-01-07 06:28:54 PST

Comment on attachment 180126 [details] patch2 r=me

WebKit Review Bot

Comment 8 2013-01-07 06:32:53 PST

Comment on attachment 180126 [details] patch2 Clearing flags on attachment: 180126 Committed r138936: <http://trac.webkit.org/changeset/138936>

WebKit Review Bot

Comment 9 2013-01-07 06:32:57 PST

All reviewed patches have been landed. Closing bug.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Unspecified

OS Unspecified

Product WebKit

Component WebGL

Assignee

Nobody

Reported

2012-11-29 01:04 PST

Modified

2013-01-07 06:32 PST History

CC List

4 users Show

URL

Keywords

Depends on

Blocks