WebKit Bugzilla
New
Browse
Search+
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
103614
Optimizing RGBA16, RGB16, ARGB16, BGRA16 unpacking functions with NEON intrinsics
https://bugs.webkit.org/show_bug.cgi?id=103614
Summary
Optimizing RGBA16, RGB16, ARGB16, BGRA16 unpacking functions with NEON intrin...
Gabor Rapcsanyi
Reported
2012-11-29 01:04:36 PST
Optimizing RGBA16LittleToRGBA8, RGB16LittleToRGBA8, ARGB16LittleToRGBA8, BGRA16LittleToRGBA8 unpacking functions in GraphicsContext3D with ARM NEON intrinsics.
Attachments
patch
(7.24 KB, patch)
2012-11-29 05:17 PST
,
Gabor Rapcsanyi
zherczeg
: review-
zherczeg
: commit-queue-
Details
Formatted Diff
Diff
modified patch
(7.23 KB, patch)
2012-12-17 03:29 PST
,
Gabor Rapcsanyi
zherczeg
: review-
Details
Formatted Diff
Diff
patch2
(7.23 KB, patch)
2012-12-19 03:33 PST
,
Gabor Rapcsanyi
no flags
Details
Formatted Diff
Diff
Show Obsolete
(2)
View All
Add attachment
proposed patch, testcase, etc.
Gabor Rapcsanyi
Comment 1
2012-11-29 05:17:47 PST
Created
attachment 176708
[details]
patch I tried it on Pandaboard with Linaro 12.10 Ubuntu. unpackOneRowOfRGBA16LittleToRGBA8: 2.3x faster unpackOneRowOfRGB16LittleToRGBA8: 1.97x faster unpackOneRowOfARGB16LittleToRGBA8: 3.07x faster unpackOneRowOfBGRA16LittleToRGBA8: 2.93x faster
Zoltan Herczeg
Comment 2
2012-12-06 00:35:24 PST
Comment on
attachment 176708
[details]
patch View in context:
https://bugs.webkit.org/attachment.cgi?id=176708&action=review
> Source/WebCore/platform/graphics/GraphicsContext3D.cpp:377 > +#if HAVE(ARM_NEON_INTRINSICS) > + unsigned componentsPerRow = pixelsPerRow * 4; > + unsigned tailComponents = componentsPerRow % 8; > + unsigned componentsSize = componentsPerRow - tailComponents; > + > + ARM::unpackOneRowOfRGBA16LittleToRGBA8NEON(source, destination, componentsSize); > + > + source += componentsSize; > + destination += componentsSize; > + pixelsPerRow = tailComponents / 4; > +#endif
I realized that I don't really like in this approach. The modification of the common code path is way too big. And too ARM specific. I would prefer: #if HAVE(ARM_NEON_INTRINSICS) optionally other SIMDS connected with || operator SIMD::unpackOneRowOfRGBA16LittleToRGBA8(source, destination, pixelsPerRow) $endif And the SIMD class (namespace) would define the folowing interface: inline void SIMD::unpackOneRowOfRGBA16LittleToRGBA8(const uint16_t*& source, uint8_t*& destination, unsigned int& pixelsPerRow) Advantages: 1) Modifications of the common code path is much shorter. 2) Can modify the arguments, since they passed as reference (usually SIMD process a group of pixels, but not necessary all if the length is not divisible by a certain value). 3) Can be extended to support other SIMD-s, not just NEON 4) Still it can be seen that a certain function is supported by the current CPU (So it is not necessary to support all of these functions on all CPUs which have SIMD support)
Gabor Rapcsanyi
Comment 3
2012-12-15 07:24:58 PST
(In reply to
comment #2
)
> (From update of
attachment 176708
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=176708&action=review
> > > Source/WebCore/platform/graphics/GraphicsContext3D.cpp:377 > > +#if HAVE(ARM_NEON_INTRINSICS) > > + unsigned componentsPerRow = pixelsPerRow * 4; > > + unsigned tailComponents = componentsPerRow % 8; > > + unsigned componentsSize = componentsPerRow - tailComponents; > > + > > + ARM::unpackOneRowOfRGBA16LittleToRGBA8NEON(source, destination, componentsSize); > > + > > + source += componentsSize; > > + destination += componentsSize; > > + pixelsPerRow = tailComponents / 4; > > +#endif > > I realized that I don't really like in this approach. The modification of the common code path is way too big. And too ARM specific. > > I would prefer: > > #if HAVE(ARM_NEON_INTRINSICS) optionally other SIMDS connected with || operator > SIMD::unpackOneRowOfRGBA16LittleToRGBA8(source, destination, pixelsPerRow) > $endif > > And the SIMD class (namespace) would define the folowing interface: > inline void SIMD::unpackOneRowOfRGBA16LittleToRGBA8(const uint16_t*& source, uint8_t*& destination, unsigned int& pixelsPerRow) > > Advantages: > 1) Modifications of the common code path is much shorter. > 2) Can modify the arguments, since they passed as reference (usually SIMD process a group of pixels, but not necessary all if the length is not divisible by a certain value). > 3) Can be extended to support other SIMD-s, not just NEON > 4) Still it can be seen that a certain function is supported by the current CPU (So it is not necessary to support all of these functions on all CPUs which have SIMD support)
Yes, sounds logical to me so I made a bug to change those optimizations:
https://bugs.webkit.org/show_bug.cgi?id=105086
Gabor Rapcsanyi
Comment 4
2012-12-17 03:29:37 PST
Created
attachment 179710
[details]
modified patch
Zoltan Herczeg
Comment 5
2012-12-17 03:54:44 PST
Comment on
attachment 179710
[details]
modified patch View in context:
https://bugs.webkit.org/attachment.cgi?id=179710&action=review
> Source/WebCore/platform/graphics/cpu/arm/GraphicsContext3DNEON.h:46 > + uint16x8_t eightComponents = vld1q_u16(source + i); > + eightComponents = vshrq_n_u16(eightComponents, 8); > + vst1_u8(destination + i, vqmovn_u16(eightComponents));
I think this could be simplified to a simple read/write method without vshr. Just read an interleaved low/high component data, and write back the high component. Similar algorithm can be created to the other cases.
Gabor Rapcsanyi
Comment 6
2012-12-19 03:33:47 PST
Created
attachment 180126
[details]
patch2 (In reply to
comment #5
)
> (From update of
attachment 179710
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=179710&action=review
> > > Source/WebCore/platform/graphics/cpu/arm/GraphicsContext3DNEON.h:46 > > + uint16x8_t eightComponents = vld1q_u16(source + i); > > + eightComponents = vshrq_n_u16(eightComponents, 8); > > + vst1_u8(destination + i, vqmovn_u16(eightComponents)); > > I think this could be simplified to a simple read/write method without vshr. Just read an interleaved low/high component data, and write back the high component. Similar algorithm can be created to the other cases.
Yes thanks I changed it. unpackOneRowOfRGBA16LittleToRGBA8: 3.19x faster now I tried the same with unpackOneRowOfARGB16LittleToRGBA8: uint8x16x2_t components = vld2q_u8(src + i * 2); uint32x4_t ARGB = vreinterpretq_u32_u8(components.val[1]); uint32x4_t RGBA = vorrq_u32(vshrq_n_u32(ARGB, 24), vshlq_n_u32(ARGB, 8)); vst1q_u8(destination + i, vreinterpretq_u8_u32(RGBA)); It was a little bit slower than my original solution.
Zoltan Herczeg
Comment 7
2013-01-07 06:28:54 PST
Comment on
attachment 180126
[details]
patch2 r=me
WebKit Review Bot
Comment 8
2013-01-07 06:32:53 PST
Comment on
attachment 180126
[details]
patch2 Clearing flags on attachment: 180126 Committed
r138936
: <
http://trac.webkit.org/changeset/138936
>
WebKit Review Bot
Comment 9
2013-01-07 06:32:57 PST
All reviewed patches have been landed. Closing bug.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug