Bug 103614

Summary: Optimizing RGBA16, RGB16, ARGB16, BGRA16 unpacking functions with NEON intrinsics
Product: WebKit Reporter: Gabor Rapcsanyi <rgabor>
Component: WebGLAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: dino, laszlo.gombos, webkit.review.bot, zherczeg
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
patch
zherczeg: review-, zherczeg: commit-queue-
modified patch
zherczeg: review-
patch2 none

Description Gabor Rapcsanyi 2012-11-29 01:04:36 PST
Optimizing RGBA16LittleToRGBA8, RGB16LittleToRGBA8, ARGB16LittleToRGBA8, BGRA16LittleToRGBA8 unpacking functions in GraphicsContext3D with ARM NEON intrinsics.
Comment 1 Gabor Rapcsanyi 2012-11-29 05:17:47 PST
Created attachment 176708 [details]
patch

I tried it on Pandaboard with Linaro 12.10 Ubuntu.

unpackOneRowOfRGBA16LittleToRGBA8: 2.3x faster
unpackOneRowOfRGB16LittleToRGBA8: 1.97x faster
unpackOneRowOfARGB16LittleToRGBA8: 3.07x faster
unpackOneRowOfBGRA16LittleToRGBA8: 2.93x faster
Comment 2 Zoltan Herczeg 2012-12-06 00:35:24 PST
Comment on attachment 176708 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=176708&action=review

> Source/WebCore/platform/graphics/GraphicsContext3D.cpp:377
> +#if HAVE(ARM_NEON_INTRINSICS)
> +    unsigned componentsPerRow = pixelsPerRow * 4;
> +    unsigned tailComponents = componentsPerRow % 8;
> +    unsigned componentsSize = componentsPerRow - tailComponents;
> +
> +    ARM::unpackOneRowOfRGBA16LittleToRGBA8NEON(source, destination, componentsSize);
> +
> +    source += componentsSize;
> +    destination += componentsSize;
> +    pixelsPerRow = tailComponents / 4;
> +#endif

I realized that I don't really like in this approach. The modification of the common code path is way too big. And too ARM specific.

I would prefer:

#if HAVE(ARM_NEON_INTRINSICS) optionally other SIMDS connected with || operator
    SIMD::unpackOneRowOfRGBA16LittleToRGBA8(source, destination, pixelsPerRow)
$endif

And the SIMD class (namespace) would define the folowing interface:
inline void SIMD::unpackOneRowOfRGBA16LittleToRGBA8(const uint16_t*& source, uint8_t*& destination, unsigned int& pixelsPerRow)

Advantages:
1) Modifications of the common code path is much shorter.
2) Can modify the arguments, since they passed as reference (usually SIMD process a group of pixels, but not necessary all if the length is not divisible by a certain value).
3) Can be extended to support other SIMD-s, not just NEON
4) Still it can be seen that a certain function is supported by the current CPU (So it is not necessary to support all of these functions on all CPUs which have SIMD support)
Comment 3 Gabor Rapcsanyi 2012-12-15 07:24:58 PST
(In reply to comment #2)
> (From update of attachment 176708 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=176708&action=review
> 
> > Source/WebCore/platform/graphics/GraphicsContext3D.cpp:377
> > +#if HAVE(ARM_NEON_INTRINSICS)
> > +    unsigned componentsPerRow = pixelsPerRow * 4;
> > +    unsigned tailComponents = componentsPerRow % 8;
> > +    unsigned componentsSize = componentsPerRow - tailComponents;
> > +
> > +    ARM::unpackOneRowOfRGBA16LittleToRGBA8NEON(source, destination, componentsSize);
> > +
> > +    source += componentsSize;
> > +    destination += componentsSize;
> > +    pixelsPerRow = tailComponents / 4;
> > +#endif
> 
> I realized that I don't really like in this approach. The modification of the common code path is way too big. And too ARM specific.
> 
> I would prefer:
> 
> #if HAVE(ARM_NEON_INTRINSICS) optionally other SIMDS connected with || operator
>     SIMD::unpackOneRowOfRGBA16LittleToRGBA8(source, destination, pixelsPerRow)
> $endif
> 
> And the SIMD class (namespace) would define the folowing interface:
> inline void SIMD::unpackOneRowOfRGBA16LittleToRGBA8(const uint16_t*& source, uint8_t*& destination, unsigned int& pixelsPerRow)
> 
> Advantages:
> 1) Modifications of the common code path is much shorter.
> 2) Can modify the arguments, since they passed as reference (usually SIMD process a group of pixels, but not necessary all if the length is not divisible by a certain value).
> 3) Can be extended to support other SIMD-s, not just NEON
> 4) Still it can be seen that a certain function is supported by the current CPU (So it is not necessary to support all of these functions on all CPUs which have SIMD support)

Yes, sounds logical to me so I made a bug to change those optimizations: https://bugs.webkit.org/show_bug.cgi?id=105086
Comment 4 Gabor Rapcsanyi 2012-12-17 03:29:37 PST
Created attachment 179710 [details]
modified patch
Comment 5 Zoltan Herczeg 2012-12-17 03:54:44 PST
Comment on attachment 179710 [details]
modified patch

View in context: https://bugs.webkit.org/attachment.cgi?id=179710&action=review

> Source/WebCore/platform/graphics/cpu/arm/GraphicsContext3DNEON.h:46
> +        uint16x8_t eightComponents = vld1q_u16(source + i);
> +        eightComponents = vshrq_n_u16(eightComponents, 8);
> +        vst1_u8(destination + i, vqmovn_u16(eightComponents));

I think this could be simplified to a simple read/write method without vshr. Just read an interleaved low/high component data, and write back the high component. Similar algorithm can be created to the other cases.
Comment 6 Gabor Rapcsanyi 2012-12-19 03:33:47 PST
Created attachment 180126 [details]
patch2

(In reply to comment #5)
> (From update of attachment 179710 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=179710&action=review
> 
> > Source/WebCore/platform/graphics/cpu/arm/GraphicsContext3DNEON.h:46
> > +        uint16x8_t eightComponents = vld1q_u16(source + i);
> > +        eightComponents = vshrq_n_u16(eightComponents, 8);
> > +        vst1_u8(destination + i, vqmovn_u16(eightComponents));
> 
> I think this could be simplified to a simple read/write method without vshr. Just read an interleaved low/high component data, and write back the high component. Similar algorithm can be created to the other cases.

Yes thanks I changed it.
unpackOneRowOfRGBA16LittleToRGBA8: 3.19x faster now

I tried the same with unpackOneRowOfARGB16LittleToRGBA8:
  uint8x16x2_t components = vld2q_u8(src + i * 2);
  uint32x4_t ARGB = vreinterpretq_u32_u8(components.val[1]);
  uint32x4_t RGBA = vorrq_u32(vshrq_n_u32(ARGB, 24), vshlq_n_u32(ARGB, 8));
  vst1q_u8(destination + i, vreinterpretq_u8_u32(RGBA));

It was a little bit slower than my original solution.
Comment 7 Zoltan Herczeg 2013-01-07 06:28:54 PST
Comment on attachment 180126 [details]
patch2

r=me
Comment 8 WebKit Review Bot 2013-01-07 06:32:53 PST
Comment on attachment 180126 [details]
patch2

Clearing flags on attachment: 180126

Committed r138936: <http://trac.webkit.org/changeset/138936>
Comment 9 WebKit Review Bot 2013-01-07 06:32:57 PST
All reviewed patches have been landed.  Closing bug.