Bug 101473

Summary: Optimize RGBA4444ToRGBA8 packing/unpacking functions with NEON intrinsics in GraphicsContext3D
Product: WebKit Reporter: Gabor Rapcsanyi <rgabor>
Component: WebGLAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: dglazkov, dino, d-r, gtk-ews, gustavo, gyuyoung.kim, peter+ews, rakuco, webkit.review.bot, xan.lopez, zherczeg
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
patch
gtk-ews: commit-queue-
patch_v2
zherczeg: review-
patch_v3
none
patch_v4 none

Description Gabor Rapcsanyi 2012-11-07 07:37:23 PST
This is the first but I would like to optimize the others as well.
Comment 1 Gabor Rapcsanyi 2012-11-07 08:21:14 PST
Created attachment 172810 [details]
patch

I tested on Pandaboard with Linaro 12.10 Ubuntu.

unpackOneRowOfRGBA4444ToRGBA8: 2.87x faster
packOneRowOfRGBA8ToUnsignedShort4444: 3.11x faster

With WebGl gl.texImage2D() it was 1.18x faster.
Comment 2 kov's GTK+ EWS bot 2012-11-07 08:28:41 PST
Comment on attachment 172810 [details]
patch

Attachment 172810 [details] did not pass gtk-ews (gtk):
Output: http://queues.webkit.org/results/14759398
Comment 3 WebKit Review Bot 2012-11-07 09:21:25 PST
Comment on attachment 172810 [details]
patch

Attachment 172810 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/14758395
Comment 4 Peter Beverloo (cr-android ews) 2012-11-07 09:36:46 PST
Comment on attachment 172810 [details]
patch

Attachment 172810 [details] did not pass cr-android-ews (chromium-android):
Output: http://queues.webkit.org/results/14744817
Comment 5 Gabor Rapcsanyi 2012-11-08 05:59:28 PST
Created attachment 173024 [details]
patch_v2

Include paths added.
Comment 6 Zoltan Herczeg 2012-11-12 03:38:00 PST
Comment on attachment 173024 [details]
patch_v2

View in context: https://bugs.webkit.org/attachment.cgi?id=173024&action=review

> Source/WebCore/WebCore.pri:56
> +    $$SOURCE_DIR/platform/graphics/arm \

Since we have a gpu directory, I think a cpu/arm directory would be better. All ARM specific optimizations could go here eventually (instead of creating subdirectories, so the filter specific optimizations could be moved here later).

> Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:44
> +        uint8x8_t componentR = vqmovn_u16(vshrq_n_u16(eightPixels, 12));
> +        uint8x8_t componentG = vqmovn_u16(vandq_u16(vshrq_n_u16(eightPixels, 8), constant));
> +        uint8x8_t componentB = vqmovn_u16(vandq_u16(vshrq_n_u16(eightPixels, 4), constant));
> +        uint8x8_t componentA = vqmovn_u16(vandq_u16(eightPixels, constant));

This takes 6 instructions. You can do it using only four, by deinterleaving the input bytes into two uint8x8 arrays, and use one ">> 4" or one "& 0xf0" to extract the components.

> Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:49
> +        componentR = vorr_u8(vshl_n_u8(componentR, 4), componentR);
> +        componentG = vorr_u8(vshl_n_u8(componentG, 4), componentG);
> +        componentB = vorr_u8(vshl_n_u8(componentB, 4), componentB);
> +        componentA = vorr_u8(vshl_n_u8(componentA, 4), componentA);

Hm even better idea:
componentR8 = component R4G4 << 4
componentG8 = component R4G4 & 0xf0
So you don't even nned to extract the components!
NEON is beautiful magic!

> Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:74
> +        uint8x8x2_t tmp = vzip_u8(componentBA, componentRG);
> +        uint8x16_t result = vcombine_u8(tmp.val[0], tmp.val[1]);
> +
> +        vst1q_u16(destination, vreinterpretq_u16_u8(result));

You can simply use a deinterleaved write here.
Comment 7 Gabor Rapcsanyi 2012-11-12 08:55:33 PST
Created attachment 173654 [details]
patch_v3

(In reply to comment #6)
> (From update of attachment 173024 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=173024&action=review
> 
> > Source/WebCore/WebCore.pri:56
> > +    $$SOURCE_DIR/platform/graphics/arm \
> 
> Since we have a gpu directory, I think a cpu/arm directory would be better. All ARM specific optimizations could go here eventually (instead of creating subdirectories, so the filter specific optimizations could be moved here later).
> 

Yes that makes sense. I put this arm directory into cpu.

> > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:44
> > +        uint8x8_t componentR = vqmovn_u16(vshrq_n_u16(eightPixels, 12));
> > +        uint8x8_t componentG = vqmovn_u16(vandq_u16(vshrq_n_u16(eightPixels, 8), constant));
> > +        uint8x8_t componentB = vqmovn_u16(vandq_u16(vshrq_n_u16(eightPixels, 4), constant));
> > +        uint8x8_t componentA = vqmovn_u16(vandq_u16(eightPixels, constant));
> 
> This takes 6 instructions. You can do it using only four, by deinterleaving the input bytes into two uint8x8 arrays, and use one ">> 4" or one "& 0xf0" to extract the components.
> 
> > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:49
> > +        componentR = vorr_u8(vshl_n_u8(componentR, 4), componentR);
> > +        componentG = vorr_u8(vshl_n_u8(componentG, 4), componentG);
> > +        componentB = vorr_u8(vshl_n_u8(componentB, 4), componentB);
> > +        componentA = vorr_u8(vshl_n_u8(componentA, 4), componentA);
> 
> Hm even better idea:
> componentR8 = component R4G4 << 4
> componentG8 = component R4G4 & 0xf0
> So you don't even nned to extract the components!
> NEON is beautiful magic!
> 

I tried it but surprisingly it was slower a little bit than my solution. As I saw vld2_u8() is slower than vld1q_u16() so its not worth to change it.


> > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:74
> > +        uint8x8x2_t tmp = vzip_u8(componentBA, componentRG);
> > +        uint8x16_t result = vcombine_u8(tmp.val[0], tmp.val[1]);
> > +
> > +        vst1q_u16(destination, vreinterpretq_u16_u8(result));
> 
> You can simply use a deinterleaved write here.

Good catch, I have changed it and now this function is 3.93x faster than the original.
Comment 8 Zoltan Herczeg 2012-11-12 09:55:51 PST
Comment on attachment 173654 [details]
patch_v3

Nice. Few more things, and this patch is ready:

View in context: https://bugs.webkit.org/attachment.cgi?id=173654&action=review

> Source/WebCore/platform/graphics/cpu/arm/GraphicsContext3DNEON.h:2
> + * Copyright (C) 2012 University of Szeged

You could also mention your name here.

> Source/WebCore/platform/graphics/cpu/arm/GraphicsContext3DNEON.h:74
> +        uint8x8x2_t RGBA;
> +        RGBA.val[0] = vorr_u8(componentB, componentA);
> +        RGBA.val[1] = vorr_u8(componentR, componentG);
> +        vst2_u8(dst, RGBA);

For me the "components" and "RGBA" names are not exactly consistent. Perhaps you could use RGBA4 and RGBA8 instead of them.
Comment 9 Gabor Rapcsanyi 2012-11-12 23:14:35 PST
Created attachment 173830 [details]
patch_v4

Corrected patch
Comment 10 Zoltan Herczeg 2012-11-12 23:28:04 PST
Comment on attachment 173830 [details]
patch_v4

Nice work! r=me
Comment 11 WebKit Review Bot 2012-11-13 00:33:43 PST
Comment on attachment 173830 [details]
patch_v4

Clearing flags on attachment: 173830

Committed r134378: <http://trac.webkit.org/changeset/134378>
Comment 12 WebKit Review Bot 2012-11-13 00:33:48 PST
All reviewed patches have been landed.  Closing bug.