101473 – Optimize RGBA4444ToRGBA8 packing/unpacking functions with NEON intrinsics in GraphicsContext3D

RESOLVED FIXED 101473

Optimize RGBA4444ToRGBA8 packing/unpacking functions with NEON intrinsics in GraphicsContext3D

https://bugs.webkit.org/show_bug.cgi?id=101473

Summary Optimize RGBA4444ToRGBA8 packing/unpacking functions with NEON intrinsics in ...

Gabor Rapcsanyi

Reported 2012-11-07 07:37:23 PST

This is the first but I would like to optimize the others as well.

Attachments
patch (11.58 KB, patch) 2012-11-07 08:21 PST, Gabor Rapcsanyi	gtk-ews: commit-queue-	Details Formatted Diff Diff
patch_v2 (12.82 KB, patch) 2012-11-08 05:59 PST, Gabor Rapcsanyi	zherczeg: review-	Details Formatted Diff Diff
patch_v3 (13.10 KB, patch) 2012-11-12 08:55 PST, Gabor Rapcsanyi	no flags	Details Formatted Diff Diff
patch_v4 (13.12 KB, patch) 2012-11-12 23:14 PST, Gabor Rapcsanyi	no flags	Details Formatted Diff Diff
Show Obsolete (3) View All Add attachment proposed patch, testcase, etc.

Gabor Rapcsanyi

Comment 1 2012-11-07 08:21:14 PST

Created attachment 172810 [details] patch I tested on Pandaboard with Linaro 12.10 Ubuntu. unpackOneRowOfRGBA4444ToRGBA8: 2.87x faster packOneRowOfRGBA8ToUnsignedShort4444: 3.11x faster With WebGl gl.texImage2D() it was 1.18x faster.

kov's GTK+ EWS bot

Comment 2 2012-11-07 08:28:41 PST

Comment on attachment 172810 [details] patch Attachment 172810 [details] did not pass gtk-ews (gtk): Output: http://queues.webkit.org/results/14759398

WebKit Review Bot

Comment 3 2012-11-07 09:21:25 PST

Comment on attachment 172810 [details] patch Attachment 172810 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/14758395

Peter Beverloo (cr-android ews)

Comment 4 2012-11-07 09:36:46 PST

Comment on attachment 172810 [details] patch Attachment 172810 [details] did not pass cr-android-ews (chromium-android): Output: http://queues.webkit.org/results/14744817

Gabor Rapcsanyi

Comment 5 2012-11-08 05:59:28 PST

Created attachment 173024 [details] patch_v2 Include paths added.

Zoltan Herczeg

Comment 6 2012-11-12 03:38:00 PST

Comment on attachment 173024 [details] patch_v2 View in context: https://bugs.webkit.org/attachment.cgi?id=173024&action=review > Source/WebCore/WebCore.pri:56 > + $$SOURCE_DIR/platform/graphics/arm \ Since we have a gpu directory, I think a cpu/arm directory would be better. All ARM specific optimizations could go here eventually (instead of creating subdirectories, so the filter specific optimizations could be moved here later). > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:44 > + uint8x8_t componentR = vqmovn_u16(vshrq_n_u16(eightPixels, 12)); > + uint8x8_t componentG = vqmovn_u16(vandq_u16(vshrq_n_u16(eightPixels, 8), constant)); > + uint8x8_t componentB = vqmovn_u16(vandq_u16(vshrq_n_u16(eightPixels, 4), constant)); > + uint8x8_t componentA = vqmovn_u16(vandq_u16(eightPixels, constant)); This takes 6 instructions. You can do it using only four, by deinterleaving the input bytes into two uint8x8 arrays, and use one ">> 4" or one "& 0xf0" to extract the components. > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:49 > + componentR = vorr_u8(vshl_n_u8(componentR, 4), componentR); > + componentG = vorr_u8(vshl_n_u8(componentG, 4), componentG); > + componentB = vorr_u8(vshl_n_u8(componentB, 4), componentB); > + componentA = vorr_u8(vshl_n_u8(componentA, 4), componentA); Hm even better idea: componentR8 = component R4G4 << 4 componentG8 = component R4G4 & 0xf0 So you don't even nned to extract the components! NEON is beautiful magic! > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:74 > + uint8x8x2_t tmp = vzip_u8(componentBA, componentRG); > + uint8x16_t result = vcombine_u8(tmp.val[0], tmp.val[1]); > + > + vst1q_u16(destination, vreinterpretq_u16_u8(result)); You can simply use a deinterleaved write here.

Gabor Rapcsanyi

Comment 7 2012-11-12 08:55:33 PST

Created attachment 173654 [details] patch_v3 (In reply to comment #6) > (From update of attachment 173024 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=173024&action=review > > > Source/WebCore/WebCore.pri:56 > > + $$SOURCE_DIR/platform/graphics/arm \ > > Since we have a gpu directory, I think a cpu/arm directory would be better. All ARM specific optimizations could go here eventually (instead of creating subdirectories, so the filter specific optimizations could be moved here later). > Yes that makes sense. I put this arm directory into cpu. > > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:44 > > + uint8x8_t componentR = vqmovn_u16(vshrq_n_u16(eightPixels, 12)); > > + uint8x8_t componentG = vqmovn_u16(vandq_u16(vshrq_n_u16(eightPixels, 8), constant)); > > + uint8x8_t componentB = vqmovn_u16(vandq_u16(vshrq_n_u16(eightPixels, 4), constant)); > > + uint8x8_t componentA = vqmovn_u16(vandq_u16(eightPixels, constant)); > > This takes 6 instructions. You can do it using only four, by deinterleaving the input bytes into two uint8x8 arrays, and use one ">> 4" or one "& 0xf0" to extract the components. > > > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:49 > > + componentR = vorr_u8(vshl_n_u8(componentR, 4), componentR); > > + componentG = vorr_u8(vshl_n_u8(componentG, 4), componentG); > > + componentB = vorr_u8(vshl_n_u8(componentB, 4), componentB); > > + componentA = vorr_u8(vshl_n_u8(componentA, 4), componentA); > > Hm even better idea: > componentR8 = component R4G4 << 4 > componentG8 = component R4G4 & 0xf0 > So you don't even nned to extract the components! > NEON is beautiful magic! > I tried it but surprisingly it was slower a little bit than my solution. As I saw vld2_u8() is slower than vld1q_u16() so its not worth to change it. > > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:74 > > + uint8x8x2_t tmp = vzip_u8(componentBA, componentRG); > > + uint8x16_t result = vcombine_u8(tmp.val[0], tmp.val[1]); > > + > > + vst1q_u16(destination, vreinterpretq_u16_u8(result)); > > You can simply use a deinterleaved write here. Good catch, I have changed it and now this function is 3.93x faster than the original.

Zoltan Herczeg

Comment 8 2012-11-12 09:55:51 PST

Comment on attachment 173654 [details] patch_v3 Nice. Few more things, and this patch is ready: View in context: https://bugs.webkit.org/attachment.cgi?id=173654&action=review > Source/WebCore/platform/graphics/cpu/arm/GraphicsContext3DNEON.h:2 > + * Copyright (C) 2012 University of Szeged You could also mention your name here. > Source/WebCore/platform/graphics/cpu/arm/GraphicsContext3DNEON.h:74 > + uint8x8x2_t RGBA; > + RGBA.val[0] = vorr_u8(componentB, componentA); > + RGBA.val[1] = vorr_u8(componentR, componentG); > + vst2_u8(dst, RGBA); For me the "components" and "RGBA" names are not exactly consistent. Perhaps you could use RGBA4 and RGBA8 instead of them.

Gabor Rapcsanyi

Comment 9 2012-11-12 23:14:35 PST

Created attachment 173830 [details] patch_v4 Corrected patch

Zoltan Herczeg

Comment 10 2012-11-12 23:28:04 PST

Comment on attachment 173830 [details] patch_v4 Nice work! r=me

WebKit Review Bot

Comment 11 2012-11-13 00:33:43 PST

Comment on attachment 173830 [details] patch_v4 Clearing flags on attachment: 173830 Committed r134378: <http://trac.webkit.org/changeset/134378>

WebKit Review Bot

Comment 12 2012-11-13 00:33:48 PST

All reviewed patches have been landed. Closing bug.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Unspecified

OS Unspecified

Product WebKit

Component WebGL

Assignee

Nobody

Reported

2012-11-07 07:37 PST

Modified

2012-11-13 00:33 PST History

CC List

11 users Show

URL

Keywords

Depends on

Blocks