Bug 101473

Summary:

Optimize RGBA4444ToRGBA8 packing/unpacking functions with NEON intrinsics in GraphicsContext3D

Product:

WebKit

Reporter:

Gabor Rapcsanyi <rgabor>

Component:

WebGL

Assignee:

Nobody <webkit-unassigned>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

dglazkov, dino, d-r, gtk-ews, gustavo, gyuyoung.kim, peter+ews, rakuco, webkit.review.bot, xan.lopez, zherczeg

Priority:

Version:

528+ (Nightly build)

Hardware:

Unspecified

OS:

Unspecified

Attachments:

Description	Flags
patch	gtk-ews: commit-queue-
patch_v2	zherczeg: review-
patch_v3	none
patch_v4	none

Gabor Rapcsanyi

Reported 2012-11-07 07:37:23 PST

This is the first but I would like to optimize the others as well.

Attachments
patch (11.58 KB, patch) 2012-11-07 08:21 PST, Gabor Rapcsanyi	gtk-ews: commit-queue-	Details Formatted Diff Diff
patch_v2 (12.82 KB, patch) 2012-11-08 05:59 PST, Gabor Rapcsanyi	zherczeg: review-	Details Formatted Diff Diff
patch_v3 (13.10 KB, patch) 2012-11-12 08:55 PST, Gabor Rapcsanyi	no flags	Details Formatted Diff Diff
patch_v4 (13.12 KB, patch) 2012-11-12 23:14 PST, Gabor Rapcsanyi	no flags	Details Formatted Diff Diff
Show Obsolete (3) View All Add attachment proposed patch, testcase, etc.

Gabor Rapcsanyi

Comment 1 2012-11-07 08:21:14 PST

Created attachment 172810 [details] patch I tested on Pandaboard with Linaro 12.10 Ubuntu. unpackOneRowOfRGBA4444ToRGBA8: 2.87x faster packOneRowOfRGBA8ToUnsignedShort4444: 3.11x faster With WebGl gl.texImage2D() it was 1.18x faster.

kov's GTK+ EWS bot

Comment 2 2012-11-07 08:28:41 PST

Comment on attachment 172810 [details] patch Attachment 172810 [details] did not pass gtk-ews (gtk): Output: http://queues.webkit.org/results/14759398

WebKit Review Bot

Comment 3 2012-11-07 09:21:25 PST

Comment on attachment 172810 [details] patch Attachment 172810 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/14758395

Peter Beverloo (cr-android ews)

Comment 4 2012-11-07 09:36:46 PST

Comment on attachment 172810 [details] patch Attachment 172810 [details] did not pass cr-android-ews (chromium-android): Output: http://queues.webkit.org/results/14744817

Gabor Rapcsanyi

Comment 5 2012-11-08 05:59:28 PST

Created attachment 173024 [details] patch_v2 Include paths added.

Zoltan Herczeg

Comment 6 2012-11-12 03:38:00 PST

Comment on attachment 173024 [details] patch_v2 View in context: https://bugs.webkit.org/attachment.cgi?id=173024&action=review > Source/WebCore/WebCore.pri:56 > + $$SOURCE_DIR/platform/graphics/arm \ Since we have a gpu directory, I think a cpu/arm directory would be better. All ARM specific optimizations could go here eventually (instead of creating subdirectories, so the filter specific optimizations could be moved here later). > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:44 > + uint8x8_t componentR = vqmovn_u16(vshrq_n_u16(eightPixels, 12)); > + uint8x8_t componentG = vqmovn_u16(vandq_u16(vshrq_n_u16(eightPixels, 8), constant)); > + uint8x8_t componentB = vqmovn_u16(vandq_u16(vshrq_n_u16(eightPixels, 4), constant)); > + uint8x8_t componentA = vqmovn_u16(vandq_u16(eightPixels, constant)); This takes 6 instructions. You can do it using only four, by deinterleaving the input bytes into two uint8x8 arrays, and use one ">> 4" or one "& 0xf0" to extract the components. > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:49 > + componentR = vorr_u8(vshl_n_u8(componentR, 4), componentR); > + componentG = vorr_u8(vshl_n_u8(componentG, 4), componentG); > + componentB = vorr_u8(vshl_n_u8(componentB, 4), componentB); > + componentA = vorr_u8(vshl_n_u8(componentA, 4), componentA); Hm even better idea: componentR8 = component R4G4 << 4 componentG8 = component R4G4 & 0xf0 So you don't even nned to extract the components! NEON is beautiful magic! > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:74 > + uint8x8x2_t tmp = vzip_u8(componentBA, componentRG); > + uint8x16_t result = vcombine_u8(tmp.val[0], tmp.val[1]); > + > + vst1q_u16(destination, vreinterpretq_u16_u8(result)); You can simply use a deinterleaved write here.

Gabor Rapcsanyi

Comment 7 2012-11-12 08:55:33 PST

Created attachment 173654 [details] patch_v3 (In reply to comment #6) > (From update of attachment 173024 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=173024&action=review > > > Source/WebCore/WebCore.pri:56 > > + $$SOURCE_DIR/platform/graphics/arm \ > > Since we have a gpu directory, I think a cpu/arm directory would be better. All ARM specific optimizations could go here eventually (instead of creating subdirectories, so the filter specific optimizations could be moved here later). > Yes that makes sense. I put this arm directory into cpu. > > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:44 > > + uint8x8_t componentR = vqmovn_u16(vshrq_n_u16(eightPixels, 12)); > > + uint8x8_t componentG = vqmovn_u16(vandq_u16(vshrq_n_u16(eightPixels, 8), constant)); > > + uint8x8_t componentB = vqmovn_u16(vandq_u16(vshrq_n_u16(eightPixels, 4), constant)); > > + uint8x8_t componentA = vqmovn_u16(vandq_u16(eightPixels, constant)); > > This takes 6 instructions. You can do it using only four, by deinterleaving the input bytes into two uint8x8 arrays, and use one ">> 4" or one "& 0xf0" to extract the components. > > > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:49 > > + componentR = vorr_u8(vshl_n_u8(componentR, 4), componentR); > > + componentG = vorr_u8(vshl_n_u8(componentG, 4), componentG); > > + componentB = vorr_u8(vshl_n_u8(componentB, 4), componentB); > > + componentA = vorr_u8(vshl_n_u8(componentA, 4), componentA); > > Hm even better idea: > componentR8 = component R4G4 << 4 > componentG8 = component R4G4 & 0xf0 > So you don't even nned to extract the components! > NEON is beautiful magic! > I tried it but surprisingly it was slower a little bit than my solution. As I saw vld2_u8() is slower than vld1q_u16() so its not worth to change it. > > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:74 > > + uint8x8x2_t tmp = vzip_u8(componentBA, componentRG); > > + uint8x16_t result = vcombine_u8(tmp.val[0], tmp.val[1]); > > + > > + vst1q_u16(destination, vreinterpretq_u16_u8(result)); > > You can simply use a deinterleaved write here. Good catch, I have changed it and now this function is 3.93x faster than the original.

Zoltan Herczeg

Comment 8 2012-11-12 09:55:51 PST

Comment on attachment 173654 [details] patch_v3 Nice. Few more things, and this patch is ready: View in context: https://bugs.webkit.org/attachment.cgi?id=173654&action=review > Source/WebCore/platform/graphics/cpu/arm/GraphicsContext3DNEON.h:2 > + * Copyright (C) 2012 University of Szeged You could also mention your name here. > Source/WebCore/platform/graphics/cpu/arm/GraphicsContext3DNEON.h:74 > + uint8x8x2_t RGBA; > + RGBA.val[0] = vorr_u8(componentB, componentA); > + RGBA.val[1] = vorr_u8(componentR, componentG); > + vst2_u8(dst, RGBA); For me the "components" and "RGBA" names are not exactly consistent. Perhaps you could use RGBA4 and RGBA8 instead of them.

Gabor Rapcsanyi

Comment 9 2012-11-12 23:14:35 PST

Created attachment 173830 [details] patch_v4 Corrected patch

Zoltan Herczeg

Comment 10 2012-11-12 23:28:04 PST

Comment on attachment 173830 [details] patch_v4 Nice work! r=me

WebKit Review Bot

Comment 11 2012-11-13 00:33:43 PST

Comment on attachment 173830 [details] patch_v4 Clearing flags on attachment: 173830 Committed r134378: <http://trac.webkit.org/changeset/134378>

WebKit Review Bot

Comment 12 2012-11-13 00:33:48 PST

All reviewed patches have been landed. Closing bug.

Note You need to log in before you can comment on or make changes to this bug.