WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
101473
Optimize RGBA4444ToRGBA8 packing/unpacking functions with NEON intrinsics in GraphicsContext3D
https://bugs.webkit.org/show_bug.cgi?id=101473
Summary
Optimize RGBA4444ToRGBA8 packing/unpacking functions with NEON intrinsics in ...
Gabor Rapcsanyi
Reported
2012-11-07 07:37:23 PST
This is the first but I would like to optimize the others as well.
Attachments
patch
(11.58 KB, patch)
2012-11-07 08:21 PST
,
Gabor Rapcsanyi
gtk-ews
: commit-queue-
Details
Formatted Diff
Diff
patch_v2
(12.82 KB, patch)
2012-11-08 05:59 PST
,
Gabor Rapcsanyi
zherczeg
: review-
Details
Formatted Diff
Diff
patch_v3
(13.10 KB, patch)
2012-11-12 08:55 PST
,
Gabor Rapcsanyi
no flags
Details
Formatted Diff
Diff
patch_v4
(13.12 KB, patch)
2012-11-12 23:14 PST
,
Gabor Rapcsanyi
no flags
Details
Formatted Diff
Diff
Show Obsolete
(3)
View All
Add attachment
proposed patch, testcase, etc.
Gabor Rapcsanyi
Comment 1
2012-11-07 08:21:14 PST
Created
attachment 172810
[details]
patch I tested on Pandaboard with Linaro 12.10 Ubuntu. unpackOneRowOfRGBA4444ToRGBA8: 2.87x faster packOneRowOfRGBA8ToUnsignedShort4444: 3.11x faster With WebGl gl.texImage2D() it was 1.18x faster.
kov's GTK+ EWS bot
Comment 2
2012-11-07 08:28:41 PST
Comment on
attachment 172810
[details]
patch
Attachment 172810
[details]
did not pass gtk-ews (gtk): Output:
http://queues.webkit.org/results/14759398
WebKit Review Bot
Comment 3
2012-11-07 09:21:25 PST
Comment on
attachment 172810
[details]
patch
Attachment 172810
[details]
did not pass chromium-ews (chromium-xvfb): Output:
http://queues.webkit.org/results/14758395
Peter Beverloo (cr-android ews)
Comment 4
2012-11-07 09:36:46 PST
Comment on
attachment 172810
[details]
patch
Attachment 172810
[details]
did not pass cr-android-ews (chromium-android): Output:
http://queues.webkit.org/results/14744817
Gabor Rapcsanyi
Comment 5
2012-11-08 05:59:28 PST
Created
attachment 173024
[details]
patch_v2 Include paths added.
Zoltan Herczeg
Comment 6
2012-11-12 03:38:00 PST
Comment on
attachment 173024
[details]
patch_v2 View in context:
https://bugs.webkit.org/attachment.cgi?id=173024&action=review
> Source/WebCore/WebCore.pri:56 > + $$SOURCE_DIR/platform/graphics/arm \
Since we have a gpu directory, I think a cpu/arm directory would be better. All ARM specific optimizations could go here eventually (instead of creating subdirectories, so the filter specific optimizations could be moved here later).
> Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:44 > + uint8x8_t componentR = vqmovn_u16(vshrq_n_u16(eightPixels, 12)); > + uint8x8_t componentG = vqmovn_u16(vandq_u16(vshrq_n_u16(eightPixels, 8), constant)); > + uint8x8_t componentB = vqmovn_u16(vandq_u16(vshrq_n_u16(eightPixels, 4), constant)); > + uint8x8_t componentA = vqmovn_u16(vandq_u16(eightPixels, constant));
This takes 6 instructions. You can do it using only four, by deinterleaving the input bytes into two uint8x8 arrays, and use one ">> 4" or one "& 0xf0" to extract the components.
> Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:49 > + componentR = vorr_u8(vshl_n_u8(componentR, 4), componentR); > + componentG = vorr_u8(vshl_n_u8(componentG, 4), componentG); > + componentB = vorr_u8(vshl_n_u8(componentB, 4), componentB); > + componentA = vorr_u8(vshl_n_u8(componentA, 4), componentA);
Hm even better idea: componentR8 = component R4G4 << 4 componentG8 = component R4G4 & 0xf0 So you don't even nned to extract the components! NEON is beautiful magic!
> Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:74 > + uint8x8x2_t tmp = vzip_u8(componentBA, componentRG); > + uint8x16_t result = vcombine_u8(tmp.val[0], tmp.val[1]); > + > + vst1q_u16(destination, vreinterpretq_u16_u8(result));
You can simply use a deinterleaved write here.
Gabor Rapcsanyi
Comment 7
2012-11-12 08:55:33 PST
Created
attachment 173654
[details]
patch_v3 (In reply to
comment #6
)
> (From update of
attachment 173024
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=173024&action=review
> > > Source/WebCore/WebCore.pri:56 > > + $$SOURCE_DIR/platform/graphics/arm \ > > Since we have a gpu directory, I think a cpu/arm directory would be better. All ARM specific optimizations could go here eventually (instead of creating subdirectories, so the filter specific optimizations could be moved here later). >
Yes that makes sense. I put this arm directory into cpu.
> > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:44 > > + uint8x8_t componentR = vqmovn_u16(vshrq_n_u16(eightPixels, 12)); > > + uint8x8_t componentG = vqmovn_u16(vandq_u16(vshrq_n_u16(eightPixels, 8), constant)); > > + uint8x8_t componentB = vqmovn_u16(vandq_u16(vshrq_n_u16(eightPixels, 4), constant)); > > + uint8x8_t componentA = vqmovn_u16(vandq_u16(eightPixels, constant)); > > This takes 6 instructions. You can do it using only four, by deinterleaving the input bytes into two uint8x8 arrays, and use one ">> 4" or one "& 0xf0" to extract the components. > > > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:49 > > + componentR = vorr_u8(vshl_n_u8(componentR, 4), componentR); > > + componentG = vorr_u8(vshl_n_u8(componentG, 4), componentG); > > + componentB = vorr_u8(vshl_n_u8(componentB, 4), componentB); > > + componentA = vorr_u8(vshl_n_u8(componentA, 4), componentA); > > Hm even better idea: > componentR8 = component R4G4 << 4 > componentG8 = component R4G4 & 0xf0 > So you don't even nned to extract the components! > NEON is beautiful magic! >
I tried it but surprisingly it was slower a little bit than my solution. As I saw vld2_u8() is slower than vld1q_u16() so its not worth to change it.
> > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:74 > > + uint8x8x2_t tmp = vzip_u8(componentBA, componentRG); > > + uint8x16_t result = vcombine_u8(tmp.val[0], tmp.val[1]); > > + > > + vst1q_u16(destination, vreinterpretq_u16_u8(result)); > > You can simply use a deinterleaved write here.
Good catch, I have changed it and now this function is 3.93x faster than the original.
Zoltan Herczeg
Comment 8
2012-11-12 09:55:51 PST
Comment on
attachment 173654
[details]
patch_v3 Nice. Few more things, and this patch is ready: View in context:
https://bugs.webkit.org/attachment.cgi?id=173654&action=review
> Source/WebCore/platform/graphics/cpu/arm/GraphicsContext3DNEON.h:2 > + * Copyright (C) 2012 University of Szeged
You could also mention your name here.
> Source/WebCore/platform/graphics/cpu/arm/GraphicsContext3DNEON.h:74 > + uint8x8x2_t RGBA; > + RGBA.val[0] = vorr_u8(componentB, componentA); > + RGBA.val[1] = vorr_u8(componentR, componentG); > + vst2_u8(dst, RGBA);
For me the "components" and "RGBA" names are not exactly consistent. Perhaps you could use RGBA4 and RGBA8 instead of them.
Gabor Rapcsanyi
Comment 9
2012-11-12 23:14:35 PST
Created
attachment 173830
[details]
patch_v4 Corrected patch
Zoltan Herczeg
Comment 10
2012-11-12 23:28:04 PST
Comment on
attachment 173830
[details]
patch_v4 Nice work! r=me
WebKit Review Bot
Comment 11
2012-11-13 00:33:43 PST
Comment on
attachment 173830
[details]
patch_v4 Clearing flags on attachment: 173830 Committed
r134378
: <
http://trac.webkit.org/changeset/134378
>
WebKit Review Bot
Comment 12
2012-11-13 00:33:48 PST
All reviewed patches have been landed. Closing bug.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug