RESOLVED FIXED 101473
Optimize RGBA4444ToRGBA8 packing/unpacking functions with NEON intrinsics in GraphicsContext3D
https://bugs.webkit.org/show_bug.cgi?id=101473
Summary Optimize RGBA4444ToRGBA8 packing/unpacking functions with NEON intrinsics in ...
Gabor Rapcsanyi
Reported 2012-11-07 07:37:23 PST
This is the first but I would like to optimize the others as well.
Attachments
patch (11.58 KB, patch)
2012-11-07 08:21 PST, Gabor Rapcsanyi
gtk-ews: commit-queue-
patch_v2 (12.82 KB, patch)
2012-11-08 05:59 PST, Gabor Rapcsanyi
zherczeg: review-
patch_v3 (13.10 KB, patch)
2012-11-12 08:55 PST, Gabor Rapcsanyi
no flags
patch_v4 (13.12 KB, patch)
2012-11-12 23:14 PST, Gabor Rapcsanyi
no flags
Gabor Rapcsanyi
Comment 1 2012-11-07 08:21:14 PST
Created attachment 172810 [details] patch I tested on Pandaboard with Linaro 12.10 Ubuntu. unpackOneRowOfRGBA4444ToRGBA8: 2.87x faster packOneRowOfRGBA8ToUnsignedShort4444: 3.11x faster With WebGl gl.texImage2D() it was 1.18x faster.
kov's GTK+ EWS bot
Comment 2 2012-11-07 08:28:41 PST
WebKit Review Bot
Comment 3 2012-11-07 09:21:25 PST
Comment on attachment 172810 [details] patch Attachment 172810 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/14758395
Peter Beverloo (cr-android ews)
Comment 4 2012-11-07 09:36:46 PST
Comment on attachment 172810 [details] patch Attachment 172810 [details] did not pass cr-android-ews (chromium-android): Output: http://queues.webkit.org/results/14744817
Gabor Rapcsanyi
Comment 5 2012-11-08 05:59:28 PST
Created attachment 173024 [details] patch_v2 Include paths added.
Zoltan Herczeg
Comment 6 2012-11-12 03:38:00 PST
Comment on attachment 173024 [details] patch_v2 View in context: https://bugs.webkit.org/attachment.cgi?id=173024&action=review > Source/WebCore/WebCore.pri:56 > + $$SOURCE_DIR/platform/graphics/arm \ Since we have a gpu directory, I think a cpu/arm directory would be better. All ARM specific optimizations could go here eventually (instead of creating subdirectories, so the filter specific optimizations could be moved here later). > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:44 > + uint8x8_t componentR = vqmovn_u16(vshrq_n_u16(eightPixels, 12)); > + uint8x8_t componentG = vqmovn_u16(vandq_u16(vshrq_n_u16(eightPixels, 8), constant)); > + uint8x8_t componentB = vqmovn_u16(vandq_u16(vshrq_n_u16(eightPixels, 4), constant)); > + uint8x8_t componentA = vqmovn_u16(vandq_u16(eightPixels, constant)); This takes 6 instructions. You can do it using only four, by deinterleaving the input bytes into two uint8x8 arrays, and use one ">> 4" or one "& 0xf0" to extract the components. > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:49 > + componentR = vorr_u8(vshl_n_u8(componentR, 4), componentR); > + componentG = vorr_u8(vshl_n_u8(componentG, 4), componentG); > + componentB = vorr_u8(vshl_n_u8(componentB, 4), componentB); > + componentA = vorr_u8(vshl_n_u8(componentA, 4), componentA); Hm even better idea: componentR8 = component R4G4 << 4 componentG8 = component R4G4 & 0xf0 So you don't even nned to extract the components! NEON is beautiful magic! > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:74 > + uint8x8x2_t tmp = vzip_u8(componentBA, componentRG); > + uint8x16_t result = vcombine_u8(tmp.val[0], tmp.val[1]); > + > + vst1q_u16(destination, vreinterpretq_u16_u8(result)); You can simply use a deinterleaved write here.
Gabor Rapcsanyi
Comment 7 2012-11-12 08:55:33 PST
Created attachment 173654 [details] patch_v3 (In reply to comment #6) > (From update of attachment 173024 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=173024&action=review > > > Source/WebCore/WebCore.pri:56 > > + $$SOURCE_DIR/platform/graphics/arm \ > > Since we have a gpu directory, I think a cpu/arm directory would be better. All ARM specific optimizations could go here eventually (instead of creating subdirectories, so the filter specific optimizations could be moved here later). > Yes that makes sense. I put this arm directory into cpu. > > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:44 > > + uint8x8_t componentR = vqmovn_u16(vshrq_n_u16(eightPixels, 12)); > > + uint8x8_t componentG = vqmovn_u16(vandq_u16(vshrq_n_u16(eightPixels, 8), constant)); > > + uint8x8_t componentB = vqmovn_u16(vandq_u16(vshrq_n_u16(eightPixels, 4), constant)); > > + uint8x8_t componentA = vqmovn_u16(vandq_u16(eightPixels, constant)); > > This takes 6 instructions. You can do it using only four, by deinterleaving the input bytes into two uint8x8 arrays, and use one ">> 4" or one "& 0xf0" to extract the components. > > > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:49 > > + componentR = vorr_u8(vshl_n_u8(componentR, 4), componentR); > > + componentG = vorr_u8(vshl_n_u8(componentG, 4), componentG); > > + componentB = vorr_u8(vshl_n_u8(componentB, 4), componentB); > > + componentA = vorr_u8(vshl_n_u8(componentA, 4), componentA); > > Hm even better idea: > componentR8 = component R4G4 << 4 > componentG8 = component R4G4 & 0xf0 > So you don't even nned to extract the components! > NEON is beautiful magic! > I tried it but surprisingly it was slower a little bit than my solution. As I saw vld2_u8() is slower than vld1q_u16() so its not worth to change it. > > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:74 > > + uint8x8x2_t tmp = vzip_u8(componentBA, componentRG); > > + uint8x16_t result = vcombine_u8(tmp.val[0], tmp.val[1]); > > + > > + vst1q_u16(destination, vreinterpretq_u16_u8(result)); > > You can simply use a deinterleaved write here. Good catch, I have changed it and now this function is 3.93x faster than the original.
Zoltan Herczeg
Comment 8 2012-11-12 09:55:51 PST
Comment on attachment 173654 [details] patch_v3 Nice. Few more things, and this patch is ready: View in context: https://bugs.webkit.org/attachment.cgi?id=173654&action=review > Source/WebCore/platform/graphics/cpu/arm/GraphicsContext3DNEON.h:2 > + * Copyright (C) 2012 University of Szeged You could also mention your name here. > Source/WebCore/platform/graphics/cpu/arm/GraphicsContext3DNEON.h:74 > + uint8x8x2_t RGBA; > + RGBA.val[0] = vorr_u8(componentB, componentA); > + RGBA.val[1] = vorr_u8(componentR, componentG); > + vst2_u8(dst, RGBA); For me the "components" and "RGBA" names are not exactly consistent. Perhaps you could use RGBA4 and RGBA8 instead of them.
Gabor Rapcsanyi
Comment 9 2012-11-12 23:14:35 PST
Created attachment 173830 [details] patch_v4 Corrected patch
Zoltan Herczeg
Comment 10 2012-11-12 23:28:04 PST
Comment on attachment 173830 [details] patch_v4 Nice work! r=me
WebKit Review Bot
Comment 11 2012-11-13 00:33:43 PST
Comment on attachment 173830 [details] patch_v4 Clearing flags on attachment: 173830 Committed r134378: <http://trac.webkit.org/changeset/134378>
WebKit Review Bot
Comment 12 2012-11-13 00:33:48 PST
All reviewed patches have been landed. Closing bug.
Note You need to log in before you can comment on or make changes to this bug.