Bug 235002 - readPixels directly to ArrayBuffer has very high CPU usage
Summary: readPixels directly to ArrayBuffer has very high CPU usage
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: ANGLE (show other bugs)
Version: Safari 15
Hardware: All iOS 15
: P2 Normal
Assignee: Nobody
URL:
Keywords: InRadar
Depends on:
Blocks: webgl2perfproblem
  Show dependency treegraph
 
Reported: 2022-01-08 00:42 PST by Simon Taylor
Modified: 2022-03-30 14:39 PDT (History)
6 users (show)

See Also:


Attachments
Annotated System Trace for a frame of the demo content (140.09 KB, image/png)
2022-01-08 02:57 PST, Simon Taylor
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Simon Taylor 2022-01-08 00:42:05 PST
WebGL1's readPixels only supports reading directly into an ArrayBuffer, and blocks until the GPU work is completed.

WebGL2 added the ability to target a PIXEL_PACK_BUFFER with readPixels, and then copy the pixels back into an ArrayBuffer with getBufferSubData. The usual recommendation is to insert a fence after the readPixels, and wait until it is signalled before calling getBufferSubData which will prevent the main thread blocking on GPU completion.

My expectation was that there would be no performance benefit to using a PIXEL_PACK_BUFFER with immediate readback, where getBufferSubData will also block until GPU completion.

i.e. I expected the webgl2-only:

gl.bindBuffer(gl.PIXEL_PACK_BUFFER, pixelBuffer);
gl.readPixels(0, 0, 640, 360, gl.RGBA, gl.UNSIGNED_BYTE, 0);
gl.getBufferSubData(gl.PIXEL_PACK_BUFFER, 0, pixels);

to have no major performance advantage over the equivalent direct route:

gl.bindBuffer(gl.PIXEL_PACK_BUFFER, null);
gl.readPixels(0, 0, 640, 360, gl.RGBA, gl.UNSIGNED_BYTE, pixels);

However on both iOS (tested on 15.2) and macOS (12.1, Safari 15.2) the direct option is many times slower, and looking at a System Trace in Instruments involves way more CPU activity long after the GPU has finished.

Test case coming up next, along with some System Trace findings from Instruments.
Comment 1 Simon Taylor 2022-01-08 01:51:11 PST
Here is an example page:
https://tango-bravo.net/webkit-bug-235002/index.html

A video element is uploaded to a texture with texImage2d, then rendered to fill the first canvas.

Two different readPixel approaches are then applied, firstly via a PIXEL_PACK_BUFFER, then directly to an ArrayBuffer.

To confirm the data (from the first, faster, method) is correct, it is uploaded with texImage2d to a second WebGL context and rendered there too.

In Safari 15.2 (both iOS and macOS) the direct readPixels is significantly slower.

The page shows average timings every 60 frames - the "readPixels (with buffer)" timings separate out the (very fast) readPixels and the (blocking) getBufferSubData.

Note the PIXEL_PACK_BUFFER readPixels happens first, so all of the time spent waiting for GPU completion is accounted for in the getBufferSubData timings. Even with that, the direct readPixels is typically 3x slower on both iPhone 12 and M1 Pro MacBook.

Typical iPhone 12 timings:

Total Draw Time: 8.97 ms 
Upload from video: 0.67 ms 
readPixels (with buffer): 0.12 + 1.92 ms 
readPixels (direct): 6.07 ms 
Render loop: 59.88 fps
Comment 2 Simon Taylor 2022-01-08 02:57:48 PST
Created attachment 448667 [details]
Annotated System Trace for a frame of the demo content

I've attached an annotated System Trace screenshot for one frame of this demo.

The flags in the timeline show the console.log calls at different checkpoints (when WebInspector is open these result in syscalls, so can be accurately located on the trace). The full set of 8 logs are:
  start
  upload complete
  draw complete
  readPixels (via buffer) complete
  getBufferSubData complete
  readPixels (direct) complete
  texImage from ArrayBuffer complete
  second draw complete

The three annotated regions dominate the timings:
 1. texImage2d from video element (this is a frame where the video has a new frame, so is the heavier workload)
 2. getSubImageData - note the vast majority of this time is blocking waiting for GPU (syscall stack shows it is within [_MTLCommandBuffer waitUntilCompleted]), and the CPU is essentially idle for this time.
 3. readPixels directly to ArrayBuffer - note CPU is pegged throughout.

During the slow readPixels you can see loads of syscalls going on, looks like repeated calls to IOSurfaceClientLock and IOSurfaceClientUnlock. The relevant bit of the stack from one of these is shown below. I wonder if this is some "robust" fallback path (ReadnPixelsRobustANGLE is in the trace) that perhaps individually fetches one row at a time?

It feels like there's very significant gains to be made here by switching to a similar backend implementation of the PIXEL_PACK_BUFFER readback.

   0 libsystem_kernel.dylib mach_msg_trap
   1 IOKit io_connect_method
   2 IOKit IOConnectCallMethod
   3 IOSurface IOSurfaceClientLock
   4 AGXMetalA14 0x1f3082ee4
   5 AGXMetalA14 0x1f3082e80
   6 libANGLE-shared.dylib rx::mtl::Texture::getBytes(rx::ContextMtl*, unsigned long, unsigned long, MTLRegion const&, gl::LevelIndexWrapper<unsigned int> const&, unsigned int, unsigned char*)
   7 libANGLE-shared.dylib rx::mtl::ReadTexturePerSliceBytes(gl::Context const*, std::__1::shared_ptr<rx::mtl::Texture> const&, unsigned long, gl::Rectangle const&, gl::LevelIndexWrapper<unsigned int> const&, unsigned int, unsigned char*)
   8 libANGLE-shared.dylib rx::FramebufferMtl::readPixelsImpl(gl::Context const*, gl::Rectangle const&, rx::PackPixelsParams const&, rx::RenderTargetMtl const*, unsigned char*) const
   9 libANGLE-shared.dylib rx::FramebufferMtl::readPixels(gl::Context const*, gl::Rectangle const&, unsigned int, unsigned int, gl::PixelPackState const&, gl::Buffer*, void*)
  10 libANGLE-shared.dylib gl::Framebuffer::readPixels(gl::Context const*, gl::Rectangle const&, unsigned int, unsigned int, gl::PixelPackState const&, gl::Buffer*, void*)
  11 libANGLE-shared.dylib gl::Context::readPixels(int, int, int, int, unsigned int, unsigned int, void*)
  12 libANGLE-shared.dylib gl::ReadnPixelsRobustANGLE(int, int, int, int, unsigned int, unsigned int, int, int*, int*, int*, void*)
  13 WebCore WebCore::GraphicsContextGLOpenGL::readnPixelsImpl(int, int, int, int, unsigned int, unsigned int, int, int*, int*, int*, void*, bool)
  14 WebCore WebCore::GraphicsContextGLOpenGL::readnPixels(int, int, int, int, unsigned int, unsigned int, GCGLSpan<void, 18446744073709551615ul>)
  15 WebCore WebCore::WebGLRenderingContextBase::readPixels(int, int, int, int, unsigned int, unsigned int, JSC::ArrayBufferView&)
  16 WebCore WebCore::jsWebGL2RenderingContextPrototypeFunction_readPixels(JSC::JSGlobalObject*, JSC::CallFrame*)
Comment 3 Kimmo Kinnunen 2022-01-10 05:18:55 PST
Thanks for the detailed analysis!

Yes, the readPixels is reading the underlying Metal texture row-by-row.
The underlying reason is that the general case of readPixels has a lot of edge-cases.
Also the general implementation has a memory-speed tradeoff characteristic for the simple implementation.

The ANGLE Metal backend does not have any special cases for the simpler cases that could be more optimal.

Other ANGLE backends seem to mostly use the buffer approach. I think it would make sense for consistency towards the pack buffer implementation that the non-pack buffer implementation would have similar perf characteristics.

The readPixels is a known slow operation and naturally recommendation is that the data would be processed with a shader. Are you working around some other WebKit issue with readPixels or is it just that readPixels should really be faster? E.g. when considering what to fix, would you benefit some other feature working better or just that readPixels would be faster?
Comment 4 Kimmo Kinnunen 2022-01-10 05:41:57 PST
NOTE about the test case measurements:
The test case times are wrong, since the calls might not be completed when the caller side measurement is made.

Consider abstraction of the test case here:
        var s = MEASURE();
        gl.texImage2D(gl.TEXTURE_2D, 0, gl.RGBA, gl.RGBA, gl.UNSIGNED_BYTE, video);
        var upload = MEASURE() - s;

        s = MEASURE();
        gl.clear(gl.COLOR_BUFFER_BIT);
        gl.drawElements(gl.TRIANGLES, 6, gl.UNSIGNED_SHORT, 0);
        var draw = MEASURE() - s;

        s = MEASURE();
        gl.bindBuffer(gl.PIXEL_PACK_BUFFER, pixelBuffer);
        gl.readPixels(0, 0, 640, 360, gl.RGBA, gl.UNSIGNED_BYTE, 0);
        gl.getBufferSubData(gl.PIXEL_PACK_BUFFER, 0, pixels);
        var pack = MEASURE() - s;

        s = MEASURE();
        gl.readPixels(0, 0, 640, 360, gl.RGBA, gl.UNSIGNED_BYTE, pixelsDirect);
        var direct = MEASURE() - s;


The first blocking call is getBufferSubData, and as such the duration measurement "pack" might contain delay due to all the work submitted prior to that call.

The second blocking call duration readPixels will contain only the readPixels payload because it comes after a blocking call.

So a benchmark measuring individual gl sequeneces needs to call finish() before taking a measurement start timestamp and end in a blocking call, and needs to take the end timestamp after it.

        gl.finish();
        var s = MEASURE();
        gl.op();
        gl.op2();
        gl.readPixels();
        var dt = MEASURE() - s;
Comment 5 Simon Taylor 2022-01-10 07:30:20 PST
(In reply to Kimmo Kinnunen from comment #3)
> Thanks for the detailed analysis!

You're welcome! I've been getting slightly obsessed by System Trace...

> 
> Yes, the readPixels is reading the underlying Metal texture row-by-row.
> The underlying reason is that the general case of readPixels has a lot of
> edge-cases.
> Also the general implementation has a memory-speed tradeoff characteristic
> for the simple implementation.
> 
> The ANGLE Metal backend does not have any special cases for the simpler
> cases that could be more optimal.

Indeed, I was just taking a look this morning. I noticed the row-by-row read back and the lack of a fast path, unlike the buffer read implementation. Also a call to convert to target type per-row when I guess in the common case this would be a no-op. Tasks like ensureResourceReadyForCPU are called per-row too I notice.

> Other ANGLE backends seem to mostly use the buffer approach. I think it
> would make sense for consistency towards the pack buffer implementation that
> the non-pack buffer implementation would have similar perf characteristics.

That makes the most sense to me, and looks to benefit from any required type conversions using a shader rather than CPU, and the fast path with the blit encoder for the common case.

> The readPixels is a known slow operation and naturally recommendation is
> that the data would be processed with a shader. Are you working around some
> other WebKit issue with readPixels or is it just that readPixels should
> really be faster? E.g. when considering what to fix, would you benefit some
> other feature working better or just that readPixels would be faster?

Thanks for asking!

My day job (when not diving too deep into WebKit code...) involves doing Computer Vision in WebAssembly on video data (generally from getUserMedia). So for our workloads the performance-critical paths are texImage2d from video, and readPixels back to an ArrayBuffer (greyscale conversion happens in a shader). Wasm processing generally occurs on a worker and then results are rendered again in WebGL. I've also tested 2d canvas drawImage and getImageData, with greyscale conversion in wasm, but WebGL seems the faster route on iOS. Also after our processing users will generally want to render the video frame in WebGL so having it already uploaded to a texture is helpful.

A nice-to-have would be a quick implementation to obtain a "snapshot" of a current frame from a video to an object that we can texUpload (cheaply) to multiple contexts and guarantee the frame is the same in both. Then we could use a separate WebGL context for our greyscale conversion shader, and allow the final content rendering to be completely independent (whilst delaying the texture update there until processing is complete). I had high hopes for createImageBitmap in iOS 15 but had disappointing performance there - see Bug 234920. However having done this investigation, the PIXEL_PACK_BUFFER readPixels seems like it may fit the bill there too - especially if we make proper use of fences to avoid blocking on GPU completion before doing the readPixels.

We generally target WebGL1 for iOS, and though I knew PIXEL_PACK_BUFFER was available in webgl2 I'd never tried it out - as I mentioned in the first comment my assumption was there wouldn't be much performance benefit if you were trying to immediately read back the data - I also had "readPixels is slow" and "...because it has to block for GPU completion" in my head, and put all the timing down to that. So it was a big surprise to see a 3x performance improvement even for a blocking read simply by binding a PIXEL_PACK_BUFFER, and that actually the majority of the time spent in the direct readPixels wasn't due to waiting on the GPU at all.

For our code I'm happy to refactor to WebGL2 (and fallback to webgl1 for older iOS), the performance win definitely justifies that work and will benefit current iOS 15. So fixing this isn't that critical to me personally but it seems like a nice win for any other code on the web that uses readPixels directly to an ArrayBuffer.

> The test case times are wrong, since the calls might not be completed when the caller side measurement is made.

As I noted in Comment 1, the getBufferSubData timings will also include accounting for time spent waiting for the GPU to complete, and indeed that can be seen clearly in the trace. The usual goal of a render function would be to avoid any blocking calls at all, but for this test case I was specifically wanting to compare two functionally-equivalent approaches to a blocking readPixels so I was happy enough with that.

One of the most critical resources in our use case is main thread CPU time, so how long the actual JS calls take remains of more interest to me than how long the GPU takes to fully execute the commands.
Comment 6 Simon Taylor 2022-01-12 04:35:18 PST
I've made some updates to the test page.
https://tango-bravo.net/webkit-bug-235002/index.html

I took Kimmo's suggestion to add gl.finish() after submitting drawing commands, and have pulled out a few more individual timing measurements. NB: Anything <1ms should be taken with a pinch of salt as that's the performance.now() granularity.

There's now a checkbox to busy loop at the end of rAF function which ensures the callback takes 20ms - this triggers the scheduler to keep the content thread on a Performance core with high clocks, and makes the timings more comparable between different options (otherwise more efficient options end up taking longer as the device can lower clocks whilst maintaining 60FPS).

There's also a microbenchmark for how many calls to Math.random() can be made in a single performance.now() tick (ie 1ms) - the idea is to  gives a proxy for CPU frequency, and to give additional confidence the busy loop is successful at triggering consistently high clocks. [iOS 15.2 has a little extra weirdness there as that number increases a lot when switching away from the tab and back again - from ~11k with busy loop enabled to ~30k. 14.8.1 gives ~30k from the start. Expect a new bug soon on that...]

With those intros out of the way, on to the main finding - having a texture-backed framebuffer bound makes the direct-to-ArrayBuffer readPixels MUCH faster on the iOS 15.2 Metal backend (from ~4.5ms to <0.5ms). It looks like that row-by-row readPixels implementation is avoided in that case.

The page now also works on browsers that only have webgl1, to allow investigating iOS 14 performance too - I tried with 14.8.1 which I believe uses ANGLE but with the GL backend. Only direct-to-ArrayBuffer readPixels is supported there. Timings appear to be the other way around - a readPixels from the default framebuffer is a bit faster (1.5x or so) than from a texture-backed framebuffer, but both are much slower than the iOS 15 backend.

In summary, readPixels is absolutely smoking fast on iOS 15, as long as it isn't a direct-to-ArrayBuffer readPixels from the default canvas framebuffer. Nice work everyone!
Comment 7 Simon Taylor 2022-01-12 08:20:32 PST
> There's also a microbenchmark for how many calls to Math.random() can be
> made in a single performance.now() tick (ie 1ms) - the idea is to  gives a
> proxy for CPU frequency, and to give additional confidence the busy loop is
> successful at triggering consistently high clocks. [iOS 15.2 has a little
> extra weirdness there as that number increases a lot when switching away
> from the tab and back again - from ~11k with busy loop enabled to ~30k.
> 14.8.1 gives ~30k from the start. Expect a new bug soon on that...]

Filed Bug 235114 with more details and a specific test case on the weird micro-benchmark results.
Comment 8 Radar WebKit Bug Importer 2022-01-15 00:43:18 PST
<rdar://problem/87634138>
Comment 9 Gregg Tavares 2022-03-30 13:20:09 PDT
fyi: In my experiments if you render to your own texture (via a framebuffer) you'll get your speed back. Based on the trace above my guess it's the IOSurfaceLock/Unlock causing the slowdown. That's only needed for reading the canvas itself, not for reading your own texture.

Of course we should optimize reading the canvas too, just pointing out a possible workaround.
Comment 10 Kenneth Russell 2022-03-30 13:39:45 PDT
Related ANGLE-side bug:

Metal: readPixels is ~4x slower than OpenGL
https://bugs.chromium.org/p/angleproject/issues/detail?id=7117
Comment 11 Simon Taylor 2022-03-30 14:39:21 PDT
(In reply to Gregg Tavares from comment #9)
> fyi: In my experiments if you render to your own texture (via a framebuffer)
> you'll get your speed back. Based on the trace above my guess it's the
> IOSurfaceLock/Unlock causing the slowdown. That's only needed for reading
> the canvas itself, not for reading your own texture.
> 
> Of course we should optimize reading the canvas too, just pointing out a
> possible workaround.

Thanks Gregg. I did eventually discover that myself in Comment 6.