Bug 234920 - ImageBitmap created from a video element has poor performance
Summary: ImageBitmap created from a video element has poor performance
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebGL (show other bugs)
Version: Safari 15
Hardware: iPhone / iPad iOS 15
: P2 Normal
Assignee: Nobody
URL:
Keywords: InRadar
Depends on: 235043 235044
Blocks:
  Show dependency treegraph
 
Reported: 2022-01-06 06:33 PST by Simon Taylor
Modified: 2022-05-05 08:44 PDT (History)
5 users (show)

See Also:


Attachments
System Trace for texImage2d(imageBitmap) (1.21 MB, image/png)
2022-01-06 07:07 PST, Simon Taylor
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Simon Taylor 2022-01-06 06:33:54 PST
My goal is to "snapshot" the current frame of a video element into an object that can then be efficiently used in other canvas contexts (multiple WebGL contexts for example).

ImageBitmap feels like the right API to be leveraging for this - it seems the intention is that ImageBitmap is the way to pass around the content of canvases and consume them cross-context (at least the use of it for OffscreenCanvas suggests that to me).

Consider a frame loop something like this:

function drawFrame() {
  if(latestFrame) {
    gl1.texUpload2d(..., latestFrame);
    gl2.texUpload2d(..., latestFrame);
    latestFrame.close();
    latestFrame = null;
  }

  createImageBitmap(video).then(ib => {
    if(latestFrame) latestFrame.close();
    latestFrame = ib;
  });
}

Unfortunately the iOS ImageBitmap implementation makes things much slower going this route vs just a direct texImage2d of the video element.

I've also tried using drawImage in a 2d canvas to obtain the "snapshot" and then texImage2d(..., canvas2d) to consume in WebGL, but that's also slow. Probably best to limit this bug to ImageBitmap as it does feel like the correct API for this use-case, and getting it performant will also be key to having OffscreenCanvas work well.
Comment 1 Simon Taylor 2022-01-06 06:55:07 PST
Here's a test case:
https://tango-bravo.net/webkit-bug-234920/index.html

Using an iPhone 12 Pro on iOS 15.2, the two upload calls combined keep the main JS thread busy for around 25ms, and the overall render loop runs at around 40 fps.

This test is similar to my test for Bug 203148 that uses texImage2d with the video element directly.

Looking in Instruments, using the video directly is pretty performant - 60FPS is easily maintained, and everything's running on the efficiency cores (also not fully loaded so probably not at max clocks). This ImageBitmap one at 40FPS has also spun up the performance cores and is keeping them pretty much fully loaded, so I'm guessing clocks are likely high too. In short, seems to be a very significant performance difference between the methods.
Comment 2 Simon Taylor 2022-01-06 07:07:12 PST
Created attachment 448498 [details]
System Trace for texImage2d(imageBitmap)

System trace shows the content process blocked for 14ms for each upload call, in RemoteRenderingBackendProxy::getShareableBitmap. It looks to have the same cost each time, even when the ImageBitmap hasn't been changed (just from the naming I would have expected the "sharable bitmap" to be pretty quick to share if it hasn't changed...)

On the GPU process, the stack looks to be GraphicsContextCG::drawNativeImage inside getShareableBitmapForImageBuffer where all the time is taken. There's also a huge number of virtual memory zero-fill operations during that time (perhaps one per scanline?)
Comment 3 Simon Taylor 2022-01-06 07:15:30 PST
My hope as a mere web developer is that createImageBitmap will consume minimal time in the JS main thread, kick off any conversion necessary to effectively bake the source into an RGBA texture / IOSurface / whatever, and resolve the promise when it has finished that. The resulting ImageBitmap should then be able to be efficiently consumed anywhere - WebGL context, drawImage in a 2d context, or for direct display in a "bitmaprenderer" context.
Comment 4 Kenneth Russell 2022-01-07 18:07:57 PST
It looks like the performance is acceptable on macOS; is that correct?

I wonder what the essential difference is between macOS and iOS in ImageBitmap handling.
Comment 5 Kimmo Kinnunen 2022-01-10 07:33:14 PST
ImageBitmap seems the correct abstraction for this use-case, but the use-case does not seem to make sense to me just based on this explanation. 

Currently in WebKit ImageBitmap is not implemented to be an optimisation across multiple Context2D and WebGL elements.

Technically, currently, each texImage2D causes the ImageBitmap to be copied multiple times and then sent across the IPC. This is mainly due to Context2D and ImageBitmap existing in GPU process while WebGL is in WebContent process. Additionally "texImage2D" doesn't establish a handle for the converted temporary object that would naturally be simple to use to persist the temporary (converted and shared bitmap data).

It does sound that quite a lot of problems come from the fact that you want to support arbitrary amount of WebGL contexts. This is probably not very good, as it uses excessive amounts of memory. E.g. in your example, you are uploading the same video to multiple textures for no real reason other than webgl restrictions.

Most use-cases can be organised around having one global WebGL and mostly use the WebGL frame buffer for displaying the drawing. For sharing images of WebGL rendering to multiple viewports, currently the best option is to draw the image to the frame buffer and then draw the webgl element to the multiple Context2Ds. It will not perform terribly well.

>Goal... the current frame of a video element into an object that can then be efficiently used in other canvas contexts (multiple WebGL contexts for example)

Everything of course should be optimal but in practice not everything will be made optimal. It would be useful to understand what is the source of the requirement. 

Without more info, I'd imagine the above is not a source requirement? It is counterintuitive that you'd want to upload, for example, 4k bitmap 2,3 or 87 times to different textures efficiently. The textures would just duplicate the bitmap pointlessly.

To me it sounds that probably what you want in the end is a "shared WebGL context" feature that can show webgl images from one context in multiple places? I.e. a feature where the video can be uploaded once to a texture but be used in rendering pictures of multiple canvases. I.e. 1 context, n canvases.
Comment 6 Simon Taylor 2022-01-10 09:29:31 PST
(In reply to Kenneth Russell from comment #4)
> It looks like the performance is acceptable on macOS; is that correct?

I had only really tested on iOS as we primarily target mobile platforms.

macOS looks pretty slow too - My 2017 intel MBP gives around 50 FPS in the test case with 20ms of "upload" time per frame, my new M1 Pro MBP maintains 60 FPS with around 10ms total upload time. Significant actual CPU work observed on all platforms.

(In reply to Kimmo Kinnunen from comment #5)
> ImageBitmap seems the correct abstraction for this use-case, but the
> use-case does not seem to make sense to me just based on this explanation.
> [...]
> Everything of course should be optimal but in practice not everything will
> be made optimal. It would be useful to understand what is the source of the
> requirement. 

In more detail, the actual use case is wanting to do some image processing on video frames (in WebAssembly, after a read back to CPU), and later rendering that processed frame along with data from that processing. Effectively it's a pipeline where we want to kick off processing on a new frame but keep the old one around for rendering.

Right now we do use a single shared WebGL context and a pool of textures. That way we can upload the current frame to a fresh texture from the pool, process it with our shaders / wasm, then flip to using the new texture in the rendering when processing is finished. That allows the renderer to keep hold of the previously processed frame until processing is ready on the new one. Bug 203148 is a bit of a problem in that case, right now we just ensure sufficient time has passed that we expect there to be a new frame available and hope for the best.

I don't have a need for "arbitrary" numbers of contexts, but using a separate one for the processing would have some advantages:

- Our code uses the WebGL API directly, but our users want to make use of WebGL engines (Three, Babylon, PlayCanvas etc) for their rendering. Engines often maintain a cache of underlying WebGL state to avoid un-necessarily resetting bits that are unchanged. They don't always provide public APIs to invalidate or query those caches, so integrating other WebGL code into a shared context is not really well-supported. One possible solution would be using "gl.get*" to query the WebGL state we might alter before resetting after our code, but that has performance implications on some implementations. The other would be to wrap our low-level code into "Program" abstractions for each engine, which is a lot of work and maintenance burden for a small team.

- On browsers that support OffscreenCanvas, our processing context can run on a worker. The frame is only needed on one context at a time, so can be transferred to the worker and back again for the renderer. Converting video -> ImageBitmap can only happen on the main thread, hence this seems the correct "intermediate" representation.

- With MediaStreamTrackProcessor + OffscreenCanvas, the video frames can be delivered to the worker directly.

> Currently in WebKit ImageBitmap is not implemented to be an optimisation
> across multiple Context2D and WebGL elements.

Without wishing to sound rude - what's the intention of the current implementation then? Is it more about supporting the various color space conversion options and less about performance?

I guess it's natural to see APIs how you want them to be, but for me it feels the intention of ImageBitmap is to keep hold of potentially-large, uncompressed images so they can be easily consumed in various places. It's up to the developer not to over-use them and to close() when finished, and in return they should be quick to consume. The availability of a "bitmaprender" context for canvas and transferToImageBitmap() for OffscreenCanvas both indicate efficient transfers are one of the main use cases.

For me createImageBitmap means "please do any prep work / decoding / etc to get this source ready for efficient use in other web APIs - and off the main thread please, just let me know when it's ready".

I've recently discovered WebGL2 readPixels to a PIXEL_PACK_BUFFER and then getBufferSubData (ideally after a sync object has been signalled to avoid blocking the GPU) has a really nice and efficient implementation in current versions of Safari (great work guys!).

That effectively gives a RGBA ArrayBuffer that can be uploaded pretty efficiently to other WebGL contexts with texImage2D and I guess is pretty efficient in Canvas2D contexts with putImageData too. It can also be transferred efficiently to / from workers, so basically fulfils most of the hopes I had for ImageBitmap.

In browsers with GPU-based Canvas 2D it makes sense ImageBitmap would map to some sort of GPU texture handle. As Safari uses CPU-based Canvas 2D a CPU-side blob of pixels seems reasonable too. Right now it seems consuming an ImageData is sufficiently more costly than a JS-side ArrayBuffer of pixels, which felt pretty unexpected to me.
Comment 7 Kimmo Kinnunen 2022-01-10 12:26:44 PST
(In reply to Simon Taylor from comment #6)
> - Our code uses the WebGL API directly, but our users want to make use of
> WebGL engines (Three, Babylon, PlayCanvas etc) for their rendering. Engines
> often maintain a cache of underlying WebGL state to avoid un-necessarily
> resetting bits that are unchanged.

But this doesn't explain why you want to snapshot video to ImageBitmap and then use the ImageBitmap in two different contexts.

Currently you can snapshot the video to your processing context via just texImage2D.

 
> - On browsers that support OffscreenCanvas, our processing context can run
> on a worker. The frame is only needed on one context at a time, so can be
> transferred to the worker and back again for the renderer. Converting video
> -> ImageBitmap can only happen on the main thread, hence this seems the
> correct "intermediate" representation.

IIRC currently WebKit does not support OffscreenCanvas, so strictly speaking converting video to ImageBuffer because you want to send ImageBuffer to the offscreen canvas is not that valid reason?
 
> > Currently in WebKit ImageBitmap is not implemented to be an optimisation
> > across multiple Context2D and WebGL elements.
> 
> Without wishing to sound rude - what's the intention of the current
> implementation then?

I think the main use-case is to convert a blob to an image to be drawn in WebGL or Context2D?
As in, it's not feasible to convert a blob to an image otherwise.
As in, it is possible to convert a video element to a texture otherwise by just directly via texImage2D.

There's of course a notion that a general concept like ImageBitmap should work consistently with different objects that serve similar purposes. However, as explained the implementations are not perfect until they're made perfect. If some implementation is made perfect it most likely means that other implementation somewhere else remains imperfect.

> I guess it's natural to see APIs how you want them to be, but for me it
> feels the intention of ImageBitmap is to keep hold of potentially-large,
> uncompressed images so they can be easily consumed in various places.

Sure, in abstract it can be that. Currently WebKit is not there, though.

And since we are not there, I'm trying to understand what is the use-case, e.g. is getting there the only way to solve the use-case.

> For me createImageBitmap means "please do any prep work / decoding / etc to
> get this source ready for efficient use in other web APIs - and off the main
> thread please, just let me know when it's ready".

Right. But from WebGL perspective that's what video element is -- for the simple case the prep work is already "done" and it's efficient to use already.

From WebGL perspective you can upload the same video element 1,2 or 77 times in different contexts and textures and it's going to be observably as fast as it ever is going to be..

> Right now it seems consuming
> an ImageData is sufficiently more costly than a JS-side ArrayBuffer of
> pixels, which felt pretty unexpected to me.

Yes, due to various reasons, mostly that not all components are in GPU Process, ImageBitmap is not equivalent to a buffer that could be mapped to various GPU-based implementations (Context2D or WebGL). We're working on this part.

However, it is to prioritise the work, it would still be useful to understand if zero-overhead ImageBuffer is something that is a must for implementing the feature or a nice to have for implementing the feature.

I still do not understand this:
1) Uploading a video to a WebGL texture is fairly fast. Can it be used, can it not be used?
2) In which concrete webby use-cases it is useful that you have a handle to a ImageBitmap, and you use this handle twice
3) In which concrete webby use-cases it is useful that you have a handle to a ImageBitmap and you use this to different WebGL contexts?
4) In which concrete webby use-cases it is useful that you have a handle to a ImageBitmap and you use this to a Context2D and a WebGL context?

Not listing these as such that you'd need to answer these all, but these are just the questions I try to use to understand the prioritisation across all the things needing fixing.
Comment 8 Simon Taylor 2022-01-11 04:58:37 PST
(In reply to Kimmo Kinnunen from comment #7)
> (In reply to Simon Taylor from comment #6)
> > - Our code uses the WebGL API directly, but our users want to make use of
> > WebGL engines (Three, Babylon, PlayCanvas etc) for their rendering. Engines
> > often maintain a cache of underlying WebGL state to avoid un-necessarily
> > resetting bits that are unchanged.
> 
> But this doesn't explain why you want to snapshot video to ImageBitmap and
> then use the ImageBitmap in two different contexts.
> 
> Currently you can snapshot the video to your processing context via just
> texImage2D.

Yes. But then later once processing is complete we want to render that frame in a rendering context and want to guarantee it's the exact same frame that has been processed so the processed results are in sync.

One straightforward way as you suggest is to use a single WebGL context for both processing and rendering, and a texture pool to allow a new frame to be processed whilst still keeping an older frame around for rendering. That's what we do now and it is of course a strategy that works.

The motivation for wanting to split into two contexts on Safari right now is really all around encapsulation and ease-of-integration with third party engines.

Imagine a library like TensorFlow that might want to implement a WebGL backend to speed up some ML inference operation - let's say something like human pose estimation. A user then wants to run that inference on video frames, and then use Three.js to render a virtual skeleton on top of the most recent frame that has results available, so the video frame and results appear perfectly synchronised.

Three.js abstracts away the underlying WebGL context and internally caches the state. There are no public APIs to allow accessing underlying WebGL objects or inform Three that its state cache may be outdated. 

If the user wants to use a single WebGL context that's used for both TensorFlow's backend and the Three.js rendering, the only really supported way to do that with Three.js would be for TensorFlow to write all its WebGL code against the Three.js abstractions for shaders / programs / renderbuffers etc so that Three.js can then remain solely responsible for the overall context state. Of course that wouldn't help a user who wants to write a page that uses TensorFlow but renders with Babylon. Nor does a dependency on a specific engine really make sense for a library project like TensorFlow.

Hopefully that helps to explain the justification for using a separate context for processing, even on current Safari where OffscreenCanvas doesn't exist.

Using a separate context is easy, the only real requirement is a primitive to allow quickly getting the same video frame as a texture in both contexts.

Just doing a separate texImage2d on the rendering context from the same video element doesn't guarantee it will be the same frame (the video is playing, and might have a new frame by the second texImage2d call).

ImageBitmap seemed the right primitive, but isn't suitably performant with the current implementation.

It does look like readPixels to a PIXEL_PACK_BUFFER and then into an ArrayBuffer might fit the bill with iOS 15 though. With that approach the processing context would texImage2d from the video and readPixels to copy it back to JS (we need it there for processing anyway, although not necessarily as full RGBA). The rendering context would texImage from the RGBA ArrayBuffer.

There's no strict need for ImageBitmap.

> > > Currently in WebKit ImageBitmap is not implemented to be an optimisation
> > > across multiple Context2D and WebGL elements.
> > 
> > Without wishing to sound rude - what's the intention of the current
> > implementation then?
> 
> I think the main use-case is to convert a blob to an image to be drawn in
> WebGL or Context2D?
> As in, it's not feasible to convert a blob to an image otherwise.
> As in, it is possible to convert a video element to a texture otherwise by
> just directly via texImage2D.

From a blob it's possible to do something like:

var im = document.createElement('img');
im.src = createObjectURL(blob);
im.onloaded = () => {gl.texImage2d(..., im)};

createImageBitmap is definitely a cleaner API and offers more options (like specifying a smaller resolution up-front).

> > I guess it's natural to see APIs how you want them to be, but for me it
> > feels the intention of ImageBitmap is to keep hold of potentially-large,
> > uncompressed images so they can be easily consumed in various places.
> 
> Sure, in abstract it can be that. Currently WebKit is not there, though.
> 
> And since we are not there, I'm trying to understand what is the use-case,
> e.g. is getting there the only way to solve the use-case.

Good to know, thanks. Hopefully I've justified why an efficient way to transfer images between different canvas contexts is a useful primitive to have.

That primitive doesn't need to be ImageBitmap, it's just what I thought was the main purpose for it.

On Safari I'm actually pretty happy that PIXEL_PACK_BUFFER readPixels can solve the need.

> > For me createImageBitmap means "please do any prep work / decoding / etc to
> > get this source ready for efficient use in other web APIs - and off the main
> > thread please, just let me know when it's ready".
> 
> Right. But from WebGL perspective that's what video element is -- for the
> simple case the prep work is already "done" and it's efficient to use
> already.
> 
> From WebGL perspective you can upload the same video element 1,2 or 77 times
> in different contexts and textures and it's going to be observably as fast
> as it ever is going to be..

Performance of direct texImage2d(video) is pretty good as you say. The main issue of just using that with separate processing and rendering contexts is that the contexts may end up with different frames (the last one uploaded might be a later frame).

There is also some "conversion" work that goes on, and blocks the JS thread, in every texImage2d call. It is acceptably performant (< 2ms when on a Performance core at high clocks) so it's not a major concern. However on an efficiency core it can be over 5ms - if that work happened off the main thread before createImageBitmap resolved the promise, it's more likely the rest of the WebGL workload could fit in the main thread without needing to move to a Performance core.

So the hope for me with createImageBitmap was twofold - that Metal conversion into a RGBA texture would happen without blocking the main thread, and the resulting ImageBitmap would be so quick to consume in WebGL that there'd be effectively no CPU overhead in splitting the processing into a dedicated context.

There's no spec requirement for that of course so it's still just a nice-to-have wishlist thing.

> Yes, due to various reasons, mostly that not all components are in GPU
> Process, ImageBitmap is not equivalent to a buffer that could be mapped to
> various GPU-based implementations (Context2D or WebGL). We're working on
> this part.

That's great to hear!

> However, it is to prioritise the work, it would still be useful to
> understand if zero-overhead ImageBuffer is something that is a must for
> implementing the feature or a nice to have for implementing the feature.

It's definitely not a requirement, and actually readPixels with PIXEL_PACK_BUFFER is a pretty good fit for us anyway. This one is not really a major concern, I just wanted to flag that the performance of the current ImageBitmap implementation didn't really match my expectations for the API.

In terms of my personal priorities, the rAF weirdness is the biggest issue I have with current iOS Safari (Bug 234923). Not WebGL-related but the catastrophic iOS 15 performance regression in <video> playback when data is in a blob or data URI (Bug 232076) has also caused us some headaches.
Comment 9 Sam Sneddon [:gsnedders] 2022-05-05 08:44:32 PDT
rdar://92797516

(Sorry for the earlier spam, something's gone wrong somewhere.)