Bug 311598

Summary: Limitations on submitted compute WebGPU CommandBuffers on iOS
Product: WebKit Reporter: reeselevine1
Component: WebGPUAssignee: Nobody <webkit-unassigned>
Status: NEW    
Severity: Normal CC: mwyrzykowski, tzagallo, webkit-bug-importer
Priority: P2 Keywords: InRadar
Version: Safari 26   
Hardware: iPhone / iPad   
OS: iOS 26   

reeselevine1
Reported 2026-04-06 16:12:44 PDT
I've been working on a WebGPU backend for llama.cpp, and have a demo of it running in the browser (compiled through Emscripten to WASM) here: https://reeselevine.github.io/wllama/ Right now, a forward pass of a model consists of ~400-600 individual compute shaders, which I batch into groups of 16-32 and submit to the queue. On most systems supporting WebGPU (including some Android devices), I then wait on queue submission once at the end of the forward pass. However, on iOS, if I try this things end up breaking. From what I've been able to tell through debugging, I don't see device lost or any other surfaced errors, rather the wait on the queue gets stuck, repeatedly timing out. I've also tried attaching remotely to the Safari Web Inspector, but when I do this it actually stops the issue from occurring, I imagine due to some overhead introduced by the inspector. To fix this, I introduced throttling for Safari on iOS, where I wait after every CommandBuffer submission. From what I've seen, even having two CommandBuffers in flight is enough to cause the issue to occur. So, I guess my question is whether this is a known/expected event from submitting multiple CommandBuffers in WebGPU, and I just have to be more careful about batching in order to fully pipeline encoding/submission, or if this is a bug in the WebGPU WebKit implementation. Right now, the deployed version on my website includes the throttling, but if it's useful for debugging, I can deploy a version without the throttling or include more instructions on how to build/deploy your own version of the code (https://github.com/reeselevine/wllama). Also, for what it's worth, we were testing on an iOS device running 26.3.1, and even with the throttling we were still experiencing issues, but with the update to 26.4 it seems to be more stable. So maybe this issue is part of some other long-running updates to WebKIt that haven't made it to shipped OS versions yet?
Attachments
Radar WebKit Bug Importer
Comment 1 2026-04-08 08:50:42 PDT
Mike Wyrzykowski
Comment 2 2026-04-08 14:46:07 PDT
Recommended practice is 1 command buffer or at the very least, each submitted command buffer should take several milliseconds of work. We don't coalesce command buffers, so this might be creating GPU resource exhaustion by submitting too many command buffers.
Mike Wyrzykowski
Comment 3 2026-04-08 14:47:07 PDT
Can you please clarify what: > ~400-600 individual compute shaders means? Is there a reason so many command buffers are being submitted? Normally its only necessary to split when you have a GPU / CPU sync point.
reeselevine1
Comment 4 2026-04-12 21:42:52 PDT
Thanks for the update Mike. I should have been more specific, by "compute shaders" I really meant 400-600 compute passes before we need to synchronize the GPU/CPU. I did play around with the batch size and that does help tune the performance on iOS while maintaining stability, as long as I keep the number of command buffers submitted at a time to one. I guess I was just hoping that I could batch on the application side at some reasonable size, and then let the WebGPU queue handle dispatching them without blocking on each one. But it sounds like on iOS at least, it'll be best to find a batch size that allows good CPU encode -> GPU dispatch pipelining (which I realize may differ even within iOS devices), and block on each submission. Otherwise I guess I was struggling a little to even diagnose this issue, since it wasn't reproducing with Safari Web Inspector attached, and I wasn't seeing any other messages being surfaced when I tried logging device errors or uncaptured errors. Overall though I think we can work with this going forward, so I'm good with this issue being closed!
Mike Wyrzykowski
Comment 5 2026-04-13 08:11:00 PDT
Is it possible to combine compute passes? That number of passes might be creating unnecessary overhead and WebKit encodes immediately so we don't coalesce passes either. Switching compute pipeline state in the middle of a pass would be preferable to ending the pass and starting a new one.
reeselevine1
Comment 6 2026-04-15 09:29:43 PDT
Yep, it is possible. I tried it out and it does seem much more stable (combining 64 compute passes per command buffer). With this change it also seems like having multiple command buffers submitted is stable too (even though I know it's recommended to still only have 1). The one caveat is that since timestamp queries only work at compute pass boundaries, with this change I can no longer get a profiling breakdown of individual shaders/pipelines. So I currently include an option to disable the compute bass batching so I can get profiling information on other systems. From my testing, on systems besides iOS there isn't any significant difference in performance or stability between combining compute passes and not combining them.
Note You need to log in before you can comment on or make changes to this bug.