Bug 311598

Summary:	Limitations on submitted compute WebGPU CommandBuffers on iOS
Product:	WebKit	Reporter:	reeselevine1
Component:	WebGPU	Assignee:	Nobody <webkit-unassigned>
Status:	NEW
Severity:	Normal	CC:	mwyrzykowski, tzagallo, webkit-bug-importer
Priority:	P2	Keywords:	InRadar
Version:	Safari 26
Hardware:	iPhone / iPad
OS:	iOS 26

reeselevine1

Reported 2026-04-06 16:12:44 PDT

I've been working on a WebGPU backend for llama.cpp, and have a demo of it running in the browser (compiled through Emscripten to WASM) here: https://reeselevine.github.io/wllama/ Right now, a forward pass of a model consists of ~400-600 individual compute shaders, which I batch into groups of 16-32 and submit to the queue. On most systems supporting WebGPU (including some Android devices), I then wait on queue submission once at the end of the forward pass. However, on iOS, if I try this things end up breaking. From what I've been able to tell through debugging, I don't see device lost or any other surfaced errors, rather the wait on the queue gets stuck, repeatedly timing out. I've also tried attaching remotely to the Safari Web Inspector, but when I do this it actually stops the issue from occurring, I imagine due to some overhead introduced by the inspector. To fix this, I introduced throttling for Safari on iOS, where I wait after every CommandBuffer submission. From what I've seen, even having two CommandBuffers in flight is enough to cause the issue to occur. So, I guess my question is whether this is a known/expected event from submitting multiple CommandBuffers in WebGPU, and I just have to be more careful about batching in order to fully pipeline encoding/submission, or if this is a bug in the WebGPU WebKit implementation. Right now, the deployed version on my website includes the throttling, but if it's useful for debugging, I can deploy a version without the throttling or include more instructions on how to build/deploy your own version of the code (https://github.com/reeselevine/wllama). Also, for what it's worth, we were testing on an iOS device running 26.3.1, and even with the throttling we were still experiencing issues, but with the update to 26.4 it seems to be more stable. So maybe this issue is part of some other long-running updates to WebKIt that haven't made it to shipped OS versions yet?

Attachments
Add attachment proposed patch, testcase, etc.

Radar WebKit Bug Importer

Comment 1 2026-04-08 08:50:42 PDT

<rdar://problem/174324351>

Mike Wyrzykowski

Comment 2 2026-04-08 14:46:07 PDT

Recommended practice is 1 command buffer or at the very least, each submitted command buffer should take several milliseconds of work. We don't coalesce command buffers, so this might be creating GPU resource exhaustion by submitting too many command buffers.

Mike Wyrzykowski

Comment 3 2026-04-08 14:47:07 PDT

Can you please clarify what: > ~400-600 individual compute shaders means? Is there a reason so many command buffers are being submitted? Normally its only necessary to split when you have a GPU / CPU sync point.

reeselevine1

Comment 4 2026-04-12 21:42:52 PDT

Thanks for the update Mike. I should have been more specific, by "compute shaders" I really meant 400-600 compute passes before we need to synchronize the GPU/CPU. I did play around with the batch size and that does help tune the performance on iOS while maintaining stability, as long as I keep the number of command buffers submitted at a time to one. I guess I was just hoping that I could batch on the application side at some reasonable size, and then let the WebGPU queue handle dispatching them without blocking on each one. But it sounds like on iOS at least, it'll be best to find a batch size that allows good CPU encode -> GPU dispatch pipelining (which I realize may differ even within iOS devices), and block on each submission. Otherwise I guess I was struggling a little to even diagnose this issue, since it wasn't reproducing with Safari Web Inspector attached, and I wasn't seeing any other messages being surfaced when I tried logging device errors or uncaptured errors. Overall though I think we can work with this going forward, so I'm good with this issue being closed!

Mike Wyrzykowski

Comment 5 2026-04-13 08:11:00 PDT

Is it possible to combine compute passes? That number of passes might be creating unnecessary overhead and WebKit encodes immediately so we don't coalesce passes either. Switching compute pipeline state in the middle of a pass would be preferable to ending the pass and starting a new one.

reeselevine1

Comment 6 2026-04-15 09:29:43 PDT

Yep, it is possible. I tried it out and it does seem much more stable (combining 64 compute passes per command buffer). With this change it also seems like having multiple command buffers submitted is stable too (even though I know it's recommended to still only have 1). The one caveat is that since timestamp queries only work at compute pass boundaries, with this change I can no longer get a profiling breakdown of individual shaders/pipelines. So I currently include an option to disable the compute bass batching so I can get profiling information on other systems. From my testing, on systems besides iOS there isn't any significant difference in performance or stability between combining compute passes and not combining them.

Mike Wyrzykowski

Comment 7 2026-04-15 14:03:00 PDT

There's a penalty on WebKit based macOS browsers too since WebKit does not coalesce command buffers. Having the profiling option is good since the timestamp queries require the use of waiting on a MTLSharedEvent (basically, a cross process semaphore) which is fairly heavyweight. Afaik this is the case across browsers, not limited to WebKit

reeselevine1

Comment 8 2026-04-21 08:04:18 PDT

Cool, thanks for the information, I'll keep that in mind as we continue to tune performance. Also, for what it's worth, here's a very ad-hoc comparison (asked the model to "write a short story") of the decode performance of a few different WebGPU LLM runtimes on an ~4-bit quantized version of Llama-3.2-1b-instruct model in the browser on Chrome vs. Safari on my M3 Macbook: - llama.cpp (https://reeselevine.github.io/wllama/, Llama-3.2-1B-Instruct-Q4_K_M.gguf) - Chrome: 47.4 t/s - Safari: 25.6 t/s - WebLLM (https://chat.webllm.ai/, Llama-3.2-1B-Instruct-q4f32_1-MLC) - Chrome: 42.1 t/s - Safari: 40.3 t/s - transformers.js (https://static.simonwillison.net/static/2025/llama-3.2-webgpu/) - Chrome: 42.7 t/s - Safari: crashed due to increasing memory, but around 29 t/s before doing so So it does seem like I can work on my batching and scheduling logic to increase the speed on Safari, as WebLLM seems to be able to come close to the Chrome performance. But generally Chrome's implementation seems more forgiving than Safari's right now. I have also tried Firefox but it's painfully slow, I know they're still working to get things up to speed.

Note You need to log in before you can comment on or make changes to this bug.