REOPENED271756
Can new Worker() be made to properly operate on the background (maybe when the URL is an in-memory Blob)?
https://bugs.webkit.org/show_bug.cgi?id=271756
Summary Can new Worker() be made to properly operate on the background (maybe when th...
jujjyl
Reported 2024-03-27 04:30:59 PDT
Would it be possible to make the following code not deadlock the browser? `a.html` ```html <html><body><script> fetch('a.js').then(response => response.blob()).then(blob => { let worker = new Worker(URL.createObjectURL(blob)); let sab = new Uint8Array(new SharedArrayBuffer(16)); worker.postMessage(sab); console.log('Waiting for Worker to finish'); while(sab[0] != 1) /*wait to join with the result*/; console.log(`Worker finished. SAB: ${sab[0]}`); }); </script></body></html> ``` `a.js` ```js onmessage = (e) => { console.log('Received SAB'); e.data[0] = 1; } ``` Observed: browser hangs on the `while()` loop. Expected: `new Worker()` would be great to make progress asynchronously, and the code would print out ``` Worker finished. SAB: 1 ``` Some background: This deadlock is causing headaches to users of SharedArrayBuffer/WebAssembly, and leads to poor startup time and over-subscription of web resources on shipped web sites. If it was possible to make the above code work and not deadlock, it would greatly improve startup time and performance of SharedArrayBuffer-utilizing web sites. Note that we are not asking `new Worker(arbitraryUrl)` to necessarily have this forward-progress guarantee, but that at least that `new Worker(blobUrlInMemory)` would do so (like illustrated in above code). Would that be feasible?
Attachments
jujjyl
Comment 1 2024-03-27 04:35:01 PDT
Note that when testing the above code, COOP+COEP HTTP headers are required, or otherwise the SharedArrayBuffer object will not be available. A quick way to get those headers is to download an ad hoc [emrun.py](https://raw.githubusercontent.com/emscripten-core/emscripten/main/emrun.py) web server, and run ``` emrun.py --no_browser --port 8000 . ``` in the directory of `a.html` and `a.js`. That will launch a web server that includes the relevant COOP+COEP headers.
Alexey Proskuryakov
Comment 2 2024-03-27 17:07:34 PDT
This is an interesting idea, but probably not. This would be quite tricky, as Worker code is generally expecting main thread to be available, and there can be quite a bit or work happening behind the scenes (Web Inspector delegates is one example that comes to mind quickly). Perhaps more importantly, I don't think that we'd be encouraging blocking main thread while waiting for background thread operation, the whole point of workers is that they work without blocking the main thread.
jujjyl
Comment 3 2024-03-28 01:26:25 PDT
> I don't think that we'd be encouraging blocking main thread while waiting for background thread operation, the whole point of workers is that they work without blocking the main thread. This is a fine sentiment, although it is good to recognize that there are several real-world use cases, where blocking the main thread on worker threads is not just the correct thing to do, but also the only possible thing to do. One example of this is the multithreaded WebGL/WebGPU scene traversal and update. A typical interactive real-time rendering application will be processing through frames in its requestAnimationFrame() callback. To improve performance of such operation, a common technique is to hand off the scene traversal and update to a number of background Workers, and then wait until these Workers finish, to render the scene contents. There is no way to achieve the same otherwise, than to make the main thread synchronously block on the Worker threads. This is a hardened battle-tested algorithm in hundreds of native game and 3D applications. As a second example where synchronously blocking the main thread is the right and most performant thing to do can be found in multithreaded Mark-and-Sweep garbage collection. In https://github.com/juj/emgc you can find an example of a multithreaded GC, to be used for example in compiling a C#, Java or Python VM into WebAssembly. In such scenario, when the heap runs out of memory on a malloc, the system may need to trigger an on-demand GC to reclaim memory. To improve overall performance of the GC, instead of doing the GC marking phase just on the main thread, it is desirable to do the GC marking phase with the help of multiple Workers. But the inability to launch such GC Workers on demand means that the GC Workers must have been preallocated already at site startup, which is pessimistic for several reasons: 1) the site startup time will be slowed down, 2) the site may not 100% in all cases even need to GC (yet, the GC Workers need to be there and consume memory) 3) the site won't have a way to let go of the GC workers and free up page memory (the GC workers may need to be reused suddenly in the future depending on user access patterns) 4) when composing software from multiple libraries, each library will independently need to pre-create their Workers for their purpose, since coordinating needed Worker counts across unrelated libraries is impossible (would essentially require developer knowing ahead of time how many pthread_create()s their application would ever do in worst case) If there was a way to launch Workers synchronously from an in-memory Blob URL, then all of the above inefficiencies would be gone: multithreaded sites could launch faster, they wouldn't have to pool up Workers in advance, and performance of abovementioned use cases would be improved. I can appreciate the concern that it might be complex to implement, though the rationale of "sync blocking is bad" is not at all accurate - in many scenarios sync blocking the main thread can improve both the throughput and responsiveness of a site.
Alexey Proskuryakov
Comment 4 2024-03-29 11:33:18 PDT
Adding a few other folks for visibility, so that your feedback is well considered. Detecting frozen main thread is one of the things we do to help users out of misbehaving websites, and while your use cases don't directly contradict the current implementation, encouraging blocking the main thread may limit what can be done to further improve.
jujjyl
Comment 5 2024-03-29 12:31:02 PDT
> Adding a few other folks for visibility, so that your feedback is well considered. Thanks! > encouraging blocking the main thread may limit what can be done to further improve. I would stress here that this issue is not about blocking the main thread per se. Rather, this issue is about enabling a site to allocate resources only when they are actually needed, instead of needing to (wastefully) preallocate Workers earlier. To illustrate, already today one can write code like this: `b.html` ```html <html><body><script> fetch('b.js').then(response => response.blob()).then(blob => { let worker = new Worker(URL.createObjectURL(blob)); worker.postMessage('init'); worker.onmessage = () => { let sab = new Uint8Array(new SharedArrayBuffer(16)); worker.postMessage(sab); console.log('Waiting for Worker to finish'); while(sab[0] != 1) /*wait to join with the result*/; console.log(`Worker finished. SAB: ${sab[0]}`); }; }); </script></body></html> ``` `b.js` ```js onmessage = (e) => { if (e.data == 'init') { console.log('Worker received SAB'); postMessage(0); } else { console.log('Received SAB'); e.data[0] = 1; } } ``` The above code example does not hang, but works correctly. Both a.html and b.html do block the main thread equally, in other words, the root issue at the heart of this problem is not the "blocking the main thread" part. The workaround scheme shown by b.html is what WebAssembly/SharedArrayBuffer users use today since a.html does not work. The difference between b.html and a.html is that in b.html, "worker.postMessage()" **is** able to make forward progress even while the main thread is spinwaiting for the worker, whereas "new Worker()" in a.html is not able to. But the troubling affairs with the workaround presented by b.html is that in that example, one must preallocate the Worker up front. In real world programs, this must happen before the code necessarily knows if it is going to need the Worker in the first place. For example, in https://github.com/juj/emgc I have implemented a multithreaded garbage collector, to be used in C#/Java/Python VMs compiled to multithreaded WebAssembly. In that GC, I would like to perform the GC marking step quicker by using a pool of background Workers. But I would also like to spawn that GC marking Worker pool only on-demand when necessary, instead of requiring the whole WebAssembly site to have to delay its page startup until I first manage to spin up all the GC Workers (that may or may not even ever fire, depending on what the user does on the site!). If the code example in a.html worked, then I would be able to only ever spawn the GC Workers synchronously at the first occassion that I need to GC, which would lead to a kind of "only-pay-if-you-use-it" type of allocation of site resources. This is just one example. Similar needs occur in Emscripten multithreaded WebAssembly users also in other scenarios, e.g. when implementing multithreaded WebGPU rendering, or multithreaded parallel for() constructs, and similar. So ideally, if "new Worker(inMemoryBlob)" was able to complete without needing to yield back to the main JS event loop, all of that wasteful preallocation of "new Worker()"s could be avoided, and multithreaded WebAssembly sites would not need to start up by creating an avalanche of Workers that they might only potentially need to ever use - that would be a big help to WebAssembly site startup performance overall!
Anne van Kesteren
Comment 6 2024-04-10 07:49:34 PDT
Reopening this for further consideration given https://github.com/web-platform-tests/wpt/pull/45502 and the discussion in the HTML issue. In particular, there is nothing in the specification that says comment 0 should not work. It shouldn't necessarily happen synchronously, but the worker should be able to make forward progress independently of what happens on the main thread. Meanwhile the main thread can either hang and decide to stop running script with a "slow script dialog" or some such or run the script long enough for everything to succeed.
jujjyl
Comment 7 2026-03-05 04:50:05 PST
Hi, friendly ping, I wonder if this could be something to bump the importance of? This item is critical to being able to ship multithreaded Unity3D game engine on the Web, and many other Emscripten compiled multithreaded pages. Without this ability, Emscripten compiled pages will often need to estimate at startup time the total number of Workers that the application's thread pools will ever spawn. At the scale of Unity3D, estimating the total number of Workers that will ever be needed is near impossible, and even when successfully estimated, the page startup is greatly slowed down since it involves spawning dozens of Workers up front, and sharing the WebAssembly Module to each. With this capability, parallel WebAssembly algorithms will be able to spawn Worker threads as needed, and then deallocate such Workers after they are no longer needed, to save memory.
Radar WebKit Bug Importer
Comment 8 2026-03-05 08:29:41 PST
Kimmo Kinnunen
Comment 9 2026-03-05 09:25:28 PST
Not necessary opposed of the sab spin working (it makes sense), but commenting on the perf premise. >Rather, this issue is about enabling a site to allocate resources only when they are actually needed, instead of needing to (wastefully) preallocate Workers earlier. How is it not requiring preallocating? > To improve performance of such operation, a common technique is to hand off the scene traversal and update to a number of background Workers, and then wait until these Workers finish, to render the scene contents. It would sound like spawning actual OS threads for workers per traversal would hinder the "improve the performance" goal? In turn which would mean to support the case, the browser would leave the threads initialised and running after initial scene traversals In turn which would mean the browser would preallocate instead of the page? > the total number of Workers that the application's thread pools will ever spawn. Note that the whole idea of "thread pool" to exist in the first place is because in arbitrary case, historically it has been impossible to spawn threads in performant manner the moment they're needed so to off-thread work. What's the underlying real-world mechanism we're discussing here that would make perf be good?
Kimmo Kinnunen
Comment 10 2026-03-05 09:31:10 PST
Is the proposal that the browser preallocates ncpus + x os threads, and then schedules Workers in them "manually"? E.g. the page could have 10000 Workers and they'd be scheduled on 20 threads? I'd imagine the problem with that is that the scheduling is non-trivial (already implemented in the os for 40 years) and switching from the blocking/busy looping workers is hard.
Kimmo Kinnunen
Comment 11 2026-03-05 09:37:07 PST
> start up by creating an avalanche of Workers that they might only potentially need to ever use Isn't the contender the case where the framework would add workers to the pool as needed, and cleanup the pool after the workers have idled certain amount of time? Or is the concern that adding workers as needed is slow, janking during the worker startup? If so, the proposal looks odd, as that'd be the default mode for the naive implementation of the proposal, for every worker task execution?
Kimmo Kinnunen
Comment 12 2026-03-05 09:40:25 PST
To put it in other words: how does unity3d native build handle the case? Is the native runtime able to spawn os threads for each parallel for separately?
jujjyl
Comment 13 2026-03-05 11:51:52 PST
> It would sound like spawning actual OS threads for workers per traversal would hinder the "improve the performance" goal? > In turn which would mean to support the case, the browser would leave the threads initialised and running after initial scene traversals > In turn which would mean the browser would preallocate instead of the page? By faster performance, I am referring to web page startup performance, and not the performance during the time when these threads are actually used. So maybe startup time, or load time is better phrase here. I.e. when a site can start up quickly without needing to preallocate threads up front, it can get to displaying the main site contents quicker. As opposed to having to preallocate a large thread pool before starting WebAssembly code execution on the web site. Then when any subsystems that actually need threads are executed, they can spawn their thread pools on-demand the first time they are used, and either leave them alive, or not. But whether such thread pools are needed, may depend on the user, and in many cases the user might opt to do something else altogether, and never access a site feature that will need to execute the thread pool functionality. > Is the proposal that the browser preallocates ncpus + x os threads No, the browser does not need to preallocate threads, but it is requested that launching a Worker can occur in parallel from the main thread, without needing to keep yielding an indeterminate unknown number of times to the main loop in order for that Worker launch to actually be observable.
Kimmo Kinnunen
Comment 14 2026-03-06 13:16:50 PST
> they can spawn their thread pools on-demand the first time they are used, and either leave them alive, or not. Yeah, now I see it. so you'd like: ``` let w = getOrCreateWorker(); let s = createBinarySemaphore(); postTaskAndSignalSemaphore(w, getNextTask(), s); s.wait(); ``` The problem here isn't really the Worker creation, it's every piece of browser code in general. That's why the communication with the workers is asynchronous. Dropping to the run loop , especially in main context -> worker communication, is essential for the existing codebases to work. If the worker needs something from the implementation main thread, the implementation main thread cannot block. Thus you in practice cannot roll your own wait primitives by busy looping, as that prevents the worker from progressing. Even if the Worker startup was fixed, the same issue exists in the likely need to block to complete your parallel for. In the case of WebKit, you can imagine for example the worker doing WebGL. If the context needs to be re-created after context loss, the current implementation has to run code in the main context. Thus your semaphore block (sab busy loop) would hang the browser. So in general, unless the browser guarantees that the implementation is such that workers don't need to run code in main thread, the request is not sound. Other aspect is that for the main content, web apis have moved towards async-only apis, so there it's unnatural to request the semaphore api. Worker -> sub worker it would make sense.
jujjyl
Comment 15 2026-03-06 14:42:03 PST
I agree with the description. That is what this request is after. Indeed it would be the case that not only new Worker() would need to be synchronous, but also postMessage()ing to that Worker should be able take place synchronously. The way you describe for example WebGL requiring calling back to main context is not uncommon. For example Chrome has been working on resolving such issues, which they have considered a problem to fix: https://issues.chromium.org/issues/425160329 It was not until the advent of SharedArrayBuffer that these types of lock priority inversion problems became observable in the first place in the Web platform, since that is what enables programs to create synchronous locks between main<->worker thread in the first place. Without SharedArrayBuffer, any internal browser behavior of a Worker taking a lock from the main thread were not observable. In the past years, such problems have been reported and fixed by browsers one at a time. E.g. in Firefox: - console.log()ing would a long time ago cause lock priority inversion, https://bugzilla.mozilla.org/show_bug.cgi?id=1049091, since fixed. - calling performance.now() in a Worker would require the main thread to yield back to event loop, https://bugzilla.mozilla.org/show_bug.cgi?id=1131757, since fixed. To motivate why this is so important for WebAssembly programs, let me provide a concrete example of what we are facing. Consider OpenMP, which is a popular native parallelization architecture: - https://curc.readthedocs.io/en/latest/programming/OpenMP-C.html - https://github.com/abrown/wasm-openmp-examples OpenMP enables one to create parallel for constructs easily. For example: ```c #include <stdio.h> #include <omp.h> int main(void) { int a[10] = {1,2,3,4,5,6,7,8,9,10}; int sum = 0; #pragma omp parallel for reduction(+:sum) for (int i = 0; i < 10; i++) sum += a[i]; printf("sum = %d\n", sum); return 0; } ``` Implementing parallel algorithms that follow a synchronous fork+join like shown here appear everywhere in native codebases. In native code, typically the OpenMP runtime would allocate the thread pool for the parallel for() on-demand the first time a parallel for loop is executed. But on Web, since such on-demand thread pool creation is today not possible, such thread pools must be created ahead of time, typically with the "-sPTHREAD_POOL_SIZE=navigator.hardwareConcurrency" linker flag. This works at first, but only until the application requires another thread elsewhere. Say, it does pthread_create() to instantiate a background PNG loading thread. This will "eat away" one of the dormant WebAssembly system Worker pool threads, and as result, the openmp parallel for loop will deadlock, since it can no longer synchronously fork-join spin up its own thread pool: there are no pre-created workers in the pool. So one must conclude that such OpenMP thread pool must be "earmarked" to be pre-created only for OpenMP, and a hypothetical PNG loading thread must create its own. Ok, go with that design. But then when the program grows, there are other sources of thread pool requirements. Say, Unity3D's own job thread pool, which works not unlike the OpenMP thread pool. But because both require their own fork-join, they both need their own earmarked thread pools preallocated. Hence, one will then have to run with "-sPTHREAD_POOL_SIZE=2*navigator.hardwareConcurrency" to make sure that both subsystems have their Workers available for them. But this then slows down the startup time, since a large pool of Workers must be prepared, long time before control flow even happens to each either subsystem that needs the threads. Then one realizes that one of the functions containing a openmp parallel for is actually called from two separate threads simultaneously, so the openmp subsystem might attempt to grow the pool size dynamically so that both parallel for loops can make progress. And as the codebase size grows, to tens of millions of lines of code and hundreds of composed sub-libraries that Unity3D for example has, more and more subsystems are discovered that have implemented their own on-demand thread pools. It becomes practically impossible to centrally coordinate these, or to track how large of a thread pool one must create in advance. Also, attempting to refactor all the subsystem algorithms to use asynchronously spawned thread pools is infeasible. For example like is the case of the above synchronous parallel array sum, it is infeasible to require that developers refactor such fundamentally "low level" operations to async tasks. We may not be talking about massively gigabyte parallel data set operations, but in a real-time engine, such parallel fork-join data sets may be tuned to complete well within a < 16 msecs timeframe, that would never pose a slow script dialog problem, for example. If such parallel fork-join algorithms are recast as async, any such async operations will require falling the rendering algorithm off from the requestAnimationFrame() event handler, breaking the vsync presentation composition design and frame timing of WebGL/WebGPU. (yielding back == present) In the best case, even if such parallel algorithms were possible to turn into async by design, it would mean several milliseconds worth of slowdown to revert to async postMessage() join communication as opposed to fast synchronous work slices or job-stealing queues, as is typical for thread pools. So maybe this might give light to the pickle that attempting to ship a large multithreaded 3D renderer codebase is, when synchronous Worker spawning is not available. Composability of "global oracle" Worker pools is infeasible, and async refactoring low-level parallel fork-join algorithms is infeasible as well. I appreciate the push-back observation on the sync WebGL context loss scenario, that may make it more difficult to refactor 'new Worker()' and postMessage to be able to make progress independently. If the conclusion here is that it will be impossible to achieve in the browser, then it would be good to loop the spec body, and ask if https://github.com/whatwg/html/issues/10228 is unimplementable?
Note You need to log in before you can comment on or make changes to this bug.