Bug 274043 - REGRESSION (277125@main): 10% performance regression in Web Assembly
Summary: REGRESSION (277125@main): 10% performance regression in Web Assembly
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebAssembly (show other bugs)
Version: WebKit Nightly Build
Hardware: Mac (Apple Silicon) macOS 14
: P2 Normal
Assignee: Nobody
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2024-05-11 15:08 PDT by Christian Speckner
Modified: 2024-05-29 02:05 PDT (History)
5 users (show)

See Also:


Attachments
Gzipped SD card image with the game that I use for performance testing (684.22 KB, application/x-gzip)
2024-05-11 15:08 PDT, Christian Speckner
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Christian Speckner 2024-05-11 15:08:04 PDT
Created attachment 471373 [details]
Gzipped SD card image with the game that I use for performance testing

I see a performance regression in my WASM code in Webkit of about 10%. I first noticed the regression in Safari Technology Preview and then confirmed that the regression is still present in the current git (... at the time of writing). I bisected the regression down to commit dc794ece05e682e4fe156bdea696620bc3727261 (Add stub for new jit-less js->wasm entrypoint)

My codebase is an emulator for ARM-based PalmOS devices and can be found on github here: https://github.com/cloudpilot-emu/cp-uarm

In order to reproduce the regression you can use a build of the current version of my code that can be found here: https://cloudpilot-emu.github.io/uarm-preview/

To launch the emulator you need to download and open the following files by clicking the respective buttons:

* NOR image: https://palmdb.net/content/files/archive-rom/palm-roms-complete/Palm-Tungsten-E2-nor.bin
* NAND image: https://palmdb.net/content/files/archive-rom/palm-roms-complete/Palm-Tungsten-E2-nand.bin
* SD card image: use the attached file (after gunzipping it)

The emulator will boot, and if you click "Start audio" you'll hear the boot sound. Once it has booted the PalmOS preferences panel is displayed. In order to reproduce my performance test

* Click the "home" icon below the "Done" button
* Click the dropdown in the upper right corner to open it and select "CARD" at the bottom
* Click the "Bike or Die!" icon in order to start the game
* Click "Continue in trial mode"

The game will load and start a realtime 3D intro sequence. Observe the "Emulation speed" display --- the second number ("limit MIPS") is the number of emulated ARM instructions that the program can emulate per second and thus a direct measure of the performance. With the version of Webkit currently shipped with Safari 17.4.1 I get about 83 - 85 MIPS on my Macbook M1 Max. Starting with the commit I identified above the speed drops by about 10% to 75 - 77 MIPS.

My code heavily relies on calling function pointers in C (one for every emulated ARM instruction) which compiles down to "call_indirect", and I suppose the commit referenced above introduces additional overhead in these calls.
Comment 1 Christian Speckner 2024-05-11 15:10:08 PDT
Ups, I forgot to fill in the latest git commit that I tested against before submitting ;) The latest git commit that I tested is 794b345789d622c60a60e4b29b264c838b177279 which still exhibits the regression.
Comment 2 Radar WebKit Bug Importer 2024-05-13 10:13:44 PDT
<rdar://problem/128010259>
Comment 3 Christian Speckner 2024-05-13 11:57:44 PDT
Another minor correction: you can now find the version of uArm for reproduction here: https://cloudpilot-emu.github.io/uarm-indirect/

The other page (https://cloudpilot-emu.github.io/uarm-preview/) now serves an updated version which trades the indirect calls that dispatch the instructions executed by the emulator for a large switch (which generate and swap in after building the WASM module with emscripten). This version performs slightly better and does not suffer from the aforementioned performance regression, which supports my suspicion that the regression is due to additional overhead in call_indirect.
Comment 4 Marcus Plutowski 2024-05-16 12:02:10 PDT
Thanks for the detailed description. I think I was able to reproduce the slowdown, but interestingly enough, while dc794ece05e682e4fe156bdea696620bc3727261 does run about 10% more slowly than my system Safari, so does the commit immediately prior (5247798c3d0e528e13aa722328f63a74ce2d3d4d [JSC] Enable wasm fast-memory on iOS), so it's not clear that the listed commit is the issue —— either that or I'm not actually reproducing the issue correctly.

Could you confirm whether 5247798c3d0e5 exhibits the slowdown on your machine, and otherwise what commits you were able to build that did not exhibit the slowdown?
Comment 5 Christian Speckner 2024-05-19 14:44:13 PDT
Thanks a lot for looking into to this! I double checked and built both 5247798c3d0e528e13aa722328f63a74ce2d3d4d (Enable fast memory on iOS) and dc794ece05e682e4fe156bdea696620bc3727261 (add stub for new...) from a clean repository. I still get about 10% better limit MIPS for 5247798c3 (about 83% vs. 75%).

Mind that you have to use https://cloudpilot-emu.github.io/uarm-indirect/, and that the second number (limit MIPS) is the relevant one.

Hope this helps!
Comment 6 Marcus Plutowski 2024-05-22 10:38:06 PDT
I got a hold of an M1 MacBook and was able to reproduce a solid 8% performance drop across the listed commit, with about the same performance baseline as you described -- I'll start doing some differential testing to see where this could be coming from.

It is odd that I wasn't able to reproduce this on my M3. To make sure I'm doing this the exact same way you are: I know you were first looking at the Safari Technology Preview at first, but thereafter how exactly did you measure performance for a specific hash? Details on A) how you built, B) what you ran (MiniBrowser? something else?) would help a lot.
Comment 7 Keith Miller 2024-05-22 10:48:58 PDT
From skimming the change in question the most suspicious thing is that GPRInfo::metadataTableRegister was added to RegisterSetBuilder::wasmPinnedRegisters even though it's not actually a pinned register. I think it could just be added to `JSEntrypointInterpreterCallee::calleeSaveRegistersImpl` and maybe some other places.
Comment 8 Keith Miller 2024-05-22 10:50:33 PDT
(In reply to Keith Miller from comment #7)
> From skimming the change in question the most suspicious thing is that
> GPRInfo::metadataTableRegister was added to
> RegisterSetBuilder::wasmPinnedRegisters even though it's not actually a
> pinned register. I think it could just be added to
> `JSEntrypointInterpreterCallee::calleeSaveRegistersImpl` and maybe some
> other places.

Also createJSToWasmWrapper would need to be updated too.
Comment 9 Christian Speckner 2024-05-22 13:18:18 PDT
(In reply to Marcus Plutowski from comment #6)
> It is odd that I wasn't able to reproduce this on my M3. To make sure I'm
> doing this the exact same way you are: I know you were first looking at the
> Safari Technology Preview at first, but thereafter how exactly did you
> measure performance for a specific hash? Details on A) how you built, B)
> what you ran (MiniBrowser? something else?) would help a lot.

Hm, it is possible that the M3 is simply too fast to show a noticeable performance drop, although I would have expected an effect on the limit MIPS at any rate (the actual MIPS are capped at 100). I have added code to make the cap adjustable to the emulator though which I can backport to the version that uses call_indirect for dispatch if this helps. I also have another benchmark that does mp4 video decoding and that might give cleaner results if the cap is increased, but that comes with a 64MB SD card image that I cannot upload to bugzilla ;)

As for how I built and tested: I checked out the repsective SHA, did a 'git clean -xffd', then './Tools/Scripts/build-web' to build, and then './Tools/Scripts/run-safari' to run.
Comment 10 Marcus Plutowski 2024-05-22 14:06:59 PDT
(In reply to Christian Speckner from comment #9)

> I have added code to make
> the cap adjustable to the emulator though which I can backport to the
> version that uses call_indirect for dispatch if this helps.
This would definitely help, thanks!
Comment 11 Justin Michaud 2024-05-23 10:17:09 PDT
Pull request: https://github.com/WebKit/WebKit/pull/28990
Comment 12 Christian Speckner 2024-05-23 13:51:02 PDT
(In reply to Justin Michaud from comment #11)
> Pull request: https://github.com/WebKit/WebKit/pull/28990

Just built this PR, and the slowdown is gone 🎉 Speed is now on par again with current Safari, both with the call_indirect heavy version of the emulator and with the current codebase.

In case you want or need to do further benchmarking, I have extended https://cloudpilot-emu.github.io/uarm-indirect/ to include a slider to increase the MIPS cap. Note that you'll have to load the SD / ROM image files again. I have also uploaded a SD card image with two more benchmarks to https://cspeckner.de/benchmark.img .

The first new benchmark is yCPUBench. If you start it and click "Benchmark armlet" you'll get a number that can be compared (provided you set the MIPS to 300 in order to avoid the cap).

The other benchmark is TCPMP which is a video player. Start it (takes a few seconds), click the title to open the "File" menu, select "open files" from the menu, then select "slot 2" and open "ED.avi". If you then select "Benchmark" from the file menu it'll start decoding as fast as possible, and if you click the video after a few seconds you'll get a quantitative benchmark that can be compared (again, set MIPS to 300 to avoid hitting the cap).
Comment 13 EWS 2024-05-29 02:05:02 PDT
Committed 279435@main (7a1fe31ef5a7): <https://commits.webkit.org/279435@main>

Reviewed commits have been landed. Closing PR #28990 and removing active labels.