NEW 174819
WebAssembly: generate even smaller binaries
https://bugs.webkit.org/show_bug.cgi?id=174819
Summary WebAssembly: generate even smaller binaries
JF Bastien
Reported 2017-07-25 09:47:08 PDT
Created attachment 316368 [details] allocations-graph.py This is a follow-up to #174818, there's plenty more size gains to be had! A few ideas in random order: 1. Don't make callee a patchpoint: it doesn't need to be a patchpoint, and uses up 4 instructions in every ARM64 function prologue. This is expensive with binaries that have 100k+ small functions, because we round up allocations and this patchpoint pushes us over the 32 byte jitAllocationGranule. Just encoding the pointer makes it still a big immediate, but we could add an indirection to use small values that encode well. 2. The jitAllocationGranule isn't needed for WebAssembly. It is required for GC to determine how to mark JIT stub routines, but WebAssembly doesn’t need alignment. That’s tricky to do though, because wasm and JS share the same executable allocator and granule is passed at construction time. Mixing aligned and non-aligned may have unexpected effects on JavaScript code (some μarch like aligned branches, etc). Maybe we should use a separate pool? 3. We don't use load and store pair on ARM, except in limited places (such as prologue / epilogue for fp and lr). This would take quite a while to teach B3 / Air about. I'm especially sad about the OMG tier up function's spills. 4. The bounds checks are bloaty on ARM. We need signaling memory. 5. If there’s no call, and no trapping op (memory, etc), then no fault can occur. No need to have the callee and codeblock on stack. We also have a separate TODO to just get rid of the codeblock outright for WebAssembly. 6. In register-only cases without spills, no need to save / restore fp / lr. We know whether that's the case in the stackcheck patchpoint, it could also handle fp / lr save / restore. 7. The js -> wasm entry point is bloaty and represents ~5% of all allocations on large games. Only one thing ever differs (the immediate), yet we generate a 96 or 128 bytes for each of them. We could tail-call to a common trampoline from it after the immediate. We could also append that code to each wasm function which is exported instead of having them separate. 8. A few of the functions see a pretty big size increase when compared to the original wasm binary. Here's some data (pre-#174818): BBQ blowup (executable allocated / wasm): | ARM64 BBQ 1 | ARM64 BBQ 2 | x86 BBQ 1 | x86 BBQ 2 average | 8.858 | 8.060 | 6.174 | 5.669 stddev | 6.406 | 5.654 | 5.239 | 5.064 min | 0.033 | 0.028 | 0.024 | 0.024 max | 72.000 | 57.600 | 56.000 | 56.000 Allocation types (bytes): | ARM64 BBQ 1 | ARM64 BBQ 2 | x86 BBQ 1 | x86 BBQ 2 unknown | 142880 | 154720 | 146144 | 146144 JS2wasm | 5285184 | 5285184 | 4844160 | 4844160 BBQ | 113392032 | 102043168 | 68447168 | 57660800 Allocation types (percentage of total allocation): | ARM64 BBQ 1 | ARM64 BBQ 2 | x86 BBQ 1 | x86 BBQ 2 unknown | 0% | 0% | 0% | 0% JS2wasm | 4% | 5% | 7% | 8% BBQ | 95% | 95% | 93% | 92% Biggest outliers compared to original (bytes): function index | ARM64 BBQ 1 | ARM64 BBQ 2 | x86 BBQ 1 | x86 BBQ 2 | original | blowup vs original 12586 | 288 | 224 | 224 | 224 | 4 | 72.000 x 103738 | 256 | 288 | 224 | 256 | 5 | 57.600 x 107915 | 256 | 288 | 224 | 256 | 5 | 57.600 x 65775 | 160 | 128 | 96 | 96 | 3 | 53.333 x 109396 | 160 | 128 | 96 | 96 | 3 | 53.333 x 109428 | 160 | 128 | 96 | 96 | 3 | 53.333 x 12322 | 256 | 256 | 224 | 256 | 5 | 51.200 x 111459 | 256 | 288 | 224 | 256 | 6 | 48.000 x 111462 | 256 | 288 | 224 | 256 | 6 | 48.000 x 111466 | 256 | 288 | 224 | 256 | 6 | 48.000 x Biggest outliers compared to original, with original > 100 (bytes): function index | ARM64 BBQ 1 | ARM64 BBQ 2 | x86 BBQ 1 | x86 BBQ 2 | original | blowup vs original 84672 | 4800 | 17472 | 3488 | 9824 | 832 | 21.000 x 12589 | 2624 | 51264 | 1344 | 49952 | 2994 | 17.122 x 12941 | 2656 | 51264 | 1376 | 49984 | 3012 | 17.020 x 12947 | 2656 | 51264 | 1376 | 49984 | 3012 | 17.020 x 12934 | 2688 | 51296 | 1376 | 49984 | 3020 | 16.985 x 36468 | 1824 | 1568 | 1408 | 1152 | 124 | 14.710 x 90394 | 1824 | 1600 | 1440 | 1184 | 129 | 14.140 x 90651 | 1824 | 1600 | 1440 | 1184 | 129 | 14.140 x 31620 | 1504 | 1312 | 1152 | 960 | 109 | 13.798 x 37056 | 1600 | 1408 | 1216 | 960 | 116 | 13.793 x Biggest difference between compiled functions (bytes): function index | ARM64 BBQ 1 | ARM64 BBQ 2 | x86 BBQ 1 | x86 BBQ 2 | original | compiled blowup 12589 | 2624 | 51264 | 1344 | 49952 | 2994 | 38.143 x 12934 | 2688 | 51296 | 1376 | 49984 | 3020 | 37.279 x 12941 | 2656 | 51264 | 1376 | 49984 | 3012 | 37.256 x 12947 | 2656 | 51264 | 1376 | 49984 | 3012 | 37.256 x 107548 | 3680 | 14144 | 2624 | 17952 | 2281 | 6.841 x 61194 | 3424 | 3104 | 768 | 576 | 479 | 5.944 x 11710 | 2560 | 2304 | 576 | 448 | 380 | 5.714 x 76474 | 1984 | 1792 | 448 | 352 | 249 | 5.636 x 94775 | 1984 | 1824 | 448 | 352 | 254 | 5.636 x 94920 | 4608 | 4128 | 1120 | 832 | 663 | 5.538 x (see attached script for how it was generated)
Attachments
allocations-graph.py (8.17 KB, text/x-python-script)
2017-07-25 09:47 PDT, JF Bastien
no flags
Note You need to log in before you can comment on or make changes to this bug.