Bug 174819 - WebAssembly: generate even smaller binaries
Summary: WebAssembly: generate even smaller binaries
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on: 174818
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-25 09:47 PDT by JF Bastien
Modified: 2020-01-24 09:18 PST (History)
6 users (show)

See Also:


Attachments
allocations-graph.py (8.17 KB, text/x-python-script)
2017-07-25 09:47 PDT, JF Bastien
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description JF Bastien 2017-07-25 09:47:08 PDT
Created attachment 316368 [details]
allocations-graph.py

This is a follow-up to #174818, there's plenty more size gains to be had!

A few ideas in random order:

1. Don't make callee a patchpoint: it doesn't need to be a patchpoint, and uses up 4 instructions in every ARM64 function prologue. This is expensive with binaries that have 100k+ small functions, because we round up allocations and this patchpoint pushes us over the 32 byte jitAllocationGranule. Just encoding the pointer makes it still a big immediate, but we could add an indirection to use small values that encode well.

2. The jitAllocationGranule  isn't needed for WebAssembly. It is required for GC to determine how to mark JIT stub routines, but WebAssembly doesn’t need alignment. That’s tricky to do though, because wasm and JS share the same executable allocator and granule is passed at construction time. Mixing aligned and non-aligned may have unexpected effects on JavaScript code (some μarch like aligned branches, etc). Maybe we should use a separate pool?

3. We don't use load and store pair on ARM, except in limited places (such as prologue / epilogue for fp and lr). This would take quite a while to teach B3 / Air about. I'm especially sad about the OMG tier up function's spills.

4. The bounds checks are bloaty on ARM. We need signaling memory.

5. If there’s no call, and no trapping op (memory, etc), then no fault can occur. No need to have the callee and codeblock on stack. We also have a separate TODO to just get rid of the codeblock outright for WebAssembly.

6. In register-only cases without spills, no need to save / restore fp / lr. We know whether that's the case in the stackcheck patchpoint, it could also handle fp / lr save / restore.

7. The js -> wasm entry point is bloaty and represents ~5% of all allocations on large games. Only one thing ever differs (the immediate), yet we generate a 96 or 128 bytes for each of them. We could tail-call to a common trampoline from it after the immediate. We could also append that code to each wasm function which is exported instead of having them separate.

8. A few of the functions see a pretty big size increase when compared to the original wasm binary. Here's some data (pre-#174818):

BBQ blowup (executable allocated / wasm):
	             | ARM64 BBQ 1 | ARM64 BBQ 2 |   x86 BBQ 1 |   x86 BBQ 2
	     average |       8.858 |       8.060 |       6.174 |       5.669
	      stddev |       6.406 |       5.654 |       5.239 |       5.064
	         min |       0.033 |       0.028 |       0.024 |       0.024
	         max |      72.000 |      57.600 |      56.000 |      56.000
Allocation types (bytes):
	             | ARM64 BBQ 1 | ARM64 BBQ 2 |   x86 BBQ 1 |   x86 BBQ 2
	     unknown |      142880 |      154720 |      146144 |      146144
	     JS2wasm |     5285184 |     5285184 |     4844160 |     4844160
	         BBQ |   113392032 |   102043168 |    68447168 |    57660800
Allocation types (percentage of total allocation):
	             | ARM64 BBQ 1 | ARM64 BBQ 2 |   x86 BBQ 1 |   x86 BBQ 2
	     unknown |          0% |          0% |          0% |          0%
	     JS2wasm |          4% |          5% |          7% |          8%
	         BBQ |         95% |         95% |         93% |         92%
Biggest outliers compared to original (bytes):
	      function index | ARM64 BBQ 1 | ARM64 BBQ 2 |   x86 BBQ 1 |   x86 BBQ 2 |    original | blowup vs original
	               12586 |         288 |         224 |         224 |         224 |           4 |      72.000 x
	              103738 |         256 |         288 |         224 |         256 |           5 |      57.600 x
	              107915 |         256 |         288 |         224 |         256 |           5 |      57.600 x
	               65775 |         160 |         128 |          96 |          96 |           3 |      53.333 x
	              109396 |         160 |         128 |          96 |          96 |           3 |      53.333 x
	              109428 |         160 |         128 |          96 |          96 |           3 |      53.333 x
	               12322 |         256 |         256 |         224 |         256 |           5 |      51.200 x
	              111459 |         256 |         288 |         224 |         256 |           6 |      48.000 x
	              111462 |         256 |         288 |         224 |         256 |           6 |      48.000 x
	              111466 |         256 |         288 |         224 |         256 |           6 |      48.000 x
Biggest outliers compared to original, with original > 100 (bytes):
	      function index | ARM64 BBQ 1 | ARM64 BBQ 2 |   x86 BBQ 1 |   x86 BBQ 2 |    original | blowup vs original
	               84672 |        4800 |       17472 |        3488 |        9824 |         832 |      21.000 x
	               12589 |        2624 |       51264 |        1344 |       49952 |        2994 |      17.122 x
	               12941 |        2656 |       51264 |        1376 |       49984 |        3012 |      17.020 x
	               12947 |        2656 |       51264 |        1376 |       49984 |        3012 |      17.020 x
	               12934 |        2688 |       51296 |        1376 |       49984 |        3020 |      16.985 x
	               36468 |        1824 |        1568 |        1408 |        1152 |         124 |      14.710 x
	               90394 |        1824 |        1600 |        1440 |        1184 |         129 |      14.140 x
	               90651 |        1824 |        1600 |        1440 |        1184 |         129 |      14.140 x
	               31620 |        1504 |        1312 |        1152 |         960 |         109 |      13.798 x
	               37056 |        1600 |        1408 |        1216 |         960 |         116 |      13.793 x
Biggest difference between compiled functions (bytes):
	      function index | ARM64 BBQ 1 | ARM64 BBQ 2 |   x86 BBQ 1 |   x86 BBQ 2 |    original | compiled blowup
	               12589 |        2624 |       51264 |        1344 |       49952 |        2994 |      38.143 x
	               12934 |        2688 |       51296 |        1376 |       49984 |        3020 |      37.279 x
	               12941 |        2656 |       51264 |        1376 |       49984 |        3012 |      37.256 x
	               12947 |        2656 |       51264 |        1376 |       49984 |        3012 |      37.256 x
	              107548 |        3680 |       14144 |        2624 |       17952 |        2281 |       6.841 x
	               61194 |        3424 |        3104 |         768 |         576 |         479 |       5.944 x
	               11710 |        2560 |        2304 |         576 |         448 |         380 |       5.714 x
	               76474 |        1984 |        1792 |         448 |         352 |         249 |       5.636 x
	               94775 |        1984 |        1824 |         448 |         352 |         254 |       5.636 x
	               94920 |        4608 |        4128 |        1120 |         832 |         663 |       5.538 x

(see attached script for how it was generated)