Bug 225848 - Make AirAllocateRegistersByGraphColoring use less memory
Summary: Make AirAllocateRegistersByGraphColoring use less memory
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Robin Morisset
URL:
Keywords: InRadar
Depends on:
Blocks: 225977 226027
  Show dependency treegraph
 
Reported: 2021-05-15 18:46 PDT by Robin Morisset
Modified: 2021-05-25 23:58 PDT (History)
15 users (show)

See Also:


Attachments
WIP (62.72 KB, patch)
2021-05-15 18:47 PDT, Robin Morisset
ews-feeder: commit-queue-
Details | Formatted Diff | Diff
WIP (62.81 KB, patch)
2021-05-15 19:16 PDT, Robin Morisset
no flags Details | Formatted Diff | Diff
Patch (84.17 KB, patch)
2021-05-17 10:26 PDT, Robin Morisset
ews-feeder: commit-queue-
Details | Formatted Diff | Diff
Patch (84.17 KB, patch)
2021-05-17 11:29 PDT, Robin Morisset
ews-feeder: commit-queue-
Details | Formatted Diff | Diff
Patch (84.25 KB, patch)
2021-05-17 11:40 PDT, Robin Morisset
ews-feeder: commit-queue-
Details | Formatted Diff | Diff
Patch (84.25 KB, patch)
2021-05-17 11:45 PDT, Robin Morisset
no flags Details | Formatted Diff | Diff
Patch (84.25 KB, patch)
2021-05-17 11:50 PDT, Robin Morisset
no flags Details | Formatted Diff | Diff
Patch (84.89 KB, patch)
2021-05-17 14:36 PDT, Robin Morisset
rmorisset: review-
rmorisset: commit-queue-
Details | Formatted Diff | Diff
Patch (110.08 KB, patch)
2021-05-18 16:26 PDT, Robin Morisset
no flags Details | Formatted Diff | Diff
memory use of the interference graphs in JetStream2 (12.04 KB, text/plain)
2021-05-18 16:26 PDT, Robin Morisset
no flags Details
memory use of the interference graphs while visiting mruby-wasm.aotoki.dev (1.37 KB, text/plain)
2021-05-18 16:27 PDT, Robin Morisset
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Robin Morisset 2021-05-15 18:46:29 PDT
The main benefit would be to allow us to increase Options::maximumTmpsForGraphColoring without causing more jetsams.
It might also help our memory usage in general, and is likely to make the register allocator somewhat faster as well.
Comment 1 Robin Morisset 2021-05-15 18:47:07 PDT
Created attachment 428760 [details]
WIP

for EWS
Comment 2 Robin Morisset 2021-05-15 19:16:44 PDT
Created attachment 428764 [details]
WIP

Fix unused variable (and style nits)
Comment 3 Robin Morisset 2021-05-17 10:26:36 PDT
Created attachment 428844 [details]
Patch
Comment 4 Robin Morisset 2021-05-17 11:29:00 PDT
Created attachment 428849 [details]
Patch

Fixing nits in the test file.
Comment 5 Robin Morisset 2021-05-17 11:40:22 PDT
Created attachment 428850 [details]
Patch

Trying to fix the test file again.
Comment 6 Robin Morisset 2021-05-17 11:45:00 PDT
Comment on attachment 428850 [details]
Patch

test file still has warnings
Comment 7 Robin Morisset 2021-05-17 11:45:52 PDT
Created attachment 428852 [details]
Patch
Comment 8 Robin Morisset 2021-05-17 11:50:06 PDT
Created attachment 428853 [details]
Patch
Comment 9 Robin Morisset 2021-05-17 14:31:56 PDT
This patch is unfortunately not enough to run mruby-wasm.aotoki.dev with an increased Options::maxTmpsForGraphColoring on a phone (it still jetsams).

I've added some instrumentation and found out why: that webpage has two functions with roughly 30k Tmps (which is OK, tsf-wasm has functions that big with no issues) and >20M edges in the interference graph (which is the source of the problem, for comparison tsf-wasm only has roughly 1M edges in its largest functions).

This patch reduces the memory consumption of the interference graph for each of these functions from roughly 250MB to 85MB (assuming an average occupancy rate for the relevant HashSet / SmallSet), and saves a few more MB from the rest of the datastructures in the register allocator (e.g. UseCounts). But considering that both functions are compiled in parallel, and that we keep alive during that time both the Air code and the B3 procedure (with its roughly 450k B3::Value per function..), it is not surprising that we still jetsam.

I have some ideas for how to save some more memory in this case, but eventually we might just have to detect while building the interference graph when things are getting out of hand and aborting to the faster/cheaper register allocator.

Oh, also this pretty dense interference graph leads to a ton of spilling: after the first iteration the number of edges is down to 3M, but the number of Tmps has increased to 300k. So it is not even clear that we are gaining that much by trying to do fancy graph coloring in this case.
Comment 10 Robin Morisset 2021-05-17 14:36:20 PDT
Created attachment 428871 [details]
Patch
Comment 11 Robin Morisset 2021-05-18 10:32:54 PDT
Comment on attachment 428871 [details]
Patch

Removing from review as I've got a better version coming up.
Comment 12 Robin Morisset 2021-05-18 16:26:14 PDT
Created attachment 428992 [details]
Patch

Now with an even better datastructure for the interference graph.
Comment 13 Robin Morisset 2021-05-18 16:26:57 PDT
Created attachment 428993 [details]
memory use of the interference graphs in JetStream2
Comment 14 Robin Morisset 2021-05-18 16:27:34 PDT
Created attachment 428994 [details]
memory use of the interference graphs while visiting mruby-wasm.aotoki.dev
Comment 15 Robin Morisset 2021-05-18 19:58:33 PDT
Comment on attachment 428992 [details]
Patch

Thanks for the review!
I'm sending this to the commit queue, but I'll post more performance numbers here when I get them.
Comment 16 EWS 2021-05-18 20:24:21 PDT
Committed r277714 (237893@main): <https://commits.webkit.org/237893@main>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 428992 [details].
Comment 17 Radar WebKit Bug Importer 2021-05-18 20:25:17 PDT
<rdar://problem/78186936>
Comment 18 Robin Morisset 2021-05-20 13:46:05 PDT
I am starting to get numbers for this patch combined with the typo fix in https://bugs.webkit.org/show_bug.cgi?id=225977.

On an iPhone7, I've got RAMification numbers. JITTests as a whole got better by 0.63% (p<0.02). Per test, here are the statistically significant improvements, with the score improvement first (based on end and peak memory), and followed by the peak memory improvement:
- 3d-raytrace-SP: 1.03%, 2.51%
- Box2D: 1.85, 2.23
- OfflineAssembler: 11.21, 19.70
- UniPoker: 0.66, 0.83
- crypto: 0.76, 0.82
- crypto-aes-SP: 1.85, 3.58
- crypto-md5-SP: 5.46, 10.93
- crypto-sha1-SP: 1.01, 1.99
- date-format-tofte-SP: 0.99, 1.63
- date-formap-xparb-SP: 0.59, 0.77
- delta-blue: 0.94, 1.28
- gaussian-blur: 0.37, 0.37
- gcc-loops-wasm: 2.71, 4.83
- lebab-wtb: 4.15, 4.36
- body-SP: 0.49, 0.49
- quicksort-wasm: 4.24, 4.24
- raytrace: 2.97, 5.54
- regexp: 3.92, 7.67
- stanford-crypto-aes: 0.91, 1.62
- stanford-crypto-sha256: 0.12, 0.31
- tsf-wasm: not significant at the score level because of a massive outlier in end memory, 13.52% for peak memory
All of these are p<0.02 except for UniPoker which is p<0.05, and there is not one significant memory regression (there is a large one in typescript, but at p>0.1 it looks like pure noise).

As expected we see that the wins are mostly on peak memory, and are largest on those subtests which have especially large interference graphs (e.g. tsf-wasm and OfflineAssembler).
Comment 19 Robin Morisset 2021-05-21 00:18:56 PDT
Here are the numbers for JetStream2 on an iPhone 7. This patch (+ the typo fix) increases the total score by 1.27% (p<0.02). It has a significant (p<0.05) effect on the following subtests:
- FlightPlanner: regression of 3.73% !
- HashSetWasm: speedup of 7.15%
- ML: speedup of 2.80% (p<0.02)
- gaussian-blur: speedup of 3.72% (p<0.02)
- lelab-wtb: speedup of 1.40%
- n-body-SP: speedup of 2.51%
- tagcloud-SP: speedup of 23.27% (p<0.02)
- tsf-wasm: the runtime is 17.69% faster (p<0.02), but there is no significant effect on the score as a whole because of a gigantic outlier in startup time.

While I don't fully understand why some benchmarks win so much more than others, and the FlightPlanner regression is a head-scratcher, overall it looks like a very clear gain.

I don't have good numbers for other machines at this point, but it unfortunately looks like more powerful machines don't benefit as much in JetStream2, even though they get the same memory reduction.
Comment 20 Robin Morisset 2021-05-21 23:49:43 PDT
Last batch of performance numbers for this patch: I've tested Speedometer 2 on both an iPhone 7 and an M1 Mac (MacBookAir10,1).

The iPhone 7 shows a 0.53% improvement overall (p<0.02), with the following significant effects on subtests:
- 1.35% better on Elm
- 0.61% better on EmberJS-Debug
- 0.81% better on Inferno
- 1.17% better on React-Redux
- 1.05% worse (!) on Vanilla-ES2015
All of these are statistically significant at p<0.02, and there is no other result significant at p<0.05.
I conclude that while the effect is less dramatic than on JetStream2, there is still a clear benefit to minimizing the memory pressure caused by the compiler threads in that case.

Unfortunately, the picture on the M1 Mac is different: no effect on the overall score, and the only statistically significant effects on subtests are regressions:
- 0.27% worse on EmberJS-Debug (p<0.05)
- 1.58% worse on Vanilla-ES2015-Babel-WebPack (p<0.02)
- 1.20% worse on Vanilla-ES2015 (p<0.02)
I have no clue of what can explain these results.