Bug 225848

Summary: Make AirAllocateRegistersByGraphColoring use less memory
Product: WebKit Reporter: Robin Morisset <rmorisset>
Component: JavaScriptCoreAssignee: Robin Morisset <rmorisset>
Status: RESOLVED FIXED    
Severity: Normal CC: annulen, benjamin, cdumez, cmarcelo, ews-watchlist, fpizlo, gyuyoung.kim, keith_miller, mark.lam, msaboff, ryuan.choi, saam, sergio, tzagallo, webkit-bug-importer
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugs.webkit.org/show_bug.cgi?id=226258
Bug Depends on:    
Bug Blocks: 225977, 226027    
Attachments:
Description Flags
WIP
ews-feeder: commit-queue-
WIP
none
Patch
ews-feeder: commit-queue-
Patch
ews-feeder: commit-queue-
Patch
ews-feeder: commit-queue-
Patch
none
Patch
none
Patch
rmorisset: review-, rmorisset: commit-queue-
Patch
none
memory use of the interference graphs in JetStream2
none
memory use of the interference graphs while visiting mruby-wasm.aotoki.dev none

Robin Morisset
Reported 2021-05-15 18:46:29 PDT
The main benefit would be to allow us to increase Options::maximumTmpsForGraphColoring without causing more jetsams. It might also help our memory usage in general, and is likely to make the register allocator somewhat faster as well.
Attachments
WIP (62.72 KB, patch)
2021-05-15 18:47 PDT, Robin Morisset
ews-feeder: commit-queue-
WIP (62.81 KB, patch)
2021-05-15 19:16 PDT, Robin Morisset
no flags
Patch (84.17 KB, patch)
2021-05-17 10:26 PDT, Robin Morisset
ews-feeder: commit-queue-
Patch (84.17 KB, patch)
2021-05-17 11:29 PDT, Robin Morisset
ews-feeder: commit-queue-
Patch (84.25 KB, patch)
2021-05-17 11:40 PDT, Robin Morisset
ews-feeder: commit-queue-
Patch (84.25 KB, patch)
2021-05-17 11:45 PDT, Robin Morisset
no flags
Patch (84.25 KB, patch)
2021-05-17 11:50 PDT, Robin Morisset
no flags
Patch (84.89 KB, patch)
2021-05-17 14:36 PDT, Robin Morisset
rmorisset: review-
rmorisset: commit-queue-
Patch (110.08 KB, patch)
2021-05-18 16:26 PDT, Robin Morisset
no flags
memory use of the interference graphs in JetStream2 (12.04 KB, text/plain)
2021-05-18 16:26 PDT, Robin Morisset
no flags
memory use of the interference graphs while visiting mruby-wasm.aotoki.dev (1.37 KB, text/plain)
2021-05-18 16:27 PDT, Robin Morisset
no flags
Robin Morisset
Comment 1 2021-05-15 18:47:07 PDT
Created attachment 428760 [details] WIP for EWS
Robin Morisset
Comment 2 2021-05-15 19:16:44 PDT
Created attachment 428764 [details] WIP Fix unused variable (and style nits)
Robin Morisset
Comment 3 2021-05-17 10:26:36 PDT
Robin Morisset
Comment 4 2021-05-17 11:29:00 PDT
Created attachment 428849 [details] Patch Fixing nits in the test file.
Robin Morisset
Comment 5 2021-05-17 11:40:22 PDT
Created attachment 428850 [details] Patch Trying to fix the test file again.
Robin Morisset
Comment 6 2021-05-17 11:45:00 PDT
Comment on attachment 428850 [details] Patch test file still has warnings
Robin Morisset
Comment 7 2021-05-17 11:45:52 PDT
Robin Morisset
Comment 8 2021-05-17 11:50:06 PDT
Robin Morisset
Comment 9 2021-05-17 14:31:56 PDT
This patch is unfortunately not enough to run mruby-wasm.aotoki.dev with an increased Options::maxTmpsForGraphColoring on a phone (it still jetsams). I've added some instrumentation and found out why: that webpage has two functions with roughly 30k Tmps (which is OK, tsf-wasm has functions that big with no issues) and >20M edges in the interference graph (which is the source of the problem, for comparison tsf-wasm only has roughly 1M edges in its largest functions). This patch reduces the memory consumption of the interference graph for each of these functions from roughly 250MB to 85MB (assuming an average occupancy rate for the relevant HashSet / SmallSet), and saves a few more MB from the rest of the datastructures in the register allocator (e.g. UseCounts). But considering that both functions are compiled in parallel, and that we keep alive during that time both the Air code and the B3 procedure (with its roughly 450k B3::Value per function..), it is not surprising that we still jetsam. I have some ideas for how to save some more memory in this case, but eventually we might just have to detect while building the interference graph when things are getting out of hand and aborting to the faster/cheaper register allocator. Oh, also this pretty dense interference graph leads to a ton of spilling: after the first iteration the number of edges is down to 3M, but the number of Tmps has increased to 300k. So it is not even clear that we are gaining that much by trying to do fancy graph coloring in this case.
Robin Morisset
Comment 10 2021-05-17 14:36:20 PDT
Robin Morisset
Comment 11 2021-05-18 10:32:54 PDT
Comment on attachment 428871 [details] Patch Removing from review as I've got a better version coming up.
Robin Morisset
Comment 12 2021-05-18 16:26:14 PDT
Created attachment 428992 [details] Patch Now with an even better datastructure for the interference graph.
Robin Morisset
Comment 13 2021-05-18 16:26:57 PDT
Created attachment 428993 [details] memory use of the interference graphs in JetStream2
Robin Morisset
Comment 14 2021-05-18 16:27:34 PDT
Created attachment 428994 [details] memory use of the interference graphs while visiting mruby-wasm.aotoki.dev
Robin Morisset
Comment 15 2021-05-18 19:58:33 PDT
Comment on attachment 428992 [details] Patch Thanks for the review! I'm sending this to the commit queue, but I'll post more performance numbers here when I get them.
EWS
Comment 16 2021-05-18 20:24:21 PDT
Committed r277714 (237893@main): <https://commits.webkit.org/237893@main> All reviewed patches have been landed. Closing bug and clearing flags on attachment 428992 [details].
Radar WebKit Bug Importer
Comment 17 2021-05-18 20:25:17 PDT
Robin Morisset
Comment 18 2021-05-20 13:46:05 PDT
I am starting to get numbers for this patch combined with the typo fix in https://bugs.webkit.org/show_bug.cgi?id=225977. On an iPhone7, I've got RAMification numbers. JITTests as a whole got better by 0.63% (p<0.02). Per test, here are the statistically significant improvements, with the score improvement first (based on end and peak memory), and followed by the peak memory improvement: - 3d-raytrace-SP: 1.03%, 2.51% - Box2D: 1.85, 2.23 - OfflineAssembler: 11.21, 19.70 - UniPoker: 0.66, 0.83 - crypto: 0.76, 0.82 - crypto-aes-SP: 1.85, 3.58 - crypto-md5-SP: 5.46, 10.93 - crypto-sha1-SP: 1.01, 1.99 - date-format-tofte-SP: 0.99, 1.63 - date-formap-xparb-SP: 0.59, 0.77 - delta-blue: 0.94, 1.28 - gaussian-blur: 0.37, 0.37 - gcc-loops-wasm: 2.71, 4.83 - lebab-wtb: 4.15, 4.36 - body-SP: 0.49, 0.49 - quicksort-wasm: 4.24, 4.24 - raytrace: 2.97, 5.54 - regexp: 3.92, 7.67 - stanford-crypto-aes: 0.91, 1.62 - stanford-crypto-sha256: 0.12, 0.31 - tsf-wasm: not significant at the score level because of a massive outlier in end memory, 13.52% for peak memory All of these are p<0.02 except for UniPoker which is p<0.05, and there is not one significant memory regression (there is a large one in typescript, but at p>0.1 it looks like pure noise). As expected we see that the wins are mostly on peak memory, and are largest on those subtests which have especially large interference graphs (e.g. tsf-wasm and OfflineAssembler).
Robin Morisset
Comment 19 2021-05-21 00:18:56 PDT
Here are the numbers for JetStream2 on an iPhone 7. This patch (+ the typo fix) increases the total score by 1.27% (p<0.02). It has a significant (p<0.05) effect on the following subtests: - FlightPlanner: regression of 3.73% ! - HashSetWasm: speedup of 7.15% - ML: speedup of 2.80% (p<0.02) - gaussian-blur: speedup of 3.72% (p<0.02) - lelab-wtb: speedup of 1.40% - n-body-SP: speedup of 2.51% - tagcloud-SP: speedup of 23.27% (p<0.02) - tsf-wasm: the runtime is 17.69% faster (p<0.02), but there is no significant effect on the score as a whole because of a gigantic outlier in startup time. While I don't fully understand why some benchmarks win so much more than others, and the FlightPlanner regression is a head-scratcher, overall it looks like a very clear gain. I don't have good numbers for other machines at this point, but it unfortunately looks like more powerful machines don't benefit as much in JetStream2, even though they get the same memory reduction.
Robin Morisset
Comment 20 2021-05-21 23:49:43 PDT
Last batch of performance numbers for this patch: I've tested Speedometer 2 on both an iPhone 7 and an M1 Mac (MacBookAir10,1). The iPhone 7 shows a 0.53% improvement overall (p<0.02), with the following significant effects on subtests: - 1.35% better on Elm - 0.61% better on EmberJS-Debug - 0.81% better on Inferno - 1.17% better on React-Redux - 1.05% worse (!) on Vanilla-ES2015 All of these are statistically significant at p<0.02, and there is no other result significant at p<0.05. I conclude that while the effect is less dramatic than on JetStream2, there is still a clear benefit to minimizing the memory pressure caused by the compiler threads in that case. Unfortunately, the picture on the M1 Mac is different: no effect on the overall score, and the only statistically significant effects on subtests are regressions: - 0.27% worse on EmberJS-Debug (p<0.05) - 1.58% worse on Vanilla-ES2015-Babel-WebPack (p<0.02) - 1.20% worse on Vanilla-ES2015 (p<0.02) I have no clue of what can explain these results.
Note You need to log in before you can comment on or make changes to this bug.