| Summary: | Make AirAllocateRegistersByGraphColoring use less memory | ||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | WebKit | Reporter: | Robin Morisset <rmorisset> | ||||||||||||||||||||||||
| Component: | JavaScriptCore | Assignee: | Robin Morisset <rmorisset> | ||||||||||||||||||||||||
| Status: | RESOLVED FIXED | ||||||||||||||||||||||||||
| Severity: | Normal | CC: | annulen, benjamin, cdumez, cmarcelo, ews-watchlist, fpizlo, gyuyoung.kim, keith_miller, mark.lam, msaboff, ryuan.choi, saam, sergio, tzagallo, webkit-bug-importer | ||||||||||||||||||||||||
| Priority: | P2 | Keywords: | InRadar | ||||||||||||||||||||||||
| Version: | WebKit Nightly Build | ||||||||||||||||||||||||||
| Hardware: | Unspecified | ||||||||||||||||||||||||||
| OS: | Unspecified | ||||||||||||||||||||||||||
| See Also: | https://bugs.webkit.org/show_bug.cgi?id=226258 | ||||||||||||||||||||||||||
| Bug Depends on: | |||||||||||||||||||||||||||
| Bug Blocks: | 225977, 226027 | ||||||||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||||||||
|
Description
Robin Morisset
2021-05-15 18:46:29 PDT
Created attachment 428760 [details]
WIP
for EWS
Created attachment 428764 [details]
WIP
Fix unused variable (and style nits)
Created attachment 428844 [details]
Patch
Created attachment 428849 [details]
Patch
Fixing nits in the test file.
Created attachment 428850 [details]
Patch
Trying to fix the test file again.
Comment on attachment 428850 [details]
Patch
test file still has warnings
Created attachment 428852 [details]
Patch
Created attachment 428853 [details]
Patch
This patch is unfortunately not enough to run mruby-wasm.aotoki.dev with an increased Options::maxTmpsForGraphColoring on a phone (it still jetsams). I've added some instrumentation and found out why: that webpage has two functions with roughly 30k Tmps (which is OK, tsf-wasm has functions that big with no issues) and >20M edges in the interference graph (which is the source of the problem, for comparison tsf-wasm only has roughly 1M edges in its largest functions). This patch reduces the memory consumption of the interference graph for each of these functions from roughly 250MB to 85MB (assuming an average occupancy rate for the relevant HashSet / SmallSet), and saves a few more MB from the rest of the datastructures in the register allocator (e.g. UseCounts). But considering that both functions are compiled in parallel, and that we keep alive during that time both the Air code and the B3 procedure (with its roughly 450k B3::Value per function..), it is not surprising that we still jetsam. I have some ideas for how to save some more memory in this case, but eventually we might just have to detect while building the interference graph when things are getting out of hand and aborting to the faster/cheaper register allocator. Oh, also this pretty dense interference graph leads to a ton of spilling: after the first iteration the number of edges is down to 3M, but the number of Tmps has increased to 300k. So it is not even clear that we are gaining that much by trying to do fancy graph coloring in this case. Created attachment 428871 [details]
Patch
Comment on attachment 428871 [details]
Patch
Removing from review as I've got a better version coming up.
Created attachment 428992 [details]
Patch
Now with an even better datastructure for the interference graph.
Created attachment 428993 [details]
memory use of the interference graphs in JetStream2
Created attachment 428994 [details]
memory use of the interference graphs while visiting mruby-wasm.aotoki.dev
Comment on attachment 428992 [details]
Patch
Thanks for the review!
I'm sending this to the commit queue, but I'll post more performance numbers here when I get them.
Committed r277714 (237893@main): <https://commits.webkit.org/237893@main> All reviewed patches have been landed. Closing bug and clearing flags on attachment 428992 [details]. I am starting to get numbers for this patch combined with the typo fix in https://bugs.webkit.org/show_bug.cgi?id=225977. On an iPhone7, I've got RAMification numbers. JITTests as a whole got better by 0.63% (p<0.02). Per test, here are the statistically significant improvements, with the score improvement first (based on end and peak memory), and followed by the peak memory improvement: - 3d-raytrace-SP: 1.03%, 2.51% - Box2D: 1.85, 2.23 - OfflineAssembler: 11.21, 19.70 - UniPoker: 0.66, 0.83 - crypto: 0.76, 0.82 - crypto-aes-SP: 1.85, 3.58 - crypto-md5-SP: 5.46, 10.93 - crypto-sha1-SP: 1.01, 1.99 - date-format-tofte-SP: 0.99, 1.63 - date-formap-xparb-SP: 0.59, 0.77 - delta-blue: 0.94, 1.28 - gaussian-blur: 0.37, 0.37 - gcc-loops-wasm: 2.71, 4.83 - lebab-wtb: 4.15, 4.36 - body-SP: 0.49, 0.49 - quicksort-wasm: 4.24, 4.24 - raytrace: 2.97, 5.54 - regexp: 3.92, 7.67 - stanford-crypto-aes: 0.91, 1.62 - stanford-crypto-sha256: 0.12, 0.31 - tsf-wasm: not significant at the score level because of a massive outlier in end memory, 13.52% for peak memory All of these are p<0.02 except for UniPoker which is p<0.05, and there is not one significant memory regression (there is a large one in typescript, but at p>0.1 it looks like pure noise). As expected we see that the wins are mostly on peak memory, and are largest on those subtests which have especially large interference graphs (e.g. tsf-wasm and OfflineAssembler). Here are the numbers for JetStream2 on an iPhone 7. This patch (+ the typo fix) increases the total score by 1.27% (p<0.02). It has a significant (p<0.05) effect on the following subtests: - FlightPlanner: regression of 3.73% ! - HashSetWasm: speedup of 7.15% - ML: speedup of 2.80% (p<0.02) - gaussian-blur: speedup of 3.72% (p<0.02) - lelab-wtb: speedup of 1.40% - n-body-SP: speedup of 2.51% - tagcloud-SP: speedup of 23.27% (p<0.02) - tsf-wasm: the runtime is 17.69% faster (p<0.02), but there is no significant effect on the score as a whole because of a gigantic outlier in startup time. While I don't fully understand why some benchmarks win so much more than others, and the FlightPlanner regression is a head-scratcher, overall it looks like a very clear gain. I don't have good numbers for other machines at this point, but it unfortunately looks like more powerful machines don't benefit as much in JetStream2, even though they get the same memory reduction. Last batch of performance numbers for this patch: I've tested Speedometer 2 on both an iPhone 7 and an M1 Mac (MacBookAir10,1). The iPhone 7 shows a 0.53% improvement overall (p<0.02), with the following significant effects on subtests: - 1.35% better on Elm - 0.61% better on EmberJS-Debug - 0.81% better on Inferno - 1.17% better on React-Redux - 1.05% worse (!) on Vanilla-ES2015 All of these are statistically significant at p<0.02, and there is no other result significant at p<0.05. I conclude that while the effect is less dramatic than on JetStream2, there is still a clear benefit to minimizing the memory pressure caused by the compiler threads in that case. Unfortunately, the picture on the M1 Mac is different: no effect on the overall score, and the only statistically significant effects on subtests are regressions: - 0.27% worse on EmberJS-Debug (p<0.05) - 1.58% worse on Vanilla-ES2015-Babel-WebPack (p<0.02) - 1.20% worse on Vanilla-ES2015 (p<0.02) I have no clue of what can explain these results. |