225848 – Make AirAllocateRegistersByGraphColoring use less memory

RESOLVED FIXED 225848

Make AirAllocateRegistersByGraphColoring use less memory

https://bugs.webkit.org/show_bug.cgi?id=225848

Summary Make AirAllocateRegistersByGraphColoring use less memory

Robin Morisset

Reported 2021-05-15 18:46:29 PDT

The main benefit would be to allow us to increase Options::maximumTmpsForGraphColoring without causing more jetsams. It might also help our memory usage in general, and is likely to make the register allocator somewhat faster as well.

Attachments
WIP (62.72 KB, patch) 2021-05-15 18:47 PDT, Robin Morisset	ews-feeder: commit-queue-	Details Formatted Diff Diff
WIP (62.81 KB, patch) 2021-05-15 19:16 PDT, Robin Morisset	no flags	Details Formatted Diff Diff
Patch (84.17 KB, patch) 2021-05-17 10:26 PDT, Robin Morisset	ews-feeder: commit-queue-	Details Formatted Diff Diff
Patch (84.17 KB, patch) 2021-05-17 11:29 PDT, Robin Morisset	ews-feeder: commit-queue-	Details Formatted Diff Diff
Patch (84.25 KB, patch) 2021-05-17 11:40 PDT, Robin Morisset	ews-feeder: commit-queue-	Details Formatted Diff Diff
Patch (84.25 KB, patch) 2021-05-17 11:45 PDT, Robin Morisset	no flags	Details Formatted Diff Diff
Patch (84.25 KB, patch) 2021-05-17 11:50 PDT, Robin Morisset	no flags	Details Formatted Diff Diff
Patch (84.89 KB, patch) 2021-05-17 14:36 PDT, Robin Morisset	rmorisset: review- rmorisset: commit-queue-	Details Formatted Diff Diff
Patch (110.08 KB, patch) 2021-05-18 16:26 PDT, Robin Morisset	no flags	Details Formatted Diff Diff
memory use of the interference graphs in JetStream2 (12.04 KB, text/plain) 2021-05-18 16:26 PDT, Robin Morisset	no flags	Details
memory use of the interference graphs while visiting mruby-wasm.aotoki.dev (1.37 KB, text/plain) 2021-05-18 16:27 PDT, Robin Morisset	no flags	Details
Show Obsolete (8) View All Add attachment proposed patch, testcase, etc.

Robin Morisset

Comment 1 2021-05-15 18:47:07 PDT

Created attachment 428760 [details] WIP for EWS

Robin Morisset

Comment 2 2021-05-15 19:16:44 PDT

Created attachment 428764 [details] WIP Fix unused variable (and style nits)

Robin Morisset

Comment 3 2021-05-17 10:26:36 PDT

Created attachment 428844 [details] Patch

Robin Morisset

Comment 4 2021-05-17 11:29:00 PDT

Created attachment 428849 [details] Patch Fixing nits in the test file.

Robin Morisset

Comment 5 2021-05-17 11:40:22 PDT

Created attachment 428850 [details] Patch Trying to fix the test file again.

Robin Morisset

Comment 6 2021-05-17 11:45:00 PDT

Comment on attachment 428850 [details] Patch test file still has warnings

Robin Morisset

Comment 7 2021-05-17 11:45:52 PDT

Created attachment 428852 [details] Patch

Robin Morisset

Comment 8 2021-05-17 11:50:06 PDT

Created attachment 428853 [details] Patch

Robin Morisset

Comment 9 2021-05-17 14:31:56 PDT

This patch is unfortunately not enough to run mruby-wasm.aotoki.dev with an increased Options::maxTmpsForGraphColoring on a phone (it still jetsams). I've added some instrumentation and found out why: that webpage has two functions with roughly 30k Tmps (which is OK, tsf-wasm has functions that big with no issues) and >20M edges in the interference graph (which is the source of the problem, for comparison tsf-wasm only has roughly 1M edges in its largest functions). This patch reduces the memory consumption of the interference graph for each of these functions from roughly 250MB to 85MB (assuming an average occupancy rate for the relevant HashSet / SmallSet), and saves a few more MB from the rest of the datastructures in the register allocator (e.g. UseCounts). But considering that both functions are compiled in parallel, and that we keep alive during that time both the Air code and the B3 procedure (with its roughly 450k B3::Value per function..), it is not surprising that we still jetsam. I have some ideas for how to save some more memory in this case, but eventually we might just have to detect while building the interference graph when things are getting out of hand and aborting to the faster/cheaper register allocator. Oh, also this pretty dense interference graph leads to a ton of spilling: after the first iteration the number of edges is down to 3M, but the number of Tmps has increased to 300k. So it is not even clear that we are gaining that much by trying to do fancy graph coloring in this case.

Robin Morisset

Comment 10 2021-05-17 14:36:20 PDT

Created attachment 428871 [details] Patch

Robin Morisset

Comment 11 2021-05-18 10:32:54 PDT

Comment on attachment 428871 [details] Patch Removing from review as I've got a better version coming up.

Robin Morisset

Comment 12 2021-05-18 16:26:14 PDT

Created attachment 428992 [details] Patch Now with an even better datastructure for the interference graph.

Robin Morisset

Comment 13 2021-05-18 16:26:57 PDT

Created attachment 428993 [details] memory use of the interference graphs in JetStream2

Robin Morisset

Comment 14 2021-05-18 16:27:34 PDT

Created attachment 428994 [details] memory use of the interference graphs while visiting mruby-wasm.aotoki.dev

Robin Morisset

Comment 15 2021-05-18 19:58:33 PDT

Comment on attachment 428992 [details] Patch Thanks for the review! I'm sending this to the commit queue, but I'll post more performance numbers here when I get them.

EWS

Comment 16 2021-05-18 20:24:21 PDT

Committed r277714 (237893@main): <https://commits.webkit.org/237893@main> All reviewed patches have been landed. Closing bug and clearing flags on attachment 428992 [details].

Radar WebKit Bug Importer

Comment 17 2021-05-18 20:25:17 PDT

<rdar://problem/78186936>

Robin Morisset

Comment 18 2021-05-20 13:46:05 PDT

I am starting to get numbers for this patch combined with the typo fix in https://bugs.webkit.org/show_bug.cgi?id=225977. On an iPhone7, I've got RAMification numbers. JITTests as a whole got better by 0.63% (p<0.02). Per test, here are the statistically significant improvements, with the score improvement first (based on end and peak memory), and followed by the peak memory improvement: - 3d-raytrace-SP: 1.03%, 2.51% - Box2D: 1.85, 2.23 - OfflineAssembler: 11.21, 19.70 - UniPoker: 0.66, 0.83 - crypto: 0.76, 0.82 - crypto-aes-SP: 1.85, 3.58 - crypto-md5-SP: 5.46, 10.93 - crypto-sha1-SP: 1.01, 1.99 - date-format-tofte-SP: 0.99, 1.63 - date-formap-xparb-SP: 0.59, 0.77 - delta-blue: 0.94, 1.28 - gaussian-blur: 0.37, 0.37 - gcc-loops-wasm: 2.71, 4.83 - lebab-wtb: 4.15, 4.36 - body-SP: 0.49, 0.49 - quicksort-wasm: 4.24, 4.24 - raytrace: 2.97, 5.54 - regexp: 3.92, 7.67 - stanford-crypto-aes: 0.91, 1.62 - stanford-crypto-sha256: 0.12, 0.31 - tsf-wasm: not significant at the score level because of a massive outlier in end memory, 13.52% for peak memory All of these are p<0.02 except for UniPoker which is p<0.05, and there is not one significant memory regression (there is a large one in typescript, but at p>0.1 it looks like pure noise). As expected we see that the wins are mostly on peak memory, and are largest on those subtests which have especially large interference graphs (e.g. tsf-wasm and OfflineAssembler).

Robin Morisset

Comment 19 2021-05-21 00:18:56 PDT

Here are the numbers for JetStream2 on an iPhone 7. This patch (+ the typo fix) increases the total score by 1.27% (p<0.02). It has a significant (p<0.05) effect on the following subtests: - FlightPlanner: regression of 3.73% ! - HashSetWasm: speedup of 7.15% - ML: speedup of 2.80% (p<0.02) - gaussian-blur: speedup of 3.72% (p<0.02) - lelab-wtb: speedup of 1.40% - n-body-SP: speedup of 2.51% - tagcloud-SP: speedup of 23.27% (p<0.02) - tsf-wasm: the runtime is 17.69% faster (p<0.02), but there is no significant effect on the score as a whole because of a gigantic outlier in startup time. While I don't fully understand why some benchmarks win so much more than others, and the FlightPlanner regression is a head-scratcher, overall it looks like a very clear gain. I don't have good numbers for other machines at this point, but it unfortunately looks like more powerful machines don't benefit as much in JetStream2, even though they get the same memory reduction.

Robin Morisset

Comment 20 2021-05-21 23:49:43 PDT

Last batch of performance numbers for this patch: I've tested Speedometer 2 on both an iPhone 7 and an M1 Mac (MacBookAir10,1). The iPhone 7 shows a 0.53% improvement overall (p<0.02), with the following significant effects on subtests: - 1.35% better on Elm - 0.61% better on EmberJS-Debug - 0.81% better on Inferno - 1.17% better on React-Redux - 1.05% worse (!) on Vanilla-ES2015 All of these are statistically significant at p<0.02, and there is no other result significant at p<0.05. I conclude that while the effect is less dramatic than on JetStream2, there is still a clear benefit to minimizing the memory pressure caused by the compiler threads in that case. Unfortunately, the picture on the M1 Mac is different: no effect on the overall score, and the only statistically significant effects on subtests are regressions: - 0.27% worse on EmberJS-Debug (p<0.05) - 1.58% worse on Vanilla-ES2015-Babel-WebPack (p<0.02) - 1.20% worse on Vanilla-ES2015 (p<0.02) I have no clue of what can explain these results.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component JavaScriptCore

Assignee

Robin Morisset

Reported

2021-05-15 18:46 PDT

Modified

2021-05-25 23:58 PDT History

CC List

15 users Show

URL

Keywords InRadar

Depends on

Blocks

225977 226027

Dependencies

tree graph