Summary: | Using WASM function size as the cap for choosing a register allocator causes performance regressions. | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Alekandr Guryanov <caiiiycuk> | ||||||||||||||||||
Component: | WebAssembly | Assignee: | Robin Morisset <rmorisset> | ||||||||||||||||||
Status: | RESOLVED FIXED | ||||||||||||||||||||
Severity: | Normal | CC: | alonzakai, bfulgham, chi187, ews-watchlist, fpizlo, keith_miller, mark.lam, msaboff, rmorisset, saam, tzagallo, webkit-bug-importer, ysuzuki | ||||||||||||||||||
Priority: | P2 | Keywords: | InRadar | ||||||||||||||||||
Version: | Safari Technology Preview | ||||||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||||||
OS: | Unspecified | ||||||||||||||||||||
Attachments: |
|
Description
Alekandr Guryanov
2020-10-04 01:29:54 PDT
Created attachment 410456 [details]
Safari 14.0.0
Created attachment 410457 [details]
Safari 14.0.1
Created attachment 410458 [details]
Safari TP
Minimal test: https://js-dos.com/v7/build/examples/dhry2.html There are 2 buttons: 1) Start worker - run test in worker 2) Start direct - run test without worker Results are nearly same. Thanks for your report. Will try to reproduce My guess is that this is related to: https://bugs.webkit.org/show_bug.cgi?id=212105 (In reply to Keith Miller from comment #7) > My guess is that this is related to: > https://bugs.webkit.org/show_bug.cgi?id=212105 I was able to bisect this to d8a0fb4bf380350af94c8cf7c399471e1186b93c..4fbc651e99458aeb44610e57d25d88fe7c1c498a and that bug looks like the most likely culprit. Created attachment 412207 [details]
Patch
Comment on attachment 412207 [details]
Patch
r=me
Committed r268942: <https://trac.webkit.org/changeset/268942> All reviewed patches have been landed. Closing bug and clearing flags on attachment 412207 [details]. Comment on attachment 412207 [details]
Patch
Can we get a microbenchmark for this?
(In reply to Saam Barati from comment #12) > Comment on attachment 412207 [details] > Patch > > Can we get a microbenchmark for this? I could try to create one but it might not be worth the time. Alekandr, would you be willing to let us copy your Wasm module into our repo as a test case? (In reply to Keith Miller from comment #13) > (In reply to Saam Barati from comment #12) > > Comment on attachment 412207 [details] > > Patch > > > > Can we get a microbenchmark for this? > > I could try to create one but it might not be worth the time. Alekandr, > would you be willing to let us copy your Wasm module into our repo as a test > case? Sure! Do you need only a wasm module? If you want I can create an all-in-one wasm file with dhry2 test embedded, that you need to simply run. (In reply to Alekandr Guryanov from comment #14) > (In reply to Keith Miller from comment #13) > > (In reply to Saam Barati from comment #12) > > > Comment on attachment 412207 [details] > > > Patch > > > > > > Can we get a microbenchmark for this? > > > > I could try to create one but it might not be worth the time. Alekandr, > > would you be willing to let us copy your Wasm module into our repo as a test > > case? > > Sure! Do you need only a wasm module? If you want I can create an all-in-one > wasm file with dhry2 test embedded, that you need to simply run. Sure, if it's not too much work, an all-in-one is probably a bit easier to work with. > Sure, if it's not too much work, an all-in-one is probably a bit easier to
> work with.
Thanks!
Created attachment 412294 [details]
all-in-one-test
This is a wasm file with an embedded dhry2 test (function names are not stripped).
Even on Safari 14.4 performance is still very bad. Created attachment 421600 [details]
Performance on Safari 14.4
Sorry for taking so long to come back to this bug. The problem is that 25k for the option --maximumTmpsForGraphColoring is too small for some benchmarks (including the Dhry2 in js-dosbox example in the original comment), causing the performance regressions; but anything significantly above 25k caused jetsams on some other webpages (which was the original reason for this limit to exist). The jetsams were coming from two data-structures: the interference graph used by AirAllocateRegistersByGraphColoring.cpp and the interference graph used by AirAllocateStackByGraphColoring.cpp. I massively reduced their respective memory consumption in: - https://trac.webkit.org/changeset/277714/webkit - https://trac.webkit.org/changeset/278186/webkit One of the pages that used to jetsam with a larger value of --maximumTmpsForGraphColoring was mruby-wasm.aotoki.dev: - the first of the above patches reduced the memory consumption of the register allocator from 2 x 262MB to 2 x 20MB (2x because of trying to compile two functions in parallel) - the second patch reduced the memory consumption of the stack allocator on the same functions from 221MB + 399MB to 6.5MB + 10.5MB Since I saw similar reductions on every benchmark I tried, I believed it is now safe to try to increase --maximumTmpsForGraphColoring. I verified on the example you provided, there are exactly two functions with > 25k temporaries (54845 and 54856). Increasing --maximumTmpsForGraphColoring from 25k to 60k increased the "VAX" score on my machine (ToT minibrowser on an M1 MBP 2020) from about 27 to about 70. Created attachment 443607 [details]
Patch
Comment on attachment 443607 [details]
Patch
r=me
(In reply to Robin Morisset from comment #20) > I verified on the example you provided, there are exactly two functions with > > 25k temporaries (54845 and 54856). Increasing > --maximumTmpsForGraphColoring from 25k to 60k increased the "VAX" score on > my machine (ToT minibrowser on an M1 MBP 2020) from about 27 to about 70. So this also means if I split this two functions into smaller somehow (like split them) it will increase performance on same value, correct? Comment on attachment 443607 [details]
Patch
Thanks Yusuke for the review.
(In reply to Alekandr Guryanov from comment #23) > (In reply to Robin Morisset from comment #20) > > > I verified on the example you provided, there are exactly two functions with > > > 25k temporaries (54845 and 54856). Increasing > > --maximumTmpsForGraphColoring from 25k to 60k increased the "VAX" score on > > my machine (ToT minibrowser on an M1 MBP 2020) from about 27 to about 70. > > So this also means if I split this two functions into smaller somehow (like > split them) it will increase performance on same value, correct? Correct. If you want to improve the performance on the currently shipping version of Safari, keeping these functions split should help a lot. I believe these giant functions typically occur because of very aggressive inlining by LLVM/Emscripten, so preventing some of it somehow might be the easiest way to achieve this. Committed r285533 (244050@main): <https://commits.webkit.org/244050@main> All reviewed patches have been landed. Closing bug and clearing flags on attachment 443607 [details]. This change should be present in STP 139, iOS 15.4 Beta, and macOS 12.3 Beta. |