Bug 217290

Summary: Using WASM function size as the cap for choosing a register allocator causes performance regressions.
Product: WebKit Reporter: Alekandr Guryanov <caiiiycuk>
Component: WebAssemblyAssignee: Robin Morisset <rmorisset>
Status: RESOLVED FIXED    
Severity: Normal CC: alonzakai, bfulgham, chi187, ews-watchlist, fpizlo, keith_miller, mark.lam, msaboff, rmorisset, saam, tzagallo, webkit-bug-importer, ysuzuki
Priority: P2 Keywords: InRadar
Version: Safari Technology Preview   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Safari 13.1.1
none
Safari 14.0.0
none
Safari 14.0.1
none
Safari TP
none
Patch
none
all-in-one-test
none
Performance on Safari 14.4
none
Patch none

Description Alekandr Guryanov 2020-10-04 01:29:54 PDT
Created attachment 410455 [details]
Safari 13.1.1

Hi. I am the author of the js-dos project (dosbox in js). After updating to Safari 14 I see a dramatic performance loss in js-dos dhry2 test. It reproduced on all devices: iphone 6, iphone 7, iphone 8, iphone x, macos catalina. All Safari starting from 14 affected (I tested 14.0, 14.0.1 and Technology Preview).

The test is a variation of dhry 2 test for dos. It runs inside js-dosbox and reports operation count. Time is measured with Date.now(). So it only does math operations and estimate time. 

You can run this test using link. 
https://dos.zone/en/play/https%3A%2F%2Fdoszone-uploads.s3.dualstack.eu-central-1.amazonaws.com%2Foriginal%2F2X%2Fb%2Fb4b5275904d86a4ab8a20917b2b7e34f0df47bf7.jsdos

My results are in attached screenshots.

Results from (MacBook Pro (13-inch, 2017, Two Thunderbolt 3 ports, Catalina 10.15.17):
Safari 13.1.1 - (640_000, VAX: 48)
Safari 14.0.0 - (80_000, VAX: 5.5)
Safari 14.0.1 - (80_000, VAX: 5.5)
Safari TP     - (80_000, VAX: 5.9)
Firefox       - (640_000, VAX: 36.9)

--
I also tried to test in recent builds of webkit (267874, 267919, 267940). But, always have this error:
MacBook-Pro-Alexander:~ caiiiycuk$ /Users/caiiiycuk/Downloads/267940/run-webkit-archive ; exit;
Setting DYLD FRAMEWORK and LIBRARY paths to /Users/caiiiycuk/Downloads/267940/Release
2020-10-04 15:21:50.936 SafariForWebKitDevelopment[4346:301066] -[BrowserWKView _setInspectorDelegate:]: unrecognized selector sent to instance 0x7fef794c9c30
2020-10-04 15:21:50.936 SafariForWebKitDevelopment[4346:301066] -[BrowserWKView _setInspectorDelegate:]: unrecognized selector sent to instance 0x7fef794c9c30
2020-10-04 15:21:50.947 SafariForWebKitDevelopment[4346:301066] .sdef warning for argument 'FileType' of command 'save' in suite 'Standard Suite': 'saveable file format' is not a valid type name.
2020-10-04 15:21:50.995 SafariForWebKitDevelopment[4346:301134] +[NSXPCSharedListener endpointForReply:withListenerName:]: an error occurred while attempting to obtain endpoint for listener 'SandboxBroker': Connection invalid
2020-10-04 15:21:51.083 com.apple.WebKit.WebContent.Development[4348:301125] Error loading /Library/Apple/System/Library/StagedFrameworks/Safari/Safari.framework/Safari:  dlopen(/Library/Apple/System/Library/StagedFrameworks/Safari/Safari.framework/Safari, 265): Symbol not found: _OBJC_CLASS_$_WBSContentBlockerStatisticsSQLiteStore
  Referenced from: /Library/Apple/System/Library/StagedFrameworks/Safari/SafariSharedUI.framework/Versions/A/SafariSharedUI
  Expected in: /System/Library/PrivateFrameworks/SafariShared.framework/Versions/A/SafariShared
 in /Library/Apple/System/Library/StagedFrameworks/Safari/SafariSharedUI.framework/Versions/A/SafariSharedUI
InjectedBundle::load failed - Could not load the executable from the bundle.
2020-10-04 15:22:01.457 SafariForWebKitDevelopment[4346:301066] -[BrowserWKView _setInspectorDelegate:]: unrecognized selector sent to instance 0x7fef7d529cc0
Comment 1 Alekandr Guryanov 2020-10-04 01:30:24 PDT
Created attachment 410456 [details]
Safari 14.0.0
Comment 2 Alekandr Guryanov 2020-10-04 01:30:49 PDT
Created attachment 410457 [details]
Safari 14.0.1
Comment 3 Alekandr Guryanov 2020-10-04 01:31:12 PDT
Created attachment 410458 [details]
Safari TP
Comment 4 Alekandr Guryanov 2020-10-04 02:14:27 PDT
Minimal test:
https://js-dos.com/v7/build/examples/dhry2.html

There are 2 buttons:
1) Start worker - run test in worker
2) Start direct - run test without worker

Results are nearly same.
Comment 5 Radar WebKit Bug Importer 2020-10-04 14:23:54 PDT
<rdar://problem/69934870>
Comment 6 Saam Barati 2020-10-15 17:06:41 PDT
Thanks for your report. Will try to reproduce
Comment 7 Keith Miller 2020-10-23 11:16:18 PDT
My guess is that this is related to: https://bugs.webkit.org/show_bug.cgi?id=212105
Comment 8 Keith Miller 2020-10-23 11:29:13 PDT
(In reply to Keith Miller from comment #7)
> My guess is that this is related to:
> https://bugs.webkit.org/show_bug.cgi?id=212105

I was able to bisect this to d8a0fb4bf380350af94c8cf7c399471e1186b93c..4fbc651e99458aeb44610e57d25d88fe7c1c498a and that bug looks like the most likely culprit.
Comment 9 Keith Miller 2020-10-23 13:22:01 PDT
Created attachment 412207 [details]
Patch
Comment 10 Michael Saboff 2020-10-23 13:36:58 PDT
Comment on attachment 412207 [details]
Patch

r=me
Comment 11 EWS 2020-10-23 15:53:34 PDT
Committed r268942: <https://trac.webkit.org/changeset/268942>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 412207 [details].
Comment 12 Saam Barati 2020-10-23 17:50:39 PDT
Comment on attachment 412207 [details]
Patch

Can we get a microbenchmark for this?
Comment 13 Keith Miller 2020-10-23 20:29:03 PDT
(In reply to Saam Barati from comment #12)
> Comment on attachment 412207 [details]
> Patch
> 
> Can we get a microbenchmark for this?

I could try to create one but it might not be worth the time. Alekandr, would you be willing to let us copy your Wasm module into our repo as a test case?
Comment 14 Alekandr Guryanov 2020-10-24 06:00:45 PDT
(In reply to Keith Miller from comment #13)
> (In reply to Saam Barati from comment #12)
> > Comment on attachment 412207 [details]
> > Patch
> > 
> > Can we get a microbenchmark for this?
> 
> I could try to create one but it might not be worth the time. Alekandr,
> would you be willing to let us copy your Wasm module into our repo as a test
> case?

Sure! Do you need only a wasm module? If you want I can create an all-in-one wasm file with dhry2 test embedded, that you need to simply run.
Comment 15 Keith Miller 2020-10-24 21:13:35 PDT
(In reply to Alekandr Guryanov from comment #14)
> (In reply to Keith Miller from comment #13)
> > (In reply to Saam Barati from comment #12)
> > > Comment on attachment 412207 [details]
> > > Patch
> > > 
> > > Can we get a microbenchmark for this?
> > 
> > I could try to create one but it might not be worth the time. Alekandr,
> > would you be willing to let us copy your Wasm module into our repo as a test
> > case?
> 
> Sure! Do you need only a wasm module? If you want I can create an all-in-one
> wasm file with dhry2 test embedded, that you need to simply run.

Sure, if it's not too much work, an all-in-one is probably a bit easier to work with.
Comment 16 Keith Miller 2020-10-24 21:13:54 PDT
> Sure, if it's not too much work, an all-in-one is probably a bit easier to
> work with.

Thanks!
Comment 17 Alekandr Guryanov 2020-10-26 02:02:43 PDT
Created attachment 412294 [details]
all-in-one-test

This is a wasm file with an embedded dhry2 test (function names are not stripped).
Comment 18 Alekandr Guryanov 2021-02-25 19:25:12 PST
Even on Safari 14.4 performance is still very bad.
Comment 19 Alekandr Guryanov 2021-02-25 19:30:35 PST
Created attachment 421600 [details]
Performance on Safari 14.4
Comment 20 Robin Morisset 2021-11-08 14:01:38 PST
Sorry for taking so long to come back to this bug.

The problem is that 25k for the option --maximumTmpsForGraphColoring is too small for some benchmarks (including the Dhry2 in js-dosbox example in the original comment), causing the performance regressions; but anything significantly above 25k caused jetsams on some other webpages (which was the original reason for this limit to exist).

The jetsams were coming from two data-structures: the interference graph used by AirAllocateRegistersByGraphColoring.cpp and the interference graph used by AirAllocateStackByGraphColoring.cpp.

I massively reduced their respective memory consumption in:
- https://trac.webkit.org/changeset/277714/webkit
- https://trac.webkit.org/changeset/278186/webkit

One of the pages that used to jetsam with a larger value of --maximumTmpsForGraphColoring was mruby-wasm.aotoki.dev:
- the first of the above patches reduced the memory consumption of the register allocator from 2 x 262MB to 2 x 20MB (2x because of trying to compile two functions in parallel)
- the second patch reduced the memory consumption of the stack allocator on the same functions from 221MB + 399MB to 6.5MB + 10.5MB

Since I saw similar reductions on every benchmark I tried, I believed it is now safe to try to increase --maximumTmpsForGraphColoring.

I verified on the example you provided, there are exactly two functions with > 25k temporaries (54845 and 54856). Increasing --maximumTmpsForGraphColoring from 25k to 60k increased the "VAX" score on my machine (ToT minibrowser on an M1 MBP 2020) from about 27 to about 70.
Comment 21 Robin Morisset 2021-11-08 14:17:23 PST
Created attachment 443607 [details]
Patch
Comment 22 Yusuke Suzuki 2021-11-08 15:39:53 PST
Comment on attachment 443607 [details]
Patch

r=me
Comment 23 Alekandr Guryanov 2021-11-08 22:01:52 PST
(In reply to Robin Morisset from comment #20)

> I verified on the example you provided, there are exactly two functions with
> > 25k temporaries (54845 and 54856). Increasing
> --maximumTmpsForGraphColoring from 25k to 60k increased the "VAX" score on
> my machine (ToT minibrowser on an M1 MBP 2020) from about 27 to about 70.

So this also means if I split this two functions into smaller somehow (like split them) it will increase performance on same value, correct?
Comment 24 Robin Morisset 2021-11-09 13:19:11 PST
Comment on attachment 443607 [details]
Patch

Thanks Yusuke for the review.
Comment 25 Robin Morisset 2021-11-09 13:22:27 PST
(In reply to Alekandr Guryanov from comment #23)
> (In reply to Robin Morisset from comment #20)
> 
> > I verified on the example you provided, there are exactly two functions with
> > > 25k temporaries (54845 and 54856). Increasing
> > --maximumTmpsForGraphColoring from 25k to 60k increased the "VAX" score on
> > my machine (ToT minibrowser on an M1 MBP 2020) from about 27 to about 70.
> 
> So this also means if I split this two functions into smaller somehow (like
> split them) it will increase performance on same value, correct?

Correct. If you want to improve the performance on the currently shipping version of Safari, keeping these functions split should help a lot.

I believe these giant functions typically occur because of very aggressive inlining by LLVM/Emscripten, so preventing some of it somehow might be the easiest way to achieve this.
Comment 26 EWS 2021-11-09 13:56:14 PST
Committed r285533 (244050@main): <https://commits.webkit.org/244050@main>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 443607 [details].
Comment 27 Brent Fulgham 2022-02-04 13:36:57 PST
This change should be present in STP 139, iOS 15.4 Beta, and macOS 12.3 Beta.