178389 – Enable tier-up in loops created by recursive tail call optimizations.

RESOLVED FIXED 178389

Enable tier-up in loops created by recursive tail call optimizations.

https://bugs.webkit.org/show_bug.cgi?id=178389

Summary Enable tier-up in loops created by recursive tail call optimizations.

Robin Morisset

Reported 2017-10-17 08:59:28 PDT

See the comments on https://bugs.webkit.org/show_bug.cgi?id=176601

Attachments
Patch (4.18 KB, patch) 2017-10-23 15:55 PDT, Robin Morisset	no flags	Details Formatted Diff Diff
Patch (5.88 KB, patch) 2018-05-03 06:37 PDT, Robin Morisset	saam: review+	Details Formatted Diff Diff
Patch (5.43 KB, patch) 2022-03-07 14:16 PST, Robin Morisset	no flags	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.

Robin Morisset

Comment 1 2017-10-23 08:33:24 PDT

I ran some tests between no optimization, optimization in both DFG and FTL, and optimization only in FTL: Disabled Normal FTLOnly FTLOnly v. Disabled n-body 679.4277+-12.1793 ^ 608.0302+-17.0313 ^ 556.8847+-9.6934 ^ definitely 1.2201x faster merge-sort 1056.9295+-13.0481 1025.3372+-19.3403 ^ 764.4767+-28.6580 ^ definitely 1.3826x faster <geometric> 847.3528+-10.6054 ^ 789.3578+-12.1466 ^ 652.3523+-16.1727 ^ definitely 1.2989x faster Two conclusions: - the win between no optimization and unconditional optimization is a lot smaller than what I had measured earlier (I had got about 20% in a less rigorous benchmark, before I got TailBench supported into the run-jsc-benchmarks script) - there is clearly a huge gain in only running it in FTL mode, which confirms our suspicion that the optimization is preventing tier-up from DFG to FTL.

Saam Barati

Comment 2 2017-10-23 09:16:07 PDT

(In reply to Robin Morisset from comment #1) > I ran some tests between no optimization, optimization in both DFG and FTL, > and optimization only in FTL: > Disabled Normal > FTLOnly FTLOnly v. Disabled > > n-body 679.4277+-12.1793 ^ 608.0302+-17.0313 ^ > 556.8847+-9.6934 ^ definitely 1.2201x faster > merge-sort 1056.9295+-13.0481 1025.3372+-19.3403 ^ > 764.4767+-28.6580 ^ definitely 1.3826x faster > > <geometric> 847.3528+-10.6054 ^ 789.3578+-12.1466 ^ > 652.3523+-16.1727 ^ definitely 1.2989x faster > > Two conclusions: > - the win between no optimization and unconditional optimization is a lot > smaller than what I had measured earlier (I had got about 20% in a less > rigorous benchmark, before I got TailBench supported into the > run-jsc-benchmarks script) > - there is clearly a huge gain in only running it in FTL mode, which > confirms our suspicion that the optimization is preventing tier-up from DFG > to FTL. I’d be okay with saying we only do this in FTL until we figure out how to OSR using this loop you introduce.

Robin Morisset

Comment 3 2017-10-23 15:55:52 PDT

Created attachment 324603 [details] Patch

Robin Morisset

Comment 4 2017-10-23 16:00:12 PDT

Some notes about that patch: - I am not sure of the necessity of the resetting of m_exitOK to false at the end of handleRecursiveTailCall - The second setting of m_exitOK to true before the placement of the LoopHint might also be unnecessary. Here are some performance numbers: "Normal" at /Users/rmorisset/Webkit2/OpenSource/WebKitBuild/Baseline/Release/jsc "FTLOnly" at /Users/rmorisset/Webkit2/OpenSource/WebKitBuild/TierUp/Release/jsc "LoopHint" at /Users/rmorisset/Webkit2/OpenSource/WebKitBuild/TierUp2/Release/jsc Collected 20 samples per benchmark/VM, with 20 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. Normal FTLOnly LoopHint LoopHint v. Normal n-body 562.0193+-4.9093 ^ 521.3055+-1.6532 ! 559.7732+-4.8271 merge-sort 963.8272+-3.6327 ^ 711.8453+-6.1834 708.6563+-6.2171 ^ definitely 1.3601x faster <geometric> 735.9597+-3.4519 ^ 609.1411+-2.7327 ! 629.7776+-3.7344 ^ definitely 1.1686x faster We see that adding the LoopHint improves performance over the baseline, but for some reason less than running the code only in FTL for n-body.

Robin Morisset

Comment 5 2018-05-03 06:22:48 PDT

I've revisited this topic, and I am even more confused than before. First, here are the numbers for TailBench with the optimisation completely disabled (and with a small fix to merge-sort-cps so it takes a similar amount of time to run as merge-sort), as a kind of baseline : n-body 614.1399+-2.6044 merge-sort 476.6205+-2.4212 merge-sort-cps 462.0143+-1.4312 bf-interpreter 401.7443+-3.4122 <geometric> 482.7646+-0.9180 (all numbers are from running Tools/Scripts/run-jsc-benchmarks WebKitBuild/Release/jsc --inner 2 --outer 10 --tail-bench on my iMac, on top-of-tree as of May 2nd, 2018). Next, here are the numbers for the default configuration, where the optimisation runs, but presumably interferes with tier up: n-body 574.8102+-2.9423 merge-sort 399.7921+-4.3145 merge-sort-cps 460.7351+-5.7754 bf-interpreter 264.6061+-3.6597 <geometric> 409.0464+-2.4496 So far so good, we improve all programs except for merge-sort-cps that we don't regress at least. Now let's try several different ways to deal with the tier-up problem. With the patch that I had proposed before (inserting a LoopHint at the right place): n-body 572.7142+-2.7052 merge-sort 436.1483+-1.5764 merge-sort-cps 443.3050+-11.2810 bf-interpreter 260.9475+-4.1062 <geometric> 412.1350+-2.9962 We finally improve merge-sort-cps a tiny bit, but at the cost of a significant regression on merge-sort. Is it due to some overhead from the way the patch works (since we add a No-op at the beginning of potential target blocks to have a place to easily put the LoopHint)? No, I tested a patch with just the transformation of No-op to LoopHint disabled, and it performs as the default (current) configuration: n-body 575.9674+-2.9741 merge-sort 400.1163+-4.3708 merge-sort-cps 460.9874+-5.5892 bf-interpreter 266.0997+-3.7222 <geometric> 409.9563+-1.9411 Is it maybe due to a problem with OSR-entry at this specific point in the program? I tried making the patch insert a LoopHint but not marking it as a potential OSR entry point: n-body 573.6579+-3.0651 merge-sort 435.2522+-1.5020 merge-sort-cps 446.3558+-10.7560 bf-interpreter 263.8143+-3.2532 <geometric> 413.9548+-2.7296 No significant difference from the normal patch. So at this point I tried a completely different approach: marking the jump back (by putting 1 into its m_opInfo2), and teaching CheckTierUpInjectionPhase to treat such a jump like a return or tail call and insert a CheckTierUpAtReturn just before it: n-body 548.8219+-5.1428 merge-sort 428.9825+-1.4452 merge-sort-cps 432.5123+-4.8355 bf-interpreter 256.7629+-2.6664 <geometric> 402.0644+-2.0024 Great news: it not-only improves merge-sort-cps, it also improves bf-interpreter a tiny bit, and n-body a slightly larger amount. But it still makes merge-sort slower than the default! At last, I tried running the optimisation only in FTL mode, and it proved even better for merge-sort-cps, same as that last patch for the rest: n-body 548.3229+-5.8444 merge-sort 427.2701+-1.7264 merge-sort-cps 411.6702+-1.6944 bf-interpreter 254.7185+-2.1882 <geometric> 395.8613+-1.1237 So, here are my temporary conclusions: - the patch I first proposed is strictly dominated by only running the optimisation in FTL - the alternate patch the marks the jump is very similar to only running the optimisation in FTL, but is slower for merge-sort-cps - the current (default) situation is only faster than optimizing purely in FTL for merge-sort. - the optimisation is a win for all 4 benchmarks in all cases. I will thus submit a patch that makes the optimisation only run in FTL, even though I am still confused as to what is actually going on.

Robin Morisset

Comment 6 2018-05-03 06:37:49 PDT

Created attachment 339404 [details] Patch

Robin Morisset

Comment 7 2018-05-09 11:17:32 PDT

Comment on attachment 339404 [details] Patch I've tested the optimization with useFTLJIT=false, and it appears to have a positive impact in that case. So just running in FTL is probably not ideal, but I still recommend doing it as a simple performance improvement in the mean time. Results (on the same machine as the previous results), without FTL and without the optimisation: n-body 964.5178+-4.1374 merge-sort 550.7901+-4.0016 merge-sort-cps 514.8657+-2.3150 bf-interpreter 600.9773+-1.5814 <geometric> 636.7378+-0.6850 Results, without FTL but with the optimization: n-body 901.5132+-1.7837 merge-sort 526.0939+-1.4784 merge-sort-cps 484.5323+-4.2516 bf-interpreter 515.0320+-5.5436 <geometric> 586.5377+-2.4870

Saam Barati

Comment 8 2018-06-06 14:28:50 PDT

Comment on attachment 339404 [details] Patch r=me Not CQ+ing since we should make sure this still applies cleanly to ToT

Robin Morisset

Comment 9 2022-03-07 14:16:32 PST

Created attachment 454036 [details] Patch Just rebased for landing, thanks Mark for reminding me of this super old patch.

Robin Morisset

Comment 10 2022-03-08 16:50:28 PST

Comment on attachment 454036 [details] Patch Trying to land this as all bots are green except for iOS-wk2 which appears to be down for all patches.

EWS

Comment 11 2022-03-08 18:54:21 PST

Committed r291026 (248200@main): <https://commits.webkit.org/248200@main> All reviewed patches have been landed. Closing bug and clearing flags on attachment 454036 [details].

Radar WebKit Bug Importer

Comment 12 2022-03-08 18:55:17 PST

<rdar://problem/90004728>

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component JavaScriptCore

Assignee

Robin Morisset

Reported

2017-10-17 08:59 PDT

Modified

2022-03-08 18:55 PST History

CC List

11 users Show

URL

Keywords InRadar

Depends on

176601

Blocks

Dependencies

tree graph