See the comments on https://bugs.webkit.org/show_bug.cgi?id=176601
I ran some tests between no optimization, optimization in both DFG and FTL, and optimization only in FTL: Disabled Normal FTLOnly FTLOnly v. Disabled n-body 679.4277+-12.1793 ^ 608.0302+-17.0313 ^ 556.8847+-9.6934 ^ definitely 1.2201x faster merge-sort 1056.9295+-13.0481 1025.3372+-19.3403 ^ 764.4767+-28.6580 ^ definitely 1.3826x faster <geometric> 847.3528+-10.6054 ^ 789.3578+-12.1466 ^ 652.3523+-16.1727 ^ definitely 1.2989x faster Two conclusions: - the win between no optimization and unconditional optimization is a lot smaller than what I had measured earlier (I had got about 20% in a less rigorous benchmark, before I got TailBench supported into the run-jsc-benchmarks script) - there is clearly a huge gain in only running it in FTL mode, which confirms our suspicion that the optimization is preventing tier-up from DFG to FTL.
(In reply to Robin Morisset from comment #1) > I ran some tests between no optimization, optimization in both DFG and FTL, > and optimization only in FTL: > Disabled Normal > FTLOnly FTLOnly v. Disabled > > n-body 679.4277+-12.1793 ^ 608.0302+-17.0313 ^ > 556.8847+-9.6934 ^ definitely 1.2201x faster > merge-sort 1056.9295+-13.0481 1025.3372+-19.3403 ^ > 764.4767+-28.6580 ^ definitely 1.3826x faster > > <geometric> 847.3528+-10.6054 ^ 789.3578+-12.1466 ^ > 652.3523+-16.1727 ^ definitely 1.2989x faster > > Two conclusions: > - the win between no optimization and unconditional optimization is a lot > smaller than what I had measured earlier (I had got about 20% in a less > rigorous benchmark, before I got TailBench supported into the > run-jsc-benchmarks script) > - there is clearly a huge gain in only running it in FTL mode, which > confirms our suspicion that the optimization is preventing tier-up from DFG > to FTL. Iād be okay with saying we only do this in FTL until we figure out how to OSR using this loop you introduce.
Created attachment 324603 [details] Patch
Some notes about that patch: - I am not sure of the necessity of the resetting of m_exitOK to false at the end of handleRecursiveTailCall - The second setting of m_exitOK to true before the placement of the LoopHint might also be unnecessary. Here are some performance numbers: "Normal" at /Users/rmorisset/Webkit2/OpenSource/WebKitBuild/Baseline/Release/jsc "FTLOnly" at /Users/rmorisset/Webkit2/OpenSource/WebKitBuild/TierUp/Release/jsc "LoopHint" at /Users/rmorisset/Webkit2/OpenSource/WebKitBuild/TierUp2/Release/jsc Collected 20 samples per benchmark/VM, with 20 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. Normal FTLOnly LoopHint LoopHint v. Normal n-body 562.0193+-4.9093 ^ 521.3055+-1.6532 ! 559.7732+-4.8271 merge-sort 963.8272+-3.6327 ^ 711.8453+-6.1834 708.6563+-6.2171 ^ definitely 1.3601x faster <geometric> 735.9597+-3.4519 ^ 609.1411+-2.7327 ! 629.7776+-3.7344 ^ definitely 1.1686x faster We see that adding the LoopHint improves performance over the baseline, but for some reason less than running the code only in FTL for n-body.
I've revisited this topic, and I am even more confused than before. First, here are the numbers for TailBench with the optimisation completely disabled (and with a small fix to merge-sort-cps so it takes a similar amount of time to run as merge-sort), as a kind of baseline : n-body 614.1399+-2.6044 merge-sort 476.6205+-2.4212 merge-sort-cps 462.0143+-1.4312 bf-interpreter 401.7443+-3.4122 <geometric> 482.7646+-0.9180 (all numbers are from running Tools/Scripts/run-jsc-benchmarks WebKitBuild/Release/jsc --inner 2 --outer 10 --tail-bench on my iMac, on top-of-tree as of May 2nd, 2018). Next, here are the numbers for the default configuration, where the optimisation runs, but presumably interferes with tier up: n-body 574.8102+-2.9423 merge-sort 399.7921+-4.3145 merge-sort-cps 460.7351+-5.7754 bf-interpreter 264.6061+-3.6597 <geometric> 409.0464+-2.4496 So far so good, we improve all programs except for merge-sort-cps that we don't regress at least. Now let's try several different ways to deal with the tier-up problem. With the patch that I had proposed before (inserting a LoopHint at the right place): n-body 572.7142+-2.7052 merge-sort 436.1483+-1.5764 merge-sort-cps 443.3050+-11.2810 bf-interpreter 260.9475+-4.1062 <geometric> 412.1350+-2.9962 We finally improve merge-sort-cps a tiny bit, but at the cost of a significant regression on merge-sort. Is it due to some overhead from the way the patch works (since we add a No-op at the beginning of potential target blocks to have a place to easily put the LoopHint)? No, I tested a patch with just the transformation of No-op to LoopHint disabled, and it performs as the default (current) configuration: n-body 575.9674+-2.9741 merge-sort 400.1163+-4.3708 merge-sort-cps 460.9874+-5.5892 bf-interpreter 266.0997+-3.7222 <geometric> 409.9563+-1.9411 Is it maybe due to a problem with OSR-entry at this specific point in the program? I tried making the patch insert a LoopHint but not marking it as a potential OSR entry point: n-body 573.6579+-3.0651 merge-sort 435.2522+-1.5020 merge-sort-cps 446.3558+-10.7560 bf-interpreter 263.8143+-3.2532 <geometric> 413.9548+-2.7296 No significant difference from the normal patch. So at this point I tried a completely different approach: marking the jump back (by putting 1 into its m_opInfo2), and teaching CheckTierUpInjectionPhase to treat such a jump like a return or tail call and insert a CheckTierUpAtReturn just before it: n-body 548.8219+-5.1428 merge-sort 428.9825+-1.4452 merge-sort-cps 432.5123+-4.8355 bf-interpreter 256.7629+-2.6664 <geometric> 402.0644+-2.0024 Great news: it not-only improves merge-sort-cps, it also improves bf-interpreter a tiny bit, and n-body a slightly larger amount. But it still makes merge-sort slower than the default! At last, I tried running the optimisation only in FTL mode, and it proved even better for merge-sort-cps, same as that last patch for the rest: n-body 548.3229+-5.8444 merge-sort 427.2701+-1.7264 merge-sort-cps 411.6702+-1.6944 bf-interpreter 254.7185+-2.1882 <geometric> 395.8613+-1.1237 So, here are my temporary conclusions: - the patch I first proposed is strictly dominated by only running the optimisation in FTL - the alternate patch the marks the jump is very similar to only running the optimisation in FTL, but is slower for merge-sort-cps - the current (default) situation is only faster than optimizing purely in FTL for merge-sort. - the optimisation is a win for all 4 benchmarks in all cases. I will thus submit a patch that makes the optimisation only run in FTL, even though I am still confused as to what is actually going on.
Created attachment 339404 [details] Patch
Comment on attachment 339404 [details] Patch I've tested the optimization with useFTLJIT=false, and it appears to have a positive impact in that case. So just running in FTL is probably not ideal, but I still recommend doing it as a simple performance improvement in the mean time. Results (on the same machine as the previous results), without FTL and without the optimisation: n-body 964.5178+-4.1374 merge-sort 550.7901+-4.0016 merge-sort-cps 514.8657+-2.3150 bf-interpreter 600.9773+-1.5814 <geometric> 636.7378+-0.6850 Results, without FTL but with the optimization: n-body 901.5132+-1.7837 merge-sort 526.0939+-1.4784 merge-sort-cps 484.5323+-4.2516 bf-interpreter 515.0320+-5.5436 <geometric> 586.5377+-2.4870
Comment on attachment 339404 [details] Patch r=me Not CQ+ing since we should make sure this still applies cleanly to ToT
Created attachment 454036 [details] Patch Just rebased for landing, thanks Mark for reminding me of this super old patch.
Comment on attachment 454036 [details] Patch Trying to land this as all bots are green except for iOS-wk2 which appears to be down for all patches.
Committed r291026 (248200@main): <https://commits.webkit.org/248200@main> All reviewed patches have been landed. Closing bug and clearing flags on attachment 454036 [details].
<rdar://problem/90004728>