See the comments on https://bugs.webkit.org/show_bug.cgi?id=176601
I ran some tests between no optimization, optimization in both DFG and FTL, and optimization only in FTL:
Disabled Normal FTLOnly FTLOnly v. Disabled
n-body 679.4277+-12.1793 ^ 608.0302+-17.0313 ^ 556.8847+-9.6934 ^ definitely 1.2201x faster
merge-sort 1056.9295+-13.0481 1025.3372+-19.3403 ^ 764.4767+-28.6580 ^ definitely 1.3826x faster
<geometric> 847.3528+-10.6054 ^ 789.3578+-12.1466 ^ 652.3523+-16.1727 ^ definitely 1.2989x faster
- the win between no optimization and unconditional optimization is a lot smaller than what I had measured earlier (I had got about 20% in a less rigorous benchmark, before I got TailBench supported into the run-jsc-benchmarks script)
- there is clearly a huge gain in only running it in FTL mode, which confirms our suspicion that the optimization is preventing tier-up from DFG to FTL.
(In reply to Robin Morisset from comment #1)
> I ran some tests between no optimization, optimization in both DFG and FTL,
> and optimization only in FTL:
> Disabled Normal
> FTLOnly FTLOnly v. Disabled
> n-body 679.4277+-12.1793 ^ 608.0302+-17.0313 ^
> 556.8847+-9.6934 ^ definitely 1.2201x faster
> merge-sort 1056.9295+-13.0481 1025.3372+-19.3403 ^
> 764.4767+-28.6580 ^ definitely 1.3826x faster
> <geometric> 847.3528+-10.6054 ^ 789.3578+-12.1466 ^
> 652.3523+-16.1727 ^ definitely 1.2989x faster
> Two conclusions:
> - the win between no optimization and unconditional optimization is a lot
> smaller than what I had measured earlier (I had got about 20% in a less
> rigorous benchmark, before I got TailBench supported into the
> run-jsc-benchmarks script)
> - there is clearly a huge gain in only running it in FTL mode, which
> confirms our suspicion that the optimization is preventing tier-up from DFG
> to FTL.
I’d be okay with saying we only do this in FTL until we figure out how to OSR using this loop you introduce.
Created attachment 324603 [details]
Some notes about that patch:
- I am not sure of the necessity of the resetting of m_exitOK to false at the end of handleRecursiveTailCall
- The second setting of m_exitOK to true before the placement of the LoopHint might also be unnecessary.
Here are some performance numbers:
"Normal" at /Users/rmorisset/Webkit2/OpenSource/WebKitBuild/Baseline/Release/jsc
"FTLOnly" at /Users/rmorisset/Webkit2/OpenSource/WebKitBuild/TierUp/Release/jsc
"LoopHint" at /Users/rmorisset/Webkit2/OpenSource/WebKitBuild/TierUp2/Release/jsc
Collected 20 samples per benchmark/VM, with 20 VM invocations per benchmark. Emitted a call to gc() between sample
measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to
get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds.
Normal FTLOnly LoopHint LoopHint v. Normal
n-body 562.0193+-4.9093 ^ 521.3055+-1.6532 ! 559.7732+-4.8271
merge-sort 963.8272+-3.6327 ^ 711.8453+-6.1834 708.6563+-6.2171 ^ definitely 1.3601x faster
<geometric> 735.9597+-3.4519 ^ 609.1411+-2.7327 ! 629.7776+-3.7344 ^ definitely 1.1686x faster
We see that adding the LoopHint improves performance over the baseline, but for some reason less than running the code only in FTL for n-body.
I've revisited this topic, and I am even more confused than before.
First, here are the numbers for TailBench with the optimisation completely disabled (and with a small fix to merge-sort-cps so it takes a similar amount of time to run as merge-sort), as a kind of baseline :
(all numbers are from running Tools/Scripts/run-jsc-benchmarks WebKitBuild/Release/jsc --inner 2 --outer 10 --tail-bench on my iMac, on top-of-tree as of May 2nd, 2018).
Next, here are the numbers for the default configuration, where the optimisation runs, but presumably interferes with tier up:
So far so good, we improve all programs except for merge-sort-cps that we don't regress at least. Now let's try several different ways to deal with the tier-up problem. With the patch that I had proposed before (inserting a LoopHint at the right place):
We finally improve merge-sort-cps a tiny bit, but at the cost of a significant regression on merge-sort. Is it due to some overhead from the way the patch works (since we add a No-op at the beginning of potential target blocks to have a place to easily put the LoopHint)? No, I tested a patch with just the transformation of No-op to LoopHint disabled, and it performs as the default (current) configuration:
Is it maybe due to a problem with OSR-entry at this specific point in the program? I tried making the patch insert a LoopHint but not marking it as a potential OSR entry point:
No significant difference from the normal patch.
So at this point I tried a completely different approach: marking the jump back (by putting 1 into its m_opInfo2), and teaching CheckTierUpInjectionPhase to treat such a jump like a return or tail call and insert a CheckTierUpAtReturn just before it:
Great news: it not-only improves merge-sort-cps, it also improves bf-interpreter a tiny bit, and n-body a slightly larger amount. But it still makes merge-sort slower than the default!
At last, I tried running the optimisation only in FTL mode, and it proved even better for merge-sort-cps, same as that last patch for the rest:
So, here are my temporary conclusions:
- the patch I first proposed is strictly dominated by only running the optimisation in FTL
- the alternate patch the marks the jump is very similar to only running the optimisation in FTL, but is slower for merge-sort-cps
- the current (default) situation is only faster than optimizing purely in FTL for merge-sort.
- the optimisation is a win for all 4 benchmarks in all cases.
I will thus submit a patch that makes the optimisation only run in FTL, even though I am still confused as to what is actually going on.
Created attachment 339404 [details]
Comment on attachment 339404 [details]
I've tested the optimization with useFTLJIT=false, and it appears to have a positive impact in that case. So just running in FTL is probably not ideal, but I still recommend doing it as a simple performance improvement in the mean time.
Results (on the same machine as the previous results), without FTL and without the optimisation:
Results, without FTL but with the optimization:
Comment on attachment 339404 [details]
Not CQ+ing since we should make sure this still applies cleanly to ToT