Bug 178389 - Enable tier-up in loops created by recursive tail call optimizations.
Summary: Enable tier-up in loops created by recursive tail call optimizations.
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Robin Morisset
URL:
Keywords:
Depends on: 176601
Blocks:
  Show dependency treegraph
 
Reported: 2017-10-17 08:59 PDT by Robin Morisset
Modified: 2018-06-06 14:28 PDT (History)
8 users (show)

See Also:


Attachments
Patch (4.18 KB, patch)
2017-10-23 15:55 PDT, Robin Morisset
no flags Details | Formatted Diff | Diff
Patch (5.88 KB, patch)
2018-05-03 06:37 PDT, Robin Morisset
sbarati: review+
rmorisset: commit-queue?
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Robin Morisset 2017-10-17 08:59:28 PDT
See the comments on https://bugs.webkit.org/show_bug.cgi?id=176601
Comment 1 Robin Morisset 2017-10-23 08:33:24 PDT
I ran some tests between no optimization, optimization in both DFG and FTL, and optimization only in FTL:
                         Disabled                   Normal                   FTLOnly               FTLOnly v. Disabled    

n-body              679.4277+-12.1793    ^    608.0302+-17.0313    ^    556.8847+-9.6934        ^ definitely 1.2201x faster
merge-sort         1056.9295+-13.0481        1025.3372+-19.3403    ^    764.4767+-28.6580       ^ definitely 1.3826x faster

<geometric>         847.3528+-10.6054    ^    789.3578+-12.1466    ^    652.3523+-16.1727       ^ definitely 1.2989x faster

Two conclusions:
- the win between no optimization and unconditional optimization is a lot smaller than what I had measured earlier (I had got about 20% in a less rigorous benchmark, before I got TailBench supported into the run-jsc-benchmarks script)
- there is clearly a huge gain in only running it in FTL mode, which confirms our suspicion that the optimization is preventing tier-up from DFG to FTL.
Comment 2 Saam Barati 2017-10-23 09:16:07 PDT
(In reply to Robin Morisset from comment #1)
> I ran some tests between no optimization, optimization in both DFG and FTL,
> and optimization only in FTL:
>                          Disabled                   Normal                  
> FTLOnly               FTLOnly v. Disabled    
> 
> n-body              679.4277+-12.1793    ^    608.0302+-17.0313    ^   
> 556.8847+-9.6934        ^ definitely 1.2201x faster
> merge-sort         1056.9295+-13.0481        1025.3372+-19.3403    ^   
> 764.4767+-28.6580       ^ definitely 1.3826x faster
> 
> <geometric>         847.3528+-10.6054    ^    789.3578+-12.1466    ^   
> 652.3523+-16.1727       ^ definitely 1.2989x faster
> 
> Two conclusions:
> - the win between no optimization and unconditional optimization is a lot
> smaller than what I had measured earlier (I had got about 20% in a less
> rigorous benchmark, before I got TailBench supported into the
> run-jsc-benchmarks script)
> - there is clearly a huge gain in only running it in FTL mode, which
> confirms our suspicion that the optimization is preventing tier-up from DFG
> to FTL.

Iā€™d be okay with saying we only do this in FTL until we figure out how to OSR using this loop you introduce.
Comment 3 Robin Morisset 2017-10-23 15:55:52 PDT
Created attachment 324603 [details]
Patch
Comment 4 Robin Morisset 2017-10-23 16:00:12 PDT
Some notes about that patch:
- I am not sure of the necessity of the resetting of m_exitOK to false at the end of handleRecursiveTailCall
- The second setting of m_exitOK to true before the placement of the LoopHint might also be unnecessary.

Here are some performance numbers:
"Normal" at /Users/rmorisset/Webkit2/OpenSource/WebKitBuild/Baseline/Release/jsc
"FTLOnly" at /Users/rmorisset/Webkit2/OpenSource/WebKitBuild/TierUp/Release/jsc
"LoopHint" at /Users/rmorisset/Webkit2/OpenSource/WebKitBuild/TierUp2/Release/jsc

Collected 20 samples per benchmark/VM, with 20 VM invocations per benchmark. Emitted a call to gc() between sample
measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to
get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds.

                          Normal                   FTLOnly                   LoopHint               LoopHint v. Normal    

n-body              562.0193+-4.9093     ^    521.3055+-1.6532     !    559.7732+-4.8271        
merge-sort          963.8272+-3.6327     ^    711.8453+-6.1834          708.6563+-6.2171        ^ definitely 1.3601x faster

<geometric>         735.9597+-3.4519     ^    609.1411+-2.7327     !    629.7776+-3.7344        ^ definitely 1.1686x faster

We see that adding the LoopHint improves performance over the baseline, but for some reason less than running the code only in FTL for n-body.
Comment 5 Robin Morisset 2018-05-03 06:22:48 PDT
I've revisited this topic, and I am even more confused than before.

First, here are the numbers for TailBench with the optimisation completely disabled (and with a small fix to merge-sort-cps so it takes a similar amount of time to run as merge-sort), as a kind of baseline :
n-body              614.1399+-2.6044    
merge-sort          476.6205+-2.4212    
merge-sort-cps      462.0143+-1.4312    
bf-interpreter      401.7443+-3.4122    
<geometric>         482.7646+-0.9180
(all numbers are from running Tools/Scripts/run-jsc-benchmarks WebKitBuild/Release/jsc --inner 2 --outer 10 --tail-bench on my iMac, on top-of-tree as of May 2nd, 2018).

Next, here are the numbers for the default configuration, where the optimisation runs, but presumably interferes with tier up:
n-body              574.8102+-2.9423    
merge-sort          399.7921+-4.3145    
merge-sort-cps      460.7351+-5.7754    
bf-interpreter      264.6061+-3.6597    
<geometric>         409.0464+-2.4496

So far so good, we improve all programs except for merge-sort-cps that we don't regress at least. Now let's try several different ways to deal with the tier-up problem. With the patch that I had proposed before (inserting a LoopHint at the right place):
n-body              572.7142+-2.7052    
merge-sort          436.1483+-1.5764    
merge-sort-cps      443.3050+-11.2810   
bf-interpreter      260.9475+-4.1062    
<geometric>         412.1350+-2.9962

We finally improve merge-sort-cps a tiny bit, but at the cost of a significant regression on merge-sort. Is it due to some overhead from the way the patch works (since we add a No-op at the beginning of potential target blocks to have a place to easily put the LoopHint)? No, I tested a patch with just the transformation of No-op to LoopHint disabled, and it performs as the default (current) configuration:
n-body              575.9674+-2.9741    
merge-sort          400.1163+-4.3708    
merge-sort-cps      460.9874+-5.5892    
bf-interpreter      266.0997+-3.7222    
<geometric>         409.9563+-1.9411

Is it maybe due to a problem with OSR-entry at this specific point in the program? I tried making the patch insert a LoopHint but not marking it as a potential OSR entry point:
n-body              573.6579+-3.0651    
merge-sort          435.2522+-1.5020    
merge-sort-cps      446.3558+-10.7560   
bf-interpreter      263.8143+-3.2532    
<geometric>         413.9548+-2.7296 
No significant difference from the normal patch.

So at this point I tried a completely different approach: marking the jump back (by putting 1 into its m_opInfo2), and teaching CheckTierUpInjectionPhase to treat such a jump like a return or tail call and insert a CheckTierUpAtReturn just before it:
n-body              548.8219+-5.1428    
merge-sort          428.9825+-1.4452    
merge-sort-cps      432.5123+-4.8355    
bf-interpreter      256.7629+-2.6664    
<geometric>         402.0644+-2.0024
Great news: it not-only improves merge-sort-cps, it also improves bf-interpreter a tiny bit, and n-body a slightly larger amount. But it still makes merge-sort slower than the default!

At last, I tried running the optimisation only in FTL mode, and it proved even better for merge-sort-cps, same as that last patch for the rest:
n-body              548.3229+-5.8444    
merge-sort          427.2701+-1.7264    
merge-sort-cps      411.6702+-1.6944    
bf-interpreter      254.7185+-2.1882    
<geometric>         395.8613+-1.1237

So, here are my temporary conclusions:
- the patch I first proposed is strictly dominated by only running the optimisation in FTL
- the alternate patch the marks the jump is very similar to only running the optimisation in FTL, but is slower for merge-sort-cps
- the current (default) situation is only faster than optimizing purely in FTL for merge-sort.
- the optimisation is a win for all 4 benchmarks in all cases.

I will thus submit a patch that makes the optimisation only run in FTL, even though I am still confused as to what is actually going on.
Comment 6 Robin Morisset 2018-05-03 06:37:49 PDT
Created attachment 339404 [details]
Patch
Comment 7 Robin Morisset 2018-05-09 11:17:32 PDT
Comment on attachment 339404 [details]
Patch

I've tested the optimization with useFTLJIT=false, and it appears to have a positive impact in that case. So just running in FTL is probably not ideal, but I still recommend doing it as a simple performance improvement in the mean time.

Results (on the same machine as the previous results), without FTL and without the optimisation:
n-body              964.5178+-4.1374
merge-sort          550.7901+-4.0016
merge-sort-cps      514.8657+-2.3150
bf-interpreter      600.9773+-1.5814
<geometric>         636.7378+-0.6850

Results, without FTL but with the optimization:
n-body              901.5132+-1.7837
merge-sort          526.0939+-1.4784
merge-sort-cps      484.5323+-4.2516
bf-interpreter      515.0320+-5.5436
<geometric>         586.5377+-2.4870
Comment 8 Saam Barati 2018-06-06 14:28:50 PDT
Comment on attachment 339404 [details]
Patch

r=me
Not CQ+ing since we should make sure this still applies cleanly to ToT