Bug 167149 - OSR entry: delay outer-loop compilation when at inner-loop
Summary: OSR entry: delay outer-loop compilation when at inner-loop
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: JF Bastien
URL:
Keywords: InRadar
Depends on: 167479 167721
Blocks: 167741 167811 167460
  Show dependency treegraph
 
Reported: 2017-01-17 18:39 PST by JF Bastien
Modified: 2017-02-03 17:19 PST (History)
9 users (show)

See Also:


Attachments
patch (5.39 MB, patch)
2017-01-17 18:58 PST, JF Bastien
fpizlo: review-
Details | Formatted Diff | Diff
patch (3.68 MB, patch)
2017-01-19 21:33 PST, JF Bastien
jfbastien: commit-queue-
Details | Formatted Diff | Diff
patch (3.68 MB, patch)
2017-01-20 12:00 PST, JF Bastien
jfbastien: commit-queue-
Details | Formatted Diff | Diff
patch (31.73 KB, patch)
2017-01-20 14:17 PST, JF Bastien
jfbastien: commit-queue-
Details | Formatted Diff | Diff
patch (32.60 KB, patch)
2017-01-20 15:26 PST, JF Bastien
jfbastien: commit-queue-
Details | Formatted Diff | Diff
patch (47.00 KB, patch)
2017-01-24 15:19 PST, JF Bastien
jfbastien: commit-queue-
Details | Formatted Diff | Diff
pach (46.98 KB, patch)
2017-01-24 20:37 PST, JF Bastien
jfbastien: commit-queue-
Details | Formatted Diff | Diff
patch (46.98 KB, patch)
2017-01-25 10:30 PST, JF Bastien
jfbastien: commit-queue-
Details | Formatted Diff | Diff
patch (3.62 MB, patch)
2017-01-25 17:14 PST, JF Bastien
fpizlo: review+
Details | Formatted Diff | Diff
patch (3.62 MB, patch)
2017-01-26 10:56 PST, JF Bastien
no flags Details | Formatted Diff | Diff
patch (3.62 MB, patch)
2017-01-26 11:15 PST, JF Bastien
no flags Details | Formatted Diff | Diff
patch (3.63 MB, patch)
2017-01-31 10:33 PST, JF Bastien
fpizlo: review-
Details | Formatted Diff | Diff
Results + profiles (88.50 KB, application/octet-stream)
2017-01-31 10:33 PST, JF Bastien
no flags Details
patch (11.19 KB, patch)
2017-01-31 15:30 PST, JF Bastien
no flags Details | Formatted Diff | Diff
Benchmark results (89.70 KB, application/octet-stream)
2017-01-31 15:31 PST, JF Bastien
no flags Details
patch (8.01 KB, patch)
2017-01-31 16:20 PST, JF Bastien
no flags Details | Formatted Diff | Diff
patch (3.39 KB, patch)
2017-02-02 12:39 PST, JF Bastien
jfbastien: commit-queue-
Details | Formatted Diff | Diff
patch (8.62 KB, patch)
2017-02-02 13:08 PST, JF Bastien
jfbastien: commit-queue-
Details | Formatted Diff | Diff
benchmark results for imprecise mustHandleValues (82.28 KB, application/octet-stream)
2017-02-02 13:40 PST, JF Bastien
no flags Details
collect mustHandleValues eagerly (6.91 KB, patch)
2017-02-02 17:05 PST, JF Bastien
jfbastien: commit-queue-
Details | Formatted Diff | Diff
patch (9.01 KB, patch)
2017-02-03 10:27 PST, JF Bastien
no flags Details | Formatted Diff | Diff
Performance results (81.47 KB, application/octet-stream)
2017-02-03 10:29 PST, JF Bastien
no flags Details
patch (17.31 KB, patch)
2017-02-03 14:08 PST, JF Bastien
no flags Details | Formatted Diff | Diff
patch (17.40 KB, patch)
2017-02-03 16:00 PST, JF Bastien
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description JF Bastien 2017-01-17 18:39:58 PST
As of https://bugs.webkit.org/show_bug.cgi?id=155217 OSR compilation can be kicked off for an entry into an outer-loop, while executing an inner-loop. This is desirable because often the codegen from an inner-entry isn't as good as the codegen from an outer-entry, but execution from an inner-loop is often pretty hot and likely to kick off compilation. This approach provided nice speedups on Kraken because we'd select to enter to the outer-loop very reliably, which reduces variability (the inner-loop was selected roughly 1/5 times from my unscientific measurements).

When compilation starts we take a snapshot of the JSValues at the current execution state using OSR's recovery mechanism. These values are passed to the compiler and are used as way to perform type profiling, and could be used to observe cell types as well as to perform predictions such as through constant propagation.

It's therefore desired to enter from the outer-loop when we can, but we need to be executing from that location to capture the right JSValues, otherwise we're confusing the compiler and giving it inaccurate JSValues which can lead it to predict the wrong things, leading to suboptimal code or recompilation due to misprediction.

These effects are pretty hard to measure: Fil points out that marsalis-osr-entry really needs mustHandleValues (the JSValues from the point of execution) because right now it just happens to correctly guess int32. I tried removing mustHandleValues entirely and saw no slowdowns, but our benchmarks probably aren't sufficient to reliably find issues, sometimes because we happen to have sufficient mitigations.

DFG tier-up was added here: https://bugs.webkit.org/show_bug.cgi?id=112838

<rdar://problem/29144126>
Comment 1 JF Bastien 2017-01-17 18:58:38 PST
Created attachment 299107 [details]
patch

Should address <rdar://problem/29144126>

Here are some benchmark results:

$ run-jsc-benchmarks original:original/bin/jsc delay:current-release/bin/jsc --kraken --outer 20
                                           original                   delay                                       

ai-astar                                67.957+-1.211             67.896+-1.411         
audio-beat-detection                    28.329+-1.439      ?      28.507+-1.279         ?
audio-dft                               59.496+-0.586      ?      59.804+-0.611         ?
audio-fft                               21.121+-0.106      ?      21.168+-0.115         ?
audio-oscillator                        45.449+-1.932             44.702+-0.187           might be 1.0167x faster
imaging-darkroom                        49.720+-0.295      ?      49.832+-0.357         ?
imaging-desaturate                      43.173+-0.864             42.325+-0.775           might be 1.0200x faster
imaging-gaussian-blur                   64.513+-9.577             60.548+-7.272           might be 1.0655x faster
json-parse-financial                    29.491+-0.126      ?      29.521+-0.132         ?
json-stringify-tinderbox                18.723+-0.186             18.607+-0.184         
stanford-crypto-aes                     30.532+-0.240             30.424+-0.124         
stanford-crypto-ccm                     28.714+-0.854             28.599+-0.896         
stanford-crypto-pbkdf2                  95.847+-0.485             95.605+-0.445         
stanford-crypto-sha256-iterative        30.506+-0.080      ?      30.534+-0.131         ?

<arithmetic>                            43.827+-0.718             43.434+-0.557           might be 1.0090x faster

$ run-jsc-benchmarks original:original/bin/jsc delay:current-release/bin/jsc --kraken --benchmark gaussian-blur --outer 200
                                original                   delay                                       

imaging-gaussian-blur        63.372+-2.660             61.678+-2.618           might be 1.0275x faster

<arithmetic>                 63.372+-2.660             61.678+-2.618           might be 1.0275x faster

$ run-jsc-benchmarks original:original/bin/jsc delay:current-release/bin/jsc --benchmark marsaglia --outer 200
                                original                   delay                                       

marsaglia-larger-ints       71.3672+-0.0630     ?     71.4541+-0.1207        ?
marsaglia-osr-entry         15.8131+-0.0794     ?     15.8814+-0.0861        ?

<geometric>                 33.5883+-0.0823     ?     33.6799+-0.0910        ? might be 1.0027x slower

$ run-jsc-benchmarks original:original/bin/jsc delay:current-release/bin/jsc --benchmark mandelbrot --outer 20
                         original                   delay                                       

mandelbrot          376.3381+-2.3982          374.4935+-0.6074        

<geometric>         376.3381+-2.3982          374.4935+-0.6074          might be 1.0049x faster

$ run-jsc-benchmarks original:original/bin/jsc delay:current-release/bin/jsc --benchmark nonude --outer 60
                         original                   delay                                       

nonude               67.6165+-0.6550     ?     67.9252+-0.7012        ?

<geometric>          67.6165+-0.6550     ?     67.9252+-0.7012        ? might be 1.0046x slower
Comment 2 JF Bastien 2017-01-17 19:03:28 PST
In the mandelbrot.js test I've got Unicode characters that display a fancy Mandelbrot image. Bugzilla doesn't like Unicode, but my terminal does (you can also print on each iteration to see a neat animation!).

nonude.js has baked-in constant array, I figure that's more reliable than randomly generated. I could avoid line-wrap and put everything in a single line if you think that would be better. It needs to be about this size to tier up.
Comment 3 Filip Pizlo 2017-01-17 19:30:27 PST
Comment on attachment 299107 [details]
patch

I'd like to provide feedback on this patch, but I can't because the mandelbrot microbenchmark is too big for the patch review tool.

Also, it seems to be fairly long-running.  We don't want microbenchmarks that run for 100ms or more.
Comment 4 Filip Pizlo 2017-01-17 19:34:54 PST
I don't think this is the right approach.  It seems to defeat the purpose of having a target loop trigger.  When you try to compile and you're not at the loop header, you just try again.  The whole point of the trigger is so that we don't have to wait and hope that we enter in the loop.
Comment 5 JF Bastien 2017-01-19 21:33:59 PST
Created attachment 299313 [details]
patch

Update with a different approach: mark the outer loop for entry, and when we force its compilation then unmark it.

I still need to handle the cases where:

 - outer-loop compilation fails (can it fail for outer but not inner? Do we care?)
 - outer-loop doesn't trigger for too long (maybe we should try to trigger one loop in, and so on, until we reach the innermost loop)

I ran performance number but they look weird. I'll close things and re-run them.
Comment 6 JF Bastien 2017-01-20 12:00:13 PST
Created attachment 299364 [details]
patch

Here's a sketch of what I think back off could be like.

I still need to:

 - Verify benchmarks, and tune this backoff thing.
 - Make new tests shorter.
 - Update changelog.
Comment 7 Filip Pizlo 2017-01-20 12:22:55 PST
(In reply to comment #6)
> Created attachment 299364 [details]
> patch
> 
> Here's a sketch of what I think back off could be like.
> 
> I still need to:
> 
>  - Verify benchmarks, and tune this backoff thing.
>  - Make new tests shorter.
>  - Update changelog.

Can you please upload a smaller patch?  You should remove the benchmarks from the patch and land them anyway when you commit.  We don't review the benchmark contents usually.
Comment 8 Geoffrey Garen 2017-01-20 12:49:26 PST
Comment on attachment 299364 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=299364&action=review

> Source/JavaScriptCore/dfg/DFGTierUpEntryTrigger.h:35
> +    DontEnter = 0, // Must stay zero: JIT-compiled code enters when non-zero.
> +    TryToCompileCode,
> +    CodeCompiledSuccessfully,

s/enters/takes the triggerOSREntryNow slow path

Maybe need to rename triggerOSREntryNow to triggerTierUpNow

Maybe:

None,
Compile,
OSREnter
Comment 9 JF Bastien 2017-01-20 14:17:08 PST
Created attachment 299381 [details]
patch

Update patch with Geoff's comments, and drop tests for now.
Comment 10 JF Bastien 2017-01-20 15:26:40 PST
Created attachment 299395 [details]
patch

Add some logging, and clean up a minor issue with entry location.
Comment 11 Filip Pizlo 2017-01-23 12:40:08 PST
Comment on attachment 299395 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=299395&action=review

> Source/JavaScriptCore/bytecode/ExecutionCounter.cpp:75
>  template<CountingVariant countingVariant>
> +void ExecutionCounter<countingVariant>::deferForAWhile()
> +{
> +    ASSERT(m_activeThreshold != std::numeric_limits<int32_t>::max());
> +    m_totalCount = count();
> +    m_counter = -static_cast<int32_t>(m_activeThreshold);
> +}

This function is kinda weird.  It's totally different from how we usually ask the execution counter to trigger soon.

> Source/JavaScriptCore/dfg/DFGJITCode.cpp:144
> +void JITCode::dontOptimizeForAWhile(CodeBlock* codeBlock)
> +{
> +    ASSERT(codeBlock->jitType() == JITCode::DFGJIT);
> +    if (Options::verboseOSR())
> +        dataLog(*codeBlock, ": Not FTL-optimizing for a while.\n");
> +    tierUpCounter.deferForAWhile();
> +}
> +

This seems like it's almost like optimizeAfterWarmUp or optimizeSoon. You should preferably choose one of those rather than introduce a new thing.
Comment 12 JF Bastien 2017-01-24 15:19:30 PST
Created attachment 299639 [details]
patch

Substantially change the logic after in-person talk with pizlo.

I still need to:

 - Test this a bunch.
 - Make the tests run faster.

The basic idea should be there though.
Comment 13 Filip Pizlo 2017-01-24 15:24:21 PST
Comment on attachment 299639 [details]
patch

This looks pretty solid to me.
Comment 14 JF Bastien 2017-01-24 20:37:24 PST
Created attachment 299676 [details]
pach

Fixed a few bugs found by tests. They're still running, I'll then benchmark and shorten the two new tests.
Comment 15 JF Bastien 2017-01-24 22:25:30 PST
Tests seem happy.

First run of the benchmarks seem OK, pretty much perf neutral without regressing variance on kraken/imagine-gaussian-blur.

I'm kicking off a longer benchmark run and signing out for today, will look a the results tomorrow and report back.
Comment 16 JF Bastien 2017-01-25 10:30:35 PST
Created attachment 299713 [details]
patch

Typo fix.

Benchmarks finished, ruby is crunching on them.
Comment 17 JF Bastien 2017-01-25 17:14:39 PST
Created attachment 299774 [details]
patch

This looks good to go. I rebased the changes, made the new benchmarks faster, and ran benchmarks (everything looks pretty neutral).

I'll file a separate bug to investigate mandelbrot's lack of FTL-ing. It's important to fix, but not as much as other things.
Comment 18 WebKit Commit Bot 2017-01-25 17:17:09 PST
Attachment 299774 [details] did not pass style-queue:


ERROR: Source/JavaScriptCore/dfg/DFGOperations.cpp:2431:  An else statement can be removed when the prior "if" concludes with a return, break, continue or goto statement.  [readability/control_flow] [4]
ERROR: Source/JavaScriptCore/dfg/DFGOperations.cpp:2613:  An else statement can be removed when the prior "if" concludes with a return, break, continue or goto statement.  [readability/control_flow] [4]
Total errors found: 2 in 18 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 19 Saam Barati 2017-01-25 17:49:07 PST
Comment on attachment 299774 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=299774&action=review

> JSTests/microbenchmarks/nonude.js:1
> +const makeFaster = 8;

Does this test actually exercise the crash?

> Source/JavaScriptCore/ChangeLog:29
> +        right JSValues, otherwise we're confusing the compiler and giving
> +        it inaccurate JSValues which can lead it to predict the wrong
> +        things, leading to suboptimal code or recompilation due to
> +        misprediction.

Also, crash

> Source/JavaScriptCore/dfg/DFGOperations.cpp:2390
> +            // We're already compiling, and can't OSR enter. We therefore must have set our slow-path trigger as well as another slow path trigger, hoping one of these would kick off a compilation. Sure enough another bytecode location did kick off a compilation before this one got a chance to do so. There's nothing for us to do but wait.

Nit: make this multiline.

> Source/JavaScriptCore/dfg/DFGSpeculativeJIT64.cpp:5787
> +        static_assert(!static_cast<int>(TierUpEntryTrigger::None), "NonZero test below depends on this");

nit: I'd use uint8_t instead of int
Comment 20 Saam Barati 2017-01-25 18:01:17 PST
Comment on attachment 299774 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=299774&action=review

> Source/JavaScriptCore/dfg/DFGOperations.cpp:2625
> +        switch (osrEntryPreparation.error()) {
> +        case FTL::OSREntryFail::StackGrowthFailed:
> +            // It's not clear what we should do. We signal to try again after a while if that happens.
> +            return nullptr;
> +        case FTL::OSREntryFail::WrongBytecode:
> +            RELEASE_ASSERT_NOT_REACHED();
> +        }

Why not just RELEASE_ASSERT(error == StackGrowthFailed)?
Also, we tend not to have "else" statements for "if"s that end in returns.
Comment 21 Filip Pizlo 2017-01-25 20:58:19 PST
Comment on attachment 299774 [details]
patch

LGTM, but I think you should address Saam's advice.
Comment 22 JF Bastien 2017-01-26 10:56:57 PST
Created attachment 299816 [details]
patch

Address Saam's comments except one as noted below.

Fix GTK build, it was sad about missing return because it wasn't smart enough to see there was no missing return (or spoon for that matter).


> > JSTests/microbenchmarks/nonude.js:1
> > +const makeFaster = 8;
> 
> Does this test actually exercise the crash?

It's complicated... The code is super finicky, and I can add asserts at specific spots to observe unwanted things that turn out to be innocuous. As-is the benchmark pretty much doesn't exhibit bad behavior.

I was thinking about adding a few asserts as sanity checks, but I'd do that separately.


> > Source/JavaScriptCore/ChangeLog:29
> > +        right JSValues, otherwise we're confusing the compiler and giving
> > +        it inaccurate JSValues which can lead it to predict the wrong
> > +        things, leading to suboptimal code or recompilation due to
> > +        misprediction.
> 
> Also, crash

I was purposefully not using that word, but pizlo says it's fine.


> View in context:
> https://bugs.webkit.org/attachment.cgi?id=299774&action=review
> 
> > Source/JavaScriptCore/dfg/DFGOperations.cpp:2625
> > +        switch (osrEntryPreparation.error()) {
> > +        case FTL::OSREntryFail::StackGrowthFailed:
> > +            // It's not clear what we should do. We signal to try again after a while if that happens.
> > +            return nullptr;
> > +        case FTL::OSREntryFail::WrongBytecode:
> > +            RELEASE_ASSERT_NOT_REACHED();
> > +        }
> 
> Why not just RELEASE_ASSERT(error == StackGrowthFailed)?
> Also, we tend not to have "else" statements for "if"s that end in returns.

I like this approach because if / when we add new reasons for OSR entry failure we *have* to edit this code (because the switch must cover all enum cases, otherwise compiler warns), think about what it means, and document it as above (what do we do about stack growth? Dunno... But wrong bytecode definitely can't happen here!).

Removing the "else" would make this code invalid: osrEntryPreparation wouldn't be in scope anymore (it's only in scope in the if / else). I can hoist it above the "if", but then IMO the code is ugly :-)

Happy to change it if you want, I just like this style quite a bit because IMO it's less likely to break in the future.
Comment 23 WebKit Commit Bot 2017-01-26 11:00:17 PST
Attachment 299816 [details] did not pass style-queue:


ERROR: Source/JavaScriptCore/dfg/DFGOperations.cpp:2439:  An else statement can be removed when the prior "if" concludes with a return, break, continue or goto statement.  [readability/control_flow] [4]
ERROR: Source/JavaScriptCore/dfg/DFGOperations.cpp:2637:  An else statement can be removed when the prior "if" concludes with a return, break, continue or goto statement.  [readability/control_flow] [4]
Total errors found: 2 in 18 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 24 Filip Pizlo 2017-01-26 11:04:44 PST
(In reply to comment #22)
> Created attachment 299816 [details]
> patch
> 
> Address Saam's comments except one as noted below.
> 
> Fix GTK build, it was sad about missing return because it wasn't smart
> enough to see there was no missing return (or spoon for that matter).
> 
> 
> > > JSTests/microbenchmarks/nonude.js:1
> > > +const makeFaster = 8;
> > 
> > Does this test actually exercise the crash?
> 
> It's complicated... The code is super finicky, and I can add asserts at
> specific spots to observe unwanted things that turn out to be innocuous.
> As-is the benchmark pretty much doesn't exhibit bad behavior.
> 
> I was thinking about adding a few asserts as sanity checks, but I'd do that
> separately.
> 
> 
> > > Source/JavaScriptCore/ChangeLog:29
> > > +        right JSValues, otherwise we're confusing the compiler and giving
> > > +        it inaccurate JSValues which can lead it to predict the wrong
> > > +        things, leading to suboptimal code or recompilation due to
> > > +        misprediction.
> > 
> > Also, crash
> 
> I was purposefully not using that word, but pizlo says it's fine.
> 
> 
> > View in context:
> > https://bugs.webkit.org/attachment.cgi?id=299774&action=review
> > 
> > > Source/JavaScriptCore/dfg/DFGOperations.cpp:2625
> > > +        switch (osrEntryPreparation.error()) {
> > > +        case FTL::OSREntryFail::StackGrowthFailed:
> > > +            // It's not clear what we should do. We signal to try again after a while if that happens.
> > > +            return nullptr;
> > > +        case FTL::OSREntryFail::WrongBytecode:
> > > +            RELEASE_ASSERT_NOT_REACHED();
> > > +        }
> > 
> > Why not just RELEASE_ASSERT(error == StackGrowthFailed)?
> > Also, we tend not to have "else" statements for "if"s that end in returns.
> 
> I like this approach because if / when we add new reasons for OSR entry
> failure we *have* to edit this code (because the switch must cover all enum
> cases, otherwise compiler warns), think about what it means, and document it
> as above (what do we do about stack growth? Dunno... But wrong bytecode
> definitely can't happen here!).

I agree with that approach. I also like to use switch statements this way. We do it all the time in the compiler guts.

> 
> Removing the "else" would make this code invalid: osrEntryPreparation
> wouldn't be in scope anymore (it's only in scope in the if / else). I can
> hoist it above the "if", but then IMO the code is ugly :-)

That would be the WebKit style.

> 
> Happy to change it if you want, I just like this style quite a bit because
> IMO it's less likely to break in the future.

We're all very used to reading code where else is only used when 'then' might fall through. It's annoying when code breaks this tradition.
Comment 25 JF Bastien 2017-01-26 11:15:38 PST
Created attachment 299823 [details]
patch

Un-else the code, pizlo makes a good case that it's WebKit style.

CQ+
Comment 26 WebKit Commit Bot 2017-01-26 11:53:59 PST
Comment on attachment 299823 [details]
patch

Clearing flags on attachment: 299823

Committed r211224: <http://trac.webkit.org/changeset/211224>
Comment 27 WebKit Commit Bot 2017-01-26 11:54:05 PST
All reviewed patches have been landed.  Closing bug.
Comment 28 JF Bastien 2017-01-26 12:06:21 PST
I'll follow-up on mandelbrot not FTL-ing here:
https://bugs.webkit.org/show_bug.cgi?id=167460

Not now though, it's not super high priority.
Comment 29 WebKit Commit Bot 2017-01-26 17:30:46 PST
Re-opened since this is blocked by bug 167479
Comment 30 JF Bastien 2017-01-31 10:33:14 PST
Created attachment 300231 [details]
patch

This patch does away with tierUpEntrySeen which seems to have been the cause of the regression on kraken/ai-astar. That benchmark has nested loops through inlining, and the innermost-loop enters through triggerTierUpNowInLoop which wasn't tracking tierUpEntrySeen, thereby preventing the trigger loop from setting any enclosing loop's trigger.

Interestingly, my local build didn't spot this as a regression because the generated code was perf-neutral but had significantly different Base/DFG/FTL/FTLOSR profile numbers. The same was true of other builds, on multiple machines (different profiles, but perf-neutral). In the end the performance regression only showed up on builds using older SDKs, which had different profiles *and* had significantly worst runtime performance (~80% regression on kraken/ai-astar). This is a troubling sign of this code's brittleness (which I'll ignore for now, and look into it later, better fix this now).


For this second fix I benchmarked approaches for the following combinations:

* Whether to drop tierUpEntrySeen for triggerTierUpNowInLoop only, or for checkTierUpAndOSREnterNow as well.
* Whether to set the triggers from outer-loop to inner-loop, or the reverse (again, trying both approaches for triggerTierUpNowInLoop and checkTierUpAndOSREnterNow).

The benchmarks tell me that dropping tierUpEntrySeen entirely and always doing outer-loop to inner-loop is a win. This patch is exactly the same as the last one, except for dropping tierUpEntrySeen.


Results (and profiles) attached separately.
Comment 31 JF Bastien 2017-01-31 10:33:51 PST
Created attachment 300232 [details]
Results + profiles

Results + profiles
Comment 32 Filip Pizlo 2017-01-31 10:49:58 PST
Comment on attachment 300231 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=300231&action=review

> Source/JavaScriptCore/ChangeLog:9
> +        Re-land of r211224, but without tierUpEntrySeen which seems to
> +        have been the cause of the kraken/ai-astar regression.

tierUpEntrySeen has been around for a while.  Can you provide a lot more details about why that's not needed anymore, and why it caused the regression in ai-astar?

I think that if we can't answer that question, then wrapping this fix up together with a rewrite of a large chunk DFG->FTL OSR entry is not the right way to go.  We need to find what is the minimal change to the code that fixes the bug.
Comment 33 JF Bastien 2017-01-31 11:06:28 PST
(In reply to comment #32)
> Comment on attachment 300231 [details]
> patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=300231&action=review
> 
> > Source/JavaScriptCore/ChangeLog:9
> > +        Re-land of r211224, but without tierUpEntrySeen which seems to
> > +        have been the cause of the kraken/ai-astar regression.
> 
> tierUpEntrySeen has been around for a while.  Can you provide a lot more
> details about why that's not needed anymore, and why it caused the
> regression in ai-astar?
> 
> I think that if we can't answer that question, then wrapping this fix up
> together with a rewrite of a large chunk DFG->FTL OSR entry is not the right
> way to go.  We need to find what is the minimal change to the code that
> fixes the bug.

The following code is in triggerOSREntryNow but not in triggerTierUpNowInLoop:

  jitCode->tierUpEntrySeen.add(bytecodeIndex);

The loop-traversal code then does:

             // Traverse the loop hierarchy from the outer-most loop, to the inner-most one.
             for (auto iterator = tierUpHierarchyEntry->value.rbegin(), end = tierUpHierarchyEntry->value.rend(); iterator != end; ++iterator) {
                 unsigned osrEntryCandidate = *iterator;
                 if (jitCode->tierUpEntrySeen.contains(osrEntryCandidate)) {
                     ....

That's the source of the bug in the patch which was previously committed, and then reverted.


I agree with your suggestion: I'll create a minimally-disruptive fix, upload+commit after review, and then submit this code as a bigger refactoring in a separate patch. That'll make sure that I separate bug fix from performance+refactoring fix.
Comment 34 JF Bastien 2017-01-31 15:30:30 PST
Created attachment 300271 [details]
patch

Here's a very minimal patch as discussed. If it sticks I'll commit the bigger refactor+tuning in a follow-up.
Comment 35 JF Bastien 2017-01-31 15:31:01 PST
Created attachment 300272 [details]
Benchmark results

Benchmark.
Comment 36 WebKit Commit Bot 2017-01-31 15:32:54 PST
Attachment 300271 [details] did not pass style-queue:


ERROR: Source/JavaScriptCore/dfg/DFGOperations.cpp:2431:  Multi line control clauses should use braces.  [whitespace/braces] [4]
Total errors found: 1 in 2 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 37 Filip Pizlo 2017-01-31 15:36:36 PST
Comment on attachment 300271 [details]
patch

I don't understand this change. Why does every conditional do nothing when triggeredSlowPath?  Doesn't that mean that when you do get force triggered at the outer loop at which you could compile, then you'll end up not compiling?  Shouldn't you be predicating on canOSRFromHere?
Comment 38 Filip Pizlo 2017-01-31 15:43:27 PST
Comment on attachment 300271 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=300271&action=review

> Source/JavaScriptCore/dfg/DFGOperations.cpp:2419
> +    if (!triggeredSlowPath) {

Not sure if this should be here.  If we already have an osrEntryBlock, as the code this is guarding checks, then we have to count retry.
Comment 39 Filip Pizlo 2017-01-31 15:46:08 PST
Comment on attachment 300271 [details]
patch

OK - the only part of this that I'm uncomfortable about is not doing the existing-osrEntryBlock dance when triggeredSlowPath.  What do you think?
Comment 40 JF Bastien 2017-01-31 15:54:58 PST
(In reply to comment #37)
> Comment on attachment 300271 [details]
> patch
> 
> I don't understand this change. Why does every conditional do nothing when
> triggeredSlowPath?  Doesn't that mean that when you do get force triggered
> at the outer loop at which you could compile, then you'll end up not
> compiling?  Shouldn't you be predicating on canOSRFromHere?

(reposting here from IRC discussion).

canOSRFromHere only happens when entering from triggerTierUpNowInLoop, which gets generated by CheckTierUpInLoop, which itself gets generated by performTierUpCheckInjection when canOSREnterAtLoopHint(level, block, nodeIndex) says "can't". That's mostly orthogonal to triggeredSlowPath.

triggeredSlowPath basically skips all the places we can return from, and goes directly to the compile step instead. That's why it does all the shortcuts (but only after testing the condition, which may have side-effects). The intent is that triggeredSlowPath  does all the required bookkeeping, and takes you straight down to jitCode->reconstruct, which then triggers a compile.



Separately though, Fil pointed out that the entire section under "It's time to try to compile code for OSR entry" didn't need to be guarded by `! triggeredSlowPath`. I'll revert that part of the change, benchmark, and upload a new patch.
Comment 41 JF Bastien 2017-01-31 16:20:40 PST
Created attachment 300280 [details]
patch

Updated patch. Perf numbers seem to be the same.
Comment 42 WebKit Commit Bot 2017-01-31 17:27:24 PST
Comment on attachment 300280 [details]
patch

Clearing flags on attachment: 300280

Committed r211461: <http://trac.webkit.org/changeset/211461>
Comment 43 WebKit Commit Bot 2017-01-31 17:27:31 PST
All reviewed patches have been landed.  Closing bug.
Comment 44 JF Bastien 2017-02-01 09:30:16 PST
This seems to regress *some* kraken/ai-astar builds again by ~80%. I'm looking at why, may revert again.
Comment 45 JF Bastien 2017-02-01 10:03:39 PST
This is odd. In the previous patch I had perf-neutral results with different tiering behavior because of a bug. This patch has the same tiering behavior as far as I can tell.

I ran the following to create a csv file, before and after my change:

for i in `seq 1 100`; do ./current/bin/jsc ./JSTests/ChakraCore/test/benchmarks/Kraken/ai-astar.js -p a 2>&1 > /dev/null && printf "f\nq" | ./Tools/Scripts/display-profiler-output a 2>&1 | grep findGraphNode | sed 's| *[^ ]* *[0-9]* *\([^ ]*\).*|\1|' | sed 's|/|,|g' ; done

Before my change:

	Base	DFG	FTL	FTLOSR
Average	1360	39254117	2175101	5277488
stddev	392	11810751	233459	11788088

After my change:

	Base	DFG	FTL	FTLOSR
Average	1279	40142662	2128971	4435185
stddev	242	10385349	229923	10371637

There are two modes when running the code: sometimes we hit straight FTL for findGraphNode, and sometimes we hit FTLOSR. Both before as well as after, straight FTL occurs about 80% of the time, and FTLOSR 20%. Hitting FTLOSR is a 10%-20% perf uptick.


But this is all with my local build, not with the build on the bots that show a slowdown. I need to dig some more.
Comment 46 JF Bastien 2017-02-01 11:00:40 PST
Interesting data point: using an older compiler and comparing before / after my change (again 100 profiled runs) I get the following results:

Before:

	Runtime	Base	DFG	FTL	FTLOSR
Average	183	1424	4002246	2317610	40386834
Stddev	5	456	452149	310826	594810

After:

	Runtime	Base	DFG	FTL	FTLOSR
Average	268	1525	40450595	2265327	3990661
Stddev	28	1065	10648728	471780	10704523

Somehow, using an older C++ compiler drastically changes jsc's behavior. It used to go out of DFG pretty quickly, and each run does *both* FTL ad FTLOSR.

After my change it's modal: most runs do FTL only, and a few do FTL+FTLOSR. Even the few that do FTL+FTLOSR stay in DFG for longer, so their performance is worse than before my change (and the FTL only ones have way worse performance).


I'll try out a build with a newer compiler (instead of whatever I have locally), and compare to this.
Comment 47 JF Bastien 2017-02-01 13:48:34 PST
Odd: a bot build with newer compiler has similar results! I'm probably perturbing compilation enough (by waiting until we hit the outer loop) that we enter FTL from the function instead of through the OSR entry.

Digging some more...
Comment 48 JF Bastien 2017-02-01 14:05:44 PST
OK the data problem stems from my cmake build *not* exhibiting this issue (before / after my patch performance and behavior are the same), but the make build exhibits this issue (behavior and performance change, code becomes modal FTL mostly and FTLOSR sometimes).

I'm not sure *why* there's a difference, but now I know *where* the difference is.
Comment 49 WebKit Commit Bot 2017-02-01 17:55:47 PST
Re-opened since this is blocked by bug 167721
Comment 50 JF Bastien 2017-02-02 12:39:22 PST
Created attachment 300434 [details]
patch

This patch tries to only create mustHandleValues when the byte code location allows it. Contrary to what I'd measured before (with a different approach) it's a disaster:

ai-astar                                67.960+-0.933      !      78.297+-1.001         ! definitely 1.1521x slower
audio-beat-detection                    29.127+-0.044      ?      29.759+-0.616         ? might be 1.0217x slower
audio-dft                               60.107+-1.480             59.898+-0.281         
audio-fft                               21.444+-0.360             21.253+-0.086         
audio-oscillator                        44.575+-0.124      ?      44.627+-0.072         ?
imaging-darkroom                        49.453+-0.107             49.391+-0.095         
imaging-desaturate                      42.325+-0.314      ?      42.503+-0.343         ?
imaging-gaussian-blur                   52.743+-0.439      !     196.948+-48.231        ! definitely 3.7341x slower
json-parse-financial                    28.857+-0.080      ?      29.336+-0.467         ? might be 1.0166x slower
json-stringify-tinderbox                18.620+-0.198             18.599+-0.195         
stanford-crypto-aes                     30.074+-0.094      ?      30.164+-0.367         ?
stanford-crypto-ccm                     26.892+-0.281      ?      26.936+-0.524         ?
stanford-crypto-pbkdf2                  92.042+-0.862             91.473+-0.273         
stanford-crypto-sha256-iterative        29.186+-0.077      ?      29.214+-0.126         ?

<arithmetic>                            42.386+-0.150      !      53.457+-3.464         ! definitely 1.2612x slower


Looking at ai-astar, there's more FTLOSR exit going on:

      CodeBlock        #Instr            Source Counts                     Machine Counts          #Compil  Inlines  #Exits   Last Opts
                                      Base/DFG/FTL/FTLOSR               Base/DFG/FTL/FTLOSR                Src/Total         Get/Put/Call
 findGraphNode#EBmwVP    83       140738/8742609/37824773/15         140738/1803397/37824773/0        7       1/9      15       2/0/0
    search#DTtNtq       686            253327/49301/0/15                253327/6873500/0/15          10       0/0       0       33/4/9
removeGraphNode#DqCfaQ  102          8094/155404/134382/0               8094/106372/134382/0          3       1/6       0       3/0/0

I'll explore other options.
Comment 51 JF Bastien 2017-02-02 13:08:20 PST
Created attachment 300437 [details]
patch

This patch goes back to one of my original ideas of imprecise recovery. It seems to dampen the pain a bit:

                                           original                    new                                        

ai-astar                                67.288+-1.586      !      74.064+-1.935         ! definitely 1.1007x slower
audio-beat-detection                    29.172+-0.201      ?      29.256+-0.248         ?
audio-dft                               60.255+-1.880      ?      61.046+-1.703         ? might be 1.0131x slower
audio-fft                               21.212+-0.124      ?      21.370+-0.147         ?
audio-oscillator                        44.652+-0.222      ?      44.848+-0.159         ?
imaging-darkroom                        49.880+-0.140      ?      50.416+-0.816         ? might be 1.0107x slower
imaging-desaturate                      43.052+-1.005             43.049+-1.091         
imaging-gaussian-blur                   52.471+-1.869      ?      53.245+-1.872         ? might be 1.0148x slower
json-parse-financial                    29.343+-0.346             29.313+-0.239         
json-stringify-tinderbox                18.963+-0.167      ?      18.988+-0.232         ?
stanford-crypto-aes                     30.038+-0.175      ?      30.156+-0.212         ?
stanford-crypto-ccm                     26.529+-0.861      ?      27.998+-0.710         ? might be 1.0553x slower
stanford-crypto-pbkdf2                  91.727+-0.497      ?      92.433+-0.530         ?
stanford-crypto-sha256-iterative        29.734+-0.982             29.671+-0.279         

<arithmetic>                            42.451+-0.264      !      43.275+-0.223         ! definitely 1.0194x slower



      CodeBlock        #Instr            Source Counts                     Machine Counts          #Compil  Inlines  #Exits   Last Opts
                                      Base/DFG/FTL/FTLOSR               Base/DFG/FTL/FTLOSR                Src/Total         Get/Put/Call
 findGraphNode#EBmwVP    83       161883/7457230/39088698/19         161883/3149734/39088698/0        7       1/9      19       2/0/0
    search#DTtNtq       686            264380/37702/0/19                264380/4255896/0/19          10       0/0       0       33/4/9
removeGraphNode#DqCfaQ  102          4214/142140/151396/0               4214/104520/151396/0          3       1/6       0       3/0/0


I'll science this some more.
Comment 52 JF Bastien 2017-02-02 13:13:32 PST
I just realized why I couldn't figure out the exit reasons: `findGraphNode` gets inlined so the exit attributions go to it, but exit events to `search`.

With this current patch (imprecise recovery) exits in kraken/ai-astar are due to BadType. With the one just before (no mustHandleValue when byte code location is bad) it's due to InadequateCoverage.
Comment 53 JF Bastien 2017-02-02 13:40:35 PST
Created attachment 300445 [details]
benchmark results for imprecise mustHandleValues

benchmark results for imprecise mustHandleValues when the bytecode location isn't suitable for entry. This looks mostly neutral, though kraken is a slight regression.
Comment 54 Filip Pizlo 2017-02-02 13:54:10 PST
We can't land a statistically significant Kraken regression, no matter how small.

Also, we consider 2% on Kraken to be a medium-to-big regression, it takes a big amount of work to get a 2% Kraken speed-up.
Comment 55 Filip Pizlo 2017-02-02 14:03:20 PST
Comment on attachment 300437 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=300437&action=review

> Source/JavaScriptCore/bytecode/ValueRecovery.cpp:76
> +        return jsUndefined();

This is probably your problem.

If you tell the DFG that something could be Undefined and there is no other signal telling the DFG otherwise, then the DFG will assume that it should speculate that it is Undefined.

It's not clear to me what you could return here that would be safe.  Our type inference system is not designed to allow for this kind of imprecision.
Comment 56 Filip Pizlo 2017-02-02 14:08:24 PST
Comment on attachment 300437 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=300437&action=review

> Source/JavaScriptCore/bytecode/ValueRecovery.cpp:75
> +        if (v.isInt32())
> +            return JSValue(static_cast<int32_t>(v.asInt32()));
> +        if (v.isDouble())
> +            return jsNumber(purifyNaN(v.asDouble()));

This is a possible security vulnerability.  Say that `v` was actually a pointer.  Now the compiler is constant-propagating a constant.  Probably someone could extract this integer.  I think that we want to actively defend against leaking pointers.
Comment 57 JF Bastien 2017-02-02 17:05:27 PST
Created attachment 300476 [details]
collect mustHandleValues eagerly

Here's a separate approach where we eagerly keep a side-collection of mustHandleValues. Anything which ends up in tierUpCommon will reconstruct the mustHandleValues for this bytecode, and store them in a hash table which lives on jitCode. I tried to do a few different things (see below).

Neither ai-astar nor imagining-gaussian-blur are exiting.

Here's imaging-gaussian-blur:

Tip of tree:
     CodeBlock      #Instr            Source Counts                     Machine Counts          #Compil  Inlines  #Exits   Last Opts
                                   Base/DFG/FTL/FTLOSR               Base/DFG/FTL/FTLOSR                Src/Total         Get/Put/Call
gaussianBlur#AVtXjn  1335        36827/629333/0/17161183           36827/629333/0/17161183         4       0/0       0       8/0/8
buildKernel#BJnJ58   758                 0/0/0/0                           0/0/0/0                 0       0/0       0        N/A
  <global>#AzIPSP    382                 0/0/0/0                           0/0/0/0                 0       0/0       0        N/A

Snapshot on every entry:

gaussianBlur#AVtXjn  1335       40170/4244885/0/13542113           40170/4244885/0/13542113        4       0/0       0       8/0/8
buildKernel#BJnJ58   758                 0/0/0/0                           0/0/0/0                 0       0/0       0        N/A
  <global>#AzIPSP    382                 0/0/0/0                           0/0/0/0                 0       0/0       0        N/A

Here's ai-astar:


Tip of tree:
      CodeBlock        #Instr            Source Counts                     Machine Counts          #Compil  Inlines  #Exits   Last Opts
                                      Base/DFG/FTL/FTLOSR               Base/DFG/FTL/FTLOSR                Src/Total         Get/Put/Call
 findGraphNode#EBmwVP    83      1223/4263839/2462385/39980673         1223/381831/2462385/0          3       1/3       0       2/0/0
    search#DTtNtq       686          44722/29270/0/228626             44722/3848207/0/39679633        4       0/0       1       33/4/9
removeGraphNode#DqCfaQ  102           8168/64194/0/225518                  8168/35557/0/0             2       1/3       0       3/0/0
    isWall#ESQBmN        21             328/4319/0/7717                     328/2689/0/0              2       1/3       0       1/0/0
   GraphNode#BmXric      86              116/9411/0/0                       116/9411/0/0              2       0/0       0       0/7/0
     init#Ct2Vq3        168               9511/0/0/0                         9511/0/0/0               2       0/0       0       0/4/0
   heuristic#DCkysN     125             374/1472/0/3127                     374/847/0/0               2       1/3       0       6/0/2
   neighbors#CZggLo     349              1335/3531/0/0                     1335/3531/0/0              2       0/0      50       8/0/4
      go#A9jRvV         100                 0/0/0/0                           0/0/0/0                 0       0/0       0        N/A
 getEndingCell#EIKyLA   113                 0/0/0/0                           0/0/0/0                 0       0/0       0        N/A
getStartingCell#DPYZfh  133                 0/0/0/0                           0/0/0/0                 0       0/0       0        N/A
   <global>#CnBI62     311876               0/0/0/0                           0/0/0/0                 0       0/0       0        N/A

Snapshot on every entry:

 findGraphNode#EBmwVP    83      2192/40538040/1801301/4366274         2192/620297/1801301/0          3       1/7       0       2/0/0
    search#DTtNtq       686          36652/228678/0/35763             36652/39611751/0/4329914        8       0/0       1       33/4/9
removeGraphNode#DqCfaQ  102           1441/260938/0/35371                  1441/35700/0/0             2       1/7       0       3/0/0
    isWall#ESQBmN        21             236/11525/0/602                     236/2536/0/0              2       1/7       0       1/0/0
   <global>#CnBI62     311876             9511/0/0/0                         9511/0/0/0               1       0/0       0       0/0/0
   GraphNode#BmXric      86              8805/701/0/0                       8805/701/0/0              2       0/0       0       0/7/0
     init#Ct2Vq3        168               8605/0/0/0                         8605/0/0/0               2       0/0       0       0/4/0
   heuristic#DCkysN     125             280/4465/0/213                      280/837/0/0               2       1/7       0       6/0/2
   neighbors#CZggLo     349              1266/3614/0/0                     1266/3614/0/0              2       0/0      65       8/0/4
getStartingCell#DPYZfh  133                 0/0/0/0                           0/0/0/0                 0       0/0       0        N/A
 getEndingCell#EIKyLA   113                 0/0/0/0                           0/0/0/0                 0       0/0       0        N/A
      go#A9jRvV         100                 0/0/0/0                           0/0/0/0                 0       0/0       0        N/A



Snapshot on every entry that canOSRFromHere (attached patch):

                                           original                    new                                        

ai-astar                                68.971+-0.810      !     122.015+-10.873        ! definitely 1.7691x slower
audio-beat-detection                    30.100+-0.494             29.857+-0.234         
audio-dft                               70.446+-0.501      ?      71.522+-0.954         ? might be 1.0153x slower
audio-fft                               22.214+-1.279             21.557+-0.058           might be 1.0305x faster
audio-oscillator                        44.986+-0.178      ?      45.233+-0.140         ?
imaging-darkroom                        50.246+-0.178      ?      50.995+-1.570         ? might be 1.0149x slower
imaging-desaturate                      46.750+-0.492      ?      46.863+-0.725         ?
imaging-gaussian-blur                   54.106+-0.972      !      85.167+-13.275        ! definitely 1.5741x slower
json-parse-financial                    29.485+-0.062      ^      29.339+-0.049         ^ definitely 1.0050x faster
json-stringify-tinderbox                19.100+-0.076      ?      19.238+-0.089         ?
stanford-crypto-aes                     30.627+-0.137      ?      30.683+-0.146         ?
stanford-crypto-ccm                     28.089+-0.631             26.967+-0.564           might be 1.0416x faster
stanford-crypto-pbkdf2                  93.139+-0.430      ?      94.900+-3.599         ? might be 1.0189x slower
stanford-crypto-sha256-iterative        29.866+-0.079      ?      30.381+-0.875         ? might be 1.0172x slower

<arithmetic>                            44.152+-0.128      !      50.337+-1.148         ! definitely 1.1401x slower


With snapshot refresh every time tierUpEntryTriggers->value is non-zero (not when the threshold causes entry):

                                           original                    new                                        

ai-astar                                68.432+-0.934      !     129.185+-8.533         ! definitely 1.8878x slower
audio-beat-detection                    29.719+-0.078      ?      30.212+-0.435         ? might be 1.0166x slower
audio-dft                               71.023+-0.540             70.820+-0.490         
audio-fft                               21.629+-0.052             21.588+-0.062         
audio-oscillator                        45.109+-0.110      ?      45.236+-0.182         ?
imaging-darkroom                        50.273+-0.223             50.272+-0.220         
imaging-desaturate                      47.295+-0.650      ?      47.364+-0.885         ?
imaging-gaussian-blur                   54.633+-1.209      !      94.046+-13.214        ! definitely 1.7214x slower
json-parse-financial                    29.578+-0.079             29.514+-0.149         
json-stringify-tinderbox                19.259+-0.078             19.145+-0.068         
stanford-crypto-aes                     30.612+-0.113      ?      31.585+-2.217         ? might be 1.0318x slower
stanford-crypto-ccm                     27.001+-0.822             26.943+-0.737         
stanford-crypto-pbkdf2                  94.911+-2.870             93.123+-0.145           might be 1.0192x faster
stanford-crypto-sha256-iterative        30.470+-1.105             30.468+-1.097         

<arithmetic>                            44.282+-0.237      !      51.393+-1.039         ! definitely 1.1606x slower


With snapshot creation only on first entry (not on every trigger, and not when the threshold causes entry):

                                           original                    new                                        

ai-astar                                69.160+-0.928      !     131.462+-13.933        ! definitely 1.9008x slower
audio-beat-detection                    29.831+-0.121      ?      29.856+-0.129         ?
audio-dft                               71.850+-1.703             70.949+-0.512           might be 1.0127x faster
audio-fft                               21.573+-0.047      ?      21.635+-0.072         ?
audio-oscillator                        45.208+-0.140      ?      45.211+-0.128         ?
imaging-darkroom                        50.431+-0.221      ?      50.452+-0.172         ?
imaging-desaturate                      51.604+-9.816             46.967+-0.597           might be 1.0987x faster
imaging-gaussian-blur                   53.568+-1.192      !      88.972+-13.724        ! definitely 1.6609x slower
json-parse-financial                    29.618+-0.087      ^      28.988+-0.102         ^ definitely 1.0217x faster
json-stringify-tinderbox                19.201+-0.096      ^      18.511+-0.101         ^ definitely 1.0373x faster
stanford-crypto-aes                     30.747+-0.260      ?      31.444+-1.533         ? might be 1.0227x slower
stanford-crypto-ccm                     27.170+-0.658      ?      27.527+-0.765         ? might be 1.0131x slower
stanford-crypto-pbkdf2                  93.346+-0.389             93.088+-0.147         
stanford-crypto-sha256-iterative        29.946+-0.084             29.930+-0.103         

<arithmetic>                            44.518+-0.736      !      51.071+-1.430         ! definitely 1.1472x slower
Comment 58 JF Bastien 2017-02-03 10:27:50 PST
Created attachment 300546 [details]
patch

This patch is perf neutral, and is minimally-disruptive. I think this will stick! 🎉
Comment 59 JF Bastien 2017-02-03 10:29:12 PST
Created attachment 300547 [details]
Performance results

Of special interest are the kraken results:

Kraken:
   ai-astar                                           68.933+-1.456             68.404+-1.343         
   audio-beat-detection                               29.792+-0.137      ?      29.965+-0.765         ?
   audio-dft                                          71.272+-0.797             70.672+-0.700         
   audio-fft                                          23.429+-2.957             21.546+-0.103           might be 1.0874x faster
   audio-oscillator                                   44.972+-0.203      ?      45.022+-0.296         ?
   imaging-darkroom                                   50.153+-0.296             50.057+-0.407         
   imaging-desaturate                                 46.187+-0.685             46.068+-1.034         
   imaging-gaussian-blur                              54.450+-1.002             53.633+-1.700           might be 1.0152x faster
   json-parse-financial                               30.566+-1.894             28.952+-0.161           might be 1.0557x faster
   json-stringify-tinderbox                           19.054+-0.142      ^      18.637+-0.167         ^ definitely 1.0224x faster
   stanford-crypto-aes                                31.273+-1.453             30.708+-0.460           might be 1.0184x faster
   stanford-crypto-ccm                                28.895+-2.198             27.556+-0.912           might be 1.0486x faster
   stanford-crypto-pbkdf2                             92.832+-0.457      ?      92.892+-0.516         ?
   stanford-crypto-sha256-iterative                   29.809+-0.130      ?      30.369+-1.425         ? might be 1.0188x slower

   <arithmetic>                                       44.401+-0.297             43.892+-0.216           might be 1.0116x faster
Comment 60 Saam Barati 2017-02-03 13:30:19 PST
Comment on attachment 300546 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=300546&action=review

Seems mostly good to me. Just some questions, some of which may stem from my shallow understanding of this code.

> Source/JavaScriptCore/ChangeLog:23
> +        entering. It therefore makes the trigger a tri-state (0: don't
> +        enter, 1: compilation complete, 2: please compile). Only 2 gets
> +        reset to zero, 1 does not.

Does it get reset to zero if the OSR entry compilation is jettisoned?

> Source/JavaScriptCore/dfg/DFGOperations.cpp:2359
> +    bool triggeredSlowPath = false;

This name confuses me. I think it really means that we want a compilation elsewhere

> Source/JavaScriptCore/dfg/DFGOperations.cpp:2363
> +        case 0:

Maybe it's worth giving these numbers some names in an enum?

> Source/JavaScriptCore/dfg/DFGOperations.cpp:2470
> +        jitCode->dontOptimizeAnytimeSoon(codeBlock);

Why this?

> Source/JavaScriptCore/dfg/DFGOperations.cpp:2479
> +            for (unsigned osrEntryCandidate : tierUpHierarchyEntry->value) {

Don't you want to try to compile the outermost loop first? If so, I think you want to iterate this vector backwards.

> Source/JavaScriptCore/dfg/DFGOperations.cpp:2485
> +                    jitCode->dontOptimizeAnytimeSoon(codeBlock);

Why this?
Comment 61 JF Bastien 2017-02-03 13:35:03 PST
Fil asked me to check what the FTL compilation is doing to see when we FTL versus FTLOSR compile.

$ VM=./WebKitBuild/Release/ && DYLD_FRAMEWORK_PATH=$VM $VM/jsc ./JSTests/ChakraCore/test/benchmarks/Kraken/ai-astar.js --reportFTLCompileTimes=1

Before my patch:

  Optimized findGraphNode#EBmwVP:[0x10756ac60->0x107569ae0->0x1075b7480, NoneFunctionCall, 83 (DidTryToEnterInLoop)] using FTLMode with FTL into 731 bytes in 18.484423 ms (DFG: 6.894247, B3: 11.590176).
  Optimized search#DTtNtq:[0x10756ae90->0x107568ff0->0x1075b7660, NoneFunctionCall, 686 (DidTryToEnterInLoop)] using FTLForOSREntryMode with FTL into 3318 bytes in 16.226048 ms (DFG: 6.966971, B3: 9.259077).
  Optimized search#DTtNtq:[0x10756b0c0->0x107568ff0->0x1075b7660, NoneFunctionCall, 686 (DidTryToEnterInLoop)] using FTLMode with FTL into 4111 bytes in 16.218263 ms (DFG: 7.276756, B3: 8.941507).


After my patch:

  Optimized GraphNode#BmXric:[0x106568730->0x1065682d0->0x1065b6e40, NoneFunctionConstruct, 86] using FTLMode with FTL into 981 bytes in 1.618677 ms (DFG: 0.628264, B3: 0.990413).
  Optimized findGraphNode#EBmwVP:[0x10656ae90->0x106569d10->0x1065b7480, NoneFunctionCall, 83] using FTLMode with FTL into 731 bytes in 1.043065 ms (DFG: 0.535857, B3: 0.507208).
  Optimized search#DTtNtq:[0x10656b0c0->0x106569220->0x1065b7660, NoneFunctionCall, 686 (DidTryToEnterInLoop)] using FTLForOSREntryMode with FTL into 3433 bytes in 11.175271 ms (DFG: 6.220067, B3: 4.955204).
  Optimized search#DTtNtq:[0x10656b2f0->0x106569220->0x1065b7660, NoneFunctionCall, 686 (DidTryToEnterInLoop)] using FTLMode with FTL into 4261 bytes in 11.527336 ms (DFG: 6.404499, B3: 5.122837).

So search starts with FTLOSR before FTL. The window is between the two (which is where unsetting the trigger was causing perf to tank, because we didn't enter DFG->FTLOSR more than once) is pretty small. Adding more printing, we start FTL and FTLOSR compilations within 0.01 ms of each other, but FTLOSR finishes 0.5 ms earlier. There's just a tiny window, considering the benchmark itself runs for ~160 ms.


Inlining looks like this:

Compilation search#DTtNtq-2-DFG:
    bc#297 --> removeGraphNode#DqCfaQ
    bc#407 --> findGraphNode#EBmwVP
    bc#431 --> isWall#ESQBmN
    bc#475 --> findGraphNode#EBmwVP
    bc#535 --> heuristic#DCkysN
Compilation search#DTtNtq-3-FTLForOSREntry:
    bc#297 --> removeGraphNode#DqCfaQ
    bc#297 --> removeGraphNode#DqCfaQ
    bc#407 --> findGraphNode#EBmwVP
    bc#431 --> isWall#ESQBmN
    bc#475 --> findGraphNode#EBmwVP
    bc#535 --> heuristic#DCkysN
Comment 62 JF Bastien 2017-02-03 14:08:53 PST
Created attachment 300563 [details]
patch

Patch updated.

> > Source/JavaScriptCore/ChangeLog:23
> > +        entering. It therefore makes the trigger a tri-state (0: don't
> > +        enter, 1: compilation complete, 2: please compile). Only 2 gets
> > +        reset to zero, 1 does not.
> 
> Does it get reset to zero if the OSR entry compilation is jettisoned?

Oh yeah I should mention that!


> > Source/JavaScriptCore/dfg/DFGOperations.cpp:2359
> > +    bool triggeredSlowPath = false;
> 
> This name confuses me. I think it really means that we want a compilation
> elsewhere

True, I renamed it. My refactoring will also make that much clearer.


> > Source/JavaScriptCore/dfg/DFGOperations.cpp:2363
> > +        case 0:
> 
> Maybe it's worth giving these numbers some names in an enum?

Definitely! The refactoring patch did this. I was trying to keep this patch as small as I can (using an enum touches other files). But since it's not changing any behavior I'll add the enum here.


> > Source/JavaScriptCore/dfg/DFGOperations.cpp:2470
> > +        jitCode->dontOptimizeAnytimeSoon(codeBlock);
> 
> Why this?

The idea here is to bump the threshold so that this function's loops stop entering this OSR handling code. We decided than another loop is a good place to optimize, and we've set their trigger accordingly, so they'll come into this OSR handling code, but we don't want the counter to do so as well.


> > Source/JavaScriptCore/dfg/DFGOperations.cpp:2479
> > +            for (unsigned osrEntryCandidate : tierUpHierarchyEntry->value) {
> 
> Don't you want to try to compile the outermost loop first? If so, I think
> you want to iterate this vector backwards.

Correct, that's what yields better perf IIUC, but I'm leaving the old code as-is to not change its behavior. I'm just delaying the compile for correctness, not changing which loop gets compiled (as I did in my refactoring). I'll validate that going the other way is better separately.


> > Source/JavaScriptCore/dfg/DFGOperations.cpp:2485
> > +                    jitCode->dontOptimizeAnytimeSoon(codeBlock);
> 
> Why this?

Same as above.
Comment 63 Filip Pizlo 2017-02-03 15:33:05 PST
Comment on attachment 300563 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=300563&action=review

> Source/JavaScriptCore/dfg/DFGOperations.cpp:2470
> +        jitCode->dontOptimizeAnytimeSoon(codeBlock);

This should be optimizeSoon.

> Source/JavaScriptCore/dfg/DFGOperations.cpp:2485
> +                    jitCode->dontOptimizeAnytimeSoon(codeBlock);

This should be optimizeSoon.
Comment 64 Filip Pizlo 2017-02-03 15:38:11 PST
Comment on attachment 300563 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=300563&action=review

>> Source/JavaScriptCore/dfg/DFGOperations.cpp:2470
>> +        jitCode->dontOptimizeAnytimeSoon(codeBlock);
> 
> This should be optimizeSoon.

Make that setOptimizationBlahBlah(Deferred)
Comment 65 JF Bastien 2017-02-03 16:00:15 PST
Created attachment 300574 [details]
patch

As we discussed on IRC I tried the 3 different approaches. setOptimizationThresholdBasedOnCompilationResult is the least different from existing code (it currently compiles and calls that function, we now set the trigger and then we want to back off a bit). The worry Saam expressed was that I was delaying regular FTL compile too much. Benchmarks don't show a difference, but we may just be getting lucky.

This patch calls setOptimizationThresholdBasedOnCompilationResult.

Here are kraken results. I'll compare setOptimizationThresholdBasedOnCompilationResult in a full benchmark run and post those results.

$ run-jsc-benchmarks original:./WebKitBuild/current/jsc dontSoon:./WebKitBuild/dontOptimizeAnytimeSoon/jsc optimizeSoon:./WebKitBuild/optimizeSoon/jsc setBlah:./WebKitBuild/setOptimizationThresholdBasedOnCompilationResult/jsc --kraken --outer 20

                                           original                  dontSoon                optimizeSoon                setBlah               setBlah v. original    

ai-astar                                67.627+-1.053             67.544+-0.798      ?      67.953+-0.978             67.816+-0.893         ?
audio-beat-detection                    29.133+-0.075      ?      29.167+-0.195             29.132+-0.082      ?      29.444+-0.602         ? might be 1.0107x slower
audio-dft                               60.809+-0.943             60.250+-0.500      ?      60.580+-0.707             60.182+-0.612           might be 1.0104x faster
audio-fft                               21.700+-0.663             21.397+-0.116      ?      21.871+-0.689             21.791+-0.882         ?
audio-oscillator                        44.702+-0.137      ?      44.817+-0.178      ?      44.823+-0.352      ?      44.930+-0.291         ?
imaging-darkroom                        50.023+-0.183      ?      50.038+-0.235             49.932+-0.170             49.903+-0.215         
imaging-desaturate                      42.867+-0.705      ?      44.027+-0.812             43.403+-0.818             43.248+-0.686         ?
imaging-gaussian-blur                   53.495+-0.845      ?      53.639+-0.928             53.342+-0.993             52.668+-1.117           might be 1.0157x faster
json-parse-financial                    29.381+-0.133      ^      27.807+-0.093             27.804+-0.128      !      28.143+-0.180         ^ definitely 1.0440x faster
json-stringify-tinderbox                17.856+-0.060      !      18.469+-0.080      ?      18.482+-0.116             18.456+-0.134         ! definitely 1.0336x slower
stanford-crypto-aes                     30.170+-0.145             30.113+-0.180      ?      30.128+-0.187             29.999+-0.192         
stanford-crypto-ccm                     26.464+-0.905      ?      26.932+-0.658      ?      27.611+-1.037             26.562+-0.881         ?
stanford-crypto-pbkdf2                  94.567+-3.453             92.140+-0.546      ?      92.339+-0.588             92.136+-0.507           might be 1.0264x faster
stanford-crypto-sha256-iterative        29.667+-0.616      ?      29.714+-0.638      ?      30.296+-1.051             29.510+-0.134         

<arithmetic>                            42.747+-0.281             42.575+-0.110      ?      42.693+-0.165             42.485+-0.168           might be 1.0062x faster
Comment 66 Saam Barati 2017-02-03 16:24:01 PST
Comment on attachment 300574 [details]
patch

r=me too
Comment 67 JF Bastien 2017-02-03 16:53:14 PST
I ran the full benchmark suite. Everything looks perf-neutral, including Sunspider and Octane!
Comment 68 WebKit Commit Bot 2017-02-03 17:19:06 PST
Comment on attachment 300574 [details]
patch

Clearing flags on attachment: 300574

Committed r211658: <http://trac.webkit.org/changeset/211658>
Comment 69 WebKit Commit Bot 2017-02-03 17:19:14 PST
All reviewed patches have been landed.  Closing bug.