Bug 167149

Summary:

OSR entry: delay outer-loop compilation when at inner-loop

Product:

WebKit

Reporter:

JF Bastien <jfbastien>

Component:

JavaScriptCore

Assignee:

JF Bastien <jfbastien>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

commit-queue, fpizlo, ggaren, jfbastien, keith_miller, mark.lam, msaboff, saam, webkit-bug-importer

Priority:

Keywords:

InRadar

Version:

WebKit Nightly Build

Hardware:

Unspecified

OS:

Unspecified

Bug Depends on:

167479, 167721

Bug Blocks:

167741, 167811, 167460

Attachments:

Description	Flags
patch	fpizlo: review-
patch	jfbastien: commit-queue-
patch	jfbastien: commit-queue-
patch	jfbastien: commit-queue-
patch	jfbastien: commit-queue-
patch	jfbastien: commit-queue-
pach	jfbastien: commit-queue-
patch	jfbastien: commit-queue-
patch	fpizlo: review+
patch	none
patch	none
patch	fpizlo: review-
Results + profiles	none
patch	none
Benchmark results	none
patch	none
patch	jfbastien: commit-queue-
patch	jfbastien: commit-queue-
benchmark results for imprecise mustHandleValues	none
collect mustHandleValues eagerly	jfbastien: commit-queue-
patch	none
Performance results	none
patch	none
patch	none

JF Bastien

Reported 2017-01-17 18:39:58 PST

As of https://bugs.webkit.org/show_bug.cgi?id=155217 OSR compilation can be kicked off for an entry into an outer-loop, while executing an inner-loop. This is desirable because often the codegen from an inner-entry isn't as good as the codegen from an outer-entry, but execution from an inner-loop is often pretty hot and likely to kick off compilation. This approach provided nice speedups on Kraken because we'd select to enter to the outer-loop very reliably, which reduces variability (the inner-loop was selected roughly 1/5 times from my unscientific measurements). When compilation starts we take a snapshot of the JSValues at the current execution state using OSR's recovery mechanism. These values are passed to the compiler and are used as way to perform type profiling, and could be used to observe cell types as well as to perform predictions such as through constant propagation. It's therefore desired to enter from the outer-loop when we can, but we need to be executing from that location to capture the right JSValues, otherwise we're confusing the compiler and giving it inaccurate JSValues which can lead it to predict the wrong things, leading to suboptimal code or recompilation due to misprediction. These effects are pretty hard to measure: Fil points out that marsalis-osr-entry really needs mustHandleValues (the JSValues from the point of execution) because right now it just happens to correctly guess int32. I tried removing mustHandleValues entirely and saw no slowdowns, but our benchmarks probably aren't sufficient to reliably find issues, sometimes because we happen to have sufficient mitigations. DFG tier-up was added here: https://bugs.webkit.org/show_bug.cgi?id=112838 <rdar://problem/29144126>

Attachments
patch (5.39 MB, patch) 2017-01-17 18:58 PST, JF Bastien	fpizlo: review-	Details Formatted Diff Diff
patch (3.68 MB, patch) 2017-01-19 21:33 PST, JF Bastien	jfbastien: commit-queue-	Details Formatted Diff Diff
patch (3.68 MB, patch) 2017-01-20 12:00 PST, JF Bastien	jfbastien: commit-queue-	Details Formatted Diff Diff
patch (31.73 KB, patch) 2017-01-20 14:17 PST, JF Bastien	jfbastien: commit-queue-	Details Formatted Diff Diff
patch (32.60 KB, patch) 2017-01-20 15:26 PST, JF Bastien	jfbastien: commit-queue-	Details Formatted Diff Diff
patch (47.00 KB, patch) 2017-01-24 15:19 PST, JF Bastien	jfbastien: commit-queue-	Details Formatted Diff Diff
pach (46.98 KB, patch) 2017-01-24 20:37 PST, JF Bastien	jfbastien: commit-queue-	Details Formatted Diff Diff
patch (46.98 KB, patch) 2017-01-25 10:30 PST, JF Bastien	jfbastien: commit-queue-	Details Formatted Diff Diff
patch (3.62 MB, patch) 2017-01-25 17:14 PST, JF Bastien	fpizlo: review+	Details Formatted Diff Diff
patch (3.62 MB, patch) 2017-01-26 10:56 PST, JF Bastien	no flags	Details Formatted Diff Diff
patch (3.62 MB, patch) 2017-01-26 11:15 PST, JF Bastien	no flags	Details Formatted Diff Diff
patch (3.63 MB, patch) 2017-01-31 10:33 PST, JF Bastien	fpizlo: review-	Details Formatted Diff Diff
Results + profiles (88.50 KB, application/octet-stream) 2017-01-31 10:33 PST, JF Bastien	no flags	Details
patch (11.19 KB, patch) 2017-01-31 15:30 PST, JF Bastien	no flags	Details Formatted Diff Diff
Benchmark results (89.70 KB, application/octet-stream) 2017-01-31 15:31 PST, JF Bastien	no flags	Details
patch (8.01 KB, patch) 2017-01-31 16:20 PST, JF Bastien	no flags	Details Formatted Diff Diff
patch (3.39 KB, patch) 2017-02-02 12:39 PST, JF Bastien	jfbastien: commit-queue-	Details Formatted Diff Diff
patch (8.62 KB, patch) 2017-02-02 13:08 PST, JF Bastien	jfbastien: commit-queue-	Details Formatted Diff Diff
benchmark results for imprecise mustHandleValues (82.28 KB, application/octet-stream) 2017-02-02 13:40 PST, JF Bastien	no flags	Details
collect mustHandleValues eagerly (6.91 KB, patch) 2017-02-02 17:05 PST, JF Bastien	jfbastien: commit-queue-	Details Formatted Diff Diff
patch (9.01 KB, patch) 2017-02-03 10:27 PST, JF Bastien	no flags	Details Formatted Diff Diff
Performance results (81.47 KB, application/octet-stream) 2017-02-03 10:29 PST, JF Bastien	no flags	Details
patch (17.31 KB, patch) 2017-02-03 14:08 PST, JF Bastien	no flags	Details Formatted Diff Diff
patch (17.40 KB, patch) 2017-02-03 16:00 PST, JF Bastien	no flags	Details Formatted Diff Diff
Show Obsolete (23) View All Add attachment proposed patch, testcase, etc.

JF Bastien

Comment 1 2017-01-17 18:58:38 PST

Created attachment 299107 [details] patch Should address <rdar://problem/29144126> Here are some benchmark results: $ run-jsc-benchmarks original:original/bin/jsc delay:current-release/bin/jsc --kraken --outer 20 original delay ai-astar 67.957+-1.211 67.896+-1.411 audio-beat-detection 28.329+-1.439 ? 28.507+-1.279 ? audio-dft 59.496+-0.586 ? 59.804+-0.611 ? audio-fft 21.121+-0.106 ? 21.168+-0.115 ? audio-oscillator 45.449+-1.932 44.702+-0.187 might be 1.0167x faster imaging-darkroom 49.720+-0.295 ? 49.832+-0.357 ? imaging-desaturate 43.173+-0.864 42.325+-0.775 might be 1.0200x faster imaging-gaussian-blur 64.513+-9.577 60.548+-7.272 might be 1.0655x faster json-parse-financial 29.491+-0.126 ? 29.521+-0.132 ? json-stringify-tinderbox 18.723+-0.186 18.607+-0.184 stanford-crypto-aes 30.532+-0.240 30.424+-0.124 stanford-crypto-ccm 28.714+-0.854 28.599+-0.896 stanford-crypto-pbkdf2 95.847+-0.485 95.605+-0.445 stanford-crypto-sha256-iterative 30.506+-0.080 ? 30.534+-0.131 ? <arithmetic> 43.827+-0.718 43.434+-0.557 might be 1.0090x faster $ run-jsc-benchmarks original:original/bin/jsc delay:current-release/bin/jsc --kraken --benchmark gaussian-blur --outer 200 original delay imaging-gaussian-blur 63.372+-2.660 61.678+-2.618 might be 1.0275x faster <arithmetic> 63.372+-2.660 61.678+-2.618 might be 1.0275x faster $ run-jsc-benchmarks original:original/bin/jsc delay:current-release/bin/jsc --benchmark marsaglia --outer 200 original delay marsaglia-larger-ints 71.3672+-0.0630 ? 71.4541+-0.1207 ? marsaglia-osr-entry 15.8131+-0.0794 ? 15.8814+-0.0861 ? <geometric> 33.5883+-0.0823 ? 33.6799+-0.0910 ? might be 1.0027x slower $ run-jsc-benchmarks original:original/bin/jsc delay:current-release/bin/jsc --benchmark mandelbrot --outer 20 original delay mandelbrot 376.3381+-2.3982 374.4935+-0.6074 <geometric> 376.3381+-2.3982 374.4935+-0.6074 might be 1.0049x faster $ run-jsc-benchmarks original:original/bin/jsc delay:current-release/bin/jsc --benchmark nonude --outer 60 original delay nonude 67.6165+-0.6550 ? 67.9252+-0.7012 ? <geometric> 67.6165+-0.6550 ? 67.9252+-0.7012 ? might be 1.0046x slower

JF Bastien

Comment 2 2017-01-17 19:03:28 PST

In the mandelbrot.js test I've got Unicode characters that display a fancy Mandelbrot image. Bugzilla doesn't like Unicode, but my terminal does (you can also print on each iteration to see a neat animation!). nonude.js has baked-in constant array, I figure that's more reliable than randomly generated. I could avoid line-wrap and put everything in a single line if you think that would be better. It needs to be about this size to tier up.

Filip Pizlo

Comment 3 2017-01-17 19:30:27 PST

Comment on attachment 299107 [details] patch I'd like to provide feedback on this patch, but I can't because the mandelbrot microbenchmark is too big for the patch review tool. Also, it seems to be fairly long-running. We don't want microbenchmarks that run for 100ms or more.

Filip Pizlo

Comment 4 2017-01-17 19:34:54 PST

I don't think this is the right approach. It seems to defeat the purpose of having a target loop trigger. When you try to compile and you're not at the loop header, you just try again. The whole point of the trigger is so that we don't have to wait and hope that we enter in the loop.

JF Bastien

Comment 5 2017-01-19 21:33:59 PST

Created attachment 299313 [details] patch Update with a different approach: mark the outer loop for entry, and when we force its compilation then unmark it. I still need to handle the cases where: - outer-loop compilation fails (can it fail for outer but not inner? Do we care?) - outer-loop doesn't trigger for too long (maybe we should try to trigger one loop in, and so on, until we reach the innermost loop) I ran performance number but they look weird. I'll close things and re-run them.

JF Bastien

Comment 6 2017-01-20 12:00:13 PST

Created attachment 299364 [details] patch Here's a sketch of what I think back off could be like. I still need to: - Verify benchmarks, and tune this backoff thing. - Make new tests shorter. - Update changelog.

Filip Pizlo

Comment 7 2017-01-20 12:22:55 PST

(In reply to comment #6) > Created attachment 299364 [details] > patch > > Here's a sketch of what I think back off could be like. > > I still need to: > > - Verify benchmarks, and tune this backoff thing. > - Make new tests shorter. > - Update changelog. Can you please upload a smaller patch? You should remove the benchmarks from the patch and land them anyway when you commit. We don't review the benchmark contents usually.

Geoffrey Garen

Comment 8 2017-01-20 12:49:26 PST

Comment on attachment 299364 [details] patch View in context: https://bugs.webkit.org/attachment.cgi?id=299364&action=review > Source/JavaScriptCore/dfg/DFGTierUpEntryTrigger.h:35 > + DontEnter = 0, // Must stay zero: JIT-compiled code enters when non-zero. > + TryToCompileCode, > + CodeCompiledSuccessfully, s/enters/takes the triggerOSREntryNow slow path Maybe need to rename triggerOSREntryNow to triggerTierUpNow Maybe: None, Compile, OSREnter

JF Bastien

Comment 9 2017-01-20 14:17:08 PST

Created attachment 299381 [details] patch Update patch with Geoff's comments, and drop tests for now.

JF Bastien

Comment 10 2017-01-20 15:26:40 PST

Created attachment 299395 [details] patch Add some logging, and clean up a minor issue with entry location.

Filip Pizlo

Comment 11 2017-01-23 12:40:08 PST

Comment on attachment 299395 [details] patch View in context: https://bugs.webkit.org/attachment.cgi?id=299395&action=review > Source/JavaScriptCore/bytecode/ExecutionCounter.cpp:75 > template<CountingVariant countingVariant> > +void ExecutionCounter<countingVariant>::deferForAWhile() > +{ > + ASSERT(m_activeThreshold != std::numeric_limits<int32_t>::max()); > + m_totalCount = count(); > + m_counter = -static_cast<int32_t>(m_activeThreshold); > +} This function is kinda weird. It's totally different from how we usually ask the execution counter to trigger soon. > Source/JavaScriptCore/dfg/DFGJITCode.cpp:144 > +void JITCode::dontOptimizeForAWhile(CodeBlock* codeBlock) > +{ > + ASSERT(codeBlock->jitType() == JITCode::DFGJIT); > + if (Options::verboseOSR()) > + dataLog(*codeBlock, ": Not FTL-optimizing for a while.\n"); > + tierUpCounter.deferForAWhile(); > +} > + This seems like it's almost like optimizeAfterWarmUp or optimizeSoon. You should preferably choose one of those rather than introduce a new thing.

JF Bastien

Comment 12 2017-01-24 15:19:30 PST

Created attachment 299639 [details] patch Substantially change the logic after in-person talk with pizlo. I still need to: - Test this a bunch. - Make the tests run faster. The basic idea should be there though.

Filip Pizlo

Comment 13 2017-01-24 15:24:21 PST

Comment on attachment 299639 [details] patch This looks pretty solid to me.

JF Bastien

Comment 14 2017-01-24 20:37:24 PST

Created attachment 299676 [details] pach Fixed a few bugs found by tests. They're still running, I'll then benchmark and shorten the two new tests.

JF Bastien

Comment 15 2017-01-24 22:25:30 PST

Tests seem happy. First run of the benchmarks seem OK, pretty much perf neutral without regressing variance on kraken/imagine-gaussian-blur. I'm kicking off a longer benchmark run and signing out for today, will look a the results tomorrow and report back.

JF Bastien

Comment 16 2017-01-25 10:30:35 PST

Created attachment 299713 [details] patch Typo fix. Benchmarks finished, ruby is crunching on them.

JF Bastien

Comment 17 2017-01-25 17:14:39 PST

Created attachment 299774 [details] patch This looks good to go. I rebased the changes, made the new benchmarks faster, and ran benchmarks (everything looks pretty neutral). I'll file a separate bug to investigate mandelbrot's lack of FTL-ing. It's important to fix, but not as much as other things.

WebKit Commit Bot

Comment 18 2017-01-25 17:17:09 PST

Attachment 299774 [details] did not pass style-queue: ERROR: Source/JavaScriptCore/dfg/DFGOperations.cpp:2431: An else statement can be removed when the prior "if" concludes with a return, break, continue or goto statement. [readability/control_flow] [4] ERROR: Source/JavaScriptCore/dfg/DFGOperations.cpp:2613: An else statement can be removed when the prior "if" concludes with a return, break, continue or goto statement. [readability/control_flow] [4] Total errors found: 2 in 18 files If any of these errors are false positives, please file a bug against check-webkit-style.

Saam Barati

Comment 19 2017-01-25 17:49:07 PST

Comment on attachment 299774 [details] patch View in context: https://bugs.webkit.org/attachment.cgi?id=299774&action=review > JSTests/microbenchmarks/nonude.js:1 > +const makeFaster = 8; Does this test actually exercise the crash? > Source/JavaScriptCore/ChangeLog:29 > + right JSValues, otherwise we're confusing the compiler and giving > + it inaccurate JSValues which can lead it to predict the wrong > + things, leading to suboptimal code or recompilation due to > + misprediction. Also, crash > Source/JavaScriptCore/dfg/DFGOperations.cpp:2390 > + // We're already compiling, and can't OSR enter. We therefore must have set our slow-path trigger as well as another slow path trigger, hoping one of these would kick off a compilation. Sure enough another bytecode location did kick off a compilation before this one got a chance to do so. There's nothing for us to do but wait. Nit: make this multiline. > Source/JavaScriptCore/dfg/DFGSpeculativeJIT64.cpp:5787 > + static_assert(!static_cast<int>(TierUpEntryTrigger::None), "NonZero test below depends on this"); nit: I'd use uint8_t instead of int

Saam Barati

Comment 20 2017-01-25 18:01:17 PST

Comment on attachment 299774 [details] patch View in context: https://bugs.webkit.org/attachment.cgi?id=299774&action=review > Source/JavaScriptCore/dfg/DFGOperations.cpp:2625 > + switch (osrEntryPreparation.error()) { > + case FTL::OSREntryFail::StackGrowthFailed: > + // It's not clear what we should do. We signal to try again after a while if that happens. > + return nullptr; > + case FTL::OSREntryFail::WrongBytecode: > + RELEASE_ASSERT_NOT_REACHED(); > + } Why not just RELEASE_ASSERT(error == StackGrowthFailed)? Also, we tend not to have "else" statements for "if"s that end in returns.

Filip Pizlo

Comment 21 2017-01-25 20:58:19 PST

Comment on attachment 299774 [details] patch LGTM, but I think you should address Saam's advice.

JF Bastien

Comment 22 2017-01-26 10:56:57 PST

Created attachment 299816 [details] patch Address Saam's comments except one as noted below. Fix GTK build, it was sad about missing return because it wasn't smart enough to see there was no missing return (or spoon for that matter). > > JSTests/microbenchmarks/nonude.js:1 > > +const makeFaster = 8; > > Does this test actually exercise the crash? It's complicated... The code is super finicky, and I can add asserts at specific spots to observe unwanted things that turn out to be innocuous. As-is the benchmark pretty much doesn't exhibit bad behavior. I was thinking about adding a few asserts as sanity checks, but I'd do that separately. > > Source/JavaScriptCore/ChangeLog:29 > > + right JSValues, otherwise we're confusing the compiler and giving > > + it inaccurate JSValues which can lead it to predict the wrong > > + things, leading to suboptimal code or recompilation due to > > + misprediction. > > Also, crash I was purposefully not using that word, but pizlo says it's fine. > View in context: > https://bugs.webkit.org/attachment.cgi?id=299774&action=review > > > Source/JavaScriptCore/dfg/DFGOperations.cpp:2625 > > + switch (osrEntryPreparation.error()) { > > + case FTL::OSREntryFail::StackGrowthFailed: > > + // It's not clear what we should do. We signal to try again after a while if that happens. > > + return nullptr; > > + case FTL::OSREntryFail::WrongBytecode: > > + RELEASE_ASSERT_NOT_REACHED(); > > + } > > Why not just RELEASE_ASSERT(error == StackGrowthFailed)? > Also, we tend not to have "else" statements for "if"s that end in returns. I like this approach because if / when we add new reasons for OSR entry failure we *have* to edit this code (because the switch must cover all enum cases, otherwise compiler warns), think about what it means, and document it as above (what do we do about stack growth? Dunno... But wrong bytecode definitely can't happen here!). Removing the "else" would make this code invalid: osrEntryPreparation wouldn't be in scope anymore (it's only in scope in the if / else). I can hoist it above the "if", but then IMO the code is ugly :-) Happy to change it if you want, I just like this style quite a bit because IMO it's less likely to break in the future.

WebKit Commit Bot

Comment 23 2017-01-26 11:00:17 PST

Attachment 299816 [details] did not pass style-queue: ERROR: Source/JavaScriptCore/dfg/DFGOperations.cpp:2439: An else statement can be removed when the prior "if" concludes with a return, break, continue or goto statement. [readability/control_flow] [4] ERROR: Source/JavaScriptCore/dfg/DFGOperations.cpp:2637: An else statement can be removed when the prior "if" concludes with a return, break, continue or goto statement. [readability/control_flow] [4] Total errors found: 2 in 18 files If any of these errors are false positives, please file a bug against check-webkit-style.

Filip Pizlo

Comment 24 2017-01-26 11:04:44 PST

(In reply to comment #22) > Created attachment 299816 [details] > patch > > Address Saam's comments except one as noted below. > > Fix GTK build, it was sad about missing return because it wasn't smart > enough to see there was no missing return (or spoon for that matter). > > > > > JSTests/microbenchmarks/nonude.js:1 > > > +const makeFaster = 8; > > > > Does this test actually exercise the crash? > > It's complicated... The code is super finicky, and I can add asserts at > specific spots to observe unwanted things that turn out to be innocuous. > As-is the benchmark pretty much doesn't exhibit bad behavior. > > I was thinking about adding a few asserts as sanity checks, but I'd do that > separately. > > > > > Source/JavaScriptCore/ChangeLog:29 > > > + right JSValues, otherwise we're confusing the compiler and giving > > > + it inaccurate JSValues which can lead it to predict the wrong > > > + things, leading to suboptimal code or recompilation due to > > > + misprediction. > > > > Also, crash > > I was purposefully not using that word, but pizlo says it's fine. > > > > View in context: > > https://bugs.webkit.org/attachment.cgi?id=299774&action=review > > > > > Source/JavaScriptCore/dfg/DFGOperations.cpp:2625 > > > + switch (osrEntryPreparation.error()) { > > > + case FTL::OSREntryFail::StackGrowthFailed: > > > + // It's not clear what we should do. We signal to try again after a while if that happens. > > > + return nullptr; > > > + case FTL::OSREntryFail::WrongBytecode: > > > + RELEASE_ASSERT_NOT_REACHED(); > > > + } > > > > Why not just RELEASE_ASSERT(error == StackGrowthFailed)? > > Also, we tend not to have "else" statements for "if"s that end in returns. > > I like this approach because if / when we add new reasons for OSR entry > failure we *have* to edit this code (because the switch must cover all enum > cases, otherwise compiler warns), think about what it means, and document it > as above (what do we do about stack growth? Dunno... But wrong bytecode > definitely can't happen here!). I agree with that approach. I also like to use switch statements this way. We do it all the time in the compiler guts. > > Removing the "else" would make this code invalid: osrEntryPreparation > wouldn't be in scope anymore (it's only in scope in the if / else). I can > hoist it above the "if", but then IMO the code is ugly :-) That would be the WebKit style. > > Happy to change it if you want, I just like this style quite a bit because > IMO it's less likely to break in the future. We're all very used to reading code where else is only used when 'then' might fall through. It's annoying when code breaks this tradition.

JF Bastien

Comment 25 2017-01-26 11:15:38 PST

Created attachment 299823 [details] patch Un-else the code, pizlo makes a good case that it's WebKit style. CQ+

WebKit Commit Bot

Comment 26 2017-01-26 11:53:59 PST

Comment on attachment 299823 [details] patch Clearing flags on attachment: 299823 Committed r211224: <http://trac.webkit.org/changeset/211224>

WebKit Commit Bot

Comment 27 2017-01-26 11:54:05 PST

All reviewed patches have been landed. Closing bug.

JF Bastien

Comment 28 2017-01-26 12:06:21 PST

I'll follow-up on mandelbrot not FTL-ing here: https://bugs.webkit.org/show_bug.cgi?id=167460 Not now though, it's not super high priority.

WebKit Commit Bot

Comment 29 2017-01-26 17:30:46 PST

Re-opened since this is blocked by bug 167479

JF Bastien

Comment 30 2017-01-31 10:33:14 PST

Created attachment 300231 [details] patch This patch does away with tierUpEntrySeen which seems to have been the cause of the regression on kraken/ai-astar. That benchmark has nested loops through inlining, and the innermost-loop enters through triggerTierUpNowInLoop which wasn't tracking tierUpEntrySeen, thereby preventing the trigger loop from setting any enclosing loop's trigger. Interestingly, my local build didn't spot this as a regression because the generated code was perf-neutral but had significantly different Base/DFG/FTL/FTLOSR profile numbers. The same was true of other builds, on multiple machines (different profiles, but perf-neutral). In the end the performance regression only showed up on builds using older SDKs, which had different profiles *and* had significantly worst runtime performance (~80% regression on kraken/ai-astar). This is a troubling sign of this code's brittleness (which I'll ignore for now, and look into it later, better fix this now). For this second fix I benchmarked approaches for the following combinations: * Whether to drop tierUpEntrySeen for triggerTierUpNowInLoop only, or for checkTierUpAndOSREnterNow as well. * Whether to set the triggers from outer-loop to inner-loop, or the reverse (again, trying both approaches for triggerTierUpNowInLoop and checkTierUpAndOSREnterNow). The benchmarks tell me that dropping tierUpEntrySeen entirely and always doing outer-loop to inner-loop is a win. This patch is exactly the same as the last one, except for dropping tierUpEntrySeen. Results (and profiles) attached separately.

JF Bastien

Comment 31 2017-01-31 10:33:51 PST

Created attachment 300232 [details] Results + profiles Results + profiles

Filip Pizlo

Comment 32 2017-01-31 10:49:58 PST

Comment on attachment 300231 [details] patch View in context: https://bugs.webkit.org/attachment.cgi?id=300231&action=review > Source/JavaScriptCore/ChangeLog:9 > + Re-land of r211224, but without tierUpEntrySeen which seems to > + have been the cause of the kraken/ai-astar regression. tierUpEntrySeen has been around for a while. Can you provide a lot more details about why that's not needed anymore, and why it caused the regression in ai-astar? I think that if we can't answer that question, then wrapping this fix up together with a rewrite of a large chunk DFG->FTL OSR entry is not the right way to go. We need to find what is the minimal change to the code that fixes the bug.

JF Bastien

Comment 33 2017-01-31 11:06:28 PST

(In reply to comment #32) > Comment on attachment 300231 [details] > patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=300231&action=review > > > Source/JavaScriptCore/ChangeLog:9 > > + Re-land of r211224, but without tierUpEntrySeen which seems to > > + have been the cause of the kraken/ai-astar regression. > > tierUpEntrySeen has been around for a while. Can you provide a lot more > details about why that's not needed anymore, and why it caused the > regression in ai-astar? > > I think that if we can't answer that question, then wrapping this fix up > together with a rewrite of a large chunk DFG->FTL OSR entry is not the right > way to go. We need to find what is the minimal change to the code that > fixes the bug. The following code is in triggerOSREntryNow but not in triggerTierUpNowInLoop: jitCode->tierUpEntrySeen.add(bytecodeIndex); The loop-traversal code then does: // Traverse the loop hierarchy from the outer-most loop, to the inner-most one. for (auto iterator = tierUpHierarchyEntry->value.rbegin(), end = tierUpHierarchyEntry->value.rend(); iterator != end; ++iterator) { unsigned osrEntryCandidate = *iterator; if (jitCode->tierUpEntrySeen.contains(osrEntryCandidate)) { .... That's the source of the bug in the patch which was previously committed, and then reverted. I agree with your suggestion: I'll create a minimally-disruptive fix, upload+commit after review, and then submit this code as a bigger refactoring in a separate patch. That'll make sure that I separate bug fix from performance+refactoring fix.

JF Bastien

Comment 34 2017-01-31 15:30:30 PST

Created attachment 300271 [details] patch Here's a very minimal patch as discussed. If it sticks I'll commit the bigger refactor+tuning in a follow-up.

JF Bastien

Comment 35 2017-01-31 15:31:01 PST

Created attachment 300272 [details] Benchmark results Benchmark.

WebKit Commit Bot

Comment 36 2017-01-31 15:32:54 PST

Attachment 300271 [details] did not pass style-queue: ERROR: Source/JavaScriptCore/dfg/DFGOperations.cpp:2431: Multi line control clauses should use braces. [whitespace/braces] [4] Total errors found: 1 in 2 files If any of these errors are false positives, please file a bug against check-webkit-style.

Filip Pizlo

Comment 37 2017-01-31 15:36:36 PST

Comment on attachment 300271 [details] patch I don't understand this change. Why does every conditional do nothing when triggeredSlowPath? Doesn't that mean that when you do get force triggered at the outer loop at which you could compile, then you'll end up not compiling? Shouldn't you be predicating on canOSRFromHere?

Filip Pizlo

Comment 38 2017-01-31 15:43:27 PST

Comment on attachment 300271 [details] patch View in context: https://bugs.webkit.org/attachment.cgi?id=300271&action=review > Source/JavaScriptCore/dfg/DFGOperations.cpp:2419 > + if (!triggeredSlowPath) { Not sure if this should be here. If we already have an osrEntryBlock, as the code this is guarding checks, then we have to count retry.

Filip Pizlo

Comment 39 2017-01-31 15:46:08 PST

Comment on attachment 300271 [details] patch OK - the only part of this that I'm uncomfortable about is not doing the existing-osrEntryBlock dance when triggeredSlowPath. What do you think?

JF Bastien

Comment 40 2017-01-31 15:54:58 PST

(In reply to comment #37) > Comment on attachment 300271 [details] > patch > > I don't understand this change. Why does every conditional do nothing when > triggeredSlowPath? Doesn't that mean that when you do get force triggered > at the outer loop at which you could compile, then you'll end up not > compiling? Shouldn't you be predicating on canOSRFromHere? (reposting here from IRC discussion). canOSRFromHere only happens when entering from triggerTierUpNowInLoop, which gets generated by CheckTierUpInLoop, which itself gets generated by performTierUpCheckInjection when canOSREnterAtLoopHint(level, block, nodeIndex) says "can't". That's mostly orthogonal to triggeredSlowPath. triggeredSlowPath basically skips all the places we can return from, and goes directly to the compile step instead. That's why it does all the shortcuts (but only after testing the condition, which may have side-effects). The intent is that triggeredSlowPath does all the required bookkeeping, and takes you straight down to jitCode->reconstruct, which then triggers a compile. Separately though, Fil pointed out that the entire section under "It's time to try to compile code for OSR entry" didn't need to be guarded by `! triggeredSlowPath`. I'll revert that part of the change, benchmark, and upload a new patch.

JF Bastien

Comment 41 2017-01-31 16:20:40 PST

Created attachment 300280 [details] patch Updated patch. Perf numbers seem to be the same.

WebKit Commit Bot

Comment 42 2017-01-31 17:27:24 PST

Comment on attachment 300280 [details] patch Clearing flags on attachment: 300280 Committed r211461: <http://trac.webkit.org/changeset/211461>

WebKit Commit Bot

Comment 43 2017-01-31 17:27:31 PST

All reviewed patches have been landed. Closing bug.

JF Bastien

Comment 44 2017-02-01 09:30:16 PST

This seems to regress *some* kraken/ai-astar builds again by ~80%. I'm looking at why, may revert again.

JF Bastien

Comment 45 2017-02-01 10:03:39 PST

This is odd. In the previous patch I had perf-neutral results with different tiering behavior because of a bug. This patch has the same tiering behavior as far as I can tell. I ran the following to create a csv file, before and after my change: for i in `seq 1 100`; do ./current/bin/jsc ./JSTests/ChakraCore/test/benchmarks/Kraken/ai-astar.js -p a 2>&1 > /dev/null && printf "f\nq" | ./Tools/Scripts/display-profiler-output a 2>&1 | grep findGraphNode | sed 's| *[^ ]* *[0-9]* *$[^ ]*$.*|\1|' | sed 's|/|,|g' ; done Before my change: Base DFG FTL FTLOSR Average 1360 39254117 2175101 5277488 stddev 392 11810751 233459 11788088 After my change: Base DFG FTL FTLOSR Average 1279 40142662 2128971 4435185 stddev 242 10385349 229923 10371637 There are two modes when running the code: sometimes we hit straight FTL for findGraphNode, and sometimes we hit FTLOSR. Both before as well as after, straight FTL occurs about 80% of the time, and FTLOSR 20%. Hitting FTLOSR is a 10%-20% perf uptick. But this is all with my local build, not with the build on the bots that show a slowdown. I need to dig some more.

JF Bastien

Comment 46 2017-02-01 11:00:40 PST

Interesting data point: using an older compiler and comparing before / after my change (again 100 profiled runs) I get the following results: Before: Runtime Base DFG FTL FTLOSR Average 183 1424 4002246 2317610 40386834 Stddev 5 456 452149 310826 594810 After: Runtime Base DFG FTL FTLOSR Average 268 1525 40450595 2265327 3990661 Stddev 28 1065 10648728 471780 10704523 Somehow, using an older C++ compiler drastically changes jsc's behavior. It used to go out of DFG pretty quickly, and each run does *both* FTL ad FTLOSR. After my change it's modal: most runs do FTL only, and a few do FTL+FTLOSR. Even the few that do FTL+FTLOSR stay in DFG for longer, so their performance is worse than before my change (and the FTL only ones have way worse performance). I'll try out a build with a newer compiler (instead of whatever I have locally), and compare to this.

JF Bastien

Comment 47 2017-02-01 13:48:34 PST

Odd: a bot build with newer compiler has similar results! I'm probably perturbing compilation enough (by waiting until we hit the outer loop) that we enter FTL from the function instead of through the OSR entry. Digging some more...

JF Bastien

Comment 48 2017-02-01 14:05:44 PST

OK the data problem stems from my cmake build *not* exhibiting this issue (before / after my patch performance and behavior are the same), but the make build exhibits this issue (behavior and performance change, code becomes modal FTL mostly and FTLOSR sometimes). I'm not sure *why* there's a difference, but now I know *where* the difference is.

WebKit Commit Bot

Comment 49 2017-02-01 17:55:47 PST

Re-opened since this is blocked by bug 167721

JF Bastien

Comment 50 2017-02-02 12:39:22 PST

Created attachment 300434 [details] patch This patch tries to only create mustHandleValues when the byte code location allows it. Contrary to what I'd measured before (with a different approach) it's a disaster: ai-astar 67.960+-0.933 ! 78.297+-1.001 ! definitely 1.1521x slower audio-beat-detection 29.127+-0.044 ? 29.759+-0.616 ? might be 1.0217x slower audio-dft 60.107+-1.480 59.898+-0.281 audio-fft 21.444+-0.360 21.253+-0.086 audio-oscillator 44.575+-0.124 ? 44.627+-0.072 ? imaging-darkroom 49.453+-0.107 49.391+-0.095 imaging-desaturate 42.325+-0.314 ? 42.503+-0.343 ? imaging-gaussian-blur 52.743+-0.439 ! 196.948+-48.231 ! definitely 3.7341x slower json-parse-financial 28.857+-0.080 ? 29.336+-0.467 ? might be 1.0166x slower json-stringify-tinderbox 18.620+-0.198 18.599+-0.195 stanford-crypto-aes 30.074+-0.094 ? 30.164+-0.367 ? stanford-crypto-ccm 26.892+-0.281 ? 26.936+-0.524 ? stanford-crypto-pbkdf2 92.042+-0.862 91.473+-0.273 stanford-crypto-sha256-iterative 29.186+-0.077 ? 29.214+-0.126 ? <arithmetic> 42.386+-0.150 ! 53.457+-3.464 ! definitely 1.2612x slower Looking at ai-astar, there's more FTLOSR exit going on: CodeBlock #Instr Source Counts Machine Counts #Compil Inlines #Exits Last Opts Base/DFG/FTL/FTLOSR Base/DFG/FTL/FTLOSR Src/Total Get/Put/Call findGraphNode#EBmwVP 83 140738/8742609/37824773/15 140738/1803397/37824773/0 7 1/9 15 2/0/0 search#DTtNtq 686 253327/49301/0/15 253327/6873500/0/15 10 0/0 0 33/4/9 removeGraphNode#DqCfaQ 102 8094/155404/134382/0 8094/106372/134382/0 3 1/6 0 3/0/0 I'll explore other options.

JF Bastien

Comment 51 2017-02-02 13:08:20 PST

Created attachment 300437 [details] patch This patch goes back to one of my original ideas of imprecise recovery. It seems to dampen the pain a bit: original new ai-astar 67.288+-1.586 ! 74.064+-1.935 ! definitely 1.1007x slower audio-beat-detection 29.172+-0.201 ? 29.256+-0.248 ? audio-dft 60.255+-1.880 ? 61.046+-1.703 ? might be 1.0131x slower audio-fft 21.212+-0.124 ? 21.370+-0.147 ? audio-oscillator 44.652+-0.222 ? 44.848+-0.159 ? imaging-darkroom 49.880+-0.140 ? 50.416+-0.816 ? might be 1.0107x slower imaging-desaturate 43.052+-1.005 43.049+-1.091 imaging-gaussian-blur 52.471+-1.869 ? 53.245+-1.872 ? might be 1.0148x slower json-parse-financial 29.343+-0.346 29.313+-0.239 json-stringify-tinderbox 18.963+-0.167 ? 18.988+-0.232 ? stanford-crypto-aes 30.038+-0.175 ? 30.156+-0.212 ? stanford-crypto-ccm 26.529+-0.861 ? 27.998+-0.710 ? might be 1.0553x slower stanford-crypto-pbkdf2 91.727+-0.497 ? 92.433+-0.530 ? stanford-crypto-sha256-iterative 29.734+-0.982 29.671+-0.279 <arithmetic> 42.451+-0.264 ! 43.275+-0.223 ! definitely 1.0194x slower CodeBlock #Instr Source Counts Machine Counts #Compil Inlines #Exits Last Opts Base/DFG/FTL/FTLOSR Base/DFG/FTL/FTLOSR Src/Total Get/Put/Call findGraphNode#EBmwVP 83 161883/7457230/39088698/19 161883/3149734/39088698/0 7 1/9 19 2/0/0 search#DTtNtq 686 264380/37702/0/19 264380/4255896/0/19 10 0/0 0 33/4/9 removeGraphNode#DqCfaQ 102 4214/142140/151396/0 4214/104520/151396/0 3 1/6 0 3/0/0 I'll science this some more.

JF Bastien

Comment 52 2017-02-02 13:13:32 PST

I just realized why I couldn't figure out the exit reasons: `findGraphNode` gets inlined so the exit attributions go to it, but exit events to `search`. With this current patch (imprecise recovery) exits in kraken/ai-astar are due to BadType. With the one just before (no mustHandleValue when byte code location is bad) it's due to InadequateCoverage.

JF Bastien

Comment 53 2017-02-02 13:40:35 PST

Created attachment 300445 [details] benchmark results for imprecise mustHandleValues benchmark results for imprecise mustHandleValues when the bytecode location isn't suitable for entry. This looks mostly neutral, though kraken is a slight regression.

Filip Pizlo

Comment 54 2017-02-02 13:54:10 PST

We can't land a statistically significant Kraken regression, no matter how small. Also, we consider 2% on Kraken to be a medium-to-big regression, it takes a big amount of work to get a 2% Kraken speed-up.

Filip Pizlo

Comment 55 2017-02-02 14:03:20 PST

Comment on attachment 300437 [details] patch View in context: https://bugs.webkit.org/attachment.cgi?id=300437&action=review > Source/JavaScriptCore/bytecode/ValueRecovery.cpp:76 > + return jsUndefined(); This is probably your problem. If you tell the DFG that something could be Undefined and there is no other signal telling the DFG otherwise, then the DFG will assume that it should speculate that it is Undefined. It's not clear to me what you could return here that would be safe. Our type inference system is not designed to allow for this kind of imprecision.

Filip Pizlo

Comment 56 2017-02-02 14:08:24 PST

Comment on attachment 300437 [details] patch View in context: https://bugs.webkit.org/attachment.cgi?id=300437&action=review > Source/JavaScriptCore/bytecode/ValueRecovery.cpp:75 > + if (v.isInt32()) > + return JSValue(static_cast<int32_t>(v.asInt32())); > + if (v.isDouble()) > + return jsNumber(purifyNaN(v.asDouble())); This is a possible security vulnerability. Say that `v` was actually a pointer. Now the compiler is constant-propagating a constant. Probably someone could extract this integer. I think that we want to actively defend against leaking pointers.

JF Bastien

Comment 57 2017-02-02 17:05:27 PST

Created attachment 300476 [details] collect mustHandleValues eagerly Here's a separate approach where we eagerly keep a side-collection of mustHandleValues. Anything which ends up in tierUpCommon will reconstruct the mustHandleValues for this bytecode, and store them in a hash table which lives on jitCode. I tried to do a few different things (see below). Neither ai-astar nor imagining-gaussian-blur are exiting. Here's imaging-gaussian-blur: Tip of tree: CodeBlock #Instr Source Counts Machine Counts #Compil Inlines #Exits Last Opts Base/DFG/FTL/FTLOSR Base/DFG/FTL/FTLOSR Src/Total Get/Put/Call gaussianBlur#AVtXjn 1335 36827/629333/0/17161183 36827/629333/0/17161183 4 0/0 0 8/0/8 buildKernel#BJnJ58 758 0/0/0/0 0/0/0/0 0 0/0 0 N/A <global>#AzIPSP 382 0/0/0/0 0/0/0/0 0 0/0 0 N/A Snapshot on every entry: gaussianBlur#AVtXjn 1335 40170/4244885/0/13542113 40170/4244885/0/13542113 4 0/0 0 8/0/8 buildKernel#BJnJ58 758 0/0/0/0 0/0/0/0 0 0/0 0 N/A <global>#AzIPSP 382 0/0/0/0 0/0/0/0 0 0/0 0 N/A Here's ai-astar: Tip of tree: CodeBlock #Instr Source Counts Machine Counts #Compil Inlines #Exits Last Opts Base/DFG/FTL/FTLOSR Base/DFG/FTL/FTLOSR Src/Total Get/Put/Call findGraphNode#EBmwVP 83 1223/4263839/2462385/39980673 1223/381831/2462385/0 3 1/3 0 2/0/0 search#DTtNtq 686 44722/29270/0/228626 44722/3848207/0/39679633 4 0/0 1 33/4/9 removeGraphNode#DqCfaQ 102 8168/64194/0/225518 8168/35557/0/0 2 1/3 0 3/0/0 isWall#ESQBmN 21 328/4319/0/7717 328/2689/0/0 2 1/3 0 1/0/0 GraphNode#BmXric 86 116/9411/0/0 116/9411/0/0 2 0/0 0 0/7/0 init#Ct2Vq3 168 9511/0/0/0 9511/0/0/0 2 0/0 0 0/4/0 heuristic#DCkysN 125 374/1472/0/3127 374/847/0/0 2 1/3 0 6/0/2 neighbors#CZggLo 349 1335/3531/0/0 1335/3531/0/0 2 0/0 50 8/0/4 go#A9jRvV 100 0/0/0/0 0/0/0/0 0 0/0 0 N/A getEndingCell#EIKyLA 113 0/0/0/0 0/0/0/0 0 0/0 0 N/A getStartingCell#DPYZfh 133 0/0/0/0 0/0/0/0 0 0/0 0 N/A <global>#CnBI62 311876 0/0/0/0 0/0/0/0 0 0/0 0 N/A Snapshot on every entry: findGraphNode#EBmwVP 83 2192/40538040/1801301/4366274 2192/620297/1801301/0 3 1/7 0 2/0/0 search#DTtNtq 686 36652/228678/0/35763 36652/39611751/0/4329914 8 0/0 1 33/4/9 removeGraphNode#DqCfaQ 102 1441/260938/0/35371 1441/35700/0/0 2 1/7 0 3/0/0 isWall#ESQBmN 21 236/11525/0/602 236/2536/0/0 2 1/7 0 1/0/0 <global>#CnBI62 311876 9511/0/0/0 9511/0/0/0 1 0/0 0 0/0/0 GraphNode#BmXric 86 8805/701/0/0 8805/701/0/0 2 0/0 0 0/7/0 init#Ct2Vq3 168 8605/0/0/0 8605/0/0/0 2 0/0 0 0/4/0 heuristic#DCkysN 125 280/4465/0/213 280/837/0/0 2 1/7 0 6/0/2 neighbors#CZggLo 349 1266/3614/0/0 1266/3614/0/0 2 0/0 65 8/0/4 getStartingCell#DPYZfh 133 0/0/0/0 0/0/0/0 0 0/0 0 N/A getEndingCell#EIKyLA 113 0/0/0/0 0/0/0/0 0 0/0 0 N/A go#A9jRvV 100 0/0/0/0 0/0/0/0 0 0/0 0 N/A Snapshot on every entry that canOSRFromHere (attached patch): original new ai-astar 68.971+-0.810 ! 122.015+-10.873 ! definitely 1.7691x slower audio-beat-detection 30.100+-0.494 29.857+-0.234 audio-dft 70.446+-0.501 ? 71.522+-0.954 ? might be 1.0153x slower audio-fft 22.214+-1.279 21.557+-0.058 might be 1.0305x faster audio-oscillator 44.986+-0.178 ? 45.233+-0.140 ? imaging-darkroom 50.246+-0.178 ? 50.995+-1.570 ? might be 1.0149x slower imaging-desaturate 46.750+-0.492 ? 46.863+-0.725 ? imaging-gaussian-blur 54.106+-0.972 ! 85.167+-13.275 ! definitely 1.5741x slower json-parse-financial 29.485+-0.062 ^ 29.339+-0.049 ^ definitely 1.0050x faster json-stringify-tinderbox 19.100+-0.076 ? 19.238+-0.089 ? stanford-crypto-aes 30.627+-0.137 ? 30.683+-0.146 ? stanford-crypto-ccm 28.089+-0.631 26.967+-0.564 might be 1.0416x faster stanford-crypto-pbkdf2 93.139+-0.430 ? 94.900+-3.599 ? might be 1.0189x slower stanford-crypto-sha256-iterative 29.866+-0.079 ? 30.381+-0.875 ? might be 1.0172x slower <arithmetic> 44.152+-0.128 ! 50.337+-1.148 ! definitely 1.1401x slower With snapshot refresh every time tierUpEntryTriggers->value is non-zero (not when the threshold causes entry): original new ai-astar 68.432+-0.934 ! 129.185+-8.533 ! definitely 1.8878x slower audio-beat-detection 29.719+-0.078 ? 30.212+-0.435 ? might be 1.0166x slower audio-dft 71.023+-0.540 70.820+-0.490 audio-fft 21.629+-0.052 21.588+-0.062 audio-oscillator 45.109+-0.110 ? 45.236+-0.182 ? imaging-darkroom 50.273+-0.223 50.272+-0.220 imaging-desaturate 47.295+-0.650 ? 47.364+-0.885 ? imaging-gaussian-blur 54.633+-1.209 ! 94.046+-13.214 ! definitely 1.7214x slower json-parse-financial 29.578+-0.079 29.514+-0.149 json-stringify-tinderbox 19.259+-0.078 19.145+-0.068 stanford-crypto-aes 30.612+-0.113 ? 31.585+-2.217 ? might be 1.0318x slower stanford-crypto-ccm 27.001+-0.822 26.943+-0.737 stanford-crypto-pbkdf2 94.911+-2.870 93.123+-0.145 might be 1.0192x faster stanford-crypto-sha256-iterative 30.470+-1.105 30.468+-1.097 <arithmetic> 44.282+-0.237 ! 51.393+-1.039 ! definitely 1.1606x slower With snapshot creation only on first entry (not on every trigger, and not when the threshold causes entry): original new ai-astar 69.160+-0.928 ! 131.462+-13.933 ! definitely 1.9008x slower audio-beat-detection 29.831+-0.121 ? 29.856+-0.129 ? audio-dft 71.850+-1.703 70.949+-0.512 might be 1.0127x faster audio-fft 21.573+-0.047 ? 21.635+-0.072 ? audio-oscillator 45.208+-0.140 ? 45.211+-0.128 ? imaging-darkroom 50.431+-0.221 ? 50.452+-0.172 ? imaging-desaturate 51.604+-9.816 46.967+-0.597 might be 1.0987x faster imaging-gaussian-blur 53.568+-1.192 ! 88.972+-13.724 ! definitely 1.6609x slower json-parse-financial 29.618+-0.087 ^ 28.988+-0.102 ^ definitely 1.0217x faster json-stringify-tinderbox 19.201+-0.096 ^ 18.511+-0.101 ^ definitely 1.0373x faster stanford-crypto-aes 30.747+-0.260 ? 31.444+-1.533 ? might be 1.0227x slower stanford-crypto-ccm 27.170+-0.658 ? 27.527+-0.765 ? might be 1.0131x slower stanford-crypto-pbkdf2 93.346+-0.389 93.088+-0.147 stanford-crypto-sha256-iterative 29.946+-0.084 29.930+-0.103 <arithmetic> 44.518+-0.736 ! 51.071+-1.430 ! definitely 1.1472x slower

JF Bastien

Comment 58 2017-02-03 10:27:50 PST

Created attachment 300546 [details] patch This patch is perf neutral, and is minimally-disruptive. I think this will stick! 🎉

JF Bastien

Comment 59 2017-02-03 10:29:12 PST

Created attachment 300547 [details] Performance results Of special interest are the kraken results: Kraken: ai-astar 68.933+-1.456 68.404+-1.343 audio-beat-detection 29.792+-0.137 ? 29.965+-0.765 ? audio-dft 71.272+-0.797 70.672+-0.700 audio-fft 23.429+-2.957 21.546+-0.103 might be 1.0874x faster audio-oscillator 44.972+-0.203 ? 45.022+-0.296 ? imaging-darkroom 50.153+-0.296 50.057+-0.407 imaging-desaturate 46.187+-0.685 46.068+-1.034 imaging-gaussian-blur 54.450+-1.002 53.633+-1.700 might be 1.0152x faster json-parse-financial 30.566+-1.894 28.952+-0.161 might be 1.0557x faster json-stringify-tinderbox 19.054+-0.142 ^ 18.637+-0.167 ^ definitely 1.0224x faster stanford-crypto-aes 31.273+-1.453 30.708+-0.460 might be 1.0184x faster stanford-crypto-ccm 28.895+-2.198 27.556+-0.912 might be 1.0486x faster stanford-crypto-pbkdf2 92.832+-0.457 ? 92.892+-0.516 ? stanford-crypto-sha256-iterative 29.809+-0.130 ? 30.369+-1.425 ? might be 1.0188x slower <arithmetic> 44.401+-0.297 43.892+-0.216 might be 1.0116x faster

Saam Barati

Comment 60 2017-02-03 13:30:19 PST

Comment on attachment 300546 [details] patch View in context: https://bugs.webkit.org/attachment.cgi?id=300546&action=review Seems mostly good to me. Just some questions, some of which may stem from my shallow understanding of this code. > Source/JavaScriptCore/ChangeLog:23 > + entering. It therefore makes the trigger a tri-state (0: don't > + enter, 1: compilation complete, 2: please compile). Only 2 gets > + reset to zero, 1 does not. Does it get reset to zero if the OSR entry compilation is jettisoned? > Source/JavaScriptCore/dfg/DFGOperations.cpp:2359 > + bool triggeredSlowPath = false; This name confuses me. I think it really means that we want a compilation elsewhere > Source/JavaScriptCore/dfg/DFGOperations.cpp:2363 > + case 0: Maybe it's worth giving these numbers some names in an enum? > Source/JavaScriptCore/dfg/DFGOperations.cpp:2470 > + jitCode->dontOptimizeAnytimeSoon(codeBlock); Why this? > Source/JavaScriptCore/dfg/DFGOperations.cpp:2479 > + for (unsigned osrEntryCandidate : tierUpHierarchyEntry->value) { Don't you want to try to compile the outermost loop first? If so, I think you want to iterate this vector backwards. > Source/JavaScriptCore/dfg/DFGOperations.cpp:2485 > + jitCode->dontOptimizeAnytimeSoon(codeBlock); Why this?

JF Bastien

Comment 61 2017-02-03 13:35:03 PST

Fil asked me to check what the FTL compilation is doing to see when we FTL versus FTLOSR compile. $ VM=./WebKitBuild/Release/ && DYLD_FRAMEWORK_PATH=$VM $VM/jsc ./JSTests/ChakraCore/test/benchmarks/Kraken/ai-astar.js --reportFTLCompileTimes=1 Before my patch: Optimized findGraphNode#EBmwVP:[0x10756ac60->0x107569ae0->0x1075b7480, NoneFunctionCall, 83 (DidTryToEnterInLoop)] using FTLMode with FTL into 731 bytes in 18.484423 ms (DFG: 6.894247, B3: 11.590176). Optimized search#DTtNtq:[0x10756ae90->0x107568ff0->0x1075b7660, NoneFunctionCall, 686 (DidTryToEnterInLoop)] using FTLForOSREntryMode with FTL into 3318 bytes in 16.226048 ms (DFG: 6.966971, B3: 9.259077). Optimized search#DTtNtq:[0x10756b0c0->0x107568ff0->0x1075b7660, NoneFunctionCall, 686 (DidTryToEnterInLoop)] using FTLMode with FTL into 4111 bytes in 16.218263 ms (DFG: 7.276756, B3: 8.941507). After my patch: Optimized GraphNode#BmXric:[0x106568730->0x1065682d0->0x1065b6e40, NoneFunctionConstruct, 86] using FTLMode with FTL into 981 bytes in 1.618677 ms (DFG: 0.628264, B3: 0.990413). Optimized findGraphNode#EBmwVP:[0x10656ae90->0x106569d10->0x1065b7480, NoneFunctionCall, 83] using FTLMode with FTL into 731 bytes in 1.043065 ms (DFG: 0.535857, B3: 0.507208). Optimized search#DTtNtq:[0x10656b0c0->0x106569220->0x1065b7660, NoneFunctionCall, 686 (DidTryToEnterInLoop)] using FTLForOSREntryMode with FTL into 3433 bytes in 11.175271 ms (DFG: 6.220067, B3: 4.955204). Optimized search#DTtNtq:[0x10656b2f0->0x106569220->0x1065b7660, NoneFunctionCall, 686 (DidTryToEnterInLoop)] using FTLMode with FTL into 4261 bytes in 11.527336 ms (DFG: 6.404499, B3: 5.122837). So search starts with FTLOSR before FTL. The window is between the two (which is where unsetting the trigger was causing perf to tank, because we didn't enter DFG->FTLOSR more than once) is pretty small. Adding more printing, we start FTL and FTLOSR compilations within 0.01 ms of each other, but FTLOSR finishes 0.5 ms earlier. There's just a tiny window, considering the benchmark itself runs for ~160 ms. Inlining looks like this: Compilation search#DTtNtq-2-DFG: bc#297 --> removeGraphNode#DqCfaQ bc#407 --> findGraphNode#EBmwVP bc#431 --> isWall#ESQBmN bc#475 --> findGraphNode#EBmwVP bc#535 --> heuristic#DCkysN Compilation search#DTtNtq-3-FTLForOSREntry: bc#297 --> removeGraphNode#DqCfaQ bc#297 --> removeGraphNode#DqCfaQ bc#407 --> findGraphNode#EBmwVP bc#431 --> isWall#ESQBmN bc#475 --> findGraphNode#EBmwVP bc#535 --> heuristic#DCkysN

JF Bastien

Comment 62 2017-02-03 14:08:53 PST

Created attachment 300563 [details] patch Patch updated. > > Source/JavaScriptCore/ChangeLog:23 > > + entering. It therefore makes the trigger a tri-state (0: don't > > + enter, 1: compilation complete, 2: please compile). Only 2 gets > > + reset to zero, 1 does not. > > Does it get reset to zero if the OSR entry compilation is jettisoned? Oh yeah I should mention that! > > Source/JavaScriptCore/dfg/DFGOperations.cpp:2359 > > + bool triggeredSlowPath = false; > > This name confuses me. I think it really means that we want a compilation > elsewhere True, I renamed it. My refactoring will also make that much clearer. > > Source/JavaScriptCore/dfg/DFGOperations.cpp:2363 > > + case 0: > > Maybe it's worth giving these numbers some names in an enum? Definitely! The refactoring patch did this. I was trying to keep this patch as small as I can (using an enum touches other files). But since it's not changing any behavior I'll add the enum here. > > Source/JavaScriptCore/dfg/DFGOperations.cpp:2470 > > + jitCode->dontOptimizeAnytimeSoon(codeBlock); > > Why this? The idea here is to bump the threshold so that this function's loops stop entering this OSR handling code. We decided than another loop is a good place to optimize, and we've set their trigger accordingly, so they'll come into this OSR handling code, but we don't want the counter to do so as well. > > Source/JavaScriptCore/dfg/DFGOperations.cpp:2479 > > + for (unsigned osrEntryCandidate : tierUpHierarchyEntry->value) { > > Don't you want to try to compile the outermost loop first? If so, I think > you want to iterate this vector backwards. Correct, that's what yields better perf IIUC, but I'm leaving the old code as-is to not change its behavior. I'm just delaying the compile for correctness, not changing which loop gets compiled (as I did in my refactoring). I'll validate that going the other way is better separately. > > Source/JavaScriptCore/dfg/DFGOperations.cpp:2485 > > + jitCode->dontOptimizeAnytimeSoon(codeBlock); > > Why this? Same as above.

Filip Pizlo

Comment 63 2017-02-03 15:33:05 PST

Comment on attachment 300563 [details] patch View in context: https://bugs.webkit.org/attachment.cgi?id=300563&action=review > Source/JavaScriptCore/dfg/DFGOperations.cpp:2470 > + jitCode->dontOptimizeAnytimeSoon(codeBlock); This should be optimizeSoon. > Source/JavaScriptCore/dfg/DFGOperations.cpp:2485 > + jitCode->dontOptimizeAnytimeSoon(codeBlock); This should be optimizeSoon.

Filip Pizlo

Comment 64 2017-02-03 15:38:11 PST

Comment on attachment 300563 [details] patch View in context: https://bugs.webkit.org/attachment.cgi?id=300563&action=review >> Source/JavaScriptCore/dfg/DFGOperations.cpp:2470 >> + jitCode->dontOptimizeAnytimeSoon(codeBlock); > > This should be optimizeSoon. Make that setOptimizationBlahBlah(Deferred)

JF Bastien

Comment 65 2017-02-03 16:00:15 PST

Created attachment 300574 [details] patch As we discussed on IRC I tried the 3 different approaches. setOptimizationThresholdBasedOnCompilationResult is the least different from existing code (it currently compiles and calls that function, we now set the trigger and then we want to back off a bit). The worry Saam expressed was that I was delaying regular FTL compile too much. Benchmarks don't show a difference, but we may just be getting lucky. This patch calls setOptimizationThresholdBasedOnCompilationResult. Here are kraken results. I'll compare setOptimizationThresholdBasedOnCompilationResult in a full benchmark run and post those results. $ run-jsc-benchmarks original:./WebKitBuild/current/jsc dontSoon:./WebKitBuild/dontOptimizeAnytimeSoon/jsc optimizeSoon:./WebKitBuild/optimizeSoon/jsc setBlah:./WebKitBuild/setOptimizationThresholdBasedOnCompilationResult/jsc --kraken --outer 20 original dontSoon optimizeSoon setBlah setBlah v. original ai-astar 67.627+-1.053 67.544+-0.798 ? 67.953+-0.978 67.816+-0.893 ? audio-beat-detection 29.133+-0.075 ? 29.167+-0.195 29.132+-0.082 ? 29.444+-0.602 ? might be 1.0107x slower audio-dft 60.809+-0.943 60.250+-0.500 ? 60.580+-0.707 60.182+-0.612 might be 1.0104x faster audio-fft 21.700+-0.663 21.397+-0.116 ? 21.871+-0.689 21.791+-0.882 ? audio-oscillator 44.702+-0.137 ? 44.817+-0.178 ? 44.823+-0.352 ? 44.930+-0.291 ? imaging-darkroom 50.023+-0.183 ? 50.038+-0.235 49.932+-0.170 49.903+-0.215 imaging-desaturate 42.867+-0.705 ? 44.027+-0.812 43.403+-0.818 43.248+-0.686 ? imaging-gaussian-blur 53.495+-0.845 ? 53.639+-0.928 53.342+-0.993 52.668+-1.117 might be 1.0157x faster json-parse-financial 29.381+-0.133 ^ 27.807+-0.093 27.804+-0.128 ! 28.143+-0.180 ^ definitely 1.0440x faster json-stringify-tinderbox 17.856+-0.060 ! 18.469+-0.080 ? 18.482+-0.116 18.456+-0.134 ! definitely 1.0336x slower stanford-crypto-aes 30.170+-0.145 30.113+-0.180 ? 30.128+-0.187 29.999+-0.192 stanford-crypto-ccm 26.464+-0.905 ? 26.932+-0.658 ? 27.611+-1.037 26.562+-0.881 ? stanford-crypto-pbkdf2 94.567+-3.453 92.140+-0.546 ? 92.339+-0.588 92.136+-0.507 might be 1.0264x faster stanford-crypto-sha256-iterative 29.667+-0.616 ? 29.714+-0.638 ? 30.296+-1.051 29.510+-0.134 <arithmetic> 42.747+-0.281 42.575+-0.110 ? 42.693+-0.165 42.485+-0.168 might be 1.0062x faster

Saam Barati

Comment 66 2017-02-03 16:24:01 PST

Comment on attachment 300574 [details] patch r=me too

JF Bastien

Comment 67 2017-02-03 16:53:14 PST

I ran the full benchmark suite. Everything looks perf-neutral, including Sunspider and Octane!

WebKit Commit Bot

Comment 68 2017-02-03 17:19:06 PST

Comment on attachment 300574 [details] patch Clearing flags on attachment: 300574 Committed r211658: <http://trac.webkit.org/changeset/211658>

WebKit Commit Bot

Comment 69 2017-02-03 17:19:14 PST

All reviewed patches have been landed. Closing bug.

Note You need to log in before you can comment on or make changes to this bug.