175144 – Use JIT probes for DFG OSR exit.

REOPENED 175144

Use JIT probes for DFG OSR exit.

https://bugs.webkit.org/show_bug.cgi?id=175144

Summary Use JIT probes for DFG OSR exit.

Mark Lam

Reported 2017-08-03 11:39:29 PDT

<rdar://problem/33437050>.

Attachments
work in progress: won't build yet. still need a few edits + self code review. (110.87 KB, patch) 2017-08-23 17:26 PDT, Mark Lam	no flags	Details Formatted Diff Diff
work in progress: it builds but crashes in the tests. (110.02 KB, patch) 2017-08-24 15:06 PDT, Mark Lam	no flags	Details Formatted Diff Diff
work in progress: fixed some bugs but still have exotic failures. (130.98 KB, patch) 2017-08-25 17:27 PDT, Mark Lam	no flags	Details Formatted Diff Diff
proposed patch. (110.49 KB, patch) 2017-08-26 12:32 PDT, Mark Lam	no flags	Details Formatted Diff Diff
x86_64 benchmark results. (96.83 KB, text/plain) 2017-08-26 14:07 PDT, Mark Lam	no flags	Details
proposed patch w/ Windows build fix. (110.86 KB, patch) 2017-08-26 14:25 PDT, Mark Lam	no flags	Details Formatted Diff Diff
x86_64 benchmark results 2. (97.24 KB, text/plain) 2017-08-26 16:21 PDT, Mark Lam	no flags	Details
x86_64 benchmark results 3: no ethernet, no wifi (as quiet as can be). (96.86 KB, text/plain) 2017-08-26 17:20 PDT, Mark Lam	no flags	Details
work in progress. (126.14 KB, patch) 2017-08-31 07:01 PDT, Mark Lam	no flags	Details Formatted Diff Diff
archive of code that compares the probe based OSR results vs the jit based one. (107.95 KB, patch) 2017-09-01 13:25 PDT, Mark Lam	no flags	Details Formatted Diff Diff
archive of code that compares the probe based OSR results vs the jit based one v.2 (116.67 KB, patch) 2017-09-01 17:28 PDT, Mark Lam	no flags	Details Formatted Diff Diff
proposed patch. (126.36 KB, patch) 2017-09-06 16:14 PDT, Mark Lam	no flags	Details Formatted Diff Diff
x86_64 benchmark results (run 1) (95.40 KB, text/plain) 2017-09-06 16:21 PDT, Mark Lam	no flags	Details
x86_64 benchmark results (run 2) (96.46 KB, text/plain) 2017-09-06 16:21 PDT, Mark Lam	no flags	Details
proposed patch (w/ 32-bit fixes). (126.62 KB, patch) 2017-09-07 11:49 PDT, Mark Lam	no flags	Details Formatted Diff Diff
proposed patch (removed some debugging code). (123.03 KB, patch) 2017-09-07 11:54 PDT, Mark Lam	saam: review+	Details Formatted Diff Diff
patch for re-landing. (120.62 KB, patch) 2017-09-09 17:20 PDT, Mark Lam	no flags	Details Formatted Diff Diff
Show Obsolete (12) View All Add attachment proposed patch, testcase, etc.

Mark Lam

Comment 1 2017-08-10 10:23:18 PDT

After working through the code and thinking about this, I think I'll go with an OSR exit thunk that execute 2 JIT probes: the first to ensure that there's enough stack space to do in-place OSR exit frame re-write, and the second to do the in-place OSR exit frame re-write. When the second probe returns, we should resume execution in the baseline codeBlock. This approach is simplest (maps well from the current implementation, which makes the patch easier to review), and most efficient (less memcpying to and from side buffers).

Lucas Forschler

Comment 2 2017-08-10 14:01:31 PDT

<rdar://problem/33437050>

Mark Lam

Comment 3 2017-08-11 11:24:22 PDT

(In reply to Mark Lam from comment #1) > After working through the code and thinking about this, I think I'll go with > an OSR exit thunk that execute 2 JIT probes: the first to ensure that > there's enough stack space to do in-place OSR exit frame re-write, and the > second to do the in-place OSR exit frame re-write. When the second probe > returns, we should resume execution in the baseline codeBlock. > > This approach is simplest (maps well from the current implementation, which > makes the patch easier to review), and most efficient (less memcpying to and > from side buffers). Change of plans: we can make the probe even more powerful so that it provides APIs to allow for in place stack modifications. This eliminates the need for the client to be aware that the probe stores state on the stack and where it's stored.

Mark Lam

Comment 4 2017-08-23 17:26:52 PDT

Created attachment 318947 [details] work in progress: won't build yet. still need a few edits + self code review.

Mark Lam

Comment 5 2017-08-24 15:06:40 PDT

Created attachment 319025 [details] work in progress: it builds but crashes in the tests.

Mark Lam

Comment 6 2017-08-25 17:27:37 PDT

Created attachment 319121 [details] work in progress: fixed some bugs but still have exotic failures.

Mark Lam

Comment 7 2017-08-26 12:32:58 PDT

Created attachment 319137 [details] proposed patch. Tests are passing locally for me now. Let's get some feedback and EWS testing while I do some benchmarking and get some memory measurements.

Mark Lam

Comment 8 2017-08-26 14:07:20 PDT

Created attachment 319138 [details] x86_64 benchmark results.

Mark Lam

Comment 9 2017-08-26 14:17:56 PDT

Benchmark results shows that on aggregate, perf is neutral. However, on specific benchmark components: - Octane's pdfjs is 1.1595x slower - SixSpeed's spread-literal.es5 is 1.0524x slower - SixSpeed's classes.es5 is 1.0446x faster - SixSpeed's map-set-object.es6 is 1.0336x faster ... and a some microbenchmarks are a lot slower: - delta-blue-try-catch ! definitely 1.3200x slower - int-or-other-neg-then-get-by-val ! definitely 11.5349x slower - super-get-by-val-with-this-polymorphic ! definitely 1.0516x slower - math-with-out-of-bounds-array-values ! definitely 1.0676x slower - direct-tail-call-inlined-caller ^ definitely 1.0539x faster - v8-raytrace-with-try-catch ! definitely 1.1061x slower - super-get-by-val-with-this-monomorphic ! definitely 1.0718x slower - string-concat-long-convert ^ definitely 1.0428x faster - fake-iterators-that-throw-when-finished ! definitely 1.5331x slower - double-pollution-putbyoffset ! definitely 6.7236x slower - string-transcoding ! definitely 1.2092x slower - to-number-constructor-string-number-string-number ! definitely 1.2452x slower - function-dot-apply ^ definitely 1.1369x faster - bigswitch-indirect-symbol-or-undefined ^ definitely 1.9525x faster - getter-richards-try-catch ! definitely 1.0995x slower

Mark Lam

Comment 10 2017-08-26 14:25:44 PDT

Created attachment 319139 [details] proposed patch w/ Windows build fix.

Filip Pizlo

Comment 11 2017-08-26 15:33:55 PDT

(In reply to Mark Lam from comment #9) > Benchmark results shows that on aggregate, perf is neutral. > > However, on specific benchmark components: > - Octane's pdfjs is 1.1595x slower > - SixSpeed's spread-literal.es5 is 1.0524x slower > - SixSpeed's classes.es5 is 1.0446x faster > - SixSpeed's map-set-object.es6 is 1.0336x faster > > ... and a some microbenchmarks are a lot slower: > - delta-blue-try-catch ! definitely 1.3200x slower > - int-or-other-neg-then-get-by-val ! definitely 11.5349x slower > - super-get-by-val-with-this-polymorphic ! definitely 1.0516x slower > - math-with-out-of-bounds-array-values ! definitely 1.0676x slower > - direct-tail-call-inlined-caller ^ definitely 1.0539x faster > - v8-raytrace-with-try-catch ! definitely 1.1061x slower > - super-get-by-val-with-this-monomorphic ! definitely 1.0718x slower > - string-concat-long-convert ^ definitely 1.0428x faster > - fake-iterators-that-throw-when-finished ! definitely 1.5331x slower > - double-pollution-putbyoffset ! definitely 6.7236x slower > - string-transcoding ! definitely 1.2092x slower > - to-number-constructor-string-number-string-number ! definitely 1.2452x > slower > - function-dot-apply ^ definitely 1.1369x faster > - bigswitch-indirect-symbol-or-undefined ^ definitely 1.9525x faster > - getter-richards-try-catch ! definitely 1.0995x slower Something is up with your results. Octane shows a “maybe 2%” slower in the geomean. Normally, we can detect a 1% change in that metric. So, I think that this means that your system was quite noisy when you ran this test. Or you did not run for enough iterations. I like to run for —outer 6 and leave —inner alone.

Mark Lam

Comment 12 2017-08-26 16:19:36 PDT

(In reply to Filip Pizlo from comment #11) > (In reply to Mark Lam from comment #9) > > Benchmark results shows that on aggregate, perf is neutral. > > > > However, on specific benchmark components: > > - Octane's pdfjs is 1.1595x slower > > - SixSpeed's spread-literal.es5 is 1.0524x slower > > - SixSpeed's classes.es5 is 1.0446x faster > > - SixSpeed's map-set-object.es6 is 1.0336x faster > > > > ... and a some microbenchmarks are a lot slower: > > - delta-blue-try-catch ! definitely 1.3200x slower > > - int-or-other-neg-then-get-by-val ! definitely 11.5349x slower > > - super-get-by-val-with-this-polymorphic ! definitely 1.0516x slower > > - math-with-out-of-bounds-array-values ! definitely 1.0676x slower > > - direct-tail-call-inlined-caller ^ definitely 1.0539x faster > > - v8-raytrace-with-try-catch ! definitely 1.1061x slower > > - super-get-by-val-with-this-monomorphic ! definitely 1.0718x slower > > - string-concat-long-convert ^ definitely 1.0428x faster > > - fake-iterators-that-throw-when-finished ! definitely 1.5331x slower > > - double-pollution-putbyoffset ! definitely 6.7236x slower > > - string-transcoding ! definitely 1.2092x slower > > - to-number-constructor-string-number-string-number ! definitely 1.2452x > > slower > > - function-dot-apply ^ definitely 1.1369x faster > > - bigswitch-indirect-symbol-or-undefined ^ definitely 1.9525x faster > > - getter-richards-try-catch ! definitely 1.0995x slower > > Something is up with your results. Octane shows a “maybe 2%” slower in the > geomean. Normally, we can detect a 1% change in that metric. So, I think > that this means that your system was quite noisy when you ran this test. Or > you did not run for enough iterations. I like to run for —outer 6 and leave > —inner alone. I ran with outer 8 on a 13" MacBook Pro. There's a chance that there's network / random background activity but the machine is as quiet as I can make it. I just finished another run with no ethernet attached (but just realized that wifi is till on). I'll upload the new results, and do a 3rd run with wifi off as well. With this 2nd run, Octane is "might be 1.0118x faster" with pdfjs "definitely 1.1681x slower.

Mark Lam

Comment 13 2017-08-26 16:21:26 PDT

Created attachment 319141 [details] x86_64 benchmark results 2.

Saam Barati

Comment 14 2017-08-26 16:50:25 PDT

It’s not entirely surprising to see the try/catch micro benchmarks get slower. Most of those tests are stressing throwing an exception, which is an OSR exit in the DFG/FTL. What is somewhat surprising is how important the actual speed of the exit itself is in those microbenchmarks. I’m not too sure what to make of this yet. Maybe there are trivial things we can do to speed up the OSR exit C code?

Mark Lam

Comment 15 2017-08-26 16:56:56 PDT

(In reply to Saam Barati from comment #14) > It’s not entirely surprising to see the try/catch micro benchmarks get > slower. Most of those tests are stressing throwing an exception, which is an > OSR exit in the DFG/FTL. What is somewhat surprising is how important the > actual speed of the exit itself is in those microbenchmarks. I’m not too > sure what to make of this yet. Maybe there are trivial things we can do to > speed up the OSR exit C code? I can try to do some profiling on OSR exits on pdfjs and see if any section of the code stands out as being expensive. One area I'm wary of is the amount of memcpying done. I also use std::memcpy to workaround strict aliasing issues. If the compiler is not smart enough to turn these into straight memory word copies, that can hurt as lot as well. I'll take a close look.

Mark Lam

Comment 16 2017-08-26 17:20:33 PDT

Created attachment 319142 [details] x86_64 benchmark results 3: no ethernet, no wifi (as quiet as can be).

Saam Barati

Comment 17 2017-08-26 17:32:23 PDT

(In reply to Mark Lam from comment #15) > (In reply to Saam Barati from comment #14) > > It’s not entirely surprising to see the try/catch micro benchmarks get > > slower. Most of those tests are stressing throwing an exception, which is an > > OSR exit in the DFG/FTL. What is somewhat surprising is how important the > > actual speed of the exit itself is in those microbenchmarks. I’m not too > > sure what to make of this yet. Maybe there are trivial things we can do to > > speed up the OSR exit C code? > > I can try to do some profiling on OSR exits on pdfjs and see if any section > of the code stands out as being expensive. One area I'm wary of is the > amount of memcpying done. > > I also use std::memcpy to workaround strict aliasing issues. If the > compiler is not smart enough to turn these into straight memory word copies, > that can hurt as lot as well. I'll take a close look. Your second run seems better in terms of numbers. When I say I’m not sure, I mean I’m not sure if this will hurt real world things that use try/catch. I think in practice probably not. But the slowdown indicates to me that there are perhaps ways to make the interpreter faster. Maybe these can be follow ups. The interesting thing about an exception OSR exit is that it won’t cause us to jettison and recompile because it’s expected behavior. Maybe because of this, there is a slight argument to keep them compiled because it might execute many times for programs that throw. That said, I don’t think this should stop your work from going forward. I feel like we should be able to make the interpreter fast enough.

Filip Pizlo

Comment 18 2017-08-26 20:53:45 PDT

The latest results seem to say that your patch is a regression. You should profile to see if time spent in OSR exit is the problem in pdfjs. It could also be that some value profiling changed. You could try reducing osrExitCountForReoptimization. Making that smaller will reduce the number of times we take any particular exit.

Mark Lam

Comment 19 2017-08-31 07:01:27 PDT

Created attachment 319459 [details] work in progress. Here's the current work in progress patch: fixed a few bugs, and made optimization changes to not repeat redundant work. The result: microbenchmarks run faster (validating that the optimization is working as intended), but real world benchmarks actually appear to run slower. At this point, it is difficult to see what is going wrong when running the real world benchmarks. So, I'm going to pull out the big gun and do a full verification run comparing my probe OSR exit results with the old one generated by the old code. I'll achieve this by inserting probes into the old code to compute the probe OSR exit register and stack state on the side, and then compare these results against the register and stack computed by the JIT offramps. Since the 2 methods will be feeding on the same real data, they should produce identical result data. If there are any differences in the restored values at all, then we'll have a lead to chase down for resolving this perf issue.

Build Bot

Comment 20 2017-08-31 11:16:22 PDT

Attachment 319459 [details] did not pass style-queue: ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:105: Wrong number of spaces before statement. (expected: 12) [whitespace/indent] [4] ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:106: Wrong number of spaces before statement. (expected: 12) [whitespace/indent] [4] ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:107: Wrong number of spaces before statement. (expected: 12) [whitespace/indent] [4] ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:108: Wrong number of spaces before statement. (expected: 12) [whitespace/indent] [4] ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:109: Wrong number of spaces before statement. (expected: 12) [whitespace/indent] [4] ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:110: Wrong number of spaces before statement. (expected: 12) [whitespace/indent] [4] ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:111: Wrong number of spaces before statement. (expected: 12) [whitespace/indent] [4] ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:112: Wrong number of spaces before statement. (expected: 12) [whitespace/indent] [4] ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:113: Wrong number of spaces before statement. (expected: 12) [whitespace/indent] [4] ERROR: Source/JavaScriptCore/bytecode/SuperSampler.cpp:52: Should have a space between // and comment [whitespace/comments] [4] Total errors found: 10 in 34 files If any of these errors are false positives, please file a bug against check-webkit-style.

Mark Lam

Comment 21 2017-08-31 23:05:34 PDT

Something is strange: comparing the stack of the probe based OSR exit and the jit based OSR exit, I detected a few stack operands that are different. However, when I chased down the difference, it is because the jit based OSR exit was recovering operand values from a stack location that is below the stack pointer. As a result, that value changed mid-way between the start of the ramp and the point where the recovery was done. The probe based OSR exit does not have this vulnerability because the true machine pointer is set much lower to accommodate probe data while the probe is executing. This appears to be the sole source of the difference between the stack values of the 2 OSR exit versions. Note: this means that either the jit version (1) is restoring a value it does not need, or (2) is restoring a corrupted value. My bet is on (1). I'll do a test to scribble a bad value into that restored value if the address if below the stack pointer. If this causes a crash, then we're dealing with scenario (2), else (1).

Mark Lam

Comment 22 2017-08-31 23:37:36 PDT

(In reply to Mark Lam from comment #21) > Note: this means that either the jit version (1) is restoring a value it > does not need, or (2) is restoring a corrupted value. My bet is on (1). > I'll do a test to scribble a bad value into that restored value if the > address if below the stack pointer. If this causes a crash, then we're > dealing with scenario (2), else (1). Indeed, (1) is correct. I scribbled those values with 0xbadbeef0, and there were no crashed.

Mark Lam

Comment 23 2017-09-01 12:36:06 PDT

I've fixed another difference where we have converted a double JSValue to an int32 JSValue. In some cases, the value fits in an int32, which is what triggered the conversion. I now force it to be a double JSValue if came from a double source. This is what the old jit base OSR exit ramp does. With that fix plus discounting the values written below the stack pointer (by scribbling 0xbadbeef0 into those slots), I now see that the stack is identical after the jit and probe based OSR exit ramps. There are register differences, but only for arguments, temps, and scratch registers. This is to be expected since the jit based OSR exit ramp calls C++ functions which uses those registers. So, the register differences can be disregarded as well. The only remaining possible difference between the 2 OSR exit ramps is in changes to VM / profiling state. I'll try to validate that next.

Mark Lam

Comment 24 2017-09-01 12:38:08 PDT

(In reply to Mark Lam from comment #23) > I've fixed another difference where we have converted a double JSValue to an > int32 JSValue. In some cases, the value fits in an int32, which is what > triggered the conversion. I now force it to be a double JSValue if came > from a double source. This is what the old jit base OSR exit ramp does. > > With that fix plus discounting the values written below the stack pointer > (by scribbling 0xbadbeef0 into those slots), I now see that the stack is > identical after the jit and probe based OSR exit ramps. > > There are register differences, but only for arguments, temps, and scratch > registers. This is to be expected since the jit based OSR exit ramp calls > C++ functions which uses those registers. So, the register differences can > be disregarded as well. > > The only remaining possible difference between the 2 OSR exit ramps is in > changes to VM / profiling state. I'll try to validate that next. FYI, with double JSValue fix (which is the only real difference), pdfjs is even more slower (from my quick benchmark run). This means the JSValue encoding of the double value is not the cause of our perf difference.

Mark Lam

Comment 25 2017-09-01 13:25:51 PDT

Created attachment 319641 [details] archive of code that compares the probe based OSR results vs the jit based one.

Mark Lam

Comment 26 2017-09-01 17:28:51 PDT

Created attachment 319676 [details] archive of code that compares the probe based OSR results vs the jit based one v.2

Mark Lam

Comment 27 2017-09-06 12:02:44 PDT

The reason for the pdfjs perf regression is https://bugs.webkit.org/show_bug.cgi?id=176473. I fixed the regression by doing what the jit OSR Exit code did i.e. not doing what arrayModeFromStructure() does. I'll clean up the patch, do some more testing, and prepare for a code review.

Filip Pizlo

Comment 28 2017-09-06 12:03:13 PDT

(In reply to Mark Lam from comment #27) > The reason for the pdfjs perf regression is > https://bugs.webkit.org/show_bug.cgi?id=176473. > > I fixed the regression by doing what the jit OSR Exit code did i.e. not > doing what arrayModeFromStructure() does. I'll clean up the patch, do some > more testing, and prepare for a code review. Nice!!

Mark Lam

Comment 29 2017-09-06 16:14:09 PDT

Created attachment 320072 [details] proposed patch. Let's try the patch on the EWS.

Mark Lam

Comment 30 2017-09-06 16:21:19 PDT

Created attachment 320075 [details] x86_64 benchmark results (run 1)

Mark Lam

Comment 31 2017-09-06 16:21:41 PDT

Created attachment 320076 [details] x86_64 benchmark results (run 2)

Mark Lam

Comment 32 2017-09-06 16:24:22 PDT

Benchmark results show that, on aggregate, perf is neutral.

Mark Lam

Comment 33 2017-09-07 11:49:40 PDT

Created attachment 320149 [details] proposed patch (w/ 32-bit fixes).

Mark Lam

Comment 34 2017-09-07 11:54:44 PDT

Created attachment 320151 [details] proposed patch (removed some debugging code).

Saam Barati

Comment 35 2017-09-07 15:21:43 PDT

Comment on attachment 320151 [details] proposed patch (removed some debugging code). View in context: https://bugs.webkit.org/attachment.cgi?id=320151&action=review r=me with comments > Source/JavaScriptCore/assembler/ProbeFrame.h:44 > + T getArgument(int argument) style nit: here and below, you preface getters with get, but I don't think we do that elsewhere, e.g, Operands. What about just calling this: argument(...), and below, operand(...), etc. > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:80 > + RegisterSet dontRestoreRegisters = RegisterSet(RegisterSet::stackRegisters(), RegisterSet::allFPRs()); style nit: This same thing is used here and below, might be worth making it a function to call. > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:135 > + context.gpr(entry.reg().gpr()) = stack.get<uintptr_t>(calleeSaveBuffer, entry.offset()); > + else > + context.gpr(entry.reg().gpr()) = stack.get<double>(calleeSaveBuffer, entry.offset()); why do these gets need to use the stack API? > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:152 > + ASSERT(is64Bit()); seems like this could be a static assert? > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:160 > + if (entry.reg().isGPR()) > + stack.set(calleeSaveBuffer, entry.offset(), context.gpr<uintptr_t>(entry.reg().gpr())); > + else > + stack.set(calleeSaveBuffer, entry.offset(), context.fpr<uintptr_t>(entry.reg().fpr())); I think it'd be nice to have context expose a register API so it does this branch for you. You also write the branch in a few places, but if context handled it, you could just do: stack.set(calleeSaveBuffer, entry.offset(), context.reg<uintptr_t>(entry.reg())); > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:171 > + RegisterSet dontSaveRegisters = RegisterSet(RegisterSet::stackRegisters(), RegisterSet::allFPRs()); ditto about method abstracting this instead of copy-pasting. > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:176 > +#if USE(JSVALUE64) > + RegisterSet baselineCalleeSaves = RegisterSet::llintBaselineCalleeSaveRegisters(); > +#endif seems like this is guaranteed given NUMBER_OF_CALLEE_SAVES_REGISTERS. > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:187 > +#if USE(JSVALUE32_64) > + UNUSED_PARAM(wasCalledViaTailCall); > +#else seems like this is dead given: NUMBER_OF_CALLEE_SAVES_REGISTERS Maybe inside your: #if NUMBER_OF_CALLEE_SAVES_REGISTERS > 0 you can add: static_assert(is64bit(), "") > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:326 > +void OSRExit::executeOSRExit(Context& context) Should you set top call frame inside of here? I think that'd help the sampling profiler at least. Maybe other things need it to. > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:353 > + if (UNLIKELY(!exit.exitState)) { suggestion: Let's cache the various calls of JITCode::dfg() here since it's a virtual function > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:366 > + // Compute the value recoveries. > + Operands<ValueRecovery> operands; > + codeBlock->jitCode()->dfg()->variableEventStream.reconstruct(codeBlock, exit.m_codeOrigin, codeBlock->jitCode()->dfg()->minifiedDFG, exit.m_streamIndex, operands); This is an interesting tradeoff. I think it's the right one. However, I know we use the event stream to reconstruct on demand because storing the reconstructed data for each exit has a high memory cost. That said, we're only doing this here for *executed* exits. That's why I think this is ok and the right choice. But it's worth making sure we truly believe this is the best choice. > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:489 > + Structure* structure = jsValueFor(cpu, exit.m_jsValueSource).asCell()->structure(vm); how do we know the incoming value is a cell? > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:745 > + saveOrCopyCalleeSavesFor(context, baselineCodeBlock, static_cast<VirtualRegister>(inlineCallFrame->stackOffset), !trueCaller); Style: This static cast here seems weird, why not just: VirtualRegister(inlineCallFrame->stackOffset) ?

Mark Lam

Comment 36 2017-09-07 16:44:54 PDT

Thanks for the review. I've addressed your feedback as follows. Will retest now before I land. (In reply to Saam Barati from comment #35) > Comment on attachment 320151 [details] > proposed patch (removed some debugging code). > > View in context: > https://bugs.webkit.org/attachment.cgi?id=320151&action=review > > r=me with comments > > > Source/JavaScriptCore/assembler/ProbeFrame.h:44 > > + T getArgument(int argument) > > style nit: here and below, you preface getters with get, but I don't think > we do that elsewhere, e.g, Operands. What about just calling this: > argument(...), and below, operand(...), etc. Fixed. > > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:80 > > + RegisterSet dontRestoreRegisters = RegisterSet(RegisterSet::stackRegisters(), RegisterSet::allFPRs()); > > style nit: This same thing is used here and below, might be worth making it > a function to call. This is not a lot of code, and the original code that these functions are based on (AssemblyHelpers::restoreCalleeSavesFromVMEntryFrameCalleeSavesBuffer(), AssemblyHelpers::copyCalleeSavesToVMEntryFrameCalleeSavesBuffer(), and AssemblyHelpers::copyCalleeSavesToVMEntryFrameCalleeSavesBuffer()) also decided to just duplicate this line. I'll keep this as is for now to keep these functions consistent with the AssemblyHelpers ones they are based on. I can do a refactor later on to adda calleeSavesRegistersToOmitFromRestoration() and change all sites to use that. > > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:135 > > + context.gpr(entry.reg().gpr()) = stack.get<uintptr_t>(calleeSaveBuffer, entry.offset()); > > + else > > + context.gpr(entry.reg().gpr()) = stack.get<double>(calleeSaveBuffer, entry.offset()); > > why do these gets need to use the stack API? They don't need to be. I fixed it to read from the stack. There's also a bug here where the fpr case (the else case) is using gpr values. I've fixed that also. > > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:152 > > + ASSERT(is64Bit()); > > seems like this could be a static assert? Fixed. > > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:160 > > + if (entry.reg().isGPR()) > > + stack.set(calleeSaveBuffer, entry.offset(), context.gpr<uintptr_t>(entry.reg().gpr())); > > + else > > + stack.set(calleeSaveBuffer, entry.offset(), context.fpr<uintptr_t>(entry.reg().fpr())); > > I think it'd be nice to have context expose a register API so it does this > branch for you. You also write the branch in a few places, but if context > handled it, you could just do: > stack.set(calleeSaveBuffer, entry.offset(), > context.reg<uintptr_t>(entry.reg())); There are 3 issues with your suggestion: 1. It's a layer violation. Probe::Context is defined at the MacroAssembler level, and Reg (which we query isGPR() on) is defined at the JIT level. It'd be strange to have Probe::Context depend on Reg. We could move Reg into the MacroAssembler level, but that's more refactoring work. 2. Your suggestion only works for 64-bit where sizeof GPR and FPR are both 64-bit. Yes, currently, only 64-bit uses callee saves, and hence, things should work out. But it's still kind of gross to rely on that. 3. The original AssemblyHelpers code that these functions are based on does it this way. I'd rather keep this version of the code consistent with the AssemblyHelpers version. For these reasons, I'd like to keep this code as is. > > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:171 > > + RegisterSet dontSaveRegisters = RegisterSet(RegisterSet::stackRegisters(), RegisterSet::allFPRs()); > > ditto about method abstracting this instead of copy-pasting. Answered above. > > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:176 > > +#if USE(JSVALUE64) > > + RegisterSet baselineCalleeSaves = RegisterSet::llintBaselineCalleeSaveRegisters(); > > +#endif > > seems like this is guaranteed given NUMBER_OF_CALLEE_SAVES_REGISTERS. Fixed. I removed the #if+#endif. > > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:187 > > +#if USE(JSVALUE32_64) > > + UNUSED_PARAM(wasCalledViaTailCall); > > +#else > > seems like this is dead given: NUMBER_OF_CALLEE_SAVES_REGISTERS Fixed. > Maybe inside your: > #if NUMBER_OF_CALLEE_SAVES_REGISTERS > 0 > > you can add: static_assert(is64bit(), "") Done. > > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:326 > > +void OSRExit::executeOSRExit(Context& context) > > Should you set top call frame inside of here? I think that'd help the > sampling profiler at least. Maybe other things need it to. I was already setting it in adjustAndJumpToTarget() after the frame has been fully reconstructed. I think that would be the better time to set it. > > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:353 > > + if (UNLIKELY(!exit.exitState)) { > > suggestion: Let's cache the various calls of JITCode::dfg() here since it's > a virtual function The DFG::JITCode* value is only used in 2 functions, and I can't avoid computing it at least once on every OSR exit in order to get the OSRExit record. So, I just cache it in a local and reuse that (didn't need to cache it in the exitState). > > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:366 > > + // Compute the value recoveries. > > + Operands<ValueRecovery> operands; > > + codeBlock->jitCode()->dfg()->variableEventStream.reconstruct(codeBlock, exit.m_codeOrigin, codeBlock->jitCode()->dfg()->minifiedDFG, exit.m_streamIndex, operands); > > This is an interesting tradeoff. I think it's the right one. However, I know > we use the event stream to reconstruct on demand because storing the > reconstructed data for each exit has a high memory cost. That said, we're > only doing this here for *executed* exits. That's why I think this is ok and > the right choice. But it's worth making sure we truly believe this is the > best choice. This caching accounted for about half of the speed up on the int-or-other-neg-then-get-by-val microbenchmark (relative to my very first patch). It is also correct to cache these values and use them. The old jit based OSR Exit caches the value recovery data in the form of the emitted OSR off-ramp jit code. We're effectively doing the same thing (in spirit) here. > > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:489 > > + Structure* structure = jsValueFor(cpu, exit.m_jsValueSource).asCell()->structure(vm); > > how do we know the incoming value is a cell? We can only get here if exit.m_kind == BadCache || exit.m_kind == BadIndexingType. I presume that those exit kinds means that the value is a cell. Anyway, I'm not introducing new logic here. This is exactly what the old jit based OSR exit code does in OSRExit::compileExit(): it loads the value and proceeds to load its structureID without first confirming that it's a cell. > > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:745 > > + saveOrCopyCalleeSavesFor(context, baselineCodeBlock, static_cast<VirtualRegister>(inlineCallFrame->stackOffset), !trueCaller); > > Style: This static cast here seems weird, why not just: > VirtualRegister(inlineCallFrame->stackOffset) ? Fixed.

Mark Lam

Comment 37 2017-09-07 18:16:34 PDT

Thanks again for the review. Landed in r221774: <http://trac.webkit.org/r221774>.

Saam Barati

Comment 38 2017-09-08 16:22:05 PDT

Comment on attachment 320151 [details] proposed patch (removed some debugging code). View in context: https://bugs.webkit.org/attachment.cgi?id=320151&action=review >>> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:135 >>> + context.gpr(entry.reg().gpr()) = stack.get<double>(calleeSaveBuffer, entry.offset()); >> >> why do these gets need to use the stack API? > > They don't need to be. I fixed it to read from the stack. > > There's also a bug here where the fpr case (the else case) is using gpr values. I've fixed that also. It's always fun to learn when we don't have tests for things :) Could be worth filing a bug saying this is the case. >>> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:160 >>> + stack.set(calleeSaveBuffer, entry.offset(), context.fpr<uintptr_t>(entry.reg().fpr())); >> >> I think it'd be nice to have context expose a register API so it does this branch for you. You also write the branch in a few places, but if context handled it, you could just do: >> stack.set(calleeSaveBuffer, entry.offset(), context.reg<uintptr_t>(entry.reg())); > > There are 3 issues with your suggestion: > 1. It's a layer violation. > > Probe::Context is defined at the MacroAssembler level, and Reg (which we query isGPR() on) is defined at the JIT level. It'd be strange to have Probe::Context depend on Reg. We could move Reg into the MacroAssembler level, but that's more refactoring work. > > 2. Your suggestion only works for 64-bit where sizeof GPR and FPR are both 64-bit. > > Yes, currently, only 64-bit uses callee saves, and hence, things should work out. But it's still kind of gross to rely on that. > > 3. The original AssemblyHelpers code that these functions are based on does it this way. > > I'd rather keep this version of the code consistent with the AssemblyHelpers version. > > For these reasons, I'd like to keep this code as is. I see. I didn't think about this layering violation. I think it'd be worth having a helper in this file alone. It could even be under the JSVALUE64. >>> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:326 >>> +void OSRExit::executeOSRExit(Context& context) >> >> Should you set top call frame inside of here? I think that'd help the sampling profiler at least. Maybe other things need it to. > > I was already setting it in adjustAndJumpToTarget() after the frame has been fully reconstructed. I think that would be the better time to set it. Let me restate: The way this code is written, the sampling profiler can not say time is spent here. If something is OSR exitting a lot, there is a good chance time is spent at the bytecode operation. Your patch makes the sampling profiler not ascribe time to that bytecode anymore. That's unfortunate. To make this work, we'd need to: 1. Set call site index in the exitting frame 2. Set top call frame I think it's definitely worth doing this. Maybe we can open a bug to do this in a follow up.

Ryan Haddad

Comment 39 2017-09-09 09:50:20 PDT

(In reply to Mark Lam from comment #37) > Thanks again for the review. Landed in r221774: > <http://trac.webkit.org/r221774>. This change introduced 3 JSC test timeouts. Details in https://bugs.webkit.org/show_bug.cgi?id=176630

Ryan Haddad

Comment 40 2017-09-09 09:54:47 PDT

Reverted r221774 for reason: This change introduced three debug JSC test timeouts. Committed r221823: <http://trac.webkit.org/changeset/221823>

Filip Pizlo

Comment 41 2017-09-09 15:10:41 PDT

(In reply to Ryan Haddad from comment #40) > Reverted r221774 for reason: > > This change introduced three debug JSC test timeouts. > > Committed r221823: <http://trac.webkit.org/changeset/221823> I think that we should skip tests in debug when they timeout, instead of rolling out the change that caused it.

Mark Lam

Comment 42 2017-09-09 17:20:33 PDT

Created attachment 320367 [details] patch for re-landing.

Mark Lam

Comment 43 2017-09-09 17:23:15 PDT

Re-landed in r221832: <http://trac.webkit.org/r221832>.

Mark Lam

Comment 44 2017-09-13 21:22:57 PDT

(In reply to Mark Lam from comment #43) > Re-landed in r221832: <http://trac.webkit.org/r221832>. This was rolled out in r222009: <http://trac.webkit.org/r222009> due to https://bugs.webkit.org/show_bug.cgi?id=176888.

Yusuke Suzuki

Comment 45 2017-09-14 03:47:28 PDT

(In reply to Mark Lam from comment #44) > (In reply to Mark Lam from comment #43) > > Re-landed in r221832: <http://trac.webkit.org/r221832>. > > This was rolled out in r222009: <http://trac.webkit.org/r222009> due to > https://bugs.webkit.org/show_bug.cgi?id=176888. It seems that OSR patch had a bit regression Octane/gbemu. Reverting it gives 4.3% recovery. https://arewefastyet.com/#machine=29&view=single&suite=octane&subtest=Gameboy

Mark Lam

Comment 46 2017-09-14 10:12:56 PDT

*** Bug 176874 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

Status REOPENED

Resolution

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component JavaScriptCore

Assignee

Mark Lam

Reported

2017-08-03 11:39 PDT

Modified

2017-10-03 15:59 PDT History

CC List

11 users Show

URL

Keywords InRadar

Depends on

175208 175214 175228 176888 177844

Blocks

174645 176703 176874

Dependencies

tree graph