Bug 175144

Summary: Use JIT probes for DFG OSR exit.
Product: WebKit Reporter: Mark Lam <mark.lam>
Component: JavaScriptCoreAssignee: Mark Lam <mark.lam>
Status: REOPENED ---    
Severity: Normal CC: buildbot, fpizlo, hodovan, jfbastien, keith_miller, lforschler, msaboff, ryanhaddad, saam, webkit-bug-importer, ysuzuki
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
Bug Depends on: 175208, 175214, 175228, 176888, 177844    
Bug Blocks: 174645, 176703, 176874    
Attachments:
Description Flags
work in progress: won't build yet. still need a few edits + self code review.
none
work in progress: it builds but crashes in the tests.
none
work in progress: fixed some bugs but still have exotic failures.
none
proposed patch.
none
x86_64 benchmark results.
none
proposed patch w/ Windows build fix.
none
x86_64 benchmark results 2.
none
x86_64 benchmark results 3: no ethernet, no wifi (as quiet as can be).
none
work in progress.
none
archive of code that compares the probe based OSR results vs the jit based one.
none
archive of code that compares the probe based OSR results vs the jit based one v.2
none
proposed patch.
none
x86_64 benchmark results (run 1)
none
x86_64 benchmark results (run 2)
none
proposed patch (w/ 32-bit fixes).
none
proposed patch (removed some debugging code).
saam: review+
patch for re-landing. none

Description Mark Lam 2017-08-03 11:39:29 PDT
<rdar://problem/33437050>.
Comment 1 Mark Lam 2017-08-10 10:23:18 PDT
After working through the code and thinking about this, I think I'll go with an OSR exit thunk that execute 2 JIT probes: the first to ensure that there's enough stack space to do in-place OSR exit frame re-write, and the second to do the in-place OSR exit frame re-write.  When the second probe returns, we should resume execution in the baseline codeBlock.

This approach is simplest (maps well from the current implementation, which makes the patch easier to review), and most efficient (less memcpying to and from side buffers).
Comment 2 Lucas Forschler 2017-08-10 14:01:31 PDT
<rdar://problem/33437050>
Comment 3 Mark Lam 2017-08-11 11:24:22 PDT
(In reply to Mark Lam from comment #1)
> After working through the code and thinking about this, I think I'll go with
> an OSR exit thunk that execute 2 JIT probes: the first to ensure that
> there's enough stack space to do in-place OSR exit frame re-write, and the
> second to do the in-place OSR exit frame re-write.  When the second probe
> returns, we should resume execution in the baseline codeBlock.
> 
> This approach is simplest (maps well from the current implementation, which
> makes the patch easier to review), and most efficient (less memcpying to and
> from side buffers).

Change of plans: we can make the probe even more powerful so that it provides APIs to allow for in place stack modifications.  This eliminates the need for the client to be aware that the probe stores state on the stack and where it's stored.
Comment 4 Mark Lam 2017-08-23 17:26:52 PDT
Created attachment 318947 [details]
work in progress: won't build yet. still need a few edits + self code review.
Comment 5 Mark Lam 2017-08-24 15:06:40 PDT
Created attachment 319025 [details]
work in progress: it builds but crashes in the tests.
Comment 6 Mark Lam 2017-08-25 17:27:37 PDT
Created attachment 319121 [details]
work in progress: fixed some bugs but still have exotic failures.
Comment 7 Mark Lam 2017-08-26 12:32:58 PDT
Created attachment 319137 [details]
proposed patch.

Tests are passing locally for me now.  Let's get some feedback and EWS testing while I do some benchmarking and get some memory measurements.
Comment 8 Mark Lam 2017-08-26 14:07:20 PDT
Created attachment 319138 [details]
x86_64 benchmark results.
Comment 9 Mark Lam 2017-08-26 14:17:56 PDT
Benchmark results shows that on aggregate, perf is neutral.  

However, on specific benchmark components:
- Octane's pdfjs is 1.1595x slower
- SixSpeed's spread-literal.es5 is 1.0524x slower
- SixSpeed's classes.es5 is 1.0446x faster
- SixSpeed's map-set-object.es6 is 1.0336x faster

... and a some microbenchmarks are a lot slower:
- delta-blue-try-catch ! definitely 1.3200x slower
- int-or-other-neg-then-get-by-val ! definitely 11.5349x slower
- super-get-by-val-with-this-polymorphic ! definitely 1.0516x slower
- math-with-out-of-bounds-array-values ! definitely 1.0676x slower
- direct-tail-call-inlined-caller ^ definitely 1.0539x faster
- v8-raytrace-with-try-catch ! definitely 1.1061x slower
- super-get-by-val-with-this-monomorphic ! definitely 1.0718x slower
- string-concat-long-convert ^ definitely 1.0428x faster
- fake-iterators-that-throw-when-finished ! definitely 1.5331x slower
- double-pollution-putbyoffset ! definitely 6.7236x slower
- string-transcoding ! definitely 1.2092x slower
- to-number-constructor-string-number-string-number ! definitely 1.2452x slower
- function-dot-apply ^ definitely 1.1369x faster
- bigswitch-indirect-symbol-or-undefined ^ definitely 1.9525x faster
- getter-richards-try-catch ! definitely 1.0995x slower
Comment 10 Mark Lam 2017-08-26 14:25:44 PDT
Created attachment 319139 [details]
proposed patch w/ Windows build fix.
Comment 11 Filip Pizlo 2017-08-26 15:33:55 PDT
(In reply to Mark Lam from comment #9)
> Benchmark results shows that on aggregate, perf is neutral.  
> 
> However, on specific benchmark components:
> - Octane's pdfjs is 1.1595x slower
> - SixSpeed's spread-literal.es5 is 1.0524x slower
> - SixSpeed's classes.es5 is 1.0446x faster
> - SixSpeed's map-set-object.es6 is 1.0336x faster
> 
> ... and a some microbenchmarks are a lot slower:
> - delta-blue-try-catch ! definitely 1.3200x slower
> - int-or-other-neg-then-get-by-val ! definitely 11.5349x slower
> - super-get-by-val-with-this-polymorphic ! definitely 1.0516x slower
> - math-with-out-of-bounds-array-values ! definitely 1.0676x slower
> - direct-tail-call-inlined-caller ^ definitely 1.0539x faster
> - v8-raytrace-with-try-catch ! definitely 1.1061x slower
> - super-get-by-val-with-this-monomorphic ! definitely 1.0718x slower
> - string-concat-long-convert ^ definitely 1.0428x faster
> - fake-iterators-that-throw-when-finished ! definitely 1.5331x slower
> - double-pollution-putbyoffset ! definitely 6.7236x slower
> - string-transcoding ! definitely 1.2092x slower
> - to-number-constructor-string-number-string-number ! definitely 1.2452x
> slower
> - function-dot-apply ^ definitely 1.1369x faster
> - bigswitch-indirect-symbol-or-undefined ^ definitely 1.9525x faster
> - getter-richards-try-catch ! definitely 1.0995x slower

Something is up with your results. Octane shows a “maybe 2%” slower in the geomean. Normally, we can detect a 1% change in that metric. So, I think that this means that your system was quite noisy when you ran this test. Or you did not run for enough iterations. I like to run for —outer 6 and leave —inner alone.
Comment 12 Mark Lam 2017-08-26 16:19:36 PDT
(In reply to Filip Pizlo from comment #11)
> (In reply to Mark Lam from comment #9)
> > Benchmark results shows that on aggregate, perf is neutral.  
> > 
> > However, on specific benchmark components:
> > - Octane's pdfjs is 1.1595x slower
> > - SixSpeed's spread-literal.es5 is 1.0524x slower
> > - SixSpeed's classes.es5 is 1.0446x faster
> > - SixSpeed's map-set-object.es6 is 1.0336x faster
> > 
> > ... and a some microbenchmarks are a lot slower:
> > - delta-blue-try-catch ! definitely 1.3200x slower
> > - int-or-other-neg-then-get-by-val ! definitely 11.5349x slower
> > - super-get-by-val-with-this-polymorphic ! definitely 1.0516x slower
> > - math-with-out-of-bounds-array-values ! definitely 1.0676x slower
> > - direct-tail-call-inlined-caller ^ definitely 1.0539x faster
> > - v8-raytrace-with-try-catch ! definitely 1.1061x slower
> > - super-get-by-val-with-this-monomorphic ! definitely 1.0718x slower
> > - string-concat-long-convert ^ definitely 1.0428x faster
> > - fake-iterators-that-throw-when-finished ! definitely 1.5331x slower
> > - double-pollution-putbyoffset ! definitely 6.7236x slower
> > - string-transcoding ! definitely 1.2092x slower
> > - to-number-constructor-string-number-string-number ! definitely 1.2452x
> > slower
> > - function-dot-apply ^ definitely 1.1369x faster
> > - bigswitch-indirect-symbol-or-undefined ^ definitely 1.9525x faster
> > - getter-richards-try-catch ! definitely 1.0995x slower
> 
> Something is up with your results. Octane shows a “maybe 2%” slower in the
> geomean. Normally, we can detect a 1% change in that metric. So, I think
> that this means that your system was quite noisy when you ran this test. Or
> you did not run for enough iterations. I like to run for —outer 6 and leave
> —inner alone.

I ran with outer 8 on a 13" MacBook Pro.  There's a chance that there's network / random background activity but the machine is as quiet as I can make it.  I just finished another run with no ethernet attached (but just realized that wifi is till on).  I'll upload the new results, and do a 3rd run with wifi off as well.  With this 2nd run, Octane is "might be 1.0118x faster" with pdfjs "definitely 1.1681x slower.
Comment 13 Mark Lam 2017-08-26 16:21:26 PDT
Created attachment 319141 [details]
x86_64 benchmark results 2.
Comment 14 Saam Barati 2017-08-26 16:50:25 PDT
It’s not entirely surprising to see the try/catch micro benchmarks get slower. Most of those tests are stressing throwing an exception, which is an OSR exit in the DFG/FTL. What is somewhat surprising is how important the actual speed of the exit itself is in those microbenchmarks. I’m not too sure what to make of this yet. Maybe there are trivial things we can do to speed up the OSR exit C code?
Comment 15 Mark Lam 2017-08-26 16:56:56 PDT
(In reply to Saam Barati from comment #14)
> It’s not entirely surprising to see the try/catch micro benchmarks get
> slower. Most of those tests are stressing throwing an exception, which is an
> OSR exit in the DFG/FTL. What is somewhat surprising is how important the
> actual speed of the exit itself is in those microbenchmarks. I’m not too
> sure what to make of this yet. Maybe there are trivial things we can do to
> speed up the OSR exit C code?

I can try to do some profiling on OSR exits on pdfjs and see if any section of the code stands out as being expensive.  One area I'm wary of is the amount of memcpying done.

I also use std::memcpy to workaround strict aliasing issues.  If the compiler is not smart enough to turn these into straight memory word copies, that can hurt as lot as well.  I'll take a close look.
Comment 16 Mark Lam 2017-08-26 17:20:33 PDT
Created attachment 319142 [details]
x86_64 benchmark results 3: no ethernet, no wifi (as quiet as can be).
Comment 17 Saam Barati 2017-08-26 17:32:23 PDT
(In reply to Mark Lam from comment #15)
> (In reply to Saam Barati from comment #14)
> > It’s not entirely surprising to see the try/catch micro benchmarks get
> > slower. Most of those tests are stressing throwing an exception, which is an
> > OSR exit in the DFG/FTL. What is somewhat surprising is how important the
> > actual speed of the exit itself is in those microbenchmarks. I’m not too
> > sure what to make of this yet. Maybe there are trivial things we can do to
> > speed up the OSR exit C code?
> 
> I can try to do some profiling on OSR exits on pdfjs and see if any section
> of the code stands out as being expensive.  One area I'm wary of is the
> amount of memcpying done.
> 
> I also use std::memcpy to workaround strict aliasing issues.  If the
> compiler is not smart enough to turn these into straight memory word copies,
> that can hurt as lot as well.  I'll take a close look.

Your second run seems better in terms of numbers. When I say I’m not sure, I mean I’m not sure if this will hurt real world things that use try/catch. I think in practice probably not. But the slowdown indicates to me that there are perhaps ways to make the interpreter faster. Maybe these can be follow ups.

The interesting thing about an exception OSR exit is that it won’t cause us to jettison and recompile because it’s expected behavior. Maybe because of this, there is a slight argument to keep them compiled because it might execute many times for programs that throw. That said, I don’t think this should stop your work from going forward. I feel like we should be able to make the interpreter fast enough.
Comment 18 Filip Pizlo 2017-08-26 20:53:45 PDT
The latest results seem to say that your patch is a regression. 

You should profile to see if time spent in OSR exit is the problem in pdfjs. It could also be that some value profiling changed.

You could try reducing osrExitCountForReoptimization. Making that smaller will reduce the number of times we take any particular exit.
Comment 19 Mark Lam 2017-08-31 07:01:27 PDT
Created attachment 319459 [details]
work in progress.

Here's the current work in progress patch: fixed a few bugs, and made optimization changes to not repeat redundant work.  The result: microbenchmarks run faster (validating that the optimization is working as intended), but real world benchmarks actually appear to run slower.

At this point, it is difficult to see what is going wrong when running the real world benchmarks.  So, I'm going to pull out the big gun and do a full verification run comparing my probe OSR exit results with the old one generated by the old code.  I'll achieve this by inserting probes into the old code to compute the probe OSR exit register and stack state on the side, and then compare these results against the register and stack computed by the JIT offramps.  Since the 2 methods will be feeding on the same real data, they should produce identical result data.  If there are any differences in the restored values at all, then we'll have a lead to chase down for resolving this perf issue.
Comment 20 Build Bot 2017-08-31 11:16:22 PDT
Attachment 319459 [details] did not pass style-queue:


ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:105:  Wrong number of spaces before statement. (expected: 12)  [whitespace/indent] [4]
ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:106:  Wrong number of spaces before statement. (expected: 12)  [whitespace/indent] [4]
ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:107:  Wrong number of spaces before statement. (expected: 12)  [whitespace/indent] [4]
ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:108:  Wrong number of spaces before statement. (expected: 12)  [whitespace/indent] [4]
ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:109:  Wrong number of spaces before statement. (expected: 12)  [whitespace/indent] [4]
ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:110:  Wrong number of spaces before statement. (expected: 12)  [whitespace/indent] [4]
ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:111:  Wrong number of spaces before statement. (expected: 12)  [whitespace/indent] [4]
ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:112:  Wrong number of spaces before statement. (expected: 12)  [whitespace/indent] [4]
ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:113:  Wrong number of spaces before statement. (expected: 12)  [whitespace/indent] [4]
ERROR: Source/JavaScriptCore/bytecode/SuperSampler.cpp:52:  Should have a space between // and comment  [whitespace/comments] [4]
Total errors found: 10 in 34 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 21 Mark Lam 2017-08-31 23:05:34 PDT
Something is strange: comparing the stack of the probe based OSR exit and the jit based OSR exit, I detected a few stack operands that are different.  

However, when I chased down the difference, it is because the jit based OSR exit was recovering operand values from a stack location that is below the stack pointer.  As a result, that value changed mid-way between the start of the ramp and the point where the recovery was done.  The probe based OSR exit does not have this vulnerability because the true machine pointer is set much lower to accommodate probe data while the probe is executing.  This appears to be the sole source of the difference between the stack values of the 2 OSR exit versions.

Note: this means that either the jit version (1) is restoring a value it does not need, or (2) is restoring a corrupted value.  My bet is on (1).  I'll do a test to scribble a bad value into that restored value if the address if below the stack pointer.  If this causes a crash, then we're dealing with scenario (2), else (1).
Comment 22 Mark Lam 2017-08-31 23:37:36 PDT
(In reply to Mark Lam from comment #21)
> Note: this means that either the jit version (1) is restoring a value it
> does not need, or (2) is restoring a corrupted value.  My bet is on (1). 
> I'll do a test to scribble a bad value into that restored value if the
> address if below the stack pointer.  If this causes a crash, then we're
> dealing with scenario (2), else (1).

Indeed, (1) is correct.  I scribbled those values with 0xbadbeef0, and there were no crashed.
Comment 23 Mark Lam 2017-09-01 12:36:06 PDT
I've fixed another difference where we have converted a double JSValue to an int32 JSValue.  In some cases, the value fits in an int32, which is what triggered the conversion.  I now force it to be a double JSValue if came from a double source.  This is what the old jit base OSR exit ramp does.

With that fix plus discounting the values written below the stack pointer (by scribbling 0xbadbeef0 into those slots), I now see that the stack is identical after the jit and probe based OSR exit ramps.

There are register differences, but only for arguments, temps, and scratch registers.  This is to be expected since the jit based OSR exit ramp calls C++ functions which uses those registers.  So, the register differences can be disregarded as well.

The only remaining possible difference between the 2 OSR exit ramps is in changes to VM / profiling state.  I'll try to validate that next.
Comment 24 Mark Lam 2017-09-01 12:38:08 PDT
(In reply to Mark Lam from comment #23)
> I've fixed another difference where we have converted a double JSValue to an
> int32 JSValue.  In some cases, the value fits in an int32, which is what
> triggered the conversion.  I now force it to be a double JSValue if came
> from a double source.  This is what the old jit base OSR exit ramp does.
> 
> With that fix plus discounting the values written below the stack pointer
> (by scribbling 0xbadbeef0 into those slots), I now see that the stack is
> identical after the jit and probe based OSR exit ramps.
> 
> There are register differences, but only for arguments, temps, and scratch
> registers.  This is to be expected since the jit based OSR exit ramp calls
> C++ functions which uses those registers.  So, the register differences can
> be disregarded as well.
> 
> The only remaining possible difference between the 2 OSR exit ramps is in
> changes to VM / profiling state.  I'll try to validate that next.

FYI, with double JSValue fix (which is the only real difference), pdfjs is even more slower (from my quick benchmark run).  This means the JSValue encoding of the double value is not the cause of our perf difference.
Comment 25 Mark Lam 2017-09-01 13:25:51 PDT
Created attachment 319641 [details]
archive of code that compares the probe based OSR results vs the jit based one.
Comment 26 Mark Lam 2017-09-01 17:28:51 PDT
Created attachment 319676 [details]
archive of code that compares the probe based OSR results vs the jit based one v.2
Comment 27 Mark Lam 2017-09-06 12:02:44 PDT
The reason for the pdfjs perf regression is https://bugs.webkit.org/show_bug.cgi?id=176473.

I fixed the regression by doing what the jit OSR Exit code did i.e. not doing what  arrayModeFromStructure() does.  I'll clean up the patch, do some more testing, and prepare for a code review.
Comment 28 Filip Pizlo 2017-09-06 12:03:13 PDT
(In reply to Mark Lam from comment #27)
> The reason for the pdfjs perf regression is
> https://bugs.webkit.org/show_bug.cgi?id=176473.
> 
> I fixed the regression by doing what the jit OSR Exit code did i.e. not
> doing what  arrayModeFromStructure() does.  I'll clean up the patch, do some
> more testing, and prepare for a code review.

Nice!!
Comment 29 Mark Lam 2017-09-06 16:14:09 PDT
Created attachment 320072 [details]
proposed patch.

Let's try the patch on the EWS.
Comment 30 Mark Lam 2017-09-06 16:21:19 PDT
Created attachment 320075 [details]
x86_64 benchmark results (run 1)
Comment 31 Mark Lam 2017-09-06 16:21:41 PDT
Created attachment 320076 [details]
x86_64 benchmark results (run 2)
Comment 32 Mark Lam 2017-09-06 16:24:22 PDT
Benchmark results show that, on aggregate, perf is neutral.
Comment 33 Mark Lam 2017-09-07 11:49:40 PDT
Created attachment 320149 [details]
proposed patch (w/ 32-bit fixes).
Comment 34 Mark Lam 2017-09-07 11:54:44 PDT
Created attachment 320151 [details]
proposed patch (removed some debugging code).
Comment 35 Saam Barati 2017-09-07 15:21:43 PDT
Comment on attachment 320151 [details]
proposed patch (removed some debugging code).

View in context: https://bugs.webkit.org/attachment.cgi?id=320151&action=review

r=me with comments

> Source/JavaScriptCore/assembler/ProbeFrame.h:44
> +    T getArgument(int argument)

style nit: here and below, you preface getters with get, but I don't think we do that elsewhere, e.g, Operands. What about just calling this:
argument(...), and below, operand(...), etc.

> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:80
> +    RegisterSet dontRestoreRegisters = RegisterSet(RegisterSet::stackRegisters(), RegisterSet::allFPRs());

style nit: This same thing is used here and below, might be worth making it a function to call.

> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:135
> +            context.gpr(entry.reg().gpr()) = stack.get<uintptr_t>(calleeSaveBuffer, entry.offset());
> +        else
> +            context.gpr(entry.reg().gpr()) = stack.get<double>(calleeSaveBuffer, entry.offset());

why do these gets need to use the stack API?

> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:152
> +    ASSERT(is64Bit());

seems like this could be a static assert?

> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:160
> +        if (entry.reg().isGPR())
> +            stack.set(calleeSaveBuffer, entry.offset(), context.gpr<uintptr_t>(entry.reg().gpr()));
> +        else
> +            stack.set(calleeSaveBuffer, entry.offset(), context.fpr<uintptr_t>(entry.reg().fpr()));

I think it'd be nice to have context expose a register API so it does this branch for you. You also write the branch in a few places, but if context handled it, you could just do:
stack.set(calleeSaveBuffer, entry.offset(), context.reg<uintptr_t>(entry.reg()));

> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:171
> +    RegisterSet dontSaveRegisters = RegisterSet(RegisterSet::stackRegisters(), RegisterSet::allFPRs());

ditto about method abstracting this instead of copy-pasting.

> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:176
> +#if USE(JSVALUE64)
> +    RegisterSet baselineCalleeSaves = RegisterSet::llintBaselineCalleeSaveRegisters();
> +#endif

seems like this is guaranteed given NUMBER_OF_CALLEE_SAVES_REGISTERS.

> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:187
> +#if USE(JSVALUE32_64)
> +        UNUSED_PARAM(wasCalledViaTailCall);
> +#else

seems like this is dead given: NUMBER_OF_CALLEE_SAVES_REGISTERS

Maybe inside your:
#if NUMBER_OF_CALLEE_SAVES_REGISTERS > 0
 
you can add: static_assert(is64bit(), "")

> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:326
> +void OSRExit::executeOSRExit(Context& context)

Should you set top call frame inside of here? I think that'd help the sampling profiler at least. Maybe other things need it to.

> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:353
> +    if (UNLIKELY(!exit.exitState)) {

suggestion: Let's cache the various calls of JITCode::dfg() here since it's a virtual function

> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:366
> +        // Compute the value recoveries.
> +        Operands<ValueRecovery> operands;
> +        codeBlock->jitCode()->dfg()->variableEventStream.reconstruct(codeBlock, exit.m_codeOrigin, codeBlock->jitCode()->dfg()->minifiedDFG, exit.m_streamIndex, operands);

This is an interesting tradeoff. I think it's the right one. However, I know we use the event stream to reconstruct on demand because storing the reconstructed data for each exit has a high memory cost. That said, we're only doing this here for *executed* exits. That's why I think this is ok and the right choice. But it's worth making sure we truly believe this is the best choice.

> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:489
> +                Structure* structure = jsValueFor(cpu, exit.m_jsValueSource).asCell()->structure(vm);

how do we know the incoming value is a cell?

> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:745
> +        saveOrCopyCalleeSavesFor(context, baselineCodeBlock, static_cast<VirtualRegister>(inlineCallFrame->stackOffset), !trueCaller);

Style: This static cast here seems weird, why not just: VirtualRegister(inlineCallFrame->stackOffset) ?
Comment 36 Mark Lam 2017-09-07 16:44:54 PDT
Thanks for the review.  I've addressed your feedback as follows.  Will retest now before I land.

(In reply to Saam Barati from comment #35)
> Comment on attachment 320151 [details]
> proposed patch (removed some debugging code).
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=320151&action=review
> 
> r=me with comments
> 
> > Source/JavaScriptCore/assembler/ProbeFrame.h:44
> > +    T getArgument(int argument)
> 
> style nit: here and below, you preface getters with get, but I don't think
> we do that elsewhere, e.g, Operands. What about just calling this:
> argument(...), and below, operand(...), etc.

Fixed.

> > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:80
> > +    RegisterSet dontRestoreRegisters = RegisterSet(RegisterSet::stackRegisters(), RegisterSet::allFPRs());
> 
> style nit: This same thing is used here and below, might be worth making it
> a function to call.

This is not a lot of code, and the original code that these functions are based on (AssemblyHelpers::restoreCalleeSavesFromVMEntryFrameCalleeSavesBuffer(), AssemblyHelpers::copyCalleeSavesToVMEntryFrameCalleeSavesBuffer(), and AssemblyHelpers::copyCalleeSavesToVMEntryFrameCalleeSavesBuffer()) also decided to just duplicate this line.  I'll keep this as is for now to keep these functions consistent with the AssemblyHelpers ones they are based on.  I can do a refactor later on to adda calleeSavesRegistersToOmitFromRestoration() and change all sites to use that.

> > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:135
> > +            context.gpr(entry.reg().gpr()) = stack.get<uintptr_t>(calleeSaveBuffer, entry.offset());
> > +        else
> > +            context.gpr(entry.reg().gpr()) = stack.get<double>(calleeSaveBuffer, entry.offset());
> 
> why do these gets need to use the stack API?

They don't need to be.  I fixed it to read from the stack.

There's also a bug here where the fpr case (the else case) is using gpr values.  I've fixed that also.

> > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:152
> > +    ASSERT(is64Bit());
> 
> seems like this could be a static assert?

Fixed.

> > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:160
> > +        if (entry.reg().isGPR())
> > +            stack.set(calleeSaveBuffer, entry.offset(), context.gpr<uintptr_t>(entry.reg().gpr()));
> > +        else
> > +            stack.set(calleeSaveBuffer, entry.offset(), context.fpr<uintptr_t>(entry.reg().fpr()));
> 
> I think it'd be nice to have context expose a register API so it does this
> branch for you. You also write the branch in a few places, but if context
> handled it, you could just do:
> stack.set(calleeSaveBuffer, entry.offset(),
> context.reg<uintptr_t>(entry.reg()));

There are 3 issues with your suggestion:
1. It's a layer violation.

   Probe::Context is defined at the MacroAssembler level, and Reg (which we query isGPR() on) is defined at the JIT level.  It'd be strange to have Probe::Context depend on Reg.  We could move Reg into the MacroAssembler level, but that's more refactoring work.

2. Your suggestion only works for 64-bit where sizeof GPR and FPR are both 64-bit.

   Yes, currently, only 64-bit uses callee saves, and hence, things should work out.  But it's still kind of gross to rely on that.

3. The original AssemblyHelpers code that these functions are based on does it this way.

   I'd rather keep this version of the code consistent with the AssemblyHelpers version.

For these reasons, I'd like to keep this code as is.

> > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:171
> > +    RegisterSet dontSaveRegisters = RegisterSet(RegisterSet::stackRegisters(), RegisterSet::allFPRs());
> 
> ditto about method abstracting this instead of copy-pasting.

Answered above.

> > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:176
> > +#if USE(JSVALUE64)
> > +    RegisterSet baselineCalleeSaves = RegisterSet::llintBaselineCalleeSaveRegisters();
> > +#endif
> 
> seems like this is guaranteed given NUMBER_OF_CALLEE_SAVES_REGISTERS.

Fixed.  I removed the #if+#endif.

> > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:187
> > +#if USE(JSVALUE32_64)
> > +        UNUSED_PARAM(wasCalledViaTailCall);
> > +#else
> 
> seems like this is dead given: NUMBER_OF_CALLEE_SAVES_REGISTERS

Fixed.

> Maybe inside your:
> #if NUMBER_OF_CALLEE_SAVES_REGISTERS > 0
>  
> you can add: static_assert(is64bit(), "")

Done.

> > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:326
> > +void OSRExit::executeOSRExit(Context& context)
> 
> Should you set top call frame inside of here? I think that'd help the
> sampling profiler at least. Maybe other things need it to.

I was already setting it in adjustAndJumpToTarget() after the frame has been fully reconstructed.  I think that would be the better time to set it.

> > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:353
> > +    if (UNLIKELY(!exit.exitState)) {
> 
> suggestion: Let's cache the various calls of JITCode::dfg() here since it's
> a virtual function

The DFG::JITCode* value is only used in 2 functions, and I can't avoid computing it at least once on every OSR exit in order to get the OSRExit record.  So, I just cache it in a local and reuse that (didn't need to cache it in the exitState).

> > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:366
> > +        // Compute the value recoveries.
> > +        Operands<ValueRecovery> operands;
> > +        codeBlock->jitCode()->dfg()->variableEventStream.reconstruct(codeBlock, exit.m_codeOrigin, codeBlock->jitCode()->dfg()->minifiedDFG, exit.m_streamIndex, operands);
> 
> This is an interesting tradeoff. I think it's the right one. However, I know
> we use the event stream to reconstruct on demand because storing the
> reconstructed data for each exit has a high memory cost. That said, we're
> only doing this here for *executed* exits. That's why I think this is ok and
> the right choice. But it's worth making sure we truly believe this is the
> best choice.

This caching accounted for about half of the speed up on the int-or-other-neg-then-get-by-val microbenchmark (relative to my very first patch).  It is also correct to cache these values and use them.  The old jit based OSR Exit caches the value recovery data in the form of the emitted OSR off-ramp jit code.  We're effectively doing the same thing (in spirit) here.

> > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:489
> > +                Structure* structure = jsValueFor(cpu, exit.m_jsValueSource).asCell()->structure(vm);
> 
> how do we know the incoming value is a cell?

We can only get here if exit.m_kind == BadCache || exit.m_kind == BadIndexingType.  I presume that those exit kinds means that the value is a cell.  Anyway, I'm not introducing new logic here.  This is exactly what the old jit based OSR exit code does in OSRExit::compileExit(): it loads the value and proceeds to load its structureID without first confirming that it's a cell.

> > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:745
> > +        saveOrCopyCalleeSavesFor(context, baselineCodeBlock, static_cast<VirtualRegister>(inlineCallFrame->stackOffset), !trueCaller);
> 
> Style: This static cast here seems weird, why not just:
> VirtualRegister(inlineCallFrame->stackOffset) ?

Fixed.
Comment 37 Mark Lam 2017-09-07 18:16:34 PDT
Thanks again for the review.  Landed in r221774: <http://trac.webkit.org/r221774>.
Comment 38 Saam Barati 2017-09-08 16:22:05 PDT
Comment on attachment 320151 [details]
proposed patch (removed some debugging code).

View in context: https://bugs.webkit.org/attachment.cgi?id=320151&action=review

>>> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:135
>>> +            context.gpr(entry.reg().gpr()) = stack.get<double>(calleeSaveBuffer, entry.offset());
>> 
>> why do these gets need to use the stack API?
> 
> They don't need to be.  I fixed it to read from the stack.
> 
> There's also a bug here where the fpr case (the else case) is using gpr values.  I've fixed that also.

It's always fun to learn when we don't have tests for things :)
Could be worth filing a bug saying this is the case.

>>> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:160
>>> +            stack.set(calleeSaveBuffer, entry.offset(), context.fpr<uintptr_t>(entry.reg().fpr()));
>> 
>> I think it'd be nice to have context expose a register API so it does this branch for you. You also write the branch in a few places, but if context handled it, you could just do:
>> stack.set(calleeSaveBuffer, entry.offset(), context.reg<uintptr_t>(entry.reg()));
> 
> There are 3 issues with your suggestion:
> 1. It's a layer violation.
> 
>    Probe::Context is defined at the MacroAssembler level, and Reg (which we query isGPR() on) is defined at the JIT level.  It'd be strange to have Probe::Context depend on Reg.  We could move Reg into the MacroAssembler level, but that's more refactoring work.
> 
> 2. Your suggestion only works for 64-bit where sizeof GPR and FPR are both 64-bit.
> 
>    Yes, currently, only 64-bit uses callee saves, and hence, things should work out.  But it's still kind of gross to rely on that.
> 
> 3. The original AssemblyHelpers code that these functions are based on does it this way.
> 
>    I'd rather keep this version of the code consistent with the AssemblyHelpers version.
> 
> For these reasons, I'd like to keep this code as is.

I see. I didn't think about this layering violation. I think it'd be worth having a helper in this file alone. It could even be under the JSVALUE64.

>>> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:326
>>> +void OSRExit::executeOSRExit(Context& context)
>> 
>> Should you set top call frame inside of here? I think that'd help the sampling profiler at least. Maybe other things need it to.
> 
> I was already setting it in adjustAndJumpToTarget() after the frame has been fully reconstructed.  I think that would be the better time to set it.

Let me restate:
The way this code is written, the sampling profiler can not say time is spent here. If something is OSR exitting a lot, there is a good chance time is spent at the bytecode operation. Your patch makes the sampling profiler not ascribe time to that bytecode anymore. That's unfortunate.

To make this work, we'd need to:
1. Set call site index in the exitting frame
2. Set top call frame

I think it's definitely worth doing this. Maybe we can open a bug to do this in a follow up.
Comment 39 Ryan Haddad 2017-09-09 09:50:20 PDT
(In reply to Mark Lam from comment #37)
> Thanks again for the review.  Landed in r221774:
> <http://trac.webkit.org/r221774>.

This change introduced 3 JSC test timeouts. Details in https://bugs.webkit.org/show_bug.cgi?id=176630
Comment 40 Ryan Haddad 2017-09-09 09:54:47 PDT
Reverted r221774 for reason:

This change introduced three debug JSC test timeouts.

Committed r221823: <http://trac.webkit.org/changeset/221823>
Comment 41 Filip Pizlo 2017-09-09 15:10:41 PDT
(In reply to Ryan Haddad from comment #40)
> Reverted r221774 for reason:
> 
> This change introduced three debug JSC test timeouts.
> 
> Committed r221823: <http://trac.webkit.org/changeset/221823>

I think that we should skip tests in debug when they timeout, instead of rolling out the change that caused it.
Comment 42 Mark Lam 2017-09-09 17:20:33 PDT
Created attachment 320367 [details]
patch for re-landing.
Comment 43 Mark Lam 2017-09-09 17:23:15 PDT
Re-landed in r221832: <http://trac.webkit.org/r221832>.
Comment 44 Mark Lam 2017-09-13 21:22:57 PDT
(In reply to Mark Lam from comment #43)
> Re-landed in r221832: <http://trac.webkit.org/r221832>.

This was rolled out in r222009: <http://trac.webkit.org/r222009> due to https://bugs.webkit.org/show_bug.cgi?id=176888.
Comment 45 Yusuke Suzuki 2017-09-14 03:47:28 PDT
(In reply to Mark Lam from comment #44)
> (In reply to Mark Lam from comment #43)
> > Re-landed in r221832: <http://trac.webkit.org/r221832>.
> 
> This was rolled out in r222009: <http://trac.webkit.org/r222009> due to
> https://bugs.webkit.org/show_bug.cgi?id=176888.

It seems that OSR patch had a bit regression Octane/gbemu. Reverting it gives 4.3% recovery.
https://arewefastyet.com/#machine=29&view=single&suite=octane&subtest=Gameboy
Comment 46 Mark Lam 2017-09-14 10:12:56 PDT
*** Bug 176874 has been marked as a duplicate of this bug. ***