WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
REOPENED
175144
Use JIT probes for DFG OSR exit.
https://bugs.webkit.org/show_bug.cgi?id=175144
Summary
Use JIT probes for DFG OSR exit.
Mark Lam
Reported
2017-08-03 11:39:29 PDT
<
rdar://problem/33437050
>.
Attachments
work in progress: won't build yet. still need a few edits + self code review.
(110.87 KB, patch)
2017-08-23 17:26 PDT
,
Mark Lam
no flags
Details
Formatted Diff
Diff
work in progress: it builds but crashes in the tests.
(110.02 KB, patch)
2017-08-24 15:06 PDT
,
Mark Lam
no flags
Details
Formatted Diff
Diff
work in progress: fixed some bugs but still have exotic failures.
(130.98 KB, patch)
2017-08-25 17:27 PDT
,
Mark Lam
no flags
Details
Formatted Diff
Diff
proposed patch.
(110.49 KB, patch)
2017-08-26 12:32 PDT
,
Mark Lam
no flags
Details
Formatted Diff
Diff
x86_64 benchmark results.
(96.83 KB, text/plain)
2017-08-26 14:07 PDT
,
Mark Lam
no flags
Details
proposed patch w/ Windows build fix.
(110.86 KB, patch)
2017-08-26 14:25 PDT
,
Mark Lam
no flags
Details
Formatted Diff
Diff
x86_64 benchmark results 2.
(97.24 KB, text/plain)
2017-08-26 16:21 PDT
,
Mark Lam
no flags
Details
x86_64 benchmark results 3: no ethernet, no wifi (as quiet as can be).
(96.86 KB, text/plain)
2017-08-26 17:20 PDT
,
Mark Lam
no flags
Details
work in progress.
(126.14 KB, patch)
2017-08-31 07:01 PDT
,
Mark Lam
no flags
Details
Formatted Diff
Diff
archive of code that compares the probe based OSR results vs the jit based one.
(107.95 KB, patch)
2017-09-01 13:25 PDT
,
Mark Lam
no flags
Details
Formatted Diff
Diff
archive of code that compares the probe based OSR results vs the jit based one v.2
(116.67 KB, patch)
2017-09-01 17:28 PDT
,
Mark Lam
no flags
Details
Formatted Diff
Diff
proposed patch.
(126.36 KB, patch)
2017-09-06 16:14 PDT
,
Mark Lam
no flags
Details
Formatted Diff
Diff
x86_64 benchmark results (run 1)
(95.40 KB, text/plain)
2017-09-06 16:21 PDT
,
Mark Lam
no flags
Details
x86_64 benchmark results (run 2)
(96.46 KB, text/plain)
2017-09-06 16:21 PDT
,
Mark Lam
no flags
Details
proposed patch (w/ 32-bit fixes).
(126.62 KB, patch)
2017-09-07 11:49 PDT
,
Mark Lam
no flags
Details
Formatted Diff
Diff
proposed patch (removed some debugging code).
(123.03 KB, patch)
2017-09-07 11:54 PDT
,
Mark Lam
saam
: review+
Details
Formatted Diff
Diff
patch for re-landing.
(120.62 KB, patch)
2017-09-09 17:20 PDT
,
Mark Lam
no flags
Details
Formatted Diff
Diff
Show Obsolete
(12)
View All
Add attachment
proposed patch, testcase, etc.
Mark Lam
Comment 1
2017-08-10 10:23:18 PDT
After working through the code and thinking about this, I think I'll go with an OSR exit thunk that execute 2 JIT probes: the first to ensure that there's enough stack space to do in-place OSR exit frame re-write, and the second to do the in-place OSR exit frame re-write. When the second probe returns, we should resume execution in the baseline codeBlock. This approach is simplest (maps well from the current implementation, which makes the patch easier to review), and most efficient (less memcpying to and from side buffers).
Lucas Forschler
Comment 2
2017-08-10 14:01:31 PDT
<
rdar://problem/33437050
>
Mark Lam
Comment 3
2017-08-11 11:24:22 PDT
(In reply to Mark Lam from
comment #1
)
> After working through the code and thinking about this, I think I'll go with > an OSR exit thunk that execute 2 JIT probes: the first to ensure that > there's enough stack space to do in-place OSR exit frame re-write, and the > second to do the in-place OSR exit frame re-write. When the second probe > returns, we should resume execution in the baseline codeBlock. > > This approach is simplest (maps well from the current implementation, which > makes the patch easier to review), and most efficient (less memcpying to and > from side buffers).
Change of plans: we can make the probe even more powerful so that it provides APIs to allow for in place stack modifications. This eliminates the need for the client to be aware that the probe stores state on the stack and where it's stored.
Mark Lam
Comment 4
2017-08-23 17:26:52 PDT
Created
attachment 318947
[details]
work in progress: won't build yet. still need a few edits + self code review.
Mark Lam
Comment 5
2017-08-24 15:06:40 PDT
Created
attachment 319025
[details]
work in progress: it builds but crashes in the tests.
Mark Lam
Comment 6
2017-08-25 17:27:37 PDT
Created
attachment 319121
[details]
work in progress: fixed some bugs but still have exotic failures.
Mark Lam
Comment 7
2017-08-26 12:32:58 PDT
Created
attachment 319137
[details]
proposed patch. Tests are passing locally for me now. Let's get some feedback and EWS testing while I do some benchmarking and get some memory measurements.
Mark Lam
Comment 8
2017-08-26 14:07:20 PDT
Created
attachment 319138
[details]
x86_64 benchmark results.
Mark Lam
Comment 9
2017-08-26 14:17:56 PDT
Benchmark results shows that on aggregate, perf is neutral. However, on specific benchmark components: - Octane's pdfjs is 1.1595x slower - SixSpeed's spread-literal.es5 is 1.0524x slower - SixSpeed's classes.es5 is 1.0446x faster - SixSpeed's map-set-object.es6 is 1.0336x faster ... and a some microbenchmarks are a lot slower: - delta-blue-try-catch ! definitely 1.3200x slower - int-or-other-neg-then-get-by-val ! definitely 11.5349x slower - super-get-by-val-with-this-polymorphic ! definitely 1.0516x slower - math-with-out-of-bounds-array-values ! definitely 1.0676x slower - direct-tail-call-inlined-caller ^ definitely 1.0539x faster - v8-raytrace-with-try-catch ! definitely 1.1061x slower - super-get-by-val-with-this-monomorphic ! definitely 1.0718x slower - string-concat-long-convert ^ definitely 1.0428x faster - fake-iterators-that-throw-when-finished ! definitely 1.5331x slower - double-pollution-putbyoffset ! definitely 6.7236x slower - string-transcoding ! definitely 1.2092x slower - to-number-constructor-string-number-string-number ! definitely 1.2452x slower - function-dot-apply ^ definitely 1.1369x faster - bigswitch-indirect-symbol-or-undefined ^ definitely 1.9525x faster - getter-richards-try-catch ! definitely 1.0995x slower
Mark Lam
Comment 10
2017-08-26 14:25:44 PDT
Created
attachment 319139
[details]
proposed patch w/ Windows build fix.
Filip Pizlo
Comment 11
2017-08-26 15:33:55 PDT
(In reply to Mark Lam from
comment #9
)
> Benchmark results shows that on aggregate, perf is neutral. > > However, on specific benchmark components: > - Octane's pdfjs is 1.1595x slower > - SixSpeed's spread-literal.es5 is 1.0524x slower > - SixSpeed's classes.es5 is 1.0446x faster > - SixSpeed's map-set-object.es6 is 1.0336x faster > > ... and a some microbenchmarks are a lot slower: > - delta-blue-try-catch ! definitely 1.3200x slower > - int-or-other-neg-then-get-by-val ! definitely 11.5349x slower > - super-get-by-val-with-this-polymorphic ! definitely 1.0516x slower > - math-with-out-of-bounds-array-values ! definitely 1.0676x slower > - direct-tail-call-inlined-caller ^ definitely 1.0539x faster > - v8-raytrace-with-try-catch ! definitely 1.1061x slower > - super-get-by-val-with-this-monomorphic ! definitely 1.0718x slower > - string-concat-long-convert ^ definitely 1.0428x faster > - fake-iterators-that-throw-when-finished ! definitely 1.5331x slower > - double-pollution-putbyoffset ! definitely 6.7236x slower > - string-transcoding ! definitely 1.2092x slower > - to-number-constructor-string-number-string-number ! definitely 1.2452x > slower > - function-dot-apply ^ definitely 1.1369x faster > - bigswitch-indirect-symbol-or-undefined ^ definitely 1.9525x faster > - getter-richards-try-catch ! definitely 1.0995x slower
Something is up with your results. Octane shows a “maybe 2%” slower in the geomean. Normally, we can detect a 1% change in that metric. So, I think that this means that your system was quite noisy when you ran this test. Or you did not run for enough iterations. I like to run for —outer 6 and leave —inner alone.
Mark Lam
Comment 12
2017-08-26 16:19:36 PDT
(In reply to Filip Pizlo from
comment #11
)
> (In reply to Mark Lam from
comment #9
) > > Benchmark results shows that on aggregate, perf is neutral. > > > > However, on specific benchmark components: > > - Octane's pdfjs is 1.1595x slower > > - SixSpeed's spread-literal.es5 is 1.0524x slower > > - SixSpeed's classes.es5 is 1.0446x faster > > - SixSpeed's map-set-object.es6 is 1.0336x faster > > > > ... and a some microbenchmarks are a lot slower: > > - delta-blue-try-catch ! definitely 1.3200x slower > > - int-or-other-neg-then-get-by-val ! definitely 11.5349x slower > > - super-get-by-val-with-this-polymorphic ! definitely 1.0516x slower > > - math-with-out-of-bounds-array-values ! definitely 1.0676x slower > > - direct-tail-call-inlined-caller ^ definitely 1.0539x faster > > - v8-raytrace-with-try-catch ! definitely 1.1061x slower > > - super-get-by-val-with-this-monomorphic ! definitely 1.0718x slower > > - string-concat-long-convert ^ definitely 1.0428x faster > > - fake-iterators-that-throw-when-finished ! definitely 1.5331x slower > > - double-pollution-putbyoffset ! definitely 6.7236x slower > > - string-transcoding ! definitely 1.2092x slower > > - to-number-constructor-string-number-string-number ! definitely 1.2452x > > slower > > - function-dot-apply ^ definitely 1.1369x faster > > - bigswitch-indirect-symbol-or-undefined ^ definitely 1.9525x faster > > - getter-richards-try-catch ! definitely 1.0995x slower > > Something is up with your results. Octane shows a “maybe 2%” slower in the > geomean. Normally, we can detect a 1% change in that metric. So, I think > that this means that your system was quite noisy when you ran this test. Or > you did not run for enough iterations. I like to run for —outer 6 and leave > —inner alone.
I ran with outer 8 on a 13" MacBook Pro. There's a chance that there's network / random background activity but the machine is as quiet as I can make it. I just finished another run with no ethernet attached (but just realized that wifi is till on). I'll upload the new results, and do a 3rd run with wifi off as well. With this 2nd run, Octane is "might be 1.0118x faster" with pdfjs "definitely 1.1681x slower.
Mark Lam
Comment 13
2017-08-26 16:21:26 PDT
Created
attachment 319141
[details]
x86_64 benchmark results 2.
Saam Barati
Comment 14
2017-08-26 16:50:25 PDT
It’s not entirely surprising to see the try/catch micro benchmarks get slower. Most of those tests are stressing throwing an exception, which is an OSR exit in the DFG/FTL. What is somewhat surprising is how important the actual speed of the exit itself is in those microbenchmarks. I’m not too sure what to make of this yet. Maybe there are trivial things we can do to speed up the OSR exit C code?
Mark Lam
Comment 15
2017-08-26 16:56:56 PDT
(In reply to Saam Barati from
comment #14
)
> It’s not entirely surprising to see the try/catch micro benchmarks get > slower. Most of those tests are stressing throwing an exception, which is an > OSR exit in the DFG/FTL. What is somewhat surprising is how important the > actual speed of the exit itself is in those microbenchmarks. I’m not too > sure what to make of this yet. Maybe there are trivial things we can do to > speed up the OSR exit C code?
I can try to do some profiling on OSR exits on pdfjs and see if any section of the code stands out as being expensive. One area I'm wary of is the amount of memcpying done. I also use std::memcpy to workaround strict aliasing issues. If the compiler is not smart enough to turn these into straight memory word copies, that can hurt as lot as well. I'll take a close look.
Mark Lam
Comment 16
2017-08-26 17:20:33 PDT
Created
attachment 319142
[details]
x86_64 benchmark results 3: no ethernet, no wifi (as quiet as can be).
Saam Barati
Comment 17
2017-08-26 17:32:23 PDT
(In reply to Mark Lam from
comment #15
)
> (In reply to Saam Barati from
comment #14
) > > It’s not entirely surprising to see the try/catch micro benchmarks get > > slower. Most of those tests are stressing throwing an exception, which is an > > OSR exit in the DFG/FTL. What is somewhat surprising is how important the > > actual speed of the exit itself is in those microbenchmarks. I’m not too > > sure what to make of this yet. Maybe there are trivial things we can do to > > speed up the OSR exit C code? > > I can try to do some profiling on OSR exits on pdfjs and see if any section > of the code stands out as being expensive. One area I'm wary of is the > amount of memcpying done. > > I also use std::memcpy to workaround strict aliasing issues. If the > compiler is not smart enough to turn these into straight memory word copies, > that can hurt as lot as well. I'll take a close look.
Your second run seems better in terms of numbers. When I say I’m not sure, I mean I’m not sure if this will hurt real world things that use try/catch. I think in practice probably not. But the slowdown indicates to me that there are perhaps ways to make the interpreter faster. Maybe these can be follow ups. The interesting thing about an exception OSR exit is that it won’t cause us to jettison and recompile because it’s expected behavior. Maybe because of this, there is a slight argument to keep them compiled because it might execute many times for programs that throw. That said, I don’t think this should stop your work from going forward. I feel like we should be able to make the interpreter fast enough.
Filip Pizlo
Comment 18
2017-08-26 20:53:45 PDT
The latest results seem to say that your patch is a regression. You should profile to see if time spent in OSR exit is the problem in pdfjs. It could also be that some value profiling changed. You could try reducing osrExitCountForReoptimization. Making that smaller will reduce the number of times we take any particular exit.
Mark Lam
Comment 19
2017-08-31 07:01:27 PDT
Created
attachment 319459
[details]
work in progress. Here's the current work in progress patch: fixed a few bugs, and made optimization changes to not repeat redundant work. The result: microbenchmarks run faster (validating that the optimization is working as intended), but real world benchmarks actually appear to run slower. At this point, it is difficult to see what is going wrong when running the real world benchmarks. So, I'm going to pull out the big gun and do a full verification run comparing my probe OSR exit results with the old one generated by the old code. I'll achieve this by inserting probes into the old code to compute the probe OSR exit register and stack state on the side, and then compare these results against the register and stack computed by the JIT offramps. Since the 2 methods will be feeding on the same real data, they should produce identical result data. If there are any differences in the restored values at all, then we'll have a lead to chase down for resolving this perf issue.
Build Bot
Comment 20
2017-08-31 11:16:22 PDT
Attachment 319459
[details]
did not pass style-queue: ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:105: Wrong number of spaces before statement. (expected: 12) [whitespace/indent] [4] ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:106: Wrong number of spaces before statement. (expected: 12) [whitespace/indent] [4] ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:107: Wrong number of spaces before statement. (expected: 12) [whitespace/indent] [4] ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:108: Wrong number of spaces before statement. (expected: 12) [whitespace/indent] [4] ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:109: Wrong number of spaces before statement. (expected: 12) [whitespace/indent] [4] ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:110: Wrong number of spaces before statement. (expected: 12) [whitespace/indent] [4] ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:111: Wrong number of spaces before statement. (expected: 12) [whitespace/indent] [4] ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:112: Wrong number of spaces before statement. (expected: 12) [whitespace/indent] [4] ERROR: Source/JavaScriptCore/dfg/DFGOSRExit.h:113: Wrong number of spaces before statement. (expected: 12) [whitespace/indent] [4] ERROR: Source/JavaScriptCore/bytecode/SuperSampler.cpp:52: Should have a space between // and comment [whitespace/comments] [4] Total errors found: 10 in 34 files If any of these errors are false positives, please file a bug against check-webkit-style.
Mark Lam
Comment 21
2017-08-31 23:05:34 PDT
Something is strange: comparing the stack of the probe based OSR exit and the jit based OSR exit, I detected a few stack operands that are different. However, when I chased down the difference, it is because the jit based OSR exit was recovering operand values from a stack location that is below the stack pointer. As a result, that value changed mid-way between the start of the ramp and the point where the recovery was done. The probe based OSR exit does not have this vulnerability because the true machine pointer is set much lower to accommodate probe data while the probe is executing. This appears to be the sole source of the difference between the stack values of the 2 OSR exit versions. Note: this means that either the jit version (1) is restoring a value it does not need, or (2) is restoring a corrupted value. My bet is on (1). I'll do a test to scribble a bad value into that restored value if the address if below the stack pointer. If this causes a crash, then we're dealing with scenario (2), else (1).
Mark Lam
Comment 22
2017-08-31 23:37:36 PDT
(In reply to Mark Lam from
comment #21
)
> Note: this means that either the jit version (1) is restoring a value it > does not need, or (2) is restoring a corrupted value. My bet is on (1). > I'll do a test to scribble a bad value into that restored value if the > address if below the stack pointer. If this causes a crash, then we're > dealing with scenario (2), else (1).
Indeed, (1) is correct. I scribbled those values with 0xbadbeef0, and there were no crashed.
Mark Lam
Comment 23
2017-09-01 12:36:06 PDT
I've fixed another difference where we have converted a double JSValue to an int32 JSValue. In some cases, the value fits in an int32, which is what triggered the conversion. I now force it to be a double JSValue if came from a double source. This is what the old jit base OSR exit ramp does. With that fix plus discounting the values written below the stack pointer (by scribbling 0xbadbeef0 into those slots), I now see that the stack is identical after the jit and probe based OSR exit ramps. There are register differences, but only for arguments, temps, and scratch registers. This is to be expected since the jit based OSR exit ramp calls C++ functions which uses those registers. So, the register differences can be disregarded as well. The only remaining possible difference between the 2 OSR exit ramps is in changes to VM / profiling state. I'll try to validate that next.
Mark Lam
Comment 24
2017-09-01 12:38:08 PDT
(In reply to Mark Lam from
comment #23
)
> I've fixed another difference where we have converted a double JSValue to an > int32 JSValue. In some cases, the value fits in an int32, which is what > triggered the conversion. I now force it to be a double JSValue if came > from a double source. This is what the old jit base OSR exit ramp does. > > With that fix plus discounting the values written below the stack pointer > (by scribbling 0xbadbeef0 into those slots), I now see that the stack is > identical after the jit and probe based OSR exit ramps. > > There are register differences, but only for arguments, temps, and scratch > registers. This is to be expected since the jit based OSR exit ramp calls > C++ functions which uses those registers. So, the register differences can > be disregarded as well. > > The only remaining possible difference between the 2 OSR exit ramps is in > changes to VM / profiling state. I'll try to validate that next.
FYI, with double JSValue fix (which is the only real difference), pdfjs is even more slower (from my quick benchmark run). This means the JSValue encoding of the double value is not the cause of our perf difference.
Mark Lam
Comment 25
2017-09-01 13:25:51 PDT
Created
attachment 319641
[details]
archive of code that compares the probe based OSR results vs the jit based one.
Mark Lam
Comment 26
2017-09-01 17:28:51 PDT
Created
attachment 319676
[details]
archive of code that compares the probe based OSR results vs the jit based one v.2
Mark Lam
Comment 27
2017-09-06 12:02:44 PDT
The reason for the pdfjs perf regression is
https://bugs.webkit.org/show_bug.cgi?id=176473
. I fixed the regression by doing what the jit OSR Exit code did i.e. not doing what arrayModeFromStructure() does. I'll clean up the patch, do some more testing, and prepare for a code review.
Filip Pizlo
Comment 28
2017-09-06 12:03:13 PDT
(In reply to Mark Lam from
comment #27
)
> The reason for the pdfjs perf regression is >
https://bugs.webkit.org/show_bug.cgi?id=176473
. > > I fixed the regression by doing what the jit OSR Exit code did i.e. not > doing what arrayModeFromStructure() does. I'll clean up the patch, do some > more testing, and prepare for a code review.
Nice!!
Mark Lam
Comment 29
2017-09-06 16:14:09 PDT
Created
attachment 320072
[details]
proposed patch. Let's try the patch on the EWS.
Mark Lam
Comment 30
2017-09-06 16:21:19 PDT
Created
attachment 320075
[details]
x86_64 benchmark results (run 1)
Mark Lam
Comment 31
2017-09-06 16:21:41 PDT
Created
attachment 320076
[details]
x86_64 benchmark results (run 2)
Mark Lam
Comment 32
2017-09-06 16:24:22 PDT
Benchmark results show that, on aggregate, perf is neutral.
Mark Lam
Comment 33
2017-09-07 11:49:40 PDT
Created
attachment 320149
[details]
proposed patch (w/ 32-bit fixes).
Mark Lam
Comment 34
2017-09-07 11:54:44 PDT
Created
attachment 320151
[details]
proposed patch (removed some debugging code).
Saam Barati
Comment 35
2017-09-07 15:21:43 PDT
Comment on
attachment 320151
[details]
proposed patch (removed some debugging code). View in context:
https://bugs.webkit.org/attachment.cgi?id=320151&action=review
r=me with comments
> Source/JavaScriptCore/assembler/ProbeFrame.h:44 > + T getArgument(int argument)
style nit: here and below, you preface getters with get, but I don't think we do that elsewhere, e.g, Operands. What about just calling this: argument(...), and below, operand(...), etc.
> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:80 > + RegisterSet dontRestoreRegisters = RegisterSet(RegisterSet::stackRegisters(), RegisterSet::allFPRs());
style nit: This same thing is used here and below, might be worth making it a function to call.
> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:135 > + context.gpr(entry.reg().gpr()) = stack.get<uintptr_t>(calleeSaveBuffer, entry.offset()); > + else > + context.gpr(entry.reg().gpr()) = stack.get<double>(calleeSaveBuffer, entry.offset());
why do these gets need to use the stack API?
> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:152 > + ASSERT(is64Bit());
seems like this could be a static assert?
> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:160 > + if (entry.reg().isGPR()) > + stack.set(calleeSaveBuffer, entry.offset(), context.gpr<uintptr_t>(entry.reg().gpr())); > + else > + stack.set(calleeSaveBuffer, entry.offset(), context.fpr<uintptr_t>(entry.reg().fpr()));
I think it'd be nice to have context expose a register API so it does this branch for you. You also write the branch in a few places, but if context handled it, you could just do: stack.set(calleeSaveBuffer, entry.offset(), context.reg<uintptr_t>(entry.reg()));
> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:171 > + RegisterSet dontSaveRegisters = RegisterSet(RegisterSet::stackRegisters(), RegisterSet::allFPRs());
ditto about method abstracting this instead of copy-pasting.
> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:176 > +#if USE(JSVALUE64) > + RegisterSet baselineCalleeSaves = RegisterSet::llintBaselineCalleeSaveRegisters(); > +#endif
seems like this is guaranteed given NUMBER_OF_CALLEE_SAVES_REGISTERS.
> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:187 > +#if USE(JSVALUE32_64) > + UNUSED_PARAM(wasCalledViaTailCall); > +#else
seems like this is dead given: NUMBER_OF_CALLEE_SAVES_REGISTERS Maybe inside your: #if NUMBER_OF_CALLEE_SAVES_REGISTERS > 0 you can add: static_assert(is64bit(), "")
> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:326 > +void OSRExit::executeOSRExit(Context& context)
Should you set top call frame inside of here? I think that'd help the sampling profiler at least. Maybe other things need it to.
> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:353 > + if (UNLIKELY(!exit.exitState)) {
suggestion: Let's cache the various calls of JITCode::dfg() here since it's a virtual function
> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:366 > + // Compute the value recoveries. > + Operands<ValueRecovery> operands; > + codeBlock->jitCode()->dfg()->variableEventStream.reconstruct(codeBlock, exit.m_codeOrigin, codeBlock->jitCode()->dfg()->minifiedDFG, exit.m_streamIndex, operands);
This is an interesting tradeoff. I think it's the right one. However, I know we use the event stream to reconstruct on demand because storing the reconstructed data for each exit has a high memory cost. That said, we're only doing this here for *executed* exits. That's why I think this is ok and the right choice. But it's worth making sure we truly believe this is the best choice.
> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:489 > + Structure* structure = jsValueFor(cpu, exit.m_jsValueSource).asCell()->structure(vm);
how do we know the incoming value is a cell?
> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:745 > + saveOrCopyCalleeSavesFor(context, baselineCodeBlock, static_cast<VirtualRegister>(inlineCallFrame->stackOffset), !trueCaller);
Style: This static cast here seems weird, why not just: VirtualRegister(inlineCallFrame->stackOffset) ?
Mark Lam
Comment 36
2017-09-07 16:44:54 PDT
Thanks for the review. I've addressed your feedback as follows. Will retest now before I land. (In reply to Saam Barati from
comment #35
)
> Comment on
attachment 320151
[details]
> proposed patch (removed some debugging code). > > View in context: >
https://bugs.webkit.org/attachment.cgi?id=320151&action=review
> > r=me with comments > > > Source/JavaScriptCore/assembler/ProbeFrame.h:44 > > + T getArgument(int argument) > > style nit: here and below, you preface getters with get, but I don't think > we do that elsewhere, e.g, Operands. What about just calling this: > argument(...), and below, operand(...), etc.
Fixed.
> > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:80 > > + RegisterSet dontRestoreRegisters = RegisterSet(RegisterSet::stackRegisters(), RegisterSet::allFPRs()); > > style nit: This same thing is used here and below, might be worth making it > a function to call.
This is not a lot of code, and the original code that these functions are based on (AssemblyHelpers::restoreCalleeSavesFromVMEntryFrameCalleeSavesBuffer(), AssemblyHelpers::copyCalleeSavesToVMEntryFrameCalleeSavesBuffer(), and AssemblyHelpers::copyCalleeSavesToVMEntryFrameCalleeSavesBuffer()) also decided to just duplicate this line. I'll keep this as is for now to keep these functions consistent with the AssemblyHelpers ones they are based on. I can do a refactor later on to adda calleeSavesRegistersToOmitFromRestoration() and change all sites to use that.
> > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:135 > > + context.gpr(entry.reg().gpr()) = stack.get<uintptr_t>(calleeSaveBuffer, entry.offset()); > > + else > > + context.gpr(entry.reg().gpr()) = stack.get<double>(calleeSaveBuffer, entry.offset()); > > why do these gets need to use the stack API?
They don't need to be. I fixed it to read from the stack. There's also a bug here where the fpr case (the else case) is using gpr values. I've fixed that also.
> > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:152 > > + ASSERT(is64Bit()); > > seems like this could be a static assert?
Fixed.
> > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:160 > > + if (entry.reg().isGPR()) > > + stack.set(calleeSaveBuffer, entry.offset(), context.gpr<uintptr_t>(entry.reg().gpr())); > > + else > > + stack.set(calleeSaveBuffer, entry.offset(), context.fpr<uintptr_t>(entry.reg().fpr())); > > I think it'd be nice to have context expose a register API so it does this > branch for you. You also write the branch in a few places, but if context > handled it, you could just do: > stack.set(calleeSaveBuffer, entry.offset(), > context.reg<uintptr_t>(entry.reg()));
There are 3 issues with your suggestion: 1. It's a layer violation. Probe::Context is defined at the MacroAssembler level, and Reg (which we query isGPR() on) is defined at the JIT level. It'd be strange to have Probe::Context depend on Reg. We could move Reg into the MacroAssembler level, but that's more refactoring work. 2. Your suggestion only works for 64-bit where sizeof GPR and FPR are both 64-bit. Yes, currently, only 64-bit uses callee saves, and hence, things should work out. But it's still kind of gross to rely on that. 3. The original AssemblyHelpers code that these functions are based on does it this way. I'd rather keep this version of the code consistent with the AssemblyHelpers version. For these reasons, I'd like to keep this code as is.
> > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:171 > > + RegisterSet dontSaveRegisters = RegisterSet(RegisterSet::stackRegisters(), RegisterSet::allFPRs()); > > ditto about method abstracting this instead of copy-pasting.
Answered above.
> > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:176 > > +#if USE(JSVALUE64) > > + RegisterSet baselineCalleeSaves = RegisterSet::llintBaselineCalleeSaveRegisters(); > > +#endif > > seems like this is guaranteed given NUMBER_OF_CALLEE_SAVES_REGISTERS.
Fixed. I removed the #if+#endif.
> > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:187 > > +#if USE(JSVALUE32_64) > > + UNUSED_PARAM(wasCalledViaTailCall); > > +#else > > seems like this is dead given: NUMBER_OF_CALLEE_SAVES_REGISTERS
Fixed.
> Maybe inside your: > #if NUMBER_OF_CALLEE_SAVES_REGISTERS > 0 > > you can add: static_assert(is64bit(), "")
Done.
> > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:326 > > +void OSRExit::executeOSRExit(Context& context) > > Should you set top call frame inside of here? I think that'd help the > sampling profiler at least. Maybe other things need it to.
I was already setting it in adjustAndJumpToTarget() after the frame has been fully reconstructed. I think that would be the better time to set it.
> > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:353 > > + if (UNLIKELY(!exit.exitState)) { > > suggestion: Let's cache the various calls of JITCode::dfg() here since it's > a virtual function
The DFG::JITCode* value is only used in 2 functions, and I can't avoid computing it at least once on every OSR exit in order to get the OSRExit record. So, I just cache it in a local and reuse that (didn't need to cache it in the exitState).
> > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:366 > > + // Compute the value recoveries. > > + Operands<ValueRecovery> operands; > > + codeBlock->jitCode()->dfg()->variableEventStream.reconstruct(codeBlock, exit.m_codeOrigin, codeBlock->jitCode()->dfg()->minifiedDFG, exit.m_streamIndex, operands); > > This is an interesting tradeoff. I think it's the right one. However, I know > we use the event stream to reconstruct on demand because storing the > reconstructed data for each exit has a high memory cost. That said, we're > only doing this here for *executed* exits. That's why I think this is ok and > the right choice. But it's worth making sure we truly believe this is the > best choice.
This caching accounted for about half of the speed up on the int-or-other-neg-then-get-by-val microbenchmark (relative to my very first patch). It is also correct to cache these values and use them. The old jit based OSR Exit caches the value recovery data in the form of the emitted OSR off-ramp jit code. We're effectively doing the same thing (in spirit) here.
> > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:489 > > + Structure* structure = jsValueFor(cpu, exit.m_jsValueSource).asCell()->structure(vm); > > how do we know the incoming value is a cell?
We can only get here if exit.m_kind == BadCache || exit.m_kind == BadIndexingType. I presume that those exit kinds means that the value is a cell. Anyway, I'm not introducing new logic here. This is exactly what the old jit based OSR exit code does in OSRExit::compileExit(): it loads the value and proceeds to load its structureID without first confirming that it's a cell.
> > Source/JavaScriptCore/dfg/DFGOSRExit.cpp:745 > > + saveOrCopyCalleeSavesFor(context, baselineCodeBlock, static_cast<VirtualRegister>(inlineCallFrame->stackOffset), !trueCaller); > > Style: This static cast here seems weird, why not just: > VirtualRegister(inlineCallFrame->stackOffset) ?
Fixed.
Mark Lam
Comment 37
2017-09-07 18:16:34 PDT
Thanks again for the review. Landed in
r221774
: <
http://trac.webkit.org/r221774
>.
Saam Barati
Comment 38
2017-09-08 16:22:05 PDT
Comment on
attachment 320151
[details]
proposed patch (removed some debugging code). View in context:
https://bugs.webkit.org/attachment.cgi?id=320151&action=review
>>> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:135 >>> + context.gpr(entry.reg().gpr()) = stack.get<double>(calleeSaveBuffer, entry.offset()); >> >> why do these gets need to use the stack API? > > They don't need to be. I fixed it to read from the stack. > > There's also a bug here where the fpr case (the else case) is using gpr values. I've fixed that also.
It's always fun to learn when we don't have tests for things :) Could be worth filing a bug saying this is the case.
>>> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:160 >>> + stack.set(calleeSaveBuffer, entry.offset(), context.fpr<uintptr_t>(entry.reg().fpr())); >> >> I think it'd be nice to have context expose a register API so it does this branch for you. You also write the branch in a few places, but if context handled it, you could just do: >> stack.set(calleeSaveBuffer, entry.offset(), context.reg<uintptr_t>(entry.reg())); > > There are 3 issues with your suggestion: > 1. It's a layer violation. > > Probe::Context is defined at the MacroAssembler level, and Reg (which we query isGPR() on) is defined at the JIT level. It'd be strange to have Probe::Context depend on Reg. We could move Reg into the MacroAssembler level, but that's more refactoring work. > > 2. Your suggestion only works for 64-bit where sizeof GPR and FPR are both 64-bit. > > Yes, currently, only 64-bit uses callee saves, and hence, things should work out. But it's still kind of gross to rely on that. > > 3. The original AssemblyHelpers code that these functions are based on does it this way. > > I'd rather keep this version of the code consistent with the AssemblyHelpers version. > > For these reasons, I'd like to keep this code as is.
I see. I didn't think about this layering violation. I think it'd be worth having a helper in this file alone. It could even be under the JSVALUE64.
>>> Source/JavaScriptCore/dfg/DFGOSRExit.cpp:326 >>> +void OSRExit::executeOSRExit(Context& context) >> >> Should you set top call frame inside of here? I think that'd help the sampling profiler at least. Maybe other things need it to. > > I was already setting it in adjustAndJumpToTarget() after the frame has been fully reconstructed. I think that would be the better time to set it.
Let me restate: The way this code is written, the sampling profiler can not say time is spent here. If something is OSR exitting a lot, there is a good chance time is spent at the bytecode operation. Your patch makes the sampling profiler not ascribe time to that bytecode anymore. That's unfortunate. To make this work, we'd need to: 1. Set call site index in the exitting frame 2. Set top call frame I think it's definitely worth doing this. Maybe we can open a bug to do this in a follow up.
Ryan Haddad
Comment 39
2017-09-09 09:50:20 PDT
(In reply to Mark Lam from
comment #37
)
> Thanks again for the review. Landed in
r221774
: > <
http://trac.webkit.org/r221774
>.
This change introduced 3 JSC test timeouts. Details in
https://bugs.webkit.org/show_bug.cgi?id=176630
Ryan Haddad
Comment 40
2017-09-09 09:54:47 PDT
Reverted
r221774
for reason: This change introduced three debug JSC test timeouts. Committed
r221823
: <
http://trac.webkit.org/changeset/221823
>
Filip Pizlo
Comment 41
2017-09-09 15:10:41 PDT
(In reply to Ryan Haddad from
comment #40
)
> Reverted
r221774
for reason: > > This change introduced three debug JSC test timeouts. > > Committed
r221823
: <
http://trac.webkit.org/changeset/221823
>
I think that we should skip tests in debug when they timeout, instead of rolling out the change that caused it.
Mark Lam
Comment 42
2017-09-09 17:20:33 PDT
Created
attachment 320367
[details]
patch for re-landing.
Mark Lam
Comment 43
2017-09-09 17:23:15 PDT
Re-landed in
r221832
: <
http://trac.webkit.org/r221832
>.
Mark Lam
Comment 44
2017-09-13 21:22:57 PDT
(In reply to Mark Lam from
comment #43
)
> Re-landed in
r221832
: <
http://trac.webkit.org/r221832
>.
This was rolled out in
r222009
: <
http://trac.webkit.org/r222009
> due to
https://bugs.webkit.org/show_bug.cgi?id=176888
.
Yusuke Suzuki
Comment 45
2017-09-14 03:47:28 PDT
(In reply to Mark Lam from
comment #44
)
> (In reply to Mark Lam from
comment #43
) > > Re-landed in
r221832
: <
http://trac.webkit.org/r221832
>. > > This was rolled out in
r222009
: <
http://trac.webkit.org/r222009
> due to >
https://bugs.webkit.org/show_bug.cgi?id=176888
.
It seems that OSR patch had a bit regression Octane/gbemu. Reverting it gives 4.3% recovery.
https://arewefastyet.com/#machine=29&view=single&suite=octane&subtest=Gameboy
Mark Lam
Comment 46
2017-09-14 10:12:56 PDT
***
Bug 176874
has been marked as a duplicate of this bug. ***
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug