90057 – llint regresses performance on some v8 benchmarks

Andy Wingo

Reported 2012-06-27 02:58:00 PDT

The LLInt is faster than the classic interpreter, overall, but it is not faster on all benchmarks. I configured a build with runtime heuristics and the classic interpreter. Like this: CPPFLAGS='-DENABLE_RUN_TIME_HEURISTICS=1 -DENABLE_LLINT=0 -DENABLE_CLASSIC_INTERPRETER=1' Tools/Scripts/build-jsc --gtk I configured another build with runtime heuristics and the LLInt: CPPFLAGS='-DENABLE_RUN_TIME_HEURISTICS=1 -DENABLE_LLINT=1 -DENABLE_CLASSIC_INTERPRETER=0' Tools/Scripts/build-jsc --gtk For the first build (classic interpreter): ~/src/v8/benchmarks$ JSC_useJIT=false ~/src/WebKit/WebKitBuild/Release/Programs/jsc-3 run.js Richards: 969 DeltaBlue: 1004 Crypto: 762 RayTrace: 2821 EarleyBoyer: 3342 RegExp: 396 Splay: 4458 NavierStokes: 1248 ---- Score (version 7): 1407 For the second (llint): ~/src/v8/benchmarks$ JSC_useJIT=false ~/src/WebKit/WebKitBuild/Release/Programs/jsc-3 run.js Richards: 783 DeltaBlue: 713 Crypto: 1397 RayTrace: 2415 EarleyBoyer: 3344 RegExp: 389 Splay: 3936 NavierStokes: 2015 ---- Score (version 7): 1449 You can see that the LLInt does really well with math-intensive benchmarks (crypto, navier-stokes) but does not do so well on the benchmarks that test property access (deltablue, richards, splay).

Andy Wingo

Comment 1 2012-06-27 03:09:36 PDT

I guess the reason here is obvious, that llint_op_get_by_id doesn't know all the tricks that tryCacheGetByID knows. I assume this is just not implemented yet, though do let me know if you are against more cleverness there (perhaps due to icache concerns).

Filip Pizlo

Comment 2 2012-06-27 11:23:56 PDT

(In reply to comment #1) > I guess the reason here is obvious, that llint_op_get_by_id doesn't know all the tricks that tryCacheGetByID knows. I assume this is just not implemented yet, though do let me know if you are against more cleverness there (perhaps due to icache concerns). This is bizarre. Most of the tricks for polymorphism in the old interpreter are disabled (like proto, list, or chain accesses). I agree that this is worth looking into.

Andy Wingo

Comment 3 2012-07-09 08:41:57 PDT

(In reply to comment #2) > (In reply to comment #1) > > I guess the reason here is obvious, that llint_op_get_by_id doesn't know all the tricks that tryCacheGetByID knows. I assume this is just not implemented yet, though do let me know if you are against more cleverness there (perhaps due to icache concerns). > > This is bizarre. Most of the tricks for polymorphism in the old interpreter are disabled (like proto, list, or chain accesses). > > I agree that this is worth looking into. I made a first pass at looking into it, adding: _llint_op_get_array_length: traceExecution() loadp CodeBlock[cfr], t1 loadis 16[PB, PC, 8], t2 loadp CodeBlock::m_globalData[t1], t1 loadConstantOrVariableCell(t2, t0, .opGetArrayLengthMiss) loadp JSGlobalData::jsArrayClassInfo[t1], t2 bpneq [t0], t2, .opGetArrayLengthMiss loadis 8[PB, PC, 8], t2 loadp JSArray::m_storage[t0], t0 loadi ArrayStorage::m_length[t0], t0 # FIXME array lenghths >= 2**31 ? orp tagTypeNumber, t0 storep t0, [cfr, t2, 8] dispatch(9) .opGetArrayLengthMiss: callSlowPath(_llint_slow_path_get_by_id) dispatch(9) _llint_op_get_string_length: traceExecution() loadis 16[PB, PC, 8], t1 loadis 8[PB, PC, 8], t2 loadConstantOrVariable(t1, t0) btpnz t0, tagMask, .opGetStringLengthMiss loadp JSCell::m_structure[t0], t1 bpneq Structure::m_typeInfo + TypeInfo::m_type[t1], StringType, .opGetStringLengthMiss loadi JSString::m_length[t0], t0 # FIXME string lenghths >= 2**31 ? orp tagTypeNumber, t0 storep t0, [cfr, t2, 8] dispatch(9) .opGetStringLengthMiss: callSlowPath(_llint_slow_path_get_by_id) dispatch(9) and adding a couple cases to llint_slow_path_get_by_id: if (baseValue.isCell() && isJSArray(baseValue) && ident == exec->propertyNames().length) { pc[0].u.opcode = bitwise_cast<void*>(&llint_op_get_array_length); goto done; } if (baseValue.isCell() && isJSString(baseValue) && ident == exec->propertyNames().length) { pc[0].u.opcode = bitwise_cast<void*>(&llint_op_get_string_length); goto done; } But no help on v8, and actually a negative effect on deltablue: Richards: 750 DeltaBlue: 668 Crypto: 1545 RayTrace: 2359 EarleyBoyer: 3334 RegExp: 388 Splay: 4273 NavierStokes: 1815 ---- Score (version 7): 1439 I guess we're hitting the slow path all the time, and these checks slow it down. Is the big win to detect flip-flopping and just move to get_by_id_generic ?

Andy Wingo

Comment 4 2012-07-09 10:14:13 PDT

I instrumented the LLInt to record how property accesses were dispatched. While running the V8 test suite, this is what I get: 372328822 : Total 232485832 : InlineCacheHit 66597673 : OutOfLineCacheHit 56115209 : NotCacheable 6326596 : PingPong 6195533 : Inline 4433881 : ArrayLength 132714 : OutOfLine 40557 : StringLength 827 : NotCell The cache hits indicate that no slow path would be called. NotCacheable indicates one of these is false: slot.isCacheable() && slot.slotBase() == baseValue && slot.cachedPropertyType() == PropertySlot::Value && !structure->isUncacheableDictionary() && !structure->typeInfo().prohibitsPropertyCaching() PingPong indicates a previously valid cache. My instrumentation caused pingpong dispatches to only succeed on the second time. Inline indicates a change to inline cache state. OutOfLine indicates a change to out of line cache state. ArrayLength and StringLength indicate changes that would occur if those opcodes were around. NotCell means the base value is not a cell. It seems the thing to work on is the NotCacheable set, to see what is there.

Andy Wingo

Comment 5 2012-07-10 04:15:46 PDT

(In reply to comment #4) > 372328822 : Total > 232485832 : InlineCacheHit > 66597673 : OutOfLineCacheHit > 56115209 : NotCacheable They're basically all base mismatches: slot.slotBase() != baseValue. I'll see if adding a get_by_id_proto will help.

Filip Pizlo

Comment 6 2012-07-10 12:14:01 PDT

(In reply to comment #5) > (In reply to comment #4) > > 372328822 : Total > > 232485832 : InlineCacheHit > > 66597673 : OutOfLineCacheHit > > 56115209 : NotCacheable > > They're basically all base mismatches: slot.slotBase() != baseValue. I'll see if adding a get_by_id_proto will help. Why not add a llint_op_get_by_id_stub like: _llint_op_get_by_id_stub: traceExecution() loadis 16[PB, PC, 8], t0 loadConstantOrVariableCell(t0, t1, .opGetByIdSlow) jmp something[PB, PC, 8] _llint_op_get_by_id_stub_slow: callSlowPath(_llint_slow_path_get_by_id) dispatch(9) and modify the DFGRepatch get_by_id/put_by_id patching code to be able to work with LLInt as well? The DFG patching code is really powerful in that it is parameterized by the set of registers free, the set of registers that holder relevant info, etc. So you'd just tell it that GPRInfo::regT1 holds the object base and all other registers (except PB, PC, cfr) are free. Essentially you'll be telling it that you're in "registersFlushed" as the DFG calls it. Then, just modify it so that (1) its fast path jump is replaced by an LLInt dispatch, (2) its patched calls (for getter accesses) save PB, PC appropriately for stack walking (instead of saving CodeOrigin as the DFG would have done), and (3) that the slow path jump jumps to llint_op_get_by_id_stub_slow. And viola, you'll have full polymorphic caching in LLInt! Though of course it will rely on the JIT being enabled.

Andy Wingo

Comment 7 2012-07-10 12:54:23 PDT

(In reply to comment #6) > (In reply to comment #5) > > (In reply to comment #4) > > > 372328822 : Total > > > 232485832 : InlineCacheHit > > > 66597673 : OutOfLineCacheHit > > > 56115209 : NotCacheable > > > > They're basically all base mismatches: slot.slotBase() != baseValue. I'll see if adding a get_by_id_proto will help. > > Why not add a llint_op_get_by_id_stub like: > > _llint_op_get_by_id_stub: > traceExecution() > loadis 16[PB, PC, 8], t0 > loadConstantOrVariableCell(t0, t1, .opGetByIdSlow) > jmp something[PB, PC, 8] > > _llint_op_get_by_id_stub_slow: > callSlowPath(_llint_slow_path_get_by_id) > dispatch(9) > > and modify the DFGRepatch get_by_id/put_by_id patching code to be able to work with LLInt as well? The DFG patching code is really powerful in that it is parameterized by the set of registers free, the set of registers that holder relevant info, etc. So you'd just tell it that GPRInfo::regT1 holds the object base and all other registers (except PB, PC, cfr) are free. Essentially you'll be telling it that you're in "registersFlushed" as the DFG calls it. Then, just modify it so that (1) its fast path jump is replaced by an LLInt dispatch, (2) its patched calls (for getter accesses) save PB, PC appropriately for stack walking (instead of saving CodeOrigin as the DFG would have done), and (3) that the slow path jump jumps to llint_op_get_by_id_stub_slow. > > And viola, you'll have full polymorphic caching in LLInt! Though of course it will rely on the JIT being enabled. Very interesting! I will look into this tomorrow. FWIW, it seems that the biggest gain on the V8 tests is having an op_get_by_id_proto. With that and some other strategies similar to what the classic interpreter does, I got it up to: Richards: 858 DeltaBlue: 787 Crypto: 1573 RayTrace: 2557 EarleyBoyer: 3404 RegExp: 397 Splay: 4319 NavierStokes: 1968 ---- Score (version 7): 1537 Not a startling improvement, but it is better for this one test.

Filip Pizlo

Comment 8 2012-07-10 13:19:19 PDT

(In reply to comment #7) > (In reply to comment #6) > > (In reply to comment #5) > > > (In reply to comment #4) > > > > 372328822 : Total > > > > 232485832 : InlineCacheHit > > > > 66597673 : OutOfLineCacheHit > > > > 56115209 : NotCacheable > > > > > > They're basically all base mismatches: slot.slotBase() != baseValue. I'll see if adding a get_by_id_proto will help. > > > > Why not add a llint_op_get_by_id_stub like: > > > > _llint_op_get_by_id_stub: > > traceExecution() > > loadis 16[PB, PC, 8], t0 > > loadConstantOrVariableCell(t0, t1, .opGetByIdSlow) > > jmp something[PB, PC, 8] > > > > _llint_op_get_by_id_stub_slow: > > callSlowPath(_llint_slow_path_get_by_id) > > dispatch(9) > > > > and modify the DFGRepatch get_by_id/put_by_id patching code to be able to work with LLInt as well? The DFG patching code is really powerful in that it is parameterized by the set of registers free, the set of registers that holder relevant info, etc. So you'd just tell it that GPRInfo::regT1 holds the object base and all other registers (except PB, PC, cfr) are free. Essentially you'll be telling it that you're in "registersFlushed" as the DFG calls it. Then, just modify it so that (1) its fast path jump is replaced by an LLInt dispatch, (2) its patched calls (for getter accesses) save PB, PC appropriately for stack walking (instead of saving CodeOrigin as the DFG would have done), and (3) that the slow path jump jumps to llint_op_get_by_id_stub_slow. > > > > And viola, you'll have full polymorphic caching in LLInt! Though of course it will rely on the JIT being enabled. > > Very interesting! I will look into this tomorrow. > > FWIW, it seems that the biggest gain on the V8 tests is having an op_get_by_id_proto. With that and some other strategies similar to what the classic interpreter does, I got it up to: > > Richards: 858 > DeltaBlue: 787 > Crypto: 1573 > RayTrace: 2557 > EarleyBoyer: 3404 > RegExp: 397 > Splay: 4319 > NavierStokes: 1968 > ---- > Score (version 7): 1537 > > Not a startling improvement, but it is better for this one test. If it looks like a speedup and smells like a speedup, put a patch up for review! I have no objection to having the LLInt do get_by_id_proto caching particularly since it should be easy to maintain. So we can do both DFG-based patching and your get_by_id_proto.

Andy Wingo

Comment 9 2012-07-11 08:33:24 PDT

(In reply to comment #8) > If it looks like a speedup and smells like a speedup, put a patch up for review! Bug 90982 :) Cheers, A