Bug 90057 - llint regresses performance on some v8 benchmarks
Summary: llint regresses performance on some v8 benchmarks
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-06-27 02:58 PDT by Andy Wingo
Modified: 2012-07-11 08:33 PDT (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andy Wingo 2012-06-27 02:58:00 PDT
The LLInt is faster than the classic interpreter, overall, but it is not faster on all benchmarks.

I configured a build with runtime heuristics and the classic interpreter.  Like this:

CPPFLAGS='-DENABLE_RUN_TIME_HEURISTICS=1 -DENABLE_LLINT=0 -DENABLE_CLASSIC_INTERPRETER=1' Tools/Scripts/build-jsc --gtk

I configured another build with runtime heuristics and the LLInt:

CPPFLAGS='-DENABLE_RUN_TIME_HEURISTICS=1 -DENABLE_LLINT=1 -DENABLE_CLASSIC_INTERPRETER=0' Tools/Scripts/build-jsc --gtk

For the first build (classic interpreter):

~/src/v8/benchmarks$ JSC_useJIT=false ~/src/WebKit/WebKitBuild/Release/Programs/jsc-3 run.js 
Richards: 969
DeltaBlue: 1004
Crypto: 762
RayTrace: 2821
EarleyBoyer: 3342
RegExp: 396
Splay: 4458
NavierStokes: 1248
----
Score (version 7): 1407

For the second (llint):

~/src/v8/benchmarks$ JSC_useJIT=false ~/src/WebKit/WebKitBuild/Release/Programs/jsc-3 run.js 
Richards: 783
DeltaBlue: 713
Crypto: 1397
RayTrace: 2415
EarleyBoyer: 3344
RegExp: 389
Splay: 3936
NavierStokes: 2015
----
Score (version 7): 1449

You can see that the LLInt does really well with math-intensive benchmarks (crypto, navier-stokes) but does not do so well on the benchmarks that test property access (deltablue, richards, splay).
Comment 1 Andy Wingo 2012-06-27 03:09:36 PDT
I guess the reason here is obvious, that llint_op_get_by_id doesn't know all the tricks that tryCacheGetByID knows.  I assume this is just not implemented yet, though do let me know if you are against more cleverness there (perhaps due to icache concerns).
Comment 2 Filip Pizlo 2012-06-27 11:23:56 PDT
(In reply to comment #1)
> I guess the reason here is obvious, that llint_op_get_by_id doesn't know all the tricks that tryCacheGetByID knows.  I assume this is just not implemented yet, though do let me know if you are against more cleverness there (perhaps due to icache concerns).

This is bizarre.  Most of the tricks for polymorphism in the old interpreter are disabled (like proto, list, or chain accesses).

I agree that this is worth looking into.
Comment 3 Andy Wingo 2012-07-09 08:41:57 PDT
(In reply to comment #2)
> (In reply to comment #1)
> > I guess the reason here is obvious, that llint_op_get_by_id doesn't know all the tricks that tryCacheGetByID knows.  I assume this is just not implemented yet, though do let me know if you are against more cleverness there (perhaps due to icache concerns).
> 
> This is bizarre.  Most of the tricks for polymorphism in the old interpreter are disabled (like proto, list, or chain accesses).
> 
> I agree that this is worth looking into.

I made a first pass at looking into it, adding:

_llint_op_get_array_length:
    traceExecution()
    loadp CodeBlock[cfr], t1
    loadis 16[PB, PC, 8], t2
    loadp CodeBlock::m_globalData[t1], t1
    loadConstantOrVariableCell(t2, t0, .opGetArrayLengthMiss)
    loadp JSGlobalData::jsArrayClassInfo[t1], t2
    bpneq [t0], t2, .opGetArrayLengthMiss
    loadis 8[PB, PC, 8], t2
    loadp JSArray::m_storage[t0], t0
    loadi ArrayStorage::m_length[t0], t0
	# FIXME array lenghths >= 2**31 ?
    orp tagTypeNumber, t0
    storep t0, [cfr, t2, 8]
    dispatch(9)
.opGetArrayLengthMiss:
    callSlowPath(_llint_slow_path_get_by_id)
    dispatch(9)

_llint_op_get_string_length:
    traceExecution()
    loadis 16[PB, PC, 8], t1
    loadis 8[PB, PC, 8], t2
    loadConstantOrVariable(t1, t0)
    btpnz t0, tagMask, .opGetStringLengthMiss
    loadp JSCell::m_structure[t0], t1
    bpneq Structure::m_typeInfo + TypeInfo::m_type[t1], StringType, .opGetStringLengthMiss
    loadi JSString::m_length[t0], t0
	# FIXME string lenghths >= 2**31 ?
    orp tagTypeNumber, t0
    storep t0, [cfr, t2, 8]
    dispatch(9)
.opGetStringLengthMiss:
    callSlowPath(_llint_slow_path_get_by_id)
    dispatch(9)

and adding a couple cases to llint_slow_path_get_by_id:

    if (baseValue.isCell()
        && isJSArray(baseValue)
        && ident == exec->propertyNames().length) {
        pc[0].u.opcode = bitwise_cast<void*>(&llint_op_get_array_length);
        goto done;
    }

    if (baseValue.isCell()
        && isJSString(baseValue)
        && ident == exec->propertyNames().length) {
        pc[0].u.opcode = bitwise_cast<void*>(&llint_op_get_string_length);
        goto done;
    }

But no help on v8, and actually a negative effect on deltablue:

Richards: 750
DeltaBlue: 668
Crypto: 1545
RayTrace: 2359
EarleyBoyer: 3334
RegExp: 388
Splay: 4273
NavierStokes: 1815
----
Score (version 7): 1439

I guess we're hitting the slow path all the time, and these checks slow it down.  Is the big win to detect flip-flopping and just move to get_by_id_generic ?
Comment 4 Andy Wingo 2012-07-09 10:14:13 PDT
I instrumented the LLInt to record how property accesses were dispatched.  While running the V8 test suite, this is what I get:

372328822 : Total
232485832 : InlineCacheHit
66597673  : OutOfLineCacheHit
56115209  : NotCacheable
6326596   : PingPong
6195533   : Inline
4433881   : ArrayLength
132714    : OutOfLine
40557     : StringLength
827       : NotCell

The cache hits indicate that no slow path would be called.  NotCacheable indicates one of these is false:

        slot.isCacheable()
        && slot.slotBase() == baseValue
        && slot.cachedPropertyType() == PropertySlot::Value
        && !structure->isUncacheableDictionary()
        && !structure->typeInfo().prohibitsPropertyCaching()

PingPong indicates a previously valid cache.  My instrumentation caused pingpong dispatches to only succeed on the second time.  Inline indicates a change to inline cache state. OutOfLine indicates a change to out of line cache state.  ArrayLength and StringLength indicate changes that would occur if those opcodes were around.  NotCell means the base value is not a cell.

It seems the thing to work on is the NotCacheable set, to see what is there.
Comment 5 Andy Wingo 2012-07-10 04:15:46 PDT
(In reply to comment #4)
> 372328822 : Total
> 232485832 : InlineCacheHit
> 66597673  : OutOfLineCacheHit
> 56115209  : NotCacheable

They're basically all base mismatches: slot.slotBase() != baseValue.  I'll see if adding a get_by_id_proto will help.
Comment 6 Filip Pizlo 2012-07-10 12:14:01 PDT
(In reply to comment #5)
> (In reply to comment #4)
> > 372328822 : Total
> > 232485832 : InlineCacheHit
> > 66597673  : OutOfLineCacheHit
> > 56115209  : NotCacheable
> 
> They're basically all base mismatches: slot.slotBase() != baseValue.  I'll see if adding a get_by_id_proto will help.

Why not add a llint_op_get_by_id_stub like:

_llint_op_get_by_id_stub:
    traceExecution()
    loadis 16[PB, PC, 8], t0
    loadConstantOrVariableCell(t0, t1, .opGetByIdSlow)
    jmp something[PB, PC, 8]

_llint_op_get_by_id_stub_slow:
    callSlowPath(_llint_slow_path_get_by_id)
    dispatch(9)

and modify the DFGRepatch get_by_id/put_by_id patching code to be able to work with LLInt as well?  The DFG patching code is really powerful in that it is parameterized by the set of registers free, the set of registers that holder relevant info, etc.  So you'd just tell it that GPRInfo::regT1 holds the object base and all other registers (except PB, PC, cfr) are free.  Essentially you'll be telling it that you're in "registersFlushed" as the DFG calls it.  Then, just modify it so that (1) its fast path jump is replaced by an LLInt dispatch, (2) its patched calls (for getter accesses) save PB, PC appropriately for stack walking (instead of saving CodeOrigin as the DFG would have done), and (3) that the slow path jump jumps to llint_op_get_by_id_stub_slow.

And viola, you'll have full polymorphic caching in LLInt!  Though of course it will rely on the JIT being enabled.
Comment 7 Andy Wingo 2012-07-10 12:54:23 PDT
(In reply to comment #6)
> (In reply to comment #5)
> > (In reply to comment #4)
> > > 372328822 : Total
> > > 232485832 : InlineCacheHit
> > > 66597673  : OutOfLineCacheHit
> > > 56115209  : NotCacheable
> > 
> > They're basically all base mismatches: slot.slotBase() != baseValue.  I'll see if adding a get_by_id_proto will help.
> 
> Why not add a llint_op_get_by_id_stub like:
> 
> _llint_op_get_by_id_stub:
>     traceExecution()
>     loadis 16[PB, PC, 8], t0
>     loadConstantOrVariableCell(t0, t1, .opGetByIdSlow)
>     jmp something[PB, PC, 8]
> 
> _llint_op_get_by_id_stub_slow:
>     callSlowPath(_llint_slow_path_get_by_id)
>     dispatch(9)
> 
> and modify the DFGRepatch get_by_id/put_by_id patching code to be able to work with LLInt as well?  The DFG patching code is really powerful in that it is parameterized by the set of registers free, the set of registers that holder relevant info, etc.  So you'd just tell it that GPRInfo::regT1 holds the object base and all other registers (except PB, PC, cfr) are free.  Essentially you'll be telling it that you're in "registersFlushed" as the DFG calls it.  Then, just modify it so that (1) its fast path jump is replaced by an LLInt dispatch, (2) its patched calls (for getter accesses) save PB, PC appropriately for stack walking (instead of saving CodeOrigin as the DFG would have done), and (3) that the slow path jump jumps to llint_op_get_by_id_stub_slow.
> 
> And viola, you'll have full polymorphic caching in LLInt!  Though of course it will rely on the JIT being enabled.

Very interesting!  I will look into this tomorrow.

FWIW, it seems that the biggest gain on the V8 tests is having an op_get_by_id_proto.  With that and some other strategies similar to what the classic interpreter does, I got it up to:

Richards: 858
DeltaBlue: 787
Crypto: 1573
RayTrace: 2557
EarleyBoyer: 3404
RegExp: 397
Splay: 4319
NavierStokes: 1968
----
Score (version 7): 1537

Not a startling improvement, but it is better for this one test.
Comment 8 Filip Pizlo 2012-07-10 13:19:19 PDT
(In reply to comment #7)
> (In reply to comment #6)
> > (In reply to comment #5)
> > > (In reply to comment #4)
> > > > 372328822 : Total
> > > > 232485832 : InlineCacheHit
> > > > 66597673  : OutOfLineCacheHit
> > > > 56115209  : NotCacheable
> > > 
> > > They're basically all base mismatches: slot.slotBase() != baseValue.  I'll see if adding a get_by_id_proto will help.
> > 
> > Why not add a llint_op_get_by_id_stub like:
> > 
> > _llint_op_get_by_id_stub:
> >     traceExecution()
> >     loadis 16[PB, PC, 8], t0
> >     loadConstantOrVariableCell(t0, t1, .opGetByIdSlow)
> >     jmp something[PB, PC, 8]
> > 
> > _llint_op_get_by_id_stub_slow:
> >     callSlowPath(_llint_slow_path_get_by_id)
> >     dispatch(9)
> > 
> > and modify the DFGRepatch get_by_id/put_by_id patching code to be able to work with LLInt as well?  The DFG patching code is really powerful in that it is parameterized by the set of registers free, the set of registers that holder relevant info, etc.  So you'd just tell it that GPRInfo::regT1 holds the object base and all other registers (except PB, PC, cfr) are free.  Essentially you'll be telling it that you're in "registersFlushed" as the DFG calls it.  Then, just modify it so that (1) its fast path jump is replaced by an LLInt dispatch, (2) its patched calls (for getter accesses) save PB, PC appropriately for stack walking (instead of saving CodeOrigin as the DFG would have done), and (3) that the slow path jump jumps to llint_op_get_by_id_stub_slow.
> > 
> > And viola, you'll have full polymorphic caching in LLInt!  Though of course it will rely on the JIT being enabled.
> 
> Very interesting!  I will look into this tomorrow.
> 
> FWIW, it seems that the biggest gain on the V8 tests is having an op_get_by_id_proto.  With that and some other strategies similar to what the classic interpreter does, I got it up to:
> 
> Richards: 858
> DeltaBlue: 787
> Crypto: 1573
> RayTrace: 2557
> EarleyBoyer: 3404
> RegExp: 397
> Splay: 4319
> NavierStokes: 1968
> ----
> Score (version 7): 1537
> 
> Not a startling improvement, but it is better for this one test.

If it looks like a speedup and smells like a speedup, put a patch up for review!

I have no objection to having the LLInt do get_by_id_proto caching particularly since it should be easy to maintain.  So we can do both DFG-based patching and your get_by_id_proto.
Comment 9 Andy Wingo 2012-07-11 08:33:24 PDT
(In reply to comment #8)

> If it looks like a speedup and smells like a speedup, put a patch up for review!

Bug 90982 :)

Cheers,

A