Bug 129212

Summary: Refine DFG+FTL inlining and compilation limits
Product: WebKit Reporter: Filip Pizlo <fpizlo>
Component: JavaScriptCoreAssignee: Filip Pizlo <fpizlo>
Status: RESOLVED FIXED    
Severity: Normal CC: atrick, barraclough, ggaren, mark.lam, mhahnenberg, mmirman, msaboff, nrotem, oliver, sam
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Bug Depends on:    
Bug Blocks: 112840    
Attachments:
Description Flags
the patch mhahnenberg: review+

Description Filip Pizlo 2014-02-22 11:08:59 PST
Patch forthcoming.
Comment 1 Filip Pizlo 2014-02-22 11:13:31 PST
Created attachment 224978 [details]
the patch
Comment 2 Mark Hahnenberg 2014-02-23 08:40:08 PST
Comment on attachment 224978 [details]
the patch

r=me
Comment 3 Geoffrey Garen 2014-02-23 09:05:26 PST
Comment on attachment 224978 [details]
the patch

View in context: https://bugs.webkit.org/attachment.cgi?id=224978&action=review

> Source/JavaScriptCore/bytecode/CodeBlock.cpp:2793
> +            dataLog("    Marking SABI because caller is too large.\n");

Don't you mean "Marking !SABI"?

> Source/JavaScriptCore/runtime/Options.h:188
> +    /* Maximum size of a caller for enabling inlining. This is purely to protect us */\
> +    /* super long compiles. */\

"from super long compiles"?
Comment 4 Geoffrey Garen 2014-02-23 09:17:36 PST
Comment on attachment 224978 [details]
the patch

GCC has a "caller too big" heuristic for inlining, and back when we compiled with GCC it caused very bad performance pathologies. A lot of critical functions in WebCore/JavaScriptCore are indeed big -- HTML parser, CSS parser, JavaScript parser, Interpreter::execute when it was written in C -- and the "caller too big" heuristic forced trivial functions like JSValue::isCell() and StringImpl::isAtomic() not to inline, even though inlining them is a pure win, reducing both code size and compile time.

When those pathologies present, they're also horrible to debug, since you can add an innocuous line of code somewhere in a function's slow path, and mysteriously make it 2X slower.

If "SABI" doesn't actually mean "should always be inlined", and instead means "should maybe inline depending on caller's size", then I think it needs a better name, and we need to introduce a new "SABI" that truly means "I'm so small that I should always inline".
Comment 5 Filip Pizlo 2014-02-23 10:00:10 PST
(In reply to comment #4)
> (From update of attachment 224978 [details])
> GCC has a "caller too big" heuristic for inlining, and back when we compiled with GCC it caused very bad performance pathologies. A lot of critical functions in WebCore/JavaScriptCore are indeed big -- HTML parser, CSS parser, JavaScript parser, Interpreter::execute when it was written in C -- and the "caller too big" heuristic forced trivial functions like JSValue::isCell() and StringImpl::isAtomic() not to inline, even though inlining them is a pure win, reducing both code size and compile time.
> 
> When those pathologies present, they're also horrible to debug, since you can add an innocuous line of code somewhere in a function's slow path, and mysteriously make it 2X slower.

I'm well aware of these performance pathologies and GCC is not the only compiler that has had this issue in its history.  But, horrible-to-debug performance pathologies are better than chain-crashing because LLVM ran out of memory.  If we ever suspect that the performance of a program is bad because of these heuristics, we should loosen the heuristics and see if we get a speed-up.

> 
> If "SABI" doesn't actually mean "should always be inlined", and instead means "should maybe inline depending on caller's size", then I think it needs a better name, and we need to introduce a new "SABI" that truly means "I'm so small that I should always inline".

I think you're confusing two different heuristics.

There is the too-big-to-inline-into heuristic, which has nothing to do with SABI; and then there's SABI, which doesn't mean "I'm so small that I should always inline" - it means that we believe that based on current heuristics, all of the callers of this function have either already been compiled with DFG/FTL and have inlined it, or they will be compiled with DFG/FTL and will inline it.  Notice that we mark !SABI if we see you're being called from a too-big-to-inline-into function because we can be pretty sure that this function will never inline anything.
Comment 6 Filip Pizlo 2014-02-23 10:39:15 PST
(In reply to comment #3)
> (From update of attachment 224978 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=224978&action=review
> 
> > Source/JavaScriptCore/bytecode/CodeBlock.cpp:2793
> > +            dataLog("    Marking SABI because caller is too large.\n");
> 
> Don't you mean "Marking !SABI"?

Yes, fixed!  Changed to "Clearing SABI".

> 
> > Source/JavaScriptCore/runtime/Options.h:188
> > +    /* Maximum size of a caller for enabling inlining. This is purely to protect us */\
> > +    /* super long compiles. */\
> 
> "from super long compiles"?

Yes, fixed!

Thanks!
Comment 7 Filip Pizlo 2014-02-23 10:44:06 PST
Landed in http://trac.webkit.org/changeset/164558
Comment 8 Geoffrey Garen 2014-02-23 12:52:11 PST
> I'm well aware of these performance pathologies and GCC is not the only compiler that has had this issue in its history.  But, horrible-to-debug performance pathologies are better than chain-crashing because LLVM ran out of memory.

Won't maximumOptimizationCandidateInstructionCount protect us from chain-crashing? Assuming that a function is net-win to inline, inlining it will never make the difference between crashing and not.

> If we ever suspect that the performance of a program is bad because of these heuristics, we should loosen the heuristics and see if we get a speed-up.

Given that we both agree, based on lots of experience with existing programs and compilers, that real programs do achieve bad performance because of these heuristics, I don't think it makes sense to treat this problem as a theoretical one that might never happen, or that needs more investigation.

Instead, I think we should implement the solution that LLVM implemented: If a function is small/trivial enough, we should allow it to inline always, even if its caller is large.

> I think you're confusing two different heuristics.

Yes, I guess I am.

(1) I propose that we have a real concept of "should always be inlined" that means "I should always be inlined because it is always net profitable to do so regardless of caller size".

(2) We have an existing concept named "should always be inlined" which means something more like "I predict that I will usually be inlined before optimizing me is necessary".
Comment 9 Filip Pizlo 2014-02-23 13:48:57 PST
(In reply to comment #8)
> > I'm well aware of these performance pathologies and GCC is not the only compiler that has had this issue in its history.  But, horrible-to-debug performance pathologies are better than chain-crashing because LLVM ran out of memory.
> 
> Won't maximumOptimizationCandidateInstructionCount protect us from chain-crashing? Assuming that a function is net-win to inline, inlining it will never make the difference between crashing and not.

OK - before we proceed further with this thread, let's make some things clear:

- Inlining heuristics are not the longest pole in the tent right now.  Every time that I play with the thresholds, I see negligible performance differences.  The purpose of maximumInliningCallerSize is to protect us from a code size and compile time pathology that I saw on real JS code, and I set it to the smallest value that didn't adversely affect performance on the benchmarks.

- Your proposals require substantially more work than this 6KB patch.  It's possible we may do it at some later time, but it won't help us meet our current goals because after this patch, inlining is no longer the longest pole in the tent.

- This 6KB patch is a net progression and doesn't hurt any benchmark that we have in our repertoire.

Now that I've gotten that out of the way, to answer your questin about maximumOptimizationCandidateInstructionCount:

Currently, if the caller is small enough, we do inlining even if the caller is larger than the size of a callsite.  That size threshold was derived from doing a grid search and optimizing performance on a bunch of benchmarks, so it's clear that we definitely want to continue to allow inlining even if it increases code size, but that of course also means that sometimes this code-size-increasing inliner must be disabled even if the caller was small enough to compile and the callee otherwise met the inliner's heuristics.  That's what maximumInliningCallerSize is for.  If we give the inliner the ability to handle must-inline functions (i.e. functions that are estimated to be smaller than a callsite and therefore inlining them will not increase code size), we will still have to keep around a heuristic like maximumInliningCallerSize, but we will probably rename it to maximumCodeSizeIncreasingInliningCallerSize, and it will apply only to functions that are larger than a callsite but are otherwise inlineable based on our current rules.

> 
> > If we ever suspect that the performance of a program is bad because of these heuristics, we should loosen the heuristics and see if we get a speed-up.
> 
> Given that we both agree, based on lots of experience with existing programs and compilers, that real programs do achieve bad performance because of these heuristics, I don't think it makes sense to treat this problem as a theoretical one that might never happen, or that needs more investigation.

It's true that if our inliner was a lot more sophisticated, then the maximumInliningCallerSize limit would either be completely unnecessary or it would go by a different name.  But for now, we need that limit, and we will probably continue to need that limit for the foreseeable future.

> 
> Instead, I think we should implement the solution that LLVM implemented: If a function is small/trivial enough, we should allow it to inline always, even if its caller is large.

Comparing LLVM's inliner and our inliner is like comparing apples and oranges.

- LLVM's inliner runs after LLVM has run a lot of its optimization pipeline on both the caller and the callee.  Hence LLVM has a much better estimate of the callee's size.  By contrast, for a typical callsite, at the time that our inliner makes its decision, we only see the callee's bytecode and we have to estimate size from that.  Bytecode isn't a good starting point for making precise code size estimates.  This is probably one of the reasons why our current thresholds often allow inlining that increases code size - we just don't have a good way of predicting whether code size increased will actually happen.

- LLVM has a detailed cost model for predicting the size and performance of each LLVM instruction.  It's true that after a lot of work we could probably derive such a cost model for DFG IR, but we haven't done it yet.

I can understand that it's tempting to argue that we should just pick up some heuristics from LLVM and drop them into the DFG, but we need to understand that LLVM's heuristics are architected around a radically different inliner - one that sees a hell of a lot more static information.  To just reuse LLVM's inlining heuristics we would need to:

        - Change our inliner to see the callee's optimized DFG IR as an input instead of using bytecode.  This would be a huge rewrite of the inliner and right now I can't think of a way of doing this without regressing memory use.
        - Add a cost model to DFG IR.  We currently have no such model.  For example, a GetById(Untyped:) is ~10-20x bigger than an ArithAdd(Int32:, Int32:, Unchecked) but right now there is no way to *know* this and adding this knowledge is probably >3000 lines of code.

Of course, it's theoretically possible that you could build heuristics that are as good as LLVM's in the average but that don't rely on seeing optimized IR of the caller or having a super precise cost model.

But whatever you do it'll be more work than this patch and even if you did all of that work, you'd still have most of the current heuristics as hard fall-backs.  For example, it's very profitable to have a quick "definitely don't inline this function" rule based on that function's bytecode size, since you can get that in O(1) time and it filters out a lot of obvious no-nos.  Also, while it's true that eventually we might set maximumInliningCallerSize to be larger than maximumFTLCandidateInstructionCount or maximumOptimizationCandidateInstructionCount - effectively obviating the need for that heuristic - we're definitely not there yet and it won't happen for a while.

> 
> > I think you're confusing two different heuristics.
> 
> Yes, I guess I am.
> 
> (1) I propose that we have a real concept of "should always be inlined" that means "I should always be inlined because it is always net profitable to do so regardless of caller size".

I usually call this "must inline".

> 
> (2) We have an existing concept named "should always be inlined" which means something more like "I predict that I will usually be inlined before optimizing me is necessary".

I would summarize my point as: on all of the benchmarks we track, from what I can tell based on profiling, inlining is good enough that the path of least resistance to getting further speed-ups probably involves leaving the inliner alone.  Inlining heuristics are a black art and you can always do better, and it's great that we have ideas for how to improve it.  But it shouldn't be a top priority.