Bug 47230 - Avoid inlining large and/or virtual functions in widely included header files
Summary: Avoid inlining large and/or virtual functions in widely included header files
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: 528+ (Nightly build)
Hardware: Other OS X 10.5
: P2 Normal
Assignee: James Robinson
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-10-05 16:58 PDT by James Robinson
Modified: 2010-10-08 19:00 PDT (History)
7 users (show)

See Also:


Attachments
Patch (66.79 KB, patch)
2010-10-05 17:07 PDT, James Robinson
no flags Details | Formatted Diff | Diff
applies to ToT (64.51 KB, patch)
2010-10-05 17:36 PDT, James Robinson
no flags Details | Formatted Diff | Diff
include WebCore.exp.in changes so this builds on mac (65.57 KB, patch)
2010-10-06 18:20 PDT, James Robinson
no flags Details | Formatted Diff | Diff
Patch (66.04 KB, patch)
2010-10-08 18:17 PDT, James Robinson
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description James Robinson 2010-10-05 16:58:37 PDT
Avoid inlining large and/or virtual functions in widely included header files
Comment 1 James Robinson 2010-10-05 17:07:54 PDT
Created attachment 69868 [details]
Patch
Comment 2 James Robinson 2010-10-05 17:12:12 PDT
Based on some work initially done by Elliot Glaysher from the Chrome team.

One particularly nasty anti-pattern I've seen a lot in our codebase is this:

Foo.h:
class Foo {
public:
    virtual ~Foo();
};

Bar.h:
class Bar : public Foo {
  HashSet<String> m_set;
};


With this sort of construct, the implicitly defined destructor for Foo is placed in the header and inlines the very large HashSet destructor into every object file that #includes Bar.h.  Since Bar::~Bar is virtual the compiler can't inline this code even if it wanted to.  Thanks to this we end up with 800+ copies of common destructors spread across object files across the system.  I'd like it if we had some sort of automated system to detect this antipattern, if possible.
Comment 3 James Robinson 2010-10-05 17:26:59 PDT
Comment on attachment 69868 [details]
Patch

Patch doesn't apply, will resolve and reupload.
Comment 4 Adam Barth 2010-10-05 17:32:01 PDT
Nice.
Comment 5 James Robinson 2010-10-05 17:36:20 PDT
Created attachment 69869 [details]
applies to ToT
Comment 6 Adam Barth 2010-10-05 17:47:52 PDT
Comment on attachment 69869 [details]
applies to ToT

View in context: https://bugs.webkit.org/attachment.cgi?id=69869&action=review

<3

> WebCore/dom/QualifiedName.h:63
> -    ~QualifiedName() { deref(); }
> +    ~QualifiedName();

You sure there's no perf hit from making this non-inline?

> WebCore/platform/network/HTTPHeaderMap.cpp:80
> +

Extra blank lines?
Comment 7 James Robinson 2010-10-05 18:09:12 PDT
(In reply to comment #6)
> (From update of attachment 69869 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=69869&action=review
> 
> <3
> 
> > WebCore/dom/QualifiedName.h:63
> > -    ~QualifiedName() { deref(); }
> > +    ~QualifiedName();
> 
> You sure there's no perf hit from making this non-inline?

QualifiedName::deref is not defined inline (either with this patch or without it) so the compiler can't do any useful inlining here anyway.

> 
> > WebCore/platform/network/HTTPHeaderMap.cpp:80
> > +
> 
> Extra blank lines?

Whoops!
Comment 8 Adam Barth 2010-10-05 18:12:58 PDT
(In reply to comment #7)
> (In reply to comment #6)
> > (From update of attachment 69869 [details] [details])
> > View in context: https://bugs.webkit.org/attachment.cgi?id=69869&action=review
> > 
> > <3
> > 
> > > WebCore/dom/QualifiedName.h:63
> > > -    ~QualifiedName() { deref(); }
> > > +    ~QualifiedName();
> > 
> > You sure there's no perf hit from making this non-inline?
> 
> QualifiedName::deref is not defined inline (either with this patch or without it) so the compiler can't do any useful inlining here anyway.

This doesn't add an extra jmp?  QualifiedNames get created and destroyed very, very often.
Comment 9 James Robinson 2010-10-05 18:22:52 PDT
The compiler is free to inline the body of QN::deref() into the destructor in QualifiedName.o since they are in the same translation unit.  External callers (i.e. those in different translation units) will still do one jump, either to ~QualifiedName or to deref.  I'll double-check this change in isolation on the html parser benchmark to be sure.
Comment 10 Eric Seidel (no email) 2010-10-05 18:48:23 PDT
Attachment 69869 [details] did not build on mac:
Build output: http://queues.webkit.org/results/4218089
Comment 11 Darin Adler 2010-10-06 08:26:23 PDT
(In reply to comment #9)
> The compiler is free to inline the body of QN::deref() into the destructor in QualifiedName.o since they are in the same translation unit.  External callers (i.e. those in different translation units) will still do one jump, either to ~QualifiedName or to deref.  I'll double-check this change in isolation on the html parser benchmark to be sure.

I’m pretty sure we inlined that because of a measurable speedup.
Comment 12 Darin Adler 2010-10-06 08:27:20 PDT
Comment on attachment 69869 [details]
applies to ToT

The added explicit keywords are great, but unrelated.

I think this goes overboard. Some of these inlined functions are more efficient than if they were compiled separately. Measuring the benefit by looking at object file sizes and not at end-user phenomena such as the total executable size or the speed of WebKit means you are seeing only part of the picture.

There’s a good chance this hurt performance, and I don’t see any evidence you measured performance.

It’s also inelegant to declare lots of empty destructors when the compiler will already do the right thing by default. There needs to be a significant benefit, because the explicit destructors make the classes more wordy, which hurts readability.
Comment 13 Darin Adler 2010-10-06 08:30:23 PDT
(In reply to comment #2)
> One particularly nasty anti-pattern I've seen a lot in our codebase is this:
> 
> Foo.h:
> class Foo {
> public:
>     virtual ~Foo();
> };
> 
> Bar.h:
> class Bar : public Foo {
>   HashSet<String> m_set;
> };
> 
> 
> With this sort of construct, the implicitly defined destructor for Foo is placed in the header and inlines the very large HashSet destructor into every object file that #includes Bar.h.  Since Bar::~Bar is virtual the compiler can't inline this code even if it wanted to.  Thanks to this we end up with 800+ copies of common destructors spread across object files across the system.  I'd like it if we had some sort of automated system to detect this antipattern, if possible.

But this is not an anti-pattern. It’s normal elegant C++ usage. However, if the compilers deal particularly poorly with it, then it might be OK to forbid it; that is a new development though, and not a well known anti-pattern.

Further, if Bar is a base class and others derive from it, then compiler *can* inline the code for the destructor in the destructors for those derived classes. So it’s wrong to say that “it’s virtual therefore can never be inlined”.

But leaving an explicit destructor out is not a request to inline. It’s just a normal use of the language. It’s unfortunate that the development tools are penalizing us for using the more-elegant, more-normal syntax for declaring a class that requires not additional special destruction code. Most classes do not require special destruction code, and it’s an anti-pattern to explicitly define empty constructors and destructors in all those cases.
Comment 14 Darin Adler 2010-10-06 10:48:12 PDT
Comment on attachment 69869 [details]
applies to ToT

View in context: https://bugs.webkit.org/attachment.cgi?id=69869&action=review

>>>> WebCore/dom/QualifiedName.h:63
>>>> +    ~QualifiedName();
>>> 
>>> You sure there's no perf hit from making this non-inline?
>> 
>> QualifiedName::deref is not defined inline (either with this patch or without it) so the compiler can't do any useful inlining here anyway.
> 
> This doesn't add an extra jmp?  QualifiedNames get created and destroyed very, very often.

Yes, this does add an extra level function call and return, which has some performance cost.
Comment 15 Darin Adler 2010-10-06 10:49:40 PDT
(In reply to comment #9)
> The compiler is free to inline the body of QN::deref() into the destructor in QualifiedName.o since they are in the same translation unit.  External callers (i.e. those in different translation units) will still do one jump, either to ~QualifiedName or to deref.  I'll double-check this change in isolation on the html parser benchmark to be sure.

Some compilers do this sort of optimization and others don’t. I am pretty sure that the Mac OS X version of WebKit does not invoke gcc at the optimization level where it inlines functions that are not marked inline.

So for this particular change there may be no performance difference for Chromium but there may be a slowdown on other platforms because of tool chain differences.
Comment 16 James Robinson 2010-10-06 15:13:24 PDT
(In reply to comment #13)
> (In reply to comment #2)
> > One particularly nasty anti-pattern I've seen a lot in our codebase is this:
> > 
> > Foo.h:
> > class Foo {
> > public:
> >     virtual ~Foo();
> > };
> > 
> > Bar.h:
> > class Bar : public Foo {
> >   HashSet<String> m_set;
> > };
> > 
> > 
> > With this sort of construct, the implicitly defined destructor for Foo is placed in the header and inlines the very large HashSet destructor into every object file that #includes Bar.h.  Since Bar::~Bar is virtual the compiler can't inline this code even if it wanted to.  Thanks to this we end up with 800+ copies of common destructors spread across object files across the system.  I'd like it if we had some sort of automated system to detect this antipattern, if possible.
> 
> But this is not an anti-pattern. It’s normal elegant C++ usage. However, if the compilers deal particularly poorly with it, then it might be OK to forbid it; that is a new development though, and not a well known anti-pattern.
> 
> Further, if Bar is a base class and others derive from it, then compiler *can* inline the code for the destructor in the destructors for those derived classes. So it’s wrong to say that “it’s virtual therefore can never be inlined”.

In WebKit our type hierarchies tend to be very wide but not very deep.  For example, there are lots of subclasses of Event but relatively few subclasses of subclasses of Event.

> 
> But leaving an explicit destructor out is not a request to inline. It’s just a normal use of the language. It’s unfortunate that the development tools are penalizing us for using the more-elegant, more-normal syntax for declaring a class that requires not additional special destruction code. Most classes do not require special destruction code, and it’s an anti-pattern to explicitly define empty constructors and destructors in all those cases.

The issue is where the definition of the destructor ends up (in the header or in the cpp).  We happen to have lots of very large destructors (thanks mostly to our WTF types having very large destructors defined in header files) and we tend to include our headers in many translation units.  This means the default behavior in this case is to put the definition of these large destructors into hundreds of object files.  It could be that this is fine in most cases (in fact the linker is good at discarding duplicate copies of the same function) but for WebKit it's causing us trouble since we can't actually link at all in some 32-bit toolchains.
Comment 17 James Robinson 2010-10-06 15:14:51 PDT
(In reply to comment #15)
> (In reply to comment #9)
> > The compiler is free to inline the body of QN::deref() into the destructor in QualifiedName.o since they are in the same translation unit.  External callers (i.e. those in different translation units) will still do one jump, either to ~QualifiedName or to deref.  I'll double-check this change in isolation on the html parser benchmark to be sure.
> 
> Some compilers do this sort of optimization and others don’t. I am pretty sure that the Mac OS X version of WebKit does not invoke gcc at the optimization level where it inlines functions that are not marked inline.
> 
> So for this particular change there may be no performance difference for Chromium but there may be a slowdown on other platforms because of tool chain differences.

I'm running performance tests on the Mac platform as well for this particular change to see if there's any difference (although I'm doing them in Release instead of Production as I can't figure out how to make the Production version work).  If there is a concern about ~QualifiedName in particular I'm happy to leave that change out, as it's a relatively small impact compared to the other changes
Comment 18 James Robinson 2010-10-06 15:26:25 PDT
(In reply to comment #12)
> (From update of attachment 69869 [details])
> The added explicit keywords are great, but unrelated.
> 
> I think this goes overboard. Some of these inlined functions are more efficient than if they were compiled separately. Measuring the benefit by looking at object file sizes and not at end-user phenomena such as the total executable size or the speed of WebKit means you are seeing only part of the picture.
> 

For context we are having serious issues in Chromium trying to successfully link on systems with a 32-bit toolchain without running out of virtual address space.  The biggest issue with linking is object file size and the largest source of object file bloat in the chromium link is WebCore.  We gave up on trying to link on Leopard a few months back since Snow Leopard provides a 64-bit toolchain, but there is no 64-bit toolchain available on windows and it's not clear when or if there ever will be.

Other ports may not be hitting hard limits yet, but the memory and time requirements are not fun for anyone.

I'm specifically making changes that have little or no impact on the final binary, only on the resource requirements in order to compile.  These are easy wins.

> There’s a good chance this hurt performance, and I don’t see any evidence you measured performance.

I've been looking at benchmarks as well as disassembly of the object files in question to verify that I'm not regressing performance.  If there was something to report I'd note it in the bug but I haven't seen anything change yet (nor would I expect to).

> 
> It’s also inelegant to declare lots of empty destructors when the compiler will already do the right thing by default. There needs to be a significant benefit, because the explicit destructors make the classes more wordy, which hurts readability.

Well, it's C++ - rarely accused of being overly elegant.
Comment 19 Darin Adler 2010-10-06 16:01:08 PDT
(In reply to comment #17)
> If there is a concern about ~QualifiedName in particular I'm happy to leave that change out, as it's a relatively small impact compared to the other changes

Which are the changes with the biggest impact? Can we do these changes in multiple smaller patches? That will be helpful if later we find they cause performance regressions, so we can pinpoint which one had that effect.
Comment 20 James Robinson 2010-10-06 17:01:47 PDT
(In reply to comment #19)
> (In reply to comment #17)
> > If there is a concern about ~QualifiedName in particular I'm happy to leave that change out, as it's a relatively small impact compared to the other changes
> 
> Which are the changes with the biggest impact? Can we do these changes in multiple smaller patches? That will be helpful if later we find they cause performance regressions, so we can pinpoint which one had that effect.

The most impactful changes are those to EventTarget and Element - moving those four virtual event handler functions from being defined in the header to being declared in the header and defined in the .cpp is a ~4.5mb improvement in debug object size.

I compared the performance with and without this patch on Safari with a release (not production) WebKit build on my slightly crappy core 2 duo Leopard MacBook Pro.  The results are actually a little surprising:

with patch:

avg 4016.6
stdev 65.28506720529589

avg 4280.1
stdev 228.23165862780738

avg 4021.25
stdev 60.233607728576246

avg 4025.05
stdev 70.40488264318037

avg 4015.05
stdev 71.23445444446108

avg 4007.95
stdev 63.84158127740885

overall avg (excluding outlier): 4017.7

without patch:

avg 4144.9
stdev 66.55741882014357

avg 4131.95
stdev 71.22041491033312

avg 4128.5
stdev 56.01383757608472

avg 4145.15
stdev 72.71126116359144

avg 4132.2
stdev 68.45845455456907

overall avg: 4136.54

diff: 188, % diff = 2.87


As you can see excluding one outlier with the patch (notice the stdev, something hiccuped on my box on that run) this patch is a nearly 3% speedup.  The speedup is quite consistent and well within the error range of this test, so I believe it is real.

I have to admit I don't fully understand why this is.  My best guess it that the linker on this system is not very good at dealing with the duplicate symbols and is either failing to discard the duplicates or (more likely) that the duplicates are hurting code locality.  I'll try to investigate further as I have time as it's pretty interesting that we can get such a speedup.  Maybe we can get more.

I don't know for certain if the production build settings will produce a similar speedup or not and I'm not sure how to test that locally (or even if I can).  Either way, though, I think it's pretty safe to say that this won't be a perf regression.
Comment 21 James Robinson 2010-10-06 18:20:45 PDT
Created attachment 70022 [details]
include WebCore.exp.in changes so this builds on mac
Comment 22 Adam Barth 2010-10-06 22:24:45 PDT
Comment on attachment 70022 [details]
include WebCore.exp.in changes so this builds on mac

I'm still in favor of this change.  It's hard to argue with a 3% speedup that also reduces code size and makes it possible to build the project on a 32 bit machine.
Comment 23 WebKit Commit Bot 2010-10-07 14:20:58 PDT
Comment on attachment 70022 [details]
include WebCore.exp.in changes so this builds on mac

Rejecting patch 70022 from commit-queue.

Failed to run "['./WebKitTools/Scripts/webkit-patch', '--status-host=queues.webkit.org', 'apply-attachment', '--force-clean', '--non-interactive', '--quiet', 70022]" exit_code: 2
Cleaning working directory
Updating working directory
Logging in as commit-queue@webkit.org...
Fetching: https://bugs.webkit.org/attachment.cgi?id=70022&action=edit
Fetching: https://bugs.webkit.org/show_bug.cgi?id=47230&ctype=xml
Processing 1 patch from 1 bug.
Processing patch 70022 from bug 47230.
Failed to run "[u'/Projects/CommitQueue/WebKitTools/Scripts/svn-apply', u'--reviewer', u'Adam Barth', u'--force']" exit_code: 1

Full output: http://queues.webkit.org/results/4216136
Comment 24 James Robinson 2010-10-08 18:17:31 PDT
Created attachment 70327 [details]
Patch
Comment 25 James Robinson 2010-10-08 18:18:14 PDT
Comment on attachment 70327 [details]
Patch

Merged up to ToT, gonna give the commit queue one more shot.
Comment 26 Darin Adler 2010-10-08 18:21:50 PDT
If I was making this change and had measured a speedup when most of the patch should be entirely performance-neutral, I would want to separate the part that definitely has no effect on performance from any part that might be speeding things up. If we find out what is making things faster, then perhaps we can take advantage of it elsewhere. There’s also the chance that part of the patch makes things faster and part makes it slower, in which case we want to keep the faster part.
Comment 27 WebKit Commit Bot 2010-10-08 19:00:18 PDT
Comment on attachment 70327 [details]
Patch

Clearing flags on attachment: 70327

Committed r69437: <http://trac.webkit.org/changeset/69437>
Comment 28 WebKit Commit Bot 2010-10-08 19:00:25 PDT
All reviewed patches have been landed.  Closing bug.