Bug 184062 - MSVC __forceinline slows down JSC release build fivefold after r229391
Summary: MSVC __forceinline slows down JSC release build fivefold after r229391
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Web Template Framework (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Ross Kirsling
URL:
Keywords: InRadar
Depends on: 184128
Blocks:
  Show dependency treegraph
 
Reported: 2018-03-27 17:14 PDT by Ross Kirsling
Modified: 2018-03-29 19:47 PDT (History)
20 users (show)

See Also:


Attachments
Patch (1.20 KB, patch)
2018-03-27 17:17 PDT, Ross Kirsling
no flags Details | Formatted Diff | Diff
Patch (4.84 KB, patch)
2018-03-28 16:20 PDT, Ross Kirsling
no flags Details | Formatted Diff | Diff
Patch for landing (4.74 KB, patch)
2018-03-29 17:27 PDT, Ross Kirsling
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Ross Kirsling 2018-03-27 17:14:41 PDT
After the TMP refactor in r229391 (https://bugs.webkit.org/show_bug.cgi?id=183263), the WinCairo Release build takes 50 minutes to compile JSC alone, compared to only 10 minutes prior to the patch.

This is an MSVC-specific issue -- their optimizer has trouble dealing with "too liberal a use of __forceinline".
(https://blogs.msdn.microsoft.com/vcblog/2018/01/04/visual-studio-2017-throughput-improvements-and-advice/)

We have reported the issue to Microsoft, but in the meantime it would seem wise to define ALWAYS_INLINE as simply "inline" on Windows, which brings the compile time back down to 10 minutes.
Comment 1 Ross Kirsling 2018-03-27 17:17:34 PDT
Created attachment 336631 [details]
Patch
Comment 2 JF Bastien 2018-03-27 21:32:05 PDT
That seems like a pretty wide net to cast to avoid the force-inline problem with the TMP code. Could you instead have a TMP_ALWAYS_INLINE or something like that which only applies to that code?

On the other hand, I don't know how much tuning has been done for the Windows build and inlining, so the heuristics we've applied may be totally wrong. Do we have developers who measure whether this changes pessimizes performance?
Comment 3 Ross Kirsling 2018-03-28 09:45:24 PDT
(In reply to JF Bastien from comment #2)
> That seems like a pretty wide net to cast to avoid the force-inline problem
> with the TMP code. Could you instead have a TMP_ALWAYS_INLINE or something
> like that which only applies to that code?

That might be a possibility. I guess there's a similar thing going on here:
https://github.com/WebKit/webkit/blob/master/Source/WebCore/platform/graphics/FormatConverter.cpp#L39-L43
(...actually I wonder if this one's still necessary in VS2017, hehe.)

> On the other hand, I don't know how much tuning has been done for the
> Windows build and inlining, so the heuristics we've applied may be totally
> wrong. Do we have developers who measure whether this changes pessimizes
> performance?

That would be great to know, especially as this should apply equally to AppleWin. I could look into performance test comparisons over here as well.
Comment 4 JF Bastien 2018-03-28 10:47:44 PDT
(In reply to Ross Kirsling from comment #3)
> (In reply to JF Bastien from comment #2)
> > That seems like a pretty wide net to cast to avoid the force-inline problem
> > with the TMP code. Could you instead have a TMP_ALWAYS_INLINE or something
> > like that which only applies to that code?
> 
> That might be a possibility. I guess there's a similar thing going on here:
> https://github.com/WebKit/webkit/blob/master/Source/WebCore/platform/
> graphics/FormatConverter.cpp#L39-L43
> (...actually I wonder if this one's still necessary in VS2017, hehe.)
> 
> > On the other hand, I don't know how much tuning has been done for the
> > Windows build and inlining, so the heuristics we've applied may be totally
> > wrong. Do we have developers who measure whether this changes pessimizes
> > performance?
> 
> That would be great to know, especially as this should apply equally to
> AppleWin. I could look into performance test comparisons over here as well.

Alex / Per / Brent are more informed about WebKit on Windows than I am, so I'll defer to them.
Comment 5 Alex Christensen 2018-03-28 10:52:29 PDT
Comment on attachment 336631 [details]
Patch

I agree with JF that this is too drastic a change.  I think we should instead do something like I needed to do in https://bugs.webkit.org/show_bug.cgi?id=125808 and only remove the __forceinline on only the problematic inlines.  There are a lot of instances of ALWAYS_INLINE that we probably want to keep.
Comment 6 Ross Kirsling 2018-03-28 11:01:08 PDT
(In reply to Alex Christensen from comment #5)
> Comment on attachment 336631 [details]
> Patch
> 
> I agree with JF that this is too drastic a change.  I think we should
> instead do something like I needed to do in
> https://bugs.webkit.org/show_bug.cgi?id=125808 and only remove the
> __forceinline on only the problematic inlines.  There are a lot of instances
> of ALWAYS_INLINE that we probably want to keep.

Sounds good. Just thought I'd kick off the conversation from the naive approach. :)
Comment 7 Ross Kirsling 2018-03-28 14:01:35 PDT
It appears that the issue is indeed quite localized and must have to do with MSVC getting overwhelmed by recursive templates in particular.

I've confirmed that:

1. Using non-forced inline for every `setupArgumentsImpl` overload results in a 9.5-minute build.
https://github.com/WebKit/webkit/blob/master/Source/JavaScriptCore/jit/CCallHelpers.h#L318-L464

2. Using non-forced inline for `marshallArgumentRegister` alone results in an 11-minute build.
https://github.com/WebKit/webkit/blob/master/Source/JavaScriptCore/jit/CCallHelpers.h#L299

These are both potentially acceptable -- does anyone have a clear preference?
Perhaps a recommendation of which test(s) would be most useful for perf impact?
Comment 8 Alex Christensen 2018-03-28 15:09:44 PDT
The least invasive change would be to make marshallArgumentRegister non-forced-inline on MSVC only.
Comment 9 Ross Kirsling 2018-03-28 15:31:59 PDT
(In reply to Alex Christensen from comment #8)
> The least invasive change would be to make marshallArgumentRegister
> non-forced-inline on MSVC only.

Sounds good.

FWIW, I also checked Sunspider benchmarks, which (unsurprisingly) support the same conclusion:

Collected 4 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between
sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific
preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95%
confidence intervals in milliseconds.

                          marshallArgumentRegister      setupArgumentsImpl

3d-cube                        17.1135+-1.2752     ?     17.4554+-1.8218        ? might be 1.0200x slower
3d-morph                       12.1973+-0.8931     ?     13.6648+-2.0364        ? might be 1.1203x slower
3d-raytrace                    23.5151+-2.4159     ?     25.1138+-7.1721        ? might be 1.0680x slower
access-binary-trees             8.3303+-0.5779     ?      8.7787+-2.4330        ? might be 1.0538x slower
access-fannkuch                14.8218+-2.2912     ?     15.4262+-7.4340        ? might be 1.0408x slower
access-nbody                   11.0309+-1.1147           10.1383+-0.2915          might be 1.0880x faster
access-nsieve                   6.6365+-0.8363            6.3742+-0.4838          might be 1.0411x faster
bitops-3bit-bits-in-byte        5.1899+-0.1658     ?      5.8102+-1.2525        ? might be 1.1195x slower
bitops-bits-in-byte             7.3887+-2.3269            6.6924+-0.1452          might be 1.1040x faster
bitops-bitwise-and              9.1475+-1.1505     ?      9.3076+-1.3356        ? might be 1.0175x slower
bitops-nsieve-bits              7.7549+-1.3521            7.7289+-0.6407
controlflow-recursive           7.2060+-0.4298     !      8.4953+-0.8057        ! definitely 1.1789x slower
crypto-aes                     19.2986+-8.2190           18.0129+-3.0947          might be 1.0714x faster
crypto-md5                      8.2644+-1.2072            7.7612+-1.1438          might be 1.0648x faster
crypto-sha1                     8.1423+-1.5990            7.9152+-0.6251          might be 1.0287x faster
date-format-tofte              35.1650+-0.6483     ?     36.1490+-2.2741        ? might be 1.0280x slower
date-format-xparb              21.6351+-2.0519           21.2189+-0.1581          might be 1.0196x faster
math-cordic                    11.2579+-2.4418            9.3053+-0.4511          might be 1.2098x faster
math-partial-sums              13.2822+-1.8823           13.0983+-2.1215          might be 1.0140x faster
math-spectral-norm              7.7226+-0.4115            7.6494+-0.2857
regexp-dna                     11.4542+-0.2199     ?     12.5857+-3.0776        ? might be 1.0988x slower
string-base64                  13.4223+-1.0573     ?     15.1614+-2.0637        ? might be 1.1296x slower
string-fasta                   20.6008+-2.9576     ?     21.6987+-6.0305        ? might be 1.0533x slower
string-tagcloud                32.4288+-0.5497     ?     34.3539+-3.1244        ? might be 1.0594x slower
string-unpack-code             84.0179+-8.3223     ?     90.4980+-27.9551       ? might be 1.0771x slower
string-validate-input          23.8311+-1.8061     ?     24.0583+-1.7425        ?

<arithmetic>                   16.9560+-0.5638     ?     17.4789+-0.6725        ? might be 1.0308x slower
Comment 10 Ross Kirsling 2018-03-28 16:20:42 PDT
Created attachment 336731 [details]
Patch
Comment 11 Ross Kirsling 2018-03-28 16:48:42 PDT
Additionally, it looks like that "C1063 Fatal Error" no longer occurs for FormatConverter.cpp, so we should be able to restore the ALWAYS_INLINE there in a follow-up patch.
Comment 12 Alex Christensen 2018-03-28 17:10:25 PDT
(In reply to Ross Kirsling from comment #11)
> Additionally, it looks like that "C1063 Fatal Error" no longer occurs for
> FormatConverter.cpp, so we should be able to restore the ALWAYS_INLINE there
> in a follow-up patch.
I've been meaning to look into whether we should just remove the ALWAYS_INLINE there.  It could make the binary significantly smaller while de-optiizing the outer loop, which would probably be a performance gain.
Comment 13 WebKit Commit Bot 2018-03-28 17:48:01 PDT
Comment on attachment 336731 [details]
Patch

Clearing flags on attachment: 336731

Committed r230062: <https://trac.webkit.org/changeset/230062>
Comment 14 WebKit Commit Bot 2018-03-28 17:48:03 PDT
All reviewed patches have been landed.  Closing bug.
Comment 15 Radar WebKit Bug Importer 2018-03-28 17:49:24 PDT
<rdar://problem/38980095>
Comment 16 Ryosuke Niwa 2018-03-29 00:33:08 PDT
Somehow this patch caused Mac port to break. We can't load apple.com now :(
Comment 17 WebKit Commit Bot 2018-03-29 00:33:14 PDT
Re-opened since this is blocked by bug 184128
Comment 18 Alex Christensen 2018-03-29 10:39:45 PDT
(In reply to Ryosuke Niwa from comment #16)
> Somehow this patch caused Mac port to break. We can't load apple.com now :(

This change makes no change on Mac.  You have almost certainly misdiagnosed
Comment 19 Ryosuke Niwa 2018-03-29 13:48:55 PDT
(In reply to Alex Christensen from comment #18)
> (In reply to Ryosuke Niwa from comment #16)
> > Somehow this patch caused Mac port to break. We can't load apple.com now :(
> 
> This change makes no change on Mac.  You have almost certainly misdiagnosed

You say that but Safari builds before this patch loads web pages file, and Safari builds after this patch cause web content process to instantly crash. The crash reproduces at r230069 but doesn't after the rollout in r230070.

So your assertion that this change makes no change on Mac is obviously wrong. It certainly affects Mac even though the code change doesn't seem like it should.
Comment 20 Ross Kirsling 2018-03-29 14:10:09 PDT
At r230075 (after the rollout, running Xcode 9.2 on Sierra), I can confirm that:
- `run-minibrowser` succeeds with or without the patch (both release and debug)
- the "play" button in Xcode fails with or without the patch (WebProcess crash on launch)

I'm not sure what's causing the issue in Xcode, but it's not this patch; I will plan to reland accordingly.
Comment 21 Alex Christensen 2018-03-29 14:10:55 PDT
Woah, let Ryosuke double check before relanding.
Comment 22 Alex Christensen 2018-03-29 14:12:29 PDT
(In reply to Ross Kirsling from comment #20)
> At r230075 (after the rollout, running Xcode 9.2 on Sierra), I can confirm
> that:
What about r230069 before the rollout?
Comment 23 Ross Kirsling 2018-03-29 14:15:50 PDT
(In reply to Alex Christensen from comment #21)
> Woah, let Ryosuke double check before relanding.

Whoops, okay. Standing by.

(In reply to Alex Christensen from comment #22)
> What about r230069 before the rollout?

Not sure. Builds are a half-hour apiece so I've been working on this since 9:15 AM.
Comment 24 Ross Kirsling 2018-03-29 14:24:11 PDT
(In reply to Ross Kirsling from comment #20)
> - `run-minibrowser` succeeds with or without the patch (both release and
> debug)

Er, to be more clear, there's no reproduction with `build-webkit [--debug]` followed by `run-minibrowser [--debug]` or `run-safari [--debug]` on the command line. I don't really understand what differs between this and the build-and-run from within Xcode.
Comment 25 Alex Christensen 2018-03-29 14:26:05 PDT
The definition of ALWAYS_INLINE depends on NDEBUG, so you'll need to do a release build.
Comment 26 Ross Kirsling 2018-03-29 14:27:50 PDT
(In reply to Alex Christensen from comment #25)
> The definition of ALWAYS_INLINE depends on NDEBUG, so you'll need to do a
> release build.

Sure, I did both across the board to be extra careful.
Comment 27 Ross Kirsling 2018-03-29 15:28:51 PDT
Alright, one more scenario just to make things perfectly unambiguous. :D

1. git checkout dd7a49d554b8583f61cfda3f0a42ecdf3ed5b6de # i.e. r230061, the revision immediately before this patch landed

2. git clean -ffdx # because why not

3. Open Xcode, Product > Clean

4. Hit the Play button

=> WebContent process crashes immediately.
Comment 28 Ross Kirsling 2018-03-29 16:12:35 PDT
(In reply to Ross Kirsling from comment #27)
> Alright, one more scenario just to make things perfectly unambiguous. :D
> 
> 1. git checkout dd7a49d554b8583f61cfda3f0a42ecdf3ed5b6de # i.e. r230061, the
> revision immediately before this patch landed
> 
> 2. git clean -ffdx # because why not
> 
> 3. Open Xcode, Product > Clean
> 
> 4. Hit the Play button
> 
> => WebContent process crashes immediately.

...or apparently this doesn't count as a thorough clean, from Xcode's perspective.

Evidently there's a file:
/Users/.../Library/Developer/Xcode/DerivedData/WebKit-.../Build/Products/Debug/WebKit.framework/Resources/com.apple.WebProcess.sb
To which my comment from this patch was added:
> // Sometimes needed due to MSVC optimizer sensitivities.
And that's syntactically invalid amid s-expressions.

This leaves me with many questions, but I'm guessing that there's a reason why only block comments are used in WTF/Compiler.h. :(
Comment 29 Alex Christensen 2018-03-29 16:14:23 PDT
Aha!  Change the comment style and resubmit!
Comment 30 Alex Christensen 2018-03-29 16:14:39 PDT
Actually, just remove the comment.
Comment 31 Ross Kirsling 2018-03-29 17:27:33 PDT
Created attachment 336823 [details]
Patch for landing
Comment 32 Alex Christensen 2018-03-29 17:33:55 PDT
(In reply to Alex Christensen from comment #18)
> (In reply to Ryosuke Niwa from comment #16)
> This change makes no change on Mac.  You have almost certainly misdiagnosed
For the record, you did not misdiagnose, Ryosuke.
Comment 33 Ross Kirsling 2018-03-29 17:48:48 PDT
(In reply to Alex Christensen from comment #32)
> (In reply to Alex Christensen from comment #18)
> > (In reply to Ryosuke Niwa from comment #16)
> > This change makes no change on Mac.  You have almost certainly misdiagnosed
> For the record, you did not misdiagnose, Ryosuke.

Indeed, what a trip. I've created https://bugs.webkit.org/show_bug.cgi?id=184166 to track the real bug here. :)
Comment 34 WebKit Commit Bot 2018-03-29 19:47:38 PDT
Comment on attachment 336823 [details]
Patch for landing

Clearing flags on attachment: 336823

Committed r230091: <https://trac.webkit.org/changeset/230091>
Comment 35 WebKit Commit Bot 2018-03-29 19:47:40 PDT
All reviewed patches have been landed.  Closing bug.