Bug 199929 - [JSC] Make DFG Local CSE and AI conservative for huge basic block
Summary: [JSC] Make DFG Local CSE and AI conservative for huge basic block
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Yusuke Suzuki
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2019-07-18 20:17 PDT by Yusuke Suzuki
Modified: 2019-07-22 16:07 PDT (History)
10 users (show)

See Also:


Attachments
Patch (5.24 KB, patch)
2019-07-18 21:14 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (3.94 KB, patch)
2019-07-18 21:16 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (3.44 KB, patch)
2019-07-18 21:33 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (3.94 KB, patch)
2019-07-18 21:58 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (12.79 KB, patch)
2019-07-19 16:45 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (12.91 KB, patch)
2019-07-19 19:03 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (11.88 KB, patch)
2019-07-20 03:54 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (11.91 KB, patch)
2019-07-20 03:55 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (11.97 KB, patch)
2019-07-20 03:57 PDT, Yusuke Suzuki
fpizlo: review+
Details | Formatted Diff | Diff
Patch for landing (14.71 KB, patch)
2019-07-22 16:02 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Yusuke Suzuki 2019-07-18 20:17:26 PDT
Precise structure transition tracking takes quadratic complexity for N, where N nodes are in a basic block.
We should give up for ridiculously large basic block.

I would like to adjust the threshold for DFG optimization too, but it would be good in a separate patch, idk.
Comment 1 Yusuke Suzuki 2019-07-18 21:03:40 PDT
<rdar://problem/49309924>
Comment 2 Yusuke Suzuki 2019-07-18 21:14:48 PDT
Created attachment 374438 [details]
Patch

WIP
Comment 3 Yusuke Suzuki 2019-07-18 21:16:34 PDT
Created attachment 374439 [details]
Patch

WIP
Comment 4 Yusuke Suzuki 2019-07-18 21:33:30 PDT
Created attachment 374441 [details]
Patch

WIP
Comment 5 Yusuke Suzuki 2019-07-18 21:58:00 PDT
Created attachment 374442 [details]
Patch

WIP
Comment 6 EWS Watchlist 2019-07-18 23:55:03 PDT
Comment on attachment 374442 [details]
Patch

Attachment 374442 [details] did not pass jsc-ews (mac):
Output: https://webkit-queues.webkit.org/results/12770865

New failing tests:
mozilla-tests.yaml/js1_5/Array/regress-101964.js.mozilla-no-ftl
apiTests
Comment 7 Yusuke Suzuki 2019-07-19 13:38:42 PDT
Discussed with Keith, and now we have an idea reducing this quadratic complexity.
But anyway, I think reducing the threshold is also nice anyway.

And note that the above AI mitigation seems showing some regression in Speedometer, and reducing the threshold is posing some improvement in Speedometer.

I'll first do reducing the threshold thing, it is anyway invalid now.
And after that, I'll do AI improvement. For many cases, it could not pose speed up, but it definitely prevents pathological cases.

So, for now, just reducing the threshold.
Comment 8 Yusuke Suzuki 2019-07-19 14:08:49 PDT
The maximum bytecode cost in JetStream2 is 36065.
Comment 9 Yusuke Suzuki 2019-07-19 14:35:30 PDT
The maximum bytecode cost in Speedometer2 is 5671.
Comment 10 Yusuke Suzuki 2019-07-19 14:48:16 PDT
BTW, CNN's problematic function's bytecode cost is 67904, it is ridiculously large...
Comment 11 Yusuke Suzuki 2019-07-19 16:45:37 PDT
Created attachment 374524 [details]
Patch
Comment 12 Mark Lam 2019-07-19 16:59:24 PDT
Comment on attachment 374524 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=374524&action=review

> Source/JavaScriptCore/ChangeLog:44
> +        So I think 60000 seems reasonable threshold while avoiding pathologically large functions (as we have a threshold for # of DFG nodes in FTL). While we avoid compiling incredibly large functions,
> +        I think this should be covered by introducing DFG tracelet compiler once it is introduced. And it is also possible that we can compile small hot functions while avoiding such an large function.

I suggest rephrasing "While we avoid compiling incredibly large functions, I think this should be covered by introducing DFG tracelet compiler once it is introduced." ...
...as "While we would like incredibly large functions to also be able to run fast, I think this should be covered later by introducing a DFG tracelet compiler to handle it."
Comment 13 Yusuke Suzuki 2019-07-19 17:08:27 PDT
I'll check whether it is laggy with O(1) thing in iPhone 7. But anyway, for now, reducing the threshold is the safest. I would like to implement O(N) algorithm after this, and I would like to revisit this threshold later. If O(N) algorithm removes the laggy behavior, we can increase this threshold again.
Comment 14 Yusuke Suzuki 2019-07-19 18:58:21 PDT
(In reply to Yusuke Suzuki from comment #13)
> I'll check whether it is laggy with O(1) thing in iPhone 7. But anyway, for
> now, reducing the threshold is the safest. I would like to implement O(N)
> algorithm after this, and I would like to revisit this threshold later. If
> O(N) algorithm removes the laggy behavior, we can increase this threshold
> again.

Checked it and it seems laggy without avoiding DFG compilation with that threshold. I'll update the patch with additional description and put it for review.
Comment 15 Yusuke Suzuki 2019-07-19 19:01:21 PDT
Comment on attachment 374524 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=374524&action=review

>> Source/JavaScriptCore/ChangeLog:44
>> +        I think this should be covered by introducing DFG tracelet compiler once it is introduced. And it is also possible that we can compile small hot functions while avoiding such an large function.
> 
> I suggest rephrasing "While we avoid compiling incredibly large functions, I think this should be covered by introducing DFG tracelet compiler once it is introduced." ...
> ...as "While we would like incredibly large functions to also be able to run fast, I think this should be covered later by introducing a DFG tracelet compiler to handle it."

Sounds nice. Fixed.
Comment 16 Yusuke Suzuki 2019-07-19 19:03:37 PDT
Created attachment 374536 [details]
Patch
Comment 17 Yusuke Suzuki 2019-07-19 19:13:26 PDT
Comment on attachment 374536 [details]
Patch

No, I missed one function in JetStream2. Adjusting the threshold now.
Comment 18 Yusuke Suzuki 2019-07-19 21:57:56 PDT
(In reply to Yusuke Suzuki from comment #17)
> Comment on attachment 374536 [details]
> Patch
> 
> No, I missed one function in JetStream2. Adjusting the threshold now.

Hmmm, it seems that JetStream/octane-zlib has 61949 bytecode cost function and it is later compiled in DFG.
While JetStream also has larger functions (77617, 115144, 120248, 65567, 64881), they do not get DFG.
While JetStream/octane-zlib's 61949 is really large, this is critical for performance. Sad thing is that anyway compiling this function in DFG takes 354.434433 ms, lol.
Comment 19 Yusuke Suzuki 2019-07-19 23:18:26 PDT
(In reply to Yusuke Suzuki from comment #18)
> (In reply to Yusuke Suzuki from comment #17)
> > Comment on attachment 374536 [details]
> > Patch
> > 
> > No, I missed one function in JetStream2. Adjusting the threshold now.
> 
> Hmmm, it seems that JetStream/octane-zlib has 61949 bytecode cost function
> and it is later compiled in DFG.
> While JetStream also has larger functions (77617, 115144, 120248, 65567,
> 64881), they do not get DFG.
> While JetStream/octane-zlib's 61949 is really large, this is critical for
> performance. Sad thing is that anyway compiling this function in DFG takes
> 354.434433 ms, lol.

Maybe, I think that the lagging with AI mitigation thing is not due to DFG.
I would like to try with AI mitigation for now.
Comment 20 Yusuke Suzuki 2019-07-19 23:53:25 PDT
(In reply to Yusuke Suzuki from comment #19)
> (In reply to Yusuke Suzuki from comment #18)
> > (In reply to Yusuke Suzuki from comment #17)
> > > Comment on attachment 374536 [details]
> > > Patch
> > > 
> > > No, I missed one function in JetStream2. Adjusting the threshold now.
> > 
> > Hmmm, it seems that JetStream/octane-zlib has 61949 bytecode cost function
> > and it is later compiled in DFG.
> > While JetStream also has larger functions (77617, 115144, 120248, 65567,
> > 64881), they do not get DFG.
> > While JetStream/octane-zlib's 61949 is really large, this is critical for
> > performance. Sad thing is that anyway compiling this function in DFG takes
> > 354.434433 ms, lol.
> 
> Maybe, I think that the lagging with AI mitigation thing is not due to DFG.
> I would like to try with AI mitigation for now.

AI mitigation + CFAPhase safepoint insertion seems working well. Let's go with this direction.
Comment 21 Yusuke Suzuki 2019-07-19 23:56:24 PDT
(In reply to Yusuke Suzuki from comment #20)
> (In reply to Yusuke Suzuki from comment #19)
> > (In reply to Yusuke Suzuki from comment #18)
> > > (In reply to Yusuke Suzuki from comment #17)
> > > > Comment on attachment 374536 [details]
> > > > Patch
> > > > 
> > > > No, I missed one function in JetStream2. Adjusting the threshold now.
> > > 
> > > Hmmm, it seems that JetStream/octane-zlib has 61949 bytecode cost function
> > > and it is later compiled in DFG.
> > > While JetStream also has larger functions (77617, 115144, 120248, 65567,
> > > 64881), they do not get DFG.
> > > While JetStream/octane-zlib's 61949 is really large, this is critical for
> > > performance. Sad thing is that anyway compiling this function in DFG takes
> > > 354.434433 ms, lol.
> > 
> > Maybe, I think that the lagging with AI mitigation thing is not due to DFG.
> > I would like to try with AI mitigation for now.
> 
> AI mitigation + CFAPhase safepoint insertion seems working well. Let's go
> with this direction.

Ah, hmm, not so much. Anyway it is much better than the current ToT.
Comment 22 Yusuke Suzuki 2019-07-20 03:54:17 PDT
Created attachment 374544 [details]
Patch
Comment 23 Yusuke Suzuki 2019-07-20 03:55:48 PDT
Created attachment 374545 [details]
Patch
Comment 24 Yusuke Suzuki 2019-07-20 03:57:08 PDT
Created attachment 374546 [details]
Patch
Comment 25 Keith Miller 2019-07-20 10:43:09 PDT
(In reply to Yusuke Suzuki from comment #18)
> (In reply to Yusuke Suzuki from comment #17)
> > Comment on attachment 374536 [details]
> > Patch
> > 
> > No, I missed one function in JetStream2. Adjusting the threshold now.
> 
> Hmmm, it seems that JetStream/octane-zlib has 61949 bytecode cost function
> and it is later compiled in DFG.
> While JetStream also has larger functions (77617, 115144, 120248, 65567,
> 64881), they do not get DFG.
> While JetStream/octane-zlib's 61949 is really large, this is critical for
> performance. Sad thing is that anyway compiling this function in DFG takes
> 354.434433 ms, lol.

Yikes! 

I’m not sure if you already tried this but have you considered making bytecode cost be equivalent to the old instruction size for any given instruction? I think this would just be bytecode cost = 1 + number of inline inline data entries + number of out of line metadata entries.
Comment 26 Geoffrey Garen 2019-07-20 17:32:44 PDT
Comment on attachment 374546 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=374546&action=review

> Source/JavaScriptCore/ChangeLog:13
> +        DFG compiler stops at GC safepoints, and they are inserted between DFG phases. So, if some of DFG phases take very long time, the main thread is blocked during that.
> +        As a result, the main thread is blocked due to this pathological compilation.

For phases that might take a long time, should we consider inserting GC safe points in the middle?

If it's not practical for a GC safe point to pause and continue a phase, could we insert a special GC safe point that canceled the compilation and rescheduled it for later?

GC happens regularly, so it's likely that any long compilation will eventually intersect GC. It's frustrating that compilation stops being concurrent when the concurrency would benefit you the most! :P
Comment 27 Yusuke Suzuki 2019-07-21 02:15:06 PDT
Comment on attachment 374546 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=374546&action=review

>> Source/JavaScriptCore/ChangeLog:13
>> +        As a result, the main thread is blocked due to this pathological compilation.
> 
> For phases that might take a long time, should we consider inserting GC safe points in the middle?
> 
> If it's not practical for a GC safe point to pause and continue a phase, could we insert a special GC safe point that canceled the compilation and rescheduled it for later?
> 
> GC happens regularly, so it's likely that any long compilation will eventually intersect GC. It's frustrating that compilation stops being concurrent when the concurrency would benefit you the most! :P

Yeah! This is good point. As you think so, I think we should eventually add GCSafepoint between the iterations inside DFG phase to offer fine-grained GC responsiveness too :)
For the other scripts taking much longer time for compiling, I think we can do that to give more chance to perform GC in a well responsive manner. I guess that it would be beneficial for JetStream2/Octane-zlib case (but not sure).

However, while this improves the general responsiveness, this does not work well for this CNN page case :(
At first, I tried inserting GCSafepoint in the middle of CFA phase: inserting GCSafepoint between iterations of CFA analysis because this phase is the phase driving AI with iterations. However, IIRC, I cannot get good responsiveness in this CNN page.
This is because this script has one incredibly large basic block (crazily, it has 43297 DFG nodes in one basic block!) and even executing block-local analysis onto this block once can take enough amount of time because of O(N^2) complexity.
For example, we can see DFGSpeculativeJIT machine code generating phase also takes very long time because it drives AI even though it does not have any iterations inside it.

Sadly, inserting GCSafepoint between basic block analysis also does not solve the problem. This is because the problematic graph has one huge basic block and almost all the time is just consumed by this one basic block.
And inserting GCSafepoint for N DFG nodes processing also does not work because processing a DFG node at the end of the basic block takes more time compared to the time processing a DFG node at the beginning of the basic block. (complexity is proportional to # of previous DFG nodes).

But I think, after solving this CNN page problem with this patch (avoiding O(N^2) complexity), inserting GCSafepoints would be more effective for the other time-consuming function compilation case :)

I think this bug is another good reason why we should have the global GC feature. If the global GC is implemented, concurrent compilers just uses GC-managed values with write barriers and they do not need to have GC safepoints. GC can stop concurrent compilers at arbitrary point by using `Thread::suspend()`.
Comment 28 Yusuke Suzuki 2019-07-21 02:29:50 PDT
(In reply to Keith Miller from comment #25)
> (In reply to Yusuke Suzuki from comment #18)
> > (In reply to Yusuke Suzuki from comment #17)
> > > Comment on attachment 374536 [details]
> > > Patch
> > > 
> > > No, I missed one function in JetStream2. Adjusting the threshold now.
> > 
> > Hmmm, it seems that JetStream/octane-zlib has 61949 bytecode cost function
> > and it is later compiled in DFG.
> > While JetStream also has larger functions (77617, 115144, 120248, 65567,
> > 64881), they do not get DFG.
> > While JetStream/octane-zlib's 61949 is really large, this is critical for
> > performance. Sad thing is that anyway compiling this function in DFG takes
> > 354.434433 ms, lol.
> 
> Yikes! 
> 
> I’m not sure if you already tried this but have you considered making
> bytecode cost be equivalent to the old instruction size for any given
> instruction? I think this would just be bytecode cost = 1 + number of inline
> inline data entries + number of out of line metadata entries.

Interesting! I didn't try it. Sadly, this blocking problem happens even in WebKit revision with old bytecode format (I tried this to ensure that this is not a regression). It would be possible that old bytecode cost calculation can differenciate values between octane-zlib and CNN page.
However, since the bytecode cost is representing the whole graph size, it does not represent how large each basic block is. And the time consuming operation in CNN page is related to the size of one basic block, and not related to the size of whole graph, because the other functions having such a large graph size can be compiled within 100-300ms. So I think using basic block size for a threshold is more direct solution.

On the other hand, I think that we can possibly have a problem with this bytecode cost threshold because we did not change this value while we revisit inlining cost heuristics when we introduced a new bytecode format. It seems that typical bytecode cost is 50-70% of the old bytecode cost. It would be possible that we are starting compiling incredibly large function that are not selected before, and getting subsequent DFG compilation requests blocked. And it would be possible that revising this threshold, or how we calculate the cost can dramatically improve the score of JetStream2 / Speedometer2 etc.
Comment 29 Geoffrey Garen 2019-07-22 10:16:54 PDT
> I think this bug is another good reason why we should have the global GC
> feature. If the global GC is implemented, concurrent compilers just uses
> GC-managed values with write barriers and they do not need to have GC
> safepoints. GC can stop concurrent compilers at arbitrary point by using
> `Thread::suspend()`.

I like this idea!
Comment 30 Saam Barati 2019-07-22 13:01:12 PDT
Comment on attachment 374546 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=374546&action=review

> Source/JavaScriptCore/ChangeLog:40
> +        Local CSE's clobbering iterates all the impure heap values to remove the clobbered one. Since # of impure heap values tend to be proportional to # of DFG nodes we visited,
> +        each CSE for a basic block gets O(N^2) complexity. To avoid this, we introduce HugeMap. This has the same interface to LargeMap and SmallMap in CSE, but its clobbering
> +        implementation just clears the map completely.

Should we switch to using epochs? Would that help?

> Source/JavaScriptCore/dfg/DFGAbstractInterpreterInlines.h:4098
> +    // If it is too large, we conservatively clobber the structures.

"clobber the" => "clobber all the"

> Source/JavaScriptCore/dfg/DFGAbstractInterpreterInlines.h:4121
> +    // If it is too large, we conservatively clobber the structures.

ditto

> Source/JavaScriptCore/dfg/DFGCSEPhase.cpp:446
> +    // This is used only for huge basic block. Our usual CSE is quadratic complexity for # of DFG nodes in a basic block.

"huge basic block" => "huge basic blocks"

> Source/JavaScriptCore/dfg/DFGCSEPhase.cpp:447
> +    // This HugeMaps tells conservative results instead of avoiding O(N^2) way.

I would say something like:
"HugeMaps models results conservatively to avoid an O(N^2) algorithm."

Might also be worth saying how results are conservative.
Comment 31 Yusuke Suzuki 2019-07-22 13:42:35 PDT
Comment on attachment 374546 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=374546&action=review

Thanks!

>> Source/JavaScriptCore/ChangeLog:40
>> +        implementation just clears the map completely.
> 
> Should we switch to using epochs? Would that help?

Discussed with Saam. It sounds nice, I'll track this in a separate patch.
https://bugs.webkit.org/show_bug.cgi?id=200014

>> Source/JavaScriptCore/dfg/DFGAbstractInterpreterInlines.h:4098
>> +    // If it is too large, we conservatively clobber the structures.
> 
> "clobber the" => "clobber all the"

Fixed.

>> Source/JavaScriptCore/dfg/DFGAbstractInterpreterInlines.h:4121
>> +    // If it is too large, we conservatively clobber the structures.
> 
> ditto

Fixed.

>> Source/JavaScriptCore/dfg/DFGCSEPhase.cpp:446
>> +    // This is used only for huge basic block. Our usual CSE is quadratic complexity for # of DFG nodes in a basic block.
> 
> "huge basic block" => "huge basic blocks"

Fixed.

>> Source/JavaScriptCore/dfg/DFGCSEPhase.cpp:447
>> +    // This HugeMaps tells conservative results instead of avoiding O(N^2) way.
> 
> I would say something like:
> "HugeMaps models results conservatively to avoid an O(N^2) algorithm."
> 
> Might also be worth saying how results are conservative.

Fixed.
Comment 32 Yusuke Suzuki 2019-07-22 14:30:48 PDT
I also added a configuration to run-jsc-stress-tests to run tests with maxDFGNodesInBasicBlockForPreciseAnalysis=small-value.
Comment 33 Yusuke Suzuki 2019-07-22 15:26:44 PDT
Comment on attachment 374546 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=374546&action=review

> Source/JavaScriptCore/dfg/DFGCSEPhase.cpp:171
> +            if (conservative)
> +                return heap.kind() == pair.key.heap().kind();

Ah, I think this needs to care m_abstractHeapStackMap's content too. I'll fix this part to maintain m_debugImpureData while this does not affect on the correctness of this algorithm change.
Comment 34 Yusuke Suzuki 2019-07-22 16:02:22 PDT
Created attachment 374648 [details]
Patch for landing
Comment 35 Yusuke Suzuki 2019-07-22 16:07:16 PDT
Committed r247703: <https://trac.webkit.org/changeset/247703>