199929 – [JSC] Make DFG Local CSE and AI conservative for huge basic block

RESOLVED FIXED199929

[JSC] Make DFG Local CSE and AI conservative for huge basic block

https://bugs.webkit.org/show_bug.cgi?id=199929

Summary [JSC] Make DFG Local CSE and AI conservative for huge basic block

Yusuke Suzuki

Reported 2019-07-18 20:17:26 PDT

Precise structure transition tracking takes quadratic complexity for N, where N nodes are in a basic block. We should give up for ridiculously large basic block. I would like to adjust the threshold for DFG optimization too, but it would be good in a separate patch, idk.

Attachments
Patch (5.24 KB, patch) 2019-07-18 21:14 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (3.94 KB, patch) 2019-07-18 21:16 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (3.44 KB, patch) 2019-07-18 21:33 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (3.94 KB, patch) 2019-07-18 21:58 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (12.79 KB, patch) 2019-07-19 16:45 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (12.91 KB, patch) 2019-07-19 19:03 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (11.88 KB, patch) 2019-07-20 03:54 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (11.91 KB, patch) 2019-07-20 03:55 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (11.97 KB, patch) 2019-07-20 03:57 PDT, Yusuke Suzuki	fpizlo: review+	Details Formatted Diff Diff
Patch for landing (14.71 KB, patch) 2019-07-22 16:02 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Show Obsolete (8) View All Add attachment proposed patch, testcase, etc.

Yusuke Suzuki

Comment 1 2019-07-18 21:03:40 PDT

<rdar://problem/49309924>

Yusuke Suzuki

Comment 2 2019-07-18 21:14:48 PDT

Created attachment 374438 [details] Patch WIP

Yusuke Suzuki

Comment 3 2019-07-18 21:16:34 PDT

Created attachment 374439 [details] Patch WIP

Yusuke Suzuki

Comment 4 2019-07-18 21:33:30 PDT

Created attachment 374441 [details] Patch WIP

Yusuke Suzuki

Comment 5 2019-07-18 21:58:00 PDT

Created attachment 374442 [details] Patch WIP

EWS Watchlist

Comment 6 2019-07-18 23:55:03 PDT

Comment on attachment 374442 [details] Patch Attachment 374442 [details] did not pass jsc-ews (mac): Output: https://webkit-queues.webkit.org/results/12770865 New failing tests: mozilla-tests.yaml/js1_5/Array/regress-101964.js.mozilla-no-ftl apiTests

Yusuke Suzuki

Comment 7 2019-07-19 13:38:42 PDT

Discussed with Keith, and now we have an idea reducing this quadratic complexity. But anyway, I think reducing the threshold is also nice anyway. And note that the above AI mitigation seems showing some regression in Speedometer, and reducing the threshold is posing some improvement in Speedometer. I'll first do reducing the threshold thing, it is anyway invalid now. And after that, I'll do AI improvement. For many cases, it could not pose speed up, but it definitely prevents pathological cases. So, for now, just reducing the threshold.

Yusuke Suzuki

Comment 8 2019-07-19 14:08:49 PDT

The maximum bytecode cost in JetStream2 is 36065.

Yusuke Suzuki

Comment 9 2019-07-19 14:35:30 PDT

The maximum bytecode cost in Speedometer2 is 5671.

Yusuke Suzuki

Comment 10 2019-07-19 14:48:16 PDT

BTW, CNN's problematic function's bytecode cost is 67904, it is ridiculously large...

Yusuke Suzuki

Comment 11 2019-07-19 16:45:37 PDT

Created attachment 374524 [details] Patch

Mark Lam

Comment 12 2019-07-19 16:59:24 PDT

Comment on attachment 374524 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=374524&action=review > Source/JavaScriptCore/ChangeLog:44 > + So I think 60000 seems reasonable threshold while avoiding pathologically large functions (as we have a threshold for # of DFG nodes in FTL). While we avoid compiling incredibly large functions, > + I think this should be covered by introducing DFG tracelet compiler once it is introduced. And it is also possible that we can compile small hot functions while avoiding such an large function. I suggest rephrasing "While we avoid compiling incredibly large functions, I think this should be covered by introducing DFG tracelet compiler once it is introduced." ... ...as "While we would like incredibly large functions to also be able to run fast, I think this should be covered later by introducing a DFG tracelet compiler to handle it."

Yusuke Suzuki

Comment 13 2019-07-19 17:08:27 PDT

I'll check whether it is laggy with O(1) thing in iPhone 7. But anyway, for now, reducing the threshold is the safest. I would like to implement O(N) algorithm after this, and I would like to revisit this threshold later. If O(N) algorithm removes the laggy behavior, we can increase this threshold again.

Yusuke Suzuki

Comment 14 2019-07-19 18:58:21 PDT

(In reply to Yusuke Suzuki from comment #13) > I'll check whether it is laggy with O(1) thing in iPhone 7. But anyway, for > now, reducing the threshold is the safest. I would like to implement O(N) > algorithm after this, and I would like to revisit this threshold later. If > O(N) algorithm removes the laggy behavior, we can increase this threshold > again. Checked it and it seems laggy without avoiding DFG compilation with that threshold. I'll update the patch with additional description and put it for review.

Yusuke Suzuki

Comment 15 2019-07-19 19:01:21 PDT

Comment on attachment 374524 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=374524&action=review >> Source/JavaScriptCore/ChangeLog:44 >> + I think this should be covered by introducing DFG tracelet compiler once it is introduced. And it is also possible that we can compile small hot functions while avoiding such an large function. > > I suggest rephrasing "While we avoid compiling incredibly large functions, I think this should be covered by introducing DFG tracelet compiler once it is introduced." ... > ...as "While we would like incredibly large functions to also be able to run fast, I think this should be covered later by introducing a DFG tracelet compiler to handle it." Sounds nice. Fixed.

Yusuke Suzuki

Comment 16 2019-07-19 19:03:37 PDT

Created attachment 374536 [details] Patch

Yusuke Suzuki

Comment 17 2019-07-19 19:13:26 PDT

Comment on attachment 374536 [details] Patch No, I missed one function in JetStream2. Adjusting the threshold now.

Yusuke Suzuki

Comment 18 2019-07-19 21:57:56 PDT

(In reply to Yusuke Suzuki from comment #17) > Comment on attachment 374536 [details] > Patch > > No, I missed one function in JetStream2. Adjusting the threshold now. Hmmm, it seems that JetStream/octane-zlib has 61949 bytecode cost function and it is later compiled in DFG. While JetStream also has larger functions (77617, 115144, 120248, 65567, 64881), they do not get DFG. While JetStream/octane-zlib's 61949 is really large, this is critical for performance. Sad thing is that anyway compiling this function in DFG takes 354.434433 ms, lol.

Yusuke Suzuki

Comment 19 2019-07-19 23:18:26 PDT

(In reply to Yusuke Suzuki from comment #18) > (In reply to Yusuke Suzuki from comment #17) > > Comment on attachment 374536 [details] > > Patch > > > > No, I missed one function in JetStream2. Adjusting the threshold now. > > Hmmm, it seems that JetStream/octane-zlib has 61949 bytecode cost function > and it is later compiled in DFG. > While JetStream also has larger functions (77617, 115144, 120248, 65567, > 64881), they do not get DFG. > While JetStream/octane-zlib's 61949 is really large, this is critical for > performance. Sad thing is that anyway compiling this function in DFG takes > 354.434433 ms, lol. Maybe, I think that the lagging with AI mitigation thing is not due to DFG. I would like to try with AI mitigation for now.

Yusuke Suzuki

Comment 20 2019-07-19 23:53:25 PDT

(In reply to Yusuke Suzuki from comment #19) > (In reply to Yusuke Suzuki from comment #18) > > (In reply to Yusuke Suzuki from comment #17) > > > Comment on attachment 374536 [details] > > > Patch > > > > > > No, I missed one function in JetStream2. Adjusting the threshold now. > > > > Hmmm, it seems that JetStream/octane-zlib has 61949 bytecode cost function > > and it is later compiled in DFG. > > While JetStream also has larger functions (77617, 115144, 120248, 65567, > > 64881), they do not get DFG. > > While JetStream/octane-zlib's 61949 is really large, this is critical for > > performance. Sad thing is that anyway compiling this function in DFG takes > > 354.434433 ms, lol. > > Maybe, I think that the lagging with AI mitigation thing is not due to DFG. > I would like to try with AI mitigation for now. AI mitigation + CFAPhase safepoint insertion seems working well. Let's go with this direction.

Yusuke Suzuki

Comment 21 2019-07-19 23:56:24 PDT

(In reply to Yusuke Suzuki from comment #20) > (In reply to Yusuke Suzuki from comment #19) > > (In reply to Yusuke Suzuki from comment #18) > > > (In reply to Yusuke Suzuki from comment #17) > > > > Comment on attachment 374536 [details] > > > > Patch > > > > > > > > No, I missed one function in JetStream2. Adjusting the threshold now. > > > > > > Hmmm, it seems that JetStream/octane-zlib has 61949 bytecode cost function > > > and it is later compiled in DFG. > > > While JetStream also has larger functions (77617, 115144, 120248, 65567, > > > 64881), they do not get DFG. > > > While JetStream/octane-zlib's 61949 is really large, this is critical for > > > performance. Sad thing is that anyway compiling this function in DFG takes > > > 354.434433 ms, lol. > > > > Maybe, I think that the lagging with AI mitigation thing is not due to DFG. > > I would like to try with AI mitigation for now. > > AI mitigation + CFAPhase safepoint insertion seems working well. Let's go > with this direction. Ah, hmm, not so much. Anyway it is much better than the current ToT.

Yusuke Suzuki

Comment 22 2019-07-20 03:54:17 PDT

Created attachment 374544 [details] Patch

Yusuke Suzuki

Comment 23 2019-07-20 03:55:48 PDT

Created attachment 374545 [details] Patch

Yusuke Suzuki

Comment 24 2019-07-20 03:57:08 PDT

Created attachment 374546 [details] Patch

Keith Miller

Comment 25 2019-07-20 10:43:09 PDT

(In reply to Yusuke Suzuki from comment #18) > (In reply to Yusuke Suzuki from comment #17) > > Comment on attachment 374536 [details] > > Patch > > > > No, I missed one function in JetStream2. Adjusting the threshold now. > > Hmmm, it seems that JetStream/octane-zlib has 61949 bytecode cost function > and it is later compiled in DFG. > While JetStream also has larger functions (77617, 115144, 120248, 65567, > 64881), they do not get DFG. > While JetStream/octane-zlib's 61949 is really large, this is critical for > performance. Sad thing is that anyway compiling this function in DFG takes > 354.434433 ms, lol. Yikes! I’m not sure if you already tried this but have you considered making bytecode cost be equivalent to the old instruction size for any given instruction? I think this would just be bytecode cost = 1 + number of inline inline data entries + number of out of line metadata entries.

Geoffrey Garen

Comment 26 2019-07-20 17:32:44 PDT

Comment on attachment 374546 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=374546&action=review > Source/JavaScriptCore/ChangeLog:13 > + DFG compiler stops at GC safepoints, and they are inserted between DFG phases. So, if some of DFG phases take very long time, the main thread is blocked during that. > + As a result, the main thread is blocked due to this pathological compilation. For phases that might take a long time, should we consider inserting GC safe points in the middle? If it's not practical for a GC safe point to pause and continue a phase, could we insert a special GC safe point that canceled the compilation and rescheduled it for later? GC happens regularly, so it's likely that any long compilation will eventually intersect GC. It's frustrating that compilation stops being concurrent when the concurrency would benefit you the most! :P

Yusuke Suzuki

Comment 27 2019-07-21 02:15:06 PDT

Comment on attachment 374546 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=374546&action=review >> Source/JavaScriptCore/ChangeLog:13 >> + As a result, the main thread is blocked due to this pathological compilation. > > For phases that might take a long time, should we consider inserting GC safe points in the middle? > > If it's not practical for a GC safe point to pause and continue a phase, could we insert a special GC safe point that canceled the compilation and rescheduled it for later? > > GC happens regularly, so it's likely that any long compilation will eventually intersect GC. It's frustrating that compilation stops being concurrent when the concurrency would benefit you the most! :P Yeah! This is good point. As you think so, I think we should eventually add GCSafepoint between the iterations inside DFG phase to offer fine-grained GC responsiveness too :) For the other scripts taking much longer time for compiling, I think we can do that to give more chance to perform GC in a well responsive manner. I guess that it would be beneficial for JetStream2/Octane-zlib case (but not sure). However, while this improves the general responsiveness, this does not work well for this CNN page case :( At first, I tried inserting GCSafepoint in the middle of CFA phase: inserting GCSafepoint between iterations of CFA analysis because this phase is the phase driving AI with iterations. However, IIRC, I cannot get good responsiveness in this CNN page. This is because this script has one incredibly large basic block (crazily, it has 43297 DFG nodes in one basic block!) and even executing block-local analysis onto this block once can take enough amount of time because of O(N^2) complexity. For example, we can see DFGSpeculativeJIT machine code generating phase also takes very long time because it drives AI even though it does not have any iterations inside it. Sadly, inserting GCSafepoint between basic block analysis also does not solve the problem. This is because the problematic graph has one huge basic block and almost all the time is just consumed by this one basic block. And inserting GCSafepoint for N DFG nodes processing also does not work because processing a DFG node at the end of the basic block takes more time compared to the time processing a DFG node at the beginning of the basic block. (complexity is proportional to # of previous DFG nodes). But I think, after solving this CNN page problem with this patch (avoiding O(N^2) complexity), inserting GCSafepoints would be more effective for the other time-consuming function compilation case :) I think this bug is another good reason why we should have the global GC feature. If the global GC is implemented, concurrent compilers just uses GC-managed values with write barriers and they do not need to have GC safepoints. GC can stop concurrent compilers at arbitrary point by using `Thread::suspend()`.

Yusuke Suzuki

Comment 28 2019-07-21 02:29:50 PDT

(In reply to Keith Miller from comment #25) > (In reply to Yusuke Suzuki from comment #18) > > (In reply to Yusuke Suzuki from comment #17) > > > Comment on attachment 374536 [details] > > > Patch > > > > > > No, I missed one function in JetStream2. Adjusting the threshold now. > > > > Hmmm, it seems that JetStream/octane-zlib has 61949 bytecode cost function > > and it is later compiled in DFG. > > While JetStream also has larger functions (77617, 115144, 120248, 65567, > > 64881), they do not get DFG. > > While JetStream/octane-zlib's 61949 is really large, this is critical for > > performance. Sad thing is that anyway compiling this function in DFG takes > > 354.434433 ms, lol. > > Yikes! > > I’m not sure if you already tried this but have you considered making > bytecode cost be equivalent to the old instruction size for any given > instruction? I think this would just be bytecode cost = 1 + number of inline > inline data entries + number of out of line metadata entries. Interesting! I didn't try it. Sadly, this blocking problem happens even in WebKit revision with old bytecode format (I tried this to ensure that this is not a regression). It would be possible that old bytecode cost calculation can differenciate values between octane-zlib and CNN page. However, since the bytecode cost is representing the whole graph size, it does not represent how large each basic block is. And the time consuming operation in CNN page is related to the size of one basic block, and not related to the size of whole graph, because the other functions having such a large graph size can be compiled within 100-300ms. So I think using basic block size for a threshold is more direct solution. On the other hand, I think that we can possibly have a problem with this bytecode cost threshold because we did not change this value while we revisit inlining cost heuristics when we introduced a new bytecode format. It seems that typical bytecode cost is 50-70% of the old bytecode cost. It would be possible that we are starting compiling incredibly large function that are not selected before, and getting subsequent DFG compilation requests blocked. And it would be possible that revising this threshold, or how we calculate the cost can dramatically improve the score of JetStream2 / Speedometer2 etc.

Geoffrey Garen

Comment 29 2019-07-22 10:16:54 PDT

> I think this bug is another good reason why we should have the global GC > feature. If the global GC is implemented, concurrent compilers just uses > GC-managed values with write barriers and they do not need to have GC > safepoints. GC can stop concurrent compilers at arbitrary point by using > `Thread::suspend()`. I like this idea!

Saam Barati

Comment 30 2019-07-22 13:01:12 PDT

Comment on attachment 374546 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=374546&action=review > Source/JavaScriptCore/ChangeLog:40 > + Local CSE's clobbering iterates all the impure heap values to remove the clobbered one. Since # of impure heap values tend to be proportional to # of DFG nodes we visited, > + each CSE for a basic block gets O(N^2) complexity. To avoid this, we introduce HugeMap. This has the same interface to LargeMap and SmallMap in CSE, but its clobbering > + implementation just clears the map completely. Should we switch to using epochs? Would that help? > Source/JavaScriptCore/dfg/DFGAbstractInterpreterInlines.h:4098 > + // If it is too large, we conservatively clobber the structures. "clobber the" => "clobber all the" > Source/JavaScriptCore/dfg/DFGAbstractInterpreterInlines.h:4121 > + // If it is too large, we conservatively clobber the structures. ditto > Source/JavaScriptCore/dfg/DFGCSEPhase.cpp:446 > + // This is used only for huge basic block. Our usual CSE is quadratic complexity for # of DFG nodes in a basic block. "huge basic block" => "huge basic blocks" > Source/JavaScriptCore/dfg/DFGCSEPhase.cpp:447 > + // This HugeMaps tells conservative results instead of avoiding O(N^2) way. I would say something like: "HugeMaps models results conservatively to avoid an O(N^2) algorithm." Might also be worth saying how results are conservative.

Yusuke Suzuki

Comment 31 2019-07-22 13:42:35 PDT

Comment on attachment 374546 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=374546&action=review Thanks! >> Source/JavaScriptCore/ChangeLog:40 >> + implementation just clears the map completely. > > Should we switch to using epochs? Would that help? Discussed with Saam. It sounds nice, I'll track this in a separate patch. https://bugs.webkit.org/show_bug.cgi?id=200014 >> Source/JavaScriptCore/dfg/DFGAbstractInterpreterInlines.h:4098 >> + // If it is too large, we conservatively clobber the structures. > > "clobber the" => "clobber all the" Fixed. >> Source/JavaScriptCore/dfg/DFGAbstractInterpreterInlines.h:4121 >> + // If it is too large, we conservatively clobber the structures. > > ditto Fixed. >> Source/JavaScriptCore/dfg/DFGCSEPhase.cpp:446 >> + // This is used only for huge basic block. Our usual CSE is quadratic complexity for # of DFG nodes in a basic block. > > "huge basic block" => "huge basic blocks" Fixed. >> Source/JavaScriptCore/dfg/DFGCSEPhase.cpp:447 >> + // This HugeMaps tells conservative results instead of avoiding O(N^2) way. > > I would say something like: > "HugeMaps models results conservatively to avoid an O(N^2) algorithm." > > Might also be worth saying how results are conservative. Fixed.

Yusuke Suzuki

Comment 32 2019-07-22 14:30:48 PDT

I also added a configuration to run-jsc-stress-tests to run tests with maxDFGNodesInBasicBlockForPreciseAnalysis=small-value.

Yusuke Suzuki

Comment 33 2019-07-22 15:26:44 PDT

Comment on attachment 374546 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=374546&action=review > Source/JavaScriptCore/dfg/DFGCSEPhase.cpp:171 > + if (conservative) > + return heap.kind() == pair.key.heap().kind(); Ah, I think this needs to care m_abstractHeapStackMap's content too. I'll fix this part to maintain m_debugImpureData while this does not affect on the correctness of this algorithm change.

Yusuke Suzuki

Comment 34 2019-07-22 16:02:22 PDT

Created attachment 374648 [details] Patch for landing

Yusuke Suzuki

Comment 35 2019-07-22 16:07:16 PDT

Committed r247703: <https://trac.webkit.org/changeset/247703>

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component JavaScriptCore

Assignee

Yusuke Suzuki

Reported

2019-07-18 20:17 PDT

Modified

2019-07-22 16:07 PDT History

CC List

10 users Show

URL

Keywords InRadar

Depends on

Blocks