Precise structure transition tracking takes quadratic complexity for N, where N nodes are in a basic block. We should give up for ridiculously large basic block. I would like to adjust the threshold for DFG optimization too, but it would be good in a separate patch, idk.
<rdar://problem/49309924>
Created attachment 374438 [details] Patch WIP
Created attachment 374439 [details] Patch WIP
Created attachment 374441 [details] Patch WIP
Created attachment 374442 [details] Patch WIP
Comment on attachment 374442 [details] Patch Attachment 374442 [details] did not pass jsc-ews (mac): Output: https://webkit-queues.webkit.org/results/12770865 New failing tests: mozilla-tests.yaml/js1_5/Array/regress-101964.js.mozilla-no-ftl apiTests
Discussed with Keith, and now we have an idea reducing this quadratic complexity. But anyway, I think reducing the threshold is also nice anyway. And note that the above AI mitigation seems showing some regression in Speedometer, and reducing the threshold is posing some improvement in Speedometer. I'll first do reducing the threshold thing, it is anyway invalid now. And after that, I'll do AI improvement. For many cases, it could not pose speed up, but it definitely prevents pathological cases. So, for now, just reducing the threshold.
The maximum bytecode cost in JetStream2 is 36065.
The maximum bytecode cost in Speedometer2 is 5671.
BTW, CNN's problematic function's bytecode cost is 67904, it is ridiculously large...
Created attachment 374524 [details] Patch
Comment on attachment 374524 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=374524&action=review > Source/JavaScriptCore/ChangeLog:44 > + So I think 60000 seems reasonable threshold while avoiding pathologically large functions (as we have a threshold for # of DFG nodes in FTL). While we avoid compiling incredibly large functions, > + I think this should be covered by introducing DFG tracelet compiler once it is introduced. And it is also possible that we can compile small hot functions while avoiding such an large function. I suggest rephrasing "While we avoid compiling incredibly large functions, I think this should be covered by introducing DFG tracelet compiler once it is introduced." ... ...as "While we would like incredibly large functions to also be able to run fast, I think this should be covered later by introducing a DFG tracelet compiler to handle it."
I'll check whether it is laggy with O(1) thing in iPhone 7. But anyway, for now, reducing the threshold is the safest. I would like to implement O(N) algorithm after this, and I would like to revisit this threshold later. If O(N) algorithm removes the laggy behavior, we can increase this threshold again.
(In reply to Yusuke Suzuki from comment #13) > I'll check whether it is laggy with O(1) thing in iPhone 7. But anyway, for > now, reducing the threshold is the safest. I would like to implement O(N) > algorithm after this, and I would like to revisit this threshold later. If > O(N) algorithm removes the laggy behavior, we can increase this threshold > again. Checked it and it seems laggy without avoiding DFG compilation with that threshold. I'll update the patch with additional description and put it for review.
Comment on attachment 374524 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=374524&action=review >> Source/JavaScriptCore/ChangeLog:44 >> + I think this should be covered by introducing DFG tracelet compiler once it is introduced. And it is also possible that we can compile small hot functions while avoiding such an large function. > > I suggest rephrasing "While we avoid compiling incredibly large functions, I think this should be covered by introducing DFG tracelet compiler once it is introduced." ... > ...as "While we would like incredibly large functions to also be able to run fast, I think this should be covered later by introducing a DFG tracelet compiler to handle it." Sounds nice. Fixed.
Created attachment 374536 [details] Patch
Comment on attachment 374536 [details] Patch No, I missed one function in JetStream2. Adjusting the threshold now.
(In reply to Yusuke Suzuki from comment #17) > Comment on attachment 374536 [details] > Patch > > No, I missed one function in JetStream2. Adjusting the threshold now. Hmmm, it seems that JetStream/octane-zlib has 61949 bytecode cost function and it is later compiled in DFG. While JetStream also has larger functions (77617, 115144, 120248, 65567, 64881), they do not get DFG. While JetStream/octane-zlib's 61949 is really large, this is critical for performance. Sad thing is that anyway compiling this function in DFG takes 354.434433 ms, lol.
(In reply to Yusuke Suzuki from comment #18) > (In reply to Yusuke Suzuki from comment #17) > > Comment on attachment 374536 [details] > > Patch > > > > No, I missed one function in JetStream2. Adjusting the threshold now. > > Hmmm, it seems that JetStream/octane-zlib has 61949 bytecode cost function > and it is later compiled in DFG. > While JetStream also has larger functions (77617, 115144, 120248, 65567, > 64881), they do not get DFG. > While JetStream/octane-zlib's 61949 is really large, this is critical for > performance. Sad thing is that anyway compiling this function in DFG takes > 354.434433 ms, lol. Maybe, I think that the lagging with AI mitigation thing is not due to DFG. I would like to try with AI mitigation for now.
(In reply to Yusuke Suzuki from comment #19) > (In reply to Yusuke Suzuki from comment #18) > > (In reply to Yusuke Suzuki from comment #17) > > > Comment on attachment 374536 [details] > > > Patch > > > > > > No, I missed one function in JetStream2. Adjusting the threshold now. > > > > Hmmm, it seems that JetStream/octane-zlib has 61949 bytecode cost function > > and it is later compiled in DFG. > > While JetStream also has larger functions (77617, 115144, 120248, 65567, > > 64881), they do not get DFG. > > While JetStream/octane-zlib's 61949 is really large, this is critical for > > performance. Sad thing is that anyway compiling this function in DFG takes > > 354.434433 ms, lol. > > Maybe, I think that the lagging with AI mitigation thing is not due to DFG. > I would like to try with AI mitigation for now. AI mitigation + CFAPhase safepoint insertion seems working well. Let's go with this direction.
(In reply to Yusuke Suzuki from comment #20) > (In reply to Yusuke Suzuki from comment #19) > > (In reply to Yusuke Suzuki from comment #18) > > > (In reply to Yusuke Suzuki from comment #17) > > > > Comment on attachment 374536 [details] > > > > Patch > > > > > > > > No, I missed one function in JetStream2. Adjusting the threshold now. > > > > > > Hmmm, it seems that JetStream/octane-zlib has 61949 bytecode cost function > > > and it is later compiled in DFG. > > > While JetStream also has larger functions (77617, 115144, 120248, 65567, > > > 64881), they do not get DFG. > > > While JetStream/octane-zlib's 61949 is really large, this is critical for > > > performance. Sad thing is that anyway compiling this function in DFG takes > > > 354.434433 ms, lol. > > > > Maybe, I think that the lagging with AI mitigation thing is not due to DFG. > > I would like to try with AI mitigation for now. > > AI mitigation + CFAPhase safepoint insertion seems working well. Let's go > with this direction. Ah, hmm, not so much. Anyway it is much better than the current ToT.
Created attachment 374544 [details] Patch
Created attachment 374545 [details] Patch
Created attachment 374546 [details] Patch
(In reply to Yusuke Suzuki from comment #18) > (In reply to Yusuke Suzuki from comment #17) > > Comment on attachment 374536 [details] > > Patch > > > > No, I missed one function in JetStream2. Adjusting the threshold now. > > Hmmm, it seems that JetStream/octane-zlib has 61949 bytecode cost function > and it is later compiled in DFG. > While JetStream also has larger functions (77617, 115144, 120248, 65567, > 64881), they do not get DFG. > While JetStream/octane-zlib's 61949 is really large, this is critical for > performance. Sad thing is that anyway compiling this function in DFG takes > 354.434433 ms, lol. Yikes! I’m not sure if you already tried this but have you considered making bytecode cost be equivalent to the old instruction size for any given instruction? I think this would just be bytecode cost = 1 + number of inline inline data entries + number of out of line metadata entries.
Comment on attachment 374546 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=374546&action=review > Source/JavaScriptCore/ChangeLog:13 > + DFG compiler stops at GC safepoints, and they are inserted between DFG phases. So, if some of DFG phases take very long time, the main thread is blocked during that. > + As a result, the main thread is blocked due to this pathological compilation. For phases that might take a long time, should we consider inserting GC safe points in the middle? If it's not practical for a GC safe point to pause and continue a phase, could we insert a special GC safe point that canceled the compilation and rescheduled it for later? GC happens regularly, so it's likely that any long compilation will eventually intersect GC. It's frustrating that compilation stops being concurrent when the concurrency would benefit you the most! :P
Comment on attachment 374546 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=374546&action=review >> Source/JavaScriptCore/ChangeLog:13 >> + As a result, the main thread is blocked due to this pathological compilation. > > For phases that might take a long time, should we consider inserting GC safe points in the middle? > > If it's not practical for a GC safe point to pause and continue a phase, could we insert a special GC safe point that canceled the compilation and rescheduled it for later? > > GC happens regularly, so it's likely that any long compilation will eventually intersect GC. It's frustrating that compilation stops being concurrent when the concurrency would benefit you the most! :P Yeah! This is good point. As you think so, I think we should eventually add GCSafepoint between the iterations inside DFG phase to offer fine-grained GC responsiveness too :) For the other scripts taking much longer time for compiling, I think we can do that to give more chance to perform GC in a well responsive manner. I guess that it would be beneficial for JetStream2/Octane-zlib case (but not sure). However, while this improves the general responsiveness, this does not work well for this CNN page case :( At first, I tried inserting GCSafepoint in the middle of CFA phase: inserting GCSafepoint between iterations of CFA analysis because this phase is the phase driving AI with iterations. However, IIRC, I cannot get good responsiveness in this CNN page. This is because this script has one incredibly large basic block (crazily, it has 43297 DFG nodes in one basic block!) and even executing block-local analysis onto this block once can take enough amount of time because of O(N^2) complexity. For example, we can see DFGSpeculativeJIT machine code generating phase also takes very long time because it drives AI even though it does not have any iterations inside it. Sadly, inserting GCSafepoint between basic block analysis also does not solve the problem. This is because the problematic graph has one huge basic block and almost all the time is just consumed by this one basic block. And inserting GCSafepoint for N DFG nodes processing also does not work because processing a DFG node at the end of the basic block takes more time compared to the time processing a DFG node at the beginning of the basic block. (complexity is proportional to # of previous DFG nodes). But I think, after solving this CNN page problem with this patch (avoiding O(N^2) complexity), inserting GCSafepoints would be more effective for the other time-consuming function compilation case :) I think this bug is another good reason why we should have the global GC feature. If the global GC is implemented, concurrent compilers just uses GC-managed values with write barriers and they do not need to have GC safepoints. GC can stop concurrent compilers at arbitrary point by using `Thread::suspend()`.
(In reply to Keith Miller from comment #25) > (In reply to Yusuke Suzuki from comment #18) > > (In reply to Yusuke Suzuki from comment #17) > > > Comment on attachment 374536 [details] > > > Patch > > > > > > No, I missed one function in JetStream2. Adjusting the threshold now. > > > > Hmmm, it seems that JetStream/octane-zlib has 61949 bytecode cost function > > and it is later compiled in DFG. > > While JetStream also has larger functions (77617, 115144, 120248, 65567, > > 64881), they do not get DFG. > > While JetStream/octane-zlib's 61949 is really large, this is critical for > > performance. Sad thing is that anyway compiling this function in DFG takes > > 354.434433 ms, lol. > > Yikes! > > I’m not sure if you already tried this but have you considered making > bytecode cost be equivalent to the old instruction size for any given > instruction? I think this would just be bytecode cost = 1 + number of inline > inline data entries + number of out of line metadata entries. Interesting! I didn't try it. Sadly, this blocking problem happens even in WebKit revision with old bytecode format (I tried this to ensure that this is not a regression). It would be possible that old bytecode cost calculation can differenciate values between octane-zlib and CNN page. However, since the bytecode cost is representing the whole graph size, it does not represent how large each basic block is. And the time consuming operation in CNN page is related to the size of one basic block, and not related to the size of whole graph, because the other functions having such a large graph size can be compiled within 100-300ms. So I think using basic block size for a threshold is more direct solution. On the other hand, I think that we can possibly have a problem with this bytecode cost threshold because we did not change this value while we revisit inlining cost heuristics when we introduced a new bytecode format. It seems that typical bytecode cost is 50-70% of the old bytecode cost. It would be possible that we are starting compiling incredibly large function that are not selected before, and getting subsequent DFG compilation requests blocked. And it would be possible that revising this threshold, or how we calculate the cost can dramatically improve the score of JetStream2 / Speedometer2 etc.
> I think this bug is another good reason why we should have the global GC > feature. If the global GC is implemented, concurrent compilers just uses > GC-managed values with write barriers and they do not need to have GC > safepoints. GC can stop concurrent compilers at arbitrary point by using > `Thread::suspend()`. I like this idea!
Comment on attachment 374546 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=374546&action=review > Source/JavaScriptCore/ChangeLog:40 > + Local CSE's clobbering iterates all the impure heap values to remove the clobbered one. Since # of impure heap values tend to be proportional to # of DFG nodes we visited, > + each CSE for a basic block gets O(N^2) complexity. To avoid this, we introduce HugeMap. This has the same interface to LargeMap and SmallMap in CSE, but its clobbering > + implementation just clears the map completely. Should we switch to using epochs? Would that help? > Source/JavaScriptCore/dfg/DFGAbstractInterpreterInlines.h:4098 > + // If it is too large, we conservatively clobber the structures. "clobber the" => "clobber all the" > Source/JavaScriptCore/dfg/DFGAbstractInterpreterInlines.h:4121 > + // If it is too large, we conservatively clobber the structures. ditto > Source/JavaScriptCore/dfg/DFGCSEPhase.cpp:446 > + // This is used only for huge basic block. Our usual CSE is quadratic complexity for # of DFG nodes in a basic block. "huge basic block" => "huge basic blocks" > Source/JavaScriptCore/dfg/DFGCSEPhase.cpp:447 > + // This HugeMaps tells conservative results instead of avoiding O(N^2) way. I would say something like: "HugeMaps models results conservatively to avoid an O(N^2) algorithm." Might also be worth saying how results are conservative.
Comment on attachment 374546 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=374546&action=review Thanks! >> Source/JavaScriptCore/ChangeLog:40 >> + implementation just clears the map completely. > > Should we switch to using epochs? Would that help? Discussed with Saam. It sounds nice, I'll track this in a separate patch. https://bugs.webkit.org/show_bug.cgi?id=200014 >> Source/JavaScriptCore/dfg/DFGAbstractInterpreterInlines.h:4098 >> + // If it is too large, we conservatively clobber the structures. > > "clobber the" => "clobber all the" Fixed. >> Source/JavaScriptCore/dfg/DFGAbstractInterpreterInlines.h:4121 >> + // If it is too large, we conservatively clobber the structures. > > ditto Fixed. >> Source/JavaScriptCore/dfg/DFGCSEPhase.cpp:446 >> + // This is used only for huge basic block. Our usual CSE is quadratic complexity for # of DFG nodes in a basic block. > > "huge basic block" => "huge basic blocks" Fixed. >> Source/JavaScriptCore/dfg/DFGCSEPhase.cpp:447 >> + // This HugeMaps tells conservative results instead of avoiding O(N^2) way. > > I would say something like: > "HugeMaps models results conservatively to avoid an O(N^2) algorithm." > > Might also be worth saying how results are conservative. Fixed.
I also added a configuration to run-jsc-stress-tests to run tests with maxDFGNodesInBasicBlockForPreciseAnalysis=small-value.
Comment on attachment 374546 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=374546&action=review > Source/JavaScriptCore/dfg/DFGCSEPhase.cpp:171 > + if (conservative) > + return heap.kind() == pair.key.heap().kind(); Ah, I think this needs to care m_abstractHeapStackMap's content too. I'll fix this part to maintain m_debugImpureData while this does not affect on the correctness of this algorithm change.
Created attachment 374648 [details] Patch for landing
Committed r247703: <https://trac.webkit.org/changeset/247703>