RESOLVED FIXED 79899
DFG BasicBlocks should not require that their nodes have contiguous indices in the graph
https://bugs.webkit.org/show_bug.cgi?id=79899
Summary DFG BasicBlocks should not require that their nodes have contiguous indices i...
Yuqiang Xian
Reported 2012-02-29 02:19:36 PST
As discussed in bug 76770, we should make it more convenient to insert nodes into the DFG, so a BasicBlock may not have continuous Node indicies.
Attachments
WIP patch (37.73 KB, patch)
2012-02-29 07:31 PST, Yuqiang Xian
no flags
Patch for review (37.45 KB, patch)
2012-02-29 19:20 PST, Yuqiang Xian
fpizlo: review-
patch updated (46.65 KB, patch)
2012-02-29 21:23 PST, Yuqiang Xian
fpizlo: review+
Yuqiang Xian
Comment 1 2012-02-29 07:31:21 PST
Created attachment 129448 [details] WIP patch WIP. There seems to be 1~2% performance regression on SunSpider. Will investigate it later.
Filip Pizlo
Comment 2 2012-02-29 14:50:24 PST
(In reply to comment #1) > Created an attachment (id=129448) [details] > WIP patch > > WIP. > > There seems to be 1~2% performance regression on SunSpider. Will investigate it later. My guess: the HashMap you added to AbstractState. My original implementation had a Vector<AbstractValue> indexed by NodeIndex rather than NodeIndex - m_block->begin. But my original implementation also had loads of regressions, which turned out to be because of a combination of clobberStructures() (that's why I have that m_haveStructures and the hacks to make clobberStructures() walk over fewer things). In the process, I also changed the Vector to be indexed by NodeIndex offset from block begin, but I have no idea if that change improved performance at all. You might want to turn on SAMPLING_REGIONS to see what's going on. That should quickly tell you if you've increased DFG compile time. But in the long run, we should probably tweak the DFG to run concurrently rather than synchronously. That will be a glorious hack, and will probably require some thread safety changes elsewhere in the runtime since the DFG parser pokes JS heap objects that are somewhat mutable (like Structure). But it will reduce the likelihood of us having to worry about regressions from compile times in the future.
Yuqiang Xian
Comment 3 2012-02-29 18:26:00 PST
(In reply to comment #2) > > My guess: the HashMap you added to AbstractState. > > My original implementation had a Vector<AbstractValue> indexed by NodeIndex rather than NodeIndex - m_block->begin. But my original implementation also had loads of regressions, which turned out to be because of a combination of clobberStructures() (that's why I have that m_haveStructures and the hacks to make clobberStructures() walk over fewer things). In the process, I also changed the Vector to be indexed by NodeIndex offset from block begin, but I have no idea if that change improved performance at all. > Yes, you're right. If we still maintain the abstract values in a Vector and index them by NodeIndex then there's no such performance regression. However it may use more memory as the vector size is equal to the graph size. Do you think it's an acceptable tradeoff?
Filip Pizlo
Comment 4 2012-02-29 18:39:12 PST
(In reply to comment #3) > (In reply to comment #2) > > > > My guess: the HashMap you added to AbstractState. > > > > My original implementation had a Vector<AbstractValue> indexed by NodeIndex rather than NodeIndex - m_block->begin. But my original implementation also had loads of regressions, which turned out to be because of a combination of clobberStructures() (that's why I have that m_haveStructures and the hacks to make clobberStructures() walk over fewer things). In the process, I also changed the Vector to be indexed by NodeIndex offset from block begin, but I have no idea if that change improved performance at all. > > > > Yes, you're right. If we still maintain the abstract values in a Vector and index them by NodeIndex then there's no such performance regression. However it may use more memory as the vector size is equal to the graph size. Do you think it's an acceptable tradeoff? That's great! I think that this is a perfectly acceptable tradeoff. Here's my specific thinking: - The old JIT is emitting ~100 bytes of machine code for basic arithmetic instructions, and probably no less than ~20 bytes of any other instruction. Some instructions emit drastically more than 100 bytes. So long as the DFG's footprint per bytecode instruction (with on average two DFG nodes - the operation and the SetLocal - per bytecode instruction) is less than 100, we probably shouldn't worry. I think we're already over this by the way, but probably not by much. - The DFG::Node class currently has ~8 32-bit words of junk. Here we're adding 2 64-bit words to represent the AbstractValue. So this is a small regression. - The DFG should be allowed to make a space-time trade-off if needed, and also, a space-teamProductivity trade-off as well. We only invoke the DFG rarely. We only invoke the DFG for code that is not absurdly big. And when we do invoke it, there is only one instance of the DFG running at a time. Hence it should be allowed to use some extra memory if it's beneficial for performance, or if it makes the code easier for us to hack.
Yuqiang Xian
Comment 5 2012-02-29 19:20:42 PST
Created attachment 129614 [details] Patch for review Filip, thanks for the clarification. Here's the patch for review. Performance result is here as well (almost neutral): Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. ToT 79899 SunSpider: 3d-cube 8.1317+-0.0753 ? 8.1812+-0.0359 ? 3d-morph 9.6125+-0.0396 ? 9.6248+-0.0347 ? 3d-raytrace 10.4898+-0.1458 10.4073+-0.0569 access-binary-trees 2.4043+-0.0633 ? 2.4274+-0.0725 ? access-fannkuch 10.7972+-0.9117 10.3027+-0.0398 might be 1.0480x faster access-nbody 5.4502+-0.0471 ? 5.4977+-0.0513 ? access-nsieve 4.0119+-0.0407 4.0100+-0.0545 bitops-3bit-bits-in-byte 1.2555+-0.0202 1.2482+-0.0243 bitops-bits-in-byte 4.6557+-0.0646 4.6193+-0.0646 bitops-bitwise-and 4.2758+-0.0490 ? 4.3005+-0.0636 ? bitops-nsieve-bits 4.2680+-0.0539 4.2642+-0.0477 controlflow-recursive 2.8596+-0.0376 2.8366+-0.0324 crypto-aes 9.5863+-0.1921 ? 9.6874+-0.1197 ? might be 1.0105x slower crypto-md5 3.3988+-0.0716 ? 3.4228+-0.0502 ? crypto-sha1 2.7206+-0.0500 ? 2.7903+-0.0403 ? might be 1.0256x slower date-format-tofte 13.6587+-1.9276 12.7911+-0.0741 might be 1.0678x faster date-format-xparb 11.9622+-0.1335 ? 12.1260+-0.1427 ? might be 1.0137x slower math-cordic 4.0385+-0.0675 ? 4.0777+-0.0784 ? math-partial-sums 14.8533+-0.0333 ? 15.5325+-1.5644 ? might be 1.0457x slower math-spectral-norm 2.7100+-0.0501 2.6963+-0.0447 regexp-dna 9.3486+-0.0752 ? 9.4206+-0.0760 ? string-base64 5.6182+-0.0711 ? 5.7111+-0.1056 ? might be 1.0165x slower string-fasta 9.4383+-0.0563 9.4094+-0.0468 string-tagcloud 16.9364+-1.9261 16.1189+-0.0514 might be 1.0507x faster string-unpack-code 27.5010+-0.2195 27.4682+-0.0478 string-validate-input 8.3231+-0.0630 ? 8.3302+-0.0378 ? <arithmetic> * 8.0118+-0.1202 7.9732+-0.0610 might be 1.0048x faster <geometric> 6.3050+-0.0562 6.3023+-0.0375 might be 1.0004x faster <harmonic> 4.8941+-0.0416 ? 4.9019+-0.0366 ? might be 1.0016x slower ToT 79899 V8: crypto 94.0062+-0.5066 ? 95.0840+-1.2626 ? might be 1.0115x slower deltablue 165.9663+-1.3598 ? 166.3503+-1.6042 ? earley-boyer 121.4067+-2.5646 119.8530+-2.6862 might be 1.0130x faster raytrace 58.6499+-0.4516 ? 59.4065+-0.4719 ? might be 1.0129x slower regexp 111.1548+-0.4702 110.9294+-0.4109 richards 191.6663+-1.2047 190.4287+-0.8902 splay 77.4138+-0.9005 77.3018+-0.2461 <arithmetic> 117.1806+-0.5576 117.0505+-0.6485 might be 1.0011x faster <geometric> * 109.0251+-0.5826 ? 109.0822+-0.6494 ? might be 1.0005x slower <harmonic> 101.3128+-0.6036 ? 101.5676+-0.6212 ? might be 1.0025x slower ToT 79899 Kraken: ai-astar 789.5488+-4.9142 788.0722+-5.0229 audio-beat-detection 237.0390+-0.3647 ? 238.9648+-1.5770 ? audio-dft 367.7708+-2.5980 ? 378.8661+-9.5683 ? might be 1.0302x slower audio-fft 151.2529+-0.0775 ? 151.2913+-0.1593 ? audio-oscillator 344.1697+-1.3283 ? 345.5862+-2.6765 ? imaging-darkroom 377.7554+-9.3130 375.5139+-9.5351 imaging-desaturate 303.6464+-0.6332 303.4512+-0.6833 imaging-gaussian-blur 506.0675+-0.1486 ? 506.2780+-0.2269 ? json-parse-financial 85.7818+-0.3590 ! 86.9759+-0.7928 ! definitely 1.0139x slower json-stringify-tinderbox 106.0411+-0.7759 ? 106.0652+-0.6543 ? stanford-crypto-aes 105.6489+-0.5791 ? 106.1828+-0.4288 ? stanford-crypto-ccm 100.3177+-0.8778 99.5192+-0.5648 stanford-crypto-pbkdf2 235.7453+-0.5540 ^ 234.9382+-0.1006 ^ definitely 1.0034x faster stanford-crypto-sha256-iterative 102.8190+-0.5031 ? 103.1907+-0.5652 ? <arithmetic> * 272.4003+-1.0940 ? 273.2068+-1.3736 ? might be 1.0030x slower <geometric> 216.1398+-0.7343 ? 216.8310+-0.8875 ? might be 1.0032x slower <harmonic> 174.7069+-0.5262 ? 175.2619+-0.5780 ? might be 1.0032x slower ToT 79899 All benchmarks: <arithmetic> 103.0250+-0.4143 ? 103.2245+-0.4864 ? might be 1.0019x slower <geometric> 27.6239+-0.1723 ? 27.6462+-0.1356 ? might be 1.0008x slower <harmonic> 8.6053+-0.0719 ? 8.6194+-0.0633 ? might be 1.0016x slower ToT 79899 Geomean of preferred means: <scaled-result> 61.9631+-0.4216 61.9369+-0.2838 might be 1.0004x faster
Filip Pizlo
Comment 6 2012-02-29 19:42:24 PST
Comment on attachment 129614 [details] Patch for review View in context: https://bugs.webkit.org/attachment.cgi?id=129614&action=review This is really awesome. R- but only because I want to make sure we do CSE right. > Source/JavaScriptCore/dfg/DFGCSEPhase.cpp:720 > > void performBlockCSE(BasicBlock& block) > { > - m_start = block.begin; > - NodeIndex end = block.end; > - for (m_compileIndex = m_start; m_compileIndex < end; ++m_compileIndex) > + m_start = block[block.startExcludingPhis]; > + for (unsigned i = block.startExcludingPhis; i < block.size(); ++i) { > + m_compileIndex = block[i]; > performNodeCSE(m_graph[m_compileIndex]); > + } > } > > NodeIndex m_start; I'm somewhat surprised that these are the only changes in CSE! CSE has this whole performance hack where it uses the NodeIndex of the instruction that is a candidate for elimination, and the NodeIndices of its children, to determine the range that it will search. Seems like that relies on NodeIndices of things in a basic block being monotonically increasing. Would be better to instead make all of the search things instead break early when they encounter a NodeIndex at block[i] that matches the NodeIndex in one of the children of the node from which the search originated.
Yuqiang Xian
Comment 7 2012-02-29 19:47:19 PST
(In reply to comment #6) > (From update of attachment 129614 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=129614&action=review > > This is really awesome. R- but only because I want to make sure we do CSE right. > > > Source/JavaScriptCore/dfg/DFGCSEPhase.cpp:720 > > > > void performBlockCSE(BasicBlock& block) > > { > > - m_start = block.begin; > > - NodeIndex end = block.end; > > - for (m_compileIndex = m_start; m_compileIndex < end; ++m_compileIndex) > > + m_start = block[block.startExcludingPhis]; > > + for (unsigned i = block.startExcludingPhis; i < block.size(); ++i) { > > + m_compileIndex = block[i]; > > performNodeCSE(m_graph[m_compileIndex]); > > + } > > } > > > > NodeIndex m_start; > > I'm somewhat surprised that these are the only changes in CSE! CSE has this whole performance hack where it uses the NodeIndex of the instruction that is a candidate for elimination, and the NodeIndices of its children, to determine the range that it will search. > > Seems like that relies on NodeIndices of things in a basic block being monotonically increasing. > > Would be better to instead make all of the search things instead break early when they encounter a NodeIndex at block[i] that matches the NodeIndex in one of the children of the node from which the search originated. Yes. You're absolutely right. I didn't touch CSE for now - actually it still relies on the assumption that the node indicies are continuous, which is true in current stage as we exclude the inserted Phi nodes. But yes we should fix it.
Filip Pizlo
Comment 8 2012-02-29 19:50:05 PST
> Yes. You're absolutely right. I didn't touch CSE for now - actually it still relies on the assumption that the node indicies are continuous, which is true in current stage as we exclude the inserted Phi nodes. But yes we should fix it. Yeah, I think this is one of those things that we want to fix in this patch. Feels like even though it will make for a bigger patch, the result will be cleaner and more coherent.
Yuqiang Xian
Comment 9 2012-02-29 21:23:50 PST
Created attachment 129627 [details] patch updated Correct CSE. Performance result here: Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. ToT 79899 SunSpider: 3d-cube 8.2464+-0.0825 ? 8.8842+-1.3563 ? might be 1.0773x slower 3d-morph 9.7461+-0.1194 9.6008+-0.0471 might be 1.0151x faster 3d-raytrace 10.3699+-0.0746 ? 10.4985+-0.0897 ? might be 1.0124x slower access-binary-trees 2.4174+-0.0491 ? 2.4208+-0.0299 ? access-fannkuch 10.2097+-0.0340 ! 10.3671+-0.0738 ! definitely 1.0154x slower access-nbody 5.4205+-0.0502 ? 5.4971+-0.0408 ? might be 1.0141x slower access-nsieve 4.0139+-0.0354 3.9855+-0.0549 bitops-3bit-bits-in-byte 1.2879+-0.1218 1.2674+-0.0267 might be 1.0162x faster bitops-bits-in-byte 4.6899+-0.0529 ? 4.6899+-0.0680 ? bitops-bitwise-and 4.3005+-0.0526 4.2784+-0.0348 bitops-nsieve-bits 4.2527+-0.0555 ? 4.2611+-0.0592 ? controlflow-recursive 2.8832+-0.0560 2.8495+-0.0330 might be 1.0118x faster crypto-aes 9.9004+-0.8040 9.7395+-0.1091 might be 1.0165x faster crypto-md5 3.7363+-0.7201 3.5368+-0.0762 might be 1.0564x faster crypto-sha1 2.7769+-0.1318 ? 2.8187+-0.0382 ? might be 1.0151x slower date-format-tofte 12.7713+-0.1900 12.6657+-0.0813 date-format-xparb 11.9215+-0.0611 11.8989+-0.0899 math-cordic 4.0575+-0.0668 4.0442+-0.0729 math-partial-sums 14.8560+-0.0503 ? 14.8976+-0.0510 ? math-spectral-norm 2.6806+-0.0414 2.6710+-0.0367 regexp-dna 9.4064+-0.0848 9.3769+-0.0829 string-base64 5.5638+-0.0453 ? 5.6208+-0.0781 ? might be 1.0102x slower string-fasta 9.4595+-0.0784 ? 9.4883+-0.0626 ? string-tagcloud 16.2971+-0.1989 ? 16.6938+-1.4490 ? might be 1.0243x slower string-unpack-code 27.5464+-0.1902 27.2705+-0.1486 might be 1.0101x faster string-validate-input 8.6276+-0.3451 8.4289+-0.0586 might be 1.0236x faster <arithmetic> * 7.9784+-0.0601 ? 7.9905+-0.0691 ? might be 1.0015x slower <geometric> 6.3202+-0.0628 ? 6.3234+-0.0358 ? might be 1.0005x slower <harmonic> 4.9249+-0.0723 4.9247+-0.0266 might be 1.0000x faster ToT 79899 V8: crypto 95.0226+-1.0047 94.3739+-0.3426 deltablue 165.2617+-0.6624 ? 165.7057+-0.6259 ? earley-boyer 120.6456+-2.7215 ? 122.9765+-2.5056 ? might be 1.0193x slower raytrace 58.9135+-0.3810 ? 59.0790+-0.4627 ? regexp 111.1890+-0.4246 110.9795+-0.4306 richards 190.9017+-0.7440 ^ 189.6743+-0.2815 ^ definitely 1.0065x faster splay 76.9456+-0.1664 76.8783+-0.1922 <arithmetic> 116.9828+-0.5396 ? 117.0953+-0.4770 ? might be 1.0010x slower <geometric> * 108.9471+-0.5280 ? 109.0833+-0.4856 ? might be 1.0013x slower <harmonic> 101.3365+-0.4907 ? 101.4683+-0.4743 ? might be 1.0013x slower ToT 79899 Kraken: ai-astar 785.0024+-0.6913 ? 785.6353+-1.8885 ? audio-beat-detection 237.7250+-1.4305 ? 238.6727+-0.6064 ? audio-dft 368.7606+-7.9661 366.7011+-3.3630 audio-fft 152.0918+-1.0270 151.3073+-0.0661 audio-oscillator 343.2595+-1.3591 ? 344.1854+-1.4707 ? imaging-darkroom 375.6221+-9.2659 374.0356+-8.6721 imaging-desaturate 303.7351+-0.8658 303.5210+-0.5746 imaging-gaussian-blur 506.6966+-1.0673 506.6839+-0.5188 json-parse-financial 86.3672+-0.7082 ? 87.4680+-0.3961 ? might be 1.0127x slower json-stringify-tinderbox 106.0603+-0.8037 106.0054+-0.6595 stanford-crypto-aes 106.1432+-0.8878 105.0881+-0.3653 might be 1.0100x faster stanford-crypto-ccm 99.5565+-0.6853 ? 100.0071+-0.7250 ? stanford-crypto-pbkdf2 235.4754+-0.3006 ! 238.4268+-1.2700 ! definitely 1.0125x slower stanford-crypto-sha256-iterative 102.9337+-0.3225 102.6497+-0.5483 <arithmetic> * 272.1021+-1.1641 ? 272.1705+-0.9473 ? might be 1.0003x slower <geometric> 216.1712+-0.8053 ? 216.3056+-0.6672 ? might be 1.0006x slower <harmonic> 174.8804+-0.5656 ? 175.0663+-0.5010 ? might be 1.0011x slower
Filip Pizlo
Comment 10 2012-02-29 21:44:23 PST
Comment on attachment 129627 [details] patch updated View in context: https://bugs.webkit.org/attachment.cgi?id=129627&action=review Awesome! Question: does this mean that the start index performance hack is no longer in effect at all? I.e. you're not even stopping the CSE when it gets to the earliest possible child of the node being replaced? That's not a problem - in fact it's great if that's true. I had added that optimization prematurely and it's somewhat comforting to know that it isn't necessary. R=me. > Source/JavaScriptCore/dfg/DFGByteCodeParser.cpp:616 > + block->prepend(resultIndex); Ha! I find it to be quite awesome that we see no regression even though prepend is an O(n) operation. Or am I missing something?
James Robinson
Comment 11 2012-02-29 21:45:39 PST
Sorry about the cc's and spam, the watchlist bot seems to be a bit off kilter today.
Yuqiang Xian
Comment 12 2012-02-29 22:38:07 PST
(In reply to comment #10) > (From update of attachment 129627 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=129627&action=review > > Awesome! > > Question: does this mean that the start index performance hack is no longer in effect at all? I.e. you're not even stopping the CSE when it gets to the earliest possible child of the node being replaced? That's not a problem - in fact it's great if that's true. I had added that optimization prematurely and it's somewhat comforting to know that it isn't necessary. > According to the data it may be true. But yes you're right that it's possible to terminate the look-back process earlier. Maybe we can do this in a separated patch? > R=me. > > > Source/JavaScriptCore/dfg/DFGByteCodeParser.cpp:616 > > + block->prepend(resultIndex); > > Ha! I find it to be quite awesome that we see no regression even though prepend is an O(n) operation. Or am I missing something? I originally had the same concern as yours. So I had another version which just kept the Phi nodes in a separated block (just as how they're originally handled) but later I found no obvious performance difference comparing to current approach. So I still throw out this version.
Yuqiang Xian
Comment 13 2012-02-29 23:40:57 PST
Note You need to log in before you can comment on or make changes to this bug.