Full Text Bug Listing

Yuqiang Xian

Reported 2012-01-20 22:26:33 PST

Current JSC DFG JIT lacks the optimization of loop invariant code motion. Here's an attempt to see the feasibility. Some WIP patch have been made, but it definitely has many problems to be solved.

Yuqiang Xian

Comment 1 2012-01-20 23:07:19 PST

Created attachment 123434 [details] WIP WIP patch. NOT ready for review.

Yuqiang Xian

Comment 2 2012-01-31 01:16:57 PST

Created attachment 124682 [details] WIP patch This patch passes the tests (tested on Linux IA32), and demonstrates performance benefit on several benchmark cases. While on some cases (especially SunSpider) some obvious regressions are observed. Further investigation is ongoing.

Yuqiang Xian

Comment 3 2012-02-01 01:37:36 PST

Created attachment 124904 [details] WIP patch This patch fixed the performance regression on the SunSpider benchmark, which was caused by that LICM revealed some CFA analysis problems and caused OSR failures. With this fix also no obvious performance regression is observed on Kraken or V8. But seems two cases having correctness problem: v8-crypto and kraken imaging-gaussian-blur. Still under investigation.

Yuqiang Xian

Comment 4 2012-02-01 06:28:06 PST

Created attachment 124942 [details] WIP patch Fixed the OSR entry problem as we adjusted the entry point to the loop preHeader, we need to rearrange the place of op_loop_hint which is used to trigger loop optimizations and code replacement. This fix passed the v8-crypto case.

Yuqiang Xian

Comment 5 2012-02-02 01:26:08 PST

Created attachment 125098 [details] WIP patch Fixed OSR exit issues after LICM. The hoisted loop invariant operations should not impact OSR exit inside the loop. Some tweaks are made to ensure the state is correct at OSR exit. Also fixed some issues related to nested loops. We have to copy the loop invariants of the outer loops to the inner loops as the inner loop can be the OSR entry, the hoisted nodes of the outer loop should be evaluated as well when we perform OSR at the inner loop. This, however, makes the LICM for non-leaf (non-innermost) loops meaningless but wasteful, unless we figure out a better solution in the future. So currently we only perform LICM for the innermost loops. This passes the JavaScriptCore tests and the 3 standard benchmarks. Here's the performance result tested on Linux IA32 (5% improvement on Kraken): Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. ToT LICM SunSpider: 3d-cube 6.7024+-0.0844 6.6755+-0.0673 3d-morph 11.3632+-0.0167 ^ 11.2263+-0.0446 ^ definitely 1.0122x faster 3d-raytrace 9.5948+-0.0627 ? 9.5970+-0.0621 ? access-binary-trees 1.8342+-0.0272 ? 1.8512+-0.0423 ? access-fannkuch 10.6713+-0.0564 ? 10.7571+-0.0341 ? access-nbody 5.4339+-0.0592 5.4261+-0.0802 access-nsieve 3.9822+-0.0382 ? 4.0274+-0.0529 ? might be 1.0113x slower bitops-3bit-bits-in-byte 1.1977+-0.0334 ? 1.2104+-0.0367 ? might be 1.0107x slower bitops-bits-in-byte 4.6181+-0.0481 ? 4.6247+-0.0422 ? bitops-bitwise-and 4.2526+-0.0515 ? 4.2663+-0.0442 ? bitops-nsieve-bits 6.7337+-0.1113 ? 6.7368+-0.1024 ? controlflow-recursive 2.8045+-0.0335 2.7933+-0.0401 crypto-aes 9.3455+-0.1455 ! 9.6886+-0.1714 ! definitely 1.0367x slower crypto-md5 3.0062+-0.0348 ! 3.0934+-0.0524 ! definitely 1.0290x slower crypto-sha1 2.6093+-0.0355 ! 2.7390+-0.0361 ! definitely 1.0497x slower date-format-tofte 12.2485+-0.0833 12.1951+-0.0920 date-format-xparb 11.6320+-0.1265 11.6023+-0.1296 math-cordic 9.1956+-0.0535 9.1807+-0.0562 math-partial-sums 13.9002+-0.0539 13.8929+-0.0490 math-spectral-norm 2.6203+-0.0485 ? 2.6545+-0.0430 ? might be 1.0131x slower regexp-dna 8.6963+-0.1054 ? 8.7113+-0.0790 ? string-base64 5.6049+-0.1043 ! 5.9174+-0.0527 ! definitely 1.0558x slower string-fasta 9.0395+-0.0556 ? 9.0681+-0.0532 ? string-tagcloud 15.5883+-0.0835 ? 15.6098+-0.0709 ? string-unpack-code 24.9881+-0.0437 ! 25.1207+-0.0666 ! definitely 1.0053x slower string-validate-input 7.2133+-0.0404 ! 7.5982+-0.0787 ! definitely 1.0534x slower <arithmetic> * 7.8799+-0.0311 ? 7.9332+-0.0308 ? might be 1.0068x slower <geometric> 6.2946+-0.0264 ! 6.3572+-0.0291 ! definitely 1.0099x slower <harmonic> 4.8063+-0.0290 ? 4.8670+-0.0343 ? might be 1.0126x slower ToT LICM V8: crypto 95.9089+-0.4916 ? 96.6358+-0.3024 ? deltablue 161.3032+-1.3059 ! 169.7393+-0.6635 ! definitely 1.0523x slower earley-boyer 111.0482+-0.1805 ! 111.5194+-0.2230 ! definitely 1.0042x slower raytrace 53.5053+-0.3817 ? 53.8410+-0.4282 ? regexp 101.7867+-0.7919 101.7171+-0.5630 richards 178.0875+-0.3938 176.8360+-0.8624 splay 79.9467+-0.3718 ? 80.1138+-0.5397 ? <arithmetic> 111.6552+-0.2536 ! 112.9146+-0.1850 ! definitely 1.0113x slower <geometric> * 104.1782+-0.2344 ! 105.1263+-0.2347 ! definitely 1.0091x slower <harmonic> 96.8544+-0.2634 ! 97.5583+-0.3083 ! definitely 1.0073x slower ToT LICM Kraken: ai-astar 782.8823+-1.0568 ^ 633.1314+-0.8283 ^ definitely 1.2365x faster audio-beat-detection 239.2905+-0.9386 ! 244.7953+-0.2533 ! definitely 1.0230x slower audio-dft 367.3076+-5.3766 ^ 334.5632+-1.9583 ^ definitely 1.0979x faster audio-fft 156.4270+-1.7812 ^ 152.5042+-0.2279 ^ definitely 1.0257x faster audio-oscillator 344.2403+-1.5722 ^ 340.8509+-1.6079 ^ definitely 1.0099x faster imaging-darkroom 381.9911+-8.0960 381.6292+-8.9160 imaging-desaturate 304.5713+-0.6926 ? 304.8006+-0.7142 ? imaging-gaussian-blur 591.3785+-3.0221 ^ 585.7728+-1.4615 ^ definitely 1.0096x faster json-parse-financial 68.3235+-0.5355 ! 69.5250+-0.3486 ! definitely 1.0176x slower json-stringify-tinderbox 103.2334+-1.0090 103.1331+-1.0498 stanford-crypto-aes 115.8342+-0.3116 115.7750+-0.5165 stanford-crypto-ccm 117.2265+-0.7123 ? 117.2419+-0.6766 ? stanford-crypto-pbkdf2 231.5518+-0.2709 ! 233.3329+-0.6611 ! definitely 1.0077x slower stanford-crypto-sha256-iterative 100.2526+-0.1269 ? 100.3860+-0.2462 ? <arithmetic> * 278.8936+-1.2346 ^ 265.5315+-0.8722 ^ definitely 1.0503x faster <geometric> 218.5252+-0.9039 ^ 213.8505+-0.6057 ^ definitely 1.0219x faster <harmonic> 173.1445+-0.6771 172.3007+-0.4488 might be 1.0049x faster ToT LICM All benchmarks: <arithmetic> 104.0633+-0.3907 ^ 100.3002+-0.2844 ^ definitely 1.0375x faster <geometric> 27.5025+-0.0963 ? 27.5131+-0.0953 ? might be 1.0004x slower <harmonic> 8.4492+-0.0499 ? 8.5531+-0.0588 ? might be 1.0123x slower ToT LICM Geomean of preferred means: <scaled-result> 61.1753+-0.1828 ^ 60.5004+-0.1703 ^ definitely 1.0112x faster

Yuqiang Xian

Comment 6 2012-02-03 01:27:29 PST

Created attachment 125286 [details] WIP patch Updated patch. Generate new Phi nodes for the new GetLocals created by LICM, and update the block variable information as well. Also this requires Phi stack processing after LICM, so moved the Phi stack processing logic from DFGByteCodeParser to DFGGraph. With this we can correctly propagate abstract values through blocks. Performance result is here (tested on Linux ia32). Still 5% on Kraken but ~1% loss on V8 (mostly due to 5% regression on deltablue, under investigation). Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. ToT LICM SunSpider: 3d-cube 6.7185+-0.0922 6.6918+-0.0762 3d-morph 11.3808+-0.0780 11.2993+-0.1254 3d-raytrace 9.5044+-0.0562 ? 9.5743+-0.0985 ? access-binary-trees 1.9286+-0.0692 1.9212+-0.0414 access-fannkuch 10.7387+-0.0590 ? 10.8040+-0.0870 ? access-nbody 5.4588+-0.0946 5.4444+-0.0900 access-nsieve 3.9970+-0.0378 ? 4.0700+-0.0391 ? might be 1.0183x slower bitops-3bit-bits-in-byte 1.1835+-0.0353 ? 1.2043+-0.0374 ? might be 1.0176x slower bitops-bits-in-byte 4.6397+-0.0533 ? 4.6637+-0.0567 ? bitops-bitwise-and 4.2493+-0.0640 ? 4.3000+-0.0913 ? might be 1.0119x slower bitops-nsieve-bits 6.8605+-0.0976 6.8071+-0.1341 controlflow-recursive 2.8164+-0.0562 2.7959+-0.0489 crypto-aes 9.3053+-0.1653 ! 9.6577+-0.1640 ! definitely 1.0379x slower crypto-md5 3.0033+-0.0699 ? 3.0361+-0.0668 ? might be 1.0109x slower crypto-sha1 2.6658+-0.0808 ? 2.7173+-0.0983 ? might be 1.0193x slower date-format-tofte 12.2253+-0.1308 12.1408+-0.0981 date-format-xparb 11.8267+-0.1292 ? 11.9841+-0.1144 ? might be 1.0133x slower math-cordic 9.2232+-0.0671 9.2085+-0.0893 math-partial-sums 13.9079+-0.0590 13.8841+-0.0717 math-spectral-norm 2.6601+-0.0649 2.6535+-0.0245 regexp-dna 8.6890+-0.0961 ? 8.7138+-0.1023 ? string-base64 5.5912+-0.1123 ? 5.6279+-0.1022 ? string-fasta 9.0830+-0.0981 9.0396+-0.0898 string-tagcloud 15.6098+-0.1030 15.5510+-0.0992 string-unpack-code 25.1136+-0.1364 ? 25.3173+-0.1788 ? string-validate-input 7.1972+-0.0982 ! 7.3726+-0.0725 ! definitely 1.0244x slower <arithmetic> * 7.9068+-0.0292 ? 7.9416+-0.0340 ? might be 1.0044x slower <geometric> 6.3250+-0.0232 ? 6.3579+-0.0324 ? might be 1.0052x slower <harmonic> 4.8379+-0.0248 ? 4.8711+-0.0402 ? might be 1.0069x slower ToT LICM V8: crypto 96.5485+-0.4716 ? 97.1088+-0.3119 ? deltablue 161.8844+-1.1480 ! 169.9394+-0.6498 ! definitely 1.0498x slower earley-boyer 110.8333+-0.2826 ? 111.4147+-0.3258 ? raytrace 53.8090+-0.3647 ? 54.4755+-0.8030 ? might be 1.0124x slower regexp 102.4384+-0.5489 101.6221+-0.3126 richards 178.8249+-0.6137 ^ 177.3008+-0.6436 ^ definitely 1.0086x faster splay 80.6917+-0.8164 80.3112+-0.4530 <arithmetic> 112.1472+-0.1864 ! 113.1675+-0.1549 ! definitely 1.0091x slower <geometric> * 104.6812+-0.2472 ! 105.4402+-0.2689 ! definitely 1.0073x slower <harmonic> 97.3662+-0.3320 ? 97.9646+-0.4274 ? might be 1.0061x slower ToT LICM Kraken: ai-astar 788.3627+-1.2834 ^ 638.5571+-1.1142 ^ definitely 1.2346x faster audio-beat-detection 240.0238+-1.4097 ^ 238.3298+-0.2238 ^ definitely 1.0071x faster audio-dft 368.1261+-3.4572 ^ 331.3670+-3.5713 ^ definitely 1.1109x faster audio-fft 154.1395+-0.1606 ^ 153.0288+-0.1039 ^ definitely 1.0073x faster audio-oscillator 347.4398+-2.4043 ^ 339.9551+-1.3986 ^ definitely 1.0220x faster imaging-darkroom 378.8966+-9.4751 377.5207+-8.5424 imaging-desaturate 305.9519+-0.7818 ? 307.2358+-1.8061 ? imaging-gaussian-blur 593.0994+-1.0883 ^ 591.0285+-0.7286 ^ definitely 1.0035x faster json-parse-financial 67.7004+-0.4985 ! 69.3780+-0.6205 ! definitely 1.0248x slower json-stringify-tinderbox 102.5560+-0.5984 101.4284+-0.5632 might be 1.0111x faster stanford-crypto-aes 115.9111+-0.4111 ^ 114.9546+-0.1903 ^ definitely 1.0083x faster stanford-crypto-ccm 116.3800+-0.8529 ? 116.6237+-1.0234 ? stanford-crypto-pbkdf2 234.4627+-1.4396 ? 235.7790+-1.5338 ? stanford-crypto-sha256-iterative 100.5601+-0.0904 ! 101.0498+-0.2687 ! definitely 1.0049x slower <arithmetic> * 279.5436+-1.0670 ^ 265.4454+-0.9724 ^ definitely 1.0531x faster <geometric> 218.5138+-0.7245 ^ 213.3139+-0.7691 ^ definitely 1.0244x faster <harmonic> 172.6914+-0.5036 171.6888+-0.6281 might be 1.0058x faster ToT LICM All benchmarks: <arithmetic> 104.3451+-0.3442 ^ 100.3168+-0.3114 ^ definitely 1.0402x faster <geometric> 27.5953+-0.0789 27.5063+-0.1031 might be 1.0032x faster <harmonic> 8.5033+-0.0422 ? 8.5600+-0.0688 ? might be 1.0067x slower ToT LICM Geomean of preferred means: <scaled-result> 61.3911+-0.1767 ^ 60.5751+-0.1816 ^ definitely 1.0135x faster

Yuqiang Xian

Comment 7 2012-02-03 01:30:47 PST

CC JSC experts. :)

Filip Pizlo

Comment 8 2012-02-03 10:41:46 PST

This looks awesome! I'll dive in to try to understand the details shortly...

Filip Pizlo

Comment 9 2012-02-03 12:23:49 PST

Comment on attachment 125286 [details] WIP patch View in context: https://bugs.webkit.org/attachment.cgi?id=125286&action=review This is a really awesome piece of work. But I have the following concerns: 1) The fact that the DFG compiler uses - and relies heavily on! - indices of nodes and blocks is a flaw that we will probably correct. We should probably move to a world where blocks have no fixed ordering, and nodes are ordered but not indexed, so that nodes can be interjected between other nodes. Imagine that the DFG::Graph was a linked list of nodes rather than an array - that would be one way to do it. In this patch you are adding more code that fiddles with our weird indexing approach, both as a simplification to your approach and as a necessity. You rely on block indices to find loops, for instance - even though your patch breaks the very indexing that it is relying on. You then have loads of code to deal with the craziness that is our Node vector - probably that code would have been simpler if you could have just inserted nodes between other nodes. 2) The fact that the DFG compiler has SetLocal/GetLocal nodes and no notion of direct data flow links from one basic block to another is a flaw that we will probably also correct. The DFG should be an SSA compiler; the fact that it's not that way yet is reflective of our development strategy (get the tiered compilation and basic code gen right before moving to harver things) rather than our technical vision. Your patch both relies on, and is made more complicated by, the lack of SSA and the use of these "local variables". 3) The phase introduced in this patch relies on properties of the DFG graph (like the meaning conveyed by the ordering of blocks) that it then promptly breaks. Hence I fear that if we were to take this, we would then not be able to implement any other optimizations. I'll have to think about this patch more, but right now I'm leaning towards the following suggestion: if would likely be far more valuable if we went to implementing SSA, and got rid of the mess that is SetArgument/SetLocal/GetLocal/Phi/Flush, than trying to teach the current beast these kinds of highly technical optimizations. My guess is that LICM would be far cleaner, easier to implement, and more maintainable if we could implement it on the DFG *after* we've moved to something like SSA. But maybe I could be convinced otherwise… it's a tough call. > Source/JavaScriptCore/dfg/DFGGraph.cpp:127 > + bool mustGenerate = refCount && node.mustGenerate(); Is this change necessary? We use the mustGenerate bit to print "!" to indicate that regardless of ref count the node must be generated. Separately, we marked nodes "skipped" if their refCount is zero. > Source/JavaScriptCore/dfg/DFGJITCompiler.h:324 > + if (!basicBlock.cfaHasVisited || basicBlock.variablesAtHead.argument(argument) == NoNode) This looks very dangerous. Do we ever want to enter into a block for which there is no CFA information? Do loop pre headers do all speculation checks that the CFA assumes would have been performed? > Source/JavaScriptCore/dfg/DFGPropagator.cpp:2091 > + if (replacement < m_graph.size() - 1 && m_graph[replacement + 1].op == SetLocal) It is not reliable to assume that the next node being a SetLocal means something. Examples where this might go wrong include, but are not limited to, unsigned bit shits or post-inc and post-dec. > Source/JavaScriptCore/dfg/DFGPropagator.cpp:2304 > + performHoist(m_compileIndex, preHeaderBlock, hoisted); > + else if (shouldGenerate && !hoistedChildren.isEmpty()) { > + // This node cannot be hoisted but it contains some references to the hoisted nodes. > + // Adjust such references by utilizing SetLocal/GetLocal for cross block references. > + for (unsigned i = 0; i < hoistedChildren.size(); ++i) > + handleHoistedChild(blockIndex, hoistedChildren[i], loopInfo); Let me try to understand: you are either hoisting (which just copies the node) or you are fixing references from the node to its children. The latter case may effectively "unhoist" previously hoisted nodes, if you find that they did not have SetLocal nodes that you could easily find. Right?

Yuqiang Xian

Comment 10 2012-02-04 01:59:36 PST

(In reply to comment #9) Filip, Thanks a lot for the comments. Yes actually I totally agree with you on the first two major concerns, and they did trouble me a lot, but I was just not as aggressive to try to change the IR form, so I tried to live with current DFG IR, which makes the optimization not as clean and easy to understand as it should be. And this is the reason why I want to have you guys notice this work at current stage, and to discuss how we should proceed. For the 3rd major concern, I guess I don't fully understand it and maybe I need to think about it more carefully. The responses to the specific code change concerns are inlined below. > > > Source/JavaScriptCore/dfg/DFGGraph.cpp:127 > > + bool mustGenerate = refCount && node.mustGenerate(); > > Is this change necessary? We use the mustGenerate bit to print "!" to indicate that regardless of ref count the node must be generated. Separately, we marked nodes "skipped" if their refCount is zero. > Because the hoisted loop invariants can have 0 refcount even if they're marked as "MustGenerate", which means we're not going to generate code for them. This is, actually, caused by the "unhoist" as you mentioned in the last comment. > > Source/JavaScriptCore/dfg/DFGJITCompiler.h:324 > > + if (!basicBlock.cfaHasVisited || basicBlock.variablesAtHead.argument(argument) == NoNode) > > This looks very dangerous. Do we ever want to enter into a block for which there is no CFA information? Do loop pre headers do all speculation checks that the CFA assumes would have been performed? > This is mainly to fix the performance regression of SunSpider. Actually we sometimes do enter into a block without CFA information (due to loop OSR). A typical case is bitops-bitwise-and in SunSpider where CFA is terminated before the loop because of a ForceOSRExit. And yes I think the speculation checks should have been performed in codegen. Please correct me if I have something missed. > > Source/JavaScriptCore/dfg/DFGPropagator.cpp:2091 > > + if (replacement < m_graph.size() - 1 && m_graph[replacement + 1].op == SetLocal) > > It is not reliable to assume that the next node being a SetLocal means something. Examples where this might go wrong include, but are not limited to, unsigned bit shits or post-inc and post-dec. > Sorry I'm not very sure of the specific concern here. I check the next node (a hoisted node in the pre header) for a SetLocal simply to find a slot for saving the result of current node, which will be used by a node in the loop body. > > Source/JavaScriptCore/dfg/DFGPropagator.cpp:2304 > > + performHoist(m_compileIndex, preHeaderBlock, hoisted); > > + else if (shouldGenerate && !hoistedChildren.isEmpty()) { > > + // This node cannot be hoisted but it contains some references to the hoisted nodes. > > + // Adjust such references by utilizing SetLocal/GetLocal for cross block references. > > + for (unsigned i = 0; i < hoistedChildren.size(); ++i) > > + handleHoistedChild(blockIndex, hoistedChildren[i], loopInfo); > > Let me try to understand: you are either hoisting (which just copies the node) or you are fixing references from the node to its children. The latter case may effectively "unhoist" previously hoisted nodes, if you find that they did not have SetLocal nodes that you could easily find. Right? Exactly.

Yuqiang Xian

Comment 11 2012-02-05 23:12:42 PST

Created attachment 125578 [details] WIP patch Updated patch. Fixed a performance issue on 64bit. And below is current performance result on 64bit (Mac OS X x64): 6% on Kraken, neutral on SunSpider and V8. While a regression on bitops-bitwise-and is observed. Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. 1 2 SunSpider: 3d-cube 5.8913+-0.0639 5.7923+-0.0798 might be 1.0171x faster 3d-morph 9.0526+-0.1103 8.8874+-0.0843 might be 1.0186x faster 3d-raytrace 8.2938+-0.0847 8.2594+-0.0928 access-binary-trees 1.9285+-0.0400 ? 1.9730+-0.1197 ? might be 1.0231x slower access-fannkuch 7.3069+-0.2342 ? 7.3645+-0.1027 ? access-nbody 3.9577+-0.0472 3.9405+-0.0286 access-nsieve 3.6470+-0.1076 3.6145+-0.0584 bitops-3bit-bits-in-byte 1.2537+-0.0261 ^ 1.1556+-0.0041 ^ definitely 1.0850x faster bitops-bits-in-byte 5.0807+-0.0462 ? 5.1419+-0.0429 ? might be 1.0120x slower bitops-bitwise-and 3.1656+-0.0106 ! 3.6648+-0.0345 ! definitely 1.1577x slower bitops-nsieve-bits 5.5132+-0.0480 5.5014+-0.0275 controlflow-recursive 2.2947+-0.0261 2.2853+-0.0230 crypto-aes 8.3225+-0.0566 ! 8.6046+-0.0619 ! definitely 1.0339x slower crypto-md5 2.5425+-0.0281 ? 2.5443+-0.0252 ? crypto-sha1 2.6004+-0.3497 2.3593+-0.0270 might be 1.1022x faster date-format-tofte 11.6510+-0.0687 ^ 11.0868+-0.0504 ^ definitely 1.0509x faster date-format-xparb 10.4985+-0.2820 10.2481+-0.3466 might be 1.0244x faster math-cordic 7.5448+-0.7898 7.4206+-0.5456 might be 1.0167x faster math-partial-sums 10.5491+-0.0809 ^ 10.3707+-0.0560 ^ definitely 1.0172x faster math-spectral-norm 2.6146+-0.0112 ! 2.7388+-0.1040 ! definitely 1.0475x slower regexp-dna 8.9686+-0.0631 8.8430+-0.0749 might be 1.0142x faster string-base64 4.9485+-0.0286 ? 4.9680+-0.0241 ? string-fasta 7.8256+-0.0639 ? 7.9127+-0.0954 ? might be 1.0111x slower string-tagcloud 13.2538+-0.1385 13.1118+-0.1419 might be 1.0108x faster string-unpack-code 21.7408+-0.2276 ? 22.1923+-0.5740 ? might be 1.0208x slower string-validate-input 6.2486+-0.0706 ? 6.2526+-0.0283 ? <arithmetic> * 6.7960+-0.0357 6.7782+-0.0337 might be 1.0026x faster <geometric> 5.5197+-0.0369 5.5096+-0.0249 might be 1.0018x faster <harmonic> 4.3784+-0.0444 4.3500+-0.0315 might be 1.0065x faster 1 2 1 2 V8: crypto 77.1976+-0.3123 ^ 76.6700+-0.1480 ^ definitely 1.0069x faster deltablue 150.1328+-1.1223 ! 153.9579+-0.8058 ! definitely 1.0255x slower earley-boyer 96.8689+-3.0327 96.2447+-3.0111 raytrace 50.0781+-0.1724 ^ 49.7800+-0.1110 ^ definitely 1.0060x faster regexp 96.1729+-0.3955 ? 96.3467+-0.3682 ? richards 133.8968+-0.2056 ! 136.5811+-0.7893 ! definitely 1.0200x slower splay 69.9721+-0.6299 69.4475+-0.7878 <arithmetic> 96.3313+-0.4623 ? 97.0040+-0.5518 ? might be 1.0070x slower <geometric> * 90.7276+-0.4666 ? 90.9880+-0.5441 ? might be 1.0029x slower <harmonic> 85.2748+-0.4300 85.2214+-0.4890 might be 1.0006x faster 1 2 Kraken: ai-astar 795.3290+-0.5676 ^ 636.2741+-0.7378 ^ definitely 1.2500x faster audio-beat-detection 193.7537+-9.1111 185.6565+-0.4666 might be 1.0436x faster audio-dft 456.0760+-0.8572 ^ 443.2001+-0.8911 ^ definitely 1.0291x faster audio-fft 122.3903+-7.5545 116.6115+-0.1681 might be 1.0496x faster audio-oscillator 281.1231+-3.1010 ^ 267.4259+-3.1168 ^ definitely 1.0512x faster imaging-darkroom 295.2771+-6.0876 294.8097+-7.3344 imaging-desaturate 228.0868+-0.2582 ? 229.5187+-3.8357 ? imaging-gaussian-blur 501.9180+-0.1475 ^ 499.6097+-0.2014 ^ definitely 1.0046x faster json-parse-financial 60.5922+-3.3409 59.2392+-0.0472 might be 1.0228x faster json-stringify-tinderbox 83.4101+-0.1794 ^ 78.7011+-0.2180 ^ definitely 1.0598x faster stanford-crypto-aes 99.6793+-0.4575 ? 101.0411+-1.0317 ? might be 1.0137x slower stanford-crypto-ccm 100.3830+-0.8947 ? 101.0337+-0.5908 ? stanford-crypto-pbkdf2 200.0138+-2.1655 199.5086+-3.1952 stanford-crypto-sha256-iterative 103.3791+-0.5305 103.2712+-0.3110 <arithmetic> * 251.5294+-1.3646 ^ 236.8501+-0.7498 ^ definitely 1.0620x faster <geometric> 190.7950+-1.5686 ^ 184.7663+-0.4824 ^ definitely 1.0326x faster <harmonic> 149.8594+-1.5736 ^ 146.7945+-0.2371 ^ definitely 1.0209x faster

Yuqiang Xian

Comment 12 2012-02-06 01:11:31 PST

(In reply to comment #9) > (From update of attachment 125286 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=125286&action=review > > 1) You rely on block indices to find loops, for instance - even though your patch breaks the very indexing that it is relying on. > > 3) The phase introduced in this patch relies on properties of the DFG graph (like the meaning conveyed by the ordering of blocks) that it then promptly breaks. > Sorry but I guess I forgot one response. IMHO I'm not breaking the block indexing in the patch. I'm not creating or reordering the blocks, nor changing the control flow. The pre header blocks are preserved at the very beginning, before LICM it contains a single Jump to the loop header. Thanks.

Filip Pizlo

Comment 13 2012-02-06 01:13:59 PST

(In reply to comment #12) > (In reply to comment #9) > > (From update of attachment 125286 [details] [details]) > > View in context: https://bugs.webkit.org/attachment.cgi?id=125286&action=review > > > > 1) You rely on block indices to find loops, for instance - even though your patch breaks the very indexing that it is relying on. > > > > 3) The phase introduced in this patch relies on properties of the DFG graph (like the meaning conveyed by the ordering of blocks) that it then promptly breaks. > > > > Sorry but I guess I forgot one response. IMHO I'm not breaking the block indexing in the patch. I'm not creating or reordering the blocks, nor changing the control flow. The pre header blocks are preserved at the very beginning, before LICM it contains a single Jump to the loop header. Thanks. Ah, that's a good point. I was confusing node ordering (you add nodes at the end) with block ordering.

Filip Pizlo

Comment 14 2012-02-06 01:18:02 PST

> > > Source/JavaScriptCore/dfg/DFGJITCompiler.h:324 > > > + if (!basicBlock.cfaHasVisited || basicBlock.variablesAtHead.argument(argument) == NoNode) > > > > This looks very dangerous. Do we ever want to enter into a block for which there is no CFA information? Do loop pre headers do all speculation checks that the CFA assumes would have been performed? > > > > This is mainly to fix the performance regression of SunSpider. Actually we sometimes do enter into a block without CFA information (due to loop OSR). A typical case is bitops-bitwise-and in SunSpider where CFA is terminated before the loop because of a ForceOSRExit. And yes I think the speculation checks should have been performed in codegen. Please correct me if I have something missed. If we do enter a loop through OSR without CFA information then it is simply a bug. We should never do that. If we do it then we should fix it! Separate bug filed: https://bugs.webkit.org/show_bug.cgi?id=77858

Filip Pizlo

Comment 15 2012-02-06 01:28:32 PST

The main issue I have with this patch is that it feels premature to implement this optimization given the state of the DFG IR. It feels to me like we should proceed as follows: - Fix register allocation and speculation bugs. For example register allocation and operand speculation will cause terminateSpeculativeExecution() to be called if we use an int-predicted variable in a double context, and then after that try to use it in an int-context. The double use will convert it to a double and we'll forget that it was actually an integer! - Get rid of reference counting of nodes, at least in the early stages of graph building and optimization. We had *lots* of bugs come in on random web pages because of bad handling of node reference counts in the fixup and CSE phases. It would be great if we didn't do any reference counting at all, and instead relied on traditional liveness analysis for DCE. - Switch to SSA. This should be easy. We already have all of the right things in place: we have an IR in which using a value means referring to its definition. All that this will take is implementing a fast dominator tree builder (I was thinking of just using Lengauer-Tarjan) and then use it to place Phi nodes. - Get rid of the notion that a basic block consists of a contiguous set of node indices. This is probably the worst part of our IR, since it means that inserting a node in between nodes is really hard. - Introduce a CFG simplification pass, probably formulated using sparse conditional constant propagation so that it does two things at once. We might even be able to fold this into the CFA. It feels to me like if we accomplish the above - which should not be terribly difficult - then we'd be in a far better position to implement awesome optimizations, particularly since it would be easier to compartmentalize them since the inter-phase invariants of the IR would be crystal clear. As another thought, I'm kind of curious why you didn't formulate this using loop peeling rather than LICM. Peeling is far less error-prone and would likely give you the same speed-ups. You could even do peeling during bytecode parsing using a similar set of tricks as what we used for inlining... But even there, I feel like those tricks are too messy. So many bugs I had to fix because of basic block linking in the parser! If we had a CFG simplification pass - which we currently can't do, because the lack of SSA means that we can't easily slam two blocks together without then having to do painful SetLocal/GetLocal fixup - then several hundred lines of awful code in the parser could be killed off.

Yuqiang Xian

Comment 16 2012-02-07 00:22:54 PST

(In reply to comment #15) > > As another thought, I'm kind of curious why you didn't formulate this using loop peeling rather than LICM. Peeling is far less error-prone and would likely give you the same speed-ups. I'm not sure if I fully understand it, but IMO w/o global CSE to eliminate the redundant (which means, loop invariant) expressions in the loop body. we cannot get the benefit of LICM but cause code bloat with peeling.

Filip Pizlo

Comment 17 2012-02-07 00:38:23 PST

(In reply to comment #16) > (In reply to comment #15) > > > > As another thought, I'm kind of curious why you didn't formulate this using loop peeling rather than LICM. Peeling is far less error-prone and would likely give you the same speed-ups. > > I'm not sure if I fully understand it, but IMO w/o global CSE to eliminate the redundant (which means, loop invariant) expressions in the loop body. we cannot get the benefit of LICM but cause code bloat with peeling. You're mostly right, except loop peeling by itself would buy you more speculation guard eliminations through CFA. So it might not be a bad first step. Of course, as with any loop peeling implementation, you'd want to tune it to target loops that were relatively small to begin with. But also, GCSE is *far* easier to implement that LICM. Particularly if you tune it right. You could imagine optimizing for loops whose body consisted of <=N (for some small N) basic blocks. This makes the control flow really easy to deal with in GCSE - essentially the CSE look-back has to consider, if it gets to the top of the block, whether there is either (a) just one predecessor or (b) one of the predecessors dominates the others. Moreover, GCSE+peeling would probably give us more wins on other benchmarks that LICM by itself would not benefit.

Yuqiang Xian

Comment 18 2012-02-07 01:12:11 PST

(In reply to comment #17) > But also, GCSE is *far* easier to implement that LICM. Particularly if you tune it right. You could imagine optimizing for loops whose body consisted of <=N (for some small N) basic blocks. This makes the control flow really easy to deal with in GCSE - essentially the CSE look-back has to consider, if it gets to the top of the block, whether there is either (a) just one predecessor or (b) one of the predecessors dominates the others. If based on current DFG IR, I wonder whether peeling+GCSE is really easier than LICM to implement... But if with the improved IR (e.g., SSA) as you mentioned, GCSE might be easier - and in such case LICM should be simplified as well. :) > > Moreover, GCSE+peeling would probably give us more wins on other benchmarks that LICM by itself would not benefit. Agree.

Filip Pizlo

Comment 19 2012-02-07 10:32:02 PST

(In reply to comment #18) > (In reply to comment #17) > > But also, GCSE is *far* easier to implement that LICM. Particularly if you tune it right. You could imagine optimizing for loops whose body consisted of <=N (for some small N) basic blocks. This makes the control flow really easy to deal with in GCSE - essentially the CSE look-back has to consider, if it gets to the top of the block, whether there is either (a) just one predecessor or (b) one of the predecessors dominates the others. > > If based on current DFG IR, I wonder whether peeling+GCSE is really easier than LICM to implement... But if with the improved IR (e.g., SSA) as you mentioned, GCSE might be easier - and in such case LICM should be simplified as well. :) There are two other issues on my mind: 1) This patch's reliance on the peculiarities of the current IR will make changing IRs more difficult. And I am working on changing the IR. With any luck I'll have something that resembles SSA within a month. And then we can do GVN+GCSE+peeling+unrolling. 2) I find it hard to visualize how to get all of the cases of OSR (both entry and exit but mostly exit) right in the presence of *any* code motion. Moving code means that we no longer have a clear OSR exit target in the old JIT. I'm not sure your patch gets this right, and even if it does get it mostly right, it *will* have a nasty bug tail. Since code motion (and hoisting in particular) are not particularly effective in general for effectful languages that also have loads of safety checks (I remember that in Java it ave you nothing) and since our only LICM wins are in Kraken, which is a questionable and mostly unrealistic benchmark that we already do OK on, Im not sure it would be a wise cost-benefit trade off. > > > > Moreover, GCSE+peeling would probably give us more wins on other benchmarks that LICM by itself would not benefit. > > Agree.

Yuqiang Xian

Comment 20 2012-02-07 18:07:06 PST

(In reply to comment #19) > > 1) And I am working on changing the IR. With any luck I'll have something that resembles SSA within a month. And then we can do GVN+GCSE+peeling+unrolling. That's cool... > > 2) I find it hard to visualize how to get all of the cases of OSR (both entry and exit but mostly exit) right in the presence of *any* code motion. Moving code means that we no longer have a clear OSR exit target in the old JIT. I'm not sure your patch gets this right, and even if it does get it mostly right, it *will* have a nasty bug tail. The patch tries to handle OSR exit as if there's no hoisted expression being evaluated, if the OSR exit happens in the pre header or in the loop body. It (supposedly) achieves this by 1) Don't compileMovHint or set operand value source for the SetLocals in the pre header, which means we're not changing the value source and lastSetOperand in the pre header; 2) The codeOriginForOSR of *any* DFG node in the pre header is the origin of the original single node in the pre header before LICM - "loop_preheader"; 3) In the loop body, the hoisted nodes are not replaced by others but marked as "NotGenerate" (refcount 0), which means the SetLocals in the loop body are kept for compileMovHint to correctly record the operand value source and lastSetOperand. These tweaks may result in some "wasteful" operations if OSR exit happens in the pre header (the loop invariants evaluation before OSR exit is useless but on the other hand we're exiting earlier) or the loop body (some operand values may already in the register file by SetLocal in the pre header but we will (redundantly) re-store the value of its child node - a "GetLocal" actually, to the register file). But they just *want to* make the OSR exit correct. But maybe I still have something missed... > Since code motion (and hoisting in particular) are not particularly effective in general for effectful languages that also have loads of safety checks (I remember that in Java it ave you nothing) and since our only LICM wins are in Kraken, which is a questionable and mostly unrealistic benchmark that we already do OK on, Im not sure it would be a wise cost-benefit trade off. I think one reason why we don't get bigger improvement by LICM is that we have to store the result of the loop invariant expression into memory at first, and then read it from memory in the loop body when used. If we have SSA and get rid of the SetLocal/GetLocals, and then have a global register allocator, the LICM improvement alone might be more promising. Also currently we hoist nodes kind of "conservatively", if with deeper side effects analysis we may find more loop invariants. But anyway, I'm open to this. It's your call. :)

Filip Pizlo

Comment 21 2012-02-07 18:15:55 PST

> The patch tries to handle OSR exit as if there's no hoisted expression being evaluated, if the OSR exit happens in the pre header or in the loop body. > > It (supposedly) achieves this by > 1) Don't compileMovHint or set operand value source for the SetLocals in the pre header, which means we're not changing the value source and lastSetOperand in the pre header; > 2) The codeOriginForOSR of *any* DFG node in the pre header is the origin of the original single node in the pre header before LICM - "loop_preheader"; > 3) In the loop body, the hoisted nodes are not replaced by others but marked as "NotGenerate" (refcount 0), which means the SetLocals in the loop body are kept for compileMovHint to correctly record the operand value source and lastSetOperand. > > These tweaks may result in some "wasteful" operations if OSR exit happens in the pre header (the loop invariants evaluation before OSR exit is useless but on the other hand we're exiting earlier) or the loop body (some operand values may already in the register file by SetLocal in the pre header but we will (redundantly) re-store the value of its child node - a "GetLocal" actually, to the register file). But they just *want to* make the OSR exit correct. > > But maybe I still have something missed... It seems like you probably covered everything. But these mechanisms are already really confusing. Also you're adding to the already confusing "code origin" notion for a node: first we just had a bytecode index for a node, then we had bytecode index + inline stack, and then we had bytecode index + bytecode index offset for set operation + inline stack. Now we will have bytecode index for debugging + bytecode index for set operation + inline stack + bytecode index for OSR exit target, the latter being conveyed implicitly by whether the node is in the pre header. Short story: there will be bugs, and awesome ones at that. :-) > > > Since code motion (and hoisting in particular) are not particularly effective in general for effectful languages that also have loads of safety checks (I remember that in Java it ave you nothing) and since our only LICM wins are in Kraken, which is a questionable and mostly unrealistic benchmark that we already do OK on, Im not sure it would be a wise cost-benefit trade off. > > I think one reason why we don't get bigger improvement by LICM is that we have to store the result of the loop invariant expression into memory at first, and then read it from memory in the loop body when used. If we have SSA and get rid of the SetLocal/GetLocals, and then have a global register allocator, the LICM improvement alone might be more promising. Also currently we hoist nodes kind of "conservatively", if with deeper side effects analysis we may find more loop invariants. I've not seen deeper side effect analysis be a win, in the past. And you're right - SSA will be a huge benefit for this code. > > But anyway, I'm open to this. It's your call. :) Still thinking about it. It's a tough call. This is an awesome patch, I'm just super worried about the bug parade that will ensure. ;-)

Yuqiang Xian

Comment 22 2012-02-27 21:41:00 PST

Created attachment 129185 [details] Rebase Don't worry... simply for back-up purpose. ;)

Filip Pizlo

Comment 23 2012-02-27 23:12:17 PST

(In reply to comment #22) > Created an attachment (id=129185) [details] > Rebase > > Don't worry... simply for back-up purpose. ;) I would be worried if you didn't back up a rebase! :-) I've been thinking a lot about how to integrate your work while still maintaining my goals: maintainability and not impeding future optimizations. I think we could do this by making the following specific refactorings, and then landing this patch, with modifications to take advantage of those refactorings: 1) We shouldn't have to do dominator-based natural loop inference. That makes sense when the source language is gotos and labels. In our case the source language is JS, which is structured. I don't think it's even possible to have a back edge that isn't part of a natural loop. The bytecode generator always knows when it's emitting a back edge, always knows what loop it's part of, and always knows all of the instructions that are part of the loop, and always knows where the header of the loop is. It would be simpler, I think, if the bytecode generator simply recorded this information in CodeBlock and the DFG could just use it. Better yet, we should preserve that information in the DFG's Graph and BasicBlocks and require all passes to preserve it as well. 2) BasicBlocks should not require that their nodes have contiguous indices in the graph. A BasicBlock should instead have a Vector<NodeIndex> that references all of the block's nodes, in the order in which they are to be executed. It should be possible to add nodes to a BasicBlock by (1) appending them to the graph and (2) appending their indices to the Vector<NodeIndex> in BasicBlock. 3) CodeOrigin should be able to say something like "this code originated from X but for OSR exit should be treated as if it was at Y". This is in addition to also saying the point Z that corresponds to the code's value profiling site. Currently CodeOrigin stores X and Z and assumes Y == X. But allowing for a differentiation would make hoisting really easy and robust - the OSR exit code for code hoisted by LICM would simply know that the code should exit to the loop header because Y would be set to the loop header. 4) We should at least get rid of the conflation of SetLocal as a hint to OSR and SetLocal as a mechanism for preserving state across basic blocks. The "ideal" way of doing this would be to go straight to SSA. But it may be that we don't need to do something so drastic to adopt this patch. The main thing I worry about - and let me know if your patch handles this gracefully - is that you could have an expression in the loop like "a:Foo(...); SetLocal(@a, r<blah>)" where @a is loop-invariant but r<blah> is a variable that is live at the loop header because it contained other things. Hoisting this combination could cause badness. What we really want is to hoist just @a and have some elegant way of referring to it. If we really wanted to, we could just create a new temporary and use SetLocal/GetLocal on that. But that feels yucky. It may be that (4) really does require SSA. If so then maybe we should just do it. It really shouldn't be that hard! :-) But (1), (2), and (3) are totally doable. I was thinking about doing them but I thought I'd throw it out here in case you could get around to it before me. I'm bogged down with 64-bit LLInt right now, and some bizarre DFG register allocation pathologies. :-| Please let me know if you like this development plan. If so, feel free to create bugs for the subtasks I listed above - or any others you can think of - and make them blockers for this bug. To be clear, I really like your LICM implementation; I just want to make sure that we integrate it in a way that makes it most profitable and least risky! :-)

Yuqiang Xian

Comment 24 2012-02-28 05:47:08 PST

(In reply to comment #23) > > I would be worried if you didn't back up a rebase! :-) > > I've been thinking a lot about how to integrate your work while still maintaining my goals: maintainability and not impeding future optimizations. Thanks for taking care of this. :) > > I think we could do this by making the following specific refactorings, and then landing this patch, with modifications to take advantage of those refactorings: > > 1) We shouldn't have to do dominator-based natural loop inference. That makes sense when the source language is gotos and labels. In our case the source language is JS, which is structured. I don't think it's even possible to have a back edge that isn't part of a natural loop. The bytecode generator always knows when it's emitting a back edge, always knows what loop it's part of, and always knows all of the instructions that are part of the loop, and always knows where the header of the loop is. It would be simpler, I think, if the bytecode generator simply recorded this information in CodeBlock and the DFG could just use it. Better yet, we should preserve that information in the DFG's Graph and BasicBlocks and require all passes to preserve it as well. Let me try to understand your concern, and please correct me if I mis-understand it... IMHO we're not doing loop inference. For now we record the loop pre-headers in DFG when parsing the bytecode. The dominator analysis is added only for finding out the dominators of the loop exit - to filter out those blocks which may not be reached in the loop due to branches (we don't want to hoist any nodes in such blocks). > > 2) BasicBlocks should not require that their nodes have contiguous indices in the graph. A BasicBlock should instead have a Vector<NodeIndex> that references all of the block's nodes, in the order in which they are to be executed. It should be possible to add nodes to a BasicBlock by (1) appending them to the graph and (2) appending their indices to the Vector<NodeIndex> in BasicBlock. > > 3) CodeOrigin should be able to say something like "this code originated from X but for OSR exit should be treated as if it was at Y". This is in addition to also saying the point Z that corresponds to the code's value profiling site. Currently CodeOrigin stores X and Z and assumes Y == X. But allowing for a differentiation would make hoisting really easy and robust - the OSR exit code for code hoisted by LICM would simply know that the code should exit to the loop header because Y would be set to the loop header. Certainly, those two are doable. > > 4) We should at least get rid of the conflation of SetLocal as a hint to OSR and SetLocal as a mechanism for preserving state across basic blocks. I can't agree with you anymore on this... In fact, I realized a bug related to OSR exit in current LICM implementation some days ago. I must admit that we have to do something with... although some tweaks/workarounds could be made based on current SetLocal mechanism, they look dirty and are really not performance friendly. > > The main thing I worry about - and let me know if your patch handles this gracefully - is that you could have an expression in the loop like "a:Foo(...); SetLocal(@a, r<blah>)" where @a is loop-invariant but r<blah> is a variable that is live at the loop header because it contained other things. Hoisting this combination could cause badness. What we really want is to hoist just @a and have some elegant way of referring to it. If we really wanted to, we could just create a new temporary and use SetLocal/GetLocal on that. But that feels yucky. For such case, the current implementation will simply (conservatively) _not_ hoist either @a or the SetLocal - We track the live variables in the loop. > > It may be that (4) really does require SSA. If so then maybe we should just do it. It really shouldn't be that hard! :-) > > But (1), (2), and (3) are totally doable. I was thinking about doing them but I thought I'd throw it out here in case you could get around to it before me. I'm bogged down with 64-bit LLInt right now, and some bizarre DFG register allocation pathologies. :-| > > Please let me know if you like this development plan. If so, feel free to create bugs for the subtasks I listed above - or any others you can think of - and make them blockers for this bug. Sure, I like it. :). And never mind, if you're busy and I get some time I can start some of the listed items. > > To be clear, I really like your LICM implementation; I just want to make sure that we integrate it in a way that makes it most profitable and least risky! :-) Thanks!

Yuqiang Xian

Comment 25 2012-02-29 02:26:03 PST

Bug 79899 is created trying to allow BasicBlocks have non-continuous node indices. (In reply to comment #24) > > I can't agree with you anymore on this... > In fact, I realized a bug related to OSR exit in current LICM implementation some days ago. I must admit that we have to do something with... although some tweaks/workarounds could be made based on current SetLocal mechanism, they look dirty and are really not performance friendly. Ah... Sorry for a HUGE typo. I tended to mean "I couldn't agree with you more..." - forgive my poor English please.

Filip Pizlo

Comment 26 2012-02-29 14:44:25 PST

> > 1) We shouldn't have to do dominator-based natural loop inference. That makes sense when the source language is gotos and labels. In our case the source language is JS, which is structured. I don't think it's even possible to have a back edge that isn't part of a natural loop. The bytecode generator always knows when it's emitting a back edge, always knows what loop it's part of, and always knows all of the instructions that are part of the loop, and always knows where the header of the loop is. It would be simpler, I think, if the bytecode generator simply recorded this information in CodeBlock and the DFG could just use it. Better yet, we should preserve that information in the DFG's Graph and BasicBlocks and require all passes to preserve it as well. > > Let me try to understand your concern, and please correct me if I mis-understand it... IMHO we're not doing loop inference. For now we record the loop pre-headers in DFG when parsing the bytecode. The dominator analysis is added only for finding out the dominators of the loop exit - to filter out those blocks which may not be reached in the loop due to branches (we don't want to hoist any nodes in such blocks). Ahhhh yeah you're at least more than half right. ;-) We could do this in the bytecode generator, but it might be complicated. Here's what I was thinking: If you're in a loop the bytecode generator could indicate this in its context. If it hits a conditional while in a loop, it can indicate that it's still in the loop but no longer in code that dominates the exit. Except that this isn't quite right if you have: loop { if (foo) { stuff; break; } } So then you'd have to add additional checks to see if there's a break statement at the end of the conditional block, but then you'd have to further handle the case of: loop { if (foo) { stuff; if (bar) { other stuff; break; } else { some other stuff; break; } } } Or maybe your dominator analysis doesn't catch this either? Just something to think about. I definitely think that if the bytecode generator could give us this information, then it would be cheaper than computing it in the DFG. On the other hand, having a loop analysis in the DFG carries the benefit that it will work even if it is done after the DFG has already done other control flow changes. In short, I think you're right, but I figure I'd throw this out there in case you have ideas on how to make this better. > > > > The main thing I worry about - and let me know if your patch handles this gracefully - is that you could have an expression in the loop like "a:Foo(...); SetLocal(@a, r<blah>)" where @a is loop-invariant but r<blah> is a variable that is live at the loop header because it contained other things. Hoisting this combination could cause badness. What we really want is to hoist just @a and have some elegant way of referring to it. If we really wanted to, we could just create a new temporary and use SetLocal/GetLocal on that. But that feels yucky. > > For such case, the current implementation will simply (conservatively) _not_ hoist either @a or the SetLocal - We track the live variables in the loop. Right. That's probably good enough, but it makes me sad. > > > > > It may be that (4) really does require SSA. If so then maybe we should just do it. It really shouldn't be that hard! :-) > > > > But (1), (2), and (3) are totally doable. I was thinking about doing them but I thought I'd throw it out here in case you could get around to it before me. I'm bogged down with 64-bit LLInt right now, and some bizarre DFG register allocation pathologies. :-| > > > > Please let me know if you like this development plan. If so, feel free to create bugs for the subtasks I listed above - or any others you can think of - and make them blockers for this bug. > > Sure, I like it. :). And never mind, if you're busy and I get some time I can start some of the listed items. That's great!

Filip Pizlo

Comment 27 2012-02-29 14:45:42 PST

(In reply to comment #25) > Bug 79899 is created trying to allow BasicBlocks have non-continuous node indices. > > (In reply to comment #24) > > > > I can't agree with you anymore on this... > > In fact, I realized a bug related to OSR exit in current LICM implementation some days ago. I must admit that we have to do something with... although some tweaks/workarounds could be made based on current SetLocal mechanism, they look dirty and are really not performance friendly. > > Ah... Sorry for a HUGE typo. I tended to mean "I couldn't agree with you more..." - forgive my poor English please. I fail at English often as well. It's actually not my first language, either (Polish is).

Yuqiang Xian

Comment 28 2012-02-29 19:31:26 PST

(In reply to comment #26) > > Ahhhh yeah you're at least more than half right. ;-) We could do this in the bytecode generator, but it might be complicated. Here's what I was thinking: > > If you're in a loop the bytecode generator could indicate this in its context. If it hits a conditional while in a loop, it can indicate that it's still in the loop but no longer in code that dominates the exit. Except that this isn't quite right if you have: > > loop { > if (foo) { > stuff; > break; > } > } > > So then you'd have to add additional checks to see if there's a break statement at the end of the conditional block, but then you'd have to further handle the case of: > > loop { > if (foo) { > stuff; > if (bar) { > other stuff; > break; > } else { > some other stuff; > break; > } > } > } > > Or maybe your dominator analysis doesn't catch this either? Right. There's a bug in the dominator analysis in current implementation regarding this scenario. We should analyze the dominators of the loop exit's successor instead of the loop exit. I think this catches the breaks.

Attachments
WIP (46.94 KB, patch) 2012-01-20 23:07 PST, Yuqiang Xian	no flags	Details Formatted Diff Diff
WIP patch (49.04 KB, patch) 2012-01-31 01:16 PST, Yuqiang Xian	no flags	Details Formatted Diff Diff
WIP patch (51.89 KB, patch) 2012-02-01 01:37 PST, Yuqiang Xian	no flags	Details Formatted Diff Diff
WIP patch (53.54 KB, patch) 2012-02-01 06:28 PST, Yuqiang Xian	no flags	Details Formatted Diff Diff
WIP patch (75.10 KB, patch) 2012-02-02 01:26 PST, Yuqiang Xian	no flags	Details Formatted Diff Diff
WIP patch (91.67 KB, patch) 2012-02-03 01:27 PST, Yuqiang Xian	no flags	Details Formatted Diff Diff
WIP patch (91.29 KB, patch) 2012-02-05 23:12 PST, Yuqiang Xian	no flags	Details Formatted Diff Diff
Rebase (116.50 KB, patch) 2012-02-27 21:41 PST, Yuqiang Xian	no flags	Details Formatted Diff Diff
Show Obsolete (7) View All Add attachment proposed patch, testcase, etc.

Bug 76770