Bug 76770

Summary: Loop invariant code motion in JSC DFG
Product: WebKit Reporter: Yuqiang Xian <yuqiang.xian>
Component: JavaScriptCoreAssignee: Yuqiang Xian <yuqiang.xian>
Status: NEW ---    
Severity: Normal CC: barraclough, fpizlo, rakuco, webkit.review.bot
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Bug Depends on: 79899, 80415    
Bug Blocks:    
Attachments:
Description Flags
WIP
none
WIP patch
none
WIP patch
none
WIP patch
none
WIP patch
none
WIP patch
none
WIP patch
none
Rebase none

Description Yuqiang Xian 2012-01-20 22:26:33 PST
Current JSC DFG JIT lacks the optimization of loop invariant code motion. Here's an attempt to see the feasibility.
Some WIP patch have been made, but it definitely has many problems to be solved.
Comment 1 Yuqiang Xian 2012-01-20 23:07:19 PST
Created attachment 123434 [details]
WIP

WIP patch. NOT ready for review.
Comment 2 Yuqiang Xian 2012-01-31 01:16:57 PST
Created attachment 124682 [details]
WIP patch

This patch passes the tests (tested on Linux IA32), and demonstrates performance benefit on several benchmark cases. While on some cases (especially SunSpider) some obvious regressions are observed. Further investigation is ongoing.
Comment 3 Yuqiang Xian 2012-02-01 01:37:36 PST
Created attachment 124904 [details]
WIP patch

This patch fixed the performance regression on the SunSpider benchmark, which was caused by that LICM revealed some CFA analysis problems and caused OSR failures. With this fix also no obvious performance regression is observed on Kraken or V8. But seems two cases having correctness problem: v8-crypto and kraken imaging-gaussian-blur. Still under investigation.
Comment 4 Yuqiang Xian 2012-02-01 06:28:06 PST
Created attachment 124942 [details]
WIP patch

Fixed the OSR entry problem as we adjusted the entry point to the loop preHeader, we need to rearrange the place of op_loop_hint which is used to trigger loop optimizations and code replacement. This fix passed the v8-crypto case.
Comment 5 Yuqiang Xian 2012-02-02 01:26:08 PST
Created attachment 125098 [details]
WIP patch

Fixed OSR exit issues after LICM. The hoisted loop invariant operations should not impact OSR exit inside the loop. Some tweaks are made to ensure the state is correct at OSR exit.
Also fixed some issues related to nested loops. We have to copy the loop invariants of the outer loops to the inner loops as the inner loop can be the OSR entry, the hoisted nodes of the outer loop should be evaluated as well when we perform OSR at the inner loop. This, however, makes the LICM for non-leaf (non-innermost) loops meaningless but wasteful, unless we figure out a better solution in the future. So currently we only perform LICM for the innermost loops.

This passes the JavaScriptCore tests and the 3 standard benchmarks. Here's the performance result tested on Linux IA32 (5% improvement on Kraken):

Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample
measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime()
function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in
milliseconds.

                                               ToT                     LICM
SunSpider:
   3d-cube                                6.7024+-0.0844          6.6755+-0.0673
   3d-morph                              11.3632+-0.0167    ^    11.2263+-0.0446       ^ definitely 1.0122x faster
   3d-raytrace                            9.5948+-0.0627    ?     9.5970+-0.0621       ?
   access-binary-trees                    1.8342+-0.0272    ?     1.8512+-0.0423       ?
   access-fannkuch                       10.6713+-0.0564    ?    10.7571+-0.0341       ?
   access-nbody                           5.4339+-0.0592          5.4261+-0.0802
   access-nsieve                          3.9822+-0.0382    ?     4.0274+-0.0529       ? might be 1.0113x slower
   bitops-3bit-bits-in-byte               1.1977+-0.0334    ?     1.2104+-0.0367       ? might be 1.0107x slower
   bitops-bits-in-byte                    4.6181+-0.0481    ?     4.6247+-0.0422       ?
   bitops-bitwise-and                     4.2526+-0.0515    ?     4.2663+-0.0442       ?
   bitops-nsieve-bits                     6.7337+-0.1113    ?     6.7368+-0.1024       ?
   controlflow-recursive                  2.8045+-0.0335          2.7933+-0.0401
   crypto-aes                             9.3455+-0.1455    !     9.6886+-0.1714       ! definitely 1.0367x slower
   crypto-md5                             3.0062+-0.0348    !     3.0934+-0.0524       ! definitely 1.0290x slower
   crypto-sha1                            2.6093+-0.0355    !     2.7390+-0.0361       ! definitely 1.0497x slower
   date-format-tofte                     12.2485+-0.0833         12.1951+-0.0920
   date-format-xparb                     11.6320+-0.1265         11.6023+-0.1296
   math-cordic                            9.1956+-0.0535          9.1807+-0.0562
   math-partial-sums                     13.9002+-0.0539         13.8929+-0.0490
   math-spectral-norm                     2.6203+-0.0485    ?     2.6545+-0.0430       ? might be 1.0131x slower
   regexp-dna                             8.6963+-0.1054    ?     8.7113+-0.0790       ?
   string-base64                          5.6049+-0.1043    !     5.9174+-0.0527       ! definitely 1.0558x slower
   string-fasta                           9.0395+-0.0556    ?     9.0681+-0.0532       ?
   string-tagcloud                       15.5883+-0.0835    ?    15.6098+-0.0709       ?
   string-unpack-code                    24.9881+-0.0437    !    25.1207+-0.0666       ! definitely 1.0053x slower
   string-validate-input                  7.2133+-0.0404    !     7.5982+-0.0787       ! definitely 1.0534x slower

   <arithmetic> *                         7.8799+-0.0311    ?     7.9332+-0.0308       ? might be 1.0068x slower
   <geometric>                            6.2946+-0.0264    !     6.3572+-0.0291       ! definitely 1.0099x slower
   <harmonic>                             4.8063+-0.0290    ?     4.8670+-0.0343       ? might be 1.0126x slower

                                               ToT                     LICM
V8:
   crypto                                95.9089+-0.4916    ?    96.6358+-0.3024       ?
   deltablue                            161.3032+-1.3059    !   169.7393+-0.6635       ! definitely 1.0523x slower
   earley-boyer                         111.0482+-0.1805    !   111.5194+-0.2230       ! definitely 1.0042x slower
   raytrace                              53.5053+-0.3817    ?    53.8410+-0.4282       ?
   regexp                               101.7867+-0.7919        101.7171+-0.5630
   richards                             178.0875+-0.3938        176.8360+-0.8624
   splay                                 79.9467+-0.3718    ?    80.1138+-0.5397       ?
 
   <arithmetic>                         111.6552+-0.2536    !   112.9146+-0.1850       ! definitely 1.0113x slower
   <geometric> *                        104.1782+-0.2344    !   105.1263+-0.2347       ! definitely 1.0091x slower
   <harmonic>                            96.8544+-0.2634    !    97.5583+-0.3083       ! definitely 1.0073x slower

                                               ToT                     LICM
Kraken:
   ai-astar                             782.8823+-1.0568    ^   633.1314+-0.8283       ^ definitely 1.2365x faster
   audio-beat-detection                 239.2905+-0.9386    !   244.7953+-0.2533       ! definitely 1.0230x slower
   audio-dft                            367.3076+-5.3766    ^   334.5632+-1.9583       ^ definitely 1.0979x faster
   audio-fft                            156.4270+-1.7812    ^   152.5042+-0.2279       ^ definitely 1.0257x faster
   audio-oscillator                     344.2403+-1.5722    ^   340.8509+-1.6079       ^ definitely 1.0099x faster
   imaging-darkroom                     381.9911+-8.0960        381.6292+-8.9160
   imaging-desaturate                   304.5713+-0.6926    ?   304.8006+-0.7142       ?
   imaging-gaussian-blur                591.3785+-3.0221    ^   585.7728+-1.4615       ^ definitely 1.0096x faster
   json-parse-financial                  68.3235+-0.5355    !    69.5250+-0.3486       ! definitely 1.0176x slower
   json-stringify-tinderbox             103.2334+-1.0090        103.1331+-1.0498
   stanford-crypto-aes                  115.8342+-0.3116        115.7750+-0.5165
   stanford-crypto-ccm                  117.2265+-0.7123    ?   117.2419+-0.6766       ?
   stanford-crypto-pbkdf2               231.5518+-0.2709    !   233.3329+-0.6611       ! definitely 1.0077x slower
   stanford-crypto-sha256-iterative     100.2526+-0.1269    ?   100.3860+-0.2462       ?

   <arithmetic> *                       278.8936+-1.2346    ^   265.5315+-0.8722       ^ definitely 1.0503x faster
   <geometric>                          218.5252+-0.9039    ^   213.8505+-0.6057       ^ definitely 1.0219x faster
   <harmonic>                           173.1445+-0.6771        172.3007+-0.4488         might be 1.0049x faster

                                               ToT                     LICM
All benchmarks:
   <arithmetic>                         104.0633+-0.3907    ^   100.3002+-0.2844       ^ definitely 1.0375x faster
   <geometric>                           27.5025+-0.0963    ?    27.5131+-0.0953       ? might be 1.0004x slower
   <harmonic>                             8.4492+-0.0499    ?     8.5531+-0.0588       ? might be 1.0123x slower

                                               ToT                     LICM
Geomean of preferred means:
   <scaled-result>                       61.1753+-0.1828    ^    60.5004+-0.1703       ^ definitely 1.0112x faster
Comment 6 Yuqiang Xian 2012-02-03 01:27:29 PST
Created attachment 125286 [details]
WIP patch

Updated patch.

Generate new Phi nodes for the new GetLocals created by LICM, and update the block variable information as well. Also this requires Phi stack processing after LICM, so moved the Phi stack processing logic from DFGByteCodeParser to DFGGraph. With this we can correctly propagate abstract values through blocks.

Performance result is here (tested on Linux ia32). Still 5% on Kraken but ~1% loss on V8 (mostly due to 5% regression on deltablue, under investigation).

Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample
measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime()
function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in
milliseconds.

                                               ToT                     LICM
SunSpider:
   3d-cube                                6.7185+-0.0922          6.6918+-0.0762
   3d-morph                              11.3808+-0.0780         11.2993+-0.1254
   3d-raytrace                            9.5044+-0.0562    ?     9.5743+-0.0985       ?
   access-binary-trees                    1.9286+-0.0692          1.9212+-0.0414
   access-fannkuch                       10.7387+-0.0590    ?    10.8040+-0.0870       ?
   access-nbody                           5.4588+-0.0946          5.4444+-0.0900
   access-nsieve                          3.9970+-0.0378    ?     4.0700+-0.0391       ? might be 1.0183x slower
   bitops-3bit-bits-in-byte               1.1835+-0.0353    ?     1.2043+-0.0374       ? might be 1.0176x slower
   bitops-bits-in-byte                    4.6397+-0.0533    ?     4.6637+-0.0567       ?
   bitops-bitwise-and                     4.2493+-0.0640    ?     4.3000+-0.0913       ? might be 1.0119x slower
   bitops-nsieve-bits                     6.8605+-0.0976          6.8071+-0.1341
   controlflow-recursive                  2.8164+-0.0562          2.7959+-0.0489
   crypto-aes                             9.3053+-0.1653    !     9.6577+-0.1640       ! definitely 1.0379x slower
   crypto-md5                             3.0033+-0.0699    ?     3.0361+-0.0668       ? might be 1.0109x slower
   crypto-sha1                            2.6658+-0.0808    ?     2.7173+-0.0983       ? might be 1.0193x slower
   date-format-tofte                     12.2253+-0.1308         12.1408+-0.0981
   date-format-xparb                     11.8267+-0.1292    ?    11.9841+-0.1144       ? might be 1.0133x slower
   math-cordic                            9.2232+-0.0671          9.2085+-0.0893
   math-partial-sums                     13.9079+-0.0590         13.8841+-0.0717
   math-spectral-norm                     2.6601+-0.0649          2.6535+-0.0245
   regexp-dna                             8.6890+-0.0961    ?     8.7138+-0.1023       ?
   string-base64                          5.5912+-0.1123    ?     5.6279+-0.1022       ?
   string-fasta                           9.0830+-0.0981          9.0396+-0.0898
   string-tagcloud                       15.6098+-0.1030         15.5510+-0.0992
   string-unpack-code                    25.1136+-0.1364    ?    25.3173+-0.1788       ?
   string-validate-input                  7.1972+-0.0982    !     7.3726+-0.0725       ! definitely 1.0244x slower

   <arithmetic> *                         7.9068+-0.0292    ?     7.9416+-0.0340       ? might be 1.0044x slower
   <geometric>                            6.3250+-0.0232    ?     6.3579+-0.0324       ? might be 1.0052x slower
   <harmonic>                             4.8379+-0.0248    ?     4.8711+-0.0402       ? might be 1.0069x slower

                                               ToT                     LICM
V8:
   crypto                                96.5485+-0.4716    ?    97.1088+-0.3119       ?
   deltablue                            161.8844+-1.1480    !   169.9394+-0.6498       ! definitely 1.0498x slower
   earley-boyer                         110.8333+-0.2826    ?   111.4147+-0.3258       ?
   raytrace                              53.8090+-0.3647    ?    54.4755+-0.8030       ? might be 1.0124x slower
   regexp                               102.4384+-0.5489        101.6221+-0.3126
   richards                             178.8249+-0.6137    ^   177.3008+-0.6436       ^ definitely 1.0086x faster
   splay                                 80.6917+-0.8164         80.3112+-0.4530

   <arithmetic>                         112.1472+-0.1864    !   113.1675+-0.1549       ! definitely 1.0091x slower
   <geometric> *                        104.6812+-0.2472    !   105.4402+-0.2689       ! definitely 1.0073x slower
   <harmonic>                            97.3662+-0.3320    ?    97.9646+-0.4274       ? might be 1.0061x slower

                                               ToT                     LICM
Kraken:
   ai-astar                             788.3627+-1.2834    ^   638.5571+-1.1142       ^ definitely 1.2346x faster
   audio-beat-detection                 240.0238+-1.4097    ^   238.3298+-0.2238       ^ definitely 1.0071x faster
   audio-dft                            368.1261+-3.4572    ^   331.3670+-3.5713       ^ definitely 1.1109x faster
   audio-fft                            154.1395+-0.1606    ^   153.0288+-0.1039       ^ definitely 1.0073x faster
   audio-oscillator                     347.4398+-2.4043    ^   339.9551+-1.3986       ^ definitely 1.0220x faster
   imaging-darkroom                     378.8966+-9.4751        377.5207+-8.5424
   imaging-desaturate                   305.9519+-0.7818    ?   307.2358+-1.8061       ?
   imaging-gaussian-blur                593.0994+-1.0883    ^   591.0285+-0.7286       ^ definitely 1.0035x faster
   json-parse-financial                  67.7004+-0.4985    !    69.3780+-0.6205       ! definitely 1.0248x slower
   json-stringify-tinderbox             102.5560+-0.5984        101.4284+-0.5632         might be 1.0111x faster
   stanford-crypto-aes                  115.9111+-0.4111    ^   114.9546+-0.1903       ^ definitely 1.0083x faster
   stanford-crypto-ccm                  116.3800+-0.8529    ?   116.6237+-1.0234       ?
   stanford-crypto-pbkdf2               234.4627+-1.4396    ?   235.7790+-1.5338       ?
   stanford-crypto-sha256-iterative     100.5601+-0.0904    !   101.0498+-0.2687       ! definitely 1.0049x slower

   <arithmetic> *                       279.5436+-1.0670    ^   265.4454+-0.9724       ^ definitely 1.0531x faster
   <geometric>                          218.5138+-0.7245    ^   213.3139+-0.7691       ^ definitely 1.0244x faster
   <harmonic>                           172.6914+-0.5036        171.6888+-0.6281         might be 1.0058x faster

                                               ToT                     LICM
All benchmarks:
   <arithmetic>                         104.3451+-0.3442    ^   100.3168+-0.3114       ^ definitely 1.0402x faster
   <geometric>                           27.5953+-0.0789         27.5063+-0.1031         might be 1.0032x faster
   <harmonic>                             8.5033+-0.0422    ?     8.5600+-0.0688       ? might be 1.0067x slower

                                               ToT                     LICM
Geomean of preferred means:
   <scaled-result>                       61.3911+-0.1767    ^    60.5751+-0.1816       ^ definitely 1.0135x faster
Comment 7 Yuqiang Xian 2012-02-03 01:30:47 PST
CC JSC experts. :)
Comment 8 Filip Pizlo 2012-02-03 10:41:46 PST
This looks awesome!  I'll dive in to try to understand the details shortly...
Comment 9 Filip Pizlo 2012-02-03 12:23:49 PST
Comment on attachment 125286 [details]
WIP patch

View in context: https://bugs.webkit.org/attachment.cgi?id=125286&action=review

This is a really awesome piece of work.  But I have the following concerns:

1) The fact that the DFG compiler uses - and relies heavily on! - indices of nodes and blocks is a flaw that we will probably correct.  We should probably move to a world where blocks have no fixed ordering, and nodes are ordered but not indexed, so that nodes can be interjected between other nodes.  Imagine that the DFG::Graph was a linked list of nodes rather than an array - that would be one way to do it.  In this patch you are adding more code that fiddles with our weird indexing approach, both as a simplification to your approach and as a necessity.  You rely on block indices to find loops, for instance - even though your patch breaks the very indexing that it is relying on.  You then have loads of code to deal with the craziness that is our Node vector - probably that code would have been simpler if you could have just inserted nodes between other nodes.

2) The fact that the DFG compiler has SetLocal/GetLocal nodes and no notion of direct data flow links from one basic block to another is a flaw that we will probably also correct.  The DFG should be an SSA compiler; the fact that it's not that way yet is reflective of our development strategy (get the tiered compilation and basic code gen right before moving to harver things) rather than our technical vision.  Your patch both relies on, and is made more complicated by, the lack of SSA and the use of these "local variables".

3) The phase introduced in this patch relies on properties of the DFG graph (like the meaning conveyed by the ordering of blocks) that it then promptly breaks.  Hence I fear that if we were to take this, we would then not be able to implement any other optimizations.

I'll have to think about this patch more, but right now I'm leaning towards the following suggestion: if would likely be far more valuable if we went to implementing SSA, and got rid of the mess that is SetArgument/SetLocal/GetLocal/Phi/Flush, than trying to teach the current beast these kinds of highly technical optimizations.  My guess is that LICM would be far cleaner, easier to implement, and more maintainable if we could implement it on the DFG *after* we've moved to something like SSA.  But maybe I could be convinced otherwise…  it's a tough call.

> Source/JavaScriptCore/dfg/DFGGraph.cpp:127
> +    bool mustGenerate = refCount && node.mustGenerate();

Is this change necessary?  We use the mustGenerate bit to print "!" to indicate that regardless of ref count the node must be generated.  Separately, we marked nodes "skipped" if their refCount is zero.

> Source/JavaScriptCore/dfg/DFGJITCompiler.h:324
> +            if (!basicBlock.cfaHasVisited || basicBlock.variablesAtHead.argument(argument) == NoNode)

This looks very dangerous.  Do we ever want to enter into a block for which there is no CFA information?  Do loop pre headers do all speculation checks that the CFA assumes would have been performed?

> Source/JavaScriptCore/dfg/DFGPropagator.cpp:2091
> +                if (replacement < m_graph.size() - 1 && m_graph[replacement + 1].op == SetLocal)

It is not reliable to assume that the next node being a SetLocal means something.  Examples where this might go wrong include, but are not limited to, unsigned bit shits or post-inc and post-dec.

> Source/JavaScriptCore/dfg/DFGPropagator.cpp:2304
> +                    performHoist(m_compileIndex, preHeaderBlock, hoisted);
> +                else if (shouldGenerate && !hoistedChildren.isEmpty()) {
> +                    // This node cannot be hoisted but it contains some references to the hoisted nodes.
> +                    // Adjust such references by utilizing SetLocal/GetLocal for cross block references.
> +                    for (unsigned i = 0; i < hoistedChildren.size(); ++i)
> +                        handleHoistedChild(blockIndex, hoistedChildren[i], loopInfo);

Let me try to understand: you are either hoisting (which just copies the node) or you are fixing references from the node to its children.  The latter case may effectively "unhoist" previously hoisted nodes, if you find that they did not have SetLocal nodes that you could easily find.  Right?
Comment 10 Yuqiang Xian 2012-02-04 01:59:36 PST
(In reply to comment #9)

Filip,

Thanks a lot for the comments. Yes actually I totally agree with you on the first two major concerns, and they did trouble me a lot, but I was just not as aggressive to try to change the IR form, so I tried to live with current DFG IR, which makes the optimization not as clean and easy to understand as it should be. And this is the reason why I want to have you guys notice this work at current stage, and to discuss how we should proceed.
For the 3rd major concern, I guess I don't fully understand it and maybe I need to think about it more carefully.

The responses to the specific code change concerns are inlined below.

> 
> > Source/JavaScriptCore/dfg/DFGGraph.cpp:127
> > +    bool mustGenerate = refCount && node.mustGenerate();
> 
> Is this change necessary?  We use the mustGenerate bit to print "!" to indicate that regardless of ref count the node must be generated.  Separately, we marked nodes "skipped" if their refCount is zero.
> 

Because the hoisted loop invariants can have 0 refcount even if they're marked as "MustGenerate", which means we're not going to generate code for them. This is, actually, caused by the "unhoist" as you mentioned in the last comment.

> > Source/JavaScriptCore/dfg/DFGJITCompiler.h:324
> > +            if (!basicBlock.cfaHasVisited || basicBlock.variablesAtHead.argument(argument) == NoNode)
> 
> This looks very dangerous.  Do we ever want to enter into a block for which there is no CFA information?  Do loop pre headers do all speculation checks that the CFA assumes would have been performed?
> 

This is mainly to fix the performance regression of SunSpider. Actually we sometimes do enter into a block without CFA information (due to loop OSR). A typical case is bitops-bitwise-and in SunSpider where CFA is terminated before the loop because of a ForceOSRExit. And yes I think the speculation checks should have been performed in codegen. Please correct me if I have something missed.

> > Source/JavaScriptCore/dfg/DFGPropagator.cpp:2091
> > +                if (replacement < m_graph.size() - 1 && m_graph[replacement + 1].op == SetLocal)
> 
> It is not reliable to assume that the next node being a SetLocal means something.  Examples where this might go wrong include, but are not limited to, unsigned bit shits or post-inc and post-dec.
> 

Sorry I'm not very sure of the specific concern here. I check the next node (a hoisted node in the pre header) for a SetLocal simply to find a slot for saving the result of current node, which will be used by a node in the loop body.

> > Source/JavaScriptCore/dfg/DFGPropagator.cpp:2304
> > +                    performHoist(m_compileIndex, preHeaderBlock, hoisted);
> > +                else if (shouldGenerate && !hoistedChildren.isEmpty()) {
> > +                    // This node cannot be hoisted but it contains some references to the hoisted nodes.
> > +                    // Adjust such references by utilizing SetLocal/GetLocal for cross block references.
> > +                    for (unsigned i = 0; i < hoistedChildren.size(); ++i)
> > +                        handleHoistedChild(blockIndex, hoistedChildren[i], loopInfo);
> 
> Let me try to understand: you are either hoisting (which just copies the node) or you are fixing references from the node to its children.  The latter case may effectively "unhoist" previously hoisted nodes, if you find that they did not have SetLocal nodes that you could easily find.  Right?

Exactly.
Comment 11 Yuqiang Xian 2012-02-05 23:12:42 PST
Created attachment 125578 [details]
WIP patch

Updated patch.

Fixed a performance issue on 64bit. And below is current performance result on 64bit (Mac OS X x64): 6% on Kraken, neutral on SunSpider and V8. While a regression on bitops-bitwise-and is observed.

Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample
measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime()
function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in
milliseconds.

                                                1                       2
SunSpider:
   3d-cube                                5.8913+-0.0639          5.7923+-0.0798         might be 1.0171x faster
   3d-morph                               9.0526+-0.1103          8.8874+-0.0843         might be 1.0186x faster
   3d-raytrace                            8.2938+-0.0847          8.2594+-0.0928
   access-binary-trees                    1.9285+-0.0400    ?     1.9730+-0.1197       ? might be 1.0231x slower
   access-fannkuch                        7.3069+-0.2342    ?     7.3645+-0.1027       ?
   access-nbody                           3.9577+-0.0472          3.9405+-0.0286
   access-nsieve                          3.6470+-0.1076          3.6145+-0.0584
   bitops-3bit-bits-in-byte               1.2537+-0.0261    ^     1.1556+-0.0041       ^ definitely 1.0850x faster
   bitops-bits-in-byte                    5.0807+-0.0462    ?     5.1419+-0.0429       ? might be 1.0120x slower
   bitops-bitwise-and                     3.1656+-0.0106    !     3.6648+-0.0345       ! definitely 1.1577x slower
   bitops-nsieve-bits                     5.5132+-0.0480          5.5014+-0.0275
   controlflow-recursive                  2.2947+-0.0261          2.2853+-0.0230
   crypto-aes                             8.3225+-0.0566    !     8.6046+-0.0619       ! definitely 1.0339x slower
   crypto-md5                             2.5425+-0.0281    ?     2.5443+-0.0252       ?
   crypto-sha1                            2.6004+-0.3497          2.3593+-0.0270         might be 1.1022x faster
   date-format-tofte                     11.6510+-0.0687    ^    11.0868+-0.0504       ^ definitely 1.0509x faster
   date-format-xparb                     10.4985+-0.2820         10.2481+-0.3466         might be 1.0244x faster
   math-cordic                            7.5448+-0.7898          7.4206+-0.5456         might be 1.0167x faster
   math-partial-sums                     10.5491+-0.0809    ^    10.3707+-0.0560       ^ definitely 1.0172x faster
   math-spectral-norm                     2.6146+-0.0112    !     2.7388+-0.1040       ! definitely 1.0475x slower
   regexp-dna                             8.9686+-0.0631          8.8430+-0.0749         might be 1.0142x faster
   string-base64                          4.9485+-0.0286    ?     4.9680+-0.0241       ?
   string-fasta                           7.8256+-0.0639    ?     7.9127+-0.0954       ? might be 1.0111x slower
   string-tagcloud                       13.2538+-0.1385         13.1118+-0.1419         might be 1.0108x faster
   string-unpack-code                    21.7408+-0.2276    ?    22.1923+-0.5740       ? might be 1.0208x slower
   string-validate-input                  6.2486+-0.0706    ?     6.2526+-0.0283       ?

   <arithmetic> *                         6.7960+-0.0357          6.7782+-0.0337         might be 1.0026x faster
   <geometric>                            5.5197+-0.0369          5.5096+-0.0249         might be 1.0018x faster
   <harmonic>                             4.3784+-0.0444          4.3500+-0.0315         might be 1.0065x faster
                                                1                       2
                                                1                       2
V8:
   crypto                                77.1976+-0.3123    ^    76.6700+-0.1480       ^ definitely 1.0069x faster
   deltablue                            150.1328+-1.1223    !   153.9579+-0.8058       ! definitely 1.0255x slower
   earley-boyer                          96.8689+-3.0327         96.2447+-3.0111
   raytrace                              50.0781+-0.1724    ^    49.7800+-0.1110       ^ definitely 1.0060x faster
   regexp                                96.1729+-0.3955    ?    96.3467+-0.3682       ?
   richards                             133.8968+-0.2056    !   136.5811+-0.7893       ! definitely 1.0200x slower
   splay                                 69.9721+-0.6299         69.4475+-0.7878

   <arithmetic>                          96.3313+-0.4623    ?    97.0040+-0.5518       ? might be 1.0070x slower
   <geometric> *                         90.7276+-0.4666    ?    90.9880+-0.5441       ? might be 1.0029x slower
   <harmonic>                            85.2748+-0.4300         85.2214+-0.4890         might be 1.0006x faster

                                                1                       2
Kraken:
   ai-astar                             795.3290+-0.5676    ^   636.2741+-0.7378       ^ definitely 1.2500x faster
   audio-beat-detection                 193.7537+-9.1111        185.6565+-0.4666         might be 1.0436x faster
   audio-dft                            456.0760+-0.8572    ^   443.2001+-0.8911       ^ definitely 1.0291x faster
   audio-fft                            122.3903+-7.5545        116.6115+-0.1681         might be 1.0496x faster
   audio-oscillator                     281.1231+-3.1010    ^   267.4259+-3.1168       ^ definitely 1.0512x faster
   imaging-darkroom                     295.2771+-6.0876        294.8097+-7.3344
   imaging-desaturate                   228.0868+-0.2582    ?   229.5187+-3.8357       ?
   imaging-gaussian-blur                501.9180+-0.1475    ^   499.6097+-0.2014       ^ definitely 1.0046x faster
   json-parse-financial                  60.5922+-3.3409         59.2392+-0.0472         might be 1.0228x faster
   json-stringify-tinderbox              83.4101+-0.1794    ^    78.7011+-0.2180       ^ definitely 1.0598x faster
   stanford-crypto-aes                   99.6793+-0.4575    ?   101.0411+-1.0317       ? might be 1.0137x slower
   stanford-crypto-ccm                  100.3830+-0.8947    ?   101.0337+-0.5908       ?
   stanford-crypto-pbkdf2               200.0138+-2.1655        199.5086+-3.1952
   stanford-crypto-sha256-iterative     103.3791+-0.5305        103.2712+-0.3110

   <arithmetic> *                       251.5294+-1.3646    ^   236.8501+-0.7498       ^ definitely 1.0620x faster
   <geometric>                          190.7950+-1.5686    ^   184.7663+-0.4824       ^ definitely 1.0326x faster
   <harmonic>                           149.8594+-1.5736    ^   146.7945+-0.2371       ^ definitely 1.0209x faster
Comment 12 Yuqiang Xian 2012-02-06 01:11:31 PST
(In reply to comment #9)
> (From update of attachment 125286 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=125286&action=review
> 
> 1) You rely on block indices to find loops, for instance - even though your patch breaks the very indexing that it is relying on.  
> 
> 3) The phase introduced in this patch relies on properties of the DFG graph (like the meaning conveyed by the ordering of blocks) that it then promptly breaks. 
> 

Sorry but I guess I forgot one response. IMHO I'm not breaking the block indexing in the patch. I'm not creating or reordering the blocks, nor changing the control flow. The pre header blocks are preserved at the very beginning, before LICM it contains a single Jump to the loop header. Thanks.
Comment 13 Filip Pizlo 2012-02-06 01:13:59 PST
(In reply to comment #12)
> (In reply to comment #9)
> > (From update of attachment 125286 [details] [details])
> > View in context: https://bugs.webkit.org/attachment.cgi?id=125286&action=review
> > 
> > 1) You rely on block indices to find loops, for instance - even though your patch breaks the very indexing that it is relying on.  
> > 
> > 3) The phase introduced in this patch relies on properties of the DFG graph (like the meaning conveyed by the ordering of blocks) that it then promptly breaks. 
> > 
> 
> Sorry but I guess I forgot one response. IMHO I'm not breaking the block indexing in the patch. I'm not creating or reordering the blocks, nor changing the control flow. The pre header blocks are preserved at the very beginning, before LICM it contains a single Jump to the loop header. Thanks.

Ah, that's a good point.  I was confusing node ordering (you add nodes at the end) with block ordering.
Comment 14 Filip Pizlo 2012-02-06 01:18:02 PST
> > > Source/JavaScriptCore/dfg/DFGJITCompiler.h:324
> > > +            if (!basicBlock.cfaHasVisited || basicBlock.variablesAtHead.argument(argument) == NoNode)
> > 
> > This looks very dangerous.  Do we ever want to enter into a block for which there is no CFA information?  Do loop pre headers do all speculation checks that the CFA assumes would have been performed?
> > 
> 
> This is mainly to fix the performance regression of SunSpider. Actually we sometimes do enter into a block without CFA information (due to loop OSR). A typical case is bitops-bitwise-and in SunSpider where CFA is terminated before the loop because of a ForceOSRExit. And yes I think the speculation checks should have been performed in codegen. Please correct me if I have something missed.

If we do enter a loop through OSR without CFA information then it is simply a bug.  We should never do that.  If we do it then we should fix it!

Separate bug filed: https://bugs.webkit.org/show_bug.cgi?id=77858
Comment 15 Filip Pizlo 2012-02-06 01:28:32 PST
The main issue I have with this patch is that it feels premature to implement this optimization given the state of the DFG IR.  It feels to me like we should proceed as follows:

- Fix register allocation and speculation bugs.  For example register allocation and operand speculation will cause terminateSpeculativeExecution() to be called if we use an int-predicted variable in a double context, and then after that try to use it in an int-context.  The double use will convert it to a double and we'll forget that it was actually an integer!

- Get rid of reference counting of nodes, at least in the early stages of graph building and optimization.  We had *lots* of bugs come in on random web pages because of bad handling of node reference counts in the fixup and CSE phases.  It would be great if we didn't do any reference counting at all, and instead relied on traditional liveness analysis for DCE.

- Switch to SSA.  This should be easy.  We already have all of the right things in place: we have an IR in which using a value means referring to its definition.  All that this will take is implementing a fast dominator tree builder (I was thinking of just using Lengauer-Tarjan) and then use it to place Phi nodes.

- Get rid of the notion that a basic block consists of a contiguous set of node indices.  This is probably the worst part of our IR, since it means that inserting a node in between nodes is really hard.

- Introduce a CFG simplification pass, probably formulated using sparse conditional constant propagation so that it does two things at once.  We might even be able to fold this into the CFA.

It feels to me like if we accomplish the above - which should not be terribly difficult - then we'd be in a far better position to implement awesome optimizations, particularly since it would be easier to compartmentalize them since the inter-phase invariants of the IR would be crystal clear.

As another thought, I'm kind of curious why you didn't formulate this using loop peeling rather than LICM.  Peeling is far less error-prone and would likely give you the same speed-ups.  You could even do peeling during bytecode parsing using a similar set of tricks as what we used for inlining...  But even there, I feel like those tricks are too messy.  So many bugs I had to fix because of basic block linking in the parser!  If we had a CFG simplification pass - which we currently can't do, because the lack of SSA means that we can't easily slam two blocks together without then having to do painful SetLocal/GetLocal fixup - then several hundred lines of awful code in the parser could be killed off.
Comment 16 Yuqiang Xian 2012-02-07 00:22:54 PST
(In reply to comment #15)
> 
> As another thought, I'm kind of curious why you didn't formulate this using loop peeling rather than LICM.  Peeling is far less error-prone and would likely give you the same speed-ups.  

I'm not sure if I fully understand it, but IMO w/o global CSE to eliminate the redundant (which means, loop invariant) expressions in the loop body. we cannot get the benefit of LICM but cause code bloat with peeling.
Comment 17 Filip Pizlo 2012-02-07 00:38:23 PST
(In reply to comment #16)
> (In reply to comment #15)
> > 
> > As another thought, I'm kind of curious why you didn't formulate this using loop peeling rather than LICM.  Peeling is far less error-prone and would likely give you the same speed-ups.  
> 
> I'm not sure if I fully understand it, but IMO w/o global CSE to eliminate the redundant (which means, loop invariant) expressions in the loop body. we cannot get the benefit of LICM but cause code bloat with peeling.

You're mostly right, except loop peeling by itself would buy you more speculation guard eliminations through CFA.  So it might not be a bad first step.  Of course, as with any loop peeling implementation, you'd want to tune it to target loops that were relatively small to begin with.

But also, GCSE is *far* easier to implement that LICM.  Particularly if you tune it right.  You could imagine optimizing for loops whose body consisted of <=N (for some small N) basic blocks.  This makes the control flow really easy to deal with in GCSE - essentially the CSE look-back has to consider, if it gets to the top of the block, whether there is either (a) just one predecessor or (b) one of the predecessors dominates the others.

Moreover, GCSE+peeling would probably give us more wins on other benchmarks that LICM by itself would not benefit.
Comment 18 Yuqiang Xian 2012-02-07 01:12:11 PST
(In reply to comment #17)
> But also, GCSE is *far* easier to implement that LICM.  Particularly if you tune it right.  You could imagine optimizing for loops whose body consisted of <=N (for some small N) basic blocks.  This makes the control flow really easy to deal with in GCSE - essentially the CSE look-back has to consider, if it gets to the top of the block, whether there is either (a) just one predecessor or (b) one of the predecessors dominates the others.

If based on current DFG IR, I wonder whether peeling+GCSE is really easier than LICM to implement... But if with the improved IR (e.g., SSA) as you mentioned, GCSE might be easier - and in such case LICM should be simplified as well. :)

> 
> Moreover, GCSE+peeling would probably give us more wins on other benchmarks that LICM by itself would not benefit.

Agree.
Comment 19 Filip Pizlo 2012-02-07 10:32:02 PST
(In reply to comment #18)
> (In reply to comment #17)
> > But also, GCSE is *far* easier to implement that LICM.  Particularly if you tune it right.  You could imagine optimizing for loops whose body consisted of <=N (for some small N) basic blocks.  This makes the control flow really easy to deal with in GCSE - essentially the CSE look-back has to consider, if it gets to the top of the block, whether there is either (a) just one predecessor or (b) one of the predecessors dominates the others.
> 
> If based on current DFG IR, I wonder whether peeling+GCSE is really easier than LICM to implement... But if with the improved IR (e.g., SSA) as you mentioned, GCSE might be easier - and in such case LICM should be simplified as well. :)

There are two other issues on my mind:

1) This patch's reliance on the peculiarities of the current IR will make changing IRs more difficult. And I am working on changing the IR. With any luck I'll have something that resembles SSA within a month. And then we can do GVN+GCSE+peeling+unrolling. 

2) I find it hard to visualize how to get all of the cases of OSR (both entry and exit but mostly exit) right in the presence of *any* code motion. Moving code means that we no longer have a clear OSR exit target in the old JIT. I'm not sure your patch gets this right, and even if it does get it mostly right, it *will* have a nasty bug tail. Since code motion (and hoisting in particular) are not particularly effective in general for effectful languages that also have loads of safety checks (I remember that in Java it ave you nothing) and since our only LICM wins are in Kraken, which is a questionable and mostly unrealistic benchmark that we already do OK on, Im not sure it would be a wise cost-benefit trade off. 



> > 
> > Moreover, GCSE+peeling would probably give us more wins on other benchmarks that LICM by itself would not benefit.
> 
> Agree.
Comment 20 Yuqiang Xian 2012-02-07 18:07:06 PST
(In reply to comment #19)
> 
> 1) And I am working on changing the IR. With any luck I'll have something that resembles SSA within a month. And then we can do GVN+GCSE+peeling+unrolling. 

That's cool...

> 
> 2) I find it hard to visualize how to get all of the cases of OSR (both entry and exit but mostly exit) right in the presence of *any* code motion. Moving code means that we no longer have a clear OSR exit target in the old JIT. I'm not sure your patch gets this right, and even if it does get it mostly right, it *will* have a nasty bug tail. 

The patch tries to handle OSR exit as if there's no hoisted expression being evaluated, if the OSR exit happens in the pre header or in the loop body.

It (supposedly) achieves this by
1) Don't compileMovHint or set operand value source for the SetLocals in the pre header, which means we're not changing the value source and lastSetOperand in the pre header;
2) The codeOriginForOSR of *any* DFG node in the pre header is the origin of the original single node in the pre header before LICM - "loop_preheader";
3) In the loop body, the hoisted nodes are not replaced by others but marked as "NotGenerate" (refcount 0), which means the SetLocals in the loop body are kept for compileMovHint to correctly record the operand value source and lastSetOperand.

These tweaks may result in some "wasteful" operations if OSR exit happens in the pre header (the loop invariants evaluation before OSR exit is useless but on the other hand we're exiting earlier) or the loop body (some operand values may already in the register file by SetLocal in the pre header but we will  (redundantly) re-store the value of its child node - a "GetLocal" actually, to the register file). But they just *want to* make the OSR exit correct.

But maybe I still have something missed...

> Since code motion (and hoisting in particular) are not particularly effective in general for effectful languages that also have loads of safety checks (I remember that in Java it ave you nothing) and since our only LICM wins are in Kraken, which is a questionable and mostly unrealistic benchmark that we already do OK on, Im not sure it would be a wise cost-benefit trade off. 

I think one reason why we don't get bigger improvement by LICM is that we have to store the result of the loop invariant expression into memory at first, and then read it from memory in the loop body when used. If we have SSA and get rid of the SetLocal/GetLocals, and then have a global register allocator, the LICM improvement alone might be more promising. Also currently we hoist nodes kind of "conservatively", if with deeper side effects analysis we may find more loop invariants.

But anyway, I'm open to this. It's your call. :)
Comment 21 Filip Pizlo 2012-02-07 18:15:55 PST
> The patch tries to handle OSR exit as if there's no hoisted expression being evaluated, if the OSR exit happens in the pre header or in the loop body.
> 
> It (supposedly) achieves this by
> 1) Don't compileMovHint or set operand value source for the SetLocals in the pre header, which means we're not changing the value source and lastSetOperand in the pre header;
> 2) The codeOriginForOSR of *any* DFG node in the pre header is the origin of the original single node in the pre header before LICM - "loop_preheader";
> 3) In the loop body, the hoisted nodes are not replaced by others but marked as "NotGenerate" (refcount 0), which means the SetLocals in the loop body are kept for compileMovHint to correctly record the operand value source and lastSetOperand.
> 
> These tweaks may result in some "wasteful" operations if OSR exit happens in the pre header (the loop invariants evaluation before OSR exit is useless but on the other hand we're exiting earlier) or the loop body (some operand values may already in the register file by SetLocal in the pre header but we will  (redundantly) re-store the value of its child node - a "GetLocal" actually, to the register file). But they just *want to* make the OSR exit correct.
> 
> But maybe I still have something missed...

It seems like you probably covered everything.

But these mechanisms are already really confusing.  Also you're adding to the already confusing "code origin" notion for a node: first we just had a bytecode index for a node, then we had bytecode index + inline stack, and then we had bytecode index + bytecode index offset for set operation + inline stack.  

Now we will have bytecode index for debugging + bytecode index for set operation + inline stack + bytecode index for OSR exit target, the latter being conveyed implicitly by whether the node is in the pre header.

Short story: there will be bugs, and awesome ones at that. :-)

> 
> > Since code motion (and hoisting in particular) are not particularly effective in general for effectful languages that also have loads of safety checks (I remember that in Java it ave you nothing) and since our only LICM wins are in Kraken, which is a questionable and mostly unrealistic benchmark that we already do OK on, Im not sure it would be a wise cost-benefit trade off. 
> 
> I think one reason why we don't get bigger improvement by LICM is that we have to store the result of the loop invariant expression into memory at first, and then read it from memory in the loop body when used. If we have SSA and get rid of the SetLocal/GetLocals, and then have a global register allocator, the LICM improvement alone might be more promising. Also currently we hoist nodes kind of "conservatively", if with deeper side effects analysis we may find more loop invariants.

I've not seen deeper side effect analysis be a win, in the past.

And you're right - SSA will be a huge benefit for this code.

> 
> But anyway, I'm open to this. It's your call. :)

Still thinking about it.  It's a tough call.  This is an awesome patch, I'm just super worried about the bug parade that will ensure. ;-)
Comment 22 Yuqiang Xian 2012-02-27 21:41:00 PST
Created attachment 129185 [details]
Rebase

Don't worry... simply for back-up purpose. ;)
Comment 23 Filip Pizlo 2012-02-27 23:12:17 PST
(In reply to comment #22)
> Created an attachment (id=129185) [details]
> Rebase
> 
> Don't worry... simply for back-up purpose. ;)

I would be worried if you didn't back up a rebase! :-)

I've been thinking a lot about how to integrate your work while still maintaining my goals: maintainability and not impeding future optimizations.

I think we could do this by making the following specific refactorings, and then landing this patch, with modifications to take advantage of those refactorings:

1) We shouldn't have to do dominator-based natural loop inference.  That makes sense when the source language is gotos and labels.  In our case the source language is JS, which is structured.  I don't think it's even possible to have a back edge that isn't part of a natural loop.  The bytecode generator always knows when it's emitting a back edge, always knows what loop it's part of, and always knows all of the instructions that are part of the loop, and always knows where the header of the loop is.  It would be simpler, I think, if the bytecode generator simply recorded this information in CodeBlock and the DFG could just use it.  Better yet, we should preserve that information in the DFG's Graph and BasicBlocks and require all passes to preserve it as well.

2) BasicBlocks should not require that their nodes have contiguous indices in the graph.  A BasicBlock should instead have a Vector<NodeIndex> that references all of the block's nodes, in the order in which they are to be executed.  It should be possible to add nodes to a BasicBlock by (1) appending them to the graph and (2) appending their indices to the Vector<NodeIndex> in BasicBlock.

3) CodeOrigin should be able to say something like "this code originated from X but for OSR exit should be treated as if it was at Y".  This is in addition to also saying the point Z that corresponds to the code's value profiling site.  Currently CodeOrigin stores X and Z and assumes Y == X.  But allowing for a differentiation would make hoisting really easy and robust - the OSR exit code for code hoisted by LICM would simply know that the code should exit to the loop header because Y would be set to the loop header.

4) We should at least get rid of the conflation of SetLocal as a hint to OSR and SetLocal as a mechanism for preserving state across basic blocks.  The "ideal" way of doing this would be to go straight to SSA.  But it may be that we don't need to do something so drastic to adopt this patch.  The main thing I worry about - and let me know if your patch handles this gracefully - is that you could have an expression in the loop like "a:Foo(...); SetLocal(@a, r<blah>)" where @a is loop-invariant but r<blah> is a variable that is live at the loop header because it contained other things.  Hoisting this combination could cause badness.  What we really want is to hoist just @a and have some elegant way of referring to it.  If we really wanted to, we could just create a new temporary and use SetLocal/GetLocal on that.  But that feels yucky.

It may be that (4) really does require SSA.  If so then maybe we should just do it.  It really shouldn't be that hard! :-)

But (1), (2), and (3) are totally doable.  I was thinking about doing them but I thought I'd throw it out here in case you could get around to it before me.  I'm bogged down with 64-bit LLInt right now, and some bizarre DFG register allocation pathologies. :-|

Please let me know if you like this development plan.  If so, feel free to create bugs for the subtasks I listed above - or any others you can think of - and make them blockers for this bug.

To be clear, I really like your LICM implementation; I just want to make sure that we integrate it in a way that makes it most profitable and least risky! :-)
Comment 24 Yuqiang Xian 2012-02-28 05:47:08 PST
(In reply to comment #23)
> 
> I would be worried if you didn't back up a rebase! :-)
> 
> I've been thinking a lot about how to integrate your work while still maintaining my goals: maintainability and not impeding future optimizations.

Thanks for taking care of this. :)

> 
> I think we could do this by making the following specific refactorings, and then landing this patch, with modifications to take advantage of those refactorings:
> 
> 1) We shouldn't have to do dominator-based natural loop inference.  That makes sense when the source language is gotos and labels.  In our case the source language is JS, which is structured.  I don't think it's even possible to have a back edge that isn't part of a natural loop.  The bytecode generator always knows when it's emitting a back edge, always knows what loop it's part of, and always knows all of the instructions that are part of the loop, and always knows where the header of the loop is.  It would be simpler, I think, if the bytecode generator simply recorded this information in CodeBlock and the DFG could just use it.  Better yet, we should preserve that information in the DFG's Graph and BasicBlocks and require all passes to preserve it as well.

Let me try to understand your concern, and please correct me if I mis-understand it... IMHO we're not doing loop inference. For now we record the loop pre-headers in DFG when parsing the bytecode. The dominator analysis is added only for finding out the dominators of the loop exit - to filter out those blocks which may not be reached in the loop due to branches (we don't want to hoist any nodes in such blocks).

> 
> 2) BasicBlocks should not require that their nodes have contiguous indices in the graph.  A BasicBlock should instead have a Vector<NodeIndex> that references all of the block's nodes, in the order in which they are to be executed.  It should be possible to add nodes to a BasicBlock by (1) appending them to the graph and (2) appending their indices to the Vector<NodeIndex> in BasicBlock.
> 
> 3) CodeOrigin should be able to say something like "this code originated from X but for OSR exit should be treated as if it was at Y".  This is in addition to also saying the point Z that corresponds to the code's value profiling site.  Currently CodeOrigin stores X and Z and assumes Y == X.  But allowing for a differentiation would make hoisting really easy and robust - the OSR exit code for code hoisted by LICM would simply know that the code should exit to the loop header because Y would be set to the loop header.

Certainly, those two are doable.

> 
> 4) We should at least get rid of the conflation of SetLocal as a hint to OSR and SetLocal as a mechanism for preserving state across basic blocks.  

I can't agree with you anymore on this...
In fact, I realized a bug related to OSR exit in current LICM implementation some days ago. I must admit that we have to do something with... although some tweaks/workarounds could be made based on current SetLocal mechanism, they look dirty and are really not performance friendly.

>
> The main thing I worry about - and let me know if your patch handles this gracefully - is that you could have an expression in the loop like "a:Foo(...); SetLocal(@a, r<blah>)" where @a is loop-invariant but r<blah> is a variable that is live at the loop header because it contained other things.  Hoisting this combination could cause badness.  What we really want is to hoist just @a and have some elegant way of referring to it.  If we really wanted to, we could just create a new temporary and use SetLocal/GetLocal on that.  But that feels yucky.

For such case, the current implementation will simply (conservatively) _not_ hoist either @a or the SetLocal - We track the live variables in the loop.

> 
> It may be that (4) really does require SSA.  If so then maybe we should just do it.  It really shouldn't be that hard! :-)
> 
> But (1), (2), and (3) are totally doable.  I was thinking about doing them but I thought I'd throw it out here in case you could get around to it before me.  I'm bogged down with 64-bit LLInt right now, and some bizarre DFG register allocation pathologies. :-|
> 
> Please let me know if you like this development plan.  If so, feel free to create bugs for the subtasks I listed above - or any others you can think of - and make them blockers for this bug.

Sure, I like it. :). And never mind, if you're busy and I get some time I can start some of the listed items.

> 
> To be clear, I really like your LICM implementation; I just want to make sure that we integrate it in a way that makes it most profitable and least risky! :-)

Thanks!
Comment 25 Yuqiang Xian 2012-02-29 02:26:03 PST
Bug 79899 is created trying to allow BasicBlocks have non-continuous node indices.

(In reply to comment #24)
> 
> I can't agree with you anymore on this...
> In fact, I realized a bug related to OSR exit in current LICM implementation some days ago. I must admit that we have to do something with... although some tweaks/workarounds could be made based on current SetLocal mechanism, they look dirty and are really not performance friendly.

Ah... Sorry for a HUGE typo. I tended to mean "I couldn't agree with you more..." - forgive my poor English please.
Comment 26 Filip Pizlo 2012-02-29 14:44:25 PST
> > 1) We shouldn't have to do dominator-based natural loop inference.  That makes sense when the source language is gotos and labels.  In our case the source language is JS, which is structured.  I don't think it's even possible to have a back edge that isn't part of a natural loop.  The bytecode generator always knows when it's emitting a back edge, always knows what loop it's part of, and always knows all of the instructions that are part of the loop, and always knows where the header of the loop is.  It would be simpler, I think, if the bytecode generator simply recorded this information in CodeBlock and the DFG could just use it.  Better yet, we should preserve that information in the DFG's Graph and BasicBlocks and require all passes to preserve it as well.
> 
> Let me try to understand your concern, and please correct me if I mis-understand it... IMHO we're not doing loop inference. For now we record the loop pre-headers in DFG when parsing the bytecode. The dominator analysis is added only for finding out the dominators of the loop exit - to filter out those blocks which may not be reached in the loop due to branches (we don't want to hoist any nodes in such blocks).

Ahhhh yeah you're at least more than half right. ;-)  We could do this in the bytecode generator, but it might be complicated.  Here's what I was thinking:

If you're in a loop the bytecode generator could indicate this in its context.  If it hits a conditional while in a loop, it can indicate that it's still in the loop but no longer in code that dominates the exit.  Except that this isn't quite right if you have:

loop {
    if (foo) {
        stuff;
        break;
    }
}

So then you'd have to add additional checks to see if there's a break statement at the end of the conditional block, but then you'd have to further handle the case of:

loop {
    if (foo) {
        stuff;
        if (bar) {
            other stuff;
            break;
        } else {
            some other stuff;
            break;
        }
    }
}

Or maybe your dominator analysis doesn't catch this either?

Just something to think about.  I definitely think that if the bytecode generator could give us this information, then it would be cheaper than computing it in the DFG.  On the other hand, having a loop analysis in the DFG carries the benefit that it will work even if it is done after the DFG has already done other control flow changes.

In short, I think you're right, but I figure I'd throw this out there in case you have ideas on how to make this better.

> >
> > The main thing I worry about - and let me know if your patch handles this gracefully - is that you could have an expression in the loop like "a:Foo(...); SetLocal(@a, r<blah>)" where @a is loop-invariant but r<blah> is a variable that is live at the loop header because it contained other things.  Hoisting this combination could cause badness.  What we really want is to hoist just @a and have some elegant way of referring to it.  If we really wanted to, we could just create a new temporary and use SetLocal/GetLocal on that.  But that feels yucky.
> 
> For such case, the current implementation will simply (conservatively) _not_ hoist either @a or the SetLocal - We track the live variables in the loop.

Right.  That's probably good enough, but it makes me sad.

> 
> > 
> > It may be that (4) really does require SSA.  If so then maybe we should just do it.  It really shouldn't be that hard! :-)
> > 
> > But (1), (2), and (3) are totally doable.  I was thinking about doing them but I thought I'd throw it out here in case you could get around to it before me.  I'm bogged down with 64-bit LLInt right now, and some bizarre DFG register allocation pathologies. :-|
> > 
> > Please let me know if you like this development plan.  If so, feel free to create bugs for the subtasks I listed above - or any others you can think of - and make them blockers for this bug.
> 
> Sure, I like it. :). And never mind, if you're busy and I get some time I can start some of the listed items.

That's great!
Comment 27 Filip Pizlo 2012-02-29 14:45:42 PST
(In reply to comment #25)
> Bug 79899 is created trying to allow BasicBlocks have non-continuous node indices.
> 
> (In reply to comment #24)
> > 
> > I can't agree with you anymore on this...
> > In fact, I realized a bug related to OSR exit in current LICM implementation some days ago. I must admit that we have to do something with... although some tweaks/workarounds could be made based on current SetLocal mechanism, they look dirty and are really not performance friendly.
> 
> Ah... Sorry for a HUGE typo. I tended to mean "I couldn't agree with you more..." - forgive my poor English please.

I fail at English often as well.  It's actually not my first language, either (Polish is).
Comment 28 Yuqiang Xian 2012-02-29 19:31:26 PST
(In reply to comment #26)
> 
> Ahhhh yeah you're at least more than half right. ;-)  We could do this in the bytecode generator, but it might be complicated.  Here's what I was thinking:
> 
> If you're in a loop the bytecode generator could indicate this in its context.  If it hits a conditional while in a loop, it can indicate that it's still in the loop but no longer in code that dominates the exit.  Except that this isn't quite right if you have:
> 
> loop {
>     if (foo) {
>         stuff;
>         break;
>     }
> }
> 
> So then you'd have to add additional checks to see if there's a break statement at the end of the conditional block, but then you'd have to further handle the case of:
> 
> loop {
>     if (foo) {
>         stuff;
>         if (bar) {
>             other stuff;
>             break;
>         } else {
>             some other stuff;
>             break;
>         }
>     }
> }
> 
> Or maybe your dominator analysis doesn't catch this either?

Right. There's a bug in the dominator analysis in current implementation regarding this scenario. We should analyze the dominators of the loop exit's successor instead of the loop exit. I think this catches the breaks.