Bug 79899 - DFG BasicBlocks should not require that their nodes have contiguous indices in the graph
Summary: DFG BasicBlocks should not require that their nodes have contiguous indices i...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks: 76770
  Show dependency treegraph
 
Reported: 2012-02-29 02:19 PST by Yuqiang Xian
Modified: 2012-02-29 23:40 PST (History)
2 users (show)

See Also:


Attachments
WIP patch (37.73 KB, patch)
2012-02-29 07:31 PST, Yuqiang Xian
no flags Details | Formatted Diff | Diff
Patch for review (37.45 KB, patch)
2012-02-29 19:20 PST, Yuqiang Xian
fpizlo: review-
Details | Formatted Diff | Diff
patch updated (46.65 KB, patch)
2012-02-29 21:23 PST, Yuqiang Xian
fpizlo: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Yuqiang Xian 2012-02-29 02:19:36 PST
As discussed in bug 76770, we should make it more convenient to insert nodes into the DFG, so a BasicBlock may not have continuous Node indicies.
Comment 1 Yuqiang Xian 2012-02-29 07:31:21 PST
Created attachment 129448 [details]
WIP patch

WIP.

There seems to be 1~2% performance regression on SunSpider. Will investigate it later.
Comment 2 Filip Pizlo 2012-02-29 14:50:24 PST
(In reply to comment #1)
> Created an attachment (id=129448) [details]
> WIP patch
> 
> WIP.
> 
> There seems to be 1~2% performance regression on SunSpider. Will investigate it later.

My guess: the HashMap you added to AbstractState.

My original implementation had a Vector<AbstractValue> indexed by NodeIndex rather than NodeIndex - m_block->begin.  But my original implementation also had loads of regressions, which turned out to be because of a combination of clobberStructures() (that's why I have that m_haveStructures and the hacks to make clobberStructures() walk over fewer things).  In the process, I also changed the Vector to be indexed by NodeIndex offset from block begin, but I have no idea if that change improved performance at all.

You might want to turn on SAMPLING_REGIONS to see what's going on.  That should quickly tell you if you've increased DFG compile time.

But in the long run, we should probably tweak the DFG to run concurrently rather than synchronously.  That will be a glorious hack, and will probably require some thread safety changes elsewhere in the runtime since the DFG parser pokes JS heap objects that are somewhat mutable (like Structure).  But it will reduce the likelihood of us having to worry about regressions from compile times in the future.
Comment 3 Yuqiang Xian 2012-02-29 18:26:00 PST
(In reply to comment #2)
> 
> My guess: the HashMap you added to AbstractState.
> 
> My original implementation had a Vector<AbstractValue> indexed by NodeIndex rather than NodeIndex - m_block->begin.  But my original implementation also had loads of regressions, which turned out to be because of a combination of clobberStructures() (that's why I have that m_haveStructures and the hacks to make clobberStructures() walk over fewer things).  In the process, I also changed the Vector to be indexed by NodeIndex offset from block begin, but I have no idea if that change improved performance at all.
> 

Yes, you're right. If we still maintain the abstract values in a Vector and index them by NodeIndex then there's no such performance regression. However it may use more memory as the vector size is equal to the graph size. Do you think it's an acceptable tradeoff?
Comment 4 Filip Pizlo 2012-02-29 18:39:12 PST
(In reply to comment #3)
> (In reply to comment #2)
> > 
> > My guess: the HashMap you added to AbstractState.
> > 
> > My original implementation had a Vector<AbstractValue> indexed by NodeIndex rather than NodeIndex - m_block->begin.  But my original implementation also had loads of regressions, which turned out to be because of a combination of clobberStructures() (that's why I have that m_haveStructures and the hacks to make clobberStructures() walk over fewer things).  In the process, I also changed the Vector to be indexed by NodeIndex offset from block begin, but I have no idea if that change improved performance at all.
> > 
> 
> Yes, you're right. If we still maintain the abstract values in a Vector and index them by NodeIndex then there's no such performance regression. However it may use more memory as the vector size is equal to the graph size. Do you think it's an acceptable tradeoff?

That's great!  I think that this is a perfectly acceptable tradeoff.

Here's my specific thinking:

- The old JIT is emitting ~100 bytes of machine code for basic arithmetic instructions, and probably no less than ~20 bytes of any other instruction.  Some instructions emit drastically more than 100 bytes.  So long as the DFG's footprint per bytecode instruction (with on average two DFG nodes - the operation and the SetLocal - per bytecode instruction) is less than 100, we probably shouldn't worry.  I think we're already over this by the way, but probably not by much.

- The DFG::Node class currently has ~8 32-bit words of junk.  Here we're adding 2 64-bit words to represent the AbstractValue.  So this is a small regression.

- The DFG should be allowed to make a space-time trade-off if needed, and also, a space-teamProductivity trade-off as well.  We only invoke the DFG rarely.  We only invoke the DFG for code that is not absurdly big.  And when we do invoke it, there is only one instance of the DFG running at a time.  Hence it should be allowed to use some extra memory if it's beneficial for performance, or if it makes the code easier for us to hack.
Comment 5 Yuqiang Xian 2012-02-29 19:20:42 PST
Created attachment 129614 [details]
Patch for review

Filip, thanks for the clarification.

Here's the patch for review.

Performance result is here as well (almost neutral):

Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample
measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime()
function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in
milliseconds.

                                               ToT                    79899                                      
SunSpider:
   3d-cube                                8.1317+-0.0753    ?     8.1812+-0.0359       ?
   3d-morph                               9.6125+-0.0396    ?     9.6248+-0.0347       ?
   3d-raytrace                           10.4898+-0.1458         10.4073+-0.0569       
   access-binary-trees                    2.4043+-0.0633    ?     2.4274+-0.0725       ?
   access-fannkuch                       10.7972+-0.9117         10.3027+-0.0398         might be 1.0480x faster
   access-nbody                           5.4502+-0.0471    ?     5.4977+-0.0513       ?
   access-nsieve                          4.0119+-0.0407          4.0100+-0.0545       
   bitops-3bit-bits-in-byte               1.2555+-0.0202          1.2482+-0.0243       
   bitops-bits-in-byte                    4.6557+-0.0646          4.6193+-0.0646       
   bitops-bitwise-and                     4.2758+-0.0490    ?     4.3005+-0.0636       ?
   bitops-nsieve-bits                     4.2680+-0.0539          4.2642+-0.0477       
   controlflow-recursive                  2.8596+-0.0376          2.8366+-0.0324       
   crypto-aes                             9.5863+-0.1921    ?     9.6874+-0.1197       ? might be 1.0105x slower
   crypto-md5                             3.3988+-0.0716    ?     3.4228+-0.0502       ?
   crypto-sha1                            2.7206+-0.0500    ?     2.7903+-0.0403       ? might be 1.0256x slower
   date-format-tofte                     13.6587+-1.9276         12.7911+-0.0741         might be 1.0678x faster
   date-format-xparb                     11.9622+-0.1335    ?    12.1260+-0.1427       ? might be 1.0137x slower
   math-cordic                            4.0385+-0.0675    ?     4.0777+-0.0784       ?
   math-partial-sums                     14.8533+-0.0333    ?    15.5325+-1.5644       ? might be 1.0457x slower
   math-spectral-norm                     2.7100+-0.0501          2.6963+-0.0447       
   regexp-dna                             9.3486+-0.0752    ?     9.4206+-0.0760       ?
   string-base64                          5.6182+-0.0711    ?     5.7111+-0.1056       ? might be 1.0165x slower
   string-fasta                           9.4383+-0.0563          9.4094+-0.0468       
   string-tagcloud                       16.9364+-1.9261         16.1189+-0.0514         might be 1.0507x faster
   string-unpack-code                    27.5010+-0.2195         27.4682+-0.0478       
   string-validate-input                  8.3231+-0.0630    ?     8.3302+-0.0378       ?

   <arithmetic> *                         8.0118+-0.1202          7.9732+-0.0610         might be 1.0048x faster
   <geometric>                            6.3050+-0.0562          6.3023+-0.0375         might be 1.0004x faster
   <harmonic>                             4.8941+-0.0416    ?     4.9019+-0.0366       ? might be 1.0016x slower

                                               ToT                    79899                                      
V8:
   crypto                                94.0062+-0.5066    ?    95.0840+-1.2626       ? might be 1.0115x slower
   deltablue                            165.9663+-1.3598    ?   166.3503+-1.6042       ?
   earley-boyer                         121.4067+-2.5646        119.8530+-2.6862         might be 1.0130x faster
   raytrace                              58.6499+-0.4516    ?    59.4065+-0.4719       ? might be 1.0129x slower
   regexp                               111.1548+-0.4702        110.9294+-0.4109       
   richards                             191.6663+-1.2047        190.4287+-0.8902       
   splay                                 77.4138+-0.9005         77.3018+-0.2461       

   <arithmetic>                         117.1806+-0.5576        117.0505+-0.6485         might be 1.0011x faster
   <geometric> *                        109.0251+-0.5826    ?   109.0822+-0.6494       ? might be 1.0005x slower
   <harmonic>                           101.3128+-0.6036    ?   101.5676+-0.6212       ? might be 1.0025x slower

                                               ToT                    79899                                      
Kraken:
   ai-astar                             789.5488+-4.9142        788.0722+-5.0229       
   audio-beat-detection                 237.0390+-0.3647    ?   238.9648+-1.5770       ?
   audio-dft                            367.7708+-2.5980    ?   378.8661+-9.5683       ? might be 1.0302x slower
   audio-fft                            151.2529+-0.0775    ?   151.2913+-0.1593       ?
   audio-oscillator                     344.1697+-1.3283    ?   345.5862+-2.6765       ?
   imaging-darkroom                     377.7554+-9.3130        375.5139+-9.5351       
   imaging-desaturate                   303.6464+-0.6332        303.4512+-0.6833       
   imaging-gaussian-blur                506.0675+-0.1486    ?   506.2780+-0.2269       ?
   json-parse-financial                  85.7818+-0.3590    !    86.9759+-0.7928       ! definitely 1.0139x slower
   json-stringify-tinderbox             106.0411+-0.7759    ?   106.0652+-0.6543       ?
   stanford-crypto-aes                  105.6489+-0.5791    ?   106.1828+-0.4288       ?
   stanford-crypto-ccm                  100.3177+-0.8778         99.5192+-0.5648       
   stanford-crypto-pbkdf2               235.7453+-0.5540    ^   234.9382+-0.1006       ^ definitely 1.0034x faster
   stanford-crypto-sha256-iterative     102.8190+-0.5031    ?   103.1907+-0.5652       ?

   <arithmetic> *                       272.4003+-1.0940    ?   273.2068+-1.3736       ? might be 1.0030x slower
   <geometric>                          216.1398+-0.7343    ?   216.8310+-0.8875       ? might be 1.0032x slower
   <harmonic>                           174.7069+-0.5262    ?   175.2619+-0.5780       ? might be 1.0032x slower

                                               ToT                    79899                                      
All benchmarks:
   <arithmetic>                         103.0250+-0.4143    ?   103.2245+-0.4864       ? might be 1.0019x slower
   <geometric>                           27.6239+-0.1723    ?    27.6462+-0.1356       ? might be 1.0008x slower
   <harmonic>                             8.6053+-0.0719    ?     8.6194+-0.0633       ? might be 1.0016x slower

                                               ToT                    79899                                      
Geomean of preferred means:
   <scaled-result>                       61.9631+-0.4216         61.9369+-0.2838         might be 1.0004x faster
Comment 6 Filip Pizlo 2012-02-29 19:42:24 PST
Comment on attachment 129614 [details]
Patch for review

View in context: https://bugs.webkit.org/attachment.cgi?id=129614&action=review

This is really awesome.  R- but only because I want to make sure we do CSE right.

> Source/JavaScriptCore/dfg/DFGCSEPhase.cpp:720
>      
>      void performBlockCSE(BasicBlock& block)
>      {
> -        m_start = block.begin;
> -        NodeIndex end = block.end;
> -        for (m_compileIndex = m_start; m_compileIndex < end; ++m_compileIndex)
> +        m_start = block[block.startExcludingPhis];
> +        for (unsigned i = block.startExcludingPhis; i < block.size(); ++i) {
> +            m_compileIndex = block[i];
>              performNodeCSE(m_graph[m_compileIndex]);
> +        }
>      }
>      
>      NodeIndex m_start;

I'm somewhat surprised that these are the only changes in CSE!  CSE has this whole performance hack where it uses the NodeIndex of the instruction that is a candidate for elimination, and the NodeIndices of its children, to determine the range that it will search.

Seems like that relies on NodeIndices of things in a basic block being monotonically increasing.

Would be better to instead make all of the search things instead break early when they encounter a NodeIndex at block[i] that matches the NodeIndex in one of the children of the node from which the search originated.
Comment 7 Yuqiang Xian 2012-02-29 19:47:19 PST
(In reply to comment #6)
> (From update of attachment 129614 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=129614&action=review
> 
> This is really awesome.  R- but only because I want to make sure we do CSE right.
> 
> > Source/JavaScriptCore/dfg/DFGCSEPhase.cpp:720
> >      
> >      void performBlockCSE(BasicBlock& block)
> >      {
> > -        m_start = block.begin;
> > -        NodeIndex end = block.end;
> > -        for (m_compileIndex = m_start; m_compileIndex < end; ++m_compileIndex)
> > +        m_start = block[block.startExcludingPhis];
> > +        for (unsigned i = block.startExcludingPhis; i < block.size(); ++i) {
> > +            m_compileIndex = block[i];
> >              performNodeCSE(m_graph[m_compileIndex]);
> > +        }
> >      }
> >      
> >      NodeIndex m_start;
> 
> I'm somewhat surprised that these are the only changes in CSE!  CSE has this whole performance hack where it uses the NodeIndex of the instruction that is a candidate for elimination, and the NodeIndices of its children, to determine the range that it will search.
> 
> Seems like that relies on NodeIndices of things in a basic block being monotonically increasing.
> 
> Would be better to instead make all of the search things instead break early when they encounter a NodeIndex at block[i] that matches the NodeIndex in one of the children of the node from which the search originated.

Yes. You're absolutely right. I didn't touch CSE for now - actually it still relies on the assumption that the node indicies are continuous, which is true in current stage as we exclude the inserted Phi nodes. But yes we should fix it.
Comment 8 Filip Pizlo 2012-02-29 19:50:05 PST
> Yes. You're absolutely right. I didn't touch CSE for now - actually it still relies on the assumption that the node indicies are continuous, which is true in current stage as we exclude the inserted Phi nodes. But yes we should fix it.

Yeah, I think this is one of those things that we want to fix in this patch.  Feels like even though it will make for a bigger patch, the result will be cleaner and more coherent.
Comment 9 Yuqiang Xian 2012-02-29 21:23:50 PST
Created attachment 129627 [details]
patch updated

Correct CSE.

Performance result here:

Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample
measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime()
function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in
milliseconds.

                                               ToT                    79899
SunSpider:
   3d-cube                                8.2464+-0.0825    ?     8.8842+-1.3563       ? might be 1.0773x slower
   3d-morph                               9.7461+-0.1194          9.6008+-0.0471         might be 1.0151x faster
   3d-raytrace                           10.3699+-0.0746    ?    10.4985+-0.0897       ? might be 1.0124x slower
   access-binary-trees                    2.4174+-0.0491    ?     2.4208+-0.0299       ?
   access-fannkuch                       10.2097+-0.0340    !    10.3671+-0.0738       ! definitely 1.0154x slower
   access-nbody                           5.4205+-0.0502    ?     5.4971+-0.0408       ? might be 1.0141x slower
   access-nsieve                          4.0139+-0.0354          3.9855+-0.0549
   bitops-3bit-bits-in-byte               1.2879+-0.1218          1.2674+-0.0267         might be 1.0162x faster
   bitops-bits-in-byte                    4.6899+-0.0529    ?     4.6899+-0.0680       ?
   bitops-bitwise-and                     4.3005+-0.0526          4.2784+-0.0348
   bitops-nsieve-bits                     4.2527+-0.0555    ?     4.2611+-0.0592       ?
   controlflow-recursive                  2.8832+-0.0560          2.8495+-0.0330         might be 1.0118x faster
   crypto-aes                             9.9004+-0.8040          9.7395+-0.1091         might be 1.0165x faster
   crypto-md5                             3.7363+-0.7201          3.5368+-0.0762         might be 1.0564x faster
   crypto-sha1                            2.7769+-0.1318    ?     2.8187+-0.0382       ? might be 1.0151x slower
   date-format-tofte                     12.7713+-0.1900         12.6657+-0.0813
   date-format-xparb                     11.9215+-0.0611         11.8989+-0.0899
   math-cordic                            4.0575+-0.0668          4.0442+-0.0729
   math-partial-sums                     14.8560+-0.0503    ?    14.8976+-0.0510       ?
   math-spectral-norm                     2.6806+-0.0414          2.6710+-0.0367
   regexp-dna                             9.4064+-0.0848          9.3769+-0.0829
   string-base64                          5.5638+-0.0453    ?     5.6208+-0.0781       ? might be 1.0102x slower
   string-fasta                           9.4595+-0.0784    ?     9.4883+-0.0626       ?
   string-tagcloud                       16.2971+-0.1989    ?    16.6938+-1.4490       ? might be 1.0243x slower
   string-unpack-code                    27.5464+-0.1902         27.2705+-0.1486         might be 1.0101x faster
   string-validate-input                  8.6276+-0.3451          8.4289+-0.0586         might be 1.0236x faster

   <arithmetic> *                         7.9784+-0.0601    ?     7.9905+-0.0691       ? might be 1.0015x slower
   <geometric>                            6.3202+-0.0628    ?     6.3234+-0.0358       ? might be 1.0005x slower
   <harmonic>                             4.9249+-0.0723          4.9247+-0.0266         might be 1.0000x faster

                                               ToT                    79899
V8:
   crypto                                95.0226+-1.0047         94.3739+-0.3426
   deltablue                            165.2617+-0.6624    ?   165.7057+-0.6259       ?
   earley-boyer                         120.6456+-2.7215    ?   122.9765+-2.5056       ? might be 1.0193x slower
   raytrace                              58.9135+-0.3810    ?    59.0790+-0.4627       ?
   regexp                               111.1890+-0.4246        110.9795+-0.4306
   richards                             190.9017+-0.7440    ^   189.6743+-0.2815       ^ definitely 1.0065x faster
   splay                                 76.9456+-0.1664         76.8783+-0.1922

   <arithmetic>                         116.9828+-0.5396    ?   117.0953+-0.4770       ? might be 1.0010x slower
   <geometric> *                        108.9471+-0.5280    ?   109.0833+-0.4856       ? might be 1.0013x slower
   <harmonic>                           101.3365+-0.4907    ?   101.4683+-0.4743       ? might be 1.0013x slower

                                               ToT                    79899
Kraken:
   ai-astar                             785.0024+-0.6913    ?   785.6353+-1.8885       ?
   audio-beat-detection                 237.7250+-1.4305    ?   238.6727+-0.6064       ?
   audio-dft                            368.7606+-7.9661        366.7011+-3.3630
   audio-fft                            152.0918+-1.0270        151.3073+-0.0661
   audio-oscillator                     343.2595+-1.3591    ?   344.1854+-1.4707       ?
   imaging-darkroom                     375.6221+-9.2659        374.0356+-8.6721
   imaging-desaturate                   303.7351+-0.8658        303.5210+-0.5746
   imaging-gaussian-blur                506.6966+-1.0673        506.6839+-0.5188
   json-parse-financial                  86.3672+-0.7082    ?    87.4680+-0.3961       ? might be 1.0127x slower
   json-stringify-tinderbox             106.0603+-0.8037        106.0054+-0.6595
   stanford-crypto-aes                  106.1432+-0.8878        105.0881+-0.3653         might be 1.0100x faster
   stanford-crypto-ccm                   99.5565+-0.6853    ?   100.0071+-0.7250       ?
   stanford-crypto-pbkdf2               235.4754+-0.3006    !   238.4268+-1.2700       ! definitely 1.0125x slower
   stanford-crypto-sha256-iterative     102.9337+-0.3225        102.6497+-0.5483

   <arithmetic> *                       272.1021+-1.1641    ?   272.1705+-0.9473       ? might be 1.0003x slower
   <geometric>                          216.1712+-0.8053    ?   216.3056+-0.6672       ? might be 1.0006x slower
   <harmonic>                           174.8804+-0.5656    ?   175.0663+-0.5010       ? might be 1.0011x slower
Comment 10 Filip Pizlo 2012-02-29 21:44:23 PST
Comment on attachment 129627 [details]
patch updated

View in context: https://bugs.webkit.org/attachment.cgi?id=129627&action=review

Awesome!

Question: does this mean that the start index performance hack is no longer in effect at all?  I.e. you're not even stopping the CSE when it gets to the earliest possible child of the node being replaced?  That's not a problem - in fact it's great if that's true.  I had added that optimization prematurely and it's somewhat comforting to know that it isn't necessary.

R=me.

> Source/JavaScriptCore/dfg/DFGByteCodeParser.cpp:616
> +        block->prepend(resultIndex);

Ha!  I find it to be quite awesome that we see no regression even though prepend is an O(n) operation.  Or am I missing something?
Comment 11 James Robinson 2012-02-29 21:45:39 PST
Sorry about the cc's and spam, the watchlist bot seems to be a bit off kilter today.
Comment 12 Yuqiang Xian 2012-02-29 22:38:07 PST
(In reply to comment #10)
> (From update of attachment 129627 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=129627&action=review
> 
> Awesome!
> 
> Question: does this mean that the start index performance hack is no longer in effect at all?  I.e. you're not even stopping the CSE when it gets to the earliest possible child of the node being replaced?  That's not a problem - in fact it's great if that's true.  I had added that optimization prematurely and it's somewhat comforting to know that it isn't necessary.
> 

According to the data it may be true. But yes you're right that it's possible to terminate the look-back process earlier. Maybe we can do this in a separated patch?

> R=me.
> 
> > Source/JavaScriptCore/dfg/DFGByteCodeParser.cpp:616
> > +        block->prepend(resultIndex);
> 
> Ha!  I find it to be quite awesome that we see no regression even though prepend is an O(n) operation.  Or am I missing something?

I originally had the same concern as yours. So I had another version which just kept the Phi nodes in a separated block (just as how they're originally handled) but later I found no obvious performance difference comparing to current approach. So I still throw out this version.
Comment 13 Yuqiang Xian 2012-02-29 23:40:57 PST
Committed as http://trac.webkit.org/changeset/109318