Bug 162309

Summary: MarkedBlock should know what objects are live during marking
Product: WebKit Reporter: Filip Pizlo <fpizlo>
Component: JavaScriptCoreAssignee: Filip Pizlo <fpizlo>
Status: RESOLVED FIXED    
Severity: Normal CC: barraclough, benjamin, cdumez, cmarcelo, commit-queue, dbates, ggaren, jfbastien, keith_miller, mark.lam, mhahnenb, msaboff, oliver, sam, sbarati
Priority: P2    
Version: WebKit Nightly Build   
Hardware: All   
OS: All   
Bug Depends on: 162310, 162842    
Bug Blocks: 149432    
Attachments:
Description Flags
it's a start
none
looks pretty close
none
might be done
none
more
none
it compiles!
none
it seems to run simple programs without crashing
none
it can run splay without crashing
none
passes tests
none
the ptach
none
the patch
none
the patch
buildbot: commit-queue-
Archive of layout-test-results from ews117 for mac-yosemite
none
more
none
perf neutral, space neutral
none
the patch
ggaren: review+
patch for landing none

Description Filip Pizlo 2016-09-20 11:26:45 PDT
...
Comment 1 Filip Pizlo 2016-10-01 10:47:16 PDT
Created attachment 290437 [details]
it's a start
Comment 2 Filip Pizlo 2016-10-01 17:26:01 PDT
Created attachment 290452 [details]
looks pretty close
Comment 3 Filip Pizlo 2016-10-01 18:09:07 PDT
Created attachment 290456 [details]
might be done
Comment 4 Filip Pizlo 2016-10-02 19:55:30 PDT
Created attachment 290477 [details]
more
Comment 5 Filip Pizlo 2016-10-02 20:15:31 PDT
Created attachment 290478 [details]
it compiles!
Comment 6 Filip Pizlo 2016-10-03 12:10:27 PDT
Created attachment 290502 [details]
it seems to run simple programs without crashing
Comment 7 Filip Pizlo 2016-10-03 14:59:56 PDT
This crashes on splay during the second GC.  I need to do carnage to CodeBlockSet.
Comment 8 Filip Pizlo 2016-10-03 15:03:44 PDT
Created attachment 290526 [details]
it can run splay without crashing

Wow!
Comment 9 Filip Pizlo 2016-10-03 17:09:58 PDT
Created attachment 290538 [details]
passes tests

I still need to gather perf data.
Comment 10 Filip Pizlo 2016-10-03 17:47:10 PDT
Perf looks pretty good!  I'll run browser benchmarks next.


Benchmark report for SunSpider, LongSpider, Octane, Kraken, and AsmBench on murderface (MacBookPro11,5).

VMs tested:
"TipOfTree" at /Volumes/Data/secondary/OpenSource/WebKitBuild/Release/jsc (r206755)
"Things" at /Volumes/Data/primary/OpenSource/WebKitBuild/Release/jsc (r206755)

Collected 10 samples per benchmark/VM, with 10 VM invocations per benchmark. Emitted a call to gc() between sample
measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to
get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds.

                                                TipOfTree                   Things                                      
SunSpider:
   3d-cube                                    4.7598+-0.1405     ?      4.8667+-0.2664        ? might be 1.0225x slower
   3d-morph                                   4.6804+-0.0819     ?      4.6835+-0.0481        ?
   3d-raytrace                                4.7986+-0.1959     ?      4.8248+-0.1859        ?
   access-binary-trees                        2.1087+-0.2056            1.9962+-0.0779          might be 1.0564x faster
   access-fannkuch                            4.8807+-0.1860            4.8249+-0.1278          might be 1.0116x faster
   access-nbody                               2.4643+-0.1026            2.3922+-0.0495          might be 1.0301x faster
   access-nsieve                              3.1765+-0.0274     ?      3.2465+-0.0844        ? might be 1.0220x slower
   bitops-3bit-bits-in-byte                   1.0927+-0.0281            1.0918+-0.0308        
   bitops-bits-in-byte                        2.6825+-0.0295     ?      2.7149+-0.0480        ? might be 1.0121x slower
   bitops-bitwise-and                         1.9481+-0.0490     ?      1.9744+-0.0821        ? might be 1.0135x slower
   bitops-nsieve-bits                         2.9989+-0.0317            2.9987+-0.0151        
   controlflow-recursive                      2.3278+-0.1394            2.2805+-0.0352          might be 1.0207x faster
   crypto-aes                                 4.2656+-0.1250     ?      4.2810+-0.1564        ?
   crypto-md5                                 2.6939+-0.1401            2.5977+-0.0554          might be 1.0370x faster
   crypto-sha1                                2.6480+-0.0578     ?      2.7426+-0.1215        ? might be 1.0357x slower
   date-format-tofte                          6.5634+-0.0371     ?      6.6671+-0.2436        ? might be 1.0158x slower
   date-format-xparb                          4.4092+-0.0485     ?      4.5115+-0.1514        ? might be 1.0232x slower
   math-cordic                                2.7159+-0.0757            2.6758+-0.1109          might be 1.0150x faster
   math-partial-sums                          3.8538+-0.0760     ?      3.8570+-0.0633        ?
   math-spectral-norm                         1.9476+-0.0290            1.9388+-0.0289        
   regexp-dna                                 6.4481+-0.3728            6.0793+-0.1339          might be 1.0607x faster
   string-base64                              4.5570+-0.1821            4.4215+-0.0251          might be 1.0307x faster
   string-fasta                               5.4202+-0.2102            5.3374+-0.0631          might be 1.0155x faster
   string-tagcloud                            8.0917+-0.4335            7.9968+-0.0990          might be 1.0119x faster
   string-unpack-code                        18.5043+-0.7824           18.3719+-0.6089        
   string-validate-input                      4.2391+-0.1710     ?      4.2530+-0.1024        ?

   <arithmetic>                               4.3953+-0.0459            4.3702+-0.0314          might be 1.0057x faster

                                                TipOfTree                   Things                                      
LongSpider:
   3d-cube                                  784.9753+-9.2362     ?    784.9796+-7.2979        ?
   3d-morph                                 564.4469+-1.8892          564.3682+-3.1053        
   3d-raytrace                              452.3591+-3.3894     ?    452.5263+-3.2539        ?
   access-binary-trees                      775.1539+-4.2585     !    785.4648+-4.5978        ! definitely 1.0133x slower
   access-fannkuch                          229.8188+-2.8471     ?    231.8483+-6.6854        ?
   access-nbody                             501.6161+-2.1568     ?    505.2909+-5.6832        ?
   access-nsieve                            280.6295+-6.1650     ?    286.2854+-7.0494        ? might be 1.0202x slower
   bitops-3bit-bits-in-byte                  31.3121+-0.3761     ?     31.7073+-0.8497        ? might be 1.0126x slower
   bitops-bits-in-byte                       82.5584+-1.8129           82.1645+-1.4429        
   bitops-nsieve-bits                       375.3969+-7.2505     ?    377.4139+-5.5289        ?
   controlflow-recursive                    430.0355+-3.4339          428.6742+-1.2196        
   crypto-aes                               531.1279+-2.0121          529.3739+-3.8091        
   crypto-md5                               454.5531+-2.4910     ?    455.7840+-4.5393        ?
   crypto-sha1                              591.1619+-4.8358     !    605.2210+-5.5402        ! definitely 1.0238x slower
   date-format-tofte                        329.8180+-4.8883     ?    335.7514+-6.1675        ? might be 1.0180x slower
   date-format-xparb                        611.0077+-3.1081     !    646.9916+-30.9931       ! definitely 1.0589x slower
   hash-map                                 135.6740+-2.8968     ?    139.2200+-2.9138        ? might be 1.0261x slower
   math-cordic                              425.7973+-9.6438          417.1100+-9.3656          might be 1.0208x faster
   math-partial-sums                        283.5224+-3.9142     ?    285.3476+-1.8994        ?
   math-spectral-norm                       516.9096+-2.5078     ?    518.4771+-2.2593        ?
   string-base64                            480.0868+-2.3306          478.7450+-2.1636        
   string-fasta                             328.3009+-0.9268     ?    331.9004+-4.4371        ? might be 1.0110x slower
   string-tagcloud                          163.9715+-1.4029          163.9625+-1.6930        

   <geometric>                              336.4531+-0.7115     !    339.0557+-1.3763        ! definitely 1.0077x slower

                                                TipOfTree                   Things                                      
Octane:
   encrypt                                   0.14959+-0.00209    ?     0.15031+-0.00311       ?
   decrypt                                   2.72169+-0.01214          2.70828+-0.00985       
   deltablue                        x2       0.11700+-0.00141    ?     0.11757+-0.00108       ?
   earley                                    0.23785+-0.00106          0.23764+-0.00140       
   boyer                                     4.12999+-0.06310    ?     4.14662+-0.05801       ?
   navier-stokes                    x2       4.61573+-0.01927          4.61221+-0.00971       
   raytrace                         x2       0.65478+-0.00466    ?     0.65531+-0.00276       ?
   richards                         x2       0.07932+-0.00136          0.07838+-0.00088         might be 1.0120x faster
   splay                            x2       0.31519+-0.00284          0.31218+-0.00247       
   regexp                           x2      15.85437+-0.26740    ?    15.93102+-0.59643       ?
   pdfjs                            x2      38.82551+-0.33882    ?    38.99885+-0.18927       ?
   mandreel                         x2      39.59546+-0.27219         39.59124+-0.29396       
   gbemu                            x2      28.72420+-0.27889    ?    28.74073+-0.10597       ?
   closure                                   0.47185+-0.00337          0.46830+-0.00230       
   jquery                                    6.43143+-0.03022          6.39458+-0.01449       
   box2d                            x2       8.71967+-0.05439    ?     8.72430+-0.05734       ?
   zlib                             x2     336.48324+-5.68523    ?   336.49135+-5.35980       ?
   typescript                       x2     604.82722+-10.05660   ?   610.12378+-5.38330       ?

   <geometric>                               4.70041+-0.01119          4.69945+-0.01477         might be 1.0002x faster

                                                TipOfTree                   Things                                      
Kraken:
   ai-astar                                   90.062+-2.375             88.968+-1.726           might be 1.0123x faster
   audio-beat-detection                       36.031+-0.871             35.823+-0.582         
   audio-dft                                  95.067+-1.552             94.269+-0.742         
   audio-fft                                  27.691+-0.684      ?      27.695+-0.851         ?
   audio-oscillator                           42.905+-0.296      ?      43.279+-1.073         ?
   imaging-darkroom                           57.087+-0.377             56.879+-0.328         
   imaging-desaturate                         41.833+-0.868      ?      42.040+-0.664         ?
   imaging-gaussian-blur                      59.686+-2.614             59.220+-2.165         
   json-parse-financial                       31.907+-1.183      ?      32.017+-0.471         ?
   json-stringify-tinderbox                   21.679+-0.925             20.965+-0.751           might be 1.0341x faster
   stanford-crypto-aes                        34.679+-1.212             34.222+-0.148           might be 1.0133x faster
   stanford-crypto-ccm                        32.347+-1.444      ?      32.987+-2.118         ? might be 1.0198x slower
   stanford-crypto-pbkdf2                     89.092+-0.363      ?      89.867+-1.349         ?
   stanford-crypto-sha256-iterative           29.307+-0.218             29.123+-0.197         

   <arithmetic>                               49.241+-0.279             49.097+-0.314           might be 1.0029x faster

                                                TipOfTree                   Things                                      
AsmBench:
   bigfib.cpp                               405.1739+-2.6410          404.2104+-1.9409        
   cray.c                                   357.2352+-2.9135     ?    357.2662+-2.7586        ?
   dry.c                                    395.2482+-4.1978     ?    414.1439+-42.2403       ? might be 1.0478x slower
   FloatMM.c                                665.1913+-13.4518    ?    673.4448+-4.6110        ? might be 1.0124x slower
   gcc-loops.cpp                           3379.0649+-9.6737     ?   3380.4471+-8.8589        ?
   n-body.c                                 748.3112+-3.9571          746.5452+-2.4560        
   Quicksort.c                              379.4353+-4.1063          377.0033+-2.7371        
   stepanov_container.cpp                  3168.5075+-22.4127    ?   3185.6080+-20.5172       ?
   Towers.c                                 237.2488+-3.8634     ?    239.0891+-4.0924        ?

   <geometric>                              670.5841+-1.4917     ?    674.7199+-7.3719        ? might be 1.0062x slower

                                                TipOfTree                   Things                                      
Geomean of preferred means:
   <scaled-result>                           47.0064+-0.1481     ?     47.0528+-0.1442        ? might be 1.0010x slower
Comment 11 Filip Pizlo 2016-10-03 18:31:33 PDT
Created attachment 290545 [details]
the ptach
Comment 12 Filip Pizlo 2016-10-03 19:45:24 PDT
Created attachment 290556 [details]
the patch

Further improved the handling of version wrap-around.
Comment 13 Filip Pizlo 2016-10-04 08:39:27 PDT
This is a 0.35% PLT regression with 84% confidence.
Comment 14 Filip Pizlo 2016-10-04 08:56:56 PDT
(In reply to comment #13)
> This is a 0.35% PLT regression with 84% confidence.

Or it could be a 0.9% PLT regression with 99% confidence.  The previous result had four runs of baseline and three runs of this patch.  I had forgotten the extra run.  After doing that, I got a result that was significantly slower, which throws the result a lot.

Anyway, I have some ideas for how to make this cheaper.  I'll try them.
Comment 15 Filip Pizlo 2016-10-04 11:18:07 PDT
Latest perf.


Benchmark report for SunSpider, LongSpider, Octane, Kraken, and AsmBench on pizlo (MacPro5,1).

VMs tested:
"TipOfTree" at /Volumes/Data/secondary/OpenSource/WebKitBuild/Release/jsc (r206755)
"Things" at /Volumes/Data/primary/OpenSource/WebKitBuild/Release/jsc (r206755)

Collected 10 samples per benchmark/VM, with 10 VM invocations per benchmark. Emitted a call to gc() between sample
measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to
get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds.

                                                TipOfTree                   Things                                      
SunSpider:
   3d-cube                                    8.0235+-0.3621     ?      8.3052+-0.2384        ? might be 1.0351x slower
   3d-morph                                   8.1279+-0.1226     ?      8.1917+-0.0854        ?
   3d-raytrace                                8.4992+-0.2040            8.4723+-0.1616        
   access-binary-trees                        3.1203+-0.1078            3.0816+-0.1068          might be 1.0125x faster
   access-fannkuch                            8.5027+-0.2279            8.4424+-0.3140        
   access-nbody                               4.4228+-0.0787            4.3797+-0.1055        
   access-nsieve                              4.8087+-0.1750            4.7793+-0.1374        
   bitops-3bit-bits-in-byte                   1.7660+-0.0822            1.7055+-0.0541          might be 1.0355x faster
   bitops-bits-in-byte                        5.3679+-0.1043     ?      5.4113+-0.0835        ?
   bitops-bitwise-and                         2.8738+-0.0478     ?      2.8918+-0.1019        ?
   bitops-nsieve-bits                         4.7063+-0.0625     ?      4.7643+-0.0575        ? might be 1.0123x slower
   controlflow-recursive                      3.5873+-0.0647            3.5592+-0.0620        
   crypto-aes                                 7.0385+-0.1865            6.9646+-0.1515          might be 1.0106x faster
   crypto-md5                                 4.4600+-0.1392            4.3811+-0.1328          might be 1.0180x faster
   crypto-sha1                                4.5745+-0.0961            4.5502+-0.0736        
   date-format-tofte                         13.4292+-0.3363     ?     13.4452+-0.3493        ?
   date-format-xparb                          7.5737+-0.1182     ?      7.5955+-0.1118        ?
   math-cordic                                4.1582+-0.0785     ?      4.1888+-0.0811        ?
   math-partial-sums                          8.9864+-0.1432            8.9397+-0.1546        
   math-spectral-norm                         3.2642+-0.0827            3.1665+-0.0496          might be 1.0308x faster
   regexp-dna                                10.3010+-0.1137     ?     10.3179+-0.2243        ?
   string-base64                              6.9089+-0.1414     ?      7.1008+-0.1023        ? might be 1.0278x slower
   string-fasta                               9.1714+-0.1150     ?      9.2841+-0.1289        ? might be 1.0123x slower
   string-tagcloud                           13.4654+-0.2029           13.2572+-0.3336          might be 1.0157x faster
   string-unpack-code                        26.7415+-0.3726     ?     27.5089+-2.1735        ? might be 1.0287x slower
   string-validate-input                      6.5980+-0.0956            6.5756+-0.0805        

   <arithmetic>                               7.3260+-0.0301     ?      7.3562+-0.0951        ? might be 1.0041x slower

                                                TipOfTree                   Things                                      
LongSpider:
   3d-cube                                 1202.1618+-4.3588     ?   1211.1989+-5.4097        ?
   3d-morph                                1051.7163+-3.1291     ?   1054.7417+-3.7847        ?
   3d-raytrace                              846.1904+-5.0345     ^    832.9858+-4.0975        ^ definitely 1.0159x faster
   access-binary-trees                     1311.1012+-12.4107    ?   1314.3566+-6.3766        ?
   access-fannkuch                          414.2702+-13.0253    ?    415.8683+-14.5161       ?
   access-nbody                            1008.3697+-7.5767         1005.0294+-0.5716        
   access-nsieve                            599.4214+-1.5117          598.0627+-1.2544        
   bitops-3bit-bits-in-byte                  44.0405+-0.0957           44.0119+-0.1087        
   bitops-bits-in-byte                      309.3904+-2.7643          309.1396+-2.1327        
   bitops-nsieve-bits                       604.6160+-4.7808          601.3614+-3.7744        
   controlflow-recursive                    806.7600+-0.3109     ?    807.5056+-0.8618        ?
   crypto-aes                               920.0754+-4.4205          915.6011+-2.0631        
   crypto-md5                               712.2925+-2.1893          711.7632+-3.3018        
   crypto-sha1                             1028.0490+-2.7507         1027.5773+-5.9827        
   date-format-tofte                        862.2874+-7.6796     ?    864.2536+-22.7555       ?
   date-format-xparb                        998.2711+-11.3190    ^    982.3115+-4.2447        ^ definitely 1.0162x faster
   hash-map                                 228.1379+-1.2474          226.8719+-3.0762        
   math-cordic                              621.2927+-4.5599     ?    621.8403+-3.9488        ?
   math-partial-sums                        903.6336+-1.2139     ?    904.6057+-1.4168        ?
   math-spectral-norm                      1045.8192+-0.5604         1045.7521+-0.4860        
   string-base64                            678.0794+-0.8927          676.1299+-2.2146        
   string-fasta                             581.9292+-6.6014          580.7636+-3.1994        
   string-tagcloud                          271.8647+-1.3777     ^    267.3205+-1.5961        ^ definitely 1.0170x faster

   <geometric>                              623.9462+-1.2138          622.4183+-1.5229          might be 1.0025x faster

                                                TipOfTree                   Things                                      
Octane:
   encrypt                                   0.27504+-0.00110    ?     0.27589+-0.00044       ?
   decrypt                                   5.12324+-0.05780          5.07856+-0.05851       
   deltablue                        x2       0.22472+-0.00984    ?     0.23279+-0.00854       ? might be 1.0359x slower
   earley                                    0.44779+-0.00160    !     0.45199+-0.00193       ! definitely 1.0094x slower
   boyer                                     8.05465+-0.02511    ?     8.06552+-0.06843       ?
   navier-stokes                    x2       6.45853+-0.00695    ?     6.45901+-0.00837       ?
   raytrace                         x2       1.24906+-0.01144    ?     1.25735+-0.01483       ?
   richards                         x2       0.14393+-0.00175          0.14353+-0.00138       
   splay                            x2       0.54480+-0.00167    ^     0.54039+-0.00215       ^ definitely 1.0081x faster
   regexp                           x2      28.43787+-0.26215         28.30601+-0.12242       
   pdfjs                            x2      65.02654+-0.34598         64.69731+-0.14130       
   mandreel                         x2      68.35391+-0.39557    ?    68.70450+-0.31458       ?
   gbemu                            x2      68.19290+-8.41274         64.05341+-6.80604         might be 1.0646x faster
   closure                                   0.80387+-0.00183    !     0.80849+-0.00150       ! definitely 1.0058x slower
   jquery                                   10.53694+-0.02784         10.48505+-0.02524       
   box2d                            x2      16.72151+-0.05996    ^    16.53741+-0.05892       ^ definitely 1.0111x faster
   zlib                             x2     554.40618+-11.52526       547.51534+-13.83813        might be 1.0126x faster
   typescript                       x2    1137.35742+-6.32767       1137.06296+-6.20274       

   <geometric>                               8.47498+-0.07318          8.44480+-0.04927         might be 1.0036x faster

                                                TipOfTree                   Things                                      
Kraken:
   ai-astar                                  140.903+-1.017      ?     141.470+-0.815         ?
   audio-beat-detection                       63.972+-0.343      ?      64.045+-0.229         ?
   audio-dft                                 128.646+-0.913      ?     128.777+-0.604         ?
   audio-fft                                  51.808+-0.344      ?      51.961+-0.237         ?
   audio-oscillator                           76.304+-0.174      ?      76.421+-0.220         ?
   imaging-darkroom                           96.350+-0.116             96.324+-0.168         
   imaging-desaturate                         90.894+-0.231             90.802+-0.179         
   imaging-gaussian-blur                     113.875+-4.287            109.129+-4.292           might be 1.0435x faster
   json-parse-financial                       60.357+-0.302      ^      59.852+-0.130         ^ definitely 1.0084x faster
   json-stringify-tinderbox                   39.800+-0.148             39.783+-0.103         
   stanford-crypto-aes                        62.260+-0.365             62.256+-0.849         
   stanford-crypto-ccm                        57.612+-2.688             57.002+-3.128           might be 1.0107x faster
   stanford-crypto-pbkdf2                    150.094+-4.087            147.819+-3.843           might be 1.0154x faster
   stanford-crypto-sha256-iterative           48.793+-0.273      ?      48.837+-0.499         ?

   <arithmetic>                               84.405+-0.523             83.891+-0.668           might be 1.0061x faster

                                                TipOfTree                   Things                                      
AsmBench:
   bigfib.cpp                               665.0508+-15.0276         662.8626+-11.7344       
   cray.c                                   597.5950+-1.8130          597.2118+-2.7995        
   dry.c                                    670.4390+-17.7186         657.1325+-16.6841         might be 1.0202x faster
   FloatMM.c                                945.2271+-32.7938         927.8857+-31.7631         might be 1.0187x faster
   gcc-loops.cpp                           6313.6921+-66.3733        6298.8269+-80.0661       
   n-body.c                                1555.4205+-25.1477        1532.9256+-33.8577         might be 1.0147x faster
   Quicksort.c                              606.6469+-3.7198     ?    608.5187+-5.4834        ?
   stepanov_container.cpp                  4480.0198+-11.9496    ?   4482.6665+-13.2934       ?
   Towers.c                                 375.6496+-0.4232     ?    376.2430+-4.4923        ?

   <geometric>                             1108.1300+-6.3161         1101.4505+-5.3150          might be 1.0061x faster

                                                TipOfTree                   Things                                      
Geomean of preferred means:
   <scaled-result>                           81.6232+-0.2539           81.3925+-0.3046          might be 1.0028x faster
Comment 16 Filip Pizlo 2016-10-04 11:18:31 PDT
Created attachment 290624 [details]
the patch

Still running perf tests.
Comment 17 Filip Pizlo 2016-10-04 13:06:22 PDT
On PLT, this looks like a 0.8% regression with 99% confidence, even after doing some things to try to reduce the overhead.
Comment 18 Build Bot 2016-10-04 15:51:31 PDT
Comment on attachment 290624 [details]
the patch

Attachment 290624 [details] did not pass mac-debug-ews (mac):
Output: http://webkit-queues.webkit.org/results/2220150

Number of test failures exceeded the failure limit.
Comment 19 Build Bot 2016-10-04 15:51:38 PDT
Created attachment 290658 [details]
Archive of layout-test-results from ews117 for mac-yosemite

The attached test failures were seen while running run-webkit-tests on the mac-debug-ews.
Bot: ews117  Port: mac-yosemite  Platform: Mac OS X 10.10.5
Comment 20 Filip Pizlo 2016-10-04 20:13:52 PDT
I think that I fixed the debug failure locally.  I also think that I'm experiencing some kind of flukes in my benchmark runs.  On the one hand, I'm seeing a 1% regression on membuster.  But this regression appears entirely unaffected by the fact that we're allocating newlyAllocated more frequently, which is the only change in this patch that looks like it could affect memory usage.  The only other possibility is that conservative roots have become more conservative, but I don't think this could be the case, since we are going to great pains to ensure that the isLive() query returns the same thing during conservative scans now as it did before.  If this wasn't true - if isLive() returned true more often, for example - then this would manifest as a crash.  I'm not seeing those (except in the last patch I uploaded - it was a separate goof and I fixed it).

So, I'm going to dig into the membuster regression more, this time trying to prove that it is indeed a fluke.  Maybe I'll try to run it on some other machines or do some clean builds.

I should also say that the differences in membuster results between baseline and trunk and smaller than the differences I get between different runs of the same build, spaced out in time.  A few hours can make the same build use 5% more or less memory!  It's crazy!
Comment 21 Filip Pizlo 2016-10-04 20:24:46 PDT
Actually, I can test putting the conservative scan at the beginning.  If this shows different memory overheads, then this would definitely take the focus off of newlyAllocated.

I should also say that I measured the amount of memory used by newlyAllocated relative to the amount of memory used by the heap's capacity. It looked like there were fewer than 1 bytes of newlyAllocated for every 1000 bytes of marked block. That's just not enough to explain the membuster regressions that I'm seeing (0.7% and 1.2%, the 1.2% regression happening when I *reduced* the number of newlyAllocated allocations).
Comment 22 Filip Pizlo 2016-10-05 21:18:09 PDT
It looks likely that the membuster regression goes away if we do the stack scan before starting marking, rather than after.  That's very surprising.  I'm going to do more tests to confirm.
Comment 23 Filip Pizlo 2016-10-05 23:25:55 PDT
(In reply to comment #22)
> It looks likely that the membuster regression goes away if we do the stack
> scan before starting marking, rather than after.  That's very surprising. 
> I'm going to do more tests to confirm.

But, in that configuration, I also get a 1% PLT regression with 99% certainty.  So weird.  I'm going to try some other hardware configurations I'm going to run membuster some more to make sure that this isn't a fluke.
Comment 24 Filip Pizlo 2016-10-06 17:31:26 PDT
I'm getting membuster neutrality and a smaller PLT regression if I simply inline newlyAllocated into Handle instead of allocating it on demand.
Comment 25 Filip Pizlo 2016-10-06 17:31:57 PDT
Created attachment 290875 [details]
more
Comment 26 Filip Pizlo 2016-10-06 18:11:25 PDT
Created attachment 290880 [details]
perf neutral, space neutral
Comment 27 Filip Pizlo 2016-10-07 11:22:11 PDT
I did more experiments.  With 10 samples of membuster for baseline and this patch, I get a 0.76% regression with 76% confidence.  I'll take that to mean it's neutral.
Comment 28 Filip Pizlo 2016-10-07 13:38:52 PDT
This still looks like a 0.96% PLT regression with 99% confidence.  I'm going to profile more.  This regression makes no sense, since the only change here is to the GC and all other GC benchmarks are happy.
Comment 29 Filip Pizlo 2016-10-08 16:30:47 PDT
Created attachment 291030 [details]
the patch
Comment 30 Geoffrey Garen 2016-10-11 14:31:25 PDT
Comment on attachment 291030 [details]
the patch

View in context: https://bugs.webkit.org/attachment.cgi?id=291030&action=review

r=me

I think there might be a clearer way to handle the wrap-around situation.

> Source/JavaScriptCore/heap/Heap.cpp:1214
>  void Heap::writeBarrierCurrentlyExecutingCodeBlocks()
>  {
> -    m_codeBlocks->writeBarrierCurrentlyExecutingCodeBlocks(this);
> +    m_codeBlocks->writeBarrierCurrentlyExecuting(this);
> +}
> +
> +void Heap::clearCurrentlyExecutingCodeBlocks()
> +{
> +    m_codeBlocks->clearCurrentlyExecuting();
>  }

These probably don't need to be functions.

> Source/JavaScriptCore/heap/MarkedBlock.cpp:57
> +    , m_allocatedVersion(MarkedSpace::nullVersion)

I would call this m_newlyAllocatedVersion since it guards m_newlyAllocated.

Later, we should come up with a better name for m_newlyAllocated. Maybe m_liveAtGCStart or m_liveSnapshot. (I don't think m_newlyAllocated will include things allocated during GC, right? And of course, for now, we don't allocate during GC.)

> Source/JavaScriptCore/heap/MarkedBlock.h:291
> +    void setNeedsDestruction();

I would call this "updateNeedsDestruction" to distinguish from a setter.

> Source/JavaScriptCore/heap/MarkedBlockInlines.h:49
> +    // - We did ~2^32 collections and rotated the version back to null, so we needed to hard-reset
> +    //   everything. If the marks had been stale, we would have cleared them. So, we can be sure that
> +    //   any set mark bit reflects objects marked during last GC, i.e. "live" objects.

It would help if you mentioned the resetAllocated() function by name here. (I think that's what you're talking about.)

Is it really true that wrap-around can wrap-around back to null? I thought it wrapped around to firstVersion.

If possible, I'd prefer to think in terms of some elements of state that are snapshotted at the start of GC, and then consulted later after the current GC has started to change some things. Is that possible here? This function almost amounts to the inverse of a snapshot of areMarksStale() at the start of GC. 

// True if areMarksStale() was true at the start of GC -- which means no object in this block was marked at the end of the last GC.
inline bool MarkedBlock:areMarksStaleSnapshot() // Or maybe areMarksStaleAtGCStart()
{
    return m_markingVersion == MarkedSpace::nullVersion || MarkedSpace::nextVersion(m_markingVersion) != markingVersion;
}

This changes the behavior of nullVersion slightly: now nullVersion says "don't check m_marks". It's possible to think of nullVersion as having up-to-date m_marks -- in the sense that they tell the (all zero) truth -- but I think it's more instructive to call this state not up-to-date -- in the sense that the GC mark phase never got a shot at setting m_marks, and so we know they were logically zero regardless of their bit values. Put another way, I would like to be able to ASSERT(areMarksStale() || m_marks.count()).

> Source/JavaScriptCore/heap/MarkedBlockInlines.h:52
> +    // It would be absurd to use this method when not collecting, since this special "one version
> +    // back" state only makes sense when we're in a concurrent collection and have to be
> +    // conservative.

Can we assert we're collecting here?

> Source/JavaScriptCore/heap/MarkedSpace.cpp:495
> +    if (UNLIKELY(nextVersion(m_allocatedVersion) == initialVersion)) {
> +        forEachBlock(
> +            [&] (MarkedBlock::Handle* handle) {
> +                handle->resetAllocated();
> +            });
> +    }

This is super subtle. I think you should mention in a comment how this is needed by marksConveyLivenessDuringMarking. And you should also describe the problem of marksConveyLivenessDuringMarking eventually becoming (incorrectly) true due to wrap-around.

Perhaps a clearer thing to do upon wrap-around is to reset all blocks to nullVersion at the start of a wrap-around GC?
Comment 31 Filip Pizlo 2016-10-11 14:59:09 PDT
(In reply to comment #30)
> Comment on attachment 291030 [details]
> the patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=291030&action=review
> 
> r=me
> 
> I think there might be a clearer way to handle the wrap-around situation.
> 
> > Source/JavaScriptCore/heap/Heap.cpp:1214
> >  void Heap::writeBarrierCurrentlyExecutingCodeBlocks()
> >  {
> > -    m_codeBlocks->writeBarrierCurrentlyExecutingCodeBlocks(this);
> > +    m_codeBlocks->writeBarrierCurrentlyExecuting(this);
> > +}
> > +
> > +void Heap::clearCurrentlyExecutingCodeBlocks()
> > +{
> > +    m_codeBlocks->clearCurrentlyExecuting();
> >  }
> 
> These probably don't need to be functions.

Done.

> 
> > Source/JavaScriptCore/heap/MarkedBlock.cpp:57
> > +    , m_allocatedVersion(MarkedSpace::nullVersion)
> 
> I would call this m_newlyAllocatedVersion since it guards m_newlyAllocated.

Done.

> 
> Later, we should come up with a better name for m_newlyAllocated. Maybe
> m_liveAtGCStart or m_liveSnapshot. (I don't think m_newlyAllocated will
> include things allocated during GC, right? And of course, for now, we don't
> allocate during GC.)

No, we will set newlyAllocated during GC.  We will scan stack at the end of GC and during every GC fixpoint stop-the-world increment.  During that time, we will stopAllocating(), which will put things in newlyAllocated.

> 
> > Source/JavaScriptCore/heap/MarkedBlock.h:291
> > +    void setNeedsDestruction();
> 
> I would call this "updateNeedsDestruction" to distinguish from a setter.

Done.

> 
> > Source/JavaScriptCore/heap/MarkedBlockInlines.h:49
> > +    // - We did ~2^32 collections and rotated the version back to null, so we needed to hard-reset
> > +    //   everything. If the marks had been stale, we would have cleared them. So, we can be sure that
> > +    //   any set mark bit reflects objects marked during last GC, i.e. "live" objects.
> 
> It would help if you mentioned the resetAllocated() function by name here.
> (I think that's what you're talking about.)

No, that function is for newlyAllocatedVersion.

> 
> Is it really true that wrap-around can wrap-around back to null? I thought
> it wrapped around to firstVersion.
> 
> If possible, I'd prefer to think in terms of some elements of state that are
> snapshotted at the start of GC, and then consulted later after the current
> GC has started to change some things. Is that possible here? 

That's exactly what this code is trying to do.

> This function
> almost amounts to the inverse of a snapshot of areMarksStale() at the start
> of GC. 
> 
> // True if areMarksStale() was true at the start of GC -- which means no
> object in this block was marked at the end of the last GC.
> inline bool MarkedBlock:areMarksStaleSnapshot() // Or maybe
> areMarksStaleAtGCStart()
> {
>     return m_markingVersion == MarkedSpace::nullVersion ||
> MarkedSpace::nextVersion(m_markingVersion) != markingVersion;
> }
> 
> This changes the behavior of nullVersion slightly: now nullVersion says
> "don't check m_marks". It's possible to think of nullVersion as having
> up-to-date m_marks -- in the sense that they tell the (all zero) truth --
> but I think it's more instructive to call this state not up-to-date -- in
> the sense that the GC mark phase never got a shot at setting m_marks, and so
> we know they were logically zero regardless of their bit values. Put another
> way, I would like to be able to ASSERT(areMarksStale() || m_marks.count()).

MarkedBlock::aboutToMarkSlow() needs to know what the mark bits were at the start of the GC.  Currently we achieve it like this:

1) If aboutToMarkSlow() sees a null marking version then the marks reflect the truth at start of GC.
2) If aboutToMarkSlow() sees a one-version-old version then the marks reflect the truth at the start of GC.
3) In all other cases, the marks should have been cleared, but hadn't been yet - so ignore their contents.

(1) arises when the MarkedBlock is null or if we had a wrap-around.  Before the wrap-around bumps the version, it makes sure that the marks are either clear or not depending on what their version had been just before the wrap-around happened.  Then it sets the version to null.  Then it can wrap the version around.

You're proposing a change to marksConveyLivenessDuringMarking(), but you haven't proposed a corresponding change to how aboutToMarkSlow() should use this to figure out if it should transfer the contents of marks into newlyAllocated or not.

I don't think that what you're proposing handles this corner case, so I'll keep my version for now.

> 
> > Source/JavaScriptCore/heap/MarkedBlockInlines.h:52
> > +    // It would be absurd to use this method when not collecting, since this special "one version
> > +    // back" state only makes sense when we're in a concurrent collection and have to be
> > +    // conservative.
> 
> Can we assert we're collecting here?

Done.

The new way to do this is MarkedSpace::isMarking(), which is more precise for this case.

> 
> > Source/JavaScriptCore/heap/MarkedSpace.cpp:495
> > +    if (UNLIKELY(nextVersion(m_allocatedVersion) == initialVersion)) {
> > +        forEachBlock(
> > +            [&] (MarkedBlock::Handle* handle) {
> > +                handle->resetAllocated();
> > +            });
> > +    }
> 
> This is super subtle. I think you should mention in a comment how this is
> needed by marksConveyLivenessDuringMarking. And you should also describe the
> problem of marksConveyLivenessDuringMarking eventually becoming
> (incorrectly) true due to wrap-around.

marksConveyLivenessDuringMarking() does not check the newlyAllocatedVersion.  It checks the markingVersion, which is different.  I don't think that this is code is needed by marksConveyLivenessDuringMarking(), which doesn't know that newlyAllocated is also versioned.

> 
> Perhaps a clearer thing to do upon wrap-around is to reset all blocks to
> nullVersion at the start of a wrap-around GC?

But that's what resetAllocated() does already.  It sets the newlyAllocatedVersion to nullVersion.  It also clears newlyAllocated, but I think that's not necessary.
Comment 32 Filip Pizlo 2016-10-11 15:37:22 PDT
Created attachment 291308 [details]
patch for landing
Comment 33 Geoffrey Garen 2016-10-11 16:25:57 PDT
> > > Source/JavaScriptCore/heap/MarkedBlockInlines.h:49
> > > +    // - We did ~2^32 collections and rotated the version back to null, so we needed to hard-reset
> > > +    //   everything. If the marks had been stale, we would have cleared them. So, we can be sure that
> > > +    //   any set mark bit reflects objects marked during last GC, i.e. "live" objects.
> > 
> > It would help if you mentioned the resetAllocated() function by name here.
> > (I think that's what you're talking about.)
> 
> No, that function is for newlyAllocatedVersion.

What function clears marks if stale upon wrap-around? That's the function I would point to here.

> You're proposing a change to marksConveyLivenessDuringMarking(), but you
> haven't proposed a corresponding change to how aboutToMarkSlow() should use
> this to figure out if it should transfer the contents of marks into
> newlyAllocated or not.

if (handle().allocator()->isAllocated(&handle()) || areMarksStaleSnapshot())
    m_marks.clearAll();
else {
    // Transfer m_marks to m_newlyAllocated
    ...
}
Comment 34 Filip Pizlo 2016-10-11 16:55:05 PDT
Landed in https://trac.webkit.org/changeset/207179
Comment 35 Geoffrey Garen 2016-10-11 17:18:33 PDT
A related observation:

If each GC took 1ms, you would need to GC continuously for (2^32 / 1000 / 60 / 60 / 24) = ~50 consecutive days before achieving wrap-around.

And of course nobody actually GCs continuously per 1ms, so the real-world worst case is probably more like one GC per 10s (2^32 / 6 / 60 / 24) = ~500,000 days.

Can we just RELEASE_ASSERT that doesn't happen?

The odds that the user will crash, navigate, quit, or install a software update before wrapping around seem to be effectively 100%.
Comment 36 Filip Pizlo 2016-10-11 19:18:06 PDT
(In reply to comment #35)
> A related observation:
> 
> If each GC took 1ms, you would need to GC continuously for (2^32 / 1000 / 60
> / 60 / 24) = ~50 consecutive days before achieving wrap-around.

The shortest GC I see in earley is 227us.

> 
> And of course nobody actually GCs continuously per 1ms, so the real-world
> worst case is probably more like one GC per 10s (2^32 / 6 / 60 / 24) =
> ~500,000 days.

Earley does 237 GCs in 2.689 seconds, so 88 GCs/sec.

You would need 2^32/88 seconds, so 48806446 seconds, or 564 days.

> 
> Can we just RELEASE_ASSERT that doesn't happen?
> 
> The odds that the user will crash, navigate, quit, or install a software
> update before wrapping around seem to be effectively 100%.

Not 100%.  Someone could use JavaScriptCore to run a two-year compute task on their desktop.  I've heard of two-year compute tasks, and I've heard of two-year uptimes.

I would agree if handling the wrap-around was hard or untestable, and if I was confident that we never wanted the size of the version to be smaller.  The version stuff probably looks more confusing in patch form than it does when you just look at trunk.  I plan to add tests by having an option that forces the wrap-around to happen sooner, and run just earley and splay in that mode, for three seconds each (in my experience that catches >95% of GC bugs).  Our GC would perform statistically as well if the versions were a byte, so if we ever needed space in Handle or MarkedBlock for any reason then these versions would be prime candidates; this would cause us to definitely need a wrap-around handler.  That's why when I reduced the version from 64-bit to 32-bit, I added the wrap-around fallback.  Sure "nobody" will need it, but if somebody does need it and we don't have it then I would feel bad.
Comment 37 Filip Pizlo 2016-10-11 19:23:44 PDT
(In reply to comment #33)
> > > > Source/JavaScriptCore/heap/MarkedBlockInlines.h:49
> > > > +    // - We did ~2^32 collections and rotated the version back to null, so we needed to hard-reset
> > > > +    //   everything. If the marks had been stale, we would have cleared them. So, we can be sure that
> > > > +    //   any set mark bit reflects objects marked during last GC, i.e. "live" objects.
> > > 
> > > It would help if you mentioned the resetAllocated() function by name here.
> > > (I think that's what you're talking about.)
> > 
> > No, that function is for newlyAllocatedVersion.
> 
> What function clears marks if stale upon wrap-around? That's the function I
> would point to here.

Gotcha.  You meant resetMarks().

> 
> > You're proposing a change to marksConveyLivenessDuringMarking(), but you
> > haven't proposed a corresponding change to how aboutToMarkSlow() should use
> > this to figure out if it should transfer the contents of marks into
> > newlyAllocated or not.
> 
> if (handle().allocator()->isAllocated(&handle()) || areMarksStaleSnapshot())
>     m_marks.clearAll();
> else {
>     // Transfer m_marks to m_newlyAllocated
>     ...
> }

Your areMarksStaleSnapshot is:

m_markingVersion == MarkedSpace::nullVersion || MarkedSpace::nextVersion(m_markingVersion) != markingVersion;

So, we would clear the marks on null version.  That's not correct for wrap-around, where the marks may have been the correct thing just prior to wrap-around.  See resetMarks():

void MarkedBlock::resetMarks()
{
    // We want aboutToMarkSlow() to see what the mark bits were after the last collection. It uses
    // the version number to distinguish between the marks having already been stale before
    // beginMarking(), or just stale now that beginMarking() bumped the version. If we have a version
    // wraparound, then we will call this method before resetting the version to null. When the
    // version is null, aboutToMarkSlow() will assume that the marks were not stale as of before
    // beginMarking(). Hence the need to whip the marks into shape.
    if (areMarksStale())
        m_marks.clearAll();
    m_markingVersion = MarkedSpace::nullVersion;
}
Comment 38 Geoffrey Garen 2016-10-12 11:55:29 PDT
> Not 100%.  Someone could use JavaScriptCore to run a two-year compute task
> on their desktop.  I've heard of two-year compute tasks, and I've heard of
> two-year uptime.

You tried this with a benchmark bot and it literally melted the hardware.

Also, two-year compute tasks probably run on 64bit machines, so we could use 'long' instead of uint32_t in order to accommodate them.

> I would agree if handling the wrap-around was hard or untestable, and if I
> was confident that we never wanted the size of the version to be smaller. 
> The version stuff probably looks more confusing in patch form than it does
> when you just look at trunk.  I plan to add tests by having an option that
> forces the wrap-around to happen sooner, and run just earley and splay in
> that mode, for three seconds each (in my experience that catches >95% of GC
> bugs).  Our GC would perform statistically as well if the versions were a
> byte, so if we ever needed space in Handle or MarkedBlock for any reason
> then these versions would be prime candidates; this would cause us to
> definitely need a wrap-around handler.  That's why when I reduced the
> version from 64-bit to 32-bit, I added the wrap-around fallback.  Sure
> "nobody" will need it, but if somebody does need it and we don't have it
> then I would feel bad.

These are some pretty hardcore theoretical arguments to justify real-world complexity that will impact all of our users and developers.
Comment 39 Filip Pizlo 2016-10-12 12:08:00 PDT
(In reply to comment #38)
> > Not 100%.  Someone could use JavaScriptCore to run a two-year compute task
> > on their desktop.  I've heard of two-year compute tasks, and I've heard of
> > two-year uptime.
> 
> You tried this with a benchmark bot and it literally melted the hardware.

It was a laptop, not a desktop.

> 
> Also, two-year compute tasks probably run on 64bit machines, so we could use
> 'long' instead of uint32_t in order to accommodate them.

You're right, we could make HeapVersion have different sizes depending on hardware.  That seems like a different kind of complexity.

> 
> > I would agree if handling the wrap-around was hard or untestable, and if I
> > was confident that we never wanted the size of the version to be smaller. 
> > The version stuff probably looks more confusing in patch form than it does
> > when you just look at trunk.  I plan to add tests by having an option that
> > forces the wrap-around to happen sooner, and run just earley and splay in
> > that mode, for three seconds each (in my experience that catches >95% of GC
> > bugs).  Our GC would perform statistically as well if the versions were a
> > byte, so if we ever needed space in Handle or MarkedBlock for any reason
> > then these versions would be prime candidates; this would cause us to
> > definitely need a wrap-around handler.  That's why when I reduced the
> > version from 64-bit to 32-bit, I added the wrap-around fallback.  Sure
> > "nobody" will need it, but if somebody does need it and we don't have it
> > then I would feel bad.
> 
> These are some pretty hardcore theoretical arguments to justify real-world
> complexity that will impact all of our users and developers.

How will this complexity impact all of our users and developers?
Comment 40 Filip Pizlo 2016-10-12 12:40:44 PDT
Just to be clear, removal of the wraparound handler would reduce complexity as follows:

- Remove resetMarks().  The body of this method is three lines of code.
- Remove resetNewlyAllocated().  The body of this method is two lines of code.
- Remove marksConveyLivenessDuringMarking() and have aboutToMarkSlow() just check version directly.  The body of this method is three lines of code, including the line that has the non-wraparound version check (MarkedSpace::nextVersion(m_markingVersion) == markingVersion).
- Remove the loops that call resetMarks() and resetNewlyAllocated().  Those are six lines of code each using my style of lambda indentation and the surrounding if {}.

None of the code that handles the wraparound handler needs to be correct for when wraparound does not happen:
- resetMarks() won't be called.
- resetNewlyAllocated() won't be called.
- marksConveyLivenessDuringMarking() could return either true or false for nullVersion and everything would be fine.

I'm not counting comments.  If we had no handling of wraparound, I would instead have a comment explaining why we can ignore wraparound.  We'd still have most of the huge comment in marksConveyLivenessDuringMarking(); we'd probably put it in aboutToMarkSlow().

So, the complexity being removed is 3 + 2 + 3 + 6 + 6 = 20 lines of code, all of which is live only when the bad thing happens.  That's quite small.

C programs can run for more than two years.  I want to be able to say that JS programs can also run for two years.  20 lines of code is a small price to pay for continuing our philosophy that JS programs shouldn't have to suffer unnecessary limitations.
Comment 41 Filip Pizlo 2016-10-12 19:11:41 PDT
As far as I can see, this landed without causing any regressions.  Hooray!