162309 – MarkedBlock should know what objects are live during marking

RESOLVED FIXED162309

MarkedBlock should know what objects are live during marking

https://bugs.webkit.org/show_bug.cgi?id=162309

Summary MarkedBlock should know what objects are live during marking

Filip Pizlo

Reported 2016-09-20 11:26:45 PDT

...

Attachments
it's a start (9.70 KB, patch) 2016-10-01 10:47 PDT, Filip Pizlo	no flags	Details Formatted Diff Diff
looks pretty close (21.92 KB, patch) 2016-10-01 17:26 PDT, Filip Pizlo	no flags	Details Formatted Diff Diff
might be done (27.12 KB, patch) 2016-10-01 18:09 PDT, Filip Pizlo	no flags	Details Formatted Diff Diff
more (31.47 KB, patch) 2016-10-02 19:55 PDT, Filip Pizlo	no flags	Details Formatted Diff Diff
it compiles! (31.45 KB, patch) 2016-10-02 20:15 PDT, Filip Pizlo	no flags	Details Formatted Diff Diff
it seems to run simple programs without crashing (35.46 KB, patch) 2016-10-03 12:10 PDT, Filip Pizlo	no flags	Details Formatted Diff Diff
it can run splay without crashing (48.49 KB, patch) 2016-10-03 15:03 PDT, Filip Pizlo	no flags	Details Formatted Diff Diff
passes tests (48.49 KB, patch) 2016-10-03 17:09 PDT, Filip Pizlo	no flags	Details Formatted Diff Diff
the ptach (56.21 KB, patch) 2016-10-03 18:31 PDT, Filip Pizlo	no flags	Details Formatted Diff Diff
the patch (57.37 KB, patch) 2016-10-03 19:45 PDT, Filip Pizlo	no flags	Details Formatted Diff Diff
the patch (58.66 KB, patch) 2016-10-04 11:18 PDT, Filip Pizlo	buildbot: commit-queue-	Details Formatted Diff Diff
Archive of layout-test-results from ews117 for mac-yosemite (409.45 KB, application/zip) 2016-10-04 15:51 PDT, Build Bot	no flags	Details
more (59.88 KB, patch) 2016-10-06 17:31 PDT, Filip Pizlo	no flags	Details Formatted Diff Diff
perf neutral, space neutral (58.74 KB, patch) 2016-10-06 18:11 PDT, Filip Pizlo	no flags	Details Formatted Diff Diff
the patch (57.38 KB, patch) 2016-10-08 16:30 PDT, Filip Pizlo	ggaren: review+	Details Formatted Diff Diff
patch for landing (57.58 KB, patch) 2016-10-11 15:37 PDT, Filip Pizlo	no flags	Details Formatted Diff Diff
Show Obsolete (15) View All Add attachment proposed patch, testcase, etc.

Filip Pizlo

Comment 1 2016-10-01 10:47:16 PDT

Created attachment 290437 [details] it's a start

Filip Pizlo

Comment 2 2016-10-01 17:26:01 PDT

Created attachment 290452 [details] looks pretty close

Filip Pizlo

Comment 3 2016-10-01 18:09:07 PDT

Created attachment 290456 [details] might be done

Filip Pizlo

Comment 4 2016-10-02 19:55:30 PDT

Created attachment 290477 [details] more

Filip Pizlo

Comment 5 2016-10-02 20:15:31 PDT

Created attachment 290478 [details] it compiles!

Filip Pizlo

Comment 6 2016-10-03 12:10:27 PDT

Created attachment 290502 [details] it seems to run simple programs without crashing

Filip Pizlo

Comment 7 2016-10-03 14:59:56 PDT

This crashes on splay during the second GC. I need to do carnage to CodeBlockSet.

Filip Pizlo

Comment 8 2016-10-03 15:03:44 PDT

Created attachment 290526 [details] it can run splay without crashing Wow!

Filip Pizlo

Comment 9 2016-10-03 17:09:58 PDT

Created attachment 290538 [details] passes tests I still need to gather perf data.

Filip Pizlo

Comment 10 2016-10-03 17:47:10 PDT

Perf looks pretty good! I'll run browser benchmarks next. Benchmark report for SunSpider, LongSpider, Octane, Kraken, and AsmBench on murderface (MacBookPro11,5). VMs tested: "TipOfTree" at /Volumes/Data/secondary/OpenSource/WebKitBuild/Release/jsc (r206755) "Things" at /Volumes/Data/primary/OpenSource/WebKitBuild/Release/jsc (r206755) Collected 10 samples per benchmark/VM, with 10 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. TipOfTree Things SunSpider: 3d-cube 4.7598+-0.1405 ? 4.8667+-0.2664 ? might be 1.0225x slower 3d-morph 4.6804+-0.0819 ? 4.6835+-0.0481 ? 3d-raytrace 4.7986+-0.1959 ? 4.8248+-0.1859 ? access-binary-trees 2.1087+-0.2056 1.9962+-0.0779 might be 1.0564x faster access-fannkuch 4.8807+-0.1860 4.8249+-0.1278 might be 1.0116x faster access-nbody 2.4643+-0.1026 2.3922+-0.0495 might be 1.0301x faster access-nsieve 3.1765+-0.0274 ? 3.2465+-0.0844 ? might be 1.0220x slower bitops-3bit-bits-in-byte 1.0927+-0.0281 1.0918+-0.0308 bitops-bits-in-byte 2.6825+-0.0295 ? 2.7149+-0.0480 ? might be 1.0121x slower bitops-bitwise-and 1.9481+-0.0490 ? 1.9744+-0.0821 ? might be 1.0135x slower bitops-nsieve-bits 2.9989+-0.0317 2.9987+-0.0151 controlflow-recursive 2.3278+-0.1394 2.2805+-0.0352 might be 1.0207x faster crypto-aes 4.2656+-0.1250 ? 4.2810+-0.1564 ? crypto-md5 2.6939+-0.1401 2.5977+-0.0554 might be 1.0370x faster crypto-sha1 2.6480+-0.0578 ? 2.7426+-0.1215 ? might be 1.0357x slower date-format-tofte 6.5634+-0.0371 ? 6.6671+-0.2436 ? might be 1.0158x slower date-format-xparb 4.4092+-0.0485 ? 4.5115+-0.1514 ? might be 1.0232x slower math-cordic 2.7159+-0.0757 2.6758+-0.1109 might be 1.0150x faster math-partial-sums 3.8538+-0.0760 ? 3.8570+-0.0633 ? math-spectral-norm 1.9476+-0.0290 1.9388+-0.0289 regexp-dna 6.4481+-0.3728 6.0793+-0.1339 might be 1.0607x faster string-base64 4.5570+-0.1821 4.4215+-0.0251 might be 1.0307x faster string-fasta 5.4202+-0.2102 5.3374+-0.0631 might be 1.0155x faster string-tagcloud 8.0917+-0.4335 7.9968+-0.0990 might be 1.0119x faster string-unpack-code 18.5043+-0.7824 18.3719+-0.6089 string-validate-input 4.2391+-0.1710 ? 4.2530+-0.1024 ? <arithmetic> 4.3953+-0.0459 4.3702+-0.0314 might be 1.0057x faster TipOfTree Things LongSpider: 3d-cube 784.9753+-9.2362 ? 784.9796+-7.2979 ? 3d-morph 564.4469+-1.8892 564.3682+-3.1053 3d-raytrace 452.3591+-3.3894 ? 452.5263+-3.2539 ? access-binary-trees 775.1539+-4.2585 ! 785.4648+-4.5978 ! definitely 1.0133x slower access-fannkuch 229.8188+-2.8471 ? 231.8483+-6.6854 ? access-nbody 501.6161+-2.1568 ? 505.2909+-5.6832 ? access-nsieve 280.6295+-6.1650 ? 286.2854+-7.0494 ? might be 1.0202x slower bitops-3bit-bits-in-byte 31.3121+-0.3761 ? 31.7073+-0.8497 ? might be 1.0126x slower bitops-bits-in-byte 82.5584+-1.8129 82.1645+-1.4429 bitops-nsieve-bits 375.3969+-7.2505 ? 377.4139+-5.5289 ? controlflow-recursive 430.0355+-3.4339 428.6742+-1.2196 crypto-aes 531.1279+-2.0121 529.3739+-3.8091 crypto-md5 454.5531+-2.4910 ? 455.7840+-4.5393 ? crypto-sha1 591.1619+-4.8358 ! 605.2210+-5.5402 ! definitely 1.0238x slower date-format-tofte 329.8180+-4.8883 ? 335.7514+-6.1675 ? might be 1.0180x slower date-format-xparb 611.0077+-3.1081 ! 646.9916+-30.9931 ! definitely 1.0589x slower hash-map 135.6740+-2.8968 ? 139.2200+-2.9138 ? might be 1.0261x slower math-cordic 425.7973+-9.6438 417.1100+-9.3656 might be 1.0208x faster math-partial-sums 283.5224+-3.9142 ? 285.3476+-1.8994 ? math-spectral-norm 516.9096+-2.5078 ? 518.4771+-2.2593 ? string-base64 480.0868+-2.3306 478.7450+-2.1636 string-fasta 328.3009+-0.9268 ? 331.9004+-4.4371 ? might be 1.0110x slower string-tagcloud 163.9715+-1.4029 163.9625+-1.6930 <geometric> 336.4531+-0.7115 ! 339.0557+-1.3763 ! definitely 1.0077x slower TipOfTree Things Octane: encrypt 0.14959+-0.00209 ? 0.15031+-0.00311 ? decrypt 2.72169+-0.01214 2.70828+-0.00985 deltablue x2 0.11700+-0.00141 ? 0.11757+-0.00108 ? earley 0.23785+-0.00106 0.23764+-0.00140 boyer 4.12999+-0.06310 ? 4.14662+-0.05801 ? navier-stokes x2 4.61573+-0.01927 4.61221+-0.00971 raytrace x2 0.65478+-0.00466 ? 0.65531+-0.00276 ? richards x2 0.07932+-0.00136 0.07838+-0.00088 might be 1.0120x faster splay x2 0.31519+-0.00284 0.31218+-0.00247 regexp x2 15.85437+-0.26740 ? 15.93102+-0.59643 ? pdfjs x2 38.82551+-0.33882 ? 38.99885+-0.18927 ? mandreel x2 39.59546+-0.27219 39.59124+-0.29396 gbemu x2 28.72420+-0.27889 ? 28.74073+-0.10597 ? closure 0.47185+-0.00337 0.46830+-0.00230 jquery 6.43143+-0.03022 6.39458+-0.01449 box2d x2 8.71967+-0.05439 ? 8.72430+-0.05734 ? zlib x2 336.48324+-5.68523 ? 336.49135+-5.35980 ? typescript x2 604.82722+-10.05660 ? 610.12378+-5.38330 ? <geometric> 4.70041+-0.01119 4.69945+-0.01477 might be 1.0002x faster TipOfTree Things Kraken: ai-astar 90.062+-2.375 88.968+-1.726 might be 1.0123x faster audio-beat-detection 36.031+-0.871 35.823+-0.582 audio-dft 95.067+-1.552 94.269+-0.742 audio-fft 27.691+-0.684 ? 27.695+-0.851 ? audio-oscillator 42.905+-0.296 ? 43.279+-1.073 ? imaging-darkroom 57.087+-0.377 56.879+-0.328 imaging-desaturate 41.833+-0.868 ? 42.040+-0.664 ? imaging-gaussian-blur 59.686+-2.614 59.220+-2.165 json-parse-financial 31.907+-1.183 ? 32.017+-0.471 ? json-stringify-tinderbox 21.679+-0.925 20.965+-0.751 might be 1.0341x faster stanford-crypto-aes 34.679+-1.212 34.222+-0.148 might be 1.0133x faster stanford-crypto-ccm 32.347+-1.444 ? 32.987+-2.118 ? might be 1.0198x slower stanford-crypto-pbkdf2 89.092+-0.363 ? 89.867+-1.349 ? stanford-crypto-sha256-iterative 29.307+-0.218 29.123+-0.197 <arithmetic> 49.241+-0.279 49.097+-0.314 might be 1.0029x faster TipOfTree Things AsmBench: bigfib.cpp 405.1739+-2.6410 404.2104+-1.9409 cray.c 357.2352+-2.9135 ? 357.2662+-2.7586 ? dry.c 395.2482+-4.1978 ? 414.1439+-42.2403 ? might be 1.0478x slower FloatMM.c 665.1913+-13.4518 ? 673.4448+-4.6110 ? might be 1.0124x slower gcc-loops.cpp 3379.0649+-9.6737 ? 3380.4471+-8.8589 ? n-body.c 748.3112+-3.9571 746.5452+-2.4560 Quicksort.c 379.4353+-4.1063 377.0033+-2.7371 stepanov_container.cpp 3168.5075+-22.4127 ? 3185.6080+-20.5172 ? Towers.c 237.2488+-3.8634 ? 239.0891+-4.0924 ? <geometric> 670.5841+-1.4917 ? 674.7199+-7.3719 ? might be 1.0062x slower TipOfTree Things Geomean of preferred means: <scaled-result> 47.0064+-0.1481 ? 47.0528+-0.1442 ? might be 1.0010x slower

Filip Pizlo

Comment 11 2016-10-03 18:31:33 PDT

Created attachment 290545 [details] the ptach

Filip Pizlo

Comment 12 2016-10-03 19:45:24 PDT

Created attachment 290556 [details] the patch Further improved the handling of version wrap-around.

Filip Pizlo

Comment 13 2016-10-04 08:39:27 PDT

This is a 0.35% PLT regression with 84% confidence.

Filip Pizlo

Comment 14 2016-10-04 08:56:56 PDT

(In reply to comment #13) > This is a 0.35% PLT regression with 84% confidence. Or it could be a 0.9% PLT regression with 99% confidence. The previous result had four runs of baseline and three runs of this patch. I had forgotten the extra run. After doing that, I got a result that was significantly slower, which throws the result a lot. Anyway, I have some ideas for how to make this cheaper. I'll try them.

Filip Pizlo

Comment 15 2016-10-04 11:18:07 PDT

Latest perf. Benchmark report for SunSpider, LongSpider, Octane, Kraken, and AsmBench on pizlo (MacPro5,1). VMs tested: "TipOfTree" at /Volumes/Data/secondary/OpenSource/WebKitBuild/Release/jsc (r206755) "Things" at /Volumes/Data/primary/OpenSource/WebKitBuild/Release/jsc (r206755) Collected 10 samples per benchmark/VM, with 10 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. TipOfTree Things SunSpider: 3d-cube 8.0235+-0.3621 ? 8.3052+-0.2384 ? might be 1.0351x slower 3d-morph 8.1279+-0.1226 ? 8.1917+-0.0854 ? 3d-raytrace 8.4992+-0.2040 8.4723+-0.1616 access-binary-trees 3.1203+-0.1078 3.0816+-0.1068 might be 1.0125x faster access-fannkuch 8.5027+-0.2279 8.4424+-0.3140 access-nbody 4.4228+-0.0787 4.3797+-0.1055 access-nsieve 4.8087+-0.1750 4.7793+-0.1374 bitops-3bit-bits-in-byte 1.7660+-0.0822 1.7055+-0.0541 might be 1.0355x faster bitops-bits-in-byte 5.3679+-0.1043 ? 5.4113+-0.0835 ? bitops-bitwise-and 2.8738+-0.0478 ? 2.8918+-0.1019 ? bitops-nsieve-bits 4.7063+-0.0625 ? 4.7643+-0.0575 ? might be 1.0123x slower controlflow-recursive 3.5873+-0.0647 3.5592+-0.0620 crypto-aes 7.0385+-0.1865 6.9646+-0.1515 might be 1.0106x faster crypto-md5 4.4600+-0.1392 4.3811+-0.1328 might be 1.0180x faster crypto-sha1 4.5745+-0.0961 4.5502+-0.0736 date-format-tofte 13.4292+-0.3363 ? 13.4452+-0.3493 ? date-format-xparb 7.5737+-0.1182 ? 7.5955+-0.1118 ? math-cordic 4.1582+-0.0785 ? 4.1888+-0.0811 ? math-partial-sums 8.9864+-0.1432 8.9397+-0.1546 math-spectral-norm 3.2642+-0.0827 3.1665+-0.0496 might be 1.0308x faster regexp-dna 10.3010+-0.1137 ? 10.3179+-0.2243 ? string-base64 6.9089+-0.1414 ? 7.1008+-0.1023 ? might be 1.0278x slower string-fasta 9.1714+-0.1150 ? 9.2841+-0.1289 ? might be 1.0123x slower string-tagcloud 13.4654+-0.2029 13.2572+-0.3336 might be 1.0157x faster string-unpack-code 26.7415+-0.3726 ? 27.5089+-2.1735 ? might be 1.0287x slower string-validate-input 6.5980+-0.0956 6.5756+-0.0805 <arithmetic> 7.3260+-0.0301 ? 7.3562+-0.0951 ? might be 1.0041x slower TipOfTree Things LongSpider: 3d-cube 1202.1618+-4.3588 ? 1211.1989+-5.4097 ? 3d-morph 1051.7163+-3.1291 ? 1054.7417+-3.7847 ? 3d-raytrace 846.1904+-5.0345 ^ 832.9858+-4.0975 ^ definitely 1.0159x faster access-binary-trees 1311.1012+-12.4107 ? 1314.3566+-6.3766 ? access-fannkuch 414.2702+-13.0253 ? 415.8683+-14.5161 ? access-nbody 1008.3697+-7.5767 1005.0294+-0.5716 access-nsieve 599.4214+-1.5117 598.0627+-1.2544 bitops-3bit-bits-in-byte 44.0405+-0.0957 44.0119+-0.1087 bitops-bits-in-byte 309.3904+-2.7643 309.1396+-2.1327 bitops-nsieve-bits 604.6160+-4.7808 601.3614+-3.7744 controlflow-recursive 806.7600+-0.3109 ? 807.5056+-0.8618 ? crypto-aes 920.0754+-4.4205 915.6011+-2.0631 crypto-md5 712.2925+-2.1893 711.7632+-3.3018 crypto-sha1 1028.0490+-2.7507 1027.5773+-5.9827 date-format-tofte 862.2874+-7.6796 ? 864.2536+-22.7555 ? date-format-xparb 998.2711+-11.3190 ^ 982.3115+-4.2447 ^ definitely 1.0162x faster hash-map 228.1379+-1.2474 226.8719+-3.0762 math-cordic 621.2927+-4.5599 ? 621.8403+-3.9488 ? math-partial-sums 903.6336+-1.2139 ? 904.6057+-1.4168 ? math-spectral-norm 1045.8192+-0.5604 1045.7521+-0.4860 string-base64 678.0794+-0.8927 676.1299+-2.2146 string-fasta 581.9292+-6.6014 580.7636+-3.1994 string-tagcloud 271.8647+-1.3777 ^ 267.3205+-1.5961 ^ definitely 1.0170x faster <geometric> 623.9462+-1.2138 622.4183+-1.5229 might be 1.0025x faster TipOfTree Things Octane: encrypt 0.27504+-0.00110 ? 0.27589+-0.00044 ? decrypt 5.12324+-0.05780 5.07856+-0.05851 deltablue x2 0.22472+-0.00984 ? 0.23279+-0.00854 ? might be 1.0359x slower earley 0.44779+-0.00160 ! 0.45199+-0.00193 ! definitely 1.0094x slower boyer 8.05465+-0.02511 ? 8.06552+-0.06843 ? navier-stokes x2 6.45853+-0.00695 ? 6.45901+-0.00837 ? raytrace x2 1.24906+-0.01144 ? 1.25735+-0.01483 ? richards x2 0.14393+-0.00175 0.14353+-0.00138 splay x2 0.54480+-0.00167 ^ 0.54039+-0.00215 ^ definitely 1.0081x faster regexp x2 28.43787+-0.26215 28.30601+-0.12242 pdfjs x2 65.02654+-0.34598 64.69731+-0.14130 mandreel x2 68.35391+-0.39557 ? 68.70450+-0.31458 ? gbemu x2 68.19290+-8.41274 64.05341+-6.80604 might be 1.0646x faster closure 0.80387+-0.00183 ! 0.80849+-0.00150 ! definitely 1.0058x slower jquery 10.53694+-0.02784 10.48505+-0.02524 box2d x2 16.72151+-0.05996 ^ 16.53741+-0.05892 ^ definitely 1.0111x faster zlib x2 554.40618+-11.52526 547.51534+-13.83813 might be 1.0126x faster typescript x2 1137.35742+-6.32767 1137.06296+-6.20274 <geometric> 8.47498+-0.07318 8.44480+-0.04927 might be 1.0036x faster TipOfTree Things Kraken: ai-astar 140.903+-1.017 ? 141.470+-0.815 ? audio-beat-detection 63.972+-0.343 ? 64.045+-0.229 ? audio-dft 128.646+-0.913 ? 128.777+-0.604 ? audio-fft 51.808+-0.344 ? 51.961+-0.237 ? audio-oscillator 76.304+-0.174 ? 76.421+-0.220 ? imaging-darkroom 96.350+-0.116 96.324+-0.168 imaging-desaturate 90.894+-0.231 90.802+-0.179 imaging-gaussian-blur 113.875+-4.287 109.129+-4.292 might be 1.0435x faster json-parse-financial 60.357+-0.302 ^ 59.852+-0.130 ^ definitely 1.0084x faster json-stringify-tinderbox 39.800+-0.148 39.783+-0.103 stanford-crypto-aes 62.260+-0.365 62.256+-0.849 stanford-crypto-ccm 57.612+-2.688 57.002+-3.128 might be 1.0107x faster stanford-crypto-pbkdf2 150.094+-4.087 147.819+-3.843 might be 1.0154x faster stanford-crypto-sha256-iterative 48.793+-0.273 ? 48.837+-0.499 ? <arithmetic> 84.405+-0.523 83.891+-0.668 might be 1.0061x faster TipOfTree Things AsmBench: bigfib.cpp 665.0508+-15.0276 662.8626+-11.7344 cray.c 597.5950+-1.8130 597.2118+-2.7995 dry.c 670.4390+-17.7186 657.1325+-16.6841 might be 1.0202x faster FloatMM.c 945.2271+-32.7938 927.8857+-31.7631 might be 1.0187x faster gcc-loops.cpp 6313.6921+-66.3733 6298.8269+-80.0661 n-body.c 1555.4205+-25.1477 1532.9256+-33.8577 might be 1.0147x faster Quicksort.c 606.6469+-3.7198 ? 608.5187+-5.4834 ? stepanov_container.cpp 4480.0198+-11.9496 ? 4482.6665+-13.2934 ? Towers.c 375.6496+-0.4232 ? 376.2430+-4.4923 ? <geometric> 1108.1300+-6.3161 1101.4505+-5.3150 might be 1.0061x faster TipOfTree Things Geomean of preferred means: <scaled-result> 81.6232+-0.2539 81.3925+-0.3046 might be 1.0028x faster

Filip Pizlo

Comment 16 2016-10-04 11:18:31 PDT

Created attachment 290624 [details] the patch Still running perf tests.

Filip Pizlo

Comment 17 2016-10-04 13:06:22 PDT

On PLT, this looks like a 0.8% regression with 99% confidence, even after doing some things to try to reduce the overhead.

Build Bot

Comment 18 2016-10-04 15:51:31 PDT

Comment on attachment 290624 [details] the patch Attachment 290624 [details] did not pass mac-debug-ews (mac): Output: http://webkit-queues.webkit.org/results/2220150 Number of test failures exceeded the failure limit.

Build Bot

Comment 19 2016-10-04 15:51:38 PDT

Created attachment 290658 [details] Archive of layout-test-results from ews117 for mac-yosemite The attached test failures were seen while running run-webkit-tests on the mac-debug-ews. Bot: ews117 Port: mac-yosemite Platform: Mac OS X 10.10.5

Filip Pizlo

Comment 20 2016-10-04 20:13:52 PDT

I think that I fixed the debug failure locally. I also think that I'm experiencing some kind of flukes in my benchmark runs. On the one hand, I'm seeing a 1% regression on membuster. But this regression appears entirely unaffected by the fact that we're allocating newlyAllocated more frequently, which is the only change in this patch that looks like it could affect memory usage. The only other possibility is that conservative roots have become more conservative, but I don't think this could be the case, since we are going to great pains to ensure that the isLive() query returns the same thing during conservative scans now as it did before. If this wasn't true - if isLive() returned true more often, for example - then this would manifest as a crash. I'm not seeing those (except in the last patch I uploaded - it was a separate goof and I fixed it). So, I'm going to dig into the membuster regression more, this time trying to prove that it is indeed a fluke. Maybe I'll try to run it on some other machines or do some clean builds. I should also say that the differences in membuster results between baseline and trunk and smaller than the differences I get between different runs of the same build, spaced out in time. A few hours can make the same build use 5% more or less memory! It's crazy!

Filip Pizlo

Comment 21 2016-10-04 20:24:46 PDT

Actually, I can test putting the conservative scan at the beginning. If this shows different memory overheads, then this would definitely take the focus off of newlyAllocated. I should also say that I measured the amount of memory used by newlyAllocated relative to the amount of memory used by the heap's capacity. It looked like there were fewer than 1 bytes of newlyAllocated for every 1000 bytes of marked block. That's just not enough to explain the membuster regressions that I'm seeing (0.7% and 1.2%, the 1.2% regression happening when I *reduced* the number of newlyAllocated allocations).

Filip Pizlo

Comment 22 2016-10-05 21:18:09 PDT

It looks likely that the membuster regression goes away if we do the stack scan before starting marking, rather than after. That's very surprising. I'm going to do more tests to confirm.

Filip Pizlo

Comment 23 2016-10-05 23:25:55 PDT

(In reply to comment #22) > It looks likely that the membuster regression goes away if we do the stack > scan before starting marking, rather than after. That's very surprising. > I'm going to do more tests to confirm. But, in that configuration, I also get a 1% PLT regression with 99% certainty. So weird. I'm going to try some other hardware configurations I'm going to run membuster some more to make sure that this isn't a fluke.

Filip Pizlo

Comment 24 2016-10-06 17:31:26 PDT

I'm getting membuster neutrality and a smaller PLT regression if I simply inline newlyAllocated into Handle instead of allocating it on demand.

Filip Pizlo

Comment 25 2016-10-06 17:31:57 PDT

Created attachment 290875 [details] more

Filip Pizlo

Comment 26 2016-10-06 18:11:25 PDT

Created attachment 290880 [details] perf neutral, space neutral

Filip Pizlo

Comment 27 2016-10-07 11:22:11 PDT

I did more experiments. With 10 samples of membuster for baseline and this patch, I get a 0.76% regression with 76% confidence. I'll take that to mean it's neutral.

Filip Pizlo

Comment 28 2016-10-07 13:38:52 PDT

This still looks like a 0.96% PLT regression with 99% confidence. I'm going to profile more. This regression makes no sense, since the only change here is to the GC and all other GC benchmarks are happy.

Filip Pizlo

Comment 29 2016-10-08 16:30:47 PDT

Created attachment 291030 [details] the patch

Geoffrey Garen

Comment 30 2016-10-11 14:31:25 PDT

Comment on attachment 291030 [details] the patch View in context: https://bugs.webkit.org/attachment.cgi?id=291030&action=review r=me I think there might be a clearer way to handle the wrap-around situation. > Source/JavaScriptCore/heap/Heap.cpp:1214 > void Heap::writeBarrierCurrentlyExecutingCodeBlocks() > { > - m_codeBlocks->writeBarrierCurrentlyExecutingCodeBlocks(this); > + m_codeBlocks->writeBarrierCurrentlyExecuting(this); > +} > + > +void Heap::clearCurrentlyExecutingCodeBlocks() > +{ > + m_codeBlocks->clearCurrentlyExecuting(); > } These probably don't need to be functions. > Source/JavaScriptCore/heap/MarkedBlock.cpp:57 > + , m_allocatedVersion(MarkedSpace::nullVersion) I would call this m_newlyAllocatedVersion since it guards m_newlyAllocated. Later, we should come up with a better name for m_newlyAllocated. Maybe m_liveAtGCStart or m_liveSnapshot. (I don't think m_newlyAllocated will include things allocated during GC, right? And of course, for now, we don't allocate during GC.) > Source/JavaScriptCore/heap/MarkedBlock.h:291 > + void setNeedsDestruction(); I would call this "updateNeedsDestruction" to distinguish from a setter. > Source/JavaScriptCore/heap/MarkedBlockInlines.h:49 > + // - We did ~2^32 collections and rotated the version back to null, so we needed to hard-reset > + // everything. If the marks had been stale, we would have cleared them. So, we can be sure that > + // any set mark bit reflects objects marked during last GC, i.e. "live" objects. It would help if you mentioned the resetAllocated() function by name here. (I think that's what you're talking about.) Is it really true that wrap-around can wrap-around back to null? I thought it wrapped around to firstVersion. If possible, I'd prefer to think in terms of some elements of state that are snapshotted at the start of GC, and then consulted later after the current GC has started to change some things. Is that possible here? This function almost amounts to the inverse of a snapshot of areMarksStale() at the start of GC. // True if areMarksStale() was true at the start of GC -- which means no object in this block was marked at the end of the last GC. inline bool MarkedBlock:areMarksStaleSnapshot() // Or maybe areMarksStaleAtGCStart() { return m_markingVersion == MarkedSpace::nullVersion || MarkedSpace::nextVersion(m_markingVersion) != markingVersion; } This changes the behavior of nullVersion slightly: now nullVersion says "don't check m_marks". It's possible to think of nullVersion as having up-to-date m_marks -- in the sense that they tell the (all zero) truth -- but I think it's more instructive to call this state not up-to-date -- in the sense that the GC mark phase never got a shot at setting m_marks, and so we know they were logically zero regardless of their bit values. Put another way, I would like to be able to ASSERT(areMarksStale() || m_marks.count()). > Source/JavaScriptCore/heap/MarkedBlockInlines.h:52 > + // It would be absurd to use this method when not collecting, since this special "one version > + // back" state only makes sense when we're in a concurrent collection and have to be > + // conservative. Can we assert we're collecting here? > Source/JavaScriptCore/heap/MarkedSpace.cpp:495 > + if (UNLIKELY(nextVersion(m_allocatedVersion) == initialVersion)) { > + forEachBlock( > + [&] (MarkedBlock::Handle* handle) { > + handle->resetAllocated(); > + }); > + } This is super subtle. I think you should mention in a comment how this is needed by marksConveyLivenessDuringMarking. And you should also describe the problem of marksConveyLivenessDuringMarking eventually becoming (incorrectly) true due to wrap-around. Perhaps a clearer thing to do upon wrap-around is to reset all blocks to nullVersion at the start of a wrap-around GC?

Filip Pizlo

Comment 31 2016-10-11 14:59:09 PDT

(In reply to comment #30) > Comment on attachment 291030 [details] > the patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=291030&action=review > > r=me > > I think there might be a clearer way to handle the wrap-around situation. > > > Source/JavaScriptCore/heap/Heap.cpp:1214 > > void Heap::writeBarrierCurrentlyExecutingCodeBlocks() > > { > > - m_codeBlocks->writeBarrierCurrentlyExecutingCodeBlocks(this); > > + m_codeBlocks->writeBarrierCurrentlyExecuting(this); > > +} > > + > > +void Heap::clearCurrentlyExecutingCodeBlocks() > > +{ > > + m_codeBlocks->clearCurrentlyExecuting(); > > } > > These probably don't need to be functions. Done. > > > Source/JavaScriptCore/heap/MarkedBlock.cpp:57 > > + , m_allocatedVersion(MarkedSpace::nullVersion) > > I would call this m_newlyAllocatedVersion since it guards m_newlyAllocated. Done. > > Later, we should come up with a better name for m_newlyAllocated. Maybe > m_liveAtGCStart or m_liveSnapshot. (I don't think m_newlyAllocated will > include things allocated during GC, right? And of course, for now, we don't > allocate during GC.) No, we will set newlyAllocated during GC. We will scan stack at the end of GC and during every GC fixpoint stop-the-world increment. During that time, we will stopAllocating(), which will put things in newlyAllocated. > > > Source/JavaScriptCore/heap/MarkedBlock.h:291 > > + void setNeedsDestruction(); > > I would call this "updateNeedsDestruction" to distinguish from a setter. Done. > > > Source/JavaScriptCore/heap/MarkedBlockInlines.h:49 > > + // - We did ~2^32 collections and rotated the version back to null, so we needed to hard-reset > > + // everything. If the marks had been stale, we would have cleared them. So, we can be sure that > > + // any set mark bit reflects objects marked during last GC, i.e. "live" objects. > > It would help if you mentioned the resetAllocated() function by name here. > (I think that's what you're talking about.) No, that function is for newlyAllocatedVersion. > > Is it really true that wrap-around can wrap-around back to null? I thought > it wrapped around to firstVersion. > > If possible, I'd prefer to think in terms of some elements of state that are > snapshotted at the start of GC, and then consulted later after the current > GC has started to change some things. Is that possible here? That's exactly what this code is trying to do. > This function > almost amounts to the inverse of a snapshot of areMarksStale() at the start > of GC. > > // True if areMarksStale() was true at the start of GC -- which means no > object in this block was marked at the end of the last GC. > inline bool MarkedBlock:areMarksStaleSnapshot() // Or maybe > areMarksStaleAtGCStart() > { > return m_markingVersion == MarkedSpace::nullVersion || > MarkedSpace::nextVersion(m_markingVersion) != markingVersion; > } > > This changes the behavior of nullVersion slightly: now nullVersion says > "don't check m_marks". It's possible to think of nullVersion as having > up-to-date m_marks -- in the sense that they tell the (all zero) truth -- > but I think it's more instructive to call this state not up-to-date -- in > the sense that the GC mark phase never got a shot at setting m_marks, and so > we know they were logically zero regardless of their bit values. Put another > way, I would like to be able to ASSERT(areMarksStale() || m_marks.count()). MarkedBlock::aboutToMarkSlow() needs to know what the mark bits were at the start of the GC. Currently we achieve it like this: 1) If aboutToMarkSlow() sees a null marking version then the marks reflect the truth at start of GC. 2) If aboutToMarkSlow() sees a one-version-old version then the marks reflect the truth at the start of GC. 3) In all other cases, the marks should have been cleared, but hadn't been yet - so ignore their contents. (1) arises when the MarkedBlock is null or if we had a wrap-around. Before the wrap-around bumps the version, it makes sure that the marks are either clear or not depending on what their version had been just before the wrap-around happened. Then it sets the version to null. Then it can wrap the version around. You're proposing a change to marksConveyLivenessDuringMarking(), but you haven't proposed a corresponding change to how aboutToMarkSlow() should use this to figure out if it should transfer the contents of marks into newlyAllocated or not. I don't think that what you're proposing handles this corner case, so I'll keep my version for now. > > > Source/JavaScriptCore/heap/MarkedBlockInlines.h:52 > > + // It would be absurd to use this method when not collecting, since this special "one version > > + // back" state only makes sense when we're in a concurrent collection and have to be > > + // conservative. > > Can we assert we're collecting here? Done. The new way to do this is MarkedSpace::isMarking(), which is more precise for this case. > > > Source/JavaScriptCore/heap/MarkedSpace.cpp:495 > > + if (UNLIKELY(nextVersion(m_allocatedVersion) == initialVersion)) { > > + forEachBlock( > > + [&] (MarkedBlock::Handle* handle) { > > + handle->resetAllocated(); > > + }); > > + } > > This is super subtle. I think you should mention in a comment how this is > needed by marksConveyLivenessDuringMarking. And you should also describe the > problem of marksConveyLivenessDuringMarking eventually becoming > (incorrectly) true due to wrap-around. marksConveyLivenessDuringMarking() does not check the newlyAllocatedVersion. It checks the markingVersion, which is different. I don't think that this is code is needed by marksConveyLivenessDuringMarking(), which doesn't know that newlyAllocated is also versioned. > > Perhaps a clearer thing to do upon wrap-around is to reset all blocks to > nullVersion at the start of a wrap-around GC? But that's what resetAllocated() does already. It sets the newlyAllocatedVersion to nullVersion. It also clears newlyAllocated, but I think that's not necessary.

Filip Pizlo

Comment 32 2016-10-11 15:37:22 PDT

Created attachment 291308 [details] patch for landing

Geoffrey Garen

Comment 33 2016-10-11 16:25:57 PDT

> > > Source/JavaScriptCore/heap/MarkedBlockInlines.h:49 > > > + // - We did ~2^32 collections and rotated the version back to null, so we needed to hard-reset > > > + // everything. If the marks had been stale, we would have cleared them. So, we can be sure that > > > + // any set mark bit reflects objects marked during last GC, i.e. "live" objects. > > > > It would help if you mentioned the resetAllocated() function by name here. > > (I think that's what you're talking about.) > > No, that function is for newlyAllocatedVersion. What function clears marks if stale upon wrap-around? That's the function I would point to here. > You're proposing a change to marksConveyLivenessDuringMarking(), but you > haven't proposed a corresponding change to how aboutToMarkSlow() should use > this to figure out if it should transfer the contents of marks into > newlyAllocated or not. if (handle().allocator()->isAllocated(&handle()) || areMarksStaleSnapshot()) m_marks.clearAll(); else { // Transfer m_marks to m_newlyAllocated ... }

Filip Pizlo

Comment 34 2016-10-11 16:55:05 PDT

Landed in https://trac.webkit.org/changeset/207179

Geoffrey Garen

Comment 35 2016-10-11 17:18:33 PDT

A related observation: If each GC took 1ms, you would need to GC continuously for (2^32 / 1000 / 60 / 60 / 24) = ~50 consecutive days before achieving wrap-around. And of course nobody actually GCs continuously per 1ms, so the real-world worst case is probably more like one GC per 10s (2^32 / 6 / 60 / 24) = ~500,000 days. Can we just RELEASE_ASSERT that doesn't happen? The odds that the user will crash, navigate, quit, or install a software update before wrapping around seem to be effectively 100%.

Filip Pizlo

Comment 36 2016-10-11 19:18:06 PDT

(In reply to comment #35) > A related observation: > > If each GC took 1ms, you would need to GC continuously for (2^32 / 1000 / 60 > / 60 / 24) = ~50 consecutive days before achieving wrap-around. The shortest GC I see in earley is 227us. > > And of course nobody actually GCs continuously per 1ms, so the real-world > worst case is probably more like one GC per 10s (2^32 / 6 / 60 / 24) = > ~500,000 days. Earley does 237 GCs in 2.689 seconds, so 88 GCs/sec. You would need 2^32/88 seconds, so 48806446 seconds, or 564 days. > > Can we just RELEASE_ASSERT that doesn't happen? > > The odds that the user will crash, navigate, quit, or install a software > update before wrapping around seem to be effectively 100%. Not 100%. Someone could use JavaScriptCore to run a two-year compute task on their desktop. I've heard of two-year compute tasks, and I've heard of two-year uptimes. I would agree if handling the wrap-around was hard or untestable, and if I was confident that we never wanted the size of the version to be smaller. The version stuff probably looks more confusing in patch form than it does when you just look at trunk. I plan to add tests by having an option that forces the wrap-around to happen sooner, and run just earley and splay in that mode, for three seconds each (in my experience that catches >95% of GC bugs). Our GC would perform statistically as well if the versions were a byte, so if we ever needed space in Handle or MarkedBlock for any reason then these versions would be prime candidates; this would cause us to definitely need a wrap-around handler. That's why when I reduced the version from 64-bit to 32-bit, I added the wrap-around fallback. Sure "nobody" will need it, but if somebody does need it and we don't have it then I would feel bad.

Filip Pizlo

Comment 37 2016-10-11 19:23:44 PDT

(In reply to comment #33) > > > > Source/JavaScriptCore/heap/MarkedBlockInlines.h:49 > > > > + // - We did ~2^32 collections and rotated the version back to null, so we needed to hard-reset > > > > + // everything. If the marks had been stale, we would have cleared them. So, we can be sure that > > > > + // any set mark bit reflects objects marked during last GC, i.e. "live" objects. > > > > > > It would help if you mentioned the resetAllocated() function by name here. > > > (I think that's what you're talking about.) > > > > No, that function is for newlyAllocatedVersion. > > What function clears marks if stale upon wrap-around? That's the function I > would point to here. Gotcha. You meant resetMarks(). > > > You're proposing a change to marksConveyLivenessDuringMarking(), but you > > haven't proposed a corresponding change to how aboutToMarkSlow() should use > > this to figure out if it should transfer the contents of marks into > > newlyAllocated or not. > > if (handle().allocator()->isAllocated(&handle()) || areMarksStaleSnapshot()) > m_marks.clearAll(); > else { > // Transfer m_marks to m_newlyAllocated > ... > } Your areMarksStaleSnapshot is: m_markingVersion == MarkedSpace::nullVersion || MarkedSpace::nextVersion(m_markingVersion) != markingVersion; So, we would clear the marks on null version. That's not correct for wrap-around, where the marks may have been the correct thing just prior to wrap-around. See resetMarks(): void MarkedBlock::resetMarks() { // We want aboutToMarkSlow() to see what the mark bits were after the last collection. It uses // the version number to distinguish between the marks having already been stale before // beginMarking(), or just stale now that beginMarking() bumped the version. If we have a version // wraparound, then we will call this method before resetting the version to null. When the // version is null, aboutToMarkSlow() will assume that the marks were not stale as of before // beginMarking(). Hence the need to whip the marks into shape. if (areMarksStale()) m_marks.clearAll(); m_markingVersion = MarkedSpace::nullVersion; }

Geoffrey Garen

Comment 38 2016-10-12 11:55:29 PDT

> Not 100%. Someone could use JavaScriptCore to run a two-year compute task > on their desktop. I've heard of two-year compute tasks, and I've heard of > two-year uptime. You tried this with a benchmark bot and it literally melted the hardware. Also, two-year compute tasks probably run on 64bit machines, so we could use 'long' instead of uint32_t in order to accommodate them. > I would agree if handling the wrap-around was hard or untestable, and if I > was confident that we never wanted the size of the version to be smaller. > The version stuff probably looks more confusing in patch form than it does > when you just look at trunk. I plan to add tests by having an option that > forces the wrap-around to happen sooner, and run just earley and splay in > that mode, for three seconds each (in my experience that catches >95% of GC > bugs). Our GC would perform statistically as well if the versions were a > byte, so if we ever needed space in Handle or MarkedBlock for any reason > then these versions would be prime candidates; this would cause us to > definitely need a wrap-around handler. That's why when I reduced the > version from 64-bit to 32-bit, I added the wrap-around fallback. Sure > "nobody" will need it, but if somebody does need it and we don't have it > then I would feel bad. These are some pretty hardcore theoretical arguments to justify real-world complexity that will impact all of our users and developers.

Filip Pizlo

Comment 39 2016-10-12 12:08:00 PDT

(In reply to comment #38) > > Not 100%. Someone could use JavaScriptCore to run a two-year compute task > > on their desktop. I've heard of two-year compute tasks, and I've heard of > > two-year uptime. > > You tried this with a benchmark bot and it literally melted the hardware. It was a laptop, not a desktop. > > Also, two-year compute tasks probably run on 64bit machines, so we could use > 'long' instead of uint32_t in order to accommodate them. You're right, we could make HeapVersion have different sizes depending on hardware. That seems like a different kind of complexity. > > > I would agree if handling the wrap-around was hard or untestable, and if I > > was confident that we never wanted the size of the version to be smaller. > > The version stuff probably looks more confusing in patch form than it does > > when you just look at trunk. I plan to add tests by having an option that > > forces the wrap-around to happen sooner, and run just earley and splay in > > that mode, for three seconds each (in my experience that catches >95% of GC > > bugs). Our GC would perform statistically as well if the versions were a > > byte, so if we ever needed space in Handle or MarkedBlock for any reason > > then these versions would be prime candidates; this would cause us to > > definitely need a wrap-around handler. That's why when I reduced the > > version from 64-bit to 32-bit, I added the wrap-around fallback. Sure > > "nobody" will need it, but if somebody does need it and we don't have it > > then I would feel bad. > > These are some pretty hardcore theoretical arguments to justify real-world > complexity that will impact all of our users and developers. How will this complexity impact all of our users and developers?

Filip Pizlo

Comment 40 2016-10-12 12:40:44 PDT

Just to be clear, removal of the wraparound handler would reduce complexity as follows: - Remove resetMarks(). The body of this method is three lines of code. - Remove resetNewlyAllocated(). The body of this method is two lines of code. - Remove marksConveyLivenessDuringMarking() and have aboutToMarkSlow() just check version directly. The body of this method is three lines of code, including the line that has the non-wraparound version check (MarkedSpace::nextVersion(m_markingVersion) == markingVersion). - Remove the loops that call resetMarks() and resetNewlyAllocated(). Those are six lines of code each using my style of lambda indentation and the surrounding if {}. None of the code that handles the wraparound handler needs to be correct for when wraparound does not happen: - resetMarks() won't be called. - resetNewlyAllocated() won't be called. - marksConveyLivenessDuringMarking() could return either true or false for nullVersion and everything would be fine. I'm not counting comments. If we had no handling of wraparound, I would instead have a comment explaining why we can ignore wraparound. We'd still have most of the huge comment in marksConveyLivenessDuringMarking(); we'd probably put it in aboutToMarkSlow(). So, the complexity being removed is 3 + 2 + 3 + 6 + 6 = 20 lines of code, all of which is live only when the bad thing happens. That's quite small. C programs can run for more than two years. I want to be able to say that JS programs can also run for two years. 20 lines of code is a small price to pay for continuing our philosophy that JS programs shouldn't have to suffer unnecessary limitations.

Filip Pizlo

Comment 41 2016-10-12 19:11:41 PDT

As far as I can see, this landed without causing any regressions. Hooray!

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware All

OS All

Product WebKit

Component JavaScriptCore

Assignee

Filip Pizlo

Reported

2016-09-20 11:26 PDT

Modified

2016-10-12 19:11 PDT History

CC List

15 users Show

URL

Keywords

Depends on

162310 162842

Blocks

149432

Dependencies

tree graph