Bug 70995

Summary: The GC should be parallel
Product: WebKit Reporter: Filip Pizlo <fpizlo>
Component: JavaScriptCoreAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: ggaren, gustavo, levin+threading, sam, webkit.review.bot, xan.lopez
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Attachments:
Description Flags
work in progress
none
the patch
none
the patch
none
the patch
none
the patch
gustavo: commit-queue-
the patch
ggaren: review+
the patch - checking EWS with changes to address review
ggaren: review+
the patch
none
the patch none

Filip Pizlo
Reported 2011-10-27 00:35:43 PDT
Garbage collection is embarrassingly parallel. There is nothing in our marking code that isn't thread-safe other than the flipping of mark bits, which can be done with a single CAS instruction.
Attachments
work in progress (36.71 KB, patch)
2011-10-27 00:41 PDT, Filip Pizlo
no flags
the patch (50.99 KB, patch)
2011-10-29 22:27 PDT, Filip Pizlo
no flags
the patch (50.97 KB, patch)
2011-10-29 22:33 PDT, Filip Pizlo
no flags
the patch (57.25 KB, patch)
2011-10-30 00:19 PDT, Filip Pizlo
no flags
the patch (57.58 KB, patch)
2011-10-30 00:27 PDT, Filip Pizlo
gustavo: commit-queue-
the patch (57.85 KB, patch)
2011-10-30 01:11 PDT, Filip Pizlo
ggaren: review+
the patch - checking EWS with changes to address review (59.49 KB, patch)
2011-10-31 19:04 PDT, Filip Pizlo
ggaren: review+
the patch (60.12 KB, patch)
2011-10-31 22:31 PDT, Filip Pizlo
no flags
the patch (60.16 KB, patch)
2011-10-31 23:16 PDT, Filip Pizlo
no flags
Filip Pizlo
Comment 1 2011-10-27 00:41:51 PDT
Created attachment 112647 [details] work in progress This works, and produces speed-ups, but still needs clean-ups. Here's the performance using my harness tuned to measure GC behavior. Benchmark report for V8. VMs tested: "TipOfTree" at /Volumes/Data/pizlo/quinary/OpenSource/WebKitBuild/Release/jsc "ParallelGC" at /Volumes/Data/pizlo/senary/OpenSource/WebKitBuild/Release/jsc Collected 8 samples per benchmark/VM, with 4 VM invocations per benchmark. Ran 5 benchmark iterations, and measured the total time of those iterations, for each sample. Emitted a call to gc() between sample measurements. Used 5 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. TipOfTree ParallelGC crypto 369.6983+-2.7221 ? 370.0403+-1.5868 ? deltablue 903.7007+-8.3329 ? 911.5094+-5.1716 ? earley-boyer 462.6758+-1.3969 461.6687+-1.3534 raytrace 315.5462+-1.9768 ? 318.6208+-1.9752 ? regexp 549.5400+-4.7576 ^ 541.9117+-1.7295 ^ definitely 1.0141x faster richards 632.6940+-2.0461 ? 634.8623+-2.2790 ? splay 414.3850+-2.1458 ^ 325.4297+-17.9843 ^ definitely 1.2733x faster <arithmetic> 521.1771+-1.3120 ^ 509.1490+-2.7390 ^ definitely 1.0236x faster <geometric> * 492.4826+-1.0591 ^ 476.1071+-3.8258 ^ definitely 1.0344x faster <harmonic> 468.0389+-1.1209 ^ 448.5087+-4.9235 ^ definitely 1.0435x faster
Filip Pizlo
Comment 2 2011-10-27 00:43:20 PDT
5% speed-up in the V8 harness. [pizlo@minime benchmarks] /Volumes/Data/pizlo/quinary/OpenSource/WebKitBuild/Release/jsc run.js Richards: 10005 DeltaBlue: 6367 Crypto: 15714 RayTrace: 8493 EarleyBoyer: 10495 RegExp: 2126 Splay: 7236 ---- Score (version 6): 7530 [pizlo@minime benchmarks] /Volumes/Data/pizlo/senary/OpenSource/WebKitBuild/Release/jsc run.js Richards: 10149 DeltaBlue: 6499 Crypto: 15935 RayTrace: 8844 EarleyBoyer: 10821 RegExp: 2180 Splay: 9013 ---- Score (version 6): 7932 [pizlo@minime benchmarks] irb >> 7932/7530.0 => 1.05338645418327
WebKit Review Bot
Comment 3 2011-10-27 00:44:37 PDT
Attachment 112647 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1 Source/JavaScriptCore/heap/SlotVisitor.h:38: The parameter name "heap" adds no information, so it should be removed. [readability/parameter_name] [5] Source/JavaScriptCore/heap/MarkStack.cpp:165: Should have a space between // and comment [whitespace/comments] [4] Source/JavaScriptCore/heap/MarkStack.cpp:183: Should have a space between // and comment [whitespace/comments] [4] Source/JavaScriptCore/heap/MarkStack.cpp:189: This { should be at the end of the previous line [whitespace/braces] [4] Source/JavaScriptCore/heap/MarkStack.cpp:212: Should have a space between // and comment [whitespace/comments] [4] Source/JavaScriptCore/heap/MarkStack.cpp:220: Should have a space between // and comment [whitespace/comments] [4] Source/JavaScriptCore/heap/MarkStack.cpp:229: Should have a space between // and comment [whitespace/comments] [4] Source/JavaScriptCore/heap/MarkStack.cpp:234: Should have a space between // and comment [whitespace/comments] [4] Source/JavaScriptCore/heap/MarkStack.h:125: Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons. [readability/comparison_to_zero] [5] Source/JavaScriptCore/heap/Heap.cpp:695: Should have a space between // and comment [whitespace/comments] [4] Source/JavaScriptCore/heap/Heap.cpp:697: Should have a space between // and comment [whitespace/comments] [4] Source/JavaScriptCore/heap/Heap.cpp:801: Should have a space between // and comment [whitespace/comments] [4] Source/JavaScriptCore/heap/Heap.cpp:803: Should have a space between // and comment [whitespace/comments] [4] Total errors found: 13 in 11 files If any of these errors are false positives, please file a bug against check-webkit-style.
Sam Weinig
Comment 4 2011-10-27 11:06:07 PDT
I am not sure how relevant it is, but you may want to look at wtf/ParallelJobs.h. On the Mac it will use libdispatch, which can be fun.
Filip Pizlo
Comment 5 2011-10-27 13:50:11 PDT
Here's the overall performance impact using my harness: Benchmark report for SunSpider, V8, and Kraken. VMs tested: "TipOfTree" at /Volumes/Data/pizlo/quinary/OpenSource/WebKitBuild/Release/jsc "ParallelGC" at /Volumes/Data/pizlo/senary/OpenSource/WebKitBuild/Release/jsc Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. TipOfTree ParallelGC SunSpider: 3d-cube 7.2180+-0.1808 ? 7.4524+-0.1367 ? might be 1.0325x slower 3d-morph 7.6799+-0.1029 7.5913+-0.1302 might be 1.0117x faster 3d-raytrace 7.5534+-0.2183 7.4831+-0.1791 access-binary-trees 1.6424+-0.0414 1.6249+-0.0357 might be 1.0107x faster access-fannkuch 6.3541+-0.1033 ? 6.4410+-0.1137 ? might be 1.0137x slower access-nbody 3.8256+-0.1068 3.7492+-0.0887 might be 1.0204x faster access-nsieve 2.6080+-0.0739 2.5943+-0.0592 bitops-3bit-bits-in-byte 1.3032+-0.0421 ? 1.3417+-0.0224 ? might be 1.0295x slower bitops-bits-in-byte 2.3928+-0.0527 2.3785+-0.0843 bitops-bitwise-and 3.3468+-0.1038 3.3233+-0.0834 bitops-nsieve-bits 5.2992+-0.1325 ? 5.3240+-0.0675 ? controlflow-recursive 2.1437+-0.0404 2.1308+-0.0296 crypto-aes 7.3174+-0.2212 ? 7.3442+-0.1498 ? crypto-md5 2.7344+-0.0748 2.6672+-0.0543 might be 1.0252x faster crypto-sha1 2.5011+-0.0744 2.4450+-0.0696 might be 1.0229x faster date-format-tofte 9.8872+-0.2085 ? 10.0584+-0.3533 ? might be 1.0173x slower date-format-xparb 9.2605+-0.2615 9.1278+-0.1729 might be 1.0145x faster math-cordic 6.5256+-0.0875 6.4593+-0.1045 might be 1.0103x faster math-partial-sums 7.5841+-0.1001 7.5316+-0.0998 math-spectral-norm 2.5835+-0.0451 ? 2.6250+-0.0915 ? might be 1.0160x slower regexp-dna 11.4311+-0.1952 ? 11.6339+-0.2408 ? might be 1.0177x slower string-base64 4.3517+-0.1414 ? 4.5075+-0.1156 ? might be 1.0358x slower string-fasta 6.3099+-0.0896 ? 6.3234+-0.1080 ? string-tagcloud 11.5226+-0.2577 ? 11.5569+-0.2916 ? string-unpack-code 20.3920+-0.3430 ? 20.5870+-0.6153 ? string-validate-input 5.1896+-0.1096 ? 5.3466+-0.1446 ? might be 1.0303x slower <arithmetic> * 6.1138+-0.0290 ? 6.1403+-0.0386 ? <geometric> 4.9488+-0.0213 ? 4.9601+-0.0298 ? <harmonic> 3.9705+-0.0217 ? 3.9761+-0.0319 ? TipOfTree ParallelGC V8: crypto 73.8359+-0.2957 ? 74.4488+-0.7109 ? deltablue 175.5006+-1.3144 ? 176.8104+-1.2502 ? earley-boyer 91.1516+-0.8585 90.1037+-0.2475 might be 1.0116x faster raytrace 62.9210+-0.4437 ? 63.1005+-0.7763 ? regexp 105.2908+-1.3755 104.8327+-0.9881 richards 126.8941+-0.5265 ? 127.2614+-1.0312 ? splay 91.3876+-0.6759 ^ 72.8390+-0.9795 ^ definitely 1.2547x faster <arithmetic> 103.8545+-0.1819 ^ 101.3424+-0.2595 ^ definitely 1.0248x faster <geometric> * 98.6156+-0.2071 ^ 95.5441+-0.3157 ^ definitely 1.0321x faster <harmonic> 94.0416+-0.2326 ^ 90.7043+-0.3916 ^ definitely 1.0368x faster TipOfTree ParallelGC Kraken: ai-astar 499.2170+-3.3908 ? 506.4470+-10.1727 ? might be 1.0145x slower audio-beat-detection 191.3056+-1.0695 ? 191.4846+-0.6155 ? audio-dft 271.6365+-6.4902 268.4316+-1.7440 might be 1.0119x faster audio-fft 125.1429+-0.8487 ? 125.8270+-1.1635 ? audio-oscillator 252.0302+-1.9938 ? 252.9450+-2.0986 ? imaging-darkroom 400.8323+-1.7930 ? 401.2702+-2.5584 ? imaging-desaturate 226.0704+-1.1277 225.7703+-0.7585 imaging-gaussian-blur 556.3116+-1.9274 ? 560.3647+-2.2416 ? json-parse-financial 57.9881+-1.3226 ! 59.8796+-0.2561 ! definitely 1.0326x slower json-stringify-tinderbox 68.2823+-0.7350 67.9300+-0.5355 stanford-crypto-aes 134.2209+-1.9479 133.6694+-1.6344 stanford-crypto-ccm 100.1903+-1.0490 99.9488+-0.4803 stanford-crypto-pbkdf2 196.7415+-3.0909 193.2500+-0.9395 might be 1.0181x faster stanford-crypto-sha256-iterative 71.2779+-1.0248 70.7566+-0.4374 <arithmetic> * 225.0891+-0.6866 ? 225.5696+-0.6807 ? <geometric> 177.4647+-0.6886 ? 177.6658+-0.3348 ? <harmonic> 139.9202+-0.8330 ? 140.3430+-0.2271 ? TipOfTree ParallelGC All benchmarks: <arithmetic> 85.8976+-0.2217 85.6813+-0.2179 <geometric> 22.4445+-0.0697 22.3746+-0.0873 <harmonic> 6.9912+-0.0374 ? 6.9982+-0.0548 ? TipOfTree ParallelGC Geomean of preferred means: <scaled-result> 51.3886+-0.1261 ^ 50.9589+-0.1603 ^ definitely 1.0084x faster
Filip Pizlo
Comment 6 2011-10-27 13:54:55 PDT
Even when running with parallel GC enabled but with only 1 worker - so the load balancer is doing completely pointless work - this patch is basically performance-neutral. It's a slow-down on v8-splay (4%, ish) but this does not show up in the grand averages. Benchmark report for SunSpider, V8, and Kraken. VMs tested: "TipOfTree" at /Volumes/Data/pizlo/quinary/OpenSource/WebKitBuild/Release/jsc "ParGC_1worker" at /Volumes/Data/pizlo/senary/OpenSource/WebKitBuild/Release/jsc Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. TipOfTree ParGC_1worker SunSpider: 3d-cube 7.2949+-0.1575 ? 7.3727+-0.1966 ? might be 1.0107x slower 3d-morph 7.6382+-0.1124 7.5984+-0.1009 3d-raytrace 7.4934+-0.1674 ? 7.5545+-0.2226 ? access-binary-trees 1.5628+-0.0328 ! 1.6664+-0.0407 ! definitely 1.0662x slower access-fannkuch 6.3769+-0.0957 6.3468+-0.1113 access-nbody 3.8433+-0.0674 3.8371+-0.0816 access-nsieve 2.6024+-0.0664 ? 2.6081+-0.0886 ? bitops-3bit-bits-in-byte 1.3050+-0.0218 1.3031+-0.0404 bitops-bits-in-byte 2.3791+-0.0536 ? 2.4168+-0.0887 ? might be 1.0159x slower bitops-bitwise-and 3.3935+-0.1110 3.3504+-0.0702 might be 1.0129x faster bitops-nsieve-bits 5.2880+-0.0830 ? 5.3889+-0.0710 ? might be 1.0191x slower controlflow-recursive 2.0956+-0.0343 ? 2.1096+-0.0380 ? crypto-aes 7.4149+-0.1537 ? 7.5127+-0.1670 ? might be 1.0132x slower crypto-md5 2.6900+-0.0664 ? 2.7824+-0.0428 ? might be 1.0343x slower crypto-sha1 2.4659+-0.0728 2.4191+-0.0656 might be 1.0194x faster date-format-tofte 9.9838+-0.3736 ? 10.0864+-0.1818 ? might be 1.0103x slower date-format-xparb 9.2507+-0.2275 9.2164+-0.2176 math-cordic 6.4024+-0.1527 ? 6.5353+-0.1568 ? might be 1.0208x slower math-partial-sums 7.5328+-0.1274 7.4789+-0.1229 math-spectral-norm 2.6021+-0.0688 2.5571+-0.0609 might be 1.0176x faster regexp-dna 11.5334+-0.2701 11.4021+-0.1800 might be 1.0115x faster string-base64 4.4235+-0.1934 4.4043+-0.1079 string-fasta 6.3310+-0.1557 ? 6.3362+-0.1281 ? string-tagcloud 11.8947+-0.3594 11.6337+-0.2537 might be 1.0224x faster string-unpack-code 20.4274+-0.3439 20.1890+-0.3354 might be 1.0118x faster string-validate-input 5.3787+-0.1406 5.3504+-0.1414 <arithmetic> * 6.1386+-0.0371 6.1330+-0.0328 <geometric> 4.9505+-0.0321 ? 4.9654+-0.0321 ? <harmonic> 3.9508+-0.0324 ? 3.9787+-0.0356 ? TipOfTree ParGC_1worker V8: crypto 74.6573+-0.9355 ? 74.6671+-1.2695 ? deltablue 176.7681+-1.9078 ? 178.8428+-1.3773 ? might be 1.0117x slower earley-boyer 90.6198+-0.4111 ? 91.2420+-0.9776 ? raytrace 63.4282+-0.7518 62.6713+-0.4769 might be 1.0121x faster regexp 105.9695+-1.1261 ^ 104.3170+-0.4879 ^ definitely 1.0158x faster richards 126.5331+-0.4297 ? 126.9162+-0.6510 ? splay 91.4826+-0.6302 ! 95.4367+-0.4899 ! definitely 1.0432x slower <arithmetic> 104.2084+-0.3281 ? 104.8705+-0.3554 ? <geometric> * 98.9672+-0.2982 ? 99.4817+-0.3438 ? <harmonic> 94.4207+-0.3735 ? 94.7775+-0.3760 ? TipOfTree ParGC_1worker Kraken: ai-astar 497.9819+-1.5014 ? 501.5006+-4.3832 ? audio-beat-detection 193.8519+-2.2030 191.8244+-1.9925 might be 1.0106x faster audio-dft 267.2113+-3.2929 ? 271.1404+-4.5041 ? might be 1.0147x slower audio-fft 125.1287+-1.0431 ? 125.7332+-0.9478 ? audio-oscillator 250.9315+-1.1048 ? 253.3135+-1.5058 ? imaging-darkroom 397.6901+-1.3490 ! 402.9572+-1.8083 ! definitely 1.0132x slower imaging-desaturate 228.5683+-2.8265 226.6920+-1.4972 imaging-gaussian-blur 560.2089+-2.9211 556.4905+-1.8611 json-parse-financial 57.2933+-0.3284 ! 58.8656+-0.6922 ! definitely 1.0274x slower json-stringify-tinderbox 70.1123+-1.4993 ^ 67.9328+-0.3743 ^ definitely 1.0321x faster stanford-crypto-aes 132.2364+-1.5831 ? 133.8309+-1.5803 ? might be 1.0121x slower stanford-crypto-ccm 100.2620+-0.6544 ? 100.6823+-1.2382 ? stanford-crypto-pbkdf2 198.5007+-3.3325 195.2135+-2.0347 might be 1.0168x faster stanford-crypto-sha256-iterative 70.9986+-0.3678 70.7538+-0.2947 <arithmetic> * 225.0697+-0.3483 ? 225.4951+-0.7427 ? <geometric> 177.5370+-0.4508 ? 177.7344+-0.4956 ? <harmonic> 140.0540+-0.5592 ? 140.1788+-0.4364 ? TipOfTree ParGC_1worker All benchmarks: <arithmetic> 85.9583+-0.1007 ? 86.1804+-0.2358 ? <geometric> 22.4631+-0.0890 ? 22.5254+-0.0871 ? <harmonic> 6.9577+-0.0557 ? 7.0059+-0.0611 ? TipOfTree ParGC_1worker Geomean of preferred means: <scaled-result> 51.5176+-0.1332 ? 51.6232+-0.1352 ?
Filip Pizlo
Comment 7 2011-10-27 15:03:41 PDT
And here's the performance with ENABLE_PARALLEL_GC set to 0. Benchmark report for SunSpider, V8, and Kraken. VMs tested: "TipOfTree" at /Volumes/Data/pizlo/quinary/OpenSource/WebKitBuild/Release/jsc "ParGC_off" at /Volumes/Data/pizlo/senary/OpenSource/WebKitBuild/Release/jsc Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. TipOfTree ParGC_off SunSpider: 3d-cube 7.3494+-0.1367 7.3081+-0.1812 3d-morph 7.7531+-0.1294 7.6432+-0.0939 might be 1.0144x faster 3d-raytrace 7.6150+-0.2149 ? 7.7728+-0.2126 ? might be 1.0207x slower access-binary-trees 1.6171+-0.0601 ? 1.6446+-0.0589 ? might be 1.0170x slower access-fannkuch 6.4249+-0.1303 ? 6.4619+-0.0996 ? access-nbody 3.7535+-0.0800 3.7456+-0.0676 access-nsieve 2.6276+-0.0681 2.6007+-0.0638 might be 1.0103x faster bitops-3bit-bits-in-byte 1.3021+-0.0327 ? 1.3056+-0.0557 ? bitops-bits-in-byte 2.3702+-0.0626 ? 2.3866+-0.0608 ? bitops-bitwise-and 3.3120+-0.0931 ? 3.3925+-0.0882 ? might be 1.0243x slower bitops-nsieve-bits 5.4025+-0.1042 5.3258+-0.1046 might be 1.0144x faster controlflow-recursive 2.1068+-0.0363 ? 2.1186+-0.0378 ? crypto-aes 7.3538+-0.2270 ? 7.4445+-0.1805 ? might be 1.0123x slower crypto-md5 2.6951+-0.0557 ? 2.7680+-0.0968 ? might be 1.0270x slower crypto-sha1 2.4814+-0.0528 ? 2.5605+-0.0755 ? might be 1.0319x slower date-format-tofte 9.8457+-0.2383 ? 9.9146+-0.2241 ? date-format-xparb 9.3008+-0.3754 9.0405+-0.1338 might be 1.0288x faster math-cordic 6.3706+-0.0808 ! 6.7802+-0.0821 ! definitely 1.0643x slower math-partial-sums 7.6198+-0.1122 7.4890+-0.1256 might be 1.0175x faster math-spectral-norm 2.6323+-0.0791 ? 2.6375+-0.0568 ? regexp-dna 11.5339+-0.1717 11.5240+-0.1524 string-base64 4.3541+-0.0758 4.3309+-0.0496 string-fasta 6.3641+-0.1345 ? 6.4228+-0.1290 ? string-tagcloud 11.4752+-0.3121 ? 11.6327+-0.1632 ? might be 1.0137x slower string-unpack-code 20.3497+-0.1953 ? 20.5273+-0.2408 ? string-validate-input 5.3539+-0.1222 5.3345+-0.1329 <arithmetic> * 6.1294+-0.0245 ? 6.1582+-0.0245 ? <geometric> 4.9538+-0.0255 ? 4.9834+-0.0308 ? <harmonic> 3.9629+-0.0352 ? 3.9923+-0.0422 ? TipOfTree ParGC_off V8: crypto 74.8595+-1.4741 74.5039+-0.3569 deltablue 180.4290+-1.1220 ^ 176.9639+-1.4984 ^ definitely 1.0196x faster earley-boyer 90.8161+-0.5091 ? 90.9153+-0.4368 ? raytrace 64.3589+-0.9302 ^ 62.9611+-0.4291 ^ definitely 1.0222x faster regexp 104.8355+-0.5815 ? 105.7245+-0.6124 ? richards 127.7546+-0.7860 ? 128.0517+-0.7380 ? splay 91.3341+-0.7922 ? 92.4157+-0.5433 ? might be 1.0118x slower <arithmetic> 104.9125+-0.2718 104.5052+-0.3167 <geometric> * 99.4909+-0.3100 99.1800+-0.2328 <harmonic> 94.8698+-0.3791 94.5322+-0.2052 TipOfTree ParGC_off Kraken: ai-astar 507.1634+-4.5211 501.2228+-3.2661 might be 1.0119x faster audio-beat-detection 190.7445+-0.8232 ! 194.8461+-2.6200 ! definitely 1.0215x slower audio-dft 269.4079+-2.6398 ? 271.1714+-2.7456 ? audio-fft 126.3637+-0.9614 126.2289+-1.0185 audio-oscillator 251.2230+-1.2274 ? 252.4837+-1.5019 ? imaging-darkroom 400.4058+-1.5128 ? 402.1017+-2.4870 ? imaging-desaturate 226.6919+-0.7861 ? 227.7904+-2.0344 ? imaging-gaussian-blur 561.7409+-3.6798 ? 565.0966+-4.2142 ? json-parse-financial 58.2345+-0.4456 57.9125+-0.3687 json-stringify-tinderbox 68.2407+-0.2725 ! 70.1338+-1.4605 ! definitely 1.0277x slower stanford-crypto-aes 133.7770+-1.6480 133.2023+-1.4945 stanford-crypto-ccm 102.7287+-2.1131 100.9507+-0.6909 might be 1.0176x faster stanford-crypto-pbkdf2 191.7112+-0.8882 ? 194.8814+-3.1410 ? might be 1.0165x slower stanford-crypto-sha256-iterative 71.2590+-0.5433 70.4723+-0.2436 might be 1.0112x faster <arithmetic> * 225.6923+-0.6509 ? 226.3210+-0.7703 ? <geometric> 177.7540+-0.4853 ? 178.2620+-0.6906 ? <harmonic> 140.2630+-0.3780 ? 140.5231+-0.6601 ? TipOfTree ParGC_off All benchmarks: <arithmetic> 86.2435+-0.2081 ? 86.3860+-0.2201 ? <geometric> 22.4974+-0.0707 ? 22.5802+-0.0928 ? <harmonic> 6.9790+-0.0604 ? 7.0293+-0.0725 ? TipOfTree ParGC_off Geomean of preferred means: <scaled-result> 51.6299+-0.0804 ? 51.7048+-0.1116 ?
Filip Pizlo
Comment 8 2011-10-29 22:27:05 PDT
Created attachment 112985 [details] the patch Implemented a bunch of stuff: - Better memory management for mark stack segments. - Better load balancing. It can now scale to 8 processors. - Autodetection of number of processors. It still caps itself at 4 threads, though, since that seems to be where we get peak efficiency. Benchmark report for SunSpider, V8, and Kraken. VMs tested: "TipOfTree" at /Volumes/Data/pizlo/quinary/OpenSource/WebKitBuild/Release/jsc "ParallelGC" at /Volumes/Data/pizlo/senary/OpenSource/WebKitBuild/Release/jsc Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. TipOfTree ParallelGC SunSpider: 3d-cube 7.2847+-0.1599 7.2000+-0.1390 might be 1.0118x faster 3d-morph 7.6836+-0.1079 7.6069+-0.1119 might be 1.0101x faster 3d-raytrace 7.5123+-0.1952 7.4649+-0.1720 access-binary-trees 1.6509+-0.0469 1.6238+-0.0729 might be 1.0167x faster access-fannkuch 6.5595+-0.0676 6.4896+-0.1053 might be 1.0108x faster access-nbody 3.7379+-0.0998 3.7356+-0.0627 access-nsieve 2.5926+-0.0849 ? 2.6093+-0.0612 ? bitops-3bit-bits-in-byte 1.2904+-0.0295 ? 1.3312+-0.0385 ? might be 1.0316x slower bitops-bits-in-byte 2.4042+-0.0579 2.3689+-0.0429 might be 1.0149x faster bitops-bitwise-and 3.3795+-0.1010 ? 3.3876+-0.1226 ? bitops-nsieve-bits 5.4380+-0.0857 ? 5.5171+-0.1328 ? might be 1.0145x slower controlflow-recursive 2.1209+-0.0514 ? 2.1224+-0.0579 ? crypto-aes 7.2702+-0.1745 ? 7.3951+-0.1885 ? might be 1.0172x slower crypto-md5 2.7203+-0.0754 ? 2.7662+-0.0691 ? might be 1.0169x slower crypto-sha1 2.4430+-0.0617 2.4351+-0.0454 date-format-tofte 10.0079+-0.1732 9.9916+-0.2204 date-format-xparb 8.8878+-0.1903 ! 9.3501+-0.2024 ! definitely 1.0520x slower math-cordic 6.4491+-0.1291 6.2971+-0.0644 might be 1.0241x faster math-partial-sums 7.5129+-0.1028 ? 7.5139+-0.1127 ? math-spectral-norm 2.5601+-0.0628 ? 2.6399+-0.0440 ? might be 1.0312x slower regexp-dna 11.5483+-0.1454 11.4922+-0.1442 string-base64 4.2972+-0.0686 ? 4.4867+-0.1473 ? might be 1.0441x slower string-fasta 6.2632+-0.1289 ? 6.3459+-0.1575 ? might be 1.0132x slower string-tagcloud 11.4555+-0.3263 ? 11.4960+-0.2517 ? string-unpack-code 20.6781+-0.2345 ? 20.7067+-0.3981 ? string-validate-input 5.2387+-0.0995 5.1660+-0.0626 might be 1.0141x faster <arithmetic> * 6.1149+-0.0322 ? 6.1361+-0.0250 ? <geometric> 4.9388+-0.0250 ? 4.9601+-0.0212 ? <harmonic> 3.9553+-0.0259 ? 3.9773+-0.0249 ? TipOfTree ParallelGC V8: crypto 74.1679+-0.3164 74.0380+-0.3816 deltablue 179.7767+-1.9744 179.3566+-1.2107 earley-boyer 91.6448+-0.5219 90.9476+-0.4252 raytrace 62.5295+-0.4596 ! 64.3819+-0.9545 ! definitely 1.0296x slower regexp 105.5681+-0.4315 105.4897+-0.4094 richards 126.3819+-0.6091 ? 127.1708+-0.3652 ? splay 93.6141+-0.7341 ^ 71.9968+-0.5395 ^ definitely 1.3003x faster <arithmetic> 104.8119+-0.2897 ^ 101.9116+-0.3152 ^ definitely 1.0285x faster <geometric> * 99.3299+-0.2377 ^ 95.9869+-0.3252 ^ definitely 1.0348x faster <harmonic> 94.5656+-0.2492 ^ 91.0851+-0.3703 ^ definitely 1.0382x faster TipOfTree ParallelGC Kraken: ai-astar 496.4468+-1.2667 ? 499.0402+-3.6723 ? audio-beat-detection 192.5907+-1.4966 191.6971+-1.2168 audio-dft 266.8459+-3.4843 ? 267.9069+-3.7898 ? audio-fft 125.3709+-0.6697 ? 125.7723+-1.2540 ? audio-oscillator 252.4701+-1.2692 251.8329+-1.2081 imaging-darkroom 404.3208+-1.1456 ? 405.2122+-1.1299 ? imaging-desaturate 226.4488+-0.4880 225.4735+-1.0376 imaging-gaussian-blur 558.3593+-1.6383 558.0886+-2.0891 json-parse-financial 57.1418+-0.4478 ? 58.0770+-0.6443 ? might be 1.0164x slower json-stringify-tinderbox 68.5476+-0.3060 68.2372+-0.3896 stanford-crypto-aes 134.3287+-1.8809 131.6633+-1.4008 might be 1.0202x faster stanford-crypto-ccm 100.6053+-0.8729 100.2639+-0.5752 stanford-crypto-pbkdf2 194.3023+-1.7189 192.6526+-1.6629 stanford-crypto-sha256-iterative 71.1339+-0.3512 ? 71.1519+-0.6564 ? <arithmetic> * 224.9224+-0.3267 224.7907+-0.6364 <geometric> 177.2358+-0.3153 177.0139+-0.4504 <harmonic> 139.6362+-0.3532 139.6204+-0.3786 TipOfTree ParallelGC All benchmarks: <arithmetic> 85.9911+-0.1177 ^ 85.5317+-0.2086 ^ definitely 1.0054x faster <geometric> 22.4348+-0.0689 22.3655+-0.0666 <harmonic> 6.9654+-0.0445 ? 7.0000+-0.0428 ? TipOfTree ParallelGC Geomean of preferred means: <scaled-result> 51.5027+-0.1124 ^ 50.9673+-0.1145 ^ definitely 1.0105x faster
WebKit Review Bot
Comment 9 2011-10-29 22:29:40 PDT
Attachment 112985 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1 Source/JavaScriptCore/heap/SlotVisitor.h:38: The parameter name "shared" adds no information, so it should be removed. [readability/parameter_name] [5] Source/JavaScriptCore/heap/MarkStack.cpp:367: This { should be at the end of the previous line [whitespace/braces] [4] Source/JavaScriptCore/heap/MarkStack.cpp:372: This { should be at the end of the previous line [whitespace/braces] [4] Source/JavaScriptCore/heap/MarkStack.h:81: The parameter name "segment" adds no information, so it should be removed. [readability/parameter_name] [5] Source/JavaScriptCore/heap/MarkStack.h:143: Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons. [readability/comparison_to_zero] [5] Total errors found: 5 in 12 files If any of these errors are false positives, please file a bug against check-webkit-style.
Filip Pizlo
Comment 10 2011-10-29 22:33:41 PDT
Created attachment 112986 [details] the patch Fixed style. Should also note that this patch, and the previous one, also fixes the weak reference harvester and opaque roots situation. Opaque roots are kept locally in each thread but merged whenever it's profitable to do so, as well as at the end of any quantum of tracing. Weak reference harvesters are merged eagerly, since they are currently unused, and are only intended for CodeBlocks. There shouldn't be enough CodeBlocks for it to matter if we sometimes have to grab a global lock to insert a weak reference harvester to the global marking data.
Filip Pizlo
Comment 11 2011-10-29 22:35:36 PDT
(In reply to comment #4) > I am not sure how relevant it is, but you may want to look at wtf/ParallelJobs.h. On the Mac it will use libdispatch, which can be fun. One big reason why not to use libdispatch or other such mechanisms: if anyone else uses them and we stop a WebKit thread due to GC whilst it is sitting inside of some libdispatch lock (or a lock of some dependency of libdispatch, like, say, malloc), then we deadlock the whole process. I think that even starting pthreads during a GC could be a dangerous game to play. Hence this code makes sure that it has a pool of threads started before any GC can happen.
WebKit Review Bot
Comment 12 2011-10-29 22:38:16 PDT
Attachment 112986 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1 Source/JavaScriptCore/heap/MarkStack.cpp:367: This { should be at the end of the previous line [whitespace/braces] [4] Source/JavaScriptCore/heap/MarkStack.cpp:372: This { should be at the end of the previous line [whitespace/braces] [4] Total errors found: 2 in 12 files If any of these errors are false positives, please file a bug against check-webkit-style.
Filip Pizlo
Comment 13 2011-10-30 00:19:12 PDT
Created attachment 112987 [details] the patch Attempting to fix weird interactions between parallel GC and webcore.
WebKit Review Bot
Comment 14 2011-10-30 00:20:43 PDT
Attachment 112987 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1 Source/JavaScriptCore/heap/MarkStack.cpp:369: This { should be at the end of the previous line [whitespace/braces] [4] Source/JavaScriptCore/heap/MarkStack.cpp:374: This { should be at the end of the previous line [whitespace/braces] [4] Total errors found: 2 in 18 files If any of these errors are false positives, please file a bug against check-webkit-style.
Filip Pizlo
Comment 15 2011-10-30 00:27:02 PDT
Created attachment 112988 [details] the patch The fixing continues...
WebKit Review Bot
Comment 16 2011-10-30 00:29:50 PDT
Attachment 112988 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1 Source/JavaScriptCore/heap/MarkStack.cpp:369: This { should be at the end of the previous line [whitespace/braces] [4] Source/JavaScriptCore/heap/MarkStack.cpp:374: This { should be at the end of the previous line [whitespace/braces] [4] Total errors found: 2 in 18 files If any of these errors are false positives, please file a bug against check-webkit-style.
Gustavo Noronha (kov)
Comment 17 2011-10-30 00:38:08 PDT
Filip Pizlo
Comment 18 2011-10-30 00:41:05 PDT
Looks like this passes tests and alows me to browse the web in debug mode.
Early Warning System Bot
Comment 19 2011-10-30 00:43:29 PDT
Gyuyoung Kim
Comment 20 2011-10-30 00:47:52 PDT
Filip Pizlo
Comment 21 2011-10-30 01:11:10 PDT
Created attachment 112990 [details] the patch Attempting to fix build failures on non-Mac platforms.
WebKit Review Bot
Comment 22 2011-10-30 01:13:02 PDT
Attachment 112990 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1 Source/JavaScriptCore/heap/MarkStack.cpp:371: This { should be at the end of the previous line [whitespace/braces] [4] Source/JavaScriptCore/heap/MarkStack.cpp:376: This { should be at the end of the previous line [whitespace/braces] [4] Total errors found: 2 in 18 files If any of these errors are false positives, please file a bug against check-webkit-style.
Filip Pizlo
Comment 23 2011-10-30 01:18:11 PDT
Performance retested after merging/fixing/etc. Benchmark report for SunSpider, V8, and Kraken. VMs tested: "TipOfTree" at /Volumes/Data/pizlo/quinary/OpenSource/WebKitBuild/Release/jsc "ParallelGC" at /Volumes/Data/pizlo/senary/OpenSource/WebKitBuild/Release/jsc Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. TipOfTree ParallelGC SunSpider: 3d-cube 7.6899+-0.2978 7.5047+-0.1976 might be 1.0247x faster 3d-morph 7.9344+-0.2444 7.6294+-0.1314 might be 1.0400x faster 3d-raytrace 7.7543+-0.3009 7.4545+-0.1448 might be 1.0402x faster access-binary-trees 1.6432+-0.0460 1.5978+-0.0653 might be 1.0284x faster access-fannkuch 6.5333+-0.0883 ? 6.5941+-0.1215 ? access-nbody 3.8733+-0.1735 3.6594+-0.0649 might be 1.0585x faster access-nsieve 2.6750+-0.0760 2.5717+-0.0545 might be 1.0402x faster bitops-3bit-bits-in-byte 1.2764+-0.0264 ! 1.3604+-0.0318 ! definitely 1.0657x slower bitops-bits-in-byte 2.3180+-0.0668 ? 2.3830+-0.0528 ? might be 1.0280x slower bitops-bitwise-and 3.4243+-0.0911 3.3971+-0.1047 bitops-nsieve-bits 5.3360+-0.0961 ? 5.4479+-0.1615 ? might be 1.0210x slower controlflow-recursive 2.1010+-0.0347 ? 2.1506+-0.0343 ? might be 1.0236x slower crypto-aes 7.4222+-0.2491 ? 7.6064+-0.1827 ? might be 1.0248x slower crypto-md5 2.7823+-0.0877 2.7117+-0.0968 might be 1.0260x faster crypto-sha1 2.4500+-0.0612 ? 2.4609+-0.0447 ? date-format-tofte 10.1750+-0.3408 ? 10.3077+-0.3219 ? might be 1.0130x slower date-format-xparb 9.0613+-0.2235 ! 9.8251+-0.3201 ! definitely 1.0843x slower math-cordic 6.4947+-0.1313 6.4855+-0.1292 math-partial-sums 7.5811+-0.2072 7.5517+-0.2192 math-spectral-norm 2.6240+-0.0661 ? 2.6337+-0.0613 ? regexp-dna 11.6862+-0.2992 11.6826+-0.3405 string-base64 4.1872+-0.0378 ? 4.4519+-0.2612 ? might be 1.0632x slower string-fasta 6.6351+-0.2360 6.5851+-0.2751 string-tagcloud 11.7555+-0.4311 ? 12.0805+-0.3561 ? might be 1.0276x slower string-unpack-code 20.5635+-0.4911 ? 21.3864+-0.8229 ? might be 1.0400x slower string-validate-input 5.3534+-0.0821 ^ 5.1113+-0.0723 ^ definitely 1.0474x faster <arithmetic> * 6.2050+-0.0507 ? 6.2550+-0.0435 ? <geometric> 4.9988+-0.0332 ? 5.0137+-0.0282 ? <harmonic> 3.9802+-0.0270 ? 3.9967+-0.0305 ? TipOfTree ParallelGC V8: crypto 73.1870+-0.3514 ? 73.4561+-0.4061 ? deltablue 178.5365+-0.8240 ^ 174.2394+-0.5655 ^ definitely 1.0247x faster earley-boyer 91.9127+-0.3644 91.5211+-0.4322 raytrace 62.8774+-0.3963 ? 63.4041+-0.6165 ? regexp 106.1395+-0.6146 ? 107.0600+-0.6550 ? richards 126.5740+-0.4962 126.5667+-0.2941 splay 93.7210+-0.4231 ^ 72.9378+-1.0055 ^ definitely 1.2849x faster <arithmetic> 104.7069+-0.2350 ^ 101.3122+-0.2225 ^ definitely 1.0335x faster <geometric> * 99.2797+-0.2067 ^ 95.6737+-0.2585 ^ definitely 1.0377x faster <harmonic> 94.5382+-0.1999 ^ 90.8940+-0.2924 ^ definitely 1.0401x faster TipOfTree ParallelGC Kraken: ai-astar 493.6872+-1.0811 ! 498.7911+-2.1791 ! definitely 1.0103x slower audio-beat-detection 193.5298+-1.4174 ^ 190.9533+-0.8591 ^ definitely 1.0135x faster audio-dft 267.3048+-1.8658 ? 269.0041+-2.7896 ? audio-fft 124.6659+-0.8884 ? 124.9870+-1.0028 ? audio-oscillator 251.0656+-1.6563 ? 251.5320+-1.1005 ? imaging-darkroom 403.1887+-2.5335 400.9468+-1.4467 imaging-desaturate 225.4055+-0.4488 ? 225.9516+-1.6251 ? imaging-gaussian-blur 554.4614+-1.3730 ? 555.1623+-2.2363 ? json-parse-financial 58.4412+-0.8444 ^ 57.0616+-0.3266 ^ definitely 1.0242x faster json-stringify-tinderbox 69.5005+-0.6899 ? 70.3635+-1.0300 ? might be 1.0124x slower stanford-crypto-aes 133.2301+-1.1865 132.4162+-1.3045 stanford-crypto-ccm 100.1346+-0.6399 ? 100.1434+-0.6771 ? stanford-crypto-pbkdf2 196.4266+-3.2542 ^ 191.2882+-1.0304 ^ definitely 1.0269x faster stanford-crypto-sha256-iterative 71.3867+-0.5641 71.2631+-0.2770 <arithmetic> * 224.4592+-0.3648 224.2760+-0.4353 <geometric> 177.3997+-0.4300 176.8961+-0.3648 <harmonic> 140.3279+-0.6257 139.7143+-0.4019 TipOfTree ParallelGC All benchmarks: <arithmetic> 85.8874+-0.1416 ^ 85.3549+-0.1542 ^ definitely 1.0062x faster <geometric> 22.5895+-0.0949 22.4835+-0.0785 <harmonic> 7.0084+-0.0466 ? 7.0333+-0.0525 ? TipOfTree ParallelGC Geomean of preferred means: <scaled-result> 51.7098+-0.1562 ^ 51.1993+-0.1660 ^ definitely 1.0100x faster
Sam Weinig
Comment 24 2011-10-30 13:18:28 PDT
(In reply to comment #11) > (In reply to comment #4) > > I am not sure how relevant it is, but you may want to look at wtf/ParallelJobs.h. On the Mac it will use libdispatch, which can be fun. > > One big reason why not to use libdispatch or other such mechanisms: if anyone else uses them and we stop a WebKit thread due to GC whilst it is sitting inside of some libdispatch lock (or a lock of some dependency of libdispatch, like, say, malloc), then we deadlock the whole process. > > I think that even starting pthreads during a GC could be a dangerous game to play. Hence this code makes sure that it has a pool of threads started before any GC can happen. That makes sense.
Sam Weinig
Comment 25 2011-10-30 13:19:58 PDT
"It also appears that this reduces GC pause times on real websites by more than half." Well that seems quite nice. How are you measuring this? Is this a metric that is cheap enough to compute that we could consider putting it in our increasingly poorly named "Caches" debug window?
Filip Pizlo
Comment 26 2011-10-30 13:22:17 PDT
(In reply to comment #25) > "It also appears that this reduces GC pause times on real websites by more than half." > > Well that seems quite nice. How are you measuring this? Is this a metric that is cheap enough to compute that we could consider putting it in our increasingly poorly named "Caches" debug window? It's really cheap. Just calls to currentTimeMS() before and after collect(). We should probably be tracking GC pause times, all the time, and allow users to see statistics on recent pause times whenver they want.
Filip Pizlo
Comment 27 2011-10-31 02:06:28 PDT
Performance after merging up to r98832. Benchmark report for SunSpider, V8, and Kraken. VMs tested: "TipOfTree" at /Volumes/Data/pizlo/quinary/OpenSource/WebKitBuild/Release/jsc "ParallelGC" at /Volumes/Data/pizlo/senary/OpenSource/WebKitBuild/Release/jsc Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. TipOfTree ParallelGC SunSpider: 3d-cube 7.3539+-0.2026 7.3229+-0.2011 3d-morph 7.5077+-0.1106 ? 7.5947+-0.1384 ? might be 1.0116x slower 3d-raytrace 7.5171+-0.2375 7.4688+-0.1699 access-binary-trees 1.6129+-0.0674 ? 1.6750+-0.0729 ? might be 1.0385x slower access-fannkuch 6.5770+-0.1457 6.4955+-0.0908 might be 1.0125x faster access-nbody 3.8775+-0.1856 3.7505+-0.0898 might be 1.0339x faster access-nsieve 2.6215+-0.0916 2.5791+-0.0562 might be 1.0164x faster bitops-3bit-bits-in-byte 1.3197+-0.0558 ? 1.3384+-0.0396 ? might be 1.0142x slower bitops-bits-in-byte 2.4372+-0.0661 2.3905+-0.0493 might be 1.0195x faster bitops-bitwise-and 3.3222+-0.1106 ? 3.3613+-0.0778 ? might be 1.0118x slower bitops-nsieve-bits 5.5133+-0.1198 ? 5.5630+-0.1057 ? controlflow-recursive 2.1531+-0.0352 ? 2.1555+-0.0519 ? crypto-aes 7.5198+-0.2329 ? 7.8458+-0.3238 ? might be 1.0433x slower crypto-md5 2.7414+-0.0706 ? 2.7590+-0.0731 ? crypto-sha1 2.4408+-0.0702 ? 2.4827+-0.0397 ? might be 1.0172x slower date-format-tofte 10.3145+-0.2630 10.3016+-0.3416 date-format-xparb 8.9655+-0.2076 ? 9.4674+-0.3290 ? might be 1.0560x slower math-cordic 6.5463+-0.1256 6.5147+-0.1471 math-partial-sums 7.4301+-0.1194 7.3979+-0.1266 math-spectral-norm 2.6345+-0.0829 2.6160+-0.0461 regexp-dna 11.7134+-0.2211 ? 11.7141+-0.2021 ? string-base64 4.5294+-0.2703 4.5244+-0.2536 string-fasta 6.2156+-0.0909 ? 6.4610+-0.1575 ? might be 1.0395x slower string-tagcloud 11.7648+-0.4143 11.7532+-0.3231 string-unpack-code 20.7230+-0.6054 ? 20.7778+-0.5743 ? string-validate-input 5.5511+-0.3397 5.3827+-0.3042 might be 1.0313x faster <arithmetic> * 6.1886+-0.0355 ? 6.2190+-0.0290 ? <geometric> 4.9970+-0.0312 ? 5.0181+-0.0257 ? <harmonic> 3.9955+-0.0353 ? 4.0172+-0.0281 ? TipOfTree ParallelGC V8: crypto 73.3812+-0.5024 73.0859+-0.5110 deltablue 166.2654+-0.5110 ^ 163.4370+-0.9970 ^ definitely 1.0173x faster earley-boyer 90.1067+-0.3181 ? 90.4178+-0.4563 ? raytrace 62.3816+-0.5405 ? 62.7093+-0.5034 ? regexp 104.9189+-0.5704 ! 106.1353+-0.6386 ! definitely 1.0116x slower richards 124.1908+-0.4864 ? 125.3265+-0.6551 ? splay 92.5096+-0.6078 ^ 72.3544+-0.7515 ^ definitely 1.2786x faster <arithmetic> 101.9649+-0.1643 ^ 99.0666+-0.2054 ^ definitely 1.0293x faster <geometric> * 97.3140+-0.1850 ^ 94.0647+-0.2235 ^ definitely 1.0345x faster <harmonic> 93.1237+-0.2246 ^ 89.7004+-0.2567 ^ definitely 1.0382x faster TipOfTree ParallelGC Kraken: ai-astar 494.2374+-2.0178 ! 502.3527+-3.7023 ! definitely 1.0164x slower audio-beat-detection 190.6462+-1.7206 ? 192.2360+-1.1167 ? audio-dft 264.3711+-3.1062 ? 266.2583+-1.7913 ? audio-fft 123.2559+-0.7364 ? 125.3206+-1.3437 ? might be 1.0168x slower audio-oscillator 251.6934+-1.6194 250.6419+-1.6296 imaging-darkroom 400.7391+-1.7695 399.8753+-2.3719 imaging-desaturate 224.3837+-1.3417 224.2029+-1.1894 imaging-gaussian-blur 551.7855+-2.0391 ? 552.5716+-1.6117 ? json-parse-financial 57.6361+-0.3248 ? 57.6641+-0.3721 ? json-stringify-tinderbox 68.7668+-0.7192 67.9271+-0.3880 might be 1.0124x faster stanford-crypto-aes 133.5143+-2.3777 131.7421+-2.0223 might be 1.0135x faster stanford-crypto-ccm 99.6615+-0.6328 ? 99.7472+-0.8025 ? stanford-crypto-pbkdf2 194.0549+-0.6564 192.8904+-1.6639 stanford-crypto-sha256-iterative 70.6565+-0.3140 70.5744+-0.2863 <arithmetic> * 223.2430+-0.5268 ? 223.8575+-0.4149 ? <geometric> 176.1541+-0.5141 ? 176.2938+-0.2608 ? <harmonic> 139.1137+-0.4508 138.9795+-0.2621 TipOfTree ParallelGC All benchmarks: <arithmetic> 85.1077+-0.1693 84.8758+-0.1338 <geometric> 22.4708+-0.0868 22.4149+-0.0649 <harmonic> 7.0325+-0.0606 ? 7.0666+-0.0481 ? TipOfTree ParallelGC Geomean of preferred means: <scaled-result> 51.2284+-0.1171 ^ 50.7811+-0.0811 ^ definitely 1.0088x faster
Geoffrey Garen
Comment 28 2011-10-31 17:53:27 PDT
Comment on attachment 112990 [details] the patch View in context: https://bugs.webkit.org/attachment.cgi?id=112990&action=review r=me, with some comments below. Windows build issue: 3>JavaScriptCore.exp : error LNK2001: unresolved external symbol "public: void __thiscall JSC::MarkStackArray::expand(void)" (?expand@MarkStackArray@JSC@@QAEXXZ) > Source/JavaScriptCore/heap/Heap.cpp:618 > + visitor.donateAndDrain(); Not for this patch, but it's really becoming apparent to me that draining should be automatic in the visitor: Once you append N items, the visitor should drain. The visitor has the best information about how full it is at any given time. Plus, this change would seriously declutter-ify calling code. > Source/JavaScriptCore/heap/MarkStack.cpp:348 > + for (unsigned countdown = Heuristics::minimumNumberOfScansBetweenRebalance; m_stack.canRemoveLast() && countdown-- > 0;) You can elide "> 0" here. > Source/JavaScriptCore/heap/MarkStack.cpp:370 > + if (Heuristics::numberOfGCMarkers > 1) { You can change this to an early return, to reduce the indentation level of the rest of the code. WebKit coding style prefers early return where possible for that reason. > Source/JavaScriptCore/heap/MarkStack.cpp:399 > + // If we reached termination, then return. > + if (!m_shared.m_numberOfActiveParallelMarkers && m_shared.m_sharedMarkStack.isEmpty()) > + return; You can just change the condition above to "return" instead of "break", and remove this code. > Source/JavaScriptCore/heap/MarkStack.h:166 > + class MarkStackSharedData { I think you should call out the threadiness of this code: "MarkStackThreadSharedData". > Source/JavaScriptCore/heap/MarkStack.h:260 > + if ((unsigned)m_opaqueRoots.size() < Heuristics::opaqueRootMergeThreshold) C++-style static_cast, please. Slightly better to upcast opaqueRootMergeThreshold, rather than down casting m_opaqueRoots.size(), to avoid loss of information. > Source/JavaScriptCore/heap/MarkStack.h:311 > + inline void MarkStack::addOpaqueRoot(void* root) > { > - return m_opaqueRoots.add(root).second; > +#if ENABLE(PARALLEL_GC) > + mergeOpaqueRootsIfProfitable(); > +#endif > + m_opaqueRoots.add(root); > } I'm surprised that's profitable to merge opaque roots eagerly. Nothing can be done with the set of opaque roots until the end of parallel work, anyway. Could be simpler just to omit eager merging. Do you know why this is profitable? > Source/JavaScriptCore/runtime/Heuristics.cpp:32 > +#if PLATFORM(MAC) && ENABLE(PARALLEL_GC) OS(UNIX) or OS(DARWIN) is better than PLATFORM(MAC) here. > Source/JavaScriptCore/runtime/Heuristics.cpp:177 > +#if PLATFORM(MAC) && ENABLE(PARALLEL_GC) > + int name[2]; > + size_t valueSize = sizeof(cpusToUse); > + name[0] = CTL_HW; > + name[1] = HW_AVAILCPU; > + sysctl(name, 2, &cpusToUse, &valueSize, 0, 0); OS(UNIX) or OS(DARWIN) is better than PLATFORM(MAC) here. > Source/JavaScriptCore/wtf/Bitmap.h:40 > +#if PLATFORM(MAC) CPU(X86) is better than PLATFORM(MAC) here. > Source/JavaScriptCore/wtf/Platform.h:1078 > +#if !ENABLE(PARALLEL_GC) && PLATFORM(MAC) > +#define ENABLE_PARALLEL_GC 1 This should be "#!defined(ENABLE_PARALLEL_GC)" instead of "!ENABLE(PARALLEL_GC)" (so you can manually define to disable parallel GC).
Filip Pizlo
Comment 29 2011-10-31 18:22:14 PDT
> > Source/JavaScriptCore/heap/MarkStack.h:311 > > + inline void MarkStack::addOpaqueRoot(void* root) > > { > > - return m_opaqueRoots.add(root).second; > > +#if ENABLE(PARALLEL_GC) > > + mergeOpaqueRootsIfProfitable(); > > +#endif > > + m_opaqueRoots.add(root); > > } > > I'm surprised that's profitable to merge opaque roots eagerly. Nothing can be done with the set of opaque roots until the end of parallel work, anyway. Could be simpler just to omit eager merging. > > Do you know why this is profitable? It's not about profitability in the execution time sense. It's profitability in the space usage sense. If we did happen to have a lot of opaque roots, then we don't want the worst case of each worker thread holding onto its copy of the same hash set. So merging when it gets big ensures that we minimize space usage. I also have a call to mergeIfNecessary() in drain(), for a plainly dumb reason: the threads currently do not know about GC phase because whenever they go into idle state they will have merged all of their information into the shared data. That keeps things simple. > > > Source/JavaScriptCore/runtime/Heuristics.cpp:32 > > +#if PLATFORM(MAC) && ENABLE(PARALLEL_GC) > > OS(UNIX) or OS(DARWIN) is better than PLATFORM(MAC) here. OS(DARWIN) probably, since this particular incantation of sysctl is missing on some unixes. > > > Source/JavaScriptCore/runtime/Heuristics.cpp:177 > > +#if PLATFORM(MAC) && ENABLE(PARALLEL_GC) > > + int name[2]; > > + size_t valueSize = sizeof(cpusToUse); > > + name[0] = CTL_HW; > > + name[1] = HW_AVAILCPU; > > + sysctl(name, 2, &cpusToUse, &valueSize, 0, 0); > > OS(UNIX) or OS(DARWIN) is better than PLATFORM(MAC) here. > > > Source/JavaScriptCore/wtf/Bitmap.h:40 > > +#if PLATFORM(MAC) > > CPU(X86) is better than PLATFORM(MAC) here. (CPU(X86) || CPU(X86_64)) && OS(UNIX) at least, right? Or some manner of check for whether the compiler supports GCC-style inline asm? > > > Source/JavaScriptCore/wtf/Platform.h:1078 > > +#if !ENABLE(PARALLEL_GC) && PLATFORM(MAC) > > +#define ENABLE_PARALLEL_GC 1 > > This should be "#!defined(ENABLE_PARALLEL_GC)" instead of "!ENABLE(PARALLEL_GC)" (so you can manually define to disable parallel GC). Oops! And I guess this should also be guarded with some variant of CPU(X86) to get it to work on PPC.
Filip Pizlo
Comment 30 2011-10-31 19:04:53 PDT
Created attachment 113121 [details] the patch - checking EWS with changes to address review Addressed Geoff's concerns, using a more precise set of preprocessor guards for CAS, # of processors detection, and parallel GC.
WebKit Review Bot
Comment 31 2011-10-31 19:06:06 PDT
Attachment 113121 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1 Source/JavaScriptCore/heap/MarkStack.cpp:394: This { should be at the end of the previous line [whitespace/braces] [4] Total errors found: 1 in 19 files If any of these errors are false positives, please file a bug against check-webkit-style.
Geoffrey Garen
Comment 32 2011-10-31 19:18:09 PDT
> It's not about profitability in the execution time sense. It's profitability in the space usage sense. Makes sense. > > CPU(X86) is better than PLATFORM(MAC) here. > > (CPU(X86) || CPU(X86_64)) && OS(UNIX) at least, right? Or some manner of check for whether the compiler supports GCC-style inline asm? Oh, right. This should do the trick: (CPU(X86) || CPU(X86_64)) && (COMPILER(GCC) || COMPILER(CLANG)). > > This should be "#!defined(ENABLE_PARALLEL_GC)" instead of "!ENABLE(PARALLEL_GC)" (so you can manually define to disable parallel GC). > > Oops! And I guess this should also be guarded with some variant of CPU(X86) to get it to work on PPC. Good point.
Filip Pizlo
Comment 33 2011-10-31 19:24:51 PDT
> > > CPU(X86) is better than PLATFORM(MAC) here. > > > > (CPU(X86) || CPU(X86_64)) && OS(UNIX) at least, right? Or some manner of check for whether the compiler supports GCC-style inline asm? > > Oh, right. This should do the trick: (CPU(X86) || CPU(X86_64)) && (COMPILER(GCC) || COMPILER(CLANG)). I'll throw COMPILER(CLANG) in there. Though, oddly, we don't do this in other cases of inline asm - I guess COMPILER(GCC) returns true of clang?
Geoffrey Garen
Comment 34 2011-10-31 19:27:14 PDT
Comment on attachment 113121 [details] the patch - checking EWS with changes to address review View in context: https://bugs.webkit.org/attachment.cgi?id=113121&action=review r=me, with build fixed. > Source/JavaScriptCore/wtf/Atomics.h:123 > +#if ENABLE(COMPARE_AND_SWAP) Maybe better just to put the #if around the whole function definition. That way, if you try to use this on the wrong platform, you fail at compile time. FWIW, TCSpinLock.h has some multi-platform CAS assembly code that could be merged into this. Doesn't need to hold up this patch, though.
Filip Pizlo
Comment 35 2011-10-31 19:41:49 PDT
(In reply to comment #34) > (From update of attachment 113121 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=113121&action=review > > r=me, with build fixed. > > > Source/JavaScriptCore/wtf/Atomics.h:123 > > +#if ENABLE(COMPARE_AND_SWAP) > > Maybe better just to put the #if around the whole function definition. That way, if you try to use this on the wrong platform, you fail at compile time. Can't quite do it. Bitmap calls weakCompareAndSwap conditionally, and even though with CAS/PARALLEL_GC disabled Bitmap doesn't get instantiated in a way that would result in calls to weakCompareAndSwap, the compiler will still complain. I suppose you could have Bitmap accept some manner of template CAS functor ... but that seems like brutal overkill. Or I could #ifdef out the calls to weakCompareAndSwap in Bitmap, but that seems like more bloat as well. Do you have anything against having weakCompareAndSwap always declared, as it is now, with the CRASH() thingy?
Geoffrey Garen
Comment 36 2011-10-31 20:06:50 PDT
> Do you have anything against having weakCompareAndSwap always declared, as it is now, with the CRASH() thingy? No.
Filip Pizlo
Comment 37 2011-10-31 22:31:50 PDT
Created attachment 113127 [details] the patch Checking to see if this fixes Windows.
WebKit Review Bot
Comment 38 2011-10-31 22:34:38 PDT
Attachment 113127 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1 Source/JavaScriptCore/heap/MarkStack.cpp:394: This { should be at the end of the previous line [whitespace/braces] [4] Total errors found: 1 in 19 files If any of these errors are false positives, please file a bug against check-webkit-style.
Filip Pizlo
Comment 39 2011-10-31 23:16:31 PDT
Created attachment 113131 [details] the patch Another Windows fix attempt.
WebKit Review Bot
Comment 40 2011-10-31 23:19:27 PDT
Attachment 113131 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1 Source/JavaScriptCore/heap/MarkStack.cpp:394: This { should be at the end of the previous line [whitespace/braces] [4] Total errors found: 1 in 20 files If any of these errors are false positives, please file a bug against check-webkit-style.
Filip Pizlo
Comment 41 2011-10-31 23:43:58 PDT
Geoffrey Garen
Comment 42 2011-11-01 11:55:57 PDT
> Though, oddly, we don't do this in other cases of inline asm - I guess COMPILER(GCC) returns true of clang? Yeah, I guess CLANG is a superset of GCC.
Note You need to log in before you can comment on or make changes to this bug.