Bug 70995 - The GC should be parallel
Summary: The GC should be parallel
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-10-27 00:35 PDT by Filip Pizlo
Modified: 2011-11-01 11:55 PDT (History)
6 users (show)

See Also:


Attachments
work in progress (36.71 KB, patch)
2011-10-27 00:41 PDT, Filip Pizlo
no flags Details | Formatted Diff | Diff
the patch (50.99 KB, patch)
2011-10-29 22:27 PDT, Filip Pizlo
no flags Details | Formatted Diff | Diff
the patch (50.97 KB, patch)
2011-10-29 22:33 PDT, Filip Pizlo
no flags Details | Formatted Diff | Diff
the patch (57.25 KB, patch)
2011-10-30 00:19 PDT, Filip Pizlo
no flags Details | Formatted Diff | Diff
the patch (57.58 KB, patch)
2011-10-30 00:27 PDT, Filip Pizlo
gustavo: commit-queue-
Details | Formatted Diff | Diff
the patch (57.85 KB, patch)
2011-10-30 01:11 PDT, Filip Pizlo
ggaren: review+
Details | Formatted Diff | Diff
the patch - checking EWS with changes to address review (59.49 KB, patch)
2011-10-31 19:04 PDT, Filip Pizlo
ggaren: review+
Details | Formatted Diff | Diff
the patch (60.12 KB, patch)
2011-10-31 22:31 PDT, Filip Pizlo
no flags Details | Formatted Diff | Diff
the patch (60.16 KB, patch)
2011-10-31 23:16 PDT, Filip Pizlo
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Filip Pizlo 2011-10-27 00:35:43 PDT
Garbage collection is embarrassingly parallel.  There is nothing in our marking code that isn't thread-safe other than the flipping of mark bits, which can be done with a single CAS instruction.
Comment 1 Filip Pizlo 2011-10-27 00:41:51 PDT
Created attachment 112647 [details]
work in progress

This works, and produces speed-ups, but still needs clean-ups.  Here's the performance using my harness tuned to measure GC behavior.


Benchmark report for V8.

VMs tested:
"TipOfTree" at /Volumes/Data/pizlo/quinary/OpenSource/WebKitBuild/Release/jsc
"ParallelGC" at /Volumes/Data/pizlo/senary/OpenSource/WebKitBuild/Release/jsc

Collected 8 samples per benchmark/VM, with 4 VM invocations per benchmark. Ran 5 benchmark
iterations, and measured the total time of those iterations, for each sample. Emitted a call
to gc() between sample measurements. Used 5 benchmark iteration per VM invocation for
warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing.
Reporting benchmark execution times with 95% confidence intervals in milliseconds.

                       TipOfTree               ParallelGC                                   

crypto             369.6983+-2.7221    ?   370.0403+-1.5868       ?
deltablue          903.7007+-8.3329    ?   911.5094+-5.1716       ?
earley-boyer       462.6758+-1.3969        461.6687+-1.3534       
raytrace           315.5462+-1.9768    ?   318.6208+-1.9752       ?
regexp             549.5400+-4.7576    ^   541.9117+-1.7295       ^ definitely 1.0141x faster
richards           632.6940+-2.0461    ?   634.8623+-2.2790       ?
splay              414.3850+-2.1458    ^   325.4297+-17.9843      ^ definitely 1.2733x faster

<arithmetic>       521.1771+-1.3120    ^   509.1490+-2.7390       ^ definitely 1.0236x faster
<geometric> *      492.4826+-1.0591    ^   476.1071+-3.8258       ^ definitely 1.0344x faster
<harmonic>         468.0389+-1.1209    ^   448.5087+-4.9235       ^ definitely 1.0435x faster
Comment 2 Filip Pizlo 2011-10-27 00:43:20 PDT
5% speed-up in the V8 harness.



[pizlo@minime benchmarks] /Volumes/Data/pizlo/quinary/OpenSource/WebKitBuild/Release/jsc run.js 
Richards: 10005
DeltaBlue: 6367
Crypto: 15714
RayTrace: 8493
EarleyBoyer: 10495
RegExp: 2126
Splay: 7236
----
Score (version 6): 7530
[pizlo@minime benchmarks] /Volumes/Data/pizlo/senary/OpenSource/WebKitBuild/Release/jsc run.js 
Richards: 10149
DeltaBlue: 6499
Crypto: 15935
RayTrace: 8844
EarleyBoyer: 10821
RegExp: 2180
Splay: 9013
----
Score (version 6): 7932
[pizlo@minime benchmarks] irb
>> 7932/7530.0
=> 1.05338645418327
Comment 3 WebKit Review Bot 2011-10-27 00:44:37 PDT
Attachment 112647 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1

Source/JavaScriptCore/heap/SlotVisitor.h:38:  The parameter name "heap" adds no information, so it should be removed.  [readability/parameter_name] [5]
Source/JavaScriptCore/heap/MarkStack.cpp:165:  Should have a space between // and comment  [whitespace/comments] [4]
Source/JavaScriptCore/heap/MarkStack.cpp:183:  Should have a space between // and comment  [whitespace/comments] [4]
Source/JavaScriptCore/heap/MarkStack.cpp:189:  This { should be at the end of the previous line  [whitespace/braces] [4]
Source/JavaScriptCore/heap/MarkStack.cpp:212:  Should have a space between // and comment  [whitespace/comments] [4]
Source/JavaScriptCore/heap/MarkStack.cpp:220:  Should have a space between // and comment  [whitespace/comments] [4]
Source/JavaScriptCore/heap/MarkStack.cpp:229:  Should have a space between // and comment  [whitespace/comments] [4]
Source/JavaScriptCore/heap/MarkStack.cpp:234:  Should have a space between // and comment  [whitespace/comments] [4]
Source/JavaScriptCore/heap/MarkStack.h:125:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
Source/JavaScriptCore/heap/Heap.cpp:695:  Should have a space between // and comment  [whitespace/comments] [4]
Source/JavaScriptCore/heap/Heap.cpp:697:  Should have a space between // and comment  [whitespace/comments] [4]
Source/JavaScriptCore/heap/Heap.cpp:801:  Should have a space between // and comment  [whitespace/comments] [4]
Source/JavaScriptCore/heap/Heap.cpp:803:  Should have a space between // and comment  [whitespace/comments] [4]
Total errors found: 13 in 11 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 4 Sam Weinig 2011-10-27 11:06:07 PDT
I am not sure how relevant it is, but you may want to look at wtf/ParallelJobs.h.  On the Mac it will use libdispatch, which can be fun.
Comment 5 Filip Pizlo 2011-10-27 13:50:11 PDT
Here's the overall performance impact using my harness:


Benchmark report for SunSpider, V8, and Kraken.

VMs tested:
"TipOfTree" at /Volumes/Data/pizlo/quinary/OpenSource/WebKitBuild/Release/jsc
"ParallelGC" at /Volumes/Data/pizlo/senary/OpenSource/WebKitBuild/Release/jsc

Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample
measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime()
function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in
milliseconds.

                                            TipOfTree               ParallelGC                                   
SunSpider:
   3d-cube                                7.2180+-0.1808    ?     7.4524+-0.1367       ? might be 1.0325x slower
   3d-morph                               7.6799+-0.1029          7.5913+-0.1302         might be 1.0117x faster
   3d-raytrace                            7.5534+-0.2183          7.4831+-0.1791       
   access-binary-trees                    1.6424+-0.0414          1.6249+-0.0357         might be 1.0107x faster
   access-fannkuch                        6.3541+-0.1033    ?     6.4410+-0.1137       ? might be 1.0137x slower
   access-nbody                           3.8256+-0.1068          3.7492+-0.0887         might be 1.0204x faster
   access-nsieve                          2.6080+-0.0739          2.5943+-0.0592       
   bitops-3bit-bits-in-byte               1.3032+-0.0421    ?     1.3417+-0.0224       ? might be 1.0295x slower
   bitops-bits-in-byte                    2.3928+-0.0527          2.3785+-0.0843       
   bitops-bitwise-and                     3.3468+-0.1038          3.3233+-0.0834       
   bitops-nsieve-bits                     5.2992+-0.1325    ?     5.3240+-0.0675       ?
   controlflow-recursive                  2.1437+-0.0404          2.1308+-0.0296       
   crypto-aes                             7.3174+-0.2212    ?     7.3442+-0.1498       ?
   crypto-md5                             2.7344+-0.0748          2.6672+-0.0543         might be 1.0252x faster
   crypto-sha1                            2.5011+-0.0744          2.4450+-0.0696         might be 1.0229x faster
   date-format-tofte                      9.8872+-0.2085    ?    10.0584+-0.3533       ? might be 1.0173x slower
   date-format-xparb                      9.2605+-0.2615          9.1278+-0.1729         might be 1.0145x faster
   math-cordic                            6.5256+-0.0875          6.4593+-0.1045         might be 1.0103x faster
   math-partial-sums                      7.5841+-0.1001          7.5316+-0.0998       
   math-spectral-norm                     2.5835+-0.0451    ?     2.6250+-0.0915       ? might be 1.0160x slower
   regexp-dna                            11.4311+-0.1952    ?    11.6339+-0.2408       ? might be 1.0177x slower
   string-base64                          4.3517+-0.1414    ?     4.5075+-0.1156       ? might be 1.0358x slower
   string-fasta                           6.3099+-0.0896    ?     6.3234+-0.1080       ?
   string-tagcloud                       11.5226+-0.2577    ?    11.5569+-0.2916       ?
   string-unpack-code                    20.3920+-0.3430    ?    20.5870+-0.6153       ?
   string-validate-input                  5.1896+-0.1096    ?     5.3466+-0.1446       ? might be 1.0303x slower

   <arithmetic> *                         6.1138+-0.0290    ?     6.1403+-0.0386       ?
   <geometric>                            4.9488+-0.0213    ?     4.9601+-0.0298       ?
   <harmonic>                             3.9705+-0.0217    ?     3.9761+-0.0319       ?

                                            TipOfTree               ParallelGC                                   
V8:
   crypto                                73.8359+-0.2957    ?    74.4488+-0.7109       ?
   deltablue                            175.5006+-1.3144    ?   176.8104+-1.2502       ?
   earley-boyer                          91.1516+-0.8585         90.1037+-0.2475         might be 1.0116x faster
   raytrace                              62.9210+-0.4437    ?    63.1005+-0.7763       ?
   regexp                               105.2908+-1.3755        104.8327+-0.9881       
   richards                             126.8941+-0.5265    ?   127.2614+-1.0312       ?
   splay                                 91.3876+-0.6759    ^    72.8390+-0.9795       ^ definitely 1.2547x faster

   <arithmetic>                         103.8545+-0.1819    ^   101.3424+-0.2595       ^ definitely 1.0248x faster
   <geometric> *                         98.6156+-0.2071    ^    95.5441+-0.3157       ^ definitely 1.0321x faster
   <harmonic>                            94.0416+-0.2326    ^    90.7043+-0.3916       ^ definitely 1.0368x faster

                                            TipOfTree               ParallelGC                                   
Kraken:
   ai-astar                             499.2170+-3.3908    ?   506.4470+-10.1727      ? might be 1.0145x slower
   audio-beat-detection                 191.3056+-1.0695    ?   191.4846+-0.6155       ?
   audio-dft                            271.6365+-6.4902        268.4316+-1.7440         might be 1.0119x faster
   audio-fft                            125.1429+-0.8487    ?   125.8270+-1.1635       ?
   audio-oscillator                     252.0302+-1.9938    ?   252.9450+-2.0986       ?
   imaging-darkroom                     400.8323+-1.7930    ?   401.2702+-2.5584       ?
   imaging-desaturate                   226.0704+-1.1277        225.7703+-0.7585       
   imaging-gaussian-blur                556.3116+-1.9274    ?   560.3647+-2.2416       ?
   json-parse-financial                  57.9881+-1.3226    !    59.8796+-0.2561       ! definitely 1.0326x slower
   json-stringify-tinderbox              68.2823+-0.7350         67.9300+-0.5355       
   stanford-crypto-aes                  134.2209+-1.9479        133.6694+-1.6344       
   stanford-crypto-ccm                  100.1903+-1.0490         99.9488+-0.4803       
   stanford-crypto-pbkdf2               196.7415+-3.0909        193.2500+-0.9395         might be 1.0181x faster
   stanford-crypto-sha256-iterative      71.2779+-1.0248         70.7566+-0.4374       

   <arithmetic> *                       225.0891+-0.6866    ?   225.5696+-0.6807       ?
   <geometric>                          177.4647+-0.6886    ?   177.6658+-0.3348       ?
   <harmonic>                           139.9202+-0.8330    ?   140.3430+-0.2271       ?

                                            TipOfTree               ParallelGC                                   
All benchmarks:
   <arithmetic>                          85.8976+-0.2217         85.6813+-0.2179       
   <geometric>                           22.4445+-0.0697         22.3746+-0.0873       
   <harmonic>                             6.9912+-0.0374    ?     6.9982+-0.0548       ?

                                            TipOfTree               ParallelGC                                   
Geomean of preferred means:
   <scaled-result>                       51.3886+-0.1261    ^    50.9589+-0.1603       ^ definitely 1.0084x faster
Comment 6 Filip Pizlo 2011-10-27 13:54:55 PDT
Even when running with parallel GC enabled but with only 1 worker - so the load balancer is doing completely pointless work - this patch is basically performance-neutral.  It's a slow-down on v8-splay (4%, ish) but this does not show up in the grand averages.



Benchmark report for SunSpider, V8, and Kraken.

VMs tested:
"TipOfTree" at /Volumes/Data/pizlo/quinary/OpenSource/WebKitBuild/Release/jsc
"ParGC_1worker" at /Volumes/Data/pizlo/senary/OpenSource/WebKitBuild/Release/jsc

Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample
measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime()
function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in
milliseconds.

                                            TipOfTree             ParGC_1worker                                  
SunSpider:
   3d-cube                                7.2949+-0.1575    ?     7.3727+-0.1966       ? might be 1.0107x slower
   3d-morph                               7.6382+-0.1124          7.5984+-0.1009       
   3d-raytrace                            7.4934+-0.1674    ?     7.5545+-0.2226       ?
   access-binary-trees                    1.5628+-0.0328    !     1.6664+-0.0407       ! definitely 1.0662x slower
   access-fannkuch                        6.3769+-0.0957          6.3468+-0.1113       
   access-nbody                           3.8433+-0.0674          3.8371+-0.0816       
   access-nsieve                          2.6024+-0.0664    ?     2.6081+-0.0886       ?
   bitops-3bit-bits-in-byte               1.3050+-0.0218          1.3031+-0.0404       
   bitops-bits-in-byte                    2.3791+-0.0536    ?     2.4168+-0.0887       ? might be 1.0159x slower
   bitops-bitwise-and                     3.3935+-0.1110          3.3504+-0.0702         might be 1.0129x faster
   bitops-nsieve-bits                     5.2880+-0.0830    ?     5.3889+-0.0710       ? might be 1.0191x slower
   controlflow-recursive                  2.0956+-0.0343    ?     2.1096+-0.0380       ?
   crypto-aes                             7.4149+-0.1537    ?     7.5127+-0.1670       ? might be 1.0132x slower
   crypto-md5                             2.6900+-0.0664    ?     2.7824+-0.0428       ? might be 1.0343x slower
   crypto-sha1                            2.4659+-0.0728          2.4191+-0.0656         might be 1.0194x faster
   date-format-tofte                      9.9838+-0.3736    ?    10.0864+-0.1818       ? might be 1.0103x slower
   date-format-xparb                      9.2507+-0.2275          9.2164+-0.2176       
   math-cordic                            6.4024+-0.1527    ?     6.5353+-0.1568       ? might be 1.0208x slower
   math-partial-sums                      7.5328+-0.1274          7.4789+-0.1229       
   math-spectral-norm                     2.6021+-0.0688          2.5571+-0.0609         might be 1.0176x faster
   regexp-dna                            11.5334+-0.2701         11.4021+-0.1800         might be 1.0115x faster
   string-base64                          4.4235+-0.1934          4.4043+-0.1079       
   string-fasta                           6.3310+-0.1557    ?     6.3362+-0.1281       ?
   string-tagcloud                       11.8947+-0.3594         11.6337+-0.2537         might be 1.0224x faster
   string-unpack-code                    20.4274+-0.3439         20.1890+-0.3354         might be 1.0118x faster
   string-validate-input                  5.3787+-0.1406          5.3504+-0.1414       

   <arithmetic> *                         6.1386+-0.0371          6.1330+-0.0328       
   <geometric>                            4.9505+-0.0321    ?     4.9654+-0.0321       ?
   <harmonic>                             3.9508+-0.0324    ?     3.9787+-0.0356       ?

                                            TipOfTree             ParGC_1worker                                  
V8:
   crypto                                74.6573+-0.9355    ?    74.6671+-1.2695       ?
   deltablue                            176.7681+-1.9078    ?   178.8428+-1.3773       ? might be 1.0117x slower
   earley-boyer                          90.6198+-0.4111    ?    91.2420+-0.9776       ?
   raytrace                              63.4282+-0.7518         62.6713+-0.4769         might be 1.0121x faster
   regexp                               105.9695+-1.1261    ^   104.3170+-0.4879       ^ definitely 1.0158x faster
   richards                             126.5331+-0.4297    ?   126.9162+-0.6510       ?
   splay                                 91.4826+-0.6302    !    95.4367+-0.4899       ! definitely 1.0432x slower

   <arithmetic>                         104.2084+-0.3281    ?   104.8705+-0.3554       ?
   <geometric> *                         98.9672+-0.2982    ?    99.4817+-0.3438       ?
   <harmonic>                            94.4207+-0.3735    ?    94.7775+-0.3760       ?

                                            TipOfTree             ParGC_1worker                                  
Kraken:
   ai-astar                             497.9819+-1.5014    ?   501.5006+-4.3832       ?
   audio-beat-detection                 193.8519+-2.2030        191.8244+-1.9925         might be 1.0106x faster
   audio-dft                            267.2113+-3.2929    ?   271.1404+-4.5041       ? might be 1.0147x slower
   audio-fft                            125.1287+-1.0431    ?   125.7332+-0.9478       ?
   audio-oscillator                     250.9315+-1.1048    ?   253.3135+-1.5058       ?
   imaging-darkroom                     397.6901+-1.3490    !   402.9572+-1.8083       ! definitely 1.0132x slower
   imaging-desaturate                   228.5683+-2.8265        226.6920+-1.4972       
   imaging-gaussian-blur                560.2089+-2.9211        556.4905+-1.8611       
   json-parse-financial                  57.2933+-0.3284    !    58.8656+-0.6922       ! definitely 1.0274x slower
   json-stringify-tinderbox              70.1123+-1.4993    ^    67.9328+-0.3743       ^ definitely 1.0321x faster
   stanford-crypto-aes                  132.2364+-1.5831    ?   133.8309+-1.5803       ? might be 1.0121x slower
   stanford-crypto-ccm                  100.2620+-0.6544    ?   100.6823+-1.2382       ?
   stanford-crypto-pbkdf2               198.5007+-3.3325        195.2135+-2.0347         might be 1.0168x faster
   stanford-crypto-sha256-iterative      70.9986+-0.3678         70.7538+-0.2947       

   <arithmetic> *                       225.0697+-0.3483    ?   225.4951+-0.7427       ?
   <geometric>                          177.5370+-0.4508    ?   177.7344+-0.4956       ?
   <harmonic>                           140.0540+-0.5592    ?   140.1788+-0.4364       ?

                                            TipOfTree             ParGC_1worker                                  
All benchmarks:
   <arithmetic>                          85.9583+-0.1007    ?    86.1804+-0.2358       ?
   <geometric>                           22.4631+-0.0890    ?    22.5254+-0.0871       ?
   <harmonic>                             6.9577+-0.0557    ?     7.0059+-0.0611       ?

                                            TipOfTree             ParGC_1worker                                  
Geomean of preferred means:
   <scaled-result>                       51.5176+-0.1332    ?    51.6232+-0.1352       ?
Comment 7 Filip Pizlo 2011-10-27 15:03:41 PDT
And here's the performance with ENABLE_PARALLEL_GC set to 0.



Benchmark report for SunSpider, V8, and Kraken.

VMs tested:
"TipOfTree" at /Volumes/Data/pizlo/quinary/OpenSource/WebKitBuild/Release/jsc
"ParGC_off" at /Volumes/Data/pizlo/senary/OpenSource/WebKitBuild/Release/jsc

Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample
measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime()
function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in
milliseconds.

                                            TipOfTree               ParGC_off                                    
SunSpider:
   3d-cube                                7.3494+-0.1367          7.3081+-0.1812       
   3d-morph                               7.7531+-0.1294          7.6432+-0.0939         might be 1.0144x faster
   3d-raytrace                            7.6150+-0.2149    ?     7.7728+-0.2126       ? might be 1.0207x slower
   access-binary-trees                    1.6171+-0.0601    ?     1.6446+-0.0589       ? might be 1.0170x slower
   access-fannkuch                        6.4249+-0.1303    ?     6.4619+-0.0996       ?
   access-nbody                           3.7535+-0.0800          3.7456+-0.0676       
   access-nsieve                          2.6276+-0.0681          2.6007+-0.0638         might be 1.0103x faster
   bitops-3bit-bits-in-byte               1.3021+-0.0327    ?     1.3056+-0.0557       ?
   bitops-bits-in-byte                    2.3702+-0.0626    ?     2.3866+-0.0608       ?
   bitops-bitwise-and                     3.3120+-0.0931    ?     3.3925+-0.0882       ? might be 1.0243x slower
   bitops-nsieve-bits                     5.4025+-0.1042          5.3258+-0.1046         might be 1.0144x faster
   controlflow-recursive                  2.1068+-0.0363    ?     2.1186+-0.0378       ?
   crypto-aes                             7.3538+-0.2270    ?     7.4445+-0.1805       ? might be 1.0123x slower
   crypto-md5                             2.6951+-0.0557    ?     2.7680+-0.0968       ? might be 1.0270x slower
   crypto-sha1                            2.4814+-0.0528    ?     2.5605+-0.0755       ? might be 1.0319x slower
   date-format-tofte                      9.8457+-0.2383    ?     9.9146+-0.2241       ?
   date-format-xparb                      9.3008+-0.3754          9.0405+-0.1338         might be 1.0288x faster
   math-cordic                            6.3706+-0.0808    !     6.7802+-0.0821       ! definitely 1.0643x slower
   math-partial-sums                      7.6198+-0.1122          7.4890+-0.1256         might be 1.0175x faster
   math-spectral-norm                     2.6323+-0.0791    ?     2.6375+-0.0568       ?
   regexp-dna                            11.5339+-0.1717         11.5240+-0.1524       
   string-base64                          4.3541+-0.0758          4.3309+-0.0496       
   string-fasta                           6.3641+-0.1345    ?     6.4228+-0.1290       ?
   string-tagcloud                       11.4752+-0.3121    ?    11.6327+-0.1632       ? might be 1.0137x slower
   string-unpack-code                    20.3497+-0.1953    ?    20.5273+-0.2408       ?
   string-validate-input                  5.3539+-0.1222          5.3345+-0.1329       

   <arithmetic> *                         6.1294+-0.0245    ?     6.1582+-0.0245       ?
   <geometric>                            4.9538+-0.0255    ?     4.9834+-0.0308       ?
   <harmonic>                             3.9629+-0.0352    ?     3.9923+-0.0422       ?

                                            TipOfTree               ParGC_off                                    
V8:
   crypto                                74.8595+-1.4741         74.5039+-0.3569       
   deltablue                            180.4290+-1.1220    ^   176.9639+-1.4984       ^ definitely 1.0196x faster
   earley-boyer                          90.8161+-0.5091    ?    90.9153+-0.4368       ?
   raytrace                              64.3589+-0.9302    ^    62.9611+-0.4291       ^ definitely 1.0222x faster
   regexp                               104.8355+-0.5815    ?   105.7245+-0.6124       ?
   richards                             127.7546+-0.7860    ?   128.0517+-0.7380       ?
   splay                                 91.3341+-0.7922    ?    92.4157+-0.5433       ? might be 1.0118x slower

   <arithmetic>                         104.9125+-0.2718        104.5052+-0.3167       
   <geometric> *                         99.4909+-0.3100         99.1800+-0.2328       
   <harmonic>                            94.8698+-0.3791         94.5322+-0.2052       

                                            TipOfTree               ParGC_off                                    
Kraken:
   ai-astar                             507.1634+-4.5211        501.2228+-3.2661         might be 1.0119x faster
   audio-beat-detection                 190.7445+-0.8232    !   194.8461+-2.6200       ! definitely 1.0215x slower
   audio-dft                            269.4079+-2.6398    ?   271.1714+-2.7456       ?
   audio-fft                            126.3637+-0.9614        126.2289+-1.0185       
   audio-oscillator                     251.2230+-1.2274    ?   252.4837+-1.5019       ?
   imaging-darkroom                     400.4058+-1.5128    ?   402.1017+-2.4870       ?
   imaging-desaturate                   226.6919+-0.7861    ?   227.7904+-2.0344       ?
   imaging-gaussian-blur                561.7409+-3.6798    ?   565.0966+-4.2142       ?
   json-parse-financial                  58.2345+-0.4456         57.9125+-0.3687       
   json-stringify-tinderbox              68.2407+-0.2725    !    70.1338+-1.4605       ! definitely 1.0277x slower
   stanford-crypto-aes                  133.7770+-1.6480        133.2023+-1.4945       
   stanford-crypto-ccm                  102.7287+-2.1131        100.9507+-0.6909         might be 1.0176x faster
   stanford-crypto-pbkdf2               191.7112+-0.8882    ?   194.8814+-3.1410       ? might be 1.0165x slower
   stanford-crypto-sha256-iterative      71.2590+-0.5433         70.4723+-0.2436         might be 1.0112x faster

   <arithmetic> *                       225.6923+-0.6509    ?   226.3210+-0.7703       ?
   <geometric>                          177.7540+-0.4853    ?   178.2620+-0.6906       ?
   <harmonic>                           140.2630+-0.3780    ?   140.5231+-0.6601       ?

                                            TipOfTree               ParGC_off                                    
All benchmarks:
   <arithmetic>                          86.2435+-0.2081    ?    86.3860+-0.2201       ?
   <geometric>                           22.4974+-0.0707    ?    22.5802+-0.0928       ?
   <harmonic>                             6.9790+-0.0604    ?     7.0293+-0.0725       ?

                                            TipOfTree               ParGC_off                                    
Geomean of preferred means:
   <scaled-result>                       51.6299+-0.0804    ?    51.7048+-0.1116       ?
Comment 8 Filip Pizlo 2011-10-29 22:27:05 PDT
Created attachment 112985 [details]
the patch

Implemented a bunch of stuff:

- Better memory management for mark stack segments.
- Better load balancing.  It can now scale to 8 processors.
- Autodetection of number of processors.  It still caps itself at 4 threads, though, since that seems to be where we get peak efficiency.



Benchmark report for SunSpider, V8, and Kraken.

VMs tested:
"TipOfTree" at /Volumes/Data/pizlo/quinary/OpenSource/WebKitBuild/Release/jsc
"ParallelGC" at /Volumes/Data/pizlo/senary/OpenSource/WebKitBuild/Release/jsc

Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample
measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime()
function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in
milliseconds.

                                            TipOfTree               ParallelGC                                   
SunSpider:
   3d-cube                                7.2847+-0.1599          7.2000+-0.1390         might be 1.0118x faster
   3d-morph                               7.6836+-0.1079          7.6069+-0.1119         might be 1.0101x faster
   3d-raytrace                            7.5123+-0.1952          7.4649+-0.1720       
   access-binary-trees                    1.6509+-0.0469          1.6238+-0.0729         might be 1.0167x faster
   access-fannkuch                        6.5595+-0.0676          6.4896+-0.1053         might be 1.0108x faster
   access-nbody                           3.7379+-0.0998          3.7356+-0.0627       
   access-nsieve                          2.5926+-0.0849    ?     2.6093+-0.0612       ?
   bitops-3bit-bits-in-byte               1.2904+-0.0295    ?     1.3312+-0.0385       ? might be 1.0316x slower
   bitops-bits-in-byte                    2.4042+-0.0579          2.3689+-0.0429         might be 1.0149x faster
   bitops-bitwise-and                     3.3795+-0.1010    ?     3.3876+-0.1226       ?
   bitops-nsieve-bits                     5.4380+-0.0857    ?     5.5171+-0.1328       ? might be 1.0145x slower
   controlflow-recursive                  2.1209+-0.0514    ?     2.1224+-0.0579       ?
   crypto-aes                             7.2702+-0.1745    ?     7.3951+-0.1885       ? might be 1.0172x slower
   crypto-md5                             2.7203+-0.0754    ?     2.7662+-0.0691       ? might be 1.0169x slower
   crypto-sha1                            2.4430+-0.0617          2.4351+-0.0454       
   date-format-tofte                     10.0079+-0.1732          9.9916+-0.2204       
   date-format-xparb                      8.8878+-0.1903    !     9.3501+-0.2024       ! definitely 1.0520x slower
   math-cordic                            6.4491+-0.1291          6.2971+-0.0644         might be 1.0241x faster
   math-partial-sums                      7.5129+-0.1028    ?     7.5139+-0.1127       ?
   math-spectral-norm                     2.5601+-0.0628    ?     2.6399+-0.0440       ? might be 1.0312x slower
   regexp-dna                            11.5483+-0.1454         11.4922+-0.1442       
   string-base64                          4.2972+-0.0686    ?     4.4867+-0.1473       ? might be 1.0441x slower
   string-fasta                           6.2632+-0.1289    ?     6.3459+-0.1575       ? might be 1.0132x slower
   string-tagcloud                       11.4555+-0.3263    ?    11.4960+-0.2517       ?
   string-unpack-code                    20.6781+-0.2345    ?    20.7067+-0.3981       ?
   string-validate-input                  5.2387+-0.0995          5.1660+-0.0626         might be 1.0141x faster

   <arithmetic> *                         6.1149+-0.0322    ?     6.1361+-0.0250       ?
   <geometric>                            4.9388+-0.0250    ?     4.9601+-0.0212       ?
   <harmonic>                             3.9553+-0.0259    ?     3.9773+-0.0249       ?

                                            TipOfTree               ParallelGC                                   
V8:
   crypto                                74.1679+-0.3164         74.0380+-0.3816       
   deltablue                            179.7767+-1.9744        179.3566+-1.2107       
   earley-boyer                          91.6448+-0.5219         90.9476+-0.4252       
   raytrace                              62.5295+-0.4596    !    64.3819+-0.9545       ! definitely 1.0296x slower
   regexp                               105.5681+-0.4315        105.4897+-0.4094       
   richards                             126.3819+-0.6091    ?   127.1708+-0.3652       ?
   splay                                 93.6141+-0.7341    ^    71.9968+-0.5395       ^ definitely 1.3003x faster

   <arithmetic>                         104.8119+-0.2897    ^   101.9116+-0.3152       ^ definitely 1.0285x faster
   <geometric> *                         99.3299+-0.2377    ^    95.9869+-0.3252       ^ definitely 1.0348x faster
   <harmonic>                            94.5656+-0.2492    ^    91.0851+-0.3703       ^ definitely 1.0382x faster

                                            TipOfTree               ParallelGC                                   
Kraken:
   ai-astar                             496.4468+-1.2667    ?   499.0402+-3.6723       ?
   audio-beat-detection                 192.5907+-1.4966        191.6971+-1.2168       
   audio-dft                            266.8459+-3.4843    ?   267.9069+-3.7898       ?
   audio-fft                            125.3709+-0.6697    ?   125.7723+-1.2540       ?
   audio-oscillator                     252.4701+-1.2692        251.8329+-1.2081       
   imaging-darkroom                     404.3208+-1.1456    ?   405.2122+-1.1299       ?
   imaging-desaturate                   226.4488+-0.4880        225.4735+-1.0376       
   imaging-gaussian-blur                558.3593+-1.6383        558.0886+-2.0891       
   json-parse-financial                  57.1418+-0.4478    ?    58.0770+-0.6443       ? might be 1.0164x slower
   json-stringify-tinderbox              68.5476+-0.3060         68.2372+-0.3896       
   stanford-crypto-aes                  134.3287+-1.8809        131.6633+-1.4008         might be 1.0202x faster
   stanford-crypto-ccm                  100.6053+-0.8729        100.2639+-0.5752       
   stanford-crypto-pbkdf2               194.3023+-1.7189        192.6526+-1.6629       
   stanford-crypto-sha256-iterative      71.1339+-0.3512    ?    71.1519+-0.6564       ?

   <arithmetic> *                       224.9224+-0.3267        224.7907+-0.6364       
   <geometric>                          177.2358+-0.3153        177.0139+-0.4504       
   <harmonic>                           139.6362+-0.3532        139.6204+-0.3786       

                                            TipOfTree               ParallelGC                                   
All benchmarks:
   <arithmetic>                          85.9911+-0.1177    ^    85.5317+-0.2086       ^ definitely 1.0054x faster
   <geometric>                           22.4348+-0.0689         22.3655+-0.0666       
   <harmonic>                             6.9654+-0.0445    ?     7.0000+-0.0428       ?

                                            TipOfTree               ParallelGC                                   
Geomean of preferred means:
   <scaled-result>                       51.5027+-0.1124    ^    50.9673+-0.1145       ^ definitely 1.0105x faster
Comment 9 WebKit Review Bot 2011-10-29 22:29:40 PDT
Attachment 112985 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1

Source/JavaScriptCore/heap/SlotVisitor.h:38:  The parameter name "shared" adds no information, so it should be removed.  [readability/parameter_name] [5]
Source/JavaScriptCore/heap/MarkStack.cpp:367:  This { should be at the end of the previous line  [whitespace/braces] [4]
Source/JavaScriptCore/heap/MarkStack.cpp:372:  This { should be at the end of the previous line  [whitespace/braces] [4]
Source/JavaScriptCore/heap/MarkStack.h:81:  The parameter name "segment" adds no information, so it should be removed.  [readability/parameter_name] [5]
Source/JavaScriptCore/heap/MarkStack.h:143:  Tests for true/false, null/non-null, and zero/non-zero should all be done without equality comparisons.  [readability/comparison_to_zero] [5]
Total errors found: 5 in 12 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 10 Filip Pizlo 2011-10-29 22:33:41 PDT
Created attachment 112986 [details]
the patch

Fixed style.

Should also note that this patch, and the previous one, also fixes the weak reference harvester and opaque roots situation.  Opaque roots are kept locally in each thread but merged whenever it's profitable to do so, as well as at the end of any quantum of tracing.  Weak reference harvesters are merged eagerly, since they are currently unused, and are only intended for CodeBlocks.  There shouldn't be enough CodeBlocks for it to matter if we sometimes have to grab a global lock to insert a weak reference harvester to the global marking data.
Comment 11 Filip Pizlo 2011-10-29 22:35:36 PDT
(In reply to comment #4)
> I am not sure how relevant it is, but you may want to look at wtf/ParallelJobs.h.  On the Mac it will use libdispatch, which can be fun.

One big reason why not to use libdispatch or other such mechanisms: if anyone else uses them and we stop a WebKit thread due to GC whilst it is sitting inside of some libdispatch lock (or a lock of some dependency of libdispatch, like, say, malloc), then we deadlock the whole process.

I think that even starting pthreads during a GC could be a dangerous game to play.  Hence this code makes sure that it has a pool of threads started before any GC can happen.
Comment 12 WebKit Review Bot 2011-10-29 22:38:16 PDT
Attachment 112986 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1

Source/JavaScriptCore/heap/MarkStack.cpp:367:  This { should be at the end of the previous line  [whitespace/braces] [4]
Source/JavaScriptCore/heap/MarkStack.cpp:372:  This { should be at the end of the previous line  [whitespace/braces] [4]
Total errors found: 2 in 12 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 13 Filip Pizlo 2011-10-30 00:19:12 PDT
Created attachment 112987 [details]
the patch

Attempting to fix weird interactions between parallel GC and webcore.
Comment 14 WebKit Review Bot 2011-10-30 00:20:43 PDT
Attachment 112987 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1

Source/JavaScriptCore/heap/MarkStack.cpp:369:  This { should be at the end of the previous line  [whitespace/braces] [4]
Source/JavaScriptCore/heap/MarkStack.cpp:374:  This { should be at the end of the previous line  [whitespace/braces] [4]
Total errors found: 2 in 18 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 15 Filip Pizlo 2011-10-30 00:27:02 PDT
Created attachment 112988 [details]
the patch

The fixing continues...
Comment 16 WebKit Review Bot 2011-10-30 00:29:50 PDT
Attachment 112988 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1

Source/JavaScriptCore/heap/MarkStack.cpp:369:  This { should be at the end of the previous line  [whitespace/braces] [4]
Source/JavaScriptCore/heap/MarkStack.cpp:374:  This { should be at the end of the previous line  [whitespace/braces] [4]
Total errors found: 2 in 18 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 17 Gustavo Noronha (kov) 2011-10-30 00:38:08 PDT
Comment on attachment 112988 [details]
the patch

Attachment 112988 [details] did not pass gtk-ews (gtk):
Output: http://queues.webkit.org/results/10242485
Comment 18 Filip Pizlo 2011-10-30 00:41:05 PDT
Looks like this passes tests and alows me to browse the web in debug mode.
Comment 19 Early Warning System Bot 2011-10-30 00:43:29 PDT
Comment on attachment 112988 [details]
the patch

Attachment 112988 [details] did not pass qt-ews (qt):
Output: http://queues.webkit.org/results/10245059
Comment 20 Gyuyoung Kim 2011-10-30 00:47:52 PDT
Comment on attachment 112988 [details]
the patch

Attachment 112988 [details] did not pass efl-ews (efl):
Output: http://queues.webkit.org/results/10238667
Comment 21 Filip Pizlo 2011-10-30 01:11:10 PDT
Created attachment 112990 [details]
the patch

Attempting to fix build failures on non-Mac platforms.
Comment 22 WebKit Review Bot 2011-10-30 01:13:02 PDT
Attachment 112990 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1

Source/JavaScriptCore/heap/MarkStack.cpp:371:  This { should be at the end of the previous line  [whitespace/braces] [4]
Source/JavaScriptCore/heap/MarkStack.cpp:376:  This { should be at the end of the previous line  [whitespace/braces] [4]
Total errors found: 2 in 18 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 23 Filip Pizlo 2011-10-30 01:18:11 PDT
Performance retested after merging/fixing/etc.



Benchmark report for SunSpider, V8, and Kraken.

VMs tested:
"TipOfTree" at /Volumes/Data/pizlo/quinary/OpenSource/WebKitBuild/Release/jsc
"ParallelGC" at /Volumes/Data/pizlo/senary/OpenSource/WebKitBuild/Release/jsc

Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample
measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime()
function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in
milliseconds.

                                            TipOfTree               ParallelGC                                   
SunSpider:
   3d-cube                                7.6899+-0.2978          7.5047+-0.1976         might be 1.0247x faster
   3d-morph                               7.9344+-0.2444          7.6294+-0.1314         might be 1.0400x faster
   3d-raytrace                            7.7543+-0.3009          7.4545+-0.1448         might be 1.0402x faster
   access-binary-trees                    1.6432+-0.0460          1.5978+-0.0653         might be 1.0284x faster
   access-fannkuch                        6.5333+-0.0883    ?     6.5941+-0.1215       ?
   access-nbody                           3.8733+-0.1735          3.6594+-0.0649         might be 1.0585x faster
   access-nsieve                          2.6750+-0.0760          2.5717+-0.0545         might be 1.0402x faster
   bitops-3bit-bits-in-byte               1.2764+-0.0264    !     1.3604+-0.0318       ! definitely 1.0657x slower
   bitops-bits-in-byte                    2.3180+-0.0668    ?     2.3830+-0.0528       ? might be 1.0280x slower
   bitops-bitwise-and                     3.4243+-0.0911          3.3971+-0.1047       
   bitops-nsieve-bits                     5.3360+-0.0961    ?     5.4479+-0.1615       ? might be 1.0210x slower
   controlflow-recursive                  2.1010+-0.0347    ?     2.1506+-0.0343       ? might be 1.0236x slower
   crypto-aes                             7.4222+-0.2491    ?     7.6064+-0.1827       ? might be 1.0248x slower
   crypto-md5                             2.7823+-0.0877          2.7117+-0.0968         might be 1.0260x faster
   crypto-sha1                            2.4500+-0.0612    ?     2.4609+-0.0447       ?
   date-format-tofte                     10.1750+-0.3408    ?    10.3077+-0.3219       ? might be 1.0130x slower
   date-format-xparb                      9.0613+-0.2235    !     9.8251+-0.3201       ! definitely 1.0843x slower
   math-cordic                            6.4947+-0.1313          6.4855+-0.1292       
   math-partial-sums                      7.5811+-0.2072          7.5517+-0.2192       
   math-spectral-norm                     2.6240+-0.0661    ?     2.6337+-0.0613       ?
   regexp-dna                            11.6862+-0.2992         11.6826+-0.3405       
   string-base64                          4.1872+-0.0378    ?     4.4519+-0.2612       ? might be 1.0632x slower
   string-fasta                           6.6351+-0.2360          6.5851+-0.2751       
   string-tagcloud                       11.7555+-0.4311    ?    12.0805+-0.3561       ? might be 1.0276x slower
   string-unpack-code                    20.5635+-0.4911    ?    21.3864+-0.8229       ? might be 1.0400x slower
   string-validate-input                  5.3534+-0.0821    ^     5.1113+-0.0723       ^ definitely 1.0474x faster

   <arithmetic> *                         6.2050+-0.0507    ?     6.2550+-0.0435       ?
   <geometric>                            4.9988+-0.0332    ?     5.0137+-0.0282       ?
   <harmonic>                             3.9802+-0.0270    ?     3.9967+-0.0305       ?

                                            TipOfTree               ParallelGC                                   
V8:
   crypto                                73.1870+-0.3514    ?    73.4561+-0.4061       ?
   deltablue                            178.5365+-0.8240    ^   174.2394+-0.5655       ^ definitely 1.0247x faster
   earley-boyer                          91.9127+-0.3644         91.5211+-0.4322       
   raytrace                              62.8774+-0.3963    ?    63.4041+-0.6165       ?
   regexp                               106.1395+-0.6146    ?   107.0600+-0.6550       ?
   richards                             126.5740+-0.4962        126.5667+-0.2941       
   splay                                 93.7210+-0.4231    ^    72.9378+-1.0055       ^ definitely 1.2849x faster

   <arithmetic>                         104.7069+-0.2350    ^   101.3122+-0.2225       ^ definitely 1.0335x faster
   <geometric> *                         99.2797+-0.2067    ^    95.6737+-0.2585       ^ definitely 1.0377x faster
   <harmonic>                            94.5382+-0.1999    ^    90.8940+-0.2924       ^ definitely 1.0401x faster

                                            TipOfTree               ParallelGC                                   
Kraken:
   ai-astar                             493.6872+-1.0811    !   498.7911+-2.1791       ! definitely 1.0103x slower
   audio-beat-detection                 193.5298+-1.4174    ^   190.9533+-0.8591       ^ definitely 1.0135x faster
   audio-dft                            267.3048+-1.8658    ?   269.0041+-2.7896       ?
   audio-fft                            124.6659+-0.8884    ?   124.9870+-1.0028       ?
   audio-oscillator                     251.0656+-1.6563    ?   251.5320+-1.1005       ?
   imaging-darkroom                     403.1887+-2.5335        400.9468+-1.4467       
   imaging-desaturate                   225.4055+-0.4488    ?   225.9516+-1.6251       ?
   imaging-gaussian-blur                554.4614+-1.3730    ?   555.1623+-2.2363       ?
   json-parse-financial                  58.4412+-0.8444    ^    57.0616+-0.3266       ^ definitely 1.0242x faster
   json-stringify-tinderbox              69.5005+-0.6899    ?    70.3635+-1.0300       ? might be 1.0124x slower
   stanford-crypto-aes                  133.2301+-1.1865        132.4162+-1.3045       
   stanford-crypto-ccm                  100.1346+-0.6399    ?   100.1434+-0.6771       ?
   stanford-crypto-pbkdf2               196.4266+-3.2542    ^   191.2882+-1.0304       ^ definitely 1.0269x faster
   stanford-crypto-sha256-iterative      71.3867+-0.5641         71.2631+-0.2770       

   <arithmetic> *                       224.4592+-0.3648        224.2760+-0.4353       
   <geometric>                          177.3997+-0.4300        176.8961+-0.3648       
   <harmonic>                           140.3279+-0.6257        139.7143+-0.4019       

                                            TipOfTree               ParallelGC                                   
All benchmarks:
   <arithmetic>                          85.8874+-0.1416    ^    85.3549+-0.1542       ^ definitely 1.0062x faster
   <geometric>                           22.5895+-0.0949         22.4835+-0.0785       
   <harmonic>                             7.0084+-0.0466    ?     7.0333+-0.0525       ?

                                            TipOfTree               ParallelGC                                   
Geomean of preferred means:
   <scaled-result>                       51.7098+-0.1562    ^    51.1993+-0.1660       ^ definitely 1.0100x faster
Comment 24 Sam Weinig 2011-10-30 13:18:28 PDT
(In reply to comment #11)
> (In reply to comment #4)
> > I am not sure how relevant it is, but you may want to look at wtf/ParallelJobs.h.  On the Mac it will use libdispatch, which can be fun.
> 
> One big reason why not to use libdispatch or other such mechanisms: if anyone else uses them and we stop a WebKit thread due to GC whilst it is sitting inside of some libdispatch lock (or a lock of some dependency of libdispatch, like, say, malloc), then we deadlock the whole process.
> 
> I think that even starting pthreads during a GC could be a dangerous game to play.  Hence this code makes sure that it has a pool of threads started before any GC can happen.

 That makes sense.
Comment 25 Sam Weinig 2011-10-30 13:19:58 PDT
"It also appears that this reduces GC pause times on real websites by more than half."

Well that seems quite nice.  How are you measuring this?  Is this a metric that is cheap enough to compute that we could consider putting it in our increasingly poorly named "Caches" debug window?
Comment 26 Filip Pizlo 2011-10-30 13:22:17 PDT
(In reply to comment #25)
> "It also appears that this reduces GC pause times on real websites by more than half."
> 
> Well that seems quite nice.  How are you measuring this?  Is this a metric that is cheap enough to compute that we could consider putting it in our increasingly poorly named "Caches" debug window?

It's really cheap.  Just calls to currentTimeMS() before and after collect().  We should probably be tracking GC pause times, all the time, and allow users to see statistics on recent pause times whenver they want.
Comment 27 Filip Pizlo 2011-10-31 02:06:28 PDT
Performance after merging up to r98832.


Benchmark report for SunSpider, V8, and Kraken.

VMs tested:
"TipOfTree" at /Volumes/Data/pizlo/quinary/OpenSource/WebKitBuild/Release/jsc
"ParallelGC" at /Volumes/Data/pizlo/senary/OpenSource/WebKitBuild/Release/jsc

Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample
measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime()
function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in
milliseconds.

                                            TipOfTree               ParallelGC                                   
SunSpider:
   3d-cube                                7.3539+-0.2026          7.3229+-0.2011       
   3d-morph                               7.5077+-0.1106    ?     7.5947+-0.1384       ? might be 1.0116x slower
   3d-raytrace                            7.5171+-0.2375          7.4688+-0.1699       
   access-binary-trees                    1.6129+-0.0674    ?     1.6750+-0.0729       ? might be 1.0385x slower
   access-fannkuch                        6.5770+-0.1457          6.4955+-0.0908         might be 1.0125x faster
   access-nbody                           3.8775+-0.1856          3.7505+-0.0898         might be 1.0339x faster
   access-nsieve                          2.6215+-0.0916          2.5791+-0.0562         might be 1.0164x faster
   bitops-3bit-bits-in-byte               1.3197+-0.0558    ?     1.3384+-0.0396       ? might be 1.0142x slower
   bitops-bits-in-byte                    2.4372+-0.0661          2.3905+-0.0493         might be 1.0195x faster
   bitops-bitwise-and                     3.3222+-0.1106    ?     3.3613+-0.0778       ? might be 1.0118x slower
   bitops-nsieve-bits                     5.5133+-0.1198    ?     5.5630+-0.1057       ?
   controlflow-recursive                  2.1531+-0.0352    ?     2.1555+-0.0519       ?
   crypto-aes                             7.5198+-0.2329    ?     7.8458+-0.3238       ? might be 1.0433x slower
   crypto-md5                             2.7414+-0.0706    ?     2.7590+-0.0731       ?
   crypto-sha1                            2.4408+-0.0702    ?     2.4827+-0.0397       ? might be 1.0172x slower
   date-format-tofte                     10.3145+-0.2630         10.3016+-0.3416       
   date-format-xparb                      8.9655+-0.2076    ?     9.4674+-0.3290       ? might be 1.0560x slower
   math-cordic                            6.5463+-0.1256          6.5147+-0.1471       
   math-partial-sums                      7.4301+-0.1194          7.3979+-0.1266       
   math-spectral-norm                     2.6345+-0.0829          2.6160+-0.0461       
   regexp-dna                            11.7134+-0.2211    ?    11.7141+-0.2021       ?
   string-base64                          4.5294+-0.2703          4.5244+-0.2536       
   string-fasta                           6.2156+-0.0909    ?     6.4610+-0.1575       ? might be 1.0395x slower
   string-tagcloud                       11.7648+-0.4143         11.7532+-0.3231       
   string-unpack-code                    20.7230+-0.6054    ?    20.7778+-0.5743       ?
   string-validate-input                  5.5511+-0.3397          5.3827+-0.3042         might be 1.0313x faster

   <arithmetic> *                         6.1886+-0.0355    ?     6.2190+-0.0290       ?
   <geometric>                            4.9970+-0.0312    ?     5.0181+-0.0257       ?
   <harmonic>                             3.9955+-0.0353    ?     4.0172+-0.0281       ?

                                            TipOfTree               ParallelGC                                   
V8:
   crypto                                73.3812+-0.5024         73.0859+-0.5110       
   deltablue                            166.2654+-0.5110    ^   163.4370+-0.9970       ^ definitely 1.0173x faster
   earley-boyer                          90.1067+-0.3181    ?    90.4178+-0.4563       ?
   raytrace                              62.3816+-0.5405    ?    62.7093+-0.5034       ?
   regexp                               104.9189+-0.5704    !   106.1353+-0.6386       ! definitely 1.0116x slower
   richards                             124.1908+-0.4864    ?   125.3265+-0.6551       ?
   splay                                 92.5096+-0.6078    ^    72.3544+-0.7515       ^ definitely 1.2786x faster

   <arithmetic>                         101.9649+-0.1643    ^    99.0666+-0.2054       ^ definitely 1.0293x faster
   <geometric> *                         97.3140+-0.1850    ^    94.0647+-0.2235       ^ definitely 1.0345x faster
   <harmonic>                            93.1237+-0.2246    ^    89.7004+-0.2567       ^ definitely 1.0382x faster

                                            TipOfTree               ParallelGC                                   
Kraken:
   ai-astar                             494.2374+-2.0178    !   502.3527+-3.7023       ! definitely 1.0164x slower
   audio-beat-detection                 190.6462+-1.7206    ?   192.2360+-1.1167       ?
   audio-dft                            264.3711+-3.1062    ?   266.2583+-1.7913       ?
   audio-fft                            123.2559+-0.7364    ?   125.3206+-1.3437       ? might be 1.0168x slower
   audio-oscillator                     251.6934+-1.6194        250.6419+-1.6296       
   imaging-darkroom                     400.7391+-1.7695        399.8753+-2.3719       
   imaging-desaturate                   224.3837+-1.3417        224.2029+-1.1894       
   imaging-gaussian-blur                551.7855+-2.0391    ?   552.5716+-1.6117       ?
   json-parse-financial                  57.6361+-0.3248    ?    57.6641+-0.3721       ?
   json-stringify-tinderbox              68.7668+-0.7192         67.9271+-0.3880         might be 1.0124x faster
   stanford-crypto-aes                  133.5143+-2.3777        131.7421+-2.0223         might be 1.0135x faster
   stanford-crypto-ccm                   99.6615+-0.6328    ?    99.7472+-0.8025       ?
   stanford-crypto-pbkdf2               194.0549+-0.6564        192.8904+-1.6639       
   stanford-crypto-sha256-iterative      70.6565+-0.3140         70.5744+-0.2863       

   <arithmetic> *                       223.2430+-0.5268    ?   223.8575+-0.4149       ?
   <geometric>                          176.1541+-0.5141    ?   176.2938+-0.2608       ?
   <harmonic>                           139.1137+-0.4508        138.9795+-0.2621       

                                            TipOfTree               ParallelGC                                   
All benchmarks:
   <arithmetic>                          85.1077+-0.1693         84.8758+-0.1338       
   <geometric>                           22.4708+-0.0868         22.4149+-0.0649       
   <harmonic>                             7.0325+-0.0606    ?     7.0666+-0.0481       ?

                                            TipOfTree               ParallelGC                                   
Geomean of preferred means:
   <scaled-result>                       51.2284+-0.1171    ^    50.7811+-0.0811       ^ definitely 1.0088x faster
Comment 28 Geoffrey Garen 2011-10-31 17:53:27 PDT
Comment on attachment 112990 [details]
the patch

View in context: https://bugs.webkit.org/attachment.cgi?id=112990&action=review

r=me, with some comments below.

Windows build issue:


3>JavaScriptCore.exp : error LNK2001: unresolved external symbol "public: void __thiscall JSC::MarkStackArray::expand(void)" (?expand@MarkStackArray@JSC@@QAEXXZ)

> Source/JavaScriptCore/heap/Heap.cpp:618
> +            visitor.donateAndDrain();

Not for this patch, but it's really becoming apparent to me that draining should be automatic in the visitor: Once you append N items, the visitor should drain. The visitor has the best information about how full it is at any given time. Plus, this change would seriously declutter-ify calling code.

> Source/JavaScriptCore/heap/MarkStack.cpp:348
> +            for (unsigned countdown = Heuristics::minimumNumberOfScansBetweenRebalance; m_stack.canRemoveLast() && countdown-- > 0;)

You can elide "> 0" here.

> Source/JavaScriptCore/heap/MarkStack.cpp:370
> +    if (Heuristics::numberOfGCMarkers > 1) {

You can change this to an early return, to reduce the indentation level of the rest of the code. WebKit coding style prefers early return where possible for that reason.

> Source/JavaScriptCore/heap/MarkStack.cpp:399
> +                    // If we reached termination, then return.
> +                    if (!m_shared.m_numberOfActiveParallelMarkers && m_shared.m_sharedMarkStack.isEmpty())
> +                        return;

You can just change the condition above to "return" instead of "break", and remove this code.

> Source/JavaScriptCore/heap/MarkStack.h:166
> +    class MarkStackSharedData {

I think you should call out the threadiness of this code: "MarkStackThreadSharedData".

> Source/JavaScriptCore/heap/MarkStack.h:260
> +            if ((unsigned)m_opaqueRoots.size() < Heuristics::opaqueRootMergeThreshold)

C++-style static_cast, please.

Slightly better to upcast opaqueRootMergeThreshold, rather than down casting m_opaqueRoots.size(), to avoid loss of information.

> Source/JavaScriptCore/heap/MarkStack.h:311
> +    inline void MarkStack::addOpaqueRoot(void* root)
>      {
> -        return m_opaqueRoots.add(root).second;
> +#if ENABLE(PARALLEL_GC)
> +        mergeOpaqueRootsIfProfitable();
> +#endif
> +        m_opaqueRoots.add(root);
>      }

I'm surprised that's profitable to merge opaque roots eagerly. Nothing can be done with the set of opaque roots until the end of parallel work, anyway. Could be simpler just to omit eager merging.

Do you know why this is profitable?

> Source/JavaScriptCore/runtime/Heuristics.cpp:32
> +#if PLATFORM(MAC) && ENABLE(PARALLEL_GC)

OS(UNIX) or OS(DARWIN) is better than PLATFORM(MAC) here.

> Source/JavaScriptCore/runtime/Heuristics.cpp:177
> +#if PLATFORM(MAC) && ENABLE(PARALLEL_GC)
> +    int name[2];
> +    size_t valueSize = sizeof(cpusToUse);
> +    name[0] = CTL_HW;
> +    name[1] = HW_AVAILCPU;
> +    sysctl(name, 2, &cpusToUse, &valueSize, 0, 0);

OS(UNIX) or OS(DARWIN) is better than PLATFORM(MAC) here.

> Source/JavaScriptCore/wtf/Bitmap.h:40
> +#if PLATFORM(MAC)

CPU(X86) is better than PLATFORM(MAC) here.

> Source/JavaScriptCore/wtf/Platform.h:1078
> +#if !ENABLE(PARALLEL_GC) && PLATFORM(MAC)
> +#define ENABLE_PARALLEL_GC 1

This should be "#!defined(ENABLE_PARALLEL_GC)" instead of "!ENABLE(PARALLEL_GC)" (so you can manually define to disable parallel GC).
Comment 29 Filip Pizlo 2011-10-31 18:22:14 PDT
> > Source/JavaScriptCore/heap/MarkStack.h:311
> > +    inline void MarkStack::addOpaqueRoot(void* root)
> >      {
> > -        return m_opaqueRoots.add(root).second;
> > +#if ENABLE(PARALLEL_GC)
> > +        mergeOpaqueRootsIfProfitable();
> > +#endif
> > +        m_opaqueRoots.add(root);
> >      }
> 
> I'm surprised that's profitable to merge opaque roots eagerly. Nothing can be done with the set of opaque roots until the end of parallel work, anyway. Could be simpler just to omit eager merging.
> 
> Do you know why this is profitable?

It's not about profitability in the execution time sense.  It's profitability in the space usage sense.  If we did happen to have a lot of opaque roots, then we don't want the worst case of each worker thread holding onto its copy of the same hash set.  So merging when it gets big ensures that we minimize space usage.

I also have a call to mergeIfNecessary() in drain(), for a plainly dumb reason: the threads currently do not know about GC phase because whenever they go into idle state they will have merged all of their information into the shared data.  That keeps things simple.

> 
> > Source/JavaScriptCore/runtime/Heuristics.cpp:32
> > +#if PLATFORM(MAC) && ENABLE(PARALLEL_GC)
> 
> OS(UNIX) or OS(DARWIN) is better than PLATFORM(MAC) here.

OS(DARWIN) probably, since this particular incantation of sysctl is missing on some unixes.

> 
> > Source/JavaScriptCore/runtime/Heuristics.cpp:177
> > +#if PLATFORM(MAC) && ENABLE(PARALLEL_GC)
> > +    int name[2];
> > +    size_t valueSize = sizeof(cpusToUse);
> > +    name[0] = CTL_HW;
> > +    name[1] = HW_AVAILCPU;
> > +    sysctl(name, 2, &cpusToUse, &valueSize, 0, 0);
> 
> OS(UNIX) or OS(DARWIN) is better than PLATFORM(MAC) here.
> 
> > Source/JavaScriptCore/wtf/Bitmap.h:40
> > +#if PLATFORM(MAC)
> 
> CPU(X86) is better than PLATFORM(MAC) here.

(CPU(X86) || CPU(X86_64)) && OS(UNIX) at least, right?  Or some manner of check for whether the compiler supports GCC-style inline asm?

> 
> > Source/JavaScriptCore/wtf/Platform.h:1078
> > +#if !ENABLE(PARALLEL_GC) && PLATFORM(MAC)
> > +#define ENABLE_PARALLEL_GC 1
> 
> This should be "#!defined(ENABLE_PARALLEL_GC)" instead of "!ENABLE(PARALLEL_GC)" (so you can manually define to disable parallel GC).

Oops!  And I guess this should also be guarded with some variant of CPU(X86) to get it to work on PPC.
Comment 30 Filip Pizlo 2011-10-31 19:04:53 PDT
Created attachment 113121 [details]
the patch - checking EWS with changes to address review

Addressed Geoff's concerns, using a more precise set of preprocessor guards for CAS, # of processors detection, and parallel GC.
Comment 31 WebKit Review Bot 2011-10-31 19:06:06 PDT
Attachment 113121 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1

Source/JavaScriptCore/heap/MarkStack.cpp:394:  This { should be at the end of the previous line  [whitespace/braces] [4]
Total errors found: 1 in 19 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 32 Geoffrey Garen 2011-10-31 19:18:09 PDT
> It's not about profitability in the execution time sense.  It's profitability in the space usage sense.

Makes sense. 

> > CPU(X86) is better than PLATFORM(MAC) here.
> 
> (CPU(X86) || CPU(X86_64)) && OS(UNIX) at least, right?  Or some manner of check for whether the compiler supports GCC-style inline asm?

Oh, right. This should do the trick: (CPU(X86) || CPU(X86_64)) && (COMPILER(GCC) || COMPILER(CLANG)).

> > This should be "#!defined(ENABLE_PARALLEL_GC)" instead of "!ENABLE(PARALLEL_GC)" (so you can manually define to disable parallel GC).
> 
> Oops!  And I guess this should also be guarded with some variant of CPU(X86) to get it to work on PPC.

Good point.
Comment 33 Filip Pizlo 2011-10-31 19:24:51 PDT
> > > CPU(X86) is better than PLATFORM(MAC) here.
> > 
> > (CPU(X86) || CPU(X86_64)) && OS(UNIX) at least, right?  Or some manner of check for whether the compiler supports GCC-style inline asm?
> 
> Oh, right. This should do the trick: (CPU(X86) || CPU(X86_64)) && (COMPILER(GCC) || COMPILER(CLANG)).

I'll throw COMPILER(CLANG) in there.

Though, oddly, we don't do this in other cases of inline asm - I guess COMPILER(GCC) returns true of clang?
Comment 34 Geoffrey Garen 2011-10-31 19:27:14 PDT
Comment on attachment 113121 [details]
the patch - checking EWS with changes to address review

View in context: https://bugs.webkit.org/attachment.cgi?id=113121&action=review

r=me, with build fixed.

> Source/JavaScriptCore/wtf/Atomics.h:123
> +#if ENABLE(COMPARE_AND_SWAP)

Maybe better just to put the #if around the whole function definition. That way, if you try to use this on the wrong platform, you fail at compile time.

FWIW, TCSpinLock.h has some multi-platform CAS assembly code that could be merged into this. Doesn't need to hold up this patch, though.
Comment 35 Filip Pizlo 2011-10-31 19:41:49 PDT
(In reply to comment #34)
> (From update of attachment 113121 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=113121&action=review
> 
> r=me, with build fixed.
> 
> > Source/JavaScriptCore/wtf/Atomics.h:123
> > +#if ENABLE(COMPARE_AND_SWAP)
> 
> Maybe better just to put the #if around the whole function definition. That way, if you try to use this on the wrong platform, you fail at compile time.

Can't quite do it.  Bitmap calls weakCompareAndSwap conditionally, and even though with CAS/PARALLEL_GC disabled Bitmap doesn't get instantiated in a way that would result in calls to weakCompareAndSwap, the compiler will still complain.

I suppose you could have Bitmap accept some manner of template CAS functor ... but that seems like brutal overkill.  Or I could #ifdef out the calls to weakCompareAndSwap in Bitmap, but that seems like more bloat as well.

Do you have anything against having weakCompareAndSwap always declared, as it is now, with the CRASH() thingy?
Comment 36 Geoffrey Garen 2011-10-31 20:06:50 PDT
> Do you have anything against having weakCompareAndSwap always declared, as it is now, with the CRASH() thingy?

No.
Comment 37 Filip Pizlo 2011-10-31 22:31:50 PDT
Created attachment 113127 [details]
the patch

Checking to see if this fixes Windows.
Comment 38 WebKit Review Bot 2011-10-31 22:34:38 PDT
Attachment 113127 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1

Source/JavaScriptCore/heap/MarkStack.cpp:394:  This { should be at the end of the previous line  [whitespace/braces] [4]
Total errors found: 1 in 19 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 39 Filip Pizlo 2011-10-31 23:16:31 PDT
Created attachment 113131 [details]
the patch

Another Windows fix attempt.
Comment 40 WebKit Review Bot 2011-10-31 23:19:27 PDT
Attachment 113131 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1

Source/JavaScriptCore/heap/MarkStack.cpp:394:  This { should be at the end of the previous line  [whitespace/braces] [4]
Total errors found: 1 in 20 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 41 Filip Pizlo 2011-10-31 23:43:58 PDT
Landed in http://trac.webkit.org/changeset/98937.
Comment 42 Geoffrey Garen 2011-11-01 11:55:57 PDT
> Though, oddly, we don't do this in other cases of inline asm - I guess COMPILER(GCC) returns true of clang?

Yeah, I guess CLANG is a superset of GCC.