Summary: | fourthTier: all inline caches should thread-safe enough to allow a concurrent compilation thread to read them safely | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Filip Pizlo <fpizlo> | ||||||
Component: | JavaScriptCore | Assignee: | Filip Pizlo <fpizlo> | ||||||
Status: | RESOLVED FIXED | ||||||||
Severity: | Normal | CC: | barraclough, ggaren, mark.lam, mhahnenberg, msaboff, oliver, sam, xinchao.peng | ||||||
Priority: | P2 | ||||||||
Version: | 528+ (Nightly build) | ||||||||
Hardware: | All | ||||||||
OS: | All | ||||||||
Bug Depends on: | |||||||||
Bug Blocks: | 112839 | ||||||||
Attachments: |
|
Description
Filip Pizlo
2013-04-17 13:37:22 PDT
Created attachment 198609 [details]
the patch
Not yet ready for review since I'm still running perf tests.
Comment on attachment 198609 [details] the patch View in context: https://bugs.webkit.org/attachment.cgi?id=198609&action=review > Source/JavaScriptCore/jit/JITStubs.cpp:2178 > + else { I will revert this style goof. Created attachment 198611 [details]
the patch
Still not ready for review. I just did a few fixes, like implementing the generic version of weakCompareAndSwap over bytes.
Looks sufficiently neutralish. Benchmark report for SunSpider, V8Spider, Octane, Kraken, and JSRegress on oldmac (MacPro4,1). VMs tested: "TipOfTree" at /Volumes/Data/pizlo/fourthTier/OpenSource/WebKitBuild/Release/jsc (r148622) "Conf#2" at /Volumes/Data/fromMiniMe/fourthTier/primary/OpenSource/WebKitBuild/Release/jsc (r148622) Collected 12 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. TipOfTree Conf#2 SunSpider: 3d-cube 10.7732+-0.7529 ? 10.8990+-0.9224 ? might be 1.0117x slower 3d-morph 9.9823+-0.8885 ? 10.5479+-0.8095 ? might be 1.0567x slower 3d-raytrace 12.7587+-0.9596 12.3760+-0.8274 might be 1.0309x faster access-binary-trees 2.4409+-0.5366 2.3562+-0.5066 might be 1.0360x faster access-fannkuch 9.4951+-0.7698 8.5674+-0.5061 might be 1.1083x faster access-nbody 4.9198+-0.3752 ? 5.5967+-0.6018 ? might be 1.1376x slower access-nsieve 5.1594+-0.3121 ? 5.7071+-0.6323 ? might be 1.1061x slower bitops-3bit-bits-in-byte 2.2412+-0.3776 2.1458+-0.4400 might be 1.0444x faster bitops-bits-in-byte 7.7304+-0.4993 ? 7.7334+-0.5442 ? bitops-bitwise-and 3.2975+-0.6362 ? 3.4661+-0.6822 ? might be 1.0511x slower bitops-nsieve-bits 4.8339+-0.5006 ? 5.2027+-0.7069 ? might be 1.0763x slower controlflow-recursive 3.5442+-0.4818 ? 4.3113+-0.7373 ? might be 1.2164x slower crypto-aes 8.9056+-0.6071 ? 9.4226+-0.6975 ? might be 1.0581x slower crypto-md5 4.2070+-0.0690 ! 5.1098+-0.5744 ! definitely 1.2146x slower crypto-sha1 3.9989+-0.6133 3.9405+-0.6392 might be 1.0148x faster date-format-tofte 17.3577+-1.3181 ? 18.3363+-1.5610 ? might be 1.0564x slower date-format-xparb 10.6231+-0.7370 ? 11.1372+-1.2858 ? might be 1.0484x slower math-cordic 4.5687+-0.5623 4.1780+-0.2841 might be 1.0935x faster math-partial-sums 14.0246+-1.3121 ? 15.5887+-0.7774 ? might be 1.1115x slower math-spectral-norm 3.7217+-0.5433 3.6799+-0.4566 might be 1.0114x faster regexp-dna 13.4874+-0.9232 13.2055+-0.8186 might be 1.0213x faster string-base64 5.8741+-0.7326 5.5280+-0.7045 might be 1.0626x faster string-fasta 12.5830+-0.8348 ? 12.6457+-0.9810 ? string-tagcloud 17.0785+-1.2532 ? 17.3610+-0.8267 ? might be 1.0165x slower string-unpack-code 32.5443+-2.2888 30.3825+-1.9038 might be 1.0712x faster string-validate-input 9.5832+-0.5710 ^ 8.1781+-0.6308 ^ definitely 1.1718x faster <arithmetic> * 9.0667+-0.3284 ? 9.1386+-0.2873 ? might be 1.0079x slower <geometric> 7.2184+-0.2548 ? 7.3315+-0.2477 ? might be 1.0157x slower <harmonic> 5.7588+-0.2265 ? 5.8350+-0.2167 ? might be 1.0132x slower TipOfTree Conf#2 V8Spider: crypto 98.2067+-2.2626 ! 104.2140+-2.8089 ! definitely 1.0612x slower deltablue 142.8194+-9.2890 ? 150.9483+-8.1493 ? might be 1.0569x slower earley-boyer 89.3803+-5.0535 ? 90.9872+-4.8304 ? might be 1.0180x slower raytrace 72.6136+-3.5272 69.5225+-3.9035 might be 1.0445x faster regexp 113.4736+-6.3801 ? 113.7722+-5.4376 ? richards 134.9424+-6.5422 133.4757+-5.8802 might be 1.0110x faster splay 50.5034+-2.0854 ? 52.7329+-2.6118 ? might be 1.0441x slower <arithmetic> 100.2771+-2.6839 ? 102.2361+-1.9171 ? might be 1.0195x slower <geometric> * 94.9037+-2.3333 ? 96.5785+-1.5952 ? might be 1.0176x slower <harmonic> 89.1866+-2.0660 ? 90.6574+-1.5417 ? might be 1.0165x slower TipOfTree Conf#2 Octane and V8v7: encrypt 0.52861+-0.01990 0.49864+-0.01720 might be 1.0601x faster decrypt 10.77387+-0.16534 ? 10.90941+-0.17941 ? might be 1.0126x slower deltablue x2 0.68750+-0.01796 ? 0.68863+-0.01681 ? earley 0.93213+-0.01255 ? 0.96996+-0.03889 ? might be 1.0406x slower boyer 13.65013+-0.65114 12.94433+-0.15999 might be 1.0545x faster raytrace x2 4.68156+-0.21498 ? 4.74780+-0.24781 ? might be 1.0141x slower regexp x2 36.74121+-0.83399 ? 37.24425+-1.13091 ? might be 1.0137x slower richards x2 0.37891+-0.00634 ? 0.38478+-0.00909 ? might be 1.0155x slower splay x2 0.65758+-0.02612 ? 0.66659+-0.03335 ? might be 1.0137x slower navier-stokes x2 12.46268+-0.15495 ? 12.80540+-0.31373 ? might be 1.0275x slower closure 0.35169+-0.05558 ? 0.35659+-0.05839 ? might be 1.0139x slower jquery 5.01616+-0.69583 ? 5.10675+-0.76303 ? might be 1.0181x slower gbemu x2 287.18190+-17.21174 285.78391+-17.74520 box2d x2 36.43667+-0.94683 36.22425+-0.84636 V8v7: <arithmetic> 8.56898+-0.13079 ? 8.64983+-0.12523 ? might be 1.0094x slower <geometric> * 2.73193+-0.02610 ? 2.75016+-0.02116 ? might be 1.0067x slower <harmonic> 1.06501+-0.01425 ? 1.06880+-0.01532 ? might be 1.0036x slower Octane including V8v7: <arithmetic> 35.89585+-1.63302 35.81259+-1.66363 might be 1.0023x faster <geometric> * 4.93247+-0.09883 ? 4.95755+-0.09884 ? might be 1.0051x slower <harmonic> 1.20386+-0.03107 ? 1.21003+-0.03595 ? might be 1.0051x slower TipOfTree Conf#2 Kraken: ai-astar 564.016+-22.741 548.852+-26.473 might be 1.0276x faster audio-beat-detection 271.087+-13.166 267.935+-15.064 might be 1.0118x faster audio-dft 359.598+-13.298 ? 387.805+-29.153 ? might be 1.0784x slower audio-fft 163.107+-6.692 ? 164.352+-4.663 ? audio-oscillator 262.675+-12.257 258.578+-10.941 might be 1.0158x faster imaging-darkroom 322.852+-14.686 ? 331.091+-13.220 ? might be 1.0255x slower imaging-desaturate 189.341+-7.705 180.871+-10.651 might be 1.0468x faster imaging-gaussian-blur 449.468+-31.862 438.416+-16.551 might be 1.0252x faster json-parse-financial 89.767+-4.353 ? 91.340+-4.658 ? might be 1.0175x slower json-stringify-tinderbox 117.635+-2.998 ? 118.585+-3.430 ? stanford-crypto-aes 274.929+-9.222 263.918+-16.025 might be 1.0417x faster stanford-crypto-ccm 119.097+-6.197 ? 124.912+-4.557 ? might be 1.0488x slower stanford-crypto-pbkdf2 289.861+-13.921 ? 291.108+-13.613 ? stanford-crypto-sha256-iterative 127.304+-7.241 ? 130.271+-4.828 ? might be 1.0233x slower <arithmetic> * 257.196+-3.358 257.002+-3.752 might be 1.0008x faster <geometric> 223.973+-2.605 ? 224.630+-2.758 ? might be 1.0029x slower <harmonic> 194.187+-2.128 ? 195.746+-2.284 ? might be 1.0080x slower TipOfTree Conf#2 JSRegress: adapt-to-double-divide 25.3017+-2.0167 ? 25.5738+-2.0607 ? might be 1.0108x slower aliased-arguments-getbyval 1.0736+-0.2077 1.0026+-0.1462 might be 1.0708x faster allocate-big-object 3.1486+-0.5895 2.8882+-0.4687 might be 1.0902x faster arity-mismatch-inlining 0.9315+-0.2269 0.8802+-0.2071 might be 1.0582x faster array-access-polymorphic-structure 8.2992+-0.7990 7.9141+-0.7346 might be 1.0487x faster array-with-double-add 6.8133+-0.7338 6.5764+-0.6513 might be 1.0360x faster array-with-double-increment 4.7634+-0.6554 4.6154+-0.4589 might be 1.0321x faster array-with-double-mul-add 8.1105+-0.5581 ? 8.5930+-0.3939 ? might be 1.0595x slower array-with-double-sum 8.9950+-0.7045 ? 9.0182+-0.5771 ? array-with-int32-add-sub 12.6781+-0.9069 12.3052+-0.4419 might be 1.0303x faster array-with-int32-or-double-sum 8.7711+-0.7163 ? 8.7902+-0.5434 ? big-int-mul 5.5825+-0.6763 5.4768+-0.6071 might be 1.0193x faster boolean-test 5.0977+-0.5875 ? 5.1502+-0.5915 ? might be 1.0103x slower cast-int-to-double 14.8425+-1.1230 ? 16.0399+-1.0950 ? might be 1.0807x slower cell-argument 16.0172+-1.1938 15.6231+-0.9089 might be 1.0252x faster cfg-simplify 4.2933+-0.4263 ? 4.7271+-0.6268 ? might be 1.1010x slower cmpeq-obj-to-obj-other 11.8893+-0.9601 ? 12.7418+-0.8596 ? might be 1.0717x slower constant-test 9.7388+-0.5613 9.6625+-0.6959 direct-arguments-getbyval 1.0562+-0.3271 0.9729+-0.1991 might be 1.0856x faster double-pollution-getbyval 12.8150+-1.1972 12.0079+-0.7601 might be 1.0672x faster double-pollution-putbyoffset 6.0374+-0.7701 6.0309+-0.6917 empty-string-plus-int 12.6062+-0.8203 ! 14.3878+-0.8777 ! definitely 1.1413x slower external-arguments-getbyval 2.7412+-0.5916 2.4407+-0.3284 might be 1.1231x faster external-arguments-putbyval 3.8867+-0.5553 ? 4.0535+-0.5437 ? might be 1.0429x slower Float32Array-matrix-mult 15.8385+-0.6888 ? 16.3619+-1.1122 ? might be 1.0331x slower fold-double-to-int 24.7382+-1.2906 23.5455+-1.3652 might be 1.0507x faster function-dot-apply 3.9800+-0.6518 3.7141+-0.6022 might be 1.0716x faster function-test 5.7699+-0.6511 ? 5.9686+-0.7408 ? might be 1.0344x slower get-by-id-chain-from-try-block 8.0133+-0.5624 ? 8.9817+-0.6651 ? might be 1.1208x slower HashMap-put-get-iterate-keys 102.2048+-5.0722 ? 109.6164+-6.3441 ? might be 1.0725x slower HashMap-put-get-iterate 107.1619+-6.1954 ? 110.7891+-4.5439 ? might be 1.0338x slower HashMap-string-put-get-iterate 82.7761+-4.5444 80.2078+-4.8661 might be 1.0320x faster indexed-properties-in-objects 5.2805+-0.6568 4.9070+-0.4738 might be 1.0761x faster inline-arguments-access 1.4816+-0.2234 ? 1.6259+-0.4115 ? might be 1.0974x slower inline-arguments-local-escape 25.8086+-1.9293 ? 26.8575+-1.4942 ? might be 1.0406x slower inline-get-scoped-var 7.7334+-0.5407 7.2062+-0.7153 might be 1.0732x faster inlined-put-by-id-transition 16.6553+-0.2585 ? 17.0550+-0.7717 ? might be 1.0240x slower int-or-other-abs-then-get-by-val 10.8286+-0.8712 9.7244+-0.7090 might be 1.1136x faster int-or-other-abs-zero-then-get-by-val 40.5876+-2.7099 ? 41.7003+-2.9785 ? might be 1.0274x slower int-or-other-add-then-get-by-val 10.6907+-0.5346 ! 12.5845+-0.4341 ! definitely 1.1772x slower int-or-other-add 12.1626+-0.7466 12.1310+-0.8930 int-or-other-div-then-get-by-val 8.3704+-0.5576 ? 8.7899+-0.6522 ? might be 1.0501x slower int-or-other-max-then-get-by-val 11.1460+-0.6125 ? 11.3571+-0.8223 ? might be 1.0189x slower int-or-other-min-then-get-by-val 9.5575+-0.6996 ? 9.9255+-0.6154 ? might be 1.0385x slower int-or-other-mod-then-get-by-val 9.2831+-0.4987 9.0938+-0.3757 might be 1.0208x faster int-or-other-mul-then-get-by-val 8.2409+-0.5250 8.1759+-0.6603 int-or-other-neg-then-get-by-val 9.8055+-0.4325 9.5058+-0.6032 might be 1.0315x faster int-or-other-neg-zero-then-get-by-val 40.9331+-2.5917 40.6453+-2.9694 int-or-other-sub-then-get-by-val 11.0429+-0.7352 ? 11.1979+-0.8853 ? might be 1.0140x slower int-or-other-sub 9.3172+-0.7569 9.2757+-0.7517 int-overflow-local 15.1200+-1.1562 ? 15.6340+-0.7867 ? might be 1.0340x slower Int16Array-bubble-sort 54.2952+-2.9049 ? 57.5772+-0.9861 ? might be 1.0604x slower Int16Array-load-int-mul 2.4510+-0.5453 ? 2.6050+-0.6645 ? might be 1.0628x slower Int8Array-load 5.4719+-0.6333 ? 5.8526+-0.6967 ? might be 1.0696x slower integer-divide 16.3649+-0.8985 ? 16.8948+-1.0491 ? might be 1.0324x slower integer-modulo 2.5957+-0.4867 2.4641+-0.4221 might be 1.0534x faster make-indexed-storage 4.5676+-0.6015 4.4355+-0.4995 might be 1.0298x faster method-on-number 26.8215+-1.8931 24.9067+-1.1704 might be 1.0769x faster nested-function-parsing-random 420.2076+-27.4093 ? 424.9766+-29.3867 ? might be 1.0113x slower nested-function-parsing 52.3087+-3.4263 ? 53.2750+-4.5216 ? might be 1.0185x slower new-array-buffer-dead 4.3082+-0.5686 4.0262+-0.5919 might be 1.0700x faster new-array-buffer-push 12.3439+-1.0000 11.8319+-1.1610 might be 1.0433x faster new-array-dead 33.4362+-0.9490 32.5353+-2.6809 might be 1.0277x faster new-array-push 8.2143+-0.6605 ? 8.7602+-0.6099 ? might be 1.0665x slower number-test 5.0777+-0.6558 4.9625+-0.6016 might be 1.0232x faster object-closure-call 8.6807+-0.4051 ! 10.1376+-0.8897 ! definitely 1.1678x slower object-test 6.0050+-0.6591 ? 6.1622+-0.7019 ? might be 1.0262x slower poly-stricteq 110.2450+-2.0018 109.8285+-3.3010 polymorphic-structure 21.7346+-1.3504 ? 23.7902+-1.1342 ? might be 1.0946x slower polyvariant-monomorphic-get-by-id 15.4141+-0.9650 14.8617+-0.6667 might be 1.0372x faster rare-osr-exit-on-local 22.0829+-1.5925 ! 24.4667+-0.6956 ! definitely 1.1080x slower register-pressure-from-osr 36.8713+-1.5757 ^ 32.8851+-1.4277 ^ definitely 1.1212x faster simple-activation-demo 38.5390+-2.7533 ? 41.2185+-1.2459 ? might be 1.0695x slower slow-array-profile-convergence 5.0100+-0.6991 ? 5.1050+-0.6707 ? might be 1.0190x slower slow-convergence 4.7688+-0.7602 4.1473+-0.4916 might be 1.1499x faster sparse-conditional 1.5460+-0.3089 ? 1.6908+-0.2650 ? might be 1.0937x slower splice-to-remove 57.7990+-3.7626 ? 59.0822+-2.8042 ? might be 1.0222x slower string-concat-object 2.9805+-0.4187 ? 3.0939+-0.2808 ? might be 1.0380x slower string-concat-pair-object 3.1770+-0.5265 2.7198+-0.0424 might be 1.1681x faster string-concat-pair-simple 18.6781+-1.3768 18.0079+-1.3392 might be 1.0372x faster string-concat-simple 17.3103+-0.1950 ! 18.9641+-1.3489 ! definitely 1.0955x slower string-cons-repeat 11.6338+-0.9880 11.1892+-0.8615 might be 1.0397x faster string-cons-tower 12.1271+-0.8269 ? 12.8342+-1.0835 ? might be 1.0583x slower string-equality 120.8965+-9.1981 ? 125.2224+-3.3817 ? might be 1.0358x slower string-hash 2.8263+-0.1812 ? 3.1405+-0.4257 ? might be 1.1112x slower string-repeat-arith 53.5491+-0.6551 53.5349+-2.8296 string-sub 95.1613+-4.3599 ! 104.7923+-4.7190 ! definitely 1.1012x slower string-test 5.0047+-0.6573 4.9646+-0.6174 structure-hoist-over-transitions 3.8305+-0.5905 3.7142+-0.4848 might be 1.0313x faster tear-off-arguments-simple 2.2737+-0.5667 2.1894+-0.4130 might be 1.0385x faster tear-off-arguments 3.6298+-0.3333 ? 4.2440+-0.6852 ? might be 1.1692x slower temporal-structure 23.2880+-1.0497 ? 23.7900+-1.2156 ? might be 1.0216x slower to-int32-boolean 34.1202+-2.4273 ? 34.1966+-1.5382 ? undefined-test 5.1342+-0.6350 4.9934+-0.5581 might be 1.0282x faster <arithmetic> 23.4340+-0.5584 ? 23.8736+-0.5201 ? might be 1.0188x slower <geometric> * 10.6473+-0.3040 ? 10.7392+-0.2830 ? might be 1.0086x slower <harmonic> 5.8342+-0.1597 5.8334+-0.1966 might be 1.0001x faster TipOfTree Conf#2 All benchmarks: <arithmetic> 46.2020+-0.6356 ? 46.5232+-0.7133 ? might be 1.0070x slower <geometric> 12.8693+-0.3133 ? 12.9877+-0.2960 ? might be 1.0092x slower <harmonic> 4.1725+-0.1058 ? 4.1878+-0.1152 ? might be 1.0037x slower TipOfTree Conf#2 Geomean of preferred means: <scaled-result> 25.8772+-0.4952 ? 26.0788+-0.4600 ? might be 1.0078x slower Comment on attachment 198611 [details]
the patch
r=me
Landed in http://trac.webkit.org/changeset/148703 at 32 bit ARM_THUMB2 ,Implement concurrent compilation thread , how i should do ?Do i need to add lock at writing profiler ?Where? (In reply to comment #7) > at 32 bit ARM_THUMB2 ,Implement concurrent compilation thread , how i should do ?Do i need to add lock at writing profiler ?Where? I don't think that concurrent compilation would be a speed up on 32-bit systems. You'd have to add locking to value profiles. That's almost certainly a terrible idea. (In reply to comment #8) > (In reply to comment #7) > > at 32 bit ARM_THUMB2 ,Implement concurrent compilation thread , how i should do ?Do i need to add lock at writing profiler ?Where? > > I don't think that concurrent compilation would be a speed up on 32-bit systems. You'd have to add locking to value profiles. That's almost certainly a terrible idea I try to add locking to value profile: at function CodeBlock:CodeBlock() m_instructions = WTF::RefCountedArray<Instruction>(instructions); ---> { ConcurrentJITLocker locker(m_lock); m_instructions = WTF::RefCountedArray<Instruction>(instructions); } but it still happened crash。Where do I still need to motify ? (In reply to comment #9) > (In reply to comment #8) > > (In reply to comment #7) > > > at 32 bit ARM_THUMB2 ,Implement concurrent compilation thread , how i should do ?Do i need to add lock at writing profiler ?Where? > > > > I don't think that concurrent compilation would be a speed up on 32-bit systems. You'd have to add locking to value profiles. That's almost certainly a terrible idea > > I try to add locking to value profile: > at function CodeBlock:CodeBlock() > > m_instructions = WTF::RefCountedArray<Instruction>(instructions); > > ---> > { > ConcurrentJITLocker locker(m_lock); > m_instructions = WTF::RefCountedArray<Instruction>(instructions); > } > > but it still happened crash。Where do I still need to motify ? I'm really confused. How does your fix resolve the word-tearing problems when generated assembly code in the LLInt and baseline JIT attempts to store a 64-bit value into a value profile using two 32-bit stores? (In reply to comment #10) > (In reply to comment #9) > > (In reply to comment #8) > > > (In reply to comment #7) > > > > at 32 bit ARM_THUMB2 ,Implement concurrent compilation thread , how i should do ?Do i need to add lock at writing profiler ?Where? > > > > > > I don't think that concurrent compilation would be a speed up on 32-bit systems. You'd have to add locking to value profiles. That's almost certainly a terrible idea > > > > I try to add locking to value profile: > > at function CodeBlock:CodeBlock() > > > > m_instructions = WTF::RefCountedArray<Instruction>(instructions); > > > > ---> > > { > > ConcurrentJITLocker locker(m_lock); > > m_instructions = WTF::RefCountedArray<Instruction>(instructions); > > } > > > > but it still happened crash。Where do I still need to motify ? > > I'm really confused. How does your fix resolve the word-tearing problems when generated assembly code in the LLInt and baseline JIT attempts to store a 64-bit value into a value profile using two 32-bit stores My fix is that i need to implement DFG thread in the 32 bit ARM_THUMB2 . I think i need add lock to value profiles .so i do that . do my doing have issue? (In reply to comment #11) > (In reply to comment #10) > > (In reply to comment #9) > > > (In reply to comment #8) > > > > (In reply to comment #7) > > > > > at 32 bit ARM_THUMB2 ,Implement concurrent compilation thread , how i should do ?Do i need to add lock at writing profiler ?Where? > > > > > > > > I don't think that concurrent compilation would be a speed up on 32-bit systems. You'd have to add locking to value profiles. That's almost certainly a terrible idea > > > > > > I try to add locking to value profile: > > > at function CodeBlock:CodeBlock() > > > > > > m_instructions = WTF::RefCountedArray<Instruction>(instructions); > > > > > > ---> > > > { > > > ConcurrentJITLocker locker(m_lock); > > > m_instructions = WTF::RefCountedArray<Instruction>(instructions); > > > } > > > > > > but it still happened crash。Where do I still need to motify ? > > > > I'm really confused. How does your fix resolve the word-tearing problems when generated assembly code in the LLInt and baseline JIT attempts to store a 64-bit value into a value profile using two 32-bit stores > > My fix is that i need to implement DFG thread in the 32 bit ARM_THUMB2 . > I think i need add lock to value profiles .so i do that . > do my doing have issue? Yes. The issue is that the concurrent JIT depends on being on a 64-bit system, or more broadly, systems where doing atomic 64-bit stores to memory is cheap. The 32-bit JIT and 32-bit LLInt have code that stores to value profiles using two 32-bit stores. You would need to fix that, either by placing a lock in JITed code (and the LLInt) around stores to value profiles, or convert those stores to using 64-bit atomic stores (probably double strex would do it). But that's probably not a good idea: the cost of doing locked/atomic 64-bit stores on 32-bit systems probably outweighs any benefit of doing concurrent JITing. But, you're welcome to experiment with this; maybe it will work for you. Certainly, your proposed fix doesn't accomplish what I'm suggesting. An alternative would be to drastically refactor the concurrent JIT to scrape all value profiles synchronously. This would be a huge change, and I don't recommend it: again, the cost of doing this would probably outweigh the benefits of concurrent JIT. (In reply to comment #12) > (In reply to comment #11) > > (In reply to comment #10) > > > (In reply to comment #9) > > > > (In reply to comment #8) > > > > > (In reply to comment #7) > > > > > > at 32 bit ARM_THUMB2 ,Implement concurrent compilation thread , how i should do ?Do i need to add lock at writing profiler ?Where? > > > > > > > > > > I don't think that concurrent compilation would be a speed up on 32-bit systems. You'd have to add locking to value profiles. That's almost certainly a terrible idea > > > > > > > > I try to add locking to value profile: > > > > at function CodeBlock:CodeBlock() > > > > > > > > m_instructions = WTF::RefCountedArray<Instruction>(instructions); > > > > > > > > ---> > > > > { > > > > ConcurrentJITLocker locker(m_lock); > > > > m_instructions = WTF::RefCountedArray<Instruction>(instructions); > > > > } > > > > > > > > but it still happened crash。Where do I still need to motify ? > > > > > > I'm really confused. How does your fix resolve the word-tearing problems when generated assembly code in the LLInt and baseline JIT attempts to store a 64-bit value into a value profile using two 32-bit stores > > > > My fix is that i need to implement DFG thread in the 32 bit ARM_THUMB2 . > > I think i need add lock to value profiles .so i do that . > > do my doing have issue? > > Yes. The issue is that the concurrent JIT depends on being on a 64-bit system, or more broadly, systems where doing atomic 64-bit stores to memory is cheap. The 32-bit JIT and 32-bit LLInt have code that stores to value profiles using two 32-bit stores. You would need to fix that, either by placing a lock in JITed code (and the LLInt) around stores to value profiles, or convert those stores to using 64-bit atomic stores (probably double strex would do it). But that's probably not a good idea: the cost of doing locked/atomic 64-bit stores on 32-bit systems probably outweighs any benefit of doing concurrent JITing. But, you're welcome to experiment with this; maybe it will work for you. Certainly, your proposed fix doesn't accomplish what I'm suggesting. > > An alternative would be to drastically refactor the concurrent JIT to scrape all value profiles synchronously. This would be a huge change, and I don't recommend it: again, the cost of doing this would probably outweigh the benefits of concurrent JIT. If i do motification that you think , i guess that performance of motified javascriptcore are not better than now . Is my idea unnecessary? Do you thinks i have necessary to motify ? (In reply to comment #13) > (In reply to comment #12) > > (In reply to comment #11) > > > (In reply to comment #10) > > > > (In reply to comment #9) > > > > > (In reply to comment #8) > > > > > > (In reply to comment #7) > > > > > > > at 32 bit ARM_THUMB2 ,Implement concurrent compilation thread , how i should do ?Do i need to add lock at writing profiler ?Where? > > > > > > > > > > > > I don't think that concurrent compilation would be a speed up on 32-bit systems. You'd have to add locking to value profiles. That's almost certainly a terrible idea > > > > > > > > > > I try to add locking to value profile: > > > > > at function CodeBlock:CodeBlock() > > > > > > > > > > m_instructions = WTF::RefCountedArray<Instruction>(instructions); > > > > > > > > > > ---> > > > > > { > > > > > ConcurrentJITLocker locker(m_lock); > > > > > m_instructions = WTF::RefCountedArray<Instruction>(instructions); > > > > > } > > > > > > > > > > but it still happened crash。Where do I still need to motify ? > > > > > > > > I'm really confused. How does your fix resolve the word-tearing problems when generated assembly code in the LLInt and baseline JIT attempts to store a 64-bit value into a value profile using two 32-bit stores > > > > > > My fix is that i need to implement DFG thread in the 32 bit ARM_THUMB2 . > > > I think i need add lock to value profiles .so i do that . > > > do my doing have issue? > > > > Yes. The issue is that the concurrent JIT depends on being on a 64-bit system, or more broadly, systems where doing atomic 64-bit stores to memory is cheap. The 32-bit JIT and 32-bit LLInt have code that stores to value profiles using two 32-bit stores. You would need to fix that, either by placing a lock in JITed code (and the LLInt) around stores to value profiles, or convert those stores to using 64-bit atomic stores (probably double strex would do it). But that's probably not a good idea: the cost of doing locked/atomic 64-bit stores on 32-bit systems probably outweighs any benefit of doing concurrent JITing. But, you're welcome to experiment with this; maybe it will work for you. Certainly, your proposed fix doesn't accomplish what I'm suggesting. > > > > An alternative would be to drastically refactor the concurrent JIT to scrape all value profiles synchronously. This would be a huge change, and I don't recommend it: again, the cost of doing this would probably outweigh the benefits of concurrent JIT. > > If i do motification that you think , i guess that performance of motified javascriptcore are not better than now . > Is my idea unnecessary? Do you thinks i have necessary to motify ? Let me put it this way: concurrent JIT is a modest speed-up (around 8%) on SunSpider. It's about an 11% speed-up when you run V8v6 under the SunSpider harness. It's not a speed-up on Octane or V8v6/v7 when run using Google's harness. It's not a speed-up on other benchmarks to my knowledge. It's probably not a good idea to do major architectural work entirely based on benchmarks, but what this tells us is that you're at best looking at a 10% speed-up. That's great, but it's not worth it if you have to introduce horrible bugs in the process or put locks around basic operations in the LLInt and Baseline JIT. But I may be wrong, and attempting to switch the baseline JIT to use two-word ldrex/strex for value profiles would be easy. If that turns out to have low overhead then maybe this whole thing would be worth a shot. I recommend performing that experiment, but only if you think that 10% speed-up is worth it for your purposes ... and only if you're willing to deal with the bug fallout. |