157787 – sqrt and pow should produce consistent results even in SSE2 available x86 32bit environment

Yusuke Suzuki

Reported 2016-05-16 23:32:56 PDT

In x86 32bit environment with SSE2 availability, some strange situation occurs. 1. C runtime is compiled with x87 C runtime is compiled with x87 since SSE2 is not considered. This should be since binary packages (e.g. Debian i686) should consider the processor capabilities conservatively. 2. DFG JIT emits floating point operations in SSE2 But DFG JIT consult CPUID to determine SSE2 availability. As a result, while C runtime is compiled with x87, DFG JIT code uses SSE2 operations (like sqrtsd, muld etc.) Since while x87 has 80bit precision SSE has 64bit precision, this produces inconsistent results in C runtime and DFG JIT code. While both 80 / 64 bit precision is ok, at least we need to ensure that the result with the same argument should be the same in all the JIT tiers. Currently, while DFG JIT produces 64bit precision values, C runtime produces 80bit precision values.

Yusuke Suzuki

Comment 1 2016-05-17 07:26:48 PDT

If my guess is correct, the result of sqrt resides on x87. And 1.0 / sqrt(value) is calculated in 80bit precision. The solution for this is, DoubleValue result = sqrt(value); return 1.0 / result; Anyway, this is crazy.

Yusuke Suzuki

Comment 2 2016-05-17 08:56:52 PDT

Created attachment 279126 [details] Patch

Yusuke Suzuki

Comment 3 2016-05-17 08:57:48 PDT

I hope this will fix the remaining failures. (I still cannot reproduce the failures).

Yusuke Suzuki

Comment 4 2016-05-17 09:03:09 PDT

Failing tests are https://build.webkit.org/builders/GTK%20Linux%2032-bit%20Release/builds/61021 stress/math-pow-stable-results.js.default: Exception: Failed opaquePow with base = 0.6931471805599453 exponent = 999 expected (9.65240607012971e-160) got (9.652406070129542e-160) stress/math-pow-stable-results.js.default: ERROR: Unexpected exit code: 3

Yusuke Suzuki

Comment 5 2016-05-17 09:04:00 PDT

(In reply to comment #4) > Failing tests are > https://build.webkit.org/builders/GTK%20Linux%2032-bit%20Release/builds/61021 > > stress/math-pow-stable-results.js.default: Exception: Failed opaquePow with > base = 0.6931471805599453 exponent = 999 expected (9.65240607012971e-160) > got (9.652406070129542e-160) > stress/math-pow-stable-results.js.default: ERROR: Unexpected exit code: 3 Not correct. Failing tests are https://build.webkit.org/builders/GTK%20Linux%2032-bit%20Release/builds/61357 stress/math-pow-stable-results.js.always-trigger-copy-phase: Exception: Failed constantExponentFunctions with base = 1.4142135623730951 exponent = -0.5 expected (0.8408964152537145) got (0.8408964152537146)

Yusuke Suzuki

Comment 6 2016-05-17 09:06:45 PDT

Created attachment 279128 [details] Patch

Mark Lam

Comment 7 2016-05-17 09:45:22 PDT

Please run benchmarks also to make sure that there are no performance regressions. Thanks.

Mark Lam

Comment 8 2016-05-17 09:46:32 PDT

(In reply to comment #7) > Please run benchmarks also to make sure that there are no performance > regressions. Thanks. For both 64-bit and 32-bit x86 since the files you changed touches both. Thanks.

Yusuke Suzuki

Comment 9 2016-05-18 17:01:59 PDT

Created attachment 279321 [details] Patch

Yusuke Suzuki

Comment 10 2016-05-18 17:04:11 PDT

Created attachment 279322 [details] Patch

Yusuke Suzuki

Comment 11 2016-05-18 17:04:55 PDT

x64 env perf results. Benchmark report for SunSpider, Octane, Kraken, and AsmBench on hanayamata. VMs tested: "baseline" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/fpu-master/Release/bin/jsc "patched" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/fpu/Release/bin/jsc Collected 4 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. baseline patched SunSpider: 3d-cube 5.7039+-0.0483 ? 5.7140+-0.0152 ? 3d-morph 25.8652+-0.6922 ? 28.8619+-9.8995 ? might be 1.1159x slower 3d-raytrace 6.1587+-0.0398 6.1484+-0.1346 access-binary-trees 2.0796+-0.1432 2.0728+-0.0925 access-fannkuch 7.2028+-1.5473 ? 7.3298+-1.7196 ? might be 1.0176x slower access-nbody 2.7382+-0.0314 ? 2.7653+-0.0805 ? access-nsieve 3.1129+-0.2039 3.0450+-0.0766 might be 1.0223x faster bitops-3bit-bits-in-byte 1.1774+-0.2119 1.0483+-0.0365 might be 1.1231x faster bitops-bits-in-byte 2.7760+-0.5369 2.4948+-0.0598 might be 1.1127x faster bitops-bitwise-and 1.9236+-0.0109 ? 1.9527+-0.1065 ? might be 1.0151x slower bitops-nsieve-bits 3.0204+-0.0286 ? 3.6378+-1.1363 ? might be 1.2044x slower controlflow-recursive 2.5072+-0.3621 ? 2.5960+-0.4151 ? might be 1.0354x slower crypto-aes 4.8640+-0.0651 4.8064+-0.0414 might be 1.0120x faster crypto-md5 2.5438+-0.2512 2.5033+-0.0910 might be 1.0162x faster crypto-sha1 2.5356+-0.5588 2.3947+-0.0579 might be 1.0588x faster date-format-tofte 10.0705+-1.6971 9.9487+-0.5134 might be 1.0122x faster date-format-xparb 5.8340+-0.0633 5.7159+-0.0678 might be 1.0207x faster math-cordic 2.9111+-0.0637 2.8884+-0.0794 math-partial-sums 10.4175+-0.0931 10.3715+-0.0223 math-spectral-norm 2.1515+-0.2492 2.0825+-0.1674 might be 1.0331x faster regexp-dna 7.2617+-0.0914 ? 7.2688+-0.1228 ? string-base64 3.9536+-0.0656 ? 3.9791+-0.0283 ? string-fasta 6.5701+-1.4645 6.0922+-0.2123 might be 1.0784x faster string-tagcloud 9.0508+-0.2073 8.9454+-0.1207 might be 1.0118x faster string-unpack-code 18.7492+-0.1751 ? 18.9697+-0.3352 ? might be 1.0118x slower string-validate-input 4.0176+-0.0196 ? 4.4243+-0.9320 ? might be 1.1012x slower <arithmetic> 5.9691+-0.0412 ? 6.0791+-0.3760 ? might be 1.0184x slower baseline patched Octane: encrypt 0.19001+-0.02718 0.18367+-0.00637 might be 1.0345x faster decrypt 3.20132+-0.54592 3.05386+-0.05815 might be 1.0483x faster deltablue x2 0.14845+-0.00094 ? 0.14898+-0.00174 ? earley 0.33373+-0.00041 0.33326+-0.00149 boyer 5.29425+-0.04791 5.28735+-0.02679 navier-stokes x2 4.82915+-0.02802 ? 4.83875+-0.05857 ? raytrace x2 0.93219+-0.01116 0.92986+-0.01973 richards x2 0.09581+-0.00267 ? 0.09629+-0.00140 ? splay x2 0.39077+-0.00360 ? 0.39307+-0.00116 ? regexp x2 18.37962+-0.21040 ? 18.53484+-0.29797 ? pdfjs x2 41.24259+-1.64031 40.60014+-0.20735 might be 1.0158x faster mandreel x2 48.21639+-0.24653 ? 48.47195+-0.83248 ? gbemu x2 34.03689+-1.03936 33.64170+-1.09649 might be 1.0117x faster closure 0.59235+-0.02495 0.58497+-0.00374 might be 1.0126x faster jquery 7.43578+-0.07755 ? 7.46192+-0.16254 ? box2d x2 14.57743+-0.73278 14.43941+-0.21710 zlib x2 363.95774+-10.46326 337.86078+-18.95572 might be 1.0772x faster typescript x2 763.78058+-34.76278 758.51379+-22.00349 <geometric> 5.78664+-0.05987 5.73540+-0.03937 might be 1.0089x faster baseline patched Kraken: ai-astar 97.593+-4.001 97.556+-5.906 audio-beat-detection 45.228+-0.118 ? 45.330+-0.109 ? audio-dft 123.409+-0.472 ? 123.467+-0.626 ? audio-fft 37.225+-0.104 37.207+-0.009 audio-oscillator 53.366+-0.015 ? 53.366+-0.059 ? imaging-darkroom 88.425+-0.053 88.368+-0.163 imaging-desaturate 56.225+-0.377 ? 56.709+-0.756 ? imaging-gaussian-blur 78.331+-13.432 74.993+-10.464 might be 1.0445x faster json-parse-financial 41.831+-0.072 ! 42.112+-0.145 ! definitely 1.0067x slower json-stringify-tinderbox 25.189+-0.123 24.962+-0.114 stanford-crypto-aes 43.227+-0.888 ? 43.456+-1.228 ? stanford-crypto-ccm 41.631+-1.823 40.955+-1.638 might be 1.0165x faster stanford-crypto-pbkdf2 104.177+-0.448 ? 105.022+-2.521 ? stanford-crypto-sha256-iterative 37.937+-0.102 37.937+-0.200 <arithmetic> 62.414+-0.934 62.246+-0.986 might be 1.0027x faster baseline patched AsmBench: towers.c 273.5890+-0.8769 272.9515+-1.7687 n-body.c 910.5629+-19.0203 ? 916.0066+-12.5231 ? float-mm.c 727.9011+-3.8672 ? 730.2297+-1.7173 ? container.cpp 3028.1337+-64.6283 ? 3037.6306+-56.0767 ? quicksort.c 430.5352+-1.1251 430.1847+-1.6893 gcc-loops.cpp 4149.4720+-97.7748 ? 4165.3242+-161.5086 ? bigfib.cpp 449.1030+-5.6904 ? 452.6222+-18.4055 ? hash-map 153.5330+-2.7146 151.2027+-1.3597 might be 1.0154x faster dry.c 483.3406+-45.4044 467.7460+-24.7893 might be 1.0333x faster <geometric> 683.7046+-6.3584 681.6894+-9.6798 might be 1.0030x faster baseline patched Geomean of preferred means: <scaled-result> 34.8431+-0.2258 ? 34.8718+-0.5628 ? might be 1.0008x slower

Yusuke Suzuki

Comment 12 2016-05-18 17:06:07 PDT

x86 32bit SSE2 enabled build (-mfpmath=sse and -msse2 or later are passed to the compiler) Benchmark report for SunSpider, Octane, Kraken, and AsmBench on 32bit. VMs tested: "baseline" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/fpu32-master/Release/bin/jsc "patched" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/fpu32/Release/bin/jsc Collected 4 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. baseline patched SunSpider: 3d-cube 17.2505+-0.0981 ? 17.2927+-0.1872 ? 3d-morph 13.8463+-0.0365 13.8152+-0.0438 3d-raytrace 11.4933+-3.5854 10.3903+-0.0686 might be 1.1062x faster access-binary-trees 2.6045+-0.0438 2.5560+-0.0936 might be 1.0190x faster access-fannkuch 7.0265+-0.1793 6.9701+-0.0622 access-nbody 4.8819+-0.0138 4.8812+-0.0109 access-nsieve 3.3465+-0.7865 3.0980+-0.0206 might be 1.0802x faster bitops-3bit-bits-in-byte 1.6490+-0.0351 1.6237+-0.0127 might be 1.0156x faster bitops-bits-in-byte 2.6247+-0.0709 ? 2.6371+-0.0761 ? bitops-bitwise-and 1.9934+-0.1417 1.9446+-0.0067 might be 1.0251x faster bitops-nsieve-bits 3.7030+-0.0094 ? 3.7060+-0.0205 ? controlflow-recursive 3.4000+-0.5472 3.2137+-0.0197 might be 1.0580x faster crypto-aes 10.4076+-6.0480 8.4325+-0.0389 might be 1.2342x faster crypto-md5 5.1667+-0.0305 5.1293+-0.0105 crypto-sha1 4.3883+-0.0141 4.3865+-0.0208 date-format-tofte 11.6183+-0.1375 ? 11.6298+-0.4792 ? date-format-xparb 10.0130+-0.0627 10.0063+-0.1523 math-cordic 3.6779+-0.7567 ? 3.9900+-1.7219 ? might be 1.0849x slower math-partial-sums 13.1448+-0.3463 13.0922+-0.0298 math-spectral-norm 2.3122+-0.0176 ? 2.3129+-0.1113 ? regexp-dna 7.4440+-0.1518 ? 7.4880+-0.2611 ? string-base64 4.7435+-0.0288 ? 4.7772+-0.0339 ? string-fasta 10.8648+-0.0833 ? 11.0375+-0.5606 ? might be 1.0159x slower string-tagcloud 13.7762+-0.2045 ? 13.8945+-0.0665 ? string-unpack-code 24.2105+-0.1533 ? 24.8459+-1.6690 ? might be 1.0262x slower string-validate-input 9.0511+-0.0272 8.9495+-0.3411 might be 1.0114x faster <arithmetic> 7.8707+-0.2654 7.7731+-0.0579 might be 1.0126x faster baseline patched Octane: encrypt 0.50798+-0.00791 0.49960+-0.01261 might be 1.0168x faster decrypt 17.11489+-0.42694 16.96305+-1.04268 deltablue x2 0.34733+-0.00346 0.34548+-0.00075 earley 0.60628+-0.00075 ? 0.60648+-0.00218 ? boyer 8.15865+-0.01372 8.15443+-0.02336 navier-stokes x2 6.63522+-1.07711 6.30414+-0.02100 might be 1.0525x faster raytrace x2 3.16797+-0.02804 ? 3.20975+-0.05828 ? might be 1.0132x slower richards x2 0.17076+-0.00079 ? 0.18036+-0.03107 ? might be 1.0562x slower splay x2 0.54739+-0.00203 ? 0.55064+-0.00407 ? regexp x2 20.53952+-0.10831 ? 20.81623+-0.18394 ? might be 1.0135x slower pdfjs x2 50.81172+-0.31701 ? 51.22708+-0.42641 ? mandreel x2 95.21398+-2.40420 ? 95.81830+-2.24623 ? gbemu x2 56.51416+-1.01726 56.48069+-0.19904 closure 0.63658+-0.02393 0.62312+-0.00578 might be 1.0216x faster jquery 8.10269+-0.13363 8.07504+-0.13617 box2d x2 19.52336+-0.55404 ? 19.56877+-0.58168 ? zlib x2 649.45215+-12.76747 649.21053+-8.40949 typescript x2 1190.06122+-64.58508 1175.71649+-28.84543 might be 1.0122x faster <geometric> 9.89802+-0.09712 ? 9.90406+-0.16350 ? might be 1.0006x slower baseline patched Kraken: ai-astar 194.487+-0.926 ? 195.598+-2.302 ? audio-beat-detection 72.518+-0.322 ? 72.559+-0.149 ? audio-dft 121.121+-0.516 ? 121.390+-0.868 ? audio-fft 60.038+-0.703 59.870+-0.171 audio-oscillator 93.826+-0.363 93.383+-0.282 imaging-darkroom 170.330+-1.036 ? 170.883+-2.320 ? imaging-desaturate 94.928+-1.026 94.546+-0.246 imaging-gaussian-blur 188.115+-0.274 ? 191.241+-5.520 ? might be 1.0166x slower json-parse-financial 65.857+-0.174 65.785+-0.215 json-stringify-tinderbox 29.373+-0.071 ? 29.420+-0.099 ? stanford-crypto-aes 65.178+-0.828 ? 65.435+-0.379 ? stanford-crypto-ccm 49.151+-0.351 49.094+-1.002 stanford-crypto-pbkdf2 128.655+-0.426 ? 129.064+-0.315 ? stanford-crypto-sha256-iterative 46.690+-0.166 46.304+-1.147 <arithmetic> 98.591+-0.128 ? 98.898+-0.274 ? might be 1.0031x slower baseline patched AsmBench: towers.c 402.2408+-5.2536 399.8250+-1.3425 n-body.c 1281.5928+-16.6677 ? 1294.8248+-47.9531 ? might be 1.0103x slower float-mm.c 1169.2465+-7.4986 1162.4531+-15.9992 container.cpp 4386.6112+-42.5811 ? 4401.5446+-94.5153 ? quicksort.c 716.2153+-0.1322 715.4910+-2.1336 gcc-loops.cpp 9724.9072+-444.4705 ? 10052.8919+-1323.6404 ? might be 1.0337x slower bigfib.cpp 905.2594+-18.0411 ? 906.2895+-7.8578 ? hash-map 197.6210+-3.3906 196.5555+-5.1078 dry.c 944.8693+-12.7314 ? 949.3920+-4.6332 ? <geometric> 1134.4236+-6.5980 ? 1138.4928+-12.8227 ? might be 1.0036x slower baseline patched Geomean of preferred means: <scaled-result> 54.3285+-0.4786 54.2594+-0.1834 might be 1.0013x faster

Yusuke Suzuki

Comment 13 2016-05-18 17:07:24 PDT

x86 32bit without SSE2 build option. In this case, operationMathPow is compiled in x87 code. But JIT can use SSE2. (This is typical i686 binary package build configuration) Benchmark report for SunSpider, Octane, Kraken, and AsmBench on 32bit. VMs tested: "baseline" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/fpulegacy32-master/Release/bin/jsc "patched" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/fpulegacy32/Release/bin/jsc Collected 4 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. baseline patched SunSpider: 3d-cube 17.1483+-0.0787 ? 17.1778+-0.1090 ? 3d-morph 13.8268+-0.0273 ? 13.8322+-0.0388 ? 3d-raytrace 10.2777+-0.0496 ? 10.3127+-0.0880 ? access-binary-trees 2.5810+-0.0111 ? 3.0062+-1.2568 ? might be 1.1648x slower access-fannkuch 6.9170+-0.0566 ? 8.2241+-4.0918 ? might be 1.1890x slower access-nbody 4.8837+-0.0224 4.8760+-0.0316 access-nsieve 3.4854+-1.1250 3.1277+-0.0189 might be 1.1144x faster bitops-3bit-bits-in-byte 1.6288+-0.0100 ? 1.6328+-0.0284 ? bitops-bits-in-byte 2.5909+-0.0625 ? 2.6307+-0.0490 ? might be 1.0154x slower bitops-bitwise-and 1.9580+-0.0194 1.9511+-0.0253 bitops-nsieve-bits 3.7618+-0.2372 3.6875+-0.0121 might be 1.0202x faster controlflow-recursive 3.1976+-0.0115 ? 3.2242+-0.0443 ? crypto-aes 8.3799+-0.0268 ? 8.4008+-0.0311 ? crypto-md5 5.2190+-0.0194 ? 5.7831+-1.8926 ? might be 1.1081x slower crypto-sha1 4.4047+-0.0199 4.3888+-0.0380 date-format-tofte 12.0606+-0.2065 12.0417+-0.4251 date-format-xparb 10.6373+-0.1105 ? 11.1366+-1.2784 ? might be 1.0469x slower math-cordic 3.4521+-0.0054 ? 3.6575+-0.5546 ? might be 1.0595x slower math-partial-sums 13.3063+-0.0259 ? 13.3249+-0.0434 ? math-spectral-norm 2.2966+-0.0126 ? 2.3030+-0.0158 ? regexp-dna 7.2940+-0.0697 ? 7.9797+-2.2196 ? might be 1.0940x slower string-base64 4.5597+-0.0303 ? 4.5825+-0.0529 ? string-fasta 10.8441+-0.1349 ? 10.9771+-0.0616 ? might be 1.0123x slower string-tagcloud 13.9378+-0.5208 13.6143+-0.1270 might be 1.0238x faster string-unpack-code 24.5435+-0.5817 24.3825+-0.1327 string-validate-input 8.9327+-0.0204 ? 8.9642+-0.0741 ? <arithmetic> 7.7741+-0.0543 ? 7.8931+-0.2313 ? might be 1.0153x slower baseline patched Octane: encrypt 0.50270+-0.00820 ? 0.50358+-0.00986 ? decrypt 17.03379+-0.58921 16.94320+-0.22933 deltablue x2 0.34613+-0.00268 ? 0.34985+-0.01265 ? might be 1.0108x slower earley 0.60980+-0.00353 0.60855+-0.00187 boyer 8.17677+-0.05106 ? 8.24389+-0.25497 ? navier-stokes x2 6.34773+-0.16032 6.33005+-0.06629 raytrace x2 3.18275+-0.02789 3.16339+-0.01329 richards x2 0.17084+-0.00047 ? 0.17128+-0.00186 ? splay x2 0.55192+-0.00319 ? 0.55711+-0.00999 ? regexp x2 20.47882+-0.03197 ? 20.50719+-0.04801 ? pdfjs x2 49.78446+-0.71499 ? 50.03719+-0.57657 ? mandreel x2 96.24993+-2.36668 95.15272+-3.73660 might be 1.0115x faster gbemu x2 57.17923+-0.84760 ? 57.83685+-3.38938 ? might be 1.0115x slower closure 0.61193+-0.00831 0.61152+-0.00440 jquery 8.04075+-0.07221 ? 8.05180+-0.27270 ? box2d x2 19.65805+-0.55065 ? 19.67844+-0.54372 ? zlib x2 649.54014+-18.33616 ? 651.51095+-12.07980 ? typescript x2 1212.65424+-137.30415 1176.06445+-36.52192 might be 1.0311x faster <geometric> 9.87533+-0.10715 9.87267+-0.06147 might be 1.0003x faster baseline patched Kraken: ai-astar 196.674+-1.291 195.400+-1.959 audio-beat-detection 72.875+-0.507 72.713+-0.376 audio-dft 121.780+-2.100 121.727+-0.928 audio-fft 60.188+-0.406 ? 60.355+-0.326 ? audio-oscillator 93.959+-0.172 93.870+-0.190 imaging-darkroom 171.704+-2.335 171.427+-2.946 imaging-desaturate 94.476+-0.146 ? 94.891+-0.369 ? imaging-gaussian-blur 223.107+-102.893 188.462+-0.801 might be 1.1838x faster json-parse-financial 63.387+-0.144 ? 67.392+-8.681 ? might be 1.0632x slower json-stringify-tinderbox 30.261+-0.371 ^ 29.797+-0.039 ^ definitely 1.0156x faster stanford-crypto-aes 64.352+-0.881 64.080+-1.158 stanford-crypto-ccm 53.295+-0.659 52.984+-0.797 stanford-crypto-pbkdf2 122.665+-7.550 120.230+-2.143 might be 1.0203x faster stanford-crypto-sha256-iterative 44.667+-0.286 ? 44.683+-0.284 ? <arithmetic> 100.956+-7.020 98.429+-0.637 might be 1.0257x faster baseline patched AsmBench: towers.c 404.3062+-22.0399 400.0767+-4.9144 might be 1.0106x faster n-body.c 1293.0051+-89.0780 1286.7296+-22.7946 float-mm.c 1158.9196+-20.8813 1158.6651+-26.1376 container.cpp 4395.9618+-78.0860 4355.1048+-26.0590 quicksort.c 706.4705+-18.6609 ? 716.0643+-0.1293 ? might be 1.0136x slower gcc-loops.cpp 9314.9860+-109.9418 ? 9570.0922+-609.9214 ? might be 1.0274x slower bigfib.cpp 918.0082+-4.9582 902.3481+-23.1366 might be 1.0174x faster hash-map 196.8177+-0.7598 195.6277+-4.1308 dry.c 939.7477+-14.6277 937.1147+-16.0467 <geometric> 1128.6581+-10.1489 1127.3911+-8.0006 might be 1.0011x faster baseline patched Geomean of preferred means: <scaled-result> 54.3769+-0.9751 54.2259+-0.3772 might be 1.0028x faster

Yusuke Suzuki

Comment 14 2016-05-18 17:08:50 PDT

SunSpider in 32bit (LXC container) is noisy, so I took the --outer=30 version. x86 32bit SSE2 enabled build. Benchmark report for SunSpider on 32bit. VMs tested: "baseline" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/fpu32-master/Release/bin/jsc "patched" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/fpu32/Release/bin/jsc Collected 30 samples per benchmark/VM, with 30 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. baseline patched 3d-cube 17.2396+-0.0167 17.2353+-0.0128 3d-morph 13.8706+-0.0619 ? 13.8931+-0.0261 ? 3d-raytrace 10.3938+-0.0130 10.3787+-0.0184 access-binary-trees 2.5792+-0.0058 ? 2.6315+-0.0775 ? might be 1.0203x slower access-fannkuch 7.0246+-0.0466 6.9819+-0.0127 access-nbody 4.8804+-0.0056 4.8764+-0.0071 access-nsieve 3.1030+-0.0059 ? 3.1043+-0.0042 ? bitops-3bit-bits-in-byte 1.6326+-0.0041 1.6314+-0.0037 bitops-bits-in-byte 2.6133+-0.0136 ? 2.6357+-0.0444 ? bitops-bitwise-and 1.9511+-0.0079 1.9494+-0.0056 bitops-nsieve-bits 3.7137+-0.0063 ? 3.7677+-0.1055 ? might be 1.0145x slower controlflow-recursive 3.2185+-0.0057 3.2125+-0.0055 crypto-aes 8.4629+-0.0335 8.4601+-0.0343 crypto-md5 5.1449+-0.0076 5.1401+-0.0070 crypto-sha1 4.3817+-0.0137 4.3725+-0.0129 date-format-tofte 11.6912+-0.0603 11.6471+-0.0725 date-format-xparb 10.0545+-0.0462 10.0262+-0.0283 math-cordic 3.4548+-0.0082 ? 3.4631+-0.0070 ? math-partial-sums 13.0911+-0.0131 ? 13.0988+-0.0160 ? math-spectral-norm 2.3083+-0.0042 ? 2.3115+-0.0049 ? regexp-dna 7.4086+-0.0186 7.3988+-0.0103 string-base64 4.7659+-0.0158 4.7647+-0.0164 string-fasta 10.8456+-0.0387 ? 10.8967+-0.0279 ? string-tagcloud 13.8135+-0.0283 ! 13.9128+-0.0212 ! definitely 1.0072x slower string-unpack-code 24.2019+-0.0394 ? 24.2541+-0.1327 ? string-validate-input 9.0753+-0.0238 9.0702+-0.0253 <arithmetic> 7.7277+-0.0058 ? 7.7352+-0.0067 ? might be 1.0010x slower x86 32bit SSE2 disabled build. Benchmark report for SunSpider on 32bit. VMs tested: "baseline" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/fpulegacy32-master/Release/bin/jsc "patched" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/fpulegacy32/Release/bin/jsc Collected 30 samples per benchmark/VM, with 30 VM invocations per benchmark. Emitted a call to gc() between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds. baseline patched 3d-cube 17.1167+-0.0184 ? 17.2956+-0.2045 ? might be 1.0105x slower 3d-morph 13.8662+-0.0280 ? 13.8866+-0.0272 ? 3d-raytrace 10.2790+-0.0118 ! 10.3195+-0.0136 ! definitely 1.0039x slower access-binary-trees 2.5875+-0.0073 ? 2.5943+-0.0061 ? access-fannkuch 6.9522+-0.0400 6.9369+-0.0161 access-nbody 4.8806+-0.0076 4.8690+-0.0068 access-nsieve 3.1309+-0.0100 ? 3.1375+-0.0055 ? bitops-3bit-bits-in-byte 1.6320+-0.0043 1.6298+-0.0035 bitops-bits-in-byte 2.5993+-0.0135 ? 2.6180+-0.0237 ? bitops-bitwise-and 1.9474+-0.0061 ? 1.9474+-0.0056 ? bitops-nsieve-bits 3.6956+-0.0060 ? 3.7466+-0.0747 ? might be 1.0138x slower controlflow-recursive 3.2078+-0.0054 ? 3.2161+-0.0064 ? crypto-aes 8.4773+-0.0772 8.4586+-0.0379 crypto-md5 5.2203+-0.0110 ^ 5.1957+-0.0075 ^ definitely 1.0047x faster crypto-sha1 4.3942+-0.0110 4.3941+-0.0081 date-format-tofte 12.0185+-0.0851 11.9945+-0.0805 date-format-xparb 10.6577+-0.0333 ! 10.7290+-0.0211 ! definitely 1.0067x slower math-cordic 3.4558+-0.0076 ? 3.4978+-0.0785 ? might be 1.0121x slower math-partial-sums 13.2987+-0.0164 13.2937+-0.0364 math-spectral-norm 2.3081+-0.0057 ? 2.3091+-0.0083 ? regexp-dna 7.2854+-0.0155 7.2758+-0.0152 string-base64 4.5988+-0.0283 4.5922+-0.0140 string-fasta 10.8793+-0.0352 ? 11.2162+-0.4723 ? might be 1.0310x slower string-tagcloud 13.6521+-0.0294 ? 13.6553+-0.0377 ? string-unpack-code 24.5895+-0.2590 24.3984+-0.0852 string-validate-input 8.9315+-0.0178 ? 8.9489+-0.0273 ? <arithmetic> 7.7562+-0.0110 ? 7.7753+-0.0199 ? might be 1.0025x slower

Yusuke Suzuki

Comment 15 2016-05-18 17:10:31 PDT

The most affected benchmark should be SunSpider math-partial-sums. It highly depends on Math.pow performance. And it seems that performance is neutral between the changes.

Build Bot

Comment 16 2016-05-18 18:03:46 PDT

Comment on attachment 279322 [details] Patch Attachment 279322 [details] did not pass mac-ews (mac): Output: http://webkit-queues.webkit.org/results/1345218 New failing tests: storage/websql/database-lock-after-reload.html media/track/track-in-band.html

Build Bot

Comment 17 2016-05-18 18:03:50 PDT

Created attachment 279329 [details] Archive of layout-test-results from ews103 for mac-yosemite The attached test failures were seen while running run-webkit-tests on the mac-ews. Bot: ews103 Port: mac-yosemite Platform: Mac OS X 10.10.5

Alexey Proskuryakov

Comment 18 2016-05-18 22:57:16 PDT

> storage/websql/database-lock-after-reload.html > media/track/track-in-band.html That's a lot of EWS flakiness for one patch :(

Yusuke Suzuki

Comment 19 2016-05-19 21:08:22 PDT

(In reply to comment #18) > > storage/websql/database-lock-after-reload.html > > media/track/track-in-band.html > > That's a lot of EWS flakiness for one patch :( I think this error is not related to this patch... mac bot fails twice, but crashing tests are random...

Yusuke Suzuki

Comment 20 2016-05-26 02:51:19 PDT

Comment on attachment 279322 [details] Patch I thought setting FPU mode in this function scope is rather simple than the current implementaiton. And setting FPU mode in the limited function is actually used way in glibc's sqrt. In glibc, to make std::sqrt 64bit precision, it changes the FPU mode in the function. Of course, if it cuases significant performance degradation, we need to reconsider about the current patch. But I'll try it.

Yusuke Suzuki

Comment 21 2016-05-26 21:46:01 PDT

(In reply to comment #20) > Comment on attachment 279322 [details] > Patch > > I thought setting FPU mode in this function scope is rather simple than the > current implementaiton. > And setting FPU mode in the limited function is actually used way in glibc's > sqrt. > In glibc, to make std::sqrt 64bit precision, it changes the FPU mode in the > function. > > Of course, if it cuases significant performance degradation, we need to > reconsider about the current patch. > But I'll try it. This approach figured out that it causes 2% regression in x86 x87 environment. So go with the current patch.

Yusuke Suzuki

Comment 22 2016-05-26 21:48:08 PDT

(In reply to comment #21) > (In reply to comment #20) > > Comment on attachment 279322 [details] > > Patch > > > > I thought setting FPU mode in this function scope is rather simple than the > > current implementaiton. > > And setting FPU mode in the limited function is actually used way in glibc's > > sqrt. > > In glibc, to make std::sqrt 64bit precision, it changes the FPU mode in the > > function. > > > > Of course, if it cuases significant performance degradation, we need to > > reconsider about the current patch. > > But I'll try it. > > This approach figured out that it causes 2% regression in x86 x87 > environment. > So go with the current patch. In SunSpider/LongSpider's math-partial-sums

Attachments
Patch (17.54 KB, patch) 2016-05-17 08:56 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (17.55 KB, patch) 2016-05-17 09:06 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (20.03 KB, patch) 2016-05-18 17:01 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (19.94 KB, patch) 2016-05-18 17:04 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Archive of layout-test-results from ews103 for mac-yosemite (870.43 KB, application/zip) 2016-05-18 18:03 PDT, Build Bot	no flags	Details
Show Obsolete (3) View All Add attachment proposed patch, testcase, etc.