Bug 157787 - sqrt and pow should produce consistent results even in SSE2 available x86 32bit environment
Summary: sqrt and pow should produce consistent results even in SSE2 available x86 32b...
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Yusuke Suzuki
URL:
Keywords:
Depends on:
Blocks: 157168
  Show dependency treegraph
 
Reported: 2016-05-16 23:32 PDT by Yusuke Suzuki
Modified: 2016-05-28 05:22 PDT (History)
13 users (show)

See Also:


Attachments
Patch (17.54 KB, patch)
2016-05-17 08:56 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (17.55 KB, patch)
2016-05-17 09:06 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (20.03 KB, patch)
2016-05-18 17:01 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (19.94 KB, patch)
2016-05-18 17:04 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Archive of layout-test-results from ews103 for mac-yosemite (870.43 KB, application/zip)
2016-05-18 18:03 PDT, Build Bot
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Yusuke Suzuki 2016-05-16 23:32:56 PDT
In x86 32bit environment with SSE2 availability, some strange situation occurs.

1. C runtime is compiled with x87

C runtime is compiled with x87 since SSE2 is not considered.
This should be since binary packages (e.g. Debian i686) should consider the processor capabilities conservatively.

2. DFG JIT emits floating point operations in SSE2

But DFG JIT consult CPUID to determine SSE2 availability.
As a result, while C runtime is compiled with x87, DFG JIT code uses SSE2 operations (like sqrtsd, muld etc.)

Since while x87 has 80bit precision SSE has 64bit precision, this produces inconsistent results in C runtime and DFG JIT code.
While both 80 / 64 bit precision is ok, at least we need to ensure that the result with the same argument should be the same in all the JIT tiers.
Currently, while DFG JIT produces 64bit precision values, C runtime produces 80bit precision values.
Comment 1 Yusuke Suzuki 2016-05-17 07:26:48 PDT
If my guess is correct, the result of sqrt resides on x87. And 1.0 / sqrt(value) is calculated in 80bit precision.
The solution for this is,

DoubleValue result = sqrt(value);
return 1.0 / result;

Anyway, this is crazy.
Comment 2 Yusuke Suzuki 2016-05-17 08:56:52 PDT
Created attachment 279126 [details]
Patch
Comment 3 Yusuke Suzuki 2016-05-17 08:57:48 PDT
I hope this will fix the remaining failures. (I still cannot reproduce the failures).
Comment 4 Yusuke Suzuki 2016-05-17 09:03:09 PDT
Failing tests are https://build.webkit.org/builders/GTK%20Linux%2032-bit%20Release/builds/61021

stress/math-pow-stable-results.js.default: Exception: Failed opaquePow with base = 0.6931471805599453 exponent = 999 expected (9.65240607012971e-160) got (9.652406070129542e-160)
stress/math-pow-stable-results.js.default: ERROR: Unexpected exit code: 3
Comment 5 Yusuke Suzuki 2016-05-17 09:04:00 PDT
(In reply to comment #4)
> Failing tests are
> https://build.webkit.org/builders/GTK%20Linux%2032-bit%20Release/builds/61021
> 
> stress/math-pow-stable-results.js.default: Exception: Failed opaquePow with
> base = 0.6931471805599453 exponent = 999 expected (9.65240607012971e-160)
> got (9.652406070129542e-160)
> stress/math-pow-stable-results.js.default: ERROR: Unexpected exit code: 3

Not correct. Failing tests are https://build.webkit.org/builders/GTK%20Linux%2032-bit%20Release/builds/61357

stress/math-pow-stable-results.js.always-trigger-copy-phase: Exception: Failed constantExponentFunctions with base = 1.4142135623730951 exponent = -0.5 expected (0.8408964152537145) got (0.8408964152537146)
Comment 6 Yusuke Suzuki 2016-05-17 09:06:45 PDT
Created attachment 279128 [details]
Patch
Comment 7 Mark Lam 2016-05-17 09:45:22 PDT
Please run benchmarks also to make sure that there are no performance regressions.  Thanks.
Comment 8 Mark Lam 2016-05-17 09:46:32 PDT
(In reply to comment #7)
> Please run benchmarks also to make sure that there are no performance
> regressions.  Thanks.

For both 64-bit and 32-bit x86 since the files you changed touches both.  Thanks.
Comment 9 Yusuke Suzuki 2016-05-18 17:01:59 PDT
Created attachment 279321 [details]
Patch
Comment 10 Yusuke Suzuki 2016-05-18 17:04:11 PDT
Created attachment 279322 [details]
Patch
Comment 11 Yusuke Suzuki 2016-05-18 17:04:55 PDT
x64 env perf results.

Benchmark report for SunSpider, Octane, Kraken, and AsmBench on hanayamata.

VMs tested:
"baseline" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/fpu-master/Release/bin/jsc
"patched" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/fpu/Release/bin/jsc

Collected 4 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample
measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to
get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds.

                                                 baseline                  patched                                      
SunSpider:
   3d-cube                                    5.7039+-0.0483     ?      5.7140+-0.0152        ?
   3d-morph                                  25.8652+-0.6922     ?     28.8619+-9.8995        ? might be 1.1159x slower
   3d-raytrace                                6.1587+-0.0398            6.1484+-0.1346        
   access-binary-trees                        2.0796+-0.1432            2.0728+-0.0925        
   access-fannkuch                            7.2028+-1.5473     ?      7.3298+-1.7196        ? might be 1.0176x slower
   access-nbody                               2.7382+-0.0314     ?      2.7653+-0.0805        ?
   access-nsieve                              3.1129+-0.2039            3.0450+-0.0766          might be 1.0223x faster
   bitops-3bit-bits-in-byte                   1.1774+-0.2119            1.0483+-0.0365          might be 1.1231x faster
   bitops-bits-in-byte                        2.7760+-0.5369            2.4948+-0.0598          might be 1.1127x faster
   bitops-bitwise-and                         1.9236+-0.0109     ?      1.9527+-0.1065        ? might be 1.0151x slower
   bitops-nsieve-bits                         3.0204+-0.0286     ?      3.6378+-1.1363        ? might be 1.2044x slower
   controlflow-recursive                      2.5072+-0.3621     ?      2.5960+-0.4151        ? might be 1.0354x slower
   crypto-aes                                 4.8640+-0.0651            4.8064+-0.0414          might be 1.0120x faster
   crypto-md5                                 2.5438+-0.2512            2.5033+-0.0910          might be 1.0162x faster
   crypto-sha1                                2.5356+-0.5588            2.3947+-0.0579          might be 1.0588x faster
   date-format-tofte                         10.0705+-1.6971            9.9487+-0.5134          might be 1.0122x faster
   date-format-xparb                          5.8340+-0.0633            5.7159+-0.0678          might be 1.0207x faster
   math-cordic                                2.9111+-0.0637            2.8884+-0.0794        
   math-partial-sums                         10.4175+-0.0931           10.3715+-0.0223        
   math-spectral-norm                         2.1515+-0.2492            2.0825+-0.1674          might be 1.0331x faster
   regexp-dna                                 7.2617+-0.0914     ?      7.2688+-0.1228        ?
   string-base64                              3.9536+-0.0656     ?      3.9791+-0.0283        ?
   string-fasta                               6.5701+-1.4645            6.0922+-0.2123          might be 1.0784x faster
   string-tagcloud                            9.0508+-0.2073            8.9454+-0.1207          might be 1.0118x faster
   string-unpack-code                        18.7492+-0.1751     ?     18.9697+-0.3352        ? might be 1.0118x slower
   string-validate-input                      4.0176+-0.0196     ?      4.4243+-0.9320        ? might be 1.1012x slower

   <arithmetic>                               5.9691+-0.0412     ?      6.0791+-0.3760        ? might be 1.0184x slower

                                                 baseline                  patched                                      
Octane:
   encrypt                                   0.19001+-0.02718          0.18367+-0.00637         might be 1.0345x faster
   decrypt                                   3.20132+-0.54592          3.05386+-0.05815         might be 1.0483x faster
   deltablue                        x2       0.14845+-0.00094    ?     0.14898+-0.00174       ?
   earley                                    0.33373+-0.00041          0.33326+-0.00149       
   boyer                                     5.29425+-0.04791          5.28735+-0.02679       
   navier-stokes                    x2       4.82915+-0.02802    ?     4.83875+-0.05857       ?
   raytrace                         x2       0.93219+-0.01116          0.92986+-0.01973       
   richards                         x2       0.09581+-0.00267    ?     0.09629+-0.00140       ?
   splay                            x2       0.39077+-0.00360    ?     0.39307+-0.00116       ?
   regexp                           x2      18.37962+-0.21040    ?    18.53484+-0.29797       ?
   pdfjs                            x2      41.24259+-1.64031         40.60014+-0.20735         might be 1.0158x faster
   mandreel                         x2      48.21639+-0.24653    ?    48.47195+-0.83248       ?
   gbemu                            x2      34.03689+-1.03936         33.64170+-1.09649         might be 1.0117x faster
   closure                                   0.59235+-0.02495          0.58497+-0.00374         might be 1.0126x faster
   jquery                                    7.43578+-0.07755    ?     7.46192+-0.16254       ?
   box2d                            x2      14.57743+-0.73278         14.43941+-0.21710       
   zlib                             x2     363.95774+-10.46326       337.86078+-18.95572        might be 1.0772x faster
   typescript                       x2     763.78058+-34.76278       758.51379+-22.00349      

   <geometric>                               5.78664+-0.05987          5.73540+-0.03937         might be 1.0089x faster

                                                 baseline                  patched                                      
Kraken:
   ai-astar                                   97.593+-4.001             97.556+-5.906         
   audio-beat-detection                       45.228+-0.118      ?      45.330+-0.109         ?
   audio-dft                                 123.409+-0.472      ?     123.467+-0.626         ?
   audio-fft                                  37.225+-0.104             37.207+-0.009         
   audio-oscillator                           53.366+-0.015      ?      53.366+-0.059         ?
   imaging-darkroom                           88.425+-0.053             88.368+-0.163         
   imaging-desaturate                         56.225+-0.377      ?      56.709+-0.756         ?
   imaging-gaussian-blur                      78.331+-13.432            74.993+-10.464          might be 1.0445x faster
   json-parse-financial                       41.831+-0.072      !      42.112+-0.145         ! definitely 1.0067x slower
   json-stringify-tinderbox                   25.189+-0.123             24.962+-0.114         
   stanford-crypto-aes                        43.227+-0.888      ?      43.456+-1.228         ?
   stanford-crypto-ccm                        41.631+-1.823             40.955+-1.638           might be 1.0165x faster
   stanford-crypto-pbkdf2                    104.177+-0.448      ?     105.022+-2.521         ?
   stanford-crypto-sha256-iterative           37.937+-0.102             37.937+-0.200         

   <arithmetic>                               62.414+-0.934             62.246+-0.986           might be 1.0027x faster

                                                 baseline                  patched                                      
AsmBench:
   towers.c                                 273.5890+-0.8769          272.9515+-1.7687        
   n-body.c                                 910.5629+-19.0203    ?    916.0066+-12.5231       ?
   float-mm.c                               727.9011+-3.8672     ?    730.2297+-1.7173        ?
   container.cpp                           3028.1337+-64.6283    ?   3037.6306+-56.0767       ?
   quicksort.c                              430.5352+-1.1251          430.1847+-1.6893        
   gcc-loops.cpp                           4149.4720+-97.7748    ?   4165.3242+-161.5086      ?
   bigfib.cpp                               449.1030+-5.6904     ?    452.6222+-18.4055       ?
   hash-map                                 153.5330+-2.7146          151.2027+-1.3597          might be 1.0154x faster
   dry.c                                    483.3406+-45.4044         467.7460+-24.7893         might be 1.0333x faster

   <geometric>                              683.7046+-6.3584          681.6894+-9.6798          might be 1.0030x faster

                                                 baseline                  patched                                      
Geomean of preferred means:
   <scaled-result>                           34.8431+-0.2258     ?     34.8718+-0.5628        ? might be 1.0008x slower
Comment 12 Yusuke Suzuki 2016-05-18 17:06:07 PDT
x86 32bit SSE2 enabled build (-mfpmath=sse and -msse2 or later are passed to the compiler)

Benchmark report for SunSpider, Octane, Kraken, and AsmBench on 32bit.

VMs tested:
"baseline" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/fpu32-master/Release/bin/jsc
"patched" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/fpu32/Release/bin/jsc

Collected 4 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample
measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to
get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds.

                                                 baseline                  patched                                      
SunSpider:
   3d-cube                                   17.2505+-0.0981     ?     17.2927+-0.1872        ?
   3d-morph                                  13.8463+-0.0365           13.8152+-0.0438        
   3d-raytrace                               11.4933+-3.5854           10.3903+-0.0686          might be 1.1062x faster
   access-binary-trees                        2.6045+-0.0438            2.5560+-0.0936          might be 1.0190x faster
   access-fannkuch                            7.0265+-0.1793            6.9701+-0.0622        
   access-nbody                               4.8819+-0.0138            4.8812+-0.0109        
   access-nsieve                              3.3465+-0.7865            3.0980+-0.0206          might be 1.0802x faster
   bitops-3bit-bits-in-byte                   1.6490+-0.0351            1.6237+-0.0127          might be 1.0156x faster
   bitops-bits-in-byte                        2.6247+-0.0709     ?      2.6371+-0.0761        ?
   bitops-bitwise-and                         1.9934+-0.1417            1.9446+-0.0067          might be 1.0251x faster
   bitops-nsieve-bits                         3.7030+-0.0094     ?      3.7060+-0.0205        ?
   controlflow-recursive                      3.4000+-0.5472            3.2137+-0.0197          might be 1.0580x faster
   crypto-aes                                10.4076+-6.0480            8.4325+-0.0389          might be 1.2342x faster
   crypto-md5                                 5.1667+-0.0305            5.1293+-0.0105        
   crypto-sha1                                4.3883+-0.0141            4.3865+-0.0208        
   date-format-tofte                         11.6183+-0.1375     ?     11.6298+-0.4792        ?
   date-format-xparb                         10.0130+-0.0627           10.0063+-0.1523        
   math-cordic                                3.6779+-0.7567     ?      3.9900+-1.7219        ? might be 1.0849x slower
   math-partial-sums                         13.1448+-0.3463           13.0922+-0.0298        
   math-spectral-norm                         2.3122+-0.0176     ?      2.3129+-0.1113        ?
   regexp-dna                                 7.4440+-0.1518     ?      7.4880+-0.2611        ?
   string-base64                              4.7435+-0.0288     ?      4.7772+-0.0339        ?
   string-fasta                              10.8648+-0.0833     ?     11.0375+-0.5606        ? might be 1.0159x slower
   string-tagcloud                           13.7762+-0.2045     ?     13.8945+-0.0665        ?
   string-unpack-code                        24.2105+-0.1533     ?     24.8459+-1.6690        ? might be 1.0262x slower
   string-validate-input                      9.0511+-0.0272            8.9495+-0.3411          might be 1.0114x faster

   <arithmetic>                               7.8707+-0.2654            7.7731+-0.0579          might be 1.0126x faster

                                                 baseline                  patched                                      
Octane:
   encrypt                                   0.50798+-0.00791          0.49960+-0.01261         might be 1.0168x faster
   decrypt                                  17.11489+-0.42694         16.96305+-1.04268       
   deltablue                        x2       0.34733+-0.00346          0.34548+-0.00075       
   earley                                    0.60628+-0.00075    ?     0.60648+-0.00218       ?
   boyer                                     8.15865+-0.01372          8.15443+-0.02336       
   navier-stokes                    x2       6.63522+-1.07711          6.30414+-0.02100         might be 1.0525x faster
   raytrace                         x2       3.16797+-0.02804    ?     3.20975+-0.05828       ? might be 1.0132x slower
   richards                         x2       0.17076+-0.00079    ?     0.18036+-0.03107       ? might be 1.0562x slower
   splay                            x2       0.54739+-0.00203    ?     0.55064+-0.00407       ?
   regexp                           x2      20.53952+-0.10831    ?    20.81623+-0.18394       ? might be 1.0135x slower
   pdfjs                            x2      50.81172+-0.31701    ?    51.22708+-0.42641       ?
   mandreel                         x2      95.21398+-2.40420    ?    95.81830+-2.24623       ?
   gbemu                            x2      56.51416+-1.01726         56.48069+-0.19904       
   closure                                   0.63658+-0.02393          0.62312+-0.00578         might be 1.0216x faster
   jquery                                    8.10269+-0.13363          8.07504+-0.13617       
   box2d                            x2      19.52336+-0.55404    ?    19.56877+-0.58168       ?
   zlib                             x2     649.45215+-12.76747       649.21053+-8.40949       
   typescript                       x2    1190.06122+-64.58508      1175.71649+-28.84543        might be 1.0122x faster

   <geometric>                               9.89802+-0.09712    ?     9.90406+-0.16350       ? might be 1.0006x slower

                                                 baseline                  patched                                      
Kraken:
   ai-astar                                  194.487+-0.926      ?     195.598+-2.302         ?
   audio-beat-detection                       72.518+-0.322      ?      72.559+-0.149         ?
   audio-dft                                 121.121+-0.516      ?     121.390+-0.868         ?
   audio-fft                                  60.038+-0.703             59.870+-0.171         
   audio-oscillator                           93.826+-0.363             93.383+-0.282         
   imaging-darkroom                          170.330+-1.036      ?     170.883+-2.320         ?
   imaging-desaturate                         94.928+-1.026             94.546+-0.246         
   imaging-gaussian-blur                     188.115+-0.274      ?     191.241+-5.520         ? might be 1.0166x slower
   json-parse-financial                       65.857+-0.174             65.785+-0.215         
   json-stringify-tinderbox                   29.373+-0.071      ?      29.420+-0.099         ?
   stanford-crypto-aes                        65.178+-0.828      ?      65.435+-0.379         ?
   stanford-crypto-ccm                        49.151+-0.351             49.094+-1.002         
   stanford-crypto-pbkdf2                    128.655+-0.426      ?     129.064+-0.315         ?
   stanford-crypto-sha256-iterative           46.690+-0.166             46.304+-1.147         

   <arithmetic>                               98.591+-0.128      ?      98.898+-0.274         ? might be 1.0031x slower

                                                 baseline                  patched                                      
AsmBench:
   towers.c                                 402.2408+-5.2536          399.8250+-1.3425        
   n-body.c                                1281.5928+-16.6677    ?   1294.8248+-47.9531       ? might be 1.0103x slower
   float-mm.c                              1169.2465+-7.4986         1162.4531+-15.9992       
   container.cpp                           4386.6112+-42.5811    ?   4401.5446+-94.5153       ?
   quicksort.c                              716.2153+-0.1322          715.4910+-2.1336        
   gcc-loops.cpp                           9724.9072+-444.4705   ?  10052.8919+-1323.6404     ? might be 1.0337x slower
   bigfib.cpp                               905.2594+-18.0411    ?    906.2895+-7.8578        ?
   hash-map                                 197.6210+-3.3906          196.5555+-5.1078        
   dry.c                                    944.8693+-12.7314    ?    949.3920+-4.6332        ?

   <geometric>                             1134.4236+-6.5980     ?   1138.4928+-12.8227       ? might be 1.0036x slower

                                                 baseline                  patched                                      
Geomean of preferred means:
   <scaled-result>                           54.3285+-0.4786           54.2594+-0.1834          might be 1.0013x faster
Comment 13 Yusuke Suzuki 2016-05-18 17:07:24 PDT
x86 32bit without SSE2 build option. In this case, operationMathPow is compiled in x87 code. But JIT can use SSE2. (This is typical i686 binary package build configuration)

Benchmark report for SunSpider, Octane, Kraken, and AsmBench on 32bit.

VMs tested:
"baseline" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/fpulegacy32-master/Release/bin/jsc
"patched" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/fpulegacy32/Release/bin/jsc

Collected 4 samples per benchmark/VM, with 4 VM invocations per benchmark. Emitted a call to gc() between sample
measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the jsc-specific preciseTime() function to
get microsecond-level timing. Reporting benchmark execution times with 95% confidence intervals in milliseconds.

                                                 baseline                  patched                                      
SunSpider:
   3d-cube                                   17.1483+-0.0787     ?     17.1778+-0.1090        ?
   3d-morph                                  13.8268+-0.0273     ?     13.8322+-0.0388        ?
   3d-raytrace                               10.2777+-0.0496     ?     10.3127+-0.0880        ?
   access-binary-trees                        2.5810+-0.0111     ?      3.0062+-1.2568        ? might be 1.1648x slower
   access-fannkuch                            6.9170+-0.0566     ?      8.2241+-4.0918        ? might be 1.1890x slower
   access-nbody                               4.8837+-0.0224            4.8760+-0.0316        
   access-nsieve                              3.4854+-1.1250            3.1277+-0.0189          might be 1.1144x faster
   bitops-3bit-bits-in-byte                   1.6288+-0.0100     ?      1.6328+-0.0284        ?
   bitops-bits-in-byte                        2.5909+-0.0625     ?      2.6307+-0.0490        ? might be 1.0154x slower
   bitops-bitwise-and                         1.9580+-0.0194            1.9511+-0.0253        
   bitops-nsieve-bits                         3.7618+-0.2372            3.6875+-0.0121          might be 1.0202x faster
   controlflow-recursive                      3.1976+-0.0115     ?      3.2242+-0.0443        ?
   crypto-aes                                 8.3799+-0.0268     ?      8.4008+-0.0311        ?
   crypto-md5                                 5.2190+-0.0194     ?      5.7831+-1.8926        ? might be 1.1081x slower
   crypto-sha1                                4.4047+-0.0199            4.3888+-0.0380        
   date-format-tofte                         12.0606+-0.2065           12.0417+-0.4251        
   date-format-xparb                         10.6373+-0.1105     ?     11.1366+-1.2784        ? might be 1.0469x slower
   math-cordic                                3.4521+-0.0054     ?      3.6575+-0.5546        ? might be 1.0595x slower
   math-partial-sums                         13.3063+-0.0259     ?     13.3249+-0.0434        ?
   math-spectral-norm                         2.2966+-0.0126     ?      2.3030+-0.0158        ?
   regexp-dna                                 7.2940+-0.0697     ?      7.9797+-2.2196        ? might be 1.0940x slower
   string-base64                              4.5597+-0.0303     ?      4.5825+-0.0529        ?
   string-fasta                              10.8441+-0.1349     ?     10.9771+-0.0616        ? might be 1.0123x slower
   string-tagcloud                           13.9378+-0.5208           13.6143+-0.1270          might be 1.0238x faster
   string-unpack-code                        24.5435+-0.5817           24.3825+-0.1327        
   string-validate-input                      8.9327+-0.0204     ?      8.9642+-0.0741        ?

   <arithmetic>                               7.7741+-0.0543     ?      7.8931+-0.2313        ? might be 1.0153x slower

                                                 baseline                  patched                                      
Octane:
   encrypt                                   0.50270+-0.00820    ?     0.50358+-0.00986       ?
   decrypt                                  17.03379+-0.58921         16.94320+-0.22933       
   deltablue                        x2       0.34613+-0.00268    ?     0.34985+-0.01265       ? might be 1.0108x slower
   earley                                    0.60980+-0.00353          0.60855+-0.00187       
   boyer                                     8.17677+-0.05106    ?     8.24389+-0.25497       ?
   navier-stokes                    x2       6.34773+-0.16032          6.33005+-0.06629       
   raytrace                         x2       3.18275+-0.02789          3.16339+-0.01329       
   richards                         x2       0.17084+-0.00047    ?     0.17128+-0.00186       ?
   splay                            x2       0.55192+-0.00319    ?     0.55711+-0.00999       ?
   regexp                           x2      20.47882+-0.03197    ?    20.50719+-0.04801       ?
   pdfjs                            x2      49.78446+-0.71499    ?    50.03719+-0.57657       ?
   mandreel                         x2      96.24993+-2.36668         95.15272+-3.73660         might be 1.0115x faster
   gbemu                            x2      57.17923+-0.84760    ?    57.83685+-3.38938       ? might be 1.0115x slower
   closure                                   0.61193+-0.00831          0.61152+-0.00440       
   jquery                                    8.04075+-0.07221    ?     8.05180+-0.27270       ?
   box2d                            x2      19.65805+-0.55065    ?    19.67844+-0.54372       ?
   zlib                             x2     649.54014+-18.33616   ?   651.51095+-12.07980      ?
   typescript                       x2    1212.65424+-137.30415     1176.06445+-36.52192        might be 1.0311x faster

   <geometric>                               9.87533+-0.10715          9.87267+-0.06147         might be 1.0003x faster

                                                 baseline                  patched                                      
Kraken:
   ai-astar                                  196.674+-1.291            195.400+-1.959         
   audio-beat-detection                       72.875+-0.507             72.713+-0.376         
   audio-dft                                 121.780+-2.100            121.727+-0.928         
   audio-fft                                  60.188+-0.406      ?      60.355+-0.326         ?
   audio-oscillator                           93.959+-0.172             93.870+-0.190         
   imaging-darkroom                          171.704+-2.335            171.427+-2.946         
   imaging-desaturate                         94.476+-0.146      ?      94.891+-0.369         ?
   imaging-gaussian-blur                     223.107+-102.893          188.462+-0.801           might be 1.1838x faster
   json-parse-financial                       63.387+-0.144      ?      67.392+-8.681         ? might be 1.0632x slower
   json-stringify-tinderbox                   30.261+-0.371      ^      29.797+-0.039         ^ definitely 1.0156x faster
   stanford-crypto-aes                        64.352+-0.881             64.080+-1.158         
   stanford-crypto-ccm                        53.295+-0.659             52.984+-0.797         
   stanford-crypto-pbkdf2                    122.665+-7.550            120.230+-2.143           might be 1.0203x faster
   stanford-crypto-sha256-iterative           44.667+-0.286      ?      44.683+-0.284         ?

   <arithmetic>                              100.956+-7.020             98.429+-0.637           might be 1.0257x faster

                                                 baseline                  patched                                      
AsmBench:
   towers.c                                 404.3062+-22.0399         400.0767+-4.9144          might be 1.0106x faster
   n-body.c                                1293.0051+-89.0780        1286.7296+-22.7946       
   float-mm.c                              1158.9196+-20.8813        1158.6651+-26.1376       
   container.cpp                           4395.9618+-78.0860        4355.1048+-26.0590       
   quicksort.c                              706.4705+-18.6609    ?    716.0643+-0.1293        ? might be 1.0136x slower
   gcc-loops.cpp                           9314.9860+-109.9418   ?   9570.0922+-609.9214      ? might be 1.0274x slower
   bigfib.cpp                               918.0082+-4.9582          902.3481+-23.1366         might be 1.0174x faster
   hash-map                                 196.8177+-0.7598          195.6277+-4.1308        
   dry.c                                    939.7477+-14.6277         937.1147+-16.0467       

   <geometric>                             1128.6581+-10.1489        1127.3911+-8.0006          might be 1.0011x faster

                                                 baseline                  patched                                      
Geomean of preferred means:
   <scaled-result>                           54.3769+-0.9751           54.2259+-0.3772          might be 1.0028x faster
Comment 14 Yusuke Suzuki 2016-05-18 17:08:50 PDT
SunSpider in 32bit (LXC container) is noisy, so I took the --outer=30 version.

x86 32bit SSE2 enabled build.

Benchmark report for SunSpider on 32bit.

VMs tested:
"baseline" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/fpu32-master/Release/bin/jsc
"patched" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/fpu32/Release/bin/jsc

Collected 30 samples per benchmark/VM, with 30 VM invocations per benchmark. Emitted a call to gc()
between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the
jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times
with 95% confidence intervals in milliseconds.

                                   baseline                  patched                                      

3d-cube                        17.2396+-0.0167           17.2353+-0.0128        
3d-morph                       13.8706+-0.0619     ?     13.8931+-0.0261        ?
3d-raytrace                    10.3938+-0.0130           10.3787+-0.0184        
access-binary-trees             2.5792+-0.0058     ?      2.6315+-0.0775        ? might be 1.0203x slower
access-fannkuch                 7.0246+-0.0466            6.9819+-0.0127        
access-nbody                    4.8804+-0.0056            4.8764+-0.0071        
access-nsieve                   3.1030+-0.0059     ?      3.1043+-0.0042        ?
bitops-3bit-bits-in-byte        1.6326+-0.0041            1.6314+-0.0037        
bitops-bits-in-byte             2.6133+-0.0136     ?      2.6357+-0.0444        ?
bitops-bitwise-and              1.9511+-0.0079            1.9494+-0.0056        
bitops-nsieve-bits              3.7137+-0.0063     ?      3.7677+-0.1055        ? might be 1.0145x slower
controlflow-recursive           3.2185+-0.0057            3.2125+-0.0055        
crypto-aes                      8.4629+-0.0335            8.4601+-0.0343        
crypto-md5                      5.1449+-0.0076            5.1401+-0.0070        
crypto-sha1                     4.3817+-0.0137            4.3725+-0.0129        
date-format-tofte              11.6912+-0.0603           11.6471+-0.0725        
date-format-xparb              10.0545+-0.0462           10.0262+-0.0283        
math-cordic                     3.4548+-0.0082     ?      3.4631+-0.0070        ?
math-partial-sums              13.0911+-0.0131     ?     13.0988+-0.0160        ?
math-spectral-norm              2.3083+-0.0042     ?      2.3115+-0.0049        ?
regexp-dna                      7.4086+-0.0186            7.3988+-0.0103        
string-base64                   4.7659+-0.0158            4.7647+-0.0164        
string-fasta                   10.8456+-0.0387     ?     10.8967+-0.0279        ?
string-tagcloud                13.8135+-0.0283     !     13.9128+-0.0212        ! definitely 1.0072x slower
string-unpack-code             24.2019+-0.0394     ?     24.2541+-0.1327        ?
string-validate-input           9.0753+-0.0238            9.0702+-0.0253        

<arithmetic>                    7.7277+-0.0058     ?      7.7352+-0.0067        ? might be 1.0010x slower


x86 32bit SSE2 disabled build.

Benchmark report for SunSpider on 32bit.

VMs tested:
"baseline" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/fpulegacy32-master/Release/bin/jsc
"patched" at /home/yusukesuzuki/dev/WebKit/WebKitBuild/fpulegacy32/Release/bin/jsc

Collected 30 samples per benchmark/VM, with 30 VM invocations per benchmark. Emitted a call to gc()
between sample measurements. Used 1 benchmark iteration per VM invocation for warm-up. Used the
jsc-specific preciseTime() function to get microsecond-level timing. Reporting benchmark execution times
with 95% confidence intervals in milliseconds.

                                   baseline                  patched                                      

3d-cube                        17.1167+-0.0184     ?     17.2956+-0.2045        ? might be 1.0105x slower
3d-morph                       13.8662+-0.0280     ?     13.8866+-0.0272        ?
3d-raytrace                    10.2790+-0.0118     !     10.3195+-0.0136        ! definitely 1.0039x slower
access-binary-trees             2.5875+-0.0073     ?      2.5943+-0.0061        ?
access-fannkuch                 6.9522+-0.0400            6.9369+-0.0161        
access-nbody                    4.8806+-0.0076            4.8690+-0.0068        
access-nsieve                   3.1309+-0.0100     ?      3.1375+-0.0055        ?
bitops-3bit-bits-in-byte        1.6320+-0.0043            1.6298+-0.0035        
bitops-bits-in-byte             2.5993+-0.0135     ?      2.6180+-0.0237        ?
bitops-bitwise-and              1.9474+-0.0061     ?      1.9474+-0.0056        ?
bitops-nsieve-bits              3.6956+-0.0060     ?      3.7466+-0.0747        ? might be 1.0138x slower
controlflow-recursive           3.2078+-0.0054     ?      3.2161+-0.0064        ?
crypto-aes                      8.4773+-0.0772            8.4586+-0.0379        
crypto-md5                      5.2203+-0.0110     ^      5.1957+-0.0075        ^ definitely 1.0047x faster
crypto-sha1                     4.3942+-0.0110            4.3941+-0.0081        
date-format-tofte              12.0185+-0.0851           11.9945+-0.0805        
date-format-xparb              10.6577+-0.0333     !     10.7290+-0.0211        ! definitely 1.0067x slower
math-cordic                     3.4558+-0.0076     ?      3.4978+-0.0785        ? might be 1.0121x slower
math-partial-sums              13.2987+-0.0164           13.2937+-0.0364        
math-spectral-norm              2.3081+-0.0057     ?      2.3091+-0.0083        ?
regexp-dna                      7.2854+-0.0155            7.2758+-0.0152        
string-base64                   4.5988+-0.0283            4.5922+-0.0140        
string-fasta                   10.8793+-0.0352     ?     11.2162+-0.4723        ? might be 1.0310x slower
string-tagcloud                13.6521+-0.0294     ?     13.6553+-0.0377        ?
string-unpack-code             24.5895+-0.2590           24.3984+-0.0852        
string-validate-input           8.9315+-0.0178     ?      8.9489+-0.0273        ?

<arithmetic>                    7.7562+-0.0110     ?      7.7753+-0.0199        ? might be 1.0025x slower
Comment 15 Yusuke Suzuki 2016-05-18 17:10:31 PDT
The most affected benchmark should be SunSpider math-partial-sums. It highly depends on Math.pow performance. And it seems that performance is neutral between the changes.
Comment 16 Build Bot 2016-05-18 18:03:46 PDT
Comment on attachment 279322 [details]
Patch

Attachment 279322 [details] did not pass mac-ews (mac):
Output: http://webkit-queues.webkit.org/results/1345218

New failing tests:
storage/websql/database-lock-after-reload.html
media/track/track-in-band.html
Comment 17 Build Bot 2016-05-18 18:03:50 PDT
Created attachment 279329 [details]
Archive of layout-test-results from ews103 for mac-yosemite

The attached test failures were seen while running run-webkit-tests on the mac-ews.
Bot: ews103  Port: mac-yosemite  Platform: Mac OS X 10.10.5
Comment 18 Alexey Proskuryakov 2016-05-18 22:57:16 PDT
> storage/websql/database-lock-after-reload.html
> media/track/track-in-band.html

That's a lot of EWS flakiness for one patch :(
Comment 19 Yusuke Suzuki 2016-05-19 21:08:22 PDT
(In reply to comment #18)
> > storage/websql/database-lock-after-reload.html
> > media/track/track-in-band.html
> 
> That's a lot of EWS flakiness for one patch :(

I think this error is not related to this patch... mac bot fails twice, but crashing tests are random...
Comment 20 Yusuke Suzuki 2016-05-26 02:51:19 PDT
Comment on attachment 279322 [details]
Patch

I thought setting FPU mode in this function scope is rather simple than the current implementaiton.
And setting FPU mode in the limited function is actually used way in glibc's sqrt.
In glibc, to make std::sqrt 64bit precision, it changes the FPU mode in the function.

Of course, if it cuases significant performance degradation, we need to reconsider about the current patch.
But I'll try it.
Comment 21 Yusuke Suzuki 2016-05-26 21:46:01 PDT
(In reply to comment #20)
> Comment on attachment 279322 [details]
> Patch
> 
> I thought setting FPU mode in this function scope is rather simple than the
> current implementaiton.
> And setting FPU mode in the limited function is actually used way in glibc's
> sqrt.
> In glibc, to make std::sqrt 64bit precision, it changes the FPU mode in the
> function.
> 
> Of course, if it cuases significant performance degradation, we need to
> reconsider about the current patch.
> But I'll try it.

This approach figured out that it causes 2% regression in x86 x87 environment.
So go with the current patch.
Comment 22 Yusuke Suzuki 2016-05-26 21:48:08 PDT
(In reply to comment #21)
> (In reply to comment #20)
> > Comment on attachment 279322 [details]
> > Patch
> > 
> > I thought setting FPU mode in this function scope is rather simple than the
> > current implementaiton.
> > And setting FPU mode in the limited function is actually used way in glibc's
> > sqrt.
> > In glibc, to make std::sqrt 64bit precision, it changes the FPU mode in the
> > function.
> > 
> > Of course, if it cuases significant performance degradation, we need to
> > reconsider about the current patch.
> > But I'll try it.
> 
> This approach figured out that it causes 2% regression in x86 x87
> environment.
> So go with the current patch.

In SunSpider/LongSpider's math-partial-sums