157168 – REGRESSION(r200208): It made 2 JSC stress tests fail on x86

Csaba Osztrogonác

Reported 2016-04-29 02:25:59 PDT

ARMv7 Thumb2 (JSCOnly port): ----------------------------- before: https://build.webkit.org/builders/JSCOnly%20Linux%20ARMv7%20Thumb2%20Release/builds/415 after: https://build.webkit.org/builders/JSCOnly%20Linux%20ARMv7%20Thumb2%20Release/builds/416 GTK x86: --------- before: https://build.webkit.org/builders/GTK%20Linux%2032-bit%20Release/builds/60939 after: https://build.webkit.org/builders/GTK%20Linux%2032-bit%20Release/builds/60940 new failures: stress/math-pow-stable-results.js.always-trigger-copy-phase stress/math-pow-stable-results.js.default stress/math-pow-stable-results.js.dfg-eager stress/math-pow-stable-results.js.dfg-eager-no-cjit-validate stress/math-pow-stable-results.js.dfg-maximal-flush-validate-no-cjit stress/math-pow-stable-results.js.no-cjit-validate-phases stress/math-pow-stable-results.js.no-llint stress/math-pow-with-constants.js.always-trigger-copy-phase stress/math-pow-with-constants.js.default stress/math-pow-with-constants.js.dfg-maximal-flush-validate-no-cjit stress/math-pow-with-constants.js.no-cjit-validate-phases stress/math-pow-with-constants.js.no-llint It's strange that no regression on Apple Mac x86 bots, GTK ARMv7 Thumb2 and JSCOnly ARMv7 with ARM instruction set.

Radar WebKit Bug Importer

Comment 1 2016-04-29 09:15:32 PDT

<rdar://problem/26004839>

Benjamin Poulain

Comment 2 2016-04-29 14:20:04 PDT

Landed http://trac.webkit.org/changeset/200263 to get more information. Is it possible to get all the compile flags of those Linux build? Something like fast-math could cause something like this. Ossy, do you have access to the hardware? Getting those binaries would be a start. If you could also disassemble the test that would be even better.

Benjamin Poulain

Comment 3 2016-04-29 15:23:54 PDT

I cannot reproduce on my Mac with a 32bits build. Yusuke, can you reproduce by any chance? Can you have a quick look?

Yusuke Suzuki

Comment 4 2016-04-30 03:03:33 PDT

(In reply to comment #3) > I cannot reproduce on my Mac with a 32bits build. > > Yusuke, can you reproduce by any chance? Can you have a quick look? OK, I'll check this on Linux x86 hardware in GTK port :)

Yusuke Suzuki

Comment 5 2016-04-30 05:38:16 PDT

Setting up the 32bit LXC container. I'll build JSC on it.

Yusuke Suzuki

Comment 6 2016-04-30 06:40:24 PDT

I've just built and tested. And I cannot reproduce the results. Currently, I think this regression is due to the processor's capabilities. Seeing the original patch, I've found that we use sqrtDouble without seeing supportsFloatingPointSqrt(). sqrtDouble is only allowed in SSE2 environment in x86. I'll create the "attempt-to-fix" patch. ossy@, could you try it when I submit it here?

Yusuke Suzuki

Comment 7 2016-04-30 06:45:06 PDT

(In reply to comment #6) > I've just built and tested. > And I cannot reproduce the results. > > Currently, I think this regression is due to the processor's capabilities. > Seeing the original patch, I've found that we use sqrtDouble without seeing > supportsFloatingPointSqrt(). sqrtDouble is only allowed in SSE2 environment > in x86. > I'll create the "attempt-to-fix" patch. ossy@, could you try it when I > submit it here? Hmm, but, in this case, DFG is not enabled (supportsFloatingPoint() also relies on SSE2).

Yusuke Suzuki

Comment 8 2016-04-30 07:09:25 PDT

ossy@, could you dump the /proc/cpuinfo of this machine?

Csaba Osztrogonác

Comment 9 2016-05-02 02:27:54 PDT

(In reply to comment #8) > ossy@, could you dump the /proc/cpuinfo of this machine? It is an Arndale Octa board - 4 x Cortex A7 + 4x Cortex A15 (But only same cores can run at the same time) http://www.arndaleboard.org/wiki/index.php/Main_Page $ cat /proc/cpuinfo processor : 0 model name : ARMv7 Processor rev 3 (v7l) Features : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x2 CPU part : 0xc0f CPU revision : 3 [snip] (same for all processors: 0,1,2,3) Hardware : SAMSUNG EXYNOS5 (Flattened Device Tree) Revision : 0000 Serial : 0000000000000000

Yusuke Suzuki

Comment 10 2016-05-02 02:42:36 PDT

(In reply to comment #9) > (In reply to comment #8) > > ossy@, could you dump the /proc/cpuinfo of this machine? > > It is an Arndale Octa board - 4 x Cortex A7 + 4x Cortex A15 > (But only same cores can run at the same time) > > http://www.arndaleboard.org/wiki/index.php/Main_Page > > $ cat /proc/cpuinfo > processor : 0 > model name : ARMv7 Processor rev 3 (v7l) > Features : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 > idiva idivt vfpd32 lpae > CPU implementer : 0x41 > CPU architecture: 7 > CPU variant : 0x2 > CPU part : 0xc0f > CPU revision : 3 > > [snip] (same for all processors: 0,1,2,3) > > Hardware : SAMSUNG EXYNOS5 (Flattened Device Tree) > Revision : 0000 > Serial : 0000000000000000 How about x86 32bit machine? Are sse (and others) enabled? My tested x86 LXC 32bit container (i686 Ubuntu 14.04) is 8 cores (hyperthreading), SSE, SSE2, SSE3, SSE4.1, SSE4.2, and AVX enabled. $ cat /proc/cpuinfo processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i7-2860QM CPU @ 2.50GHz stepping : 7 microcode : 0x1a cpu MHz : 800.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 3 cpu cores : 4 apicid : 7 initial apicid : 7 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4999.71 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: $ uname -a Linux my32bitbox 3.13.0-76-generic #120-Ubuntu SMP Mon Jan 18 15:59:10 UTC 2016 i686 i686 i686 GNU/Linux

Csaba Osztrogonác

Comment 11 2016-05-02 02:46:38 PDT

(In reply to comment #6) > I've just built and tested. > And I cannot reproduce the results. > > Currently, I think this regression is due to the processor's capabilities. > Seeing the original patch, I've found that we use sqrtDouble without seeing > supportsFloatingPointSqrt(). sqrtDouble is only allowed in SSE2 environment > in x86. > I'll create the "attempt-to-fix" patch. ossy@, could you try it when I > submit it here? I can try patches only on our ARM board, I don't have any x86 hardware. GTK guys can help, since they maintain the GTK x86 bot.

Csaba Osztrogonác

Comment 12 2016-05-02 03:07:16 PDT

(In reply to comment #2) > Landed http://trac.webkit.org/changeset/200263 to get more information. It seems it is a rounding issue: https://build.webkit.org/builders/JSCOnly%20Linux%20ARMv7%20Thumb2%20Release/builds/452 Running stress/math-pow-stable-results.js.always-trigger-copy-phase stress/math-pow-stable-results.js.default: Exception: Failed opaquePow with base = 1.4426950408889634 exponent = -0.5 expected (0.8325546111576977) got (0.8325546111576978) stress/math-pow-stable-results.js.default: ERROR: Unexpected exit code: 3 FAIL: stress/math-pow-stable-results.js.default stress/math-pow-stable-results.js.always-trigger-copy-phase: Exception: Failed opaquePow with base = 1.4426950408889634 exponent = -0.5 expected (0.8325546111576977) got (0.8325546111576978) Running stress/math-pow-stable-results.js.no-llint stress/math-pow-stable-results.js.always-trigger-copy-phase: ERROR: Unexpected exit code: 3 FAIL: stress/math-pow-stable-results.js.always-trigger-copy-phase Running stress/math-pow-stable-results.js.no-cjit-validate-phases stress/math-pow-stable-results.js.no-llint: Exception: Failed opaquePow with base = 2 exponent = -0.5 expected (0.7071067811865475) got (0.7071067811865476) stress/math-pow-stable-results.js.no-llint: ERROR: Unexpected exit code: 3 FAIL: stress/math-pow-stable-results.js.no-llint Running stress/math-pow-stable-results.js.dfg-eager stress/math-pow-stable-results.js.no-cjit-validate-phases: Exception: Failed opaquePow with base = 2 exponent = -0.5 expected (0.7071067811865475) got (0.7071067811865476) stress/math-pow-stable-results.js.no-cjit-validate-phases: ERROR: Unexpected exit code: 3 FAIL: stress/math-pow-stable-results.js.no-cjit-validate-phases Running stress/math-pow-stable-results.js.dfg-eager-no-cjit-validate stress/math-pow-stable-results.js.dfg-eager: Exception: Failed opaquePow with base = 2 exponent = -0.5 expected (0.7071067811865475) got (0.7071067811865476) stress/math-pow-stable-results.js.dfg-eager: ERROR: Unexpected exit code: 3 FAIL: stress/math-pow-stable-results.js.dfg-eager Running stress/math-pow-stable-results.js.dfg-maximal-flush-validate-no-cjit stress/math-pow-stable-results.js.dfg-eager-no-cjit-validate: Exception: Failed opaquePow with base = 2 exponent = -0.5 expected (0.7071067811865475) got (0.7071067811865476) stress/math-pow-stable-results.js.dfg-eager-no-cjit-validate: ERROR: Unexpected exit code: 3 FAIL: stress/math-pow-stable-results.js.dfg-eager-no-cjit-validate Running stress/regexp-exec-effect-after-exception.js.default stress/math-pow-stable-results.js.dfg-maximal-flush-validate-no-cjit: Exception: Failed opaquePow with base = 2 exponent = -0.5 expected (0.7071067811865475) got (0.7071067811865476) stress/math-pow-stable-results.js.dfg-maximal-flush-validate-no-cjit: ERROR: Unexpected exit code: 3 FAIL: stress/math-pow-stable-results.js.dfg-maximal-flush-validate-no-cjit > Is it possible to get all the compile flags of those Linux build? Something > like fast-math could cause something like this. > Ossy, do you have access to the hardware? Getting those binaries would be a > start. If you could also disassemble the test that would be even better. Sure, I'll upload the disassembly soon.

Csaba Osztrogonác

Comment 13 2016-05-02 03:22:39 PDT

Created attachment 277902 [details] disassembly of stress/math-pow-stable-results.js.default on ARMv7 Thumb2 working directory: .tests/stress $ ../../.vm/JavaScriptCore.framework/Resources/jsc --dumpDisassembly=true --useFTLJIT\=false --useFunctionDotArguments\=true math-pow-stable-results.js

Csaba Osztrogonác

Comment 14 2016-05-02 03:23:41 PDT

Created attachment 277903 [details] a preprocessed JSC source to be able to check all build time flags

Csaba Osztrogonác

Comment 15 2016-05-02 06:52:55 PDT

A clean build fixed it on the ARMv7 Thumb2 JSCOnly bot: https://build.webkit.org/builders/JSCOnly%20Linux%20ARMv7%20Thumb2%20Release/builds/453 Let's see if clean build fixes it on x86 too: https://build.webkit.org/builders/GTK%20Linux%2032-bit%20Release/builds/61021

Yusuke Suzuki

Comment 16 2016-05-02 09:10:54 PDT

(In reply to comment #15) > A clean build fixed it on the ARMv7 Thumb2 JSCOnly bot: > https://build.webkit.org/builders/JSCOnly%20Linux%20ARMv7%20Thumb2%20Release/ > builds/453 > > Let's see if clean build fixes it on x86 too: > https://build.webkit.org/builders/GTK%20Linux%2032-bit%20Release/builds/61021 Nice. OK.

Yusuke Suzuki

Comment 17 2016-05-02 10:08:14 PDT

(In reply to comment #16) > (In reply to comment #15) > > A clean build fixed it on the ARMv7 Thumb2 JSCOnly bot: > > https://build.webkit.org/builders/JSCOnly%20Linux%20ARMv7%20Thumb2%20Release/ > > builds/453 > > > > Let's see if clean build fixes it on x86 too: > > https://build.webkit.org/builders/GTK%20Linux%2032-bit%20Release/builds/61021 > > Nice. OK. x86 32bit is still failing. It seems that this error is due to the precision. I'm now considering about x87.

Csaba Osztrogonác

Comment 18 2016-05-10 02:00:10 PDT

It would be great to know if GTK x86 bot has SSE2 support or not.

Yusuke Suzuki

Comment 19 2016-05-15 14:44:41 PDT

I've just extracted C runtime code (operationMathPow()), and executed it. https://gist.github.com/Constellation/2ae6dc8ff7bc4d0bb7870503d654848a Interestingly, I've got the 9.652406070129542e-160 in 32/64 environment. Even if we enable/disable sse in 32bit (80bit double is used in x87), we got the same result. I think, there is some bugs in JIT code. I'll investigate it.

Yusuke Suzuki

Comment 20 2016-05-15 16:19:13 PDT

Ah, OK. I think I've found the cause of this. This is due to x87's 80bit double precision. The result depends on fldl / fstpl. They perform 80 => 64 / 64 => 80 bit rouding. https://gist.github.com/Constellation/2ae6dc8ff7bc4d0bb7870503d654848a We got the following disasm from the above code. One is compiled without any options, one is compiled with -O3. No option: https://gist.github.com/Constellation/ff034fa6929c2f3570108214df5fe30b O3: https://gist.github.com/Constellation/c161ff854cd3615a7226527d83e9cbe5 In particular, we should focus on the following part. 93 // If the exponent is a small positive int32 integer, we do a fast exponentiation 94 double result = 1; 95 while (yAsInt) { 96 if (yAsInt & 1) 97 result *= x; 98 x *= x; 99 yAsInt >>= 1; 100 } 101 return result; O0: 80486da: d9 e8 fld1 80486dc: dd 5d e8 fstpl -0x18(%ebp) 80486df: 83 7d e4 00 cmpl $0x0,-0x1c(%ebp) 80486e3: 74 21 je 8048706 <_Z16operationMathPowdd+0x1eb> 80486e5: 8b 45 e4 mov -0x1c(%ebp),%eax 80486e8: 83 e0 01 and $0x1,%eax 80486eb: 85 c0 test %eax,%eax 80486ed: 74 09 je 80486f8 <_Z16operationMathPowdd+0x1dd> 80486ef: dd 45 e8 fldl -0x18(%ebp) 80486f2: dc 4d d0 fmull -0x30(%ebp) 80486f5: dd 5d e8 fstpl -0x18(%ebp) 80486f8: dd 45 d0 fldl -0x30(%ebp) 80486fb: dc 4d d0 fmull -0x30(%ebp) 80486fe: dd 5d d0 fstpl -0x30(%ebp) 8048701: d1 7d e4 sarl -0x1c(%ebp) 8048704: eb d9 jmp 80486df <_Z16operationMathPowdd+0x1c4> 8048706: dd 45 e8 fldl -0x18(%ebp) 8048709: eb 16 jmp 8048721 <_Z16operationMathPowdd+0x206> O3: 80486c9: dd d8 fstp %st(0) 80486cb: d9 e8 fld1 80486cd: eb 03 jmp 80486d2 <_Z16operationMathPowdd+0x142> 80486cf: 90 nop 80486d0: d9 c9 fxch %st(1) 80486d2: a8 01 test $0x1,%al 80486d4: 74 0a je 80486e0 <_Z16operationMathPowdd+0x150> 80486d6: d8 c9 fmul %st(1),%st 80486d8: d9 c9 fxch %st(1) 80486da: eb 06 jmp 80486e2 <_Z16operationMathPowdd+0x152> 80486dc: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi 80486e0: d9 c9 fxch %st(1) 80486e2: d1 f8 sar %eax 80486e4: d8 c8 fmul %st(0),%st 80486e6: 75 e8 jne 80486d0 <_Z16operationMathPowdd+0x140> 80486e8: dd d8 fstp %st(0) 80486ea: e9 33 ff ff ff jmp 8048622 <_Z16operationMathPowdd+0x92> In O0 ver, we always perform fstpl / fldl onto the result of multiplying. This performs 80bit => 64bit rounding. But, in O3 ver, we don't do that. It means that multiplying is calculated in 80bit in x87. As a result, it causes the precision problem. If operationMathPow is compiled like O3 cases, C runtime: It performs 80bit multiplying. And we round 80bit => 64bit at the end of operationMathPow. DFG: It always uses SSE. So there is no rounding problem. (SSE registers are 64bit).

Yusuke Suzuki

Comment 21 2016-05-15 17:27:50 PDT

The default flag for mfpmath in clang is sse in OS X. That's why 32bit OS X port does not show this problem.

Yusuke Suzuki

Comment 22 2016-05-15 21:16:43 PDT

Created attachment 278998 [details] Patch

Yusuke Suzuki

Comment 23 2016-05-15 22:02:46 PDT

Created attachment 279000 [details] Patch

Yusuke Suzuki

Comment 24 2016-05-15 23:38:07 PDT

Created attachment 279002 [details] Patch

Yusuke Suzuki

Comment 25 2016-05-15 23:48:49 PDT

Comment on attachment 279002 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=279002&action=review Added comments. > Source/JavaScriptCore/ChangeLog:18 > + occur in OS X clang 32bit environment. This is because `-mfpmath=sse` is enabled by default in OS X clang 32bit. Currently, in this patch, I use `volatile double` approach. But there are several alternative designs. (1) Do not use this fast path in operationMathPow & DFG JIT if `CPU(X86) && (!defined(__SSE2_MATH__) || !defined(__SSE2__))`. This is more reliable than using `volatile double`. But it incurs some performance penalty in DFG JIT compared to the current case, since DFG JIT gives up emitting this fast path even if SSE2 is available. (2) Emit x87 in DFG when `CPU(X86) && (!defined(__SSE2_MATH__) || !defined(__SSE2__))`. I think it's not good.

Yusuke Suzuki

Comment 26 2016-05-16 00:04:45 PDT

Comment on attachment 279002 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=279002&action=review >> Source/JavaScriptCore/ChangeLog:18 >> + occur in OS X clang 32bit environment. This is because `-mfpmath=sse` is enabled by default in OS X clang 32bit. > > Currently, in this patch, I use `volatile double` approach. But there are several alternative designs. > > (1) Do not use this fast path in operationMathPow & DFG JIT if `CPU(X86) && (!defined(__SSE2_MATH__) || !defined(__SSE2__))`. > > This is more reliable than using `volatile double`. But it incurs some performance penalty in DFG JIT compared to the current case, > since DFG JIT gives up emitting this fast path even if SSE2 is available. > > (2) Emit x87 in DFG when `CPU(X86) && (!defined(__SSE2_MATH__) || !defined(__SSE2__))`. > > I think it's not good. Reading this[1], `volatile double` seems OK. [1]: https://gcc.gnu.org/wiki/x87note

Yusuke Suzuki

Comment 27 2016-05-16 00:07:29 PDT

Comment on attachment 279002 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=279002&action=review >>> Source/JavaScriptCore/ChangeLog:18 >>> + occur in OS X clang 32bit environment. This is because `-mfpmath=sse` is enabled by default in OS X clang 32bit. >> >> Currently, in this patch, I use `volatile double` approach. But there are several alternative designs. >> >> (1) Do not use this fast path in operationMathPow & DFG JIT if `CPU(X86) && (!defined(__SSE2_MATH__) || !defined(__SSE2__))`. >> >> This is more reliable than using `volatile double`. But it incurs some performance penalty in DFG JIT compared to the current case, >> since DFG JIT gives up emitting this fast path even if SSE2 is available. >> >> (2) Emit x87 in DFG when `CPU(X86) && (!defined(__SSE2_MATH__) || !defined(__SSE2__))`. >> >> I think it's not good. > > Reading this[1], `volatile double` seems OK. > > [1]: https://gcc.gnu.org/wiki/x87note And this https://gcc.gnu.org/bugzilla/show_bug.cgi?id=323

Alex Christensen

Comment 28 2016-05-16 10:50:42 PDT

Comment on attachment 279002 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=279002&action=review > Source/JavaScriptCore/runtime/MathCommon.cpp:438 > if (static_cast<double>(yAsInt) == y && yAsInt > 0 && yAsInt <= maxExponentForIntegerMathPow) { This should work if yAsInt == 0, too. It would correctly return 1, and it would do so faster than mathPowInternal. We could probably also do special optimizations for yAsInt < 0, but there would probably be lots of edge cases and precision problems to worry about. Same with non-{-0.5, 0.5} multiples of 0.5, which probably aren't very common. > Source/JavaScriptCore/runtime/MathCommon.cpp:448 > + typedef volatile double DoubleValue; Will we have to do this elsewhere, too? I've solved 80-bit precision (actually 64-bit mantissa) rounding issues on Windows with _control_fp and _PC_53, but that changes the rounding mode globally for the application.

Benjamin Poulain

Comment 29 2016-05-16 17:23:20 PDT

Comment on attachment 279002 [details] Patch Good job finding the issue! I wish there was a way for gcc to use SSE for the entire JSC. I am sure there are plenty of similar bugs that are not covered by tests :(

Yusuke Suzuki

Comment 30 2016-05-16 18:02:55 PDT

Created attachment 279077 [details] Patch

Yusuke Suzuki

Comment 31 2016-05-16 18:04:16 PDT

Comment on attachment 279002 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=279002&action=review >> Source/JavaScriptCore/runtime/MathCommon.cpp:438 >> if (static_cast<double>(yAsInt) == y && yAsInt > 0 && yAsInt <= maxExponentForIntegerMathPow) { > > This should work if yAsInt == 0, too. It would correctly return 1, and it would do so faster than mathPowInternal. > We could probably also do special optimizations for yAsInt < 0, but there would probably be lots of edge cases and precision problems to worry about. Same with non-{-0.5, 0.5} multiples of 0.5, which probably aren't very common. Nice catch!. DFG / FTL already handles this 0 case. So operationMathPow should be aware of that since operationMathPow's processing should be aligned to DFG / FTL's ones. >> Source/JavaScriptCore/runtime/MathCommon.cpp:448 >> + typedef volatile double DoubleValue; > > Will we have to do this elsewhere, too? I've solved 80-bit precision (actually 64-bit mantissa) rounding issues on Windows with _control_fp and _PC_53, but that changes the rounding mode globally for the application. I think this should be limited. The requirement is that the generated value is consistent in all the tiers (from LLInt to FTL). If the value is calculated in 80bit, it's ok if the DFG also calculate it in 80bit. So, what we should take care is that the mathematical value calculated in DFG like this. Yeah, as you said, _control_fp in Win (And _FPU_SETCW & _FPU_DOUBLE in the other environments) changes the FPU precision globally. So in this patch, I take the small changes that effect is limited. However, if changing the configuration is acceptable, changing it can be a good alternative.

Yusuke Suzuki

Comment 32 2016-05-16 18:24:37 PDT

(In reply to comment #29) > Comment on attachment 279002 [details] > Patch > > Good job finding the issue! > > I wish there was a way for gcc to use SSE for the entire JSC. I am sure > there are plenty of similar bugs that are not covered by tests :( Yeah. -mfpmath=sse & -march=native can enforce gcc to use SSE2 (muld etc.) if they are supported. However, in the environment that does not support SSE2, we need to take the conservative approach. And I think we still need to consider about such environments since the binary packages for i686 (e.g. Debian's i686 port) need to consider the CPU capabilities conservatively. Linux drops i386 support, but i686 support still remains. And i686 (Pentium Pro ~) does not include SSE2 (Pentium 4 ~). After reconsidering about the probelm, I think Alex's suggestion - changing FPU precision globally - seems nice in WebKit globally. Pros: - We don't need to consider about x87 80bit precision anymore. - In Windows, FPU precision is configured 64bit by default. (And in Linux, this is configued 80bit). So I don't think enforcing 64bit precision incurs some critical problems. Cons: - The change affects globally. So all the x87 processing in the process is now enforced to 64bit precision. - Even if FPU config is changed, libc's math function may return the value in 80bit precision (For example, I've checked that std::pow() still returns 80bit precision value even if we changed x87 precision to 64bit in x86). Potentially, this incurs some inconsistency problems. - We need to take care of the other libraries that may change the FPU configuration to keep this 64bit. For example, DirectX is known that it changes FPU configuration (without D3DCREATE_FPU_PRESERVE). Alex, Benjamin, what do you think of?

Benjamin Poulain

Comment 33 2016-05-16 18:41:21 PDT

Comment on attachment 279077 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=279077&action=review > Source/JavaScriptCore/b3/B3MathExtras.cpp:53 > - start->appendNew<Const32Value>(procedure, origin, 1000)); > + start->appendNew<Const32Value>(procedure, origin, maxExponentForIntegerMathPow)); Missing header for this.

Benjamin Poulain

Comment 34 2016-05-16 18:42:05 PDT

Changing the FPU configuration seems fragile to me. We would see bugs appearing if any library decides to change it too. Alex, what do you think?

Yusuke Suzuki

Comment 35 2016-05-16 18:45:33 PDT

(In reply to comment #34) > Changing the FPU configuration seems fragile to me. We would see bugs > appearing if any library decides to change it too. > > Alex, what do you think? (In reply to comment #32) > (In reply to comment #29) > > Comment on attachment 279002 [details] > > Patch > > > > Good job finding the issue! > > > > I wish there was a way for gcc to use SSE for the entire JSC. I am sure > > there are plenty of similar bugs that are not covered by tests :( > > Yeah. -mfpmath=sse & -march=native can enforce gcc to use SSE2 (muld etc.) > if they are supported. However, in the environment that does not support > SSE2, we need to take the conservative approach. And I think we still need > to consider about such environments since the binary packages for i686 (e.g. > Debian's i686 port) need to consider the CPU capabilities conservatively. > Linux drops i386 support, but i686 support still remains. And i686 (Pentium > Pro ~) does not include SSE2 (Pentium 4 ~). > > After reconsidering about the probelm, I think Alex's suggestion - changing > FPU precision globally - seems nice in WebKit globally. > > Pros: > - We don't need to consider about x87 80bit precision anymore. > - In Windows, FPU precision is configured 64bit by default. (And in Linux, > this is configued 80bit). So I don't think enforcing 64bit precision incurs > some critical problems. Ah, and potential concern is that there are so many Linux libraries that is not used in Windows; gtk, etc. > > Cons: > - The change affects globally. So all the x87 processing in the process is > now enforced to 64bit precision. > - Even if FPU config is changed, libc's math function may return the value > in 80bit precision (For example, I've checked that std::pow() still returns > 80bit precision value even if we changed x87 precision to 64bit in x86). > Potentially, this incurs some inconsistency problems. > - We need to take care of the other libraries that may change the FPU > configuration to keep this 64bit. For example, DirectX is known that it > changes FPU configuration (without D3DCREATE_FPU_PRESERVE). > > Alex, Benjamin, what do you think of?

Alex Christensen

Comment 36 2016-05-16 21:06:14 PDT

Comment on attachment 279077 [details] Patch I think this fix is good, but there are certainly more, and if we found all of them, the code would look messy everywhere and do lots of unnecessary storing to the stack. Requiring all JSC applications to use 53-bit mantissa mode would be VERY bad, because some of them intentionally change it for more precision in fancy math libraries. I had to debug such a problem once and that was no fun.

Yusuke Suzuki

Comment 37 2016-05-16 21:25:55 PDT

(In reply to comment #36) > Comment on attachment 279077 [details] > Patch > > I think this fix is good, but there are certainly more, and if we found all > of them, the code would look messy everywhere and do lots of unnecessary > storing to the stack. Requiring all JSC applications to use 53-bit mantissa > mode would be VERY bad, because some of them intentionally change it for > more precision in fancy math libraries. I had to debug such a problem once > and that was no fun. Sounds reasonable. OK, I'll go with this patch and watch the build bot. Thanks :)

Yusuke Suzuki

Comment 38 2016-05-16 21:26:22 PDT

Comment on attachment 279077 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=279077&action=review >> Source/JavaScriptCore/b3/B3MathExtras.cpp:53 >> + start->appendNew<Const32Value>(procedure, origin, maxExponentForIntegerMathPow)); > > Missing header for this. Oops, thanks. Fixed.

Yusuke Suzuki

Comment 39 2016-05-16 21:35:17 PDT

Committed r200996: <http://trac.webkit.org/changeset/200996>

Yusuke Suzuki

Comment 40 2016-05-16 22:00:17 PDT

(In reply to comment #36) > Comment on attachment 279077 [details] > Patch > > I think this fix is good, but there are certainly more, and if we found all > of them, the code would look messy everywhere and do lots of unnecessary > storing to the stack. Requiring all JSC applications to use 53-bit mantissa > mode would be VERY bad, because some of them intentionally change it for > more precision in fancy math libraries. I had to debug such a problem once > and that was no fun. Once the new code like this is found, we need to extract this DoubleValue type and use it in such code to ensure 64bit precision. I believe such code is rare; the problematic pattern is that the same floating point calculation is done in C runtime and Baseline / DFG (not FTL) JIT SSE2 code separately. FTL does not have a problem since FTL is not enabled in x86 32bit.

Yusuke Suzuki

Comment 41 2016-05-16 22:43:44 PDT

operations with base = 0.6931471805599453 exponent = 999 are fixed. But still failing due to exponent = 0.5 / -0.5 special path cases.

Yusuke Suzuki

Comment 42 2016-05-16 22:53:02 PDT

Ah, OK. math-pow-with-constants.js cases are fixed. But math-pow-stable-results.js cases remain. This is because of the similar issue in sqrt. operationMathPow uses sqrt. This c sqrt function is implementation-dependent, but, actually, the result is calculated in 80bit precision in glibc. However, DFG JIT fast path for 0.5 / -0.5 uses sqrt SSE operation. Of course, they have 64bit precision. I think the similar issue should occur in ArithSqrt and std::sqrt. Seeing the ArithXXX nodes at glance, the other nodes do not have the problems.

Yusuke Suzuki

Comment 43 2016-05-16 23:33:16 PDT

I've just opened the new bug to handle sqrt and pow. https://bugs.webkit.org/show_bug.cgi?id=157787

Yusuke Suzuki

Comment 44 2016-05-17 06:42:38 PDT

Committed r201014: <http://trac.webkit.org/changeset/201014>

Attachments
disassembly of stress/math-pow-stable-results.js.default on ARMv7 Thumb2 (1.14 MB, text/plain) 2016-05-02 03:22 PDT, Csaba Osztrogonác	no flags	Details
a preprocessed JSC source (3.70 MB, text/plain) 2016-05-02 03:23 PDT, Csaba Osztrogonác	no flags	Details
Patch (2.68 KB, patch) 2016-05-15 21:16 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (2.68 KB, patch) 2016-05-15 22:02 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (3.23 KB, patch) 2016-05-15 23:38 PDT, Yusuke Suzuki	no flags	Details Formatted Diff Diff
Patch (4.13 KB, patch) 2016-05-16 18:02 PDT, Yusuke Suzuki	benjamin: review+	Details Formatted Diff Diff
Show Obsolete (3) View All Add attachment proposed patch, testcase, etc.