Bug 157168 - REGRESSION(r200208): It made 2 JSC stress tests fail on x86
Summary: REGRESSION(r200208): It made 2 JSC stress tests fail on x86
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: Other
Hardware: Unspecified Unspecified
: P1 Critical
Assignee: Yusuke Suzuki
URL:
Keywords: InRadar
Depends on: 157787
Blocks: 157121
  Show dependency treegraph
 
Reported: 2016-04-29 02:25 PDT by Csaba Osztrogonác
Modified: 2016-05-17 06:42 PDT (History)
13 users (show)

See Also:


Attachments
disassembly of stress/math-pow-stable-results.js.default on ARMv7 Thumb2 (1.14 MB, text/plain)
2016-05-02 03:22 PDT, Csaba Osztrogonác
no flags Details
a preprocessed JSC source (3.70 MB, text/plain)
2016-05-02 03:23 PDT, Csaba Osztrogonác
no flags Details
Patch (2.68 KB, patch)
2016-05-15 21:16 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (2.68 KB, patch)
2016-05-15 22:02 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (3.23 KB, patch)
2016-05-15 23:38 PDT, Yusuke Suzuki
no flags Details | Formatted Diff | Diff
Patch (4.13 KB, patch)
2016-05-16 18:02 PDT, Yusuke Suzuki
benjamin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Csaba Osztrogonác 2016-04-29 02:25:59 PDT
ARMv7 Thumb2 (JSCOnly port):
-----------------------------
before: https://build.webkit.org/builders/JSCOnly%20Linux%20ARMv7%20Thumb2%20Release/builds/415
after: https://build.webkit.org/builders/JSCOnly%20Linux%20ARMv7%20Thumb2%20Release/builds/416

GTK x86:
---------
before: https://build.webkit.org/builders/GTK%20Linux%2032-bit%20Release/builds/60939
after: https://build.webkit.org/builders/GTK%20Linux%2032-bit%20Release/builds/60940

new failures:
	stress/math-pow-stable-results.js.always-trigger-copy-phase
	stress/math-pow-stable-results.js.default
	stress/math-pow-stable-results.js.dfg-eager
	stress/math-pow-stable-results.js.dfg-eager-no-cjit-validate
	stress/math-pow-stable-results.js.dfg-maximal-flush-validate-no-cjit
	stress/math-pow-stable-results.js.no-cjit-validate-phases
	stress/math-pow-stable-results.js.no-llint
	stress/math-pow-with-constants.js.always-trigger-copy-phase
	stress/math-pow-with-constants.js.default
	stress/math-pow-with-constants.js.dfg-maximal-flush-validate-no-cjit
	stress/math-pow-with-constants.js.no-cjit-validate-phases
	stress/math-pow-with-constants.js.no-llint

It's strange that no regression on Apple Mac x86 bots, GTK ARMv7 Thumb2 and JSCOnly ARMv7 with ARM instruction set.
Comment 1 Radar WebKit Bug Importer 2016-04-29 09:15:32 PDT
<rdar://problem/26004839>
Comment 2 Benjamin Poulain 2016-04-29 14:20:04 PDT
Landed http://trac.webkit.org/changeset/200263 to get more information.

Is it possible to get all the compile flags of those Linux build? Something like fast-math could cause something like this.

Ossy, do you have access to the hardware? Getting those binaries would be a start. If you could also disassemble the test that would be even better.
Comment 3 Benjamin Poulain 2016-04-29 15:23:54 PDT
I cannot reproduce on my Mac with a 32bits build.

Yusuke, can you reproduce by any chance? Can you have a quick look?
Comment 4 Yusuke Suzuki 2016-04-30 03:03:33 PDT
(In reply to comment #3)
> I cannot reproduce on my Mac with a 32bits build.
> 
> Yusuke, can you reproduce by any chance? Can you have a quick look?

OK, I'll check this on Linux x86 hardware in GTK port :)
Comment 5 Yusuke Suzuki 2016-04-30 05:38:16 PDT
Setting up the 32bit LXC container. I'll build JSC on it.
Comment 6 Yusuke Suzuki 2016-04-30 06:40:24 PDT
I've just built and tested.
And I cannot reproduce the results.

Currently, I think this regression is due to the processor's capabilities.
Seeing the original patch, I've found that we use sqrtDouble without seeing supportsFloatingPointSqrt(). sqrtDouble is only allowed in SSE2 environment in x86.
I'll create the "attempt-to-fix" patch. ossy@, could you try it when I submit it here?
Comment 7 Yusuke Suzuki 2016-04-30 06:45:06 PDT
(In reply to comment #6)
> I've just built and tested.
> And I cannot reproduce the results.
> 
> Currently, I think this regression is due to the processor's capabilities.
> Seeing the original patch, I've found that we use sqrtDouble without seeing
> supportsFloatingPointSqrt(). sqrtDouble is only allowed in SSE2 environment
> in x86.
> I'll create the "attempt-to-fix" patch. ossy@, could you try it when I
> submit it here?

Hmm, but, in this case, DFG is not enabled (supportsFloatingPoint() also relies on SSE2).
Comment 8 Yusuke Suzuki 2016-04-30 07:09:25 PDT
ossy@, could you dump the /proc/cpuinfo of this machine?
Comment 9 Csaba Osztrogonác 2016-05-02 02:27:54 PDT
(In reply to comment #8)
> ossy@, could you dump the /proc/cpuinfo of this machine?

It is an Arndale Octa board - 4 x Cortex A7 + 4x Cortex A15
(But only same cores can run at the same time)

http://www.arndaleboard.org/wiki/index.php/Main_Page

$ cat /proc/cpuinfo
processor       : 0
model name      : ARMv7 Processor rev 3 (v7l)
Features        : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x2
CPU part        : 0xc0f
CPU revision    : 3

[snip] (same for all processors: 0,1,2,3)

Hardware        : SAMSUNG EXYNOS5 (Flattened Device Tree)
Revision        : 0000
Serial          : 0000000000000000
Comment 10 Yusuke Suzuki 2016-05-02 02:42:36 PDT
(In reply to comment #9)
> (In reply to comment #8)
> > ossy@, could you dump the /proc/cpuinfo of this machine?
> 
> It is an Arndale Octa board - 4 x Cortex A7 + 4x Cortex A15
> (But only same cores can run at the same time)
> 
> http://www.arndaleboard.org/wiki/index.php/Main_Page
> 
> $ cat /proc/cpuinfo
> processor       : 0
> model name      : ARMv7 Processor rev 3 (v7l)
> Features        : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4
> idiva idivt vfpd32 lpae
> CPU implementer : 0x41
> CPU architecture: 7
> CPU variant     : 0x2
> CPU part        : 0xc0f
> CPU revision    : 3
> 
> [snip] (same for all processors: 0,1,2,3)
> 
> Hardware        : SAMSUNG EXYNOS5 (Flattened Device Tree)
> Revision        : 0000
> Serial          : 0000000000000000

How about x86 32bit machine? Are sse (and others) enabled?

My tested x86 LXC 32bit container (i686 Ubuntu 14.04) is 8 cores (hyperthreading), SSE, SSE2, SSE3, SSE4.1, SSE4.2, and AVX enabled.


$ cat /proc/cpuinfo
processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 42
model name      : Intel(R) Core(TM) i7-2860QM CPU @ 2.50GHz
stepping        : 7
microcode       : 0x1a
cpu MHz         : 800.000
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 3
cpu cores       : 4
apicid          : 7
initial apicid  : 7
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips        : 4999.71
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

$ uname -a
Linux my32bitbox 3.13.0-76-generic #120-Ubuntu SMP Mon Jan 18 15:59:10 UTC 2016 i686 i686 i686 GNU/Linux
Comment 11 Csaba Osztrogonác 2016-05-02 02:46:38 PDT
(In reply to comment #6)
> I've just built and tested.
> And I cannot reproduce the results.
> 
> Currently, I think this regression is due to the processor's capabilities.
> Seeing the original patch, I've found that we use sqrtDouble without seeing
> supportsFloatingPointSqrt(). sqrtDouble is only allowed in SSE2 environment
> in x86.
> I'll create the "attempt-to-fix" patch. ossy@, could you try it when I
> submit it here?

I can try patches only on our ARM board, I don't have any x86 hardware.
GTK guys can help, since they maintain the GTK x86 bot.
Comment 12 Csaba Osztrogonác 2016-05-02 03:07:16 PDT
(In reply to comment #2)
> Landed http://trac.webkit.org/changeset/200263 to get more information.

It seems it is a rounding issue:
https://build.webkit.org/builders/JSCOnly%20Linux%20ARMv7%20Thumb2%20Release/builds/452

Running stress/math-pow-stable-results.js.always-trigger-copy-phase
stress/math-pow-stable-results.js.default: Exception: Failed opaquePow with base = 1.4426950408889634 exponent = -0.5 expected (0.8325546111576977) got (0.8325546111576978)
stress/math-pow-stable-results.js.default: ERROR: Unexpected exit code: 3
FAIL: stress/math-pow-stable-results.js.default
stress/math-pow-stable-results.js.always-trigger-copy-phase: Exception: Failed opaquePow with base = 1.4426950408889634 exponent = -0.5 expected (0.8325546111576977) got (0.8325546111576978)
Running stress/math-pow-stable-results.js.no-llint
stress/math-pow-stable-results.js.always-trigger-copy-phase: ERROR: Unexpected exit code: 3
FAIL: stress/math-pow-stable-results.js.always-trigger-copy-phase
Running stress/math-pow-stable-results.js.no-cjit-validate-phases
stress/math-pow-stable-results.js.no-llint: Exception: Failed opaquePow with base = 2 exponent = -0.5 expected (0.7071067811865475) got (0.7071067811865476)
stress/math-pow-stable-results.js.no-llint: ERROR: Unexpected exit code: 3
FAIL: stress/math-pow-stable-results.js.no-llint
Running stress/math-pow-stable-results.js.dfg-eager
stress/math-pow-stable-results.js.no-cjit-validate-phases: Exception: Failed opaquePow with base = 2 exponent = -0.5 expected (0.7071067811865475) got (0.7071067811865476)
stress/math-pow-stable-results.js.no-cjit-validate-phases: ERROR: Unexpected exit code: 3
FAIL: stress/math-pow-stable-results.js.no-cjit-validate-phases
Running stress/math-pow-stable-results.js.dfg-eager-no-cjit-validate
stress/math-pow-stable-results.js.dfg-eager: Exception: Failed opaquePow with base = 2 exponent = -0.5 expected (0.7071067811865475) got (0.7071067811865476)
stress/math-pow-stable-results.js.dfg-eager: ERROR: Unexpected exit code: 3
FAIL: stress/math-pow-stable-results.js.dfg-eager
Running stress/math-pow-stable-results.js.dfg-maximal-flush-validate-no-cjit
stress/math-pow-stable-results.js.dfg-eager-no-cjit-validate: Exception: Failed opaquePow with base = 2 exponent = -0.5 expected (0.7071067811865475) got (0.7071067811865476)
stress/math-pow-stable-results.js.dfg-eager-no-cjit-validate: ERROR: Unexpected exit code: 3
FAIL: stress/math-pow-stable-results.js.dfg-eager-no-cjit-validate
Running stress/regexp-exec-effect-after-exception.js.default
stress/math-pow-stable-results.js.dfg-maximal-flush-validate-no-cjit: Exception: Failed opaquePow with base = 2 exponent = -0.5 expected (0.7071067811865475) got (0.7071067811865476)
stress/math-pow-stable-results.js.dfg-maximal-flush-validate-no-cjit: ERROR: Unexpected exit code: 3
FAIL: stress/math-pow-stable-results.js.dfg-maximal-flush-validate-no-cjit

 
> Is it possible to get all the compile flags of those Linux build? Something
> like fast-math could cause something like this.
 
> Ossy, do you have access to the hardware? Getting those binaries would be a
> start. If you could also disassemble the test that would be even better.

Sure, I'll upload the disassembly soon.
Comment 13 Csaba Osztrogonác 2016-05-02 03:22:39 PDT
Created attachment 277902 [details]
disassembly of stress/math-pow-stable-results.js.default on ARMv7 Thumb2

working directory: .tests/stress

$ ../../.vm/JavaScriptCore.framework/Resources/jsc  --dumpDisassembly=true --useFTLJIT\=false --useFunctionDotArguments\=true math-pow-stable-results.js
Comment 14 Csaba Osztrogonác 2016-05-02 03:23:41 PDT
Created attachment 277903 [details]
a preprocessed JSC source

to be able to check all build time flags
Comment 15 Csaba Osztrogonác 2016-05-02 06:52:55 PDT
A clean build fixed it on the ARMv7 Thumb2 JSCOnly bot:
https://build.webkit.org/builders/JSCOnly%20Linux%20ARMv7%20Thumb2%20Release/builds/453

Let's see if clean build fixes it on x86 too:
https://build.webkit.org/builders/GTK%20Linux%2032-bit%20Release/builds/61021
Comment 16 Yusuke Suzuki 2016-05-02 09:10:54 PDT
(In reply to comment #15)
> A clean build fixed it on the ARMv7 Thumb2 JSCOnly bot:
> https://build.webkit.org/builders/JSCOnly%20Linux%20ARMv7%20Thumb2%20Release/
> builds/453
> 
> Let's see if clean build fixes it on x86 too:
> https://build.webkit.org/builders/GTK%20Linux%2032-bit%20Release/builds/61021

Nice. OK.
Comment 17 Yusuke Suzuki 2016-05-02 10:08:14 PDT
(In reply to comment #16)
> (In reply to comment #15)
> > A clean build fixed it on the ARMv7 Thumb2 JSCOnly bot:
> > https://build.webkit.org/builders/JSCOnly%20Linux%20ARMv7%20Thumb2%20Release/
> > builds/453
> > 
> > Let's see if clean build fixes it on x86 too:
> > https://build.webkit.org/builders/GTK%20Linux%2032-bit%20Release/builds/61021
> 
> Nice. OK.

x86 32bit is still failing. It seems that this error is due to the precision.
I'm now considering about x87.
Comment 18 Csaba Osztrogonác 2016-05-10 02:00:10 PDT
It would be great to know if GTK x86 bot has SSE2 support or not.
Comment 19 Yusuke Suzuki 2016-05-15 14:44:41 PDT
I've just extracted C runtime code (operationMathPow()), and executed it.
https://gist.github.com/Constellation/2ae6dc8ff7bc4d0bb7870503d654848a

Interestingly, I've got the 9.652406070129542e-160 in 32/64 environment.
Even if we enable/disable sse in 32bit (80bit double is used in x87), we got the same result. I think, there is some bugs in JIT code. I'll investigate it.
Comment 20 Yusuke Suzuki 2016-05-15 16:19:13 PDT
Ah, OK. I think I've found the cause of this.
This is due to x87's 80bit double precision. The result depends on fldl / fstpl. They perform 80 => 64 / 64 => 80 bit rouding.

https://gist.github.com/Constellation/2ae6dc8ff7bc4d0bb7870503d654848a

We got the following disasm from the above code. One is compiled without any options, one is compiled with -O3.

No option:
https://gist.github.com/Constellation/ff034fa6929c2f3570108214df5fe30b

O3:
https://gist.github.com/Constellation/c161ff854cd3615a7226527d83e9cbe5

In particular, we should focus on the following part.

 93         // If the exponent is a small positive int32 integer, we do a fast exponentiation
 94         double result = 1;
 95         while (yAsInt) {
 96             if (yAsInt & 1)
 97                 result *= x;
 98             x *= x;
 99             yAsInt >>= 1;
100         }
101         return result;

O0:
 80486da:	d9 e8                	fld1   
 80486dc:	dd 5d e8             	fstpl  -0x18(%ebp)
 80486df:	83 7d e4 00          	cmpl   $0x0,-0x1c(%ebp)
 80486e3:	74 21                	je     8048706 <_Z16operationMathPowdd+0x1eb>
 80486e5:	8b 45 e4             	mov    -0x1c(%ebp),%eax
 80486e8:	83 e0 01             	and    $0x1,%eax
 80486eb:	85 c0                	test   %eax,%eax
 80486ed:	74 09                	je     80486f8 <_Z16operationMathPowdd+0x1dd>
 80486ef:	dd 45 e8             	fldl   -0x18(%ebp)
 80486f2:	dc 4d d0             	fmull  -0x30(%ebp)
 80486f5:	dd 5d e8             	fstpl  -0x18(%ebp)
 80486f8:	dd 45 d0             	fldl   -0x30(%ebp)
 80486fb:	dc 4d d0             	fmull  -0x30(%ebp)
 80486fe:	dd 5d d0             	fstpl  -0x30(%ebp)
 8048701:	d1 7d e4             	sarl   -0x1c(%ebp)
 8048704:	eb d9                	jmp    80486df <_Z16operationMathPowdd+0x1c4>
 8048706:	dd 45 e8             	fldl   -0x18(%ebp)
 8048709:	eb 16                	jmp    8048721 <_Z16operationMathPowdd+0x206>

O3:
 80486c9:	dd d8                	fstp   %st(0)
 80486cb:	d9 e8                	fld1   
 80486cd:	eb 03                	jmp    80486d2 <_Z16operationMathPowdd+0x142>
 80486cf:	90                   	nop
 80486d0:	d9 c9                	fxch   %st(1)
 80486d2:	a8 01                	test   $0x1,%al
 80486d4:	74 0a                	je     80486e0 <_Z16operationMathPowdd+0x150>
 80486d6:	d8 c9                	fmul   %st(1),%st
 80486d8:	d9 c9                	fxch   %st(1)
 80486da:	eb 06                	jmp    80486e2 <_Z16operationMathPowdd+0x152>
 80486dc:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi
 80486e0:	d9 c9                	fxch   %st(1)
 80486e2:	d1 f8                	sar    %eax
 80486e4:	d8 c8                	fmul   %st(0),%st
 80486e6:	75 e8                	jne    80486d0 <_Z16operationMathPowdd+0x140>
 80486e8:	dd d8                	fstp   %st(0)
 80486ea:	e9 33 ff ff ff       	jmp    8048622 <_Z16operationMathPowdd+0x92>


In O0 ver, we always perform fstpl / fldl onto the result of multiplying. This performs 80bit => 64bit rounding.
But, in O3 ver, we don't do that. It means that multiplying is calculated in 80bit in x87.

As a result, it causes the precision problem. If operationMathPow is compiled like O3 cases,

C runtime: It performs 80bit multiplying. And we round 80bit => 64bit at the end of operationMathPow.
DFG: It always uses SSE. So there is no rounding problem. (SSE registers are 64bit).
Comment 21 Yusuke Suzuki 2016-05-15 17:27:50 PDT
The default flag for mfpmath in clang is sse in OS X.
That's why 32bit OS X port does not show this problem.
Comment 22 Yusuke Suzuki 2016-05-15 21:16:43 PDT
Created attachment 278998 [details]
Patch
Comment 23 Yusuke Suzuki 2016-05-15 22:02:46 PDT
Created attachment 279000 [details]
Patch
Comment 24 Yusuke Suzuki 2016-05-15 23:38:07 PDT
Created attachment 279002 [details]
Patch
Comment 25 Yusuke Suzuki 2016-05-15 23:48:49 PDT
Comment on attachment 279002 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=279002&action=review

Added comments.

> Source/JavaScriptCore/ChangeLog:18
> +        occur in OS X clang 32bit environment. This is because `-mfpmath=sse` is enabled by default in OS X clang 32bit.

Currently, in this patch, I use `volatile double` approach. But there are several alternative designs.

(1) Do not use this fast path in operationMathPow & DFG JIT if `CPU(X86) && (!defined(__SSE2_MATH__) || !defined(__SSE2__))`.

This is more reliable than using `volatile double`. But it incurs some performance penalty in DFG JIT compared to the current case,
since DFG JIT gives up emitting this fast path even if SSE2 is available.

(2) Emit x87 in DFG when `CPU(X86) && (!defined(__SSE2_MATH__) || !defined(__SSE2__))`.

I think it's not good.
Comment 26 Yusuke Suzuki 2016-05-16 00:04:45 PDT
Comment on attachment 279002 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=279002&action=review

>> Source/JavaScriptCore/ChangeLog:18
>> +        occur in OS X clang 32bit environment. This is because `-mfpmath=sse` is enabled by default in OS X clang 32bit.
> 
> Currently, in this patch, I use `volatile double` approach. But there are several alternative designs.
> 
> (1) Do not use this fast path in operationMathPow & DFG JIT if `CPU(X86) && (!defined(__SSE2_MATH__) || !defined(__SSE2__))`.
> 
> This is more reliable than using `volatile double`. But it incurs some performance penalty in DFG JIT compared to the current case,
> since DFG JIT gives up emitting this fast path even if SSE2 is available.
> 
> (2) Emit x87 in DFG when `CPU(X86) && (!defined(__SSE2_MATH__) || !defined(__SSE2__))`.
> 
> I think it's not good.

Reading this[1], `volatile double` seems OK.

[1]: https://gcc.gnu.org/wiki/x87note
Comment 27 Yusuke Suzuki 2016-05-16 00:07:29 PDT
Comment on attachment 279002 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=279002&action=review

>>> Source/JavaScriptCore/ChangeLog:18
>>> +        occur in OS X clang 32bit environment. This is because `-mfpmath=sse` is enabled by default in OS X clang 32bit.
>> 
>> Currently, in this patch, I use `volatile double` approach. But there are several alternative designs.
>> 
>> (1) Do not use this fast path in operationMathPow & DFG JIT if `CPU(X86) && (!defined(__SSE2_MATH__) || !defined(__SSE2__))`.
>> 
>> This is more reliable than using `volatile double`. But it incurs some performance penalty in DFG JIT compared to the current case,
>> since DFG JIT gives up emitting this fast path even if SSE2 is available.
>> 
>> (2) Emit x87 in DFG when `CPU(X86) && (!defined(__SSE2_MATH__) || !defined(__SSE2__))`.
>> 
>> I think it's not good.
> 
> Reading this[1], `volatile double` seems OK.
> 
> [1]: https://gcc.gnu.org/wiki/x87note

And this https://gcc.gnu.org/bugzilla/show_bug.cgi?id=323
Comment 28 Alex Christensen 2016-05-16 10:50:42 PDT
Comment on attachment 279002 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=279002&action=review

> Source/JavaScriptCore/runtime/MathCommon.cpp:438
>      if (static_cast<double>(yAsInt) == y && yAsInt > 0 && yAsInt <= maxExponentForIntegerMathPow) {

This should work if yAsInt == 0, too.  It would correctly return 1, and it would do so faster than mathPowInternal.
We could probably also do special optimizations for yAsInt < 0, but there would probably be lots of edge cases and precision problems to worry about.  Same with non-{-0.5, 0.5} multiples of 0.5, which probably aren't very common.

> Source/JavaScriptCore/runtime/MathCommon.cpp:448
> +        typedef volatile double DoubleValue;

Will we have to do this elsewhere, too?  I've solved 80-bit precision (actually 64-bit mantissa) rounding issues on Windows with _control_fp and _PC_53, but that changes the rounding mode globally for the application.
Comment 29 Benjamin Poulain 2016-05-16 17:23:20 PDT
Comment on attachment 279002 [details]
Patch

Good job finding the issue!

I wish there was a way for gcc to use SSE for the entire JSC. I am sure there are plenty of similar bugs that are not covered by tests :(
Comment 30 Yusuke Suzuki 2016-05-16 18:02:55 PDT
Created attachment 279077 [details]
Patch
Comment 31 Yusuke Suzuki 2016-05-16 18:04:16 PDT
Comment on attachment 279002 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=279002&action=review

>> Source/JavaScriptCore/runtime/MathCommon.cpp:438
>>      if (static_cast<double>(yAsInt) == y && yAsInt > 0 && yAsInt <= maxExponentForIntegerMathPow) {
> 
> This should work if yAsInt == 0, too.  It would correctly return 1, and it would do so faster than mathPowInternal.
> We could probably also do special optimizations for yAsInt < 0, but there would probably be lots of edge cases and precision problems to worry about.  Same with non-{-0.5, 0.5} multiples of 0.5, which probably aren't very common.

Nice catch!. DFG / FTL already handles this 0 case. So operationMathPow should be aware of that since operationMathPow's processing should be aligned to DFG / FTL's ones.

>> Source/JavaScriptCore/runtime/MathCommon.cpp:448
>> +        typedef volatile double DoubleValue;
> 
> Will we have to do this elsewhere, too?  I've solved 80-bit precision (actually 64-bit mantissa) rounding issues on Windows with _control_fp and _PC_53, but that changes the rounding mode globally for the application.

I think this should be limited. The requirement is that the generated value is consistent in all the tiers (from LLInt to FTL).
If the value is calculated in 80bit, it's ok if the DFG also calculate it in 80bit.
So, what we should take care is that the mathematical value calculated in DFG like this.

Yeah, as you said, _control_fp in Win (And _FPU_SETCW & _FPU_DOUBLE in the other environments) changes the FPU precision globally.
So in this patch, I take the small changes that effect is limited. However, if changing the configuration is acceptable, changing it can be a good alternative.
Comment 32 Yusuke Suzuki 2016-05-16 18:24:37 PDT
(In reply to comment #29)
> Comment on attachment 279002 [details]
> Patch
> 
> Good job finding the issue!
> 
> I wish there was a way for gcc to use SSE for the entire JSC. I am sure
> there are plenty of similar bugs that are not covered by tests :(

Yeah. -mfpmath=sse & -march=native can enforce gcc to use SSE2 (muld etc.) if they are supported. However, in the environment that does not support SSE2, we need to take the conservative approach. And I think we still need to consider about such environments since the binary packages for i686 (e.g. Debian's i686 port) need to consider the CPU capabilities conservatively. Linux drops i386 support, but i686 support still remains. And i686 (Pentium Pro ~) does not include SSE2 (Pentium 4 ~).

After reconsidering about the probelm, I think Alex's suggestion - changing FPU precision globally - seems nice in WebKit globally.

Pros:
- We don't need to consider about x87 80bit precision anymore.
- In Windows, FPU precision is configured 64bit by default. (And in Linux, this is configued 80bit). So I don't think enforcing 64bit precision incurs some critical problems.

Cons:
- The change affects globally. So all the x87 processing in the process is now enforced to 64bit precision.
- Even if FPU config is changed, libc's math function may return the value in 80bit precision (For example, I've checked that std::pow() still returns 80bit precision value even if we changed x87 precision to 64bit in x86). Potentially, this incurs some inconsistency problems.
- We need to take care of the other libraries that may change the FPU configuration to keep this 64bit. For example, DirectX is known that it changes FPU configuration (without D3DCREATE_FPU_PRESERVE).

Alex, Benjamin, what do you think of?
Comment 33 Benjamin Poulain 2016-05-16 18:41:21 PDT
Comment on attachment 279077 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=279077&action=review

> Source/JavaScriptCore/b3/B3MathExtras.cpp:53
> -        start->appendNew<Const32Value>(procedure, origin, 1000));
> +        start->appendNew<Const32Value>(procedure, origin, maxExponentForIntegerMathPow));

Missing header for this.
Comment 34 Benjamin Poulain 2016-05-16 18:42:05 PDT
Changing the FPU configuration seems fragile to me. We would see bugs appearing if any library decides to change it too.

Alex, what do you think?
Comment 35 Yusuke Suzuki 2016-05-16 18:45:33 PDT
(In reply to comment #34)
> Changing the FPU configuration seems fragile to me. We would see bugs
> appearing if any library decides to change it too.
> 
> Alex, what do you think?

(In reply to comment #32)
> (In reply to comment #29)
> > Comment on attachment 279002 [details]
> > Patch
> > 
> > Good job finding the issue!
> > 
> > I wish there was a way for gcc to use SSE for the entire JSC. I am sure
> > there are plenty of similar bugs that are not covered by tests :(
> 
> Yeah. -mfpmath=sse & -march=native can enforce gcc to use SSE2 (muld etc.)
> if they are supported. However, in the environment that does not support
> SSE2, we need to take the conservative approach. And I think we still need
> to consider about such environments since the binary packages for i686 (e.g.
> Debian's i686 port) need to consider the CPU capabilities conservatively.
> Linux drops i386 support, but i686 support still remains. And i686 (Pentium
> Pro ~) does not include SSE2 (Pentium 4 ~).
> 
> After reconsidering about the probelm, I think Alex's suggestion - changing
> FPU precision globally - seems nice in WebKit globally.
> 
> Pros:
> - We don't need to consider about x87 80bit precision anymore.
> - In Windows, FPU precision is configured 64bit by default. (And in Linux,
> this is configued 80bit). So I don't think enforcing 64bit precision incurs
> some critical problems.

Ah, and potential concern is that there are so many Linux libraries that is not used in Windows; gtk, etc.

> 
> Cons:
> - The change affects globally. So all the x87 processing in the process is
> now enforced to 64bit precision.
> - Even if FPU config is changed, libc's math function may return the value
> in 80bit precision (For example, I've checked that std::pow() still returns
> 80bit precision value even if we changed x87 precision to 64bit in x86).
> Potentially, this incurs some inconsistency problems.
> - We need to take care of the other libraries that may change the FPU
> configuration to keep this 64bit. For example, DirectX is known that it
> changes FPU configuration (without D3DCREATE_FPU_PRESERVE).
> 
> Alex, Benjamin, what do you think of?
Comment 36 Alex Christensen 2016-05-16 21:06:14 PDT
Comment on attachment 279077 [details]
Patch

I think this fix is good, but there are certainly more, and if we found all of them, the code would look messy everywhere and do lots of unnecessary storing to the stack.  Requiring all JSC applications to use 53-bit mantissa mode would be VERY bad, because some of them intentionally change it for more precision in fancy math libraries.  I had to debug such a problem once and that was no fun.
Comment 37 Yusuke Suzuki 2016-05-16 21:25:55 PDT
(In reply to comment #36)
> Comment on attachment 279077 [details]
> Patch
> 
> I think this fix is good, but there are certainly more, and if we found all
> of them, the code would look messy everywhere and do lots of unnecessary
> storing to the stack.  Requiring all JSC applications to use 53-bit mantissa
> mode would be VERY bad, because some of them intentionally change it for
> more precision in fancy math libraries.  I had to debug such a problem once
> and that was no fun.

Sounds reasonable. OK, I'll go with this patch and watch the build bot. Thanks :)
Comment 38 Yusuke Suzuki 2016-05-16 21:26:22 PDT
Comment on attachment 279077 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=279077&action=review

>> Source/JavaScriptCore/b3/B3MathExtras.cpp:53
>> +        start->appendNew<Const32Value>(procedure, origin, maxExponentForIntegerMathPow));
> 
> Missing header for this.

Oops, thanks. Fixed.
Comment 39 Yusuke Suzuki 2016-05-16 21:35:17 PDT
Committed r200996: <http://trac.webkit.org/changeset/200996>
Comment 40 Yusuke Suzuki 2016-05-16 22:00:17 PDT
(In reply to comment #36)
> Comment on attachment 279077 [details]
> Patch
> 
> I think this fix is good, but there are certainly more, and if we found all
> of them, the code would look messy everywhere and do lots of unnecessary
> storing to the stack.  Requiring all JSC applications to use 53-bit mantissa
> mode would be VERY bad, because some of them intentionally change it for
> more precision in fancy math libraries.  I had to debug such a problem once
> and that was no fun.

Once the new code like this is found, we need to extract this DoubleValue type and use it in such code to ensure 64bit precision.
I believe such code is rare; the problematic pattern is that the same floating point calculation is done in C runtime and Baseline / DFG (not FTL) JIT SSE2 code separately.
FTL does not have a problem since FTL is not enabled in x86 32bit.
Comment 41 Yusuke Suzuki 2016-05-16 22:43:44 PDT
operations with base = 0.6931471805599453 exponent = 999 are fixed.
But still failing due to exponent = 0.5 / -0.5 special path cases.
Comment 42 Yusuke Suzuki 2016-05-16 22:53:02 PDT
Ah, OK. math-pow-with-constants.js cases are fixed.
But math-pow-stable-results.js cases remain.
This is because of the similar issue in sqrt. operationMathPow uses sqrt. This c sqrt function is implementation-dependent, but, actually, the result is calculated in 80bit precision in glibc.
However, DFG JIT fast path for 0.5 / -0.5 uses sqrt SSE operation. Of course, they have 64bit precision.

I think the similar issue should occur in ArithSqrt and std::sqrt.
Seeing the ArithXXX nodes at glance, the other nodes do not have the problems.
Comment 43 Yusuke Suzuki 2016-05-16 23:33:16 PDT
I've just opened the new bug to handle sqrt and pow.
https://bugs.webkit.org/show_bug.cgi?id=157787
Comment 44 Yusuke Suzuki 2016-05-17 06:42:38 PDT
Committed r201014: <http://trac.webkit.org/changeset/201014>