Bug 197797

Summary: Tail calls are broken on ARM_THUMB2 and MIPS
Product: WebKit Reporter: Caio Lima <ticaiolima>
Component: JavaScriptCoreAssignee: Caio Lima <ticaiolima>
Status: RESOLVED FIXED    
Severity: Normal CC: commit-queue, ews-watchlist, guijemont, guijemont+jsc-armv7-ews, keith_miller, mark.lam, msaboff, rmorisset, saam, tzagallo, webkit-bug-importer, ysuzuki
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
Bug Depends on:    
Bug Blocks: 196399    
Attachments:
Description Flags
WIP - Patch
guijemont+jsc-armv7-ews: commit-queue-
Archive of layout-test-results from ews215 for win-future
none
WIP - Patch
guijemont+jsc-armv7-ews: commit-queue-
WIP - Patch
none
WIP - Patch
none
WIP - Patch
none
WIP - Patch
none
ARMv7 Benchmark results
none
Archive of layout-test-results from ews211 for win-future
none
WIP - Patch
none
MIPS performance benchmarks
none
Patch
none
Archive of layout-test-results from ews211 for win-future
none
Patch
none
Patch none

Description Caio Lima 2019-05-10 13:59:06 PDT
...
Comment 1 Caio Lima 2019-05-10 15:12:58 PDT
Created attachment 369606 [details]
WIP - Patch

WIP
Comment 2 jsc-armv7 EWS 2019-05-10 17:46:30 PDT
Comment on attachment 369606 [details]
WIP - Patch

Attachment 369606 [details] did not pass jsc-armv7-ews (jsc-only):
Output: https://webkit-queues.webkit.org/results/12157526

New failing tests:
jsc-layout-tests.yaml/js/script-tests/regress-150580.js.layout
jsc-layout-tests.yaml/js/script-tests/regress-150580.js.layout-dfg-eager-no-cjit
jsc-layout-tests.yaml/js/script-tests/regress-150580.js.layout-no-cjit
stress/tail-call-with-spilled-registers.js.--useConcurrentJIT=false
stress/call-link-info-osrexit-repatch.js.ftl-eager
jsc-layout-tests.yaml/js/script-tests/regress-150580.js.layout-no-llint
apiTests
Comment 3 EWS Watchlist 2019-05-11 04:02:22 PDT
Comment on attachment 369606 [details]
WIP - Patch

Attachment 369606 [details] did not pass win-ews (win):
Output: https://webkit-queues.webkit.org/results/12161417

New failing tests:
security/contentSecurityPolicy/video-with-file-url-allowed-by-media-src-star-with-AllowContentSecurityPolicySourceStarToMatchAnyProtocol-enabled.html
Comment 4 EWS Watchlist 2019-05-11 04:02:24 PDT
Created attachment 369650 [details]
Archive of layout-test-results from ews215 for win-future

The attached test failures were seen while running run-webkit-tests on the win-ews.
Bot: ews215  Port: win-future  Platform: CYGWIN_NT-10.0-17763-3.0.5-338.x86_64-x86_64-64bit
Comment 5 Caio Lima 2019-05-14 14:37:01 PDT
Created attachment 369895 [details]
WIP - Patch
Comment 6 jsc-armv7 EWS 2019-05-14 17:03:40 PDT
Comment on attachment 369895 [details]
WIP - Patch

Attachment 369895 [details] did not pass jsc-armv7-ews (jsc-only):
Output: https://webkit-queues.webkit.org/results/12191921

New failing tests:
jsc-layout-tests.yaml/js/script-tests/regress-150580.js.layout-dfg-eager-no-cjit
es6.yaml/es6/proper_tail_calls_tail_call_optimisation_mutual_recursion.js.default
stress/shadow-chicken-enabled.js.shadow-chicken
jsc-layout-tests.yaml/js/script-tests/regress-150580.js.layout-no-llint
stress/mutual-tail-call-no-stack-overflow.js.dfg-eager
stress/dfg-tail-calls.js.no-llint
jsc-layout-tests.yaml/js/script-tests/regress-150580.js.layout
stress/mutual-tail-call-no-stack-overflow.js.no-llint
stress/dfg-tail-calls.js.default
stress/mutual-tail-call-no-stack-overflow.js.no-cjit-collect-continuously
stress/dfg-tail-calls.js.no-cjit-validate-phases
stress/mutual-tail-call-no-stack-overflow.js.dfg-maximal-flush-validate-no-cjit
jsc-layout-tests.yaml/js/script-tests/regress-150580.js.layout-no-cjit
stress/dfg-tail-calls.js.dfg-maximal-flush-validate-no-cjit
stress/mutual-tail-call-no-stack-overflow.js.no-cjit-validate-phases
stress/dfg-tail-calls.js.dfg-eager-no-cjit-validate
stress/mutual-tail-call-no-stack-overflow.js.dfg-eager-no-cjit-validate
stress/dfg-tail-calls.js.dfg-eager
stress/mutual-tail-call-no-stack-overflow.js.default
stress/dfg-tail-calls.js.no-cjit-collect-continuously
stress/call-link-info-osrexit-repatch.js.ftl-eager
apiTests
Comment 7 Caio Lima 2019-05-15 12:10:53 PDT
Created attachment 369981 [details]
WIP - Patch

Test to verify EWS.
Comment 8 Caio Lima 2019-05-15 19:19:41 PDT
Created attachment 370019 [details]
WIP - Patch

I want to test EWS.
Comment 9 EWS Watchlist 2019-05-15 19:21:42 PDT
Attachment 370019 [details] did not pass style-queue:


ERROR: Source/JavaScriptCore/runtime/StackAlignment.h:35:  Place brace on its own line for function definitions.  [whitespace/braces] [4]
Total errors found: 1 in 4 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 10 Caio Lima 2019-05-15 19:25:13 PDT
Created attachment 370021 [details]
WIP - Patch

Testing EWS.
Comment 11 EWS Watchlist 2019-05-15 19:27:22 PDT
Attachment 370021 [details] did not pass style-queue:


ERROR: Source/JavaScriptCore/runtime/StackAlignment.h:35:  Place brace on its own line for function definitions.  [whitespace/braces] [4]
Total errors found: 1 in 4 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 12 Caio Lima 2019-05-26 11:13:30 PDT
Created attachment 370650 [details]
WIP - Patch
Comment 13 Caio Lima 2019-05-26 11:33:01 PDT
Created attachment 370651 [details]
ARMv7 Benchmark results

This change seems to be neutral on ARMv7.
Comment 14 EWS Watchlist 2019-05-26 17:06:41 PDT
Comment on attachment 370650 [details]
WIP - Patch

Attachment 370650 [details] did not pass win-ews (win):
Output: https://webkit-queues.webkit.org/results/12296122

New failing tests:
svg/text/textpath-reference-update.html
storage/indexeddb/modern/gc-closes-database.html
Comment 15 EWS Watchlist 2019-05-26 17:06:44 PDT
Created attachment 370661 [details]
Archive of layout-test-results from ews211 for win-future

The attached test failures were seen while running run-webkit-tests on the win-ews.
Bot: ews211  Port: win-future  Platform: CYGWIN_NT-10.0-17763-3.0.5-338.x86_64-x86_64-64bit
Comment 16 Yusuke Suzuki 2019-05-28 02:11:38 PDT
I don't think this is the right fix. In x64 and ARM64, super actively maintained JIT architectures, we are strongly assuming this requirement. I think the right fix is pushing something or doing something, and adjusting the misaligned stack pointer to the aligned one in these architectures.
Comment 17 Caio Lima 2019-05-28 08:12:21 PDT
(In reply to Yusuke Suzuki from comment #16)
> I don't think this is the right fix. In x64 and ARM64, super actively
> maintained JIT architectures, we are strongly assuming this requirement. I
> think the right fix is pushing something or doing something, and adjusting
> the misaligned stack pointer to the aligned one in these architectures.

Thx for the comment! I'm going to resuse part of the text I was adding to the ChangeLog.

Let's consider the broken case with the following script:

```
let o = {
    get x() {
        return tail_call();
    }
}

let foo = (o) => {
    ...
    return o.x == o.y;
}
```

Also, lets suppose that we compiled `foo` into DFG and we generated IC
code for `o.x` and `o.y`. The register `r1` is assigned to `o`. Since the IC code of `o.x` is a getter case, this code
will spill `r1` to the stack to proper setup the JS call and restore
its value when the getter returns from execution (see third line of stack below).
The getter IC code allocates 32 bytes to call the JS getter, and during the
getter's prologue, `lr` and `cfr` are pushed to the stack, ending up
with a new frame pointed by `cfr1`. During tail call preparation, we
then calculate the top of stack using the following operations:

```
muli SlotSize, argc # Slot size is 8
# StackAlignment = 16
# CallFrameHeaderSize = 32
addi StackAlignment - 1 + CallFrameHeaderSize, argc, argcInBytes
andi ~StackAlignmentMask, argcInBytes

move cfr1, newCFR
addp argcInBytes, newCFR
```

Cosidering `argc = 1` in the operation above, the result is then `argcInBytes = 48`. Adding 48 bytes to `newCFR` will make the top of the stack off by 8 bytes, since current stack frame is 40 bytes (32 from caller allocation + 4 of `lr` + 4 of `cfr`).
In the end, when we start the copy
operation of the stack, the address where `r1` is spilled will be
cloberred, generating unexpected results. 

+----------------+ <-- Stack of program execution. Each cell is 4 bytes.
|      ...       |
+----------------+ <-- newCFR (where the calculation of top of stack will result)
|      ...       |
+----------------+  
|      r1        |
+----------------+ <-- Start of stack allocated to call getter
|      ...       | |
+----------------+ | 32 bytes
|      ...       | |
+----------------+ <-- End of stack allocated to call getter
|      lr        |
+----------------+
|      cfr       |
+----------------+ <-- cfr1

Now that I gave more context about the issue, lets move to possible solutions.

The first solution, as you mentioned, would be to adjust `cfr` to be 16-byte aligned and calculate top of stack properly. The problem of this approach is that every time we execute `prepareForTailCall`, we will be increasing the stack 8-bytes to adjust `cfr`, causing Stack Overflow at some point. That's what caused the errors on https://bugs.webkit.org/attachment.cgi?id=369606.
Pushing things to the stack to align it is pretty much the same as first approach. The problem triggered specifically when there is an IC getter and tail calls involved, but I have the feeling that this stack alignment of 16 bytes can cause crashes on other cases. The problem is that they are very hard to reproduce.

Changing the stack alignment for those architectures seems to be a better fix IMHO,  because it will reflect the real alignment of their stack. Also, it seems to be performance neutral on ARMv7 (I'm collecting numbers from MIPS now). Maybe I'm missing some context of the rest of the code, but I don't understand what would be the problem of changing this requirement. If it makes the code harder to maintain or cause any kind of regression, I agree we should look for another solution, but I don't know if that is the case.
Comment 18 Caio Lima 2019-05-28 14:36:33 PDT
Created attachment 370784 [details]
WIP - Patch
Comment 19 Caio Lima 2019-05-28 14:38:35 PDT
Created attachment 370785 [details]
MIPS performance benchmarks

This patch also seems to be performance neutral (to slightly better) on MIPS.
Comment 20 Caio Lima 2019-06-06 23:57:34 PDT
Created attachment 371567 [details]
Patch
Comment 21 EWS Watchlist 2019-06-07 02:57:17 PDT
Comment on attachment 371567 [details]
Patch

Attachment 371567 [details] did not pass win-ews (win):
Output: https://webkit-queues.webkit.org/results/12403988

New failing tests:
imported/blink/fast/canvas/bug382588.html
Comment 22 EWS Watchlist 2019-06-07 02:57:20 PDT
Created attachment 371580 [details]
Archive of layout-test-results from ews211 for win-future

The attached test failures were seen while running run-webkit-tests on the win-ews.
Bot: ews211  Port: win-future  Platform: CYGWIN_NT-10.0-17763-3.0.5-338.x86_64-x86_64-64bit
Comment 23 Caio Lima 2019-06-11 10:25:07 PDT
Ping review.
Comment 24 Caio Lima 2019-06-11 10:27:24 PDT
Comment on attachment 371580 [details]
Archive of layout-test-results from ews211 for win-future

Failure seems not related with changes.
Comment 25 Caio Lima 2019-06-17 13:22:14 PDT
Ping review.
Comment 26 Michael Saboff 2019-06-17 15:22:58 PDT
Comment on attachment 371567 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=371567&action=review

r-

> Source/JavaScriptCore/bytecode/AccessCase.cpp:899
> +            // For ARMv7 and MIPS architectures, we need 8 extra bytes to
> +            // guarantee that stack size have enough space to be reused by a
> +            // tail call. Since sizeof(CallerFrameAndPC) == 8 for those architectures,
> +            // we only need to calculate `numberOfBytesForCall = numberOfRegsForCall * sizeof(Register)`.
> +            unsigned numberOfBytesForCall = numberOfRegsForCall * sizeof(Register);

This doesn't seem like a reasonable fix.  It "works" by creating extra stack space, but that would suggest that our stack allocation code when making calls is suspect, which I strongly doubt.  It is more likely that the call frame shuffler doesn't know that r1 got spilled or is still alive, or there is another call frame shuffler bug.
Comment 27 Caio Lima 2019-06-18 05:48:42 PDT
(In reply to Michael Saboff from comment #26)
> Comment on attachment 371567 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=371567&action=review
> 
> r-
> 
> > Source/JavaScriptCore/bytecode/AccessCase.cpp:899
> > +            // For ARMv7 and MIPS architectures, we need 8 extra bytes to
> > +            // guarantee that stack size have enough space to be reused by a
> > +            // tail call. Since sizeof(CallerFrameAndPC) == 8 for those architectures,
> > +            // we only need to calculate `numberOfBytesForCall = numberOfRegsForCall * sizeof(Register)`.
> > +            unsigned numberOfBytesForCall = numberOfRegsForCall * sizeof(Register);
> 
> This doesn't seem like a reasonable fix.  It "works" by creating extra stack
> space, but that would suggest that our stack allocation code when making
> calls is suspect, which I strongly doubt.  It is more likely that the call
> frame shuffler doesn't know that r1 got spilled or is still alive, or there
> is another call frame shuffler bug.

I think I was not clear enough in my previous explanation, so let me give deeper details and more concrete data.
There is no problem with call frame shuffler and spilled registers. See the generated code for the crashing test on ARMv7:

```
Generated JIT code for Access stub for c1#AVpIb9:[0x7289c2a0->0x7289c0e0->0x728ce1c0, DFGFunctionCall, 54 (NeverInline) (StrictMode)] bc#10 with return point CodePtr(executable = 0x73dfe731, dataLocation = 0x73dfe730):
Getter:(Generated, offset = 0, structure = 0x728ff110:[Object, {c2:0}, NonArray, Proto:0x728c4000, Leaf], viaProxy = false, additionalSet = (nil), customSlotBase = (nil), callLinkInfo = 0x73ea8400, customAccessor = (nil)):
    Code at [0x73dfe061, 0x73dfe121):
          0x73dfe060: ldr r6, [r0]
          0x73dfe062: movw ip, #0xf110
          0x73dfe066: movt ip, #0x728f
          0x73dfe06a: cmp.w r6, ip
          0x73dfe06e: ittt ne
          0x73dfe070: nopne
          0x73dfe072: nopne.w
          0x73dfe076: bne.w #0x73dfe8f8
          0x73dfe07a: ldr r3, [r0, #0x10]
          0x73dfe07c: sub sp, #0x20
          0x73dfe07e: str r0, [sp]       <----------------------------- Spilling `r0`
          0x73dfe080: str r1, [sp, #8]
          0x73dfe082: str r2, [sp, #0x10]
          0x73dfe084: mov.w ip, #0
          0x73dfe088: str.w ip, [r7, #0x1c]
          0x73dfe08c: ldr r3, [r3, #0x10]
          0x73dfe08e: tst r3, r3
          0x73dfe090: beq #0x73dfe0ec
          0x73dfe092: sub sp, #0x20   <----------------------------- Stack allocation to call getter
          0x73dfe094: mov.w ip, #1
          0x73dfe098: str.w ip, [sp, #0x10]
          0x73dfe09c: str r3, [sp, #8]
          0x73dfe09e: mvn ip, #4
          0x73dfe0a2: str.w ip, [sp, #0xc]
          0x73dfe0a6: str r0, [sp, #0x18]
          0x73dfe0a8: mvn ip, #4
          0x73dfe0ac: str.w ip, [sp, #0x1c]
          0x73dfe0b0: movw ip, #0
          0x73dfe0b4: movt ip, #0
          0x73dfe0b8: cmp.w r3, ip
          0x73dfe0bc: bne #0x73dfe0ce
          0x73dfe0be: movw ip, #0
          0x73dfe0c2: movt ip, #0
          0x73dfe0c6: blx ip
          0x73dfe0c8: mov r2, r1
          0x73dfe0ca: mov r1, r0
          0x73dfe0cc: b #0x73dfe0f2
          0x73dfe0ce: mov r0, r3
          0x73dfe0d0: mvn r1, #4
          0x73dfe0d4: movw r2, #0x8400
          0x73dfe0d8: movt r2, #0x73ea
          0x73dfe0dc: movw ip, #0xf501
          0x73dfe0e0: movt ip, #0x72df
          0x73dfe0e4: blx ip
          0x73dfe0e6: mov r2, r1
          0x73dfe0e8: mov r1, r0
          0x73dfe0ea: b #0x73dfe0f2
          0x73dfe0ec: mvn r2, #3
          0x73dfe0f0: movs r1, #0
          0x73dfe0f2: mvn ip, #0x87
          0x73dfe0f6: add ip, r7
          0x73dfe0f8: mov sp, ip
          0x73dfe0fa: ldr r0, [sp]         <----------------------------- Restoring `r0`
          0x73dfe0fc: add sp, #0x20
          0x73dfe0fe: nop
          0x73dfe100: nop.w
          0x73dfe104: b.w #0x73dfe730
          0x73dfe108: nop
          0x73dfe10a: nop.w
          0x73dfe10e: b.w #0x73dfe8f8
          0x73dfe112: bkpt #0
          0x73dfe114: bkpt #0                                                                                                                                                                                                                                                            
          0x73dfe116: bkpt #0
          0x73dfe118: bkpt #0
          0x73dfe11a: bkpt #0
```

In this example, the corrupted value is in `r0`. We can see on instruction `0x73dfe07e` that this register is being spilled to `[sp]` and being restored on `0x73dfe0fa`. This same generated code shows why we are having corrupted value on `r0` after the execution of getter. Instruction `0x73dfe092` is allocating 32 bytes to call the getter `c2`. The JS code of this getter is the following:

```
function c2() {
    let b = Math.random();
    let c = Math.random();
    let d = Math.random();
    let e = Math.random();
    return c3('test', b, c, d, e);
}
```

In this case, this getter contains a tail call and that’s where the problem is. If we look into LLInt operation to calculate the top of stack during `prepareForTailCall`, we can see that the minimum size of a stack frame when there is 1 argument is 48 bytes:

```
loadi PayloadOffset + ArgumentCount[cfr], temp2
loadp CodeBlock[cfr], temp1
loadi CodeBlock::m_numParameters[temp1], temp1
bilteq temp1, temp2, .noArityFixup
move temp1, temp2

.noArityFixup:
# We assume < 2^28 arguments
# At this point, temp2 is 1, since the getter contains only `this` as parameter.
# SlotSize is 8.
muli SlotSize, temp2
# StackAlignment - 1 here is 15 and CallFrameHeaderSize is 32 on MIPS and ARM.
# This will result in temp2 + 47. In our example, we will then have
# temp2 = 8 + 47 = 55
addi StackAlignment - 1 + CallFrameHeaderSize, temp2
andi ~StackAlignmentMask, temp2
# After applying the mask temp2 == 48

 # Then calculate top of stack
 move cfr, temp1
 addp temp2, temp1
```

When the program reaches this point, the current call frame size is 40 bytes (32 allocated by the caller plus 4 bytes of `lr` and 4 bytes of `cfr` pushed to stack during prologue). This means that the top of stack is 8 bytes off from its real position and the memory position where `r0` is spilled will be clobbered during copy operation. This is also going to be the case for “prepare for tail call” on Baseline and DFG.
In the end, when `r0` is restored from stack, it will then contain garbage and next access to it can potentially crash.  This is not a problem on 64-bit architectures, because their stack frame will have at least 48 bytes in this case.
Given this analysis, I proposed 2 fixes in mind (and I also uploaded both of them in this bug). One of them is allocating +8 bytes during getter IC call and avoid clobbering on spilled registers (that’s what the current patch is doing). Another solution is to change StackAlignment to 8 bytes on those architectures (https://bugs.webkit.org/attachment.cgi?id=370784&action=diff), since it will then make the calculation of top of stack work properly. IMHO, the later solution sounds better because it will reflect the real stack alignment  of ARMv7 and MIPS. Maybe there is something behind the scenes that makes such change not correct (see comment #16), but, TBH, I can’t find any evidence that changing the alignment requirement can cause issues. Also, I’ve already ran performance benchmarks to check if changing the stack alignment will cause any regression, but the results are neutral (they are also uploaded to the bug). Anyway, I’m fine with any of these solutions.


The final question is: do these solutions still looks wrong fix for the bug?

PS: One more important thing to share is the reason we can’t see such issues on regular calls. In this case, we allocate at least 48 bytes, even if there is no argument for the call.
Comment 28 Caio Lima 2019-06-25 10:06:55 PDT
@Yusuke and @Michael Saboff Does your opinions prevail after my comment #27?
Comment 29 Caio Lima 2019-06-27 12:43:22 PDT
(In reply to Caio Lima from comment #28)
> @Yusuke and @Michael Saboff do your opinions prevail after my comment #27?

Ping.
Comment 30 Caio Lima 2019-07-03 12:10:30 PDT
(In reply to Caio Lima from comment #28)
> @Yusuke and @Michael Saboff Does your opinions prevail after my comment #27?

ping
Comment 31 Caio Lima 2020-03-03 11:04:22 PST
Created attachment 392300 [details]
Patch
Comment 32 Caio Lima 2020-03-05 09:50:12 PST
Ping
Comment 33 Yusuke Suzuki 2020-03-09 00:17:01 PDT
Comment on attachment 392300 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=392300&action=review

r=me

> Source/JavaScriptCore/ChangeLog:9
> +        `prepareForTailCall` operation expects that header size + parameters
> +        size is aligned with stack (alignment is 16-bytes for every architecture).

Can we debug-assert it?
Comment 34 Caio Lima 2020-03-09 07:55:04 PDT
Created attachment 393034 [details]
Patch
Comment 35 Caio Lima 2020-03-09 07:55:57 PDT
Comment on attachment 392300 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=392300&action=review

Thank you very much for the review!

>> Source/JavaScriptCore/ChangeLog:9
>> +        size is aligned with stack (alignment is 16-bytes for every architecture).
> 
> Can we debug-assert it?

Done.
Comment 36 WebKit Commit Bot 2020-03-09 08:32:58 PDT
Comment on attachment 393034 [details]
Patch

Clearing flags on attachment: 393034

Committed r258143: <https://trac.webkit.org/changeset/258143>
Comment 37 WebKit Commit Bot 2020-03-09 08:33:00 PDT
All reviewed patches have been landed.  Closing bug.
Comment 38 Radar WebKit Bug Importer 2020-03-09 08:33:22 PDT
<rdar://problem/60224122>