Bug 149061

Summary: [ARM] REGRESSION(r189575): It made 2860 tests fail/crash on AArch64 Linux
Product: WebKit Reporter: Csaba Osztrogonác <ossy>
Component: JavaScriptCoreAssignee: Nobody <webkit-unassigned>
Status: RESOLVED DUPLICATE    
Severity: Normal CC: clopez, fpizlo, mark.lam, msaboff, ossy
Priority: P2    
Version: Other   
Hardware: Unspecified   
OS: Unspecified   
Bug Depends on:    
Bug Blocks: 108645, 148666    
Attachments:
Description Flags
Patch used for X86-64 Callee Saves debugging none

Description Csaba Osztrogonác 2015-09-11 04:32:12 PDT
Unfortunately I can't add the details to the bug report, because 
https://build.webkit.org/waterfall is out of order again and again. :-/
Comment 1 Csaba Osztrogonác 2015-09-11 04:37:37 PDT
Ah, build.webkit.org works again, so here is the link about this regression:
https://build.webkit.org/builders/EFL%20Linux%20AArch64%20Release/builds/3270

I tested manually, everything works fine on r189574, but there 
are 2860 failures/crash on r189575 (with its buildfix r189588

I'm going to investigate this issue and try to provide
debug build logs and/or other useful information.
Comment 2 Csaba Osztrogonác 2015-09-11 06:39:20 PDT
Unfortunately I can't reproduce this bug in debug mode. :(
I will try to reproduce it on a relase build with debug symbols.
Comment 3 Michael Saboff 2015-09-11 09:48:29 PDT
While debugging the callee saves work, I would run into failures on release builds that wouldn't reproduce with debug builds.  Typically this was due to the optimizer making use of callee saves registers in the compiled C++ code.  If JSC inadvertently stepped on one of those registers, it would only cause a problem on release builds.

The first place I would look is in the FTL code.  For example, I didn't test any of the changes to the Linux specific code in FTLUnwindInfo.cpp.  See if failing tests work when the FTL is turned off.

One technique that I used to track down these kinds of problems was to add back in the saving and restoring of callee saves to the pushCalleeSaves() / popCalleeSaves() macros in LowLevelInterpreter.asm and then in LowLEvelInterpreter64.asm:doVMEntry, write sentinel numeric values to the callee saves registers, e.g. 0x1019 to x19, 0x1020 to x20, ... After "makeCall()" in doVMEntry and at the beginning of _handleUncaughtException, compare the values with a breakpoint on mismatch.  I made a macro to do the testing.  That did 2 things, first it allowed building with debug.  But probably more useful was that at any point executing in the JavaScript VMs I could look at the registers to see that they had the sentinel values were they should.  I could also check the CallFrames that we saved the sentinel values where appropriate.  I'll post a patch with this technique that I used for X86-64 debugging.
Comment 4 Michael Saboff 2015-09-11 09:50:48 PDT
Created attachment 261007 [details]
Patch used for X86-64 Callee Saves debugging
Comment 5 Csaba Osztrogonác 2015-09-15 02:59:46 PDT
(In reply to comment #3)
> While debugging the callee saves work, I would run into failures on release
> builds that wouldn't reproduce with debug builds.  Typically this was due to
> the optimizer making use of callee saves registers in the compiled C++ code.
> If JSC inadvertently stepped on one of those registers, it would only cause
> a problem on release builds.
> 
> The first place I would look is in the FTL code.  For example, I didn't test
> any of the changes to the Linux specific code in FTLUnwindInfo.cpp.  See if
> failing tests work when the FTL is turned off.
> 
> One technique that I used to track down these kinds of problems was to add
> back in the saving and restoring of callee saves to the pushCalleeSaves() /
> popCalleeSaves() macros in LowLevelInterpreter.asm and then in
> LowLEvelInterpreter64.asm:doVMEntry, write sentinel numeric values to the
> callee saves registers, e.g. 0x1019 to x19, 0x1020 to x20, ... After
> "makeCall()" in doVMEntry and at the beginning of _handleUncaughtException,
> compare the values with a breakpoint on mismatch.  I made a macro to do the
> testing.  That did 2 things, first it allowed building with debug.  But
> probably more useful was that at any point executing in the JavaScript VMs I
> could look at the registers to see that they had the sentinel values were
> they should.  I could also check the CallFrames that we saved the sentinel
> values where appropriate.  I'll post a patch with this technique that I used
> for X86-64 debugging.

Thanks for the ideas and the patch for debugging.

I didn't check the FTL code yet, because it is disabled by default on Linux.
I don't know if it works at all, I didn't check it in the latest 4-5 months.

But it seems the bug is in the DFG tier somewhere, because tests pass with
(build time) disabled DFG. (except ~20 tests) And I already managed to catch
register mismatches with the idea you suggested. I'll continue debugging in
the near future.
Comment 6 Csaba Osztrogonác 2015-11-12 01:58:01 PST
https://trac.webkit.org/changeset/192352 already fixed this issue.
Sometimes Linux failures point out real but hidden failure on iOS. ;)

before: https://build.webkit.org/builders/EFL%20Linux%20AArch64%20Release/builds/4313 - 3227 failures
after: https://build.webkit.org/builders/EFL%20Linux%20AArch64%20Release/builds/4314 - 52 failures

The remaining failures might be related to this issue or can be a 
different issue, who knows what else happened in the latest 2 months.

I'm going to file a new bug report for the remaining failures.

*** This bug has been marked as a duplicate of bug 150936 ***