Bug 149061 - [ARM] REGRESSION(r189575): It made 2860 tests fail/crash on AArch64 Linux
Summary: [ARM] REGRESSION(r189575): It made 2860 tests fail/crash on AArch64 Linux
Status: RESOLVED DUPLICATE of bug 150936
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: Other
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks: 108645 148666
  Show dependency treegraph
 
Reported: 2015-09-11 04:32 PDT by Csaba Osztrogonác
Modified: 2015-11-12 01:58 PST (History)
5 users (show)

See Also:


Attachments
Patch used for X86-64 Callee Saves debugging (3.74 KB, patch)
2015-09-11 09:50 PDT, Michael Saboff
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Csaba Osztrogonác 2015-09-11 04:32:12 PDT
Unfortunately I can't add the details to the bug report, because 
https://build.webkit.org/waterfall is out of order again and again. :-/
Comment 1 Csaba Osztrogonác 2015-09-11 04:37:37 PDT
Ah, build.webkit.org works again, so here is the link about this regression:
https://build.webkit.org/builders/EFL%20Linux%20AArch64%20Release/builds/3270

I tested manually, everything works fine on r189574, but there 
are 2860 failures/crash on r189575 (with its buildfix r189588

I'm going to investigate this issue and try to provide
debug build logs and/or other useful information.
Comment 2 Csaba Osztrogonác 2015-09-11 06:39:20 PDT
Unfortunately I can't reproduce this bug in debug mode. :(
I will try to reproduce it on a relase build with debug symbols.
Comment 3 Michael Saboff 2015-09-11 09:48:29 PDT
While debugging the callee saves work, I would run into failures on release builds that wouldn't reproduce with debug builds.  Typically this was due to the optimizer making use of callee saves registers in the compiled C++ code.  If JSC inadvertently stepped on one of those registers, it would only cause a problem on release builds.

The first place I would look is in the FTL code.  For example, I didn't test any of the changes to the Linux specific code in FTLUnwindInfo.cpp.  See if failing tests work when the FTL is turned off.

One technique that I used to track down these kinds of problems was to add back in the saving and restoring of callee saves to the pushCalleeSaves() / popCalleeSaves() macros in LowLevelInterpreter.asm and then in LowLEvelInterpreter64.asm:doVMEntry, write sentinel numeric values to the callee saves registers, e.g. 0x1019 to x19, 0x1020 to x20, ... After "makeCall()" in doVMEntry and at the beginning of _handleUncaughtException, compare the values with a breakpoint on mismatch.  I made a macro to do the testing.  That did 2 things, first it allowed building with debug.  But probably more useful was that at any point executing in the JavaScript VMs I could look at the registers to see that they had the sentinel values were they should.  I could also check the CallFrames that we saved the sentinel values where appropriate.  I'll post a patch with this technique that I used for X86-64 debugging.
Comment 4 Michael Saboff 2015-09-11 09:50:48 PDT
Created attachment 261007 [details]
Patch used for X86-64 Callee Saves debugging
Comment 5 Csaba Osztrogonác 2015-09-15 02:59:46 PDT
(In reply to comment #3)
> While debugging the callee saves work, I would run into failures on release
> builds that wouldn't reproduce with debug builds.  Typically this was due to
> the optimizer making use of callee saves registers in the compiled C++ code.
> If JSC inadvertently stepped on one of those registers, it would only cause
> a problem on release builds.
> 
> The first place I would look is in the FTL code.  For example, I didn't test
> any of the changes to the Linux specific code in FTLUnwindInfo.cpp.  See if
> failing tests work when the FTL is turned off.
> 
> One technique that I used to track down these kinds of problems was to add
> back in the saving and restoring of callee saves to the pushCalleeSaves() /
> popCalleeSaves() macros in LowLevelInterpreter.asm and then in
> LowLEvelInterpreter64.asm:doVMEntry, write sentinel numeric values to the
> callee saves registers, e.g. 0x1019 to x19, 0x1020 to x20, ... After
> "makeCall()" in doVMEntry and at the beginning of _handleUncaughtException,
> compare the values with a breakpoint on mismatch.  I made a macro to do the
> testing.  That did 2 things, first it allowed building with debug.  But
> probably more useful was that at any point executing in the JavaScript VMs I
> could look at the registers to see that they had the sentinel values were
> they should.  I could also check the CallFrames that we saved the sentinel
> values where appropriate.  I'll post a patch with this technique that I used
> for X86-64 debugging.

Thanks for the ideas and the patch for debugging.

I didn't check the FTL code yet, because it is disabled by default on Linux.
I don't know if it works at all, I didn't check it in the latest 4-5 months.

But it seems the bug is in the DFG tier somewhere, because tests pass with
(build time) disabled DFG. (except ~20 tests) And I already managed to catch
register mismatches with the idea you suggested. I'll continue debugging in
the near future.
Comment 6 Csaba Osztrogonác 2015-11-12 01:58:01 PST
https://trac.webkit.org/changeset/192352 already fixed this issue.
Sometimes Linux failures point out real but hidden failure on iOS. ;)

before: https://build.webkit.org/builders/EFL%20Linux%20AArch64%20Release/builds/4313 - 3227 failures
after: https://build.webkit.org/builders/EFL%20Linux%20AArch64%20Release/builds/4314 - 52 failures

The remaining failures might be related to this issue or can be a 
different issue, who knows what else happened in the latest 2 months.

I'm going to file a new bug report for the remaining failures.

*** This bug has been marked as a duplicate of bug 150936 ***