Bug 100117

Summary: REGRESSION(r132143): It made fast/xmlhttprequest/xmlhttprequest-recursive-sync-event.html and editing/style/iframe-onload-crash-mac.html flakey
Product: WebKit Reporter: Csaba Osztrogonác <ossy>
Component: New BugsAssignee: Nobody <webkit-unassigned>
Status: REOPENED ---    
Severity: Critical CC: ap, cdumez, fpizlo, ggaren, mark.lam, ossy
Priority: P1 Keywords: Qt, QtTriaged
Version: 420+   
Hardware: All   
OS: All   
See Also: https://bugs.webkit.org/show_bug.cgi?id=139840
Bug Depends on:    
Bug Blocks: 79666, 99872    

Description Csaba Osztrogonác 2012-10-23 06:41:54 PDT
After r132143 fast/xmlhttprequest/xmlhttprequest-recursive-sync-event.html
became flakey. Sometimes passes, sometimes flakey (fail/pass), sometimes fail (fail/fail).

--- /ramdisk/qt-linux-64-release/build/layout-test-results/fast/xmlhttprequest/xmlhttprequest-recursive-sync-event-expected.txt 
+++ /ramdisk/qt-linux-64-release/build/layout-test-results/fast/xmlhttprequest/xmlhttprequest-recursive-sync-event-actual.txt 
@@ -1,3 +1,4 @@
+CONSOLE MESSAGE: RangeError: Maximum call stack size exceeded.
 CONSOLE MESSAGE: RangeError: Maximum call stack size exceeded.
 This tests that having infinite recursion in XMLHttpRequest event handler does not crash. 
 PASS
Comment 1 Csaba Osztrogonác 2012-10-23 07:11:20 PDT
I skipped it on Qt - r132218. Please unskip it with the proper fix.
Comment 2 Mark Lam 2012-10-23 08:25:51 PDT
See comment https://bugs.webkit.org/show_bug.cgi?id=100109#c3 in bug 100109.  This test flakiness may actually be due to a bug where PLATFORM(EFL)'s StackBounds implementation is not setting up a meaningful stack size/limit.  This will have to be tested and debugged on EFL after 100109 is fixed.  Unfortunately, I don't have the means to test on EFL.
Comment 3 Mark Lam 2012-10-23 08:37:12 PDT
(In reply to comment #2)
> See comment https://bugs.webkit.org/show_bug.cgi?id=100109#c3 in bug 100109.  This test flakiness may actually be due to a bug where PLATFORM(EFL)'s StackBounds implementation is not setting up a meaningful stack size/limit.  This will have to be tested and debugged on EFL after 100109 is fixed.  Unfortunately, I don't have the means to test on EFL.

... or on Qt. :)
Comment 4 Mark Lam 2012-10-23 09:19:50 PDT
I stand corrected.  This is not a stack size issue.  I can reproduce this failure using a debug build on mac (which has a reasonable stack size).  Still investigating ...
Comment 5 Mark Lam 2012-10-23 22:03:40 PDT
This test itself is intrinsically flaky to begin with.  The test failure has nothing to with bug 100109 nor 99872.  Will remove 100109 as a dependency as it is unrelated, but will leave 99872 because this latent issue was exposed when 99872 landed.    Here are the details:

This is what the test does:

function test()
{
    ...
    var xhr = new XMLHttpRequest();
    xhr.onreadystatechange = function() {
        if (xhr.readyState == 4) {
            xhr.open("GET", "recurse.html", false);
            xhr.send(null);
        }
    };
    xhr.open("GET", "recurse.html", false);
    xhr.send(null);
    ...
}

Note that the XMLHttpRequest is a synchronous one.  Hence, the above code is set up as a means to achieve infinite recursion with the following pattern:

test()
   calls xhr.send() 
      which calls xhr.onreadstatechange() with readyState = 1
      and calls xhr.onreadstatechange() with readyState = 4
         which calls xhr.send()
            which calls xhr.onreadstatechange() with readyState = 1
            and calls xhr.onreadstatechange() with readyState = 4
               which calls xhr.send()
                  which calls xhr.onreadstatechange() with readyState = 1
                  and calls xhr.onreadstatechange() with readyState = 4
                     which calls xhr.send()
                        which ...
                        and ...
                           ...
                           which calls xhr.onreadstatechange() with readyState = 1  // case 1
                           and calls xhr.onreadstatechange() with readyState = 4 // case 2
                           ...

At some point, a stack overflow will be detected.  If the stack overflows during the execution right after case 1, the test won't get to call send() again.  So, a stack overflow gets thrown, and no more recursion take place.  In this case, the test will report only 1 StackOverflowError.

If the stack overflows during the execution right after case 2, the test would have just received a state change with readyState 4 and has just sent another request.  The invocation of case 2 causes a StackOverflowError to be thrown and is uncaught all the way out to the WebCore event dispatcher.  But the send() queues another event which causes the dispatcher to re-enter the interpreter again.  This in turn causes triggers a 2nd StackOverflowError.

I haven't isolated the exact root cause yet, but whether the test sees 1 or 2 StackOverflowErrors seem to depend on some race condition.  Here are 2 observations which supports that:
1. If I add/remove a printf to JSEventListener::handleEvent() (which is in the event dispatch code path), it can alter the result between the 2 outcomes.
2. A release build of DumpRenderTree that passes the test when run via run-webkit-test will fail the test if I run DumpRenderTree on the individual test case from the command line.

It doesn't look like the test failure has to do with the amount of stack space available, and hence is not really caused by 99872.
Comment 6 Mark Lam 2012-10-23 22:07:27 PDT
(In reply to comment #5)
> This test itself is intrinsically flaky to begin with. 

I neglected to point out that the test is flaky because it can not guarantee how much native stack is available before a stack overflow is detected.  Hence, it cannot control whether we detect the overflow right after case 1 in the test, or right after case 2.  As a result of that, the test cannot guarantee whether the result will see 1 or 2 StackOverflowError.
Comment 7 Csaba Osztrogonác 2012-10-24 03:03:05 PDT
Additionally one more test started to fail with same error,
but only with parallel NRWT:

--- /home/webkitbuildbot/slaves/release32bit-NRWT/buildslave/qt-linux-32-release-NRWT/build/layout-test-results/editing/style/iframe-onload-crash-mac-expected.txt 
+++ /home/webkitbuildbot/slaves/release32bit-NRWT/buildslave/qt-linux-32-release-NRWT/build/layout-test-results/editing/style/iframe-onload-crash-mac-actual.txt 
@@ -1,1 +1,2 @@
+CONSOLE MESSAGE: RangeError: Maximum call stack size exceeded.
 PASS
Comment 8 Mark Lam 2012-10-24 09:47:08 PDT
Adding some instrumentation to inspect how much stack is available at the time the stack overflows are detected, and correlating that with whether we last saw xhr.onreadstatechange() with readyState 1 or readyState 4, I confirmed that the only determining factor as to whether whether the test produces 1 StackOverflowError or 2 is based on whether we run out of stack with readyState 1 or readyState 2.

The reason that this test did not seem flaky before 99872 is because previously, we did not do stack checks based on a measurement of available stack size, but instead estimated it based on depth of reentry into the interpreter.  With reentry depth, stack usage can differ between builds and platforms, but we will still be guaranteed the same number of interpreter/VM reentries and hence guarantee that we declare a stack overflow at one of the readyState 1 or readyState 4 point, and baseline the expected result accordingly.  This gives the illusion that the test is stable independent of stack usage. 

In summary, this means that the change set in 99872 only exposed a latent flakiness that was in the test to begin with.

Can we stabilize the test?
===============
No.  This is because the stack overflow is handled at the native level in WebCore.  The test (in JS) is not even aware that a stack overflow error has occurred.  Also, the test (in JS) has no way to determine how much native stack is available, nor how much stack usage per VM reentry is needed to ensure that we only see 1 StackOverflowError.  Hence, we will have to expect that the test will remain flaky.

Is this test still valid?
============
Yes.  The goal of the test is to verify that the test did not crash.  I believe leaving the test marked as flaky should serve this purpose adequately, and we only need to pay attention to it when it crashes.

Hence, I think there is nothing to fix here.

I will take a look at Csaba's other test failure before closing this as WONT FIX.
Comment 9 Mark Lam 2012-10-24 10:19:40 PDT
Csaba, I can't reproduce the editing/style/iframe-onload-crash-mac.html failure on NWRT on mac.  I'm closing this bug as WONT FIX.  If the editing/style/iframe-onload-crash-mac issue persists, please file another bug for it.  Thanks.
Comment 10 Mark Lam 2012-10-24 10:20:53 PDT
The test is intrinsically flaky and cannot be fixed.  Closing the bug as WONT FIX.
Comment 11 Csaba Osztrogonác 2012-11-04 02:53:04 PST
(In reply to comment #9)
> Csaba, I can't reproduce the editing/style/iframe-onload-crash-mac.html failure on NWRT on mac.  I'm closing this bug as WONT FIX.  If the editing/style/iframe-onload-crash-mac issue persists, please file another bug for it.  Thanks.

editing/style/iframe-onload-crash-mac.html almost always fail on Qt NRWT bot (with 4 parallel jobs) with different number of dumped error lines:

--- /home/webkitbuildbot/slaves/release32bit-NRWT/buildslave/qt-linux-32-release-NRWT/build/layout-test-results/editing/style/iframe-onload-crash-mac-expected.txt 
+++ /home/webkitbuildbot/slaves/release32bit-NRWT/buildslave/qt-linux-32-release-NRWT/build/layout-test-results/editing/style/iframe-onload-crash-mac-actual.txt 
@@ -1,1 +1,2 @@
+CONSOLE MESSAGE: RangeError: Maximum call stack size exceeded.
 PASS


--- /home/webkitbuildbot/slaves/release32bit-NRWT/buildslave/qt-linux-32-release-NRWT/build/layout-test-results/editing/style/iframe-onload-crash-mac-expected.txt 
+++ /home/webkitbuildbot/slaves/release32bit-NRWT/buildslave/qt-linux-32-release-NRWT/build/layout-test-results/editing/style/iframe-onload-crash-mac-actual.txt 
@@ -1,1 +1,4 @@
+CONSOLE MESSAGE: RangeError: Maximum call stack size exceeded.
+CONSOLE MESSAGE: RangeError: Maximum call stack size exceeded.
+CONSOLE MESSAGE: RangeError: Maximum call stack size exceeded.
 PASS


I won't file a new bug report for it, because it is same bug.
Comment 12 Csaba Osztrogonác 2012-11-04 02:57:36 PST
I skipped the failing editing test too - r133418.
Comment 13 Mark Lam 2012-11-05 15:38:13 PST
I thought of a way to fix fast/xmlhttprequest/xmlhttprequest-recursive-sync-event.html to not be flaky:
https://bugs.webkit.org/show_bug.cgi?id=101268