Bug 88404 - [WTR] process unresponsiveness is reported to the wrong test
Summary: [WTR] process unresponsiveness is reported to the wrong test
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Balazs Kelemen
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-06-06 02:45 PDT by Balazs Kelemen
Modified: 2012-08-03 05:55 PDT (History)
7 users (show)

See Also:


Attachments
Patch (4.21 KB, patch)
2012-06-06 02:53 PDT, Balazs Kelemen
no flags Details | Formatted Diff | Diff
Patch (9.35 KB, patch)
2012-06-12 09:55 PDT, Balazs Kelemen
no flags Details | Formatted Diff | Diff
Patch (12.34 KB, patch)
2012-06-15 08:15 PDT, Balazs Kelemen
no flags Details | Formatted Diff | Diff
Patch (9.55 KB, patch)
2012-06-18 07:38 PDT, Balazs Kelemen
no flags Details | Formatted Diff | Diff
Archive of layout-test-results from ec2-cr-linux-03 (984.72 KB, application/zip)
2012-06-18 13:58 PDT, WebKit Review Bot
no flags Details
refactored patch (9.58 KB, patch)
2012-06-19 02:23 PDT, Balazs Kelemen
zherczeg: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Balazs Kelemen 2012-06-06 02:45:03 PDT
If the web process become blocked for some reason with a test the test times out. After that TestController::resetStateToConsistentValues is called which sends another message to the web process and waits for the answer. If the web process is still blocked - which is likely the case - it will also times out and the harness reports unresponsiveness for the test. It's clear that the timeout is the result of the previous test so this is not a correct behavior. I think restarting the driver is the correct solution on timeout.
Comment 1 Balazs Kelemen 2012-06-06 02:47:34 PDT
This can be reproduced with a simple endless loop. The next test will trigger reporting unresponsiveness.
Comment 2 Balazs Kelemen 2012-06-06 02:53:39 PDT
Created attachment 145971 [details]
Patch
Comment 3 Balazs Kelemen 2012-06-11 16:31:03 PDT
Comment on attachment 145971 [details]
Patch

It's actually not compatible with nrwt as it does not expect the driver to exit without commanded to if there is no crash/timeout.
Comment 4 Balazs Kelemen 2012-06-12 09:55:33 PDT
Created attachment 147104 [details]
Patch
Comment 5 Balazs Kelemen 2012-06-12 09:57:18 PDT
(In reply to comment #3)
> (From update of attachment 145971 [details])
> It's actually not compatible with nrwt as it does not expect the driver to exit without commanded to if there is no crash/timeout.

No problem after changed to return 1 on error, and no false-positive timeout.
Comment 6 Balazs Kelemen 2012-06-15 08:11:20 PDT
(In reply to comment #5)
> (In reply to comment #3)
> > (From update of attachment 145971 [details] [details])
> > It's actually not compatible with nrwt as it does not expect the driver to exit without commanded to if there is no crash/timeout.
> 
> No problem after changed to return 1 on error, and no false-positive timeout.

After one more test round I realized that it was still not good.
Comment 7 Balazs Kelemen 2012-06-15 08:15:20 PDT
Created attachment 147824 [details]
Patch
Comment 8 Balazs Kelemen 2012-06-18 07:18:38 PDT
Comment on attachment 147824 [details]
Patch

Err, I'm doing it wrong. Thinked over again, it's not consistent with the behavior we did so far, both in drt and wtr.
Comment 9 Balazs Kelemen 2012-06-18 07:38:46 PDT
Created attachment 148098 [details]
Patch
Comment 10 WebKit Review Bot 2012-06-18 13:58:52 PDT
Comment on attachment 148098 [details]
Patch

Attachment 148098 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/12982148

New failing tests:
svg/css/getComputedStyle-basic.xhtml
Comment 11 WebKit Review Bot 2012-06-18 13:58:57 PDT
Created attachment 148168 [details]
Archive of layout-test-results from ec2-cr-linux-03

The attached test failures were seen while running run-webkit-tests on the chromium-ews.
Bot: ec2-cr-linux-03  Port: <class 'webkitpy.common.config.ports.ChromiumXVFBPort'>  Platform: Linux-2.6.35-28-virtual-x86_64-with-Ubuntu-10.10-maverick
Comment 12 Balazs Kelemen 2012-06-19 02:23:04 PDT
Created attachment 148292 [details]
refactored patch
Comment 13 Zoltan Herczeg 2012-07-10 04:31:31 PDT
Comment on attachment 148292 [details]
refactored patch

The patch looks sane to me. Let's wait a few more days, and if there will be no objection, I can r+ this.

View in context: https://bugs.webkit.org/attachment.cgi?id=148292&action=review

> Tools/WebKitTestRunner/TestInvocation.cpp:208
> +    char buffer[256];
> +    if (!resetDone) {
> +        pid_t pid = WKPageGetProcessIdentifier(m_mainWebView->page());
> +        sprintf(buffer, "#PROCESS UNRESPONSIVE - WebProcess (pid %ld)\n", static_cast<long>(pid));
> +        errorMessageToStderr = buffer;
> +    }

Can't we make this nicer?
Comment 14 Balazs Kelemen 2012-07-10 05:31:20 PDT
(In reply to comment #13)
> (From update of attachment 148292 [details])
> The patch looks sane to me. Let's wait a few more days, and if there will be no objection, I can r+ this.
> 
> View in context: https://bugs.webkit.org/attachment.cgi?id=148292&action=review
> 
> > Tools/WebKitTestRunner/TestInvocation.cpp:208
> > +    char buffer[256];
> > +    if (!resetDone) {
> > +        pid_t pid = WKPageGetProcessIdentifier(m_mainWebView->page());
> > +        sprintf(buffer, "#PROCESS UNRESPONSIVE - WebProcess (pid %ld)\n", static_cast<long>(pid));
> > +        errorMessageToStderr = buffer;
> > +    }
> 
> Can't we make this nicer?

I'm not sure. I want to keep the dump logic separated in a function and not duplicate it. It's important to end the streams as the test harness expects, with #EOF at the end. So, avoiding the constant sized char buffer seems a bit overkill, I could not imagine something much better than this:

const char* msgParts[] = {"#PROCESS UNRESPONSIVE - WebProcess (pid", ")\n"};
unsigned length = strlen(msgParts[0]) + strlen(msgPargs[1]) + 1 + sizeof(pid);
OwnPtr<char> msg = adoptPtr(new char[length]);
strcpy(msg.get(), msgPargs[0]);
sprintf(msg.get() + strlen(msg.get()), " %ld", static_cast<long>(pid));
strcat(msg.get() + strlen(msg.get()), msgParts[1]);

but it's definitely more ugly. Also stack allocating seems to be ok here, this is testing code, not an engine hotspot. So, IMHO trying to make it nicer would be only overengineering :)
Comment 15 Dirk Pranke 2012-07-11 12:20:25 PDT
I'm confused ... I believe NRWT already restarts WTR/DRT when a test times out. Are you not seeing this?
Comment 16 Balazs Kelemen 2012-07-12 00:51:20 PDT
(In reply to comment #15)
> I'm confused ... I believe NRWT already restarts WTR/DRT when a test times out. Are you not seeing this?

There is no problem in the general case, and anyway the bug is not in NRWT but WTR.

There is at least two kinds of timeout with WTR:
 * WTR don't produce output before timer expires in the harness. it's the same as with DRT, and the harness restarts the driver
 * web process times out but the driver (the child process in NRWT's point of view, which is the UI process) is still responsive, so WTR reports the timeout

In the latter case there is a chance that the web process is not only runs the test slowly but it actually became unresponsive. There is a message pass after every test between UI proc - web proc that tell the web process to reset it's values to a clean state. If we don't get answer to this message we know that the web process is unresponsive, so we output a special message to stderr ("WEB PROCESS UNRESPONSIVE ...") and NRWT knows that it needs to restart the driver (since this is the most reliable way to handle an unresponsive web process). Otherwise, if the resetting was done we don't restart the driver since there is no reason to do that. So far so good.

The actual bug is that WTR sends the restart message after it dumped the output for the test that make the web process unresponsive and writes the stderr message that time, so the test harness believes that the next test timed out.

I hope I have clarified it. :)
Comment 17 Dirk Pranke 2012-07-12 12:48:36 PDT
(In reply to comment #16)
> (In reply to comment #15)
> > I'm confused ... I believe NRWT already restarts WTR/DRT when a test times out. Are you not seeing this?
> 
> There is no problem in the general case, and anyway the bug is not in NRWT but WTR.
> 
> There is at least two kinds of timeout with WTR:
>  * WTR don't produce output before timer expires in the harness. it's the same as with DRT, and the harness restarts the driver
>  * web process times out but the driver (the child process in NRWT's point of view, which is the UI process) is still responsive, so WTR reports the timeout
> 
> In the latter case there is a chance that the web process is not only runs the test slowly but it actually became unresponsive. There is a message pass after every test between UI proc - web proc that tell the web process to reset it's values to a clean state. If we don't get answer to this message we know that the web process is unresponsive, so we output a special message to stderr ("WEB PROCESS UNRESPONSIVE ...") and NRWT knows that it needs to restart the driver (since this is the most reliable way to handle an unresponsive web process). Otherwise, if the resetting was done we don't restart the driver since there is no reason to do that. So far so good.
> 
> The actual bug is that WTR sends the restart message after it dumped the output for the test that make the web process unresponsive and writes the stderr message that time, so the test harness believes that the next test timed out.
> 
> I hope I have clarified it. :)

I see. Yes,  thank you. I had thought the situation was that the first test was reported as timeout.
Comment 18 Zoltan Herczeg 2012-08-03 04:34:02 PDT
Comment on attachment 148292 [details]
refactored patch

View in context: https://bugs.webkit.org/attachment.cgi?id=148292&action=review

> Tools/WebKitTestRunner/TestInvocation.cpp:203
> +    char buffer[256];

64 should be neough
Comment 19 Balazs Kelemen 2012-08-03 05:02:20 PDT
Landed in http://trac.webkit.org/changeset/124596. Sorry, I forget about the 64 after merge :) I will fix it but first let's see if bots are happy with the patch.