Bug 95906 - WKTR often reports an unresponsive WebProcess on Mac bots
Summary: WKTR often reports an unresponsive WebProcess on Mac bots
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Stephanie Lewis
URL:
Keywords: InRadar, MakingBotsRed
: 87720 (view as bug list)
Depends on:
Blocks:
 
Reported: 2012-09-05 15:45 PDT by Tim Horton
Modified: 2017-07-18 08:29 PDT (History)
11 users (show)

See Also:


Attachments
increase all the WKTR timeouts (2.23 KB, patch)
2012-09-10 11:46 PDT, Tim Horton
no flags Details | Formatted Diff | Diff
reduces the number wk2 testers running (2.60 KB, patch)
2012-09-12 16:31 PDT, Stephanie Lewis
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Horton 2012-09-05 15:45:48 PDT
Some output from after my logging improvement yesterday (http://trac.webkit.org/changeset/127547):

http://build.webkit.org/results/Apple%20MountainLion%20Debug%20WK2%20(Tests)/r127647%20(647)/fast/js/dfg-int32array-crash-log.txt

> No crash log found for WebProcess:16532.
> Process failed to become responsive before timing out.
>
> stdout:
> Timed out waiting for final message from web process
>
> stderr:
> #PROCESS UNRESPONSIVE - WebProcess (pid 16532)
> #EOF

http://build.webkit.org/results/Apple%20Lion%20Release%20WK2%20(Tests)/r127645%20(2940)/svg/animations/mozilla/animateMotion-by-1-crash-log.txt

> No crash log found for WebProcess:33091.
> Process failed to become responsive before timing out.
>
> stdout:
> Failed to reset to consistent state before the first test
> stderr:
> #PROCESS UNRESPONSIVE - WebProcess (pid 33091)
> #EOF

There are dozens upon dozens of examples of this happening, on Lion and Mountain Lion, across many machines. It seems to come in bunches - 20 tests from different shards will fail nearly simultaneously, and the harness will then give up.

I've never reproduced this locally, but it is one of the most significant issues blocking us from greenness on the Mac WK2 test bots, and has been for some time now.

Questions I have:

Is this the same (effectively) as a timeout for a test that doesn't call waitUntilDone (i.e. is this what you get if you have a non-waitUntilDone test, but then it hangs for some reason (infinite loop in JS, WebCore bug, whatever)?)?

If so, why don't we treat it as a timeout instead of a crash? (not that that would really help)

Could this be caused simply by slow/overloaded machines? (this is something I want to test locally)

Why does it come in batches, across many processes? (see http://build.webkit.org/results/Apple%20Lion%20Release%20WK2%20(Tests)/r127645%20(2940)/results.html)
Comment 1 Stephanie Lewis 2012-09-05 16:08:59 PDT
I took a blind stab at fixing this by doubling the timeout length. https://bugs.webkit.org/show_bug.cgi?id=87720.  It didn't have any effect on the frequency, so I think this is more complicated than slow machines.

I can reproduce this locally by running with a lot of parallel processes, more than my machine is supposed to support.  Whether that is just reproducing the symptom instead of the root cause I don't know.

One theory I had was that this was related to plugin process crashes which aren't caught by the harness.

Another theory I have is that this is some kind of disk contention.  It could be that resetting the state is holding a lock on a file and that when multiple processes try to access it they time out.

Currently when WKTR notices the WebProcess didn't come back from this call it kills the WebProcess.  If we wanted to treat this as a timeout WKTR would need to modified to not kill the WebProcess.  If we did that restructuring we could get samples which might help diagnose the issue.
Comment 2 Tim Horton 2012-09-05 16:18:00 PDT
(In reply to comment #1)
> I took a blind stab at fixing this by doubling the timeout length. https://bugs.webkit.org/show_bug.cgi?id=87720.  It didn't have any effect on the frequency, so I think this is more complicated than slow machines.
> 
> I can reproduce this locally by running with a lot of parallel processes, more than my machine is supposed to support.  Whether that is just reproducing the symptom instead of the root cause I don't know.

Ah, I'll have to try that.

> One theory I had was that this was related to plugin process crashes which aren't caught by the harness.

PluginProcess crashes shouldn't hang the WebProcess. We don't see that normally, do we?
 
> Another theory I have is that this is some kind of disk contention.  It could be that resetting the state is holding a lock on a file and that when multiple processes try to access it they time out.

Most of the timeouts are not during the reset settings to consistent state bit, they're mostly "during the test" i.e. after the page load started, but before didFinishLoad.

> Currently when WKTR notices the WebProcess didn't come back from this call it kills the WebProcess.  If we wanted to treat this as a timeout WKTR would need to modified to not kill the WebProcess.  If we did that restructuring we could get samples which might help diagnose the issue.

It ... has to kill the WebProcess, the WebProcess is *not responding*. (at least if by "not kill", you mean "use it for subsequent tests"). But yes, if we could sample it first that would be nice.
Comment 3 Stephanie Lewis 2012-09-05 17:13:57 PDT
(In reply to comment #2)
> (In reply to comment #1)
> > I took a blind stab at fixing this by doubling the timeout length. https://bugs.webkit.org/show_bug.cgi?id=87720.  It didn't have any effect on the frequency, so I think this is more complicated than slow machines.
> > 
> > I can reproduce this locally by running with a lot of parallel processes, more than my machine is supposed to support.  Whether that is just reproducing the symptom instead of the root cause I don't know.
> 
> Ah, I'll have to try that.
> 
> > One theory I had was that this was related to plugin process crashes which aren't caught by the harness.
> 
> PluginProcess crashes shouldn't hang the WebProcess. We don't see that normally, do we?

No, that's the whole point of the plugin process.  I just noticed that a few of the tests where this happened matched up to tests where the plugin process was crashing. 

> 
> > Another theory I have is that this is some kind of disk contention.  It could be that resetting the state is holding a lock on a file and that when multiple processes try to access it they time out.
> 
> Most of the timeouts are not during the reset settings to consistent state bit, they're mostly "during the test" i.e. after the page load started, but before didFinishLoad.

OK, that's more information than I had.

> 
> > Currently when WKTR notices the WebProcess didn't come back from this call it kills the WebProcess.  If we wanted to treat this as a timeout WKTR would need to modified to not kill the WebProcess.  If we did that restructuring we could get samples which might help diagnose the issue.
> 
> It ... has to kill the WebProcess, the WebProcess is *not responding*. (at least if by "not kill", you mean "use it for subsequent tests"). But yes, if we could sample it first that would be nice.

Yes, I meant to keep it alive long enough to sample, because currently by the time we get the message the WebProcess has already been killed.

The next step I was going to take was to try to get samples.  My next bot watching cycle comes around next week so hopefully I'll be able to take a look at it then.
Comment 4 Tim Horton 2012-09-06 11:42:10 PDT
I just sampled, it looks like the bots might be getting stuck doing I/O? So maybe you weren't far off with your disk contention theory!
Comment 5 Radar WebKit Bug Importer 2012-09-06 13:25:57 PDT
<rdar://problem/12248711>
Comment 6 Stephanie Lewis 2012-09-06 13:30:06 PDT
*** Bug 87720 has been marked as a duplicate of this bug. ***
Comment 7 Tim Horton 2012-09-06 22:50:23 PDT
News:

1. Yes, the bots are extremely strained for resources (I/O mostly), though it's not totally clear why. Having a billion WebProcesses, WebKitTestRunners, and Python processes probably isn't helping, though.

2. r110382 made WKTR no longer wait for LongTimeout while waiting for the final message from the WebProcess (except in the case of waitUntilDone tests). This probably made the situation a lot worse. I need to understand why this change was made, and what we can do to reintroduce a timeout there without breaking run-perf-tests.
Comment 8 Tim Horton 2012-09-06 22:57:12 PDT
(In reply to comment #7)
> News:
> 
> 1. Yes, the bots are extremely strained for resources (I/O mostly), though it's not totally clear why. Having a billion WebProcesses, WebKitTestRunners, and Python processes probably isn't helping, though.
> 
> 2. r110382 made WKTR no longer wait for LongTimeout while waiting for the final message from the WebProcess (except in the case of waitUntilDone tests). This probably made the situation a lot worse. I need to understand why this change was made, and what we can do to reintroduce a timeout there without breaking run-perf-tests.

Oh, hey.  The intent of "NoTimeout" is that we wait forever, not that we wait for no time. However, there's a FIXME that notes that this isn't implemented on Mac, and is instead causing it to wait for no time (because the timeout is -1).

I'm not sure we want to wait forever (what if the WebProcess gets stuck forever!?), but this also explains what they were going for.
Comment 9 Tim Horton 2012-09-06 22:58:28 PDT
(In reply to comment #8)

Oh no! I misunderstood m_useWaitToDumpWatchdogTimer, I don't think this is relevant at all :(
Comment 10 Tim Horton 2012-09-10 10:07:16 PDT
In an attempt to work around this, Stephanie doubled the shortTimeout duration (https://bugs.webkit.org/show_bug.cgi?id=87720). However, the failures we are seeing are generally against longTimeout runUntils. I'm going to post a patch to double longTimeout (to 60 seconds); we should see what happens with that.
Comment 11 Tim Horton 2012-09-10 10:10:02 PDT
(In reply to comment #10)
> In an attempt to work around this, Stephanie doubled the shortTimeout duration (https://bugs.webkit.org/show_bug.cgi?id=87720). However, the failures we are seeing are generally against longTimeout runUntils. I'm going to post a patch to double longTimeout (to 60 seconds); we should see what happens with that.

Ugh, can't double it because of:

// defaultLongTimeout + defaultShortTimeout should be less than 50,
// the default timeout value of the test harness so we can detect an
// unresponsive web process.

(unless we also crank the harness timeout... or maybe it was already increased? I'll check.)
Comment 12 Tim Horton 2012-09-10 11:46:27 PDT
Created attachment 163174 [details]
increase all the WKTR timeouts

This is a total shot in the dark (and we should look at fixing the hardware), but it would be interesting to see what happens.
Comment 13 Tim Horton 2012-09-10 14:10:30 PDT
Rate of timeouts decreased but they're still happening.
Comment 14 Dirk Pranke 2012-09-10 15:55:54 PDT
I think it would probably be better to reduce the number of children you're running instead of increasing the timeout. In my experience, the slowdown from thrashing and timeout is usually much worse than the slowdown from decreased parallelism.

At any rate, I think it's probably at least worth trying fewer children to get a data point. FWIW, I will also see if I can reproduce this locally, since I have a mac pro w/ a regular hard drive ...
Comment 15 Stephanie Lewis 2012-09-12 16:31:40 PDT
Created attachment 163731 [details]
reduces the number wk2 testers running
Comment 16 Stephanie Lewis 2012-09-12 18:34:19 PDT
Committed http://trac.webkit.org/projects/webkit/changeset/128391
Comment 17 Stephanie Lewis 2012-09-12 19:07:15 PDT
16 -> 12 didn't cut it.  I'm going to wait and get a couple more data points.  If the bottleneck is the disk then the number of acceptable processes is probably not a function of the number of available cores.

http://build.webkit.org/results/Apple%20Lion%20Release%20WK2%20(Tests)/r128393%20(3199)/results.html
Comment 18 Eric Seidel (no email) 2013-01-04 00:53:01 PST
Attachment 163731 [details] was posted by a committer and has review+, assigning to Stephanie Lewis for commit.
Comment 19 Stephanie Lewis 2013-01-04 14:06:00 PST
All patches submitted, issue still occurs
Comment 20 Eric Seidel (no email) 2013-03-01 10:38:18 PST
Comment on attachment 163174 [details]
increase all the WKTR timeouts

Cleared Dan Bernstein's review+ from obsolete attachment 163174 [details] so that this bug does not appear in http://webkit.org/pending-commit.
Comment 21 Eric Seidel (no email) 2013-03-01 10:38:21 PST
Comment on attachment 163731 [details]
reduces the number wk2 testers running

Cleared Dirk Pranke's review+ from obsolete attachment 163731 [details] so that this bug does not appear in http://webkit.org/pending-commit.