Some output from after my logging improvement yesterday (http://trac.webkit.org/changeset/127547): http://build.webkit.org/results/Apple%20MountainLion%20Debug%20WK2%20(Tests)/r127647%20(647)/fast/js/dfg-int32array-crash-log.txt > No crash log found for WebProcess:16532. > Process failed to become responsive before timing out. > > stdout: > Timed out waiting for final message from web process > > stderr: > #PROCESS UNRESPONSIVE - WebProcess (pid 16532) > #EOF http://build.webkit.org/results/Apple%20Lion%20Release%20WK2%20(Tests)/r127645%20(2940)/svg/animations/mozilla/animateMotion-by-1-crash-log.txt > No crash log found for WebProcess:33091. > Process failed to become responsive before timing out. > > stdout: > Failed to reset to consistent state before the first test > stderr: > #PROCESS UNRESPONSIVE - WebProcess (pid 33091) > #EOF There are dozens upon dozens of examples of this happening, on Lion and Mountain Lion, across many machines. It seems to come in bunches - 20 tests from different shards will fail nearly simultaneously, and the harness will then give up. I've never reproduced this locally, but it is one of the most significant issues blocking us from greenness on the Mac WK2 test bots, and has been for some time now. Questions I have: Is this the same (effectively) as a timeout for a test that doesn't call waitUntilDone (i.e. is this what you get if you have a non-waitUntilDone test, but then it hangs for some reason (infinite loop in JS, WebCore bug, whatever)?)? If so, why don't we treat it as a timeout instead of a crash? (not that that would really help) Could this be caused simply by slow/overloaded machines? (this is something I want to test locally) Why does it come in batches, across many processes? (see http://build.webkit.org/results/Apple%20Lion%20Release%20WK2%20(Tests)/r127645%20(2940)/results.html)
I took a blind stab at fixing this by doubling the timeout length. https://bugs.webkit.org/show_bug.cgi?id=87720. It didn't have any effect on the frequency, so I think this is more complicated than slow machines. I can reproduce this locally by running with a lot of parallel processes, more than my machine is supposed to support. Whether that is just reproducing the symptom instead of the root cause I don't know. One theory I had was that this was related to plugin process crashes which aren't caught by the harness. Another theory I have is that this is some kind of disk contention. It could be that resetting the state is holding a lock on a file and that when multiple processes try to access it they time out. Currently when WKTR notices the WebProcess didn't come back from this call it kills the WebProcess. If we wanted to treat this as a timeout WKTR would need to modified to not kill the WebProcess. If we did that restructuring we could get samples which might help diagnose the issue.
(In reply to comment #1) > I took a blind stab at fixing this by doubling the timeout length. https://bugs.webkit.org/show_bug.cgi?id=87720. It didn't have any effect on the frequency, so I think this is more complicated than slow machines. > > I can reproduce this locally by running with a lot of parallel processes, more than my machine is supposed to support. Whether that is just reproducing the symptom instead of the root cause I don't know. Ah, I'll have to try that. > One theory I had was that this was related to plugin process crashes which aren't caught by the harness. PluginProcess crashes shouldn't hang the WebProcess. We don't see that normally, do we? > Another theory I have is that this is some kind of disk contention. It could be that resetting the state is holding a lock on a file and that when multiple processes try to access it they time out. Most of the timeouts are not during the reset settings to consistent state bit, they're mostly "during the test" i.e. after the page load started, but before didFinishLoad. > Currently when WKTR notices the WebProcess didn't come back from this call it kills the WebProcess. If we wanted to treat this as a timeout WKTR would need to modified to not kill the WebProcess. If we did that restructuring we could get samples which might help diagnose the issue. It ... has to kill the WebProcess, the WebProcess is *not responding*. (at least if by "not kill", you mean "use it for subsequent tests"). But yes, if we could sample it first that would be nice.
(In reply to comment #2) > (In reply to comment #1) > > I took a blind stab at fixing this by doubling the timeout length. https://bugs.webkit.org/show_bug.cgi?id=87720. It didn't have any effect on the frequency, so I think this is more complicated than slow machines. > > > > I can reproduce this locally by running with a lot of parallel processes, more than my machine is supposed to support. Whether that is just reproducing the symptom instead of the root cause I don't know. > > Ah, I'll have to try that. > > > One theory I had was that this was related to plugin process crashes which aren't caught by the harness. > > PluginProcess crashes shouldn't hang the WebProcess. We don't see that normally, do we? No, that's the whole point of the plugin process. I just noticed that a few of the tests where this happened matched up to tests where the plugin process was crashing. > > > Another theory I have is that this is some kind of disk contention. It could be that resetting the state is holding a lock on a file and that when multiple processes try to access it they time out. > > Most of the timeouts are not during the reset settings to consistent state bit, they're mostly "during the test" i.e. after the page load started, but before didFinishLoad. OK, that's more information than I had. > > > Currently when WKTR notices the WebProcess didn't come back from this call it kills the WebProcess. If we wanted to treat this as a timeout WKTR would need to modified to not kill the WebProcess. If we did that restructuring we could get samples which might help diagnose the issue. > > It ... has to kill the WebProcess, the WebProcess is *not responding*. (at least if by "not kill", you mean "use it for subsequent tests"). But yes, if we could sample it first that would be nice. Yes, I meant to keep it alive long enough to sample, because currently by the time we get the message the WebProcess has already been killed. The next step I was going to take was to try to get samples. My next bot watching cycle comes around next week so hopefully I'll be able to take a look at it then.
I just sampled, it looks like the bots might be getting stuck doing I/O? So maybe you weren't far off with your disk contention theory!
<rdar://problem/12248711>
*** Bug 87720 has been marked as a duplicate of this bug. ***
News: 1. Yes, the bots are extremely strained for resources (I/O mostly), though it's not totally clear why. Having a billion WebProcesses, WebKitTestRunners, and Python processes probably isn't helping, though. 2. r110382 made WKTR no longer wait for LongTimeout while waiting for the final message from the WebProcess (except in the case of waitUntilDone tests). This probably made the situation a lot worse. I need to understand why this change was made, and what we can do to reintroduce a timeout there without breaking run-perf-tests.
(In reply to comment #7) > News: > > 1. Yes, the bots are extremely strained for resources (I/O mostly), though it's not totally clear why. Having a billion WebProcesses, WebKitTestRunners, and Python processes probably isn't helping, though. > > 2. r110382 made WKTR no longer wait for LongTimeout while waiting for the final message from the WebProcess (except in the case of waitUntilDone tests). This probably made the situation a lot worse. I need to understand why this change was made, and what we can do to reintroduce a timeout there without breaking run-perf-tests. Oh, hey. The intent of "NoTimeout" is that we wait forever, not that we wait for no time. However, there's a FIXME that notes that this isn't implemented on Mac, and is instead causing it to wait for no time (because the timeout is -1). I'm not sure we want to wait forever (what if the WebProcess gets stuck forever!?), but this also explains what they were going for.
(In reply to comment #8) Oh no! I misunderstood m_useWaitToDumpWatchdogTimer, I don't think this is relevant at all :(
In an attempt to work around this, Stephanie doubled the shortTimeout duration (https://bugs.webkit.org/show_bug.cgi?id=87720). However, the failures we are seeing are generally against longTimeout runUntils. I'm going to post a patch to double longTimeout (to 60 seconds); we should see what happens with that.
(In reply to comment #10) > In an attempt to work around this, Stephanie doubled the shortTimeout duration (https://bugs.webkit.org/show_bug.cgi?id=87720). However, the failures we are seeing are generally against longTimeout runUntils. I'm going to post a patch to double longTimeout (to 60 seconds); we should see what happens with that. Ugh, can't double it because of: // defaultLongTimeout + defaultShortTimeout should be less than 50, // the default timeout value of the test harness so we can detect an // unresponsive web process. (unless we also crank the harness timeout... or maybe it was already increased? I'll check.)
Created attachment 163174 [details] increase all the WKTR timeouts This is a total shot in the dark (and we should look at fixing the hardware), but it would be interesting to see what happens.
Rate of timeouts decreased but they're still happening.
I think it would probably be better to reduce the number of children you're running instead of increasing the timeout. In my experience, the slowdown from thrashing and timeout is usually much worse than the slowdown from decreased parallelism. At any rate, I think it's probably at least worth trying fewer children to get a data point. FWIW, I will also see if I can reproduce this locally, since I have a mac pro w/ a regular hard drive ...
Created attachment 163731 [details] reduces the number wk2 testers running
Committed http://trac.webkit.org/projects/webkit/changeset/128391
16 -> 12 didn't cut it. I'm going to wait and get a couple more data points. If the bottleneck is the disk then the number of acceptable processes is probably not a function of the number of available cores. http://build.webkit.org/results/Apple%20Lion%20Release%20WK2%20(Tests)/r128393%20(3199)/results.html
Attachment 163731 [details] was posted by a committer and has review+, assigning to Stephanie Lewis for commit.
All patches submitted, issue still occurs
Comment on attachment 163174 [details] increase all the WKTR timeouts Cleared Dan Bernstein's review+ from obsolete attachment 163174 [details] so that this bug does not appear in http://webkit.org/pending-commit.
Comment on attachment 163731 [details] reduces the number wk2 testers running Cleared Dirk Pranke's review+ from obsolete attachment 163731 [details] so that this bug does not appear in http://webkit.org/pending-commit.