55164 – Long-sleeping php jobs may be causing the server to time out requests

Eric Seidel (no email)

Reported 2011-02-24 11:24:54 PST

Long-sleeping php jobs may be causing the server to time out requests

Eric Seidel (no email)

Comment 1 2011-02-24 11:31:47 PST

Created attachment 83687 [details] Patch

Eric Seidel (no email)

Comment 2 2011-02-24 11:34:01 PST

If this theory is correct, we can basically dup every single one of bugs about http tests listed from https://bugs.webkit.org/show_bug.cgi?id=50856 to this one.

Eric Seidel (no email)

Comment 3 2011-02-24 11:38:11 PST

The large CC list was generated by webkit-patch --suggest-reviewers. I thought that many of the folks who wrote these tests might want to see this change go by.

Tony Gentilcore

Comment 4 2011-02-24 11:39:19 PST

Comment on attachment 83687 [details] Patch This seems reasonable. Aside from the hangs, it also seems great to speed up these tests. The one reason this is scary is that presumably the tests were designed to trigger a bug that only a delay triggers. Without a delay, they would pass. So with a smaller delay, the passing may be for the right reason or it may be because the delay isn't long enough and we are losing coverage. I don't have any ideas for an efficient way to verify which is the case. View in context: https://bugs.webkit.org/attachment.cgi?id=83687&action=review > LayoutTests/http/tests/history/resources/slow-image.php:2 > +sleep(1); You use 2 seconds everywhere but here. Any reason? > LayoutTests/http/tests/xmlhttprequest/xmlhttprequest-50ms-download-dispatch.html:44 > + // delay is in ms. Why not just rename the variable delayMs.

Eric Seidel (no email)

Comment 5 2011-02-24 11:40:21 PST

I should note that the xmlhttprequest/web-apps tests aren't even being run. They're all renamed -disabled. Presumably because they all use such ridiculously long timeouts (10 minutes?!) and were hanging the server.

Brady Eidson

Comment 6 2011-02-24 11:40:34 PST

Comment on attachment 83687 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=83687&action=review > LayoutTests/http/tests/xmlhttprequest/web-apps/016-test.cgi:5 > -sleep 600; # ten minutes > +# The orginal test waited 10 minutes, however doing so would > +# lock apache for too long. > +sleep 2; # ten minutes Then the "#ten minutes" comment is no longer valid.

Eric Seidel (no email)

Comment 7 2011-02-24 11:42:55 PST

(In reply to comment #4) > (From update of attachment 83687 [details]) > This seems reasonable. Aside from the hangs, it also seems great to speed up these tests. > > The one reason this is scary is that presumably the tests were designed to trigger a bug that only a delay triggers. Without a delay, they would pass. So with a smaller delay, the passing may be for the right reason or it may be because the delay isn't long enough and we are losing coverage. I don't have any ideas for an efficient way to verify which is the case. I manually inspected almost all the uses of these files. Basically for all of them (except for some of the slow images, etc.) they're just trying to delay long enough for a js setTimeout to run. > View in context: https://bugs.webkit.org/attachment.cgi?id=83687&action=review > > > LayoutTests/http/tests/history/resources/slow-image.php:2 > > +sleep(1); > > You use 2 seconds everywhere but here. Any reason? IIRC this is one of the resources which is actually expected to complete, no sense in making the test run any longer than it has to. > > LayoutTests/http/tests/xmlhttprequest/xmlhttprequest-50ms-download-dispatch.html:44 > > + // delay is in ms. > > Why not just rename the variable delayMs. I was trying to avoid really touching the code, but could do.

Eric Seidel (no email)

Comment 8 2011-02-24 11:43:15 PST

(In reply to comment #6) > (From update of attachment 83687 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=83687&action=review > > > LayoutTests/http/tests/xmlhttprequest/web-apps/016-test.cgi:5 > > -sleep 600; # ten minutes > > +# The orginal test waited 10 minutes, however doing so would > > +# lock apache for too long. > > +sleep 2; # ten minutes > > Then the "#ten minutes" comment is no longer valid. Yup. We should probably just remove the tests since we don't run them.

Eric Seidel (no email)

Comment 9 2011-02-24 11:58:23 PST

It's possible my theory is bad, since I would think that navigating away from a page would abort all requests for that page. Upon closing the connection you would think Apache would be smart enough to terminate long running php/perl/python jobs.

Eric Seidel (no email)

Comment 10 2011-02-24 11:59:23 PST

https://bugs.webkit.org/show_bug.cgi?id=54791#c2 is the log which got me started on this theory. Note that both of the late-to-complete resources are php. It's possible that Apache doens't know how to abort php tasks when the connection closes.

Eric Seidel (no email)

Comment 11 2011-02-24 12:02:12 PST

This thread about php + apache + sleep may be related: http://stackoverflow.com/questions/21411/php-sleep-silently-hogs-cpu

Eric Seidel (no email)

Comment 12 2011-02-24 12:03:25 PST

Also possibly related: http://forums.digitalpoint.com/showthread.php?t=1132873&s=5db8c42ac9a74a25202df9c6b6db815f#post9949058

Eric Seidel (no email)

Comment 13 2011-02-24 12:06:48 PST

http://groups.google.com/group/mailing.unix.bugtraq/browse_thread/thread/90aef7b2a1df3b49/9b3f949cb001bb03?pli=1 also talks about php sleep() not counting towards max_execution_time and thus not being killed off after a timeout as expected. This may or may not be related. I could see it possible that we could have 150 (our current MAX_CHILD) outstanding requests all in a php sleep call if the tests are running fast enough. Again, it's possible there is some other bottleneck limit which is lower than that. This is all just a theory at this point.

Alexey Proskuryakov

Comment 14 2011-02-24 12:21:13 PST

This is an interesting theory! It's surprising if we have over 150 tests with sleep(). Is there anything we can do to test that Apache successfully serves 150 concurrent requests on the buildbot slaves? Maybe open 100 connections to a slow running script with curl, and see if other requests come through? We might also be running into an open file descriptor limit or something. It does seem that using a short sleep interval could increase flakiness, or make some tests ineffective.

Alexey Proskuryakov

Comment 15 2011-02-24 12:24:59 PST

I think that it would be good to remove disabled tests from xmlhttprequest/webapps. These were one of the first WebKit http tests to be added, so every XHR test that we could possibly make work in the future meant much more than now.

Eric Seidel (no email)

Comment 16 2011-02-24 12:40:22 PST

I would have thought that running: run-webkit-tests http/tests/history/back-during-onload-triggered-by-back.html --iterations 200 would confirm this theory. But that ran w/o any trouble. Hmm.. Maybe php only "leaks" these sleeping jobs in certain circumstances. Still testing.

Eric Seidel (no email)

Comment 17 2011-02-24 12:43:14 PST

Interesting. If you run: run-webkit-tests --iterations=500 http/tests/multipart/multipart-wait-before-boundary.html the run times are very uneven. I've not yet seen it hang long enough to timeout, but there is definitely something bad going on there. The run times for each iteration should be roughly the same.

Eric Seidel (no email)

Comment 18 2011-02-24 12:44:29 PST

Btw, when I run: run-webkit-tests --iterations=500 http/tests/multipart/multipart-wait-before-boundary.html I get a zillion httpd instances. Suggesting that my theory is at least partially correct. These httpd instances are leaking between runs, thus possibly causing us to hit the MaxClientLimit of 150.

Eric Seidel (no email)

Comment 19 2011-02-24 12:55:07 PST

run-webkit-tests --iterations=200 http/tests/history/popstate-fires-with-pending-requests.html confirms the httpd process leaking as well. I suspect it's worth investigating if there is a way to get apache to kill any process goes past some timeout. Maybe that's already happening (but we should probably lower the timeout). It seems the httpd leaking can certainly slow testing down. I believe it can explain these timeouts as well, but I've not reproduced that yet.

Eric Seidel (no email)

Comment 20 2011-02-24 13:37:26 PST

I can definitely see a theoretical case in which this demonstrable httpd process leak could cause tests to timeout. If you start a test when you are 1 away from the limit, the main resource will start, but the subresource won't and you'll time out. :)

Tony Gentilcore

Comment 21 2011-02-24 13:47:52 PST

Regardless of whether this is causing timeouts, lowering these timers where possible seems like a good idea. I'm still concerned about the hang connection resource. There is no way 2 seconds simulates a hang. > sleep(2); # 2 seconds should be long enough to simulate a hang. And I'd either revert xmlhttprequest-50ms-download-dispatch.html or rename the variable. Perhaps this change should just lower the timeouts that seem arbitrary and too high (like the 5 seconds ones) and then handle the hang tests (10 minute ones) separately.

Alexey Proskuryakov

Comment 22 2011-02-24 13:58:47 PST

> These httpd instances are leaking between runs Leaking between runs is certainly wrong. We shut down Apache after every run-webkit-tests run, and if that doesn't close child processes reasonably soon, that's unexpected. However, even 10 minutes shouldn't be a problem in practice, because non-http tests take much longer to run. It's not clear if long-lived Apache processes do us any harm at all.

Alexey Proskuryakov

Comment 23 2011-02-24 14:02:17 PST

Is 5 seconds really a too high timeout? As long as tests don't wait for the resource to finish loading, it seems best to make timeouts as long as possible to avoid flakiness.

Eric Seidel (no email)

Comment 24 2011-02-24 14:03:48 PST

(In reply to comment #15) > I think that it would be good to remove disabled tests from xmlhttprequest/webapps. These were one of the first WebKit http tests to be added, so every XHR test that we could possibly make work in the future meant much more than now. Done. bug 55181.

Eric Seidel (no email)

Comment 25 2011-02-24 14:05:42 PST

(In reply to comment #23) > Is 5 seconds really a too high timeout? As long as tests don't wait for the resource to finish loading, it seems best to make timeouts as long as possible to avoid flakiness. Longer timeouts can only reduce the probability of flakiness. Tests which are flaky can flake at any timeout. :) But yes I can extend the timeouts to 5s. That just increase the overlap window with other tests. I'm still investigating to see if there is a PHP, Apache or WebKIt bug at play here. We shouldn't be leaking httpd processes in any case (since Apache should detect the connection close and shoot the process in the head?) and I'm not sure who is at fault.

Eric Seidel (no email)

Comment 26 2011-02-24 14:09:44 PST

(In reply to comment #22) > > These httpd instances are leaking between runs > > Leaking between runs is certainly wrong. We shut down Apache after every run-webkit-tests run, and if that doesn't close child processes reasonably soon, that's unexpected. They leak between tests, not between apache restarts. Shutting down apache will kill all the httpd children. > However, even 10 minutes shouldn't be a problem in practice, because non-http tests take much longer to run. When an httpd child outlives the test it was trying to run, it holds one of the 150 MaxChildLimit slots. If we ran 150 tests in a row, all of which ran in a few milliseconds (like most tests do), but managed to get php to sleep for 1 second, on the 151st test it would have to wait for the first test's sleep to finish before Apache would have a child to service it. Now imagine that instead of 1 second, we sleep 30 seconds (the DRT timeout). Now suddenly the 151 test might timeout having never been serviced by apache at all (or more realisticly, the main resource was serviced but any subresources it needed weren't.) All because a previous test was "leaking" an httpd process by sleeping from php.

Eric Seidel (no email)

Comment 27 2011-02-24 14:22:22 PST

I'm trying to determine if WebKit, Apache or PHP is at fault here. When running: run-webkit-tests --iterations=500 http/tests/multipart/multipart-wait-before-boundary.html I see a 151 httpd processes (this is expected, and means we're hitting the MaxChild limit). I'm able to reproduce the timeouts we see on the bots! Will running these 500 iterations, using netstat, I see 150 sockets in CLOSE_WAIT and 150 in FIN_WAIT_2. CLOSE_WAIT: The socket connection has been closed by the remote peer, and the system is waiting for the local application to close its half of the connection. FIN_WAIT_2: The socket connection has been closed by the local application, the remote peer has acknowledged the close, and the system is waiting for it to close its half of the connection. This still seems inconclusive to me. The CLOSE_WAITs may suggest that the httpd children are stuck in sleep and not acknowledging the connection abort from WebKit. But I don't understand the FIN_WAIT_2s.

Eric Seidel (no email)

Comment 28 2011-02-24 14:22:52 PST

Created attachment 83720 [details] netstat output during run-webkit-tests --iterations=500 http/tests/multipart/multipart-wait-before-boundary.html

Eric Seidel (no email)

Comment 29 2011-02-24 14:25:38 PST

I upped the timout in multipart-wait-before-boundary.php to 60 seconds (from 30) and can confirm from the logs from my --iterations 500 run (see above) that the php processes are not exiting until 60 seconds after they started: 127.0.0.1 - - [24/Feb/2011:14:20:41 -0800] "GET /multipart/multipart-wait-before-boundary.html HTTP/1.1" 200 718 127.0.0.1 - - [24/Feb/2011:14:19:41 -0800] "GET /multipart/resources/multipart-wait-before-boundary.php HTTP/1.1" 200 6353 127.0.0.1 - - [24/Feb/2011:14:20:41 -0800] "GET /multipart/resources/multipart-wait-before-boundary.php?done=1 HTTP/1.1" 200 33 127.0.0.1 - - [24/Feb/2011:14:20:41 -0800] "GET /multipart/multipart-wait-before-boundary.html HTTP/1.1" 200 718 127.0.0.1 - - [24/Feb/2011:14:19:41 -0800] "GET /multipart/resources/multipart-wait-before-boundary.php HTTP/1.1" 200 6353 127.0.0.1 - - [24/Feb/2011:14:20:41 -0800] "GET /multipart/resources/multipart-wait-before-boundary.php?done=1 HTTP/1.1" 200 33 This means that any "10 minute" timeouts in PHP would actually hang one of the httpd children for 10 minutes, or until the httpd master shuts down (whichever comes first).

Eric Seidel (no email)

Comment 30 2011-02-24 14:32:51 PST

Sad. There is a hole in my theory. When I'm geting the timeouts from my --iterations 500 runs, I'm seeing this in the error log: [Thu Feb 24 14:29:25 2011] [error] server reached MaxClients setting, consider raising the MaxClients setting I see no such message in the error log seen on https://bugs.webkit.org/show_bug.cgi?id=54791.

Eric Seidel (no email)

Comment 31 2011-02-24 14:35:44 PST

I tried changing the "Timeout" value in http/conf/apache2-httpd.conf (which I believe to be the correct config file for Snow Leopard). I changed it from 300 seconds (default) to 30 seconds. This did not change how long any individual httpd process stayed around when calling sleep from php. I would have expected these processes to be killed at 30 seconds but they weren't. :(

Eric Seidel (no email)

Comment 32 2011-02-24 14:37:33 PST

PHP's own documentation (or at least the comments) confirm that sleep can bypass the automatic limiting (which defaults to 30 seconds): http://php.net/manual/en/function.set-time-limit.php

Alexey Proskuryakov

Comment 33 2011-02-24 14:41:19 PST

> Longer timeouts can only reduce the probability of flakiness. Yes, that's what I also said. If you intended to object, then I'll counter-object :) A short timeout will make it more likely that a resource that's not expected to finish loading in a given test will actually finish loading, causing havoc. > They leak between tests, not between apache restarts. OK - I misunderstood what you were talking about. It seems unlikely that the dozen tests with longish sleep() that we have causes trouble then. > Apache should detect the connection close and shoot the process in the head Possibly, but I'm not sure about that. It might be considered unsafe to force kill external scripts that are doing work. > But I don't understand the FIN_WAIT_2s Socket are kept in FIN_WAIT state by the kernel regardless of whether a process is still running. I don't remember enough detail to tell how FIN_WAIT is different from FIN_WAIT_2. > This means that any "10 minute" timeouts in PHP would actually hang one of the httpd children for 10 minutes It seems harmless if we only have a few of those.

Eric Seidel (no email)

Comment 34 2011-02-24 14:57:32 PST

I just ran "run-webkit-tests http" while monitoring the number of httpd processes with "while true; do ps waux | grep httpd | wc; sleep .5; done". The high-water mark was 22 processes, further evidence that we're not hitting the MaxChild limit (at least not on my machine).

Eric Seidel (no email)

Comment 35 2011-02-24 14:59:25 PST

(In reply to comment #33) > > But I don't understand the FIN_WAIT_2s > > Socket are kept in FIN_WAIT state by the kernel regardless of whether a process is still running. I don't remember enough detail to tell how FIN_WAIT is different from FIN_WAIT_2. I now understand what netstat is saying here. The Apache side are waiting in CLOSE_WAIT and the WebKit side are waiting in FIN_WAIT_2. They're two sides of the same coin and this is expected behavior since php is not servicing the connection while sleeping. I didn't initially realize that netstat would show two entries for each connection. One for the Apache socket and one for the WebKit/DRT socket.

Eric Seidel (no email)

Comment 36 2011-02-24 15:01:12 PST

(In reply to comment #33) > > Apache should detect the connection close and shoot the process in the head > > Possibly, but I'm not sure about that. It might be considered unsafe to force kill external scripts that are doing work. I realize now that Apache would have no way to do this. it's already handed off the socket to the child process and the child has called sleep(). Ideally php's "sleep" call would actually be some sort of poll/select call which serviced any open sockets while "sleeping".

Adam Barth

Comment 37 2011-10-14 17:20:11 PDT

Comment on attachment 83687 [details] Patch Booting from pending-review due to inactivity.

Adam Barth

Comment 38 2011-10-14 17:20:27 PDT

Eric, do you still think this is a problem?

Tony Chang

Comment 39 2011-11-10 11:17:00 PST

(In reply to comment #38) > Eric, do you still think this is a problem? I think this is still a problem: http://test-results.appspot.com/dashboards/flakiness_dashboard.html#tests=http%2Ftests%2Fmultipart%2Fpolicy-ignore-crash.php%2Chttp%2Ftests%2Fmultipart%2Fmultipart-wait-before-boundary.html On the Chromium SnowLeopard debug bots, there appears to be an inverse relationship between http/tests/multipart/multipart-wait-before-boundary.html timing out and http/tests/multipart/policy-ignore-crash.php not running properly (policy-ignore-crash runs right after multipart-wait-before-boundary.html). On the Chromium Leopard debug bots, policy-ignore-crash.php was getting some weird errors, but there doesn't appear to be problems with multipart-wait-before-boundary.html timing out. When the errors were occurring, I tried changing some settings in /private/etc/php.ini (set the session.bug_compat_42 and session.bug_compat_warn to 0) and it seemed to policy-ignore-crash.php. Maybe there's some php setting that's causing a problem? One way to test this theory would be to rewrite multipart-wait-before-boundary.html and policy-ignore-crash.php from php to perl and see if the problems persist.

Attachments
Patch (12.77 KB, patch) 2011-02-24 11:31 PST, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
netstat output during run-webkit-tests --iterations=500 http/tests/multipart/multipart-wait-before-boundary.html (24.07 KB, text/plain) 2011-02-24 14:22 PST, Eric Seidel (no email)	no flags	Details
View All Add attachment proposed patch, testcase, etc.