Bug 55164 - Long-sleeping php jobs may be causing the server to time out requests
Summary: Long-sleeping php jobs may be causing the server to time out requests
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: 528+ (Nightly build)
Hardware: Other OS X 10.5
: P2 Normal
Assignee: Eric Seidel (no email)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-02-24 11:24 PST by Eric Seidel (no email)
Modified: 2011-11-10 11:17 PST (History)
15 users (show)

See Also:


Attachments
Patch (12.77 KB, patch)
2011-02-24 11:31 PST, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
netstat output during run-webkit-tests --iterations=500 http/tests/multipart/multipart-wait-before-boundary.html (24.07 KB, text/plain)
2011-02-24 14:22 PST, Eric Seidel (no email)
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Eric Seidel (no email) 2011-02-24 11:24:54 PST
Long-sleeping php jobs may be causing the server to time out requests
Comment 1 Eric Seidel (no email) 2011-02-24 11:31:47 PST
Created attachment 83687 [details]
Patch
Comment 2 Eric Seidel (no email) 2011-02-24 11:34:01 PST
If this theory is correct, we can basically dup every single one of bugs about http tests listed from https://bugs.webkit.org/show_bug.cgi?id=50856 to this one.
Comment 3 Eric Seidel (no email) 2011-02-24 11:38:11 PST
The large CC list was generated by webkit-patch --suggest-reviewers. I thought that many of the folks who wrote these tests might want to see this change go by.
Comment 4 Tony Gentilcore 2011-02-24 11:39:19 PST
Comment on attachment 83687 [details]
Patch

This seems reasonable. Aside from the hangs, it also seems great to speed up these tests.

The one reason this is scary is that presumably the tests were designed to trigger a bug that only a delay triggers. Without a delay, they would pass. So with a smaller delay, the passing may be for the right reason or it may be because the delay isn't long enough and we are losing coverage. I don't have any ideas for an efficient way to verify which is the case.

View in context: https://bugs.webkit.org/attachment.cgi?id=83687&action=review

> LayoutTests/http/tests/history/resources/slow-image.php:2
> +sleep(1);

You use 2 seconds everywhere but here. Any reason?

> LayoutTests/http/tests/xmlhttprequest/xmlhttprequest-50ms-download-dispatch.html:44
> +    // delay is in ms.

Why not just rename the variable delayMs.
Comment 5 Eric Seidel (no email) 2011-02-24 11:40:21 PST
I should note that the xmlhttprequest/web-apps tests aren't even being run.  They're all renamed -disabled.  Presumably because they all use such ridiculously long timeouts (10 minutes?!) and were hanging the server.
Comment 6 Brady Eidson 2011-02-24 11:40:34 PST
Comment on attachment 83687 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=83687&action=review

> LayoutTests/http/tests/xmlhttprequest/web-apps/016-test.cgi:5
> -sleep 600; # ten minutes
> +# The orginal test waited 10 minutes, however doing so would
> +# lock apache for too long.
> +sleep 2; # ten minutes

Then the "#ten minutes" comment is no longer valid.
Comment 7 Eric Seidel (no email) 2011-02-24 11:42:55 PST
(In reply to comment #4)
> (From update of attachment 83687 [details])
> This seems reasonable. Aside from the hangs, it also seems great to speed up these tests.
> 
> The one reason this is scary is that presumably the tests were designed to trigger a bug that only a delay triggers. Without a delay, they would pass. So with a smaller delay, the passing may be for the right reason or it may be because the delay isn't long enough and we are losing coverage. I don't have any ideas for an efficient way to verify which is the case.

I manually inspected almost all the uses of these files.  Basically for all of them (except for some of the slow images, etc.) they're just trying to delay long enough for a js setTimeout to run.

> View in context: https://bugs.webkit.org/attachment.cgi?id=83687&action=review
> 
> > LayoutTests/http/tests/history/resources/slow-image.php:2
> > +sleep(1);
> 
> You use 2 seconds everywhere but here. Any reason?

IIRC this is one of the resources which is actually expected to complete, no sense in making the test run any longer than it has to.

> > LayoutTests/http/tests/xmlhttprequest/xmlhttprequest-50ms-download-dispatch.html:44
> > +    // delay is in ms.
> 
> Why not just rename the variable delayMs.

I was trying to avoid really touching the code, but could do.
Comment 8 Eric Seidel (no email) 2011-02-24 11:43:15 PST
(In reply to comment #6)
> (From update of attachment 83687 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=83687&action=review
> 
> > LayoutTests/http/tests/xmlhttprequest/web-apps/016-test.cgi:5
> > -sleep 600; # ten minutes
> > +# The orginal test waited 10 minutes, however doing so would
> > +# lock apache for too long.
> > +sleep 2; # ten minutes
> 
> Then the "#ten minutes" comment is no longer valid.

Yup.  We should probably just remove the tests since we don't run them.
Comment 9 Eric Seidel (no email) 2011-02-24 11:58:23 PST
It's possible my theory is bad, since I would think that navigating away from a page would abort all requests for that page.  Upon closing the connection you would think Apache would be smart enough to terminate long running php/perl/python jobs.
Comment 10 Eric Seidel (no email) 2011-02-24 11:59:23 PST
https://bugs.webkit.org/show_bug.cgi?id=54791#c2 is the log which got me started on this theory.  Note that both of the late-to-complete resources are php.  It's possible that Apache doens't know how to abort php tasks when the connection closes.
Comment 11 Eric Seidel (no email) 2011-02-24 12:02:12 PST
This thread about php + apache + sleep may be related:
http://stackoverflow.com/questions/21411/php-sleep-silently-hogs-cpu
Comment 13 Eric Seidel (no email) 2011-02-24 12:06:48 PST
http://groups.google.com/group/mailing.unix.bugtraq/browse_thread/thread/90aef7b2a1df3b49/9b3f949cb001bb03?pli=1 also talks about php sleep() not counting towards max_execution_time and thus not being killed off after a timeout as expected.  This may or may not be related.

I could see it possible that we could have 150 (our current MAX_CHILD) outstanding requests all in a php sleep call if the tests are running fast enough.  Again, it's possible there is some other bottleneck limit which is lower than that.  This is all just a theory at this point.
Comment 14 Alexey Proskuryakov 2011-02-24 12:21:13 PST
This is an interesting theory!

It's surprising if we have over 150 tests with sleep(). Is there anything we can do to test that Apache successfully serves 150 concurrent requests on the buildbot slaves? Maybe open 100 connections to a slow running script with curl, and see if other requests come through?

We might also be running into an open file descriptor limit or something.

It does seem that using a short sleep interval could increase flakiness, or make some tests ineffective.
Comment 15 Alexey Proskuryakov 2011-02-24 12:24:59 PST
I think that it would be good to remove disabled tests from xmlhttprequest/webapps. These were one of the first WebKit http tests to be added, so every XHR test that we could possibly make work in the future meant much more than now.
Comment 16 Eric Seidel (no email) 2011-02-24 12:40:22 PST
I would have thought that running:
run-webkit-tests http/tests/history/back-during-onload-triggered-by-back.html --iterations 200
would confirm this theory. But that ran w/o any trouble.  Hmm..  Maybe php only "leaks" these sleeping jobs in certain circumstances.  Still testing.
Comment 17 Eric Seidel (no email) 2011-02-24 12:43:14 PST
Interesting.  If you run:
run-webkit-tests --iterations=500 http/tests/multipart/multipart-wait-before-boundary.html 
the run times are very uneven.  I've not yet seen it hang long enough to timeout, but there is definitely something bad going on there.  The run times for each iteration should be roughly the same.
Comment 18 Eric Seidel (no email) 2011-02-24 12:44:29 PST
Btw, when I run:
run-webkit-tests --iterations=500 http/tests/multipart/multipart-wait-before-boundary.html

I get a zillion httpd instances.  Suggesting that my theory is at least partially correct.  These httpd instances are leaking between runs, thus possibly causing us to hit the MaxClientLimit of 150.
Comment 19 Eric Seidel (no email) 2011-02-24 12:55:07 PST
run-webkit-tests --iterations=200 http/tests/history/popstate-fires-with-pending-requests.html
confirms the httpd process leaking as well.

I suspect it's worth investigating if there is a way to get apache to kill any process goes past some timeout.  Maybe that's already happening (but we should probably lower the timeout).

It seems the httpd leaking can certainly slow testing down.  I believe it can explain these timeouts as well, but I've not reproduced that yet.
Comment 20 Eric Seidel (no email) 2011-02-24 13:37:26 PST
I can definitely see a theoretical case in which this demonstrable httpd process leak could cause tests to timeout.

If you start a test when you are 1 away from the limit, the main resource will start, but the subresource won't and you'll time out. :)
Comment 21 Tony Gentilcore 2011-02-24 13:47:52 PST
Regardless of whether this is causing timeouts, lowering these timers where possible seems like a good idea.

I'm still concerned about the hang connection resource. There is no way 2 seconds simulates a hang.
> sleep(2); # 2 seconds should be long enough to simulate a hang.

And I'd either revert xmlhttprequest-50ms-download-dispatch.html or rename the variable.

Perhaps this change should just lower the timeouts that seem arbitrary and too high (like the 5 seconds ones) and then handle the hang tests (10 minute ones) separately.
Comment 22 Alexey Proskuryakov 2011-02-24 13:58:47 PST
> These httpd instances are leaking between runs

Leaking between runs is certainly wrong. We shut down Apache after every run-webkit-tests run, and if that doesn't close child processes reasonably soon, that's unexpected.

However, even 10 minutes shouldn't be a problem in practice, because non-http tests take much longer to run.

It's not clear if long-lived Apache processes do us any harm at all.
Comment 23 Alexey Proskuryakov 2011-02-24 14:02:17 PST
Is 5 seconds really a too high timeout? As long as tests don't wait for the resource to finish loading, it seems best to make timeouts as long as possible to avoid flakiness.
Comment 24 Eric Seidel (no email) 2011-02-24 14:03:48 PST
(In reply to comment #15)
> I think that it would be good to remove disabled tests from xmlhttprequest/webapps. These were one of the first WebKit http tests to be added, so every XHR test that we could possibly make work in the future meant much more than now.

Done.  bug 55181.
Comment 25 Eric Seidel (no email) 2011-02-24 14:05:42 PST
(In reply to comment #23)
> Is 5 seconds really a too high timeout? As long as tests don't wait for the resource to finish loading, it seems best to make timeouts as long as possible to avoid flakiness.

Longer timeouts can only reduce the probability of flakiness.  Tests which are flaky can flake at any timeout. :)  But yes I can extend the timeouts to 5s.  That just increase the overlap window with other tests.

I'm still investigating to see if there is a PHP, Apache or WebKIt bug at play here.  We shouldn't be leaking httpd processes in any case (since Apache should detect the connection close and shoot the process in the head?) and I'm not sure who is at fault.
Comment 26 Eric Seidel (no email) 2011-02-24 14:09:44 PST
(In reply to comment #22)
> > These httpd instances are leaking between runs
> 
> Leaking between runs is certainly wrong. We shut down Apache after every run-webkit-tests run, and if that doesn't close child processes reasonably soon, that's unexpected.

They leak between tests, not between apache restarts.  Shutting down apache will kill all the httpd children.

> However, even 10 minutes shouldn't be a problem in practice, because non-http tests take much longer to run.

When an httpd child outlives the test it was trying to run, it holds one of the 150 MaxChildLimit slots.  If we ran 150 tests in a row, all of which ran in a few milliseconds (like most tests do), but managed to get php to sleep for 1 second, on the 151st test it would have to wait for the first test's sleep to finish before Apache would have a child to service it.  Now imagine that instead of 1 second, we sleep 30 seconds (the DRT timeout).  Now suddenly the 151 test might timeout having never been serviced by apache at all (or more realisticly, the main resource was serviced but any subresources it needed weren't.)  All because a previous test was "leaking" an httpd process by sleeping from php.
Comment 27 Eric Seidel (no email) 2011-02-24 14:22:22 PST
I'm trying to determine if WebKit, Apache or PHP is at fault here.

When running:
run-webkit-tests --iterations=500 http/tests/multipart/multipart-wait-before-boundary.html

I see a 151 httpd processes (this is expected, and means we're hitting the MaxChild limit).

I'm able to reproduce the timeouts we see on the bots!

Will running these 500 iterations, using netstat, I see 150 sockets in CLOSE_WAIT and 150 in FIN_WAIT_2.

     CLOSE_WAIT:  The socket connection has been closed by the remote peer,
     and the system is waiting for the local application to close its half of
     the connection.

     FIN_WAIT_2:  The socket connection has been closed by the local
     application, the remote peer has acknowledged the close, and the system
     is waiting for it to close its half of the connection.

This still seems inconclusive to me.  The CLOSE_WAITs may suggest that the httpd children are stuck in sleep and not acknowledging the connection abort from WebKit.  But I don't understand the FIN_WAIT_2s.
Comment 28 Eric Seidel (no email) 2011-02-24 14:22:52 PST
Created attachment 83720 [details]
netstat output during run-webkit-tests --iterations=500 http/tests/multipart/multipart-wait-before-boundary.html
Comment 29 Eric Seidel (no email) 2011-02-24 14:25:38 PST
I upped the timout in multipart-wait-before-boundary.php to 60 seconds (from 30) and can confirm from the logs from my --iterations 500 run (see above) that the php processes are not exiting until 60 seconds after they started:

127.0.0.1 - - [24/Feb/2011:14:20:41 -0800] "GET /multipart/multipart-wait-before-boundary.html HTTP/1.1" 200 718
127.0.0.1 - - [24/Feb/2011:14:19:41 -0800] "GET /multipart/resources/multipart-wait-before-boundary.php HTTP/1.1" 200 6353
127.0.0.1 - - [24/Feb/2011:14:20:41 -0800] "GET /multipart/resources/multipart-wait-before-boundary.php?done=1 HTTP/1.1" 200 33
127.0.0.1 - - [24/Feb/2011:14:20:41 -0800] "GET /multipart/multipart-wait-before-boundary.html HTTP/1.1" 200 718
127.0.0.1 - - [24/Feb/2011:14:19:41 -0800] "GET /multipart/resources/multipart-wait-before-boundary.php HTTP/1.1" 200 6353
127.0.0.1 - - [24/Feb/2011:14:20:41 -0800] "GET /multipart/resources/multipart-wait-before-boundary.php?done=1 HTTP/1.1" 200 33

This means that any "10 minute" timeouts in PHP would actually hang one of the httpd children for 10 minutes, or until the httpd master shuts down (whichever comes first).
Comment 30 Eric Seidel (no email) 2011-02-24 14:32:51 PST
Sad.  There is a hole in my theory.  When I'm geting the timeouts from my --iterations 500 runs, I'm seeing this in the error log:

[Thu Feb 24 14:29:25 2011] [error] server reached MaxClients setting, consider raising the MaxClients setting

I see no such message in the error log seen on https://bugs.webkit.org/show_bug.cgi?id=54791.
Comment 31 Eric Seidel (no email) 2011-02-24 14:35:44 PST
I tried changing the "Timeout" value in http/conf/apache2-httpd.conf (which I believe to be the correct config file for Snow Leopard).  I changed it from 300 seconds (default) to 30 seconds.  This did not change how long any individual httpd process stayed around when calling sleep from php.  I would have expected these processes to be killed at 30 seconds but they weren't. :(
Comment 32 Eric Seidel (no email) 2011-02-24 14:37:33 PST
PHP's own documentation (or at least the comments) confirm that sleep can bypass the automatic limiting (which defaults to 30 seconds):  http://php.net/manual/en/function.set-time-limit.php
Comment 33 Alexey Proskuryakov 2011-02-24 14:41:19 PST
> Longer timeouts can only reduce the probability of flakiness.

Yes, that's what I also said. If you intended to object, then I'll counter-object :)

A short timeout will make it more likely that a resource that's not expected to finish loading in a given test will actually finish loading, causing havoc.

> They leak between tests, not between apache restarts.

OK - I misunderstood what you were talking about. It seems unlikely that the dozen tests with longish sleep() that we have causes trouble then.

> Apache should detect the connection close and shoot the process in the head

Possibly, but I'm not sure about that. It might be considered unsafe to force kill external scripts that are doing work.

> But I don't understand the FIN_WAIT_2s

Socket are kept in FIN_WAIT state by the kernel regardless of whether a process is still running. I don't remember enough detail to tell how FIN_WAIT is different from FIN_WAIT_2.

> This means that any "10 minute" timeouts in PHP would actually hang one of the httpd children for 10 minutes

It seems harmless if we only have a few of those.
Comment 34 Eric Seidel (no email) 2011-02-24 14:57:32 PST
I just ran "run-webkit-tests http" while monitoring the number of httpd processes with "while true; do ps waux | grep httpd | wc; sleep .5; done".

The high-water mark was 22 processes, further evidence that we're not hitting the MaxChild limit (at least not on my machine).
Comment 35 Eric Seidel (no email) 2011-02-24 14:59:25 PST
(In reply to comment #33)
> > But I don't understand the FIN_WAIT_2s
> 
> Socket are kept in FIN_WAIT state by the kernel regardless of whether a process is still running. I don't remember enough detail to tell how FIN_WAIT is different from FIN_WAIT_2.

I now understand what netstat is saying here.  The Apache side are waiting in CLOSE_WAIT and the WebKit side are waiting in FIN_WAIT_2.  They're two sides of the same coin and this is expected behavior since php is not servicing the connection while sleeping.  I didn't initially realize that netstat would show two entries for each connection. One for the Apache socket and one for the WebKit/DRT socket.
Comment 36 Eric Seidel (no email) 2011-02-24 15:01:12 PST
(In reply to comment #33)
> > Apache should detect the connection close and shoot the process in the head
> 
> Possibly, but I'm not sure about that. It might be considered unsafe to force kill external scripts that are doing work.

I realize now that Apache would have no way to do this.  it's already handed off the socket to the child process and the child has called sleep().  Ideally php's "sleep" call would actually be some sort of poll/select call which serviced any open sockets while "sleeping".
Comment 37 Adam Barth 2011-10-14 17:20:11 PDT
Comment on attachment 83687 [details]
Patch

Booting from pending-review due to inactivity.
Comment 38 Adam Barth 2011-10-14 17:20:27 PDT
Eric, do you still think this is a problem?
Comment 39 Tony Chang 2011-11-10 11:17:00 PST
(In reply to comment #38)
> Eric, do you still think this is a problem?

I think this is still a problem:
http://test-results.appspot.com/dashboards/flakiness_dashboard.html#tests=http%2Ftests%2Fmultipart%2Fpolicy-ignore-crash.php%2Chttp%2Ftests%2Fmultipart%2Fmultipart-wait-before-boundary.html

On the Chromium SnowLeopard debug bots, there appears to be an inverse relationship between http/tests/multipart/multipart-wait-before-boundary.html timing out and http/tests/multipart/policy-ignore-crash.php not running properly (policy-ignore-crash runs right after multipart-wait-before-boundary.html).

On the Chromium Leopard debug bots, policy-ignore-crash.php was getting some weird errors, but there doesn't appear to be problems with multipart-wait-before-boundary.html timing out.  When the errors were occurring, I tried changing some settings in /private/etc/php.ini (set the session.bug_compat_42 and session.bug_compat_warn to 0) and it seemed to policy-ignore-crash.php.

Maybe there's some php setting that's causing a problem?  One way to test this theory would be to rewrite multipart-wait-before-boundary.html and policy-ignore-crash.php from php to perl and see if the problems persist.