Bug 62829 - nrwt: attempt #3 at fixing server startup/shutdown
Summary: nrwt: attempt #3 at fixing server startup/shutdown
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Dirk Pranke
URL:
Keywords:
: 62180 62256 (view as bug list)
Depends on:
Blocks:
 
Reported: 2011-06-16 15:23 PDT by Dirk Pranke
Modified: 2011-06-22 11:05 PDT (History)
6 users (show)

See Also:


Attachments
Patch as landed in r88995 and r88999, later reverted (minus the changelog entries) (47.20 KB, patch)
2011-06-16 15:23 PDT, Dirk Pranke
no flags Details | Formatted Diff | Diff
Patch (53.20 KB, patch)
2011-06-20 20:01 PDT, Dirk Pranke
no flags Details | Formatted Diff | Diff
Archive of layout-test-results from ec2-cr-linux-01 (1.21 MB, application/zip)
2011-06-20 20:09 PDT, WebKit Review Bot
no flags Details
lint, fix win error (53.38 KB, patch)
2011-06-20 20:15 PDT, Dirk Pranke
no flags Details | Formatted Diff | Diff
Archive of layout-test-results from ec2-cr-linux-03 (1.20 MB, application/zip)
2011-06-20 20:24 PDT, WebKit Review Bot
no flags Details
incremental diff from the version of the patch that was landed and then reverted. (21.56 KB, patch)
2011-06-20 20:27 PDT, Dirk Pranke
no flags Details | Formatted Diff | Diff
update w/ review feedback from tony (53.54 KB, patch)
2011-06-21 13:33 PDT, Dirk Pranke
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dirk Pranke 2011-06-16 15:23:36 PDT
nrwt: attempt #3 at fixing server startup/shutdown
Comment 1 Dirk Pranke 2011-06-16 15:23:59 PDT
Created attachment 97510 [details]
Patch as landed in r88995 and r88999, later reverted (minus the changelog entries)
Comment 2 Dirk Pranke 2011-06-16 15:26:43 PDT
Comment on attachment 97510 [details]
Patch as landed in r88995 and r88999, later reverted (minus the changelog entries)

This is the rolled-up version of the patches in bug 62251, 62256, bug 62180, and bug 62615 . This version didn't work; apache didn't get shut down correctly on either linux or Mac. I'm not sure why my tests didn't catch this yet.
Comment 3 Dirk Pranke 2011-06-20 20:01:04 PDT
Created attachment 97914 [details]
Patch
Comment 4 WebKit Review Bot 2011-06-20 20:09:39 PDT
Comment on attachment 97914 [details]
Patch

Attachment 97914 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/8910768
Comment 5 WebKit Review Bot 2011-06-20 20:09:44 PDT
Created attachment 97916 [details]
Archive of layout-test-results from ec2-cr-linux-01

The attached test failures were seen while running run-webkit-tests on the chromium-ews.
Bot: ec2-cr-linux-01  Port: Chromium  Platform: Linux-2.6.35-28-virtual-x86_64-with-Ubuntu-10.10-maverick
Comment 6 Dirk Pranke 2011-06-20 20:15:57 PDT
Created attachment 97917 [details]
lint, fix win error
Comment 7 WebKit Review Bot 2011-06-20 20:23:57 PDT
Comment on attachment 97917 [details]
lint, fix win error

Attachment 97917 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/8913767
Comment 8 WebKit Review Bot 2011-06-20 20:24:02 PDT
Created attachment 97918 [details]
Archive of layout-test-results from ec2-cr-linux-03

The attached test failures were seen while running run-webkit-tests on the chromium-ews.
Bot: ec2-cr-linux-03  Port: Chromium  Platform: Linux-2.6.35-28-virtual-x86_64-with-Ubuntu-10.10-maverick
Comment 9 Dirk Pranke 2011-06-20 20:26:49 PDT
Sheesh ... hopefully third time's the charm.

Okay, the previous attempt to fix the startup and shutdown of the http servers didn't work because apparently Apache lies to the caller about when it has started up or shut down.

For starters, it turns out the the process that you fork off to start apache completes immediately, and so you really need to look at the pid written to the apache pidfile, not the pid returned from the popen() call.

Next, you would normally expect to be able to do the equivalent of:

% apache $ARGS -k start
% cat $PIDFILE

to get the running pid. However, $PIDFILE may not have been written yet when the process completes, so you need to actually slow-spin waiting for the pid file (note also that if the pid file doesn't exist, the servers may not have been started). Similarly, when you call '-k stop', the pid file may not have been deleted yet, and the server not actually shut down.

This patch revises the code to address the above issues. I will also post an incremental diff from the version that landed previously and was reverted, because the overall patch is still quite large. 

I should probably follow up with the apache community at some point to figure out if things are supposed to work this way (although I can't imagine  why that would be correct), or if these are bugs. 

Note that the behavior reproduces on both Mac and Linux.

Now, you might ask why, given that there were tests written to make sure things were actually starting and stopping, why the tests were still passing. Welll, it appears that the python code is fast enough that it actually beats those races in such a way that the assertions about the servers being up or down are also "wrong", in that it would say the server was up, and then the server would shut down (or vice versa).

I have revised the tests to put in some random sleeps to catch this behavior, and added an integration test that starts and stops the servers in a loop just to be safe.
Comment 10 Dirk Pranke 2011-06-20 20:27:30 PDT
Created attachment 97919 [details]
incremental diff from the version of the patch that was landed and then reverted.
Comment 11 Dirk Pranke 2011-06-20 20:29:55 PDT
Note that the main change in this patch (apart from the logic fixes, obviously) is that self._process in the http_server is no longer particularly useful, so that stops being an instance variable, and instead there is a self._pid variable that tracks the pid that actually matters. In addition. self._spawn_process is now responsible for both launching the server and ensuring it is written to self._pid_file, since Apache and LigHTTPd/PyWebSocket accomplish these things differently.
Comment 12 Tony Chang 2011-06-21 11:36:57 PDT
Comment on attachment 97919 [details]
incremental diff from the version of the patch that was landed and then reverted.

View in context: https://bugs.webkit.org/attachment.cgi?id=97919&action=review

> Tools/Scripts/webkitpy/layout_tests/port/apache_http_server.py:-57
> -        self._name = 'apache'

Why is this line being removed?

> Tools/Scripts/webkitpy/layout_tests/port/apache_http_server.py:187
> +        if not self._wait_for_action(lambda: self._filesystem.exists(self._pid_file)):
> +            raise http_server_base.ServerError('Failed to start %s: no pid file found' % self._name)

Would it be possible to grab the pid after we do the socket connect check?  That would avoid having to wait twice.  I guess we would have to restructure the code quite a bit since that's not how the web socket server code works.  I guess this is fine since I don't have a better suggestion.

> Tools/Scripts/webkitpy/layout_tests/port/http_server_base.py:110
> +            _log.warning('Failed to stop %s: pid file contains %d, not %d' %
> +                         (self._name, actual_pid, self._pid))
> +            return

Should we stop some of the pids here?  If we hit this, it would be nice to not leave stray processes around.

> Tools/Scripts/webkitpy/layout_tests/port/http_server_base.py:149
> +    def _wait_for_action(self, action, wait_secs=2.0, sleep_secs=0.05):

2 seconds seems too short for starting apache.  How about something in the middle like 10s?
Comment 13 Tony Chang 2011-06-21 11:38:57 PDT
Comment on attachment 97917 [details]
lint, fix win error

Do you know why the cr-linux bots failed?  It looks like some stray server processes.  That makes me hesitant to r+ this change.
Comment 14 Dirk Pranke 2011-06-21 13:13:23 PDT
(In reply to comment #12)
> (From update of attachment 97919 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=97919&action=review
> 
> > Tools/Scripts/webkitpy/layout_tests/port/apache_http_server.py:-57
> > -        self._name = 'apache'
> 
> Why is this line being removed?
> 

The version that was committed before had this instance variable being assigned twice.

> > Tools/Scripts/webkitpy/layout_tests/port/apache_http_server.py:187
> > +        if not self._wait_for_action(lambda: self._filesystem.exists(self._pid_file)):
> > +            raise http_server_base.ServerError('Failed to start %s: no pid file found' % self._name)
> 
> Would it be possible to grab the pid after we do the socket connect check?  That would avoid having to wait twice.  I guess we would have to restructure the code quite a bit since that's not how the web socket server code works.  I guess this is fine since I don't have a better suggestion.
> 

Yeah, it's not easy to restructure this. However, both waits are very quick, in my experience, so I don't think it's worth worrying about.

> > Tools/Scripts/webkitpy/layout_tests/port/http_server_base.py:110
> > +            _log.warning('Failed to stop %s: pid file contains %d, not %d' %
> > +                         (self._name, actual_pid, self._pid))
> > +            return
> 
> Should we stop some of the pids here?  If we hit this, it would be nice to not leave stray processes around.

Hm. Good question. The only way this could happen is if the port object thinks it started something, but  the pid file points to something else, presumably because another port object killed the server and started a new one. In this case, normally, this port's server should already be dead, and we wouldn't want to kill the other one since we didn't start it.

I suppose we could kill self._pid directly just to be safe. I'll add that, and we can see if anything weird happens.

> 
> > Tools/Scripts/webkitpy/layout_tests/port/http_server_base.py:149
> > +    def _wait_for_action(self, action, wait_secs=2.0, sleep_secs=0.05):
> 
> 2 seconds seems too short for starting apache.  How about something in the middle like 10s?

Okay. I've never seen it take more than a second or so on my machine, but I suppose it's conceivable that there are much slower or more heavily loaded machines.
Comment 15 Dirk Pranke 2011-06-21 13:16:29 PDT
(In reply to comment #13)
> (From update of attachment 97917 [details])
> Do you know why the cr-linux bots failed?  It looks like some stray server processes.  That makes me hesitant to r+ this change.

It seems possible or likely that the cr-linux bots already had stray servers running, which would be a fatal error in the new code. We may have to clean up the EWS bots just like I'm expecting to have to clean up all of the others in order to bring them into a known good state.
Comment 16 Dirk Pranke 2011-06-21 13:33:59 PDT
Created attachment 98051 [details]
update w/ review feedback from tony
Comment 17 Tony Chang 2011-06-21 13:54:43 PDT
Comment on attachment 98051 [details]
update w/ review feedback from tony

cr-linux passed this time.  weird.
Comment 18 Dirk Pranke 2011-06-21 19:20:34 PDT
Comment on attachment 98051 [details]
update w/ review feedback from tony

Clearing flags on attachment: 98051

Committed r89400: <http://trac.webkit.org/changeset/89400>
Comment 19 Dirk Pranke 2011-06-21 19:20:40 PDT
All reviewed patches have been landed.  Closing bug.
Comment 20 Dirk Pranke 2011-06-22 11:04:05 PDT
*** Bug 62256 has been marked as a duplicate of this bug. ***
Comment 21 Dirk Pranke 2011-06-22 11:05:03 PDT
*** Bug 62180 has been marked as a duplicate of this bug. ***