Bug 62829

Summary: nrwt: attempt #3 at fixing server startup/shutdown
Product: WebKit Reporter: Dirk Pranke <dpranke>
Component: New BugsAssignee: Dirk Pranke <dpranke>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, dglazkov, eric, ojan, tony, webkit.review.bot
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch as landed in r88995 and r88999, later reverted (minus the changelog entries)
none
Patch
none
Archive of layout-test-results from ec2-cr-linux-01
none
lint, fix win error
none
Archive of layout-test-results from ec2-cr-linux-03
none
incremental diff from the version of the patch that was landed and then reverted.
none
update w/ review feedback from tony none

Dirk Pranke
Reported 2011-06-16 15:23:36 PDT
nrwt: attempt #3 at fixing server startup/shutdown
Attachments
Patch as landed in r88995 and r88999, later reverted (minus the changelog entries) (47.20 KB, patch)
2011-06-16 15:23 PDT, Dirk Pranke
no flags
Patch (53.20 KB, patch)
2011-06-20 20:01 PDT, Dirk Pranke
no flags
Archive of layout-test-results from ec2-cr-linux-01 (1.21 MB, application/zip)
2011-06-20 20:09 PDT, WebKit Review Bot
no flags
lint, fix win error (53.38 KB, patch)
2011-06-20 20:15 PDT, Dirk Pranke
no flags
Archive of layout-test-results from ec2-cr-linux-03 (1.20 MB, application/zip)
2011-06-20 20:24 PDT, WebKit Review Bot
no flags
incremental diff from the version of the patch that was landed and then reverted. (21.56 KB, patch)
2011-06-20 20:27 PDT, Dirk Pranke
no flags
update w/ review feedback from tony (53.54 KB, patch)
2011-06-21 13:33 PDT, Dirk Pranke
no flags
Dirk Pranke
Comment 1 2011-06-16 15:23:59 PDT
Created attachment 97510 [details] Patch as landed in r88995 and r88999, later reverted (minus the changelog entries)
Dirk Pranke
Comment 2 2011-06-16 15:26:43 PDT
Comment on attachment 97510 [details] Patch as landed in r88995 and r88999, later reverted (minus the changelog entries) This is the rolled-up version of the patches in bug 62251, 62256, bug 62180, and bug 62615 . This version didn't work; apache didn't get shut down correctly on either linux or Mac. I'm not sure why my tests didn't catch this yet.
Dirk Pranke
Comment 3 2011-06-20 20:01:04 PDT
WebKit Review Bot
Comment 4 2011-06-20 20:09:39 PDT
Comment on attachment 97914 [details] Patch Attachment 97914 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/8910768
WebKit Review Bot
Comment 5 2011-06-20 20:09:44 PDT
Created attachment 97916 [details] Archive of layout-test-results from ec2-cr-linux-01 The attached test failures were seen while running run-webkit-tests on the chromium-ews. Bot: ec2-cr-linux-01 Port: Chromium Platform: Linux-2.6.35-28-virtual-x86_64-with-Ubuntu-10.10-maverick
Dirk Pranke
Comment 6 2011-06-20 20:15:57 PDT
Created attachment 97917 [details] lint, fix win error
WebKit Review Bot
Comment 7 2011-06-20 20:23:57 PDT
Comment on attachment 97917 [details] lint, fix win error Attachment 97917 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/8913767
WebKit Review Bot
Comment 8 2011-06-20 20:24:02 PDT
Created attachment 97918 [details] Archive of layout-test-results from ec2-cr-linux-03 The attached test failures were seen while running run-webkit-tests on the chromium-ews. Bot: ec2-cr-linux-03 Port: Chromium Platform: Linux-2.6.35-28-virtual-x86_64-with-Ubuntu-10.10-maverick
Dirk Pranke
Comment 9 2011-06-20 20:26:49 PDT
Sheesh ... hopefully third time's the charm. Okay, the previous attempt to fix the startup and shutdown of the http servers didn't work because apparently Apache lies to the caller about when it has started up or shut down. For starters, it turns out the the process that you fork off to start apache completes immediately, and so you really need to look at the pid written to the apache pidfile, not the pid returned from the popen() call. Next, you would normally expect to be able to do the equivalent of: % apache $ARGS -k start % cat $PIDFILE to get the running pid. However, $PIDFILE may not have been written yet when the process completes, so you need to actually slow-spin waiting for the pid file (note also that if the pid file doesn't exist, the servers may not have been started). Similarly, when you call '-k stop', the pid file may not have been deleted yet, and the server not actually shut down. This patch revises the code to address the above issues. I will also post an incremental diff from the version that landed previously and was reverted, because the overall patch is still quite large. I should probably follow up with the apache community at some point to figure out if things are supposed to work this way (although I can't imagine why that would be correct), or if these are bugs. Note that the behavior reproduces on both Mac and Linux. Now, you might ask why, given that there were tests written to make sure things were actually starting and stopping, why the tests were still passing. Welll, it appears that the python code is fast enough that it actually beats those races in such a way that the assertions about the servers being up or down are also "wrong", in that it would say the server was up, and then the server would shut down (or vice versa). I have revised the tests to put in some random sleeps to catch this behavior, and added an integration test that starts and stops the servers in a loop just to be safe.
Dirk Pranke
Comment 10 2011-06-20 20:27:30 PDT
Created attachment 97919 [details] incremental diff from the version of the patch that was landed and then reverted.
Dirk Pranke
Comment 11 2011-06-20 20:29:55 PDT
Note that the main change in this patch (apart from the logic fixes, obviously) is that self._process in the http_server is no longer particularly useful, so that stops being an instance variable, and instead there is a self._pid variable that tracks the pid that actually matters. In addition. self._spawn_process is now responsible for both launching the server and ensuring it is written to self._pid_file, since Apache and LigHTTPd/PyWebSocket accomplish these things differently.
Tony Chang
Comment 12 2011-06-21 11:36:57 PDT
Comment on attachment 97919 [details] incremental diff from the version of the patch that was landed and then reverted. View in context: https://bugs.webkit.org/attachment.cgi?id=97919&action=review > Tools/Scripts/webkitpy/layout_tests/port/apache_http_server.py:-57 > - self._name = 'apache' Why is this line being removed? > Tools/Scripts/webkitpy/layout_tests/port/apache_http_server.py:187 > + if not self._wait_for_action(lambda: self._filesystem.exists(self._pid_file)): > + raise http_server_base.ServerError('Failed to start %s: no pid file found' % self._name) Would it be possible to grab the pid after we do the socket connect check? That would avoid having to wait twice. I guess we would have to restructure the code quite a bit since that's not how the web socket server code works. I guess this is fine since I don't have a better suggestion. > Tools/Scripts/webkitpy/layout_tests/port/http_server_base.py:110 > + _log.warning('Failed to stop %s: pid file contains %d, not %d' % > + (self._name, actual_pid, self._pid)) > + return Should we stop some of the pids here? If we hit this, it would be nice to not leave stray processes around. > Tools/Scripts/webkitpy/layout_tests/port/http_server_base.py:149 > + def _wait_for_action(self, action, wait_secs=2.0, sleep_secs=0.05): 2 seconds seems too short for starting apache. How about something in the middle like 10s?
Tony Chang
Comment 13 2011-06-21 11:38:57 PDT
Comment on attachment 97917 [details] lint, fix win error Do you know why the cr-linux bots failed? It looks like some stray server processes. That makes me hesitant to r+ this change.
Dirk Pranke
Comment 14 2011-06-21 13:13:23 PDT
(In reply to comment #12) > (From update of attachment 97919 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=97919&action=review > > > Tools/Scripts/webkitpy/layout_tests/port/apache_http_server.py:-57 > > - self._name = 'apache' > > Why is this line being removed? > The version that was committed before had this instance variable being assigned twice. > > Tools/Scripts/webkitpy/layout_tests/port/apache_http_server.py:187 > > + if not self._wait_for_action(lambda: self._filesystem.exists(self._pid_file)): > > + raise http_server_base.ServerError('Failed to start %s: no pid file found' % self._name) > > Would it be possible to grab the pid after we do the socket connect check? That would avoid having to wait twice. I guess we would have to restructure the code quite a bit since that's not how the web socket server code works. I guess this is fine since I don't have a better suggestion. > Yeah, it's not easy to restructure this. However, both waits are very quick, in my experience, so I don't think it's worth worrying about. > > Tools/Scripts/webkitpy/layout_tests/port/http_server_base.py:110 > > + _log.warning('Failed to stop %s: pid file contains %d, not %d' % > > + (self._name, actual_pid, self._pid)) > > + return > > Should we stop some of the pids here? If we hit this, it would be nice to not leave stray processes around. Hm. Good question. The only way this could happen is if the port object thinks it started something, but the pid file points to something else, presumably because another port object killed the server and started a new one. In this case, normally, this port's server should already be dead, and we wouldn't want to kill the other one since we didn't start it. I suppose we could kill self._pid directly just to be safe. I'll add that, and we can see if anything weird happens. > > > Tools/Scripts/webkitpy/layout_tests/port/http_server_base.py:149 > > + def _wait_for_action(self, action, wait_secs=2.0, sleep_secs=0.05): > > 2 seconds seems too short for starting apache. How about something in the middle like 10s? Okay. I've never seen it take more than a second or so on my machine, but I suppose it's conceivable that there are much slower or more heavily loaded machines.
Dirk Pranke
Comment 15 2011-06-21 13:16:29 PDT
(In reply to comment #13) > (From update of attachment 97917 [details]) > Do you know why the cr-linux bots failed? It looks like some stray server processes. That makes me hesitant to r+ this change. It seems possible or likely that the cr-linux bots already had stray servers running, which would be a fatal error in the new code. We may have to clean up the EWS bots just like I'm expecting to have to clean up all of the others in order to bring them into a known good state.
Dirk Pranke
Comment 16 2011-06-21 13:33:59 PDT
Created attachment 98051 [details] update w/ review feedback from tony
Tony Chang
Comment 17 2011-06-21 13:54:43 PDT
Comment on attachment 98051 [details] update w/ review feedback from tony cr-linux passed this time. weird.
Dirk Pranke
Comment 18 2011-06-21 19:20:34 PDT
Comment on attachment 98051 [details] update w/ review feedback from tony Clearing flags on attachment: 98051 Committed r89400: <http://trac.webkit.org/changeset/89400>
Dirk Pranke
Comment 19 2011-06-21 19:20:40 PDT
All reviewed patches have been landed. Closing bug.
Dirk Pranke
Comment 20 2011-06-22 11:04:05 PDT
*** Bug 62256 has been marked as a duplicate of this bug. ***
Dirk Pranke
Comment 21 2011-06-22 11:05:03 PDT
*** Bug 62180 has been marked as a duplicate of this bug. ***
Note You need to log in before you can comment on or make changes to this bug.