RESOLVED FIXED 64886
new-run-webkit-tests hung while acquiring http lock on snow leopard bots
https://bugs.webkit.org/show_bug.cgi?id=64886
Summary new-run-webkit-tests hung while acquiring http lock on snow leopard bots
Eric Seidel (no email)
Reported 2011-07-20 13:02:17 PDT
new-run-webkit-tests hung while acquiring http lock on snow leopard bots Example: http://build.webkit.org/builders/SnowLeopard%20Intel%20Release%20%28Tests%29/builds/31580/steps/layout-test/logs/stdio 2011-07-20 09:44:53,076 67417 printing.py:469 INFO Acquiring http lock ... 2011-07-20 09:44:53,100 67417 http_lock.py:123 DEBUG Creating lock file: /var/folders/g7/g7uo9NaMFTqOewnSEAm43U+++TI/-Tmp-/WebKitHttpd.lock.21 command timed out: 1200 seconds without output, killing pid 67415 process killed by signal 9 program finished with exit code -1 elapsedTime=1240.833448 I suspect that we're hanging in this while loop: def wait_for_httpd_lock(self): """Create a lock file and wait until it's turn comes. If something goes wrong it wont do any locking.""" if not self._create_lock_file(): _log.debug("Warning, http locking failed!") return while self._curent_lock_pid() != os.getpid(): time.sleep(1) It's unclear why we need to loop there, since we just wrote the file. :) Ossy wrote the code, so perhaps he has some idea what it's trying to do. At the least that while loop should exit after some timeout.
Attachments
Patch (1.99 KB, patch)
2011-08-18 18:53 PDT, Dirk Pranke
no flags
snipped of twistd.log from apple-xserve-8 buildbot log showing a master restart (3.49 KB, application/octet-stream)
2011-08-24 15:32 PDT, Dirk Pranke
no flags
Eric Seidel (no email)
Comment 1 2011-07-20 13:10:36 PDT
I don't think our filelock code is correct. def acquire_lock(self): self._lock_file_descriptor = os.open(self._lock_file_path, os.O_TRUNC | os.O_CREAT) start_time = time.time() while True: try: self._create_lock() return True except IOError: if time.time() - start_time > self._max_wait_time_sec: _log.debug("File locking failed: %s" % str(sys.exc_info())) os.close(self._lock_file_descriptor) self._lock_file_descriptor = None return False Notice how we're calling os.open, with trunc/create *OUTSIDE* of any lock. I think we should be opening with "append" or some non-harmful mode, since it's very easy for two processes to try and grab the same lock, and blast away each other's lock just by opening the file, no?
Eric Seidel (no email)
Comment 2 2011-07-20 13:22:16 PDT
I'm not sure my diagnosis is correct. flock doesnt' write anything to the file as far as I can tell, so it shouldn't matter what mode we opened it with.
Tony Chang
Comment 4 2011-07-20 15:07:01 PDT
Looks like this might be causing a similar problem to the commit queue? http://webkit-commit-queue.appspot.com/results/9193426
Eric Seidel (no email)
Comment 5 2011-07-20 15:07:41 PDT
Nope, those are unrelated. They were caused by bug 64885.
Dirk Pranke
Comment 6 2011-08-18 18:53:52 PDT
Ryosuke Niwa
Comment 7 2011-08-18 18:55:30 PDT
Comment on attachment 104435 [details] Patch Thanks for doing this!
Ryosuke Niwa
Comment 8 2011-08-18 18:55:53 PDT
Comment on attachment 104435 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=104435&action=review > Tools/ChangeLog:9 > + Nit: Missing Reviewed By here.
Ryosuke Niwa
Comment 9 2011-08-18 18:56:28 PDT
Comment on attachment 104435 [details] Patch cq- because of the above nit.
Dirk Pranke
Comment 10 2011-08-18 18:57:38 PDT
Okay, I"ve posted a patch that disables the file locking. In theory, what'll happen is that we'll kill the old http and websocket servers, and the new tests will run fine. Supposedly we're also seeing a lot of stale NRWT processes themselves; this patch will not fix that. However, in theory (again), those processes will be done doing whatever they were doing and be idle, so there shouldn't be too much downside. This isn't a real fix, we shouldn't close this bug, and we still need to figure out what's going on. Unfortunately, I can't even really test this patch at the moment because test-webkitpy is hanging on my machine in a rebaselining test, and someone apparently has broken all of the integration tests that I wrote for these functions (but which aren't tested by default on the bots, so I can't blame them too much). If someone can get me the buildbot logs for when these things are getting wedged, that would be helpful.
Dirk Pranke
Comment 11 2011-08-18 18:58:16 PDT
(In reply to comment #8) > (From update of attachment 104435 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=104435&action=review > > > Tools/ChangeLog:9 > > + > > Nit: Missing Reviewed By here. Interesting. The changelog was generated from webkit-patch upload ... I wonder what happened to the OOPS line?
Dirk Pranke
Comment 12 2011-08-18 18:59:06 PDT
Dirk Pranke
Comment 13 2011-08-18 18:59:45 PDT
Okay, patch landed. Feel free to revert this if it causes other problems.
Dirk Pranke
Comment 14 2011-08-24 15:25:11 PDT
I took a brief look at some of the buildbot logs from when a master restarts (I've attached a snippet from one such log). It looks like the buildbot slave is hard-killing the current command to abort it, and that means that we're probably not getting a chance to clean up the subprocesses we've spawned. Given that, I'm not sure that there's much we can do to shut down cleanly, which means we need to be able to clean up on startup. Chromium has a step that the bot runs (on Windows only) that kills stray processes. Maybe the thing to do is to take that, make it portable (if necessary), and introduce it here?
Dirk Pranke
Comment 15 2011-08-24 15:32:14 PDT
Created attachment 105083 [details] snipped of twistd.log from apple-xserve-8 buildbot log showing a master restart
Eric Seidel (no email)
Comment 16 2011-10-11 13:17:58 PDT
I thought we disabled http locking on Mac? Can we close this?
Ryosuke Niwa
Comment 17 2011-10-11 14:08:33 PDT
(In reply to comment #16) > I thought we disabled http locking on Mac? Can we close this? Yes, I think so. But it'll be nice if we could file another bug to deal with stable child process issue (i.e. only manager.py is killed) on bots.
Eric Seidel (no email)
Comment 18 2011-12-21 14:34:27 PST
Attachment 104435 [details] was posted by a committer and has review+, assigning to Dirk Pranke for commit.
Dirk Pranke
Comment 19 2012-01-04 15:44:54 PST
The patch was landed in r93383, so I'm clearing the flags. Also, we updated the kill-old-processes script to address the other concerns in r97341 (bug 63651) so I'm closing this as fixed.
Note You need to log in before you can comment on or make changes to this bug.