On Mac crashreporter will roll over logs after a certain number of crashes. This can cause the layout tests to fail if looking for a log that doesn't exist anymore. We're tracking bugs internally to fix that problem, but the the try/catch block should include the stat and catch OSError. backtrace: 18:01:45.278 79451 worker/5 raised OSError('[Errno 2] No such file or directory: '/Users/buildbot/Library/Logs/DiagnosticReports/WebProcess_2012-05-29-161118_buildbots-Mac-Pro-40.crash''): 18:01:45.278 79492 worker/7 mathml/presentation/fenced-mi.xhtml passed 18:01:45.278 79451 layout_tests/controllers/manager_worker_broker.py:290 (in run) 18:01:45.278 79451 self._worker_connection.run_message_loop() 18:01:45.278 79451 layout_tests/controllers/manager_worker_broker.py:243 (in run_message_loop) 18:01:45.278 79451 self._broker.run_message_loop(self._run_topic, self._client, delay_secs) 18:01:45.278 79451 layout_tests/controllers/manager_worker_broker.py:175 (in run_message_loop) 18:01:45.279 79451 self._run_loop(topic_name, client, block=True, delay_secs=delay_secs) 18:01:45.279 79451 layout_tests/controllers/manager_worker_broker.py:189 (in _run_loop) 18:01:45.279 79451 self._dispatch_message(msg, client) 18:01:45.279 79451 layout_tests/controllers/manager_worker_broker.py:198 (in _dispatch_message) 18:01:45.279 79451 message_handler(message.src, *optargs) 18:01:45.279 79451 layout_tests/controllers/worker.py:144 (in handle_test_list) 18:01:45.279 79451 self._run_test(test_input) 18:01:45.279 79451 layout_tests/controllers/worker.py:171 (in _run_test) 18:01:45.279 79451 result = self.run_test_with_timeout(test_input, test_timeout_sec) 18:01:45.279 79451 layout_tests/controllers/worker.py:213 (in run_test_with_timeout) 18:01:45.279 79451 return self._run_test_in_this_thread(test_input) 18:01:45.279 79451 layout_tests/controllers/worker.py:298 (in _run_test_in_this_thread) 18:01:45.279 79451 return self.run_single_test(self._driver, test_input) 18:01:45.279 79451 layout_tests/controllers/worker.py:302 (in run_single_test) 18:01:45.279 79451 test_input, driver, self._name) 18:01:45.279 79451 layout_tests/controllers/single_test_runner.py:46 (in run_single_test) 18:01:45.279 79451 return runner.run() 18:01:45.279 79451 layout_tests/controllers/single_test_runner.py:101 (in run) 18:01:45.280 79451 return self._run_reftest() 18:01:45.280 79451 layout_tests/controllers/single_test_runner.py:291 (in _run_reftest) 18:01:45.280 79451 reference_output = self._driver.run_test(DriverInput(reference_test_name, self._timeout, test_output.image_hash, should_run_pixel_test=True)) 18:01:45.280 79451 layout_tests/port/driver.py:217 (in run_test) 18:01:45.280 79451 return self._running_drivers[cmd_line_key].run_test(driver_input) 18:01:45.280 79451 layout_tests/port/webkit.py:579 (in run_test) 18:01:45.280 79451 newer_than=start_time) 18:01:45.280 79451 layout_tests/port/mac.py:194 (in _get_crash_log) 18:01:45.280 79451 crash_log = crash_logs.find_newest_log(name, pid, include_errors=True, newer_than=newer_than) 18:01:45.280 79451 common/system/crashlogs.py:38 (in find_newest_log) 18:01:45.280 79451 return self._find_newest_log_darwin(process_name, pid, include_errors, newer_than) 18:01:45.280 79451 common/system/crashlogs.py:59 (in _find_newest_log_darwin) 18:01:45.280 79451 if not newer_than or self._host.filesystem.mtime(path) > newer_than: 18:01:45.280 79451 common/system/filesystem.py:183 (in mtime) 18:01:45.280 79451 return os.stat(path).st_mtime
the list of logs is only obtained 4 lines above ... doesn't that imply that we're rolling over the logs in the time it takes to actually walk the list (presumably less than a second or so)? How many things do you have crashing at once, and what is the limit on how many logs crash reporter will keep?
hmm, It isn't the crash log rollover causing the problem either. I just looked at the bot now, and there are older logs than the one it is looking for. I wonder how it got the name of a file that doesn't exist.
(In reply to comment #2) > hmm, It isn't the crash log rollover causing the problem either. I just looked at the bot now, and there are older logs than the one it is looking for. I wonder how it got the name of a file that doesn't exist. Interesting. That's a good question. There would either have to be a bug in our file-finding code (which I just checked and it looks correct - that's certainly a plausible looking filename, at least, but maybe I'm missing something), or the file would've had to have been created and then deleted, or you're seeing some sort of a filesystem / python inconsistency between os.walk() and os.stat().
It looks like ReportCrash doesn't necessarily remove by oldest logs first. I just had the logs disappear out from underneath me. It must use some sort of heuristic and we are getting bitten by it.
I see. Well, catching the exception would certainly be defensively safe and correct; did you have a patch in the works, or do you want me to fix it?
Created attachment 144683 [details] patch
Comment on attachment 144683 [details] patch This needs a unittest.
Created attachment 144973 [details] patch and test
committed http://trac.webkit.org/projects/webkit/changeset/119030
I'm not sure why this is still open ... closing.