Bug 81600 - webkitpy: get ServerProcess out of the reportcrash business
Summary: webkitpy: get ServerProcess out of the reportcrash business
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Dirk Pranke
URL:
Keywords:
Depends on:
Blocks: 71380
  Show dependency treegraph
 
Reported: 2012-03-19 17:28 PDT by Dirk Pranke
Modified: 2012-03-21 13:29 PDT (History)
6 users (show)

See Also:


Attachments
Patch (12.64 KB, patch)
2012-03-19 17:32 PDT, Dirk Pranke
no flags Details | Formatted Diff | Diff
Patch (12.41 KB, patch)
2012-03-20 15:39 PDT, Dirk Pranke
no flags Details | Formatted Diff | Diff
fix another crash (12.74 KB, patch)
2012-03-21 13:22 PDT, Dirk Pranke
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dirk Pranke 2012-03-19 17:28:29 PDT
webkitpy: get ServerProcess out of the reportcrash business
Comment 1 Dirk Pranke 2012-03-19 17:32:16 PDT
Created attachment 132732 [details]
Patch
Comment 2 Adam Barth 2012-03-19 17:39:47 PDT
Comment on attachment 132732 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=132732&action=review

> Tools/Scripts/webkitpy/layout_tests/port/server_process.py:106
> +        if self._proc.returncode in (-1073741510, -signal.SIGINT):

-1073741510   <-   Can we name this constant?
Comment 3 Dirk Pranke 2012-03-19 17:52:55 PDT
Comment on attachment 132732 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=132732&action=review

>> Tools/Scripts/webkitpy/layout_tests/port/server_process.py:106
>> +        if self._proc.returncode in (-1073741510, -signal.SIGINT):
> 
> -1073741510   <-   Can we name this constant?

I don't actually know what a good name for this constant would be; given the FIXME above, I suspect we can actually delete the constant altogether, and even if we can't, I need to figure out if this value correlates to something. 

I'd like to defer doing that and removing the fixme to a subsequent patch if that's okay, since getting to the root of this is kind of a different effort.
Comment 4 Adam Barth 2012-03-19 17:54:28 PDT
Comment on attachment 132732 [details]
Patch

I'm not sure I understand.  Where did the code that waits for ReportCrash to finish dumping the stack go?
Comment 5 Adam Barth 2012-03-19 17:55:09 PDT
Do we not need to wait for ReportCrash anymore?
Comment 6 Dirk Pranke 2012-03-19 18:04:13 PDT
(In reply to comment #4)
> (From update of attachment 132732 [details])
> I'm not sure I understand.  Where did the code that waits for ReportCrash to finish dumping the stack go?

That logic moves into the Port classes (see bug 81603), but I should probably move some of it back here.

The reason is slightly complicated. 

It turns out that waiting for ReportCrash to exit is the wrong thing to do. Under normal circumstances (things are crashing rarely), ReportCrash actually writes the crash log before python is even notified that the subprocess has crashed. When lots of things are crashing, however, the write may be somewhat delayed (I don't really know what's going on, but it's like there is a delay before the file gets flushed to disk). So, in some cases, we actually do want to wait a while longer to look for the log - hence, in the patch in bug 81603, there's a slow spin for a period of time derived from how many workers are running in parallel.

However, in any case, the operating system appears to keep ReportCrash around for a long (~30-40second) time just so it doesn't have to re-launch it if there is another crash. Also, there is only ever one ReportCrash running, regardless of how many processes are crashing at once (it appears that the single ReportCrash handles multiple crashes simultaneously).

Most of the time, not waiting at all seems to be fine. Sometimes - if the world is broken and so every test crashes - you want to wait for a while. 

Ideally (for me) we'd land this patch as-is and then get "more correct" when 81603 lands. However, if you like I can rework the division of labor between the two patches a bit so that we even slow-spin as part of this patch as well.
Comment 7 Dirk Pranke 2012-03-19 18:04:41 PDT
(In reply to comment #6)
> That logic moves into the Port classes (see bug 81603), but I should probably move some of it back here.

s/should probably/could if necessary/. :)
Comment 8 Adam Barth 2012-03-19 18:18:04 PDT
I see.  I think I misunderstood how ReportCrash worked previously.  I thought there was one for each crash dump.
Comment 9 Adam Barth 2012-03-19 18:19:02 PDT
Comment on attachment 132732 [details]
Patch

I'm not sure I understand 100% of the details in this patch, but I like the overall direction.
Comment 10 Dirk Pranke 2012-03-19 18:33:10 PDT
Committed r111293: <http://trac.webkit.org/changeset/111293>
Comment 11 Jessie Berlin 2012-03-19 20:11:26 PDT
This is still breaking the layout tests on the bots, even after http://trac.webkit.org/changeset/111307

http://build.webkit.org/builders/Lion%20Intel%20Debug%20%28WebKit2%20Tests%29/builds/5070/steps/layout-test/logs/stdio

-03-19 19:51:42,660 43121 single_test_runner.py:209 DEBUG   /Volumes/Data/slave/lion-intel-debug/build/Source/WebCore/loader/appcache/ApplicationCacheStorage.cpp(557) : bool WebCore::ApplicationCacheStorage::executeSQLCommand(const WTF::String &)
2012-03-19 19:51:42,660 43123 stack_utils.py:67 DEBUG     File "/Volumes/Data/slave/lion-intel-debug-tests-wk2/build/Tools/Scripts/webkitpy/layout_tests/controllers/manager_worker_broker.py", line 280, in run
2012-03-19 19:51:42,661 43123 stack_utils.py:67 DEBUG       self._worker_connection.run_message_loop()
2012-03-19 19:51:42,661 43123 stack_utils.py:67 DEBUG     File "/Volumes/Data/slave/lion-intel-debug-tests-wk2/build/Tools/Scripts/webkitpy/layout_tests/controllers/manager_worker_broker.py", line 232, in run_message_loop
2012-03-19 19:51:42,661 43123 stack_utils.py:67 DEBUG       self._broker.run_message_loop(self._run_topic, self._client, delay_secs)
2012-03-19 19:51:42,661 43123 stack_utils.py:67 DEBUG     File "/Volumes/Data/slave/lion-intel-debug-tests-wk2/build/Tools/Scripts/webkitpy/layout_tests/controllers/manager_worker_broker.py", line 168, in run_message_loop
2012-03-19 19:51:42,661 43123 stack_utils.py:67 DEBUG       self._run_loop(topic_name, client, block=True, delay_secs=delay_secs)
2012-03-19 19:51:42,661 43123 stack_utils.py:67 DEBUG     File "/Volumes/Data/slave/lion-intel-debug-tests-wk2/build/Tools/Scripts/webkitpy/layout_tests/controllers/manager_worker_broker.py", line 182, in _run_loop
2012-03-19 19:51:42,662 43121 worker.py:206 DEBUG worker/5 fast/media/media-query-invalid-value.html passed
2012-03-19 19:51:42,662 43123 stack_utils.py:67 DEBUG       self._dispatch_message(msg, client)
2012-03-19 19:51:42,662 43123 stack_utils.py:67 DEBUG     File "/Volumes/Data/slave/lion-intel-debug-tests-wk2/build/Tools/Scripts/webkitpy/layout_tests/controllers/manager_worker_broker.py", line 191, in _dispatch_message
2012-03-19 19:51:42,662 43123 stack_utils.py:67 DEBUG       message_handler(message.src, *optargs)
2012-03-19 19:51:42,662 43123 stack_utils.py:67 DEBUG     File "/Volumes/Data/slave/lion-intel-debug-tests-wk2/build/Tools/Scripts/webkitpy/layout_tests/controllers/worker.py", line 128, in handle_test_list
2012-03-19 19:51:42,662 43123 stack_utils.py:67 DEBUG       self._run_test(test_input)
2012-03-19 19:51:42,663 43123 stack_utils.py:67 DEBUG     File "/Volumes/Data/slave/lion-intel-debug-tests-wk2/build/Tools/Scripts/webkitpy/layout_tests/controllers/worker.py", line 143, in _run_test
2012-03-19 19:51:42,663 43123 stack_utils.py:67 DEBUG       result = self.run_test_with_timeout(test_input, test_timeout_sec)
2012-03-19 19:51:42,663 43123 stack_utils.py:67 DEBUG     File "/Volumes/Data/slave/lion-intel-debug-tests-wk2/build/Tools/Scripts/webkitpy/layout_tests/controllers/worker.py", line 185, in run_test_with_timeout
2012-03-19 19:51:42,663 43123 stack_utils.py:67 DEBUG       return self._run_test_in_this_thread(test_input)
2012-03-19 19:51:42,663 43123 stack_utils.py:67 DEBUG     File "/Volumes/Data/slave/lion-intel-debug-tests-wk2/build/Tools/Scripts/webkitpy/layout_tests/controllers/worker.py", line 270, in _run_test_in_this_thread
2012-03-19 19:51:42,663 43123 stack_utils.py:67 DEBUG       return self.run_single_test(self._driver, test_input)
2012-03-19 19:51:42,663 43123 stack_utils.py:67 DEBUG     File "/Volumes/Data/slave/lion-intel-debug-tests-wk2/build/Tools/Scripts/webkitpy/layout_tests/controllers/worker.py", line 274, in run_single_test
2012-03-19 19:51:42,664 43123 stack_utils.py:67 DEBUG       test_input, driver, self._name)
2012-03-19 19:51:42,664 43123 stack_utils.py:67 DEBUG     File "/Volumes/Data/slave/lion-intel-debug-tests-wk2/build/Tools/Scripts/webkitpy/layout_tests/controllers/single_test_runner.py", line 46, in run_single_test
2012-03-19 19:51:42,664 43123 stack_utils.py:67 DEBUG       return runner.run()
2012-03-19 19:51:42,664 43123 stack_utils.py:67 DEBUG     File "/Volumes/Data/slave/lion-intel-debug-tests-wk2/build/Tools/Scripts/webkitpy/layout_tests/controllers/single_test_runner.py", line 106, in run
2012-03-19 19:51:42,664 43123 stack_utils.py:67 DEBUG       return self._run_compare_test()
2012-03-19 19:51:42,664 43123 stack_utils.py:67 DEBUG     File "/Volumes/Data/slave/lion-intel-debug-tests-wk2/build/Tools/Scripts/webkitpy/layout_tests/controllers/single_test_runner.py", line 119, in _run_compare_test
2012-03-19 19:51:42,664 43123 stack_utils.py:67 DEBUG       test_result_writer.write_test_result(self._filesystem, self._port, self._test_name, driver_output, expected_driver_output, test_result.failures)
2012-03-19 19:51:42,665 43123 stack_utils.py:67 DEBUG     File "/Volumes/Data/slave/lion-intel-debug-tests-wk2/build/Tools/Scripts/webkitpy/layout_tests/controllers/test_result_writer.py", line 66, in write_test_result
2012-03-19 19:51:42,665 43123 stack_utils.py:67 DEBUG       writer.write_crash_report(crashed_driver_output.crashed_process_name, crashed_driver_output.error)
2012-03-19 19:51:42,665 43123 stack_utils.py:67 DEBUG     File "/Volumes/Data/slave/lion-intel-debug-tests-wk2/build/Tools/Scripts/webkitpy/layout_tests/controllers/test_result_writer.py", line 165, in write_crash_report
2012-03-19 19:51:42,665 43123 stack_utils.py:67 DEBUG       log = crash_logs.find_newest_log(crashed_process_name)
2012-03-19 19:51:42,665 43123 stack_utils.py:67 DEBUG     File "/Volumes/Data/slave/lion-intel-debug-tests-wk2/build/Tools/Scripts/webkitpy/common/system/crashlogs.py", line 40, in find_newest_log
2012-03-19 19:51:42,665 43123 stack_utils.py:67 DEBUG       return self._find_newest_log_darwin(process_name, pid, include_errors)
2012-03-19 19:51:42,665 43123 stack_utils.py:67 DEBUG     File "/Volumes/Data/slave/lion-intel-debug-tests-wk2/build/Tools/Scripts/webkitpy/common/system/crashlogs.py", line 57, in _find_newest_log_darwin
2012-03-19 19:51:42,666 43123 stack_utils.py:67 DEBUG       logs = self._filesystem.files_under(log_directory, file_filter=is_crash_log)
2012-03-19 19:51:42,666 43123 stack_utils.py:67 DEBUG     File "/Volumes/Data/slave/lion-intel-debug-tests-wk2/build/Tools/Scripts/webkitpy/common/system/filesystem.py", line 111, in files_under
2012-03-19 19:51:42,666 43123 stack_utils.py:67 DEBUG       if file_filter(self, dirpath, filename):
2012-03-19 19:51:42,666 43123 stack_utils.py:67 DEBUG     File "/Volumes/Data/slave/lion-intel-debug-tests-wk2/build/Tools/Scripts/webkitpy/common/system/crashlogs.py", line 54, in is_crash_log
2012-03-19 19:51:42,666 43123 stack_utils.py:67 DEBUG       return basename.startswith(process_name + "_") and basename.endswith(".crash")
2012-03-19 19:51:42,668 43123 manager_worker_broker.py:291 DEBUG worker/7 done with message loop, exception raised
2012-03-19 19:51:42,669 42883 manager.py:1424 ERROR worker/7 raised TypeError('unsupported operand type(s) for +: 'NoneType' and 'str''):
2012-03-19 19:51:42,669 42883 manager.py:1458 ERROR   layout_tests/controllers/manager_worker_broker.py:280 (in run)
2012-03-19 19:51:42,669 42883 manager.py:1459 ERROR     self._worker_connection.run_message_loop()
2012-03-19 19:51:42,669 42883 manager.py:1458 ERROR   layout_tests/controllers/manager_worker_broker.py:232 (in run_message_loop)
2012-03-19 19:51:42,669 42883 manager.py:1459 ERROR     self._broker.run_message_loop(self._run_topic, self._client, delay_secs)
2012-03-19 19:51:42,669 42883 manager.py:1458 ERROR   layout_tests/controllers/manager_worker_broker.py:168 (in run_message_loop)
2012-03-19 19:51:42,669 42883 manager.py:1459 ERROR     self._run_loop(topic_name, client, block=True, delay_secs=delay_secs)
2012-03-19 19:51:42,670 42883 manager.py:1458 ERROR   layout_tests/controllers/manager_worker_broker.py:182 (in _run_loop)
2012-03-19 19:51:42,670 42883 manager.py:1459 ERROR     self._dispatch_message(msg, client)
2012-03-19 19:51:42,670 42883 manager.py:1458 ERROR   layout_tests/controllers/manager_worker_broker.py:191 (in _dispatch_message)
2012-03-19 19:51:42,670 42883 manager.py:1459 ERROR     message_handler(message.src, *optargs)
2012-03-19 19:51:42,670 42883 manager.py:1458 ERROR   layout_tests/controllers/worker.py:128 (in handle_test_list)
2012-03-19 19:51:42,670 42883 manager.py:1459 ERROR     self._run_test(test_input)
2012-03-19 19:51:42,670 42883 manager.py:1458 ERROR   layout_tests/controllers/worker.py:143 (in _run_test)
2012-03-19 19:51:42,670 42883 manager.py:1459 ERROR     result = self.run_test_with_timeout(test_input, test_timeout_sec)
2012-03-19 19:51:42,670 42883 manager.py:1458 ERROR   layout_tests/controllers/worker.py:185 (in run_test_with_timeout)
2012-03-19 19:51:42,670 42883 manager.py:1459 ERROR     return self._run_test_in_this_thread(test_input)
2012-03-19 19:51:42,670 42883 manager.py:1458 ERROR   layout_tests/controllers/worker.py:270 (in _run_test_in_this_thread)
2012-03-19 19:51:42,670 42883 manager.py:1459 ERROR     return self.run_single_test(self._driver, test_input)
2012-03-19 19:51:42,670 42883 manager.py:1458 ERROR   layout_tests/controllers/worker.py:274 (in run_single_test)
2012-03-19 19:51:42,671 42883 manager.py:1459 ERROR     test_input, driver, self._name)
2012-03-19 19:51:42,671 42883 manager.py:1458 ERROR   layout_tests/controllers/single_test_runner.py:46 (in run_single_test)
2012-03-19 19:51:42,671 42883 manager.py:1459 ERROR     return runner.run()
2012-03-19 19:51:42,671 42883 manager.py:1458 ERROR   layout_tests/controllers/single_test_runner.py:106 (in run)
2012-03-19 19:51:42,671 42883 manager.py:1459 ERROR     return self._run_compare_test()
2012-03-19 19:51:42,671 42883 manager.py:1458 ERROR   layout_tests/controllers/single_test_runner.py:119 (in _run_compare_test)
2012-03-19 19:51:42,671 42883 manager.py:1459 ERROR     test_result_writer.write_test_result(self._filesystem, self._port, self._test_name, driver_output, expected_driver_output, test_result.failures)
2012-03-19 19:51:42,671 43123 worker.py:151 DEBUG worker/7 cleaning up
2012-03-19 19:51:42,671 42883 manager.py:1458 ERROR   layout_tests/controllers/test_result_writer.py:66 (in write_test_result)
2012-03-19 19:51:42,671 42883 manager.py:1459 ERROR     writer.write_crash_report(crashed_driver_output.crashed_process_name, crashed_driver_output.error)
2012-03-19 19:51:42,671 42883 manager.py:1458 ERROR   layout_tests/controllers/test_result_writer.py:165 (in write_crash_report)
2012-03-19 19:51:42,671 42883 manager.py:1459 ERROR     log = crash_logs.find_newest_log(crashed_process_name)
2012-03-19 19:51:42,671 43123 worker.py:178 DEBUG worker/7 killing driver
2012-03-19 19:51:42,671 42883 manager.py:1458 ERROR   common/system/crashlogs.py:40 (in find_newest_log)
2012-03-19 19:51:42,672 42883 manager.py:1459 ERROR     return self._find_newest_log_darwin(process_name, pid, include_errors)
2012-03-19 19:51:42,672 42883 manager.py:1458 ERROR   common/system/crashlogs.py:57 (in _find_newest_log_darwin)
2012-03-19 19:51:42,672 42883 manager.py:1459 ERROR     logs = self._filesystem.files_under(log_directory, file_filter=is_crash_log)
2012-03-19 19:51:42,672 42883 manager.py:1458 ERROR   common/system/filesystem.py:111 (in files_under)
2012-03-19 19:51:42,672 42883 manager.py:1459 ERROR     if file_filter(self, dirpath, filename):
2012-03-19 19:51:42,672 42883 manager.py:1458 ERROR   common/system/crashlogs.py:54 (in is_crash_log)
2012-03-19 19:51:42,672 42883 manager.py:1459 ERROR     return basename.startswith(process_name + "_") and basename.endswith(".crash")
Comment 12 Jessie Berlin 2012-03-19 20:12:19 PDT
I think I may have to roll out those patches soon if this isn't addressed.
Comment 13 Jessie Berlin 2012-03-19 20:48:06 PDT
I had to roll out the changes in http://trac.webkit.org/changeset/111343
Comment 14 Maciej Stachowiak 2012-03-20 00:52:34 PDT
Comment on attachment 132732 [details]
Patch

Unflagging patch, since it got rolled out.
Comment 15 Dirk Pranke 2012-03-20 13:56:32 PDT
Thanks for rolling this out, and I'm sorry I wasn't around to debug/fix/roll it out myself!
Comment 16 Dirk Pranke 2012-03-20 15:39:02 PDT
Created attachment 132911 [details]
Patch
Comment 17 Dirk Pranke 2012-03-20 15:39:48 PDT
Comment on attachment 132911 [details]
Patch

The actual fix:


layout_tests $ git diff port/webkit.py
494c494,497
<         return self._server_process.has_crashed()
---
>         if self._server_process.has_crashed():
>             self._crashed_subprocess_name = self._port.driver_name()
>             return True
>         return False
501c504,505
<         elif error_line == "#CRASHED - WebProcess\n":
---
>             return True
>         if error_line == "#CRASHED - WebProcess\n":
503a508
>             return True
Comment 18 Dirk Pranke 2012-03-20 15:40:19 PDT
Committed r111453: <http://trac.webkit.org/changeset/111453>
Comment 19 mitz 2012-03-20 16:52:53 PDT
(In reply to comment #18)
> Committed r111453: <http://trac.webkit.org/changeset/111453>

This change broke Lion Intel Release (WebKit2 Tests), as seen here: <http://build.webkit.org/builders/Lion%20Intel%20Release%20%28WebKit2%20Tests%29/builds/5535/steps/layout-test/logs/stdio/text>.
Comment 20 Dirk Pranke 2012-03-20 17:00:17 PDT
Reverted r111453 for reason:

still broken

Committed r111470: <http://trac.webkit.org/changeset/111470>
Comment 21 Philippe Normand 2012-03-21 02:27:29 PDT
(In reply to comment #20)
> Reverted r111453 for reason:
> 
> still broken
> 
> Committed r111470: <http://trac.webkit.org/changeset/111470>

I still see this in GTK Debug when a crash happens:

2012-03-21 02:10:45,581 1482 manager.py:1424 ERROR worker/1 raised UnicodeDecodeError(''ascii' codec can't decode byte 0xce in position 235: ordinal not in range(128)'):
2012-03-21 02:10:45,582 1482 manager.py:1458 ERROR   layout_tests/controllers/manager_worker_broker.py:280 (in run)
2012-03-21 02:10:45,582 1482 manager.py:1459 ERROR     self._worker_connection.run_message_loop()
...
2012-03-21 02:10:45,584 1482 manager.py:1458 ERROR   layout_tests/controllers/single_test_runner.py:119 (in _run_compare_test)
2012-03-21 02:10:45,584 1482 manager.py:1459 ERROR     test_result_writer.write_test_result(self._filesystem, self._port, self._test_name, driver_output, expected_driver_output, test_result.failures)
2012-03-21 02:10:45,584 1482 manager.py:1458 ERROR   layout_tests/controllers/test_result_writer.py:66 (in write_test_result)
2012-03-21 02:10:45,584 1482 manager.py:1459 ERROR     writer.write_crash_report(crashed_driver_output.crashed_process_name, crashed_driver_output.error)
2012-03-21 02:10:45,584 1482 manager.py:1458 ERROR   layout_tests/controllers/test_result_writer.py:168 (in write_crash_report)
2012-03-21 02:10:45,584 1482 manager.py:1459 ERROR     fs.write_text_file(filename, log if log else error)
2012-03-21 02:10:45,584 1482 manager.py:1458 ERROR   common/system/filesystem.py:223 (in write_text_file)
2012-03-21 02:10:45,584 1482 manager.py:1459 ERROR     f.write(contents)
2012-03-21 02:10:45,584 1482 manager.py:1458 ERROR   /usr/lib/python2.6/codecs.py:691 (in write)
2012-03-21 02:10:45,584 1482 manager.py:1459 ERROR     return self.writer.write(data)
2012-03-21 02:10:45,585 1482 manager.py:1458 ERROR   /usr/lib/python2.6/codecs.py:351 (in write)
2012-03-21 02:10:45,585 1482 manager.py:1459 ERROR     data, consumed = self.encode(object, self.errors)

The error variable needs to be: errors.decode('utf8')
Comment 22 Dirk Pranke 2012-03-21 13:11:57 PDT
(In reply to comment #21)
> The error variable needs to be: errors.decode('utf8')

Okay, will fix with the next attempt :).
Comment 23 Dirk Pranke 2012-03-21 13:22:14 PDT
Created attachment 133102 [details]
fix another crash
Comment 24 Dirk Pranke 2012-03-21 13:23:28 PDT
Actual fix in patch #3:


$ diff --git a/Tools/Scripts/webkitpy/layout_tests/port/webkit.py b/Tools/Scripts/webkitpy/layout_tests/port/webkit.py
index 32a7a25..b998c47 100644
--- a/Tools/Scripts/webkitpy/layout_tests/port/webkit.py
+++ b/Tools/Scripts/webkitpy/layout_tests/port/webkit.py
@@ -588,7 +588,7 @@ class WebKitDriver(Driver):
         block = ContentBlock()
         out_seen_eof = False
 
-        while True:
+        while not self.has_crashed():
             if out_seen_eof and (self.err_seen_eof or not wait_for_stderr_eof):
                 break
$

I need to revamp ServerProcess to be a little more bulletproof, but I'll do that in a separate patch.
Comment 25 Dirk Pranke 2012-03-21 13:29:45 PDT
Committed r111599: <http://trac.webkit.org/changeset/111599>