63495 – NRWT should wait for ReportCrash

RESOLVED FIXED 63495

NRWT should wait for ReportCrash

https://bugs.webkit.org/show_bug.cgi?id=63495

Summary NRWT should wait for ReportCrash

Adam Barth

Reported 2011-06-27 17:08:04 PDT

NRWT should wait for ReportCrash

Attachments
Patch (7.06 KB, patch) 2011-06-27 17:24 PDT, Adam Barth	no flags	Details Formatted Diff Diff
Patch (7.07 KB, patch) 2011-06-27 17:27 PDT, Adam Barth	eric: review+	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.

Adam Barth

Comment 1 2011-06-27 17:24:21 PDT

Created attachment 98825 [details] Patch

Adam Barth

Comment 2 2011-06-27 17:27:16 PDT

Created attachment 98826 [details] Patch

Eric Seidel (no email)

Comment 3 2011-06-27 17:27:35 PDT

Comment on attachment 98825 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=98825&action=review > Tools/Scripts/webkitpy/layout_tests/port/server_process.py:172 > + if self._executive.running_pids(self._port.is_crash_reporter): > + _log.warning('Waiting for crash reporter...') > + self._executive.wait_newest(self._port.is_crash_reporter) So is this the central controller for all servers? Or is this a single worker? What happens if we start nrwt while a CrashReporter process is running? What if CrashReporter never stops?

Eric Seidel (no email)

Comment 4 2011-06-27 17:28:33 PDT

Comment on attachment 98826 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=98826&action=review > Tools/Scripts/webkitpy/layout_tests/port/server_process.py:171 > + _log.warning('Waiting for crash reporter...') So if I have 24 workers on my machine, I'll get this printed 24 times?

Eric Seidel (no email)

Comment 5 2011-06-27 17:32:51 PDT

Comment on attachment 98826 [details] Patch I'm certain you will get complaints about this spamming the console.

Adam Barth

Comment 6 2011-06-27 17:33:31 PDT

Committed r89881: <http://trac.webkit.org/changeset/89881>

Adam Barth

Comment 7 2011-06-27 17:42:51 PDT

These are the answers I gave Eric in person: > So is this the central controller for all servers? Or is this a single worker? It's for a single worker. > What happens if we start nrwt while a CrashReporter process is running? NRWT will wait for CrashReporter to stop before doing anything meaningful. > What if CrashReporter never stops? Then NRWT will never stop. > So if I have 24 workers on my machine, I'll get this printed 24 times? Yes. I've added the worker name to make it more clear what's going on. Eric suggested that the master process could print this message, which is probably a good idea, but not a blocking issue.

Dirk Pranke

Comment 8 2011-06-27 17:46:43 PDT

(In reply to comment #7) > These are the answers I gave Eric in person: > > > So is this the central controller for all servers? Or is this a single worker? > > It's for a single worker. > > > What happens if we start nrwt while a CrashReporter process is running? > > NRWT will wait for CrashReporter to stop before doing anything meaningful. > > > What if CrashReporter never stops? > > Then NRWT will never stop. > > > So if I have 24 workers on my machine, I'll get this printed 24 times? > > Yes. I've added the worker name to make it more clear what's going on. Eric suggested that the master process could print this message, which is probably a good idea, but not a blocking issue. The master doesn't know anything about what's going on here, so that might be kinda tricky. I suppose you could post a message from the worker back to the master and the master could throttle the rate it logged messages accordingly. But frankly, if you have 24 workers on your machine and they're all crashing at once, you've probably got worse things to worry about, and NRWT will exit shortly due to the max # of crashes being exceeded anyway.

Dirk Pranke

Comment 9 2011-06-27 17:49:15 PDT

(In reply to comment #7) > > What if CrashReporter never stops? > > Then NRWT will never stop. > Actually, since this is being done in the worker threads, if this takes too long, the manager will treat this as a wedged thread. Wedging by itself is not fatal, and workers can actually un-wedge, but if every thread wedges and you make no progress at all, NRWT will eventually bail out.

Adam Barth

Comment 10 2011-06-27 17:50:20 PDT

> NRWT will eventually bail out. Where is the code that does that?

Dirk Pranke

Comment 11 2011-06-27 17:54:00 PDT

(In reply to comment #10) > > NRWT will eventually bail out. > > Where is the code that does that? Look at the _run_tests() and is_done() methods in the Manager class (manager.py:674 and manager.py:1316),

Adam Barth

Comment 12 2011-06-27 17:57:18 PDT

> Look at the _run_tests() and is_done() methods in the Manager class (manager.py:674 and manager.py:1316), Thanks. I suspect we might trigger those conditions if lots of tests are crashing.

Dirk Pranke

Comment 13 2011-06-27 18:07:37 PDT

(In reply to comment #12) > > Look at the _run_tests() and is_done() methods in the Manager class (manager.py:674 and manager.py:1316), > > Thanks. I suspect we might trigger those conditions if lots of tests are crashing. Yeah, it's possible, but even if we do it's really just a slightly-accelerated version of the early exit we'd do when we hit the --exit-after-n-crashes-or-timeouts threshold, so I don't know that I'd worry about it too much (although the fewer workers we have running, the more this wouldn't be true ...). Dunno if you've poked around in this part of the code yet, but in worker.py you have _run_test() on line 127, which calculates the "hang timeout", posts a message to the manager saying that it's starting a test, runs the test and compares the results, and then posts the test completed message. If this really becomes an issue, we have three potential options, in order of increasingly complexity and (IMO) decreasing desirability: 1) Adjust the "hang timeout" to be substantially longer. Given that tests don't really wedge these days, this should be fairly harmless. 2) Adjust the code in manager.py to realize you're getting a lot of crashes and be smart able adjusting the timeouts on that side to compensate 3) Rearrange the code in worker.py so that you post that we've completed the test *before* trying to read the crash log, since we only check for wedged workers if we think they are actively running tests. This would effectively disable the wedge checking for this case, and would mess up the layering of the various methods, but would work in a worst-case scenario.

Adam Barth

Comment 14 2011-06-27 23:56:22 PDT

This patch was rolled out because it made Qt sloooow.

Csaba Osztrogonác

Comment 15 2011-06-28 00:00:08 PDT

Gábor, Kristóf, could you check why this change made our NRWT bot so slow?

Adam Barth

Comment 16 2011-06-28 00:30:25 PDT

Committed r89899: <http://trac.webkit.org/changeset/89899>

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Unspecified

OS Unspecified

Product WebKit

Component New Bugs

Assignee

Adam Barth

Reported

2011-06-27 17:08 PDT

Modified

2011-06-28 00:30 PDT History

CC List

5 users Show

URL

Keywords

Depends on

63513

Blocks

Dependencies

tree graph