Bug 63495 - NRWT should wait for ReportCrash
Summary: NRWT should wait for ReportCrash
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Adam Barth
URL:
Keywords:
Depends on: 63513
Blocks:
  Show dependency treegraph
 
Reported: 2011-06-27 17:08 PDT by Adam Barth
Modified: 2011-06-28 00:30 PDT (History)
5 users (show)

See Also:


Attachments
Patch (7.06 KB, patch)
2011-06-27 17:24 PDT, Adam Barth
no flags Details | Formatted Diff | Diff
Patch (7.07 KB, patch)
2011-06-27 17:27 PDT, Adam Barth
eric: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Adam Barth 2011-06-27 17:08:04 PDT
NRWT should wait for ReportCrash
Comment 1 Adam Barth 2011-06-27 17:24:21 PDT
Created attachment 98825 [details]
Patch
Comment 2 Adam Barth 2011-06-27 17:27:16 PDT
Created attachment 98826 [details]
Patch
Comment 3 Eric Seidel (no email) 2011-06-27 17:27:35 PDT
Comment on attachment 98825 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=98825&action=review

> Tools/Scripts/webkitpy/layout_tests/port/server_process.py:172
> +            if self._executive.running_pids(self._port.is_crash_reporter):
> +                _log.warning('Waiting for crash reporter...')
> +            self._executive.wait_newest(self._port.is_crash_reporter)

So is this the central controller for all servers?  Or is this a single worker?   What happens if we start nrwt while a CrashReporter process is running?  What if CrashReporter never stops?
Comment 4 Eric Seidel (no email) 2011-06-27 17:28:33 PDT
Comment on attachment 98826 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=98826&action=review

> Tools/Scripts/webkitpy/layout_tests/port/server_process.py:171
> +                _log.warning('Waiting for crash reporter...')

So if I have 24 workers on my machine, I'll get this printed 24 times?
Comment 5 Eric Seidel (no email) 2011-06-27 17:32:51 PDT
Comment on attachment 98826 [details]
Patch

I'm certain you will get complaints about this spamming the console.
Comment 6 Adam Barth 2011-06-27 17:33:31 PDT
Committed r89881: <http://trac.webkit.org/changeset/89881>
Comment 7 Adam Barth 2011-06-27 17:42:51 PDT
These are the answers I gave Eric in person:

> So is this the central controller for all servers?  Or is this a single worker?

It's for a single worker.

> What happens if we start nrwt while a CrashReporter process is running?

NRWT will wait for CrashReporter to stop before doing anything meaningful.

> What if CrashReporter never stops?

Then NRWT will never stop.

> So if I have 24 workers on my machine, I'll get this printed 24 times?

Yes.  I've added the worker name to make it more clear what's going on.  Eric suggested that the master process could print this message, which is probably a good idea, but not a blocking issue.
Comment 8 Dirk Pranke 2011-06-27 17:46:43 PDT
(In reply to comment #7)
> These are the answers I gave Eric in person:
> 
> > So is this the central controller for all servers?  Or is this a single worker?
> 
> It's for a single worker.
> 
> > What happens if we start nrwt while a CrashReporter process is running?
> 
> NRWT will wait for CrashReporter to stop before doing anything meaningful.
> 
> > What if CrashReporter never stops?
> 
> Then NRWT will never stop.
> 
> > So if I have 24 workers on my machine, I'll get this printed 24 times?
> 
> Yes.  I've added the worker name to make it more clear what's going on.  Eric suggested that the master process could print this message, which is probably a good idea, but not a blocking issue.

The master doesn't know anything about what's going on here, so that might be kinda tricky. I suppose you could post a message from the worker back to the master and the master could throttle the rate it logged messages accordingly. 

But frankly, if you have 24 workers on your machine and they're all crashing at once, you've probably got worse things to worry about, and NRWT will exit shortly due to the max # of crashes being exceeded anyway.
Comment 9 Dirk Pranke 2011-06-27 17:49:15 PDT
(In reply to comment #7)
> > What if CrashReporter never stops?
> 
> Then NRWT will never stop.
> 

Actually, since this is being done in the worker threads, if this takes too long, the manager will treat this as a wedged thread. Wedging by itself is not fatal, and workers can actually un-wedge, but if every thread wedges and you make no progress at all, NRWT will eventually bail out.
Comment 10 Adam Barth 2011-06-27 17:50:20 PDT
> NRWT will eventually bail out.

Where is the code that does that?
Comment 11 Dirk Pranke 2011-06-27 17:54:00 PDT
(In reply to comment #10)
> > NRWT will eventually bail out.
> 
> Where is the code that does that?

Look at the _run_tests() and is_done() methods in the Manager class (manager.py:674 and manager.py:1316),
Comment 12 Adam Barth 2011-06-27 17:57:18 PDT
> Look at the _run_tests() and is_done() methods in the Manager class (manager.py:674 and manager.py:1316),

Thanks.  I suspect we might trigger those conditions if lots of tests are crashing.
Comment 13 Dirk Pranke 2011-06-27 18:07:37 PDT
(In reply to comment #12)
> > Look at the _run_tests() and is_done() methods in the Manager class (manager.py:674 and manager.py:1316),
> 
> Thanks.  I suspect we might trigger those conditions if lots of tests are crashing.

Yeah, it's possible, but even if we do it's really just a slightly-accelerated version of the early exit we'd do when we hit the --exit-after-n-crashes-or-timeouts threshold, so I don't know that I'd worry about it too much (although the fewer workers we have running, the more this wouldn't be true ...).

Dunno if you've poked around in this part of the code yet, but in worker.py you have _run_test() on line 127, which calculates the "hang timeout", posts a message to the manager saying that it's starting a test, runs the test and compares the results, and then posts the test completed message.

If this really becomes an issue, we have three potential options, in order of increasingly complexity and (IMO) decreasing desirability:

1) Adjust the "hang timeout" to be substantially longer. Given that tests don't really wedge these days, this should be fairly harmless.

2) Adjust the code in manager.py to realize you're getting a lot of crashes and be smart able adjusting the timeouts on that side to compensate

3) Rearrange the code in worker.py so that you post that we've completed the test *before* trying to read the crash log, since we only check for wedged workers if we think they are actively running tests. This would effectively disable the wedge checking for this case, and would mess up the layering of the various methods, but would work in a worst-case scenario.
Comment 14 Adam Barth 2011-06-27 23:56:22 PDT
This patch was rolled out because it made Qt sloooow.
Comment 15 Csaba Osztrogonác 2011-06-28 00:00:08 PDT
Gábor, Kristóf, could you check why this change made our NRWT bot so slow?
Comment 16 Adam Barth 2011-06-28 00:30:25 PDT
Committed r89899: <http://trac.webkit.org/changeset/89899>