Bug 81603

Summary: webkitpy: clean up actually getting crash logs from DRT/WTR crashes
Product: WebKit Reporter: Dirk Pranke <dpranke>
Component: Tools / TestsAssignee: Dirk Pranke <dpranke>
Status: RESOLVED FIXED    
Severity: Normal CC: aroben, ossy, pnormand
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Bug Depends on:    
Bug Blocks: 71380, 76871    
Attachments:
Description Flags
Patch
none
tweak changelog slightly
none
patch for landing, updated w/ review feedback
none
fix BasePort._get_crash_log() none

Description Dirk Pranke 2012-03-19 17:46:14 PDT
currently the logic is scattered between different files and doesn't actually handle WebProcess crashes properly.
Comment 1 Dirk Pranke 2012-03-19 17:47:42 PDT
Created attachment 132735 [details]
Patch
Comment 2 Dirk Pranke 2012-03-19 17:49:37 PDT
Created attachment 132736 [details]
tweak changelog slightly
Comment 3 Adam Roben (:aroben) 2012-03-20 12:12:04 PDT
Comment on attachment 132736 [details]
tweak changelog slightly

View in context: https://bugs.webkit.org/attachment.cgi?id=132736&action=review

> Tools/ChangeLog:15
> +        Mac will now slow-spin with a timeout waiting for the crash log, rather
> +        than waiting for the ReportCrash process to exit (it appears
> +        that the mac will manage multiple crashes with a single
> +        ReportCrash process, the process waits around longer than
> +        necessary, presumably to avoid thrashing if processes are
> +        repeatedly crashing).

Seems like we could optimize this by just keeping track of all the process name/pid pairs that crashed during the test run, and only trying to retrieve crash logs for them at the very end of the run. Then in the common case ReportCrash would finish its business without us ever having to wait on it. We'd only have to wait for tests that crashed very near the end of the test run.
Comment 4 Dirk Pranke 2012-03-20 13:02:32 PDT
Comment on attachment 132736 [details]
tweak changelog slightly

View in context: https://bugs.webkit.org/attachment.cgi?id=132736&action=review

>> Tools/ChangeLog:15
>> +        repeatedly crashing).
> 
> Seems like we could optimize this by just keeping track of all the process name/pid pairs that crashed during the test run, and only trying to retrieve crash logs for them at the very end of the run. Then in the common case ReportCrash would finish its business without us ever having to wait on it. We'd only have to wait for tests that crashed very near the end of the test run.

Perhaps. I think it's a little conceptually clearer to return the crash as part of the test output itself. In the normal case, most of the waiting happens before we even get notified in the select() that the process has crashed; the only time we need to wait longer is when lots of things are crashing at once.

I'd prefer to leave things as-is until we see a need to do another pass over things.
Comment 5 Adam Barth 2012-03-21 14:29:32 PDT
Comment on attachment 132736 [details]
tweak changelog slightly

View in context: https://bugs.webkit.org/attachment.cgi?id=132736&action=review

Ok.  I kind of like aroben's idea of waiting until the end.  I'm not sure whether that's going to make things go significantly faster in practice, but it might be a good refinement to experiment with.

> Tools/Scripts/webkitpy/layout_tests/port/chromium.py:611
>          return DriverOutput(text, output_image, actual_checksum, audio=audio_bytes,
> -            crash=crash, crashed_process_name=crashed_process_name, test_time=run_time, timeout=timeout, error=error)
> +            crash=crash, crashed_process_name=crashed_process_name, crashed_pid=crashed_pid, crash_log=crash_log,
> +            test_time=run_time, timeout=timeout, error=error)

This line of code is getting kind of out of control!

> Tools/Scripts/webkitpy/layout_tests/port/mac.py:169
> +        CRASH_LOG_WAIT_TIMEOUT_SECS = 5 * int(self.get_option('child_processes'))

Is there a reason CRASH_LOG_WAIT_TIMEOUT_SECS is in all caps?  Does get_option return the actual number of child processes we're using (e.g., if we have memory limitations)?
Comment 6 Dirk Pranke 2012-03-21 14:37:53 PDT
(In reply to comment #5)
> (From update of attachment 132736 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=132736&action=review
> 
> Ok.  I kind of like aroben's idea of waiting until the end.  I'm not sure whether that's going to make things go significantly faster in practice, but it might be a good refinement to experiment with.
> 
> > Tools/Scripts/webkitpy/layout_tests/port/chromium.py:611
> >          return DriverOutput(text, output_image, actual_checksum, audio=audio_bytes,
> > -            crash=crash, crashed_process_name=crashed_process_name, test_time=run_time, timeout=timeout, error=error)
> > +            crash=crash, crashed_process_name=crashed_process_name, crashed_pid=crashed_pid, crash_log=crash_log,
> > +            test_time=run_time, timeout=timeout, error=error)
> 
> This line of code is getting kind of out of control!
> 

Yeah ... I'm not sure that the named arguments are really needed, either.

> > Tools/Scripts/webkitpy/layout_tests/port/mac.py:169
> > +        CRASH_LOG_WAIT_TIMEOUT_SECS = 5 * int(self.get_option('child_processes'))
> 
> Is there a reason CRASH_LOG_WAIT_TIMEOUT_SECS is in all caps?

Hm. No; it was a constant at one point. Will fix.

>  Does get_option return the actual number of child processes we're using (e.g., if we have memory limitations)?

It represents the max number of child processes, so it respects the memory limitations. However, we may have fewer children actually running (either because we didn't need that many, or because the others have already finished). Getting the number actually running would be difficult if not impossible since it depends on what is going on in other processes.
Comment 7 Dirk Pranke 2012-03-21 14:41:32 PDT
(In reply to comment #5)
> (From update of attachment 132736 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=132736&action=review
> 
> Ok.  I kind of like aroben's idea of waiting until the end.  I'm not sure whether that's going to make things go significantly faster in practice, but it might be a good refinement to experiment with.
> 

Another downside to this is that currently *all* of the "handle test failure and put stuff into layout-test-results" is done in the worker/child processes, in parallel. I wouldn't really like to split that logic up, and pushing that into the manager to the end could end up slowing things down (would have to code it to see).

There is a certain appeal to only having the manager process writing into layout-test-results, though.
Comment 8 Dirk Pranke 2012-03-21 15:21:44 PDT
Created attachment 133125 [details]
patch for landing, updated w/ review feedback
Comment 9 Dirk Pranke 2012-03-21 15:24:52 PDT
Committed r111609: <http://trac.webkit.org/changeset/111609>
Comment 10 Csaba Osztrogonác 2012-03-22 01:05:47 PDT
(In reply to comment #9)
> Committed r111609: <http://trac.webkit.org/changeset/111609>

Reopen, because it broke Qt WK2 tester bot:
http://build.webkit.sed.hu/builders/x86-32%20Linux%20Qt%20Release%20WebKit2/builds/21876/steps/layout-test/logs/stdio

Could you check what happened?
Comment 11 Csaba Osztrogonác 2012-03-22 01:21:24 PDT
I rolled it out to unbreak Qt WK2 tester bot - http://trac.webkit.org/changeset/111661 (with accidentally change in r111615 too)
Comment 12 Dirk Pranke 2012-03-22 10:00:28 PDT
(In reply to comment #11)
> I rolled it out to unbreak Qt WK2 tester bot - http://trac.webkit.org/changeset/111661 (with accidentally change in r111615 too)

Argh. This was fixed in r111615 ... it was just the ChangeLog that was wrong :(.

Starting with build 21879 on that bot, though, there appears to be another, different bug. I will look into that one as well.
Comment 13 Philippe Normand 2012-03-22 10:22:09 PDT
(In reply to comment #12)
> (In reply to comment #11)
> > I rolled it out to unbreak Qt WK2 tester bot - http://trac.webkit.org/changeset/111661 (with accidentally change in r111615 too)
> 
> Argh. This was fixed in r111615 ... it was just the ChangeLog that was wrong :(.
> 
> Starting with build 21879 on that bot, though, there appears to be another, different bug. I will look into that one as well.

 '\n'.join('STDERR: ' + line for line in stderr.decode('utf8').splitlines()) if stderr else '<empty>')

should be:


 '\n'.join(['STDERR: '] + line for line in stderr.decode('utf8').splitlines()) if stderr else '<empty>')

I think.
Comment 14 Dirk Pranke 2012-03-22 11:21:12 PDT
(In reply to comment #13)
> (In reply to comment #12)
> > (In reply to comment #11)
> > > I rolled it out to unbreak Qt WK2 tester bot - http://trac.webkit.org/changeset/111661 (with accidentally change in r111615 too)
> > 
> > Argh. This was fixed in r111615 ... it was just the ChangeLog that was wrong :(.
> > 
> > Starting with build 21879 on that bot, though, there appears to be another, different bug. I will look into that one as well.
> 
>  '\n'.join('STDERR: ' + line for line in stderr.decode('utf8').splitlines()) if stderr else '<empty>')
> 
> should be:
> 
> 
>  '\n'.join(['STDERR: '] + line for line in stderr.decode('utf8').splitlines()) if stderr else '<empty>')
> 
> I think.

No, that part's correct, as it is prefixing each line with STDERR: (although adding some parentheses will make that clearer).

It's actually complaining about the %d getting handed a string two lines above. Clearly I need a unit test for this ...
Comment 15 Dirk Pranke 2012-03-22 11:43:19 PDT
Created attachment 133304 [details]
fix BasePort._get_crash_log()
Comment 16 Dirk Pranke 2012-03-22 11:44:38 PDT
Committed r111736: <http://trac.webkit.org/changeset/111736>