Bug 210202

Summary: Buildbot: Force crash log submission after each test run
Product: WebKit Reporter: Jonathan Bedard <jbedard>
Component: Tools / TestsAssignee: Jonathan Bedard <jbedard>
Status: RESOLVED FIXED    
Severity: Normal CC: aakash_jain, ap, ddkilzer, ews-watchlist
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugs.webkit.org/show_bug.cgi?id=210378
Attachments:
Description Flags
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch for landing none

Description Jonathan Bedard 2020-04-08 11:15:16 PDT
We need to force crash log submission after each test run because only 20 unsubmitted crash logs are cached at a time.
Comment 1 Jonathan Bedard 2020-04-08 11:15:30 PDT
<rdar://problem/60508929>
Comment 2 Jonathan Bedard 2020-04-08 11:18:54 PDT
Created attachment 395841 [details]
Patch
Comment 3 Jonathan Bedard 2020-04-08 12:13:46 PDT
Created attachment 395851 [details]
Patch
Comment 4 Alexey Proskuryakov 2020-04-08 13:04:11 PDT
Comment on attachment 395851 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=395851&action=review

It may be worth checking if there are any unsubmitted crashes, to avoid the need to spend the time when there is nothing to do. Although that would make us depend even more on undocumented and unsupoprted behaviors.

> Tools/ChangeLog:3
> +        Buildbot: Force crash log submission after each test run (Part 1)

This patch doesn't change anything for Buildbot yet. I see that it's part one, but it can have its own name.

> Tools/ChangeLog:9
> +        * BuildSlaveSupport/report-crashes: Added.

I think that this would benefit from a more descriptive name.

> Tools/BuildSlaveSupport/report-crashes:64
> +    if subprocess.call(['/usr/bin/killall', '-9', 'diagnostics_agent']):

Worth mentioning in a comment that this is a workaround for rdar://problem/60507877.

> Tools/BuildSlaveSupport/report-crashes:66
> +        return 1

I think that you are going to separate triggering and waiting for completion, in which case the early return won't be needed.

> Tools/BuildSlaveSupport/report-crashes:73
> +    print('Requested crashreporter submission, waiting on process {} to queise'.format(pid))

Typo: quiesce
Comment 5 Jonathan Bedard 2020-04-08 13:49:21 PDT
Created attachment 395862 [details]
Patch
Comment 6 Jonathan Bedard 2020-04-08 14:17:02 PDT
Created attachment 395868 [details]
Patch
Comment 7 Alexey Proskuryakov 2020-04-08 15:50:40 PDT
Comment on attachment 395868 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=395868&action=review

> Tools/ChangeLog:18
> +        * BuildSlaveSupport/trigger-crash-collection: Added.
> +        * BuildSlaveSupport/wait-for-crash-collection: Added.

Maybe trigger-crash-log-submission? "Crash collection" doesn't parse without context.

> Tools/BuildSlaveSupport/trigger-crash-collection:35
> +    # Work around for <rdar://problem/60507877>

As a noun, "workaround" doesn't have a space. Please also add a period at the end.

> Tools/BuildSlaveSupport/trigger-crash-collection:37
> +        print('Failed to kill diagnostic agent')

Please verify that this process is named like this in all supported macOS versions.

Nit: Better to have the exact name in the log, diagnostics_agent.

> Tools/BuildSlaveSupport/trigger-crash-collection:39
> +    print('Killed diagnostic agent')

Better to have the exact name in the log.

> Tools/BuildSlaveSupport/trigger-crash-collection:43
> +        print('Failed to request crashreporter submission')

Nit: using different words in logs and in tool name suggests that one of those is not ideal. Is it "request" or "trigger"?

> Tools/BuildSlaveSupport/wait-for-crash-collection:29
> +SUBMIT_DIAG_INFO = '/System/Library/CoreServices/SubmitDiagInfo'

Have you confirmed that the name is the same on all supported macOS versions?

> Tools/BuildSlaveSupport/wait-for-crash-collection:33
> +    process = subprocess.Popen(['/bin/ps', '-eo', 'pid,comm'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)

Surprised that there isn't a way for this in Python that doesn't involve launching a separate tool.

> Tools/BuildSlaveSupport/wait-for-crash-collection:51
> +def main():

I think that this should have a timeout, either built in, or an argument. It's better to run tests anyway if SubmitDiagInfo doesn't quit.

> Tools/BuildSlaveSupport/wait-for-crash-collection:59
> +    if not pid:
> +        print('Failed to find {}'.format(SUBMIT_DIAG_INFO))
> +        return 1

This seems like a success case, not failure.

> Tools/BuildSlaveSupport/wait-for-crash-collection:61
> +    print('Waiting on process {} to quiesce'.format(pid))

"on" or "for"?

> Tools/BuildSlaveSupport/wait-for-crash-collection:64
> +    while cpu_percentage(pid) > 5:
> +        pass

Should we sleep for a bit here, to not spin a whole CPU core.

> Tools/BuildSlaveSupport/build.webkit.org-config/steps.py:152
> +    description = ["waiting for crash collectio to quiesce"]

Typo: collectio
Comment 8 Jonathan Bedard 2020-04-08 16:17:33 PDT
Comment on attachment 395868 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=395868&action=review

>> Tools/BuildSlaveSupport/wait-for-crash-collection:33
>> +    process = subprocess.Popen(['/bin/ps', '-eo', 'pid,comm'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
> 
> Surprised that there isn't a way for this in Python that doesn't involve launching a separate tool.

This is the way webkitpy does it.

I suspect there is way, but I didn't look much further when I saw what webkitpy did.

>> Tools/BuildSlaveSupport/wait-for-crash-collection:51
>> +def main():
> 
> I think that this should have a timeout, either built in, or an argument. It's better to run tests anyway if SubmitDiagInfo doesn't quit.

I can add one to be explicit, I was relying on Buildbot's built in timeout when a script prints nothing for too long.
Comment 9 Alexey Proskuryakov 2020-04-08 16:25:00 PDT
> I can add one to be explicit, I was relying on Buildbot's built in timeout when a script prints nothing for too long.

That would abort steps by default, right? I think that we should ignore and continue, after 5 minutes or so.
Comment 10 Jonathan Bedard 2020-04-08 17:21:41 PDT
(In reply to Alexey Proskuryakov from comment #9)
> > I can add one to be explicit, I was relying on Buildbot's built in timeout when a script prints nothing for too long.
> 
> That would abort steps by default, right? I think that we should ignore and
> continue, after 5 minutes or so.

Ok, making that change now
Comment 11 Alexey Proskuryakov 2020-04-08 18:53:29 PDT
Comment on attachment 395868 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=395868&action=review

> Tools/BuildSlaveSupport/build.webkit.org-config/factories.py:87
> +        if platform.startswith('mac') or platform.startswith('ios-simulator'):

Oh, also needed for EWS.
Comment 12 Jonathan Bedard 2020-04-09 08:26:34 PDT
Created attachment 395955 [details]
Patch
Comment 13 Jonathan Bedard 2020-04-09 08:37:18 PDT
Created attachment 395958 [details]
Patch
Comment 14 Jonathan Bedard 2020-04-09 08:46:37 PDT
Created attachment 395959 [details]
Patch
Comment 15 Aakash Jain 2020-04-09 09:20:35 PDT
Comment on attachment 395959 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=395959&action=review

Let's use single-quotes instead of double-quotes through-out the patch.

> Tools/BuildSlaveSupport/build.webkit.org-config/steps.py:144
> +    name = "trigger crash log submission"

please remove spaces from the step name.

> Tools/BuildSlaveSupport/build.webkit.org-config/steps.py:151
> +    name = "wait for crash collection"

please remove spaces from the step name.

> Tools/BuildSlaveSupport/ews-build/factories.py:121
> +            self.addStep(WaitForCrashCollection())

Might be a good idea to add a comment here as well mentioning that this is a temporary workaround.

> Tools/BuildSlaveSupport/ews-build/factories_unittest.py:223
> +            _BuildStepFactory(steps.TriggerCrashLogSubmission),

SetBuildSummary will finish the build immediately, so it has to be the last step. Please have TriggerCrashLogSubmission executed before it.

> Tools/BuildSlaveSupport/ews-build/steps.py:1637
> +    name = "trigger crash log submission"

please remove spaces from the step name.

> Tools/BuildSlaveSupport/ews-build/steps.py:1652
> +    name = "wait for crash collection"

please remove spaces from the step name.

> Tools/BuildSlaveSupport/ews-build/steps.py:1658
> +        super(WaitForCrashCollection, self).__init__(timeout=360, logEnviron=False, **kwargs)

let's explicitly mention timeout as timeout=6 * 60

> Tools/BuildSlaveSupport/ews-build/steps.py:1662
> +            return {u'step': u'Crash log collection failed to quiesce'}

Please replace quiesce with an easy to understand word, so that people like me don't have to open the dictionary to understand the error message.
Comment 16 Jonathan Bedard 2020-04-09 09:45:19 PDT
Comment on attachment 395959 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=395959&action=review

>> Tools/BuildSlaveSupport/ews-build/factories.py:121
>> +            self.addStep(WaitForCrashCollection())
> 
> Might be a good idea to add a comment here as well mentioning that this is a temporary workaround.

It is not a temporary workaround. The only temporary part is the killall. We're always going to need to trigger crash log submission and wait for collection to quiesce.

>> Tools/BuildSlaveSupport/ews-build/steps.py:1662
>> +            return {u'step': u'Crash log collection failed to quiesce'}
> 
> Please replace quiesce with an easy to understand word, so that people like me don't have to open the dictionary to understand the error message.

"quiesce" is the term we use in a few other places for "wait until a process stops doing work", namely in performance infrastructure. I don't know a better term in general, although this line could be 'Crash log collection process still running'. Is your complaint about this line in particular or the use of 'quiesce' throughout the patch
Comment 17 Jonathan Bedard 2020-04-09 10:11:04 PDT
Created attachment 395969 [details]
Patch
Comment 18 Aakash Jain 2020-04-09 10:49:55 PDT
Comment on attachment 395969 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=395969&action=review

> Tools/ChangeLog:31
> +        * BuildSlaveSupport/trigger-crash-collection: Added.

Are these scripts well-tested?
Maybe we should consider enabling it in one of the CI first (either EWS or build.webkit.org) and make sure it works fine before enabling them in other one.

> Tools/BuildSlaveSupport/trigger-crash-log-submission:41
> +    time.sleep(3)

probably '3' should be a variable defined in beginning of this file.

> Tools/BuildSlaveSupport/trigger-crash-log-submission:42
> +    if subprocess.call(['/usr/bin/notifyutil', '-p', 'com.apple.crashreporter.debug.submit.now']):

Does this return immediately and crash log submission keeps happening in background?
If so, what if the bot immediately reboot after build finishes?

> Tools/BuildSlaveSupport/wait-for-crash-collection:48
> +def cpu_percentage(pid):

what if the pid finished just before this function was called. top output wouldn't mention any percentage in that case, and float conversion would raise an exception. We should probably have exception handling here

> Tools/BuildSlaveSupport/wait-for-crash-collection:78
> +        time.sleep(3)

probably '3' should be a variable defined in beginning of this file.

> Tools/BuildSlaveSupport/ews-build/steps.py:1647
> +            return {u'step': u'Failed to kill trigger crash log submission'}

kill? typo?
Comment 19 Jonathan Bedard 2020-04-09 11:17:57 PDT
Comment on attachment 395969 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=395969&action=review

>> Tools/ChangeLog:31
>> +        * BuildSlaveSupport/trigger-crash-collection: Added.
> 
> Are these scripts well-tested?
> Maybe we should consider enabling it in one of the CI first (either EWS or build.webkit.org) and make sure it works fine before enabling them in other one.

I've tested them pretty extensively.

I think we land the single patch and do one restart first, then the other. There aren't very many ways these scripts can fail, actually. Especially because we have a timeout on wait-for-crash-collection.

>> Tools/BuildSlaveSupport/trigger-crash-log-submission:42
>> +    if subprocess.call(['/usr/bin/notifyutil', '-p', 'com.apple.crashreporter.debug.submit.now']):
> 
> Does this return immediately and crash log submission keeps happening in background?
> If so, what if the bot immediately reboot after build finishes?

Yes.

Not something I've considered, nor something I think we should worry about. Either crash submission simply won't happen, or it will happen after reboot. Doesn't matter much either way. Failing to trigger crash submission is basically the world we live in now, which we've been able to live with for years.
Comment 20 Jonathan Bedard 2020-04-09 11:26:01 PDT
Created attachment 395980 [details]
Patch
Comment 21 Jonathan Bedard 2020-04-09 11:37:33 PDT
Created attachment 395984 [details]
Patch for landing
Comment 22 EWS 2020-04-09 12:00:42 PDT
Committed r259811: <https://trac.webkit.org/changeset/259811>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 395984 [details].
Comment 23 Aakash Jain 2020-04-11 05:02:11 PDT
Deployed this change on EWS server.
Comment 24 Aakash Jain 2020-04-11 05:07:00 PDT
Comment on attachment 395984 [details]
Patch for landing

View in context: https://bugs.webkit.org/attachment.cgi?id=395984&action=review

> Tools/BuildSlaveSupport/wait-for-crash-collection:73
> +        print('Failed to find {}'.format(SUBMIT_DIAG_INFO))

This results in a misleading error message: 'Failed to find /System/Library/CoreServices/SubmitDiagInfo'. e.g.: https://ews-build.webkit.org/#/builders/31/builds/7348/steps/8/logs/stdio

Should be something like: 'Failed to find running process for {}'
Comment 25 Aakash Jain 2020-04-11 05:08:03 PDT
> This results in a misleading error message
Fixed in https://trac.webkit.org/changeset/259926/webkit
Comment 26 Aakash Jain 2020-04-11 05:37:46 PDT
Follow-up fix to remove spaces from step name: https://trac.webkit.org/changeset/259927/webkit