RESOLVED FIXED 210202
Buildbot: Force crash log submission after each test run
https://bugs.webkit.org/show_bug.cgi?id=210202
Summary Buildbot: Force crash log submission after each test run
Jonathan Bedard
Reported 2020-04-08 11:15:16 PDT
We need to force crash log submission after each test run because only 20 unsubmitted crash logs are cached at a time.
Attachments
Patch (4.36 KB, patch)
2020-04-08 11:18 PDT, Jonathan Bedard
no flags
Patch (4.38 KB, patch)
2020-04-08 12:13 PDT, Jonathan Bedard
no flags
Patch (11.25 KB, patch)
2020-04-08 13:49 PDT, Jonathan Bedard
no flags
Patch (25.48 KB, patch)
2020-04-08 14:17 PDT, Jonathan Bedard
no flags
Patch (32.43 KB, patch)
2020-04-09 08:26 PDT, Jonathan Bedard
no flags
Patch (34.27 KB, patch)
2020-04-09 08:37 PDT, Jonathan Bedard
no flags
Patch (34.42 KB, patch)
2020-04-09 08:46 PDT, Jonathan Bedard
no flags
Patch (34.37 KB, patch)
2020-04-09 10:11 PDT, Jonathan Bedard
no flags
Patch (34.48 KB, patch)
2020-04-09 11:26 PDT, Jonathan Bedard
no flags
Patch for landing (34.49 KB, patch)
2020-04-09 11:37 PDT, Jonathan Bedard
no flags
Jonathan Bedard
Comment 1 2020-04-08 11:15:30 PDT
Jonathan Bedard
Comment 2 2020-04-08 11:18:54 PDT
Jonathan Bedard
Comment 3 2020-04-08 12:13:46 PDT
Alexey Proskuryakov
Comment 4 2020-04-08 13:04:11 PDT
Comment on attachment 395851 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=395851&action=review It may be worth checking if there are any unsubmitted crashes, to avoid the need to spend the time when there is nothing to do. Although that would make us depend even more on undocumented and unsupoprted behaviors. > Tools/ChangeLog:3 > + Buildbot: Force crash log submission after each test run (Part 1) This patch doesn't change anything for Buildbot yet. I see that it's part one, but it can have its own name. > Tools/ChangeLog:9 > + * BuildSlaveSupport/report-crashes: Added. I think that this would benefit from a more descriptive name. > Tools/BuildSlaveSupport/report-crashes:64 > + if subprocess.call(['/usr/bin/killall', '-9', 'diagnostics_agent']): Worth mentioning in a comment that this is a workaround for rdar://problem/60507877. > Tools/BuildSlaveSupport/report-crashes:66 > + return 1 I think that you are going to separate triggering and waiting for completion, in which case the early return won't be needed. > Tools/BuildSlaveSupport/report-crashes:73 > + print('Requested crashreporter submission, waiting on process {} to queise'.format(pid)) Typo: quiesce
Jonathan Bedard
Comment 5 2020-04-08 13:49:21 PDT
Jonathan Bedard
Comment 6 2020-04-08 14:17:02 PDT
Alexey Proskuryakov
Comment 7 2020-04-08 15:50:40 PDT
Comment on attachment 395868 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=395868&action=review > Tools/ChangeLog:18 > + * BuildSlaveSupport/trigger-crash-collection: Added. > + * BuildSlaveSupport/wait-for-crash-collection: Added. Maybe trigger-crash-log-submission? "Crash collection" doesn't parse without context. > Tools/BuildSlaveSupport/trigger-crash-collection:35 > + # Work around for <rdar://problem/60507877> As a noun, "workaround" doesn't have a space. Please also add a period at the end. > Tools/BuildSlaveSupport/trigger-crash-collection:37 > + print('Failed to kill diagnostic agent') Please verify that this process is named like this in all supported macOS versions. Nit: Better to have the exact name in the log, diagnostics_agent. > Tools/BuildSlaveSupport/trigger-crash-collection:39 > + print('Killed diagnostic agent') Better to have the exact name in the log. > Tools/BuildSlaveSupport/trigger-crash-collection:43 > + print('Failed to request crashreporter submission') Nit: using different words in logs and in tool name suggests that one of those is not ideal. Is it "request" or "trigger"? > Tools/BuildSlaveSupport/wait-for-crash-collection:29 > +SUBMIT_DIAG_INFO = '/System/Library/CoreServices/SubmitDiagInfo' Have you confirmed that the name is the same on all supported macOS versions? > Tools/BuildSlaveSupport/wait-for-crash-collection:33 > + process = subprocess.Popen(['/bin/ps', '-eo', 'pid,comm'], stdout=subprocess.PIPE, stderr=subprocess.PIPE) Surprised that there isn't a way for this in Python that doesn't involve launching a separate tool. > Tools/BuildSlaveSupport/wait-for-crash-collection:51 > +def main(): I think that this should have a timeout, either built in, or an argument. It's better to run tests anyway if SubmitDiagInfo doesn't quit. > Tools/BuildSlaveSupport/wait-for-crash-collection:59 > + if not pid: > + print('Failed to find {}'.format(SUBMIT_DIAG_INFO)) > + return 1 This seems like a success case, not failure. > Tools/BuildSlaveSupport/wait-for-crash-collection:61 > + print('Waiting on process {} to quiesce'.format(pid)) "on" or "for"? > Tools/BuildSlaveSupport/wait-for-crash-collection:64 > + while cpu_percentage(pid) > 5: > + pass Should we sleep for a bit here, to not spin a whole CPU core. > Tools/BuildSlaveSupport/build.webkit.org-config/steps.py:152 > + description = ["waiting for crash collectio to quiesce"] Typo: collectio
Jonathan Bedard
Comment 8 2020-04-08 16:17:33 PDT
Comment on attachment 395868 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=395868&action=review >> Tools/BuildSlaveSupport/wait-for-crash-collection:33 >> + process = subprocess.Popen(['/bin/ps', '-eo', 'pid,comm'], stdout=subprocess.PIPE, stderr=subprocess.PIPE) > > Surprised that there isn't a way for this in Python that doesn't involve launching a separate tool. This is the way webkitpy does it. I suspect there is way, but I didn't look much further when I saw what webkitpy did. >> Tools/BuildSlaveSupport/wait-for-crash-collection:51 >> +def main(): > > I think that this should have a timeout, either built in, or an argument. It's better to run tests anyway if SubmitDiagInfo doesn't quit. I can add one to be explicit, I was relying on Buildbot's built in timeout when a script prints nothing for too long.
Alexey Proskuryakov
Comment 9 2020-04-08 16:25:00 PDT
> I can add one to be explicit, I was relying on Buildbot's built in timeout when a script prints nothing for too long. That would abort steps by default, right? I think that we should ignore and continue, after 5 minutes or so.
Jonathan Bedard
Comment 10 2020-04-08 17:21:41 PDT
(In reply to Alexey Proskuryakov from comment #9) > > I can add one to be explicit, I was relying on Buildbot's built in timeout when a script prints nothing for too long. > > That would abort steps by default, right? I think that we should ignore and > continue, after 5 minutes or so. Ok, making that change now
Alexey Proskuryakov
Comment 11 2020-04-08 18:53:29 PDT
Comment on attachment 395868 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=395868&action=review > Tools/BuildSlaveSupport/build.webkit.org-config/factories.py:87 > + if platform.startswith('mac') or platform.startswith('ios-simulator'): Oh, also needed for EWS.
Jonathan Bedard
Comment 12 2020-04-09 08:26:34 PDT
Jonathan Bedard
Comment 13 2020-04-09 08:37:18 PDT
Jonathan Bedard
Comment 14 2020-04-09 08:46:37 PDT
Aakash Jain
Comment 15 2020-04-09 09:20:35 PDT
Comment on attachment 395959 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=395959&action=review Let's use single-quotes instead of double-quotes through-out the patch. > Tools/BuildSlaveSupport/build.webkit.org-config/steps.py:144 > + name = "trigger crash log submission" please remove spaces from the step name. > Tools/BuildSlaveSupport/build.webkit.org-config/steps.py:151 > + name = "wait for crash collection" please remove spaces from the step name. > Tools/BuildSlaveSupport/ews-build/factories.py:121 > + self.addStep(WaitForCrashCollection()) Might be a good idea to add a comment here as well mentioning that this is a temporary workaround. > Tools/BuildSlaveSupport/ews-build/factories_unittest.py:223 > + _BuildStepFactory(steps.TriggerCrashLogSubmission), SetBuildSummary will finish the build immediately, so it has to be the last step. Please have TriggerCrashLogSubmission executed before it. > Tools/BuildSlaveSupport/ews-build/steps.py:1637 > + name = "trigger crash log submission" please remove spaces from the step name. > Tools/BuildSlaveSupport/ews-build/steps.py:1652 > + name = "wait for crash collection" please remove spaces from the step name. > Tools/BuildSlaveSupport/ews-build/steps.py:1658 > + super(WaitForCrashCollection, self).__init__(timeout=360, logEnviron=False, **kwargs) let's explicitly mention timeout as timeout=6 * 60 > Tools/BuildSlaveSupport/ews-build/steps.py:1662 > + return {u'step': u'Crash log collection failed to quiesce'} Please replace quiesce with an easy to understand word, so that people like me don't have to open the dictionary to understand the error message.
Jonathan Bedard
Comment 16 2020-04-09 09:45:19 PDT
Comment on attachment 395959 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=395959&action=review >> Tools/BuildSlaveSupport/ews-build/factories.py:121 >> + self.addStep(WaitForCrashCollection()) > > Might be a good idea to add a comment here as well mentioning that this is a temporary workaround. It is not a temporary workaround. The only temporary part is the killall. We're always going to need to trigger crash log submission and wait for collection to quiesce. >> Tools/BuildSlaveSupport/ews-build/steps.py:1662 >> + return {u'step': u'Crash log collection failed to quiesce'} > > Please replace quiesce with an easy to understand word, so that people like me don't have to open the dictionary to understand the error message. "quiesce" is the term we use in a few other places for "wait until a process stops doing work", namely in performance infrastructure. I don't know a better term in general, although this line could be 'Crash log collection process still running'. Is your complaint about this line in particular or the use of 'quiesce' throughout the patch
Jonathan Bedard
Comment 17 2020-04-09 10:11:04 PDT
Aakash Jain
Comment 18 2020-04-09 10:49:55 PDT
Comment on attachment 395969 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=395969&action=review > Tools/ChangeLog:31 > + * BuildSlaveSupport/trigger-crash-collection: Added. Are these scripts well-tested? Maybe we should consider enabling it in one of the CI first (either EWS or build.webkit.org) and make sure it works fine before enabling them in other one. > Tools/BuildSlaveSupport/trigger-crash-log-submission:41 > + time.sleep(3) probably '3' should be a variable defined in beginning of this file. > Tools/BuildSlaveSupport/trigger-crash-log-submission:42 > + if subprocess.call(['/usr/bin/notifyutil', '-p', 'com.apple.crashreporter.debug.submit.now']): Does this return immediately and crash log submission keeps happening in background? If so, what if the bot immediately reboot after build finishes? > Tools/BuildSlaveSupport/wait-for-crash-collection:48 > +def cpu_percentage(pid): what if the pid finished just before this function was called. top output wouldn't mention any percentage in that case, and float conversion would raise an exception. We should probably have exception handling here > Tools/BuildSlaveSupport/wait-for-crash-collection:78 > + time.sleep(3) probably '3' should be a variable defined in beginning of this file. > Tools/BuildSlaveSupport/ews-build/steps.py:1647 > + return {u'step': u'Failed to kill trigger crash log submission'} kill? typo?
Jonathan Bedard
Comment 19 2020-04-09 11:17:57 PDT
Comment on attachment 395969 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=395969&action=review >> Tools/ChangeLog:31 >> + * BuildSlaveSupport/trigger-crash-collection: Added. > > Are these scripts well-tested? > Maybe we should consider enabling it in one of the CI first (either EWS or build.webkit.org) and make sure it works fine before enabling them in other one. I've tested them pretty extensively. I think we land the single patch and do one restart first, then the other. There aren't very many ways these scripts can fail, actually. Especially because we have a timeout on wait-for-crash-collection. >> Tools/BuildSlaveSupport/trigger-crash-log-submission:42 >> + if subprocess.call(['/usr/bin/notifyutil', '-p', 'com.apple.crashreporter.debug.submit.now']): > > Does this return immediately and crash log submission keeps happening in background? > If so, what if the bot immediately reboot after build finishes? Yes. Not something I've considered, nor something I think we should worry about. Either crash submission simply won't happen, or it will happen after reboot. Doesn't matter much either way. Failing to trigger crash submission is basically the world we live in now, which we've been able to live with for years.
Jonathan Bedard
Comment 20 2020-04-09 11:26:01 PDT
Jonathan Bedard
Comment 21 2020-04-09 11:37:33 PDT
Created attachment 395984 [details] Patch for landing
EWS
Comment 22 2020-04-09 12:00:42 PDT
Committed r259811: <https://trac.webkit.org/changeset/259811> All reviewed patches have been landed. Closing bug and clearing flags on attachment 395984 [details].
Aakash Jain
Comment 23 2020-04-11 05:02:11 PDT
Deployed this change on EWS server.
Aakash Jain
Comment 24 2020-04-11 05:07:00 PDT
Comment on attachment 395984 [details] Patch for landing View in context: https://bugs.webkit.org/attachment.cgi?id=395984&action=review > Tools/BuildSlaveSupport/wait-for-crash-collection:73 > + print('Failed to find {}'.format(SUBMIT_DIAG_INFO)) This results in a misleading error message: 'Failed to find /System/Library/CoreServices/SubmitDiagInfo'. e.g.: https://ews-build.webkit.org/#/builders/31/builds/7348/steps/8/logs/stdio Should be something like: 'Failed to find running process for {}'
Aakash Jain
Comment 25 2020-04-11 05:08:03 PDT
> This results in a misleading error message Fixed in https://trac.webkit.org/changeset/259926/webkit
Aakash Jain
Comment 26 2020-04-11 05:37:46 PDT
Follow-up fix to remove spaces from step name: https://trac.webkit.org/changeset/259927/webkit
Note You need to log in before you can comment on or make changes to this bug.