WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
210202
Buildbot: Force crash log submission after each test run
https://bugs.webkit.org/show_bug.cgi?id=210202
Summary
Buildbot: Force crash log submission after each test run
Jonathan Bedard
Reported
2020-04-08 11:15:16 PDT
We need to force crash log submission after each test run because only 20 unsubmitted crash logs are cached at a time.
Attachments
Patch
(4.36 KB, patch)
2020-04-08 11:18 PDT
,
Jonathan Bedard
no flags
Details
Formatted Diff
Diff
Patch
(4.38 KB, patch)
2020-04-08 12:13 PDT
,
Jonathan Bedard
no flags
Details
Formatted Diff
Diff
Patch
(11.25 KB, patch)
2020-04-08 13:49 PDT
,
Jonathan Bedard
no flags
Details
Formatted Diff
Diff
Patch
(25.48 KB, patch)
2020-04-08 14:17 PDT
,
Jonathan Bedard
no flags
Details
Formatted Diff
Diff
Patch
(32.43 KB, patch)
2020-04-09 08:26 PDT
,
Jonathan Bedard
no flags
Details
Formatted Diff
Diff
Patch
(34.27 KB, patch)
2020-04-09 08:37 PDT
,
Jonathan Bedard
no flags
Details
Formatted Diff
Diff
Patch
(34.42 KB, patch)
2020-04-09 08:46 PDT
,
Jonathan Bedard
no flags
Details
Formatted Diff
Diff
Patch
(34.37 KB, patch)
2020-04-09 10:11 PDT
,
Jonathan Bedard
no flags
Details
Formatted Diff
Diff
Patch
(34.48 KB, patch)
2020-04-09 11:26 PDT
,
Jonathan Bedard
no flags
Details
Formatted Diff
Diff
Patch for landing
(34.49 KB, patch)
2020-04-09 11:37 PDT
,
Jonathan Bedard
no flags
Details
Formatted Diff
Diff
Show Obsolete
(9)
View All
Add attachment
proposed patch, testcase, etc.
Jonathan Bedard
Comment 1
2020-04-08 11:15:30 PDT
<
rdar://problem/60508929
>
Jonathan Bedard
Comment 2
2020-04-08 11:18:54 PDT
Created
attachment 395841
[details]
Patch
Jonathan Bedard
Comment 3
2020-04-08 12:13:46 PDT
Created
attachment 395851
[details]
Patch
Alexey Proskuryakov
Comment 4
2020-04-08 13:04:11 PDT
Comment on
attachment 395851
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=395851&action=review
It may be worth checking if there are any unsubmitted crashes, to avoid the need to spend the time when there is nothing to do. Although that would make us depend even more on undocumented and unsupoprted behaviors.
> Tools/ChangeLog:3 > + Buildbot: Force crash log submission after each test run (Part 1)
This patch doesn't change anything for Buildbot yet. I see that it's part one, but it can have its own name.
> Tools/ChangeLog:9 > + * BuildSlaveSupport/report-crashes: Added.
I think that this would benefit from a more descriptive name.
> Tools/BuildSlaveSupport/report-crashes:64 > + if subprocess.call(['/usr/bin/killall', '-9', 'diagnostics_agent']):
Worth mentioning in a comment that this is a workaround for
rdar://problem/60507877
.
> Tools/BuildSlaveSupport/report-crashes:66 > + return 1
I think that you are going to separate triggering and waiting for completion, in which case the early return won't be needed.
> Tools/BuildSlaveSupport/report-crashes:73 > + print('Requested crashreporter submission, waiting on process {} to queise'.format(pid))
Typo: quiesce
Jonathan Bedard
Comment 5
2020-04-08 13:49:21 PDT
Created
attachment 395862
[details]
Patch
Jonathan Bedard
Comment 6
2020-04-08 14:17:02 PDT
Created
attachment 395868
[details]
Patch
Alexey Proskuryakov
Comment 7
2020-04-08 15:50:40 PDT
Comment on
attachment 395868
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=395868&action=review
> Tools/ChangeLog:18 > + * BuildSlaveSupport/trigger-crash-collection: Added. > + * BuildSlaveSupport/wait-for-crash-collection: Added.
Maybe trigger-crash-log-submission? "Crash collection" doesn't parse without context.
> Tools/BuildSlaveSupport/trigger-crash-collection:35 > + # Work around for <
rdar://problem/60507877
>
As a noun, "workaround" doesn't have a space. Please also add a period at the end.
> Tools/BuildSlaveSupport/trigger-crash-collection:37 > + print('Failed to kill diagnostic agent')
Please verify that this process is named like this in all supported macOS versions. Nit: Better to have the exact name in the log, diagnostics_agent.
> Tools/BuildSlaveSupport/trigger-crash-collection:39 > + print('Killed diagnostic agent')
Better to have the exact name in the log.
> Tools/BuildSlaveSupport/trigger-crash-collection:43 > + print('Failed to request crashreporter submission')
Nit: using different words in logs and in tool name suggests that one of those is not ideal. Is it "request" or "trigger"?
> Tools/BuildSlaveSupport/wait-for-crash-collection:29 > +SUBMIT_DIAG_INFO = '/System/Library/CoreServices/SubmitDiagInfo'
Have you confirmed that the name is the same on all supported macOS versions?
> Tools/BuildSlaveSupport/wait-for-crash-collection:33 > + process = subprocess.Popen(['/bin/ps', '-eo', 'pid,comm'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
Surprised that there isn't a way for this in Python that doesn't involve launching a separate tool.
> Tools/BuildSlaveSupport/wait-for-crash-collection:51 > +def main():
I think that this should have a timeout, either built in, or an argument. It's better to run tests anyway if SubmitDiagInfo doesn't quit.
> Tools/BuildSlaveSupport/wait-for-crash-collection:59 > + if not pid: > + print('Failed to find {}'.format(SUBMIT_DIAG_INFO)) > + return 1
This seems like a success case, not failure.
> Tools/BuildSlaveSupport/wait-for-crash-collection:61 > + print('Waiting on process {} to quiesce'.format(pid))
"on" or "for"?
> Tools/BuildSlaveSupport/wait-for-crash-collection:64 > + while cpu_percentage(pid) > 5: > + pass
Should we sleep for a bit here, to not spin a whole CPU core.
> Tools/BuildSlaveSupport/build.webkit.org-config/steps.py:152 > + description = ["waiting for crash collectio to quiesce"]
Typo: collectio
Jonathan Bedard
Comment 8
2020-04-08 16:17:33 PDT
Comment on
attachment 395868
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=395868&action=review
>> Tools/BuildSlaveSupport/wait-for-crash-collection:33 >> + process = subprocess.Popen(['/bin/ps', '-eo', 'pid,comm'], stdout=subprocess.PIPE, stderr=subprocess.PIPE) > > Surprised that there isn't a way for this in Python that doesn't involve launching a separate tool.
This is the way webkitpy does it. I suspect there is way, but I didn't look much further when I saw what webkitpy did.
>> Tools/BuildSlaveSupport/wait-for-crash-collection:51 >> +def main(): > > I think that this should have a timeout, either built in, or an argument. It's better to run tests anyway if SubmitDiagInfo doesn't quit.
I can add one to be explicit, I was relying on Buildbot's built in timeout when a script prints nothing for too long.
Alexey Proskuryakov
Comment 9
2020-04-08 16:25:00 PDT
> I can add one to be explicit, I was relying on Buildbot's built in timeout when a script prints nothing for too long.
That would abort steps by default, right? I think that we should ignore and continue, after 5 minutes or so.
Jonathan Bedard
Comment 10
2020-04-08 17:21:41 PDT
(In reply to Alexey Proskuryakov from
comment #9
)
> > I can add one to be explicit, I was relying on Buildbot's built in timeout when a script prints nothing for too long. > > That would abort steps by default, right? I think that we should ignore and > continue, after 5 minutes or so.
Ok, making that change now
Alexey Proskuryakov
Comment 11
2020-04-08 18:53:29 PDT
Comment on
attachment 395868
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=395868&action=review
> Tools/BuildSlaveSupport/build.webkit.org-config/factories.py:87 > + if platform.startswith('mac') or platform.startswith('ios-simulator'):
Oh, also needed for EWS.
Jonathan Bedard
Comment 12
2020-04-09 08:26:34 PDT
Created
attachment 395955
[details]
Patch
Jonathan Bedard
Comment 13
2020-04-09 08:37:18 PDT
Created
attachment 395958
[details]
Patch
Jonathan Bedard
Comment 14
2020-04-09 08:46:37 PDT
Created
attachment 395959
[details]
Patch
Aakash Jain
Comment 15
2020-04-09 09:20:35 PDT
Comment on
attachment 395959
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=395959&action=review
Let's use single-quotes instead of double-quotes through-out the patch.
> Tools/BuildSlaveSupport/build.webkit.org-config/steps.py:144 > + name = "trigger crash log submission"
please remove spaces from the step name.
> Tools/BuildSlaveSupport/build.webkit.org-config/steps.py:151 > + name = "wait for crash collection"
please remove spaces from the step name.
> Tools/BuildSlaveSupport/ews-build/factories.py:121 > + self.addStep(WaitForCrashCollection())
Might be a good idea to add a comment here as well mentioning that this is a temporary workaround.
> Tools/BuildSlaveSupport/ews-build/factories_unittest.py:223 > + _BuildStepFactory(steps.TriggerCrashLogSubmission),
SetBuildSummary will finish the build immediately, so it has to be the last step. Please have TriggerCrashLogSubmission executed before it.
> Tools/BuildSlaveSupport/ews-build/steps.py:1637 > + name = "trigger crash log submission"
please remove spaces from the step name.
> Tools/BuildSlaveSupport/ews-build/steps.py:1652 > + name = "wait for crash collection"
please remove spaces from the step name.
> Tools/BuildSlaveSupport/ews-build/steps.py:1658 > + super(WaitForCrashCollection, self).__init__(timeout=360, logEnviron=False, **kwargs)
let's explicitly mention timeout as timeout=6 * 60
> Tools/BuildSlaveSupport/ews-build/steps.py:1662 > + return {u'step': u'Crash log collection failed to quiesce'}
Please replace quiesce with an easy to understand word, so that people like me don't have to open the dictionary to understand the error message.
Jonathan Bedard
Comment 16
2020-04-09 09:45:19 PDT
Comment on
attachment 395959
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=395959&action=review
>> Tools/BuildSlaveSupport/ews-build/factories.py:121 >> + self.addStep(WaitForCrashCollection()) > > Might be a good idea to add a comment here as well mentioning that this is a temporary workaround.
It is not a temporary workaround. The only temporary part is the killall. We're always going to need to trigger crash log submission and wait for collection to quiesce.
>> Tools/BuildSlaveSupport/ews-build/steps.py:1662 >> + return {u'step': u'Crash log collection failed to quiesce'} > > Please replace quiesce with an easy to understand word, so that people like me don't have to open the dictionary to understand the error message.
"quiesce" is the term we use in a few other places for "wait until a process stops doing work", namely in performance infrastructure. I don't know a better term in general, although this line could be 'Crash log collection process still running'. Is your complaint about this line in particular or the use of 'quiesce' throughout the patch
Jonathan Bedard
Comment 17
2020-04-09 10:11:04 PDT
Created
attachment 395969
[details]
Patch
Aakash Jain
Comment 18
2020-04-09 10:49:55 PDT
Comment on
attachment 395969
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=395969&action=review
> Tools/ChangeLog:31 > + * BuildSlaveSupport/trigger-crash-collection: Added.
Are these scripts well-tested? Maybe we should consider enabling it in one of the CI first (either EWS or build.webkit.org) and make sure it works fine before enabling them in other one.
> Tools/BuildSlaveSupport/trigger-crash-log-submission:41 > + time.sleep(3)
probably '3' should be a variable defined in beginning of this file.
> Tools/BuildSlaveSupport/trigger-crash-log-submission:42 > + if subprocess.call(['/usr/bin/notifyutil', '-p', 'com.apple.crashreporter.debug.submit.now']):
Does this return immediately and crash log submission keeps happening in background? If so, what if the bot immediately reboot after build finishes?
> Tools/BuildSlaveSupport/wait-for-crash-collection:48 > +def cpu_percentage(pid):
what if the pid finished just before this function was called. top output wouldn't mention any percentage in that case, and float conversion would raise an exception. We should probably have exception handling here
> Tools/BuildSlaveSupport/wait-for-crash-collection:78 > + time.sleep(3)
probably '3' should be a variable defined in beginning of this file.
> Tools/BuildSlaveSupport/ews-build/steps.py:1647 > + return {u'step': u'Failed to kill trigger crash log submission'}
kill? typo?
Jonathan Bedard
Comment 19
2020-04-09 11:17:57 PDT
Comment on
attachment 395969
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=395969&action=review
>> Tools/ChangeLog:31 >> + * BuildSlaveSupport/trigger-crash-collection: Added. > > Are these scripts well-tested? > Maybe we should consider enabling it in one of the CI first (either EWS or build.webkit.org) and make sure it works fine before enabling them in other one.
I've tested them pretty extensively. I think we land the single patch and do one restart first, then the other. There aren't very many ways these scripts can fail, actually. Especially because we have a timeout on wait-for-crash-collection.
>> Tools/BuildSlaveSupport/trigger-crash-log-submission:42 >> + if subprocess.call(['/usr/bin/notifyutil', '-p', 'com.apple.crashreporter.debug.submit.now']): > > Does this return immediately and crash log submission keeps happening in background? > If so, what if the bot immediately reboot after build finishes?
Yes. Not something I've considered, nor something I think we should worry about. Either crash submission simply won't happen, or it will happen after reboot. Doesn't matter much either way. Failing to trigger crash submission is basically the world we live in now, which we've been able to live with for years.
Jonathan Bedard
Comment 20
2020-04-09 11:26:01 PDT
Created
attachment 395980
[details]
Patch
Jonathan Bedard
Comment 21
2020-04-09 11:37:33 PDT
Created
attachment 395984
[details]
Patch for landing
EWS
Comment 22
2020-04-09 12:00:42 PDT
Committed
r259811
: <
https://trac.webkit.org/changeset/259811
> All reviewed patches have been landed. Closing bug and clearing flags on
attachment 395984
[details]
.
Aakash Jain
Comment 23
2020-04-11 05:02:11 PDT
Deployed this change on EWS server.
Aakash Jain
Comment 24
2020-04-11 05:07:00 PDT
Comment on
attachment 395984
[details]
Patch for landing View in context:
https://bugs.webkit.org/attachment.cgi?id=395984&action=review
> Tools/BuildSlaveSupport/wait-for-crash-collection:73 > + print('Failed to find {}'.format(SUBMIT_DIAG_INFO))
This results in a misleading error message: 'Failed to find /System/Library/CoreServices/SubmitDiagInfo'. e.g.:
https://ews-build.webkit.org/#/builders/31/builds/7348/steps/8/logs/stdio
Should be something like: 'Failed to find running process for {}'
Aakash Jain
Comment 25
2020-04-11 05:08:03 PDT
> This results in a misleading error message
Fixed in
https://trac.webkit.org/changeset/259926/webkit
Aakash Jain
Comment 26
2020-04-11 05:37:46 PDT
Follow-up fix to remove spaces from step name:
https://trac.webkit.org/changeset/259927/webkit
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug