commit-queue can wrongly reject patches if the buildbots are behind
The commit-queue uses the buildbot status to determine if it should be landing patches or not. Right now it checks before trying to build and if the bots are green goes ahead and tries to land. If the build/tests fail then it assumes they were caused by the patch.
This can lead to false rejections if the bots are simply behind but they are about to turn red. :(
There are many solutions we could use to fix this. Two possibilities:
- roll out the patch locally and try to build/test again, if that passes actually reject the patch.
- update to exactly the revision that the bots last tested.
OK. I've modified the commit-queue to pause itself when the buildbots get behind.
Builders ["Leopard Intel Release (Tests)" : 4 pending, "Leopard Intel Debug (Tests)" : 7 pending] are behind. Waiting. See http://build.webkit.org/waterfall. Sleeping until 2009-10-26 17:31:39 (5 mins).
I'm going to test the change locally for a bit. I'll post a patch for review tomorrow assuming all goes well.
The sad part of this change is that it will make the commit-bot much more sensitive to our slow layout tests. It also will make the commit-bot tend to commit patches late at night (when the tree is otherwise quiet) and will slow down response time to commit-queue+.
That said, I still would rather have a commit-queue which never falsely rejects patches than one which is fast.
Created attachment 41919 [details]
My proposed fix, currently testing on live commit-bot
http://webkit-commit-queue.appspot.com/ doesn't really like having such long status messages. But you can see what the output of the bot looks like there.
Actually checking to make sure that there are no pending builds isn't quite strong enough. If I want to actually make sure that the very latest build was green then I need to make sure that every builder is "idle".
Adam Barth points out that I should have a fall-back mode, where if the builders are busy I do an extra build/test step w/o the patch applied first, if that passes then I go ahead and commit. That might just be extra complexity for an incomplete solution, but then again it might be a big help.
As-is waiting for the builders to all be idle before letting the commit-queue run is going to greatly slow down the rate at which the commit-queue can land patches as it will be always tied to run as slowly as the slowest bot that it's watching. :(
I think if we end up turning this on for the commit-queue, I may end up needing to spend a bunch of effort making the buildbots faster (by speeding up run-webkit-tests). We'll see.
This patch is long out of date, but I need to update it, because it would be very useful. :)
This problem is super easy to solve just by running a complete cycle before landing the patch and then trying to apply the patch with --no-update. That's how the EWS rolls.
The commit is more likely to fail at the end however, because the ChangeLogs are more likely to be out of date.
That approach trades one evil (longer waits before bots are actually green) for another (longer cycles, more likely to reject). But since it's easy to implement and better than what we've got I say go for it. We also need to eventually fix buildbot.py's pending count (right now it's one-off), but that's relatively low priority.
Created attachment 45834 [details]
style-queue ran check-webkit-style on attachment 45834 [details] without any errors.
Comment on attachment 45834 [details]
162 self._update_status("Unabled to successfully build and test", None)
should include the revision number that it tried, and should include the results spew, no?
Also, shouldn't it prefix with ERROR: so that it shows up purple?
Probably should have a comment to explain the --no-update here:
170 self.run_bugzilla_tool(["land-attachment", "--force-clean", "--non-interactive", "--no-update", "--parent-command=commit-queue", "--build-style=both", "--quiet", patch["id"]])
I think this is definitely better than what we have. Sad that we'll end up running the layout tests twice which will slow us down a lot. Hopefully that will be moot once run_webkit_tests.py is with us.
I'll say r+, but I think this needs modifications to report the status of the failed test build better. That will help us more easily debug flakey tests (and will make the dashboard correctly show purple).
> should include the revision number that it tried, and should include the
> results spew, no?
It could, but no one is going to see the message expect maybe you. The EWS also as a similar idle message and I've never been interested in why they didn't build.
> Also, shouldn't it prefix with ERROR: so that it shows up purple?
This doesn't show up purple anywhere because it's not associated with any patch. It's just an idle message like "Builders [%s] are red."
I'm tempted to remove the message instead of making it more complicated.
> Probably should have a comment to explain the --no-update here:
> I'll say r+, but I think this needs modifications to report the status of the
> failed test build better. That will help us more easily debug flakey tests
> (and will make the dashboard correctly show purple).
Flakey tests will just make the commit-queue wait five minutes before landing a patch occasionally. If we get a flaky test 1% of the time, this will add 3 seconds of average latency, which is lost in the noise of the 5 minute polling frequency.
Created attachment 45835 [details]
Comment on attachment 45835 [details]
Slightly confused by the second _builders_are_green() check? (might need a comment?) Won't land-attachment check for us?
I still need to fix buildbot to be able to tell you if things are pending. I'll do that in a separate bug.
> Slightly confused by the second _builders_are_green() check? (might need a
I added that because you mentioned that build-and-test could take a while. If the bots have gone red in the meantime (say on Windows), then we don't want to land in order to avoid making the tree worse.
> Won't land-attachment check for us?
If that's the case, then why was the code there originally to check the builders? I'd be happy to remove it if you think the subprocess should handle checking the builders. I'd be more inclined to have to pass --ignore-builders and have the master process deal with deciding whether its time to run land-attachment.
Committed r52764: <http://trac.webkit.org/changeset/52764>
Well, back in the day we didn't have any way to distingish between a recoverable/temporary error and a permentant error. I'm not sure that builders being red returns the right exit code. But you're right, we could remove the second check and depend on land-attachment assuming that EnsureBuildersAreGreen returns the correct exit code.
This change has made the commit-queue nearly unbearably slow. We may need to revert this, or try a different approach.