Bug 30098

Summary: commit-queue can wrongly reject patches if the buildbots are behind
Product: WebKit Reporter: Eric Seidel (no email) <eric>
Component: Tools / TestsAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, levin, ojan, webkit.review.bot
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: PC   
OS: OS X 10.5   
Attachments:
Description Flags
My proposed fix, currently testing on live commit-bot
none
Patch
none
Patch eric: review+

Description Eric Seidel (no email) 2009-10-05 15:01:07 PDT
commit-queue can wrongly reject patches if the buildbots are behind

The commit-queue uses the buildbot status to determine if it should be landing patches or not.  Right now it checks before trying to build and if the bots are green goes ahead and tries to land.  If the build/tests fail then it assumes they were caused by the patch.

This can lead to false rejections if the bots are simply behind but they are about to turn red. :(

There are many solutions we could use to fix this.  Two possibilities:
- roll out the patch locally and try to build/test again, if that passes actually reject the patch.
- update to exactly the revision that the bots last tested.
Comment 1 Eric Seidel (no email) 2009-10-26 17:28:04 PDT
OK.  I've modified the commit-queue to pause itself when the buildbots get behind.

Example:
Builders ["Leopard Intel Release (Tests)" : 4 pending, "Leopard Intel Debug (Tests)" : 7 pending] are behind.  Waiting. See http://build.webkit.org/waterfall. Sleeping until 2009-10-26 17:31:39 (5 mins).

I'm going to test the change locally for a bit.  I'll post a patch for review tomorrow assuming all goes well.
Comment 2 Eric Seidel (no email) 2009-10-26 17:29:35 PDT
The sad part of this change is that it will make the commit-bot much more sensitive to our slow layout tests.  It also will make the commit-bot tend to commit patches late at night (when the tree is otherwise quiet) and will slow down response time to commit-queue+.

That said, I still would rather have a commit-queue which never falsely rejects patches than one which is fast.
Comment 3 Eric Seidel (no email) 2009-10-26 17:34:41 PDT
Created attachment 41919 [details]
My proposed fix, currently testing on live commit-bot
Comment 4 Eric Seidel (no email) 2009-10-26 17:35:36 PDT
http://webkit-commit-queue.appspot.com/ doesn't really like having such long status messages.  But you can see what the output of the bot looks like there.
Comment 5 Eric Seidel (no email) 2009-10-27 01:39:06 PDT
Actually checking to make sure that there are no pending builds isn't quite strong enough.  If I want to actually make sure that the very latest build was green then I need to make sure that every builder is "idle".

Adam Barth points out that I should have a fall-back mode, where if the builders are busy I do an extra build/test step w/o the patch applied first, if that passes then I go ahead and commit.  That might just be extra complexity for an incomplete solution, but then again it might be a big help.

As-is waiting for the builders to all be idle before letting the commit-queue run is going to greatly slow down the rate at which the commit-queue can land patches as it will be always tied to run as slowly as the slowest bot that it's watching. :(
Comment 6 Eric Seidel (no email) 2009-10-27 01:40:39 PDT
I think if we end up turning this on for the commit-queue, I may end up needing to spend a bunch of effort making the buildbots faster (by speeding up run-webkit-tests).  We'll see.
Comment 7 Eric Seidel (no email) 2009-12-22 18:27:05 PST
This patch is long out of date, but I need to update it, because it would be very useful. :)
Comment 8 Eric Seidel (no email) 2010-01-04 13:15:12 PST
Another victim:
https://bugs.webkit.org/show_bug.cgi?id=33090#c3
Comment 9 Adam Barth 2010-01-04 14:37:33 PST
This problem is super easy to solve just by running a complete cycle before landing the patch and then trying to apply the patch with --no-update.  That's how the EWS rolls.
Comment 10 Eric Seidel (no email) 2010-01-04 14:43:53 PST
The commit is more likely to fail at the end however, because the ChangeLogs are more likely to be out of date.

That approach trades one evil (longer waits before bots are actually green) for another (longer cycles, more likely to reject).  But since it's easy to implement and better than what we've got I say go for it.  We also need to eventually fix buildbot.py's pending count (right now it's one-off), but that's relatively low priority.
Comment 11 Adam Barth 2010-01-04 14:46:13 PST
Created attachment 45834 [details]
Patch
Comment 12 WebKit Review Bot 2010-01-04 14:49:28 PST
style-queue ran check-webkit-style on attachment 45834 [details] without any errors.
Comment 13 Eric Seidel (no email) 2010-01-04 14:50:34 PST
Comment on attachment 45834 [details]
Patch

Seems:
 162             self._update_status("Unabled to successfully build and test", None)

should include the revision number that it tried, and should include the results spew, no?

Also, shouldn't it prefix with ERROR: so that it shows up purple?

Probably should have a comment to explain the --no-update here:
 170             self.run_bugzilla_tool(["land-attachment", "--force-clean", "--non-interactive", "--no-update", "--parent-command=commit-queue", "--build-style=both", "--quiet", patch["id"]])

I think this is definitely better than what we have.  Sad that we'll end up running the layout tests twice which will slow us down a lot.  Hopefully that will be moot once run_webkit_tests.py is with us.

I'll say r+, but I think this needs modifications to report the status of the failed test build better.  That will help us more easily debug flakey tests (and will make the dashboard correctly show purple).
Comment 14 Adam Barth 2010-01-04 15:02:33 PST
> should include the revision number that it tried, and should include the
> results spew, no?

It could, but no one is going to see the message expect maybe you.  The EWS also as a similar idle message and I've never been interested in why they didn't build.

> Also, shouldn't it prefix with ERROR: so that it shows up purple?

This doesn't show up purple anywhere because it's not associated with any patch.  It's just an idle message like "Builders [%s] are red."

I'm tempted to remove the message instead of making it more complicated.

> Probably should have a comment to explain the --no-update here:

Done.

> I'll say r+, but I think this needs modifications to report the status of the
> failed test build better.  That will help us more easily debug flakey tests
> (and will make the dashboard correctly show purple).

Flakey tests will just make the commit-queue wait five minutes before landing a patch occasionally.  If we get a flaky test 1% of the time, this will add 3 seconds of average latency, which is lost in the noise of the 5 minute polling frequency.
Comment 15 Adam Barth 2010-01-04 15:04:07 PST
Created attachment 45835 [details]
Patch
Comment 16 Eric Seidel (no email) 2010-01-04 15:06:29 PST
Comment on attachment 45835 [details]
Patch

LGTM.

Slightly confused by the second _builders_are_green() check?  (might need a comment?)  Won't land-attachment check for us?

I still need to fix buildbot to be able to tell you if things are pending.  I'll do that in a separate bug.
Comment 17 Adam Barth 2010-01-04 15:13:13 PST
> Slightly confused by the second _builders_are_green() check?  (might need a
> comment?)

I added that because you mentioned that build-and-test could take a while.  If the bots have gone red in the meantime (say on Windows), then we don't want to land in order to avoid making the tree worse.

> Won't land-attachment check for us?

If that's the case, then why was the code there originally to check the builders?  I'd be happy to remove it if you think the subprocess should handle checking the builders.  I'd be more inclined to have to pass --ignore-builders and have the master process deal with deciding whether its time to run land-attachment.
Comment 18 Adam Barth 2010-01-04 15:14:02 PST
Committed r52764: <http://trac.webkit.org/changeset/52764>
Comment 19 Eric Seidel (no email) 2010-01-04 15:15:52 PST
Well, back in the day we didn't have any way to distingish between a recoverable/temporary error and a permentant error.  I'm not sure that builders being red returns the right exit code.  But you're right, we could remove the second check and depend on land-attachment assuming that EnsureBuildersAreGreen returns the correct exit code.
Comment 20 Eric Seidel (no email) 2010-01-20 03:42:06 PST
This change has made the commit-queue nearly unbearably slow.  We may need to revert this, or try a different approach.