Bug 94505 - Intermittenly, many WebKit2 tests have results from the wrong test compared to the test just run, giving false failures.
Summary: Intermittenly, many WebKit2 tests have results from the wrong test compared t...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Dirk Pranke
URL:
Keywords:
Depends on:
Blocks: 94277
  Show dependency treegraph
 
Reported: 2012-08-20 11:03 PDT by Brady Eidson
Modified: 2012-08-22 04:30 PDT (History)
9 users (show)

See Also:


Attachments
Patch (2.12 KB, patch)
2012-08-20 13:18 PDT, Dirk Pranke
ojan: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Brady Eidson 2012-08-20 11:03:24 PDT
Intermittenly, many WebKit2 tests have results from the wrong test compared to the test just run, giving false failures.

This can easily be seen on the WK2 Lion bots, for example. where we bail early with 500+ failures.  If you dig in to the failures you see that in many cases the wrong results are being attributed to the test, therefore causing the failure.

If you're running locally and don't have the "abort after 500+ failures" flag, these tests are re-run as flakes and pass the 2nd time.

This is huge pain point in regression testing right now.
Comment 1 Dirk Pranke 2012-08-20 11:06:31 PDT
Is this a sub-bug of bug 94277, or do you see this as different somehow?
Comment 3 Brady Eidson 2012-08-20 11:11:36 PDT
(In reply to comment #1)
> Is this a sub-bug of bug 94277, or do you see this as different somehow?

I think off-by-one is an identified symptom and therefore a distinct ticket.

It might verywell be the *only* issue, at which point I'll clean up the bookkeeping.
Comment 4 Simon Fraser (smfr) 2012-08-20 11:23:17 PDT
Did this problem start on > 1 bot at the same time? If so, maybe we can track it down to a specific commit. Otherwise, the test before or at the first failing test for each shard may be interesting.
Comment 5 Brady Eidson 2012-08-20 11:24:19 PDT
(In reply to comment #4)
> Did this problem start on > 1 bot at the same time? If so, maybe we can track it down to a specific commit. Otherwise, the test before or at the first failing test for each shard may be interesting.

This started weeks ago (and went unnoticed) and the bots don't keep long enough history to know when it started.
Comment 6 Dirk Pranke 2012-08-20 12:26:28 PDT
so, I'm able to reproduce something that looks at least similar, locally. 

It looks like what is happening is that the first test that fails in a shard (i.e., same directory) is failing with the message "Timed out waiting for final message from web process". Subsequent tests in the same shard then are failing with the output from the previous test, as if the output from the test that did time out isn't getting thrown away properly, and we're not communicating back to run-webkit-tests that something weird happened, and so run-webkit-tests doesn't know to recover/restart WTR either.

I don't yet know why this is happening, and I don't know the code in WTR at all so I don't know if this sounds plausible. However, it should be easy to test/debug/work around, so I'm trying that now.
Comment 7 Dirk Pranke 2012-08-20 13:18:57 PDT
Created attachment 159504 [details]
Patch
Comment 8 Ojan Vafai 2012-08-20 13:28:23 PDT
Comment on attachment 159504 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=159504&action=review

> Tools/Scripts/webkitpy/layout_tests/port/driver.py:163
> +        if 'Timed out waiting for final message from web process' in text:

This should probably have a FIXME. In theory at least, we should fix the WTR bug eventually and remove this special case, right?
Comment 9 Balazs Kelemen 2012-08-20 13:44:16 PDT
(In reply to comment #8)
> (From update of attachment 159504 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=159504&action=review
> 
> > Tools/Scripts/webkitpy/layout_tests/port/driver.py:163
> > +        if 'Timed out waiting for final message from web process' in text:
> 
> This should probably have a FIXME. In theory at least, we should fix the WTR bug eventually and remove this special case, right?

Before r124596, a change of mine, WTR was exiting in this case, so I think we should keep this behavior, although I'm not sure whether special case is needed in nrwt with that.
Comment 10 Dirk Pranke 2012-08-20 13:48:41 PDT
It should have a fixme to either fix the underlying bug or at least make WTR communicate the need to restart in a more consistent and/or orderly manner.
Comment 11 Dirk Pranke 2012-08-20 13:50:00 PDT
(In reply to comment #9)
> (In reply to comment #8)
> > (From update of attachment 159504 [details] [details])
> > View in context: https://bugs.webkit.org/attachment.cgi?id=159504&action=review
> > 
> > > Tools/Scripts/webkitpy/layout_tests/port/driver.py:163
> > > +        if 'Timed out waiting for final message from web process' in text:
> > 
> > This should probably have a FIXME. In theory at least, we should fix the WTR bug eventually and remove this special case, right?
> 
> Before r124596, a change of mine, WTR was exiting in this case, so I think we should keep this behavior, although I'm not sure whether special case is needed in nrwt with that.

Is it possible your change is causing this bug?
Comment 12 Brady Eidson 2012-08-20 13:56:38 PDT
For the lazy - http://trac.webkit.org/changeset/124596 is when the WKTR stopped exiting when the WebProcess timed out.

I think we need to land this.
Comment 13 Dirk Pranke 2012-08-20 13:59:57 PDT
Committed r126062: <http://trac.webkit.org/changeset/126062>
Comment 14 Ojan Vafai 2012-08-20 16:50:54 PDT
(In reply to comment #12)
> For the lazy - http://trac.webkit.org/changeset/124596 is when the WKTR stopped exiting when the WebProcess timed out.
> 
> I think we need to land this.

I was just suggesting a FIXME. There's already a way that NRWT detects timing out tests. We should make WTR do that, then we don't need this special case. IIRC it something like spitting out a "#TIMEOUT" line.
Comment 15 Brady Eidson 2012-08-20 17:01:35 PDT
(In reply to comment #14)
> (In reply to comment #12)
> > For the lazy - http://trac.webkit.org/changeset/124596 is when the WKTR stopped exiting when the WebProcess timed out.
> > 
> > I think we need to land this.
> 
> I was just suggesting a FIXME. There's already a way that NRWT detects timing out tests. We should make WTR do that, then we don't need this special case. IIRC it something like spitting out a "#TIMEOUT" line.

I think Dirk and Balazas had much of that conversation on IRC today.  Unsure what the outcome was (if any)
Comment 16 Dirk Pranke 2012-08-20 17:16:38 PDT
(In reply to comment #15)
> (In reply to comment #14)
> > (In reply to comment #12)
> > > For the lazy - http://trac.webkit.org/changeset/124596 is when the WKTR stopped exiting when the WebProcess timed out.
> > > 
> > > I think we need to land this.
> > 
> > I was just suggesting a FIXME. There's already a way that NRWT detects timing out tests. We should make WTR do that, then we don't need this special case. IIRC it something like spitting out a "#TIMEOUT" line.
> 
> I think Dirk and Balazas had much of that conversation on IRC today.  Unsure what the outcome was (if any)

I did add the FIXME ... Balazs' patch was doing an end-run around the existing timeout detection logic, and he suggested another patch that would've exited instead of letting NRWT abort things, but I decided not to land that one.

The problem seems to be that in 

http://trac.webkit.org/browser/trunk/Tools/WebKitTestRunner/TestInvocation.cpp#L201

I think false is being returned from resetStateToConsistentValues() (which seems like it would only happen if we timed out while trying to load about:blank).
Comment 17 Balazs Kelemen 2012-08-21 03:50:10 PDT
> I did add the FIXME ... Balazs' patch was doing an end-run around the existing timeout detection logic, and he suggested another patch that would've exited instead of letting NRWT abort things, but I decided not to land that one.
> 
> The problem seems to be that in 
> 
> http://trac.webkit.org/browser/trunk/Tools/WebKitTestRunner/TestInvocation.cpp#L201
> 
> I think false is being returned from resetStateToConsistentValues() (which seems like it would only happen if we timed out while trying to load about:blank).

In my change I assumed that resetStateToConsistentValues() will never fail if there was no timeout before. In debug I added an assertion but in release this just mean we don't output anything on stderr so nrwt could not catch the error. I am going to take a try on whether removing this assumption fix this (without the patch we landed).
Comment 18 Balazs Kelemen 2012-08-21 08:32:14 PDT
(In reply to comment #17)
> > I did add the FIXME ... Balazs' patch was doing an end-run around the existing timeout detection logic, and he suggested another patch that would've exited instead of letting NRWT abort things, but I decided not to land that one.
> > 
> > The problem seems to be that in 
> > 
> > http://trac.webkit.org/browser/trunk/Tools/WebKitTestRunner/TestInvocation.cpp#L201
> > 
> > I think false is being returned from resetStateToConsistentValues() (which seems like it would only happen if we timed out while trying to load about:blank).
> 
> In my change I assumed that resetStateToConsistentValues() will never fail if there was no timeout before. In debug I added an assertion but in release this just mean we don't output anything on stderr so nrwt could not catch the error. I am going to take a try on whether removing this assumption fix this (without the patch we landed).

My theory was wrong, this is not the reason. I think Dirk's patch fix the case when the test times out but resetStateToConsistentValues() does not fail (because if it fails we was restarting wtr anyway). I think the bug is that resetStateToConsistentValues() can return true even if the test is stil running. If I dump the #PROCESS UNRESPONSIVENESS message before calling resetStateToConsistentValues(), the off-by-one failures disappear.
Comment 19 Balazs Kelemen 2012-08-21 10:02:33 PDT
Filed bug 94609 to remove the workaround.
Comment 20 Simon Fraser (smfr) 2012-08-21 10:41:06 PDT
(In reply to comment #18)
> (In reply to comment #17)
> > > I did add the FIXME ... Balazs' patch was doing an end-run around the existing timeout detection logic, and he suggested another patch that would've exited instead of letting NRWT abort things, but I decided not to land that one.
> > > 
> > > The problem seems to be that in 
> > > 
> > > http://trac.webkit.org/browser/trunk/Tools/WebKitTestRunner/TestInvocation.cpp#L201
> > > 
> > > I think false is being returned from resetStateToConsistentValues() (which seems like it would only happen if we timed out while trying to load about:blank).
> > 
> > In my change I assumed that resetStateToConsistentValues() will never fail if there was no timeout before. In debug I added an assertion but in release this just mean we don't output anything on stderr so nrwt could not catch the error. I am going to take a try on whether removing this assumption fix this (without the patch we landed).
> 
> My theory was wrong, this is not the reason. I think Dirk's patch fix the case when the test times out but resetStateToConsistentValues() does not fail (because if it fails we was restarting wtr anyway). I think the bug is that resetStateToConsistentValues() can return true even if the test is stil running. If I dump the #PROCESS UNRESPONSIVENESS message before calling resetStateToConsistentValues(), the off-by-one failures disappear.

Does that mean that we still need a fix, and should reopen this bug?
Comment 21 Simon Fraser (smfr) 2012-08-21 10:52:42 PDT
I guess the patch in bug 94609 implements the real fix.
Comment 22 Balazs Kelemen 2012-08-22 04:30:13 PDT
(In reply to comment #21)
> I guess the patch in bug 94609 implements the real fix.

Yep.