94505 – Intermittenly, many WebKit2 tests have results from the wrong test compared to the test just run, giving false failures.

RESOLVED FIXED 94505

Intermittenly, many WebKit2 tests have results from the wrong test compared to the test just run, giving false failures.

https://bugs.webkit.org/show_bug.cgi?id=94505

Summary Intermittenly, many WebKit2 tests have results from the wrong test compared t...

Brady Eidson

Reported 2012-08-20 11:03:24 PDT

Intermittenly, many WebKit2 tests have results from the wrong test compared to the test just run, giving false failures. This can easily be seen on the WK2 Lion bots, for example. where we bail early with 500+ failures. If you dig in to the failures you see that in many cases the wrong results are being attributed to the test, therefore causing the failure. If you're running locally and don't have the "abort after 500+ failures" flag, these tests are re-run as flakes and pass the 2nd time. This is huge pain point in regression testing right now.

Attachments
Patch (2.12 KB, patch) 2012-08-20 13:18 PDT, Dirk Pranke	ojan: review+	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.

Dirk Pranke

Comment 1 2012-08-20 11:06:31 PDT

Is this a sub-bug of bug 94277, or do you see this as different somehow?

Dirk Pranke

Comment 2 2012-08-20 11:09:08 PDT

Examples for the lazy: http://build.webkit.org/results/Apple%20Lion%20Release%20WK2%20(Tests)/r126041%20(2401)/results.html see especially the the dom failures, e.g.: http://build.webkit.org/results/Apple%20Lion%20Release%20WK2%20(Tests)/r126041%20(2401)/dom/html/level2/core/createDocument08-pretty-diff.html http://build.webkit.org/results/Apple%20Lion%20Release%20WK2%20(Tests)/r126041%20(2401)/results.html where it clearly looks like we're getting off-by-one issues ...

Brady Eidson

Comment 3 2012-08-20 11:11:36 PDT

(In reply to comment #1) > Is this a sub-bug of bug 94277, or do you see this as different somehow? I think off-by-one is an identified symptom and therefore a distinct ticket. It might verywell be the *only* issue, at which point I'll clean up the bookkeeping.

Simon Fraser (smfr)

Comment 4 2012-08-20 11:23:17 PDT

Did this problem start on > 1 bot at the same time? If so, maybe we can track it down to a specific commit. Otherwise, the test before or at the first failing test for each shard may be interesting.

Brady Eidson

Comment 5 2012-08-20 11:24:19 PDT

(In reply to comment #4) > Did this problem start on > 1 bot at the same time? If so, maybe we can track it down to a specific commit. Otherwise, the test before or at the first failing test for each shard may be interesting. This started weeks ago (and went unnoticed) and the bots don't keep long enough history to know when it started.

Dirk Pranke

Comment 6 2012-08-20 12:26:28 PDT

so, I'm able to reproduce something that looks at least similar, locally. It looks like what is happening is that the first test that fails in a shard (i.e., same directory) is failing with the message "Timed out waiting for final message from web process". Subsequent tests in the same shard then are failing with the output from the previous test, as if the output from the test that did time out isn't getting thrown away properly, and we're not communicating back to run-webkit-tests that something weird happened, and so run-webkit-tests doesn't know to recover/restart WTR either. I don't yet know why this is happening, and I don't know the code in WTR at all so I don't know if this sounds plausible. However, it should be easy to test/debug/work around, so I'm trying that now.

Dirk Pranke

Comment 7 2012-08-20 13:18:57 PDT

Created attachment 159504 [details] Patch

Ojan Vafai

Comment 8 2012-08-20 13:28:23 PDT

Comment on attachment 159504 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=159504&action=review > Tools/Scripts/webkitpy/layout_tests/port/driver.py:163 > + if 'Timed out waiting for final message from web process' in text: This should probably have a FIXME. In theory at least, we should fix the WTR bug eventually and remove this special case, right?

Balazs Kelemen

Comment 9 2012-08-20 13:44:16 PDT

(In reply to comment #8) > (From update of attachment 159504 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=159504&action=review > > > Tools/Scripts/webkitpy/layout_tests/port/driver.py:163 > > + if 'Timed out waiting for final message from web process' in text: > > This should probably have a FIXME. In theory at least, we should fix the WTR bug eventually and remove this special case, right? Before r124596, a change of mine, WTR was exiting in this case, so I think we should keep this behavior, although I'm not sure whether special case is needed in nrwt with that.

Dirk Pranke

Comment 10 2012-08-20 13:48:41 PDT

It should have a fixme to either fix the underlying bug or at least make WTR communicate the need to restart in a more consistent and/or orderly manner.

Dirk Pranke

Comment 11 2012-08-20 13:50:00 PDT

(In reply to comment #9) > (In reply to comment #8) > > (From update of attachment 159504 [details] [details]) > > View in context: https://bugs.webkit.org/attachment.cgi?id=159504&action=review > > > > > Tools/Scripts/webkitpy/layout_tests/port/driver.py:163 > > > + if 'Timed out waiting for final message from web process' in text: > > > > This should probably have a FIXME. In theory at least, we should fix the WTR bug eventually and remove this special case, right? > > Before r124596, a change of mine, WTR was exiting in this case, so I think we should keep this behavior, although I'm not sure whether special case is needed in nrwt with that. Is it possible your change is causing this bug?

Brady Eidson

Comment 12 2012-08-20 13:56:38 PDT

For the lazy - http://trac.webkit.org/changeset/124596 is when the WKTR stopped exiting when the WebProcess timed out. I think we need to land this.

Dirk Pranke

Comment 13 2012-08-20 13:59:57 PDT

Committed r126062: <http://trac.webkit.org/changeset/126062>

Ojan Vafai

Comment 14 2012-08-20 16:50:54 PDT

(In reply to comment #12) > For the lazy - http://trac.webkit.org/changeset/124596 is when the WKTR stopped exiting when the WebProcess timed out. > > I think we need to land this. I was just suggesting a FIXME. There's already a way that NRWT detects timing out tests. We should make WTR do that, then we don't need this special case. IIRC it something like spitting out a "#TIMEOUT" line.

Brady Eidson

Comment 15 2012-08-20 17:01:35 PDT

(In reply to comment #14) > (In reply to comment #12) > > For the lazy - http://trac.webkit.org/changeset/124596 is when the WKTR stopped exiting when the WebProcess timed out. > > > > I think we need to land this. > > I was just suggesting a FIXME. There's already a way that NRWT detects timing out tests. We should make WTR do that, then we don't need this special case. IIRC it something like spitting out a "#TIMEOUT" line. I think Dirk and Balazas had much of that conversation on IRC today. Unsure what the outcome was (if any)

Dirk Pranke

Comment 16 2012-08-20 17:16:38 PDT

(In reply to comment #15) > (In reply to comment #14) > > (In reply to comment #12) > > > For the lazy - http://trac.webkit.org/changeset/124596 is when the WKTR stopped exiting when the WebProcess timed out. > > > > > > I think we need to land this. > > > > I was just suggesting a FIXME. There's already a way that NRWT detects timing out tests. We should make WTR do that, then we don't need this special case. IIRC it something like spitting out a "#TIMEOUT" line. > > I think Dirk and Balazas had much of that conversation on IRC today. Unsure what the outcome was (if any) I did add the FIXME ... Balazs' patch was doing an end-run around the existing timeout detection logic, and he suggested another patch that would've exited instead of letting NRWT abort things, but I decided not to land that one. The problem seems to be that in http://trac.webkit.org/browser/trunk/Tools/WebKitTestRunner/TestInvocation.cpp#L201 I think false is being returned from resetStateToConsistentValues() (which seems like it would only happen if we timed out while trying to load about:blank).

Balazs Kelemen

Comment 17 2012-08-21 03:50:10 PDT

> I did add the FIXME ... Balazs' patch was doing an end-run around the existing timeout detection logic, and he suggested another patch that would've exited instead of letting NRWT abort things, but I decided not to land that one. > > The problem seems to be that in > > http://trac.webkit.org/browser/trunk/Tools/WebKitTestRunner/TestInvocation.cpp#L201 > > I think false is being returned from resetStateToConsistentValues() (which seems like it would only happen if we timed out while trying to load about:blank). In my change I assumed that resetStateToConsistentValues() will never fail if there was no timeout before. In debug I added an assertion but in release this just mean we don't output anything on stderr so nrwt could not catch the error. I am going to take a try on whether removing this assumption fix this (without the patch we landed).

Balazs Kelemen

Comment 18 2012-08-21 08:32:14 PDT

(In reply to comment #17) > > I did add the FIXME ... Balazs' patch was doing an end-run around the existing timeout detection logic, and he suggested another patch that would've exited instead of letting NRWT abort things, but I decided not to land that one. > > > > The problem seems to be that in > > > > http://trac.webkit.org/browser/trunk/Tools/WebKitTestRunner/TestInvocation.cpp#L201 > > > > I think false is being returned from resetStateToConsistentValues() (which seems like it would only happen if we timed out while trying to load about:blank). > > In my change I assumed that resetStateToConsistentValues() will never fail if there was no timeout before. In debug I added an assertion but in release this just mean we don't output anything on stderr so nrwt could not catch the error. I am going to take a try on whether removing this assumption fix this (without the patch we landed). My theory was wrong, this is not the reason. I think Dirk's patch fix the case when the test times out but resetStateToConsistentValues() does not fail (because if it fails we was restarting wtr anyway). I think the bug is that resetStateToConsistentValues() can return true even if the test is stil running. If I dump the #PROCESS UNRESPONSIVENESS message before calling resetStateToConsistentValues(), the off-by-one failures disappear.

Balazs Kelemen

Comment 19 2012-08-21 10:02:33 PDT

Filed bug 94609 to remove the workaround.

Simon Fraser (smfr)

Comment 20 2012-08-21 10:41:06 PDT

(In reply to comment #18) > (In reply to comment #17) > > > I did add the FIXME ... Balazs' patch was doing an end-run around the existing timeout detection logic, and he suggested another patch that would've exited instead of letting NRWT abort things, but I decided not to land that one. > > > > > > The problem seems to be that in > > > > > > http://trac.webkit.org/browser/trunk/Tools/WebKitTestRunner/TestInvocation.cpp#L201 > > > > > > I think false is being returned from resetStateToConsistentValues() (which seems like it would only happen if we timed out while trying to load about:blank). > > > > In my change I assumed that resetStateToConsistentValues() will never fail if there was no timeout before. In debug I added an assertion but in release this just mean we don't output anything on stderr so nrwt could not catch the error. I am going to take a try on whether removing this assumption fix this (without the patch we landed). > > My theory was wrong, this is not the reason. I think Dirk's patch fix the case when the test times out but resetStateToConsistentValues() does not fail (because if it fails we was restarting wtr anyway). I think the bug is that resetStateToConsistentValues() can return true even if the test is stil running. If I dump the #PROCESS UNRESPONSIVENESS message before calling resetStateToConsistentValues(), the off-by-one failures disappear. Does that mean that we still need a fix, and should reopen this bug?

Simon Fraser (smfr)

Comment 21 2012-08-21 10:52:42 PDT

I guess the patch in bug 94609 implements the real fix.

Balazs Kelemen

Comment 22 2012-08-22 04:30:13 PDT

(In reply to comment #21) > I guess the patch in bug 94609 implements the real fix. Yep.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware All

OS All

Product WebKit

Component Tools / Tests

Assignee

Dirk Pranke

Reported

2012-08-20 11:03 PDT

Modified

2012-08-22 04:30 PDT History

CC List

9 users Show

URL

Keywords

Depends on

Blocks

94277

Dependencies

tree graph