Intermittenly, many WebKit2 tests have results from the wrong test compared to the test just run, giving false failures. This can easily be seen on the WK2 Lion bots, for example. where we bail early with 500+ failures. If you dig in to the failures you see that in many cases the wrong results are being attributed to the test, therefore causing the failure. If you're running locally and don't have the "abort after 500+ failures" flag, these tests are re-run as flakes and pass the 2nd time. This is huge pain point in regression testing right now.
Is this a sub-bug of bug 94277, or do you see this as different somehow?
Examples for the lazy: http://build.webkit.org/results/Apple%20Lion%20Release%20WK2%20(Tests)/r126041%20(2401)/results.html see especially the the dom failures, e.g.: http://build.webkit.org/results/Apple%20Lion%20Release%20WK2%20(Tests)/r126041%20(2401)/dom/html/level2/core/createDocument08-pretty-diff.html http://build.webkit.org/results/Apple%20Lion%20Release%20WK2%20(Tests)/r126041%20(2401)/results.html where it clearly looks like we're getting off-by-one issues ...
(In reply to comment #1) > Is this a sub-bug of bug 94277, or do you see this as different somehow? I think off-by-one is an identified symptom and therefore a distinct ticket. It might verywell be the *only* issue, at which point I'll clean up the bookkeeping.
Did this problem start on > 1 bot at the same time? If so, maybe we can track it down to a specific commit. Otherwise, the test before or at the first failing test for each shard may be interesting.
(In reply to comment #4) > Did this problem start on > 1 bot at the same time? If so, maybe we can track it down to a specific commit. Otherwise, the test before or at the first failing test for each shard may be interesting. This started weeks ago (and went unnoticed) and the bots don't keep long enough history to know when it started.
so, I'm able to reproduce something that looks at least similar, locally. It looks like what is happening is that the first test that fails in a shard (i.e., same directory) is failing with the message "Timed out waiting for final message from web process". Subsequent tests in the same shard then are failing with the output from the previous test, as if the output from the test that did time out isn't getting thrown away properly, and we're not communicating back to run-webkit-tests that something weird happened, and so run-webkit-tests doesn't know to recover/restart WTR either. I don't yet know why this is happening, and I don't know the code in WTR at all so I don't know if this sounds plausible. However, it should be easy to test/debug/work around, so I'm trying that now.
Created attachment 159504 [details] Patch
Comment on attachment 159504 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=159504&action=review > Tools/Scripts/webkitpy/layout_tests/port/driver.py:163 > + if 'Timed out waiting for final message from web process' in text: This should probably have a FIXME. In theory at least, we should fix the WTR bug eventually and remove this special case, right?
(In reply to comment #8) > (From update of attachment 159504 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=159504&action=review > > > Tools/Scripts/webkitpy/layout_tests/port/driver.py:163 > > + if 'Timed out waiting for final message from web process' in text: > > This should probably have a FIXME. In theory at least, we should fix the WTR bug eventually and remove this special case, right? Before r124596, a change of mine, WTR was exiting in this case, so I think we should keep this behavior, although I'm not sure whether special case is needed in nrwt with that.
It should have a fixme to either fix the underlying bug or at least make WTR communicate the need to restart in a more consistent and/or orderly manner.
(In reply to comment #9) > (In reply to comment #8) > > (From update of attachment 159504 [details] [details]) > > View in context: https://bugs.webkit.org/attachment.cgi?id=159504&action=review > > > > > Tools/Scripts/webkitpy/layout_tests/port/driver.py:163 > > > + if 'Timed out waiting for final message from web process' in text: > > > > This should probably have a FIXME. In theory at least, we should fix the WTR bug eventually and remove this special case, right? > > Before r124596, a change of mine, WTR was exiting in this case, so I think we should keep this behavior, although I'm not sure whether special case is needed in nrwt with that. Is it possible your change is causing this bug?
For the lazy - http://trac.webkit.org/changeset/124596 is when the WKTR stopped exiting when the WebProcess timed out. I think we need to land this.
Committed r126062: <http://trac.webkit.org/changeset/126062>
(In reply to comment #12) > For the lazy - http://trac.webkit.org/changeset/124596 is when the WKTR stopped exiting when the WebProcess timed out. > > I think we need to land this. I was just suggesting a FIXME. There's already a way that NRWT detects timing out tests. We should make WTR do that, then we don't need this special case. IIRC it something like spitting out a "#TIMEOUT" line.
(In reply to comment #14) > (In reply to comment #12) > > For the lazy - http://trac.webkit.org/changeset/124596 is when the WKTR stopped exiting when the WebProcess timed out. > > > > I think we need to land this. > > I was just suggesting a FIXME. There's already a way that NRWT detects timing out tests. We should make WTR do that, then we don't need this special case. IIRC it something like spitting out a "#TIMEOUT" line. I think Dirk and Balazas had much of that conversation on IRC today. Unsure what the outcome was (if any)
(In reply to comment #15) > (In reply to comment #14) > > (In reply to comment #12) > > > For the lazy - http://trac.webkit.org/changeset/124596 is when the WKTR stopped exiting when the WebProcess timed out. > > > > > > I think we need to land this. > > > > I was just suggesting a FIXME. There's already a way that NRWT detects timing out tests. We should make WTR do that, then we don't need this special case. IIRC it something like spitting out a "#TIMEOUT" line. > > I think Dirk and Balazas had much of that conversation on IRC today. Unsure what the outcome was (if any) I did add the FIXME ... Balazs' patch was doing an end-run around the existing timeout detection logic, and he suggested another patch that would've exited instead of letting NRWT abort things, but I decided not to land that one. The problem seems to be that in http://trac.webkit.org/browser/trunk/Tools/WebKitTestRunner/TestInvocation.cpp#L201 I think false is being returned from resetStateToConsistentValues() (which seems like it would only happen if we timed out while trying to load about:blank).
> I did add the FIXME ... Balazs' patch was doing an end-run around the existing timeout detection logic, and he suggested another patch that would've exited instead of letting NRWT abort things, but I decided not to land that one. > > The problem seems to be that in > > http://trac.webkit.org/browser/trunk/Tools/WebKitTestRunner/TestInvocation.cpp#L201 > > I think false is being returned from resetStateToConsistentValues() (which seems like it would only happen if we timed out while trying to load about:blank). In my change I assumed that resetStateToConsistentValues() will never fail if there was no timeout before. In debug I added an assertion but in release this just mean we don't output anything on stderr so nrwt could not catch the error. I am going to take a try on whether removing this assumption fix this (without the patch we landed).
(In reply to comment #17) > > I did add the FIXME ... Balazs' patch was doing an end-run around the existing timeout detection logic, and he suggested another patch that would've exited instead of letting NRWT abort things, but I decided not to land that one. > > > > The problem seems to be that in > > > > http://trac.webkit.org/browser/trunk/Tools/WebKitTestRunner/TestInvocation.cpp#L201 > > > > I think false is being returned from resetStateToConsistentValues() (which seems like it would only happen if we timed out while trying to load about:blank). > > In my change I assumed that resetStateToConsistentValues() will never fail if there was no timeout before. In debug I added an assertion but in release this just mean we don't output anything on stderr so nrwt could not catch the error. I am going to take a try on whether removing this assumption fix this (without the patch we landed). My theory was wrong, this is not the reason. I think Dirk's patch fix the case when the test times out but resetStateToConsistentValues() does not fail (because if it fails we was restarting wtr anyway). I think the bug is that resetStateToConsistentValues() can return true even if the test is stil running. If I dump the #PROCESS UNRESPONSIVENESS message before calling resetStateToConsistentValues(), the off-by-one failures disappear.
Filed bug 94609 to remove the workaround.
(In reply to comment #18) > (In reply to comment #17) > > > I did add the FIXME ... Balazs' patch was doing an end-run around the existing timeout detection logic, and he suggested another patch that would've exited instead of letting NRWT abort things, but I decided not to land that one. > > > > > > The problem seems to be that in > > > > > > http://trac.webkit.org/browser/trunk/Tools/WebKitTestRunner/TestInvocation.cpp#L201 > > > > > > I think false is being returned from resetStateToConsistentValues() (which seems like it would only happen if we timed out while trying to load about:blank). > > > > In my change I assumed that resetStateToConsistentValues() will never fail if there was no timeout before. In debug I added an assertion but in release this just mean we don't output anything on stderr so nrwt could not catch the error. I am going to take a try on whether removing this assumption fix this (without the patch we landed). > > My theory was wrong, this is not the reason. I think Dirk's patch fix the case when the test times out but resetStateToConsistentValues() does not fail (because if it fails we was restarting wtr anyway). I think the bug is that resetStateToConsistentValues() can return true even if the test is stil running. If I dump the #PROCESS UNRESPONSIVENESS message before calling resetStateToConsistentValues(), the off-by-one failures disappear. Does that mean that we still need a fix, and should reopen this bug?
I guess the patch in bug 94609 implements the real fix.
(In reply to comment #21) > I guess the patch in bug 94609 implements the real fix. Yep.