Bug 90892

Summary: results.html should handle flaky tests differently
Product: WebKit Reporter: Ojan Vafai <ojan>
Component: Tools / TestsAssignee: Ojan Vafai <ojan>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, dpranke, kkristof, rniwa, simon.fraser
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch dpranke: review+

Description Ojan Vafai 2012-07-10 10:00:46 PDT
We should have two flaky lists.
1. Tests that fail the first run and pass the second.
2. Tests that fail both runs but in different ways.

List 1 should come after tests that timed out and tests with stderr output (before "expected to fail but passed"). List 2 should be where the flaky tests currently are. List 1 is consistently just noise that makes the page harder to make sense of. Also, I frequently want to flag the list of the reliable failures to rerun. It's annoying to have to tab through all the flaky passes to get the the timeouts.
Comment 1 Dirk Pranke 2012-07-11 12:23:01 PDT
I'm sure you realize this already but we don't currently have a way to compare the output from the two runs to see if they are different. I don't know that it would be particularly hard to add that.

Also, I'm a bit concerned that implementing this just makes it even easier to ignore the tests in list 1, that seem to be should either be marked as expected flaky or actually be fixed.
Comment 2 Ojan Vafai 2012-07-11 15:05:24 PDT
(In reply to comment #1)
> I'm sure you realize this already but we don't currently have a way to compare the output from the two runs to see if they are different. I don't know that it would be particularly hard to add that.

full_results.json, which is what results.html uses, has this information and shows it in the UI already. We don't technically know which run was first and which was second and we don't have the -actual.* files for the first run, but we have the type of failure for each run.

> Also, I'm a bit concerned that implementing this just makes it even easier to ignore the tests in list 1, that seem to be should either be marked as expected flaky or actually be fixed.

That's true. In practice, I think that this is already ignored. So the cost is that other non-flaky failures get missed.

In fact, upon further thought, I think we should hide list 1 by default. There's just too much noise right now in the results.html output.
Comment 3 Dirk Pranke 2012-07-11 15:11:13 PDT
(In reply to comment #2)
> (In reply to comment #1)
> > I'm sure you realize this already but we don't currently have a way to compare the output from the two runs to see if they are different. I don't know that it would be particularly hard to add that.
> 
> full_results.json, which is what results.html uses, has this information and shows it in the UI already. We don't technically know which run was first and which was second and we don't have the -actual.* files for the first run, but we have the type of failure for each run.
> 

True. Shouldn't we have the -actuals for the first run as well?

> > Also, I'm a bit concerned that implementing this just makes it even easier to ignore the tests in list 1, that seem to be should either be marked as expected flaky or actually be fixed.
> 
> That's true. In practice, I think that this is already ignored. So the cost is that other non-flaky failures get missed.
> 
> In fact, upon further thought, I think we should hide list 1 by default. There's just too much noise right now in the results.html output.

That would be hiding unexpected behavior, which seems kinda bad. If others thought this was a good idea, I'd be willing to give it a shot, though.
Comment 4 Ojan Vafai 2012-07-11 15:18:28 PDT
(In reply to comment #3)
> (In reply to comment #2)
> > (In reply to comment #1)
> > > I'm sure you realize this already but we don't currently have a way to compare the output from the two runs to see if they are different. I don't know that it would be particularly hard to add that.
> > 
> > full_results.json, which is what results.html uses, has this information and shows it in the UI already. We don't technically know which run was first and which was second and we don't have the -actual.* files for the first run, but we have the type of failure for each run.
> > 
> 
> True. Shouldn't we have the -actuals for the first run as well?

Do we? Where do we store them? I thought the second run overwrote the first one.

> > > Also, I'm a bit concerned that implementing this just makes it even easier to ignore the tests in list 1, that seem to be should either be marked as expected flaky or actually be fixed.
> > 
> > That's true. In practice, I think that this is already ignored. So the cost is that other non-flaky failures get missed.
> > 
> > In fact, upon further thought, I think we should hide list 1 by default. There's just too much noise right now in the results.html output.
> 
> That would be hiding unexpected behavior, which seems kinda bad. If others thought this was a good idea, I'd be willing to give it a shot, though.

I suppose. I see a slew (~12) of flaky passes everytime I run the tests. Maybe I just encounter this more because I usually run with -f and I'm just getting what I ask for.
Comment 5 Dirk Pranke 2012-07-11 15:55:16 PDT
the output for the retry is in layout-test-results/retries/...
Comment 6 Ojan Vafai 2012-07-17 10:59:46 PDT
Created attachment 152788 [details]
Patch
Comment 7 Ojan Vafai 2012-07-17 11:44:44 PDT
Committed r122864: <http://trac.webkit.org/changeset/122864>