Bug 90892

Summary:

results.html should handle flaky tests differently

Product:

WebKit

Reporter:

Ojan Vafai <ojan>

Component:

Tools / Tests

Assignee:

Ojan Vafai <ojan>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

abarth, dpranke, kkristof, rniwa, simon.fraser

Priority:

Version:

528+ (Nightly build)

Hardware:

Unspecified

OS:

Unspecified

Attachments:

Description	Flags
Patch	dpranke: review+

Ojan Vafai

Reported 2012-07-10 10:00:46 PDT

We should have two flaky lists. 1. Tests that fail the first run and pass the second. 2. Tests that fail both runs but in different ways. List 1 should come after tests that timed out and tests with stderr output (before "expected to fail but passed"). List 2 should be where the flaky tests currently are. List 1 is consistently just noise that makes the page harder to make sense of. Also, I frequently want to flag the list of the reliable failures to rerun. It's annoying to have to tab through all the flaky passes to get the the timeouts.

Attachments
Patch (5.88 KB, patch) 2012-07-17 10:59 PDT, Ojan Vafai	dpranke: review+	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.

Dirk Pranke

Comment 1 2012-07-11 12:23:01 PDT

I'm sure you realize this already but we don't currently have a way to compare the output from the two runs to see if they are different. I don't know that it would be particularly hard to add that. Also, I'm a bit concerned that implementing this just makes it even easier to ignore the tests in list 1, that seem to be should either be marked as expected flaky or actually be fixed.

Ojan Vafai

Comment 2 2012-07-11 15:05:24 PDT

(In reply to comment #1) > I'm sure you realize this already but we don't currently have a way to compare the output from the two runs to see if they are different. I don't know that it would be particularly hard to add that. full_results.json, which is what results.html uses, has this information and shows it in the UI already. We don't technically know which run was first and which was second and we don't have the -actual.* files for the first run, but we have the type of failure for each run. > Also, I'm a bit concerned that implementing this just makes it even easier to ignore the tests in list 1, that seem to be should either be marked as expected flaky or actually be fixed. That's true. In practice, I think that this is already ignored. So the cost is that other non-flaky failures get missed. In fact, upon further thought, I think we should hide list 1 by default. There's just too much noise right now in the results.html output.

Dirk Pranke

Comment 3 2012-07-11 15:11:13 PDT

(In reply to comment #2) > (In reply to comment #1) > > I'm sure you realize this already but we don't currently have a way to compare the output from the two runs to see if they are different. I don't know that it would be particularly hard to add that. > > full_results.json, which is what results.html uses, has this information and shows it in the UI already. We don't technically know which run was first and which was second and we don't have the -actual.* files for the first run, but we have the type of failure for each run. > True. Shouldn't we have the -actuals for the first run as well? > > Also, I'm a bit concerned that implementing this just makes it even easier to ignore the tests in list 1, that seem to be should either be marked as expected flaky or actually be fixed. > > That's true. In practice, I think that this is already ignored. So the cost is that other non-flaky failures get missed. > > In fact, upon further thought, I think we should hide list 1 by default. There's just too much noise right now in the results.html output. That would be hiding unexpected behavior, which seems kinda bad. If others thought this was a good idea, I'd be willing to give it a shot, though.

Ojan Vafai

Comment 4 2012-07-11 15:18:28 PDT

(In reply to comment #3) > (In reply to comment #2) > > (In reply to comment #1) > > > I'm sure you realize this already but we don't currently have a way to compare the output from the two runs to see if they are different. I don't know that it would be particularly hard to add that. > > > > full_results.json, which is what results.html uses, has this information and shows it in the UI already. We don't technically know which run was first and which was second and we don't have the -actual.* files for the first run, but we have the type of failure for each run. > > > > True. Shouldn't we have the -actuals for the first run as well? Do we? Where do we store them? I thought the second run overwrote the first one. > > > Also, I'm a bit concerned that implementing this just makes it even easier to ignore the tests in list 1, that seem to be should either be marked as expected flaky or actually be fixed. > > > > That's true. In practice, I think that this is already ignored. So the cost is that other non-flaky failures get missed. > > > > In fact, upon further thought, I think we should hide list 1 by default. There's just too much noise right now in the results.html output. > > That would be hiding unexpected behavior, which seems kinda bad. If others thought this was a good idea, I'd be willing to give it a shot, though. I suppose. I see a slew (~12) of flaky passes everytime I run the tests. Maybe I just encounter this more because I usually run with -f and I'm just getting what I ask for.

Dirk Pranke

Comment 5 2012-07-11 15:55:16 PDT

the output for the retry is in layout-test-results/retries/...

Ojan Vafai

Comment 6 2012-07-17 10:59:46 PDT

Created attachment 152788 [details] Patch

Ojan Vafai

Comment 7 2012-07-17 11:44:44 PDT

Committed r122864: <http://trac.webkit.org/changeset/122864>

Note You need to log in before you can comment on or make changes to this bug.