86889 – should we get rid of the SLOW modifier in test_expectations.txt?

Dirk Pranke

Reported 2012-05-18 13:20:04 PDT

As part of the discussion on webkit-dev in the thread with https://lists.webkit.org/pipermail/webkit-dev/2012-May/020674.html and subsequent discussion on bug 86796, there is a question as to whether we should move the SLOW modifier to the right side of the expectation line, whether its semantics should change, and/or whether we should get rid of it entirely. 1) Suppose you have an expectation with SLOW by itself: bug(dpranke) : foo/bar.html = SLOW presumably that should imply that the test is expected to pass, but it just takes a while to execute (in the current format, you would have SLOW : foo/bar.html = PASS, which is explicit). 2) Suppose you have SLOW plus an additional outcome keyword: bug(dpranke) : foo/bar.html = SLOW IMAGE presumably that means that the test is expected to produce a pixel test failure and take a while to execute. It does not represent "either pass slowly or fail the pixel test quickly", nor "run slowly but either pass or fail the pixel test", i.e., this does not make the test flaky. (Of course, we could change the interpretation here) 3) Should it be a warning similar to "unexpected pass" if a test marked SLOW starts to complete quickly? Seems reasonable ... 4) Can we get away with only using SLOW for "SLOW PASS", and say that SLOW failures should not be allowed (since we want to move to a model where any test that fails for longer than a couple hours should get a new expectation/baseline checked in)? 5) Can/should we just get rid of SLOW altogether? If we did this, we would have to adjust the default timeout value for chromium upwards, and we would want to consider what the impact on bot cycle time should be

Ryosuke Niwa

Comment 1 2012-05-18 13:33:16 PDT

(In reply to comment #0) > 3) Should it be a warning similar to "unexpected pass" if a test marked SLOW starts to complete quickly? Seems reasonable ... I suspect that'll add a lot of noise. Most of slow tests I know of are not always slow. > 4) Can we get away with only using SLOW for "SLOW PASS", and say that SLOW failures should not be allowed (since we want to move to a model where any test that fails for longer than a couple hours should get a new expectation/baseline checked in)? > > 5) Can/should we just get rid of SLOW altogebther? If we did this, we would have to adjust the default timeout value for chromium upwards, and we would want to consider what the impact on bot cycle time should be I think we should do that. But how many tests are marked slow today? Maybe we can split them into smaller piececs to make them fast.

Ryosuke Niwa

Comment 2 2012-05-18 14:38:46 PDT

To answer my question, there are 118 tests that are marked slow in chromium/test_expectations.txt, and some of them are from ietestcenter so it might be quite challenging to fix all those tests and make them fast. Unfortunately, some of slow tests are taking well over 10s to finish. There are 258 tests that are marked TIMEOUT and not marked SKIP or WONTFIX. Most of them appears to be flaky between PASS and TIMEOUT, and I suspect many of them are just slow. Note: if we were to run all of these TIMEOUT tests sequentially for 10s, it'll take roughly 43 minutes :(

Dirk Pranke

Comment 3 2012-05-18 14:57:14 PDT

(In reply to comment #2) Your 118 number I think includes some directories, since editing, and the inspector tests are marked as SLOW -> the number of SLOW tests is likely much higher.

Ojan Vafai

Comment 4 2012-05-18 15:05:57 PDT

(In reply to comment #2) > To answer my question, there are 118 tests that are marked slow in chromium/test_expectations.txt, and some of them are from ietestcenter so it might be quite challenging to fix all those tests and make them fast. Unfortunately, some of slow tests are taking well over 10s to finish. > > There are 258 tests that are marked TIMEOUT and not marked SKIP or WONTFIX. Most of them appears to be flaky between PASS and TIMEOUT, and I suspect many of them are just slow. > > Note: if we were to run all of these TIMEOUT tests sequentially for 10s, it'll take roughly 43 minutes :( Most of the flaky timeout tests only timeout 1 time out of many many runs. I'd be curious to see how many non-flaky TIMEOUT listings we have.

Ryosuke Niwa

Comment 5 2012-05-18 15:08:33 PDT

(In reply to comment #4) > Most of the flaky timeout tests only timeout 1 time out of many many runs. I'd be curious to see how many non-flaky TIMEOUT listings we have. Roughly 80.

Ryosuke Niwa

Comment 6 2012-05-18 15:09:47 PDT

I guess spending 10 minutes isn't the end of the world especially since we run tests in parallel on most of bots (probably not on Mac minis?)

Dirk Pranke

Comment 7 2012-05-18 15:13:39 PDT

even the minis tend to run 2 or 4 threads in parallel. It seems to me that we should be skipping any tests that are reliably timing out. I'm not sure what running them buys us, unless we are just hoping that they start working.

Ryosuke Niwa

Comment 8 2012-05-18 15:15:46 PDT

(In reply to comment #7) > even the minis tend to run 2 or 4 threads in parallel. > > It seems to me that we should be skipping any tests that are reliably timing out. I'm not sure what running them buys us, unless we are just hoping that they start working. I've seen many tests that magically start working. It's typically because we fix some bug that was causing the test to timeout or breaks time race condition, etc...

Dirk Pranke

Comment 9 2012-05-18 15:20:54 PDT

(In reply to comment #8) > (In reply to comment #7) > > even the minis tend to run 2 or 4 threads in parallel. > > > > It seems to me that we should be skipping any tests that are reliably timing out. I'm not sure what running them buys us, unless we are just hoping that they start working. > > I've seen many tests that magically start working. It's typically because we fix some bug that was causing the test to timeout or breaks time race condition, etc... Oh sure, that happens, but I'm not convinced that we should be running these tests every time just to hope that it starts happening. It seems to me that we should skip these and have the onus be on someone to fix them. could be that when someone actually goes to fix the bug, it turns out that the test is no longer failing. I would probably say the same thing about tests that crash. Frankly, I'm disturbed that we have any tests that are marked as (just) CRASH at all - why haven't those bugs been fixed?

Ryosuke Niwa

Comment 10 2012-05-18 15:24:48 PDT

(In reply to comment #9) > I would probably say the same thing about tests that crash. Frankly, I'm disturbed that we have any tests that are marked as (just) CRASH at all - why haven't those bugs been fixed? Yes! While this should be discussed on a separate bug, I'm all for removing CRASH. I think having CRASH expectation gives people an illusion of making the tree green and getting over it while we still DO have a crash in the reality. Given that we're already moving away form the strict "every single layout test must be passing before you roll", I don't see any reason we should have CRASH expectation.

Ojan Vafai

Comment 11 2012-05-18 15:40:51 PDT

To defend the status quo a bit: 1. If a test does output any text, we'll show that in the actual.txt file even if the test times out. So, this can be very helpful for diagnosing why a test is timing out and fixing it (esp for flaky timeouts that are hard to reproduce). 2. If a test crashes, we get a stacktrace. This makes it easier to quickly look at the output of a crashing test on the flakiness dashboard and see if you have an idea what's going wrong, how to fix it, or who to CC to get it fixed. 3. It's actually not that rare for tests to stop timing out or crashing. Many times the crash is not in the test code itself, but is just something like memory corruption, so it goes from one test to another as the tests we skip/run changes. If we skip them, then the list of tests we skip just keeps growing indefinitely.

Ryosuke Niwa

Comment 12 2012-05-18 15:45:55 PDT

(In reply to comment #11) > 3. It's actually not that rare for tests to stop timing out or crashing. Many times the crash is not in the test code itself, but is just something like memory corruption, so it goes from one test to another as the tests we skip/run changes. If we skip them, then the list of tests we skip just keeps growing indefinitely. Really? I've rarely seen crashes due to memory corruptions in DRT although I do agree that there are cases where one test causes subsequent tests to crash.

Ojan Vafai

Comment 13 2012-05-18 15:47:42 PDT

(In reply to comment #12) > (In reply to comment #11) > > 3. It's actually not that rare for tests to stop timing out or crashing. Many times the crash is not in the test code itself, but is just something like memory corruption, so it goes from one test to another as the tests we skip/run changes. If we skip them, then the list of tests we skip just keeps growing indefinitely. > > Really? I've rarely seen crashes due to memory corruptions in DRT although I do agree that there are cases where one test causes subsequent tests to crash. Sure. My point is more that skipping tests that are not themselves the direct cause of the crash would just mean an ever growing list of skipped tests.

Dirk Pranke

Comment 14 2012-05-18 16:18:52 PDT

(In reply to comment #13) > Sure. My point is more that skipping tests that are not themselves the direct cause of the crash would just mean an ever growing list of skipped tests. Of course, right now we have an ever-growing list of flaky crashing tests :). My hope (probably overly optimistic) is that people would treat a test being skipped as more serious than a test being marked as flaky, and would be more inclined to dig into it.

Ojan Vafai

Comment 15 2012-05-18 16:34:35 PDT

(In reply to comment #14) > (In reply to comment #13) > > Sure. My point is more that skipping tests that are not themselves the direct cause of the crash would just mean an ever growing list of skipped tests. > > Of course, right now we have an ever-growing list of flaky crashing tests :). My hope (probably overly optimistic) is that people would treat a test being skipped as more serious than a test being marked as flaky, and would be more inclined to dig into it. Do we know that to be true? My experience is that we tend to think the situation w/ test_expectations.txt is worse than it actually is.

Dirk Pranke

Comment 16 2012-05-18 17:26:43 PDT

(In reply to comment #15) > (In reply to comment #14) > > (In reply to comment #13) > > > Sure. My point is more that skipping tests that are not themselves the direct cause of the crash would just mean an ever growing list of skipped tests. > > > > Of course, right now we have an ever-growing list of flaky crashing tests :). My hope (probably overly optimistic) is that people would treat a test being skipped as more serious than a test being marked as flaky, and would be more inclined to dig into it. > > Do we know that to be true? My experience is that we tend to think the situation w/ test_expectations.txt is worse than it actually is. Well, there's about 132 flaky crashes listed in test_expectations.txt :). There are about 6 non-flaky expectations.

Ojan Vafai

Comment 17 2012-05-18 17:40:27 PDT

(In reply to comment #16) > (In reply to comment #15) > > (In reply to comment #14) > > > (In reply to comment #13) > > > > Sure. My point is more that skipping tests that are not themselves the direct cause of the crash would just mean an ever growing list of skipped tests. > > > > > > Of course, right now we have an ever-growing list of flaky crashing tests :). My hope (probably overly optimistic) is that people would treat a test being skipped as more serious than a test being marked as flaky, and would be more inclined to dig into it. > > > > Do we know that to be true? My experience is that we tend to think the situation w/ test_expectations.txt is worse than it actually is. > > Well, there's about 132 flaky crashes listed in test_expectations.txt :). There are about 6 non-flaky expectations. Sure, but how many were there a year ago? How many of those still crash or are just incorrectly marked (this is something we can answer only if we run them).

Dirk Pranke

Comment 18 2012-05-18 17:52:11 PDT

(In reply to comment #17) > (In reply to comment #16) > > (In reply to comment #15) > > > (In reply to comment #14) > > > > (In reply to comment #13) > > > > > Sure. My point is more that skipping tests that are not themselves the direct cause of the crash would just mean an ever growing list of skipped tests. > > > > > > > > Of course, right now we have an ever-growing list of flaky crashing tests :). My hope (probably overly optimistic) is that people would treat a test being skipped as more serious than a test being marked as flaky, and would be more inclined to dig into it. > > > > > > Do we know that to be true? My experience is that we tend to think the situation w/ test_expectations.txt is worse than it actually is. > > > > Well, there's about 132 flaky crashes listed in test_expectations.txt :). There are about 6 non-flaky expectations. > > Sure, but how many were there a year ago? How many of those still crash or are just incorrectly marked (this is something we can answer only if we run them). Dunno, and fair questions. I would suspect that we would need some other process to deal with any tests (including ones we already have) that are marked with SKIP but not WONTFIX.

Dirk Pranke

Comment 19 2012-06-20 16:23:39 PDT

closing this as wontfix for now ... it seems as though there's no clear consensus as of now that removing slow would be better than leaving it.