63767 – Remove the concept of "being wedged" from new-run-webkit-tests

Adam Barth

Reported 2011-06-30 16:12:51 PDT

Remove the concept of "being wedged" from new-run-webkit-tests

Dirk Pranke

Comment 1 2011-06-30 16:14:14 PDT

Why?

Adam Barth

Comment 2 2011-06-30 16:14:47 PDT

Created attachment 99391 [details] Patch

Adam Barth

Comment 3 2011-06-30 16:15:53 PDT

From the ChangeLog: Worker processes shouldn't ever become wedged. My understanding is that this code was originally motivated by the old threading-based design but no longer servers any purpose. Note: If we actually have a problem with the test harness getting stuck, buildbot will kill us.

Dirk Pranke

Comment 4 2011-06-30 16:16:17 PDT

Comment on attachment 99391 [details] Patch It is still possible for workers to wedge up, and they should not hang the overall test run. I'm R-'ing this unless you can explain why we don't need this concept any more.

Dirk Pranke

Comment 5 2011-06-30 16:17:34 PDT

(In reply to comment #3) > From the ChangeLog: > > Worker processes shouldn't ever become wedged. My understanding is > that this code was originally motivated by the old threading-based > design but no longer servers any purpose. > This is incorrect. Workers can still get wedged, it just happens much less often. > Note: If we actually have a problem with the test harness getting > stuck, buildbot will kill us. When this happens, you get much less useful output from the test run, and it takes a lot longer. Plus, it's harder to diagnose what happened.

Adam Barth

Comment 6 2011-06-30 16:19:15 PDT

> This is incorrect. Workers can still get wedged, it just happens much less often. How does this happen? How often does it happen?

Dirk Pranke

Comment 7 2011-06-30 16:23:14 PDT

(In reply to comment #6) > > This is incorrect. Workers can still get wedged, it just happens much less often. > > How does this happen? How often does it happen? In the chromium port, in particular, the test timeout is enforced by DRT itself, rather than by the python code. If DRT wedges and doesn't self-timeout, the worker will wedge also. It doesn't seem to happen that often, but I've definitely seen it happen. It is possible that if you fix the Chromium code and we can safely assert that workers are never wedging then this code will no longer be necessary. I personally think it's still a good defense-in-depth that makes the tool more robust. Also, I'll note that a few days ago you were asking about --run-singly and run_in_another_thread. If you got rid of this code, that would be yet another reason to have to keep run_in_another_thread around. I'd rather get rid of run_in_another thread(), since the manager is a more appropriate place for this sort of logic.

Adam Barth

Comment 8 2011-06-30 16:28:30 PDT

I just checked the last 100 builds on Webkit Mac10.6 (deps) and this condition has not occurred. Would you like me to check the last thousand or would you like me to check another bot?

Adam Barth

Comment 9 2011-06-30 16:29:04 PDT

(To be clear, that was the Chromium bot)

Dirk Pranke

Comment 10 2011-06-30 16:31:08 PDT

(In reply to comment #9) > (To be clear, that was the Chromium bot) I would definitely want to check Chromium Win as well, and I would feel better if we could say that we hadn't seen this for a couple weeks. That said, I believe that this feature should stay in the tool, period.

Adam Barth

Comment 11 2011-06-30 16:32:02 PDT

> That said, I believe that this feature should stay in the tool, period. To be clear, you believe the feature should stay in the tool regardless of whether the condition ever occurs?

Dirk Pranke

Comment 12 2011-06-30 16:36:52 PDT

(In reply to comment #11) > > That said, I believe that this feature should stay in the tool, period. > > To be clear, you believe the feature should stay in the tool regardless of whether the condition ever occurs? Not exactly. I believe that the feature exists in order to keep the tool from deadlocking and hanging. The way the tool is currently written, deadlock is otherwise possible. If you can convince me that deadlock is not otherwise possible, I would be more open to removing it.

Adam Barth

Comment 13 2011-06-30 16:40:42 PDT

This hasn't happened in the 100 most recent builds on Chromium Windows either. So, we've established that if this condition does ever occur, it occurs with at least probably less than 0.5 percent.

Dirk Pranke

Comment 14 2011-06-30 16:43:15 PDT

(In reply to comment #13) > This hasn't happened in the 100 most recent builds on Chromium Windows either. > > So, we've established that if this condition does ever occur, it occurs with at least probably less than 0.5 percent. I'm not trying to be a nag here; comment #12 best explains why I think this feature should stay in the tool. That said, you are assuming a uniform distribution of events and that past performance predicts future performance. If someone introduces a patch into Chromium DRT that hangs it, this will occur 100% of the time.

Dirk Pranke

Comment 15 2011-06-30 16:44:51 PDT

(In reply to comment #14) > (In reply to comment #13) > > This hasn't happened in the 100 most recent builds on Chromium Windows either. > > > > So, we've established that if this condition does ever occur, it occurs with at least probably less than 0.5 percent. > > I'm not trying to be a nag here; comment #12 best explains why I think this feature should stay in the tool. > > That said, you are assuming a uniform distribution of events and that past performance predicts future performance. > > If someone introduces a patch into Chromium DRT that hangs it, this will occur 100% of the time. Oh, and in addition, this code protects not only against currently executing (hopefully correctly written code) but code that may be written in the future. E.g., you may find as you are getting the apple win port working, or porting this to work with some new webkit port, that your DRTs will hang on occasion, at which point this code would once again be useful.

Adam Barth

Comment 16 2011-06-30 16:48:02 PDT

If we need to kill DRT, we can do that in the worker process. If the worker process has a bug that causes it to hang, we can fix that bug. In neither case to do we need to redundant layer of timeout. It's not helping anyone.

Dirk Pranke

Comment 17 2011-06-30 16:52:32 PDT

(In reply to comment #16) > If we need to kill DRT, we can do that in the worker process. If the worker process has a bug that causes it to hang, we can fix that bug. In neither case to do we need to redundant layer of timeout. It's not helping anyone. No, you can't kill it in the worker process, because the worker process is likely blocked inside port-specific code in run_test(). If the worker process has a bug that causes it to hang, it is true that you can fix that bug, but in the meantime the tool will be hanging and bots and users will be less happy than they would be otherwise. So, in both cases the redundant layer of timeouts makes the tool more reliable and robust.

Adam Barth

Comment 18 2011-06-30 16:53:10 PDT

Comment on attachment 99391 [details] Patch I've now checked every build from the Chromium Windows (deps) build that exist. Thie condition has never occurred. Our estimate is now that it occurs less than twice per thousand runs. This code does not appear to be useful. If this becomes a problem in the future, we can add the code back.

Dirk Pranke

Comment 19 2011-06-30 17:00:26 PDT

This probably won't convince you, but I will also point out that this code would be useful to anyone who was using the --worker-model=threads feature on a box with Python 2.5 (which would happen if we were running on Leopard or Ubuntu Hardy). However, you have already also suggested that that code should be removed as well :).

Adam Barth

Comment 20 2011-06-30 22:54:25 PDT

> This probably won't convince you, but I will also point out that this code would be useful to anyone who was using the --worker-model=threads feature on a box with Python 2.5 (which would happen if we were running on Leopard or Ubuntu Hardy). That code is gone now. Hopefully we won't need to run the tests on Hardy anymore since we've moved to Lucid. The anti-wedging code also causes problems when workers legitimately need to block for moderate periods of time. For example, on Mac, we want to run /usr/bin/sample on tests that time out in order to understand why they time out. The the sample tool effectively blocks for a while in order to capture stack traces from DRT. By removing the anti-wedging code, we give the worker process more rope, which it can use to hang itself or to do some useful work. Given that we can't seem to find an example of a worker wedging in the 500 runs we've examined, it seems like the balance might lie with giving the worker some more rope at this time. These decisions are all very reversible. If we run into trouble with the test suite timing out and getting killed by buildbot, we can address that problem.

Eric Seidel (no email)

Comment 21 2011-06-30 23:00:39 PDT

(In reply to comment #20) > The anti-wedging code also causes problems when workers legitimately need to block for moderate periods of time. For example, on Mac, we want to run /usr/bin/sample on tests that time out in order to understand why they time out. The the sample tool effectively blocks for a while in order to capture stack traces from DRT. Are there easy ways we should be communicating back to the master that we're sampling/crash-reporting/whatever? Is it correct division of duty for the worker to be doing these tasks?

Eric Seidel (no email)

Comment 22 2011-06-30 23:02:57 PDT

Comment on attachment 99391 [details] Patch I think it's OK to remove this stuff, given that I remember it being added back in the threading days. I agree that wedges today are bugs in the workers and can be fixed. It isn't entirely clear to me if sampling in the worker is the correct decision, but I assume you can enlighten me there. It's also not entirely clear to me if NRWT or DRT manages the test hang timeout. I believe in ORWT both of them have watchdog timers, I'm not sure what the status in NRWT is.

Eric Seidel (no email)

Comment 23 2011-06-30 23:03:13 PDT

Comment on attachment 99391 [details] Patch I think it's OK to remove this stuff, given that I remember it being added back in the threading days. I agree that wedges today are bugs in the workers and can be fixed. It isn't entirely clear to me if sampling in the worker is the correct decision, but I assume you can enlighten me there. It's also not entirely clear to me if NRWT or DRT manages the test hang timeout. I believe in ORWT both of them have watchdog timers, I'm not sure what the status in NRWT is.

Adam Barth

Comment 24 2011-06-30 23:05:18 PDT

> It isn't entirely clear to me if sampling in the worker is the correct decision, but I assume you can enlighten me there. There are two reason to do it in the worker: 1) The worker is the one that knows the process ID of DRT, which is necessary to run sample. 2) The worker is responsible for adding all the output files to the results directory, and the hang report is one such file.

Adam Barth

Comment 25 2011-06-30 23:06:36 PDT

> Are there easy ways we should be communicating back to the master that we're sampling/crash-reporting/whatever? Is it correct division of duty for the worker to be doing these tasks? The master doesn't read hardly any information from the workers. He just stuffs the channel with tasks and then waits for the workers to spin down.

Eric Seidel (no email)

Comment 26 2011-06-30 23:08:59 PDT

(In reply to comment #7) > (In reply to comment #6) > > > This is incorrect. Workers can still get wedged, it just happens much less often. > > > > How does this happen? How often does it happen? > > In the chromium port, in particular, the test timeout is enforced by DRT itself, rather than by the python code. If DRT wedges and doesn't self-timeout, the worker will wedge also. This sounds like a bug. I believe in ORWT both DRT and ORWT have timers. I guess that's what this wedge code is about.

Eric Seidel (no email)

Comment 27 2011-06-30 23:10:58 PDT

(In reply to comment #25) > > Are there easy ways we should be communicating back to the master that we're sampling/crash-reporting/whatever? Is it correct division of duty for the worker to be doing these tasks? > > The master doesn't read hardly any information from the workers. He just stuffs the channel with tasks and then waits for the workers to spin down. I see. So he never reads back any state from the workers? A design whereby the workers were considered "trusted" parts of the architecture and responsible for makign sure that they themselves didn't wedge makes sense to me. I remember this wedging code being very necessary back when we couldn't trust that threads wouldn't wedge. If we're removing this, what DRT-wedge-prevention code do we have left in NRWT?

Eric Seidel (no email)

Comment 28 2011-06-30 23:13:12 PDT

(In reply to comment #24) > 2) The worker is responsible for adding all the output files to the results directory, and the hang report is one such file. (I added a FIXME in one of my patches that we should be sampling from the worker instad of the ServerProcess, since it seems silly for the ServerProcess to know anything about the port or the results directory.) It does make sense that the worker would be the one responsible for stuffing the results directory though. I'm not sure that ServerProcess makes sense on its own, given how much knowledge it needs about DRT (in terms of how to deal with stdout and stderr). Right now the only other client for ServerProcess (besides DRT) is ImageDiff.

Eric Seidel (no email)

Comment 29 2011-06-30 23:26:29 PDT

Comment on attachment 99391 [details] Patch I understand your desire to remove this. I does feel like the wrong layer to put this trust boundary now that we're not using threaded workers. Can you please file a follow-up bug about handling the hung DRT case in the workers? (Assuming we don't have timeout code in the worker like dirk was alluding to before?)

Adam Barth

Comment 30 2011-06-30 23:28:58 PDT

Done: https://bugs.webkit.org/show_bug.cgi?id=63784

Adam Barth

Comment 31 2011-06-30 23:34:58 PDT

Committed r90207: <http://trac.webkit.org/changeset/90207>

Dirk Pranke

Comment 32 2011-07-01 13:01:38 PDT

(In reply to comment #22) > (From update of attachment 99391 [details]) > I think it's OK to remove this stuff, given that I remember it being added back in the threading days. I agree that wedges today are bugs in the workers and can be fixed. > > It isn't entirely clear to me if sampling in the worker is the correct decision, but I assume you can enlighten me there. > Partially sampling in the worker is the right thing to do because you get increased parallelism that way, and partially it is fallout from how the code is currently designed. > It's also not entirely clear to me if NRWT or DRT manages the test hang timeout. I believe in ORWT both of them have watchdog timers, I'm not sure what the status in NRWT is. Chromium's DRT enforces the test hang timeout directly. the other DRTs, as far as I know, do not (we don't even pass the timeout to the other DRTs). This was before my time, so I don't know exactly why this decision was made, but it may have been prompted by the fact that non-blocking I/O is annoying to do in python generally and significantly harder on Windows; it may not have worked at all in Python 2.4 on XP, for example. The lack of NBIO is a big part of the reason the current apple win port doesn't work, since the webkit.py/server_process.py implementation relies on it. The work I did a few months ago convinced me that it was actually doable on windows, though, and I never got around to it. Once someone implements non-blocking i/o, it becomes more feasible to say that the workers can properly enforce the timeout and won't get wedged.

Dirk Pranke

Comment 33 2011-07-01 13:04:47 PDT

(In reply to comment #26) > (In reply to comment #7) > > (In reply to comment #6) > > > > This is incorrect. Workers can still get wedged, it just happens much less often. > > > > > > How does this happen? How often does it happen? > > > > In the chromium port, in particular, the test timeout is enforced by DRT itself, rather than by the python code. If DRT wedges and doesn't self-timeout, the worker will wedge also. > > This sounds like a bug. I believe in ORWT both DRT and ORWT have timers. I guess that's what this wedge code is about. There are comments in ORWT that DRT has an internal 30 second timeout (I have no idea if that works or not). WebKitTestRunner can take a timeout command line argument.

Dirk Pranke

Comment 34 2011-07-01 13:08:31 PDT

(In reply to comment #27) > (In reply to comment #25) > > > Are there easy ways we should be communicating back to the master that we're sampling/crash-reporting/whatever? Is it correct division of duty for the worker to be doing these tasks? > > > > The master doesn't read hardly any information from the workers. He just stuffs the channel with tasks and then waits for the workers to spin down. > > I see. So he never reads back any state from the workers? > The manager gets the list of test failures and other data about each test run (like the amount of time it takes to run the test). It uses that to (a) tell the user when tests fail, (b) compute the aggregate statistics and results about the test run, and (c) generate the results files. > A design whereby the workers were considered "trusted" parts of the architecture and responsible for makign sure that they themselves didn't wedge makes sense to me. I remember this wedging code being very necessary back when we couldn't trust that threads wouldn't wedge. > > If we're removing this, what DRT-wedge-prevention code do we have left in NRWT? You will be relying solely on the workers' ability to keep themselves from being wedged. I found having two layers of checking much more reliable and it seems like an entirely sensible design to me given that the manager is busy tracking other information about the workers as well (like whether they are done or not).

Attachments
Patch (8.48 KB, patch) 2011-06-30 16:14 PDT, Adam Barth	eric: review+	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.