Bug 63767 - Remove the concept of "being wedged" from new-run-webkit-tests
Summary: Remove the concept of "being wedged" from new-run-webkit-tests
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Adam Barth
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-06-30 16:12 PDT by Adam Barth
Modified: 2022-02-27 23:43 PST (History)
2 users (show)

See Also:


Attachments
Patch (8.48 KB, patch)
2011-06-30 16:14 PDT, Adam Barth
eric: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Adam Barth 2011-06-30 16:12:51 PDT
Remove the concept of "being wedged" from new-run-webkit-tests
Comment 1 Dirk Pranke 2011-06-30 16:14:14 PDT
Why?
Comment 2 Adam Barth 2011-06-30 16:14:47 PDT
Created attachment 99391 [details]
Patch
Comment 3 Adam Barth 2011-06-30 16:15:53 PDT
From the ChangeLog:

        Worker processes shouldn't ever become wedged.  My understanding is
        that this code was originally motivated by the old threading-based
        design but no longer servers any purpose.

        Note: If we actually have a problem with the test harness getting
        stuck, buildbot will kill us.
Comment 4 Dirk Pranke 2011-06-30 16:16:17 PDT
Comment on attachment 99391 [details]
Patch

It is still possible for workers to wedge up, and they should not hang the overall test run. I'm R-'ing this unless you can explain why we don't need this concept any more.
Comment 5 Dirk Pranke 2011-06-30 16:17:34 PDT
(In reply to comment #3)
> From the ChangeLog:
> 
>         Worker processes shouldn't ever become wedged.  My understanding is
>         that this code was originally motivated by the old threading-based
>         design but no longer servers any purpose.
>

This is incorrect. Workers can still get wedged, it just happens much less often.
 
>         Note: If we actually have a problem with the test harness getting
>         stuck, buildbot will kill us.

When this happens, you get much less useful output from the test run, and it takes a lot longer. Plus, it's harder to diagnose what happened.
Comment 6 Adam Barth 2011-06-30 16:19:15 PDT
> This is incorrect. Workers can still get wedged, it just happens much less often.

How does this happen?  How often does it happen?
Comment 7 Dirk Pranke 2011-06-30 16:23:14 PDT
(In reply to comment #6)
> > This is incorrect. Workers can still get wedged, it just happens much less often.
> 
> How does this happen?  How often does it happen?

In the chromium port, in particular, the test timeout is enforced by DRT itself, rather than by the python code. If DRT wedges and doesn't self-timeout, the worker will wedge also.

It doesn't seem to happen that often, but I've definitely seen it happen.

It is possible that if you fix the Chromium code and we can safely assert that workers are never wedging then this code will no longer be necessary. I personally think it's still a good defense-in-depth that makes the tool more robust.

Also, I'll note that a few days ago you were asking about --run-singly and run_in_another_thread. If you got rid of this code, that would be yet another reason to have to keep run_in_another_thread around. I'd rather get rid of run_in_another thread(), since the manager is a more appropriate place for this sort of logic.
Comment 8 Adam Barth 2011-06-30 16:28:30 PDT
I just checked the last 100 builds on Webkit Mac10.6 (deps) and this condition has not occurred.  Would you like me to check the last thousand or would you like me to check another bot?
Comment 9 Adam Barth 2011-06-30 16:29:04 PDT
(To be clear, that was the Chromium bot)
Comment 10 Dirk Pranke 2011-06-30 16:31:08 PDT
(In reply to comment #9)
> (To be clear, that was the Chromium bot)

I would definitely want to check Chromium Win as well, and I would feel better if we could say that we hadn't seen this for a couple weeks. 

That said, I believe that this feature should stay in the tool, period.
Comment 11 Adam Barth 2011-06-30 16:32:02 PDT
> That said, I believe that this feature should stay in the tool, period.

To be clear, you believe the feature should stay in the tool regardless of whether the condition ever occurs?
Comment 12 Dirk Pranke 2011-06-30 16:36:52 PDT
(In reply to comment #11)
> > That said, I believe that this feature should stay in the tool, period.
> 
> To be clear, you believe the feature should stay in the tool regardless of whether the condition ever occurs?

Not exactly. I believe that the feature exists in order to keep the tool from deadlocking and hanging. The way the tool is currently written, deadlock is otherwise possible. If you can convince me that deadlock is not otherwise possible, I would be more open to removing it.
Comment 13 Adam Barth 2011-06-30 16:40:42 PDT
This hasn't happened in the 100 most recent builds on Chromium Windows either.

So, we've established that if this condition does ever occur, it occurs with at least probably less than 0.5 percent.
Comment 14 Dirk Pranke 2011-06-30 16:43:15 PDT
(In reply to comment #13)
> This hasn't happened in the 100 most recent builds on Chromium Windows either.
> 
> So, we've established that if this condition does ever occur, it occurs with at least probably less than 0.5 percent.

I'm not trying to be a nag here; comment #12 best explains why I think this feature should stay in the tool. 

That said, you are assuming a uniform distribution of events and that past performance predicts future performance.

If someone introduces a patch into Chromium DRT that hangs it, this will occur 100% of the time.
Comment 15 Dirk Pranke 2011-06-30 16:44:51 PDT
(In reply to comment #14)
> (In reply to comment #13)
> > This hasn't happened in the 100 most recent builds on Chromium Windows either.
> > 
> > So, we've established that if this condition does ever occur, it occurs with at least probably less than 0.5 percent.
> 
> I'm not trying to be a nag here; comment #12 best explains why I think this feature should stay in the tool. 
> 
> That said, you are assuming a uniform distribution of events and that past performance predicts future performance.
> 
> If someone introduces a patch into Chromium DRT that hangs it, this will occur 100% of the time.

Oh, and in addition, this code protects not only against currently executing (hopefully correctly written code) but code that may be written in the future. E.g., you may find as you are getting the apple win port working, or porting this to work with some new webkit port, that your DRTs will hang on occasion, at which point this code would once again be useful.
Comment 16 Adam Barth 2011-06-30 16:48:02 PDT
If we need to kill DRT, we can do that in the worker process.  If the worker process has a bug that causes it to hang, we can fix that bug.  In neither case to do we need to redundant layer of timeout.  It's not helping anyone.
Comment 17 Dirk Pranke 2011-06-30 16:52:32 PDT
(In reply to comment #16)
> If we need to kill DRT, we can do that in the worker process.  If the worker process has a bug that causes it to hang, we can fix that bug.  In neither case to do we need to redundant layer of timeout.  It's not helping anyone.

No, you can't kill it in the worker process, because the worker process is likely blocked inside port-specific code in run_test(). 

If the worker process has a bug that causes it to hang, it is true that you can fix that bug, but in the meantime the tool will be hanging and bots and users will be less happy than they would be otherwise.

So, in both cases the redundant layer of timeouts makes the tool more reliable and robust.
Comment 18 Adam Barth 2011-06-30 16:53:10 PDT
Comment on attachment 99391 [details]
Patch

I've now checked every build from the Chromium Windows (deps) build that exist.  Thie condition has never occurred.  Our estimate is now that it occurs less than twice per thousand runs.  This code does not appear to be useful.  If this becomes a problem in the future, we can add the code back.
Comment 19 Dirk Pranke 2011-06-30 17:00:26 PDT
This probably won't convince you, but I will also point out that this code would be useful to anyone who was using the --worker-model=threads feature on a box with Python 2.5 (which would happen if we were running on Leopard or Ubuntu Hardy).

However, you have already also suggested that that code should be removed as well :).
Comment 20 Adam Barth 2011-06-30 22:54:25 PDT
> This probably won't convince you, but I will also point out that this code would be useful to anyone who was using the --worker-model=threads feature on a box with Python 2.5 (which would happen if we were running on Leopard or Ubuntu Hardy).

That code is gone now.  Hopefully we won't need to run the tests on Hardy anymore since we've moved to Lucid.

The anti-wedging code also causes problems when workers legitimately need to block for moderate periods of time.  For example, on Mac, we want to run /usr/bin/sample on tests that time out in order to understand why they time out.  The the sample tool effectively blocks for a while in order to capture stack traces from DRT.

By removing the anti-wedging code, we give the worker process more rope, which it can use to hang itself or to do some useful work.  Given that we can't seem to find an example of a worker wedging in the 500 runs we've examined, it seems like the balance might lie with giving the worker some more rope at this time.

These decisions are all very reversible.  If we run into trouble with the test suite timing out and getting killed by buildbot, we can address that problem.
Comment 21 Eric Seidel (no email) 2011-06-30 23:00:39 PDT
(In reply to comment #20)
> The anti-wedging code also causes problems when workers legitimately need to block for moderate periods of time.  For example, on Mac, we want to run /usr/bin/sample on tests that time out in order to understand why they time out.  The the sample tool effectively blocks for a while in order to capture stack traces from DRT.

Are there easy ways we should be communicating back to the master that we're sampling/crash-reporting/whatever?  Is it correct division of duty for the worker to be doing these tasks?
Comment 22 Eric Seidel (no email) 2011-06-30 23:02:57 PDT
Comment on attachment 99391 [details]
Patch

I think it's OK to remove this stuff, given that I remember it being added back in the threading days.  I agree that wedges today are bugs in the workers and can be fixed.

It isn't entirely clear to me if sampling in the worker is the correct decision, but I assume you can enlighten me there.

It's also not entirely clear to me if NRWT or DRT manages the test hang timeout.  I believe in ORWT both of them have watchdog timers, I'm not sure what the status in NRWT is.
Comment 23 Eric Seidel (no email) 2011-06-30 23:03:13 PDT
Comment on attachment 99391 [details]
Patch

I think it's OK to remove this stuff, given that I remember it being added back in the threading days.  I agree that wedges today are bugs in the workers and can be fixed.

It isn't entirely clear to me if sampling in the worker is the correct decision, but I assume you can enlighten me there.

It's also not entirely clear to me if NRWT or DRT manages the test hang timeout.  I believe in ORWT both of them have watchdog timers, I'm not sure what the status in NRWT is.
Comment 24 Adam Barth 2011-06-30 23:05:18 PDT
> It isn't entirely clear to me if sampling in the worker is the correct decision, but I assume you can enlighten me there.

There are two reason to do it in the worker:

1) The worker is the one that knows the process ID of DRT, which is necessary to run sample.
2) The worker is responsible for adding all the output files to the results directory, and the hang report is one such file.
Comment 25 Adam Barth 2011-06-30 23:06:36 PDT
> Are there easy ways we should be communicating back to the master that we're sampling/crash-reporting/whatever?  Is it correct division of duty for the worker to be doing these tasks?

The master doesn't read hardly any information from the workers.  He just stuffs the channel with tasks and then waits for the workers to spin down.
Comment 26 Eric Seidel (no email) 2011-06-30 23:08:59 PDT
(In reply to comment #7)
> (In reply to comment #6)
> > > This is incorrect. Workers can still get wedged, it just happens much less often.
> > 
> > How does this happen?  How often does it happen?
> 
> In the chromium port, in particular, the test timeout is enforced by DRT itself, rather than by the python code. If DRT wedges and doesn't self-timeout, the worker will wedge also.

This sounds like a bug.  I believe in ORWT both DRT and ORWT have timers.  I guess that's what this wedge code is about.
Comment 27 Eric Seidel (no email) 2011-06-30 23:10:58 PDT
(In reply to comment #25)
> > Are there easy ways we should be communicating back to the master that we're sampling/crash-reporting/whatever?  Is it correct division of duty for the worker to be doing these tasks?
> 
> The master doesn't read hardly any information from the workers.  He just stuffs the channel with tasks and then waits for the workers to spin down.

I see.  So he never reads back any state from the workers?

A design whereby the workers were considered "trusted" parts of the architecture and responsible for makign sure that they themselves didn't wedge makes sense to me.  I remember this wedging code being very necessary back when we couldn't trust that threads wouldn't wedge.

If we're removing this, what DRT-wedge-prevention code do we have left in NRWT?
Comment 28 Eric Seidel (no email) 2011-06-30 23:13:12 PDT
(In reply to comment #24)
> 2) The worker is responsible for adding all the output files to the results directory, and the hang report is one such file.

(I added a FIXME in one of my patches that we should be sampling from the worker instad of the ServerProcess, since it seems silly for the ServerProcess to know anything about the port or the results directory.)  It does make sense that the worker would be the one responsible for stuffing the results directory though.

I'm not sure that ServerProcess makes sense on its own, given how much knowledge it needs about DRT (in terms of how to deal with stdout and stderr).   Right now the only other client for ServerProcess (besides DRT) is ImageDiff.
Comment 29 Eric Seidel (no email) 2011-06-30 23:26:29 PDT
Comment on attachment 99391 [details]
Patch

I understand your desire to remove this.  I does feel like the wrong layer to put this trust boundary now that we're not using threaded workers.  Can you please file a follow-up bug about handling the hung DRT case in the workers?  (Assuming we don't have timeout code in the worker like dirk was alluding to before?)
Comment 30 Adam Barth 2011-06-30 23:28:58 PDT
Done: https://bugs.webkit.org/show_bug.cgi?id=63784
Comment 31 Adam Barth 2011-06-30 23:34:58 PDT
Committed r90207: <http://trac.webkit.org/changeset/90207>
Comment 32 Dirk Pranke 2011-07-01 13:01:38 PDT
(In reply to comment #22)
> (From update of attachment 99391 [details])
> I think it's OK to remove this stuff, given that I remember it being added back in the threading days.  I agree that wedges today are bugs in the workers and can be fixed.
> 
> It isn't entirely clear to me if sampling in the worker is the correct decision, but I assume you can enlighten me there.
> 

Partially sampling in the worker is the right thing to do because you get increased parallelism that way, and partially it is fallout from how the code is currently designed.

> It's also not entirely clear to me if NRWT or DRT manages the test hang timeout.  I believe in ORWT both of them have watchdog timers, I'm not sure what the status in NRWT is.

Chromium's DRT enforces the test hang timeout directly. the other DRTs, as far as I know, do not (we don't even pass the timeout to the other DRTs).

This was before my time, so I don't know exactly why this decision was made, but it may have been prompted by the fact that non-blocking I/O is annoying to do in python generally and significantly harder on Windows; it may not have worked at all in Python 2.4 on XP, for example.

The lack of NBIO is a big part of the reason the current apple win port doesn't work, since the webkit.py/server_process.py implementation relies on it.

The work I did a few months ago convinced me that it was actually doable on windows, though, and I never got around to it. Once someone implements non-blocking i/o, it becomes more feasible to say that the workers can properly enforce the timeout and won't get wedged.
Comment 33 Dirk Pranke 2011-07-01 13:04:47 PDT
(In reply to comment #26)
> (In reply to comment #7)
> > (In reply to comment #6)
> > > > This is incorrect. Workers can still get wedged, it just happens much less often.
> > > 
> > > How does this happen?  How often does it happen?
> > 
> > In the chromium port, in particular, the test timeout is enforced by DRT itself, rather than by the python code. If DRT wedges and doesn't self-timeout, the worker will wedge also.
> 
> This sounds like a bug.  I believe in ORWT both DRT and ORWT have timers.  I guess that's what this wedge code is about.

There are comments in ORWT that DRT has an internal 30 second timeout (I have no idea if that works or not). WebKitTestRunner can take a timeout command line argument.
Comment 34 Dirk Pranke 2011-07-01 13:08:31 PDT
(In reply to comment #27)
> (In reply to comment #25)
> > > Are there easy ways we should be communicating back to the master that we're sampling/crash-reporting/whatever?  Is it correct division of duty for the worker to be doing these tasks?
> > 
> > The master doesn't read hardly any information from the workers.  He just stuffs the channel with tasks and then waits for the workers to spin down.
> 
> I see.  So he never reads back any state from the workers?
> 

The manager gets the list of test failures and other data about each test run (like the amount of time it takes to run the test). It uses that to (a) tell the user when tests fail, (b) compute the aggregate statistics and results about the test run, and (c) generate the results files.


> A design whereby the workers were considered "trusted" parts of the architecture and responsible for makign sure that they themselves didn't wedge makes sense to me.  I remember this wedging code being very necessary back when we couldn't trust that threads wouldn't wedge.
> 
> If we're removing this, what DRT-wedge-prevention code do we have left in NRWT?

You will be relying solely on the workers' ability to keep themselves from being wedged. I found having two layers of checking much more reliable and it seems like an entirely sensible design to me given that the manager is busy tracking other information about the workers as well (like whether they are done or not).