Bug 103824 - clean up test result summarization / aggregating code in nrwt
Summary: clean up test result summarization / aggregating code in nrwt
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Dirk Pranke
URL:
Keywords: NRWT
Depends on: 78168 103828 103829 103830 103943 103961 103963 104047 104048 104051 104059 104061 104064 104067 104072 104158 104165 104871 105077 105078 105080
Blocks: 103826
  Show dependency treegraph
 
Reported: 2012-12-01 17:19 PST by Dirk Pranke
Modified: 2013-02-20 16:31 PST (History)
6 users (show)

See Also:


Attachments
work-in-progress big honking patch; no need to review (160.80 KB, patch)
2012-12-01 18:03 PST, Dirk Pranke
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dirk Pranke 2012-12-01 17:19:10 PST
Right now the code for collecting the results of individual tests is scattered over several badly named classes and dicts (ResultSummary, summarized_results, etc.) and is unwieldly. It would make things a lot easier to follow if this code was cleaned up and given some sort of structure; this bug is a meta-bug to track that work. 

In addition, the test-running code (webkitpy.layout_tests.controllers.manager.Manager and layout_test_runner.Runner) have some weird object ownership structuring and don't really return enough information to be very useful; as a result, the run_webkit_tests_integration tests have to jump through a bunch of hoops to test things (e.g., bug 78168) and it would be nice to clean that up.

Lastly, if we cleaned up the way we returned information about the test run back it would probably simplify various client programs like the buildbot log parsers and calling scripts used for various purposes (e.g., me trying to hunt down flaky tests).
Comment 1 Dirk Pranke 2012-12-01 17:31:21 PST
The general model I have in mind can be thought of functionally or in a dataflow manner ...

1) get a list of test paths (e.g., ['fast/html', 'fast/forms']

2) convert those into a list of test names

3) convert those into a list of TestInputs

4) run the tests, each test produces a list of TestFailures plus other data (like test run time) which is collected into a TestResult for each test

5) the list of TestResults is collected into a TestPass class (what is currently called a ResultSummary), or some similar name; the concept is "making a pass over the tests" rather than "this test passed", but I don't like the confusion. 

6) if we had any unexpected failures and were not told to not retry failures, we do a second pass over those, producing a second TestPass objects (currently what is called the retry_summary)

7) from those two TestPass objects we can merge the results and summarize them (in the routine currently called summarize_results that returns a flat JSON/POD object that can be stored as full_results.json)

8) the results of 6 and 7 are collected together (along with the single exit code containing the number of unexpected failures) into a RunDetails object and handed back to callers, who now will have a full description of everything that happened during the test run.

I'm not wild about any of these names and am open to suggestions. Note that all of this basically describes the existing flow and objects; I'm not changing anything much, just cleaning it up.
Comment 2 Dirk Pranke 2012-12-01 17:59:25 PST
Obviously, all of this cleanup is a bunch of work and will come as a bunch of patches, not one big honking one :).
Comment 3 Dirk Pranke 2012-12-01 18:03:20 PST
Created attachment 177122 [details]
work-in-progress big honking patch; no need to review
Comment 4 Ojan Vafai 2012-12-01 18:31:44 PST
(In reply to comment #1)
> The general model I have in mind can be thought of functionally or in a dataflow manner ...
> 
> 1) get a list of test paths (e.g., ['fast/html', 'fast/forms']
> 
> 2) convert those into a list of test names

At some point around here we need to parse TestExpectations. We need to know the expectations to create the TestInput, right?

Also, it's not worth complicating things too much. But as you're designing this, a nice-to-have to keep in mind would be being able to start step 4 before having completely finished steps 2 and 3. This is to improve the time it takes to run the tests. Taking http://build.chromium.org/p/chromium.webkit/builders/WebKit%20XP/builds/1244/steps/webkit_tests/logs/stdio as a probably extreme slow case:

16:58:57.230 3892 Collecting tests ...
16:59:25.726 3892 Parsing expectations ...
16:59:37.567 3892 Found 33781 tests; running 26622, skipping 7159.

Collecting tests took 28.5 seconds and parsing expectations took 12. This is obviously a very slow machine, but anything we can do to improve this would be great.

> 5) the list of TestResults is collected into a TestPass class (what is currently called a ResultSummary), or some similar name; the concept is "making a pass over the tests" rather than "this test passed", but I don't like the confusion. 

As you said, overloading "pass" is asking for trouble. TestResultSet maybe? TestResultList? TestRunOutput? TestRunResults? I agree all of these sound not quite right, but I'd prefer any of these to TestPass. :)

> 6) if we had any unexpected failures and were not told to not retry failures, we do a second pass over those, producing a second TestPass objects (currently what is called the retry_summary)

This another thing that would be great to parallelize. It's super annoying right now that we wait until the last shard finishes before we start the retry. I find this especially annoying when running tests locally. I usually run with --no-retry because takes too long. Again, just something to keep in mind as you're writing the code.

> 7) from those two TestPass objects we can merge the results and summarize them (in the routine currently called summarize_results that returns a flat JSON/POD object that can be stored as full_results.json)

Side note: at some point, we should change the test results server to store the aggregate results files with a different name (e.g. aggregate-results.json) so that we can properly name this file results.json to match results.html.

> I'm not wild about any of these names and am open to suggestions. Note that all of this basically describes the existing flow and objects; I'm not changing anything much, just cleaning it up.

Yup. Except for overloading "pass", this all sounds good. Don't let any of my comments here complicate the refactoring too much. Just thought I'd mention them if it affected the code/design in anyway or if they were easy-ish to accomplish.
Comment 5 Dirk Pranke 2012-12-01 19:26:24 PST
(In reply to comment #4)
> (In reply to comment #1)
> > The general model I have in mind can be thought of functionally or in a dataflow manner ...
> > 
> > 1) get a list of test paths (e.g., ['fast/html', 'fast/forms']
> > 
> > 2) convert those into a list of test names
> 
> At some point around here we need to parse TestExpectations. We need to know the expectations to create the TestInput, right?
> 

Correct. I didn't detail every step :).

> Also, it's not worth complicating things too much. But as you're designing this, a nice-to-have to keep in mind would be being able to start step 4 before having completely finished steps 2 and 3. This is to improve the time it takes to run the tests. Taking http://build.chromium.org/p/chromium.webkit/builders/WebKit%20XP/builds/1244/steps/webkit_tests/logs/stdio as a probably extreme slow case:
> 
> 16:58:57.230 3892 Collecting tests ...
> 16:59:25.726 3892 Parsing expectations ...
> 16:59:37.567 3892 Found 33781 tests; running 26622, skipping 7159.
> 
> Collecting tests took 28.5 seconds and parsing expectations took 12. This is obviously a very slow machine, but anything we can do to improve this would be great.
> 

Yeah, I have often thought about this. Pipelining this is a bit tricky, especially since it affects how we do sharding and compute the batches to run, but as you say, it could be a nice win on a slow machine.

> > 5) the list of TestResults is collected into a TestPass class (what is currently called a ResultSummary), or some similar name; the concept is "making a pass over the tests" rather than "this test passed", but I don't like the confusion. 
> 
> As you said, overloading "pass" is asking for trouble. TestResultSet maybe? TestResultList? TestRunOutput? TestRunResults? I agree all of these sound not quite right, but I'd prefer any of these to TestPass. :)
> 

Yeah, I think I had thought of most of those as well, but I don't much care for the TestRun variants since I don't then know what to call the RunDetails/overall object. Your objection to TestPass is recorded :)

Anyone else have any ideas?

> > 6) if we had any unexpected failures and were not told to not retry failures, we do a second pass over those, producing a second TestPass objects (currently what is called the retry_summary)
> 
> This another thing that would be great to parallelize. It's super annoying right now that we wait until the last shard finishes before we start the retry. I find this especially annoying when running tests locally. I usually run with --no-retry because takes too long. Again, just something to keep in mind as you're writing the code.
>

I intentionally don't retry in parallel with running the first pass, in order to avoid additional potential flakiness introduced by parallel results. I think this is the better behavior, even if it makes the run longer locally.

> > 7) from those two TestPass objects we can merge the results and summarize them (in the routine currently called summarize_results that returns a flat JSON/POD object that can be stored as full_results.json)
> 
> Side note: at some point, we should change the test results server to store the aggregate results files with a different name (e.g. aggregate-results.json) so that we can properly name this file results.json to match results.html.
> 

Yup.
Comment 6 Dirk Pranke 2013-02-20 16:31:56 PST
Okay, I hit declining marginal return here and am calling this done.

The code ended up as follows:

run_webkit_tests.py creates a Manager object. The manager builds a list of TestInputs representing the tests to run. Each test gets run (normally), producing a TestResult. The TestResults get collected into a TestRunResults object (initial_results), which also contains some derived aggregate statistics.

If there are unexpected results, they are retried, producing a second TestRunResults object (retry_results).

initial_results and retry_results are then summarized (again) into a dict called summarized_results. 

These three objects, along with the resulting exit code, are returned as a RunDetails object back to run_webkit_tests.py. The RunDetails object is used to print the results to stdout and to compute the json blocks uploaded to the dashboards (and then to determine the exit code to leave with).