Bug 86021 - performance tests should be able to measure runs/sec rather than time
Summary: performance tests should be able to measure runs/sec rather than time
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Ryosuke Niwa
URL:
Keywords:
Depends on:
Blocks: 77037
  Show dependency treegraph
 
Reported: 2012-05-09 15:05 PDT by Tony Chang
Modified: 2012-05-14 12:29 PDT (History)
11 users (show)

See Also:


Attachments
Adds PerfTestRunner.runPerSecond and converts flexbox tests (14.54 KB, patch)
2012-05-11 23:55 PDT, Ryosuke Niwa
no flags Details | Formatted Diff | Diff
Flebox Perf. Test Experiment - function call per iteration is adjusted (deleted)
2012-05-13 16:58 PDT, Ryosuke Niwa
no flags Details
Flebox Perf. Test Experiment - function call per iteration is 1 (11.87 KB, image/png)
2012-05-13 16:59 PDT, Ryosuke Niwa
no flags Details
Patch for landing (15.24 KB, patch)
2012-05-13 20:55 PDT, Ryosuke Niwa
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Tony Chang 2012-05-09 15:05:54 PDT
It is helpful to measure runs/sec for a fixed time because as we speed up tests and hardware gets faster, the precision of measurement gets worse.

I think Morita did a one off change to support Dromaeo tests (which measure hertz), but we should roll this ability directly into PerformanceTests/resources/runner.js so non-Dromaeo tests can also use hertz.
Comment 1 Ryosuke Niwa 2012-05-11 23:55:29 PDT
Created attachment 141553 [details]
Adds PerfTestRunner.runPerSecond and converts flexbox tests
Comment 2 Ojan Vafai 2012-05-12 10:31:42 PDT
Comment on attachment 141553 [details]
Adds PerfTestRunner.runPerSecond and converts flexbox tests

View in context: https://bugs.webkit.org/attachment.cgi?id=141553&action=review

> PerformanceTests/ChangeLog:17
> +        By default, runPerSecond runs the test for at least 750 milliseconds in each run, and executes
> +        21 runs, yielding the total run time of roughly 18 seconds. This is significantly faster than

Do we really need to run 21 times to get the variance down? Seems excessive to me. Have you looked at what V8 does? They've have lots of time to tweak their test runner to reduce variance, but still be fast. With every test taking 18 seconds, the cycle time for the bots will be huge as people add tests and it will be harder to tell which patch caused a regression.

> PerformanceTests/resources/runner.js:79
> +    result.unit = unit || "ms";

Looks like you always set unit, so do you need this ||?

> PerformanceTests/resources/runner.js:167
> +PerfTestRunner.runPerSecond = function (test) {

It's up to you, but I would call this method run and call the other method runTime or something like that. Whichever method is called run is the one naive test writers will use.

> PerformanceTests/resources/runner.js:179
> +    this._runCount = test.runCount || 21;
> +    this._doneFunction = function () { if (test.done) test.done(); };
> +    this._completedRuns = 0;
> +    this._callsPerIteration = 1;
> +    this.customRunFunction = null;
> +    this.unit = 'runs/s';
> +    this._results = [];
> +
> +    this._test = test;
> +    this._runner = this._perSecondRunner;
> +    this.log("Running " + this._runCount + " times");
> +    this._runLoop();

Looks like this duplicates a lot of code in PerfTestRunner.run. Could they both call a shared helper function? PerfTestRunner._resetState or something?

> PerformanceTests/resources/runner.js:195
> +        totalTime += this._perSecondRunnerIterator(callsPerIteration);
> +        i += callsPerIteration;
> +        if (this._completedRuns <= 0 && totalTime < 100)
> +            callsPerIteration = Math.max(10, 2 * callsPerIteration);

Again, it would be good to see what the V8 test harness does for this loop.

> PerformanceTests/resources/runner.js:212
> +    var startTime = new Date();
> +    for (var i = 0; i < callsPerIteration; i++)
> +        this._test.run();
> +    return new Date() - startTime;

Nit: s/new Date()/Date.now()/. Saves the cost of instantiating a Date object, which is most of the cost.
Comment 3 Ryosuke Niwa 2012-05-12 13:00:33 PDT
Comment on attachment 141553 [details]
Adds PerfTestRunner.runPerSecond and converts flexbox tests

View in context: https://bugs.webkit.org/attachment.cgi?id=141553&action=review

Thanks for the review!

>> PerformanceTests/ChangeLog:17
>> +        21 runs, yielding the total run time of roughly 18 seconds. This is significantly faster than
> 
> Do we really need to run 21 times to get the variance down? Seems excessive to me. Have you looked at what V8 does? They've have lots of time to tweak their test runner to reduce variance, but still be fast. With every test taking 18 seconds, the cycle time for the bots will be huge as people add tests and it will be harder to tell which patch caused a regression.

I think each test taking 18s is an improvement over the current situation where we spend well over a minute on some tests.

I did look into what v8 benchmark and dom_perf do. According to http://code.google.com/p/v8/source/browse/trunk/benchmarks/base.js#201,
V8's benchmark runs for at least 1s or 32 runs. So 750ms is even smaller than that. We spend 18 seconds because we're getting 21 samples.

We can tweak and reduce the time if we wanted to in the future but we definitely need at least 20 samples.
(I wanted 100 samples but that takes too long so we're already compromising on that end)

>> PerformanceTests/resources/runner.js:79
>> +    result.unit = unit || "ms";
> 
> Looks like you always set unit, so do you need this ||?

No, there are tests that directly calls computeStatistics.

>> PerformanceTests/resources/runner.js:167
>> +PerfTestRunner.runPerSecond = function (test) {
> 
> It's up to you, but I would call this method run and call the other method runTime or something like that. Whichever method is called run is the one naive test writers will use.

Yeah, I'm planning to do that once I've converted some test and seen how they behave over time.

>> PerformanceTests/resources/runner.js:179
>> +    this._runLoop();
> 
> Looks like this duplicates a lot of code in PerfTestRunner.run. Could they both call a shared helper function? PerfTestRunner._resetState or something?

I thought about it but then most of initializations are slightly different between the two :(
The only code they can share is
this.customRunFunction = null;
this._results = [];
this.log("Running " + this._runCount + " times");
this._runLoop();

Do you have a suggestion as to how to create a shared helper?

>> PerformanceTests/resources/runner.js:195
>> +            callsPerIteration = Math.max(10, 2 * callsPerIteration);
> 
> Again, it would be good to see what the V8 test harness does for this loop.

What they're doing is pretty naive: http://code.google.com/p/v8/source/browse/trunk/benchmarks/base.js#205
They're doing "new Date();" between each run even though they're expecting to have 32 or more runs per second.

>> PerformanceTests/resources/runner.js:212
>> +    return new Date() - startTime;
> 
> Nit: s/new Date()/Date.now()/. Saves the cost of instantiating a Date object, which is most of the cost.

Good catch. Will fix.
Comment 4 Ojan Vafai 2012-05-12 22:49:00 PDT
Comment on attachment 141553 [details]
Adds PerfTestRunner.runPerSecond and converts flexbox tests

View in context: https://bugs.webkit.org/attachment.cgi?id=141553&action=review

>>> PerformanceTests/ChangeLog:17
>>> +        21 runs, yielding the total run time of roughly 18 seconds. This is significantly faster than
>> 
>> Do we really need to run 21 times to get the variance down? Seems excessive to me. Have you looked at what V8 does? They've have lots of time to tweak their test runner to reduce variance, but still be fast. With every test taking 18 seconds, the cycle time for the bots will be huge as people add tests and it will be harder to tell which patch caused a regression.
> 
> I think each test taking 18s is an improvement over the current situation where we spend well over a minute on some tests.
> 
> I did look into what v8 benchmark and dom_perf do. According to http://code.google.com/p/v8/source/browse/trunk/benchmarks/base.js#201,
> V8's benchmark runs for at least 1s or 32 runs. So 750ms is even smaller than that. We spend 18 seconds because we're getting 21 samples.
> 
> We can tweak and reduce the time if we wanted to in the future but we definitely need at least 20 samples.
> (I wanted 100 samples but that takes too long so we're already compromising on that end)

I actually prefer how the V8 benchmark code does this. It looks much simpler to me and is less likely to have test harness-induced variance. 

1. It doesn't do anything special to try to deal with Date objects being slow. I think if you're running for 1 second and using Date.now, you don't need to do this thing with callsPerIteration. It's too complicated and could lead to high accidental variance. How about we drop that for now and see what the results look like before complicating the test harness?

2. The V8 code does 1 second or 32 runs, but it only does it once. I don't see what benefit we get from running it 21 times. If we need something to reduce the variance, we can increase the measurement time to higher than 1 second or the minimum runs to more than 32. Running multiple times just makes for more complicated, harder to reason about code.
Comment 5 Ojan Vafai 2012-05-13 11:32:45 PDT
Comment on attachment 141553 [details]
Adds PerfTestRunner.runPerSecond and converts flexbox tests

View in context: https://bugs.webkit.org/attachment.cgi?id=141553&action=review

>>>> PerformanceTests/ChangeLog:17
>>>> +        21 runs, yielding the total run time of roughly 18 seconds. This is significantly faster than
>>> 
>>> Do we really need to run 21 times to get the variance down? Seems excessive to me. Have you looked at what V8 does? They've have lots of time to tweak their test runner to reduce variance, but still be fast. With every test taking 18 seconds, the cycle time for the bots will be huge as people add tests and it will be harder to tell which patch caused a regression.
>> 
>> I think each test taking 18s is an improvement over the current situation where we spend well over a minute on some tests.
>> 
>> I did look into what v8 benchmark and dom_perf do. According to http://code.google.com/p/v8/source/browse/trunk/benchmarks/base.js#201,
>> V8's benchmark runs for at least 1s or 32 runs. So 750ms is even smaller than that. We spend 18 seconds because we're getting 21 samples.
>> 
>> We can tweak and reduce the time if we wanted to in the future but we definitely need at least 20 samples.
>> (I wanted 100 samples but that takes too long so we're already compromising on that end)
> 
> I actually prefer how the V8 benchmark code does this. It looks much simpler to me and is less likely to have test harness-induced variance. 
> 
> 1. It doesn't do anything special to try to deal with Date objects being slow. I think if you're running for 1 second and using Date.now, you don't need to do this thing with callsPerIteration. It's too complicated and could lead to high accidental variance. How about we drop that for now and see what the results look like before complicating the test harness?
> 
> 2. The V8 code does 1 second or 32 runs, but it only does it once. I don't see what benefit we get from running it 21 times. If we need something to reduce the variance, we can increase the measurement time to higher than 1 second or the minimum runs to more than 32. Running multiple times just makes for more complicated, harder to reason about code.

I remembered that magnitude-perf.js does something similar to the code you have here. I wonder actually if that contributes to some of the flakiness we see in the LayoutTest/perf tests. Might be worth trying locally to simplify that logic and see if things get less flaky.
Comment 6 Ryosuke Niwa 2012-05-13 16:58:36 PDT
Created attachment 141617 [details]
Flebox Perf. Test Experiment - function call per iteration is adjusted
Comment 7 Ryosuke Niwa 2012-05-13 16:59:08 PDT
Created attachment 141618 [details]
Flebox Perf. Test Experiment - function call per iteration is 1
Comment 8 Ryosuke Niwa 2012-05-13 17:11:18 PDT
Okay, I was curious about this so I did some experiments. I've ran run-perf-tests on Google-issued MacBookPro (6,2, OS X: 10.6.8, CPU: Core i5 2.53GHz, RAM: 4GB DDR3 1067MHz) with two variants of runner.js one where callsPerIteration = 1 and another where callsPerIteration is adjusted in the patch. To minimize the noise, I did:
- disabled anti-virus software during the experiment
- used each variant of runner.js in turn so that if there had been some permanent environment changes, then both results would get affected
- not use keyboard or mouse while run-perf-tests was running.

First, I hypothesized that adjusting the function calls per iteration will only benefit fast run functions while it does not affect slow run functions. To confirm (or rather to refute if possible) the latter null-hypothesis (i.e. it does not affect slow run functions), I used flexbox performance added by Tony as modified in the attached patch, and ran the tests 10 times each, and mean, median, stdev, min, and max values were recorded.

CallsPerIteartion = 1: https://bug-86021-attachments.webkit.org/attachment.cgi?id=141618
CallsPerIteration adjusted: https://bug-86021-attachments.webkit.org/attachment.cgi?id=141617

From the graph, I could not observe any statistically significant differences between the two approach.
Comment 9 Ryosuke Niwa 2012-05-13 17:14:26 PDT
Here is the average of standard deviations I observed in two approaches:
CallsPerIteration = 1
test      flexbox-column-nowrap flexbox-column-wrap flexbox-row-nowrap flexbox-row-wrap
avg stdev 10.013323             9.466342569         5.057115346        7.981334236

CallsPerIteration adjusted
test      flexbox-column-nowrap	flexbox-column-wrap flexbox-row-nowrap flexbox-row-wrap
avg stdev 11.21493623           8.05499943          9.153795006        6.051266683
Comment 10 Ryosuke Niwa 2012-05-13 17:35:01 PDT
Now for the formar null-hypothesis that the adjustment will benefit fast run functions, and addresses ojan's point 1, I've created a following test:

<!DOCTYPE html>
<html>
<body>
<script src="../resources/runner.js"></script>
<script>
PerfTestRunner.runPerSecond({run: function () { document.createElement('div'); }});
</script>
</body>
</html>

And took 20 samples with both approaches. The result can be seen at https://docs.google.com/spreadsheet/ccc?key=0AqvOKf8yeFVCdG1PRnJsYnBUWUg4SW5lZFJyN0M1bWc

The result is that the average standard deviation for CallsWithIteration is roughly 1.35% (or 1.07% if we get rid of one off value), and it is 2.02% when CallsWithIteration is 1. That is, we have 50%-100% more variance when CallsWithIteration is always 1. Furthermore, the results when CallsWithIteration is 1 was 14% slower, indicating that calls to Date.now() significantly affected the results.

Thus, I'm strongly include to refute Ojan's null-hypothesis in comment #4 that "if you're running for 1 second and using Date.now, you don't need to do this thing with callsPerIteration" although my experiments ran for 750ms per iteration.
Comment 11 Ryosuke Niwa 2012-05-13 17:48:39 PDT
Now let us test the second null-hypothesis Ojan suggested:

(In reply to comment #4)
> 2. The V8 code does 1 second or 32 runs, but it only does it once. I don't see what benefit we get from running it 21 times. If we need something to reduce the variance, we can increase the measurement time to higher than 1 second or the minimum runs to more than 32. Running multiple times just makes for more complicated, harder to reason about code.

Looking at the test results posted in comment #10, we see that even when CallsPerIteration is adjusted,
(avg max - avg min) / (avg min) is staggering 6.3% even though standard deviation in each run stayed within 1-2% range. Similarly, the values for flexbox perf. tests were around 4-5% even though standard deviations in each run, again, stayed within 0.5-2% range.

(See https://docs.google.com/spreadsheet/ccc?key=0AqvOKf8yeFVCdFNvT2s0ZjJRQmhXc3EwTDhQejQ3aFE for flexbox experiment results)

Thus, I conclude that running the tests for 750ms once is definitely inadequate given we would like to catch performance regressions in the order of 5-10%
Comment 12 Ryosuke Niwa 2012-05-13 20:55:44 PDT
Created attachment 141631 [details]
Patch for landing
Comment 13 WebKit Review Bot 2012-05-13 22:22:55 PDT
Comment on attachment 141631 [details]
Patch for landing

Clearing flags on attachment: 141631

Committed r116916: <http://trac.webkit.org/changeset/116916>
Comment 14 WebKit Review Bot 2012-05-13 22:23:01 PDT
All reviewed patches have been landed.  Closing bug.
Comment 15 Ryosuke Niwa 2012-05-14 12:29:48 PDT
Build fix landed in http://trac.webkit.org/changeset/116976.