86021 – performance tests should be able to measure runs/sec rather than time

RESOLVED FIXED 86021

performance tests should be able to measure runs/sec rather than time

https://bugs.webkit.org/show_bug.cgi?id=86021

Summary performance tests should be able to measure runs/sec rather than time

Tony Chang

Reported 2012-05-09 15:05:54 PDT

It is helpful to measure runs/sec for a fixed time because as we speed up tests and hardware gets faster, the precision of measurement gets worse. I think Morita did a one off change to support Dromaeo tests (which measure hertz), but we should roll this ability directly into PerformanceTests/resources/runner.js so non-Dromaeo tests can also use hertz.

Attachments
Adds PerfTestRunner.runPerSecond and converts flexbox tests (14.54 KB, patch) 2012-05-11 23:55 PDT, Ryosuke Niwa	no flags	Details Formatted Diff Diff
Flebox Perf. Test Experiment - function call per iteration is adjusted (deleted) 2012-05-13 16:58 PDT, Ryosuke Niwa	no flags	Details
Flebox Perf. Test Experiment - function call per iteration is 1 (11.87 KB, image/png) 2012-05-13 16:59 PDT, Ryosuke Niwa	no flags	Details
Patch for landing (15.24 KB, patch) 2012-05-13 20:55 PDT, Ryosuke Niwa	no flags	Details Formatted Diff Diff
Show Obsolete (1) View All Add attachment proposed patch, testcase, etc.

Ryosuke Niwa

Comment 1 2012-05-11 23:55:29 PDT

Created attachment 141553 [details] Adds PerfTestRunner.runPerSecond and converts flexbox tests

Ojan Vafai

Comment 2 2012-05-12 10:31:42 PDT

Comment on attachment 141553 [details] Adds PerfTestRunner.runPerSecond and converts flexbox tests View in context: https://bugs.webkit.org/attachment.cgi?id=141553&action=review > PerformanceTests/ChangeLog:17 > + By default, runPerSecond runs the test for at least 750 milliseconds in each run, and executes > + 21 runs, yielding the total run time of roughly 18 seconds. This is significantly faster than Do we really need to run 21 times to get the variance down? Seems excessive to me. Have you looked at what V8 does? They've have lots of time to tweak their test runner to reduce variance, but still be fast. With every test taking 18 seconds, the cycle time for the bots will be huge as people add tests and it will be harder to tell which patch caused a regression. > PerformanceTests/resources/runner.js:79 > + result.unit = unit || "ms"; Looks like you always set unit, so do you need this ||? > PerformanceTests/resources/runner.js:167 > +PerfTestRunner.runPerSecond = function (test) { It's up to you, but I would call this method run and call the other method runTime or something like that. Whichever method is called run is the one naive test writers will use. > PerformanceTests/resources/runner.js:179 > + this._runCount = test.runCount || 21; > + this._doneFunction = function () { if (test.done) test.done(); }; > + this._completedRuns = 0; > + this._callsPerIteration = 1; > + this.customRunFunction = null; > + this.unit = 'runs/s'; > + this._results = []; > + > + this._test = test; > + this._runner = this._perSecondRunner; > + this.log("Running " + this._runCount + " times"); > + this._runLoop(); Looks like this duplicates a lot of code in PerfTestRunner.run. Could they both call a shared helper function? PerfTestRunner._resetState or something? > PerformanceTests/resources/runner.js:195 > + totalTime += this._perSecondRunnerIterator(callsPerIteration); > + i += callsPerIteration; > + if (this._completedRuns <= 0 && totalTime < 100) > + callsPerIteration = Math.max(10, 2 * callsPerIteration); Again, it would be good to see what the V8 test harness does for this loop. > PerformanceTests/resources/runner.js:212 > + var startTime = new Date(); > + for (var i = 0; i < callsPerIteration; i++) > + this._test.run(); > + return new Date() - startTime; Nit: s/new Date()/Date.now()/. Saves the cost of instantiating a Date object, which is most of the cost.

Ryosuke Niwa

Comment 3 2012-05-12 13:00:33 PDT

Comment on attachment 141553 [details] Adds PerfTestRunner.runPerSecond and converts flexbox tests View in context: https://bugs.webkit.org/attachment.cgi?id=141553&action=review Thanks for the review! >> PerformanceTests/ChangeLog:17 >> + 21 runs, yielding the total run time of roughly 18 seconds. This is significantly faster than > > Do we really need to run 21 times to get the variance down? Seems excessive to me. Have you looked at what V8 does? They've have lots of time to tweak their test runner to reduce variance, but still be fast. With every test taking 18 seconds, the cycle time for the bots will be huge as people add tests and it will be harder to tell which patch caused a regression. I think each test taking 18s is an improvement over the current situation where we spend well over a minute on some tests. I did look into what v8 benchmark and dom_perf do. According to http://code.google.com/p/v8/source/browse/trunk/benchmarks/base.js#201, V8's benchmark runs for at least 1s or 32 runs. So 750ms is even smaller than that. We spend 18 seconds because we're getting 21 samples. We can tweak and reduce the time if we wanted to in the future but we definitely need at least 20 samples. (I wanted 100 samples but that takes too long so we're already compromising on that end) >> PerformanceTests/resources/runner.js:79 >> + result.unit = unit || "ms"; > > Looks like you always set unit, so do you need this ||? No, there are tests that directly calls computeStatistics. >> PerformanceTests/resources/runner.js:167 >> +PerfTestRunner.runPerSecond = function (test) { > > It's up to you, but I would call this method run and call the other method runTime or something like that. Whichever method is called run is the one naive test writers will use. Yeah, I'm planning to do that once I've converted some test and seen how they behave over time. >> PerformanceTests/resources/runner.js:179 >> + this._runLoop(); > > Looks like this duplicates a lot of code in PerfTestRunner.run. Could they both call a shared helper function? PerfTestRunner._resetState or something? I thought about it but then most of initializations are slightly different between the two :( The only code they can share is this.customRunFunction = null; this._results = []; this.log("Running " + this._runCount + " times"); this._runLoop(); Do you have a suggestion as to how to create a shared helper? >> PerformanceTests/resources/runner.js:195 >> + callsPerIteration = Math.max(10, 2 * callsPerIteration); > > Again, it would be good to see what the V8 test harness does for this loop. What they're doing is pretty naive: http://code.google.com/p/v8/source/browse/trunk/benchmarks/base.js#205 They're doing "new Date();" between each run even though they're expecting to have 32 or more runs per second. >> PerformanceTests/resources/runner.js:212 >> + return new Date() - startTime; > > Nit: s/new Date()/Date.now()/. Saves the cost of instantiating a Date object, which is most of the cost. Good catch. Will fix.

Ojan Vafai

Comment 4 2012-05-12 22:49:00 PDT

Comment on attachment 141553 [details] Adds PerfTestRunner.runPerSecond and converts flexbox tests View in context: https://bugs.webkit.org/attachment.cgi?id=141553&action=review >>> PerformanceTests/ChangeLog:17 >>> + 21 runs, yielding the total run time of roughly 18 seconds. This is significantly faster than >> >> Do we really need to run 21 times to get the variance down? Seems excessive to me. Have you looked at what V8 does? They've have lots of time to tweak their test runner to reduce variance, but still be fast. With every test taking 18 seconds, the cycle time for the bots will be huge as people add tests and it will be harder to tell which patch caused a regression. > > I think each test taking 18s is an improvement over the current situation where we spend well over a minute on some tests. > > I did look into what v8 benchmark and dom_perf do. According to http://code.google.com/p/v8/source/browse/trunk/benchmarks/base.js#201, > V8's benchmark runs for at least 1s or 32 runs. So 750ms is even smaller than that. We spend 18 seconds because we're getting 21 samples. > > We can tweak and reduce the time if we wanted to in the future but we definitely need at least 20 samples. > (I wanted 100 samples but that takes too long so we're already compromising on that end) I actually prefer how the V8 benchmark code does this. It looks much simpler to me and is less likely to have test harness-induced variance. 1. It doesn't do anything special to try to deal with Date objects being slow. I think if you're running for 1 second and using Date.now, you don't need to do this thing with callsPerIteration. It's too complicated and could lead to high accidental variance. How about we drop that for now and see what the results look like before complicating the test harness? 2. The V8 code does 1 second or 32 runs, but it only does it once. I don't see what benefit we get from running it 21 times. If we need something to reduce the variance, we can increase the measurement time to higher than 1 second or the minimum runs to more than 32. Running multiple times just makes for more complicated, harder to reason about code.

Ojan Vafai

Comment 5 2012-05-13 11:32:45 PDT

Comment on attachment 141553 [details] Adds PerfTestRunner.runPerSecond and converts flexbox tests View in context: https://bugs.webkit.org/attachment.cgi?id=141553&action=review >>>> PerformanceTests/ChangeLog:17 >>>> + 21 runs, yielding the total run time of roughly 18 seconds. This is significantly faster than >>> >>> Do we really need to run 21 times to get the variance down? Seems excessive to me. Have you looked at what V8 does? They've have lots of time to tweak their test runner to reduce variance, but still be fast. With every test taking 18 seconds, the cycle time for the bots will be huge as people add tests and it will be harder to tell which patch caused a regression. >> >> I think each test taking 18s is an improvement over the current situation where we spend well over a minute on some tests. >> >> I did look into what v8 benchmark and dom_perf do. According to http://code.google.com/p/v8/source/browse/trunk/benchmarks/base.js#201, >> V8's benchmark runs for at least 1s or 32 runs. So 750ms is even smaller than that. We spend 18 seconds because we're getting 21 samples. >> >> We can tweak and reduce the time if we wanted to in the future but we definitely need at least 20 samples. >> (I wanted 100 samples but that takes too long so we're already compromising on that end) > > I actually prefer how the V8 benchmark code does this. It looks much simpler to me and is less likely to have test harness-induced variance. > > 1. It doesn't do anything special to try to deal with Date objects being slow. I think if you're running for 1 second and using Date.now, you don't need to do this thing with callsPerIteration. It's too complicated and could lead to high accidental variance. How about we drop that for now and see what the results look like before complicating the test harness? > > 2. The V8 code does 1 second or 32 runs, but it only does it once. I don't see what benefit we get from running it 21 times. If we need something to reduce the variance, we can increase the measurement time to higher than 1 second or the minimum runs to more than 32. Running multiple times just makes for more complicated, harder to reason about code. I remembered that magnitude-perf.js does something similar to the code you have here. I wonder actually if that contributes to some of the flakiness we see in the LayoutTest/perf tests. Might be worth trying locally to simplify that logic and see if things get less flaky.

Ryosuke Niwa

Comment 6 2012-05-13 16:58:36 PDT

Created attachment 141617 [details] Flebox Perf. Test Experiment - function call per iteration is adjusted

Ryosuke Niwa

Comment 7 2012-05-13 16:59:08 PDT

Created attachment 141618 [details] Flebox Perf. Test Experiment - function call per iteration is 1

Ryosuke Niwa

Comment 8 2012-05-13 17:11:18 PDT

Okay, I was curious about this so I did some experiments. I've ran run-perf-tests on Google-issued MacBookPro (6,2, OS X: 10.6.8, CPU: Core i5 2.53GHz, RAM: 4GB DDR3 1067MHz) with two variants of runner.js one where callsPerIteration = 1 and another where callsPerIteration is adjusted in the patch. To minimize the noise, I did: - disabled anti-virus software during the experiment - used each variant of runner.js in turn so that if there had been some permanent environment changes, then both results would get affected - not use keyboard or mouse while run-perf-tests was running. First, I hypothesized that adjusting the function calls per iteration will only benefit fast run functions while it does not affect slow run functions. To confirm (or rather to refute if possible) the latter null-hypothesis (i.e. it does not affect slow run functions), I used flexbox performance added by Tony as modified in the attached patch, and ran the tests 10 times each, and mean, median, stdev, min, and max values were recorded. CallsPerIteartion = 1: https://bug-86021-attachments.webkit.org/attachment.cgi?id=141618 CallsPerIteration adjusted: https://bug-86021-attachments.webkit.org/attachment.cgi?id=141617 From the graph, I could not observe any statistically significant differences between the two approach.

Ryosuke Niwa

Comment 9 2012-05-13 17:14:26 PDT

Here is the average of standard deviations I observed in two approaches: CallsPerIteration = 1 test flexbox-column-nowrap flexbox-column-wrap flexbox-row-nowrap flexbox-row-wrap avg stdev 10.013323 9.466342569 5.057115346 7.981334236 CallsPerIteration adjusted test flexbox-column-nowrap flexbox-column-wrap flexbox-row-nowrap flexbox-row-wrap avg stdev 11.21493623 8.05499943 9.153795006 6.051266683

Ryosuke Niwa

Comment 10 2012-05-13 17:35:01 PDT

Now for the formar null-hypothesis that the adjustment will benefit fast run functions, and addresses ojan's point 1, I've created a following test: <!DOCTYPE html> <html> <body> <script src="../resources/runner.js"></script> <script> PerfTestRunner.runPerSecond({run: function () { document.createElement('div'); }}); </script> </body> </html> And took 20 samples with both approaches. The result can be seen at https://docs.google.com/spreadsheet/ccc?key=0AqvOKf8yeFVCdG1PRnJsYnBUWUg4SW5lZFJyN0M1bWc The result is that the average standard deviation for CallsWithIteration is roughly 1.35% (or 1.07% if we get rid of one off value), and it is 2.02% when CallsWithIteration is 1. That is, we have 50%-100% more variance when CallsWithIteration is always 1. Furthermore, the results when CallsWithIteration is 1 was 14% slower, indicating that calls to Date.now() significantly affected the results. Thus, I'm strongly include to refute Ojan's null-hypothesis in comment #4 that "if you're running for 1 second and using Date.now, you don't need to do this thing with callsPerIteration" although my experiments ran for 750ms per iteration.

Ryosuke Niwa

Comment 11 2012-05-13 17:48:39 PDT

Now let us test the second null-hypothesis Ojan suggested: (In reply to comment #4) > 2. The V8 code does 1 second or 32 runs, but it only does it once. I don't see what benefit we get from running it 21 times. If we need something to reduce the variance, we can increase the measurement time to higher than 1 second or the minimum runs to more than 32. Running multiple times just makes for more complicated, harder to reason about code. Looking at the test results posted in comment #10, we see that even when CallsPerIteration is adjusted, (avg max - avg min) / (avg min) is staggering 6.3% even though standard deviation in each run stayed within 1-2% range. Similarly, the values for flexbox perf. tests were around 4-5% even though standard deviations in each run, again, stayed within 0.5-2% range. (See https://docs.google.com/spreadsheet/ccc?key=0AqvOKf8yeFVCdFNvT2s0ZjJRQmhXc3EwTDhQejQ3aFE for flexbox experiment results) Thus, I conclude that running the tests for 750ms once is definitely inadequate given we would like to catch performance regressions in the order of 5-10%

Ryosuke Niwa

Comment 12 2012-05-13 20:55:44 PDT

Created attachment 141631 [details] Patch for landing

WebKit Review Bot

Comment 13 2012-05-13 22:22:55 PDT

Comment on attachment 141631 [details] Patch for landing Clearing flags on attachment: 141631 Committed r116916: <http://trac.webkit.org/changeset/116916>

WebKit Review Bot

Comment 14 2012-05-13 22:23:01 PDT

All reviewed patches have been landed. Closing bug.

Ryosuke Niwa

Comment 15 2012-05-14 12:29:48 PDT

Build fix landed in http://trac.webkit.org/changeset/116976.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Unspecified

OS Unspecified

Product WebKit

Component Tools / Tests

Assignee

Ryosuke Niwa

Reported

2012-05-09 15:05 PDT

Modified

2012-05-14 12:29 PDT History

CC List

11 users Show

URL

Keywords

Depends on

Blocks

77037

Dependencies

tree graph