Bug 97510

Summary: Some perf. tests have variances that differ greatly between runs
Product: WebKit Reporter: Ryosuke Niwa <rniwa>
Component: Tools / TestsAssignee: Ryosuke Niwa <rniwa>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, dpranke, eric, haraken, jhammel, mjs, morrita, ojan, slewis, webkit.review.bot
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Bug Depends on: 97625, 99800, 105685, 106078, 106172, 111030    
Bug Blocks: 77037, 105003    
Attachments:
Description Flags
Two consecutive runs of perf. tests at r129387
none
layout_20_100 (original)
none
lyaout_20_100 (approach 1)
none
layout_20_100 (approach 2)
none
scroll-top (original)
none
scroll-top (approach 1)
none
scroll-top (approach 2)
none
js-attr-prototype (original)
none
js-attr-prototype (approach 1)
none
js-attr-prototype (approach 2)
none
Work in progress
none
Work in progress 2
none
Patch
none
Fixed a harness test
none
Updated per comment and introduced iteration groups none

Comment 1 Ryosuke Niwa 2012-09-24 18:35:26 PDT
I've considered the following two approaches to solve this problem:

1. Increase the number of samples we take in each test (js code change).
2. Reduce the number of samples taken in each test by a factor of roughly 4, and run the same in 4 different instances of DumpRenderTree.

I'm going to post a whole bunch of results pages now but the results appear to indicate that we should do take the approach 2.
Comment 2 Ryosuke Niwa 2012-09-24 18:36:03 PDT
Created attachment 165487 [details]
layout_20_100 (original)
Comment 3 Ryosuke Niwa 2012-09-24 18:36:35 PDT
Created attachment 165488 [details]
lyaout_20_100 (approach 1)
Comment 4 Ryosuke Niwa 2012-09-24 18:36:58 PDT
Created attachment 165489 [details]
layout_20_100 (approach 2)
Comment 5 Ryosuke Niwa 2012-09-24 18:37:42 PDT
Created attachment 165490 [details]
scroll-top (original)
Comment 6 Ryosuke Niwa 2012-09-24 18:38:01 PDT
Created attachment 165491 [details]
scroll-top (approach 1)
Comment 7 Ryosuke Niwa 2012-09-24 18:38:30 PDT
Created attachment 165492 [details]
scroll-top (approach 2)
Comment 8 Ryosuke Niwa 2012-09-24 18:39:15 PDT
Created attachment 165493 [details]
js-attr-prototype (original)
Comment 9 Ryosuke Niwa 2012-09-24 18:39:48 PDT
Created attachment 165494 [details]
js-attr-prototype (approach 1)
Comment 10 Ryosuke Niwa 2012-09-24 18:40:24 PDT
Created attachment 165495 [details]
js-attr-prototype (approach 2)
Comment 11 Ryosuke Niwa 2012-09-24 18:42:24 PDT
To elaborate more on the two approaches, let me give you an example.

Suppose we have a sample test that has 20 iterations. Approach 1 increases the number of iterations to, say, 100. Approach 2 reduces the number of iterations to 5, but then runs it 4 times, each using a different instance of DumpRenderTree.
Comment 12 Kentaro Hara 2012-09-24 18:51:24 PDT
General comment: In my experience, an average value is strongly affected by a couple of outliers. How about calculating the average value after discarding one or two outliers (i.e. discarding one or two largest values)? Or how about using a median instead of an average?

What we are interested in is not the distribution of execution times but the execution time in "common cases". In that sense, in order to observe what we want to observe, it might make more sense to observe a median or an average that ignores outliers than observe a pure average of all values.
Comment 13 Ryosuke Niwa 2012-09-24 19:53:23 PDT
(In reply to comment #12)
> General comment: In my experience, an average value is strongly affected by a couple of outliers. How about calculating the average value after discarding one or two outliers (i.e. discarding one or two largest values)? Or how about using a median instead of an average?

You can take a look at each graph (click on the test to show the graph, and then click on the graph to adjust the y-axis), but I don't think discarding one or two extrema or using median wouldn't help here because some of these tests have bi-modal distributions.
Comment 14 Ryosuke Niwa 2012-09-24 19:57:27 PDT
Also, take a look at the graph on https://bug-97510-attachments.webkit.org/attachment.cgi?id=165488 (layout_20_100 with 100 iterations). There, values are not only multi-modal but both means and medians are centered at different values in different runs.
Comment 15 Stephanie Lewis 2012-10-01 15:44:04 PDT
IMO you want to run enough iterations so that issues in repeating code are found (i.e. if the javascript heap size increases each time slowing stuff down), but running multiple instances does improve variance and it reduces the risk of an entire run being an outlier.
Comment 16 Ryosuke Niwa 2012-10-01 16:18:21 PDT
(In reply to comment #15)
> IMO you want to run enough iterations so that issues in repeating code are found (i.e. if the javascript heap size increases each time slowing stuff down), but running multiple instances does improve variance and it reduces the risk of an entire run being an outlier.

In V8 at least, there are some global variables that are initialized at startup, which is then used to compute hashes, etc... So just increasing the number of iterations doesn't help. See results labeled "(approach 1)",
Comment 17 Ryosuke Niwa 2013-02-19 01:54:14 PST
Created attachment 189027 [details]
Work in progress
Comment 18 Ryosuke Niwa 2013-02-28 01:49:56 PST
Created attachment 190681 [details]
Work in progress 2
Comment 19 Ryosuke Niwa 2013-02-28 15:58:23 PST
Created attachment 190829 [details]
Patch
Comment 20 Ryosuke Niwa 2013-02-28 16:01:32 PST
Created attachment 190831 [details]
Fixed a harness test
Comment 21 Benjamin Poulain 2013-03-01 16:49:45 PST
Comment on attachment 190831 [details]
Fixed a harness test

View in context: https://bugs.webkit.org/attachment.cgi?id=190831&action=review

I can't wait to see the result on the bots.

> PerformanceTests/Dromaeo/resources/dromaeorunner.js:9
>           setup: function(testName) {
> -             PerfTestRunner.prepareToMeasureValuesAsync({iterationCount: 5, doNotMeasureMemoryUsage: true, doNotIgnoreInitialRun: true, unit: 'runs/s'});
> +             PerfTestRunner.prepareToMeasureValuesAsync({dromaeoIterationCount: 5, doNotMeasureMemoryUsage: true, doNotIgnoreInitialRun: true, unit: 'runs/s'});
>  
>               var iframe = document.createElement("iframe");
> -             var url = DRT.baseURL + "?" + testName + '&numTests=' + PerfTestRunner.iterationCount();
> +             var url = DRT.baseURL + "?" + testName + '&numTests=' + 5;

var dromaeoIterationCount;
PerfTestRunner.prepareToMeasureValuesAsync({dromaeoIterationCount: dromaeoIterationCount, Foobar)
[...]
var url = DRT.baseURL + "?" + testName + '&numTests=' + dromaeoIterationCount;

> PerformanceTests/resources/runner.js:158
> +        iterationCount = test.dromaeoIterationCount || (window.testRunner ? 5 : 20);

Damn, JavaScript is ugly :-D

> Tools/Scripts/webkitpy/performance_tests/perftest.py:110
> +    def __init__(self, port, test_name, test_path, process_count=4):

process_count -> process_run_count or something alike?

> Tools/Scripts/webkitpy/performance_tests/perftest.py:134
> +        for _ in range(0, self._process_count):

xrange?

Gosh I hate python sometime :)

> Tools/Scripts/webkitpy/performance_tests/perftest.py:138
> +                if not self._run_with_driver(driver, time_out_ms):
> +                    return None

You may have 3 run with results and one that failed?
Comment 22 Ryosuke Niwa 2013-03-01 20:35:59 PST
Created attachment 191091 [details]
Updated per comment and introduced iteration groups
Comment 23 Eric Seidel (no email) 2013-03-03 01:20:19 PST
Comment on attachment 191091 [details]
Updated per comment and introduced iteration groups

View in context: https://bugs.webkit.org/attachment.cgi?id=191091&action=review

> Tools/Scripts/webkitpy/performance_tests/perftest.py:383
> -        for i in range(0, 20):
> +        for i in range(0, 6):

What does this do?
Comment 24 Eric Seidel (no email) 2013-03-03 01:21:00 PST
Comment on attachment 191091 [details]
Updated per comment and introduced iteration groups

I would have split the iteration change half out into a separate patch.  That would ahve reduced the size by half, and made the two independent perf results changes separate.
Comment 25 Ryosuke Niwa 2013-03-03 01:21:44 PST
Comment on attachment 191091 [details]
Updated per comment and introduced iteration groups

View in context: https://bugs.webkit.org/attachment.cgi?id=191091&action=review

>> Tools/Scripts/webkitpy/performance_tests/perftest.py:383
>> +        for i in range(0, 6):
> 
> What does this do?

It'll do 6 runs instead of 20 for a given driver.
20 should have been 21 since we ignore the first run. It was a bug :(
Comment 26 Ryosuke Niwa 2013-03-03 15:21:21 PST
Comment on attachment 191091 [details]
Updated per comment and introduced iteration groups

Clearing flags on attachment: 191091

Committed r144583: <http://trac.webkit.org/changeset/144583>
Comment 27 Ryosuke Niwa 2013-03-03 15:21:26 PST
All reviewed patches have been landed.  Closing bug.