172968 – Compute the final score using geometric mean in Speedometer 2.0

RESOLVED FIXED 172968

Compute the final score using geometric mean in Speedometer 2.0

https://bugs.webkit.org/show_bug.cgi?id=172968

Summary Compute the final score using geometric mean in Speedometer 2.0

Mathias Bynens

Reported 2017-06-06 03:54:36 PDT

Currently, Speedometer uses the arithmetic mean of individual test results. As a result, Speedometer is essentially dominated by 3 individual benchmarks. It would be nice to get that fixed. I’m happy to submit a patch that uses the geometric mean instead.

Attachments
Changes the score computation (11.86 KB, patch) 2017-09-05 02:29 PDT, Ryosuke Niwa	saam: review+	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.

Ryosuke Niwa

Comment 1 2017-06-06 10:56:30 PDT

This is by design. We're weighting each test case based on how slow they're.

Benedikt Meurer

Comment 2 2017-06-23 01:53:12 PDT

With arithmetic mean, the overall score is a lot less useful. This way Speedometer 2 is basically a Ember/AngularJS/React-Redux benchmark. And this will remain, since Ember does essentially more than other tests that merely test the view layer (to get a serious comparison you'd probably need to replace Ember with Glimmer.js). Using geometric mean instead would make it a bit more useful, since changes for other frameworks would also be reflected in the overall score (in both Safari and Chrome).

Tianyou Li

Comment 3 2017-07-11 08:21:55 PDT

As stated in http://browserbench.org/Speedometer/, "Speedometer is not meant to compare the performance of different JavaScript frameworks.". While current Speedometer score is more favorable of particular JS frameworks which take longer execution time for example Ember.js. Would it be possible to use geomean for score calculation which may help for other fast-running JS frameworks to contribute a bit more to the overall Speedometer score compared with original arithmetic mean calculation?

Ryosuke Niwa

Comment 4 2017-08-26 22:12:02 PDT

I think we need to revisit this more carefully. In the latest version of Speedometer 2.0, Preact runs in 554ms whereas Inferno takes 1533ms on Safari. Since the total time is 6533, Preact only accounts for ~0.8% of the total score whilst Inferno accounts for ~23% of the total score. At that point, there's almost no practical value in measuring Preact's runtime.

Ryosuke Niwa

Comment 5 2017-08-26 22:24:11 PDT

(In reply to Ryosuke Niwa from comment #4) > I think we need to revisit this more carefully. In the latest version of > Speedometer 2.0, Preact runs in 554ms whereas Inferno takes 1533ms on > Safari. Since the total time is 6533, Preact only accounts for ~0.8% of the > total score whilst Inferno accounts for ~23% of the total score. At that > point, there's almost no practical value in measuring Preact's runtime. Sorry, I meant to say Preact runs in 54ms, not 554ms.

Ryosuke Niwa

Comment 6 2017-08-26 23:41:35 PDT

Another reason we should probably consider using geomean is that we now have both release & debug builds of Ember.js after https://trac.webkit.org/changeset/221205 and https://trac.webkit.org/changeset/221206. We did this because we noticed that debug build was 4x slower and therefore constitutes a fundamentally different kind of a test. However, if we used arithmetic mean to compute the score, then we’re effectively giving 4x more weight to debug build of Ember.js compared to its release build even though only ~5% of websites that use Ember.js use debug builds in production.

Mathias Bynens

Comment 7 2017-08-27 08:15:31 PDT

(In reply to Ryosuke Niwa from comment #6) > Another reason we should probably consider using geomean is that we now have > both release & debug builds of Ember.js after > https://trac.webkit.org/changeset/221205 and > https://trac.webkit.org/changeset/221206. > > We did this because we noticed that debug build was 4x slower and therefore > constitutes a fundamentally different kind of a test. > > However, if we used arithmetic mean to compute the score, then we’re > effectively giving 4x more weight to debug build of Ember.js compared to its > release build even though only ~5% of websites that use Ember.js use debug > builds in production. IMHO that just means we should remove the debug build of Ember.js from the benchmark altogether.

Addy Osmani

Comment 8 2017-08-28 19:41:20 PDT

There are a few possible options here: 1. Switch to the Geometric mean. Avoids an issue where the lower execution times of frameworks like Vue and Preact don't contribute much to the final score. Also avoids Speedometer appearing to highlight the cost of some frameworks more than others. 2. Adopt a hybrid approach of measuring both Arithmetic and Geometric means, taking an average of the two. 3. Minimize the impact to overall scores by excluding the Ember debug build. This may not be sufficient alone. 4. Consider other weighting factors to each implementation to avoid any specific framework contributing more to the score than others.

Ryosuke Niwa

Comment 9 2017-08-28 20:39:55 PDT

(In reply to Mathias Bynens from comment #7) > (In reply to Ryosuke Niwa from comment #6) > > Another reason we should probably consider using geomean is that we now have > > both release & debug builds of Ember.js after > > https://trac.webkit.org/changeset/221205 and > > https://trac.webkit.org/changeset/221206. > > > > We did this because we noticed that debug build was 4x slower and therefore > > constitutes a fundamentally different kind of a test. > > > > However, if we used arithmetic mean to compute the score, then we’re > > effectively giving 4x more weight to debug build of Ember.js compared to its > > release build even though only ~5% of websites that use Ember.js use debug > > builds in production. > > IMHO that just means we should remove the debug build of Ember.js from the > benchmark altogether. The goal of the Speedometer benchmark is to measure plausible ways DOM APIs will be used, not necessary only the most popular way, or most optimized way. Since 5% of websites that use ember.js use debug build, we should include it in the benchmark given how radically different its performance characteristics is. Additionally, this doesn't solve the problem that Vue.js contributes less than 1% of the total score whereas Inferno contributes more than 23% at least in Safari. (In reply to Addy Osmani from comment #8) > There are a few possible options here: > > 1. Switch to the Geometric mean. Avoids an issue where the lower execution > times of frameworks like Vue and Preact don't contribute much to the final > score. Also avoids Speedometer appearing to highlight the cost of some > frameworks more than others. We should probably do this. > 2. Adopt a hybrid approach of measuring both Arithmetic and Geometric means, > taking an average of the two. This has one problem that we're still going to give ~2x more weight to debug build of ember.js compared to release build of ember.js > 3. Minimize the impact to overall scores by excluding the Ember debug build. > This may not be sufficient alone. Right. Just removing debug build of ember.js doesn't solve the issue of Inferno account for ~23% of the test score while Vue.js accounts for less than 1%. > 4. Consider other weighting factors to each implementation to avoid any > specific framework contributing more to the score than others. Given we don't know have a good understanding of how popular each framework / library is, I don't think we could reasonably do this. And it's subject to a lot of interpretations and opinions.

Ryosuke Niwa

Comment 10 2017-09-05 02:29:24 PDT

Created attachment 319888 [details] Changes the score computation

Saam Barati

Comment 11 2017-09-05 09:03:41 PDT

Comment on attachment 319888 [details] Changes the score computation View in context: https://bugs.webkit.org/attachment.cgi?id=319888&action=review > PerformanceTests/Speedometer/resources/benchmark-runner.js:285 > + values.sort(function (a, b) { return a - b }); // Avoid the loss of significance for the sum. Do you want to compute product over the sorted array as well?

Ryosuke Niwa

Comment 12 2017-09-05 19:10:00 PDT

(In reply to Saam Barati from comment #11) > Comment on attachment 319888 [details] > Changes the score computation > > View in context: > https://bugs.webkit.org/attachment.cgi?id=319888&action=review > > > PerformanceTests/Speedometer/resources/benchmark-runner.js:285 > > + values.sort(function (a, b) { return a - b }); // Avoid the loss of significance for the sum. > > Do you want to compute product over the sorted array as well? No, the computation of a product doesn't suffer from the loss of significance.

Ryosuke Niwa

Comment 13 2017-09-05 19:37:43 PDT

Committed r221659: <http://trac.webkit.org/changeset/221659>

Radar WebKit Bug Importer

Comment 14 2017-09-27 12:53:45 PDT

<rdar://problem/34694227>

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component Tools / Tests

Assignee

Ryosuke Niwa

Reported

2017-06-06 03:54 PDT

Modified

2017-09-27 12:53 PDT History

CC List

15 users Show

URL

Keywords InRadar

Depends on

Blocks

172339

Dependencies

tree graph