Bug 43253 - SunSpider run times are not reproducible
Summary: SunSpider run times are not reproducible
Status: RESOLVED DUPLICATE of bug 61552
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: 528+ (Nightly build)
Hardware: PC All
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on: 32804 43257 43255 43256
Blocks:
  Show dependency treegraph
 
Reported: 2010-07-30 08:43 PDT by Paul Biggar
Modified: 2011-07-02 14:16 PDT (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Paul Biggar 2010-07-30 08:43:26 PDT
SunSpider has a very high level of variability in its run-time, especially on Linux. In Firefox, we're trying to make very small improvements to the JS VM, but we can't trust Sunspider results for any improvement of less than 15ms.

This is a meta-bug, roughly corresponding to https://bugzilla.mozilla.org/show_bug.cgi?id=580532 in the Mozilla bugzilla. I will file more specific bugs as blockers to this bug (assuming that's the way you use your bugzilla).
Comment 1 Oliver Hunt 2010-08-06 16:30:17 PDT
How are you running sunspider?  I tend to see variance in the order of 0.3-0.5% with 30 runs
Comment 2 Paul Biggar 2010-08-06 16:43:02 PDT
I've seen large variability with even 1000 runs. I run it on Linux, where I am told variability is worse.

I don't believe the number reported on the TOTAL line is accurate. It seems to be based on the difference between the two run-times, rather than the sum of the differences across all the benchmarks. As a simple example, 3bit-bits-in-byte benchmark runs in either 0ms or 1ms for me, and has massive variability, but that is completely masked by using the TOTAL value.

I wrote https://bug580532.bugzilla.mozilla.org/attachment.cgi?id=459618 as a better measure of variability. It's not perfect, but it allows you see how variable a test run is.
Comment 3 Maciej Stachowiak 2010-08-06 16:53:19 PDT
I don't think summing the absolute differences from the mean for each subtest is a statistically valid procedure, for computing the variance of the total score.

You are correct that individual subtests, particularly the tests that are now very short runtime in modern implementations, have much higher variance than the total.
Comment 4 Paul Biggar 2010-08-06 17:04:42 PDT
(In reply to comment #3)
> I don't think summing the absolute differences from the mean for each subtest is a statistically valid procedure, for computing the variance of the total score.

This is true. I ran it past our resident stats expert dmandelin, who confirmed it to be "not too bad as a rough measure". It seemed to work so I stopped there. As I understand it, this is complicated by the fact that the benchmarks run for wildly different lengths of time. I don't know if you'd consider solving this for 0.9.2.
Comment 5 Maciej Stachowiak 2011-07-02 14:16:47 PDT
I think there's nothing in particular to be done here other than increase the runtime of the tests. Thus, reverse-duplicating to 61552.

*** This bug has been marked as a duplicate of bug 61552 ***