Bug 43253

Summary: SunSpider run times are not reproducible
Product: WebKit Reporter: Paul Biggar <pbiggar>
Component: Tools / TestsAssignee: Nobody <webkit-unassigned>
Status: RESOLVED DUPLICATE    
Severity: Normal CC: mjs, oliver
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: PC   
OS: All   
Bug Depends on: 32804, 43257, 43255, 43256    
Bug Blocks:    

Description Paul Biggar 2010-07-30 08:43:26 PDT
SunSpider has a very high level of variability in its run-time, especially on Linux. In Firefox, we're trying to make very small improvements to the JS VM, but we can't trust Sunspider results for any improvement of less than 15ms.

This is a meta-bug, roughly corresponding to https://bugzilla.mozilla.org/show_bug.cgi?id=580532 in the Mozilla bugzilla. I will file more specific bugs as blockers to this bug (assuming that's the way you use your bugzilla).
Comment 1 Oliver Hunt 2010-08-06 16:30:17 PDT
How are you running sunspider?  I tend to see variance in the order of 0.3-0.5% with 30 runs
Comment 2 Paul Biggar 2010-08-06 16:43:02 PDT
I've seen large variability with even 1000 runs. I run it on Linux, where I am told variability is worse.

I don't believe the number reported on the TOTAL line is accurate. It seems to be based on the difference between the two run-times, rather than the sum of the differences across all the benchmarks. As a simple example, 3bit-bits-in-byte benchmark runs in either 0ms or 1ms for me, and has massive variability, but that is completely masked by using the TOTAL value.

I wrote https://bug580532.bugzilla.mozilla.org/attachment.cgi?id=459618 as a better measure of variability. It's not perfect, but it allows you see how variable a test run is.
Comment 3 Maciej Stachowiak 2010-08-06 16:53:19 PDT
I don't think summing the absolute differences from the mean for each subtest is a statistically valid procedure, for computing the variance of the total score.

You are correct that individual subtests, particularly the tests that are now very short runtime in modern implementations, have much higher variance than the total.
Comment 4 Paul Biggar 2010-08-06 17:04:42 PDT
(In reply to comment #3)
> I don't think summing the absolute differences from the mean for each subtest is a statistically valid procedure, for computing the variance of the total score.

This is true. I ran it past our resident stats expert dmandelin, who confirmed it to be "not too bad as a rough measure". It seemed to work so I stopped there. As I understand it, this is complicated by the fact that the benchmarks run for wildly different lengths of time. I don't know if you'd consider solving this for 0.9.2.
Comment 5 Maciej Stachowiak 2011-07-02 14:16:47 PDT
I think there's nothing in particular to be done here other than increase the runtime of the tests. Thus, reverse-duplicating to 61552.

*** This bug has been marked as a duplicate of bug 61552 ***