43253 – SunSpider run times are not reproducible

RESOLVED DUPLICATE of bug 6155243253

SunSpider run times are not reproducible

https://bugs.webkit.org/show_bug.cgi?id=43253

Summary SunSpider run times are not reproducible

Paul Biggar

Reported 2010-07-30 08:43:26 PDT

SunSpider has a very high level of variability in its run-time, especially on Linux. In Firefox, we're trying to make very small improvements to the JS VM, but we can't trust Sunspider results for any improvement of less than 15ms. This is a meta-bug, roughly corresponding to https://bugzilla.mozilla.org/show_bug.cgi?id=580532 in the Mozilla bugzilla. I will file more specific bugs as blockers to this bug (assuming that's the way you use your bugzilla).

Attachments
Add attachment proposed patch, testcase, etc.

Oliver Hunt

Comment 1 2010-08-06 16:30:17 PDT

How are you running sunspider? I tend to see variance in the order of 0.3-0.5% with 30 runs

Paul Biggar

Comment 2 2010-08-06 16:43:02 PDT

I've seen large variability with even 1000 runs. I run it on Linux, where I am told variability is worse. I don't believe the number reported on the TOTAL line is accurate. It seems to be based on the difference between the two run-times, rather than the sum of the differences across all the benchmarks. As a simple example, 3bit-bits-in-byte benchmark runs in either 0ms or 1ms for me, and has massive variability, but that is completely masked by using the TOTAL value. I wrote https://bug580532.bugzilla.mozilla.org/attachment.cgi?id=459618 as a better measure of variability. It's not perfect, but it allows you see how variable a test run is.

Maciej Stachowiak

Comment 3 2010-08-06 16:53:19 PDT

I don't think summing the absolute differences from the mean for each subtest is a statistically valid procedure, for computing the variance of the total score. You are correct that individual subtests, particularly the tests that are now very short runtime in modern implementations, have much higher variance than the total.

Paul Biggar

Comment 4 2010-08-06 17:04:42 PDT

(In reply to comment #3) > I don't think summing the absolute differences from the mean for each subtest is a statistically valid procedure, for computing the variance of the total score. This is true. I ran it past our resident stats expert dmandelin, who confirmed it to be "not too bad as a rough measure". It seemed to work so I stopped there. As I understand it, this is complicated by the fact that the benchmarks run for wildly different lengths of time. I don't know if you'd consider solving this for 0.9.2.

Maciej Stachowiak

Comment 5 2011-07-02 14:16:47 PDT

I think there's nothing in particular to be done here other than increase the runtime of the tests. Thus, reverse-duplicating to 61552. *** This bug has been marked as a duplicate of bug 61552 ***

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution DUPLICATE

of bug 61552

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware PC

OS All

Product WebKit

Component Tools / Tests

Assignee

Nobody

Reported

2010-07-30 08:43 PDT

Modified

2011-07-02 14:16 PDT History

CC List

2 users Show

URL

Keywords

Depends on

32804 43257 43255 43256

Blocks

Dependencies

tree graph