Bug 43253

Summary:	SunSpider run times are not reproducible
Product:	WebKit	Reporter:	Paul Biggar <pbiggar>
Component:	Tools / Tests	Assignee:	Nobody <webkit-unassigned>
Status:	RESOLVED DUPLICATE
Severity:	Normal	CC:	mjs, oliver
Priority:	P2
Version:	528+ (Nightly build)
Hardware:	PC
OS:	All
Bug Depends on:	32804, 43257, 43255, 43256
Bug Blocks:

Paul Biggar

Reported 2010-07-30 08:43:26 PDT

SunSpider has a very high level of variability in its run-time, especially on Linux. In Firefox, we're trying to make very small improvements to the JS VM, but we can't trust Sunspider results for any improvement of less than 15ms. This is a meta-bug, roughly corresponding to https://bugzilla.mozilla.org/show_bug.cgi?id=580532 in the Mozilla bugzilla. I will file more specific bugs as blockers to this bug (assuming that's the way you use your bugzilla).

Attachments
Add attachment proposed patch, testcase, etc.

Oliver Hunt

Comment 1 2010-08-06 16:30:17 PDT

How are you running sunspider? I tend to see variance in the order of 0.3-0.5% with 30 runs

Paul Biggar

Comment 2 2010-08-06 16:43:02 PDT

I've seen large variability with even 1000 runs. I run it on Linux, where I am told variability is worse. I don't believe the number reported on the TOTAL line is accurate. It seems to be based on the difference between the two run-times, rather than the sum of the differences across all the benchmarks. As a simple example, 3bit-bits-in-byte benchmark runs in either 0ms or 1ms for me, and has massive variability, but that is completely masked by using the TOTAL value. I wrote https://bug580532.bugzilla.mozilla.org/attachment.cgi?id=459618 as a better measure of variability. It's not perfect, but it allows you see how variable a test run is.

Maciej Stachowiak

Comment 3 2010-08-06 16:53:19 PDT

I don't think summing the absolute differences from the mean for each subtest is a statistically valid procedure, for computing the variance of the total score. You are correct that individual subtests, particularly the tests that are now very short runtime in modern implementations, have much higher variance than the total.

Paul Biggar

Comment 4 2010-08-06 17:04:42 PDT

(In reply to comment #3) > I don't think summing the absolute differences from the mean for each subtest is a statistically valid procedure, for computing the variance of the total score. This is true. I ran it past our resident stats expert dmandelin, who confirmed it to be "not too bad as a rough measure". It seemed to work so I stopped there. As I understand it, this is complicated by the fact that the benchmarks run for wildly different lengths of time. I don't know if you'd consider solving this for 0.9.2.

Maciej Stachowiak

Comment 5 2011-07-02 14:16:47 PDT

I think there's nothing in particular to be done here other than increase the runtime of the tests. Thus, reverse-duplicating to 61552. *** This bug has been marked as a duplicate of bug 61552 ***

Note You need to log in before you can comment on or make changes to this bug.