43642 – SunSpider confidence intervals are questionable

UNCONFIRMED 43642

SunSpider confidence intervals are questionable

https://bugs.webkit.org/show_bug.cgi?id=43642

Summary SunSpider confidence intervals are questionable

Dave Mandelin

Reported 2010-08-06 14:18:34 PDT

Most SunSpider users I have talked to take the confidence intervals with a grain of salt, especially the confidence metrics for comparing two runs. In particular, there are a lot of false positives: way more than 5% of things marked "95% significant" are not in fact real differences. I looked into this for a while, and I saw that the comparison script uses the t test, which is of course the standard significance test for the difference of two sample means taken from normally distributed data sets. I did some simulations and simple normality tests that show that SunSpider scores for individual benchmarks are not normally distributed. There seem to be 3 main deviations from normality: 1. The scores are integral numbers of milliseconds. This makes the data look very non-normal for short-running tests especially. 2. The differences from the mean are not symmetrical: there are bigger outliers on the high side than on the low side. Related to this is the fact that the range of the scores stops at 0, rather than going down to negative infinity. 3. The tails seem to be much fatter than a normal distribution. I'm not sure exactly what should be done about this. I think it would be possible to pick a distribution that better fits the benchmark scores, and compute the confidence intervals with that instead. I did some simulations that suggested that multiplying the low range of the confidence interval by 1.5 and the high range by 2 gave something closer to 95% significance.

Attachments
Add attachment proposed patch, testcase, etc.

Maciej Stachowiak

Comment 1 2011-05-05 10:45:05 PDT

Thanks for doing the testing. You are probably right that normal distribution is not an appropriate assumption. Your suggested tweaks to the confidence interval ranges would help the confidence intervals themselves, but I'm not sure what to do as a test of whether the difference is significant. Just having two confidence intervals is not necessarily sufficient for that, or at least, the t-test itself doesn't work that way.

Note You need to log in before you can comment on or make changes to this bug.

Status UNCONFIRMED

Resolution

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware All

OS All

Product WebKit

Component Tools / Tests

Assignee

Nobody

Reported

2010-08-06 14:18 PDT

Modified

2011-05-05 10:45 PDT History

CC List

3 users Show

URL

Keywords

Depends on

Blocks