Bug 43642 - SunSpider confidence intervals are questionable
Summary: SunSpider confidence intervals are questionable
Status: UNCONFIRMED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-08-06 14:18 PDT by Dave Mandelin
Modified: 2011-05-05 10:45 PDT (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dave Mandelin 2010-08-06 14:18:34 PDT
Most SunSpider users I have talked to take the confidence intervals with a grain of salt, especially the confidence metrics for comparing two runs. In particular, there are a lot of false positives: way more than 5% of things marked "95% significant" are not in fact real differences. 

I looked into this for a while, and I saw that the comparison script uses the t test, which is of course the standard significance test for the difference of two sample means taken from normally distributed data sets. I did some simulations and simple normality tests that show that SunSpider scores for individual benchmarks are not normally distributed. There seem to be 3 main deviations from normality:

1. The scores are integral numbers of milliseconds. This makes the data look very non-normal for short-running tests especially.

2. The differences from the mean are not symmetrical: there are bigger outliers on the high side than on the low side. Related to this is the fact that the range of the scores stops at 0, rather than going down to negative infinity.

3. The tails seem to be much fatter than a normal distribution.

I'm not sure exactly what should be done about this. I think it would be possible to pick a distribution that better fits the benchmark scores, and compute the confidence intervals with that instead. I did some simulations that suggested that multiplying the low range of the confidence interval by 1.5 and the high range by 2 gave something closer to 95% significance.
Comment 1 Maciej Stachowiak 2011-05-05 10:45:05 PDT
Thanks for doing the testing. You are probably right that normal distribution is not an appropriate assumption.

Your suggested tweaks to the confidence interval ranges would help the confidence intervals themselves, but I'm not sure what to do as a test of whether the difference is significant. Just having two confidence intervals is not necessarily sufficient for that, or at least, the t-test itself doesn't work that way.