Population and sample variance
Mar. 20th, 2014 03:02 pmI am the author of PEP 450 and the statistics module. That module offers four different functions related to statistical variance, and some people may not quite understand what the difference between them.
[Disclaimer: statistical variance is complicated, and my discussion here is quite simplified. In particular, most of what I say only applies to "reasonable" data sets which aren't too skewed or unusual, and samples which are random and representative. If your sample data is not representative of the population from which it is drawn, then all bets are off.]
The statistics module offers two variance functions,
The two versions of variance give obviously different results:
So which should you use? In a nutshell, two simple rules apply:
If you remember those two rules, you won't go badly wrong. Or at least, no more badly than most naive users of statistical functions. You want to be better than them, don't you? Then read on...
( Read more... )
[Disclaimer: statistical variance is complicated, and my discussion here is quite simplified. In particular, most of what I say only applies to "reasonable" data sets which aren't too skewed or unusual, and samples which are random and representative. If your sample data is not representative of the population from which it is drawn, then all bets are off.]
The statistics module offers two variance functions,
pvariance
and variance
, and two corresponding versions of the standard deviation, pstdev
and stdev
. The standard deviation functions are just thin wrappers which take the square root of the appropriate variance function, so there's not a lot to say about them. Except where noted differently, everything I say about the (p)variance
functions also applies to the (p)stdev
functions, so for brevity I will only talk about variance.The two versions of variance give obviously different results:
py> import statistics py> data = [1, 2, 3, 3, 3, 5, 8] py> statistics.pvariance(data) 4.53061224489796 py> statistics.variance(data) 5.2857142857142865
So which should you use? In a nutshell, two simple rules apply:
- If you are dealing with the entire population, use
pvariance
. - If you are working with a sample, use
variance
instead.
If you remember those two rules, you won't go badly wrong. Or at least, no more badly than most naive users of statistical functions. You want to be better than them, don't you? Then read on...
( Read more... )