Population and sample variance
Mar. 20th, 2014 03:02 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
I am the author of PEP 450 and the statistics module. That module offers four different functions related to statistical variance, and some people may not quite understand what the difference between them.
[Disclaimer: statistical variance is complicated, and my discussion here is quite simplified. In particular, most of what I say only applies to "reasonable" data sets which aren't too skewed or unusual, and samples which are random and representative. If your sample data is not representative of the population from which it is drawn, then all bets are off.]
The statistics module offers two variance functions,
The two versions of variance give obviously different results:
So which should you use? In a nutshell, two simple rules apply:
If you remember those two rules, you won't go badly wrong. Or at least, no more badly than most naive users of statistical functions. You want to be better than them, don't you? Then read on...
When your data is the entire population, you should always calculate the variance using the
Things are a bit more complicated when you're dealing with a sample. The
But why are there two variance functions in the first place?
Start with population variance. Mathematicians give it the symbol σ² (sigma-squared), and you can consider that it is the "true" variance of a population. (That is to say, it is true by definition, since that is how variance is defined.) Ultimately, it is the population variance σ² that we care about, but since we rarely have access to the entire population data, we have to estimate it from a sample. This estimated value is called the sample variance, and is given the symbol s².
The most obvious way to estimate the population variance from a sample is to just apply the population variance formula unchanged. Since this formula includes a denominator of
Since the samples are random, sometimes the estimate will be a little high, and sometimes a little low, but we'd like the average of those estimates to converge on the true population variance. Unfortunately that's not what happens: the estimates tend to underestimate the true value a bit more often than they overestimate it, leading to a systematic bias when used for the sample variance. That bias can be very significant for small sample sizes, although it does get smaller as the number of samples increases. For example, with five samples,
In any case, since the bias is known, we can correct for it. That version is called the corrected sample variance, and uses a denominator of
The statistics module's
[Aside: Although the corrected sample variance is unbiased, the square root of it, the sample standard deviation
Statistical bias is not all bad. Although
There is a third, rarer, corrected version of sample variance that you will occasionally see. This one uses a denominator of
There's another complication which may be important. The corrected sample variance estimates the population mean from the sample, but sometimes you know what the mean of the population is, or at least you already have a good estimate. Provided that the mean was estimated or calculated independently of the sample you are now using, this extra information makes the bias-correction unnecessary. Instead, you should use the
In summary, here is a more detailed guide on which variance function to use:
Otherwise, when in doubt, stick to
[Disclaimer: statistical variance is complicated, and my discussion here is quite simplified. In particular, most of what I say only applies to "reasonable" data sets which aren't too skewed or unusual, and samples which are random and representative. If your sample data is not representative of the population from which it is drawn, then all bets are off.]
The statistics module offers two variance functions,
pvariance
and variance
, and two corresponding versions of the standard deviation, pstdev
and stdev
. The standard deviation functions are just thin wrappers which take the square root of the appropriate variance function, so there's not a lot to say about them. Except where noted differently, everything I say about the (p)variance
functions also applies to the (p)stdev
functions, so for brevity I will only talk about variance.The two versions of variance give obviously different results:
py> import statistics py> data = [1, 2, 3, 3, 3, 5, 8] py> statistics.pvariance(data) 4.53061224489796 py> statistics.variance(data) 5.2857142857142865
So which should you use? In a nutshell, two simple rules apply:
- If you are dealing with the entire population, use
pvariance
. - If you are working with a sample, use
variance
instead.
If you remember those two rules, you won't go badly wrong. Or at least, no more badly than most naive users of statistical functions. You want to be better than them, don't you? Then read on...
When your data is the entire population, you should always calculate the variance using the
pvariance
function, no ifs buts or maybes. There is never any good reason to use the sample variance function on the entire population.Things are a bit more complicated when you're dealing with a sample. The
variance
function is usually the right function to use, but there are circumstances where you might prefer to use the population variance function on a sample, despite the name.But why are there two variance functions in the first place?
Start with population variance. Mathematicians give it the symbol σ² (sigma-squared), and you can consider that it is the "true" variance of a population. (That is to say, it is true by definition, since that is how variance is defined.) Ultimately, it is the population variance σ² that we care about, but since we rarely have access to the entire population data, we have to estimate it from a sample. This estimated value is called the sample variance, and is given the symbol s².
The most obvious way to estimate the population variance from a sample is to just apply the population variance formula unchanged. Since this formula includes a denominator of
n
, the number of values in the data set, this is sometimes written as s² with a subscript n, sn². This is often called the uncorrected sample variance, or sometimes the biased sample variance. (For maximum confusion, sometimes the "uncorrected" or "biased" parts are left out.)Since the samples are random, sometimes the estimate will be a little high, and sometimes a little low, but we'd like the average of those estimates to converge on the true population variance. Unfortunately that's not what happens: the estimates tend to underestimate the true value a bit more often than they overestimate it, leading to a systematic bias when used for the sample variance. That bias can be very significant for small sample sizes, although it does get smaller as the number of samples increases. For example, with five samples,
pvariance
tends to underestimate the true variance by 20%. But with 100 samples, that falls to just 1%, and with 5000 samples it is a mere 0.02%.In any case, since the bias is known, we can correct for it. That version is called the corrected sample variance, and uses a denominator of
n-1
instead of n
. It too is given the symbol s², occasionally with a subscript n-1. Unfortunately, mathematicians tend to be annoyingly and confusingly inconsistent with their notation and terminology, but if you see something referred to as "sample variance" or s², it usually means this corrected version.The statistics module's
variance
function is this corrected version. Unlike pvariance
it is statistically unbiased (at least for representative samples — if your data exhibits a bias, so will anything you calculate from that data). This means that although any specific estimate of the variance may happen to be too high or too low, the average will tend towards the true population variance. That's a very desirable property, which is why variance
is normally preferred over pvariance
when dealing with samples.[Aside: Although the corrected sample variance is unbiased, the square root of it, the sample standard deviation
stdev
, is not! Some bias still remains, although not as much as the uncorrected or population version pstdev
. Unfortunately removing all the bias from standard deviation is a hard problem that depends on knowledge of the exact population distribution. Fortunately, the difference rarely actually matters in practice.]Statistical bias is not all bad. Although
pvariance
tends to underestimate the variance, it does has one advantage over the unbiased version: it exhibits less variability and is more consistent. On average the corrected sample variance lands closer to the correct value, but when it misses, it misses more badly than does the uncorrected version, so if you find yourself in a situation where you care more about consistency between samples than closeness to the population variance, you might prefer to use pvariance
.There is a third, rarer, corrected version of sample variance that you will occasionally see. This one uses a denominator of
n+1
instead of n
or n-1
, and may be known as the biased estimator of the population variance. This version also tends to underestimate the variance of a sample, but it has the advantage of minimizing the mean squared error of the estimate (at least when the data is distributed normally). The statistics module doesn't provide that function, but you can make your own if you need it:def var1(data): data = list(data) n = len(data) return statistics.pvariance(data)*n/(n+1)
There's another complication which may be important. The corrected sample variance estimates the population mean from the sample, but sometimes you know what the mean of the population is, or at least you already have a good estimate. Provided that the mean was estimated or calculated independently of the sample you are now using, this extra information makes the bias-correction unnecessary. Instead, you should use the
pvariance
function, and pass that known population mean as the optional mu
parameter. In this case, pvariance
is acting as the uncorrected sample variance rather than population variance.In summary, here is a more detailed guide on which variance function to use:
- If you are dealing with the entire population, always use
pvariance
, no exceptions. - If you are working with a sample:
- If you know the population mean by some independent method, unrelated to the sample you are working with, use the
pvariance
function, and pass that known mean as themu
parameter.- If you know the population mean by some independent method, unrelated to the sample you are working with, use the
- If the sample is small or of moderate size, use
variance
. - If the sample is very large, the difference between the two variance functions is minimal, so it really doesn't matter which one you use. If you have no other reason to prefer one over the other, the least surprising decision is to use the sample
variance
function. - If you care more about reducing the mean squared error than the bias, use the
pvariance
function; if you don't know what mean squared error is, or don't care about it, use thevariance
.
Otherwise, when in doubt, stick to
variance
, and you will rarely go wrong.