import_that | Entries tagged with statistics

Quote of the day, via John Cook:

Statistics: A subject which most statisticians find difficult but in which nearly all physicians are expert. — Stephen Senn

Earlier, I discussed some of the terminology and different uses for the two variance functions in Python 3.4's new statistics module. That was partly motivated by a question from Andrew Szeto, who asked me why the variance functions took optional mu or xbar parameters.

The two standard deviation functions stdev and pstdev are just thin wrappers that return the square root of the variance, so they have the same signature as the variance functions.

I had a few motives for including the second parameter, in no particular order:

The reason given in the PEP was that I took the idea from the GNU Scientific Library. Perhaps they know something I don't? (I actually thought of the idea independently, but when I was writing PEP 450 I expected this to be controversial, and was pleased to find prior art.)

It allows a neat micro-optimization to avoid having to recalculate the mean if you've already calculated it. If you have a large data set, or one with custom numeric types where the __add__ method is expensive, calculating the mean once instead of twice may save some time.

Mathematically, the variance is in some sense a function dependent on μ (mu) or x̄ (xbar). Making them parameters of the Python functions reflects that sense.

Variance has a nice interpretation in physics: it's the moment of inertia around the centre of mass. We can calculate the moment of inertia around any point, not just the centre — might we not also calculate the "variance" around some point other than the mean? If you want to abuse the variance function by passing (say) the median instead of the mean as xbar or mu, you can. But if you do, you're responsible for ensuring that the result is physically meaningful. Don't come complaining to me if you get a negative variance or some other wacky value.

It also allows you to calculate an improved sample variance by passing the known population mean to the pvariance function — see my earlier post or Wikipedia for details.

Individually, none of theses were especially strong, and probably wouldn't have justified the additional complexity on their own. But taken together I think they justified including the optional parameters.

Surprisingly (at least to me), I don't recall much if any opposition to these mu/xbar parameters. I expected this feature would be a lot more controversial than it turned out to be. If I recall correctly, there was more bike-shedding about what to call the parameters (I initially just called them "m", for mu/mean) than whether or not to include them.

I am the author of PEP 450 and the statistics module. That module offers four different functions related to statistical variance, and some people may not quite understand what the difference between them.

[Disclaimer: statistical variance is complicated, and my discussion here is quite simplified. In particular, most of what I say only applies to "reasonable" data sets which aren't too skewed or unusual, and samples which are random and representative. If your sample data is not representative of the population from which it is drawn, then all bets are off.]

The statistics module offers two variance functions, pvariance and variance, and two corresponding versions of the standard deviation, pstdev and stdev. The standard deviation functions are just thin wrappers which take the square root of the appropriate variance function, so there's not a lot to say about them. Except where noted differently, everything I say about the (p)variance functions also applies to the (p)stdev functions, so for brevity I will only talk about variance.

The two versions of variance give obviously different results:

py> import statistics
py> data = [1, 2, 3, 3, 3, 5, 8]
py> statistics.pvariance(data)
4.53061224489796
py> statistics.variance(data)
5.2857142857142865

So which should you use? In a nutshell, two simple rules apply:

If you are dealing with the entire population, use pvariance.

If you are working with a sample, use variance instead.

If you remember those two rules, you won't go badly wrong. Or at least, no more badly than most naive users of statistical functions. You want to be better than them, don't you? Then read on...

( Read more... )

S	M	T	W	T	F	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

Import That!

Statistics quote

More on variance

Population and sample variance

Profile

May 2015

Syndicate

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags