Algorithms for calculating variance
In statistics, a formula for calculating the variance of a population of size n is:
- <math>\mathit{Variance} = \frac {n\sum_{i=1}^{n} x_i^2 - (\sum_{i=1}^{n} x_i)^2}{n^2}<math>
A formula for calculating the unbiased estimation of the population variance from n finite samples is:
- <math>\mathit{Variance} = \frac {n\sum_{i=1}^{n} x_i^2 - (\sum_{i=1}^{n} x_i)^2}{n(n-1)}<math>
The method of calculation may be more easily understood from the table below
where the mean is 8.
| i | xi | xi-mean
| (xi-mean)2
|
| (index) | (datum) | (deviation) | (squared deviation)
|
| 1 | 5 | -3 | 9
|
| 2 | 7 | -1 | 1
|
| 3 | 8 | 0 | 0
|
| 4 | 10 | 2 | 4
|
| 5 | 10 | 2 | 4
|
| n=5 | sum=40 | 0 | 18
|
- mean = 40/5 = 8
- variance = (5*338 - 402)/(5 * 4) = 4.5
- standard deviation = <math>\sqrt{\mathit{Variance}}<math>= 2.12
Note: Details of the variance calculation:
338 = [52 + 72 + 82 + 102 + 102]
40 = [5 + 7 + 8 + 10 + 10]
Algorithm
Therefore a simple algorithm to calculate variance can be described by the following pseudocode:
double sum;
double sum_sqr;
double variance;
long n = data.length; // the number of elements in the data array (the actual syntax is language-specific)
for i = 0 to n
sum += data[i];
sum_sqr += ( data[i] * data[i] );
end for
variance = ((n * sum_sqr) - (sum * sum))/(n*(n-1));
Algorithm
Another algorithm which avoids large numbers in sum_sqr while summing up
double avg;
double var;
long n = data.length; // number of elements
for i = 0 to n
avg = (avg*i + data[i]) / (i + 1);
if (i > 0) var += (var * (i - 1) + (data[i] - avg)*(data[i] - avg)) / i;
end for
return var; // resulting variance