R: Sample variance and SD (2024)

Sample variance and Standard Deviation using R

R can calculate the sample variance and sample standard deviation of our cattle weight data using these instructions:

Giving:

> var(y)
[1] 1713.333
> sd(y)
[1] 41.39243

var(y) instructs R to calculate the sample variance of Y. In other words it uses n-1 'degrees of freedom', where n is the number of observations in Y.
sd(y) instructs R to return the sample standard deviation of y, using n-1 degrees of freedom.
sd(y) = sqrt(var(y)). In other words, this is the uncorrected sample standard deviation.
This var function cannot give the 'population variance', which has n not n-1 d.f. But, there are 2 simple ways to achieve that:
Remember if n=1 the second variance formula will always yield zero, because the mean of y will equal y, whereas the first formula will always yield NA, because 0/(1-1) = 0/0 and cannot be evaluated.
Similarly, to obtain the 'population' standard deviation, use:

R can calculate the variance from the frequencies (f) of a frequency distribution with class midpoints (y) using these instructions:

Giving:

[1] 143.8768

y=c(110, 125, 135, 155) copies the class interval midpoints into a variable called y.
f=c(23, 15, 6, 2) copies the frequency of each class into a variable called f.
ybar=sum(y*f)/sum(f) creates a variable called ybar, containing the arithmetic mean - as calculated from these frequencies and midpoints.
However, even if you have a more accurate arithmetic mean, calculated directly from the observations themselves, you need to use this formula. If you do not do this your estimated variance will be too high - because this formula gives the mean based upon the same assumptions as your variance will be calculated.
sum(f*(y-ybar)^2) / (sum(f)-1) calculates the sample variance from the frequencies, f, midpoints, y, and the mean estimated from them, ybar.
Alternately, you could combine two of these instructions as: sum(f*(y-sum(y*f)/sum(f))^2)/(sum(f)-1)
Remember this only provides an estimate of the variance you would obtain from the original data - and is dependent upon the choice of midpoints, and the number of class intervals used.