According to: http://vortex.ihrc.fiu.edu/MET4570/members/Lectures/Lect05/m10divideby_nminus1.pdf the reason why we use n - 1
is because that ensures that the average of all sample variances of every combination of sample (with replacement) of a population is equal to the population variance!
import random
import numpy as np
def calculate_variance(population):
# the population data must be normally distributed for this to make sense
# use normality tests to check if it makes sense
sample = np.random.choice(population, size=100, replace=False)
population_deviations = population - np.mean(population)
sample_deviations = sample - np.mean(sample)
# σ^2 the is a "parameter" of the population
population_variance = np.mean(np.square(population_deviations))
# s^2 this is a "statistic" of the sample
sample_variance = np.sum(np.square(sample_deviations)) / (
sample_deviations.size - 1)
# older style of sample variance that didn't divide by n - 1
sample_variance_ = np.mean(np.square(sample_deviations))
return (population_variance, sample_variance, sample_variance_)
However running those alot it doesn't actually mean that the sample variance will be necessarily closer to the population variance compared to the sample variance!
def cma(cma=0, count=0):
def next(value):
nonlocal cma
nonlocal count
cma = (value + (count * cma)) / (count + 1)
count += 1
return cma
return next
# empirical evidence of estimation performance
# we look at the difference between between sample_variance and sample_variance_
# relative to the population variance
variance_diff_avg_cma = cma()
variance_diff_avg_cma_ = cma()
for _ in range(1000):
(population_variance, sample_variance,
sample_variance_) = calculate_variance(np.random.normal(size=1000))
variance_diff_avg = variance_diff_avg_cma(
(population_variance - sample_variance)**2)
variance_diff_avg_ = variance_diff_avg_cma_(
(population_variance - sample_variance_)**2)
print('DIFF 1:', variance_diff_avg)
print('DIFF 2:', variance_diff_avg_)
if variance_diff_avg < variance_diff_avg_:
print('new sample variance is better')
elif variance_diff_avg > variance_diff_avg_:
print('old sample variance is better')
else:
print('new and old sample variance are equal')
# this doesn't make sense
# the normal variation ends up better estimating
# the population variance
variance_diff_avg_cma = cma()
variance_diff_avg_cma_ = cma()
# set the population first
population = np.random.normal(size=1000)
for _ in range(1000):
(population_variance, sample_variance,
sample_variance_) = calculate_variance(population)
variance_diff_avg = variance_diff_avg_cma(
(population_variance - sample_variance)**2)
variance_diff_avg_ = variance_diff_avg_cma_(
(population_variance - sample_variance_)**2)
print('DIFF 1:', variance_diff_avg)
print('DIFF 2:', variance_diff_avg_)
if variance_diff_avg < variance_diff_avg_:
print('new sample variance is better')
elif variance_diff_avg > variance_diff_avg_:
print('old sample variance is better')
else:
print('new and old sample variance are equal')
I haven't seen any documentation that shows this result. But I guess the logical need to have unbiased samples is more important than empirical distance from the actual population variance.
Once you get the standard deviation, you are able to estimate the bounds between values on a 68–95–99.7 rule.
For a ML system: https://www.quora.com/What-is-meant-by-high-variance-low-bias-and-why-does-this-lead-to-overfitting-in-machine-learning
Overfitting can mean high variance (could be low or high bias).
Underfitting can mean high bias (could be low or high variance).
You want to reduce both bias and variance: https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff
But it's often a tradeoff on systems. It seems like the the tradeoff between precision and recall!
There is a relationship!
https://stats.stackexchange.com/questions/158443/bias-variance-tradeoff-vs-precision-and-recall
In the above diagram, the overfitting situation is where you have low bias and high variance. The underfitting situation is where you have high bias and low variance.
A high recall and low precision model corresponds to the case of having high variance and low bias. Similarly you could have a model which gets some false negatives but gets fewer false positives, ie, it is high precision - low recall, then it corresponds to the high bias - low variance case. https://stats.stackexchange.com/questions/158443/bias-variance-tradeoff-vs-precision-and-recall
It's important to get to a point where you minimise total error. In the PvR scenario, you want to get a high F-score. In the BvV scenario you want to get a low variance and low bias! (just like high precision and high recall).
More commentary on this: https://stats.stackexchange.com/a/17893/198729