Denote the expectation by
${\mathbb E}[f(X)]:=\sum_x f(x)\cdot p(x)$,
where $\sum_x$ is the sum (or integral) over all $x$ and $p(x)$ is the probability distribution in question.
Denote the variance by
$\mathrm{Var}[X]:={\mathbb E}[(X-{\mathbb E}[X])^2]$
Further, define the standard deviation
$\sigma[X]=\mathrm{Var}[X]^\frac{1}{2}$
(Which, if we use units, has the same units as $E[X]$ or $X$ itself.)
Funny source of python issue:
import statistics, pandas, torch, numpy
xs = list(numpy.random.random(20))
xs
statistics.mean(xs)
pandas.Series(xs).mean()
numpy.mean(xs)
torch.mean(torch.tensor(xs))
statistics.stdev(xs)
pandas.Series(xs).std()
torch.std(torch.tensor(xs))
numpy.std(xs)
xs = [15.0, 17.0, 19.0]
statistics.mean(xs) # 17.0
statistics.variance(xs) # 4.0
statistics.stdev(xs) # 2.0
pandas.Series(xs).mean() # 17.0
pandas.Series(xs).var() # 4.0
pandas.Series(xs).std() # 2.0
numpy.mean(xs) # 17.0
numpy.var(xs) # 2.6666666666666665 = 8/3 = (2*(2**2))/3
numpy.std(xs) # 1.632993161855452
#See also
#etc.
===Theorem (informal)===
Note:
$\forall i. {\mathbb E}[X_i-X_1] =0 \implies {\mathbb E}[\sum_{i=1}^n X_i] = n,{\mathbb E}[X_1]$
To discuss consequences of re-use of samples from $X$,
the discussion here is for $X_i$ iid random variables.
For $x_i$ sampled from $X_i$, with $i\in{1,2,\dots, n}$, define the sample mean by
$\overline x := \frac{1}{n} \sum_{j=1}^n x_j$.
By the above impication, ${\mathbb E}[\overline X] = {\mathbb E}[X_1]$,
so $\overline x$ may be used to estimate the expectation.
Note:
Consider the case $n=2$. Then, whatever the samples are, $\overline x$ is always estimated to be between those samples.
So $\overline x$, a means to estimatre ${\mathbb E}$, is contrained by where the samples are, in a very geometric way.
Let
$D(m) := \sum_{k=1}^n \left( x_k - m \right)^2$
Informal theorem: A "good" estimator of the variance $\mathrm{Var}[X]:={\mathbb E}[(X-{\mathbb E}[X])^2]$,
via a sample set of size $n$, is the so called sample variance
$\frac{1}{n-1}D(\overline x)$
Naive question:
If the number $m$ is an estimate o ${\mathbb E}[X]$, we might want to estimate the variance via
$\frac{1}{n} D(\overline x)$
As $\overline x$ indeed estimates ${\mathbb E}[X]$, so why not use $\frac{1}{n}$?
We will see that this naive formula fail to have desired properties.
Hint: $D(\overline x) = \left( x_k - m \right)^2$ means a re-use of samples, leading to bias.
Note:
$D'(m) = \sum_{k=1}^n 2 \left( x_k - m \right) \cdot (-1) = 2 n (m - \overline x)$
I.e. among all values for $m$, letting $m={\overline x}$ is in fact minimizing $D$.
$\bullet$ We'll discuss why $\frac{1}{n-1}$ over $\frac{1}{n}$ is sensible in a formal proof.
$\bullet$ Dividing by the bigger $n$ leads to an underestimate, as we'll argue
albeit the difference $\frac{n-1}{n}<1$ matters little, in praxis, for large $n$.
===Theorem===
The expectation of the sample variance (defined above, for any sample set) equals that of $(X-{\mathbb E}[X])^2$.
$i=j\implies {\mathbb E}[X_i\cdot X_j] = {\mathbb E}[X_1^2]$
$i\neq j\implies {\mathbb E}[X_i\cdot X_j] = {\mathbb E}[X_1]^2$
With this,
$ {\mathbb E} \left[ \left( \sum_{j=1}^n (X_k - X_j) \right)^2 \right]$
$= {\mathbb E} \left[ \sum_{j=1}^n \sum_{l=1}^n (X_k - X_j)(X_k - X_l) \right]$
$= {\mathbb E}[X_1^2]\cdot(n^2+2n\cdot(-1)+n\cdot(-1)^2) + {\mathbb E}[X_1]^2\cdot(2n(n-1)\cdot(-1) + n(n-1)\cdot(-1)^2)$
$= n(n-1) \big({\mathbb E}[X_1^2] - {\mathbb E}[X_1]^2\big)$
$= n(n-1) ({\mathbb E}[(X-{\mathbb E}[X])^2])$
$= n(n-1) \mathrm{Var}(X_1)$
Hence
${\mathbb E} \left[ \sum_{k=1}^n \left( X_k - \frac{1}{n} \sum_{j=1}^n X_j \right)^2 \right]$
$\sum_{k=1}^n {\mathbb E} \left[ \left( X_k - \frac{1}{n} \sum_{j=1}^n X_j \right)^2 \right]$
$= \frac{1}{n^2} \sum_{k=1}^n {\mathbb E} \left[ \left( \sum_{j=1}^n (X_k - X_j) \right)^2 \right]$
$= (n-1) \mathrm{Var}(X_1)$
The average over all $\left( x_k - {\overline x} \right)^2$, i.e. the more naive formula $\frac{1}{n} D(\overline x)$,
is instead systematically underestimating the variance.