Details to show uncertainty for huge amount of data

Mathematical Definitions

The following Formulas are typeset with the help of https://alexanderrodin.com/github-latex-markdown/.

statistical Variable

derived Mathematical Rules

$E \left( a X + Y \right) = a E \left( X \right) + E \left( Y \right)$

$V \left( a X \right) = a^2 V \left( X \right)$

$V \left( X + b \right) = V \left( X \right)$

$V \left( X + Y \right) = V \left( X \right) + 2 \mathrm{cov} \left( X , Y \right) + V \left( Y \right)$

Mathematical derivation

Model

$Y = a X + b + F$

input $X$
output $Y$
error $F$

Somehow natural conditions

expected error is zero

$\widehat{F} = E \left( F \right) \stackrel{!}{=} 0$

Applying expectation on the model yields

$\begin{align*} E \left( Y \right) &= E \left( a X + b + F \right) \ \widehat{Y} &= a \widehat{X} + b + \widehat{F} \ \widehat{Y} &= a \widehat{X} + b \end{align*}$

This means, the model applied to the expectation values is exact.

variation of error is minial

$\widetilde{F} = V \left( F \right) \stackrel{!}{=} \min$

Inserting the model into this brings up

$\begin{align*} V \left( F \right) &= V \left( Y - a X - b \right)= V \left( Y - a X \right) \ &= V \left( Y \right) + 2 \mathrm{cov}\left( Y , -a X \right) + V \left( - a X \right)\ &= \widetilde{YY} - 2 a \widetilde{YX} + a^2 \widetilde{XX} \end{align*}$

Derive for the unknown scalar value $a$ and setting to zero shows

$0 = - 2 \widetilde{YX} + 2 a \widetilde{XX}$

or simpler and explicit

$a = \frac{\widetilde{YX}}{\widetilde{XX}}$

combination of the former

Inserting the result from the variation we get from the expectation

$\widehat{Y} = \frac{\widetilde{YX}}{\widetilde{XX}} \widehat{X} + b$

and folling explicit

$b = \widehat{Y} - \frac{\widetilde{YX}}{\widetilde{XX}} \widehat{X}$

Explicit model

Putting all together shows we should model as

$Y \approx \frac{\widetilde{YX}}{\widetilde{XX}} \left( X - \widehat{X}\right) + \widehat{Y}$

with the error variance

$\widetilde{FF} = \widetilde{YY} - \frac{\widetilde{XY} \widetilde{XY}}{\widetilde{XX}}$

Python Implementation

def plotwithlinreguncertainty(data: pandas.DataFrame,
                              x: str,
                              y: str,
                              res: int = 150,
                              ci=.95,
                              globalmodell: bool = True,
                              **pltopts):
    assert 2 * res <= len(data), f"Zu hohe Auflösung {res} für {len(data)} Werte. Zwei Einträge pro Block für Kovarianz nötig."
    import matplotlib.pyplot
    import numpy
    import scipy.special
    sigmafactor = scipy.special.erfinv(ci) * numpy.sqrt(2)
    wrkdata = data[[x, y]].sort_values(x).reset_index(drop=True)
    assert (wrkdata[x].is_monotonic)
    fg = matplotlib.pyplot.figure(**pltopts)
    ax = matplotlib.pyplot.gca()
    grpcol = res*wrkdata.index.to_series() // len(wrkdata)
    grpobj = wrkdata.groupby(by=grpcol)
    pltdata = grpobj.mean()
    Vx = grpobj.cov().xs(y, level=1)[y]
    Cxy = grpobj.cov().xs(x, level=1)[y]
    Vy = grpobj.cov().xs(x, level=1)[x]
    if globalmodell:
        covges = wrkdata.cov()
        a = covges.loc[y, x] / covges.loc[x, x]
    else:
        a = Cxy / Vx
    delta = numpy.sqrt(Vy - 2 * a * Cxy +
                       a * a * Vx)  # Could be into if/else - Performance?
    # Create Plots
    lines, = ax.plot([], [])
    fillcol = lines.get_color()
    lines.remove()
    ax.plot(pltdata[x], pltdata[y], color=fillcol)
    ax.fill_between(pltdata[x],
                    pltdata[y] - sigmafactor * delta,
                    pltdata[y] + sigmafactor * delta,
                    facecolor=fillcol,
                    alpha=.15)
    ax.set_xlabel(x)
    ax.set_ylabel(y)
    ax.set_title(f'Mean and {100*ci} % confidence bands')

Barry1/plotwithuncertainty.md