Example:
from fitline import Fitline # a class with functions fit(), line(), plot(), trim(), trimfit()
F = Fitline( level=.95, nbig=5, verbose=1, axis=None, xline=None ) # the defaults
F.fit( x, y ) # -> f .a .b .residuals ...
a, b = F.a, F.b # y ~= a + b * x
xtrim, ytrim = F.trimfit( x, y, niter=1 ) # trim nbig, fit again
See https://github.com/denis-bz/plots/issues/1 for an example plot.
level
0.95:
the level for confidence / prediction bands, aka 1 - alpha
nbig
5:
track the few biggest residuals |y_i - line(x_i)|
verbose
1:
prints fit()
info like
fitline 164 points: intercept 0.52 +- 0.12 slope 0.319 +- 0.1
residuals: sd 0.627 var 0.394 biggest res^2 are [9 5 5 4 3] ... % of the total
axis
:
a matplotlib fig
or axis
from e.g. fig, axes = pl.subplots()
,
to plot fit()
s; the default is None
.
F.fit( x, y )
:
fit a line y = a + b * x
to 1d numpy arrays or array-like
F.line( xx )
:
the line, F.a + F.b * xx
. (F.predict()
is a synonym for F.line()
.)
F.plot( x, y )
:
is called from F.fit()
, if Fitline( axis=fig )
or axis=axis
.
It plots the x y
points, F.line( F.xline )
,
and confidence and prediction regions, using matplotlib .
xtrim, ytrim = F.trim( x, y )
:
x y
without the nbig
points with the biggest residuals,
e.g. 100 -> 95 points.
x y
must be the same as the last fit( x y )
.
xtrim, ytrim = F.trimfit( x, y, niter=1 )
:
trim()
, fit again
se_conf( xx ), se_predict( xx )
:
see wikipedia Confidence_and_prediction_bands.
These use sd_res n mean_x Sxx
from the last fit()
.
a b slope == b
se_a se_slope
mean_x mean_y
residuals res2 sd_res var_res
bigres jbig xbig ybig
summary
Please see the code for details.
Python2, numpy, scipy (for tval); for plotting, matplotlib and optionally seaborn
A few big squares (outliers) can shift least-squares lines quite a lot. For example,
distances 1 1 1 1 1 10: 10 is 2/3 of the sum
=> squares 1 1 1 1 1 100: 100 is 95 % of the sum
See the picture under
Coefficient_of_determination .
(Fitline
plots big residuals as long vertical lines though -- visually |res|,
not the res^2 that least-squares is minimizing.)
A different way to plot confidence intervals and bands is to
calculate 100 lines from random subsets of the data, i.e. bootstrap.
Say we want an 80 % confidence level -- 10 low, 80 middle, 10 high.
For a given x
, the interval 10 th lowest to 10 th highest of the 100 lines(x
)
is a confidence interval.
Sweeping this across x
in e.g. np.linspace(...)
gives a confidence band.
Why bootstrap ?
- A few definite lines may be more informative than a nice symmetric OVERconfidence region.
- works for any method of fitting lines, with residuals normal or not.
See understanding-shape-and-calculation-of-confidence-bands-in-linear-regression for an example plot of bootstrap lines.
If a line is a crummy fit to some data, a "confidence region" is unlikely to say "crummy" out loud, especially if the boss wants to hear "95 % confident". Exercise: plot confidence regions for Anscombe's quartet .
If your data has outliers, one approach is to trim a few, as with trimfit
.
Some other approaches:
-
medianline
: the line through[median(x), median(y)]
with slope =median( y_i / x_i )
. This is robust and takes only a few lines of code (medianline.py
), but is afaik hard to analyze theoretically. -
If
x
andy
could both be noisy, look at the linex ~ y
as well asy ~ x
. -
cheap, approximate orthogonal aka Deming regression (see the picture there):
first centre the data at [0 0], by subtracting off means or medians, so we want have only one parameter, slope or angle.
- fit slope
y = b0 x
as usual - rotate the data, e.g. if
b0 == 1
, 45 degrees clockwise.
Now residual lines are orthogonal to theb0
line. - fit again,
y = b1 x
- rotate back,
y = b1 x
rotated 45 degrees counterclockwise.
There are many many methods, papers and books on outliers and robust regression.
Start by looking at the biggest residuals, e.g. in Fitline
plots.
This Fitline
class is modified from
https://github.com/thomas-haslwanter/statsintro
fitLine.py
author : thomas haslwanter
date : 13.Dec.2012
ver : 2.2
Outlier:
"Deletion of outlier data is a controversial practice frowned on ..."
Visualizing linear relationships in Seaborn
RANSAC if there are many outliers
On stats.stackexchange:
Confidence_and_prediction_bands
Understanding-shape-and-calculation-of-confidence-bands-in-linear-regression
Linear-regression-prediction-interval
Other-ways-to-find-line-of-best-fit
and test cases most welcome.
cheers
-- denis-bz-py t-online.de
Last change: 2015-07-10 July