Jason Vestuto, Data Scientist, University of Texas at Austin
We start the course with an initial exploration of linear relationships, including some motivating examples of how linear models are used, and demonstrations of data visualization methods from matplotlib. We then use descriptive statistics to quantify the shape of our data and use correlation to quantify the strength of linear relationships between two variables.
- Models are used to make predictions for future performance (extrapolation)
- Models can interpolate the data
- Will use "matplotlib.pyplot" package to visualize the models
- Linear models use the slope-intersept form of linear equation
- slope: dy / dx (rise-over-run)
- intercept: when x=0, y=?
- Review of Single Variable Statistics
Name | Formula |
---|---|
Mean | sum(x) / len(x) |
Deviation | x - mean(x) |
Variance | np.mean(dx * dx) |
Standard Deviation | np.sqrt(variance) |
- Quantifying Linear Relationships
- Covariance
- Deviation of two variables: dx, dy
- Co-vary means to vary together: deviation_products = dx * dy
- Covariance as the mean: covariance = np.mean(dx*dy)
- Correlation
- divide deviations by standard deviation:
- zx = dx/np.std(x)
- zy = dy/np.std(y)
- mean of the normalize deviations: correlation = np.mean(zx*zy)
- divide deviations by standard deviation:
- Normalization helps to compare (correlate) signals of different magnitudes
- Magnitude and direction of correlation (positive/negative, greater/equal/less than 0)
- Covariance
Here we look at the parts that go into building a linear model. Using the concept of a Taylor Series, we focus on the parameters slope and intercept, how they define the model, and how to interpret the them in several applied contexts. We apply a variety of python modules to find the model that best fits the data, by computing the optimal values of slope and intercept, using least-squares, numpy, statsmodels, and scikit-learn.
- Taylor Series
- Approximates any curve
- Polynomial form: y = a0 + a1x + a2x^2 + a3x^3 + .... + anx^n
- often, first order is enough: y = a0 + a1*x
- First order with a1=0: flat line
- First order with a1=1: diagonal line
- Second order with a2=1: exponential line
- Problems when overfitting the model -- predictions beyond the training data are significantly incorrect
- Interpreting Slope and Intercept
- y = a0*x + a1
- x = independent variable, e.g. time
- y = dependent variable, e.g. distance traveled
- slope, everage slope
- Residuals, squared, variations
>>> residuals = y_model - y_data
>>> len(residuals) == len(y_data)
True
>>> print( np.sum(residuals)
0.0
>>> residuals_squared = np.square(y_model - y_data)
>>> print( np.sum(residuals_squared) )
65.1
>>> RSS = np.sum(residuals_squared)
>>> print( RSS )
5.9
- Least-Squares Optimization
- Minima of RSS
- a1 = covariance(x, y) / variance(x)
- a0 = mean(y) − a1 * mean(x)
- Optimized by Numpy
>>> x_mean = np.sum(x) / len(x)
>>> y_mean = np.sum(y) / len(y)
>>> x_dev = x - x_mean
>>> y_dev = y - y_mean
>>> a1 = np.sum( x_dev * y_dev ) / np.sum( x_dev^2 )
>>> a0 = y_mean - (a1 * x_mean)
- Optimized by Scipy
>>> from scipy import optimize
>>> x_data, y_data = load_data()
>>> def model_func(x, a0, a1):
>>> return a0 + (a1 * x)
>>> param_opt, param_cov = optimize.curve_fit(model_func, x_data, y_data)
>>> a0 = param_opt[0] # a0 is the intercept in y = a0 + a1*x
>>> a1 = param_opt[1] # a1 is the slope in y = a0 + a1*x
- Optimized by Statsmodels
>>> from statsmodels.formula.api import ols
>>> x_data, y_data = load_data()
>>> df = pd.DataFrame(dict(x_name=x_data, y_name=y_data))
>>> model_fit = ols(formula="y_name ~ x_name", data=df).fit()
>>> y_model = model_fit.predict(dx)
>>> x_model = x_data
>>> a0 = model_fit.params['Intercept']
>>> a1 = model_fit.params['x_name']
Next we will apply models to real data and make predictions. We will explore some of the most common pit-falls and limitations of predictions, and we evaluate and compare models by quantifying and contrasting several measures of goodness-of-fit, including RMSE and R-squared.
- Optimized by Scikit-Learn
from sklearn.linear_model import LinearRegression
# Initialize a general model
model = LinearRegression(fit_intercept=True)
# Load and shape the data
x_raw, y_raw = load_data()
x_data = x_raw.reshape(len(y_raw),1) y_data = y_raw.reshape(len(y_raw),1)
# Fit the model to the data
model_fit = model.fit(x_data, y_data)
- Predications, Parameters, Uncertainty
a0 = model_fit.params['Intercept']
a1 = model_fit.params['times']
e0 = model_fit.bse['Intercept']
e1 = model_fit.bse['times']
- Limits of prediction, Domain of Validity
- Goodness of Fit -- 3 different R's
- Building Models
- RSS
- Evaluating Models
- RMSE
- R-squared
- Building Models
residuals = y_model - y_data
RSS = np.sum( np.square(residuals) )
mean_squared_residuals = np.sum( np.square(residuals) ) / len(residuals)
MSE = np.mean( np.square(residuals) )
RMSE = np.sqrt(np.mean( np.square(residuals)))
RMSE = np.std(residuals)
# Deviations
deviations = np.mean(y_data) - y_data
VAR = np.sum(np.square(deviations))
# R-squared
r_squared = 1 - (RSS / VAR)
r = correlation(y_data, y_model)
- RMSE vs R-Squared
- RMSE -- how much variation is residual
- R-squared -- what fraction of variation is linear