Introduction to Linear Modeling in Python

Lecturer

Jason Vestuto, Data Scientist, University of Texas at Austin

Exploring Linear Trends

We start the course with an initial exploration of linear relationships, including some motivating examples of how linear models are used, and demonstrations of data visualization methods from matplotlib. We then use descriptive statistics to quantify the shape of our data and use correlation to quantify the strength of linear relationships between two variables.

Models are used to make predictions for future performance (extrapolation)
Models can interpolate the data
Will use "matplotlib.pyplot" package to visualize the models
Linear models use the slope-intersept form of linear equation
- slope: dy / dx (rise-over-run)
- intercept: when x=0, y=?
Review of Single Variable Statistics

Name	Formula
Mean	sum(x) / len(x)
Deviation	x - mean(x)
Variance	np.mean(dx * dx)
Standard Deviation	np.sqrt(variance)

Quantifying Linear Relationships
- Covariance
  - Deviation of two variables: dx, dy
  - Co-vary means to vary together: deviation_products = dx * dy
  - Covariance as the mean: covariance = np.mean(dx*dy)
- Correlation
  - divide deviations by standard deviation:
    - zx = dx/np.std(x)
    - zy = dy/np.std(y)
  - mean of the normalize deviations: correlation = np.mean(zx*zy)
- Normalization helps to compare (correlate) signals of different magnitudes
- Magnitude and direction of correlation (positive/negative, greater/equal/less than 0)

Building Linear Models

Here we look at the parts that go into building a linear model. Using the concept of a Taylor Series, we focus on the parameters slope and intercept, how they define the model, and how to interpret the them in several applied contexts. We apply a variety of python modules to find the model that best fits the data, by computing the optimal values of slope and intercept, using least-squares, numpy, statsmodels, and scikit-learn.

Taylor Series
- Approximates any curve
- Polynomial form: y = a0 + a1x + a2x^2 + a3x^3 + .... + anx^n
- often, first order is enough: y = a0 + a1*x
First order with a1=0: flat line
First order with a1=1: diagonal line
Second order with a2=1: exponential line
Problems when overfitting the model -- predictions beyond the training data are significantly incorrect
Interpreting Slope and Intercept
- y = a0*x + a1
- x = independent variable, e.g. time
- y = dependent variable, e.g. distance traveled
- slope, everage slope
Residuals, squared, variations

>>> residuals = y_model - y_data
>>> len(residuals) == len(y_data)
  True
>>> print( np.sum(residuals)
  0.0
>>> residuals_squared = np.square(y_model - y_data)
>>> print( np.sum(residuals_squared) )
  65.1
>>> RSS = np.sum(residuals_squared)
>>> print( RSS )
  5.9

Least-Squares Optimization
- Minima of RSS
- a1 = covariance(x, y) / variance(x)
- a0 = mean(y) − a1 * mean(x)
Optimized by Numpy

>>> x_mean = np.sum(x) / len(x)
>>> y_mean = np.sum(y) / len(y)
>>> x_dev = x - x_mean
>>> y_dev = y - y_mean
>>> a1 = np.sum( x_dev * y_dev ) / np.sum( x_dev^2 )
>>> a0 = y_mean - (a1 * x_mean)

Optimized by Scipy

>>> from scipy import optimize
>>> x_data, y_data = load_data()
>>> def model_func(x, a0, a1):
>>> return a0 + (a1 * x)
>>> param_opt, param_cov = optimize.curve_fit(model_func, x_data, y_data)
>>> a0 = param_opt[0] # a0 is the intercept in y = a0 + a1*x
>>> a1 = param_opt[1] # a1 is the slope in y = a0 + a1*x

Optimized by Statsmodels

>>> from statsmodels.formula.api import ols
>>> x_data, y_data = load_data()
>>> df = pd.DataFrame(dict(x_name=x_data, y_name=y_data))
>>> model_fit = ols(formula="y_name ~ x_name", data=df).fit()
>>> y_model = model_fit.predict(dx)
>>> x_model = x_data
>>> a0 = model_fit.params['Intercept']
>>> a1 = model_fit.params['x_name']

Making Model Predictions

Next we will apply models to real data and make predictions. We will explore some of the most common pit-falls and limitations of predictions, and we evaluate and compare models by quantifying and contrasting several measures of goodness-of-fit, including RMSE and R-squared.

Optimized by Scikit-Learn

from sklearn.linear_model import LinearRegression
 
# Initialize a general model
model = LinearRegression(fit_intercept=True)

# Load and shape the data
x_raw, y_raw = load_data()
x_data = x_raw.reshape(len(y_raw),1) y_data = y_raw.reshape(len(y_raw),1)

# Fit the model to the data
model_fit = model.fit(x_data, y_data)

Predications, Parameters, Uncertainty

a0 = model_fit.params['Intercept']
a1 = model_fit.params['times']

e0 = model_fit.bse['Intercept']
e1 = model_fit.bse['times']

Limits of prediction, Domain of Validity
Goodness of Fit -- 3 different R's
- Building Models
  - RSS
- Evaluating Models
  - RMSE
  - R-squared

residuals = y_model - y_data
RSS = np.sum( np.square(residuals) )

mean_squared_residuals = np.sum( np.square(residuals) ) / len(residuals) 

MSE = np.mean( np.square(residuals) )

RMSE = np.sqrt(np.mean( np.square(residuals)))
RMSE = np.std(residuals)

# Deviations
deviations = np.mean(y_data) - y_data
VAR = np.sum(np.square(deviations))

# R-squared
r_squared = 1 - (RSS / VAR)
r = correlation(y_data, y_model)

RMSE vs R-Squared
- RMSE -- how much variation is residual
- R-squared -- what fraction of variation is linear

misho-kr/Introduction to Linear Modeling in Python.md

Introduction to Linear Modeling in Python

Lecturer

Exploring Linear Trends

Building Linear Models

Making Model Predictions