Skip to content

Instantly share code, notes, and snippets.

@misho-kr
Last active March 25, 2020 19:54
Show Gist options
  • Save misho-kr/d75d529fd03f46c8efb1d17175306dab to your computer and use it in GitHub Desktop.
Save misho-kr/d75d529fd03f46c8efb1d17175306dab to your computer and use it in GitHub Desktop.
Introduction to Linear Modeling in Python

Lecturer

Jason Vestuto, Data Scientist, University of Texas at Austin

Exploring Linear Trends

We start the course with an initial exploration of linear relationships, including some motivating examples of how linear models are used, and demonstrations of data visualization methods from matplotlib. We then use descriptive statistics to quantify the shape of our data and use correlation to quantify the strength of linear relationships between two variables.

  • Models are used to make predictions for future performance (extrapolation)
  • Models can interpolate the data
  • Will use "matplotlib.pyplot" package to visualize the models
  • Linear models use the slope-intersept form of linear equation
    • slope: dy / dx (rise-over-run)
    • intercept: when x=0, y=?
  • Review of Single Variable Statistics
Name Formula
Mean sum(x) / len(x)
Deviation x - mean(x)
Variance np.mean(dx * dx)
Standard Deviation np.sqrt(variance)
  • Quantifying Linear Relationships
    • Covariance
      • Deviation of two variables: dx, dy
      • Co-vary means to vary together: deviation_products = dx * dy
      • Covariance as the mean: covariance = np.mean(dx*dy)
    • Correlation
      • divide deviations by standard deviation:
        • zx = dx/np.std(x)
        • zy = dy/np.std(y)
      • mean of the normalize deviations: correlation = np.mean(zx*zy)
    • Normalization helps to compare (correlate) signals of different magnitudes
    • Magnitude and direction of correlation (positive/negative, greater/equal/less than 0)

Building Linear Models

Here we look at the parts that go into building a linear model. Using the concept of a Taylor Series, we focus on the parameters slope and intercept, how they define the model, and how to interpret the them in several applied contexts. We apply a variety of python modules to find the model that best fits the data, by computing the optimal values of slope and intercept, using least-squares, numpy, statsmodels, and scikit-learn.

  • Taylor Series
    • Approximates any curve
    • Polynomial form: y = a0 + a1x + a2x^2 + a3x^3 + .... + anx^n
    • often, first order is enough: y = a0 + a1*x
  • First order with a1=0: flat line
  • First order with a1=1: diagonal line
  • Second order with a2=1: exponential line
  • Problems when overfitting the model -- predictions beyond the training data are significantly incorrect
  • Interpreting Slope and Intercept
    • y = a0*x + a1
    • x = independent variable, e.g. time
    • y = dependent variable, e.g. distance traveled
    • slope, everage slope
  • Residuals, squared, variations
>>> residuals = y_model - y_data
>>> len(residuals) == len(y_data)
  True
>>> print( np.sum(residuals)
  0.0
>>> residuals_squared = np.square(y_model - y_data)
>>> print( np.sum(residuals_squared) )
  65.1
>>> RSS = np.sum(residuals_squared)
>>> print( RSS )
  5.9
  • Least-Squares Optimization
    • Minima of RSS
    • a1 = covariance(x, y) / variance(x)
    • a0 = mean(y) − a1 * mean(x)
  • Optimized by Numpy
>>> x_mean = np.sum(x) / len(x)
>>> y_mean = np.sum(y) / len(y)
>>> x_dev = x - x_mean
>>> y_dev = y - y_mean
>>> a1 = np.sum( x_dev * y_dev ) / np.sum( x_dev^2 )
>>> a0 = y_mean - (a1 * x_mean)
  • Optimized by Scipy
>>> from scipy import optimize
>>> x_data, y_data = load_data()
>>> def model_func(x, a0, a1):
>>> return a0 + (a1 * x)
>>> param_opt, param_cov = optimize.curve_fit(model_func, x_data, y_data)
>>> a0 = param_opt[0] # a0 is the intercept in y = a0 + a1*x
>>> a1 = param_opt[1] # a1 is the slope in y = a0 + a1*x
  • Optimized by Statsmodels
>>> from statsmodels.formula.api import ols
>>> x_data, y_data = load_data()
>>> df = pd.DataFrame(dict(x_name=x_data, y_name=y_data))
>>> model_fit = ols(formula="y_name ~ x_name", data=df).fit()
>>> y_model = model_fit.predict(dx)
>>> x_model = x_data
>>> a0 = model_fit.params['Intercept']
>>> a1 = model_fit.params['x_name']

Making Model Predictions

Next we will apply models to real data and make predictions. We will explore some of the most common pit-falls and limitations of predictions, and we evaluate and compare models by quantifying and contrasting several measures of goodness-of-fit, including RMSE and R-squared.

  • Optimized by Scikit-Learn
from sklearn.linear_model import LinearRegression
 
# Initialize a general model
model = LinearRegression(fit_intercept=True)

# Load and shape the data
x_raw, y_raw = load_data()
x_data = x_raw.reshape(len(y_raw),1) y_data = y_raw.reshape(len(y_raw),1)

# Fit the model to the data
model_fit = model.fit(x_data, y_data)
  • Predications, Parameters, Uncertainty
a0 = model_fit.params['Intercept']
a1 = model_fit.params['times']

e0 = model_fit.bse['Intercept']
e1 = model_fit.bse['times']
  • Limits of prediction, Domain of Validity
  • Goodness of Fit -- 3 different R's
    • Building Models
      • RSS
    • Evaluating Models
      • RMSE
      • R-squared
residuals = y_model - y_data
RSS = np.sum( np.square(residuals) )

mean_squared_residuals = np.sum( np.square(residuals) ) / len(residuals) 

MSE = np.mean( np.square(residuals) )

RMSE = np.sqrt(np.mean( np.square(residuals)))
RMSE = np.std(residuals)

# Deviations
deviations = np.mean(y_data) - y_data
VAR = np.sum(np.square(deviations))

# R-squared
r_squared = 1 - (RSS / VAR)
r = correlation(y_data, y_model)
  • RMSE vs R-Squared
    • RMSE -- how much variation is residual
    • R-squared -- what fraction of variation is linear
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment