Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save dmarx/3d1d94bffa85baeb89c6576abfeef6da to your computer and use it in GitHub Desktop.

Select an option

Save dmarx/3d1d94bffa85baeb89c6576abfeef6da to your computer and use it in GitHub Desktop.
Simple regression with interaction terms to measure effect of a regime change on the predictors. Implementation of https://stats.stackexchange.com/a/99432/8451
#' ---
#' title: "Regression for quantifying a regime change"
#' author: "David Marx"
#' date: "June 5, 2017"
#' output: html_document
#' ---
#' There are two time points of interest. We want to test the hypothesis that the regression
#' coefficients changed after these time points, respectively. We will accomplish this by introducing
#' dummy variables to denote whether we are before or after a particular change point. This approach
#' is based on a suggestion I stumbled across on Cross Validated: https://stats.stackexchange.com/a/99432/8451
#' We start by defining simulation parameters
a=10 # Main intercept
b1=3 # Main slope
b2=5 # Slope after first regime change
b3=2 # Slope after second regime change
s=20 # Error variance
n=150 # Number of observations
n1=floor(n/3) # First change point
n2=2*n1 # Second change point
set.seed(111)
#' Generate data
n1=floor(n/3)
n2=2*n1
r1 = 1:n1
r2 = (n1+1):n
r3 = (n2+1):n
x = 1:n
y = a + b1*x + rnorm(length(x), 0, s)
y2 = y
offset = y[r2[1]-1]
y2[r2] = a + b2*x[seq_along(r2)] + rnorm(length(r2), 0, s) + offset
y3 = y2
offset = y2[r3[1]-1]
y3[r3] = a + b3*x[seq_along(r3)] + rnorm(length(r3), 0, s) + offset
df0 = data.frame(x=x, change_point_1=1*(x>n1), change_point_2=1*(x>n2))
df1 = df0; df2 = df0; df3 = df0
df1$y = y
df2$y = y2
df3$y = y3
#' ## No regime change
#' Let's start by running simulation where our hypothesis is wrong: there was no change to
#' the model after either of the time points of interest.
with(df1, plot(x,y, col=1+change_point_1 + change_point_2)); lines(y~x, df1)
mod_1 = lm(y~x*change_point_1 + x*change_point_2, df1)
df1$y_pred = predict(mod_1)
with(df1, plot(x,y, col=1+change_point_1 + change_point_2))
with(df1[1:n1,], lines(x,y_pred, col=1+change_point_1 + change_point_2))
with(df1[(n1+1):n2,], lines(x,y_pred, col=1+change_point_1 + change_point_2))
with(df1[(n2+1):n,], lines(x,y_pred, col=1+change_point_1 + change_point_2))
knitr::kable(summary(mod_1)$coefficients[, c('Estimate','Std. Error', 'Pr(>|t|)')])
#' We're essentially only interested in coefficients with 'x' in them.
#'
#' The coefficient for x is correctly estimated close to 3 and has p-value << 0.01
#' The coefficients for interactions between x and the change point dummy variables
#' are both close to zero with p-values > 0.01.
#'
#' The interpretation of these results is that the change points had no measurable effect on the
#' effect of the predictors, which the model estimates as $3.3$. The true value is 3, so the model
#' did a reasonably good job considering the high variance used to generate the data.
#+eval=FALSE, echo=FALSE
0 # Need a code block for h2 to interpret correctly
#' ## One regime change
#' Let's modify the simulation by changing the slope after the first regime change but not
#' the second and see what that looks like.
with(df2, plot(x,y, col=1+change_point_1 + change_point_2)); lines(y~x, df2) # 1 regime shift
mod_2 = lm(y~x*change_point_1 + x*change_point_2, df2)
df2$y_pred = predict(mod_2)
with(df2, plot(x,y, col=1+change_point_1 + change_point_2))
with(df2[1:n1,], lines(x,y_pred, col=1+change_point_1 + change_point_2))
with(df2[(n1+1):n2,], lines(x,y_pred, col=1+change_point_1 + change_point_2))
with(df2[(n2+1):n,], lines(x,y_pred, col=1+change_point_1 + change_point_2))
knitr::kable(summary(mod_2)$coefficients[, c('Estimate','Std. Error', 'Pr(>|t|)')])
#' The dummy variables have the effect of essentially fitting the data separately before and after each
#' time point, so the intercept and 'x' coefficients are not affected because we have fit them to literally
#' the same data (the first 50 observations).
#'
#' As before, the coefficient for x:change_point_2 is close to 0 and has a high p-value, but
#' the coefficient for x:change_point_1 is estimated as 1.7 (p<<0.01) with a std error of 0.28. Interestingly,
#' we see that the standard errors for x:change_point_1 and x:change_point_2 are the same (which also happened
#' in the 'no regime change' simulation) although one was significant and the other was not.
#' The model has determined that the first change point had the effect of increasing the slope by 1.7.
#' The true values for the difference in slopes is 2, which is captured in a 2 standard error margin about
#' our estimate (i.e. a 95% CI)
#+eval=FALSE, echo=FALSE
0 # Need a code block for h2 to interpret correctly
#' ## Two regime changes
#' Finally, let's see what happens, when the slope changes after both time points of interest. In the last
#' simulation the slope increased, so this time let's reduce the slope after the second time point.
with(df3, plot(x,y, col=1+change_point_1 + change_point_2)); lines(y~x, df3) # 2 regime shift
mod_3 = lm(y~x*change_point_1 + x*change_point_2, df3)
df3$y_pred = predict(mod_3)
with(df3, plot(x,y, col=1+change_point_1 + change_point_2))
with(df3[1:n1,], lines(x,y_pred, col=1+change_point_1 + change_point_2))
with(df3[(n1+1):n2,], lines(x,y_pred, col=1+change_point_1 + change_point_2))
with(df3[(n2+1):n,], lines(x,y_pred, col=1+change_point_1 + change_point_2))
knitr::kable(summary(mod_3)$coefficients[, c('Estimate','Std. Error', 'Pr(>|t|)')])
#' The coefficients for `x` and `x:change_point_1` are exactly as before. But now, x:change_point_2
#' is estimated as having an effect of -2.85 (p<<0.01). As in the earlier models, the standard errors for our
#' interaction terms are tied together, and are estimated to be about 0.28. The true value for the difference in
#' slope after the second change point is -3, so the model did a pretty good job here as well.
#'
#' Rather than looking at the effect on the slope, we can use the estimated coefficients to get estimates for
#' the slope in each of our time periods of interest. Prior to the any of the change points, the estimated slope
#' is just the estimated coefficient for 'x', which was 3.3. After the first change point, the we add 1.7 to get 5.0.
#' Finally, after the second change point we subtract 2.9 to get 2.1. Our estimated slopes are then 3.3, 5.0, and 2.1
#' respectively. The true values are 3, 5, and 2, so the model did a pretty good job here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment