dmarx · June 6, 2017 01:12
diff --git a/simple regression to measure effect of a regime change.r b/simple regression to measure effect of a regime change.r
 #' ---
 #' title: "Regression for quantifying a regime change"
 #' author: "David Marx"
 #' date: "June 5, 2017"
 #' output: html_document
 #' ---

 #' There are two time points of interest. We want to test the hypothesis that the regression 
 #' coefficients changed after these time points, respectively. We will accomplish this by introducing
 #' dummy variables to denote whether we are before or after a particular change point. This approach 
 #' is based on a suggestion I stumbled across on Cross Validated: https://stats.stackexchange.com/a/99432/8451

 #' We start by defining simulation parameters
 a=10 # Main intercept 
 b1=3 # Main slope
 b2=5 # Slope after first regime change  
 b3=2 # Slope after second regime change
 s=20  # Error variance
 n=150 # Number of observations

 n1=floor(n/3) # First change point
 n2=2*n1       # Second change point

 set.seed(111)

 #' Generate data

 n1=floor(n/3)
 n2=2*n1
 r1 = 1:n1
 r2 = (n1+1):n
 r3 = (n2+1):n

 x = 1:n
 y = a + b1*x + rnorm(length(x), 0, s)

 y2 = y
 offset = y[r2[1]-1]
 y2[r2] = a + b2*x[seq_along(r2)] + rnorm(length(r2), 0, s) + offset

 y3 = y2
 offset = y2[r3[1]-1] 
 y3[r3] = a + b3*x[seq_along(r3)] + rnorm(length(r3), 0, s) + offset

 df0 = data.frame(x=x, change_point_1=1*(x>n1), change_point_2=1*(x>n2))
 df1 = df0; df2 = df0; df3 = df0

 df1$y = y
 df2$y = y2
 df3$y = y3


 #' ## No regime change

 #' Let's start by running simulation where our hypothesis is wrong: there was no change to
 #' the model after either of the time points of interest.

 with(df1, plot(x,y, col=1+change_point_1 + change_point_2)); lines(y~x, df1)

 mod_1 = lm(y~x*change_point_1 + x*change_point_2, df1)

 df1$y_pred = predict(mod_1)
 with(df1, plot(x,y, col=1+change_point_1 + change_point_2)) 
 with(df1[1:n1,], lines(x,y_pred, col=1+change_point_1 + change_point_2))
 with(df1[(n1+1):n2,], lines(x,y_pred, col=1+change_point_1 + change_point_2))
 with(df1[(n2+1):n,], lines(x,y_pred, col=1+change_point_1 + change_point_2))

 knitr::kable(summary(mod_1)$coefficients[, c('Estimate','Std. Error', 'Pr(>|t|)')])

 #' We're essentially only interested in coefficients with 'x' in them. 
 #' 
 #' The coefficient for x is correctly estimated close to 3 and has p-value << 0.01
 #' The coefficients for interactions between x and the change point dummy variables
 #' are both close to zero with p-values > 0.01. 
 #' 
 #' The interpretation of these results is that the change points had no measurable effect on the 
 #' effect of the predictors, which the model estimates as $3.3$. The true value is 3, so the model
 #' did a reasonably good job considering the high variance used to generate the data.

 #+eval=FALSE, echo=FALSE
 0 # Need a code block for h2 to interpret correctly

 #' ## One regime change

 #' Let's modify the simulation by changing the slope after the first regime change but not
 #' the second and see what that looks like.

 with(df2, plot(x,y, col=1+change_point_1 + change_point_2)); lines(y~x, df2) # 1 regime shift

 mod_2 = lm(y~x*change_point_1 + x*change_point_2, df2)

 df2$y_pred = predict(mod_2)
 with(df2, plot(x,y, col=1+change_point_1 + change_point_2)) 
 with(df2[1:n1,], lines(x,y_pred, col=1+change_point_1 + change_point_2))
 with(df2[(n1+1):n2,], lines(x,y_pred, col=1+change_point_1 + change_point_2))
 with(df2[(n2+1):n,], lines(x,y_pred, col=1+change_point_1 + change_point_2))

 knitr::kable(summary(mod_2)$coefficients[, c('Estimate','Std. Error', 'Pr(>|t|)')])

 #' The dummy variables have the effect of essentially fitting the data separately before and after each
 #' time point, so the intercept and 'x' coefficients are not affected because we have fit them to literally
 #' the same data (the first 50 observations).
 #' 
 #' As before, the coefficient for x:change_point_2 is close to 0 and has a high p-value, but
 #' the coefficient for x:change_point_1 is estimated as 1.7 (p<<0.01) with a std error of 0.28. Interestingly,
 #' we see that the standard errors for x:change_point_1 and x:change_point_2 are the same (which also happened
 #' in the 'no regime change' simulation) although one was significant and the other was not.
 #' The model has determined that the first change point had the effect of increasing the slope by 1.7. 
 #' The true values for the difference in slopes is 2, which is captured in a 2 standard error margin about
 #' our estimate (i.e. a 95% CI) 

 #+eval=FALSE, echo=FALSE
 0 # Need a code block for h2 to interpret correctly

 #' ## Two regime changes

 #' Finally, let's see what happens, when the slope changes after both time points of interest. In the last 
 #' simulation the slope increased, so this time let's reduce the slope after the second time point.

 with(df3, plot(x,y, col=1+change_point_1 + change_point_2)); lines(y~x, df3) # 2 regime shift

 mod_3 = lm(y~x*change_point_1 + x*change_point_2, df3)

 df3$y_pred = predict(mod_3)
 with(df3, plot(x,y, col=1+change_point_1 + change_point_2)) 
 with(df3[1:n1,], lines(x,y_pred, col=1+change_point_1 + change_point_2))
 with(df3[(n1+1):n2,], lines(x,y_pred, col=1+change_point_1 + change_point_2))
 with(df3[(n2+1):n,], lines(x,y_pred, col=1+change_point_1 + change_point_2))

 knitr::kable(summary(mod_3)$coefficients[, c('Estimate','Std. Error', 'Pr(>|t|)')])

 #' The coefficients for `x` and `x:change_point_1` are exactly as before. But now, x:change_point_2 
 #' is estimated as having an effect of -2.85 (p<<0.01). As in the earlier models, the standard errors for our
 #' interaction terms are tied together, and are estimated to be about 0.28. The true value for the difference in 
 #' slope after the second change point is -3, so the model did a pretty good job here as well.
 #' 
 #' Rather than looking at the effect on the slope, we can use the estimated coefficients to get estimates for
 #' the slope in each of our time periods of interest. Prior to the any of the change points, the estimated slope
 #' is just the estimated coefficient for 'x', which was 3.3. After the first change point, the we add 1.7 to get 5.0.
 #' Finally, after the second change point we subtract 2.9 to get 2.1. Our estimated slopes are then 3.3, 5.0, and 2.1 
 #' respectively. The true values are 3, 5, and 2, so the model did a pretty good job here.
	#' ---
	#' title: "Regression for quantifying a regime change"
	#' author: "David Marx"
	#' date: "June 5, 2017"
	#' output: html_document
	#' ---

	#' There are two time points of interest. We want to test the hypothesis that the regression
	#' coefficients changed after these time points, respectively. We will accomplish this by introducing
	#' dummy variables to denote whether we are before or after a particular change point. This approach
	#' is based on a suggestion I stumbled across on Cross Validated: https://stats.stackexchange.com/a/99432/8451

	#' We start by defining simulation parameters
	a=10 # Main intercept
	b1=3 # Main slope
	b2=5 # Slope after first regime change
	b3=2 # Slope after second regime change
	s=20 # Error variance
	n=150 # Number of observations

	n1=floor(n/3) # First change point
	n2=2*n1 # Second change point

	set.seed(111)

	#' Generate data

	n1=floor(n/3)
	n2=2*n1
	r1 = 1:n1
	r2 = (n1+1):n
	r3 = (n2+1):n

	x = 1:n
	y = a + b1*x + rnorm(length(x), 0, s)

	y2 = y
	offset = y[r2[1]-1]
	y2[r2] = a + b2*x[seq_along(r2)] + rnorm(length(r2), 0, s) + offset

	y3 = y2
	offset = y2[r3[1]-1]
	y3[r3] = a + b3*x[seq_along(r3)] + rnorm(length(r3), 0, s) + offset

	df0 = data.frame(x=x, change_point_1=1(x>n1), change_point_2=1(x>n2))
	df1 = df0; df2 = df0; df3 = df0

	df1$y = y
	df2$y = y2
	df3$y = y3


	#' ## No regime change

	#' Let's start by running simulation where our hypothesis is wrong: there was no change to
	#' the model after either of the time points of interest.

	with(df1, plot(x,y, col=1+change_point_1 + change_point_2)); lines(y~x, df1)

	mod_1 = lm(y~xchange_point_1 + xchange_point_2, df1)

	df1$y_pred = predict(mod_1)
	with(df1, plot(x,y, col=1+change_point_1 + change_point_2))
	with(df1[1:n1,], lines(x,y_pred, col=1+change_point_1 + change_point_2))
	with(df1[(n1+1):n2,], lines(x,y_pred, col=1+change_point_1 + change_point_2))
	with(df1[(n2+1):n,], lines(x,y_pred, col=1+change_point_1 + change_point_2))

	knitr::kable(summary(mod_1)$coefficients[, c('Estimate','Std. Error', 'Pr(>\|t\|)')])

	#' We're essentially only interested in coefficients with 'x' in them.
	#'
	#' The coefficient for x is correctly estimated close to 3 and has p-value << 0.01
	#' The coefficients for interactions between x and the change point dummy variables
	#' are both close to zero with p-values > 0.01.
	#'
	#' The interpretation of these results is that the change points had no measurable effect on the
	#' effect of the predictors, which the model estimates as $3.3$. The true value is 3, so the model
	#' did a reasonably good job considering the high variance used to generate the data.

	#+eval=FALSE, echo=FALSE
	0 # Need a code block for h2 to interpret correctly

	#' ## One regime change

	#' Let's modify the simulation by changing the slope after the first regime change but not
	#' the second and see what that looks like.

	with(df2, plot(x,y, col=1+change_point_1 + change_point_2)); lines(y~x, df2) # 1 regime shift

	mod_2 = lm(y~xchange_point_1 + xchange_point_2, df2)

	df2$y_pred = predict(mod_2)
	with(df2, plot(x,y, col=1+change_point_1 + change_point_2))
	with(df2[1:n1,], lines(x,y_pred, col=1+change_point_1 + change_point_2))
	with(df2[(n1+1):n2,], lines(x,y_pred, col=1+change_point_1 + change_point_2))
	with(df2[(n2+1):n,], lines(x,y_pred, col=1+change_point_1 + change_point_2))

	knitr::kable(summary(mod_2)$coefficients[, c('Estimate','Std. Error', 'Pr(>\|t\|)')])

	#' The dummy variables have the effect of essentially fitting the data separately before and after each
	#' time point, so the intercept and 'x' coefficients are not affected because we have fit them to literally
	#' the same data (the first 50 observations).
	#'
	#' As before, the coefficient for x:change_point_2 is close to 0 and has a high p-value, but
	#' the coefficient for x:change_point_1 is estimated as 1.7 (p<<0.01) with a std error of 0.28. Interestingly,
	#' we see that the standard errors for x:change_point_1 and x:change_point_2 are the same (which also happened
	#' in the 'no regime change' simulation) although one was significant and the other was not.
	#' The model has determined that the first change point had the effect of increasing the slope by 1.7.
	#' The true values for the difference in slopes is 2, which is captured in a 2 standard error margin about
	#' our estimate (i.e. a 95% CI)

	#+eval=FALSE, echo=FALSE
	0 # Need a code block for h2 to interpret correctly

	#' ## Two regime changes

	#' Finally, let's see what happens, when the slope changes after both time points of interest. In the last
	#' simulation the slope increased, so this time let's reduce the slope after the second time point.

	with(df3, plot(x,y, col=1+change_point_1 + change_point_2)); lines(y~x, df3) # 2 regime shift

	mod_3 = lm(y~xchange_point_1 + xchange_point_2, df3)

	df3$y_pred = predict(mod_3)
	with(df3, plot(x,y, col=1+change_point_1 + change_point_2))
	with(df3[1:n1,], lines(x,y_pred, col=1+change_point_1 + change_point_2))
	with(df3[(n1+1):n2,], lines(x,y_pred, col=1+change_point_1 + change_point_2))
	with(df3[(n2+1):n,], lines(x,y_pred, col=1+change_point_1 + change_point_2))

	knitr::kable(summary(mod_3)$coefficients[, c('Estimate','Std. Error', 'Pr(>\|t\|)')])

	#' The coefficients for `x` and `x:change_point_1` are exactly as before. But now, x:change_point_2
	#' is estimated as having an effect of -2.85 (p<<0.01). As in the earlier models, the standard errors for our
	#' interaction terms are tied together, and are estimated to be about 0.28. The true value for the difference in
	#' slope after the second change point is -3, so the model did a pretty good job here as well.
	#'
	#' Rather than looking at the effect on the slope, we can use the estimated coefficients to get estimates for
	#' the slope in each of our time periods of interest. Prior to the any of the change points, the estimated slope
	#' is just the estimated coefficient for 'x', which was 3.3. After the first change point, the we add 1.7 to get 5.0.
	#' Finally, after the second change point we subtract 2.9 to get 2.1. Our estimated slopes are then 3.3, 5.0, and 2.1
	#' respectively. The true values are 3, 5, and 2, so the model did a pretty good job here.
No results found