Last active
June 6, 2017 01:12
-
-
Save dmarx/3d1d94bffa85baeb89c6576abfeef6da to your computer and use it in GitHub Desktop.
Simple regression with interaction terms to measure effect of a regime change on the predictors. Implementation of https://stats.stackexchange.com/a/99432/8451
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #' --- | |
| #' title: "Regression for quantifying a regime change" | |
| #' author: "David Marx" | |
| #' date: "June 5, 2017" | |
| #' output: html_document | |
| #' --- | |
| #' There are two time points of interest. We want to test the hypothesis that the regression | |
| #' coefficients changed after these time points, respectively. We will accomplish this by introducing | |
| #' dummy variables to denote whether we are before or after a particular change point. This approach | |
| #' is based on a suggestion I stumbled across on Cross Validated: https://stats.stackexchange.com/a/99432/8451 | |
| #' We start by defining simulation parameters | |
| a=10 # Main intercept | |
| b1=3 # Main slope | |
| b2=5 # Slope after first regime change | |
| b3=2 # Slope after second regime change | |
| s=20 # Error variance | |
| n=150 # Number of observations | |
| n1=floor(n/3) # First change point | |
| n2=2*n1 # Second change point | |
| set.seed(111) | |
| #' Generate data | |
| n1=floor(n/3) | |
| n2=2*n1 | |
| r1 = 1:n1 | |
| r2 = (n1+1):n | |
| r3 = (n2+1):n | |
| x = 1:n | |
| y = a + b1*x + rnorm(length(x), 0, s) | |
| y2 = y | |
| offset = y[r2[1]-1] | |
| y2[r2] = a + b2*x[seq_along(r2)] + rnorm(length(r2), 0, s) + offset | |
| y3 = y2 | |
| offset = y2[r3[1]-1] | |
| y3[r3] = a + b3*x[seq_along(r3)] + rnorm(length(r3), 0, s) + offset | |
| df0 = data.frame(x=x, change_point_1=1*(x>n1), change_point_2=1*(x>n2)) | |
| df1 = df0; df2 = df0; df3 = df0 | |
| df1$y = y | |
| df2$y = y2 | |
| df3$y = y3 | |
| #' ## No regime change | |
| #' Let's start by running simulation where our hypothesis is wrong: there was no change to | |
| #' the model after either of the time points of interest. | |
| with(df1, plot(x,y, col=1+change_point_1 + change_point_2)); lines(y~x, df1) | |
| mod_1 = lm(y~x*change_point_1 + x*change_point_2, df1) | |
| df1$y_pred = predict(mod_1) | |
| with(df1, plot(x,y, col=1+change_point_1 + change_point_2)) | |
| with(df1[1:n1,], lines(x,y_pred, col=1+change_point_1 + change_point_2)) | |
| with(df1[(n1+1):n2,], lines(x,y_pred, col=1+change_point_1 + change_point_2)) | |
| with(df1[(n2+1):n,], lines(x,y_pred, col=1+change_point_1 + change_point_2)) | |
| knitr::kable(summary(mod_1)$coefficients[, c('Estimate','Std. Error', 'Pr(>|t|)')]) | |
| #' We're essentially only interested in coefficients with 'x' in them. | |
| #' | |
| #' The coefficient for x is correctly estimated close to 3 and has p-value << 0.01 | |
| #' The coefficients for interactions between x and the change point dummy variables | |
| #' are both close to zero with p-values > 0.01. | |
| #' | |
| #' The interpretation of these results is that the change points had no measurable effect on the | |
| #' effect of the predictors, which the model estimates as $3.3$. The true value is 3, so the model | |
| #' did a reasonably good job considering the high variance used to generate the data. | |
| #+eval=FALSE, echo=FALSE | |
| 0 # Need a code block for h2 to interpret correctly | |
| #' ## One regime change | |
| #' Let's modify the simulation by changing the slope after the first regime change but not | |
| #' the second and see what that looks like. | |
| with(df2, plot(x,y, col=1+change_point_1 + change_point_2)); lines(y~x, df2) # 1 regime shift | |
| mod_2 = lm(y~x*change_point_1 + x*change_point_2, df2) | |
| df2$y_pred = predict(mod_2) | |
| with(df2, plot(x,y, col=1+change_point_1 + change_point_2)) | |
| with(df2[1:n1,], lines(x,y_pred, col=1+change_point_1 + change_point_2)) | |
| with(df2[(n1+1):n2,], lines(x,y_pred, col=1+change_point_1 + change_point_2)) | |
| with(df2[(n2+1):n,], lines(x,y_pred, col=1+change_point_1 + change_point_2)) | |
| knitr::kable(summary(mod_2)$coefficients[, c('Estimate','Std. Error', 'Pr(>|t|)')]) | |
| #' The dummy variables have the effect of essentially fitting the data separately before and after each | |
| #' time point, so the intercept and 'x' coefficients are not affected because we have fit them to literally | |
| #' the same data (the first 50 observations). | |
| #' | |
| #' As before, the coefficient for x:change_point_2 is close to 0 and has a high p-value, but | |
| #' the coefficient for x:change_point_1 is estimated as 1.7 (p<<0.01) with a std error of 0.28. Interestingly, | |
| #' we see that the standard errors for x:change_point_1 and x:change_point_2 are the same (which also happened | |
| #' in the 'no regime change' simulation) although one was significant and the other was not. | |
| #' The model has determined that the first change point had the effect of increasing the slope by 1.7. | |
| #' The true values for the difference in slopes is 2, which is captured in a 2 standard error margin about | |
| #' our estimate (i.e. a 95% CI) | |
| #+eval=FALSE, echo=FALSE | |
| 0 # Need a code block for h2 to interpret correctly | |
| #' ## Two regime changes | |
| #' Finally, let's see what happens, when the slope changes after both time points of interest. In the last | |
| #' simulation the slope increased, so this time let's reduce the slope after the second time point. | |
| with(df3, plot(x,y, col=1+change_point_1 + change_point_2)); lines(y~x, df3) # 2 regime shift | |
| mod_3 = lm(y~x*change_point_1 + x*change_point_2, df3) | |
| df3$y_pred = predict(mod_3) | |
| with(df3, plot(x,y, col=1+change_point_1 + change_point_2)) | |
| with(df3[1:n1,], lines(x,y_pred, col=1+change_point_1 + change_point_2)) | |
| with(df3[(n1+1):n2,], lines(x,y_pred, col=1+change_point_1 + change_point_2)) | |
| with(df3[(n2+1):n,], lines(x,y_pred, col=1+change_point_1 + change_point_2)) | |
| knitr::kable(summary(mod_3)$coefficients[, c('Estimate','Std. Error', 'Pr(>|t|)')]) | |
| #' The coefficients for `x` and `x:change_point_1` are exactly as before. But now, x:change_point_2 | |
| #' is estimated as having an effect of -2.85 (p<<0.01). As in the earlier models, the standard errors for our | |
| #' interaction terms are tied together, and are estimated to be about 0.28. The true value for the difference in | |
| #' slope after the second change point is -3, so the model did a pretty good job here as well. | |
| #' | |
| #' Rather than looking at the effect on the slope, we can use the estimated coefficients to get estimates for | |
| #' the slope in each of our time periods of interest. Prior to the any of the change points, the estimated slope | |
| #' is just the estimated coefficient for 'x', which was 3.3. After the first change point, the we add 1.7 to get 5.0. | |
| #' Finally, after the second change point we subtract 2.9 to get 2.1. Our estimated slopes are then 3.3, 5.0, and 2.1 | |
| #' respectively. The true values are 3, 5, and 2, so the model did a pretty good job here. | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment