Created
August 6, 2018 12:55
-
-
Save Gerenuk/3c84abdfd571d62b4cc61118e6ebf942 to your computer and use it in GitHub Desktop.
Time Series Summary
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
:toc: | |
:toc-placement: left | |
toc::[] | |
== Summary | |
* mostly from https://otexts.org/fpp2/[Forecasting: Principles and Practice (Hyndman, Athanasopoulos)] | |
* "stationary": properties do not depend on absolute time (e.g. fixed time seasons; but may have cycles due to dependence on past values) | |
* centered moving average will smooth out seasons (e.g. 1/8,1/4,1/4,1/4,1/8) | |
* maybe transform data if want to restrict to range (e.g. log for positive values) or variances are changing (e.g. Box-Cox transform) | |
** -> remember to do correct back transformation, since with the naive way the mean is biased | |
* you may consider modelling effects as additive or multiplicative | |
* forecasting with co-integrated features may require special treatment | |
* include spike dummy variables for special events | |
* you need stationary time series for ARIMA-like models -> use KPSS test -> use difference if small p-value (otherwise variance/interval estimates are incorrect) | |
* you can include features (rather than target alone); need stationary, and if differencing ideally difference all the same tiem | |
* could include lagged features, if lagged effects reasonable | |
* use AICc or CV to choose models (but differencing cannot be found with this since changes data) | |
* for ARIMA(p,d,q), if p=0 or q=0: can detect model from behaviour of ACF/PACF plots | |
* test residuals: | |
** use Ljung-Box test to see if residuals are white noise | |
** use AR(1) or Cochrane-Orcutt test to check for autocorrelation in residuals | |
** -> if not, you are missing information and may need a better model (more features, transformations, ...) | |
* usually prediction intervals are too narrow since many choices not accounted for | |
* bagging: do many predictions and average | |
=== Successful models | |
* SARIMAX: but does not work if period large (e.g. yearly seasons on weekly data) | |
* damped Holt-Winters | |
* auto.arima (R): may need to force to search more exhaustively | |
* ETS (Error-Trend-Season): non-linear ETS is not ARIMA | |
* Seasons could be handled with Fourier terms asciimath:[cos/sin (2*pi*k*t/T)] "harmonic regression" | |
* TBATS | |
* LSTM, RNN | |
* XGB, ANN | |
* GARCH | |
* Prophet (Facebook) | |
* CausalImpact (Google) | |
=== Special models | |
* Hierarchical/grouped models: when want to model parts and whole | |
* Vector Autoregression (VAR): when interactions between multiple targets | |
* Low integer count prediction: there are special methods for this; e.g. predict non-zero values and time between them | |
== Plots | |
* x-axis: month; within month show year development (an total mean for this month, but over years) | |
* scatter plot (matrix) of y's | |
* autocorrelation with diff lags | |
* ACF of residuals | |
== Tests | |
* Portmanteau test: test residuals whether they could be white noise | |
** Box-Pierce test: sum of h residuals squared; use h=10 for non-seasonal, h=2*season for seasonal, but h<=T/5 | |
** Ljung-Box test: more accurate than Box-Pierce; chi^2 test | |
* Breusch-Godfrey test (Lagrange Multiplier) test for serial correlation: joint hypothesis that there is no autocorrelation in the residuals up to a certain specified order; small p-value -> significant autocorrelation remaining in the residuals. | |
=== Unit root test for stationarity | |
* determine whether differencing needed for stationarity | |
* KPSS Kwiatkowski-Phillips-Schmid-Shin test: differencing if small p-value | |
* also check if seasonal differencing needed | |
* with autocorrelation: | |
** while coef estimate is still unbiased, it may be off because a feature with predictive power is not in the model | |
** (hidden) variance of estimates increased | |
** estimated std errors to narrow; all statistics too optimistic | |
== Autocorrelation | |
* trend -> ACF drops slowly | |
* white noise -> 95% within -+ 2/sqrt(T) [T is length of time series]; if large deviations or too many -> not white noise | |
* model still unbiased if autocorrelated residuals, but prediction intervals may be wrong | |
* (Durbin-Watson tests for first order autocorrelation only; also does not work if target is reused as future feature; 0: pos 1st autocorr, 2: no 1st autocorr, 4: neg 1st autocorr) | |
* AR(1) more powerful test; Cochrane-Orcutt almost same result | |
* to fix: | |
** add feature | |
** change functional form | |
== Box-Cox transform | |
* asciimath:[(y^lambda-1)/lambda] (or asciimath:[ln(y)] if asciimath:[lambda=0]) | |
* best if seasonal variation about equal sizes | |
* does not work if asciimath:[y<0] | |
* forecasts not sensitive on asciimath:[lambda]; but may have large effects on prediction intervals | |
*! not the the back transform gets the mean biased -> needs correction if want to add up results (e.g. aggregate); need forecast variance | |
== Metrics | |
* MAE for median | |
* RMSE for mean | |
* MAPE: mean of percentage error (only makes sense if meaningful zero); heavier weight on negative error | |
* (sMAPE: asciimath:[mean(|r_i|*2/(y+hat y)]; not recommended!) | |
* MASE mean absolute scaled error: scale errors based on training MAE from a simple forecast (e.g. naive forecasts) | |
* Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy. International Journal of Forecasting, 22, 679–688. | |
== Prediction intervals | |
* https://otexts.org/fpp2/prediction-intervals.html | |
* 1 step ahead: stddev about same as that of residuals | |
* for basic methods, there are equations (e.g. asciimath:[sigma*sqrt(h)] for naive method) | |
* bootstrap method: iteratively cumulate errors | |
== Selecting features | |
* https://otexts.org/fpp2/selecting-predictors.html | |
* try multiple model versions; e.g. all feature subsets | |
* or do backward feature selection; not necessarily best model, but always good model | |
* (forward selection if all features are too many) | |
* (Adjusted R^2: equivalent to minimizing stderr; selects too many) | |
* LOO-CV (fast methods) | |
* AIC (for large T same as CV method) | |
* AICc: corrected, since for small T selects too many features | |
* (BIC: for large T similar to leave-v-out where asciimath:[v=T(1-1/(ln(T)-1)]) | |
*-> use AIC/CV, in FPP AICc is used | |
== Exponential smoothing | |
* average between last observation and previous forecast | |
* capture trend and seasonality | |
* asciimath:[y_(T+1)=alpha*y_T+alpha(1-alpha)*y_(T-1)+alpha(1-alpha)^2*y_(T-2)+...] | |
* need to choose smoothing and initial parameters | |
* Holt extended method to allow for trend: forecast/level/trend equation | |
* Gardner & McKenzie: added dampening to a flat line (parameter 0.8...0.98) | |
* can include seasons | |
* forecast by iterating model (and setting unknown errors to zero) | |
== ARIMA | |
* capture autocorrelation | |
* needs to be stationary | |
* cyclic maybe be stationary if depends on on past values and not absolute time | |
* ACF: | |
** non-stationary: drop slowly, r1 often large and positive | |
** stationary: drops quickly | |
* stationary -> do difference and Ljung-Box test (want large p value) | |
* https://otexts.org/fpp2/MA.html | |
* AR: autoregression of target | |
* MA: regression on past forecast errors | |
* AR theoretically same as MA with large order | |
* with some constraints MA model is invertible and also same as AR model with large order; invertible when recent observations more predictive than past(?) | |
* for stationary data we coefs are restricted: | |
** AR(1): asciimath:[-1<phi_1<1] | |
** AR(2): asciimath:[-1<phi_2<1], asciimath:[phi_1+phi_2<1] | |
** AR(3): complex restrictions | |
* -> linear model on (differenced) target and past prediction errors | |
* also includes a constant term | |
* ARIMA(p,d,q): | |
** order of autoregr. | |
** degree of differencing | |
** order of moving average | |
* with c=0: differencing 0/1/2 will model zero/constant/line | |
* with c!=0: differencing 0/1/2 will model mean/line/quadratic | |
* for cycles need p>=2; then asciimath:[cos omega = -phi_1(1-phi_2)/(4*phi_2)] | |
* sometime possible to tell p and q from ACF/PACF plots | |
* PACF is like last coef in AR(k) model | |
* finding p/q from ACF/PACF plot: | |
**! ACF/PACF plot only help if only one of p>0 or q>0 | |
** ARIMA(p,d,0): ACF exp decay or sine; PACF spike p, but none beyond (sharp drop) | |
** ARIMA(0,d,q): PACF exp decay or sine; ACF spike q, but none beyond (sharp drop) | |
* auto.arima in R may need to be force to search more parameters | |
* estimation methods complex and different software can give different results | |
* AIC can be used to find parameters (has term asciimath:[p+q+bool(c!=0)]) | |
*! d parameter cannot be found from AIC since it changes data and AIC not comparable | |
* auto.arima: | |
** https://otexts.org/fpp2/arima-r.html | |
** unit root test, minimize AIC, fit MLE | |
** d<=2 determined from KPSS test | |
** initial (0,0) (2,2) (1,0) (0,1); if d<2: use constant, but also fit (0,0) without constant | |
** start from best of initial and vary parameters to find better models | |
** `approximation=FALSE` and `stepwise=FALSE` for more search | |
* prediction intervals complicating; for stationary models (d=0) they will converge | |
*! prediction errors too narrow since variation in parameters, model order, etc not accounted | |
* seasonal ARIMA: | |
** (0,1): single spike in ACF, decay in PACF | |
** (1,0): decay in ACF, single spike in PACF | |
* Books: | |
** Box, G. E. P., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2015). Time series analysis: Forecasting and control (5th ed). Hoboken, New Jersey: John Wiley & Sons. | |
** math background: Brockwell, P. J., & Davis, R. A. (2016). Introduction to time series and forecasting (3rd ed). New York, USA: Springer. | |
** alternatie autoarima: Peña, D., Tiao, G. C., & Tsay, R. S. (Eds.). (2001). A course in time series analysis. New York, USA: John Wiley & Sons. | |
=== ARIMA vs ETS | |
* linear ETS is special case of ARIMA | |
* non-linear ETS no equivalent ARIMA | |
* some ARIMA not a ETS | |
* all ETS non-stationary, some ARIMA stationary | |
* AIC cannot be used to compare ARIMA and ETS because diff model class and diff MLE | |
* but could use time series CV to compare | |
== Dynamic regression | |
* https://otexts.org/fpp2/estimation.html | |
* normal linear regression, but let (auto-correlated) error asciimath:[nu] follow an ARIMA model (with white noise error asciimath:[epsilon]) | |
* cannot just minimize (auto-correlated) asciimath:[nu] since would be biased, statistical tests would be incorrect, AIC would be incorrect, p-values would be too small -> minimize asciimath:[epsilon] from ARIMA model | |
*! require that target and all of features are stationary (otherwise coefs will not be consistent) | |
* exception for stationarity requirement: co-integrated (linear combination of non-stationary is actually stationary) | |
* to keep variable nature, apply same differencing to all of them at once ("model in differences") | |
* now only need ARMA model (not ARIMA) | |
* to forecast, you also need to predict features (e.g. by assuming past averages) | |
* Pankratz, A. E. (1991): "Forecasting with dynamic regression models." | |
* Generalization of dynamic regression models, "transfer function models": Box, G. E. P., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2015). "Time series analysis: Forecasting and control" (5th ed) | |
=== Trend modeling | |
* options: | |
** deterministic trend: constant term in target prediction | |
** stochastic trend: using differencing (asciimath:[d=1]) in error model (will have wider prediction intervals; but that may be good because it considers changing trends) | |
== Seasons with Harmonic regression | |
* https://otexts.org/fpp2/dhr.html | |
* for long seasonal trends: dynamic regression with Fourier terms often better; especially since ARIMA does not like large periods (24 for hourly just about fine) | |
* also differencing daily data on a year ago is just noisy | |
* -> use harmonic regression for seasons and ARMA for short-term | |
* only drawback is that seasons are assumed to be constant | |
== Lagged features | |
* include lagged features if it makes sense (due to delayed effects) | |
* choose number of lags by AIC | |
== Hierarchical/grouped time series | |
* when target can be split into (hierarchical) categories (e.g. type or region) | |
* want "coherent" add up of series | |
* there is always: [all-intermediate-level-targets] = [Summing matrix] * [Bottom-level-targets] | |
* Wickramasuriya, S. L., Athanasopoulos, G., & Hyndman, R. J. (2018). Optimal forecast reconciliation for hierarchical and grouped time series through trace minimization | |
* "Grouped time series": Also independent (crossed) categorisation and not just a single tree | |
* bottom-up approach: just predict bottom level and sum | |
* top-down approach: predict top level and then predict proportions of categories; may perform well | |
* check https://otexts.org/fpp2/top-down.html for more | |
* middle-out approach: predict from middle and expand by summing or using proportions | |
* generally incoherent forecasts can be made coherent with a "reconciliation matrix" | |
* optimal reconciliation matrix: https://otexts.org/fpp2/reconciliation.html; "MinT" approach | |
** no top-down approach can be unbiased | |
** need to estimate forecast error variance -> use approximations | |
* further reading: | |
** Gross & Sohl (1990) provide a good introduction to the top-down approaches. | |
** The reconciliation methods were developed in a series of papers, which are best read in the following order: Hyndman et al. (2011), Athanasopoulos et al. (2009), Hyndman, Lee, & Wang (2016), Wickramasuriya et al. (2018). | |
** Athanasopoulos, Hyndman, Kourentzes, & Petropoulos (2017) extends the reconciliation approach to deal with temporal hierarchies. | |
== Complex, multiple seasonality | |
* https://otexts.org/fpp2/complexseasonality.html | |
* `msts`, `mstl` classes in R: specify all frequencies | |
* or use harmonic regression (include Fourier terms for each frequency asciimath:[T]): asciimath:[sin/cos ((2*pi*k*t/T))]; optimize number of terms by AICc | |
* or TBATS model by Livera, Hyndman, Synder: combination of Fourier, exp. smoothing, Box-Cox, automated | |
** however prediction intervals often too wide | |
** do not allow for features | |
* more sophisticated: | |
*+ Hyndman, R. J., & Fan, S. (2010). Density forecasting for long-term peak electricity demand. IEEE Transactions on Power Systems, 25(2), 1142–1153. | |
** Fan, S., & Hyndman, R. J. (2012). Short-term load forecasting based on a semi-parametric additive model. IEEE Transactions on Power Systems, 27(1), 134–141. | |
== Vector autoregression (VAR) where bi-directional influence | |
* target influences features | |
* influence cross-terms | |
* choose how many variables (choose only few by information criterion) and how many lags to use | |
* for co-integrated features use "vector error correction model" | |
* predict recursively with 1-step predictions | |
* `VARselect()` from R `vars` package select number of lags; Hyndmann suggests BIC (SC) criterions since AIC chooses too many features | |
* making cross-influence possible seems raw, but VAR useful when: | |
** forecasting a collection of related variables where no explicit interpretation is required | |
** testing whether one variable is useful in forecasting another (the basi-s of Granger causality tests); | |
** impulse response analysis, where the response of one variable to a sudden but temporary change in another variable is analysed; | |
** forecast error variance decomposition, where the proportion of the forecast variance of each variable is attributed to the effects of the other variables | |
* Vector ARMA generalization but also more complicating | |
== Neural networks | |
* use asciimath:[p] lags and asciimath:[k] nodes in a simple (autoregressive) neural network NNAR(p,k) | |
* `nnetar()` | |
* since not a stochastic model, no clear prediction intervals -> pick random errors from a distribution while predicting and simulate possible future paths | |
== Bootstrapping and bagging | |
* "blocked bootstrap" random sections of remainder time series (i.e. without trend ans season) select and joined together | |
* -> measure of forecast uncertainty | |
== General | |
* all can be additive or multiplicative | |
* remove calendar effects (e.g. number of days in month) | |
* forecast variable y: dependent, explained variables | |
* predictor variable x: independent, explanatory variable | |
* forecast inefficient if residuals autocorrelated or correlated with target | |
* could use spike or step function for interventions or changes | |
* lagged values can be useful | |
* for seasons of period asciimath:[m] could use first Fourier features asciimath:[sin/cos (2*pi*t*k/m)] | |
* regression with Fourier terms: "harmonic regression" | |
* https://otexts.org/fpp2/regression-matrices.html | |
* CV statistics can be calculated directory from hat matrix asciimath:[H=X(X'X)^(-1)X'] with asciimath:[CV=1/T sum (e_t/(1-h_t))^2] where asciimath:[h_t] are diagonal elements of hat matrix [hence not needed to fit many models] | |
* when multi-collinearity: coefficient estimates unreliable, relation between features unreliable; for good software not a problem [unless perfect collinearity?] | |
* forecast unreliable of new features outside previous range | |
* "seasonally adjusted": remove seasonal | |
* if season order 4: 2x4 MA (moving average, first 2 then 4) will give weights (1/8,1/4,1/4,1/4,1/8) and hence remove quarterly season; MA is centered here | |
* classical decomposition: detrend with MA (m for odd, 2xm for even order season) -> remove averaged curve from original; average out months | |
* for monthly/quarterly seasons, some institutions have developed simple averagine methods: https://otexts.org/fpp2/x11.html | |
* https://otexts.org/fpp2/stl.html[STL Seasonal and Trend decomposition using Loess] | |
* "stationary": when properties do not depend on time (every set of measurements if time invariant distribution) | |
* FPP prefers to use AICc for minimization | |
* one could plot complex root with unit circle (https://otexts.org/fpp2/arima-r.html) to see how stationary and invertible the model is -> roots close to unit circle may make the model unstable (but many implementations only return well-behaved models) | |
* for forecasting: use predicted values as features; use zero for future errors; https://otexts.org/fpp2/arima-forecasting.html | |
* forecasting with co-integrated: "Applied time series modelling and forecasting" (Richard Harris and Robert Sollis) | |
* almost all model give too narrow(!) prediction intervals since some uncertainty not considered (e.g. ETS 95% may cover only 70%) | |
* Bagging: predict many bootstrapped versions and average | |
* one could need 13 Fourier terms with asciimath:[2*pi*j*t/52.18] if you have weekly data and yearly seasons | |
* for holiday effects use dummy variable (but STL, ETC, TBATS do not allow features) -> rather use dynamic regression models | |
* for predicting low integer counts: https://otexts.org/fpp2/counts.html | |
** Croston's method: often used, even though not great; predict non-zero values and also length of zero count runs | |
** better methods for low count data: Christou, V., & Fokianos, K. (2015). On count time series prediction. Journal of Statistical Computation and Simulation, 85(2), 357–373. | |
* to limit forecasts to a range, one could transform them https://otexts.org/fpp2/limits.html (e.g. log for positive counts) | |
* if you want to predict aggregates (e.g. 3 month sum when having monthly data), you can just sum, but may need to correct prediction intervals (possible correlations) | |
* ETS and ARIMA with differencing allow for condition changes, but dynamic regression models have no evolution | |
* think of if you want 1 or multi step forecasts in train/test data | |
* missing values can be estimated | |
* you could replace outliers with missing values | |
* ideally do not use the future information to decide on missing or outlier values | |
* more possible time series issues: last chapter of "Principles of business forecasting" (Ord, J. K., Fildes, R., & Kourentzes, N. (2017)) | |
== Models to try | |
* ETS | |
* (S)ARIMA(X) | |
* GARCH | |
* TBATS | |
* Prophet (Facebook) | |
* CausalImpact | |
* Shallow ANN | |
== Cookbook | |
* check ACF of residuals | |
** see tests above | |
** helpful if normal, for easier prediction intervals | |
* residuals correlate with feature -> may need non-linear feature transformation | |
* any feature which is not in model correlates with residuals? -> include | |
* residuals vs fitted values -> if pattern, then heteroscedacticity and maybe transform target | |
* outliers which are "influential observations"? | |
* for ARIMA: | |
** https://otexts.org/fpp2/arima-r.html | |
** plot | |
** stabilize variance if needed | |
** check stationary with ACF (needs to drop quickly) | |
** make stationary by differencing if needed | |
** check ACF/PACF if (p,0) or (0,q) model appropriate -> use AICc to find better models | |
** check residuals by ACF and Portmanteau test to verify that they are white noise | |
* for CV test set you can use business metric again and also different differencing parameters | |
from statsmodels.tsa.stattools import kpss | |
p_values = kpss(s)[1] | |
if p_values > 0.05: | |
print("Stationary. Good!") | |
* alternatively `adfuller(s)[1] < 0.05` wanted (but this test is weaker) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment