The tstests package provides a number of different statistical tests for evaluating the goodness of fit of time series models as well as their out of sample predictions.
Table 1 summarizes the tests currently implemented, together with the reference paper and type of test. The tests are broadly categorized as Wald [W], Likelihood Ratio [LR], Hausman [H] and Lagrange Multipler [LM].
One of the key features of the package includes customization of the test output using the as_flextable method to generate non-console compatible and nicely formatted tables. The next section provides an overview of the tests together with examples.
2 Goodness of Fit Tests
2.1 GMM Orthogonality
The Generalized Method of Moments framework of Hansen (1982) provides for a set of orthogonality conditions for testing the specification of a model. These conditions test for how well a model captures a set of lower to higher order moments and co-moments by testing the following hypothesis:
with \(\theta\) being the parameter vector. For large T we expect that \(g_T\left(\theta\right)\) converges to \(E\left[M_t\left(\theta\right)\right]\), and should be equal to zero under a correctly specified model. The GMM estimator of \(\theta\) is obtained by minimizing:
\[
S = \frac{1}{T}\sum^T_{t=1}M_t\left(\theta\right)M_t\left(\theta\right)'.
\tag{4}\]
which turns out to be the asymptotic covariance matrix.1
The test is conducted on individual moment conditions as a t-test with \(T-1\) degrees of freedom, on the co-moment conditions as a Wald test2 with \(j\) degrees of freedom, and joint tests on the co-moment conditions and all conditions (\(J\)3) as Wald tests with \(m\) and \(4+3m\) degrees of freedom, respectively.
As an example, we use the SPY log returns to estimate an exponential GARCH (2,1) model with Johnson’s SU distribution, and test all these conditions using the gmm_test.
## GMM Orthogonality Test
## Hypothesis(H0) : E[Moment] = 0
##
## Mean Std. Error t value Pr(>|t|)
## M1 0.0004718 0.01152 0.04095 0.96734
## M2 0.0081345 0.02585 0.31473 0.75297
## M3 -0.1599486 0.15545 -1.02894 0.30354
## M4 1.2079050 1.26430 0.95540 0.33941
## Q2[J] NA NA 2.18549 0.33530
## Q3[J] NA NA 4.74401 0.09329 .
## Q4[J] NA NA 1.60572 0.44805
## J NA NA 11.04053 0.35437
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
None of the p-values is below 5%, therefore we fail to reject the null hypothesis. Notice that the joint conditions (Q2-Q4 and J) have no Mean or Std.Error as these are vector valued. It is also possible to create a nicer looking table, with symbols, using flextable, and we also include the test paper reference in the footnote.
Reference: Hansen,L.P. (1982), Large sample properties of generalized method of moments estimators, Econometrica, 50(4), 1029--1054.
2.2 Parameter Constancy
The parameter constancy tests of Nyblom (1989) and Hansen (1992) are Lagrange multiplier parameter stability tests. The test of Nyblom (1989) tests the null hypothesis that all parameters are constant against the alternative that they follow a martingale process, whilst that of Hansen (1992) tests the null hypothesis of constancy for individual parameters. The distribution of the statistic is non-standard, and related to the distribution of squares of independent Brownian bridges which has the following series expansion4:
where \(p\) are the number of parameters. Hansen (1992) provides a table with critical values for up to 20 parameters based on simulation which has been used extensively as a reference point when conducting this test. Instead, the tstests package uses an internal dataset generated by simulation for up to 40 parameters in addition to a kernel smoothed density in order to generate the p-values directly. It should be noted that neither the individual nor joint tests provide any information about the location of breaks, and the distribution is only valid for stationary data.
To illustrate the use of the test, we continue with the same model as in the previous section, and turn on the argument to provide guidance on whether to reject the null hypothesis at the 5% level of significance.
Reference: Nyblom,J. (1989), Testing for the constancy of parameters over time, Journal of the American Statistical Association, 405, 223--230.
Table 3 indicates that individually we can reject the null of parameter constancy on most of the parameters at the 5% level as well as jointly (J).
2.3 Sign Bias
The sign bias test of Engle and Ng (1993) evaluates the presence of leverage effects in the standardized residuals (to capture possible misspecification of a GARCH model), by regressing the squared standardized residuals on lagged negative and positive shocks as follows:
where \(S^{-}_{t-1} = I_{\varepsilon_{t-1}<0}\), \(S^{+}_{t-1} = I_{\varepsilon_{t-1}\ge0}\). and \(\varepsilon_t\) the model residuals. The null hypothesis is that all coefficients \(\{c_1,c_2,c_3\}\) are individually and jointly zero. The joint test is a Wald test distributed as \(\chi^2(3)\).
Table 4 shows the output of this test on the residuals of the model used so far, where the p-values indicate that we cannot reject the null of no sign bias either individually or jointly.
Code
test <-signbias_test(residuals(mod), sigma =sigma(mod))as_flextable(test, use.symbols =TRUE, footnote.reference =TRUE)
Reference: Engle RF, Ng VK (1993). Measuring and testing the impact of news on volatility. The Journal of Finance, 48(5), 1749--1778.
3 Forecast Evaluation Tests
3.1 Density Forecast
A novel method to analyze how well a conditional density fits the underlying data is through the probability integral transformation (PIT) discussed in Rosenblatt (1952) and defined as:
which transforms the data \(y_t\), using the estimated distribution \(\hat F\) into i.i.d. \(U(0,1)\) under the correctly specified model.
Because of the difficulty in testing a \(U(0,1)\) sequence, the PIT data (\(x_t\)) is transformed into \(N(0,1)\) by Berkowitz (2001) using the normal quantile (\(\Phi^{-1}\)) function:
\[
z_t = \Phi^{-1}\left(x_t\right)
\tag{8}\]
Under the null of a correctly forecast density the values (\(z_t\)) should be independent across observations, with mean zero, unit variance and non-significant autoregressive terms (zero). For example, the following first order autoregressive dynamics can be tested:
as the unrestricted equation in a likelihood ratio (LR) test against the restricted equation where \(\mu=0\), \(\sigma=1\) and \(\rho=0\). The LR test is then:
where \(L\) is the log-likelihood. The LR test statistic is asymptotically distributed as \(\chi^2(2+m)\) where \(m\) are the number of autoregressive lags.
A final testing step can also be included which looks at the goodness of fit of the transformed series into the Normal. Following Dowd (2004), the output includes the Normality test of Jarque and Bera (1987).
We use the GARCH based pre-computed backtest5 in the package for the example:
Code
p <-pdist('jsu', q = garch_forecast$actual, mu = garch_forecast$forecast,sigma = garch_forecast$sigma, skew = garch_forecast$skew,shape = garch_forecast$shape)as_flextable(berkowitz_test(p))
Hypothesis(H0) : Normal(0,1) with no autocorrelation
Both the p-values of the LR and Jarque-Bera tests are greater than 5% so at this level we fail to reject the null hypothesis, and find the density forecast adequately meets the requirements of these tests.
3.2 Directional Accuracy
High and significant Directional Accuracy could imply either an ability to predict the sign of the mean forecast or could merely be the result of volatility dependence in the absence of mean predictability as argued by Christoffersen and Diebold (2006).
Pesaran and Timmermann (1992) provide a non-parametric test to evaluate the magnitude and significance of the forecast’s directional accuracy. Let \(y_t\) denote the actual series and \(x_t\) the forecast of that series, then the null hypothesis is that the two are independent (the forecast cannot predict the actual). This is a Hausman type test with statistic:
\[
\hat P = \frac{1}{n} \sum^n_{t=1}H\left(y_t x_t\right)
\tag{12}\]
representing the proportion of times that x correctly predicts y, with \(H(.)\) being the Heaviside function taking values of 1 if the signs of the forecast and actual agree and 0 otherwise.
with \(\hat P_y\) and \(\hat P_x\) the proportion of positive cases in y and x respectively. Equation 13 represents the expected proportion of times that x would correctly predict y, under the assumption that they are independently distributed (the null). This difference between realized proportion and expected proportion under the null forms the basis of the Hausman test statistic. The denominator standardizes this difference in proportions by the difference in the variance of the two proportions, with:
The statistic \(S_n\) is asymptotically distributed as \(N\left(0,1\right)\).
While Pesaran and Timmermann (1992) test for sign predictability, Anatolyev and Gerko (2005) test for mean predictability based on normalized excess profitability implied by a directional strategy which goes long or short depending on the sign of the forecast. One of the key differences between these two tests relates to the observed asymmetric nature of \(y_t\), which means that it may matter when \(x_t\) fails to predict \(y_t\) if during those times the loss (or missed gain) is high enough to make the strategy relatively unprofitable despite having significant directional accuracy. Consider the 1-period return of the trading strategy:
A benchmark strategy which issues buy and sell signals at random with proportions of buys and sells the same as in the trading strategy, has the following expected return:
Under the null of conditional mean independence, then \(E\left[y_t|I_{t-1}\right] = \mathrm{constant}\). If the strategy has predictive power, it will generate higher returns on average than the benchmark, and \(A_n - B_n\) will be greater than zero and significant. To test the significance of this difference, it needs to be normalized by the the variance of this difference (\(V\)):
\[
V = \frac{4}{n^2}\hat p_y\left(1 - \hat p_y\right)\sum^n_{t=1}\left(y_t-\hat y\right)^2
\tag{19}\]
where \(\hat p_y=\frac{1}{2}\left(1 + \frac{1}{n}\sum^n_{t=1}\mathrm{sign}\left(\hat y_t\right)\right)\)
Similar to the test of Pesaran and Timmermann (1992), this is also a Hausman test with statistic \(\frac{A_n - B_n}{\sqrt(V)}\), asymptotically distribution as \(N\left(0,1\right)\).
To illustrate we use an ARMA(1,1) pre-computed backtest forecast available in the package:
The p-values for both tests in Table 6 suggest that we fail to reject the null of no predictability, with high confidence.
3.3 Forecast Unbiasedness
To understand the properties of a good forecast, we start by considering what an optimal forecast should look like. Consider the Wald representation of \(y\), at horizon h:
and variance of the h-step forecast error increasing in h.
The optimal forecast error is unbiased with \(E\left[\epsilon_{n+h|n}\right] = 0\), with the h-step forecast errors having at most an MA(h-1) structure and the 1-step ahead forecast error \(\epsilon_{n+1|n} = \epsilon_{n+h}\) being white noise. This implies that :
can be tested under the null of unbiasdness with \(\alpha=0\) and \(\beta=1\) using a Wald test. This is the essence of the regression test by Mincer and Zarnowitz (1969). It effectively tests for forecast bias, though it does not say anything about forecast variance.
Table 7 shows the output of the test using a 25-step ahead forecast on a chosen subsample of SPY log returns from an ARMA(12,0) model. We fail to reject the NULL of \(\alpha=0\) and \(\beta=1\) given the results of the Wald test with Pr(>Chisq) = 0.91.
Reference: Mincer JA. and Zarnowitz V. (1969), 'The evaluation of economic forecasts.' In Economic forecasts and expectations: Analysis of forecasting behavior and performance, NBER, 3--46
3.4 Expected Shortfall
The test of Du and Escanciano (2017) combines ideas from Berkowitz (2001) and Christoffersen (1998) to create an unconditional and conditional shortfall test based on the probability integral transformed realizations, conditioned on the forecast distribution, to evaluate the severity and independence of the shortfall residuals.
3.5 The Unconditional Test
The unconditional test assesses the expected value of the cumulative violations beyond the Value at Risk (VaR) threshold, using a two-sided t-test, with the following statistic:
where the term \(\frac{\alpha}{2}\) is the expected value under a correctly calibrated risk model. \(\bar H\left(\alpha\right)\) denotes the sample mean of \(H_t\left(\alpha\right)\):
with \(H_t\left(\alpha\right)\) the cumulative failures (violations beyond VaR) such that:
\[
H_t\left(\alpha\right) = \frac{1}{\alpha}\left(\alpha - u_t\right)\mathrm{I}\left(u_t\le\alpha\right).
\tag{28}\] The vector \(u\) is the probability integral transformation of the future realization given the forecast distribution.
The distribution of the test statistic \(U_{ES}\) is asymptotically distributed as \(N(0,1)\). Since the statistic is bounded in the unit interval, the confidence intervals need to be constrained to the be between [0,1]. Therefore, the p-value will take the following form:
The conditional test assesses not only the correctness and significance of the expected value of the cumulative failures but also their independence. Consider the auto-covariance of the cumulative violations for lag j:
Hypothesis(H0) : Unconditional(U) and Independent(C)
Reference: Du Z. and Escanciano J.C. (2017). Backtesting expected shortfall: accounting for tail risk. Management Science, 63(4), 940--958.
3.8 Value at Risk: Coverage and Duration
The Value at Risk (VaR) tests in the tstests package are based on the proportion of failures test of Kupiec (1995), the conditional coverage test of Christoffersen (1998), and the duration between failures test of Christoffersen and Pelletier (2004). These are summarized in the next sections.
3.9 Proportion of Failures (UC)
The test of Kupiec (1995) assesses whether the proportion of failures is consistent with the expected proportion of failures at a given level of VaR. This is done by summing the number of violation (binary) and dividing by the length of the forecasts.6 Under the null hypothesis that the proportion of failures (\(\pi\)) is equal to the VaR level (\(\alpha\)), then the restricted model likelihood is:
\[
\mathrm{L}_r = \left(1 - \alpha\right)^{n}\alpha^{k}
\tag{33}\] and the unrestricted (observed) model likelihood:
\[
\mathrm{L}_u = \left(1 - \pi\right)^{n}\pi^{k}
\] where \(\pi=\frac{n}{\left(n + k\right)}\), \(n\) are the number of observations, and \(k\) the number of failures. A likelihood ratio test can then be conducted with statistic:
\[
\mathrm{LR}_{uc} = -2\log{\frac{\mathrm{L}_r}{\mathrm{L}_u}}
\tag{34}\] which is distributed as \(\chi^2_1\).
3.10 Independence of Failures (CCI)
The proportion of failures (or unconditional) test of Kupiec (1995) tests the coverage of the interval but has no power in detecting whether they are independent. Christoffersen (1998) proposed a test to explicitly check the independence assumption against a first order Markov alternative. Consider a binary, first order Markov chain with transition probability matrix:
where \(\pi_{ij}=Pr\left(I_t=j|I_{t-1}-i\right)\), and \(I_t\) is the indicator function denoting failures. The approximate likelihood of this process is then
\[
\mathrm{L}_u = \left(1-\pi_{01}\right)^{n_{00}}\pi_{01}^{n_{01}}\left(1-\pi_{11}\right)^{n_{10}}\pi_{11}^{n_{11}}
\tag{36}\] where \(n_{ij}\) is the number of observation with value i followed by j. For example, \(n_{10}\) represents the number of periods with failures followed by periods with no failures. For the restricted model under the null of independence, the first order Markov chain has the following transition probability matrix:
\[
\Pi_1 =
\begin{bmatrix}
1-\pi_2 & \pi_2\\
1-\pi_2 & \pi_2\\
\end{bmatrix}
\tag{37}\] with the following likelihood:
which is asymptotically distributed as \(\chi^2\left(1\right)\).
3.11 Conditional Coverage (CC)
The likelihood ratio for the joint test of coverage and independence is simply the sum of the Kupiec coverage and independence likelihood ratios:
\[
\mathrm{LR}_{cc} = \mathrm{LR}_{uc} + \mathrm{LR}_{cci}
\tag{40}\] which is asymptotically distributed as \(\chi^2\left(2\right)\).
3.12 Duration between Failures (D)
The conditional coverage independence test in the previous section has limited power in detecting temporal dependence beyond the simple first order Markov structure. To provide a more powerful test Christoffersen and Pelletier (2004) proposed a more general structure using the duration of time between VaR failures (the no-hit duration), defined as :
\[
D_i = t_i - t_{i-1}
\tag{41}\]
where \(t_i\) denotes the time index of violation number \(i\). Under the null hypothesis of a correctly specified risk model, this no-hit duration (\(D\)) should have no memory with a mean duration of \(\frac{1}{\alpha}\) periods. Since the only continuous memory free random distribution is the exponential, then under the null hypothesis the distribution of the no-hit duration should be:
\[
f_{\mathrm{exp}}\left(D;\alpha\right) = p\exp\left(-\alpha D\right).
\tag{42}\] In order to test for this, an encompassing distribution which allows for duration dependence and also embeds the exponential is required. The Weibull distribution offers one such example, with distribution:
\[
f_{\mathrm{W}}\left(D;a,b\right) = a^bbD^{b-1}\exp\left(-\left(aD\right)^b\right).
\tag{43}\] The exponential is a special case when \(b=1\). Therefore, the null hypothesis of a memory-less duration process corresponds to \(b=1\) and \(a=\alpha\), which can be tested using a likelihood ratio test distributed as \(\chi^2\left(1\right)\).7
3.13 Example
Table 9 shows the 4 tests for VaR, with p-values well above 10% indicating a correctly specified model for this series during the out of sample period tested.
Code
library(tsdistributions)data("garch_forecast")q <-qdist("jsu", p =0.05, mu = garch_forecast$forecast, sigma = garch_forecast$sigma,skew = garch_forecast$skew, shape = garch_forecast$shape)test <-var_cp_test(actual = garch_forecast$actual, forecast = q, alpha =0.05)as_flextable(test, footnote.reference = T) |>width(j =1:3,width =1.2)
Du, Zaichao, and Juan Carlos Escanciano, 2017, Backtesting expected shortfall: Accounting for tail risk, Management Science 63, 940–958.
Engle, Robert F, and Victor K Ng, 1993, Measuring and testing the impact of news on volatility, The Journal of Finance 48, 1749–1778.
Hamilton, James Douglas, 2020, Time Series Analysis (Princeton university press).
Hansen, Bruce E, 1992, Testing for parameter instability in linear models, Journal of Policy Modeling 14, 517–533.
Hansen, L.P., 1982, Large sample properties of generalized method of moments estimators, Econometrica 50, 1029–1054.
Jarque, Carlos M, and Anil K Bera, 1987, A test for normality of observations and regression residuals, International Statistical Review/Revue Internationale de Statistique, 163–172.
Kupiec, P.H., 1995, Techniques for verifying the accuracy of risk measurement models, The Journal of Derivatives 3, 73–84.
Mincer, Jacob A, and Victor Zarnowitz, 1969, The evaluation of economic forecasts, Economic forecasts and expectations: Analysis of forecasting behavior and performance (NBER).
Nyblom, J., 1989, Testing for the constancy of parameters over time, Journal of the American Statistical Association 84, 223–230.
Pesaran, M.H., and A. Timmermann, 1992, A simple nonparametric test of predictive performance, Journal of Business & Economic Statistics 10, 461–465.
Rosenblatt, M., 1952, Remarks on a multivariate transformation, The Annals of Mathematical Statistics 23, 470–472.
Footnotes
It is also possible to correct this estimator for serial correlation as suggested by Hamilton (2020), but we leave this for future investigation.↩︎
Based on the SPY log returns data (tsdatasets::spy()) using a GARCH(1,1)-JSU model (see documentation for details and replication code).↩︎
Under the assumption of i.i.d observations, the sequence of failures is distributed as Bernoulli(\(\alpha\)).↩︎
The actual implementation requires calculating the duration of the hit sequence, the censored observations and combining all this to construct the log-likelihood which needs to be solved using numerical methods for the unrestricted likelihood.↩︎