tstests – Time Series Models

1 Introduction

The tstests package provides a number of different statistical tests for evaluating the goodness of fit of time series models as well as their out of sample predictions.

Table 1 summarizes the tests currently implemented, together with the reference paper and type of test. The tests are broadly categorized as Wald [W], Likelihood Ratio [LR], Hausman [H] and Lagrange Multipler [LM].

Table 1: Package Tests

Test	Function	Reference	Type
GMM Orthogonality Test	gmm_test	Hansen (1982)	[W]
Nyblom Parameter Constancy Test	nyblom_test	Nyblom (1989)	[LM]
Sign Bias Test	signbias_test	Engle and Ng (1993)	[W]
Berkowitz Forecast Density Test	berkowitz_test	Berkowitz (2001)	[LR]
Directional Accuracy Tests	dac_test	Pesaran and Timmermann (1992), Anatolyev and Gerko (2005)	[H]
Mincer-Zarnowitz Test	minzar_test	Mincer and Zarnowitz (1969)	[W]
Expected Shortfall Test	shortfall_de_test	Du and Escanciano (2017)	[LR]
Value at Risk Test	var_cp_test	Christoffersen (1998), Christoffersen and Pelletier (2004)	[LR]

One of the key features of the package includes customization of the test output using the as_flextable method to generate non-console compatible and nicely formatted tables. The next section provides an overview of the tests together with examples.

2 Goodness of Fit Tests

2.1 GMM Orthogonality

The Generalized Method of Moments framework of Hansen (1982) provides for a set of orthogonality conditions for testing the specification of a model. These conditions test for how well a model captures a set of lower to higher order moments and co-moments by testing the following hypothesis:

\[ \begin{aligned} E&\left[z_t\right]=0,\quad E\left[z_t^2-1\right]=0\\ E&\left[z_t^3\right]=0,\quad E\left[z_t^3\right]-3=0\\ E&\left[\left(z^2_t-1\right)\left(z_{t-j}^2-1\right)\right]=0,\quad j=1,\ldots,m\\ E&\left[\left(z^3_t\right)\left(z_{t-j}^3\right)\right]=0,\quad j=1,\ldots,m\\ E&\left[\left(z^4_t-3\right)\left(z_{t-j}^4-3\right)\right]=0,\quad j=1,\ldots,m\\ \end{aligned} \tag{1}\]

where $z_t$ are the standardized residuals from an estimated model, and $m$ the number of lags used for testing the hypothesis.

Define the moment (M) generating conditions as:

\[ g_T\left(\theta\right) = \frac{1}{T}\sum^T_{t=1}M_t\left(\theta\right) \tag{2}\]

with $\theta$ being the parameter vector. For large T we expect that $g_T\left(\theta\right)$ converges to $E\left[M_t\left(\theta\right)\right]$, and should be equal to zero under a correctly specified model. The GMM estimator of $\theta$ is obtained by minimizing:

\[ g_T\left(\theta\right)'S^{-1}g_T\left(\theta\right) \tag{3}\]

where S is the weighting matrix and defined as

\[ S = \frac{1}{T}\sum^T_{t=1}M_t\left(\theta\right)M_t\left(\theta\right)'. \tag{4}\]

which turns out to be the asymptotic covariance matrix.¹

The test is conducted on individual moment conditions as a t-test with $T-1$ degrees of freedom, on the co-moment conditions as a Wald test² with $j$ degrees of freedom, and joint tests on the co-moment conditions and all conditions ($J$³) as Wald tests with $m$ and $4+3m$ degrees of freedom, respectively.

As an example, we use the SPY log returns to estimate an exponential GARCH (2,1) model with Johnson’s SU distribution, and test all these conditions using the gmm_test.

Code

library(tsgarch)
library(tsdistributions)
spy <- tsdatasets::spy
spyr <- na.omit(diff(log(spy)))
spec <- garch_modelspec(spyr, model = "egarch", order = c(2,1), constant = TRUE,
distribution = "jsu")
mod <- estimate(spec)
skewness <- dskewness("jsu", skew = coef(mod)["skew"], shape = coef(mod)["shape"])
kurtosis <- dkurtosis("jsu", skew = coef(mod)["skew"], shape = coef(mod)["shape"]) + 3
test <- gmm_test(residuals(mod, standardize = TRUE), lags = 2, skewness = skewness, 
                 kurtosis = kurtosis, conf_level = 0.95)
print(test)

## GMM Orthogonality Test
## Hypothesis(H0) :  E[Moment] = 0 
## 
##             Mean Std. Error  t value Pr(>|t|)  
## M1     0.0004718    0.01152  0.04095  0.96734  
## M2     0.0081345    0.02585  0.31473  0.75297  
## M3    -0.1599486    0.15545 -1.02894  0.30354  
## M4     1.2079050    1.26430  0.95540  0.33941  
## Q2[J]         NA         NA  2.18549  0.33530  
## Q3[J]         NA         NA  4.74401  0.09329 .
## Q4[J]         NA         NA  1.60572  0.44805  
## J             NA         NA 11.04053  0.35437  
## 
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

None of the p-values is below 5%, therefore we fail to reject the null hypothesis. Notice that the joint conditions (Q2-Q4 and J) have no Mean or Std.Error as these are vector valued. It is also possible to create a nicer looking table, with symbols, using flextable, and we also include the test paper reference in the footnote.

Code

as_flextable(test, include.decision = FALSE, footnote.reference = TRUE)

Table 2: GMM Orthogonality Test

Moment	Mean	Std. Error	t value	Pr(>\|t\|)
$E\left[z_t\right]$	0.0005	0.0115	0.0409	0.9673
$E\left[z^2_t - 1\right]$	0.0081	0.0258	0.3147	0.7530
$E\left[z^3_t\right]$	-0.1599	0.1554	-1.0289	0.3035
$E\left[z^4_t - 3\right]$	1.2079	1.2643	0.9554	0.3394
$E\left[(z^2_t - 1)(z^2_{t-j} - 1)\right]$			2.1855	0.3353
$E\left[(z^3_t)(z^3_{t-j})\right]$			4.7440	0.0933	.
$E\left[(z^4_t - 3)(z^4_{t-j} - 3)\right]$			1.6057	0.4480
$J$			11.0405	0.3544
Signif. codes: 0 '*' 0.001 '' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hypothesis(H0) : E[Moment] = 0
Reference: Hansen,L.P. (1982), Large sample properties of generalized method of moments estimators, Econometrica, 50(4), 1029--1054.

2.2 Parameter Constancy

The parameter constancy tests of Nyblom (1989) and Hansen (1992) are Lagrange multiplier parameter stability tests. The test of Nyblom (1989) tests the null hypothesis that all parameters are constant against the alternative that they follow a martingale process, whilst that of Hansen (1992) tests the null hypothesis of constancy for individual parameters. The distribution of the statistic is non-standard, and related to the distribution of squares of independent Brownian bridges which has the following series expansion⁴:

\[ \sum^{\infty}_{k=1}\frac{1}{\left(\pi k\right)^2}\chi^2_{k}\left(p\right) \tag{5}\]

where $p$ are the number of parameters. Hansen (1992) provides a table with critical values for up to 20 parameters based on simulation which has been used extensively as a reference point when conducting this test. Instead, the tstests package uses an internal dataset generated by simulation for up to 40 parameters in addition to a kernel smoothed density in order to generate the p-values directly. It should be noted that neither the individual nor joint tests provide any information about the location of breaks, and the distribution is only valid for stationary data.

To illustrate the use of the test, we continue with the same model as in the previous section, and turn on the argument to provide guidance on whether to reject the null hypothesis at the 5% level of significance.

Code

test <- nyblom_test(residuals(mod, standardize = TRUE), scores = estfun(mod), 
                    parameter_names = names(coef(mod)), 
                    parameter_symbols = mod$parmatrix[estimate == 1]$symbol)
as_flextable(test, use.symbols = TRUE, footnote.reference = TRUE, include.decision = T)

Table 3: Nyblom Parameter Stability

	Statistic	Pr(>t)		Decision(5%)
$\mu$	0.1599	0.3499		Fail to Reject H0
$\omega$	0.5565	0.0285	*	Reject H0
$\alpha_1$	0.7846	0.0079	**	Reject H0
$\alpha_2$	0.7569	0.0092	**	Reject H0
$\gamma_1$	0.4700	0.0473	*	Reject H0
$\gamma_2$	0.0714	0.7169		Fail to Reject H0
$\beta_1$	0.5187	0.0354	*	Reject H0
$\zeta$	1.8639	0.0000	***	Reject H0
$\nu$	0.8672	0.0050	**	Reject H0
$\mathrm{J}$	5.3285	0.0000	***	Reject H0
Signif. codes: 0 '*' 0.001 '' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hypothesis(H0) : Constant Parameters
Reference: Nyblom,J. (1989), Testing for the constancy of parameters over time, Journal of the American Statistical Association, 405, 223--230.

Table 3 indicates that individually we can reject the null of parameter constancy on most of the parameters at the 5% level as well as jointly (J).

2.3 Sign Bias

The sign bias test of Engle and Ng (1993) evaluates the presence of leverage effects in the standardized residuals (to capture possible misspecification of a GARCH model), by regressing the squared standardized residuals on lagged negative and positive shocks as follows:

\[ z^2_t = c_0 + S^{-}_{t-1} + c_2 S^{-}_{t-1}\varepsilon_{t-1} + c_3 S^{+}_{t-1}\varepsilon_{t-1} + u_t \tag{6}\]

where $S^{-}_{t-1} = I_{\varepsilon_{t-1}<0}$, $S^{+}_{t-1} = I_{\varepsilon_{t-1}\ge0}$. and $\varepsilon_t$ the model residuals. The null hypothesis is that all coefficients $\{c_1,c_2,c_3\}$ are individually and jointly zero. The joint test is a Wald test distributed as $\chi^2(3)$.

Table 4 shows the output of this test on the residuals of the model used so far, where the p-values indicate that we cannot reject the null of no sign bias either individually or jointly.

Code

test <- signbias_test(residuals(mod), sigma = sigma(mod))
as_flextable(test, use.symbols = TRUE, footnote.reference = TRUE)

Table 4: Sign Bias Test

	Estimate	Std. Error	t value	Pr(>\|t\|)
$S^{-}_{t-1}$	0.0648	0.0693	0.9347	0.3500
$S^{-}_{t-1}\varepsilon_{t-1}$	3.9303	3.9425	0.9969	0.3189
$S^{+}_{t-1}\varepsilon_{t-1}$	3.4217	4.2902	0.7976	0.4251
$\mathrm{J}$			1.6447	0.6493
Signif. codes: 0 '*' 0.001 '' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hypothesis(H0) : No sign bias
Reference: Engle RF, Ng VK (1993). Measuring and testing the impact of news on volatility. The Journal of Finance, 48(5), 1749--1778.

3 Forecast Evaluation Tests

3.1 Density Forecast

A novel method to analyze how well a conditional density fits the underlying data is through the probability integral transformation (PIT) discussed in Rosenblatt (1952) and defined as:

\[ {x_t} = \int\limits_{ - \infty }^{{y_t}} {\hat f\left( u \right)du = \hat F\left( {{y_t}} \right)} \tag{7}\]

which transforms the data $y_t$, using the estimated distribution $\hat F$ into i.i.d. $U(0,1)$ under the correctly specified model.

Because of the difficulty in testing a $U(0,1)$ sequence, the PIT data ($x_t$) is transformed into $N(0,1)$ by Berkowitz (2001) using the normal quantile ($\Phi^{-1}$) function:

\[ z_t = \Phi^{-1}\left(x_t\right) \tag{8}\]

Under the null of a correctly forecast density the values ($z_t$) should be independent across observations, with mean zero, unit variance and non-significant autoregressive terms (zero). For example, the following first order autoregressive dynamics can be tested:

\[ z_t = \mu + \rho\left(z_{t-1} - \mu\right) + \varepsilon_t \tag{9}\]

as the unrestricted equation in a likelihood ratio (LR) test against the restricted equation where $\mu=0$, $\sigma=1$ and $\rho=0$. The LR test is then:

\[ \mathrm{LR} = -2\left(L\left(0,1,0\right) - L\left(\hat\mu,\hat\sigma,\hat\rho\right)\right) \tag{10}\]

where $L$ is the log-likelihood. The LR test statistic is asymptotically distributed as $\chi^2(2+m)$ where $m$ are the number of autoregressive lags.

A final testing step can also be included which looks at the goodness of fit of the transformed series into the Normal. Following Dowd (2004), the output includes the Normality test of Jarque and Bera (1987).

We use the GARCH based pre-computed backtest⁵ in the package for the example:

Code

p <- pdist('jsu', q = garch_forecast$actual, mu = garch_forecast$forecast,
sigma = garch_forecast$sigma, skew = garch_forecast$skew,
shape = garch_forecast$shape)
as_flextable(berkowitz_test(p))

Table 5: Berkowitz Forecast Density Test

Test	DoF	Statistic	Pr(>Chisq)
Berkowitz	3	3.2655	0.3525
Jarque-Bera	2	3.5131	0.1726
Signif. codes: 0 '*' 0.001 '' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hypothesis(H0) : Normal(0,1) with no autocorrelation

Both the p-values of the LR and Jarque-Bera tests are greater than 5% so at this level we fail to reject the null hypothesis, and find the density forecast adequately meets the requirements of these tests.

3.2 Directional Accuracy

High and significant Directional Accuracy could imply either an ability to predict the sign of the mean forecast or could merely be the result of volatility dependence in the absence of mean predictability as argued by Christoffersen and Diebold (2006).

Pesaran and Timmermann (1992) provide a non-parametric test to evaluate the magnitude and significance of the forecast’s directional accuracy. Let $y_t$ denote the actual series and $x_t$ the forecast of that series, then the null hypothesis is that the two are independent (the forecast cannot predict the actual). This is a Hausman type test with statistic:

\[ S_n = \frac{\hat P - \hat P_*}{\sqrt{\left[\hat V\left(\hat P\right) - \hat V\left(\hat P_*\right)\right]}} \tag{11}\]

where:

\[ \hat P = \frac{1}{n} \sum^n_{t=1}H\left(y_t x_t\right) \tag{12}\]

representing the proportion of times that x correctly predicts y, with $H(.)$ being the Heaviside function taking values of 1 if the signs of the forecast and actual agree and 0 otherwise.

Further,

\[ \hat P_* = \hat P_y \hat P_x + \left(1 - \hat P_y\right)\left(1 - \hat P_x\right) \tag{13}\]

with $\hat P_y$ and $\hat P_x$ the proportion of positive cases in y and x respectively. Equation 13 represents the expected proportion of times that x would correctly predict y, under the assumption that they are independently distributed (the null). This difference between realized proportion and expected proportion under the null forms the basis of the Hausman test statistic. The denominator standardizes this difference in proportions by the difference in the variance of the two proportions, with:

\[ \hat V\left(\hat P\right) = \frac{1}{n}\hat P_*\left(1 - \hat P_*\right) \tag{14}\]

and

\[ \begin{aligned} \hat V\left(\hat P_*\right) =& \frac{1}{n}\left(2\hat P_y-1\right)^2\hat P_x\left(1-\hat P_x\right) + \\ & \frac{1}{n}\left(2\hat P_x-1\right)^2\hat P_y\left(1-\hat P_y\right)+\\ & \frac{4}{n^2}\hat P_y\hat P_x\left(1 - \hat P_y\right)\left(1 - \hat P_x\right) \end{aligned} \tag{15}\]

The statistic $S_n$ is asymptotically distributed as $N\left(0,1\right)$.

While Pesaran and Timmermann (1992) test for sign predictability, Anatolyev and Gerko (2005) test for mean predictability based on normalized excess profitability implied by a directional strategy which goes long or short depending on the sign of the forecast. One of the key differences between these two tests relates to the observed asymmetric nature of $y_t$, which means that it may matter when $x_t$ fails to predict $y_t$ if during those times the loss (or missed gain) is high enough to make the strategy relatively unprofitable despite having significant directional accuracy. Consider the 1-period return of the trading strategy:

\[ r_t = \textrm{sign}\left(x_t\right)y_t \tag{16}\]

with expected return

\[ A_n = \frac{1}{n}\sum^n_{t=1}r_t \tag{17}\]

A benchmark strategy which issues buy and sell signals at random with proportions of buys and sells the same as in the trading strategy, has the following expected return:

\[ B_n = \left(\frac{1}{n}\sum^n_{t=1}\textrm{sign}\left(x_t\right)\right)\left(\frac{1}{n}\sum^n_{t=1}y_t\right) \tag{18}\]

Under the null of conditional mean independence, then $E\left[y_t|I_{t-1}\right] = \mathrm{constant}$. If the strategy has predictive power, it will generate higher returns on average than the benchmark, and $A_n - B_n$ will be greater than zero and significant. To test the significance of this difference, it needs to be normalized by the the variance of this difference ($V$):

\[ V = \frac{4}{n^2}\hat p_y\left(1 - \hat p_y\right)\sum^n_{t=1}\left(y_t-\hat y\right)^2 \tag{19}\]

where $\hat p_y=\frac{1}{2}\left(1 + \frac{1}{n}\sum^n_{t=1}\mathrm{sign}\left(\hat y_t\right)\right)$

Similar to the test of Pesaran and Timmermann (1992), this is also a Hausman test with statistic $\frac{A_n - B_n}{\sqrt(V)}$, asymptotically distribution as $N\left(0,1\right)$.

To illustrate we use an ARMA(1,1) pre-computed backtest forecast available in the package:

Code

data(arma_forecast)
as_flextable(dac_test(arma_forecast$actual, arma_forecast$forecast))

Table 6: Directional Accuracy Tests

Test	Statistic	Pr(>\|t\|)
PT: (Sign)	0.3594	0.3596
AG: (Mean)	0.1144	0.4545
Signif. codes: 0 '*' 0.001 '' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Obs: 250, Accuracy: 0.5, Prevalence: 0.464
Hypothesis(H0) : PT: No Predictability (Sign)
Hypothesis(H0) : AG: No Predictability (Mean)

The p-values for both tests in Table 6 suggest that we fail to reject the null of no predictability, with high confidence.

3.3 Forecast Unbiasedness

To understand the properties of a good forecast, we start by considering what an optimal forecast should look like. Consider the Wald representation of $y$, at horizon h:

\[ y_{n+h} = \mu + \epsilon_{n+h} + b_1\epsilon_{n+h-1} + b_2\epsilon_{n+h-2} + \ldots \tag{20}\]

The optimal h-step forecast is then

\[ y_{n+h|n} = \mu + b_h\epsilon_n + b_{h+1}\epsilon_{n-1} + \ldots \tag{21}\]

with optimal forecast error

\[ \epsilon_{n+h|n} = \epsilon_{n+h} + b_1\epsilon_{n+h-1} + b_2\epsilon_{n+h-2} + \ldots + b_{h-1}\epsilon_{n+1} \tag{22}\]

and variance of the h-step forecast error increasing in h.

The optimal forecast error is unbiased with $E\left[\epsilon_{n+h|n}\right] = 0$, with the h-step forecast errors having at most an MA(h-1) structure and the 1-step ahead forecast error $\epsilon_{n+1|n} = \epsilon_{n+h}$ being white noise. This implies that :

\[ \epsilon_{n+h|n} = \alpha + \beta y_{n+h|n} + \varepsilon_{n+h} \tag{23}\]

will have $\alpha=0$ and $\beta=0$.

Since

\[ \epsilon_{n+h|n} = y_{n+h} - y_{n+h|n} \tag{24}\]

then,

\[ y_{n+h|n} = \alpha + \beta x_{n+h|n} + \varepsilon_{n+h|n} \tag{25}\]

can be tested under the null of unbiasdness with $\alpha=0$ and $\beta=1$ using a Wald test. This is the essence of the regression test by Mincer and Zarnowitz (1969). It effectively tests for forecast bias, though it does not say anything about forecast variance.

Table 7 shows the output of the test using a 25-step ahead forecast on a chosen subsample of SPY log returns from an ARMA(12,0) model. We fail to reject the NULL of $\alpha=0$ and $\beta=1$ given the results of the Wald test with Pr(>Chisq) = 0.91.

Code

library(tsarma)
library(xts)
spyr <- na.omit(diff(log(tsdatasets::spy)))
spec <- arma_modelspec(spyr[2000:2500], order = c(12,0))
mod <- estimate(spec)
p <- predict(mod, h = 25, forc_dates = index(spyr[2501:2525]))
test <- minzar_test(actual = as.numeric(spyr[2501:2525]), 
                    forecast = as.numeric(p$mean), vcov. = sandwich::vcovHAC)
as_flextable(test, footnote.reference = T, digits = 2)

Table 7: Mincer-Zarnowitz Test

	Estimate	Std. Error	t value	Pr(>\|t\|)
constant	-0.00	0.00	-0.28	0.78
forecast	1.04	3.15	0.33	0.74
J			0.18	0.91
Signif. codes: 0 '*' 0.001 '' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hypothesis(H0) : Unbiased Forecast
Reference: Mincer JA. and Zarnowitz V. (1969), 'The evaluation of economic forecasts.' In Economic forecasts and expectations: Analysis of forecasting behavior and performance, NBER, 3--46

3.4 Expected Shortfall

The test of Du and Escanciano (2017) combines ideas from Berkowitz (2001) and Christoffersen (1998) to create an unconditional and conditional shortfall test based on the probability integral transformed realizations, conditioned on the forecast distribution, to evaluate the severity and independence of the shortfall residuals.

3.5 The Unconditional Test

The unconditional test assesses the expected value of the cumulative violations beyond the Value at Risk (VaR) threshold, using a two-sided t-test, with the following statistic:

\[ U_{ES} = \frac{\sqrt{n}\left(\bar H\left(\alpha\right) - \frac{\alpha}{2}\right)}{\sqrt{\alpha\left(\frac{1}{3} - \frac{\alpha}{4}\right)}} \tag{26}\]

where the term $\frac{\alpha}{2}$ is the expected value under a correctly calibrated risk model. $\bar H\left(\alpha\right)$ denotes the sample mean of $H_t\left(\alpha\right)$:

\[ \bar H\left(\alpha\right)=\frac{1}{n}\sum^n_{t=1}H_t\left(\alpha\right) \tag{27}\]

with $H_t\left(\alpha\right)$ the cumulative failures (violations beyond VaR) such that:

\[ H_t\left(\alpha\right) = \frac{1}{\alpha}\left(\alpha - u_t\right)\mathrm{I}\left(u_t\le\alpha\right). \tag{28}\] The vector $u$ is the probability integral transformation of the future realization given the forecast distribution.

The distribution of the test statistic $U_{ES}$ is asymptotically distributed as $N(0,1)$. Since the statistic is bounded in the unit interval, the confidence intervals need to be constrained to the be between [0,1]. Therefore, the p-value will take the following form:

\[ Pr(>|t|) = 2 \min{\left(\mathrm{Pr}\left[|U_{ES}|\le x\right],1-\mathrm{Pr}\left[|U_{ES}|\le x\right]\right)} \tag{29}\]

3.6 The Conditional Test

The conditional test assesses not only the correctness and significance of the expected value of the cumulative failures but also their independence. Consider the auto-covariance of the cumulative violations for lag j:

\[ \gamma_j = \frac{1}{n-j}\sum^n_{t=j+1} \left(H_t - \frac{\alpha}{2}\right)\left(H_{t-j}-\frac{\alpha}{2}\right). \tag{30}\]

The autocorrelation is then equal to:

\[ \rho_j = \frac{\gamma_j}{\gamma_0} \tag{31}\] and the test statistic for m lags is:

\[ C_{ES} = n\sum^m_{j=1}\rho^2_j \tag{32}\]

which is asymptotically distributed as $\chi^2_m$, and is independent of $\alpha$.

3.7 Example

The following example, using the pre-computed GARCH backtest data, highlights the approach to using the shortfall_de_test function.

Code

library(tsdistributions)
data("garch_forecast")
# the pit data
x <- pdist("jsu", q = garch_forecast$actual, mu = garch_forecast$forecast,
sigma = garch_forecast$sigma, skew = garch_forecast$skew,
shape = garch_forecast$shape)
test <- shortfall_de_test(x, alpha = 0.05, lags = 4)
as_flextable(test, footnote.reference = T) |> width(j = 1:3,width = 1.2)

Table 8: Expected Shortfall Test

Test	Statistic	Pr(>\|t\|)
DE (U)	0.0277	0.7377
DE (C)	2.7680	0.5974
Signif. codes: 0 '*' 0.001 '' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Coverage: 0.05, Obs: 250
Hypothesis(H0) : Unconditional(U) and Independent(C)
Reference: Du Z. and Escanciano J.C. (2017). Backtesting expected shortfall: accounting for tail risk. Management Science, 63(4), 940--958.

3.8 Value at Risk: Coverage and Duration

The Value at Risk (VaR) tests in the tstests package are based on the proportion of failures test of Kupiec (1995), the conditional coverage test of Christoffersen (1998), and the duration between failures test of Christoffersen and Pelletier (2004). These are summarized in the next sections.

3.9 Proportion of Failures (`UC`)

The test of Kupiec (1995) assesses whether the proportion of failures is consistent with the expected proportion of failures at a given level of VaR. This is done by summing the number of violation (binary) and dividing by the length of the forecasts.⁶ Under the null hypothesis that the proportion of failures ($\pi$) is equal to the VaR level ($\alpha$), then the restricted model likelihood is:

\[ \mathrm{L}_r = \left(1 - \alpha\right)^{n}\alpha^{k} \tag{33}\] and the unrestricted (observed) model likelihood:

\[ \mathrm{L}_u = \left(1 - \pi\right)^{n}\pi^{k} \] where $\pi=\frac{n}{\left(n + k\right)}$, $n$ are the number of observations, and $k$ the number of failures. A likelihood ratio test can then be conducted with statistic:

\[ \mathrm{LR}_{uc} = -2\log{\frac{\mathrm{L}_r}{\mathrm{L}_u}} \tag{34}\] which is distributed as $\chi^2_1$.

3.10 Independence of Failures (`CCI`)

The proportion of failures (or unconditional) test of Kupiec (1995) tests the coverage of the interval but has no power in detecting whether they are independent. Christoffersen (1998) proposed a test to explicitly check the independence assumption against a first order Markov alternative. Consider a binary, first order Markov chain with transition probability matrix:

\[ \Pi_1 = \begin{bmatrix} 1-\pi_{01} & \pi_{01}\\ 1-\pi_{11} & p_{11} \end{bmatrix} \tag{35}\]

where $\pi_{ij}=Pr\left(I_t=j|I_{t-1}-i\right)$, and $I_t$ is the indicator function denoting failures. The approximate likelihood of this process is then

\[ \mathrm{L}_u = \left(1-\pi_{01}\right)^{n_{00}}\pi_{01}^{n_{01}}\left(1-\pi_{11}\right)^{n_{10}}\pi_{11}^{n_{11}} \tag{36}\] where $n_{ij}$ is the number of observation with value i followed by j. For example, $n_{10}$ represents the number of periods with failures followed by periods with no failures. For the restricted model under the null of independence, the first order Markov chain has the following transition probability matrix:

\[ \Pi_1 = \begin{bmatrix} 1-\pi_2 & \pi_2\\ 1-\pi_2 & \pi_2\\ \end{bmatrix} \tag{37}\] with the following likelihood:

\[ \mathrm{L}_r = \left(1 - \pi_2\right)^{n_{00} + n_{10}}\pi_2^{n_{01} + n_{11}} \tag{38}\]

Finally, the likelihood ratio for the test of independence can be expressed as:

\[ \mathrm{LR}_{cci} = -2\log\frac{L_r}{L_u} \tag{39}\]

which is asymptotically distributed as $\chi^2\left(1\right)$.

3.11 Conditional Coverage (`CC`)

The likelihood ratio for the joint test of coverage and independence is simply the sum of the Kupiec coverage and independence likelihood ratios:

\[ \mathrm{LR}_{cc} = \mathrm{LR}_{uc} + \mathrm{LR}_{cci} \tag{40}\] which is asymptotically distributed as $\chi^2\left(2\right)$.

3.12 Duration between Failures (`D`)

The conditional coverage independence test in the previous section has limited power in detecting temporal dependence beyond the simple first order Markov structure. To provide a more powerful test Christoffersen and Pelletier (2004) proposed a more general structure using the duration of time between VaR failures (the no-hit duration), defined as :

\[ D_i = t_i - t_{i-1} \tag{41}\]

where $t_i$ denotes the time index of violation number $i$. Under the null hypothesis of a correctly specified risk model, this no-hit duration ($D$) should have no memory with a mean duration of $\frac{1}{\alpha}$ periods. Since the only continuous memory free random distribution is the exponential, then under the null hypothesis the distribution of the no-hit duration should be:

\[ f_{\mathrm{exp}}\left(D;\alpha\right) = p\exp\left(-\alpha D\right). \tag{42}\] In order to test for this, an encompassing distribution which allows for duration dependence and also embeds the exponential is required. The Weibull distribution offers one such example, with distribution:

\[ f_{\mathrm{W}}\left(D;a,b\right) = a^bbD^{b-1}\exp\left(-\left(aD\right)^b\right). \tag{43}\] The exponential is a special case when $b=1$. Therefore, the null hypothesis of a memory-less duration process corresponds to $b=1$ and $a=\alpha$, which can be tested using a likelihood ratio test distributed as $\chi^2\left(1\right)$.⁷

3.13 Example

Table 9 shows the 4 tests for VaR, with p-values well above 10% indicating a correctly specified model for this series during the out of sample period tested.

Code

library(tsdistributions)
data("garch_forecast")
q <- qdist("jsu", p = 0.05, mu = garch_forecast$forecast, sigma = garch_forecast$sigma,
skew = garch_forecast$skew, shape = garch_forecast$shape)
test <- var_cp_test(actual = garch_forecast$actual, forecast = q, alpha = 0.05)
as_flextable(test, footnote.reference = T) |> width(j = 1:3,width = 1.2)

Table 9: Value at Risk Tests

Test	DoF	Statistic	Pr(>Chisq)
Kupiec (UC)	1	0.9514	0.3294
CP (CCI)	1	0.0009	0.9763
CP (CC)	2	0.9522	0.6212
CP (D)	1	0.0821	0.7745
Signif. codes: 0 '*' 0.001 '' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Coverage: 0.05, Obs: 250, Failures: 16, E[Failures]: 12
Hypothesis(H0) : Unconditional(UC), Independent(CCI), Joint Coverage(CC) and Duration(D)
References:
Kupic, P. (1995), Techniques for verifying accuracy of risk measurement models, Journal of Derivatives, 3, 73--84.
Christoffersen, P. (1998), Evaluating Interval Forecasts, International Economic Review, 39, 841--862.
Christoffersen, P. and Pelletier, D. (2004), Backtesting value-at-risk: A duration-based approach, Journal of Financial Econometrics, 2(1), 84--108.

4 References

Anatolyev, S., and A. Gerko, 2005, A trading approach to testing for predictability, Journal of Business & Economic Statistics 23, 455–461.

Berkowitz, J., 2001, Testing density forecasts, with applications to risk management, Journal of Business & Economic Statistics 19, 465–474.

Christoffersen, Peter F, 1998, Evaluating interval forecasts, International Economic Review, 841–862.

Christoffersen, P.F., and F.X. Diebold, 2006, Financial asset returns, direction-of-change forecasting, and volatility dynamics, Management Science 52, 1273–1287.

Christoffersen, P., and D. Pelletier, 2004, Backtesting value-at-risk: A duration-based approach, Journal of Financial Econometrics 2, 84–108.

Dowd, Kevin, 2004, A modified berkowitz back-test, Risk Magazine 17, 86–87.

Du, Zaichao, and Juan Carlos Escanciano, 2017, Backtesting expected shortfall: Accounting for tail risk, Management Science 63, 940–958.

Engle, Robert F, and Victor K Ng, 1993, Measuring and testing the impact of news on volatility, The Journal of Finance 48, 1749–1778.

Hamilton, James Douglas, 2020, Time Series Analysis (Princeton university press).

Hansen, Bruce E, 1992, Testing for parameter instability in linear models, Journal of Policy Modeling 14, 517–533.

Hansen, L.P., 1982, Large sample properties of generalized method of moments estimators, Econometrica 50, 1029–1054.

Jarque, Carlos M, and Anil K Bera, 1987, A test for normality of observations and regression residuals, International Statistical Review/Revue Internationale de Statistique, 163–172.

Kupiec, P.H., 1995, Techniques for verifying the accuracy of risk measurement models, The Journal of Derivatives 3, 73–84.

Mincer, Jacob A, and Victor Zarnowitz, 1969, The evaluation of economic forecasts, Economic forecasts and expectations: Analysis of forecasting behavior and performance (NBER).

Nyblom, J., 1989, Testing for the constancy of parameters over time, Journal of the American Statistical Association 84, 223–230.

Pesaran, M.H., and A. Timmermann, 1992, A simple nonparametric test of predictive performance, Journal of Business & Economic Statistics 10, 461–465.

Rosenblatt, M., 1952, Remarks on a multivariate transformation, The Annals of Mathematical Statistics 23, 470–472.

Footnotes

It is also possible to correct this estimator for serial correlation as suggested by Hamilton (2020), but we leave this for future investigation.↩︎
Asymptotically distributed as $\chi^2$.↩︎
All joint tests are denoted by the capital letter $J$ in the package.↩︎
From equation 3.3 of Nyblom (1989).↩︎
Based on the SPY log returns data (tsdatasets::spy()) using a GARCH(1,1)-JSU model (see documentation for details and replication code).↩︎
Under the assumption of i.i.d observations, the sequence of failures is distributed as Bernoulli($\alpha$).↩︎
The actual implementation requires calculating the duration of the hit sequence, the censored observations and combining all this to construct the log-likelihood which needs to be solved using numerical methods for the unrestricted likelihood.↩︎

1 Introduction

2 Goodness of Fit Tests

2.1 GMM Orthogonality

2.2 Parameter Constancy

2.3 Sign Bias

3 Forecast Evaluation Tests

3.1 Density Forecast

3.2 Directional Accuracy

3.3 Forecast Unbiasedness

3.4 Expected Shortfall

3.5 The Unconditional Test

3.6 The Conditional Test

3.7 Example

3.8 Value at Risk: Coverage and Duration

3.9 Proportion of Failures (UC)

3.10 Independence of Failures (CCI)

3.11 Conditional Coverage (CC)

3.12 Duration between Failures (D)

3.13 Example

4 References

Footnotes

3.9 Proportion of Failures (`UC`)

3.10 Independence of Failures (`CCI`)

3.11 Conditional Coverage (`CC`)

3.12 Duration between Failures (`D`)