The tsgam package provides a custom wrapper of the gam function from the mgcv package of Wood (2022), in order to conform to the methods and outputs of the tsmodels framework. In particular, the predict method will return the simulated distribution of the mean forecast, and optionally includes the possibility of meta-fitting and using one of the distributions from the tsdistributions package for the uncertainty.
The reason for introducing this model into the framework is due to the flexibility and robustness of the smooth basis functions in Generalized Additive Models (GAMs), the speed of estimation when working with large data sets, the ability to capture complex and multiple seasonal periods and potentially many regressors which are available at the forecast horizon which may not be purely linear in the response.
2 Package Demo: Electricity Load
We highlight the package functionality using the Total electricity load in Greece dataset from the tsdatasets package.
2.1 Estimation
The model specification, unlike previous packages in the tsmodels repo, takes as argument a formula describing the GAM model, in line with how many linear models work in R. This is a flexible approach, but does require an understanding of the types of smooth constructs available in mgcv.
We first load the data and construct some additional information from the index and load information. We use the 48 hour lagged value of load to be conservative in terms of release schedules of the actual load and forecast delays. We also construct hour of day, day of week and month of year variables to capture stable seasonal patterns using cyclic cubic regression splines. In practice, one would hopefully have more information to include such as forecast temperatures, forecast demand etc.
We’ll estimate 2 models and compare them using an F-test. The first model simply uses the 48 hour lag of the load for prediction, whilst the second model also makes use of the seasonal factors. We’ll also use approximately 90% of the data for training and leave 10% for out of sample prediction.
The estimated objects include a slot for model which is the mgcv estimated object, from which we can use existing methods for evaluation. We first compare the extended model with the baseline model using a F-test.
Code
anova(mod_baseline$model, mod_extended$model, test ="F")
Even though the extended model has significantly more parameters, the test indicates that it provides a better fit than the baseline. We therefore proceed with this model.
The extended model explains about 85% of the deviance. Looking at the QQ plot of the residuals using the appraise function from the gratia package suggests that the Gaussian assumptions is reasonable.
Code
appraise(mod_extended$model)
The figure below shows the marginal effect plot of the months on Load, with a clear peak during summer and winter months as load goes up due to temperature and hence the use of air-conditioners and heaters respectively (although the inference will be slightly misleading here since we are using UTC whereas the local time zone is EET).
We can also plot the additive decomposition of the components using the existing tsdecompose method which is available for both estimate and predicted objects.
The prediction method does not have any n.ahead arguments as in other time series methods in the tsmodels framework, instead being completely determined by the newdata argument. Additionally, the argument distribution can be used to proxy the simulated distribution by any other in the tsdistributions package if it is believed that the Gaussian assumption is too restrictive. Choosing another distribution will lead to the estimation of the distributional parameters on the response residuals followed by simulation of the distribution given these parameters and the forecast mean. Comparison of different simulated predictive distributions can then be carried out using for instance the CRSP or MIS functions from the tsaux package. A further argument, tabulate, returns a data.table object of the draws, forecast dates and predictions in long format which may be useful in production settings when such values need to be stored in a database.
Note that, even though the training dataset is of length 20,000 (hours), this does not mean that we are performing such a long range forecast since the 48 hour lag of the actual load effectively allows us to forecast, at time T, from T+1 to T+48. Without re-estimation, this is equivalent, in the other state space type models, to a horizon of 48 with updating every 48 hours (filtering with new data and then re-forecasting). An example and comparison is provided in the next section.
Plotting the simulated predictive distribution against the realized values indicates reasonable performance, which we confirm with a table of statistics.
The simple GAM model appears to outperform, for this series, the ISSM model on all 3 metrics.
There is also a tsbacktest method which will be expanded on in a separate post as this takes a training and testing split list of time indices which departs from the current approach in the other packages but offers more flexibility and is more suitable for models of this type.