key: cord-0138204-t5wsw1pk
authors: Coulombe, Philippe Goulet; Marcellino, Massimiliano; Stevanovic, Dalibor
title: Can Machine Learning Catch the COVID-19 Recession?
date: 2021-03-01
journal: nan
DOI: nan
sha: 11957778d898bb0b7b6773c8706a36e310c5b140
doc_id: 138204
cord_uid: t5wsw1pk

Based on evidence gathered from a newly built large macroeconomic data set for the UK, labeled UK-MD and comparable to similar datasets for the US and Canada, it seems the most promising avenue for forecasting during the pandemic is to allow for general forms of nonlinearity by using machine learning (ML) methods. But not all nonlinear ML methods are alike. For instance, some do not allow to extrapolate (like regular trees and forests) and some do (when complemented with linear dynamic components). This and other crucial aspects of ML-based forecasting in unprecedented times are studied in an extensive pseudo-out-of-sample exercise.

Forecasting economic developments during crisis time is problematic since the realizations of the variables are far away from their average values, while econometric models are typically better at explaining and predicting values close to the average, particularly so in the case of linear models.

The situation is even worse for the Covid-19 induced recession, when typically well performing econometric models such as Bayesian VARs with stochastic volatility have troubles in tracking the unprecedented fall in real activity and labour market indicators -see for example for the US Carriero et al. (2020) and Plagborg-Møller et al. (2020) , or An and Loungani (2020) for an analysis of the past performance of the Consensus Forecasts.

As a partial solution, Foroni et al. (2020) employ simple mixed-frequency models to nowcast and forecast US and the rest of G7 GDP quarterly growth rates, using common monthly indicators, such as industrial production, surveys, and the slope of the yield curve. They then adjust the forecasts by a specific form of intercept correction or estimate by the similarity approach, see Clements and Hendry (1999) and Dendramis et al. (2020) , showing that the former can reduce the extent of the forecast error during the Covid-19 period. Schorfheide and Song (2020) do not include COVID periods in the estimation of a mixed-frequency VAR model because those observations substantially alter the forecasts. An alternative approach is the specification of sophisticated nonlinear / time-varying models. While this is not without perils when used on short economic time series, it can yield some gains, see e.g. Ferrara et al. (2015) in the context of forecasting during the financial crisis using Markov-Switching, threshold and other types of random parameter models.

The goal of this paper is to go one step further in terms of model sophistication, by considering a variety of machine learning (ML) methods and assessing whether and to what extent they can improve the forecasts, both in general and specifically during the Covid-19 crisis, focusing on the UK economy that at the same time was also experiencing substantial Brexit-related uncertainty. A related paper, but with a focus on the largest euro area countries, is Huber et al. (2020) who introduce Bayesian Additive Regression Tree-VARs (BART-VARs) for Covid. They develop a nonlinear mixed-frequency VAR framework by incorporating regression trees, and exploiting their ability to model outliers and to disentangle the signal from noise. Indeed, the regression trees (and even more the forests) are able to quickly adapt to extreme observations and to disentangle the switch in the underlying regime. Another relevant related paper is Goulet Coulombe et al. (2019), which however does not include an analysis of the Covid-19 period and focuses on the US. A third related paper, again with a focus on the US, is Clark et al. (2021) , who consider alternative specifications of BART-VARs, possibly with also a non-parametric specification for the time-varying volatility, and compare their point, density and tail forecast performance with that of large Bayesian VARs with stochastic volatility, finding often gains, though of limited size.

In line with Goulet Coulombe et al. (2019), we consider five nonlinear nonparametric ML methods. Three of them have the capacity to extrapolate and two do not. Specifically, being based on trees, boosted trees (BT) and random forests (RF) cannot predict out-of-sample a value (ŷ i ) greater than the maximal in-sample value (same goes for the minimum). This is a simple implication of how forecasts are constructed, basically by taking means over sub-samples chosen in a data-driven way. Clearly, this is an important limitation when it comes to forecasting variables which significantly got out of their typical range during the Pandemic (like hours worked). 1 No such constraints bind on Macroeconomic Random Forest (MRF), Kernel Ridge Regression (KRR), and Neural Networks (NN). By using a linear part within the leafs, MRF can extrapolate the same way a linear model does, while retaining the usual benefits of tree-based methods (limited or inexistent overfitting, necessitate little to no tuning, can cope with large data). Goulet Coulombe (2020a) notes that this particular feature gives MRF an edge over RF when it comes to forecasting the (once) extreme escalation of the unemployment rate during the Great Recession.

Machine learning algorithms offer ways to approximate unknown and potentially complicated functional forms with the objective of minimizing the expected loss of a forecast over h periods. The focus of the current paper is to construct a feature matrix susceptible to improve the macroeconomic forecasting performance of off-the-shelf ML algorithms. Let H t = [H 1t , ..., H Kt ] for t = 1, ..., T be the vector of variables found in a large macroeconomic dataset, such as the FRED-MD database of McCracken and Ng (2016) or the UK-MD dataset described in the next section, and let y t+h be our target variable. We follow Stock and Watson (2002a,b) and target average growth rates or average differences over h periods ahead y t+h = g( f Z (H t )) + e t+h .

(1)

To illustrate this point, define Z t ≡ f Z (H t ) as the N Z -dimensional feature vector, formed by combining several transformations of the variables in H t . 3 The function f Z represents the data preprocessing and/or featuring engineering whose effects on forecasting performance we seek to investigate. The training problem for the case of no data pre-processing ( f Z = I()) is

The function g, chosen as a point in the functional space G, maps transformed inputs into the transformed targets. pen() is the regularization function whose strength depends on some vector/scalar hyperparameter(s) τ.

In this section we present the main predictive models (for a more complete discussion, see, among other, Hastie et al. (2009)) , and some additional, less standard, forecasting models we will consider (more details can be found in Goulet Coulombe et al. (2019)). Table 1 lists all the models implemented in the forecasting exercise, together with their respective input matrices Z t .

We consider the autoregressive model (AR), as well as the autoregressive diffusion index (ARDI) model of Stock and Watson (2002a,b) . Let Z t = y t , y t−1 ..., y t−P y , F t , F t−1 ..., F t−P f be our feature matrix, then the ARDI model is given by

where F t are k factors extracted by principal components from the N X -dimensional set of predictors X t and parameters are estimated by OLS. The AR model is obtained by keeping in Z t only the lagged values of y t . The hyperparameters of both models are specified using the Bayesian information criterion (BIC).

The Elastic Net model simultaneously predicts the target variable y t+h and selects the most relevant predictors from a set of N Z features contained in Z t whose weights β := (β i ) N Z i=1 solve the following penalized regression problem

and where (α, λ) are hyperparameters. Here, Z t contains lagged values of y t , factors and X t . The Lasso estimator is obtained when α = 1, while the Ridge estimator imposes α = 0 and both use unit weights throughout. We select λ and α with grid search where α ∈ {.01, .02, .03, ..., 1} and λ ∈ [0, λ max ] where λ max is the penalty term beyond which coefficients are guaranteed to be all zero assuming α 0. Since those algorithms performs shrinkage (and selection), we do not crossvalidate P y , P f and k. We impose P y = 6, P f = 6 and k = 8 and let the algorithms select the most relevant features for forecasting task at hand.

This algorithm provides a means of approximating nonlinear functions by combining regression trees. Each regression tree partitions the feature space defined by Z t into distinct regions and, in its simplest form, uses the region-specific mean of the target variable y t+h as the forecast, i.e. for M leaf nodesŷ

where R 1 , ..., R M is a partition of the feature space. The input Z t is the same as in the case of Elastic Net models. To circumvent some of the limitations of regression trees, Breiman (2001) introduced Random Forests. Random Forests consist in growing many trees on subsamples (or nonparametric bootstrap samples) of observations. A random subset of features is eligible for the splitting variable, further decorrelating them. The final forecast is obtained by averaging over the forecasts of all trees. In this paper we use 500 trees which is normally enough to stabilize the predictions. The minimum number of observation in each terminal nodes is set to 3 while the number of features considered at each split is #Z t 3 . In addition, we impose P y = 6, P f = 6 and k = 8.

This algorithm provides an alternative means of approximating nonlinear functions by additively combining regression trees in a sequential fashion. Let η ∈ [0, 1] be the learning rate andŷ (n) t+h and e (n) t+h := y t−h − ηŷ (n) t+h be the step n predicted value and pseudo-residuals, respec-tively. Then, for square loss, the step n + 1 prediction is obtained aŝ

and c n+1 := (c n+1,m ) M m=1 are the parameters of a regression tree. In other words, it recursively fits trees on pseudo-residuals. We consider a vanilla Boosted Trees where the maximum depth of each tree is set to 10 and all features are considered at each split. We select the number of steps and η ∈ [0, 1] with Bayesian optimization. Z t contains lagged values of y t , factors and X t , and we impose P y = 6, P f = 6 and k = 8.

A way to introduce high-order nonlinearities among predictors' set Z t , but without specifying a plethora of basis functions, is to opt for the Kernel trick. As in Goulet Coulombe et al. (2019), the nonlinear ARDI predictive equation (3) is written in a general nonlinear form g(Z t ) and can be approximated with basis functions φ() such that

The so-called Kernel trick is the fact that there exist a reproducing kernel K() such that

This means we do not need to specify the numerous basis functions, a well-chosen kernel implicitly replicates them. Here we use the standard radial basis function (RBF) kernel

where σ is a tuning parameter to be chosen by cross-validation. In terms of implementation, after factors are extracted via PCA from (4), the forecast of the Kernel Ridge Regression (KRR) diffusion index model is obtained from

Here, we impose the same set of inputs, Z t , as in the ARDI model and we fix P y = 6, P f = 6 and k = 8.

We consider standard feed-forward networks and the architecture closely follows that of Gu et al. (2019) . Cross-validating the whole network architecture is a difficult task especially with a small number of observations as is the case in macroeconomic applications.

Hence, we use two hidden layers, the first with 32 neurons and the second with 16 neurons. The number of epochs is fixed at 100. The activation function is ReLu and that of the output layer is linear. The batch size is 32 and the optimizer is Adam (Keras default values). The learning rate and the Lasso parameter are chosen by 5-fold cross-validation among the following grids respectively, ∈ {0.001, 0.01} and ∈ {0.001, 0.0001}. We apply the early stopping, i.e. we wait for 20 epochs to pass without any improvement of the cross-validation MSE to stop the training. The final prediction is the average of an ensemble of 5 different estimations. Z t contains lagged values of y t , factors and X t , and we impose P y = 6, P f = 6 and k = 8. 

where S t are the state variables governing time variation and F a forest. S t is (preferably) a highdimensional macroeconomic data set. In this paper, it is the same Z t as in plain RF and Boosting.

X determines the linear model that we want to be time-varying. UsuallyX ⊂ S is rather small (and focused) compared to S. For instance, an autoregressive random forests (ARRF) uses lags of y t for X t . A factor-augmented ARRF (FA-ARRF) adds factors to ARRF's linear part.

The problem is to find the optimal variable S j (so, finding the best j out of the random subset of predictors indexes J − ) to split the sample with, and at which value c of that variable should we split. The outputs should be j * and c * to be used to split l (the parent node) into two children nodes, l 1 and l 2 . Hence, the greedy algorithm developed in Goulet Coulombe (2020a) runs

recursively to construct trees.

As it was the case for RF, the bulk of regularization comes from taking the average over a diversified ensemble of trees (generated by both Bagging and a random J − ⊂ J . Nonetheless, β t 's (and the attached prediction) can also benefit from extra (yet mild) regularization. Time-smoothness is made operational by taking the "rolling-window view" of time-varying parameters. That is, the tree solve many weighted least squares problems (WLS) which includes close-by observations. To keep computational demand low, the kernel w(t; ζ) is a symmetric 5-step Olympic podium. Informally, the kernel puts a weight of 1 on observation t, a weight of ζ < 1 for observations t − 1 and t + 1 and a weight of ζ 2 for observations t − 2 and t + 2. Note that a small Ridge penalty is added to make sure every matrix inverts nicely (even in very small leaves), so a single tree has in fact two sources of regularization.

The standard RF is a restricted version of MRF whereX t = ι, λ = 0, ζ = 0 and the block size for Bagging is 1. In words, the only regressor is a constant, there is no within-leaf shrinkage, and Bagging does not care for serial dependence. It is understood that MRF will have an edge over RF whenever linear signals included inX t are strong and the number of training observations (or signal-to-noise ratio) is low. The reason for this is simple: MRF nudge the learning algorithm in the right direction rather than hoping for RF to learn everything non-parametrically. Moreover, by providing generalized time-varying parameters (and credible regions for those), MRF lends itself more easily to interpretation. shrunk (and maybe selected) to 0, using MARX transform the usual β k,p → 0 prior into shrinking each β k,p to β k,p−1 for the p lag of predictor k. For more sophisticated techniques where shrinkage is only implicit (like RF and Boosting), MARX "proposes" the variable-selecting algorithm with pre-assembled group of lags which helps in avoiding that the underlying trees waste splits on a bunch of scattered lags (Goulet Coulombe, 2020a). Goulet Coulombe et al. (2020) report that the transformation is particularly helpful for US monthly real economic activity targets. Adding MARX to the input set Z t is considered in all models except ARDI and KRR.

Large datasets are now very popular in empirical macroeconomic research since Stock and Watson (2002a,b) 

The dataset contains 112 macroeconomic and financial indicators divided into nine categories: labour, production, retail and services, consumer and retail price indices, producer price indices, international trade, money, credit and interest rate, stock market and finally sentiment and leading indicators. The selection of variables is inspired by McCracken and Ng (2016) , Fortin-Gagnon et al. Our last concern is to balance the resulting panel since some series have missing observations.

We opted to apply an expectation-maximization algorithm by assuming a factor model to fill in the blanks as in Stock and Watson (2002b) and McCracken and Ng (2016) . We initialize the algorithm by replacing missing observations with their unconditional mean, starting in 1998M1, and then proceed to estimate a factor model by principal component. The fitted values of this model are used to replace missing observations.

Finally, for this application we also add nineteen US macroeconomic and financial aggregates as considered in Banbura et al. (2008) . These series include income, production, labour market, housing, consumption and monetary indicators, as well as interest rates and prices. The complete list is available in the appendix D.

Large macroeconomic datasets are mainly used for forecasting and impulse response analysis through lenses of factor modeling (Kotchoni et al., 2019; Bernanke et al., 2005) . Indeed, the factors provide a widely used dimension reduction method, but they also serve as an empirical representation of general equilibrium models (Boivin and Giannoni, 2006) . Hence, it is important to explore the factor structure of our UK-MD dataset.

Estimating After the static factors are estimated by principal components as in Stock and Watson (2002a) , we report in Table 2 their marginal contribution to the variance of variables constituting UK-MD.

For instance, mR 2 i (k) measures the incremental explanatory power of the factor k for the variable i, which is simply the difference between the R 2 after regressing the variable i on the first k and k − 1 factors. The overall marginal contribution of the factor k is the sample average over all variables. Table 2 shows the average mR 2 (k) for each of nine estimated factors, lists ten series that load most importantly on each factor and indicates the group to which the series belongs. For example, factor 1 explains 20.7% of the variation in UK-MD and is clearly a real activity factor as the ten most related variables are indicators of production and services. In particular, it explains 88.7 and 83.6% of variation in the index of services and the index of production in manufacturing respectively.

The second factor explains 8.4% of variation overall, and represents mainly the group of interest rates. For instance, its marginal contribution to the 12-month LIBOR is 0.532. Factor 3's average explanatory power is 5.4% and it is linked to prices indices, with the highest mR 2 i (k) = 0.513 for the CPI inflation. Factors 4 and 5 are related to stock market and employment variables respectively.

The sixth factor explain 3.4% of total variation and can be interpreted as the international trade factor. Factor 7 is related to unemployment and working hours indicators, with an explanatory power of 24.5% for the over 12 month unemployment duration. Exchange rates are well explained by the seventh factor. Finally, the ninth component stands out as an energy factor as it explains a sizeable fraction of variation in production indices of oil extraction, mining and energy sectors. for 20 series, mostly services, production and average week hours series. The nine factors are also very important for 42 variables as they have an R 2 between 0.5 and 0.8. There is only one series that have the idiosyncratic component explaining over 90% of the variation, IOP_PETRO, and 3 variables for which the common component R 2 is less than 20%. Therefore, we can conclude that the factor structure in UK-MD seems reasonable and is comparable to those in FRED-MD and CAN-MD datasets. Interestingly, the interpretation of the first three UK-MD factors is identical to the interpretation of the first three FRED-MD components.

In We consider the direct predictive modeling in which the target is projected on the information set, and the forecast is made directly using the most recent observables. All the variables above are assumed I(1), so we forecast the average growth rate (Stock and Watson, 2002b) ,

except for UNRATE where we target the average change as in (6) but without logs.

The pseudo-out-of-sample period starts on 2008M01. The end period depends on target variables. Labor market series, EMP, UNEMP RATE and HOURS, end on 2020M09, while RETAIL is available up to 2020M10. The rest of variables end on 2020M11. The forecasting horizons considered are 1, 2 and 3 months. All models are estimated recursively with an expanding window in order to include more data so as to potentially reduce the variance of more flexible models.

The standard Diebold and Mariano (2002) (DM) test procedure is used to compare the predictive accuracy of each model against the reference autoregressive model. Mean squared error (MSE) is the most natural loss function given that all models are trained to minimize the squared (2019) compared it with a scheme which respects the time structure of the data in the context of macroeconomic forecasting and found K-fold to be performing as well as or better than this alternative scheme. All models are estimated (and their hyperparameters re-optimized) every month.

In this section we present the results of the forecasting experiment, focusing first on the Covid-19 era and then on average performance over the longer evaluation sample. Table 3 .

Though the Covid era is short and so the results should be interpreted with care, the outcome is quite interesting. Linear models have a hard time characterizing the path of EMP during the Pandemic recession. Ridge+MARX, which was marginally better than the nonlinear FA-ARRF (2,2) during the pre-Covid era, is predicting an employment cataclysm that did not materialize. This is a general property of linear models for this target since the best linear forecast (other than the AR)

for EMP in 2020 is the 0 forecast, that is, the RW without drift in levels. FA-ARRF(2,4) (and FA-ARRF(2,2) close behind) is the best model for EMP at a horizon of one month. At longer horizons, RF-MARX is the best model, with a decisive advantage over both AR and RF that do not use the transformations of Goulet Coulombe et al. (2020) . This winning streak extends to unemployment at all horizons -another variable that responded in a rather mild fashion to the Covid shock due to Government intervention. Given RF usual robustness (Goulet Coulombe, 2020b), those gains are almost all statistically significant.

In Figure 3b , we see that the improvement at h = 1 comes from responding more swiftly (and more vigorously) to the first Covid shock than what AR would allow for. An explanation for this well-calibrated response can be found in Figure 4 which plots the underlying Generalized Time-Varying Parameters (GTVPs) for FA-ARRF(2,2). The persistence seems to be highly statedependent -being much higher during certain episodes (including recessions). This feature is replicated out-of-sample during the Pandemic recession, which procured FA-ARRF(2,2) an edge over the competitive plain AR. Additionally, the model incorporates an intercept that alternates between two regimes, with the negative one being attributed to recessions (but not exclusively according to pre-2008 data). The drop in intercept is also predicted out-of-sample for the Covid period. Unsurprisingly, those switches match those of persistence. Finally, it is noted that the sensitivity to the first factor (which usually characterizes real activity) is initially milder during recessions for EMP. This is a salient feature for 2020 as the EMP response to the Covid shock is much milder than that of other labor/production indicators (like HOURS).

Turning to HOURS -which experienced an unprecedented rise and fall during the onset of the Pandemic Recession -, it is striking to see that only Macroeconomic Random Forests (MRF) can beat the AR benchmark at h = 1. Indeed, the four MRFs report MSE ratios between 0.69 and 0.78 whereas that of the other nonlinear models range between 1.05 and 1.5. Things are even worse for linear models. are closely related to HOURS itself, and that all successful MRFs include an AR component, this points in the direction that HOURS may well follow a nonlinear AR process which MRF is particularly well equipped to extract. As a result, the response of MRF to the Covid shock is (as it was the case for EMP), more timely than that of AR. Given how fast things were evolving back in the spring of 2020, that timing provides MRF with an improvement of around 30% over the benchmark.

As conjectured earlier, MRF's capacity to extrapolate (which RF and Boosted Trees both lack) proves vital for variables which exhibited (previously unseen) swings of extraordinary proportions. While NN-ARDI also has the capacity to extrapolate (and is marginally better than FA-ARRF(2,2) in the pre-Covid era), its lack of an explicit linear part is likely to blame for its spectacular incapacity to propel the Covid shock in Figure 3b . A similar dismal predicament is observed for RIDGE-MARX which is the best linear model for the Covid sample.

Different troubles afflict data-rich linear models for RPI HOUSING with MSE ratios exploding well over 10. As a result, the best linear model is without question the simple autoregression. An obvious explanation for the generalized failure of linear models (and also most data-rich ones) can be found in Figure 3b . The "orange" forecasts basically predict a path largely inspired by the experience of the Great Recession, i.e., a joint collapse of real activity and housing prices. Since this is the sole recession in the training set, it is fair to say that most ML methods naively (yet inevitably) associate real activity slowdown with a significant drop in RPI Housing. However, by information available to the economist, but not to the sample-constrained ML algorithm, this Notes: GTVPs of the one month ahead EMP forecast. Persistence is defined as the sum of y t−1:2 's coefficients. The gray bands are the 68% and 90% credible region. The pale orange region is the OLS coefficient ± one standard error. The vertical dotted line is the end of the training sample (for this graph only, not the forecasting exercise itself, which is ever-updating). Pink shading corresponds to recessions. association is more of a 2008-2009 exception than a "rule".

The only models able to beat the benchmark are the MRFs equipped with small autoregressions as linear parts (ARRF(2) and ARRF (6)). So, how did they avoid the dismal fates of other ML methods, and captured nicely the soft drop (and bounce back) of RPI HOUSING in 2020? First, they do not rely explicitly on linkage with other groups of variables (like FA-ARRFs would through the use of factors) but rather focus on nonlinear autoregressive dynamics. This strategy is expected to pay off whenever a shock can truly be thought of as "exogenous" and we simply need a model to propagate it -this description corresponds to the onset of the Pandemic Recession but certainly not its predecessor. Second, the model needs to separate pre-2008 dynamics from what followed. Figure   9 confirms visually that the variation in the intercept of ARRF(6) gives an edge over both AR and the best linear model (RIDGE-MARX), especially starting from 2011. As a result, ARRF (6) is also the best model for all horizons in the quieter period of 2011-2019 (see Table 11 ) with improvements over the AR benchmark of 70%, 54% and 54% at horizons 1 to 3 respectively.

The last quadrant of Figure 3a shows that for PPI MANU, a model that does marginally worse most of the time can generate substantial gain during the Covid period. Such is the case for RF-MARX which performance is similar to that of the best linear model for most samples (and the best overall pre-Covid). Figure 3b makes clear that this edge during the Pandemic happens because (i) RF-MARX goes almost as deep as linear models during the spring and yet (ii) does not call for a large decrease in September and October (unlike linear models, and akin to AR's prediction).

Since RF-MARX does better than plain RF by 36% and Boosting-MARX better than plain Boosting by 12%, it is natural curiosity to investigate the VI measures of those models to uncover what particular MARX transformations RF is so fond of. In Figure 6 , we see that both plain Boosting and RF rely strongly on the most recent values of oil prices, PPI oil and PPI MANU itself -which comes to no surprise. Interestingly, the other lags of oil prices are generally absent from the top 20.

The MARX versions consider a slightly less focused set of predictors composed of various moving averages of oil prices. In both the RF and Boosting case, the most important feature is the last 6 months average of oil prices change. Thus, RF-MARX versions avoid calling for another decrease of PPI MANU by relying less on monthly oil indicators by themselves, which are subject to large swings, but rather on temporal averages that have the ability of smoothing out the noise inevitably present in the oil price trajectory. Moreover, by the very design of the manufacturing production chain, increases/decreases over several months are more likely to be transmitted into prices than notoriously volatile one-month-to-the-next variations.

It has been repeatedly reported that the benefits of a large panel of predictors may solely be present during periods of economic turmoil (Kotchoni et al., 2019; Siliverstovs and Wochner, 2019) .

For this reason and others (Lerch et al., 2017) , it is of interest to study the marginal benefits associated with data-rich models outside of the tumultuous entry/exit of the Great Recession and the Pandemic Recession. Moreover, starting the pseudo-out-of-sample from 2011 gives data-rich models at least one recession to be trained on, and 13 years of data overall rather than 10 (as it were the case in Table 4 ).

Ridge and Ridge-MARX do well for EMP and HOURS with gains roughly distributed between 10% and 20% depending on the horizon. The MARX version usually has the upper hand by a small margin. The evidence for other activity indicators is more mixed. For HOURS, only nonlinear models manage to beat the AR benchmark albeit in a non-statistically significant fashion. The best model for IP PROD at all horizons is ARRF(2) which improves upon the AR by small margins. For IP MACH, some small gains can be obtained at a horizon of 3 months (with FA-ARRF(2,2), most notably) but none of those are statistically significant.

Aligned with traditional wisdom for the US (Stock and Watson, 2008) , it is hard to beat the simple benchmark when it comes to CPI inflation. Nevertheless, ARRF(6) is the best model for all horizons (ex-aequo at h = 1) with gains of 9-10% -but none of those are significant. Larger improvements are obtained for RPI, where various data-rich models (linear and nonlinear) provide gains of around 20%. The most notable are those of FA-ARRFs at a horizon of 3 months (but also any other horizon) which are nearly 30%, far ahead from most of the competing modelsincluding all those that also rely directly on factors. Finally, as a last notable observation from Table 11 , ARRF(6) dominates at all horizons for both RPI HOUSING and CREDIT, highlighting the benefits of a more focused modeling of persistence (while allowing for its time variation) in otherwise high-dimensional/data-rich/nonlinear ML methods.

In this paper we assess the forecasting performance of a variety of standard and ML forecasting methods for key UK economic variables, with a special focus on the Covid-19 period and using a specifically collected large dataset of monthly indicators, labeled UK-MD (also augmented with some international indicators).

As standard benchmarks, we consider AR, random walk and factor augmented AR models. As ML methods, we evaluate penalized regressions (RIDGE, LASSO, ELASTIC NET), boosted trees (BT) and random forests (RF), Kernel Ridge Regression (KRR), and Neural Networks (NN), plus Macroeconomic Random Forest (MRF), which uses a linear part within the leafs, and Moving Average Rotation of X (MARX), a feature engineering technique which generates an implicit shrinkage more appropriate for time series data.

Overall ML methods can provide substantial gains when short-term forecasting several indicators of the UK economy, though a careful temporal and variable by variable analysis is needed.

Over the full sample, RF works particularly well for labour market variables, in particular when augmented with MARX; KRR for real activity and consumer price inflation; LASSO or LASSO+MARX for the retail price index and its version focusing on housing; and RF for credit variables. The gains can be sizable, even 40-50% with respect to the benchmark, and ML methods were particularly useful during the Covid-19 period. During the Covid era, nonlinear methods with the ability to extrapolate have a nice edge. Certain MRFs, unlike linear methods or simpler nonlinear ML techniques, procure important improvements by predicting large "bounce back" that did occur and avoid predicting mayhem that did not materialize.

Goulet Coulombe, P. (2020a).

The macroeconomy as a random forest. arXiv preprint arXiv:2006.12724.

Goulet Coulombe, P. (2020b). To bag is to prune. arXiv e-prints, pages arXiv-2008.

Goulet Coulombe, P., Leroux, M., Stevanovic, D., and Surprenant, S. (2019) . How is machine learning useful for macroeconomic forecasting? Technical report, CIRANO Working Papers, 2019s-22.

Goulet Coulombe, P., Leroux, M., Stevanovic, D., and Surprenant, S. (2020) . Macroeconomic data transformations matter. arXiv preprint arXiv:2008.01714.

Gu, S., Kelly, B., and Xiu, D. (2019) . Empirical asset pricing via machine learning. Review of Financial Studies. including the specific predictor in the forest part. V I OOB means VI for the out-of-bag criterion. V I OOS is using the hold-out sample. V I β is an out-of-bag measure of how much β t,k varies by withdrawing a certain predictor. Figure 9 : Full POOS forecasts for RPI HOUSING at h = 1

Notes: Pink shading corresponds to recessions. Exact selected models are reported in Table 3 .

When available, the series have been retrieved adjusted for seasonality beforehand. However, the price indices (CPI, RPI and PPI) were not and after conducting the Kruskal and Wallis (1952) test for seasonal behavior, these have been seasonally adjusted using the X-13-ARIMA-SEATS software developed by the United States Census Bureau. The transformation codes are: 1 -no transformation; 2 -first difference; 4 -logarithm; 5 -first difference of logarithm. 

Improved penalization for determining the number of factors in approximate factor models

How well do economists forecast recoveries

Determining the number of factors in approximate factor models. Econometrica

Large Bayesian VARs

A note on the validity of cross-validation for evaluating autoregressive time series prediction

Measuring the effects of monetary policy: a factoraugmented vector autoregressive (FAVAR) approach

DSGE models in a data-rich environment

Random forests. Machine learning

Nowcasting tail risks to economic activity with many indicators

Tail forecasting with bayesian additive regression trees

Forecasting Non-Stationary Economic Time Series

A similarity-based approach for macroeconomic forecasting

Comparing predictive accuracy

Macroeconomic forecasting during the great recession: The return of non-linearity?

Forecasting the covid-19 recession and recovery: Lessons from the financial crisis

A large canadian database for macroeconomic analysis

Recessions as breadwinner for forecasters state-dependent evaluation of predictive ability: Evidence from big macroeconomic us data. KOF Working Papers

Forecasting using principal components from a large number of predictors

Macroeconomic forecasting using diffusion indexes

Phillips curve inflation forecasts

The additional transformation codes are: 6 -second difference of logs; 7δ(x t /x t−1 − 1).