key: cord-0445101-7hbgny8q authors: Fen, Cameron; Undavia, Samir title: Improving Macroeconomic Model Validity and Forecasting Performance with Pooled Country Data using Structural, Reduced Form, and Neural Network Model date: 2022-03-13 journal: nan DOI: nan sha: 86febec5ac647fe4e5aab1284f913684ccf70845 doc_id: 445101 cord_uid: 7hbgny8q We show that pooling countries across a panel dimension to macroeconomic data can improve by a statistically significant margin the generalization ability of structural, reduced form, and machine learning (ML) methods to produce state-of-the-art results. Using GDP forecasts evaluated on an out-of-sample test set, this procedure reduces root mean squared error by 12% across horizons and models for certain reduced-form models and by 24% across horizons for dynamic structural general equilibrium models. Removing US data from the training set and forecasting out-of-sample country-wise, we show that reduced-form and structural models are more policy-invariant when trained on pooled data, and outperform a baseline that uses US data only. Given the comparative advantage of ML models in a data-rich regime, we demonstrate that our recurrent neural network model and automated ML approach outperform all tested baseline economic models. Robustness checks indicate that our outperformance is reproducible, numerically stable, and generalizable across models. A central problem in both reduced-form and structural macroeconomics is the dearth of underlying data. For example, GDP is a quarterly dataset that only extends back to around the late 1940s, which results in around 300 timesteps. Thus generalization and external validity of these models are a pertinent problem. In forecasting, this approach is partially addressed by using simple linear models. In structural macroeconomics the use of micro-founded parameters and Bayesian estimation attempts to improve generalization to a limited effect. More flexible and nonparametric models would likely produce more accurate forecasts, but with limited data, this avenue is not available. However, pooling data across many different countries allows economists to forecast and even estimate larger structural models which have both better external validity and forecasting, without having to compromise on internal validity or model design. We show that the effectiveness of pooling US or other single country data with other countries in conjunction with large DSGE models and machine learning leads to improvements in external validity and economic forecasting. This pooling approach adds more data, rather than more covariates and leads to consistent and significant improvements in external validity as measured by timestep out-of-sample forecasts and other metrics. For example, our data goes from 211 timesteps of US data for our AutoML model, to 2581 timesteps over all countries in our pooled data. This not only leads to significant improvements in forecasting for standard models, but also allows more flexible models to be used without overfitting. Pooling a panel of countries also leads to parameters that are a better fit to the underlying data generating process -almost for free -without any change to the underlying equations governing the models and only changing the interpretation from a single country parameter to an average world parameter. Even in this case we show that the stability of parameters over space -across countries -may be better than the alternative-going back further in time for more single country data. A central theme throughout this paper is that more flexible models benefit more from pooling. We start with the linear models as a good baseline. Even in this case, pooling improves performance. Estimating traditional reduced-form models -AR(2) (Walker, 1931) , VAR(1), VAR(4) models (Sims, 1980) -we show that we can reduce RMSE by an average of 12% across horizons and models. Outside of pure forecasting, analysis of our pooling procedure across models suggests improvements in external validity in other ways -making models more policy/regime invariant. To show this, we estimate both linear and nonlinear models on data from all countries except the target country being forecasted. Thus, our test data is not only time step out-of-sample, which we implement in all our forecast evaluations, but also country out-of-sample, which we add to illustrate generalizability. Across most models and forecasting horizons, our out-of-sample forecasts outperforms the typical procedure of estimation models on only the data of the country of interest. This time and country out-of-sample analysis leads to about 80% of the improvement gained from moving all the way to forecasting with the full data set. We believe this provides evidence that this data augmentation scheme can help make reduced-form as well as more nonlinear models more policy-invariant. Moving to a more flexible model, we proceed to apply our panel data augmentation scheme to the Smets-Wouters structural 45+ parameter DSGE models (Smets and Wouters, 2007) . Our augmentation statistically improves the generalization of this model by an average of 24% across horizons. We again test our model in a country out-of-sample manner, and show improvements while estimating over a single country baselines. This suggests that even DSGE models are not immune to the Lucas critique and the use of country panel data can improve policy/country invariance. Given the consistent improvements across all horizons and reduced-form models, we are confident this approach will generalize to the estimation of other structural models. We advocate applying this approach to calibration, generalized method of moments (Hansen and Singleton, 1982) , and Bayesian estimation (Fernández-Villaverde and Rubio-Ramírez, 2007) , where the targets are moments from a cross-section of countries instead of just one region like the United States. Recognizing that this augmentation increases the effective number of observations by a factor of 10, we also demonstrate that pooling can overcome overfitting in flexible machine learning models that can now outperform traditional forecasting models in this high data regime. This is in line with the trend of larger models having a comparative advantage in terms of forecasting improvement given more data. We use two different algorithms. The first, AutoML, runs a horse race with hundreds of machine learning models to determine the best performing model on the vali-dation set, which is ultimately evaluated on the test set. We view AutoML as a proxy for great machine learning performance, but also expect individual model tuning can improve performance even further. As different models perform better under the low-data (US) regime and the highdata (pooled) regime under AutoML, we also test an RNN to show the improvement of a single model under both data regimes. The model improvement indicates that while these approaches are competitive in the low data regime, machine learning methods consistently outperforms baseline economic models -VAR(1), VAR(4), AR(2), Smets-Wouters, and factor models -in the high data regime. Furthermore, while some of the baseline models use a cross-section of country data, we only use three covariates -GDP, consumption, and unemployment lags. In contrast, the DSGE model uses 11 different covariates, the factor model uses 248, and the Survey of Professional Forecasters (SPF) (None, 1968) uses just as many covariates along with real time data, suggesting our machine learning models still have room for improvement. Over most horizons, our model approaches SPF median forecast performance, albeit evaluated on 2020 vintage data (see Appendix C), resulting in outperformance over SPF benchmark at 5 quarters ahead. Moreover, the outperformance of our model over the SPF benchmark is noteworthy as the SPF is an ensemble of both models and human decision makers, and many recessions like the recent COVID-19 recession are not generally predicted by models. The paper proceeds as follows: Section II. reviews the literature on forecasting and recurrent neural networks and describes how our paper merges these two fields; Section III. discusses feedforward neural networks, linear state space models, and gated recurrent units ; Section V. describes the data; Section III.B. briefly mentions our model architecture; Section IV. discusses the benchmark economic models and the SPF that we compare our model to; Section VI. and Appendix G.2 provide the main results and robustness checks; and Section VII. concludes the paper. This paper connects multiple strands of literature: machine learning, time-series econometrics, and panel macroeconomic analysis. As our pooling technique leads to larger datasets, this creates an opportunity to either increase the parameter count in models or proceed to more powerful tools. In the area of machine learning, when combined with additional data, even models with billions of parameters still exhibit continued log linear improvements in accuracy (Kaplan et al., 2020) . This opens up an avenue to explore whether the outperformance of linear models is due more to 1) the lack of data or 2) their attractive properties in fitting the underlying data generating process. The results of pooling across countries suggests that the advantage of linear models is due to the former. We applied a recent machine learning technique to our economic data, AutoML, a technique introduced and improved on as machine learning models became more complicated with more layers and hyperparameters to tune (Thornton et al., 2013) . Unlike the case with deep learning, many innovations came from the software industry to automate estimation techniques. However, there is a vibrant academic literature following its introduction (F et al., 2014) . The basic premise is to automate the model training and discovery portion of machine learning. H2O (LeDell and Poirier, 2020) , the AutoML platform we use, takes data and automatically runs a horse race of machine learning models on the validation set then returns the best performing model. Hence, we view the output model as a proxy for a well trained and effective predictive model by a data scientist, even if some additional fine tuning can improve performance. While there is room for human improvements over a automated machine learning process, removing the human from the process entirely in our AutoML algorithm shields us from most p-hacking critiques. The second machine learning procedure we used was the estimation of a RNN, which is a state space model much like the linear state space models often used in economics. Innovations in deep learning have improved the predictive power of these models over what economists are used to for their linear counterparts. RNNs have been around in many forms, but were mainly popularized in the 1980s (Rumelhart, Hinton and Williams, 1986) . The popularity and performance of RNNs grew with the introduction of long short-term memory (LSTM) networks by Hochreiter and Schmidhuber (1997) . The model uses gates to prevent unbounded or vanishing gradients giving this model the ability to have states that can "remember" many timesteps into the past. In addition to its pervasive use in language modeling, LSTMs are used in fields as disparate as robot control (Mayer et al., 2006) , protein homology detection (Hochreiter, Heusel and Obermayer, 2007) , and rainfall detection (Shi et al., 2015) . We use a modification of long short-term memory networks called a gated recurrent unit . RNNs and other deep learning architectures like convolutional neural networks have been used to forecast unemployment (Smalter Hall and Cook, 2017) . Within economics, gated recurrent units (GRUs) have been applied in stock market prediction (Minh et al., 2018) and power grid load forecasting (Li, Zhuang and Zhang, 2020) . Moving from machine learning models to economics models, autoregressive models have been the workhorse forecasting models since the late 1930s Diebold (1998 ), Walker (1931 ). Even the machine learning models maintain an autoregressive structure in its inputs. Despite its simplicity and age, the model is still used among practitioners and as a baseline in many papers (Watson, 2001) . One advancement in forecasting stems from the greater adoption of structural or pseudostructural time series models like the Smets-Wouters DSGE models (Smets and Wouters, 2007) . While DSGE forecasting is widely used in the literature, they are competitive with -but often times no better than -a simple AR(2), with more bespoke DSGE models performing poorer (Edge, Kiley and Laforte, 2010) . However, the use of DSGE models for counterfactual analysis is an important and unique benefit of these models. The final economic baseline is the factor model (Stock and Watson, 2002a) , which attempts to use a large cross-section of data resulting in a more comprehensive picture of the economy to perform forecasting. Details on all these models and our implementation can be found in Appendix D. In addition, our paper uses tools from forecast evaluation (West, 1996) , (Pesaran and Timmermann, 1992) , and (Diebold and Mariano, 2002) , as well as model averaging (Koop, Leon-Gonzalez and Strachan, 2012) , (Timmermann, 2006) , and (Wright, 2008) . Moving to structural economics, there is a scant but robust literature on panel data and dynamic general equilibrium models (Breitung, 2015) . Most of the literature focuses on the use of panel data to better identify the effects of interest that vary across country. Much of it is theoretical and adopts microeconometric panel techniques to macroeconomic data (Banerjee, Marcellino and Osbat, 2004) . This literature also studies the use of cross-sectional data to improve counterfactual analysis in general equilibrium models (Miyamoto, Nguyen and Sergeyev, 2018) , (Crandall, Lehr and Litan, 2007) to have a more microeconomic forecasting focus (Baltagi, 2008) . There is also literature looking at specific panel models applied to macroeconomics like dynamic panel models (Doran and Schmidt, 2006) , (Bai and Ng, 2010) , and (Diebold, Rudebusch and Aruoba, 2006) . At the same time, the approach of pooling countries has faced some resistance for theoretical reasons. Pesaran and Smith (1995) argue that structural parameters lose meaning as they turn into a mean value across countries rather than an estimate for the true value in one country. However, our results suggest that even if using a spatial dimension across countries, the econometrician still needs a minimum amount of data for good parameter identification. If one pools across a large spatial cross-section one can use data that is more recent. As we show empirically, more recent data but spread across different countries has the same -if not -more predictive power than data that is constrained to a single country but extends further into the past. This finding suggests that even though it might be neater to use single country data extending further back in time, countries are somewhat artificial boundaries. The stability and predictive power of parameters are at least as strong across space as across time. AutoML software is designed to provide end-to-end solutions for machine learning problems by efficiently training and evaluating a number of different models and ultimately returning the best model. In order to provide a proxy for the performance of a good "nonparametric" machine learning model, we tested the open-source automated machine learning (AutoML) software H2O 1 . We created a pipeline for each prediction horizon, trained the model using our international cross-1. https://www.h2o.ai/ sectional data, evaluated on US validation data, and lastly predicted using our US data test set. In contrast with our own custom model, setting up H2O and training on our dataset was almost entirely automated. The benefit of automation is that while humans can improve performance, there was little we could do either via tinkering with architecture or devoting more computational resources to influence the performance of the procedure. From predicting one quarter ahead to five quarters ahead, the AutoML software picked a different model for each horizon, respectively: XGBoost, gradient boosting machine, gradient boosting machine, distributed random forest, and deep learning. We noticed that the software generally picked deep learning models for the quarters that were further away (four and five quarters ahead) compared to predicting gradient boosted techniques for closer quarters (one and two quarters ahead). This is not surprising, as decision-tree-based techniques have relatively few degrees of freedom and good inductive biases for many problems but deep learning techniques ultimately are more flexible and scale better on larger datasets because of larger parameter counts. Ultimately, AutoML had very strong results and can be applied to other prediction problems in economics. Additionally, because the AutoML selects a different model for a given horizon and data set size, we also estimated an RNN on the both the reduced USA dataset and the pooled world data set. This allows us to show the effect of the increase in data size holding the model architecture fixed. The RNN also has the advantage of not being a model considered by AutoML, which gives broader coverage of the universe of machine learning models that are being considered in our paper. A RNN model we use to supplement AutoML is the gated recurrent unit (GRU) model, described in Appendix E.4. We add additional feedforward layers as well as other architecture choices as indicated in Figure I . The model architecture involves preprocessing the data using two feed-forward dense layers with 64 and 128 hidden units respectively and rectified linear unit (Agarap, 2018) activation (see Appendix E.5.), then running a gated recurrent unit with sixteen states on this Our model contains parallel dense layers between each operation; the layers were originally skip connections (He et al., 2015) , but we modified them to allow for learning of linear parameters. The final skip connection concatenates the input with the output of the network so that the neural network would nest an VAR(1) model. These design choices all improved forecasting performance. Between all of our non-skip connection layers, we also use batch normalization (Ioffe and Szegedy, 2015) . More details on batch normalization can also be found in Appendix E.6. Ultimately, our model comprises about 17,000 parameters which explains the comparative outperformance on a data rich regime. We tested the predictive power of a series of machine learning and traditional macroeconomic models estimated on our panel of countries using our novel data pooling method. We found that the more complex the model, the more our data augmentation helped. The machine learning models tended to be more flexible, but even among economic models the trend still held. Additionally, we provided comparisons to the Survey of Professional Forecasters (None, 1968 ) median GDP forecast, which is seen as a proxy for state-of-the-art performance. A discussion of the Survey of Professional Forecasters and our attempt to evaluate their forecasts is contained in Appendix C. The baseline economic models we used are the AR(2) autoregressive model, the Smets-Wouters 2007 DSGE model (Smets and Wouters, 2007) , and a factor model (Stock and Watson, 2002a) , (Stock and Watson, 2002b ) and a VAR(4)/VAR(1) (Sims, 1980) . A more detailed explanation of these models is contained in Appendix D. For the linear models, getting cross country data is straightforward, thus we compare those models estimated only on US data as well as on our data set of 50 countries. 2. While we use the word preprocessed, the approach is trained entirely end-to-end and is not a two step process as the word preprocess might imply. The neural network projects the input data -consumption growth, GDP growth and the unemployment rate -into a high dimensional state that the gated recurrent unit finds easier to work with much like pre-processing would. The end-to-end procedure learns the pre-processing and the gated recurrent unit analysis at the same time For the Smets-Wouters DSGE, we also assembled a panel of 27 rich and developing countries to estimate the structural model on. As is standard with economic forecasting, the baseline models were trained in a pseudo-out-ofsample fashion where the training set/validation set expands as the forecast date becomes more recent. However, with our neural network and AutoML, we keep the training set and validation set fixed due to computational costs and programming constraints. We expect that our model will improve if we use a pseudo-out-of-sample approach. When initially training our complex neural network models, we found that United States macroeconomic data was not sufficient so, in order to train the model, we use data from 49 other developed and developing countries as listed in Appendix A. We source cross country data from Trading Economics via the Quandl platform API 3 as well as GDP data from the World Bank. 4 We used GDP, consumption, and the unemployment rate as inputs to the model. GDP and consumption were all expressed in growth rates. Unemployment was expressed as a rate as well. As mentioned earlier, we also assembled 11 different covariates across 27 countries for a panel of data used to estimate the We split our data into training, validation, and test sets. We forecast GDP, and evaluated with RMSE. The test set was from 2008-Q4 to 2020-Q1, either testing only US data or world data depending on the particular problem. The validation consisted of data from 2003-Q4 to 2008-Q3, which was only used for the RNN. This data was in the training set for all other models. AutoML, which was not a sequential model, used k-fold cross validation on the entire training set, comprised of the remainder of data not in test or validation sets. We chose these periods so that both the test set and the validation set would have periods of both expansion and recession, based on the US business cycle. Including the 2001 recession in the validation set would leave the model without enough training data, so we split the data of the Great Recession over the test and validation set. The quarter with the fall of Lehman Brothers and the largest dip in GDP was the first quarter in our test set, 2008-Q4. Two quarters with negative growth preceding this were in the validation set. We estimated all models from a horizon of one quarter ahead to five quarters ahead. The metric of choice for forecast evaluation was RMSE. Our first set of results shows the benefit of pooled data on reduced-form models VARs and ARs, showing significantly improved GDP forecasting accuracy. Despite the relative parsimony of these models adding pooled data improves RMSE performance almost uniformly by an average of 12%. Our second set of results shows that the panel data augmentation improved the chances of building externally and potentially, internally valid structural models, using the Smets-Wouters DSGE model as the main benchmark. These models benefited more due to the pooled data as they had a higher parameter count, improving RMSE by 24%. Our third set of results took all the models and demonstrates the forecasting power of "nonparametric" machine learning models over all the previously mentioned traditional economic models in this relatively data rich regime. As these models were the largest and most data hungry, the use of pooled data improved performance from slightly above average forecasters to providing state-of-the-art predictions across the board. The RMSE of the RNN-based model improved by 23%, which was smaller than the structural models. However, the RMSE starting point was much better than that of the structural models and proportional performance was more difficult the better the original baseline. Ultimately, we present an interesting finding in the improvement in performance as model capacity increased, moving from the US data set to the pooled world data set. This suggests that more complex models benefit from increasingly larger datasets and that pooling can address overfitting. to the model's name indicates the statistical significance of world data outperformance over the US data using a Diebold-Mariano test (Diebold and Mariano, 2002) at the 1% (***), 5% (**) or 10% (*) level. This format will be followed for the rest of the RMSE performance graphs, unless otherwise noted. Pooling improved the performance of the models in statistically significant manner, especially at longer horizons. Except for a slight underperformance at the one quarter ahead for the AR (2), all other horizon-models show outperformance using the country panel data augmentation. The outperformance of the pooled data averages roughly 12% of US RMSE over all horizon-model pairs. We show a significant improvement with the pooled data, however, since the set has limited model complexity, the improvement is not as large as that of more complex structural or machine learning models. shows forecasting performance on the same US test data, but for a model that is both time step and country out-of-sample. For example, since we forecasted US GDP, we used every country but the US to forecast US GDP. This test enables us to show that using panel data can lead to models that are policy/country invariant and can generalize even to country data that the model lacked access to. In all cases, the RMSE of the out-of-sample forecasts were statistically indistinguishable from the RMSE of the model estimated on the full panel of countries but significant over the US baseline with stars indicating significance of the out-of-sample forecast. Again the models were mainly significant at longer horizons, but any significance is nevertheless impressive since the outperforming model uses no US data to forecast US GDP, for example. Excluding the H1 AR(2) pair and using only the out-of-sample data led to capturing 79%, on average, of the outperformance of the world panel over US only data forecasts. Ultimately, there was an improvement in performance due to the use of the pooled data over single country training data. We performed the same tests over our entire cross-section of countries, in Table I model is estimated using every country but the country of forecasting interest. Aside from providing an additional robustness test regarding the outperformance of world data and even out-of-sample data, this table shows additional policy-invariance both from the out-of-sample tests as well as the world data tests, where the same model can jointly forecast all countries better than custom tailored models for each country. Additionally, this model seems to indicate a diminishing return in terms of RMSE for linear models when using pooling data and it seems like all linear models converge to a similar RMSE, although the larger VAR(4) models seem to have the best performance and suggests improvement when moving to more complex structural and machine learning models. Considering the success of data pooling for reduced-form models, we also tested the procedure on structural models and achieved even greater success. Since the DSGE model requires 11 different variables, we assembled our own data from the World Bank, OECD, Eurostat, and FRED. This panel has only a 27 country cross-section which is smaller than the 50-country panel used in our reduced-form models. The performance of our structural models demonstrates that this pooling of data likely leads to performance gains across models, including DSGE models that should generalize to out-of-sample data because of their resilience to the Lucas critique. Combined with the results using machine learning models, our results make the case that model outperformance due to pooling helps across many, if not most, types of models. The three results we show entails: 1) Comparing the forecasting performance of a DSGE model trained only on US data to a DSGE model estimated on the entire panel of countries in the same manner as Figure II , 2) Comparing the performance of a model that is not only out-of-sample from a timestep perspective as is usual in forecasting, but also does not contain the forecasting country data in question when estimating, in the same manner as Figure III Unlike the reduced-form models, the most significant results for structural models, with p-values less than 1%, were at shorter horizons -below three quarters ahead -although all horizons were significant at the 10% level. The parameters that seemed to change the most -moving from US only data to world data -were the shocks and the moving averages of the shock variable, monetary policy Taylor-rule variables, and variables governing wages and inflation. While the increasing variance of the shocks did not affect expected forecasts due to certainty equivalence, the model is both less confident and closer to correct when using pooled world data. Perhaps it is unsurprising that variables focusing on monetary policy and inflation are different when estimated on world data. Inflation, especially among rich developing countries, along with the monetary response to them, was a more pernicious problem outside the US than within (Azam and Khan, 2020) . For more information on the changes in structural variables when moving from US data to pooled world data see Appendix F.3. Despite the increased uncertainty of the model as illustrated by the increase in the standard deviation of the shocks, the parameters were more reliable when estimated under world data. The improvement in RMSE averages over 25% over all horizons, which was more than double the percentage improvement for reduced-form models. Part of the outperformance was due to weaker performance of the models estimated on US data. This suggests that the Smets-Wouters model is no better at generalizing across policy regimes or countries than reduced-form models and benefits more because of its higher parameter count. In Figure V , we also provide a structural chart that parallels the out-of-sample chart in Figure III in the reduced-form section. As a reminder, the out-of-sample DSGE model was estimated on the panel of 26 countries (removing the US) so that the GDP forecast was country out-ofsample as well as time step out-of-sample. The out-of-sample performance was 9% better, on average, than even the performance of a DSGE estimated on the entire world data. However, the Diebold-Mariano tests are less significant with only the first two horizons having p-values with less than 1% and no significance in horizon four and five. This suggests that the out-of-sample outperformance may in part be due to chance, and we studied this in further detail below. Since the United States was the only country which has data that extends much before 2000 across all our needed variables, we hypothesized that DSGE model parameters were more stable across countries than across time. Removing the US made the data closer in time to the test set. This addresses an internal validity criticism of our panel approach arguing that when pooled structural parameters have different values across countries, estimating a single model on all countries strips the parameters of economic meaning (Pesaran and Smith, 1995) . For example the parameter no longer represents the depreciation rate of the United States, but an average depreciation rate across 30 countries. This result provides a suggestive counter argument to that claim, by pointing out that, considering one needs a certain number of data points to get accurate predictions to begin with, using data across a cross section of countries provides forecasts that at least as good as using data the extends further in time. Probing this hypothesis led to results that were somewhat mixed. We estimated a model trained on the entire panel of countries with data from 1995-Q1 onward. This affected three countriesthe US, Japan and New Zealand. This procedure isolates more sharply the effect of similarity across space versus across time on model generalization, rather than just removing all US data. The US lost about 140-190 data points (as the test set requires rolling forecasts), and New Zealand and Japan both lost about 15-60 timesteps. Figure which could suggest some of the outperformance of the out-of-sample DSGE model was due to chance. However, it seems that the 1995 data performed at least as well as a model estimated on world data both pre and post-1995, despite our robust results suggesting that more data is generally better. This makes some practical sense when considering something like the advent of software, the depreciation rate in France in 2015 plausibly shares more in similarities with the depreciation rate in the US in 2020 than the depreciation rate of the US in 1960. As one needs a set number of data points to identify structural models anyway, our results provide suggestive evidence that getting cross-sectional data results in parameters that are no less stable than parameters going back in time. The results seem inconclusive but certainly don't suggest any more parameter stability across time than space, in contrast with the potential pitfall highlighted by Pesaran and Smith (1995) and generally accepted in the literature. The DSGE models we estimated generally underperformed the parameters recovered in the original Smets and Wouters (2007), because we focused on maximum likelihood estimation without priors, optimized only with gradient descent. Despite the forecasting success of (Smets and Wouters, 2007) and other Bayesian DSGE models (Herbst and Schorfheide, 2015) , (Fernández-Villaverde, Rubio-Ramírez and Schorfheide, 2016), we chose to use to use a maximum likelihood approach to maintain comparability to the reduced-form and machine learning experiments as well as a large portion of the applied literature that focuses on point estimate techniques ranging from calibration, maximum likelihood, to generalized method of moments (Hansen and Singleton, 1982) . However, Figure V as well as Appendix G.1 shows the performance of a model estimated via maximum likelihood outperformed along some horizons the parameters of Smets and Wouters (2007) . Given the limitations of maximum likelihood and our differing focus, we see this outperformance as an endorsement of the use of the pooling approach. We show that the use of pooled data results in the Smets-Wouters DSGE model outperforming DSGE models estimated only on US data. Furthermore, we provided suggestive evidence that is possible that models are more externally and internally valid if one uses data across countries in addition to the statistically significant improvement in generalization. Using our out-of-sample tests, we show that these models can improve the out-of-sample generalization of the Smets-Wouters DSGE model even if they are theoretically policy-invariant. The data shown in the charts are also be displayed in Appendix G.1. As a final note, while it is difficult to quantify improved performance of calibration and generalized methods of moments, based on the generalization improvements from estimation for both reduced-form and structural models, we imagine these results should generalize and macroeconomists would benefit from calibrating to moments as well as other methods that feature a large number of countries. Given the improvement in forecasting performance for both the reduced-form and structural models and the improving relative performance of complex models, we decided to test the performance of nonparametric models that are even more flexible than the DSGEs and some of the larger VARs. We tested both a RNN, as well as an AutoML algorithm. While the improvement in performance was less than the DSGE improvement from pooled data, it still seems more impressive given that the flexible models had much better performance even on US data. This again illustrates the trend that increased parameter count leads to a gain in performance that came from pooling. Two charts below illustrate performance of the RNN and AutoML models on both US and pooled world data. We compared estimation on US data as well as pooled world data, for both models. For the RNN, Figure VII shows the improvement in RMSE from estimating a recurrent neural network using only US data to using the entire cross-section of 50 countries. The improvement was statistically significant for all horizons except five quarters ahead. The average improvement was around 23% over all horizons, which was similar to the improvement for the Smets-Wouters model and almost double the improvement of linear models. This is a reassuring confirmation as the RNN is a data hungry model that benefits more from data rich regimes. We also attempted to add a country identifier term to our model. For example, using GDP per capita at the time of prediction as an input to localize the pooled data to some degree. While this might be expected to reduce bias, it didn't improve out of sample performance to any degree. This potentially suggests that countries are more similar than different and the bias of pooling different countries has a limited effect, while adding in such a covariate leads to more overfitting. Our second chart in Figure VIII , shows the same performance graph for AutoML. The performance gain is not as easily interpreted as AutoML benefits from the pooled data but can also pick different models that gains relatively in both data poor and data rich regimes. Because of that, the large gains of the RNN are more representative of performance gains from moving to pooled The previous sections outlined performance of all reduced-form, structural, and machine learning models. This section takes all the data and provides results from a holistic perspective. We first compare forecasting models using all approaches, estimated on both pooled and US data. This table demonstrates the effectiveness of the machine learning forecasting methods in data rich regimes. AutoML estimated on world data outperforms all baseline economic models on four of the five horizons and the RNN outperforms on longer term horizons. We do not report the maximum likelihood of the Smets-Wouters model, as the original Bayesian parameterization has better performance than either of our maximum likelihood Smets-Wouters models estimated on world or US data. Introducing our other DSGE variations, would be difficult to justify and would also have no effect on the results of the horse race. Regardless, all of our models that outperform all baseline models on a horizon are bolded. No baseline model ever outperformed both our models along any horizon and the best performing model along all horizons was either an AutoML model or an RNN model, likely because the additional pooled data allowed a more powerful model to be used without overfitting. To illustrate the effect that pooling data has on forecasting, we show a graph that orders RMSE performance based on increasing model complexity with RMSE performance, comparing the trend when estimated on US data versus pooled data. As the image shows, when using increasingly complex models with only the three hundred or so US timesteps, the most parsimonious model, the AR(2),performs the best and the models get progressively worse. However, when using the pooled data, the picture is entirely different. The AR(2) model actually performs a little bit worse -likely due to chance. However, each progressively (2) performance. Even if the RMSE decline is less striking in this latter case, this decline is never less compelling as the improvement is actually quite large but looks small as the US data RNN performs so poorly, it would never be used. In fact, despite the appearance of only a small improvement due to model complexity, the performance of the RNN on pooled data is state-of-the-art, while the performance of the AR(2) on pooled data is somewhat pedestrian. A similar story holds across other horizons with less striking consistency compared to the one period ahead story. We also provide a graph of the forecasts of five of the models: AR(2), factor, DSGE, RNN, and AutoML, as well as the true data. This graph is useful in disentangling why our machine learning models outperform. To illustrate the relative strengths of the models, we display the one quarter ahead forecasts in Figure X here and the rest of the graphs are in Appendix B. AutoML, factor, and RNN models all did a good job at forecasting the Great Recession, with AutoML forecasting the best at one quarter ahead. The AR(2) and DSGE did not detect a regime change for the recessionary periods and are also upwardly biased leading to even worse performance during recession onset. 5 However, the XGBoost model that performed the best in AutoML was satisfactory forecast the expansions both in terms of the average level as well as individual movements in the quarterly data. Neither the factor model nor the RNN were able to forecast the quarter by 5. More information on the biases and variances of the different models can be found in Appendix F.2. In this paper we show how estimating macroeconomic models on a panel of countries, as opposed to a single country can significantly improve external validity. Using a panel of countries as a training set, we statistically improved the RMSE performance of reduced-form models -AR(2), VAR(1), and VAR(4) -by roughly 12%. We further show that we can make these reduced-form models more policy/country invariant, suggesting that these models have learned to generalize GDP forecasting even to countries the model has never been trained on. We also showed that a similar training set of a panel of countries can improve external validity of structural models which again are typically estimated only on a single country of interest. We focus on the Smets-Wouters model (Smets and Wouters, 2007) . Using a panel of countries improves the forecasting performance of the Smets-Wouters model estimated with maximum likelihood by roughly 24% averaged across horizons. These results are again statistically significant. We then demonstrated that we can again improve policy-invariance and generalization to out-of-sample countries by using a panel of countries in our training set. Additionally, we addressed one potential roadblock to the adoption of pooling country data, which is the fact that the structural parameters may not be stable across countries and hence the pooled parameter value can only be interpreted as a mean value. While our results are less conclusive on this front, we argue that based on a forecasting exercise that parameter generalization and stability are likely as good across space as across time. Finally, concluding our section on structural models, we capitalize on the consistency of improvements and discuss the likelihood that our results will extend to other estimation techniques like generalized method of moments, calibration, and Bayesian approaches. Our last set of results, recognizes that our dataset has increased from 300 timesteps to around 3000 timestep-countries, showing that nonparametric machine learning models are able to outperform all the economic baseline models even after being estimated in this more data rich regime. Our RNN outperforms all economic baselines for horizons longer than two periods ahead. Likewise, our AutoML model outperforms all baselines for all horizons except for the three quarters ahead. Combined, the best performing model over all horizons is either an AutoML model or a recurrent neural network model which suggests there is likely much more room to test other nonparametric models in the more data rich macroeconomic regime. A first model we use is the autoregressive model, AR(n). An oft-used benchmark model, it estimates a linear relationship using the independent variable lagged N times. In terms of forecasting ability, this model is competitive with or outperforms the other economic models in our tests which is consistent with Diebold (1998) . We used an autoregressive model with two lags and a constant term. Additionally, we compared the Smets-Wouters 2007 model (Smets and Wouters, 2007) , as DSGE models share many similarities with recurrent neural networks and Smets-Wouters (2007) suggests that this particular model can outperform VARs and BVARs in forecasting. When running this, we used the standard Smets-Wouters Dynare code contained on the published paper's data appendix. We take the point forecasts from the Smets and Wouters (2007) and use that to forecast. Like Smets and Wouters (2007) , we use Dynare (Adjemian et al., 2011) to solve and estimate the model. A final model we included in our baseline economic models were factor models (see Stock and Watson (2002a) and Stock and Watson (2002b) ). In short, the factor model approach takes a large cross-section of data and uses a technique like principal components analysis to reduce the dimensionality of the problem. In our case, we concatenate five to eight principal components based on information criterion of the high dimensional data with a lagged value of GDP and regress future GDP. We modified and used the code from FRED-QD as our baseline factor model (McCracken and Ng, 2016) . While these models were extremely effective at lower horizons, these models were also dependent on a large cross-section of economic data with a long history in a country. In reality, only a few other developed countries have a cross-section of data that would be large enough to permit using these models as effectively as can be used in the United States. That being said, factor models do outperform our neural networks at shorter time intervals, and we imagine there is promise in combining the factor approach with a RNN or AutoML approach. We also tested the the forecasting performance of vector autoregressions (Sims, 1980) . In addition to displaying performance in our main table, we compared this model and the AR(2) in our 50 countries cross-section test as well. Since we were only forecasting GDP, the vector autoregressive models used lagged GDP, consumption, and unemployment used to forecast the single GDP variable as a linear regression. All the economic models were estimated on US GDP as is standard. While we ran preliminary tests on estimating these models on our cross-section of 50 countries, we ran into issues with estimating both factor models and DSGE models this way. However, preliminary results on the AR(2) model suggests there could be some improvement to using a cross-section even on a three parameter AR(2) model. The improvement is not as large as the RNN, which is not surprising as the RNN has more parameters to take advantage of a larger data set. The feed-forward network is the prototypical image associated with deep learning. At its core, a feed-forward neural network is a recursively nested linear regression with nonlinear transforms. For example, assume X is a vector valued input to the neural network and X out is the output. In Figure XVI : An Example of a Feed-Forward Neural Network a typical linear regression, X out = X in β 1 . The insight for composing a feed-forward network is to take the output and feed that into another linear regression: Y = X out β 2 . In Figure XVI , X in would be the input layer, X out would be the hidden layer and Y would be the output layer. The problem is not all that interesting if X out is a scalar. If X in is a matrix of dimension timesteps by regressors, X out can be a matrix of dimension timesteps by hidden units. Here in the figure, the dimension of the hidden layer is four, so β 1 has to be a matrix of dimension three by four (regressors by hidden units). Thus, we make X out an input into multidimensional regression for the second layer, Y = X out β 2 , if the first layer is a vector regression. 6 This can be repeated for as many layers as desired. Now a composition of two layers will result in: Y = X out β 2 = (X in β 1 )β 2 . A product of two matrices is still another matrix, which means the model is still linear. Clearly this will hold no matter how many layers are added. However, an early result in the literature showed that if between every regression, eg X out = X in β 1 , one inserts an almost arbitrary nonlinear link function this allows a neural network to approximate any continuous function (Hornik, Stinchcombe and White, 1989) . For example, inserting a logistic transformation between X in and X out i.e. X out = σ(X in β 1 ) where 6. Note: this regression is not a vector autoregression as Xout is a latent variable σ(z) = 1 1+e −z . One can put these nonlinearities as often as one would like to get something like this: Y = σ(σ(X in β 1,1 )β 2 ). These are the fundamental building blocks of neural networks and additional composition of link functions and matrix multiplication of parameters form the basis of deeper networks and allows these models to be universal approximators. For the purposes of this paper, associating an RNN with a feed-forward networks will not hinder comprehension. That being said, without even knowing it, many economists are already familiar with RNNs. The simplest is a Kalman filter-like linear state space model. The two equations that define the linear state space model are 7 : In a linear state space model, the state s i is an unobserved variable which allows the model to keep track of the current environment. One uses the state, along with lagged values of the observed variables, to forecast observed variables y i . For example, for GDP, the state could be either an expansionary period or a recession -a priori, the econometrician does not know. However, one can make an educated guess based on GDP growth. As machine learning is more interested in prediction, the state is often estimated with point estimates, which allows the data scientist to sidestep the tricky problem of filtering. The two equations that define the linear state space model are We add autoregressive lags to make the model more general. We use Equations (3) and (4) to recursively substitute for the model prediction at a particular time period so the forecast for period 1 then is: and the forecast for period 2 is: Hatted variables indicate predictions and unhatted variables correspond to actual data. Additional time periods would be solved by iteratively substituting for the state using Equations (3) and (4) for the previous state. In order to update the parameters matrices A, B, C, D, E, and F , the gradient is derived for each matrix and each parameter is updated via hill climbing. We will illustrate the process of hill climbing by taking the gradient of one parameter B: Here L() indicates the loss function. Substituting for y 1 and y 2 with Equations (5) and (6) into (7) and using squared error as the loss function, we arrive at an equation with which we can take partial derivatives for with respect to A: Distributing all the B's and taking the derivative of (8) results in ∂ which provide the gradients for hill climbing. In practice, the derivatives are taken automatically in code. Gated recurrent units were introduced to improve upon the performance over previous RNNs that resembled linear state space models and can deal with the exploding gradient problem. The problem with linear state space models is that if one does not apply filtering, the state vector either blows up or goes to a steady state value. This can be seen by recognizing that each additional timestep results in the state vector getting multiplied by U s an additional time. Depending on if the eigenvectors of U s are greater than or less than one, the states will ultimately explode (go to infinity) or go to a steady state. More sophisticated RNNs like gated recurrent units we use, fix this with the use of gates. First, we redefine sigma as the logistic link function: The idea behind the gate, is to allow the model to control the magnitude of the state vector. A simple gated recurrent neural network looks like the linear state space model with an added gate equation: The output of σ() is a number between zero and one which is element-wise multiplied by s t , the first draft of the state. The operation indicates element-wise multiplication or the Hadamard product. Variables are subscripted with the time period they are observed in (t or t − 1). Weight matrices, which are not a function of the inputs, are superscripted with the equation name they feed into. All elements are considered vectors and matrices, and matrix multiplication is implied when no operation is present. The presence of the gate controls the behavior of the state, which means that even if the eigenvalues of U s were greater than one, or equivalently even if h t would explode without the gate, the gate can keep the state bounded. Additionally, the steady state distribution of the state does not have to converge to a number. The behavior could be periodic, or even chaotic (Zerroug, Terrissa and Faure, 2013) . This allows for the modeling of more complex behavior as well as the ability of the state vector to "remember" behavior over longer time periods (Chung et al., 2014) . The equations of the gated recurrent unit are: Tanh is defined as the hyperbolic tangent: Like the linear state space model, the state vector of the gated recurrent unit persists over timesteps in the model. Mapping these equations to Equation (10)-(13), Equation (18) is the measurement equation (analogous to Equation (10)). Equation (15) and (16) are both gates and analogous to Equation (11). Equation (17) is the first draft of the state before the gate z t is applied and resembles Equation (12). Equation (18) is the final draft of the state after z t is applied and resembles Equation (13). The recurrent neural network is optimized using gradient descent, where the derivative of the loss function with respect to the parameters is calculated via the chain rule/reverse mode differentiation. The gradient descent optimizer algorithm we use is Adam (Kingma and Ba, 2014) , which shares similarities with a quasi-Newton approach. See Appendix E.7 for more information. A nonlinearity used in our architecture, but not in the gated recurrent unit layers is the rectified linear unit (ReLU) (Agarap, 2018) . The rectified linear unit is defined as: The ReLU is the identity operation with a floor of zero much like the payoff of a call option. Despite being almost the identity map, this nonlinearity applied in a wide enough neural network can approximate any function (Hornik, Stinchcombe and White, 1989 ). Skip connections (He et al., 2015) allow the input to skip the operation in a given layer. The input is then just added onto the output of the skipped layer, forming the final output of the layer. This allows the layer being skipped to learn a difference between the "correct" output and input, instead of learning a transformation. Additionally, if the model is overfitting, the neural network can learn the identity map easily. Skip connections are used when the input and the output are the same dimension which allows each input to correspond to one output. Because our network does not have this property, we learn a linear matrix that converts to the input to the output dimension. All the skip connections are linear operations and have no activation or batch norm, which differs from the pair of dense layers at the beginning of the network which have both batch norm and rectified linear unit activations. Batch normalizing (Ioffe and Szegedy, 2015) is used to prevent drift of output through a deep neural network. Changes to parameters in the early layers will cause an out-sized effect on the output values for the later layers. Batch norm fixes this problem by normalizing the output to look like a standard normal distribution after the output of each layer. Thus the effect of changes in parameters will not greatly effect the magnitude of output vector as between each output the data is re-normalized to have a mean of 0 and a standard deviation on 1. Adam combines momentum (Polyak, 1964 ) -a technique that uses recent history to smooth out swings orthonormal to the objective direction -with RMSprop (Tieleman and Hinton, 2012 ) -a technique used to adjust step size based on gradient volatility. Traditional gradient descent hill climbing updates the parameters with a single equation: Here ∇ θ L θ (x, y) denotes taking the gradient of the loss with respect to θ, the parameters of the model. For convenience, I will denote this term g t . By subtracting the gradient multiplied by a small step size λ, this is moving the parameters theta in the direction the reduces the loss the most at θ t−1 If we wanted to use information from the second derivative to inform optimization, we can use Newton-Raphson instead: This uses the Hessian to determine an optimal step size based on steepness in the loss function. Typically, this approach is not used in deep learning as deep learning models typically have a large number of parameters and calculating the Hessian has a quadratic cost in the number of parameters and inverting also has a super-linear cost. However, there are quasi-Newton methods that attempt to approximate the Hessian to determine the step size without the high computational cost. Adam is similar to these methods. The equations that define Adam are as follows: The first equation is a moving average of the gradient. This "momentum" term is used because often in training the direction of the gradient would move nearly perpendicular to the direction towards the optimum. Gradient descent would spend a lot of time zig-zagging while only making slow progress towards an optimum (see Figure XVII) . Taking a moving average of previous gradients preserves the principal direction while the orthogonal directions cancel each other out. This approach is empirically shown to lead to more stable optimization and even better optima than simpler gradient descent approaches for large networks. We regress true GDP on a varying collection of forecasts to test for statistically significant contribution of a given forecast like our gated recurrent unit model. An interpretation of significant coefficients would be that the given forecast method is adding statistically significant information when pooled with the other regressions. Here is the H2O forecast compared to the SPF on the baseline test set ranging from one-quarter ahead forecasts to five-quarters ahead: Notes: * * * Significant at the 1 percent level. * * Significant at the 5 percent level. * Significant at the 10 percent level. The following table contains regressions comparing the information content of the H2O and baselines, excluding the SPF: Notes: * * * Significant at the 1 percent level. * * Significant at the 5 percent level. * Significant at the 10 percent level. This final table performs the same regression but has the SPF, H2O, and all baseline models: Notes: * * * Significant at the 1 percent level. * * Significant at the 5 percent level. * Significant at the 10 percent level. None of the models except the SPF have consistent statistically significant information above and beyond the other models. The following table contains the mean bias as well as the variance of the models. For the gated recurrent unit we use the median forecast: For our RNN model, we found we could improve forecasting performance by taking the mean prediction of many models estimated by stochastic gradient descent. The ensembling improves performance slightly, but later graphs will show it also improves model stability and variance. Bolded entries indicate outperformance over all economic models. We provide a Monte Carlo simulation (Table IX) , estimating our RNN model at each time horizon 100 times. At every horizon, the average root mean squared error of our simulated models indicates competitive, if not outperformance, against baseline models. Interestingly, it seems like the best performing model on validation data, when tested on the test data, often performs worse than the average performance over all the models. This is something that should be investigated further, but based on this phenomenon, we recommend that practitioners take a simple mean or median forecast across many different models. to those critiques, we also provide robustness checks across different architectures and data periods. One test we preformed was to replace the gated recurrent units with a long short-term memory (LSTM) layer (Hochreiter and Schmidhuber, 1997 ) -another type of RNN. We use the same test data as the main result (USA 2009-Q1 to 2020-Q1) as well as the same data as inputs. The LSTM in Table X are analogous to the gated recurrent unit neural networks models in the table in Section VI.C. in the main text. Mean RMSE and Std Dev RMSE, correspond to the entries in the table below. The baseline performances are still the same as the test set has not changed. The LSTM networks outperform the baseline models along essentially the same time horizons. Performance is also competitive, but consistently a little worse than the gated recurrent unit over all time frames. The LSTM has similar standard deviation of root mean squared error, suggesting that the two models consistently find a similar optimum when it comes to forecasting. Again, taking a model average through the mean or median forecast results in small but consistent root mean squared error performance improvements. (Table XI) . The reason we use this horizon is that it contains no recessions. Since the highly flexible neural network will have an advantage forecasting periods with a significant departure from a more linear-friendly period of expansion, removing the recessions would hamstring our model compared to the more linear model baselines. Our gated recurrent units were completely re-estimated as we additionally included 2009-Q1 to 2009-Q3 in the validation set. Performance would improve if we left those (recession) timesteps out of the validation set as the test set contains no recessions, however, this decision cannot be rationalized from the point-of-view of an out-of-sample forecaster. Although this version of our model did not outperform the best baseline models along any horizon, considering performance over all horizons, we think our median and mean models are better than the US AR(2), VAR (1), and the factor model on this test set, while performing slightly worse than the DSGE model and the world AR(2). This supports our hypothesis that the main outperformance of our model was in highly nonlinear domains like recessions and other regime changes although using the cross-sectional data reduced the tendency for the models to be biased upwards and was a contributor to the RNN's outperformance over models trained only on US data. This provides supplementary evidence that the outperformance of our neural network is not due to either over-fitting the test set or over-fitting the architecture choice. Additionally, we ran Monte Carlo simulations (Table XII) which show that given one hundred random initialization and (2) US data 1.7 1.7 1.8 1.9 1.9 World data 1.6 1.6 1.6 1.5 1.5* Smets Wouters DSGE US data 1.8 1.8 1.7 1.6 1.5* Factor US data 1.6 1.6 1.6 1.9 2.1 RNN* Best 1.8 2.3 2.0 2.0 1.9 Mean Forecast 1.7 1.7 1.7 1.7 1.7 Median Forecast 1.7 1.7 1.7 1.7 1.7 SPF Median 1.4 1.5 1.5 1.5 1.5 * All RNN models use entire world data cross-section optimization routines over all five horizons, the model still consistently achieves low root mean squared error and has a low standard deviation -demonstrating stability and reproducability. Dynare: Reference Manual Version 4 Deep Learning using Rectified Linear Units (ReLU) Threshold effects in the relationship between inflation and economic growth: Further empirical evidence from the developed and developing world Instrumental variable estimation in a data rich environment Forecasting with panel data Some cautions on the use of panel methods for integrated series of macroeconomic data The analysis of macroeconomic panel data On the Properties of Neural Machine Translation: Encoder-Decoder Approaches Empirical evaluation of gated recurrent neural networks on sequence modeling The effects of broadband deployment on output and employment: A cross-sectional analysis of US data Comparing predictive accuracy The macroeconomy and the yield curve: a dynamic latent factor approach GMM estimators with improved finite sample properties using principal components of the weighting matrix, with an application to the dynamic panel data model A comparison of forecast performance between federal reserve staff forecasts, simple reduced-form models, and a DSGE model Estimating macroeconomic models: A likelihood approach Solution and estimation methods for DSGE models AutoML 2014 @ ICML Generalized instrumental variables estimation of nonlinear rational expectations models Deep Residual Learning for Image Recognition Bayesian estimation of DSGE models Long Short-Term Memory Fast model-based protein homology detection without alignment Multilayer feedforward networks are universal approximators Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Scaling Laws for Neural Language Models Bayesian model averaging in the instrumental variable regression model Short-term Power Load Forecasting Based on Gate Recurrent Unit Network and Cloud Computing Platform A System for Robotic Heart Surgery that Learns to Tie Knots Using Recurrent Neural Networks FRED-MD: A monthly database for macroeconomic research Deep learning approach for short-term stock trends prediction based on two-stream gated recurrent unit network Government spending multipliers under the zero lower bound: Evidence from Japan Survey of Professional Forecasters A simple nonparametric test of predictive performance Estimating long-run relationships from dynamic heterogeneous panels Some methods of speeding up the convergence of iteration methods Learning representations by back-propagating errors Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting Macroeconomics and Reality Macroeconomic indicator forecasting with deep neural networks Shocks and frictions in US business cycles: A Bayesian DSGE approach Forecasting using principal components from a large number of predictors Macroeconomic forecasting using diffusion indexes Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms Lecture 6.5-RmsProp: Divide the gradient by a running average of its recent magnitude Handbook of economic forecasting On periodicity in series of related terms Time series: economic forecasting Asymptotic inference about predictive ability Bayesian model averaging and exchange rate forecasts Chaotic dynamical behavior of recurrent neural network