key: cord-0597992-p0ubah2e authors: Clark, Todd E.; Huber, Florian; Koop, Gary; Marcellino, Massimiliano title: Forecasting US Inflation Using Bayesian Nonparametric Models date: 2022-02-28 journal: nan DOI: nan sha: deb5c0a2dca1212919dea44af4887783efef7b81 doc_id: 597992 cord_uid: p0ubah2e The relationship between inflation and predictors such as unemployment is potentially nonlinear with a strength that varies over time, and prediction errors error may be subject to large, asymmetric shocks. Inspired by these concerns, we develop a model for inflation forecasting that is nonparametric both in the conditional mean and in the error using Gaussian and Dirichlet processes, respectively. We discuss how both these features may be important in producing accurate forecasts of inflation. In a forecasting exercise involving CPI inflation, we find that our approach has substantial benefits, both overall and in the left tail, with nonparametric modeling of the conditional mean being of particular importance. As reviewed in studies including Stock and Watson (2007) and Faust and Wright (2013) , inflation forecasting is fraught with challenges. Structural economic models and simple economic reasoning imply that inflation should be forecastable with a range of indicators, including measures of domestic and international economic activity, import prices or exchange rates, cost measures such as wage growth, and oil prices. Other work has explored forecasting inflation with forward-looking financial indicators, such as bond yields. Over some periods and in some studies, some of these variables have yielded some success in improving the accuracy of inflation forecasts. However, by now a body of work has established that simple inflation forecasts from the unobserved components model of Stock and Watson (2007) and the inflation gap model of Faust and Wright (2013) are very difficult for another specification to beat. These specifications improve inflation forecasts by accounting for a time-varying trend in inflation. But adding other information does little to help inflation forecasts. For example, models motivated by the Phillips curve to include indicators of economic activity or marginal costs of production cannot consistently improve on these simple univariate benchmarks. 1 Although most of the literature alluded to has focused on parametric linear models of inflation, other work has examined the predictability of inflation using nonlinear or nonparametric models. For example, some work has suggested that the nonlinear effects of economic activity on inflation kick in as the economy becomes very strong, such that economic expansions fail to produce much of a rise in inflation until such times. Babb and Detmeister (2017) provide a useful review of the literature on nonlinear Phillips curves. A fast-growing literature evaluates the use of machine learning techniques for macroeconomic forecasting, with random forests (see Breiman (2001) and, e.g., Masini, Medeiros, and Mendes (2021) , for a survey) performing particularly well, also during crisis times, in a variety of studies and for key variables such as GDP growth and inflation; see, e.g., Goulet , , Goulet Coulombe, Marcellino, and Stevanovic (2021) , . While these papers adopt classical methods, Bayesian techniques are also available. Jochmann (2015) proposes an infinite hidden Markov model and applies it to model US inflation dynamics. This model endogenously selects the number of regimes and, for US data, finds a secular decline in inflation volatility and around 7 distinct inflation regimes. In a recent paper, Clark, et al. (2021) use multivariate Bayesian additive regression tree (BART) models to forecast several US macroeconomic aggregates (including inflation), with particular interest in forecasting tail risks. For inflation, BART models improve upon linear models in turbulent periods such as the COVID-19 pandemic. One shortcoming of BART (and in fact many nonparametric techniques such as Gaussian processes or kernel methods) is that the shocks are assumed to be Gaussian. If there is empirical evidence of non-Gaussian features such as heavy tails in the innovations, the flexibility of these nonparametric models could imply that the conditional mean captures this and the model would thus erroneously suggest nonlinear relationships between inflation dynamics and predictors of inflation. 2 These considerations motivate the models we propose in this paper. In this paper, we consider models of inflation that combine nonparametric specifications of the conditional mean and nonparametric specifications of the innovation to inflation. We model the conditional mean using a flexible and analytically tractable Gaussian process (GP) regression. This specification is capable of capturing nonlinearities in the relationship between inflation and its predictors. To avoid overfitting, we introduce a subspace shrinkage prior (Shin, Bhattacharya, and Johnson, 2020 ) that shrinks the GP regression toward a linear subspace in a data-driven manner. The resulting model can be interpreted as a convex combination between a GP regression and either a linear (estimated by OLS) or a factor model (when principal components rather than the original regressors are used to define the linear subspace). To capture fat tails, possible asymmetries, and other non-Gaussian features that might determine inflation dynamics, we introduce a Dirichlet process mixture (DPM) model to estimate the unknown shock distribution. This mixture model allows us to capture unobserved heterogeneity in a very flexible manner and is thus capable of handling situations such as the pandemic. To assess which of these features improves inflation forecasts, we also consider variants that treat the conditional mean as linear or the error term as Gaussian. In addition, we allow for error specifications with either constant variances or stochastic volatility. After developing proper Markov chain Monte Carlo (MCMC) estimation algorithms, we use all of the models to forecast quarterly consumer price inflation (CPI) in the US, using various sets of predictors. We evaluate out-of-sample forecasts over a long sample of 1980 through 2021, on the basis of the accuracy of point forecasts, density forecasts, and tail risk forecasts. Our results confirm the benefits of our flexible, nonparametric models. Over the 1980 to 2021 period, our nonparametric models achieve some gains in the accuracy of point and density forecasts relative to the common benchmark of the univariate model of Stock and Watson (2007) . These models achieve sizable gains during the volatile 2020-21 period of the pandemic. The primary gains to flexible nonparametric modeling come from nonlinear modeling of the conditional mean, through Gaussian processes. Although a large set of variables in some settings offers an advantage, it does not do so uniformly; most of the gains to nonparametric modeling can be achieved with a moderately sized set of variables. In exercises that drill deeper into the properties of predictive distributions, our nonparametric models are also shown to yield gains in predicting left-tail risks to inflation. They are more challenged in capturing time variation in the right tail, which may have to do with the prevalence of low and stable inflation for much of the sample. Although the sample is small, the models seem to better capture the right tail of the predictive distribution of inflation during the pandemic period. We also show that our proposed models yield predictive distributions that sometimes display asymmetry, related to the literature on inflation at risk; see, e.g., Lopez-Salido and Loria (2020). The paper proceeds as follows. Section 2 presents our models and estimation algorithms. Section 3 describes the data and design of the forecasting exercise and presents results. Section 4 summarizes results for a robustness check with an alternative measure of inflation that excludes food and energy. Section 5 concludes. Models used for macroeconomic forecasting involve assumptions about the functional and distributional form of the conditional mean and the form of the error process, respectively. In this paper we use nonparametric forms for both of these, and it is in this sense that we refer to our model as fully nonparametric. For the conditional mean, we use a GP prior and for the error distribution a DPM model. In this section, we define our model, discuss its properties, and introduce MCMC methods that allow for computationally efficient Bayesian inference and prediction, with additional details presented in Appendix A. Our model assumes that inflation in time t, y t , depends on a vector of K appropriately lagged predictors, x t , in a possibly nonlinear way: Here, f : R K → R denotes an unknown and potentially nonlinear function. In the next subsection, we focus on f . Then we present our nonparametric treatment of ε t . Learning the unknown function f can be achieved through many different techniques such as Bayesian additive regression trees (Chipman, George, and McCulloch, 2010) , B-splines (Shin, Bhattacharya, and Johnson, 2020) , or (deep) neural networks (Nakamura, 2005; Goulet Coulombe, 2022) . In this paper, we propose approximating the unknown function f using a GP regression. This approach places a GP prior on the function f . This implies a Gaussian prior on with K being a T × T kernel matrix with typical element k(x t , x τ ) for times t and τ . GP priors are nonparametric in the sense that they do not assume a particular form for f ; instead, they are interpreted as a prior over all functions that might fit the data. In essence, the T elements in f are treated as unknown parameters. The likelihood defined by (1) is over-parameterized, but the use of prior information given in (2) can be used to overcome this concern. A textbook introduction to GPs is given in Rasmussen and Williams (2006) . A recent macroeconomic application is Hauzenberger, et al. (2021) , who also provide further intuition and explanation of GPs in an economic context. As compared to other approaches, GPs can be applied to data sets including many covariates without introducing additional parameters (and hence remaining relatively parsimonious). Another key advantage is that the computational burden is little affected by the number of covariates but depends largely on the number of observations. This makes GP regression well suited to quarterly macroeconomic data where T is relatively small. Function estimation through GPs relies heavily on the particular choice of the kernel. Suitable kernels allow for capturing many different functional shapes and dynamics for the function f . In principle, many choices for K are possible. In a time-series context, kernels can be developed for capturing low-frequency movements or abrupt breaks. In this sense, they can approximate the behavior of unobserved component models and successfully extract trend inflation. There is also a way of specifying them that leads to a specific form of a neural network; see Lee, et al. (2017) . But the most common choice, which we also adopt in some cases, is the Gaussian kernel. A typical element of K under a Gaussian kernel is given by: with ξ, φ ∈ R + denoting the hyperparameters of the kernel. This Gaussian kernel captures the idea that similar values for x t and x τ should be associated with similar values for f (x t ) and f (x τ ). The distance between two values of x t and x τ is measured in squared exponential terms. The degree of smoothness of the function depends on φ. Low values for this hyperparameter lead to a smooth function, whereas higher values allow for more high-frequency variation. Note that if x t = x τ , then Var(f (x t )) = ξ. This allows us to see that ξ controls the variance of the function f . Since these hyperparameters are crucial for appropriately capturing inflation dynamics, we estimate them using a Bayesian approach. This requires adequate priors. We found that values of ξ and φ greater than 1 led to overfitting, and thus, we use a Uniform prior between 0 and 1 to avoid values of ξ, φ > 1. This choice implies that, as long as these hyperparameters are not too large, we remain agnostic on the precise values of ξ and φ. We also introduce a second version of the GP prior, which involves the concept of subspace shrinkage; see Shin, Bhattacharya, and Johnson (2020) . To motivate this addition, note that the GP prior using the Gaussian kernel reflects a belief in smoothness. However, we might also be interested in a prior that reflects a belief in linearity. Linearity is a subspace of the nonlinear form of f , hence the terminology subspace shrinkage. This might be useful in and of itself (i.e., shrinking toward a more parsimonious model often improves forecasts), but subspace shrinkage methods can also be used as a model selection device (i.e., they can select the linear model if the data warrant this; see Huber and Koop (2021) ). Shin, Bhattacharya, and Johnson (2020) show how one can shrink toward a pre-specified subspace such as the linear one in the context of a particular nonparametric model (in their case, a B-spline regression). 3 Here we adapt these methods for our purposes. Let Φ 0 = X(X X) −1 X denote the linear projection matrix of X = (x 1 , . . . , x T ) . Subspace shrinkage involves modifying the prior variance, which in our case is the kernel, as follows: The GP prior in (2) is replaced by We stress that our GP subspace shrinkage prior is still a GP prior, but with a different choice of kernel. Hence, the same Bayesian MCMC methods such as those for the GP prior with Gaussian kernel described above can be used. The only addition is that we treat τ 2 as an unknown parameter. We assume that the prior on τ has a density proportional to: This prior reduces to the half-Cauchy distribution if d 0 = d 1 = 1/2, a choice we follow in this paper (Shin, Bhattacharya, and Johnson, 2020) . The parameter τ 2 plays an important role in the prior as it controls the weight on the linear part of the model. If τ 2 is large, little weight is placed on the linear model. But as τ 2 decreases, the linear part receives more weight. Formally, Lemma 3.1 of Shin, Bhattacharya, and Johnson (2020) shows that: with ω = 1/(1 + τ 2 ) ∈ [0, 1] and f being the posterior mean of f for the GP process with Gaussian kernel. This shows that the posterior fit of f under the subspace shrinkage prior can be interpreted as a convex combination between the fit of a GP regression with kernel K and the fit of a simple OLS regression. 4 The same result also holds when interest centers on predictive distributions, thus implying that our approach can, in a data-driven way, assess whether inflation is better described by a linear model or whether nonlinearities in the conditional mean are necessary. We now turn to the modeling of the error distribution. The GP specification on the conditional mean implies a great deal of flexibility in terms of capturing arbitrary functional relationships between the covariates in x t and y t . However, macroeconomic time series are also subject to, e.g., infrequent large shocks, conditional heteroskedasticity, and possible multi-modality of the shocks. To capture such features without strong a priori assumptions, we rely on DPMs. DPMs have long been used as a Bayesian nonparametric method for uncovering unknown distributions; see, e.g., Escobar and West (1995) . We use an implementation as in Frühwirth-Schnatter and Malsiner-Walli (2019). The DPM is an infinite mixture of distributions. We assume the errors to be independent over time with ∞ j=1 w j = 1 and w j ≥ 0 ∀ j. We use the standard stick-breaking representation of the weights, w j , developed in Sethuraman (1994) , to cast the mixture into a finite dimensional representation. The stick-breaking representation can be interpreted as a prior on the weights w j that depends on auxiliary quantities ξ j as follows: and each ξ i ∼ B(1, α) is Beta distributed with α being a hyperparameter. This hyperparameter plays an important role in controlling the clustering behavior of the DPM, and we thus introduce a Gamma prior α ∼ G(2, 4), a choice suggested in Escobar and West (1995) . For the component means µ j we use Gaussian priors centered on zero with variance v j = 4. Given the scale of our data, this introduces relatively little information and allows for sufficient flexibility for capturing outliers. On the component precision σ −2 j , we use Gamma priors σ −2 j ∼ G(c 0 , c 1 ) with c 0 = 10 and c 1 = 5. This choice ensures a proper prior with mean 2 and variance 0.4, introducing sufficient information if one of the components includes only very few (or no) observations. Intuitively speaking, this mixture specification soaks up any variation in y t not explained through the GP component in the conditional mean. Since our choice of the kernel implies that the GP captures smoothly varying trends in inflation (which are determined by x t ), the mixture model will capture transitory and possibly large shocks to the trend that can also be asymmetric. In our empirical work, we present results using this form for the errors. But note that it assumes the errors to be independent over time, which is a potential drawback with timeseries data. Especially if the time series feature volatility clustering, the DPM has problems capturing persistence in terms of the variance of a time series. Standard stochastic volatility (SV) models capture this through a persistent latent volatility component. To get the best of both worlds, we propose a version of the DPM that is capable of detecting volatility clusters. Several specifications have been proposed in the literature for combining DPMs with SV; see, for instance, Jensen and Maheu (2010) and Jensen and Maheu (2014) . But these specifications imply that standard Kalman filter-based algorithms are not applicable. To circumvent this issue, we consider a model, which we label DPM-SV, that replaces (5) with and lets σ 2 t follow a standard SV process, with log σ 2 t evolving according to an AR(1) process. We assume the same stick-breaking process for the weights. Thus, this model combines some of the flexibility of the DPM with the empirically desirable properties of the SV models. Estimation of all of these models can be carried out using MCMC techniques. In Appendix A we provide additional details on the precise steps. In principle, our MCMC algorithm samples We carry out posterior inference by repeating the algorithm 20,000 times and discarding the first 10,000 draws as burn-in. Based on full-sample results, the sampler mixes well, with inefficiency factors across all parameters and latent states of the model being well below 40. This section briefly describes the data, summarizes the models and design of the forecasting exercise, and presents results. We use quarterly data that range from 1959:Q1 to 2021:Q3 from the FRED-QD database developed in McCracken and Ng (2020) and maintained by the Federal Reserve Bank of St. Louis. As a measure of medium-term inflation expectations, we also use a 5-quarters-ahead inflation expectation from the Survey of Professional Forecasters, obtained from the website of the Federal Reserve Bank of Philadelphia. We focus on forecasting CPI inflation, measured as (400/h) ln(P t+h /P t ) at horizon h = 1, 4, with results on ex food and energy inflation provided in a robustness check. We run all of the models, except for the unobserved components (UC) ones, which do not include any explanatory variables, using three different data sets: one that uses only lagged values of inflation as explanatory variables (labeled AR(1)), a moderately sized data set that includes 29 variables, and a large one involving 169 variables (which, as indicated below, we sometimes reduce to principal components to facilitate estimation). As detailed in Appendix B, the set with 29 variables includes an array of major macro indicators fitting into broad categories commonly considered as possible predictors of inflation. These include various indicators of economic activity (e.g, growth in industrial production, growth in payroll employment, and the unemployment rate), growth in wages and unit labor costs, producer price inflation, and financial indicators (e.g., interest rates, stock returns, and growth in business loans). In our forecasting exercise, we consider a variety of different models and different data sets. We consider four different treatments of the conditional mean, four different treatments of the error distribution, and three different sets of variable. In all implementations, the models include one lag of the explanatory variables. In this sub-section we list the models and define the acronyms we use. For the conditional mean, we have two versions of the Gaussian process with and without subspace shrinkage: GP-sub and GP, respectively. We also present results assuming f is linear and use the acronym Linear for this. This model is estimated as a nested alternative of GP-sub with ω = 1. We also produce results for UC models that involve only a time-varying intercept that follows a random walk. This model (with a stochastic volatility specification; see below) serves as our main benchmark given its strong performance when forecasting quarterly inflation (see, e.g., Stock and Watson (2007) and Stock and Watson (2010) ). For the error distribution we assess whether allowing for departures from Gaussianity and homoskedasticity in a flexible way pays off. To do so, we consider conventional Normal errors with constant variance (labeled Homosk.) and also with SV. The former serves as a natural benchmark to assess the predictive gains from allowing for heteroskedasticity, whereas the latter has been shown to be helpful in improving the accuracy of both point and density forecasts (Clark, 2011) . These traditional assumptions on the shocks are contrasted with our comparable nonparametric versions: the DPM and DPM-SV defined in the preceding sub-section. This leads to 16 different models that combine a choice for the conditional mean with one for the error distribution. We estimate and forecast with all of these. We run all of the models, except for the UC ones, which do not include any explanatory variables, using the three different data sets described above. These data sets are also used to form the subspace toward which we shrink in our GP-sub models. It is worth stressing that for the large GP regressions, the regressors enter the model in an unrestricted manner. But for the linear regression model, since OLS estimation using all 175 regressors of the large data set is infeasible, we follow Stock and Watson (2002) and use a linear model with the first 6 principal components being explanatory variables. As described in Footnote 4, this also forms the subspace toward which we shrink the large-scale GP regressions. The design of our forecasting exercise is recursive using an expanding window of data. We use the period from 1980:Q1 to 2021:Q3 as our forecast evaluation period and produce forecasts for horizons h = 1 and h = 4. These forecasts are produced by lagging the elements in x t appropriately (i.e., we compute direct forecasts). As measures of predictive accuracy we focus on the mean squared forecast error (MSE) and the log predictive likelihood (i.e., log score, denoted LPL). To assess tail forecasting accuracy, we will focus on the quantile score (QS), associated with the tick loss function as in studies such as Giacomini and Komunjer (2005) : where Q p,t+h is the forecast of quantile p, and the indicator function 1 {y t+h ≤Q p,t+h } has a value of 1 if the outcome is at or below the forecast quantile and 0 otherwise. Table 1 reports MSE and average LPL (in parentheses) results for forecast horizons of 1 and 4 quarters. For the benchmark UC-SV specification, we report the MSE and LPL levels, whereas for all other models, we report ratios of MSEs relative to the benchmark (ratios less than 1 mean a model is more accurate than the UC-SV model) and differences in average LPL relative to the benchmark (positive entries mean a model is more accurate than the benchmark). The table has four horizontal panels for, respectively, UC, univariate, moderately sized, and large models. For all models we report results for linear, GP, and GP-sub specifications (except UC) for the conditional mean, and homoskedastic, DPM, SV, and DPM-SV for the error process. Finally, the table has two vertical panels: the left one for a long evaluation period ranging from 1980:Q1 until 2021:Q3, the right one for the short COVID pandemic period, ranging from 2020:Q1 to 2021:Q3. Naturally, the short length of the COVID period should be taken into proper consideration when assessing the results, but it is of course of major interest, since most standard econometric models had severe problems in tracking and forecasting economic variables during this period due to their wide swings. In particular, inflation first dropped significantly, almost to deflation, and then rebounded so strongly that it reached levels last seen in the early 1980s. In contrast, the low and stable inflation that prevailed from the early 1990s until the Great Recession may create challenges in beating the forecast accuracy of the simple UC-SV benchmark forecast. Starting with the UC models, the main finding is that the pandemic led to large increases in MSE for h = 1 but not for h = 4, while there were massive decreases in LPL for both horizons. Among the UC specifications, the homoskedastic specification is the least accurate. Focusing on the other UC specifications, for h = 1, there are only small differences in MSE from the various error specifications over the long evaluation period, with some gains for DPM and DPM-SV (about 5-7 percent) during the pandemic period, which, however, turn into large losses (up to 54 percent) for h = 4. For LPL, there are also small differences in general, but, during the pandemic DPM and DPM-SV present some small gains, indicating that these more complex specifications produced slightly more accurate density forecasts. Overall, it seems difficult to consistently beat the benchmark UC-SV model with more complex specifications of the error term. Moving to the AR(1) models, over the full evaluation sample the linear AR(1) specification is worse than UC-SV in terms of MSE and LPL for both h = 1 and h = 4, in line with previous results in the literature. Interestingly, GP is overall comparable with UC-SV in terms of MSE and LPL, with SV and DPM-SV being slightly more accurate than UC-SV for h = 4. GP-sub is, not surprisingly, more similar to linear AR(1). During the COVID period, the linear AR (1) is slightly to modestly better than UC-SV for h = 1 but much worse for h = 4, and the same happens for GP-sub. Yet, GP is much better than UC-SV for both h = 1 and h = 4 and for both MSE and LPL, with minor differences across error types. Overall, GP seems a better univariate model than UC-SV, since it performs similarly in normal times, in particular when complemented with SV or DPM-SV, and much better in problematic periods. Next, we consider the role of the variables included in the moderately sized multivariate model. Over the full evaluation sample, at the one-step-ahead horizon the linear specification With the moderately sized set of variables, the GP and GP-sub specifications consistently improve on the forecast accuracy of the UC-SV benchmark. In the full sample, at both horizons the MSEs of the moderately sized GP and GP-sub models are slightly to modestly lower than those for linear regression and UC-SV, with gains with respect to the UC-SV baseline of about 5-8 percent for h = 1 and as much as 15 percent for h = 4, and small differences across error specifications. The moderately sized GP and GP-sub models also offer small to modest gains in LPL. In the pandemic sample, the MSE and LPL gains of the moderately sized GP and GP-sub models are larger. For h = 4, by the MSE metric, GP is much better than not only UC-SV but also the linear moderately sized model and the AR(1) GP specification. GP's better performance at h = 4 with respect to both linear and UC-SV could be due to the increased importance of nonlinearity, in the sense that if the relationship between inflation and the explanatory variables is nonlinear at h = 1, it will, in general, be even more nonlinear at h = 4. Overall, the moderately sized multivariate GP specification seems so far to be the preferred model, beating the benchmark for both MSE and LPL across both horizons and evaluation periods, with little difference across error specifications. Finally, we consider the larger multivariate models and, in the interest of brevity, we focus on the large-scale GP models as compared to the moderately sized GP specifications. 5 With the nonparametric specifications, it is mostly -but not uniformly -the case that models using the moderately sized set of variables forecast as well as or better than models with the large-scale data set. More specifically, over the long evaluation sample, at both forecast horizons, using the large set of variables offers no gains in MSE or LPL (in fact, for h = 4, using the large set sometimes reduces accuracy) as compared to using the moderately sized set of variables -i.e., nonlinearity seems to make the larger information set redundant. But in the pandemic period, results are more mixed, depending on the forecast horizon. For h = 1, the large-scale GP model yields lower MSE and higher LPL as compared to the moderately sized GP model; in this case, But for h = 4, the reverse is true. For both horizons, the ranking of GP and GP-sub, and of the various error specifications, is not clear-cut, but in general the differences are small. Overall, we conclude from the point and density forecast evaluation that our nonparametric models can offer gains in forecast accuracy over a long sample but particularly during the volatile period of the pandemic. With no clear-cut ranking for the different specifications, we believe, for the error term of the inflation model, that the primary gains to flexible nonparametric modeling come from nonlinear modeling of the conditional mean, through GPs (without a clear advantage or disadvantage to applying subspace shrinkage). Although a large set of variables offers an advantage in some settings, it does not do so uniformly; most of the gains to nonparametric modeling can be achieved with a moderately sized set of variables. We now discuss additional empirical results that may shed light on some of the patterns emerging from Table 1 , and clarify additional aspects such as the sources of differences in LPLs and the stability of the relative model performance over time. Figures 1 and 2 report cumulative differences in LPLs with respect to the UC-SV specification for, respectively, moderately sized and large models. In each figure, the top and bottom rows provide results for h = 1 and h = 4, respectively. Columns 1 to 3 report results for linear, GP, and GP-sub specifications, respectively. Figures 1 and 2 show that the relative performance with respect to UC-SV improves around the financial crisis, more sharply for h = 1 than for h = 4. Following the crisis, for the GP and GP-sub models applied to both the moderately sized and the large variable sets, the SV and DPM-SV error specifications are consistently first or second best in model fit, except in the case of the GP-sub specification at the one-step-ahead horizon. By this measure, including stochastic volatility in the model's error specification improves model fit, particularly following the financial crisis. Once again, in the GP class, the large-scale set of variables offers no clear advantages over the moderately sized data set. To assess efficacy in forecasting tail risks to inflation -both left and right tails -Figures 3 and 4 present quantile scores for GP and GP-sub specifications with, respectively, moderately sized and large models. In this comparison, too, scores are presented as relative to the UC-SV benchmark. The main message emerging from these figures is that the GP models are much better in the left tail (low inflation) than in the right tail (high inflation). More specifically, from Figure 3's h = 1 results with the moderately sized set of variables, in the left tail, GP is better than GP-sub, and both are 10-20 percent better than UC-SV. At the 5 percent and 10 percent quantiles, the specifications with SV and DPM-SV are most accurate. With the moderately sized set of variables, the left-tail gains are noticeably larger for h = 4 (reaching as much as 40 percent) than for h = 1. However, as the quantile considered moves toward the median and into the right tail, the score performance deteriorates. Indeed, in the right tail, the nonparametric models are less accurate than the UC-SV baseline, by 10-20 percent for GP and a bit less for GP-sub (and as much as 40 percent at the four-quarters-ahead horizon). Results with large models are broadly similar to those with moderately sized models: The patterns for large models in Figure 4 and moderately sized models in Figure 3 are quite similar. Quantitatively, These quantile scores are computed over the full hold-out period (i.e., 1980:Q1 to 2021:Q3) and might thus mask important temporal variation in tail forecasting performance. To understand whether the tail performance described above is stable over time, Table 2 The previous results have been based on relative model performance. To gauge how well the corresponding predictive densities are calibrated, Figures 5 and 6 report the Rossi and Sekhposyan 1980 -1990 1991 -2000 2001 -2010 2011 -2021 Moderately sized models Gaussian process These results indicate that the predictive densities of the GP-SV and UC-SV specifications are closest to being correctly specified, whereas the densities of other models display some noticeable mis-specification, likely related to the bad predictive performance in the right tail noted above (unreported probability integral transforms yield similar patterns and conclusions). For both horizons, the UC-SV model's quantiles of the predictive density lay within the correctspecification bands. With the moderately sized set of variables, at the one-step-ahead horizon the quantiles of the predictive density also lay within the confidence band for the homoskedastic and SV versions of the GP model. But significant departures from correct specification occur if the version of the GP model is changed, the forecast horizon is increased to 4, or the variable set is changed to the large-scale set. In calibration, as in the quantile score performance presented above, the problems seem to be in the right tail of the distribution: Departures from the correct specification line become more prevalent and larger as the quantile moves from the left to the right tail. This sub-section focuses on the qualitative properties of the predictive distributions. Figure 7 graphs the evolution of right-tail quantile scores (p = 0.95) and the 10, 50 and 90 th percentiles of the predictive distribution. In the interest of brevity, we report results for the selected specifications of the UC-SV benchmark and the GP model with DPM-SV for the moderately sized data set, for h = 1 in the upper panel and h = 4 in the lower panel. As shown in the panels in the left side of the figure, in most periods, the right-tail quantile scores are lower (better) for UC-SV than for GP, but in episodes in the mid-2000s and during the pandemic, the GP specification beats UC-SV. As shown in the panels in the right side of the To get a better understanding of the effects of extreme events such as the financial crisis and the pandemic on the predictive distributions, Figure 8 plots the predictive densities in some quarters of these specific episodes. The figure provides results from the UC-SV benchmark and the homoskedastic and DPM-SV versions of the moderately sized GP specification. In each case, the red dot provides the actual outcome of inflation for the quarter in question. These plots nicely illustrate that, in some quarters (especially at the beginning of the aforementioned periods), the gains from the more flexible model stem from non-Gaussian features (heavy tails and skewness). After one or two quarters, the GP models also seem to adjust the point forecasts (better than the UC-SV models) and this provides further predictive gains. In a linear regression model, the marginal effect of any predictor on inflation is simply its regression coefficient. In our nonparametric model, the nonlinear interaction between y t+h and x t means no such simple summary of the effect of a predictor on inflation exists. The effect can depend on the magnitude of any or all of the predictors and can vary over quantiles. Several recent papers (see, e.g., Crawford, et al., 2019; Woody, Carvalho, and Murray, 2021) address this issue through linear posterior summaries that are close (according to some metric) to the actual posterior distribution. In this paper, we follow a similar approach to Woody, Carvalho, and Murray (2021) . Our aim is to approximate the quantiles of the predictive distribution Q p,t+h using a linear and possibly sparse regression model. For each p, we solve the following optimization problem: where t 0 marks the beginning of the hold-out period, β p = (β p,1 , . . . , β p,K ) is a linear set of coefficients, and λ ≥ 0 is a penalty term that controls the trade-off between model fit and parsimony. This is a LASSO problem, and we decide on λ through cross-validation. Notice that for each quantile we search for a linear representation that minimizes the squared errors between the quantile forecast and the linear predictor. This closely resembles a standard quantile regression model and, if repeated for every p, provides us with a linearized version of the predictive distribution that is simple to interpret. As a robustness check, we assess whether our model is also able to compete with the UC-SV benchmark when the focus is on core CPI (i.e., CPI excluding food and energy) inflation. This exercise is considerably harder for the models we propose (which are designed to capture non-systematic, higher-frequency movements in inflation), since core CPI inflation excludes the volatile food and energy components. Table 4 shows the results for core CPI inflation. In these results, as in the results for headline inflation, the MSE for the benchmark UC-SV model is much higher in the pandemic period than in the full sample at the one-step-ahead horizon, whereas at the four-steps-ahead horizon, the MSE is more stable across samples. But for both horizons, the LPL for the benchmark model drops substantially in the pandemic period compared to the full sample. In the full sample, using core rather than headline inflation reduces the advantages of the GP and GP-sub specifications compared to the UC-SV benchmark. Among these models, the SP specification with SV applied to the moderately sized set of variables fares best over the full sample, about the same as the benchmark in MSE and LPL accuracy for h = 1 and modestly to solidly more accurate for h = 4. Over the pandemic sample, most models beat the UC-SV benchmark in both MSE and LPL. In particular, the GP models offer gains larger than those found over the long sample, with the SV specification again best with GP. In this sample, by the MSE metric, gains are larger for h = 4 (as large as 33 percent) than for h = 1. In addition, over the pan- demic, there is some additional advantage to using the large-scale variable set as compared to the medium-scale data set. Forecasting inflation is hard, partly due to the changing, and often vanishing, relationship between it and its predictors and partly due to the occasionally large and asymmetric shocks that hit the inflation process. The model developed in this paper, being nonparametric in the conditional mean and in the error distribution, is designed to address these challenges. In our empirical work, we have shown that the model is capable of producing accurate point and density forecasts. These forecasts are often more precise than the ones obtained from simpler alternatives such as the UC-SV model or a linear regression model. This forecasting performance is driven by a superior overall performance in the left tail and the center of the distribution. However, the performance in the right tail is somewhat weaker. This is mainly driven by slightly inflated predictive intervals during the Great Moderation. In the high-inflation period of the early 1980s and during the second year of the pandemic, our model also improves upon the UC-SV model in the right tail. Given that the model works well in turbulent times, it might also work well for forecasting time series that are subject to rapid mean shifts and changing volatilities such as exchange rates or asset prices. Moreover, the univariate nature of the model implies that we do not model dynamic interdependencies between variables. Since our framework is highly scalable, it would be straightforward to extend it to the VAR case and use it for structural analysis. In this section we discuss our posterior simulation algorithm. With some exceptions, the parameters and states of the model can be sampled through Gibbs updating steps. In cases where the full conditionals are not of a well-known form, we rely on Metropolis-Hastings updates to simulate from the target distribution. A.1 Posterior simulation of the Dirichlet process mixture We start by discussing the posterior simulation steps of the DPM first. This algorithm closely follows the one discussed in Frühwirth-Schnatter and Malsiner-Walli (2019). Notice that (5) can be written as: ε t |δ t = j ∼ N (µ j , σ 2 j ), (A.1) with δ t denoting a latent variable that assigns each observation to one of the components. δ t is specified such that Prob(δ t = j) = w j and we let δ = (δ 1 , . . . , δ T ) denote a T −dimensional classification vector. In terms of full-data vectors, the DPM can be written as follows: ε = µ + η, η ∼ N (0 T , Σ), with ε = (ε 1 , . . . , ε T ) , µ = (µ δ 1 , . . . , µ δ T ) and Σ = diag(σ 2 δ 1 , . . . , σ 2 δ T ). To simulate from the posterior of the DPM, we iterate between the following steps (conditional on knowing f ): • We start by sampling the weights ξ i conditional on the classification indicators and the remaining model parameters and latent states. Let ξ K = 1 (where J is a truncation level for the number of mixture components). The sticks ξ 1 , . . . , ξ J−1 are simulated from a sequence of Beta distributions: where T j = T t=1 I(δ t = j) denotes the number of observations associated with cluster j, with I(•) denoting an indicator function that equals 1 if its argument is true. Draws of ξ 1 , . . . , ξ J−1 are then used to construct the weights w 1 , . . . , w J−1 using the stick-breaking formula in (6). • Next we sample the classification indicators on a t−by−t basis using slice sampling (Kalli, Griffin, and Walker, 2011) . We do this in two steps. First, we sample an auxiliary quantity u t |δ t ∼ U(0, δt ) with j = (1 − κ)κ j−1 and κ = 0.8. In the second step, the indicators are simulated from a discrete distribution with the probability that δ t = j (j = 1, . . . , J) being proportional to: Prob(δ t = j|•) I(u t < j ) j × w j N ( t |µ j , σ 2 j ). The • notation indicates conditioning on all remaining coefficients and latent states of the model. • The posterior of α is obtained through a Metropolis-Hastings step (see Frühwirth-Schnatter and Malsiner-Walli, 2019). • The component mean µ j (j = 1, . . . , J) is simulated from: (ε t × I(δ t = j)) . • We simulate the component-specific error variances from an inverse Gamma-distributed posterior: Finally, we obtain the truncation level J such that 1 − J j=1 w j < min(u 1 , . . . , u T ). In case we use the stochastic volatility specification, the sampling step for the componentspecific variances is replaced with the algorithm proposed in Kastner and Frühwirth-Schnatter (2014) . Conditional on µ and Σ, we need to sample f and the parameter controlling the weight put on the subspace shrinkage prior τ 2 : • The full conditional posterior of f is a T -dimensional Gaussian distribution: • We sample from the posterior of τ 2 using a slice sampler (Shin, Bhattacharya, and Johnson, 2020) . In the first step, we sample a uniformly distributed random variable u ∼ U(0, r) with r = (τ −2 + 1) −(d 0 +d 1 ) 2 and set r * = u −(d 0 +d 1 ) −1 − 1. In the next step, we simulate yet another auxiliary quantity ζ from a truncated Gamma distribution with bounds 0 and r * : A draw of τ 2 is then obtained by setting τ 2 = 1/ζ. • We estimate φ and ξ jointly using a random walk Metropolis-Hastings updating step. 1980 1986 1993 1999 2005 2011 GP 1980 1986 1993 1999 2005 2011 GP−sub 1980 1986 1993 1999 2005 2011 2017 (a) 4-quarters-ahead Nonlinearities in the Phillips curve for the United States: Evidence using metropolitan data Random forests BART: Bayesian additive regression trees Real-time density forecasts from Bayesian vector autoregressions with stochastic volatility Tail forecasting with multivariate Bayesian additive regression trees Evolving post-world war II US inflation dynamics Variable prioritization in nonlinear black box methods: A genetic association case study Bayesian density estimation and inference using mixtures 2 of Handbook of Economic Forecasting From here to infinity: sparse finite versus dirichlet process mixtures in model based clustering Evaluation and combination of conditional quantile forecasts The macroeconomy as a random forest How is machine learning useful for macroeconomic forecasting Can machine learning catch the COVID-19 recession Gaussian process vector autoregressions and macroeconomic uncertainty Subspace shrinkage in conjugate Bayesian vector autoregressions Estimating a semiparametric asymmetric stochastic volatility model with a Dirichlet process mixture Modeling US inflation dynamics: A Bayesian nonparametric approach Slice sampling mixture models Ancillarity-sufficiency interweaving strategy (ASIS) for boosting MCMC estimation of stochastic volatility models Deep neural nets as Gaussian processes Inflation at risk Machine learning advances for time series forecasting FRED-QD: A quarterly database for macroeconomic research Forecasting inflation in a data-rich environment: the benefits of machine learning methods Inflation forecasting using a neural network Gaussian processes for machine learning Alternative tests for correct specification of conditional predictive densities A constructive definition of Dirichlet priors Functional horseshoe priors for subspace shrinkage Evolving post-World War II US inflation dynamics: Comment Macroeconomic forecasting using diffusion indexes Why has U.S. inflation become harder to forecast? Modeling inflation after the Crisis Model interpretation through lowerdimensional posterior summarization