key: cord-0556893-0jjn8n51
authors: Hauzenberger, Niko; Huber, Florian; Klieber, Karin
title: Real-time Inflation Forecasting Using Non-linear Dimension Reduction Techniques
date: 2020-12-15
journal: nan
DOI: nan
sha: 07f7850e911ffea0475d5b007512e097df4fd874
doc_id: 556893
cord_uid: 0jjn8n51

In this paper, we assess whether using non-linear dimension reduction techniques pays off for forecasting inflation in real-time. Several recent methods from the machine learning literature are adopted to map a large dimensional dataset into a lower dimensional set of latent factors. We model the relationship between inflation and the latent factors using constant and time-varying parameter (TVP) regressions with shrinkage priors. Our models are then used to forecast monthly US inflation in real-time. The results suggest that sophisticated dimension reduction methods yield inflation forecasts that are highly competitive to linear approaches based on principal components. Among the techniques considered, the Autoencoder and squared principal components yield factors that have high predictive power for one-month- and one-quarter-ahead inflation. Zooming into model performance over time reveals that controlling for non-linear relations in the data is of particular importance during recessionary episodes of the business cycle or the current COVID-19 pandemic.

Inflation expectations are used as crucial inputs for economic decision making in central banks such as the European Central Bank (ECB) and the US Federal Reserve (Fed). Given current and expected inflation, economic agents decide on how much to consume, save and invest.

In addition, measures of inflation expectations are often employed to estimate the slope of the Phillips curve, infer the output gap or the natural rate of interest. Hence, being able to accurately predict inflation is key for designing and implementing appropriate monetary policies in a forward looking manner.

Although the literature on modeling inflation is voluminous and the efforts invested considerable, predicting inflation remains a difficult task and simple univariate models are still difficult to beat (Stock and Watson, 2007) . The recent literature, however, has shown that using large datasets (Stock and Watson, 2002a) and/or sophisticated models (see Koop and Potter, 2007; Koop and Korobilis, 2012; D'Agostino et al., 2013; Koop and Korobilis, 2013; Clark and Ravazzolo, 2015; Chan et al., 2018; Jarocinski and Lenza, 2018) has the potential to improve upon simpler benchmarks.

These studies often exploit information from huge datasets. This is commonly achieved by extracting a relatively small number of principal components (PCs) and including them in a second stage regression model (see, e.g., Stock and Watson, 2002a) . While this approach performs well empirically and yields consistent estimators for the latent factors, it fails to capture nonlinear relations in the dataset. In the presence of non-linearities, using simple PCs potentially reduces predictive accuracy by ignoring important features of the data. Some studies deal with this issue by using flexible factor models which allow for non-linearities in the data. Bai and Ng (2008) use targeted predictors coupled with quadratic principal components and show that allowing for non-linearities yields non-trivial improvements in predictive accuracy for inflation.

This suggests that non-linearities (of a known form) are present in US macroeconomic datasets which are commonly employed for inflation forecasting. More recently, Pelger and Xiong (2021) propose a flexible state-dependent factor model and apply this method to US bond yields and stock returns. Using this non-linear and non-parametric technique yields results which differ from linear, PC-based models by extracting significantly more information from the data.

One additional assumption commonly made is that the relationship between inflation and the latent factors is constant. For longer time series which feature multiple structural breaks this assumption is a strong one and may be deleterious for predictive accuracy. Several recent papers deal with this issue by using time-varying parameter (TVP) regressions which, in addition, allow for heteroscedasticity through stochastic volatility (SV) models (Koop and Potter, 2007; Koop and Korobilis, 2012; D'Agostino et al., 2013; Belmonte et al., 2014b; Clark and Ravazzolo, 2015; Jarocinski and Lenza, 2018; Korobilis, 2021) .

Investigating whether allowing for non-linearities in the compression stage pays off for inflation forecasting is the key objective of the present paper. Building on recent advances in machine learning (see Gallant and White, 1992; McAdam and McNelis, 2005; Exterkate et al., 2016; Chakraborty and Joseph, 2017; Heaton et al., 2017; Mullainathan and Spiess, 2017; Feng et al., 2018; Coulombe et al., 2019; Kelly et al., 2019; Medeiros et al., 2021) , we adopt several non-linear dimension reduction techniques. The resulting latent factors are then linked to in-flation in a second stage regression. To investigate whether there exists a relationship between non-linear factor estimation and flexible modeling of the predictive inflation equation, we introduce dynamic regression models that allow for TVPs and SV. Since the inclusion of a relatively large number of latent factors can still imply a considerable number of parameters (and this problem is even more severe in the TVP regression case), we rely on state-of-the-art shrinkage techniques.

From an empirical standpoint it is necessary to investigate how these dimension reduction techniques perform over time and during different business cycle phases. We show this by carrying out a thorough real-time forecasting experiment for the US. Our forecasting application uses monthly real-time datasets (i.e., the FRED-MD database proposed in McCracken and Ng, 2016) and includes a battery of well established models commonly used in central banks and other policy institutions to forecast inflation. These include simple benchmarks as well as more elaborate models such as the specification proposed in Stock and Watson (2002a) .

Our results show that non-linear dimension reduction techniques yield forecasts that are highly competitive to (and in fact often better than) the ones obtained from using linear methods based on PCs. In terms of one-month-ahead forecasts we find that models based on the Autoencoder yield point and density forecasts which are more precise than the ones obtained from other sophisticated non-linear dimension reduction techniques as well as traditional methods based on PCs. When the focus is on one-quarter-ahead forecasts we find that non-linear variants of PCs perform best. This performance, however, is not homogeneous over time and some of the models do better than others during different stages of the business cycle. In a brief discussion, we also analyze how our set of models performs during the COVID-19 pandemic.

These findings give rise to the second contribution of our paper. Since we observe that more sophisticated non-linear dimension reduction methods outperform simpler techniques during recessions, we combine the different models using dynamic model averaging (see Raftery et al., 2010; Koop and Korobilis, 2013) . We show that combining our proposed set of models with a variety of standard forecasting models yields predictive densities which are very close to the single best performing model in overall terms. Since the set of models we consider is huge, this indicates that using model and forecast averaging successfully controls for model uncertainty.

The remainder of this paper is structured as follows. Section 2 discusses our proposed set of dimension reduction techniques. Section 3 introduces the econometric modeling environment that we use to forecast inflation. Section 4 first provides some in-sample features, then discusses the results of the forecasting horse race and finally presents our findings based on forecast averaging. The last section summarizes and concludes the paper. The Online Appendix provides further details on the econometric techniques as well as the data and additional empirical results.

Suppose that we are interested in predicting inflation using a large number of K regressors that we store in a T × K matrix X = (x 1 , . . . , x T ) , where x t denotes a K-dimensional vector of observations at time t. If K is large relative to T , estimation of an unrestricted model that uses all columns in X quickly becomes cumbersome and overfitting issues arise. As a solution, dimension reduction techniques are commonly employed (see, e.g., Stock and Watson, 2002a; Bernanke et al., 2005) . These methods strike a balance between model fit and parsimony. At a very general level, the key idea is to introduce a function f that takes the matrix X as input and yields a lower dimensional representation Z = f (X) = (z 1 , . . . , z T ) , which is of dimension T × q, as output. The critical assumption to achieve parsimony is that q K. The latent factors in Z are then linked to inflation through a dynamic regression model (see Section 3).

The function f : R T ×K → R T ×q is typically assumed to be linear with the most prominent example being PCs. In this paper, we will consider several choices of f that range from linear to highly non-linear (such as manifold learning as well as deep learning) specifications. We subsequently analyze how these different specifications impact inflation forecasting accuracy. In the following sub-sections, we briefly discuss the different techniques and refer to the original papers for additional information.

We start our discussion by considering principal component analysis (PCA) . Minor alterations of the standard PCA approach allow for introducing non-linearities in two ways. First, we can introduce a non-linear function g that maps the covariates onto a matrix W = g(X). Second, we could alter the sample covariance matrix (the kernel) with a function h: κ = h(W W ).

Both W and κ form the two main ingredients of a general PCA reducing the dimension to q, as outlined below (for details, see Schölkopf et al., 1998) .

Independent of the functional form of g and h, we obtain PCs by performing a truncated singular value decomposition (SVD) of the transformed sample covariance matrix κ. Conditional on the first q eigenvalues, the resulting factor matrix Z is of dimension T × q. These PCs, for appropriate q, explain the vast majority of variation in X. In the following, the relationship between the PCs and X is:

with Λ(κ) being the truncated K ×q eigenvector matrix of κ (Stock and Watson, 2002a) . Notice that this is always conditional on deciding on a suitable number q of PCs. The number of factors is a crucial parameter that strongly influences predictive accuracy and inference (Bai and Ng, 2002) . In our empirical work, we consider a small (q = 5), moderate (q = 15), and large (q = 30) number of PCs.

By varying the functional form of g and h we are now able to discuss the first set of linear and non-linear dimension reduction techniques belonging to the class of PCA:

The simplest way is to define both g and h as the unity function, resulting in W = X and κ = X X. Due to the linear link between the PCs and the data, PCA is very easy to implement and yields consistent estimators for the latent factors if K and T go to infinity (Stock and Watson, 2002a; Bai and Ng, 2008) . Even if there is some time-variation in the factor loadings (and K is large), Stock and Watson (2002b) show that principal components asymptotically (i.e., T → ∞) remain a consistent estimator for the factors and also that the resulting forecast is efficient. 1

The literature suggests several ways to overcome the linearity restriction of PCs. Bai and Ng (2008) , for example, apply a quadratic link function between the latent factors and the regressors, yielding a more flexible factor structure. While squared PC considers just squaring the elements of X resulting in W = X 2 and κ = (X 2 ) (X 2 ), with X 2 = (X X) and denoting element-wise multiplication, quadratic PC is defined as W = (X, X 2 ) and κ = W W .

Both variants also focus on the second moments of the covariate matrix and allow for a non-linear relationship between the principal components and the predictors. Bai and Ng (2008) show that quadratic variables can have substantial predictive power as they provide additional information on the underlying time series. Intuitively speaking, given that we transform our data to stationarity in the empirical work, this transformation strongly overweights situations characterized by sharp movements in the columns of X (such as during a recession). By contrast, periods characterized by little variation in our macroeconomic panel are transformed to mildly fluctuate around zero (and thus carry little predictive content for inflation). In our empirical model, our regressions always feature lagged inflation and this transformation thus effectively implies that in tranquil periods, the model is close to an autoregressive model whereas in crisis periods, more information is introduced.

Another approach for non-linear PCs is the kernel principal component analysis (KPCA).

KPCA dates back to Schölkopf et al. (1998) , who proposed using integral operator kernel functions to compute PCs in a non-linear manner. In essence, this amounts to implicitly applying a non-linear transformation of the data through a kernel function and then applying PCA on this transformed dataset. Such an approach has been used for forecasting in Giovannelli (2012) and Exterkate et al. (2016) .

We allow for non-linearities in the kernel function between the data and the factors by defining h to be a Gaussian or a polynomial kernel κ (which is of dimension K × K) with the (i, j)th element given by for a Gaussian kernel and

for a polynomial kernel.

Here, W = X (i.e., g is the unity function), x •i and x •j (i, j = 1, . . . , K) denote two columns of X while c 0 and c 1 are scaling parameters. As suggested by Exterkate et al. (2016) we set c 0 = (K + 2)/2 and c 1 = √ c K /π with c K being the 95th percentile of the χ 2 distribution with K degrees of freedom.

Diffusion maps, originally proposed in Coifman et al. (2005) and Coifman and Lafon (2006) , are another set of non-linear dimension reduction techniques that retain local interactions between data points in the presence of substantial non-linearities in the data. 2 The local interactions are preserved by introducing a random walk process.

The random walk captures the notion that moving between similar data points is more likely than moving to points which are less similar. We assume that the weight function which determines the strength of the relationship between x •i to x •j is given by

where ||x •i − x •j || denotes the Euclidean distance between x •i and x •j and c 2 is a tuning

Here, c 2 is determined by the median distance of the k-nearest neighbors of x •i as suggested by Zelnik-Manor and Perona (2004) . The number of k is approximated using the algorithm suggested by Angerer et al. (2016) .

The probability of moving from x •i to x •j is then simply obtained by normalizing:

.

This probability tends to be small except for the situation where x •i and x •j are similar to each other. As a result, the probability that the random walk moves from x •i to x •j will be large if they are equal but rather small if both covariates differ strongly.

Let P denote a transition matrix of dimension K × K with (i, j)th element given by p i→j .

The probability of moving from x •i to x •j in n = 1, 2, . . . steps is then simply the matrix power of P n , with typical element denoted by p n i→j . Using a biorthogonal spectral decomposition of P n yields:

with ψ s and φ s denoting left and right eigenvectors of P , respectively. The corresponding eigenvalues are given by λ s .

We then proceed by computing the so-called diffusion distance as follows:

with p 0 being a normalizing factor that measures the proportion the random walk spends at

x •j . This measure turns out to be robust with respect to noise and outliers. Coifman and Lafon (2006) show that

This allows us to introduce the family of diffusion maps from R K → R q given by:

The distance matrix can then be approximated as:

Intuitively, this equation states that we now approximate diffusion distances in R K through the Euclidian distance between Ξ n (x •i ) and Ξ n (x •j ). This discussion implies that we have to choose n and q and we do this by setting q = {5, 15, 30} according to our approach with either a small, moderate or large number of factors and n = T , the number of time periods. The algorithm in our application is implemented using the R packages diffusionMap and destiny (Richards and Cannoodt, 2019; Angerer et al., 2016) .

Locally linear embeddings (LLE) have been introduced by Roweis and Saul (2000) . Intuitively, the LLE algorithm maps a high dimensional input dataset X into a lower dimensional space while preserving the neighborhood structure. This implies that points which are close to each other in the original space are also close to each other in the transformed space.

The LLE algorithm is based on the assumption that each x •i is sampled from some underlying manifold. If this manifold is well defined, each x •i and its neighbors x •j are located close to a locally linear patch of this manifold. One consequence is that each x •i can be reconstructed from its neighbors x •j with j = i, conditional on suitably chosen linear coefficients. This reconstruction, however, will be corrupted by measurement errors. Roweis and Saul (2000) introduce a cost function to quantify these errors:

with ω ij denoting the (i, j)th element of a weight matrix Ω. This cost function is then minimized subject to the constraint that each x •i is reconstructed only from its neighbors. This implies that ω ij = 0 if x •j is not a neighbor of x •i . The second constraint is that the matrix Ω is rowstochastic, i.e., the rows sum to one. Conditional on these two restrictions, the cost function can be minimized by solving a least squares problem.

To make this algorithm operational we need to define our notion of neighbors. In the following, we will use the k-nearest neighbors in terms of the Euclidean distance. We choose the number of neighbors by applying the algorithm proposed by Kayo (2006) , which automatically determines the optimal number for k. The q latent factors in Z, with typical ith column z •i , are then obtained by minimizing:

which implies a quadratic form in z t . Subject to suitable constraints, this problem can be easily solved by computing:

and finding the q + 1 eigenvectors of M associated with the q + 1 smallest eigenvalues. The bottom eigenvector is then discarded to arrive at q factors. For our application, we use the R package lle (Diedrich and Abel, 2012).

Isometric Feature Mapping (ISOMAP) is one of the earliest methods developed in the category of manifold learning algorithms. Introduced by Tenenbaum et al. (2000) , the ISOMAP algorithm determines the geodesic distance on the manifold and uses multidimensional scaling to come up with a low number of factors describing the underlying dataset. Originally, ISOMAP was constructed for applications in visual perception and image recognition. In economics and finance, some recent papers highlight its usefulness (see, e.g., Ribeiro et al., 2008; Lin et al., 2011; Orsenigo and Vercellis, 2013; Zime, 2014) .

The algorithm consists of three steps. In the first step, a dissimilarity index that measures the distance between data points is computed. These distances are then used to identify neighboring points on the manifold. In the second step, the algorithm estimates the geodesic distance between the data points as shortest path distances. In the third step, metric scaling is performed by applying classical multidimensional scaling (MDS) to the matrix of distances. For the dissimilarity transformation, we determine the distance between point i and j by the Manhattan index d ij = k |x ki − x kj | and collect those points where i is one of the k-nearest neighbors of j in a dissimilarity matrix. For our empirical application, we again choose the number of neighbors by applying the algorithm proposed by Kayo (2006) and use the implementation in the R package vegan (Oksanen et al., 2019) .

The described non-linear transformation of the dataset enables the identification of a nonlinear structure hidden in a high-dimensional dataset and maps it to a lower dimension. Instead of pairwise Euclidean distances, ISOMAP uses the geodesic distances on the manifold and compresses information under consideration of the global structure.

Deep learning algorithms are characterized by not only non-linearly converting input to output but also representing the input itself in a transformed way. This is called representation learning in the sense that representations of the data are expressed in terms of other, simpler representations before mapping the data input to output values.

One tool which performs representation of itself as well as representation to output is the Autoencoder (AE). The first step is accomplished by the encoder function, which maps an input to an internal representation. The second part, which maps the encoded ( We apply this function element-wise to the entries of X. Using tanh activation functions is justified by its strong empirical properties identified in recent studies such as Saxe et al. (2019) and Andreini et al. (2020) .

The structure of our deep learning algorithm can be represented in form of a composition of univariate semi-affine functions given bŷ

andX (0) = X for l = 0. Here, W (l) denotes a weighting matrix of dimension N l−1 × N l (with N l being the number of neurons in layer l), b l is a N l × 1 bias vector and ι T is a T × 1 vector of ones.

3 In principle, f can vary over the different layers.

The output of the network is then obtained by setting:

Notice that if we set N L = q( K), we achieve dimension reduction and the output of the network is a (non-linearily) compressed version of the input dataset. In principle, what we have just described constitutes the encoding part of the Autoencoder. If we are interested in recovering the original dataset X we simply have to add additional layers characterized by increasing numbers of neurons until we reach N L+j = K for j = 1, 2, . . . . To capture the dynamics of the different cycles present in the data the optimization procedure needs to be repeated in a reasonably high number of epochs. We find that the algorithm converges quickly and setting the number of epochs to 100 is sufficient.

We employ the R interface to keras (Allaire and Chollet, 2019), a high-level neural networks API and widely used package for implementing deep learning models.

In the following, we introduce the predictive regression that links our target variable, inflation in consumer prices, to Z and other observed factors. Following Stock and Watson (1999) , inflation is specified such that:

with CPI t+h denoting the consumer price index in period t + h.

In the empirical application we set h ∈ {1, 3}. y t+h is then modeled using a dynamic regression model:

where β t+h is a vector of TVPs associated with M (= q + p) covariates denoted by d t and σ 2 t+h is a time-varying error variance. d t might include the latent factors extracted from the various methods discussed in the previous sub-section, lags of inflation, an intercept term or other covariates which are not compressed.

Following much of the literature (Taylor, 1982; Belmonte et al., 2014a; Kalli and Griffin, 2014; Kastner and Frühwirth-Schnatter, 2014; Stock and Watson, 2016; Chan, 2017; Huber et al., 2021) we assume that the TVPs and the error variances evolve according to independent stochastic processes:

with µ h denoting the conditional mean of the log-volatility, ρ h its persistence parameter and

and v 2 j being the process innovation variance that determines the amount of time-variation in β t+h . This setup implies that the TVPs are assumed to follow a random walk process while the log-volatilities evolve according to an AR(1) process.

The model described by Eq. (3) and Eq. (4) is a flexible state space model that encompasses a wide range of models commonly used for forecasting inflation. For instance, if we set V = 0 M and ϑ 2 = 0, we obtain a constant parameter model with homoscedastic errors. If V is instead a full M × M matrix but of reduced-rank, we obtain the model proposed in Chan et al. (2020) . If d t includes the lags of inflation and (lagged) PCs, we obtain a model closely related to the one used in Stock and Watson (2002a) . If we set d t = 1 and allow for TVPs, we obtain a specification similar to the unobserved components stochastic volatility model successfully adopted in Stock and Watson (1999) . A plethora of other models can be identified by appropriately choosing d t , V and ϑ 2 . This flexibility, however, calls for model selection. We select appropriate submodels by using Bayesian methods for estimation and forecasting. These techniques are further discussed in Section B of the Online Appendix and allow for data-based shrinkage towards simpler nested alternatives.

In our empirical application we consider the popular FRED-MD database. This dataset is publicly accessible and available in real-time. The monthly data vintages ensure that we only use information that would have been available at the time a given forecast is being produced. A detailed description of the databases can be found in McCracken and Ng (2016) . To achieve approximate stationarity we transform the dataset as outlined in Section C of the Online Appendix. Furthermore, each time series is standardized to have sample mean zero and unit sample variance prior to using the non-linear dimension reduction techniques.

Our US dataset includes 105 monthly variables that span the period from 1963:01 to 2021:01.

The forecasting design relies on a rolling window, as justified in Clark (2011) , that initially ranges from 1980:01 to 1999:12. For each month of the hold-out sample, which starts in 2000:01 and ends in 2019:12, we compute the h-month-ahead predictive distribution for each model (for h ∈ {1, 3}), keeping the length of the estimation sample fixed at 240 observations (i.e., a rolling window of 20 years). 5 For these periods we contrast each forecast with the realization of inflation in the vintage one-quarter-ahead, following the evaluation approach of Chan (2017) . As most data revisions take place in the first quarter while afterwards the vintages remain relatively unchanged (see, e.g., Croushore, 2011; Pfarrhofer, 2020) , we make sure that realized inflation is not subject to revisions anymore.

One key limitation is that all methods are specified conditionally on d t and thus implicitly on the specific function f used to move from X to Z. Another key objective of this paper is to control for uncertainty with respect to f by using dynamic model averaging techniques. For obtaining predictive combinations, we use the first 24 observations of our hold-out sample. The remaining periods (i.e., ranging from 2002:01 to 2019:12) then constitute our evaluation sample and the respective predictions are again contrasted to the one-quarter-ahead vintage of inflation.

In terms of competing models we can classify the specifications along two dimensions:

. . , s t−p+1 ) is then composed of p lags of s t with K = pK 0 . In our empirical work we set p = 12 and include all variables in the dataset (except for the transformed CPI series, i.e., K 0 = 104). We then use the different dimension reduction techniques outlined in Section 2 to estimate z t . Moreover, we include p lags of y t as additional observed factors to d t . This serves to investigate how different dimension reduction techniques perform when interest centers on predicting inflation. We also consider simple AR(12) models as well as a small-and a large-scale AR specification augmented with (ob- Hauzenberger et al., 2019), we use a semi-automatic approach which handles this issue rather agnostically. We discuss this in more detail in Sub-section 4.2.

2. The relationship between d t and y t+h . The second dimension along which our models differ is the specific relationship described by Eq. (3). To investigate whether non-linear dimension reduction techniques are sufficient to control for unknown forms of non-linearities, we benchmark all our models that feature TVPs with their respective constant parameter counterparts. To perform model selection we consider two priors. The first one is the horseshoe (HS, Carvalho et al., 2010) prior and the second one is an adaptive Minnesota (MIN, see Carriero et al., 2015; Giannone et al., 2015) prior (for further details see Section B of the Online Appendix).

In this sub-section we analyze bivariate correlations between the factors, obtained from using different dimension reduction techniques, and the variables in our dataset as well as inflation.

These correlations provide some information on the specific factor dynamics and (with caution) on how to interpret the factors in Z from a structural perspective. 6 The recent literature (Crawford et al., 2018 (Crawford et al., , 2019 Joseph, 2019 ) advocates using linear approximations or Shapley values to improve interpretability of these highly non-linear models. In this paper, we opt for a simple correlation-based approach given the large amount of competing dimension reduction techniques and the fact that for some of these the different techniques work better than for other methods. Figure 1 is a heatmap of the correlations with rows denoting the different covariates in X and columns representing the different dimension reduction techniques. These correlations are averages across the factors (in case that q > 1) and, since we include several lags of the input dataset, are also averaged across the lags.

The figure suggests for most dimension reduction techniques that the factors are correlated with housing quantities (PERMIT and HOUST alongside their sub-components) as well as interest rate spreads. Some variables which measure real activity (such as industrial production and several of its components) also display comparatively large correlations with the factors. In some cases, these correlations are positive whereas in other cases, correlations are negative. In both instances, however, the absolute magnitudes are similar. The three exceptions from this rather general pattern are diffusion maps as well as PCA quadratic and squared. In this case, the corresponding columns indicate lower correlations.

Averaging over the factors, as done in Figure 1 , potentially masks important features of individual factors. Next we ask whether there are relevant differences by analyzing the correlations between each z j (j = 1, . . . , q) and each column of X. For brevity, we focus on a specific model that performs extraordinarily well in terms of density forecasts: the Autoencoder with a single hidden layer and 30 factors. Figure 2 shows, for each factor, the five variables which display the largest absolute correlation. The variables in the rows are a union over the sets of top-five variables for each factor. This figure shows that several factors display quite similar correlation patterns. For all of them, housing quantities are either positively or negatively correlated (with similar magnitudes). Apart from that, and in consistence with the findings discussed above, we observe that financial market variables (such as interest rate spreads) show up frequently for several factors. Only very few factors depart from this overall pattern. In the case of factors 9, 22, 23 and 24 we find low correlations with housing and much stronger correlations with financial markets. In fact, factor 9 is closely tracking the credit (BAAFFM) and term spreads (e.g., T10YFFM).

These two heatmaps provide a rough overview on what variables drive the factors. Next, we ask whether we can construct models based on including variables which display the strongest correlations with the factors. This approach can be interpreted as a simple selection device which takes non-linearities in the input dataset implicitly into account. Since the heatmap is based 6 The estimates of the factor are considerably more difficult to interpret. Nevertheless, to provide some intuition on how the factors for the best performing specifications evolve over time, see Figure CP3Mx  FEDFUNDS  GS1  HOUST  HOUSTMW  HOUSTNE  HOUSTS  HOUSTW  PERMIT  PERMITMW  PERMITNE  PERMITS  PERMITW  T10YFFM  T1YFFM  T5YFFM  TB3MS  TB3SMFFM  TB6MS  TB6SMFFM These variables are also the ones which display high correlations to the factors in Figure 1 and are included in the large-scale ARX model. Here, it is worth stressing that there seems to be appreciable heterogeneity with respect to dimension reduction methods. Most of them generate factors that are highly correlated with real activity and housing measures as well as interest rates and other stock market variables. Interestingly, when we focus on the second group we observe that the factors arising from using PCA squared (and to a somewhat lesser extent PCA quadratic) are heavily related to labor market measures. Average correlations with prices (i.e., CUSR0000SA0L5) are small for most techniques (with PCA quadratic yielding the largest correlations of around 0.3 − 0.4). Some methods also yield factors that are strongly correlated to money stocks and reserves (e.g., diffusion maps). Table C .3 of the Online Appendix provides a much more detailed picture on the precise variables used to build the small-scale models. Next, we ask whether the factors are correlated to inflation. Table 1 shows the correlation with inflation averaged across the number of factors for each dimension reduction techniques as well as the minimum and maximum value (across these factors) in parentheses. To assess whether these correlations differ over time, we divide our sample into expansionary and recessionary periods. 7 Since the COVID-19 pandemic marks an extraordinary period in our sample, we also compute the correlations for 2020 only and include it at the bottom of Table 1 .

For the full sample as well as during expansions, we find that the factors obtained from using the linear variants of PCA display comparatively higher correlations relative to the other dimension reduction techniques (with some of the factors featuring a correlation of close to 0.2). In recessions and the pandemic, these correlations increase substantially to reach average correlations close to 0.3 (with the factor displaying the maximum correlation being strongly related to inflation, with values of around 0.6). The non-linear dimension reduction techniques yield strong correlations during turbulent times (i.e., recessions and the pandemic). This is not surprising since these methods tend to work well if there are strong deviations from linearity (which mostly occurs in recessions). Such a feature can be easily demonstrated by considering PCA squared. In normal times, the factors will be centered around zero and typically display little variation. But in recessions the link function implies that larger changes will dominate the shape of the factors and imply pronounced movements which could be helpful for predicting turning points in inflation.

We now consider point and density forecasting performance of the different models and dimension reduction techniques. The forecast performance is evaluated through log predictive likelihoods (LPLs) for density forecasts and root mean squared errors (RMSEs) for point forecasts. Superior models are those with high scores in terms of LPL and low values in terms of RMSE. We benchmark all models relative to the autoregressive (AR) model with constant parameters and the Minnesota prior. The first entry in the tables gives the actual value of the LPL (cumulated over the hold-out sample) with actual RMSEs in parentheses (averaged over the hold-out sample) for our benchmark model. The remaining entries are differences in LPLs with relative RMSEs in parentheses. We mark statistically significant results according to the Diebold and Mariano (1995) test at the one, five and ten percent significance levels with one, two and three asterisks, respectively.

Starting with the one-month-ahead horizon, Huang, 2003; Heaton, 2008) suggests that the number of hidden layers should increase with the complexity of the dataset. Our results, however, suggest the opposite. For a typical US macroeconomic dataset the forecast performance of the Autoencoder seems to be strongest when a single hidden layer coupled with a large number of factors is used.

Next, we inspect the longer forecast horizon in greater detail. Before proceeding to the next sub-section we briefly discuss two important issues. First, it is worth stressing that the factors used in this forecasting exercise are extracted from the full set of variables in X. In Table A.1 and Table A .2 of the Online Appendix we divide the dataset into slow-and fast-moving variables (Bernanke et al., 2005) and extract the latent factors from these partitioned datasets exclusively. The main results based on extracting the factors from the full dataset remain in place: for one-month-ahead forecasts we find the Autoencoder to perform particularly well whereas for one-quarter-ahead predictions PCA squared and quadratic yield accurate forecast densities.

Second, for one of our best performing models (the Autoencoder with one hidden layer)

forecasting performance changes sharply when the number of factors is changed. This raises Tables 2 and 3 with ARX models. For these ARX specifications the best performing model solely serves as an unsupervised variable selection device. Conditional on the latent factors of this best performing dimension reduction technique, we always include the top-five correlated variables as covariates alongside the lags of inflation (see Figure C .1 of the Online Appendix). Asterisks indicate statistical significance for each model relative to the benchmark at the 1% (***), 5% (**) and 10% (*) significance levels.

the question on how the relationship between the number of factors and forecast performance is. In Figure 3 in the Online Appendix we show two graphs that discuss how point and density forecasting performance change with the number of factors. In this exercise we find that the largest jumps in predictive accuracy is found when increasing the number of factors from 17 to 24 and again from 29 to 30 in terms of LPLs and from 18 to 26 and 29 to 30 in terms of RMSE.

The results based on RMSEs and LPLs provide information on relative forecasting performance.

In the next step, we ask whether the different methods and models we propose yield predictive distribution which are better calibrated. To this end, we consider the normalized forecast errors obtained through the probability integral transform (PIT). If a model is correctly specified the PITs are iid uniformly distributed and the respective standardized forecast errors should be iid normally distributed. Departures from the standard Gaussian distribution allow us to inspect along what dimensions the model is poorly calibrated. For instance, if the variance of the normalized forecast error is too small (i.e., below one) this is evidence that the predictive distribution is too wide (i.e., too many predictions are in the tails) while values greater than one indicate that the variance is too tight (i.e., the tails are not adequately represented). Clark (2011) we show the mean, the variance and the AR(1) coefficient of the normalized forecast errors. Given a well-calibrated model (i.e. the null-hypothesis), normalized forecast errors should have zero mean, a variance of one and experience no autocorrelation. These conditions are tested separately: 1) To test for a zero mean we compute the p-values with a Newey-West variance (with five lags). 2) To test for a unit variance we regress the squared normalized forecast errors on an intercept and allow for a Newey-West variance (with three lags). 3) To test for no autocorrelation we obtain the p-values with an AR(1) model that features an unconditional mean and heteroskedasticity-robust standard errors. Asterisks indicate statistical significance for each model at the 1% (***), 5% (**) and 10% (*) significance levels. Table 5 shows the results for the one-month-ahead normalized forecast errors. 8 In principle, we observe that the mean across methods is close to zero (with some few exceptions such as PCA quadratic for q = 15). Nevertheless, these differences are never statistically significantly different from zero. Considering the variances shows that most models yield forecast distributions which seem to be slightly too narrow (with variances exceeding one). The asterisks indicate whether the variances are significantly different from one. For some few models, this is the case (especially if we assume constancy of the parameters) but if we allow for TVPs there are only a handful of cases left. This, however, strongly depends on the shrinkage prior adopted. Turning to the autocorrelation of the normalized shocks reveals that these are mostly close to zero and never statistically significantly different from zero.

Comparing sophisticated to simple dimension reduction methods suggests no discernible differences in model calibration. In principle, approaches based on linear PCs yield normalized forecast errors with similar statistical properties than the ones obtained from using more sophisticated dimension reduction techniques. If a model is correctly specified the PITs are standard uniformly distributed and the normalized forecast errors standard normally distributed. This theoretically correct specification is indicated by the black lines, with the dashed lines referring to the respective 95% confidence interval. In red we present the results of the benchmark, whereas in blue we indicate the respective model. The light gray shaded areas refer to the global financial crisis.

The discussion above might mask important differences in calibration of different parts of the predictive distribution. We now turn to a deeper analysis of the one-month-ahead predictive distribution of the two best performing models vis-á-vis the benchmark: the Autoencoder 1l (q = 30) and PCA squared (q = 5). This analysis is based on visual inspection of the normalized forecast errors (left panel of Figure 3 ), a histogram of the PITs (middle panel of Figure 3 ) and the visual diagnostic of the empirical cumulative density function proposed in Rossi and Sekhposyan (2019) (right panel of Figure 3 ). Recall that, under correct specification, the PITs should be iid uniformly and the normalized forecast errors should be iid standard normally distributed, respectively.

The left panel of the figure indicates that for both models under consideration, normalized forecast errors are centered on zero, display little serial correlation and a variance close to one (with the Autoencoder generating slightly more spread out normalized forecast errors). In some periods, normalized forecast errors depart significantly from the standard normal distribution (i.e., the corresponding observations lie outside the 95% confidence intervals). But in general, and for both models (and the benchmark), model calibration seems to be adequate. Next, we focus on the histogram in the middle panel of Figure 3 (which includes 95% confidence intervals). From this figure, we learn that both models are well calibrated with some tendency to overestimate the upper tail risk. Finally, considering the right panel shows that all models appear to be well calibrated, with most observations being clustered around the 45 degree lines

and not a single observation being outside the 95% confidence intervals.

To the detriment of linear modeling techniques, the COVID-19 pandemic caused severe outliers for several of the time series we include in our dataset. Following the recent literature (e.g., Huber et al., 2020; Clark et al., 2021; Coulombe et al., 2021) which advocates using non-linear and non-parametric modeling techniques in turbulent times, we briefly investigate whether the non-linear dimension reduction techniques proposed in this paper yield more precise inflation forecasts during the pandemic. Figure 4 depicts the differences in LPLs for the period 2020:01 to 2020:08. For illustrative purposes, we only consider the models with 30 factors. 9

The figure provides a few interesting insights. First, we observe that in March 2020, models based on the Autoencoder improve upon the benchmark, irrespective of the prior and regression specification adopted. This finding is less pronounced for the other techniques in the constant parameter case. Comparing the performance of the constant parameter and the TVP regression models reveals that, irrespective of the prior, allowing for time variation in the parameters improves density forecasts during the pandemic. This finding is consistent with findings in, e.g., Huber et al. (2020) , who show that flexible models improve upon linear models during the pandemic due to increases in the predictive variance. 9 The findings for the other factors are very similar and available from the corresponding author upon request. 

In the previous sub-section and Section A of the Online Appendix we provide some evidence that model performance varies considerably over time (see Figure A. 3). The key implication is that non-linear compression techniques (and time-varying parameters) might be useful during turbulent times whereas forecast evidence is less pronounced in normal times. In this sub-section, we ask whether combining models in a dynamic manner further improves predictive accuracy.

After having obtained the predictive densities of y t+h for the different dimension reduction techniques and model specifications, the goal is to exploit the advantages of both linear and non-linear approaches. This is achieved by combining models in a model pool such that better performing models over certain periods receive larger weights while inferior models are subsequently down-weighted. The literature on forecast combinations suggests several different weighting schemes, ranging from simply averaging over all models (see, e.g., Hendry and Clements, 2004; Hall and Mitchell, 2007; Clark and McCracken, 2010; Berg and Henzel, 2015) to estimating weights based on the models' performances according to the minimization of an objective or loss function (see, e.g., Timmermann, 2006; Hall and Mitchell, 2007; Geweke and Amisano, 2011; Conflitti et al., 2015; Pettenuzzo and Ravazzolo, 2016) or according to the posterior probabilities of the predictive densities (see, e.g., Raftery et al., 2010; Koop and Korobilis, 2012; Beckmann et al., 2020) . More recent approaches set up separate state space models which assume sophisticated law of motions for the weights associated with each predictive distribution (Billio et al., 2013; Pettenuzzo and Ravazzolo, 2016; McAlinn and West, 2019) . These approaches, while being elegant and having the advantage of incorporating all available sources of uncertainty (i.e., also control for estimation uncertainty in the weights), are computationally cumbersome if the number of models to be combined is large.

Since our model space is huge, we use computationally efficient approximations to dynamically combine models. Our approach builds on combining predictive densities according to their posterior probabilities. This is referred to as Bayesian model averaging (BMA). The resulting weights are capable of reflecting the predictive power of each model for the respective periods.

Dynamic model averaging (DMA), as specified by Raftery et al. (2010) , extends the approach by adding a discount (or forgetting) factor to control for a model's forecasting performance in the recent past. The 'recent past' is determined by the discount factor, with higher values attaching greater importance to past forecasting performances of the model and lower values gradually ignoring results of past predictive densities. Similar to Raftery et al. (2010) , Koop and Korobilis (2012) and Beckmann et al. (2020) , we apply DMA to combine the predictive densities of our various models. These methods do not require computationally intensive MCMC or sequential

Monte Carlo techniques and are thus fast and easy to implement.

DMA works as follows. Let t+h|t = ( t+h|t,1 , . . . , t+h|t,J ) denote a set of weights for J competing models at time t + h given all available information up to time t. These (horizonspecific) weights vary over time and depend on the recent predictive performance of the model according to:

where p j (y t+h |y 1:t ) denotes the h-month-ahead predictive distribution of model j evaluated at y t+h and δ ∈ (0, 1] denotes a forgetting factor close to one. Intuitively speaking, the first equation is a prediction of the weights based on all available information up to time t while the second equation shows how the weights get updated if new data flows in.

In our empirical work we set δ = 0.97. 10 Notice that if δ = 1, we obtain standard BMA weights while δ = 0 would imply that the weights depend exclusively on the forecasting performance in the last period.

Weights obtained by combining models according to their predictive power convey useful information about the adequacy of each model over time. In order to get a comprehensive picture of the effects of different model modifications, we combine our models and model specifications in various ways. Across the two forecast horizons considered, we find pronounced accuracy improvements for point and density forecasts relative to the AR model. When we benchmark the different combination strategies to the single best performing model we find no accuracy gains for both horizons. Differences in terms of LPLs are, however, rather small. This suggests that while the best performing model (i.e., a constant parameter regression with factors obtained through the Autoencoder) is hard to beat, one can effectively reduce model and specification uncertainty and thus obtain competitive forecasts without the need to rely on a single model. Comparing whether restricting the model a priori improves predictions yields mixed insights.

For the one-month-ahead horizon we observe that pooling over models which use our variant of the Minnesota prior yields more favorable forecasts as compared to a pooling strategy which uses both priors or the horseshoe only. When we pool over constant and TVP regressions we find small decreases in predictive accuracy relative to a model pool which only includes constant parameter regressions.

Turning to one-quarter-ahead forecasts yields a similar picture. Using a large pool of models generally leads to slightly less precise forecasts. For higher-order forecasts our results suggest that pooling models which use the horseshoe yields higher LPLs. When we compare the different regression specifications we find that integrating out uncertainty with respect to whether parameters should be time-varying yields forecasts which are very similar to the strategy that only pools over constant parameter models.

In general, the differences in predictive performance across the DMA-based averaging schemes are small. Hence, as a general suggestion we can recommend applying DMA and using the most exhaustive model space available (i.e., including both priors, the different number of factors and TVP and constant parameter regressions).

To investigate which model receives substantial posterior weight over time, Figure The bottom panel (panel (c)) of Figure 5 provides information on how much weight is allocated to models that exploit non-linear dimension reduction techniques. This figure corroborates our full sample findings: the Autoencoder performs extremely well and dominates our pool of models. Notice, however, that this statement is not true during the global financial crisis. During that period we observe that models based on PCA squared and PCA quadratic feature large weights. We also find that linear techniques (PCA linear) and other non-linear techniques (PCA with a Gaussian kernel, LLE, ISOMAP, diffusion maps) retreive almost no posterior weight over time.

Summing up this discussion we find that the single best performing model (the Autoencoder) is hard to beat when we dynamically combine models. However, this comparison is, to some extent, unfair since the researcher does not have this information at her disposal. Hence, combining models helps to integrate out this uncertainty by producing forecasts which are close to the single best performing model but, at the cost of higher computational costs, without the necessity of knowing the strongest single model specification.

In macroeconomics, the vast majority of researchers compress information using linear methods such as principal components to efficiently summarize information embodied in huge datasets in forecasting applications. Machine learning techniques describing large datasets with relatively few latent factors have gained relevance in the last years in various areas. In this paper, we have shown that using such approaches potentially improves real-time inflation forecasts for a wide range of competing model specifications. Our findings indicate that point forecasts of simpler models are hard to beat (especially at the one-month-ahead horizon). For density forecasts, however, we find that more sophisticated modeling techniques that rely on non-linear dimension reduction do particularly well. Among all the techniques considered, our results suggest that the Autoencoder, a particular form of a deep neural network, produces the most precise inflation forecasts (both in terms of point and density predictions). The large battery of competing models gives rise to substantial model uncertainty. We address this issue by using dynamic model averaging to dynamically weight different models, dimension reduction methods and priors. Doing so yields forecasts which are almost as accurate as the ones obtained from the single best performing models.

Real-time Inflation Forecasting Using Non-linear Dimension Reduction Techniques NIKO HAUZENBERGER 1,2 , FLORIAN HUBER 1 , and KARIN KLIEBER 1 1 University of Salzburg 2 Vienna University of Economics and Business

A.1 Properties of the factors cont. A.2 Results based on extracting the factors from subgroups of the dataset Note: The table shows LPLs with RMSEs in parentheses below when we divide our dataset into slow-and fast-moving variables. The first (red shaded) entry gives the actual LPL and RMSE scores of our benchmark (an autoregressive (AR) model with constant parameters and a Minnesota prior). Asterisks indicate statistical significance for each model relative to the benchmark at the 1% (***), 5% (**) and 10% (*) significance levels. Since the large ARX model with time-varying parameters would features 273 period-specific coefficients and is computationally intractable, we assume that the TVPs feature a factor structure (with three factors) to reduce the dimension of the state space (see Section B of the Online Appendix and Chan et al., 2020) . A.3 Forecast performance over time we observe declines in the relative model performance vis-á-vis the AR benchmark. A.4 Probability integral transforms of higher order forecasts To implement the Bayesian priors to achieve shrinkage in the TVP regression defined by Eq. (3) and Eq. (4), we use the non-centered parameterization proposed in Frühwirth-Schnatter and

Wagner (2010). Intuitively speaking, this allows us to move the process innovation variances into the observation equation and discriminate between a time-invariant and a time-varying part of the model. The non-centered parameterization of the model is given by:

where the jth element inβ t+h is given byβ jt+h =

Conditional on the normalized statesβ, Eq. (B.1) can be written as a linear regression model as follows: 

We use a multivariate Gaussian prior on α:

Since our dependent variable is transformed to be approximately stationary, we set the prior mean equal to zero. This reflects the notion that inflation, as defined in Eq.

(2), follows a white noise process a priori. Moreover, V denotes a 2M -dimensional prior variance-covariance matrix, V = diag τ 2 1 , . . . , τ 2 2M . This matrix collects the prior shrinkage parameters τ j associated with the time-invariant regression coefficients and the process innovation standard deviations.

In the empirical work, we consider two prior variants that differ in the specification of the prior variance-covariance matrix V . The first is the horseshoe (HS, Carvalho et al., 2010) prior and the second is an adaptive Minnesota (MIN, see Carriero et al., 2015; Giannone et al., 2015) prior.

The horseshoe prior of Carvalho et al. (2010) achieves shrinkage by introducing local and global shrinkage parameters (see Polson and Scott, 2010) . These follow a standard half-Cauchy distribution restricted to the positive real numbers. That is:

While the global component ς strongly pushes all coefficients in α towards the prior mean (i.e., zero), the local scalings {ζ j } 2M j=1 allow for variable-specific departures from zero in light of a global scaling parameter close to zero. This flexibility leads to heavy tails in the marginal prior (obtained after integrating out ζ j ) which turns out to be useful for forecasting.

Inspired by Chan (2021), we consider a simplified version of the adaptive hierarchical Minnesota prior. This setup is also closely related to the one of Carriero et al. (2015) and Giannone et al. (2015) . That is, we allow for different treatment of own lags of inflation, the exogenous factors as well as the square root of the state innovation variances (governing the time variation). To capture this notion of the adaptive Minnesota prior and to be consistent with notation of the horseshoe prior, we impose the following structure on the diagonal element of the prior variance-covariance matrix: for coefficients associated with the own lags of inflation (l = 1, . . . , p), σ 2 π σ 2 k for coefficients associated with the kth exogenous factor (k = 1, . . . , q).

Here, ς 1 , ς 2 and ς 3 are global shrinkage parameters, {ζ j } 2M j=1 denote local scalings andσ 2 π as well as {σ 2 k } q k=1 refer to OLS variances of an AR(1) model on inflation and the kth exogenous factor, respectively. We assume that the global scaling parameters ς 1 , ς 2 and ς 3 feature a hierarchical prior structure and are standard half-Cauchy distributed (see Polson and Scott, 2012) , while the local scalings are set to fixed and known values. This structure contrasts with the horseshoe prior in Eq. (B.5), where we do not discriminate between certain groups of coefficients and where both hyperparameters (i.e., global and We sample from the relevant full conditional posterior distributions iteratively. This is repeated 10, 000 times and the first 2, 000 draws are discarded as burn-in.

The Federal Reserve Economic Data (FRED) contains monthly observations of macroeconomic variables for the US and is available for download at https://research.stlouisfed.org. Details on the dataset can be found in McCracken and Ng (2016) . For each data vintage (available from 1999:08), the time series start in January 1959. Due to missing values in some of the series, we preselect 105 variables and transform them according to Table C .1. We select all variables for our models except for the ARX model. In this case, we apply our variable selection approach described in Sub-section 4.2 and choose the variables according to their correlation with the factors obtained from the different dimension reduction techniques. For the small ARX models, we determine the top-five correlated variables in each vintage for each dimension reduction technique. As an example, Figure C .1 shows the outcome of the first step for the best performing models (i.e., the Autoencoder with one layer and 30 factors, PCA quadratic and PCA squared with five factors). Note: Column Trans I(0) denotes the transformation of each time series to achieve approximate stationarity: (1) no transformation, (2) ∆x t , (4) log(x t ), (5) ∆ log(x t ), (6) ∆ 2 log(x t ), (7) ∆(x t /x t−1 − 1.0) Note: The table shows the first five variables with the highest average correlation among the factors obtained from the different dimension reduction techniques. We determine the top-five correlated variables for each vintage and present those most frequently occurring. This overview serves mostly as a simple illustration of our unsupervised variable selection approach used in Table 4 . In parentheses, we present the frequency of occurrence over these vintages. For example, when we obtain five factors with the Autoencoder (1 layer), the variable PERMIT is in 22% of all vintages among the top-five correlated variables.

keras: R Interface to 'Keras

destiny: diffusion maps for large-scale single-cell data in R

Determining the number of factors in approximate factor models

Exchange rate predictability and dynamic Bayesian learning

Hierarchical shrinkage in time-varying coefficient models

Hierarchical shrinkage in time-varying parameter models

Point and density forecasts for the euro area using Bayesian VARs

Measuring the effects of monetary policy: A factor-augmented vector autoregressive (FAVAR) approach

Time-varying combinations of predictive densities using nonlinear filtering

Bayesian VARs: specification choices and forecast accuracy

On Gibbs sampling for state space models

The horseshoe estimator for sparse signals

Machine learning at central banks

The stochastic volatility in mean model with time-varying parameters: An application to inflation modeling

Minnesota-type adaptive hierarchical priors for large Bayesian VARs

A new model of inflation, trend inflation, and long-run inflation expectations

Reducing the state space dimension in a large TVP-VAR

Real-time density forecasts from BVARs with stochastic volatility

Tail Forecasting with Multivariate Bayesian Additive Regression Trees

Averaging forecasts from VARs with uncertain instabilities

Macroeconomic forecasting performance under alternative specifications of time-varying volatility

Diffusion maps

Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps

Optimal combination of survey forecasts

How is machine learning useful for macroeconomic forecasting?

Can machine learning catch the Covid-19 recession?

Variable prioritization in nonlinear black box methods: A genetic association case study

Bayesian approximate kernel regression with variable selection

Frontiers of real-time data analysis

Macroeconomic forecasting and structural change

Forecasting using a large number of predictors: Is Bayesian shrinkage a valid alternative to principal components?

Comparing Predictive Accuracy

lle: Locally linear embedding

Nonlinear forecasting with many predictors using kernel ridge regression

Deep learning for predicting asset returns

Data augmentation and dynamic linear models

Stochastic model specification search for Gaussian and partial non-Gaussian state space models

On learning the derivatives of an unknown mapping with multilayer feedforward networks

Optimal prediction pools

Prior selection for vector autoregressions

Nonlinear forecasting using large datasets: Evidences on US and Euro area economies

Combining density forecasts

Fast and Flexible Bayesian Inference in Time-varying Parameter Regression Models

Introduction to neural networks with Java

Deep learning for finance: deep portfolios

Pooling of forecasts

Learning capability and storage capacity of two-hidden-layer feedforward networks

Inducing Sparsity and Shrinkage in Time-Varying Parameter Models

Nowcasting in a pandemic using non-parametric mixed frequency VARs

An inflation-predicting measure of the output gap in the Euro area

Parametric inference with universal function approximators

Time-varying sparsity in dynamic regression models

Dealing with stochastic volatility in time series using the R package stochvol

Ancillarity-sufficiency interweaving strategy (ASIS) for boosting MCMC estimation of stochastic volatility models

Locally linear embedding algorithm-Extensions and applications

Characteristics are covariances: A unified model of risk and return

Forecating inflation using dynamic model averaging

Estimation and forecasting in models with multiple breaks

High-dimensional macroeconomic forecasting using message passing algorithms

The use of hybrid manifold learning and support vector machines in the prediction of business failure

A simple sampler for the horseshoe estimator

Forecasting inflation with thick models and neural networks

Dynamic Bayesian predictive synthesis in time series forecasting

FRED-MD: A monthly database for macroeconomic research

Forecasting inflation in a data-rich environment: the benefits of machine learning methods

Machine learning: An applied econometric approach

vegan: Community Ecology Package

Linear versus nonlinear dimensionality reduction for banks' credit rating prediction

State-varying factor models of large dimensions

Optimal portfolio choice under decision-based model combinations

Forecasts with Bayesian vector autoregressions under real time conditions

Shrink globally, act locally: Sparse Bayesian regularization and prediction

Online prediction under model uncertainty via Dynamic Model Averaging: Application to a cold rolling mill

Supervised Isomap with dissimilarity measures in embedding learning

Exploiting lowdimensional structure in astronomical spectra

Alternative tests for correct specification of conditional predictive densities

Nonlinear dimensionality reduction by locally linear embedding

On the information bottleneck theory of deep learning

Nonlinear component analysis as a kernel eigenvalue problem

Macroeconomic forecasting using diffusion indexes

Forecasting using principal components from a large number of predictors

Why has U.S. inflation become harder to forecast?

Core inflation and trend inflation

Financial returns modelled by the product of two stochastic processes-a study of the daily sugar prices 1961-75

A global geometric framework for nonlinear dimensionality reduction

Forecast combinations

Self-tuning spectral clustering

Economic performance evaluation and classification using hybrid manifold learning and support vector machine model

local scalings) are stochastic quantities.

We carry out posterior inference by using a Markov chain Monte Carlo (MCMC) algorithm to simulate from the joint posterior of the parameters, the log-volatilities and the TVPs. This MCMC algorithm consists of the following steps:1. Conditional on the time-varying part of the coefficients and the stochastic volatilities, we draw (β 0 , v 1 , . . . , v M ) from N (β, V ) with V = (D D + V −1 ) −1 and β = V (Dỹ).ỹ is a T −dimensional vector with typical element y t /σ t andD is a T × (2M ) matrix with typical row D t /σ t .2. Controlling for all other model parameters, the full history ofβ t+h is sampled using the forward-filtering backward-sampling (FFBS) algorithm proposed by Carter and Kohn (1994); Frühwirth-Schnatter (1994) . For constant parameter models this step is skipped.3. The stochastic volatilities log σ 2 t+h are drawn by employing the algorithm of Kastner and Frühwirth-Schnatter (2014) implemented in the stochvol R-package of Kastner (2016) .4. Sampling the diagonal elements of V depends on the specific prior setup chosen.• In case we adopt the horseshoe prior, we rely on the hierarchical representation of Makalic and Schmidt (2015) . Introducing auxiliary random quantities which follow an inverse Gamma distribution we can draw ζ j and ς as follows:• If the Minnesota prior is used, updating of ς 1 , ς 2 and ς 3 can be done in a similar fashion as we do with the horseshoe prior. The main difference, however, is that we only need to update the global hyperparameters and the respective auxiliary quantities: