key: cord-0494777-258hb681 authors: Nagpal, Udai; Nagpal, Krishan title: Forecasting Market Changes using Variational Inference date: 2022-05-02 journal: nan DOI: nan sha: 7c68043170eaf6eb9936d4ef55093be1ed9cf2c9 doc_id: 494777 cord_uid: 258hb681 Though various approaches have been considered, forecasting near-term market changes of equities and similar market data remains quite difficult. In this paper we introduce an approach to forecast near-term market changes for equity indices as well as portfolios using variational inference (VI). VI is a machine learning approach which uses optimization techniques to estimate complex probability densities. In the proposed approach, clusters of explanatory variables are identified and market changes are forecast based on cluster-specific linear regression. Apart from the expected value of changes, the proposed approach can also be used to obtain the distribution of possible outcomes, which can be used to estimate confidence levels of forecasts and risk measures such as VaR (Value at Risk) for the portfolio. Another advantage of the proposed approach is the clear model interpretation, as clusters of explanatory variables (or market regimes) are identified for which the future changes follow similar relationships. Knowledge about such clusters can provide useful insights about portfolio performance and identify the relative importance of variables in different market regimes. Illustrative examples of equity and bond indices are considered to demonstrate forecasts of the proposed approach during Covid-related volatility in early 2020 and subsequent benign market conditions. For the portfolios considered, it is shown that the proposed approach provides useful forecasts in both normal and volatile markets even with only a few explanatory variables. Additionally the predicted estimate and distribution adapt quickly to changing market conditions and thus may also be useful in obtaining better real-time estimates of risk measures such as VaR compared to traditional approaches. Financial market data is hard to forecast and various approaches ranging from fundamental and econometric to technical charts and machine leaning approaches have been proposed with varying levels of success (see for example [11] ). Some of the early approaches for forecasting equity changes were based on regression and time series prediction models such as ARMA. More complex modeling approaches for dynamical systems such as Hidden Markov Models have also been proposed (see for example [7] ). Neural networks of various forms have also been proposed (see for example [5] ) and are generally good at identifying non-linear relationships. Ensemble of multiple decision trees can be used to synthesize forecasts from different indicators as in [9] . Support Vector Machines have shown strong performance in financial forecasting (see for example [8] ). The goal of this paper is to show that Variational Inference (VI) approaches can also be successfully adapted to provide accurate and insightful financial forecasts. The objective here is not to compare performance of various Machine Learning approaches for financial forecasting or even describe the best VI approach and its performance but rather illustrate how a VI approach could be applied in financial forecasting to obtain useful forecasts and insights. VI is generally used to obtain analytical approximations of the posterior probability density for graphical models where the observed data is dependent on unobserved latent variable. VI has been used in a wide range of applications to get approximate Bayesian posterior distributions conditioned on data (see for example ([1] and [2] ). Compared to other approaches based on sampling such as the Metropolis-Hastings algorithm ( [6] ), VI can be more computationally efficient for large data sets or complex distributions. The concept of latent variables in creating posterior distributions in VI has a natural analog in financial data as market participants think of market conditions as "different regimes" where in each regime markets behave in a particular manner. For example market participants might think of market regimes as "rates increasing" vs. "rates decreasing" environments, or as "risk on" vs. "risk off" environments, where in each of such environments market changes are expected to follow a similar trend. In VI or other latent variable approaches, these market regimes such as "risk on" or "risk off" are not user specified but rather identified through analysis of the data. In the proposed framework it is assumed that the underlying data is a mixture model of Gaussian clusters with a locally linear mapping where the unknown linear regression vector for each cluster is drawn from another Gaussian distribution. Similar mixture models have been considered in [3] and for dynamical systems in [10] and in both cases parameter estimation is based on Expectation-Maximization. Variational approaches for mixtures of linear mixed models have also been considered [12] . The main advantages of using the proposed approach in financial forecasting are a) its closed-form solution involving a relatively easy optimization problem, b) obtaining not only an estimate but the entire distribution of outcomes which can be useful in scenario analysis and portfolio risk management, and c) by identifying clusters, one can gain insight into market conditions for which future changes follow similar patterns. This paper is organized as follows. The next section contains a brief introduction of variational inference and mean-field family of distributions. The underlying assumptions and modeling framework are described in Section 3. Section 4 contains illustrative examples to demonstrate the strong performance of the proposed approach in both normal and volatile market conditions. Section 5 concludes with a summary and the Appendix contains the derivation of the variational parameter estimates. In Bayesian statistics the goal is to infer the posterior distribution of unknown quantities using observations. For some complex problems it is assumed that observations, denoted by x = {x 1 , · · · , x N } in this section, are linked to latent variables z = {z 1 , · · · , z K } which are drawn from prior distribution p(z). The likelihood of observation x depends on z through the distribution p(x|z). In Bayesian learning and estimation problems the goal is to learn the posterior distribution p(z|x) of latent variables conditioned on the data. VI is a widely-used method in Bayesian machine learning to approximate posterior distributions. In VI a suitable family of distributions (F ) is chosen which is complex enough to capture the attributes of the data and yet simple enough to be computationally tractable. The best distribution is obtained from an optimization to minimize Kullback-Leibler (KL) divergence with the exact posterior (see for example ([1] and [2] ): where the KL divergence, a measure of information-theoretic distance between two distributions, is obtained as follows where E q denotes expectation with respect to q(z): Since log p(x) does not depend on the distribution q(z), the optimal distribution q * (z) that minimizes KL divergence in (1) also maximizes the Evidence Lower Bound (ELBO) defined as follows: Here we will work with the mean-field variational family of distributions q(z) in which one assumes that the latent variables are mutually independent and probability density of each latent variable is governed by distinct factors (see [1] and [2] for more details). This assumption greatly simplifies the computation of the optimal parameters of the variational distribution q(z). Due to the assumption of mutual independence, if there are K latent variables z = {z 1 , . . . , z K }, the joint density function is of the form where q k (z k ) is the density of the k'th latent variable z k and each q k has its own set of parameters. For the mean-field variational family of distributions, the most commonly used approach for obtaining parameters that maximize the ELBO (defined in (3)) is coordinate ascent variational inference (CAVI). In this approach, which has similarities to Gibbs sampling, one optimizes parameters for each latent variable one at a time while holding others fixed, resulting in a monotonic increase in the ELBO ( [1] , [2] ). It can be shown ( [1] ) that for the mean-field variational family, q * k (z k ), the optimal distribution for q k (z k ) that maximizes ELBO (3), satisfies the following: where z −k denotes all z i other than z k and E −k represents expectation with respect to all z i other than z k . Thus the optimal variational distribution of latent variable z k is proportional to the exponentiated expected log of the posterior conditional distribution of of z k given all other latent variables and all the observations. Since p(z −k , x) does not depend on z k (due to independence of latent variables z i ), one can write the above equivalently as For example, if there are three latent variables z 1 , z 2 and z 3 , the above implies The proposed approach is based on the mean-field family and thus (4) will be instrumental in obtaining the parameters of the posterior distribution. It is assumed that the available information known at time t is x t ∈ R n and this is used to predict some signal y t+1 ∈ R that is not known until time t + 1. The vector x t could be a function of observed market data (such as S&P index, swap rates or equity implied volatilities) and their recent trend, trading volumes etc. The optimal choice of components of x t could be based on choosing the combination that has the best out of sample performance using the proposed algorithm. The goal is to predict signal y t+1 and its distribution at time t + 1 from x t , which comprises information available at time t. As an illustrative example, let us assume that our goal is to predict S&P change for date t + 1 based on available information up to date t. Here y t+1 will be S&P index change from day t to day t + 1. The input x t , which is based only on information up to time t, could be composed of the prior day change in S&P index, trading volume deviation from average, implied volatility in recent days, changes in interest rates, etc. We will assume that there are K clusters where each cluster can be thought of as a different market regime. For any time t, x t belongs to one of the K clusters. For each t, c t ∈ R K will denote the indicator vector which describes which cluster x t belongs to. The indicator vector c t has K − 1 elements equal to 0 and one element equal to 1 that corresponds to the cluster assignment of x t . For example, if x t is in the second cluster then We will denote c t(k) as the k th element of c t and as noted above c t(k) = 1 for only one k ∈ {1, · · · , K} and is zero for all other k. For example if the vector c t is as above, c t(2) = 1 and c t(k) = 0 for all k = 2. We will assume that the mean and variance of cluster means do not change over time and the likelihood that the state x t is in cluster k is π k (the sum of π k over K has to be one as at every time x t is in one of the K clusters). The assumption that cluster probability π k does not depend on time t does not imply that the underlying time series must be stationary. For non-stationary data such as equity indices, x t would be obtained by transforming non-stationary time series to a stationary signals. For example, for non-stationary equity time series, x t could be composed of drift terms (such as change over n days) which are stationary. In other words, the dynamical aspects of the markets (such as momentum and/or mean reversion) here would be captured by appropriate transformations of market data such as changes over n days in the S&P index or interest rates. The mean of x t for cluster k, which we denote by µ k , is not known. It is assumed that for each cluster k, µ k is normally distributed with mean µ k0 and variance R k0 . Within each cluster k, x t is normally distributed around the cluster mean µ k with variance M , which is the same for all clusters. It is assumed that explanatory variables x t have been suitably adjusted for mean and drift terms so that the cluster mean µ k does not depend on time t. It is assumed that for each of the K clusters there are regression vectors β k that linearly map x t to the output of interest y t+1 . More specifically: Note that there is no constant term to capture the intercept. For simplicity of notation we will assume that a constant such as 1 is included in the vector x t to allow for a non-zero intercept in the linear expression above. Since we have absorbed the constant 1 in x t , the corresponding elements of R k0 (variance of µ k ) and M (variance of x t within cluster k) should be close to zero to reflect the fact that there is little uncertainty about this element of x t . We represent prior information about the regression vector β k for cluster k through a Gaussian prior distribution with mean β k0 and variance Q k0 . We will make a simplifying assumption that each observation of input-output data (x t , y t+1 ) is independent. In case of daily prediction, this assumption implies that each day provides a new independent observation. This is a simplifying assumption as there may be some overlap in market data x t and x t+1 (as they may incorporate data from same prior days). It is possible to to drop the independence assumption and extend the VI framework to the case where x t is the state of a linear dynamical system as in [10] . However, such an approach leads to greater complexity as one must also estimate the parameters that describe the time evolution of the system. The assumptions described above imply that the observed data (input x t and output y t+1 ) is generated by the following model: • K is the number of clusters of input space x t (where x t ∈ R n ). • The fraction of times x t is in cluster K is π k . Thus the sum of π k over all k is one. • For cluster k, the mean of x t is µ k , which is normally distributed with mean µ k0 and covariance R k0 . In other words, for cluster k, µ k ∼ N (µ k0 , R k0 ). • If x t is in cluster k then x t is normally distributed with mean µ k and covariance M . In other words if c t(k) = 1 then x t ∼ N (µ k , M ). Note that while each cluster has a different mean µ k , we are assuming the the same covariance M for x t in each cluster. • The regression coefficient for cluster k, which is β k ∈ R n , is normally distributed with mean β k0 and covariance Q k0 . In other words, β k ∼ N (β k0 , Q k0 ). • If x t at time t is in cluster k, then the output y t+1 that we would like to predict is normally distributed with mean x t β k and standard deviation σ (equivalently, if c t(k) = 1, then y t+1 ∼ N (x t β k , σ)). • Summary of hyperparameters: π for cluster probabilities; σ for standard deviation of exogenous noise in market change; (µ k0 , R k0 ) for mean and variance of each cluster mean; M for variance of x t within a given cluster; (β k0 , Q k0 ) regression parameter mean and its variance for each cluster. Generative process (a) Draw latent variable c t(k) ∼ Cat(π k ) which describes the cluster assignment for time t. c t(k) is an indicator variable which is 1 for only one of its K elements and zero for all other K − 1 elements. Let us illustrate the above generative process with an example. Consider the case when there are K = 3 clusters for the market data x t . One can think of the three clusters as three market regimes corresponding to market rallies, sell-offs, and relatively unchanged markets (note that the algorithm will identify the clusters and they may not have this intuitive interpretation). At any given time, the markets are in one of these three possible states or clusters. For the three possible states, the input or the market condition as defined by x t is normally distributed with mean corresponding to that cluster and variance M . The output y t+1 for time t + 1 is assumed to be normally distributed with mean x t β k and standard deviation σ. Given data x t and y t+1 for t ∈ [1, . . . T ], the goal is to obtain posterior distribution for cluster means µ k , regression vectors β k , and cluster assignment variables c t for all data points indexed by time t. We now summarize the above prior information where π k , µ k0 , R k0 , β k0 and Q k0 for k ∈ {1, · · · , K} are hyperparameters for cluster assignments, means and regression vectors: Based on the above, x t and y t are assumed to have the following probability densities where M and σ are hyperparameters: Let us assume we have T observations of the data available. For ease of communication we will use the following notation With the above generative process, The above factorization implies For approximating the posterior probability density of latent variables p(β, µ, c|x, y) by variational distribution q(β, µ, c), we will consider the following mean-field distribution where β, µ and c are independent and governed by their own variational parameters (with the variational parameters beingβ k ,Q k ,μ k ,R k and φ t ) : Please note that the superscript "ˆ" will denote variational parameters of the corresponding distribution. The meanfield family of distributions above may not contain the true posterior because of the assumption that β, µ and c are independent. The independence assumption leads to much more tractable estimation of the the posterior probability density of the latent variables. The distributions q c in terms of its variational parameter φ t are assumed to be as follows: q c (c t(k) = 1) = φ t k for t ∈ {1, · · · , T } and for k ∈ {1, · · · , K} Without making any assumptions on the family of distributions for q β and q µ , it is shown in the Appendix that these are Gaussian distributions with the following form in terms of their variational parameters: The variational posterior distribution q(β, µ, c), which approximates p(β, µ, c|x, y) is obtained by maximizing ELBO (3) . The variational parameters {β k ,Q k ,μ k ,R k , φ t k } have the following interpretation for the posterior distribution: φ t k is the probability of the k th cluster assignment for data x t ; parametersμ k ,R k are the mean and variance of the k th cluster of input x t ; andβ k ,Q k are the mean and variance of the regression vector for the k th cluster. Utilizing (12) and (13), ELBO function (3) can be expressed as follows where the expectation is with respect to variational distribution q(β, µ, c): The variational parameters {β k ,Q k ,μ k ,R k , φ t k } are to be estimated so that the ELBO function described above is maximized. Recall that the indicator function c t(k) is 1 if x t is in cluster k and 0 otherwise. Thus where p(x t |µ k , c t(k) = 1) and p(y t+1 |x t , β k , c t(k)=1 ) are described in (8) and (9). The variational parameters that maximize the ELBO function in (17) can now be obtained based on (4) . It is shown in the Appendix that the optimal variational parameters satisfy the following equations: for k ∈ {1, . . . , K} and t ∈ {1, . . . , T } (20) where r t(k) is defined as follows: One of the most commonly used algorithms for obtaining optimal variational parameters is Coordinate Ascent Variational Inference (CAVI) (see for example [2] ). Here parameters are updated one at a time, keeping other parameters constant. The process is repeated until the ELBO converges. The CAVI algorithm for estimating parameters is described in Algorithm 1. The algorithm may however converge to a local maximum and thus running the algorithm with different initial estimates of the variational parameters can improve the approximated model posterior. Input: Data x t and y t+1 for t ∈ {1, · · · , T } Input: Hyperparameters: Number of clusters K; intra-cluster variance M; π k , µ k0 , R k0 , β k0 and Q k0 for k ∈ {1, · · · , K} Initialize variational parametersβ k ,Q k ,μ k ,R k for all k ∈ {1, · · · , K} and φ t k for all t ∈ {1, · · · , T } and k ∈ {1, · · · , K} while ELBO has not converged do For the following discussion we will denote all the available observations of x t and y t up to time t as D t , i.e. D t := {x 1 , · · · , x t , y 1 , · · · , y t } We are interested in estimating the probability density of forecast y t+1 based on available information up to time t. Its computation will be based on approximating p(β, µ, c|D t ) by the variational distribution q(β, µ, c). In particular, the density of the forecasts is obtained as follows: In the above we have used (20) and (16) for q(c t(k) ) and q(β k ) and equation (9) for p(y t+1 |x t , β k , c t(k) = 1). Let us define β k := β k − ∆ k ψ k for k = 1, · · · , K (29) where the second equality in (27) follows from the Matrix Inversion Lemma. Substituting the above definitions in (26), we find that the density of the forecast is: In obtaining the last expression above we have used the fact that Gaussian distribution density integrates to one, that is Utilizing the definitions of ∆ k and ψ k in (27) and (28), one can show the following after some algebraic manipulations: Substituting the above in (30), one obtains the following expression for predictive density: From the predictive density described in (31), one can obtain the expected value of y t+1 as well as various confidence levels for the prediction. Note that the expression (31) is a weighted combination of K Gaussian probability densities, one from each cluster. The term in square brackets is the contribution to the density estimate from each cluster, which is weighted proportionally to cluster probability. The predictive density within cluster k is centered aroundβ k x t and has variance (σ 2 + x tQk x t ), which increases as Q k (the uncertainty of k th regression vector) increases. To illustrate the performance of the proposed approach, we consider one-day forecasts of two portfolios in both normal and volatile market conditions. In both cases the one-day prediction is based on calibrating the model parameters to the most recent 250 observations (the model is calibrated to roughly the most recent one year of data). The first example is of predicting S&P one-day change. The states used for predicting next day S&P change are: S&P change on the most recent day VIX index change over the most recent five days Standard deviation of USD and JPY exchange rate changes over the most recent five days 1 The constant term 1 is to account for nonzero intercept. Calibration was done after normalizing all inputs and outputs. The first two inputs (recent S&P and VIX changes) and output (next-day S&P change) were normalized to their z score based on the previous 250 observations where z score is defined as (current value-most recent 250 day average)/standard deviation of variable over the most recent 250 days. The third input is the difference between the standard deviation of JPY/USD exchange rate over most recent five days and the standard deviation of JPY/USD exchange rate over the most recent 250 days, with appropriate normalization. For illustrative purposes, only three explanatory variables are included that capture the momentum effect and recent market volatility, but one could include more variables such as those involving customer flows etc. Figure 1 provides a scatter plot of z score of actual one-day S&P change vs. that predicted using the proposed approach with input variable x t defined above. The results are displayed separately for the volatile markets during early 2020 and the more normal markets after that. In both cases it is evident that the predictions provide quite reasonable estimates of the next day changes. Interestingly, the performance is similar across volatile and normal markets. Another interesting observation about these predictions is that in most of the days when the S&P dropped significantly, the prediction was also for a decrease in S&P (in other words, there are few data points in the second quadrant for which the index dropped but the forecast was for the index to increase). The second portfolio is a mix of equities and bonds and is assumed to be 50% S&P and 50% AGG (a bond ETF). The assumed states for predicting next day changes for this portfolio are Portfolio value change on the most recent day AGG value change over the most recent five days Standard deviation of USD and JPY exchange rate changes over the most recent five days 1    Figure 2 shows a similar comparison of one-day prediction vs. actual (in terms of z score) for a portfolio of half equities and half bonds (half S&P and half bond ETF AGG). As evident from these plots, the predictive performance is similar and quite reasonable for this portfolio as well in both normal and volatile market conditions. One of the advantages of the proposed approach is that it provides not only the forecast but also the distribution of the forecast. Figure 3 shows the daily changes of the two portfolios compared to the 5th and 95th percentiles of the predictions for the first six months of the year 2020. As one can observe, not only is the estimate usually within the band from the 5th to 95th percentile, but also the predictive distribution reacts quickly to changing market conditions. For example, the predictive distribution becomes wider and the predictive band often moves further from 0 during the period of high market volatility in March 2020, whereas after April 2020 the predictive band narrows once again and is more consistently centered around 0 as markets become less volatile. Table 1 provides a backtesting comparison that can be used to evaluate the accuracy of the predictive distribution of the forecasts. The table contains the percentages of days from January 2020 to November 2021 that the portfolio change was below the specified percentiles of the predictive distribution. For example, on 5.05% of the days, the S&P change was below the 5th percentile of the forecast, while for the portfolio of half S&P and half AGG, the corresponding number was 6.46%. Another advantage of the proposed approach is the interpretability of the results in terms of visualizing market regimes/clusters in terms of explanatory variables and identifying the most important variables for forecasts in each market regime. The variational parameters of the approximate posterior distribution vary with time as they are obtained based on the most recent 250 days of observations. Below we describe the cluster means and regression parameters for two days -one from a normal market period and one from a more volatile period. The results below in Table 2 describe the cluster means (variableμ k ) for November 22, 2021 when markets were calm and stocks were rallying. Table 3 describes the regression parameters (β k ) for each of the clusters for this day. S&P Index S&P and AGG portfolio Percentage of days when portfolio change is less than 5th percentile of forecast 5.05% 6.46% Percentage of days when portfolio change is less than 25th percentile of forecast 17.37% 22.02% Percentage of days when portfolio change is less than 50th percentile of forecast 50.1% 51.92% Percentage of days when portfolio change is less than 75th percentile of forecast 80% 78.59% Percentage of days when portfolio change is less than 95th percentile of forecast 97.17% 94.75% Based on the above tables, one can draw some qualitative conclusions about the clusters which, though relatively simple, may provide useful insight to practitioners. For example based on the above tables, one may draw the following qualitative conclusions about each of the clusters in terms of a) market conditions, b) important market variables for forecasting in such market conditions, and c) bias for expected change for the next day: Cluster 1 • Market conditions: S&P relatively unchanged and low market volatility (most of elements ofμ 1 other than the constant term are relatively small and close to zero) • Critical variables for forecast: The most important variable for the forecast is change in implied vol (largest element of regression vectorβ 1 is −0.962, corresponding to VIX 5 day change). Increasing implied vol (VIX is a proxy of implied vols) in such market conditions implies lower S&P forecast for the next day because of the negative sign of this parameter. • Bias for the next day S&P change: The last element of regression vectorβ 1 corresponding to the intercept is −0.423 (a negative number) and thus the expected S&P change for the next day is negative in cluster 1 assuming other variables remain small. • Market conditions: S&P increasing while five day VIX change is negative (first two elements ofμ 2 are 0.276 and −0.285). Thus this cluster corresponds to market conditions where equity prices are increasing marginally and are accompanied by a decrease in equity implied vols. • Critical variables for forecast: None as most of the elements of regression vectorβ 2 are small. • Bias for the next day S&P change: S&P changes are expected to be small as the last element of regression vectorβ 2 corresponding to the intercept is −0.02, which is close to zero. Thus in this cluster, the forecast is for minimal change as both the intercept terms and the other regression parameters are quite small. • Market conditions: S&P decreasing and five day VIX change is positive (first two elements ofμ 3 are −0.379 and 0.407). Thus this cluster corresponds to market conditions where equity prices are falling marginally and are accompanied by an increase in equity implied vols. • Critical variables for forecast: Regression parameters corresponding to VIX change and JPY/USD exchange rate standard deviation are positive. This implies that for this cluster, an increase in volatility would likely result in higher S&P value the next day. • Bias for the next day S&P change: Since the last element of regression vectorβ 3 corresponding to the intercept is slightly positive (0.171), S&P is likely to increase marginally the next day assuming other variables are close to their average levels. Note that predictions have different sensitivities to market variables for different clusters. For example, in cluster 1, increasing VIX would result in a lower S&P forecast for the next day, while for cluster 3, an increase in VIX results in a higher S&P forecast (because the regression parameters associated with VIX change for the two clusters have opposite signs and are −0.962 and 0.388 respectively). Variational parameters will change with market conditions. The results below in Table 4 describe the cluster means (variableμ k ) for April 3, 2020 when markets were very volatile due to Covid pandemic concerns. Table 5 describes the regression parameters (β k ) for each of the clusters for this day. As before, one can draw some qualitative conclusions about different market clusters/regimes based on the above tables: Cluster Table 5 : Regression parameter estimates for the clusters on April 3, 2020 • Market conditions: S&P increase in the most recent day accompanied by a near-term increase in VIX and a highly volatile JPY/USD exchange rate. Such a cluster would be less likely in normal markets when a rally in equities is typically accompanied by a decrease (rather than an increase) in implied volatilities. • Critical variables for forecast: Increasing JPY/USD exchange rate standard deviation/volatility would lower the S&P forecast because the corresponding regression parameter is −0.67. Forecasts increase with greater recent increase in equities as the regression parameter is 0.23. • Bias for the next day S&P change: The last element of regression vectorβ 1 is −3.573, which suggests that the S&P is likely to drop significantly the next day. • Market conditions: All elements ofμ 2 other than the constant term are small, suggesting that the markets are stable and normal conditions prevail. • Critical variables for forecast: None as most of the elements of regression vectorβ 2 are small. • Bias for the next day S&P change: S&P changes are expected to be small as the last element of regression vectorβ 2 , corresponding to the intercept, is close to zero (0.033). • Market conditions: Large drop in S&P accompanied by large five-day increase in VIX (the first two elements ofμ 3 are −3.85 and 4.176). Thus this cluster corresponds to market conditions when there is a sharp sell-off in equity markets accompanied by a large increase in equity implied vols. • Critical variables for forecast: An increase in the JPY/USD exchange rate standard deviation would result in a higher forecast for the S&P (the corresponding regression parameter is 0.917) • Bias for the next day S&P change: Since the last element of regression vectorβ 3 corresponding to the intercept is 0.875, the S&P is expected to increase slightly the next day assuming other variables are close to their average levels. In this paper, an approach based on variational inference is proposed for financial forecasts. In the proposed approach, clusters of market variables are identified such that the predicted output is a linear function of the market variables, where the linear function depends on the cluster. One advantage of the proposed approach is that it identifies market regimes in which future changes share a similar dependence on market variables. It is shown that a fairly simple model with only three explanatory variables provides useful predictions for a couple of index portfolios. Another advantage of the proposed approach is that it also provides a fairly accurate predictive density of the forecast. The predictive density of the forecast can be useful in financial risk management since it can produce estimates of risk measures such as VaR (Value at Risk) that adapt quickly to changing market conditions. Pattern Recognition and Machine Learning Variational Inference : A Review for Statisticians High-dimensional regerssion with gaussian mixtures and partially-latent response variables Variational Bayesian inference for linear and logistic regression The use of data mining and neural networks for forecasting stock market returns Sampling based approaches to calculating marginal densities Stock Market Forecasting Using Hidden Markov Model: A New Approach. Proceedings of the 2005 5th International Conference on Intelligent Systems Design and Applications (ISDA'05) Financial time series forecasting using support vector machines Predicting the direction of stock market prices using random forest Variational inference and learning of piecewise-linear dynamical systems Short-term stock market price trend prediction using a comprehensive deep learning system Variational approximation for mixtures of linear mixed models We now derive the parameters for the variational distribution based on (4) . First let us consider the variational density of the cluster mean µ k . From (4)where E −µ k is expectation with respect to {β 1 , · · · , β K }, {c i , 1 ≤ i ≤ T } and {µ i , i = k.}. From (6), (8), (18), and variational parameters for q c in (14) one notes that