key: cord-0146205-jbhvf0bk authors: Xu, Xiuqin; Chen, Ying title: Deep Switching State Space Model (DS$^3$M) for Nonlinear Time Series Forecasting with Regime Switching date: 2021-06-04 journal: nan DOI: nan sha: a917315029b227c20f6fe720ff7d92e90be0fc0e doc_id: 146205 cord_uid: jbhvf0bk We propose a deep switching state space model (DS$^3$M) for efficient inference and forecasting of nonlinear time series with irregularly switching among various regimes. The switching among regimes is captured by both discrete and continuous latent variables with recurrent neural networks. The model is estimated with variational inference using a reparameterization trick. We test the approach on a variety of simulated and real datasets. In all cases, DS$^3$M achieves competitive performance compared to several state-of-the-art methods (e.g. GRU, SRNN, DSARF, SNLDS), with superior forecasting accuracy, convincing interpretability of the discrete latent variables, and powerful representation of the continuous latent variables for different kinds of time series. Specifically, the MAPE values increase by 0.09% to 15.71% against the second-best performing alternative models. Regime switching models for nonlinear time series are of interest in many contexts including but not limited to physics, medicine, transportation, energy and economics. It has been considered a difficult task for a long time to develop efficient and interpretable inference algorithms for dynamic nonlinear behaviors of time series, not to mention forecasting, due to several notorious difficulties known as complex stochastic dependence, irregular dynamic switching, and relatively small sample size. It seems natural to allow for the existence of different states or regimes in time series, and to allow the dynamics switching among various regimes. The Switching State Space Models (switching SSMs) are arguably the most popular, where the evolution of time series is assumed to be driven by hidden factors switching among discrete regimes, see [1, 2, 3, 4, 5, 6] . The switching SSM is a generalization of the Hidden Markov Models (HMMs) and State Space Models (SSMs), where the dynamics in each regime is usually represented by simple linear models that could be efficiently estimated even with a small sample size [7] , and the switching among regimes is controlled by hidden transition probabilities of a Markov process. For example, the Linear Gaussian State Space Model (LGSSM) can be used to model the short-term dynamics of the data, and the HMM can be used to describe the longer-term regime changes and control the LGSSM. By extending the local linear models with different regimes, the resulting model can approximate the globally nonlinear behavior and simultaneously retain interpretation. For complex dependence where the stage-wise linear structure is insufficient, SSMs can be customized with certain pre-specify nonlinear transition and/or emission functions, e.g. RSSSM [8] , where Taylor approximation is often used for inference [9, 10, 11] . Despite their popularity, the existing nonlinear models rely on pre-specified forms with simple structures, linear or nonlinear, which could be too brief to describe the nonlinear patterns in the real world data. On the contrary, the field of deep learning and especially recurrent neural networks (RNN) have emerged as the new standard to model nonlinear time series with highly complex dependence. Gate structures such as the Long-Short Term Memory (LSTM) or the Gated recurrent unit (GRU), handling the gradient vanishing problem in the estimation, have been developed for a broad spectrum of applications in time series, see [12, 13] . Transformers [14] and temporal convolution networks [15] have also been applied to nonlinear time series forecasting. However, the deep learning methods are known as black-box tools to identify patterns, making interpretation difficult. Moreover, the classic models specify some deterministic functions to link e.g. the previous hidden states and the current observations. The only randomness allowed appears in the conditional output probability model, with either a simple unimodal distribution, e.g. Gaussian in [16] , or a mixture of simple unimodal distributions, e.g. Gaussian Mixture models in [17] . It ignores the presence of stochastic signals in the dynamic system. Though unobserved, stochastic signals continuously influence the evolution of time series, in addition to the random noises. For example, the unemployment rate not only depends on economic status, e.g. booming or recession, but also is influenced by some latent variables such as elasticity of regional wage level and others that in turn vary with economic status. It is ambitious for a model with deterministic states to capture the stochastic behavior in nonlinear time series with non-stationary transitions. As a remedy, it needs a large number of parameters to ensure a reasonable modeling accuracy, which further requires a large sample size for a consistent estimation. Unfortunately, the amount of real-world time series data are not that big in many disciplines. This advocates the integration of deep learning and stochastic latent variable models to leverage their complementary strengths for nonlinear representation and interpretability [18, 19, 20] . A variety of deep SSMs have emerged for time series forecasting, where the continuous Gaussian latent variables are introduced at each time-step in a sequence of variational auto-encoders [21, 22, 23, 24] . We refer to [25] for a systematic review. Though powerful, it is challenging to understand how the dynamics are evolving, without knowledge of discrete regimes. It stimulates the development of deep SSMs which incorporates discrete latent variables for an interpretable inference of nonlinear time series with regime switching, see SVAE [26] , DSARF [27] and SNLDS [28] which have both discrete and continuous latent variables, rHSMM [29] and HSMM [30] which have two sets of discrete latent variables. The Markov assumption in the SSM is continuously used in the integration with deep learning because it simplifies the presentation and also many problems are indeed following Markov dynamics. There exist several works that have tried to extend the Markov dynamics of the discrete latent variables by letting it depend on previous continuous latent variables [31, 32, 33] or last observations [28] . The relevant information from the past beyond one time period on the current value of time series serves as a presence of disturbance to the switching dynamics. However, this can lead to unnecessarily frequent state shifts in the estimated discrete latent variables, making interpretations difficult. There is a need for a well-designed architecture that allows the disturbance to be reasonably represented in the complex regime switching structure, and simultaneously does not lead to over frequently switching. In this work, we stick to a Markov dynamic of the discrete latent variable and push the non-Markov dynamics into the continuous latent variables that also depend on the hidden states of a recurrent neural network summarizing the information coming from the past. The main contribution of our paper is a new modeling framework to perform interpretable and efficient inference and forecasting for the nonlinear time series that involves irregularly switching among various regimes. We name it DS 3 M, i.e. deep switching state space model. We allow a flexible and realistic modeling framework, where the switching among regimes is modeled by both discrete and continuous latent variables with recurrent neural networks. Specifically, the nonlinear evolution is driven by both the discrete and continuous latent variables. The discrete latent variables, representing the unknown regimes, are designed to influence both the observed time series and the values of the continuous latent variables, i.e. some unobserved driving factors. In other words, regime switching has an impact directly on time series and indirectly via its influence on the continuous latent variables. Moreover, the current hidden state that contains all relevant information from the past is designed to have an impact on the time series too. We design an approximate variational inference algorithm that can scale to large datasets. The key idea is to marginalize out the discrete latent variables only at each time step, and then use a reparametrization trick on the continuous latent variables. It should be highlighted that the DS 3 M is different from the existing methods in several aspects. Above all, it incorporates both discrete and continuous latent variables in the deep switch state space models, which allows a comprehensive understanding of the joint impact of regimes and stochastic signals on the evolution of time series. Secondly, the architecture is designed to reflect the inherent dependence structure in nonlinear time series with regime switching. This is particularly useful for real problems. With either one type of latent variables only, a model will fail to provide a meaningful interpretation. Lastly, we extend the SSMs with the hidden variables representing more than one-step temporal dependence. As such, the non-Markov problem can be still made in the Markov framework. The proposed model can leverage the interpretability of discrete latent variables, the powerful representation ability of continuous latent variables, and the nonlinearity of deep learning models. We test the approach on a variety of simulated and real datasets. In all cases, DS 3 M achieves competitive performance compared to several state-of-the-art methods (e.g. SRNN, GRU, DSARF, SNLDS), with superior predictive accuracy (the MAPE improves by 0.09% to 15.71% against the best performing alternatives), convincing interpretability of the discrete latent variables, and powerful representation of the continuous latent variables for different kinds of time series. The duration of the learned regimes is longer than alternatives and is more consistent with the empirical facts. The rest of the paper is organized as follows. Section 2 reviews the related works including RNN, the SSM, and the switching SSM. Section 3 details the proposed DS 3 M and the scalable inference algorithm. Section 4 presents the numerical performance of the DS 3 M with several simulated and real-world datasets. Section 5 concludes. Denote a time series of T observations as y 1:T = {y 1 , y 2 , · · · , y T }, y t ∈ R D and a sequence of inputs as x 1:T = {x 1 , x 2 , · · · , x T }, x t ∈ R U . In the setting of time series forecasting, x t can be one or multiple lagged values of the time series, e.g. y t−1 and higher orders y t−2 , y t−3 , · · · . The inputs x t could also contain exogenous variables. We are interested in modeling p(y 1:T |x 1:T ). The RNNs introduce hidden states, denoted by h t ∈ R H , to encode the information coming from the past input x 1:t with a deterministic non-linear function: where the function f is commonly chosen as the LSTM or the GRU. In the class of switching SSMs, both continuous and discrete latent variables are introduced, based on which the dynamics are decomposed to regimes. The simplest form is the switching linear dynamical system (SLDS), where the dynamics of each regime is explained by a linear state space model. The discrete latent variables, denoted as d t ∈ {1, 2, · · · , K} at each time step t = 1, 2, · · · , T , follows a Markov process. In particular, d t |d t−1 is assumed to follow a transition matrix Γ ∈ R K×K , where The discrete latent variables d t have impact on both the continuous latent variables z t ∈ R Z and the observable data y t : where the parameters depend on the states of d t , and W The emission function (1) specifies the dynamics of the observed time series, given the state of the latent variables at time t. The transition function (2) determines the evolution of the latent variable. The emission noises t and transition noises e t are Gaussian distributed. When K = 1, the model is also termed as the Linear Gaussian State Space Model (LGSSM). There have been several extensions to the SLDS which is eventually a piece-wise linear model. To model complex nonlinear structures, the RSSSM [8] adopts a pre-specified nonlinear transition function for the latent variables and uses the extended Kalman filter for estimation. The SVAE model [26] parametrizes the emission function by neural networks, while the transition function remains linear. The SNLDS model [28] parametrizes both the emission and transition functions with nonlinear neural networks. The DSARF [27] approximates the high-dimensional time series with the dynamics of factors and weights that are guided by the switching latent variables. The discrete switching variables in the SLDS are assumed to be Markov, i.e. d t depends on d t−1 only. The rSLDS (recurrent SLDS) extends the open-loop Markov dynamics, making d t depending on the hidden state z t−1 [31] . [32] imposed a tree structure prior on the switching variables of rSLDS, where the dynamics of the switching variables can behave similar in the same subtrees. In addition to Gibbs sampling, [33] used variational inference for rSLDS and approximated the discrete variables with a continuous relaxation, see [34, 35] . There is a common question in the switching models, i.e. to decide the number of switching states. [6] proposed a hierarchical Dirichlet process prior on the switching variables. [36] directly output the parameters for the SSMs at each time step via RNN, where the switching SSM is considered with an infinite number of switching states. Most works, including this work, assume that the number of states is fixed with prior knowledge. The discrete latent variable can also be modeled as a Semi-Markov process, where the duration of each state is controlled by another discrete random variable, see [30, 29, 37 ]. In this section, we introduce the DS 3 M by extending the linear switching SSM with neural network models. We also present the inference method and derive the predictive distributions. Generative network The generating procedure of the DS 3 M contains four steps. At time step t, a forward RNN is firstly used to process the input data: Here K refers to the number of regimes states. Thirdly, the continuous latent variable z t transits with a Gaussian process determined by d t : where the mean µ Finally, the time series y t is modeled as where π represents the output probabilistic model with parameter Φ o . The choice of π depends on the stochastic behavior of the observed time series, e.g. Gaussian for data with bell shape or lognormal for asymmetric data. The joint probability is represented as Given the non-linearility introduced by neural networks, it is intractable to obtain the likelihood of observations, denoted by L(θ), by averaging out z 1:T and d 1:T in (3). In other words, the maximum likelihood method is not useful. Thus we develop a scalable learning and inference algorithm for the DS 3 M using variational inference. Figure 1a displays a graphical representation of the DS 3 M. It is important to stress the key differences between DS 3 M and the state-of-the-art SNLDS. We stack an RNN below the SSSM and design a direct connection of the hidden state h t to the time series y t inspired by the skip connection in ResNet [38] , Transformers [39] and SRNN [22] . From a modeling aspect, a lack of this connection will force the continuous latent variable z t to encode all the relevant continuous information including the latent driving factors, the disturbance from the past, and others. The connection between y t and h t on the other hand allows a clear structure, where both the deterministic hidden states and the stochastic latent variables can separately encode different aspects of information. The ELBO is tight, i.e. L(θ) = ELBO(θ, φ), only when q φ (z 1:T , d 1:T |y 1:T , x 1:T ) is equal to the true posterior p θ (z 1:T , d 1:T |y 1:T , x 1:T ), which is unfortunately intractable. To achieve a tight ELBO, we consider the following factorization derived from the d-separation [40] : where the posterior of z t , d t depends on the past information encoded in {z t−1 , d t−1 } as well as the future information in {y t:T , h t:T }, see the generative network in Figure 1a . The inference is designed to use the information from all time steps to approximate the posterior at each time step t: where A t = g φA (A t+1 , [y t , h t ]) and φ = {φ z , φ d , φ A }. We parameterize g φA as a backward RNN and q φz (z t |z t−1 , d t , A t ) with a Gaussian probabilistic density: where z t is a Gaussian variable with mean µ (k) t and variance Σ (k) t determined by neural network models g (k) The graphical model of the inference network is shown in Figure 1b . In addition, we have With the defined approximate posterior, the ELBO can be derived as follows: where q * φ (z t , d t ) = q φ (z 1:t , d 1:t |y 1:T , x 1:T ) dz 1:t−1 dd 1:t−1 and q * φ (d t ) = q φ d (d 1:t |y 1:T , x 1:T ) d 1:t−1 . We approximate the ELBO using a Monte Carlo method. Specifically, we sample (z (s) t , d (s) t ) for t = 1 · · · , T from q * φ (z t , d t ) using ancestral sampling according to (4) and approximate ELBO as follows: for t = 1, 2, · · · , T sequentially according to (4) to approximate the ELBO in (6) 3. Derive ∇ θ ELBO(θ, φ) and ∇ φ ELBO(θ, φ) 4. Update θ (Iter) , φ (Iter) using the ADAM, set Iter = Iter + 1 end while It is easy to obtain ∇ θ ELBO(θ, φ), while ∇ φ ELBO(θ, φ) is complicated as φ also appears in the expectation. The score function gradient estimator [41] can be used to approximate the gradient, but the obtained results suffer from high variance. Thus the reparameterization approach [19, 18] is often used for an estimator with low variance. We apply the reparameterization approach to the continuous latent variable z t in the ELBO. Specifically, we generate a sample t ∼ N (0, 1), and then use µ t + t Σ t as a sample of z t ∼ N (µ t , Σ t ) in the above Monte Carlo approximation. The gradients can then be backpropagated through the continuous random variables. One could not use the Gumble-softmax reparameterization trick for the discrete latent variables as non-integer values of d t are invalid for our generative model. Also, [28] shows that using the Gumble-softmax reparameterization trick will reduce the benefit of discrete latent variables. As an alternative approach, in (6), we marginalize out the discrete variable d t with a summation over its probability at each time step t, and do not marginalize out the discrete variable before time t. While fast learning, this may introduce a biased gradient estimator for φ d and φ z as it can be viewed as gradient clips where the gradients from the previous time steps are ignored. In our experiments, such an approximation performs very well compared to the unbiased score function estimator and we arguably consider the bias ignorable. We do not observe posterior collapse for the discrete latent variables, which implies that the gradients backpropagated through time may vanish quickly, as such the gradients contributed by previous few time steps do not play an important role. On the other hand, the gradient exploding is also mitigated by such an approximation, making the gradient more stable. A summary of the structured inference algorithm is given in Algorithm 1. It is worth mentioning that the SNLDS marginalizes the discrete latent variables using the exact posterior derived with the forward-backward algorithm, while the DS 3 M marginalizes the discrete latent variables using the approximate posterior at each time step. Although the latter would introduce a slightly biased gradient estimator for the ELBO, it speeds up the inference by avoiding the use of the forward-backward algorithm. Moreover, in the SNLDS the approximate posterior for z t does not depend on d t , which could lead to a severe posterior collapse problem where d t is not used at all. Thus an entropy regularizer has to be introduced in SNLDS to encourage an evenly distributed posterior for d t . In contrast, our posterior for z t depends on d t and the posterior collapse problem for the discrete latent variables does not appear in our experiments. Lastly, the focus of SNLDS is the segmentation of time series, i.e. identify the regimes (in-sample inference), while our focus is on the prediction task too. Predictive distributions Given a trained model, one can approximate the predictive distributions for the future values of time series y T +1 , the discrete latent d T +1 and the continuous latent z T +1 by generating samples of {z , where s = 1, · · · , S and S represents the number of Monte Carlo samples. The predictive distributions can be obtained as follows: In this section, we evaluate the DS 3 M with several datasets. We first consider a simulated 1-d time series whose true dynamic follows a nonlinear switching state space model and a simulated 10-d time series based on the Lorenz attractor. We further applied the DS 3 M to several real-world datasets covering a variety of applications such as health care, transportation, energy, econometrics. Both simulations and real data analysis demonstrate that the DS 3 M captures the switching regimes well and achieves competitive prediction accuracy compared with several state-of-the-art methods, including GRU, SRNN [22] , DSARF [27] and SNLDS [28] . Specifically, SRNN can be considered as our model without discrete latent variables. DSARF and SNLDS are the two recently proposed nonlinear dynamic latent variable models for time series which have both continuous and discrete latent variables. DSARF has been shown to outperform several models such as rSLDS, SLDS, BTMF, TRMF and RKN for time series forecasting, while SNLDS has shown to better at time series segmentation compared with rSLDS, SVAE, KVAE and CompILE. Thus, we omit the comparison with the other models and present DSARF and SNLDS only. For fair comparison, we select the same datasets used in the original paper of DSARF and SNLDS when applicable. The implementation is done with their official codes. Details of the hyperparameters are provided in the Appendix. Toy example For the toy simulate example, we simulated data with a length of 2000 from the following nonlinear switching state space model: For the simulated time series, the switching indicator d t controls both the dynamics of the continuous latent variable z t and the observation y t . By design, y t is much more volatile when d t = 0 compared with d t = 1. We transform the times series into subsequences with a length of 20, resulting in 1980 subsequences. The first 1000, the following 480, and the last 500 subsequences are used for training, validation and testing. We set x t = y t−1 . Figure 2a display the one-step-ahead forecasting results (one experiment run) of the DS 3 M for the testing data as well as the predicted switching indicators of DS 3 M, SNLD and DSARF. It shows that the predicted means of the observations can trace the true observations well and the 90% confidence intervals cover most of the observations. Also, it succeeds in providing relatively wider confidence intervals when the data are more volatile and narrower ones when the data are more stable. The learned transition matrix is [0.91, 0.09;0.18, 0.82] which is close to the true transition matrix. A summary of the forecasting and inference accuracy over five experiment runs is provided in Table 1 . The DS 3 M achieves smaller forecasting RMSE for the observations (a relative improvement of 8.46% and 0.67% compared to SNLDS and DSARF respectively), higher prediction accuracy (a relative improvement of 43.08% and 1.59%) and higher F1 score (a relative improvement of 39.95% and 1.43%) for the switching indicators. It can be observed in Figure 2a that DS 3 M provides much reliable predictions for switching indicators, while SNLD and DSARF tend to switch over frequently. The mean duration lengths of the two statuses are 8.211 and 8.296, although is smaller than the true values, i.e. 24 and 24, but are much better than the other two models whose duration lengths are around 1 -4. When used to segment time series (inference), the DS 3 M also delivers better accuracy and F1 score than SNLDS and DSARF. Lorenz attractor Lorenz attractor is a canoinal nolinear dynamical system with the following nonlinear dynamic for z t = [z t,1 , z t,2 , z t,3 ]: The variable z t = [z t,1 , z t,2 , z t,3 ] T is treated as a latent variable, and thus is unobservable. In the simulation, we considered a 10-dimensional time series . The same dataset was used in [27] . Similar to the toy example, we set x t = y t−1 . The traces of the Lorenz attractor roughly can be separated into two ellipses. We simulated a time series with a length of 3000 and transform the time series into subsequences with a length of 5, resulting in 2990 subsequences. The first 1000, the following 990, and the last 1000 subsequences are used for training, validation and testing respectively. The forecasted switching variables of the DS 3 M are shown in Figure 2b . The model successfully separates the two ellipses with a forecasting accuracy of 0.882 ± 0.079 (an relative improvement of 43.14% and 11.99% compared with SNLDS and DSARF respectively) and an F1 score of 0.837 ± 0.127 (an relative improvement of 39.50% and 8.11%), see Table 1 . For the forecasting accuracy of the observations, the DS 3 M has smaller RMSE and MAPE compared to SNLDS, but did not beat DSARF. This is because that the DSARF is the true data generating process of this simulated dataset. As for the segmentation task (inference), the DS 3 M also achieves the highest accuracy: 0.911 ± 0.068 (an relative improvement of 22.51% and 15.49%) and best F1 score: 0.883 ± 0.103 (a relative improvement of 29.97% and 16.06%). We evaluated the performance of the The Sleep Apnea dataset is a public physiological dataset from a patient diagnosed with sleep apnea, a medical condition in which patients intermittently stop breathing during sleep. The respiration pattern in sleep apnea can be characterized by at least two regimes -no breathing and gasping breathing induced by reflex arousal. We use the same separation of training and testing data as in [5] and [27] . For Apnea, the DS 3 M separates time series into two regimes (blue and red). It seems that the status with consistent red color mostly corresponds to the period that the patient has no or weak breathing and the blue color corresponds to the period when the patient grasps breathing. The Hangzhou Metro dataset consists of the incoming passenger flow of 80 metro stations in Hangzhou, China from January 1 to January 25, 2019 [27, 43] . The passenger flow data have a temporal resolution of 10-minutes during the service hour, i.e. 108 points per day. The last 5 days are used for testing. The prediction from the DS 3 M for Hangzhou station 0 and station 40 are shown in the figures in the Appendix. It shows that the DS 3 M model helps to automatically segment the time series into peak hours and non-peak hours without supervision. The Seattle Traffic dataset contains the traffic speed from 323 loop detectors in Seattle, USA, from January 1 to January 28, 2015 [27, 44] . It has a temporal resolution of 5-min, i.e. 288 points in a day. The last 5 days are reserved for testing. For this dataset, the red color regime corresponds to the periods where the traffic becomes more volatile. The Pacific surface temperature dataset is consists of monthly surface temperatures of the Pacific for 2520 gridded spatial locations from January 1970 to December 2002 [27, 45] . The last 5 years are used for testing. For this dataset, it seems that the level of the time series changes for different regimes. For Location 0, the red regime has a higher level compared to the blue regime, while for Location 840, the blue regime has a higher level. The French Electricity demand dataset contains half-hourly electricity demand in French from January 1, 2012 to December 31, 2019 [46, 47] . The year 2019 is used for testing. The DS 3 M uses two regimes for the French electricity demand datasets, the red color mostly corresponds to the working days while the blue color mostly coincides with weekends. Table 2 We proposed a deep switching state space model (DS 3 M) for forecasting nonlinear time series with regime switching. The model consists of a recurrent neural network (RNN) and a nonlinear switching state space model (SSSM) where emission and transition functions are governed by a Markov chain of discrete latent variables and parameterized by multilayer perceptrons. The RNN and SSSM are stacked together so that the continuous latent variables in the SSSM could use the long-term information embedded in the RNN. Also, the RNN is skip-connected to the observations to further improve the forecasting. The model is applicable for large datasets as it is estimated with the amortized variational inference method where an inference network and the generative network are trained together. The DS 3 M is applied to a variety of simulated and real datasets and it achieves competitive performance on several state-of-the-art methods. On state estimation in switching environments State estimation for discrete systems with switching parameters Analysis of time series subject to changes in regime Switching kalman filters Variational learning for switching state-space models Nonparametric bayesian learning of switching linear dynamical systems Time series analysis by state space methods Nonlinear regime-switching state-space (rsss) models Application of statistical filter theory to the optimal estimation of position and velocity on board a circumlunar vehicle New extension of the kalman filter to nonlinear systems A tutorial on particle filtering and smoothing: Fifteen years later Long short-term memory Empirical evaluation of gated recurrent neural networks on sequence modeling Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting Think globally, act locally: A deep neural network approach to high-dimensional time series forecasting DeepAR: Probabilistic forecasting with autoregressive recurrent networks Generating sequences with recurrent neural networks Auto-encoding variational bayes Stochastic backpropagation and approximate inference in deep generative models Learning stochastic recurrent networks A recurrent latent variable model for sequential data Sequential neural models with stochastic layers Structured inference networks for nonlinear state space models A disentangled recognition and nonlinear dynamics model for unsupervised learning Dynamical variational autoencoders: A comprehensive review Composing graphical models with neural networks for structured representations and fast inference Deep switching auto-regressive factorization: Application to time series forecasting Collapsed amortized variational inference for switching nonlinear dynamical systems Recurrent hidden semi-markov model Structured inference for recurrent hidden semi-Markov model Bayesian learning and inference in recurrent switching linear dynamical systems Tree-structured recurrent switching linear dynamical systems for multi-scale modeling Switching linear dynamics for variational bayes filtering The concrete distribution: A continuous relaxation of discrete random variables Categorical reparameterization with gumbel-softmax Deep state space models for time series forecasting Compile: Compositional imitation learning and execution Deep residual learning for image recognition Attention is all you need Identifying independence in bayesian networks Simple statistical gradient-following algorithms for connectionist reinforcement learning Us unemployment rate Hangzhou incoming passenger flow Seattle inductive loop detector dataset Probabilistic forecasting for daily electricity loads and quantiles for curve-to-curve regression Modeling and forecasting daily electricity load curves: A hybrid approach The ELBO for the loglikelihood can be derived as follows:q φ (z1:T , d1:T |y1:T , x1:T ) log p θ (y1:T , z1:T , d1:T |x1:T ) q φ (z1:T , d1:T |y1:T , x1:T ) dz1:T dd1:T = q φ (z1:T , d1:T |y1:T , h1:T ) log p θ (y1:T , z1:T , d1:T |h1:T ) q φ (z1:T , d1:T |y1:T , h1:T ) dz1:T dd1:Tp(dT +1|x1:T , y1:T ) = The DS 3 M is implemented in Pytorch with a V100 GPU. The Adam optimizer is used. The initial earning rate is set to 0.001 and is reduced by a factor of 0.1 when the validation loss has stopped improving for 10 epochs. An early stopping regularization is also implemented to stop the training when the validation loss has stopped improving for 20 epochs. The number of switching statuses is set to 2. The batch size is set to 64. The dimension of the continuous latent variables is set to 2 for the toy example, the sleep apnea and the unemployment rate, 3 for the Lorenz and 10 for other datasets. All RNN are chosen to be 1-layer GRUs with the hidden dimension being 10 for the toy example, 20 for the Lorenz and the dimension of the observations D for other datasets. All MLPs are 2-layers with the hidden dimension to be the same as the dimension of the outputs. All datasets are normalized before training and are transformed back for evaluation. All parameters are initialized randomly. The classical linear KL annealing approach is used to increase the coefficients of the KL terms from 0.01 to 1 over the course of the training process. 100 epochs are sufficient for most experiments. Figure 3 shows the predictions of the testing data and the forecasted switching regimes for different datasets.