key: cord-0196512-nco4egea authors: Chen, Ziang title: Dual reparametrized Variational Generative Model for Time-Series Forecasting date: 2022-03-11 journal: nan DOI: nan sha: e8514eb8fb84cdde5c647a05e6bdbfa6231c8271 doc_id: 196512 cord_uid: nco4egea This paper propose DualVDT, a generative model for Time-series forecasting. Introduced dual reparametrized variational mechanisms on variational autoencoder (VAE) to tighter the evidence lower bound (ELBO) of the model, prove the advance performance analytically. This mechanism leverage the latent score based generative model (SGM), explicitly denoising the perturbation accumulated on latent vector through reverse time stochastic differential equation and variational ancestral sampling. The posterior of denoised latent distribution fused with dual reparametrized variational density. The KL divergence in ELBO will reduce to reach the better results of the model. This paper also proposed a latent attention mechanisms to extract multivariate dependency explicitly. Build the local temporal dependency simultaneously in factor wised through constructed local topology and temporal wised. The proven and experiment on multiple datasets illustrate, DualVDT, with a novel dual reparametrized structure, which denoise the latent perturbation through the reverse dynamics combining local-temporal inference, has the advanced performance both analytically and experimentally. Multivariate Time Series forecasting, which extensively applied in multiple disciplines like fiance [25] , economy [15] , epidemic [2] , self-driving [17] , etc., have caught enormously research which summarized in [25] [15] . The most advanced results are mainly implemented leverage Deep Generative models, includes normalized flows [14] , variational autoencoder (VAE) [11] [12] , generative adversarial network (GAN) [5] etc. [20] [13] [16] . Among them, VAE using variational inference, can captured the importance factor and give the density estimation, [21] [11] demonstrate the advance of VAE compared to other generative models. Additionally, the experiment on multiple tasks shows impressive results of VAE [3] [4] [28] . However, main issue encounter during temporal inference of VAE, is the variation lower bound may divergence with time increased [6] [14] , i.e. the error, which closely related to dependency formulation, will accumulated in such autoregressive process, thus restrict the real-word performance of the models [8] . To build a applicable dependency of multivariate time series. It can be categorized to factor to factor and temporal to factor (also called spatial-temporal interaction) phases [17] [15] . Methods in factor to factor interaction where each elements may affect each other at various level, can be captured explicitly with a topological structure in graph neural network (GNN) [10] [29] , or implicitly by the local operation as pooling or convolution [18] . On temporal dependency, same method in implicit local operation can be applied directly [18] or combining temporal information. One may make a natural assumption of the importance decay with the recall steps increased [26] or establish the temporal dependency separately compared to embedding in the hidden state through exploit the self-attention mechanism [23] [30] . Although the methods above illustrate the advance in practice [30] [15] . On theoretical perspective, scored based generative models (SGM), explicit denoising the perturbation on latent space has demonstrated the impressive result on audio and image generation [22] [19] [1] . Latent SGM have forward and reverse process. At forward process, it defined a noise diffusion process on latent space by stochastic differential equation (SDE), then learning the log-density of perturbation, i.e., the distribution of noise injection [19] [22] . To reduce the noise, it using learned score model in reverse time SDE, which can been seen as ancestral sampling from perturbed density to 'pure density' [7] . This approach thus will optimized the evidence lower bound of the variational probabilistic model. Figure 1 : Structure of DualVDT, multivariate time-series data first feed in to the encoder, extract the feature accordingly and build their local-temporal dependency. Then pass to the Dual reparametrized on reverse SDE phase to tighter the variational lower bound of the model This paper leverage both advance in practice and analytical, introduced DualVDT, a model which encompass the latent score based generative model and local-temporal aggregation operator into multivariate time series forecasting problem. Leverage the learned dependency with a novel dual reparametrized mechanism. A tighter variation lower bound thus can be construct and proven tighter than original VAE. i.e., ensure the better performance in theoretically. DualVDT also exploit self-attention mechanism on dynamic topological, inference the spatial temporal interaction simultaneously. On behalf of the experiment, this paper evaluate on ETDataset (Electricity Transformer Dataset) [ Dual Reparametrized Variational inference : DualVDT introduce a dual reparametrized variational inference on the latent space. Through the latent score based generative model, which explicitly denoise the perturbation accumulated by reverse time stochastic differential equation. Tighter Evidence Lower Bound With dual reparametrized variational, the evidence lower bound can be proven tighter than original VAE, ensure the performance of the model. The ablation study also illustrate the effect of this mechanism. This section review the score-based generative models which reduced the perturb through matching the score function in diffusion process, directly estimate the noise injection. Considered a random vector, in this paper, represent the latent code z defined by the encode process p θ (z|x) (discussion of latent space in section 2.2). A perturbation process on latent space at time t has the transition density q(z t |z 0 ). Described by Stochastic Differential Equation (SDE) below. (1) is the process maps the latent code between two same k dimensional space and inject the noise, where w is the standard Wiener process, Through this process, density will converge to Gaussian when time tend to infinity, represent the error accumulation in the auto-regressive model. Follow [19] , the reverse-time SDE desrible q(z 0 |z t ) is : where w is the reverse Wiener process, the goal of the SGM is to train a score function which estimate the score ∇ z log q t (z) to reverse the diffusion process on arbitary density of random vector z through the ancestral sampling of reverse transition q(z 0 |z t ). The prior density q(z) can be estimated via score matching objective: Coefficient C and weighting λ(t) = 1 2 g(t) 2 are parameter independent referenced to [22] 2.2 SGM in Latent Space [22] introduced latent space with a diffusion denoising process above. The common prior of latent p(z 0 ) in VAE [11] [21] is gaussian N (z 0 ; 0, σ 2 0 I), the diffused density q (z t | z 0 ) = N z t ; µ t (z 0 ) , σ 2 t I According to [22] , a latent space generated by the encoder q φ (z|x) with a score matching prior p θ (z), the cross entropy loss is: The score model S θ which defined in section 3, is the approximation of the score function in (2), The goal of latent score-based model is minimize both reconstruction error in VAE [11] and score matching error [22] with the combined loss. Where p ψ (x | z 0 ) is the decoder. The joint distribution thus can be written as p(z 0 , x) = p θ (z)p ψ (x | z 0 ). The structure of DualVDT as Figure 1 shows. This section will firstly, formulate a multivariate time series forecasting inference problem. Then introduce the local temporal neighbourhood aggregation operator. Consider a group of series x = (x 0 , x 1 , ..., x nx ) T which have n x variables. Each variable has a sequence of history point x i = (x i0 , x i1 , ...x iTx ) within a given look back window t = 0, 1, ....T x . Foretasted series y = (y 0 , y 1 , ..., y ny ) T , y i = (y i0 , y i1 , ..., y iTy ) with n y series in total time steps T y . The forecasting process in probabilistic can be viewed as sampling from learned posterior distribution y ∼ p ψ (y | x). For the convenient of discussion, let n x , n y and T x , T y equal to each other and denote as n, T in later text, extend of the unmatched dimension can simply via padding zeros on the smaller dimension and masked out the useless feature as [23] . To speed up the computation and reach spatial temporal aggregation simultaneously, reference to [27] , define a local-temporal mask: Figure 2 : An example of the temporal mask γ and local mask 1 − γ Applying the multi-head self intention mechanism [23] , the inference with local temporal masks can evaluated simultaneously with learned query, key and value by extractor Q(x), K(x), V (x), details will given at Section 4. The weights of query, key and value tensor W = (w q , w k , w v ), aggregate the attention and mask, the multi-level local-temporal dependency: Where is the elementwise product, W α , W β are weights tensor of local and temporal correspondingly. Above dependency built by multi-head self attention can be seen as a dynamic topological on heterogeneous graph, with two types of link, the spatial (10) and temporal (11) , in heterogeneously graph [9] : The node represents the factors embedding with the same way in (8) After built the local-temporal dependency, encoding the latent with the dual reparametrized variational mechanism. Let a posterior of latent vector z as latent p θ (z) with parameter θ, the denoising transition density q (z t | z 0 ) and reverse transition q(z 0 |z t ) can be obtained through (2) . The variance convergence according: Scaled the exception through µ t (z t−1 ) = 1 − σ 2 t z i−1 . The training goal of the optimal posterior score density parameter θ in (5) re-weighted to: (13) defined ancestral sampling of the reverse transition density with: With a sequence of sampling, p θ (z t−1 ) = q(z t−1 |z t )p θ (z t−1 ), the denoised latent z 0 sampled with a dual reparameterized through The sampling z θ , the score mathcing objective therefor becomes maximize the likelihood in dual reparameterized process. Score function s θ can be arbitrary model with a refineable parameter between same dimensional space as (2). The model introduced in this paper, has the process shown in Figure 1 and Algorithm 1. During inference, the multivariate temporal data formulated in Section 3.1. Factor interaction in local and temporal through learned dynamic dependency in (8), ← s is the correspondingly sampling method in variational inference [11] and in (14) . Latent feed into encoder and the reverse phase in SGM on its latent space optimized as (13) . With a sequence of sampling, the denoised posterior of latent transformed with dual reparameterized variational mechanism in (15), denote as D ensure the conjunct of prior in (5) . This paper let the density estimation between reconstruction decoder and future predication identity with p ψ , thus DualVDT have the ability evaluate both predication and imputation with different mask. The variational lower bound of DualVDT can be prove tighter compared to the original ELBO of VAE. For vanilla VAE which has the ELBO as [11] : Spilt the last two terms , a KL divergence between latent density and posterior can be separate Since the latent generative model estimate the posterior distribution of the latent, reduce KL divergence in at score based generative shown in (5), the convergence in (12) will ensure the convergence of such algorithm into a tighter ELBO 3 . On behalf of the experiments, this paper first compare different sequence to sequence model as Table 1 , then evaluate the ablation study to illustrate the convergence tighter than original VAE as Table 2 . Training on datasets ETDataset [30] and Covid-19 open data [24] asses the mean square error (MSE) and mean absolute error (MAE). In ETDataset, the factor include High Useless Load (HUL), Middle UseFul Load (MUFL), Low UseFul Load (LUFL) and Low UseLess Load (LULL), the target varibale is Oil Temperature (OT). According to [30] , to predict the usage varies with the time based on the factor HUL, MUFL LUFL and LULL, due to OT is necessary and which can reflect he condition of Electrical Transformer (ET), thus choice as the target variable. Covid-19 Open Data is the epidemiological database on global. The pandemic problem has the nature property of the topology interaction. Infected population migrate to effect other region, this paper using the same approach in [2] to divide the dataset and doing the validation. Using the same setting of parameter as in [26] [23] [14] . The results shows in most cases, DualVDT can performer better than other sequence models in these dataset. One reason may because LSTM and Transformer are build for classify output (of the word embedding) rather than regression. And Gaussian Process which can been seen as the DualVDT with linear encoder decoder and remove both dual reparamenterized and local-temporal inference. On ablation study, this paper select different setting of encoder, score model with various sampling method introduced by [19] . The decoder has the same structure with encoder except the output dimension matching the dimension of target variable. In Table 2 , asses Fully Connect (FC), Convolution Neural Network (CNN) and Local Temporal Attention (LT) on different stage as Figure 1 shows. The sampling method includes ancestral sampling, reverse differential equation solved by NeuralODE, and probability flow method [19] . The results demonstrate as the prove in Section 3. Dual Reparametrized can reduce the loss of the model. In most cases, local temporal attention have more accurate estimation than vanilla VAE using Multi-layer perceptions. The assessment of different sampling method also illustrate the same results as [19] . This paper propose DualVDT, a generative model for Time-series forecasting with dual reparametrized variational mechanisms. The model can be proven have a tighter evidence lower bound (ELBO) which ensure the performance compared to vanilla VAE. With score based generation model, reduce the KL divergence through score matching process explicitly by reverse the perturbation process. This paper also proposed a latent attention mechanisms to extract multivariate dependency. Dynamically build the local-temporal dependency simultaneously. This mechanisms can capture the factor wised dependency through density estimation with topology. The results shows the advance both in analytical and experimental. A further study may focus on the application of DualVDT and combining more informative differential equation model on specific fields. WaveGrad: Estimating gradients for waveform generation Study of Opioid Crisis with SI Model Based on Data Gp-vae: Deep probabilistic time series imputation Som-vae: Interpretable discrete representation learning on time series Temporal Difference Variational Auto-Encoder Denoising Diffusion Probabilistic Models The vanishing gradient problem during learning recurrent neural nets and problem solutions Heterogeneous Graph Transformer Edge-labeling graph neural network for few-shot learning Auto-Encoding Variational Bayes An Introduction to Variational Autoencoders Semi-supervised learning with deep generative models Normalizing Flows: An Introduction and Review of Current Methods Time-series forecasting with deep learning: a survey Auxiliary deep generative models Human motion trajectory prediction: a survey Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting Score-based generative modeling through stochastic differential equations A note on the evaluation of generative models Nvae: A deep hierarchical variational autoencoder Score-based generative modeling in latent space Attention is all you need COVID-19 Open-Data: curating a fine-grained, global-scale data repository for SARS-CoV-2". In: (2020). Work in progress Judgemental and statistical time series forecasting: a review of the literature A review of recurrent neural networks: LSTM cells and network architectures AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting Deep learning methods for forecasting COVID-19 time-Series data: A Comparative study Heterogeneous graph neural network Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting