key: cord-0644933-nhpsult1
authors: Cao, Defu; Wang, Yujing; Duan, Juanyong; Zhang, Ce; Zhu, Xia; Huang, Conguri; Tong, Yunhai; Xu, Bixiong; Bai, Jing; Tong, Jie; Zhang, Qi
title: Spectral Temporal Graph Neural Network for Multivariate Time-series Forecasting
date: 2021-03-13
journal: nan
DOI: nan
sha: a365682b0385fe36b9e4fcac07cc86bd3a02e6fb
doc_id: 644933
cord_uid: nhpsult1

Multivariate time-series forecasting plays a crucial role in many real-world applications. It is a challenging problem as one needs to consider both intra-series temporal correlations and inter-series correlations simultaneously. Recently, there have been multiple works trying to capture both correlations, but most, if not all of them only capture temporal correlations in the time domain and resort to pre-defined priors as inter-series relationships. In this paper, we propose Spectral Temporal Graph Neural Network (StemGNN) to further improve the accuracy of multivariate time-series forecasting. StemGNN captures inter-series correlations and temporal dependencies textit{jointly} in the textit{spectral domain}. It combines Graph Fourier Transform (GFT) which models inter-series correlations and Discrete Fourier Transform (DFT) which models temporal dependencies in an end-to-end framework. After passing through GFT and DFT, the spectral representations hold clear patterns and can be predicted effectively by convolution and sequential learning modules. Moreover, StemGNN learns inter-series correlations automatically from the data without using pre-defined priors. We conduct extensive experiments on ten real-world datasets to demonstrate the effectiveness of StemGNN. Code is available at https://github.com/microsoft/StemGNN/

Time-series forecasting plays a crucial role in various real-world scenarios, such as traffic forecasting, supply chain management and financial investment. It helps people to make important decisions if the future evolution of events or metrics can be estimated accurately. For example, we can modify our driving route or reschedule an appointment if there is a severe traffic jam anticipated in advance. Moreover, if we can forecast the trend of COVID-19 in advance, we are able to reschedule important events and take quick actions to prevent the spread of epidemic.

Making accurate forecasting based on historical time-series data is challenging, as both intra-series temporal patterns and inter-series correlations need to be modeled jointly. Recently, deep learning models shed new lights on this problem. On one hand, Long Short-Term Memory (LSTM) [12] , Gated Recurrent Units (GRU) [6] , Gated Linear Units (GLU) [8] and Temporal Convolution Networks (TCN) [3] have achieved promising results in temporal modeling. At the same time, Discrete Fourier Transform (DFT) is also useful for time-series analysis. For instance, State Frequency Memory (SFM) network [39] combines the advantages of DFT and LSTM jointly for stock price prediction; Spectral Residual (SR) model [28] leverages DFT and achieves state-of-the-art performances in time-series anomaly detection. Another important aspect of multivariate time-series forecasting is to model the correlations among multiple time-series. For example, in the traffic forecasting task, adjacent roads naturally interplay with each other. Current state-of-the-art models highly depend on Graph Convoluational Networks (GCNs) [16] originated from the theory of Graph Fourier Transform (GFT). These models [38, 20] stack GCN and temporal modules (e.g., LSTM, GRU) directly, which only capture temporal patterns in the time domain and require a pre-defined topology of interseries relationships.

In this paper, our goal is to better model the intra-series temporal patterns and inter-series correlations jointly. Specifically, we hope to combine both the advantages of GFT and DFT, and model multivariate time-series data entirely in the spectral domain. The intuition is that after GFT and DFT, the spectral representations could hold clearer patterns and can be predicted more effectively. It is non-trivial to achieve this goal. The key technical contribution of this work is a carefully designed StemGNN (Spectral Temporal Graph Neural Network) block. Inside a StemGNN block, GFT is first applied to transfer structural multivariate inputs into spectral time-series representations, while different trends can be decomposed to orthogonal time-series. Furthermore, DFT is utilized to transfer each univariate time-series into the frequency domain. After GFT and DFT, the spectral representations become easier to be recognized by convolution and sequential modeling layers. Moreover, a latent correlation layer is incorporated in the end-to-end framework to learn inter-series correlations automatically, so it does not require multivariate dependencies as priors. Moreover, we adopt both forecasting and backcasting output modules with a shared encoder to facilitate the representation capability of multivariate time-series.

The main contributions of this paper are summarized as follows:

• To the best of our knowledge, StemGNN is the first work that represents both intra-series and inter-series correlations jointly in the spectral domain. It encapsulates the benefits of GFT, DFT and deep neural networks simultaneously and collaboratively. Ablation studies further prove the effectiveness of this design.

• StemGNN enables a data-driven construction of dependency graphs for different timeseries. Thereby the model is general for all multivariate time-series without pre-defined topologies. As shown in the experiments, automatically learned graph structures have good interpretability and work even better than the graph structures defined by humans.

• StemGNN achieves state-of-the-art performances on nine public benchmarks of multivariate time-series forecasting. On average, it outperforms the best baseline by 8.1% on MAE an 13.3% on RMSE. A case study on COVID-19 further shows its feasibility in real scenarios.

Time-series forecasting is an emerging topic in machine learning, which can be divided into two major categories: univariate techniques [25, 27, 23, 33, 39, 24, 23] and multivariate techniques [29, 26, 20, 38, 3, 35, 30, 19, 18] Univariate techniques analyze each individual time-series separately without considering the correlations between different time-series [27] . For example, FC-LSTM [36] forecasts univariate time-series with LSTM and fully-connected layers. SMF [39] improves the LSTM model by breaking down the cell states of a given univariate time-series into a series of different frequency components through Discrete Fourier Transform (DFT). N-BEATS [24] proposes a deep neural architecture based on a deep stack of fully-connected layers with basis expansion.

Multivariate techniques consider a collection of multiple time-series as a unified entity [29, 11, 21] . TCN [3] is a representative work in this category, which treats the high-dimensional data entirely as a tensor input and considers a large receptive field through dilated convolutions. LSTNet [17] uses convolution neural network (CNN) and recurrent neural network (RNN) to extract short-term local dependence patterns among variables and discover long-term patterns of time series. Deep-State [26] marries state space models with deep recurrent neural networks and learns the parameters of the entire network through maximum log likelihood. DeepGLO [30] leverages both global and local features during training and forecasting. The global component in DeepGLO is based on matrix factorization and is able to capture global patterns by representing each time-series as a linear combination of basis components. There is another category of works using graph neural networks to capture the correlations between different time-series explicitly. For instance, DCRNN [20] incorporates both spatial and temporal dependencies in the convolutional recurrent neural network for traffic forecasting. ST-GCN [38] is another deep learning framework for traffic prediction, integrating graph convolution and gated temporal convolution through spatio-temporal convolutional blocks. GraphWaveNet [35] combines graph convolutional layers with adaptive adjacency matrices and dilated casual convolutions to capture spatio-temporal dependencies. However, most of them either ignore the inter-series correlations or require a dependency graph as priors. In addition, Fourier transform has showed its advantages in previous work, especially in Joint Fourier Transform (JFT) [10] , and its application to forecasting can be found in [14, 21] . In spite of this, none of existing solutions capture temporal patterns and multivariate dependencies jointly in the spectral domain. In this paper, StemGNN is proposed to address these issues. We refer you to recent surveys [34, 41, 40] for more details about related works.

In order to emphasize the relationships among multiple time-series, we formulate the problem of multivariate time-series forecasting based on a data structure called multivariate temporal graph, which can be denoted as G = (X, W ). X = {x it } ∈ R N ×T stands for the multivariate time-series input, where N is the number of time-series (nodes), and T is the number of timestamps. We denote the observed values at timestamp t as X t ∈ R N . W ∈ R N ×N is the adjacency matrix, where w ij > 0 indicates that there is an edge connecting nodes i and j, and w ij indicates the strength of this edge.

Given observed values of previous K timestamps X t−K , · · · , X t−1 , the task of multivariate timeseries forecasting aims to predict the node values in a multivariate temporal graph G = (X, W ) for the next H timestamps, denoted byX t ,X t+1 , · · · ,X t+H−1 . These values can be inferred by the forecasting model F with parameter Φ and a graph structure G, where G can be input as prior or automatically inferred from data.

4 Spectral Temporal Graph Neural Network

Here, we propose Spectral Temporal Graph Neural Network (StemGNN) as a general solution for multivariate time-series forecasting. The overall architecture of StemGNN is illustrated in Figure  1 . The multivariate time-series input X is first fed into a latent correlation layer, where the graph structure and its associated weight matrix W can be inferred automatically from data.

Next, the graph G = (X, W ) serves as input to the StemGNN layer consisting of two residual StemGNN blocks. A StemGNN block is by design to model structural and temporal dependencies inside multivariate time-series jointly in the spectral domain (as visualized in the top diagram of Figure 1 ). It contains a sequence of operators in a well-designed order. First, a Graph Fourier Transform (GFT) operator transforms the graph G into a spectral matrix representation, where the univariate time-series for each node becomes linearly independent. Then, a Discrete Fourier Transform (DFT) operator transforms each univariate time-series component into the frequency domain. In the frequency domain, the representation is fed into 1D convolution and GLU sub-layers to capture feature patterns before transformed back to the time domain through inverse DFT. Finally, we apply graph convolution on the spectral matrix representation and perform inverse GFT.

After the StemGNN layer, we add an output layer composed of GLU and fully-connected (FC) sublayers. There are two kinds of outputs in the network. The forecasting outputs Y i are trained to generate the best estimation of future values, while the backcasting outputsX i are used in an autoencoding fashion to enhance the representation power of multivariate time-series. The final loss function can be formulated as a combination of both forecasting and backcasting losses:

where the first term represents for the forecasting loss and the second term denotes the backcasting loss. For each timestamp t, {X t−K , ..., X t−1 } are input values within a sliding window, and X t is the ground truth value to forecast;X t is the forecasted value for the timestamp t, and {B t−K (X), ..., B t−1 (X)} are reconstructed values from the backcasting module. B indicates the entire network that generates backcasting output, ∆ θ denotes all parameters in the network.

In the inference phase, we adopt a rolling strategy for multi-step prediction. First,X t is predicted by taking {X t−K , ..., X t−1 } as input. Then, the input will be changed to {X t−K+1 , ..., X t−1 ,X t } for predicting the next timestampX t+1 . By applying this rolling strategy consecutively, we can obtain forecasting values of the next H timestamps.

GNN-based approach requires a graph structure when modeling multivariate time-series. It can be constructed by human knowledge (such as road network in traffic forecasting), but sometimes we do not have a pre-defined graph structure as prior. In order to serve general cases, we leverage the self-attention mechanism to learn latent correlations between multiple time-series automatically. In this way, the model emphasizes task-specific correlations in a data-driven fashion.

First, the input X ∈ R N ×T is fed into a Gated Recurrent Unit (GRU) layer, which calculates the hidden state corresponding to each timestamp t sequentially. Then, we use the last hidden state R as the representation of entire time-series and calculate the weight matrix W by the self-attention mechanism as follows,

where Q and K denote the representation for query and key, which can be calculated by linear projections with learnable parameters W Q and W K in the attention mechanism, respectively; and d is the hidden dimension size of Q and K. The output matrix W ∈ R N ×N is served as the adjacency weight matrix for graph G. The overall time complexity of self-attention is O(N 2 d).

The StemGNN layer is constructed by stacking multiple StemGNN blocks with skip connections. A StemGNN block is designed by embedding a Spectral Sequential (Spe-Seq) Cell into a Spectral Graph Convolution module. In this section, we first introduce the motivation and architecture of the StemGNN block, and then briefly describe the Spe-Seq Cell and Spectral Graph Convolution module separately.

StemGNN Block Spectral Graph Convolution has been widely used in time-series forecasting task due to its extraordinary capability of learning latent representations of multiple time-series in the spectral domain. The key component is applying Graph Fourier Transform (GFT) to capture inter-series relationships. It is worth noting that the output of GFT is also a multivariate timeseries while GFT does not learn intra-series temporal relationships explicitly. Therefore, we can utilize Discrete Fourier Transform (DFT) to learn the representations of the input time-series on the trigonometric basis in the frequency domain, which captures the repeated patterns in the periodic data or the auto-correlation features among different timestamps. Motivated by this, we apply the Spe-Seq Cell on the output of GFT to learn temporal patterns in the frequency domain. Then the output of the Spe-Seq Cell is processed by the rest components of Spectral Graph Convolution.

Our model can also be extended to multiple channels. We apply GFT and Spe-Seq Cell on each individual channel X i of input data and sum the results after graph convolution with kernel Θ ·j . Next, Inverse Graph Fourier Transform (IGFT) is applied on the sum to obtain the jth channel Z j of the output, which can be written as follows,

Here GF , GF −1 and S denote GFT, IGFT, and Spe-Seq Cell respectively, Θ ij is the graph convolution kernel corresponding to the ith input and the jth output channel, and Λ i is the eigenvalue matrix of normalized Laplacian and the number of eigenvectors used in GFT is equivalent to the multivariate dimension (N ) without dimension reduction. After that we concatenate each output channel Z j to obtain the final result Z.

Following [24] , we use learnable parameters to represent basis vectors V and a fully-connected layer to generate basis expansion coefficients θ based on Z. Then the output can be calculated by a combination of different bases: Y = V θ. We have two branches of this module in the StemGNN block, one to forecast future values, namely forecasting branch, and the other to reconstruct history values, namely backcasting branch (denoted by B). The backcasting branch helps regulate the functional space for the block to represent time-series data.

Furthermore, we employ residual connections to stack multiple StemGNN blocks to build deeper models. In our case, we use two StemGNN blocks. The second block tries to approximate the residual between the ground-truth values and the reconstructed values from the first block. Finally, the outputs from both blocks are concatenated and fed into GLU and fully-connected layers to generate predictions.

Spectral Sequential Cell (Spe-Seq Cell) The Spe-Seq Cell S aims to decompose each individual time-series after GFT into frequency basis and learn feature representations on them. It consists of four components in order: Discrete Fourier Transform (DFT, F ), 1D convolution, GLU and Inverse Discrete Fourier Transform (IDFT, F −1 ), where DFT and IDFT transforms time-series data between temporal domain and frequency domain, while 1D convolution and GLU learn feature representations in the frequency domain. Specifically, the output of DFT has real part (X r u ) and imaginary part (X i u ), which are processed by the same operators with different parameters in parallel. The operations can be formulated as:

where θ * τ is the convolution kernel with the size of 3 in our experiments, ⊙ is the Hadamard product and nonlinear sigmoid gate σ * determines how much information in the current input is closely related to the sequential pattern. Finally, the result can be obtained by M r (x r u ) + iM i (x i u ), and IDFT is applied on the final output.

The Spectral Graph Convolution [16] is composed of three steps.

(1) The multivariate time-series input is projected to the spectral domain by GFT. (2) The spectral representation is filtered by a graph convolution operator with learnable kernels. (3) Inverse Graph Fourier Transform (IGFT) is applied on the spectral representation to generate final output.

Graph Fourier Transform (GFT) [9] is a basic operator for Spectral Graph Convolution. It projects the input graph to an orthonormal space where the bases are constructed by eigenvectors of the normalized graph Laplacian. The normalized graph Laplacian [1] can be computed as:

Then, we perform eigenvalue decomposition on the Laplacian matrix, forming L = U ΛU T , where U ∈ R N ×N is the matrix of eigenvectors and Λ is a diagonal matrix of eigenvalues. Given multivariate time-series X ∈ R N ×T , the operators of GFT and IGFT are defined as GF (X) = U T X =X and GF −1 (X) = UX respectively. The graph convolution operator is implemented as a function g Θ (Λ) of eigenvalue matrix Λ with parameter Θ. The overall time complexity is O(N 3 ) We compare the performances of StemGNN on nine public datasets, ranging from traffic, energy and electrocardiogram domains with other state-of-the-art models, including FC-LSTM [32] , SFM [39] , N-BEATS [24] , DCRNN [20] , LSTNet [17] , ST-GCN [38] , DeepState [26] , TCN [3] , Graph Wavenet [35] and DeepGLO [30] . We tune the hyper-parameters on the validation data by grid search for StemGNN. Finally, the channel size of each graph convolution layer is set as 64 and the kernel size of 1D convolution is 3. Following [38] , we adopt the RMSprop optimizer, and the number of training epochs is set as 50. The learning rate is initialized by 0.001 and decayed with rate 0.7 after every 5 epochs. We use the Mean Absolute Errors (MAE) [13] , Mean Absolute Percentage Errors (MAPE) [13] , and Root Mean Squared Errors (RMSE) [13] to measure the performances, which are averaged by H steps in multi-step prediction. We report the performances of baseline models in their original publications unless otherwise stated. The dataset statistics are summarized in Table 1 .

We conduct the dataset into three part for training, validation and testing with a ratio of 6:2:2 on PEMS03, PMES04, PEMS08, and 7:2:1 on META-LA, PEMS-BAY, PEMS07, Solar, Electricity and ECG. The inputs of ECG are normalized by min-max normalization following [5] . Besides, the inputs are normalized by Z-Score method [24] . That means StemGNN is trained on normalized input where each time-series in the training set is re-scaled as

where µ and σ denote the mean and standard deviation respectively. More details descriptions about datasets, evaluation metrics, and experimental settings can be found in Appendix B, C and D.

The evaluation results are summarized in Table 2 , and more details can be found in Appendix E.1.Generally, StemGNN establishes a new state-of-the-art on most of the datasets. Furthermore, the model does not need apriori topology and demonstrates the feasibility of learning latent correlations automatically. In particular, on all datasets, StemGNN improves an average of 8.1% on MAE and 13.3% on RMSE compared to the best baseline for each dataset. In terms of baseline models, FC-LSTM only takes temporal information into consideration and performs estimation in the time domain. SFM models the time-series data in the frequency domain and shows stable improvement over FC-LSTM. Besides, N-BEATS, TCN and DeepState are state-of-the-art deep learning models specialized for sequential modeling. A common limitation is that they do not capture the correlations among multiple time-series explicitly, hindering their application to multivariate time-series forecasting. Therefore, it is natural that StemGNN shows much better performances against these baselines.

On the other hand, spatial and temporal correlations can be modeled in GNN-based approaches, such as DCRNN, ST-GCN and GraphWaveNet. However, most of them need a pre-defined topology of different time-series and are not applicable to Solar, Electricity and ECG datasets. GraphWaveNet is able to work without a pre-defined structure but the performance is not satisfied. For traffic forecasting tasks, StemGNN outperforms these models consistently without any prior knowledge of the road network. It is convincing that a data-driven latent correlation layer works more effectively than human defined priors. Moreover, DeepGLO is a hybrid method that enables the model to focus both on local properties of individual time-series as well as global properties, while multivariate correlations are encoded by a matrix factorization module. It shows competitive performances on some datasets like solar and PEMS08, but StemGNN is generally more advantageous. Arguably, it is beneficial to recognize both structural and sequential patterns jointly in the spectral domain.

To better understand the effectiveness of different components in StemGNN, we design six variants of the model and conduct ablation study on several datasets. Table 3 summarizes the results obtained on PEMS07 [4] , and more results on other datasets can be found in Appendix E.2. 6 Analysis

To investigate the validity of our proposed latent correlation layer, we perform a case study in the traffic forecasting scenarios. We choose 6 detectors from PEMS-BAY and show the average correlation matrix learned from the training data (the right part in Figure 2 ). Each column represents a sensor in the real world. As shown in the figure, column i represents the correlation strength between detector i and other detectors. As we can see, some columns have a higher value like column s 1 , and some have a smaller value like column s 25 . This indicates that some nodes are closely related to each other while others are weakly related. This is reasonable, since detector s 1 is located near the intersection of main roads, while detector s 25 is located on a single road, as shown in the left part of Figure 2 . Therefore, our model not only obtains an outstanding forecasting performance, but also shows an advantage of interpretability. We further analyze the effect of GFT and DFT in StemGNN. We choose the top two eigenvectors obtained by eigenvalue decomposition of the normalized Laplacian matrix L and visualize their corresponding time-series after GFT in Figure 4 . As encoded by the eigenvectors, the first time-series captures a common trend in the world and the second time-series captures a common trend from Asian countries. For a clear comparison, we also visualize the ground truth of daily number of newly confirmed in the whole world and Asian countries. As shown in Figure 4 , the time-series after GFT capture these two major trends obviously. Moreover, the time-series data in the spectral space becomes smoother, which increases the generalization capability and reduces the difficulty of forecasting. We also draw the time-series after processed by the Spectral Sequential Cell (denoted by IDFT in Figure 4 ), which recognizes the data patterns in a frequency domain. Compared to the ones after GFT, the result time-series turn to be smoother and more feasible for forecasting.

In this paper, we propose a novel deep learning model, namely Spectral Temporal Graph Neural Network (StemGNN), to take the advantages of both inter-series correlations and temporal dependencies by modeling them jointly in the spectral domain. StemGNN outperforms existing approaches consistently in a variety of multivariate time-series forecasting applications. Future works are considered in two directions. First, we will investigate approximation method to reduce the time complexity of StemGNN, because directly applying eigenvalue decomposition is prohibitive for very large graphs of high-dimensional time-series. Second, we will look for its application to more real-world scenarios, such as product demand, stock price prediction and budget analysis. We also plan to apply StemGNN for predictive maintenance, which is an important topic in AIOps.

Time-series analysis is an important research domain for machine learning, while multivariate timeseries forecasting is one of the most prevalent tasks in this domain. This paper proposes a novel model, StemGNN, for the task of multivariate time-series forecasting. For the first time, we model the inter-series correlations and temporal patterns jointly in the spectral domain, which improves the representation power of multivariate time-series. Signals in the time domain can be easily restored by the orthogonal basis in the frequency domain, so we could leverage the rich information beneath the hood of the frequency domain to improve the forecasting results. StemGNN is neat yet powerful as proved by extensive experiments and analyses. It is one of the first attempts that incorporate Discrete Fourier Transform with Graph Neural Networks. We believe it will motivate more exploration along this direction in other related domains with temporal features, such as social graph mining and sentiment analysis. Moreover, StemGNN adopts a latent correlation layer in an end-to-end framework to learn relationships among multivariate signals automatically. This makes StemGNN a general approach that could be applied to a wide range of applications, including surveillance of traffic flows, healthcare data monitoring, natural disaster forecasting and economy.

Multivariate time-series forecasting has significant societal implications as well. A sophisticated supply chain management system may be built if we can predict market trend precisely. It also brings benefit to our daily life. For example, there is a real case about 'Flooding Risk Analysis'. The task is to predict when there will be a flooding in certain areas near the city. The prediction mainly depend on two external factors, tides and rainfalls. Accurate prediction can alert people to keep away from the area at the corresponding time to avoid unnecessary losses. For COVID-19, accurate prediction of the trend may help the government make suitable decisions to control the spread of the epidemic. According to a case study on COVID-19 in this paper, we can reasonably forecast the daily number of newly confirmed cases four weeks in advance based on historical data. Nevertheless, how to predict the trend from the beginning without sufficient historical data is more challenging and remained to be investigated. Moreover, we are aware of the negative impact of this technique to infringement of personal privacy. Customers' behavior may be predicted by unscrupulous business persons on historical records, which provides a convenient way to send spam information. Hackers may also use the predicted data to avoid surveillance of a bank's security system for fraud credit card transactions.

Although current models are still far away from predicting future data absolutely correct, we do believe that the margin is decreasing rapidly. We hope that researchers could understand and mitigate the potential risks in this domain. We would like to mention the concept of responsible AI, which guides us to integrate fairness, interpretability, privacy, security, accountability into the design of AI systems. We suggest researchers to take a people-centered approach to research, development, and deployment of AI and cultivate a responsible AI-ready culture. We compare the performance of StemGNN with other state-of-the-art models on ten public datasets, ranging from traffic, energy, electrocardiogram to COVID-19 domain. Among all the datasets, only the datasets from traffic domain provide apriori topology. Table 1 shows the statistics of these datasets.

Traffic Forecasting. These datasets are collected by the Caltrans Performance Measurement System (PeMS) [4] and the loop detectors in the highway of Los Angeles County (METR) [15] . The monitoring data is aggregated by 5 minutes from 30-second data samples, which means there are 12 points in the flow data for each hour. We evaluate the performance of traffic flow forecasting on PEMS03, PEMS07, PEMS08 and traffic speed forecasting on PEMS04, PEMS-BAY and METR-LA.

Energy Forecasting. We consider two datasets in this perspective. (1) Solar. It contains photovoltaic production of 137 stations in Alabama State [17] , which is sampled every 10 minutes. (2) Electricity. It contains hourly time-series of electricity consumption from 370 customers [2] .

Electrocardiogram Forecasting. We adopt the ECG5000 dataset from the UCR time-series Classification Archive [5] , and this dataset is composed of 140 electrocardiograms (ECG) with a length of 5000.

Forecasting. This dataset is provided by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University 3 , which contains daily case reports including con-firmed, deaths and recovered number. We use the daily number of newly confirmed COVID-19 cases as the time-series and select the time-series of 25 countries with severe COVID-19 outbreak from 1/22/2020 to 5/10/2020 (totally 110 days). Specifically, we use the first 60 days for training and the rest 50 days for testing.

LetX t and X t be the predicted and ground truth values at timestamp t respectively, T is the total number of timestamps. The evaluation metrics we use in the experiments can be computed by:

FC-LSTM [32] : FC-LSTM can forecast univariate time-series with fully-connected LSTM hidden units. The source code can be found at https://github.com/farizrahman4u/seq2seq. We use 4 stacked LSTM cells of 1000 hidden size and other detailed settings can be referred to [32] .

SMF [39] : SMF improves the LSTM model to be able to break down the cell states of a given univariate time-series into a series of different frequency components. We use SMF by setting hidden dimension as 50, frequence dimension as 10. Other default configurations are given in the source code: https://github.com/z331565360/State-Frequency-Memory-stock-prediction.

N-BEATS [24] : N-BEATS proposes a deep neural architecture based on backward and forward residual links and a very deep stack of fully-connected layers without using time-series domain knowledge. We use the open source code from: https://github.com/philipperemy/n-beats, and only modify the data I/O interface for different shapes of inputs. In our experiments, the backcast length is 10 and the hidden units number is 128. According to the recommendation, we turn on the 'share_weights_in_stack' option.

LSTNet [17] : takes advantage of the convolution layer to discover the local dependence patterns among multi-dimensional input variables, and the recurrent layer to captures the complex long-term dependency patterns. We use the open source code from: https://github.com/fbadine/LSTNet, and modify the data shapes of inputs. In our experiments, the number of output filters in the CNN layer is 100 and the CNN filter size is 6. Other experimental settings we refer to papers and code default values.

DCRNN [20] : DCRNN is a deep learning framework for traffic forecasting that incorporates both spatial and temporal dependencies in the traffic flow. Some of the results of DCRNN are directly reported in [20, 31] , and we use the source code at https://github.com/liyaguang/DCRNN when reproduction is necessary. The horizon size is 12 and the RNN layer number is 2 with 64 units in our experiments. DCRNN is not applicable in scenarios without a priori topology. Thus, DCRNN is only used to forecast the traffic data.

STGCN [38] : STGCN is a novel deep learning framework for traffic prediction, integrating graph convolution and gated temporal convolution through spatio-temporal convolutional blocks. The performances of STGCN can be found at [38] and the source code is available at https://github.com/VeritasYin/STGCN_IJCAI-18. We use 12 history steps to forecast future data with batch size as 50, epoch number as 50, and learning rate as 0.001. STGCN is not applicable to scenarios without a priori topology.

TCN [3] : TCN combines best practices such as dilations and residual connections with the causal convolutions for autoregressive prediction. We take the source code at https://github.com/locuslab/TCN. We use a configuration similar to polyphonic music task mentioned in this paper, where the kernel size is 5, the gradient clip is 0.2, the upper epoch limit is 100 and the initial learning rate is 0.001.

DeepState [26] : This model marries state space models with deep recurrent neural networks. First, it uses recurrent neural networks to calculate ht = RN N (ht−1, xt). Then, this model uses ht to calculate the parameters of state space Θt = Φ(ht) . Finally, the likelihood pss(z1:T |Θ1:T ) are calculated and the parameters are learned through maximum log likelihood. DeepState is integrated in Gluon Time Series (GluonTS), which is the Gluon toolkit for probabilistic time series modeling. The tutorials can be found at https://gluon-ts.mxnet.io/. We use the default configuration given by the tool and only change its sampling frequency as given in Table 1 .

GraphWaveNnet [35] : GraphWaveNet is a method that represents each node's network neighborhood via a low-dimensional embedding by leveraging heat wavelet diffusion patterns. The results of Graph Wavenet are reported at [35, 37, 31] , and the source code can be found at https://github.com/nnzhan/Graph-WaveNet. We turn on the 'add graph convolution layer' option when reproduce on some datasets. We set the weight decay rate as 0.0001 and dropout rate as 0.3. Other configurations follow the options recommended in the paper. We use the priori topology when it is available (traffic forecasting), and this method also works in scenarios without a priori topology (energy forecasting, electrocardiogram forecasting and COVID-19 forecasting).

DeepGLO [30] : The kernel size is set as 7 for both hybrid model and local model, and the learning rate is set to be 0.005. We report the best results from the normalized and unnormalized settings in the paper.

Please refer to their publications for more detailed descriptions and settings.

We conduct all our experiments using one NVIDIA GeForce GTX 1080 GPU. We divide the dataset into three part for training, validation and testing according to [11] (PEMS03, PMES04, PEMS08), [38] (PEMS07), and [20] (META-LA, PEMS-BAY, Solar, Electricity, ECG). The inputs of ECG are normalized by min-max normalization following [5] . Besides, the inputs are normalized by Z-Score method [24] . That means StemGNN is trained on normalized input where each time-series in the training set is re-scaled as X in = (X in − µ(X in ))/σ(X in ), where µ and σ denote the mean and standard deviation respectively. The evaluation of Solar, Electricity and ECG datasets is performed on the re-scaled data following [5] and [26] , i.e., first using the normalization algorithm to transform Solar, Electricity and ECG into a value range of [0, 1], and then applying StemGNN to generate the forecasting values. Afterwards, the predictions are transformed back to the original scale, and the metrics are calculated on the original data.

In StemGNN, the dimension of self-attention layer is 32, which is chosen from a search space of [16, 32, 64 , 128] on the validation data. The channel size of each graph convolution layer is 64 chosen from a search space of [16, 32, 64, 128] and the kernel size of 1D convolution is 3, selected from a search space of [3, 6, 9, 12] . The batch size is 50. The learning rate is initialized as 0.001 and decays with rate 0.7 after every 5 epochs. The total number of training epochs is set as 50.

In the traffic datasets (METR-LA, PEMS-BAY, PEMS07, PEMS07, PEMS04, PEMS08), the data is aggregated by 5 minutes, so the number of timestamps per day is 288. For traffic speed forecasting task, we use the one-hour historical data to predict the next 15 minutes data [20, 38] ; for traffic flow forecasting task, we use one-hour historical data to predict the values in the next hour [11] . The solar dataset is aggregated every 10 minutes, according to [26, 30] , we forecast the trend in future 0.5 hour with 4-hour historical data. For the electricity data, we follow [26, 30] which use 24-hour historical data to infer the values in next 3 hours. For ECG5000 dataset, according to [24] , we set the forecasting step as 3 and the sliding window size as 12. For COVID-19 dataset, we forecast the future 4 weeks' trend, which means the max forecasting step is 28 (days). Besides, we use averaging MAE, MAPE, RMSE over the predicted time period to evaluate StemGNN and all baselines.

E.1 More results on METR-LA and COVID-19 In order to prove that there is a steady improvement for multi-step forecasting, we choose METR-LA and COVID-19 for an evaluation of longer time span. As shown in Table 6 , StemGNN achieves excellent performance in multi-steps forecasting scenarios. In particular, we use COVID-19 data to forecast the number of infected people in the next 1-4 weeks which is of great significance to help relevant departments make decisions. Compared to other solutions, StemGNN reduces the time-dependent error accumulation and improves the performance of long-term forecasting. • w/o Latent Correlations (LCs). We use a priori topology instead of automatic correlations. As shown in Table 7 , dynamic latent correlations performs even better than a static priori topology. The reason may be that a priori topology is static, but the StemGNN is capable of building a topology for each sliding window dynamically, which captures the newest knowledge about the interaction between different time-series.

• w/o Spe-Seq Cell. This setting does not equip with the Spe-Seq Cell. It performs the worst among all settings, indicating that temporal dependency is the most important clue for time-series forecasting.

• w/o DFT. It removes DFT and inverse DFT operators in the Spectral Sequential Cell. Thus, temporal dependencies are modeled in the time domain. It shows improvement over the naive baseline without Spe-Seq Cell, but under-performs StemGNN by a large margin, which proves the benefit of DFT.

• w/o GFT. We no longer use the entire StemGNN cell, but only take the Spectral Sequential Cell. The performance drops significantly, which shows a necessity of capturing latent correlations through graph Fourier transform.

• w/o Residual. This setting has two stacked StemGNN blocks without residual connection. It verifies that the second block learns supplement information through a residual connection.

• w/o Backcasting. This model disables the backcasting branch, showing the benefit of backcasting module for enhancing time-series representation.

F Analysis F.1 Efficiency Analysis Table 8 . StemGNN is similar to the first-order approximate graph convolution model (STGCN) in time, but our performance is improved significantly. Comparing to other baselines, StemGNN has a superior training speed, and inference speed shows the same conclusion. The learning curves for StemGNN and major baselines on METR-LA dataset are shown in Figure 5 , where the x-axis denotes wall-clock time and the y-axis denotes validation RMSE. It shows that StemGNN has an effective training procedure and achieves a better RMSE score at convergence than other SOTAs with comparable training times.

To investigate the usability and interpretability of StemGNN, we conduct a detailed analysis on COVID-19 data. We assume the daily number of newly confirmed cases as time-series, and choose 25 countries with severe outbreak as multivariate input. Figure 6 (a) shows the visualization of the inter-series correlations captured automatically by our model. In this figure, row i represents the correlation strength between country i and other countries. As we can see, correlations are not uniform across countries. This indicates that some nodes are closely related to neighboring countries while weakly related to others. This is reasonable since countries on the same continent have higher correlations in population mobility and related policies. Therefore, our model not only obtains the best forecasting performance, but also shows the advantage of interpretability.

To prove that the conversion of graph Fourier transform over multivariate time-series data is effective, we first visualize the matrix of eigenvectors (U) obtained by the decomposition of the normalized Laplace matrix L on Figure 6 (c). Each column of U represents an eigenvector corresponding to a eigenvalue sorted from the highest to the lowest. We select three eigenvectors with the largest eigenvalues (u 0 − u 2 ) and visualize the corresponding time-series after GFT for further analysis. 

Learning on graph with laplacian regularization

UCI machine learning repository

An empirical evaluation of generic convolutional and recurrent networks for sequence modeling

Pravin Varaiya, and Zhanfeng Jia. Freeway performance measurement system: mining loop detector data

The UCR time series classification archive

On the properties of neural machine translation: Encoder-decoder approaches

Covid-19 data repository by the center for systems science and engineering (csse) at johns hopkins university

Language modeling with gated convolutional networks

Convolutional neural networks on graphs with fast localized spectral filtering

A time-vertex signal processing framework: Scalable processing and meaningful representations for time-series on graphs

Attention based spatialtemporal graph convolutional networks for traffic flow forecasting

Long short-term memory

Another look at measures of forecast accuracy

Forecasting time series with varma recursions on graphs

Big data and its technical challenges

Semi-Supervised Classification with Graph Convolutional Networks

Modeling long-and shortterm temporal patterns with deep neural networks

Social-wagdat: Interactionaware trajectory prediction via wasserstein graph double-attention network

Evolvegraph: Multi-agent trajectory prediction with dynamic relational reasoning

Diffusion convolutional recurrent neural network: Data-driven traffic forecasting

Stationary time-vertex signal processing

Birds of a feather: Homophily in social networks

Fforma: Feature-based forecast model averaging

N-BEATS: neural basis expansion analysis for interpretable time series forecasting

Predicting stock market index using fusion of machine learning techniques

Deep state space models for time series forecasting

Recurrent neural network and a hybrid model for prediction of stock returns

Time-series anomaly detection service at microsoft

Deepar: Probabilistic forecasting with autoregressive recurrent networks

Think globally, act locally: A deep neural network approach to high-dimensional time series forecasting

Spatial-temporal synchronous graph convolutional networks: A new framework for spatial-temporal network data forecasting

Sequence to sequence learning with neural networks

Forecasting sales by exponentially weighted moving averages

A comprehensive survey on graph neural networks

Graph wavenet for deep spatial-temporal graph modeling

Convolutional lstm network: A machine learning approach for precipitation nowcasting

3d graph convolutional networks with temporal graphs: A spatial information free framework for traffic forecasting

Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting

Stock price prediction via discovering multifrequency trading patterns

Deep learning on graphs: A survey

Graph neural networks: A review of methods and applications

As shown in Figure 6 (c), u0 captures the general trend countries across the world, u1 learns the major trend of Asian countries, and u2 learns the common trend of South American countries. As illustrated in Figure 6 (b), the three components capture these three trends respectively, and the timeseries in the spectral space is relatively smooth [22] compared to the original data, reducing the difficulty of forecasting. Thus, it is clear that graph Fourier transform can better leverage the relationships learned by the latent correlation layer and make the forecasting easier through feature smoothness. Moreover, IDFT also helps to improve the smoothness of time-series and lead to better generalization ( Figure 6(b) ). Finally, Figure 6 (d) shows the forecasting results for several exemplar countries and demonstrate the feasibility of StemGNN. In the figure, 'Canada' represents for ground truth; 'Canada1' means forecasting in advance of 1 day; 'Canada7' means forecasting in advance of 7 days. It is similar for other countries. Figure 6 and Figure 7 show the result comparison of stemGNN and two major baselines which respectively use historical data to forecast the number of confirmed people in advance of one day (denoted by *-1) and one week (denoted by *-7). We select several typical countries, and the features show that StemGNN can predict the future trends more accurately. Thanks to graph Fourier transform and Spectral Sequential Cell, StemGNN captures the major trends more smoothly and predict the changes of data more timely. The turning points predicted by other baselines have larger time delays compared to StemGNN.