key: cord-0936302-8v7qym9m
authors: Spadon, Gabriel; Hong, Shenda; Brandoli, Bruno; Matwin, Stan; Rodrigues-Jr, Jose F.; Sun, Jimeng
title: Pay Attention to Evolution: Time Series Forecasting with Deep Graph-Evolution Learning
date: 2020-08-29
journal: IEEE transactions on pattern analysis and machine intelligence
DOI: 10.1109/tpami.2021.3076155
sha: 12d2db4253abeeeb6b644f9ef1c250c2e53a95c2
doc_id: 936302
cord_uid: 8v7qym9m

Time-series forecasting is one of the most active research topics in artificial intelligence. Applications in real-world time series should consider two factors for achieving reliable predictions: modeling dynamic dependencies among multiple variables and adjusting the model's intrinsic hyperparameters. A still open gap in that literature is that statistical and ensemble learning approaches systematically present lower predictive performance than deep learning methods. They generally disregard the data sequence aspect entangled with multivariate data represented in more than one time series. Conversely, this work presents a novel neural network architecture for time-series forecasting that combines the power of graph evolution with deep recurrent learning on distinct data distributions; we named our method Recurrent Graph Evolution Neural Network (ReGENN). The idea is to infer multiple multivariate relationships between co-occurring time-series by assuming that the temporal data depends not only on inner variables and intra-temporal relationships (i.e., observations from itself) but also on outer variables and inter-temporal relationships (i.e., observations from other-selves). An extensive set of experiments was conducted comparing ReGENN with dozens of ensemble methods and classical statistical ones, showing sound improvement of up to 64.87% over the competing algorithms. Furthermore, we present an analysis of the intermediate weights arising from ReGENN, showing that by looking at inter and intra-temporal relationships simultaneously, time-series forecasting is majorly improved if paying attention to how multiple multivariate data synchronously evolve.

Time series refers to the persistent recording of a phenomenon along time, a continuous and intermittent unfolding of chronological events subdivided into past, present, and future. In the last decades, time series analysis has been vital to predict dynamic phenomena on a wide range of applications, such as climate change [1] - [4] , financial market [5] - [7] , land-use monitoring [8] - [10] , anomaly detection [11] - [13] , energy consumption, and price forecasting [14] - [16] , apart from epidemiology and healthcare-related studies [17] - [22] . On such applications, an effective data-driven decision requires precise forecasting based on time series [23] . A prime example is the SARS-CoV-2, COVID-19, or Coronavirus Pandemic [24] , which is known to be highly contagious and cause increased pressure on healthcare systems worldwide [25] . In this case, time-series analysis plays a vital role in planning a safe retake of fundamental activities by preventing economic systems' collapse.

Time series can be regarded as univariate or multivariate describing, respectively, single and multiple variables varying over time [26] . Recent techniques in time series have roots in the use of Artificial Neural Networks [27] , which contain a non-linear functioning that enables it to outperform classical algorithms [28] . Such techniques evolved into deep learning models for timeseries forecasting, such as Haoyi et al. [29] that used an informer component to enhance long-sequence time-series predictions but disregarded the inter-dependencies existing within different multivariate time-series, besides others from the spatiotemporal forecasting field. For example, Seo et al. [30] proposed the Graph Convolutional Recurrent Network (GCRN) from graph-structured and time-varying data by combining a Convolutional Neural Network (CNN) [31] that identifies spatial structures and a Recurrent Neural Network (RNN) [32] that learns dynamic patterns; Li et al. [33] introduced a spatiotemporal model for traffic forecasting with Diffusion over Convolutional Recurrent Neu- Fig. 1 : A multiple multivariate time-series forecasting problem, where each multivariate time-series (i.e., sample) shares the same domain, timestream, and variables. When stacking the time-series together, we assemble a tridimensional tensor with the axes describing samples, timestamps, and variables. The multiple samples have equal variables recorded during the same timestamps, meaning that samples are unique but all observed in the same way. By tackling the problem altogether, we leverage inner and outer variables besides intra-and inter-temporal relationships to improve forecasting.

ral Network (DCRNN); and, Zhang et al. [34] presented a Gated Attention Network (GaAN) for forecasting traffic speed using a Graph Gated Recurrent Unit (GGRU) with a CNN controlling the attention head's importance. Further contributions on the spatiotemporal field [35] - [37] employ traditional recurrent units, attention mechanisms, and even Graph Convolution Network (GCN) [38] . However, due to being designed to deal with spatial data, those models often fail to frame temporal dependencies as they aim to achieve state-of-the-art generalization across space and time at once.

Moreover, the LSTNet [39] encodes short-term data into low dimensional vectors by using a CNN for later decoding through an RNN; leverages from a recurrent-skip Gated Recurrent Unit for capturing long-term dependencies within the temporal data; and, incorporates an Autoregressive (AR) component in parallel to the non-linear neural network for preserving the scale of the output. Similarly, the DSANet [40] integrates an AR component with a dual selfattention network [41] with parallel convolutional components, a versatile idea for modeling global and local temporal patterns. More recently, the MLCNN [42] proposed the use of short and long-term prediction strategies for modeling temporal behavior through a multi-layer CNN together with Long Short-Term Memory (LSTM) [43] recurrent units. However, although LSTNet, DSANet, and MLCNN are cutting-edge multivariate time-series forecasting algorithms, they do not explicitly address per-variable and intertime-series dependencies, which weakens their forecasting ability in the face of higher-dimensional data. Therefore, the state-of-the-art in time-series forecasting is bounded to a bidimensional space in which we understand the forecasting process by a non-linear function between time and variables.

Differently, we hypothesize that time-series are dependent on their inner variables, which are observations from themselves, and from outer variables provided by different time series that share the same timestream. For instance, the evolution of a biological species is not solely related to observations from itself but also from other species that share the same habitat, as they are all part of the same food chain. The time series gains an increased dimensionality by considering the variables and the dependency aspect during the analysis. Consequently, a previously considered bidimensional problem, in which a model's forecasting ability comes from observing relationships of variables over time, now becomes tridimensional, where forecasting means understanding the entanglement between variables of different time-series that co-occur in time. Accordingly, time-series define an event that is not a consequence of a single chain of observations but a set of synchronous observations of many time-series.

For example, during the Coronavirus Pandemic, it is paramount to understand the disease's time-aware behavior in every country. Despite progressing in different moments and locations, the pandemic's underlying mechanisms are supposed to follow similar (and probably interconnected) patterns. Along these lines, looking individually at the development of the pandemic in each country, one can describe the problem in terms of multiple variables, like the number of confirmed cases, recovered people, and deaths. However, when looking at all countries at once, the problem yields an additional data dimension, and each country becomes a multivariate sample of a broader problem, such as depicted in Fig. 1 . In linguistic terms, we refer to such a problem as multiple multivariate time-series forecasting.

Along with these premises, in this study, we contribute with an unpreceded neural network that emerges from a graph-based time-aware auto-encoder with linear and nonlinear components working in parallel to forecast multiple multivariate time-series simultaneously, named after Recurrent Graph Evolution Neural Network (REGENN). We refer to evolution as the natural progression of a process where the neural network iteratively optimizes a graph representing observations from the past until it reaches an evolved version of itself that generalizes on future data still to be observed. Accordingly, the underlying network structure of REGENN is powered by two Graph Soft Evolution (GSE) layers, a further contribution of this study. The GSE stands for a graph-based learning-representation layer that enhances the encoding and decoding processes by learning a shared graph across different time-series and timestamps.

The results we present are based on an extensive set of experiments, in which REGENN surpassed a set of 49 competing algorithms from the fields of deep learning, machine learning, and time-series; among of which are singletarget, multi-output, and multi-task regression algorithms in addition to univariate and multivariate time-series forecasting algorithms. Aside from surpassing the state-of-the-art, REGENN remained effective after three rounds of 30 ablation tests through distinct hyperparameters. All experiments were carried out over the SARS-CoV-2, Brazilian Weather, and PhysioNet datasets. In the task of epidemiology modeling on the SARS-CoV-2 dataset, we had improvements of at least 64.87%. We outperformed the task of climate forecasting on the Brazilian Weather dataset by at least 11.96% and patient monitoring on intensive care units on the PhysioNet dataset by 7.33%. Furthermore, we analyzed the results using the Evolution Weights from the GSE layers, which are the intermediate hidden adjacency matrices that arise from the graph-evolution process after going through the cosine similarity activation, showing that graphs shed new light on the understanding of non-linear black-box models. Since multiple multivariate time-series is an ascending research topic, we understand REGENN has implications in multiple domains, like economics, social sciences, and biology, in which different time-series share the same timestream and co-occur in time, mutually influencing one another.

In order to present our contributions, this paper is further divided into four sections. We begin by proposing a layer and neural network architecture, besides detailing the methods used along with this study. Subsequently, we display the experimental results compared to previous literature. Next, we provide an overall discussion on our proposal and the achieved results. Finally, we present the conclusions and final remarks. Alongside, the Supplementary Material presents extended methods and additional results.

Hereinafter, we use bold uppercase letters to denote multidimensional matrices (e.g., X), bold lowercase letters to vectors (e.g., x), and calligraphic letters to sets (e.g., X ). Matrices, vectors, and sets can be used with subscripts. For example, the element in the i-th row and j-th columns of a matrix is X ij , the i-th element of a vector is x i , and the j-th element of a set is X j . The transposed matrix of X ∈ R m×n is X T ∈ R n×m , and the transposed vector of x ∈ R m×1 is x T ∈ R 1×m , where m and n are arbitrary dimensions. We display in Tab. 1 a summary of all context-specific notations.

Graph Soft Evolution (GSE) stands for a representationlearning layer that, given a training dataset, builds a graph in the form of an adjacency matrix, as in Fig. 2 . The GSE layer receives no graph as input but a set of multiple multivariate time-series. The graph is built by tracking pairs of co-occurring variables, one sample at a time, and merging the results into a single co-occurrence graph shared among samples and timestamps. We define co-occurring variables as two variables, from a multivariate time-series, with a nonzero value in the same timestamp -in that case, we say one variable influences another and is influenced back. The cooccurrence graph is the projection of a tridimensional tensor, T, T ∈ R s×w×v , into a bidimensional one, A, A ∈ R v×v , describing variables' pair-wise time-invariant relationships.

The co-occurrence graph G = V, E is symmetric and weighted. It is composed of a set V of |V| nodes equal to the 

Definition ω ∈ N + Sliding window size w, z ∈ N + Number of training and testing (i.e., stride) timestamps s, t, v ∈ N + Number of samples, timestamps, and variables T ∈ R s×t×v Tensor of multiple multivariate time-series Y ∈ R s×ω×v Batched input of the first GSE and the Autoregression layers Yα ∈ R s×ω×v Output of the first GSE and input of the encoder layers Yε ∈ R s×ω×v Output of the encoder and input of the decoder layers Yε ∈ R s×z×v Output from the first recurrent unit and input to the second one Y ∈ R s×z×v Output of the second recurrent unit and input of the second GSE layer Y ψ ∈ R s×z×v Non-linear output yielded by the second GSE layer Y λ ∈ R s×z×v Linear output provided by the Autoregression layer Y ∈ R s×z×v The final result from the merging of the linear and non-linear outputs T ∈ R s×z×v The ground truth expected from merging the linear and non-linear outputs G = V, E Graph in which V is the set of nodes and E the set of edges A ∈ R v×v Adjacency matrix of co-occurring variables Aµ ∈ R v×v Neighbor-smoothed per-variable embedding shared between GSE layers A φ ∈ R v×v Evolved adjacency matrix produced by the second GSE layer U • V Batch-wise Hadamard product between matrices U and V U · V Batch-wise scalar product between matrices U and V · F

The Frobenius norm of a given vector or matrix ϕ(·) Dropout regularization function σg (·) Sigmoid activation function σ h (·)

Hyperbolic tangent activation function cos θ (·)

Cosine matrix-similarity activation function RELU(·)

Rectified exponential linear unit activation function SOFTMAX(·) Normalized exponential function activation function Fig. 2 : Graph Soft Evolution representation-learning, in which the set of multiple multivariate time-series is mapped into adjacency matrices of co-occurring variables. The matrices are element-wise summed to generate a shared graph among samples, which, after a linear transformation, goes through a similarity activation function and is scaled by an element-wise multiplication to produce an intermediate hidden adjacency-matrix with similarity properties inherent to the shared graph.

number of variables and another set E of |E| non-directed edges equal to the number of co-occurring variables. A node v ∈ V corresponds to a variable from the time-series multivariate domain, and an edge e ∈ E is an ordered pair u, v ≡ v, u of co-occurring variables u, v ∈ V. The edges' weight f corresponds to the summation of the values of the variables u, v ∈ V whenever they co-occur in time, such that

We use summation as the graph-merging operator when building A from G because, in the face of a zero-one input, a multiplicative operator would provide values close to zero, division would make those values overgrow toward infinity, subtraction would turn some of them into negative, while summation provides consistently positive values upper bounded to 2 × (s × w), helping to sustain the magnitude of variables' values. This way, the whole graph is bounded to w, which is the number of timestamps existing in the training portion of the input tensor, and if a pair of variables never co-occur in the training data, no edge will be assigned to the graph, meaning that u, v ∈ E, and f (u, v) = 0. The GSE layer for an arbitrary graph is formulated as: 3 : Graph Soft Evolution layers assembled for evolution-based learning. In such a case, the first GSE layer's output (i.e., source) will feed further layers of the neural network, whose result goes through the second GSE layer (i.e., target). The GSE, as the last layer, does not use regularizers nor linear transformations before the output. Contrarily, it outputs the result from the scalar product between the learned representation and the data propagated throughout the network.

similarities by learning that two variables influence one another on different scales and resulting in A η . Next, in Eq. 1.3, it performs a batch-wise matrix-by-matrix multiplication between the adjacency matrix from Eq. 1.2 and the batched input tensor Y to combine the information from the graph, which generalizes samples and timestamps, with the time-series. The result will be followed by a dropout regularizer [44] and batch-wise matrix-by-matrix multiplication, where the final features from joining both tensors will be extracted and forwarded to the next layer of the network.

The evolution concept comes from the cooperation between two GSE layers, one at the beginning (i.e., right after the input) and the other at the end (i.e., right before the output) of a neural network, such as in the example shown in Fig. 3 . As evolution arises from sharing hidden weights between a pair of non-sequential layers, we named this process after Soft Evolution. Accordingly, the first layer (i.e., source) aims to learn the weights to scale the matrix and produce A µ . Such a result is the input of the second GSE layer (i.e., target), and it will be used for learning the evolved version of the adjacency matrix, referred to as A φ and produced as in Eq. 1.1. Notice that in Fig. 3 , the source layer is different from the target one because we disregard the regularizer ϕ, trainable weights W α , and bias b α from Eq. 1.3. They aim to enhance the feature-learning processes when multiple layers are stacked together and, as the last layer, GSE provides the output from already learned features through one last scalar product between the data propagated throughout the network, i.e., Y, and the intermediate hidden adjacency-matrix, i.e., A ψ .

One can see that the source GSE layer has two constant inputs: the graph and input tensor. The target GSE layer has two dynamic inputs, the shared graph from the source GSE layer and input propagated throughout the network. In this work scope, we use an auto-encoder between GSE layers to learn data codings from the source layer's output, which will be decoded into a representation closest to the expected output and later re-scaled by the target layer. In this sense, while the first layer learns a graph from the training data (i.e., past data) working as a pre-encoding feature-extraction layer, the second one re-learns (i.e., evolve) a graph at the end of the forecasting process based on future data, working as a post-decoding output-scaling layer. When joining the GSE layers with the auto-encoder, we assemble the Recurrent Graph Evolution Neural Network (REGENN).

REGENN is a graph-based time-aware auto-encoder with linear and non-linear components on parallel data-flows working together to provide future predictions based on past observations. The linear component is the autoregression implemented as a feed-forward layer, and the nonlinear component is made of an encoder and a decoder module powered by a pair of GSE layers. Fig. 4 Fig. 4 : Data diagram of the Recurrent Graph Evolution Neural Network (REGENN), which has a linear component parallel to a non-linear one. The linear component has a feed-forward layer, and the non-linear one has an auto-encoder and two GSE layers. Although equal to the first, the last GSE layer yields an early output as it is not stacked with another layer. how these components communicate from the input to the output, and, in the following, we detail their operation.

The non-periodical changes and constant progressions of the series across time usually decrease the performance of the network. That is because the output scale loses significance compared to the input, which comes from the complexity and non-linear nature of neural networks in time-series forecasting tasks [39] . Following a systematic strategy to deal with such a problem [45] , [46] , REGENN leverages from an Autoregressive (AR) layer working as a linear feedforward shortcut between the input and output, which for a tridimensional input, is algebraically defined as:

where W ∈ R ω×z are the weights and b ∈ R z the bias to be learned. The output of the linear component, i.e., Y λ ∈ R s×z×v as in Eq. 2, is element-wise added to the nonlinear component's output, i.e., Y ψ ∈ R s×z×v , so to produce the final predictions of the network Y ∈ R s×z×v , formally given as Y = Y λ + Y ψ . Subsequently, we describe the autoencoder that produces the non-linear output of REGENN.

We use a non-conventional Transformer Encoder [41] that employs self-attention to learn an encoding from the features forwarded by the GSE layer. It consists of multiple encoders joined through the scaled dot-product attention into a single set of encodings through multi-head attention. The number of expected features by the Transformer Encoder must be a multiple of the number of heads in the multihead attention. Our encoder's non-conventionality comes from the fact that the first GSE layer's output goes through a single scaled dot-product attention on a single-head attention task. That is because the number of features produced by the encoder is equal to the length of the sliding window, and through single-head attention, the window can assume any length. The encoder module is defined as follows:

where self-attention in Eq. 3a is a particular case of the multi-head attention, in which the input query Q, key K, and value V of the scaled dot-product attention, i.e., SOFTMAX Q · K T ÷ √ d k · V, are equal; and d k is the dimension of the keys. The attention results are followed by a dropout regularization [44] , a residual connection [47] , and a layer normalization [48] as in Eq. 3b to ensure generalization. The first two layers work to avoid overfitting and gradient vanishing, while the last one normalizes the output such that the samples among the input have zero mean and unit variance γ (∆ (Y ε + ϕ (Y ε ))) + β, where ∆ is the normalization function, and γ and β are parameters to be learned. After, in Eq. 3c, the intermediate encoding goes through a double linear layer, a point-wise feed-forward layer, which, in this case, consists of two linear transformations in sequence with a ReLU activation in between, having the weights W ε , W ι and bias b ε , b ι as optimizable parameters. Finally, the transformed encoding goes through one last set of generalizing operations, as shown in Eq. 3d. The resulting encoding Y ε ∈ R s×ω×v is a tensor with the time-axis length matching the size of the sliding window ω.

The previous encoding will be decoded by two sequenceto-sequence layers, which are Long Short Term Memory (LSTM) [43] units. The decoder operates in two of the tridimensional axes of the encoding, the time-axis and variableaxis, once at a time. Under the sample and time-axis, the time-axis decoder tracks temporal dependencies, looks for a set of weights that generalizes across variables, and translates the window-sized input into a stride-sized output:

where v is the v-th variable of the t-th time-series group, and the weights W

o ∈ R z are parameters to be learned. Along with Eq. 4, we refer to f as the forget gate's activation vector, i as the input and update gate's activation vector, o as the output gate's activation vector, C as the cell input activation vector, C as the cell state vector, and h as the hidden state vector. The last hidden state vector goes through a dropout regularization ϕ before the next LSTM in the sequence.

Under the time and variable-axis, the next recurrent unit decodes the variable-axis from the partially-decoded encoding, searches for a set of weights that generalizes across time, and translates patterns arising from different variables on the same timestamp. The set of variables within the timeseries does not necessarily imply a sequence, which does not interfere in the decoding process as long as the variables are always kept in the same order; a common approach used with boosting trees, such as the XGBoost [49] algorithm. The second LSTM, in which t is the t-th timestamp of the v-th variable group, is algebraically defined as follows:

where Y ε is the partially-decoded encoding, and the weights W

o ∈ R z are internal parameters. The description of the notations within Eq. 4 holds for Eq. 5. Their difference is the residual connection with the partially-decoded encoding Y ε at the last hidden state vector after the dropout regularization ϕ. After the decoding is complete, the last recurrent unit output Y goes through the second GSE layer, producing the non-linear output Y ψ of REGENN.

REGENN operates on a tridimensional space shared between samples, timestamps, and variables. In such a space, it carries out a time-based optimization strategy. The training process iterates over the time-axis of the dataset, showing to the network how the variables within a subset of time-series behave as time goes by and later repeating the process through subsets of different samples in distinct batches of preset window-size. However, the shared graph A is built during the first epoch of the training phase. Such a process happens in parallel to the optimization process, not impacting training stability nor convergence. After that point, A will remain as it is during the training and testing.

The network's weights are shared among the entire dataset and optimized towards best generalization simultaneously across samples, timestamps, and variables. We used Adam [50] , a gradient descent-based algorithm, to optimize the model and, as the optimization criterion, the Mean Absolute Error (MAE), which is a generalization of the Support Vector Regression [51] with soft-margin criterion where Ω is the set of internal parameters of REGENN, Y is the network's output, and T the ground truth:

Due to the SARS-CoV-2 data behave as a streaming dataset, we adopted a transfer learning approach to train the network on that dataset. Transfer learning shares knowledge across different domains by using pre-trained weights of another neural network. The approach we adopted, although different, resembles Online Deep Learning [52] . The main idea is to train the network on incremental slices of the time-axis, such that the pre-trained weights of a previous slice are used to initialize the weights of the network in the next slice. This technique aims not only to achieve better forecasting performance but also to show that REGENN is superior to other algorithms throughout the pandemic.

The results are based on three datasets, all of which are multi-sample, multivariate, and vary over time. The first dataset describes the Coronavirus Pandemic, referred to as SARS-CoV-2, made available by John Hopkins University [53] . It describes 3 variables through 120 days for 188 countries and varies from the first day of the pandemic to the day it completed four months of duration. The second one is the Brazilian Weather dataset collected from 253 sensors during 1,095 days regarding 4 variables. The third dataset is from the 2012 PhysioNet Computing in Cardiology Challenge [54] , from which we are using 9 variables across 48 hours recorded from 11,988 ICU patients. The variables within the datasets are: The number of datasets that can be used for multiple multivariate time-series forecasting is still limited. Although the ones we experimented with have a small number of variables, there is no theoretical upper bound for the number of samples, timestamps, and variables REGENN can handle. However, as REGENN deals with dense tridimensional tensors, increasing any of the axes (samples×time×variables) in size will make the tensor grow at an almost-cubic rate, quickly overflowing the GPU Memory. Accordingly, RE-GENN ends up suffering from a trade-off between the dataset's size and the available hardware.

The datasets were individually min-max normalized into a zero-one scale on the variable-axis to avoid gradient spikes and speed up training. We used the training data to define the min-max parameters required for normalization. Such parameters were applied for normalizing the test data as well. As a consequence of normalizing the entire dataset, the network's output will follow a similar scale. Therefore, the output was inversely transformed to what it was before the normalization for evaluating the results.

A simplistic yet effective approach to train time-series algorithms is through the Sliding Window technique [55] , also referred to as Rolling Window. The window size is well known to be a highly sensitive hyperparameter [56] , [57] . Consequently, we followed a non-tunable approach, in which we set the window size before the experiments, just taking into consideration the context and domain of the datasets. These values were used across all windowbased experiments, including the baselines and ablation tests. It is noteworthy that most of the machine-learning algorithms are not meant to handle time-variant data, such that no sliding window was used in those cases. Conversely, we considered training timestamps as features and those reserved for testing as multi-task regression labels.

On the deep learning algorithms, we used a window size of 7 days for training and reserved 7 days for validation (between the training and test sets) to predict the last 14 days of the SARS-CoV-2 dataset. The 7-7-14 split idea comes from the disease incubation period, which is of 14 days. On the other hand, we used a window size of 84 days and reserved 28 days for validation to predict the last 56 days in the Brazilian Weather dataset. The 84-28-56 split is based on the seasonality of the weather data, such that we will look to the previous 3 months (a weather-season window) to predict the last 2 months of the upcoming season. Finally, we used a window size of 12 hours for training and 6 hours for validation to predict the last 6 hours of the PhysioNet dataset. The 12-6-6 split comes from the fact that patients in ICUs are in a critical state, such that predictions within 24 hours are more useful than long-term predictions.

Many existing algorithms are limited because they neither support multi-task nor multi-output regression, making these algorithms even more limited to tasks when data is tridimensional. The most straightforward yet effective approach we followed to compare them to REGENN was to create a chain of ensembles 1 . In such a case, each estimator makes predictions on order specified by the chain using all of the available features provided to the model plus the predictions of earlier models in the chain. The number of estimators in each experiment varies according to the type of the ensemble and the type of the algorithms, and the final performance is the average of each estimator's performance. For simplicity sake, we grouped the algorithms as follows:

Corresponds to tridimensional compliant algorithms of single estimators; ○ Describes multivariate algorithmss estimators, one estimator for each sample; ◎ Consists of multi-output and multi-task algorithmsv estimators, one estimator per variable; Indicates single-target algorithms -v×z estimators, one estimator per variable and stride; and, Represents univariate algorithms -s×v estimators, one estimator for each sample and variable.

As time-series forecasting works as a time-aware regression problem, our goal remains in predicting values that resemble the ground truth the most. 

1. See Regressor Chain at https://bit.ly/3hBfxTA.

REGENN has two hyperparameters able to change the dimension of the weights' tensors, which are the window size (i.e., input size) and the stride size (i.e., output size). As already discussed, both were set before the experiments, and none of them were tuned towards any performance improvement. The trade-off of having fewer hyperparameters is to spend more energy on training the network towards a better performance. We are focusing on the network optimizer, gradient clipping, learning rate scheduler, and dropout regularization when we refer to tunable hyperparameters. Along these lines, we followed a controlled and limited exploratory approach similar to a random grid-search, starting with PYTORCH's defaults. The tuning process was on the validation set, intentionally reserved for measuring the network improvement. The tuning process follows by updating the hyperparameters whenever observing better results on the validation set, leading us to a set of optimized but no optimum hyperparameters. We used the set of optimized hyperparameters to evaluate REGENN on the test set and the default values for all the other algorithms [49] , [58] , [59] unless explicitly required for working with a particular dataset, as was the case of LSTNet [39] , DSANet [40] , and MLCNN [42] . The complete list of hyperparameters is in the Supplementary Material.

The experiments related to machine-learning and timeseries algorithms were carried out on a Linux-based system with 64 CPUs and 750 GB of RAM. The experiments related to deep-learning on the SARS-CoV-2 dataset were carried out on a Linux-based system with 56 CPUs, 256 GB of RAM, and 8 GPUs (Titan X -Pascal). The Brazilian Weather and PhysioNet datasets were tested on a different system with 32 CPUs, 512 GB of RAM, and 8 GPUs (Titan X -Maxwell). While CPU-based experiments are even across all CPU architectures, the same does not hold for GPUs, such that the GPU model and architecture must match to guarantee reproducibility. Aiming at complete reproducibility, we disclose not only the source code of REGENN on GitHub 2 , but also the scripts, pre-processed data, and snapshots of all trained networks on a public folder at Google Drive 3 .

Subsequently, we go through the experimental results in each one of the benchmark datasets. Due to the fact that not all the algorithms perform evenly across them, we display the 34 most prominent ones out of the 49 The results confirmed REGENN's superior performance as it is the algorithm with the lowest error and standard deviation -the improvement in the experiment was no lower than 64.87%. In the image, the algorithms are symbol-encoded based on their type and number of estimators; we use gray arrows to report the standard deviation of the results. The negative deviation, which is equal to the positive one, was suppressed for improved readability. the GSE layers, by using the cosine similarity on the adjacency matrices of co-occurring variables (see Section 2.2).

The SARS-CoV-2 has being updated daily since the beginning of the Coronavirus Pandemic. We used a self-to-self transfer-learning approach to train the network in slices of time due to such a dataset's streaming nature. In short, the network was re-trained every 15 days with new incoming data, using as starting weights, the pre-trained weights from the network trained in the past 15 days so to evaluate REGENN's performance over the pandemic's progression.

In such a case, when the pandemic completed 45 days, the first time-slice in which REGENN was trained, it outperformed the second-placed algorithm, the Orthogonal Matching Pursuit, by 27.27% on the Mean Absolute Error (MAE), 16.50% on the Root Mean Square Error (RMSE), and 38.87% on the Mean Squared Logarithmic Error (MSLE). For all subsequent time-slices (i.e., 60, 75, 90, 105, and 120 days), the second-placed algorithm was the Exponential Smoothing, which was outperformed with improvement no lower than 47.40% on the MAE, 17.19% on the RMSE, and 37.39% on the MSLE. We further detail the results on the time-slices mentioned above in the Supplementary Material. However, in Fig. 5 , we detail the results on the complete dataset of 120 days, in which REGENN surpassed the Exponential Smoothing, the second-best algorithm, by 75.21% on the MAE, 64.87% on the RMSE, and 79.61% on the MSLE.

As a result of the analysis of the dataset in time-slices, we noticed that, as time goes by and more information is available on the SARS-CoV-2 dataset, the problem becomes more challenging to solve by looking individually at each country and more natural when looking at all of them together. Although countries have their particularities, which make the disease spread in different ways, the main goal is to decrease the spreading, such that similarities between the historical data of different countries provide for finer predictions. Furthermore, we observed that not all the estimators within an ensemble perform in the same way in the face of different countries. Due to the REGENN capability of observing inter-and intra-relationships between time-series, it performs better on highly uncertain cases like this one.

Subsequently, we present the ablation results, in which we utilized the same data-flow as REGENN but no GSE layer while systematically changing the decoder architecture. We provide results using different Recurrent Units (RU), including the Elman RNN [32] , LSTM [43] , and GRU [60] . We also varied the recurrent unit's directional flag between Unidirectional (U) and Bidirectional (B). That because a unidirectional recurrent unit tracks only forward dependencies while a bidirectional one tracks both forward and backward dependencies. A summarized tag describes each test's network architecture; for example, (E → U RU + B RU) + AR means the model has a Transformer Encoder (E) as the encoder, a Unidirectional Recurrent Unit as the time-axis decoder, and a Bidirectional Recurrent Unit as the variable-axis decoder. Besides that, the output of the decoder is element-wise added to the Autoregression (AR) output. The table shows results with and without the encoder and AR component, besides using a single recurrent unit only for time-axis decoding.

According to the ablation results in Tab. 2, the improvement of REGENN is slightly smaller than previously reported. That is because its performance not only comes from the GSE layer but also from how the network handles the multiple multivariate time-series data. Consequently, the ablation experiments reveal that some models without GSE layers are enough to surpass all the competing algorithms. However, when using REGENN, we can improve them further and achieve 20.81% of additional reduction on the MAE, 19.77% on the RMSE, and 35.72% on the MSLE. Fig. 6 shows the Evolution Weights originated from applying the cosine similarity on the hidden adjacency matrices of REGENN. When comparing the input and evolved graphs, the number of cases and deaths have a mild similarity. That might come from the fact that diagnosing infected people was already a broad concern at the beginning of the pandemic. The problem did not go away, but more infected people were discovered as more tests were made, and also because the disease spread worldwide. A similar scenario can be drawn from the number of recovered people and the number of cases, as infected people with mild or no symptoms were unaware of being infected. Contrarily, we can see that the similarity between recovered and deaths In this experiment, REGENN once more outperformed all the competing algorithms, demonstrating versatility by performing well even on a highly-seasonal dataset with improvement no lower than 11.95%. In the face of seasonality, the Elman RNN surpassed the Exponential Smoothing, the previously second-best algorithm. In the image, the algorithms are symbol-encoded based on their type and number of estimators; we use gray arrows to report the results' standard deviation. The negative deviation, which is equal to the positive one, was suppressed for improved readability. decreases over time, which comes from the fact that, as more tests are made, the mortality rate drops to a stable threshold due to the increased number of recovered people.

The Brazilian Weather dataset is a highly seasonal dataset with a higher number of samples, variables, and timestamps than the previous one. For simplicity's sake, in this experiment, REGENN was trained on the whole training set at once. The results are in Fig. 7 , in which REGENN was the first-placed algorithm, followed by the Elman RNN in second. REGENN overcame the Elman RNN by 11.95% on the MAE, 11.96% on the RMSE, and 25.84% on the MSLE. We noticed that all the algorithms perform quite similarly for this dataset. The major downside for most algorithms comes from predicting small values close to zero, as noted by the MSLE results. In such a case, the ensembles showed a high variance when compared to REGENN. We believe this is why the Elman RNN shows performance closer to REGENN rather than to Exponential Smoothing, the third-placed algorithm, as REGENN has a single estimator, while the Exponential Smoothing is an ensemble of estimators. Another understanding of why some algorithms underperform on the MSLE is related to their difficulty to track temporal dependencies, such as weather seasonality.

The ablation results are in Tab. 3, in which we observed again that the network without the GSE layers already surpasses the baselines. When decommissioning the GSE layers of REGENN and using GRU instead of LSTM on the decoder, we observed a 3.86% improvement on the MAE, 10.02% on the RMSE, and 25.34% on the MSLE when compared to the Elman RNN results. Using REGENN instead, we achieve a further performance gain of 8.42% on the MAE and 2.16% on the RMSE over the ablation experiment. Fig. 8 depicts the Evolution Weights for the current dataset, in which we can observe a consistent similarity between pairs of variables in the input graph, which does not repeat in the evolved graph, implying different relationships. We observe that the similarity between all pairs of variables increased on the evolved graph. The pairs Solar Radiation and Rain, Maximum Temperature and Rain, and Solar Radiation and Minimum Temperature stood out. Those pairs are mutually related, which comes from solar radiation interfering in maximum and minimum temperature and in the precipitation factors, where the opposite relation holds. What can be extracted from the Evolution Weights, in this case, is the notion of importance between pairs of variables so that the pairs that stood out are more relevant and provide better information during the forecasting process.

The PhysioNet dataset presents a large number of samples and an increased number of variables, but little information on the time-axis, a setting in which ensembles still struggle to perform accurate predictions, as depicted in Fig. 9 . Once again, REGENN keeps steady as the first-placed algorithm in performance, showing solid improvement over the Linear SVR, the second-placed algorithm. The improvement was 7.33% on the MAE and 35.13% on the MSLE, while the RMSE achieved by REGENN laid within the standard deviation of the Linear SVR, pointing out an equivalent performance between them. The Linear SVR is an ensemble of multiple estimators, while REGENN uses only one, making it better accurate for dealing with the current dataset.

As in Tab. 4, the ablation results reveal that a neural network architecture without the GSE layers can achieve a better performance than the baseline algorithms. In this specific case, we see that by using a bidirectional LSTM instead of unidirectional on the decoder module of the neural network, we can achieve a performance almost as good as REGENN, but not enough to surpass it, as REGENN still shows an improvement of 1.05% on the MAE and 0.98% on the RMSE over the experiment with bidirectional LSTM.

In this specific case, REGENN learns by observing multiple ICU patients. However, one cannot say that an ICU patient's state is somehow connected to another patient's state. Contrarily, the idea holds as in the first experiment, 10 of MAE, in which REGENN was the algorithm with the best performance followed by the Linear SVR. The comparative improvement was no lower than 7.33%, but, in this case, REGENN yielded an RMSE compatible with the Linear SVR. In the image, the algorithms are symbol-encoded based on their type and number of estimators; we use gray arrows to report standard deviation. The negative deviation, which is equal to the positive one, was suppressed for improved readability. dataset, in which we use the cosine similarity to compare the relationship between pairs of variables. We use "ABP" as a shortening for Arterial Blood Pressure, "NI" as Non-Invasive, "Dias" as Diastolic, and "Sys" as Systolic.

where although the samples are different, they have the same domain, variables, and timestream, such that the information from one sample might help enhance future forecasting for another one. That means REGENN learns both from the past of the patient and from the past of other patients. Nevertheless, we must be careful about drawing any understanding about these results, as the reason each patient is in the ICU is different, and while some explanations might be suited for a small set of patients, it tends not to generalize to a significant number of patients. When analyzing the Evolution Weights in Fig. 10 aided by a physician, we can say that there is a relationship between the amount of urine excreted by a patient and the arterial blood pressure, and also that there is a relation between the systolic and diastolic blood pressure. However, even aided by the Evolution Weights, we cannot further describe these relations once there are variables of the biological domain that are not being taken into consideration.

We refer to the Evolution Weights as the intermediate weights of the representation-learning process optimized throughout the network's training. Such weights are time-invariant and are a requirement for the feature-learning behind the GSE layer. Although time does not flow through the adjacency matrix, the network is optimized as a whole, such that every operation influences the gradients of the backward propagation process. That means the optimizer, influenced by the gradients of both time-variant and invariant data, will optimize the weights towards a better forecasting ability. Such a process depends not only on the network architecture but also on the optimization process's reliability. That increases uncertainty, which is the downside of RE-GENN, demanding more time to train the neural network and causing the improvement not to be strictly uprising. Consequently, training might take long sessions, even with consistently reduced learning rates on plateaus or simulated annealing techniques; this is influenced by the fact that the second GSE layer has two dynamic inputs, which arise from the graph-evolution process. However, we observed that throughout the epochs, the Evolution Weights reach a stable point with no major updates. As a result, the network demonstrates a remarkable improvement in its final iterations when the remaining weights intensely converge to a near-optimal configuration.

Even though REGENN has a particular drawback, it demonstrates excellent versatility, which comes from its superior performance in the task of epidemiology modeling on the SARS-CoV-2 dataset, climate forecasting on the Brazilian Weather, and patient monitoring on intensive care units on the PhysioNet dataset. Consequently, we see REGENN as a tool to be used in data-driven decision-making tasks, helping prevent, for instance, natural disasters or preparing for an upcoming pandemic. As a precursor in multiple multivariate time-series forecasting, there is still much to be improved. For example, reducing the uncertainty that harms REGENN without decreasing its performance should be the first step, followed by extending the proposal to handle problems in the spatiotemporal field of great interest to traffic forecasting and environmental monitoring. Another possibility would be to remove the decoder's recurrent layers while tracking the temporal dependencies through multiple graphs, a new temporal-modeling perspective in which one could leverage from Graph Convolution Networks for extracting inter-variable relationships and using those as hidden-features during the time-series forecasting process, to which further hypothesis constraints may apply.

Notwithstanding, in some cases, where extensive generalization is not required, the analysis of singular multivariate time-series may be preferred to multiple multivariate time-series. That because, when focusing on a single series at a time, some but not all samples might yield a lower forecasting error, as the model will be driven to a single multivariate sample. However, both approaches for tackling time-series forecasting can coexist in the state-of-the-art, and, as a consequence, the decision to work on a higher or lower dimensionality must relate to which problem is being solved and how much data is available to solve it.

This paper tackles multiple multivariate time-series forecasting tasks by proposing the Recurrent Graph Evolution Neural Network (REGENN), a graph-based time-aware auto-encoder powered by a pair of Graph Soft Evolution (GSE) layers, a further contribution of this study that stands for a graph-based learning-representation layer.

The literature handles multivariate time-series forecasting with outstanding performance, but up to this point, we lacked a technique with increased generalization over multiple multivariate time-series with sound performance. Previous research might have avoided tackling such a problem as a neural network, to that matter, is challenging to train and usually yields poor results. That because one aims to achieve good generalization on future observations for multivariate time-series that do not necessarily hold the same data distribution.

Because of that, REGENN is a precursor in multiple multivariate time-series forecasting and, even though this is a challenging problem, REGENN surpassed all the baselines and remained effective after three rounds of 30 ablation tests through distinct hyperparameters. The experiments were carried out over the SARS-CoV-2, Brazilian Weather, and PhysioNet datasets with improvements, respectively, of at least 64.87%, 11.96%, and 7.33%. As a consequence of the results, REGENN shows a new range of possibilities in time-series forecasting, starting by demonstrating that ensembles poorly perform if compared to a single model able to learn the entanglement between different variables by looking at how they interact as time goes by and how multiple multivariate time-series synchronously evolve.

GS conceived the experiments, GS prepared the data for experimentation, and GS conducted them. SH and BB validated GS's source-code implementation. SH and BB analyzed the preliminary results. SM, JR, and JS validated the final results. GS wrote the original draft. SH, BB, SM, JR, and JS iteratively polished the main ideas and the paper. All authors reviewed and approved the results and manuscript. Baselines Notes. Table 1 list the acronym and full name of all algorithms we tested during the baselines' computation. Tables 2 to 6 present detailed information from the experiments discussed along with the main manuscript. The following tables regard the tests using Transfer Learning on the SARS-CoV-2 dataset, in which a new network was trained every 15 days starting on 45 days after the pandemic started and up to 120 days of its duration.

Cosine Similarity. The cosine similarity, which has been widely applied in learning approaches, accounts for the similarity between two non-zero vectors based on their orientation in an inner product space [1] . The underlying idea is that the similarity is a function of the cosine angle θ between vectors u = [u 1 , u 2 , . . . ,

Hence, when θ = 1, the two vectors in the inner product space have the same orientation, when θ = 0, these vectors are oriented a 90 • relative to each other, and when θ = −1, the vectors are diametrically opposed. The cosine similarity between the vectors u and v is defined as: 

u i v i denotes the dot product between u and v, and u represents the norm of the vector u = √ u · u, while u i is the i-th variable of u. In this work's scope, the cosine similarity is used to build similarity adjacency matrices, which measures per-nodes similarity in a variables' co-occurrence graph. The similarity between two nodes in the graph describes how likely those two variables co-occur in time. In this case, the similarity ends up acting as an intermediate activation function, enabling the graph evolution process by maintaining relationships' similarity between pairs of nodes. Thus, the cosine-matrix similarity is defined as:

where A · A T denotes the dot product between the adjacency matrix A and the transposed A T , while A represents the norm of that same matrix with respect to any of its ranks. The resulting matrix of using the cosine-similarity activation is symmetric and referred to along with the main manuscript as Evolution Weights.

Horizon Forecasting. It stands for an approach used for making non-continuous predictions by accounting for a future gap in the data. It is useful in a range of applications by considering, for instance, that recent data is not available or too costly to be collected. Thereby, it is possible to optimize a model that disregards the near future and focuses on the far-away future. However, such an approach abdicates from additional information that could be learned from continuous timestamp predictions [2] . By not considering the near past as a variable that influences the near future, we might result in a non-stochastic view of time, meaning that the algorithm focuses on long-term dependencies rather than both long-and short-term dependencies. Along these lines, both the LSTNet [3] and DSANet [4] comply with horizon forecasting, and to make our results comparable, we set the horizon to one on both of them. Thus, we started assessing the test results right after the algorithms' last validation step because as closer to the horizon, the more accurate the horizon-based models should be.

A simplistic yet effective approach to train time-series algorithms is through the Sliding Window technique [5] , which is also referred to as Rolling Window (see Fig. 1 ). Such a technique 2 Pay Attention to Evolution: Time Series forecasting with Deep Graph-Evolution Learning fixes a window size, which slides over the time axis, predicting a predefined number of future steps, referred to as stride. Some time-series studies have been using a variant technique known as Expanding Sliding Window [6, 7] . This variant starts with a prefixed window size, which grows as it slides, showing more information to the algorithm as time goes by. ReGENN holds for the traditional technique as it is bounded to the tensor weights dimension. Those dimensions are of a preset size and cannot be effortlessly changed during training, as it comes with increased uncertainty by continuously changing the number of internal parameters, such that a conventional neural network optimizer cannot handle it properly. Nevertheless, the window size of the Sliding Window is well known to be a highly sensitive hyperparameter [8, 9] ; to avoid an increased number of hyperparameter, we followed a non-tunable approach, in which we set the window size before the experiments taking into consideration the context of the datasets; such values were even across all window-based trials, including the baselines and ablation.

Optimization Strategy. ReGENN operates on a three-dimensional space shared between samples, time, and variables. In such a space, it carries out a time-based optimization strategy. The training process iterates over the time-axis of the dataset, showing to the network how the variables within a subset of time-series behave as time goes by, and later repeating the process through subsets of different samples. The network's weights are shared among the entire dataset and optimized towards best generalization simultaneously across samples, time, and variables. The dataset T ∈ R s×t×v is sliced into training T ∈ R s×w×v and testing T ∈ R s×z×v as follows:

Once the data is sliced, we follow by using a gradient descent-based algorithm to optimize the model. In this work's scope, we used Adam [10] as the optimizer, as it is the most common optimizer among time-series forecasting problems. As the optimization criterion, we used the Mean Absolute Error (MAE), which is a generalization of the Support Vector Regression [11] with soft-margin criterion formalized as it follows:

where w is the set of optimizable parameters, · F is the Frobenius norm, and both C and ρ are hyperparameters. The idea, then, is to find w that better fit y i , x i ∀i ∈ [1, n] so that all values are in [ρ + ξ i , ρ + ξ * i ], where ξ i and ξ * i are the two farther opposite points in the dataset. A similar formulation on the Linear SVR implementation for horizon forecasting was presented by Lai et al. [3] . Due to the higher-dimensionality among the multiple multivariate time-series used in this study, in which we consider the time to be continuous, the problem becomes:

where Ω is the set of internal parameters of ReGENN, Y is the network's output and T the ground truth. When disregarding C and setting ρ as zero, we can reduce the problem to the MAE loss formulation:

Pay Attention to Evolution: Time Series forecasting with Deep Graph-Evolution Learning

Square-and logarithm-based criteria can also be used with ReGENN. We avoid doing so, as this is a decision that should be made based on each dataset. Contrarily, we follow the SVR path towards the evaluation of absolute values, which is less sensitive to outliers and enables ReGENN to be applied to a range of applications.

Transfer-Learning Approach. We adopted a Transfer Learning approach to train the network on the SARS-CoV-2 dataset that, although different, resembles Online Deep Learning [12] . The idea is to train the network on incremental slices of the time-axis, such that the pre-trained weights of a previous slice are used to initialize the weights of the network in the next slice (see Fig. 2 ). This technique aims not only to achieve better performance towards the network but also to show that ReGENN is useful throughout the pandemic.

Slice Hyperparameters adjustment is usually required when transferring the weights from one network to another, mainly the learning rate; for the list of hyperparameters we used, see Tab. 3. Besides, we deliberately applied a 20% Dropout on all tensor weights outside the network architecture and before starting the training. The aim behind that decision was to insert randomness in the pipeline and avoid local optima. It worth mentioning that we did not observe any decrease in performance, but the optimizer's convergence was slower in some cases.

Baselines Algorithms. Open-source Python libraries provided the time series and machine learning algorithms used along with the experiments. Time series algorithms came from the statsmodels 1 , while the machine learning ones majorly from the Scikit-Learn 2 . Further algorithms, such as XGBoost 3 , LGBM 4 , and CatBoost 5 ,

have a proprietary open-source implementation, which was preferred over the others. We used the default hyperparameters over all the experiments, performing no fine-tuning. However, because all the datasets we tested are strictly positive, we forced all the negative output to become zero, such as made by a ReLU activation function.

A list with names and algorithms tested along with the experiments is provided in Tab 1, which contains more algorithms than we reported in the main paper. That because we are listing all algorithms, even those removed from the pipeline due to being incapable of working with the input data and yielding exceptions. Hyperparameters for the Ablation Tests Legend: Algorithms with best performance are in bold, the ones noted as -yielded exceptions, and others as *** were suppressed due to poor performance. 

Isotonic Table 9 : Detailed results for the first three slices of the SARS-CoV-2.

Legend: Algorithms with best performance are in bold, the ones noted as -yielded exceptions, and others as *** were suppressed due to poor performance. Legend: Algorithms with best performance are in bold, the ones noted as -yielded exceptions, and others as *** were suppressed due to poor performance. 

Estimation of the maximum annual number of North Atlantic tropical cyclones using climate models

Effects of climate and land-use changes on fish catches across lakes at a global scale

Climate change impacts on wind power generation

The impact of climate change and glacier mass loss on the hydrology in the Mont-Blanc massif

Stock Market Prediction Using Optimized Deep-ConvLSTM Model

Stock market analysis using candlestick regression and market trend prediction (CKRM)

Change-point detection using spectral PCA for multivariate time series

Robust Landsat-based crop time series modelling

Continuous monitoring of land disturbance based on Landsat time series

Spatially and temporally complete Landsat reflectance time series modelling: The fill-and-fit approach

Generic and Scalable Framework for Automated Time-series Anomaly Detection

Multivariate time series anomaly detection: A framework of Hidden Markov Models

Performance of CatBoost and XGBoost in Medicare Fraud Detection

A review on time series forecasting techniques for building energy consumption

Better seasonal forecasts for the renewable energy industry

A framework to predict the price of energy for the end-users with applications to monetary and energy policies

Temporal phenotyping of medically complex children via PARAFAC2 tensor factorization

Patient trajectory prediction in the Mimic-III dataset, challenges and pitfalls

Ligdoctor: Real-world clinical prognosis using a bi-directional neural network

TASTE

An interpretable mortality prediction model for COVID-19 patients

LIG-Doctor: Efficient patient trajectory prediction using bidirectional minimal gated-recurrent networks

On the responsible use of digital data to tackle the COVID-19 pandemic

The COVID-19 epidemic

The COVID-19 Pandemic in the US

Temporal aggregation of univariate and multivariate time series models: A survey

Artificial neural networks: a tutorial

A simulation study of artificial neural networks for nonlinear time-series forecasting

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

Structured sequence modeling with graph convolutional recurrent networks

Gradient-based learning applied to document recognition

Finding structure in time

Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting

GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal Graphs

GeoMAN: Multilevel Attention Networks for Geo-sensory Time Series Prediction

Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting

T-GCN: A Temporal Graph Convolutional Network for Traffic Prediction

Semi-supervised classification with graph convolutional networks

Modeling Long-and Short-Term Temporal Patterns with Deep Neural Networks

DSANet: Dual selfattention network for multivariate time series forecasting

Attention is all you need

Towards better forecasting by fusing near and distant future visions

Long Short-Term Memory

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Recurrent highway networks

Highway Networks

Deep Residual Learning for Image Recognition

Layer Normalization

XGBoost: A scalable tree boosting system

Adam: A method for stochastic optimization

Support vector method for function approximation, regression estimation, and signal processing

Online Deep Learning: Learning Deep Neural Networks on the Fly

An interactive web-based dashboard to track COVID-19 in real time

Predicting in-hospital mortality of ICU patients: The Phys-ioNet/Computing in cardiology challenge 2012

Segmenting Time Series: a Survey and Novel Approach

Input window size and neural network predictors

Time series prediction and neural networks

Scikit-learn: Machine Learning in Python

CatBoost: gradient boosting with categorical features support

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Data Mining Introduction

Towards better forecasting by fusing near and distant future visions

Modeling Long-and Short-Term Temporal Patterns with Deep Neural Networks

DSANet: Dual self-attention network for multivariate time series forecasting

Segmenting Time Series: a Survey and Novel Approach

Picture fuzzy time series: Defining, modeling and creating a new forecasting method

Machine learning panel data regressions with an application to nowcasting price earnings ratios

Input window size and neural network predictors

Time series prediction and neural networks

Pay Attention to Evolution: Time Series forecasting with Deep Graph-Evolution Learning

Adam: A method for stochastic optimization

Support vector method for function approximation, regression estimation, and signal processing

Online Deep Learning: Learning Deep Neural Networks on the Fly