key: cord-0268322-71e1ajg8 authors: Chandra, Rohitash; Goyal, Shaurya; Gupta, Rishabh title: Evaluation of deep learning models for multi-step ahead time series prediction date: 2021-03-26 journal: nan DOI: 10.1109/access.2021.3085085 sha: 10177567a959dc260216749fe699c961879e11d2 doc_id: 268322 cord_uid: 71e1ajg8 Time series prediction with neural networks has been the focus of much research in the past few decades. Given the recent deep learning revolution, there has been much attention in using deep learning models for time series prediction, and hence it is important to evaluate their strengths and weaknesses. In this paper, we present an evaluation study that compares the performance of deep learning models for multi-step ahead time series prediction. The deep learning methods comprise simple recurrent neural networks, long short-term memory (LSTM) networks, bidirectional LSTM networks, encoder-decoder LSTM networks, and convolutional neural networks. We provide a further comparison with simple neural networks that use stochastic gradient descent and adaptive moment estimation (Adam) for training. We focus on univariate time series for multi-step-ahead prediction from benchmark time-series datasets and provide a further comparison of the results with related methods from the literature. The results show that the bidirectional and encoder-decoder LSTM network provides the best performance in accuracy for the given time series problems. Apart from econometric models, machine learning methods became extremely popular for time series prediction and forecasting in the last few decades [1]- [7] . Some of the popular categories include one-step, multi-step, and multivariate prediction. Recently, some attention has been given to dynamic time series prediction where the size of the input to the model can dynamically change [8] . Just as the term indicates, one-step prediction refers to the use of a model to make a prediction one-step ahead in time whereas a multi-step prediction refers to a series of steps ahead in time from an observed trend in a time series [9] , [10] . In the latter case, the prediction horizon defines the extent of future prediction. The challenge is to develop models that produce low prediction errors as the prediction horizon increases given the chaotic nature and noise in the dataset [11] - [13] . There are two major approaches for multi-step-ahead prediction which include recursive and direct strategies. The recursive strategy features the prediction from a one-step-ahead prediction model as the input for future prediction horizon [14] , [15] , where error in the prediction for the next horizon is accumulated in future horizons. The direct strategy encodes the multi-step-ahead problem as a multi-output problem [16] , [17] , which in the case of neural networks can be represented by multiple neurons in the output layer for the prediction horizons. The major challenges in multi-step-ahead prediction include highly chaotic time series and those that have missing data which has been approached with non-linear filters and neural networks [18] . Neural networks have been popular for time series prediction for various applications [19] . Different neural network architectures have different strengths and weaknesses. Time series prediction requires careful integration of knowledge in temporal sequences; hence, it is important to choose the right neural network architecture and training algorithm. Recurrent neural networks (RNNs) are well known for modelling temporal sequences [20] - [24] and dynamical systems when compared to feedforward networks [25] - [27] . The Elman RNN [20] , [28] is one of the earliest architectures to be trained by backpropagation through-time, which is an extension of the backpropagation algorithm [21] . The limitation in learning long-term dependencies in temporal sequences using canonical RNNs [29] , [30] have been addressed by long short-term memory (LSTM) networks [23] . Recent deep learning revolution [24] contributed to further improvements in LSTM networks with gated recurrent unit (GRU) networks [31] , [32] , which provides similar performance and are simpler to implement. Some of the other extensions include predictive state RNNs [33] that combines RNNs with the power of predictive state representation [34] . Bidirectional RNNs connect two hidden layers of opposite directions to the same output, where the output layer can get information from past and future states simultaneously [35] . The idea was further extended into bidirectional-LSTM networks for phoneme classification [36] which performed better than standard RNNs and LSTM networks. Further work has been done by combining bidirectional LSTM networks with convolutional neural networks (CNNs) for natural language processing with problem of named entity recognition [37] . Further extensions have been done by encoderdecoder LSTM networks that used a LSTM to map the input sequence to a vector of a fixed dimensionality, and uses another LSTM to decode the target sequence for language task such as English to French translation [38] . CNNs with regularisation methods such as dropouts during training can improve generalisation [39] . Adaptive gradient methods such as the adaptive moment estimation (Adam optimiser) has become very prominent for training neural networks [40] . Apart from these, neuroevolution that uses evolutionary algorithms and multi-task learning have been used for time series prediction [8] , [41] . RNNs have also been trained by neuroevolution with applications for time series prediction [10] , [42] . We note that limited work has been done to compare FNN and RNNs for multi-step time series prediction [43] , [44] . It is important to evaluate the advancements of deep learning methods for a challenging problem which in our case is multi-step time series prediction. LSTM network applications have dominated applications in natural language processing and signal processing such as phoneme recognition; however, there is no work that evaluates their performance for time series prediction, particularly multi-step ahead prediction. Since the underlying feature of LSTM networks is in handling temporal sequences, it is worthwhile to investigate their predictive power, i.e. accuracy as the prediction horizon increases. In this paper, we present an evaluation study that compares the performance of selected deep learning models for multistep ahead time series prediction. We examine univariate time series prediction with selected models and learning algorithms for benchmark time series datasets. The deep learning methods comprise of standard LSTM, bidirectional LSTM, encoder-decoder LSTM, and CNNs. We also compare the results with canonical neural networks that use stochastic gradient descent learning and Adam optimiser. We further compare our results with other related machine learning methods for multi-step time series prediction from the literature. The rest of the paper is organised as follows. Section 2 presents a background and literature review of related work. Section 3 presents the details of the different deep learning models, and Section 4 presents experiments and results. Section 5 provides a discussion and Section 6 concludes the paper with discussion of future work. One of the first attempts for recursive strategy multi-stepahead prediction used state-space Kalman filter and smoothing [45] followed by recurrent neural networks [46] . Later, a dynamic recurrent network used current and delayed observations as inputs to the network which reported excellent generalization performance [47] . The non-parametric Gaussian process model was used to incorporate the uncertainty about intermediate regressor values [48] . The Dempster-Shafer regression technique for prognosis of data-driven machinery used iterative strategy with promising performance [49] . Lately, reinforced real-time recurrent learning was used with iterative strategy for flood forecasts [12] . One of the earliest work done using direct strategy for multi-step-ahead prediction used RNNs trained by backpropagation through-time algorithm [13] . A review of single-output versus multipleoutput approaches showed direct strategy more promising choice over recursive strategy [17] . Multiple-output support vector regression (M-SVR) achieved better forecasts when compared to standard SVR using direct and iterated strategies [50] . The combination of recursive and direct strategies has also been prominent such as multiple SVR models that were trained independently based on the same training data and with different targets [14] . Optimally pruned extreme learning machine (OP-ELM) used recursive, direct and a combination of the two strategies in an ensemble approach where the combination gave better performance than standalone methods [51] . Chandra et al. [52] presented recursive and cascaded neural networks inspired by multi-task learning trained via cooperative neuroevolution where the tasks represented different prediction horizons. We note that neuroevolution provides an alternate training method that does not require gradients [53] . Ye and Dai [54] presented a multitask learning method which considers different prediction horizons as tasks and explores the relatedness amongst prediction horizons. The method consistently achieved lower error values over all horizons when compared to other related iterative and direct prediction methods. A comprehensive study on the different strategies was given using a large experimental benchmark (NN5 forecasting competition) [3], and further comparison for macroeconomic time series. It was reported that the iterated forecasts typically outperformed the direct forecasts [55] . The relative performance of the iterated forecasts improved with the forecast horizon, with further comparison that presented an encompassing representation for derivation auto-regressive coefficients [56] . A study on the properties shows that direct strategy provides prediction values that are relatively robust and the benefits increases with the prediction horizon [57] . The applications for real-world problems include 1.) autoregressive models for predicting critical levels of abnormality in physiological signals [58] , 2.) flood forecasting using recurrent neural networks [59] , [60] , 3.) emissions of nitrogen oxides using a neural network and related approaches [61] , 4.) photo-voltaic power forecasting using hybrid support vector machine [62] , 5.) Earthquake ground motions and seismic response prediction [63] , and 6. central-processing unit (CPU) load prediction [64] . Recently, Wu [65] employed an adaptive-network-based fuzzy inference system with uncertainty quantification the prediction of short-term wind and wave conditions for marine operations. Wang and Li [66] used multi-step ahead prediction for wind speed prediction which was based on optimal feature extraction, LSTM networks, and an error correction strategy. The method showed lower error values for one, three and five-step ahead predictions in comparison to related methods. Wang and Li [67] also used hybrid strategy for wind speed prediction with empirical wavelet transformation for feature extraction. Moreover, they used autoregressive fractionally integrated moving average and swarm-based backpropagation neural network. Deep learning has been very successful for computer vision [68] , computer games [69] , multimedia, and big data related problems. Deep learning methods have also been prominent for modelling temporal sequences [24] , [70] . RNNs have been popular in forecasting time series with their ability to capture temporal information [10] , [22] , [71] - [73] . Mirikitani and Nikolaev used [74] variational inference for implementing Bayesian RNNs in order to provide uncertainty quantification in predictions. CNNs have gained attention recently in forecasting time series. Wang et al. [75] used CNNs with wavelet transform for probabilistic wind power forecasting. Xingjian et al. [76] used CNNs in conjunction with LSTM networks to capture spatiotemporal sequences for forecasting precipitation. Amarasinghe et al. [77] employed CNNs for energy load forecasting, and Huang and Kuo [78] combined CNNs and LSTM networks for air pollution quality forecasting. Sudriani et al. [79] employed LSTM networks for forecasting discharge level of a river for managing water resources. Ding et al. [80] employed CNNs to evaluate different events on stock price behavior, and Nelson et al. [81] used LSTM networks to forecast stock market trends. Chimmula and Zhand employed LSTM networks for forecasting COVID-19 transmission in Canada [82] . The original time series data needs to be embedded (reconstructed) for multi-step-ahead prediction. Taken's embedding theorem expresses that the reconstruction can reproduce important features of the original time series [83] . Therefore, given an observed time series x(t), an embedded phase space can be generated; where T is the time delay, D is the embedding dimension (window size) given t = 0, 1, 2, ..., N − DT − 1, and N is the length of the original time series. A study needs to be done to determine optimal values for D and T in order to efficiently apply Taken's theorem [84] . Taken's proved that if the original attractor is of dimension d, then D = 2d + 1 would be sufficient [83] . We refer to the backpropagation neural network and multilayer perceptron as simple neural networks which has been typically trained by the stochastic gradient descent (SGD) algorithm. SGD maintains a single learning rate for all the weight updates which does not change during training. The Adam optimiser [85] extends SGD by adapting the learning rate for each parameter (weight) as learning unfolds. Using first and second moments of the gradients, Adam computes adaptive learning rate, inspired by the adaptive gradient algorithm (AdaGrad) [86] . In the literature, Adam has shown better results when compared to SGD and AdaGrad for a wide range of problems. In our experiments, we evaluate them further for multi-step ahead time series prediction. Adam optimiser updates for the set of neural network parameters represented by weights w and bias b for iteration t can be formulated as where, m t , v t are the respective first and second moment vectors for iteration t; β 1 , β 2 are constants ∈ [0, 1], α is the learning rate, and is a close to zero constant. The Elman RNN [28] is a prominent example of simple RNNs that feature a context layer to act as memory and incorporate current state for propagating information into future states to handle given future inputs. The use of context layer is to store the output of the state neurons from computation of the previous time steps making them applicable for timevarying patterns in data. The context layer maintains memory of the prior hidden layer result as shown in Figure 1 . A vectorised formulation for simple RNNs is given as follows (2) VOLUME 9, 2016 where; x t input vector, h t hidden layer vector, y t output vector, W represent the weights for hidden and output layer, U is the context state weights, b is the bias, and σ h and σ y are the respective activation functions. Backpropagation through time (BPTT) [21] has been a prominent method for training simple RNNs. In comparison to simple neural networks, BPTT in RNNs propagate the error for a deeper network architecture that features states defined by time. Simple RNNs have the [23] limitation of learning longterm dependencies with problems in vanishing and exploding gradients [29] . LSTM networks employ memory cells and gates for much better capabilities in remembering the longterm dependencies in temporal sequences as shown in Figure 2 . LSTM units are trained in a supervised fashion on a set of training sequences using an adaptation of the BPTT algorithm that considers the respective gates [23] . LSTM networks calculate a hidden state h t as where, i t , f t and o t refer to the input, forget and output gates, at time t, respectively. x t and h t refer to the number of input features and number of hidden units, respectively. W and U is the weight matrices adjusted during learning along with b which is the bias. The initial values are c 0 = 0 and h 0 = 0. All the gates have the same dimensions d h , the size of your hidden state.C t is a "candidate" hidden state, and C t is the internal memory of the unit as shown in Figure 2 . Note that * denotes element-wise multiplication. A major shortcoming of conventional RNNs is that they only make use of previous context state for determining future states. Bidirectional RNNs (BD-RNNs) [35] process information in both directions with two separate hidden layers, which are then propagated forward to the same output layer. BD-RNNs consist of placing two independent RNNs together to allow both backward and forward information about the sequence at every time step. BD-RNN computes the forward hidden sequence h f , the backward hidden sequence h b , and the output sequence y by iterating information from the backward layer, i.e. t = T to t = 1. Then information in the other network is propagated from t = 1 to t = T in order to update the output layer; when both networks are combined, information is propagated in bidirectional manner. Bi-directional LSTM networks (BD-LSTM) [36] have been originally proposed for word-embedding in natural language processing in order to access long-range context or state in both directions, similar to BD-RNNs. BD-LSTM would intake inputs in two ways, one from past to future and one from future to past which differ from conventional LSTM networks. By running information backwards, state information from the future is preserved. Hence, with two hidden states combined, at any point in time the network can preserve information from both past and future as shown in Figure 3 . BD-LSTM networks have been used in several real-world sequence processing problems such as phoneme classification [36] , continuous speech recognition [87] , and speech synthesis [88] . Sutskever et al. [89] introduced the encoder-decoder LSTM network (ED-LSTM) which is a sequence to sequence model for mapping a fixed-length input to a fixed-length output [90] . The length of the input and output may differ which makes them applicable in automatic language translation tasks (such as English to French). Hence, the input can be the sequence of video frames (x 1 , ..., x n ), and the output is the sequence of words (y 1 , ..., y m ). Therefore, we estimate the conditional probability of an output sequence (y 1 , ..., y m ), given an input sequence (x 1 , ..., x n ), i.e. p(y 1 , ..., y m |x 1 , ..., x n ). In the case of multi-step series prediction, both the input and outputs are of variable lengths. ED-LSTM networks handle variablelength input and outputs by first encoding the input sequences one at a time, using a latent vector representation, and then decoding them from that representation. In the encoding phase, given an input sequence, the ED-LSTM computes a sequence of hidden states. In the decoding phase, it defines a distribution over the output sequence given the input sequence as shown in Figure 4 . CNNs introduced by LeCun [91] , [92] are prominent deep learning architecture inspired by the natural visual system of mammals. CNNs can be trained using backpropagation algorithm for tasks such as handwritten digit classification [93] . CNNs have been prominent in many computer vision and image processing tasks. Recently, CNNs have been applied for time series prediction and produced very promising results [75] - [77] . CNNs learn spatial hierarchies of features by using multiple building blocks, such as convolution, pooling layers, and fully connected layers. Figure 5 shows an example of a CNN used for time series prediction using a univariate time series as input where multiple output neurons represent different prediction horizons. We note that CNNs are more appropriate for multivariate time series with use of features extracted via the convolutional and the pooling layers. We use a combination of benchmark problems that include simulated and real-world time series. The simulated time series are Mackey-Glass [94] , Lorenz [95] , Henon [96] , and Rossler [97] . The real-world time series are Sunspot [98] , Lazer [99] and ACI-financial time series [100] . They have been used in our previous works and have been prominent for time series problems [101] - [103] . The Sunspot time series indicates solar activities from November 1834 to June 2001 and consists of 2000 data points [98] . The ACI-finance time series contains closing stock prices from December 2006 to February 2010, featuring 800 data points [100] . The Lazer time series is from the Santa Fe competition that consists of 500 points [99] . The respective time series are processed into a state-space vector [83] with embedding dimension D = 5 and time-lag T = 1 for 10-step-ahead prediction. We determine respective model hyper-parameters from trial experiments that include number of hidden neurons, and learning rate. Table 1 gives details for the topology of the respective models in terms of input, hidden and output layers. We use maximum time of 1000 epochs with rectifier linear units (Relu) in all the respective models. The simple neural networks feature SGD and Adam optimiser (FNN-SGD and FNN-Adam). Adam optimiser is used in the deep learning models that include simple RNNs, LSTM networks, ED-LSTM, BD-LSTM, and CNNs. The time series are scaled in the range [0,1]. We used first 1000 data points from which the first 60% are used for training and remaining for testing. We use the root-meansquared error (RMSE) as the main performance measure ( Equation 4) for the different prediction horizons where, y i ,ŷ i are the observed data, predicted data, respectively. N is the length of the observed data. We report the mean and 95 % confidence interval of RMSE for each prediction horizon for the respective problem for train and test datasets from 30 experimental runs with different initial neural network weights. Figure 9 to Figure 12 presents the results for the simulated time series (Tables 8 to 11 in Appendix). Figure 6 to 8 presents the results for the real-world time series (Tables 5 to 7 in Appendix). We define robustness as the confidence interval which must be as low as possible to indicate high confidence in prediction. We consider scalability as the ability to provide consistent performance as the the prediction horizon increases. The results are given in terms of the RMSE where the lower VOLUME 9, 2016 We first review results for real-world time series that feature noise (ACI-Finance, Sunspot, Lazer). Figure 6 shows the results for the ACI-fiancee problem. We observe that the test performance is better than the train performance in Figure 6 (a), where deep learning models provide more reliable performance. The prediction error (RMSE) increases with the prediction horizon, and the deep learning methods do much better than simple neural networks (FNN-SGD and FNN-Adam). We find that LSTM provides the best overall performance as shown in Figure 6 (b). The overall test performance shown in Figure 6 (a) indicates that FNN-Adam and LSTM provide similar performance, which are better than rest of the problems. Figure 13 shows ACI-finance prediction performance of the best experiment run with selected prediction horizons that indicate how the prediction deteriorates as prediction horizon increases. Next, we consider the results for the Sunspot time series shown in Figure 7 which follows a similar trend as the ACIfinance problem in terms of the increase in prediction error along with the prediction horizon. The test performance is better than the train performance as evident from Figure 7 (a). The LSTM methods (LSTM, ED-LSTM, BD-LSTM) gives better performance than the other methods as can be observed from Figure 7 (a) and 7 (b). Note that the FNN-SGD gives the worst performance and the performance of RNN is better than that of CNN, FNN-SGD, and FNN-Adam, but poorer than LSTM methods. Figure 14 shows Sunspot prediction performance of the best experiment run with selected prediction horizons. The results for Lazer time series is shown in Figure 8 , which exhibits a similar trend in terms of the train and test performance as the other real-world time series problems. Note that the Lazer problem is highly chaotic (as visually evident in Figure 16 ), which seems to be the primary reason behind the difference in performance for the prediction horizon in contrast to other problems as displayed in Figure 8 (b). It is striking that none of the methods appear to be showing any trend for the prediction accuracy along the prediction horizon, as seen in previous problems. In terms of scalability, all the methods appear to be performing better in comparison with the other problems. The performance of CNN is better than that of RNN, which is different from other real-world time series. Figure 16 shows Lazer prediction performance of the best experiment run using ED-LSTM with selected prediction horizons. We note that due to the chaotic nature of the time series, the prediction performance is visually not clear. We now consider simulated time series that do not feature noise (Henon, Mackey-Glass, Rosssler, Lorenz). The Henon time series in Figure 9 shows that ED-LSTM provides the best performance. Note that there is a more significant difference between the three LSTM methods when compared to other problems. The trends are similar to the ACI-finance and the Sunspot problem given the prediction horizon performance in Figure 9 (a) and 9 (b), where the simple neural networks (FNN-SGD and FNN-Adam) appear to be more scalable than the other methods along the prediction horizon, although they perform poorly. Figure 17 and Figure 15 show Mackey-Glass and Henon prediction performance of the best experiment run using ED-LSTM for selected prediction horizons. The Henon prediction in Figure 15 indicates that it is far more chaotic than Mackey-Glass; hence, it faces more challenges. We show them since these are cases with no noise when compared to real-world time series previously shown. They have a larger deterioration in prediction performance as the prediction horizon increases (Figures 13 and Figure 14) . In the Lorenz, Mackey-Glass and Rossler simulated time series, the deep learning methods are performing far better than the simple neural networks as shown in Figures 10, 11 and 12. The trend along the prediction horizon is similar to previous problems, i.e., the prediction error increases along with the prediction horizon. If we consider scalability, the deep learning methods are more scalable in the Lorenz, Mackey-Glass and Rossler problems than the previous problems. This is the first instance where the CNN has outperformed LSTM for Mackey-Glass and Rossler time series. We note that there have been distinct trends in prediction for the different types of problems. In the simulated time series, given that we exclude Henon, we observe a similar trend for Mackey-Glass, Lorenz and Rossler time series. The trend indicates that simple neural networks face major difficulties. ED-LSTM and BD-LSTM networks provides the best performance, which also applies to Henon time series, except that it has close performance for simple neural networks when compared to deep learning models for 7-10 prediction horizons (Figure 9 b) . This difference reflects in the nature of the time series which is highly chaotic in nature ( Figure 15 ). We further note that in the case of the simple neural networks, Henon ( Figure 9 ) does not deteriorate in performance as the prediction horizon increases when compared to Mackley-Glass, Lorenz and Rossler problems. Simple neural networks in this case performs poorly from the beginning prediction horizon. The performance of simple neural networks in Lazer problem shows a similar trend in Lazer time series, where VOLUME 9, 2016 the predictions are poor from the beginning and its striking that LSTM networks actually improve the performance as the prediction horizon increases (Figure 8 b) . This trend is a clear outlier when compared to the rest of real-world and simulated problems, since they all show deep learning models deteriorate as the prediction horizon increases. Tables 3 and 4 show a comparison with related methods from the literature for simulated and real-world time series, respectively. We note that the comparison is not fair as other methods may have employed different models with different data processing and also in reporting of results with different measures of error. Moreover, some papers report best experimental run and do not show mean and standard deviation of the results. We highlight in bold the best performance for respective prediction horizon. In Table 3 , we compare the Mackey-Glass and Lorenz time series performance for twostep-ahead prediction by real-time recurrent learning (RTRL) and echo state networks (ESN) [12] . Note that * in the results implies that the comparison is not fair due to different embedding dimension in state-space reconstruction and it is not clear if the mean or the best run has been reported. We show further comparison for Mackey-Glass for 5th prediction horizon using Extended Kalman Filtering (EKF), the Unscented Kalman Filtering (UKF) and the Gaussian Particle Filtering (GPF), along with their generalized versions G-EKF, G-UKF and G-GPF, respectively [104] . In the case of MultiTL-KELM [54] , we find that it beats all our proposed methods for Mackey-Glass, except for the Henon time series. In general, we find that our proposed deep learning methods (LSTM, BD-LSTM, ED-LSTM) outperform most of the methods from the literature for the simulated time series, except for the Mackey-Glass time series. In Table 4 , we compare the performance of Sunspot time series with support vector regression (SVR), iterated (SVR-I), direct (SVR-D), and multiple models (M-SVR) methods [14] . In the respective problems, we also compare with coevolutionary multi-task learning (CMTL) [52] . We observe that our proposed deep learning methods have given the best performance for the respective problems for most of the prediction horizons. Moreover, we find the FNN-Adam overtakes CMTL in all time-series problems except in 8step ahead prediction in Mackey-Glass and 2-step ahead prediction in Lazer time series. It should also be noted that except for the Mackey-Glass and ACI-Finance time series, the deep learning methods are the best which motivates further applications for challenging forecasting problems. We provide a ranking of the methods in terms of performance accuracy over the test dataset across the prediction horizons in Table 2 . We observe that FNN-SGD gives the worst performance for all time-series problems followed by FNN-Adam in most cases. We observe that the BD-LSTM and ED-LSTM models provide one of the best performance across different problems with different properties. We also note that across all the problems, the confidence interval of RNN is the lowest, followed by CNN which indicates that they provide more robust performance accuracy given different model initialisations in weight space. We note that it is natural for the performance accuracy to deteriorate as the prediction horizons increases in multi-step ahead problems. The prediction is based on current values and the information gap increases with the prediction horizon due to our problem formulated as direct strategy of multi-step ahead prediction, as opposed to iterated prediction strategy. ACI-finance problem is unique in a way where there is not a major difference with simple neural networks and deep learning models (Figure 7 b) when considering the higher prediction horizons (7 -10). Long term dependency problems arise in the analysis of time series where the rate of decay of statistical dependence of two points increase with time interval. Simple RNNs had difficulty in training long-term dependency problems [29] ; hence, LSTM networks were developed [23] . The time series problems in our experiments are not long-term dependency problems; however, LSTM networks provide better performance when compared to simple RNNs. It seems that the memory gates in LSTM networks help better capture information in temporal sequences, even though they do not have long-term dependencies. We note that the memory gates in LSTM networks were originally designed to cater for the vanishing gradient problem. It seems the memory gates of LSTM networks are helpful in capturing salient features in temporal series that help in predicting future tends much better than simple RNNs. We note that simple RNNs provide better results than simple neural networks (FNN-SGD and FNN-Adam) since they are more suited for temporal series. Moreover, we find striking results given that CNNs which are suited for image processing perform better than simple RNNs in general. This could be due to the convolutional layers in CNNs that help in better capturing hidden features for temporal sequences. Moving on, it is important to understand why the novel LSTM network models (ED-LSTM and BD-LSTM) have given much better results. The ED-LSTM model was designed for language modeling tasks, primarily sequence to sequence modelling for language translation where encoder LSTM maps a source sequence to a fixed-length vector, and the decoder LSTM maps the vector representation back to a variable-length target sequence [38] . In our case, the encoder maps an input time series to a fixed length vector and then the decoder LSTM maps the vector representation to the different prediction horizons. Although the application is different, the underlying task of mapping inputs to outputs remains the same; hence, ED-LSTM models have been very effective for multi-step ahead prediction. Simple RNNs make use of only the previous context states for determining future states. On the other hand, BD-LSTMs process information using two LSTM models to feature forward and backward information about the sequence at every time step [36] . Although these have been useful for language modelling tasks, our results show that they are applicable for mapping current and future states for time series modelling. The information from past and future states are somewhat preserved which seems to be the key feature in achieving better performance for multi-step prediction problems when compared to conventional LSTM models. 2SA-RTRL* [12] 0.0035 ESN* [12] 0.0052 EKF [104] 0.2796 G-EKF [104] 0.2202 UKF [104] 0.1374 G-UKF [104] 0.0509 GPF [104] 0.0063 G-GPF [104] 0.0022 Multi-KELM [54] 0.0027 0.0031 0.0028 0.0029 MultiTL-KELM [54] 0.0025 0.0029 0.0026 0.0028 CMTL [52] 0.0550 0.0750 0.0105 0.1200 ANFIS(SL) [105] 0.0051 0.0213 0.0547 R-ANFIS(SL) [105] 0.0045 0.0195 0.0408 R-ANFIS(GL) [105] 0 In this paper, we provide a comprehensive evaluation of emerging deep learning models for multi-step-ahead time series problems. Our results indicate that encoder-decoder and bi-directional LSTM networks provide best performance for both simulated and real-world time series problems. The results have significantly improved over related time series prediction methods given in the literature. In future work, it would be worthwhile to provide similar evaluation for multivariate time series prediction problems. Moreover, it is worthwhile to investigate the performance of given deep learning models for spatiotemporal problems, such as the prediction of certain characteristics of storms and cyclones. Further applications to other real-world problems would also be feasible, such as air pollution and energy forecasting. We provide open source implementation in Python along with data for the respective methods for further research 1 . [14] 0.2355 ± 0.0583 SVR-I [14] 0.2729 ±0.1414 SVR-D [14] 0.2151 ± 0.0538 CMTL [52] 0 Dr Chandra has built a program of research encircling methodologies and applications of artificial intelligence; particularly in areas of Bayesian deep learning, neuroevolution, Bayesian inference via MCMC, climate extremes, landscape and reef evolution models, and mineral exploration. Dr Chandra has been developing novel methods for machine learning inspired by neural systems and learning behaviour that include transfer and multi-task learning, with the goal of developing modular deep learning methods. The current focus has been on Bayesian deep learning with a focus on recurrent, convolutional, and graph neural networks, with application to language models involving sentiment analysis and COVID-19. Dr Chandra has attracted multi-million dollar funding with a leading international interdisciplinary team and part of the Australian Research Council (ARC ITTC) Training Centre for Data Analytics in Minerals and Resources (2020-2025). Dr Chandra is an Associate Editor for Neurocomputing, IEEE Transactions on Neural Networks and Learning Systems, and Geoscientific Model Development (Topical Editor). An empirical comparison of machine learning models for time series forecasting 25 years of time series forecasting Recent developments in econometric modeling and forecasting The econometric analysis of economic time series Co-evolutionary multi-task learning for dynamic time series prediction Feature extraction, classification and forecasting of time series signal using fuzzy and garch techniques Cooperative coevolution of elman recurrent neural networks for chaotic time series prediction A bias and variance analysis for multistepahead time series forecasting Reinforced two-step-ahead weight adjustment technique for online training of recurrent neural networks Multi-step-ahead prediction with neural networks: a review Iterated time series prediction with multiple support vector regression models Recursive and direct multi-step forecasting: the best of both worlds Methodology for long-term prediction of time series Multiple-output modeling for multi-step-ahead time series forecasting Multi-step prediction of time series with random missing data Time series prediction and neural networks Learning the hidden structure of speech Backpropagation through time: what it does and how to do it Recurrent neural networks and robust time series prediction Long short-term memory Deep learning in neural networks: An overview Fuzzy finite-state automata can be deterministically encoded into recurrent neural networks Training second-order recurrent neural networks using hints Equivalence in knowledge representation: automata, recurrent neural networks, and dynamical fuzzy systems Finding structure in time The vanishing gradient problem during learning recurrent neural nets and problem solutions Learning long-term dependencies with gradient descent is difficult Empirical evaluation of gated recurrent neural networks on sequence modeling Learning phrase representations using rnn encoder-decoder for statistical machine translation Predictive state recurrent neural networks Predictive state representations: A new theory for modeling dynamical systems Bidirectional recurrent neural networks Framewise phoneme classification with bidirectional lstm and other neural network architectures Named entity recognition with bidirectional lstm-cnns Sequence to sequence learning with neural networks Dropout: A simple way to prevent neural networks from overfitting The landlab v1.0 overlandflow component: a python tool for computing shallow-water flow across watersheds Coevolutionary multi-task learning for feature-based modular pattern classification Time series prediction with recurrent neural networks trained by a hybrid psoea algorithm Time series prediction with multilayer perceptron, fir and elman neural networks Evaluation of co-evolutionary neural network architectures for time series prediction with mobile application in finance Recursive estimation and forecasting of nonstationary time series Long-term predictions of chemical processes using recurrent neural networks: a parallel training approach Multi-step-ahead prediction using dynamic recurrent neural networks Gaussian process priors with uncertain inputs application to multiple-step ahead time series forecasting Dempster-shafer regression for multi-stepahead time-series prediction towards data-driven machinery prognosis Multi-step-ahead time series prediction using multiple-output support vector regression Long-term time series prediction using op-elm Co-evolutionary multi-task learning with predictive recurrence for multi-step chaotic time series prediction Evolving deep lstm-based memory networks using an information maximization objective Multitl-kelm: A multi-task learning algorithm for multi-step-ahead time series prediction A comparison of direct and iterated multistep ar methods for forecasting macroeconomic time series Direct and iterated multistep {AR} methods for difference stationary processes Multistep forecasting in the presence of location shifts Multi-step ahead predictions for critical levels in physiological time series Reinforced recurrent neural networks for multi-step-ahead flood forecasts Real-time multi-step-ahead water level forecasting by recurrent neural networks for urban flood control Multi-step-ahead prediction of {NOx} emissions for a coal-based boiler Comparison of strategies for multi-step ahead photovoltaic power forecasting models based on hybrid group method of data handling networks and least square support vector machine Multi-step prediction of strong earthquake ground motions and seismic responses of {SDOF} systems based on emd-elm method A pattern fusion model for multi-step-ahead cpu load prediction Prediction of short-term wind and wave conditions for marine operations using a multi-step-ahead decomposition-ANFIS model and quantification of its uncertainty Multi-step ahead wind speed prediction based on optimal feature extraction, long short term memory neural network and error correction strategy An innovative hybrid approach for multi-step ahead wind speed prediction Deep residual learning for image recognition Playing atari with deep reinforcement learning Deep learning Recurrent neural networks for time series classification Competition and collaboration in cooperative coevolution of elman recurrent neural networks for time-series prediction Deepar: Probabilistic forecasting with autoregressive recurrent networks Recursive bayesian recurrent neural networks for time-series modeling Deep learning based ensemble approach for probabilistic wind power forecasting Convolutional lstm network: A machine learning approach for precipitation nowcasting Deep neural networks for energy load forecasting A deep cnn-lstm model for particulate matter (pm2.5) forecasting in smart cities Long short term memory (lstm) recurrent neural network (rnn) for discharge level prediction and forecast in cimandiri river, indonesia Deep learning for event-driven stock prediction Stock market's price movement prediction with lstm neural networks Time series forecasting of COVID-19 transmission in canada using lstm networks Detecting strange attractors in turbulence Chaos theory and transportation systems: Instructive example Adam: A method for stochastic optimization Adaptive subgradient methods for online learning and stochastic optimization Tts synthesis with bidirectional lstm based recurrent neural networks Hybrid speech recognition with deep bidirectional lstm Sequence to sequence learning with neural networks On the properties of neural machine translation: Encoder-decoder approaches Handwritten digit recognition with a backpropagation network Gradient-based learning applied to document recognition Theory of the backpropagation neural network," in International 1989 Joint Conference on Neural Networks Oscillation and chaos in physiological control systems Deterministic non-periodic flows A two-dimensional mapping with a strange attractor An equation for continuous chaos Solar cycle forecasting: A nonlinear dynamics approach Time series prediction: Forecasting the future and understanding the past. proceedings of the NATO advanced research workshop on a comparative time series analysis NASDAQ Exchange Daily: 1970-2010 Open, Close, High, Low and Volume Langevin-gradient parallel tempering for bayesian neural learning Competition and collaboration in cooperative coevolution of Elman recurrent neural networks for time-series prediction Co-evolutionary multi-task learning with predictive recurrence for multi-step chaotic time series prediction Multi-step prediction of chaotic time-series with intermittent failures based on the generalized nonlinear filtering methods Explore an evolutionary recurrent anfis for modelling multi-step-ahead flood forecasts