key: cord-0737047-aa987v7y authors: Ray, Arnob; Chakraborty, Tanujit; Ghosh, Dibakar title: Optimized ensemble deep learning framework for scalable forecasting of dynamics containing extreme events date: 2021-06-09 journal: Chaos DOI: 10.1063/5.0074213 sha: d9c26fec9115c499a07ab5af1a559653f1c78f9e doc_id: 737047 cord_uid: aa987v7y The remarkable flexibility and adaptability of both deep learning models and ensemble methods have led to the proliferation for their application in understanding many physical phenomena. Traditionally, these two techniques have largely been treated as independent methodologies in practical applications. This study develops an optimized ensemble deep learning (OEDL) framework wherein these two machine learning techniques are jointly used to achieve synergistic improvements in model accuracy, stability, scalability, and reproducibility prompting a new wave of applications in the forecasting of dynamics. Unpredictability is considered as one of the key features of chaotic dynamics, so forecasting such dynamics of nonlinear systems is a relevant issue in the scientific community. It becomes more challenging when the prediction of extreme events is the focus issue for us. In this circumstance, the proposed OEDL model based on a best convex combination of feed-forward neural networks, reservoir computing, and long short-term memory can play a key role in advancing predictions of dynamics consisting of extreme events. The combined framework can generate the best out-of-sample performance than the individual deep learners and standard ensemble framework for both numerically simulated and real world data sets. We exhibit the outstanding performance of the OEDL framework for forecasting extreme events generated from Lienard-type system, prediction of COVID-19 cases in Brazil, dengue cases in San Juan, and sea surface temperature in Nino 3.4 region. The study of extreme events is one of the interdisciplinary research fields due to its devastating impact on nature and human civilization. Several disciplines deal with this topic of extreme events due to their occurrence in various fields [1, 2] . We can highlight a few names of extreme events such as the spreading of pandemic COVID-19 in the whole world [3] , 2020 global stock market crash [4] , super cyclonic storm Amphan in India [5] , oceanic rogue wave near Newfoundland [6] . Last few decades, researchers are interested to explore extreme events from dynamical system approach [7] . Extreme events occur in a dynamical system when its trajectory evolves within a bounded region most of the time, but occasionally moves far away from that region and originates a significantly large amplitude for a while in temporal dynamics [8] . Several experimental and numerical works have been done based on the appearance and origination of extreme events in various dynamical systems [9] [10] [11] [12] [13] . However, unpredictability is an important attribute of extreme events, so prediction [14] of the extreme events becomes a significant aspect for reducing the harmful loss [15] , or apply a suitable control scheme [16] [17] [18] in advance. Hence the study on prediction of extreme events in the dynamical systems [19, 20] or real-world scenarios deserve special attention. For the prediction of extreme events in a dynamical system, it is essential to know the dynamics of the system in most cases. An analytical approach is available to detect an indi-cator for the prediction of the extreme events [21] , but it is a challenging task to apply this scheme irrespective of all systems. Bialonski et al. [22] have proposed a data-driven prediction scheme for the prediction of extreme events in a spatially extended excitable system based on the reconstruction of the local dynamics from data. In the same time, machine learning models uncover a new scope of data-driven prediction of extreme events where it is not necessary to understand the nature of the system dynamics [23] [24] [25] [26] [27] [28] . In the last few years, machine learning techniques are used extensively to explore several emergent phenomena in nonlinear dynamical systems. In this context, reservoir computing (RC) [29] [30] [31] , a version of recurrent neural network model, is effective for inference of unmeasured variables in chaotic systems using values of a known variable [32] , forecasting dynamics of chaotic oscillators [33, 34] , predicting the evolution of the phase of chaotic dynamics [35] , and prediction of critical transition in dynamical systems [36] . Also, RC is used to detect synchronization [37] [38] [39] , spiking-bursting phenomena [40] , inferring network links [41] in coupled systems. Apart from the RC, researchers have also applied different architectures of artificial neural networks such as feed-forward neural network (FFNN) [42, 43] , long-short term memory (LSTM) [44] [45] [46] for different purposes such as detecting phase transition in complex network [47] , and functional connectivity in coupled systems [48] , forecasting of complex dynamics [49] . Currently, deep learning has furnished natural ways for humans to communicate with digital devices and is foundational for constructing artificial general intelligence [50] . Deep learning was persuaded by the architecture of the cerebral cortex and insights into autonomy and general intelligence may be found in other brain regions that are essential for planning and survival [51] . Although applications of deep learning frameworks to real-world problems have become ubiquitous, but they are not without shortcomings: deep learners often exhibit high variance and may fall into local loss minima during training. At the forefront of machine learning and artificial intelligence, ensemble learning and deep learning have independently made a substantial impact on the field of time series forecasting through their widespread applications [52] . Ensemble learning combines several individual models to obtain better generalization of performance [53] . Since the seminal paper by Bates and Granger [54] , various ensemble methods that combine different base forecasting methods have been introduced in the literature [55] [56] [57] . The so-obtained ensemble forecasting method is expected to have better accuracy than its components, but at the same time it should not overfit the data and should not be too complex to understand and explain. Thus, by creating an interpretable optimized ensemble deep learning model, one can utilize the relative advantages of both the deep learning models as well as the ensemble learning such that the final model has better generalization performance. Motivated by these, this paper proposes and studies a novel mathematical optimization-based ensemble deep learning model, namely optimized ensemble deep learning (OEDL) framework, to build an optimized ensemble which trades off the accuracy of the ensemble and utilizes the power of deep neural network models to be used. Our approach is flexible to incorporate desirable properties one may have on the ensemble, such as obtaining better predictive performance of the ensemble over individual forecasters. We illustrate our approach with real data sets arising in various contexts. In our work, we endeavor to forecast the dynamics consisting of extreme events from the time-series using an optimized ensemble of three deep learning frameworks. Our main focus is to improve the forecasts given by three experts, namely (a) FFNN [58] , (b) RC [59] , and (c) LSTM [60] . For this, we use the ensemble technique [61] where three forecasts are combined using a best convex combination approach, and it gives a better result on prediction of regular events or extreme events than an individual one. The proposal should achieve synergistic improvements in model accuracy, stability, scalability, and reproducibility prompting a new wave of applications in the forecasting of dynamics. To validate our hypothesis, we experimentally show that it is efficient to use our proposed framework for getting better predictions than the individual experts (FFNN, RC, and LSTM). Besides applying the aggregation of forecasters on numerically simulated data sets, we also implement it for forecasting purposes of three realworld scenarios, like pandemic, epidemic, and weather event. The excellent performance of the OEDL framework in forecasting extreme events in Liénard-type system using synthetic data and also prediction of COVID-19 in Brazil, dengue cases in San Juan, and Niño 3.4 ENSO prediction. In each case, we also compare the results obtained from the OEDL model with individual deep learners (FFNN, RC, and LSTM). The robustness and scalability of the proposed OEDL framework lie in its wide range of applications in various data sizes ranging from less than 500 to 50000 which makes it a potential forecaster for other applied forecasting problems. The main contributions of the paper can be summarized in the following manner: 1. We present a novel formulation of the ensemble deep learning model consisting of FFNN, RC, and LSTM, based on a convex optimization problem that motivates our new forecasting method, optimized ensemble deep learning (OEDL). We also conclude that the theoretical formulation of the framework ensures worst-case guarantees and always generates the best set of weights (to be used in the ensemble) using an online solver. 2. Using a range of tests with synthetic data collected from Liénard-type system consisting of extreme events comparing proposed OEDL against individual experts (FFNN, RC, LSTM) and simple ensemble method, we demonstrate that solving the predictability problem of extreme events to optimality yields an ensemble deep learner that better reflect the ground truth in the data. 3. We extend the experimentation of the proposed OEDL framework to real-world forecasting challenges, for example, COVID-19 prediction in Brazil, dengue prediction in San Juan, and ENSO prediction in Niño 3.4 region, to show the wide range of applicability of the proposal. 4. To practitioner's point of view, we present applications of the OEDL method on two synthetic data sets and three real-world data sets that predict with high accuracy when the optimal ensemble method will deliver consistent and significant accuracy improvements over individual experts. All the data sets and codes used in this study are made publicly available at https: //github.com/arnob-r/OEDL. This article is arranged as follows. We introduce our optimized ensemble deep learning framework in Sec. II. In Sec. III, the comparison of forecasting accuracy among deeplearning forecasts and the ensemble of forecasts is the focused issue on both the numerically simulated and real-world data sets. We divide this section into two subsections. We describe the dynamics of the Liénard-type system and discuss our result for numerically simulated data sets in Subsec. III A. Similar exploration has been done on real-world data sets in Subsec. III B, respectively. The conclusion is drawn in Sec. IV with some discussions and future aspects. In Appendix V, we briefly explain three neural network-based forecasting methods to be used in the proposed ensemble framework. Deep learning is well known for its power to approximate almost any function and increasingly demonstrates predictive accuracy that surpasses human experts. However, deep learning models are not without shortcomings; they often exhibit high variance and may fall into local loss minima during training. Ensemble learning, as its name implies, combines multiple individual learners to complete the learning task together [54, 55, 62] . Indeed, empirical results of ensemble methods that combine the output of multiple deep learning models have been shown to achieve better generalizability than a single model [63] . In addition to simple ensemble approaches such as averaging output from individual models, combining heterogeneous models enables multifaceted abstraction of data, and may lead to better learning outcomes [64] . Researchers' quest for an optimal solution to forecast combination continues in many applied fields ranging from dynamical systems to epidemiology. In this research, based on the constructed deep neural networks, an optimization algorithm is used to integrate the component learners for ensemble learning. Section II A describes the formulation of the proposed OEDL framework in terms of an optimization problem with linear constraints. Section II B establishes the connection of the approach with the constrained Lasso and some theoretical results of the solution are derived. Section II C considers the algorithmic structure of the proposed OEDL framework. Finally, we discuss the implementation of the proposed OEDL model in Section II D. This section presents the new ensemble approach. We describe the formulation of the model in terms of an optimization problem with linear constraints. Let F be a finite set of base forecasting models for the time series data y i . No restriction is imposed on the collection of base forecasters. In this work, we introduce an OEDL framework which is an ensemble of three deep learning experts, namely FFNN, RC, and LSTM models. By taking convex combinations of the base forecasters in F , we obtain a broader class of forecasters, namely, (1) We denote α f f ∈F by α. The selection of an combined forecaster from op(F ) can be done by optimizing a function which takes into account the following criteria. The fundamental criterion is the overall accuracy of the combined framework, measured through a loss function L , defined on op(F ), Since in the proposed OEDL framework, we choose base forecasters f ∈ F with higher reliability, i.e., with lower individual loss, thus higher importance given to the loss of the ensemble forecaster. Therefore, an optimized ensemble learner is obtained by solving the following mathematical optimization problem with linear constraints: where S is the unit simplex in R |F | , In general, Problem (2) has linear constraints. Based on the choice of loss functions, we can rewrite the objective function as a linear or a convex quadratic function while the constraints remain linear. Therefore, for the commonly used loss functions, Problem (2) is easily tractable with commercial solvers. In addition, under some mild assumptions, we can characterize the behavior of the optimal solution. Remark 1 Let T be a training sample in which each individual time sequences y i ∈ T . Suppose L be the empirical loss of quantile regression for T , i.e., for some τ ∈ (0, 1). Then, Problem (2) can be expressed as a linear programming problem and thus efficiently solved with linear programming solvers [65, 66] . Remark 2 Let T be a training sample in which each individual time sequences y i ∈ T . Suppose L be the empirical loss of ordinary least squares regression for T , i.e., Hence, Problem (2) is a convex quadratic optimization problem with linear constraints, can be seen as a constrained Lasso problem [67] without standard regularization parameter (λ = 0). In particular, we can assert that Problem (2) has unique optimal solution α. Under mild conditions on L , applicable in particular for the quantile and ordinary least squares empirical loss functions, we can find the optimal solution of the Problem (2). For the quantile and ordinary least squares empirical loss functions, these constraints are either linear or convex quadratic, and thus the optimization problems can be addressed with the available convex solvers [68] . However, we are interested in online aggregation rules that perform almost as well as, for instance, the best constant convex combination of the experts to solve Problem (2). In our proposed OEDL framework, we use 'oracle' function [61, 68, 69] which is the best constant convex combination of the experts; in fact, they hold for all sequences of time series and come with finite time worst-case guarantees. In the proposed OEDL framework, we create combination (ensemble) of forecasts based on optimized online expert aggregation method from a pool of forecasting methods, namely 1. Feed-forward neural network (Appendix VA); 3. Long short-term memory (Appendix VC). More formally, we consider a sequence of observations y 1 , y 2 , . . . , y n (any real bounded time series) to be predicted step by step. Suppose that a finite set of deep learning-based forecasters k = 1, 2, 3 (FFNN, RC, and LSTM) provide us before each time step t = 1, 2, . . . , n predictions f k,t of the next observation y t . We obtain the final predictionŷ t by using only the knowledge of the past observations y 1 , y 2 , . . . , y t−1 and expert forecasts f 1,t , f 2,t , f 3,t , i.e., where α 1,t , α 2,t and α 3,t are three non-negative weights subject to α 1,t + α 2,t + α 3,t = 1 as in Eqn. (1) . Interestingly, we can choose these weights equal and uniformly [55] in such a way that α 1,t = α 2,t = α 3,t = 1 3 (we refer it as 'ensemble' model) but this formulation does not come with finite time worst-case guarantees. We formulate this mathematical optimization problem in terms of an empirical squared loss minimization problem as mentioned in Remark 2 with the following loss function, (4) We obtain the best possible weights in the proposed OEDL framework by minimizing the above loss function. This choice of loss function ensures that the best convex oracle strategy by an efficient aggregation performance of forecasters can be obtained in comparison to individual experts or the uniform average of the expert forecasts with respect to mean squared error. The function oracle performs a strategy that cannot be defined online and requires in advance the knowledge of the whole data set y t and the expert advice to be well defined. 'Online Prediction by ExpeRts Aggregation' (opera) is a robust online solver to estimate these weights (α k,t ) based on expert forecasts, developed by Gaillard et al. [61, 69] . The online solver opera performs, for regression-oriented timeseries, predictions by combining a finite set of forecasts provided by the user (FFNN, RC, and LSTM in our case). To check the accuracy of prediction, we select the following statistical metric, namely root mean square error (RMSE) [70] as defined below, The complete procedure for making one-step-ahead predictions with our optimized ensemble methodology is summarized in Algorithm 1 and can be visualized in Fig. 1 . Offline phase: Train the learning models (FFNN, RC and LSTM). Input: {y 1 , y 2 , . . . , y n }: n observed time series that form the reference set. F : a set of expert forecasts to be used in the ensemble. Output: Record {ŷ t : t = 1, 2, . . . , n} Steps: 1. For each family of strategies (introduced in Eqn. 1) compute the performance corresponding to the best constant choices of the parameters by minimizing Eqn. 4. 2. Obtain the best set of weights (α f ) using the 'oracle' function (set of convex weights) available in the online linear time solver, namely 'Online Prediction by ExpeRts Aggregation' (opera). 3. Obtain the final set of predictions from the OEDL model using Eqn. 3. 4. Assess the quality of the operational performance, i.e., the performance obtained after some automatic and sequential tuning. 5. Compare the performance of the experts as well as uniform combination of experts and best convex combination of experts as in OEDL using RMSE as in Eqn. 5 (on new data). FIG. 1: The schematic diagram depicts the aggregation of expert forecasters. For a given input vector, three deep learning frameworks FFNN, RC, and LSTM generate three sets of forecasts. We combine the three results and get a new forecast with better accuracy with respect to RMSE performance. The aggregation rule is defined as a best constant convex combination of forecasts. For i-th time if FFNN, RC, and LSTM produce the forecast values f 1 , f 2 , and f 3 , then the final forecast is Proposed OEDL refers to an optimized ensemble model where instead of building a single model, multiple 'base' models are combined to perform time series forecasting tasks. The implementation of the proposed OEDL framework consists of four major steps: (a) Time-series data is divided into training and test data sets. Neural network models (FFNN, RC, and LSTM) are implemented on the training data sets. (b) Obtain the forecasts based on these experts (on the test data) and use them in the next step. (c) Generating the final ensemble forecasts using the best constant combination of the experts using Online Prediction by ExpeRts Aggregation. results for the basic 'ensemble' model. An optimized ensemble deep learning model is built using the available forecasts based on the three deep neural network-based experts. To obtain the best convex combination of the experts, we have used an online mathematical optimization solver, namely OPERA. Using the online solver, we obtain the best combination of weights for the proposed OEDL framework. For the sake of reproducibility, the data sets and implementation code of the proposed OEDL framework are made publicly available in our GitHub repository. In this section, we implement the proposed OEDL framework on both numerically simulated and real-world data sets. We divide this section into two subsections. Firstly, we describe the paradigmatic model and then discuss our result on Liénard-type system. Secondly, we experiment with the proposed methodology on three real-world data sets. Each data set (numerical as well as real-world) consists of extreme events from the perspective of each situation. A. The Liénard-type system We consider a paradigmatic Liénard-type system [71] with an external sinusoidal forcing. Leo Kingston et al. [12] have reported that this system exhibits extreme events for an admissible set of parameter values. The mathematical expression of the model is presented bẏ where α , β and γ represent nonlinear damping, strength of nonlinearity and intrinsic frequency, respectively. F and ω are the amplitude and frequency of the external forcing, respectively. The values of the parameters are kept fixed at α = 0.45, β = 0.5, γ = −0.5, and F = 0.2. The parameter ω is treated as a bifurcation parameter, and it is varied to observe extreme events in this system. Leo Kingston et al. [12] have showed that extreme events occurs in y variable of this system via Pomeau-Manville intermittency [72] and interior-crisis induced intermittency [73] route to chaos. So, the state variable y is observable in this system and the local maxima are considered as events. Two bifurcation diagrams are plotted to reveal two routes in the upper panel of Fig. 2 . The local maxima of y, y max are portrayed by varying the forcing frequency parameter ω ∈ [0.6421, 0.6424] in Fig. 2(a) , and ω ∈ [0.731, 0.0.734] in Fig. 2(b) . In Fig. 2(a) , when the value of ω is increased, we observe that period-1 orbit suddenly transits to chaotic attractor at a critical value of ω (ω ≈ 0.6422). This scenario is called Pomeau-Manville intermittency [72] . On the other hand, the size of the chaotic attractor suddenly decreases when the value of ω crosses the critical value of ω (ω ≈ 0.7328) in Fig. 2(b) . Here, high amplitude chaotic attractor transfers to the chaotic attractor with different low amplitude via interior crisis. Due to the interior crisis, intermittent chaotic bursts are observed in the time series of y at the vicinity of the critical value of ω. An event (y max ) is called extreme event [12] if it occasionally crosses a significant height, H S = µ + 8σ [74] [75] [76] , a pre-defined threshold. Here µ and σ are mean and standard deviation of a sufficiently large data set of events, respectively. We select two values of ω from the bifurcation diagram Figs. 2(a)-(b) so that the chaotic attractors corresponding to the values of ω, exhibiting extreme events, are manifested through the different routes, as mentioned above. We plot the time evolution of y in Fig. 2(c) at ω = 0.6423, and in Fig. 2 . For prediction purpose, we construct two data sets of y max for ω = 0.6423, and ω = 0.7315. We consider two data sets of y max from Liénard-type system via simulating Eqn. (6), constructed for ω = 0.6423, and ω = 0.7315. Two mentioned values of ω are chosen in such a way that the chaotic attractor is generated due to Pomeau-Manville intermittency, and interior crisis route to chaos, respectively. Each data set consists of 50000 events. Out of 50000 data points, 90% of events are trained to the FFNN, RC, and LSTM. Rest 10% cases are used to test the forecasting results. For the experimentation on numerically simulated data sets, the hidden layer in the FFNN model consists of 100 neurons and the learning rate for training data is 0.05. For RC, the size of the reservoir is 400, and the average degree of W res is 40. The number of hidden cells is chosen as 200 in the LSTM layer. More precisely, the FFNN model is trained for 2500 epochs. while for training the LSTM model, the data points are trained for 250 epochs. The gradient threshold is 1 to prevent the gradients from exploding. The initial learning rate 0.005 and drop the learning rate after 125 epochs by multiplying by a factor of 0.2. We aggregate three forecast outputs of individual FFNN, RC, and LSTM models using two aggregation rules. Firstly, we implement the usual ensemble technique which is an average (arithmetic mean) of three expert forecasts. For standard ensemble deep learning model [54, 55] , α 1 = α 2 = α 3 = 1 3 . The results on test data sets are computed for ensemble framework and average RMSE and its standard deviations are shown in Table I . Now, we apply our proposed OEDL model to these two simulated data sets by following the algorithm 1 as discussed in Subsec. II C. In the left panel of Fig. 3 , we plot a small segment of the time evolution of y-variable (blue) consisting of extreme events, and the events (y actual ) are marked by black dots for the data set of ω = 0.6423. For this data set corresponding to ω = 0.6423 from Liénard-type system, we calculate the weights α 1 = 0.144, α 2 = 0.133, α 3 = 0.723 for convex aggregation rule. In Figs. 3 (b, d, f, h, j) , we draw the regression plots corresponding to Figs. 3 (a, c, e, g, i) , respectively. Similarly, time evolution of y along with the events (y actual ), and forecasting values (y f orecasted ) are depicted for ω = 0.7315 in the left panel of Fig. 4 . Also, corresponding respective regression plots are exhibited in the right panel of Fig. 4 . We determine α 1 = 0.101, α 2 = 0.066, α 3 = 0.833 for the second data set corresponding to ω = 0.7315. We show regression pots using test data points of y max and data sets of forecasts using different frameworks and aggregation rules in the left panel of Figs Table I for all the five models considered in this study. For both cases, we observe that the proposed OEDL model gives the best result according to the RMSE (minimum RMSE values). This is because we minimized the loss function corresponding to convex aggregation as discussed in Sec. II and the success of the proposal is also evident from both the Figs. 3 and 4. Now, we validate the proposed OEDL model for real-world data and check whether it gives a better prediction or not. For this, we experiment on various real-world data sets of different dynamics. Three time-series forecasting data sets from practical fields are considered to showcase the scalability of the proposal. These data sets are relatively small in size in comparison to numerically simulated data. First, we take data of daily cases of the recent COVID-19 pandemic in Brazil. Another data set consists of weekly dengue cases at San Juan, capital of Puerto Rico. Finally, one more data set of sea surface temperature (SST) values in the Niño 3.4 regions is also considered for our study. We use 80% of the data points for training the forecasting methods. Rest 20% of the data sets is used for checking the performance of the forecasters. All these data sets contain extreme events from the perception of the research fields. Below we provide a brief description of these real-world data sets. (a) COVID-19 pandemic data: A daily confirmed cases of COVID-19 from 26th February, 2020 to 29th April, 2021 in Brazil are collected from Our World in Data. The data set consists of 429 data points in total. Several works have been done regarding COVID-19 forecasting [57, 77, 78] . From the Brazil pandemic data set, we take 80% of the total data (i.e., 343 daily cases) as training data and rest as test data set. (b) Dengue epidemic data: We gather 1196 weekly cases of dengue from May, 1990 to April, 2013 of San Juan from the National Oceanic and Atmospheric Administration. Dengue forecasting is one of the current research fields of epidemic forecasting [56, 79, 80] . We build our models on 956 weekly cases (80% of total data) for the training phase and the rest for testing purpose in our study. (c) ENSO data: El Niño-Southern Oscillation (ENSO) is a climate phenomenon that occurs due to an interplay between atmospheric and ocean circulations [81] . A fluctuation in sea surface temperature (SST) is observed across the equatorial Pacific Ocean. A warming phase of SST in the eastern is called El Niño, and La Niña is an occasional cooling of ocean surface waters in the eastern Pacific Ocean [82] . These two phases are captured through the data set consisting of sea surface temperatures of the Pacific Ocean. We collect weekly sea surface temperatures from 3rd January, 1990 to 21st April, 2021 in the Niño 3.4 region (5 • North-5 • South and 170 • − 120 • West) across the Pacific Ocean from the Climate Prediction Center to form a set of 1634 data points. This data can be used for SST forecasting or to say El Niño / La Niña forecasting. Prediction of El Niño is one of the trending topics in the field of climatology [83, 84] . We also validate our proposed framework for forecasting SST over Niño 3.4 regions, and it helps for Niño 3.4 ENSO prediction. We train 1307 data points of SST (weekly collected) and the rest data points are used for testing purposes in this case study. The standard implementation of FFNN, RC, and LSTM methods is followed for all the real-world data sets as discussed in Section II D. The number of hidden neurons in FFNN, size of the reservoir in RC, number of hidden cells in LSTM are chosen using the cross-validation technique [51, 85] . Now, we explore the forecasting results of real-world data sets using five different forecasters to showcase the excellent performance of our proposed OEDL model (convex combination of experts) in terms of RMSE. In the left panel of (FFNN, RC, LSTM) . The comparative study between actual and forecasted data points is illustrated from those figures. Besides, the RMSE values for five competitive forecasting methods are given in the right panel of Fig. 5 for three different real-world data sets. The lesser the value of RMSE, the better the forecasting method is. Five bars are plotted at Figs. 5(b, d, f) corresponding to five forecasters, namely FFNN (magenta), RC (yellow), LSTM (green), and basic ensemble method (blue) and proposed OEDL method (red), respectively. According to three Figs. 5(b, d, f), we observed that the proposed OEDL attains minimum RMSE among other methods for all the three real-world data sets. Let forecasting results of FFNN, RC, and LSTM are aggregated with the following weights α 1 , α 2 , and α 3 , respectively in the proposed framework. So, for the chosen sets of weights corresponding to three real-world case studies are given here, and are used for depicting the Fig. 5 . We found {α 1 , α 2 , α 3 } = {0.0, 0.33, 0.67} for the case of daily cases of COVID-19 forecasting, {α 1 , α 2 , α 3 } = {0.31, 0.425, 0.265} for weekly dengue cases forecasting, and {α 1 , α 2 , α 3 } = {0.029, 0.41, 0.561} for SST forecasting. Also, for this realworld data sets, we compute the mean ( RMSE ) and fluctuation (σ RMSE ) of the RMSEs for each forecasting methods in the Table II to check the practical robustness of our proposal. Here, the values of RMSE are calculated after 50 trials for all the three real-world data sets. From Table II , we conclude that the proposed OEDL always wins from the perspective of forecasting accuracy in comparison with the results of other experts or uniform aggregation of experts (ensemble) in a significant margin. We have proposed a novel data-driven ensemble method, based on deep neural networks, for modeling and prediction of chaotic dynamics consisting of extreme events as well as dynamics real-world scenarios. The proposed optimized ensemble deep learning framework is trained on time-series data and it requires no prior knowledge of the underlying governing equations. Using the trained model, long-term predictions are made by iteratively predicting one step forward. The proposed OEDL method is based on multiple deep models, the best convex optimization algorithm and an ensemble design strategy, useful for scalable forecasting. The proposal was first theoretically built as an optimization problem with possible solutions lying in a broader set of combinations of experts. The best convex combination of weights was chosen using an online solver which minimizes average squared loss and comes with finite time worst-case guarantees. This method is scalable to even larger systems, requires significantly less training data to obtain high-quality predictions in comparison with individual state-of-the-art deep learning forecasters, can effectively utilize an ensemble of base experts. Experimental results using both numerically simulated data sets and realworld data sets demonstrated the outstanding performance of the proposed OEDL method in comparison with other single and ensemble models. These results provide comprehensive evidence that the optimal ensemble deep learner is scalable for practical applications and leads to significant improvements over standard neural network methods. Since the discovery of the 'Wisdom of Crowds' (more popularly 'Vox Populi') over 100 years ago theories of collective intelligence have held that group accuracy requires either statistical independence or informational diversity among individual beliefs [86] . This idea received immense interest in time series forecasting when Bates and Granger [54] shows simple combinations of forecasting methods can generate better out-of-sample forecasts using empirical evidence. Of course, the current progress in machine learning and deep learning brought various highly capable forecasters to our door. This paper provides a mathematical optimization-based ensemble of deep learning methods for efficient and accurate forecasting of dynamics con-sisting of extreme events. Both theoretical and experimental results presented in this paper support our claims and improve the predictability of the dynamical systems considered in this study. A more principled way of ensemble learning by stacking is to perform it at the full-sequence level, rather than at the data level as attempted in this paper. The greater complexity of this pursuit leaves it to our future work. Also, with an increasingly large amount of training data and computing power, it is conceivable that the weight parameters and hyperparameters in all the constituent deep networks can be learned jointly using the unified backpropagation in an end-to-end fashion. The authors would like to thank Chittaranjan Hens and Subrata Ghosh for constructive discussions and notable comments. Deep learning systems have dramatically improved the accuracy of machine learning-based forecasting systems, and various deep architectures and learning methods have been developed with distinct strengths and weaknesses in recent years. Now, we discuss three state-of-the-art deep neural network frameworks for time series forecasting, which are used to develop the proposed OEDL framework. We have already mentioned that three expert forecasters are FFNN, RC, and LSTM. Above mentioned three types of neural network-based deep learning models are considered as basis time series models to construct the ensemble model, and their basic concepts and modeling process are briefly reviewed. Feed-forward neural networks (FFNN) are flexible computing frameworks for modeling a broad range of nonlinear forecasting problems, inspired by the structure of human and animal brains. One significant advantage of the neural network models over other classes of nonlinear models is that feedforward neural network is universal approximator that can approximate a large class of functions with a high degree of accuracy [87] . Their power comes from the parallel processing of the information from the data. No prior assumption on the data generating process is required in the model building process. Instead, the network model is largely determined by the characteristics of the data. Shallow feed-forward network with one hidden layer is the most widely used model for time series forecasting in various applied domains [88] . A shallow FFNN model is characterized by a network of three layers of simple processing units connected by acyclic links (see Fig. 6 ). The relationship between the output (y t ) and the inputs (y t−1 , y t−2 , . . . , y t−P ) is represented by the following mathematical equation, where w i, j (i = 0, 1, 2, . . . , P, j = 1, 2, . . . , Q) and w j ( j = 0, 1, 2, . . . , Q) are model parameters often called connection weights; P is the number of input nodes; Q is the number of hidden nodes, and e t be the error term. The logistic activation function is often used as the hidden layer transfer function, i.e., g(x) = 1 1+exp(−x) . Hence, the FFNN model as in Eqn. (7) , in fact, performs a nonlinear functional mapping from the past observations to the future value y t , i.e., where W is a vector of all parameters and f (·) is a function determined by the network structure and connection weights. Thus, the neural network is equivalent to a nonlinear autoregressive model. Note that expression (7) implies one output node in the output layer, which is typically used for onestep-ahead forecasting. The neural network given by (7) can approximate arbitrary function when the number of hidden nodes Q is sufficiently large [87, 88] . In practice, the FFNN structure with small number of hidden nodes often works well in out-of-sample forecasting. This may be due to the overfitting effect that can be typically found in the neural network training process. During the training phase of FFNN, it is a common technique to use a back-propagation learning algorithm [89] for updating the weights and bias values by minimizing the error. Here error is the mean square of the difference between the predicted and actual outputs. We use one of the popularly used optimization algorithms, gradient descent with momentum [90] , to minimize the error function. The momentum term may also be helpful to prevent the learning process from being trapped into a local minima and is usually chosen within the interval [0, 1]. Finally, the estimated model is evaluated using a separate hold-out sample that is not exposed to the training process. Training the FFNN model is a time-consuming process and also gives reduced accuracy in prediction. To overcome this, another powerful computational approach, namely Reservoir computing (RC), is developed based on recurrent neural networks. RC maps input signals into higher dimensional computational spaces through the dynamics of a fixed, non-linear system called a reservoir [91] . RC is a generalization of earlier neural network architectures such as recurrent neural networks, liquid-state machines, and echo-state networks [29] . Usage of RC is abundant at this time [92] and most recent applications of RC are evident in time series forecasting FIG. 6: A schematic diagram of a feed-forward neural network with one hidden layer consisting of Q-neurons and one output layer is drawn. A set of P inputs {y t−1 , y t−2 , . . . , y t−P } is sent through input layer. The used activation functions are log-sigmoidal (denoted by 'logsig') for the hidden layer and linear for the output layer, respectively. The input weight parameters are w i, j whereas the output weight connections are w j . FIG. 7: A schematic diagram of the reservoir computer consisting of input layer, reservoir, and output layer. For an input vector {x 1 , x 2 , . . . , x n }, a output vector {y 1 , y 2 , . . . , y n } is calculated. Input weighted matrix and output weighted matrix are [W in ] m×n and [W out ] m×n , respectively. W res is weighted random matrix of order m × m and u(t) is hidden state vector at t-th time. [93, 94] , robot motor control [95] , speech recognition [96] among many others. The basic architecture of the reservoir computer for computation is delineated below. The reservoir computer consists of an input layer, a reservoir, and an output layer. A schematic diagram of it is portrayed in Fig. 7 . A input vector x = {x 1 , x 2 , . . . , x n } T of ndimension is sent for training from input layer into reservoir using the linear input weighted matrix W in of order m × n, whose elements lie within [−1, 1]. An m-dimensional reservoir state vector u(t) (or hidden state vector) is defined for reservoir neuron activation and u(t) follows the following deterministic model, where ∆t is the step of time discretization. For this study, we use ∆t = 1. Here, W res denotes the recurrent weighted adjacency matrix of order m × m of the reservoir. This matrix is a sparse random Erdös-Rényi matrix whose non-zero entities are picked up from [−1, 1] uniformly. All the elements of W res are rescaled uniformly in such a way that the largest value of the magnitudes of its eigenvalues becomes a pre-defined positive real number, called the 'spectral radius' of W res . The leakage rate is denoted by β that lies in (0, 1]. We set the spectral radius and the leaking rate as 0.001, and 0.5, respectively for experimental usage. The bias vector of order m × 1 is denoted by W b whose elements are 1. Again a linear output weighted matrix W out of order m × n is constructed for generating the output vectorŷ = {y 1 , y 2 , . . . , y n } T at t-th time in the following way,ŷ (t) = W T out u(t). Here, elements of W out are adjusted in the training phase from the reservoir using linear regression which is ridge regression [97] . This ridge regression minimizes the mean square error of the network output data and the training target data and the output weights are computed by the following way, Here, Y target and X consist of target output data and reservoir state collected from each training time iteration. Let p be the training time steps. Then orders of Y target and X are n × p and m × p, respectively. Since the concept of reservoir computing stems from the use of recursive connections within neural networks to create a complex dynamical system; thus it is highly useful in the forecasting of dynamics [32, 33, 98] . To overcome the problems of speed and stability in recurrent neural network, long short-term memory (LSTM) was proposed by Hochreiter and Schmidhuber [60] . The LSTM is similar to the recurrent neural network, but in this model, a new concept is introduced with one cell or interaction per module. As presented in Fig. 8 , LSTM is a chain-like structure capable of remembering information and long-term training with four network layers. Besides, LSTM is designed in such a way that the vanishing gradient problem is got over. For this, usage of LSTM became a popular choice in various applied fields [99, 100] . The LSTM network consists of memory blocks, like cells. As usual, a hidden state is introduced in LSTM. Besides, the memory of LSTM is carried by a cell state and the cell state is updated by removing or adding information when it passes through three different gates, namely forget gate, input gate, and output gate. With the implementation of these gates, longterm dependency problems can be avoided while memorizing the LSTM. Now, we focus on how the cell and hidden states are modified corresponding to the input vector (say, x i ∈ R n ) at i-th time step. Forget gate: The forget gate concludes about the information that is removed in the cell state c i at i-th time step. Let h i−1 ∈ R m be the hidden state at previous (i-1)-th step. Due to implementation of sigmoid activation function (σ ) over the combination of x i and h i−1 with their corresponding weighted matrices W x1 , and W h1 , respectively and bias vector b 1 ∈ R m , a forget gate's activation vector F i ∈ R m is generated in the following way: Here, the orders of the matrices W x1 , and W h1 are m × n and m × m, respectively. It gives the result lying within [0, 1], where 0 indicates forgetting the information completely, whereas 1 signifies to keep the information as it is. Input gate: Input gate decides how much the new information is added to the cell state. Input gate's activation vector I i ∈ R m is produced by using a sigmoid activation function as follows: Besides, a vector of new candidate values g i ∈ R m is created through tanh activation function as follows: Here the orders of W x2 and W x3 are m × n, the orders of W h2 and W h3 are m×m, and b 2 , b 3 ∈ R m are bias vector. I i g i contributes for updating cell state where the operator represents the point-wise multiplication between two vectors. Finally, the linear combination of the input gate and forget gate is employed for updating the previous cell state into the current cell state by the following equation: Output gate: Finally, the hidden state is modified in the output gate. Output gate's activation vector is determined in the following way: where W x4 ∈ R m×n , W h4 ∈ R m×m , and b 4 ∈ R m are weighted matrices and bias vector, respectively. The current hidden state is updated by operating tanh activation function over current cell state as follows: where the initial values are setted as c 0 = 0, and h 0 = 0. So, the output vector is calculated from the hidden state at the time step i as followsŷ where σ is activation function, W ∈ R n×m is weighted matrix and b 5 ∈ R n is bias vector. As a result of this state transition within LSTM, the trained LSTM network with a set of fixed weight vectors can still show different predictive behaviour for different time series. Age of extremes: the short twentieth century Extreme events in nature and society Extremes and recurrence in dynamical systems Chaos: An Interdisciplinary Chemometrics and intelligent laboratory systems Feedforward neural network methodology Deep learning Combining pattern classifiers: methods and algorithms Sankhyā: The Indian Journal of Statistics pp Prediction, learning, and games Modeling and stochastic learning for forecasting in high dimensions Rogue waves in the ocean El Niño southern oscillation & climatic variability An introduction to statistical learning Proceedings of the 15th european symposium on artificial neural networks. p Advances in neural information processing systems Proceedings of the 2005 IEEE international conference on robotics and automation Twelfth Annual Conference of the International Speech Communication Association