key: cord-0254567-z0gpwi4x authors: Baur, Sebastian; Rath, Christoph title: Predicting high-dimensional heterogeneous time series employing generalized local states date: 2021-02-24 journal: nan DOI: 10.1103/physrevresearch.3.023215 sha: ff3b980c4b7b7d47762cfdc15b084a9071558b5f doc_id: 254567 cord_uid: z0gpwi4x We generalize the concept of local states (LS) for the prediction of high-dimensional, potentially mixed chaotic systems. The construction of generalized local states (GLS) relies on defining distances between time series on the basis of their (non-)linear correlations. We demonstrate the prediction capabilities of our approach based on the reservoir computing (RC) paradigm using the Kuramoto-Sivashinsky (KS), the Lorenz-96 (L96) and a combination of both systems. In the mixed system a separation of the time series belonging to the two different systems is made possible with GLS. More importantly, prediction remains possible with GLS, where the LS approach must naturally fail. Applications for the prediction of very heterogeneous time series with GLSs are briefly outlined. We generalize the concept of local states (LS) for the prediction of high-dimensional, potentially mixed chaotic systems. The construction of generalized local states (GLS) relies on defining distances between time series on the basis of their (non-)linear correlations. We demonstrate the prediction capabilities of our approach based on the reservoir computing (RC) paradigm using the Kuramoto-Sivashinsky (KS), the Lorenz-96 (L96) and a combination of both systems. In the mixed system a separation of the time series belonging to the two different systems is made possible with GLS. More importantly, prediction remains possible with GLS, where the LS approach must naturally fail. Applications for the prediction of very heterogeneous time series with GLSs are briefly outlined. Tremendous advances in predicting the short and long term behavior of complex systems have been made in recent years by applying machine learning [1] [2] [3] [4] [5] [6] [7] . reservoir computing turned out to be a very promising machine learning approach as it combines outstanding performance with conceptual advantages like very fast and comparably transparent learning and possibly appealing hardware realizations of reservoir computing [8, 9] . For high-dimensional systems reservoir computing suffers like other machine learning methods from the "curse of dimensionality" meaning that the number of nodes of the network representing the reservoir has to be considerably larger than the dimensionality of the input data rendering the training unfeasible with a naive reservoir computing approach. With a parallel prediction scheme based on local states [10] , however, the forecasting of high-dimensional chaotic spatiotemporal systems of arbitrarily large extent becomes possible [11, 12] . The definition of local states relies on defining spatial local neighborhoods for each time series to be predicted. Thus, the knowledge of the position of the time series in space is a necessary prerequisite for defining local states. The similarity of time series can be defined in a much more general way by deducing a distance measure and thus a local neighborhood from the correlations among the time series [13] . Those generalized similarities ledamong others -to a reasonable, fully data-driven taxonomy of the stock market [14] while crucial differences between the linear and nonlinear correlation structure of the stock market especially during crises were reported later [15] . In this Article, we employ this approach to define generalized local states for the prediction of high dimensional systems with which some of the shortcomings of the local states approach can be overcome. For modeling high dimensional, spatiotemporal, chaotic systems, the L96 and KS systems have become widely used in the RC community [3, [16] [17] [18] . The L96 [19, 20] system is defined as where x(t) is the D-dimensional state vector of the system and F the forcing parameter. In this study, we set D = 40 and F = 5 resulting in a chaotic system for which we calculate a maximal Lyapunov exponent (MLE) of Λ max = 0.45 using Sprott's method of orbit separation (OS) [21] . We use the fourth-order Runge-Kutta method [22] for the simulation, with a time step of ∆t = 0.05. The second model, widely used to model a variety of weakly turbulent fluid systems [23, 24] , is the KS system [25] . Its PDE reads where the field u(x, t) is defined on some domain size L. In this study, we use a domain size of L = 22 with periodic boundary conditions u(x + L, t) = u(x, t) for all 0 ≤ x ≤ L. For the numerical treatment, the equations are discretized on a grid of D = 40 points, the same size as the L96 system, and numerically integrated, with a time step of ∆t = 0.5, using the fourth order time-stepping method ETDRK4 [26] . Using OS we find a MLE of Λ max = 0.049. We use a standard setup for reservoir computing. At the center of RC is a network, in the following called the reservoir A, created as a sparse random D r ×D r network with average node degree κ. After network generation, all its random, uniformly distributed connection strengths are scaled to have a predetermined fixed spectral radius ρ. The D in dimensional input x(t) interacts with the reservoir state r(t) through an input coupler which in our case is a sparse D r × D in matrix W in . Following [1] , W in is created such that one element in each row is chosen uniformly between [−ω, ω] where ω is the input coupler scaling parameter. All other elements of W in are zero. The input data then connects with the reservoir state r(t) of the previous time step via activation function tanh(·) to advance the reservoir state by one step in time We choose a simple matrix as output coupler W out , whose elements are determined in the training phase via ridge regression where y T (t) is the D out dimensional target output.r is a nonlinear transformation of the reservoir state r here chosen to ber = [r, r 2 ] T = [r 1 , r 2 , ..., r Dr , r 2 1 , r 2 2 , ...r 2 Dr ] T . This blow-up of the reservoir states first introduced by Lu et al. [16] actually serves to break the symmetries in the reservoir equations [27] . To avoid that the arbitrary initial state of the reservoir influences the regression results, training only starts after a washout phase of 20000 time steps. Once trained, the output y(t) can be calculated from the reservoir states r(t) as y(t) = W outr (t). When using RC for prediction, it can then be run autonomously by using the prediction of the previous time step y(t) as the input x pred (t) to calculate the next predicted time step y(t + ∆t). In this case one finds r(t + ∆t) = tanh (Ar(t) + W in x pred (t)) . (5) GLS is based on the LS approach proposed by Parlitz et al. [10] and used for RC prediction by Pathak et al. [3] . While non-local implementations of RC algorithms use just one network to process all input data, LS and GLS partitions the input data into multiple subsets of smaller dimension each with their own reservoir. The reservoirs are then trained on and predict these subsets only. This provides an effective workaround for the curse of dimensionality by essentially parallelizing the prediction of one high-dimensional dataset using many lower-dimensional subsets. These subsets are in the following called neighborhoods. Each neighborhood itself consists of a number of core variables and neighbor variables and is assigned its own reservoir. Each reservoir in question then uses only the variables of the input making up its neighborhood to predict its core variables as accurately as possible. As such, the input time series for the reservoir assigned to the i-th neighborhood is not the full D in dimensional input time series x anymore, but instead a slice unique to this neighborhood x i of dimension D i in ≤ D in . Similarly, the trained output of the i-th neighborhood y i is given by just its core variables and hence is an even smaller subset of the neighborhood of dimension D i out ≤ D i in . These different neighborhoods are depicted in Figure 1 . LS RC. The highlighted neighborhood of the reservoir has one core variable xi and four locally adjacent neighbors. These five variables are used as input for the reservoir Ri to predict the core variables future state yi. Many more reservoirs exist, each with their own neighborhood. (c) GLS RC. As in (b), the neighborhood has one core variable and four neighbors, with the crucial difference being that the neighbors do not have to be locally adjacent to the core. Between each prediction step the neighborhoods need a new input which, as typically D i out < D i in , can only come from the other neighborhoods' prediction. Hence between each prediction step, a new input vector x pred is formed by the combined core variable prediction of all neighborhoods. As a consequence, each variable of the original input must be in one and only one neighborhood as a core variable. Pathak et al. originally introduced their LS approach in the context of the KS system, a purely locally interacting system. As such, their neighborhoods were chosen to have only spatially adjacent cores surrounded by contiguous buffer regions of neighbors (see Figure 1b) . For systems where such a local interaction is not present, this scheme cannot be used. Nonetheless, the idea of locality in the sense of importance to a prediction is generalizable to something which we will call similarity. As the choice of neighborhoods is essentially arbitrary, this allows the creation of neighborhoods not by which variables are locally closest to the core variables, but by instead including the variables most similar to the cores as neighbors. Such a non-local GLS neighborhood is shown in Figure 1c ). The effectiveness of this procedure of course depends on an appropriate choice of neighborhoods, such that the information necessary to predict their core variables is present in their (non-local) neighbors. Our first measure to estimate the similarity of the time series is the (linear) cross-correlation (CC) coefficient C xi,xj , where x i is the i-th variable of the time series x. To transform the CC into a similarity measure (SM) we take its absolute value and associate a larger value as more similar. As such, both a high correlation as well as a high anti-correlation correspond to a high similarity. Considering that chaotic time series are by their very nature nonlinear, we use one more measure that captures these nonlinear relationships, the mutual information (MI). As we are working with long (10 5 time steps), well behaved time series, we will implement the widely used binning method [15, 28, 29] as estimator for the MI. Akin to [15] we find empirically that for our time series of length T = 10 5 a choice of T /4 = 158 bins of equal size works well. We normalize the MI as described by Strehl et al. [30] leading to our MI SM where I (X i , X j ) is the MI while H (X i ) and H (X j ) are the Shannon entropies of X i and X j respectively. Once a SM has been chosen and calculated for all variables of the full input data, one needs to use it to create the neighborhoods. As the prediction output of each neighborhood is only given by its core variables, predicting these core variables as accurately as possible is most important. While a variety of choices are possible, we restrict ourselves to associate each neighborhood with exactly one core variable. In the LS case, the neighbors of the core are simply the variables spatially closest to that core. We will call this a spatial neighborhood (SN). In the GLS case, the exact analog is possible, where the neighbors of the core are the variables most similar to it as defined be the SM. We call the corresponding neighborhoods CC or MI neighborhoods, depending on the similarity measure used. Lastly, it should be emphasized that the GLS method is in principle independent of not only the specific SM used, but also of the chosen prediction method. Many other choices for the SM, e.g. the transfer entropy, or the network, e.g. LSTMs, are conceivable. To enable a fair comparison of the different neighborhood generation methods, the RC hyperparameters are optimized for the LS neighborhoods and then copied for the GLS neighborhoods without further adjustments. Furthermore, we added noise to all training data following Vlachas et al. [2] . Without this noise we found the short term prediction accuracy to often be higher, but at the cost of an increased rate of failed realizations, and lower quality long term predictions in general. The added noise proved decisive in minimizing variance between network realizations and reducing the number of failed realizations, especially for the L96 system. Heuristically, we found normally distributed noise with standard deviation σ noise = 1% σ data , where σ data is the standard deviation of the training data, to be a sweet spot. The hyperparameters used for all RC are given in table I. Dr 5000 average node degree κ 3 spectral radius ρ 0.5 input coupler scaling ω 0.5 ridge regression parameter β 10 −6 noise level α 1% Transient effects of the simulated data were discarded before any synchronization, training or prediction took place. Similarly, each reservoir training and prediction is preceded by 2000 synchronization steps. All reservoirs were trained for a total of 10 5 time steps using the noisy training data. To calculate the SMs we use the same noisy data used to train the reservoirs. As a result all neighborhoods shown here are representative examples, but not uniform for all realizations. The neighborhoods are calculated for a single core and 18 neighbors, in the case of SN and MI neighborhoods, and 28 neighbors, in the case of a CC neighborhood. These fixed neighborhood sizes for SN and MI were chosen to be similar to Pathak's original paper [3] which we found to be a good compromise between computation speed (small neighborhoods) and prediction accuracy (large neighborhoods). While for them this results in good predictions, for the CC SM this leads to essentially all predictions diverging. This is likely the result of the linear CC SM not recognizing the importance of the core's nearest neighbors in these systems. As such, the CC neighborhood size was increased to a total size of 29, the minimum where no predictions diverged. CC neighborhoods of different sizes for the L96 system are discussed in appendix A. Example neighborhoods for the L96 systems are shown in Figure 2 . While the SN neighborhoods are simply defined as a single core and its nearest neighbors, the CC and MI neighborhoods warrant a closer look. First and foremost, even though the MI neighborhoods were calculated dynamically, without directly using the knowledge of L96 being a locally interacting system, the resulting MI neighborhoods closely resemble the SN neighborhoods. The CC neighborhoods in contrast include many variables spatially far away from the core. 100 distinct random network realizations are generated for each system-SM combination. They are then trained and used to predict the same 300 sections of 10 Lyapunov times length on the chaotic attractor of the L96 and KS systems. To quantify the short term prediction accuracy, we define the normalized root mean square error (NRMSE) as whereŷ ∈ R D is the prediction at a single time-step, y ∈ R D the true signal and y max (y min ) the largest (smallest) value taken of any variable in the simulated data set. Short term prediction results are shown in Figure 3 . Looking at the short term predictions of the L96 system, it is very striking that the averaged NRMSE of the SN and MI neighborhoods coincide more or less exactly, while the CC prediction is significantly worse. This performance drop is likely the result of the CC neighborhood's inclusion of many variables which are far from the, ostensibly most important, close region around the core (see Figure 2b ). It should be noted that the statistical stability that has been assessed here for the first time is remarkable. Often RC algorithms exhibit a much larger spread of prediction qualities between different random networks [31] . This stability is partly attributable to finding the correct hyperparameters for the systems at hand, as suboptimally chosen hyperparameters often leave the best predictions intact while increasing the variance towards completely failed predictions drastically [17, 31] . However, empirically we found that the role of the noise added to the training data in achieving this stability, especially for the L96 system, cannot be understated either. Using the OS method we can calculate the MLEs of the predictions as another characteristic to quantify long term prediction accuracy. For this purpose we let each of our reservoir realizations predict three distinct sections of the system attractor each 1000 Lyapunov times in length which are then used to calculate the MLEs shown in table II. The MLEs of all three neighborhoods agree well with the MLEs calculated directly from the simulations. The details of the OS procedure used are described in appendix B. In preparation for the non-local system discussed in the following section, we predict and analyze the KS system as we have just done for the L96 system. Example neighborhoods and short term prediction results for the KS system are shown in Figure 4 Looking at the short term predictions of the KS system shown in Figure 5 , it is striking that the averaged NRMSE of all three neighborhoods coincide more or less exactly. This is remarkable especially when compared to the L96 system in which the CC neighborhoods performed significantly worse. Using OS to calculate the MLE we find them again to agree excellently with the simulated MLE as depicted in To test the usefulness of GLS for non-locally interacting systems, we use the KS and L96 systems to artificially create such a non-locally interacting test system. As depicted in Figure 6 we do this by concatenating both systems and then randomly shuffling the 80 variables of the combined system. As this combined system now is a composite of two systems with different time steps, it does not have a well defined Lyapunov exponent any more. For the sake of consistency we nonetheless continue the time axis rescaling in terms of Lyapunov exponents. To do so we use the larger of the two system's time steps per Lyapunov time as calculated in table IV, hence at worst slightly underestimating our short term prediction results. As before, we can also calculate the SN, CC and MI neighborhoods for this new concatenated system. The CC and MI neighborhoods are shown in Figure 7 . The SN neighborhoods have been omitted from this Figure as they are, by definition, always the same. Fascinatingly, both the CC and MI neighborhoods in the combined but not shuffled systems look like they are composed of the individual system's neighborhoods. In fact, exactly this is the case as both the CC and the MI SMs are able to completely separate the KS and L96 systems. In the case of experimental data, such an analysis of the similarity structure can be highly informative as it exposes the underlying relationships between dimensions. As before, we quantify the short term prediction accuracy using the NRMSE. The results are shown in Figure 8 . Immediately noticeable is the almost instant di- vergence of all SN predictions. This is of course expected, considering the SN neighborhood's assume a locally interacting system which the shuffled system is not. Furthermore, the NRMSE of the CC and MI neighborhoods is the combination of the NRMSE of the individual systems. This, again, makes sense due to the perfect separation between the L96 and the KS system shown in Figure 7 . As for the KS system, this results in the MI SM delivering significantly better results than the CC SM. For the long term statistical analysis we cannot use the MLE as it is not well defined due to the different time step sizes of the sub-systems. Therefore we look at the probability distribution functions (PDFs) to roughly estimate the system climate as also used in [4] . The details of the histogram generation are described in appendix C. CC and MI PDFs are depicted in Figure 9 . The predicted PDFs show excellent agreement with the simulated data. Similarly to the MLE results, the worse short term predictive performance of the CC SM compared to the MI SM is not reflected in its climate reproduction quality. We have proposed and discussed a generalization of the concept of local states in the sense of using similarity measures derived from correlations among (instead of spatial distances between) time series. This offers a much more versatile approach for the prediction of highdimensional complex systems. First, generalized local states can still make excellent predictions in the case of mixed systems, where local states are doomed to fail. Here it is worth noticing that the perfect neighborhood separation we found in the combined KS-L96 system (see Figure 7) suggests that generalized local states can be used to separate mixed data sets and to thus infer different origins of a set of heterogeneous time series, for which the generating processes are unknown. Second, prediction of high-dimensional systems remains feasible, when for a number or for all time series no spatial information is available. This is more and more the typical case in real world applications, when analyzing such heterogeneous data sets ranging from remote sensing data, financial data, social media data (e.g. Instagram, Twitter) to nowadays also infection rates during the COVID-19 pandemic, etc. and a combination of the aforementioned. Current research explores applications of generalized local states based predictions in those cases. We wish to acknowledge valuable discussions with D. Mohr, P. Huber, J. Aumeier, Y.Mabrouk, and J. Herteux. As discussed in the text, the CC neighborhood sizes were necessarily chosen to be significantly larger than the SN and MI neighborhood sizes. This necessity is likely the result of the CC SM not recognizing the importance of the core's nearest neighbors in the L96 and KS systems. As such, the CC neighborhood size was increased to a total size of 29, the minimum where no predictions diverged and where the core's nearest neighbors were consistently included in its neighborhood. CC neighborhoods of size 19, 27 and 29 for the L96 system are depicted in Figure 10 . While the true importance of each variable regarding the prediction of another is of course unknown, at least for the locally interacting L96 and KS systems studied here, the spatial nearest neighbors of the cores are critical for achieving a sensible prediction. Typically the most important Lyapunov exponent is considered to be the MLE, defined as the largest Lyapunov exponent of any given chaotic system. Its importance stems from the fact that it is intimately tied to the predictability of the system as, given a MLE Λ max two infinitesimally close trajectories in phase space, initially separated by the vector δx(t = 0) diverge as where t is the time since separation [32] . To calculate the MLE we use Sprott's method of OS [21] . By taking the logarithm and the average · over many trajectory divergences in different parts of the chaotic attractor we find Note that for this equation to hold, we are only using the divergence data after transient effects have subsided but before the divergence size saturates due to it reaching the size of the chaotic attractor. From this, we can use a simple linear least squares fit to calculate the MLE. As described in the main text, we base the OS calculation on three 1000 Lyapunov time long term prediction datasets. For each of the 100 realizations we choose 50 trajectory positions uniformly distributed in the first of the three long term datasets as starting point for the orbit separation at which we add normally distributed noise with standard deviation σ noise = 10 −10 σ r to the internal reservoir states. From this perturbed internal state, we let the reservoir predict 1500 time steps and compute the separation magnitude to the unperturbed predicted time series using the least squared fit of equation B2 as described above. The MLE of the L96 and KS simulations are computed analogously, using one 10 5 time steps long dataset from which 10 4 uniformly distributed starting positions for the trajectory divergence are chosen. The resulting MLEs are given in Appendix C: PDF Estimation PDF estimation was done with the same data sets used to calculate the MLE. The three resulting predictions and reference data sets from the simulation are treated as one single data set of 3000 Lyapunov times length respectively. The resulting datasets are flattened and then used to create the histogram. The histograms themselves consist of 100 equally sized bins. To make comparison easy, they are chosen such that bin position and size for the simulated and predicted time series histograms are the same. Attractor reconstruction by machine learning Data-driven forecasting of highdimensional chaotic systems with long short-Term memory networks Model-Free Prediction of Large Spatiotemporally Chaotic Systems from Data: A Reservoir Computing Approach Data-Driven Super-Parameterization Using Deep Learning: Experimentation With Multiscale Lorenz 96 Systems and Transfer Learning Calibrated reservoir computers Reducing network size and improving prediction stability of reservoir computing Do Reservoir Computers Work Best at the Edge of Chaos? Accuracy of neural networks for the simulation of chaotic dynamics: Precision of training data vs precision of the algorithm Theory of Neuromorphic Computing by Waves: Machine Learning by Rogue Waves, Dispersive Shocks, and Solitons Prediction of Spatiotemporal Time Series Based on Reconstructed Local States Model-Free Prediction of Large Spatiotemporally chaotic systems from data -supp material Observing spatiotemporal dynamics of excitable media using reservoir computing Hierarchical structure in financial markets Dynamics of market correlations: Taxonomy and portfolio analysis Linear and nonlinear market correlations: Characterizing financial crises and portfolio optimization Reservoir observers: Model-free inference of unmeasured variables in chaotic systems Backpropagation algorithms and Reservoir Computing in Recurrent Neural Networks for the forecasting of complex spatiotemporal dynamics Detecting causality from time series in a machine learning framework Predictablilty: A problem partly solved Effects of stochastic parametrizations in the Lorenz '96 system Chaos and time-series analysis of Fortran numerical recipes: the art of scientific computing Persistent propagation of concentration waves in dissipative media far from thermal equilibrium The Kuramoto-Sivashinsky equation: A caricature of hydrodynamic turbulence? Persistent Propagation of Concentration Waves in Dissipative Media Far from Thermal Equilibrium Fourth-order timestepping for stiff PDEs Breaking Symmetries of the Reservoir Equations in Echo State Networks A Mathematical Theory of Communication Estimating mutual information Cluster ensembles -A knowledge reuse framework for combining multiple partitions Good and bad predictions: Assessing and improving the replication of chaotic attractors by means of reservoir computing A practical method for calculating largest Lyapunov exponents from small data sets, Phys. D Nonlinear Phenom