key: cord-0864338-p9lgwk8w authors: Puleio, Alessandro title: Recurrent neural network ensemble, a new instrument for the prediction of infectious diseases date: 2021-03-16 journal: Eur Phys J Plus DOI: 10.1140/epjp/s13360-021-01285-3 sha: f9ab185bf0b4351cc24096480efd939e66ec3780 doc_id: 864338 cord_uid: p9lgwk8w Infectious diseases afflict human beings since ancient times. We can classify the infectious disease in two principal types: the emerging diseases, that are caused by new pathogens, and the re-emerging diseases, due to a new spread of a known pathogen. Both types can then be subdivided in natural, accidental or intentional spreads. The risk associated to infectious diseases strongly increased in the last decades, especially because of the globalisation, which leads to a denser and more efficient link between nations, involving that a local infectious may easily spread worldwide, such as the SARS-CoV-2 in 2019–2020. The development of new methods to predict the spread of diseases is crucial. However, sometimes the variables are too many that classical algorithms fail in the prediction. Aim of this work is to investigate the use of an ensemble of recurrent neural networks for disease prediction, using real flu’s data to train and develop an instrument with the capability to determine the future flues. Two different types of study have been conducted. The first study investigates the influence of the neural network architecture, and it has been performed using 12 seasons to train the model and 3 seasons to test it. The second test aims to investigate the number of seasons needed to have a good prediction for future ones. The results demonstrated that this approach could ensure very high performances also with simple architectures. The ensemble approach allows to have information about the uncertainty of the prediction, allowing also to take countermeasures as a function of that value. In the future, the use of this approach may be applied to many other types of disease. The infectious diseases have affected human beings since ancient times; historical traces about the presence of epidemics caused by infectious diseases have been reported in a lot of manuscripts of the past. An example of epidemics is the "plague of Athens" of 430 BC, in which the 25-35% of the population died [1] , while the most recent is the Spanish flu in 1918 with a budget of tens of millions of deaths around the world [1] , arriving to the SARS-CoV-2 pandemic. To understand the impact of infectious diseases on humans, it is sufficient to consider that in 2004 the fifteen millions of fifty-seven millions of deaths around the world (more than the 25%) were caused by the infectious diseases [1] . With the term "infectious diseases," it is possible to identify all the pathologies caused by the contact between humans and microorganisms, with a consequent interaction between the microorganism and the host's immunity system [2] . Between the etiological agents of these types of diseases, we can find viruses, bacteria or fungi, that after the penetration in the human organism can cause a state of disease [2] . Infectious diseases can be classified in emerging infectious diseases (EID) or re-emerging infectious diseases. The EIDs are defined as every infectious disease that compares for the first time in human, such as SARS [3] . The re-emerging infectious diseases are defined as known diseases, that in the past had caused great problems for the public health, and that return in the human population after a long period of absence, or have an explosive increment in the number of cases in a short period [4] . There is also a third category, which is defined as deliberately emerging diseases [4] . This category is referred to each spread phenomenon of infectious diseases that have been originated from an accidental or voluntary dispersion of natural or bio-engineered microorganisms [4] . EIDs are in a lot of cases classified as zoonoses [4] . This type of diseases is distinguished by a microorganism's passage from animal to human, and generally, in a lot of cases, the human is a 'dead-end' point [4] . To understand the importance of this category, it is important to note how between 1940 and 2004 was reported the emergence of 335 new cases [3] . For what concern the re-emerging infectious diseases the microorganisms at the base of this category are generally characterised by their genetic variability, that gives to these etiological agents a great potential. An example of this type of microorganism can be the influenza A, with its segmented genome characterised by the antigenic shift and drift phenomena [4, 5] . The antigenic shift phenomena in the influenza viruses, point mutation, and drift phenomena, recombination of segmented genomes of two different influenza's viruses in the animal, create evolutionary success conditions that allow the passage of the viruses between animals and humans [4, 5] . It also involves the development of new types of virus, some with the capability to escape the human immunity system which has been improved from the previous influenza's viruses [4, 5] . These mechanisms can lead in the future to new deadly pandemics, as in the past 1918, 1957 etc. [4, 5] . Furthermore, different factors can influence the possibility of new emerging or re-emerging diseases [6] . Many are the factors that influence the emergences and re-emergences, such as ecological changes (as a combination of agriculture/economic development or climate changes), demographic changes (as a consequence of economic development or changes in the humans' behaviour), economic development with the consequent globalisation trend (with an increment of travels and goods transport), and technological and industrial development [6] . Moreover, the mutation and evolutional capabilities of microorganisms involve the increase of pharmaco-resistances [6] , with a consequent increase of the risk associated to human health. In this scenario, preparedness and efficient countermeasures play a key role for the prevention and control of future infectious diseases. The dynamic and factors in the spread of infectious diseases have studied since XIX century, an example of this can be the study of influenza's virus spread through mathematic model starting from real data to obtain a forecast of new epidemy [4] . Thus, the possibility to obtain new computational instruments with the capabilities to forecast the diffusion of an infectious disease results as fundamental for the public-health's decisionmaker [7] . Aim of this study is to investigate the possibility and the capabilities of recurrent neural networks as an instrument for the trend prediction of epidemic curves, organising them in ensembles of neural networks in the way to also obtain uncertainty values for each forecast. The main idea and innovation of this work are based on the use of the ensembles. Recurrent or Time-Delay Neural Networks (TDNN) are supervised machine learning tools. They are strongly influenced by the quality of the training set. If we try to predict something which is strongly different from the training set (and which follows different physics or are affected by new external variables), the tool will return a piece of wrong information. But in a real application, we have usually no idea about it, and thus we do not know if the prediction is reliable or not. An interesting approach would be the use of ensembles. In fact, neural networks with the same architecture and trained with the same data should predict similar results in a "predictable" region, while they should have more casual data in the "unpredictable" or "different" region. Therefore, by a TDNN ensemble, one may calculate the average and the standard deviation of the prediction. The average prediction is the most probable value, while the standard deviation returns the uncertainty of the prediction, giving a more robust data to use for decision making, etc. In order to study the capability of the ensemble to predict epidemics and pandemics, in this work the same case (influenza) has been analysed with different architectures of the TDNN, different ensemble sizes, and different training set sizes. The use of neural networks is of great interest in science and it could have a hugely important role in predicting the evolution of an epidemic or pandemic [8, 9] . Recurrent neural networks are supervised machine learning tools which aim at predicting the future data (time series) by knowing the past of the physical system considered. Machine learning tools have found applications in all research and science fields, such as image analysis (medicine [10, 11] , physics [12, 13] , facial recognition [14, 15] , etc.), time-series prediction (financial series [16, 17] , plasma disruptions [18, 19] ), and also statistical analysis and causality detection [20] . More specifically, there are also several applications of artificial neural networks in medicine, such as ECG analysis and prediction [21, 22] , medical image analysis (cancer detection [23, 24] , etc.), and medical decision making [25, 26] . The etiological agent of influenza, the disease case study, is part of the Orthomyxoviridae family [5] . This family of viruses is composed by different animal pathogenic viruses among which it is possible to find the only human pathogenic virus: the flu's virus (divided in type A, B and C) [5] . These types of viruses are distinguished by the segmented RNA genome with negative polarity; indeed, these viruses have the necessity of RNA-polymerase RNA-dependent for their replication [5] . So this peculiarity combined with the segmentation of genome increases the possibility of mutation with the possibility to obtain a new type of influenza's viruses [5] . The great potential in terms of variability that distinguished these viruses is the base of two principal mechanisms, known as antigenic shift and drift [5] . The problem linked with these mechanisms is the consequent possibility to obtain from the variability of the development of a new virus with the capability to escape from the human immunity system, although there is the presence of antibodies for previous influenza's virus [5] . If to this evolutionary success phenomena, we add the transmission mechanism, through aerosol particles, the flu virus has the potentiality to be endemic around the world [4, 5] . It is important to note how the influenza viruses are characterised by a seasonal epidemic trend that gives a great contribution to the worldwide mortality rate, especially in the population aged over 65 years [5, 27] . In particular, as reported by the Italian Ministry of Health, the influenza virus affects with an average of 8% the Italian population, and its seasonal epidemics are associated with a high level of morbidity and mortality [28] . For what concern the diagnosis of Influenza Like-Illness (ILI) since the season 2014-2015 the protocol to diagnose the influenza syndrome or ILI is the same of the adopted in the rest of Europe by ECDC [29] . For the diagnosis of ILI must be present one or more of the following symptoms with rapid or sudden onset: 1. General symptoms [29] : • Fever, • General discomfort or exhaustion, • Headache, • Muscle pain, • Cough. • Sore throat, • Wheezing. About the prevention, the first instrument is the vaccination for the fragile population to prevent fatal outcome [5] . Belong to this category: people over the age of 50, adults and children affected by changes in the cardiovascular and pulmonary system, adults and children with chronic metabolic changes (e.g. diabetes etc.), women in the second and third trimester of pregnancy, persons professionally exposed to the contagion [5] . Unfortunately, the unpredictability and great variability of flu viruses and the possibility to have a new virus every year create hard conditions for the prevention programs [5, 27] . In this scenario, the possibility to have computational instruments to forecast the epidemic trend and its variations can be useful in combination with the first prevention protocols that remain: vaccinations. In particular, the possibility to obtain not only a forecast of the epidemic curve but also its uncertainty values give the possibility to the decisionmaker to create efficient plans for the prevention and control. To use the recurrent neural networks as an instrument to forecast the future possible seasonal flu's epidemic curve, it is mandatory to create a database of previous seasonal influenzas' epidemic curves. Indeed, to obtain a recurrent neural network with the forecasting capability, it must be trained with a large database of previous data about the phenomenon that you want to forecast. The database used in this work is public and composed by the seasonal flu's dataset, that is collected and available thanks to the Italian National sentinel influenza surveillance system (InfluNet), that started its work with the seasonal flu between 1999 and 2000 [27, 30] . The Italian National sentinel surveillance system, InfluNet, collects the data about the diffusion of seasonal influenza through the community of family doctors and paediatric doctors, who are indicated to report all cases of diagnosis of ILI to InfluNet network [29] . The operative protocol establishes that the number of medical doctors that takes part to the surveillance must be in the number to ensure a minimum quantity of population under observation, that must be equal or major than the 4% [29] . Furthermore, the protocol through this surveillance system work establishes that the diagnosed ILI cases must have the same symptoms indicated from the ECDC in the way to have a single meter of reference in all Europe [29] . The year period under surveillance is the period identified with the maximum diffusion of seasonal influenza that starts in the 42nd week of the year and finishes around the 17th week of the next year (as it is possible to see in Table 1) , with a few exceptions in the database [29, 30] . In this work, it was decided to use as database all the files publicly available [30], with the seasonal epidemic curves shown in Table 1 , from the season 2003/2004 to season 2017/2018, for a total of 15 seasons. The neural network used are time-delay neural networks, which have a simple feed-forward architecture. In the specific case, the neural networks had two hidden layers and one output layer. The output layer has only one neuron (being one the output). The two hidden layers can have a free number of neurons. In the specific case, two different architectures have been used. The first one has only two neurons per layers, while the second one has 4 neurons in the first layer and 2 in the second. More complex networks have also been investigated, but as the reader will see, even so-simple architectures ensured the highest performances. Two inputs have been used, the weeks (being seasonal) and the past cases of influenza. Two different combinations of time delays have been tested, 5 and 10 (a time delay corresponds to one week, which is also the time step). Figure 1 shows the two architectures of the neural networks. In particular, in Fig. 1 the value k represents the time delay values used by the neural network, while the value t-p represents the first time slice of the dataset. These neural networks are the core of the ensemble. The ensemble is compounded by many neural networks which are trained and predict independently. The size of the ensemble, i.e. the number of neural networks, can freely be changed. The cases analysed are resumed in Table 2 . The outputs of the ensemble are the mean and the standard deviation of the prediction, calculated, respectively, as the average and the standard deviation of the prediction of each neural network. Figure 2 shows the experimental scheme of this work. At first, the dataset is divided in training set and test set. The first N weeks are used to train, the remaining to test the ensemble. The training set is also divided in other three subsets: train (70%), validation (15%), and test (15%) [31] . This process is used to avoid that each neural network does not overfit the data. Once the ensemble is trained, it is tested on the test set. For the first part of the study, it is investigated the capabilities and the quality of results in the forecasting of epidemic curves. Six ensembles, with two different delay values, are used to forecast the last three seasons of influenza. As shown in Table 3 In the second part of the study, the aim was to investigate the importance of the size of the training set to ensure high prediction performances analysing the forecasts resulted in comparison with the real data. The ensemble used it the best obtained from the previous study. As shown in Table 4 , the study was been conducted progressively decreasing the number of seasons used for the training while the number of seasons predicted increases. The study is conducted starting with 13 seasons for training and 1 for prediction arriving at 1 season used for the training and 13 seasons forecasted. The predictions' results obtained are analysed through the use of R 2 values [32] . In this way, it is possible to have a quantification of correspondence between the predictions and the real data. In Table 5 , all average values of R 2 for each ensemble tested are reported. The best result in terms of R 2 is obtained by the prediction of ensemble 5x_4-2_D10, with a determination coefficient equal to 98.26% for the three seasons predicted (Table 5 ). While the worse results are obtained by the ensembles of the same net but with a delay value equal to 5, Table 5 . From Figs. 3, 4, 5 and 6, the 3 seasonal predicted epidemics curves are plotted. The line in black always represents the real data while the coloured lines represent the predicted data and uncertainty of each ensemble (red for ensembles of 5 nets, orange for ensembles of 10 nets and blue for the ensemble of 15 nets). Figure 3 shows the prediction plots obtained by the ensembles of 2-2 net (Fig. 1) with a delay value equal to 5. It is possible to see as the predicted data are close to the real data, in particular, there is a little overestimation in the second season, while in the third season it is possible to see a little underestimation of the epidemic curves in all ensembles. The ensembles of 10 nets (Fig. 3) are characterised by appreciable differences between the predicted and real data in the third season. However, it can be observed that when there is a large error, the ensemble returns also a large uncertainty, and the predicted data are within the confidence interval (2 standard deviation). Figure 4 reports the prediction plots obtained by the ensemble of nets with two neurons for each hidden layers (Fig. 1) and delay value equal to 10. In these cases, the predicted and real data are very close for each season, as demonstrated by Fig. 4 and by the R 2 results in Table 5 . Also in this case, the third season is affected by larger error and prediction uncertainties. However, also in this case the real data are within the predicted values (considering the uncertainty), involving that the prediction is capable to return reliable results. In Fig. 5 , the prediction plots of the ensembles obtained by the neural network with four neurons in the first layer and two neurons in the second layer (Fig. 1) are reported, and delay value is equal to 5. In these tests, it is possible to see, in the first plot (top in Fig. 5) , that the ensemble of five nets has a forecast distant in some point from the real data, especially in the peak zones of second and third seasons. This inaccurate trend is also confirmed by the R 2 value of this ensemble equal to 73.43% (reported in Table 5 ), that is a lower value of correspondence between the forecasting and real data. In the specific case, this value is the lowest value of R 2 obtained in this work. Table 3 Database subdivision used to perform the first study N°s e a s o n s 1 For what concern the ensembles of 10 nets and 15 nets the forecasts are closer to the real data with little differences in peak's zones of the epidemic curve (Fig. 5) . In this case, the forecasts' uncertainty values obtained are high always in correspondence of peak's zones of the epidemic curve (Fig. 5) . Finally, in Fig. 6 , the forecasts' data obtained by the three ensembles, based on the neural network with four neurons in the first layer and two in the second one (Fig. 1) are plotted, with a delay value equal to 10. In this case, the forecasts' data are very close to the real data, especially in the ensemble of 5 nets (the plot in the top of Fig. 6) , that is the nets' ensemble with the best result in terms of trend and especially in terms of R 2 value in comparison with the other ensembles (Table 5 ). In the ensembles' forecasts (Fig. 6) , it is also possible to see a low underestimation in correspondence of the third season's peak, that is more appreciable observing the ensemble of 15 nets that obtained, for this reason, a low R 2 value, 95.89% (Table 5) . R 2 term. To perform this study, the best ensemble of neural networks obtained by the first study has been used. For this reason, the choice was obviously the ensemble of nets with the architecture of four neurons for the first layer and two for the second layer (Fig. 1) , with a delay value equal to 10. Also in this case, the analysis has been performed varying the size of the ensemble, i.e. the number of neural networks. The study is based on varying the training and test set sizes. Having a total of 15 seasons, 14 different tries have been performed. The first had 1 season for the training and 14 for the The figure shows the plotted R 2 values. All the R 2 show a similar trend, which strongly decreases when the training set is decreased. This trend indicates that there is a correlation between the forecasts' quality and the number of data used to build the database to train the neural network, as it is expected. Indeed, decrementing the number of data used to train the neural network, the inaccuracy of the forecast grows. It is also interesting to observe the effect of the ensemble sizes. When only 5 neural networks are used (black line), the R 2 oscillates, involving that the prediction using only 5 networks is too low and the prediction accuracy is poor. Contrariwise, the ensembles with 10 and 15 return much more stable and good results. In this work, the possibility to use neural network ensembles for infectious disease prediction has been investigated, by performing two studies aiming at characterising the applicability of this tool. For what concerns the results of the first study, it is possible to conclude that the ensembles based on the TDNNs have the capabilities to forecast epidemic curves of influenza Indeed, analysing the R 2 values for each tested ensemble, it is possible to see as in a lot of these cases the values of linear correspondence are major than the 95%, meaning that the forecasted average epidemic curves are very close to the real data (which have not been used in the training phase of ensembles). The differences between the various architectures did not show significant differences and the main conclusions are: 1. Increasing the number of neural networks increases the prediction accuracy and standard deviation (since they are calculated with larger statistics). 2. A large number of time delays may increase the prediction accuracy. Through the second study of this work, it is possible to conclude how the correlation between the quantity of data in the database used for the training phase is strongly connected to the quality of prediction obtained by an ensemble. Indeed, observing the studies of three different ensembles (in Fig. 7) , it is possible to see how the quality of prediction decreases with the reduction of used database size. Increasing the number of TDNN in the ensemble involves a better statistic, involving more stable and reliable results. Finally, it is possible to conclude that the use of an ensemble of TDNNs can be an instrument to predict the future epidemic curves with good results. Furthermore, the used ensemble structure of TDNN also gives the possibility to obtain an uncertainty measure of the predicted epidemic curve, with the possibility to adapt the future measure of prevention, response, and control. It is important to note that in this work a very simple neural network has been used. This is due to the small number of input variables (past cases and weeks). In more advanced analysis, where it could be important to take into account much more pieces of information (vaccination, countermeasures taken by government, etc.), much more complex neural network may help in predicting more complex epidemics, or to increase the prediction performances of standard ones. In conclusions, this work demonstrated that neural network ensemble may be used to predict epidemics or pandemics with great accuracy, giving also a great additional information: the uncertainty. This value is meaningful and truly important for decision making, since it gives the reliability of the predicted values. A predicted value with an uncertainty equal to 2% of the predicted values involves that the prediction will be reliable and accurate, while a value higher than 50% means a poor accurate and reliable prediction. In the future, more complex epidemics with many more input variables will be tested. Data Availability Statement Not applicable. The author declares no conflict of interest. Epidemics and Pandemics: Their Impacts on Human History Manuale Di Virologia Medica Advances in Financial Machine Learning Prevenzione e Controllo Dell'influenza (Ministero della Salute Sistema di Sorveglianza Integrata dell'Influenza -Istituto Superiore di Sanità Divide Data for Optimal Neural Network Training -MATLAB & Simulink -Help center An Introduction to Statistical Learning