key: cord-1011294-7k975qvr authors: Wieczorek, Michał; Siłka, Jakub; Woźniak, Marcin title: Neural Network powered COVID-19 spread forecasting model date: 2020-08-15 journal: Chaos Solitons Fractals DOI: 10.1016/j.chaos.2020.110203 sha: 130a3c51fa03379a3ceebac118525fd10a88982e doc_id: 1011294 cord_uid: 7k975qvr Virus spread prediction is very important to actively plan actions. Viruses are unfortunately not easy to control, since speed and reach of spread depends on many factors from environmental to social ones. In this article we present research results on developing Neural Network model for COVID-19 spread prediction. Our predictor is based on classic approach with deep architecture which learns by using NAdam training model. For the training we have used official data from governmental and open repositories. Results of prediction are done for countries but also regions to provide possibly wide spectrum of values about predicted COVID-19 spread. Results of the proposed model show high accuracy, which in some cases reaches above 99%. Virus spread prediction is very important on the way to plan situation. The problem with modeling such system is that each day of COVID-19 and the number of new potential cases can not be determined in a simple mathematical equation. There are many reasons for such problems. Spread of human threads in general depends on various features both dependent from human behavior but also biological structure of the virus itself. In any case the research must be done to biologically describe the virus for the development of medical treatment but also to model spread which will help to prevent new cases and concentrate on locations with highest potential needs. The COVID-19 virus belongs to the Coronaviridae family and order Nidovirales, thus it contains a few taxonomy hallmarks. Those have lipidic envelope or positivesense, single strange RNA within large ribonucleoprotein (RNP) core. All of such viruses contain three characteristic proteins anchored in the viral membrane: the envelope protein (E), the transmembrane glycoprotein (M) and Spike (S) glycoprotein. As it is worth noticing, the viral envelope exhibits a large thickness, almost twice that of a typical biological membrane [1] . The genome of SARS-CoV-2 consists of six major open-reading frames (ORFs), frequently identified in other CoVs. It seems, that some of the genes shared less than 80% nucleotide sequence identity to SARS-CoV [2] . COVID-19 is sensitive (like other CoVs) to ultraviolet rays and heat -it is thought that this virus can be inactivated at 27 • C. Moreover, this virus might be inactivated by ether (75%), ethanol, chlorine-containing disinfectant, peroxyacetic acid and chloroform except for chlorexidine [3] . The thought of the vast impact that coronaviruses may have on Public Health is underpinned by the fact that throughout the history, a lot of them were able to cross species barriers and finally affect human health by causing common cold, or more severe diseases, like SARS and MERS [3] . A large-scale study based on 1995 cases have shown that main clinical symptoms are fever (88.5% of cases), cough (68.6% of cases), myalgia or fatigue (35.8% of cases), expectoration (28.2% of cases), and dyspnea (21.9% of cases), while the minor ones include headache or dizziness (12.1% of cases), diarrhea (4.8% of cases), nausea and vomiting (3.9% of cases) [4] . The transmission of novel coronavirus, alike many pathogens is believed to occur through respiratory droplets, thereby the vast majority of spread cases are limited to the close spaces. The incubation time last usually 3-7 days, however the symptoms might occur to up to 2 weeks [3] . In recent months several important models were presented. In [5] machine learning was applied to estimate how the outbreak of this thread will go. Nevertheless it not easy to predict situation in case of COVID-19, since there are many factors which determine rapid changes [6] . Therefore many approaches were used to help. In [7] prediction of thread was done by mathematical model, in which undetected infections were estimated for China region. Sometimes even very simple methodologies are used. When decision is necessary immediately we can start prediction based on data preprocessing in which some cases are simply removed for applied model on Euclidean network [8] . In Japan prediction models also evaluated first symptoms [9] . One of the first presented models for Italy was using Gauss error function and Monte Carlo simulation on recorded cases [10] . Also stochastic predictors give potential help in first days when not much data is available for machine learning approaches [11] . Such stochastic models are also presented to work even for very large societies like in India [12] . Therefore when Artificial Intelligence is applied in the first days of prediction periods the results are mostly for single region or country. One of the first approaches for China was presented in [13] . An interesting discussion of principles for using mathematical modeling was presented in [14] . Some methodologies not only predict numbers of new cases but also make some assumptions regarding dynamics of growth [15] . There are many sources of information to predict situation. As reported in [16] social media can bring valuable information not only about confirmed cases but further spread. Relations between new cases and speed or reach of growth can be transformed to prediction in other location, as shown in [17] such knowledge transfer to model another region was done between Italy and Hunan province in China. Diamond Princess ship case was discussed in [18] . There are also models which estimate situation in larger regions or more than one country. In [19] and [20] applied prediction model was defined to work with data from China, Italy and France. Some models treat only the total number of cases in the whole World altogether [21] . The model proposed in our article is a complex solution. Proposed Neural Network architecture was developed to flexibly predict new cases in various countries and regions of the World. The architecture is composed of seven layers and the output predicts the number of new cases. Our predictor is trained by NAdam, which was chosen in tests for the best efficiency and shortest training time. Several other test enabled us to select the best time window in which data is fed to the network. As the results have shown such composed model is able to predict new cases with very high efficiency, which in some regions and countries is above 99%. In the following section we discuss our approach by showing how the data set was selected and preprocessed. Further we discuss network construction and results for tested training models. In final section we present results and compare them to classic statistical approaches and draw conclusions from our research. In our work we were using dataset provided by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University on their github page. It contains combined data from numerous different sources such as: • World Health Organization (WHO), • European Center for Disease Prevention and Control (ECDC), • DXY.cn. Pneumonia. 2020, • COVID Tracking Project, • National Health Commission of the People's Republic of China (NHC), • China CDC (CCDC), • Washington State Department of Health, and many other smaller, regional health departments. This dataset is free for academic purposes and contains time series data with daily-updated total confirmed cases for countries and some bigger regions. Because it's all combined in a single csv file it is very accessible to use for our neural network predictor. The dataset contained three categories of data: • Region name, • Geolocalization in the form of Latitude and Longitude, • Total cases count per day. In our model we are focusing on predicting total cases count so we have removed the region name and localization data from the dataset giving each country numerical ID to later identify predicted data and show it in the form of a table, however the ID is not used in the training process. In order to prevent overfeeding we have sorted discerningly our region by the maximum cases count and then split them to separate files containing maximum of 30 regions per file. Using 20 regions/file resulted in decreased performance by about 3% compared to 30 r/f and using 40 and 50 also decreased accuracy by 1% and 1.67%. After division we train a separate network for each file and then combine all results in one, final file. Because we are working with time-step data we had to specify on how many previous days we want to predict future values. In order to choose the best option we had to prepare an experiment on which we tested few different time-steps and analyzed the results. Gathered data are shown in Table 1 and Fig. 1 . As we can see smaller time-step allows the network to adapt more quickly to the changes, so it reduces the delay between first few days of evaluation period in some regions and the adjustment of the network. Because of that there is a growth of accuracy in early days of period. However, because the network is using few days for the prediction, the model is also less stable and has a higher fluctuations if changes are higher than normal, what results in predicted Having all of this pros and cons we had to find the balance between fast adaptation and stability of the predicted data. Because the spread has already started in most regions there is smaller need for fast adjustment but higher for the reliability of the data so we decided to go for some higher time-step. In our tests we found out that the best option for our predictor was 14 days evaluation period which not only produces accurate plot but also has the highest overall accuracy of the model. To normalize our data before training we are dividing all values from the dataset by the maximum value found in the pool to get real values in scale 0 to 1. After training, to get the final values we are multiplying results by the same maximum value we were dividing them before and we are rounding it to get the integer value. Because we are not working with classification but with prediction we need to have a method to compute accuracy of our model. To do this we have used a measure of error margin of 0.2 computed in such way: where a is match. As we can see the margin value is relative to the total cases count. That gives us possibility to normalize the accuracy calculation and use the same equation for regions with small number of cases as well as for most populated regions. After analyzing most of the regions we have noticed that although different regions have some similarities in growth curve there are however some major differences. If we want to accurately predict the virus spread in every region we cannot train the network on some countries and test on other. Such approach would generalize the spread prediction and create universal but semi-accurate model without ability to predict variations. Because of that we have divided our data in a different way -we have trained our model on data from 5 days period and older, and tested on the newest data. That allowed our network to create generalized global predictor but also to accurately adapt to every region separately. This method however is not 100% perfect. Because the network is trained on older data, in some cases there may be a little slip in the beginning of COVID-19 spread and the network would need few days to adapt to the new situation and accurately predict new values. Our model works in such way that it does not have any need to specify the perfect number of iterations to get the best possible accuracy. It saves the network's best performing synaptic weights during the training and after the training it restores them and later makes predictions. Because of that, we can safely give the network more iterations than needed. It is however not recommended because of limited time and resources. To find the minimum number of iterations that can safely give us the best result we have analyzed accuracy and loss. Based on them we specified the best number of iterations to 3300. In most cases it was more than needed but in some regions the network needed even more iterations to perform the best. During our research we have found out that going over 3300 iterations did not give our model more positive results, so we defined it as the best option. It was proven that neural networks predictors can result in accuracy and better adjustment than other stochastic methods. On the other way there are many different architectures of neural networks. In this article we present predictor which results in highest efficiency. In our research for time-step data we have also tried to use Recurrent Neural Network (RNN) for total cases prediction, however in our case it performed about 1-2% worse than our ANN model. Additionally, even with GPU acceleration network training with the same parameters was about 3 times longer. Because of that, finally we decided to use classic ANN in our predictor. Sample comparisons between RNN and ANN in accuracy and loss are shown in Fig. 3 (for ANN model) and Fig. 4 (for RNN model). As we can see ANN not only scores higher than RNN but also reaches the maximum accuracy in much less iterations making it the most suitable architecture for our problem. In our model we have used classic, fully connected layers in Neural Network presented in Fig. 2 . To predict the value of cases we have used two activation functions: • Hyperbolic Tangent -for all hidden layers, R (x) = 0 for x < 0 1 for x > 0 To find the best solution for our network training we have done some tests using most known optimization algorithms. After analysis we have selected one with the best accuracy and the fastest learning. Results are shown in Table 2 . Conclusions are as follows: • Adadelta -gave us the worst results of actually 0% accuracy, • SGD -the network was slowly learning however it did not reach any high accuracy, • Adagrad -loss function was decreasing very slowly however the final accuracy was much higher than for SGD, • Ftrl -similar to Adagrad, • Adamax -loss function decreased fast however high accuracy was reached very late during the training process, • Adam -high accuracy was reached much faster than in Adamax and was even little bit higher, • RMSprop -network reached around 87.65% accuracy and made it half the time of NAdam algorithm, • NAdam -final accuracy was the highest however it was reached much later than in RMSprop. Results of our tests are presented in Fig. 5 . In the end we decided to use NAdam, because it gave us the best accuracy overall. In order to get the best possible accuracy of our model, based on previously conducted tests, we have used an improved version of Adaptive Moment Estimation Algorithm -NAdam. Because of it's high performance and small requirement of computing power it's widely used in Artificial Intelligence research. To adjust our network's prediction even more, we have also used a learning rate decay to speed up rough adaptation to the cases curve and then, using much smaller steps, slowly polish the final model. NAdam formula can be described as follows: where g is current gradient value of the error function and β is constant values called hyper-parameter. Values m t and t m are used for calculation of the correlations marked asm t andv t according to equations: Using correlation calculated above, the final formula for changing weights in our ANN architecture was defined as follows: where is a constant small value and η is a learning rate (in our case with value 0.0004). We applied NAG to Adam using equations below: We modify standard Adam update rule, because the first term no longer depends on g t we need to change expressions for w t andm t : We will start discussion of our prediction from classic measures used in statistics. These measure will serve as some point in the discussion to present further accuracy of our predictor and final predictions for countries and regions. In a statistical sense the shape of prediction line for confirmed cases can be affected by several factors, such as the number of tests carried out in country or region, restrictions imposed by state authorities, and also by behavior of people in a given area. Sample Moving Average (SMA) of detected cases is cal- where n is number of assumed factors, c i is i-th value. Exponential Moving Average (EMA) of detected cases is calculated as where EM A 0 = Y 0 and Y is a value from the i-th value of the set taken into account, a -is 2 n−1 where n is the number of values of the set. Linear trend line is built for the number of detected cases as where X i is i-th value of the set, X is arithmetic mean of the set, Y i is i-th value of the set, Y is arithmetic mean of In Table 3 -Table 4 we can see the final accuracy for selected regions and countries using different error margins in our predictor. The accuracy was computed as an arithmetic mean of the sum of accuracy per day for the given time interval. High accuracy with the lowest error margin means that the network was very precise in predicting the value of cases and medium accuracy for the low error margin with high accuracy with the highest error margin mean that the network was predicting correctly the shape of the growth curve. Even in not all predicted values were exact from analyzed charts we can see that our predictor works well reaching high efficiency both for regions and countries. Accuracy plots are shown in Fig. 9 and Fig. 10 . For each of charts X-axis contains dates and Y-axis shows accuracy for that day with the error margin of 0.2 for selected region/country. As we can see when there are no active cases in some regions developed Neural Network correctly predicts number of cases. However in the beginning of period we can see fluctuations since to reach good performance the network needs proper amount of data, what was hard to reach in that time. When cases were recorded and data was available from various services developed network predictor quickly adapts to the new situation and the ac- curacy peaks up to almost 100%. In Fig. 7 and Fig. 8 we can see the comparison between real and our predicted values for the total cases count plot. As we can see the network correctly predicts the main trend of cases growth, however for some days the results are a little bit unstable. We conclude this is caused by rapid changes up and down in recorded cases what is normal situation in case of virus detection. Usually it happens when there are some additional variables changing the speed of the spread. On the other hand it is also visible that our predictor learns such situation and returns to correct trend easily. Analysis of statistical predictions and our proposed Neural Network shows differences. For our model it is necessary to have enough data from countries with COVID-19 affection to train predictor. The increment curve itself is however, much more adapted to the real data. Statistical approach gives results when in the beginning we don't have well defined Neural Network architecture, such composed predictor trains well for high efficiency of predictions what compensates all possible difficulties to develop it in the beginning. In Fig. 11 we can see how the system predicts situation in the World. Such map gives a prediction model for various continents. By analyzing such presentation we can see which continents may suffer from growth in recent time. Our system also gives marks to countries which report decreasing trend so that the estimation is much useful. In our article we have shown both, advantages and disadvantages of solutions related to prediction of cases for infected with the COVID-19 virus. We have presented why, according to us, such models benefit from the use of neural network. One should also take into account many factors influencing the shape of the curve of the increase in infections, among others: behavior of the population in a given region, behavior of governments of given countries as well as access to knowledge and medical equipment. The network we have developed has unified architecture what makes it easy to work as predictor, since we don't need to change it for countries or regions. Our predictor achieves very high accuracy for most of regions which is around 87.70%. We can conclude that by using individual predictors for each possible region or country, the predictions could be increased. Our research also show the problem of insufficient data at the beginning of the situation. with increasing amount od data from governments our predictor was gaining on accuracy during following days. In future works, we want to expand precision with devoted predictors and increase network efficiency when there is a small amount of data available by using statistical models as some compensation. ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. ☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Michal Wieczorek, Jakub Silka, Marcin Wozniak Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44-100 Gliwice, Poland e-mail: michal_wieczorek@hotmail.com, kubasilka@gmail.com, marcin.wozniak@polsl.pl} Cryoelectron tomography of mouse hepatitis virus: insights into the structure of the coronavirion A pneumonia outbreak associated with a new coronavirus of probable bat origin Features, evaluation and treatment coronavirus (covid-19), in: Statpearls Covid-19 patients' clinical characteristics, discharge rate, and fatality rate of metaanalysis Covid-19 outbreak prediction with machine learning Why is it difficult to accurately predict the covid-19 epidemic? Mathematical modeling of the spread of the coronavirus disease 2019 (covid-19) taking into account the undetected infections. the case of china, Communications in nonlinear science and numerical simulation 88 Covid-19 spread: Reproduction of data and prediction using a sir model on euclidean network Prediction of the epidemic peak of coronavirus disease in japan Mathematical prediction of the time evolution of the covid-19 pandemic in italy by a gauss error function and monte carlo simulations A discrete stochastic model of the covid-19 outbreak: Forecast and control Seir and regression model based covid-19 outbreak predictions in india Artificial intelligence forecasting of covid-19 in china Predictive mathematical models of the covid-19 pandemic: Underlying principles and value of projections A model based study on the dynamics of covid-19: Prediction and control Prediction of number of cases of 2019 novel coronavirus (covid-19) using social media search index Extended sir prediction of the epidemics trend of covid-19 in italy and compared with hunan, china Estimation of the reproductive number of novel coronavirus (covid-19) and the probable outbreak size on the diamond princess cruise ship: A data-driven analysis Analysis and forecast of covid-19 spreading in china, italy and france Modified seir and ai prediction of the epidemics trend of covid-19 in china under public health interventions Forecasting the novel coronavirus covid-19 curve is not compensated. On the other hand, as shown in our experiments, when new data is coming regularly to