key: cord-0845356-dmaahg2q
authors: Salgotra, Rohit; Gandomi, Mostafa; Gandomi, Amir H
title: Time Series Analysis and Forecast of the COVID-19 Pandemic in India using Genetic Programming
date: 2020-05-30
journal: Chaos Solitons Fractals
DOI: 10.1016/j.chaos.2020.109945
sha: 157dca8849575a3c6f31c03bc143a3a3295e53b8
doc_id: 845356
cord_uid: dmaahg2q

COVID-19 declared as a global pandemic by WHO, has emerged as the most aggressive disease, impacting more than 90% countries of the world. The virus started from a single human being in China, is now increasing globally at a rate of 3% to 5% daily and has become a never ending process. Some studies even predict that the virus will stay with us forever. India being the second most populous country of the world, is also not saved, and the virus is spreading as a community level transmitter. Therefore, it become really important to analyse the possible impact of COVID-19 in India and forecast how it will behave in the days to come. In present work, prediction models based on genetic programming (GP) have been developed for confirmed cases (CC) and death cases (DC) across three most affected states namely Maharashtra, Gujarat and Delhi as well as whole India. The proposed prediction models are presented using explicit formula, and impotence of prediction variables are studied. Here, statistical parameters and metrics have been used for evaluated and validate the evolved models. From the results, it has been found that the proposed GEP-based models use simple linkage functions and are highly reliable for time series prediction of COVID-19 cases in India.

transport, closure of shops, mass activities and any other activity which accounts for social gatherings were brought to an immediate halt. All of these measures where employed to minimize the effect of community level transmission of the disease [4] . The Chinese authorities took over the control of the situation and collected data from 2018 International Air Transport Association (IATA) to 15 identify and check the infectious disease vulnerability indexes (IDVIs) in new countries where the virus might have transmitted outside China [5] . It should be noted that IDVI has a range of [0, 1] , higher the value of IDVI, lower is the risk of disease transmission and vulnerability. The vrius affected more than 85,000 Chinese population and the initial destinations affected were Hong Kong, Bangkok, Tokyo and Taipei, all having an IDVI above 0.65 [6] . 20 Though numerous efforts have been brought into place, but still the virus was not controlled and by 19 January 2020, numerous cases across the world were reported [3] . WHO declared the disease as an emergency situation for the whole world on 31 January 2020 and by 11 March 2020, it was declared as a new threatening global pandemic [4] . As of 12 May 2020, almost 4,006,257 have been reported across the globe with a total death count (DC) of 278,892 amounting for a daily increase 25 of 26% and 28% increase in confirmed cases (CC) to deaths per day [6] . The worst affected country being USA, the second most affected country is Russian Federation, followed by United Kingdom, Spain, Italy, Germany, France and Turkey. The virus which started from China, has engulfed almost every country of the world, with the most affected region being the European continent [7] .

Average estimates have also been drawn to calibrate and design COVID-19 transmission models, 30 before further investigation and pandemic control measures can be implemented [8] . It can also be noted that the virus started from a single individual but it migrated to cluster level and in present situation, is enormously increases as a community level transmission system [11] .

India the second most populous country of the world with 1.3 billions people to serve, having an average household income ranked at 112th out of 164 countries by the world bank and with a 150th 35 rank in global health care by world economic forum. This critical condition was under the scanner of the whole world, when the COVID-19 pandemic first set foot on Indian soil [9] . The first case was reported on 30 January 2020 and it was expected that India will not be able to survive the heat and due to lack of essential services, life of millions of people will be at stake. Its more than three person coming to the country from China and people who had any Chinese travel history in the past few days. The first nation wide lockdown for 21 days was announced by the government on 23

March 2020 and was further extended by another 21 days till 3 May 2020, which has been further extended till 17 May 2020 (can be extended to 30 May 2020). Important measures included timely 55 response to provide health care facilities, contact tracing, extensive testing, community mobilization and others have helped to contain virus and keep a low mortality rate. Different states have already recovered from various adversaries and that have helped them to keep a check on the new situation.

Odisha, Kerela and Tamil Nadu has a long history of natural disasters and precautionary measures have already been taken by the government. Maharashtra on a whole uses drones to monitor social 60 distancing and lockdown. A cluster containment strategy has also been employed to diagnose and contain the virus. This has been done by surveying, detecting and contact tracing of about 3 km of area where more than three patients are diagnosed [9] .

The potential effect of COVID-19 have prompted various studies on the characteristics of the coronavirus and a large number of studies are under processing to estimate the possible devastation 65 by the virus and to derive a vaccine for its cure. It has also been found that the virus has an adverse effect on elderly people as well as for people who are suffering from some kind of infectious diseases such a heart attacks, respiratory diseases, and others, and it a big concern for the authorities to keep a check on the virus so that minimum harm can be done [12] . Various studies have been conducted by researchers across the world to estimate the possible impact of coronavirus. The major studies 70 include stochastic simulations [13] , Weibull distribution model [14] , exponential growth model [16] , lognormal distribution [15] and others [17] . The studies were able to predict an average incubation period of 5.1 days and a total of 14 days quarantine necessary for analysing the virus within a person [13] . But none of these studies could estimate the exact reproduction rate and hence not much has been done to predict how the virus will effect in the coming weeks or so. Also, all of the studies have 75 been done on China and not much work has been done with respect to the Indian sub-continent.

In present work, a new genetic programming based model (GP) [18] for times series prediction of the COVID-19 scenarios in India has been proposed to estimate the possible spread of the virus.

The dataset for evaluation is taken from [19] . GP is an enhanced version of genetic algorithm (GA) [20] , in which new solutions are generated as computer based programs rather than simple binary equations. Overall, it can be said that prediction models proposed by GEP based modeling have better calibration and can analyse the results in a much effective way [27] , [28] . Thus in present work, GEP models are proposed based on the raw data taken from authentic sources since 24 March

The paper is organized into 4 sections, first section include the introduction as discussed above.

In section 2, technical preliminaries and model analysis is presented, providing details on the basics of GEP and proposed GEP model. Section 3, provides the various results and discussion related to different scenarios of COVID-19. Here it should be noted that three most affected states of India have been taken under consideration and GEP models for all the states have been proposed. These 

GEP is a highly effective evolutionary algorithm and has proved its worth in comparison to GA.

The algorithm produces new equations instead of binary strings and hence has the direct advantage of mathematical formulations for higher dimensional problems which otherwise is not possible with a standard GA. In order to formulate the GEP model for India, it is really very important to investigate 110 existing models and analyse if the proposed GEP models will be significant enough or not. Various models such as AceMod (Australian Census-based Epidemic Model) [29] , neural network based models [30] and others have been employed to access the situation and provide exact predictions.

Though these models are a bit significant but the first AceMod model has been used for influenza prediction [29] and has little relevance to COVID-19. The other neural network based model uses 115 shallow long term memory (LSTM) method along with the fuzzy rule based model to predict the present scenario. A very high root mean square error (RMSE) and correlation (R 2 ) values have been found making the model little vulnerable to uncertainties. Both these methods discussed are basic and discrete in nature and require more sets of data values to provide exact predictions. Also these are classical techniques and pose very challenging implementation when compared to simple 120 GEP based modelling. GEP allows for a system to be calibrated easily and even predict accurate as well as reliable solution under minimal constraints [27] . In present work, five major time series including CC and DC have been taken into consideration to access and predict the possible impact new structure consists of five major components namely function set, control parameters, terminal condition, fitness function and a terminal set. All these components collectively form a simple para tree and is known as an expression tree (ET). The major advantage of using this kind of methodology is that it is extremely simple and works at chromosome level. It also consists of multi-genic properties and can be used for evolution of complex and nonlinear sub-programming models [31] . 140 Each GEP model consists of a list of fixed length symbols, a function set (e.g.+,-,×,/,Log) and a terminal set (e.g.a,b,c,7). Thus in terms of both terminal set and function set, a GEP can be an invention of multiple chromosomes which are capable of representation in the form of any parse tree.

To decode this information, Karva language is used at chromosome level [32] and a simple gene in Karva language is given by

where a, b and c are variables and 3 is a constant. The expression in 1 is called as a K-expression or generally a Karva notation. The above formulated model can then be evolved in the form of a simple ETs as given by Figure 1 . The expression in Equation 1, can be converted into a k-expression.

This expression is the root of ET, which reads through the functional node and finally to the terminal node. This type of interpretation allows for a quicker understanding of the complex mathematical 150 intricacies [33] . Thus, a simplified k-expression is presented in the form of mathematical equations as given by

Note that the above discussed k-expression it can be estimated that the total length of genes in a GEP model remains same but the total number of ETs keep changing with respect to the problem under consideration [22] . The GEP model thus formulated further finds that certain redundant 155 elements are present in the notation which are not significant for genetic mapping. So for a GEP model to be an efficient model, the total length of a k-expression should be equal to or less than the total length of the GEP gene. Here it should be noted that a random head-tail methodology is followed by a GEP to select a gene. The head might have both the function symbol and the terminal symbol but the tail has only one terminal symbol [22] . 

To have a clear understanding of the total number of COVID-19 cases across India, three major parameters are taken into consideration including CC and DC. All of these parameters are taken in order to accurately access and predict the effect of COVID-19. For performance evaluation, eight 175 former records are used in the time series and best GEP model based on them is selected. The numerical data set used is divided into two sub data sets and are equivalently used for training as well as testing/validation phase. Also, it is a well-known fact that performance of an evolutionary algorithm can not be judged by using single run and hence multiple runs of the data set were performed to reduce the error and predict a near optimal output [27] . Here multiple runs of the 180 same data set were simulated, thus helping the algorithm in providing exact output even if the total instances for experimentation are limited in number. As a further evaluation step, 70% data was used to perform training tasks and rest 30% was used to perform testing/validation. Here it should be further noted that the training data uses gene evolution and best model is predicted using correlation coefficient. Thus a new model has been proposed having better performance for training 185 and can somehow work good enough for testing phase.

Apart from this, the GEP model is greatly affected by the choice of parameter. In order to have a fair model multiple runs have been conducted to find global solution by changing the parametric settings. The initial parametric setting was based on the previously introduced model as given by [22] . Further, for the best performance, the fitness function is evolved with respect to mean squared 

The fitness function formulated in equation 3, is used for all the cases under investigation. A detailed discussion on the model validity and comparative study is presented in consecutive section.

The GEP algorithm to devise new model for COVID-19 in India was implemented using GeneXpro Tool [33] . For genetic operators, the parameter settings are as given in Table 1 

model has been devised. The statistical parameters for the proposed GEP model are given in Table   205 2. The next section details about the results and discussion corresponding to the proposed GEP model.

Here two models based on CC and DC are devised and the parameter settings for their evaluation is given in Table 2 . Here it should be noted that root mean square error (RM SE)and correlation coefficient (R) metrics are taken into consideration to evaluate the model, which are calculated as

where n is the total number of samples, h i and t i are the actual and intended outputs,h i andt i are averages of the actual and intended outputs for the ith output. values of R has the capability of providing reliable time series predictions [35] .

In order to externally validate the GEP model, criteria used by [36] has also been employed.

The main feature of this criteria is that the regression slopes (k or k ) should be close to 1 and must be around the origin. The value of parameters n and m ought to be lower than 0.1 whereas external predictability R m should be greater than 0.5 [37] . Also, the squared correlation coefficient 225 (Ro 2 ) and the coefficient (Ro 2 ) should be close to 1. Here it should be noted that value of Ro 2 lies between the predicted and desired values where as Ro 2 lies between desired and predicted values respectively [27] . More details on other parameters for external validation is given in Table 2 . All of these parameters plays an important part in ensuring good prediction probability of each proposed model and also analysing the strong validity of the model. 

A comparison of actual to intended or predicted values for CC and DC are given in Figure 3 . The mathematical formulations discussed above, represent a complex organization of constant, operators and variables to predict the output. From Figure 3 , it is evident that both the prediction models give almost equivalent results as that of the original CC and DC. Also, from the models, it can 235 be seen that till 13 May 2020, there are total number of CC and DC is around 80,000 and 2500

respectively. The original values versus predicted values for the same period are almost similar. The Figure 3 , also shows the predicted outputs for these models and it has been found that in the next 10 days that is by 23 May 2020, the total number of CC and DC will approximately become 142,000

and 4200 respectively. 

The expression trees for whole India is given in Figure 4 in terms of both CC and DC. Based on these, mathematical equations can be formulated and new prediction analysis can be drawn.

From the Figure 4 , it can be said that the proposed ETs can be consecutively divided into four subprograms. Each of the proposed subprograms represent individual aspects of the problem under 245 consideration and meaningful information can be derived to get the overall desired solutions [27] .

Here it can be indicated that each of the newly evolved sub-function from ETs consists of potential information about the basic psychology and architecture for a certain facet of the problem. This kind of information, ultimately paves way for evaluation at chromosomal level [33] . From the sub-ETs in Figure 4 , it can be seen that the linkage function for CC is addition where as for DC, it 

Predictor variables are an important and integral part of a GEP model [39] . These parameters help in understanding the contribution of all the variables in the model. Here a randomization phenomena is followed for each input values in order to analyse the importance of each variable and then finding the average reduction in R 2 between the predicted value and the desired output.

The results obtained for all the prediction variables are normalized such that the addition of all the 260 variables amounts to 1. From the results in Figure 5 , it can be seen that for whole India as of 13

May 2020, the variable d13 greatly effects the algorithms and is the most important variable for both CC and DC. In case of CC, the model is highly sensitive to two other variables namely d2 and d4. In the next subsections, three major states of India have also been studied and performance of proposed GEP models of CC and DC for these three states has been analysed. Note that basic 265 details about the ETs, variable importance and statistical values has not be referred again in order to avoid repetition. the state is presented in Figure 6 . It has been found the GEP based prediction model proposed in present work provide very reliable results for both CC and DC till 15 May 2020. The GEP models further predicts that by 25 May 2020, the total number of CC will reach almost 45,000 and in the same period, the total DC will reach nearly 1800. Thus a sharp rise can be noticed in the COVID-19

cases across Maharashtra in the coming days. 

This parameter follows a randomization phenomena for each of the input values in order to analyse the importance of each variable. Here average reduction in R 2 between the predicted and 290 desired output is taken into consideration and variable importance is calculated. The results in Figure 

Gujarat is the second most affected state of India after Maharashtra and COVID-19 cases are rising at a higher pace. This section proposes two new GEP models for CC and DC in Gujarat.

The results of prediction model with respect to actual cases for both CC and DC are given in Figure 9 . From the results, the total number of CC to DC as of 15 May 2020, is around 10,000 to 600 respectively and as per the GEP prediction models, it is expected to increase upto 14,000 and 300 900 respectively for CC and DC. The projected values further indicate that the possible spread of COVID-19 in Gujarat is an alarming factor and needs to be kept in check. The ETs based validation is presented in the next subsection.

The ETs for both CC and DC in Gujarat are given by Figure 10 . It can be seen that the ETs for 305 both the cases consist of four subprograms or chromosomes or simply four sub-ETs. All the sub ETs are connected by addition linking function for CC whereas for DC, average linking function is used.

Also, from these ETs, mathematical equations can be formulated as per the end users requirement and further evaluation at chromosomal level. The general time series prediction models for Gujarat in case of CC is given in Algorithm 5 and for DC is given by Algorithm 6. Here it should be noted 310 that the model has been generated based on 39 training records for both CC and DC respectively. is presented in the next subsection.

The ETs in case of Delhi are also grouped into four sub-ETs. Here subtraction linking function is used for both CC and DC and are presented in Figure 13 . Predictor variables in case of Delhi play very significant role. Here d13 plays the most significant role for both CC and DC GEP models. Along with that, d5 and d12 also has little impact on the prediction model for CC whereas d8, d9 and d10 pose little significant knowledge for DC model. The results are presented in Figure 14 and it can be said that the prediction variables for Delhi two to three prediction variables affect the GEP models. Here also, the results from the prediction model 

A robust and reliable variant of GEP was used to model the confirmed cases and death cases of COVID-19 in India. New accurate empirical models were designed for prediction of CC and DC across whole India and three major states which are highly affected by the COVID-19 pandemic.

These states include Maharashtra, Gujarat and Delhi. The proposed models were developed from the daily situation reports of COVID-19 cases published by the Ministry of Home Affairs, Govt.

of India since the onset of first lockdown in the country that is 24 March 2020. The following conclusions have been formulated based on the proposed models:

• The GEP models proposed in this work are highly reliable in predicting both confirmed cases 370 and death cases across India. They also satisfy all the requirements of external validation and hence can be used for predicting future cases.

• The RM SE and R values for all the cases is higher and close to 1 respectively. Thus verifying the solution quality of the proposed models and hence higher the chances of reliable predictions.

• The ETs derived are very simple and basic mathematical equations can be formulated from 375 them without any time-consuming laboratory implementations. These mathematical equations can be further used to optimize the proposed models using different optimization techniques such as differential evolution, cuckoo search algorithm and others.

• The prediction variables of all the proposed models play very significant role and it has been found that apart from Delhi, all other models have effect of only one or two prediction variables.

Thus making the models less sensitive to variables.

• Apart from that, from the experimental results, it can be said that GEP models are highly reliable as they are based on experimental data rather than just basic assumptions, as in case of conventional models. Another salient feature of GEP modeling is that it can work on less time series data and still provide reliable results.

Thus overall, it can be said that GEP based models are highly reliable and can be treated as benchmark for time series predictions. The concern arises when the total number of cases increases many fold. In those cases, GEP models need to be optimized. So, mathematical equations derived from the GEP models are optimized using highly effective state-of-the-art evolutionary algorithms.

As a future direction, these equations can be derived and algorithms such as Krill herd algorithm, 390 naked mole-rat algorithm and others can be used to optimize the prediction models. Also, the prediction models shows CC and DC for the next 10 days show that strict measures need to be taken to keep the virus under check.

Here lockdown and social distancing should be strictly followed so that the virus can be controlled and refrained to particular areas only.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Clinical features of patients infected with 2019 novel coronavirus in

Statement Regarding Cluster of Pneumonia Cases in

WHO Director-General's opening remarks at the media briefing on COVID-19 -11

Identifying Future Disease Hot Spots: Infectious Disease Vulnerability Index; RAND Corporation

Situation Report; World Health Organization

Pattern of early human-to-human transmission of Wuhan 2019-nCoV

The incubation period of 2019-nCoV infections among travellers from Wuhan, China

India under COVID-19 lockdown

Situation Report: Government of India

Modelling transmission and control of the COVID-19 pandemic in 425

Novel coronavirus 2019-nCoV: early estimation of epidemiological parameters and epidemic predictions

Risk assessment of novel coronavirus COVID-19 outbreaks outside China

Epidemiological characteristics of novel coronavirus infection: A statistical analysis of publicly available case data

Real-time estimation of the novel coronavirus incubation time

Modelling disease outbreaks in realistic urban social networks

Transmission 440 dynamics of 2019 novel coronavirus

Genetic programming: On the programming of computers by means of natural selection

COVID-19: Time Series Datasets India versus World

Genetic algorithms and machine learning

On the automatic evolution of computer programs and its application

Gene expression programming: A new adaptive algorithm for solving problems

Analysis and forecast of COVID-19 spreading in China, Italy and France

A model 455 based study on the dynamics of COVID-19: Prediction and control

Weather reports India, Government of India

COVID-19: Time Series Datasets India versus World

Nonlinear genetic-based models for prediction of flow number of asphalt mixtures

Applications of artificial intelligence and data mining techniques in soil modeling

Creating a surrogate commuter network from Australian Bureau of Statistics census data

Neural network based country wise risk 470 prediction of COVID-19

Novel approach to strength modeling of concrete under triaxial compression

A robust data mining approach for formulation of geotechnical engi-475 neering systems

Probability and statistics in civil engineering

The data analysis handbook

Beware of q2

On some aspects of variable selection for partial least squares regression models

Nonlinear modeling of shear strength of SFRC beams using linear genetic programming

Handbook of genetic programming applications

Krill herd: a new bio-inspired optimization algorithm. Communications in nonlinear science and numerical simulation

The naked mole-rat algorithm