key: cord-0501962-ob3pdfon authors: Ghosh, Sayantari; Bhattacharya, Saumik title: A Data-driven Understanding of COVID-19 Dynamics Using Sequential Genetic Algorithm Based Probabilistic Cellular Automata date: 2020-08-27 journal: nan DOI: nan sha: 008f4d3eed24ae4f3c89f97e2306f872d507b315 doc_id: 501962 cord_uid: ob3pdfon COVID-19 pandemic is severely impacting the lives of billions across the globe. Even after taking massive protective measures like nation-wide lockdowns, discontinuation of international flight services, rigorous testing etc., the infection spreading is still growing steadily, causing thousands of deaths and serious socio-economic crisis. Thus, the identification of the major factors of this infection spreading dynamics is becoming crucial to minimize impact and lifetime of COVID-19 and any future pandemic. In this work, a probabilistic cellular automata based method has been employed to model the infection dynamics for a significant number of different countries. This study proposes that for an accurate data-driven modeling of this infection spread, cellular automata provides an excellent platform, with a sequential genetic algorithm for efficiently estimating the parameters of the dynamics. To the best of our knowledge, this is the first attempt to understand and interpret COVID-19 data using optimized cellular automata, through genetic algorithm. It has been demonstrated that the proposed methodology can be flexible and robust at the same time, and can be used to model the daily active cases, total number of infected people and total death cases through systematic parameter estimation. Elaborate analyses for COVID-19 statistics of forty countries from different continents have been performed, with markedly divergent time evolution of the infection spreading because of demographic and socioeconomic factors. The substantial predictive power of this model has been established with conclusions on the key players in this pandemic dynamics. With its outbreak in Wuhan, China, Coronavirus disease-2019 (COVID- 19) has spread across the world within a few months. Due to its explosive growth and considerable rate of fatality, World Health Organization (WHO) declared COVID-19 as a pandemic and a global health emergency [1] . According to the available statistics in June, 2020, the total number of infections by SARS-CoV-2 (Severe Acute Respiratory Syndrome Coronavirus 2), the causative agent of this disease, is approaching 19 million around the world, causing around 700,000 deaths in 213 countries and territories, with no effective vaccination available in the market so far. Beyond respiratory discomforts including pneumonia, dry cough, cold and sneezing [2, 3] , it has been reported to cause liver and gastrointestinal tract maladies, kidney dysfunction and heart inflammation, in cases of severe infection [4, 5, 6] . This highly infectious disease transmits from personto-person through respiratory droplets produced by infected person. Fomitemediated and nosocomially acquired infections are also being identified as important sources of viral diffusion [7, 8, 9] . A typical incubation time from exposure to symptoms has been reported for COVID-19, while infection transmission from asymptomatic individuals has been observed as well [10, 11, 12] . Immediately after the detection of human-to-human transmission, the government agencies of various countries started implementing several mitigation strategies to control the epidemic. The measures thus taken include social distancing, restrictions on domestic as well as international travel, cancelling social events, shutting down of public as well as commercial activities etc. which can effectively reduce the possibilities of physical human contact. Moreover, contact tracing, aggressive testing as well as hospital or home quarantine for infected individuals and suspected cases have also been executed to track and prevent further spread. However, these strategies are directly contributing to enormous economical loss. The optimum estimation of this novel disease dynamics is emerging out as a challenging problem in this context. The immense disruption caused by COVID-19, resulting into overwhelming disorder in the health, economy and lives of billions of people around the globe, has brought the necessity for accurate modelling of infectious diseases into the focus. The effect and effectiveness of this complex interplay between differing length-scales and time-scales with the applied control strategies can only be understood and predicted with the help of precisely designed quantitative models. With a tremendous effort from researchers around the world, a spectrum of various mathematical and computational approaches is being used to understand and predict COVID-19 statistics, addressing its different perspectives. On a rudimentary sense, the studies being pursued can be segmented in two categories: (i) data science and machine learning approaches and (ii) differential equation based mathematical modelling techniques. The first group of studies trusted mostly on data mining from national/international repositories (e.g., WHO, country specific data centres etc.) or popular social media platforms to forecast the active cases and mortality data [13, 14, 15, 16, 17] . The major goal of these studies are to estimate and predict the time evolution of the disease using specific computational concepts, like Monte Carlo decision making, fuzzy rule induction, deep learning etc [18, 19, 20, 21, 22] . Some of these studies also explored impact of disease control interventions, like, travel restrictions [23] , patient quarantining and isolation [24] , medical facilities [25] , social distancing and administrative responsibility [15] on epidemic spreading rate. Though these models are quite effective, being entirely dependent on data, the efficiency of these studies can be heavily inclined towards the data quality. As comprehensively reviewed by [26] , several data-dependent models are prone to suffer from high risk of bias, which is very much probable for imprecise short time series data. With the evidence of giving effective predictions for past pandemics [27, 28, 29] , the traditional approaches of the mathematical theory of epidemiological dynamics also have driven several researchers to study COVID-19 dynamics. Theoretical modelling based approaches have been long associated to understand and predict the outbreak probabilities and seriousness of a disease, and provide key information to control the intensity [30, 31, 32, 33] . Most of the mathematical models that are being used to investigate the COVID-19 dynamics [34, 35, 36, 37] are based on variants of classical deterministic model of susceptible-infectious-recovered (SIR) that was introduced by Kermack and McKendrick [38] . Constituting a set of nonlinear ordinary differential equations (ODE), the SIR model compartmentalises the population where susceptible subpopulation declines over time, constantly getting infected (by infectious subpopulation), and then recovered from (and gaining immunity to) the disease over time. Being powerful and computationally favourable tool to analyse epidemic, variants of this methodology are common in understanding real epidemic data [39, 40] . Though these models capture the disease transmission dynamics, being deterministic, they suffer from the assumption of homogeneous mixing, forgoing the spatial information. For modelling real-world dynamics of a disease that spreads from close-contacts only, the tool needs to accommodate neighborhood information. Moreover, the platform requires to take into account of stochasticity of real dynamics, spatial infection spread and inherent heterogeneity in population, which are some major limitations of the mentioned works. Thus, the identification of research gap points out in a direction of designing a methodology that addresses the above mentioned issues to understand and predict neighbourhood-dependent personto-person probabilistic transmission of COVID-19, that should be powered with extensive computational tools for parametric optimization. In this study, we propose probabilistic cellular automata based dynamical model, optimised through sequential genetic algorithm for an accurate assessment of the extent of COVID-19 dynamics. The major motivation of using cellular automata (CA) is its ability in depicting extremely complex macroscopic outcomes, while being based on local interactions that trusts on the Proposed method, a) accommodates heterogeneity in population b) includes stochasticity and probabilistic dynamics c) estimates optimum epidemic dynamics parameters. d) considers neighborhood and demography explicitly. e) performs robust prediction with limited data. interaction of a multitude of single individuals [41, 42] . This methodology is capable of giving a direct correspondence to the physical system and also rectifies the major drawbacks of ODE models by (i) tracking individual contact processes, (ii) giving room for introducing probabilistic individual behaviour, and (iii) capturing neighbourhood as well as global spatial information. Because of these reasons, CA based approaches have been successfully used as a competent substitute method to simulate physical, biological, environmental and social contagion-like spreading [43, 44, 45, 46] . For studying past epidemics as well as interpreting COVID-19, some studies have proposed cellular automata as an alternative method [47, 48, 49, 50] . However, to capture and interpret the behaviour of real data through CA needs a large-scale parameter optimization that could be time consuming as well as sub-optimal. Thus, though being extremely flexible and powerful, CA has not been yet optimized to understand and interpret COVID-19 data for countries worldwide. To explore this, in this study, genetic algorithm (GA) has been employed, which is a well-known method for generating the optimal parameter subset through stochastic search procedures based on the principle of the survival of the fittest [51, 52, 53, 54, 55] . Crossover and mutations, two key properties of genetic algorithm help to optimize the parameter set efficiently in limited steps. Cellular automata coupled with genetic algorithm has been used before to explore evolutionary aspects of game theoretical problems [56] , but to the best of our knowledge analyzing and developing understanding from real pandemic data like COVID-19 using optimized CA platform has not been attempted yet. The main contributions of this work are as follows: • To build a CA model which is probabilistic, so that it can take into account of demographic variations, neighbourhood diversity and uncertainties of real dynamics. • To create an easily implementable framework where optimization using GA will be done sequentially for all parameters associated with the transition rules of the CA model for real data interpretation. • To interpret and understand COVID-19 disease transmission dynamics with an optimized CA framework, which can be extended for prediction as well. Through this, on one hand, one can track the individual contact process through time and space; on the other hand, a self-adapting process of evolutionary strategies has been created by designing the chromosome with parametric genes and establishing fitness function that maximises over the generations. The main limitations of the state-of-the-art algorithms and the major contributions of the proposed method are listed in Table 1 for a clear understanding. The main rationality behind this approach is that it is extremely difficult to find the optimal parameter of the complex spatial epidemiological model using random search or analytical techniques. The proposed GA based framework helps to search the parameter space more efficiently for the optimal performance of the entire algorithm. The rest of this article is organized as follows: Section 2 includes the proposed concepts of epidemiological model, probabilistic cellular automata and the sequential genetic algorithm used in this work. In Section 3, the results has been elaborately discussed where the optimized CA model has been employed for simultaneously understanding as well as analyzing active infections, total infections and total death caused by COVID-19 for several countries, considering the demographic and spatial population density variations. Section 4 is comprised of concluding remarks. An object process diagram of the proposed method has been depicted in Fig. 1(a) . The methodology starts with the infection spreads following the SEIQR epidemiological model in a random human population over a 2D grid, initialized on a country-specific basis. The parameters of the epidemiological model is continuously optimized using proposed sequential genetic algorithm to match the real country-specific infection spread data. The proposed methodology is consisted of three distinct parts− (A) epidemiological model that governs the infection spreading, (B ) probabilistic cellular automata (PCA) to model the dynamics of the pandemic spread and (C ) optimization of the parameters associated with PCA using genetic algorithm (GA) to fit real-world data. In the epidemiological model, the entire population is partitioned in five distinct parts. At the very beginning, every person was healthy but they are vulnerable to the infection. These people are denoted as susceptible (S) subpopulation. At time instance t = 0, some people in the population got exposed Transitional delay for x to move from a i to a j e t , i t Number of exposed and infected people in the d-neighbourhood of x at time t p e , p i Probabilities that an exposed or an infected person spreads the infection to a susceptible person when they meet Θ A gene containing all the parameters of PCA method B Binary encoded representation of Θ G(Θ) The PCA model with parameter Θ y Time series of an epidemiological state in a countrŷ y Time series estimate of epidemiological state from PCA e ji Estimation error of j th gene in i th generation N g Total number of chromosome in genepool F Number of parents selected for mating from N g p β Fraction of r t that recovers from the disease ρ Fraction of parents F that lives in the next generation to the infection from some known or unknown source. These exposed people do not have any particular symptom of the infection, but they can spread the infection to the susceptible people. These asymptomatic people are referred as exposed (E) subpopulation. At time instance t = 0, there were also some people who had clear symptoms of the infection and they also had the potential to spread the infection among susceptible people. This symptomatic people are considered as infected (I) subpopulation. After an incubation period, some of the exposed people show the symptoms of the infection and they move to subpopulation I. Because of the health facilities and testing time, the infected people are detected with some average delay, and put to quarantine. The people who are quarantined cannot spread the infection to other people, though they themselves remain in the infectious stage. These people are denoted as quarantined (Q) subpopulation. Both the quarantined people and the infected (but not detected) people would come out of the infectious stage eventually, and after that they no longer contribute in the infection spreading dynamics. These people are denoted as removed (R) subpopulation in the model. This removed subpopulation contains two kinds of people− one who have recovered from the infection completely and they neither infect nor get infected in future, and the other kind of people who have died due to the severity of the infection. Schematic diagram related to the transitions, probabilities and timelines corresponding to the dynamics of infection are shown in Fig. 1(b) . In the analysis, normalized subpopulations have been considered, and the respective normalized subpopulation is denoted using the same lowercase character. For example, the normalized susceptible and infected subpopulations are denoted by s and i respectively. As shown in Fig. 1 (c), this epidemiological time evolution has been implemented on a 2D lattice using PCA as discussed below. Let L be a finite subset of Z 2 at time instance t, denoted as L Z 2 which defines a regular 2D lattice. Every point on this lattice x ∈ L can acquire finite number of states A. In this particular problem, the set A can be defined as A = {0, s, e, i, q, r}, where the terms s, e, i, q and r denote the particular possible states of infection as discussed in Sec. 2.1, and 0 denotes no human occupant or an empty space. At time t = 0, n 0 i points are randomly selected on L and assign the state a i where i ∈ A. The total initial population is defined as N = i∈A n 0 i . At any instance of time t, n t i , i ∈ A \ 0 denotes the total number of the people in respective state a i . For neighbourhood criteria, modified-Moore neighbourhood or d -neighbourhood has been used. A finite subset Ω d Z 2 is defined, containing the origin 0 = (0, 0), and the cardinality of Ω d is 4d(d + 1). General probabilistic cellular automata (PCA) is a stochastic process that describes sequence of mappings Λ a t : L → a, a ∈ A, where any particular state Λ a t (x) of x ∈ L at a particular time instance t is dependent on the previous states of the d-neighbourhood of x, denoted as x + Ω d = {x + ω : ∀ω ∈ Ω d } with certain probabilities. More precisely, in COVID-19 infection spread, Λ E t (x) will be decided by Λ t−1 (x + ω), ∀ω ∈ Ω d . The other mappings Λ a t (x), a ∈ A \ E, depends on the sequence of states Λ a κ (x), 0 ≤ κ < t. The transition probability p t aiaj denotes the probability of transition at time t from state a i to state a j , where a i , a j ∈ A. Without any loss of generality, p t aiaj is denoted as p t ij and transition from state a i to a j as a ij in the rest of the discussion for a simpler notation. In cases, where a i = a j , p t ij is referred as state transitional probability, and if a i = a j , p t ii is called as self transitional probability. If a state transition a ij , i = j, happens in x at time t following the transition probability p t ij and the transition state a ij has a transitional delay τ ij , then where t ui is the time instance when transition a ui , u = i happened. In this infection diffusion model, only the state transitional probabilities p t se , p t ei , p t iq , p t qr and p t ir are considered to be nonzero at certain instance of time, and for all the other transitional probabilities, τ ij is set to infinity, where p ij and τ ij are user defined parameters. However, for the transition a se , t ui and τ ij are set to zero, and for x ∈ L, let us define p t se = p ij = 1 − p t ss and the self-transition and e t−1 are the number of cells in states i and e respectively in the Ω d neighbourhood of x at time t − 1. The probabilities p e and p i are defined as 'infection probabilities' which can be considered as the probabilities that a susceptible person become exposed to the infection when that person meets an exposed or an infected person respectively. An empty cell does not contribute in the infection spread, and thus, self transitional probability p t 00 = 1, ∀t. Among the total removed population r t at time instance t, a population fraction p β r t is considered that recover from the infection at time instance t and acquire long-term immunity towards the disease, and a population fraction (1 − p β )r t is considered to be deceased. The removed population r t is not considered further in the infection dynamics and it is taken that p t rr = 1, t > t. Though PCA has potential to model the probabilistic transition of states on a spatial lattice, the main challenge to use it for modeling a real-world scenario is to find out the optimal parameters for the PCA. As the searching space for the proposed PCA model is very large, it is practically impossible to search for the optimal parameter setting manually to analyse the characteristics of the infection spread from a real data. Thus, genetic algorithm (GA) has been applied to find out the optimal parameter set given a real time-series data. Let us assume a discrete time signal y[n], 0 ≤ n ≤ (T − 1) associated with the real world infection spread. The PCA model is denoted by G(Θ), where Θ = [θ 1 , θ 2 . . . θ h ] denotes the set of parameters used for the PCA model. If y[n], 0 ≤ n ≤ (T − 1) is the time evolution of the desired variable in the model G(Θ), then the objective is to find an optimal parameter set Θ * such that y[n] → y[n], ∀n. To apply GA, each θ i , 1 ≤ i ≤ h, is encoded as a string of binary digits b i [54, 55] assuming the θ i has a bound |θ i | < ζ i , 1 ≤ i ≤ h. This binary string is referred as gene, and the concatenated genes in the order of the appearance of respective θ i in Θ is called the chromosome. For example, if B is the chromosome corresponding to parameter set Θ, G(B) is equivalent to G(Θ). A collection of N g number of chromosomes of estimated parameters, often referred as gene pool, are evaluated at every time step (called as generation). In our work, the error of each chromosome has been evaluated using l 1 norm distance. At i th generation, the error of the j th chromosome B ji is computed as whereŷ ji is the estimated output of G(B ji ) in the vector form andŷ ji [n] is the value ofŷ ji at time instance 'n'. At each generation, GA finds out min(e ji ), ∀j and tries to make e ji → 0 as i → ∞. In the proposed framework, some of the parameters are related to probabilities having a range 0 to 1, and some of the parameters are associated with time (in days) which are discrete integers, and greater than or equal to zero in our case. Thus, the parameters are initialized randomly keeping their domain restrictions intact. For mating, two chromosomes, often referred as parents, are selected from the gene pool considering their 'fitness'. Among two selected parents, a crossover point or a splice point is selected at b i , 1 ≤ i ≤ h in both chromosomes and a crossover [55] happens that produces two offsprings. In our approach, fitness f ji of each chromosome has been defined as the inverse of their respective errors at a particular generation. At each generation, F number of best chromosomes are selected from the gene pool having the maximum fitness for mating. Following the idea of [52] , ρF number of parents are kept to the next generation along with the new chromosomes to ensure that the error in the next generation is always less than or equal to the current generation. Selecting ρF number of chromosomes from the parents, N g − ρF number of children are produced from mating to keep the size of the gene pool constant. After the offsprings are generated, in the parameter space, s genes are randomly selected and small perturbations are added individually to mimic mutation. As shown by several researchers [57] , the homogeneity in the gene pool increases with the generations, and as the perturbations due to mutation are typically small, the reduction of error becomes a problem after a few generations. Thus, to restrict homogeneity in the gene pool, a small number of offsprings µ are selected from the total N g − ρF number of generated offsprings, and replaced them with randomly generated chromosomes to maintain diversity. This step is called as 'diversification' of gene pool. In our problem, the parameters Θ of the PCA model G(Θ) are the state transitional probabilities p ei , p iq , p ir , p qr , infection probabilities p e and p i , state transition delays τ ei , τ iq , τ qr , τ ir , neighbourhood d, and death probability p β as mentioned in Sec.2.2. As optimizing these many parameters simultaneously might be challenging and require huge amount of resources, we propose a variant of GA with sequential evolution mechanism where instead of optimizing the solutions simultaneously, the parameters are optimized sequentially. Let us define a set of generations as an era. For the first era containing a small number of generations, a traditional GA methodology is followed as discussed this far to have a set of initial parameters. From the next era onward, two parameters are fixed and optimized sequentially in that era. Mutation and crossover are restricted to those two respective genes, whereas parent selection is done based on the performances of the entire chromosomes. This newly proposed sequen- tial optimization of parameters of PCA using GA is defined as PCA-GA. The proposed approach can optimize a large number of parameters using limited resources efficiently. All the notations used in PCA-GA are briefly summarized in the Table 2 . Proposed PCA-GA has a complexity which can be approximated as O(N g T g O(f )) where N g is the number of population, T g is the total generation and O(f ) is the complexity to measure the fitness in the GA. For a large enough N g , T g is considered as a comparatively smaller constant and thus, the complexity of the entire algorithm is mainly governed by N g and O(f ). The complexity of estimating the fitness can be approximated as O(f ) = O(T + 8N τ T ) for Moore neighbourhood criteria, where N is the total population on the 2D grid. The length of the original time series data T , and τ , the maximum of τ ij , are both constant, and thus O(f ) can be represented as O(N ). Though GA has been selected as a strategy to optimize the parameters of the proposed PCA model, it is evident that because of the generalized construction of the proposed framework, other meta-heuristic methods could also be employed to search the parameters of the spatially driven SEIQR model which is the main focus of this work. However, presence of mutation and diversification in GA help to search for better solutions as the search space is extremely large. To validate the effectiveness of the proposed framework, using PCA-GA, the actual statistics of COVID-19 spreads till 20th June, 2020 in different countries is used. For finalizing the data-set from available data of 213 countries, several aspects have been considered. At first, 102 countries had been dropped due to less number of reported cases (less than 1000 reported cases till 20th June 2020). Out of the remaining countries, some countries, like Iran, Greece, Paraguay etc., are removed due to data inconsistency, and finally 40 countries are randomly selected ensuring the following points: • At least 2 countries from each continent got selected to maintain demographic diversity in our data. • Care has been taken to maintain significant variation in population density, which we believe as a major factor contributing in disease transmission. • It was ensured that countries from three distinct stages of COVID-19 infection are considered: (i) where the infection is significantly diminished, (ii) where the peak infection has been reached but substantial infection still persists, and (iii) where consistent growth in infection is occurring. With these widely variant spectrum of time series data, we proceed for quantitative calibration and interpretation through the proposed methodology. All data samples are taken from the website worldometers.info 1 . To point out the major contributing factors in dynamics of infection spread, for every country under consideration, three available time series, namely daily active cases, total number of infected cases and total number of deaths are accumulated. Out of these three series, the daily active cases time series is used for model formulation, and the rest are considered for model validation. It is important to mention that the population q t is the relevant observable here, as infected people as i t and e t remain latent and undetected in the population. The reported daily active case data is associated with lifetime of the infection, and are used in this study to check the effectiveness of the proposed framework as follows. By applying PCA-GA on the daily active case data of a particular country, the parameters Θ * that gives the minimum l 1 error is extracted. For validation of the optimized parameters and understanding the robustness of the algorithm, results generated by using G(Θ * ) for the total infected states and deceased states are then compared with the real-world data. Here it must be noted that the optimal parameters Θ * remain unaltered and no further optimization is performed. For all the simulations, PCA is initialized with a fixed lattice size of 100×100 with n e = 50 and n i = 4. The population n q and n r are set to zero at t = 0. The susceptible population n s has been initiated depending on the population density of a country as follows: among the countries considered in our study, for the country with lowest population density (Canada), n s = 2500 has been selected, and for the country with highest population density (Singapore), n s = 6000 has been fixed. For any other country, n s has been assigned within this range using logarithmic scaling based on the population of that country. As each of the parameters of PCA-GA has physical relevance, the sequential searching process has been initiated by following restrictions of ranges. It is important to note that in our problem, genes associated with probabilities are initiated in the range [0, 1] and clipped during the optimization process accordingly. The state transition delays τ ei (incubation period) and τ iq (testing delay) are considered to be within the range (0, 30). The transition delay τ ir and τ qr (corresponding recovery periods) are initialized in the range (20, 100). All the simulations are executed in a system with Intel Core i7 8700K processor, 64GB RAM and 8GB NVIDIA GeForce RTX 2080 8GB GPU using Python and numpy packages. The daily active cases can be defined as the c t = c t−1 + q t − r t where c t is the number of active cases at time instance t having the initialization c 0 = 0. In Fig. 2 , the active cases of 20 different countries are shown along with the respective estimated active cases using PCA-GA model. For the countries shown in Fig. 2 , the first peak of the infection is already crossed and a steady fall in the infection spread is observed. It can also be seen that some of the active cases of the countries like China, Israel, Switzerland, follow smooth bell-shaped curves, whereas for some countries, like Australia, Cyprus, Hungary etc., the times series data deviates from bell-shaped curves with substantial degree of noises. In all the cases, PCA-GA has successfully captures the trend of the time series data estimating the parameters of the epidemiological process. To measure the goodness of the model estimation, three different metrics has been used to measure the quality of the estimated values. The root mean square (RMSE) distance, correlation distance and chi-square distance [58, 59, 60] , denoted as d l , d c and d χ respectively, are computed between the real data and the estimated values from the PCA-GA model to evaluate the effectiveness of the optimized model. For two vectors u and v, we define where T is the length of each vector, u i and v i are the i th elements of u and v respectively and (.) denotes dot product of two vectors. As shown in Fig. 3(a) , the proposed model performs well in modelling the real data. When evaluated over all the countries considered in this work, the proposed model fits the data well, and for only 0% -12.5% cases the fittings were poor depending on the evaluation metric. It is important to mention that all the distance measures are evaluated on normalized data. In Fig. 2 , an interesting point to notice is that the peak of the active cases are located at markedly differing time instances, and the other properties, like variance, skewness etc., of the observed distributions are also varying drastically. The fundamental differences between the fitted curves are quantified with the help of boxplot of the parameters in Fig.3 (b)-(c) by analysing basic statistical properties. The reported boxplots are specifically for the countries selected in Fig. 2 . It can be noted that p e , p i and p ei exhibit a wide variability in Fig. 3(b) . During our analysis, a strong positive correlation with population density for p e and p i has been also observed. This can be thus inferred that the variation in population density in the considered countries causes the wide range of these parameters. It can be also concluded that high density of population increases the probability of transmission of the disease. The considerable difference in the mean magnitudes of the infection associated probabilities (p e , p i and p ei ) and recovery-related probabilities (p iq , p ir and p qr ) indicate the sharper rise and slower fall of active cases curves, which results into a skewed distribution in most of the cases (see Fig. 2 ). In Fig. 3(c) , it is also shown that τ ei , which is identified as the incubation time in the model, exhibits a range of 3-14 days with a mean at 7.3, which perfectly aligns with the observed cases all around the world [61] . In this figure, a wide variability in the range of τ ir and τ qr is observed, which points out the substantial difference in health infrastructure of these countries. Here it must be mentioned that, while performing this statistical analysis with all 40 countries, some countries were detected showing consistent outliers (not included in Fig. 3(b)-(c) ) in terms of four transitional parameters: p ir . p qr , τ ir and τ qr . While analyzing the active case distributions of these outliers, it was found out that the time series data for all these countries have a saturating trend where the daily active cases do not show an average descent with time. Some of such cases are shown in Fig. 4 . Even for these data which have drastically different qualitative trend compared to countries shown in Fig. 2 , the proposed PCA-GA framework has successfully captured the trend of the real time series data accurately. There are also certain countries, like India, Brazil, Chile, Mexico, etc., for which the infection spreading started later than the countries like China or Italy, and the active daily cases are still growing almost exponentially. As shown in Fig. 5 , PCA-GA is able to estimate the time series data for these countries where the infection is spreading rapidly. Dynamics of COVID-19 spread in these countries are of particular interest as the prediction of the peak positions in these countries might help immensely to understand the maximum socioeconomic impact of the disease at a time in that geographical location. While analyzing a complex dynamics like the spread of a pandemic, it is not always sufficient to model the input real data only. It is required that the optimized model should be robust and can provide meaningful interpretations without further retraining or parameter tuning for real-world applications. To validate the robustness and the effectiveness of the proposed algorithm, the optimized model is now employed for three different tasks. At first, the robustness of the optimized model is checked by estimating the total number of infected cases, followed by total number of death cases without any further training, tuning or supervision. Finally, to further validate the efficiency of the model, its performance has been evaluated for the prediction task by training the model with partitioned data and evaluating on its future predictions without any further optimization. The total number of infected cases z t at time instance 't' can defined as z t = t i=0 q i . This cumulative sum indicates the total number of people who suffered from the disease at any point of time. For a country, where the first wave of the infection has passed, e.g., Croatia, Italy, etc., z t follows a sigmoid function approximately, whereas for the countries like India, Mexico etc., where the infection has not reached the peak, z t follows an exponential function. As PCA-GA is optimized using the time series information of daily active cases c t , z t is used to validate the parameters learnt by the sequential GA framework in the following way. Once a particular country is selected, Θ * is estimated using PCA-GA with the actual c t . Next theẑ t for G(Θ * ) is calculated without any further fine-tuning of the parameters, and comparedẑ t with actual z t . In Fig. 6 , the total cases (blue) of six such countries are shown along with the best-fit results obtained from PCA-GA (red) which depict an excellent agreement with the data. It must be mentioned that for all three dynamical stages of infection spreading as discussed in Sec. 3.2, i.e., where the first wave of infection has passed, where the active cases are almost saturated currently or where the active cases are increasing rapidly, our estimatedẑ t closely matches z t without any further parameter optimization. When evaluated over all 40 countries for the number of infected people, the proposed method gives average d l , average d c and average d χ as 0.037,0.006 and 0.53 respectively, which exhibits the robustness of the model. To further validate the 'goodness' of the estimated parameters, the parameter set Θ * optimized over the daily active cases of a particular country is taken and the identical parameter values are used to compare the estimated total deaths with the actual total deaths of that country. Death in the population is the prime concern in case of the COVID-19 pandemic, and as mentioned in Sec. 2.2.1, daily deceased population is a fraction of r t in our model. So, the total estimated death cases can be defined asd t = (1 − p β ) t i=0 r i where p β and r i for 0 ≤ i ≤ t are given by Θ * and G(Θ * ) respectively. Fig. 7 demonstrates the comparison of the actual total death cases d t with estimated total death caseŝ d t for Θ * , the identical set of parameters used for estimating active cases as well as total cases previously. The same countries shown in Fig. 6 have been selected to show the robustness of the estimated parameter Θ * using the proposed technique. Excellent agreement with data has been found for this case as well; when evaluated over all 40 countries for the total number of death cases, the proposed method gives average d l , average d c and average d χ as 0.041,0.006 and 0.48 respectively. Prediction of future events is always challenging in data modeling [62] .For the final stage of validation of the methodology, the predictive power of the model has been tested. As the impacts of this pandemic becomes far reaching as the socioeconomic contexts vary, a considerably accurate prediction about the dynamics of the infection spread can be crucial and useful in many ways. As PCA-GA successfully estimates the optimal parameter Θ * , the set of parameters can also be utilised to predict the future course of the infection in that country. To validate the capacity of the prediction strategy, the daily active cases of a country c t is truncated to c P keeping the first 'P ' values. PCA-GA is applied on c P to estimate the parameters Θ P . Then Θ P is used to predict the daily active casesĉ t . As shown in Fig. 8 , for two countries Israel and Switzerland, the daily active case information up to 54 and 43 days respectively are considered for an attempt to predict the daily active cases up to 100 days. In the figure, the estimated curve (shown in red) is optimized using all the real data points available, whereas the predicted curve (shown in black) is optimized using the truncated real data. It can be observed that the predictive estimation closely follows the real active case data, even though only ∼ 50% data points are used for parameter estimation. For Israel and Switzerland, 100 days prediction of the algorithm produces (d l , d c , d χ ) as (0.056, 0.008, 0.95) and (0.028, 0.005, 0.43) respectively. As prediction of the spread of the infection is one of the most challenging tasks, the predictive ability of proposed algorithm is compared with different baseline methods to better understands its performance. As only a very few data points were available in the truncated data, fast decision tree learning algorithm [63] and Random forest regression perform poorly and give (d l , d c , d χ ) as (0.43, 0.49, 243.82) and (0.439, 0.51, 252.6) respectively for the truncated time series of Switzerland. SVM regression with RBF kernel performs satisfactorily on the same truncated data and produces (d l , d c , d χ ) as (0.09, 0.02, 27.8). However, the proposed PCA-GA algorithm significantly outperforms the baseline algorithms and produces (d l , d c , d χ ) as (0.028, 0.005, 0.43). As the PCA-GA methodology has been elaborately validated in Section 3.3, now, in this section, it is employed for the purpose of prediction of consistently rising real epidemic data. Though the parameter estimation works well even when the minimum information about the peak position in c t is available, the prediction task becomes really challenging when c t is exponential in nature. For a particular country where c t is almost exponentially rising, proceeding with prediction, first the best set of parameters Θ * is detected by PCA-GA with fitness f * and error e * . As the drop of the infection heavily depends on the transitional probabilities p ir , p qr and state transitional delays τ ir and τ qr , this parameters are tuned to find a region of predictions bounded by the possible best case and the worst case scenarios. While estimating the best case scenario, p ir and p qr is chosen equal to the maximum and minimum p ir and p qr observed in the continent from which the country belongs. The reason behind this strategy is that the parameters related to the infection spreading are different in each continent which is also observed by [64] . In the best case scenario, transitional delays τ * ir and τ * qr are reduced to obtain best case transitional delays τ ir and τ qr respectively such that the fitness remain within 90% of f * , where τ * ir and τ * qr are the corresponding optimized delays available in Θ * . For the worst case scenario, we consider τ ⊕ ir = τ * ir + α ir and τ ⊕ qr = τ * qr + α qr , where α ir = τ * ir − τ ir and α qr = τ * qr − τ qr . Fig. 9 depicts the prediction of the daily active cases using the method discussed so far. In the Fig. 9 , the black dotted line indicates the prediction using the optimal parameters Θ * estimated using PCA-GA. The orange line indicates the best case scenario, where the maximum daily active cases would be minimized given the real data. The red line indicates the worst case scenario based on the specific conditions mentioned above. The best case and the worst case scenarios act as limiting cases of an area (shaded in pink color) of probable future state. Any curve inside the pink region that contains the real data could be the evolution of the daily active cases in future given the real time series data, that is in exponentially rising state currently. This indicates that for India, which is now one of the biggest epicenters of COVID-19 in South-eastern Asia, the disease can start decline very soon if vigorous measures from government and complete support from the public could be achieved. It also shows that the maximum active cases on a day, that puts a direct burden on the health infrastructure of the country can be restricted below 750,000 if people participate to government indicated mitigation strategies, and recovery rate remains at its current value. In that case, the peak of the disease is expected to pass during mid-September to mid-October, and the disease can be over with its first wave by March 2021. But these predictions also imply that the range of future states, that are possible for exponentially rising daily active cases, not only depend on the evolution of the epidemic so far, but also gets highly affected by the consistency and implementation efficiency of mitigation strategies. COVID-19 outbreak has created a massive impact all across the globe. Even after nation-wide lockdowns, extensive testing strategies and medical supports, the spread of the virus has overwhelmed several countries. Thus, it is becoming more and more important to understand the nature of the infection spread and the key parameters that are controlling the spread. In this work, we proposed a probabilistic cellular automata model to understand and depict COVID-19 spread using appropriate choice of loss functions and evolutionary optimization framework. The parameters of this cellular automata model are optimised using sequential evolutionary genetic algorithm. It has been shown that this self-adapting methodology can be highly flexible and has the power to accurately estimate time trajectories of epidemics. This model works with physically interpretable parameters, which are accessible for analysis, data collection and further experiment, and can be readily identified with ground reality. This model has been successfully employed for optimizing all these parameters simultaneously for the daily active cases, total infected cases and total deaths with extreme robustness. The performance of the model has been exhibited for a large number of countries with huge diversity in population density, continents and available healthcare infrastructures. The predictive strength of the model has also been validated extensively, and demonstrated to estimate the course of the pandemic for the countries where infection peak has not been reached yet. It is important to mention that the motivation of the work was to develop a data driven, generalized, spatial framework that can be used to estimate relevant epidemiological parameters. This methodology is so powerful and flexible that physical interpretations of the results obtained from these analyses can have a wide range implications. Once the data is properly interpreted with the proposed methodology, interesting realistic features can be identified for specific countries. For example, in a pandemic situation, easily relatable factors like population clusters, variable population density, variable health facilities at different places of a country etc, can be studied to understand and predict emergence of new hotspots which can be used to design selective area containment strategies. While we propose and establish the applicability and strength of this framework in this work, we wish address these application perspectives in a study in our upcoming research studies. With this proposed platform, the impact of individuality on contagion process can be explicitly studied, which might be directly related to the questions like lockdown behavioral differences, influence of rumors, vaccination opinion differences etc. As the effects of more complex dynamical factors like periodic lockdown or population clusters are not considered in this present model, the prediction capability of the proposed model is not satisfactory for time series data with abrupt discontinuities in the present form. The proposed framework could be enhanced with other l p norm distances and different optimization techniques like multi-objective genetic algorithm or strength pareto evolutionary algorithm. Other swarm-based optimization techniques can also be explored for further refinement of the model. The potential of the proposed approach can be utilized to better understand the disease spreading and controlling, beyond this pandemic the world is facing currently, by keeping track of the spatial information of the dynamics, incorporating realistic behavioural aspects, and optimizing in terms of demographic as well as socioeconomic features. World Health Organization Coronavirus disease (COVID-2019) situation reports Epidemiological, clinical and virological characteristics of 74 cases of coronavirus-infected disease 2019 (COVID-19) with gastrointestinal symptoms Clinical characteristics of COVID-19 patients with digestive symptoms in Hubei, China: a descriptive, cross-sectional, multicenter study Kidney disease is associated with in-hospital death of patients with COVID-19 Digestive symptoms in COVID-19 patients with mild disease severity: clinical presentation, stool viral RNA testing, and outcomes COVID-19 and the cardiovascular system Immediate psychological responses and associated factors during the initial stage of the 2019 coronavirus disease (COVID-19) epidemic among the general population in China Unique epidemiological and clinical features of the emerging 2019 novel coronavirus pneumonia (covid-19) implicate special control measures Aerosol and surface stability of SARS-CoV-2 as compared with SARS-CoV-1 Presumed asymptomatic carrier transmission of COVID-19 Estimation of the asymptomatic ratio of novel coronavirus infections A familial cluster of infection associated with the 2019 novel coronavirus indicating possible person-to-person transmission during the incubation period Modelling the covid-19 epidemic and implementation of population-wide interventions in italy Modified seir and ai prediction of the epidemics trend of covid-19 in china under public health interventions On a quarantine model of coronavirus infection and data analysis Retrospective analysis of the possibility of predicting the covid-19 outbreak from internet searches and social media data, china Propagation analysis and prediction of the covid-19 Composite monte carlo decision making under high uncertainty of novel coronavirus epidemic using hybridized deep learning and fuzzy rule induction Healthcare impact of covid-19 epidemic in india: A stochastic mathematical model Finding an accurate early forecasting model from small dataset: A case of 2019-ncov novel coronavirus outbreak Monte carlo deep neural network model for spread and peak prediction of covid-19 A fuzzy dynamic optimal model for covid-19 epidemic in india based on granular differentiability Covid-19 progression timeline and effectiveness of response-to-spread interventions across the united states Modelling the epidemic 2019-ncov event in italy: a preliminary note Assessing spread risk of wuhan novel coronavirus within and beyond china Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal Dynamically modeling sars and other newly emerging respiratory illnesses: past, present, and future Forecasting models for coronavirus disease (covid-19): A survey of the state-of-the-art Synchrony of sylvatic dengue isolations: a multi-host, multi-vector sir model of dengue virus transmission in senegal Infectious diseases of humans: dynamics and control Asymptotic behavior in a deterministic epidemic model Optimal control of deterministic epidemics Viral marketing on social networks: An epidemiological perspective The reproductive number of COVID-19 is higher compared to SARS coronavirus Transmission potential and severity of COVID-19 in South Korea Early dynamics of transmission and control of covid-19: a mathematical modelling study Epidemic analysis of covid-19 in china by dynamical modeling A contribution to the mathematical theory of epidemics Analysis, simulation and optimal control of a seir model for ebola virus with demographic effects A simple mathematical model for ebola in africa Cellular automata machines: a new environment for modeling Cellular automata and complexity: collected papers A probabilistic automata network epidemic model with births and deaths exhibiting cyclic behaviour A simple cellular automaton model for influenza a viral infections Individual-based lattice model for spatial spread of epidemics Epidemic dynamics: discrete-time and cellular automaton models A cellular automata modeling for visualizing and predicting spreading patterns of dengue fever A novel cellular automata classifier for covid-19 prediction Enhanced cellular automata with autonomous agents for covid-19 pandemic modeling Computational model on covid-19 pandemic using probabilistic cellular automata Genetic algorithms for real parameter optimization Nonlinear parameter estimation via the genetic algorithm A hybrid genetic algorithm for efficient parameter estimation of large kinetic models A genetic algorithm approach to curve fitting Least median squares curve fitting using a genetic algorithm Evolutionary aspects of spatial prisoners dilemma in a population modeled by continuous probabilistic cellular automata and genetic algorithm Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence Clustering of time series dataa survey Denoising nonlinear time series by adaptive filtering and wavelet shrinkage: a comparison Anomaly detection in medical wsns using enclosing ellipse and chi-square distance World Health Organization coronavirus disease (COVID-2019) situation reports A comparative study of statistical and rough computing models in predictive data analysis A fast decision tree learning algorithm Correlation between universal bcg vaccination policy and reduced morbidity and mortality for covid-19: an epidemiological study