key: cord-0996604-7ft0g062 authors: Khan, Farhan Mohammad; Kumar, Akshay; Puppala, Harish; Kumar, Gaurav; Gupta, Rajiv title: Projecting the Criticality of COVID-19 Transmission in India Using GIS and Machine Learning Methods date: 2021-05-30 journal: nan DOI: 10.1016/j.jnlssr.2021.05.001 sha: e7e2f21678144cd808c5459bc914fce3740075f5 doc_id: 996604 cord_uid: 7ft0g062 There is a new public health catastrophe forbidding the world. With the advent and spread of 2019 novel coronavirus (2019-nCoV). Learning from the experiences of various countries and the World Health Organization (WHO) guidelines, social distancing, use of sanitizers, thermal screening, quarantining, and provision of lockdown in the cities being the effective measure that can contain the spread of the pandemic. Though complete lockdown helps in containing the spread, it generates complexity by breaking the economic activity chain. Besides, laborers, farmers, and workers may lose their daily earnings. Owing to these detrimental effects, the government has to open the lockdown strategically. Prediction of the COVID-19 spread and analyzing when the cases would stop increasing helps in developing a strategy. An attempt is made in this paper to predict the time after which the number of new cases stops rising, considering the strong implementation of lockdown conditions using three different techniques such as Decision Tree, Support Vector Machine, and Gaussian Process Regression algorithm are used to project the number of cases. Thus, the projections are used in identifying inflection points, which would help in planning the easing of lockdown in a few of the areas strategically. The criticality in a region is evaluated using the criticality index (CI), which is proposed by authors in one of the past of research works. This research work is made available in a dashboard to enable the decision-makers to combat the pandemic. Coronavirus disease (COVID-19) is a new and contagious disease caused by a new virus, known as novel coronavirus. The disease affects the lungs and causes a respiratory illness with symptoms such as cold, throat inflammation, cough, fever, and trouble breathing in severe cases. Public health authorities recommended that one can protect herself/himself by frequently washing and/or sanitizing hands, avoiding touching the nose, ears, and face, and by maintaining social distancing with other people. Considering the global footprint of this pandemic, the World Health Organization (WHO) declared COVID-19 to be a pandemic on 30 th March 2020 and set out guidelines to help countries safeguard critical health services during the COVID-19 epidemic. Action plans were in place by different countries to contain the spread of this pandemic. In order to control the situation, it is essential to have a plan in place, which depends on the prediction of new cases due to COVID-19. This would help hospitals and administrations to take necessary measures in advance (WHO, 2020). In the context of an emerging infectious disease outbreak, predicting the trend of the epidemic is of paramount importance to plan effective control strategies and determine how said strategies impact the course of the epidemic. Gaussian process regression (GPR) is a nonparametric, Bayesian approach to regression, which is widely used in the area of machine learning. A significant benefit of GPR is that it can work well on small datasets and have the ability to provide uncertainty measurements on the required predictions (Zhang et al., 2018) . Like other supervised machine learning algorithms that learn exact values for every parameter in a function, the Bayesian approach deduces a probability distribution over all the possible values. GPR is not limited by a functional form; it calculates the probability distribution of parameters for a specific function by distributing the probabilities over all admissible functions that fit the data. A prior function space is specified that predicts the pattern observed using the training data and computed the predictive posterior distribution based on the classifiers. The mean function and covariance kernel function is selected based on model architecture and tuned during model selection. The mean function is typically constant, either zero or the mean of the training dataset. There are many options for the covariance kernel function: it can have many forms as long as it follows the properties of a kernel (i.e., semi-positive definite and symmetric). Some common kernel functions include constant, linear, exponential, square exponential, and Matern kernel. A prior function space and covariance kernel functions have been used for GPR (Rasmussen et al., 2016) . The Gaussian process regression method is the Octave and MATLAB implementation of several localized regression methods, like the domain decomposition method (Park et al., 2011, DDM) , partial independent conditional (Snelson and Ghahramani, 2007, PIC) , localized probabilistic regression (Urtasun and Darrell, 2008, LPR), and bagging for Gaussian process regression (Chen and Ren, 2009, BGP) . For general machine learning problems, most localized regression methods can be applied, although the domain decomposition method is only applicable for spatial data sets. The GPR method also offers two parallel versions of the DDM for the computation. The ease of parallelization is one of the advantages of localized regression, and the two parallel implementations can provide proper guidance on how this benefit can be realized as software (Park et al., 2012) . Gaussian process regression has the advantage of being able to combine different kernels, creating a rich set of interpretable and reusable building blocks (Duvenaud et al., 2013) . For example, adding two kernels together models the data as an independent functional superposition. Multiplying a kernel with a function kernel with a radial basis smoothes the first kernel's predictions locally. Gaussian process model learns all functions efficiently. Even if the inputs are sampled at random, the error for a Gaussian process regression always goes down (Schulz et al., 2018) . A support vector machine is a supervised learning algorithm that can be used for classification or regression. An SVM classifies data by finding the best hyperplane that separates data points of one class from points of the other class (Dagher et al., 2008) . Support vectors refer to a small subset of training data that is used as support for the optimal location of the decision surface. SVM contains linear, quadratic, cubic, fine gaussian medium gaussian, coarse gaussian SVM for classification and predictions in which quadratic SVM has medium model flexibility, hard interpretability, and it uses binary medium and large multiclass for memory usage (Pelckmans et al., 2002) . In machine learning, a decision tree can be used to represent decisions and decision-making for deriving a strategy to predict future data. Decision trees are easy to interpret, fast for fitting and predictions and, low on memory usage. It contains a coarse tree, medium tree, fine tree to increase the model flexibility with the maximum number of splits settings. A fine tree has many large leaves to make many subtle distinctions between different classes (Safavian et al., 1991) . For predictions, the decisions follow in the tree from the beginning node down to a leaf node. A fine tree with many leaves is usually highly accurate on the training data. A leafy tree tends to over train, and its validation and accuracy often higher than its training. A fine DT is a greedy algorithm that performs a binary classification of the feature space to maximize the information gained at a tree node (Martinez et al., 2007) . database, the spread of COVID-19 is predicted at a national level, which is further distributed among the districts in terms of cumulative cases. In view of the findings of past research work (Akshay et al., 2020), the Gaussian Process Regression method is adopted in this study to project the number of cumulative cases that are likely to arise. Additionally, the inflection points at a district level, which was not addressed in the past study (Akshay et al., 2020), is discussed. The need of the hour is to understand the spatial footprint of COVID-19 and to predict the possible spread along with analyzing the risk in a region. In this regard, a framework is designed to facilitate the decision-maker in forecasting the likely scenarios of its extent. Besides mapping the cumulative cases which are evaluated using the data set available as of 10 th June 2020, the attributes of the parameters were projected for the next two months using three Machine Learning Techniques such as Gaussian Process Regression (GPR), Support Vector Machine (SVM) and Decision Tree (DT). Gaussian process regression (GPR) is a nonparametric, Bayesian approach to regression, which is widely used in the area of machine learning. A significant benefit of GPR is that it can work well on small datasets and have the ability to provide uncertainty measurements on the required predictions. GPR taking exponential kernel function was used in forecasting considering the worst-case scenario. Exponential GPR is identical to the Squared Exponential GPR except that the Euclidean distance is not squared. Exponential GPR replaces inner products of basis functions with kernels slower than the Squared Exponential GPR. The Exponential GPR handles smooth functions well with minimal errors. A support vector machine is a supervised learning algorithm that can be used for classification or regression. An SVM classifies data by finding the best hyperplane that separates data points of one class from points of the other class. SVM contains linear, quadratic, cubic, fine gaussian medium gaussian, coarse gaussian SVM for classification and predictions. The quadratic SVM has medium model flexibility, hard interpretability, and it uses binary medium and large multiclass for memory usage. Decision trees are easy to interpret, fast for fitting and predictions and, low on memory usage. It contains a coarse tree, medium tree, fine tree to increase the model flexibility with the maximum number of splits settings. (1) It is recommended that districts that are in the green zone can be completely released. However, travel to another adjacent district can be restricted if they fall into another zone. Lockdown can alternatively be opened and closed in these regions with continuous monitoring of the new positive cases. The importance of the work lies in identifying the districts which are falling in the more severe zone in the following weeks. For such districts, the policy of partial release is recommended with various preventive actions in place. Due to the nature of the problem, it is recommended that maps should be updated daily, and changes of district from one critical zone to another should be identified. Though the findings help to plan the combat strategies, updating the database would help to forecast which could mimic the reality. The following are a few of the assumptions in the current study that may test the capabilities of the decision-maker in planning the essential strategies since there always be possibility for a certain deviation. For better clarity, the assumptions are listed below-Assumption-1: For district-wise analysis, the dataset used in this study for forecasting was taken from various sources. Due to the sudden mass outbreak of this pandemic, there is a discrepancy in data over various sources. Assumption-2: It may have to be noted that the time series analysis for the total number of daily cases is considered from the date when the first infection was registered in the state. Also, those days were considered as zero values when no case appeared in the district and vice versa for recoveries and deaths for all the districts, respectively. Assumption-3: For missing data for the district, a proportionate value to the nearest hotspot in the state is taken. In forecasting, if the daily increase in the number of cases is less than 0.4, then we have considered it as zero, and if the daily increase in the number of cases is greater than 0.4, then we have considered it as one. Assumption-4: District-wise deceased cases and recovered cases are based on the new cases which arose 15 days before. In this study, three machine learning-based algorithms are proposed to observe the transmission pattern of COVID-19 by changing the learning coefficients from 0.1 to 0.001. Based on available data as of 10 th June 2020, the daily confirmed, recovered, and deceased cases of COVID-19 cases in India are forecasted using the three machine learning techniques such as GPR, SVM, and DT model for the next two months, as Figure 8 , Figure 9 and Figure 10 , represents the forecast of daily new confirmed cases by taking the value of learning coefficient as 0.001, 0.01 and 0.1 respectively. The prediction using the GPR method has been made considering that the lockdown will remain active, and it is expected that confirmed cases will start declining from the first week of August. It can be seen clearly from the curve that at the end of July, the daily new confirmed will start declining if the conditions remain the same in the country. Figure 11 , Figure 12 , and Figure 13 represent the prediction for recovered cases, and Figure 14 , Figure 15 , and Figure 16 depicts the prediction for deceased cases, which is based on the assumption that today's confirmed cases either will be changed to recovered or deceased cases after around 15 days. The results are validated with existing data. However, the percentage of each state may be different as far as recovered and deceased cases are concerned. These predictions are made with existing conditions. However, it can be improved by taking a few preventive steps. We intend to further improve our model by collecting more data in the upcoming days. Hence it is proposed to update the site weekly. The projected statistics are distributed among the districts. Although this exercise is associated with deviation, it gives a holistic picture and also helps decision-makers to foresee the intensity of COVID-19 transmission at a district level. The mapping helps to find the surrounding districts to take further precautions. The distributed confirmed cases, recovery cases, and deceased cases are shown in Figure 17 , Figure 18 , and Figure 19 , respectively. The forecasted attributes consequently help in estimating the cumulative score of the region, which is determined using TOPSIS, a Multi-criteria decision-making technique. The obtained cumulative score is introduced as the criticality index. Based on this distribution of criticality index, the regions can be classified into clusters of low risk, moderate risk, and high risk. Lockdown should be imposed in the area of high risk, whereas red zones can be identified in the regions of moderate-risk, and restriction to movement can be imposed. Lockdown in the region of low risk can be released with some precautionary measures. The probable variation of the criticality index for the next four weeks is mapped using the three different techniques. The variation obtained from TOPSIS using the data forecasted using DT, SVM, and GPR. The criticality of COVID-19 in a region is evaluated using the Criticality Index 1 . Three crucial parameters are considered that capture the CI are -intensity, preparedness, and the probability of transmission. The intensity of COVID-19 in a region is evaluated using the total number of confirmed cases; preparedness in a region is evaluated using the hospitals to population ratio, and population density is taken as a measure to evaluate the probability of transmission. The attributes of these parameters corresponding to each of the regions are collected, and the cumulative risk associated with each of the districts is evaluated. Based on this distribution of criticality index, the regions can be classified into clusters of low risk, moderate risk, and high risk. Lockdown should be imposed in the area of high risk, whereas red zones can be identified in the regions of moderate risk, and restriction to movement can be imposed. Lockdown in the region of low risk can be released with some precautionary measures. The probable variation of the criticality index for the next two months from 10 th June is mapped using the GPR method. The criticality index mapped using the forecasted data obtained from GPR is shown in Figure 20 . From the evaluated criticality index, it is apparent that the intensity would increase in the districts geographically located in Rajasthan, Andhra Pradesh, Maharashtra, Bihar, Delhi, Punjab, and Chhattisgarh. Keeping in view the projected scenario, current duration of lockdown, degrading economic activity, a lockdown opening strategy is proposed, as shown in Table 2 . The findings of the study are further used in identifying the total number of days after which the curve flattens. With the day-wise projected data set, the inflection points are identified, and the districts are grouped into five different groups, as shown in Figure 22 . The lockdown can be released sequentially as per the groups. and other viral infections [57; 58] . The isolation of cases and touch tracking, according to Kucharski et al. [59] , could be less successful for COVID-19 because infectivity begins before symptoms appear [60] . According to Hellewell et al. [61] , adequate touch tracing and case isolation are sufficient to contain a new COVID-19 epidemic within three months, although the likelihood of control declines with lengthy delays between symptom initiation and isolation, which increases dissemination until symptoms. In the context of COVID-19 outbreaks, it's critical to understand the factors that influence the infectious disease's dissemination dynamics to develop strategies for halting or slowing its spread and empowering health policy through fiscal, social, and environmental interventions. In order to demonstrate the performance of the detection result through the proposed method, various benchmark methods are used to perform comparison experiments. Table 3 shows that the proposed method in this paper outperforms other methods in the prediction with an accuracy of 95%. This study considered the data related to daily confirmed, deceased, and recovered cases of COVID-19 in India. Besides evaluating the present data, efforts are made in projecting the cases using Machine Learning Techniques to forecast the possible scenario of the future of COVID-19 in India. Additionally, this study proposed a criticality index that can help to quantify the risk in a region, which is further used to classify the regions into zones of high risk, low risk, and moderate risk. Developing maps by considering the updated data and mapping the risk for the following weeks by integrating machine learning tools and GIS would certainly help to combat the COVID-19 transmission. The critical contribution of the work lies in preparedness and mitigation of the disaster with at least some projection, which is based on scientific methods. Even if the predictions may not be very accurate, but it gives a systematic plan to combat an unknown enemy. It has to be noted that the findings of this study will match reality if the lockdown continues. For future modifications, the new methodology will be used for forecasting the scenario of COVID-19 after removing the complete lockdown. None. WHO Virtual press conference on COVID-19 Infectious diseases of humans: dynamics and control Mathematical epidemiology of infectious diseases: model building, analysis and interpretation The mathematics of infectious diseases Mathematical models in population biology and epidemiology A contribution to the mathematical theory of epidemics Gaussian process regression method for classification for high-dimensional data with limited samples Gaussian processes for machine learning Preparedness and Mitigation by projecting the risk against COVID-19 transmission using Machine Learning Techniques Forecasting the trend in cases of Ebola virus disease in west African countries using auto regressive integrated moving average models Using phenomenological models to characterize transmissibility and forecast patterns and final burden of Zika epidemics Estimation of the final size of the COVID-19 epidemic Prediction of number of cases expected and estimation of the final size of coronavirus epidemic in India using the logistic model and genetic algorithm Forecasting of covid-19 confirmed cases in different countries with arima models Artificial intelligence forecasting of covid-19 in china COVID-19: Mathematical Modeling and Predictions. ResearchGate, DOI: DOI Modeling and Predictions for COVID 19 Forecasting Covid-19 Preparedness and Mitigation by projecting the risk against COVID-19 transmission using Machine Learning Techniques Pandemic Prediction for Hungary; a Hybrid Machine Learning Approach. A Hybrid Machine Learning Approach Fitting SIR model to COVID-19 pandemic data and comparative forecasting with machine learning Outbreak trends of CoronaVirus (COVID-19) in India: A Prediction First-principles machine learning modelling of COVID-19 A machine learning methodology for forecasting of the COVID-19 cases in India Analysis of COVID-19 spread in South Korea using the SIR model with time-dependent parameters and deep learning Simulation of Covid-19 epidemic evolution: are compartmental models really predictive COVID-19 Epidemic Analysis using Machine Learning and Deep Learning Algorithms Modified SEIR and AI prediction of the epidemics trend of COVID-19 in China under public health interventions A machine learning methodology for real-time forecasting of the 2019-2020 COVID-19 outbreak using Internet searches, news alerts, and estimates from mechanistic models Prediction of COVID-19 Disease Progression in India: Under the Effect of National Lockdown Analysis of the COVID-19 pandemic by SIR model and machine learning technics for forecasting Forecasting the dynamics of COVID-19 Pandemic in Top Using country-level variables to classify countries according to the number of confirmed COVID-19 cases: An unsupervised machine learning approach Domain decomposition approach for fast Gaussian process regression of large spatial data sets Sparse Gaussian processes using pseudoinputs Bagging for Gaussian process regression Gplp: a local and parallel computation toolbox for gaussian process regression A tutorial on Gaussian process regression: Modelling, exploring, and exploiting functions Structure discovery in nonparametric regression through compositional kernel search Quadratic kernel-free non-linear support vector machine LS-SVMlab: a matlab/c toolbox for least squares support vector machines A survey of decision tree classifier methodology Computational statistics handbook with MATLAB Arima and nar based prediction model for time series analysis of covid-19 cases in india Using supervised machine learning and empirical bayesian kriging to reveal correlates and patterns of covid-19 disease outbreak in sub-saharan africa: Exploratory data analysis European Centre for Disease Prevention and Control; 2020. Public Health Management of Persons Having Had Contact With Novel Coronavirus Cases in the European Union Effectiveness of airport screening at detecting travellers infected with novel coronavirus (2019-nCoV) Impact of international travel and border control measures on the global spread of the novel 2019 coronavirus outbreak Modeling and public health emergency responses: lessons from SARS Implementation and management of contact tracing for Ebola virus disease Contact tracing performance during the Ebola epidemic in Liberia Public Health England MERS-CoV close contact algorithm. Public health investigation and management of close contacts of Middle East Respiratory Coronavirus (MERS-CoV) cases Contact tracing for imported case of Middle East respiratory syndrome Active contact tracing beyond the household in multidrug resistant tuberculosis in Vietnam: a cohort study Potential scenarios for the progression of a COVID-19 epidemic in the European Union and the European Economic Area Evaluation of the benefits and risks of introducing Ebola community care centers Comparing nonpharmaceutical interventions for containing emerging epidemics Feasibility of controlling COVID-19 outbreaks by isolation of cases and contacts The authors have declared no competing interest.