key: cord-0745476-n94yz8t2
authors: Bajaj, N. S.; Pardeshi, S. S.; Patange, A. D.; Kotecha, D.; Mate, K. K.
title: Statistical analysis of national & municipal corporation level database of COVID-19 cases In India
date: 2020-07-21
journal: nan
DOI: 10.1101/2020.07.18.20156794
sha: 523f4e0f273542c4c99020532432d765df754afe
doc_id: 745476
cord_uid: n94yz8t2

Since its origin in December 2019, Novel Coronavirus or COVID-19 has caused massive panic in the word by infecting millions of people with a varying fatality rate. The main objective of Governments worldwide is to control the extent of the outbreak until a vaccine or cure has been devised. Machine learning has been an efficient mechanism to train, map, analyze, and predict datasets. This paper aims to utilize regression, a supervised machine learning algorithm to assess time-series datasets of COVID-19 pandemic by performing comparative analysis on datasets of India and two Municipal Corporations of Maharashtra, namely, Mira-Bhayander and Akola. This study's current contribution is an attempt towards drawing attention to the dynamics of the pandemic in a controlled locality such as Municipal Corporation. The results of the current study depicts that growth of COVID-19 cases is exponential when considered nationally, however, for limited area the nature of curve is observed to be cubic for total cases and multi-peak Gaussian for active cases. In conclusion, Government should empower district/ corporations to adopt their own methodology and decisionmaking policy to contain the pandemic at regional-level like in the case of Dharavi.

Coronavirus showed similar etiology to the Severe Acute Respiratory Syndrome (SARS) virus of 2006. Hence, it is also known as SARS-2 or COVID-19 (WHO, 2020) [3] . It is a contagious disease that spreads via minute respiratory droplets through coughing, sneezing, or close contact.

Patients can experience the symptoms in 2 to 14 days of incubation, such as high fever, body pain or weakness, dry cough, breathlessness, pneumonia, kidney failure, and respiratory distress. In some cases, the patients do not show any symptoms of the virus; that is, they are asymptomatic. In both scenarios, testing of the patient is essential for verification, and they should be isolated (He et al., 2020; Singhal et al., 2020) [4] [5] . Thus to curb the rate of infection and prevent community spread wearing masks, to avoid crowded spaces and maintaining the protocols of social distancing is essential (WHO, 2020) [3] .

In India, the first confirmed case of COVID-19 was of a man with travel history from Wuhan (WHO, 2020) [3] . Although the progression of cases was slow initially, the Indian government imposed a lockdown on 25 th March to prevent human-to-human transmission and contain the pace of the virus. After all the protective measures, the growth of the virus has continued at an exponential rate in the country Figure 1 . Thus, modeling the current situation, analyzing the expansion pattern, identifying the reason for such behavior, and using it to forecast the next trend of the virus will help us develop coping mechanisms.

In Section 1, the current scenario has been explained along with the background and requirement of the proposed methodology. Section 2 details the importance of regression and the theory behind it. Details and motivation regarding the case study have been explained in Section 3. Section 4 consists of graphs and results after analysis. Section 5 details the comparative discussions of the obtained results followed by the conclusion in Section 6.

Machine learning algorithms are used in all wakes of life, including the business sector, finance, medical, engineering, and educational domains. It is exceptionally efficient in the prediction of healthcare data (Mai et al., 2017, 2; Purcaro et al., 2018, 3; Ye et al., 2003) [6] [7] [8] . Right from helping understand the influenza pandemic dynamics to analyze data for prediction of the growth rate of COVID-19, its role is unmatched. As the cases of COVID-19 continue to at frightening rate, panic continues to spread among the people and their government. Ceased production, reduction in goods and services, and a collapsing economy and an increase in fatalities, create a situation of mass hysteria. In such scenarios, it can help countries make the right decisions and avoid loss of human lives and economy. Designing an accurate outbreak model will provide essential insights into new policy-making and will also help evaluate the result of the policies (Remuzzi et al., 2020) [9] .

Several mathematical and statistical models have been utilized by researchers to study the behavior of the virus, people's response, and predictions. Ramjeet, 2020 in his paper used six regression-based analysis models, i.e. quadratic, cubic, third, fourth, fifth, sixth degrees, and exponential polynomial on the dataset of India (Yadav, 2020) [10] . It provided a functional analysis of . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted July 21, 2020. [12] . The model also underestimates the actual observations but can be updated to remove that issue, as suggested in the paper. It also predicted slowing in cases in the coming days, and proper guidelines and measures accompanied reduction in daily cases with it.

Building on the recent conceptualization of detecting connective communities in time series and Konstantinos et al., 2020 developed a novel spline regression model to determine knot using community detection in the complex network for Greece dataset (Demertzis et al., 2020, 5) [13] .

Owing to diversity, the difference in geography, a large population, the study would not be directly [17] . A comparison of various models revealed that MLP and ANFIS showed the best results and can be employed for long term predictions.

As can be seen from above, various types of methods have been used for analysis. However, it has been limited to a national approach with a limited study on state-level differences. Barely any form of study on district-level assessment is undertaken. As a difference in state and national level dynamics have been proved previously, this paper's aim would be to explore Municipal-level differences, by mapping and evaluating, datasets of Mira-Bhayandar and Akola.

Regression is a category of machine learning statistics tool. It utilizes supervised ML that employs an algorithm to understand the mapping of output data to input. Its objective is to map the approximate function as accurately as possible and train datasets. Analyzing data, identifying patterns and making corrections with human help or correction is possible with help of machine learning.

These tools and algorithms are used by regression for mapping and prediction. Regressions models . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 21, 2020. . https://doi.org/10.1101/2020.07.18.20156794 doi: medRxiv preprint are employed to study the relationship between single or multiple independent variable(s) (X) and dependent variable(Y) involving unknown parameter(s) (β) (Sarstedt et al., 2014, 3) [18] .

The reason for such an analysis is to find β that best fits the model. The benefit of regression analysis is that it can compute the relative strength of independent variables effects. Additional analysis can also predict the value of y for a given x in the sample size n. It includes modeling and analysis of variables to determine the most fitting algorithm that can explain their correlation.

The term e represents error or residue. It is the distance between every observed value and its corresponding predicted value. ŷ(y-hat) represents prediction made by the best regression line. An assumption during this process is that e follows a normal distribution N (0,  2 ) (Rasmussen et al.,

2006, 1) [19] . Therefore, e is calculated by:

Calculating β uses the Least-Squares Linear regression method. It minimizes the value of e for the best fit curve. Another important quantity is the coefficient of determination orR 2 that demonstrates the accuracy with which the proposed model describes the change in the dependent variable. It is represented by the formula:

Where SSR represents square of difference between the regression line and average value's sum (sum of the squares), and SST is the square of difference between each observation and the average value's sum (total sum of squares),

where Yi's are the observed values corresponding to each X, the estimated (predicted) value of the response variable is represented by ' ̅ , ̅ depicts the average of Yi , and n is the total observations. The value of R 2 always lies between 0 and 1. For R 2 = 1, the model has perfect determination and a value closer to 1 indicates a better model.

Fitting curves used for modeling are simple linear, multiple linear, polynomials of varying orders such as quadratic, cubic, exponential, Gaussian, and logarithmic. The analysis performed in this study employed cubic, exponential, and multi-peak Gaussian curves.

The best fit line in this represents a curve rather than a straight line. It is much more flexible to incorporate data with complex relationships. It is useful when the plotted data first curves one way then another. It is considered a special case of the multiple linear regression models. The function that represents the graph is:

The following system of equations helps to find the value of coefficients:

∑ 4 + ∑ 3 + ∑ 2 + ∑ = ∑ , …………………………………………………………….. (7) . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 21, 2020. . https://doi.org/10.1101/2020.07.18.20156794 doi: medRxiv preprint ∑ 5 + ∑ 4 + ∑ 3 + ∑ 2 = ∑ 2 , ………………………………………………………..….. (8) ∑ 6 + ∑ 5 + ∑ 4 + ∑ 3 = ∑ 3 …………………………….……………………………….. (9) For comparing the efficiency, the model considers the value of standard error of regression ( ̅ ), correlation coefficient (R), and coefficient of determination (R 2 ).

This method explains the district models in the study. It represented the total cases, the total number of deceased and total recoveries in the model.

It is one of the simplest types of non-linear regression. When the coefficients vary non-linearly, this method gives the best fitting curve.

Coefficients 'a' and 'b' can be calculated with the help of:

For the study, the country's statistics in terms of total cases, recoveries, and deaths follow an exponential trend.

Similar to cubic regression, standard error of regression ( ̅ ), correlation coefficient (R), and coefficient of determination (R 2 ) help to understand the efficiency of the model used with the help of the same formula. If the correlation coefficient is 0, there is no relation between X and Y, for negative value they are inversely proportional, and they are strongly related for correlation coefficient's value as 1. Both methods follow a similar path while solving the problem. It compares unknown data in a tabular format to empirical formulae and finds the closest match. It utilizes the Least Square Method to obtain the best fit for which residual is minimum.

This method is especially useful as it increases the fit by the mathematical algorithm and numerical calculation to find Gaussian sub-peaks in the model. The sub-peaks are then combined, which results in a waveform that consists of multiple peaks (Qing et al., 2013) [20] . When the existing methods are unfit to describe the data or while the process involves small datasets, it can provide uncertain measures on predictions.

In this method of approach, first, a Gaussian process prior is assumed: is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 21, 2020. . https://doi.org/10.1101/2020.07.18.20156794 doi: medRxiv preprint inputs in space that corresponds to the similarity of outputs. It has two hyper-parameters, signal variance (σ²) length scale (l).

( , ′ ) = 2 exp(− 1 2 2 ‖ − ′ ‖ 2 )……………………………………………………………………….. (14) 

The epidemiological statistics of India was obtained from the official COVID-19 website of India (Coronavirus India) [21] . The Indian subcontinent lies in the South Asian continent, with a population close to 1.380 billion. Maharashtra has been one of the most affected states during the pandemic, with approximately 217K confirmed cases as of 8 July. SciDAVis (Scientific Data Analysis and Visualization), a cross-platform program for graphical presentation of datasets and data analysis, was used during the study's analysis phase. It supports linear, non-linear and multi-peak functions and has built-in operations for column/ row statistics, convolutions, and filtering operations allowing with a user-friendly interface. Since the data needed non-linear analysis such as exponential, cubic and multi-peak Gaussian, this software was utilized for analysis.

Results of analysis performed on SciDAVis using time series datasets of India, Kerala and Municipal Corporations of Mira-Bhayander and Akola have been stated in the form of graphical and tabulated format in this section. It forms the basis of the discussion done in Section 6. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 21, 2020. . https://doi.org/10.1101/2020.07.18.20156794 doi: medRxiv preprint The R 2 represents accuracy of fit, mentioned in Table 1 May. It has an exponential growth trend. Figure 3 represents the cumulative plots of total deceased, cases and discharged for Mira-Bhayander between 27 th March and 2 nd June whereas Figure 4 represents the cumulative graph for Akola for the timeframe 7 th April to 5 th June. Dotted lines are the observed dataset for both the graphs . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 21, 2020. . https://doi.org/10.1101/2020.07.18.20156794 doi: medRxiv preprint while blue line is for cubic fit received. Unlike the exponential trend observed in Figure 1 , both regional plots follow cubic regression curve. These curves can be represented by the equation: Y = a0 + a1*x + a2*x 2 + a3*x 3 . Table 2 along with values of coefficients. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 21, 2020. Active cases datasets of Mira-Bhayandar and Akola is depicted in Figures 5 and 6, respectively. Both the graphs follow a 3-peak Gaussian curve unlike the plot of active cases of India in Figure 1 that resembles an exponential plot. Table 3 contains the tabulated data characteristics of each peak in Figures 5 and 6 that is values of area, centre, width, height and total accuracy of the fit. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 21, 2020. A comparative discussion on the significance of these spikes has been performed in the Section 5. Lockdown that was implemented earlier is a form of quarantine and since we had the data for number of people in quarantine in Kerala, this statistics was considered to compare with the case study of Mira-Bhayander and Akola. Figure 9 is the plot for active cases in Kerala between 30 th January and 4 th June. Total people on quarantine on each date in the state are mentioned in Figure 10 .

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 21, 2020. . https://doi.org/10.1101/2020.07.18.20156794 doi: medRxiv preprint Plot of total active cases and number of people in quarantine follow a 3-peak Gaussian curve similar to regional active cases. Kerala was one the earliest state in India to be affected by the pandemic; the whole state underwent quarantine implementation. The plots obtained after analysis are also similar to regional curves rather than country plot. Hence, its results can be easily compared with the controlled containment zones. Values of area, centre, width and height each peak observed in both the figures along with their overall accuracy of fit has been tabulated in Table 4 . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 21, 2020. . https://doi.org/10.1101/2020.07.18.20156794 doi: medRxiv preprint 6. Inferences and discussion:

Comparison of Figure 1 of the cumulative country plot with the plots of Mira-Bhayander and Akola from Figures 3 and 4 shows the difference in the curve obtained i.e. while the country plot follows an exponential path, regional curve behaves in a cubic manner. It can be attributed to the reason that the outbreak's statistics behave differently at a lower level than at the nation-level.

When all such statistics are combined, the resultant curve should resemble an exponential pattern.

Thus, the dynamics of COVID-19 at a regional-level change drastically and should be noted while implementing policies.

The daily onset of cases in the regions of Mira-Bhayandar and Akola from Figures 7 and 8 show four notable spikes in the bar plot outbreak. The dates of these spike, lockdown and unlock dates for comparison have been mentioned in Table 5 . Table 5 exhibits that the spikes are either an immediate result of or due to the anticipation of lockdown/ unlock. This feature can only be seen in a regional plot, unlike the nation plot. Thus, it can be taken as a reference for future policies. 

Active cases plot in Figures 5 and 6 of Mira-Bhayandar and Akola, respectively, mimic a Gaussian curve with 3-peaks. It can be attributed to the fact that when the number of cases in a region increases, the number of people in quarantine increases proportionally. However, as the number of cases starts to fall, the quarantine number also decreases. As a result, more people get infected, giving rise to active cases.

A prime example of this feature is graphs of Kerala in Figures 9 and 10 depicting active and quarantine cases that follow this hypothesis. Therefore, it is crucial to maintain high rates for quarantine, irrespective of the number of active cases, to avoid further outbreaks. Only when the total active cases start to dwindle quarantine regulations should be relaxed.

India had previously been making decisions at the national level. However, after the commencement of lockdown four, a state-based approach was followed where each state devised its policies and regulations. The state implemented zone based policies depending upon the extent of contamination. While drafting the paper Maharashtra State Government implemented another policy that empowers districts/ Municipal Corporation to take its decisions, showing a shift in trend from . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted July 21, 2020. . https://doi.org/10.1101/2020.07.18.20156794 doi: medRxiv preprint macro-level to micro-level policy making approach. It supports the proposed problem statement, and it can help them model their regional datasets for analysis. It can also be utilized to forecast further developments in response to decisions.

In this paper, the nature of growth of COVID-19 total cases, deceased, recovered and actives cases with regional datasets from two Municipal Corporation was analyzed. The reason for different curves obtained was an indicator of pandemics distinct feature at local or zonal level. The trend in the active case plot resulted from changes in quarantine regulations based on the number of cases. The validation of this response can be found by analyzing active and quarantine case plots of Kerala, which shows that both factors are co-dependent. Similarly, spikes in daily cases of the regional plot are an attribute of human psychology combined with lockdown outcomes. This study suggests that more districts and states in every country should be given the authorization to make decisions regarding policies and lockdowns regulations depending on the outbreak's nature in particular regions.

Rules of quarantine should be adhered to strictly until the total active cases do not fall to a minimum value for a continuous period. It would also help reduce spikes in daily cases observed around significant days.

Funding: Not applicable.

There was no conflict of interest.

Pneumonia of Unknown Cause -China

WHO. 2020. Coronavirus Disease

The clinical feature of silent infections of novel coronavirus infection (COVID-19) in Wenzhou

Modeling and Forecasting for Covid-19 growth curve in India. medRxiv

Controlling testing volume for respiratory viruses using machine learning and text mining, AMIA

International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted

Volatile fingerprinting of human respiratory viruses from cell culture

Predicting hepatitis B virus-positive metastatic hepatocellular carcinomas using gene expression profiling and supervised machine learning

COVID-19 and Italy: what next? The Lancet

Data analysis of COVID-2019 epidemic using machine learning methods: a case study of India

Analysis on Novel Coronavirus (COVID-19) Using Machine Learning Methods. Chaos

Modeling and Forecasting for Covid-19 growth curve in India. medRxiv

Modeling and forecasting the COVID-19 temporal spread in Greece: an exploratory approach based on complex network defined splines

Regression Analysis of COVID-19 Spread in India and its Different States, medRxiv

SEIR and Regression Model based COVID-19 outbreak predictions in India, medRxiv

Outbreak Data Analysis and Prediction

COVID-19 Outbreak Prediction with Machine Learning, medRxiv

Regression Analysis, Springer Texts in Business and Economics

Gaussian Process for Machine Learning

Study of the combustion mechanism of oil shale semi-coke with rice straw based on Gaussian multi-peak fitting and peak-to-peak methods

Coronavirus in India: Latest Map and Case Count

Municipal Corporation

Kerala: COVID-19 Battle

International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity

The copyright holder for this preprint this version posted July 21, 2020. . https://doi.org/10.1101/2020.07.18.20156794 doi: medRxiv preprint