key: cord-1029755-7daui4rh authors: Ghosh, Shinjini title: Predictive Model with Analysis of the Initial Spread of COVID-19 in India date: 2020-08-25 journal: Int J Med Inform DOI: 10.1016/j.ijmedinf.2020.104262 sha: 1f1b53cff365d97e3fcc5477552ba1c958974c34 doc_id: 1029755 cord_uid: 7daui4rh OBJECTIVE: The Coronavirus Disease 2019 (COVID-19) has currently ravaged through the world, resulting in over thirteen million confirmed cases and over five hundred thousand deaths, a complete change in daily life as we know it, worldwide lockdowns, travel restrictions, as well as heightened hygiene measures and physical distancing. Being able to analyse and predict the spread of this epidemic-causing disease is hence of utmost importance now, especially as it would help in the reasoning behind important decisions drastically affecting countries and their people, as well as in ensuring efficient resource and utility management. However, the needs of the people and specific conditions of the spread are varying widely from country to country. Hence, this article has two fold objectives: (i) conduct an in-depth statistical analysis of COVID-19 affected patients in India, (ii) propose a mathematical model for the prediction of spread of COVID-19 cases in India. MATERIALS AND METHOD: There has been limited research in modeling and predicting the spread of COVID-19 in India, owing both to the ongoing nature of the pandemic and limited availability of data. Currently famous SIR and non-SIR based Gauss-error-function and Monte Carlo simulation models do not perform well in the context of COVID-19 spread in India. We propose a ‘change-factor’ or ‘rate-of-change’ based mathematical model to predict the spread of the pandemic in India, with data drawn from hundreds of sources. RESULTS: Average age of affected patients was found to be 38.54 years, with 66.76% males, and 33.24% females. Most patients were in the age range of 18 to 40 years. Optimal parameter values of the prediction model are identified (α = 1.35, N = 3 and T = 10) by extensive experiments. Over the entire course of time since the outbreak started in India, the model has been 90.36% accurate in predicting the total number of cases the next day, correctly predicting the range in 150 out of the 166 days looked at. CONCLUSION: The proposed system showed an accuracy of 90.36% for prediction since the first COVID-19 case in India, and 96.67% accuracy over the month of April. Predicted number of cases for the next day is found to be a function of the numbers over the last 3 days, but with an ‘increase’ factor influenced by the last 10 days. It is noticed that males are affected more than females. It is also noticed that in India, the number of people in each age bucket is steadily decreasing, with the largest number of adults infected being the youngest ones—a departure from the world trend. The model is self-correcting as it improves its predictions every day, by incorporating the previous day's data into the trend-line for the following days. This model can thus be used dynamically not only to predict the spread of COVID-19 in India, but also to check the effect of various government measures in a short span of time after they are implemented. Conclusion: The proposed system showed an accuracy of 90.36% for prediction since the first COVID-19 case in India, and 96.67% accuracy over the month of April. Predicted number of cases for the next day is found to be a function of the numbers over the last 3 days, but with an 'increase' factor influenced by the last 10 days. It is noticed that males are affected more than females. It is also noticed that in India, the number of people in each age bucket is steadily decreasing, with the largest number of adults infected being the youngest onesa departure from the world trend. The model is self-correcting as it improves its predictions every day, by incorporating the previous day's data into the trendline for the following days. This model can thus be used dynamically not only to predict the spread of COVID-19 in India, but also to check the effect of various government measures in a short span of time after they are implemented. The present article delves into more specifics of the spread of COVID-19 in India, conducting a statistical analysis of the affected patients, as well as proposing a new mathematical model to predict the spread of the disease in the 20 country. Section 2 outlines how the data has been collected; Section 3 discusses the patient-level statistical analysis and compares that with the world trends, and Section 4 looks at the details of the modeling and prediction system. The results have then been discussed and analysed thoroughly in Sections 5 and 6. Concluding remarks and areas for future work are portrayed in Section 7. The data analysed in this paper includes the official counts as released by Mo-HFW, hundreds of news reports, the COVID-19 India API, as well as volunteercollected de-identified open source data until the date of July 16, 2020. In this section, we analyse various demographic and other factors of the COVID-19 affected patients in India, and contrast and compare those with the world trends. Age counts of 2339 patients were received, and the mean age was found to be 38.54 years, with a standard deviation of 17.22 years. (3.07%) deceased. Of the cases with an outcome, 87.74% had recovered, somewhat higher than the then global average of 81% (Worldometers) [5] . Of the affected patients whose status was tracked, an outcome was obtained in a mean time of 10.69 days, with great variance observed -the mean recovery time (time between hospitalization and official status of recovery) was found to 50 be 14.47 days and the mean time to death (time between hospitalization and death) was found to be a mere 3.18 days. The recovery rate is at par with CDC guidelines and WHO[1] statements that mention it may take the body up to 2 weeks to recover from the illness, and up to 6 weeks for severe or critical cases. However, the mean time for death is drastically lower than the WHO mentioned Out of the 2339 patients whose age data was received, the distribution was observed as in Table 1 . We see a sharp contrast in this distribution when compared with other countries / regions of the world. In Italy, about 71% of A very important aspect of attempting to control this ongoing pandemic and 75 save lives is to model the spread that has occurred so far, and to be able to use that to predict future spread and number of cases. There has been extensive ongoing research in this aspect, catering to the needs of various countries. In the rest of this article, we propose a model and prediction system that is developed very specifically with India in mind, and currently offers excellent predictions. published in early April 2020 [11] , we see the usage of the currently famous SIR Model, and a prediction that India will reach a final epidemic size of around 95 13, 000 by the end of May 2020. Needless to say, this is a drastic difference from the current state of affairs, as we prepare to enter May 2020 with 35, 403 cases, almost thrice the value predicted for the end of May. A different model [12] predicted that 54 days of total lockdown would yield around 5, 000 cases. On the 40 th day of lockdown, there were seven times as many cases, and currently 100 almost 200 times as many, and we can see that this model is not offering good predictions either. Yet another mathematical model [13] estimates that with lockdown from 25 th March onwards (as happened), we would have just shy of 10, 000 cases by end-April, which, once again, is one-third the actual total number of cases at the end of April, 2020, and the number of cases by the end 105 of April 25, 2020 was about 2.7 times as predicted. In the present article, we propose a mathematical model with a different approach, which has been yielding consistent results in predicting the spread since the beginning of the outbreak in India. Moreover, while there exist datadriven approaches to look at the outbreak of COVID-19 [14] , [15] , there have Based on these studies, we came up with Algorithm 2 for predicting the number of daily cases. Here, α is a factor we can tune as necessary. Algorithm 2 Predict number of cases for the next day based on last N days' cases, and last T days' change factors with list of daily cases L trendline :: change factors for last T days implies that the predicted cases for the next day is a function of that in the last 3 days, but with an 'increase' factor influenced by the last 10 days. This is a testament to and goes at par with the highly dynamic nature of the epidemic. We notice sharp spikes and troughs in the number of daily cases -and 140 this can be due to a variety of reasons, including staggered testing, staggered collection of results, holidays, one-time breakout events, mass gatherings, new testing development, new medical guidelines, and so on. Due to this reason, we also do an analysis of the number of cases (confirmed, recovered and deceased) for a simple moving average of the data over a window of the last K days. 145 We have experimented with various values of K and have settled on K = 3 as the most balanced point wherein we are not diluting the effects of every day's changes much, but also somewhat accounting for day-to-day external changes in testing and other reasons that are unrelated to the 'true' spread of the disease. With the window averaged list of K = 3 days, we use a prediction system Because the 'true' spread of the epidemic is as hard to estimate as the number We observe that the prediction range offered by the raw and moving averaged 165 data offers a high accuracy of 90.36% until the date of July 16, 2020. We further note, that there is a tradeoff in selecting the values of α to be too high or too low -a value much further from 1 will spread out the range and thus have a greater chance of being 'correct', but at the same time a huge range prevents a model from being very precise, although accurate. Hence, we have kept the The model is self-correcting as it improves its predictions every day, by incorporating the previous day's data into the trendline for the following days. This model can thus be used dynamically not only to predict the spread of COVID-19 in India, but also to check the effect of various government measures 185 in a short span of time after they are implemented. The prediction range bounded by the raw and moving averaged data ob- Comparing to other contemporary models, an article published in early April 195 2020 [11] , using the currently famous SIR Model predicted that India will reach a final epidemic size of around 13, 000 by the end of May 2020, while the real figure was 190, 609, almost 14 times as much as predicted. The proposed model had 86.67% accuracy by the same day, predicting 8, 342 cases on May 30 (with the actual number being as close as 8336). A different model [12] predicted that 200 54 days of total lockdown would yield around 5, 000 cases. On the 40 th day of lockdown, there were seven times as many cases. Another data-driven mathematical model [13] estimated that with lockdown from 25 th March onwards (as happened), there would be a little less than 10, 000 cases by end-April, which, once again, is one-third the actual total number of cases at the end of April, Figure 4 denotes the same data as Figure 3 does, but on a logarithmic scale. Realizing the exponential nature of this pandemic, we have decided to include this log-scale plot. A straight line on a logarithmic scale indicates an exponential rise in the absolute number of cases. Looking at the three subplots in Column 2 of Figure 4 , we note this alarming trend. Finally, in Figure 5 , we plot the corresponding derivatives of the subplots in 305 Figure 3 . The three subplots in Column 1 denote the first derivative of number of daily cases, i.e., how the number of daily new cases is changing. We see sharp large fluctuations towards the end of the subplots, indicating the difficulty that any predictive system faces. In Column 2, we have the first derivative of the total number of cases, i.e., the number of daily cases. We see the similarity 310 between these plots to those in Column 1 of Figure 3 , as the derivative of the cumulative total is nothing but the number of daily cases. Mathematical prediction of the time evolution of the covid-19 pandemic in italy by a gauss error function and monte carlo simulations Predictions for covid-19 outbreak in india using epidemiological models Modeling and predictions for covid Spread of covid-19 in india: A mathematical modeldoi Transmission dynamics of the covid-19 outbreak and effectiveness of government interventions: A datadriven analysis Analyzing the epidemiological outbreak of covid-19: A visual exploratory data analysis approach Summary Points 1. The present article looks into the specifics of the spread of COVID-19 in India and conducts a statistical analysis of the affected patients Existing models have failed to predict the spread of COVID-19 in India. 3. The present article proposes a new mathematical model to predict the spread of the disease in the country, which has succeeded with 90.36% accuracy ever since the first COVID-19 case was