key: cord-0618469-8c4g4hqx authors: Das, Sourish title: Prediction of COVID-19 Disease Progression in India : Under the Effect of National Lockdown date: 2020-04-07 journal: nan DOI: nan sha: 309f789af580cd8118949f25277c0da525fb47c1 doc_id: 618469 cord_uid: 8c4g4hqx In this policy paper, we implement the epidemiological SIR to estimate the basic reproduction number $mathcal{R}_0$ at national and state level. We also developed the statistical machine learning model to predict the cases ahead of time. Our analysis indicates that the situation of Punjab ($mathcal{R}_0approx 16$) is not good. It requires immediate aggressive attention. We see the $mathcal{R}_0$ for Madhya Pradesh (3.37) , Maharastra (3.25) and Tamil Nadu (3.09) are more than 3. The $mathcal{R}_0$ of Andhra Pradesh (2.96), Delhi (2.82) and West Bengal (2.77) is more than the India's $mathcal{R}_0=2.75$, as of 04 March, 2020. India's $mathcal{R}_0=2.75$ (as of 04 March, 2020) is very much comparable to Hubei/China at the early disease progression stage. Our analysis indicates that the early disease progression of India is that of similar to China. Therefore, with lockdown in place, India should expect as many as cases if not more like China. If lockdown works, we should expect less than 66,224 cases by May 01,2020. All data and texttt{R} code for this paper is available from url{https://github.com/sourish-cmi/Covid19} The World Health Organization (WHO) declared the outbreak of the novel coronavirus, COVID-19, as a pandemic. It will take twelve to eighteen months to develop the vaccine for the COVID-19, [5] . The absence of a vaccine makes the situation worse for the already overstretched Indian health care system. For example, the number of hospital beds, per 1000 population, is less than one, [13] -it is just one indicator to cite the miserable situation of India's health care system. In the absence of a vaccine, the 'social distancing' is the optimal strategy to control the spread of novel coronavirus, [5] . Other than social distancing, broad base rapid test and cluster tests are essential to identify those who are infected and isolate them. However, India did not have enough testing capacity as it is reported widely in media, [2] . Though, Indian scientists recently developed the affordable testing kit for COVID-19, [8] ; India needed a complete overhaul of its health care system in a war footing. In such a situation, India's Prime Minister Narendra Modi announced an unprecedented three-weeks nationwide lockdown on the 24th March 2020. The purpose of the lockdown is to slow down the spread of the novel coronavirus; so that the Govt can take a multi-prong strategy to add more beds in its network of hospitals, scale up the production of the testing kit of the COVID-19 and personal protection equipment (PPE) for the health workers. In such a grim scenario, the important question for Indian health officials is how many new confirmed cases will be seen and by what time; with the hope that the national lockdown will slow down the spread of the virus; which will buy them time to overhaul of the health care system. However, is lockdown going to provide the necessary slow down of the virus spread? Even if the lockdown helps India to control the spread of the virus, it is not economically sustainable to continue the lockdown further, as large the workforce in India employed in the informal sector as a daily wage laborer. Therefore, in this policy paper, we try to estimate the effect of lockdown and set up a track following which we will know if the lockdown is working! In this paper, we develop the epidemiological SIR model and statistical machine learning model to predict disease progression in India. We implemented the SIR model to estimate the basic reproduction number R 0 at the national and state level. So that we identify which states require more attention. Then we implement the machine learning model to predict the number of cases ahead of time so Indian administration can be better prepared ahead of time. In Section (2), we introduce the database, from where the data is downloaded and model is built. In Section (3), we present the methodology to analyze and predict the data. In Section (4) we present our analysis and prediction of the Covid-19 disease progression in India. Section (5) concludes the paper with some policy recommendations. In this paper, we used the following major databases. Legendary statistician Prof George Box, once said "All models are wrong, but some are useful", see [3] . Keeping this in our mind, here in this paper, we take a model agnostic two-prong approach. One is to understand the severity of the ground situation; and the second is the prediction, which will help the health officials to make the plans accordingly. The epidemic models for infectious disease yield insights into the dynamic behavior of the disease spread. With new insights, health officials can develop more effective disease intervention strategies. Besides, such epidemic models are also used to forecast the course of the epidemic. In addition to epidemic models, we consider the statistical machine learning (SML) models, which are extremely good for prediction. Often, the interpretability of SML models is questioned. However, as we take a model agnostic approach; we can use the epidemic models to understand the ground reality while adopting the SML to achieve better prediction accuracy. The popular epidemic models for an infectious disease is the Susceptible, Infected, Recovered (SIR) model. The model considers a closed population. To start with, a few infected people are added to the population. It assumes that the mixing pattern is homogeneous. During the period of the sickness, the contagious people each infect on average R 0 other people, who each then go on to infect R 0 others, who are susceptible. The R 0 is popularly known as the Basic Reproduction Number. The R 0 is the fundamental quantity of the disease progression, and higher R 0 means, more people will tend to be infected in the course of the epidemic. The major advantage of the SIR model is it gives a number R 0 , which can be used to benchmark and compare the ground situation of different states and resource allocations can be made to those states which are hard hit. The SIR model can be described as, where S, I, and R are the number of people in the population that are susceptible, infected and recovered. The β is the transmission rate. Each susceptible person contacts β people per day; a fraction I N of which are infectious. Therefore β SI N move out of the susceptible group and goes into the infected group. The transmission rate is the average rate of contacts a susceptible person makes that is sufficient to transmit the infection. The parameter γ is the recovery rate, and γI is the flow out of the infected crowd and goes into the recovered group. The average duration a person spends in the infected group is 1 γ days. For Covid-19, 1 γ is around 14 days, see [5] . In this paper, we follow the SIR implementation methodology as described in [12] . Given R 0 , β and γ, the implementation of SIR model is fairly straight forward via deSolve package, a solvers for initial value problems of differential equations, see [10] . It is known that R 0 = β γ , see [4] . We considered γ as 1 14 , from [5] . However, we need some good estimates of the R 0 , so that we can implement the SIR model and predict the disease progression in India. In order to estimate the R 0 , we use the R-package called, 'R0', a toolbox to estimate R 0 , see [7] . The time between the infection of a primary case and one of its secondary cases is called a generation time, see [11] . The 'R0' package assumes generation time of infection is known and should be provided as input. The mean generation time for the Wuhan has been reported as 6.5 days, [6] . In this paper, we assume the generation time follows Gamma distribution and we estimated the mean and shape parameter of the Gamma distribution using data. Our estimated mean generation time for Hubei province turns out to be 6.7, presented in the Table 1 . On the recovery from infection, we assume the individuals are assumed to be immune to re-infection in the short term. This assumption is same as [5] . Currently, we are deploying a grid search method over the mean and shape of the Gamma distribution for the time generation process. For a particular choice of the mean (µ) and shape (κ) parameter, we generate the time and then given that as input we estimate the R 0 using the 'R0' package in R. Then for an estimated R 0 and γ (assumed to be 1/14), we simulate the disease progression, for the period, for which we observed the new incidences. Then we calculate the Mean Square Error (MSE) in the following way: whereÎ(t) is the new incidence estimated from SIR model described in (1) at time point t, and i obs (t) is the actual incidence observed in the data at time point t. We estimate the mean parameter µ and shape parameter κ for which the MSE in (2) is minimum. The, for estimated mean and shape parameter, R 0 is estimated using the 'R0' package. The infection rate of a typical epidemic reaches its peak and then it slows down. The SIR model predicts when that peak will be reached very well because it captures the inherent dynamism of the epidemic. However, the SIR model is not helpful for short and medium-term predictions. We also need short and medium-term prediction, to predict the cases as quickly as possible so that the health officials can take the appropriate decision. The Statistical Machine Learning (SML) models are most popular for its prediction accuracy from short to medium term, [9] . Consequently, SML and SIR models complement each other. Note that the SML does not do well in long term prediction, particularly it cannot predict when it will reach the peak. Under this understanding, we develop traditional SML models and not deep learning models. We refrain to develop deep learning type models because we need a lot of data. However, in epidemiology, we do not have such kind of big data. In addition, the literature on how to adopt deep-learning for small data is not sufficient yet. Therefore we refrain from developing deep learning models and we develop the traditional regression type SML model, for short to medium type prediction. As different countries or provinces population levels are different; we consider the our variable to analyze as cases per 100,000 (aka., Rate), × 100, 000. The we model the Rate as a function of time, country and time-country interaction in the following way: where Rate it is the Rate of the i th country at the t th time point, α i is the effect of i th country, α i t is the linear effect of time on the Rate of the i th country, α i t 2 is the quadratic effect of time on the Rate of the i th country. We considered the following countries in our model: (1) On March 24, 2020, India announced the national lockdown of the nation. To measure the effectiveness of the lockdown, we used all data up to March 24, 2020, to train the model and learn the parameters of the model. Based on the trained model, we predict the disease progression path. Since the incubation period of the COVID-19 is about 14 days, it is likely that for 14 days from the beginning of the lockdown, the disease will follow the predicted path and then, it will deviate down from the predicted path. If the new confirmed cases come below the predicted path then we will know that is due to the effect of lockdown. On the other hand, if the disease progression stays on the predicted path then we will know the lockdown did not work. If the disease progression comes above the predicted path then we can say that the ground situation worsen during the lockdown. Exploratory Data Analysis (EDA) is important to develop good predictive models. In the Figure (1) , we plot the case per 100,000 (aka., Rate) for US, EU and Iran. The worst-hit US, EU and Iran's rates are in the range of 70 and 250. On the other hand, disease progression among Asian countries is very different, see Figure ( 2). The disease progression for both India and Japan are similar. We see the exponential rise in India and Japan but at a very lower rate than the Western nations. China has able to flatten the curve and South Korea was able to curb the rise from exponential to linear. However, so far South Korea experienced the worst rate among the major four Asian countries. Table ( 3), we present the actual prediction till May 01, 2020. If lockdown works then actual confirmed cases for India should stay below 66,224 by May 01,2020. A Comparison of R 0 between India and China : In the Table (1) , R 0 with a 95% confidence interval for Hubei province and China is around 2.5 during the first 23 days from the starting of the Lockdown. India's R 0 with a 95% confidence interval computed using two different starting points as breakout. One from 02-Mar-2020, because the number of cases in India started rising from that day. The R 0 for India for the first 22 days till the lockdown is around 2.5, like China. However, if we use the data, till 04-Apr-2020, then the R 0 value is around 2.75. It indicates since the lockdown the situation has worsen. It is also clear from the Figure (4) . In the second approach, we consider India's breakout from 23-Jan-2020. In that situation, if we consider the data till 24-Mar-2020, the R 0 with 95% confidence is almost 1.9 and if we consider data till 04-Apr-2020, the R 0 is nearly 2.1. It means if we use the data earlier to 02-Mar-2020 the India's R 0 looks better. In the Figure (3) , we compare the incidences of Hubei and India in Figure (3:a) and (3:b) . We consider the date range for Hubei from 23-Jan-2020 to 14-Feb-2020, i.e., during the first 23 days of Hubei lockdown. On the other hand, we considered the data for India, from the 02-Jan-2020 to 24-Jan-2020, till the lockdown. On the 23-Jan-2020, Hubei had 444 confirmed cases and overall China had 548 confirmed cases. On 02-Jan-2020, India had only 3 confirmed cases, whereas on the day of lockdown, i.e., on 24-Jan-2020, India had 536 confirmed cases. So on the day, when the lockdown starts both India and Hubei and/or China had a comparable number of cases. Perhaps, we should consider India's R 0 as around 2.5 similar to that of the early stage COVID-19 disease progression of China. Even with the lockdown, China experienced more than 80,000 cases. Perhaps, we should prepare for at least that many cases if not more in India. Stat wise R 0 : In Table ( 2), we present the state wise Basic Reproduction Number, R 0 , as of 04 March, 2020. We see the Punjab's R 0 is worst in the country. Punjab's high R 0 ≈ 16 is likely due to a super spreader, who ignored advice to self quarantine after returning from a trip to Italy and Germany, see [1] . The situation is Punjab is really complicated and serious intervention is required. In Figure (5) , we present the cases in Punjab over time. Since March 20, 2020 the number of confirmed cases increased at an unprecedented rate. From Table ( 2), we see the R 0 for Madhya Pradesh (3.37) , Maharastra (3.25) and Tamil Nadu (3.09) are more than 3. Clearly the situations are complicated in these three states. The R 0 of Andhra Pradesh (2.96), Delhi (2.82) and West Bengal (2.77) is more than the India's R 0 which is 2.75. These seven states should need special attention as their R 0 is more than that of India (2.75). These numbers are as of 4 Apr, 2020. For the following states, we either do not have enough data to make inference for R 0 ; or the algorithm fail to converge: (1) Andaman and Nicobar Islands; (2) Arunachal Pradesh; Here we present a point by point discussion of our analysis and prediction. 1. Situation of Punjab (R 0 ≈ 16) is bad. It requires immediate aggressive attention. 2. We see the R 0 for Madhya Pradesh (3.37) , Maharastra (3.25) and Tamil Nadu (3.09) are more than 3. Aggressive intervention is needed. Table 1 : R 0 with a 95% confidence interval for Hubei province and China is around 2.5 during the first 23 days from the starting of the Lockdown. India's R 0 with a 95% confidence interval using two different starting points. One from 02-Mar-2020, because the number of cases in India started rising from that day. The R 0 for India for the first 22 days till the lockdown is around 2.5, like China. However, if we use the data, till 04-Apr-2020, then the R 0 value is around 2.75. In the second approach, we consider India's breakout from 23-Jan-2020. In that situation, if we consider the data till 24-Mar-2020, the R 0 with 95% confidence is almost 1.9 and if we consider data till 04-Apr-2020, the R 0 is nearly 2.1. . We consider the date range for Hubei from 23-Jan-2020 to 14-Feb-2020, i.e., during the first 23 days of Hubei lockdown. On the other hand, we considered the data for India, from the 02-Jan-2020 to 24-Jan-2020, before the lockdown. On the 23-Jan-2020, Hubei had 444 confirmed cases and overall China had 548 confirmed cases. On 02-Jan-2020, India had only 3 confirmed cases, whereas on the day of lockdown, i.e., on 24-Jan-2020, India had 536 confirmed cases. Punjab's high R 0 is likely due to a super spreader ignored advice to self quarantine after returning from a trip to Italy and Germany, see [1] . The high R 0 is likely due to a super spreader ignored advice to self quarantine after returning from a trip to Italy and Germany, see [1] . Coronavirus: India 'super spreader' quarantines 40,000 people. BBC News Coronavirus: Why is india testing so little? BBC News Science and statistics Mathematical Epidemiology Impact of non-pharmaceutical interventions (npis) toto reduce covid-19 mortality and healthcare demand. London: Imperial College COVID-19 Response Team Early transmission dynamics in wuhan, china, of novel corona virus-infected pneumonia The r0 package: a toolbox to estimate reproduction numbers for epidemic outbreaks Coronavirus: The woman behind india's first testing kit A bayesian perspective of statistical machine learning for big data deSolve: Solvers for Initial Value Problems of Differential Equations A note on generation times in epidemic models Epidemic modelling with compartmental models using r Hospital beds (per 1,000 people) Here due to space constraint, we present only 5 days interval and recent out of sample values at the daily level