key: cord-187857-emgxp5wg authors: Gupta, Sourendu; Shankar, R. title: Estimating the number of COVID-19 infections in Indian hot-spots using fatality data date: 2020-04-07 journal: nan DOI: nan sha: doc_id: 187857 cord_uid: emgxp5wg In India the COVID-19 infected population has not yet been accurately established. As always in the early stages of any epidemic, the need to test serious cases first has meant that the population with asymptomatic or mild sub-clinical symptoms has not yet been analyzed. Using counts of fatalities, and previously estimated parameters for the progress of the disease, we give statistical estimates of the infected population. The doubling time is a crucial unknown input parameter which affects these estimates, and may differ strongly from one geographical location to another. We suggest a method for estimating epidemiological parameters for COVID-19 in different locations within a few days, so adding to the information required for gauging the success of public health interventions It is generally accepted that in many parts of the world the actual number of infected people is much more than the number of confirmed cases. This is due to limited testing which biased towards the serious cases. The number of documented fatalities on the other hand is likely to be much closer to the actual number. In this note we use a method to estimate the actual number of infections from the documented number of fatalities. This estimate is one of our main results. It is important because it is seems possible that asymptomatic and sub-clinical infections may also infect others [1] . We find large uncertainties in these predictions at this time, and suggest a method to improve the estimates systematically day by day. This gives us a secondary motivation, which is to refine epidemiological parameters for COVID-19 infections using only the daily statistics of fatalities. We do not utilize detailed epidemiological models. Our model input is a hypothesis of exponential growth. There is no data from India at present which contradicts this. The other inputs we need are about the progress of the disease. There is agreement in the literature that post-infection there is a short asymptomatic period. We use statistical models for the progression of the disease from asymptomatic to resolution into recovery or fatality which are parametrized to fit reports. We also need the infection fatality ratio (IFR), which we take from previous studies. Using these we make predictions for the infected population now and in future for various scenarios for the exponential growth rate. We discuss how our predictions can be used to validate some of the model assumptions. The distribution of population in India is highly non-uniform, and this could cause geographical fluctuations in the progress of the epidemic. So we apply our estimators to various localized outbreaks seen in India till now, and make predictions for the number of fatalities in these regions for about a week from now. Using the statistical inputs which we discuss here, we give confidence limits on the predictions, and discuss how to match them to future data in order to extract the remaining epidemiological parameters. It should be noted that current data is available aggregated over districts. Finer details may be very useful for discussing exit strategies from a lock-down. In view of this, the availability of data from each hospital separately would be of great use. In the early stages of a typical epidemic, when the number infected are a very small fraction of the population, the number of infected cases, I, rises exponentially. This may be parametrized as where the doubling time τ = ln 2/λ, and t 0 is the initial time at which the counting starts. The doubling time τ is related to the basic reproductive rate parameter R 0 . This is affected by both the virulence of the pathogen and the rate of social contact. Estimates of R 0 in China vary from as low as 2.2 to as high as 14.8, with a cluster of estimates close to the median of 2.79 [4] . See also an interesting estimate of the effect of public health interventions on R 0 using data from the cruise ship Diamond Princess [5] . Converting R 0 to τ requires an epidemiological model. In this note, we do not use any particular model. Our only assumption about the dynamics of the epidemic is the exponential growth of infections given by eq. (1). Due to the extreme heterogeneity of the population in India, R 0 , or τ , could vary from one place to another. The most definitive estimates of population densities are almost a decade old, since they come from the 2011 census of India [6] . According to this source, the population density of Indore is 9718 /Km 2 , whereas that for Mumbai is 28,185 /Km 2 . Inside Mumbai again there is extreme heterogeneity, with high density areas like Dharavi having a density of 3,35,743 /Km 2 . These numbers are indicative of the possibility that different cities, and even districts within a city, may have very different doubling times τ . In view of this, we do not assume a single value for τ , but work with two scenarios: in Scenario A infections double every 10 days, i.e., τ = 10 days; in Scenario B doubling is twice as rapid, with τ = 5 days. In comparison, the aggregated data on fatalities in India taken until 30 March, indicates a doubling time of roughly 4 days. We note that between March 30 and April 2, the number of fatalities in the Mumbai-Navi Mumbai-Pune area changed from 8 to 14, which supports the idea that Scenario B may not be far from the current doubling time in this location. The aggregated Indian fatality numbers could either be dominated by rapid growth in some local outbreaks, or a general spread of the disease. The geographically disaggregated data is capable of indicating which is the case. For COVID-19 infections the incubation period, t i , is estimated to be 5.1 days (95% CL 4.5-5.8 days), with a long tail [2] . Early observers reported that the recovery time, t r , the time from the onset of symptoms to recovery, ranged from 12 to 32 days [3] . The interval from onset of symptoms to release from hospital for a recovery may depend on various factors. It is seen to be larger from the time to death. This latter number is seen to have a mean of 17.8 days [8] , and is the one relevant for our analysis. We will take the duration of the disease, T , to be the sum of these two periods, i.e., T = t i + t r . We will take the incubation period to be a minimum of 3 days and have an exponential distribution after that, so that the mean incubation period is 5.1 days. We shall take the recovery time to have a minimum of 12 days, and be exponentially distributed such that the mean time to death is 17.8 days. In other words, t i and t r are random variates chosen from the distribution and P (t r ) = 0 for t r < 12, 5.8 e −(tr−12)/5.8 for t i > 12. ( With these assumptions, the mean infected period, T , is about 19 days, with the 95% CL being 15-27 days. These distributions can be improved in future. The core point to note is that a Gaussian distribution is not a good description when the data shows a long tail and skewness. Note that each of the random variates is applicable to a different case. The fraction of the infected population which dies is called the infection fatality ratio, IFR. This is most reliably estimated after the end of an epidemic. Estimates based on Chinese data for COVID-19 give IFR = 0.0066 (i.e., 0.66%) on average [8] , with a strong age structure [7] [8] [9] . The analysis of [8] gives a skewed posterior distribution of this quantity, so We will take IFR to have a minimum value of 0.0035, and exponentially after this so that it has a mean of 0.0066 and a width of 0.005. Since the case fatality ratio, CFR, is discussed more frequently, we discuss the relation between CFR and IFR in the final section. Note that IFR is a population averaged quantity, and the random values assumed for it are a measure of our uncertainty about this number. We have assumed that there is no correlation between T and IFR, since we have not taken age structuring into account. In a more detailed, age structured analysis, correlations may be important. The remainder of our analysis will use the data on fatalities reported by the Ministry of Health and Family Welfare [11] . Before embarking on the analysis it is useful to separate out the data on fatalities into two sets. One is the set of fatalities known to be of persons who arrived from a foreign country and was very soon after diagnosed as being infected with COVID-19. The statistics of such deaths relate to infections in the country where it was picked up. So this set is not of relevance to our analysis of the infections within the country. It is the complement, namely the set of fatalities to which no travel history can be attributed, which is of relevance to the analysis. This separation is not made in [11] . However, the data tracked in [12] adds this information from press reports, and the totals tally with the data of [11] . This is the data set we utilize here [13] . Due to the extreme variability in population density across India, it is good to avoid a country-averaged analysis if possible. We analyze clusters of fatalities due to COVID-19 in India, with data that was complete at the end of March 30, 2020. The fatalities of persons with no history of recent foreign travel fell into two groups: sporadic, defined as single fatalities in isolated geographical locations, and clusters, defined as more than one fatality in the same city or in towns very close to each other. We decided not to use sporadic cases, since statistical estimates are meaningless for single incidents. We found four clusters which are listed in Table I . These are the epidemic hot-spots in the country according to currently available data. The observation of the number of fatalities, D(t), on day t, may be converted to an estimate of the actual number of infections, I(t), on the same day by using the formula The first factor, D(t)/IFR is an estimate of the number of infections, I(t − T ), on day t − T . The second factor is the exponential growth of eq. (1) which evolves this older number of infections to its current value, leading to eq. (4). Note that the very broad and skewed distributions of T and IFR will give similarly broad and skewed distributions of I(t). Here we suggest how to narrow these estimates progressively. Any knowledge of I(t) gives a prediction of I(t ′ ) at a future date t ′ . Uncertainties in I(t) expand into larger uncertainties in I(t ′ ) due to exponential growth. However, the number of fatalities up to date t ′ , i.e.. D(t ′ ), is directly observable. A time series for D(t ′ ) allows us to estimate τ directly. Furthermore, with each day's data on fatalities, one can run the evolution backwards to rule out some of the uncertainty in the starting prediction I(t) on 30 March. This means that the prediction for D(t ′ ) = I(t ′ ) × IFR further in the future is narrowed down. We show this inference procedure schematically in Figure 1 . As the allowed range of the initial I(t) successively narrows, one also narrows the allowed range of T and IFR through a Bayesian inference paradigm. There are statistical uncertainties in the parameters T and CFR. We have combined them through a random Monte Carlo sampling using the probability distribution functions defined above. We implemented the Monte Carlo in Mathematica. The use of eq. (5) to make estimates of I(t), and the future values I(t ′ ) and D(t ′ ), then gives a statistical distribution of these quantities. These are implemented in the same Monte Carlo estimator. The basic result is for the amplification factor, I(t)/D(t), which we obtain with this numerical estimate, (460 to 9, 700) × D(t) for scenario A (2, 700 to 3, 80, 000) × D(t) for scenario B The ranges given here are 95% CL, and have been rounded to two significant digits. The median values are 1,500 in scenario A and 12,000 in scenario B. We tested the effect of changing the skewness of the distribution by replacing the shifted Exponential distributions in eq. (2) and eq. (3) by Gamma distributions tuned so as to reproduce the same means and variances. However, the Gamma distributions then have smaller skewness. (440 to 7, 700) × D(t) for scenario A (1, 800 to 2, 00, 000) × D(t) for scenario B The medians are 1,600 in scenario A and 15,000 in scenario B. One sees that the medians and lower limits of the 95% CL change by relatively small amounts, whereas the upper limits are quite different. We list the COVID-19 hot-spots in Table I along with estimates of I and D on various dates for each in the two different scenarios defined earlier. We emphasize that different geographical locations may have different doubling times, so both scenarios may be relevant. These are our major results. For later use we also show in Figure 3 the 95% CL predictions for D on 10 April in two different hot-spots in Scenario B. Any limits that we can extract on disease and epidemiological parameters would help us to plan ahead for the kind of demands that may be put on medical facilities in the near future. From the data on the geographical distribution of fatalities in India, we identified four possible hot-spots for COVID-19. The clustering of fatalities into hot-spots is an indication that there is perhaps no general spread of the epidemic through the country, and gives hope for partial removal of lock-down if the situation does not change. We used the simple statistical estimator given in eq. (4) to make a prediction for the total number of infections in each of these hot-spots on 30 March, in two scenarios. Estimating I(t) is important, especially since there is a good chance that the part of this population which is pre-symptomatic, non-symptomatic, or has sub-clinical symptoms are all able to communicate the disease to others. Our predictions are exhibited in Table I . Note that the 95% CL spans an enormous range, due to the spread in the input parameters, mainly the time interval between infection and death. Note that the distribution of I(t) is very skew, and the median is close to the lower end of the 95% CL. In Scenario B, the upper end of the 95% CL is more than 1% of the population of Mumbai. We consider this highly unlikely, although statistically possible. We have outlined a procedure refine these estimates by incorporating daily data progressively into the computation. Given the very large values of I(t) which the model predicts, medical professionals may legitimately ask whether one sees so many respiratory cases arriving in hospitals. Note however, that a very large fraction are likely to be either asymptomatic, or exhibit sub clinical symptoms. In fact [8] simultaneously reports IFR and the fraction of infected individuals who are hospitalized, H. This ratio, averaged over the population is reported to be in the range H = 2-4%. In Scenario B, therefore one may expect 440-12,000 people to arrive in a hospital. Even among these, some may not be able to consult a physician during a lock-down. Nevertheless, current experience strongly disfavours the upper end of this 95% CL range. One recalls again, that the median is close to the lower end. This estimate again emphasizes how important it is to narrow the range of prediction from the model. We are able to perform another simple estimate from these numbers. The case fatality ratio, CFR, is defined as the number of fatalities divided by the number of cases tested. Assuming that the tests are largely done on the people who arrive at a hospital, one can see that CFR ≈ D(t) HI(t) = IFR H . With IFR = 0.657% and H = 2-4%, we find CFR = 0.16-0.33, i.e., in the range of 16% to 33%. This is precisely in the range that is seen in the current data for India. This also means that the current policy for testing is likely to be catching most cases which need to be hospitalized. One hopes that with the decision to administer the faster serum test, it might become possible to sample the larger population more effectively. Finally we point out that there is a strong age structure to all the model parameters, which we have ignored. This we plan to do in future. Along with the planned Bayesian narrowing of the parameter space of COVID-19 pathology and epidemiology, this would provide valuable inputs for more detailed models which can be used to inform future policy. Aerosol and Surface Stability of SARS-CoV-2 as Compared with SARS-CoV-1 The Incubation Period of Coronavirus Disease 2019 (COVID-19) from Publicly Reported Confirmed Cases: Estimation and Application Positive RT-PCR Test Results in Patients Recovered from COVID-19 The reproductive number of COVID-19 is higher compared to SARS Coronavirus COVID-19 outbreak on the Diamond Princess cruise ship: estimating the epidemic potential and the effectiveness of public health countermeasures Characteristics of and Important Lessons from the Coronavirus Disease 2019 (COVID-19) Outbreak in China Estimates of the severity of Coronavirus disease 2019: a model based analysis Clinical predictors of mortality due to COVID-19 base on an analysis of data of 150 patients from Wuhan, China Potential Biases in Estimating Absolute and Relative Case-Fatality Risks during Outbreaks We make our data set on clusters of fatalities publicly available on Google maps at the address drive The authors would like to thank Sandhya Koushika, Gautam Menon, and Rahul Siddharthan for crucial inputs, and many members of the ISRC (Indian Scientists' Response to COVID-19) mailing list for discussions.