key: cord-0214434-d81cf3ms authors: Gupta, Sourendu title: Epidemic parameters for COVID-19 in several regions of India date: 2020-05-18 journal: nan DOI: nan sha: 9f9b549629dd05cefa6f18a59bc893531d3bbd85 doc_id: 214434 cord_uid: d81cf3ms Bayesian analysis of publicly available time series of cases and fatalities in different geographical regions of India during April 2020 is reported. It is found that the initial apparent rapid growthin infections could be partly due to confounding factors such as initial rapid ramp-up of disease surveillance. A brief discussion is given of the fallacies which arise if this possibility is neglected. The growth after April 10 is consistent with a time independent but region dependent exponential. From this, R0 is extracted using both known cases and fatalities. The two estimates are seen to agree in many cases; for these CFR is reported. It is seen that CFR and R0 increase together. Some public health implications of this observation are discussed, including a target doubling interval if medical facilities are to remain adequate. SARS-COV-2 is a virus which has newly entered the global human population [1] . As this host-parasite system evolves towards an equilibrium, its epidemiology has been studied extensively, but with some conflicting results [2, 3] . The true extent of its penetration into the population is as yet open to question [4] , since testing is fairly restricted in most countries. Nor is the progression of the disease, COVID-19, or its method of spread completely clear [5] . Since the virus is already so widely established, it seems unlikely that it will be totally eliminated soon. So it is important to extract basic epidemiological parameters as cleanly as possible. India has managed to geographically contain the spread of the COVID-19 epidemic with the nation-wide lock-down which started on 24 March, 2020. At the end of April the proportion of identified cases in India as a whole was a few tens per million, with 1-2 orders of magnitude more in hot spots. Even if this were wrong by an order of magnitude, it would still mean that the epidemic remains at an early stage in India. This, combined with the lock-down, presents an opportunity to examine the growth of the epidemic in multiple isolated regions which implement essentially the same policy with regard to testing. This study examines the heterogeneity in the growth rate of the disease, in several ways. First, the doubling intervals, τ , of the cumulative number of identified cases, C(t), and the cumulative number of fatalities, D(t), is examined. From τ it is possible to extract the basic reproduction rate, R 0 , within epidemic models. Marked heterogeneity are observed. After this the correlation between the case fatality ratio, CFR, and R 0 is studied. Epidemic data, especially at the beginning is never clean. The public health system has to gear up for disease surveillance. The continuing recurrence of Cholera epidemics [6] , the spread of Dengue and Chikungunya [7] , the successful surveillance and elimination of Nipah [8] and Zika [9] , show that India has a mixed record on epidemic surveillance. In addition to a possible lag between the beginning of the epidemic and its surveillance, there could be a problem of incomplete surveillance during the time the health service ramps up. Any examination of data has to allow for the identification of confounding factors such as these. For the COVID-19 surveillance data, there are further cautionary remarks. The ICMR guidelines for testing [10] specify that only symptomatic cases should be tested using rRT-PCR. This part of the policy has been unchanged since the middle of March. Depending on the fraction of cases which are symptomatic, this could miss the actual prevalence of the disease in the population. Estimates of the fraction of asymptomatic infections range as high as 80% [11] , implying that, in this extreme case, the testing policy can never reveal more than 20% of the cases. The social stigma attached to COVID-19 [12] also means that some fraction of infections may be cryptic. There are uncertainties in the statistics of fatalities also. It has been reported that in Europe and the US the number of fatalities due to COVID-19 may have been underestimated by a factor of 2-3. Indian cities have fairly complete registries of deaths, so miscounting of COVID-19 fatalities could come mainly from mistaken or incomplete reports of the cause of death. For larger regions, say districts and whole states, where most deaths happen at home and death certificates are not common [13] the errors in counting fatalities may be significantly larger, and hard to estimate at this time. One point about the quality test that is developed here is that absolute numbers are not as important for it as the check that fatalities and identified cases are independently tracing the same rate of growth of the epidemic. This is expected at the beginning of the epidemic, when all epidemic models become linear, and the growth of generic measures is driven by the maximum eigenvalue of the linearized models. However, in the extraction of the CFR, the absolute counts do matter. In spite of the uncertainties, the correlation of CFR and R 0 holds important lessons for public health in the inevitable later stages of this epidemic in India, and the middle and low income countries of the world. Data has been extracted from official sources where possible. For Ahmedabad city, the data is made publicly available by the municipal council of the city [14] . This well-organized site corrects data retrospectively for up to about 10 days. For Chennai city, the data has been collected from daily tweets by the municipal council [15] . For Indore city, the data was collected from daily bulletins of the Chief Medical and Health Officer and the collection is available for public use [16] . For Mumbai city the data has been collated from the daily tweets by MCGM [17] into a publicly available site [18] . For Pune district the data was collated from the daily tweets by the district authority [19] and the collection is publicly available [20] . For Delhi and all other states, the data was taken from the publicly available collection at Covid19India [21] . This site corrects data retrospectively for over a week. Only data on the cumulative number of identified cases, C(t), and the cumulative number of known fatalities, D(t), are used in this analysis. For this work data collection stopped on May 1, 2020, and retrospective corrections made after this are not included. The unquantifiable parts of the errors in the counts of cases and fatalities due to COVID-19 were discussed in the previous section, along with the reasons why their estimates need not be included in this analysis. However, there is another part of the errors in the daily counts of cases and fatalities which come from backlogs of tests or hospital records. These shuffle a fraction of the numbers from one day to another, and therefore cause errors in the daily counts. As long as the number of facilities keep pace with the growth of the epidemic, these errors remain proportional to the number of cases and fatalities. Since the wait time for hospital beds for COVID-19 cases has remained roughly constant during the period of study, this argument is expected to hold. In view of this, of 20% of the reported values of C(t) and D(t) are assigned as errors. The specific fraction, 20%, was chosen to in order to cover the long range fluctuations visible in the time series (for example in those visible on days 3 and 13 of Figure 1 ). It has been seen that official reports and independent estimates of these number are generally within this range. At such an early stage in the infection, it is reasonable to assume exponential growth, i.e., doubling every τ days. Within this assumption one can check how well the lock-down is working by letting the doubling interval become time dependent. The simplest function to try is a linear change in τ , i.e., and a similar set of three parameters for C(t). Note that τ 0 has dimensions of time, whereas τ 1 is dimensionless. A fitting form with constant doubling interval was also used; this is denoted τ , dropping the subscript. The fitting procedure follows the methods of [23] , with gamma distributions used as prior probability distribution functions (PDFs) for τ 0 and C 0 . The additional parameter τ 1 is allowed to take positive and negative values, by letting the prior PDF to be a Gaussian. For all these distributions, the widths are taken large enough that the posterior distribution is insensitive to the choice of priors. The appendix contains details of the relation between a time varying doubling interval and time variation of the basic reproductive rate R 0 . This requires choosing a model of the epidemic. Using the SEIR model, and the median interval between the appearance of symptoms and the time of fatalities, t 2 = 17.8 days [22] , one has When a constant τ is used, one can set τ 1 = 0 in the above formulae and write R 0 and τ instead of R 0 0 and τ 0 . Exactly the same procedure is followed for fits to the time series for D(t). Estimates of the median values of the parameters, along with interquartile ranges (IQR) and 95% credible intervals (CrI) are quoted for the doubling intervals as well as R 0 . The analysis of the time series for C(t) and D(t) lead quite naturally to the case fatality ratio, CFR. This is defined as the ratio CFR = D(t)/C(t). ( If C(t) is underestimated, then CFR is overestimated, and conversely, when D(t) is underestimated, then CFR is also underestimated. This was regulated using a Bayesian estimator. Since the outcome is binomial, the prior PDF used is a beta distribution with α = 1 and β = 2. These choices make the posterior distribution insensitive to doubling or halving the values of the priors. The posterior distribution is of the same form with α = 1 + D(t) and β = 2 + C(t) − D(t), with t taken to be the final day of the analysis. Since C(t) and D(t) are both large, the following approximations for the median, µ, and standard deviation, σ, may be used: III. RESULTS The time series of C(t) and D(t) is shown for the example of Delhi in Figure 1 . Of the regions that we analysed, most cities show an initial rapid growth followed by a tempered growth. The exceptions are Ahmedabad and Chennai among cities, and the states of Gujarat, Kerala, and West Bengal. Note that day one is taken to be March 31, 2020, which is 7 days after the beginning of the national lock-down. Since t 2 = 17.8 days, it might be expected that the growth rate of cases in the pre-lock-down period could manifest itself in that of fatalities until around day 11. In case of a successful lock-down, D(t) could then show an initial exponential growth, tempered after day 11. The initial data for fatalities in Delhi, Indore, Mumbai and Pune can indeed be described by an exponential. However, the doubling interval in Pune turns out to be half of that in Mumbai, although the average population density of Mumbai is about 6 times larger than the average in Pune city. The ansatz of eq. (1), i.e., a linearly varying doubling interval, was also examined for urban regions. The results are collected in Table I . In most locations the initial doubling interval seems to be between half and day and two days. When converted to R 0 , one obtains extremely high values, far in excess of what has been quoted in the literature. Certainly R 0 could vary from place to place, since it depends on infectivity of the virus as well as the social networks in each location, and the latter may change from one place to another. However, τ 0 for Pune is one third that of Mumbai, when Mumbai has six times the average population density. The wrong dependence of the doubling time for fatalities on population density, together with the observation that C(t) shows a growth till the same date, supports the idea that there could be a more parsimonious explanation for this common period of growth. This is discussed in the next section. At the moment, any statistical evidence for a gradual slowing of the growth rate of the epidemic is hidden due to some confounding factors. In view of this, the analysis was continued with a constant doubling interval, τ , applying it to the period after day 10 or 11. For this part of the analysis data was from three states, namely Gujarat, Kerala, and West Bengal, was also used. From Figure 1 , one sees that this simpler model provides as good a description of the data as the model of eq. (1). Furthermore, this yields more realistic values of R 0 implies that during the lock-down each of these places has seen a location dependent constant doubling interval. The values of τ , along with inferred values of R 0 , are collected in Table II . These are the primary results of this analysis. It was noted that the number of known cases, C(t), is definitely missing cases among those who have not been tested. This could include a possibly large, fraction of asymptomatic and non-critical or pre-symptomatic cases [24, 25] . However, India's disease surveillance mechanism has concentrated on identifying critical cases and contact tracing, which could be a good tracer of the growth of epidemics. If this reasoning is correct, then, during the early growth of the epidemic, one should be able to obtain reliable doubling intervals from the cumulative counts of test positives [23] . The results of this analysis are also given in Table II . The two independent estimates of R 0 agree well enough that a closer look reveals interesting patterns. The scatter plot in Figure 2 of R 0 obtained in two different ways shows several interesting patterns. First, there seem to be two groups of outbreaks. Most regions have R 0 below 3. Among the regions that we studied, Ahmedabad and Gujarat were a separate group, which saw a faster epidemic growth, with R 0 above 3. Finally there is Kerala, a different outlier, whose doubling intervals are longer than t 2 , and therefore with very low values of R 0 . The case of Kerala merits a separate remark. The cumulative number of fatalities reached 4 at the end of the period of study. With such low counts of fatalities the assumption of exponential growth cannot be well tested. The counts of total infections was larger, and supported the hypothesis of exponential growth over the period studied. Second, one sees that most estimates lie close to the diagonal line. If the data was perfect, and the epidemic grew steadily, the estimates would lie exactly on this line. With this requirement we can separate the regions into two groups. One consisting of Ahmedabad, Chennai, Delhi, Gujarat, and West Bengal are, within statistical uncertainties, on this line. The second group, with Indore, Kerala, Mumbai, and Pune, are not. This could indicate some issues with the data. On the other hand, if the data is as good as the other regions, then the fact that they are off the diagonal line should be understood. Kerala, which is the only region which lies below the diagonal, is perhaps seeing a lower growth in new cases than fatalities, which could be indicative of a gradual slowing down of the epidemic. Due to the lag by t 2 , fatalities would see the slowing down later. Conversely, the regions which lie above the diagonal (namely Indore, Mumbai, and Pune, and, possibly, Chennai) could be seeing an increased growth in infections, not yet visible in fatalities because of the same time lag. Whether these scenarios are true, or the data quality is not dependable, should be known to the health agencies now, and would become visible to the public later. When there is a statistically significant difference between the doubling interval determined by C(t) and D(t), then the ratio gives a time-dependent CFR. This is usually understood to be a transient phenomenon. In view of this, the analysis was restricted to Ahmedabad, Chennai, Delhi, Gujarat, Kerala, and West Bengal, i.e., the regions which lie on the diagonal line of Figure 2 , and therefore are seeing a steady growth of identified cases as well as fatalities. The case fatality ratios for these regions are plotted against the R 0 inferred from D(t) in Figure 3 . The most obvious trend is that for the group of three cities there is an overall trend towards smaller CFR with decreasing R 0 . This is also true of the two states. However, the CFR for states is displaced upwards from that for cities. Both trends have strong implications for the public health outlook and will be discussed further in the next section. Counts of known cases and fatalities of COVID-19 from five cities (Ahmedabad, Chennai, Delhi, Indore, and Mumbai), one district (Pune), and three states (Gujarat, Kerala and West Bengal) was investigated in this work. In two of the groups, there was one case each where the epidemic was not severe at the end of April (Chennai among cities, and Kerala among states). The others were known hot spots. Kerala is special because the number of fatalities is too low for statistical tests to be meaningful. There are strong regional heterogeneity in the course of the epidemic, indicating the necessity of looking at its spread at extremely local scales in order to check and control it. The time series both of known cases and fatalities in four out of the six urban centers showed a rapid rise for about 18 days after the lock-down. This was followed by a much slower growth. Since fatalities track cases with a delay of 17.8 days on the average, the early part of this data could track the growth in the time before the lock-down. However, it turns out that the data grows faster in less dense urban areas. Moreover, this hypothesis is not tenable for the growth in the number of known cases. A possibility which resolves these difficulties is that this rapid rise of numbers in the early days tracks the rapid improvement of disease surveillance rather than the epidemic. The fact that the positive cases in Kerala does not show such a rapid initial growth is consistent with reports that the state activated disease surveillance after the first infections came from abroad [26] . This could also be true of Ahmedabad and Gujarat, two other centers which show no such initial increase, since the state had passed through the surveillance challenge of Zika virus in recent years [9] . Due to this confounding factor, it is not possible to use the data until April 10 or 11 to make any statistically valid measurement of the growth of the epidemic before the lock-down. Neglecting this leads to multiple fallacies, which I remark on next. The apparent slowing down of the growth in later stages may be falsely interpreted as a transition to polynomial growth. As shown in eq. (A5) and eq. (A7), this is equivalent to a time dependent doubling interval. It has been discussed in the previous subsection that this leads to highly unlikely properties of the COVID-19 epidemic. The same apparent slowing down of the growth rate in India has also been interpreted within the homogeneous SIR model with constant, time invariant, parameters [27] . In such a simple model the time dependence can only come from early evolution towards herd immunity. This gives rise to the unlikely conclusion that herd immunity will be reached for COVID-19 while 99% of the population remains susceptible. A misrepresentation of data also arises when "instantaneous doubling intervals" or similar measures of exponential growth are constructed using C(t) for one day, or averaged over small windows of time [28] . This shows a spurious gradual slowing of growth during the first three weeks of the lock-down. In later weeks these estimates are also plagued by spurious effects which result when delayed reports are dumped into cumulative numbers on one day instead of being assigned to correct past dates. These appear as evidence of local spurts or slumps in growth. Evidence of retroactive corrections from [14] shows that delays of as much as ten days may occur. When these artifacts are averaged over a moving window, this gives the mistaken appearance of peaks and troughs, and may put erroneous pressure to change policies. Due to the reasons discussed in the previous subsections, the period after April 10 or 11 constitutes the base data for the main part of this analysis. As shown in Figure 1 a constant growth rate in each locality during the the lockdown models the data as well as a growth rate which changes linearly with time. This is also the most parsimonious hypothesis about the growth of the epidemic. The observed doubling interval, and the derived quantity R 0 , fall into three groups (see Table II and Figure 2 ). Several geographical regions have R 0 less than 3. Kerala has R 0 ≃ 1.7 (a doubling interval larger than t 2 , the interval from the emergence of symptoms to death). Gujarat and Ahmedabad have R 0 higher than 3. Since this is the growth rate during the lock-down, population density effects are unlikely to be the major determinant of R 0 . It would be worthwhile to consider the role of individuals with extremely large number of contacts in this context, or a significant tail of the distribution with small number of contacts, but still above three. Five regions pass the following data quality test-the value of R 0 obtained from the growth of fatalities and cases are equal. This does not mean that the number of cases is correctly counted. Rather it indicates that the effort to find the cases requiring critical care, and tracing their contacts has successfully resulted in tracking a constant fraction of all infected persons. It may miss, for example, a large fraction of asymptomatic cases. For the five geographical regions which pass the quality test described in the previous section, a further study was performed. The dependence of the case fatality ratio, CFR (i.e., the ratio of the observed number of fatalities and cases) on R 0 was investigated. Although the number of cases identified may be much smaller than the actual number of cases, the chance that cases are identified in these five regions are expected to be similar, since the rate of testing is about the same. A positive correlation between R 0 and CFR is observed. One possible reason for this is that with lower R 0 the number of critical cases grows slower, giving medical practitioners time to figure out good practices which prevent critical care patients from progressing to fatality. Deeper studies of this factor, comparing case data from different regions, is called for in future. It is possible that this is one of the most positive, and least discussed, outcomes of the lock-down. Another possibility may also be conjectured. Careful maintenance of social distancing, necessary to reduce R 0 , results in evolutionary pressure on the virus. Lock-down and similar methods force the virus to evolve in a direction which maximizes its ability to reproduce, which it can do if the disease becomes less critical or asymptomatic, and the chances of fatality decrease. It would be interesting to compare different regions across the globe for changes in the serial time and CFR. At the observed rate of growth, and with the current rate of testing, more than 0.5% of the population in hot spots will begin to test positive for infections in about a month. A constant rate of growth of infections means that the number of hospital beds will also grow at the same rate, for as long as the epidemic is growing. Even if the rate is slowed down heavily, as it is already in Delhi, Mumbai, and Chennai, the demand for hospital facilities will keep on growing, as long as the epidemic grows. This demand is already beginning to outstrip resources in the larger cities. The mean interval between the start of symptoms and discharge was estimated to be 24.7 days [22] . This means that unless the doubling interval is kept above 35 days (= 24.7/ ln 2), the demand on hospitals will keep rising. Of the places we studied, only Kerala has begun to approach this break-even point. CFR is currently small, partly because medical facilities have been able to cope with the rate of growth. If the number of cases exceeds the capacity of the medical system, cases which might have recovered will be harder to treat. Inevitably in such cases CFR will climb. It is useful to note that in Figure 3 the statewide figures for CFR are higher than those for cities. This is a reflection of the relative paucity of medical services outside cities, and is a pointer to what might happen when the number of infections rises beyond the sustainable capacity of hospitals. I thank Rahul Banerjee, Prahlad Harsha, D. Indumathi, and R. Shankar for sharing collated data on various cities. I thank Jayasree Subramanian for providing me with the reference [14] . In this form of the equation time is measured in units of the case resolution time. This equation assumes that the fraction of susceptible persons is close to unity, and the fraction of persons in any other compartment is very small. As argued before, this is a reasonable assumption to make. The cumulative number of infections is then found by integration. There is no closed form result for the general case. If only the linear term in the expansion of R 0 is retained, then The function Erfi is defined through the integral It is possible to use an expansion for t ≪ T , which gives the form I(t) = I(0) 1 λ 0 e λ0t 1 + ǫ(1 − λ 0 t + λ 2 0 t 2 ) + O(ǫ 2 ) (A5) where the notation λ 0 = R 0 0 − 1, and ǫ = R 1 0 /(λ 2 0 T ) are introduced. The imaginary part vanishes exponentially. This is easy to match to the phenomenological form I(t) = I(0) 2 t/(τ0+τ1t/T ′ ) = I(0) 2 t/τ0 1 − ln 2τ 1 where an artificial expansion parameter T ′ is introduced. It is set to unity after expansion. Matching these two expansions is accurate only when λ 0 t is large. Then The phenomenological parametrization of eq. (A2) can be connected to the parameters of (non-autonomous) evolution equations for the epidemic. Note that T and T ′ are both regularization scales, in the sense of a renormalization group, whose numerical value need not be specified. In order to change units of time to days, it is necessary to choose a model of the epidemic. If one uses the SEIR model, then the unit of time would be the median interval between the appearance of symptoms and the time of fatality or recovery, whichever is earlier. This quantity, t 2 = 17.8 days [22] . If one instead uses the SIR model, then it is appropriate to choose the unit of time to be the median interval between the beginning of the infection and the earlier of the time of fatality or recovery. This is t 1 + t 2 , where t 1 is the median pre-symptomatic period, t 1 = 5.1 days. Here the conversion is made within the SEIR scheme. This gives When a constant τ is used, one can set τ 1 = 0 in the above formulae and write R 0 and τ instead of R 0 0 and τ 0 . Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic Changes in contact patterns shape the dynamics of the COVID-19 outbreak in China Effect of non-pharmaceutical interventions to contain COVID-19 in China First antibody surveys draw fire for quality, bias Coronavirus disease 2019 (COVID-19): A literature review Identification of burden hotspots and risk factors for cholera in India: An observational study Emergencies Preparedness, Response WHO, Zika Virus Infection -India, Emergencies Preparedness, Response Strategy for COVID19 testing in India Global Covid-19 Case Fatality Rates Ministry of Health and Family Welfare, Government of India, Addressing Social Stigma Associated with COVID-19 Undated advisory Nationwide mortality studies to quantify causes of death: relevant lessons form India's Million Death Study Amdavad Municipal Corporation, COVID-19 website Greater Chennai Corporation, Official Twitter page Collection of Press releases by Praveen Jadiya, Chief Medical and Health Officer Municipal Commission of Greater Mumbai, Health Department, Official handle District Information Office, Pune, Official Twitter account Covid-19 India API Estimates of the severity of Coronavirus disease 2019: a model based analysis Inferring epidemic parameters for COVID-19 from fatality counts in Mumbai Estimating the asymptomatic proportion of coronavirus disease 2019 (COVID-19) cases on board the Diamond Princess cruise ship Estimation of the asymptomatic ratio of novel coronavirus infections (COVID-19) Coronavirus: Surveillance is the key, Kerala shows the way Singapore Univ. of Tech. and Design, Data Driven Innovation Lab, Predictive Monitoring of COVID-19 Coronavirus (COVID-19) Cases Our World in Data In this appendix the unit of time will be taken to be the inverse of the mean rate of fatality of the infected. In these units, R 0 is the average number of new infections caused by an infected person. R 0 depends on the infectivity of the virus, as well as an average degree of the contact network. As a result, it may be affect by public health policies, such as a lock down. Say a policy measure has a time scale is T . Due to this, R 0 may become time-dependent, and one may write a Taylor series expansionOne may introduce this into a typical epidemic model equation, to obtain dI dt = (R 0 − 1)I, which gives log I(t) I(0) = t R 0 0 − 1 +