key: cord-0565285-itdr1r67 authors: Simone, Andrea De; Piangerelli, Marco title: The impact of undetected cases on tracking epidemics: the case of COVID-19 date: 2020-05-13 journal: nan DOI: nan sha: 932d5e546759f2ef8810e10e08ac85e1afec7b28 doc_id: 565285 cord_uid: itdr1r67 One of the key indicators used in tracking the evolution of an infectious disease isthe reproduction number. This quantity is usually computed using the reportednumber of cases, but ignoring that many more individuals may be infected (e.g.asymptomatics). We propose a statistical procedure to quantify the impact of un-detected infectious cases on the determination of the effective reproduction number. Our approach is stochastic, data-driven and not relying on any compartmentalmodel. It is applied to the COVID-19 case in eight different countries and all Italianregions, showing that the effect of undetected cases leads to estimates of the effective reproduction numbers larger than those obtained only with the reported cases by factors ranging from two to ten. Our findings urge caution about deciding when and how to relax containment measures based on the value of the reproduction number. Tracking the evolution of the spread of an infectious disease is of primary importance during the whole course of any epidemic. An accurate evaluation of the transmission potential of the disease provides invaluable information to guide the decision-making process of control interventions, and to assess their effectiveness. One of the key epidemiological variables in this respect is the effective reproduction number R, defined as the average number of secondary cases per primary case of infection. In order to stop an epidemic, R needs to be persistently reduced to a level below 1. The issue of providing reliable estimates of R is particularly severe now during the on-going COVID-19 pandemic [1] , and urgently calls for efforts towards a comprehensive mathematical modelling of the outbreak. In this paper we propose a statistical framework for computing the effective reproduction number R characterized by the following main features. 1. Stochastic. Our approach is stochastic, and not rooted in any deterministic framework of compartmental models, such as SIR and its extensions. Although compartmental models can indeed provide very useful outcomes, especially for extrapolating the outbreak evolution into the near future, they rely on the simultaneous determination of all the coefficients appearing in the differential equations describing the dynamics of each compartment. Real-time. The method provides a time series of estimations of R at each time step (e.g. one day), and not a single a posteriori value when the outbreak is almost over. 3 . Bayesian. Within a Bayesian framework the results have a transparent probabilistic interpretation, the assumptions (priors) are explicit and their role is clearly tracked. A Bayesian updating procedure also accounts for the real-time evolution of the probabilities. 4 . Comprehensive. Our method is explicitly carried out for taking into account the (unknown) number of undetected cases. However, it can be straightforwardly generalized to include any additional random variable affecting the reproduction number. Since the reproduction number is a highly complex quantity, affected by a wide variety of factors, such as biological, environmental and social factors [2] , this feature is particularly relevant. Several methods for estimating the reproduction number have been proposed in the literature and have one or more of the above characteristics (see e.g. Refs. [3, 4, 5, 6, 7, 8, 9] ). But, to the best of our knowledge, our approach is the first one combining all those components at once. In particular, in this paper we are interested in assessing the role and impact of the number of undetected infection cases onto the effective reproduction number. There may be different reasons why an infected patient is undetected and does not appear in the official reports: individuals not showing the symptoms of the disease but are able to infect others (asymptomatics), individuals whose symptoms have not been linked to the disease under consideration (especially in the early stages of the outbreak), impossibility of a complete population screening, etc. For COVID-19, the number of undetected cases of infection may indeed be rather large. According to Ref. [10], in the small Italian town of Vo' Euganeo 43% of the confirmed infections were asymptomatic with no statistically different viral load with respect to the symptomatic cases. Another study performed at the New YorkPresbyterian Allen Hospital and Columbia University Irving Medical Center in NYC, pointed out that 29 of the 33 patients who were positive for SARS-CoV-2 (87.9%) had no COVID-19 symptoms [11] . These results justify the denomination of asymptomatic transmission of SARS-CoV-2 as the "Achilles' heel" of current COVID-19 containment strategies [12]. We develop a Bayesian statistical formulation of the evolution of the effective reproduction number, building upon the works of Refs. [4, 5, 6] . Full details about our model and calculations are provided in Supplementary Material S.1. It is important to remark that our approach is stochastic and data-driven, not relying on any deterministic compartmental model. The number of new infected individuals at a given time are modelled as a discrete-time stochastic process. Among the many variables affecting the effective reproduction number R [2] , we choose as the most relevant ones the time series of disease incidence data up to time t (I ≤t = {I 0 , I 1 , . . . , I t }), and the serial interval (W ), i.e. the time from symptom onset in a primary case to symptom onset of his/her secondary cases. At any time t during an epidemic, the posterior probability density of R, conditioned on the past incidence data and serial interval, p R|I ≤t ,W (r|i ≤t , w) encodes a great deal of information about the current state of the outbreak. In particular, the effective reproduction number at time t (R t ) can be derived from it as the expected value as well as the 95% central credible intervals (see Fig. 1 ). As initial prior for R we adopt an uninformative uniform distribution throughout the paper. Our mathematical formalism is general and flexible enough to incorporate in a straightforward way any other random variable affecting R, in addition to the incidence and serial interval usually considered. In particular, in this paper we consider the impact on R due to the unknown number of undetected cases, provided we make assumptions about their probability distribution. The general posterior probability density of R needed in Eq. (1), given Poisson-distributed incidence data, serial interval data and the undetected cases is reported in Eq. (S.11), including the marginalization over all nuisance parameters. Our results allow one to track the effective reproduction number in real time during an outbreak, including the effects of undetected cases, under general assumptions. We also took a further step by assuming a parametric form for the serial interval distribution, and uniform probabilities for the undetected cases. In this simple case we were able to compute the posterior density p R|I ≤t ,W (r|i ≤t , w) analytically in close form (see Eq. (S.12)). This is the form we are applying to the COVID-19 data, as we now turn to discuss. We employ COVID-19 daily incidence data for eight different countries (France, Germany, R over C U is cutoff at C U = 2. Of course this choice of the prior for C U is subjective, and we believe it is also rather conservative. The computation of R starts from the first day reported on the dataset. It is worth mentioning that the starting date is not the same for all the countries taken into account. The results are shown in Fig. 2 and Fig. 3 . The figures show the time evolution of R and, wherever applicable, the dates when contagion containment measures have been enforced. In Table 1 we report the numerical results for the last day of our analysis. Our results clearly show that R, after a transitory period, is in a down trend in all the countries we considered. In all the countries considered, we find values of the mean value R t larger than the ones without considering the undetected cases by factors of about 2 -4. This is a somewhat expected consequence of our conservative choice of C U being at most 2. By allowing more undetected cases, the resulting reproduction numbers would be necessarily higher. Furthermore, the upper values of the 95% credible intervals are larger than the estimate with only officially reported cases by factors up to 10, and they are all significantly above 1. In this paper we took the first steps towards a comprehensive stochastic modelling of the effective reproduction number R during an epidemic. We followed a completely general approach, which enables one to account for any random variable affecting R. In particular, our primary focus was on assessing the impact on R of the number of undetected infection cases. We investigated the time evolution of the posterior probability density of R, marginalized over the parameters of the serial interval and undetected cases distributions. The application of our method to the COVID-19 outbreak in different countries show that the effective reproduc-tion number is largely affected by the undetected cases, and in general it increases by factors of order 2 to 10. There are several directions in which further research can be carried out, aiming at expanding the capabilities of the basic framework described in this paper. For instance, it is desirable to explore a more realistic model of the probability distribution of undetected cases, also including time dependence. Furthermore, it is also possible to adopt a fully data-driven approach by performing non-parametric estimation of the serial interval distribution from transmission chains data. The stochastic approach outlined in this paper is not designed to establish or predict any cause-effect relationship between the R t trend and the enforcement of the containment mea- We work in a Bayesian framework in which we treat all observations and parameters as random variables. The parameter of primary interest in this paper is the effective reproduction number R, considered as a continuous random variable with prior probability density function p R (r). We then consider the observed incidence cases (number of new infected individuals at time t) I t and the unknown number of undetected cases U t as positive integer-valued stochastic processes with discrete time index t. We will also consider the stochastic process T t ≡ I t + U t , describing the total number of incident cases as a function of time, i.e. the sum of observed (reported) cases and the number of undetected cases. We ignore the imported cases. Notice that, in general, I t and U t are dependent, and their dependence is encoded by the conditional variable U t |I t . The serial interval is described by the discrete random variable W . Its probability mass function p W (w) provides the probability of a secondary case arising w time steps after a primary case. Given a time window of τ time steps, over which R is assumed to be constant, we can split the times into two intervals: 0 ≤ k ≤ t − τ − 1 and t − τ ≤ k ≤ t. To avoid notational clutter, we will indicate by I τ , the posterior probability density of R given the serial interval distribution, the incidence data history and the undetected cases in the time where we assumed that W and R and independent (a generic dependence between them can be implemented in a straightforward way). Now, we can use the total law of probability to sum over the unknown number of undetected cases, assuming that U k depends only on I k . This way, we can get the posterior probability density for R given the serial interval distribution and the time series of incidence data up to time t The sum over the undetected cases u k is ideally running up to the total population minus the observed cases i k (neglecting effects of acquired immunity), but in practice it is cutoff much earlier by the distribution p U k |I k , as discussed below. The only assumptions made to derive the posterior probability density in Eq. (S.4) have been that W is independent of R and that U k only depends on I k . So, Eq. (S.4) quite generally describes how to incorporate the effect of undetected cases in a Bayesian statistical model for the effective reproduction number, given the serial interval and incidence data. We now turn to formulate our assumptions about each of the terms appearing in Eq. (S.4), and we will reach a simple analytical form, ready to use for numerical simulations. The prior p R (r) for R is assumed to be uniform. For the prior distribution p W (w) of the serial interval variable W we assume a continuous Gamma distribution (to be evaluated on integer values of w) described by two parameters: the shape parameter a and the rate parameter b The probability mass function of the undetected cases U t , which we already assumed to depend on I t only, can be modelled in many different ways, for example with decreasing probabilities associated to large values of undetected cases, and also in a time-dependent way. In this paper we adopt the simplest assumption of a discrete uniform distribution with a single param- . This way, we are describing the situation where the number of undetected cases at a given time t can be at most C U times the number of reported cases at the same time. We leave the investigation of alternative (and more realistic) scenarios to future work. So, the probability mass function of the number of undetected cases, conditioned on the values of the incidence data and the parameter C U is where χ is the indicator function. The prior distribution of the continuous parameter C U is assumed to be an uninformative uniform prior between 0 and 2: C U ∼ Uniform([0, 2]). The choice of the number 2 if of course subjective, although we believe it is a reasonable and conservative prior. The number of new total cases T k at time k within the time window [t − τ, t], given the previous incidence data, the serial interval data and the value of R, is assumed to be Poissondistributed with parameter RΛ k , where the total infection potential at a generic time t is defined At first approximation, this quantity can be considered as unaffected by the undetected cases, as the serial interval distribution is derived from tracking the secondary of reported cases. By considering the sample of secondary infections as representative of the population, the approximation above is justified. The generalizations to replace the Poisson distribution with a two-parameter negative binomial distribution and to include the undetected cases into Λ t are left to future work. Therefore, we can write explicitly the probability mass function of T k |I τ , from this conditional posterior probability density it is possible to compute the mean and the 95% central credible intervals. In particular, the effective reproduction number R t is the expected value of R conditioned on the past incidence data and serial interval where the conditional probability density p R|I ≤t ,W (r|i ≤t , w) marginalized over all nuisance parameters is given by Eq. (S.11) in general, and by Eq. (S.12) for the particular case of uniform distribution of the number of undetected cases. The daily incidence data, i.e. the number of new positive cases on each day, we used for the plots in Section 2 are plotted in figure S.1. Notice how the pattern of disease incidence in Italy, Spain, France, Germany, UK and USA are different from the pattern in South Korea and Sweden. For Spain, a negative incidence of -1400 was reported on 2020-04-19. We believe this is due to fixing numbers which were incorrectly reported previously. In order to deal with positive incidence data only, we adjust the negative number of cases reported on 2020-04-19 by distributing the cases of the previous and following day. In particular, we assign one-half of the reported cases on 2020-04-18 and one-fourth of the cases on 2020-04-20 to the daily incidence on 2020-04-19, resulting in a total of 590 cases on 2020-04-19. The serial interval is the number of days occurring from the onset of symptoms in a patient and the onset of symptoms in a secondary patient infected by the primary one. For modelling the serial interval we used a gamma distribution (see Eq. (S.5)). We simulate different realizations of the gamma distribution considering the parameters a and b normally distributed, with mean and variance reported in Eq. (S.6). The resulting 95% confidence interval is shown in figure S.2 We apply our statistical model, described in Section S.1, to the incidence data in the twenty Italian regions, retrieved from Ref. [16] . The computation of R t starts 5 days after the first reported day, that is the same for all the regions. This choice is adopted in order to bypass the uncertainty and delay in collected data in the very first days. For the time window length we our analysis (2020-05-08), for each of the Italian regions we considered. The second column reports the value we find by including only reported incidence data and mean values of the serial interval distribution. On the third column we report the mean of the posterior distribution of R, marginalized over the nuisance parameters describing the serial interval and undetected cases distributions. The corresponding 95% credible interval is reported on the last column. Coronavirus disease (covid-19) pandemic Complexity of the basic reproduction number (r0) Different Epidemic Curves for Severe Acute Respiratory Syndrome Reveal Similar Impacts of Control Measures Real time bayesian estimation of the epidemic potential of emerging infectious diseases A New Framework and Software to Estimate Time-Varying Reproduction Numbers During Epidemics