key: cord-0844899-ynairc1u
authors: De Simone, Andrea; Piangerelli, Marco
title: A Bayesian approach for monitoring epidemics in presence of undetected cases
date: 2020-08-26
journal: Chaos Solitons Fractals
DOI: 10.1016/j.chaos.2020.110167
sha: 98e1579aa75fd1ee5a44178dc4c67152f1f182f0
doc_id: 844899
cord_uid: ynairc1u

One of the key indicators used in tracking the evolution of an infectious disease is the reproduction number. This quantity is usually computed using the reported number of cases, but ignoring that many more individuals may be infected (e.g. asymptomatic carriers). We develop a Bayesian procedure to quantify the impact of undetected infectious cases on the determination of the effective reproduction number. Our approach is stochastic, data-driven and not relying on any compartmental model. It is applied to the COVID-19 outbreak in eight different countries and all Italian regions, showing that the effect of undetected cases leads to estimates of the effective reproduction numbers larger than those obtained only with the reported cases by factors ranging from two to ten.

Tracking the evolution of the spread of an infectious disease is of primary importance during the whole course of any epidemic. An accurate evaluation of the transmission potential of the disease provides invaluable information to guide the decision-making process of control interventions, and to assess their effectiveness. One of the key epidemiological variables in this respect is the effective reproduction number R t , defined as the average number of secondary cases per primary case of infection. In order to stop an epidemic, R t needs to be persistently reduced to a level below 1. The issue of providing reliable estimates of R t is particularly severe now during the on-going COVID-19 pandemic [1] , and urgently calls for efforts towards a comprehensive mathematical modelling of the outbreak.

In this paper we propose a statistical framework for computing the effective reproduction number R t characterized by the following main features.

1. Stochastic. Our approach is stochastic, and not rooted in any deterministic framework of compartmental models, such as the Susceptible -Infectious -Recovered (SIR) model and its extensions. Although compartmental models can indeed provide very useful outcomes, especially for extrapolating the outbreak evolution into the near future, they rely on the simultaneous de- * Corresponding author. termination of all the coefficients appearing in the differential equations describing the dynamics of each compartment. On the other hand, in our approach the time evolution of the system is modelled as a stochastic process, with a probability distribution associated to each random variable. 2. Real-time. The method provides a time series of estimations of R t at each time step (e.g. one day), and not a single a posteriori value when the outbreak is almost over. 3. Bayesian. Within a Bayesian framework the results have a transparent probabilistic interpretation, the assumptions (priors) are explicit and their role is clearly tracked. A Bayesian updating procedure also accounts for the real-time evolution of the probabilities. 4. Comprehensive. Our method is explicitly carried out for taking into account the (unknown) number of undetected cases. However, it can be straightforwardly generalized to include any additional random variable affecting the reproduction number. Since the reproduction number is a highly complex quantity, affected by a wide variety of factors, such as biological, environmental and social factors [2] , this feature is particularly relevant.

Several methods for estimating the reproduction number have been proposed in the literature and have one or more of the above characteristics (see e.g. Ref. [3] for a review). The pioneering work of Ref. [4] laid the grounds for a stochastic approach to the problem. The sequential Bayesian method, which provides real-time estimates of R t is introduced in Ref. [5] . Comprehensive https://doi.org/10.1016/j.chaos.2020.110167 0960-0779/© 2020 Elsevier Ltd. All rights reserved. and real-time approaches, which also allow one to include the effect of undetected cases, have been recently developed in Refs. [6] [7] [8] , although within the framework of compartmental models. Our work builds upon the advances carried out in Refs. [9, 10] , where a real-time and Bayesian estimation method is developed, not tied to any compartmental model. The framework we develop in this work proceeds along the lines of generalizing the formalism of Refs. [9, 10] to include many other factors affecting the reproduction number directly or indirectly. To the best of our knowledge, our approach is the first one combining at once all the components listed above.

In particular, in this paper we are interested in assessing the role and impact of the number of undetected infection cases onto the effective reproduction number. There may be different reasons why an infected patient is undetected and does not appear in the official reports: individuals not showing the symptoms of the disease but are able to infect others (asymptomatic carriers), individuals whose symptoms have not been linked to the disease under consideration (especially in the early stages of the outbreak), impossibility of a complete population screening, etc.

The number of undetected cases of infection may indeed be rather large. Recent studies about the COVID-19 case reported that many confirmed infections were asymptomatic, with no statistically different viral load with respect to the symptomatic cases: 43% for the sample of Ref. [11] , 88% for the sample in Ref. [12] . These results justify the denomination of asymptomatic transmission of SARS-CoV-2 as the "Achilles' heel" of current COVID-19 containment strategies [13] .

We work in a Bayesian framework in which we treat all observations and parameters as random variables. The parameter of primary interest in this paper is the effective reproduction number R t , described by a continuous random variable R with prior probability density function p R ( r ). We then consider the observed incidence cases (number of new infected individuals at time t ) I t and the unknown number of undetected cases U t as positive integer-valued stochastic processes with discrete time index t .

We will also consider the stochastic process T t ≡ I t + U t , describing the total number of incident cases as a function of time, i.e. the sum of observed (reported) cases and the number of undetected cases. Notice that, in general, I t and U t are dependent, and their dependence is encoded by the conditional variable U t | I t .

The serial interval is described by the discrete random variable W . Its probability mass function p W ( w ) provides the probability of a secondary case arising w time steps after a primary case.

Given a time window of τ time steps, over which R is assumed to be constant, we can split the times into two intervals:

To avoid notational clutter, we will indicate by I < s the set of random variables before time s 

In our numerical simulations leading to the results displayed in Sections 2 and A.3 we set τ = 7 days.

By Bayes' theorem, at any given time t > τ , the posterior probability density of R given the serial interval distribution, the incidence data history and the undetected cases in the time window

where we assumed that W and R and independent (a generic dependence between them can be implemented in a straightforward way). Now, we can use the law of total probability to sum over the unknown number of undetected cases, assuming that U k depends only on I k . This way, we can get the posterior probability density for R given the serial interval distribution and the time series of incidence data up to time t

The sum over the undetected cases u k is ideally running up to the total population minus the observed cases i k (neglecting effects of acquired immunity), but in practice it is cutoff much earlier by the distribution p U k | I k , as discussed below.

The only assumptions made to derive the posterior probability density in Eq. (4) have been that W is independent of R and that U k only depends on I k . So, Eq. (4) quite generally describes how to incorporate the effect of undetected cases in a Bayesian statistical model for the effective reproduction number, given the serial interval and incidence data.

We now turn to formulate our assumptions about each of the terms appearing in Eq. (4) , and we will reach a simple analytical form, ready to use for numerical simulations.

The prior p R ( r ) for R is assumed to be uniform over the interval [0,40].

For the prior distribution p W ( w ) of the serial interval variable W we assume a continuous Gamma distribution (to be evaluated on integer values of w ) described by two parameters: the shape parameter a and the rate parameter b

The parameter values at 1 σ level have been reported in Ref. [14] as

The prior distributions of the serial interval parameters a, b are

The probability mass function of the undetected cases U t , which we already assumed to depend on I t only, can be modelled in many different ways, for example with decreasing probabilities associated to large values of undetected cases, and also in a timedependent way. In this paper we adopt the simplest assumption of a discrete uniform distribution with a single parameter C U :

. This way, we are describing the situation where the number of undetected cases at a given time t is constant and can be at most C U times the number of reported cases at the same time. We leave the investigation of alternative (and more realistic) scenarios to future work. So, the probability mass function of the number of undetected cases, conditioned on the values of the incidence data and the parameter C U is

where χ is the indicator function. The prior distribution of the continuous parameter C U is assumed to be an uninformative uniform prior between 0 and 2: C U ~Uniform([0,2]). The choice of the number 2 is of course subjective, although we believe it is a reasonable and conservative prior. The number of new total cases T k at time k within the time window [ t − τ, t] , given the previous incidence data, the serial interval data and the value of R , is assumed to be Poisson-distributed with parameter R k , where the total infection potential at a generic time t is defined by

At first approximation, this quantity can be considered as unaffected by the undetected cases, as the serial interval distribution is derived from tracking the secondary of reported cases. By considering the sample of secondary infections as representative of the population, the approximation above is justified. The generalizations to replace the Poisson distribution with a two-parameter negative binomial distribution [15] [16] [17] [18] and to include the undetected cases into t are left to future work. Therefore, we can write explicitly the probability mass function of T k | I < k , W, R as

Now we have collected all the ingredients of Eq. (4) , and we added the nuisance parameters a, b, C U to the model. So the joint posterior density of R and the nuisance parameters reads

By marginalizing over a, b, C U we finally get the posterior probability density for R

With a uniform prior for the number of undetected cases as described in Eq. (7) , the sum in the square bracket of Eq. (11) can be computed analytically in terms of the regularized upper incomplete gamma function Q ( s, x ) as

At any time t > τ , from the conditional posterior probability density p R | I ≤t ,W it is possible to compute the mean and the 95% central credible intervals. In particular, the effective reproduction number R t is the expected value of R conditioned on the past incidence data and serial interval

where the conditional probability density p R | I ≤t ,W , marginalized over all nuisance parameters, is given in general by Eq. (11) , and by Eq. (12) for the particular case of uniform distribution of the number of undetected cases. 

We now turn to apply the statistical procedure developed in the previous section to real data of the COVID-19 outbreak in eight different countries, using daily incidence data from Ref. [19] (see also Appendix A.2 ). In Appendix A.3 , we also analyze the incidence data for the twenty Italian regions.

The computation of R t starts from the first day reported on the dataset. It is worth mentioning that the starting date is not the same for all the countries taken into account. In the following we show the time evolution of the mean of the posterior distribution of R , as depicted in Fig. 1 , and the corresponding 95% central credible interval. The results are shown in Figs. 2 and 3 . By applying the full procedure described in the previous section, which delivers the posterior distribution marginalized over all the nuisance parameters of the serial interval and undetected cases distributions, we obtain our main results for R t (blue solid line). 1 As benchmarks, we 1 Open-source code available at github.com/de-simone/epidemic-Rt-Bayesian . also show the results obtained with only reported cases of infection and with fixed serial interval distribution (black dashed line), and the results obtained with only reported cases and marginalizing over the parameters of the serial interval distribution (green dash-dotted line).

Wherever applicable, we also report the dates when contagion containment measures have been enforced. In Table 1 we report the numerical results for the last day of our analysis.

Our results clearly show that R t , after a transitory period, is in a down trend in all the countries we considered. In all the countries considered, we find values of the mean value R t larger than the ones without considering the undetected cases by factors of about 2 -4. This is a somewhat expected consequence of our conservative choice of C U being at most 2. By allowing more undetected cases, the resulting reproduction numbers would be necessarily higher. Furthermore, the upper values of the 95% credible intervals are larger than the estimate with only officially reported cases by factors up to 10, and they are all significantly above 1. Similar results are obtained for the Italian regions (as shown in A.3 ). 

The values of the effective reproduction number R t for COVID-19, on the last day of our analysis (2020-05-08), for each of the countries we considered. The second column reports the value we find by including only reported incidence data and mean values of the serial interval distribution. On the third column we report R t from the posterior distribution, marginalized over the nuisance parameters describing the serial interval and undetected cases distributions. The corresponding 95% credible interval is reported on the last column. We emphasize that these results depend on the assumptions about the prior distributions, as described in Section 2 . However, the method itself is quite general and it is straightforward to implement any other choice of priors.

In this paper we took the first steps towards a comprehensive stochastic modelling of the effective reproduction number R t during an epidemic. We followed a completely general approach, which enables one to account for any random variable affecting R t . In particular, our primary focus was on assessing the impact on R t of the number of undetected infection cases.

We investigated the time evolution of the posterior probability density of the random variable describing R t , marginalized over the parameters of the serial interval and undetected cases distributions. The application of our method to the COVID-19 outbreak in different countries show that the effective reproduction number is largely affected by the undetected cases, and in general it increases by factors of order 2 to 10.

There are several directions in which further research can be carried out, aiming at expanding the capabilities of the basic framework described in this paper. For instance, it is desirable to explore a more realistic model of the probability distribution of undetected cases, also including time dependence. Furthermore, it is also possible to adopt a fully data-driven approach by performing non-parametric estimation of the serial interval distribution from transmission chains data.

The stochastic approach outlined in this paper is not designed to establish or predict any cause-effect relationship between the R t trend and the enforcement of the containment measures. It is nevertheless possible to use our results (especially the credible intervals) to define some robust criterion for evaluating the effectiveness of the containment measures.

Our findings can be used by public health institutions to take more informed decisions about designing and gauging the strategies of infection containment. According to our results, we recommend caution in deciding when and how to relax the containment measures based on the value of the reproduction number, since it may be much larger than usually estimated.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. 

The serial interval is the number of days occurring from the onset of symptoms in a patient and the onset of symptoms in a secondary patient infected by the primary one. For modelling the serial interval we used a gamma distribution (see Eq. (5) ). We simulate different realizations of the gamma distribution considering the parameters a and b normally distributed, with mean and variance reported in Eq. (6) . The resulting 95% confidence interval is shown in Fig. 4 .

The daily incidence data, i.e. the number of new positive cases on each day, we used for the plots in Section 2 are plotted in Fig. 5 . Notice how the pattern of disease incidence in Italy, Spain, France, Germany, UK and USA are different from the pattern in South Korea and Sweden. For Spain, a negative incidence of -1400 was reported on 2020-04-19. We believe this is due to fixing numbers which were incorrectly reported previously. In order to deal with positive incidence data only, we adjust the negative number of cases reported on 2020-04-19 by distributing the cases of the previous and following day. In particular, we assign one-half of the reported cases on 2020-04-18 and one-fourth of the cases on 2020-04-20 to the daily incidence on 2020-04-19, resulting in a total of 6) . The shaded region is the 95% confidence interval.

The values of the effective reproduction number R t for COVID-19, on the last day of our analysis (2020-05-08), for each of the countries we considered. The second column reports the value we find by including only reported incidence data and mean values of the serial interval distribution. On the third column we report R t from the posterior distribution, marginalized over the nuisance parameters describing the serial interval and undetected cases distributions. The corresponding 95% credible interval is reported on the last column. 590 cases on 2020-04-19. In Fig. 5 we also superimpose the effective reproduction number R t .

We apply our statistical model, described in Section 2 , to the incidence data in the twenty Italian regions, retrieved from Ref. [20] . The computation of R t starts 5 days after the first reported day, that is the same for all the regions. This choice is adopted in order to bypass the uncertainty and delay in collected data in the very first days. For the time window length we use τ = 7 days. The results for all regions in alphabetical order are shown in Figs. 6-10 , and Table 2 . 

World Health Organization. Coronavirus disease (covid-19) pandemic

Complexity of the basic reproduction number (r0)

The R0 package: a toolbox to estimate reproduction numbers for epidemic outbreaks

Different Epidemic Curves for Severe Acute Respiratory Syndrome Reveal Similar Impacts of Control Measures

Real time bayesian estimation of the epidemic potential of emerging infectious diseases

Modelling the covid-19 epidemic and implementation of population-wide interventions in italy

Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (sars-cov-2)

Asymptomatic infectives and r 0 for covid

A New Framework and Software to Estimate Time-Varying Reproduction Numbers During Epidemics

Improved inference of time-varying reproduction numbers during infectious disease outbreaks

Suppression of covid-19 outbreak in the municipality of vo

Universal screening for sars-cov-2 in women admitted for delivery

Asymptomatic transmission, the achilles heel of current strategies to control covid-19

The early phase of the covid-19 outbreak in lombardy, italy

Stochastic processes and population growth

Deterministic and stochastic models for recurrent epidemics

Modelling biological populations in space and time

Dynamics of measles epidemics: estimating scaling of transmission rates using a time series sir model

European Centre for Disease Prevention and Control

Coronavirus measures may have already averted up to 12010 0 0 deaths across europe