key: cord-0779765-wv7c8938 authors: Böhning, Dankmar; Rocchetti, Irene; Maruotti, Antonello; Holling, Heinz title: Estimating the undetected infections in the Covid-19 outbreak by harnessing capture-recapture methods date: 2020-06-11 journal: Int J Infect Dis DOI: 10.1016/j.ijid.2020.06.009 sha: 087735068f19874a30d47eac13d286258c8aa524 doc_id: 779765 cord_uid: wv7c8938 OBJECTIVES: A major open question, affecting the decisions of policy makers, is the estimation of the true number of Covid-19 infections. Most of them are undetected, because of a large number of asymptomatic cases. We provide an efficient, easy to compute and robust lower bound estimator for the number of undetected cases. METHODS: A modified version of the Chao estimator is proposed, based on the cumulative time-series distributions of cases and deaths. Heterogeneity has been addressed by assuming a geometrical distribution underlying the data generation process. An (approximated) analytical variance of the estimator has been derived to compute reliable confidence intervals at 95% level. RESULTS: A motivating application to the Austrian situation is provided and compared with an independent and representative study on prevalence of Covid-19 infection. Our estimates match well with the results from the independent prevalence study, but the capture-recapture estimate has less uncertainty involved as it is based on a larger sample size. Results from other European countries are mentioned in the discussion. The estimated ratio of the total estimated cases to the observed cases is around the value of 2.3 for all the analyzed countries. CONCLUSIONS: The proposed method answers to a fundamental open question: “How many undetected cases are going around?”. CR methods provide a straightforward solution to shed light on undetected cases, incorporating heterogeneity that may arise in the probability of being detected.  Results from several European countries are mentionedThe estimated ratio of the total estimated cases to the observed cases is around the value of 2.3 for all the analyzed countries. with large, almost unacceptable, uncertainty on the obtained estimates. As mentioned above, several methods have been proposed to estimate the undetected number 59 sense, the most obvious method to estimate a dark number. For more details see Böhning (2016) . Hence, the purpose of this contribution is to propose a lower bound estimator for the number of 62 people affected by Covid-19 but not detected for various reasons, the major one being that they 63 are asymptomatic. In other words, the aim is to estimate the size of an elusive, i.e. partially This short note is organized as follows. In Section 2, we introduce the basic notation and how 79 we are going to work with the cumulative distribution of observed cases and deaths. Section We will denote with N (t) the cumulative count of infections at day t where t = t 0 , · · · , t m . Hence Also, let D(t) denote the cumulative count of deaths at day t where t = t 0 , · · · , t m . t 0 defines the 87 beginning of the observational period and t m defines the end. We assume the trivial assumption 88 t m > t 0 , so that the observational window is not empty. Again, we denote with ∆D(t) = 89 D(t) − D(t − 1) the count of new deaths at day t where t = t 0 + 1, · · · , t m . To illustrate, we 90 look at these data (taken from https://www.worldometers.info/coronavirus/country/austria/) 91 for the country of Austria as provided in Table 1 for the infections and in Table 2 for the deaths. in this sampling process. Also, let p x denote the probability of identifying a unit x times where 97 x = 0, 1, · · · . In the capture-recapture world the following mixture model is quite common: In (1) this assumption we allow the parameter θ to vary in the population with arbitrary unknown 105 distribution f (θ) to reflect varying identification probabilities across the target population: Often the Poisson distribution is used in (2) instead of the geometric distribution. However, we 107 prefer to use the latter as we think of the geometric distribution as a Poisson distribution mixed 108 with an exponential density, hence the geometric is able to incorporate already some of the likely 109 present heterogeneity in the population. 110 We assume that model (2) is valid which we consider as a weak assumption. Then, using 111 the Cauchy-Schwarz inequality for moments, it is possible to show that for the probability p 0 of 112 missing a unit of interest the following inequality holds: Replacing p 1 and p 2 on the right-hand side of (3) with the observed frequencies f 1 of those 114 6 J o u r n a l P r e -p r o o f identified exactly once and f 2 of those identified exactly twice leads to the lower bound estimate of Chao (Chao, 1987 (Chao, , 1989 Chao and Colwell, 2017) : 116f Here f 0 is the frequency of units that remains unobserved or hidden for which (4) is a lower 117 bound estimate. In the case of no heterogeneity, (4) is a direct estimate of f 0 . Chao's lower 118 bound has been also generalized to include covariate information such as regional information The idea is to apply this estimator (4) day-wise. We take an arbitrary day t. At this day we 121 have ∆N (t) new infections. This will be viewed as f 1 , the infected people identified just once. If we look at ∆N (t − 1), then this is the count of new infections the day before. But these will 123 still be infected at day t unless they decease. So, f 2 corresponds to ∆N (t − 1) − ∆D(t). We can We will use a bias-corrected form of (5) suggested by Chao (1989) and given as We define the understanding that ∆N (t − 1) − ∆D(t) is set to 0 if it becomes negative, in other Var Var H(t). The results are provided in Table 3 for the country of Austria which includes estimates of the We have applied the capture-recapture approach using Chao's estimator for large entities such 187 as countries in Europe. However, the approach can be also utilized to indicate regional variation, 188 in other words application to smaller geographical or administrative units. In addition, if age-189 specific numbers are provided Chao's estimator can be applied in a age-stratified way. Another question relates to the size of the observational period. In the case, study we have The example provided here relies on Austrian data, but many other countries can be analyzed considering data from the day which we record the first death, we obtain the estimates of 202 undetected cases for Italy, Germany, Spain, UK and Greece (see Table 4 ). The last column in 203 Table 4 shows the ratio of the total estimated cases to the observed cases. There is a remarkable compared to those presented in other studies. We emphasize that the estimates provided are 208 conservative, in the sense that they provide lower bounds on the size of undetected infections. However, we have provided some evidence such as in the situation of Austria that these lower We thank Professor Herwig Friedl for a critical reading of the paper as well as pointing out some 220 valuable improvements. We also express deep thanks to an anonymous referee for his/her very 221 Ethical Approval 223 Not required. Declaration of interests 225 The authors declare that they have no known competing financial interest or personal relation-226 ships that could have appeared to influence the work reported in this paper. Conflict of interest 228 We declare that we have no conflict of interest. Funding sources 230 None. Ratio plot and ratio regression with applications to social and medical 233 sciences A flexible ratio regression approach for 235 zero-truncated capture-recapture counts Estimating the population size for capture-recapture data with unequal catch-237 ability Estimating population size for sparse data in capture-recapture experiments comparing richness with incidence data and incomplete sampling A Time-dependent SIR model for Covid-19 Estimating the number of infections and the impact of 246 non-pharmaceutical interventions on Covid-19 in 11 European countries Modelling the Covid-19 epidemic and implementation of population-wide interven-249 tions in Italy Dynamics in Wuhan, China, of Novel Coronavirus-Infected Pneumonia Substantial 254 undocumented infection facilitates the rapid dissemination of novel coronavirus Estimation of Unreported Novel 257 SARS-CoV-2) Infections from Reported Deaths: A Susceptible Exposed Infec-258 tious Recovered Dead Model Estimating the number of 261 undetected Covid-19 cases exported internationally from all of China. medRxiv preprint Capture-recapture 268 estimation based upon the geometric distribution allowing for heterogeneity Case-fatality rate and characteristics of patients 271 dying in relation to Covid-19 in Italy Estimating the final epidemic size for Covid-19 outbreak using improved 273 epidemiological models Iceland is doing science 50% of people with Covid-19 not 275 showing symptoms, 50% have very moderate cold symptoms Estimation of Covid-19 outbreak size in 279 The Lancet Infectious Disease Preliminary estimation 281 of the basic reproduction number of novel coronavirus (2019-nCoV) in China, from 2019 to 282 2020: A data-driven analysis in the early phase of the outbreak Preliminary prediction of the 285 basic reproduction number of the Wuhan novel coronavirus 2019-nCoV # The Austrian case study cases=c <-function(cases,deaths) { n.obs <-length(deaths) sum=0 t0=2; tend=n.obs-1 f0 <-f1 <-f2 <-variance <-0 for tot <-sum(variance,na.rm=T) return(list(f0 = sum, N.hat = sum+cases