key: cord-0867890-v2o6etjk authors: HERNANDEZ-SUAREZ, CARLOS M; Verme, Paolo; Murillo-Zamora, Efren title: On the estimation of the total number of SARS-CoV-2 infections date: 2020-04-29 journal: nan DOI: 10.1101/2020.04.23.20077446 sha: ba383829fc4243c6f6540e28a1b51bb0becd8259 doc_id: 867890 cord_uid: v2o6etjk We introduce a simple methodology to estimate the total number of infected with SARS-CoV-2 based on the number of deaths in households with at least one confirmed case of COVID-19. If we are willing to assume that a single member of a household with $n$ members will infect the remaining members with probability 1, then the number of deaths in a household follows a binomial distribution with parameters $(n-1,p)$ where $p$ is the CFR. Although the method may be affected by classification errors, its simplicity will allow to reduce the error of the estimates by increasing the sample size, since it requires minimal laboratory testing capabilities. We illustrate our methodology with data from Mexico and estimate the CFR in 0.34 %, that is, we estimate that the total number of infections is about $300$ times larger than the number of deaths. We specify some dataset limitations. In comparison, using the number of deaths to date and a recently published results from random tests in Iceland, we calculated the ratio estimated infections/deaths in about $200$ for that country. It is known that the immune response to SARS-CoV-2 range from fully asymptomatic to exhibit mild or even severe responses that may cause death. Estimates of the probability of presenting a particular response is useful for prevention and attention purposes or even for building appropriate mathematical 5 models that may provide some projections at the population level, specially to analyze the evolution of the immune population with the purpose of economic recovery. These estimates are particularly important to estimate the total number of infections by expanding the fraction of observed in some category, for the instance the number of hospitalized persons or the number of deaths. 10 Let p = [p 1 , p 2 , . . . , p s ] be the probabilities that an individual will develop reaction i from a possible set reactions, for instance: S = {None, Mild, Severe, Death} or any other categorization that can be associated to an individual without error and where the categories are mutually exclusive. The idea is that if the number of individuals in some category k is known or can be approximated, say 15 n k , and its proportion p k can be estimated, then the total number of individuals in all categories can be estimated with n k /p k . There are current estimates of the probability of showing a specific reaction to infection, for instance, being asymptomatic, presenting mild or severe symptoms [1, 2, 3, 4] , but their statistical properties are unknown. A possi-20 ble design that would allow to estimate p is random screening for infection or antibodies, and categorizing the response of infected or already immune individuals. The press has announced ongoing studies of this type to estimate the share of immunes which would allow to estimate the spread of the disease, but these studies may face some bias depending on the level of randomness, since 25 in most trials participation is voluntary and individuals that were exposed or believe that were exposed may feel encouraged to participate, contributing to overestimate the spread of the disease. The ideal random sample would be one extracted from a census database making sure that those dead from the disease are included. Nevertheless, this CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 29, 2020. . https://doi.org/10.1101/2020.04.23.20077446 doi: medRxiv preprint is expensive because it requires comprehensive laboratory testing. However, if the fraction of infected or recovered is small, since only those infected at some time provide information, the cost per unit of information may be very high. In addition, a small sample would result in estimates with large confidence intervals. Here we suggest a simple study design based on the outcomes of households in which there has been at least one infected individual. Let's define an effective contact or contact for short as any act between an infectious and a susceptible individual that would result in the infection of the 40 susceptible [5] . Let's suppose that we are presented with an individual that had a contact, this individual will then provide information on the likelihood of presenting a reaction in the set S. If we are presented with a sample of n individuals that we know had a contact with some infectious individual (not necessarily the same infectious individual) then if x i is the number of individuals 45 that exhibit reaction i,p i = x i /n is an estimate of p i , the probability than an individual will develop reaction i to infection. The variance of the estimate iŝ From here, the importance of finding individuals that we know had a contact. But these individuals are easy to find: several studies have shown that household 50 transmission as well as familial transmission is high [6, 7, 8, 9, 10, 11, 12] or even in offices for relative short interactions [13] . Therefore, if we are willing to concede that all the members of a household with a diagnosed individual had a contact with the initial infected in the household, the fraction of the remaining members of the household that exhibited reaction i is an estimate of p i and 55 we can pool data from several households to obtain a better estimate. In what follows, we formalize this estimate. Define a household as an infected household if there has been at least one confirmed COVID-19. Suppose we have several infected households, with n j 3 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 29, 2020. . https://doi.org/10.1101/2020.04.23.20077446 doi: medRxiv preprint inhabitants in house j. Call the initial infected in the household infected zero. Assume that: (i) The infected zero will infect the remaining n j − 1 susceptible in the household with probability 1. (ii) Once infected, the responses of individuals in an infected household are independent, that is, the responses of the remaining susceptible members in a household follows a multinomial distribution with parameters n j − 1 and p. Observe that (i) implies that when two or more individuals are infected in the household, the probability that any one of the remaining susceptible will be infected is not increased. Also, it implies that all infected individuals are 70 equally infectious, regardless of their symptomatic response to infection. In our approach, it is required to know the total number of individuals in a specific category of responses. The simplest approach is to consider the number of deaths, as this is likely the most reliable observed indicator to proxy the corresponding statistics in the population. Hereafter, we will refer exclusively to this 75 response to infection and thus our set consists of two responses S={Recovered, Dead}. This is preferable to use than the total number of individuals attending hospitals or receiving intensive care, for instance, which depends on the availability of health facilities and case definitions, which may vary between countries. Thus, the individual responses within a household follow a Bernoulli 80 distribution with parameter p, where p is the Case Fatality Ratio (CFR). A list of confirmed cases can be used to obtain a sample of infected houses. Suppose that sample is of size m. Let n j be the size of household j and n = m j n j be the sum of all members in all households in the sample. Let x j be the 85 number of deaths (excluding infected zero) in household j and let x = m j x j . The estimate of p, the CFR measured at the household level is x j /(n j − 1). Using all households data in the sample, the estimate of p is: CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 29, 2020. . With one further assumption, one can estimate the number of infections for The estimate of the total number of infections per death is about θ = 1/p. The approximate variance ofθ is: Let M be the total number of deaths from COVID-19 in the population, the estimate of the total number of infected individuals in the population, N is: with approximate variance: The probability that, among the remaining susceptible in the household, one or more will become infected by a different individual than infected zero is negligible, mainly because of the comparative pressure of infected zero on all 105 members of the household. Nevertheless, assume that this happens and one of the susceptible in the household is infected by someone outside the household. At a glance, it seems that the correct estimate at the household level is now x i /(n j − 2) because there are only n j − 2 remaining susceptibles, but this is 5 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 29, 2020. . https://doi.org/10.1101/2020.04.23.20077446 doi: medRxiv preprint incorrect. The simplest explanation is the following: we know that in an infected 110 household with n j members, there are n j − 1 individuals that have been subject to a contact. If one of the members of the household has a contact with an individual outside its household, its response still counts regardless of where the infection was acquired. Recall that we are estimating the probability of having a specific reaction to infection, not the probability of infection. This is 115 the rationale we use to select a member at random from the duplicates in the list of deaths. It is not relevant who is the infected zero, we only need to ensure there was enough pressure of infection to guarantee a contact. In this example we build an approximation to (1) using a database from 120 Mexico's IMSS (Instituto Mexicano de Seguro Social), the Mexican Institute for Social Insurance. The main problem with the database is estimating how many households there are (m) and the total population living in those m households, n. This is due to the fact that state, county, city and street are known, but in most cases there is no street number, so, in this approximation 125 we considered two cases in the same street as belonging to the same household, which underestimates the number of households. Observe that the denominator in (1) can be written as m(µ − 1), where µ is the average household size, thus this approach tends to overestimatep in (1). The database has 1180 confirmed COVID-19 cases from March 2 to April 130 16, 2020. Outcome of cases (death, recovered) was missing in several cases which were excluded. In an attempt to consider only households with final outcomes we excluded cases with symptoms onset in the last 21 days, that is, we considered only cases from March 2 to April 11, 2020. We also removed cases with lost addresses, leaving a final sample of 502 cases. The mean age of this 135 final set was 47.3 years with a standard deviation of 16.1 years with median 47 years. From these, there were 61 % males and 39 % females. In this set, 43 % were at least 50 years old. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 29, 2020. First we must mention that our goal here is not to provide precise estimates of p for Mexico but to illustrate a simple methodology to estimate the true 145 number of infections in a population using available information on confirmed individuals. As mentioned before, the database we used does not allow for a direct calculation of the number of households which is underestimated and thus, the CFR is overestimated. Our estimate from the IMSS data at µ = 2.8 is p = 0.0034, which is 3.5 times 150 smaller than the CFR for the Diamond Princess with CFR= 0.012 and mean 7 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 29, 2020. . https://doi.org/10.1101/2020.04.23.20077446 doi: medRxiv preprint age of 58 years [14] and three 3.4 larger than the reported so far for the USS Theodore Roosevelt, with CFR= 0.001 with an evident lower mean age [15] . In conclusion, we estimate one death per 300 infected individuals. A recent study in Iceland [16] reports that from 2,283 persons selected at In a following step, we can obtain the same probabilities for the whole pop-180 ulation of positive cases by matching the household sample of tested households with households in the census. In other words, we only need to make sure 8 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 29, 2020. . population which would allow us to weight for differential response to the infection. Suppose that we classify a population in K categories (e.g., age) at relative frequencies f i . Let x (i) and n (i) be respectively the total number of deaths and total number of individuals in category i in all households in the sample of size m, then a better estimate of p would be: Thisp must be plugged in (2) One of the most important sources of bias in this method, is that some observations may be censored. Perhaps death has not occurred yet in a given household and thus the probability of death is underestimated. We tried to 205 control this by using only data where the onset of symptoms was at least 21 days old so that the outcome is very likely observed, but in principle, we should 9 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 29, 2020. . https://doi.org/10.1101/2020.04.23.20077446 doi: medRxiv preprint use households were there is enough evidence to believe that we can observe final outcomes. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 29, 2020. . https://doi.org/10.1101/2020.04.23.20077446 doi: medRxiv preprint . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 29, 2020. . https://doi.org/10.1101/2020.04.23.20077446 doi: medRxiv preprint sars-cov-2 in the icelandic population, New England Journal of Medicine 0 (0) (0) null. arXiv:https://doi.org/10.1056/NEJMoa2006100, doi: 12 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 29, 2020. org. mx/proyectos/enchogares/especiales/intercensal. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 29, 2020. . https://doi.org/10.1101/2020.04.23.20077446 doi: medRxiv preprint Viral dynamics in mild and severe cases of covid-19 Estimating the asymptomatic proportion of coronavirus disease 2019 (covid-19) cases on board the diamond princess cruise ship, yokohama, japan Estimation of the asymptomatic ratio of novel coronavirus infections Estimating clinical severity of covid-19 from the transmission dynamics in wuhan, china Mathematical Models in Epidemi-240 ology A familial cluster of pneumo-245 nia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster Clinical characteristics of 24 asymptomatic infections with 250 covid-19 screened among close contacts in nanjing, china Asymptomatic cases in a family cluster with sars-cov-2 infection A 260 covid-19 transmission within a family cluster by presymptomatic infectors in china A familial cluster of infection associated with the 2019 novel coronavirus indicating possible person-to-person transmission during the incubation period, The Journal of infectious diseases Transmission of 2019-ncov infection from an asymptomatic contact in germany Estimating the infection and case fatality ratio for covid-19 using age-adjusted data from the outbreak on the diamond princess cruise ship Gylfa-280 son