key: cord-0604725-ww8yv9a1 authors: Pozderac, Calvin; Skinner, Brian title: Superspreading of SARS-CoV-2 in the USA date: 2020-07-30 journal: nan DOI: nan sha: 16fc36144a57ad7ec99bb4aff7a2ea3f81342167 doc_id: 604725 cord_uid: ww8yv9a1 A number of epidemics, including the SARS-CoV-1 epidemic of 2002-2004, have been known to exhibit superspreading, in which a small fraction of infected individuals are responsible for the majority of new infections. The existence of superspreading implies a fat-tailed distribution of infectiousness (new secondary infections caused per day) among different individuals. Here, we present a simple method to estimate the variation in infectiousness by examining the variation in early-time growth rates of new cases among different subpopulations. We use this method to estimate the mean and variance in the infectiousness, $k$, for SARS-CoV-2 transmission during the early stages of the pandemic within the United States. We find that $sigma_k/mu_k gtrsim 3.7 $, where $mu_k$ is the mean infectiousness and $sigma_k$ is its standard deviation, which implies pervasive superspreading. This result allows us to estimate that in the early stages of the pandemic in the USA, nearly $88%$ of new cases were a result of the top $10%$ of most infectious individuals. The temporal growth of an epidemic is often characterized by either a time scale (such as the doubling time) [1, 2] or by the reproduction rate R 0 , which indicates the average number of new infections produced by each infected individual [3] . Estimates of R 0 for the current pandemic of SARS-CoV-2 range from 1.4 to 3.8 [4] [5] [6] [7] . Neither of these numbers, however, gives any information about the distribution of infectiousness among individuals -i.e., whether new infections arise relatively uniformly from all infected individuals, or whether new infections are driven primarily by a small number of highly infectious individuals. The latter case is commonly referred to as "superspreading", and different epidemics exhibit superspreading to different degrees. For example, during the outbreak of SARS CoV-1 in 2002-2004, over 80% of cases were observed to result from the top 20% most infectious individuals [8, 9] . Understanding the degree of superspreading in the current pandemic of SARS-CoV-2 is crucial for developing strategies to mitigate continued spread and informing an educated reopening procedure [10] [11] [12] [13] . Here we present a simple and direct method to understand how the infectiousness (also called the "reproduction rate" of the disease) varies among infected individuals. At late times after the onset of an epidemic, the number of infected individuals is large, and consequently any statistical fluctuations in the growth rate are relatively small, so that the growth rate is well characterized by the mean infectiousness, µ k . However, at early times, when there are relatively few cases, the growth rate is stochastic and the degree of randomness depends on the variance in infectiousness, σ 2 k , between individuals (Fig. 1a) . By examining the variance in growth rate across subpopulations at these early times (Fig. 1b) , we are able to infer the variation in the distribution of infectiousness. In our analysis we divide the US cases into counties and observe how the variance in growth rate across them evolves as the number of cases increases. Formalizing this idea, we first present a derivation of The number of infections per infected individual as a function of total infections. In the main figure, each point corresponds to a given county at a given time point. As the number of cases increase, the rates narrow in to the mean infection rate. The mean (points) and variance (bars) of ∆I/I at a given I are shown in the inset. The variance decreases like (µ k +σ 2 k )/I (black lines). the variance in the exponential growth rate, or number of new cases per infected individual per day, ∆I/I, using an SIR framework that incorporates a probability distribution for the infectiousness of a given individual. Our result implies a simple method for estimating the mean, µ k , and variance, σ 2 k , of the infectiousness k. We apply this method to data for COVID-19 cases in the USA, and find a mean infection rate of µ k = 0.15 cases/day and standard deviation of σ k 0.55 cases/day. Since the standard deviation is considerably larger than the mean, with σ k /µ k 3.7, we conclude that superspreading is prevalent. By our estimate, these results imply that at least 88% of new cases are caused by the top 10% of most infectious individuals. Our method, which uses only a direct measurement of variance in detected case data in the USA, is consistent with estimates of superspreading using surveillance data [14] , secondary-case data [15] , and more complicated estimates of cluster size distribution using Markov Chain Monte Carlo [16] . In this section we derive a relation between the variance in the case growth rate and the variance in individual infectiousness between individuals in the population. We start with a standard discrete-time SIR model [17] , which is governed by the following difference equations: Here, N is the total population and S, I, and R are the time-dependent numbers of susceptible, infected, and recovered individuals, respectively. The parameters k and r encode the infectiousness and recovery rate of a disease within a population. The time is effectively discretized into days by the available data, so we use ∆I rather than the usual time derivative, dI/dt. The SIR description typically assumes fixed values for k and r across the population. However, in superspreading contexts there is a substantial variance in the infectiousness within a population [8, 9, 18, 19] . We account for this variation by introducing a probability distribution of infectiousness, p(k), so that the probability for a randomly-selected individual to have infectiousness in the range (k, k + dk) is given by is given by p(k)dk. For an individual with a given infectiousness, k, the probability of infecting exactly n others in a day follows the Poisson distribution, Pois(n; k). The probability that a randomly selected individual will infect n others is given by combining the Poisson distribution with the distribution p(k), giving The first two moments of P (n), µ n and σ 2 n , can be calculated independent of the form of p(k): Equation (4) represents the variance, among all infected individuals, of the number of new infections caused by a single person in a given day. When there are I active cases, the mean number of new cases per infected person, ∆(I +R)/I, is given by the average of I random variables drawn from the distribution P (n). By the central limit theorem, it follows that Var(∆(I + R)/I) = σ 2 n /I. Additionally, in the SIR model with a finite total population N , ∆(I + R)/I = kS/N = k(1 − (I + R)/N ) decreases as the susceptible population continually shrinks. Effectively, p(k) is scaled by the factor (1 − (I + R)/N ), which represents the fraction of the population that remains susceptible. Consequently, µ k → µ k (1 − (I + R)/N ) and Therefore the total variance in ∆(I + R)/I follows: This result becomes simpler in the limiting case where there is no significant change in the susceptible population (N → ∞). In this limit, we retrieve the case of simple exponential growth, for which [20] Var In the limit σ k → 0, where every infected individual has the same infectiousness µ k , the variance in the average infection rate is simply µ k /I, which corresponds to the variance in a Poisson process with rate µ k . In the case of SARS-CoV-2, it is well established that there are asymptomatic carriers [21] [22] [23] who transmit the virus without being detected, as well as other infections that are undetected or unreported. Current estimates typically predict that only 10 − 25% [24] [25] [26] of cases are detected. One can attempt to address this effect by assuming that there is a fixed detection probability, p det , and that the entire infected population, regardless of symptoms, follows the same infectiousness distribution p(k). In this case, there are many more infected individuals, I ∼ I det /p det , than those detected, which reduces the statistical fluctuations in the growth rate and makes our calculation of σ 2 k a lower bound. The effect of undetected cases is considered in more detail in Appendix C. In order to be conservative (especially given the possibility that asymptomatic cases have a lower rate of infection than symptomatic ones [27, 28] ), the results we present here use p det = 1. We corroborate Eqs. (5) and (6) using a numerical simulation of the trajectories of infection growth, I(t), for a given distribution p(k). Reference 18 has suggested that infectiousness follows a gamma distribution, and consequently, P (n) = NB(n; µ 2 k /σ 2 k , µ k /(µ k + σ 2 k )) where NB is the negative binomial distribution [10, 16] . Using this assumption, we simulate the growth of the epidemic by assuming that a given individual i, with infectiousness k i that is drawn randomly from p(k), generates a number n i of new cases each subsequent day that is drawn from Pois(n i ; k i ). The simulation results confirm Eqs. (5) and (6) , as shown in Appendix A. We now turn our attention to data for total detected cases of COVID-19 in the USA, taken from the publicly available data set at Ref. 29 . In the following analysis we limit our consideration to only a short timescale (∼14 days) after the first infection is detected in a given county † . This limitation in time scale serves three main purposes; first, it is likely that through changes in policy, lockdown, social distancing, mask usage, etc., the average infectiousness within the population is time-dependent. By restricting ourselves to a relatively small window of early times, we may assume that there is a constant average infectiousness. Second, considering only beginning stages allows us to neglect the possible saturation of the susceptible population, effectively allowing us to take the N → ∞ limit. Finally, the recovery period for COVID-19 is ∼14 days [30] and so by considering this two week period, we can treat our system as if there is no recovery and R → 0. These restrictions allow us to treat the USA data using the exponential case, Eq. (6). In our analysis, the population is divided into geographic regions and the variance is calculated across different trajectories I(t). The US cases are divided by county. For each county, we calculate the average number of new cases per current case per day, ∆I/I, for the first 14 days after the first infection is detected in that county. The variance in ∆I/I is then calculated among all counties that have a given fixed value of I (we present data only for values of I that have at least 250 corresponding counties). As shown in Fig. 2 , the US data generally follows the predicted ∼ 1/I trend. An unbiased fit of the data gives Var(∆I/I) ∝ I −0.77 . From Eq. (6), we calculate µ k + σ 2 k by averaging Var(∆I/I) × I, weighted by the number of instances at each I value. One might worry that the main source of variation comes from differing average growth rates, µ k , in various counties (i.e. rural vs. urban). However, we show in Appendix B that FIG. 2. As the number of infections I in a given county increases, the variance in infection rate ∆I/I decreases as (µ k + σ 2 k )/I. We observe that the USA data (blue) is inconsistent with a model of uniform infectiousness, or σ k = 0 (dashed red line ). A fit to the data (solid black line) implies a large variance in infectiousness, such that σ k /µ k 3.7. variance in µ k across counties is too small to explain the large observed variance in ∆I/I. We calculate µ k from the entire USA population by averaging all values of ∆I/I weighted by the current number of infections. Equivalently, we sum the number of cases caused each day and then divide by the sum of the number of cases across those days. This procedure gives the mean infectiousness, µ k , and thus from Eq. (6) and the fitted slope in Fig. 2 , we can infer σ 2 k . This procedure yields µ k = 0.15 cases/day and σ k = 0.55 cases/day. The small value of µ 2 k /σ 2 k = 0.07, equivalent to the dispersion parameter [16, 31, 32] , provides clear evidence for superspreading during early stages of the COVID-19 pandemic in the United States. These results for µ k and σ k can be used to further quantify the extent of superspreading. In particular, under the assumption that p(k) follows a gamma distribution [18] , one can calculate the fraction of individuals X k0 with infectiousness larger than a given value k 0 , as well as the fraction of secondary infections Y k0 that these individuals are expected to cause: where Q is the Regularized Gamma function. By eliminating k 0 we find Equation (9) gives the cumulative share of infections, Y , caused by the top X portion of most infectious cases, and is plotted in Fig. 3 A wide distribution p(k) in infectiousness k leads to large statistical variation in the early-time growth rate of a disease. Here we have shown that by calculating the variance in growth rate among different subpopulations one can infer the variance in p(k). Our result for COVID-19 cases in the USA suggests that σ k /µ k 3.7, implying a relatively severe superspreading. If we further assume that p(k) follows a gamma distribution (as in Ref. 18 ), then we can produce a more direct estimate of the extent of superspreading (Fig. 3) . Our relatively simple and direct method, based on a calculation of variance in reported case data, can be contrasted with more complicated methods for inferring the dispersion parameter that are based on maximum likelihood estimation (e.g., Ref. [32] develops such a method using simulated data), cluster size distributions [16, 33] , and surveillance or tracing data [14, 15] . These methods also tend to yield a lower-bound estimate for σ k /µ k . While studies based on testing and contact tracing (e.g., Refs. [18, [34] [35] [36] ) remain the definitive method for assessing superspreading, the method we present here may provide a much simpler way of estimating its prevalence across a much larger population. We emphasize that our analysis is unable to determine whether this large variance is a result of differing biological symptoms, social behavior, or other possible explanations. Additionally, this estimation is carried out for early times to minimize effects from a time varying p(k) and therefore predominantly speaks to the infectiousness prior to widespread lockdown measures. We close by commenting on a number of complicating factors that we did not include in our analysis and which, one might suspect, could alter our primary finding of a large value of σ k /µ k . For example, we have assumed a uniform value of µ k across different geographic locations; we have neglected undetected cases; we have ignored the possible variation in detection rate p det among different counties; we have effectively treated each county as an isolated population and have neglected cross-county interactions; and we have ignored the effects of latency in disease as well as the potential variation in latency periods between individuals. In the Appendices, we consider each of these mechanisms in turn and show that none of them can explain our result, so that our conclusion of prevalent superspreading of SARS-CoV-2 in the USA remains robust. In brief: the variation in µ k among different geographic locations is too small to explain the observed variance in growth rate [App. B]; neglecting undetected cases leads to an underestimate of the variance σ 2 k , so that our result is effectively a lower bound for the prevalence of superspreading [App. C]; variation in p det between counties does not directly affect the variance in the growth rate (∆I det )/I det , other than to provide an average of p det < 1, which results in a lower-bound estimate of σ 2 k [App. D]; cross county interactions tend to reduce the variance, so our result cannot be explained as a consequence of such interactions [App. E]; and variations in latency period can only reduce the apparent variance in growth rate [App. F]. We employ Monte Carlo simulations to corroborate our theoretical calculations [Eqs. (5) and (6)]. We start by adopting the conclusion of Ref. 18 and defining p(k) as a Gamma distribution with mean µ k and standard deviation σ k . We simulate an outbreak by first randomly generating a k from the distribution p(k) and then drawing a random n from Pois(n; k). This process is repeated until a non-zero n is generated, representing an outbreak starting in a given location with n individuals. Each of these n infected individuals is given a correspond infectiousness, k i , which they keep for the remainder of the simulation. Each individual, i, then generates n i new cases randomly drawn from Pois(n i ; k i ), and each of these secondary infections is assigned its own randomly generated infectiousness as well. During every iteration of the simulation, representing a day, each infected individual infects others given by a new random Poisson variable with mean defined by their own infectiousness. After a set number of infections is reached, the simulation is stopped and the trajectory I(t) is recorded. This process is repeated for 3,000 total trajectories, representing the ∼3,000 counties in the real USA data. This simulated data is then treated in the same manner as the real data, which is explained in Sec. 2 of the main text. We also consider simulations with a recovery phase and a finite carrying capacity N . To implement recovery, we specify a given number of days, t rec , over which an infected individual is infectious. Only those who contract the virus within this time period infect others. The effect of finite carrying capacity N is included by scaling the infectiousness k i of individual i by the factor S/N = 1 − (I + R)/N . For example, someone with infectiousness k i generates a number of cases n i drawn from the probability distribution Pois(n; k i (1 − (I + R)/N )) each day. This procedure is cut off once a certain fraction of N is reached, and then repeated 3,000 times. Although each trajectory follows the same p(k), N can vary between different trajectories. To account for this variation in N , we normalize ∆I → √ I(∆I/I −µ k (1−(I +R)/N )). We see that the simulated variance matches well with our theory (Fig. 4) . It is reasonable to question whether the calculated variance, σ 2 k , is a result of various geographic locations having differing average infectiousness, µ k , due to varying population density, social norms, etc. One may instead consider that the mean infectiousness µ k follows some distribution q(µ k ) among different counties. For a given µ k , we have shown that the variance in P (n; µ k , σ k ) averaged over I realizations is given by µ 2 k + σ 2 k /I in the exponential case. Including the effect of a distribution q(µ k ), we calculate the variance in ∆I/I to be: That is, when we account for the possibility that each region has a different µ k , the value of µ k is replaced by its meanμ k across counties, and a constant term is added for the variance in µ k across counties. We can conclude that this variance cannot fully explain the data for two reasons. First, we observe a clear Var(∆I/I) ∼ 1/I trend in the data (Fig. 2) , which can only be a result of the variance in p(k) rather than q(µ k ). Additionally, we can directly measure the variance in µ k across counties, which we find to be 0.003 (cases/day) 2 . This number is too small to significantly affect the total variance in ∆I/I, as seen in Fig. 5 . When the measured variance in q(µ k ) is taken into account in our fitting procedure, we find that µ k ∼ 0.15 cases/day, σ 2 k ∼ 0.29 cases 2 /days 2 , resulting in very slightly different value of σ k /µ k ∼ 3.6. Between asymptomatic cases and imperfect testing, there are a significant number of active cases, which can transmit the virus, that do not show up in the data set we use. One way to take this effect into account is by introducing an average probability that a case is detected, p det . All variance calculations occur at fixed values of the number of detected cases, I det . Given I det , the probability that there are I total cases is given by a negative binomial distribution: If there are I active cases, then the probability that ∆I cases are generated is given by the sum of I random variables drawn from P (n). Since µ n = µ k and σ 2 n = µ k +σ 2 k , Once ∆I cases are generated on a given day, the probability that ∆I det are detected is given by a binomial distribution: (C4) We combine these equations to derive the probability distribution for the number of new detected cases in a given day, ∆I det , given that there are currently I det active cases. (C5) It follows that the mean and variance in ∆I det /I det are That is, when under-detection is accounted for, an extra term µ 2 k (1 − p det )/I det is added to the variance due to the variance in the underlying total number of cases, I. The term σ 2 k /I det is also scaled down by a factor p det , since there are on average a larger number I ∼ I det /p det of total cases. Thus, since µ 2 k (1 − p det ) µ k and σ 2 k is suppressed by a factor of p det , the calculated value of σ 2 k is a lower bound. Although there are estimates of the fraction of detected COVID-19 cases in literature (e.g., Refs. [24] [25] [26] , in order to be conservative we do not directly use these estimates for the parameter p det in Eq. (C7), due to the possibility that undetected cases have a different (likely lower) infectiousness. We have shown that if there are I total cases, then the mean and variance in the number of new cases, ∆I, is µ k I and (µ k + σ 2 k )I, respectively. Once ∆I cases are generated, they are randomly sorted into the M counties. Therefore, (E2) Combining these distributions, the probability that ∆I 1 cases occur in a given county that has I 1 active cases is: P (∆I 1 ; I 1 ) = ∞ I=I1 P (I; I 1 ) ∞ ∆I=∆I1 P (∆I; I)P (∆I 1 ; ∆I). From P (∆I 1 ; I 1 ) we calculate the mean and variance in ∆I 1 /I 1 to find: Since µ k = 0.15 cases/day, the term µ k +µ 2 k (1−1/M ) cannot account for the variation present in the US. Therefore, if there is maximal interactions between counties, then the calculation of σ 2 k remains a lower bound estimate. Intuitively, one can say that when different counties interact strongly with each other, there is a larger underlying number of active cases from which new cases can be drawn for a given county, and this larger number reduces the statistical variance. The previous consideration assumes that all counties interact evenly with each other. It is possible that some counties might gain a large number of cases entering from a neighboring county, while others have more exiting than entering. To account for this potential source of variance we consider a single county. Assume there is a fixed portion, p exit , of new infections from cases within a county that are spread to an outside county, as well as a portion, p enter , of new cases from other counties. Both of these quantities are defined in terms of the number of cases in the current county, I(t). Consequently, the effective number of cases leading to new infections in the current county is (1 − p exit + p enter )I(t). Since all counties are assumed to follow the same underlying distribution p(k), we expect that the mean of ∆I/I will be (1 − p exit + p enter )µ k . With this understanding, there are two possibilities: either all counties have equal flows in and out so that p exit ∼ p enter , or there is a balance between counties with p exit > p enter and vice versa. In the first case, we see that all counties share approximately the same measured µ k while in the later, there is a wide spread in µ k from this cross county interaction. Since we show in Appendix B that the there is little variance in µ k across counties, this suggests that p exit ∼ p enter within each county. Consequently, the effective number of cases that can lead to a new case within a given county, (1 − p exit + p enter )I(t), varies from the true number, I(t), only on the order of σ µ k σ k . Therefore, we conclude that this effect cannot explain the large variance we observe. FIG. 6. We simulate the impact of variance in latency compared to a fixed latency (blue). Here, we use p(L) ∼ e λL with λ = 1, 2, 3 (orange, green, red). Variance in latency only decreases the observed variance and consequently cannot explain the large σ k we calculate. that can lead to detection. This treatment is equivalent to assuming that there is a fixed latency period for all infections. In this situation, if the true number of cases at a given time (counting cases as those infected, and not necessarily already detected) is I(t), then the number of cases observed is simply I observed (t) = I(t − L), where L is the amount of time for an individual to start showing symptoms and be detected. However, it is reasonable to suspect that some individuals could have shorter or longer latency periods, and this variance could impact the overall variance that we calculate. To address this, we assume that the latency time L follows some distribution q(L) with mean µ L . At a given time t, the infections being observed are originating from times in the vicinity of t − µ L . Therefore, the effective number of infected individuals leading to new infections at time t is µ L +∆L L=µ L −∆L q(L)I(t−L), where 2∆L represents the range of possible latency times. Since I(t) grows exponentially, the terms from the most recent time, t = µ L − ∆L, will dominate the sum. Consequently, the effective infected population is always greater than or equal to I(t − µ L ), and this effect tends to decreases the overall variance in ∆I/I. This diminishing of the variance with increasing ∆L is corroborated by simulations (Fig. 6 ). Doubling time of the covid-19 epidemic by province, china Covid-19 seeding time and doubling time model: an early epidemic risk assessment tool Mathematical Biology Early transmission dynamics in wuhan, china, of novel coronavirusinfected pneumonia Pattern of early human-tohuman transmission of wuhan 2019 novel coronavirus (2019-ncov) High contagiousness and rapid spread of severe acute respiratory syndrome coronavirus The reproductive number of COVID-19 is higher compared to SARS coronavirus Dimensions of superspreading Super-spreaders in infectious diseases Stochasticity and heterogeneity in the transmission dynamics of sars-cov-2 (2020) Super-spreader businesses and risk of covid-19 transmission Modelling covid-19 Projections and earlywarning signals of a second wave of the covid-19 epidemic in illinois Characterizing super-spreading events and age-specific infectivity of covid-19 transmission in georgia, usa Superspreading in early transmissions of covid-19 in indonesia, medRxiv Estimating the overdispersion in covid-19 transmission using outbreak sizes outside china A contribution to the mathematical theory of epidemics Superspreading and the effect of individual variation on disease emergence Impact of superspreaders on dissemination and mitigation of covid-19 Fitting the negative binomial distribution to biological data Estimation of undetected symptomatic and asymptomatic cases of covid-19 infection and prediction of its spread in usa Estimating the asymptomatic proportion of coronavirus disease 2019 (covid-19) cases on board the diamond princess cruise ship Estimation of the asymptomatic ratio of novel coronavirus infections (covid-19) Quantifying undetected covid-19 cases and effects of containment measures in italy: Predicting phase 2 dynamics Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (sarscov-2) Estimating the Early Outbreak Cumulative Incidence of COVID-19 in the United States: Three Complementary Approaches, medRxiv Characterization of an asymptomatic cohort of SARS-COV-2 infected individuals outside of Wuhan Physical distancing, face masks, and eye protection to prevent person-to-person transmission of sars-cov-2 and covid-19: a systematic review and metaanalysis An interactive webbased dashboard to track covid-19 in real time Science forum: Sars-cov-2 (covid-19) by the numbers Beyond r0: Heterogeneity in secondary infections and probabilistic epidemic forecasting Maximum likelihood estimation of the negative binomial dispersion parameter for highly overdispersed data The role of superspreading in middle east respiratory syndrome coronavirus (mers-cov) transmission Ebola superspreading, The Lancet Infectious Diseases The role of super-spreading events in Mycobacterium tuberculosis transmission: evidence from contact tracing Network analysis of mers coronavirus within households, communities, and hospitals to identify most centralized and super-spreading in the arabian peninsula Acknowledgments. The authors are grateful to N. E. Skinner for helpful conversations. We have shown that a fixed detection rate, p det , across counties cannot account for the variance observed within the US population. However, one can also check to ensure that variation in p det between counties, described by a probability distribution q(p det ), does not explain the data either. To account for differing values of p det we weight Eq. (C4) by q(p det ) so that P (∆I det ; ∆I) → 1 0 dp det q(p det )P (∆I det ; ∆I). Plugging into Eg. (C7) we seewhere p det = 1 0 dp det q(p det )p det is the mean detection rate across all counties. Therefore, variance in p det cannot explain the large observed variance in (∆I)/I, and a substantial variation from σ 2 k is required to explain the US data. Our analysis in the main text operates under the assumption that each county in the USA is an independent population in which the virus can spread. However, it is clear that there is some portion of infections that cross county lines. To understand how this interaction can affect the variance in observed growth rate of cases, we explore what the variance looks like if we have perfect mixing between M counties, each with I 1 active cases. In this formulation, the variance we calculate is Var(∆I 1 /I 1 ) at a given I 1 . Focusing on a single county with I 1 active cases, the probability that there are I total cases across the M counties is given by a negative binomial distribution: Our analysis ignores the effect of latency between the moment of infection and the appearance of symptoms