key: cord-0165273-dvp2w5a8 authors: Tsirtsis, Stratis; De, Abir; Lorch, Lars; Gomez-Rodriguez, Manuel title: Group Testing under Superspreading Dynamics date: 2021-06-30 journal: nan DOI: nan sha: 64fb9c38cfabcb496cfe5315ac3142a77b96288e doc_id: 165273 cord_uid: dvp2w5a8 Testing is recommended for all close contacts of confirmed COVID-19 patients. However, existing group testing methods are oblivious to the circumstances of contagion provided by contact tracing. Here, we build upon a well-known semi-adaptive pool testing method, Dorfman's method with imperfect tests, and derive a simple group testing method based on dynamic programming that is specifically designed to use the information provided by contact tracing. Experiments using a variety of reproduction numbers and dispersion levels, including those estimated in the context of the COVID-19 pandemic, show that the pools found using our method result in a significantly lower number of tests than those found using standard Dorfman's method, especially when the number of contacts of an infected individual is small. Moreover, our results show that our method can be more beneficial when the secondary infections are highly overdispersed. As countries around the world learn to live with COVID-19, the use of testing, contact tracing and isolation (TTI) has been proven to be as important as social distancing for containing the spread of the disease [1] . However, as the infection levels grow, TTI reaches a tipping point and its effectiveness quickly degrades as the health authorities lack resources to trace and test all contacts of diagnosed individuals [2] . In this context, there has been a flurry of interest on the use of group testing-testing groups of samples simultaneously-to scale up testing under limited resources. The literature on group testing methods has a rich history, starting with the seminal work by Dorfman [3] [4] [5] . However, existing methods [6, 7] , including those developed and used in the context of the COVID-19 pandemic [8] [9] [10] [11] [12] [13] [14] [15] [16] , are oblivious to the circumstances of contagion provided by contact tracing and assume statistical independence of the samples to be tested. This assumption can be seemingly justified by classical epidemiological models where the number of infections caused by a single individual follows a Poisson distribution. However, in COVID-19, there is growing evidence suggesting that the number of secondary infections caused by a single individual is overdispersed -most individuals do not infect anyone but a few superspreaders infect many in infection hotspots [17] [18] [19] [20] [21] [22] [23] . Overdispersion has been also observed in MERS and SARS [24] [25] [26] . In this work, our goal is to develop group testing methods that are specifically designed to use the information provided by contact tracing and are effective in the presence of overdispersion. More specifically, we build upon a well-known semi-adaptive pool testing method, Dorfman's method with imperfect tests [5, 27] . In Dorfman's method, samples from multiple individuals are first pooled together and evaluated using a single test. If a pooled sample is negative, all individuals in the pooled sample are deemed negative. If the pooled sample is positive, each individual sample from the pool is then tested independently. However, rather than modeling the probability that each individual sample is positive using an independent Bernoulli distribution as in Dorfman's method, we assume that: (i) the samples to be tested are all the contacts of a diagnosed individual during their infectious period, who are identified using contact tracing, and (ii) the number of true positive samples, i.e., secondary infections by the diagnosed individual, follows an overdispersed negative binomial distribution [22, 23] . Given any arbitrary set of pools, we then compute the average number of tests and the expected number of false negatives and false positives under our model. Finally, we introduce a dynamic programming algorithm to efficiently find the set of pools that optimally trade off the average number of tests, false negatives and false positives in polynomial time. Experiments using a variety of reproduction numbers and dispersion levels in secondary infections, including those observed for COVID-19, show that the pools found using our method result in a significantly lower average number of tests than those found using standard Dorfman's method, especially when the number of contacts of an infected individual is small. Moreover, our results show that our method can be particularly beneficial when the number of secondary infections caused by an infectious individual is highly overdispersed. Previous work have mostly built on the assumption that the number of infections X caused by a single individual follows a Poisson distribution with mean r, so X ∼ Pois(r), where r is often called the effective reproduction number. However, having equal mean and variance, the Poisson is unable to capture settings where the number of cases to be tested for exhibits higher variance. Following recent work in the context of COVID-19 [19, 22] , we instead model X using a generalized negative binomial distribution. For a (standard) negative binomial distribution, X ∼ NBin(k, p) can be interpreted as the number of successes before the k-th failure in a sequence of Bernoulli trials with success probability p. For a generalized negative binomial distribution, k > 0 can take real values and the probability mass function is given by where k is called the dispersion parameter and parameterizes higher variance of the distribution for small k. Since E[X] = kp/(1 − p), we assume in this work that the number of secondary infections X is distributed as X ∼ NBin (k, p) with p = r/(k +r), hence parameterizing X via its mean E[X] = r and dispersion parameter k. Under this parameterization, Var[X] = r(1 + r/k), which is greater than the variance of the Poisson r for k < ∞. For k → ∞, the sequence of random variables X k ∼ NBin(k, r/(k + r)) converges in distribution to X ∼ Pois(r), making the negative binomial a suitable generalization of the Poisson for modeling secondary infections. Assuming we test all contacts of a diagnosed individual during their infectious period, we have prior information about the maximum number of possible infections N in practice. In this setting, we will write q r,k,N (n) := P (X = n | X ≤ N ) when X ∼ NBin (k, r/(k + r)) . (1) Here, note that P (X = n | X ≤ N ) = P (X = n)/P (X ≤ N ) if n ≤ N and 0 otherwise. In this context, our goal is to identify infected individuals among all contacts N of a positively diagnosed individual via testing, where |N | = N . For events concerning each individual j ∈ N , we define the following indicator random variable: In addition, for each pool of individuals S ⊆ N , we define the number of infected in S as I(S) := j∈S I j . Following our assumption on the distribution of the number of secondary infections, we define To account for the sensitivity s e (i.e., true positive probability) and specificity s p (i.e., true negative probability) of tests, we parameterize the conditional probabilities as In the above, following the literature on the group testing [5] , we assume that the sensitivity of pool tests is independent of the exact number of infected individuals in the pool. Moreover, while dilution has been shown to often be negligible [28] , the effect can easily be addressed by making the conditional T (S) | I(S), dependent on the corresponding pool size |S|. Dorfman testing proceeds by pooling individuals into non-overlapping partitions of N and first testing the combined samples of each pool using a single test. Every member of a pool is marked as negative if their combined sample is negative. If a combined sample of a pool is positive, each individual of the pool is subsequently tested individually to determine who exactly is marked positive in the pool. Let D S j denote the indicator random variable for the event that individual j is marked as infected in pool S ⊆ N : |S| > 1 after Dorfman testing. Then, D S j can be expressed as i.e., taking value 1 if and only if the combined sample of pool S is first tested positive and the sample of individual j is tested positive in the second step. In the simple case of |S| = 1, we have D S j = T ({j}). Expected number of tests Let K(S) be the number of tests performed when testing pool S as described above. Then, the expected number of tests E[K(S)] due to a pool S is: where f (S) is given by where the last step follows from the fact that I(S) | I(N ) = n ∼ HGeom(N, n, |S|), our assumption about P(I(N ) = n) and our assumptions about T (S). Expected number of false negatives To compute the number of false negatives, we distinguish between two cases. If |S| = 1, i.e., the pool consists of only one person, there is no distinction between a group test and an individual test. Therefore, a false negative can occur only if the person is infected and the test turns out negative. Thus, s ← argmin 1≤j≤n [g(j) + h(n − j)] 8: where λ 1 and λ 2 are given non-negative parameters, penalizing the numbers of false negatives and false positives. Perhaps surprisingly, we can solve the above problem in polynomial time using a simple dynamic programming procedure. More specifically, define the following recursive functions: Interpreting n as the number of individuals not yet assigned to a pool, using the two recursive functions, the (sizes of the) optimal set of pools can be recovered by computing h(n) in increasing order of n. Algorithm 1 summarizes the overall procedure, where the function ComputeObjective(·) precomputes the function g(k) for each k ∈ {1, . . . , N }. More formally, we arrive at the following proposition: Proposition 1 Given N contacts, the set of pool sizes S N returned by Algorithm 1 are optimal. Proof We will prove this proposition by induction. In the base case, where n = 1, it is easy to see that the optimal solution is S * 1 = {1} i.e., it consists of one group with size 1 and the minimum of the objective value is OP T 1 = g(1), while the recursive functions trivially find the optimal solution since h(1) = g(1). For n > 1 contacts, the inductive hypothesis is that the values h(i) and sets S i recovered by Algorithm 1 for all i < n are optimal. Let S * n and OP T n be the optimal set of pool sizes and the respective value of the objective function for n contacts. Suppose, for the sake of contradiction, that OP T n < h(n) i.e., the solution computed using the recursive functions for n contacts is suboptimal. Let S * n = {s * 1 , s * 2 , . . . , s * l }. Then, we get: where the first step is based on the fact that g(s) + h(n − s) ≤ g(j) + h(n − j) for all j : 1 ≤ j ≤ n and the second step is based on the inductive hypothesis. Since l i=2 s * i = n − s * 1 , the final inequality implies that, having n − s * 1 contacts, the set of pool sizes {s * 2 , . . . , s * l } is strictly better than the optimal one which is clearly a contradiction. Therefore, the values h(n) and sets of pool sizes S n given by the recursive functions are optimal for all n : 1 ≤ n ≤ N . (c) shows the empirical distribution of the percentage of tests saved by using our method instead of Dorfman's method, where we exclude the highest and lowest 5% of observations. In all panels, we set the sensitivity and specificity to s e = s p = 0.95, and sample the number of secondary infections from a truncated negative binomial distribution with mean r = 2.5 and dispersion parameter k = 0.1 [19, 20, 29, 30] . For each combination of method and parameter values, the averages and quantiles in all panels are estimated using 100,000 samples. We perform simulations to evaluate Algorithm 1 against Dorfman's method in its ability to optimally trade off resources and false test outcomes in the presence of overdispersed distributions of secondary infections. To generate the infection states for each contact, we first fix a number of contacts N and sample the number of secondary infections n ∼ q r,k,N (n), where q r,k,N (n) is a truncated negative binomial distribution as defined in Eq. 1. Then, we select n of the N contacts at random and set their status to infected. To implement Dorfman's method, we use a variation of Algorithm 1 in which the expected numbers of tests, false negatives and false positives are computed assuming an independent individual probability of infection p = E q r,k,N [n]/N , using the formulas derived by Abrahamian et al. [5] . We first compare the performance of our method and Dorfman's method at finding the pools that minimize the number of tests (i.e., λ 1 = λ 2 = 0) for fixed values of the reproductive number r and dispersion parameter k matching estimations done during the early phase of the COVID-19 pandemic. Figure 1 summarizes the results with respect to the number of contacts N of the diagnosed individuals. The results show that our method achieves a lower average number of tests across all values of N , with its competitive advantage being greater when the number of contacts is small and less apparent as the number of contacts is increasing. Moreover, Dorfman' method chooses pool sizes that increase with the number of contacts while the ones chosen by our method remain relatively constant. This leads to significant differences between the distributions of the number of tests performed under the two methods. For example, as shown in Figure 1(c) , when the number of contacts is N = 20, our method is most likely to perform about 50% less tests than Dorfman's. However, due to the more conservative pool sizes given by Dorfman's method, there is a small probability that our method ends up performing more tests, sometimes even double the amount. Next, we investigate to what extent our method offers a competitive advantage with respect to Dorfman's method for additional values of the reproductive number r and dispersion parameter k different than those estimated during the early phase of the COVID-19 pandemic. Figure 2 summarizes the results, which show that our method offers the greatest competitive advantage whenever the number of secondary infections is overdispersed, i.e., k → 0. Moreover, as the number of contacts N increases, the competitive advantage is greater for larger values of the reproductive number r. Finally, we explore the trade-off between the average number of tests that our method achieves and the false positive and negative rates, under different values of the parameters λ 1 and λ 2 . Figure 3 In each panel, we either penalize the false negative rate (i.e., we vary λ 1 and set λ 2 = 0) or the false positive rate (i.e., we vary λ 2 and set λ 1 = 0). Accordingly, for the former, we show the false negative rate vs average number of tests (in blue) and, for the latter, we show the false positive rate vs average number of tests (in pink). Here, we set the number of contacts to N = 100 and sample the number of positive infections from a truncated negative binomial distribution with mean r = 2.5 and dispersion parameter k = 0.1. In each experiment, we estimate averages using 100,000 samples. results, which show that, to achieve lower false negative and false positive rates, more tests need to be performed. When trading off the number of tests with the number of false positives (λ 1 = 0, λ 2 > 0), our method gradually changes the average pool size, leading to many possible trade-off points between the number of tests and the false positive rate. When λ 2 takes small values, the optimal solution leads to pool sizes that minimize the number of tests, while the solution consists of pools of two contacts when λ 2 gets significantly larger. When balancing the number of tests with the number of false negatives (λ 1 > 0, λ 2 = 0), there are only two or three possible solutions, with the extreme ones corresponding to pool sizes minimizing the number of tests and pools of size one, i.e., individual testing for all. We noticed that the expected number of false negatives in a pool grows linearly with its size for sizes greater than one, therefore, making the exact size of the pool irrelevant in terms of the expected total number of false negatives. As a consequence, this leads to only a few optimal solutions where some of the contacts are individually tested while the rest of them are split into pools. Our results have direct implications for the allocation of limited and imperfect testing resources in future pandemics whenever there exists evidence of substantial overdispersion in the number of secondary infections. In this context, we acknowledge that more research is needed to more accurately characterize the level of overdispersion in a pandemic. Moreover, it would be interesting to extend our algorithm using distributions other that the negative binomial and practically evaluate our method in randomized control studies. The challenges of containing SARS-CoV-2 via test-trace-and-isolate Iris Pigeot, Anita Schöbel, and Viola Priesemann. The foreshadow of a second wave: An analysis of current covid-19 fatalities in germany The detection of defective members of large populations Group testing in the presence of test error; an extension of the dorfman procedure Optimal risk-based group testing Group testing: an information theory perspective. Now Foundations and Trends Combinatorial group testing and its applications Khadija Said Mohammed, et al. Pooled testing conserves sars-cov-2 laboratory resources and improves test turn-around time: experience on the kenyan coast Multi-stage group testing improves efficiency of large-scale covid-19 screening Pooled testing for expanding covid-19 mass surveillance A pooled testing strategy for identifying sars-cov-2 at low prevalence Simulation of pooled-sample analysis strategies for covid-19 mass testing Increasing testing throughput and case detection with a pooledsample bayesian approach in the context of covid-19. bioRxiv Large-scale implementation of pooled rna extraction and rt-pcr for sars-cov-2 detection Boosting test-efficiency by pooled testing strategies for sars Noisy pooled pcr for virus testing Clustering and superspreading potential of SARS-CoV-2 infections in Hong Kong Too little, too late: The unacceptable neglect of the elderly in care homes during the COVID-19 epidemic in Spain Estimating the overdispersion in COVID-19 transmission using outbreak sizes outside China Characterizing superspreading events and age-specific infectiousness of SARS-CoV-2 transmission in Georgia, USA Identifying and interrupting superspreading events-implications for control of severe acute respiratory syndrome coronavirus 2 Effective reproduction number and dispersion under contact tracing and lockdown on COVID-19 in Karnataka. medRxiv Quantifying the effects of contact tracing, testing, and containment measures in the presence of infection hotspots Middle east respiratory syndrome coronavirus superspreading event involving 81 persons Superspreading and the effect of individual variation on disease emergence Super-spreaders in infectious diseases A generalized binomial group testing problem Evaluation of covid-19 rt-qpcr test in multi sample pools Effectiveness of isolation, testing, contact tracing, and physical distancing on reducing transmission of sars-cov-2 in different settings: a mathematical modelling study Superspreading in early transmissions of covid-19 in indonesia We would like to thank Vipul Bajaj for fruitful discussions.