key: cord-0594035-y4h0upd5 authors: Broder, Andrei Z.; Kumar, Ravi title: A Note on Double Pooling Tests date: 2020-04-03 journal: nan DOI: nan sha: 8a8387828719f01f07d32a0b96cc434d53d4257d doc_id: 594035 cord_uid: y4h0upd5 We present double pooling, a simple, easy-to-implement variation on test pooling, that in certain ranges for the a priori probability of a positive test, is significantly more efficient than the standard single pooling approach (the Dorfman method). The concept of test pooling was apparently invented by Robert Dorfman [Dor43] in 1943 who suggested that it would be more effective to test WW2 would-be recruits for syphilis by mixing the blood samples of several recruits and test the pool for antigens. If the pool tests negative then all the pool members are deemed healthy; otherwise, each member of the pool is tested separately. A simple analysis (see next section) shows that for a given probability p that a recruit is infected there is an optimum pool size s 1 (p) that minimizes the expected number of needed tests. The lower the p, the larger the s 1 (p) and the lower the expected number of tests required. Dorfman's analysis has been further refined and generalized to deal with various problems such as false negatives [LLZA12] and studied as part of the broad topic of Combinatorial Group Testing [DH93, AJS19] . Note that some recursive and adaptive approaches dear to computer scientists, such as binary search, often may not work for this problem: there are pragmatic limitations on (a) the size of the pool beyond which dilution results in too many false negatives; (b) the number of samples available from a given specimen; and (c) the total time required to produce an answer. Nevertheless the emergence of the COVID-19 pandemic and the cost and scarcity of tests for the underlying virus has revived an enormous interest in test pooling. For COVID-19 test pooling has been shown to be doable with pools as large as 64 [YAST + 20] and is already in use in several countries including Germany [SCea] and Israel [YAST + 20]. The purpose of this note is to propose a simple variation on Dorfman's approach that we call double pooling; for clarity, we refer to Dorfman's method as single pooling. Double pooling works as follows: given a probability p of a positive test, pick an optimal size s 2 (p) for the pool size. (The optimal s 2 (p) is larger than the corresponding optimal s 1 (p) for single pooling.) Divide the population to be tested into non-overlapping pools of size s 2 (the division is assumed to be random) twice. Thus, now every patient belongs to two pools and is tested in two parallel rounds, A and B. For every patient if both the pools test positive then test the patient individually. Otherwise consider that patient cleared. If the pool tests do not ever produce false negatives the algorithm is clearly correct. (The false positives only reduce efficiency.) It turns out the double pooling is particularly advantageous for p < 2% corresponding to testing a large population of asymptomatic patients but it is more efficient than single pooling even for p = 10%. We will discuss double pooling in the next section in more detail, but to see its advantages and build an intuitive understanding we start with an example. Assume that p = 0.01112. It turns out s 1 = 10 is the optimal size for single pooling with this p and results in an expected cost of ≈ 0.206 tests/patient, a nice improvement over testing everyone. However using double pooling the optimum s 2 (p) is 23 and the expected cost further declines to just 0.145, an almost 30% improvement. (These quantities will be obvious from our analysis later.) At first blush these gains might seem surprising, but here is a quick-and-dirty computation: Assume that we are testing 1000 patients and remember p ≈ 0.011 so we will posit we have exactly 11 positive patients. • For single pooling, since s 1 = 10, we start with 100 tests. Assuming all positive cases end in separate pools (an upper bound) we will need to do another 110 tests to deal with all the suspicious cases, hence 210 total tests (which is close enough to 206.) • For double pooling, since s 2 = 23, we do twice 44 tests with 23 patients each (88 tests). In each round at most 11 tests will come back positive raising suspicions about a total of 22 × 11 = 242 healthy patients. Thus a given healthy patient has probability 0.24 to be a suspect in Round A and the same 0.24 probability in round B. These are quasi-independent, hence the probability of being suspected twice is only 0.24×0.24 = 0.0576. Thus we expect to have less than 57 (≈ 0.0576×(1000−11)) healthy patients that were in a positive pool in both rounds and will have to be retested. In addition the 11 truly positive patients will be retested as well. Thus the total number of tests is 88 + 57 + 11 = 156 (which is reasonably close to the claimed 145 since we overestimated and also because in fact we now cover 44 × 23 = 1012 patients). In conclusion, the "magic" of double pooling comes from a paradigm that has been observed in many other situations, e.g., Bloom filters [Blo70, BM03] and balanced allocations [ABKU99, Mit01] . Although the probability of being "unlucky" in a given trial might be high, the probability of being unlucky in two or more independent trials decreases dramatically. Consider the expected cost attributable to one patient in the single pooling situation, where the size of the pool is s: • if the patient is positive, then the cost is 1/s + 1 (the patient's share of the pool + their individual test); • if the patient is negative, then the cost is 1/s + 1 × (1 − (1 − p) (s−1) ) (the patient's share of the pool + their individual test iff not all the other patients are healthy). Since the probability of being positive is p, the total expected cost per patient in this case is To determine the pool size that minimizes the total for a given p, we take the derivative of the cost with respect to s: and set it to 0. The solution of interest can be expressed in terms of the Lambert W function 1 namely Let us define p 10 as the value of p for which the optimum s 1 is exactly 10. It turns out that p 10 ≈ 0.01112 which is the value we used in the introductory example. More generally, Figure 1 shows the optimum integer s 1 as a function of p. Let us turn to double pooling: now each patient will be assigned to two random pools each of size s. A patient will be tested individually iff both their pools test positive. Again let us look at the expected cost induced by the testing of one patient: • if the patient is positive, then the cost is 2/s + 1 (the patient's share of the two pools + their individual test); • if the patient is negative, then the cost is 2/s + 1×(1 − (1 − p) (s−1) ) 2 (the patient's share of the pools + their individual test iff both pools test positive). Hence the total expected cost is where for brevity q stands for 1 − p. As before, to determine the pool size that minimizes the total cost for a given p, we take the partial derivative of the cost with respect to s: and set it to 0. This has to be solved numerically at each p. Solving this at p = p 10 ≈ 0.01112 we see that the optimum s 2 = 23. More generally, Figure 2 shows the optimum integer s 2 as a function of p and Figure 3 shows the expected cost per patient tested as a function of p for both single and double pooling using optimal integer values of s 1 (p) and s 2 (p). In principle we can generalize double pooling to k-pooling, whereby each patient participates in k independent pools in k parallel rounds. The expected cost becomes Depending on p this can yield further improvements but they are probably impractical especially if p gets larger. (With triple testing for p = p 10 and about 1000 samples we would need only 128 tests with pools of size 36, and with quadruple testing only 122 tests with pools of size 47.). Even more asymptotically efficient tests can be constructed [MT11, PR11] , but it is unclear if they can be practical. We presented double pooling, a simple, easy-to-implement variation on test pooling, that in certain ranges for p, the a priori probability of a positive tests, is significantly more efficient than the standard single pooling approach. Figure 4 shows the percentage of savings of double pooling over single pooling as a function of p. We can see that double pooling is particularly advantageous for p below 2% corresponding to large scale testing of asymptomatic patients, but is still at least 10% better than single pooling all the way up to p = 5.4%. Our analysis assumes sampling from an infinite distribution, but in practice, double pooling can be implemented after accumulating a fairly small collection of samples. (There is a small efficiency penalty due to the correlation between rounds that we will discuss in the final version.) The main disadvantage of double pooling is that it is more sensitive to dilution-induced false negatives for two reasons: one physical: the pools used are larger; and one mathematical: a true positive sample will be missed if either of its two pools produces a false negative. We will discuss this further in the final version. We are reaching out to our colleagues in the medical field to find out whether double pooling is practically usable for COVID testing and will update our note with their feedback. Presently, there is an extraordinary flurry of activity and independent work on group testing for COVID. This includes an analysis of a single pooling method [Gol20] and a proposal based on binary search [Gos20] . It might well be the case that independent researchers have already obtained the same results presented here. We are encouraging members of the community to send us their comments and feedback. Balanced allocations Group testing: An information theory perspective. Foundations and Trends in Communications and Information Theory Space/time trade-offs in hash coding with allowable errors Network applications of Bloom filters: A survey The detection of defective members of large populations Optimal group testing to exit the COVID confinement Group testing against COVID-19 Optimality of group testing in the presence of misclassification The power of two choices in randomized load balancing Group testing with random pools: Optimal twostage algorithms Explicit nonadaptive combinatorial group testing schemes Pool testing of SARS-CoV-2 samples increases test capacity Evaluation of COVID-19 RT-qPCR test in multi-sample pools. medRxiv We thank our colleagues Fernando Pereira and Tamás Sarlós for many useful comments.