key: cord-0575869-7epiuixu authors: Bergel, Itsik title: Variable pool testing for infection spread estimation date: 2020-04-07 journal: nan DOI: nan sha: 44138c327f3da5384474c0f41f97d965ad019770 doc_id: 575869 cord_uid: 7epiuixu We present a method for efficient estimation of the prevalence of infection in a population with high accuracy using only a small number of tests. The presented approach uses pool testing with a mix of pool sizes of various sizes. The test results are then combined to generate an accurate estimation over a wide range of infection probabilities. This method does not require an initial guess on the infection probability. We show that, using the suggested method, even a set of only $50$ tests with a total of only $1000$ samples can produce reasonable estimation over a wide range of probabilities. A measurement set with only $100$ tests is shown to achieve $25%$ accuracy over infection probabilities from $0.001$ to $0.5$. The presented method is applicable to COVID-19 testing. Pool testing is used for various infectious disease [3] , [4] , and was proven to work also for RT-qPCR tests [5] , [6] . Furthermore, pool testing was successfully demonstrated recently for the SARS-CoV-2 pathogen of COVID-19 [7] for pool sizes of up to 64 samples. Pool sampling strategics for COVID-19 (with fixed pool size) where also studied in [8] . The literature on pool testing is quite diverse, both in the medical literature (e.g., [9] , [10] ) and in data processing contexts (e.g., [11] , [12] ). Pool testing is commonly used for efficient detection of infected samples when the infection probability is small. But, pool testing is also used for efficient estimation of the prevalence of a rare disease. But, using pool testing for prevalence estimation, requires a choice of the pool size in accordance with the expected prevalence. This is problematic, as such tests are often performed without prior knowledge on the tested group. For example [13] suggested a sequential pool testing, where at each stage the estimate of the infection probability is improved and used to design the next stage. Yet, the test accuracy still strongly depends on the quality of the initial guess. So far, there is no simple and efficient method to choose pool sizes that will bring accurate estimation over a wide range of infection probabilities in a single batch. In this work we present a method for efficient estimation of the prevalence of infection in a population using a small number of tests. The presented approach uses a mix of pool sizes, ranging from single sample test to very large pools. The test results are then combined to generate an accurate estimation for a wide range of infection probabilities. This solves the problem of choosing the pool size (which requires an initial guess of the probability). As an example, using only 100 tests, we can estimate the infection probability at an accuracy of ±25% over all the probability range from 10 −3 to 0.5. In the following we first consider pool testing with fixed pool size, and then present the variable pool size approach. We consider the estimation of the spread of disease in a given population. Denote the population size by L and the number of infected by L i . We define the probability of finding an infected sample by: p = L i L . We consider the estimation accuracy given a limited number of tests T . Note that δ i = 1 indicates a positive test for the disease. Considering a maximum likelihood (ML) estimator, we have: Let w = T i=1 δ i , we take the derivative of the log of (2) with respect to p, and compare to zero: The accuracy of this estimation is presented in figure 1 . We measure the accuracy relative to the actual infection probability. The root mean square error is defined as: and is evaluated using Monte Carlo simulation. The estimation accuracy is defined as Note that the accuracy is better if η is smaller. All Monte Carlo simulations in this work use 10 4 repetitions. If the pool size is too small, then the probability to get a positive set is still too small and the efficiency is reduced. On the other hand, if the pool size is too large, almost all tests will turn positive and the accuracy degrades very fast. Thus, to have an efficient test, one must match the pool size to an initial guess of the actual probability. To avoid the need for such a guess, the next sub-section presents an efficient method for estimation with variable pool sizes. We next consider a general pooling scheme where the pool of test i is of size N i . Thus, δ i , is a binary distribution, with: (and again, δ i = 1 indicates a positive test for the disease). The ML estimate of the infection probability, p, from a set of T tests with pool sizes N 1 , N 2 , . . . , N T is given by: April 8, 2020 DRAFT Taking the derivative of the log of (8) , and comparing to zero: Thus, the ML estimator is the solution to In this case, the ML estimator does not have a closed form expression. Yet, the left hand side of Equation (11) is monotonic increasing withp. Hence, the ML estimator can be efficiently calculated by solving (11) using a binary search. As shown above, the probability estimate will benefit most from a pool size which is approximately N = 1/p. In this approach we wish to measure a large range of possible infection probabilities, p. Thus, we need to use pool size with wide range of sizes. To do so, we suggest to select the pool sizes in a logarithmic manner, that is: for i = 0, . . . , T − 1, where N 0 is the size of the smallest pool and q > 1 is the logarithmic spacing. We use the notation ⌈·⌋ to indicate rounding to the nearest integer. If the desired range of measure probabilities is p min < p < p max , then it is important to have N 0 < 1/p max and N 0 · q T −1 > p min . Thus, the choice of q represents a tradeoff between a large measurement range and the measurement accuracy. This is demonstrated in Fig. 2 for the values of q give above. Indeed, we see that the estimation accuracy starts to deteriorate around 1/(N 0 · q T −1 ), that is around 0.009, 0.001 and 0.00013 respectively. Note that even using the larger q in this simulation, the accuracy of ±40% is quite good, as this error is obtained in a measurement that covers three order of magnitudes of the actual probability. We next present a last set of simulations that demonstrates the efficiency of the suggest approach for a fast and efficient measurement of the infection probability. We choose the measurement design to cover the range 10 −3 < p < 0.5. Using T = 100 tests, we use N 0 = 1 and q = 1.085 (such that q T −1 = 10 3.5 ). The resulting accuracy is depicted by the solid line with x-markers in Fig. 3 . The figure demonstrates that using only 100 tests, we can get an estimation accuracy of 25% over the whole probability range. The main drawback of this approach is that it requires many samples. The example above used 100 tests, but required a total of 40, 439 samples (mixed in the various pools). This is sometimes a problem, as it may be difficult to persuade these many people to come to test. Thus, the figure also depicts the accuracy when the measurement is limited only to 1000 samples. In this case, we adjust the q to achieve a logarithmic spacing such that the sum of all pool sizes in 1000 (i.e., q = 1.03708). We presented a method for efficient estimation of infection probability using a small number of tests. The method is based on pool testing with variable pool sizes. It is shown that proper choice of pool sizes leads to accurate estimations even with small number of tests. For example, using 100 tests was shown to achieve 25% accuracy over a wide range of actual infection probabilities. Even a set of only 50 tests over only 1000 samples was shown to produce reasonable estimation. Further research is required in order to accommodate for false alara (false positive) and miss detection (false negative) probabilities. In particular, it can be assumed that these error probability can increase with the pool size and hence effect the choice of pool sizes. Pool tests for COVID-19 was so far demonstrated for up to 64 samples in a pool, with small enough errors [7] . The data in that work can give an initial estimate on the behavior of the error probabilities, and hence used to improve the estimation and the pool size design. April 8, 2020 DRAFT The detection of defective members of large populations Pooled-testing procedures for screening high volume clinical specimens in heterogeneous populations Screening for the presence of a disease by pooling sera samples A methodology for deriving the sensitivity of pooled testing, based on viral load progression and pooling dilution Highthroughput pooling and real-time pcr-based strategy for malaria detection Evaluation of the pooling of swabs for real-time pcr detection of low titre shedding of low pathogenicity avian influenza in turkeys Evaluation of covid-19 rt-qpcr test in multi-sample pools Evaluation of group testing for sars-cov-2 rna Blood donor screening with cobas s 201/cobas taqscreen mpx under routine conditions at german red cross institutes A general regression framework for group testing data, which incorporates pool dilution effects Boolean compressed sensing and noisy group testing Adaptive bayesian group testing: Algorithms and performance Sequential prevalence estimation with pooling and continuous test outcomes