key: cord-0325796-tk9nnx4l authors: Skorniakov, V.; Leipus, R.; Juzeliunas, G.; Staliunas, K. title: Optimal group based testing strategy for detecting infected individuals: comparison of algorithms date: 2020-06-29 journal: nan DOI: 10.1101/2020.06.29.20142323 sha: 9281cbfba1cf7a54bf71ea88be97dc19c191d0e1 doc_id: 325796 cord_uid: tk9nnx4l We investigate group based testing strategy targeted to identify infected patients by making use of a medical test which equally well applies to single and pooled samples. We demonstrate that, under assumed setting, quick sort grounded testing algorithm allows to reduce average costs, and the reduction is very significant when the infection percentage is low. Although the basic idea of test sampling is known, our major novelty is the rigorous treatment of the model. Another interesting insight following rigorous analysis is that an average number of tests per one individual scales like entropy of the prevalence of infection. One more reason for the paper is the context: taking into account the current situation with the coronavirus, dissemination of renowned ideas and the optimisation of algorithms can be of a great importance and of economical benefit. The group testing is devoted to the detection of unknown subset of defective items in a set of objects using the tests in the most efficient way. It turns out that quite often individual testing can be efficiently replaced by testing the groups. In many settings (and, particularly, in a medical one), adoption of such strategy results in the significant time and costs savings and is, therefore, a desirable step. The concept of group testing was originally introduced by [1] in relation with efficient mass blood testing and later found important applications in molecular biology, quality control, computer science and other fields. In particular, the situation with the current COVID-19 epidemic raises the need of the quick and cheep PCR testing for SARS-COV-2 virus in order to find a rapid exit from the lockdown strategy in many countries. In this paper, we investigate group based testing strategy aimed to identify infected patients by making use of a medical test which equally well applies to single and pooled samples. We show that quick sort grounded (halving) testing algorithm allows to reduce average costs, and the reduction is very significant when the infection percentage is low. The main objective of the paper is to recall the main schemes of group testing and compare them to each other on the grounds of the rigorous mathematical base. We hope that, taking into account the current situation with the coronavirus, dissemination of renowned ideas and the optimisation of algorithms can be of a great importance and of economical benefit. The remaining part of the paper is organized as follows. Section 2 provides the description of two classical schemes of testing together with preliminaries. In Section 3, we consider the main algorithm of the paper (Scheme C) together with comparison to the other two schemes. In Section 4, we give an accompanying discussion of the assumptions of our setup and the related literature. Appendices A and B contain mathematical derivations and tables. Consider the following setup. Assume that the prevalence of some disease (the fraction of the infected individuals) is equal to p ∈ (0, 1) in an infinite (or large enough) population. A cohort, spanning N independent individuals, has to be tested. For this, samples are collected from each individual. The test to be applied performs equally well for individual and for pooled samples: a situation might occur, e.g., when the test indicates the presence of the infection in the blood sample, and there is no difference whether the latter is obtained from a single individual or from a pooled cohort of samples. For the situation described, physicians can choose different testing strategies. Let us assume that the following are two possible choices. Scheme A. Test each patient's sample. Scheme B. Conduct testing of the pooled sample. Test each member of the cohort only in case of detected infection in the pooled sample. Although it is not obvious at the first glance, the second choice is more efficient. To give a rigorous justification, let us formally define the underlying model corresponding to Scheme B. Consider the sample of N individuals. Put X i = 1, provided the test of i-th individual is positive, and X i = 0, otherwise. Let S = S N = X 1 + · · · + X N be the total number of infected individuals in the sample and let T = T N be the total number of tests applied to the cohort: the test is applied once if the result is negative, and it is further applied to each of N individuals otherwise, i.e. Assume that X 1 , . . . , X N are independent identically distributed (i.i.d.) random variables, each having Bernoulli distribution Be(p). Then S has Binomial distribution Bin(N, p). Therefore, an average number of tests per cohort is where q := 1 − p. An average number of tests per individual, say t = t(N ), is Consider a function t : (0, ∞) → (0, ∞) given in (2.2). By equating its derivative to 0, we see that stationary points solve equation which is a fixed point equation for g(N ) = q −N/2 ln 1 q −1/2 , N ∈ (0, ∞), and, hence, can be easily solved iteratively. It is further not difficult to prove that, for p in the region enclosing (0, 0.2), there exists a unique solution N p > 0 of (2.3) which is a minimizer of t(N ) (see Proposition A.1 in Appendix B). Then, turning back to economic/biomedical interpretation, we conclude that, having a cohort of N p (here and in the sequel, y stands for an integer part of y ∈ R) individuals, Scheme B results in a lowest average number of tests per person which is possible when applying scheme of this type for a population having prevalence p. Scheme A, in contrast, always has a constant number of tests 1 per person. Therefore, the average (absolute) gain attained by applying Scheme B versus Scheme A is given by the difference Right panel in the Figure 1 shows the graph of p → 100G p , p ∈ (0, 0.2), which is an average gain measured by the number of tests saved per 100 individuals. The corresponding values are provided in Table 1 (see Appendix 2 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 29, 2020. Figure 1 ) demonstrates dependence of optimal sample size on p. To obtain a fast numerical evidence, assume that N is bounded away from zero and pN → 0. Then from (2.3) it follows that optimal sample size satisfies Hence, assuming that p is small enough for pN ≈ 0 to hold, the above implies that For example, if p = 0.01, then we have G p ≈ 0.8, i.e. an approximate average gain is 80% or so. In what follows, we retain all the notions and assumptions introduced previously. For the sake of convenience, we additionally assume that the size of the cohort (to be tested) is of the form N = 2 n , n ∈ N. Keeping in mind all the said, below is the recommended algorithm for a group based testing when p is low. Step 1. Test pooled sample of the whole cohort. Proceed to Step 2. Step 2. If the test is positive, proceed to Step 3, otherwise, finish testing cohort. Step 3. Divide the cohort into two parts consisting of the first and second halves respectively. Apply the whole algorithm to the two obtained parts recursively. The main features of the above testing algorithm are summarized in Proposition 3.1 below. Proposition 3.1. Assume the testing Scheme C. Then (i) an average number of tests per person is given by (ii) an average number of tests per person in the case of an infinitely large cohort is 3 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 29, 2020. . (iii) for a fixed p ∈ (0, 1), function t : N → (0, ∞) admits at most two minimizers N p : the value N = N p corresponding to optimal sample size is either To shed some light on the described testing scheme, we provide several comments on the performance of Scheme C vs. Scheme A and Scheme B. Consider first the limit in (ii) of Proposition 3.1. Obviously, Hence, for q ≈ 1 (or alternatively p ≈ 0), t(∞) < 1. The latter means that, when the prevalence is low, the proposed scheme always outperforms common sequential Scheme A. To gain a quick quantitative insight, assume that p is small enough for pN ≈ 0 to hold. Then turning to (iii) and taking a 'continuous' (undiscretized) version of N p equal to ln 2 2 ln(1/q) yields relationships (see Remark A.1, eq. (A.3)) Therefore, an approximation to an average gain G p = 1 − t(N p ) is 1 + 2p log 2 p. Taking, e.g., p = 0.01 results in G 0.01 ≈ 0.867. Considering analogous example given in Section 2 for Scheme B, we see that the gain has an increase close to 7%. In fact, this does not surprise (for a visual comparison of Schemes B and C on the linear and the log-log scale, see Figure 2 , and, for the numerical one, Table 1 and Table 2 in Appendix B), since for Scheme B we had G p ≈ 1 − 2 √ p and p log 2 p/ √ p → 0 as p → 0. Equality (3.2), however, exhibits some magic flavor. To see this, note that, for p ≈ 0, entropy I p of X ∼ Be(p) is asymptotically equivalent to p log 2 p since Consequently, (3.2) means that an average number of tests per one individual scales like entropy of the prevalence of the infection. Keeping in a view the above relationship, it is not surprising that the significant number of works ( [2] , [3] , [4] ) have approached the testing problem from the Information Theory perspective. Another connection with the Information Theory is due to the Quick Sort (QS) algorithm utilized in the Scheme C. It is well known that QS yields the best (up to the constant multiple) possible average performance among comparison based algorithms: to sort an array having N non-constant (i.e., random) items, the smallest average number of comparisons is of the order N ln N [5] , and, in fact, all "randomize and conquer" type algorithms (with QS being one among the rest) have expected time asymptotically equivalent to QS, which randomly splits sorted set into two equal subsets [6] . Our formula (3.1) is just a confirmation of this (well known fact). To see this, note that, in the context of sorting task, (3.1) presents an average number of comparisons per item. Though the order is correct, we are inclined to think that the multiplier 1 0 1 q x 2 1+ v log 2 N −1 dx dv can be improved by making use of QS modification (or another comparison-based algorithm) designed to sort items with a small number of possible values (in our case, there are just two: "sick" and "healthy"). On the other hand, we think that the order is optimal since though there are algorithms which can beat the order of QS when sorting integers, e.g., [7] , [8] , they operate under different, i.e., noncomparisons-based, mode. In our case, however, comparison is predefined by the setting of the problem at hand: we assume that biomedical tests can only be carried by making use of comparison. Assumptions. At a first glance, independence assumption seems too restrictive. That is, more natural would be to assume that individuals forming the tested cohort are related. However, an application of the proposed procedure is most effective when the prevalence is low (p ≈ 0). In such case, under classical assumption pN → λ > 0, the number of infected individuals S N can be well approximated by the Poisson distribution Pois(pN ), and the approximation remains quite accurate irrespectively of the nature of the dependence exhibited by summands (see [9] , [10] , [11] and references therein for results of this kind with possible extensions of the classical setting). Another important assumption is that of the constant prevalence p. When the testing is applied in the vibrant pandemic environment, it would be more natural to assume the dependence on time: p = p(t). This, however, requires switching to a more complex process theory model, and it is difficult to say what is more reasonable: to assume validness of our simple setting for a short period of time, or to look for a more complicated yet more realistic model requiring longer and elaborated analysis and suitable data. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 29, 2020. Related works. The ideas underlying the testing problem covered in our paper are far from being new. For example, the Scheme B dates back at least to 1943 (see [1] ) and then reappears with modifications and further extensions not only in biomedical context but also in the engineering sciences as well: [12] , [13] , [14] . The Scheme C is currently afresh discussed by Gollier and Gossner [15] , Mentus et al. [16] and Shani-Narkiss et al. [17] . For an older reference, discussing the case of nonhomogeneous population (i.e., the one in which the probability of being infected p may vary across individuals) and containing quite a large body of applied literature on halving algorithm corresponding to the Scheme C, we refer to [18] . Our major input is that, in contrast to the mentioned references discussing the procedure and providing instructions suitable for the practical application of Scheme C with a brief theoretical background, we estimate its properties in a rigorous fashion. To our best knowledge, the reference [19] is the only work close to ours both in nature of investigations and results. However, in that paper, the authors focus on the treatment of an asymptotic regime of Scheme C when p → 0. To finish our short discussion, it is also worth to mention that there is a large body of papers of applied nature, prevalent in medical, engineering and other literature and devoted to pooling in various settings (see, e.g., [20] , [21] among others). Writing the paper, we aimed to recall the ideas of group testing, highlight directions for improvements, and promote dissemination which seems to be of most importance in the present context forcing a burst of papers devoted to similar problems (see, e.g., [22] , [23] , [24] , [25] , [26] , [27] , [28] ). Finally, 5 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Taking expectations yields hence, the first equality in (i). For the second one, take the last sum above and continue as follows: (ii) By (A.1), (iii) Since N = N (n) = 2 n , by the second equality in (A.2), Clearly, q 2N ↓ 0 as N → ∞. Therefore, there exist no more than two N p ∈ N such that ∀N ≤ N p ∆ n ≤ 0, ∀N ≥ N p ∆ n ≥ 0, and t(N p ) attains its minimal value at N p . To find N p , we first solve 1 − 2q 2N = 0 with respect to N , and then choose from the two nearest integers (i.e., N , N +1) the one which minimizes t N . Remark A.1. Note that, if N 1 and pN → 0, then for t(N ) in (3.1) it holds To see this, it suffices to use the following bounds The next proposition justifies the discussion on Scheme B presented in the introductory Section 2. Proposition A.1. Let p ∈ A := (0, 1 − e −4/e 2 ) be fixed. Consider function g p (N ) = g(N ) = q −N/2 (ln 1 q ) −1/2 , N > 0. It admits a unique fixed point N p which minimizes t(N ) given by (2.2). Step 1 (fixed points). Define v : which is satisfied for any p ∈ A. Therefore, f attains maximal value at Consequently, f has two zeroes: N p ∈ (0, N max ) and N p ∈ (N max , ∞). The latter means that g has two fixed points. Step 2 (minimizer). In this step, we show that N p from Step 1 is the minimizer for t(N ) given in (2.2). By (2.3), Hence, In the tables below, the following information is provided: • Column 'N p ' shows an optimal sample size corresponding to p ranging in an interval given in the column 'Range of p'. • Column 'Range of 100G p ' shows an average gain (as defined in the main body of the paper) per 100 individuals corresponding to values of p and N p given in the two leading columns. The highest gain corresponds to the lowest p in the corresponding interval. For example, in Table 1 The detection of defective members of large populations Entropy-based optimal group-testing procedures New procedures for group-testing based on the Huffman lower bound and Shannon entropy criteria Group testing: An information theory perspective Introduction to Algorithms A simple expected running time analysis for randomized "divide and conquer" algorithms Sorting in linear time? Randomized sorting in O(n log log n) time and linear space using addition, shift, and bit-wise Boolean operations Poisson approximation for dependent trials Approximation Methods in Probability Theory Group testing to eliminate efficiently all defectives in a binomial sample Optimal group testing Dorfman and R1-type procedures for a generalized group-testing problem Group testing against Covid-19 Analysis and applications of adaptive group testing method for COVID-19 Efficient and practical sample pooling for High-Throughput PCR diagnosis of COVID-19 Group testing in heterogeneous populations by using halving algorithms Asymptotic analysis of optimal nested group-testing procedures Screening for the presence of a disease by pooling sera samples Pooled nucleic acid testing to identify antiretroviral treatment failure during HIV infection Optimal group testing to exit the Covid confinement Pool testing of SARS-CoV-2 samples increases test capacity A note on Double Pooling Tests Evaluation of COVID-19 RT-qPCR test in multi-sample pools Accelerated testing for COVID-19 using group testing Evaluation of group testing for SARS-CoV-2 RNA Pooled testing with replication: a mass testing strategy for the COVID-19 pandemics COVID-19 testing