key: cord-1008806-w9mb1e69
authors: Chen, J. Y.; Chen, A. S. C.
title: Optimal Pool Size for COVID-19 Group Testing
date: 2020-05-01
journal: nan
DOI: 10.1101/2020.04.26.20076265
sha: 6ead6fff10a11a1bdbe42a7360bcf6e68e36e2e6
doc_id: 1008806
cord_uid: w9mb1e69

This paper presents an analytical formulation for determining optimal pool size in the initial pooling stage and the subsequent retests for COVID-19. A generalized constant compaction approach confirms the efficiency of halving targeted population between retest stages. An analytical gain formula is derived to aid future test designs. It is observed that optimal gain relies on the proper choice of the initial pool size. This optimal compaction scheme could outperform the conventional algorithms in most cases and may provide a mathematically-native road map for us to operate beyond the standard super-even-number-based (64, 32, 16, 8, 1) group testing algorithms.

Group (or pooled) testing for a disease (or other attribute) involves testing "pools" of individual samples. If a pool tests negative, then all samples therein are assumed negative; if it tests positive, then further pooled or individual tests ensue. Group testing was first studied in the combinatorial context in 1962, with the introduction of a multi-stage algorithm 1, 2 . The game of improving group testing efficiency is largely driven by the prevalence 1* rate. The strong appeal of pooled testing is that it can significantly reduce the number of tests and associated costs when the prevalence for a disease is small. The 2020 COVID-19 pandemic calls for timely extensive testing for a large populace. The lack of available tests aggravates the situation around the world, and hence this study. At the time of writing, Germany and Israel have reported group testing efforts. This study aims to develop a strategy for choosing the optimal pool size and compaction factor for each stage to further improve the testing efficiency.

The main parameters we will use in the model include: pi: prevalence at stage i, bi: pool size for stage i, : the number of individual samples to be tested in stage i, and N: total number of people in the target population to be tested. It is obvious that 1 = -the full population is targeted in stage 1. The probability of a particular pool tested negative in stage i is (1 − ) . The number of pools tested negative in stage 1 is 

It is productive to define a Compaction Factor 1 for stage 1:

In general, the compaction factor in stage i is

As a result of a compaction in stage i

Pooling in a given stage reduces the size of the population to be tested and increases the prevalence in the next stage. The task here is that given , one must choose a good combination of pool size and an effective compaction factor to drive next stage compaction.

The critical parameters depends on as shown in (1) . To see how depends on , solve for using (1) .

For our region of interest, 0 ≤ ≪ 1, Eq (4) can be approximated with

Recall that the goal of group testing is to identify all positive individuals with as few total tests as possible. We need to examine the number of tests required. The number of tests required in stage 1 is 1 = 1 1 . The number of tests required in stage 2 is 2 = 2 2 , and so on. Denote the total number of stages as S. The compaction continues thorough S-1 stages until the prevalence rate is so large that it is more effective to just test each sample individually, i.e., set the pool size = 1, and = = .

The total number of tests required is = ∑ =1 . The figure of merit we used to guide the study is the gain defined as ≡ More specifically,

Computer simulations were used to scan through parameter space { 1 , , } to hunt for effective compacting algorithms. The main results are summarized in Figure 1 .

Simulation results conclusively indicated that the best strategy for choosing { , } to maximize the gain is to use a constant compaction of 0.5 in stages 1 through S-1. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 1, 2020. . = 0.5,

Deploying all with an identical value of 0.5 is a clear indication that this algorithm is structurally a close cousin of the standard 64-32-16-8 "halving" algorithms. The distinctive feature of the approach in this study is the continuous optimization of pool size . It determines an optimal pool size at every stage that deviates freely from the super even sequence of "2-4-8-16…" Such an approach regularly outperforms the standard group testing algorithms based on the conventional super-even sequence.

The results of two sample runs are summarized in Table 1 : (1) p = 1% and (2) p = 2%. A population size of 1,000,000 is used. These are the only two input parameters: p and N. Table1shows the 2%-prevalence simulation results of a gain of 5.48 using the Optimized Pool Size approach developed here. In contrast, a conventional approach achieved a gain of 5.24. The 1%-prevalence simulation shows a gain of 9.451 using the Optimized Pool Size approach and the conventional approach has a gain of 9.449. Tables 2 and 3 compare details of the 2%-prevalence simulation runs. Tables 4 and 5 compare details of the 1%-prevalence simulation runs. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 1, 2020. . In the following, we will construct an analytical model for the gains in testing efficiency based on a constant compaction factor .

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 1, 2020. . https://doi.org/10.1101/2020.04.26.20076265 doi: medRxiv preprint

The Constant Compaction Factor will be denoted as whose value is chosen to be 0.5 based on the simulation results. A common compaction in all compaction stages leads to the following.

The optimal pool size at every stage depends on the prevalence at that stage. The dependence is shown in Figure 2 . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 1, 2020. It is worth noting that Eq (10) is an almost perfect approximation for Eq (9) in our operating range 0 ≤ ≪ 1. The error is ( 2 ). It could be used as a simple rule of thumb when one wants to determine the optimal pool size quickly.

~0

.693

Furthermore, the constant compaction leads to some desirable simplifications. Consider the number of tests required in each stage. In stage 1, 1 = 1 1 , in subsequent retest stages,

The fact that each retest stage requires the same number of tests as the initial stage (except for the last non-pooling stage, S) simplifies the derivation of the gain in closed form. To calculate the gain, we will need to know the number of compaction stages S-1. The very last ("test-everyremaining-sample") stage is triggered by a threshold in prevalence.

At the end of each stage i near the end, with a population of remaining to be tested and an expected prevalence of , a decision will be made regarding whether to (1) test everyone individually and conclude the test, or (2) keep pooling in the next stage and perhaps wrap up in the stage after that. (1) Stop pooling and test everyone. The required number of tests to conclude the whole test will be . (2) Keep pooling. The required number of tests to conclude the test will be the number of pooled tests for the next stage + population remained after that stage. Define ≡ 0.693

We will stop pooling and test everyone if

The condition simplifies to

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 1, 2020. 

The inequality can be solved graphically for , yielding 0.3034 < .

One can also use binomial expansion to solve the inequality analytically with a good approximation.

This analytical approximation is very close to the threshold in prevalence value 0.3034 obtained from graphical solution of Eq (12). In the following, when the prevalence of the next stage is more than 0.3, or 30%, then we test everyone and conclude the test. Now we are ready to calculate the number of stages required. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 1, 2020. . So, the total number of stages with pooling, not counting the very last stage which every sample is tested, is − 1

The ceiling function in the equation means rounding up to an integer for obtaining an integer number of stages. As a result, the smoothness of the overall gain formula, in reality, is modulated by the ceiling function. The effect can be seen in Figures 1 and 3 .This is also true for the rounding required in Eq (4) for obtaining integer optimum pool size.

The number of tests in the last stage is

Combining all the required tests for all stages using Eqs (11) and (14), we get the total number of tests A full expression for the gain is . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 1, 2020. .

The in the equation is a placeholder for 0.5 in the model. It is clearly seen in the equation that the whole multi-stage group testing optimization through constant compaction is driven by only one critical parameter-the initial expected prevalence 1 . High gains in group testing efficiency can be expected for 1 ≪ 1. The dependence of gain on prevalence is shown in Figure 3 . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 1, 2020. . https://doi.org/10.1101/2020.04.26.20076265 doi: medRxiv preprint

The ceiling function in the gain formula introduces important modifications. Figure 4 shows the gain dependence on compaction for these scenarios. The cliff-like drop on the peak gain can be as large as 11% when p = 2%. The also appears to shift towards 0.4 from 0.5for larger prevalences. These modifications need to be addressed in actual implementation of the optimum pool size approach when the prevalence is ≥ 2%. Figure 4 . The ceiling function in the gain formula induces a peak gain shift from 0.5 to 0.4 when prevalence is higher than 2%. Application of the optimal pool size approach will need to take this effect into consideration.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 1, 2020. .

A sequential method for screening experimental variables

Combinatorial group testing and its applications

The potential high gain region of operation can reduce overall testing costs and time. In order to take full advantage of low prevalence, it is essential to conduct smart screening to accomplish a hybrid group testing by forming heterogeneous subgroups. Smart grouping calls for risk assessments. The risks can be measured in a number of ways traditionally. Some recent techniques involve using artificial intelligence, in which a training data set of individual diagnoses and corresponding risk factors are used to establish a binary regression model. This model can then be applied to the current individuals being screened in order to estimate their risk of having the disease.As an example, when the whole target population is divided into low risk (p = 0.1%), medium risk (p = 1%), and high risk (p = 20%) groups, a gain of 64.7 can be expected for the low risk group and a gain of 9.6 can be expected for the medium risk group. A composite gain of 30 is accomplished in this example. See Table 6 . This idealized mathematical study focuses narrowly on improving the group testing efficiencies. Other important and compelling issues related to testing-such as sensitivity, specificity and technical logistics, have not been discussed here. They are very relevant and deserve to be studied in detail. In addition, the unconventional choices of pool size in this approach dictate that future robotic pipetting will need major development to accommodate cross-pool mixing arrangement during retest stages. These advanced robots also have to handle rows and columns of non-super-even numbers as well as super-accurate low-dose dispensing.The approach studied here, deep down, is isomorphic to the conventional multi-stage "halving" algorithms based on super even numbers. It is believed that the initial choice of pool size is very relevant in retaining a high gain in testing efficiency. Using the usual 16 (or 64) to start group testing, though pleasingly simple, may cost us in efficiency.