key: cord-117424-mp6h9dyl
authors: Abraham, Louis; Becigneul, Gary; Coleman, Benjamin; Scholkopf, Bernhard; Shrivastava, Anshumali; Smola, Alexander
title: Bloom Origami Assays: Practical Group Testing
date: 2020-07-21
journal: nan
DOI: nan
sha: 
doc_id: 117424
cord_uid: mp6h9dyl

We study the problem usually referred to as group testing in the context of COVID-19. Given n samples collected from patients, how should we select and test mixtures of samples to maximize information and minimize the number of tests? Group testing is a well-studied problem with several appealing solutions, but recent biological studies impose practical constraints for COVID-19 that are incompatible with traditional methods. Furthermore, existing methods use unnecessarily restrictive solutions, which were devised for settings with more memory and compute constraints than the problem at hand. This results in poor utility. In the new setting, we obtain strong solutions for small values of n using evolutionary strategies. We then develop a new method combining Bloom filters with belief propagation to scale to larger values of n (more than 100) with good empirical results. We also present a more accurate decoding algorithm that is tailored for specific COVID-19 settings. This work demonstrates the practical gap between dedicated algorithms and well-known generic solutions. Our efforts results in a new and practical multiplex method yielding strong empirical performance without mixing more than a chosen number of patients into the same probe. Finally, we briefly discuss adaptive methods, casting them into the framework of adaptive sub-modularity.

Lacking effective treatments or vaccinations, the most effective way to save lives in an ongoing epidemic is to mitigate and control its spread. This can be done by testing and isolating positive cases early enough to prevent subsequent infections. If done regularly and for a sufficiently large fraction of susceptible individuals, mass testing has the potential to prevent many of the infections a positive case would normally cause. However, a number of factors, such as limits on material and human resources, necessitate economical and efficient use of test resources.

Group testing aims to improve test quality by testing groups of samples simultaneously. We wish to leverage this framework to design practical and efficient COVID-19 tests with limited testing resources. Group testing can be adaptive or non-adaptive. In the former, tests can be decided one at a time, taking into account previous test results. In the latter, one can run tests in parallel, but also has to select all tests before seeing any lab results.

A popular example of a semi-adaptive group test is to first split n samples into g groups of (roughly) equal size, pool the samples within the groups and perform g tests on the pooled samples. All samples in negatively tested pools are marked as negative, and all samples in positively tested pools are subsequently tested individually. Figure 1 : We formulate the group testing problem as a constrained information maximization problem. Samples are grouped into testing pools so that the information gain is maximized while obeying practical constraints (i.e. no more than 64 samples in one group). Here, positive samples are shown in black and positive tests are shown in blue. The tests are decoded with error correcting probabilistic methods.

Practical Constraints for COVID-19. Although group testing is a well-studied problem, the recent COVID-19 pandemic introduces specific constraints. In contrast to seroprevalence antibody tests, PCR tests aim to detect active cases, and only successfully do so during part of the disease course [8] ). This results in a small prevalence (prior probability of population infection; we will assume a default value of 10 −3 ), assuming we screen the general population rather than only symptomatic individuals. Group testing has recently been validated for COVID-19 PCR tests [22, 26] . It is facilitated by the fact that PCR is an amplification technique that can detect small virus concentrations. Nevertheless, there are limitations on the number of samples l that can be placed in a group ([26] considers up to 64), and constraints on the number of times a particular sample can be used ( [22] uses serial incubation of the same respiratory sample in up to k = 10 tubes). Besides, there are practical issues: adaptive testing is time consuming and hard to manage. Complex multiplex designs are prone to human error.

Existing research on non-adaptive group testing is generally concerned with identifying at most k positive samples amongst n total samples, which is referred to as non-adaptive hypergeometric group testing [9] . This assumption yields asymptotic bounds on the number of tests needed to recover the ground truth [13, 10, 4, 3] . However, these are of limited practical relevance when constructive results on small numbers of samples are required. The specific constraints for COVID-19 force us to revisit the general framework of group testing.

Novel Formulation. We formulate the problem based on the principle of information gain: given n people and m testing kits, the characteristics of the test and prior probabilities for each person to be sick, we seek to optimize the way the tests are used by combining several samples. For simplicity, samples are assumed to be independent analogous to [18] . However, we focus on implementable tests, unlike [18] which focuses on asymptotic results that are valid for large n. Figure 1 summarizes our approach.

Optimal Characterization: By leveraging the framework of adaptive sub-modularity initially developed for sensor covering by [6] , we prove near-optimality of a simple greedy-strategy for adaptive testing. Despite the simplicity, it turns out that this greedy strategy has exponential running time and becomes infeasible for n ≥ 16. Fortunately, the near optimally of the greedy-adaptive method points toward a simple and scalable non-adaptive solution leveraging randomization akin to the Bloom Filter structure [20] .

Bloom Origami Assays: 1 [19, 2] recently showed that pooling using random hash functions, similar to a Bloom filter structure, can lead to an efficient and straightforward group testing protocol. We will show that such a protocol, based on random hash functions, is unnecessarily restrictive. Bloom filters were designed for streaming data, where there is no choice but to use universal hash functions for pooling. For COVID-19, the computational situation is much simpler. Leveraging our information gain framework, we propose superior but straightforward hashing strategies.

A bigger problem with Bloom filters is the (necessarily) simple decoder. The decoder trades accuracy for efficiency, as it was designed for internet-scale problems where linear time decoding is prohibitive. For COVID-19, we instead propose a message-passing decoder, similar to Counter Braids [16] , which is more accurate. Our proposal of connecting probabilistic graphical model (PGM) inference with Bloom filters could be broadly applicable to situations beyond COVID-19 group testing. Since the graphical model is a bipartite graph for which no narrow junction tree can be found, message passing does not necessarily converge to the global optimum. Therefore, we propose a new method for graphical model inference leveraging probabilistic concentration and meet-in-the-middle (MITM) techniques, which may be of independent interest. Our MITM method is particularly useful for the low prevalence scenario. This paper illustrates the power of algorithmic re-design to target practical constraints. We obtain significant gains even on the relatively well-studied topic of group testing.

Notations are progressively introduced throughout but are gathered in the appendix, which also contains the proofs. Denote the number of patient 2 samples by n. As previously mentioned, we consider the group testing task in the particular context of the COVID-19 pandemic. This choice of problem setting naturally introduces new mathematical constraints of a practical nature: Impracticality of Adaptivity. Adaptive methods require several hours in between each lab result of the adaptive sequence. This inspires us to only consider either non-adaptive methods or semi-adaptive methods with no more than two phases of testing.

Low Concentration and Test Accuracy. Excessive mixing of patient swabs may result in prohibitively low viral concentration with negative consequences for testing. A recent study reports that one can safely mix a patient swab up to 10 times [22] ; another relays that mixing up to 32 patient samples into the same probe yields a false negative rate below 10% [26] .

There is clearly ambiguity in the limitations of the experimental protocol. For instance, [15] validate double-digit numbers of patients per sample for PCR tests. While dilution effects are relevant for such large pools, they can be partly addressed by incubating respiratory swabs multiple times [22] . Also note that we are only concerned with the accuracy of the tests per se rather than the biological sampling protocol (i.e. whether swabs are taken when viral load is detectable in patients). In what follows we consider group sizes of n = 100 as a sensible upper limit.

Notations and Reminders Denote the number of tests to run by m. Tests are assumed to be imperfect, with a true positive rate (or sensitivity) tpr 3 and true negative rate (or specificity) tnr. 4 As simple default values, we will use tpr = 99% [21] and tnr = 90% [26] . 5 Patient sample i is infected with probability p i ∈ [0, 1] and we assume statistical independence of infection of patient samples. Denoting by a '1' a positive result (infection), the unknown ground truth is a vector of size n made up of '0's and '1's. This vector describes who is infected and who is not. We call this the secret, denoted as s ∈ {0, 1} n . A design of a test d ∈ {0, 1} n to run in the lab is a subset of patient samples to mix together into the same sample, where d i = 1 if patient sample i is mixed into design d and d i = 0 otherwise. Note that the outcome of a perfect design d for a given secret s can simply be obtained as 1 d,s >0 where d, s := n i=1 d i s i . That is, a test result is positive if there is at least one patient i for which d i = 1 (patient i is included in the sample) and s i = 1 (patient i is infected). Figure 1 illustrates the problem setting.

Recall that the secret s is unknown. However, since we assume that patient sample i is infected with probability p i and that patient samples are independent, we have a prior probability distribution over the possible values of s. We hence represent the random value of s as a random variable (r.v.), denoted by S, with probability distribution p S (s) := Pr[S = s] over {0, 1} n . Let us now recall the definition of the entropy of our random variable,

The entropy represents the amount of uncertainty that we have on its outcome, measured in bits. It is maximized when S follows a uniform distribution, and minimized when S constantly outputs the same value. As we perform tests, we gain additional knowledge about S. For instance, if we group all samples into the same pool and have a negative result, then our posterior probability that all patients are healthy goes up. That is, p S ((0, . . . , 0)) increases according to Bayes' rule of probability theory. More generally, we may perform a sequence of tests of varying composition, updating our posterior after each test. Our goal will be to select designs of tests so as to minimize entropy, resulting in the least amount of uncertainty about the test outcome for all individuals.

Given n people, test characteristics tpr & tnr and a set of prior probabilities of sample infection (p i ) 1≤i≤n , the best multiset D of m pool designs is the one maximizing the information gain. The tests are order insensitive, which gives a search space of cardinality 2 n +m m . Evaluating the information gain of every multiset separately takes O (2 n+m ) operations. 6 Hence, brute-forcing this search space is prohibitive even for small values of n and m.

We resort to randomized algorithms to find a good enough solution. Our approach is to use Evolutionary Strategies (ES). We apply a variant of the (1 + λ) ES with optimal restarts [17] to optimize any objective function over individuals (multisets of tests).

Detailed Description. We maintain a population of 1 individual between steps. At every step of the ES, we mutate it in λ ∈ N + offsprings. In the standard (1 + λ) ES, each offspring is mutated from the population, whereas our offsprings are iteratively mutated, each one being the mutation of the previous. These offsprings are added to the population, and the best element of the population is selected as the next generation of the population.

We initialize our population with the "zero" design that doesn't test anyone. Our mutation step is straightforward: flipping one bit d i of one pool design d, both chosen uniformly at random. We also restrict our search space if needed: the number of 1's in a column must be less than the number of times a given swab can be mixed with others, the number of 1's in a line is constrained not to put too many swabs into the same pool. Our iterative mutation scheme allows us to step out of local optima.

After choosing a basis b proportional to n × m (which is approximately the logarithm of our search space), we apply restarts according to the Luby sequence:

, ...). This sequence of restarts is optimal for Las Vegas algorithms [17] , and our ES can be viewed as such under two conditions: (i) that the population never be stuck in a local optimum, which can be achieved in our algorithm using λ = n × m (note that much smaller constant values are used in practice); (ii) the second condition is purely conceptual and consists in defining a success as having a score larger than some threshold. The fact that our algorithm does not use this threshold as an input yields the following result, proved in Appendix C.1: Theorem 1. Under condition (i), the evolutionary strategy using the Luby sequence for restarts yields a Las Vegas algorithm that restarts optimally [17] to achieve any target score threshold.

Note that since tests are imperfect, for a given pool design d ∈ {0, 1} n and a given secret s ∈ {0, 1} n , the Boolean outcome T (s, d) of the test in the lab is not deterministic. If tests were perfect, we would have T (s, d) = 1 d,s >0 . To allow for imperfect tests, we model T (s, d) as a r.v. whose distribution is described by Pr[T (s, d) = 1 | d, s > 0] = tpr and Pr[T (s, d) = 0 | d, s = 0] = tnr. 7 Since the secret s is also unknown (and described by the r.v. S), the outcome T (S, d) has now two sources of randomness: imperfection of tests and unknown secret. 8 In practice, one will not run one test but multiple tests. We now suppose that m tests of pool designs are run and let their designs be represented as a multiset D ∈ ({0, 1} n ) m . This leads us to the following question: given an initial prior probability distribution p S over the secret, how should we select pool designs to test in the lab? We want to select it such that once we have its outcome, we have as much information as possible about S, i.e. the entropy (uncertainty) of S has been minimized. Since we cannot know in advance the outcome of the tests, we have to minimize this quantity in expectation over the randomness coming from both the imperfect test and unknown secret. This requires the notion of conditional entropy.

Conditional Entropy. Given pool designs D, we consider two random variables S (secret) and T := T (S, D) (test results). The conditional entropy of S given T is given by:

In 

where

It represents the amount of information (measured in bits) needed to describe the outcome of S, given that the result of T is known. The mutual information between S and T can equivalently be defined as I 

This criterion selects the pool designs D whose outcome will maximize our information about S.

Expected Confidence. We report another evaluation metric of interest called the expected confidence. It is the mean average precision of the maximum likelihood outcome. The maximum likelihood outcome it defined by: 

M L is of particular practical interest: given test results t, a physician wants to make a prediction. In this case, it makes sense to use the maximum likelihood predictor. The interpretation of Confidence is straightforward: it is the probability that the prediction is true (across all possible secrets).

Updating the priors. Both scoring functions described above compute the expectation relative to the test results of a score on the posterior distribution p S|T =t (s). After observing the test results, we are able to replace the prior distribution p S by the posterior. By the rules of Bayesian computation, this update operation is commutative, i.e., the order in which designs d 1 and d 2 are tested does not matter, and compositional in the sense that we can test {d 1 , d 2 } simultaneously with the same results.

Thus, we can decompose those steps and make different choices as we run tests (see the adaptive method below).

Although searching the space of all possible adaptive strategies would yield a prohibitive complexity of Ω(2 2 m ), it turns out that a simple adaptive strategy can yield provably near-optimal results. We describe an adaptive scheme in Algorithm 1 which greedily optimizes the criterion defined in Eq. (4). Update p S accordingly (see Eq. (3)) to the realization of d * in the lab ; 9 Decrease the number of remaining tests k by 1; 10 end

Leveraging the framework of adaptive sub-modularity [6] , and assuming that the criterion defined by Eq. (4) is adaptive sub-modular 9 , Algorithm 1 has the guarantee below. Theorem 2. Denote by 'Algo' an adaptive strategy. Let I(Algo) be the expected mutual information obtained at the end of all m tests by running Algo, the expectation being taken over all 2 m outcomes of lab results. Denote by 'Optimal' the best (unknown) adaptive strategy. If we run Algorithm 1 for m 1 tests and Optimal for m 2 tests, we have:

where α is defined as follows: assume that our priors p i are wrong, in the sense that there exist constants c, d with cp i ≤ p i ≤ dp i for i ∈ {1, ..., n}, with c ≤ 1 and d ≥ 1, where p i denotes the true prior: we set α := d/c.

Remarks. Accordingly, Algorithm 1 is (i) robust to wrong priors and (ii) near-optimal in the sense that the ratio of its performance with that of the optimal strategy goes to 1 exponentially in the ratio of the numbers of tests run in each algorithm. For α = 1 and m 1 = m 2 , this yields 1 − e −1 0.63.

Our previous methods are effective, but they are prohibitively expensive for n > 30 patients. To address this, we present a randomized approach to selecting D by grouping patients into pools using Bloom filters [1] . Randomized test pooling may be attractive to practitioners because it is straightforward to understand and implement in the laboratory. The simplest method partitions n patients into random groups of equal size. Patients are either re-tested or reported positive if their group tests positive (Single Pooling). In [2] , the authors propose an extension to this idea that inserts patients into two sets of pools, named double pooling, which offers impressive advantages at the same cost. We present a generalization of this idea that uses an array of Bloom filters to improve the error characteristics of the test. While Bloom filters have been considered for the low-prevalence COVID-19 testing problem [19, 12] , current methods are based on a simple randomized encoding and decoding process that was designed for internet-scale applications where even linear time was prohibitive and where the keys are not known beforehand. This sacrifices accuracy. We now design an improved algorithm. 9 Empirical validation in Appendix F.

Encoding. Bloom filters use universal random hash functions for load balancing because the streaming algorithm setting does not allow us to control the number of items in each group. Here, we can improve the filter with perfect load balancing. We divide the m tests into g groups of b pools. In each group, we assign the n patient samples to the b pools so that each pool contains n/b patients. 10 . This procedure constrains the multiset D of possible test designs. With uniform prior probabilities, we implement a perfectly load balanced hash by assigning each patient a number based on a permutation π j of the integers {1, ...n} Thus patient i is assigned to pool h j (i) := π j (i) mod b in group j.

For non-uniform priors, we can resort to a variable load hash to balance total weights into pools. Due to the concavity of the entropy, the information gain is maximized if all pools have the same probability of testing positive. This is maximized for 1/2, the mode of the binary entropy.

Load balancing implies Information Gain: Load balancing, as exhibited by our encoding, maximizes the information gain for a practical subset of constrained Bloom filter group test problems. Theorem 3 motivates Bloom filters in the context of our information theoretic framework. With a constraint on the number of samples in each pool, our load balancing hash allocation is the optimal pooling strategy provided that Pr[t b = 1] is sufficiently small (∼ 20%). We defer a detailed discussion to the appendix. Decoding for Perfect Tests. Assuming perfect tests, one can easily decode the pooled test results t ∈ {0, 1} b×g because all patients in negative pools are healthy. We can then identify positive (and ambiguous) samples by eliminating healthy samples from positive pools, as described in the appendix. In the case where g = 1 and g = 2, we have the widely-used single pooling method and the recently-proposed double pooling method [2] . Assuming there are no false negative pool results, one can use the decoder to identify all positive samples and derive optimal dimensions b × g that minimize the number of tests, as shown in the below theorem: The analysis borrows tools from regular Bloom filters and the results shown in [20] . Note that the problem with no test error and 1/2 prevalence is a #P-complete restriction of #SAT, called monotone CNF [24] . Realistic tests with nontrivial fnr and fpr are technically more interesting. A natural idea is an algorithm dating back to [11] when decoding diseases from the QMR database. Decoding via Message Passing. Indeed, false negative rates are often as high as 10%. The decoder fails for imperfect tests because even negative pools might contain positive samples. A small number of healthy pools might even test positive for some protocols (e.g. due to spurious contamination).

When viewed as a probabilistic graphical model we can interpret t gb as a corrupted version of the true state y gb . It is our goal to infer the secret s that produced t gb . Belief propagation is a common technique to estimate the posterior distribution p S|T =t for a graphical model. Since our graphical model cannot be rewritten as a junction tree with narrow tree width there are no efficient exact algorithms. Instead, we resort to loopy belief propagation [14] .

While inexact (loopy-BP isn't guaranteed to converge to the minimum) the resulting solution can classify samples as positive or negative with reasonable performance. While the degree of each pool Figure 3 : Intuition behind probabilistic decoding. In each group, we suspect that patients in positive pools are positive. If a patient falls within multiple positive pools, the likelihood that their test status is positive increases. Even if a false positive or negative occurs, we may still report the correct diagnosis thanks to information from other groups. This process is known as "error correction" and can be implemented with message passing or our MITM algorithm. Figure 5 : Intuition behind the MITM approach. If the prevalence is low, then we do not need to consider inputs with many positives. This restricts the set of possible secrets and the set of ways we can encode those secrets. The figure shows the inputs for at most 3 positives. Given a (potentially corrupted) output, there are only a small set of true encodings that could have produced that output -it is highly unlikely that every test had a false result. The two conditions "meet in the middle" to produce a small set of states. Our MITM algorithm efficiently approximates the posterior probabilities by summing over this restricted state space.

node is so high that the clique potential would naively involve an intractable number of states, the clique potentials have a simple form that permits an efficient implementation (detail in the appendix).

Decoding for Imperfect Tests: Meet-in-the-Middle (MITM). The structure of the problem also enables an efficient approximation to the exact solution in the (realistic) setting where the tests are fairly accurate and the disease prevalence is low. Low prevalence implies that there are relatively few "likely secrets" s ∈ {0, 1} n , because most s i are 0 with high probability. Thus, we only need to consider secrets with a small number of positive patients.

Since the secrets concentrate in a small subset of {0, 1} n , we expect to see relatively few Bloom encodings y ∈ {0, 1} t for low-prevalence problems. Furthermore, the output space is likely to be corrupted in relatively few ways. The true state y gb is likely to be the same as the observed output t gb , so we only need to consider states that are similar to the observed output. By restricting our attention to "likely secrets" and "likely outputs", we can reduce the O(2 n ) complexity of the naive brute-force algorithm. This process constitutes a "meet in the middle" approach where we only need to consider Illustrative values for MITM decoding. Using Stirling's formula n! ∼ √ 2πn(n/e) n , one can easily show that for a fraction x ∈ [0, 1], we have f (xn) = o((2p x ) n ) when n → +∞. If k := xn is such that 2p x < 1, i.e. x > log(2)/ log(1/p), then f (k) will be exponentially small w.r.t. n. With our default p = 0.1% we only need to consider secrets with a fraction smaller than x * = log 2 (2)/ log 2 (1/10 −3 ) ≈ 10.03% of infected people to yield negligible error. For n = 60, choosing x = 13% reduces 12 the search space of secrets from 2 60 ≈ 10 18 to 60 * 0.13 −1 i=0 60 i < 6 · 10 7 with an error ε < (2p 0.13 ) 60 < 5 · 10 −6 .

We ran simulations to compare test designs for a large variety of group testing parameters (n, tpr/tnr, b × g, prevalence) in the appendix. In this section we present results for a practical scenario where tnr = 0.9, tpr = 0.99, and 0.1% prevalence. In Figure 6 , we compare the entropy-minimizing solution found by genetic algorithms with several Bloom filter designs. The Bloom filter performance closely resembles the optimal solution, albeit with higher variance. This validates our claim that the load balancing permutation hash implies a good information gain. We also apply our graphical model framework to 3 × 5 arrays of Bloom filters, single pooling and double pooling designs. We use the MITM technique to compute posteriors for all designs and we compare performance. While MITM provides the best results, computational constraints may demand belief propagation for situations where there are many positive group tests. In the high-prevalence scenario, belief propagation will still provide sufficient error correction for good diagnostic results (Figure 4) . The vanilla bloom decoding (single and double pooling) is unnecessarily inaccurate, clearly implying the need for specific tailored algorithms.

We have presented a framework for group testing taking into account specifics of the current COVID-19 pandemic. It applies methods of probability and information theory to construct and decode multiplex codes spanning the relevant range of group sizes, establishing an interesting connection to Bloom filters and graphical models inference along the way. Our empirical results, more of which are included in the appendix, show that our methods lead to better codes than randomized pooling and popular approaches such as single pooling and double pooling.

Furthermore, we provide an approximate inference algorithm through Theorem 5 that outperforms the message passing approach for realistic parameter values by pruning the exponential search space. We also prove compute-time bounds on its error, highly useful in practice because they are strict.

We believe that the test multiplexing problem is an ideal opportunity for our community to make a contribution towards addressing the current global crisis. By firmly rooting this problem in learning and inference methods, we provide fertile ground for further development. As more information about test characteristics becomes available, we could take into account dependencies of tpr, tnr on pool size. The framework could be adapted to different objective functions, or linked to decision theory using suitable risk functionals, e.g., taking into account the downstream risk of misdiagnosing an individual with particular characteristics (comorbidities, probability of spreading the disease, etc.). It can be combined with the output of other methods providing individualized estimated of infection probabilities, to optimize pool allocation for non-uniform priors/prevalence. Statistical dependencies (e.g., for family members) could be taken into account. Finally, similar methods also permit addressing the problem of prevalence estimation. Further details as well as some concrete design recommendations derived from our methods are available in the appendix.

The motivation for this work was to help address the worldwide shortage of testing capacity for Cov-SARS-2. Testing plays a major role in breaking infection chains, monitoring the pandemic, and informing public policy. Countries successful at containing Covid-19 tend to be those that test a lot. 13 On an individual level, availability of tests allows early and targeted care for high-risk patients. While treatment options are limited, it is believed that antiviral drugs are most effective if administered early on, since medical complications in later stages of the disease are substantially driven by inflammatory processes, rather than by the virus itself [23] .

Finally, large-scale testing as enabled by pooling and multiplexing strategies may be a crucial component for opening up our societies and economies. People want to visit their family members in nursing homes, send their children to school, and the economy needs to function in order to secure supply chains and allow people to earn their livelihoods. 14 However, the present work also poses some ethical challenges, of which we would like to list the below.

The first family concerns the accuracy of the tests. Indeed, when the number of tests and patients are equal, it is natural to compare the tpr/tnr of the individual test to the tpr/tnr of the individual results in our grouped test framework (obtained by marginalizing the posterior distribution). In some situations with unbalanced priors, the marginal tpr/tnr of some people in the group could be lower than the test tpr/tnr, even if the test will be more successful overall. However, reporting the marginal individual results gives doctors a tool to decide whether further testing should be needed; hence we cannot rule out that individuals might be worse off by being tested in a group. We furthermore show in the appendix that some designs are more fair than others, in that the individual performances are more equally distributed.

The second family of concerns, directly resulting from the first, is the responsibility of the doctor when assigning the people to batches and giving them prior probabilities (using another model). The assignment of people in batches should be dealt with in a future extension of our framework, while the sensitivity of our protocols to priors should be studied in more depth. The adaptive framework may be more robust with respect to the choice of priors than the non-adaptive one.

Finally, the possibility of truly large scale testing may allow countries with sufficient financial resources to perform daily testing of large populations, with significant advantages for economic activity. This, in turn, could exacerbate economic imbalances.

We use upper case letters exclusively for random variables (r.v.), except for mutual information I and entropy H. 

Different objective functions. We have used the number of tests and samples as given, and then optimized a conditional entropy. However, from a practical point of view, other quantities are relevant and may need to be included in the objective, e.g. the expectation (over a population) of the waiting time before an individual is "cleared" as negative (and can then go to work, visit a nursing home, or perform other actions which may require a confirmation of non-infectiousness).

Semi-adaptive tests. Instead of performing m consecutive tests, one could do them in k batches of respective sizes m 1 , ..., m k satisfying m 1 + ... + m k = m. Adaptivity over the sequence of length k could be handled greedily as in Algorithm 1, except that instead of selecting a single pool design d * , we would select m i designs at the i th step. We named this semi-adaptive algorithm the k-greedy strategy.

Further practical considerations. A good practical strategy could be to perform one round of pooled tests to disjoint groups every morning as individuals arrive at work, being evaluated during work hours. Those who are in a positive group (adaptively) get assigned to a second pool design tested later, which can consist of a non-adaptive combination of multiple designs, tested over night. They receive the result in the morning before they go to work, and if individually positive, they enter quarantine. If the test is so sensitive that it detects infections even before individuals become contagious (which may be the case for PCR tests), such a strategy could avoid most infections at work. "Bloom" denotes the use of the Bloom encoding described in Section 5 while "Entropy" denotes the use of the Conditional Entropy / Mutual Information encoder described in Section 4. In both comparisons, we observe that the Entropy encoder yields more similar PR curves across patients, compared to the Bloom encoder.

Dependencies between tpr, tnr and pool size. The reliability of tests may vary with pool size. In our notation, the outcome of the tests is a random variable that need not only depend on whether one person is sick (1 d,s >0 ) but it may also depend on the number of tested people |d| and the number of sick people d, s (cf. Footnote 7); it could even assign different values of tpr and tnr to different people. The tpr may in practice be an increasing function of the proportion of sick people d, s /|d|.

Estimating prior infection probabilities. Currently, we start with a factorized prior of infection that not only assumes independence between the tested patients but is also oblivious to the individual characteristics. We could, however, build a simple ML system that estimates the prior probabilities based on a set of features such as: job, number of people living in the same household, number of children, location of home, movement or contact data, etc. 15 Those prior probabilities can then be readily used by our approach to optimize the pool designs, and the ML system can gradually be improved as we gather more test results.

Prevalence estimation. Similar methods can be applied to the question of estimating prevalence. Note that this is an easier problem in the sense that we need not necessarily estimate which individuals are positive, but only how many.

C.1 Theorem 1

Statement: Under the condition that the population never be stuck in a local optimum, the evolutionary strategy using the Luby sequence (b, b, 2b, b, b, 2b, 4b, b, b, 2b, b, b, 2b, 4b, 8b , ...) for restarts yields a Las Vegas algorithm that restarts optimally [17] to achieve any target score threshold.

Proof. Let us remind the main result we use on optimal restarts [17] : the simple Luby sequence of times of restart given by (b, b, 2b, b, b, 2b, 4b, b, b, 2b, b, b, 2b, 4b, 8b , ...) is optimal (up to a log factor) for Las Vegas algorithms (i.e. randomized algorithms that always provide the correct answer when they stop, but may have non-deterministic running time). Our theorem is a direct consequence of conceptually casting our problem as a Las Vegas algorithm: indeed, we seek to optimize a fitness function f . For a given threshold A > 0, we can replace the maximization of f by the condition f > A. Applying the result of [17] for an exhaustive family of thresholds yields the desired result.

We wish to invoke Theorem 1 of [6] . In order to do so, we need to prove that the conditional entropy which we introduced in Eq. (2) is adaptive monotone. Concerning adaptive sub-modularity, we make it an assumption upon which our results is conditioned, and validate it numerically with high precision for small values of n (see Appendix F). Direct respective correspondence between our notations and that of [6] is given by:

• Pool designs d : items e;

• Test results T : realizations Φ;

• Set D of selected designs : set E(π, Φ) of selected items by policy π;

This allows one to define, following Definition 1 of [6] , the conditional expected marginal benefit of a pool design d given results t as:

It represents the marginal gain of information obtained, in expectation, by observing the outcome of d at a given stage (this stage being defined by p S , i.e. after having observed test results t).

Adaptive monotonicity holds if ∆(d) ≥ 0 for any d.

Adaptive sub-modularity holds if for any two sets of results t and t such that t is a sub-realization 16 of t , for any pool design d:

The below lemma concludes the proof.

Lemma. With respect to ∆ defined in Eq. (9), adaptive monotonicity holds. Proof. Adaptive monotonicity is a consequence of the "information-never-hurts" bound H(X | Y ) ≤ H(X) [5] .

We are interested in the information I(S, T ) for a single Bloom filter row with B cells. Because each test in the row contains a disjoint set of patients, I(S, T ) is the sum of the information for each test (i.e. the t b random variables are independent and there are no cross terms).

Using the basic definition of H(t b ), we have that

We use the fact that

Since the relationship between the ideal test results y b and the patient statuses s i is deterministic, conditioning on s i is the same as conditioning on y b . In particular, one can write Pr(y b = 0) = (1 − p i ). Observe that tpr = Pr(t b = 0|y b = 0) and tnr = Pr(t b = 1|y b = 1) and let ρ b = Pr(y b = 1). This gives us a simple expression for Pr[t b = t] and thus

We approach the second term H(t b |S patients ∈b ) the same way.

= − t∈{0,1} y∈{0,1}

Put β = (tnr log 2 (tnr)+(1−tnr) log 2 (1−tnr)) and α = (tpr log 2 (tpr)+(1−tpr) log 2 (1−tpr)). Then, the information I(S, T ) is equal to

Information is Concave in ρ: To show that there is a single, constant, and optimal probability for each group test to be positive, we prove that I(S, T ) is concave in ρ. It is sufficient to show that each term I(S, t b ) in the sum is concave in ρ b .

Taking derivatives, we have

The second derivative is

We wish to show that d 2 dρ 2 b I(S, t b ) ≤ 0, which we will do by proving that the two fractions are both positive. The squared terms in the numerators are positive, as is the expression (2tpr − 1)ρ b + 1 − tnr because tnr > 0.5. This leaves the (1 − 2tnr)ρ b + tnr term in the denominator. This term is linear in ρ b ∈ [0, 1], with a minimum of 1 -tnr. Thus, I(S, T ) is concave.

Optimal Value of ρ b : Since the information is concave, there is an optimal value of ρ b that maximizes the information gain from each grouped test. Since H(t b ) depends only on tpr, tnr and ρ b , it is easy to see that this value is constant and the same for all groups b. This proves the theorem.

However, it is of practical importance to find or approximate the optimal value of ρ b . If one wanted to load balance a variety of (possibly different) priors into groups that have the optimal probability of testing positive, one needs to know the desired value of ρ b . We obtain the following equation by setting the derivative to zero:

where c = (1 − 2tnr) + (2tpr − 1) + log(2)(β − α). One can obtain the optimal ρ b by numerically solving this equation. When tpr = tnr, we have c = 0 and the optimal value of ρ b = 0.5.

We prove the theorem using an analysis that is similar to the one for standard Bloom filters. The Bloom filter decoder identifies a sample as positive if all of the pools containing the sample are positive. It is easy to see that the decoder cannot produce false negatives under perfect tests, because each positive sample will always generate a positive pool result. We now analyze the systemic false positives introduced by the pooling operation. Each pool contains either N B or N B − 1 patients, where the latter situation occurs when B does not perfectly divide N and there are a few "leftover" elements. Thus, any given sample will share a bin with up to N B − 1 other elements, each of which has independent probability ρ of testing positive. To correctly identify a sample as negative, we require that all of these N B − 1 samples also test negative. Hence the probability that our sample will not collide with a positive sample is at least

The -1 arises from the fact that the sample cannot collide with itself. This analysis holds for a single Bloom filter row, but we have G independent opportunities to land in a negative pool. The rows are independent because independent random hash functions are used to form the groupings. The probability that we collide with a positive in all G groups is at most

This expression gives the probability that we fail to identify the sample correctly. We want to bound the failure probability p f and choose parameters that minimize the bound. Note that we replaced N B − 1 with N B -the inequality still holds because (1 − ρ) < 1.

The optimal dimensions for the Bloom filter come from minimizing the upper bound. We use the relation M = B × G to put p f in terms of M and G.

We find that the optimal G = M N ρ log 2.

Notations. we use tildax to denote the estimation of a quantity x.

Error Bounds and Confidence Levels. Given a test result t ∈ {0, 1} m , and a patient i ∈ {1, ..., n}, we seek to estimate P [s i |t], i.e. the probability of patient sample s i being positive. We can rewrite:

where we defined λ := P [s i , t] and µ := P [s i , t]. Hence, we seek to estimate λ, resp. µ. We use the term "code space" to refer to the space {0, 1} m of encodings of secrets s ∈ {0, 1} n . We write λ and µ in terms of the joint distribution of secrets s, encodings c, and results t. Summing across the code space yields: Suppose that we have (under-)estimatesã andb such that 0 ≤ max c (a(c) −ã(c)) ≤ ε and 0 ≤ c (b(c) −b(c)) ≤ ε. Later, we will describe how to obtain these estimates. For now, observe that we can (under-)estimate λ = c a(c)b(c) withλ := cã (c)b(c), with the following error bound 17 :

and similarly 0 ≤ µ −μ ≤ 2ε, which would imply 0 ≤ (λ + µ) − ( 

which concludes the proof that we can estimate P [s i |t] with error less than 5ε/P [t], whereP 

Then,

Similarly, letã 

This concludes our presentation of the estimatorsã(c) andb(c). Note that we presented a confidence interval together with a bound on its size, i.e. we showed that the true value P [s i |t] is within an interval that depends on the observed quantityP [s i |t]. However, we can also provide an interval for the observed quantity as a function of the true value:

whose size can be bounded by:

where we assumed ε < λ/4 to justify that 3λε 2 (λ+µ) 3 1 1− ε λ+µ < ε λ+µ . One might also be interested in the error rather than a confidence interval. We will now present an algorithm that efficiently computes these estimators. In our algorithm, A( ) is the number of secrets with at most k nonzeros, C( ) is number of codes produced by this restricted set of k-sparse secrets, and B( ) is a set of probable ideal codes for the potentially-corrupted output t that we observe. 1 Preprocessing: (independent of results t) 2 Compute k such that f (k) < ε and initialize C = ∅; 3 Enumerate all the codes c := enc(s) for s with less than k positives a ; 4 Use these codes to approximateP [s i , c] andP [s i , c] using the formula forb(c) in Eq. (42). Store the results in C; 5 Query: (dependent upon results t) 6 Compute P := i t i , N := m − P and a(c) (see Eq. (48)) for F P ≤ P , F N ≤ N ; Complexity Analysis. Since the outcome of a test t is conditionally independent to s w.r.t. c, we can pre-compute all encodings c := enc(s) for s belonging to the reduced search space of size A(ε) := k−1 i=0 n i . Saving all these resulting encodings with a hashmap or a set structure gives a space of complexity proportional to C(ε) ≤ A(ε), since the output function image of an input set is always smaller than (or equal to) the size of the input set. Finally, at query time, we seek to estimate P [s i |t]. Note that we have pruned two search spaces: the space of encodings of A(ε) many secrets, reduced from 2 m to C(ε), and the space of codes c such that for our given t, prob[F P ][F N ] > ε, reduced from 2 m to B(ε). Given a test result t, we can compute N, P in O(m) operations, which then allows us to compute B(ε) for this t. Also note that we approximate P [s i , t] via cã (c)b(c). Sinceã(c) = 0 for c such that t doesn't belong to the reduced test results space of size B(ε), we can choose to perform this sum on either this set, or the reduced code space of size C(ε): whichever is the smallest. This is where the denomination "meet-in-the-middle" comes from.

The C++ code can be used in the browser through an interactive WebAssembly demo: https://bloom-origami.github.io/ The following features are implemented:

• Bloom assay generation • Greedy adaptive strategy simulation • Design optimization using genetic algorithms • Posterior decoding using MITM

Our designs assume that the prevalence ρ is known, at least approximately. However, we can also use our Bloom filter design to estimate the prevalence in the overall infected population. When we randomly and independently sample an individual from the population, they have probability ρ of being infected. The prevalence estimation problem is to determine ρ using as few tests as possible. Here, we assume perfect tests to simplify the analysis.

Of course, one could individually test a large number of people from the population and report the fraction of positive test results. The challenge is that if we screen individuals, we end up with a random variable for which the mean to variance ratio is unfavorable. Consider a random variable X ∈ {0, 1} with E[X] = ρ and variance var[X] = ρ − ρ 2 = ρ(1 − ρ). The error of the empirical average of m individual tests is

The relative error is 1−ρ ρ . Clearly this is minimized for ρ = 1. Unfortunately, this value is entirely useless since it corresponds to the situation where every test returns positive. In practice, we encounter the unfortunate situation of ρ << 1 where the relative error diverges. Under a naive random sampling approach to prevalence estimation, a very large number of tests are required. To amend this situation, it is beneficial to increase the probability of a positive test by testing multiple candidates at once. Our pooled tests are no longer positive with probability ρ but with probability q = 1 − (1 − ρ) k , where k is the number of samples combined in a single pool. Knowing q, we can solve for ρ via

We will use the central limit theorem and the delta method to show that we need fewer Bloom filter pooled tests than random individual tests to estimate the prevalence. The central limit theorem states that 

Suppose we combine k samples into each bin. Now X is the test status of the bin and it is positive with probability q = 1 − (1 − ρ) k . Hence µ = q and σ 2 = q(1 − q). Use the delta method with

Space/time trade-offs in hash coding with allowable errors

A note on double pooling tests

Non-adaptive group testing: Explicit bounds and novel algorithms

Graphconstrained group testing

Elements of information theory

Adaptive submodularity: Theory and applications in active learning and stochastic optimization

Near-optimal sensor placements in gaussian processes

Temporal dynamics in viral shedding and transmissibility of COVID-19

Non-adaptive hypergeometric group testing

Efficiently decodable non-adaptive group testing

Variational probabilistic inference and the qmr-dt network

Non-adaptive group testing in the presence of errors

Probabilistic graphical models: principles and techniques

Pooling of samples for testing for SARS-CoV-2 in asymptomatic people. The Lancet Infectious Diseases

Counter braids: a novel counter architecture for per-flow measurement

Optimal speedup of Las Vegas algorithms

Nonadaptive group testing with random set of defectives

Bloom-filter inspired testing of pooled samples (and splitting of swabs!)

Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis

Reconstructed diagnostic sensitivity and specificity of the rt-pcr test for covid-19. medRxiv

FACT -Frankfurt adjusted COVID-19 testing -a novel method enables high-throughput SARS-CoV-2 screening without loss of sensitivity. medRxiv

The trinity of COVID-19: immunity, inflammation and intervention

Model counting of monotone cnf formulas with spectra

Paper-origami-based multiplexed malaria diagnostics from whole blood

Evaluation of COVID-19 RT-qPCR test in multi-sample pools

<< aux ( { } ) << ' ' << aux ( { t e s t 1 } ) << ' ' 131 << aux ( { t e s t 2 } ) << ' ' << aux

Gary Bécigneul is funded by the Max Planck ETH Center for Learning Systems. Benjamin Coleman and Anshumali Shrivastava are supported by NSF-1652131, nsf-bigdata 1838177, AFOSR-YIPFA9550-18-1-0152, Amazon Research Award, and ONR BRC grant for Randomized Numerical Linear Algebra.