key: cord-0202229-m3ybmf83
authors: Gebhard, Oliver; Hahn-Klimroth, Max; Parczyk, Olaf; Penschuck, Manuel; Rolvien, Maurice; Scarlett, Jonathan; Tan, Nelvin
title: Near optimal sparsity-constrained group testing: improved bounds and algorithms
date: 2020-04-24
journal: nan
DOI: nan
sha: bf5b6159054dfc12580a0d591dcdec36d75401b7
doc_id: 202229
cord_uid: m3ybmf83

Recent advances in noiseless non-adaptive group testing have led to a precise asymptotic characterization of the number of tests required for high-probability recovery in the sublinear regime $k = n^{theta}$ (with $theta in (0,1)$), with $n$ individuals among which $k$ are infected. However, the required number of tests may increase substantially under real-world practical constraints, notably including bounds on the maximum number $Delta$ of tests an individual can be placed in, or the maximum number $Gamma$ of individuals in a given test. While previous works have given recovery guarantees for these settings, significant gaps remain between the achievability and converse bounds. In this paper, we substantially or completely close several of the most prominent gaps. In the case of $Delta$-divisible items, we show that the definite defectives (DD) algorithm coupled with a random regular design is asymptotically optimal in dense scaling regimes, and optimal to within a factor of $eul$ more generally; we establish this by strengthening both the best known achievability and converse bounds. In the case of $Gamma$-sized tests, we provide a comprehensive analysis of the regime $Gamma = Theta(1)$, and again establish a precise threshold proving the asymptotic optimality of SCOMP (a slight refinement of DD) equipped with a tailored pooling scheme. Finally, for each of these two settings, we provide near-optimal adaptive algorithms based on sequential splitting, and provably demonstrate gaps between the performance of optimal adaptive and non-adaptive algorithms.

The group testing problem, originally introduced by Dorfman [2] , is a prominent example of a classical inference problem that has recently regained considerable attention [3] , [4] , [5] . Briefly, the problem is posed as follows: Among a population of n individuals, a small subset of k individuals is infected with a rare disease. We are able to test groups of individuals at once, and each test result returns positive if (and only if) there is at least one infected individual in the test group. The challenge is to develop strategies for pooling individuals into tests such that the status of every individual can be recovered reliably from the outcomes, and to do so using as few tests as possible.

While the preceding terminology corresponds to medical applications, group testing also has many other key applications [3, Sec. 1.7] , ranging from DNA sequencing [6] , [7] to protein interaction experiments [8] , [9] . Particular attention has been paid to group testing as a tool for the containment of an epidemic crisis. On the one hand, mass testing appears to be an essential tool to face pandemic spread [10] , while on the other hand, the capability of efficiently identifying infected individuals fast and at a low cost is indispensable [11] . For the sake of pandemic control, risk surveillance plans aim at an early, fast and efficient identification of infected individuals to prevent diseases from spreading [12] , [13] , [14] .

The group testing problem includes many variants, depending on the presence/absence of noise, possible adaptivity of the tests, recovery requirements, and so on. Our focus in this paper is on the following setup, which has been the focus of numerous recent works (see [3] for a survey):

• The tests are non-adaptive, meaning they must all be designed in advance before observing any outcomes. This is highly desirable in applications, as it permits the tests to be implemented in parallel. • The tests are noiseless; this assumption is more realistic in some applications than others, but serves as an important starting point for understanding the problem. • The goal is high-probability identification of each individual's defectivity status (i.e., probability approaching one as n → ∞). While a deterministic (probability-one) recovery guarantee is also feasible in the noiseless setting [5] , it requires considerably more tests, incurring a arXiv:2004.11860v4 [cs.DS] 22 Dec 2021 k 2 dependence on the number of infected individuals (whenever k ≤ O( n)) instead of k. • The number of infected individuals k is taken to equal n θ for some θ ∈ (0, 1), 1 i.e., the sublinear regime. Heaps' law of epidemics [15] , [16] indicates that this regime is of major interest. In addition, recent hardness results preclude non-trivial recovery guarantees in the linear regime k = Θ(n) [17] , at least under the most widelyadopted recovery criterion.

Under this setup, Coja-Oghlan et al. [18] , [4] recently established the exact information-theoretic threshold on the number of tests, in an asymptotic sense including the implied constant. This threshold was originally attained using a random regular testing design [18] (see also [19] ), improving on earlier results for Bernoulli testing [20] , [21] . While the recovery algorithm used in [18] is not computationally efficient, the subsequent work [4] attained the same threshold using a spatially coupled random regular design and a computationally efficient recovery algorithm.

All of the preceding test designs have in common that each individual takes part in O(ln n) tests, and each test contains O(n/k) individuals. As a result, these designs face limitations in real-world applications. Firstly, one may face dilution effects: If an infected individual gets tested within a group of many uninfected individuals, the signal of the infection (e.g., concentration of the relevant molecules) might be too low. For instance, a testing scheme for HIV typically should not contain more than 80 individual samples per test [22] . More recently, evidence was found that certain laboratory tests allow pooling of up to 5 individuals [23] or 64 individuals [24] per test for reliably detecting COVID-19 infections. Secondly, it is often the case that each individual can only be tested a certain number of times, due to the limited volume of the sample taken. More generally, test designs with few tests-per-individual and/or individuals-pertest may be favorable due to resource limitations, difficulties in manually placing samples into tests, and so on.

In light of these practical issues, there is substantial motivation to study the group testing problem under the following constraints on the test design:

• Under the ∆-divisible items constraint (or bounded resource model), any given individual can only be tested at most ∆ times; • Under the Γ-sized tests constraint (or bounded testsize model), any given test can only contain at most Γ individuals.

Previous studies of group testing under these constraints [25] , [26] , [27] , [1] are surveyed in Section I-A. We note that some of the above practical motivations may warrant more sophisticated models (e.g., random noise models for dilution effects), but nevertheless, noiseless group testing under the preceding constraints serves as an important starting point towards a full understanding. In addition, as with previous works, we only consider the above two constraints separately, though the case that both are present simultaneously may be of interest for future studies.

As outlined above, the asymptotically optimal performance limits are well-understood in the case of unconstrained test designs, with optimal designs placing each item in ∆ = Θ(ln n) tests, and each test containing Γ = Θ n k items. We refer the reader to [3] for a more detailed survey, and subsequently focus our attention on the (much more limited) prior work considering the constrained variants with ∆ = o(ln n) and Γ = o n k . The most relevant prior work is that of Gandikota et al. [25] , who gave information-theoretic lower bounds on the number of tests under both kinds of constraint, as well as upper bounds via the simple COMP algorithm [28] . 2 The main results therein are summarised as follows, assuming the sublinear regime k = n θ with θ ∈ (0, 1) throughout (we sometimes refer to θ as the density parameter): with β ∈ [0, 1) and ξ = n −ζ with ζ > 0, the error probability is at most ξ when m ≥ 1+ζ (1−θ)(1−β) · n Γ . (Theorem 4.6 in [25] ) A sizable gap remains between the achievability and converse bounds in the case of ∆-divisible items, since typ-

For Γ-sized tests, the bounds match to within a constant factor, but the optimal constant remains unknown. In particular, the two differ by at least a multiplicative 1 1−θ factor, and even for θ close to zero, the two can differ by a factor of 2 due to the rounding in the achievability part.

As we outline further below, we nearly completely close these gaps for ∆-divisible items, and we close them completely for Γ-sized tests in the special case β = 0 (i.e., Γ = Θ(1)) for all θ ∈ (0, 1). We achieve these results using both the DD and SCOMP algorithms introduced in [20] . While the regime β ∈ (0, 1) is also of interest, it appears to require different techniques, and is deferred to future work. 2 The COMP algorithm declares any individual in a negative test as uninfected, and all other individuals as infected. It is called Column Matching Algorithm in [25] . TABLE I: Overview of noiseless non-adaptive sparsity-constrained group testing results under the scaling k = n θ (θ ∈ (0, 1)). For the setting of Γ-sized tests, this table only corresponds to Γ = Θ(1), and in both settings we neglect higher-order terms and the dependence on the error probability. See the main text for more complete and precise statements.

Gandikota et al. [25] additionally gave explicit designs (i.e., test matrices that can be deterministically constructed in polynomial time), but these give worse scaling laws, and are therefore of less relevance to our results based on random designs. In a distinct but related line of works, Macula [27] and Inan et al. [26] , [29] developed designs for the much stronger guarantee of uniform recovery, i.e., a single test matrix that uniquely recovers any infected set of size at most k, without allowing any error probability. This stronger guarantee comes at the price of requiring considerably more tests, and we thus omit a direct comparison and refer the interested reader to [27] , [26] , [29] for details.

Our main contributions are informally outlined as follows (with k = n θ for θ ∈ (0, 1), and ε being an arbitrarily small constant throughout), with "w.h.p." meaning probability approaching one as n → ∞. The formal statements are given in the theorems referenced. The results are also summarised in Table I (non-adaptive • ∆-divisible items setting.

Assuming that ∆ = (ln n) 1 −Ω(1) (and in some cases, any ∆ = o(ln n) is allowed), we have the following: -(General converse -Theorem 3.1) If m ≤ (1 − ε)e −1 ∆k 1+ 1−θ ∆θ , then w.h.p. any (possibly adaptive) group testing strategy fails. 3 -(Non-adaptive converse -Theorem 3.2) Under any non-adaptive test design, if ∆ > θ/ (1 − θ) and m ≤ (1 − ε)∆k 1+ 1 ∆ , then w.h.p. any inference algorithm fails. Combining with the general lower bound, the same holds for m ≤ (1 − ε) max e −1 ∆k 1+ 1−θ ∆θ , ∆k 1+ 1 ∆ . (1) , and with probability Ω(1) when ∆ = Θ(1)). 3 These expressions are obtained after substituting k = n θ . In the more general case that k equals a positive constant times n θ , the results remain unchanged upon replacing k

∆ everywhere. Note also that the achievability bounds may exceed n in some scaling regimes, but in such cases m = n tests still suffice, since one can instead resort to oneby-one testing.

-(DD-specific converse -Theorem 3.4) Under random regular testing, DD fails when m is slightly below the achievability bound (w.h.p. when ∆ = ω(1), and with Ω(1) probability when ∆ = Θ(1)). -(Adaptive achievability -Theorem 5.1) There exists an efficient adaptive algorithm succeeding with probability one when m ≥ (1 + ε)∆k 1+ 1−θ ∆θ .

• Γ-sized tests setting: Assuming that Γ = Θ(1) in the non-adaptive setting (whereas the adaptive results allow general Γ = o n k ), we have the following: . We use different test designs and analyses for the dense regime θ ≥ 1 2 (Theorem 4.10) and sparse regime θ < 1 2 (Theorem 4.18), and combine the two results to get the overall condition in m in Section IV-F. For the dense regime, our analysis shows that DD has the same guarantee, whereas for the sparse regime, we crucially require the refined SCOMP algorithm. -(Adaptive achievability -Theorem 6.1) There exists an efficient adaptive algorithm succeeding with probability one when m ≥ (1 + ε) n Γ + klog 2 Γ. In particular, when Γ = o n k ln n , it suffices that m ≥ (1 + ε) n Γ . -(General converse -Theorem 6.2) If m ≤ (1 −ε) n Γ , then the error probability is bounded away from zero for any (possibly adaptive) group testing strategy.

These results have several interesting implications, which we discuss as follows. In the ∆-divisible setting, our first converse bound strengthens that of [25] (removing the −5ξ term in the exponent) and extends it to the adaptive setting, and our second converse provides a further improvement for non-adaptive designs. Our DD achievability result scales 

A m i c 0 n r I / O 2 L i K z X 0 P e P 0 i e 6 r e W 1 C / q c 1 I t 2 9 a M V M h J E G Q W e L u h H H O s C T w H C H S a C a D w 0 g V D L z V 0 z 7 R B K q T R a 2 C c G d P / k v q J 7 k 3 b O 8 e 3 u S L V w n c a T R H t p H h 8 h F 5 6 i A b l A Z V R B F j + g Z v a I 3 6 8 l 6 s c b W + 8 y a s p K e X f S r r M 8 v b W a g e A = = < / l a t e x i t >

x F M z W 7 / W r 6 t F L s C u I 0 L I r o l 4 E t R 4 8 q l g t t L V k 0 1 c N Z r N L k i 2 W Z f + X F / + E N y 9 e P C j i 1 b v Z u o h a H w Q m M / P y 8 s a P O F P a d R + t w s j o 2 P j E 5 J Q 9 P T M 7 N 1 9 c W D x X Y S w p 1 G j I Q 1 n 3

H x t i r e y U Z p 7 y C P Y x I t o x W 0 h j y 0 j f b Q E T p G N U T R H X p C L + j V u r e e r T f r / c t a s P K e J f S r r I 9 P D Z S p 8 A = = < / l a t e x i t > for θ > 1 2 and ∆ = ω(1), our results demonstrate that DD is asymptotically optimal among non-adaptive strategies, with a precise phase transition between success and failure at m ≈ ∆k 1+ 1 ∆ . For θ < 1 2 , while establishing a precise phase transition remains an open problem, our results establish DD's optimality up to a multiplicative factor of e, and demonstrate that one cannot reduce the number of tests further under DD and the random regular design. Finally, our results prove a strict adaptivity gap for θ > 1 2 , and demonstrate that our adaptive algorithm is optimal to within a factor of e for all θ ∈ (0, 1).

In the Γ-sized tests setting, our results provide an exact asymptotic threshold on the number of tests in the Γ = Θ(1) regime, and we establish the asymptotic optimality of SCOMP in all such cases. To achieve this, we adopt novel analysis techniques specific to this scaling, including a novel test design in the case θ < 1 2 , as described in the next section. This case of θ < 1 2 also has the interesting feature that using SCOMP instead of DD appears to be crucial, in stark contrast with other settings in which the two algorithms tend to have identical asymptotic performance [18] . We note that the distinction between integer and non-integer valued θ 1−θ arises due to rounding issues in the analysis, e.g., counting the number of individuals appearing in at most θ 1−θ tests. Our results again demonstrate a strict adaptivity gap (this time for all θ ∈ (0, 1)), and we provide a precise phase transition at n Γ for adaptive algorithms under most scalings of Γ. Finally, in Section VIII, we present numerical results for small population sizes to support our theoretical findings.

Given the number of individuals n, the number of infected individuals k ∼ n θ (θ ∈ (0, 1)), and the number of tests m, we let G = (V ∪F, E ) be a random bipartite (multi-)graph with |F | = m factor nodes (a 1 , ..., a m ) and |V | = n variable nodes (x 1 , ..., x n ). The variable nodes represent individuals, the factor nodes represent tests, and an edge between individual x i and test a j indicates, that x i takes part in test a j . Furthermore, let (∂ G a 1 , ..., ∂ G a m ) and (∂ G x 1 , ..., , ∂ G x n ) denote the neighbourhoods in G . Whenever the context clarifies what G is, we will drop the subscript. The test-node degrees are given by Γ i (G ) = |∂ G a i |, and the individualnode degrees by ∆ i (G ) = |∂ G x i |. We can visualise any nonadaptive group testing instance by a pooling scheme in the form of such a graph G .

We indicate the infection status of each individual of the population by σ ∈ {0, 1} n , a uniformly chosen vector of Hamming weight k. Formally, σ x = 1 iff x is infected. Then, we letσ =σ(G , σ) ∈ {0, 1} m denote the sequence of test results, such thatσ a = 1 iff test a contains at least one infected individual, that iŝ

Throughout the paper, we use standard Landau notation, e.g., o(1) is a function converging to 0 while ω(1) stands for an arbitrarily slowly diverging function. Moreover, we say that a property P holds with high probability (w.h.p.), if

The random (almost-)regular bipartite pooling scheme is known to be information-theoretically optimal in the unconstrained variant of group testing [18] , and is conceptually simple and easy to implement. In this work, depending on the setup, we sometimes require less standard schemes, as described in the following. It is important to note that in each of these designs, we are constructing a multi-graph rather than a graph, and every multi-edge is counted when referring to a node degree. In the following we will define our choices of the restricted pooling scheme and denote them G ∆ andG Γ 1) ∆-divisible: In this setup, we adopt the design of [18] , [19] , but with fewer tests per individual in accordance with the problem constraint: Each individual chooses ∆ tests uniformly at random with replacement; thus, an individual may be placed in the same test more than once. By construction of G ∆ , any individual has degree exactly ∆, whereas the test degrees fluctuate. We denote by Γ (G ∆ ) = {Γ 1 (G ∆ ) , . . . , Γ m (G ∆ )} the (random) sequence of test-degrees.

2) Γ-sparse: In the Γ-sparse case, our choice of pooling scheme requires additional care; we defineG Γ (θ) separately for two cases:G

with G Γ and G * Γ defined in the following. Throughout the paper, we will always clarify which of the cases we assume, and we will therefore refer toG Γ (θ) asG Γ . Starting with G Γ , we employ the configuration model [30] . Given n, m, Γ, set ∆ = mΓ/n and create for each individual . Then, choose a perfect matching uniformly at random between the individual-clones and the test-clones and construct a random multi-graph by merging the clones to vertices and adding an edge (x, a) whenever there are i ∈ [∆], j ∈ [Γ] such that the edge ({x}×{i } , {a}× j ) is part of the perfect matching (in other words, the edge (x, a) exists in the graph as a result of the i -th clone of x and the j -th clone of a being matched). We denote by G Γ the random regular multi-graph that comes from this procedure.

For G * Γ , we adopt a different approach. First, we select γ ≤ 2n Γ+1 individuals randomly and put them apart for the moment (denote by X = x 1 . . . x γ the set of those vertices). The precise γ value is chosen such that we can create a random bipartite regular graph on the remaining vertices with each individual having degree 2 and each test having degree Γ − 1 (thus, an instance of G Γ−1 ). By a simple comparison of degrees, this is only possible if m ≥ 2 n Γ+1 . Now, we draw a uniformly random matching between the tests (of degree Γ−1) and the remaining individuals x 1 . . . x γ . By definition, each of those individuals takes part in exactly one test.

In both cases above, G * Γ is an almost-regular bipartite graph with each test comprising at most Γ individuals.

We make use of the definite defectives (DD) and sequential combinatorial orthogonal matching pursuit (SCOMP) algorithms [20] , which are described as follows. Note that SCOMP amounts to running DD and then performing greedy improvements. 4 It will turn out in due course that mΓ/n is an integer under the choice of Γ used in the analysis. 5 A positive test is unexplained if it does not contain any individuals that have already been marked as infected. 1 Declare every individual x that appears in a negative test as non-infected; remove all such individuals. 2 Declare all individuals that are now the sole individual in a (positive) test as infected. 3 Proceed as follows depending on the algorithm: • For DD, declare all remaining individuals as uninfected. • For SCOMP, repeat the following step until no unexplained 5 positive tests remain: Declare as infected the (previously undeclared) individual in the largest number of unexplained positive tests.

The DD and SCOMP algorithms as defined by [20] .

In this section, we introduce four types of individuals (see Figure 3 ) that might appear in any group testing instance and which the student can make use of. It turns out that the sizes of the sets of these individuals are the key to understanding group testing combinatorially. Given a pooling scheme G , let

be the uninfected and infected individuals, respectively. Then we can define easy uninfected individuals to be the uninfected individuals that appear in a negative testclearly, they can easily be identified. We will call the set of such individuals V 0− ; formally,

Then, there the easy infected individuals (sometimes referred to as definitive defectives). These are those infected individuals that appear in at least one test with only easy uninfected individuals. Thus, upon removing the easy uninfected individuals, there will be at least one positive test with exactly one undeclared individual, and this individual has to be infected. We call this set

Subsequently, there might be disguised uninfected individuals, that are uninfected themselves but only appear in positive tests. It is well known [31] , [18] , [4] that since the prior probability of being uninfected is very large, a group testing instance can tolerate a certain number of individuals of this type. Formally,

Finally, there might be disguised infected individuals, thus infected individuals appearing only in tests that contain at least one more infected individual. Formally, Infected individuals (red) that appear only in such tests are impossible to identify. Finally, infected individuals of V 1−− appear in at least one test with only elements of V 0− . Thus, after identifying all elements of V 0− , they can be identified. The dashed lines represent the fact that the individuals may also participate in other tests; these may include negative tests classifying their participants as uninfected (elements of V 0− ) even though the particular test displayed is positive.

While the above types of individuals are not exhaustive, we will see in Section II-E that they are the relevant types for the information-theoretic and algorithmic analyses.

1) Remarks on information-theoretic and combinatorial bounds: It turns out that in the sparse group testing problem -as well as in the unrestricted version [20] , [4] -the non-adaptive information-theoretic phase transition comes in two installments. First, there are universal informationtheoretic bounds, e.g., counting bounds, that account for the fact that a given number of tests can carry only a certain amount of information. Such bounds directly apply to the non-adaptive as well as the adaptive setting. Second, there are combinatorial / graph theoretical restrictions: Given that there exist a large number of disguised infected individuals (i.e., individuals such that in each of its tests there is a second infected individual), any non-adaptive algorithm fails with high (conditional) probability [18] , [4] . This non-adaptivity gap becomes stronger if we increase the infection density parameter θ, because for larger θ, the chance of finding multiple infected individuals in a small neighborhood increases as well. In this section we deal with the combinatorial part. In our setting, the transition where the combinatorial bound dominates the informationtheoretic bound happens at k ∼ n, i.e., at the point where we find multiple infected individuals in a bounded neighborhood w.h.p..

Given a pooling scheme G , a ground truth infection status vector σ (drawn uniformly from the vectors of Hamming weight k) and a sequence of test resultsσ, we denote by S k (G , σ) the set of all colorings (i.e., infection status assignments) of individuals τ ∈ {0, 1} n that would have led to the test outcomesσ (clearly including σ itself ). Furthermore, we define Z k (G , σ) = |S k (G , σ)|. The following proposition states that all sets in S k (G , σ) are equally likely given the test outcomes.

Proposition 2.1: [Corollary 2.1 of [18] ] For all τ ∈ {0, 1} n we have

This immediately implies the following corollary. Corollary 2.2: If Z k (G , σ) ≥ w.h.p., then any inference algorithm recovers σ from (G ,σ) with probability at most −1 (1 + o(1)).

In other words, as soon as multiple satisfying assignments exist, one cannot do any better than selecting one uniformly at random, as no further information is included in G and σ [32] . The following claims will also be useful. Claim 2.3: For any test design, we have Z k (G , σ) ≥ |V 1+ (G )| |V 0+ (G )|. Hence, conditioned on the sets V 1+ (G ) and V 0+ (G ), any inference algorithm fails with probability at least 1 − 1 |V 1+ (G )||V 0+ (G )| . Proof: The first statement is straightforward and was already given in [18, Fact 3.3] , and the second statement follows directly from Corollary 2.2.

Finally, we have the following well-known result on the DD algorithm. Proof: By definition, DD first classifies all x ∈ V 0− (G ) correctly. In the second step, DD classifies those individuals x as infected, which belong to a positive test a such that ∂a \ {x} ⊂ V 0− (G ). Thus, DD finds all x ∈ V 1 ∩ V 1−− (G ). As DD classifies the remaining individuals as uninfected, it fails as soon as there exists an individual x ∈ V 1 \ V 1−− (G ).

We note that even if V 1 (G )\V 1−− (G ) = , the DD-algorithm does not produce any false positives but only false negatives. In addition, if DD succeeds then SCOMP is guaranteed to succeed [33] , but unlike DD, in general SCOMP may produce both false positive and false negatives.

A key tool to deal with an arbitrary test design is to introduce certain levels of independent randomness. For example, the only randomness in (G , σ) is the infection status of each individual. We will see in due course we can study an independent infection model (denoted by σ * ) instead of dealing with exactly k infected individuals, specifically considering each individual as being infected independently from all others with probability p = k− k ln n n (see Corollary 3.6) . For the purposes of establishing a converse, the main step is to show that V 1+ (G ) = , and we will establish this in two steps. We denote by V + (G ) the set of disguised individuals, i.e., all tests containing this individual x contain at least one other individual (differing from x) that is infected, and hence

Once we find a large enough set |V + (G )| n/k, there will be some infected individuals in V + (G ) w.h.p.. The main challenge is that in order to find the set of disguised individuals, one uses infected individuals, therefore the events |V + (G )| exceeding a specific size and infected individuals existing in V + (G ) are not independent in (G , σ * ). This is where the two-round exposure technique, used very prominently in the study of random graphs [30] , comes into account.

More specifically, our analysis will take the following steps in which individuals are randomly infected: 1) We first mark each individual as infected with probability αk/n for some fixed constant α ∈ (0, 1) and find a set K 1 of infected individuals whose neighbourhood (the tests they belong to) has certain properties. 2) Next, we mark the remaining individuals in the second neighbourhood of K 1 (hence, we look at the individuals that are contained in the tests together with the vertices of K 1 ) as infected independently with probability (1−2α)k/n for establishing the property of being disguised. 3) After the previous step, each individual has been infected with probability at most αk/n

as infected with probability p − p i , where p i is the probability already incurred from the first two steps. By doing so, the overall distribution of σ * is i.i.d. with probability p, as desired. While these extra infections are not actually analyzed, the idea is that they produce the desired overall distribution, while only enlarging (or keeping unchanged) the set of individuals that are disguised.

In this section, we formally state and prove our main results regarding non-adaptive group testing with ∆-divisible individuals.

As we highlighted earlier, optimal unconstrained designs are known that place each individual in Θ(ln n) tests. Accordingly, we only consider the regime ∆ = o(ln n), and specifically suppose that ∆ ≤ ln 1−δ n for some constant δ ∈ (0, 1).

Define

which will represent the information theoretic converse bound for any non-adaptive group testing scheme and the algorithmic barrier for DD, respectively. In the following, we assume that ∆ ≥ 2 and ∆ > θ/ (1−θ) . If the latter inequality is reversed, then we find that m DD (∆) = ω(n), in which case one is better off resorting to one-by-one testing.

Our first main result provides a simple counting-based converse bound for any adaptive or non-adaptive test design. This result, and all subsequent results, will be proved throughout the rest of the section. An overview of the proof strategy will be provided in Section III-B1 Theorem 3.1: Fix ε ∈ (0, 1), and suppose that k = n θ with θ ∈ (0, 1) and

for fixed ε > 0, we have w.h.p. that any (possibly adaptive) group testing procedure that tests each individual at most ∆ times fails to recover σ.

This bound recovers the first term of max{·, ·} appearing in the definition of m inf (∆) above, which is dominant for θ ≤ 1/2. For the second term (which is dominant for θ ≥ 1/2), we require a more sophisticated argument that only holds for non-adaptive designs; as we will see in Section V, adaptive designs can in fact go beyond this threshold. The proof of Theorem 3.1 is given in Section III-C.

Theorem 3.2: Given any non-adaptive pooling scheme G where any individual gets tested at most ∆ times (with

Combining these results, we find that any non-adaptive group testing strategy using at most (1 − ε)m inf (∆) tests fails w.h.p. if ∆ = ω(1), and fails with constant non-zero probability if ∆ = O(1). We provide the proof of Theorem 3.2 in Section III-D. Next, we state our main upper bound, corresponding to the random regular design and the DD algorithm.

Theorem 3.3: Suppose that m = (1+ε)m DD (∆) for some ε > 0. Then, under the random regular design with parameter ∆, DD recovers σ from (G ∆ ,σ) with probability at least

Note that the success probability tends to one as ∆ → ∞; if ∆ = O(1) then we need to take ε → ∞ for the probability to approach one (but it can be close to one for finite ε). The proof of Theorem 3.3 is given in Section III-E. Comparing this result with Theorem 3.1, we find that DD is asymptotically optimal for θ ≥ 1/2. On the other hand, a gap between m inf (∆) and m DD (∆) remains for θ < 1 2 . In principle, this could be due to a weakness in the converse, a fundamental limitation of DD, or a weakness in our analysis of DD. However, the following theorem rules out the latter of these.

Theorem 3.4: Let θ < 1/2. Given the random regular pooling scheme G ∆ on m = (1 − ε)m DD (∆) tests for fixed ε ∈ (0, 1), we have the following:

, then DD fails with positive probability bounded way from zero. 2) If ∆ = (ln n) 1−δ for δ ∈ (0, 1), then DD fails w.h.p.. Thus, Theorem 3.4 settles a coarse phase transition of DD in the random regular model when there are finitely many tests-per-individual, and a sharp phase transition when the number of tests-per-individual is diverging. The proof of Theorem 3.4 is provided in Section III-F. We expect that DD is in fact provably suboptimal for θ < 1 2 , but leave this as an open problem.

1) Overview of proofs: Before proving Theorems 3.1-3.4, we provide a brief overview:

• To prove Theorem 3.1, we establish an upper bound on the probability that an arbitrary inference algorithm recovers σ correctly based on the amount of information provided by the test results (which is inherently limited due to the testing constraints). This already suffices to show that as soon as the number of tests crosses a certain lower bound, any inference algorithm must have an error probability approaching one. • Theorem 3.2 deals with non-adaptive designs, which can be represented as a bipartite graph. The main argument is that when there are too many disguised infected and disguised uninfected individuals, perfect recovery becomes impossible, since interchanging these two types of individuals would not impact the test results. We carefully analyse the number of occurrences of these disguised individuals by the means of local structures in the graph (see Figure 3 ). We first prove a counting-based upper bound on the success probability for any test design and inference algorithm. Afterwards, we will use this bound on the success probability to prove our main converse bound, providing a lower bound on m for attaining a given target error probability.

Let A (G ,σ, k) be the output of a group testing inference algorithm with input G (pooling scheme),σ (test results), and k (number of infected individuals). The inference algorithm is successful if A (G ,σ, k) = σ, and P (A (G ,σ, k) = σ) is the success probability. We first prove the following nonasymptotic counting-based bound via a similar approach to [34] with suitable adjustments, and also using the Nishimori property similarly to [18] .

Lemma 3.5: Under the preceding setup, for any pooling scheme G and inference algorithm A (G ,σ, k), we have

Proof: Any given pooling scheme can be viewed as a deterministic mapping from an infection status vector σ ∈ {0, 1} n to an outcome vectorσ ∈ {0, 1} m . Recall that in Proposition 2.1, S k (G , σ) is the set of all colorings of individuals that lead to the testing sequenceσ, and Z k (G , σ) is its cardinality. In the following, we additionally letẐ k (G ,σ) denote Z k (G , σ) when the test outcomes produced by (G , σ) are equal toσ, and letŜ k (G ,σ) be the set of all σ sequences that produce test outcomesσ.

Proposition 2.1 shows that the optimal inference algorithm outputs an arbitrary element of S k (G , σ), and is correct with probability (conditioned on σ) equal to 1 Z k (G ,σ) . Thus, averaging over the n k possible k-sparse vectors σ, we have the following:

where (a) follows since there areẐ k (G ,σ) terms in the second summation, thus canceling the 1 Z k (G ,σ) term, and (b) uses the fact that at most ∆k test outcomes can be positive, even in the adaptive setting; this is because adding another infected individual always introduces at most ∆ additional positive tests. We now use the result in (7) to prove Theorem 3.1 -3.1. In the following we want to provide a short overview of how we obtain these results Proof of Theorem 3.1: Let m count (∆) = e −1 ∆k 1+ (1−θ) ∆θ denote the threshold in the theorem statement. It suffices to prove the claim for m = (1−ε)m count (∆), since the inference algorithm could choose to ignore tests. We use the nonasymptotic bound in Lemma 3.5, and upper bound the sum of binomial coefficients via [35, Section 4.7 .1] to obtain the following for a fixed target success probability of 1 − ξ (for some ξ ∈ (0, 1)):

where h(·) is the binary entropy function in nats (logs to base e). From (8), we have e mh( ∆k m ) / n k = 1−ξ, which implies that

where (a) uses a Taylor expansion and the fact that ∆k (1)) which is used to obtain the simplification. Rearranging (9), we obtain

where (a) follows from the fact that n k ≥ n k k and k = n θ .

Finally, we note that (1−ξ) 1/(∆k) → 1 for any fixed ξ ∈ (0, 1), since k → ∞ by assumption. This means that m must be at least (1−o(1))e −1 ∆k 1+ (1−θ) ∆θ to obtain any arbitrarily small success probability, and hence, if m is instead a (1−ε) factor below this threshold (as we have assumed) then the success probability must tend to zero.

It suffices to prove the assertion of the theorem for m = (1 − ε)∆k 1+1/∆ , since extra tests can only help (or can be ignored). Let ε, θ, δ ∈ (0, 1), and θ/(1 − θ) ≤ ∆ ≤ ln 1−δ n. Furthermore, let G be an arbitrary non-adaptive pooling scheme with V (G ) the set of n individuals and F (G ) the set of m = (1 − ε)∆k 1+1/∆ tests such that each individual is tested at most ∆ times. Let

Thus,Γ represents the average degree of the tests in F (G ), where Γ a is the size of test a. We pick a set of k infected individuals uniformly at random and let σ be the {0, 1}-vector representing them. We introduce p = k− k ln n n and σ * as a binomial {0, 1}-vector, such that each entry represents one individual and equals 1 with probability p independently of the others. Our next result relates σ and σ * . As in [17] , [4] the way to establish a lower bound is to establish that the underlying graph structure always contains a certain number of disguised infected as well as disguised uninfected individuals. We note that due to the ∆-divisibility condition, a straightforward application of the FKG inequality does not appear to provide a sufficiently strong bound, since the variances of the random variables of interest may become too large. Corollary 3.6: Under the preceding setup, for fixed ε ∈ (0, 1) and n large enough, if there is a non-negative integer C (possibly C = 0) such that

then it also holds that (1) and

Proof: The proof follows along the lines of the proof of [4, Lemma 3.6] . Let B be the event that |σ * | ∈ [k − 2 k ln n, k]. Then a standard application of the Chernoff bound guarantees that P

Given B, we couple σ * and σ by flipping at most 2 k ln n uninfected individuals in σ * to infected, uniformly at random. This yields the correct distribution, since by definition the set I 1 = i : σ * i = 1 is a uniform subset of size |σ * | (conditioned on |σ * |). Hence, when we infect another random subset of size k − |I 1 | uniformly at random, the overall infected set is uniform over the subsets of size k. Clearly, the number of disguised infected individuals can only increase, and hence

However, it might happen that previously disguised uninfected individuals do now contribute to |V

By the above coupling argument, we have

Therefore, Markov's inequality implies

The desired result now follows directly from (12), (13) , and P (B) = 1 − o(1). Corollary 3.7: Under the preceding setup, we have the following:

Proof: We use the fact that the property of being disguised is independent of the infection status. Indeed, given the number of disguised individuals |V + (G , σ * )|, we have |V 1+ (G , σ * )| ∼ Bin(|V + (G , σ * )| , k/n) and |V 0+ (G , σ * )| ∼ Bin(|V + (G , σ * )| , 1−k/n). Let δ > 0 be such that, by assump-

Observe that if n < n k ln n , then we have (1 − k/n) n = 1−o(1). Therefore, due to (14), we require

and we conclude that

The desired result then follows directly from Corollary 3.6, distinguishing between δ = o(1) and δ ∈ (Ω(1), 1 − Ω(1)).

By adopting the two-round exposure technique from Section II-F, Theorem 3.2 will follow from the next lemma, which establishes the conditions in Corollary 3.7 regarding σ * . Lemma 3.8: For any ε, θ, δ ∈ (0, 1) the following holds. Consider the i.i.d. infection model σ * , and let G be a test design such that any of the n = V (G ) individuals is tested at most ∆ times (with θ/(1 − θ) < ∆ ≤ (ln n) 1−δ ) and

Proof: We first give a brief overview of the proof:

• We first establish that there must be no tests in G with too few individuals (Claim 3.9). • Second, we apply the two-round exposure technique described in Section II to create a set K 1 of infected individuals of size roughly αk. • Third, we remove any tests that already contain two infected individuals, since individuals of K 1 are disguised if and only if they are disguised upon the removal of such tests (Fact 3.10). • Next, we show that, upon applying the second stage of the two-round exposure technique to the second neighbourhood of the individuals of K 1 in the remaining graph, the probability an individual x ∈ K 1 being disguised is minimised in the case that its tests are disjoint (Claim 3.11). • The preceding result is used to lower bound the average probability of being disguised by employing a hypothetical model in which all tests are mutually disjoint and therefore independent (Claim 3.12). • Finally, carefully applied concentration results are used complete the proof.

Proceeding more formally, we first show that G satisfies certain degree properties, namely, there cannot be any tests that are too small. Claim 3.9: For any fixed integer D, we can assume without loss of generality (for proving Lemma 3.8) that, for n large enough, every test has size at least D.

Proof of Claim 3.9: We obtain an alternative design G from G by iteratively deleting a test of size less than D and all individuals contained in the test, until all tests have size at least D. In each step, we remove one test, between one and D individuals, and at most ∆D edges. Without loss of generality, assume that in G there are only o(n) individuals that are not contained in any tests (otherwise, the error probability would trivially tend to one). Therefore, the test-design G contains at least (1))n edges, and since the individual degree is still at most ∆, its number of individuals n = |V (G )| satisfies n ≥ (1 − o(1))n/∆. This lower bound on n along with the assumption ∆ ≤ (ln n) 1−δ additionally imply that ∆ ≤ (ln n ) 1−δ/2 when n is sufficiently large.

As for the remaining number of tests m = |F (G )|, we claim that for all large enough n, (16) Indeed, the first inequality follows since m ≤ (1−ε)∆k 1+1/∆ = (1 − ε)∆n θ+θ/∆ and the fact that we delete at least one test per D deleted individuals. For the second inequality, let ζ := θ + θ/∆, which yields ζ < 1 by our assumption ∆ > θ/(1 − θ). Then, we distinguish two cases:

• If n − n ≥ n, then we have the following:

where the first inequality holds for sufficiently large n since D is constant, ζ ∈ (0, 1), and ∆ is at most logarithmic, and the second inequality holds because the function f (x) = x ζ (for ζ ∈ (0, 1)) is concave and monotone, so for any δ > 0 it holds that f (x +δ) − f (x) is largest when x = 0. Substituting the above finding yields the desired second inequality in (16) . • On the other hand, if n−n < n, we have the following for large enough n:

since n − n < n implies that n/n = o (1) . Hence, in this case we get the desired result even after trivially bounding (n − n )/D by zero.

Since V 1+ (G ) ⊆ V 1+ (G ), we can continue working with G and the desired claim holds.

Recall that in the multi-step argument in Section II-F, for some α > 0, the first step is to infect each individual independently with probability αk/n, and denote the resulting set of infected individuals by K 1 . We seek to characterize the number of disguised individuals in K 1 following a second step of infections, in which each previously-uninfected individual is infected with probability (1 − 2α)k/n. Given K 1 , let X * v be the probability that v ∈ K 1 is disguised after this second step, and let X * = v∈K 1 X * v . To prove that X * is large, we need the following two statements.

Fact 3.10: Let a be a test such that |∂a ∩ K 1 | ≥ 2. Then any individual in K 1 is disguised if and only if it is disguised when removing the test a.

This fact is immediate as any infected individual is disguised in a by definition. Furthermore, to get a handle on the subtle dependencies between overlapping tests, we prove that the probability for an individual to be disguised in two tests is minimised when the tests are disjoint. For this, denote by ∂ (x) a = ∂a \ {x} the individuals in test a without x. Claim 3.11 : Consider marking each individual in ∂ (x) a ∪ ∂ (x) a as infected with some probability q independent of the others. Then, for any integer z > 0, any individual x ∈ V (G ) and any two tests a, a ∈ ∂x, we have

Proof: We first note that

as the infected individuals in the two tests are independent due to the conditioning event.

On the other hand, suppose that ∂ (x) a ∩ ∂ (x) a = z > 0. In order to make both tests contain at least one infected individual that is not x, we can either have at least one of the z common individuals which is infected (happening with probability 1 − (1 − q) z ), or we need both tests to contain an infected individual outside of the intersection. Hence,

Using (17) and (18), we conclude the proof with a short calculation:

where the last step follows by expanding and simplifying.

With this in mind, we can consider a simplified model in which the test degrees are unchanged, but the tests are all disjoint. 6 More precisely, we define the following: Given an infection rate q ∈ (0, 1), we let Y a = Y a (q) := 1 − 1 − q Γ a −1 be the probability that in a test a of size Γ a with one fixed individual x, there is at least one infected individual that is not x. For any individual v, we then denote by X v = X v (q) := a∈∂v Y a (q) the probability that v is disguised in this model, where all tests are mutually disjoint. Observe that, by Claim 3.11, X * v ≥ X v , and therefore, X * ≥ X . The advantage is that in this model, X v and X u are independent for v = u. Recall that

because ∆ = O(ln 1−δ n) and k = n θ , and let a = Γ a k/n. Note that X v describes the probability of being disguised for one individual; we proceed by considering the entire set of individuals. The following lemma provides a useful lower bound on n −1 v∈V (G ) X v . Claim 3.12: Under the preceding setup with q = (1 − 2α)k/n, we have

Proof: By the inequality of arithmetic and geometric means, we have

Furthermore, by Claim 3.9, we may assume that Γ a ≥ (3α) −1 , and we deduce that

Hence, (19) yields

Next, we note that a∈F (G ) Γ a ≤ ∆n by the ∆-divisibility constraint, which further implies a∈F (G ) a ≤ k∆. The choice m = (1 − ε)∆k 1+1/∆ also implies¯ = ∆km −1 , and we can characterise the logarithm of the right-hand side of (20) as follows: (by the above calculations regarding¯ and a above), along with the fact that¯ = ∆km −1 = o(1) and f (x) is a decreasing function for small enough x. Finally, the assertion of the claim follows from (20) and (21). We note from this claim that if we let v be a uniformly random individual, we have (also using¯ = o (1) ) that

provided that α ≤ ε/8. Now, recall that X = v∈K 1 X v , and that each individual is in K 1 with probability αk/n. Then we deduce from the above that

As X v and X u are independent for v = u, we can apply the Chernoff bound (Lemma 7.1, or more precisely a one-sided version that saves a factor of 2) to obtain

Now, as described earlier, consider infecting any uninfected individual with probability q = (1 − 2α)k/n independent of all the others. Then, as v∈K 1 P(v ∈ V 1+ (G )) = X * ≥ X , we find that conditioned on K 1 and X , it holds with probability at least Recall that p = k− k ln n n , and note that any individual is infected with probability at most

independent of all the others. As discussed in Section II-F we can in hindsight raise the infection probability of each individual to p, which can only increase the size of the set V 1+ (G ) (i.e., the number of disguised infected individuals). This yields the assertion of Lemma 3.8 for the i.i.d. infection model.

The theorem now follows easily by combining Lemma 3.8 with Corollary 3.7: With at least one disguised infected individual and at least ln n disguised uninfected individuals, the conditional error probability is

E. Algorithmic achievability on the random regular model: Proof of Theorem 3.3

Recall the random regular model G ∆ from Section II-B1. We let (Γ 1 , . . . , Γ m ) be the (random) sequence of test-degrees, which satisfies the following by construction:

Furthermore, given the sequence (Γ i ) i ∈[m] , we define

We stress at this point that the construction of G ∆ allows for multi-edges, and hence one individual might take part in a test multiple times and contribute more than one to its degree. Moreover, we parametrise the average degree asΓ = n/k, such that denotes the expected number of infected individuals a test would contain in a binomial random bipartite graph. The definition ofΓ implies = k∆ m , and substituting m = (1 + ε)m DD yields

Note that with θ 1−θ < ∆ ≤ (ln n) 1−Ω(1) , we have ω(n 1−θ ) ≤ ≤ o(1). We will make use of a stronger version of the left inequality stating that n 1−θ ≥ n Ω(1) , which follows from ∆ > θ 1−θ and checking both cases of which term in (24) attains the minimum.

We first argue that each test degree is tightly concentrated with high probability, defining the concentration event C Γ as follows:

Lemma 3.13: For given in (24), we have P(

Proof: Each individual chooses ∆ tests with replacement. Hence, each individual has a chance of picking a given test ∆ times independently, yielding Thus, we have E [Γ i ] = n/k, which scales as ω(1) since we have established ≥ ω(n 1−θ ). Applying the Chernoff bound (Lemma 7.1) and the aboveestablished fact n 1−θ ≥ n Ω(1) , we obtain

Hence, we can choose t of the form O(n −Ω(1) ln n) = O(n −Ω(1) ) to attain

An analogous calculation shows

Therefore, the lemma follows from (26), (27) , and a union bound over all m ≤ n tests.

2) Analysis of the different types of individuals: Let Y i denote the number of infected individuals (including all multi-edges) in test a i (for i = 1 . . . m) . These variables are not mutually independent, as a single individual takes part in multiple tests. Luckily, it turns out that the family of the Y i can be approximated by a family of mutually independent random variables sufficiently well. Given

be a sequence of mutually independent Bin (Γ i , k/n) variables. Furthermore, let

be the event that the sequence (X i ) renders the correct number of infected individuals. Stirling's approximation (Lemma 7.2) guarantees that E ∆ is not too unlikely; specifically, P (E ∆ | (Γ i ) i ) = Ω((n∆) −1/2 ). Furthermore, the X i are indeed a good local approximation to the correct distribution, as stated in the following known result. Proof: Let m 0 = (X i ) i ∈[m] : X i = 0 . Combining the definition of X i with (7.5), we get

which represents the expected number of negative tests approximated through (X i ) i . Hence, when (Γ i ) i satisfies the concentration event defining C Γ (see (25) ), a second order Taylor expansion (Lemma 7.4) yields

Then, conditioned on (Γ i ) i , the Chernoff bound implies implies with probability at least 1 − o(n −10 ) that

The first assertion of the lemma now follows from (29), (30), Lemma 3.13, Lemma 3.14, and the fact that E ∆ has probability Ω((n∆) −1/2 ): Letting A be the above probability-o(n −10 ) event, we simply write P(A |E ∆ ) ≤ P(A ) P(E ∆ ) , and substitute the upper bound on the numerator and lower bound on the denominator. For the second assertion of the lemma, we need to additionally take note of the fact that = o(1) and hence

m. But since m = k∆, this only amounts to replacing m −1/4 by k −1/4 in the counterpart of (30) , and otherwise has no impact.

Next, we provide a characterization of the size of V 0+ (G ∆ ), i.e., the number of disguised uninfected individuals. Lemma 3.16: We have with probability at least 1 − O n −Ω(1) that

Proof: Without loss of generality, given m 1 and C Γ , we suppose that tests a 1 . . . a m 1 are the positive tests. By the degree bounds in (25) and Lemma 3.15, the total number of edges connected to a positive test is w.h.p. given by

We need to calculate the probability that a given uninfected individual belongs to V 0+ (G ∆ ), i.e., each of its ∆ edges is connected to a positive test. By a counting argument, we have

where the simplification follows via Claim 7.3 along with (31) and m i =1 Γ i = mΓ. Therefore,

Analogously, the second moment turns out to be

The idea of the first line of (33) is to consider pairs of uninfected individuals whose 2∆ combined edges only participate in positive tests. 7 The second line of (33) follows from Stirling's approximation in the form of Claim 7.3. We lemma is now obtained using (32), (33) , and Chebyshev's inequality, and noting that n(1 − e − ) ∆ = n −Ω(1) (which is seen by using = o(1) to approximate (1−e − ) ∆ by ∆ , and applying (24)). Let A denote the number of infected individuals that do not belong to the easy uninfected set V 1−− (G ∆ ). The following lemma allows us to bound its size. Proof: We can split (24) into two cases, depending on the sparsity level θ:

Recall that m 1 is the number of positive tests, and define

as the event that both the number of positive tests as well as the size of V 0+ (G ∆ ) behave as expected. Lemmas 3.15 and 3.16 guarantee that F ∆ is a high probability event, namely, P {F ∆ } ≥ 1 −Õ(n −1 ). Given m 1 , we suppose without loss of generality that a 1 . . . a m 1 are the tests rendering a positive result.

We describe the number of occurrences of different types of individuals by introducing two sequences of random variables. Define

as the number of infected individuals, disguised uninfected individuals of V 0+ (G ∆ ), and non-disguised uninfected individuals (those of V 0− (G ∆ )) appearing in test i , respectively. By construction,

Given |V 0+ (G ∆ )| and m 1 , we approximate these variables by a sequence of mutually independent multinomials. Specifically, let

i.i.d.

where Mult ≥(1,0,0) means multinomial conditioned on the first coordinate being at least one. We introduce the event

and make use of the following. 

In addition, let

i denote the number of connections from individuals in V 0− (G ∆ ) to positive tests. Then, a counting argument gives

Letting (r i ) i ∈[m 1 ] be a second sequence as above, it follows that

Next, define

and analogously for R 1 , R + , R − . By definition, we have

Then, by the definition of H , we have

Thus, the first statement of Claim 3.18 follows from (37) and (39) , and the second statement follows from Claim 7.5

We now introduce a random variable that counts (positive) tests featuring only one infected individual and no disguised uninfected individuals. Formally, let

By the definition of H i (see (36) ), we have

In the following, we suppose that Γ i satisfies the concentration aroundΓ defining event C Γ (see (25) ), and m 1 and |V 0+ (G ∆ )| satisfy the concentration defining event F ∆ (see (35) ). Using the concentration of Γ i and the asymptotic expansion (1 − k/n)Γ = exp(− (1 + O(n −Ω(1) ))), we find that

and further applyingΓ = n∆ m , k = n θ , and the concentration of m 1 , we obtain

Now, let us distinguish between the cases θ ≤ 1/2 and θ > 1/2. Case 1: θ > 1/2: In this case, we have n/k = o(k), and = (1 + ε) −1 k −1/∆ . We recall the event F ∆ from (35) that gives a concentration condition for |V 0+ (G ∆ )| and m 1 . Substituting into (35), we find that given F ∆ , there is some γ ∈ (0, 1) such that

Hence, using (43) and applyingΓ = n/k, we obtain

by a second-order Taylor expansion of e − . Now, B is a binomial random variable with a random number of trials and a random probability parameter. Clearly, when conditioning on a specific number of trials and a specific probability, B is a binomial random variable. Therefore, recalling the expression for in (34) , the Chernoff bound guarantees that under the concentration events C Γ and F ∆ , we have

with probability at least o(n −10 ). Then, similar to the proof of Lemma 3.15, Claim 3.18 yields that

with probability 1 − O(n −Ω(1) ). Thus, we can calculate the probability of an infected individual not belonging to V 1−− (G ) (i.e., not being in the easily-identified infected set) as follows. Such an individual has to choose all of its ∆ edges out of the k∆ − B edges that would lead to a test in which the individual could be identified by DD. Hence, we have

where the simplification holds using (45) and Claim 7.3. 8 Interpreting the average of A as a sum of k probabilities, it follows that

Case 2: θ ≤ 1/2: In this case, we have = (1 + ε) −1 n −(1−θ)/∆ . Hence, given F ∆ ,

In contrast to the first case, here we find that the influence of the size of disguised uninfected individuals does not vanish asymptotically in relation to the number of infected individuals in (43) . By a similar argument as the first case, (48) and (43) imply

and similarly to (45), combining this with the Chernoff bound and Claim 3.18 yields that

with probability 1 − O(n −Ω(1) ). Therefore, the probability of an infected individual not belonging to V 1−− (G ) satisfies the following analog of (46):

Since 2θ − 1 ≤ 0 by assumption, it follows that 

In accordance with Claim 2.4, we first provide a lemma bounding the size of V 1−− (G ∆ ), the set of infected individuals appearing in at least one test with only easy uninfected individuals. Lemma 3.19: For θ < 1/2 and m = (1−ε)m DD (∆), we have under the random regular design that

Proof: We re-use the notations¯ andΓ in (11) , but their expressions are modified as follows in accordance with the choice m = (1 − ε)∆k 1+ (1−θ) ∆θ associated with θ < 1 2 :

We additionally recall B from (40) as the number of tests featuring exactly one infected individual and no elements of V 0+ . By the same calculation as in (50) and (51) with andΓ replaced by the values in (53), we obtain

with probability at least 1 − o(n −8 ). Therefore, we can calculate the probability that an infected individual does not belong to V 1−− (G ∆ ) via Claim 7.3 as follows:

Since there are k individuals in x ∈ V 1 (G ) by assumption, we obtain

and the lemma follows using

Knowing the expected size of |V 1−− (G ∆ )|, Markov's inequality leads to the following. Corollary 3.20: Let θ < 1/2 and m = (1 − ε)m DD (∆) and ∆ = Θ(1). Then, with probability at least

Claim 2.4 and Corollary 3.20 immediately imply Theorem 3.4, since (57) is always positive for sufficiently small γ, and approaches one as ∆ → ∞.

In this section, we formally state and prove our main results concerning non-adaptive group testing Γ-sized tests, namely, a universal lower bound and an algorithmic upper bound that matches the lower bound. Recall that we focus on the regime Γ = Θ(1). Within this section, G denotes an arbitrary non-adaptive pooling scheme with respect to the Γ-sparsity constraint. The section contains two main parts, outlined as follows:

• Theorem 4.1 states our universal lower bound for nonadaptive designs. The proof is based on a careful analysis of the appearance of disguised individuals (see Section II-D), with the idea being that too many such individuals leads to failure. For θ < 1 2 , we additionally use the idea of identifying sufficiently many tests with multiple individuals of degree one, prohibiting reliable inference. • Theorems 4.10 and 4.18 analyze the performance of the DD and SCOMP algorithms. The proofs are again based on the idea that in the underlying pooling scheme, any infected individual appears in at least one test with only definitive uninfected individuals (elements of V 0− (G )). We refer the reader to Sections II-D and II-E for further insights on these properties. The test size constraints pose additional technical challenges compared to the unconstrained setting [18] , in particular leading us to adopt a less standard matching-based test design when θ < 1 2 .

The first statement that we prove is an informationtheoretic converse that applies to any non-adaptive group testing scheme with maximum test size Γ. Denote by

which we will show to be the sharp information-theoretic phase transition point when Γ ≥ 1 + θ 1−θ ; note that if this inequality is reversed, then m inf,Γ > n, whereas n tests trivially suffice via one-by-one testing. In [25] a lower bound of (n/Γ)(1+o(1)) was proved, and we see that in the regime Γ = Θ(1), our lower bound improves on this for all θ ∈ (0, 1). Theorem 4.1: Let θ ∈ (0, 1), Γ ≥ 1 + θ 1−θ , and δ > 0. Furthermore, let G be any non-adaptive pooling scheme (deterministic or randomised) with m = (1 − δ)m inf,Γ tests such that each test contains at most Γ individuals. Then any inference algorithm A fails in recovering σ from (σ, G )

Thus, even with unlimited computational power, there cannot be any algorithm with a maximum test size of Γ that is able to infer the infected individuals correctly w.h.p. once the number of tests drops below (58). The distinction between integer vs. non-integer values of θ/(1 − θ) arises for technical reasons (e.g., counting the number of nodes with degree at most θ/(1 − θ) ), and we found it difficult to prove a high-probability (rather than constant-probability) failure result in the integer case.

The proof of the universal information-theoretic converse resembles the proof of [4] for the existence of a universal information-theoretic bound for unrestricted non-adaptive group testing, but several modifications are required to handle the test size constraint. We provide the details in the following subsection.

We start by defining

For the proof, we distinguish two different regimes for θ, as stated in Proposition 4.2 and Proposition 4.6. We start with the following proposition addressing the existence of disguised individuals. Proposition 4.2: Let 1/2 ≤ θ < 1, Γ ≥ d + , and let G be an arbitrary pooling scheme with tests of size at most Γ. For Proposition 4.2: Let G be an arbitrary pooling scheme such that each test contains at most Γ individuals. We denote by V (G ) the set of individuals, and by F (G ) the set of tests in G (by the identification of G with a bipartite graph). Instead of analysing (G ,σ), similarly to in the ∆-divisible case, we analyse a related model that eliminates nuisance dependencies between the infection status of different individuals.

Specifically, let p = k− k ln n n , and let σ * be a {0, 1}valued vector, where every entry is one with probability p. Corollary 3.6 guarantees that if the modified model satisfies

then the original model satisfies

and

Thus, working with the modified model is sufficient. For the sake of brevity, we henceforth write G Γ in place of (G , σ * ), leaving the dependencies on σ * implicit. We proceed by finding a set of (many) individuals, that have a high probability of being disguised. We will apply the probabilistic method iteratively to create the desired set. Creating this set turns out to be delicate due to the dependencies in an arbitrary pooling scheme. Luckily, it will suffice for our purposes to note that whenever individuals have distance at least 6 (i.e., the shortest path between two individuals has at least 6 edges) in the underlying graph, the events of being disguised are independent [4] . To see this, note that we can identify whether an individual is disguised by looking at the tests it is in, and the defectivity status of all other individuals in those tests. This procedure only reaches the second neighborhood, so a separation of 6 is enough to ensure there is no overlap when doing this for two different nodes (which implies independence under an i.i.d. defectivity model).

In the following, we denote the set of all disguised individuals by

We first present a claim establishing that we may safely assume that each individual gets tested Θ(1) times. Claim 4.3: Given any pooling scheme G with m = (1 − 2ε)d + n Γ (for some ε > 0) such that each test contains at most Γ = Θ(1) individuals, there is another pooling scheme G such that each test contains at most Γ = Θ(1) individuals with m = (1 − ε)d + n Γ , while also satisfying the following: • Each individual is contained in at most C = Θ(1) tests; • Recovery of σ from (G ,σ ) implies recovery from (G ,σ).

Proof: Given G and a constant C ∈ N, there is C ∈ N such that there are at most n/C individuals of degree at least C in G , which is an immediate consequence of m being linear in n (due to Γ = Θ(1)). Design G such that each individual of G with degree larger than C gets tested individually (causing n/C additional tests) and all other individuals and tests stay the same as under G . Clearly, if recovery in G was possible, then it is possible in G as well. Setting C = Γ εd + , the claim follows. In addition to being able to assume there are no individuals with an overly high degree, we can also prove that there cannot be too many individuals with an overly low degree.

Lemma 4.4: Let G be the given pooling scheme and m ≤

If there is a constant α > 0 such that the number of individuals of degree at most d − is αn, then we have the following:

Proof: Suppose that the number of individuals with degree at most d − is αn, and recall that p = k− k ln(n) n . Without loss of generality, we can assume that there are no tests of degree zero or one. Otherwise, remove them and each connected individual from the testing scheme and note that, by the assumed lower bound Γ ≥ d + , there are at least εn individuals left. This manipulated graph satisfies the same inequality between the number of individuals and number of tests and, clearly, if the inference of σ does not succeed on this manipulated graph, then it cannot succeed in G . Before proceeding, we introduce the following auxiliary result.

Claim 4.5: Under the preceding setup, suppose that there exists a set I − ⊂ V of individuals of degree at most d − with |I − | ≤ αn (α ∈ (0, 1)). Then, there exists β ∈ (0, α) (depending only on d − and Γ) such that there must also exist I + ⊂ I − with I + = βn, having the property that for all pairs x = y in I + it holds that dist(x, y) ≥ 6.

Proof: First recall from Claim 4.3 that all degrees in the graph are bounded. Consider the procedure of iterating through all individuals x ∈ I − , and deleting all y ∈ I − of distance at most four from x, and repeating until no individuals remain. Let I + denote set of x's visited by this process. Since the degrees in the graph are finite, each removal only decreases the size of the set I − by at most a constant, and the assertion of the claim follows.

Let B be the largest possible subset of individuals satisfying the requirements of Claim 4.5. Thus, B is a set of βn individuals such that for all x = x ∈ B we have

We analyze a single individual x ∈ B using the FKG inequality (e.g., see [36, Proposition 1] ); as noted in [17, Lemma 4] , the events of x being disguised in each of its tests are increasing with respect to σ * (in the sense that marking additional individuals as infected in σ * can only increase the probability that an individual x is disguised). Hence, the FKG inequality yields the following, recalling that we are considering the case that deg(a) ≥ 2 for all a:

Then, by the fact that deg(

for some constant C depending on θ and Γ.

We now turn to the total number of disguised individuals in B . As noted above, for two individuals x, x ∈ B , the events of being disguised are independent due to the pairwise distances being at least 6, as described above. Thus, the number of disguised infected individuals |V 1+ (G )| dominates a binomial random variable Bin(βn, p · C p d − ).

Since np ∼ k = n θ , the mean of this binomial distribution scales as Θ(n θ−(1−θ)d − ). In particular, when θ 1−θ is noninteger, the choice d − = θ 1−θ ensures that the exponent is positive, and the Chernoff bound gives w.h.p. that

On the other hand, if θ 1−θ is integer-valued, then the mean of the binomial is Θ(1), which is enough to ensure that |V 1+ (G )| > 0 with Ω(1) probability. Combining these two cases completes the proof of Lemma 4.4.

As an immediate consequence of Lemma 4.4, in any group testing instance that succeeds w.h.p., there are at most o(n) individuals of degree up to d − . However, if m ≤ (1 − ε)d + n/Γ we find at least αn individuals of degree at most d − (for some α depending on ε) by the handshaking lemma [37, Corollary 1.3], yielding a contradiction. Therefore, Proposition 4.2 is a direct consequence of Lemma 4.4, with the claims regarding |V 0+ (G )| following easily from those regarding |V 1+ (G )| in the same way as Corollary 3.7. ■

We now turn to the sparse regime θ < 1 2 , establishing the following proposition as a stepping stone to Theorem 4.1.

Proposition 4.6: Let 0 < θ < 1/2, and let G be an arbitrary pooling scheme with tests of size at most Γ. For all ε * > 0 and sufficiently large n, if m ≤ (2−ε) n Γ+1 , then any algorithm (efficient or not) fails at recovering σ fromσ and G w.h.p..

The proof hinges on a fairly straightforward observation. We can again assume without loss of generality that there are no tests containing only one individual (otherwise, we remove them and their corresponding individuals from the testing scheme). By a simple counting argument, there can be only o(n) such tests (since otherwise m > 2n/Γ, which is a contradiction). In addition, we can assume that there are no degree-zero individuals; if there were Ω(n) of them, high-probability correct inference would trivially be impossible, whereas with o(n) of them, they can be removed and the subsequent analysis still holds for those remaining, with the o(n) difference not impacting the final result.

Then, another counting argument leads to the fact that the number of individuals of degree 1 is large when m < 2n/Γ, as stated in the following. 

Solving for α yields α ≥ ε, and the lemma follows.

The next lemma shows that there can only be a small number of tests containing more than one individual of degree 1.

Lemma 4.8: If there is any algorithm recovering σ from the test results with Ω(1) probability, then the number of tests containing more than one individual of degree one is below n/ k = o(n).

Proof: Suppose that at least n/ k tests contain at least two individuals of degree one, and consider any resulting subset of 2n/ k individuals (two per test). The average number of infected individuals among these is (2n/ k) · (k/n) = 2 k. Hence, by the Chernoff bound for the hypergeometric distribution, w.h.p. there are at least k/ ln n such infected individuals. On the other hand, among these tests, the average number in which both of these degree-one individuals are infected is (n/ k)·O((k/n) 2 ) = O(k k/n), so Markov's inequality implies that w.h.p. the actual number is O( kn −Ω(1) ).

Hence, all but an o(1) fraction of the above-mentioned k/ ln n infected individuals must be in a test with both a degree-one infected and a degree-one uninfected individual. For these tests, the inference algorithm cannot do better than guess which one is the infected one, but then the probability of all guesses being correct is (1/2) ω(1) = o(1), from which the lemma follows. We are now in a position to prove Proposition 4.6. For m = (2 − ε)n/Γ, we find by Lemma 4.7 that there are at least εn individuals of degree 1. By Lemma 4.8 and the fact that Γ = Θ(1), only o(n) such individuals can be placed together in any tests, and hence, the total number of tests is at least εn − o(n). Formally,

Solving (60) for ε, we find ε ≤ 2 Γ+1 + o (1) . Hence,

and the proposition follows. ■ The universal lower bound in the considered regime is a direct consequence of Proposition 4.2, Proposition 4.6, and Claim 2.3. The proof of Theorem 4.1 is thus complete.

We now turn to the problem of establishing an upper bound, with a suitably-chosen test design and an efficient inference algorithm, that matches the universal lower bound. We start by recalling the definition ofG Γ in Section II-B2:G

We equip this pooling scheme with the efficient DD algorithm (see Algorithm 1). In the following, we will see that the combination of these tools will lead to informationtheoretically optimal performance in the Γ-sparse setting with Γ = Θ(1).

Define

For Γ = Θ(1) and m = (1 + ε)m SCOMP , we have

To prove this result, we handle the dense regime θ > 1 2 in Theorem 4.10 below, the sparse regime θ < 1 2 in Theorem 4.18, and combine them in Section IV-F. We observe that m SCOMP = m inf,Γ , i.e., the achievability and converse results match for all θ ∈ (0, 1).

We first show that the DD algorithm succeeds with a slightly higher threshold, namely max 2, 1 + θ 1−θ n Γ , employing the configuration model G Γ . We define

representing this achievability bound for DD in G Γ . Theorem 4.10: Let ε > 0 and m ≥ m DD (G Γ ). Then w.h.p. DD infers σ from (G Γ ,σ) correctly.

We stress at this point that Theorem 4.10 gives a performance guarantee for the configuration model with any sparsity level, but it will turn out in due course that for θ < 1 2 a different model performs slightly better. Note also that for θ ≥ 1 2 , we can simplify max 2, 1 + θ The proof of Lemma 4.11, while conceptually not difficult and similar to [18] , is technically challenging, as we have to deal with subtle dependencies in the pooling scheme, caused by the mutli-edges given through the configuration model. A heuristic argument with a (false) independence assumption can provide some intuition as follows: In order for an individual x to be part of a test containing no infected individual (besides possibly x itself ) is roughly (1 − k/n) Γ−1 . For x to be disguised, thus being element of V 0+ (G Γ ) or V 1+ (G Γ ), x may not be part of such a test. Hence, the probability of x being disguised would be roughly 1 − (1 − k/n) Γ−1 ∆ if the associated ∆ events were independent (recall that ∆ = mΓ/n is the degree of each individual in the random regular design).

To formally deal with the dependencies in the graph, we proceed as follows. Denote by (Y 1 , . . . , Y m ) the number of infected individuals in the tests. There are n∆ edges connected to individuals, out of which exactly k∆ correspond to infected individuals. Each test chooses exactly Γ individuals without replacement, and hence, the number of infected individuals in any test follows a hypergeometric distribution. In order to get a handle on this distribution, we introduce a family (X 1 , ..., X m ) of independent binomial variables, such that X i ∼ Bin(Γ, k/n). These variables can accurately describe the local behaviour of how many infected individuals belong to test a i . We define E Γ to be the event that the overall number of edges containing infected individuals is correct, i.e.,

Claim 7.5 implies that P (E Γ ) = Ω((n∆) −1/2 ). In addition, we have the following. where the equality follows by rewriting in terms of factorials and simplifying. Furthermore, given i x i = k∆, we have

This implies the lemma. Thus, similarly to the analysis following Lemma 3.14, we are able to carry out all necessary calculations with respect to (X 1 , ..., X n ) and 

Observing that E m 1 = Θ(k) (since m = Θ(n) due to Γ = Θ(1)), the Chernoff bound (Lemma 7.1) guarantees that

and, similar to the proof of Lemma 3.15, by combining Lemma 4.12 with Claim 7.5, we obtain

Thus, the first part of the lemma follows. The second part is immediate, as m 0 + m 1 = m. The above-mentioned naive calculation (assuming independence) can now be rigorously justified, and we can establish the sizes of the disguised individuals w.h.p. as follows.

Lemma 4.14: Given n and k = n θ as well as Γ = Θ(1) and ∆ ≥ 2, we have w.h.p. that |V 0+ (G Γ )| = o(k).

Proof: By the definition of G Γ via the configuration model, Lemma 4.13 guarantees that the total number of edges connected to a positive test is, with probability at least 1 − o(n −2 ), given by

Let x be an uninfected individual. We can calculate the probability of x belonging to V 0+ (G Γ ) (i.e., being disguised and uninfected) as follows: Each of the ∆ = Θ(1) edges 9 that are mapped to x in the configuration model have to be connected to a positive test. Thus, by (64) along with Claim 7.3, we obtain

Therefore,

Combining (65), ∆ ≥ 2, and Markov's inequality, we obtain the assertion of Lemma 4.14.

Next, we define the event

in which the number of positive tests and disguised uninfected individuals behave as expected. By Lemmas 4.13 and 4.14, we have P (F Γ ) ≥ 1 − o (1) . We assume without loss of generality that the first m 1 tests render a positive result. Letting

i =1 H i equals its expectation, we have the following analog of Corollary 3.18. 9 By counting degrees, we have n∆ = mΓ, so the assumption Γ = Θ(1)

leads to m = Θ(n∆). Since ∆ is integer-valued and we are considering m > 0 and m ≤ n (otherwise, individual testing would be preferred), it follows that ∆ = Θ(1).

By the definition of R i , we have

Letting (r i ) i ∈[m 1 ] be a second sequence as above, it follows that

Furthermore, by the definition of X , we have

The first part of the claim follows from Equations (67) and (68). The probability follows by applying Claim 7.5 for ∆ = Θ(1)

We are interested in the number of positive tests that contain exactly one infected individual and no elements of V 0+ (G Γ ). Therefore, we define

We have w.h.p. that

Proof of Claim 4.16: We use Claim 4.15 to simulate B through independent random variables as in B . Since B is a sum of independent multinomial variables, we obtain its expectation by applying (66), Lemma 7.2 and Bayes Theorem:

where the last step follows from Lemma 7.4 and Γ = Θ(1).

Conditioning on F Γ defined in (66), we obtain

where the first line uses Lemma 4.14, the second line uses Claim 7.4 , and we additionally recall that k = n θ , ∆ = mΓ n , and Γ = Θ(1). Moreover, since B is a binomial random variable, the Chernoff bound (Lemma 7.1) yield with probability o(n −10 ) that

Thus, similar to the proof of Lemma 3.15, by Claim 4.15 we have w.h.p. that

We are now in a position to characterize

Proof of Claim 4.17: The combinatorial expression follows by adding k probabilities, one per defective item. Each probability is the probability that an infected individual does not belong to V 1−− , which equals the probability that all of its ∆ connections are disjoint from the k∆ − B connections to tests in which it would have been the only infected individual with no disguised uninfected individuals. The assertion then follows by combining the

Proof of Lemma 4.11: We distinguish between θ/(1 − θ) ∈ Z and θ/(1 − θ) = T ∈ Z, and recall m DD from (62) with ∆ = max 2, 1 + θ 1−θ . For simplicity, we assume that the inequality m ≥ m DD holds with equality, but the general case is analogous. Case A: θ/(1 − θ) ∈ Z. In this case, for m = m DD , we have ∆ = max 2, 1 + θ

We distinguish the two cases θ < 1/2 and θ > 1/2 as follows:

• Case A1: θ > 1/2. In this case, we have ∆ = θ/(1 − θ) .

Defining η = θ − (1 − θ) · θ/(1 − θ) < 0, using (72) and Γ, ∆ = Θ(1), we find

• Case A2: θ < 1/2. In this case, we have ∆ = 2, and hence

Case B: θ/(1 − θ) = T ∈ Z. Again, we distinguish the cases θ = 1/2 and θ > 1/2:

• Case B1: θ > 1/2. We have ∆ = T + 1, so by (72) and Γ, ∆ = Θ(1), we find

where the last step uses 1 − θ ≥ θ and T > 1. • Case B2: θ = 1/2. We have ∆ = 2, and hence

Combining (73)-(76) with Markov's inequality and the fact that F Γ occurs w.h.p., we deduce that A = 0 w.h.p., completing the proof of Lemma 4.11. Theorem 4.10 now follows directly by combining Lemma 4.11 and Claim 2.4. So far, we have addressed the case where the test design is formed using the configuration model, and showed that the DD-algorithm is optimal in this regime if applied to the random regular pooling scheme G Γ . However, the preceding analysis does not provide a tight bound for the matching-based design.

Recall from from Section II-B2 that the matching-based model with parameter γ is denoted by G * Γ . While the DD algorithm does not appear to be optimal in this case, it turns out that turning to SCOMP (a slight refinement of DD) suffices for optimality.

Theorem 4.18: If m ≥ 2n/(Γ + 1) and 0 < θ < 1/2, then w.h.p. SCOMP recovers σ from G * Γ andσ. 1) Proof of Theorem 4.18: We prove the theorem for m = 2n/(Γ + 1) (which implies γ = 2 Γ+1 n), but the more general case follows analogously; intuitively, a higher number of tests can only help. We analyse the DD algorithm on G * Γ in two steps, starting with the regular part of the graph. Denote by G * ,r Γ the (Γ − 1, 2) regular part, in which we select n − γ individuals and pool them into two tests each. It remains to handle the second step, and specifically, argue that after adding the γ = 2 n Γ+1 individuals (one to each test) we can guarantee the success of SCOMP. We denote by k the number of infected individuals under the remaining n individuals, and let θ be the value such that k = Θ((n ) θ ), which is well-defined due to the following. Claim 4.20: Under the preceding setup, we have w.h.p. that θ = θ.

Proof: As we remove γ = 2 Γ+1 n individuals randomly, the number of infected individuals in the remaining part is a hypergeometrically distributed random variable K ∼ H n, k, n . Thus, the Chernoff bound for the hypergeometric distribution guarantees w.h.p. that

and the assertion follows.

In the second step, we analyse the remaining part of the graph, in which the γ remaining individuals are placed into one test each. To do so, the following lemma turns out to be useful.

Under the matching-based model G * Γ with θ < 1 2 , it holds w.h.p. that there are no two infected individuals within distance 4 in the graph.

Proof: By construction, it holds with probability one that G * Γ has individual-degree at most two, and test-degree at most Γ = Θ(1). Hence, all degrees are bounded. This means that for any given individual x, the set of individuals x with dist(x, x ) ≤ 4 has size O(1). For any two individuals x and x , the probability of both being infected is O((k/n) 2 ), and a union bound over the O(n) possible pairs with dist(x, x ) ≤ 4 increases this probability to O(n(k/n) 2 ). The assumption θ < 1 2 implies that k = o( n), and thus, we have O(n(k/n) 2 ) = o (1) , which establishes the lemma.

We now combine the preceding lemmas to establish the success of the DD algorithm. , and on all infected individuals having pairwise distance exceeding 4, it holds with conditional probability one that the SCOMP algorithm recovers σ from (G * Γ ,σ). Proof: By the construction of G * Γ , there are γ = 2 Γ+1 n individuals added to G * ,r Γ to produce G * Γ . Denote the set of these individuals by X = x 1 . . . x γ . As γ ≤ m, there is a matching from X to the the m tests.

Having assumed success on the regular part G * ,r Γ , we only need to show that the newly added individuals in X are also correctly identified, and additionally do not impact the identifications in G * ,r Γ . Recall from Claim 2.4 that DD succeeds if and only if all infected individuals are easy infected (i.e., are in V 1−− (G * Γ )), and recall also that the success of DD implies the success of SCOMP [33] . We distinguish four different cases, which are illustrated in Figure 4 . Case A: Connecting to a negative test. Suppose that an individual x ∈ X connects to a (previously) negative test a. Then, for all y ∈ ∂ G * ,r Γ (a) we have y ∈ V 0− (G * ,r Γ ). • Case A-1: σ x = 0. If x is uninfected and connects to a negative test, then the test remains negative. It follows immediately that x ∈ V 0− (G * Γ ) (i.e., x is easy uninfected), which further implies that all other individuals in the test that were previously easy uninfected or easy infected in G * ,r Γ remain so in G * Γ , as desired. • Case A-2: σ x = 1. In this case, we haveσ a (G * ,r Γ ) = 0 butσ a (G * Γ ) = 1. To maintain success, we need to show that all y ∈ ∂ G * ,r Γ (a) (which were previously easy uninfected) remain easy uninfected in G * Γ ; this implies both that previous decisions are not affected, and that The blue vertex is added in the second step of construction of G * Γ , and the labels inside the vertices indicate the defectivity status or test outcome after adding the blue vertex. In case A1, recovery is clearly possible if and only if the same is true the remainder of the graph. In case A2, the yellow vertices may, in principle, no longer be identifiable as definite nondefectives. This happens if and only if the corresponding red individual is infected, which in turn implies a length-4 path between defectives, contradicting a high-probability event that we show. In case B1, there is a path of length two from the infected blue individual to the infected yellow individual, which is again a contradiction. In case B2(i), the infected yellow vertex can still be recovered as it is element of V 1−− in the regular part and will be recovered successfully during the first two steps of SCOMP. In case B2(ii), the two red tests could either be explained by the yellow infected individual or by the two blue (uninfected) individuals, and due to its greedy selection rule, SCOMP declares the yellow individual as infected and the blue individuals as uninfected.

the decision for x is correct due to x ∈ V 1−− (G * Γ ). To establish that each y ∈ ∂ G * ,r Γ (a) is easy uninfected, we argue that the second test that y belongs to is negative. Indeed, suppose for contradiction that y is in another positive test a with an infected individual x . Then, there is a path of length 4 in G * Γ from x to a to y to a to x , and this contradicts Lemma 4.21.

Case B: Connecting to a positive test. Suppose that an individual x ∈ X connects to a (previously) positive test a. Therefore, there exists at least one y ∈ V 1 (G * ,r Γ ) ∩ ∂ G * ,r Γ (a). As DD succeeds on G * ,r Γ by assumption, we have y ∈ V 1−− (G * ,r Γ ).

• Case B-1: σ x = 1. This case does not occur, because it implies a length-2 path from x to y, both of which are infected, in contradiction with the lemma assumption. • Case B-2: σ x = 0. Since the first two steps of SCOMP (Algorithm 1) never make mistakes, the only way that an error can occur in this case is that (i) x is added in some step of the final (sequential greedy) step, or (ii) y ∉ V 1−− (G * Γ ) and y fails to be chosen throughout the final step. We argue that neither of these events occur. To see this, first note that in G * ,r Γ , y is not only part of V 1−− (G * ,r Γ ) because of a, but also because the second test that y belongs to consists only of y and individuals from V 0− (G * Γ ): If this were not the case, then we could create a path from y to another infected individual using a path of length at most 4. We then have the following:

-If y ∈ V 1−− (G * Γ ) then y is trivially decoded correctly, and x is certainly not added in the final step (since its only test is already explained).

-If y ∉ V 1−− (G * Γ ) then the two tests containing y are unexplained at the start of the final step. Due to the above-established property of both of these tests leading to y ∈ V 1−− (G * ,r Γ ) in the regular part, we have that in G * Γ , only y and/or the newly added elements of X can explain these two tests. But since y explains both of them, but the elements of X can only explain one each (since their degree is one), it is clearly y (and not x) that will be chosen, as desired.

We now have all the ingredients to prove Theorem 4.18.

Proof of Theorem 4.18: By construction, G * Γ consists of n individuals and m = 2n/(Γ + 1) tests. By Lemma 4.19, this m suffices for DD to succeed w.h.p. on the regular part of G * Γ (i.e., on G * ,r Γ ). In addition, Lemma 4.21 gives the convenient distance-4 property w.h.p., and Lemma 4.22 guarantees that the preceding two findings suffice to ensure that SCOMP infers σ correctly from G * Γ andσ. Hence, the theorem follows.

Theorem 4.10 proves that DD succeeds on the biregular graph G Γ created by the configuration model using max 2, 1 + θ 1−θ tests, and hence so does SCOMP [33] . Furthermore, as Theorem 4.18 shows, for θ < 1/2, 2n Γ+1 tests suffice employing G * Γ and using SCOMP. Finally, we show that the results of Theorem 4.10 and Theorem 4.18 combine to match the information-theoretic lower bound (58), i.e., max 1 + θ 1−θ n Γ , 2 n Γ+1 . On the one hand, for θ < 1 2 , the lower bound simplifies to the desired quantity 2n Γ+1 due to the fact that θ 1−θ = 0 in this regime, and 2 Γ+1 ≥ 1 Γ for Γ ≥ 1. On the other hand, if θ ≥ 1 2 then we have θ 1−θ ≥ 1, and so the maximum in the lower bound is achieved by the first term (since 1 Γ ≥ 1 Γ+1 ), thus again matching the upper bound. Hence, the SCOMP algorithm is information-theoretically optimal when used with the pooling schemeG Γ .

In this section, we turn to adaptive testing strategies in the case of ∆-divisible individuals, and demonstrate that in certain cases the number of tests can be reduced significantly.

Recall that the converse bound proved in Theorem 3.1 already considered adaptive test designs. Thus, any adaptive strategy fails w.h.p.when m ≤ (1 − ε)e −1 ∆k 1+ (1−θ) ∆θ for fixed ε > 0.

We present an algorithm that can be viewed as an analog of Hwang's binary splitting algorithm [38] , instead using non-binary splitting in order to ensure that each item is in at most ∆ tests. Like with Hwang's algorithm, we assume that the size k of the infected set is known. In the case case that only an upper bound k max ≥ k is known, the same analysis and results apply with k max in place of k. However, such bounds may somewhat loose, and care should be taken in using initial tests to estimate k as an initial step (e.g., see [39] , [40] , [41] ), as this may use a significant portion of the ∆ budget. For clarity, we only consider the case of known k in this section, and leave the case of unknown k to future work (see also [1] for some initial findings).

1) Recovering the infected Set: Our adaptive algorithm is described in Algorithm 2, where we assume for simplicity that n k 1/∆ is an integer. 10 Using Algorithm 2, we have the following theorem, which is proved throughout the remainder of the subsection. We define m ada (∆) = ∆k 1+ 1−θ θ∆ .

Theorem 5.1: For ∆ = o(ln n) and k = n θ with θ ∈ (0, 1), the adaptive algorithm in Algorithm 2 tests each individual 10 Note that we assume k = o(n) and ∆ = o ln n k , meaning that n k 1/∆ → ∞. Hence, the effect of rounding is asymptotically negligible, and is accounted for by the 1 + o(1) term in Theorem 5.1.

Number of individuals n, number of infected individuals k, and divisibility of each individual ∆ 1: Initialise n ← n k ∆−1 ∆ and the estimate K ← 2: Arbitrarily group the n individuals into n/ n groups of size n 3: Test each group and discard any that return negative 4: Label the remaining groups incrementally as G (0) j , where j = 1, 2, . . . 5: for i = 1 to ∆ − 1 do 6: for each group G (i −1) j from the previous stage do 7: Arbitrarily group all individuals in G (i −1) j into n 1/(∆−1) sub-groups of size n 1−i /(∆−1) 8:

Test each sub-group and discard any that return a negative outcome 9: Label the remaining sub-groups incrementally as G (i ) at most ∆ times and uses at most m ada (∆)(1 + o(1)) tests to recover the infected set exactly with zero error probability. Proof: Similar to Hwang's generalised binary splitting algorithm [38] , the idea behind the parameter n in Algorithm 2 is that when k becomes large, having large groups during the initial splitting stage is wasteful, as it results in each test having a high probability of being positive (not very informative). Hence, we want to find the appropriate group sizes that result in more informative tests to minimise the number of tests. Each stage (outermost for-loop in Algorithm 2) here refers to the process where all groups of the same sizes are split into smaller groups (e.g., see Figure  5 ). We let n be the group size at the initial splitting stage of the algorithm. The algorithm first tests n/ n groups of size n each, 11 then steadily decrease the sizes of each group down the stages: n → n 1−1/(∆−1) → n 1−2/(∆−1) → · · · → 1 (see Figure  5 ). Hence, we have n/ n groups in the initial splitting and 11 Note that n/ n is an integer for our chosen n below, which gives n n = k n k 1/∆ , and n k 1/∆ was already assumed to be an integer.

Require: Number of individuals n, number of infected individuals k, and test size restriction Γ 1: Initialize infected set K ← 2: Randomly group n individuals into n/Γ groups of size Γ 3: for each group G i where i ∈ Z : i ∈ [1, n/Γ] do 4: while testing G i returns a positive outcome do 5: run Algorithm 4 on a copy of G i , and add its one infected individual output k * into K 6: G i ← G i \ {k * } 7: return K Algorithm 3: Adaptive algorithm for Γ-sparse tests n 1 ∆−1 groups in all subsequent splits. With the above observations, we can derive an upper bound on the total number of tests needed. We have n/ n tests in the first stage. Since we have k infected and split into n 1 ∆−1 sub-groups in subsequent stages, the number of smaller groups that each stage can produce is at most k n 1 ∆−1 . This implies that the number of tests conducted at each stage is at most k n 

and the assertion follows. We also use the following direct consequence of the binomial expansion.

Claim 7.4: For any real number x ≥ −1 and any integer t ≥ 0 the following holds:

(1 + x) t = 1 + t x + O(t 2 x 2 ).

Finally, we state the following useful result relating to Stirling's approximation and the local limit theorem.

Claim 7.5: [Appendix B1 of [18] ] For any m, ∆ ∈ N, θ ∈ (0, 1), k ∼ n θ , let (X i ) i ∈[m] denote a sequence of independent Bin(Γ i , k/n) and define

Then, we have P (E ) = Ω(1/ n∆).

In Figures 6 and 7 , we compare our theoretical findings to empirical results obtained as follows:

• In the non-adaptive case, we fix the number of individuals n, the infection parameter θ, and, depending on the setup considered, the individual degree ∆ or test degree Γ. We vary the number m of tests (x-axis), and simulate 10 4 independent trials per parameter set. DD's performance (y-axis) is reported as the fraction of simulations per parameter point that inferred the infected set without errors. • In the adaptive case, we cannot directly control the number of tests m a priori. Instead, we fix the same parameter set as in the non-adaptive case, and carry out 10 6 simulations. We then report the cumulative distribution of tests required, i.e., the y-value corresponding to some m is given as the fraction of runs that required at most m tests.

We observe that the empirical results are consistent with our theoretical thresholds in all cases. The adaptive testing strategies show a particularly rapid transition at m ada (∆) and m ada (Γ) respectively. We find that the non-adaptive DD algorithm requires more tests in comparison to the adaptive schemes, and has a much broader range of transient behaviour. This suggests that convergence rates to the first-order asymptotic threshold may reveal an even wider gap between adaptive and non-adaptive designs, in analogy with studies of channel coding [44] . Note that the change of slope in Figure 7 (right) at m=2000 is due to rounding of ∆.

We have studied the information-theoretic and algorithmic thresholds of group testing with constraints on the number of items-per-test or test-per-item. For ∆-divisible items, we proved that at least for ∆ = ω(1), the DD algorithm is asymptotically optimal for θ > 1 2 , and is optimal to within a factor of e for all θ ∈ (0, 1), thus significantly improving on existing bounds for the COMP algorithm having suboptimal scaling laws. For Γ-sized tests with Γ = Θ(1), we improved on both the best known upper bounds and lower bounds, established a precise threshold for all θ ∈ (0, 1), and introduced a new randomised test design for θ > 1 2 . In both settings, we additionally provided near-optimal adaptive algorithms, and demonstrated a strict gap between the number of tests for adaptive and non-adaptive designs in broad scaling regimes.

Near-optimal sparse adaptive group testing

The detection of defective members of large populations

Group testing: an information theory perspective. Foundations and Trends in Communications and Information Theory

Optimal group testing

Combinatorial group testing and its applications

Pooling designs and nonadaptive group testing: important tools for DNA sequencing

A survey on combinatorial group testing algorithms with applications to dna library screening

Designing pooling systems for noisy high-throughput protein-protein interaction experiments using boolean compressed sensing

A new pooling strategy for high-throughput screening: the shifted transversal design

Mitigating the COVID Economic Crisis: Act Fast and Do Whatever It Takes

Pandemics: Risks, impacts and mitigation

Surveillance and studies in a pandemic in europe

Planning and Preparedness Resources

Global surveillance during an influenza pandemic

Discovery of power-laws in chemical space

Evolution of scaling emergence in large-scale spatial epidemic spreading

Individual testing is optimal for non-adaptive group testing in the linear regime

Information-theoretic and algorithmic thresholds for group testing

Performance of group testing algorithms with near-constant tests per item

Group testing algorithms: bounds and simulations

Phase transitions in group testing

Pooled testing for HIV screening: Capturing the dilution effect

Pool testing of SARS-Cov-2 samples increases worldwide test capacities many times over

Pooling method for accelerated testing of covid-19

Nearly optimal sparse group testing

Sparse group testing codes for low-energy massive random access

A simple construction of d-disjunct matrices with certain constant weights

Non-adaptive probabilistic group testing with noisy measurements: near-optimal bounds with efficient algorithms

Sparse combinatorial group testing

Random Graphs

Improved group testing rates with constant column weight designs

Statistical physics of inference: thresholds and algorithms

On the optimality of some group testing algorithms

The capacity of adaptive group testing

Information Theory

Correlation inequalities on some partially ordered sets

The weighted version of the handshaking lemma

A method for detecting all defective members in a population by group testing

Competitive group testing and learning hidden vertex covers with minimum adaptivity

Estimating the number of defectives with group testing

Adaptive group testing algorithms to estimate the number of defectives

Randomized Algorithms

A remark on stirling's formula

Feedback in the nonasymptotic regime

OG was funded by DFG CO 646/3. MHK was partially funded by Stiftung Polytechnische Gesellschaft and DFG FOR 2975. OP was supported by the DFG (Grant PA 3513/1-1) and the London School of Economics and Political Science. MP was funded by ME 2088/4-2 and ME 2088/5-1 (DFG FOR 2975). JS was funded by an NUS Early Career Research Award.

Comparisons: We observe that m ada (∆) matches the universal lower bound in Theorem 3.1 to within a factor of e for all θ ∈ (0, 1). For θ < 1 2 , we have m ada (∆) = m DD (∆) = ∆k 1+ (1−θ) ∆θ , meaning that the best known bounds for the adaptive and non-adaptive settings are identical (though the adaptive algorithm attains zero error probability). In contrast, for θ > 1 2 , we have m DD (∆) = ∆k 1+ 1 ∆ and m ada (∆) = ∆k 1+ (1−θ) ∆θ . The former is significantly higher, and Theorem 3.2 reveals that this limitation is inherent to any nonadaptive test design and algorithm. Hence, for θ > 1 2 , there is a significant gap between the number of tests required by adaptive and non-adaptive algorithms.

Our adaptive algorithm with Γ-sparse tests, shown in Algorithm 3, is again a modification of Hwang's generalised binary splitting algorithm [38] , where we initially divide the n individuals into n Γ groups of size Γ, instead of k groups of size n k as in the original algorithm. Our main result is stated as follows, in which we defineRequire: a group of individualsG 1: whileG i consists of multiple individuals do 2: Pick half of the individuals inG and call this setG . Perform a single test onG .

If the test is positive, setG ←G . Otherwise, set G ←G \G . 4: return single individual inG Algorithm 4: Binary splitting Theorem 6.1: For any Γ = o n k , Algorithm 3 outputs the correct configuration of infection statuses with probability one, while using at most m ada (Γ)(1 + o(1)) tests, each containing at most Γ items.Proof: Let k i be the number of infected individuals in each of the initial n Γ groups. Note that since Γ = o n k implies k = o( n Γ ), most groups will not have a infected individual. In the binary splitting stage of the algorithm, we can round the halves in either direction if they are not an integer. Hence, for each of the initial n Γ groups, we take at most log 2 Γ adaptive tests to find a infected individual, or one test to confirm that there are no infected individuals. Therefore, for each of the initial n Γ groups, we need max{1, k i log 2 Γ+O(k i )} tests to find k i infected individuals. Summing across all n Γ groups, we need a total of m = n/Γ i =1 max{1, k i log 2 Γ + O(k i )} tests. This has the following upper bound:where (a) uses k = o n Γ . If we slightly strengthen the requirement Γ = o n k to Γ = o n k ln(n/k) (which, in particular, includes the regime Γ = n k 1−Ω(1) studied in [25] ), then we have n Γ = ω k ln n k and hence n Γ = ω(k ln Γ). Thus, we obtainThis simplified upper bound is tight, due the simple fact that n Γ (1 − o(1)) tests (of size at most Γ) are needed just to test a fraction 1−o(1) of the items at least once each (which is a minimal requirement for recovering σ w.h.p.). Formally, this argument reveals the following. Theorem 6.2: In the setup of Γ-sparse tests with k = n θ for some θ ∈ (0, 1), any (possibly adaptive) group testing procedure that recovers σ w.h.p. must use at least n Γ (1−o(1)) tests.

The following variant of the Chernoff bound is convenient to work with (e.g., see [42, Sec. 4 

Lemma 7.1 (Multiplicative Chernoff Bound): Let X 1 , . . . , X n be independent random variables such that 0 ≤ X i ≤ 1 a.s., and fix δ ∈ (0, 1). Then, we have