key: cord-0436372-7pe4hzye
authors: Nikpey, Hesam; Kim, Jungyeol; Chen, Xingran; Sarkar, Saswati; Bidokhti, Shirin Saeedi
title: Group Testing with Correlation under Edge-Faulty Graphs
date: 2022-02-05
journal: nan
DOI: nan
sha: 53f8817b39a0f0ee91da20ed34156595b717f72a
doc_id: 436372
cord_uid: 7pe4hzye

In applications of group testing in networks, e.g. identifying individuals who are infected by a disease spread over a network, exploiting correlation among network nodes provides fundamental opportunities in reducing the number of tests needed. We model and analyze group testing on $n$ correlated nodes whose interactions are specified by a graph $G$. We model correlation through an edge-faulty random graph formed from $G$ in which each edge is dropped with probability $1-r$, and all nodes in the same component have the same state. We consider three classes of graphs: cycles and trees, $d$-regular graphs and stochastic block models or SBM, and obtain lower and upper bounds on the number of tests needed to identify the defective nodes. Our results are expressed in terms of the number of tests needed when the nodes are independent and they are in terms of $n$, $r$, and the target error. In particular, we quantify the fundamental improvements that exploiting correlation offers by the ratio between the total number of nodes $n$ and the equivalent number of independent nodes in a classic group testing algorithm. The lower bounds are derived by illustrating a strong dependence of the number of tests needed on the expected number of components. In this regard, we establish a new approximation for the distribution of component sizes in"$d$-regular trees"which may be of independent interest and leads to a lower bound on the expected number of components in $d$-regular graphs. The upper bounds are found by forming dense subgraphs in which nodes are more likely to be in the same state. When $G$ is a cycle or tree, we show an improvement by a factor of $log(1/r)$. For grid, a graph with almost $2n$ edges, the improvement is by a factor of ${(1-r) log(1/r)}$, indicating drastic improvement compared to trees. When $G$ has a larger number of edges, as in SBM, the improvement can scale in $n$.

Group testing [DHH00] is a well studied problem at the intersection of many fields, including computer science [Dor43, CHKV11, COGHKL20, CGM21, CTB + 20], information theory [AJS19, BJ21, GH08] and computational biology [KM95, FKKM97] . The goal is to find an unknown subset of n items that are different from the rest using the least number of tests. The target subset is often referred to as defective, corrupted or infected, depending on the field of study. In this work, we use the term defective. To find the subset of defectives, items are tested in groups. The result of a test is positive if and only if at least one item in the group is defective. Group testing is beneficial when the number of defective items is o(n). It is often assumed that the (expected) number of defective items is n α , α < 1. realization in the subset, i.e., two realizations have the same components except one component that is partitioned into two components. Only one realization does not obey this rule, the one with the least number connected components. They found a near optimal non-adaptive testing strategy for any distribution over the considered realizations of the graph and showed optimality for specific graphs.

The correlation model we consider is close to the work of [AU21] . We consider a random (edge-faulty) graph G where each edge is realized with probability r. In a realized graph, each connected component is assumed defective with probability p. As opposed to [AU21] , we do not constrain our study to a subset of the realizations and instead consider all possible realizations of the graph G. Despite its simplicity, our model captures two key features. First, given a network of nodes, one expects that when a path between two nodes gets shorter, the probability that they are in the same state increases. Our proposed model captures this intuition. By adding one edge to a path, the probability of being in the same state reduces by a factor of r. Second, two nodes might have long distances from each other, but when there are many edge-distinct paths between them, it is more likely that they are in the same state. Under our model, by adding distinct paths between two nodes, the probability of them being in the same state increases.

In other related works, a graph could represent potential constraints on testing (among independent nodes) [CKMS12, SW18] . This can be viewed as a complementary direction to our setting in which a graph models the inherent dependency of the nodes, but there is no constraint in how the groups can be formed for testing. In [CKMS12] , the authors design optimal nonadaptive testing strategies when each group is constrained to be path connected. In particular, they use random walks to obtain the pool of tests. In a follow up work, [SW18] shows that either the constraints are too strong and no algorithm can do better than testing most of the nodes, or optimal algorithms can be designed matching with the unconstrained version of the problem. This is attained by sampling each edge with probability r (0 < r < 1 and optimized). If a component is large enough, the algorithm tests the entire component. Our approach in this paper has similarities with [SW18] in aiming to find parts of the graph that are large and connected enough so that they remain connected with a decent probability after realizing the edges, but our techniques to find the dense subgraphs and the corresponding analysis are different.

We start by motivating the key attributes that we capture in our model through an example of testing for infections in a network of people (nodes). Consider the interaction network for the spread of an infectious disease (e.g. COVID-19) in a network of people/nodes. There is an edge between two nodes if the corresponding individuals are in physical proximity for a minimum amount of time each week. Such individuals are more likely to be in the same state than those who have been distant throughout. Thus, firstly, the probability of being in the same state decreases with increase in the length of paths (i.e., distance in interaction network) between nodes. Second, infection is more likely to spread from one node to another if there are many distinct paths between them. Thus, the probability that two nodes are in the same state increases with the increase in the number of distinct paths between them.

We capture correlation through a faulty-edge graph model. Consider a graph G = (V, E) where the node set V represents the items and the edge set E represents connections/correlations between them. Suppose each edge is realized with probability 0 ≤ r ≤ 1. After the sampling, we have a random graph that we denote by G r . Each node is either defective or non-defective. All nodes in the same component of G r are in the same state, rendering defectiveness a component property. We consider that each component is defective with probability (w.p.) p independent of others. As an example, consider graph G with five nodes and eight edges, and a sampled graph realization G r as shown in Figure 1 (left) and Figure 1 (right) respectively. When r = 1/3, G r is realized w.p. ( 1 3 ) 3 ( 2 3 ) 5 . There are two components in G r , namely, v 1 , v 4 , v 5 and v 2 , v 3 ; v 1 , v 4 , v 5 are in the same state, which is defective w.p. p, independent of the state of v 2 , v 3 .

This model importantly captures the two attributes we discussed: Clearly, a long path between two nodes in G has a smaller chance of survival in G r , compared to a short path, making the end nodes less likely to be in the same state as the length of the path in G between them increases. Moreover, the probability that at least one path between two nodes survive in G r increases with increase in the number of distinct paths between them in G, so having distinct paths between a pair of nodes in G makes them more likely to be in the same state.

We aim to find the minimum expected number of tests needed to find the defective items with at most n errors, where can potentially be of order o(1). To be precise, let #ERR(H) be the number of nodes mispredicted by an algorithm on graph H. Then we require to have

where the expectation is taken over G r and possible randomization of the algorithm. We refer to it as the average error.

Remark 1.1. The definition of error in classic probabilistic group testings such as [LCH + 14] is a stronger notion of error probability where the goal is to correctly predict all nodes with probability 1 − , and with probability any number of nodes can be mispredicted. This is stronger than our definition of average error in (1) because with probability at most n nodes are mispredicted in the classic group testing, so the average error would be less than n, the allowed error in our model.

We mostly work with the notion of average error in this paper. In the last section (Section 5), we introduce another notion of error which we refer to as maximum error, by allowing at most n mispredicted nodes with high probability. We recover all the results of the paper for this stronger notion of maximum error as well. Our approach in this paper is to relate the problem of group testing with correlation to an equivalent independent group testing problem with fewer nodes and provide a basis for comparison and quantification of the improvements that our methods offer by exploiting correlation.

The tests can not be designed with the knowledge of G r , only the value of r is known apriori. In the extreme case of r = 0, the problem is reduced to the classic group testing with |V | independent nodes. In the extreme case of r = 1, all components of G remain connected and hence the problem is reduced to independent group testing with only components of G. When 0 < r < 1, the problem is non-trivial, because there can be multiple components, some with more than one node, and the number and composition of the components is apriori unknown. Thus, it is not apriori known which nodes will be in the same state. Our group testing strategies will seek to circumvent this challenge by identifying parts of G that are connected enough so that they remain connected in G r with a high probability. 

We obtain upper and lower bounds on the number of group tests needed to determine the states, defective or otherwise, of individual nodes in a large class of interaction graphs in presence of correlation among the states of nodes. We progressively consider 1) cycles and trees (about n links), 2) d−regular graphs (about dn/2 links) and 3) stochastic block models or SBM (Θ(n 2 ) links). The correlation is captured by the factor r (see Section 1.1). The bounds are obtained in terms of the number of tests needed when the states are independent, and help us quantify the efficiency brought forth by group testing in terms of r.

For trees and cycles, we prove an upper bound on the optimal number of tests in terms of the number of group tests when there are n log(1/r) independent nodes. Note that one can trivially determine the states of each node by disregarding correlation and testing among n nodes (e.g. using classic group testing techniques). Our upper bound therefore shows that group testing can reduce the tests by a factor of log(1/r), which is less than 1 when r > 1/2. As r approaches 1 the multiplicative factor reduces even further implying even greater benefits due to group testing. Our lower bound, on the other hand, shows an improvement factor (1 − r).

For d−regular graphs, we prove new results for the distribution of components. This leads to a lower bound that is expressed as a sum series depending on r and n. We further prove an upper bound for a specific 4−regular graph, namely grid, in terms of the number of group tests when there are n(1 − r) log(1/r) independent items. Thus, the improvement factor is (1 − r) log(1/r), as opposed to only log(1/r) for trees; this hints us that group testing gets drastically more efficient for denser graphs.

The stochastic block model divides the network into communities such that nodes in the same community are more connected than nodes in different communities. We show that the reduction in the test count due to group testing can be classified into three regimes: 1) strong intracommunity connectivity but sparse inter-community connectivity, which reduces the effective number of independent nodes to the number of communities, 2) strong connectivity leading to an (almost) fully connected graph, in which case all nodes have the same state and one independent test is representative 3) sparse connectivity leading to many isolated nodes, in which case the states of all nodes are independent.

We use the following notations for the rest of the paper. Let CrltOPT(G, r, p, ) be the expected number of tests in an optimal algorithm on graph G with parameters r, probability of defectiveness p, and an error of n. Let IndepOPT(n, p, ) be the minimum expected number of tests needed for n items in order to find the defective set with the error probability at most , where each item is independently defective with probability p. It is noteworthy to mention that the definitions of error in IndepOPT and CrltOPT are different, as mentioned in Remark 1.1. When clear from the context, we may drop p, r, from the notations.

The following lemma provides a lower bound on CrltOPT in terms of IndepOPT -the minimum number of group tests needed in discovery of the defective set among independent nodes.

Lemma 2.1. Let C(G r ) be the number of connected components of G r . Then

IndepOPT(C(G r ), p, n) ≤ CrltOPT(G, r, p, ).

(2)

Proof. Knowing the connected components of G r , we have C(G r ) independent connected components that each contain nodes in the same state. Assign a candidate node for each connected component and form a group testing problem only on the candidate nodes: Among the C(G r ) candidate nodes. Suppose the error of independent nodes is n. Means with probability n, the algorithm predicts at least one node incorrectly, so the total error is at least n, and the error can't go above than n in CrltOPT(G, r, p, ), so we have the lemma.

Lemma 2.1 illustrates a connection between the number of connected components and the minimum number of tests CrltOPT. We next discuss concentration lemmas for a class of graph theoretic functions, including the number of connected components. By having the concentration results, we would be able to replace C(G) by its expectation in (2).

Definition 2.2. [AS16] A graph theoretic function f is said to satisfy the edge Lipschitz condition if, whenever H and H differ in only one edge, |f (H) − f (H )| ≤ 1 .

The above definition can also be defined for nodes, for graphs that are different in one node, which we call node Lipschitz [AS16] . Using this definition, the function f (G) as the number of connected components C(G) is edge Lipschitz, as for two graphs that differ in only one edge, either have the same number of connected components, or the graph with one less edge has one more connected component. The Edge Exposure Martingale. Let e 1 , e 2 , . . . , e m be an arbitrary order of the edges. We define a martingale X 0 , X 1 , . . . , X m where X i is the value of a graph theoretic function f (H) after exposing e 1 , e 2 , . . . , e i . Note that X 0 is a constant which is the expected of f (G), where G is drawn from G r . This is a special case of martingales sometimes referred to as a Doob martingle process, where X i is the conditional expectation of f (H), as long as the information known at time i includes the information at time i − 1 [AS16] . The same process can be defined for node exposure martingales, where the nodes are exposed one by one [AS16] . Node exposure can be seen as exposing one node at each step, so at the i th step the graph has i nodes along with the edges between them. Then we have the following theorem.

Theorem 2.3. [AS16] When f satisfies the edge (resp. node) Lipschitz condition, the corresponding edge (resp. node) exposure martingale satisfies |X i+1 − X i | ≤ 1.

We then have Azuma's inequality.

Corollary 2.5. Let δ > 0. Then with probability 1 − δ we have

Specifically, in the case that E[G r ] = cn for a constant c, and G has m = o(n 2 ) edges, then with high probability the number of connected components is within cn ± o(n) log 1/δ.

Proof. Consider the number of connected components C(G) which is an edge-lipschitz function of the graph. Now define X 0 = E[C(G r )|] and X m = C(G r ). Applying Theorem 2.4 concludes the proof.

Lemma 2.1 provides a lower bound on CrltOPT in terms of IndepOPT. Indeed, any lower bound on the latter leads to a lower bound on the former. We now review some useful lower and upper bounds on IndepOPT(G, r, p, ) next.

In the probabilistic group testing of [LCH + 14], there are n individuals and every individual i is independently defective with probability p i . Let X be the indicator vector of the items' defectiveness and H(X) = i p i log 1 p i be the entropy of X.

[LCH + 14] proves the following lower bound on the required number of tests (for both adaptive and non-adaptive scenarios) Theorem 2.6. [LCH + 14] Any Probabilistic Group Testing algorithm whose probability of error is at most requires at least (1 − )H(X) tests.

For the upper bounds in this paper, roughly speaking, we partition the graph into groups of nodes and assume independency among them. Solving the problem for groups require us to use the classic group testing algorithms, so we also need testing results from classic group testing. Below, we summarize the probabilistic group testing results of [LCH + 14] for upper bounds on number of tests needed when items are independent.

Theorem 2.7. [LCH + 14] There is an adaptive algorithm with probability of error at most exp −2δ 2 n 1/4 such that the number of tests is less than

Theorem 2.8. [LCH + 14] For any 0 < ≤ 1 and δ > 0, if the entropy of X satisfies

then with probability of error at most

there is a non-adaptive algorithm that requires no more than

tests.

In this section, we consider graphs that have n + c edges, where c is a constant. Specifically, we prove lower and upper bounds on the number of tests needed when the underlying graph is a cycle or tree, where the lower bound can be generalized to graph with n + c edges. We obtain these bounds by formulating the problem into one with independent nodes, potentially with less nodes than the original problem. First, we use the lemmas introduced in Section 2 to obtain lower bounds for cycles and trees. Next, we propose group testing algorithms and prove upper bounds for cycles and trees.

By knowing Corollary 2.5, if we know the expected number of components in a graph, we would be able to ind the minimum number of tests needed by Lemma 2.1. The following theorem proves a lower bound for cycle and trees.

Theorem 3.1. Let G be a cycle or a tree. Then we have

Proof. In a tree, by removing each edge we get one more connected component, so after removing k edges the tree has k connected components and the cycle has k − 1. Each edge is removed with probability 1 − r, so the expected number of components is 1 + (1 − r)(n − 1) for trees, and (1 − r)n for cycles. By Corollary 2.5, the number of components is (1 − r)n ± O( √ n log n) with probability 1 − 1/n 2 , and with probability 1/n 2 , the difference in tests is at most n, hence O(1/n) additional tests. Applying Lemma 2.1 thus completes the proof.

The above proof also works for any graph with n + c edges, where c is a constant. In other words, when the number of edges is less than n + c, a lower bound on the number of tests needed for almost (1 − r)n independent nodes is also a lower bound on the number of tests needed under our model with correlation r.

In this section, we provide algorithms to find the defective set and provide theoretical bounds. We start by a simple cycle as a warm up. Later, we generalize the ideas used in cycles to devise algorithms for any tree. Note that after having an algorithm for trees, we would have an algorithm for general graphs, by just considering a spanning tree of it. But the algorithm might be far from optimal. The general idea is to partition the graph G into subgraphs that will remain connected in G r with high probability. The nodes in those connected subgraphs will thus have the same states. We can then select a candidate node for each subgraph to be tested. By knowing the probability of each subgraph being connected and the probability of error in classic group testing, we can estimate the error in our problem.

First, we provide an algorithm for cycles.

log 1/r , 1} and < /2. Then there is an algorithm that uses

IndepOPT( n/l , p, ) tests and finds the defective set with the error at most · n over a cycle of length n.

Proof. Consider the following algorithm:

log 1/r , 1}. Partition the cycle into n/l paths P 1 , P 2 , . . . , P n/l of the same length l, except one path that may be shorter.

2. For each path, choose one of its nodes at random and let the corresponding nodes be v P 1 , v P 2 , . . . v P n/l .

3. Use an IndepOPT( n/l , p, ) algorithm (by Theorem 2.7 for adaptive or Theorem 2.8 for non-adaptive group testing) to find the defective items among v P 1 , v P 2 , . . . v P n/l where < 2 and the probability of being defective equals to p.

4. Assign the state of all the nodes in P i as v i for all i.

Note that for each i, the defectiveness probability of v i is p. The probability that P i is actually connected after a realization is r l−1 . So the probability that P i is not in the same state of v i is 1 − r l−1 . Then assuming that we detect all v i 's correctly, the error in G is at most n/l · (1 − r l−1 ) · l. By replacing l = max{ log[1/(1− /2)] log 1/r , 1}, the error becomes less than n/2. Moreover, we might also have probability of error for the v i 's (given the criteria set in IndepOPT), meaning that with probability 1− , all the nodes are predicted correctly, and with probability we have at least one mispredicted node, and at most n mispredicted nodes. So the total error from this part is at most n < n/2. So the total error is at most ( + /2)n < n and we have the above theorem. ) O( log n log 1/r ).

Note that when correlations are strong, i.e., r ≥ 1 − 1/ log n, the algorithm does a constant number of tests, as expected.

We now generalize the ideas to achieve an upper bound for trees. We will try to partition the graph into n/l subgraphs (which we sometimes refer as groups) of l nodes, find the probability of being connected in a random realization, and then optimize it over l. At the high level, we partition the graph such that the nodes within a subgraph have small paths among each other. This is because shorter paths remain in the graph with higher probability, maximizing the probability of the nodes being in the same state.

We first give a definition to formalize the number of nodes needed to make a subset of nodes connected.

Definition 3.4. Let S ⊆ V of a graph G. The smallest connecting closure of S is a subset S ⊆ V such that the induces graph over S ∪ S is connected.

For example, consider the graph G r in Figure 1 . If S = {v 1 , v 5 }, then the smallest connecting closure of S is {v 4 }, as by adding v 4 to S we make S connected. Now we provide a partition of nodes for trees such that for each subgraph, we don't need to add many other nodes to make them connected. Formally:

Lemma 3.5. Let G be a tree. There is a partitioning of the graph into n/l subgraphs of size l (one subgraph may have less than l nodes), such that the size of the smallest connecting closure for each subgraph is less than or equal to l, for each l ≤ n.

The proof is provided in Appendix A. Now we are ready to prove the upper bound for trees.

Theorem 3.6. Consider a tree with n nodes and let l = max{ log[1/(1− /2)] 2 log 1/r , 1}. Let < /2. Then there is an algorithm that uses IndepOPT( n/l , p, ) tests and finds the defective set with at most · n errors. I.e.,

CrltOPT(G, r, p, ) ≤ IndepOPT( n/l , p, ).

Proof. Consider the following algorithm:

1. By Lemma 3.5, partition the tree into n/l subgraphs g 1 , g 2 , . . . , g n/l of the same length l, one subgraph might be smaller than the other ones.

2. For each group, choose one of its nodes at random and let them be v g 1 , v g 2 , . . . v g n/l .

3. Use an IndepOPT( n/l , p, ) algorithm to find the defective set among v g 1 , v g 2 , . . . v g n/l . 4. Assign the state of all the nodes in g i as v i , for all i.

First, we calculate the probability that g i is connected. By Lemma 3.5, we know that each g i has the property that its smallest connecting closure is less than or equal to l. This ensures that at most l edges (over the edges already in g i ) are needed to to make g i connected. Therefore, the probability of g i being connected is at least r 2l . So the probability that g i is not in the same state as v i is at most 1 − r 2l . The rest of the proof revolves around proving that the total error is less than n as was done for cycle and this completes the proof.

Corollary 3.7. Corollary 3.3 can be recovered for trees with an additional factor of 2.

In this section, we focus on graphs that potentially have many edges. As the number of edges increases, the correlation between nodes increases even when r is not large. As mentioned earlier, we need to target those components that are more likely to appear in various realizations. We know that there is a threshold phenomenon in some edge-faulty graphs, meaning that when r is below a threshold, there are many isolated nodes (and hence many independent tests are needed) and when r is above that threshold, we have a giant component (and hence a single test suffices). Most famously, this threshold is log n n for Erdős-Rényi graphs. For random dregular graphs, also, [Goe97] has shown that when a graph is drawn uniformly from the set of all d-regular graphs with n nodes and then each edge is realized with probability r, 1 d−1 is a threshold almost surely.

For the rest of this section, we first study a (deterministic) 4-regular graph, known as the grid 1 and then provide near-optimal results for the stochastic block model. When we consider (deterministic) d-regular graphs, we can't use the results of [Goe97] for random d-regular graphs because we can not be sure that the specific chosen graph is among the "good" graphs that constitute the almost sure result. So we need to develop new results on the number of connected components and the distribution on them for our purposes.

We first formally define a grid. A grid with n nodes and side length √ n is a graph where nodes are in the form of (a, b) : 1 ≤ a, b ≤ √ n. Node (a, b) is connected to its four close neighbors (if exist), namely (a − 1, b), (a + 1, b), (a, b + 1), (a, b − 1). Border nodes (with a ∈ {1, √ n} or b ∈ {1, √ n}) might have three or two neighbors. In order to get a lower bound, we need to know the expected number of components in G r . Traditionally, this has been done by finding the expected component size that nodes would belong to [AS16, Goe97] . Consider the following process. Pick a node v ∈ V (G), mark it as processed, and let it be the root of a tree. For each u ∈ V (G) that is not processed and is a neighbor of v, uv is realized with probability r and added as a child of v. The same process is repeated for each realized u in a Breath First Search (BFS) order. When the process ends, there is a tree with root v, and the expected size of the tree is the expected size of the component that v ends up in.

An example of this proecss is show in Figure 2 . Node v 11 is the root (colored in blue), the children that are realized are in marked in green, and the children that are not realized are marked in red. The component would be {v 11 , v 12 , v 7 , v 2 }.

By repeating the process for each node that is not processed yet, we get a spanning forest. The expected number of components in the forest is the expected number of components in the original random graph.

Here, the challenge is that we don't know the number of available (unprocessed) neighbors of a node. It highly depends on the previously chosen nodes, especially when d is small, like in the grid. We circumvent this issue by analyzing an infinite regular tree process that effectively corresponds to a more connected graph and therefore leads to a lower bound on the expected number of connected components for the grid.

Consider an infinite tree with root v such that each node in the tree has three children. Consider the process where each edge is realized with probability r. Let C(v) be the component that v ends up in. The following lemmas approximates the distribution of |C(v)|.

Lemma 4.1. Under the above process and for t ∈ N,

Proof. Let T be an embedded tree in G with t nodes. In order to T be realized in the process, all the edges in T should be realized and the rest of edges that has a node in T should not be realized. There are t − 1 edges in T , and each node has three potential edges, so there are 2t + 1 edges that are not realized. So the probability that T be realized is r t−1 (1 − r) 2t+1 . Let C t be the number of trees with t nodes and v as the root. We have

Note that C 0 = C 1 = 1. We find a closed form of C t by recursion. Node v has three potential subtrees, where the sum of the size of the subtrees is t − 1. We thus get the following recursion

This recursion has the same initial points and the same recursion as second order Catalan numbers as follows:

So the solution has the form C t = 1 2t+1 3t t and the proof is completed. 

The solutions of this equation are 0 and P ∞ = 3r± √ r(4−3r) 2r 2

. Note that P ∞ < 1, so

is not valid. The relevant solution turns out to be dependent on whether r > 1/3 or r ≤ 1/3. As a matter of fact, when r > 1/3, the probabilities of all finite components (as found in Lemma 4.1) do not add up to one. So for r > 1/3, the correct solution is Theorem 4.4. When r ≤ 1/3, the expected component size is

Proof. The proof follows from Lemma 4.1 and Lemma 4.3 and by using Stirling Approximation.

It is worthwhile to remark that the proof generalizes to general d−regular tree processes.

Remark 4.5. When the underlying graph is an infinite d-regular tree, under the defined process and for t ∈ N,

In the case of grid, we consider 3-regular trees, as after the root in grid, each child has at most 3 potential neighbors and if we choose a node in the border, the root also has at most 3 potential children, as illustrated in Figure 2 . So the 3-regular tree process that we analyzed corresponds to a more connected graph than the grid. Therefore its expected number of connected components that we found in (3) Theorem 4.6. For a grid with n nodes and r ≤ 1/3, the expected number of components is

Corollary 4.7. Using Lemma 2.1 in conjunction with Theorem 4.6, any lower bound on the number of tests for

(1/n) independent nodes is also a lower bound on the number of tests on a grid under our model.

In this subsection, we provide an upper bound for the number of tests in a grid. At a high level idea, we partition the grid into subgrids and assume that each subgrid is connected, so we can consider one representative node per subgrid to test. In other words, the algorithmic idea is similar to the proof of Theorems [3.2, 3.6] where the graph G was partitioned into subgraphs that were more likely to remain connected in G r . But, here, we choose the subgraphs to be subgrids. In order to calculate the error, we need to compute the probability that a subgrid of length k is connected, where k is to be optimized later. We first estimate the probability that a subgrid becomes connected.

Lemma 4.8. Let P k be the probability that a grid of length k > 1 becomes connected when each of its edge is realized with probability r. We have:

Proof. Consider the subgrid of length k − 1 that contains the bottom-left corner node. Then the main grid consists of the subgrid and a path of length 2k − 1, where each node in the path has one edge to the subgrid. With probability P k−1 the subgrid is connected. The path would be decomposed into (1 − r)2k ± o(rk) = Θ((1 − r)k) subpaths with probability at least 1 − 1/k 10 . Each subpath has at least one edge to the subgrid, so each one is connected to the path with probability at least r. The probability that all of them connect to the subgrid then is at least r (1−r)Θ(k) and the lemma is proved.

Theorem 4.9. Let P k be the probability that a grid of length k > 1 becomes connected when each of its edge is realized with probability r.We have:

Proof. The proof is done by replacing P k−1 with P k−2 , and then P k−2 with P k−3 etc in Lemma 4.8 and at last replacing P 1 = 1. Now similar to Theorem 3.6, by setting error probability of each subgraph small enough, that is 1 − P k < /2, we get k < log(1− /1000) (1−r) log r . Then the error is less than n with at most n/k 2 independent node tests with error < /2.

In this section, we study our model on SBM graphs. We apply the same techniques used in Erdős-Rényi graphs to find the connectivity threshold to find the structure of the connected components.

A stochastic block model has g clusters of size k = n/g, where any pair of nodes in the same cluster are connected with probability q 1 , and any pair of nodes in different clusters are connected with probability q 2 < q 1 . After realizing each edge with probabilities q 1 and q 2 , we have our graph G. Then based on our correlation model, each edge remains in G r with probability r. So with probability r 1 = rq 1 an edge remains in the same cluster and with probability r 2 = rq 2 an edge remains between two different clusters. Here, we assume the size of the clusters are much bigger than log n, i.e. k log n. We find the number of connected components based on r 1 and r 2 .

Theorem 4.10.

• If r 1 ≥ 100 log n k and 1 − (1 − r 2 ) k 2 ≥ 100 log g g , then with high probability G is connected. (first regime, one test needed)

• If r 1 ≥ 100 log n k and 1−(1−r 2 ) k 2 ≤ 1 100g , then with high probability each cluster is connected but most of the clusters are isolated. (second regime, g independent tests needed)

• If r 1 ≤ 1 100k and r 2 ≤ 1 100n , then with high probability G has many isolated nodes. (third regime, Ω(n) independent tests needed)

• If r 1 ≤ 1 100k and r 2 ≥ 100 log n n , and g > 1, then with high probability G is connected. (fourth regime, one test needed) Proof. First, suppose r 1 ≥ 100 log n k . A cut is a partition of nodes into two sets (parts), and its size is the number of nodes in the smaller set. We say a cut is disconnected if there is no edge between the sets. A cut of size i ≤ k/2 in a single cluster has i(k − i) potential edges between the parts. Then the probability that the specific cut is disconnected is

The first inequality, (i), is true because 1 − x ≤ e −x for x ≥ 0, (ii) is true by r 1 ≥ 100 log n k , and (iii) is true by i ≤ k/2. Note that number of cuts of size i is k i , and by Union Bound, the probability that any cut of size i becomes disconnected is at most

by a simple counting argument, so the probability of a cut be disconnected is at most

. So with probability 1 − ( 1 n ) 48 a single cluster of size k is connected. Again, by Union Bound, with probability at most ( 1 n ) 48 g < ( 1 n ) 47 there is a disconnected cluster. So with probability 1 − ( 1 n ) 47 , all clusters are connected. Now if we assume all the clusters are connected, if there is an edge between two clusters, then those two clusters are connected. So if we consider a graph where the nodes represent the clusters and two nodes are connected if there is at least one edge between the corresponding clusters, then we need to understand the connectivity of the new graph. The probability that there is at least one edge between two clusters is 1 − (1 − r 2 ) k 2 , and again if this value is more than 100 log g g , then G is connected with high probability. If 1 − (1 − r 2 ) k 2 < 1 100g , then the probability that a cluster is isolated is more than (1 − 1/(100g)) g−1 e −100 0.99, so most of the clusters are isolated, which proves the first two parts of the theorem.

If r 1 ≤ 1 100k , then with the same argument, with high probability most of the nodes in all clusters are isolated. If we also have r 2 ≤ 1 100n , then this means that most of the nodes don't have any neighbors outside of their cluster with high probability, so in total the graph has Ω(n) many isolated nodes, which proves the third part. Now suppose r 2 ≥ 100 log n n , and we prove the last part of the theorem. We assume each cluster is empty, i.e. there is no edge in the cluster, and even when they're empty with r 2 ≥ 100 log n n , the graph is connected with high probability. Consider a cut in G with i ≤ n/2 nodes. Each node has n − k potential neighbors in other clusters. So it has at least n − k − i potential neighbors outside of its clusters and the chosen cut. Then, almost similar to the first part of the theorem, the probability that this cut is disconnected is at most

Here, (i) is true by i ≤ n/2 and k/n = 1/g. Again, there are n i ≤ n i cuts of size i. So the probability that any cut of size i is disconnected is at most n i · ( 1 n ) 100i(1/2−1/g) = ( 1 n ) 100i(1/2−1/g−0.01) . It is not hard to see that if g > 2, then ( 1 n ) 100i(1/2−1/g−0.01) = o(1/n 2 ). So the probability that any cut is disconnected is bounded by

So in the case of g > 2, we've proved the last part of the theorem. If g = 2, for i > 2, a cut of size i has at most i 2 /2 edges in the node set of size i, as the graph is bipartite and the number of edges in the set is maximized when i/2 nodes is chosen from each part of the graph. So the potential edges to the other side of the cut is at least i(n − k) − i 2 /4 = i(n − k − i/4) = i( n−i 2 ) ≥ i · 3n/8, as i ≤ n/2, and we can repeat the reasoning to prove that with high probability all cuts in this graph are connected. It is also easy to verify that when the cut is a single node or a pair of nodes, then the cut is disconnected with probability at most o(1/n 4 ), and this completes the proof.

So far, we only focused on bounding the expected error of the algorithms. But similar to [LCH + 14], we can also have probabilistic error with a relaxation on error-free prediction. In other words, in [LCH + 14] they wanted to find the defective set with probability 1 − δ such that no error is made, but we might allow up to n mispredicted nodes. Suppose that we want to have at most n mispredicted nodes with probability 1 − δ, and for the δ other fraction we can have any number of mispredicted nodes. We call this maximum error with parameters and δ. Let CrltOPT(G, r, p, δ, ) be the expected number of tests in an optimal algorithm with maximum error with parameters and δ. Parameters r and p are defined as in Section 1.1.

We first find a lower bound for CrltOPT(G, r, p, δ, ). Note that we are no longer able to use lower bounds of classic group testing directly, because having an error in classic group testing might not fulfill our maximum error target. In other words, in classic group testing, when an error happens with probability δ, there is no guarantee on the number of mispredicted nodes, it might be one or a constant or all the nodes, but n mispredicted nodes are allowed in maximum error. So we need to find a lower bound for the problem with maximum error definition and independent nodes directly. We adapt the approach in [LCH + 14] to derive the following lower bound:

Theorem 5.1. Let IndepOPT(n, p, δ, ) be the minimum number of tests for n independent nodes under maximum error with parameters (δ, ). Then any Probabilistic Group Testing algorithm that, with probability 1 − δ, predicts all independent nodes but n of them correctly, needs n(1 − δ)(h(p) − h( )) tests where h is the binary entropy function. i.e.

The proof is provided in Appendix B. Using the above lemma, along with Lemma 2.1), we find:

Lemma 5.2. Let C(G r ) be the number of connected components in G r . Then, under a maximum error target, we have

Lower bounds for cycles, trees and grids can be recovered by this lemma. For average error, by Theorem 2.6, with C(G) independent nodes, any algorithm should perform (1 − )H(X) = n(1 − )h(p), and by finding a concentration over the number of connected components, we were able to have a lower bound on the number of tests. Now after finding the concentration, we use Lemma 5.2 instead of Theorem 2.6 and Lemma 2.1 to get the lower bounds.

We now find upper bounds similar to those that we provided in Sections 3.2 and 4.1. Under an average error target, we partitioned the graph and computed the incurred average error. Under maximum error targets, however, we don't know what fraction of the realizations with average error n actually has less than n mispredicted nodes, so we can't use those results and need to prove concentrations around the average. For cycles and grids, the subgraphs we designed were independent of each other, in the sense that the connectivity of a subgraph would not change the probability of connectivity of the other subgraphs. Hence by using Hoeffding's bound we prove the following theorem for a maximum error target. , and < δ/2. There is an algorithm that uses IndepOPT( n/l , p, ) tests and finds the defective set with maximum error with parameters and δ, i.e.

CrltOPT(Cycle, r, p, δ, ) ≤ IndepOPT( n/l , p, )

Proof. We use the same algorithm and the same subgraphs as in Theorem 3.2. Recall that the probability of one subgraph not being connected is r l−1 and the average error is n/2. The number of mispredicted nodes in each subgraph is in [0, l] and they are independent of each other, so by Hoeffding's bound, with probability at least 1 − exp[−Θ( 2 n log(1/r) log(1− 1 /2 ) )] > 1 − δ/2, the error is at most n. Also with probability more than 1 − > 1 − δ/2 all candidates are predicted correctly, so with probability more than 1 − δ the classic group tests detect all the defective nodes with no error, and assuming this, the error on subgraphs is less than n and we're done.

The same reasoning works for subgrids of a grid and leads to the same upper bound with the parameters δ > 2e −Θ( 2 n log(1/r)

, and < δ/2. But for trees, we can't use Hoeffding's bound because the connectivity of a subgraph can affect the others, violating the assumption of independence that we need for Hoeffding's inequality .

Before providing the new theorem for trees, we need the following lemma to use the concentration lemma for node exposure martingale defined in Section 3.1.

Lemma 5.4. Let g 1 , g 2 , . . . , g n/l be the subgraphs formed in Lemma 3.5. Let f (H) be the number of connected subgraphs g i 's in the graph H. Then there is an order of node exposure such that at each step, the value of f does not change by more than one.

The proof is provided in Appendix C. The above lemma allows us to use Azuma's inequality for groups made for trees as described in the following theorem. , and < δ/2. Then there is an algorithm that uses IndepOPT( n/l , p, ) tests and finds the defective set with maximum error with parameters and δ, i.e.

CrltOPT(T ree, r, p, δ, ) ≤ IndepOPT( n/l , p, ) Proof. We use the algorithm and the same subgraphs introduced in Theorem 3.2. Recall from the proof of Theorem 3.6 that the probability of one group not being connected is r 2l and the average error is n/2. Now we can't use the Hoeffding's bound to show concentration around the average. Instead, we use a node exposure martingale to prove the concentration. Note that Theorem 2.4 for node exposure martingale only works when the graph function is node Lipschitz, meaning when H 1 and H 2 only differ in one node, |f (H 1 )−f (H 2 )| ≤ 1. But if we can find an order of nodes v 1 , . . . , v n exposed such the graph H i on first i nodes satisfies |f (H i+1 ) − f (H i )| ≤ 1, then we can still use Azuma's inequality. Let f (H) be the number of groups g i 's connected in H. Then by Lemma 5.4 there is an order such that |f (H i+1 ) − f (H i )| ≤ 1, and we can use the concentration theorem (Theorem 2.4) for the number of connected groups in the random graph G r . We now have the same error concentration as in Theorem 5.3 for error (equivalent to the Hoeffding's bound), hence we can repeat the argument to complete the proof. is an ancestor of the rest of the nodes. Let's call u and other nodes that we make a recursion "breaking point". Then any pair of breaking points are ancestor and descendant, and all the nodes added to S are subtrees of breaking points. So by connecting all the breaking points by a single path, which has length at most l, as the distance is less than u to v, we connect S, so the smallest closure of S is less than l. More than that, we have only included some subtress of G, so by removing S, G remains connected and we can use the induction for the remaining tree.

To illustrate the algorithm, consider the tree at Figure 3 with l = 5. We start with v, which is a deepest leaf, move up and add node 7 to the subgraph. We can't add u and its subtrees, as the size of the subgraph would exceed l. Then we update l 1 = l − 2 = 2 and proceeds with another subtree of u, let's say the right subtree containing node 8. The size of the subtree is one, so we can add it to the subgraph and update l 2 = l 1 − 1 = 2. Now we continue with the updated l and the left subtree of u. Node 10 is a deepest leaf that we start with, move up to u , but we can't add its subtree to the subgraph, as the size of the u and its subtrees is bigger than l 2 . So we add 10 and proceed with u . Finally our updated l is 1 and we only add 9 to the subgraph. So the final subgraph is {v, 7, 8, 9, 10}, and we can connect all of them by adding u and u . Now we've found S that has smallest connecting closure at most l, and includes only subtrees of G, so removing that does not disconnect the graph. Then we save S as an aimed subgraph and use induction hypothesis on the rest of the graph. Then other formed subgraphs has size l (except one) and has small connecting closures. By adding S to the subgraphs, we get the desired grouping of all nodes and the proof is completed. Define an error random variable E such that

Where . is the number of non-zero elements in a vector. By Fano's inequality, we can bound the conditional entropy as Proof. Let g 1 , g 2 , . . . , g n/l be the subgraphs formed in Lemma 3.5 in this order. Consider the following order of node exposure: we first expose all the nodes in the last subgraph, g n/l , in some order. Then expose the nodes in the subgraph before that, g ( n/l −1) , and so on until expose the nodes at the last subgraph, g 1 . Consider a node v in subgraph g i when it is exposed. Note that no subgraph g j , j > i can become connected after exposing v, as by construction of subgraphs, all the nodes of connecting closure of g j lies in g k , k > j. Also, no subgraph g j , j < i can become connected, because neither of g j 's nodes are exposed yet. So f (H) can potentially only change by one, and for the last node of g i to make g i connected, and the lemma is proved.

Adaptive group testing on networks with community structure

Group testing: an information theory perspective

Conservative two-stage group testing

The probabilistic method

Group testing with a graph infection spread model

Multivariate fuss-catalan numbers

Network group testing

Group testing as a strategy for covid-19 epidemiological monitoring and community surveillance

Random multipleaccess communication and group testing

Semiquantitative group testing in at most two rounds

Group testing with probabilistic tests: Theory, design and application

Randomized communication in radio networks

Graph-constrained group testing

Optimal group testing

Noisy adaptive group testing using bayesian sequential experimental design

Combinatorial group testing and its applications

The detection of defective members of large populations

Group testing problems with sequences in experimental molecular biology

Group testing against covid-19

Improved adaptive group testing algorithms with applications to multiple access channels and dead sensor diagnosis

The giant component threshold for random regular graphs with edge faults

Group testing problems in experimental molecular biology

Group testing with prior statistics

Sundara Rajan Srinivasavaradhan, Tao Guo, Christina Fragouli, and Suhas Diggavi. Group testing for connected communities

Unconstraining graph-constrained group testing

Group testing for sars-cov-2 allows for up to 10-fold efficiency increase across realistic scenarios and testing strategies. Frontiers in Public Health

A Proof of Lemma 3.5Proof. We prove the lemma by induction on the number of nodes of G. for n = 1, 2, 3, the statement is trivially true. Now suppose the lemma is true for any number of nodes less than n, we prove it for n.We aim to find a set of nodes of size l such that, first, by removing the set the graph remains connected, and second, the smallest connecting closure of the set is at most l. Then by removing the aforementioned set and consider it as one of the final subgraphs, we use induction hypothesis for the rest of the graph.To do that, suppose the tree is hanged by an arbitrary node. Let v be a deepest leaf, means any other node is at higher or equal level of v. Let u be the first ancestor of v, such that the subtree rooted at u, including u, has equal or more than l nodes. If there is exactly l nodes, we can group them together, and the aimed subgraph is found. So suppose the subtree rooted at u has more than l nodes. Note that the distance from v to u is less than l. Now we form a subgraph with small connecting closure. Starting with empty set S, we add the subtree of the child of u that v is an descendant of it, and the child itself, call this subtree s 1 . Note that the added set has k < l nodes, and by removing them from G, G remains connected.Recursively, do the same process for the other subtress of u for the updated l 1 = l − k. To be precise, u has other subtrees than s 1 , as the subtree rooted at u has more than l nodes and s 1 has at most l − 2 nodes. Then consider another subtree of u, called s 2 . If |s 2 | ≤ l 1 , we update l 2 = l 1 − |s 1 | and add s 2 to S and continue with another subtree of u, which by the same argument exists. If |s 2 | > l 1 , then again we choose a deepest leaf of s 2 and proceed with the same process as before to find another group of nodes, i.e. We start with the deepest leaf and go up in the tree until it exceeds l 1 , and repeat the procedure. Note that after moving to the subtrees of u, we disregard the rest of the graph, so u is an ancestor of all the nodes we encounter next. Again, for the next recursion, the subtree of the node that exceeds updated l