key: cord-0674179-yesnrz44 authors: Gebhard, Oliver; Loick, Philipp title: Note on the offspring distribution for group testing in the linear regime date: 2021-03-24 journal: nan DOI: nan sha: 8fa44ec80fa567f60091535559d73a7699097f74 doc_id: 674179 cord_uid: yesnrz44 The group testing problem is concerned with identifying a small set of $k$ infected individuals in a large population of $n$ people. At our disposal is a testing scheme that can test groups of individuals. A test comes back positive if and only if at least one individual is infected. In this note, we lay groundwork for analysing belief propagation for group testing when $k$ scales linearly in $n$. To this end, we derive the offspring distribution for different types of individuals. With these distributions at hand, one can employ the population dynamics algorithm to simulate the posterior marginal distribution resulting from belief propagation. 1.1. Motivation and related work. The group testing problem, originally introduced by Dorfman [6] , is a prime example of a statistical inference problem. The problem can be formulated as follows. Consider a population of n individuals k of which suffer from a rare decease. Furthermore, there is a test scheme available that can test groups of individuals. A test returns positive if and only if there is at least one infected individual in the tested group. The goal is to recover all infected individuals within the total population with the smallest number of tests with high probability (w.h.p.) 1 . In the last years this problem gained significant attention (see [3] for a detailed survey) and finds application in fields such as DNA sequencing [9, 12] , protein interaction experiments [11, 14] or the current COVID-19 pandemic [7] . Dorfman's original approach to recover the set of infected individuals was to perform a two-stage testing scheme. In the first stage, groups of individuals would be tested. If a test returned positive, each individual would subsequently be tested separately. Conversely, a negative test ensures that every individual in the test is uninfected and thus no further testing is needed. While this scheme improves on individual testing for low infection rates, it is clearly not optimal. Over the years, a variety of algorithms have been proposed and analysed to cut down the number of tests. Due to its amenable properties, mathematical research on group testing focused primarily on the sublinear regime where k ∼ n θ for some θ ∈ (0, 1). By today, the information-theoretic and algorithmic thresholds in this regime are well understood. In [5] , Coja-Oghlan et al. provide a novel pooling scheme and a sophisticated polynomial-time algorithm achieving reliable recovery with the information-theoretically minimum number of tests possible for one-stage and multi-stage test designs. The natural next step of the group testing research is to move to the regime where the number of infected individuals scales linearly in the number of infected individuals (as for instance of relevance in the ongoing pandemic [8] ). Next to Dorfman's original two-stage algorithm, a number of algorithms from the sublinear regime readily carry over. First, consider a plain random regular test design where each individual is assigned to a fixed number of tests uniformly at random without replacement 2 . A first idea would be to apply the one-stage so-called COMP algorithm in which we classify all individuals in negative tests as uninfected and all other individuals as infected. There is a slightly more sophisticated approach called DD under which we first classify all individuals in negative tests as uninfected as above and remove them from the test design. Then, we search for any individuals included in a positive test on its own (after removal of the definitively uninfected from before) and classify them as infected. All other individuals are classified as uninfected. Since no one-stage algorithm can recover the set of infected individuals in the linear regime w.h.p. [2] , both of these approaches will not entirely solve the group testing problem, but they might serve as starting points for a first group testing stage with a follow-up stage of individual testing for those individuals that are not definitively uninfected or infected. Interestingly, the DD algorithm closely resembles the well-known warning propagation algorithm from mathematical physics which belongs to the hand tool in solving constraint satisfaction problems. Indeed, the first two steps of the DD algorithm are precisely what warning propagation would stipulate for the group testing problem. Some prominent applications of warning propagation in other realms include graph colouring [10] , constraint satisfaction problems [1] , and the k-core of a graph [13] . The set of definitively uninfected and definitively infected individuals in the group testing problem correspond to the hard fields identified by warning propagation in physics jargon. The interesting question is what to do Oliver Gebhard and Philipp Loick are supported by DFG 646/3. 1 We say that a sequence of events E n occurs with high probability if lim n→∞ P(E n ) = 1. 2 A regular design is provably superior to a Bernoulli-type design in the sublinear regime and likely superior as well in the linear regime. 1 with those individuals that are not definitively uninfected or infected? Both the COMP and DD algorithm remain clueless. For the sublinear regime, finding the remaining set of infected individuals boils down to solving a minimum vertex cover problem and a greedy vertex cover algorithm called SCOMP has been suggested, which, however, does not improve performance beyond DD [4] . For the linear regime, things are trickier and the default approach of mathematicians and physicists would be to apply belief propagation to a random regular test design as described above. However, analysing belief propagation in the linear regime is a challenging endeavour due to the breakdown of central-limit-like properties. In this note, we lay some groundwork towards analysing belief propagation in the linear regime. To be precise, we consider different types of non-hard individuals, i.e. individuals that are not classified as definitely uninfected or infected by the warning propagation algorithm and determine their offspring distribution. We analyse the second neighbourhood of any such individual type and derive the distribution of the different types of non-hard individuals. With these distributions at hand, one can employ the population dynamics algorithm to simulate the posterior marginal distribution resulting from belief propagation for each of the non-hard individual types. Before we can state our result, let us introduce some notation. In the following, we consider individuals x 1 , ..., x n out which k are infected. The infection status of each individual is encoded by a configuration σ ∈ {0, 1} V . We call the set of infected individuals V 1 and the set of uninfected individuals V 0 . Furthermore we consider tests a 1 , ..., a m and denote the set of tests by F . We assume the number of tests conducted to scale as m = cn for some constant c > 0. With G denoting a graph describing the assignment of individuals to tests and using standard graph notation such ∂x and ∂a for the first neighbourhood of individual x or test a, letσ ∈ {0, 1} F denote the test outcomes being determined bŷ We employ a fixed variable individual degree model. Each individual has degree ∆ = cd/λ for some constant d. We restrict our attention to constants c and d such that ∆ is integer. Each individual draws its tests from the set F without replacement, leading to a fluctuating test degree (Γ a ) a∈F . This test design gives rise to different types of individuals. For uninfected individuals, we distinguish two types, V − 0 and V + 0 . V − 0 is the set of all uninfected individuals that show up in at least one negative test. V + 0 is the counterset. Formally, Recall from the above, that the DD algorithm which is the application of warning propagation to the group testing problem easily classifies V − 0 -the definitely uninfected. Along the same lines, let V − 1 denote the set of all infected individuals that show up in at least one test where all remaining individuals are from the set V − 0 . Again, those individuals are easily classified as infected by warning propagation. Formally, If we consider an individual x we can divide the second neighbourhood of x into four sets To shorten notation, we refer to these sets as V + 1,(2) ,V + 0,(2) ,V − 1,(2) ,V − 0,(2) , respectively. Let O 0+ denote the event that the considered individual x is from V + 0 and O 1+ the event that it is from V + 1 . Moreover, we write O 1+,a for the event that test a in the neighbourhood of x is compatible with the individual under consideration being in V + 1 and O 0+,a for the event that test a is compatible with the considered individual being in V + 0 . All that O 0+,a says is that test a must feature an infected individual inducing the test to be positive and thus the considered individual to be in V + 0 . The same goes for O 1+,a . Our test design will be denoted by G. In the following, we will typically consider a reduced graph G ′ obtained as follows. First, remove all negative tests from G and all individuals assigned to these negative tests, i.e. the set V − 0 . Next, remove all individuals that now are included in at least one positive test with no other individual, i.e. the set V − 1 . Finally, remove any tests in the neighbourhood of V − 1 . Let us denote this reduced graph G ′ which will be the primary object of study hereafter. Note, that this graph model precisely remains after we run warning propagation on G. Put differently, all removed individuals will have completely polarised marginals after running belief propagation and the remove tests are inconsequential for the posterior distribution of the remaining individuals. For an individual x, the following holds for ∂ (2] x under G ′ . with (X i ) i≥1 being a sequence of independent Bernoulli random variables with parameter q 0 . with (X ′ i ) i≥1 being a sequence of independent Bernoulli random variables with parameter q 1 and furthermore, (Y i ) i≥1 being a sequence of independent random variables with support in (N 0 ) 2 and probability density function for any i ∈ N and j , k ≥ 0 given by Since individuals are assigned to tests independently, we can characterise the distribution of infected and uninfected individuals as follows. |∂a ∩ V 0 | ∼ Bin (n − k, d/k) and |∂a ∩ V 1 | ∼ Bin (k, d/k) . as claimed. Proof. Infected individuals are assigned to tests mutually independent. In combination with Lemma 2.2, we find for a test a that Thus, the expected number of positive tests in the neighbourhood of any uninfected individual x ∈ V 0 is The lemma now follows readily from the fact that ∆ = Θ(1) as n → ∞. We have Proof. By Lemma 2.3 and the fact that the maximum number of uninfected individuals per test is O log n with probability at least 1 − o n −2 we find for any test a ∈ ∂x for x ∈ V 1 The lemma now readily follows from the fact that ∆ = Θ(1) as n → ∞. Offspring for x ∈ V + 0 . Next, we will characterize the second neighbourhood of an individual x ∈ V + 0 in the reduced graph G ′ . For clarity, let us denote this individual by x r . To get started, for each test that in the neighborhood of x r we can calculate the probability that no individual from V − 1 is contained in its second neighborhood. Note that we identify such tests since they are removed in G ′ . Thus, a test in the neighborhood of x r remains unexplained with q 0 . Using this insight, we obtain the following two results for the offspring distribution in the second neighbourhood of x r . Lemma 2.6. Let (X i ) i≥1 be a sequence of independent Bernoulli random variables with parameter q 0 . Given O 0+ we have Proof. Given O 0+ , we know that any test a ∈ ∂x r contains an infected individual and is thus positive. Starting from the distribution of uninfected individuals in test a from Fact 2.1 and using Lemma 2.3 and the fact that the maximum test degree is O(log n) with probability at least 1 − o n −2 , we find as n → ∞. The lemma now follows readily from the fact that warning propagation removes tests featuring individuals from V − 1 and that we have ∆ = Θ(1). Lemma 2.7. Let (X i ) i≥1 be a sequence of independent Bernoulli random variables with parameter q 0 . Given Proof. Another application of Bayes Theorem reveals for any test a ∈ ∂x r and any i ≥ 0 Indeed, (2.5) is the probability density function of the distribution Po ≥1 d q 1,∆−1 . Thus, the lemma follows from the fact that warning propagation removes tests featuring individuals from V − 1 and ∆ = Θ(1). Offspring for x ∈ V + 1 . We adopt the same notation as above. To be precise, we will write O 1+ to denote that the considered individual is from V + 1 . Moreover, we write O 1+,a for the event that test a is compatible with the considered individual being in V + 1 , i.e. the test featuring at least one other infected individual or individual from V + 0 . Lemma 2.8. For any a ∈ ∂x r , we have Proof. By Bayes Theorem we have Plugging (2.7)-(2.9) into (2.6) yields the desired expression. Next, let us specify the offspring distribution for V + 0, (2) and V + 1, (2) . In contrast to having x ∈ V + 0 as the starting individual, the distribution of both types is not independent and we have to specify a joint distribution. Lemma 2.9. Given O 1+ we have with (X i ) i≥1 be a sequence of independent Bernoulli random variables with parameter q 1 and (Y i ) i≥1 being a sequence of independent random variables with support in (N 0 ) 2 and probability density function for any i ∈ N and j , k ≥ 0 given by Proof. For any test a ∈ ∂x r and i , j ≥ 0, we find by Bayes Theorem Lower bounds for random 3-sat via differential equations Individual testing is optimal for non-adaptive group testing in the linear regime Group testing: an information theory perspective. Foundations and Trends in Communications and Information Theory Information-theoretic and algorithmic thresholds for group testing Optimal group testing. 33rd Annual Conference on Learning Theory (COLT 2020) The detection of defective members of large populations Pool testing of sars-cov-02 samples increases worldwide test capacities many times over Covid-19 data repository Pooling designs and nonadaptive group testing: important tools for dna sequencing The freezing threshold for k-colourings of a random graph Designing pooling systems for noisy high-throughput protein-protein interaction experiments using boolean compressed sensing A survey on combinatorial group testing algorithms with applications to dna library screening Sudden emergence of a giant k-core in a random graph A new pooling strategy for high-throughput screening: the shifted transversal design