key: cord-0184112-go8kkrdx
authors: Cohen, Alejandro; Cohen, Asaf; Jaggi, Sidharth; Gurewitz, Omer
title: Secure Group Testing
date: 2016-07-17
journal: nan
DOI: nan
sha: 60d2f29365cda9608988c5a1be1457efa05735fb
doc_id: 184112
cord_uid: go8kkrdx

The principal mission of Group Testing (GT) is to identify a small subset of"defective"items from a large population, by grouping items into as little as possible test pools. The test outcome of a pool is positive if it contains at least one defective item, and is negative otherwise. GT algorithms are utilized in numerous applications, and in most of them the privacy of the tested subjects, namely, whether they are defective or not, is critical. In this paper, we consider a scenario where there is an eavesdropper (Eve) which is able to observe a subset of the GT outcomes (pools). We propose a new non-adaptive Secure Group Testing (SGT) algorithm based on information theoretic principles, which keeps the eavesdropper ignorant regarding the items' status. Specifically, when the fraction of tests observed by Eve is $0 leq delta<1$, we prove that the number of tests required for both correct reconstruction at the legitimate user (with high probability) and negligible mutual information at Eve's side is $frac{1}{1-delta}$ times the number of tests required with no secrecy constraint.

The classical version of Group Testing (GT) was suggested during World War II in order to identify syphilis infected draftees while dramatically reducing the number of required tests [1] . Specifically, when the number of infected draftees, K, is much smaller than the population size, N , instead of examining each blood sample individually, one can conduct a small number of of pooled samples. Each pool outcome is negative if it contains no infected sample, and positive if it contains at least one infected sample. The problem is thus to identify the infected draftees via as few pooled tests as possible. Figure 1 (a)-(c) depicts a small example.

Since its exploitation in WWII, GT was utilized in numerous fields, including biology and chemistry [2] , [3] , communications [4] - [7] , sensor networks [8] , pattern matching [9] and web services [10] . GT also found applications in the emerging field of Cyber Security, e.g., detection of significant changes in network traffic [11] , Denial of Service attacks [12] and indexing information for data forensics [13] .

Many applications which utilize GT include sensitive information which should not be revealed if some of the tests leak or if the tests are distributed to several labs for parallel processing. However, in GT, a leakage of even a single pooltest may contain significant information on the tested items. I.e., if it is found to be negative it indicates that none of the items in the pool are defective. If it is positive, at least one of the items in the pool is defective (see Figure 1 (d) for a short example). Accordingly, ensuring that a leakage of a fraction of the pool-tests to undesirable or malicious eavesdroppers A. Cohen, A. Cohen and O. Gurewitz are with the Department of Communication Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel (e-mail: alejandr@post.bgu.ac.il; coasaf@post.bgu.ac.il; gurewitz@post.bgu.ac.il). S. Jaggi is with the Department of Information Engineering, Chinese University of Hong Kong (jaggi@ie.cuhk.edu.hk). Parts of this work were presented at the IEEE International Symposium on Information Theory, ISIT 2016.

does not give them any useful information on the status of the items is crucial. It is very important to note that protecting GT is different from protecting the communication between the parties. To protect the communication, one should merely protect the bits transmitted, and can do so by encoding them before the transmission. However, in GT, we do not want to assume one entity has access to all pool-tests, and can apply an encoding function to them before they are exposed. We also do not want to assume a mixer can add a certain substance that will prevent a third party from testing the sample. To protect GT, one should make sure that without altering mixed samples, if a fraction of them leaks, either already tested or not, information is not revealed.

While the current literature includes several works on the privacy in GT algorithms for digital objects [13] - [16] , these works are based on cryptographic schemes, assume the testing matrix is not known to all parties, ensue a high computational burden, and, last but not least, assume the computational power of the eavesdropper is limited [17] , [18] . Information theoretic security [18] , [19] , on the other hand, can offer privacy at the price of additional tests, without keys, obfuscation or assumptions on limited power.

In this work, we formally define Secure Group Testing (SGT), suggest SGT algorithms based on information theoretic principles and analyse their performance. In the considered model, there is an eavesdropper which may observe part of the information from the outcome pool-tests vector. The goal of the test designer is to design the tests such that a legitimate decoder can decode the status of the items (defective or not) with an arbitrarily small error probability, yet, as long as the eavesdropper gains only part of the output vector (a fraction δ), she cannot (asymptotically) gain any significant information on the status of the items.

We propose a SGT code, and two algorithms which satisfy reliability as well as weak or strong secrecy conditions. We further provide sufficiency and necessity bounds on the number of tests required, and show that the code construction, together with Maximum Likelihood (ML) decoding is order-optimal (in N , K and δ). The second algorithm, while missing the exact behaviour with respect to δ, is highly efficient, maintains the reliability and secrecy guarantees, yet requires only O(N 2 T ) operations, where T is the number of tests.

We do so by proposing a model, which is, in a sense, analogous to a channel model, as depicted in Figure 2 . The subset of indices pointing to the (in hindsight) defective items in this analogy takes the place of a confidential message; the testing matrix represents the design of the pools: a matrix row can be considered as codeword, where each row corresponds to a separate item. The decoding algorithm is analogous to a channel decoding process, and the eavesdropped signal is the output of an erasure channel, namely, having only part of the transmitted signal from the legitimate source to the legitimate receiver.

In the suggested construction, instead of each item receiving a single test vector which determines exclusively in which pool-tests it should participate, each item receives a random vector, chosen from a set of random and independent vectors. The vectors are, in a sense, dissimilar when the number of tests is large. Namely, we use stochastic encoding, and each vector determines different pool-tests in which the item should participate. For each item the lab picks one of the vectors in its set at random, and the item participates in the pooltests according to this randomly chosen vector. A schematic description of our procedure is depicted in Figure 4 . It is important to note, though, that the randomness is required only at the lab (the "mixer"), and it is not a shared key in any sense.

Accordingly, by obtaining a pool-test result, without knowing the specific vectors chosen by the lab for each item, the eavesdropper may gain only negligible information regarding the items themselves. Specifically, we show that even though the pool-tests in which each item participated can be chosen randomly and even though the legitimate user does not know a-priori in which pool-tests each item has participated, the legitimate user will be able to identify the defective items, while the eavesdropper, observing only a subset of the pooltest results, will have no significant information regarding the status of the items.

The structure of this work is as follows. In Section II, a SGT model is formally described. In Section III, we summarize the related work. Section IV includes our main results, with the direct proved in Section V and converse proved in Section VI. Section VII describes an efficient algorithm, and proves an upper bound on its error probability. Section VIII concludes the paper.

In SGT, a legitimate user desires to identify a small unknown subset K of defective items form a larger set N , while reducing the number of measurements T as possible and keeping the eavesdropper, which is able to observe a subset of the tests results, ignorant regarding the status of the N items. N = |N |, K = |K| denote the total number of items, and the number of defective items, respectively. The status of the items, defective or not, should be kept secure from the eavesdropper, but detectable to the legitimate user. We assume that the number K of defective items in K is known a priori. This is a common assumption in the GT literature [3] .

Throughout the paper, we use boldface to denote matrices, capital letters to denote random variables, lower case letters to denote their realizations, and calligraphic letters to denote the alphabet. Logarithms are in base 2 and h b (·) denotes the binary entropy function. Figure 3 gives a graphical representation of the model. In general, and regardless of security constraints, GT is defined by a testing matrix Figure 1 : An example of test results, a simple decoding procedure at the legitimate decoder and the risk of leakage. The example includes 7 items, out of which at most one defective (the second one in this case; unknown to the decoder). Three pooled tests are conducted. Each row dictates in which pooled tests the corresponding item participates. (a) Since the first result is negative, items 1 and 6 are not defective. (b) The second result is positive, hence at least one of items 2 and 4 is defective. (c) Based on the last result, as item 4 cannot be defective, it is clear that 2 is defective. Note that decoding in this case is simple: any algorithm which will simply rule out each item whose row in the matrix is not compatible with the result will rule out all but the second item, due to the first and last test results being negative, thus identifying the defective item easily. (d) An evesdropped which has access to part of the results (the first two) can still infers useful information. Our goal is construct a testing matrix such that such an eavesdropper remains ignorant. where each row corresponds to a separate item j ∈ {1, . . . , N }, and each column corresponds to a separate pool test t ∈ {1, . . . , T }. For the j-th item,

. , X j (T )} is a binary row vector, with the t-th entry X j (t) = 1 if and only if item j participates in the t-th test. If A j ∈ {0, 1} is an indicator function for the j-th item, determining whether it belongs to the defective set, i.e., A j = 1 if j ∈ K and A j = 0 otherwise, the pool tests outcome Y T is

where is used to denote the boolean OR operation, and the second equality is since only the defective items will have influence in the boolean operation on the outcome vector.

In SGT, we assume an eavesdropper which observes a noisy vector Z T = {Z(1), . . . , Z(T )}, generated from the outcome vector Y T . In the erasure case considered in the work, the probability of erasure is 1 − δ, i.i.d. for each test. That is, on average, T δ outcomes are not erased and are accessible to the eavesdropper via Z T . Therefore, in the erasure case, if B t ∈ {1, ?} is an erasure indicator function for the t-th pool test, i.e., B t = 1 with probability δ, and B t =? with probability 1 − δ, the eavesdropper observes

Denote by W ∈ {1, . . . , N K } the index of the subset of defective items. We assume W is uniformly distributed, that is, there is no a-priori bias to any specific subset. Further, denote byŴ (Y T ) the index recovered by the legitimate decoder, after observing Y T . We refer to the testing matrix, together with the decoder as a SGT algorithm. As we are interested in the asymptotic behavior, the following definition lays out the goals of SGT algorithm. (2) Weakly secure: At the eavesdropper, observing Z T , we have lim

(3) Strongly secure: At the eavesdropper, observing Z T , we have lim

To conclude, The goal is to design (for parameters N and K) an N × T measurement matrix and a decoding algorithm W (Y T ), such that observing Y T , the legitimate user will identify the subset of defective items (with high probability), yet, observing Z T , the eavesdropper cannot (asymptotically) identify the status of the items, defective or not.

Performance Bounds: GT can be non-adaptive, where the testing matrix is designed beforehand, adaptive, where each new test can be designed while taking into account previous test results, or a combination of the two, where testing is adaptive, yet with batches of non-adaptive tests. It is also important to distinguish between exact recovery and a vanishing probability of error.

To date, the best known lower bound on the number of tests required (non-adaptive, exact recovery) is Ω( K 2 log K log N ) [20] . The best known explicit constructions were given in [21] , resulting in O(K 2 log N ). However, focusing on exact recovery requires more tests, and forces a combinatorial nature on the problem. Settling for high probability reconstructions allows one to reduce the number of tests to the order of K log N . 1 For example, see the channel-coding analogy given in [22] , [23] . A similar analogy to wiretap channels will be at the basis of this work as well. In fact, probabilistic methods with an error probability guarantee appeared in [24] , without explicitly mentioning GT, yet showed the O(K log N ) bound. Additional probabilistic methods can be found in [25] for support recovery, or in [26] , when an interesting phase transition phenomenon was observed, yielding tight results on the threshold (in terms of the number of tests) between the error probability approaching one or vanishing. Finally, note that the significant difference between exact reconstruction and negligible error probability is also apparent in adaptive GT, e.g., [27] .

A Channel Coding Interpretation: As mentioned, the analogy to channel coding proved useful [22] , [23] . [28] defined the notion of group testing capacity, that is, the value of

T under which reliable algorithms exist, yet, over which, no reliable reconstruction is possible. A converse result for the Bernoulli, non-adaptive case was given in [29] . 1 A simple information theoretic argument explains a lower bound. There are K defectives out of N items, hence N K possibilities to cover: log N K bits of information. Since each test carries at most one bit, this is the amount of tests required. Stirling's approximation easily shows that for K N , the leading factor of that is K log N .

Strong converse results were given in [30] , [31] , again, building on the channel coding analogy, as well as converses for noisy GT [32] . In [27] , adaptive GT was analyzed as a channel coding with feedback problem.

Efficient Algorithms: A wide variety of techniques were used to design efficient GT decoders. Results and surveys for early non-adaptive decoding algorithms were given in [33] - [35] . Moreover, although most of the works described above mainly targeted fundamental limits, some give efficient algorithms as well. In the context of this work, it is important to mention the recent COMP [36] , DD and SCOMP [37] algorithms, concepts from which we will use herein. Efficient GT decoding is also possible via polar codes [38] or LDPC [39] .

Noisy Group Testing and Graph Models: Since GT has numerous applications, several different models were suggested, with diverse tools being applied to tackle them. Perhaps the most studied extension is that of noisy GT. Noisy models can include cases of false positives, where with some probability a negative pool is still marked as positive, or false negatives if defective items get diluted. [22] includes several results in this context. Additional converse results for noisy GT are also in [32] . Noise level can also depend on the number of items tested in each pool [38] . Analysis of noisy GT from a Signal to Noise ratio perspective is given in [40] .

In threshold group testing [41] - [43] , the test results are sensitive to the number of defective items within a test, hence, are not necessarily a Boolean sum. A recent work also allowed the presence of inhibitors, which can mask test results [34] , [44] . The capacity of Semi-quantitative group testing was discussed in [45] , where the tests' output is not necessarily binary (positive or negative). In a similar context, [46] considered the case where the testing matrix has to be sparse (either because an item cannot be tested too many times, or because a test cannot include too many items). [47] considered the case where the number of defectives, K, is unknown in advance. [48] considered the case where the items are ordered, and the defective set is a consecutive subset of the items.

Finally, since at the heart of GT stands a binary testing matrix, dictating in which test each item participates, GT also has a clean bipartite graph representation, when the Left set of vertices may stand for the items, and the Right for the tests. Indeed, such a representation gives a very elegant combinatorial result regarding the number of errors a certain testing matrix can withstand [49] . Results on cover-free families yield lower bounds for GT as well [50] . These definitions also lead to a natural definition for list GT [51] , where the goal is to output a list containing all defective items, and perhaps additional ones up to the size of the list. Bipartite graphs for GT and noisy GT where also used in [52] , [53] . [54] considered a more structured graph model, i.e., when the bipartite graph is regular, with items having degree l and tests having degree r.

Under the model definition given in Section II, our main result are the following sufficiency (direct) and necessity (converse) conditions, characterizing the maximal number of tests required to guarantee both reliability and security. The proofs are deferred to Section V and Section VI.

The sufficiency part is given by the following theorem.

for some ε ≥ 0 independent of N and K, then there exists a sequence of SGT algorithms which are reliable and secure. That is, as N → ∞, both the average error probability approaches zero and an eavesdropper with leakage probability δ is kept ignorant.

The construction of the SGT algorithm, together with the proofs of reliability and secrecy are deferred to Section V. However, a few important remarks are in order now.

First, rearranging terms in eq. (1), we have

That is, compared to only a reliability constraint, the number of tests required for both reliability and secrecy is increased by the multiplicative factor 1 1−δ , where, again, δ is the leakage probability at the eavesdropper.

Since for a fixed K and large enough N we have log N −K i = Θ(i log N ), we have the following corollary on the number of tests required.

Corollary 1. For SGT with parameters K << N and T , reliability and secrecy can be maintained with

Note that this suggests a Θ (K log N ) result for δ bounded away from 1.

The necessity part is given by the following theorem.

Theorem 2. Let T be the minimum number of tests necessary to identify a defective set of cardinality K among population of size N while keeping an eavesdropper, with a leakage probability δ < 1, ignorant regarding the status of the items. Then, for large enough N one must have

The lower bound is derived using Fano's inequality to address reliability, assuming a negligible mutual information at the eavesdropper, thus keeping an eavesdropper with leakage probability δ ignorant, and information inequalities bounding the rate of the message on the one hand, and the data Eve does not see on the other. Compared with the lower bound without security constraints, it is increased by the multiplicative factor 1 1−δ .

Returning to the analogy in [28] between channel capacity and group testing, one might define by C s the (asymptotic) minimal threshold value for log N K /T , above which no reliable and secure scheme is possible. Under this definition, the result in this paper show that C s = (1 − δ)C, where C is the capacity without the security constraint. Clearly, this can be written as

raising the usual interpretation as the difference between the capacity to the legitimate decoder and that to the eavesdropper [55] . Note that as the effective number of tests Eve sees is T e = δT , her GT capacity is δC.

Under the SGT model definition given in Section II, we further consider a computationally efficient algorithm at the legitimate decoder. Specifically, we analyze the Definite Non-Defective (DND) algorithm (originally called Combinatorial Orthogonal Matching Pursuit (COMP)), considered for the non-secure GT model in the literature [36] , [37] . The theorem below states that indeed efficient decoding (with arbitrarily small error probability) and secrecy are possible, at the price of a higher T . Interestingly, the theorem applies to any K, and not necessarily only to K = O(1). This is, on top of the reduced complexity, an important benefit of the suggested algorithm.

Theorem 3. Assume a SGT model with N items, out of which K are defective. Then, for any δ < 1 2 1 − ln 2 K , there exists an efficient decoding algorithm, requiring O(N 2 T ) operations, such that if the number of tests satisfies

its error probability is upper bounded by

The construction of the DND GT algorithm, together with the proofs of reliability and secrecy are deferred to Section VII.

In order to keep the eavesdropper, which obtains only a fraction δ of the outcomes, ignorant regarding the status of the items, we randomly map the items to the tests. Specifically, as depicted in Figure 4 , for each item we generate a bin, containing several rows. The number of such rows corresponds to the number of tests that the eavesdropper can obtain, yet, unlike wiretap channels, it is not identical to Eve's capacity, and should be normalized by the number of defective items. Then, for the j-th item, we randomly select a row from the j-th bin. This row will determine in which tests the item will participate.

In order to rigorously describe the construction of the matrices and bins, determine the exact values of the parameters (e.g., bin size), and analyze the reliability and secrecy, we first briefly review the representation of the GT problem as a channel coding problem [22] , together with the components required for SGT.

A SGT code consists of an index set I = {1, 2, . . . N K }, its w-th item corresponding to the w-th subset K ⊂ {1, . . . , N }; A discrete memoryless source of randomness (R, p R ), with known alphabet R and known statistics p R ; An encoder,

which maps the index W of the defective items to a matrix X T Sw of codewords, each of its rows corresponding to a different item in the index set S w . The need for a stochastic encoder is similar to most encoders ensuring information theoretic security, as randomness is required to confuse the eavesdropper about the actual information [55] . Hence, we define by R K the random variable encompassing the randomness required for the K defective items, and by M the number of rows in each bin. Clearly, M K = H(R K ).

At this point, and important clarification is in order. The lab, of course, does not know which items are defective. Thus, operationally, it needs to select a row for each item. However, in the analysis, since only the defective items affect the output (that is, only their rows are summed up to give Y T ), we refer to the "message" as the index of the defective set w and refer only to the random variable R K required to choose the rows in their bins.

A decoder at the legitimate user is a map

The probability of error is P (Ŵ (Y T ) = W ). The probability that an outcome test leaks to the eavesdropper is δ. We assume a memoryless model, i.e., each outcome Y (t) depends only on the corresponding input X Sw (t), and the eavesdropper observes Z(t), generated from Y (t) according to

p(Y (t)|X Sw (t))p(Z(t)|Y (t)).

We may now turn to the detailed construction and analysis.

1) Codebook Generation: Set M = 2 T δ− K . Using a distribution P (X T ) = T i=1 P (x i ), for each item generate M independent and identically distributed codewords x T (m), 1 ≤ m ≤ M , where can be chosen arbitrarily small. The codebook is depicted in the left hand side of Figure 4 . Reveal the codebook to Alice and Bob. We assume Eve may have the codebook as well.

2) Testing: For each item j, the lab selects uniformly at random one codeword x T (m) from the j-th bin. Therefore, the SGT matrix contains N randomly selected codewords of length T , one for each item, defective or not. Amongst is an unknown subset X T Sw , with the index w representing the true defective. An entry of the j-th random codeword is 1 if the j-item is a member of the designated pool test and 0 otherwise.

3) Decoding at the Legitimate Receiver: The decoder looks for a collection of K codewords X T Sŵ , one from each bin, for which Y T is most likely. Namely,

Then, the legitimate user (Bob) declaresŴ (Y T ) as the set of bins in which the rowsŵ reside.

Let (S 1 , S 2 ) denote a partition of the defective set S into disjoint sets S 1 and S 2 , with cardinalities i and K − i, respectively. Let I(X S 1 ; X S 2 , Y ) denote the mutual information between X S 1 and X S 2 , Y , under the i.i.d. distribution with which the codebook was generated and remembering that Y is the output of a Boolean channel. The following lemma is a key step in proving the reliability of the decoding algorithm.

then, under the codebook above, as N → ∞ the average error probability approaches zero.

Next, we prove Lemma 1, which extends the results in [56] to the codebook required for SGT. Specifically, to obtain a bound on the required number of tests as given in Lemma 1, we first state Lemma 2, which bounds the error probability of the ML decoder using a Gallager-type bound [57] .

Definition 2. The probability of error event E i in the ML decoder defined, as the event of mistaking the true set for a set which differs from it in exactly i items.

where the error exponent E o (ρ) is given by

In Appendix A we analyze the bound provided in Lemma 2. Note that there are two main differences compared to nonsecured GT. First, the decoder has N K M K possible subsets of codewords to choose from, N K for the number of possible bins and M K for the number of possible rows to take in each bin. Thus, when fixing the error event, there are N −K i M i subsets to confuse the decoder. Moreover, due to the bin structure of the code, there are also many "wrong" codewords which are not the one transmitted on the channel, hence create a decoding error codeword-wise, yet the actual bin decoded may still be the right one.

Proof of Lemma 1. For this lemma, we follow the derivation in [56] . However, due the different code construction, the details are different. Specially, for each item there is a bin of codewords, from which the decoder has to choose. Define

We wish to show that T f(ρ) → ∞ as N → ∞. Note that log K i is a constant for the fixed K regime. Thus for large T we have lim T →∞ f(0) = 0. Since the function f(ρ) is differentiable and has a power series expansion, for a sufficiently small α, by Taylor series expansion in the neighborhood of ρ ∈ [0, α] we have

Hence, if

for some constant ε > 0, the exponent is positive for large enough T and we have P (E i ) → 0 as T → ∞ for ρ > 0.

Using a union bound one can show that taking the maximum over i will ensure a small error probability in total.

The expression I(X S 1 ; X S 2 , Y ) in Lemma 1 is critical to understand how many tests are required, yet it is not a function of the problem parameters in any straight forward mannar. We now bound it to get a better handle on T . Claim 1. For large K, and under a fixed input distribution for the testing matrix ( ln(2) K , 1 − ln(2) K ), the mutual information between X S 1 and (X S 2 , Y ) is lower bounded by 

where equality (a) follows since the rows of the testing matrix are independent, and (b) follows since H(Y |X S ) is the uncertainty of the legitimate receiver given X S , thus when observing the noiseless outcomes of all pool tests, this uncertainty is zero. Also, note that the testing matrix is random and i.i.d. with distribution (1 − q, q), hence the probability for i zeros is q i . Then, under a fixed input distribution for the testing matrix (p = ln (2) K , q = 1 − ln(2) K ) and large K it is easy to verify that the bounds meet at the two endpoint of i = 1 and i = K, yet the mutual information is concave in i thus the bound is obtained. This is demonstrated graphically in Figure 5 .

Applying Claim 1 to the expression in Lemma 1, we have

Hence, substituting M = 2 T δ− K , a sufficient condition for reliability is

with some small . Rearranging terms results in

Noting that this is for large K and N , and that ε is independent of them, achieves the bound on T provided in Theorem 1 and reliability is established.

We now prove the security constraint is met. Hence, we wish to show that I(W ; Z T )/T → 0, as T → ∞. Denote by C T the random codebook and by X T S the set of codewords corresponding to the true defective items. We have,

where T → 0 as T → ∞. (a) is since there is a 1 : 1 correspondence between (W, R K ) and X T S ; (b) is since R K is independent of W and C T ; (c) is since in this direct result the codebook is defined by the construction and is memoryless, as well as the channel; (d) is since by choosing an i.i.d. distribution for the codebook one easily observes that I(X S ; Z|C T ) ≤ δ. Finally, (e) is for the following reason: Given W and the codebook, Eve has a perfect knowledge regarding the bins from which the codewords were selected. It requires to see whether she can indeed estimate R K . Note that the channel Eve sees in this case is the following multiple access channel: each of the defective items can be considered as a "user" with M messages to transmit. Eve's goal is to decode the messages from all users. This is possible if the rates at which the users transmit are within the capacity region of this MAC. Indeed, this is a (binary) Boolean MAC channel, followed by a simple erasure channel. The sum capacity cannot be larger than δ, and this sum capacity is easily achieved by letting one user transmit at a time, or, in our case, time sharing between the users. Thus, each user sees a capacity of δ/K, hence Eve may decode if 1 T log M is smaller than that value. Remark 1. Under non-secure GT, it is clear that simply adding tests to a given GT code (increasing T ) can only improve the performance of the code (in terms of reliability). A legitimate decoder can always disregard the added tests. For SGT, however, the situation is different. Simply adding tests to a given code, while fixing the bin sizes, might make the vector of results vulnerable to eavesdropping. In order to increase reliability, one should, of course, increase T , but also increase the bin sizes proportionally, so the secrecy result above will still hold. This will be true for the efficient algorithm suggested in Section VII as well.

In this section, we derive the necessity bound on the required number of tests. LetZ denote the random variable corresponding to the tests which are not available to the eavesdropper. Hence, Y = (Z,Z). By Fano's inequality, if P e → 0, we have

where T → 0 as T → ∞. Moreover, the secrecy constraint implies

where T → 0 as T → ∞. Consequently,

where (a) follows from Fano's inequality and since Y = (Z,Z), (b) follows from (2), (c) follows from the Markov chain W → X T Sw → Y → Z and (d) is since conditioning reduces entropy.

We now evaluate H(Z). Denote byĒ the set of tests which are not available to Eve and byĒ γ the event {|Ē| ≤ T (1 − δ)(1 + γ)} for some γ > 0. We have

where the last inequality follows from the Chernoff bound for i.i.d. Bernoulli random variables with parameter (1 − δ) and is true for some f (γ) such that f (γ) > 0 for any γ > 0.

Thus, we have

That is,

for some T such that T → 0 as T → ∞. This completes the converse proof.

The achievability result given in Theorem 1 uses a random codebook and ML decoding at the legitimate party. The complexity burden in ML, however, prohibits the use of this result for large N . In this section, we derive and analyze an efficient decoding algorithm, which maintains the reliability result using a much simpler decoding rule, at the price of only slightly more tests. The secrecy constraint, as will be clear, is maintained by construction, as the codebook and mixing process do not change compared to the achievability result given before. Moreover, the result in this section will hold for any K, including even the case were K grows linearly with N .

Specifically, we assume the same codebook generation and the testing procedure given in Section V, and analyze the Definite Non-Defective (DND) algorithm, previously considered for the non-secure GT in the literature [36] , [37] . The decoding algorithm at the legitimate user is as follows. Bob attempts to match the rows of X with the outcome vector Y T . If a particular row j of X has the property that all locations t where it has 1, also corresponds to a 1 in Y (t), then that row can correspond to a defective item. If, however, the row has 1 at a location t where the output has 0, then it is not possible that the row corresponds to a defective item. The problem, however, when considering the code construction in this paper for SGT, is that the decoder does not know which row from each bin was selected for any given item. Thus, it takes a conservative approach, and declares an item as defective if at least one of the rows in its bin signals it may be so. An item is not defective only if all the rows in its bin prevent it from being so.

It is clear that this decoding procedure has no false negatives, as a defective item will always be detected. It may have, though, false positives. A false positive may occur if all locations with ones in a row corresponding to a non-defective item are hidden by the ones of other rows corresponding to defective items and selected by the mixer. To calculate the error probability, fix a row of X corresponding to a nondefective item (a row in its bin). Let j 1 ; . . . ; j k index the rows of X corresponding to the K defective items, and selected by the mixer for these items (that is, the rows which were actually added by the Boolean channel). An error event associated with the fixed row occurs if at any test where that row has a 1, at least one of the entries X j1 (t), . . . , X j k (t) also has a 1. The probability for this to happen, per column, is p(1 − (1 − p) K ). Hence, the probability that a test result in a fixed row is hidden from the decoder, in the sense that it cannot be declared as non defective due to a specific column, is

Since this should happen for all T columns, the error probability for a fixed row is 1 − p(1 − p) K T . Now, to compute the error probability for the entire procedure we take a union bound over all M (N − K) rows corresponding to nondefective items. As a result, we have

In the above, (a) follows by taking p = y/K and setting T as βK log N , for some positive y and β, to be defined. (b) follows since e −y ≤ (1 − y/n) n−1 for small y > 0 and any integer n > 0. In the sequence below, we will use it with y = ln 2, for which it is true. (c) follows since e −x ≥ (1 − x/n) n for x > 0 and any integer n > 0. (d) follows by choosing y = ln 2. (e) is by setting M = 2 T δ− K and substituting the value for T . The result in (3) can be interpreted as follows. As long as δ, the leakage probability at the eavesdropper, is smaller than 1 2 (1 − ln 2 K ), choosing T = βK log N with a large enough β results in an exponentially small error probability. For example, for large enough K and δ = 0.25, one needs β > 4, that is, about 4K log N tests to have an exponentially small (with N ) error probability while using an efficient decoding algorithm. To see the dependence of the error probability on the number of tests, denote

Then, if the number of tests satisfies

Thus, while the results under ML decoding (Theorem 1) show that any value of δ < 1 is possible (with a 1 1−δ toll on T compared to non-secure GT), the analysis herein suggests that using the efficient algorithm, one can have a small error probability only for δ < 1/2, and the toll on T is greater than 1 1 2 −δ . This is consistent with the fact that this algorithm is known to achieve only half of the capacity for non-secure GT [37] . However, both these results may be due to coarse analysis, and not necessarily due to an inherent deficiency in the algorithm. Remark 2 (Complexity). It is easy to see that the algorithm runs over all rows in the codebook, and compares each one to the vector of tests' results. The length of each row is T . There are N items, each having about 2 δ K T rows in its bin. Since T = O(K log N ), we have O(N 2 ) rows in total. Thus, the number of operations is O(N 2 T ) = O(KN 2 log N ). This should be compared to O(KN log N ) without any secrecy constraint. Figure 6 includes simulation results of the secure DND GT algorithm proposed, compared with ML decoding and the upper and lower bounds on the performance of ML.

In this paper, we proposed a novel non-adaptive SGT algorithm, which with parameters N, K and T is asymptotically reliable and secure. Specifically, when the fraction of tests observed by Eve is 0 ≤ δ < 1, we prove that the number of tests required for both correct reconstruction at the legitimate user (with high probability) and negligible mutual information at Eve's side is 1 1−δ times the number of tests required with no secrecy constraint. We further provide sufficiency and necessity bounds on the number of tests required in the SGT model to obtains both, reliability and secrecy constraints. Moreover, we analyze in the proposed secure model, computationally efficient algorithms at the legitimate decoder, previously considered for the non-secure GT in the literature which identify the definitely non-defective items.

In this section, we give a direct proof for Lemma 1. Specifically, we bound the error probability of the maximum likelihood decoder, and show that under the condition on T in the lemma, the error probability is indeed approaching 0 as N → ∞.

Under the assumption that all items are equality likely to be defective, we consider the error probability given that the first K items are defective, that is, S 1 ∈ I is the defective set. Denote this probability by P e|1 . We have

where E i is the event of a decoding error in which the decoder declares a defective set which differs from the true one (S 1 ) in exactly i items.

In general, we follow the derivation in [22] . However, there is a key difference. In the code construction suggested in Section V, for each item there are M possible codewords (a "bin" of size M ). Only one of these codewords is selected by the mixer to decide in which pool tests the item will participate. Thus, when viewing this problem as a channel coding problem, if an item is defective, one and only one codeword out of its bin is actually transmitted (and summed with the codewords of the other K − 1 defective items). Since the decoder does not know which codewords were selected in each bin (the randomness is known only to the mixer), there are multiple error events to consider. E.g., events where the decoder choose the wrong codeword for some items, yet identified parts of the bins correctly, and, of course, events where the codeword selected was from a wrong bin. This complicates the error analysis significantly. Moreover, we wish to employ the correction suggested in [56] , which results in a simpler yet stricter bound.

Consider the event E i . Clearly, E i can be broken down into two disjoint events. The first is E i and the event that the codewords selected for the correct K − i items are the true transmitted ones, and the second is the event of both E i and the event that at least one of the codewords selected for the correct items is wrong. Denote the first event as E i . Now, consider the case where we have E i , that is, a correct decision on K − i items, yet, out of these K − i items, the decoder identified only j codewords right, 0 ≤ j < K − i, and for the rest, it identified a wrong codeword in the right bin. We claim that the probability for this event is exactly P (E K−j ), as the decoder has to mistake X for X S1 , where both are collections of K codewords, and X shares exactly j codewords with X S1 . Thus, we have:

The last inequality in (4) is loose, yet it suffices for the proof in the regime where K = O(1). A more delicate bounding technique might be required if K might grow with N . We now bound P (E i ). Particularly, we will establish the following lemma.

Lemma 3. The error probability P (E i ) is bounded by

where the error exponent E o (ρ) is given by

the set of indices corresponding to sets of K items that differ from the true defective set S 1 in exactly i items. S 1 c ,w for w ∈ I denotes the set of items which are in S w but not in S 1 .

We have

where (5) is exactly [22, eq. (25) ], as when considering E i we assume the decoder not only got K − i items right, but also the correct codeword in each such bin. Hence

for all s > 0 and 0 ≤ ρ ≤ 1. Note that (a) is since the probability is less than 1 and can be raised to the power of ρ. (b) is critical. It follows from the symmetry of codebook construction, namely, the inner summation depends only on the codewords in X S 1 c ,w but not the ones in S 1 c ,w . Due to the code's construction, i.e., its binning structure, there are exactly N −K i M i possible sets of codewords to consider for S 1 c ,w . (c) follows as the sum of positive numbers raised to the ρ-th power is smaller than the sum of the ρ-th powers.

We now continue similar to [22] , substituting the conditional error probability just derived in a summation over all codewords and output vectors. We have

There are K K−i sets S 1,w , and the summation does not depend on which set is it, hence, we get

which continue similar to [22] , results in

Thus, we have

(STRONG SECRECY) We now prove the security constraint is met. Hence, we wish to show that I(W ; Z T ) → 0. Denote by C T the random codebook and by X T Sw the set of codewords corresponding to the true defective items.

We assumed W ∈ {1, . . . , N K } is uniformly distributed, that is, there is no a-priori bias to any specific subset. Further, the codebook is generated independent codewords and identically distributed, hence the eavesdropper, observing Z T , wish to decode the true K independent and identically distributed codewords, which correspond to the defective items from N K subsets. To analyze the information leakage at the eavesdropper, we note that the channel Eve sees in this case is the following multiple access channel. Each of the items can be considered as a "user" with M specific codewords. Eve's goal is to identify the active users by decode all the transmitted codewords. This is possible if the rates at which the users transmit are within the capacity region of this MAC. Indeed, this is a (binary) Boolean MAC channel, followed by a simple erasure channel. The sum capacity to Eve cannot be larger than δ, this sum capacity is achieved in our case by sharing the channel between the active users. Thus, Eve's channel obtain from each active user a capacity of at most δ/K. Consequently, Eve may obtain a sub-matrixZ of possibly transmitted codewords, where from each codeword Eve sees at most a capacity of δ/K.

However, in our analysis we help Eve to get the true sub-matrixZ, with the maximum rate δ/K, that she can obtain in the Boolean MAC channel followed by erasure channel with erasure probability of 1 − δ. Providing this information to Eve only makes her stronger and thus a coding scheme that succeeds to keep Eve ignorant, will also succeed against the original Eve. Now, once Eve obtain the true sub-matrixZ, since the codebook and the subsets of the items was generated independent and identically distributed, we will analyze the information that leakage at the eavesdropper from each possibly transmitted codeword inZ separately. Namely, per original codeword X T j , on average, δ K T from the δT outcomes are not erased and are accessible to the eavesdropper viaZ T j . That is, where for each item the encoder generate M independent and identically distributed codewords, there are an exponential number of codewords, that from eavesdropper perspective, may have been participate in X T Sw that would have resulted in the sameZ T j . These consistent codewords are exactly those that lie in a ball aroundZ T j of radius d ≈ (1 − δ K )T as depicted in Figure 7 . The eavesdropper doesn't know what is the codeword X T j in X T Sw , which was selected by the mixer and he even doesn't know what d H = X T j −Z T j is exactly (where · is the hamming distance), other than the fact that d ≈ (1 − δ K )T . However, in our analysis we help Eve by providing d and by choosing a small yet exponential set of codewords (of size 2 T K for an appropriately small ) from the codebook C chosen from the set of all codewords at distance d fromZ T j , with the additional guarantee that it also contains the true X T j . We refer to this set as the oracle-given set X T

Oracel . Again providing this information to Eve only makes him stronger and thus a coding scheme that succeeds to keep Eve ignorant, will also succeed against the original Eve. Conditioned on Eves viewZ T j and X T Oracel , each of the codewords in X T Oracel is equally likely to have been participated in the pool tests. Nonetheless, Eve still has a reasonable amount of uncertainty about which of the 2 T K codewords was actually participated indeed, this is the uncertainty that we leverage in our analysis.

We define, for anyZ T j and d, ball and shell as

Sh(Z T j , d) = {X T j : d H (X T j ,Z T j ) = d} Hence, we define the probability for a codeword to fall in shell as P r(X T j ∈ C ∩ X T j ∈ Sh(Z T j , d)) = Vol(Sh(Z T j , d)) 2 T = 2 (1− δ K )T 2 T .

For each item we have M = 2 ( δ+ K )T codewords. Thus, the number of codewords Eve sees on a shell, per defective item is

Hence, we can conclude that on average for every defective item, Eve has quite a few options in particular shell. Now, where the average number of codewords per item |{w : X T j (w) ∈ Sh(Z T j , d)}| is 2 T K , the probability that the actual number of options deviates from the average by more than ε is (1 − ε)2 T K ≤ |{w : X T j (w) ∈ Sh(Z T j , d)}| ≤ (1 + ε)2 T K , which is the probability that the number of the codewords per item is not in the range. We define

Let fix a pairZ T j and d for large enough T , thus by Chernoff bound and taking union bound overZ T j and d, we have P r(E C1 (Z T j , d)) ≥ 1 − 2 −ε 2 T K . Due to the super exponential decay in T , even where we take a union bound over all the codewords and all shells, we get that the probability that a codeword will have |{w : X T j (w) ∈ Sh(Z T j , d)}| options significantly different than 2 T K is very small, actually with high probability, it has almost the same number of options per item, hence, for Eve, all codewords are are almost equiprobable and I(W ; Z T ) → 0.

The detection of defective members of large populations

Combinatorial group testing and its applications

Probabilistic nonadaptive group testing in the presence of errors and DNA library screening

Group detection for synchronous Gaussian codedivision multiple-access channels

Graphconstrained group testing

Partition information and its transmission over Boolean multi-access channels

Achievable partition information rate over noisy multi-access Boolean channel

Joint source-channel communication for distributed estimation in sensor networks

k-mismatch with dont cares

Testing web services using progressive group testing

What's new: finding significant differences in network data streams

Detecting application denial-of-service attacks: A group-testing-based approach

Indexing information for data forensics

Private combinatorial group testing

Efficient private matching and set intersection

The secrecy of compressed sensing measurements

Chapter 16 cryptographic problems and philosophies

Physical-Layer Security: From Information Theory to Security Engineering

The wire-tap channel

Onr-cover-free families

Explicit nonadaptive combinatorial group testing schemes

Boolean compressed sensing and noisy group testing

Correction to "Boolean compressed sensing and noisy group testing

On two random search problems

Limits on support recovery with probabilistic models: An information-theoretic framework

Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM

Adaptive group testing as channel coding with feedback

The capacity of adaptive group testing

The capacity of bernoulli nonadaptive group testing

Strong impossibility results for sparse signal processing

Strong converses for group testing in the finite blocklength regime

Converse bounds for noisy group testing with arbitrary measurement matrices

A survey on nonadaptive group testing algorithms through the angle of decoding

Optimal two-stage algorithms for group testing problems

Efficiently decodable non-adaptive group testing

Non-adaptive group testing: Explicit bounds and novel algorithms

Group testing algorithms: bounds and simulations

Polar coding for group testing

Note on noisy group testing: asymptotic bounds and belief propagation reconstruction

Near-optimal adaptive compressed sensing

Nonadaptive algorithms for threshold group testing

Improved constructions for non-adaptive threshold group testing

Near-optimal stochastic threshold group testing

Non-adaptive group testing with inhibitors

Semi-quantitative group testing

Nearly optimal sparse group testing

Competitive group testing and learning hidden vertex covers with minimum adaptivity

Group testing for consecutive positives

Derandomization and group testing

Some new bounds for cover-free families

Noise-resilient group testing: Limitations and constructions

Grotesque: noisy group testing (quick and efficient)

Saffron: A fast, efficient, and robust framework for group testing based on sparse-graph codes

An analysis on non-adaptive group testing based on sparse pooling graphs

Physical-Layer Security: From Information Theory to Security Engineering

Correction to "boolean compressed sensing and noisy group testing

Information theory and reliable communication