key: cord-0539035-u4j862wv authors: Haddad-Zaaknoon, Catherine A. title: Heuristic Random Designs for Exact Identification of Defectives Using Single Round Non-adaptive Group Testing and Compressed Sensing date: 2021-12-23 journal: nan DOI: nan sha: ecf9e4116ef54ee62d01b5a431fdd6c52ed2df62 doc_id: 539035 cord_uid: u4j862wv Among the challenges that the COVID-19 pandemic outbreak revealed is the problem to reduce the number of tests required for identifying the virus carriers in order to contain the viral spread while preserving the tests reliability. To cope with this issue, a prevalence testing paradigm based on group testing and compressive sensing approach or GTCS was examined. In these settings, a non-adaptive group testing algorithm is designed to rule out sure-negative samples. Then, on the reduced problem, a compressive sensing algorithm is applied to decode the positives without requiring any further testing besides the initial test matrix designed for the group testing phase. The result is a single-round non-adaptive group testing - compressive sensing algorithm to identify the positive samples. In this paper, we propose a heuristic random method to construct the test design called $alpha-$random row design or $alpha-$RRD. In the $alpha-$RRD, a random test matrix is constructed such that each test aggregates at most $alpha$ samples in one group test or pool. The pooled tests are heuristically selected one by one such that samples that were previously selected in the same test are less likely to be aggregated together in a new test. We examined the performance of the $alpha-$RRD design within the GTCS paradigm for several values of $alpha$. The experiments were conducted on synthetic data. Our results show that, for some values of $alpha$, a reduction of up to 10 fold in the tests number can be achieved when $alpha-$RRD design is applied in the GTCS paradigm. The problem of group testing is the problem of identifying a small amount of items or subjects known as defective items or positive subjects within a pile of elements using group tests or pools. Let X = [n] := {1, · · · , n} be a set of n items or subjects to be tested, and let I ⊆ X be the set of positive subjects such that |I| = d < n. A group test or a pool is a subset Q ⊆ X of subjects. A test result is positive if Q ∩ I = ∅ and negative otherwise. Alternatively, we identify the test Q ⊆ X with an assignment a ∈ {0, 1} n where a i = 1 if and only if i ∈ Q, i.e., a i = 1 signifies that item i participates in the test Q. The objective of group testing algorithms is to find the members of the set I with minimum number of group tests. Group testing algorithms can be deterministic or randomized, adaptive or non-adaptive. In adaptive algorithms, the tests might be selected based on the results of previous ones. In non-adaptive algorithms, tests are independent and must not rely on previous results. Therefore, all the tests can be performed in a single parallel step. The set of tests in any non-adaptive deterministic (resp. randomized) algorithm can be identified with an (resp. random) m × n test design matrix M (also called pool design) that its rows are all the assignments a that correspond to the group tests sets selected by the algorithm. The group testing algorithm considered in this work is random non-adaptive. Group testing approach was first introduced during WWII [8] , when Robert Dorfman, in 1943, suggested the method to reduce the expected number of tests needed to weed out all syphilitic soldiers in a specific unit. Since it was first suggested, group testing algorithms have been diversely applied in a wide range of areas. These applications include DNA library screening [10, 23] , product quality control testing [26] , file searching in storage systems [18] , sequential screening of experimental variables [19] , efficient contention algorithms for MAC [18, 27] , random access communications problem [27] , data compression [15] , and computations in data stream model [7] . For a brief history and other applications, the reader is referred to [1, 6, 7, 9, 10, 15, 16, 18, 21, 23] and the references therein. Among its recent applications, due to the recent pandemic outbreak, group testing approach for accelerating COVID-19 testing was widely applied across the globe. Due to severe shortages in testing kits supply, a number of researches adopted the group testing paradigm for COVID-19 mass testing not only to accelerate the testing process, but also to reduce the number of the tests required to reveal positive virus-carriers [2, 4, 11, 14, 22, 24, 29] . In many labs, COVID-19 detection was performed using PCR methods. PCR-based machines can perform multiple parallel tests in single run, while each run can be several hours long. Driven by the process of PCR testing, non-adaptive group testing is most fit for these settings. In this context, the items in question are samples taken from potential patients and the positive subjects are samples that test positive to the virus. While many researchers applied Dorfman's attitude with multi-stage PCR runs, some have examined the possibility of using the method while designing tests that require no more than one PCR-round. One of the promising directions is the group testing -compressed sensing paradigm (GTCS) used in [12, 13, 17, 25] . This method includes the following stages; initially, a test matrix M is designed for a single non-adaptive group testing round. Upon test results delivery, a two-stage decoding process is performed. The decod-ing process is purely combinatorial and does not involve any further sample testing. Using standard non-adaptive group testing decoding (e.g. COMP algorithm [5] ), a substantial amount of samples that tested negative to the virus are ruled out. Obviously, the main benefit of this phase is to reduce the dimension of the compressed sensing problem by cutting down the number of samples that need further decoding. This is crucial due to the computational complexity of compressed sensing algorithms. In the next stage, compressive sensing techniques are used over the reduced problem (e.g. Orthogonal Matching Pursuit -OMP [28] , Fast-OMP [20] ), to analyze and identify real carriers. The design of the test matrix M is crucial for both group testing phase and the compressive sensing phase that follows. In [25] , the design matrix is constructed using Reed Solomon error correcting codes. The authors has checked their method on a set of n = 384 samples in which 5 samples are positive (about 1.3%). For pool size of 48 and using 48 group tests they could recover all the 5 positive samples. In the work of Jirong, Mudumbai and Xu. [17] , the authors investigated two types of pooling designs. The first is Bernoulli random matrix where each entry is selected to be 1 or 0 with equal probability. The second design is obtained using expander graphs where each column has a fixed number of non-zero entries. The designs tested in [12] are based on Kirkman triples. In this paper, we propose a heuristic random method to construct the test design M called α−random row design or α−RRD. In the α−RRD, a random test matrix is constructed such that each test in M aggregates at most α < n samples in one group test. This model is useful in applications were tests reliability might be compromised if the pool size is large. The matrix rows are selected one by one. The main idea of the construction is to choose the non-zero entries of a new row according to two considerations: samples that belong to the same subject participate in similar number of tests on average (fairness); and samples that were previously selected in the same test are less likely to be aggregated together in a new test (sparsity). We perform experiments on noiseless synthetic data to examine the performance of the design, while applying OMP (Orthogonal Matching Pursuit) as the compressive sensing algorithm. Practically, test designs need to be deterministic, meaning, they need to be predefined before the testing process. To use random test design, it is acceptable to make simulations of several random designs and choosing the design that performs best on some set of data. Then, this design is adopted to be used as a deterministic one for the real time tests. Our experiments results suggest that using the GTCS framework with α−RRD design can reduce the number of tests dramatically. First, the results imply that there is an evident correlation between the value of α and the performance of the process; choosing higher values of α can increase the success rate in identifying the positives. This can be observed from the results summarized in Fig. 3 . For example, for n = 400 and positives rate 5%, the success rate for α = 10 is bellow 20% while it can be increased beyond 95% when α = 22. This paradigm is reproduced in all test configurations outlined in Figure( 3). In the context of COVID-19 testing, the term positives rate or positives ratio indicates the rate of positives within some community. The simulations show that the number of tests can be reduced by up to 10 folds, relative to testing samples one by one, when α = 48 for n = 900 and positives rate of 1%. Table ( 3) summarizes the performance of the α−RRD design when the number of the tests is fixed to m = 96 for positives rates of 1%, . . . , 5% and n = 400, . . . , 900. An additional set of experiments examines the performance of the design when m = 384, positives rates of 1%, . . . , 10% and the number of samples n that reaches 4000. Similar results are obtained in these dimension settings as summarized in Table (2) . The values of m are derived from practical considerations that involve the common sizes of PCR-based machines commonly used in the industry. Let X = [n] := {1, · · · , n} be a set of n items or subjects, and let I ⊆ X be the set of positive (defective) items such that |I| = d n. A group test or a pool is a subset Q ⊆ X of items. The quantity p := |Q| is called the pool size. The result of the test Q with respect to I is defined by Q( The set of tests in any non-adaptive group testing algorithm can be identified with an m×n test design matrix M (pool design) where each row corresponds to an assignment a ∈ {0, 1} n that defines a group test selected by the algorithm. Upon performing the tests defined by M , each test of the m assignments in M yields the value 1 or 0 according to whether the tests contains at least one positive sample or not. Let y ∈ {0, 1} m denote the test results obtained by performing the tests of M , and let x ∈ {0, 1} n be a vector such that x i = 1 if and only if i ∈ I. Formally, where the operation is defined as follows; for each 1 ≤ i ≤ m, where the ∨ operation is the logic OR. It is easy to see that the definition from 1 is equivalent to is the set that corresponds to the test defined by the ith row in M . Assume that each subject sample can be measured by a real valued number that expresses the magnitude or the load of the examined symptom (e.g. viral load in COVID-19 case). Letx ∈ R n be a n−dimensional real-valued vector that signifies the symptom load of the subjects; i.e. for each 1 ≤ i ≤ n,x i indicates the symptom load of the subject i, where the value ofx i is directly proportional to the load. Symptom-free items will have their corresponding load measure equals to 0. We assume that the number of positives d n, therefore, the load vectorx is d−sparse, meaning, it includes only d non-zero entries. The objective is to restore the indexes of the non-zero entries inx. Similar to the definition of the result vector y from the GT settings, the design matrix M , also called the sensing matrix in the compressive sensing context, defines the load vectorŷ ∈ R n where each entryŷ i correlates with the load of the ith pool in M . That is,ŷ where the (·) operation is the standard matrix multiplication, therefore, for each In this paper, we will be interested in restoring the indexes of the non-zero entries of the vectorx, which is equivalent to restoring the binary vector x from the GT settings. The GTCS paradigm suggests a non-adaptive group testing generic algorithm for identifying the exact set of positives while using CS-based decoding techniques. The GTCS paradigm is composed of three basic phases. 1. Create and perform the actual tests: Create a test design M and perform the group tests defined by the design. Practically, the outcome of this stage is a vectorŷ ∈ R n as described in Equations (2) and (3). This stage is followed by two-stage decoding process to exactly identify the test of positives. The vector y is derived fromŷ by assigning each entry y i = 1 ifŷ i > 0 and y i = 0 otherwise. 2. Group testing decoding: using standard group testing decoding methods (e.g. COMP algorithm [5] ) on the problem y = M x, a subset X 0 ⊆ X of items that are guaranteed by the GT algorithm to be negative samples is identified. This stage is used to reduce the size of the problem to be solved in the next stage. The rational behind this step is to exploit the fact that GT decoding algorithms like COMP has zero false negatives (i.e. all sample that were detected by the algorithm as negative ones are actually negative). Therefore, eliminating the set of sure-negative samples X 0 reduces the computational complexity of the step that follows, while keeping its decoding accuracy intact. The reduced compressive-sensing problem is established by applying the following enhancements. Given the set X 0 that include sure negatives, we define a new set X r := X \ X 0 . Let Y 0 ⊆ [m] be the set of tests indexes that yielded the result 0 in the previous stage. The new test design matrix M r is constructed from M , X 0 and Y 0 by projecting M on the columns that correspond to the samples in X \ X 0 and the rows that appear in the set [m] \ Y 0 . Therefore, the resulting matrix is an (m r × n r ) binary matrix where m r = m − |Y 0 | and n r = n − |X 0 |. The reduced test result vector y r is derived fromŷ by deleting the entries that correspond to Y 0 . 3. Compressive sensing decoding: by applying standard compressive sensing algorithms (e.g. OMP, LASSO) on the reduced problemŷ r = M r ·x r , and using the results from previous stage, the vectorx r and therefore, the vectorsx and x can be restored. In this section, we propose a random design for GT. For this design, we restrict that the pool size to be at most α < n. The design is constructed row by row. The main idea of the construction is to choose the non-zero entries in the new row according to two principles; 1. Fairness: Elements that participated in the minimum number of tests in previous rows will be more likely to be chosen in the new test. 2. Sparsity: Elements that were previously selected in the same test will be less likely to be assembled together in the new test. For a vector a = (a 1 , . . . , a n ) ∈ {0, 1} n , recall that H(a) := {i : a i = 1} ⊆ [n]. The Hamming weight of a is denoted by ω(a) and is equal to ω(a) = |H(a)|. Let 0 n (1 n , ∞ n ) denote the all zero (one, ∞ resp.) vector of length n. Let A (m×n) be an m × n matrix over {0, 1}. For all 1 ≤ i ≤ m, denote by A (i) ∈ {0, 1} n the ith row of the matrix A, by A (j) the jth column and by A i,j the element in A that corresponds to the ith row and the jth column. The columns weight vector of a matrix A m×n is a vector w = (w 1 , . . . , w n ) ∈ R n such that w j = m i=1 A i,j . Practically, the weight column vector indicates the number of tests each element participated in. The procedure RRD(n, m, α) in Fig. 1 describes the RRD strategy to choose a random design A with m rows and n columns where each row is of Hamming weight at most α. The algorithm starts by randomly choosing the first row in A from the set of binary vectors of length n and Hamming weight α. Assume that the first − 1 rows are already chosen, and let A −1 be the matrix defined by those rows. Letŵ = (w 1 , . . . , w n ) ∈ R n be the columns weight vector of A −1 . Then, the algorithm chooses the first non-zero entry in the th row uniformly randomly from the set of indexes that correspond to the entries of minimal value inŵ. This choice complies with the fairness principle. Let k < α be the number of non-zero entries that algorithm already chose for the th row. The k + 1 entry is chosen as follows. Let Q k be the set of indexes of the non-zero entires chosen so far in the current row. Let Z be the set of rows indexes i, such that H(A (i) ) ∩ Q k = ∅, and letŵ be the weight vector of submatrix of A defined by the rows in Z. The algorithm evaluateŝ w in steps 2.1.2 and 2.1.3 in Fig. 1 . The algorithm then constructs the set of indexes S ⊆ [n] that includes all the indexes j such that w j is of minimum value among the entries inŵ and sums the corresponding columns. Let X be the set of column indexes with minimum value. Then, among the indexes in X, choose s ∈ X uniformly at random and assign A ,s = 1. These are steps 2.1.4 − 2.1.11 in Fig. 1 . This choice complies with the fairness principle. The procedure CalcSelectedRows from Fig. 2. (c) outputs a set of row numbers C ⊆ {1, . . . , − 1} such that i ∈ C if and only if H(A (i) ) ∩ Q k = ∅. The procedure UpdateWeight calculates the weight vector w = j∈C A (j) (See step no. 2 in Fig. 2.(a) ). To ensure that the entires in Q k are excluded from the selection of the next non-zero index, the weight vector w is updated to have the value ∞ in the corresponding indexes. (See step 3 in Fig. 2.(a) ). The selection of the set S complies with the sparsity principle. The set S is derived from the weight vectorŵ by selecting the indexes with minimal weight, where the weight is evaluated over the rows that agree with one or more of the entries selected for the current test. The initialization step in SumColumns implies that the choice of the next non-zero entry of the current test will be from the indexes in S. Those indexes are the ones with minimum agreement with the current test, therefore, choosing the next non-zero entry from them is the best choice to keep the sparsity principle. The selection of the set X in step 2.1.8 complies with the fairness principle; among the best candidates from the indexes of S, the algorithm chooses s uniformly from those that appeared minimum number of times over all the previous tests. In this section, we outline tests results of the performance of the α−RRD design when chosen as the test design in the GTCS generic paradigm. For the GT decoding, the COMP algorithm is selected to generate initial sure-negative set, while the OMP is used as the CS algorithm in the final stage. We performed the test on noiseless synthetic data sets. In this set of simulations, we examine the performance of the α−RRD design for two major categories of settings. The first category fixes the number of the tests m to be 96 while the choice of m in the second category is m = 384. Despite that fact that the construction from section 3 does not impose any limitation on the parameter m, the choices of the values of m in the experiments are derived from the number of tests that can be performed in parallel in most PCR machines used in the industry. For m = 96, the examined pool sizes that range from 10 to 48 when applicable. The positives ratio, denoted here by p, ranges between 1% − 5%, and the tests were performed on sets of n = 400, 500, . . . , 1000. In each setting, the minimum choice of α is derived from the applicability of the compressive sensing 2) Choose a ∈ {0, 1} n uniformly at random from all binary vectors of weight α. 3) A (1) ← a. 4) For each = 2, . . . , m do: algorithm on the constructed matrix, while the maximum value of the pool size matches the maximum value tested for COVID-19 PCR pool designs [25] . Fig. 3 describes the success rate in restoring the defective items. The x−axis denotes the total number of the defectives d in the set of n elements. The success rate is examined while applying different values of α. In this set of experiments, for each n and each possible value of d that is up to 5% of the total number of elements n, the target vectorx is chosen where the d non-zero entries are chosen uniformly at random, while the symptom load of each non-zero entry is chosen uniformly over the real range [0, 1]. For each choice of n, d and α, the success rate is evaluated over 100 m × n different α−RRD matrices and 100 correspondingx for each design. For eachx and design pair, we check if GTCS restored the vector x correctly. We remind the reader that x is a binary vector derived fromx where its non-zero entries indicate the set of positive items. The success rate in Fig. 3 reflects the number of trials in which exact restoration of x was achieved over the total number of trials, i.e. 100. Our experiments results suggest that using the GTCS framework with α−RRD design can reduce the number of tests dramatically. Morover, the results imply that there is a correlation between the value of α and the performance of the process; choosing higher values of α can increase the success rate in identifying the positives. This can be observed from the results summarized in Fig. 3 . For example, for n = 400 and p = 5%, the success rate for α = 10 is bellow 20% while it can be increased beyond 95% when α = 22. This paradigm is reproduced in all test configurations outlined in Fig. 3 . In addition, for n = 400 and positives rate p = 1%, a success rate of 99% is achieved for pool size as small as 10. This implies that we can test up to n = 400 samples with only 96 tests when the positives rate is 1% with success rate as high as 99%. When the positives rate rises to 5%, to achieve this success rate, α needs to be increased up to 28. For n = 900, designs with pool size α = 48 can be used to identify positives when p = 1% using the same amount of tests. This indicates that the number of tests can be reduced by up to 10 folds when α = 48 for n = 900. For rates higher than that, no value of α ≤ 48 could achieve the success rate of 99%. Table ( 1) summarizes the best (minimal) choice of α for n and p to achieve success rate higher that 99%. For m = 384, we tested the performance of GTCS with RRD designs when α ranges from 8 up to 48 and n ranges from 800 up to 2000. In this set, for 1400 20 20 20 20 20 20 20 26 --1500 22 22 22 22 22 22 28 ---1600 24 24 24 24 24 24 ----1700 26 26 26 26 26 28 ----1800 28 28 28 28 28 -----1900 30 30 30 30 ------2000 32 32 32 32 ------4000 48 --------each choice of n and α, the success rate is evaluated on 10 random designs. Table ( 2) lists the minimal choice of α that is required to achieve high success rate (≥ 99%) for each value of n and positives ratio p. In addition, we tested a configuration with n = 4000, α = 48 and positives rate up to p = 1.5%. For this configuration, we achieved a success rate of 100% for our choice of α = 48 and positives rate p = 1.3%; this is a 10 fold improvement in the number of tests required to test a group of magnitude that reaches 4000 samples. Both sets of experiments were performed on a simulator specifically programmed to implement the algorithms used in these tests. The simulator can be easily configured to provide an α−RRD design for any choice of α, n, m and p. When these parameters are provided, the simulator generates a configurable number of test designs among which it chooses the one that performs the best on its auto-generated synthetic datasets. In this paper, we suggested a new random pooling design α−RRD. This design can be used as part of the GTCS paradigm in order to build a single-round non-adaptive group testing protocol to exactly identify positives within a large set of elements. The complexity of generating α−RRD design is polynomial in the dimensions of the design and can be easily generated for any dimensions m and n. Given the parameters m, n and positives rate, the best choice for the parameter α can be concluded using computer simulations. Moreover, since random sensing matrices can perform well with compressive-sensing algorithms, the GTCS paradigm can be further tested with other well-known group testing random designs such as RID, RrSD, RsSD and Transversal design [3] . Similarly, other compressive sensing algorithms can be applied too. We performed the tests in this paper on noiseless synthetic data sets, however, from practical point of view, it is encouraged to check the performance on a noisy data or even real-life data while applying standard methods designed for coping with inaccuracies and errors dictated by testing processes. Queries and concept learning Bounds for the number of tests in non-adaptive randomized algorithms for group testing Pooling for sars-cov-2 control in care institutions. medRxiv Venkatesh Saligrama, and Samar Agnihotri. Non-adaptive group testing: Explicit bounds and novel algorithms Group Testing What's hot and what's not: Tracking most frequent items dynamically The detection of defective members of large populations Combinatorial group testing and its applications Pooling designs and nonadaptive group gesting: Important tools for DNA sequencing Ad hoc laboratory-based surveillance of sars-cov-2 by real-time rt-pcr using minipools of rna prepared from routine respiratory samples. medRxiv Ajit Rajwade, and Manoj Gopalkrishnan. A compressed sensing approach to group-testing for covid-19 detection. arXiv: Quantitative Methods Group testing against covid-19 Group testing for image compression A method for detecting all defective members in a population by group testing Low-cost and highthroughput testing of COVID-19 viruses and antibodies via compressed sensing: system concepts and computational experiments Nonrandom binary superimposed codes A sequential method for screening experimental variables Fast non-negative orthogonal matching pursuit A group testing method for finding patterns in data Analysis and applications of adaptive group testing methods for covid-19. medRxiv A survey on combinatorial group testing algorithms with applications to dna library screening Efficient and practical sample pooling for high-throughput pcr diagnosis of covid-19. medRxiv Efficient high throughput sars-cov-2 testing to detect asymptomatic carriers. medRxiv Group testing to eliminate efficiently all defectives in a binomial sample Born again group testing: Multiaccess communications Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition Evaluation of covid-19 rt-qpcr test in multi-sample pools. medRxiv