key: cord-0699281-8kccpd4x authors: Zhu, Junan; Rivera, Kristina; Baron, Dror title: Noisy Pooled PCR for Virus Testing date: 2020-04-08 journal: bioRxiv DOI: 10.1101/2020.04.06.026765 sha: 49b4ed9d9ca2bbcd3fb20be16448f8b4101e9735 doc_id: 699281 cord_uid: 8kccpd4x Fast testing can help mitigate the coronavirus disease 2019 (COVID-19) pandemic. Despite their accuracy for single sample analysis, infectious diseases diagnostic tools, like RT-PCR, require substantial resources to test large populations. We develop a scalable approach for determining the viral status of pooled patient samples. Our approach converts group testing to a linear inverse problem, where false positives and negatives are interpreted as generated by a noisy communication channel, and a message passing algorithm estimates the illness status of patients. Numerical results reveal that our approach estimates patient illness using fewer pooled measurements than existing noisy group testing algorithms. Our approach can easily be extended to various applications, including where false negatives must be minimized. Finally, in a Utopian world we would have collaborated with RT-PCR experts; it is difficult to form such connections during a pandemic. We welcome new collaborators to reach out and help improve this work! Motivation. Reverse transcription polymerase chain reaction (RT-PCR) is a prevalent diagnostic tool for infectious diseases, like coronavirus disease 2019 (COVID-19). RT-PCR is a labor intensive technical procedure that requires numerous trained laboratory personnel to analyze one patient sample [1] . Briefly, ribonucleic acid (RNA) is isolated from a patient's respiratory tract and purified for reverse transcription, a process where the RNA template is turned into complementary deoxyribonucleic acid (cDNA). cDNA, along with specific viral primers, is loaded into a machine, where cDNA is amplified and annealed to the target sequence. While extending through each PCR cycle, a reporter dye is cleaved or broken from a probe to amplify fluorescence intensity and reveal a positive sample. Despite its accuracy for single sample analysis, RT-PCR requires substantial resources to test a large number of samples. Instead, we aim to develop a scalable testing procedure that allows for patient samples to be combined before PCR. Main idea. Noisy group testing is used to analyze RT-PCR data from mixed or pooled samples, as recently demonstrated for COVID-19 [2] . The goal of group testing is twofold. First, to increase the accuracy of testing for each individual patient by combining information from multiple pooled measurements that sample genetic material from that same individual. Second, to use a reduced number of measurements, especially in settings where a large population is being tested, most patients are healthy, and so many individual measurements will come out negative and thus show that multiple patients are healthy. In summary, group testing allows to evaluate large populations at high throughput, low per-patient diagnostic costs, and low false positive and negative probabilities. Noiseless group testing has been established, but noisy group testing algorithms are less mature. For example, recent work on COVID-19 [2] , [3] uses pooled tests to rule out patients corresponding to negative pooled measurements. Their approach implicitly relies on false negatives being rare in RT-PCR, but diluting many samples may increase false negatives [4] . Additionally, patients corresponding to positive pooled measurements are later tested individually [3] , which does not benefit from pooling. Our algorithm (Sec. III) applies pooling to identify individual sick patients. Recently, researchers across the world have been looking to increase the sample size per PCR run by using custom barcodes for each sample and then pooling them together [5] . Custom barcoding is not new in terms of multiplexed genetic sequencing [6] . Briefly, custom barcodes for each patients RNA sample are designed by an algorithm and substituted in as the reverse transcriptase (RT) primers to generate barcoded cDNA. Next-generation sequencing is performed after a single pooled PCR reaction, and then demultiplexed to determine each samples viral content. Our method of pooling samples before adding barcodes could be used by researchers for a quicker time to analysis and also as a complementary method to reanalyze barcoded samples. Contributions and organization. We focus on a simple pooled testing model for RT-PCR (Sec. II). This model is converted to a linear inverse problem (Sec. II-A), and our goal is to estimate a vector of patient illness status, x, from a vector of noisy RT-PCR measurements, y, a matrix A relating patients and measurements, and statistical information about false positives and negatives (Sec. II-B and Fig. 1 ). This estimation problem is solved using generalized approximate message passing (GAMP) [7] in Sec. III. Promising numerical results are provided in Sec. IV, and Sec. V discusses how our GAMP-based approach can be extended. Conventional RT-PCR. RT-PCR has a binary outcome. That is, once the sample is amplified, there is plenty of genetic material available for identification. Prior to amplification, there can be problems during pre-processing to isolate purified RNA. Either genetic material can be damaged, in which case all further tests with this material are negative, or the sample is contaminated, in which case further tests are positive. However, such pre-processing problems essentially flip the sample's condition permanently from negative to positive or Fig. 1 . System model. A Bernoulli process with probability ρ of sickness generates an input vector x ∈ {0, 1} N , which reflects patient illness status. The input is multiplied by a measurement matrix, A ∈ {0, 1} M ×N , resulting in noiseless measurements, w = Ax ∈ N M (1), which are processed by an RT-PCR channel, resulting in noisy measurements, y ∈ {0, 1} M (2). GAMP [7] processes the input channel relating ρ and x with g in (·) (4), and the output channel relating w and x with gout(·) (5) (details in Sec. III). vice versa, because all tests of the patient will be using flawed genetic material from this patient. Therefore, such problems will not be discussed further. Once the sample has been pre-processed, there are two further problems that could arise once we partition the sample into multiple group test measurements. One possible outcome is a false negative, meaning not enough genetic material from a sick patient and, therefore, insufficient amplification. It is also possible to have a false positive, meaning the sample was contaminated by viral matter, and the test is positive although the patient is healthy. We focus on these two problems, as a well-designed group testing procedure mitigates their effects. Group testing. Instead of sampling genetic material from one patient, we will pool material from multiple patients. In principle, if any of the patients is sick, and there is sufficient genetic material, then the pooled group test will come out positive. However, it is possible that group testing will use less genetic material per patient, meaning that the measurement is diluted, and the probability of false negatives (per sick patient) might be larger than conventional (unpooled) measurements [4] . We now express the number of sick patients pooled per measurement as a product between a binary matrix representing what patients are pooled in different measurements, and a binary vector representing patient illness status. Later, Sec. II-B forms a noisy probabilistic outcome, where the RT-PCR test being positive or negative depends probabilistically on the number of sick patients pooled per measurement. Our system is illustrated in Fig. 1 . We have N patients. The status of each patient is given by x n , where n ∈ {1, . . . , N }. If patient n is sick, then x n = 1, else x n = 0. The N entries are modeled as independent and identically distributed (i.i.d.), and we model x n using a random variable (RV), X n . These N RVs follow a Bernoulli probability mass function (pmf), X n ∼ Ber(ρ), where Pr(·) denotes probability, and ρ is the percentage of sick patients. Next, the vector x ∈ {0, 1} N is multiplied by a binary measurement matrix A ∈ {0, 1} M ×N . Because x and A are binary, the matrix vector product, is a length-M vector of natural (non-negative) numbers. The matrix A is interpreted as follows. Row m corresponds to measurement m, and column n to patient n, where m ∈ {1, . . . , M } and n ∈ {1, . . . , N }. If patient n is not measured in measurement m, then A mn , the matrix entry in row m and column n, is zero; such patients do not affect the outcome of measurement m. In contrast, A mn = 1 when genetic material from patient n appears in measurement m. It can be seen that w m counts the number of sick patients evaluated by measurement m. In noiseless group testing, the RT-PCR measurement is positive if and only if w m > 0. However, RT-PCR suffers from false positives and negatives. We now account for these false positives and negatives. The noisy measurement y m depends on w m through a conditional probability, Pr(Y m |W m ), where Y m and W m are RVs. To evaluate Pr(Y m |W m ), we denote the probability of an individual patient being sick yet not having enough genetic material in one of the measurements by p 1 . This is the probability of a false negative caused by one patient; with w m sick patients, the probability that all of them have false negatives is (p 1 ) wm . (We note in passing that false negatives could be modeled as independent of w m [3] .) Similarly, the probability of a false positive is denoted by p 2 . If there is a false positive in y m , then y m = 1, irrespective of the status of the patients being evaluated by measurement m. On the other hand, y m = 0 means that there was no false positive, and all the patients evaluated that were actually sick resulted in false negatives. Based on this discussion, we can express the probability for y m to be 0 or 1 given w m , In communication and information theory, such a probabilistic relationship is known as a channel [8] ; the output channel relating the vectors w and y appears in Fig. 1 . We now have a linear relationship from x to w, and the noiseless measurements vector w, which contains the number of sick patients per measurement, is then processed by a probabilistic channel to yield the noisy measurements vector, y. Our goal is to estimate x from y, A, and statistical information about the channel. Other group testing approaches often perform pooled measurements in a first part, and positives are tested individually in a second part [3] ; our method can improve both parts by pooling all measurements and accounting for all available information. In the following section, we describe our algorithmic framework in detail. Algorithm 1 GAMP Inputs. Maximum iterations t max , percentage of sick patients ρ, false negative probability p 1 , false positive probability p 2 , measurements y, and matrix A. Initialize. t, k, h m , Θ m , x n , s n , ∀m, n. Comment. t is iteration number, k is mean of our estimate for Ax, h m is correction term for w m , Θ m is variance of h, x n is our estimate for x n , s n is variance in our estimate x n . h m = g out (k m , y m , θ m ) 7: // clean up input channel 12: for n = 1 to N do 13: x n = g in (∆ vn , q n ) = E[x n |q n ] // mean estimate 14: s n = E[x 2 n |q n ] − E 2 [x n |q n ] // variance estimate 15: t = t + 1 Output. Estimate x, pseudo data q, and scalar channel noise variance ∆ v . We estimate x from y, A, and statistical information about the channel by applying generalized approximate messsage passing (GAMP) [7] , which is an iterative signal estimation algorithm. GAMP is preferred, because it achieves best-possible estimation-theoretic performance asymptotically, in the limit of large linear estimation problems. Our approach focuses on the large system limit, where N → ∞, M (N ) depends on N , and lim N →∞ where we call R the measurement rate. GAMP relies on the large system limit for various summations in the derivation steps of the algorithm to be well-approximated as Gaussian under the central limit theorem [7] . Note that running our algorithm for small problem sizes such as N = 100 patients and M = 30 measurements may result in poor estimation quality. The GAMP algorithm is listed in Algorithm 1. For a detailed derivation, we refer the reader to Rangan [7] . An intuitive and less formal explanation is provided below. GAMP is comprised of two parts. The first part involves the input channel relating ρ and x (Fig. 1) [7] , where x is estimated from an auxiliary vector q ∈ R N (cf. Line 10 of Algorithm 1) through a function g in (·). The auxiliary vector, q, is known in the AMP literature 1 as the pseudo data, which can be treated as a noisy version of the true signal x, where v ∈ R N is additive white Gaussian noise (AWGN) with 1 AMP can be derived from GAMP for a specific setting [7] . While AMP [9] requires the matrices it processes to have zero mean, GAMP is less restrictive. zero mean, where ∆ vn is the variance of v n . Hence, g in (·) can be interpreted as a denoising function, While other denoising functions can be used, conditional expectation, i.e., E[X|Q], minimizes the mean squared error (MSE) in each GAMP iteration, and so it reduces the error as quickly as possible. The second part of GAMP involves the output channel (cf. Fig. 1) , where y m depends probabilistically on w m . We estimate w m from y m using a second denoising function, where the expectation is taken over the pmf, In this expression, (5), we have mean and variance values for w m , and can interpret g out (k m , y m , Θ m ) as a correction term that reflects residual information, which is provided by the noisy measurements vector, y, but is not yet reflected in our estimates, k for Ax, and x. The correction term is used in later iterations to compute q and g in (·). GAMP uses these two scalar functions, g in (·) (4) and g out (·) (5), to estimate x and w = Ax (1) from q (3) and y (2), respectively. That is, GAMP iteratively cleans the input and output channels. A numerical illustration is provided in Fig. 2 ; Sec. IV discusses this figure in detail. GAMP also uses derivatives of these scalar functions to estimate the variance. In words, knowing not only the mean but also the variance around the mean allows GAMP to judiciously use information from x when estimating w and vice versa. This section provides numerical results showing how GAMP solves noisy group testing problem. For readers who are new to GAMP, we begin by illustrating how GAMP cleans up the input and output channels iteratively. We evaluate N = 5000 patients at a time, where the fraction of infected patients is ρ = 0.01. The measurement rate is R = M/N = 0.3, meaning that we take M = N R = 1500 RT-PCR measurements. 2 The matrix A is designed to pick up n pos sick patients per measurement on average; we let n pos = 0.5. The numbers of ones per row and column are kept close to n pos /ρ and Rn pos /ρ, respectively. For the RT-PCR channel, we assume a false negative probability, p 1 = 0.02, and false positive probability, p 2 = 0.001. 3 We quantify GAMP signal estimation quality using the area under the receiver operating curve (AUC-ROC). In words, the ROC captures trade-offs between false positives and negatives, and increasing the AUC reflects better estimation. While standard GAMP minimizes the MSE [7] , other error metrics can be minimized [10] , [11] . The top pannel of Fig. 2 plots the 2 norms of the input channel noise (dashed red line) and output channel noise (solid blue). We can see that the input and output channels improve over iterations. The bottom panel shows the first 1000 entries of the input signal vector x and their estimates. We can see that patient illness status is estimated well. We investigate the impact of the percentage of sick patients, ρ, and measurement rate, R = M/N , on estimation accuracy in Fig. 3 . As before, N = 5000, n pos = 0.5, p 1 = 0.02, and p 2 = 0.001. We run our algorithm on ρ ∈ {0.005, 0.01, 0.015, · · · , 0.05} and R ∈ {0.1, 0.15, 0.2, · · · , 0.5}. For each setting, we randomly generate 20 different triples of (x, A, y), and record the AUC for every triple. The performance for each setting is evaluated by averaging AUC values. Our results show that the AUC increases with the measurement rate, R, and larger ρ requires larger R to yield an AUC near 1. These results align with our expectation that more measurements improve estimation, while more sick patients require more measurements. Recent work Hanel and Thurner [3] analyzes a two part group testing approach. Their model for PCR uses Pr(Y m = 0|W m > 0) = (1 − p 2 )p 1 , while we use Pr(Y m = 0|W m = w m > 0) = (1 − p 2 )p wm 1 ; we evaluated their approach using ρ = 0.01, p 1 = 0.02, and p 2 = 0.001. Hanel and Thurner's Part 1 pools a block of B = 11 patients at a time. If a pool is negative, all patients in the block are declared healthy; else A simulation over N = 10 7 patients had 935 false positives and 4038 false negatives. Our two part approach modifies Part 2. Instead of testing patients within each positive block individually, we combine all patients within all positive blocks into a new linear inverse problem, and solve the resulting estimation problem (1) with GAMP. For example, let the block size be B = 25 patients in Part 1. In Part 2, we combine all positive blocks and apply R = 0.5 and n pos = 0.5 to the linear inverse problem. Note that (i) positive measurements from Part 1 are reused in the matrix A and measurement vector y of Part 2, because they contain information that helps GAMP; (ii) we decide whether a patient is sick or not by thresholding x. Combining Parts 1 and 2, the measurement rate is R = 0.149. We randomly generate 100 (x, y, A) triples for N = 10000 patients in Part 1, Among 100N = 10 6 patients, there are 92 false positives and 366 false negatives. Our false positive and negative rates are both lower than those of Hanel and Thurner [3] . V. DISCUSSION Our current approach relies on various assumptions. Below are issues that can be considered in ongoing and future work. Challenges. Some of the challenges we expect involve better modeling of RT-PCR, in particular how pooling multiple samples dilutes the genetic material and may increase false positives and negatives [12] . Other challenges involve matrix design; better matrices will improve estimation quality. One question is whether p 1 is the same for each patient n in each measurement m where A mn = 1. It might be possible to use different matrix entry values (not just 0 and 1) and thus sample more genetic material in some cases, and less in others. This will likely result in p 1 depending on the amount of genetic matter being sampled. Therefore, genetic material will have to be measured before samples are pooled to ensure the same concentration of genetic material is loaded for each sample. Limiting the amount of genetic matter being processed may reduce the costs of the overall measurement process. On the other hand, if the amount of genetic material per patient per measurement varies, a sophisticated channel could be supported by having nonzeros in A mn take different values. Another question is whether the false positive probability, p 2 , is identical for all measurements. Alternately, p 2 may depend on the measurement system, for example the number of samples pooled together, or the number of RT-PCR iterations. If individual RT-PCR iterations are costly, then the number of iterations can be reduced, resulting in larger p 2 . Our experience with AMP-based algorithms suggests that a modest increase in R = M/N will compensate for the degraded individual measurements. Cost effectiveness of each iteration will determine the number of times each sample can be run. This value should be weighed against the number of times individual samples are pooled, allowing to optimize the number of samples pooled per measurementà la [3] . A third challenge pertains to the measurement matrix, A. In our current design, rows and columns contain similar numbers of nonzeros (Sec. IV). Therefore, each measurement provides the same signal to noise ratio (SNR). One matrix design option is to allow different rows and columns to have different numbers of nonzeros. Another is to prevent any pair of patients, n 1 and n 2 , from both having nonzero matrix entries in different rows, m 1 and m 2 , i.e., A m1n1 , A m1n2 , A m2n1 , and A m2n2 cannot all be nonzero. Refinements in matrix design will improve our estimation quality. Applications. Improvements in testing accuracy can be used in different ways in different applications. •False positive and negative rates of individual RT-PCR measurements can be reduced by pooling together samples, and using GAMP-based algorithms. This can help decide when it is safe to release a COVID-19 patient from quarantine. •Latency reduction. As the first RT-PCR measurements from a batch of patients arrive, all patients corresponding to positive measurements can be quarantined. As more RT-PCR measurements come in, GAMP can determine which of the individual patients in the positive pooled samples are actually healthy. •Throughput can be drastically increased for fixed target levels of false positives and negatives. This can be useful for testing large populations with minimal cost. •False negatives must be low to prevent a few sick patients from infecting many others. Low false negative probabilities can be provided by a two-part signal estimation approach [13] . Part 1 will use a conventional measurement matrix A. Part 2 takes extra measurements only for patients deemed healthy in Part 1, thus reducing false negatives. Similar ideas have been proposed for adaptive sensing [14] . Finally, our GAMP-based approach uses statistical information about the probability of a patient being sick, ρ, and probabilities governing false positives and negatives, p 1 and p 2 . The algorithm can be improved using more information, for example statistical dependencies between household members being sick. We will integrate more predictive information to further improve estimation quality. Quantification of mRNA using real-time rt-PCR Evaluation of COVID-19 RT-qPCR test in multi-sample pools Boosting test-efficiency by pooled testing strategies for SARS-CoV-2 Comparative analysis of triplex nucleic acid test assays in United States blood donors A Massively Parallel COVID-19 Diagnostic Assay for Simultaneous Testing of 19200 Patient Samples Inexpensive multiplexed library preparation for megabase-sized genomes Generalized approximate message passing for estimation with random linear mixing Elements of Information Theory Message passing algorithms for compressed sensing Signal estimation with additive error metrics in compressed sensing Performance limits with additive error metrics in noisy multimeasurement vector problems Impact of pre-amplification conditions on sensitivity of the tat/rev induced limiting dilution assay Two-part reconstruction with noisysudocodes Adaptive sensing for sparse recovery