key: cord-0832073-yulvl1bc authors: Escobar, M.; Jeanneret, G.; Bravo-Sanchez, L.; Castillo, A.; Gomez, C.; Valderrama, D.; Roa, M. F.; Martinez, J.; Madrid-Wolff, J.; Cepeda, M.; Guevara-Suarez, M.; Sarmiento, O. L.; Medaglia, A. L.; Forero-Shelton, M.; Velasco, M.; Pedraza-Leal, J. M.; Restrepo, S.; Arbelaez, P. title: Smart Pooling: AI-powered COVID-19 testing date: 2020-07-15 journal: nan DOI: 10.1101/2020.07.13.20152983 sha: 2f904674d4fd87023b02950eeff6497be222926b doc_id: 832073 cord_uid: yulvl1bc Massive molecular testing for COVID-19 has been pointed as fundamental to moderate the spread of the disease. Pooling methods can enhance testing efficiency, but they are viable only at very low prevalences of the disease. We propose Smart Pooling, a machine learning method that uses sociodemographic data from patients to increase the efficiency of pooled molecular testing for COVID-19 by arranging samples into all-negative pools. We show efficiency gains of 42% with respect to individual testing at disease prevalence of up to 25%, a regime in which two-step pooling offers marginal efficiency gains. Additionally, we calculate the possible efficiency gains of one- and two-dimensional two-step pooling strategies and present the optimal strategies for disease prevalences up to 25%. We discuss practical limitations to conduct pooling in the laboratory. Smart Pooling makes pooling methods efficient even at high prevalence of the disease. a. In standard two-step pooling methods, samples are pooled randomly. When the outcome of a pooled test is negative, all the samples in it are labeled as negative. When the outcome is positive, all samples are tested individually. As prevalence increases, the efficiency of pooling without a priori information drops rapidly and makes the strategy unviable, mainly because the probability of having at least one positive sample in a pool increases. b. Smart Pooling tackles this problem by arranging samples into pools that maximize the probability of being all-negative. c. The Smart Pooling pipeline. Samples and data are collected from patients. The Smart Pooling model processes these data and returns an arrangement with the probability that each sample tests positive. In the lab, samples are pooled based on this arrangement. Subsequently, samples from positive pools are tested individually. Finally, the diagnostic outcome of each sample is fed to the Smart Pooling platform. This process enlarges the dataset and allows for continuous learning. epidemiological context of the COVID-19 pandemic, testing laboratories may have clinical and demographic data for each individual. These data could be exploited to estimate the probability of a patient yielding a positive result [17] . Thus, an informed guess could be made to exclude a sample from a pool. Smart Pooling is easily adaptable to any pooling procedure. To use Smart Pooling as a tool for the desired pooling strategy, we propose a five-step pipeline between the laboratory procedures and the Smart Pooling analysis. Figure 1c illustrates this process. First, the laboratory acquires samples and demographic metadata from patients. Secondly, the Smart Pooling platform processes metadata to provide an ordering of the patients into pools. Thirdly, samples are pooled according to this ordering in the laboratory. Then, molecular tests are run on the ordered pools until there is a diagnosis for each sample. Lastly, the laboratory feeds the results of the tested samples into the Smart Pooling platform to continuously improve the model. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 15, 2020. . We identify that using complementary information to arrange pools can improve the efficiency of testing. We do this by training a machine learning algorithm to predict the probability that a sample will test positive for COVID-19 based on sociodemographic data. Testing efficiency increases by using individual testing on high probability samples and arranging the remaining samples into pools simulating a two-step pooling protocol. Figure 2 shows that for disease prevalences ranging from 0% to 25%, Smart Pooling efficiency outperforms the simulated efficiency obtained with Dorfsman's two-step pooling and individual testing. Efficiency of Smart Pooling trained on different data. Smart pooling achieves higher efficiencies than two-step pooling for prevalence of the disease of up to 25%. For our experiments, we construct a dataset that comprises the results of qPCR tests of samples from the city of Bogotá, Colombia. We gather data at different granularity levels reflecting real data availability and organize it into the Test Center Dataset. This data is restricted to information of the test center and the date the samples were taken. We conduct qPCR on DNA extracted from lower or upper respiratory tract samples, . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 15, 2020. . https://doi.org/10.1101/2020.07.13.20152983 doi: medRxiv preprint following standard diagnostic protocols for COVID-19 [18, 19] . We perform DNA extraction and amplification individually for each sample in the dataset. Figure 2 shows the efficiency gains of Smart Pooling. Our computational experiments show that efficiency gains are obtained for all simulated prevalences up to 25% when performing Smart Pooling. For instance, with Smart Pooling at a prevalence of 10% and an efficiency around 2, the estimated number of patients that could be tested with the same number of test kits is doubled compared to individual testing. These results on the Test Center Dataset show that Smart Pooling does not depend on the availability of rich complementary data such as patient-level data. Figure 3 depicts the visualized predictions from our machine learning models trained with test center metadata. We rank samples according to the confidence of the model for predicting a sample as positive. Compared to the random ordered samples, Smart Pooling's predictions of the test center data set most of the positive samples at the top of the arrangement. This figure illustrates the working principle of Smart Pooling: it enhances efficiency by artificially reducing the incidence in the samples sent to pools, by sending the samples most likely to test positive to individual testing. The machine learning model is inferring the probability that a sample results positive from the metadata. For the Test Center Dataset, the machine learning model could be exploiting underlying correlations in the samples. Samples in the dataset were acquired during strict measures limiting mobility in the city of Bogota. It is likely that people tested at the same center share epidemiological factors, such as visiting the same markets, sharing public transport, or being in the same hospital. The model could also be learning the different probability distributions of samples being positive in different parts of the city. Smart Pooling can be used with multiple pooling strategies and improves efficiency regardless of the strategy. Figure S1 shows the effect of using Smart Pooling with a fixed pool size of 10 and an adaptive pooling strategy based on the optimal strategies previously stated. Both of these alternatives are more efficient than two-step pooling and individual testing when coupled with Smart Pooling. Our simulations show that using adaptive pooling strategies offers efficiency gains than fixed-size pooling for higher prevalences (p > 15%). Ultimately, the pooling strategy used in practice should be determined by the resources available at the testing laboratory. By a two-step pooling protocol, we mean a procedure that, given a set of samples, is combined into pools using specific criteria. Once the pools are defined, we carry out two steps. In the first step, we test each pool for disease using qPCR. There are two possible outcomes: (i) The pooled sample is negative, which implies that all individual 5/17 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 15, 2020. . samples within the pool are negative; or (ii) The pooled sample is positive, which implies that at least one individual sample within the pool is positive. If the test result of the first step is positive, then it is necessary to proceed to the second step. In the second step, each sample is individually tested, thereby finding the infected samples. More importantly, note that if the result of the first step is negative, by performing a single pooled test, we save all the PCR kits required to test the samples individually (except the one used for the pooled test). Crucially, as we will show later, this gain is obtained without any loss in either test sensitivity or specificity. We explored the following two kinds of pooling protocols: 1. Dorfman's pooling protocols [4] : Given m samples, we make a single pool with all of them. We denote this protocol by S m 2. Matrix pooling protocols [5] : Given a collection of m × n samples, we place them 6/17 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 15, 2020. . into a rectangular m × n array. We create pools by combining samples along the rows and along the columns of this array (for a total of m + n pools). In the second phase, we test each sample at the intersection of the positive columns and rows individually. We denote this protocol by S m×n . Pooling efficiency will be our main tool to quantitatively compare different pooling protocols, as it is defined as the number of patients tested per detection kit consumed. Efficiency depends on the prevalence p of the sample (defined as the probability that an individual in the population is ill). Assuming independence among patients, pooling efficiency can be easily computed analytically [4, 5] : 1. For Dorfman's pooling S m the efficiency is given by For the matrix pooling protocol S m×n the efficiency is given by The efficiency functions above show the key property behind pooling in general, and Smart Pooling in particular. At low prevalences, the efficiency can be considerably greater than one. As the prevalence increases, E(p) decreases, and the efficiency gains become negligible after prevalences around 30%. Figure 4 shows graphically these efficiency gains (and how they vanish) for two-step pooling. What are the best pooling protocols that maximize efficiency for a given prevalence p, assuming a fixed bound c on pool size? The experimental setup unavoidably constrains the maximum pool size. In the context of COVID-19 we will focus on the cases c = 5 and c = 10 since these are the most useful in practice (see Section S1.1 for details). Finding the optimal pooling strategies for a given prevalence equals finding the pooling protocol of maximum efficiency by comparing the values of E(p) for the different protocols. Figure 5 shows the efficiency curves of the best pooling protocols of the form S j and S m×n with maximum pool size c ≤ 10 and c ≤ 5. The optimal two-stage protocols exist (i.e., their efficiencies are the highest among all two-stage pooling protocols, not only among those of the form S m and S m×n ). More precisely, table 1 shows the optimal protocols and their respective intervals of optimality when c = 10 and c = 5 respectively. Even while using optimal protocols, it is clear that the efficiency gains quickly disappear as the prevalence increases (effectively vanishing when p ≥ 30.66%). . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 15, 2020. . The data corresponds to the molecular tests conducted by Universidad de los Andes for the Health Authority of the city of Bogotá, Colombia. We organized the data according to the test center that collected the sample. Additionally, metadata from the test centers, such as their location and name, were available. Samples were collected from April 6 th to May 25 th 2020, with a total of 7162 samples from 101 test centers. To construct the dataset, we tested samples individually following the Berlin Protocol [18] (for dates before April 18 th ) and the protocol for the U-TOP TM COVID-19 Detection kit from Seasun Biomaterials [19] . To experimentally validate the performance of our proposed models and descriptors, we divided the data into a training split containing all the samples until May 7 th , and a held-out test split for evaluating the performance of the model with the remaining samples. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 15, 2020. . https://doi.org/10.1101/2020.07.13.20152983 doi: medRxiv preprint Prevalence intervals and their respective optimal pooling strategy. S m are single pooling protocols and S m×n are matrix pooling protocols. Quantitative evaluation. We used the standard metrics in machine learning for detection problems: Precision-Recall curves, the maximum F-measure, and Average Precision. Additionally, we measured the efficiency of our models in silico in terms of the number of tests used when employing the model's output as the criterion for Smart Pooling. We calculated the efficiency as the fraction between the initial number of samples and the total number of tests used, following a two-step pooling strategy. First, we sorted the individual tests based on the predicted prevalence by decreasing order. Then, we grouped the samples following the original order and the pool size. Finally, the total number of tests corresponds to the number of pools in the first step added to the individual samples from the second step, that is, samples from the pools that were positive in the first step. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 15, 2020. . Prevalence and Efficiency graphs For our best model at test time, we simulated different prevalences from the data up to 25%. We calculated the model's efficiency for each prevalence and thus obtained a point in the prevalence efficiency graph. Originally, the prevalence was 22.6% for the Test Center dataset. We obtained higher and lower prevalence values by removing positive and negative samples, respectively. Due to the nature of the data, we could not remove individual samples; thus, we eliminated all the samples of a test center on a given date. We trained a machine learning method for the Test Center dataset. We performed the training of our models in three phases. First, we further split the training data into two disjoint subsets: training and validation. Secondly, we used the AutoML from H2O library [20] to explore multiple machine learning models. We used the performance of the validation subset as the criterion to select the best model. Lastly, we obtained our final model by retraining the best model using all the available training data. During the first training phase, we selected the validation subset such that we predicted as many dates as those in the test set. The maximum number of dates to predict in May was five. We removed the test centers that did not have sufficient dates for the validation split but included them back for the second training phase. Below we explain the details for training. The level of granularity of this dataset allowed us to have information on the number of positive and total tests per test center on a given date. However, we do not have daily reports from each test center. Thus, we explicitly modeled the data as a sparse time series. We trained the machine learning methods to predict the fraction of positive tests for a center until the current date. Afterward, we assigned the incidence of each sample from a test center on a given date. Said incidence corresponds to the probability of the sample from the test center testing positive on that date. We sorted the tests by decreasing incidence, simulated a two-step pooling protocol, computed the number of used tests, and calculated the efficiency of Smart Pooling. The best model for this task was a Gradient Boosting Machine (GBM) [21] with 50 trees, a constant depth of 5, a minimum of 12 leaves, and a maximum of 29 leaves. To predict the incidence for a date in the validation or test set, we defined a descriptor calculated from the available training data for each testing center. We included in the dimensions of the features, the cumulative tests of each institution up to every date within the time series, and the total number of tests from all the institutions at the corresponding date. To include the temporal information, we defined as features the date in YY-MM-DD format and the relative date in the number of days since the first date on the time series. Additionally, we created a variable that we named gap, which encodes the distance between two consecutive entries from the same test center. In particular, during test, the gap encodes the distance between the last training date and the date to predict. The gap provided the model with an estimate of the analyzed time window. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 15, 2020. . As such, we compute the descriptor's features by analyzing the relative differences between variables on the last known date and those from the training days in the current gap. These gap features comprise the cumulative number of tests, total tests, number of positive samples for each institution at the last available date, and the corresponding incidences. We complemented the descriptor with features that indicate the number of days, relative to the current prediction date, from the first n number of positive tests in each test center. For our experiments we defined n = 1, 5, 10, 100, 500, 1000. The rationale behind this feature was to provide the model with a temporal encoding of the incidence's evolution. These features were adapted from a public kernel from Kaggle's COVID-19 Global Forecasting Challenge 1 . The key idea of Smart Pooling is maintaining p artificially low, even under scenarios with a high overall prevalence of COVID-19, by reordering the samples according to a priori estimates of prevalence, before the pooling takes place. We demonstrate efficiency gains at simulated prevalences of up to 25%, considerably higher than those applicable for even optimal standard pooling methods. At the prevalence levels for COVID-19 that many diagnostic laboratories are currently managing, two-step pooling efficiency is significantly reduced. Here we show that it is possible to increase pooling efficiency by using machine learning to separate likely positive cases and then pool the rest of the samples using what we show to be an optimal strategy. From our understanding, access to more detailed patient information is very likely to improve predictor performance. Pooling uses artificial intelligence to enhance the performance of well-established diagnostics. It is an example of how data-driven models can complement, not replace, high-confidence molecular methods. Its robustness to the variability of the available data, prevalence, and model performance and its independence of pooling strategy make it compelling to apply at large scales. Additionally, its continuous learning should make it robust to the pandemic's evolution and our understanding of it. Smart Pooling could ease access to large scale testing. This pandemic has presented challenges to all nations, regardless of their income. As the number of infected people and the risk of contagion increase, more testing is required. However, the supply of test kits and reagents cannot cope with the demand, with most countries not being able to perform 0.3 new tests per thousand people [3] . Adopting Smart Pooling could translate into more accessible and larger-scale massive testing. In the case of Colombia, this could mean testing 20,000 samples daily, instead of 11,000, in mid-June 2020 [3] . If deployed globally, Smart Pooling truly has the potential to empower humanity to respond to the COVID-19 pandemic. It is an example of how artificial intelligence can be employed to bring social good. Historically, pooling was used extensively during the second world war, but since then, it has mostly been implemented in specific niches such as testing blood for diseases and reducing cost in developing countries [16, 22, 23] . In the specific context of SARS-CoV-2, there are several incentives to implement pooling. Most significantly, the current shortage of reagents, especially in developing countries that do not produce these and have limited stocks. Additionally, if implemented correctly, these methods can increase throughput, another motivation for developing countries and locales with a limited number of certifiable qPCR machines. Finally, through the reduced use of reagents and increased throughput, these methods can reduce costs and motivate health care providers. Here we considered Dorfman and array testing algorithms because they are easily compatible with a manual implementation of pooling. Although it is possible to find the optimal algorithm for pooling at a particular prevalence, it may be cumbersome to implement all the protocols. Fortunately, most of the pooling algorithms are relatively 15/17 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 15, 2020. . robust over a broader range than that in which they are optimal, so implementing some should increase efficiency without added complexity. Different algorithms will also affect tip use as well as plating time. Filter tips have been in short supply since the beginning of the pandemic [24] , and some of the pooling protocols increase efficiency at the expense of increased tip use. Plating time will depend on both the pooling algorithm and the experience of the personnel doing the pooling. Each lab has to adapt the specifics of the protocol for their needs. Successful implementation of PCR-based pooling requires understanding the limitations of the method and performing viability tests. Although it is possible to pool reactions including primers that detect the virus only when it is present, it is not possible, to our knowledge, to pool the RNAse P positive control. This phenomenon occurs because the samples are positive unless a problem has occurred with the sample or the extraction of RNA; it involves finding a negative among positives where samples have variable RNA content and an exponential amplification step. In our implementation, the two reactions are separate. The RNAse P control is performed on a different plate, using a faster PCR protocol (1.5 hours instead of 2.5 hours). In kits for which the specific primers and the control are multiplexed, for example, it is not possible to ensure the presence of RNA when pooling. Sensitivity must also be taken into account when implementing pooling. Each two-fold dilution results in the increase of the Ct value by 1 unit (1∆Ct), on average. Some have detected dilutions up to 32-fold [8] , but only for samples with average Ct, not for samples close to the detection limit. It is necessary to calibrate the dilution process in each lab since it depends on the kit, machine, and operators. Another limit for the size of the pools is the total reagent volume in each pool. Our plates can handle 10µl of a sample while other plate-kit combinations may handle just 5µl. Depending on the operators, pipetting volumes under 1µl may cause quality control problems, and since we rely on many volunteers for our testing, we only considered pools of up to 10, which also keeps the maximum Ct of the first round below 42, the limit of cycles in our machine. Additionally, we increased the cutoff in Ct for the first round from 38 to 38+ ∆Ct at this dilution, from the calibrations. This modification may increase the number of false positives in the first round, but since the second round samples are sampled individually, the standard cutoff can be used to eliminate potential false positives from the first round. Pooling before RNA extraction is an attractive option since it reduces reagent use, but we did not implement it for several reasons. The first is that the expected increase in Ct of 5 [25] , was above the practical limits for our machine and kit combination. Secondly, it is not possible to flag poorly taken samples in which there is no RNA, for the same reasons, it is not possible to pool the RNAse P control. Finally, we received heterogeneous samples from both the upper and lower respiratory tracts, which resulted in false negatives when pooled together. It is likely that implementing pooling before RNA extraction is possible with other kits or more homogeneous samples. The long-term management of COVID-19 will likely require the use of 16/17 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 15, 2020. . https://doi.org/10.1101/2020.07.13.20152983 doi: medRxiv preprint complementary approaches, including testing for antibodies against the virus. The methods presented in this paper can be applied in this context and could increase the measurement of seropositivity in the population. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 15, 2020. . https://doi.org/10.1101/2020.07.13.20152983 doi: medRxiv preprint A novel coronavirus from patients with pneumonia in China World Health Organization Coronavirus pandemic (covid-19) The Detection of Defective Members of Large Populations The use of a square array scheme in blood testing On the utility of pooling biological samples in microarray experiments Assessment of Specimen Pooling to Conserve SARS CoV-2 Testing Resources Evaluation of COVID-19 RT-qPCR test in multi-sample pools Ad hoc laboratory-based surveillance of SARS-CoV-2 by real-time RT-PCR using minipools of RNA prepared from routine respiratory samples Sample Pooling as a Strategy to Detect Community Transmission of SARS-CoV-2 Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: Covid-19 case study Rapid ai development cycle for the coronavirus (covid-19) pandemic: Initial results for automated detection & patient monitoring using deep learning ct image analysis An interpretable mortality prediction model for covid-19 patients Forecasting the novel coronavirus covid-19 Real-time tracking of self High-throughput pooling and real-time PCR-based strategy for malaria detection Editorial: Making the Best Use of Test Kits for COVID-19 Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR U-TOP™ COVID-19 Detection Kit Gradient boosting machines, a tutorial Pooling of sera for human immunodeficiency virus (hiv) testing: an economical method for use in developing countries Feasibility of pooling sera for hiv-1 viral rna to diagnose acute primary hiv-1 infection and estimate hiv incidence Shortage of standard health supplies is 'a huge problem Pooling of nasopharyngeal swab specimens for sars-cov-2 detection by rt-pcr Acknowledgments ME, GJ, and PA acknowledge special funding from Facultad de Ingeniería, Universidad de los Andes. The authors thank John Mario González from the Faculty of Medicine at Universidad de los Andes and his team CBMU for devoting the laboratory for COVID testing and their contributions during the laboratory certification. The authors also acknowledge the team members of the GenCore Covid Laboratory at Universidad de los Andes. The authors thank Secretaría de Salud de Bogotá and Alcaldía de Bogotá for their support during the laboratory certification process and access to complementary information for the samples. The authors declare no competing interests.