key: cord-0999038-7rip6wtu authors: Minhas, Fayyaz; Grammatopoulos, Dimitris; Young, Lawrence; Amin, Imran; Snead, David; Anderson, Neil; Ben-Hur, Asa; Rajpoot, Nasir title: Improving COVID-19 Testing Efficiency using Guided Agglomerative Sampling date: 2020-04-14 journal: bioRxiv DOI: 10.1101/2020.04.13.039792 sha: c8c527540298dccedc89234d4cc7973b096e0752 doc_id: 999038 cord_uid: 7rip6wtu One of the challenges in the current COVID-19 crisis is the time and cost of performing tests especially for large-scale population surveillance. Since, the probability of testing positive in large population studies is expected to be small (<15%), therefore, most of the test outcomes will be negative. Here, we propose the use of agglomerative sampling which can prune out multiple negative cases in a single test by intelligently combining samples from different individuals. The proposed scheme builds on the assumption that samples from the population may not be independent of each other. Our simulation results show that the proposed sampling strategy can significantly increase testing capacity under resource constraints: on average, a saving of ~40% tests can be expected assuming a positive test probability of 10% across the given samples. The proposed scheme can also be used in conjunction with heuristic or Machine Learning guided clustering for improving the efficiency of large-scale testing further. The code for generating the simulation results for this work is available here: https://github.com/foxtrotmike/AS. Effective large scale testing and contact tracing have been successfully used in a number of countries for controlling the spread of the SARS-CoV-2 virus (CoV-2) which causes COVID-19 disease [1] . However, in resource-limited settings, it may not be feasible to do large scale testing unless the efficiency of existing tests is improved in terms of number of tests required for a given number of samples. In this short paper, we discuss a computer-science inspired divide and conquer strategy based on pooling samples from multiple individuals that can improve test efficiency by a significant amount under a minimalistic set of assumptions. Given a set of individuals to be tested for CoV-2, the number of tests required for identifying positive individuals can be reduced from by considering the fact that the probability of testing positive is small (say, = 0.1) and individual test results are typically not independent of each other (e.g., members in the same household or people in contact with each other or other CoV-2 infected individuals can have dependencies in their test results). In this work, we propose a divide and conquer agglomerative sampling strategy that is built on these ideas and can be used to reduce the number of tests. Before delving into the details of the method, we present a list of assumptions underlying the proposed method: 1. Pooling: Samples of multiple individuals can be combined or mixed into a single "bag" which can be tested by a single test such that: a. The test produces a positive outcome if any of the samples in the bag is positive b. The test produces negative outcome if none of the samples in the bag is positive c. Testing in bags does not change the error rate of the test being used 2. Multiplicity: Multiple samples can be taken from a single individual or a single sample from an individual can be divided further 3. Rarity: The probability of testing positive is small ( < 0.2) It can be expected that these assumptions are satisfied by a number of current tests for CoV-2 infection such as the quantitative RT-PCR and serological (antibody) testing [2] . Consider a set of individuals = {1,2, … , } to be tested for CoV-2 infection. Assume that the (originally unknown) test result of each of the individuals is given by ⊂ {0,1}, = 1 … . Without loss of generality or introducing any limitations in the model, assume that for each individual, we are also given a set of d-features ∈ (such as frailty, age, gender, contact with known or suspected CoV-2 infected patients, geographical location, symptoms, family/household dependencies, etc.,) that can be used to generate a degree of belief of that individual to test positive. We denoted this degree of belief by , = 1 … . In case, it is not possible to assign a belief to each individual, can be considered to be uniformly random, i.e., ~(0,1). Alternatively, belief can be assigned by a human oracle in a subjective manner or can be obtained through machine learning or probabilistic modelling based on the given set of features. If we cluster or mix individual samples into bags and proceed with testing these bags in a hierarchical manner, the number of required tests can be reduced by essentially pruning out multiple negative samples in a single test. For this purpose, consider a tree structure organization of the given set of individuals based on the degree of belief , = 1 … (or using the given set of features directly) as shown in the example figure below. The basic idea of agglomerative testing is that we test a bag of samples and if the bag level result comes out negative, then there is no need to test each of the samples individually. However, in case, the test comes out positive, we subdivide the samples into further clusters and test each of these bags next. This is continued until we get a test score of each individual. Furthermore, if a test for a bag comes out positive but the next sub-bag tests negative, then we know that the positive result is a consequence of a positive individual in the other bag which can be split further directly without additional testing. This guide algorithm based on even binary split is summarized in Algorithm-1. The figure below shows that if we obtain a mixed bag of all individual samples 1-8 and do a single test, the outcome will be negative and there is no need to do individual testing. For a bag comprising of cases 9-16, the result of the test will be positive because there is at least one positive individual in the bag. Doing this in a recursive manner can lead to reducing the number of tests required from 16 to 11 or 14 depending upon how the terminal nodes are tested. If we have access to informed belief values, then the given samples can be sorted with respect to their belief values prior to tree construction. Tree construction can also be done in an unsupervised manner based on existing individual features coupled with hierarchical or agglomerative clustering. [3] . In order to evaluate the efficacy of this approach, we constructed a simple simulation in which individuals are assigned random test labels ( = 1 with probability and = 0 with probability 1 − ). Each individual is then assigned a degree of belief . We tested with both a random degree of belief (no belief information) and varying degrees of belief as measured by the concordance between and by using an additive normal distribution noise prior = + (with ~ℵ(0, )) with the degree of noise controlled by the standard deviation parameter . For a given simulation setting (number of individuals, prior probability and belief control factor ), we calculate the number of required tests by the proposed sampling method. In order to get reliable statistical estimates of the distribution of the number of required tests for a given simulation setting, we repeated the simulation multiple times with the same setting and plotted the distribution of the number of required tests using a box plot. Based on our computational analysis, the expected number of tests required for a given positive probability and input samples (under no belief assumptions) can be calculated as: ( , ) = 2( − 1)(1 − 2 −4.5 ) + 1. This formulation captures the typical average case number of required tests using the proposed strategy. It can be seen that this formulation is heavily dependent on the value of the positive probability. However, it can significantly reduce the number of required tests when the probability is small, e.g., for community level testing. The probability value up to which the proposed strategy can remain effective, i.e., up till T