key: cord-0322175-5fnzfd8f authors: McLure, Angus; O'Neill, Ben; Mayfield, Helen; Lau, Colleen; McPherson, Brady title: PoolTestR: An R package for estimating prevalence and regression modelling with pooled samples date: 2020-12-10 journal: nan DOI: 10.1016/j.envsoft.2021.105158 sha: 4db61fac19aee4ca1274ea716635385bff10aecc doc_id: 322175 cord_uid: 5fnzfd8f Pooled testing (also known as group testing), where diagnostic tests are performed on pooled samples, has broad applications in the surveillance of diseases in animals and humans. An increasingly common use case is molecular xenomonitoring (MX), where surveillance of vector-borne diseases is conducted by capturing and testing large numbers of vectors (e.g. mosquitoes). The R package PoolTestR was developed to meet the needs of increasingly large and complex molecular xenomonitoring surveys but can be applied to analyse any data involving pooled testing. PoolTestR includes simple and flexible tools to estimate prevalence and fit fixed- and mixed-effect generalised linear models for pooled data in frequentist and Bayesian frameworks. Mixed-effect models allow users to account for the hierarchical sampling designs that are often employed in surveys, including MX. We demonstrate the utility of PoolTestR by applying it to a large synthetic dataset that emulates a MX survey with a hierarchical sampling design. Pooled testing, also known as group testing, where diagnostic tests are performed on pooled samples, has broad applications for the detection of traits, defects, or diseases with low prevalence. Recently, pooled testing has been used to efficiently and rapidly test for SARS-CoV-2019 (Sunjaya and Sunjaya, 2020) , but this approach has long been used for surveillance of other infectious diseases, e.g. to detect pathogens in food animals (Arnold et al., 2011) and conduct surveillance of vector-borne diseases (Pilotte et al., 2017; Rodríguez-Pérez et al., 2006) . The World Health Organization has for decades been running global elimination programs to reduce the impact of many vector-borne diseases, including lymphatic filariasis (LF) (World Health Organisation, 2019) (ref) . A key challenge for programs of this scale is to optimise the efficiency and accuracy of surveillance, especially as the disease becomes rarer or more localised over time. Molecular xenomonitoring (MX), the surveillance of pathogens or molecular markers in vector populations, is emerging as an alternative or adjunct to human-based surveillance of vector-borne diseases such as LF, and typically employs pooled testing of mosquitoes for filarial DNA (Lau et al., 2016; Rao et al., 2016; Schmaedick et al., 2014; Subramanian et al., 2020) . In the right setting, MX with pooled testing could potentially be very sensitive, avoid the need to test humans, and provide insights to assist in vector control strategies. Large scale MX surveys, such as those needed to support decision making for elimination programs, involve capturing and testing large numbers of vectors. It is not always feasible to individually test every vector captured, so vectors are routinely pooled to reduce cost and improve efficiency. If using a sufficiently sensitive and specific test, each pool returns a positive result if any individual in the pool is positive and a negative result otherwise. The presence/absence of infected vectors provides a proxy measure of ongoing transmission, and could potentially help identify geographic areas where further surveillance or interventions may be required. The prevalence of infection amongst vectors provides useful information for decision makers when assessing and communicating risk, and can be used to prioritise the allocation of resources, observe the effects of interventions, and identify spatial and temporal trends. However, estimating prevalence from pooled samples requires specialised statistical methods (Chen and Swallow, 1990; Farrington, 1992; Hepworth, 2005; Walter et al., 1980) Many MX surveys employ a hierarchical sampling structure. For instance, to conduct a MX study in a particular country, one may select a number of representative regions in a country, followed by a number of villages within each region, and place traps at a number of representative sites within each village. While hierarchical sampling designs like this provide an effective means of collecting a representative sample of vectors across the study area, they call for specialised analytical methods to accurately estimate infection prevalence. Analytical methods that do not account for hierarchical sampling structures will tend to underestimate the uncertainty in prevalence when applied to data collected using these sampling methods (Birkner et al., 2013) . There are a number of extant software packages for working with pooled testing models -e.g. PoolScreen , and the R packages pooling (Van Domelen, 2020), binGroup (Zhang et al., 2018) , and its successor binGroup2 (Hitt et al., 2020) . PoolScreen is a standalone application that has been used in many MX projects and provides a graphical user interface for estimating prevalence in both frequentists and Bayesian paradigms. However, PoolScreen does not have functionality for regression modelling and is impractical for very large and complex surveys where one may wish to estimate prevalence for many subsets of the data (e.g. estimating prevalence by vector species, by country/region/village, by sampling year, etc.). The R package pooling is designed primarily for case-control studies with pooled assays and not applicable to MX studies. The R packages, binGroup and its successor binGroup2, provide tools for fixed-effect regression modelling with pooled data, though only in a frequentist framework. However, neither PoolScreen nor binGroup2 have functionality to account for hierarchical sampling frames which are common in large-scale MX surveys and therefore will tend to underestimate the uncertainty in prevalence estimates associated with the sampling design. The lack of a tool that is tailored for large-scale hierarchical MX surveys limits the efficiency of data analyses and the amount of information to be gleaned from these studies, particularly in resource-poor settings with limited technical capacity. To fill this gap, we developed PoolTestR, an package for the R language (R Core Team, 2020) which provides a user-friendly and extensible framework for estimating prevalence with pooled data, and performing fixed and mixed-effect regression modelling for both hierarchical and non-hierarchical surveys. All analyses can be conducted in frequentist or Bayesian frameworks. Though our package can be applied to generic pooled testing datasets, we demonstrate its use through examples based on simulated MX data with known prevalence. 2 Pooled testing and the pooled binomial GLMM Suppose we have a molecular marker of infection in a population, with unknown prevalence . We can estimate the prevalence by taking a random sample of binary outcomes showing whether or not the molecular marker is present in a sampled unit (e.g. testing each mosquito caught). If the molecular marker has a low prevalence in the population, and the unit cost of each test is much more than the unit cost of procuring samples (e.g. trapping mosquitoes), pooled testing may be a more cost-effective means of estimating prevalence. For simplicity, assume that the test can detect the marker with 100% sensitivity and specificity. We also assume that the marker is independent across each individual in the sample (and therefore also in the pools) and that the total number of samples is much smaller than the population. While larger pools may 'dilute' the marker of interest and thus affect the sensitivity of the test, for the purposes of our analysis we will assume that the dilution is insignificant for the range of pool sizes used. Under these assumptions, for a pool of size , the probability of a positive result is: In some cases a fixed pool size is used, however we consider the general case where there may be a mixture of pool sizes. Suppose we have a set of observations where we observe positive tests out of pools each of size . Given the pool counts and pool sizes , the counts of positive tests is a sufficient statistic for the prevalence and follows a "pooled binomial distribution", with probability mass function given by: is the total number of individuals in the sample and + = ∑ is the total number of individuals in all the positive pools in the sample. Other than in trivial situations where all the positive pools have the same size, the maximum likelihood estimate (MLE) for does not have a closed-form expression and is computed numerically. Under some weak regularity conditions, the standardised MLE converges in distribution to the standard normal distribution allowing for Wald confidence intervals. Other confidence intervals for have been proposed (Hepworth, 2005) . In our package we implement a generalised linear mixed model (GLMM) using the pooled binomial distribution. Our mixed model has outcomes that are associated with covariates (using fixed effects) and (using random effects). where is the link function. In practice, we will generally take to be a diagonal matrix with diagonal values (i.e., variances) that are variable parameters that may be shared over covariate groups. This yields random effect terms that are independent, but not necessarily with equal variance. Conditional on the covariates and random effects, the GLMM uses the model equation: ( | , , , , , ) = ∏ PoolBin( | , , ). This is a mixed-effects model that extends the GLM discussed in Farrington (1992) ; our model with a logistic link function is also similar to a bird-nesting model used in Shaffer (2004) . Our package accommodates two link functions for the prevalence parameter: the logistic and complementary log-log (CLL( ) = log(− log(1 − ))). The logistic link function produces more readily interpretable coefficients and is the default in our package. However, the CLL link function leads to a simpler mathematical exposition and so is illustrated here. The main benefit of the complementary log-log function in the present context is that CLL( ( )) = log( ) + CLL( ), so the function can be applied to separate the pool size from the prevalence parameter. The joint loglikelihood function for the model parameters and and the random effects is: Our package implements both a frequentist and a Bayesian analysis of this GLMM. In the Bayesian case, the functions in our package have flexibility for the user to input their own prior for the unknown model parameters and , or use default uninformative priors. For any prior density 0 , the posterior distribution is: ( , , | , , , , ) ∝ exp(ℓ | , , , ( , , )) • 0 ( , ). The covariates in these models can represent any characteristic of pools. For a MX study of a mosquitoborne disease, this may include sample collection time and location, or attributes of the site where the sample is collected (e.g. interventions in place at the site, altitude, vegetation index, distance to housing). It may also include any attributes of the mosquitoes shared by the entire pool; e.g. mosquito species, in a survey design where trapped mosquitoes are sorted by species before being pooled. However, there is no scope for units within the same pool to have different covariates unlike in BinGroup2 (Hitt et al., 2020) . Mixed-effect terms can be used to account for intra-site variation not captured by other covariates in studies with hierarchical sampling frames. The PoolTestR package has been designed to be a simple, user-friendly and extensible way to analyse test results from pooled samples. The package has four primary functions for the estimation of prevalence: PoolPrev, HierPoolPrev, PoolReg, and PoolRegBayes. PoolPrev produces unadjusted estimates of prevalence of a marker based on the outcome of tests on pooled samples, optionally stratifying the dataset by one or more user-specified covariates. HierPoolPrev is like PoolPrev but allows users to adjust prevalence estimates for hierarchical structure in sampling frames. PoolReg and PoolRegBayes provide flexible and extensible frameworks to fit mixed or fixed effect regression models in either a frequentist or Bayesian framework, allowing users to identify variables associated with higher prevalence, produce predictive models, while accounting for hierarchical sampling frames. Table 1 provides an overview of the differences between the four main functions. The following sections provide more details of these functions, Boxes A and B provide example code, and Figures 1 and 2 compare the outputs of these functions when applied to a synthetic dataset. PoolPrev was designed to produce comparable results to the popular stand-alone application PoolScreen for familiarity to existing users of the software, and to enable direct comparison of results from the many studies that used PoolScreen. Stratifying a dataset and calculating prevalence for each subgroup of the data using PoolScreen requires many manual steps to import data, run analyses and export results. Using our function PoolPrev, this same task can be achieved in a few lines of R code with a simple syntax. Given a dataset containing the number of samples per pool and the test results for each pool, PoolPrev returns Bayesian and maximum likelihood estimates of the prevalence together with uncertainty intervals. Efficient Bayesian inference is performed with Hamiltonian Markov Chain Monte Carlo using the Stan programming language (Stan Development Team, 2020b) and the R packages rstan (Stan Development Team, 2020a) and rstantools (Gabry et al., 2020) . Users can specify their prior belief for the prevalence from the Beta distribution or use the default uninformative 'Jefferey's' prior i.e. (0.5,0.5). Users can also optionally specify the prior probability that the marker of interest is entirely absent from the population, in which case PoolPrev also returns the probability of absence given the data. As we assume the test performed on the pooled samples does not produce false positive or negatives, the probability of absence is always zero if any of the pools test positive. In most cases the credible interval (CrI) for the prevalence are the 2.5% and 97.5% quantiles of the posterior distribution. However, if all tests are positive the upper bound of the CrI is 1 and the lower bound is the 5% quantile of the posterior. Similarly, if all tests are negative the lower bound of the CrI is 0 and the lower bound is the 95% quantile of the posterior. The uncertainty interval for the maximum likelihood estimate is calculated using the likelihood ratio method (i.e. a Wilk's confidence interval). As with the Bayesian CrI, the lower or upper bound of the confidence intervals are zero or one when all pools are negative or positive, respectively. All estimates can be optionally stratified by variables (e.g. vector species or location) by providing the name(s) of the columns in the dataset containing the variable(s). Estimation of prevalence proceed independently for each subgroup of the data defined by the variable(s). Box A demonstrates the use of PoolPrev on a synthetic dataset (described in Section 4). HeirPoolPrev is designed to account for the hierarchical sampling structures that are common in MX studies. It assumes that samples were taken from a number of sites across the study area, and these sites can be nested within one or more hierarchical levels (e.g. sites within villages, villages within regions, regions within provinces, etc.). HeirPoolPrev estimates prevalence by fitting an intercept-only hierarchical generalised linear mixed model with a logistic link function. The syntax and outputs are very similar to PoolPrev. There is only one additional argument, hierarchy, which requires the user to list the variables that encode the hierarchical structure of the sampling frame (e.g. the names of columns containing site IDs, village IDs etc.). The output provides the Bayesian posterior mean and CrI for the prevalence, but unlike PoolPrev does not provide frequentist outputs (i.e. maximum likelihood estimates or likelihood ratio confidence intervals). As with PoolPrev, users can specify their prior belief for the prevalence and specify variables that stratify the dataset into subgroups. If subgroups of the data are specified, estimation of prevalence and random effect variances proceed independently for each subgroup. Box A demonstrates the use of HierPoolPrev on a synthetic dataset (described in Section 4). Our package provides tools for mixed-effect regression models in both frequentist and Bayesian frameworks. PoolReg fits a frequentist mixed-or fixed-effect generalised linear model that adjusts for the sizes of pools, building on glm from the in the stats package (R Core Team, 2020) for fixedeffect models and the glmer function from the lme4 package (Bates et al., 2015) for mixed-effect models. For a model with only fixed effects the output is an S3-object of class glmfit, while the output for a model with random effects is an S4-object of class glmerMod which supports that same methods (e.g. summary, predict, plot, confint, anova) as any other object returned by the glm or glmer functions. PoolRegBayes provides functionality to perform the same analyses in a Bayesian framework and returns a brmsfit object. By building on these existing statistical packages, PoolTestR leverages the extensive suite of diagnostics tools available for working with models fitted with these functions and uses paradigms that will be familiar to existing users of R. These frameworks allow for a very broad range of linear models (e.g. polynomial regression, spline regression, gaussian process models). In addition, PoolTestR includes the function getPrevalence, which provides a convenient way to extract estimates of prevalence from regression models fitted with PoolReg or PoolRegBayes. The function getPrevalence, is in many cases able to detect whether a model includes adjustments for hierarchical random/group effect terms, and automatically estimate prevalence at every level of the sampling hierarchy. Box B applies PoolReg and PoolRegBayes to the same synthetic dataset used to demonstrate PoolPrev, estimating the trend of decline in prevalence over time. PoolTestR provides a number of approaches to estimate prevalence: frequentist or Bayesian, stratifying or adjusting for covariates, adjusting for or ignoring hierarchical sampling frame (Table 1) . We compare the approaches with a large synthetic dataset including 179,092 mosquitoes split into 7,604 pools sampled across three years with a realistic hierarchical sampling design in three regions, 30 villages (10 per region), and 300 sites (10 per village). The synthetic dataset was generated by simulating samples taken from across three regions (A, B, and C) in which the vectors have a low (0.5%), medium (2%), and high (4%) prevalence of the marker of interest. Within each region we choose ten villages, and within each of these villages we choose ten sites to place traps. We sample from the same locations once a year over three years (0, 1, and 2). Prevalence is not uniform within each region or over time. At baseline (year 0), prevalence varies between villages within each region, and prevalence varies between sites within each village. Consequently, though the prevalence is different for each site, two sites within the same village are likely to have a more similar prevalence than two sites in different villages, or two sites in different regions. On average the prevalence is declining over time (odds ratio of 0.8 per year), however, the growth rate varies between villages. Consequently, two sites in different villages with similar prevalence at baseline may have different prevalence by the third year, and prevalence may go up in some villages. Each year the traps at each site catch a negative binomial number (mean 200, dispersion 5) of vectors. The catch size at each site and year is independent. Though a wide range of pool sizes leads to better estimates (Gu et al., 2004) , we simulate a simple and commonly used pooling strategy: each year, the catches at each site are pooled into groups of 25 with an additional pool for any remainder (e.g. a catch of 53 vectors will be pooled into two pools of 25 and one pool of three). Each pool is tested once for the marker of interest using a test with perfect sensitivity and specificity. This synthetic dataset is provided with the PoolTestR package, and has been used to illustrate the package in Boxes A and B. Box A demonstrates the use of the functions PoolPrev and HierPoolPrev, by estimating prevalence stratified by year and region with or without adjustments for the hierarchical sampling frame. Box B demonstrates the functions PoolReg and PoolRegBayes and fits logistic-type regression models with Year and Region as covariates with and without adjustment for sampling hierarchy in frequentist and Bayesian frameworks. The predictions for each region and year, and for a selection of villages are compared in Figures 1 and 2 . Since our example has adequate sample size, the estimates using a frequentist framework are very similar to estimates in a Bayesian framework using non-informative priors, e.g. compare frequentist and Bayesian outputs of PoolPrev in Figures 1 and 2 . For this synthetic dataset in which there is moderate variability between sites and villages, accounting for hierarchical sampling increases the fraction of confidence/credible intervals that contain the true value; the improvement can be seen whether stratifying other covariates (Box A), or adjusting for them (Box B). Stratifying the data by Year and Region produces estimates with wider confidence/credible intervals than in a regression framework; compare in Figures 1 and 2 PoolPrev to PoolRegBayes (without adjustment for hierarchy), or the results of HierPoolPrev to PoolRegBayes (with adjustments for hierarchy). This effect is particularly pronounced where prevalence is low (e.g. region A). Consequently, without adjustments for hierarchical sampling, using a regression framework further reduces the fraction of the confidence/credible intervals that contain the true value (Figures 1 and 2) . However, the regression model with adjustments for hierarchical sampling frame (hatched square in Figures 1 and 2) has the narrowest intervals amongst models that consistently include the true prevalence values, and thus performs best on this synthetic dataset. . PoolTestR is a user-friendly, flexible, and extensible R package for estimating prevalence and regression modelling with tests on pooled samples. PoolTestR offers substantial advantages over existing software for pooled testing such as Poolscreen , especially for sampling design such as those used in MX surveys. While each analysis in PoolScreen requires many manual steps to import data and export results, analysis with PoolTestR users can utilise the diverse ecosystem of R packages to easily import data, perform multiple analyses, and export results with a number of common formats (e.g. csv, xls). PoolTestR also expands the range of statistical analyses that can be performed, allowing estimates to be adjusted for hierarchical sampling design and providing tools for a very broad category of mixed effect regression models in Bayesian or frequentist frameworks. When conducting MX surveys, collecting a simple random sample of vectors across a large area is operationally infeasible. Many MX studies will therefore involve a hierarchical sampling frame involving representative sample sites distributed across the study area. If the study area and the distance between traps are smaller than the movement range of the vector being studied, it may be fair to assume that all traps are sampling from the same population, and that there is no variation in prevalence between trap sites. In such cases the method implemented in Poolscreen and the PoolPrev function in our package are appropriate for estimating prevalence. However, when aggregating data to estimate prevalence in a study area substantially larger than the typical movement range of vectors, these methods which do not account for heterogeneity between sample sites may have unreasonably narrow confidence intervals that often fail to contain the true value (Birkner et al., 2013) . Instead, the function HierPoolPrev or a hierarchical mixed-effect regression model using PoolReg or PoolRegBayes should be preferred in these situations. While accounting for hierarchical sampling frames will increase the width of confidence intervals for prevalence estimates, failing to do so may result in confidence intervals which frequently fail to include the true prevalence value. Molecular xenomonitoring surveys utilising pooled testing are often paired with human surveys utilising un-pooled testing. Though regression modelling is commonly used in the human components of these surveys (e.g. Subramanian et al. (2020) ), regression modelling with pooled MX data has been hampered by the lack of suitable software; the only method for looking at differences by groups in PoolScreen is to stratify the data, and the regression models in binGroup2 cannot account for hierarchical sampling frames. The regression functions in the PoolTestR package fill this gap, allowing users to identify variables associated with higher prevalence, test the statistical significance of these associations, and produce predictive models. Moreover, where appropriate regression models can produce more precise estimates (narrower confidence intervals) compared to simple stratification. Regression models could be used for predictive prevalence mapping, however further development is required to allow for models with spatial correlation in our package. There are currently no tools that readily allow for the comparison or synthesis of both the human and MX components of surveys (e.g. model predictions of prevalence in humans based on prevalence in vectors). This functionality may be added in future releases of PoolTestR. As with all models, estimates made with PoolTestR will be unreliable if the implicit assumptions about the test characteristics, sampling frame, population, or covariates are substantially violated. All the models in our package currently assume that the tests applied to each pool has perfect sensitivity and specificity. One of many things that can affect test sensitivity and specificity, and therefore the accuracy of the prevalence estimates, is pool size. Statistical methods that estimate test sensitivity or specificity from the data or test for the existence of diluting effect in larger pools have been proposed (Tu et al., 1995) and may be incorporated in future versions of PoolTestR. All of our models also assume that vector catch numbers are fixed by the sampling design or that the 'stopping rules' used to determine samples sizes are independent of prevalence and covariates. A common 'stopping rule' is to test as many vectors as possible from each site. The relationship between vector density, transmission rate, and prevalence is dependent on complex host, agent, and environment relationships, and so there may be correlation between catch numbers and disease prevalence at a site. This kind of correlation, if not accounted for, will result in biased estimates, if testing all vectors trapped at each site. While a predetermined sample size for each site could avoid this bias, it may require sampling to be prolonged at some sites and vectors to be discarded at others. The best way to detect and adjust for bias related to sampling designs that do not use a predetermined sample size remains an open question. Another key consideration in MX studies is the appropriate sample size and pooling strategy. When designing a sampling strategy using pooled samples, there is a trade-off between cost and precision. Using fewer, larger pools makes it cheaper and faster to conduct laboratory tests, but greater numbers of smaller pools improves the power of the data and the precision of estimates. For a fixed number of pools, distributing the specimens into a number of fixed size pools is likely to result in poorer estimates than using pools of various sizes (Gu et al., 2004) . However, there are currently no practical rules or tools for determining an optimal or near-optimal strategies for sampling or pooling. A tool thatgiven a sampling design, testing constraints, and catch size -determines the optimal number of pools and the optimal distribution of samples across these pools, would further improve the costeffectiveness of pooled MX surveys and may be incorporated in future updates of PoolTestR. PoolTestR is a software package borne out of the need for a simple, flexible and freely available tool to analyse large and complex datasets to estimate infection prevalence from pooled samples. PoolTestR allows users to conduct the most common analyses required for MX, whilst to being able to adjust for hierarchical sampling design and conduct a broad range of regression analyses. MX is increasingly being used as a surveillance method around the world and we hope that PoolTestR can assist researchers and program managers in disease surveillance in a range of control settings and others contexts using pooled data. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. A comparison of pooled and individual bird sampling for detection of Salmonella in commercial egg laying flocks Fitting Linear Mixed-Effects Models Using lme4 Evaluation of a Frequentist Hierarchical Model to Estimate Prevalence When Sampling from a Large Geographic Area Using Pool Screening Using Group Testing to Estimate a Proportion, and to Test the Binomial Model Estimating prevalence by group testing using generalized linear models 2020. rstantools: Tools for Developing R Packages Interfacing with 'Stan Assessment of arbovirus vector infection rates using variable size pooling Confidence intervals for proportions estimated by group testing with groups of unequal size 2020. binGroup2: Identification and Estimation using Group Testing Important experimental parameters for determining infection rates in arthropod vectors using pool screening approaches Lymphatic Filariasis Elimination in American Samoa: Evaluation of Molecular Xenomonitoring as a Surveillance Tool in the Endgame The Current Status of Molecular Xenomonitoring for Lymphatic Filariasis and Onchocerciasis 2020. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing Programmatic Use of Molecular Xenomonitoring at the Level of Evaluation Units to Assess Persistence of Lymphatic Filariasis in Sri Lanka Large-scale entomologic assesment of onchocerca volvulus transmission by poolscreen PCR in Mexico The American Molecular Xenomonitoring Using Mosquitoes to Map Lymphatic Filariasis after Mass Drug Administration in American Samoa A Unified Approach to Analyzing Nest Success Stan Development Team, 2020a. RStan: the R interface to Stan Stan Development Team, 2020b. Stan Modeling Language Users Guide and Reference Manual Molecular xenomonitoring as a post-MDA surveillance tool for global programme to eliminate lymphatic filariasis: Field validation in an evaluation unit in India Pooled Testing for Expanding COVID-19 Mass Surveillance On the informativeness and accuracy of pooled testing in estimating prevalence of a rare disease: Application to HIV screening pooling: Fit Poolwise Regression Models Estimation of infection rates in populations of organisms using pools of variable size Global programme to eliminate lymphatic filariasis: progress report