key: cord-0632845-6m3j9jpf authors: Baxter, Jonathan title: Bayesian Beta-Binomial Prevalence Estimation Using an Imperfect Test date: 2020-09-11 journal: nan DOI: nan sha: a2be47245aa94b07a69f73c15e15cac1600a914d doc_id: 632845 cord_uid: 6m3j9jpf Following [Diggle 2011, Greenland 1995], we give a simple formula for the Bayesian posterior density of a prevalence parameter based on unreliable testing of a population. This problem is of particular importance when the false positive test rate is close to the prevalence in the population being tested. An efficient Monte Carlo algorithm for approximating the posterior density is presented, and applied to estimating the Covid-19 infection rate in Santa Clara county, CA using the data reported in [Bendavid 2020]. We show that the true Bayesian posterior places considerably more mass near zero, resulting in a prevalence estimate of 5,000--70,000 infections (median: 42,000) (2.17% (95CI 0.27%--3.63%)), compared to the estimate of 48,000--81,000 infections derived in [Bendavid 2020] using the delta method. A demonstration, with code and additional examples, is available at testprev.com. We consider the problem of estimating disease prevalence in a population of interest using an unreliable test. Following [2, 3] , we take a Bayesian approach and model uncertainty in the test characteristics (sensitivity and specificity) as well as the uncertainty due to a finite testing population. We extend [2] by deriving a simple expression for the posterior prevalence probabilty density in the common case of a test that has been validated against a number of known positive and negative subjects. A Monte Carlo algorithm for computing the prevalence posterior is presented, and applied to Covid-19 infection data from Santa Clara county CA [1] , where the false positive test calibration rate (0.5%) was close to the measured prevalence (1.5%). The posterior distribution on prevalence in this case acquires a second mode at zero, which results in substantial broadening of the credible interval on prevalence. The appearance of a second mode also explains why local approximation methods such as the delta-method can fail to capture all the posterior variance. Suppose we know the test false-positive rate (1 -specificity) u and sensitivity v. Denote the (unknown) population prevalence by θ. The probability p of a positive test is the probability of a positive test given the subject has the disease, plus the probability of a positive test given the subject is disease-free: The probability of k positive tests out of n subjects tested follows a binomial distribution with parameter p: By application of Bayes' rule, the distribution of θ is given by: where Pr(θ) is our prior probability on θ and Choosing a uniform prior on θ, and applying dθ = dp v−u : is the incomplete Beta function, and for notational brevity we drop the dependence on k and n from B(v; k + 1, n − k + 1) and just write B(v). Substituting (1), (2) and (5) into (3) yields: 3 Estimated test performance Equation (6) expresses the distribution over population prevalence θ given known test characteristics u and v. However, the false-positive rate u and sensitivity v are usually themselves estimates based on validation against known positive and negative subjects. Specifically, suppose the test has been validated with ku false positives out of nu known negative samples, and kv true positives out of nv known positive samples. Assuming a beta prior on u with parameters αu, βu, the posterior density on u given the validation data is proportional to a beta density with parameters ku + αu, nu − ku + βu. Abusing notation for clarity, write Betau(u) for this density and similarly Betav(v) for the corresponding density on v. Let Betap(u + θ(v − u)) denote the density at u + θ(v − u) of the beta distribution with parameters k + 1, n − k + 1. With this notation, integrating out u, v from (6), we obtain the following expression for the posterior distribution on prevalence θ that accounts for uncertainty in the test characteristics: The domain of integration has been restricted to the region v − u > 0, reflecting the fact that a test with false-positive rate u in excess of sensitivity v is not a usable test (this can also be thought of as an adjustment of the joint posterior on u and v to capture a dependence between u and v). To the author's knowledge there is no closed-form expression for the righthand-side of (7) in terms of hypergeometric or related functions. In the next section we will describe an algorithm for evaluating the integral using Monte Carlo integration. Let I(θ) denote the integral on the right-hand-side of (7). Draw N samples ui, vi from Betau(u) and Betav(v) (any pairs such that ui > vi are rejected and resampled). Then with error ∼ 1 √ N , This expression is calculated for a discrete grid of values θj and then normalized to generate the posterior density Pr(θ). The samples ui, vi can be reused for each estimate I(θj), which allows computation of to be performed once and reused. To avoid numerical problems in the case B(u) ≈ B(v) ≈ 1, observe that B(u; k + 1, n − k + 1) = B(k + 1, n − k + 1) − B(1 − u; n − k + 1, k + 1) where B(k +1, n−k +1) is the complete beta function with parameters k +1, n−k +1. Thus The right-hand-side of (9) is the difference of two values close to zero when the left-hand-side is the difference of two values close to 1. Differencing two small numbers has better numerical stability than differencing two numbers that may be indistinguishable from 1 within machine precision. With this substitution, Algorithm 1 gives pseudocode for computing the full prevalance posterior given the results of an imperfect test. The prevalence of SARS-CoV-2 antibodies in Santa Clara county, CA, was recently measured using an imperfect serological test [1] . Three different calculations were performed based on different estimates of the test's sensitivity and specificity. For brevity, we will focus on their scenario 3 (similar conclusions apply to the other two scenarios). Figure 1 : Prevalence posterior for Santa Clara county testing data described in [1] . The x-axis is measured in bps, or units of 0.01%. Figure 1a : uniform priors on test false positive rate and sensitivity. Figure 1b: Beta(1, 99) prior on test false positive. The relevant parameters for the PPP estimation algorithm are as follows: • k = 50 positive tests out of n = 3330 subjects tested. • ku = 2 false positives out of nu = 401 known negative samples. • kv = 103 correct positives out of nv = 122 known positive samples. The authors used the delta method [4] to estimate standard errors for the population prevalence, which accounts for sampling error and propagates the uncertainty in the test sensitivity and specificity. However, the delta method provides only a local approximation to the posterior density, and with small counts this can result in underestimated variance. The raw positive test count k = 50 was also reweighted to account for demographic differences between the test sample and the overall Santa Clara population, yielding a considerably larger population-adjusted count of k = 94. The reweighting was applied before the sensitivity/specificity adjustments, which also has potential to underestimate variance in the final result. After all adjustments, the authors reported a prevalence estimate of 2.75% (95CI 2.01%-3.49%). In order to avoid potentially biasing our results, we applied the PPP algorithm to the raw counts, and then the population reweighting was applied to the estimated posterior prevalence distribution. To compare with the delta method used in [1] , we reran their methodology adjusting first for uncertainty in the test characteristics, and then for population. Omitting the details, we arrive at a prevalence estimate of 2.81% (95CI 1.74% -3.88%). Observe that while the lower bound of the CI has dropped from 2.01% to 1.74%, it is still well above zero. Figure 1a shows the prevalence posterior density computed using the PPP algorithm with uniform priors (αu = βu = αv = βv = 1), a Monte Carlo sample size N of 10,000, and a grid size M of 10,000. The estimated prevalence of Covid-19 in Santa Clara county is 1.89% (95CI 0.09% -3.51%), with a notably reduced lower bound on the credible interval of 0.09%. This translates to an infected population range of 1,800-68,000, considerably wider than the 38,000-76,000 range derived using the delta method in [1] . The shape of the prevalence posterior near zero helps explain the difference between the two results. The full Bayes approach assigns considerably more mass towards zero, such that the posterior distribution becomes bimodal. This is driven by two factors: • The uncertainty in the false positive rate u, which is derived from only ku = 2 false positves out of nu = 401 known negative samples (0.5%). • The underlying infection prevalence rate (estimated at 1.5%) is close to the test false positive rate (0.5%). The uncertainty in false positive rate is exacerbated by using a uniform prior on u, which is arguably too conservative in this case. Figure 1b shows the posterior generated with αu = 1, βu = 99, which corresponds to a beta prior with mean and standard deviation of 1%. Note that the bimodality is almost eliminated, but the credible interval is still considerably wider than that derived via the delta method: 2.17% (95CI 0.27%-3.63%). This corresponds to an infected range of 5,000-70,000 with median 42,000. Reliably estimating infection prevalence with an unreliable diagnostic test is of particular importance during the Covid-19 pandemic, expecially when the infection prevalence is not much greater than the test's false positive rate. Following [2, 3] , we derived a simple expression (7) for the posterior prevalence distribution given the results of an unreliable diagnostic test. A Monte Carlo algorithm (Posterior Prevalence Probability or PPP) for efficiently computing the posterior was given. Application of the algorithm to the Santa Clara county, CA Covid-19 test data in [1] generates credible intervals with considerably more mass at zero than the delta method used in the same paper. This is primarily due to the appearance of a second mode in the posterior density at zero, which is not captured by local methods such as the delta method. A demonstration (with code and additional examples) is available at https: //testprev.com. COVID-19 Antibody Seroprevalence in Estimating Prevalence Using an Imperfect Test Basic Methods for Sensitivity Analysis of Biases A Note on the Delta Method Thanks to David Joerg and Charlie Graham for helpful comments on an earlier draft of this paper.