key: cord-0326946-9q4gvjrs authors: Gelman, A.; Carpenter, B. title: Bayesian analysis of tests with unknown specificity and sensitivity date: 2020-05-25 journal: nan DOI: 10.1101/2020.05.22.20108944 sha: 7aabc20ee9be5d4a7aa18b4135ceab1b648c87b9 doc_id: 326946 cord_uid: 9q4gvjrs When testing for a rare disease, prevalence estimates can be highly sensitive to uncertainty in the specificity and sensitivity of the test. Bayesian inference is a natural way to propagate these uncertainties, with hierarchical modeling capturing variation in these parameters across experiments. Another concern is the people in the sample not being representative of the general population. Statistical adjustment cannot without strong assumptions correct for selection bias in an opt-in sample, but multilevel regression and poststratification can at least adjust for known differences between sample and population. We demonstrate these models with code in R and Stan and discuss their application to a controversial recent study of COVID-19 antibodies in a sample of people from the Stanford University area. Wide posterior intervals make it impossible to evaluate the quantitative claims of that study regarding the number of unreported infections. For future studies, the methods described here should facilitate more accurate estimates of disease prevalence from imperfect tests performed on non-representative samples. Correction of diagnostic tests for false positives and false negatives is a well-known probability problem. When the base rate is low, estimates become critically sensitive to misclassifications (Hemenway, 1997) . This issue hit the news recently (Lee, 2020) with a recent study of coronavirus antibodies in a population with a low incidence rate. This is a problem where not fully accounting for uncertainty can make a big difference in scientific conclusions and potential policy recommendations. In early April, 2020, Bendavid et al. (2020a) recruited 3330 residents of Santa Clara County, California and tested them for COVID-19 antibodies. 50 people tested positive, yielding a raw estimate of 1.5%. After corrections, Bendavid et al. (2020a) reported an uncertainty range of 2.5% to 4.2%, implying that the number of infections in the county was between 50 and 85 times the count of cases reported at the time. Using an estimate of the number of coronavirus deaths in the county up to that time, they computed an implied infection fatality rate (IFR) of 0.12-0.2%, much lower than IFRs in the range of 0.5%-1% that had been estimated from areas with outbreaks of the disease. The estimates from Bendavid et al. (2020a) were controversial, and it turned out that they did not correctly account for uncertainty in the specificity (true negative rate) of the test. There was also concern about the adjustment they performed for non-representativeness of their sample. Thus, the controversy arose from statistical adjustment and assessment of uncertainty. In the present article we set up a Bayesian framework to clarify these issues, setting up and fitting models using the probabilistic programming language Stan (Stan Development Team, 2020). There is a long literature on Bayesian measurement-error models (for example, Gustafson, 2003) and their application to diagnostic testing (Greenland, 2009) ; our contribution here is to set up the model, supply code, and consider multilevel regression and poststratification, influence of hyperpriors, and other challenges that arise in the problem of estimating population prevalence using test data from a sample of people. Testing for a rare disease is a standard textbook example of conditional probability, famous for the following counterintuitive result: Suppose a person tests positive for a disease, based on a test that has a 95% accuracy rate, and further suppose that this person is sampled at random from a population with a 1% prevalence rate. Then what is the probability that he or she actually has the disease? The usual intuition suggests that the conditional probability should be approximately 95%, but it is actually much lower, as can be seen from a simple calculation of base rates, as suggested by Gigerenzer et al. (2007) . Imagine you test 1000 people: with a 1% prevalence rate, we can expect that 10 have the disease and 990 do not. Then, given the assumed 95% accuracy rate (assuming this applies to both specificity and sensitivity of the test), we would expect 0.95 * 10 = 9.5 true positives and 0.05 * 990 = 49.5 false positives; thus, the proportion of positive tests that are true positives is 9.5/(9.5 + 49.5) = 0.16, a number that is difficult to make sense of without visualizing the hypothetical populations of true positive and false positive tests. A related problem is to take the rate of positive tests and use it to estimate the prevalence of the disease. If the population prevalence is π and the test has a specificity of γ and a sensitivity of δ, then the expected frequency of positive tests is p = (1 − γ)(1 − π) + δπ. So, given known γ, δ and p, we can solve for the prevalence, If the properties of the test are known, but p is estimated from a random sample, we can obtain a simple classical estimate by starting with a confidence interval for p and then propagating it through the formula. For example, Bendavid et al. (2020) report 50 positive tests out of 3330, which corresponds to an estimatep = 50/3000 = 0.015 with standard error 0.015(1 − 0.015)/3330 = 0.002. Supposing that their test had a specificity of γ = 0.995 and a sensitivity of δ = 0.80, this yields an estimate of (0.015 + 0.995 − 1)/(0.80 + 0.995 − 1) = 0.013 with standard error 0.002/(0.80 + 0.995 − 1) = 0.003. Two immediate difficulties arise with the classical approach. First, if the observed ratep is less than 1 − γ, the false positive rate of the test, then the estimate from (1) becomes meaninglessly negative. Second, if there is uncertainty in the specificity and sensitivity parameters, it becomes challenging to propagate uncertainty through the nonlinear expression (1). We can resolve both these problems with a simple Bayesian analysis (Gelman, 2020) . First, suppose that estimates and uncertainties for sensitivity and specificity have been externally supplied. The model is then: and Stan code is given in Appendix A.1. For simplicity, we have specified independent normal prior distributions (with the constraint that both parameters are restricted to the unit interval), but it would be easy enough to use other distributions if prior information were available in that form. In the example of Bendavid (2020a) , prior information on specificity and sensitivity was given in the form of previous trials, yielding the model, Fitting this model given the data in that report (y γ /n γ = 399/401 and y δ /n δ = 103/122) yields a wide uncertainty for p. Stan code is in Appendix A.2. Figure 1a shows the joint posterior simulations for p and γ: uncertainty in the population prevalence is in large part driven by uncertainty in the specificity. Figure 1b shows the posterior distribution for π, which reveals that the data and model are consistent with prevalences as low as 0 and as high as nearly 2%. The asymmetric posterior distribution with its hard bound at zero suggests that the usual central 95% interval will not be a good inferential summary. Instead we use highest posterior density or shortest posterior interval, for reasons discussed in Liu, Gelman, and Zheng (2015) . The resulting 95% interval for π is (0, 1.8%), which is much different from the interval (1.1%, 2.0%) reported by Bendavid et al. (2020a) . As a result, the substantive conclusion from that earlier report has been overturned. From the given data, the uncertainty in the specificity is large enough that the data do not supply strong evidence of a substantial prevalence. The above analysis reveals that inference about specificity is key to precise estimation of low prevalence rates. In the second version of their report, Bendavid et al. (2020b) include data from 13 specificity studies and 3 sensitivity studies. Sensitivity and specificity can vary across experiments, so it is not appropriate to simply pool the data from these separate studies. Instead, we set up a hierarchical model where, for any study j, the specificity γ j and sensitivity δ j are drawn from Figure 2 : Summary of inferences for the prevalence, specificity, and sensitivity of the Bendavid et al. (2020b) experiment, along with inferences for the hyperparameters characterizing the distribution of specificity and sensitivity on the logistic scale. (a) For the model with weak priors for σ γ and σ δ , the posterior inference for the prevalence, π, is highly uncertain. This is driven by the wide uncertainty for the sensitivity, which is driven by the large uncertainty in the hyperparameters for the sensitivity distribution. (b) Stronger priors on σ γ and σ δ have the effect of regularizing the specificity and sensitivity parameters, leading to narrower intervals for π, the parameter of interest in this study. The hyperparameters µ and σ are on the logistic scale so are difficult to interpret without transformation. normal distributions on the logistic scale, Stan code is given in Appendix A.3. This is different from model (2) in that, with data from multiple calibration studies, the hyperparameters µ and σ can be estimated from the data. In general it could make sense to allow correlation between γ j and δ j (Guo, Riebler, and Rue, 2017) , but the way the data are currently available to us, specificity and sensitivity are estimated from separate studies and so there is no information about such a correlation. When coding the model, we use the convention that j = 1 corresponds to the study of interest, with other j's representing studies of specificity and sensitivity given known samples. One could also consider alternatives to the logistic transform, which allows the unbounded normal distribution to map to the unit interval but might not be appropriate for tests where the specificity can actually reach the value of 1. We fit the above hierarchical model to the data from Bendavid et al. (2020b) , assigning weak normal + (0, 1) priors to σ γ , σ δ (using the notation normal + for the truncated normal distribution constrained to be positive). The results are shown in Figure 2a . The 95% posterior interval for the prevalence is now (0.00, 0.35). Where does that upper bound come from: how could an underlying prevalence of 35% be possible, given that only 1.5% of the people in the sample tested positive? The answer can be seen from the huge uncertainty in the sensitivity parameter, which in turn comes from the possibility that σ δ is very large. The trouble is that the sensitivity information in these data comes from only three experiments, which is not enough to get a good estimate of the underlying distribution. This is a problem discussed by Guo, Riebler, and Rue (2017) . The only way to make progress here is to constrain the sensitivity parameters in some way. One possible strong assumption is to assume that σ δ is some small value. This could make sense in the current context, as we can consider it as a relaxation of the assumption of Bendavid et al. (2020b) that σ δ = 0. We also have reason to believe that specificity will not vary much between experiments, so we will apply a soft constraint to the variation in specificities as well. We replace the weakly informative normal + (0, 1) priors on σ γ , σ δ with something stronger, σ γ , σ δ ∼ normal + (0, 0.5). To get a sense of what this means, start with the point estimate from Figure 2a of µ δ , which is 1.36. Combining with this new prior implies that there's a roughly 2/3 chance that the sensitivity of the assay in a new experiment is in the range logit −1 (1.36 ± 0.2), which is (0.76, 0.83). This seems reasonable. Figure 2b shows the results. Our 95% interval for π is now (0.003, 0.021); that is, the infection rate is estimated to be somewhere between 0.3% and 2.1%. To assess the sensitivity of the above prevalence estimate to the priors placed on σ γ and σ δ , we consider the family of prior distributions, where τ δ and τ γ are user-specified hyperparameters. Setting τ δ and τ γ to zero would force σ δ and σ γ to be zero and would enforce complete pooling, corresponding to the assumption that each test site has identical specificity and sensitivity. As the hyperparameters are increased, the scales of variation of σ γ and σ δ are allowed to vary more, and setting τ γ and τ δ to infinity would typically be considered noninformative in the sense of providing the least amount of constraint on the sensitivities and specificities. In practice, we often use normal + (0, 1) priors for hierarchical scale parameters, on the default assumption that the underlying parameters (in this case, the specificities) will probably vary by less than 1 on the logit scale. For this problem, however, a weak prior does not work: as shown in the left panel of Figure 2 , the resulting inferences for the sensitivities are ridiculously wide. No, we do not believe these tests could have specificities below 50%, yet such a possibility is included in the posterior distribution, and this in turn propagates to inappropriately wide intervals for the prevalence, π. As explained in the previous section, that is why we assigned a stronger prior, using hyperprior parameters τ γ = τ δ = 0.2. Figure 3 shows how these hyperprior parameters τ γ and τ δ affect inferences for the prevalence, π. The posterior median of π is not sensitive to the scales τ γ and τ δ of the hyperpriors, but the uncertainty in that estimate, as defined by the central posterior 90% intervals, is influenced by these settings. In particular, in the graphs on the right, when the sensitivity hyperprior parameter τ δ is given a high value, the upper end of the interval is barely constrained. The lower end of the interval is fairly stable, as long as the specificity hyperprior parameter τ γ is not given an artificially low value. When τ γ and τ δ are too low, the variation in specificity and sensitivity are constrained to be nearly zero, all values are pooled, and uncertainty is artificially deflated. As the hyperprior parameters are increased, the uncertainty in prevalence increases. This gets out of hand when the hyperprior for sensitivity is increased, because there are only three data points to inform the distribution it controls. This is an example of the general principle that wide hyperpriors on hierarchical scale parameters can pull most of the probability mass into areas of wide variation and dominate the data, leading to inflated uncertainty. Around the middle of these ranges, the posterior intervals are not as sensitive to variation in the hyperpriors. We would consider values τ γ = τ δ = 0.5 to be weakly informative for this example, in that they are roughly consistent with inter-site variation in specificity in the range 73% to 99.3% and of specificity in the range 88% to 99.75%. 5 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2020. Figure 3 : Each panel shows a plot of the posterior median and central 90% posterior interval of the prevalence, π, as a function of τ γ and τ δ , the prior scales for the specificity and sensitivity hyperparameters, σ γ and σ δ . The posterior median of prevalence is not sensitive to τ γ and τ δ , but the endpoints of the 90% interval show some sensitivity. It is possible to use a weak hyperprior on the scale of the specificity distribution, σ γ : this makes sense given that there are 13 prior specificity studies in the data. For the scale of the specificity distribution, σ δ , it is necessary to use a prior scale of 0.5 or less to effectively rule out the possibility of extremely high prevalence corresponding to an unrealistically sensitivity parameter γ. In addition we could consider priors on µ γ and µ δ . In this particular example, once we have constrained the variation in the specificities and sensitivities, enough data are available to estimate these population means without strong priors, but if this were a concern we could use moderately informative priors. For example, the normal(4, 2) distribution puts 2/3 of the prior mass in the range 4 ± 2, which, after undoing the logistic transformation, corresponds to (0.881, 0.995) on the probability scale. In general it is good practice to include informative priors for all parameters; we kept them out of this particular model just to keep the code a little bit cleaner. The complexity of this sensitivity analysis might seem intimidating: if Bayesian inference is this difficult and this dependent on priors, maybe it's not a good idea? We would argue that the problem is not as difficult as it might look. The steps taken in Sections 2 and 3 show the basic workflow: We start with a simple model, then add hierarchical structure. For the hierarchical model we started with weak priors on the hyperparameters and examined the inferences, which made us realize that we had prior information (that specificities and sensitivities of the tests were not so variable) which we then incorporated into the next iteration of the model. Performing the sensitivity analysis was fine-it helped us understand the inferences better-but it was not necessary for us to get reasonable inferences. Conversely, non-Bayesian analyses would not be immune from this sensitivity to model choices, as is illustrated by the mistakes made by Bendavid et al. (2020b) to treat specificity and sensitivity as not varying at all, to set σ γ = σ δ = 0 in our notation. An alternative could be to use the calibration studies to get point estimates of σ γ and σ δ , but then there would still be the problem of accounting for uncertainty in these estimates, which would return the researchers to the need for some sort of external constraint or bound on the distribution of the sensitivity parameters δ j , given that only three calibration studies are available here to estimate these. This in turn suggests the need for more data or modeling of the factors that influence the test's specificity and sensitivity. In short, the analysis shown in Figure 3 formalizes a dependence on prior information that would arise, explicitly or implicitly, in any reasonable analysis of these data. 6 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2020. . https://doi.org/10.1101/2020.05.22.20108944 doi: medRxiv preprint 5. Extensions of the model 5.1. Multilevel regression and poststratification (MRP) to adjust for differences between sample and population Bendavid et al. (2020a,b) compared demographics on 3330 people they tested, and they found differences in the distributions of sex, age, ethnicity, and zip code of residence compared to the general population of Santa Clara County. It would be impossible to poststratify the raw data on 2 sexes, 4 ethnicity categories, 4 age categories, and 58 zip codes, as the resulting 1856 cells would greatly outnumber the positive tests in the data. They obtained population estimates by adjusting for sex × ethnicity × zip code, but their analysis is questionable, first because they did not adjust for age, and second because of noisy weights arising from the variables they did adjust for. To obtain stable estimates while adjusting for all these variables, we would recommend applying a multilevel model to the exposure probability, thus replacing the constant π in the above models with something like this logistic regression: where male is a variable that takes on the value 0 for women and 1 for men; In the regression model (3), it is important to include the predictor x zip , which in this example might be percent Latino or average income in the zip code. Otherwise, with so many zip codes, the multilevel model will just partially pool most of the zip code adjustments to zero, and not much will be gained from the geographic poststratification. The importance of geographic predictors is well known in the MRP literature; see, for example, Caughey and Warshaw (2019) . The above model is a start; it coud be improved by including interactions, following the general ideas of Ghitza and Gelman (2013) . In any case, once this model has been fit, it can be used to make inferences for disease prevalence for all cells in the population, and these cell estimates can then be summed, weighting by known population totals (in this case, the number of people in each sex × ethnicity × age × zip code category in the population) to get inferences for the prevalence in the county. Here we are implicitly assuming that the data represent a random sample within poststratification cells. In addition, priors are needed for σ eth , σ age , σ zip , and β, along with the hierarchical specificity and sensitivity parameters from the earlier model. We code the model in Stan; see Appendix A.4. Unfortunately the raw data from Bendavid et al. are not currently available, so we fit the model to simulated data to check the stability of the computation. We use a normal(0, 2.5) prior for the centered intercept β 1 + β 2 male + β 3 x zip (corresponding to an approximate 95% prior interval of (0.7%, 99.3%) for the probability that an average person in the sample has the antibody), a normal(0, 0.5) prior for β 2 , and normal + (0, 0.5) priors for σ eth , σ age , σ zip . These priors allow the prevalence to vary moderately by these poststratification variables. Once the above model has been fit, it implies inferences for the prevalence in each of the 2 × 4 × 4 × 58 poststratification cells; as discussed by Johnson (2020) , these can be averaged to get a population prevalence: p avg = j N j p j / j N j , where N j is the number of people in cell j in the general population, and p i is the prevalence as computed from the logistic model. We perform this summation in the generated quantities block of the Stan model in Appendix A.4. The aforementioned Santa Clara County study is just one of many recent COVID-19 antibody surveys. Other early studies were conducted in Boston, New York, Los Angeles, and Miami, and in various places outside the United States, and we can expect many more in the future. If the raw data from these studies were combined, it should be possible to estimate the underlying prevalences from all these studies using a hierarchical model, allowing specificity, sensitivity, and prevalence to vary by location, and adjusting for non-sampling error where possible. Such an analysis is performed by Levesque and Maybury (2020) using detailed information on the different tests used in different studies. We will also be seeing more studies of changing infection rates over time. Stringhini et al. (2020) perform such an analysis of weekly surveys in Geneva, Switzerland, accounting for specificity and sensitivity and poststratifying by sex and age. We have so far assumed that test results are binary, but additional information can be gained from continuous measurements that make use of partial information when data are near detection limits (Gelman, Chew, and Shnaidman, 2004; Bouman, Bonhoeffer, and Regoes, 2020) . Further progress can be made by performing different sorts of tests on study participants or retesting observed positive results. Another promising direction is to include additional information on people in the study, for example from self-reported symptoms. Some such data are reported in Bendavid et al. (2020b) , although not at the individual level. With individual-level symptom and test data, a model with multiple outcomes could yield substantial gains in efficiency compared to the existing analysis using only a single positive/negative test result on each participant. As with any statistical analysis, alternative approaches are possible that would use the same information and give similar results. In Section 2, it was necessary to account for uncertainty in all three parameters, while respecting the constraint that all three probabilities had to be between 0 and 1. We assume that both these aspects of the model could be incorporated into a non-Bayesian approach by working out the region in the space of (π, γ, δ) that is consistent with the data and then constructing a family of tests which could be inverted to create confidence regions. This could be expanded into a multilevel model as in Section 3 by considering the specificities and sensitivities of the different experiments as missing data and averaging over their distribution, but still applying non-Bayesian inference to the resulting hyperparameters. The wide uncertainty intervals from the analysis in Section 3 suggest that some constraints or regularization or additional information on the hyperparameters would be necessary to get stable inferences here, no matter what statistical approach is used. Fithian (2020) performs a non-Bayesian analysis of the data from Bendavid et al. (2020b) , coming to the same basic conclusion that we do, demonstrating that the calibration data are incompatible with a model of constant specificity and that, once the specificity is allowed to vary, the observed rate of positive tests in the Santa Clara study does not allow rejection of the null hypothesis of zero infection rate. Had it been possible to reject zero, this would not be the end of 8 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2020. . the story: at that point one could invert a family of tests to obtain a confidence region, as noted above. Finally, some rough equivalent to the poststratification adjustment in Section 5.1 could be performed using a non-Bayesian weighting approach, using some smoothing to avoid the noisiness of raw poststratification weights. Similarly, non-Bayesian methods could be used to fit regressions allowing prevalence to vary over location and time. 7. Discussion 7.1. Limitations of the statistical analysis Epidemiology in general, and disease testing in particular, features latent parameters with high levels of uncertainty, difficulty in measurement, and uncertainty about the measurement process as well. This is the sort of setting where it makes sense to combine information from multiple studies, using Bayesian inference and hierarchical models, and where inferences can be sensitive to assumptions. The biggest assumptions in this analysis are, first, that the historical specificity and sensitivity data are relevant to the current experiment; and, second, that the people in the study are a representative sample of the general population. We addressed the first concern with a hierarchical model of varying sensitivities and specificities, and we addressed the second concern with multilevel regression and poststratification on demographics and geography. But this modeling can take us only so far. If there is hope or concern that the current experiment has unusual measurement properties, or that the sample is unrepresentative in ways not accounted for in the regression, then more information or assumptions need to be included in the model, as in Campbell et al. (2020) . The other issue is that there are choices of models, and tuning parameters within each model. Sensitivity to the model is apparent in Bayesian inference, but it would arise with any other statistical method as well. For example, Bendavid et al. (2020a) used an (incorrectly applied) delta method to propagate uncertainty, but this is problematic when sample size is low and probabilities are near 0 or 1. Bendavid et al. (2020b) completely pooled their specificity and sensitivity experiments, which is equivalent to setting σ γ and σ δ to zero. And their weighting adjustment has many arbitrary choices. We note these not to single out these particular authors but rather to emphasize that, at least for this problem, all statistical inferences involve user-defined settings. For the models in the present article, the most important user choices are: (a) what data to include in the analysis, (b) prior distributions for the hyperparameters, and (c) the structure and interactions to include in the MRP model. For these reasons, it would be difficult to set up the model as a plug-and-play system where users can just enter their data, push a button, and get inferences. Some active participation in the modeling process is required, which makes sense given the sparseness of the data. When studying populations with higher prevalences and with data that are closer to random samples, more automatic approaches might be possible. with zero infection rate and a wide variation in specificity and sensitivity across tests, and the numbers are also consistent with the claims made in Bendavid et al. (2020a,b) . That does not mean anyone thinks the true infection rate is zero. It just means that more data, assumptions, and subject-matter knowledge are required. That's ok-people usually make lots of assumptions in this sort of laboratory assay. It's common practice to use the manufacturer's numbers on specificity, sensitivity, detection limit, and so forth, and not worry about that level of variation. It's only when you are estimating a very low underlying rate that the statistical challenges become so severe. For now, we do not think the data support the claim that the number of infections in Santa Clara County was between 50 and 85 times the count of cases reported at the time, or the implied interval for the IFR of 0.12-0.2%. These numbers are consistent with the data, but the data are also consistent with a near-zero infection rate in the county. The data of Bendavid et al. (2020a,b) do not provide strong evidence about the number of people infected or the infection fatality ratio; the number of positive tests in the data is just too small, given uncertainty in the specificity of the test. Going forward, the analyses in this article suggest that future studies should be conducted with full awareness of the challenges of measuring specificity and sensitivity, that relevant variables be collected on study participants to facilitate inference for the general population, and that (deidentified) data be made accessible to external researchers. 13 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2020. . (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2020 . . https://doi.org/10.1101 COVID-19 antibody seroprevalence in COVID-19 antibody seroprevalence in Estimating seroprevalence with imperfect serological tests: a cutoff-free approach Bayesian adjustment for preferential testing in estimating the COVID-19 infection fatality rate: Theory and methods Public opinion in subnational politics Statistical comment on the revision of Bendavid Simple Bayesian analysis inference of coronavirus infection rate from the Stanford study in Santa Clara county Bayesian analysis of serial dilution assays Deep interactions with MRP: Election turnout and voting patterns among small electoral subgroups Helping doctors and patients make sense of health statistics Bayesian perspectives for epidemiologic research: III. Bias analysis via missing-data methods Bayesian bivariate meta-analysis of diagnostic test studies with interpretable priors Measurement Error and Misclassification in Statistics and Epidemiology: Impacts and Bayesian Adjustments The myth of millions of annual self-defense gun uses: A case study of survey overestimates of rare events Estimating seroprevalence with data from an imperfect test on a convenience sample Two antibody studies say coronavirus infections are more common than we think A note on COVID-19 seroprevalence studies: a metaanalysis using hierarchical modelling Simulation-efficient shortest probability intervals Stan Modeling Language User's Guide and Reference Manual Repeated seroprevalence of anti-SARS-CoV-2 IgG antibodies in a population-based sample