key: cord-0231275-3g8hrtne authors: DiCiccio, Thomas J.; Ritzwoller, David M.; Romano, Joseph P.; Shaikh, Azeem M. title: Confidence Intervals for Seroprevalence date: 2021-03-27 journal: nan DOI: nan sha: cba66669dd75215ac9a0a4abcb6aea3a33a2c41c doc_id: 231275 cord_uid: 3g8hrtne This paper concerns the construction of confidence intervals in standard seroprevalence surveys. In particular, we discuss methods for constructing confidence intervals for the proportion of individuals in a population infected with a disease using a sample of antibody test results and measurements of the test's false positive and false negative rates. We begin by documenting erratic behavior in the coverage probabilities of standard Wald and percentile bootstrap intervals when applied to this problem. We then consider two alternative sets of intervals constructed with test inversion. The first set of intervals are approximate, using either asymptotic or bootstrap approximation to the finite-sample distribution of a chosen test statistic. We consider several choices of test statistic, including maximum likelihood estimators and generalized likelihood ratio statistics. We show with simulation that, at empirically relevant parameter values and sample sizes, the coverage probabilities for these intervals are close to their nominal level and are approximately equi-tailed. The second set of intervals are shown to contain the true parameter value with probability at least equal to the nominal level, but can be conservative in finite samples. Effective public health policy requires accurate measurement of the spread of infections diseases (Fauci et al., 2020; Peeling et al., 2020) . Seroprevalence surveys, in which antibody tests are administered to samples of individuals from populations of interest, are a practical and widely applied strategy for assessing the progression of a pandemic (Krammer and Simon, 2020; Alter and Seder, 2020) . However, antibody tests, which detect the presence of viral antibodies in blood samples, are imperfect. 1 Accounting for the variation in the results of seroprevalence surveys induced by this imperfection is important for informative assessment of the uncertainty in measurements of the spread of infectious diseases. In this paper, we study the construction of confidence intervals in standard seroprevalence surveys. Given the public interest in explicit representations of disease incidence, our objective is to analyze the accuracy of various methods of constructing confidence intervals, so that results from empirical analyses can be reported with statistical precision. We demonstrate that some methods based on test inversion offer advantages relative to more standard confidence interval constructions in terms of the accuracy and validity of their coverage probabilities. In a standard seroprevalence survey, the proportion of a population that has been infected with a disease is a smooth function of the parameters of three independent binomial trials. Although it may be expected that standard approaches to confidence interval construction are well suited for such a simple parametric problem, in Section 2 we demonstrate with simulation that standard Wald and percentile bootstrap confidence intervals have erratic coverage probabilities at empirically relevant parameter values and sample sizes when applied to this problem. In fact, as documented in Brown et al. (2001) , erratic coverage probabilities for confidence intervals constructed using standard methods surface even in the context of inference on a single binomial parameter. Bootstrap (and other) methods that are typically second-order correct in continuous problems may not achieve this accuracy in discrete problems. 2 Additionally, when a binomial random variable has parameter value near zero or one, even first-order approximations to its limiting distribution are not normal, while many standard methods for constructing confidence 1 An early systematic review of the accuracy of SARS-CoV-2 antibody tests is given in Deeks et al. (2020) , highlighting several methodological limitations. False positive and negative rates for five leading SARS-CoV-2 immunoassays are measured in Ainsworth et al. (2020) . Estimates of false positive rates ranged from 0.1% to 1.1%. Estimates of false negative rates ranged from 0.9% to 7.3%. 2 Usually, claims of second-order correctness are made based on Edgeworth expansions (Hall, 2013) . However, in discrete settings, Cramér's condition, a necessary condition for the application of an Edgeworth expansion, fails, and second-order accuracy may not be achievable. For example, atoms in the binomial distribution based on n trials have order n −1/2 , so expansions to order n −1 must account for this discreteness. intervals rely, either explicitly or implicitly, on a normal approximation holding. As the parameter of interest in a seroprevalence survey is a function of three binomial parameters, inference in this setting is more challenging than for a single binomial proportion. To address the erratic coverage probabilities in standard confidence interval constructions, we consider several alternative approaches based on test inversion. A test inversion confidence interval for a parameter θ consists of the set of points θ 0 for which the null hypothesis H(θ 0 ) : θ = θ 0 is not rejected. For parameters θ where the corresponding null hypothesis H(θ 0 ) is simple, the application of test inversion is straightforward. However, when the corresponding null hypothesis is composite, as in the case of seroprevalence, the application of test inversion is not immediate. Thus, in Section 3 we explore the general problem of test inversion for parameters whose corresponding null hypotheses are composite. We consider both methods based on asymptotic or bootstrap approximation and methods with finite-sample guarantees. In the later case, a maximization of p-values over a nuisance parameter space is required, as in Berger and Boos (1994) and Silvapulle (1996) . In practice, this maximization is carried out over a discrete grid. We provide a refinement to such an approximation that maintains the finite-sample coverage requirement. We take particular care in requiring that the confidence intervals that we develop behave well at both endpoints; that is, we require that they are equi-tailed. 3 In Section 4, we apply these approaches to construct confidence intervals for seroprevalence. We consider several choices of test statistic, including maximum likelihood estimators and generalized likelihood ratio statistics. We demonstrate with simulation that the intervals based on asymptotic or bootstrap approximation have coverage probabilities that, at empirically relevant parameter values and sample sizes, are close to, but potentially below, the nominal level and are approximately equi-tailed. By contrast, the finite-sample valid construction results in longer intervals on average, but always have coverage probabilities that satisfy the coverage requirement. We contextualize our analysis with data used to estimate seroprevalence at early stages of the 2019 SARS-Cov-2 pandemic. In particular, as a running example, we measure coverage probabilities and average interval lengths for each of the methods we consider at sample sizes and parameter values close to the estimates and sample sizes of Bendavid et al. (2020a) -a preprint posted on medRxiv on April 11 th , 2020. 4 This preprint estimates that the number of coronavirus cases in 3 A 1 − α confidence interval is equi-tailed if the probabilities that the parameter exceeds the upper endpoint or is below the lower endpoint of the interval are both near or below α/2. That is, an equi-tailed 1 − α confidence interval should be given by the set of points satisfying both an upper and a lower confidence bound, each at level 1 − α/2. 4 This preprint has subsequently been published as Bendavid et al. (2021) . Santa Clara County, California on April 3 -4, 2020 was more than fifty times larger than the number of officially diagnosed cases, and as a result, received widespread coverage in the popular and scientific press (Kolata, 2020; Mallapaty, 2020) . The methods and design of this study -including the reported confidence intervals -were questioned by many researchers (Eisen and Tibshirani, 2020) , prompting the release of a revised draft on April 27 th , 2020, which we refer to as Bendavid et al. (2020b) , that integrated additional data. 5 Our analysis highlights statistical challenges in seroprevalence surveys at early stages of the spread of infectious diseases, when disease incidence is low and close to uncertain error rates of new diagnostic technologies. This paper contributes to the literatures on inference in seroprevalence surveys (Rogan and Gladen, 1978; Hui and Walter, 1980; Walter and Irwig, 1988) ; see Jewell (2004) for a general introduction to epidemiological statistics. More broadly, we contribute to the large literature on test inversion. The classical duality between tests and confidence intervals is discussed in Chapter 3 of Lehmann and Romano (2005) . Bootstrap approaches to confidence construction based on estimating nuisance parameters are developed in Efron (1981) , DiCiccio and Romano (1990) , and Carpenter (1999) . Conservative approaches to confidence interval construction that maximize pvalues over an appropriate nuisance parameter space are considered in Berger and Boos (1994) and Silvapulle (1996) . For the problem considered in this paper, Toulis (2020) uses test inversion based on a particular choice of test statistic, though the resulting confidence interval is based on projection. Cai et al. (2020) is more closely related to one of the approaches we consider, and we discuss some important differences in Section 4. 6 Gelman and Carpenter (2020) take a Bayesian approach to the problem studied in this paper, and give a complementary analysis of uncertainty quantification in Bendavid et al. (2020a,b) . A standard seroprevalence survey entails the collection of antibody test results from three samples of individuals of sizes n 1 , n 2 , and n 3 , with n = (n 1 , n 2 , n 3 ) . The first sample is selected at random from the population under study. All individuals in the second sample have not had the disease of interest and all individuals in the third sample have had the disease of interest. 7 We let X = (X 1 , X 2 , X 3 ) denote the number of positive antibody test results in the corresponding samples. It is assumed that each X i has a binomial distribution with success probability p i and is independent of the other samples. 8 The quantities 1 − p 2 and p 3 are referred to as the specificity and sensitivity of the test, respectively. We assume that the test has diagnostic value in the sense that p 2 < p 3 , and so p 1 necessarily satisfies p 2 ≤ p 1 ≤ p 3 . 9 Thus, the parameter p = (p 1 , p 2 , p 3 ) exists in the parameter space We consider confidence intervals for the probability π that an individual randomly selected from the population under study has had the disease. By the law of total of probability, p 1 = p 2 (1 − π) + p 3 π, and so We refer to π as seroprevalence. A natural estimate of π is given byπ n = π(p n ), wherep n = (p n,1 ,p n,2 ,p n,3 ) andp n,i = X i /n i is the usual empirical frequency for group i. We let the maximum likelihood estimator (MLE) of p for the model p ∈ Ω be denoted byp n , wherep n = (p n,1 ,p n,2 ,p n,3 ) . 10 Accordingly, the MLE of π for the model p ∈ Ω is given byπ n = π(p n ). The most obvious approach to constructing confidence intervals for π is to approximate the finite-sample distribution ofπ n with its limiting normal distribution. The variance of the normal limiting distribution ofπ n is given by where σ 2 i (p i ) = p i (1 − p i ) /n i . This leads to the standard Wald or delta method confidence interval π n ± z 1−α/2 Vπ n (p n ) , where z 1−α/2 is the 1 − α/2 quantile of the standard normal cumulative distribution function Φ(·). This construction was used in Bendavid et al. (2020a). 7 For example, in Bendavid et al. (2020a) , the second sample was composed of blood samples taken before the COVID-19 epidemic and the third sample was composed of blood samples taken from patients who had recovered from confirmed cases of COVID-19. 8 We assume the sample sizes are small relative to population size so that the difference between sampling with and without replacement is negligible. 9 Note that some of the methods developed in Section 4 will not require p 2 < p 3 . 10 Typically,p n andp n agree, with the exception occurring ifp n / ∈ Ω. As outlined in the introduction, the Wald interval may perform poorly in finite-samples due to discreteness of the data or the proximity of parameter values to the boundaries of their spaces. To address some of these issues, Bendavid et al. (2020b) apply the percentile bootstrap confidence interval developed in Efron (1981) . A refinement of this interval construction, called the BC a interval (Efron, 1987) , is also applicable to this problem, with bias and acceleration constants estimated with the formula given in Efron (1987) and DiCiccio and Romano (1995) . The Wald and bootstrap intervals are approximate. In contrast, it may be desirable to construct intervals that ensure coverage of at least 1 − α in finite samples, particularly if there are concerns that the finite-sample distribution ofπ n is not well-approximated by a normal distribution. A simple, but crude, approach to constructing finite-sample valid confidence intervals is projection. In particular, suppose that R 1−α is a joint confidence region for p of nominal level 1 − α. The projection method simply constructs the confidence interval I 1−α = {π(p) : p ∈ R 1−α }. The chance that π ∈ I 1−α is bounded below by the chance that p ∈ R 1−α . Thus, if R 1−α has a guaranteed coverage of 1 − α, then so does I 1−α . For example, one possible choice of joint confidence region is the rectangle R 1−α = 3 i=1 I i,1−γ , where I j,1−γ is a nominal 1 − γ confidence interval for p j and γ is taken to satisfy (1 − γ) 3 = 1 − α. 11 In this case, the computational cost of the projection interval I 1−α is minimal, as π(p) is monotone increasing in p 1 and monotone decreasing in each of p 2 and p 3 as p varies on the parameter space Ω. Projection intervals are easy to implement, but are generally wide and conservative, in that the true coverage is often larger than the nominal level. To assess the finite-sample performance of the delta method, bootstrap, and projection confidence intervals, we estimate their coverage probabilities and average lengths at parameterizations close to the sample size and estimates of Bendavid et al. (2020a) . In this study, n 1 = 3300 participants were recruited for serologic testing for SARS-CoV-2 antibodies. The total number of positive tests was X 1 = 50. The authors use n 2 = 401 pre-COVID era blood samples to measure the specificity of their test, of which only X 2 = 2 samples tested positive. Similarly, the authors use n 3 = 122 blood samples from confirmed COVID-19 patients, of which X 3 = 103 samples 11 In our implementation, we apply the standard Clopper and Pearson (1934) confidence intervals for p j , as they have guaranteed coverage in finite-samples. Other choices exist, however, and in particular the intervals recommended in Brown et al. (2001) may perform well. Alternatively, the region R 1−α can be constructed by inverting likelihood ratio tests, but would incur a significantly larger computational cost. A related approach is developed in Toulis (2020 Table 1 reports the delta method, percentile bootstrap, BC α bootstrap, and projection confidence intervals, at nominal level 95% computed on data from Bendavid et al. (2020a) . Estimates of the average length and coverage for these intervals at sample sizes n and estimated valuesp n from this study are also displayed. Estimates of average length and coverage are taken over 100,000 bootstrap replicates of X at the sample size n and the estimated parameterŝ p n from this study. tested positive. 12 The realization of the MLE of π for these data isπ n = 0.012. 13 Table 1 reports the nominal 95% confidence intervals constructed with the standard approaches discussed above. For each parameter (e.g., p 1 ), we simulate replicates of X at each value of a grid around the estimated value of the parameter, holding the other five parameters fixed at their estimated values (e.g.,p n,2 ,p n,3 , n 1 , n 2 , n 3 ). For each method at each combination of parameter values and sample sizes, we compute the proportion of replicates for which the true value of π (i.e., the value of π associated with the parameterization) is below, contained in, or above the corresponding confidence interval with nominal coverage probability α = 0.05. Figure 1 displays the results of this Monte Carlo experiment for each interval construction at parameter values aroundp n,1 ,p n,2 , n 1 , and n 2 . 14 The black dots display one minus the proportion of replicates for which the realized confidence interval contains the true value of π, i.e., one minus the estimated coverage of the confidence interval. Additionally, Table 1 reports estimates of the coverage and average length of each interval taken over 100,000 bootstrap replicates at the sample 12 The specificity and sensitivity samples combine data provided by the test manufacturer and additional tests run at the Stanford. We refer the reader to the statistical appendices of Bendavid et al. (2020a,b) for further details. In Bendavid et al. (2020b) it was revealed that there was an error in the recording of the sensitivity sample, i.e., that there were two fewer positive tests than reported. We adhere to the data as reported in Bendavid et al. (2020a) . 13 Bendavid et al. (2020a) report an alternative estimate of seroprevalence in which the demographics of their sample are weighted to match the overall demographics of Santa Clara County. We briefly discuss the application of the general methods developed in Section 3 to this setting in Section 5, and view further consideration as a useful extension. Gelman and Carpenter (2020) give a Bayesian approach that accommodates sample weights. In contemporaneous work, Cai et al. (2020) also address the case where are samples are reweighted according to population characteristics. 14 There is little variation in the coverage probabilities for parameter values around p 3 and n 3 , so we omit the results of this experiment for the sake of clarity. Notes: Figure 1 displays estimates of the coverage probabilities of the delta method (π n , ∆), percentile bootstrap (π n , pb), BC α bootstrap (π n , bca), and projection intervals at parameter values close to the estimatep n and sample size n of Bendavid et al. (2020a) as specified in Section 2. The nominal coverage probability is 0.95 and is denoted by the horizontal dotted line. The black dots denote one minus the proportion of replicates for which the true value of π falls in the realized confidence intervals, i.e., one minus the estimated coverage probability. The purple squares and blue triangles denote the proportion of replicates that fall below and above realized confidence intervals, respectively. The vertical dotted line denotes the estimated value ofp n,1 ,p n,2 or sample size n 1 , n 2 for Bendavid et al. (2020a) . sizes n and estimated valuesp n from Bendavid et al. (2020a) . We find that the delta method and bootstrap intervals are quite liberal. In most cases the estimated coverage of a nominal 95% interval is below 90%. The estimated coverage decreases sharply as n 2 and p 1 become small and is not equi-tailed, in the sense that the proportions of replicates that fall below and above the confidence intervals are not approximately equal. By contrast, the projection method intervals are quite conservative. They are approximately 45% longer than the delta method intervals at sample sizes n and estimated valuesp n from Bendavid et al. (2020a) . These findings motivate the development of approximate and finite-sample valid alternative methods for constructing confidence intervals that have less erratic coverage probabilities. In this section, we consider both approximate and finite-sample valid approaches to the general problem of constructing test-inversion confidence intervals for parameters θ, where the corresponding null hypothesis H (θ 0 ) : θ = θ 0 is composite. To this end, we require a more general notation. Suppose data X follows a general parametric model indexed by a parameter (θ, ϑ) in parameter spaceΩ. The parameter of interest θ is real-valued, and the nuisance parameter ϑ is finite dimensional. For a fixed value θ 0 , the parameter space for the nuisance parameter ϑ is denoted bȳ Observe that for the case of seroprevalence π, we have that θ = π and can assign ϑ = (p 1 , p 3 ). Test inversion reduces the problem of confidence interval construction for θ to the problem of testing H(θ 0 ) : θ = θ 0 against θ > θ 0 and θ < θ 0 . Consider a test of the null hypothesis with test inversion is given by the infimum of the set of θ 0 such that H (θ 0 ) is not rejected at level α/2 against the alternative θ > θ 0 , which we denote by L 1−α/2 . A 1 − α/2 upper bound for θ, U 1−α/2 may be constructed analogously by testing H (θ 0 ) against the alternative θ < θ 0 . Thus, a 1 − α confidence interval is given by L 1−α/2 , U 1−α/2 . It is illustrative to assume that the nuisance parameter ϑ = ϑ 0 is known. In this case, the null hypothesis H(θ 0 ) is simple and one-sided tests that control the level at α/2 are easily constructed. Consider the test that rejects H(θ 0 ) against θ > θ 0 for large values of the test statistic T n = T n (X). The cumulative distribution function of T n is given by F n,θ,ϑ (t) = P θ,ϑ {T n ≤ t}. Additionally, define the related quantity F n,θ,ϑ (t − ) = P θ,ϑ {T n < t}, and let t 0 denote the observed value of T n . The null probability that T n ≥ t 0 is given bŷ and is a valid p-value in the sense that the test that rejects when this quantity is ≤ α/2 has size ≤ α/2; see, e.g., Lemma 3.3.1 in Lehmann and Romano (2005) . 15 Thus, a 1 − α/2 confidence 15 Throughout, we will denote various p-values byq rather thanp becausep is reserved for various estimates of binomial parameters. is continuous and strictly monotone decreasing in θ 0 , then a 1 − α/2 lower confidence bound θ L may be obtained by solving Similarly, an upper confidence boundθ U may be obtained by solving F n,θ U ,ϑ 0 (t 0 ) = α/2. 16 Thus, For a single binomial parameter, this construction gives the classical Clopper and Pearson (1934) interval. If the nuisance parameter ϑ is unknown, then it may be approximated. In particular, ifθ(θ 0 ) is the MLE for ϑ subject to the constraint θ = θ 0 , then the infeasible p-value (3) can be replaced witĥ where F n,θ 0 ,θ(θ 0 ) is approximated either analytically or with the parametric bootstrap. 17 Accordingly, the infeasible confidence interval [θ L , θ U ] is replaced with the feasible confidence interval where, the endpointsθ L andθ U are the values of θ 0 that satisfy respectively. In other words, either Wald or parametric bootstrap tests are constructed for each θ 0 , where the distribution of the test statistic T n is determined under the parameter (θ 0 ,θ(θ 0 )). This approach was used in DiCiccio and Romano (1990) and DiCiccio and Romano (1995) . In this approximate approach, the family of distributions indexed by (θ, ϑ) has been reduced to an approximate least favorable one-dimensional family of distributions governed by the parameter (θ 0 ,θ(θ 0 )) as θ 0 varies. This approach implicitly orthogonalizes the parameter of interest with respect to the nuisance parameter, so that the effect of estimating the nuisance parameter is negligible 16 Even in the case that the distribution of T n is discrete, the function F θ0,ϑ0 (t − 0 ) is typically continuous in θ 0 (as in the binomial case). If not, one could use the infimum over θ 0 such that F θ0,ϑ0 (t − 0 ) < 1 − α/2 as a lower bound. Note that, in general, may wish to test at each endpoint of a reported confidence interval to determine whether it should be a closed or open interval. We simply take the conservative approach and report closed intervals. 17 An alternative, related, approach proceeds by imposing and integrating out a potentially uninformative prior on the nuisance parameter, and then constructs test statistics from the resultant pseudo-likelihood function (see e.g., Severini (1999) and Datta and Mukerjee (2004) ). to second-order and then typically results in second-order accurate confidence intervals. See Cox and Reid (1987) for a discussion of the role of orthogonal parameterizations for inference about a scalar parameter in the presence of nuisance parameters. The quality of the coverage probability of the approximate intervals considered in the previous section will depend on the quality of the approximation ofθ(θ 0 ) to the true value of the nuisance parameter ϑ 0 and the quality of the analytic or bootstrap approximation to the finite-sample distribution F n,θ 0 ,θ(θ 0 ) . In situations in which qualities of these approximations are in doubt, e.g., due to discreteness or the proximity of true parameters to the boundary of their parameter spaces, intervals that ensure coverage of at least 1 − α in finite samples may be desirable. An infeasible approach to constructing such intervals proceeds by taking the supremum of the p-values over all possible values of the nuisance component ϑ, givinĝ with finite-sample validity following from Note that p-values of this form may be conservative, as the supremum over ϑ may be obtained at a value far from ϑ 0 . 18 To address this issue, one can restrict the space of values for ϑ that are considered by first constructing a 1 − γ confidence region for ϑ. Such an approach is considered in Berger and Boos (1994) , Silvapulle (1996) , and Romano et al. (2014) . This refined approach proceeds as follows. Fix a small number γ and let I 1−γ be a 1 − γ confidence region for ϑ. 19 Consider the modified p-value defined bŷ The p-value obtained by (8) is valid in finite samples; see Berger and Boos (1994) or Silvapulle (1996) . Thus, the one-sided test that rejects whenq L,θ 0 ,I 1−γ ≤ α/2 leads to a 1 − α/2 lower confidence bound for θ. The finite-sample valid approach considered in the previous section is infeasible, as it involves computing a supremum over an infinite setΩ(θ 0 ). A natural approximation to this approach is to approximateΩ(θ 0 ) with a finite discretization, so that the supremum is replaced by maximum over a finite set of values on a grid. In particular, if G(θ 0 ) denotes a finite grid over the spaceΩ(θ 0 ), then (7) can be approximated bŷ Similarly, the refinement given in (8) can be approximated by replacing I 1−γ withÎ 1−γ , whereÎ 1−γ denotes a finite grid (or -net) approximating I 1−γ . 20 We develop a modification to this construction that provably maintains finite-sample Type 1 error control for testing H(θ 0 ) by directly accounting for the approximation error induced by a finite discretization ofΩ(θ 0 ). Towards this end, we require additional structure. Suppose now that the components of the data X = (X 1 , . . . , X k ) are independent, that the distribution of X i depends on a parameter β i , with β = (β 1 , . . . , β k ) ∈ Ω, and that the family of distributions for X i has a monotone likelihood ratio in X i . As before, interest focuses on a real-valued parameter θ = f (β 1 , . . . , β k ), with the nuisance parameter ϑ given by β −1 = (β 2 , . . . , β k ). That is, we assume that the model can be equivalently parameterized by (θ, ϑ) ∈Ω or through β ∈ Ω, i.e., that the mapping from β to (θ, ϑ) is one-to-one. For a fixed value θ 0 , the parameter space for β is given by As before, let T n = T n (X 1 , . . . , X k ) be a test statistic for testing H(θ 0 ), with t 0 denoting its 20 In contemporaneous work, Cai et al. (2020) use this construction to develop confidence intervals for seroprevalence that ensure finite-sample Type 1 error control up to the error induced by the finiteness of G(θ 0 ). realized value. Assume that T n is monotone with respect to each each component X i . 21 Let J n,β (·) denote the cumulative distribution function of T n for the βparametrization, so that J n,β (t) = P β {T n ≤ t}, and letβ(θ 0 ) denote the MLE for β subject to the constraint that θ = θ 0 . In this case, for example, we can represent the approximate p-value defined in Section 3.2 withq L,θ 0 ,θ(θ 0 ) = 1 − J n,β(θ 0 ) (t − 0 ), and similarly for the other p-values previously introduced. We replace the supremum over I 1−γ in (8) with a finite maximum while maintaining Type 1 control. Consider a partition of the values of ϑ in I 1−γ into r regions E 1 , . . . , E r . In our implementation, each region is given by a hyperrectangle of the form k i=2 [β i , β i ], though this is not essential. For each region E j , let β −1 (j) = (β 2 (j), . . . , β k (j)) be the vector giving the smallest value that all but the first component of β takes on in E j ; that is β i (j) = inf{β i : β −1 ∈ E j }. Analogously, letβ −1 (j) be the vector giving the largest value that all but the first component of β denote the smallest and largest values that the first component of β takes on Ω(θ 0 ) for β −1 in E j . If the infimum or supremum in (10) is over a non-empty set, then define s L (j) = J n,β(j) (t − ) and s U (j) = J n,β(j) (t), whereβ(j) = (β 1 (j),β −1 (j)) and β(j) = (β 1 (j), β −1 (j)). If there is no β in Ω(θ 0 ) with (β 2 , . . . , β k ) in E j , then set s L (j) = 1 and s U (j) = 0, respectively. We construct the p-values q L,θ 0 ,I 1−γ = max 1≤j≤r (1 − s L (j)) + γ andq U,θ 0 ,I 1−γ = max 1≤j≤r s U (j) + γ by taking the maximum over the adjusted p-values 1−s L (j) and s U (j) . This refinement is feasible and valid in finite samples. Theorem 3.1 Assume that the components of the data X = (X 1 , . . . , X k ) are independent, that each component X i has distribution in a family having a monotone likelihood ratio, and that the statistic T n = T n (X 1 , . . . , X k ) is monotone increasing with respect to each component X i . Let I 1−γ be a finite-sample valid 1 − γ confidence region for (β 2 , . . . , β k ). Then, the p-valuesq L,θ 0 ,I 1−γ 21 If T n is monotone decreasing with respect to a particular component, say X j , then X j can be replaced by −X j , whose family of distributions is then monotone increasing with respect to −β j . andq U,θ 0 ,I 1−γ are valid for testing H(θ 0 ) in the sense that, for any 0 ≤ u ≤ 1 and any ϑ, P θ 0 ,ϑ {q L,θ 0 ,I 1−γ ≤ u} ≤ u and P θ 0 ,ϑ {q U,θ 0 ,I 1−γ ≤ u} ≤ u . PROOF OF THEOREM 3.1. First, note that I 1−γ could be the whole spaceΩ(θ 0 ) by taking γ = 0. It follows from Lemma A.1 in Romano et al. (2011) (which is a simple generalization of Lemma 3.4.2 in Lehmann and Romano (2005) ) that the family of distributions of T n satisfies J n,β (t) ≤ J n,β (t) for any t, β = (β 1 , . . . , β k ), and β = (β 1 , . . . , β k ) with β i ≥ β i for all i ≥ 1. The same is true if t is replaced by t − . Thus, we have that J n,β(j) (t − ) ≤ J n,β (t − ) and J n,β(j) (t) ≥ J n,β (t) for any β with θ = f (β) = θ 0 and (β 2 , . . . , β k ) ∈ E j . Therefore, the p-value defined by (8) satisfieŝ (1 − s L (j)) + γ =q L,θ 0 ,I 1−γ , and similarlyq U,θ 0 ,I 1−γ ≤q U,θ 0 ,I 1−γ forq U,θ 0 ,I 1−γ is defined analogously to (8) for tests of H(θ 0 ) against the alternative θ < θ 0 . Sinceq L,θ 0 ,I 1−γ andq U,θ 0 ,I 1−γ are valid p-values, then so are any random variables that are stochastically larger. Thus, tests based on the p-valuesq L,θ 0 ,I 1−γ andq U,θ 0 ,I 1−γ may be used to test H(θ 0 ) and, through test inversion, yield finite-sample valid confidence bounds for θ. In this section, we apply the general methods considered in Section 3 to the problem of constructing approximate and finite-sample valid confidence intervals for seroprevalence. We begin by exhibiting a set of test statistics T n applicable to our problem. Letp n (π 0 ) denote the MLE for p restricted to Ω(π 0 ) = {p ∈ Ω : π = π 0 }. 22 A natural choice for the test statistic T n is the difference betweenπ n and π 0 . This statistic can be Studentized with an estimate of its standard deviation, givingπ n (π 0 ) = (π n − π 0 ) Vπ n (p n ) , where Vπ n (p) is given in (2). Alternatively, it can be Studentized with an estimate of its standard deviation under the constraint π = π 0 , giving the test statistic π n,C (π 0 ) = (π n − π 0 ) Vπ n (p n (π 0 )). Imposing the restriction π = π 0 explicitly in the estimate of the variance may provide a more accurate approximation to the variance ofπ n under the null hypothesis. An analogous improvement has been established for the binomial case in Hall (1982) . Observe that as p 1 = p 2 (1 − π) + p 3 π, we can rewrite the condition π 0 = π as the linear . This observation suggests consideration of the linear test statisticφ n (π 0 ) = b(π 0 ) p n . Moreover, as the variance ofφ n (π 0 ) is exactly equal to and can be estimated with the plug-in estimator Vφ n(π0) (p n ), the test statisticφ n (π 0 ) can be Studentized, giving the alternative test statistic φ n (π 0 ) =φ n (π 0 ) Vφ (π 0 ) (p n ). In turn, we can Studentizeφ n (π 0 ) with an estimate of its variance under the restriction π 0 = π, giving the statisticφ n,C (π 0 ) =φ n (π 0 ) Vφ (π 0 ) (p n (π 0 )). Observe thatφ n (π 0 ) is well-defined if p 2 = p 3 , and soφ n (π 0 ),φ n (π 0 ), orφ n.C (π 0 ) may be desirable choices in situations where p 2 is close to p 3 . Alternatively, we can use statistics based on the likelihood function. In particular, let L (p | x) be the likelihood function, given by L (p | x) = 1≤i≤3 The generalized likelihood ratio test statistic for testing H(π 0 ) : π = π 0 is given by W n = W n (π 0 ) = 2 · 1≤j≤3 X j log p n,ĵ p n,j (π 0 ) + (n j − X j ) log 1 −p n,j 1 −p n,j (π 0 ) . Large values of W n give evidence for both π < π 0 and π > π 0 . To address this issue, we also consider the signed square root likelihood ratio statistic for the restriction π = π 0 , given by R n = R n (π 0 ) = sign (π n − π 0 ) · W n (π 0 ) . Corrections to improve the accuracy of W n based on its signed square root R n have a long history; see Lawley (1956) , Barndorff-Nielsen (1986) , Fraser and Reid (1987) , Jensen (1986 ), Jensen (1992 , DiCiccio et al. (2001) and Lee and Young (2005) . Frydenberg and Jensen (1989) consider the effect of discreteness on the efficacy of corrections to improve asymptotic approximations to the distribution of the likelihood ratio statistic. The statistic R n can be re-centered and Studentized where m R n (p) and V R n (p) denote the mean and variance of R n under p, and in practice are computed with the bootstrap underp n (π 0 ). We now outline the application of the approximate intervals developed in Section 3.2 to constructing confidence intervals for seroprevalence with the test statistics formulated in Section 4.1, and measure their performance in the Monte Carlo experiment developed in Section 2. Suppose that we are using the test statistic T n with observed value t 0 . Let J T n,p (t) = P p {T n ≤ t} denote the distribution of the general statistic T n under p, and also introduce the related quantity J T n,p (t − ) = P p {T n < t} , which will be of use in computing p-values for tests of the null hypothesis π = π 0 against alternatives of the form π > π 0 . The approximate test-inversion intervals are constructed by first computing, for each π 0 , the p-valueŝ q L,π 0 ,pn(π 0 ) = 1 − J T n,pn(π 0 ) (t − 0 ) andq U,π 0 ,pn(π 0 ) = J T n,pn(π 0 ) (t 0 ). The resultant interval with nominal coverage 1 − α then takes the form π 0 : q L,π 0 ,pn(π 0 ) ≥ α/2 and q U,π 0 ,pn(π 0 ) ≥ α/2 . We begin by considering asymptotic approximations to the distribution J T n,pn(π 0 ) for different test statistics. Table 2 reports the approximate test-inversion confidence intervals, constructed with an asymptotic approximation to test statistic null distributions, computed on data from Bendavid et al. (2020a) . Estimates of the average length and coverage for these intervals at the n and estimatep n from this study are also displayed. Estimates of average length and coverage are taken over 10,000 bootstrap replicates of X at the sample size n and the estimated parameterŝ p n from this study. Observe that, under the null hypothesis H (π 0 ) and provided that p is not on the boundary of Ω, the test statisticsπ n,C (π 0 ),φ n,C (π 0 ), andR n are asymptotically N (0, 1), and W n is asymptotically χ 2 1 . Thus, if we set T n equal to any of the asymptotically normal statistics, we can approximate J n,pn(π 0 ) with a standard normal distribution. Likewise, we may apply a χ 2 1 approximation if we set T n equal to W n . 23 Table 2 reports realizations of these approximate confidence for the observed values from Bendavid et al. (2020a) . Additionally, Table 2 presents estimates of coverage and average interval length taken over 10,000 bootstrap replicates computed at the n and estimatep n from this study. Notably, each of these intervals now includes zero. 24 These interval constructions are roughly the same length, on average, as the delta method intervals, but have coverage probability significantly closer to the nominal level. Figure 2 displays estimates of the coverage probabilities for these intervals in the Monte Carlo experiment developed in Section 2. Recall that the black dots display one minus the estimates of the coverage probabilities of the respective intervals, and that the purple squares and blue triangles display the proportion of the replicates in which the true value of π falls below and above the realized confidence interval, respectively. In contrast to the results for the standard methods displayed 23 Observe that using normal approximations toπ n orπ n is equivalent to constructing Wald intervals for these statistics. For that reason, we focus on statistics that make explicit use of the null hypothesis restriction π = π 0 . 24 If the null hypothesis π = 0 is of particular interest or concern, then there exists an exact uniformly most powerful unbiased level α test for the equivalent problem of testing p 1 = p 2 against p 1 > p 2 . This is a conditional one-sided binomial test; see Section 4.5 of Lehmann and Romano (2005) . Such a test does not exist for other values of π 0 . Notes: Figure 2 displays estimates of the coverage probabilities of the approximate confidence intervals constructed with an asymptotic approximation to test statistic null distributions. The nominal coverage probability is 0.95 and is denoted by the horizontal dotted line. Estimates of the coverage for the interval constructed with the test statistic T n are denoted by "T n , aa" and are computed at parameter values close to the estimatesp n and sample size n of Bendavid et al. (2020a) as specified in Section 2. The black dots denote one minus the proportion of replicates for which the true value of π falls in the realized confidence intervals, i.e., one minus the estimated coverage probability. The purple squares and blue triangles denote the proportion of replicates that fall below and above realized confidence intervals, respectively. The vertical dotted line denotes the estimated value ofp n,1 ,p n,2 , or sample n 1 , n 2 for Bendavid et al. (2020a). in Figure 1 , we estimate that the coverage probabilities for these methods are very close to the nominal value of 95% for most parameterizations. The intervals constructed with W n andR n are the most equi-tailed. Next, we refine this approach by directly computing J n,pn(π 0 ) with the bootstrap. This method is more accurate, but can be more computationally expensive. In particular, choosing test statistics that make use of the constrained MLEp n (π 0 ) requires solving the associated convex program for each bootstrap replicate. As a result, for this case, we focus attention on test statistics that do not make use ofp n (π 0 ). Table 3 reports realizations of these approximate confidence intervals computed on data from Bendavid et al. (2020a) , in addition to estimates of the coverage and average Table 3 reports the approximate test-inversion confidence intervals, constructed with a parametric bootstrap approximation to null distributions of test statistics, computed on data from Bendavid et al. (2020a) . Estimates of the average length and coverage for these intervals at sample size n and estimatep n from this study are also displayed. Estimates of average length and coverage are taken over 10,000 bootstrap replicates of X at the sample size n and estimatep n from this study. interval length at the n and estimatep n for this study. Again, each of these intervals include zero, are roughly the same length as the delta-method intervals, and have coverage probability close to the nominal level. approximately equi-tailed. The interval constructed with R n is most equi-tailed and appears to be the least sensitive to perturbations in p 2 and n 2 . We now turn to the application of the finite-sample valid intervals discussed in Section 3.4. We focus our development on the test statisticφ n (π 0 ) as it is linear, and therefore monotone with respect to each sample X i , as is required. 25 To begin, we partition the parameter space Ω into a parameter of interest and a nuisance component. Recall that finite-sample exact intervals formed by maximizing p-values over a nuisance space will perform best if the distribution of the chosen test statistic does not vary much with the 25 One may also consider the test statisticπ n , as π(·) is monotone with respect to each component as long as p ∈ Ω. The black dots denote one minus the proportion of replicates for which the true value of π fall in the realized confidence intervals, i.e., one minus the estimated coverage probability. The purple squares and blue triangles denote the proportion of replicates that fall below and above realized confidence intervals, respectively. The vertical dotted line denotes the estimatep n,1 ,p n,2 or sample size n 1 , n 2 for Bendavid et al. (2020a). nuisance parameter. For small values of π 0 , the variance ofφ n (π 0 ) is insensitive to changes in p 3 , as the variance σ 2 3 (p 3 ) enters into (12) linearly and scaled by π 2 0 . Additionally, for sample sizes comparable to the measurements taken in Bendavid et al. (2020a) , where n 1 is much larger than n 2 , the variance ofφ n (π 0 ) will be less sensitive to changes in p 1 than to changes in p 2 . Thus, we set the nuisance component ϑ = (p 1 , p 3 ), giving the parameterization (π, ϑ). 26 To fix ideas, consider Figure 4 , which displays a heat-map of the 0.975 quantile ofφ n (π 0 ) under different values of the nuisance parameter ϑ, with the sample size and the null hypothesis restriction π 0 equal to the sample size and estimated prevalenceπ n from Bendavid et al. (2020a) . The square black dot denotes the constrained MLE,θ(π 0 ) = (p n,1 (π 0 ),p n,3 (π 0 )). The black line exhibits the boundary of the parameter spaceΩ(π 0 ). Recall that in the constructions of approximate intervals considered in Section 4.2, a point π 0 is excluded from a confidence interval with nominal coverage 0.95 if the observed value of the chosen test statistic T n exceeds or falls below the 0.975 or 0.025 quantiles of the statistic's finite-sample distribution at the constrained MLE,p n (π 0 ). However, as illustrated in Figure 4 , the 0.975 quantile of the bootstrap distribution ofφ n (π 0 ) has considerable variation with the nuisance parameter ϑ. As a result, these approximate intervals will not exactly control the coverage probability in finite samples, as the event that ϑ differs fromθ(π 0 ) occurs with positive probability. In turn, comparing the realized value of a test statistic to quantiles of the statistic's finitesample distribution at every value of the nuisance component ϑ is both infeasible, as the space of ϑ is infinite, and impractical, as it would lead to extremely conservative intervals. In fact, we can see that in Figure 4 , the 0.975 quantile of the bootstrap distribution ofφ n (π 0 ) is approximately four times as large at p 1 = 0.05 than at p 1 =p n,1 (π 0 ). Thus, the finite-sample approach developed in Section 3.4 begins by constructing a 1 − γ confidence region for ϑ and forming a finite grid over this space. The initial confidence region I 1−γ is illustrated in Figure 4 by the greyed rectangle, and a 10 × 10 grid over this space is illustrated by the grid of white dots. The confidence region I 1−γ is formed by taking the Cartesian product of 26 We note that for different sample sizes, it may be attractive to set the nuisance component ϑ = (p 2 , p 3 ). For example, Bendavid et al. (2020b) -the April 27 th draft of Bendavid et al. (2020a) -includes larger sensitivity and specificity samples n 2 and n 3 . In particular, the specificity sample n 2 was increased from 401 to 3324. These additional data were are aggregated over several samples taken at different times and locations. Gelman (2020), Fithian (2020b), and Bennett and Steyvers (2020) highlight issues with this aggregation. The choice ϑ = (p 2 , p 3 ) also has computation advantages. In particular, by the identity p 1 = p 2 (1 − π) + p 3 π, for any value of seroprevalence π 0 and any values of p 2 and p 3 satisfying p 2 < p 3 , there is a value of p 1 that satisfies p 2 ≤ p 1 ≤ p 3 such that π 0 = (p 1 − p 2 )/(p 3 − p 2 ). That is, any value of (p 2 , p 3 ) corresponds to a unique value of p 1 consistent with a given value of π 0 and satisfying the a priori restrictions on the parameter space Ω. Notes: Figure 4 displays a heat-map of the 0.975 quantile ofφ n (π 0 ) under different parameter values (π 0 , ϑ), where the sample size and null hypothesis restriction π 0 equal to the sample size and estimated prevalenceπ n from Bendavid et al. (2020a) . The black line exhibits the boundary of the parameter spaceΩ(π 0 ). The black square dot denotes the constrained MLEθ(π 0 ) = (p n,1 (π 0 ),p n,3 (π 0 )). With γ = 0.001, the grey rectangle denotes a 1 − γ confidence region for ϑ constructed by taking the cartesian product of two √ 1 − γ level confidence regions for p 1 and p 3 each constructed with the method of Clopper and Pearson (1934) . The white dots denote a 10 × 10 grid over this space. two √ 1 − γ level confidence regions for p 1 and p 3 , each constructed by using the exact intervals of Clopper and Pearson (1934) . For the purposes of this figure, we set γ = 0.001. This grid partitions the values of ϑ in I 1−γ into r = 81 rectangles, which we enumerate E 1 , . . . , E r . Define, for i = 1, 3, the extreme points p i (j) = inf{p i : (p 1 , p 3 ) ∈ E j } and p i (j) = sup{p i : (p 1 , p 3 ) ∈ E j } as well as p 2 (j) = inf{p 2 : p ∈ Ω(π 0 ), (p 1 , p 3 ) ∈ E j } and (14) p 2 (j) = sup{p 2 : p ∈ Ω(π 0 ), (p 1 , p 3 ) ∈ E j } . As the test statisticφ n (π 0 ) is monotone increasing in X 1 and monotone decreasing in X 2 and X 3 , define p L (j) = p 1 (j), p 2 (j), p 3 (j) and p U (j) = p 1 (j), p 2 (j), p 3 (j) as well as where s L (j) and s L (j) are set equal to 1 and 0, respectively, if the infimum or supremum in (14) are taken over the empty set. Thus, by Theorem 3.1 we can construct the finite-sample valid p-values q L,π 0 ,I 1−γ = max 1≤j≤r (1 − s L (j)) + γ andq U,π 0 ,I 1−γ = max 1≤j≤r (s U (j)) + γ for testing the null hypothesis π = π 0 . Hence, the resultant finite-sample valid interval with nominal coverage 1 − α takes the form π 0 :q L,π 0 ,I 1−γ ≥ α/2 andq U,π 0 ,I 1−γ ≥ α/2 . This approach is closely related to the method developed in Cai et al. (2020) , though there are some differences. Roughly, Cai et al. (2020) compute p-values for test of the null hypothesis π = π 0 with the parametric bootstrap using the particular choice of test statisticπ n at each point of a grid spanning a confidence region for the nuisance parameter. Their construction begins by constructing a joint confidence region for all three parameters, while our approach proceeds from a smaller initial region for just p 1 and p 3 . We make an additional correction for a grid approximation to the nuisance space, which allows us to ensure finite-sample validity. Their construction does not guarantee that the resulting intervals are equi-tailed. Table 4 reports realizations of these finite-sample valid intervals for several values of γ, in addition to the projection intervals discussed in Section 2, for Bendavid et al. (2020a) . 27 The table also reports estimates of the coverage and average interval length at the estimated values ofp n for this study. The cost of ensuring finite-sample valid coverage is large. The realized intervals are roughly 40% wider on average than intervals constructed with the delta method, and the coverage is very close to one. Figure 5 displays estimates of the coverage probabilities for the finite-sample valid intervals as well as the projection intervals in the same Monte Carlo experiment developed in Section 2. 28 Again, the coverage is very close to one at small sample sizes. However, the finitesample valid intervals outperform the projection intervals. The difference is most salient in the measurements of coverage. The additional costs associated with correction of the approximation error induced by a finite 27 These results are insensitive to small changes in the grid size g. 28 Note that the proportion of Monte Carlo replicates for which the true value of π falls below the realized intervals is very close to zero at most parameter values, and so dots denoting one minus the estimated coverage and the proportion of Monte Carlo replicates for which the true value of π falls above the realized intervals are approximately overlaid. Table 4 reports the finite-sample valid test-inversion and projection confidence intervals computed on data from Bendavid et al. (2020a) . Estimates of the average length and coverage for these intervals at sample size n and estimatê p n from this study are also displayed. Estimates of average length and coverage are taken over 10,000 bootstrap replicates of X at the sample size n and the estimatep n from this study. discretization of the nuisance space are not overly burdensome. In particular, consider the test inversion intervals for π constructed with the p-valuesq L,π 0 ,Î 1−γ,g andq U,π 0 ,Î 1−γ,g , where the former p-value is defined in (8), the latter is defined analogously for upper confidence bounds, andÎ 1−γ,g is a g × g grid over the initial confidence region I 1−γ . That is,Î 1−γ,g denotes the white dots in Figure 4 , where in that case g = 10. The realized value for these intervals with g = 10 and γ = 10 −2 for data from Bendavid et al. (2020a) are [0.000, 0.025]. For these values of g and γ, this interval construction has an average length of 0.0249, which is 34.85% longer than the delta-method interval on average, i.e., they are 3.5% shorter than the finite-sample valid intervals considered in this section. There are several facets of the finite-sample valid confidence intervals considered in this section that could potentially be improved. These include the choice of the nuisance parameter ϑ that leads to an initial confidence region I 1−γ . Additionally, an initial confidence region may be constructed by taking the product of appropriate one-sided bounds, respectively, rather than using a single joint confidence region for both lower and upper bounds. This change should save roughly γ/2 in overcoverage. It may be more desirable to use a Studentized test statistic, as its distribution may vary even less within the initial confidence region. However, a nontrivial modification of the correction for the approximation error induced by a finite discretization of the nuisance space is required, since monotonicity may be violated. Lastly, finer grids over the nuisance space may be applied to further reduce the length of intervals. The black dots denote one minus the proportion of replicates for which the true value of π falls in the realized confidence intervals, i.e., one minus the estimated coverage probability. The purple squares and blue triangles denote the proportion of replicates that fall below and above realized confidence intervals, respectively. Note that, in the case of this figure, the black dots and blue triangles are approximately overlaid.The vertical dotted line denotes the estimated value ofp n,1 , p n,2 , or sample size n 1 , n 2 from Bendavid et al. (2020a) . We demonstrate that standard methods for constructing confidence intervals in basic seropreva- (2020a) seroprevalence survey appears to be contingent on the form of population weighting applied. This sensitivity highlights the fundamental importance of high quality data collection in survey design, and supports a view of seroprevalence surveys as an important input into, but not a final answer for, assessments of the progression of early stages of infectious diseases. The methods discussed in this paper are applicable to, but not tailored for, post-stratification weighted estimation. If there are S strata of the population of interest, with p i 1 denoting the probability that a randomly selected individual in the ith stratum tests positive, then seroprevalence in the ith stratum is π i = (p i 1 − p 2 )/(p 3 − p 2 ). If the ith stratum gets the known weight w i , then the overall seroprevalence is π = i w i π i , which is a function of S + 2 binomial parameters. However, for even moderately large values of S, the finite-sample valid intervals developed in Section 3.4 will be computationally expensive due to the need to compute p-values for each point in 29 The demographic data necessary for the replication of this result are not available. Bendavid et al. (2020a) report a weighted seroprevalence estimate and confidence interval -purportedly constructed with the delta method -of 0.0249 and (0.0201,0.0349), respectively. However, Fithian (2020a) argues convincingly that there were coding errors made in the computation of these intervals. Cai et al. (2020) report weighted percentile bootstrap confidence intervals of (0.0110,0.0372), where we note that there are small differences in the specificity and sensitivity estimates that they use relative to the data studied in this article. a discretization of an S + 1 dimensional first-stage confidence region. 30 The computational cost of the approximate intervals considered in Section 3.2 will not be higher than for the unweighted problem. We view further consideration of confidence interval constructions that are well suited for post-stratification weighting as a useful direction for further research. 31 In many applied contexts it is likely valuable -both for estimation and uncertainty quantification -to incorporate other forms of information made available in the collection of samples and test characteristics. For example, in Bendavid et al. (2020b) a larger specificity sample is constructed by aggregating several samples from different populations. As indicated in Fithian (2020b) , there is evidence that there is greater variation in estimates of specificity across these samples than would be expected if each of the test results in these samples were independent and identically distributed. Gelman and Carpenter (2020) propose a hierarchical approach to accounting for this over-dispersion, suggesting a model in which the specificity parameters of the tests implemented in each sample -including the sample taken from the population of interest -are drawn from a pre-specified parametric distribution. As the process generating the specificity samples might tend to be different from the process generating the population of interest (e.g., specificity samples may be drawn from hospital patients local to test manufacturers), we would advocate for an approach in which specificity was modeled as a function of relevant population characteristics. Gelman and Carpenter (2020) also highlight the possibility of incorporating individual-level symptom data; we second this suggestion and view it as a useful direction for further research. The methods presented in Section 3 apply quite generally to the construction of confidence intervals for real-valued parameters θ. A subclass of problems can be described as follows. Suppose X 1 , . . . , X k are independent, with X i distributed as a binomial with parameters n i and p i . It is desired to construct a confidence interval for some parameter θ = f (p 1 , . . . , p k ). In this case, the family of distributions of X i has monotone likelihood ratio in X i . The application of Theorem 3.1 requires specification of a nuisance parameter ϑ, construction of a confidence interval for ϑ, and verification that the chosen test statistic is monotone with respect to each of its components. The latter is straightforward when the test statistic is given by T n = f (p 1 , . . . ,p n ), wherep i = X i /n i . Some important special cases include differences of proportions, measures of relative risk reduc- 30 In this case, adapting the finite-sample valid procedures proposed in this article for use with asymptotic approximations to the distribution of the signed square root likelihood ratio statistic (see e.g., Brazzale et al. (2007) ; Jensen (1986, 1992) ) may significantly facilitate computation. 31 Both Gelman and Carpenter (2020) and Cai et al. (2020) propose approaches to this problem, with the former taking a Bayesian perspective. tion, and odds ratios. 32 Agresti and Min (2002) and Fagerland et al. (2015) also consider similar confidence interval constructions for these three parameters that involve minimizing p-values over a nuisance parameter space, but do not account for the discretization required. 32 Uniformly most accurate unbiased confidence bounds exist only for the odds ratio based on classical constructions, but optimality considerations fail for the other parameters; see e.g., Problem 5.29 of Lehmann and Romano (2005) Unconditional small-sample confidence intervals for the odds ratio Performance characteristics of five immunoassays for sars-cov-2: a head-to-head benchmark comparison. The Lancet Infectious Diseases The power of antibody-based surveillance Inference on full and partial parameters based on the standardized signed log likelihood ratio COVID-19 antibody seroprevalence in Covid-19 antibody seroprevalence in santa clara county Covid-19 antibody seroprevalence in santa clara county Estimating covid-19 antibody seroprevalence in santa clara county, california. a re-analysis of bendavid et al. medRxiv P values maximized over a confidence set for the nuisance parameter Applied asymptotics: case studies in small-sample statistics Interval estimation for a binomial proportion Exact inference for disease prevalence based on a test with unknown specificity and sensitivity Test inversion bootstrap confidence intervals The use of confidence or fiducial limits illustrated in the case of the binomial Parameter orthogonality and approximate conditional inference (with discussion) Probability matching priors: higher order asymptotics Antibody tests for identification of current and past infection with sars-cov-2 Simple and accurate one-sided inference from signed roots of likelihood ratios Nonparametric confidence limits by resampling methods and least favorable families On bootstrap procedures for second-order accurate confidence limits in parametric models Nonparametric standard errors and confidence intervals Better bootstrap confidence intervals How to identify flawed research before it becomes dangerous. The New York Times Recommended confidence intervals for two independent binomial proportions Covid-19-navigating the uncharted I am grateful to moderator @jonc101x and speaker @jsross119 for allowing me to make a brief statement at the very interesting Stanford BMIR seminar yesterday (43:00 mark). A lightly edited text version follows Statistical comment on the On conditional inference for a real parameter: a differential approach on the sample space Is the 'improved likelihood ratio statistic' really improved in the discrete case? Concerns with that stanford study of coronavirus prevalence. Statistical Modeling, Causal Inference, and Social Science blog Bayesian analysis of tests with unknown specificity and sensitivity Improving the normal approximation when constructing one-sided confidence intervals for binomial or poisson parameters The Bootstrap and Edgeworth Expansion Estimating the error rates of diagnostic tests Similar tests and the standardized log ratio statistic The modified signed likelihood statistic and saddlepoint approximations Statistics for Epidemiology Coronavirus infections may not be uncommon, tests suggest. The New York Times Serology assays to manage covid-19 A general method for approximating to the distribution of the likelihood ratio criteria Parametric bootstrapping with nuisance parameters Testing Statistical Hypotheses Antibody tests suggest that coronavirus infections vastly exceed official counts Estimating prevalence from the results of a screening test Consonance and the closure method in multiple testing, article 12 A practical two-step method for testing moment inequalities On the relationship between bayesian and non-bayesian elimination of nuisance parameters A test in the presence of nuisance parameters Estimation of covid-19 prevalence from serology tests: A partial identification approach Estimation of test error rates, disease prevalence and relative risk from misclassified data: a review