key: cord-0643633-36t6tfqk
authors: Guerrier, St'ephane; Kuzmics, Christoph; Victoria-Feser, Maria-Pia
title: Prevalence Estimation from Random Samples and Census Data with Participation Bias
date: 2020-12-19
journal: nan
DOI: nan
sha: abd17d54760e688730d161b27e810e25b5a88b6f
doc_id: 643633
cord_uid: 36t6tfqk

Countries officially record the number of COVID-19 cases based on medical tests of a subset of the population with unknown participation bias. For prevalence estimation, the official information is typically discarded and, instead, small random survey samples are taken. We derive (maximum likelihood and method of moment) prevalence estimators, based on a survey sample, that additionally utilize the official information, and that are substantially more accurate than the simple sample proportion of positive cases. Put differently, using our estimators, the same level of precision can be obtained with substantially smaller survey samples. We take into account the possibility of measurement errors due to the sensitivity and specificity of the medical testing procedure. The proposed estimators and associated confidence intervals are implemented in the companion open source R package cape.

In the ongoing COVID-19 pandemic, governments face a trade-off between reducing the wealth or the health of citizens when choosing the degree of economic slowdown in their policy measures. The key to assess this trade-off is an understanding of the number or proportion of cases in the population and their evolution. Acquiring this understanding, in turn, depends on reliable estimates of the number of cases (at different points in time).

The officially recorded number of positive cases can probably only be seen as a lower bound of the actual number of cases. The selection of participants to be medically tested is typically not complete and, importantly, also not random, but instead suffers from an unknown participation bias. The whole official procedure can, in fact, be understood as a complete census with (a possibly large) participation bias. It is typically unclear how many undetected positive cases there are in the population. Acknowledging this problem, for the case of COVID-19, some studies have proposed estimates for the prevalence among asymptomatic patients (see e.g. Nishiura et al., 2020 , Mizumoto et al., 2020 , or arXiv:2012.10745v2 [stat.ME] 23 Dec 2020 have attempted to infer from the prevalence obtained through the official procedure to the population one (see e.g. Manski and Molinari, 2020) .

In this paper, we instead propose to combine the information available in the data obtained through the official procedure that, as argued, suffers from participation bias, with data collected using a random sample of participants from the population all of which are medically tested. From this random sample, an unbiased estimator of the population proportion of positive cases, ignoring the information available from the official procedure, is then simply the proportion of positive cases in the sample; see e.g. Bendavid et al. (2020) ; SORA (2020); Stringhini et al. (2020) for the analysis of COVID-19 prevalence. More precisely, we demonstrate that the information gathered through the official procedure, while not useful in its own, can be used to improve the accuracy of the best estimators derived from random samples. All what is needed, is to also record, for each participant in the random sample, whether they are already part of the official statistics, i.e., whether they have been already declared positive through the official procedure. Appropriate estimators can then be derived whose key input is the number of new cases found in the sample.

We show that these estimators are substantially more accurate than the standard proportion of cases in the random sample. Or put differently, appropriately utilizing the information obtained through the official procedure, means that the sample sizes for the survey can be substantially smaller and yet achieve the same statistical accuracy, thus, substantially reducing the costs and/or time for data acquisition. Alternatively, from the same survey sample, finer analysis at sub-population levels (e.g. regions) can reasonably be done even if the number of participants in these levels is rather small.

We also provide several standard approaches to building confidence interval bounds for the proportion of positive cases, and compare, in a simulation study, their (finite sample) coverage properties. We also take into account possible misclassification errors of the (medical) testing devices used to collect the data see e.g., Kobokovich et al. (2020) and Surkova et al. (2020) . The associated misclassification errors are actually induced by their sensitivity, i.e., the complement to the False Positive (FP) rate, and by their specificity, i.e., the complement to the False Negative (FN) rate, and adjusting for these errors avoids biased estimates (see e.g. Diggle, 2011; Lewis and Torgerson, 2012 , and the references therein). Using a sensitivity analysis with the Austrian survey data, we actually find that the proposed estimators are much less influenced by the value of the FN rate, than the survey sample proportion, allowing, in practice, to limit the impact of the choice for the medical test specificity when estimating the proportion of positive cases.

Such misclassification adjustments are also necessary with binary outcomes in logistic regression; see e.g. Ni et al. (2019) and Meyer and Mittag (2017) , and the references therein. In this paper, we consider the case of estimating the proportion of positive cases, but the framework could easily be extended to the case of logistic regression. Moreover, while the data from the November 2020 survey collected by Statistics Austria (2020) is suitable for prevalence estimation, i.e. the population proportion of Austrians infected by the COVID-19 in November 2020, the same approach can be used to estimate other proportions such as the incidence of the COVID-19 (see e.g. Woodward, 2014) . For the sensitivity and specificity, we use cutoff values, hence without the need to specify a (prior) distribution for these quantities (see e.g. McDonald and Hodgson, 2018; Bouman et al., 2020 , and the references therein). Finally, the data from the Austrian survey (Statistics Austria, 2020) is performed using the cape R package which includes the new methods developed in this paper (see Section 7 for more details).

The paper is organised as follows. We first present the formal setup in Section 2. In Section 3 we derive associated estimators and inference procedures, also treating the case of possible (partially) missing information. In Section 4 we present a simulation study that confirms and quantifies the theoretical results we develop in the previous sections. In Section 5 we apply the methodology to the case of the COVID-19 prevalence estimation and associated confidence bounds in Austria.

Consider taking a (random) survey sample of n participants in some population in order to estimate the population proportion π of, for example, a given infectious disease. Our framework also supposes that prior to the collection of the survey sample, a known proportion of individuals in the population have been declared positive through an official procedure based on an incomplete census or a census with participation bias. The official procedure has two steps. First, participants are selected based on some unknown criteria. Second, selected participants are medically tested for the disease.

For each participant i = 1, . . . , n in the survey sample, there are three random variables of interest.

if participant i is tested positive in the survey sample, 0 otherwise;

if participant i was declared positive with the official procedure, 0 otherwise.

(1)

We assume that, for each participant i = 1, . . . , n in the survey sample, we observe Y i and Z i , but not X i . The objective is to provide an estimator for the unknown population proportion π := P (X i = 1) .

We allow for the possibility that the outcome of (medical) tests can be subject to misclassification error. Let

The probabilities α and β, are the (assumed known) FP rates (α = 1 − specificity) and FN rates (β = 1 − sensitivity) of the particular medical test employed in the survey. The probabilities α 0 and β 0 , are respectively the (assumed known) FP and FN rates of the official procedure.

REMARK A: Note that α 0 is not the FP rate of the medical test administered in the official procedure. It is the probability that a participant has been incorrectly declared positive through the official procedure and, therefore, the product of two probabilities: the probability that a negative individual is selected to be tested in the official procedure multiplied with the probability that the medical test is positive conditional on this individual being (selected and) negative. In many applications α 0 will, therefore, be, sometimes substantially, smaller than the FP rate of the medical test.

The FN rate β 0 of the official procedure is not known (otherwise we would know the population proportion π) and depends on π 0 , π and α 0 as follows:

Thus,

It is useful to make three small assumptions.

ASSUMPTION A: α + β < 1.

ASSUMPTION B: α 0 + β 0 < 1.

The survey sample is collected completely at random, without replacement. Its size n is small compared to the population size.

With Assumption A, we rule out the uninteresting case α + β = 1. Indeed, if α + β = 1, Y i would be completely uninformative about the random variable of interest X i , as P(X i = 1|Y i = 1) = P(X i = 1|Y i = 0) = π. Otherwise Assumption A is without loss of generality in the following sense. If α + β > 1, we could just use Y i = 1 − Y i instead of Y i , which would have FP and FN rates of α = 1 − α and β = 1 − β, with α + β < 1.

Assumption B is similarly without loss of generality. It implies that α 0 ≤ π 0 . To see this suppose that α 0 > π 0 = (1 − π)α 0 + π(1 − β 0 ). This is equivalent to 0 > −πα 0 + π(1 − β 0 ), which in turn, is equivalent to 0 > 1 − α 0 − β 0 , a contradiction.

Assumption C specifies the type of sampling method assumed in this paper. Extensions to weighted sampling methods, with non random weights, would require a relatively straightforward adjustment of the proposed estimators, that we omit for clarity of exposition. Moreover, assuming that the sample size is relatively small compared to the population size, allows one to consider distributional properties of the variables that can be easily defined, in that binomial distributions can be used to approximate hypergeometric distributions.

The unknown population proportion of positive cases π is bounded from below by π := π0−α0 1−α0 . To see this, recall that the equality π 0 = (1 − π)α 0 + π(1 − β 0 ) must hold (with both π and β 0 unknown parameters). The lowest admissible value for π is achieved when β 0 = 0, in which case we get the lower bound π0−α0 1−α0 . If α 0 = 0 then π = π 0 . Note that, given the assumptions, 0 ≤ π0−α0 1−α0 ≤ 1. From these variables we construct the following random variables that will be used to formulate the models:

In words, R 11 is the number of participants in the survey sample that are tested positive and have also been declared positive through the official procedure; R 10 is the number of participants in the survey sample that are tested negative but have been declared positive through the official procedure; R 01 is the number of participants in the survey sample that are tested positive but have been declared negative through the official procedure; R 00 is the number of participants in the survey sample that are tested negative and have been declared negative through the official procedure. We also make use of R * 1 = n i=1 Y i = R 11 + R 01 , the number of participants that are tested positive in the survey sample. The success probabilities (see Supplementary Material A for their derivation), denoted by τ ij (π) associated to each R ij , i, j ∈ {0, 1} in (2) are given by

where ∆ := 1 − (α + β). Without misclassification error, we would have τ 11 (π) = π 0 , τ 10 (π) = 0, τ 01 (π) = π − π 0 , τ 00 (π) = 1 − π. Moreover, it is easy to verify that given our Assumptions, we have that the τ 's are non-negative and sum up to 1.

In this section we derive Maximum Likelihood Estimators (MLE), a marginal MLE when some data is missing, and some Generalized Method of Moment (GMM) estimators. We also provide (exact) fiducial confidence intervals when possible, such as for a Method of Moment Estimator (MME) estimator under the assumption that the FP rates are zero. We also provide confidence intervals based on the estimators' asymptotic distribution. We compare the accuracy of the proposed estimators (that utilize the information from the official procedure) with the survey MLE that is the sample proportion of positive cases in the survey sample (that ignores the information from the official procedure).

The benchmark estimator which is based only on R * 1 (= R 11 + R 01 ), the number of positive cases in the survey sample, is given byπ

which reduces toπ = R * 1 /n, when α = β = 0. It is actually the MLE of π based only on the survey sample.

Its variance is given by

Under Assumption C, the likelihood function for π can be obtained from the multinomial distribution with categories provided by R 11 , R 10 , R 01 , R 00 and their associated success probabilities τ 11 (π), τ 10 (π), τ 01 (π), τ 00 (π). The log-likelihood function is, therefore, given by

where C is a quantity independent of π.

The conditional MLE, i.e., the one based on the log-likelihood given in (6), which is hence conditional on the information provided by the official procedure, is defined bŷ

with π given in Remark B. The conditional MLEπ, generally, has no closed-form solution but can be computed numerically. However, in the case when α 0 = 0, we obtain a closed-form solution given bŷ

When α 0 = α = β = 0, this further reduces tô

REMARK C: The closed form expression in (8) is the conditional MLE only if the estimate is within the interval [π, 1]. There are, however, possible (but unlikely in practice) combinations of parameter values and sample realisations for which the likelihood function is maximized at the boundaries, i.e. either at π or at 1. In the case of no misclassification errors (α 0 = α = β = 0) the estimate given in 9 is automatically within [π, 1].

Under Assumption C, we have that R * 1 ∼ B(n, π) and, conditionally on R * 1 , we obtain the conditional model R 01 |R * 1 ∼ B(R * 1 , π0 π ). The associated (conditional) likelihood function is, therefore, given by

, with associated conditional MLE given in (9).

In Proposition 1 below, we show the consistency and asymptotic normality of the conditional MLE defined in (7).

PROPOSITION 1: The conditional MLEπ defined in (7) is consistent for π. Moreover, if π ∈ (π, 1), we have

The proof of Proposition 1 is provided in Supplementary Material B.

Alternatively, we can consider an estimator from the class GMM estimators (Hansen, 1982) based on the random variable R := [R 11 /n, R 10 /n, R 01 /n, R 00 /n] with expectation E[R] := τ (π) = [τ 11 (π), τ 10 (π), τ 01 (π), τ 00 (π)]. A GMM estimatorπ is given byπ

where Ω is a fixed 4 by 4 positive definite matrix with entries ω ij , i, j = 1, ..., 4. Since τ (π) is a linear combination of π, we can write τ (π) := aπ + b, with a = [a l ] l=1,...,4 , b = [b l ] l=1,...,4 two vectors derived from (3). Then, assuming an interior solution exists (a remark similar to Remark C applies),π is the root of d dπ

Therefore, we obtainπ

and it follows that E[π] = π. For a general matrix Ω,π is a linear combination of the elements of R, and it would be useful to choose Ω such that the distribution ofπ is known (for all n), for the construction of exact confidence bounds. One such case is obtained when ω ij = 1 for i = j = 3 and 0 otherwise, i.e. the GMM is reduced to a MME based on R 01 (with expectation τ 01 (π)), which, again assuming an interior solution exists, is given byπ ∈ [π, 1] that solves

This yieldsπ

When α 0 = α = β = 0, this reduces toπ

REMARK E: Interestingly, in the case of no misclassification errors (α 0 = α = β = 0),π can also be seen as an approximation to the MLE (in 9) for small values of π 0 and π, i.e., by simplifying (n − R * 1 )/(n − R 11 ) ≈ 1 and π 0 (n − R 11 ) ≈ π 0 n.

Moreover, we have E[π] = π, i.e., the moment estimator is unbiased, and the variance is easily determined to be

The possible advantage of the MMEπ in (11) is that is has a known finite sample distribution, based on R 01 ∼ B(n, τ 01 (π)), so that exact confidence bounds can be computed using, for example, the Clopper-Pearson method, see below. Actually, using (10) and setting ω ij = 1 for i = j = l and 0 otherwise, l = 1, . . . , 4, we can obtain all the MME corresponding to the different variables in R, asπ

with E[π (l) ] = π for all l = 1, . . . , 4, and also known finite sample distribution. In Supplementary Material C we propose an alternative and more efficient moment estimator based on a (variance minimizing) linear combination of thẽ π (l) , but unfortunately without known finite sample distribution. However, when α 0 tends to zero (recall Remark A for the interpretation of α 0 ), this minimum variance GMM estimator is in fact the MME in (11).

In some cases it might be that the information in R 10 (and R 00 ) in (2) is not easily available, for example, when additional data is collected using follow-up procedures. In that case, one can proceed with the marginalization of the likelihood function in (6) on the unknown quantities, leading to * (π|R 11 , R 01 ) = C + R 11 ln(τ 11 (π)) + R 01 ln(τ 01 (π))+ + E [R 10 ] ln(τ 10 (π)) + (n − R − E [R 10 ]) ln(τ 00 (π)) = C + R 11 ln(τ 11 (π)) + R 01 ln(τ 01 (π))+ + nτ 10 (π) ln(τ 10 (π)) + (n − R − nτ 10 (π)) ln(τ 00 (π)),

where C is a quantity independent of π. The marginal MLE is given by

and, generally, has no closed form. It can however be easily computed using a numerical optimisation method.

As for the conditional MLE, we show the consistency and asymptotic normality of the marginal MLE in (15) in Proposition 2 below. The proof is omitted as it follows closely the one of Proposition 1. Also, the exact expression of the asymptotic variance denoted by I * (π) −1 , is not explicitly provided here but implemented in the cape R package (see Section 7).

PROPOSITION 2: The marginal MLEπ in (15) is consistent for π. Moreover, if π ∈ (π, 1), we have

.

In this section, we compare the variance of the various estimators to assess their efficiency relative to the Cramer-Rao lower bound variance (that the conditional MLE achieves asymptotically) of all unbiased estimators.

The closed form expressions for the variance are given in (5) for the survey MLE and in (13) for the MME. No closed form expressions of the finite sample variance of the conditional MLE and the marginal MLE are easily obtained, not even for the case of no misclassification errors.

The Cramer-Rao lower bound, which is also the asymptotic variance of the conditional MLE, is given by the reciprocal of the Fisher information, that is

One can provide a lengthy closed form expression for I(π) −1 , see Proposition 1. In practice, based on simulations (not presented here), the sample variance appears indistinguishable from the asymptotic variance, from sample sizes of n ≥ 500.

In Section 4, we perform a simulation study, with parameter values loosely inspired by what one might expect for estimating the COVID-19 prevalence using PCR tests, to empirically assess the efficiency of the various estimators by considering the ratio of the Cramer Rao lower bound and the variance of the estimator. In this section we formally compute efficiency ratio, in the case of no misclassification errors, in order to highlight the increased precision that we get by considering the information from the official procedure. To do so, let α 0 = α = β = 0 and consider the ratio of the variance ofπ (in 5) relative toπ (in 13):

var (π) var (π) = π(1 − π) (π − π 0 )(1 + π 0 − π) = π(1 − π) π(1 − π) − π 0 (1 + π 0 − 2π)

.

Therefore, when 2π > 1 + π 0 we have var(π) < var(π), while when 2π < 1 + π 0 we have var(π) > var(π). A sufficient condition for the variance of the MME to be lower than the variance of the survey MLE is, therefore, that the true population proportion π is below one half.

On the other hand, the efficiency of the survey MLE relative to the (asymptotic) conditional MLE, in this case, is given by

Moreover, since the variance of the conditional MLE is also the Cramer-Rao lower bound for the variance of any unbiased estimator of π, the MME, being unbiased, must have a higher variance. Indeed, the relative efficiency ofπ versus the conditional MLE (for sufficiently large n) is given by

The efficiency loss ofπ relative toπ can also be expressed in terms of the increase in sample size needed when usingπ rather thanπ. Let n * denote the sample size that is needed to obtain a variance for the survey MLEπ that is equal to the one of MMEπ using a sample size of n. We obtain

which, for small π 0 , is approximately equal to 1 1−π0/π . If, for instance, π = 2π 0 then n * n ≈ 2. The added value in using the additional information provided in R 11 , therefore, is equivalent to using the survey MLE with a sample with twice the size.

Although the MMEπ has a (typically small) efficiency loss relative to the conditional MLE, it has the advantage of having a known distribution through R 01 ∼ B(n, τ 01 (π)). This allows one to construct (exact, but possibly conservative) confidence intervals even in finite samples without appealing to the estimator's asymptotic normal distribution, using the (fiducial) approach put forward in Clopper and Pearson (1934) (see also e.g. Fisher, 1935; Brown et al., 2001) .

A Clopper-Pearson (CP) (1 − γ) confidence interval based on the survey MLE, i.e., based on R * 1 , is given by

where, generally,

and where B(p; v, w), 0 ≤ p ≤ 1, is the cumulative distribution function of a beta distribution with shape parameters v and w.

A CP (1 − γ) confidence interval can be constructed based on the moment estimator (11), i.e., based on the information provided by R 01 . Given that E (3)), a (1 − γ) confidence interval for π, is given by

Using the conditional and marginal MLEs we can also provide confidence intervals based on their asymptotic normal distribution. All these confidence intervals are compared in our COVID-19 inspired simulation study in Section 4 and in our case study using actual COVID-19 data from an Austrian survey sample in Section 5.

In this section, we present the efficiencies, coverage and confidence interval lengths of the different methods, in finite samples. This section is parameterized in such a way that it is loosely compatible with the case of COVID-19 prevalence estimation using PCR tests. In particular, The FP and FN rates have been chosen so that they correspond to sensitivity and specificity commonly encountered in COVID-19 medical tests, as for example reported by the Center for Health Security of the John Hopkins University (Kobokovich et al., 2020) , see also (Surkova et al., 2020) . Throughout we choose α 0 = 0, the FP rate of the official procedure. We do so because, as pointed out in Remark A, α 0 is the product of two probabilities, here the probability of a COVID-19 negative person being selected to be tested in the official procedure and the FP rate of the PCR test employed in the official procedure. Given the relative low official prevalence of COVID-19, at least at the moment of writing this article, this product must be fairly close to zero. If, for instance, 1% of the member of a population have been found positive through the official procedure and if the FP rate of the PCR test is another 1%, we get an α 0 = (0.01) 2 = 0.01%.

We consider three settings. Setting I is without misclassification error, i.e. with α 0 = α = β = 0. Setting II has only a FN rate, i.e. α 0 = α = 0, β = 2%. Setting III, finally, has both types of misclassification errors, i.e., α 0 = 0, α = 1%, β = 2%. We consider a sample size of n = 2, 000 which leads to the same conclusions (not presented here) as a somewhat smaller sample size (e.g. n = 1, 500).

For π, we consider three rather different values, i.e. 5%, 20% and 75% in order to cover a wide range of possible prevalence rates. For π 0 , we consider, for each value of π, 30 equally spaced values between 1.025 min(α 0 , π) and 0.975π, so that, conditionally on the information brought in by Z i , one can appreciate the efficiency and accuracy gain of the approach based on the conditional model. As estimators, we consider the survey MLEπ in (4), the conditional MLEπ in (7), the MMEπ in (11) as well as the marginal MLEπ in (15) for the plausible cases when the information on R 10 and R 00 in (2) is not available. is a substantial efficiency loss for the survey MLEπ that increases drastically as π 0 approaches π, with or without misclassification errors. This is in line with the fact that the information brought in by considering Z i (1), is more important as π 0 is near π, and ignoring it, lowers the efficiency. Second, for the marginal MLE, the efficiency loss is negligible throughout the different settings, so there is little gain in considering R 10 and R 00 in (2), especially when this information is difficult/costly to obtain. Third, for the MME, the efficiency loss is negligible for π = 5% and π = 20% when π 0 is not too near to π, while the efficiency loss is rather important for small values of π 0 (relative to π), compared to the one of the survey MLE when π = 75%. Figure 2 presents the coverage (at the 95% level), computed using simulations, for the CP method based on R * 1 in (2), which is associated to the survey MLEπ, the CP method based on R 01 in (2), which is associated to the MMEπ, and the asymptotic method based on the conditional MLEπ. The coverage for the asymptotic method based on the marginal MLEπ are not presented as they are the same as the ones for the asymptotic method based on the conditional MLE. Overall, as expected, the CP method provides slightly conservative coverage across settings, while the asymptotic method based on the survey MLE is slightly liberal, especially for π = 5%. Moreover, for both the CP method based on R 01 and the asymptotic method based on the conditional MLE, for π = 5% and π = 20%, the coverage worsens (even if they remain quite accurate) as π 0 approaches π. For the asymptotic method, this can be explained by the fact Figure 2: Empirical coverage (at the 95% level) for the CP method based on R * 1 in (2), the CP method based on R 01 in (2) and the asymptotic method based on the conditional MLEπ. Top panels: α 0 = α = β = 0. Middle panels: α 0 = α = 0, β = 2%. Bottom panels: α 0 = 0, α = 1%, β = 2%. The sample size is 2, 000 and the number of Monte Carlo simulations is 50, 000.

that confidence intervals might have bounds falling outside the domain of π (e.g. below π 0 ), especially when π is near π 0 and in settings such as Setting II.

Given that the coverage is reasonable across methods, it is worth comparing the confidence interval lengths. Figure 3 presents the relative confidence interval (at the 95% level) lengths, computed using simulations, for the CP method based on R * 1 in (2) (associated to the survey MLEπ) and the CP method based on R 01 in (2) (associated to the MMẼ π), relative to the confidence interval (at the 95% level) lengths for the asymptotic method based on the conditional MLEπ. One can observe, as expected, that the (mean) confidence interval lengths can be a lot larger when ignoring the information provided by Z i in (1), especially as the information increases, i.e. as π 0 approaches π. An interesting feature appears, however, for a small population proportion (π = 5%) when π 0 approaches π, in that the mean confidence interval length for the CP based on R 01 (associated to the MME) is smaller than the one of the asymptotic method based on the conditional MLE. However, for a large population proportion (π = 75%), the mean confidence interval length for the CP based on R * 1 are relatively smaller than the ones based on R 01 , while remaining larger than the mean confidence interval length for the asymptotic method based on the conditional MLE. This is especially the case for small values of π 0 relative to π, and is in line with the study of the efficiencies provided in Figure 1 . Figure 3: Relative empirical confidence interval (at the 95% level) mean lengths for the CP method based on R * 1 in (2) and the CP method based on R 01 in (2), relative to the empirical confidence interval (at the 95% level) mean lengths for the asymptotic method based on the conditional MLEπ. Top panels: α 0 = α = β = 0. Middle panels: α 0 = α = 0, β = 2%. Bottom panels: α 0 = 0, α = 1%, β = 2%. The sample size is 2, 000 and the number of Monte Carlo simulations is 50, 000.

We use the methodology developed in this paper for the case of the COVID-19 prevalence estimation using the results of a survey done in November 2020 by Statistics Austria (2020). We also compare the different approaches, in order to illustrate, in practice, the impact of choosing one method rather than another one. In November 2020, a survey sample of n = 2287 was collected to test for COVID-19 using PCR-tests. Seventy-one participants (R * 1 = 71) were tested positive, and among these ones, thirty-two (R 11 = 32) had declared to have been tested positive with the official procedure, during the same month. In November, there were 93914 declared cases among the official (approximately) 7166167 inhabitants in Austria (above 16 years old), so that π 0 ≈ 1.3105%. The sensitivity (1 − α) and the specificity (1 − β) are not known with precision, so that we present estimates of the prevalence without misclassification error as well as for values for the FP and FN rates, that are plausible given the data and according to the sensitivity and specificity reported in Kobokovich et al. (2020) or Surkova et al. (2020) . Table 1 provides various estimates of π, the COVID-19 prevalence in Austria in November 2020, for the case of no misclassification error and for the case of misclassification errors with α = 1%, β = 10%, and α 0 = 0. Recall Remark A for the choice of α 0 = 0. We also chose a small α (FP rate for the medical test in the survey sample), because Table 1 : Prevalence estimation for the Austrian data (November 2020) with associated 95% confidence intervals, using the conditional MLE (CMLE) with asymptotic confidence intervals, the moment estimator (MME) with Clopper-Pearson intervals and the survey MLE (SMLE) with asymptotic and Clopper-Pearson confidence intervals. For the later, two additional estimation are provided with n * = kn and R * * 1 = kR * 1 , with k = 1.5 (SMLE-CP * ) and k = 2 (SMLE-CP * * ). The original data are π 0 ≈ 1.3105%, n = 2287, R * 1 = 71, R 11 = 32. The CI are illustrated as horizontal bars with lengths associated to respectively 80%, 95% and 99% confidence levels, with a dot representing the estimate. The first three columns are under the assumption of no misclassification errors. The second three columns assume α = 1%, β = 10%, and α 0 = 0.

we only observe 71 positive cases out 2287 participants. If α were larger, say α = 5%, we would also expect a larger number of (misclassified) positive cases, i.e. 114 positive cases just because of false positives.

From the first three lines of Table 1 , one can derive a series of insights. First, we note that without misclassification errors, the estimates are very similar across methods. Second, as expected, the confidence intervals for the SMLE are wider than the ones associated to the conditional MLE (CMLE) or the MME. These two statements are true for both the case of no misclassification errors and the case of some misclassification errors. Third, in the case of misclassification errors, the estimates differ more substantially between the sample MLE and the conditional MLE or MME, with a difference of 10% in the estimate.

Since the FP rate α has a limited number of possible values, given the data, we present in Figure 4 a sensitivity analysis of the prevalence estimation by the survey MLE and the MME, when the FN β varies from 0% to 30%. What is striking is that the sample MLE is much more influenced by the value of the FP β compared to the MME, which shows a far better stability. To understand this feature, from (11), we get, for the MME, under the sensitivity analysis conditions, π = 1 ∆ R01 n + (1 − β)π 0 . With increasing values for β, ∆ = 1 − (α + β) decreases, but at the same time, the quantity (1 − β)π 0 also decreases. On the other hand, with the survey MLE given in (4), an increase in the FP β directly induces an increased value for the estimator.

Finally, in order to illustrate the accuracy gain of using a conditional MLE or MME, in Table 1 , last two lines, we provide the prevalence estimate using the sample MLE with associated CI with 1.5 and 2 times as many sample data. In other words, the (hypothetical) data are built up by choosing n * = kn and R * * 1 = kR * 1 , with k = 1.5, 2. The aim of this exercise it to see if with more data, the sample MLE can provide an estimator that is as accurate as the conditional MLE or MME. One can see that, roughly, one would need twice as much survey sample data, in order to achieve the same level of accuracy provided by the MME or the conditional MLE. This is in line with the theoretical results provided in Section 3.2.

While we have cast this paper in the language of disease prevalence estimation, the method we propose has a more general range of applications. We actually propose a method to estimate the proportion of some characteristic in a population using information both from a random sample and from an incomplete census or census with participation bias. In other words, we are interested in the prevalence (or proportion) of population members having characteristic A, conditional on another characteristic B, such that having characteristic B implies having characteristic A, but not necessarily vice versa. We study this problem with and without the possibility of misclassification errors for A as well as for B.

The approach that we propose for such settings is that when a random survey sample is drawn to not only record for each participant whether they have characteristic A or not, but also whether they have characteristic B or not. The key idea, to improve the accuracy of the estimate of the prevalence of characteristic A in the population, is to base the estimate appropriately on the number of participants in the sample that have characteristic A and not B. We propose MLE as well as MME derived from this idea.

We show that our approach provides estimates that are substantially more accurate than the simple sample proportion (of participants with characteristic A), the maximum likelihood estimate that ignores the information available for characteristic B. As an important consequence, our approach can provide a given level of desired accuracy, with a substantially smaller sample size. This is useful when data collection is costly or, as for our COVID-19 example, medical tests (or lab spaces to evaluate test) are in limited supply.

It would be straightforward to adapt the estimators to the case of weighted sampling, with non random weights, as well as to include explanatory variables in our model in the same vein as in generalized linear models by postulating a relationship of the proportion parameter of interest and an array of additional observable characteristics.

Finally, there is some similarity of our approach and that of capture-recapture models (see e.g. Chao et al., 2001 , and the references therein) used to estimate the size of a population. In capture-recapture models several samples are drawn randomly from a population with unknown size. Estimates of the size then, as in our approach, rely on the possibility of participants in a first sample showing up again in a second sample. To see the difference between the two approaches, we can place our framework in the language of capture-recapture models as follows. In our case, the first capture is taken as an incomplete census or a census with participation bias from a population of known size, and not a random sample from a population of unknown size.

All computations presented in this paper were done using the ContionAl Prevalence Estimation, or cape R package that can be downloaded from https://github.com/stephaneguerrier/cape. Installation instructions as well as a user guide (vignette) of the package are provided in https://stephaneguerrier.github.io/cape/. All simulation results (as well as additional ones), can be reproduced and the simulation script is available on GitHub.

The success probabilities τ j (π) for R j , j = 1, 2, 3, 4, in (2), can be deduced from the following table. There are two fundamental cases X i = 0 and X i = 1, and conditionally on each one of these cases, errors are independently and identically distributed.

We, thus, have

Plugging in β 0 = 1 − π0−α0(1−π) π and using ∆ := 1 − (α + β) we obtain τ 11 (π) = π∆α 0 + (π 0 − α 0 )(1 − β) + αα 0 .

The remaining probabilities τ 10 , τ 01 and τ 00 can be similarly obtained.

PROOF: The identifiability of the model is straightforward from (3) and by the extreme value theorem we have E[| ln p(R|π)|] < ∞, where p(R|π) denotes the probability mass function of a multinomial distribution with event probabilities τ i , i = 1, 2, 3, 4 as defined in (3). Therefore, by applying the information inequality (see e.g. Lemma 2.2 of Newey and McFadden, 1994) , we can verify the identification ofπ. By combining the compactness of Π, the (uniform) law of large numbers and/or Theorem 2.1. of Newey and McFadden (1994) ,π is a consistent estimator for p 0 . Then, if p 0 ∈ (π, 1), standard techniques can be used to show that √ n (π − p 0 )

Finally, we verify that Assumption A guarantee that I(π) exists and is finite. Indeed, none of the equations:

have a solution in (π, 1), which concludes the proof.

A possibly more efficient and closed form estimator can be obtained by choosing a weighted sum of theπ (l) , with weights summing to one to obtain an unbiased estimator with a smaller variance. Indeed, let for example ω ll = γ l , l = 1, 2, 3 and 0 otherwise, such thatπ 

and γ = [γ l ] l=1,2,3 , with 3 l=1 γ l = 1, we can choose γ such that min γ var(π(γ)).

The fourth term l = 4 is omitted as it does not provide additional information, since we have that 1 i=0 1 j=0 R ij = n. As is shown below, we have that

One can see that the weight γ 3 is the most important, as α 0 is usually very small, see Remark A. Unfortunately, the weights γ depend on π, so that one needs to plug in a value. This could be chosen as being the one provided byπ in (11), which is a consistent estimator of π. Nevertheless, the finite sample distribution ofπ(γ) in (18) is unknown, so that one would need to resort to asymptotic theory, and this would not bring any advantage, in terms of inference, compared to the MLE.

To obtain (19), we first develop (14) using (3) to obtaiñ

Letting τ j := τ j (π), j = 1, . . . , 4, the variance of the GMMπ in (18), using the properties of the multinomial distribution, is given by var(π) = γ 2 1 n∆ 2 α 2 0 τ 11 (1 − τ 11 ) + γ 2 2 n∆ 2 α 2 0 τ 10 (1 − τ 10 ) + γ 2 3 n∆ 2 (1 − α 0 ) 2 τ 01 (1 − τ 01 ) +2 γ 1 γ 2 n∆ 2 α 2 0 τ 11 τ 10 − 2 γ 1 γ 3 n∆ 2 α 0 (1 − α 0 ) τ 11 τ 01 + 2 γ 2 γ 3 n∆ 2 α 0 (1 − α 0 ) τ 10 τ 01 .

Minimizing the variance subject to j γ j = 1 is then equivalent to minimizing H(γ) = γ 2 1 α 2 0 τ 11 (1 − τ 11 ) + γ 2 2 α 2 0 τ 10 (1 − τ 10 ) + γ 2

(1 − α 0 ) 2 τ 01 (1 − τ 01 ) +2 γ 1 γ 2 α 2 0 τ 11 τ 10 − 2 γ 1 γ 3 α 0 (1 − α 0 ) τ 11 τ 01 + 2 γ 2 γ 3 α 0 (1 − α 0 ) τ 10 τ 01 −λ(γ 1 + γ 2 + γ 3 − 1). The first order conditions for minimality are then given by ∂H ∂γ 1 = 2γ 1 α 2 0 τ 11 (1 − τ 11 ) + 2γ 2 α 2 0 τ 11 τ 10 − 2γ 3 α 0 (1 − α 0 ) τ 11 τ 01 − λ = 0 ∂H ∂γ 2 = 2γ 2 α 2 0 τ 10 (1 − τ 10 ) + 2γ 1 α 2 0 τ 11 τ 10 + 2γ 3 α 0 (1 − α 0 ) τ 10 τ 01 − λ = 0 ∂H ∂γ 3 = 2γ 3 (1 − α 0 ) 2 τ 01 (1 − τ 01 ) − 2γ 1 α 0 (1 − α 0 ) τ 11 τ 01 + 2γ 2 α 0 (1 − α 0 ) τ 10 τ 01 − λ = 0 which can be simplified as 2γ 1 α 0 (1 − τ 11 ) + 2γ 2 α 0 τ 10 − 2γ 3 1 − α 0 τ 01 = λα 0 2τ 11 (20)

2γ 2 α 0 (1 − τ 10 ) + 2γ 1 α 0 τ 11 + 2γ 3 1 − α 0 τ 01 = λα 0 2τ 10

2γ 3 1 − α 0

(1 − τ 01 ) − 2γ 1 α 0 τ 11 + 2γ 2 α 0 τ 10 = λ(1 − α 0 ) 2τ 01

Using (20) in (21) to simplify for γ 3 yields

Similarly using (20) in (22) leads to

Then, from (23) and (24), knowing that 1 i=0 1 j=0 τ ij = 1, we obtain γ 1 α 0 τ 00 = λ 2 1 + α 0 τ 11 (τ 00 − τ 11 ) , which leads to γ 1 in (19). Using γ 1 in e.g. (23), we obtain γ 2 in (19), and finally γ 3 is deduced as in (19).

COVID-19 antibody seroprevalence in

Estimating seroprevalence with imperfect serological tests: a cutoff-free approach

Interval estimation for a binomial proportion

The applications of capture-recapture models to epidemiological data

The use of confidence or fiducial limits illustrated in the case of the binomial

Estimating prevalence using an imperfect test

The fiducial argument in statistical inference

Large sample properties of generalized method of moments estimators

Serology-based tests for COVID-19

A tutorial in estimating the prevalence of disease in humans and animals in the absence of a gold standard diagnostic

Estimating the COVID-19 infection rate: Anatomy of an inference problem

Prior precision, prior accuracy, and the estimation of disease prevalence using imperfect diagnostic tests

Misclassification in binary choice models

Estimating the asymptomatic proportion of coronavirus disease 2019 (COVID-19) cases on board the Diamond Princess cruise ship

Large sample estimation and hypothesis testing

Comparing external and internal validation methods in correcting outcome misclassification bias in logistic regression: A simulation study and application to the case of postsurgical venous thromboembolism following total hip and knee arthroplasty

Estimation of the asymptomatic ratio of novel coronavirus infections (COVID-19)

Prävalenz von SARS-CoV-2-Infektionen liegt bei 3,1%

Repeated seroprevalence of anti-SARS-CoV-2 IgG antibodies in a population

False-positive covid-19 results: hidden problems and costs

Epidemiology: Study Design and Data Analysis

Stéphane Guerrier is partially supported by Swiss National Science Foundation grant #176843 and Innosuisse-Boomerang Grant 37308.1 IP-ENG. Maria-Pia Victoria-Feser is partially supported by a Swiss National Science Foundation grant #182684. We are grateful to Michael Greinecker, Helmut Kuzmics, Hans Manner, Michael Richter, Michael Scholz and Dominique-Laurent Couturier for helpful comments and suggestions.