key: cord-0524842-ma6o2i7b authors: Rothe, Christoph title: Combining Population and Study Data for Inference on Event Rates date: 2020-05-14 journal: nan DOI: nan sha: ccea45bbab73e76fa18e35c3e9aa8ae0ea6d4692 doc_id: 524842 cord_uid: ma6o2i7b This note considers the problem of conducting statistical inference on the share of individuals in some subgroup of a population that experience some event. The specific complication is that the size of the subgroup needs to be estimated, whereas the number of individuals that experience the event is known. The problem is motivated by the recent study of Streeck et al. (2020), who estimate the infection fatality rate (IFR) of SARS-CoV-2 infection in a German town that experienced a super-spreading event in mid-February 2020. In their case the subgroup of interest is comprised of all infected individuals, and the event is death caused by the infection. We clarify issues with the precise definition of the target parameter in this context, and propose confidence intervals (CIs) based on classical statistical principles that result in good coverage properties. In a recent study, Streeck et al. (2020) estimate the infection fatality rate (IFR) of SARS-CoV-2 infection in a German town that experienced a super-spreading event in mid-February 2020. The study features prominently in Germany's current political discussion, and has been covered extensively by major German and international news outlets. Several newspaper articles raised the question, however, whether the study reports an accurate confidence interval (CI) for its IFR estimate. To explain the issue, consider a stylized version of the setup in Streeck et al. (2020) . There is a population of total size N T , in which N I individuals are infected, and N D units have died from the infection. The values N T and N D are known from administrative records, but N I is not directly observed. Instead, the researcher collects a random sample of N S individuals, and observes that N P of them test positive for the disease. If the test is always accurate, the IFR can then be estimated by is an estimate of the number of infected units in the population. Now, the CI for the IFR reported in Streeck et al. (2020) only takes the sampling uncertainty about N I into account, but treats the number of deaths N D as fixed. The question is whether doing so is appropriate, or if N D should be treated as random. We argue that the answer depends on whether θ is interpreted as an estimate of the IFR among the N I infected individuals, or an estimate of the IFR among all N T members of the population. To clarify this point, we postulate the existence of vectors D = (D 1 , . . . , D N T ), I = (I 1 , . . . , I N T ) and S = (S 1 , . . . , S N T ), with D j ∈ {0, 1} an indicator for the (possibly counterfactual) event that the jth individual in the population would have died in the study period if s/he had been infected with SARS-CoV-2, I j ∈ {0, 1} an indicator for the jth individual actually being infected, and S j ∈ {0, 1} an indicator for the jth individual being included in the sample. These indicators are in principle unobserved, and such that with the last term being a new notation for the counterfactual number of deaths one would have observed if the entire population had been infected at the time of the study. We consider D to be a fixed feature of the population, and both S and I to be random vectors whose distribution is determined by the sampling design used in the study and the process that governs the spread of the infection, respectively. This means that N I and N D are also random, through their dependence on I. There are then two plausible candidates for the parameter of interest : the IFR among the individuals that were infected at the time of the study, given by and the IFR for the entire population, given by Now consider a CI that only accounts for the uncertainty in θ through its dependence on N I , which can be obtained by scaling a conventional (1 − α) CI for the proportion of infected This type of CI is reported in Streeck et al. (2020) , and it is easily seen to have correct coverage for θ 1 conditional on I, and therefore it must also have correct coverage unconditionally: In that sense, the CI in Streeck et al. (2020) is not wrong, but it is a CI for a very particular target parameter. In general, inference on θ 2 is going to be more practically relevant since IFR estimates are typically used to design policy measures that affect the entire population. The CI C α 1 clearly does not have correct coverage for θ 2 though, with or without conditioning on I. Intuitively, an appropriate CI for θ 2 should be wider than C α 1 , but it is not immediately obvious how such a CI should be constructed. In the remainder of this note, we propose two approaches that both result in good coverage properties. To avoid modeling the number of infections, we seek CIs C α 2 that are valid conditional on N I , and any CI that has such approximately correct conditional coverage must again also have approximately correct unconditional coverage. Note that the distinction between θ 1 and θ 2 is similar in spirit to that of sampling-based and design-based uncertainty in Abadie et al. (2020), but the details of their framework are very different from ours. We impose the following assumptions for our analysis. Assumption 1. The sampling and infection indicators are independent conditional on N I : The infection status of each individual is as good as randomly assigned conditional on N I , in the sense that for all N T -vectors i = (i 1 , . . . , i N T ) of dummy variables with N T j=1 i j = N I we have that: Assumption 3. The individuals included in the study sample are determined by simple random sampling independently of N I , in the sense that for all N T -vectors s = (s 1 , . . . , s N T ) of dummy variables with N T j=1 s j = N S we have that Assumption 1 is natural, and likely to hold even unconditionally. It would be violated, for example, if individuals with knowledge of their infection status are more or less like to participate in the study. Assumption 2 implies that the individuals infected at the time of the study are representative for the entire population. This rules out, for example, different age groups being affected more or less severely over the course of the pandemic. Note that the "success" probability N I /N T can be changed to accommodate infection testing with less than 100% sensitivity and specificity. Assumption 3 can easily be adapted if the sample of N S individuals is obtained though a different sampling scheme, such as cluster sampling. Note that an equivalent definition of θ 2 under the above assumptions is given by so that this parameter can be interpreted as the "average" IFR, where the averaging is done with respect to the distribution of I. This representation also makes it more apparent that θ is actually a suitable estimate of θ 2 . Since θ depends on S and I through N P and N D only, it is also useful to state the implications of the above assumptions for the joint distribution of the latter two quantities conditional on N I . Simple calculations show that this joint conditional distribution corresponds to two independent binomials: These distributions should be kept in mind for the following arguments. Consider a test of the null hypothesis H 0 : θ 2 = θ o that uses the estimated IFR θ as the test statistic. We propose to construct (1 − α) CIs for θ 2 by collecting all values of θ o for which the p-value of such a test is less than α. With conditioning on N I , the number of infections effectively becomes a nuisance parameter in this testing problem; and since N I is unknown no exact p-value is feasible in this setup. However, we can still use existing statistical approaches to obtain CIs with good coverage properties. We specifically consider one based on the parametric bootstrap, and one based on varying N I over a "large" preliminary CI. To describe these two approaches in our context, we introduce some notation. For constants n I and θ o , let N * P and N * D be independent random variables that each follow particular binomial distributions that only depend on the constants and other observable quantities: We also put N * I = N T N * P /N S , and denote the CDF of the ratio N * P / N * I by There is no simple closed form expression for this distribution function, but it can easily be computed through standard numerical methods for any value of the constants n I and θ o . For example, one can compute G(c|n I , θ o ) to desired accuracy by simulating a sufficiently large number of draws from the distribution of (N * P , N * D ), and then taking the empirical CDF of the resulting realizations of N * P / N * I . The function G(c|N I , θ 2 ) is the CDF of θ conditional on N I under the statistical model described above, and G(c|N I , θ o ) is the CDF under H 0 : θ 2 = θ o . If N I was observed, an equal-tailed p-value for a test of H 0 based on θ would be given by Using a "plug-in" or parametric bootstrap approach (e.g. Horowitz, 2001; Hall, 2013) , we can substitute the estimator N I into the p-value formula to construct a feasible CI for θ 2 : This CI is easily seen to have correct asymptotic coverage of θ 2 conditional on N I under any sequence for which N I /N I = 1 + o P (1). That is, it holds that P (θ 2 ∈ C α 2,P B |N I ) = 1 − α + o P (1) if N I /N I = 1 + o P (1). If the sample size N S is rather large, it can be reasonable to treat N I as a consistent estimate of N I , in which case the above result implies that C α 2,P B has approximately correct finite sample coverage of θ 2 . If the goal is to have a CI with guaranteed finite sample coverage, a different method can be used to compute a p-value. Let [L β ; U β ] be a standard (1 − β) Clopper-Pearson CI for the share N I /N T of infected individuals in the population, so that C β = [N T L β ; N T U β ] is a (1 − β) CI for the number of infections N I , for some β substantially smaller than α. We can then obtain a new p-value by maximizing p(θ o , n I ) over n I ∈ C β , and correcting the result for the fact that β is not zero (Berger and Boos, 1994; Silvapulle, 1996) . This yields the following CI for θ 2 : This CI has conditional coverage of at least 1 − α in finite samples: The CI is conservative, however, in that the last inequality is generally strict. Exact coverage only occurs in the unlikely scenario that the supremum in the definition of the p-value is attained at N I , which happens only if N I coincides with one of the boundaries of C β . We illustrate methods described above with numerical values taken from Streeck et al. (2020) . The town investigated in that study has N T = 12, 597 inhabitants, of which N D = 7 died in the study period with a SARS-CoV-2 infection. Out of a sample of N S = 919 individuals, N P = 138 tested positive for SARS-CoV-2. This corresponds to an infection rate of N P /N S = 15.0% in the sample, an estimated N I = 1892 infected individuals in the population, and an estimated IFR of θ = 0.37%. Setting α = .05 and β = .01, we obtain the CIs Recall that the first of these CIs has θ 1 as the target parameter, while the latter two aim for coverage of θ 2 . As expected, the latter two CIs are substantially wider than the first. We would argue that they are also more appropriate measures of uncertainty about the IFR estimate, since this quantity is used to design policy measures that affect the entire population. We note that Streeck et al. (2020) report an estimated 1,956 infected individuals, an IFR of .36%, and a CI for the IFR of [0.29%; 0.45%]. These results differ from the N I , θ and C α 1 given above for two reasons: first, Streeck et al. (2020) apply an adjustment factor to the raw infection rate in their sample to account for the sensitivity and specificity of their test for SARS-CoV-2 infection; and second, their sample is generated through a form of cluster sampling, which leads to a slightly wider CI relative to simple random sampling. Such adjustments should also slightly widen our CIs for θ 2 . While this note is motivated by research on the current SARS-CoV-2 pandemic, the CIs proposed here could also be used in other contexts in which researchers want to combine sample and population data in a similar fashion. To give an economic example, suppose that there is a group of individuals that qualify for benefits from some public program, and that the researcher is interested in the share of these individuals that actually receive benefits (this share could be small if the program is not well-known, difficult to apply for, or comes with social stigma). This then fits into the framework of this note if the number of benefit recipients is known to administrators, but the number of qualifying individuals needs to be estimated from survey data. Samplingbased vs. design-based uncertainty in regression analysis P Values Maximized Over a Confidence Set for the Nuisance Parameter The Bootstrap and Edgeworth Expansion The Bootstrap A Test in the Presence of Nuisance Parameters Infection fatality rate of SARS-CoV-2 infection in a German community with a super-spreading event