key: cord-0103459-tnr625zk authors: Campbell, Harlan; Valpine, Perry de; Maxwell, Lauren; Jong, Valentijn MT de; Debray, Thomas; Janisch, Thomas; Gustafson, Paul title: Bayesian adjustment for preferential testing in estimating the COVID-19 infection fatality rate: Theory and methods date: 2020-05-18 journal: nan DOI: nan sha: 5073c29dac4746d544bcb18d45f989db731fc7c8 doc_id: 103459 cord_uid: tnr625zk A key challenge in estimating the infection fatality rate (IFR) of COVID-19 is determining the total number of cases. The total number of cases is not known because not everyone is tested but also, more importantly, because tested individuals are not representative of the population at large. We refer to the phenomenon whereby infected individuals are more likely to be tested than non-infected individuals, as"preferential testing."An open question is whether or not it is possible to reliably estimate the IFR without any specific knowledge about the degree to which the data are biased by preferential testing. In this paper we take a partial identifiability approach, formulating clearly where deliberate prior assumptions can be made and presenting a Bayesian model, which pools information from different samples. Results suggest that when limited knowledge is available about the magnitude of preferential testing, reliable estimation of the IFR is still possible so long as there is sufficient"heterogeneity of bias"across samples. If someone is infected with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the pathogen that causes COVID-19, how likely is that person to die of COVID-19? This simple question is surprisingly difficult to answer. The "case fatality rate" (CFR) is a common measure that quantifies the mortality risk in a certain population, and is given by the ratio of deaths (D) over confirmed cases (CC) during a specific time period. However, because many COVID-19 cases are never diagnosed, the CFR almost certainly overestimates the true lethality of the virus. Instead, the better answer is captured by the infection fatality rate (IFR) (Wong et al., 2013; Kobayashi et al., 2020) . The IFR, also a simple ratio, differentiates itself from the CFR by considering all cases, including the asymptomatic, undetected and misdiagnosed infections, in the denominator. For instance, if 20 individuals die of the disease in a population with 1000 infections, then the IFR is 20 / 1000 = 0.02 = 2%. Evidently, a key challenge in calculating the IFR is determining the true total number of cases. The total number of cases (C) is not known because not everyone is tested in the population (P ). A naïve estimate of the IFR might take this into account by simply considering the number of tests (T ) and estimating the number of cases as: C ≈ (CC/T ) × P . However, diagnostic tests are often selectively initiated, such that tested individuals are not representative of the population at large. In most countries/jurisdictions, those with classic COVID-19 symptoms (e.g. fever, dry cough, loss of smell or taste) are much more likely to be tested than those without symptoms. Due to this severity bias, the reported number of cases likely includes mostly people whose symptoms were severe enough to be tested and excludes the vast majority of those who are mildly-or asymptomatic. Even when testing is made equally available to all individuals (e.g., Bendavid et al. (2020) ), there is potential for "selection bias" if people who have reason to believe they are infected are more likely to volunteer to be tested. We refer to the phenomenon whereby infected individuals are more likely to be tested than non-infected individuals, as "preferential testing." (Hauser et al. (2020) and others use the term "preferential ascertainment.") If the degree of preferential testing in a particular sample is of known magnitude, bias adjustment can be achieved by appropriately altering the estimated rate of infection and its uncertainty interval. However, the degree of preferential testing is no doubt difficult to ascertain and likely highly variable across different jurisdictions. An open question is whether or not it is possible to reliably estimate the IFR without any specific knowledge about the degree to which the data are biased by preferential testing (Q1). If not, should we rely only on select samples for which testing is representative and ignore the vast majority of available data? (Q2) In this paper, we address these two important questions by considering a simple Bayesian hierarchical model for estimation of the IFR. Bayesian models have been previously used in similar situations. For example, Presanis et al. (2009) conduct Bayesian inference to estimate the severity of pandemic H1N1 influenza; see also Presanis et al. (2011) . More recently, Rinaldi and Paradisi (2020) , and Hauser et al. (2020) employ Bayesian methods for estimating the severity of COVID-19 from early data from China and Italy. In order to address preferential testing bias, Hauser et al. (2020) apply susceptible-exposed-infected-removed (SEIR) compartmental models to age-stratified data and, in order to establish parameter identifiability, assume that all cases of infected patients aged 80 years and older are confirmed cases. The Bayesian model we propose is more general and allows one to obtain appropriate point and interval estimates for the IFR with varying degrees of prior knowledge about the magnitude of preferential testing and the distribution of other explanatory factors (e.g. age, minority status). This paper is structured as follows. In Section 2, we introduce required notation, discuss distributional assumptions and review key issues of identifiability. In Section 3, we formulate our Bayesian model and present a small illustrative example. In Section 4, we describe how the model can be scaled for larger populations and present results from a simulation study. In Section 5, we go over potential model extensions as well as model limitations. Finally, we conclude in Section 6 with a discussion of how the model could be used for an analysis of COVID-19 data and return to the questions of interest, Q1 and Q2. 2. Notation, distributions, and issues of (un)identifiability Let us begin by describing the data and defining some basic notation. Suppose we have data from K groups (i.e., countries or jurisdictions) from a certain fixed period of time. For group k in 1, . . . , K, let: • P k be the population size (i.e., the total number of individuals at risk of infection); • T k be the total number of people tested; • CC k be the total number of confirmed cases resulting from the tests; and • D k be the total number of observed deaths attributed to COVID-19 infection. We do not observe the following latent variables. For the k-th group, let: • C k , the total number of infected people (cases) in the population; • IR k , be the true infection rate (proportion of the population which is infected), which is the expected value of C k /P k ; • IF R k , be the true underlying infection fatality rate (IFR), which is the expected value of D k /C k . We assume that: where, in the k-th group, the unknown number of infections, C k , and the known number of deaths, D k , each follow a binomial distribution. Note that there are latent variables on both the left hand side and the right hand side of (1). Suppose that, for each population, CC k is recorded, instead of C k . Even in the absence of preferential testing, CC k will be smaller than C k because not everyone is tested. In other words, the confirmed cases (CC) are a subset of the total cases (C). We assume that the distribution of the confirmed cases depends only on the actual infection rate (C/P ) and the testing rate (T /P ) and not on the infection fatality rate (D/C) or other information. In other words, we assume that the conditional distribution of (CC|C, T, P, D) is identical to the conditional distribution of (CC|C, T, P ). This important assumption is similar to the assumption of "non-differential" exposure misclassification in measurement error models and may or may not be realistic; see De Smedt et al. (2018) . For example, COVID-19 is thought to be deadlier amongst the elderly. If this is true, the non-differential preferentiality assumption would fail if elderly individuals were just as likely to be infected as others, yet more likely to be tested. The goal is to draw inference on the relationship between the number of deaths, D, and the number of cases, C, having only data on D, CC, P , and T . This is particularly challenging since the number of confirmed cases in each group may be subject to a certain unknown degree of preferential testing. Let the degree of preferential testing correspond to the φ non-centrality parameter, where CC k follows Wallenius' noncentral hyper-geometric distribution with: The hyper-geometric distribution describes the probability of CC k confirmed cases amongst T k tests (without any individuals being tested more than once), from a finite population of size P k that contains exactly C k cases. The non-central hypergeometric is a generalization of the hyper-geometric distribution whereby, in this case, testing may be biased so that either cases or non-cases are more likely to be tested. When φ k > 1, cases (i.e., infected individuals) are more likely to be tested than noncases (i.e., non-infected individuals); when φ k < 1, cases are less likely to be tested than non-cases. When φ k = 1, we have that the probability of being tested is equal for both cases and non-cases, and the non-central hyper-geometric distribution reduces to the standard hyper-geometric distribution. In this parameterization, the φ k parameter can be interpreted as an odds ratio: the odds of a case being tested vs. the odds of a non-case being tested. Given the assumptions detailed above, for each of the K groups, there are three unknown parameters (latent states), IR k , IF R k and φ k , that must be estimated for every two known values (D k /P k and CC k /T k ). This suggests that a unique solution may not be attainable without additional external data. The problem at hand is sufficiently rich and complex that forming intuition about the information-content of the data is challenging. Here we present an asymptotic argument which lays bare the flow of information. Consider a situation in which an infinite amount data are available. In so-called "asymptotia," we have that populations are approaching infinite size (i.e., for k in 1,. . . , K, we have P k → ∞), and that the number of tests also approaches infinity (i.e., for k in 1,. . . , K, we have T k → ∞). Recall that a hyper-geometric distribution is asymptotically equivalent to a binomial distribution. As such, we consider the following: Note that a k simply follows from the conditional binomial distribution (see expressions (11)-(14) and explanation in Section 4.1). However, this particular parameterization of b k does not emerge from the limit of the non-central hyper-geometric distribution. As we discuss later in Section 4.1, we have simply chosen a convenient parameterization for b k with the important connotation that φ k = 1 corresponds to testing at random, but as φ k increases, the testing is more preferentially weighted to those truly infected. For example, with an infection rate of IR k = 0.01, the binomial sampling probability b k is approximately 10 times larger than the infection rate if φ k = 10, and about 18 times larger if φ k = 20. Presume that the a priori defensible information about the preferential sampling in the k-th group is expressed in the form i.e., φ k andφ k are investigator-specified bounds on the degree of preferential sampling for that jurisdiction. If one is certain that cases are as likely, or at least as likely, to be tested as non-cases, φ k = 1 is appropriate. If testing is known to be entirely random for the k-th group, one would set φ k =φ k = 1. Note that for fixed (a k , b k ), IF R k is a function of φ k with the form Examining (5), knowledge of (a k , b k ), in tandem with (4) restricts the set of possible values for IF R k . In fact it is easy to verify that (5) is monotone in φ k , hence the restricted set is an interval. We write this interval as I k (a k , b k , φ k ,φ k ), or simply as I k for brevity. This is the jurisdiction-specific identification interval for IF R k . As we approach asymptotia for the k-th group, all values inside the interval remain plausible, while all values outside are ruled out; see Manski (2003) . This is the essence of the partial identification inherent to this problem. Thinking now about the meta-analytic task of combining information, we envision that both φ k and IR k could exhibit considerable variation across jurisdictions. However, the variation in IF R could be small, particularly if sufficient jurisdiction-specific covariates are included (see Section 5.1). That is, after adjustment for a jurisdiction's age-distribution, healthcare capacity, and so on, residual variation in IF R could be very modest. When modeling, we would invoke such an assumption via a prior distribution. For understanding in asymptotia, however, we simply consider the impact of an a priori bound on the variability in IF R. Let τ be the standard deviation of IF R across jurisdictions. Then we presume τ does not exceed an investigator-specified upper bound ofτ , i.e., τ ≤τ . The jurisdiction-specific prior bounds on the extent of preferential sampling, and the prior bound on IF R variation across jurisdictions, along with the limiting signal from the data in the form of (a, b), gives rise to an identification region for the average infection fatality rate, IF R = K −1 K k=1 IF R k . Formally, this interval is defined as Again, the interpretation is direct: in the asymptotic limit, all values of IF R inside this interval are compatible with the observed data, and all values outside are not. The primary question of interest is whether this interval is narrow or wide under realistic scenarios, since this governs the extent to which we can learn about IF R from the data. In general, evaluating (6) for given inputs is an exercise in quadratic programming nested within a grid search, hence can be handled with standard numerical optimisation. Details of this formulation are given in the Appendix. However, the special case ofτ = 0 is noteworthy in terms of developing both scientific and mathematical intuition. Consequently, we explore this case in some depth in what follows. Scientifically,τ = 0 represents the extreme limit of an a priori assumption that, possibly after covariate adjustment, IF R is a 'biological constant' which does not vary across jurisdictions. If the prospects for inference are not good when this assumption holds, they will be even less good under the less strict assumption that the IF R heterogeneity is small, but not necessarily zero. Mathematically, the case is much simpler, with (6) reducing to As can be seen immediately from Figure 1 (left-hand panels), in the present example the binding constraints arise from the first and twelfth jurisdictions, which happen to have the least and most amounts of preferential testing. However, this pattern does not hold in general. One can easily construct pairs of infection rates for which the jurisdiction with more preferential testing has a smaller upper endpoint for I k and/or a larger lower endpoint. Thus the values of φ k alone do not determine which two jurisdictions will provide the binding information about IF R. Bayesian models work well for dealing with partially identifiable models; see Gustafson (2010) . We describe a Bayesian model for the IFR which assumes standard Gaussian random-effects allowing both the infection rate (IR) and infection fatality rate (IFR) to vary between populations with a complimentary log-log link function: for k in 1, . . . , K, where θ is the parameter of primary interest, τ 2 represents between group IFR heterogeneity, β represents the mean cloglog-infection rate, and σ 2 is the variance in infection rates across the K groups. A Bayesian model begins by specifying a joint probability distribution. For our unknown parameters of interest (θ, τ 2 , β, σ 2 ), latent variables (C k , IF R k , IR k , and φ k , for k in 1,. . . , K), and aggregate data from K sources (we require data = {P k , T k , CC k , and D k }, for k in 1,. . . , K), Bayes theorem states that: We have that p(D k |IF R k , C k ) is defined according to a binomial distribution as stated in (2), that p(CC k |T k , P k , C k , φ k ) is defined according to (3), and that (1). We also have that p(IF R k |θ, τ 2 ) and p(IR k |β, σ 2 ) are defined according to (7) and (8) respectively. We are left to define prior distributions for the unknown parameters: θ, τ 2 , β, σ 2 , and φ k , for k in 1, . . . , K. Defining prior distributions is often controversial, as their choice can substantially influence the posterior when few data are available; see Lambert et al. (2005) ; Berger (2013); Burke et al. (2018) ; Gelman et al. (2006) . We proceed by adopting flat priors on the probability scale to induce "uninformative" priors on the θ and β parameters (i.e., uniform priors on the inverse-clog-log scale) and weakly-informative half-normal priors for the τ and σ parameters. We fully expect that the true underlying infection rate will vary importantly across populations, while the true underlying IFR should be less variable, (especially after accounting for population level sources of heterogeneity; see Section 5.1). The priors are set accordingly. We consider the following: The only remaining component is p(φ) = k p(φ k ). Knowledge about the magnitude of preferential testing in a given population may come from a variety of sources, yet may be difficult to quantify. Let us begin by assuming that, for k = 1, . . . , K, φ k is greater than 1, but otherwise of unknown magnitude. It also seems reasonable to assume a certain degree of heterogeneity for φ k across the K populations since the degree to which testing is available and randomly allocated might vary considerably. We therefore define a uniform prior such that: and: γ ∼ Exp(λ = 0.5). The prior specification therefore assumes that the uniform range of possible values for φ k is itself unknown. Setting λ = 0.5 implies that, a priori, a reasonable value for the φ k odds ratio is 2, (i.e., since E(γ) = 1/λ and E(φ k ) = (1 + (γ + 1))/2). In some scenarios, it might be conceivable to have a subset of groups for which φ k is known and equal to 1 (i.e., to have data from some samples where testing is known to be random). Without loss of generality, suppose this subset is the first k studies, such that for k = 1, . . . , k , we have φ k = 1. In a situation where testing is known to be random for all groups, k = K. We emphasize that the performance of any Bayesian estimator will depend on the choice of priors. The priors described represent a scenario where there is little to no a priori knowledge about the θ, β, and φ model parameters. Inference would no doubt be improved should more informative priors be specified based on probable values for each of these parameters. We briefly consider the impact of priors in the simulation study in Section 4.3, where we look to different values for λ. Let us illustrate the proposed model with the simple example dataset introduced earlier in Table 1 . The illustrative dataset considers K = 12 groups with average populations of 2,000 individuals (P k obtained from a N egBin(2000, 1) distribution). The data were simulated such that, across all 12 groups, the expected overall IFR is 2% (i.e., icloglog(θ) = 0.02; θ = −3.90), and the expected overall IR is 20% (i.e., icloglog(β) = 0.20; β = −1.50). A certain degree of variability between populations was allowed by selecting τ 2 = 0.005 and σ 2 = 0.25. The testing rate for each population was obtained from a U nif orm(0.01, 0.10) distribution so that the number of tests in each population ranged from 1% of individuals to 10%. Note that the number of observed deaths is relatively small ranging from 1 to 20. The number of confirmed cases differs with differing degrees of preferential testing. The φ k values were set to be equally spaced from 1 to γ + 1 and the number of confirmed cases were then simulated from Wallenius' non-central hyper-geometric distribution (see Fog (2008) ) as in equation (3) with either no preferential testing (γ = 0), "mild" preferential testing (γ = 4), "modest" preferential testing (γ = 11), or "substantial" preferential testing (γ = 22). We fit the model (M 1 ) as detailed in Section 3.1, where k = 0, and K = 12, and also fit the model (M {γ=0} ) where γ = 0 is fixed, corresponding to a situation in which one assumes that none of the populations are subject to any preferential testing. Each model is fit using JAGS (just another Gibbs' sampler) (Kruschke, 2014) , with 5 for k in 1, . . . , K. Note that the φ k parameter above no longer corresponds to an odds ratio, yet the interpretation is similar. We could have considered substituting the noncentral hyper-geometric distribution in (3) with a Gaussian asymptotic approximation to the non-central hyper-geometric (Hannan and Harkness, 1963; Stevens, 1951) . However, the Gaussian approximation requires solving quadratic equations and therefore, might not necessarily help reduce the the computational complexity of our model; see Sahai and Khurshid (1995) . We can further simplify by marginalizing over the cases. Since we have the distribution of C k and the conditional distribution of D k given C k , and since both of these are binomials, we have that unconditionally: For our unknown parameters of interest (θ, τ 2 , β, σ 2 ), latent variables (IF R k , IR k , and φ k , for k in 1,. . . , K), and aggregate data from K sources (we require data = {P k , T k , CC k , and D k }, for k in 1,. . . , K), Bayes' theorem states that: p((θ, τ 2 ,β, σ 2 , IFR, IR, φ)|data) ∝ p(data|θ, τ 2 , β, σ 2 , IFR, IR, φ) × p(θ, τ 2 , β, σ 2 , IFR, IR, φ) where p(D k |IF R k , IR k , P k ) and p(CC k |T k , IR k , φ k ) are defined by binomial distributions detailed in (14) and (11) respectively. We also have p(IF R k |θ, τ 2 ) and p(IR k |β, σ 2 ) defined according to (7) and (8) respectively, and priors defined as in Section 3. For the large-P model, MCMC mixing can be slow because different combinations of φ k , cloglog(IR k ) and cloglog(IFR k ) can yield similar model probabilities. This is related to the identifiability issues discussed in Section 2.2. To improve mixing, we wrote this model in the nimble package (de Valpine et al., 2017) , which supports an extension of the modeling language used in JAGS and makes it easy to configure samplers and provide new samplers. Using nimble, we applied two sampling strategies for the trio (φ, cloglog(IR k ), cloglog(IFR k )) for each k in 1, . . . , K. The details of these are provided in the Appendix. We conducted a simple simulation study in order to better understand the operating characteristics of the proposed model. Specifically, we wished to evaluate the frequentist coverage of the CI for θ, and investigate the impact of the chosen prior for the magnitude of preferential testing (i.e., the impact of selecting different values for λ). As emphasized in Gustafson et al. (2009) , the average frequentist coverage of a Bayesian credible interval, taken with respect to the prior distribution over the parameter space, will equal the nominal coverage. This mathematical property is unaffected by the lack of identification. However, the variability of coverage across the parameter space is difficult to anticipate and could be highly affected by the choice of prior. For example, we might expect that, in the absence of preferential testing (i.e., when γ = 0), coverage will be lower than the nominal rate. However, if this is the case, coverage will need to be higher than the nominal rate when γ > 0, so that the "average" coverage (taken with respect to the prior distribution over the parameter space) is nominal overall. We First we ran the simulation study with the 12 "unknown" φ k values, for k in 9, ..., 20, set to be evenly spaced between 1 and γ + 1, (as in the illustrative example) with γ assuming one of the eight values of interest (Study A). We also repeated the entire simulation study with the 12 "unknown" φ k values, for k in 9, ..., 20, simulated from a U nif orm(1, γ + 1) distribution (Study B). We fit three models to each unique dataset: M 1 , M 2 , and M 3 . All three models follow the same large-P framework detailed in Section 4.1 but consider different subsets of the data. The M 1 model uses only data from the samples for which φ k is unknown, i.e., {P k , T k , CC k , and D k } for k in 9, ..., 20. The M 2 model considers the data from all the groups, i.e., {P k , T k , CC k , and D k } for k in 1, ..., 20. Finally, the M 3 model uses only data from the samples for which φ k is known, i.e., {P k , T k , CC k , and D k } for k in 1, ..., 8. To be clear, the M 2 and M 3 models include the assumption of (correctly) known φ k for k = 1, . . . , 8. We simulated 200 unique datasets and, for each, fit the three different models. The suggests that appropriate estimates are achievable even in the presence of a substantial and unknown amount of preferential testing. When γ = 11, the odds (on average) for a case to be tested are more than six times the odds for a non-case (the average φ k value is equal to (1 + (γ + 1))/2), and yet the model is able to appropriately adjust. Results also point to a bias-variance trade-off with regards to one's choice for λ. The smaller the value of λ, the more robust the model is to a potentially large degree of preferential testing. The price of this additional robustness is greater uncertainty, i.e., a wider credible interval. Results from the M 2 and M 3 models suggest that for a given range of γ values, the M 2 model (which makes use of the additional "non-representative" data) is preferable Finally, note that overall the interval width is much narrower for M 2 relative to M 1 (compare dashed lines in lower-right panel to those in upper-right panel of Figure 2 ) which confirms that the k = 8 representative samples are very valuable for reducing the uncertainty around θ. With regards to the COVID-19 pandemic, this emphasizes the importance of conducting some amount of "unbiased testing" even if the sample sizes are relatively small; see Cochran (2020) . were simulated from a U nif orm(1, 1 + γ). In this study, for a given dataset, we do not necessarily have a wide range between the lowest and highest φ k values. Based on the identifiability issues discussed in Section 2.2, we might therefore anticipate that appropriate inference is more challenging. Looking at the results with regards to coverage, this intuition appears to be correct. However, note that the M 2 model is still able to provide at or above nominal coverage for most of the γ values we considered. With regards to the infection rate (IR), time since first reported infection and time between first reported infection to imposition of social distancing measures might be predictive (Anderson et al., 2020) . Other, potentially less obvious, covariates could also be included for IR, see Stephens-Davidowitz (2020). Age is a key factor for explaining the probability of COVID-19-related death. One might therefore consider median age of each group as a predictor for the IFR, or perform analyses that are stratified by different age groups (Onder et al., 2020) . The latter strategy has, for instance, been recommended to make accurate predictions for respiratory infections (Pellis et al., 2020) . To illustrate, let us briefly consider the possibility that, for each population, one has age stratified data. Suppose one has {P age k , T age k , CC age k , and D age k }, for age in {0 − 30 years, 30 − 60 years, 60 − 80 years, ≥ 80 years}, and for k in 1, . . . , K. Then, including the covariates and a simple random effect can accommodate as follows: where: η k ∼ N (0, ω 2 ), for k in 1, . . . , K, with unknown ω 2 variance. Finally, note that including data from serology studies (Winter and Hegde, 2020) will be crucial to inform the IR. If data from multiple serology studies are available from a single jurisdiction (or from several different regions within the jurisdiction), the proposed model could incorporate all these by including appropriate covariates and random effects. Estimation of the IFR is very challenging due to the fact that it is a ratio of numbers where both the numerator and denominator are subject to a wide range of biases. Our proposed model only seeks to address one particular type of bias pertaining to the denominator: the bias in the number of cases due to preferential testing. With this in mind, we wish to call attention to several other important sources of bias. Cause of death information, compiled from death certificates, may not list SARS-CoV-2 as a contributing factor and certain jurisdictions may not have adopted the International Form of Medical Certificate of Cause of Death or have adopted the WHO guidelines on registering COVID-19-related deaths (WHO, 2020). As such, reported statistics on the number of deaths may be very inaccurate. To overcome this issue, many suggest looking to "excess deaths," by comparing aggregate data for all-cause deaths from the time during the pandemic to the years prior (Leon et al., 2020) . Using this approach and a simple Bayesian binomial model, Rinaldi and Paradisi (2020) are able to obtain IFR estimates without relying on official (possibly inaccurate) data for the number of COVID-19 deaths. Some people who are currently sick will eventually die of the disease, but have not died yet. Due to the delay between disease onset and death, the number of confirmed and reported COVID-19 deaths at a certain point in time will not reflect the total number of deaths that will occur among those already infected (right-censoring). This will result in the number of recorded deaths underestimating the true risk of death. The denominator of the IFR must be the number of cases with known outcomes. Using time-series survival data or a defensible prior on the time from infection to death, the Bayesian model could be expanded to account for this additional source of uncertainty. The model, as currently proposed, also fails to account for the (unknown) number of false positive and false negative tests. When both the test specificity and the infection rate is low, false positives can substantially inflate estimates of infection rate and as a consequence, the IFR could be biased downwards. In principle, the model could accommodate for this by specifying priors for the sensitivity and specificity. In fact, Bayesian inference is known to be an excellent tool for adjusting for unknown testing uncertainty (Srinivasan et al., 2012; Burstyn et al., 2020) . Finally, because the model uses data that are aggregated at the group level, estimates are potentially subject to ecological bias. While including group-level covariates may help reduce variability in the estimates, adjustment using group-level covariates can also lead to biased, misleading results (Li and Hua, 2020; Berlin et al., 2002) . While ecological analyses can be useful for developing hypotheses (Pearce, 2000) and may be needed to make rapid use of publicly available surveillance data for inference during an epidemic, they cannot be used to make reliable inferences at the participant level. Sharing de-identified participant-level data as rapidly and widely as possible, in keeping with ethical and legal standards, is central to epidemic response (The GloPID-R Data Sharing Working Group, 2018). Reducing uncertainty around the severity of COVID-19 is of great importance to policy makers and the public (Ioannidis, 2020; Lipsitch, 2020) . Comparisons between the COVID-19 and seasonal influenza IFRs have impacted the timing and degree of social distancing measures and highlighted the need for more accurate estimates for the severity of both viruses (Faust, 2020) . The current lack of clarity means that policy makers are unsure if cross-population differences (i.e., due to large τ 2 ) are related to clinically relevant heterogeneity or to spurious heterogeneity driven by testing and reporting biases (i.e., due to large γ). Existing efforts to understand the distribution of SARS-CoV-2 infection at the population level are unfortunately met by recruiting challenges (Gudbjartsson et al., 2020; Bendavid et al., 2020) , leading to an over representation of people who are concerned about their exposure and/or an under representation of individuals who are self-quarantining, isolating, or hospitalized because of the virus. As of May 2020, nonpreferential testing has only been reported for limited populations where the entire population was tested, including the Diamond Princess Cruise Ship, four US prisons where all inmates and staff were tested, and the town of Vo', Italy (Field Briefing, 2020; Lavezzo et al., 2020; So and Sminth, 2020; Aspinwall and Neff, 2020) . Of these, only the Diamond Princess Cruise Ship had made testing and outcome data publicly available at the age-group level at the time of publication (Field Briefing, 2020); (participantlevel age and test result data are available for Vo', Italy, but no outcome data were listed at the time of publication). Munich, Germany has undertaken a populationrepresentative prospective cohort study where researchers randomly selected addresses and all household members will be asked to complete COVID-19-related questions and serologic testing (Radon et al., 2020) . While the research protocol does not specify when, whether, or how the data would be made publicly available, the Munich data will be an important resource as perhaps the first large-scale, population-representative dataset that is not affected by preferential testing. In an ideal world (for purposes of estimating the IFR), ample data free from preferential testing bias would be available for analysis. Unfortunately, in reality, such data are in scarce supply (Goodman and Shah, 2020 ). Our Bayesian model suggests that we can make headway nonetheless. Data from jurisdictions that are unrepresentative can still be used to obtain informative estimates of the IFR and can help reduce uncertainty when used alongside the limited representative data available. In the Introduction, we identified two important questions. First (Q1), is it possible to reliably estimate the IFR without any specific knowledge about the degree to which the data are biased by preferential testing? And secondly (Q2), must we only rely on select samples for which testing is representative and ignore the vast majority of available data? The proposed Bayesian model suggests that reliable estimation of the IFR at the group level is indeed possible when existing data do not reflect a random sample from the target population, and when limited knowledge is available about the likely magnitude of preferential testing. Importantly, the key to (partial) identifiability is sufficient heterogeneity in the degree of preferential testing across groups and sufficient homogeneity in the group-specific IFR. We also saw that, one need not ignore the vast majority of available data that may be biased by preferential testing. This data, with appropriate adjustment, can supplement any available representative data in order to sharpen the inference. In a typical situation of drawing inference from a single sample, obtaining appropriate estimates if that sample is biased by preferential testing is challenging if not impossible without some sort of external validation data. Intuition suggests that one might only be able to do a sensitivity analysis with respect to the impact of bias and indeed, applying prior distributions for the degree of preferential testing and proceeding with Bayesian inference is often regarded as a probabilistic form of sensitivity analysis (see, for instance, Greenland (2005) Code for all models, data, and analysis is available at: https://github.com/harlanhappydog/COVID19IFR 7. Appendix 7.1. Issues of (un)identifiability -(continued) In section 2.2 we described how the evaluation of the identification interval (6) for IF R reduces to a simple intersection of intervals, in the special case of τ = 0. Here we describe the evaluation of (6) for τ > 0, i.e., where a limited heterogeneity in IF R is permitted. Recall that quadratic programming constitutes the minimization of a quadratic function subject to linear constraints, and these may be a mix of equality and inequality constraints. Let x be a candidate value, which we will test for membership in the identification interval. To perform this test, we use a standard quadratic programming package (quadprog, Turlach and Weingessel (2013) ) to minimize the quadratic function V ar(IF R), subject to the equality constraint IF R = x and the 2K inequality constraints which restrict IF R k to the interval I k for each k. By the definition of (6) then, x belongs in the identification interval if and only if the minimized variance does not exceed τ 2 . Thus a simple grid search over values of x numerically determines the identification interval. Note that so long as a and b arise from values of φ within the prescribed bounds, the underlying value of IF R must belong to the identification interval. Thus two numerical searches can be undertaken. One starts at the underlying value and tests successively larger x until a failing value is obtained. The other starts at the underlying value and does the same, but moving downwards. In all cases the univariate sampling method was adaptive random-walk Metropolis-Hastings. Using nimble, we applied two sampling strategies for the trio (φ k , cloglog(IR k ), cloglog(IFR k )) for each k in 1,. . . , K. In all cases the univariate sampling method was adaptive random-walk Metropolis-Hastings. For notation, we drop the subscript k and define η 1 = cloglog(IR k ) and η 2 = cloglog(IFR k ). First, we included a block sampler on (φ, η 1 , η 2 ) for each k, along with the usual univariate samplers on each element of the trio. Second, we included samplers in two transformed coordinate spaces. Define transformed coordinates (z 1 , z 2 ) = (h 1 (η 1 , η 2 ), h 2 (η 1 , η 2 )) = (exp(η 1 ) + exp(η 2 ), exp(η 1 ) − exp(η 2 )). (Based on the cloglog link, the quantities exp(η 1 ) and exp(η 2 ) may be interpreted as continuous time rates.) Now z 1 represents the more strongly identified quantity, so mixing in z 2 can be slow. Hence we wish to improve mixing in the z 2 direction. To do so, a sampler can operate in the (z 1 , z 2 ) coordinates while transforming the prior such that it is equivalent in (z 1 , z 2 ) to what was specified in the original coordinates, (η 1 , η 2 ). Using P (·) for priors, we have log(P (z 1 , z 2 )) = log(P (η 1 , η 2 )) − log(|J|), where |J| is the determinant of the Jacobian of (z 1 , z 2 ) with respect to (η 1 , η 2 ). In this case, |J| = 2 exp(η 1 + η 2 ). The other transformed coordinates used were (z 1 , z 2 ) = (log(φ) + η 1 , log(φ) − η 1 ). Note that log(φ) + η 1 = log(− log((1 − IR) φ )). Hence z 1 represents the more strongly identified quantity, so we wish to improve mixing by sampling in the z 2 direction. We have the same formulation as above, with |J| = 2/φ. How will country-based mitigation measures influence the course of the covid-19 epidemic? The Lancet These prisons are doing mass testing for covid-19and finding mass infections Covid-19 antibody sero Statistical decision theory and Bayesian analysis Individual patient-versus group-level data meta-regressions for the investigation of treatment effect modifiers: ecological bias rears its ugly head Bayesian bivariate meta-analysis of correlated effects: Impact of the prior distributions on the between-study correlation, borrowing of strength, and joint inferences Towards reduction in bias in epidemic curves due to outcome misclassification through Bayesian analysis of time-series of laboratory test results: Case study of covid-19 in Alberta Why we need more coronavirus tests than we think Bias due to differential and non-differential disease-and exposure misclassification in studies of vaccine effectiveness Programming with models: writing statistical algorithms for general model structures with nimble Comparing covid-19 deaths to flu deaths is like comparing apples to oranges Diamond princess covid-19 cases Sampling methods for Wallenius' and Fisher's noncentral hypergeometric distributions Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper) The Hill: Poor state reporting hampers pandemic fight Multiple-bias modelling for analysis of observational data Spread of sars-cov-2 in the icelandic population Bayesian inference for partially identified models. The International Interval estimation for messy observational data Normal approximation to the distribution of two independent binomials, conditional on fixed sum Estimation of sars-cov-2 mortality during the early stages of an epidemic: a modelling study in Hubei First Opinion: A fiasco in the making? as the coronavirus pandemic takes hold, we are making decisions without reliable data Communicating the risk of death from novel coronavirus disease Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan How vague is vague? A simulation study of the impact of the use of vague prior distributions in mcmc using winbugs Suppression of covid-19 outbreak in the municipality of Covid-19: a need for real-time monitoring of weekly excess deaths. The Lancet Partial identification of probability distributions Using simulation studies to evaluate statistical methods Case-fatality rate and characteristics of patients dying in relation to covid-19 in Italy The ecological fallacy strikes back Systematic selection between age and household structure for models aimed at emerging epidemic predictions Changes in severity of 2009 pandemic a/h1n1 influenza in england: a Bayesian evidence synthesis The severity of pandemic H1N1 influenza in the united states Protocol of a population-based prospective covid-19 cohort study Munich An empirical estimate of the infection fatality rate of covid-19 from the first italian outbreak Statistics in epidemiology: methods, techniques and applications In four U.S. state prisons, nearly 3,300 inmates test positive for coronavirus -96% without symptoms Propagation of uncertainty in bayesian diagnostic test interpretation The New York Times, Opinion: Google searches can help us find emerging covid-19 outbreaks Mean and variance of an entry in a contingency table The ecological fallacy strikes back quadprog: Functions to solve quadratic programming problems Guidelines: Cause of death-covid-19 The important role of serology for covid-19 control Case fatality risk of influenza a (h1n1pdm09): a systematic review