key: cord-0832154-une291x3
authors: Mieskolainen, M.; Bainbridge, R.; Buchmueller, O.; Lyons, L.; Wardle, N.
title: Statistical techniques to estimate the SARS-CoV-2 infection fatality rate
date: 2020-11-22
journal: nan
DOI: 10.1101/2020.11.19.20235036
sha: 52aa02c5358b84924b2cb08e0d192e93b44e17a5
doc_id: 832154
cord_uid: une291x3

The determination of the infection fatality rate (IFR) for the novel SARS-CoV-2 coronavirus is a key aim for many of the field studies that are currently being undertaken in response to the pandemic. The IFR together with the basic reproduction number R0, are the main epidemic parameters describing severity and transmissibility of the virus, respectively. The IFR can be also used as a basis for estimating and monitoring the number of infected individuals in a population, which may be subsequently used to inform policy decisions relating to public health interventions and lockdown strategies. The interpretation of IFR measurements requires the calculation of confidence intervals. We present a number of statistical methods that are relevant in this context and develop an inverse problem formulation to determine correction factors to mitigate time-dependent effects that can lead to biased IFR estimates. We also review a number of methods to combine IFR estimates from multiple independent studies, provide example calculations throughout this note and conclude with a summary and "best practice" recommendations. The developed code is available online.

The meta-analysis reported in Ref. [3] relies on a common 'method of moments' method by DerSimonian and Lard [4] to provide a point-estimate of the IFR. We review several approaches to combine results from individual studies of the IFR into a single estimate. These include the method of moments and a normal likelihood based classic meta-analysis, a joint likelihood combination, and methods to combine full probability densities such as optimal transport and the product or mean of Bayesian posteriors.

In Section 2, we introduce the Gangelt field study [2] , which we use as an example in the first part of the paper. Section 3 reviews several methods that can be used to determine a confidence interval for the IFR, based on a single binomial proportion, a ratio of binomial proportions, a profile likelihood ratio, Monte Carlo simulation, and Bayesian constructs. Section 4 compares the confidence intervals from the various methods. Section 5 presents a time-dependent deconvolution calculus that accounts for intrinsic time delays through the determination and application of a correction factor (and associated uncertainty) to the IFR estimate. Section 6 provides first a general perspective on how multiple data points can be combined, before a set of seroprevalence studies from around the globe are introduced, which are then used as concrete examples for the aforementioned combination techniques. Finally, Section 7 summarises this work and concludes by providing "best practice" recommendations in the context of confidence interval calculations for the IFR of the SARS-CoV-2 coronavirus.

A sero-epidemiological study of a small German community exposed to a super-spreading event was undertaken to determine the IFR [2] . The municipality of Gangelt is located in the district of Heinsberg in the German state of North Rhine-Westphalia. The municipality reported unusually high levels of SARS-CoV-2 infections following local Carnival festivities held on 15 th February 2020. Strict social distancing measures, which included an advisory curfew, were introduced on 26 th February to suppress further infections. The Carnival festivities held in Gangelt are attended predominantly by people living locally and the community is relatively closed, with low levels of tourism and travel.

An estimate of the IFR is obtained from the double ratio IFR = r F r I = n F /n P n I∧T /n T = n F n P r I , (2.1) where the raw fatality rate r F is the ratio of the number of fatalities n F and individuals n P in a given population, and the raw infection rate r I is the ratio of the number of infections n I∧T identified in a representative cross-sectional sample of tested individuals n T . Assuming the test sample of n T individuals is representative of the population under study, in terms of both demographic and seroprevalence measures, the product n P r I provides an estimate of the total number of infected individuals in the population n P . The authors of Ref. [2] identified the Gangelt municipality as a unique opportunity to accurately estimate the IFR, because of its stable, relatively isolated population and an appreciable number of fatalities arising from the high level of infections present in its community. The Gangelt municipality has a population of n P = 12597. Inhabitants were randomly polled to participate in testing for both SARS-CoV-2 virus RNA (PCR) and antibodies to identify the number of both present and past infections, respectively. The number of inhabitants that were tested and satisfied all survey requirements is n T = 919, and the number of identified infections (n I∧T ) in this sample is n I∧T = 138. Within the duration of the study, the number of deaths recorded in the Gangelt municipality that were associated with a SARS-CoV-2 virus infection is n F = 7. Hence the study yields a raw infection rate of 15.0% and an IFR of 0.37%.

Following the application of corrections for various identifiable biases and the associated statistical and systematic uncertainties, Ref. [2] reports r I = 15.5 [12.3, 19 .0]%. Hence, the total number of infected individuals in the Gangelt municipality is estimated to be 1950 [1550, 2390] . The reported interval for the IFR of 0.37% is [0.29, 0.45]%, which can be used to estimate the number of infected individuals in other places with population characteristics similar to that used in the Gangelt study. Several other key findings are reported in Ref. [2] , which are beyond the scope of this note. We restrict the discussion in this note to the reported confidence interval for the IFR estimate.

A total of 6575 deaths associated with SARS-CoV-2 were recorded in Germany by 2 nd May 2020. Assuming the Gangelt sample population is representative of the German population as a whole, the IFR can be used to estimate the total number of infections -at some point earlier in the pandemicthat led to the reported death toll on 2 nd May 2020. The study yields an estimate of 1.8 [1.5, 2.3] million infected individuals in the German population. Using the methods described below, we determine a confidence interval (CI95) of [0.9, 3.7] million on the estimate of 1.8 million infected individuals, based on a simple method (the Wilson score) applicable to a single binomial proportion.

Estimation of confidence intervals of parameters with statistical tests goes as follows. The test statistic for the null hypothesis H 0 is known to asymptotically follow an analytic distribution, which is taken as an approximate proxy to the problem. The exact solution is available only in special cases. The analytic approximation can be relaxed with more modern numerical bootstrap and Monte Carlo techniques which provide the most appropriate tools when data needs to be also re-weighted, propagated through a chain of analysis algorithms or manipulated in more complex ways.

Notation The likelihood function is L(θ) ≡ L(θ; x 1 , x 2 , . . . , x n ) = f (x 1 , x 2 , . . . , x n ; θ) = j f (x j ; θ), where f is the underlying sampling probability density (pdf) and the last equality holds for n independent and identically distributed (iid) observations in the sample {x j }. Algebraically, the likelihood and the density have the same origin, the difference being only if the parameter θ or the observable x is treated as the variable of the function. The density unit normalization holds only over x. This difference should make the mathematical meaning unambiguous, even if the term likelihood is used often in a relaxed way. The null hypothesis H 0 parameter θ values, which are not fixed, are denoted with θ 0 and the maximum likelihood estimates (MLE) withθ = arg max L(θ).

Single binomial uncertainty is the most dominating statistical uncertainty on the IFR estimate, because the fractional uncertainty on number of fatalities is typically much larger than the uncertainty in the number of infections.

The most common estimator for the binomial success rate confidence interval is the so-called normal approximation interval based. The interval can be derived by inverting the Wald test

binomial parameter is se(p) = [p(1 −p)/n] 1/2 . Then writing down −z α/2 ≤ Z ≤ z α/2 , substituting Eq. 3.1 and re-arranging gives CI S =p ± z α/2 [p(1 −p)/n] 1/2 , (3.2) wherep = k/n is the maximum likelihood estimate of the central success rate given k successes and n trials. The standard normal inverse cumulative distribution quantile is Φ −1 (1 − α/2) = −Φ −1 (α/2) = z α/2 for a confidence level (1 − α) × 100 %. Numerically, these are z = 1 for 68.27 % and z = 1.96 for 95 % confidence levels (intervals), respectively. The construction here assumes √ n(θ −θ) to follow a Gaussian N (0, σ 2 ) by Central Limit Theorem and the true variance I(θ) −1 is estimated with the plug-in estimate I(θ) −1 , using the Fisher information I(θ) given in Eq. 3.5. Both assumptions are valid only under n → ∞. Finite n coverage of this interval estimator is weak as emphasized in [5] , and also shown in our simulations in Appendix L, and its use cannot be recommended. Because the Wald test is not scale-invariant, one may try to improve its behavior with normalizing transformations such as the log-odds transform φ = ln p/(1 − p) ∈ (−∞, ∞) or a pure log-transform φ = ln p. The interval endpoints are then calculated in the transformed space and inverse transformed.

Wilson derived an estimator [6] for the binomial proportion parameter confidence intervals using more advanced reasoning on the probabilities than the standard Wald test-based approximation, and this leads crucially to a different evaluation point. Using here a more modern construction, the score test statistic is

where the gradient of the log-likelihood (score) and the Fisher information are As originally shown by Rao [7] , this test follows χ 2 -distribution asymptotics like the Wald test. Also it can be shown that the score test formulation is actually equivalent with a Lagrange multipliers-based constrained optimization [8] , used often in economics, physics and engineering. Score intervals typically require numerical solutions. However, by setting Eq. 3.3 equal to z 2 which is allowed because χ 2 with one dof is equal to the standard normal squared, a quadratic closed form solution is obtained

1 + z 2 /n ± z p(1 −p)/n + z 2 4n 2 1 + z 2 /n . (3.6) The central estimate of the rate is not given by k/n, but (k + z 2 /2)/(n + z 2 ), which makes a significant difference with small event counts. We return to this feature of intervals with the Bayesian estimates. Typical extensions to the Wilson score interval add continuity corrections to the standard formula.

Wilson score can be recommended as the de facto choice to replace the weakly performing pure Wald test based one.

The log-likelihood ratio based test statistic is

which unlike the Wald or the score test, is a scale-invariant test. As before, one relies here on the χ 2 -distribution, which gives the asymptotic null hypothesis distribution according to Wilks' theorem [9] . Non-asymptotic inference without relying on the χ 2 -distribution is typically only possible via Monte Carlo simulations, unless one uses techniques such as the saddlepoint approximations [10] . The binomial log-likelihood ratio is LLR(p 0 ) = 2 [k lnp + (n − k) ln(1 −p) − (k ln p 0 + (n − k) ln(1 − p 0 ))] . (3.8) Comparing with the χ 2 1 -distribution 1 − α quantile gives the confidence interval CI LLR = [min(p 0 ), max(p 0 )] such that LLR(p 0 ) ≤ χ 2 1,1−α , (3.9) which is found numerically. In a more general case, asymptotic d-parameter (vector) inference requires comparing the likelihood ratio with a χ 2 -distribution having d degrees of freedom.

This classic [11] binomial confidence interval estimator is also known as the 'exact' interval because it is based on direct integration over the binomial distribution and thus compatible with the Neyman construction of confidence intervals, given in Appendix E. By construction, it never undercovers. The interval is usually obtained numerically by integrating over a beta distribution, which is dual to the sum over the binomial tails

The quantile integrals are (3.12) which are numerically inverted for L and U . The beta distribution is

with the normalization provided by the beta function B(α, β)

where Γ is the gamma function. When k = n the interval is [0, (1 − α/2) 1/n ] and when k = n the interval is [(α/2) 1/n , 1]. This estimator is called conservative, because its guaranteed interval coverage is always equal to or larger than the nominal one. Related, Blyth and Still [12] have shown how to construct a confidence interval for the binomial distribution with nominally optimal but conservative coverage and minimal length. The construction is nearly equivalent with Clopper-Pearson, but different optimization criteria are being used. Downside of the alternative constructions is that different sized intervals are not always fully contained within each other, i.e., they are not necessarily nested as one would simply expect. -6 -

In situations like the one studied here where observations are discrete (integers), the p-value is traditionally defined as the probability of obtaining the actual observed number k obs or anything more extreme. These are used, for example, in obtaining Clopper-Pearson intervals for the binomial probability of success from the given k obs ; and this results in over-coverage. A method for mitigating this [13] is to consider only half of the probability of obtaining k obs in calculating the p-value, i.e.

for the right-hand tail. Using this results in intervals that are shorter than those for the standard Clopper-Pearson intervals. The price to pay for the shorter intervals is that the mid-p method has undercoverage for specific ranges of the parameter of interest p, which vary with the total number of binomial tests N . Since N carries no useful information about p, suggestions have been made to average the coverage over N which should be acceptable even to frequentists. This procedure results in much reduced undercoverage for the mid-p approach (see, for example, ref. [14] ).

Characteristics Chaotic coverage properties of classic binomial uncertainty interval estimators were studied in detail in [5] , where the Wald test based interval estimator was shown to be completely unsuitable when it comes to its practical coverage. In general the chaotic properties are due to the underlying binomial distribution spanning a discrete lattice structure, not a continuum. Table 1 summarizes different test construction strategies. The Wald, the score and the likelihood ratio all have the χ 2 -distribution as their null hypothesis H 0 asymptotic distribution. The frequentist coverage aspects behind these estimators have been also studied in [14] .

Confidence intervals without relying on the asymptotic χ 2 -statistic can be obtained using Monte Carlo. A common choice with optimal properties in this context is the likelihood ratio based construction or 'ordering principle' of acceptance sets, which has been studied theoretically and computationally since the Neyman-Pearson lemma, see e.g. Refs. [15] [16] [17] [18] [19] . It relies on the (exact) duality between statistical tests and confidence intervals. Now, the well known brute-force algorithm to construct the so-called (Neyman) confidence belt is as follows.

1. The parameter θ 0 space is discretized or randomized uniformly. sets. Special care must be taken at this point while looping over the acceptance sets, because their union may yield sometimes discontinuous topologies depending on the specific ordering or acceptance principle of Step 3.

For more information on this, see Appendix L. We illustrate this algorithm in Figure 1 for a single binomial uncertainty using the exact (MC based) log-likelihood ratio test statistic [16, 17, 19] as described above, where we see that the asymptotic χ 2 -approximation is quite good in this case, but the relative discrepancy grows reasonably large with small numbers. This MC based construction was introduced to high energy physics by Feldman and Cousins [19] . Clopper-Pearson procedure is also compared, which gives slightly different intervals than the exact LLR based. The basic property of the asymptotic LLR approximation is that the acceptance region threshold given by the χ 2 -distribution quantile, is independent of the local θ 0 value, unlike the exact Monte Carlo driven LLR test statistic and the CP construction. These both are exact procedures in a sense that their coverage is always larger than or equal to 1 − α. The lattice structure shows why < and ≤ operations make a difference with discrete numbers but not with continuous parameters. When constructing Figure 1 for LLR based variants, in Step 3., we used t ≤ t C , where t is the log-likelihood ratio test statistic and t C the cut value (χ 2 -quantile or Monte Carlo based). Similar care is required with the 'vertical direction' in Step 4.

The approach described here is in principle generic, however, the construction in higher parameter dimensions can be technically challenging. This depends on the construction of the likelihood functions (parametric vs non-parametric) and computational complexity of the Monte Carlo procedures. With several nuisance parameters, the profile likelihood approximation is typically used in Step 3. to keep the whole approach feasible. For K nuisance parameters, the profiled likelihood ratio complexity may scale (naively speaking) linearly O(K) using e.g. simultaneous stochastic gradient search at each likelihood evaluation point, whereas a full hyperbelt scan is exponentially hard O(N K ), where N is -8 -the number of discretization points in each dimension. The multiple minima need to be in principle handled by the profiling procedure (an example in Section 3.4). Analogous computational challenges arise in Bayesian solutions with high dimensional integration, typically implemented with various Markov Chain MC and variational approximations.

We now consider a ratio r = p 1 /p 2 between two binomial proportions p 1 k 1 /n 1 and p 2 k 2 /n 2 , where k i denotes the number of success and n i the total number of trials. This setup models the uncertainty in the IFR estimate given by Eq. 2.1, the double ratio between the fatality rate and the infection rate. The full combinatorial setup of our problem is enumerated later in Table 2 , which is beyond the two independent binomial approximation. In this section we concentrate on the ratio between two binomial proportions -a problem which can be described using a 2 × 2 [20] or by algebraic statistics [21] , but these are not used in our case.

Exact interval estimation of the two binomial sample ratio parameter is seemingly not theoretically fully possible within the frequentist framework [22] , or the possibility to calculate generalized hypergeometric probabilities is not possible. However, it is possible within the Bayesian framework as shown in [23] , which we derive in Section 3.6.

Nelson [24] considers confidence intervals for the ratio of two unknown Poisson mean occurrence rates. He proceeds by using binomials and constructs the conditional distribution of k 1 given k 1 + k 2 = N , which is Bin (N = k 1 + k 2 , p = p 1 n 1 /(p 1 n 1 + p 2 n 2 )). The maximum likelihood estimator for the ratio isr = (k 1 /n 1 )/(k 2 /n 2 ), which is biased, but it can be shown that no unbiased estimator exists. To obtain the parameter p confidence interval, the endpoints L and U are computed by inverting from Eqs. 3.11 and 3.12 using Beta(k 1 , k 2 + 1) and Beta(k 1 + 1, k 2 ), respectively. But in principle, other interval estimators than the Clopper-Pearson can be also used. The confidence interval for the ratio r = p 1 /p 2 is then written as

(3.15)

This approximation [25] is based on using the Wald test construction, a logtransform of the observed ratio and analytic error propagation by the standard delta method [26] , which combines the central limit theorem and the first-order Taylor expansion g(θ) + (θ − θ)g (θ). In essence, the delta method is used for estimating the uncertainty on some non-linear function g(θ) of the parameter θ. Based on these tools, the standard error estimate of the logarithmic ratio iŝ (3.16) and the confidence interval for the ratio is CI K = exp (ln(r) ± zŝe[ln(r)]) . (3.17) This Gaussianization of the ratio in the log-space cannot be guaranteed to yield uniformly high precision results especially for small n, but it results in a very simple formula. Also, it is possible to combine this approach for example with a sinh −1 transform to optimize the interval lengths as suggested in [27] .

Bootstrap A non-parametric Efron's bootstrap [28] proceeds via simulations by resampling with replacement the obtained sample and calculates the observable of interest for each bootstrap sample.

Here, if we pick random numbers parametrically from two binomial distributions with parameters set to their maximum likelihood values, the results will be identical to those from non-parametric bootstrap.

In general, this is not the case with more complicated distributions and sampling scenarios. The most common first order method with a coverage correct up to terms proportional to O(n −1/2 ), is to obtain confidence interval estimates based on taking the quantiles (percentiles) of the generated bootstrap sample {θ * }, known as the percentile bootstrap (PRC). This assumes the the bootstrap distribution is a good proxy for the underlying true distribution. The confidence interval estimate is

obtained by ordering B bootstrap sample estimates θ * 1 ≤ θ * 2 ≤ · · · ≤ θ * B . A different variant, usually known by the name 'basic bootstrap', is to assume that bootstrap generates a good proxy of the error e * = θ * −θ, and then obtain the confidence interval with [2θ − θ * 1−α , 2θ − θ * α ]. We do not consider the basic bootstrap further here.

Well known extensions of the percentile bootstrap are are the so-called bias corrected (BC) and bias corrected with acceleration (BCA) bootstrap [29] . Under certain assumptions and using asymptotic Edgeworth expansion techniques, it was shown by Hall that the second-order BCA bootstrap coverage is correct up to order O([n −1/2 ] 2 ) [30] . The BCA confidence interval estimate is

whereĜ is the empirical CDF of the bootstrap sample statistics and Φ the standard normal CDF. The bias correction isẑ 0 and the acceleration term isâ, which can be negative, is to account for non-uniform variance. To construct the polynomial acceleration estimate, the jackknife residuals d i are needed

whereθ (i) is one of the jackknife estimates obtained by dropping the i-th data point, and proceeding with this n − 1 sized sample as with the original data sample. The whole construction is motivated by doing a monotone normalizing transform m : θ → ϕ with a statisticŝ ϕ ∼ N (ϕ − z 0 σ ϕ , σ 2 ϕ ), with σ ϕ = 1 + aϕ. The interval construction is done in the transformed space, and finally the endpoints are inverse mapped with m −1 . The caseẑ 0 ≡â ≡ 0 reduces identically to the percentile bootstrap and the casê z 0 = 0,â ≡ 0 is the case of bias correction without acceleration.

The profile likelihood method splits the parameters into two groups: true parameters of interest θ and nuisance parameters ξ, and maximizes the full likelihood over the nuisance parameters

where sup denotes the supremum, the least upper bound, which is almost the same as the maximum but takes into account the possibility that the likelihood cannot be evaluated exactly at that point ξ.

The main idea behind profiling is the dimensional reduction over the nuisance parameters, which then allows one to infer the uncertainty on θ by formally proceeding as with a usual likelihood, for example by using the score test or the likelihood ratio test which are asymptotically equivalent. Solutions based on the score test for the two binomial case have been proposed in [31, 32] . We shall now derive the likelihood ratio test based solution.

Let us parametrize r ≡ p 1 /p 2 and write down the joint likelihood function for two independent binomials

This re-parametrization does not involve the change of variables formula (Jacobian), because the likelihood as a sampling function and its volume normalization is over k 1 and k 2 , which we left intact. We are interested in the parameter r and treat the parameter p 1 as a nuisance parameter, which we profile out by finding a value for p 1 which maximizes the joint likelihood for every single value of r. This procedure is in principle readily generalized to arbitrary number of nuisance and true parameters of interest, which however is only a formal statement. Possible singularities depend on the exact type of nuisance parameters and models. A practical problem in higher dimensional cases is also the parameter optimization problem itself. The profile log-likelihood ratio test statistic follows here asymptotically

The fact that χ 2 -asymptotics holds also for the profile likelihood inference is a non-trivial result, but propagates from Wilks' theorem under certain assumptions. After maximizing Eq. 3.27, the profile log-likelihood ratio is

where the maximum likelihood estimates arê

and the local profile extremum solution p * 1 conditioned at point r 0 has two roots p * 1 (r 0 ) = k 1 + n 2 + k 2 r 0 + n 1 r 0 ± [(−k 1 − n 2 − k 2 r 0 − n 1 r 0 ) 2 − 4(n 1 + n 2 )(k 1 r 0 + k 2 r 0 )] 1/2 2(n 1 + n 2 ) , (3.31) -11 -All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. where the negative (−) branch gives the right solution in our problem. It is not guaranteed that every profile likelihood problem is differentiable and has a closed form solution, but this turned out to be the case here. The profile log-likelihood ratio is illustrated in Figure 2 , which has asymmetric 95% confidence interval endpoints around the maximum likelihood value.

Notation Q68, Q95 are used for pure density quantiles and CI68, CI95 for a parameter estimate confidence (frequentist) or credible (Bayesian) intervals.

This elementary simulation approach starts with three Bernoulli random numbers: tests ∼ B T , infections ∼ B I and deaths ∼ B F , which are together per person modelled as a 3-dimensional Bernoulli distribution. To remind, the Bernoulli distribution is the underlying distribution behind the binomial distribution, which turns into a Poisson distribution when p is small and n is large. Thus, this approach is ab initio in this hierarchy of distributions. Now in general, a D-dimensional Bernoulli requires 2 D − 1 free parameters. However, we do not have enough measurements here to constrain all the parameters. To simplify this problem, we factorize B T to be independent of B I and B F

That is, tests do not (hopefully) affect infections or fatalities. This leaves us with one and two dimensional sub-problems which require together 1 + 3 parameters. The two-dimensional problem can be parametrized directly with four probabilities of (B I , B F )-binary combinations, which sum to one. Another parametrization uses the expectation values E[B I ], E[B F ] and the correlation coefficient 1] . We use both in order to be able to sample correlated Bernoulli variables in the direct (multinomial) basis, which satisfy by definition the conservation of probability, and to give an interpretation of the problem in the correlation basis. The formulas are given in Appendix A and B.

-12 -All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. where the count variables and their associations to the underlying sets of tested T, all infected I in the city and all fatal F are as described in Section 2. Here, one must pay attention to Eq. 3.34, where the study sample infection rate is assumed to be representative in the simulation for the whole population by assuming homogeneity between samples. A systematic uncertainty could be associated here. By choosing in Eq. 3.36 the maximum possible positive correlation coupling (see Appendix B), the simulation output in Table 2 reproduces the event count observables, which enter as the input variables. That is, its value is fixed by data. Also, a boundary condition is used

which states that no fatalities happen without getting infected. This forbids the combinations 1 and 5 in Table 2 from appearing. The total number of people n P = |P| is kept fixed for each MC run. We have also fixed the test sample size n T = |T| to be constant in these simulations, to follow more closely the Gangelt setup, but turning on the Bernoulli fluctuations is implemented in the code as an option. However, with the given event counts the difference is not significant for the IFR. Type I and II errors of tests are not simulated here. Once calibrated, including their effect as a post processing step is trivial with two free parameters and an additional coin flipping per tested person. For more details, see Appendix G.

-13 -All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Using the generated Monte Carlo samples, arbitrary observables such as the IFR are computed by simply counting numbers from an (8 × N M C )-dim matrix, here N M C = 10 6 , and accumulating numerically the relevant point estimates such as mean values and percentiles. Note that this aggregated matrix is the fully sufficient statistic and contains all simulation information, due to binary random variables. These results are given Table 2 . The simulated distributions for the infection fatality rates are shown in Fig. 3 , which illustrates the non-trivial Dirac's comb discrete characteristics of the problem, but also the finite sample smearing effect on the extrapolation estimate. The smeared IFRdistribution is calculated from the test samples and the reference IFR-distribution from the inaccessible (full) population statistics, which are both obtained simultaneously in the simulation. See Appendix A for the exact definitions.

To this end, we may summarize that the power of this simulation is the 'full phase space' modelling of partially overlapping sets of tested, infected and fatal, which is not possible with independent binomial ratios. The simulation is based on describing the most elementary classical stochastic process involved, namely correlated Bernoulli coins. The optimal confidence interval estimator can be based on the simulations as described in Sec. 3.2 where each simulation with fixed input parameters simply generates a sample for a single null hypothesis H 0 . Alternatively, faster bootstrap approximations can be used.

In the exact Bayesian inference within two independent binomial distributions, we keep the number of observed events k 1 , k 2 as fixed numbers, also n 1 , n 2 , and calculate the joint posteriori density for the binomial parameters p 1 and p 2 . Their joint posteriori density is described with a product of two beta distributions

given generic beta priors Beta(α i , β i ) and binomial likelihoods. The derivation of this and priors are given in Appendix C. Given this joint posterior, the ratio r ≡ p 1 /p 2 density is obtained via change of variables such that p 1 = ry and p 2 = y. Writing down the Jacobian determinant gives |∂(p 1 , p 2 )/∂(r, y)| = |y|. Then we substitute these new variables in Eq. 3.38, include the determinant -14 -All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted November 22, 2020. and integrate over y

Using Mathematica, we obtain for this integral a representation

where 2 F 1 is the regularized Gauss hypergeometric 2 F 1 function and B is the Euler beta function. The regularized version is 2 F (a, b, c, r) ≡ 2 F 1 (a, b, c, r)/Γ(c). Equation 3.40 represents the master formula, which can be evaluated numerically with high precision special function libraries and credible intervals can be obtained with standard numerical integration techniques. However, we found that numerically it is easier to use directly Eq. 3.39.

The results are given Figure 4 , where the joint posteriori distribution includes the credible regions (CR), which encapsulate 68 and 95 percent of the probability mass. The shape is constructed according to the natural contour lines. The right figure shows the ratio posteriori density and the corresponding cumulative distribution by using two different prior distributions. The solid line is obtained using the non-informative Jeffreys prior Beta(1/2, 1/2), which is invariant under coordinate transformations. It is proportional to the square root of the Fisher's information determinant p(θ) ∝ det I(θ), where the determinant represents abstract information volume (here in one dimension the determinant is trivial). The results with dashed lines are obtained using a unit flat prior, which is not completely non-informative and results in slightly larger values. Its maximum (mode) gives numerically the same estimate for the IFR as the simple ML estimate.

Nuisance parameters and systematic uncertainties Bayesian framework allows one to add nuisance parameters and systematic uncertainties into the formulation. For example: the death counts k 1 may be need to be scaled with a parameter γ due to time delays. Note that scaling the parameter p 1 instead is ambiguous, which is seen using the binomial pdf and by computing the Fisher information -15 -All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. matrix, which will turn out to be singular. Which means that the corresponding parameter estimation problem would be rank deficient. Similarly the positive test counts k 2 may be multiplied with another scale λ. Using auxiliary measurements or prior judgement, the uncertainty information on the nuisance parameter is often modelled using a Gaussian prior π γ (γ; µ γ , σ γ ) with fixed µ γ , σ γ . However, with a positive definite scale, using the gamma prior could be more suitable, although practical difference can be small. This is applied on the Bayesian inference master formula as an additional prior constraint term and by replacing k 1 → γk 1 everywhere. Computationally, each nuisance parameter requires typically an additional integral when marginalizing the posterior and in the normalization (Bayes denominator). For more details, see Appendix D. Table 3 shows the numerical results for different confidence interval estimators, using count data of the Gangelt study and Figure 5 similarly, but as a function of death counts (rather than for just the observed F = 7). Clear outliers in the group of these estimators are the normal (Wald) test based and the bootstrap percentile estimator. Their interval is shifted toward smaller IFR values, which is especially visible with CI95 intervals. The Wald test will also give negative (unphysical) values at small F . This can be expected from their mathematical construction. The impact of the bias correction and acceleration for the bootstrap is clear. The rest of the estimators yield numerically similar values for the Gangelt input data and small differences are more easily seen from Figure 5 . Compared with the Monte Carlo simulations of Figure 3 , the Bayesian distributions of Figure 4 are completely smooth because the observed event counts are considered fixed and the continuous binomial parameters p 1 , p 2 are considered random. In contrast, the simulations are closer to a frequentist inference, because the model parameters are fixed and the discrete event counts are random. Though technically speaking, the underlying simulation can be used within both philosophies.

-16 -All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Coverage probability simulations and interval widths are given in Appendix L for the single binomial based estimators. These illustrate significant undercoverage of the Wald test based estimator, the conservative coverage of the Clopper-Pearson based estimator and the 'bracketing' coverage behavior of the Wilson score and Bayesian estimators. The likelihood ratio with an asymptotic χ 2 approximation has behavior similar to the Wilson score, but significantly undercovers at very small values of p, similarly to the the Wald test.

-17 -All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted November 22, 2020. ; https://doi.org/10.1101/2020.11.19.20235036 doi: medRxiv preprint

The previous discussion in Sections 2 and 3 is only fully applicable under the asymptotic time t → ∞ limit or instantaneous action, i.e., no time delays. To be concrete, by time asymptotic we mean the tail of a single epidemic outbreak and neglect additional possible complications (immunology evolution, different viral strains) which result from overlapping epidemic 'waves'. However, purely mathematical overlap is automatically handled by our description, that is, we do not assume any specific epidemic shape for the time-series input. In this section, we briefly outline how the IFR estimation and its uncertainty is implemented during an evolving epidemic with finite time delays. For the interested reader, further details of our time evolution study can made available upon request.

A combined double delay effect can be summarized with one ratio function

where K F is the delay kernel (pdf) from infections to deaths, K S is the delay kernel from infections to seroprevalence (antibodies) and the symbol * denotes a linear time-lag convolution integral. These kernels are extracted from data by fitting them typically with Weibull or log-normal distributions, see e.g. supplementary material of the Geneva study in [33] and references there, where a similar convolution calculus was used. The kernels are given in Appendix H, which shows explicitly the expected delays. The relative differences between K S (t) and K F (t) drive Eq. 5.1. The calculus here is general, and one may replace the antibody type tests with PCR type tests by replacing the delay kernel K S . In that case an additional multiplicative effect to include is the 'viral shedding' (loading) period probability, i.e., how long an infected person gives a positive test result. That evaporation factor can be neglected with antibody based serology (seroreversion effect), if the half-life involved is large enough on the scale of epidemic. For some specific antibodies, this may not be the case. The denominator of Eq. 5.1 can be extended to incorporate it, see Appendix H for that construction. Basic numerical integration is used here for the convolutions. Time t denotes the time of the seroprevalence determination and ∆t ≥ 0 denotes how many days later the population cumulative death count is taken. Because our problem here is essentially an inverse problem, the underlying cumulative infection countÎ(t) can be estimated computationally by regularized deconvolution of the reported positive cases dC(t)/dt of PCR viral tests. Daily counts need to be used in the inversion instead of cumulative counts, to conserve all information. Although even if one assumes a constant reporting rate,Î(t) can be estimated only up to an unknown scale (probability), which however fortunately cancels in Eq. 5.1. In principle, one may also use the reported daily death counts dF (t)/dt to obtain the deconvolution inverse estimate ofÎ(t), albeit the statistics might be too limited. For technical details about the deconvolution algorithm, see Appendix H. The algorithm is based on non-negative linear least squares with Tikhonov smoothness regularization. Regularization is needed, because basically all naive inversion procedures always amplify the counting fluctuations (noise). No fine structure recovery is needed, thus smoothness is a good functional prior in this problem. By using Eq. 5.1, the delay corrected non-equal time IFR estimate is now

where F (t) is the population cumulative death count andÎ S (t) is the population level seroprevalence (extrapolated) estimateÎ S (t) = n P × n I∧T /n T . Here n P is the population size, n I∧T is the number of -18 -All rights reserved. No reuse allowed without permission.

preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted November 22, 2020. ;  infection positive in the demographically randomized test sample and n T is the test sample size. By non-equal time we refer here to the shift by ∆t, which can be optimized after the seroprevalence test.

No delay correction is needed, if t or ∆t are chosen (or happen to be) with certain lucky values. This depends on interplay between three factors: 1. the delay kernel K F (t) of deaths, 2. the delay kernel K S (t) of antibodies (seroprevalence) and 3. the cumulative epidemic curveÎ(t). In our SARS-CoV-2 case, using kernels from [33] , the kernels give a functional shape for ψ(t, ∆t) which peaks above one for small t, then decreases below one, and asymptotically approaches one when t → ∞. However, eventually the antibodies will vanish from the body (seroreversion), so realistic times scales must be used, also for other obvious reasons.

The systematic uncertainty estimates should include perturbation of the kernels and the estimated I(t) function, most easily studied via toy Monte Carlo, propagated through the deconvolution algorithm and Eqs. 5.1 and 5.2. The full procedure to estimate ψ(t, ∆t) is illustrated in Figure 6 , based on kernel data from [33] and Switzerland data from [34] . For the uncertainties, we used approximately 20 % Gaussian equivalent uncertainties in the Weibull kernel parameters and fluctuated the input data with Poisson uncertainties, propagated via toy Monte Carlo. A good re-projection of the deconvolved I(t) to deathsF (t) is observed shape wise, as a 'closure test'. We emphasize that this closure test would be trivial, if the daily death count dF (t)/dt would have been used as the algorithm input. But we used the reported daily PCR cases dC(t)/dt as the input, so the result is non-trivial. The absolute normalization forÎ(t) andF (t) is matched to data for the visualization, because it is not obtained as a part of the procedure. The scale function ψ(t, ∆t) has larger uncertainty and larger values earlier in the epidemic, which is natural.

Optimal and practical procedures Now in principle, after observing the epidemic time series, we can determine the optimal delay argument ∆t of the ψ-function for each fixed prevalence determination time point t by setting ψ(t, ∆t) = 1, and inverting the best ∆t value numerically. This gives us the time point t + ∆t to read out the death counts. Alternatively, we use Equation 5.2 directly with some chosen ∆t value and obtain the correction factor given by ψ, which is a more flexible option. This is because numerically equivalent ∆t value can be in the asymptotic future, if the epidemic evolution has already saturated (no more counts). Also in practise the right hand tail truncation (causality) of data must be treated explicitly e.g. in kernel estimation in very early phase.

The strategy of using the inversion machinery discussed here can be somewhat non-conservative. As a more conservative strategy, Figure 6 shows that using a fixed read-out delay ∆t = 7 days gives already a quite good choice as long as t is after the peak of the daily deaths. Before that point, it yields upward biased IFR values. Similar behavior was observed also with other public datasets, which basically follows from the underlying functional shape of the epidemic curve. However, these conclusions are not without uncertainties and depend ultimately on data, as formulated in Eq. 5.1. The uncertainties related to kernels and their parametrizations are never very rigorous with a novel virus. Thus, from a prevalence test design point of view, an optimal choice for precision IFR estimates is to use a prevalence determination which has not been done too early in the local epidemic evolution.

In what follows, we explicitly evaluate the IFR values with different fixed ∆t values as a transparent and practical procedure. In addition we show results with an optimal delay solved from ψ(t, ∆t) function using the same kernels globally for each region, as an approximation. Both of these procedures have their pros and cons, as discussed here. The fixed delay case is essentially a special case of the latter, and even the complete procedure with fully known (oracle) kernels relies on a specific assumption of delay kernels being invariant (constant) over time. This time invariance is the defining property of the convolution integral. Also, whenever the daily reported positive PCR cases are used in the inversion, it may be necessary to normalize the counts by non-constant test rates e.g. due to active policy changes of public test campaigns. -20 -All rights reserved. No reuse allowed without permission.

preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted November 22, 2020. ; https://doi.org/10.1101/2020.11.19.20235036 doi: medRxiv preprint

In general, the value of IFR for a given individual is dependent on a number of factors such as age, sex, viral load, diet, genetics etc. For example, estimates of IFR in SARS-CoV-2 were found to have a strong age dependence in [33] and [35] , varying over two to three orders of magnitude as a function of age. Mathematically, let there be a multivariate IFR function IFR (X = (age, sex, diet, genes, . . . )) :

which takes as an input a random human feature vector X and returns the expected probability to die Y if infected. In Bayesian modelling, the random feature vector is often composed into measured X, and identified to be important but not necessarily measured variables Z, such that,

Given enough, well sampled data, the function IFR(X) ≡ P (Y |X) can be modelled using simple histograms, (conditional) logistic regression, deep learning learning techniques or other methods. However, given the naturally limited data available early on in a pandemic, the IFR function can be integrated over its dependents to obtain a local expected IFR for a particular population,

where f j (X|city) is the normalized sampling density of the population. A city is chosen here to represent a realistic system size in terms of sample statistics, which can be considered as independent from other systems. Any significant difference in the population densities f (X|city), between cities, will yield different empirical IFR j values. If the IFR function has a strong dependence over a particular feature, a bias with respect to the other city will be present. When comparing studies implemented in different cities, this intrinsic sampling bias can be compensated in two ways:

1. physically, a priori, using carefully designed sampling (selection) of the population 2. mathematically, a posteriori, using an inverse weight or sampling function, which can be modelled using detailed demographic statistics of the test sample and the city.

Using either the strategy 1 or 2, one must also choose a reference population density or a 'standard template' to represent a typical demography. This provides the golden reference for the sampling procedures. Once this sampling or stratification bias is accounted for, a truthful comparison of IFR values can be obtained. The expected raw global value without any re-sampling schemes can be modelled with a mixture density model as

where the weights are w j = N j / i N i and N j is the total number of people in the city.

-21 -All rights reserved. No reuse allowed without permission.

preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted November 22, 2020. ; https://doi.org/10.1101/2020.11.19.20235036 doi: medRxiv preprint Table 4 shows a number of studies performed to determine the prevalence of SARS-CoV-2. The datasets are chosen to represent Western cities with varying population sizes and reasonably similar demographics, and all studies, except for Iceland and Stockholm, are based on an antibody type blood tests. Count data for each dataset are given in Table 5 . The daily reported cases and death counts for different cities and regions are collected from public databases [34, [36] [37] [38] with full time series information, except Gangelt, for which we use only the death counts as given in their report. All data we used is available within our online analysis code. We do not do any further adjustments to count data, such as try re-correct or verify type I (false positive) or type II (false negative) errors of the antibody tests but we do evaluate their estimated effect on the uncertainties (see Appendix G), compensate for underlying health conditions of the individuals (cause of death ambiguity) or try to adjust for demographic sampling differences. The effective population counts n P are from Wikipedia, which may be biased for some regions. It should be noted that the true prevalence can be determined only by full population (antibody) testing with unit sensitivity and specificity, which is not attainable with current technology. For more information about possible hidden sources of systematic uncertainties, perhaps correlated between different studies, see Appendix K.

The data are collected over different time periods, T = [t 0 , t 1 ], are of different sample sizes and cover different points during the developing epidemic. We collect the death counts at times t + ∆t, and take a moving average of deaths over the prevalence determination period T

to take into account the finite span of the test period. These averaged and rounded death counts are shown in Table 5 . We study the dependence of the results for different ∆t values to be able to

Prevalence test period T Type Age Finland (FIN) [39] [2020-06-01, 2020-06-14] IgG 0-69 Los Angeles (LAC) [40] [2020-04-10, 2020-04-11] IgG/IgM all Santa Clara (SCC) [41] [2020-04-03, 2020-04-04] IgG/IgM all San Francisco (SFR) [42] [2020-04-23, 2020-04-27] IgG all Iceland (ISL) [43] [2020-04-04, 2020-04-04] PCR all Gangelt (GAN) [2] [2020-03-31, 2020-04-06] IgG/IgA all Geneva (GVA) [44] [2020-05-04, 2020-05-09] IgG all New York City (NYC) [42] [2020-03-23, 2020-04-01] IgG all Miami-Dade (MIA) [42] [2020-04-06, 2020-04-10] IgG all Region Stockholm (STK) [35] [2020-03-26, 2020-04-02] PCR all Philadelphia (PHI) [42] [2020-04-13, 2020-04-25] IgG all Table 4 . Studies for the determination of SARS-CoV-2 prevalence together with the type of the test used. The references for each study are provided in the table. Test type acronyms: Ig(G,M,A) is immunoglobulin type-(G,M,A) antibodies and PCR is polymerase chain reaction based amplification of viral RNA. We use a global fixed values for the test sensitivity v = (0.892 ± 0.02) and the specificity s = (1 − 6 · 10 −3 ± 1.4 · 10 −3 ), obtained by averaging and taking standard error on the mean with input from [45] . Alternatively, each dataset could be associated local central values and their estimated uncertainties, which is supported by our code. Studies have corrected their count data for sensitivity and specificity with methods similar to derived in Appendix G.

-22 -All rights reserved. No reuse allowed without permission.

preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. explicitly show the IFR estimate sensitivity on the epidemic time-evolution. The datasets are with different prevalence rates, some of them quite low. This means that especially then the test specificity can be a problem, i.e. test errors cannot be anymore reliable corrected for. See Appendix G for more information on this important aspect. We estimate this uncertainty using the methods described in the appendix, with the test sensitivity and specificity obtained by taking global averages from [45] . Certain regions such as New York, Los Angeles or Stockholm were in a fast evolving stage in the epidemic evolution, when the prevalence studies were performed. This is seen in the Table 5 fatality counts, as the counts change significantly for different values of ∆t. In contrast, Geneva or Finland were already in a stable stage, thus time delays will have a minimal impact for estimating the IFR.

In the following, we will determine the IFR for each dataset, and study the impact of fixed time delay to account for time evolution. Based on Figure 6 , the choice of ∆t 7 days can be considered a reasonable conservative approximation, which most likely overestimates the IFR very early on the epidemic curve, then being close to optimal choice at larger t values. The optimal ∆t for each dataset is solved as outlined in Section 5. Future precision estimates in terms of time delays require careful kernel extraction for each dataset (region) individually.

The datasets from the studies in Table 4 can be considered simultaneously to study the global IFR. In this section, we describe several strategies to do so. In general, we assume that the IFR values in each dataset are integrated values over all physical properties, and that the sample used in the study is chosen such that the single IFR value obtained will be representative of the whole population. Furthermore, we assume statistically independent infection rates in each city, because the systems are physically long distance isolated and do not contain common infection sources. An exact decomposition of the sources of variance in the IFR, e.g. due to different demographies both within (local) and across (global) populations, is clearly not uniquely solvable. However, meaningful statistical estimates can be obtained. For comparisons, see [46] .

-23 -All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The following is a brief overview of the methods used in Section 6.3. In the following, r j represents the observed IFR values for j = 1...K independent studies. For some early work on the comparison analysis of similar experiments, we refer to Cochran [47] .

This non-parametric estimator for the meta-analysis was in its simplest form proposed by DerSimonian and Lard [4] . This method aims to determine the mean and variance of the parent distribution of r. The mean is determined as,

with the global variance or 'heterogeneity' given by [48] 

where the test statistic is Q = K j=1 w j (r j −r) 2 and s 2 j represents the estimated variance within each study. We use a two step, iterative procedure in whichr and∆ 2 are first determined using,

and then bothr and ∆ 2 are updated by setting w j = 1/(s 2 j +∆ 2 ). (6.10)

Local convergence is obtained typically after a few iterations. The asymptotic standard error onr is given byŝ

which provide Wald test-like confidence intervals. This method does not yield uncertainty on∆ 2 . In general, a finite sample error is to be expected directly based on the sampling error in the study-specific variance estimates s 2 j . Several weighting scheme variants of the DL estimator have been proposed. See Ref. [48] for a recent comparison study, where the two-step DL estimator was found to be among the best, but the original DL performed weakly in some scenarios.

Hardy and Thomson [49] used parametric normal-normal hierarchy with sampling densities r j ∼ N (θ j , s 2 j ) (6.12) θ j ∼ N (r, ∆ 2 ). (6.13) -24 -All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted November 22, 2020. ; https://doi.org/10.1101/2020.11.19.20235036 doi: medRxiv preprint After integrating out the latent θ j , this gives a normal marginal distribution r j ∼ N (r, s 2 j + ∆ 2 ). The total joint log-likelihood with K contributing studies is

. (6.14)

It is easy to change the underlying sampling densities e.g. to a log-normal which takes effectively into account the physical boundary r > 0. The maximum likelihood solution can be obtained via standard optimization techniques or by iterating the following equationŝ

The two-dimensional confidence region on these parameters is obtained with the log-likelihood ratio 2 ln L(r, ∆ 2 ) > 2 ln L(r,∆ 2 ) − χ 2 2,1−α . (6.17)

By profiling, we obtain the individual confidence intervals for r and ∆ 2 . It is also straight-forward to extend this normal-normal model to fully Bayesian hierarchies as described in [50] . In that case Markov Chain MC sampling is typically used for obtaining the posterior density, which requires special technical care and is most easily dealt with specialized libraries.

As an alternative to the simple normal likelihood based, the so-called Restricted Maximum Likelihood (REML) method was introduced by Patterson and Thomson [51] for unbiased estimates of variance components in linear mixed models. See Ref. [52] for a detailed derivation within the correlated and full multivariate formulations.

The Wasserstein metric barycenter or the Fréchet mean [53, 54] is an optimal transport (OT) based approach, which solves the optimization problem P (r) = arg min P K j=1 w j W p p [P (r), P j (r)], (6.18) where P is the optimally combined new density under the p-Wasserstein metric W p , which is a geodesic transport metric in the space of densities. See Appendix I for more details and Ref. [55] for statistical properties. The weights can be taken as w j ∝ 1/s 2 j , if the solution is taken to be (inversely) proportional to the sample variances, for example. The solution is found by discretizing the 1D-posterior densities for each dataset and constructing the barycenter as an average in the inverse CDF space of quantile functions. We tried also a Sinkhorn iteration based algorithm using an entropy regularized transport cost formulation known as Bregman projections [56] , with minimum regularization set such that a numerically stable output was obtained. However, this approach resulted seemingly in an over-smoothed output which is mathematically expected due to the entropic approximation.

The arithmetic mean of posterior densities is

-25 -All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted November 22, 2020. ;  which has a mixture model interpretation and the weights can be taken as in the optimal transport case. The density interpolation properties can be more limited compared to the optimal transport case, which can be crucial if the idea behind combining the posterior densities is to find one common data generating distribution. For multimodal densities, the mean estimator is an inclusive, probability mass covering estimator.

The normalized product (geometric mean) of posterior densities is

provides the re-normalization. This approach is known in machine learning as the product of experts model by Hinton [57] , where several simple models are combined to 'vote' together. It is also called logarithmic pooling. Thus for multimodal densities, the product estimator is an exclusive, single mode seeking estimator. By using information theory, this approach can be derived using e.g. the so-called α-divergence of Chernoff, which is a generalization of the standard Kullback-Leibler (KL) divergence (relative entropy). The work by Amari [58] shows how the product is an optimal solution to a divergence risk minimization problem with α = 1 (reverse KL) and that the arithmetic mean of Eq. 6.19 is also a minimal risk solution, but under its dual α = −1 (forward KL).

This approach uses a product over the likelihood for each independent dataset,

where X j = {k 1 , n 1 , k 2 , n 2 } j is the j-th dataset. We compute the profile likelihood ratio test statistic of Eq. 3.27 independently for each dataset. The combined IFR maximum likelihood value and confidence intervals are then obtained easily by comparing the total product (sum) of Eq. 6.21 against the χ 2 1 -distribution quantiles as described in Section 3.4.

We can summarize that the optimal choice of the combination method depends on the underlying assumptions, which are encoded by the implicit or explicit algebraic, information theoretic or probabilistic aspects of the method. Here the first class of combiners is more inclusive, the second class more exclusive, in their output decision. See Appendix J for some illustrative properties of the probabilistic risk functions, which may be used for formal motivations. As already mentioned, the methods we considered can be roughly classified into two scenarios for combining the data from the different studies. The first scenario aims to determine a parent distribution, from which each IFR observed for a particular city is assumed to be sampled from. In the second scenario, we assume δ j ≡ 0 for all j in Equation 6.4 which represents a model in which the the true global and local IFR values are the same. Under this assumption, the individual measurements of IFR can be combined to provide a more precise but possibly overoptimistic measurement of IFR. In the following section, we use the Method of moments, the Normal likelihood, the Wasserstein-Fréchet mean, and the Arithmetic mean of posteriors methods for the first scenario, and the Product of posteriors and Joint likelihood ratio for the second scenario.

-26 -All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted November 22, 2020. ;

The individual observed IFR result for each dataset is given in Table 6 , where the 95% confidence (credible) intervals are obtained using the Bayesian estimator described in Section 3.6. The results are given using four different choices of the fixed time delay ∆t, and with an adaptive delay determined with the inverse machinery described in Section 5. In general, some datasets are strongly dependent on the chosen read-out delay. The reason for this is the underlying local epidemic evolution and its time derivatives. The adaptive delays are given in Table 8 where the confidence intervals are based on propagating Poisson fluctuations and kernel uncertainties through the whole deconvolution chain with Monte Carlo sampling as described in Section 5. This estimated uncertainty δγ on the death count scale is included as a Gaussian prior in the Bayesian estimator as described in Section 3.6 and Appendix D, whereas the fixed read-out delay estimates here are 'bare' and do not include time scale uncertainties. This difference is manifest in the width of the individual densities. Measurements done in the very end of the local epidemic, generate larger optimal read-out delay values and correspondingly smaller delay uncertainties on the death counts because the daily increase in deaths has reached the slow asymptotic regime. The test inversion systematic scale uncertainty δλ is included as a Gaussian prior affecting the IFR denominator, numerically larger for low prevalence datasets (see Table 8 ).

The posterior distributions from each city are presented in Figures 7, 8 and 9 for the choices of adaptive, ∆t = 7 and ∆t = 14 days, respectively. The sensitivity to the time delay effects highlights the importance of including these effects into any complete study of the IFR. Philadelphia yields significantly larger IFR peak values than the rest, while Los Angeles, Santa Clara and Finland all yield much smaller IFR peaks. This could point to a relatively strong IFR dependence on the local population, and further highlights the importance of sampling the population within an individual study. The Gangelt study is approximately in the middle and fairly constant with choice of ∆t. It is worth noting that the kernels used in adaptive delay inversion are the same (global) for each dataset, which makes it locally biased for PCR test based prevalence data (Stockholm, Iceland) and due to local reporting delays. and a data adaptive (inverse solved) read-out ∆t ← ψ(t, ∆t) using global kernels. The credible interval takes into account the statistical counting uncertainty in the double ratio via Bayesian posterior estimator under non-informative Jeffreys prior, the test inversion related uncertainty δλ and also the delay uncertainty δγ in the case of adaptive delay estimation in the rightmost column.

-27 -All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. -28 -All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The results of the different combination strategies are given in Table 7 and in Figures 8 Table 8 . Left column: the optimal read-out delay in days for each dataset based on the deconvolution inversion and its uncertainty. The Gangelt data includes no detailed time-series for deconvolution, thus ∆t = 1 is used instead. Center column: deconvolution and Monte Carlo propagated relative systematic scale uncertainty δγ ≡ σ[nF(∆t)]/nF(∆t) on the cumulative death counts due to causal time-delays at time t + ∆t. Right column: Type I and II test error inversion relative systematic scale uncertainty δλ constructed with an error propagated uncertainty and a pure binomial Wilson uncertainty on the corrected prevalence fraction p2, according to Eq. G.6. Test sensitivity and specificity values are as given in Table 4 , which are used in the error propagation described in Appendix G, to obtain individual δλ values given here. These systematic uncertainties are finally applied with the Bayesian priors as described in Appendix D.

-29 -All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. obtained using the methods described in Section 6.2. The global dispersion parameter ∆ and its uncertainty are shown instead in Figure 11 together with the IFR parameter r, using the NL model full likelihood information. The optimal transport is fully non-parametric, thus its output distribution does not explicitly try to disentangle the IFR and global dispersion like components and their individual uncertainties. Instead, it yields distributions which look in their functional form similar to individual distributions. The inverse variance weighted variant prefers smaller values of r, which is well expected with this set of data. However, often there can be other more suitable weighting strategies, based on e.g. some auxiliary risk minimization and the available domain knowledge.

The mean of posteriors method gives a very different distribution to the others. This method is sensitive to the datasets chosen as input and in general will not give a single representative (unimodal) distribution when the individual distributions have small overlap regions. In particular this is seen in the close by double peak structure around r = 0.2 in ∆t = 7 days results, which disappears in the ∆t = 14 days result. This double peak structure is caused mainly by the New York distribution being narrow enough and moving significantly between these two choices. The simplest distribution peak structure is obtained with the adaptive delays, shown in Figure 7 , which has only one strong peak -30 -All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. between r = 0.1 and r = 0.2.

The product of posteriors and joint likelihood methods yield distributions which are much more narrow than the others. Figure 10 show the combination results using the joint likelihood method, where the delay ∆t = 14 days gives slightly smaller joint log-likelihood ratio than the delay ∆t = 7 days. However, this cannot stand on its own as a method for determining a good effective global ∆t, because some datasets are invariant under the choice of ∆t, thus their IFR values would have a pivotal role. Based on our time-delay calculus, individual datasets all have different delay scale functions ψ(t, ∆t) and are evaluated at different t values, which give optimal local ∆t values shown in Table 8 .

Finally, the high similarity between the two product methods is a natural consequence of the underlying assumption of these strategies that the IFR is a global quantity and that individual studies can be combined to improve the precision of the measurement. We caution however that this is a -31 -All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted November 22, 2020. ; https://doi.org/10.1101/2020.11.19.20235036 doi: medRxiv preprint strong assumption and in particular highlight the fact that several of the individual distributions show tension with the product like combinations, indicating that the assumption may be unjustified.

The demographic averaged infection fatality rate (IFR) of SARS-CoV-2 in Western societies has been estimated here to be approximately 0.4 [0.2, 0.8] % (Q95), when using a fixed ∆t = 7 day readout delay between the prevalence determination and public death counts and the optimal transport based posterior combination of individual datasets. In general, our methods included the Bayesian double ratio based binomial counting uncertainty, deconvolution of the underlying time-series for the optimal read-out delay determination, systematic uncertainties in the death counts and systematic uncertainties in the corrected positive test counts. This estimate is a factor of 3 − 5 larger than the IFR of a typical seasonal influenza, if one assumes its often referred value of ∼ 0.1 %. Our result is numerically similar to analyses e.g. in [59] or [3] , but differs in the methodology. However, even if we constructed our methodology rigorously, one should not take our estimate as an ultimate measurement of the IFR. In addition of requiring better control of the underlying details of collected data, we identified also other factors which more extensive analysis would take into account such as demographic differences, population counts used in the normalization and regional time-delays.

Our IFR estimation is based on several random sampled seroprevalence determination datasets combined with different statistical techniques, such as Bayesian estimation of counting uncertainties and modern algorithmic optimal transport driven probability density fusion. We recommend the Wilson estimator for fast but reasonable confidence interval estimation of binomial counting experiments and the Bayesian double ratio estimator for more extensive counting uncertainty estimates in the context of the IFR, because it provides access to a full posterior distribution. Also additional effects can be more easily incorporated with the Bayesian formulation. When analysing multiple datasets, the choice of data combination tools depends on the underlying scenarios. If it is reasonable to assume that there exists one global integrated IFR value, the product of posterior distributions or the joint likelihood method are perhaps the most natural approaches to use for improving the individual estimates. However, if that is not the case, then the Wasserstein-Fréchet mean (optimal transport) of posteriors, the classic meta-analysis models and the linear mean of posteriors can be more suitable, as we also reasoned with results from the information theory.

An accurate estimate also requires careful consideration of the significant time delays observed between infection and death. The required time delay convolutions necessitate the extraction (i.e. fitting) of delay kernel functions that may need to be locally adjusted because of differences in health care and administrative procedures. Poorly estimated kernels used within (de)convolution results in unknown bias, naturally. A more transparent solution is to always show in addition results with several fixed delays such as one or two weeks, as presented here. However, we provided and analyzed the necessary inverse problem methods for advanced, close to optimal delay corrections and demonstrated this machinery with data using fixed global kernels. The best solution experimentally is a crosssectional seroprevalence trial that minimizes time-domain effects, namely, not done too early in the epidemic and which is tightly localized (not spread) over time, if possible.

Based on studies in Stockholm [35] and Geneva [33] , and global comparisons [45] , binning (stratifying) over age seems to be a crucial selection variable for the SARS-CoV-2 IFR, as expected. All statistical methodology developed here can be applied also in a stratified analysis. However, age stratification is not enough; although it gives strong ordering in IFR, it does not provide a proper explanation of the underlying dynamics. Possible body response differences and uncertainties in the -32 -All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted November 22, 2020. ;  antibody type tests are crucial factors, as well as the crucial extrapolation to the total population level. A recent study has found a new type of genetic defect risk in type I interferon (IFN) pathways, inducing a life-threatening COVID-19 pneumonia [60] , but not being an active mechanism with influenza viruses.

From a larger perspective, a direct one-to-one comparison e.g. with seasonal flu is non-trivial, because for that the population has more natural immunity and there are seasonal vaccinations. However, an indirect comparison is possible and for understanding the total risk and harm on the society, one should understand the multiplicative reproduction differences between these viruses. A virus with a seemingly relatively small IFR can still be of high risk, if it is easily transmitted. A driving factor might not just be the mean R 0 , but also independently the variance and tails of the transmission chain multiplicities. This possibly overdispersed case (compared to e.g. Poisson) is often modelled with a negative binomial distribution, which in physics describes at a phenomenological level the charged particle final state multiplicities of high energy proton-proton collisions. By using sharp analogies, tools from high energy physics can be useful on modelling and analyzing the epidemic production side of the problem.

Physically, the ultimate solution for the future is to increase massively the testing capabilities for real-time monitoring. This basically requires new type of non-invasive personal health care technology, perhaps based on promising new techniques such as CRISPR based diagnostics [61] . More measurements is not just a requirement to obtain minimally extrapolated IFR estimates and reliable values for the epidemic parameters such as R 0 and understanding the critical role of multiplicative fluctuations, but more crucially, to obtain control of the epidemic with minimal lock-down measures. All the analyzed and developed tools here were constructed essentially from the first principles, and thus these methods should stay highly relevant also in the future. where F, I, T, P denote the sets of fatal, infected, tested and all people in the city, respectively. The number of elements of a set is denoted with | · | and the intersect of two sets with . To show that the construction is consistent, consider the limit where all people in the city are tested by substituting T → P in Eq. A.2. Then the limited statistics IFR coincides with the full statistics IFR by construction, because also always holds that I F ≡ F and I P ≡ I. These definitions are purely formal and assume perfect test sensitivity & specificity and no time-delays. These necessary corrections are discussed in other sections of this paper. The implicit assumption made in the extrapolation is that the infections observed in the test sample represent truthfully the stochastic infection process in the full sample. In essence some ergodicity (time-average equals ensemble average) and sample homogeneity must be assumed.

Using the expectation values E[X], E[Y ] and the correlation coefficient ρ[X, Y ] between two correlated Bernoulli random variables X and Y , the direct (hypercube) basis parametrization is

The sampling of vectors (X, Y ) is now multinomial, such that [(0, 0), (0, 1), (1, 0), (1, 1)] ∼ [P 0 , P 1 , P 2 , P 3 ] are four corners of the hypercube. Any multinomial distribution sampling algorithm can be used.

Binomial posterior density

We start with the generic Bayesian inference formula for the posterior density P (θ|X 1 , . . . , X n , γ) = f (X 1 , . . . , X n |θ, γ)π(θ|γ) p(X 1 , . . . , X n ) = f (X 1 , . . . , X n |θ, γ)π(θ|γ) dθ f (X 1 , . . . , X n |θ, γ)π(θ|γ) ,

where X i contains the observed data, θ the parameters of true interest and γ the hyperparameters or nuisance parameters. The likelihood function L(θ, γ) = f (X 1 , . . . , X n |θ, γ) = i f (X i |θ, γ) is the sampling density evaluated as a function of θ, γ for n iid observations and the prior density is π(θ|γ).

In what follows, we will set sampling density:

prior density: π(θ = p) = Beta(p|α, β).

A computationally easy conjugate prior pair for the binomial is the beta distribution yielding a beta posterior density, which we show below. In a same way, the gamma and Poisson distributions are conjugate pairs. The flat prior case is Beta(p|1, 1) = 1. The Jeffreys coordinate equivariant prior corresponds to the case Beta(p|1/2, 1/2) = [π p(1 − p)] −1 , which is an important one when considering non-informativeness under coordinate transforms.

To obtain the denominator (evidence) of Eq. C.1, we marginalize over the parameter p-space

The posterior distribution is obtained by substituting Eq. C.2 and Eq. C.5 into Eq. C.1, giving P (p|k, n, α, β) = Beta(p|k + α, n − k + β) distribution.

Using a generic Beta(α, β) prior and the binomial likelihood will give us the posterior density Beta(k + α, n − k + β) with the mean valuê The result in Eq. C.7 was presumably first found by Laplace in his 'law of succession' and inverse probabilities, which was considered somewhat controversial at that time because it does not coincide with the intuitive maximum likelihood answer k/n.

For arbitrary new data x new , the prior predictive distribution is

Then, using a measured sample X ≡ (X 1 , X 2 , . . . , X n ), the posterior predictive distribution for new data is

The posteriori predictive distribution allows one to draw values x from the sampling density with the parameter θ uncertainty described by the posteriori density P (θ|X). Thus, strictly speaking there exists no direct frequentist equivalent of this expression.

Additional systematic uncertainties on counts k 1 and k 2 , are applied by multiplying and integrating over the Bayesian posteriori ratio IFR formula of Eq. 3.39 with

where G(x, µ, σ) is a normal density, a small positive scalar and the integrals are computed numerically. These additional Gaussian distributed parameters model multiplicative scale corrections γ and λ on the death counts k 1 and on the positive test counts k 2 , respectively. The triple integral gives the posteriori ratio probability density up to the overall normalization, which is obtained numerically. The normal prior densities here can be replaced with gamma densities, for example.

The mean values are taken µ γ = µ λ = 1, typically, if the the counts are corrected prior this formula. The 1-sigma uncertainties on these corrections are described by δγ and δλ, which come from auxiliary procedures or calibration measurements. In our case, δγ is obtained via Monte Carlo error propagation of the deconvolution procedure and its kernel uncertainties in Section H, and δλ as described in Section G. One needs to pay attention to possible double counting of statistical uncertainties in Equation D.1, when estimating these parameters.

Bayesian A Bayesian credible interval (CR) at the level 1 − α is defined as an integral over the posterior density

where C defines the credible interval or multidimensional region, which contains the true parameter with (1 − α) × 100 % probability. There is usually an infinite number of such intervals, but often the tail masses are fixed to be equal α/2. A given credible interval is not constructed to contain the parameter with the same probability if the experiment is repeated, which is what a frequentist confidence interval tries to construct. However, the Bayesian construction may have also strong frequentist coverage properties, as the well-known 'Jeffreys interval' demonstrates [5] .

Frequentist A basic property of frequentist confidence intervals (CI) is their coverage. This is a property of statistical procedures for extracting intervals for parameters of interest θ at some confidence level 1 − α; it does not apply to a single confidence interval from a specific experiment. For a repeated set of measurements, each with its own fluctuations, the position of the intervals will vary. The coverage is defined as the fraction of intervals that contain the true value of θ. Coverage can vary with the value of θ, but for frequentist intervals from a Neyman construction [63] , it will never be smaller

where I is the indicator function I : R → {0, 1} and n is the number of repeated experiments (with differing intervals). Formally, for a given α, the confidence interval or region C i is the one which gives the infimum (the greatest lower bound) of the coverage probability. The interval and its lower and upper endpoints L(X) ≤ U (X) are random variables depending on the random data X, where as the -40 -All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted November 22, 2020. ; https://doi.org/10.1101/2020.11.19.20235036 doi: medRxiv preprint true parameter θ itself is not a random variable in this picture. Finally, it is instructive to show that combining two one-sided bounds inf θ P(L(X) ≤ θ)) = 1 − α/2 and inf θ P(U (X) ≥ θ)) = 1 − α/2, (E. 3) gives the expected confidence interval

Unlike the Bayesian credible intervals, the frequentist confidence intervals do not explicitly estimate the probability for the parameter to be within some range.

The optimal frequentist confidence interval acceptance set construction, used in the inverse construction of the Neyman confidence belts, can be derived briefly as follows [16] .

Several other acceptance set constructions or ordering principles also exist, such as the shortest expected length and various pdf based orderings, perhaps optimal under some very specific condition such as certain interval topology. Also randomized intervals can be constructed, but which are mostly used only in theoretical analysis of (discrete) problems.

Let our parameter of interest be θ ∈ Ω, the random measurement be X, and let us use here the likelihood ratio based ordering. We can formalize the confidence interval as a set

having the corresponding coverage probability

To construct the set, the likelihood ratio is considered at each value of θ, for each value of X

-41 -All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted November 22, 2020. ; https://doi.org /10.1101 /10. /2020 whereθ is the maximum likelihood estimate. The crucial piece above is the confidence level 1 − α constructing local threshold

which is explicitly dependent on θ. This value can be constructed with asymptotic approximations or with Monte Carlo. For more information, see e.g. [17, 19] .

Let p = P (V + ) be the true viral prevalence of the population, let q = P (T + ) be the fraction of positive tests in the test sample. Let specificity be s ≡ P (T − |V − ) = 1 − α = 1 − P(type I error) and let sensitivity be v ≡ P (T + |V + ) = 1 − β = 1 − P(type II error). Using alternative terminology, α is known as False Positive Rate and 1 − β as True Positive Rate. These symbols should not be mixed with the parameters of the Beta priors, to be clear. The following derivation uses pure probability calculus, without specifying the underlying density or mass functions. The four different conditional probabilities can be combined under the Bayes' rule

with the law of total probability expanded for the true prevalence

Using these, a well-known inverse estimator (see e.g. [64] ) for the true prevalence iŝ Otherwise the problem is physically ill-posed. Especially the FPR lower bound is problematic when the viral prevalence is low. As a simple estimate of the related uncertainty, we can use the first order Taylor expansion (error propagation) with q, s, v taken independent. We get

where σ 2 q , σ 2 s , σ 2 v are the individual 1-sigma uncertainties squared. The first one is driven by the binomial counting, the two other by the uncertainty in the laboratory calibration of the test error rates. Instead of using dichotomic (binary) test output decisions and Eq. G.3, alternative inversion strategies can be based on a test-by-test weighted inversion, according to the conditional probabilities of Bayes' rule and Expectation-Maximization (maximum likelihood) iteration of the prevalence fraction. This requires that the test provides a probabilistic output (e.g. multivariate analysis). Different strategies should be simulated with Monte Carlo sampling.

-42 -All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted November 22, 2020. ;  Renormalization procedure Given already inverted prevalence ratep (or counts k = np) together with known (or assumed) sensitivity and specificity and their uncertainties, we can estimate the relative systematic multiplicative scale uncertainty δλ due to type I and II errors, by first computing the corresponding raw rate q by (re)inverting Eq. G.3, compute its binomial uncertainty σ q , then apply Eq. G.5 and finally find out the additional (orthogonal) relative uncertainty

by remembering that in the multiplicative case relative uncertainties add in quadrature. The pure binomial reference uncertainty σ p can be computed e.g. with the Wilson estimator. The re-inversion step is needed only if no raw data is available. The idea behind this renormalization procedure is to protect against double counting the statistical uncertainty component, when multiplicatively 'dressing' the Bayesian IFR estimates (with the corrected counts as input) as explained in Section D.

The deconvolution here is implemented as a non-negative least squares with Tikhonov regularization. We found this classic approach to be by far the most stable of standard methods in this problem, including regularized Fourier space methods and early stopping regulated maximum likelihood EMiteration (Richardson-Lucy). The EM-iteration driven formulation assumes Poisson noise, which in principle should be more optimal, however, the explicit regularization properties of the method shown here seemed to play a bigger role. The regularized least squares solution for the discretized infection rate x ∼ dI(t)/dt is obtained by inverting the linear convolution equation Ax = y, by minimizinĝ

where y is the measurement vector and λ R controls the regularization strength. The measurement vector is constructed from the daily PCR infection counts y ∼ dC(t)/dt, where the vector domain is extended (padded) with zeros before the first counts, in order to be able to describe the 'pull-back' of deconvolution without a limiting boundary. The matrix A is the convolution operation Toeplitz matrix constructed from the corresponding discretized kernel function K(t). The auxiliary vector is x 0 = 0 in our problem formulation. The regulate the solution smoothness (curvature), we use a finite difference second order derivative matrix

Other typical options for L are the identity matrix and first order derivatives. The minimization is done through an active set method [65] which enforces the necessary Karush-Kuhn-Tucker (KKT) constrained optimization conditions. To be able to use standard optimization algorithms with the -43 -All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . Delay convolution Weibull kernels K(t) fitted using the mean and standard deviation values given in [33] , where λ (scale) and k (shape) denote the corresponding Weibull pdf parameters. The individual delays are: KI→O is from the infection to the symptom onset (incubation period), KO→C is from the symptom onset to the case report, KO→S is from the symptom onset to seroconversion (antibodies) and KC→F is from the case report to death. The combined delays are: KC = KI→O * KO→C is from the infection to the case report, KS = KI→O * KO→S is from the infection to seroconversion and KF = KI→O * KO→C * KC→F is from the infection to death. The combined kernels are solved by numerical convolution of the individual kernels, with tilded variables denoting the resulting Weibull parameters.

regularization term included, we use an augmented matrix formulation

Thus the regularization is fully explicit here. We use minimal parameter values for λ R yielding smooth inversion results without oscillatory behavior and remark that the statistical uncertainties of the inverse estimate are affected by the regularization procedure, due to the bias-variance trade-off. The regularization adds a small bias term into the solution, and correspondingly suppresses the statistical fluctuations. This makes the statistical uncertainty properties of inverse estimates non-trivial. The kernel extraction from data itself is its own problem, typically approached using e.g. Kaplan-Meier type non-parametric estimators [66] and functions generalizing the basic exponential (memoryless process) delay kernel, such as the Weibull pdf. Sequential delays are easy to model via cascaded convolutions, but the factorization and identifiability of the component kernels in terms of the underlying physically independent delay sources is not necessarily possible. The fitted kernels are shown in Figure 12 , which are also causal such that they are defined only for t > 0.

Seroreversion An additional effect beyond the causal delays discussed earlier, is the finite half-life of antibodies. Taking this evaporation effect into account, the total measurable seroprevalence I S (t) can be modelled with the following convolutions preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted November 22, 2020. ;  whereÎ(t) comes from the deconvolution procedure and K R is the new kernel function, modelling the finite lifetime of measurable antibodies in the body (e.g. exponential decay). The extraction of this requires time-dependent control studies where a group of test positive are monitored and continuously re-tested over a period of months. Convolution integrals are involved in the solution, because we deal with time-evolving input distributions, assume linearity of the system and time-invariance of the kernels. In Eq. H.5, the first term I S (t) is the time-delayed seroprevalence without antibody decays and the second term I RS (t) is the delayed and decayed distribution part. The difference between these gives the actual measured seroprevalence I S (t), and asymptotically I S (t) → 0 when t → ∞, due to the finite half-life. Naturally, when the half-life of antibodies t 1/2 → ∞, then we recover the case I S (t) → I S (t), that is, the case without seroreversion.

Because the antibody decay kernel can have a very long tail, all computational convolution procedures with arrays should domain extend (pad) the daily counts with zeros, to handle properly the convolutional (un)winding of these tails.

Using standard notation, the p-Wasserstein metric [67] for p ≥ 1 is given by

where d(x, y) is the basic cost between two points x and y, for example d(x, y) = x − y . The so-called transport map T : µ → ν, maps a measure density µ to another measure density ν over the space χ. The set of all possible couplings is Γ(µ, ν), which has marginals µ and ν, with a realization γ(dx, χ) = µ(dx) and γ(dy, χ) = ν(dy). The case p = 2 and χ = R D has a unique minimum solution.

In one dimension D = 1, the metric can be written as

where U (z) and V (z) are the cumulative distribution functions (CDF) of µ and ν, and the comparison in Eq. I.2 is between inverse CDFs (quantile functions).

It may be tempting to choose only one of the estimator results. However, optimality of this decision depends on the risk function definition. To ease out with possible interpretations, here we list shortly some typical risk functions. Let our parameter of interest be θ, the random variable of data be X, the decision function (estimator) be δ(X) and the loss function be ξ(θ, δ(X)), which encodes our cost definition. Beyond these probabilistic risks, there are related principles in information theory, such as the minimum description length (MDL) [68] and other formulations e.g. in economics.

Frequentist risk The frequentist risk is the loss integrated over the sampling density R(θ, δ) = E θ [ξ(θ, δ(X))] = dx ξ(x, δ(X))f (x|θ). (J.1) -45 -All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted November 22, 2020. ; https://doi.org/10.1101/2020. 11.19.20235036 doi: medRxiv preprint If the loss function is ξ(θ, δ(X)) = (θ − δ(X)) 2 , (J.2) then the optimal decision δ * (X) minimizes the sum of (squared) bias and variance, which is trivial to show.

Posterior risk The posterior risk is the loss integrated over the posterior density B(θ, δ) = dθ ξ(θ, δ)P (θ|X), (J. 3) which has two typical solutions ξ(θ, δ(X)) = (θ − δ(X)) 2 → δ * (x) = dθ θP (θ|x) ∼ posterior mean (J.4) ξ(θ, δ(X)) = |θ − δ(X)| → δ * (x) ∼ posterior median.

(J.5)

The optimal decisions δ * (X) for these losses are obtained by the posterior mean and median.

Bayes rule risk The hybrid risk is the frequentist risk integrated over the prior density H(θ, δ) = dθ R(θ, δ(X))π(θ).

(J.6)

Using certain specific priors π(θ), one can turn this into a minimax risk.

is the worst case (maximum) frequentist risk. As its name states, the minimax-optimal decision is the one which minimizes the maximum expected risk.

This section is a general summary of possible unknowns.

• The number of tested people is a well-known quantity, but the total (effective) population size of the system is not fully known. This is the problem of open versus closed systems, or their idealization. In reality, not all citizens are in contact but there are locally isolated systems, which are not 'thermalized' together. One may argue that to be able to define the IFR in a way as is typically done, by using a test sample and extrapolating to the full city population scale, the so-called ergodicity hypothesis of Boltzmann is assumed to hold implicitly. Another sampling issue is the local household clusterization effect, which can in principle induce both positive and negative correlations such as the average infection rate first increasing and then decreases as a function of the household size, due to children. Monte Carlo simulations can be used to study these issues, but we may expect other sources of uncertainties to be typically much larger, at least while comparing studies implemented in relatively similar sized and dense systems.

• The demographic heterogeneity uncertainties and their regression modelling are discussed already in some detail in Section 6. It makes sense to compare the average IFR one-to-one between countries which have similar demographics. The population median age in the world spans approximately 32 years, between Niger ∼ 15 years to Japan ∼ 47 years, which is expected to have a large impact. Similarly, the provided health care are very different. The combination analysis, if implemented using studies done under similar demographic conditions, probes then the underlying and always partially unknown systematics in an empirical and effective way.

• It is currently an open question how large is the effect of the initial viral dose on the outcome of the disease development. It has been hypothesized [69] that using face masks effectively reduces, not just the number of infections, but in a more non-linear way also the infection fatality rate due to smaller doses transmitted and received. In this case, a person receiving a small dose, would allow their body to develop mild symptoms and even immunity. The serious condition would happen instead more likely with a large initial dose of the virus. A positive correlation can be expected with large viral load during the disease and the severity, but the transmission dose dependence instead is hard to analyze without dedicated studies.

• Sensitivity (true positive rate) and specificity (true negative rate), or the ROC-curve 'receiver operating characteristics' working point of PCR or antibody tests, should be carefully calibrated and corrected for. Person by person, there are irreducible type I (false positive) and type II (false negative) classification errors to be made which cannot be avoided, however, for large samples it is possible to compensate these errors by inversion analysis. The corrections can be calculated as explained in Section G or even perhaps more optimally, using weighted corrections test-by-test. Re-weighting or other corrections can be executed only if the test manufacturer has produced well calibrated tests and algorithms with a probabilistic output. In a review of five studies, SARS-CoV-2 PCR tests have been estimated to have a false negative rate up to 29 % [70] , however, this depends on the chosen working point of false positive rate.

• The degree of personal variation on the antibody response is not yet well understood. As an alternative strategy to antibodies, the T-cell response for SARS-CoV-2 seems currently promising to combine with the antibody response [71] .

• Cumulative death counts [IFR bias ↓↑] Relative undercounting of death counts happens simply due to the chosen analysis time interval endpoint and finite time delays according to Eq. 5.1, driven by biological and communication delays. Similarly, it is possible to do relative overcounting. This 'efficiency' or 'overcounting' type of counting error can be estimated and multiplicatively corrected, but its accuracy is limited by the quality of delay kernels extracted from data.

If a PCR type test is made too late, it can miss possibly (earlier) positive person. This effect is prominent in the tails, when the infection vanishes from the population. In this case, fatalities -47 -All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted November 22, 2020. ; https://doi.org/10.1101/2020.11.19.20235036 doi: medRxiv preprint and infection counts will stay the same, but when the test count grows as a function of time, it leads to a growing IFR estimate. The associated time period is called also as the 'duration of viral shedding'. To mitigate this problem, typically seroprevalence tests should be used primarily to determine the IFR. Alternative is continuous (daily) PCR testing, which is typically feasible only for high risk group individuals.

When an antibody (IgG, IgM, . . . ) type seroprevalence determination is implemented, it is necessary to take into account the body response delays of developing the necessary amount of antibodies to pass the test thresholds but also the fact that the antibodies do also decay, i.e., their half-life is not necessarily insignificant on the time scale of the epidemic. Delays in development or vanishing of antibodies can bias the IFR estimate upward by downward biased prevalence count. In Ref. [72] it was concluded that after SARS-CoV-2 infection, long-lasting protective antibodies are not likely produced. This is currently an open question in precision terms. In Ref. [73] was found that SARS-CoV-2 IgG responses decreased only 4% within 90 days. However, IgA and IgM were short-lived with median decay times of 70.5 [58.5, 87 .5] and 48.9 [43.8, 55 .6] days (CI95). Neutralizing antibody titers had little decrease, being also highly correlated with IgG.

• Non-uniform sampling rate [IFR bias ↓↑] Any precision procedure relying computing e.g. (de)convolution between time-series data and delay kernels, may need to take into account the non-uniform testing and reporting rates. However, the reported daily death count time series can be considered more reliable, assuming that deaths are correctly reported and placed in the time series. Inspecting public data, this evidently is not always the case, with anomalously large discontinuities seen in time-series.

• The conditional classification of the death itself, to be caused by COVID-19, is not fully unique. A person may develop simultaneous serious bacterial (Streptococcus etc.) pneumonia increasing the fatality risk, which is typical with seasonal influenza viruses and one of the most common causes of death [74] . The unique cause of death will be ambiguous or degenerate in this case. Similarly, any underlying chronic conditions can significantly affect the outcome, such as the metabolic syndrome. One future solution to this could be a more advanced bookkeeping scheme, which assigns data-driven probabilities with one or more international cause of death (ICD) codes, to tag simultaneous underlying conditions. The conditional probability P (Y |X) is by definition the joint probability P (Y, X) divided by the probability of the condition P (X). As an approximation, there could be also just two categories of COVID-19 deaths, with and without existing chronic conditions. In addition, there can be also other systematic country or study level differences in basic bookkeeping of death counts. This issue is particularly relevant, when excess fatality rate comparisons due to COVID-19 are made against seasonal flu fluctuations.

-48 -All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted November 22, 2020. ; https://doi.org/10.1101/2020.11.19.20235036 doi: medRxiv preprint L Coverage simulations Figure 13 . On the left, simulations of a single binomial proportion confidence interval CI95 coverage as a function of the binomial probability p. On the right, the corresponding average confidence interval CI95 relative widths (endpoints). Each row for n number of binomial trials. The likelihood ratio is with χ 2 -approximation.

Impact of non-pharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand

Infection fatality rate of SARS-CoV-2 infection in a German community with a super-spreading event

A systematic review and meta-analysis of published research data on covid-19 infection-fatality rates

Meta-analysis in clinical trials

Interval estimation for a binomial proportion

Probable inference, the law of succession, and statistical inference

Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation

The Lagrangian multiplier test

The large-sample distribution of the likelihood ratio for testing composite hypotheses

The use of confidence or fiducial limits illustrated in the case of the binomial

Binomial confidence intervals

The combination of probabilities arising from data in discrete distributions

Frequentist evaluation of intervals estimated for a binomial parameter and for the ratio of Poisson means

Testing Statistical Hypotheses

The Advanced Theory of Statistics

Unbiasedness of likelihood ratio confidence sets in cases without nuisance parameters

Empirical likelihood ratio confidence regions

Unified approach to the classical statistical analysis of small signals

A network algorithm for performing Fisher's exact test in r × c contingency tables

Algebraic algorithms for sampling from conditional distributions

In defence of score intervals for proportions and their differences

Exact Bayesian analysis of two proportions

Confidence intervals for the ratio of two Poisson means and Poisson predictor intervals

Obtaining confidence intervals for the risk ratio in cohort studies

The limiting distributions of certain statistics

Better bootstrap confidence intervals

Theoretical comparison of bootstrap confidence intervals

Test-based exact confidence intervals for the difference of two binomial proportions

On small-sample confidence intervals for parameters in discrete distributions

Serology-informed estimates of SARS-COV-2 infection fatality risk in

Coronavirus (COVID-19) Data in the World

The infection fatality rate of COVID-19 in Stockholm

Coronavirus (COVID-19) Data in the United States

COVID-19) Data in Sweden

COVID-19) Data in Switzerland

Seroprevalence weekly report

Seroprevalence of SARS-CoV-2-Specific Antibodies Among Adults in

COVID-19 Antibody Seroprevalence in

Seroprevalence of Antibodies to SARS-CoV-2 in 10 Sites in the United States

Spread of SARS-CoV-2 in the Icelandic population

Repeated seroprevalence of anti-SARS-CoV-2 IgG antibodies in a population

Small sample inference for fixed effects from restricted maximum likelihood

Problems arising in the analysis of a series of similar experiments

A comparison of heterogeneity variance estimators in simulated random-effects meta-analyses

A likelihood approach to meta-analysis with random effects

Bayesian approaches to random-effects meta-analysis: a comparative study

Recovery of inter-block information when block sizes are unequal

Maximum likelihood approaches to variance component estimation and to related problems

Barycenters in the Wasserstein space

The fréchet distance between multivariate normal distributions

Statistical aspects of Wasserstein distances, Annual review of statistics and its application

Iterative Bregman projections for regularized transportation problems

Training products of experts by minimizing contrastive divergence

Integration of stochastic models by minimizing α-divergence

Estimates of the severity of coronavirus disease 2019: a model-based analysis

Auto-antibodies against type I IFNs in patients with life-threatening COVID-19

Field-deployable viral diagnostics using CRISPR-Cas13

Infections and Identified Cases of COVID-19 from Random Testing Data

Outline of a theory of statistical estimation based on the classical theory of probability

Estimating prevalence from the results of a screening test

Solving least squares problems

Nonparametric estimation from incomplete observations

Prescribing a system of random variables by conditional distributions, Theory of Probability & Its Applications

Modeling by shortest data description

Facial Masking for Covid-19-Potential for "Variolation" as We Await a Vaccine

False-Negative results of initial RT-PCR assays for COVID-19: a systematic review

SARS-CoV-2-specific T cell immunity in cases of COVID-19 and SARS, and uninfected controls

Prevalence of IgG antibodies to SARS-CoV-2 in Wuhan -implications for the ability to produce long-lasting protective antibodies against SARS-CoV-2, medRxiv

Persistence and decay of human antibody responses to the receptor binding domain of SARS-CoV-2 spike protein in COVID-19 patients

Secondary bacterial infections associated with influenza pandemics

There is a one-to-one mapping between tests and confidence intervals

Uniformly most accurate (UMA) confidence region minimizes the probability of false coverage

By using Property 1, UMA set is found by inverting the uniformly most powerful (UMP) test

According to the Neyman-Pearson lemma [63], the likelihood ratio test is the UMP when both the hypothesis H 0 and alternative H A are simple (not composite). The UMP also exists for a composite H A , if the underlying distributional family has the so-called monotone likelihood ratio property

Acknowledgements We thank Heather Battey and Yoshi Uchida for reading and comments on the manuscript, and Allen Caldwell for providing further information on their study [62] .Notes Open source Python code which reproduces all algorithms, figures and tables shown here, and beyond, is available at: github.com/mieskolainen/covidgen.