key: cord-0462772-10so9oi6 authors: Memari, Yasin title: Low incidence rate of COVID-19 undermines confidence in estimation of the vaccine efficacy date: 2021-01-25 journal: nan DOI: nan sha: 6e4d470f4d7e236eff179ea3783047e71b3ec417 doc_id: 462772 cord_uid: 10so9oi6 Knowing the true effect size of clinical interventions in randomised clinical trials is key to informing the public health policies. Vaccine efficacy is defined in terms of the relative risk or the ratio of two disease risks. However, only approximate methods are available for estimating the variance of the relative risk. In this article, we show using a probabilistic model that uncertainty in the efficacy rate could be underestimated when the disease risk is low. Factoring in the baseline rate of the disease, we estimate broader confidence intervals for the efficacy rates of the vaccines recently developed for COVID-19. We propose new confidence intervals for the relative risk. We further show that sample sizes required for phase 3 efficacy trials are routinely underestimated and propose a new method for sample size calculation where the efficacy is of interest. We also discuss the deleterious effects of classification bias which is particularly relevant at low disease prevalence. Vaccines are seen as the best control measure for the coronavirus pandemic. In this context, understanding the true efficacy of the vaccines and clinical interventions is crucial. Randomised clinical trials are conducted to systematically study the safety and the efficacy of an intervention in a subset of the population before it is widely used in the general population. In placebo-controlled vaccine trials, participants are randomised into vaccinated and unvaccinated groups where cases of the disease or infection are allowed to accrue over time. In planning a clinical trial, advance sample size calculation determines the size of the trial population needed to detect a minimal clinically relevant difference between the two groups if such a difference exists. The indicator for effectiveness of a vaccine is usually reduction of the cases in the vaccinated group relative to the control group. However, it is sometimes naively assumed that the trial participants who do not experience the event provide no information. Consequently, the event rate or the incidence rate of the disease receives inadequate attention. For rare diseases, it is often simply accepted that the accrual of the cases takes longer. Human clinical trials are also an area where theory and practice are seldom consistent, as experiments in human populations are hardly fully controlled experiments, not least due to unrealistic assumptions, loss to follow-up, noncompliance, heterogeneity of treatment effect and the trial population, etc. [1] Therefore, it is not uncommon that, by the time an interim analysis declares a significant finding, the original assumptions used to define the statistical power of the study and the sample size are neglected. In this article our interest is on evaluating the impact of the event rate, insofar as it could affect the estimation of the efficacy rate. We show that low incidence rate of the disease could lead to overestimation of confidence in the estimated efficacy rates. We propose a new method for posterior probability of the vaccine efficacy that has a more subtle relationship with the event rate. Using our approach, we obtain broader confidence intervals for the efficacy of the vaccines recently developed for COVID-19. Based on our findings, we propose new confidence intervals for the relative risk. A new method for sample size calculation in controlled efficacy trials is proposed which is more robust at low disease prevalence. Also highlighted is the impact of classification bias which could have large consequences when the disease risk is low. Vaccine efficacy is defined as the proportionate reduction in the risk of disease or infection in a vaccinated group compared to an unvaccinated group. It is defined as (1-RR)×100% in terms of the relative risk or the risk ratio, RR = π v /π c , where π are the incidence of the disease among those exposed in the vaccinated and control groups. Throughout this paper we interchangeably use the terms, incidence rate, disease risk, prevalence and event rate. It is important to remember that the variables π v and π c are scaled binomials as they represent sample proportions. Assuming equal person-time exposure in the two groups, the efficacy is often summarised in terms of the numbers of cases in the vaccinated and unvaccinated groups, t v and t c respectively: It appears in the literature that only approximate methods are available for the variance of the ratio of two binomial parameters [2, 3] . The consensus method that is commonly used to assign confidence intervals to the risk ratio is credited to Katz et al [3] . The method is based on asymptotic normality of logarithm of the ratio of two binomial variables. Assuming independence of the incidence rates, it follows that var(log(π v /π c )) = var(log(π v )) + var(log(π c )). Using a Taylor series, the variances are approximated as var(log(π)) ≈ var(π)/π 2 where Wald method is often used to set var(π). Then two-sided 95% confidence intervals on the efficacy (e.g. see [4] [5] [6] [7] ) can be written as Hereafter we refer to equation 2 as pooled Wald approximation. We will show that the method underestimates the variance espcially when the incidence rate is low. Equation 2 sets out the large sample asymptotic variance of the risk ratio. However, Wald method used to define var(π) is known to be unreliable when π is small. One may use alternative binomial proportion confidence intervals, however, log normality of the ratio might not hold and the variance of (the logarithm of) the ratio may be irreducible. Hightower et al [5] raised question 2/16 about the credibility of the confidence limits when the efficacy is high and the disease risk is low. Also, O'Neill [8] noted that, when t ≪ n, the variance of ln(RR) in equation 2 remains fairly stable and quickly converges to 1/t v + 1/t c . Ratio distributions are known to have heavy tails and often no finite variance. If one were to model the likelihood function for the efficacy defined in equation 1 in terms of independent incidence rates, the choice of the prior probabilities for π v and π c would be critical. One can readily verify that the variance of the ratio of two binomial distributions increases as binomial probabilities decrease. Uninformative priors could simply cancel out by the division and the dependence of the posterior on the prevalence would not become obvious. Analytical solutions using independent incidence rates may also be hard to obtain. For an analytical solution, we model the efficacy in terms of conditional probabilities of the disease risks. Independence of the probabilities of the incidence rates is neither necessary nor ideal when calculating the efficacy, as equation 1 imposes a constraint on the two variables. Under a binomial model with overall prevalence of π = t/n in both groups and total population size of n, overall number of cases t = t c + t v follows t ∼ Bin(n, π), then, from equation 1 assuming t c ∼ Bin(t, 1/(2 − α)), we expect t c ∼ Bin(n, π/(2 − α)). Were we to use Poisson distributions for t and t c , t c conditional on t would still follow a binomial distribution. Modeling the efficacy in terms of conditional probabilities has previously been suggested [4] . This notation enables to explicitly parametrise the likelihood function in terms of the prevalence, irrespective of the priors for π v and π c . For a general solution accounting for classification bias we assume an imperfect diagnostic procedure with sensitivity Se and specificity Sp. Then fraction of individuals who test positive for the disease is sum of true positive rate and false positive rate: where c 1 =1-Sp is the false positive rate and c 2 =Se+Sp-1. The posterior distribution of α given that t c is binomial follows as p(α|t c , π, c 1 , c 2 ) = p(t c |α, π, c 1 , c 2 )p(π)p(α) g(α) where f (π) is the prior on π and we have assumed uniform prior on the efficacy α ∼ unif{0, 1}. For a complete solution, the marginal likelihood g(α) can be written in terms of the incomplete beta function (see e.g. [9] ): As we do not intend to impose a prior on the prevalence, f (π) in equation 4 cancels out and our analysis, in essence, is likelihood based. One needs to remember that, the posterior in equation 4, as it was derived from the second equality in equation 1, is valid only when the individuals are equally divided between the two groups. The mode of the posterior of α is obtained by setting the derivative of the log likelihood to zero i.e. ∂ℓ/∂α = ∂ln(p(α|t c , π))/∂α = 0. This leads to which corresponds to the maximum likelihood estimator (MLE). Cramér-Rao bound expresses a lower bound on the variance of any unbiased estimator of α in terms of the inverse of the Fisher information where the Fisher information I(α) is obtained as Here . We will show that the conditional binomial model has a more subtle dependence on π compared to the pooled Wald method. Under certain regularity conditions and assuming asymptotic normality near MLE, 95% confidence intervals on α mode can be estimated as However, as the posterior distribution is asymmetric, especially when the efficacy is high, and the intervals could lie outside [0,1], we will estimate the credible intervals computationally. The posterior probability of vaccine efficacy given in its simplest form in equation 4 is ready for inspection. Using binomial notation is particularly useful in enabling us to directly plug in the numbers n, t c in the estimation of α. In this section we evaluate the impact of the incidence rate on the efficacy and assign new confidence bounds to the efficacy of COVID-19 vaccines. Firstly, we assume a diagnostic test with perfect sensitivity and specificity i.e. Se=Sp=1. In the absence of misclassification, mode of the posterior in equation 5 corresponds to the expectationα = 1 − t v /t c . The larger n the smaller the variance of the posterior, however, for a fixed n, the variance depends on π. Figure 1 shows the posterior probability of α plotted over a range of π, for a fixed n on the left hand, and for a fixed t on the right hand, assuming true vaccine efficacy of 70% and 90% respectively. Also plotted in vertical lines are the independent 95% confidence intervals from equation 2. As the event rate falls, the posterior distributions and the confidence intervals become wider, however, for a fixed t (right plot) Wald intervals are stable over a wide range of π, and more so when the efficacy is high. The proposed conditional binomial model better represents the variability at low prevalence. Left hand plot assumes a fixed n=50,000 while right hand plot is for a fixed t=2,000. The general trend holds for different values of the parameters. Wald method overstates the confidence in the efficacy when t ≪ n. Three clinical trials of the vaccines designed to prevent COVID-19 recently published their interim phase 3 analysis results [10] [11] [12] with two of them reporting incredibly narrow 95% confidence bounds on the efficacy. The reported case numbers and the efficacy rates for the primary end points are provided in Table 1 . Firstly, we note that, although the trials used different models and priors on the efficacy, the reported confidence intervals almost perfectly correspond with those obtained from equation 2. At large n the posterior is clearly dominated by the data and the Bayesian and the frequentist are equivalent. Furthermore, especially where the efficacy is high, pooled Wald confidence intervals hardly vary by the choice of n. If one were to use different values for n in Table 1 , over a large range of the values equation 2 would still give the same confidence intervals. Therefore, the uncertainty caused by n v and n c is not accounted for. We re-estimate the confidence intervals using the conditional binomial model presented in the Methods. Using the case numbers reported, the likelihood of the data in equation 4 is obtained by setting the prevalence to π = T = t/n. Then maximum a posteriori (MAP) and 95% credible intervals for the efficacy rates are calculated computationally. The results shown in Table 1 are contrasted with those reported. Although estimated modes are the same, our credible intervals are wider. Incorporating the incidence rates has removed the overwhelming confidence originally assigned to the point estimates. Note that, our approach requires the trial participants to be equally divided between the vaccinated and unvaccinated groups which is roughly the case here. Figure 2 , in red, shows the posterior probabilities and the credible intervals for COVID-19 vaccines. Of note is that, if we were to hypothetically assume π = t/n = 1, the posterior in equation 4 would produce the same intervals as those reported by the vaccine trials and Wald approximation. Moreover, an independent binomial model with uninformative (e.g. uniform) priors for π v and π c would produce the pooled Wald intervals. So far we have assumed no bias in classification of the cases, however, imperfect diagnostic procedure could lead to misclassification of the infected and uninfected individuals. In this section we examine the effect of classification bias on estimation of the efficacy. It is worth noting that equation 3 requires the observed infection rate T to be greater than the false positive rate c 1 = 1 − Sp. This relates to the 'false positive paradox' which implies that the accuracy of a diagnostic test is compromised if the test is used in a population where the incidence of the disease is lower than the false positive rate of the test itself. Furthermore, false negatives could dominate at low incidence rates. When the disease risk is low, as the majority of the tests are negative, a small false negative rate could lead to a situation where false negatives outnumber the positive cases. These concepts are further explained in Note 1. Figure 3 illustrates the effect of classification bias on the posterior probability of the vaccine efficacy. The left plot shows the impact of a very small reduction in specificity to 0.999 (or increase in false positive rate), while the right hand plot shows the effect of reduction in sensitivity to 0.95 (or increase in false negative rate). A small loss of specificity could lead to serious underestimation of the effect size as noted by [6, 7] , but it could further lead to complete loss of precision when the incidence rate is low. Loss of sensitivity results in overestimation of the 6 Figure 3 . Effect of imperfect diagnostic procedure. Misclassification error biases the vaccine efficacy rate. Left plot shows the distributions for Se=1 and Sp=0.999, while the right plot is for Se=0.95 and Sp=1, with n=50,000 in both. True efficacy rate is assumed at 70%. Imperfect specificity, however small, could have disastrous effects when incidence rate is low, whereas lack of sensitivity consistently inflates the efficacy rate. efficacy irrespective of the disease rate. In these plots, we have considered a larger reduction in sensitivity, not only because reduction in specificity has a more dramatic effect, but also as diagnostic assays typically have relatively higher specificity than sensitivity, not least due to specimen collection, insufficient viral load, stage of the disease, etc. [13] However, the effect of loss of sensitivity is consistently toward shifting the mode in equation 5, or MAP, to higher values of α, even at low incidence rates where negative predictive value is high. Base rate fallacy happens in situations where base rate information is ignored in favour of individuating information. In probability terms, it often occurs when P (A|B) is confused or interchangeably used with P (B|A) ignoring the prior probability P (A), e.g. probability of having a rare disease given a positive test is wrongly equated to probability of a positive test given the disease (or diagnostic sensitivity) ignoring the low prior probability of the disease itself. We showed, in estimation of the vaccine efficacy when the disease rate is low, not only diagnostic error could have deleterious effects, but also failure to appropriately integrate the information about the base rate or incidence rate of the disease in the calculation could lead to underestimation of the uncertainty. Vaccine efficacy is defined in terms of the risk ratio π v /π c , that is the ratio of two binomial proportions. Ratio distributions are known to have undefined variances, conversely, pooled Wald method has been traditionally used to approximate the variance of the risk ratio. In this article, we used a parametrisation that makes the dependence of the efficacy on the disease prevalence explicit, without recourse to priors for π v and π c . Particularly, improper priors for π v and π c Note 1 J. Balayla [14] noted that there exists a prevalence threshold below which the positive predictive value (PPV) of a diagnostic test drops precipitously relative to the prevalence. This means that at too low a prevalence a positive test result could more likely be a false positive than a true positive. More underappreciated is the impact of the negative predictive value (NPV). Though, at low incidence rates, the negative predictive value is nearly 100%, a small loss in sensitivity could still have a marked effect as the negative tests vastly outnumber the positive tests. We could even have a situation where the false negatives are more than the true and false positives. To avoid these pitfalls, the participants are pre-selected for their symptoms before confirmation with the assay. Though this raises the pre-test probability, it could cause collider bias [15] . could lead to underestimation of the variance. We conditioned t c on t = t c + t v and treated t as another random variable. The resulting compound probability t c ∼ Bin(n, π/(2 − α)) is over-dispersed and better captures the variability of the variance with π, whereas pooled Wald confidence intervals are largely insensitive to π when π is small. Wald method is intended as large sample approximation, however, the bulk of the life sciences deals with small sample sizes. Therefore, it is likely that the confidence intervals reported in the literature for the risk ratio (and odds ratio) are overly optimistic. By analogy of equations 6 and 8, one could define new confidence intervals for the risk ratio by substituting RR=(n c /n v )(1 − α) for unequal sized groups in the Fisher information. The results can be written as where π=(t v + t c )/(n v + n c ). The above intervals on the risk ratio are generally wider than but converge to the pooled Wald method when the sample size is large. They may be preferred to those obtained from equation 2 when the sample size is small or the relative risk is low. Particularly, for a fixed sample size as RR nears zero, the upper bound in equation 9 remains conservative and the lower bound takes negative values and becomes undetermined. On the contrary, as RR nears zero, the pooled Wald intervals remain positive and shrink rapidly, giving the counterintuitive impression of increased precision when the incidence rate is low (similar to figure 2 ). However, as with Wald method, the confidence intervals in equation 9 were derived 8/16 using normal approximation which may not hold when RR significantly deviates from 1. Our findings have implications for pre-planning the sample sizes for phase 3 efficacy trials. Sample size calculation in case-control design is often stated as "How many samples are needed to be randomised in order to conclude with 100(1 − β)% power that a treatment difference of size ∆ exists between the two groups at the level of significance of α?". Therefore calculation of sample size requires specification of the null hypothesis (expected treatment effect) and the alternative hypothesis defined in terms of the difference in treatment outcomes. Here, α or type I error is the probability of rejecting the null hypothesis where we should not, and β or type II error is the probability of failing to reject the null hypothesis where we should reject it. Under the assumption of normality of the treatment outcome, a generic formula for per-group sample size is derived in terms of the two-sample t-test: [1] where z-scores determine the critical values for the standard normal distribution. Therefore one needs to specify the variance of the measured variable, the desired rates of error and the magnitude of the treatment difference. Where the measured variable is binary (infected or uninfected), the test statistic reduces to the test for the difference between two proportions. Where the efficacy is of interest, the log normal approximation of the risk ratio from equation 2 may be used to define the test statistic. O'Neill [8] calculated the required sample sizes for a two-sided test given the pooled Wald variance in equation 2. We re-write the total sample size in this form: where d = ln ∆/(2(1 − VE)) + ∆/(2(1 − VE)) 2 + 1 . Here VE is the anticipated efficacy and ∆ is the expected difference in VE in absolute terms. We showed, however, that at low prevalence rate, equation 2 significantly underestimates the variance. Using an inadequately small variance could lead to underestimation of the type I and type II errors, potentially resulting in winner's curse in underpowered studies [16, 17] . If instead we were to use the proposed compound binomial model, one could simply substitute the variance in equation 6. As in [8] , under the assumption of normality and assuming ∆ is the difference between the upper and lower limits of the confidence interval, substituting the margin of error as ∆/2 = zσ in equation 6 gives This equation sets out the total required sample size for a perfect diagnostic test, to be equally divided between the two groups. The proposed Cramér-Rao bound based formula 12 assumes normality of distributions of the null and the alternative hypotheses, however, the binomial likelihood function is asymmetric, as is pooled Wald intervals (see [8] ), and becomes more so as the efficacy increases. Notwithstanding the limitations, we plug in the critical values for α = 0.05 and power of 100(1 − β) = 80 per cent (z 1−α/2 = 1.96 and z 1−β = 0.84) in equations 11 and 12. The resulting sample sizes are plotted in Figure 5 for ∆ = 10% and different prevalence and efficacy rates. In Figure 5 the relationship between the sample size and the incidence rate looks linear on log-log scale as they have a power law relationship. However, while the two methods coincide at high incidence rates, pooled Wald method significantly underestimates the sample sizes at low incidence rates especially when the efficacy is high (note that y-axis is on logarithmic scale). Contrasting Figure 5 with the case rates in Table 1 , it is clear that, to achieve the narrow confidence bounds that Pfizer and Moderna have reported, they would have needed several times more samples under pooled Wald method, and an order of magnitude more under Cramér-Rao bound. If the event rate were to differ from that in the general population or if possibility of misclassification was non negligible, such a discrepancy in incidence rates could cause such large variations in the variance that the trial population could be unrepresentative of the larger population. Table 2 provides the total sample sizes from Cramér-Rao bound formula 12 for different levels of efficacy and effect size. It is clear that the sample size is also very sensitive to the choice of ∆, therefore an investigator must be wary of misspecification of the anticipated treatment difference [1] . Throughout the Methods, we incorporated the misclassification error in the calculations in order to emphasise the importance of accounting for classification bias when the disease is rare. We showed that, while lack of diagnostic sensitivity consistently inflates the estimated efficacy rates, imperfect specificity results is serious loss of accuracy and precision at low disease risks. Case definition for COVID-19 is particularly a major caveat. The three vaccine trials broadly follow FDA definition of the disease. For primary end points symptomatic cases are identified by surveillance or are self-reported, and are subsequently confirmed with RT-PCR. Pre-selecting of the participants for PCR assay could create the possibility for collider bias [15] . Moreover, the highly non-specific symptoms of COVID-19, which include symptoms as common as cough and congestion, could create the perfect conditions for misclassification. False negatives due to e.g. selective reporting, specimen collection, etc, and PCR false positives due to e.g. remnant viral RNA, etc could be introduced if the test is not repeated [13, 18] . Much remains unknown about COVID-19 and its many symptoms and presentations. Therefore, it is recommended to account for classification bias in the calculation. The code for calculating the posterior probability of the vaccine efficacy, which can simultaneously marginalise over the diagnostic sensitivity and specificity is provided. R code for the posterior probability of the efficacy was modified from code published in [9] . It is provided in Appendix along with functions to calculate the sample sizes from equations 11 and 12. Sample Size Calculations for Randomized Controlled Trials Approximate interval estimation of the ratio of binomial parameters: a review and corrections for skewness Obtaining confidence intervals for the risk ratio in cohort studies Comparing methods for calculating confidence intervals for vaccine efficacy Recommendations for the use of Taylor series confidence intervals for estimates of vaccine efficacy Sensitivity, specificity, and vaccine efficacy Impact of diagnostic methods on efficacy estimation -a proof-of-principle based on historical examples On sample sizes to estimate the protective efficacy of a vaccine Estimating prevalence using an imperfect test Safety and efficacy of the ChAdOx1 nCoV-19 vaccine (AZD1222) against SARS-CoV-2: an interim analysis of four randomised controlled trials in Brazil, South Africa, and the UK. Lancet Safety and efficacy of the bnt162b2 mrna covid-19 vaccine Efficacy and safety of the mrna-1273 sars-cov-2 vaccine False-negative results of initial RT-PCR assays for COVID-19: A systematic review Prevalence threshold and the geometry of screening curves Collider bias undermines our understanding of COVID-19 disease risk and severity Why most discovered true associations are inflated Power failure: why small sample size undermines the reliability of neuroscience Bayesian updating and sequential testing: Overcoming inferential limitations of screening tests The author's position at the University of Cambridge is funded by CRUK grant C60100/A23916. The author would like to appreciate the helpful comments received from the Cancer Mutagenesis group at MRC Cancer Unit.