key: cord-0212298-wyfx1t83
authors: Wilson, Kevin J.; Williamson, S. Faye; Allen, A. Joy; Williams, Cameron J.; Hellyer, Thomas P.; Lendrem, B. Clare
title: Bayesian sample size determination for diagnostic accuracy studies
date: 2021-08-19
journal: nan
DOI: nan
sha: 11c506d433b92c797a6ae1ae949c9785c8db75aa
doc_id: 212298
cord_uid: wyfx1t83

The development of a new diagnostic test ideally follows a sequence of stages which, amongst other aims, evaluate technical performance. This includes an analytical validity study, a diagnostic accuracy study and an interventional clinical utility study. Current approaches to the design and analysis of the diagnostic accuracy study can suffer from prohibitively large sample sizes and interval estimates with undesirable properties. In this paper, we propose a novel Bayesian approach which takes advantage of information available from the analytical validity stage. We utilise assurance to calculate the required sample size based on the target width of a posterior probability interval and can choose to use or disregard the data from the analytical validity study when subsequently inferring measures of test accuracy. Sensitivity analyses are performed to assess the robustness of the proposed sample size to the choice of prior, and prior-data conflict is evaluated by comparing the data to the prior predictive distributions. We illustrate the proposed approach using a motivating real-life application involving a diagnostic test for ventilator associated pneumonia. Finally, we compare the properties of the proposed approach against commonly used alternatives. The results show that by making better use of existing data from earlier studies, the assurance-based approach can not only reduce the required sample size when compared to alternatives, but can also produce more reliable sample sizes for diagnostic accuracy studies.

Diagnostic accuracy studies evaluate the ability of a diagnostic test (the index test) to correctly identify patients with and without a target condition. This is typically achieved by prospectively comparing results from the index test to the true disease status obtained from the best available reference standard for a cohort of patients. The two main measures used to assess intrinsic diagnostic accuracy are sensitivity and specificity. For a test to proceed to the next stage of evidence development, it is important that these measures are estimated to an appropriate degree of accuracy. This hinges on the sample size chosen for the diagnostic accuracy study. Too small a sample size will lead to an imprecise estimate with wide corresponding intervals, which is non-informative to the decision maker, and contributes to research waste [Ioannidis et al., 2014] . Alternatively, too large a sample size may delay the results of the study due to longer recruitment times and resource limitations, in addition to financial and ethical implications [Altman, 1980] . Consequently, choosing a sample size which strikes a balance between accuracy and efficiency is a crucial step in the design of any diagnostic accuracy study.

Traditional sample size calculations are based on a hypothesis-testing framework. The idea is to choose a sample size such that the probability of rejecting the null hypothesis when there is a clinically relevant difference is greater than a required power (typically 80% or 90%) with a specified type I error rate (typically 5% for a two-sided test) [Kunzmann et al., 2021] . However, a sample size which captures the precision of the measure of interest, by targeting a desirable width of the corresponding confidence interval, can be more appropriate in certain circumstances [Jiroutek et al., 2003, Jia and Lynn, 2015] . This is pertinent in early clinical diagnostic studies, where the aim is to estimate test accuracy with sufficient precision, which is the approach adopted here.

In this paper, we consider the sample size problem from a Bayesian perspective and propose a novel approach, referred to as the Bayesian Assurance Method (BAM), to determine sample sizes for diagnostic accuracy studies. In doing so, we explore whether utilising information from the preceding laboratory study will reduce the sample size in the diagnostic accuracy study, and thus lead to a more efficient development process. This may be important if there is need to deploy accurate diagnostic tests rapidly, such as in response to the COVID-19 pandemic, where 2 Inference in a diagnostic accuracy study

We consider a diagnostic accuracy study to assess an index test under development. In the study, we observe the numbers of individuals in a 2 × 2 contingency table (Table 1 (i)).

The number of individuals with and without the disease is assumed to be known, based on a reference test. The intrinsic accuracy of the index test can be measured by its sensitivity and specificity, defined as the probability of a positive test given disease and the probability of a negative test given no disease, respectively.

There are two approaches used to model numbers of individuals in the cells of the 2 × 2 table: assuming either binomial or multinomial likelihoods. In the first case, n 1,1 | λ, n T,1 ∼ Bin(n T,1 , λ) and n 2,2 | θ, n T,2 ∼ Bin(n T,2 , θ), where λ is the sensitivity and θ is the specificity of the index test. The conjugate prior distributions are λ ∼ Beta(a λ , b λ ) (i) Disease No Disease Total Test Positive n 1,1 n 1,2 n 1,T Test Negative n 2,1 n 2,2 n 2,T Total n T,1 n T,2 n T (ii)  VAP  No VAP  Total  Test Positive  16  35  51  Test Negative  1  20  21  Total  17  55  72  (iii)  VAP  No VAP  Total  Test Positive  51  55  106  Test Negative  2  42  44  Total  53 97 150 and θ ∼ Beta(a θ , b θ ). If we assume in the prior that the sensitivity and specificity are independent, then their posterior distributions are λ | n ∼ Beta(a λ + n 1,1 , b λ + n 2,1 ) and θ | n ∼ Beta(a θ + n 2,2 , b θ + n 1,2 ). The independence assumption will often be reasonable since the diagnostic thresholds for the test are fixed at this stage, and the sensitivity and specificity consider mutually exclusive populations of patients.

In the second case, we consider the vector n = (n 1,1 , n 1,2 , n 2,1 , n 2,2 ) and assume n | γ ∼ Multi(n, γ), where γ = (γ 1,1 , γ 1,2 , γ 2,1 , γ 2,2 ) is a vector containing the probabilities of each cell of the contingency table. Here, the sensitivity and specificity are given by λ = γ 1,1 /(γ 1,1 + γ 2,1 ) and θ = γ 2,2 /(γ 1,2 + γ 2,2 ). A typical form for the prior distribution is a Dirichlet distribution, which provides conjugacy. That is, γ ∼ Dir(α), where α = (α 1,1 , α 1,2 , α 2,1 , α 2,2 ) .

It can be shown that the two approaches are equivalent in terms of inference for the sensitivity and specificity (see Appendix A). In this paper, we will use the binomial form as it allows for the direct specification of the priors for the sensitivity, specificity and prevalence. We will assume conjugate beta priors, as detailed above, throughout the rest of the paper.

Assurance is a Bayesian alternative to power to choose a sample size. Consider a two-armed clinical trial in which a hypothesis test is to be conducted with H 0 : δ = 0 versus H 1 : δ > 0, where δ represents the difference in the effect of two treatments. A typical power calculation would choose a sample size to provide a certain statistical power at a particular assumed value δ c for δ, often taken to be the minimal clinically relevant difference. In this case, the power is Pr(Reject H 0 | δ = δ c ) and would increase with sample size.

In practice, the choice of δ c is relatively arbitrary. As the true effect size δ is unknown, this can result in conditioning on an event which is extremely unlikely. One approach to mitigate this is to conduct a sensitivity analysis, varying the value of δ c and choosing a sample size which is robust to small perturbations [Matthews, 2006] . In the Bayesian context we can take an alternative approach, and represent our uncertainty over δ using a prior distribution π(δ). The assurance is the expected power of the hypothesis test with respect to this prior,

We choose to make the dependence on the sample size n explicit for the assurance A(·).

Assurance is not restricted to where we will perform a hypothesis test at the end of a trial. If we perform a Bayesian analysis instead, then we may declare the trial a success and the new treatment superior if Pr(δ ≤ 0) ≤ 0.05 in the posterior, for example. In this case, A(n) = E δ [Pr(Trial a success | δ)] = Pr(Trial a success | δ)π(δ)dδ. Thus, the assurance is the unconditional probability that the trial results in a successful outcome.

We use assurance to choose a sample size to estimate sensitivity, specificity, or both, of the index test to a certain degree of accuracy. We initially focus on sensitivity of the index test, λ and consider two cases: assuring the width of the posterior probability interval (two-sided), and assuring the width of the lower half of the posterior probability interval (one-sided).

Considering the inference from Section 2, a 100(1 − α)% symmetric posterior probability interval for λ is (λ L , λ U ), where the limits of the interval are defined such that Pr(λ ≤ λ L | n) = α 2 and Pr(λ ≥ λ U | n) = α 2 . The accuracy of the estimation of λ can be considered as the width of this interval, λ U − λ L , and a successful diagnostic accuracy study would produce an interval with a width smaller than some target, λ U − λ L ≤ w * .

Suppose the number of individuals with the disease in the study, n T,1 , is fixed. There are three possibilities: no values of n 1,1 lead to an interval with width smaller than w * , all values of n 1,1 lead to an interval with width smaller than w * , or some values of n 1,1 lead to an interval with width smaller than w * . To investigate the third case, consider the posterior variance of λ, Var(λ | n) = {(a λ + n 1,1 )(b λ + n 2,1 )}/{(a λ + b λ + n T,1 ) 2 (a λ + b λ + n T,1 + 1)}. For a fixed sample size n T,1 , the denominator of this fraction is constant. That is, Var(λ | n) ∝ (a λ + n 1,1 )(b λ + n 2,1 ) ∝ n 1,1 (b λ − a λ + n T,1 ) − n 2 1,1 , substituting n 2,1 = n T,1 − n 1,1 . The variance is quadratic in n 1,1 and the squared term has a negative coefficient. Thus, the posterior probability interval will be narrower than w * when n 1,1 ≤ c 1 and n 1,1 ≥ c 2 , for two critical numbers of individuals c 1 < c 2 . We define this set as N = {n 1,1 : n 1,1 ≤ c 1 or n 1,1 ≥ c 2 }.

We consider a 100(1 − α)% posterior probability interval for λ of the form (λ L , 1), where the lower limit of the interval is defined such that Pr(λ ≤ λ L | n) = α. We consider the distance between the lower limit of the interval and a central point estimate of λ, i.e. λ 0.5 − λ L , where λ 0.5 is the posterior median. A successful diagnostic accuracy study would result in this interval having a width smaller than some target, λ 0.5 − λ L ≤ w * .

By the same logic as the two-sided case, the posterior probability interval will be narrower than w * when n 1,1 ≤ c 1 and n 1,1 ≥ c 2 , for two critical numbers of individuals c 1 < c 2 . Thus, we consider the set N = {n 1,1 : n 1,1 ≤ c 1 or n 1,1 ≥ c 2 } for the one-sided case, with c 1 and c 2 determined by the interval λ 0.5 − λ L .

We can obtain an expression for the assurance for a sample size n T , conditional on a fixed number of diseased individuals n T,1 . This is denoted by A λ (n T | n T,1 ) and defined as A λ (n T | n T,1 ) = Pr(Accuracy achieved | λ)π(λ)dλ,

n T,1 n 1,1 Γ(a λ + n 1,1 )Γ(b λ + n 2,1 ) Γ(a λ + b λ + n T,1 ) ,

where Γ(·) represents the gamma function. A derivation is given in Section A of the supplementary material. As the number of individuals with the disease, n T,1 , will not be known in advance, we need to sum over the possible values n T,1 can take. If we have a random sample from the target population, then n T,1 | ρ ∼ Bin(n T , ρ), where ρ is the prevalence of the disease. Let ρ ∼ Beta(a ρ , b ρ ) for some chosen values of (a ρ , b ρ ). The unconditional assurance is then

where f (n T,1 ) = f (n T,1 | ρ)π(ρ)dρ is the probability of observing n T,1 individuals in the disease group. The assurance can thus be expressed as

This is derived in Section A of the supplementary material. All that remains is to find the values of (c 1 , c 2 ). For each fixed sample size, n T , and number of diseased individuals, n T,1 , the values of λ L , λ 0.5 and λ U will depend only on n 1,1 and, hence, the width of the interval will be a function of n 1,1 , W (n 1,1 ), in both cases. Therefore, c 1 = argmin {W (n 1,1 ) ≥ w * } − 1 and c 2 = argmax {W (n 1,1 ) ≥ w * } + 1 for n 1 < n T,1 < n 2 , where n 1 is a number below which the interval can never achieve the desired width and n 2 is a number above which the width of the interval is always below w * . Hence, A λ (n T | n T,1 ) = 0 for all n T,1 ≤ n 1 and A λ (n T | n T,1 ) = 1 for all n T,1 ≥ n 2 .

To estimate the specificity of the index test to a given accuracy of w * , we can derive the assurance in the same way, which results in an assurance analogous to that in equation (2). The details are given in Section A of the supplementary material.

Finally, suppose we wish to estimate both the sensitivity and specificity to a particular accuracy. Consider different accuracy targets, w * λ and w * θ , for the sensitivity and specificity, respectively. In this case, the assurance for the sample size n T conditional on n T,1 (and hence n T,2 , since n T,2 = n T − n T,1 ) is given by

where N 1 contains the values n 1,1 ≤ c 1 and n 1,1 ≥ c 2 that give a posterior interval narrower than w * λ for the sensitivity, and N 2 contains the values n 2,2 ≤c 1 and n 2,2 ≥c 2 that give a posterior interval narrower than w * θ for the specificity. To find the unconditional assurance, we sum over the possible values of n T,1 to give:

The proposed BAM is now summarised via the following steps:

1. Choose whether we wish to assure our estimate of sensitivity λ, specificity θ, or both.

2. Choose a target width(s) w * for the accuracy measure(s), a one-or two-sided posterior interval and a level α for the interval.

3. Specify the prior distributions for the chosen accuracy measure(s) and the prevalence ρ. We detail how to do this in the next section. (2) or (3) (or see Section A of the supplementary material) to calculate the assurance for sample sizes n T = 1, 2, . . ..

Choose the minimum sample size n * T to give the desired assurance. Example: Suppose we wish to estimate both sensitivity and specificity to within 5%, with posterior probability 0.99 using a two-sided interval, i.e. w * = 0.05 and α = 0.01. We specify prior distributions for λ, θ and ρ, and use equation (3) to evaluate the assurance for sample sizes n T = 1, 2, . . .. To achieve the desired accuracy with a probability of at least 0.9, say, we choose the smallest value of n T which gives rise to an assurance greater than 0.9.

A diagnostic accuracy study is part of an extensive development process for the diagnostic test [see Graziadio et al., 2020 , Figure 1 ]. Its main purpose is to estimate performance characteristics of the test, particularly the sensitivity and specificity, in the target population in a clinically relevant setting. Prior to the diagnostic accuracy study is the analytical validity phase, in which the test may still be under development and the data generated may be used to support regulatory approvals [Graziadio et al., 2020] . The validation conducted during this stage may test individuals from the target population. Consequently, the data produced can be used to inform the prior distributions in the diagnostic accuracy study. This assumes that the observations in the two stages are exchangeable, which may not always be reasonable. Therefore, in Section B of the supplementary material, we detail how the BAM can be used under weaker assumptions.

Consider the analytical validity testing. Suppose that a random sample of n 0 T individuals was taken and the numbers in the cells of the 2 × 2 contingency table were n 0 = (n 0 1,1 , n 0 1,2 , n 0 2,1 , n 0 2,2 ) . Using the inferential approach in Section 2, priors for the sensitivity, specificity and prevalence would be

T,2 . These latter beta distributions can be used as priors for the diagnostic accuracy study. Although this does not negate the necessity of choosing the initial prior values (a 0 λ , b 0 λ ), (a 0 θ , b 0 θ ) and (a 0 ρ , b 0 ρ ), these will have a small effect on the sample size chosen if sufficient data is available from the analytical validity stage. This is explored further in the next section. The approach taken here is equivalent to using a power prior with the parameter quantifying the heterogeneity between the diagnostic study population and analytic validity population set equal to one (representing homogeneous populations). In cases of heterogeneity between the two populations, a power prior could be used with this parameter taking a value in the range [0, 1]. For full details see Ibrahim et al. [2015] .

In cases where it is controversial to use data from the analytical validity stage when inferring the sensitivity and specificity of the test, we could use a weaker prior in the analysis, but retain the original prior in the design to inform the sample size calculations. This is illustrated in Section B of the supplementary material

The choice of initial prior parameters,

may have little effect on the assurance if sufficient data are observed at the analytic validity stage. We explore this using local sensitivity analysis and investigate the following two questions:

(1) How does the optimal sample size, n * T , change with values of the prior parameters? (2) How does the assurance at n * T , A(n * T ), change with the prior parameters? In particular, we vary the prior parameters (a 0 C , b 0 C ) for C = {λ, θ, ρ} in turn over a range of values around their initial values, and record the smallest and largest values of the optimal sample size (n * T , n * T ) and assurance (A(n * T ), A(n * T )). If these values do not differ by much, then the optimal sample size is relatively robust to the initial prior choice.

Using the grid search approach [Roos et al., 2015] to determine an appropriate range of prior parameter values, we explore the sensitivity on a grid G a 0 ,b 0 ( ), where represents the distance between a prior and the original prior with parameters (a 0 , b 0 ). That is,

where π a,b (γ) represents the beta prior distribution with parameters (a, b) and γ is one of λ, θ and ρ. We use the Hellinger distance [Roos et al., 2015] which, for the beta distribution, can be expressed as

where B(a, b) = Γ(a)Γ(b)/Γ(a + b) is the beta function. To conduct the grid search, it is sensible to work in polar co-ordinates. Therefore, we set a = exp(z) cos(φ) and b = exp(z) sin(φ), where z = log(r). We search in the range φ ∈ [−π, π], solving for the value of r which gives the correct value of . To find the values of a and b, we convert back via a = a 0 + r cos(φ) and b = b 0 + r sin(φ). From this grid search, we can then find the corresponding (n * T , n * T ) and (A(n * T ), A(n * T )) for this . We suggest a sensible choice of in Section 5.2.

Label the counts in the 2 × 2 table from the diagnostic accuracy study n 1 = (n 1 1,1 , n 1 1,2 , n 1 2,1 , n 1 2,2 ) . The posterior distributions for the sensitivity and specificity (omitting the conditioning) will be λ ∼ Beta(a 2 λ , b 2 λ ) and θ ∼ Beta(a 2 θ , b 2 θ ), respectively, where

The inference for the sensitivity and specificity is in the form of a weighted average of the prior and the observations, with weights determined by the relative sample sizes of each. The prior is made up of a weighted average of the observations in the analytical validity stage and the original prior. If all of the elements are in broad agreement, then the posterior distribution will provide an accurate summary of the properties of the index test in the population of interest. However, it could be the case that the prior and observations are not in agreement, which is known as prior-data conflict [Box, 1980 , Schmidli et al., 2014 . For example, if the two studies are carried out at different times or in different locations, the spectrum of disease in the target population may not be the same. In this case, it is important to investigate why the differences are there and what action should be taken.

We can evaluate prior-data conflict by comparing the observations to the prior predictive distributions of the parameters. We consider the prior predictive distributions of the number of observations in the disease group, n T,1 , and, conditional on this, the numbers who test positive, n 1,1 of those with the disease, and the number who test negative of those without the disease, n 2,2 . These are given by f (y) = n y B(a + y, b + n − y) B(a, b) , where y is (n T,1 , n 1,1 , n 2,2 ) in turn, n is the corresponding sample size, i.e. (n T , n T,1 , n T,2 ), and (a, b) are the beta distribution parameter values for the prevalence, sensitivity and specificity, respectively. We can then plot the prior predictive distributions and calculate probabilities of the form Pr(n ≥ n obs ), for observed number of individuals n obs . If the observed value lies in the body of the associated prior predictive distribution, then that prior is consistent with the data. Otherwise, this provides evidence of prior-data conflict.

Using published results [Conway Morris et al., 2010 , Hellyer et al., 2015 , we consider the development of a biomarker test for ventilator associated pneumonia (VAP). The development of the test involved four stages; an exploratory study to look at possible biomarkers for VAP diagnosis, a single centre observational study to choose suitable biomarkers, a multicentre diagnostic accuracy study to develop biomarker cut offs and validate accuracy and a randomised controlled trial of clinical utility. At each stage the target population was patients on a ventilator with suspected VAP. The reference standard test was the growth of pathogens at > 10 4 colony forming units per millilitre of bronchoalveolar fluid. All patients with suspected VAP receive antibiotics, although only 20-60% of patients will have VAP confirmed by the reference standard, leading to overuse of antibiotics. Microbiology culture and sensitivities takes up to 72 hours to return results to clinicians, which delays the opportunity to discontinue antibiotics in patients who do not have infection. A rapid, highly sensitive biomarker test could allow for early stopping of antibiotics. We consider planning the diagnostic accuracy study. The sample size was originally chosen to reduce the width of the 95% confidence interval for the post-test probability of VAP to 0.16, and resulted in n T = 150. Estimates from the single centre observational study were used to calculate the sample size. The estimated sensitivity and prevalence in the single centre observational study wereλ = 0.94 andρ = 0.24, respectively, for the most promising biomarker, IL-1β. If instead the sample size had been chosen based on a confidence interval for the sensitivity, using the Wald interval [Zhou et al., 2011] , a larger sample size of 196 would have been required.

To use assurance to determine the sample size, we require the prior parameters for the sensitivity, (a 0 λ , b 0 λ ), and the prevalence, (a 0 ρ , b 0 ρ ), before the biomarker selection study. In the initial exploratory study, there were 55 patients, 12 of whom were confirmed by the reference test to have VAP. Assuming exchangeability, a suitable prior for the prevalence is ρ ∼ Beta(12, 43). The most promising biomarker gave an estimated sensitivity of 0.93. Since it was unclear which biomarker(s) would be used in the final test, it is not reasonable to make an exchangeability assumption for the test results in the two stages. A more suitable prior for the sensitivity is more diffuse but with a mean around this value, such as λ ∼ Beta(9.9, 1.1). These priors are represented by the dashed lines in Figure 1 .

In the biomarker selection study, the 2 × 2 contingency table is provided in Table 1 (ii) for the most promising biomarker, IL-1β.

We assume that these patients are exchangeable with those in the diagnostic accuracy study as they are randomly sampled from the same population. Therefore, the prior distributions for the diagnostic accuracy study are λ ∼ Beta(25.9, 2.1) and ρ ∼ Beta(29, 98) (see Section 2), illustrated by solid lines in the left hand side of Figure 1 . Suppose we would like to estimate the sensitivity of the test to within 0.16 in a 95% symmetric probability interval and choose a sample size to give 80% assurance. Based on the priors above, we use the BAM to obtain a sample size of n * T = 106. This is significantly smaller than the original sample size of n T = 150 (which would give an assurance of 88%). The full assurance curve for is provided in the right hand side of Figure 1 . Note that the assurance curve has a different shape to a power curve, and is monotonically increasing between 0 and 1.

To assess the sensitivity of the sample size and assurance to the prior distribution, we use the approach outlined in Section 4.2. In particular, we conduct a grid search for both the sensitivity and prevalence priors using a value of = 0.00354 (equivalent to a mean shift in a standard normal random variable of 0.1).

The resulting values of the beta distribution parameters (a, b) are provided in Section B.3 of the supplementary material for the sensitivity and prevalence. The corresponding smallest and largest values of the assurance and sample size are provided in Table 2 .

Changes to the prevalence prior has little effect on the sample size or the assurance at n T = 106. The effect is slightly larger for the sensitivity prior but, even for the most extreme prior, a sample size of 130 would be sufficient (which is considerably less than the sample size of 150 in the study). Table 2 : The smallest and largest values of the assurance, A(n T ), and the smallest and largest sample sizes, n T , found in the local sensitivity analysis.

The results from the diagnostic accuracy study with the 150 patients are summarised in Table 1 (iii) for the biomarker IL-1β. The resulting posterior distributions for the sensitivity of and prevalence are λ ∼ Beta(76.9, 4.1) and ρ ∼ Beta (82, 195) , respectively. The corresponding 95% posterior probability interval for the sensitivity is (0.893, 0.986), and so we meet the target of 0.16 on the width of the interval. To assess possible prior-data conflict, we use the approach detailed in Section 4.3 and compare the observations to the prior predictive distributions.

The prior predictive distributions of the number of patients with VAP (left) and the number of patients with VAP who tested positive (right) are provided in Figure 2 , with the observation shown as a red dashed line. A color version of this figure can be found in the electronic version of the article.

We see the number of patients correctly diagnosed with VAP lies within the main body of the prior predictive distribution. The observed number of patients with VAP lies in the body of the distribution, but is closer to the upper tail, in the 99th percentile. The observed number of patients correctly diagnosed lies in the 76th percentile. This provides some evidence of prior-data conflict for the number of patients with VAP, so we may choose a prior on the prevalence which is not based on the single centre observational study.

The posterior mean and 95% posterior probability interval for the prevalence are 0.296 and (0.244, 0.351), respectively. The same quantities using a flat prior with a ρ = b ρ = 1 are 0.355 and (0.281, 0.433), respectively, which would not affect the inference on the sensitivity. However, if we believe the sub-populations with VAP are different between the two stages we would also consider an alternative prior for sensitivity. 

In this section, we compare properties of the proposed BAM to alternative commonly used methods. Assume we wish to obtain the number of individuals with the disease, n T,1 , required to estimate the sensitivity within a particular degree of accuracy. The alternative methods are based on a hypothesis test of H 0 : λ = λ 0 against the two-tailed alternative H 1 : λ = λ 0 conducted at a significance level of α. We take the value of λ 0 to beλ, i.e. the maximum likelihood estimate of sensitivity using the analytical validity data. The sample size can be chosen according to a desired power of β to detect a difference of size w * . As discussed in Section 1, there are several possible approaches; we consider the following. The first is based on a Normal approximation. In this case, to achieve a power of β we choose the sample size in the disease group as n T,1 = (z α/2 + z β ) λ (1 −λ) 2 /(w * /2) 2 , where z · is the upper percentile of a standard normal distribution. We construct a 100(1 − α)% confidence interval based on this Normal approximation, known as the Wald interval The second approach is based on an exact binomial test to give the Clopper-Pearson (CP) interval. The third approach combines the Normal approximation with an adjustment to the hypothesised value as the centre of the interval to give the Agresti-Coull (AC) interval.

In practice, the standard way of obtaining the required sample size is to use the appropriate sample size formula (if available), or in-built functions within statistical software (e.g. the binDesign function from the binGroup R package [Zhang et al., 2018] ). However, these often give rise to unreliable sample sizes and, in our investigation, are shown to perform poorly over the range of parameter values considered; see Section E of the supplementary material. We instead rely on simulation. That is, we choose the smallest sample size n T,1 to give the correct proportion of intervals below the desired target width w * , based on simulating confidence intervals repeatedly and finding the power empirically. The total number of individuals to recruit, n T , is found by scaling with respect to the estimated prevalenceρ, i.e. n T = n T,1 /ρ. The same procedure is used to obtain the number of individuals without the disease, n T,2 , required to estimate the specificity to a certain degree of accuracy. In this case, n T = n T,2 /(1 −ρ).

In this section, we compare the sample sizes required for a diagnostic accuracy study using the methods outlined above. We consider a significance level of α = 0.05, a power/assurance of β = 0.8, and aim to estimate sensitivity to within 0.18 in a two-sided interval. We vary the sensitivity over the range [0.6, 0.9] and the prevalence over the range [0.15, 0.95] . For the proposed BAM, we consider three prior sample sizes of n 0 T = 25, 50 and 75 to represent "small", "medium" and "large" analytical validity studies. The results for all scenarios and methods are illustrated in Figure 3 .

Note that the power calculations are based on the true parameter values. The assurance calculation, however, uses beta priors with parameters (n 0 T λρ, n 0 T ρ[1 − λ]) for the sensitivity and (n 0 T ρ, n 0 T [1 − ρ]) for the prevalence. An assurance calculation with non-informative priors for the analysis is also considered. This is based on a design prior from the "small" analytical validity study to represent a reasonable "worst case" scenario.

In Figure 3 , we observe similar patterns across the frequentist approaches (represented by the coloured lines) for each prevalence. CP always results in the largest sample size, with Wald and AC giving similar, slightly smaller, Agresti-Coull (green) , assurance (black), and assurance based on non-informative analysis priors (light blue). In each plot, there are three black curves relating to prior sample sizes of (from top to bottom) 25, 50 and 75. sample sizes. In comparison to assurance, the frequentist methods produce larger sample sizes when the prevalence is high. In some scenarios, they result in smaller sample sizes. For example, when the prior sample size is 25 below a prevalence of 0.5, when the prior sample size is 50 below a prevalence of 0.3 and when the prior sample size is 75 around a prevalence of 0.2. However, as the sensitivity increases, the required sample size based on assurance reduces quicker than the frequentist approaches, which are known to perform poorly as the sensitivity approaches one.

Further details are provided in Section C of the supplementary material, including an assessment of different target interval widths. The message is consistent across the parameter combinations considered: assurance for the sensitivity reduces the required sample size in the majority of cases, particularly in moderate to high prevalence populations and when a highly accurate test is required. High prevalence situations are common in secondary care, where patients have already been triaged (such as in a suspected stroke [Shaw et al., 2021] ), or in cancer pathways by the time an invasive test, such as a biopsy, is used. When the BAM is applied to even lower prevalences of 0.1, 0.05 and 0.01, the sample sizes required for a sensitivity of 0.9, and based on a medium analytic validity study, are 681, 1643 and 2770, respectively. Such low prevalences may be the case in large-scale geographic prevalence surveys, for example.

A smaller sample size will not be useful if the corresponding interval estimates are very wide. Therefore, we conduct a simulation study, outlined below, to assess the width of the intervals resulting from each approach.

First, we sample values of the sensitivity and prevalence from uniform distributions. These are used to sample analytical validity results, n 0 1,1 and n 0 T,1 , from their respective binomial distributions based on a "medium" total sample size of n T = 50. From these data, we find estimates of the sensitivity and prevalence for the power calculations and set the prior distributions for the assurance calculations. We then find the required sample size for each method. We sample the results of the diagnostic accuracy study, n 1 1,1 and n 1 T,1 , from their respective binomial distributions and use these to calculate 100(1 − α)% intervals for the sensitivity. Finally, we calculate the width of the intervals. By repeating this process 100 times, we consider the distributions of widths of the intervals, which are shown in Figure  4 for a power/assurance, β, of 0.5 (left) and 0.8 (right). In all cases, α = 0.05 and w * = 0.18. For both β = 0.5 and β = 0.8, the approaches produce intervals with a similar distribution of widths. When β = 0.5, the median width of each approach lies approximately at the target width. When β = 0.8, the target width is around, or slightly above, the upper quartile for each method. Thus, the different sample sizes observed in the previous section do not come at the expense of less precision in inference.

The simulations were repeated with interval widths of w * = 0.22 and w * = 0.14. The corresponding results are provided in Section D of the supplementary material. The main conclusions remain: for a power/assurance of 0.5, all of the distributions are approximately centred on the target width, and for a power/assurance of 0.8, each approach produces intervals which include the target width in the upper 25% of its empirical distribution.

In addition, we have investigated the properties of the BAM when assuring both sensitivity and specificity together, in terms of the sample size required and the resulting interval widths. This is provided in Section F of the supplementary material.

In this paper, we have proposed the novel BAM to determine sample sizes for diagnostic accuracy studies. Bayesian assurance fulfils a similar role to power and, as we have shown, can offer benefits when suitable prior information is available. In particular, representing uncertainty in unknown test characteristics using prior distributions, and utilising information from different stages of the development pathway, allows for a wider range of evidence to be seamlessly incorporated in the design and analysis of a diagnostic accuracy study. Consequently, we have shown that this has the potential to reduce the sample size, thus increasing efficiency in evidence development.

If no prior information is available, or accessible, from earlier stages of development, expert elicitation can be used to form the necessary prior distributions. Elicited distributions can include opinions from multiple experts, or be combined with data from other sources [Williams et al., 2021] . The larger the prior sample size, the more informative the prior distribution will be which, as shown in Figure 3 , typically corresponds to a smaller sample size in the diagnostic accuracy study. If it is not appropriate to use an informative prior for the analysis (e.g. to mitigate researcher bias), a sceptical or flat prior can be used instead. The BAM has the flexibility to allow for distinct prior distributions in the design and analysis stages, as illustrated in Section B of the supplementary material.

The proposed BAM can be used regardless of whether the final analysis is frequentist or Bayesian. Some assurance calculations may not result in closed form solutions (e.g. if a Bayesian analysis uses a non-conjugate analysis prior), in which case, simulation and numerical methods are required. Thus, calculating assurance can be challenging and, unlike power, is not available in standard software packages. To increase accessibility of the BAM, R code is provided and an R Shiny application is currently under development.

This work focuses on assuring sensitivity and specificity as measures of diagnostic accuracy. We have also shown how the BAM can be used to assure sensitivity and specificity jointly, for which no existing approaches are available, to our knowledge. The assurance calculations can be modified to obtain sample sizes for other quantities, such as positive and negative predictive values or the area under the curve. Moreover, the assurance calculations could be extended to allow for multiple categorical results, or results in the form of continuous measures, which forms an area of further work. In this paper, we considered the evaluation of a single diagnostic test, but further work could explore how the proposed method extends to multiple tests.

To reflect standard practice in diagnostic accuracy studies, we have inherently assumed that the sampling plan will be produced prior to the study, carried out accordingly and then the data analysed at the end of the study. Future work could extend the approach so that it can be applied sequentially, participant-by-participant (or in blocks), to monitor the width of the posterior interval until the desired value is attained, at which point the study would terminate. This would reduce the sample size required. However, it would require a change in the way that diagnostic accuracy studies are routinely implemented. sensitivity as γ 1,1 /(γ 1,1 + γ 2,1 ) = γ 1,1 γ Σ /(γ 1,1 γ Σ + γ 2,1 γ Σ ) and, by the properties of the gamma distribution, we have λ ∼ Beta(α 1,1 , α 2,1 ). By similar reasoning, θ ∼ Beta(α 2,2 , α 1,2 ) for specificity.

When the number of individuals in each cell of the contingency table is observed, the posterior distributions are γ | n ∼ Dir(α 1,1 + n 1,1 , α 1,2 + n 1,2 , α 2,1 + n 2,1 , α 2,2 + n 2,2 ), λ | n ∼ Beta(α 1,1 + n 1,1 , α 2,1 + n 2,1 ), θ | n ∼ Beta(α 2,2 + n 2,2 , α 1,2 + n 1,2 ).

Setting α 1,1 = a λ , α 1,2 = b θ , α 2,1 = b λ and α 2,2 = a θ , we see that the two approaches are equivalent for the sensitivity and specificity.

Approximate is better than "exact" for interval estimation of binomial proportions

Statistics and ethics in medical research: III how large a sample?

Sample sizes of studies on diagnostic accuracy: Literature survey

Bayesian phase II optimization for time-toevent data based on historical information

Sampling and bayes' inference in scientific modelling and robustness

Sample size calculations for studies designed to evaluate diagnostic test accuracy

Interval estimation for a binomial proportion

From statistical power to statistical assurance: It's time for a paradigm change in clinical trial design

Diagnostic importance of pulmonary interleukin-1β and interleukin-8 in ventilator-associated pneumonia

How to ease the pain of taking a diagnostic point of care test to the market: A framework for evidence development

Diagnostic accuracy of pulmonary host inflammatory mediators in the exclusion of ventilator-acquired pneumonia

The power prior: theory and applications

Increasing value and reducing waste in research design, conduct, and analysis. The Lancet

A sample size planning approach that considers both statistical significance and clinical significance

A new method for choosing sample size for confidence interval-based inferences

Targeted test evaluation: a framework for designing diagnostic accuracy studies with clear study hypotheses

Integration of pharmacometric and statistical analyses using clinical trial simulations to enhance quantitative decision making in clinical drug development

A review of Bayesian perspectives on sample size derivation for confirmatory trials

Introduction to Randomized Controlled Clinical Trials

Confidence intervals for proportions and related measures of effect size

Bayesian assessment of sample size for clinical trials of cost-effectiveness

Sensitivity analysis for bayesian hierarchical models

The importance of diagnostic testing during a viral pandemic: Early lessons from novel coronavirus disease (COVID-19)

and Beat Neuenschwander. Robust meta-analytic-predictive priors in clinical trials with historical control information

Purines for rapid identification of stroke mimics (prism): study protocol for a diagnostic accuracy study

Three principles to define the success of a diagnostic study could be identified

A potential for seamless designs in diagnostic research could be identified

A comment on sample size calculations for binomial confidence intervals

A comparison of prior elicitation aggregation using the classical method and shelf

Adaptive trial designs in diagnostic accuracy research

bingroup: Evaluation and experimental design for binomial group testing. online

Improving interval estimation of binomial proportions

Statistical methods in diagnostic medicine

The authors thank Prof. James Wason and Prof. John Simpson for useful discussions. AJA, CJW, BCL are supported by the NIHR Newcastle In Vitro Diagnostics Co-operative. The funders had no role in the preparation or decision to publish this manuscript. The views are those of the authors and not necessarily those of the NHS, NIHR, or the Department of Health and Social Care. The VAPrapid trial was supported by the Department of Health and Wellcome Trust via the Health Innovation Challenge Fund.

Additional supporting information may be found online in the Supporting Information section at the end of this article.R code used to implement the method is also available here. Data sharing is not applicable to this article as no new data were created or analyzed in this study.

Based on the multinomial-Dirichlet prior structure, the sensitivity and specificity are independent in the prior. To see this, we note that the parameter vector γ can be re-ordered so thatγ = (γ 1,1 , γ 2,1 , γ 1,2 , γ 2,2 ) , which has a Dirichlet prior with re-ordered parameter vectorα. Then, by the properties of neutrality and aggregation of the Dirichlet distribution, we see that (γ 1,1 /[γ 1,1 + γ 2,1 ], γ 2,1 /[γ 1,1 + γ 2,1 ]) and (γ 1,2 /[γ 1,2 + γ 2,2 ], γ 2,2 /[γ 1,2 + γ 2,2 ]) are mutually independent. Hence γ 1,1 /[γ 1,1 + γ 2,1 ] and γ 2,2 /[γ 1,2 + γ 2,2 ] are independent. Now, since γ has a Dirichlet prior distribution, this means that γ i,j γ Σ ∼ Gamma(α i,j , 1), where γ Σ =