key: cord-0137966-7hs2ppbr authors: Zhou, Tianjian; Ji, Yuan title: On Frequentist and Bayesian Sequential Clinical Trial Designs date: 2021-12-17 journal: nan DOI: nan sha: 693ddf7b6fdd6a78d015bba14f87c24edb626d31 doc_id: 137966 cord_uid: 7hs2ppbr Clinical trials usually involve sequential patient entry. When designing a clinical trial, it is often desirable to include a provision for interim analyses of accumulating data with the potential for stopping the trial early. We review frequentist and Bayesian sequential clinical trial designs with a focus on their fundamental and philosophical differences. Frequentist designs utilize repeated significance testing or conditional power to make early stopping decisions. The majority of frequentist designs are concerned with controlling the overall type I error rate of falsely rejecting the null hypothesis at any analysis. On the other hand, Bayesian designs utilize posterior or posterior predictive probabilities for decision-making. The prior and threshold values in a Bayesian design can be chosen to either achieve desirable frequentist operating characteristics or reflect the investigator's subjective belief. We also comment on the likelihood principle, which is commonly tied with statistical inference and decision-making in sequential clinical trials. A single-arm trial example with normally distributed outcomes is used throughout to illustrate some frequentist and Bayesian designs. Numerical studies are conducted to assess these designs. In most clinical trials, patient enrollment is staggered, and patients' data are collected sequentially. When designing a clinical trial, it is often desirable to include a provision for interim analyses of accumulating data with the potential for modifying the conduct of the study (Pocock, 1977; Armitage, 1991) . For example, in a randomized-controlled trial, if an interim analysis demonstrates that the investigational drug is deemed superior than the standard of care, the trial could be stopped early on grounds of ethics and trial efficiency (Geller and Pocock, 1987) . The BNT162b2 COVID-19 vaccine trial is a recent case in which four interim analyses were planned with the possibility for early stopping (Polack et al., 2020 ). An important question that underlies sequential clinical trials is the following: how should statistical analysis of trial data be affected by the knowledge that interim analyses have been performed in the past or that further analyses might be undertaken in the future? (Jennison and Turnbull, 1990) . As an illustration, consider a singlearm trial that aims to establish the therapeutic effect of an investigational drug. Suppose that a total of K analyses, including (K − 1) interim analyses and a final analysis, are planned during the course of the trial. At the jth analysis, data of n j patients are accumulated, denoted by y 1 , y 2 , . . . , y n j and assumed independently and normally distributed with mean θ and variance σ 2 . Here, θ is parameterized such that a positive value of θ is indicative of a therapeutic effect, and σ 2 is assumed known for simplicity. The planned maximum sample size is denoted by n K and can be determined based on a power requirement or the amount of available resources. Most often, patients are enrolled in groups of equal size g, thus n j = jg. If g = 1, it leads to the fully sequential case, known as continuous monitoring; if g > 1, it is called the group sequential case, which is more feasible in practice. The primary research question of the trial can be formulated as the following hypothesis test, At each analysis, the hypothesis test is performed. If certain stopping rule is triggered, say z j = n j i=1 y i / √ n j · σ > c j for some threshold c j , H 0 is rejected, and the trial is terminated for efficacy. This is referred to as data-dependent or optional stopping. How should these stopping rules be determined? Under the frequentist paradigm, type I error rate control is central to hypothesis testing. The type I error rate refers to the probability of falsely rejecting the null at any analysis (in hypothetical repetitions of the trial), given that the null hypothesis is true. If each test is performed at a constant nominal level, the type I error rate will inflate as K grows and will eventually converge to 1 as K → ∞ (Armitage et al., 1969) . Therefore, adjustments to the stopping rules are necessary to ensure that the overall type I error rate is maintained at a desirable level. For example, the stopping boundaries may be decided using the Pocock or O'Brien-Fleming procedure (Pocock, 1977; O'Brien and Fleming, 1979) . Under the Bayesian paradigm, however, the answer to the question is less clear. Without accounting for the sequential nature of the hypothesis test, Bayesian designs can suffer the same problem of type I error inflation, which can be unsettling for many Bayesian statisticians. Therefore, in many Bayesian sequential trial designs, the stopping rules are determined to control the type I error rate at a desirable level (Zhu and Yu, 2017; Shi and Yin, 2019) . As an example, the recent BNT162b2 COVID-19 vaccine trial was designed using a Bayesian approach with four planned interim analyses (Polack et al., 2020) . The stopping boundaries were chosen such that the overall type I error rate was controlled at 2.5%. Indeed, regulatory agencies generally recommend demonstration of adequate control of the type I error rate for any trial design to be acceptable (Food and Drug Administration, 2010, 2019) . On the other hand, the type I error rate is a frequentist concept, the calculation of which involves an average over unrealized events such as hypothetical repetitions of the trial. Bayesian inference can be performed based solely on the observed data from the actual (and lone) trial and does not have to be concerned with type I error rate control, since the same trial is not assumed to repeat, hypothetically or in practice. Some think that the type I error rate is not the quantity that one should pay most attention to (Harrell, 2020b) . Also, according to the likelihood principle (LP), unrealized events should be irrelevant to the statistical evidence about a parameter (Berger and Wolpert, 1988) . Therefore, some Bayesian statisticians believe that the choice of the stopping rules does not need to depend on the planning of interim analyses (Berry, 1985 (Berry, , 1987 . For example, one may stop the trial at any analysis provided that Pr(θ > 0 | data) exceeds some threshold, or if stopping minimizes the posterior expected loss. We will elaborate on these issues in the upcoming sections. The mainstream sequential clinical trial designs used in practice are frequentist. See, for example, Whitehead (1997) or Jennison and Turnbull (2000) for a comprehensive review. On the other hand, Bayesian trial designs have become increasingly popular Berry, 2006; Berry et al., 2010) . Insightful discussions on Bayesian sequential designs can be found in Cornfield (1966b) , Berry (1985) , Berry (1987) , Jennison and Turnbull (1990) , Freedman et al. (1994) , Emerson et al. (2007) , and Ryan et al. (2020) , among many others. In this article, we review frequentist and Bayesian sequential clinical trial designs. Our review is not meant to be comprehensive. Instead, we focus on the fundamental and philosophical differences between the two camps rather than the methodological details with regard to the type of trial (e.g., single-arm or randomized-controlled), type of outcome (e.g., binary, continuous, or time-to-event), or distributional assumption (e.g., violation of the normality assumption). We primarily consider early stopping rules for efficacy, as futility stopping does not increase the type I error rate of a design. Discussion on futility stopping is deferred to Section 6. The aforementioned single-arm trial example will be used throughout to demonstrate these designs, but we present an extension for randomized-controlled trials in Section 6. Compared to existing reviews on this topic (e.g., Jennison and Turnbull, 1990) , we devote more discussion to Bayesian designs and aim to disentangle their philosophical and practical implications. As mentioned before, it remains an open question whether Bayesian designs need to be adjusted for the planning of interim analyses. We attempt to answer this question from three perspectives: a mixed frequentist-Bayesian perspective, a subjective Bayesian perspective, and a calibrated Bayesian perspective. We also comment on the LP. Our view is that, although only the observed data are relevant to the statistical evidence about a parameter, the loss for making a specific decision may depend on other aspects of an experiment, such as the planning of interim analyses. Therefore, it is not unreasonable for a decision rule to depend on unrealized events. For example, an optimal decision rule may need to minimize the expected loss over future interim analyses. More explanations are given later in Section 4. The remainder of the paper is structured as follows. In Section 2, we briefly review some frequentist sequential designs. In Section 3, we review selected Bayesian sequential designs based on posterior and posterior predictive probabilities, decisiontheoretic designs, and three approaches to formulating Bayesian designs. In Section 4, we discuss the LP, which is commonly tied with statistical inference and decisionmaking in sequential clinical trials. In Section 5, we present some numerical studies. Finally, in Section 6, we conclude and discuss some other considerations including randomized-controlled trials, futility stopping rules, and two-sided tests. The code for reproducing the results in this paper is provided in the Supplementary Material. Under the frequentist paradigm, probability of an event in a clinical trial is interpreted as the limit of its relative frequency in infinite imaginary repetitions of the trial. Parameters are viewed as fixed but unknown. Uncertainty regarding an estimator is characterized by its sampling distribution, that is, the distribution of the estimator over imaginary repetitions of the experiment. For example, the meaning of a 95% confidence interval is that the fraction of the calculated confidence intervals in numerous repeated trials that encompass the true parameter value would tend toward 95% (Cox and Hinkley, 1974) . Since a parameter is considered fixed, a hypothesis regarding the parameter is either true or false. One cannot assign a probability to a hypothesis but instead consider the type I and type II error rates. Here, the type I (or type II) error rate refers to the probability of mistakenly rejecting the null (or alternative) hypothesis in imaginary repetitions of the experiment, given that the null (or alternative) hypothesis is true. A frequentist hypothesis testing procedure usually preserves the type I error rate at a prespecified level while attempts to minimize the type II error rate. To illustrate frequentist sequential designs, let us revisit the single-arm trial example in Section 1. At analysis j, the z-statistic from accumulating data is given by If z j > c j for some critical value c j , the null hypothesis is rejected, and the trial is stopped. At the moment, we focus on early stopping for efficacy, and discussion on futility stopping is deferred to Section 6. The type I error rate of this test procedure is the probability of falsely rejecting H 0 at any analysis, given that H 0 is true. The maximum type I error rate is attained when θ = 0 and is given by It can be derived that z = (z 1 , . . . , z K ) follows a multivariate normal distribution with E(z j ) = θ √ n j /σ, Var(z j ) = 1, and Cov(z j , z j ) = n j /n j for j < j . Therefore, where Φ K (·; ·, ·) is the cumulative distribution function of a multivariate Gaussian random variable, c = (c 1 , c 2 , . . . , c K ) , and Σ is the covariance matrix of z. Frequentist group sequential designs are concerned with the specification of the boundary values {c 1 , . . . , c K } such that Equation (2) holds for prespecified α, K, and {n 1 , . . . , n K }. The solution to Equation (2) is not unique, thus restrictions on the boundary values have been considered. We give some examples next. 2.1. The Pocock and O'Brien-Fleming Procedures. In the case of equal group sizes (that is, n j = jg for some g), Pocock (1977) proposed to use equal boundary values by setting c 1 = · · · = c K = c P (K, α), while O' Brien and Fleming (1979) suggested decreasing boundaries with c j = c OBF (K, α) K/j. In either case, the boundary values can be solved through a numerical search. 2.2. The Error Spending Approach. Slud and Wei (1982) first considered the idea of specifying the error rate spent at each analysis, defined as κ j = Pr(z 1 ≤ c 1 , . . . , z j−1 ≤ c j−1 , z j > c j | θ = 0). This represents the probability of rejecting H 0 at stage j but not at any previous stages, given that θ = 0. We have α = K j=1 κ j . Once the κ j 's are specified, one can successively calculate the boundary values. Lan and DeMets (1983) further extended this idea and suggested to use a function to characterize the rate at which the error rate is spent. This function, denoted by h(u) (0 ≤ u ≤ 1), satisfies h(0) = 0 and h(1) = α. The κ j 's can be chosen such that κ j = h(n j /n K ) − h(n j−1 /n K ) (with the understanding that n 0 = 0). Common Here, Φ(·) is the cumulative distribution function of the standard normal distribution, and q α/2 = Φ −1 (1 − α/2) is the upper (α/2) quantile of the standard normal distribution, Φ(q α/2 ) = 1−α/2. It has been shown that in the case of equal group sizes, h 1 (u) and h 2 (u) produce boundary values similar to those given by Pocock's and O'Brien-Fleming's procedures, respectively. Function h 3 is known as the power spending function and has been studied by Kim and DeMets (1987b) . The error spending approach introduces greater flexibility to sequential designs, as the frequency and timing of the interim analyses do not need to be specified in advance. 2.3. Stochastic Curtailment Based on Conditional Power. Lan et al. (1982) proposed the idea of stochastic curtailment that at any point in a sequential clinical trial, if the result at the end of the trial is inevitable, the study can be terminated early. Consider the single-arm trial example. Suppose that at the final analysis, H 0 will be rejected if the final z-statistic z K > q η , where q η is the upper η quantile of the standard normal distribution. Then, at analysis j ∈ {1, . . . , K − 1}, the probability that H 0 will be rejected upon completion of the study, given θ, is given by where y j = (y 1 , . . . , y n j ) is the vector of accumulating data up to analysis j. This is known as the conditional power. A simple calculation shows that If based on current data, H 0 will likely be rejected at the final analysis even if the investigational drug has no treatment effect (θ = 0), then the trial may be stopped early. Mathematically, one may stop the trial early if CP j (0) > γ for some threshold γ. This is equivalent to z j > q η n K /n j + q 1−γ (n K − n j )/n j . If desirable, one may use different thresholds γ j 's at different interim analyses. An important consideration is the type I error rate of this procedure, but Lan et al. (1982) showed that the error rate is upper bounded by η/γ, regardless of the number of interim analyses. Therefore, if η and γ are chosen such that η/γ ≤ α, the type I error rate is maintained at or below α, even if interim analyses are conducted at arbitrary times. The stopping boundaries based on this argument are typically conservative. However, if the timing of the interim analyses is specified in advance, tighter stopping boundaries can be constructed by calculating the exact type I error rate numerically. 2.4. Analysis at the Conclusion of a Sequential Trial. Once a sequential trial has been completed, it is often of interest to construct a point estimate and a confidence interval for the treatment effect θ. Consider again the single-arm trial example. The results of the trial can be represented by a bivariate random vector (t, z t ), where t denotes the time of stopping, and z t is the corresponding test statistic. Following Armitage et al. (1969) or Jennison and Turnbull (2000) (Chapter 8), the density of (t, z t ) is and for t = 2, . . . , K, with φ(·) denoting the standard normal density. The sample mean estimator,θ =ȳ t , is a straightforward point estimator for θ. It can be shown thatθ is also the maximum likelihood estimator (MLE). However, it is known that the MLE following a sequential trial is biased, and one may correct it by subtracting an estimate of its bias. See, e.g., Whitehead (1986) for more details. To construct a confidence interval for θ, one needs to define an ordering of the sample space (Tsiatis et al., 1984; Kim and DeMets, 1987a; Rosner and Tsiatis, 1988) . For example, based on the stage-wise ordering, (t , z t ) is above (t, z t ) if either (i) t = t and z t > z t , or (ii) t < t. In this case, (t , z t ) is indicative of a larger value of θ compared to (t, z t ). It can be shown that is a continuous and monotonically increasing function of θ for every possible trial outcome (t, z t ) (Kim and DeMets, 1987a) . Thus, one can find unique values θ L and θ U which satisfy Pr[Observing an outcome above (t, z t ) | θ L ] = α/2, The two equations can be solved numerically. Then, (θ L , θ U ) is a 100(1 − α)% confidence interval for θ. 2.5. Summary. Frequentist designs for sequential clinical trials can effectively control the chance of falsely approving an ineffective drug. After the completion of sequential trials, frequentist confidence intervals are calculated such that the intervals have desirable coverage probabilities in repeated trials. For a more comprehensive review, alternative methods, and comparisons, refer to Jennison and Turnbull (2000) . Note that unrealized events can play a role in the statistical inference, which depend on the experimental design such as the frequency of interim analyses. Under the Bayesian paradigm, probability is interpreted as degree of belief (De Finetti, 2017) . Parameters are viewed as random variables. Inferences regarding the unknown parameters are based on their posterior distributions, obtained using Bayes' rule by updating the prior distribution with the likelihood conditional on the observed data. The need for prior specification has made many people reluctant to Bayesian statistics, but Box (1980) explained that it is logically impossible to distinguish between model assumptions and the prior distribution of the parameters. In a hypothesis testing problem, one can attach probability distributions to hypotheses. As a result, the notion of (frequentist) error rates is not essential for Bayesian to reject/accept a hypotheses, although these error rates may be calculated similarly even under a Bayesian framework. 3.1. Designs Based on Posterior Probabilities. In Bayesian sequential designs, early stopping rules are typically based on the posterior probability (PP) of θ being greater than some threshold (e.g., Thall and Simon, 1994; Heitjan, 1997) . Assume the time and frequency of interim analyses are given in advance. Let π(θ) denote the prior distribution of θ. At analysis j, the posterior distribution of θ is given by where f (y j | θ) denotes the sampling distribution of y j . When the prior for θ is a conjugate normal distribution, θ ∼ N(µ, ν 2 ), the above posterior is available in closed form, If PP j = Pr(θ > 0 | y j ) > γ j for some threshold γ j , H 0 is rejected, the trial is stopped, and efficacy of the drug is declared. This is equivalent to where q 1−γ j is the upper (1 − γ j ) quantile of the standard normal distribution. It remains to specify the prior π(θ) and threshold values {γ 1 , . . . , γ K }. We present three approaches next. 3.1.1. The Mixed Frequentist-Bayesian Approach. Without accounting for multiple looks at the data, the stopping rule in Equation (6) can lead to type I error rate inflation. As an example, consider an improper prior on θ by setting ν = ∞, that is, π(θ) ∝ 1. In this case, the stopping criterion at analysis j is z j > q 1−γ j , which has the same form as the frequentist stopping rule in Section 2 with c j = q 1−γ j . Suppose γ j ≡ 0.95, using Equation (3), the type I error rates are α = 0.05, 0.08, 0.13, 0.17, 0.31, and 1 for K = 1, 2, 5, 10, 100, and ∞, respectively. As encouraged by regulatory agencies (Food and Drug Administration, 2010, 2019), one should adjust π(θ) and {γ 1 , . . . , γ K } according to the planning of interim analyses to achieve desirable type I error rate control (and possibly other frequentist properties). We refer to this as a mixed frequentist-Bayesian approach. With an intended type I error rate, the parameters in a Bayesian sequential design can be chosen in multiple ways. For prespecified threshold values, type I error rate control can be achieved by using a conservative prior. Freedman and Spiegelhalter (1989) and Freedman et al. (1994) demonstrated that by tuning the prior distribution of θ, one could achieve boundary values similar to or more conservative than Pocock's or O'Brien-Fleming's boundaries. In our case, we can simply set µ = 0 and adjust ν 2 according to the planning of interim analyses. From Equation (6), the stopping boundaries monotonically increase as ν 2 decreases. For example, consider the singlearm trial with an outcome variance of σ 2 = 1, a maximum sample size of 1000, K = 5 analyses, and equal group sizes. Then, with threshold values γ j ≡ 0.95, a N(0, 0.054 2 ) prior for θ controls the type I error rate at 0.05. The corresponding stopping boundaries for z j 's are shown in Table 1 . Alternatively, for a given prior π(θ), type I error rate control can be attained by adjusting the threshold values {γ 1 , . . . , γ K }. For the single-arm trial example, one may equate the stopping boundaries in Equation (6) to the corresponding boundaries in any frequentist sequential design. For example, suppose {c 1 , . . . , c K } are O'Brien-Fleming boundaries, then γ j may be set at For more complicated trials (e.g., randomized-controlled, binary outcome), tuning π(θ) and {γ 1 , . . . , γ K } to achieve desirable type I error rate control is more challenging and may require numerical methods. See, for example, Zhu and Yu (2017) (see, e.g., Goldstein, 2006 and Robinson, 2019) , the prior π(θ) should be specified to reflect a subjective belief on θ before the trial, and the threshold values {γ 1 , . . . , γ K } should be chosen to represent personal tolerance of risk. For example, a positive (or negative) prior mean for θ represents that the investigator's prior belief on the treatment effect is optimistic (or pessimistic). Similarly, the prior variance for θ reflects the investigator's uncertainty about the prior opinion. In practice, π(θ) could be elicited from preclinical data and historical clinical trials with a similar setting. On the other hand, the choice of the threshold values can be justified from a decisiontheoretic perspective. See, e.g., Robert (2007) (Chapter 5.2). Consider the singlearm trial example again. At analysis j, the possible decision is denoted by ϕ j , where ϕ j = 1 (or 0) indicates rejecting H 0 and stopping the trial (or failing to reject H 0 and continuing enrollment if j < K). Assume the loss associated with decision ϕ j is Then, the posterior expected loss of ϕ j is L j (ϕ j , y j ) = j (ϕ j , θ)p(θ | y j )dθ, and the decision that minimizes L j (ϕ j , y j ) is otherwise. By setting γ j at ξ 1j /(ξ 0j + ξ 1j ), the stopping rule in Equation (6) minimizes the posterior expected loss. In practice, one could specify the loss function j (ϕ j , θ) based on personal tolerance of risk and then derive the γ j 's subsequently. For example, if one wants to be conservative about rejections early in the trial, one could consider increasing the loss of false rejections at early interim analyses (Rosner and Berry, 1995) . A more stringent way of formulating the loss function should take into account the sequential nature of the trial. For example, a decision to continue the trial should be made based on balancing the cost of enrolling more patients and the gain of acquiring more information. More discussion on this point is deferred to Section 3.3. We see that by taking a subjective Bayesian approach, one does not need to take frequentist properties into account. For example, suppose that ξ 1j = 19 · ξ 0j for all j, then one can reject H 0 and stop the trial at any analysis as long as Pr(θ > 0 | y j ) > 0.95. As Edwards et al. (1963) stated, "it is entirely appropriate to collect data until a point has been proven or disproven, or until the data collector runs out of time, money, or patience." This point has also been made by Harrell (2020a) . Such a procedure is vulnerable to type I error rate inflation, which would bother many practitioners. However, it has been argued that the type I error rate is not the quantity that one should pay most attention to (Harrell, 2020b) . Also, the calculation of the type I error rate involves an average over unrealized events that may arise for hypothetical values of θ, but based on the LP, unobserved events are irrelevant to the evidence about θ (Berry, 1985 (Berry, , 1987 . We provide more discussion in Section 4. A similar critique on the subjective Bayesian approach is the issue of "sampling to a foregone conclusion" (Cornfield, 1966a) . However, Berry (1985 Berry ( , 1987 argued that this is not a threat, because the sequence of posterior probabilities, {Pr(θ > 0 | y 1 , . . . , y n ) : n = 1, 2, . . .}, is a martingale. Note that E[Pr(θ > 0 | y 1 , . . . , y n , y n+1 ) | y 1 , . . . , y n ] = Pr(θ > 0 | y 1 , . . . , y n ), where the expectation is taken with respect to the posterior predictive distribution of y n+1 . Therefore, E Pr(θ > 0 | y 1 , . . . , y n , y n+1 ) | Pr(θ > 0 | y 1 , . . . , y n ) = E E Pr(θ > 0 | y 1 , . . . , y n , y n+1 ) | y 1 , . . . , y n , Pr(θ > 0 | y 1 , . . . , y n ) = Pr(θ > 0 | y 1 , . . . , y n ). If the posterior probability of {θ > 0} is less than 0.95 given n observations, say 0.94, then after the next observation, it may increase or decrease with an expected value of 0.94. In other words, one cannot guarantee reaching Pr(θ > 0 | data) > 0.95 with more data. Specifically, when the sampling distribution of y i 's is normal, the expected number of additional observations required to raise Pr(θ > 0 | data) any prescribed amount is infinite. This is analogous to the expected hitting time of a Brownian motion, which is infinite (see, e.g., Chapter 8.2 in Ross, 1996) . 3.1.3. The Calibrated Bayesian Approach. Although Bayesian probabilities represent degrees of belief in some formal sense, for practitioners and regulatory agencies, it can be pertinent to examine the operating characteristics of Bayesian designs in repeated practices. One could calibrate the prior and threshold values in a Bayesian sequential design to achieve desirable operating characteristics under a range of plausible scenarios, and we refer to this as a calibrated Bayesian approach (Rubin, 1984; Little, 2006) . We provide more background on the calibrated Bayesian approach in Appendix A.1. We distinguish between operating characteristics and frequentist properties: we use the former to refer to the long-run average behaviors of a statistical procedure in a series of (possibly different) trials, and use the latter to refer to those in (imaginary) repetitions of the same trial. In other words, operating characteristics represent averages over a joint data-parameter distribution, while frequentist properties represent averages over a data distribution given a fixed parameter. See, e.g., Rubin (1984) or Bayarri and Berger (2004) . Frequentist properties are a special class of operating characteristics. Consider again the single-arm trial example. Imagine an infinite series of such trials with true but unknown treatment effects {θ (1) , θ (2) , . . .}, which constitute some population distribution π 0 (θ). For each trial, patient outcomes y K ∼ f 0 (y K | θ) and are observed sequentially, where y K = (y 1 , . . . , y n K ). Suppose a Bayesian design as in Section 3.1 is applied to every trial with a prior model π(θ), a sampling model f (y K | θ), and threshold values {γ 1 , . . . , γ K }. Let denote the rejection region of the design. That is, H 0 is rejected if y K ∈ Γ. Since false rejections of the null may result in costly failures in drug development, it would be desirable to control the false discovery rate (FDR) and false positive rate (FPR) of the design in the infinite series of trials for a range of plausible f 0 (y K | θ)π 0 (θ). This is similar to the rationale of type I error rate control. The FDR is the relative frequency of false rejections among all trials in which H 0 is rejected, and the FPR is the relative frequency of false rejections among all trials with nonpositive treatment effects θ's. Mathematically, Our definitions of the FDR and FPR are slightly different from, but closely related to, their typical definitions in a frequentist sense (see, e.g., Storey, 2003) . The calibration of the design parameters is typically done through computer simulations. For each plausible f 0 (y K | θ)π 0 (θ), one could generate S hypothetical trials with treatment effects {θ (1) , θ (2) , . . . , θ (S) } and outcomes {y (1) K , . . . , y K } (for some large S). Then, the FDR and FPR are respectively approximated by The prior and threshold values in the Bayesian design can be chosen such that FDR and FPR do not exceed some prespecified levels for every plausible f 0 (y K | θ)π 0 (θ). In certain contexts, there are theoretical guarantees on the operating characteristics of Bayesian sequential designs. Specifically, we have the following proposition. Proposition 3.1. Suppose the rejection region Γ of a Bayesian design is defined as in Equation (7). Assume the joint model for (y K , θ) in the Bayesian design is the same as the actual joint distribution of (y K , θ) in a series of trials, i.e., f (y K | θ)π(θ) = f 0 (y K | θ)π 0 (θ). Then, both the FDR and FPR of the design over this series of trials are upper bounded regardless of the time and frequency of interim analyses, The proof is given in Appendices A.2 and A.3. Therefore, from a calibrated Bayesian perspective, the prior on θ could be elicited to resemble the actual distribution of θ in repeated practices as closely as possible, and the threshold values reflect acceptable FDR and FPR levels. In general, requiring a design to have good operating characteristics (under plausible scenarios) is more lenient than requiring it to have good frequentist properties (for all possible parameter values). For example, the type I error rate is essentially the FPR when π 0 (θ) is a point mass. Stringent type I error rate requires that the FPR is controlled for all possible π 0 (θ), even when π 0 (θ) is a point mass at 0, while the calibrated Bayesian approach only requires the FPR to be controlled for plausible π 0 (θ). In this sense, the calibrated Bayesian approach can be thought of as a middle ground between the mixed frequentist-Bayesian approach and the subjective Bayesian approach. Importantly, π (0) (θ) and π (1) (θ) have supports on (−∞, 0] and (0, ∞), respectively. Then, the prior probability for each hypothesis is also specified, Pr(H 0 ) = 1 − ω and Pr(H 1 ) = ω. At analysis j, the posterior probability of H 1 is which can be used to decide whether to stop the trial early. For example, if Pr(H 1 | y j ) > γ j , H 0 is rejected, and the trial is stopped. This approach is equivalent to specifying a mixture prior distribution for θ, and then stop the trial at analysis j if Pr(θ > 0 | y j ) > γ j . Note that under the mixture prior, This relationship has been noted by Zhou et al. (2021) . Although these two approaches are equivalent, when the primary goal is hypothesis testing, the prior for θ is usually specified as a mixture of two truncated distributions; when the primary goal is parameter estimation, the prior for θ is usually specified as a single continuous distribution. A special case is when H 0 is a point hypothesis, say when we test H 0 : θ = 0 vs H 1 : θ = 0. From a hypothesis testing perspective, the prior for θ should be a mixture of a point mass at θ = 0 (denoted by δ 0 (θ)) and a continuous distribution, π(θ) = (1 − ω)δ 0 (θ) + ωπ (1) (θ). Such a prior distribution is rarely used when the primary goal is parameter estimation. Lastly, Johnson and Cook (2009) and Johnson and Rossell (2010) recommended the use of non-local prior densities, which incorporate a minimally significant separation between the null and alternative hypotheses, for Bayesian hypothesis testing and applications in trial monitoring. 3.2. Designs Based on Posterior Predictive Probabilities. In the upcoming sections, we review some other classes of Bayesian sequential designs. Similar to the idea of conditional power, posterior predictive probabilities can be used to determine whether to stop a trial early. See, e.g., Dmitrienko and Wang (2006) , Lee and Liu (2008) , and Saville et al. (2014) . Suppose that at the final analysis, efficacy of the drug will be declared if Pr(θ > 0 | y K ) > 1 − η. At analysis j ∈ {1, . . . , K − 1}, the posterior predictive distribution of future observations y * j,K = (y * n j +1 , . . . , y * n K ) is and the posterior predictive probability of success (PPOS) is One may stop the trial early if PPOS j > γ j for some threshold γ j . To specify the prior for θ and the threshold values {γ 1 , . . . , γ K−1 } and η, one may take one of the approaches in Sections 3.1.1-3.1.3. For the single-arm trial example, we havē whereȳ * j,K = y * n j +1 + · · · + y * n K /(n K −n j ). The criteria Pr θ > 0 | y j , y * j,K > 1−η is equivalent tō Finally, it can be derived that The PPOS depends on η and n K . In general, the stopping rules based on PPOS and PP are different, although for given η and n K , one may select γ j such that {PPOS j > γ j } and {PP j > γ j } are equivalent. As a result, one may also impose type I error rate control on PPOS stopping rules based on the arguments in Section 3.1.1. As noted by Saville et al. (2014) , if at the jth interim analysis, the amount of data remain to be collected (n K − n j ) is infinity, then PPOS j = PP j regardless of η. Typically, the PPOS is close to the PP at the beginning of a trial and moves toward either 0 or 1 as the trial nears completion. 3.3. Decision-theoretic Designs. As described in Section 3.1.1, the decisions in a sequential clinical trial can be made by minimizing the expected loss under a decisiontheoretic framework. This approach has been considered by Berry and Ho (1988) , Lewis and Berry (1994) , Stallard et al. (1999) , and Ventz and Trippa (2015) , among others. The idea is that, at each interim analysis, the decision to stop the trial early and reject H 0 is associated with some loss if the decision is wrong. On the other hand, continuing the trial results in more cost in terms of patient recruitment. But with more data, the chance of making a wrong decision may be decreased. By considering both factors, decision-theoretic designs combine the strengths of designs based on posterior and posterior predictive probabilities. We illustrate the idea of decision-theoretic designs through the single-arm trial example. Let ϕ j denote a possible decision at analysis j. For j = 1, . . . , K − 1, ϕ j = 1 (or 0) represents rejecting H 0 and stopping the trial early (or failing to reject and continuing enrollment). For j = K, ϕ K = 1 (or 0) represents rejecting (or failing to reject) H 0 at the final analysis, and the trial is stopped in either case. Let j (ϕ j , θ, y j ) denote the loss of making decision ϕ j at analysis j given parameter θ and data y j . The posterior expected loss is then L j (ϕ j , y j ) = θ j (ϕ j , θ, y j )p(θ | y j )dθ. The optimal decision isφ j (y j ) = arg min ϕ j L j (ϕ j , y j ) and the associated expected loss isL j (y j ) = min ϕ j L j (ϕ j , y j ), i.e., the Bayes risk. Suppose that the loss of making decision ϕ j = 1 at analysis j (j = 1, . . . , K − 1) is where ξ 1j is the loss of mistakenly rejecting H 0 and stopping the trial if θ ≤ 0. On the other hand, if ϕ j = 0, the trial continues, (n j+1 − n j ) patients will be enrolled until the next analysis, and we assume a unit loss for recruiting each patient. We Here, y * j,j+1L j+1 (y j , y * j,j+1 )p(y * j,j+1 | y j )dy * j,j+1 is the Bayes risk at analysis (j + 1) marginalized over the posterior predictive distribution on y * j,j+1 = (y * n j +1 , . . . , y * n j+1 ), that is, the observations between analyses j and j + 1. We also assume the loss of making decision ϕ K at the final analysis is Here, ξ 1K is the loss of mistakenly rejecting H 0 at the final analysis if θ ≤ 0 (a type I error), and ξ 0 is the loss of failing to reject H 0 if θ > 0 (a type II error). At analysis j, the optimal decisionφ j (y j ) can be solved by backward induction (DeGroot, 1970, Chapter 12) . First, we calculateL K (y K ) for all possible data y K that can arise at the final analysis. Next, using Equations (10) and (11), we can calculatẽ L K−1 (y K−1 ) for all possible data y K−1 that can arise at analysis (K − 1). Proceeding backward in this way givesL K−2 (y K−2 ), . . . ,L j (y j ). This procedure requires many minimizations and integrations which may not be analytically tractable. Simulationbased approaches have been proposed to mitigate these computational challenges (Müller et al., 2007) . Lewis and Berry (1994) demonstrated that by tuning the loss functions, decisiontheoretic designs can achieve desirable type I error rate control. Ventz and Trippa (2015) considered constrained optimal designs with explicit frequentist requisites. Alternatively, the loss functions and prior can be chosen by taking the subjective or calibrated Bayesian approach. One may be worried that the stopping time t is not included in the conditional of p(θ | y t ). However, we have Most often (and in all the designs that we have reviewed), θ affects t only through the observations y t , i.e., θ and t are independent conditional on y t . Therefore, p(θ | t, y t ) = p(θ | y t ), and the stopping rule plays no role in the posterior distribution of θ. See, e.g., Hendriksen et al. (2021) . The posterior mean, E(θ | y t ), is a commonly used point estimator for θ. On the other hand, a 100(1 − α)% credible interval for θ can be constructed as (θ L , θ U ), where θ L and θ U are the lower and upper (α/2) quantiles of p(θ | y t ), respectively. This credible interval has its asserted coverage in repeated practices if the model specification is correct (see Appendix A.1), but the coverage may deteriorate in the presence of model misspecification. Lastly, the posterior probability of the alternative hypothesis, Pr(θ > 0 | y t ), is also reported. 3.5. Discussion. Bayesian sequential designs utilize posterior and/or posterior predictive probabilities to make early stopping decisions. After the completion of sequential trials, Bayesian credible intervals for the treatment effects are calculated based on their posterior distributions. The prior and threshold values/loss functions in a Bayesian design can be chosen using either the mixed frequentist-Bayesian approach, the subjective Bayesian approach, or the calibrated Bayesian approach. Assessing a design or making a decision based on trial data has been traditionally based on controlling the type I error rate as opposed to minimizing the expected loss with carefully calibrated prior and loss functions. While we argue that the latter reflects the true objective of decision making in clinical trials, we also acknowledge the challenges with such an approach (Berry et al., 2010) . First, to warrant good operating characteristics of Bayesian designs in repeated practices, prior specification is crucial, which is not easy. Second, clinical trials often involve multiple decision makers with distinctive prior opinions and tolerances for risk. Third, even for experts trained in probabilistic thinking, the process of eliciting costs and benefits can be difficult. As Spiegelhalter et al. (1994) noted, "when the decision is whether or not to discontinue the trial, coupled with whether or not to recommend one treatment in preference to the other, the consequences of any particular course of action are so uncertain that they make the meaningful specification of utilities rather speculative." As a result, calibrating the prior and loss functions to achieve desirable type I error rate control is a simple and realistic approach. Although Bayesian designs may involve additional complexities compared to their frequentist counterparts (e.g., prior elicitation and computational challenges when the posterior distribution is not analytically tractable), they have certain advantages (see, e.g., Freedman et al., 1994) . First, with a chosen probability model, the data affect posterior inference only through the likelihood function. In this way, Bayesian inference obeys the LP (Gelman et al., 2013, p. 7) . This can be philosophically appealing. Frequentist inference, on the other hand, may be affected by unrealized events. We will elaborate on this point in Section 4. Second, the stopping rule of an experiment is irrelevant to the construction and interpretation of a Bayesian credible interval. In contrast, a frequentist interval estimate of treatment effect following a group sequential trial crucially depends on the stopping rule. pointed out, such an interval may be quite unintuitive. Depending on the choice of sample space ordering, the interval may not always include the sample mean and can include zero difference even for data that lead to a recommendation to stop the trial at the first interim analysis (see Rosner and Tsiatis, 1988) . Third, stringent frequentist inference can be challenging or unsatisfactory if the prescribed stopping rule is not followed. For example, a trial may be stopped due to unforeseeable circumstances such as the outbreak of COVID-19; in some cases, it may be desirable to extended a trial beyond the planned sample size. Some have criticized that the relevance of stopping rules makes it almost impossible to conduct any frequentist inference in a strict sense (Berger, 1980; Berry, 1985; Berger and Wolpert, 1988; Wagenmakers, 2007 Statistical inference and decision-making in sequential clinical trials are typically tied with the LP. We provide some discussions in this section. Let Y denote a random variable with density f θ (y). The likelihood function for θ, given the observed outcome y of the random variable Y , is L y (θ) = f θ (y). That is, the density evaluated at y and considered as a function of θ. The (strong) LP, as in Birnbaum (1962) and Berger and Wolpert (1988) , can be summarized as follows: The Likelihood Principle. All the statistical evidence about θ arising from an experiment is contained in the likelihood function for θ given y. Two likelihood functions for θ (from the same or different experiments) contain the same statistical evidence about θ if they are proportional to one another. Birnbaum (1962) showed that the LP can be deduced from two widely accepted principles: the sufficiency principle and the conditionality principle. There have been debates regarding Birnbaum's proof and the validity of the LP in general. A detailed treatment of the LP is outside the scope of this paper. We refer interested readers to Berger and Wolpert (1988) , Robins and Wasserman (2000) , Evans (2013) What would be the consequences if we accept the LP? Since the LP deals only with the observed y, data that did not obtain and experiments not carried out have no impact on the evidence about θ (Berry, 1987; Berger and Wolpert, 1988) . Also, as in Berger and Wolpert (1988) , the LP implies that the reason for stopping an experiment (the stopping rule) should be irrelevant to the evidence about θ. In a clinical trial, the implication is that early stopping, no matter the reason, would not affect the evidential meaning of the trial outcome. As an illustration, consider the example given by Berry (1987) . Imagine that a single-arm trial as described in Section 1 has been conducted, and The conflict here does not mean we have to either reject the LP or reject frequentist procedures. Explained previously (e.g., Berger and Wolpert, 1988 , Gandenberger, 2015b , and Gandenberger, 2017 , the LP is not a decision procedure and gives little guidance in assessing the overall performance of a decision procedure. The LP implies that only the observed data are relevant to the evidence about θ, but the loss/utility for making a specific decision may depend on other aspects of an experiment. First, while the evidence about θ is trial-specific, a decision procedure is applied to many trials. For example, from a regulatory agency's perspective, the loss of a decision to approve a drug reflects not only the consequences of administering this drug to patients, but also the downstream consequences of that decision rule for other drugs in the future (Gandenberger, 2017) . Therefore, frequentist measures such as the type I error rate can be factored into the decision procedure. Second, even for a single trial, it is not unreasonable to associate the loss of a decision with possible (unrealized) future consequences. For example, in a Bayesian decision-theoretic design, the loss of continuing the trial depends on possible future events (Equation 11). Imagine an ongoing clinical trial with a maximum sample size of 400 patients and an outcome variance of σ 2 = 1. After 200 outcomes have been recorded, an interim analysis is being performed by two investigators C and D, who used the same probability model with a N(0, 1 2 ) prior on θ but had different plans. Investigator C planned another interim analysis after 300 observations, while investigator D did not plan to conduct any additional interim analysis. Suppose the z-statistic at the interim analysis is z 1 = 1.75. Then, using the design and loss functions described in Section 3.3 with ξ 0 = 400 and ξ 1j ≡ 19ξ 0 for all j, the optimal decisions for investigators C and D are continuing enrollment and stopping the trial, respectively. Specifically, Figure 1 shows the posterior expected losses for possible decisions that can be made by the two investigators. We can see that the existence of a planned future interim analysis has an impact on the posterior expected loss associated with continuing the trial. In summary, if a dichotomous decision must be made, the LP does not preclude one from utilizing other information in addition to the observed data. Still, the conflict does suggest that if we accept the LP, then frequentist measures such as type I/II error rates and p-values may not be used as measures of statistical evidence for or against a hypothesis in a clinical trial (Berger and Wolpert, 1988) . This point has been raised by many others as well. For example, Royall (1997) stated that "Neyman-Pearson statistical theory is aimed at finding good rules for choosing from a specified set of possible actions. It does not address the problem of representing and interpreting statistical evidence, and the decision rules derived from Neyman-Pearson theory are not appropriate tools for interpreting data as evidence." It should also be noted that not all Bayesian procedures are in compliance with the LP. For example, eliciting the prior for θ based on the sampling plan, such as using the Jeffreys prior (Jeffreys, 1946) , results in violation of the LP (Berger and Wolpert, 1988, p. 21) . We have mentioned in Section 3.1.1 that one may control the type I error rate of a Bayesian sequential design by calibrating the prior or threshold values. To avoid violation of the LP, however, we recommend taking the latter approach and not selecting the prior based on trial planning. Intuitively, changing the threshold values only affects decision-making, while changing the prior affects both the evidence about θ (e.g., point and interval estimations) and decision-making. As an illustration, we calculate the stopping boundaries for the z-statistics given by some of the aforementioned frequentist and Bayesian sequential designs, that is, the {c 1 , . . . , c K } values for which we would stop the trial at analysis j if z j > c j . We consider the single-arm trial example described in Section 1. Suppose that a total of K = 5 (interim and final) analyses are planned, the maximum sample size is n K = 1000, and patients are enrolled in groups of size 200 (n j = 200j). The variance for the outcomes is set at σ 2 = 1 and is assumed known. For a fair comparison, the design parameters are calibrated such that the type I error rate at θ = 0 is α = 0.05 for every design. Specifically: (i) For the Pocock and O'Brien-Fleming procedures (Section 2.1), and the error spending approaches (Section 2.2), the stopping boundaries are calculated using the R package gsDesign (Anderson, 2021) . For the error spending approach with spending function h 3 , we use b = 1. (ii) For stochastic curtailment based on conditional power (Equation 4), we use γ = 0.8 and find that setting η = 0.049 leads to a type I error rate of 0.05. (iii) For stopping boundaries based on posterior probabilities (Equation 6), we consider the following two versions. In the first version, we use γ j ≡ 0.95 and find that a N(0, 0.054 2 ) prior for θ leads to a type I error rate of 0.05. In the second version, we place a N(0, 1 2 ) prior on θ and find that setting γ j ≡ 0.983 leads to a type I error rate of 0.05. (iv) For stopping boundaries based on posterior predictive probabilities (Section 3.2), we set γ j ≡ 0.8, η = 0.05, and find that a N(0, 0.063 2 ) prior for θ leads to a type I error rate of 0.05. (v) For the Bayesian decision-theoretic design (Section 3.3), we place a N(0, 1 2 ) prior on θ, use ξ 0 = 1000, and find that setting ξ 1j ≡ 34890 leads to a type I error rate of 0.05. The stopping boundaries are summarized in Table 1 The comparison of the stopping boundaries illustrates the differences among these sequential designs in a qualitative manner. For quantitative comparisons, a number of criteria can be considered. For example, one can match the type I error rate (at θ = 0) and type II error rate (at a specific θ value) of the designs, and then compare the power and expected sample size of the designs over a range of possible parameter values. For more details, refer to Jennison and Turnbull (2000) . Table 1 . Stopping boundaries for the z-statistics given by several frequentist and Bayesian sequential designs. The single-arm trial in Section 1 is considered with K = 5 analyses, a maximum sample size of n K = 1000, and equal group sizes (n j = 200j). The design parameters are calibrated such that the type I error rate at θ = 0 is α = 0.05 for every design. simulation studies to explore the operating characteristics of the Bayesian design in Section 3.1. Consider the single-arm trial example in Section 1 with a maximum sample size of n K = 1000. Suppose the actual effect size of the trial, θ, is a random draw from N(µ 0 , ν 2 0 ). As the trial progresses, patient outcomes become available sequentially and follow a normal distribution, y 1 , y 2 , . . . ∼ N(θ, σ 2 ). The trial statistician, on the other hand, uses a N(µ, ν 2 ) prior to draw inference about θ, which may or may not be identical to the actual population distribution of θ. For simplicity, assume the sampling model used by the statistician, f (y K | θ), is correctly specified. At prespecified time and frequency, the statistician conducts interim analyses of accumulating data. If the stopping rule as in Equation (6) is triggered, H 0 is rejected, the trial is stopped, and efficacy of the drug is declared. We consider 72 simulation scenarios, one for each combination of ν 0 ∈ {0.1, 0.5, 1}, ν ∈ {0.1, 0.5, 1, 10}, and K ∈ {1, 2, 5, 10, 100, 1000}. For simplicity, we fix the other parameters: µ 0 = µ = 0, and σ = 1. Here, a larger (or smaller) value of ν 0 indicates that the actual effect size is more likely to be larger (or smaller). We do not consider ν 0 > 1 as in practice, a standardized effect size that is much larger than what could be drawn from a N(0, 1 2 ) distribution is not common. A larger (or smaller) value of ν represents that the assumed prior for θ is more diffuse (or more concentrated around zero). When ν 0 = ν, the population distribution of θ over different trials is the same as the prior for θ used for analysis. Lastly, K is the total number of (interim and final) analyses. We assume that patients are enrolled in groups of equal size n K /K. For each scenario, we simulate S = 10, 000 hypothetical trials by first generating θ (1) , . . . , θ (S) ∼ N(µ 0 , ν 2 0 ). Next, for each θ (s) , trial outcomes are sequentially generated from N (θ (s) , σ 2 ). Interim analyses are performed after every n K /K outcomes have been observed, and the trial is stopped if the stopping rule as in Equation (6) is satisfied with γ j ≡ γ = 0.95. We record the FDR and FPR as defined in Equation (8). In addition, we record the percentage of 95% credible intervals for θ, calculated as in Section 3.4, that cover the true values. Table 2 summarizes the simulation results. Although the FDR and FPR increase with the number of analyses, according to Proposition 3.1, the FDR and FPR are upper bounded when the statistician's model is correctly specified. These theoretical results are corroborated by the simulations: when ν 0 = ν, the FDR is roughly bounded by 1−γ = 5% (due to Monte Carlo errors and a finite number of simulations, the FDR may sometimes exceed 5%), and the FPR is always below (1−γ)/γ = 5.3%. In addition, when ν 0 = ν, the coverage of the 95% credible intervals for θ is around 95% regardless of K. In the presence of model misspecification, however, Bayesian statements may not attain their asserted coverage, and the discrepancy becomes larger with more frequent applications of data-dependent stopping rules. These results are consistent with the findings in Rubin (1984) and Rosenbaum and Rubin (1984) . When the assumed prior is more diffuse than the actual distribution of θ, the FDR and FPR are inflated, and the degree of FDR and FPR inflation becomes greater when K is larger. For example, when ν 0 = 0.1, ν = 10, and K = 1000, the FDR and FPR are around 20%. From a calibrated Bayesian point of view, if one wants to control the FDR and FPR at below 5% for all possible ν 0 and K considered, ν should be set at ≤ 0.1; if one plans to conduct no more than K = 10 analyses, then setting ν ≤ 1 is sufficient. We caution against the use of diffuse priors for decision-making if data-dependent stopping rules are in frequent use and the actual effect sizes are believed to be small. In addition, when ν 0 = ν, the coverage of the 95% credible intervals for θ is below 95% and decreases as K increases. Interestingly, an overly conservative prior (that is more concentrated around zero) results in low coverage of the credible intervals, while a diffuse prior has less impact on the coverage. We have reviewed frequentist and Bayesian sequential clinical trial designs. Frequentist designs need to be adjusted for the planning of interim analyses to achieve intended type I error rate control. Although similar adjustments can be made to Bayesian designs, such adjustments are not necessary, as type I error rate control is not central to Bayesian inference. We have summarized two alternative approaches to formulating Bayesian designs, namely the subjective Bayesian approach and the calibrated Bayesian approach, and have discussed their implications. We have also commented on the role of the LP in sequential trial designs. While the LP implies that unrealized events are irrelevant to the statistical evidence about the treatment effect, it gives little guidance in assessing a decision procedure thus does not preclude the use of frequentist measures in decision-making. We have focused on the fundamental and philosophical differences between frequentist and Bayesian sequential designs. For simplicity, we have used a single-arm trial with normally distributed outcomes to illustrate the designs. It would be of interest to extend the methodology and discussion to randomized-controlled trials and trials with other types of outcomes (e.g., binary and time-to-event) and distributional assumptions. For example, in a randomized-controlled trial with normally distributed outcomes, at analysis j, observed data are y r1 , y r2 , . . . , y rn rj ∼ N(θ r , σ 2 r ) for arm r, where r = 1 and 0 represent the investigational drug and control arms, respectively. The goal may be to test Assume σ 2 1 and σ 2 0 are known. Let Under the frequentist paradigm, one can reject H 0 and stop the trial if z j > c j for some critical value c j . The specification of the c j values can follow exactly the same arguments as in Section 2. Under the Bayesian paradigm, one can specify a prior distribution for θ = θ 1 − θ 0 , say θ ∼ N(µ, ν 2 ). The posterior distribution for θ at analysis j is given by θ | y 1j , y 0j ∼ N µν −2 + (ȳ 1j −ȳ 0j )(σ 2 1 /n 1j + σ 2 0 /n 0j ) −1 ν −2 + (σ 2 1 /n 1j + σ 2 0 /n 0j ) −1 , 1 ν −2 + (σ 2 1 /n 1j + σ 2 0 /n 0j ) −1 . Then, one can proceed similarly as in Section 3. An alternative approach is to specify independent priors separately for θ 1 and θ 0 and then use these to obtain a posterior distribution for θ. This will lead to slightly different designs. See Stallard et al. (2020) . In all the trial examples we have given, the variances of the outcomes have been assumed known, but it is straightforward to extend the designs to accommodate unknown variances. So far, we have only considered early stopping for efficacy. In practice, it may be desirable to allow for early stopping when interim results suggest the investigational drug is unlikely to have a clinically meaningful treatment effect (Snapinn et al., 2006) . This is known as early stopping for futility. A sequential trial design can include a provision for either early efficacy stopping, early futility stopping, or both. Consider the single-arm trial example. Under the frequentist paradigm, one could stop the trial at analysis j in favor of the null hypothesis if z j < d j for some critical value d j ; under the Bayesian paradigm, one could stop early for futility if Pr(θ > 0 | y j ) < τ j . Futility stopping rules do not inflate the type I error rate; actually, they decrease the type I error rate. However, futility stopping rules also decrease the power and increase the false negative rate (FNR) and false omission rate (FOR) of a design. In frequentist sequential designs, the futility stopping boundaries are typically selected to satisfy certain power requirements while maintaining the type I error rate at a desirable level (e.g., Pampallona and Tsiatis, 1994) . In Bayesian sequential designs, one could specify the futility boundaries to either satisfy certain power and type I error rate requirements, reflect personal beliefs, or achieve desirable FNR, FOR, FDR, and FPR under plausible scenarios. Two-sided tests and point null hypotheses are very common in clinical trials. For example, for the single-arm trial in Section 1, one may test There have been several criticisms of testing a point null hypothesis (Berger and Sellke, 1987) , such as the plausibility of θ being equal to 0 exactly. As a result, we have focused on a one-sided test with a composite null hypothesis (Equation 1). Most of our discussions are still applicable to tests like Equation (12), although from a Bayesian hypothesis testing perspective, the prior for θ should include a discrete mass at the location indicated by the point hypothesis. From a frequentist perspective, the issue of type I error rate inflation (or multiplicity) can arise from repeatedly testing a single hypothesis over time, or testing multiple hypotheses simultaneously (Simon, 1994) . From a subjective Bayesian perspective, however, repeated hypothesis testing is not a problem (see Section 3.1.2), and multiplicity adjustments are needed only when there are multiple tests. It is worth noting that frequentist and Bayesian philosophies on multiple testing are also quite different (Berry and Hochberg, 1999; Sjölander and Vansteelandt, 2019) . We present more details about the calibrated Bayesian approach described in Section 3.1.3. We consider the setup of an infinite series of single-arm trials (described in Section 1) with true but unknown treatment effects θ (1) , θ (2) , . . . ∼ π 0 (θ). For each trial, patient outcomes y K ∼ f 0 (y K | θ) and are observed sequentially. The Bayesian design in Section 3.1 is applied to every trial with a prior model π(θ), a sampling model f (y K | θ), and threshold values {γ 1 , . . . , γ K }. We are interested in the operating characteristics of the Bayesian design over this infinite series of trials, in particular its FDR and FPR. A.1. Background. We first provide more background on the calibrated Bayesian approach. Rubin (1984) called a statistical procedure (conservatively) calibrated if the resulting probability statements (at least) have their asserted coverage in repeated practices. Clearly, calibrated procedures are desirable, and Rubin recommended examining operating characteristics to select calibrated Bayesian procedures. Rubin's points were echoed by Little (2006) . The following discussion is adopted from Rubin (1984) . A Bayesian procedure is calibrated if the model specification is correct, that is, if f (y K | θ)π(θ) = f 0 (y K | θ)π 0 (θ). For example, suppose that I(y K ) is a 95% credible interval for θ under model f (y K | θ)π(θ), then θ∈I(y K ) f (y K | θ)π(θ)dθ θ f (y K | θ)π(θ)dθ = θ∈I(y K ) f 0 (y K | θ)π 0 (θ)dθ θ f 0 (y K | θ)π 0 (θ)dθ = 0.95. The interpretation is that, among the possible θ values from π 0 (θ) that might have generated the observed y K from f 0 (y K | θ), 95% of them belong to I(y K ). Therefore, when the procedure of calculating I(y K ) from f (y K | θ)π(θ) is repeatedly applied to data drawn from f 0 (y K | θ)π 0 (θ), 95% of the calculated credible intervals will cover the true parameter values. We see that posterior probabilities correspond to frequencies of actual events. Similarly, when we claim Pr(θ > 0 | y K ) > 0.95, it means that among the possible θ values that might have generated y K , more than 95% are positive. Rubin (1984) and Rosenbaum and Rubin (1984) also demonstrated that when the model specification is correct, the coverage and interpretation of Bayesian statements are still valid under data-dependent stopping rules. For example, if we conclude Pr(θ > 0 | y j ) > 0.95 at any interim analysis j, it means that more than 95% of the possible θ values that might have generated y j are positive, even if the trial is optionally stopped at analysis j based on the observed data. Of course, in the presence of model misspecification, the coverage of Bayesian statements is not warranted. In particular, Rubin (1984) and Rosenbaum and Rubin (1984) noted that data-dependent stopping rules increase the sensitivity of Bayesian inference to model specification. Therefore, especially for sequential trial designs, one might want to examine their operating characteristics for a range of plausible f 0 (y | θ)π 0 (θ) (which may deviate from f (y | θ)π(θ)) to select appropriate design parameters. A.2. The False Discovery Rate. We show that the FDR is upper bounded if f (y K | θ)π(θ) = f 0 (y K | θ)π 0 (θ). Note that if y K ∈ Γ, then Pr(θ > 0 | y K ) > γ min . This is because for every j ∈ {1, . . . , K}, Pr(θ > 0 | y j ) = y j,K Pr(θ > 0 | y j , y j,K )f (y j,K | y j )dy j,K , where y j,K = (y n j +1 , . . . , y n K ). If Pr(θ > 0 | y K ) = Pr(θ > 0 | y j , y j,K ) ≤ γ min , then Pr(θ > 0 | y j ) ≤ γ min for every j, which contradicts with y K ∈ Γ. Therefore, FDR = y K ∈Γ θ≤0 f 0 (y K | θ)π 0 (θ)dθdy K y K ∈Γ f 0 (y K )dy K = y K ∈Γ θ≤0 f (y K | θ)π(θ)dθdy K y K ∈Γ f (y K )dy K = y K ∈Γ Pr(θ ≤ 0 | y K ) · f (y K )dy K y K ∈Γ f (y K )dy K ≤ 1 − γ min . A.3. The False Positive Rate. To derive the upper bound of the FPR when f (y K | θ)π(θ) = f 0 (y K | θ)π 0 (θ), we first introduce an inequality under the Bayesian hypothesis testing framework (Section 3.1.5). Assume θ | H 0 ∼ π (0) (θ), θ | H 1 ∼ π (1) (θ), and write f (y j | H m ) = θ f (y j | θ)π (m) (θ)dθ for j = 1, . . . , K and m = 0, 1. Then, the following inequality holds for any 0 < < 1 (Hendriksen et al., 2021) : Pr ∃j ∈ {1, . . . , K} : where Pr(· | H 0 ) = θ Pr(· | θ)π (0) (θ)dθ. This is referred to as a universal bound on the probability of observing misleading evidence (Royall, 2000; Sanborn and Hills, 2014) . In our application, instead of specifying the priors for θ separately under H 0 and H 1 , a single prior for θ is specified over the entire parameter space, θ ∼ π(θ). Still, the universal bound is applicable, because θ ∼ π(θ) is equivalent to Pr(H 0 ) = θ≤0 π(θ)dθ, θ | H 0 ∼ π(θ | θ ≤ 0) = π(θ) · 1(θ ≤ 0) θ≤0 π(θ)dθ Pr(H 1 ) = θ>0 π(θ)dθ, θ | H 1 ∼ π(θ | θ > 0) = π(θ) · 1(θ > 0) θ>0 π(θ)dθ . Also, Pr(θ > 0 | y j ) = Pr(H 1 | y j ) > γ j is equivalent to (1 − γ j ) · θ>0 π(θ)dθ . Applying the universal bound and notice that f (y K | θ)π(θ) = f 0 (y K | θ)π 0 (θ), we have FPR = y K ∈Γ θ≤0 f 0 (y K | θ)π 0 (θ)dθdy K θ≤0 π 0 (θ)dθ = θ f (y K ∈ Γ | θ)π(θ | θ ≤ 0)dθ = Pr ∃j ∈ {1, . . . , K} : (1 − γ min ) · θ>0 π(θ)dθ H 0 (1 − γ min ) · θ>0 π(θ)dθ γ min · θ≤0 π(θ)dθ . Package 'gsDesign Interim analysis in clinical trials Repeated significance tests on accumulating data The interplay of Bayesian and frequentist analysis Statistical Decision Theory: Foundations, Concepts, and Methods Testing a point null hypothesis: the irreconcilability of P values and evidence The Likelihood Principle Interim analyses in clinical trials: classical vs. Bayesian approaches Interim analysis in clinical trials: the role of the likelihood principle Bayesian clinical trials One-sided sequential stopping boundaries for clinical trials: a decision-theoretic approach Bayesian perspectives on multiple comparisons Bayesian Adaptive Methods for Clinical Trials On the foundations of statistical inference Sampling and Bayes' inference in scientific modelling and robustness A Bayesian test of some classical hypotheses-with applications to sequential clinical trials Sequential trials, sequential analysis and the likelihood principle Theoretical Statistics Theory of Probability: A Critical Introductory Treatment Optimal Statistical Decisions Bayesian predictive approach to interim monitoring in clinical trials Bayesian statistical inference for psychological research Bayesian evaluation of group sequential clinical trial designs What does the proof of Birnbaum's theorem prove? Guidance for the use of Bayesian statistics in medical device clinical trials Adaptive designs for clinical trials of drugs and biologics: Guidance for industry Comparison of Bayesian with group sequential methods for monitoring clinical trials The what, why and how of Bayesian clinical trials monitoring A new proof of the likelihood principle Differences among noninformative stopping rules are often relevant to Bayesian decisions Two Principles of Evidence and Their Implications for the Philosophy of Scientific Method Interim analyses in randomized clinical trials: ramifications and guidelines for practitioners Bayesian Data Analysis Subjective Bayesian analysis: principles and practice Continuous learning from data: no multiplicities from computing and using Bayesian posterior probabilities as often as desired p-values and type I errors are not the probabilities we need Bayesian interim analysis of phase II cancer clinical trials Optional stopping with Bayes factors: a categorization and extension of folklore results, with an application to invariant situations An invariant form for the prior probability in estimation problems Statistical approaches to interim monitoring of medical trials: a review and commentary Group Sequential Methods with Applications to Clinical Trials Bayesian design of single-arm phase II clinical trials with continuous monitoring On the use of non-local prior densities in Bayesian hypothesis tests Confidence intervals following group sequential tests in clinical trials Design and analysis of group sequential tests based on the type I error spending rate function Discrete sequential boundaries for clinical trials Stochastically curtailed tests in long-term clinical trials A predictive probability design for phase II cancer clinical trials Group sequential clinical trials: a classical evaluation of Bayesian decision-theoretic designs Calibrated bayes: a Bayes/frequentist roadmap On the Birnbaum argument for the strong likelihood principle Simulationbased sequential Bayesian design A multiple testing procedure for clinical trials Group sequential designs for one-sided and two-sided hypothesis testing with provision for early stopping in favor of the null hypothesis A note on recent criticisms to Birnbaum's theorem Group sequential methods in the design and analysis of clinical trials Safety and efficacy of the BNT162b2 mRNA Covid-19 vaccine The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation Conditioning, likelihood, and coherence: a review of some foundational concepts What properties might statistical inferences reasonably be expected to have?-crisis and resolution in statistical inference Sensitivity of Bayes inference with data-dependent stopping rules A Bayesian group sequential design for a multiple arm randomized clinical trial Exact confidence intervals following a group sequential trial: a comparison of methods Stochastic Processes Statistical Evidence: A Likelihood Paradigm On the probability of observing misleading statistical evidence Bayesianly justifiable and relevant frequency calculations for the applied statistician Do we need to adjust for interim analyses in a Bayesian adaptive trial design? The frequentist implications of optional stopping on Bayesian hypothesis tests The utility of Bayesian predictive probabilities for interim monitoring of clinical trials Control of type I error rates in Bayesian sequential designs Problems of multiplicity in clinical trials Frequentist versus Bayesian approaches to multiple testing Two-sample repeated significance tests based on the modified Wilcoxon statistic Assessment of futility in clinical trials Bayesian approaches to randomized trials Decision theoretic designs for phase II clinical trials with multiple outcomes Comparison of Bayesian and frequentist group-sequential clinical trial designs The positive false discovery rate: a Bayesian interpretation and the q-value Practical bayesian guidelines for phase IIB clinical trials Exact confidence intervals following a group sequential test Bayesian designs and the control of frequentist characteristics: a practical solution A practical solution to the pervasive problems of p values On the bias of maximum likelihood estimation following a sequential test The Design and Analysis of Sequential Clinical Trials The use of local and nonlocal priors in Bayesian test-based monitoring for single-arm phase II clinical trials A Bayesian sequential design using alpha spending function to control type I error