key: cord-0571453-75ritre6 authors: Arjas, Elja; Gasbarra, Dario title: Adaptive treatment allocation and selection in multi-arm clinical trials: a Bayesian perspective date: 2021-04-07 journal: nan DOI: nan sha: 05502beb0cd391086a602f02708885229f4c3d31 doc_id: 571453 cord_uid: 75ritre6 Clinical trials are an instrument for making informed decisions based on evidence from well-designed experiments. Here we consider adaptive designs mainly from the perspective of multi-arm Phase II clinical trials, in which one or more experimental treatments are compared to a control. Treatment allocation of individual trial participants is assumed to take place according to a fixed block randomization, albeit with an important twist: The performance of each treatment arm is assessed after every measured outcome, in terms of the posterior distribution of a corresponding model parameter. Different treatments arms are then compared to each other, according to pre-defined criteria and using the joint posterior as the basis for such assessment. If a treatment is found to be sufficiently clearly inferior to the currently best candidate, it can be closed off either temporarily or permanently from further participant accrual. The latter possibility provides a method for adaptive treatment selection, including early stopping of the trial. The main development in the paper is in terms of binary outcomes, but some extensions, notably for handling time-to-event data, are discussed as well. The presentation is to a large extent comparative and expository. From the earliest contributions to the present day, the statistical methodology for designing and executing clinical trials has been dominated by frequentist ideas, most notably, on testing a precise hypothesis of "no effect difference" against an alternative, using a fixed sample size, and applying a pre-specified significance level to control for Type 1 error, as a means to guard against false positives in long term. An important drawback of this basic form of the standard methodology is that the design does not include the possibility of interim analyses during the trial. Particularly in exploratory studies during Phase II aimed at finding effective treatments from among a number of experimental candidates it is natural look for extended designs that 1 allow the execution of the trial to be modified based on the results from interim analyses. For example, such results could provide reasons for terminating the accrual of additional patients to some treatments for lack of efficacy or, if the opposite is true, for allocating more patients to the treatments that turned out more successful. Allowing for earlier dissemination of such findings may then also benefit the patient population at large. These motivations have led to the development of a whole spectrum of adaptive trial designs, and of corresponding methods for the statistical analysis of such data. An authoritative presentation of group sequential methods is provided in the monograph Jennison and Turnbull (1999) . More general reviews of adaptive clinical trial designs, from the perspective of classical inference, can be found in, e.g., Chow and Chang (2008) , Mahajan and Gupta (2010) , Chow (2014) , Chang and Balser (2016) , Pallmann et al. (2018) and Atkinson and Biswas (2019) . While such adaptive designs allow for greater flexibility in the running of actual trials, their assessment is usually based on selected frequentist performance measures. In the standard version, interim analyses are planned before the trial is started, and need then to be accounted for, due to the consequent multiple testing, in computing the probability of Type 1 error. Although such rigid form of planning can be relaxed when employing the so-called alpha spending functions (e.g., Pocock (1977) , O'Brien and Fleming (1979) , Demets and Lan (1994) ), looking into the data before reaching the pre-planned end of the trial carries a cost either in terms of an inflated probability of Type 1 error or, if that is fixed, in a reduced power of the test to detect meaningful differences between the considered treatments. These classical approaches in the design and execution of clinical trials have been challenged from both foundational and practical perspectives. Important early contributions include, e.g., Thompson (1933) , Flühler et al. (1983) , Berry (1985) , Spiegelhalter et al. (1986) , Berger and Berry (1988) , Spiegelhalter et al. (1994) and Thall and Simon (1994) ; for a brief historical account and a large number of references, see Grieve (2016) . Comprehensive expositions of the topic are provided in the monographs Spiegelhalter et al. (2004) , and Yuan et al. (2017) . The key argument here is the change of focus: instead of guarding against false positives in a series of trials in long term, the main aim is to utilize the full information potential in the observed data from the ongoing trial itself. Then, looking into the data in interim analyses is not viewed as something incurring a cost, but rather, as providing an opportunity to act more wisely. The foundational arguments enabling this change are provided by the adoption of the likelihood principle, e.g., Berger and Wolpert (1984) . In practice, this also implies a change of the inferential paradigm, from frequentist into Bayesian. In Bayesian inference, the conditional (posterior) distribution for unknown model parameters is being updated based on the available data, via updates of the corresponding likelihood. In a clinical trial, it is even possible to continuously monitor the outcome data as they are observed, and thereby utilize such data in a fully adaptive fashion during the execution of the trial. The advantages of this approach are summarized neatly in the short review paper Berry (2006) , in , Lee and Chu (2012) , and more recently, in Yin et al. (2017) , Ruberg et al. (2019) and Giovagnoli (2021) . The paper Villar et al. (2015) contains a useful review of the theoretical background, connecting the theory of the optimal design of clinical trials with that of multi-armed bandit problems. Unfortunately, general results on optimal strategies are largely lacking and their application in practice often infeasible because of computational complexity; however, see Press 2009 . Recently, simulation based approximations have been used for applying Bayesian decision theory in the clinical trials context (e.g., Müller et al. (2017) , Yuan et al. (2017) , Alban et al. (2018) ). Importantly, the posterior probabilities provide intuitively meaningful and directly interpretable answers to questions concerning the mutual comparison of different treatments, given the available evidence, and do so without needing reference to concepts such as sampling distribution of a test statistic under given hypothetical circumstances. Here we consider adaptive designs mainly from the perspective of multi-arm Phase II clinical trials, in which one or more experimental treatments are compared to a control. However, the same ideas can be applied, essentially without change, in confirmatory Phase III trials, where only a single experimental treatment is compared to a control, but the planned size of the trial is larger. In both situations, treatment allocation of individual trial participants is assumed to take place according to a fixed block randomization, albeit with an important twist: The performance of each treatment arm is assessed after every measured outcome in terms of the posterior distribution of a corresponding model parameter. Different treatments arms are then compared to each other according to pre-defined criteria. If a treatment arm is found to be inferior in such a comparison to the others, it can be closed off either temporarily or permanently from further accrual. Of the recent clinical trials literature, the papers by Villar et al. (2015) and Jacob et al. (2016) seem most closely related to our approach, although in different ways. In the latter part of Villar et al. (2015) , the authors discuss and compare several adaptive strategies according to which patients can be allocated to different treatments in a multi-arm trial. Although the paper uses Bayesian inferential methods in parameter estimation, the final comparison between alternative methods is based on frequentist ideas and measures: testing of hypotheses, using fixed sample size and given significance level. In contrast to this, Jacob et al. (2016) introduces three dynamic rules for dropping inferior treatment arms during the trial; these rules are closely similar to our Rules 1 and 2 below. On the other hand, and unlike Villar et al. (2015) , Jacob et al. (2016) does not explicitly consider the possibility of adaptive treatment allocation. We consider first, in Section 2, the simple situation in which the outcomes are binary, and they can be observed soon after the treatment has been delivered. Section 3 reports results from corresponding simulation experiments, following closely the settings of two examples in Villar et al. (2015) but applying the adaptive methods presented in Section 2. In Section 4, the approach is extended to cover situations in which either binary outcomes are measured after a fixed time lag from the treatment, or the data consist of time-to-event measurements, with the possibility of right censoring. This section includes also some notes on vaccine efficacy trials. The paper concludes with a discussion in Section 5. The presentation is to a large extent comparative and expository, particularly in Sections 3 and 5. As a companion to this paper, we provide an implementation of the proposed method in the form of a freely available R package Marttila et al. (2021) that facilitates the simulation of clinical trials with adaptive treatment allocation. 3 2 The case of Bernoulli outcomes 2.1 An adaptive method for treatment allocation: Rule 1 As in the papers Villar et al. (2015) and Jacob et al. (2016) , consider the 'prototype' example of a trial with binary outcomes and two types of treatments, one type representing a control or reference treatment indexed by 0, and K experimental treatments indexed by k, 1 ≤ k ≤ K. Motivated by a conditional exchangeability postulate between trial participants (with conditioning corresponding to their assignment to the different treatment arms), independent Bernoulli outcomes can in this case be assumed for all treatments, with respective response rates θ 0 and θ 1 , θ 2 , . . . , θ K considered as model parameters. We index the participants in their order of recruitment to the trial by i, 1 ≤ i ≤ N max , where N max is an assumed maximal size of the trial. If no such maximal size is specified, we choose N max to be infinite. In this prototype version it is assumed that, for each i, the outcome Y i from the treatment of patient i is observed soon after the treatment has been delivered. This assumption simplifies the consideration of adaptive designs, as the rule applied for deciding the treatment given to each participant can then directly account for information on such earlier outcomes. The meaning of 'soon' here should be understood in a relative sense to the accrual of participants to the trial. If the considered medical condition is rare in the background population, accrual will usually be slow with relatively long times between the arrivals. Then this requirement of outcome information being available when the next participant arrives may apply even if 'soon' is not literally true in chronological time. Extensions of this simple situation are considered in Section 4. We assume that, before starting the trial, a sequential block randomization to the treatment arms 0, 1, ..., K has been performed. We index by n ≥ 1 the positions on that list, calling n list index, and denote by r(n) the corresponding treatment arm. Thus, we have a fixed sequence ((r(1), r(2), ...r(K + 1)), (r(K + 2), r(K + 3), ...r(2(K + 1)), ...) of randomized blocks of length K + 1, where the blocks are independent random permutations of the treatment arm indexes 0, 1, ..., K. Assignment of the participants to the different treatment arms is now assumed to follow this list, but with the possibility of skipping a treatment arm in case it has been determined to be in the dormant state for the considered value of n. This leads to a balanced design in the sense that, as long as no treatment arms have been skipped by the time of considering list index n, the numbers of participants assigned to different treatments can differ from each other by at most 1, and they are equal when n is a multiple of K + 1. Denote by I k,n the binary indicator variable of arm k being in active state at list index value n, n ≥ 0, 0 ≤ k ≤ K, and let I n = (I 0,n , I 1,n , ..., I K,n ) be the corresponding activity state vector. The values of these vectors are determined in an inductive manner to be specified later. By inspection we find that, at the time a value n ≥ 1 of the list index is considered, altogether N (n) = n m=1 I r(m),m−1 (2.1) trial participants have so far arrived and been assigned to some treatment. Clearly N (n) ≤ n. Let now the sequence {N −1 (i); i ≥ 1} be defined recursively by Then N −1 (i) is the value of the list index n at which participant i is assigned to a treatment, while A i = r(N −1 (i)) is the index of the corresponding treatment arm. Having postulated independent Bernoulli outcomes with treatment arm specific parameters θ k , 0 ≤ k ≤ K, we then get that Y i is distributed according to Bernoulli(θ r(N −1 (i)) ). The distinction between active and dormant states is that no trial participants are assigned, at a value n of the list index, to a treatment arm r(n) if it is in the dormant state. Generally speaking, treatments whose performance in the trial has been poor, in a relative sense to the others, are more likely to be transferred into the dormant sate. However, with more data, there may later turn out to be sufficient evidence for such a trial arm to be returned back to the active state. The data D n that have accrued from the trial when it has proceeded up to list index value n consist of the values of the state indicators I k,m−1 , 0 ≤ k ≤ K, 1 ≤ m ≤ n, and of treatments A i and outcomes Y i for i ≤ N (n). Next, we outline the inductive rule by which the values of state vectors I n = (I 0,n , I 1,n , ..., I K,n ) in a data sequence {D n ; n ≥ 1} are updated when the value of n is increased by 1. We write θ = (θ 0 , θ 1 , . . . θ K ) and use, for clarity, boldface notation θ k when the parameters are unknown and considered as random variables. Denote also θ ∨ = max{θ 0 , θ 1 , . . . , θ K }. According to this rule, called Rule 1, for n ≥ 1 and if r(n) = k is an experimental treatment arm, we let I k,n = 0 if P π (θ k = θ ∨ D n ) < ε, and otherwise I k,n = 1. Similarly, for the control arm r(n) = 0 we let I 0,n = 0 if P π (θ 0 + δ ≥ θ ∨ D n ) < ε, and otherwise I 0,n = 1. Here the threshold values ε > 0 and δ ≥ 0 are selected operating characteristics of the algorithm. A smaller value of ε reflects then a more conservative attitude towards moving a treatment into the dormant state. The value of δ can be viewed as specifying the minimal important difference (MID) or minimal clinically important difference (MCID) in the trial; if positive, it provides some extra protection to the control arm from being moved into the dormant state. At the beginning, for n = 0, the coordinates of I 0 = (I 0,0 , I 1,0 , ..., I K,0 ) are determined in a similar fashion directly from the prior. In practice, the prior is never so strong that we would not have I 0 = (1, 1, ..., 1). Rule 1 Adaptive method for treatment allocation. for k ← 1 to K (experimental treatment arms) do if P π θ k = θ ∨ D n < ε then I k,n ← 0; else I k,n ← 1; end end if P π θ 0 + δ ≥ θ ∨ D n < ε then I 0,n ← 0; else I 0,n ← 1; end end end As a byproduct, successive applications of Rule 1 give us an explicit expression for the likelihood L (θ|D n ) = Lik n , n ≥ 1, arising from observing data D n as specified above. According to this rule, the likelihood expression L (θ|D n ) is updated only at values of n at which I r(n),n = 1, and then this is done by multiplying the previous value L (θ|D n−1 ) by the factor θ n) . By repeatedly applying the chain multiplication rule for conditional probabilities, we get that The right hand side expression is obtained by re-arranging the terms and denoting by respectively, the number of successful and failed outcomes from treatment k when considering list index values up to n. Of intrinsic importance in this derivation is that, when conditioning sequentially at n on the data D n , the criteria according to which the values of the indicators I k,n are updated to I k,n+1 do not depend on the parameter θ. As a consequence, these updates do not contribute to the likelihood terms that would depend on θ. Different formulations of this result can be found in many places, e.g., Villar et al. (2015) . As a consequence we can change the focus from the full data {D n , n ≥ 1}, indexed according to the original list indexes used for randomization, to "condensed" data {D * i , i ≥ 1} indexed according to the order in which the participants were treated. We denote by respectively, the number of successful and failed outcomes from treatment k when considering the first i participants. Let be the corresponding total number of successes and of failures, across all treatment arms. Following the usual practice in similar contexts, we assume that the unknown parameter values θ 0 , θ 1 , . . . , θ K have been assigned independent Beta-priors, with Beta(θ k |α k , β k ) for treatment arm k, where α k and β k are separately chosen hyperparameters. The choice of appropriate values of these hyperparameters (e.g., Thall and Simon (1994) ) is always context specific, and is not discussed here further. Then, due to the well-known conjugacy property of the Betapriors and the Bernoulli-type likelihood (2.3), the posterior p θ k |D * k,i for θ k , corresponding to data D * i , has the form of Beta-distribution with its parameters updated directly from the data: This, together with the product form of the likelihood (2.3) and the assumed independence of the priors π, allows then for an easy computation of the joint posterior distribution for (θ 0 , θ 1 , . . . , θ K ) for any i. The density p π θ 0 , θ 1 , . . . , θ K |D * i,k becomes the product of K + 1 Beta-densities. For example, posterior probabilities of the form P π θ k = θ ∨ D n , or posterior distributions for pairwise differences of the type θ k −θ 0 or θ k −θ l , can be computed numerically, in practice either by numerical integration as in Jacob et al. (2016) , or by performing Monte Carlo sampling from this distribution; see also Zaslavsky (2012) . In our numerical examples in Section 3 we have applied this latter possibility. While Rule 1 may at least temporarily inactivate some less successful treatment arms and thereby close them off from further accrual, this closure need not be final. As long as a treatment arm is in the dormant state, the posterior for the corresponding parameter θ k remains fixed. In contrast, with the accrual of participants to active treatment arms still continuing, the posteriors for their parameters can be expected to become less and less dispersed. As a consequence, returns from dormant to active state tend to become increasingly rare. Thompson's rule. Rule 1 has much similarity with Thompson's rule (Thompson (1933) , see also, e.g., Thall et al. (2015) , Villar et al. (2015) ), and both can be viewed as particular versions of response-adaptive randomization (RAR) designs (Chow and Chang (2008) ). In its standard version, this Thompson's rule randomizes new patients to different treatment arms k, 0 ≤ k ≤ K, directly according to the posterior probabilities P π θ k = θ ∨ D n , updating the values of these probabilities as described above. Fractional versions of Thompson's rule use probability weights for this purpose, based on powers P π θ k = θ ∨ D n κ , with 0 ≤ κ ≤ 1, normalized into probabilities by dividing such terms by their sum over different values of k. Thus, for κ = 0, the randomization is symmetric to all K + 1 treatments, and its adaptive control mechanism becomes stronger with increasing κ. We return to considering Thompson's rule in Section 3, in the context of the simulation experiments described there. While an open end recipe such as Rule 1 or Thompson's algorithm may seem attractive, for example, from the perspective of drawing increasingly accurate inferences on the response parameters, practical considerations will often justify incorporation of rules for more definitive selection of some treatments and elimination of others. This is the case if the continued availability of more than one experimental treatment alternative at a later point in time is judged to be impracticable, as when entering the study into Phase III. Another reason is that incorporation of such decision rules enables us to make more direct comparisons to trial designs utilizing classical hypothesis testing ideas. With this in mind, we complement Rule 1 with an optional possibility to conclusively terminate the accrual of additional participants to the less successful treatment arms. Rule 2 below is an adaptation and extension of the corresponding definitions in, e.g., Thall and Wathen (2007) , ), Xie et al. (2012 and Jacob et al. (2016) . In the commonly adopted terminology of adaptive designs, Rule 2 can be said to be a combination of versions of responseadaptive randomization (RAR) and drop-the-losers designs (Chow and Chang (2008) ). In the definition of the algorithm, the letter T is used as a generic notation for the set of treatment arms still left in the trial at the considered value of n. Each elimination of a treatment reduces its size by one. Rule 2 contains, as part, Rule 1 for moving treatments to the dormant state. It then involves, in addition to the operating characteristics ε and δ for Rule 1, three new parameters, viz. θ low , ε 1 and ε 2 . Specifying a value for θ low means setting up a level of minimum required treatment response rate (MRT), e.g., Xie et al. (2012) . A treatment k ∈ T is eliminated from the trial if the posterior probability for {θ k > θ low } falls below ε 1 . The criteria for eliminating treatments are formally identical to those for moving them into the dormant state except that the bounds for the posterior probabilities then need to be tighter, ε 1 ≤ ε. Rule 2 Adaptive rule for treatment allocation and selection. in this case (r(n) ∈ T) and (I r(n),n−1 = 1); for k ∈ T \ {0}(experimental treatment arms) do if P π θ k ≥ θ low D n < ε 1 or P π θ k = max ∈T θ D n < ε 2 then T ← T \ {k}; I k,n ← 0 ; n k,last ← n; Notes. The state indicator I r(n),n at list index value n depends on the recorded past trial history {D m ; 1 ≤ m ≤ n − 1}. However, given this history, it is conditionally independent of the model parameters θ = (θ 0 , θ 1 , . . . , θ K ). As was the case in Rule 1, for a given original block randomization, the likelihood expression arising from applying Rule 2 depends only on the outcome data. This property is crucially important from the perspective of being able to draw correct statistical inferences from the trial. But it is also important from the perspective of practical implementation. Having assumed the initial randomization {r(n) : n ≥ 1} to be fixed, no further randomization is needed when the trial is run since, at any point in time, the next move to be made will be fully determined by the observed past data. After every new observed outcome, the algorithm of Rule 2 determines the current state of each treatment arm, choosing between the three possible options: active, dormant, or dropped. All moves between these states are possible except that the dropped state is absorbing: once a treatment arm has been dropped, it will stay. If an arm is in dormant state, it is at least momentarily closed from further patient accrual. Consider then the different actions based on Rule 2 in more detail. The posterior probabilities P π θ k ≥ θ low D n for the experimental arms, and P π θ 0 + δ ≥ θ low D n for the control arm, express how likely it is, given the currently available data, that their response rate exceeds the pre-specified MRT θ low . The first criterion in Rule 2 then says that if this probability is below a selected threshold value ε 1 , the treatment arm is dropped from the trial. The value of ε 1 can then be said to represent an acceptable risk level of error when concluding that {θ k ≥ θ low }, or {θ 0 + δ ≥ θ low }, would not be true. This part of Rule 2 will obviously not be active if either θ low = 0 or ε 1 = 0. The second criterion in Rule 2 makes a comparison of the response rate of a treatment and that of the best treatment in the trial. Both values are unknown, and the comparison is made in terms of the posterior probabilities P π (θ k = max ∈T θ D n ) for the experimental arms and P π (θ 0 + δ ≥ max ∈T θ D n ) for the control. Here T ⊂ {0, 1, ..., K} is the set of treatment arms left in the trial at time n. The composition of T is determined in an inductive manner, starting from T = {0, 1, ..., K} at n = 1. A treatment is dropped from the trial if the corresponding posterior probability falls below the selected threshold level ε 2 . Thus, for small ε 2 , the decision to drop an experimental treatment k is made if, in view of the currently available data D n , the event {θ k = max ∈T θ } is true only with probability close to 0, with ε 2 representing the selected risk level. The control arm is protected even more strongly from inadvertent removal from the trial if a positive safety margin δ is employed; the comparison to experimental arms becomes symmetric if δ = 0. This entire mechanism of eliminating treatments based on mutual comparisons is inactivated by letting ε 2 = 0. One should note that, while Rule 1 is compatible with the likelihood principle, Rule 2 has an element which violates it. This is because, in multi-arm trials with K > 1, when considered at times n at which some treatment arms have already been dropped, the definition of the maximal response parameter value θ V = max ∈T θ ignores those indexed in {0, 1, ..., K} \ T. Sequential elimination of treatments, as embodied in Rule 2, while it has an obvious practical appeal in running a clinical trial, also renders properties such as standard Bayesian consistency inapplicable. In the third criterion of Rule 2 copies Rule 1: A n experimental treatment arm k ∈ T is made dormant if P π (θ k = max ∈T θ D n ) < ε, and the control arm if P π (θ 0 + δ ≥ max ∈T θ D n ) < ε, where ε is a selected threshold. For this part of Rule 2 to function in a nontrivial way, we need to choose ε > ε 1 and ε > ε 2 . If either ε = ε 1 or ε = ε 2 , then the possibility of a treatment arm being moved into the dormant state is ruled out, and if ε 1 = ε 2 = 0, then Rule 2 is easily seen to collapse into the simpler Rule 1. Finally, if also ε = 0, then treatment allocation will follow directly the original block randomization, which was assumed to be symmetric between all treatment arms, and no treatments are dropped before reaching N max . The selection of appropriate threshold values δ and θ low in Rule 1 and Rule 2 should be based on substantive contextual arguments in the trial. If a positive value for δ is specified, then, as already mentioned in the context of Rule 1, this is commonly viewed as the minimal clinically important difference (MCID) in the trial. Employing such a positive threshold value when comparing the response rate of the control arm to that of an experimental arm, and not doing so when comparing two experimental arms to each other, reflects the idea that the design should be more conservative towards moving the control arm to the dormant state, let alone dropping it conclusively from the trial, than when contemplating about a similar move for an experimental treatment. Once selected, the design parameters ε, ε 1 and ε 2 in applying Rule 2, and then deciding to either drop the treatment or putting it into the dormant state, can be interpreted directly as upper bounds for the risk that this decision was in fact unwarranted. By risk is here meant the posterior probability of error, each time conditioned on the current data actually observed. Suppose, for example, that a finite value for n k,last has been established due to P π θ k ≥ θ low D n k,last < ε 1 . Further accrual of trial participants to treatment arm k is then stopped after the patient indexed by N k,last because the response rate θ k from that arm is judged, with only a small probability ≤ ε 1 , given the data, to be above the MRT level θ low . If all experimental treatments have been dropped as a result of applying Rule 2, the trial ends with a negative result, futility, e.g. Thall and Wathen (2007) . On the other hand, if the control arm has been dropped, at least one of the experimental arms was deemed better than the control, which is a positive finding. In case more than two experimental arms were left at that time, the trial design may allow for a continued application of Rule 2, with the goal of ultimately identifying the one with the highest response rate. As remarked earlier, the application of Rule 2 is optional. If it is not enforced, Rule 1 is open ended and will only control the assignment of new participants to the different treatments. Then, if the trial size N max has been specified and fixed in advance, and regardless of whether Rule 1 was previously employed or not, the posterior probabilities P π (θ k ≥ θ low |D * Nmax ), P π (θ 0 + δ ≥ θ ∨ |D * Nmax ) and P π (θ k = θ ∨ |D * Nmax ) can be computed routinely after all outcome data D * Nmax have been observed, to provide the final assessment of the results from the trial. A frequentist perspective. A different perspective to the application of Rule 2 is offered by the classical frequentist theory of statistical hypothesis testing. While the main point of this paper is to argue in favor of reasoning directly based on posterior inferences, this may not be sufficient to satisfy stake holders external to the study itself, including the relevant regulatory authorities in question, which may be concerned about frequentist measures such as the overall Type 1 error rate at a pre-specified significance level (Chow and Chang (2008) ). From a frequentist point of view, the posterior probabilities P π (θ 0 + δ ≥ θ ∨ |D n ) and P π (θ k = θ ∨ |D n ), via their dependence on the data D n , can be viewed as test statistics in respective sequential testing problems, with Rule 2 defining the stopping boundaries. In the case K = 1, they correspond to considering two overlapping hypotheses (e.g., Lewis and Berry (1994) ), null hypothesis H 0 : θ 1 ≤ θ 0 + δ and its alternative H 1 : θ 1 ≥ θ 0 . For K ≥ 1, the null hypothesis becomes H 0 : θ ∨ ≤ θ 0 + δ, and the alternative H 1 : θ ∨ ≥ θ 0 . The posterior probabilities P π (θ 0 + δ ≥ θ ∨ |D n ) can then be used as test statistics in testing H 0 , and P π (θ ∨ ≥ θ 0 |D n ) for testing H 1 . The size of the test depends on the hypothesized "true" values of the response parameters θ = (θ 0 , θ 1 , . . . , θ K ), on the selected threshold values δ, θ low , ε, ε 1 , ε 2 and, if specified in advance, on the maximal size N max of the trial. For clarity, we denote such a hypothesized distribution generating the data by Q, distinct from the mixture distribution P π used, after being conditioned on current data, in applying Rule 1 and Rule 2. Frequentist measures such as true and false positive and negative rates, characterizing the performance of a test, can be computed numerically to a good approximation by performing a sufficiently large number of forward simulations from the selected Q and then averaging the sampled values. Such a consideration is, however, essentially only needed at the design stage when the trial design needs to be approved and no outcome data are yet available. When the trial is then run, it is natural to utilize, at each time n, the currently available data D n and the consequent posterior probabilities such as P π (θ 0 + δ ≥ θ ∨ |D n ), P π (θ k = θ ∨ |D n ) and P π (θ k ≥ θ low |D n ). In this context it may be useful to recall the well known result from general decision theory: for any prior, the smallest Bayes risk is achieved by minimizing "pointwise" the expected loss with respect to the posterior. We return to considering the frequentist measures in the next section, in connection of Experiments 1 and 2. prior information equivalent to α k successes and β k failures from a treatment arm before initiating the trial. If the selected values α k and β k for some particular treatment arm k are such that their sum α k + β k is larger than that of the others, say α l + β l , it may be a good idea to postpone the application of Rule 1 on arm k from the start of the trial, and use it to assign the first participants to those other arms l until the sum α l + β l + S l (i) + F l (i) reaches the level of α k + β k . Intuitively speaking, treatments are then compared to each other only after the joint posterior is based in the same number of (pseudo)observations from all arms. 3.1 Simulation studies with a 2-arm trial: Experiment 1 Our first simulation experiment mimics the setting of the two-arm trial described in Villar et al. (2015) , Section 5.1. In this comparison of a single experimental treatment to a control, the hypothesis H 0 : θ 1 ≤ θ 0 was tested against the alternative H 1 : θ 1 > θ 0 by using Fisher's exact test at the significance level of α = 0.05 for Type 1 error. Two alternative parameter settings were considered in the simulations leading to the numerical results shown in Table 5 of Villar et al. (2015) , with Type 1 error rate computed at parameter values θ 0 = θ 1 = 0.3, henceforth denoted by Q null , and the power of rejecting H 0 computed at θ 0 = 0.3, θ 1 = 0.5, denoted by Q alt . The simulations and the tests were based on fixed trial size N max = 148. In the present approach, instead of first collecting all planned outcome data and then performing a test at a given level of significance, the trial would be run in an adaptive manner, continuously updating the posterior probabilities specified in Rule 1 and/or Rule 2, and then proceeding in an inductive manner according to these rules. We considered three different settings of the design parameters for Rule 1: (a) δ = 0.1, ε = 0.1, (b) δ = 0.1, ε = 0.05, and (c) δ = 0.05, ε = 0.2. Note that larger values for δ and smaller values for ε correspond to a higher degree of conservatism towards moving a treatment arm from active to dormant state, and conversely. The choice (b) is therefore more conservative than (a), while (c) is more liberal. As an illustration of the workings of Rule 1, we performed an experiment emulating a real trial with maximal size N max = 500, using a single realization generated from Q alt and applying Rule 1 for treatment allocation with thresholds (a). For this, we considered the values of the list index n at which a new patient was assigned to either of the two treatments, i.e., N (n)−N (n−1) = 1, thereby skipping an index if it corresponded to an arm in the dormant state. For such values of n and until N (n) = 500, we monitored in Figure 1 the development of the posterior probabilities P π (θ 0 + δ ≥ θ ∨ |D n ) = P π (θ 0 + δ ≥ θ 1 |D n ) and P π (θ 1 = θ ∨ |D n ) = P π (θ 1 ≥ θ 0 |D n ), of the posterior expectations E π (θ 0 |D n ) and E π (θ 1 |D n ), and of the activity indicators I 0,n and I 1,n . Note that these functions depend only on the corresponding "condensed" simulated data {D * i , 1 ≤ i ≤ 500}. According to Rule 1 (a), the control arm is in dormant state for patient i if P π (θ 0 + 0.1 ≥ θ 1 |D * i ) < 0.1. The values of i for which this was the case in the considered simulation are shown in Figure 1 in grey color. For such i, no new patients were assigned to the control treatment, and therefore the corresponding cumulative sum of activity indicators I 0,n and the posterior expectation E π (θ 0 |D * i ) remained constant. In contrast, the experimental arm was 13 active during the entire follow-up due to all posterior probabilities P π (θ 1 ≥ θ 0 |D * i ) staying above the threshold ε = 0.1. Had also Rule 2 been applied, say, with threshold values ε 1 = 0 and ε 2 = 0.05, the control arm would have been dropped from the trial at the first i for which P π (θ 0 + 0.1 ≥ θ 1 |D * i ) < 0.05. In the considered simulation this happened at i = 365. In Figure 1 this is indicated in dark grey. Bottom: Cumulative sums of activity indicators of the treatment arms and the cumulative number of treatment successes. For more details, see text. Next, we study the effect of the choice of the design parameters ε and δ in Rule 1 on some frequentist type key characteristics of a trial. Figure 2 illustrates this effect for the joint distribution of the activity indicators I 0 and I 1 , considered as a function of the number of treated patients. Empirical probabilities are shown, based on 5000 simulated trials of size N max = 500, under Q null with true parameter values θ 0 = θ 1 = 0.3 (left), and under Q alt with θ 0 = 0.3, θ 1 = 0.5 (right). For Q alt , θ 0 + δ = 0.3 + 0.1 < 0.5 = θ 1 , and therefore the posterior probabilities P π (θ 0 + δ ≥ θ 1 D * i ) tend to be small, at least for larger values of i. When they are below the threshold ε, compliance with Rule 1 forces the control arm to be dormant. We can see this happening in Figure 2 on the right, where the Q alt probability of {I 0 = 0, I 1 = 1} clearly dominates that of {I 0 = 1, I 1 = 0}. The effect is strongest in the liberal parameter setting (c), and weakest but still quite strong in the conservative alternative (b). In contrast, under Q null , with θ 0 = θ 1 , the configuration {I 0 = 1, I 1 = 1} remains the most likely alternative during the entire follow-up, with the strongest tendency to do so in the conservative design (b) and the weakest in the liberal (c). A third aspect to be noted on the left of Figure 2 is that the configuration {I 0 = 1, I 1 = 0} was always much more likely than {I 0 = 0, I 1 = 1}, due to the control arm being protected by the positive safety margin δ. Finally, Figure 2 shows the expectations of E π θ k |D * i and E π θ k |D * i , (1 ≤ i ≤ 500, k = 0, 1), computed from these simulations under Q null and Q alt . For small i all these values are close to 0.5, originating from the Uniform(0, 1)-priors assumed in all simulations. With more data, the curves stabilize close to the true parameter values, but exhibit then a small negative bias. This is an aspect shared by all adaptive methods favoring in treatment allocation arms with relatively more successes in the past, see e.g. Villar et al. (2015) . Given that the main goal of each on-going trial is the mutual comparison of the different treatments involved, and that this assessment is here made with respect to the joint posterior based on the current trial data, the frequentist property of a small bias in the estimation of the individual treatment success parameters, in the same direction, does not seem very crucial. Figure 3 showing the cumulative distribution functions (CDFs) of N 1 (200), the number of patients out of the first 200 assigned by Rule 1 to the experimental treatment, and of S(200), the total number of successes from both treatments combined. Corresponding results from considering the first 100 and 500 patients are shown in Figures S1 and S2 included in the Supplement. The CDF's in these figures are based on simulated data sets from Q null and Q alt by using the same parameter settings (a), (b) and (c) of Rule 1 as in Figure 2 and, in addition, (d) where adaptive treatment allocation was inactivated by applying threshold value ε = 0, then leading to a completely symmetric block randomization. Finally, for a comparison, also shown are the CDFs of these variables when adaptive treatment allocation of patients was applied by using Thompson's rule with fractional exponents κ = 0.25, 0.50, 0.75 and 1.00. Note that κ = 0 would correspond to treatment assignment by tossing a fair coin, and therefore the corresponding CDF of S(200) would be very similar to that obtained under Rule 1 (d). The top part of Figure 3 shows how the application of Rule 1, under Q null , leads to often allocating exactly half of the patients to both treatment arms, which happens in trial runs during which the dormant state had not been entered even once. Overall, due to the protective 16 safety margin δ > 0, Rule 1 has a tendency of allocating more patients to the control arm. Thompson's rule, in contrast, behaves symmetrically for data coming from Q null . Under Q alt , in which case the true success rate of the experimental treatment is higher than that of the control, after a training period, more than half of the patients will usually be given this better treatment. In Rule 1, the mode of control changes abruptly at times at which one of the two treatment arms enters the dormant state. The different versions of Rule 1 can therefore be said to represent a bang-bang type of system control. Thompson's rule applies a randomization scheme based on continuously updated posterior probabilities, and in this sense represents a softer control type. Under Q alt , the better performance of the experimental treatment is usually detected rather early in the trial, and then, with more evidence from the data, all adaptive rules use progressively stronger control in directing patients to this better treatment. However, there is a small probability that, accidentally, more patients are given the inferior control treatment. It is clear, as is also illustrated by Figure1, that the risk for this to happen is highest early in the trial when there are only few observed outcomes. In the present simulations, these Q altprobabilities were, respectively, 0.041, 0.023 and 0.049, under Rule 1 with parameters (a), (b) and (c). In the bottom part of Figure 3 , the CDFs for S(200) under Q null are identical in all designs, due to both treatment arms having the same true response rate 0.3. For Q alt , employing the symmetric block randomization scheme Rule 1 (d) gives E Q alt (S(200)) = 80, and if all patients could be given the better experimental treatment, the resulting optimal expected value would be 100. In Figure 3 , the expectations E Q alt (S(200)) for different adaptive schemes range from 85.6 for Rule 1 (b), to 94.4 for Thompson's rule with κ = 1. Employing an initial burn-in period. The potential problem of accidentally allocating more patients to an inferior treatment arm can be mitigated by delaying the workings of the adaptive mechanism of Rule 1, or Thompson's rule, by employing the symmetric block randomization scheme (d) until a fixed number of patients have been assigned to all treatments. To have an idea of the size of the effect of this modification in the present example, we carried out a simulation study identical to that leading to Figure 3 except that, of the considered 200 patients, the first 30 were divided evenly to the two treatments, 15 to both. The result is shown in the Supplement Figure S3 . The probabilities of imbalance in the unwanted direction are now lower, respectively 0.013, 0.005 and 0.019 for Rule 1 (a), (b) and (c). On the other hand, delaying the adaptive mechanism from taking effect until outcome data from the first 30 patients are available obviously lowers, in case of Q alt , the expected number of treatment successes by small amounts. For additional comments on the effects of burn-in, see the Supplement. Alternative versions of burn-in in adaptive designs have been considered, e.g., in Thall and Wathen (2007) and Thall et al. (2015) . There are no free lunches, and these potential gains in terms of either more efficacious treatments given to more patients in the trial, or smaller numbers of treated patients needed for being able to select the better treatment, are to be weighed against corresponding potentially stronger statistical inferences that might be obtained from more balanced designs. For a numerical comparison, we applied a design where adaptive patient allocation was applied following either Rule 1 or Thompson's rule, and an assessment of the results, including the possibility 18 of dropping a treatment arm, was only allowed at the time at which a pre-specified number i = N max of patients had been treated. Here we consider the choice N max = 200, reporting the results from experiments with N max = 100 and 500 in the Supplement. In a trial with only two treatments, dropping either one is taken to mean selection of the other. The final analysis made at N max need not necessarily use the same threshold values as Rule 1, and therefore we use new notations ε 0 and δ 0 for them. Accordingly, when performing such an analysis at Obviously, at most one of these criteria can be satisfied for given data D * Nmax when ε 0 < 0.5. But it is also possible that neither of them is satisfied, in which case no firm decision concerning treatment selection is made at N max . Even then, however, there is the possibility of studying the joint posterior P π ((θ 0 , θ 1 ) ∈ ·|D * Nmax ) for the purpose of drawing further inferences from the results of the trial. For example, one can print the posterior CDF P π (θ 1 − θ 0 ≤ x|D * Nmax ), −1 ≤ x ≤ 1, and then decide, the study protocol permitting this, whether to continue the trial by recruiting more participants. This may then lead to either one of the two selection criteria being satisfied at a later point in time. We now study how the application of different versions of adaptive treatment allocation influences the strength of statistical inferences, viewed from a frequentist perspective, that can be drawn from trial data in Experiment 1. For this, we consider the probabilities Q(P Under Q null , the CDFs of P π θ 1 ≥ θ 0 D * 200 for different designs, shown in the bottom part of Figure 4 , are almost linear, which would correspond to the Uniform(0, 1) sampling distribution. This is the case particularly in the designs following Thompson's rule, where the two treatment arms are considered symmetrically. For Rule 1 the deviations from linearity are clearer, and most evident in the case of Rule 1 (c). The overall shape of the CDFs of P π θ 0 + 0.05 ≥ θ 1 D * 200 in the top part of Figure 4 is convex, signalling that the Q null -density of these posterior probabilities tends to increase as their values increase. The reason is the threshold δ 0 = 0.05 providing extra protection for the control arm arm against being dropped. The CDFs generated under Q alt behave very differently. Those of P π θ 1 ≥ θ 0 D * 200 in the bottom part of Figure 4 show a high concentration of values close to 1, and those of P π θ 0 + 0.05 ≥ θ 1 D * 200 in the top part of Figure 4 a somewhat lower but still high concentration close to 0. The main difference between these CDFs stems from the opposite directions of the inequalities between θ 0 and θ 1 , and the difference in concentration is again due to the threshold δ 0 = 0.05. Based on these results, we then computed numerical values for the true and false positive and negative rates, shown in Table 1 . More exactly, we use the terms true positive rate = Q alt (P π (θ 0 + δ 0 ≥ θ 1 |D * Nmax ) ≤ ε 0 ) and false negative rate = Q alt (P π (θ 1 ≥ θ 0 |D * Nmax ) ≤ ε 0 ). In addition, the probabilities Q(P π (θ 0 + δ 0 ≥ θ 1 |D * Nmax ) > ε 0 , P π (θ 1 ≥ θ 0 |D * Nmax ) > ε 0 ) are called inconclusive rates, respectively, under Q = Q null and Q = Q alt . The following conclusions are now immediate from Table 1 . For N max = 200, the false positive rates are small, below 2.5 percent, for all considered versions of adaptive treatment allocation. This is true even for the "liberal" design parameters (c) in Rule 1 for which there was a non-negligible probability, about five percent, of serious imbalance in treatment allocation in the unwanted direction. The false negative rates are very small for all considered designs. Under Q null , the trial remains inconclusive with probability at least ninety percent, which is consistent with the fact that then there is no difference between the true response rates θ 0 and θ 1 . Finally, the true positive rate (power ) is on the moderate level of approximately seventy percent when applying Rule 1 with design parameter values (a), (b) and (d), and almost as high for Thompson's rule with κ = 0.25. Recall here that (d) means symmetric block randomization, which can thought to provide a suitable yardstick for such comparisons of power. For larger values of κ, for which the adaptive mechanism is stronger, these rates are smaller. Of all considered alternatives, the smallest true positive rate is obtained for the design parameters (c). Corresponding tables for N max = 100 and N max = 500 are provided, and commented on, in the Supplement as Table S1 and Table S2 . Employing an initial burn-in period. We also studied the effect of the burn-in period, described above in subsection 3.1.2, on the frequentist performance measures in Table 1 . For this, we drew CDFs (not shown) similar to those in Figure 4 , and then worked out numerical values for the true and false positive and negative rates. The results, with some comments, can be found in the Supplement Table S3 . Remarks on other test variants. Somewhat different numerical values are obtained if the positive safety margin δ 0 protecting the control arm from being dropped is given the value δ 0 = 0. With this extra protection removed, the rates of positive findings, both true and false, will naturally increase, while the rates of negative results remain unchanged. Another modification is to change the presently used decision criterion P π (θ 1 ≥ θ 0 |D * Nmax ) ≤ ε 0 for dropping the experimental arm into P π (θ 1 ≥ θ 0 + δ 0 |D * Nmax ) ≤ ε 0 , in which case it would be symmetric to the condition P π (θ 0 + δ 0 ≥ θ 1 |D * Nmax ) ≤ ε 0 for dropping the control arm. If made, such a conclusion (in effect, declaring futility) is made more easily. The true and false negative rates become then larger, while the rates of positive results remain unaltered. For both variants, the inconclusive rates are larger than when applying the original criteria. Numerical values for these two variants, with N max = 200, are provided in Supplement Tables S4 and S5 . It depends on the concrete context whether either one of these alternative criteria would be considered more appropriate than the version used in the construction of Table 1 . All three represent different forms of superiority trials. After a suitable modification, the same basic structures would also apply for testing non-inferiority and equivalence hypotheses (e.g., Lesaffre (2008) ). We then modified the design by employing the adaptive Rule 2 for treatment selection. Figure 5 shows the probabilities Q(N 0,last ≤ i, N 1,last > i) of having dropped the control arm, Q(N 0,last > i, N 1,last ≤ i) of having dropped the experimental arm, and Q(N 0,last > i, N 1,last > i) of not having done either of these, all considered at the time i patients had been treated. Note that, since the possibility of dropping both treatment arms in the same trial has been ruled out, the first two probabilities can be written in the shorter form Q(N 0,last ≤ i) and Q(N 1,last ≤ i). Empirical estimates of these probabilities are shown, based on 5000 simulated samples from Q null (left) and Q alt (right). The earlier threshold values (a), (b) and (c) for ε and δ were again applied, but combining them with ε 1 = 0 and ε 2 = 0.05 for Rule 2. In Figure 5 , on the left, the curve Q null (N 0,last ≤ i), 1 ≤ i ≤ 500, forming the upper boundary of the blue band "1 active, 0 dropped", depicts the false positive rate evaluated at i. On the right, Q alt (N 0,last ≤ i) is the true positive rate, or power at i. The widths of the brown bands in this figure can be interpreted similarly, with Q null (N 1,last ≤ i) on the left being the true negative rate evaluated at i, and Q alt (N 1,last ≤ i) on the right the false negative rate. The latter probabilities are small, at most 0.05 for the considered designs (a), (b) and (c) for Rule 1. The widths of the yellow bands represent the inconclusive rates at i. From this follows that also the areas of these colored bands Figure 5 have meaningful interpretations in terms of expected values. The area of the blue region from 1 to i is the expected value, with respect to Q null (left) and to Q alt (right), of the number of patients among the first i, who were directed to the experimental treatment in the situation in which the control arm had already been dropped. The areas of the brown bands can be interpreted in a similar fashion, with the roles of the two treatments interchanged. The area of the yellow band from 1 to i is the expected value, again with respect to Q null (left) and to Q alt (right), of the random variable min{N 0,last , N 1,last , i}. Finally, Figure 5 shows the expectations of E π θ k |D * i and E π θ k |D * i , (1 ≤ i ≤ 500, k = 0, 1), computed from these simulations under Q null and Q alt . Overall, the behaviour of these curves is similar to those in Figure 2 , although the negative bias seems here slightly larger. Apparently, this difference is due to Rule 2 imposing a stronger control on treatment allocation. (1 ≤ i ≤ 500, k = 0, 1), computed from these simulations. For more details, see text. Our second simulation experiment is modeled following the set-up of Table 7 in Villar et al. (2015) , describing a trial with K = 3 experimental arms and a control arm. The hypotheses were H 0 : θ k ≤ θ 0 for all k, 1 ≤ k ≤ 3, and its logical complement H 1 : θ k > θ 0 for at least one k, 1 ≤ k ≤ 3. Considered as a multiple hypothesis testing problem, applying significance level α = 0.05 and the Bonferroni correction, H 0 was tested separately against each alternative H 1k : θ k > θ 0 at level α/3. The numerical results shown in Table 7 of Villar et al. (2015) were based on using the fixed trial size of N max = 80, together with parameter values (θ 0 , θ 1 , θ 2 , θ 3 ) = (0.3, 0.3, 0.3, 0. 3) for computing the family-wise error rate (FWER), and (θ 0 , θ 1 , θ 2 , θ 3 ) = (0.3, 0.4, 0.5, 0.6) for computing the power of concluding H 1 . The small trial size was justified by thinking of a rare disease setting, where the number of patients in the trial could be a high proportion of all patients with the considered condition. Below, we continue using the shorthand notations Q null and Q alt for these two parameter settings. As in Experiment 1, we first monitored the execution of this trial, based on a single simulation from Q alt , and thereby applying Rule 1 for treatment allocation with threshold values ε = 0.1 and δ = 0.1. Figure 6 presents an example based on such simulated data, showing the timeevolution of the posterior probabilities P π θ 0 + δ ≥ θ ∨ D * i and P π θ k = θ ∨ D * i , 1 ≤ k ≤ 3, of the the posterior expectations E π θ k |D * i and of the cumulative sums of the activity indicators I k , 0 ≤ k ≤ 3, all considered at times at which i patients had been treated and up to maximal trial size N max = 500. From the top display we can see how, with some luck in the simulation that was carried out, the posterior probabilities P π θ 0 + δ ≥ θ ∨ D * i , P π θ 1 = θ ∨ D * i and P π θ 2 = θ ∨ D * i started progressively to take on values below the given threshold ε = 0.10 and finally stayed there during the remaining simulation run. In contrast, after considerable early variation, the posterior probabilities P π θ 3 = θ ∨ D * i corresponding to the highest true response rate θ 3 = 0.6 stayed consistently above that threshold level, and actually started to dominate the others from approximately i = 120 onward. The cumulative activity indicators for all treatment arms in the bottom display of Figure 6 show clearly when each of these arms was active or dormant. In this simulation, there was some back-and-forth movement between these two states, but finally treatment arms 0, 1 and 2, respectively after 153, 174, and 68 treated patients, remained dormant. The dotted line shows the cumulative numbers of successes in the simulation, ending up with the total S(500) = 294, not much short of the optimal expected value 300 that would have been obtained if all 500 patients had been assigned to the best treatment with success rate θ 3 = 0.6. 26 Next, as in Experiment 1, we studied the effect of the choice of the design parameters ε and δ in Rule 1 on some selected frequentist type key characteristics of the trial. For this, we simulated 2000 data sets of size N max = 500, under both Q null and Q alt . The same three combinations of the design parameters were used as before: a) ε = 0.1, δ = 0.1, b) ε = 0.05, δ = 0.1, c) ε 1 = 0.2, δ = 0.05. For the analysis, θ 0 , . . . , θ 3 were assumed to be a priori independent and uniformly distributed on (0, 1). In a 4-arm trial there would in principle be 2 4 − 1 = 15 possibilities of forming combinations of active and dormant states at a given i, and it would be hard to present such results in an easily understandable graphical form. The main aim of the trial of this type is to find out whether one of the experimental treatments would be better than the others, and in particular, better than the control. In view of this, we call treatment k maximal at i if P π θ k = θ ∨ |D * i ≥ P π θ = θ ∨ |D * i ∀ = k, and then focus our attention on events of the form {treatment k is maximal, control treatment is dormant}. The results are shown in Figure 7 . In the subfigures, the width of each of the 4 bands at i corresponds to the Q-probability of a respective event. The three lower bands represent the Q-probabilities of {treatment k is maximal at i, I 0,i = 0}, 1 ≤ k ≤ 3, and the upper band (violet) the Q-probabilities of {I 0,i = 1}. In the present 4-arm experiment, we can think of all three experimental arms combined as competing, and being evaluated against, the control arm, in a way analogous to the single experimental treatment in Experiment 1. Seen from this angle, the sum of the widths of the three lower bands of Figure 7 corresponds to the width of the lowest band in Figure 2 , while that of the top one in the former corresponds to the sum of the top two in the latter. On the left of Figure 7 , describing Q null , the violet band corresponding to {I 0,i = 1} is broader than the other three, not only because the assumed initial state {I 0,1 = 1}, but because the control arm is protected by δ = 0.1 against being moved to the dormant state. The other three bands are similar to each other due to the assumed symmetry of the experimental treatments 1, 2 and 3 under Q null . All these probabilities stabilize rather quickly with growing i, well before i = 100. On the right, corresponding to Q alt , the violet band becomes narrower with growing i, losing ground mainly to the yellow band, which represents the Q alt -probabilities of the events {treatment 3 is maximal at i, I 0,i = 0}. The widths of the three lower bands, yellow, brown and blue, are seen to follow the same order as the corresponding true response parameter values. Approximate values of these probabilities can be read from Figure 7 as well. For example, considering design (a) at i = 500, we get Q alt (treatment 3 is maximal at 500, I 0,500 = 0) = 0.763. Overall, designs (a) and (b) led to very similar Q alt -probabilities, while the more liberal design (c), which allowed for more variability during the early stages of the trial, gave rise to somewhat broader brown and blue bands. We then employed also Rule 2, in order to study the ability of this algorithm to drop possibly inferior treatment arms from the trial and thereby to act as a selection mechanism for those performing better. Using data simulated under Q null and Q alt , the same three combinations of design parameters as in Experiment 1 were again considered: (a) ε = 0.1, ε 1 = 0, ε 2 = 0.05, In the subfigures, the width of each of the 4 bands corresponds to the Q-probability of a respective event in the box. Also shown are the expectations E Q null E π θ k |D * i and E Q alt E π θ k |D * i , (1 ≤ i ≤ 500, 1 ≤ k ≤ 3), computed from these simulations. For more details, see text. δ = 0.1, (b) ε = 0.05, ε 1 = 0, ε 2 = 0.05, δ = 0.1, (c) ε = 0.2, ε 1 = 0, ε 2 = 0.05, δ = 0.05. The results are shown in Figure 8 . The main distinction to Figure 7 is that, in the definition of the four colored bands, the events {I 0,i = 0} have here been replaced by {N 0,last ≤ i}. The widths of the three lowest bands therefore represent the Q-probabilities of the events {treatment k is maximal at i, N 0,last ≤ i}, 1 ≤ k ≤ 3, while that of the violet band is the Q-probability of {N 0,last > i}. As in the case of Experiment 1, these events have operational meanings comparable to corresponding key concepts used in hypothesis testing. Thus, on the left of Figure 8 , the sum of the three lower bandwidths at i is the false positive rate when observing outcome data from i patients. If its size is of major concern to a person considering the design from a frequentist perspective, it can be reduced smaller in a similar fashion as suggested in Section 3.1.2 in the context of Experiment 1, by employing a form of a burn-in period and activating adaptive treatment allocation only after some fixed number of patients have been treated in all four arms. Data of the kind considered in Sections 2 and 3, where binary outcomes are determined and observed soon after the treatment is delivered, may be rare in practical applications such as drug development. More likely, it takes some time until a response to a treatment can measured in a useful manner. For example, the status of a cancer patient could be determined one month after the treatment was given. Incorporation of such a delay into the model is not technically very difficult, but it necessitates explicit introduction of the recruitment or arrival process, in continuous time, of the patients to the trial. A somewhat different problem arises if the outcome itself is a measurement of time, such as time from treatment to relapse or to death in a cancer trial, or to infection in vaccine development. When such information would be needed for adaptive treatment allocation, part of the data are typically right censored. Both types of extensions of the basic Bernoulli model in Section 2 are considered briefly below. We now consider a model, where a binary outcome is systematically measured after a fixed time period has elapsed from the time at which the patient in question received the treatment. Modelling such a situation, rather obviously, requires that the model is based on a continuous time parameter. Let, therefore, t > 0 be a continuous time parameter, and denote by U 1 < U 2 < . . . < U i < . . . the arrival times of the patients to the trial, again using i = 1, 2, . . . to index the participants. We then assume that the treatment is always given immediately upon arrival, and that the outcome Y i is measured at time V i = U i + d, where d > 0 is fixed as part of the design. Let N (t) = i≥1 1 {U i ≤t} , t > 0, be the counting process of arrivals. At time t, outcome measurements are available from only those patients who arrived and were treated before time t − d. Therefore, the adaptive rule for assigning a treatment to a participant arriving at time t can utilize only the data . Three combinations of design parameters were considered: (a) ε = 0.1, ε 1 = 0, ε 2 = 0.05, δ = 0.1 (top), (b) ε = 0.05, ε 1 = 0, ε 2 = 0.05, δ = 0.1 (middle), (c) ε = 0.2, ε 1 = 0, ε 2 = 0.05, δ = 0.05 (bottom). In the subfigures, the width of each of the 4 bands corresponds to the Q-probability of a respective event in the box. Also shown are the expectations E Q null E π θ k |D * i and E Q alt E π θ k |D * i , (1 ≤ i ≤ 500, 1 ≤ k ≤ 3), computed from these simulations. For more details, see text. where the indicator C i (t) = 1 {U i 0, the process counting the arrivals to the trial. If the data are collected at time t, and U i ≤ t and V i > t hold for patient i, the response time X i will be right censored. Observed in the data are then the Suppose now that the original response times X i arising from treatment k, i.e., those for which A i = k, are independent and distributed according to some distribution F (x|θ k ) with respective parameter value θ k > 0, k = 0, 1, . . . , K. Denote the corresponding densities by f (x|θ k ). As above, we assume that the arrival process is not informative about the model parameters, and that the participants are conditionally exchangeable given their respective treatment assignments. Then the likelihood expression corresponding to data collected from treatment arm k up to time t, has the familiar form Such data are in the survival analysis literature commonly referred to as data with staggered entry. Due to the assumed conditional independence of the response times across the different treatment arms, given the respective parameters θ k , the combined data give rise to the product form likelihood where θ = (θ 0 , θ 1 , . . . , θ K ). Upon specifying a prior for θ, the posterior probabilities corresponding to the data D t can then be computed and utilized in Rule 1 or Rule 2. Remarks. It is well known that, in Bayesian inference, Gamma-distributions are conjugate priors to the likelihood arising from exponentially distributed survival or duration data, with θ k representing the corresponding intensity parameters. This holds also when such data are right censored, in which case the likelihood (4.3) corresponding to D k,t has the Poisson form, with being the number of measured positive outcomes and N (t) i=1 Y i (t) 1 {A i =k} the corresponding Total Time on Test (TTT) statistic. Assuming independent Gamma(θ k | α k , β k )priors for the respective treatment arms k = 0, 1, . . . , K, the posterior for θ k corresponding to data D k,t becomes (4.5) and the joint posterior p (θ | D t ) is the product distribution of these independent marginals. When considering the application of Rule 1 or Rule 2 in this exponential response time model, the natural target would often be to decrease, rather than increase, the value of the intensity parameter corresponding to an experimental treatment in the trial. Moreover, for measuring the degree of such potential improvements, use of hazard ratios, or relative risks, seems often more appropriate than of absolute differences. Criteria such as P π θ k ≥ θ low D n < ε 1 and P π θ 0 + δ ≥ max ∈T θ D n < ε 2 applied previously in Rule 2 should then be replaced by corresponding requirements of the form P π θ k ≤ θ high D t < ε 1 and P π ρθ 0 ≤ min where ρ < 1 is a given safety margin protecting the control arm from inadvertent dropping. Writing ρ = exp {−δ} and using η k = − log θ k as model parameters brings us back to the absolute scale, with the last inequality becoming the requirement η 0 + δ ≥ max ∈T η . An important and timely special case of time-to-event data are data coming from large scale Phase III vaccine trials. When a newly developed vaccine candidate has reached the stage when it is tested in humans for efficacy, the trial participants are usually healthy individuals and the control treatment is either placebo or some existing vaccine that has been already approved for wider use. In such trials adaptive treatment allocation is less likely to be an issue, whereas it would be important to arrive at some reasonably definitive conclusion about efficacy already before reaching the planned study endpoint N max . For this reason, in the recent trials for testing COVID-19 candidate vaccines in humans, the design has allowed for from two to five 'looks' into the data before trial completion, usually defined as times at which some prespecified number of infections have been observed. To our knowledge, most of these trials have applied frequentist group sequential methods for testing, adjusting the targeted significance level by suitably defined spending functions. This standard practice is followed in spite of that, arguably, in trials for experimental vaccines such as the COVID-19 candidates, for which Phase II has been already successfully completed, Type 1 errors could be considered less worrisome than Type 2 errors. Entertaining the idea that such vaccine trials had been designed by using the Bayesian framework as presented above in 4.2, this task could have been accomplished by applying Rule 2 and thereby selecting suitable values for its design parameters ρ, θ high , ε 1 , ε 2 and N max , letting finally ε = ε 2 to inactivate the separately defined adaptive mechanism for treatment allocation. For example, considering the case of a single experimental vaccine, the value ρ = 0.4 would signify the target of sixty percent decrease in the value of the intensity parameter θ 1 compared to the placebo control θ 0 , and thereby a corresponding reduction in the expected number of infected individuals among those vaccinated. The trial could then be run, and it would stop with declared success if a posterior probability P π ρθ 0 < θ 1 D * i < ε 2 were obtained for some i ≤ N max . On the other hand, futility would be declared if either P π θ 1 ≤ θ high D * i < ε 1 or P π ρθ 0 ≥ θ 1 D * i < ε 2 were established for such i. In either case, the monitoring of these probabilities could in principle be done in an open book form, and not just in a few 'looks' made at pre-planned check points. A somewhat different approach to modeling and analyzing vaccine trial data can be outlined as follows. Suppose that the design is fixed by allocating, at time t = 0, n 1 individuals to the vaccination group and n 0 individuals to the placebo group. Denote by 0 < T 1,1 < T 1,2 < ... the times at which the individuals in the former group become infected and by 0 < T 0,1 < T 0,2 < ... the corresponding times in the latter group. Expressed in terms of counting processes, N 1 (t) = m≥1 1 {T 1,m ≤t} and N 1 (t) = m≥1 1 {T 1,m ≤t} count the number of infections up to time t in these two groups. We then assume that infections occur at respective rates (n 1 − N 1 (t−))λ 1 (t) and (n 0 − N 0 (t−))λ 0 (t), where λ 1 (t) and λ 0 (t) are unknown functions of the follow-up time t. In practice, n 1 and n 0 are large, of the order 10.000 or more, while N 1 (t) and N 0 (t) can during the observation interval be at most a few hundred. Therefore, {N 1 (t); t ≥ 0} and {N 0 (t); t ≥ 0} can be approximated quite well by Poisson processes with respective intensities n 1 λ 1 (t) and n 0 λ 0 (t). Suppose that these processes are (conditionally) independent given their intensities. Then the likelihood corresponding to the data D t = {N 0 (s), N 1 (s); s ≤ t}, combined from both groups and up to time t, gets the familiar Poisson-form expression (4.6) Assuming that the processes {T 0,m ; t ≥ 1} and {T 1,m ; t ≥ 1} do not have exact ties, we now consider their superposition {0 < T 1 < T 2 < ...} and the corresponding counting process N (t) = N 0 (t) + N 1 (t) = m≥1 1 {Tm≤t} , which then has intensity n 0 λ 0 (t) + n 1 λ 1 (t). In what follows, for the purposes of statistical inference, this superposition is decomposed back into its components. For this, we define a Estimation of the function λ 0 (.), describing the infection pressure in the non-vaccinated population, may be possible by utilizing data sources that are external to the trial, but estimation of λ 1 (.) would be hard. This problem can be circumvented if we are ready to impose a proportionality assumption, according to which, although the rates at which infections occur in the vaccination and placebo groups generally vary in time, their ratio is a constant ρ > 0. Expressed in symbols, we assume then that λ 1 (t) = ρλ 0 (t), t ≥ 0. The smaller the value of ρ, the better protected, according to this model, the vaccinated individuals are. The value 1 − ρ is what is commonly called vaccine efficacy at reducing infection susceptibility, abbreviated as V E S (e.g., Halloran et al. (2010) ). The postulated proportionality property appears to be reasonable if all trial participants are vaccinated approximately at the same time, in which case t refers to time from vaccination, and if both groups, due to randomization, can be assumed to be exposed to approximately the same infection pressure. If the trial participants have been recruited from different geographical regions with highly varying levels of infection pressure, a stratified analysis based on a common vaccine efficacy value might still be possible. However, if vaccination takes place over a longer time period, it becomes difficult to differentiate from each other the effects of infection pressure, varying in the population with calendar time, and that of individual level susceptiblity, which is likely to depend on the build-up of the immune response and thereby on the time from vaccination. A different matter, which has received much attention recently in connection of COVID-19 vaccine trials, is the dependence of ρ on age, due to the immune response in the older age groups generally developing more slowly. Stratification of the analyses by using some age threshold has been applied, but the selected thresholds have varied. This is a problem for statistical analysis as long as the numbers of infected individuals in some age groups remain low. Supposing now a common value for ρ, there are two alternative approaches to be selected from: Either (i) considering joint inferences on the pair (λ 0 (.), ρ), using the "full" likelihood (4.6) for this purpose and introducing a separate model for a description of λ 0 (.), or (ii) following the path well known from the context of the Cox proportional hazards model and employing a corresponding partial likelihood expression (e.g., Yip and Chen (2000) ). In a stratified analysis, the (partial) likelihood expressions would become products across the considered strata. Here we consider briefly the approach based on partial likelihood. A comparative assessment of these approaches is beyond the scope of this presentation. By inserting the assumed form λ 1 (.) = ρλ 0 (.) of the intensity λ 1 (.) into (4.6), it can be written, after some re-arrangement and cancellation of terms, in the form . The latter product in this expression simplifies further into where we have denoted θ = n 0 (n 0 + n 1 ρ) −1 . This is the sought-after partial likelihood and, parameterized in this way, it has the familiar Binomial form. The word partial signifies the fact that the parts in the "full" likelihood that were omitted in the derivation of (4.7) also contain the unknown model parameter ρ. We now proceed by employing the approximation where the partial likelihood is treated as if it were the "full". On specifying a Beta( . | α, β)-prior for θ, and using the conjugacy property of the Beta-Binomial distribution family, we would get the posterior p(θ | D t ) = Beta(θ | α + N 0 (t), β + N (t) − N 0 (t)), and further the posterior for ρ by noting that ρ = n 0 (1 − θ)/n 1 θ. However, a Beta-prior may not be fully appropriate for this particular application. More naturally we could postulate, for example, the Uniform(0, 1) prior for ρ. It would correspond to the assumption that infectivity in the vaccine group cannot be larger than in the placebo group, but all values of vaccine efficacy between 0 and 100 percent are a priori equally likely. This would entail for θ a prior density, which is no longer of Beta-form. With the conjugacy property lacking in this case, the posterior can nevertheless be computed easily by applying Markov Chain Monte Carlo sampling. While adaptive treatment allocation appears to be less of an issue in vaccine trials, there will be more interest in how, and when, results from such trials could be appropriately reported. At times such as the current SARS-CoV-2 pandemic, there is much pressure towards making the results from vaccine trials available as soon as a pre-specified level of certainty can be assured. Again, consistent with the likelihood principle, all monitoring of posterior probabilities could be done in an open book form, and not just in a few 'looks' at pre-planned check points. For example, the trial could be run, and it could stop with declared success at time t if the posterior probability P π (V E S ≥ ve * |D t ) > 1 − ε 1 were obtained, with ve * a pre-specified minimal target value and ε 1 having a small value such as 0.05 or 0.01. (To compare, according to the WHO guidelines for evaluation of COVID-19 vaccines (World Health Organization (2020)), for a candidate vaccine the primary efficacy endpoint point estimate in a placebo-controlled efficacy trial should be at least 50 percent, and the lower bound of the appropriately alpha-adjusted confidence interval around the primary efficacy endpoint point estimate should be larger than 30 percent. Note that, while such a criterion defines a stopping time with respect to the internal history of the trial, it violates the likelihood principle.) A similar criterion could be set up for declaring futility. To give an example from a recent real study, Moderna, Inc. announced on November 30, 2020 (Moderna Inc. (2020)) a primary efficacy analysis of their Phase III COVID-19 Vaccine Candidate. The announcement, based on a randomized, 1:1 placebo-controlled study of 30.000 participants, reported 185 infections in the placebo group and 11 in the vaccine group, leading to the point estimate 11/185 = 0.059 of ρ and thereby efficacy estimate 0.941. We computed the posterior density p(ρ | D t ) of ρ, using these data N 0 (t) = 185 and N 1 (t) = 11 and assuming the uniform prior for ρ as described above. The result, together with the 95 percent HPDI (0.030, 0.105), is shown in Figure 9 . The corresponding HPDI for V E S = 1 − ρ is then (0.895, 0.970). Remarks. A practical advantage of the Poisson process approximation entertained above is that only the numbers N 0 (t) and N 1 (t) are needed for computing the posterior of ρ at time t. If n 0 and n 1 are not large enough to justify such an approximation, statistical inference based on partial likelihood is still possible, but it then necessitates monitoring of the sizes of the two risk sets. The exact times of infection are not required, but the ordering in which members of either the placebo or of the vaccine groups become infected needs to be known. As in the case of the Cox proportional hazards model, the partial likelihood expression is then somewhat more involved and the computations more slow. In the above approach and analysis we have assumed that the risk set sizes are reduced only due to the trial participants becoming infected. This may not be so, as there may be various other reasons why they may be lost from follow-up. If the resulting right censoring concerns a large proportion of the participants, this has to be accounted for in the analysis. It does not create a conceptually difficult problem, but it requires that the sizes of the risk sets, both in the vaccine and the placebo groups, are known at the times at which new infections are registered. The simple power form expression (4.7) for partial likelihood is then not valid any more, and needs to be replaced by the product (4.8) where R 0,Tm and R 1,Tm are the sizes of the two risk sets at time T m . It is, in fact, a simple form of the familiar expression used for the Cox proportional hazards model, connected to the latter by the transformation ρ = exp {−β}. Currently, several vaccines against COVID-19 have been successfully tested in placebo controlled Phase III trials and, somewhat depending on the country, have then been approved by the relevant regulatory authorities for wider use in their respective population. In addition to the original efficacy trials, there are now several studies on the population level effectiveness of COVID-19 vaccines (e.g., Dagan et al. 2021 , Vasileiou et al. 2021 ). On the other hand, in the present situation in which several vaccines that are demonstrably efficacious against both infection and the more serious forms of COVID-19 disease are available, it is difficult to find support, for a number of different reasons, to additional large-scale placebo controlled trials for testing new candidate vaccines, cf. Krause et al. 2020 . A possible alternative to such testing would be to use one or more of these existing vaccines as controls, and then make a comparative study. Such a design presents two major challenges, however. The first difficulty is demonstrated clearly by the Moderna study described briefly above: Of the approximately 15.000 individuals in the vaccine group only 11 were infected during the trial. If the candidate vaccine has at all comparable efficacy, as would naturally be desirable, the number of infected individuals in the vaccine group of a similar size, and assuming a comparable infection pressure in the study population, could not be expected to be much larger. With such small frequencies from both treatment arms in the trial, it would not be possible to arrive at a sufficiently firm conclusion concerning the desired target of superiority or non-inferiority, and this would be the case regardless of the statistical paradigm that were applied for such purpose. To overcome this problem, it would therefore be almost mandatory to seek regulatory approval to a design in which healthy volunteers, some vaccinated by the candidate and some by an already approved vaccine, say Vaccine*, used as a control treatment, are exposed to the virus under a carefully specified protocol. The possibility of a human challenge design, albeit with placebo controls, was already discussed at the time when no efficacious vaccine was available (World Health Organization 2020, Eyal et al. 2020 , Richards 2020 , and it is still considered relevant now (Eyal and Lipsitch 2021) . One could anticipate that in a challenge trial, naturally depending on the level of viral exposure that would be applied, a much smaller number of participants would be needed for reaching a statistically valid conclusion on comparability. If desired, such a design could be extended to involve more than a single candidate and/or control vaccine. Note that adaptive sequential recruitment and Bayesian decision making, as exemplified by Rule 2, would find here their natural place: It would not be necessary to fix the group sizes in advance; the trial could be run with newly recruited individuals until the desired level of certainty, as specified in the design, has been reached. A second issue arising in the context of such a design concerns statistical modeling and inference in a situation in which information comes from different data sources: While the design may lead to an efficacy estimate where the candidate vaccine is compared to another in routine use, this estimate cannot be readily converted to a corresponding V E S -estimate, where the candidate vaccine is compared to placebo. For practical consideration, this latter estimate could be the one of most interest. An approximate solution to this problem could be provided by assuming that the relative V E S -efficacy measures obtained from different trials, viz. an 'old' trial for testing Vaccine* vs. placebo, and the 'new' trial for testing the candidate vaccine vs. Vaccine*, act multiplicatively on each other, which would correspond to the structure of the Cox proportional hazards model. This would then yield a synthetic V E S -estimate for comparing the candidate vaccine to placebo, with a corresponding posterior derived by applying Bayesian inferential tools providing an uncertainty quantification. The relevance of this idea of combining estimates from different trials needs to be given careful scrutiny, however, and in particular since the dominant virus variant may have changed in between. This approach will be studied in more detail elsewhere. Clinical trials are an instrument for making informed decisions. In Phase II trials, the usual goal is to make a comparative evaluation on the success rates of one or more experimental treatments to a standard or control, and in multi-arm trials, also to each other. More successful treatments among the considered alternatives, if found, can then be selected for further study, possibly in Phase III. With this as the stated goal for a trial, the conclusions should obviously be drawn as fast as possible, but not jumping ahead of the evidence provided by the acquired data. Both aspects can be accounted for by applying a suitable adaptive design, allowing for a continuous monitoring of the outcome data, and then utilizing in the execution of the trial the information that the data contain. Still, there is always the antagonism Exploration versus Exploitation: From the perspective of an individual patient in the trial, under postulated exchangeability, the optimal choice of treatment would be to receive the one with the largest current posterior mean of the success rate, as this would correspond to the highest predictive probability of treatment success. However, as demonstrated in Villar et al. (2015) , this Current Belief (CB) strategy leads to a very low probability of ultimately detecting the best treatment arm among the considered alternatives and would therefore be a poor choice when considering the overall aims of the trial. Finding an appropriate balance between these two competing interests is a core issue in the design and execution of clinical trials, and can realistically be made only in each concrete context. For example, in trials involving medical conditions such as uncomplicated urinary infections, or acute ear infections in children, use of balanced non-adaptive 1:1 randomization to both symptomatic treatment and antibiotics groups appears fully reasonable. A very different example is provided by the famous ECMO trial on the use of the potentially life-saving technique of extracorporeal membrane oxygenation in treating newborn infants with severe respiratory failure (e.g., Bartlett et al. (1985) , Wolfson (2003) ). While statisticians advising clinical researchers have the responsibility of making available the best methods in their tool kit, there may well be overriding logistic, medical or ethical arguments which determine the final choice of the trial design. It has been even suggested that randomized clinical trials as such can present a scientific/ethical dilemma for clinical investigators, see Royall (1991) . Bayesian inferential methods are naturally suited to sequential decision making over time. In the present context, this involves deciding at each time point whether to continue accrual of more participants to the trial or to stop, either temporarily or permanently, and if such accrual is continued, selecting the treatment arm to which the next arriving participant is assigned. The current joint posterior distribution of the success parameters captures then the essential information in the data that is needed for such decisions. The posterior probabilities used for formulating Rule 2, when considered as functions of the accumulated data D n , can be viewed as test statistics in sequential tests of null hypotheses against corresponding alternatives. This link between the Bayesian and the frequentist inferential approaches makes it possible to compute, for the selected design parameters, the values of traditional performance criteria such as false positive rate and power. In the present approach, specifying a particular value for the trial size has no real theoretical bearing, and would serve mainly as an instrument for resource planning. Instead, the emphasis in the design is on making an appropriate choice of the operating characteristics, the ε's and δ, which control the execution of the trial, and on the direct consideration of posterior probabilities of events of the form {θ k = θ ∨ } and {θ 0 + δ ≥ θ ∨ } when monitoring outcome data from the trial. An important difference to the methods based on classical hypothesis testing is that posterior probabilities, being conditioned on the observed data, are directly interpretable and meaningful concepts as such, without reference to their quantile value in a sampling distribution conditioned on the null. This is true regardless of whether the trial design applies adaptive treatment allocation and selection while the trial is in progress, or whether only a final posterior analysis is performed when an initially prescribed number of trial participants have been treated and their outcomes observed. Large differences between the success parameters, if present, will often be detected early without need to wait until reaching a planned maximal trial size. On the other hand, if the joint posterior stems from an interim analysis, it forms a principled basis for predicting, in the form the consequent posterior predictive distribution, what may happen in the future if the trial is continued (e.g., Spiegelhalter et al. (1986) , Yin et al. (2012) ). Note, however, that future outcomes are uncertain even in the fictitious situation in which the true values of the success parameters were known. Therefore, from the perspective of decision making, the predictive distribution involves only "more uncertainty" than the posterior, not less. Another advantage of the direct consideration of posterior probabilities is that the joint posterior of the success parameters may contain useful empirical evidence for further study even when no firm final conclusion from the trial has been made. This is in contrast to classical hypothesis testing, where, unless the observed significance level is below the selected α-level so that the stated null hypothesis is rejected, the conclusion from the trial remains hanging in mid-air, without providing much guidance on whether some parts of the study would perhaps deserve further experimentation and consequent closer assessment. The standard paradigm of null hypothesis significance testing (NHST), and particularly the version where the observed p-value is compared mechanistically to a selected α-level such as 0.05, have been criticised increasingly sharply in the recent statistical literature (e.g., Wasserstein and Lazar (2016) , Greenland et al. (2016) ). In spite of this, the corresponding strong emphasis on controlling the frequentist Type 1 error rate at a pre-specified fixed level has been largely adopted in the Bayesian clinical trials literature as well (e.g., Shi, Yin, et al. (2019) , Stallard et al. (2020) ). These error rates are conditional probabilities, evaluated from a sampling distribution under an assumed null hypothesis Q null and in practice computed during the design stage when no actual outcome data from the trial are yet available. In contrast, in the Bayesian clinical trials methodology as outlined here, error control against false positives is performed continuously while the trial is run by applying bounds of the form P π θ 0 + δ ≥ θ ∨ D * i < ε 2 , where the considered posterior probabilities are conditioned on the currently available trial data D * i . For this reason, in our view, calibration of Bayesian trial designs on a selected fixed frequentist Type 1 error rate (e.g., Thall et al. (2015) ) does not form a natural basis for comparing such designs. More generally, the role of testing a null hypothesis and the consequent emphasis on Type 1 error rate should not enjoy primacy over other relevant criteria in drawing concrete conclusions from a clinical trial (Greenland (2020) ). Even posterior inferences alone are not sufficient for rational decision making in such a context, and should therefore optimally be combined with appropriately selected utility functions (e.g., D.V. Lindley in Grieve et al. (1994) ). If the trial is continued into Phase III, this can be done in a seamless fashion by using the joint posterior of the selected treatments from Phase II as the prior for Phase III. In particular, if some treatment arms have been dropped during Phase II, the trial can be continued into Phase III as if the selected remaining treatments had been the only ones present from the very beginning. Recall, however, from the remarks made in Section 2 that such treatment elimination, as encoded into Rule 2, contains a violation of the likelihood principle. If Rule 2 is employed in Phase III, and considering that Phase III trials are commonly targeted at providing confirmatory evidence on the safety and efficacy of the new experimental treatment against the current standard treatment used as a control, it may be a reasonable idea to lower the threshold values ε 1 and ε 2 from their levels used in Phase II, and thereby apply stricter criteria for final approval. No statistical method is uniformly superior to others on all accounts. Important criticisms against the use of adaptive randomization in clinical trials have been presented, e.g., in Thall et al. (2015) . There, computer simulations were used to compare adaptive patient allocation based on Thompson's rule (Thompson (1933) , Villar et al. (2015) ) in its original and fractional forms, in a two-arm 200-patient clinical trial, to an equally randomized group sequential design. The main argument against using methods applying adaptive randomization was their potential instability, that is, there was, in the authors' view, unacceptably large (frequentist) Q-probability of allocating more patients to the inferior treatment arm, the opposite of the intended effect. Although these simulations were restricted to Thompson's rule, the criticism in Thall et al. (2015) was directed more generally towards applying adaptive randomization and would therefore in principle apply to our Rules 1 and 2 as well. The results from our limited simulation experiments, shown in graphical form in Figure 3 and Figures S1 and S2 in the Supplement, do not support such a firm negative conclusion, however. This holds at least provided that the possibility of actually dropping a treatment arm is deferred to a somewhat later time from the beginning of the trial, and that in such assessment the deviations from balance in the opposite directions are not weighted completely differently. A precautionary approach to the design, from a frequentist perspective, could apply a sandwich structure, starting with a symmetric burn-in, followed by an adaptive treatment allocation realized by Rule 1 or Thompson's rule, and finally coupling in Rule 2 for actual treatment selection. Another criticism presented in Thall et al. (2015) was that, for trial data collected from an adaptive design, the considered tests had lower power than in a corresponding equally randomized design, and particularly so if the tests were calibrated to have the same Type 1 error rate. This question was discussed in subsection 3.1.3 and in the corresponding part of the Supplement. In these experiments, adaptive treatment allocation methods based on Rule 1 (a) and (b), and on Thompson's rule with fractional power κ = 0.25, demonstrated frequentist performance quite comparable to what was observed when applying the fully symmetric block randomization design (d). All adaptive methods favoring treatment arms with relatively more successes in the past will inevitably introduce some degree of bias in the estimation of the respective success parameters, see Bauer and Köhne (1994) and Villar et al. (2015) . A comprehensive review of the topic is provided in Robertson et al. (2021) . Here, we have only considered this matter briefly in our simulation experiments, and instead emphasized the, in our view, more important aspect of the mutual comparison of the performance of different treatment arms in the trial. All biases in these experiments were relatively small and in the same direction, downward, and therefore unlikely to have had a strong influence on the conclusions that were drawn. Our main focus has been on trials with binary outcome data, where individual outcomes could be measured soon after the treatment was delivered. More complicated data situations were outlined in Section 4. The important case of normally distributed outcome data was by-passed here; there is a large body of literature relating to it, e.g., Spiegelhalter et al. (1994) and Gsponer et al. (2014) . A complication with the normal distribution is that, unless the variance is known to a good approximation already from before, there are two free parameters to be estimated for each treatment. If a suitable yardstick at the start is missing, many observations are needed before it becomes possible to separate the statistical variability of the outcome measures from a true difference between treatment effects. In principle, the logic of Rules 1 and 2 remains valid and these rules can be applied for different types of outcome data, requiring only the ability to update the posterior distributions of the model parameters of interest when more data become available. The computation of the posteriors is naturally much less involved if the prior and the likelihood are conjugate to each other. Vague priors, or models containing more than a single parameter to be updated, will necessarily require more outcome data before adaptive actions based on Rule 1 or Rule 2 can kick in. If such updating is not done systematically after each individual outcome is measured, for example, for logistic reasons, but less frequently in batches, Rule 1 and Rule 2 can still be used at the times at which the batches are completed. The same holds if updating is done at regularly spaced points in time. Such thinning of the data sequence has the effect that some of the actions that would have been otherwise implied by Rule 1 and Rule 2 are then postponed to a later time or even omitted. In designing a concrete trial, one then needs to find an appropriate balance between, on one hand, the costs saved in logistics and computation, and on the other, the resulting loss of information and the effect this may have to the quality of the inferences that can be drawn. A Additional figures to subsection 3.1.2 Figures S1 and S2 below complement Figure 3 in the main paper, where we illustrated the effect of the design parameters of Rule 1 and Thompson's rule on treatment allocation in a two-arm trial with N max = 200, and on the consequent total number of treatment successes. Here we do the same for N max = 100 in Figure S1 and for N max = 500 in Figure S2 . For data generated under Q null , the overall shape of the CDFs in Figures S1 and S2 remains remarkably close to that in Figure 3 , where the trial size was N max = 200. The differences become more evident when considering Q alt , in which case the adaptive rules can use their potential to assign more patients to the treatment with higher true success rate. But learning from data takes time, and therefore the gains from using such adaptive rules become progressively more evident as the trial size increases. Thus, the expected number of successes can be increased by approximately ten percent by employing a strong adaptive treatment allocation rule when N max = 100, by fifteen percent when N max = 200, and twenty percent when N max = 500. Another point of interest in the case of Q alt is the probability of unwanted imbalance, allocating more patients to the inferior control arm than to the better experimental one. The highest risk for this to happen is in the case of Rule 1 (c), for which it was found to be approximately five percent when N max = 200. The corresponding percentage for N max = 100 is ten and for N max = 500 three. Rule 1 (c) appears to be the only design, among those considered, for which there is a non-negligible probability that the imbalance turns out to be serious. For the other designs, including different versions of Thompson's rule, the probabilities are much smaller, and very small for N max = 500. 46 Figure S1 : B Additional figures and tables to subsection 3.1.3 In subsection 3.1.3 of the main text we studied the performance of different adaptive designs in terms of true and false positive and negative rates, by considering trial size N max = 200 in Figure 4 and Table 1 . Below we present corresponding results for N max = 100 in Figure S4 and Table S1 , and for N max = 500 in Figure S5 and Table S2 . When combined, these results give us an idea about how such measures depend on the size of the trial. Figures S4 and S5 bear close similarity to Figure 4 . The main differences can be seen in the CDFs arising from data generated under Q alt . The CDFs of the posterior probabilities P π (θ 1 ≥ θ 0 |D * Nmax ) move to the right as N max grows from 100 to 200 and then to 500, thereby signalling that these probabilities become stochastically larger with growing trial size. A similar movement, somewhat slower and in the opposite direction, is seen in the CDFs of P π (θ 0 +0.05 ≥ θ 1 |D * Nmax ) with growing N max . The following conclusions can now be made from Tables 1, S1 and S2. Under Q null , the false positive rates are generally somewhat smaller for larger trial sizes, but remain under 0.025 even in the case of N max = 100. The true negative rates are usually larger, by a few percentage points, when the trial size is changed from 100 to 200 and then to 500, and the inconclusive rates correspondingly smaller, typically attaining values on either side of ninety percent. The false negative rates are very small for all considered designs. In contrast, as can be expected, the true positive rate (power ) under Q alt depends strongly on the size of the trial. As reported in 3.1.3, for N max = 200 it has the moderate level of approximately seventy percent for Rule 1 designs (a), (b) and (d), and almost as high for Thompson's rule with κ = 0.25. For these same designs and N max = 100, the true positive rates are lower, on both sides of 45 percent, but for N max = 500 already in the range of 95 percent. Again, of interest is to note that, in terms of these frequentist measures, three adaptive rules perform as well as the symmetric block randomization design (d). For Thompson's rule, larger values of κ lead to greater instability in the behavior of the adaptive mechanism and consequent weaker frequentist performance. Of all considered alternatives, the smallest true positive rate is obtained for the design (c) of Rule 1. The false negative rates are very small for all considered designs. 50 Figure S4 : Effect of the design parameters ε and δ of Rule 1, and κ of Thompson's rule, on the CDFs of the posterior probabilities P θ 0 + 0.05 ≥ θ 1 D * 100 (top) and P θ 1 ≥ θ 0 D * 100 (bottom) in the 2-arm trial of Experiment 1 when applying Rule 1 for treatment allocation and making a final assessment at i = N max = 100. The results are based on 5000 data sets generated under Q null and Q alt when using the following combinations of design parameters: (a) ε = 0.1, δ = 0.1, (b) ε = 0.05, δ = 0.1, (c) ε = 0.2, δ = 0.05. Figure S5 : Effect of the design parameters ε and δ of Rule 1, and κ of Thompson's rule, on the CDFs of the posterior probabilities P θ 0 + 0.05 ≥ θ 1 D * 500 (top) and P θ 1 ≥ θ 0 D * 500 (bottom) in the 2-arm trial of Experiment 1 when applying Rule 1 for treatment allocation and making a final assessment at i = N max = 500. The results are based on 5000 data sets generated under Q null and Q alt when using the following combinations of design parameters: (a) ε = 0.1, δ = 0.1, (b) ε = 0.05, δ = 0.1, (c) ε = 0.2, δ = 0.05. Table S1 : True and false positive and negative rates when applying adaptive treatment allocation with design parameter values ε 0 = 0.05 and δ 0 = 0.05 in a trial of size N max = 100. Table S2 : True and false positive and negative rates when applying adaptive treatment allocation with design parameter values ε 0 = 0.05 and δ 0 = 0.05 in a trial of size N max = 500. In Table S3 we consider the effect of the design modification, where the first 30 patients are divided evenly, by using a block randomization, to the two treatments. Adaptive treatment allocation is then applied after this, either in the form of Rule 1 or Thompson's rule, and the performance measures are evaluated at N max = 200 from a simulation experiment of 5000 repetitions. The numerical values in Table S3 are compared naturally to those in Table 1 , where the design was the same except that no burn-in was used. Overall, the differences are small. The largest change is in the values of true positive rate (power) for Rule 1 (c), which has increased from 0.303 in Table 1 to 0.443 due to the stabilizing initial burn-in. Smaller differences can be seen in the false positive rates for Rule 1 (c) and Thompson's rule with κ = 0.75 and κ = 1, where burn-in has trimmed down these already rather low rates by small amounts. The conclusion from this experiment is that, in a trial of size N max = 200, employing an initial burn-in period has a small to modest stabilizing effect on the frequentist performance of those adaptive designs in which the adaptive mechanism was strongest. In the first variant, we consider in Table S4 the case δ 0 = 0, where the special protection against dropping the control arm in the final test at N max has been removed. Thus we write false positive rate = Q null (P π (θ 0 ≥ θ 1 |D * Nmax ) ≤ ε 0 ), true negative rate = Q null (P π (θ 1 ≥ Table S4 : True and false positive and negative rates when applying adaptive treatment allocation with design parameter values ε 0 = 0.05 and δ 0 = 0 in a trial of size N max = 200. First test variant, see text. θ 0 |D * Nmax ) ≤ ε 0 ), true positive rate = Q alt (P π (θ 0 ≥ θ 1 |D * Nmax ) ≤ ε 0 ) and false negative rate = Q alt (P π (θ 1 ≥ θ 0 |D * Nmax ) ≤ ε 0 ). Inconclusive rates are the probabilities Q(P π (θ 0 ≥ θ 1 |D * Nmax ) > ε 0 , P π (θ 1 ≥ θ 0 |D * Nmax ) > ε 0 ), for Q = Q null and Q = Q alt . As noted in the main text, this change from the original criteria implies that, compared to the respective values provided in Table 1 , all positive rates are now larger, while the negative rates remain intact. Of the former, the rates for Rule 1 (a), (b) and (d), and for Thompson's rule with κ = 0.25, are again quite similar, with false positive rates varying on both sides of five percent and true positive rates (power ) reaching levels of almost ninety percent. The frequentist performance of the other designs is somewhat weaker, deteriorating with increasing instability of the allocation rule. In the second variant of the final test, the experimental arm is dropped if P π (θ 1 ≥ θ 0 + δ 0 |D * Nmax ) ≤ ε 0 . Therefore, in Table S5 we write false positive rate = Q null (P π (θ 0 + δ 0 ≥ θ 1 |D * Nmax ) ≤ ε 0 ), true negative rate = Q null (P π (θ 1 ≥ θ 0 + δ 0 |D * Nmax ) ≤ ε 0 ), true positive rate = Q alt (P π (θ 0 +δ 0 ≥ θ 1 |D * Nmax ) ≤ ε 0 ) and false negative rate = Q alt (P π (θ 1 ≥ θ 0 +δ 0 |D * Nmax ) ≤ ε 0 ). The probabilities Q(P π (θ 0 + δ 0 ≥ θ 1 |D * Nmax ) > ε 0 , P π (θ 1 ≥ θ 0 + δ 0 |D * Nmax ) > ε 0 ), for Q = Q null and Q = Q alt , are inconclusive rates. This change means that the negative rates, both true and false, are now larger than the respective values in Table 1 , while the positive rates remain intact. The true negative rates, which were below ten percent in Table 1 , vary in Table S5 on both sides of twenty percent. The inconclusive rates under Q null are now lower than in Table S4 , but still rather high, between seventy-five and eighty percent. The false negative rates are slightly higher than in Table 1 , but still very low for all allocation rules. The performance of Table S5 : True and false positive and negative rates when applying adaptive treatment allocation with design parameter values ε 0 = 0.05 and δ 0 = 0.05 in a trial of size N max = 200. Second test variant, see text. Rule 1 (a), (b) and (d), and of Thompson's rule with κ = 0.25, is again quite similar. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples Group sequential methods in the design and analysis of clinical trials A multiple testing procedure for clinical trials Bayesian approach to bioequivalence assessment: an example The likelihood principle Extracorporeal Circulation in Neonatal Respiratory Failure: A Prospective Randomized Study Interim analyses in clinical trials: classical vs. Bayesian approaches Monitoring clinical trials: Conditional or predictive power? Statistical analysis and the illusion of objectivity Ethics and statistics in randomized clinical trials Evaluation of experiments with adaptive interim analyses Interim analysis: the alpha spending function approach Bayesian approaches to randomized trials, Discussion Group sequential clinical trials: a classical evaluation of Bayesian decision-theoretic designs Bayesian approaches to randomized trials Practical Bayesian Guidelines for Phase IIB Clinical Trials Group sequential tests with applications to clinical trials. English. Chapman & Hall/CRC Interdisciplinary Statistics A Partial Likelihood Estimator of Vaccine Efficacy The development and use of extracorporeal membrane oxygenation in neonates Bayesian approaches to clinical trials and health-care evaluation Bayesian clinical trials Practical Bayesian Adaptive Randomization in Clinical Trials Adaptive design methods in clinical trials-a review Superiority, equivalence, and non-inferiority trials Bandit solutions provide unified ethical models for randomized clinical trials and comparative effectiveness research Design and analysis of vaccine studies Adaptive design clinical trials: Methodology, challenges and prospect Summarizing historical information on controls in clinical trials Adaptive Clinical Trials: The Promise and the Caution Bayesian adaptive methods for clinical trials Bayesian clinical trials in action A Bayesian adaptive design for multi-dose, randomized, placebo-controlled phase I/II trials Phase II trial design with Bayesian adaptive randomization and predictive probability Bayesian Hypothesis Testing in Two-Arm Trials with Dichotomous Outcomes Adaptive clinical trial design A practical guide to Bayesian group sequential designs Statistical controversies in clinical research: scientific and ethical problems with adaptive randomization in comparative clinical trials Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges Adaptive Design-Recent Advancement in Clinical Trials Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations Idle thoughts of a 'well-calibrated' Bayesian in clinical drug development Evaluation of a multi-arm multi-stage Bayesian design for phase II drug selection trials -an example in hemato-oncology The ASA Statement on p-Values: Context, Process, and Purpose Clinical Trial Design as a Decision Problem Bayesian randomized clinical trials: From fixed to adaptive design Bayesian designs for phase I-II clinical trials Extending a Bayesian decisiontheoretic approach to a value-based sequential clinical trial design Adaptive designs in clinical trials: why use them, and how to run and report them Randomised response-adaptive designs in clinical trials Inference and Decision Making for 21st-Century Drug Development and Approval Control of type I error rates in Bayesian sequential designs Human challenge studies to accelerate coronavirus vaccine licensure Analysis goals, error-cost sensitivity, and analysis hacking: Essential considerations in hypothesis testing and multiple comparisons COVID-19 vaccine trials should seek worthwhile efficacy Moderna announces Primary Efficacy analysis in Phase 3 COVE study for Its Covid-19 Vaccine candidate and Filing today with U.S. FDA for emergency use authorization Ethical guidelines for deliberately infecting volunteers with COVID-19 Comparison of Bayesian and frequentist group-sequential clinical trial designs Key criteria for the Ethical acceptability of Covid-19 human challenge studies BNT162b2 mRNA Covid-19 vaccine in a nationwide mass vaccination setting How to test SARS-CoV-2 vaccines ethically even after one is available The Bayesian Design of Adaptive Clinical Trials barts: Bayesian adaptive rules for treatment selection Point estimation for adaptive trial designs Interim findings from first-dose mass COVID-19 vaccination roll-out and COVID-19 hospital admissions in Scotland: a national prospective cohort study We are grateful to Jukka Ollgren for comments and encouragement, and to Mikko Marttila for useful suggestions on the text. E.A. thanks Arnoldo Frigessi and David Swanson for support and useful discussions during an early stage of this work.