key: cord-0122170-rci07jfw
authors: Greenstreet, Peter; Jaki, Thomas; Bedding, Alun; Harbron, Chris; Mozgunov, Pavel
title: A multi-arm multi-stage platform design that allows pre-planned addition of arms while still controlling the family-wise error
date: 2021-12-12
journal: nan
DOI: nan
sha: 99db9d3d3c79425618e08d3670d601c81190fde7
doc_id: 122170
cord_uid: rci07jfw

There is growing interest in platform trials that allow for adding of new treatment arms as the trial progresses as well as being able to stop treatments part way through the trial for either lack of benefit/futility or for superiority. In some situations, platform trials need to guarantee that error rates are controlled. This paper presents a multi-stage design that allows additional arms to be added in a platform trial in a pre-planned fashion, while still controlling the family wise error rate. A method is given to compute the sample size required to achieve a desired level of power and we show how the distribution of the sample size and the expected sample size can be found. A motivating trial is presented which focuses on two settings, with the first being a set number of stages per active treatment arm and the second being a set total number of stages, with treatments that are added later getting fewer stages. Through this example we show that the proposed method results in a smaller sample size while still controlling the errors compared to running multiple separate trials.

Clinical trials take many years to run and during this time it is not uncommon for new promising treatments to emerge that warrant evaluation. It may be advantageous to include these treatments into an ongoing trial, due to the shared trial infrastructure and the possibility to use a shared control group. This can result in useful therapies being identified faster while reducing cost and time (Cohen et al., 2015) . The trial potentially requires less administrative and logistical effort than setting up separate trials, so can noticeably speed up the development process . The addition of more arms may also enhance the recruitment, as patients have a higher chance of receiving an experimental treatment, therefore, making them potentially more likely to join a trial (Meurer et al., 2012) .

There is an ongoing discussion about how to add new treatments to clinical trials (Lee et al., 2021) . Cohen et al. (2015) conclude in their literature review that there is no systematic or comprehensive method for adding new arms to clinical studies. Recently, several approaches to adding treatment arms have been proposed which aim to help tackle this issue. Bennett and Mander (2020) and Choodari-Oskooei et al. (2020) propose approaches which extend the Dunnett test (Dunnett, 1955) to allow for unplanned additional arms to be included into multiarm trials while still controlling the family wise error rates (FWER). This methodology does not incorporate the possibility of interim analyses.

Interim analyses are a further way to potentially improve the design of a clinical trial (Wason et al., 2016) . They allow for ineffective treatments to be dropped for futility (or lack of benefit) earlier, as well as allowing the trial to stop early if a superior treatment is found. Both of these can result in the reduction of the expected sample sizes and costs of a trial. Multiarm multi-stage (MAMS) design (e.g. Magirr et al., 2012) allows for several treatments to be evaluated within one study and incorporate interim analyses for efficiency, but does not allow for additional arms to be added throughout the trial. Burnett et al. (2020) developed an approach that incorporates unplanned additional treatment arms to be added to a trial already in progress using the conditional error principle (Proschan and Hunsberger, 1995) , to allow for modifications during the course of a trial. The unplanned nature of the adaptation, however, means that type I error and power for different arms may be different. As a result, the additional treatments can be underpowered.

In this work, we provide an analytical method for adding of treatments to a multi-arm multistage (MAMS) trial in a pre-planned manner, while still controlling the statistical errors. The design assumes that, at the design stage, it is known that new treatments will be added later into the trial. Unlike the currently developed approaches, the approach proposed allows for a trial with multiple stages and multiple arms to be designed, so that pre-planned additional treatments can be added to the trial while still controlling the family wise error rate (FWER) in the strong sense (Dmitrienko et al., 2009) and achieve suitable power for the trial.

We focus our investigation on two settings: (i) each active treatment has the same number of stages; (ii) there is a fixed total number of stages. Using these two settings, we revisit the design of FLAIR (Howard et al., 2021) that had a treatment arm added during the trial. We derive the critical boundaries, the sample sizes and the expected sample sizes for trials based on FLAIR, and study the effect on errors when deviating from the planned additions. These two settings are compared to running separate trials, using the MAMS design by Magirr et al. (2012) and using the original trial design without adjusting for additional treatments.

Consider a clinical trial with up to K experimental arms that will be tested against one common control arm with K experimental arms starting at the beginning of the trial, where K ≥ 1, and K − K arms being added later. The primary outcome on each patient is independent and normally distributed with known variance σ 2 . In total, the control treatment is recruited for a maximum of J 0 stages with there being a maximum of J 0 − 1 interim analyses, with an analysis taking place at the end of each stage. Each of the active treatments can have any number of stages (provided it is pre-specified and their total number is less or equal to J 0 − 1) which coincide with the analysis for the other treatments. Each additional treatment can be added at any of the interim analyses as long as this is pre-planned at the design stage of the trial development. When comparing the control to the active treatments only the concurrent controls are used in the comparisons. This means only participants recruited to the control arm at the same time as the active arm are used in the comparisons.

The null hypotheses of interest are H 01 : µ 1 ≤ µ 0 , H 02 : µ 2 ≤ µ 0 , ..., H 0K : µ K ≤ µ 0 , where µ 1 , . . . , µ K are the mean responses on the K experimental treatments and µ 0 is the mean response of the control group. The global null hypothesis, µ 0 = µ 1 = µ 2 = . . . = µ K , is denoted by H G . Each of the K hypotheses is potentially tested at a series of analyses indexed by j = 1, . . . , J k where J k is the maximum number of analyses for a given treatment k = 1, . . . , K. Let J denote the maximum number of planned analyses of any of the active treatments, J = max k (J k ). The total number of stages of the control treatment is J 0 . Let s(k) be the stage when treatment k is added to the trial and define the vector of adding times by S = (s(1), . . . , s(K)). We denote the ratio of patients recruited to treatment k by the end of its j th stage by r k,j and denote n k,j as the number of patients recruited to treatment k by the end of its j th stage with k = 0 denoting the control treatment. The number of patients recruited to the first stage of treatment k is defined as n k so that n k,j = n k r k,j r k,1 . The total sample size of a trial is denoted by N , where the maximum total planned sample size, max(N ) = K k=0 n k,J k . At analysis j for treatment k, to test H 0k it is assumed that responses, X k,i , from patients i = 1, . . . , n k,j are observed, as well as the responses X 0,i from patients i = n 0,s(k) + 1, . . . , n 0,s(k)+j , which are the outcome of the patients allocated to the control which have been recruited since treatment k has been added into the trial up to the jth analysis of treatment k. The test statistics

are used to test hypothesis H 0k . Upper and lower stopping boundaries, U k = (u k,1 , . . . , u k,J k ) and L k = (l k,1 , . . . , l k,J k ), are used for the decision-making as follows. If Z k,j > u k,j then H 0k is rejected and the trial stops with the conclusion that treatment k is superior to control. If Z k,j < l k,j then treatment k is dropped from all subsequent stages of the trial. If the Z statistics for all the treatments fall below their lower boundary, the trial stops for futility. Treatment k and control continues to its next stage j + 1 if neither of these conditions are met, so l k,j ≤ Z k,j ≤ u k,j . The boundaries are found to control the family wise error rate (FWER) in the strong sense at a specified desired level α which is defined as P (reject at least one true H 0k under any null configuation, k = 1, . . . , K) ≤ α.

While this work concerns the general procedure of adding, we focus on two special cases -see Figure 1 . Setting 1 is the case that each active treatment is planned to have the same number of stages regardless of when it is added. Setting 2 is the case with a set total number of stages with the later a treatment is added the fewer stages are planned for it. Note in Setting 1, J 1 = . . . = J K = J and J 0 = max(S) + J and in Setting 2, J 0 = J and J k = J − s(k).

Following the method of Dunnett (1955) , one can exploit the correlation between the test statistics arising from the common control responses. This description follows Magirr et al.

Control Treatment 1

Treatment 2

Control Treatment 1 Treatment 2 Figure 1 : Examples of the two settings for a two active arm setting with J 0 = 3 and Treatment 2 being added at stage 1. For Setting 1, both active treatments get 2 interim analyses, and for Setting 2, Treatment 2 is added one stage and, hence, has one fewer stage. The grey represents areas of possible shared control group. The dashed black line represents an interim analysis.

(2012). For any vector of constants Θ = (θ 1 , . . . , θ K ) and k = 1, . . . , K, j = 1, . . . , J k , letting I k,j = σ 2 (n −1 k,j + (n 0,j+s(k) − n 0,s(k) ) −1 ), define the events,

If µ k − µ 0 = θ k for k = 1, . . . , K, the event that H 1 , . . . , H K all fail to be rejected is equivalent toR

with the convention that 0 i=1 = Ω where Ω is the whole sample space, m 1 ∈ {1, . . . , K} and m k ∈ {1, . . . , K}\{m 1 , . . . , m k−1 }. The notation m k is used to reflect the fact that the order in which treatments are added affects the FWER. The events A k,j (θ k ) and B k,i (θ k ) can be rearranged so the following holds

Here Φ(·) denotes the standard normal distribution function, and Φ j (L k,j (θ k ), U k,j (θ k ), Σ k,j ) denotes the result of integrating the j-dimensional normal density with mean zero and correlation matrix, Σ k,j with the (i, i )th element (i ≤ i ) of Σ k,j is r k,i r k,i . The integration is over the region defined by a vector of lower limits L k,j (θ k ) = (l k,1 (θ k ), . . . l k,j−1 (θ k ), −∞), and upper limits, U k,j (θ k ) = (u k,1 (θ k ), . . . u k,j−1 (θ k ), l k,j (θ k )), where l k,j (θ k ) =l k,j 1 + r k,j r 0,s(k)+j − r 0,s(k) + √ r k,j r 0,s(k)+j − r 0,s(k)

Note that, in contrast to Magirr et al. (2012) , the proposed approach accounts for the fact that treatments can be added at different points and hence l k,j (θ k ) and u k,j (θ k ) depend on s(k). It also allows for different stopping boundaries per treatment: A k,j (θ k ) and B k,j (θ k ) depend on l k,j and u k,j and there are different maximum numbers of stages per treatment. Then, one can obtain the following result.

Theorem 2.1. For any Θ, under the conditions above, P (reject at least one true H 0k |Θ) ≤ P (reject at least one true H 0k |H G ).

The proof of Theorem 2.1 is given in the Supporting Information in Web Appendix A. It follows from Theorem 2.1 that the FWER is maximized under the global null hypothesis.

Corollary 2.1.1. Setting Θ = 0 and finding P (R K (Θ)) such that P (R K (Θ)) = 1 − α controls FWER in the strong sense at level α.

Proof. Under the global null hypothesis µ 0 = µ k for all k ∈ 1, . . . K so that Θ = (0, . . . , 0) = 0. Using Theorem 2.1 FWER is controlled in the strong sense at level α if P (R K (0)) = 1 − α.

As a result of Corollary 2.1.1, the stopping boundaries under the global null hypothesis, which result in P (R K (0)) = 1 − α, will guarantee strong control of FWER at level α.

As mentioned above, the proposed methodology allows for different critical boundaries to be used for each treatment k as seen in Equation (2.1). To find the boundaries one can use the functions L k = f k (a k ) and U k = g k (a k ) to reduce the number of unknowns, where f k and g k are the functions for the shape of the upper and lower boundaries respectively and a k are scalar parameters specific to each active treatment. One can use a single parameter a to find the boundaries so f k = f k , g k = g k and a k = a k which is similar to the method presented in Magirr et al. (2012) , with the advantage of there being an equal number of unknowns to equations. However, using the same boundaries for each treatment arm, regardless of when it was added, can result in different probabilities of dropping each treatment which might be undesirable. It may be of interest in having a different stopping boundary shapes for each treatment, as the same shape for each treatment may not be optimal as the trial may need greater sample size compared to a design with different stopping boundary shapes as seen in the Supporting Information (Web Appendix D). This requires using L k = f k (a k ) and U k = g k (a k ) which results in K scaler parameters to be found, a = (a 1 , . . . , a K ).

To calculate a k for all k = 1, . . . , K, we introduce the requirements on the pairwise error rate (PWER) being the same for all active treatments, where PWER is the probability of rejecting the null hypothesis H 0.k incorrectly. For the PWER for treatment k, it is assumed that no active treatment other than k can stop the trial. This assumption eliminates the possibility of stopping the trial without finishing testing treatment k, hence, maximises the probability of the corresponding error. The PWER denoted by α k for treatment k is

with L k,j = (l k,1 , . . . , l k,j−1 , −∞), U k,j = (u k,1 , . . . , u k,j−1 , l k,j ), and covariance matrixΣ k,j . The

.

To ensure equal PWER across all the treatments and ensure FWER is controlled the iterative approach in Algorithm 1 is proposed. This approach yields the desired properties as with each iteration we update every a k so that PWER is equal for all the active treatments and then using step H we ensure that the FWER is controlled by using Corollary 2.1.1.

Algorithm 1 Iterative approach to compute the stopping boundaries 0 Begin by assuming a = (a 1 , a 1 , . . . , a 1 ) and find a 1 such that a controls FWER at a specified level, α, using Equation (2.1) with Θ = 0. Then repeat the following iterative steps until each element of a no longer changes between iterations within some small : 1 Find a 2 such that α 2 = α 1 .

. . .

H-1 Find a K such that α K = α 1 . H Find a such that a = a (a 1 , a 2 , . . . , a K ) results in Equation (2.1) with Θ = 0 equalling α.

The aim of the trial design in question is to find the required sample size to achieve the power for every hypothesis greater than a pre-specified level (1 − β). We assume that any given treatment k , is recommended when (i) its test statistic crossed the corresponding upper boundary, and (ii) its test statistic is the largest one, where k = 1, . . . , K. The power is defined as the probability of rejecting H 0k for k , such that µ k − µ 0 = θ and µ k − µ 0 = θ 0 for k = k where θ is the clinically interesting treatment effect and θ 0 is the highest uninteresting treatment effect. This setting is known as the least favourable configuration for treatment k (which is denoted as LFC k , Thall et al., 1988) .

Let Π k ,J , denote the probability that under the LFC k , no null hypotheses are rejected before the J th analysis for treatment k , with treatment k not being stopped for futility at any of these analyses, and at analysis J , H 0.k being rejected and treatment k being recommended, where J = 1, . . . , J k . The power for rejecting treatment k is then given by

To obtain Π k ,J , we find the probability that H 0k is not rejected before analysis J and treatment k is not dropped for futility before analysis J assuming t 1 , . . . , t s(k )+J and v J are known, where t is defined in Equation (2.2) and v J =X k ,J −µ k σ √ n k ,J , with t 1 , . . . , t s(k )+J and v J and then integrate over every possible value as can be seen below. This

The event that H 0k is rejected at analysis J assuming that t 1 , . . . , t s(k )+J and v J are known and where 1{·} is an indicator function, is

The probability that H 0k is not rejected before analysis J for treatment k for all k ∈ 1, . . . , k − 1, k + 1, . . . K assuming t 1 , . . . , t s(k )+J and v J are known is

One can then find Π k ,J as

(2.4)

To ensure that all the experimental treatments achieve the pre-specified power under the corresponding LFC k , the sample size must be found in order for Π k ,1 + Π k ,2 . . . + Π k ,J k ≥ 1 − β for all k . Due to treatments potentially starting at different times and having different number of stages, each treatment may require a different number of patients to achieve the same power.

As changing the sample size of one treatment effects the power of another, an iterative approach is proposed to calculate the required sample size per treatment per stage. Specifically, to have all the treatments controlled at the same specified power, 1 − β under their LFC k , one needs to define n = (n 0 , . . . , n K ), then by assuming each n k can take any real value, use Algorithm 2. Once n is found using this Algorithm 2, round up each value of n to its nearest integer, then recalculate a with these new values for n to account for the fact that the ratios have now also changed between treatments.

Algorithm 2 Iterative approach to compute the sample size 0 Begin by assuming n k = 1 for all k, ∈ 0, . . . , K then calculate a using Algorithm 1. Find n 1 such that Π 1,1 + Π 1,2 . . . + Π 1,J 1 = 1 − β with n k = n 1 for all k ∈ 0, . . . , K and update n. Then repeat the following iterative steps until each element of n no longer changes between iterations within some small :

Find n 0 and r 0,1 , . . . r 0,J 0 based on n 1 , . . . , n K then recalculate a using Algorithm 1.

The distribution of the sample size and expected sample size can both be calculated by finding the probability of every possible outcome of the trial denoted by PJ ,Q . DefineJ = (j(1), . . . ,j(K)) withj(k) = 1, . . . , J k as the point in which treatment k would finish being tested, ignoring the possibility that the trial has already stopped early as a different treatment is found which is superior to the control. This is done in order to remove the dependence between each active arm. We define Q = (q(1), . . . , q(K)) with q(k) = ∞ if treatment k goes below the lower stopping boundary at pointj(k) and q(k) = 1, if treatment k goes above the upper stopping boundary at pointj(k). Due to ignoring the possibility of the trial already stopping early, every active treatment will either stop for futility or efficacy therefore q(k) can only take one of two values. We find

The PJ ,Q are then associated with their given total sample size NJ ,Q for that givenJ and Q.

where • is the scalar product. To obtain the sample size distribution each value ofJ and Q which result in the same value of NJ ,Q is associated with its corresponding PJ ,Q . This set of PJ ,Q is then summed together to give the probability of the realisation of this sample size. To find the sample size distribution for each active arm one can associate n k,max(min(j(k)+s(k),(J+S)•Q)−s(k),0) with its corresponding PJ ,Q , and this can similarly be done for the control treatment. The expected sample size for N for a given Θ, E(N |Θ), can be found by summing up every possible combination ofJ and Q,

In recent years, there have been several platform trials conducted and their use appears to be increasing during the COVID-19 pandemic (Stallard et al., 2020) . One platform trial example is FLAIR (Howard et al., 2021) . It is a randomised, controlled, open-label, confirmatory trial in chronic lymphocyte leukaemia. When designing FLAIR there was the plan to add an active treatment during the trial as well as an interim analysis halfway through the planned sample size for each treatment. In the actual trial, two additional arms were added, one being an additional control arm. The original design of the study considered pairwise type I error control only and hence did not provide the FWER control.

We revisit the design of the FLAIR trial accounting for one additional active treatment arm being planned to be added mid-trial. One active and a control arm begin the trial, and apply the proposed methodology to control the FWER in the strong sense. One can argue that the controlling FWER in this trial is essential as the trial aimed to test different combinations of treatments with the same common compound -ibrutinib for all the active treatments whilst the main control was not based on ibrutinib (Wason et al., 2014) . This might be a regulatory requirement to control the FWER while adding an arm during the trial (FDA, 2019).

Based on the planned effect given in FLAIR, the interesting treatment difference is θ = − log(0.69), σ = 1, and the uninteresting treatment effect is assumed (as not being used in the original design) to be θ 0 = − log(0.99). The desired power in FLAIR was 80%, while the type-I error of each treatment comparison was 2.5% (one-sided). While still targeting the same power, we will use a more stringent target of 2.5% FWER (one-sided). In line with FLAIR, we use a total number of stages for both settings to be three. Therefore, in Setting 1, treatments 1 and 2 will both have two stages, whereas, in Setting 2, Treatment 1 will have three stages and Treatment 2 will have two (see Figure 1 ). The interims are equally spaced for the active treatments across all stages so r k,j = j for k > 0. Informed by the recruitment to FLAIR, we assume a constant recruitment rate of 21 patients per month.

The operating characteristics that will be studied for both methods include the FWER and power under LFC k . These two are studied to ensure that the trial design meets the required error control. Other operating characteristics stated include the maximum number of stages per active treatment arm, denoted by NS k , as well as the number of patients per arm per stage. Also shown for each setting is the maximum sample size and duration until the trial is complete as well as the expected sample size and duration. The duration of the trial is denoted by T . These values are found in order to compare the different designs.

We compare the proposed designs to three alternative methods. All of these are in the frequentest framework. For brevity, the main focus of the competing designs comparison will be on the Setting 2 (with the results for Setting 1 provided in Supporting Information).

The first competing approach is to evaluate each active treatment in separate trials. In line with Setting 2, the first study uses the 3-stage design and the second one -the 2-stage one. This approach will be referred to as "Separate trials". Two variations on running separate trials are studied. The first controls the FWER across both trials using α , the error rate for each trial, chosen so that (1 − α ) 2 = 0.975. This results in a type I error for each trial of 1.26%. The second variation does not control the FWER across the two trials.

The second competing design is the MAMS approach proposed by Magirr et al. (2012) . We will refer to these approach as "MAMS trial". Note, that under this MAMS approach, the trial cannot start until all treatments are ready. As this approach requires equal numbers of stages per treatment, both the results for running a 2 stage and 3 stage trial are presented.

The third competing method uses the same design parameters as originally planned for the 2-arm 3-stage trial and then also uses them for the additional treatment. This approach will be referred to as the "Naive MAMS" as it does not adjust the design parameters for the added arm. This will also allow to demonstrate the effect of not adjusting a design for additional arms. We provide the results for both the same maximum sample size as originally planned, and for the same sample size per arm per stage, n j,k .

The operating characteristics of the different design options are provided in Table 1 . The calculations were carried out using R (R Core Team, 2021) with the method given here having the multivariate normal probabilities being calculated using the package mvtnorm (Genz et al., 2021) and the outer integrals being calculated using the quadrature rule with the packages gtools (Warnes et al., 2021) and statmod (Smyth et al., 2021) . Code is available at https://github.com/pgreenstreet/platform-design-with-addition-of-arms. The comparison results were obtained using the R package MAMS (Jaki et al., 2019).

All the results can be seen in Table 1 . For all the designs the triangular shaped stopping boundaries are used (Whitehead, 1997; Wason and Jaki, 2012) . The first two rows of Table 1 shows the proposed design under Setting 1 and 2, respectively with the Supporting Information containing the results of using other stopping boundary shapes. Table 1 : Operating characteristics of the proposed design under two settings and competing approaches: Running trials separate ("Separate Trials"), MAMS design by (Magirr et al., 2012) ("MAMS"), and using a "naive" MAMS approach for the FLAIR trial. 

is the expected sample size and trial duration under the null and under the LFC for treatment k, respectively.

Under Setting 1, which has two stages for each active treatment, the design requires 42 and 44 patients per stage for Treatment 1 and Treatment 2, respectively. As a result, the maximum sample size is 540, and the expected sample sizes varies between approximately 286 and 401. The maximum duration of the trial is 25.7 months, and the expected one varies between 13.6 and 19.1 months. Under Setting 2, in which the first active treatment has 3 stages and the second has only 2, the required sample size per stage are is 46 and 77, respectively. This results in the maximum sample size of 492 and expected sample size between 296.6 and 347.8 depending on the configuration. This resulted in a maximum duration of 23.4 months and the expected duration varies between 14.1-16.8 months.

Comparing the sample size and trial duration for the proposed designs under Settings 1 and 2, under most configurations Setting 2 has lower expected sample sizes, and requires nearly 50 fewer patients in the maximum sample size to achieve 80% power while controlling the FWER at 2.5%. This also translates into the shorter duration. Setting 1 has advantages over Setting 2 under the case when Treatment 1 is superior, and Treatment 2 has the uninteresting effect. However, the difference in the expected sample size is around 10 patients and the difference in the expected duration is 0.5 months. For this reason, we will focus on the comparisons with Setting 2 with the results for Setting 1 provided in Supporting Information.

The next two rows of Table 1 show the operating characteristics of running two separate trials.

To match the setting of the FLAIR (and Setting 2), it is assumed that the first treatment has three stages while the second has two. Under the FWER across these separate trials controlled, the maximum and expected sample sizes are noticeably larger -with the difference of 134 patients required to achieve 80% power. Running two separate trials with the FWER also increases the maximum duration by 6 months, and the expected duration by 2-5 months, on average, depending on the configuration. Under two trials not controlling the FWER, the advantage of the proposed design under Setting 2 still persists. Two separate trials would require 44 more patients, and the expected sample size are only nearly 11 patients lower under LFC 2 which results in less than 1 month in recruitment time. Comparing this to the expected sample size under LFC 1 which is around 44 patients lower for Setting 2 which results in a saving of time of over 2 months.

The second competing method is the MAMS approach proposed by Magirr et al. (2012) that requires all treatments to start at the same time, and uses the critical values controlling the FWER, and the sample size achieving 80% power. We consider both 2-and 3-stage variants for a fairer comparison. The duration until the trial finishes for both includes the time before the trial would be able to start. This is calculated by assuming that Treatment 2 is ready when planned for Setting 2 and then calculating the time before this treatment is added. Therefore, this is the time for the first 92 patients to be recruited -4.4 months. Under this MAMS design, the maximum sample size is lower than for the proposed one under Setting 2 (36 and 15 patients for the 2-and 3-stage designs, respectively). The maximum duration, however, is increased by 3-4 months. The expected duration is also increase for all the configuration studied with an increase of between 1.4-5 months.

The final comparison method of naively using the original design for a 2 arms 3 stage trial is shown. The original design is to have 46 patients per arm per stage which results in 276 patients. Therefore when n j,k is kept the same this results in a maximum sample size of 368. For this approach the FWER is inflated by over 75%. The power under LFC 1 is still above the desired level however under LFC 2 the power decreases to 56.4% so is well below the target of 80%. For the naive approach where the maximum sample size remains the same there is therefore a change in sample size for the first treatment's 2 nd and 3 rd stages to accommodate the addition of the new treatment. As a result there is 46 patients on Treatment 1 at stage 1 then this decreases to 30 for Treatment 1's final two stages. In order to keep max(N ) = 276 then n 2 was set to equal 31 patients. In this case the FWER is inflated by over 75% and neither the power under the LFC 1 or LFC 2 is controlled at the desired level. The drop in power for the LFC 2 between Setting 2 and this naive approach is over 35% which is a dramatic loss in power. This poor result is to be predicted for this naive approach as it does not have bounds designed to control FWER or the required number of patients to get the desired power for either treatment.

The results of other common stopping boundary shapes and combinations of these can be seen in the Supporting Information in Web Appendix D for Setting 1 and Setting 2. In Web Appendix C of the Supporting Information the comparison results to Setting 1 can be seen when using the triangular stopping boundaries.

Overall this section has shown how the methods proposed in this paper could work in order to design a clinical trial in which an additional treatment is added later. In addition, competing frequentest approaches are studied to see how they compare. It can be seen that there is benefit to using the method proposed here compared to using these other methods either with regards to sample size or trial duration.

The distribution of the total sample size and the sample size of each treatment for Setting 2 under the global null is given in Figure 2 . Analogous results for Setting 1 can be seen in the Supporting Information (Web Appendix E) along with the expression for the probability mass function for the total sample size for Setting 2. The design under Setting 2 results in the interquartile range of 246 to 292 and median of 292 under the global null for the total sample size. These figures can be used by the trial team and given to funders and regulators to help with the communication of how many patients are likely to be required for the trial. 

In the FLAIR trial, the second active treatment was not added until about three quarters of the way through the recruitment for the first treatment. In this section, the effect of adding the treatments earlier or later than planned (i.e. at the first interim for the considered example) will be studied using simulations. We consider three approaches of how a treatment could be added later or earlier to a trial are studied. In all approaches the total maximum sample size is fixed to be the same. Below, we focus on Setting 2, and similar results for Setting 1 are given in the Supporting Information Web Appendix F.

Approach 1 is to change the timing of the interim analysis for Treatment 1 so it is conducted when Treatment 2 is added. Once the second active treatment is added, the allocation ratio for Treatment 1 to control changes as in the original proposed design. The patients remaining from the total sample size are then shared out across the phases with respect to the pre-set allocation ratio between each treatment. The pre-set stopping boundaries are used. Approach 2 follows Approach 1, but instead of keeping the original boundaries the bounds are recalculated using Algorithm 1 with the allocation ratios of each treatment at the time the additional treatment is actually added. Approach 3 keeps the timing of the interim analysis for Treatment 1 unchanged, and, at this point, the allocation ratio changes.

The effect of adding the second active treatment to the trial after only recruiting 1 patient to the control, up to recruiting 189 patients to the control, is studied. With the first treatment receiving the correct number of patients based on its recruitment rate relative to the controls recruitment rate. i.e 1 patient will have also been recruited to Treatment 1 before Treatment 2 is added in the earliest example. Figure 3 shows the resulting FWER and Power for different times when the new treatment is added based on 10 million simulations for each case. Figure 3a shows how the FWER varies for each one of the approaches with Approach 2 having the least variation and Approach 1 having the most variation in FWER under the global null. The maximum inflation happens for Approach 3 of an increase in FWER under the global null to 2.54% when the second treatment is added after 100 patients are recruited to the control. For Approach 1 the FWER increases until the planned adding time and then decreases. Approach 2 stays constantly around the planned FWER and for Approach 3 the FWER starts below 2.5% and increases until 100 patients, before starting to decrease.

The changing FWER for Approach 1 and 3 are caused by two opposing forces. The first of which is when Treatment 1 is added earlier there is a decrease in correlation between the first interim for Treatment 1 and the rest of its analyses, this is caused by a decrease in r 1,1 /r 1,2 and r 1,1 /r 1,3 . The second, is there is an increase in correlation between the Z-statistics for Treatment 1 and 2 as there is now an increased number of shared control patients. These two opposing forces make it difficult to predict what effect any change will have on the FWER without running the calculations or using simulations. In order to guarantee that the FWER is controlled then either the second treatment needs to be added when it was planned to be or recalculate the stopping boundaries for each point as done in Approach 2.

Considering the power, for the all considered approaches the later the additional arm is added the lower the power for this arm is. For Approaches 1 and 2, the power for the Treatment 1 increases the later the additional arm is added whereas the power remains almost constant for Approach 3. One issue therefore with Approaches 1 and 2 is that power for all the treatments under the LFC is no longer controlled unless the treatment is added at the preplanned time. However, when using Approach 3, the power is controlled for both treatments when the treatment is added earlier as well as the FWER being controlled. This is not always the case as can be seen in the Supporting Information (Web Appendix G) which provides an example with higher uninteresting treatment effect. Higher θ 0 results in a greatly increased chance of taking Treatment 2 forward before the second analysis for Treatment 1 due to θ 0 effect on the sample size of the second active treatment. This reiterates why using the pre-planned design and assessing the impact of deviations of the plan are crucial. One potential solution to this problem of controlling the errors when adding the treatment at a different time to when it was planned is to recalculate all design parameters (including the sample size). This section has highlighted the importance of using the original plan in order to control the FWER and power under the LFC. Therefore it is important when using this design to ensure that the additional treatment will be ready for the pre-planned addition time. One key point is that if the new treatment is not added to the trial at all, the design will still guarantee control of FWER and power for the treatments already in the trial. Furthermore by using the original plan this removes the bias potentially caused by changing when the additional treatments are added, in order to benefit treatments already in the trial.

In this paper, a general design for adding additional treatments in a pre-planned manner is developed and explored. This design ensures strong control of the FWER and power under the least favourable configuration and allows for interim analyses. Both sample size distribution and expected sample size can be calculated. Two iterative approaches are given to allow for multiple stopping boundary shapes and to allow for different numbers of patients on each treatment depending on when the treatment is added to the trial. Two different designs based on FLAIR for the two settings are presented. These designs are then further explored where the effect of adding treatments later or earlier than planned is studied.

Overall the method proposed here, which builds on the work of Magirr et al. (2012) , has shown that running a platform trial where treatments are added at later points can result in a considerably more favourable design to running separate trials with respect to maximum and expected sample size. This approach has shown that it can be worthwhile starting a trial earlier with the available treatments and then planning to add treatments later, compared to waiting until all are ready then beginning the trial with respect to the time it takes before the trial concludes. This is true both when there is either a constant or increasing recruitment rate. In Section 3.3 it was seen that using Setting 2 compared to 1 can be potentially beneficial with regards to the trials sample size and duration. This makes intuitive sense as this results in increased correlation between the test statistics as there are more shared controls which results in a reduction in the FWER of the trial for the given boundaries.

Throughout this paper only concurrent controls are used, as argued by Lee and Wason (2020) . While it would be possible to extend the design to allow for non-concurrent controls, error control will only be guaranteed if there is no difference between the concurrent and the nonconcurrent controls, but this is impossible to know before the trial. In this paper we assumed that an interim analysis is conducted at the time that a new treatment is added. This not only simplifies calculations, but is also sensible as if a new treatment is being added to the trial then the other treatments in the trial may as well also be studied at this point. This has two benefits, the first is there is the opportunity for a treatment to be declared superior to the control before recruitment of the new treatment starts. The second is it allows the study of all the patients on the control treatment from before the additional treatment is added, so potentially making it easier at later stages to know which controls are concurrent.

PWER was used in order to calculate a 1 , a 2 , . . . , a K in order to share the FWER out among the treatments. This was used as it ensures the highest possible probability of rejecting a null hypothesis is the same for all treatments. However there are a multitude of different ways the FWER could be shared such as having: the probability of rejecting a null hypothesis under the global null the same; or the probability that a treatment is taken forward to the next phase the same. One may also want to consider one of these approaches. To do this the same iterative approach as given in Section 2.2 with a different Equation (2.3) can be used.

When calculating the expected sample size every possible outcome of the trial was enumerated resulting in a very computationally costly procedure. In Section Web Appendix B of the Supporting Information a more efficient approach is provided for the computation of the expected sample size. The cost of this efficiency is that the algorithm does not yield the full sample size distribution in addition to the expected sample size.

An area for further research is deciding whether to wait for all the treatments to be ready or to start the trial with the ability to add preplanned treatments later. A lot of factors need to be considered when choosing this such as: recruitment time, recruitment cost, time left before all the treatments are ready, and the cost of delaying development of existing treatments. Therefore an area for further research from this paper is looking at how a decision framework, such as the one discussed in Lee et al. (2019) , could be used.

Some of the potential limitations with this approach are that it assumes normally distributed data. By using asymptotic normality as discussed in Jaki and Magirr (2013) , however, other endpoints can also be used. Another area is the fact it is assumed that the common variance is known. However using an ad hoc approach such as the one in Magirr et al. (2012) can also be used to transform individual test statistics to combat this issue.

Web Appendix A Proof of strong control of FWER

Proof. For any k > 0,

Take any

Next suppose for any m 1 , . . . , m K where m 1 ∈ {1, . . . , K} and m k ∈ {1, . . . , K}\{m 1 , . . . , m k−1 } with θ m 1 , . . . , θ m l ≤ 0 and θ m l+1 , . . . , θ m K > 0. Let Θ l = (θ m 1 , . . . , θ m l ). Then P (reject at least one true H 0.k |Θ) ≤ P (Z k,j > u k,j for some (k, j) ∈ {(m 1 , 1) . . . , (m 1 , J m 1 ), (m 2 , 1) . . . , (m l , 1), . . . , (m l , J m l )}|Θ)

The sample size calculation can be split into four sections. Two sections that focus on the control treatment and two sections that focus on the active treatments. These sections are:

1. The probability the control treatment finishes at each stage j 0 as the trial is stopped, given no null hypotheses are rejected, where j 0 ∈ 1, . . . , J 0 . This is calculated by taking the difference between the probability that every treatment is stopped for futility by the control's j 0 th stage, denoted by Ψ j and every treatment is stopped for futility by the control's stage j 0 − 1. The control treatment cannot stop being recruited until either: one or more null hypotheses is rejected, or until all the active treatments have had at least one stage. Therefore, in this calculation only the stages after every treatment has been added to the trial need to be considered, so s is defined as s = max(S). Using this gives for j 0 > s ,

and for j 0 ≤ s gives Ψ j 0 = 0 and Ψ 0 = 0.

2. The probability the control treatment finishes at each stage as the trial is stopped, given that a null hypothesis is rejected. This is calculated by taking the difference between the probability that at least one null hypothesis is rejected by the control treatments j 0 th stage, denoted by Υ j 0 , and that at least one null hypothesis is rejected by the control's j 0 − 1 stage.

whereÜ k,j (θ k ) = (u k,1 (θ k ), . . . u k,j−1 (θ k ), u k,j (θ k )). With Υ 0 = 0.

3. The probability treatment k stops at each of its J th stages because at least one other treatments null hypothesis has been rejected at this stage, denoted by Λ k ,J where,

whereL k,j (θ k ) = (l k,1 (θ k ), . . . l k,j−1 (θ k ), l k,j (θ k )) and

with k ,0 = 1.

4. The probability treatment k stops at each of its J th stages because only H 0k is rejected, or no null hypotheses are dropped at this stage and treatment k is stopped as its test statistic drops below its lower boundary for that stage (Ξ k ,J ).

Using the probabilities calculated above the expected sample size is:

(Ξ k ,J + Λ k ,J )n k ,J .

Web Appendix C Table of results for Setting 1 based on the motivating trial As done for Setting 2 in Table 1 of the main paper in Web Table 1 the results of the different comparison approach given in Section 3.2 is shown. Now for the expected duration until the MAMS trial finishes it is now assumed that the trial does not begin until the beginning of the second stage for Setting 1. Therefore 152 patients have already been recruited which equals 7.2 months.

Web Table 1 . The results of the triangular stopping boundary shape on the design configuration for the motivating FLAIR trial under Setting 1 and three competing approaches: when running each trial separately, when using the MAMS design by (Magirr et al., 2012) and when using the naive MAMS approach 

is the expected sample size and trial duration under the null and under the LFC for treatment k, respectively.

Web Appendix D Tables of results based on the motivating trial for different stopping boundaries

In Web Table 2 and Web Table 3 the results for the different combinations of stopping boundary shapes are shown for Setting 1 and 2 respectively. The stopping boundary shapes which are considered here are Pocock (Pocock, 1977) , O'Brien and Fleming (O'Brien and Fleming, 1979) and Triangular stopping boundaries (Whitehead, 1997) . However for both the Pocock boundary shape and the O'brien and Flemming boundary shape the symmetric futility boundary may be too stringent of a requirement to be able to drop ineffective treatments, therefore a simple alternative is l k,j = 0 for j < J k is used. As can be seen in these tables the upper and lower stopping boundaries are given, with the top row being the boundaries for the first active treatment added and the second row being for the second active treatment.

Web Table 2 . The results of different stopping boundary shapes on the design configuration for the example trial under Setting 1. 

is the expected sample size and trial duration under the null and under the LFC for treatment k, respectively.

Web Table 3 . The results of different stopping boundary shapes on the design configuration for the example trial under Setting 2. Web Appendix E Distribution of sample size

The distribution of the total sample size and the sample size of each treatment when the triangular stopping boundaries are used for Setting 1 under the global null is given in Web Figure 1 with the probability mass function for the total sample size given in Equation (Web Appendix E.1).

In this example the triangular stopping boundaries for Setting 1 gives the interquartile range of 308 to 384 and median of 308 under the global null for the total sample size.

Distribution of the total sample size Setting 1 = Web Appendix F Robustness to the timing of the actual adding for Setting 1

The same 3 approaches, as given in Section 4, to adding the second treatment earlier or later are studied here for Setting 1 as can be seen in Web Figure 

Designs for adding a treatment arm to an ongoing clinical trial

Adding experimental treatment arms to Multi-Arm Multi-Stage platform trials in progress

Adding new experimental arms to randomised clinical trials: Impact on error rates

Adding a treatment arm to an ongoing clinical trial: a review of methodology and practice

Multiple testing problems in pharmaceutical statistics

A Multiple Comparison Procedure for Comparing Several Treatments with a Control

Adaptive Designs for Clinical Trials of Drugs and Biologics Guidance for Industry

mvtnorm: Multivariate Normal and t Distributions

A platform trial in practice: adding a new experimental research arm to the ongoing confirmatory FLAIR trial in chronic lymphocytic leukaemia

Considerations on covariates and endpoints in multi-arm multistage clinical trials selecting all promising treatments

The R package MAMS for designing multi-arm multi-stage clinical trials

Statistical consideration when adding new arms to ongoing clinical trials: the potentials and the caveats

Including non-concurrent control patients in the analysis of platform trials: is it worth it?

To add or not to add a new treatment arm to a multiarm study: A decision-theoretic framework

A generalized Dunnett test for multi-arm multi-stage clinical studies with treatment selection

Adaptive Clinical Trials: A Partial Remedy for the Therapeutic Misconception?

A Multiple Testing Procedure for Clinical Trials

Group Sequential Methods in the Design and Analysis of Clinical Trials

Designed Extension of Studies Based on Conditional Power

R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing

Efficient Adaptive Designs for Clinical Trials of Interventions for COVID-19

Two-stage selection and testing designs for comparative clinical trials

Some recommendations for multi-arm multi-stage trials

Optimal design of multi-arm multi-stage trials

Correcting for multiple-testing in multi-arm trials: is it necessary and is it done?

The Design and Analysis of Sequential Clinical Trials

This report is independent research supported by the National Institute for Health Research (NIHR300576). The views expressed in this publication are those of the authors and not necessarily those of the NHS, the National Institute for Health Research or the Department of Health and Social Care (DHSC). PM and TJ also received funding from UK Medical Research Council (MC_UU_00002/14). This paper is based on work completed while PG was part of the EPSRC funded STOR-i centre for doctoral training (EP/S022252/1).

Web Appendix G The effect of a larger θ 0 when using triangular stopping boundariesFor this example the same variables are used as the ones discussed in Section 3 apart from θ 0 = − log(0.80). The main focus of this section is to show how this larger θ 0 can result in the third approach to adding treatments earlier or later, as discussed in Section 4, does not control the power of all the treatments when the treatment is added earlier. The stopping boundaries, sample size and expected sample size for both settings are given in Web Table 4 . The effect of adding the treatment earlier or later than planned using the three approaches discussed in Section 4 is given in Web Figure 3 and Web Figure 4 for Settings 1 and 2, respectively. Both these figures now show that the power under the LFC for Treatment 1 is no longer controlled for any of the approaches when the second treatment is added earlier.Web Table 4 . The results of the triangular stopping boundaries on the design configuration for the example trial with θ 0 = − log(0.80) for both settings.Setting