key: cord-0209446-uq5rlov7
authors: Zhong, Chengxue; Pan, Haitao; Miao, Hongyu
title: A Bayesian group sequential schema for ordinal endpoints
date: 2021-08-14
journal: nan
DOI: nan
sha: 43b2312f9235136feeaf6aab70c8d18015f254e0
doc_id: 209446
cord_uid: uq5rlov7

The ordinal endpoint is prevalent in clinical studies. For example, for the COVID-19, the most common endpoint used was 7-point ordinal scales. Another example is in phase II cancer studies, efficacy is often assessed as an ordinal variable based on a level of response of solid tumors with four categories: complete response, partial response, stable disease, and progression, though often a dichotomized approach is used in practices. However, there lack of designs for the ordinal endpoint despite Whitehead et al. (1993, 2017), Jaki et al. (2003) to list a few. In this paper, we propose a generic group sequential schema based on Bayesian methods for ordinal endpoints, including three methods, the proportional-odds-model (PO)-based, non-proportional-odds-model (NPO)-based, and PO/NPO switch-model-based designs, which makes our proposed methods generic to be able to deal with various scenarios. We conducted extensive simulations to demonstrate the desirable performances of the proposed method and an R package BayesOrdDesign has also been developed.

Except the continuous, binary, or time-to-event outcomes, the ordinal endpoint is also frequently occurred in various clinical studies (Desai & Gyawali, 2020; Nunn et al., 2016; Peterson et al., 2019) , which is essentially an ordered categorical value. For example, during the COVID-19 pandemic, more than 1,000 clinical trials were registered to assess the efficacy of potential therapeutic drugs. In these drug trials, the most commonly used endpoint was literally a 7-point ordinal scale, with a smaller ordered number representing a higher severity of disease (Gordon et al., 2021) . Another example is the efficacy response defined by the Response Evaluation Criteria in Solid Tumors (RECIST) criteria in the therapeutic interventions of cancer, which provide a consistent mechanism for evaluating anticancer treatments (Eisenhauer et al., 2009) . The RECIST is based on examine the outcomes of changes in tumor size and tumor lesion fall into one of the following four ordered categories: complete response (CR), partial response (PR), stable disease (SD), and progressive disease (PD), therefore, the efficacy response is also an ordinal endpoint per se. However, in practices, the ordinal endpoints are often be converted to binary endpoints via dichotomization purely for simplification, though merits of the use of ordinal endpoints instead of dichotomized binary endpoint in clinical studies have been demonstrated in literatures, for instance, may plan a study with a more feasible sample size and assess the benefits of intervention at a higher resolution (Peterson et al., 2019; Whitehead, 1993) . On the other hand, the ordinal endpoint is also accepted in some routine clinical practices. For instance, the Breast Since existing approaches were primarily developed for binary endpoints, there is an urgent need to develop novel trial designs for general ordinal endpoints. Although few, design methods have been proposed for ordinal endpoints and were all developed from the frequentist perspective. For example, Whitehead (1993) introduced a sample size formula for ordinal responses and discussed the effects of stratification and misclassification on sample size calculation with the odds ratio (OR) as the required endpoint. In addition, these existing frequentist-derived methods mainly adopted the proportional odds (PO) model to design the study, which requires a restrictive assumption that a single OR applies across different categories. An example is the generic ordinal sequential trial (GOST) design proposed by Whitehead and colleagues (2017) . If the proportion odds (PO) assumption holds, the aforementioned methods have desirable properties such as controlling the type I and II error rates. However, if the PO assumption is inappropriate, design properties will be compromised. To our knowledge, only the study by Murray et al. (2018) considered the non-proportional odds assumption for clinical trial design. Specifically, the authors proposed a utility-based Bayesian design for ordinal outcomes that considered both the proportional odds (PO) model and non-proportional odds (NPO) model. However, in practice, the question of how to choose between PO and NPO models and incorporate this uncertainty into the design stage are unsolved. The method proposed by Murray et al. did not address this. However, the PO assumption may not hold in practice, and it might be difficult to fit the NPO model since more parameters are involved. Also, at the stage of trial design, it is very likely that the sample size is overestimated or underestimated, since there are no actual data to determine which model is the correct one.

In this paper, we propose three versions of the two-arm Bayesian group sequential design (GSD) for ordinal outcomes, the PO, NPO and PO-NPO switch models. The PO model is a Bayesian GSD if the proportional odds assumption holds while the NPO applies if this assumption fails to hold. To tackle the issue that there is uncertainty of choosing the PO or NOP model prior to a study, we propose the third model, PO-NPO switch model, which includes the following procedures. First, a Bayesian method is developed to allow NPO model fitting, and a utility-based comparative criterion is applied to assist with interpreting the model's result.

Second, a reversible-jump Markov chain Monte Carlo (RJMCMC) algorithm is adopted to allow the automatic switch between PO and NPO structures after interim data are accumulated, so that sample size required for the following recruitment stages could be reduced. In brief, our proposed designs can not only deal with the PO model or NPO model assumption, but can also examine if the PO or the NPO model is more suitable based on the interim data if we are unsure about which model to use a priori (designated as PO/NPO switch model).

The remainder of the paper is organized as follows: Section 2 introduces the notations and the proposed PO, NPO, and PO/NPO switch models. We emphasize on illustration of the utility-based comparative criterion for the NPO model, and RJMCMC for PO/NPO model switch. Section 3 examine the operating characteristics by simulations. Section 4 briefly introduces how to use the developed R package BayesOrdDesign to implement our proposed design. Section 5 ends with some concluding remarks.

= ( 1 , 2 , … , ) ′ and = ( 1 , 2 , … , ) ′ be the observed ordinal outcomes, which can be conceived to be originated from the unobserved (latent) variables * and * . To be specific, let * and * be the unobserved continuous outcomes in the control and treatment groups, respectively, where * = ( 1 * , 2 * , … , * )′, and * = ( 1 * , 2 * , … , * )′. Assume that the latent variable follows a logistic distribution with a standard deviation σ, and the corresponding linear regression models are * = + ,~ (0, 2 ), = 1, … , , (1) * = + Δ + ,~ (0, 2 ), = 1, … , ,

where sample sizes of the control group and the treatment group are assumed to be the same, (e.g., 1:1 allocation), and the variances of both groups are assumed to be identical for simplicity.

Assume that the observed ordinal variables = ( 1 , 2 , … , ) ′ and = ( 1 , 2 , … , ) ′ originate from the aforementioned unobserved variable * and * . The latent thresholds (1 < < − 1) and (1 < < − 1) partition the value of * and * into the observed ordinal categories of and . Formally, the correspondence between the latent variables and the observed outcomes are defined as shown below:

where the boundaries and are unknown, and −∞ = 0 < 1 < ⋯ < −1 < = ∞.

We let = ( 1 , … , −1 ) and = (θ 1 , ⋯ , θ −1 ) denote the vector of thresholds.

For the control group, assume that ε has a continuous distribution function (•) whose range lies in (−∞, ∞): where ( ≤ | ) is denoted as , and it refers to the cumulative probability for the control group. The probability ( = | ), which is of primary interest, can be computed from:

Similarly, for unobserved ordinal variables in the treatment group, the probability ( = | , Δ) and the cumulative probability ( ≤ | , Δ) for the treatment group, denoted as , can be derived from

If we assume that = for = 1, … , − 1, then,

where ∆ represents the log OR between the two groups, and it also implies that the treatment effect is constant regardless of the outcome scale. Therefore, if the proportional odds (PO) assumption holds, the hypothesis set is outlined as 0 : Δ = 0 . : Δ < 0.

The null hypothesis 0 implies group mean equality, and the alternative suggests an effective treatment (higher value in the category indicates less desirable outcome). The model described above is formulated in a Bayesian framework and fitted using Markov Chain Monte Carlo (MCMC). Before describing our design, we propose the comparative criterion to conduct interim monitoring and final decision as below:

= Pr (Δ < 0| ).

The proposed Bayesian design is described as below if one interim look and one final look are planned. Herein, n patients are recruited into two groups, respectively, with a 1:1 allocation ratio for every interim look.

1. Enroll 2n patients and randomize them into treatment or control groups.

2. Given the first-stage data, = ( ,1 , … , , ), = ( ,1 , … , , ). If < , terminate the trial early due to futility, where is a cutoff point for futility stopping.

3. Otherwise, at the second stage, continue to enroll additional 2n patients and randomize them to the treatment or control groups.

4. Once the maximum sample size is reached, calculate based on all the two-stage data. If > , where is a cutoff point for the superiority, conclude that the treatment is superior; otherwise, conclude that treatment is not effective.

To guarantee good frequentist operating characteristics in our design, futility and superiority cutoff points and need to be calibrated to achieve desirable type I and type II errors. The optimal searching algorithm for cutoff points is used for the simulation-based calibrated steps.

In general, without the assumption of equality between and for = 1, … , − 1, we assume the following generic model:

Prior distribution for μ, γ, and θ are similar as aforementioned; the mean and variance of Δ prior distribution can be estimated from historical information by packages such as R package ordinal (Christensen RHB (2019 and B gives probabilities (0.4, 0.2, 0.4); however, it is unclear whether A is superior to B or not, since B has higher probabilities of both Good and Poor levels than those of A. We adopted the utility-based comparative criterion as it is a flexible tool, can be applied to evaluate which treatment is better, and can be conveniently communicated with clinicians. Specifically, to provide a criterion for deciding whether the treatment is superior, numerical values for utility of ( = ) for all levels of outcome need to be specified. Then, we can easily compare different treatments using mean utilities as below:

where ( ) represents the specific outcome probabilities from treatment groups and utility function ( = ) denotes the corresponding desirability of each response outcome level.

Similarly, we let ( ) denote the specific outcome probabilities from control group. In our context, the outcome level 1 is defined as the most desirable outcome and outcome level 6 the least desirable outcome. We can firstly set ( With the utility setting, a Bayesian model is used to estimate the outcome level probabilities with unknown parameters and then calculate model-based mean utility scores.

Given this information, the comparative criterion is defined as follows:

With the proposed comparative criterion, the procedure of the corresponding Bayesian 

Thus far, the relationship between the response variable and independent variable has been modeled with either a PO structure or an NPO structure. However, sufficient information is lacking to determine which of these two models is more appropriate. Here, we propose a method based on the reversible-jump MCMC (RJMCMC) algorithm, which functions to determine a correct model structure based on the patients enrolled at interim stages to solve the above challenge. We designate the proposed method as the PO/NPO switch model-based design, which has advantages over the previous two designs: i) sample size can be re-estimated for the second stage after the interim analysis, which could save sample size; ii) the RJMCMC approach can select a more appropriate model in the interim analysis and possibly increase the power of the design.

Here, we briefly describe the RJMCMC procedure. As outlined by Green (Green, 1995) , a key step in RJMCMC is to build bijective functions describing relationships between parameters from different models. When the dimensions of two models differ, supplemental parameters are introduced to ensure consistence in the parameter number. For instance, if model 1 has 5 parameters and model 2 has only 1 parameter to map 2 to 1 , a bijective function is required with a supplemental variable vector , where is 4-dimensional but is not connected with 2 , as it serves as place holders only for 2 . Thus, building connections among various models need to specify ( 2 ) bijective functions, where is the number of candidate models.

When is large, the number of bijective functions could result in a computational challenge.

Therefore, to simplify and demystify RJMCMC, Barker (Barker & Link, 2013) proposed a universal parameter called the "palette," from which all parameters of different models can be calculated. Specifically, parameter vector can be converted through a known mapping function ( ) = = ( , ). With the universal parameter in hand, the number of bijective functions are reduced from ( 2 ) to 2 as there is no need to develop bijective functions for each pair of models. Crucially, this implementation is a post-processing algorithm, which means that samples from model-specific posteriors ( | , ) can be used directly once the candidate models are fitted. In this aim, we consider two models:

We construct a 5-dimensional palette by assuming 6-scale ordinal endpoints, =

( 1 , … , 5 ). In the NPO model, parameter vectors = (Δ 1 , … , Δ 5 ) are linked to ; In the PO model, Δ is associated with the weighted average = Our RJMCMC sampler for the PO/NPO switch proceeds as follows:

1. Given the current value of , sample :

 Under 1 , we sample Δ from [ Δ | 1 , ], for = 1, … ,5. We then compute = (Δ 1 , … , Δ 5 )′.

 Under 2 , we sample Δ from [ Δ| 2 , ] and from [ | 2 ]. We then compute = − (Δ 1 , 1 , 4 )′.

2. Given the current value of , sample . We use Eqn. (5) at the second interim look it is possible that a smaller sample size will be selected based on the RJMCMC result from first-stage data.

Similar to the proposed PO-and NPO-model based designs, to guarantee good frequentist operating characteristics of the design, futility, and superiority cutoff points and also need to be calibrated to control type I and II errors. Grid search for cutoff points is used for the simulation-based calibrated steps.

[ Figure 1] 

In this simulation study, for each of the three proposed designs, we consider a seamless phase II/III trial with an ordinal endpoint. The trial has one control group and one experimental treatment group, where patients are equally randomized. Under the null hypothesis, we assume that the OR is equal to 1. The overall type I error under different model-based designs is computed with 1,000 simulations.

Under the alternative hypothesis, we examine various scenarios (see Table 1 ) in 1,000 simulated trials using our proposed designs. We first generate the first-stage ordinal outcome via To evaluate the proposed designs, we considered the following three metrics: (i) The percentage of early termination (PET) is defined as the percentage of trials that are terminated early; (ii) the percentage of rejecting the null (PRN) hypothesis is defined as the percentage of simulated trials in which 0 is rejected. The PRN is essentially the type I error rate (or power) when 0 is (or is not) true; (iii) the average sample size used in 1,000 simulated trials. We have constructed eight scenarios to cover different types of data structure. To compare the powers of three proposed models under different data structures, we generated two types of observations for the treatment group: one follows the PO structure and the other the NPO structure. Both have the same control group. Specifically, under the proportional odds assumption, we generate a series of treatment group observations based on ORs ranging from 1 to 2. Three models are fitted to evaluate the power, respectively. Similarly, without the proportional odds assumption, observations are generated based on 1 = 2 = 1.5, and 3 = 4 = 5 . The last three ORs values range from 1 to 1.4, with an increment of 0.05. Meanwhile, utility function is applied to compute the corresponding mean utilities and differences for all scenarios (see Table 1 ). Higher numerical utilities reflect higher efficacy of the treatment group than the control group.

[ Table 2 shows the performance of the three proposed design based on 1,000 simulated trials under different pairs of 0 and 1 . In scenario 1, which is the complete null case, all three designs control the type I error near or less than 0.05. Tables 2a, 2b, and 2c show the frequentist operating characteristics by computing the aforementioned metrics for the PO model-based design, NPO model-based design, and PO/NPO switch model-based design, respectively. Table   2a shows an increase in power when the true effect size increases. In Table 2b , the PO assumption is not satisfied. For example, for scenarios 6, 7, and 8, the power of declaring that the treatment group is superior to the control group increases when the mean utility increases. Note that we denote both the ORs for the PO model and the numeric utility for the NPO model as effect size in Table 2 for simplicity. In Table 2b , the effect size value represents the value of the ORs, and the effect size value in Table 2c denotes the disparity of numeric utilities between the treatment and control groups. Table 2c shows all eight scenarios and the performance of the PO/NPO switch model-based design. Results shown in Table 2c , under predefined scenarios,

show that the switch model-based design has a higher percentage of rejecting the null (PRN) compared to that for the other two model-based designs. For example, if the effect size is 1.8, the PRN calculated by the PO model design is 91% (Table 2a) and is 96% by the PO/NPO switch model-based design. If the effect size is 3.6, the calculated PRN is 56% by the NPO modelbased design (Table 2b ) and is 75% by the PO/NPO model-based design. Note that under the non-null case, the PRN is the power.

When the PO assumption holds, Figure 2 show the probability of concluding that the treatment group is efficacious against the true value of the ORs and against the sample size, respectively. Specifically, the top plot of Figure 2 shows that the PO model based-design has increasing power with the increase in the ORs from 1 to 2. Similarly, larger sample size gives more power to the design (the bottom plot of Figure 2 ). Sample size in the range of 50 to190 refers to the sample size for each stage per group. When the PO assumption is violated, it displays the probability of concluding that the treatment group is efficacious when plotted against the sample size for the NPO model based-designs (the tope one of Figure 2 ). To evaluate the relationship between ORs and the model's power, we define = (1, 1,1,1,1) as the first null scenario, the second scenario as = (1.5,1.5,1.5,1,1), and increase it to = The futility cutoff point and superiority cutoff points are 0.2 and 0.9, respectively, and the type I error rate is 0.04. The recommended sample size is 80 and 80 for the treatment and control groups at each stage, and the corresponding power is 0.82.

The above outputs showed that the codes produce a paragraph of outputs. Futility and superiority cutoff points (0.2 and 0.9) were calibrated to achieve a desirable type I error rate, which is 0.04 in this case. In this example, we see that the design has a power of 82% and the sample size is 80 and 80 for the treatment and control groups, respectively, at each stage.

Since the operating characteristics of Bayesian two-stage design need to be compared, we also provide convenient functions for users to easily obtain the power using various scenarios and different models. In this two-stage NPO model-based design example, the desirability of outcome level is (100, 80, 65, 25, 10, 0) . Multiple effect sizes that need to be detected in terms of OR are also given as a matrix, wherein each row represents a specific scenario and the elements of each row represent corresponding ORs for outcome scales. Given that the distribution of clinical categories for the control group is (0.58, 0.05, 0.17, 0.03, 0.04, 0.13), and the required sample size for the treatment and control groups per stage is 100, this would be implemented using the expression fixed_ss = 100. Thus, 1,000 simulated trials are conducted with argument ntrial = 1,000, and then this design can be implemented using the following code:

> library(BayesOrdDesign) > # Generate a matrix containing various effect sizes. For instance, the first row represents the null hypothesis; the second row represents the Scenario 2, where 1 = 2 = 1.5, and 3 = 4 = 5 = 1; the third row represents the Scenario 3, where 1 = 2 = 1.5, and 3 = 4 = 5 = 1.05, and so on. The argument ors is an 8 by 5 matrix composed of the scenarios users interested in. Note that our package allows users to estimate statistical inference based on the Frequentist or Bayesian method. These approaches can be selected by using the method argument. The following options are available: (i) Frequentist (method = "Frequentist"), (ii) Bayesian (method = "Bayesian").

The ordinal endpoint is commonly used in clinical trials to construct categories of various outcome assessments ranked in order of patient status (e.g., from resumption of normal activities to death). The most common regression model for analyzing ordinal data is the proportional odds model, in which the regression parameter is consistent under different levels of the response. If the assumption of parallel lines does not hold, an alternative nonproportional odds model is required wherein regression parameters are allowed to vary depending on the level of response.

In this study, we proposed three Bayesian group sequential designs for comparing two treatment arms in terms of ordinal endpoints: a proportional odds (PO) model, a non-proportional odds (NPO) model, and a switch model that can perform model re-selection at interim analysis based on the accumulative data. Results of our simulation study reveal that the design based on the PO/NPO switch model shows the desirable frequentist operating characteristics regardless of data structure. Specifically, when the PO assumption is satisfied, the PO and PO/NPO switch models have similar probability of recommending the efficacious treatment arm. However, when the PO assumption is not satisfied, the NPO model and PO/NPO switch model have similar probabilities of making the correct conclusion. In this paper, we also introduce the R package

BayesOrdDesign to estimate the required sample size for the model-specific trial.

Although the PO/NPO switch model-based design can reselect the model at interim analysis and potentially save the sample size, in this study we only applied a naï ve method for the sample size re-estimation. A more formal method, for instance, using the predictive probability to re-estimate the sample size based on interim data may have the potential to increase the performance. These are our future research. 

For scenarios 1 to 5, the PO assumption holds.

Corresponding odds ratios (ORs) are 1, 1.2, 1.4, 1.6, and 1.8, respectively. Then, prior distributions for are calculated as shown below.

( | 1 ) = ( 1 | 1 ) × | 1 | = ∏ (Δ | 1 ) 

Bayesian multimodel inference by RJMCMC: A Gibbs sampling approach

ordinal-Regression Models for Ordinal Data

Endpoints used in phase III randomized controlled trials of treatment options for COVID-19

New response evaluation criteria in solid tumours: Revised RECIST guideline (version 1.1)

R package rjmcmc : The calculation of posterior model probabilities from MCMC output Transdimensional algorithms

Interleukin-6 Receptor Antagonists in Critically Ill Patients with Covid-19 -Writing Committee

Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination

Law 2020 A stochastically curtailed two-arm randomised phase II trial design for binary outcomes

A utility-based design for randomized comparative trials with ordinal outcomes and prognostic subgroups

Analysis of the modified Rankin scale in randomised controlled trials of acute ischaemic stroke: A systematic review

Comparison of an ordinal endpoint to time-to-event, longitudinal, and binary endpoints for use in evaluating treatments for severe influenza requiring hospitalization

Bayesian sequential monitoring design for two-arm randomized clinical trials with noncompliance

Remdesivir in adults with severe COVID-19: a randomised, double-blind, placebocontrolled, multicentre trial

Sample size calculations for ordered categorical data

GOST: A generic ordinal sequential trial design for a treatment trial in an emerging pandemic

Phase II trial design with Bayesian adaptive randomization and predictive probability

For the PO model, γ and θ are identical and only one Δ is required. Assume that the outcome follows a categorical distribution, which is a discrete probability distribution that describes possible results of a random variable that can take on one of six possible categories, with probability of each category obtained separately., | , , , Δ~Categorical( , ), = 1, … , N; j = 1, … , 6Where represents the probability of corresponding categorical value, and ∑ 6 =1 = 1.

For the PO model, the likelihood function for a single observation is ( , = | , , Δ, 1 ) ∝ 1 1 + − + +Δ * −