key: cord-0888763-nw69p3rx authors: Lin, Dan-Yu; Zeng, Donglin; Eron, Joseph J title: Evaluating the Efficacy of Therapies in COVID-19 Patients date: 2020-08-21 journal: Clin Infect Dis DOI: 10.1093/cid/ciaa1231 sha: ada41faae65cbb95f301fad806ee23fe14d6cd7d doc_id: 888763 cord_uid: nw69p3rx There is a proliferation of clinical trials worldwide to find effective therapies for patients diagnosed with novel coronavirus disease-2019 (COVID-19). The endpoints that are currently used to evaluate the efficacy of therapeutic agents against COVID-19 are focused on clinical status at a particular day or on time to a specific change of clinical status. To provide a full picture of the clinical course of a patient and make complete use of available data, we consider the trajectory of clinical status over the entire follow-up period. We also show how to combine the evidence of treatment effects on the occurrences of various clinical events. We compare the proposed and existing endpoints through extensive simulation studies. Finally, we provide guidelines on establishing the benefits of treatments. Several studies have recently been completed and many more are currently underway or in the planning stages to investigate the efficacy and safety of therapeutic agents in patients diagnosed with COVID-19. A clinical trial of lopinavir/ritonavir (LPV/r) on adult patients hospitalized with severe COVID-19 was completed with unprecedented speed; 1 clinical trials of remdesivir on a spectrum of COVID-19 patients have just concluded or are still ongoing; 2−4 and WHO and partners recently launched SOLIDARITY, a global mega-trial of remdesivir, LPV/r, interferon beta-1a, chloroquine, and hydroxychloroquine. 5 Table 1 shows six remdesivir trials registered on ClinicalTrials.gov. The Capital Medical University in China has conducted two of those trials, one in patients with mild/moderate COVID-19 and one in patients with severe COVID-19. 2 Gilead Sciences has also conducted two trials, one in patients with moderate disease and one in patients with severe disease. 4 In addition, NIAID has conducted a trial of remdesivir 3 and is now evaluating the combination of baricitinib and remdesivir (ClinicalTrials.gov number: NCT04401579). Finally, INSERM is conducting a trial of remdesivir, LPV/r, interferon beta-1a, and hydroxychloroquine. The efficacy of a therapeutic agent is assessed mainly in terms of the primary endpoint used in a clinical trial. Table 1 shows the primary endpoints adopted by the aforementioned six remdesivir trials. The primary endpoints are quite different among these trials, even for patients with similar disease severity at enrollment. Combining data from these trials would enable a more accurate assessment of the efficacy of remdesivir than separate analyses, but data from studies with different endpoints cannot be efficiently combined. Without a common endpoint, it would also be difficult to compare the efficacy of remdesivir with that of other agents in the future. The currently used primary endpoints are focused on clinical status at a particular day or on time to a specific improvement of clinical status. To fully represent important clinical outcomes and make efficient use of available data, we propose using the entire clinical course of a patient to assess the efficacy of COVID-19 therapy. Specifically, we evaluate the effect A c c e p t e d M a n u s c r i p t 5 of treatment on the clinical-status trajectory over the follow-up period by regarding daily ratings of clinical status as repeated measures of this trajectory. In addition, we combine the evidence of treatment effects on all levels of improvement of clinical status over time, as well as all levels of deterioration, including critical illness and death. Finally, we demonstrate the advantages of the proposed methods over the existing ones through extensive simulation studies using empirical data from recently completed COVID-19 trials. 1−3 The clinical status of a COVID-19 patient is commonly rated on a seven-category ordinal scale: 1, not hospitalized, with resumption of normal activities; 2, not hospitalized, but unable to resume normal activities; 3, hospitalized, not requiring supplemental oxygen; 4, hospitalized, requiring supplemental oxygen; 5, hospitalized, requiring nasal high-flow oxygen therapy, noninvasive mechanical ventilation, or both; 6, hospitalized, requiring extracorporeal membrane oxygenation (ECMO), invasive mechanical ventilation, or both; and 7, death. 1,6−8 This severity-rating system was adopted by the Chinese LPV/r trial 1 and the INSERM trial. It was also adopted by the Chinese remdesivir trial, although the two outpatient categories were merged into one. 2 NIAID also adopted this seven-category scale but divided Category 3 further to indicate whether or not ongoing medical care is required. 3 Gilead used the severity-rating scale of NIAID but merged the two outpatient categories. 4 In the Chinese trials of LPV/r and remdesivir, 1,2 the primary endpoint is time from randomization to clinical improvement, which is defined as a decline of two categories of severity (from status at randomization) or live discharge from the hospital, whichever occurs first. The primary endpoint in the INSERM trial is distribution of severity rating at day 15. NIAID also adopted this endpoint but later changed its primary endpoint to time to recovery, which is defined as hospital discharge or not requiring ongoing medical care. 3 The primary endpoints for the two Gilead trials were changed from hospital discharge by day 14 and normalization of fever and oxygen saturation by day 14 to distribution of severity rating at days 11 and 14, respectively. 4 Notably, the primary endpoint in each of the six A c c e p t e d M a n u s c r i p t 6 trials captures only part of the clinical course of a patient. Rather than focusing on a specific change in the severity rating over time or the severity rating at a particular day, it is less arbitrary and more comprehensive to consider the severityrating trajectory over the follow-up period. This endpoint encapsulates the entire clinical course of a patient and represents all available clinical data. To prove the concept, we adopt the seven-category severity rating system used in the Chinese LPV/r trial and the INSERM trial and define -recovery‖ as hospital discharge. patients. In each of the four plots, the even-numbered patient has a higher severity-rating curve than the odd-numbered patient; however, the two patients in each plot have the same We can characterize the treatment effect on severity rating at a particular day by the difference of the mean severity ratings between treatment and control. Thus, we consider average severity rating -the sum of a patient's daily severity ratings over the follow-up We can also characterize the treatment effect on severity rating at a particular day by the odds ratio of lower severity (i.e., the odds of falling into or below a severity-rating category versus falling above it for the treatment group divided by that of the control group) under the proportional odds model; 10 see Supplementary Appendix S.1. If we assume that the odds ratio of lower severity is constant over the follow-up period of interest, then we can estimate or test this common odds ratio by applying GEE to the ordinal daily severity ratings. Although the assumption of a common odds ratio may not hold when treatment is effective, this formulation yields a nonparametric test of the null hypothesis that treatment has no effect on the distribution of severity rating at any time and an estimator of the overall treatment effect on the severity-rating trajectory. It is worthwhile to examine the odds ratios at various time points. The estimated common odds ratio is not very meaningful when individual odds ratios are in opposite directions. We can describe the changes of clinical status over time by a multi-state model; 11 see Figure 2 . We know each patient's initial state, i.e., clinical status at randomization. We also observe the time when each patient transits to a different state (i.e., category), provided that the transition occurs before the end of follow-up. An effective treatment should accelerate transitions to less severe categories and slow transitions to more severe categories. We can view these transitions as multiple types of events. 11, 12 There are nine types of events for hospitalized patients: improvement by one, two, three, four, or five categories and deterioration by one, two, three, or four categories (from status at randomization). Each patient can potentially experience six events, whose types depend on the initial clinical status: a patient initially in Category 5 can improve by one, two, three, or four categories or deteriorate by one or two categories; a patient initially in Category 4 can improve or deteriorate by one, A c c e p t e d M a n u s c r i p t 8 two, or three categories; and so on. As detailed in Supplementary Appendix S.2, we formulate the treatment effects on the five levels of improvement and the four levels of deterioration through nine Cox proportional hazards models. 13 Suppose that the hazard ratios of treatment versus control for the five levels of improvement are the same. Then we can estimate or test this common hazard ratio by the methodology of Wei, Lin and Weissfeld (WLW), 12 which is an extension of GEE to multiple events data. Although the five hazard ratios of improvement may not be the same when treatment is effective, this framework provides a valid test of the null hypothesis of no treatment effect on any level of improvement and a summary of the treatment effect on improvement. Likewise, we can use the WLW methodology to estimate or test a common hazard ratio for the four levels of deterioration. Furthermore, we can test the global null hypothesis of no treatment benefit on any change of clinical status. We refer to these three methods as WLW-imp, WLW-det, and WLW-ben. We suggest reporting the common hazard ratio of improvement and that of deterioration, as well as the nine constituent hazard ratios. The -clinical improvement‖ endpoint used in the Chinese trials of LPV/r and remdesivir pertains to improvement by one category for patients initially in Category 3 and to improvement by two categories for more ill patients, with time to improvement by one category also as a secondary endpoint. 1,2 By contrast, WLW-imp clearly distinguishes between improvement by one versus two categories and also includes improvement by more than two categories. In addition, it automatically accounts for multiple comparisons. The -recovery‖ endpoint used in the NIAID trial corresponds to a greater degree of improvement for a patient who is more ill at enrollment; however, all -recovery‖ events are treated the same in the endpoint. By contrast, WLW-imp considers each level of improvement separately and thus makes fuller and more precise use of the available data. The ultimate goal of any therapy for COVID-19 patients is to prevent death. It would therefore be desirable to use 28-day mortality as the primary endpoint. However, the mortality rates are relatively low for COVID-19, 1−3 so a large number of patients are required to achieve good statistical power for detecting a moderate treatment difference in mortality. On A c c e p t e d M a n u s c r i p t 9 the other hand, the patients who become critically ill (e.g., on ECMO or invasive mechanical ventilation) are likely to suffer multiorgan failure, and even if they survive past day 28, they are still at risk of dying. Thus, critical illness is a good surrogate for 28-day mortality. A treatment comparison based on time to critical illness comprises a larger number of events and therefore tends to be more powerful than the mortality difference. We can combine the evidence of treatment effects on time to recovery and time to critical illness or death through the WLW methodology; we refer to this method as WLW-rc or WLW-rm. To assess the operating characteristics of various endpoints and methods, we conducted a simulation study mimicking the design of the Chinese remdesivir trial. 2 We let 15%, 70%, and 15% of the patients belong to Category 3, 4, and 5, respectively, at enrollment. 1,2 Within each category, we assigned patients to treatment or placebo at a ratio of 2:1. We simulated the transitions between the seven categories of severity rating according to the multi-state model 11 shown in Figure 2 . As described in Supplementary Appendix S.3, we chose a set of transition probabilities such that 70% of the placebo patients experienced the -clinical improvement‖ endpoint and 15% died by day 28. 1, 2 As detailed in Supplementary Appendix S.3, we considered ten possible scenarios for the treatment effects on the transitions between the seven categories. Case 1 pertains to the null hypothesis of no treatment effect. Cases 2-9 pertain to alternative hypotheses in which treatment accelerates the transition to a less severe category and slows the transition to a more severe category. In Case 2, the magnitude of treatment effect is the same for all transitions. In Case 3, the treatment effect is stronger on the transition to a less severe category than to a more severe category; the opposite is true in Case 4. In Case 5, the treatment effect becomes weaker when the current state is more severe; in Case 6, the treatment effect becomes stronger when the current state is more severe. In Case 7, the treatment effect increases as severity at enrollment increases; in Case 8, the treatment effect decreases as severity at enrollment increases. Case 9 is the same as Case 2, but there is a A c c e p t e d M a n u s c r i p t 10 patient-specific random effect to create heterogeneity. Case 10 is the same as Case 2, but treatment accelerates the transition to death. The treatment effects on various endpoints are shown in the top panel of Table 2 . Of note, the proportional odds and proportional hazards assumptions do not hold in Cases 2-10; what are shown in Table 2 are the mean estimates of treatment effects based on a large number of simulated data sets. We implemented linear models for severity rating at day 15 and average severity rating over days 1-28 or 8-28; proportional odds models for odds ratio of lower severity at day 15 and common odds ratio of lower severity over days 1-28 or 8-28; Cox models for time to clinical improvement, time to recovery, time to critical illness, and time to death; and logistic model for 28-day mortality, stratifying each by severity at enrollment. We also implemented the four WLW methods. In each scenario, we simulated 100,000 data sets, each with 453 patients. For each of the fifteen methods, we tested the null hypothesis of no treatment effect at the one-sided nominal significance level of 2.5% and estimated the rejection probability. The results of the simulation study are summarized in the top panel of Table 3 . We excluded logistic model for 28-day mortality from the summary because the estimation algorithm did not always converge due to the small number of deaths. Cox models for time to critical illness and time to death and WLW-det have slightly inflated type I error due to the small number of events. The other methods have reasonable type I error. We now discuss the results in Cases 2-9. Cox models for time to clinical improvement and time to recovery have about 80% and 82% power, respectively, whereas linear models for average severity rating have about 90% power. The power of linear model for severity rating at day 15 is about 5% lower than that of linear models for average severity rating. Proportional odds models have similar power to linear models in most cases. As expected, Cox model for time to critical illness is more powerful than Cox model for time to death. WLW-imp is much more powerful than Cox models for time to clinical improvement and time to recovery. WLW-det is much more powerful than Cox models for time to critical illness and time to death. WLW-ben is nearly as powerful as linear and proportional odds models for severity-rating trajectory. WLE-rc is much more powerful than Cox model for time to A c c e p t e d M a n u s c r i p t 11 critical illness and also tends to be more powerful than Cox model for time to recovery. In Case 10, treatment has a beneficial effect on improvement and deterioration generally, except that it slightly increases the risk of death. In this case, Cox model for time to death has 1.6% power and WLW-det has 60%, whereas the other methods have 80% or higher power. In such situations, the tests based on composite endpoints should be used with caution, and having low probability to claim a beneficial treatment may be desirable. We conducted a second simulation study mimicking the design of the NIAID remdesivir trial. 3 We let 15%, 40%, and 45% of the patients belong to Category 3, 4, and 5, respectively, at enrollment. Within each category, we assigned patients to treatment or placebo at a ratio of 1:1. We adopted the same set of transition probabilities and the same eight scenarios of treatment effects as in the first simulation study but chose smaller effect sizes such that Cox model for time to recovery has 80% power with 1,000 patients. The treatment effects on various endpoints are shown in the bottom panel of Table 2 . The results of this simulation study are summarized in the bottom panel of Table 3 . All fifteen methods have correct type I error. In Cases 2-9, average severity rating, common odds ratio, and WLW-ben continue to have the highest power; Cox models for time to critical illness and time to death fare worse than before because of reduced treatment effects; and logistic model for 28-day mortality is slightly more powerful than Cox model for time to death. In Case 10, WLW-det and WLW-ben have only 35% and 68% power, respectively. COVID-19 trials have rated clinical status on an ordinal scale of severity rating, with 6, 7, or 8 points. The rating system covers a multitude of important clinical outcomes, favorable or unfavorable. We propose two approaches for using the daily severity ratings over the follow-up period of interest to capture the totality of evidence on treatment efficacy: average severity rating and common odds ratio pertain to severity-rating trajectory, and WLW deals with times to changes of clinical status. Severity-rating trajectory is of great interest if there are substantial fluctuations of severity rating over time, whereas times to changes of clinical A c c e p t e d M a n u s c r i p t 12 status are most relevant if severity-rating curves are largely monotone. Average severity rating assigns a specific value or weight to each rating category, whereas common odds ratio and WLW rely only on the ordering of the rating scale. WLW can also accommodate clinical events not derived from a specific rating scale. We recommend using common odds ratio or WLW-ben as the primary analysis, depending on how clinical status changes over time. Time to recovery and 28-day mortality are clinically meaningful endpoints that can be used with common odds ratio or WLW-ben. It is difficult to power a trial on a mortality endpoint, and time to recovery does not measure deterioration of clinical status. We may declare a therapy beneficial if WLW-ben or the treatment effect on common odds ratio is statistically significant and the treatment effects on 28-day mortality and time to recovery are in the right directions. WLW-imp and WLW-det can also serve as secondary endpoints. Most COVID-19 trials were designed to follow patients for only 3-4 weeks. However, patients with severe illness, especially those experiencing prolonged ventilation or developing acute respiratory distress syndrome with a fibrotic component, may have unfavorable longterm outcomes. In addition, patients may require intensive care well beyond the end of the study, and some may die of COVID-19 in several months. Thus, we recommend that patients be followed as long as possible in order to evaluate long-term treatment effects. Given the positive findings from the NIAID trial, 3 any trials that compare remdesivir to placebo will likely be terminated, and any trials that compare multiple agents to placebo will likely switch placebo patients to active agents. Combining the data that have been collected thus far on all patients who have received remdesivir or placebo will enable a more accurate assessment of the effects of remdesivir (relative to placebo) on mortality and other outcomes than individual trials. Meta-analysis of summary statistics (i.e., estimated treatment effects and standard errors) will be logistically simpler than, but statistically as efficient as, joint analysis of patient-level data. 14 It is difficult to establish the benefits of other treatments beyond that of remdesivir, so having common endpoints that are clinically relevant and statistically powerful is critically important to future Covid-19 trials. A c c e p t e d M a n u s c r i p t M a n u s c r i p t 19 A trial of lopinavir-ritonavir in adults hospitalized with severe COVID-19 Remdesivir in adults with severe COVID-19: a randomised, double-blind, placebo-controlled, multicentre trial Remdesivir for the treatment of Covid-19-preliminary report Remdesivir for 5 or 10 days in patients with severe Covid-19 The World Health Organization (2020). -Solidarity‖ clinical trial for COVID-19 treatments International Severe Acute Respiratory and Emerging Infections Consortium (ISARIC) home page Coronavirus disease (COVID-2019) R&D. Geneva: World Health Organization NEWS) 2: Standardising the assessment of acute-illness severity in the NHS. London: Royal College of Physicians Analysis of Longitudinal Data Regression models for ordinal data The Statistical Analysis of Failure Time Data Regression analysis of multivariate incomplete failure time data by modeling marginal distributions Regression models and life-tables On the relative efficiency of using summary statistics versus individual-level data in meta-analysis Estimation of regression coefficients when some regressors are not always observed Adjustment during Army Life Misspecified proportional hazard models Checking the Cox model with cumulative sums of martingale-based residuals On the restricted mean survival time curve in survival analysis M a n u s c r i p t