key: cord-0225637-166orrxb
authors: Hu, Liangyuan; Li, Fan; Ji, Jiayi; Joshi, Himanshu; Scott, Erick
title: Estimating the causal effects of multiple intermittent treatments with application to COVID-19
date: 2021-09-27
journal: nan
DOI: nan
sha: 1a383c0687a111587c36a00e31d1b5ae71267cdf
doc_id: 225637
cord_uid: 166orrxb

To draw real-world evidence about the comparative effectiveness of multiple time-varying treatment regimens on patient survival, we develop a joint marginal structural proportional hazards model and novel weighting schemes in continuous time to account for time-varying confounding and censoring. Our methods formulate complex longitudinal treatments with multiple"start/stop"switches as the recurrent events with discontinuous intervals of treatment eligibility. We derive the weights in continuous time to handle a complex longitudinal dataset on its own terms, without the need to discretize or artificially align the measurement times. We further propose using machine learning models designed for censored survival data with time-varying covariates and the kernel function estimator of the baseline intensity to efficiently estimate the continuous-time weights. Our simulations demonstrate that the proposed methods provide better bias reduction and nominal coverage probability when analyzing observational longitudinal survival data with irregularly spaced time intervals, compared to conventional methods that require aligned measurement time points. We apply the proposed methods to a large-scale COVID-19 dataset to estimate the causal effects of several COVID-19 treatment strategies on in-hospital mortality or ICU admission, and provide new insights relative to findings from randomized trials.

To draw real-world evidence about the comparative effectiveness of multiple time-varying treatment regimens on patient survival, we develop a joint marginal structural proportional hazards model and novel weighting schemes in continuous time to account for time-varying confounding and censoring. Our methods formulate complex longitudinal treatments with multiple "start/stop" switches as the recurrent events with discontinuous intervals of treatment eligibility. We derive the weights in continuous time to handle a complex longitudinal dataset on its own terms, without the need to discretize or artificially align the measurement times. We further propose using machine learning models designed for censored survival data with time-varying covariates and the kernel function estimator of the baseline intensity to efficiently estimate the continuous-time weights. Our simulations demonstrate that the proposed methods provide better bias reduction and nominal coverage probability when analyzing observational longitudinal survival data with irregularly spaced time intervals, compared to conventional methods that require aligned measurement time points. We apply the proposed methods to a large-scale COVID-19 dataset to estimate the causal effects of several COVID-19 treatment strategies on in-hospital mortality or ICU admission, and provide new insights relative to findings from randomized trials.

1. Introduction. The COVID-19 pandemic has been a rapidly evolving crisis challenging global health and economies. Public health experts believe that this pandemic has no true precedent in modern times. While multiple COVID-19 vaccines have been developed across the globe, no consensus has been reached on optimal clinical management of COVID-19 (Yousefi et al., 2020) . The lack of evidence for effective treatment options warrants further investigation into the causal effects of multiple COVID-19 treatment strategies currently implemented in clinics. Although randomized controlled trials (RCTs) are considered as the gold standard for evaluating the efficacy of COVID-19 therapies, they are enormously expensive and time consuming, especially in a time of crisis. Stringent inclusion and exclusion criteria also limit the generalizability of RCTs to frailer populations at higher risk for severe morbidity and mortality. To overcome these challenges, we study the causal effects of COVID-19 treatment strategies on patient survival by leveraging the continuously growing observational data collected at the Mount Sinai Health System-New York City's largest academic medical system. We focus on four commonly used medication classes that are of most clinical interest: (i) remdesivir; (ii) dexamethasone; (iii) anti-inflammatory medications other than corticosteroids; and (iv) corticosteroids other than dexamethasone.

The complex nature of COVID-19 treatments, owing to differential physician preferences and variability of treatment choices attributable to evolving clinical guidelines, poses three major challenges for statistical analysis of observational data that cannot be easily addressed by existing longitudinal causal inference methods. First, treatment is not randomly allocated and the treatment status over time may depend upon the evolving patient-and disease-specific covariates. Second, the measurement time points during the follow-up are irregularly spaced. Third, there is more than one treatment under consideration. Patients can be simultaneously prescribed to various treatment combinations, or can be switched to a different treatment. Figure 1 illustrates the observed treatment trajectories for nine randomly selected patients during their hospital stays. While previous work has shown that a continuous-time marginal structural model is effective in addressing time-varying confounding and provides consistent causal effect estimators (Johnson and Tsiatis, 2005; Saarela and Liu, 2016; Hu et al., 2018; Hu and Hogan, 2019; Ryalen et al., 2020) , the development has been restricted to a single longitudinal treatment and therefore may not be directly applicable. We consider a joint marginal structural model to accommodate multiple longitudinal treatments in continuous time. To estimate causal parameters in the joint marginal structural model, we derive a novel set of continuous-time stabilized inverse probability weights by casting each treatment process as a counting process for recurrent events, allowing for discontinuous intervals of eligibility. In addition, we propose to use machine learning and smoothing techniques designed for censored survival data to estimate such complex weights. Through simulations, we demonstrate that our approach provides valid causal effect estimates and can considerably alleviate the consequence of unstable inverse probability weights under parametric formulations. We further undertake a detailed analysis of a large longitudinal registry data of clinical management and patient outcomes to investigate the comparative effectiveness of multiple COVID-19 treatments on patient survival.

2.1. Notation and set up. We consider a longitudinal observational study with multiple treatments and a right-censored survival outcome. Denote t as the time elapsed from study entry (e.g., hospital admission), t o the maximum follow-up time, and T a collection of time points on the interval [0, t o ]. Suppose each individual has a p-dimensional covariate process {L(t) : t ∈ T }, some elements of which may be time-varying; by definition, the time-fixed elements of L(t) are constant over T . Let T denote time to an outcome event of interest such as death, with {N T (t) : t ∈ T } as its associated zero-one counting process. We consider W different medication classes (treatments), whose separate and joint causal effects on patient survival are of interest. We use A w (t) to denote the assignment of treatment w ∈ W = {1, . . . , W }, which can be characterized as the counting process; we let A w (t) = 1 if an individual is treated with w at time t and A w (t) = 0 otherwise (Johnson and Tsiatis, 2005) .

Let C denote the time to censoring due to, for example, discharge or loss to follow up. We use the overbar notation to represent the history of a random variable, for example,Ā w (t) = {A w (s) : 0 ≤ s ≤ t} corresponds to the history of treatment A w from hospital admission up to time t andL(t) = {L(s) : 0 ≤ s ≤ t} corresponds to the covariate history up to time t. Following the convention in the longitudinal causal inference literature (Robins, Orellana and Rotnitzky, 2008) , we assume the treatment decision is made only after observing the most recent covariate information just prior to the treatment; that is, for a given t, A w (t) occurs after L(t) for all w.

Let Tā 1(t),...,āW (t) represent the counterfactual failure time to event of interest had an individual been assigned treatment history {ā 1 (t),ā 2 (t), . . .ā W (t)} rather than the observed treatment history {Ā 1 (t),Ā 2 (t), . . .Ā W (t)}. Similarly, TĀ 1(t),...,ĀW (t) represents the observed failure time to event for an individual given the observed treatment history. We similarly define Cā 1(t),...,āW (t) as the counterfactual censoring time under treatment {ā 1 (t),ā 2 (t), . . .ā W (t)}. The observed data available for drawing inferences about the distribution of potential outcomes are as follows: the observed time to outcome event is T * = T ∧ C, with the censoring indicator ∆ T = I(T ≤ C). Note that both the treatment processes {A w (t), w = 1, . . . , W } and the covariate processL(t) are defined for all t ∈ T but are observed only at discrete and potentially irregularly spaced time points for each individual. For example, individual i may have covariates and treatment status observed at a set of discrete time points from study entry t = 0 to his or her last follow-up time t iKi ≤ t o . We denote the set of discrete time points with observed covariate and treatment information for individual i as T i = {0, t i1 , . . . , t iKi }, and therefore the observed covariate and treatment histories becomeL

2.2. Joint marginal structural model for survival outcomes. We consider a marginal structural model to estimate the joint causal effects ofĀ 1 (t), . . . ,Ā W (t) on patient survival. The most popular model specification is a marginal structural Cox model, for its flexibility in handling baseline hazard and straightforward software implementation when used in conjunction with the stabilized inverse probability weights (Howe et al., 2012) . When there is a strong concern that the proportional hazards assumption may not be satisfied across the marginal distribution of the counterfactual survival times, alternative strategies including the structural additive hazards model or accelerated failure time model can also be considered. For purposes of presenting our methodology, we focus on the marginal structural Cox model but extensions to alternative structural models are possible with straightforward modifications of the weighted estimating equations. For notational brevity but without loss of generality, we first consider W = 2 treatments. Expansion of the joint marginal structural model and weighting schema for W ≥ 3 treatments is discussed in Section 3.4. Specifically, we assume Tā 1(t),ā2(t) follows a marginal structural proportional hazards model of the form

where λ Tā 1 (t),ā 2 (t) is the hazard function for Tā 1 (t),ā2(t) and λ 0 (t) is the unspecified baseline hazard function when treatment A 1 and A 2 are withheld during the study. The parameter ψ 1 encodes the instantaneous effect of treatment A 1 on Tā 1(t),ā2(t) in terms of log hazard ratio while A 2 is withheld during the study. Similarly, ψ 2 corresponds to the instantaneous treatment effect for A 2 in the absence of A 1 . The multiplicative interaction effect of A 1 and A 2 is captured by ψ 3 . In addition, the hazard function λ Tā 1 (t),ā 2 (t) can depend on baseline covariates by elaborating model (1) or by using a stratified version of λ 0 (t). Model (1) implicitly assumes that the instantaneous treatment effect is constant in the course of follow-up. This model assumption is reasonable given that the COVID-related hospitalization is generally short and medications are prescribed for days in succession. Finally, model (1) is a continuous-time generalization of the discrete-time model considered by Howe et al. (2012) for estimating the joint survival effects of multiple time-varying treatments.

Model (1) offers two advantages for the estimation of treatment effects. First, the counterfactual survival function can be expressed as

Therefore, causal contrasts can be performed based on any relevant summary measures of the counterfactual survival curves including median survival times and restricted mean survival times. Second, model (1) allows for the estimation of causal effects of interventions defined by varying treatment initiation timing and treatment duration. For example, an intervention may take the form ofā 1 (t † ) = {a 1 (s) = 1, 0 ≤ s ≤ t † }, representing prescribing treatment A 1 until t † (e.g., t † = day 6). A more complex intervention strategy is {ā 1 (t † ),ā 2 (t † )} = {a 1 (s) = 1(0 ≤ s ≤ t † /2), a 2 (s) = 1(t † /2 < s ≤ t † )}, which refers to assigning treatment A 1 until t † /2 and then switching altogether to A 2 until t † .

It is worth noting that the comparative effectiveness research question under investigation in this work is fundamentally different from the problem of self-controlled case series (Farrington, 1995; Simpson et al., 2013) . In a self-controlled case series, individuals act as their own controls; and the goal of such studies is to compare on-drug versus off-drug period within a single patient. Contrarily, we compare the population-level causal effects of time-varying treatment regimens, each of which is followed by and thus observed from a certain number of patients in the study population. These treatment regimens were not randomly assigned to patients. Thus, specialized causal inference techniques are needed to isolate cause-andeffect from other biasing factors, which can often be time-varying and themselves may be affected by past treatments. Because of this complexity, we motivate our methodology under the counterfactual outcomes framework rooted in the emerging literature on estimating the causal effects of time-varying treatments; also see, for example, Hernán, Brumback and Robins (2001) ; Robins, Orellana and Rotnitzky (2008) for such methods applied to discretetime longitudinal observational studies, and Hu et al. (2018) ; Hu and Hogan (2019) for such methods applied to continuous-time longitudinal and survival data but with only one binary treatment.

To obtain a consistent estimator for ψ = {ψ 1 , ψ 2 , ψ 3 } in model (1) using longitudinal observational data with two treatments, we introduce the following causal assumptions and maintain them throughout the rest of the article:

(A1) Consistency. The observed failure times,

Similarly for the observed censoring times,

The consistency assumption implies that the observed outcome corresponds to the counterfactual outcome under a specific joint treatment trajectory {ā 1 (t),ā 2 (t)} when an individual actually follows treatment {ā 1 (t),ā 2 (t)}. This is an extension of the consistency assumption developed with a single time-varying treatment (Robins, 1999) to two time-varying treatments.

(A2) Conditional Exchangeability. Alternatively referred to as sequential randomization, this assumption states that initiation of treatment at time t among those who are still alive and remain in the study is conditionally independent of the counterfactual survival time Tā 1(t),ā2(t) conditional on observed treatment and covariate histories. Mathematically, let

where λ A1,A2 (t) is the joint intensity process of the joint counting process A 1 (t) and A 2 (t). Similarly, letŌ(t) = {L(t),Ā 1 (t),Ā 2 (t)} denote the observed history up to t, we assume conditional exchangeability for censoring such that ∀ t ∈ T ,

where λ C (t) is the intensity process corresponding to the counting process of censoring. Our conditional exchangeability assumption is a continuous-time generalization of the usual sequential randomization assumption for the discrete-time marginal structural models (Robins, 1999; Howe et al., 2012) .

(A3) Positivity. We assume that at any given time t, there is a positive probability of initiating a treatment plan, among those who are subject to initiating at least one treatment, for all configurationsŌ(t − ):

For a pair of joint treatments (A 1 , A 2 ), at a given time t, individuals with treatment status (0, 0), (0, 1) or (1, 0) are subject to "initating" at least one treatment. The treatment initiation patterns can be as follows: (0, 0) → {(0, 1), (1, 0), (1, 1)}, (0, 1) → {(1, 1), (1, 0)}, (1, 0) → {(1, 1), (0, 1)}. Treatment discontinuation (from 1 to 0), however, is not considered a stochastic process in our study, as COVID medication is typically prescribed with a specific treatment duration, e.g., treat with dexamethasone at a dose of 6 mg once daily for 10 days (RECOVERY Collaborative Group, 2021) . Furthermore, because an individual cannot be at risk for receiving the same treatment once he or she is on the treatment, we only need to assume the positivity when the individual is off that specific treatment, i.e., at risk for initiating that treatment.

3.1. Framing repeated treatment initiation as recurrent events. As Figure 1 suggests, the observed treatment pattern is complex due to considerable variability in COVID treatment protocols and clinician preferences over time. Individuals may discontinue a treatment and restart the same treatment at a later time; or they may be switched altogether to another treatment. Meanwhile, patients can take more than one treatment for a period of time. Each treatment can therefore be viewed as the counting process of recurrent events, with discontinuous intervals of treatment eligibility (Andersen and Gill, 1982) . Specifically, casting treatment initiation as a recurrent event process captures two distinguishing features of our observational data: (i) having received a treatment would prevent an individual from receiving the same treatment again for the time period while the individual is on the treatment; and (ii) after the individual was off the treatment, he or she would be eligible or at risk for re-initiating the treatment.

To formalize the treatment initiation process, we first consider a univariate treatment process N Aw . We assume that the jumps of A w (t), i.e., dA w (t), is observed on certain subintervals of T only. Specifically for individual i, we observe the stochastic process A w,i (t) on a set of intervals

iKi . This representation implies the following results. First, an individual can have at most J i ≥ 1 initiations of treatment w: if U w,iJi = t w,iKi , then individual i has J i − 1 treatment initiations; and if U w,iJi < t w,iKi , then individual i has J i treatment initiations. A special case where J i = 1 and U w,iJi = t w,iKi corresponds to the situation where individual i is continuously eligible for treatment initiation and has not been treated with w during the follow-up. Second, once treatment is initiated, A w,i (t) is no longer stochastic until person i discontinues the treatment. This also suggests that the jth treatment initiation is observed at U w,ij . Third, we have

In words, treatment status is equal to one deterministically on the discontinuous intervals of ineligibility (i.e., on treatment period). Define a censoring or filtering process by D Aw i (t) = I(t ∈ E w,i ), and the filtered counting process by N Aw

where q indexes the qth treatment initiation.

Following , we assume conditional independence among occurrences of treatment initiation given all observed history, and that the set E w,i is defined such that D Aw i (t) is predictable. The observed data with occurrences on the set E w,i can therefore be viewed as a marked point process generating the filtration (F D w,t ). Similarly, we denote the filtration generated by the counting process {A w (t) : t ∈ T } corresponding to E w,i = T by (F w,t ). We assume A w,iq (t) follows Aalen's multiplicative intensity model (Aalen, Borgan and Gjessing, 2008) 

is the hazard rate function parameterized by θ, and Y w,iq (t) is the at-risk function with Y w,iq (t) = 1 indicating person i is eligible just before time t for the qth initiation of treatment w in the interval [t, t + dt), and Y w,iq (t) = 0 indicating otherwise. It follows that the filtered counting process N Aw iq (t) follows the multiplicative intensity model

With two treatments, model (4) can be directly extended for the joint treatment initiation process as

is the at-risk process for the qth treatment initiation with the filtering process defined jointly by A 1 and A 2 .

3.2. Derivation of the continuous-time weights. Under assumptions (A1)-(A3), a consistent estimator of ψ can be obtained by solving the weighted partial score equations (Hu et al., 2018) ,

where Ω A1,A2 (t Ki ) is the weight that corrects for potential time-varying confounding for time-varying treatments A 1 and

, and

is a modified version of the weighted mean of Z over observations still at risk for the outcome event at time t. In equation (7), we define the weighted risk set indicator for outcome

is the at-risk function for the outcome event, and r(a 1 , a 2 , t) = exp{ψ 1 a 1 (t) + ψ 2 (t)a 2 (t) + ψ 3 a 1 (t)a 2 (t)}.

In the discrete-time setting with non-recurrent treatment initiation, the stabilized inverse probability weights (we suppress subscript i for brevity) are given by Howe et al. (2012) 

where t k 's are a set of ordered discrete time points common to all individuals satisfying

While Ω A1,A2 (t) in (8) corrects for time-varying confounding by adjusting forL(t) in the weights, it requires that the time points are well aligned across all individuals. In addition, it does not accommodate the recurrent nature of complex intervention strategies as in our observational study. We now generalize the weights developed for the discrete-time setting to a continuoustime process, which do not require the time points to be well aligned. Partition the time interval [0, t] into a number of small time intervals, and let dA w (s) be the increment of A w over the small time interval [s, s + ds), ∀s ∈ [0, t]. Recall that treatment initiation, or the jumps of A w (t), dA w (t), is observed on a number of subintervals of T only. That is, conditional on historyL(s), the occurrence of treatment initiation for an individual in [s, s + ds)I(s ∈ E) is a Bernoulli trial with outcomes dA w (s) = 1 and dA w (s) = 0. Then the

which takes the form of the individual partial likelihood for the filtered counting process {D Aw (s)A w (s) : 0 ≤ s ≤ t}. When the number of time intervals in [0, t] increases and ds approaches zero, the finite product over the number of time intervals of the individual partial likelihood will approach a product integral (Aalen, Borgan and Gjessing, 2008) , given by

. For individual i, both factors in (9) need to be evaluated with respect to the individual's filtered counting process {N Aw iq (t) : 0 ≤ t ≤ t Ki , q = 1, . . . , Q w,i }, with the first quantity being equal to the finite product over the jump times and the second quantity being the survival function for treatment initiation. As described in Section 3.1, the number of treatment initiations for individual i, Q w,i can take three values: (i)

Corresponding to the three cases, the quantity in (9) can be rewritten as

where S Aw and f Aw are the survival and density function of the filtered counting process for treatment A w . Here we assume that initiations of different treatments are ordered. For example, whether to initiate A 2 at time t is decided upon observing the treatment status A 1 (t). This suggests that the hazard function λ A2 is estimable by conditioning onĀ 1 (t) and A 2 (t − ); and the hazard function λ A1 is estimable by conditioning onĀ 1 (t − ) andĀ 2 (t − ).

For exposition brevity, we definē

Putting this all together, the individual continuous-time stabilized inverse probability weight that corrects for time-varying confounding byL is given by Ω A1,A2 (t) = Ω A1 (t)Ω A2 (t) with Ω Aw being:

Turning to censoring, under the conditional exchangeability assumption (A2), the censoring process is covariate-and treatment-dependent. To correct for selection bias due to censoring, we additionally define a weight function associated with censoring,

where S C is the survival function associated with the censoring process, and

This leads to a final modification of the estimating equation for ψ,

3.3. Estimation of the causal survival effects. We consider four ways in which the continuous-time weights Ω A1,A2 (t) can be estimated: (i) fitting a usual Cox regression model for the intensity process of the counting process of treatment initiation {A w (t) : t ∈ T }, estimating the density function f Aw and survival function S Aw from the fitted model with the Nelson-Aalen estimator for the baseline intensity function, and finally calculating the weights following equation (10); (ii) smoothing the Nelson-Aalen estimator and in turn f Aw and S Aw estimated from the fitted Cox regression model by means of kernel functions (Ramlau-Hansen, 1983) , and calculating the weights using the smoothed version of f Aw and S Aw ; (iii) fitting a multiplicative intensity tree-based model in which the functional form of the intensity ratio for treatment initiation is flexibly captured to estimate the f Aw and S Aw and calculate the weights; (iv) smoothing the Nelson-Aalen estimator of the baseline intensity from the tree-based model and calculating the weights using the smoothed version of f Aw and S Aw . Among these approaches, (i) relies on the parametric assumptions about the intensity ratio relationships between the treatment initiation process and covariate process and may be subject to model misspecification and bias for estimating causal effects. Compared to the Nelson-Aalen estimator which includes discrete jumps at event occurrences, the kernel function estimator in (ii) may help alleviate the issue of extreme or spiky weights, and has also been shown to be a consistent and asymptotically normal baseline intensity estimator . Approach (iii) leverages a recent random survival forests model ) that can accommodate time-varying covariates to mitigate the parametric assumptions and attendant biases associated with the usual Cox regression. With baseline time-fixed treatment, prior work has used similar machine learning techniques to improve propensity score weighting estimators (Lee, Lessler and Stuart, 2010) as well as to provide more accurate causal effect estimates with censored survival data (Hu, Ji and Li, 2021) . Finally, approach (iv) smooths the baseline intensity estimated from the survival forests for estimating the stabilized inverse probability weights, and serves as an additional step to smooth over the potentially spiky weights. In Section 4, we compare the performances of these four strategies to estimating the continuous-time weights to generate practical recommendations. In addition, the censoring weight function Ω C (G i ) can be estimated in a similar fashion via any one of these four approaches. Additional details of kernel function smoothing in approach (ii) and random survival forests in approach (iii) are presented in Web Appendix S1.

To accommodate the time-varying covariate process and account for the recurrent nature of treatment initiation, we fit a survival model to the counting process style of data input. Each individual is represented by several rows of data corresponding to nonoverlapping time intervals of the form (start, stop]. To allow for discontinuous intervals of eligibility, when defining multiple time intervals E w,i = ∪ Ji j=1 (V w,ij , U w,ij ] on T for individual i, the duration of a treatment is removed from T when the individual is currently being treated and therefore no longer eligible for initiating the treatment. Finally, since our estimators for ψ is a solution to the weighted partial score equation (11), we can use the robust sandwich variance estimator to construct confidence intervals for the structural parameters; the details of the robust sandwich variance estimator is provided in Web Appendix S1. In practice, the robust sandwich variance estimator is at most conservative under the discrete-time setting , and we will empirically assess the accuracy of this variance estimator with continuous-time weights via simulations.

3.4. Extensions to more than two time-varying treatments. Although we introduce our methods with two longitudinal treatments, our approach can be extended to more than two time-varying treatments in a straightforward fashion. In theory, a fully interacted version of model (1) can be formed to include all the main effects of a w (t) ∀w ∈ W and the interactions thereof. Clinical interests and data sparsity on combinations of treatments may also be used to guide the inclusion of interaction terms into the structural model. Suppose B = {b 1 (t), . . . , b V (t)} is a collection of causal interaction effects of interest, e.g., b 1 (t) = a 1 (t)a 2 (t), the general joint marginal structural proportional hazards model is

where ψ 1w 's and ψ 2v 's capture the causal main and interaction effects on the counterfactual hazard function. A consistent estimator of ψ = {ψ 11 , . . . , ψ iW , ψ 21 , . . . , ψ 2V } can be obtained by solving the general form of the estimating equation

The joint treatment weights Ω A1,...,AW (t Ki ) can be estimated by assuming a specific order in which treatments are initiated and calculating the weights using appropriate history information O w (t) andŌ Aw (t), similar as described in Section 3.2. The estimation of the censoring weights Ω C (G i ) also follows the same strategy outlined in Section 3.2 with two longitudinal treatments.

4.1. Simulation design. We carry out simulations to investigate the finite-sample properties of the proposed weight estimators for the marginal structure Cox model parameters. We simulate data compatible with the marginal structural Cox model by generating and relating data adhering to the structural nested accelerated failure time (SNAFT) model (Young et al., 2008) . A general representation of a SNAFT model for time-varying treatment a is (Hernán et al., 2005) 

where T0 is the counterfactual failure time under no treatment. Robins (1992) developed a simulation algorithm to generate data adhering to the SNAFT model under the discrete-time version of the identifying assumptions (A1)-(A3) in Section 3. Young et al. (2008) showed that, under the same identifying assumptions, data adhering to a marginal structural Cox model of the form λ Tā (t) = λ 0 (t) exp [ψ msm a(t)] can be simulated from a SNAFT model with ψ aft = ψ msm by adding an additional quantity to the term exp [ψ aft a(t)]. In particular when T0 has an exponential distribution, the additional quantity is zero, hence the structural nested AFT model simulation algorithm (Robins, 1992) can be used to appropriately simulate data compatible with the marginal structural Cox model under complex time-varying data structures. Building on these previous works, we extend the simulation algorithm described in Karim, Platt and Group (2017) to generate data from the joint marginal structural Cox model, while allowing for multiple time-varying treatments with discontinuous intervals of treatment eligibility and for both continuous and discrete time-varying confounders.

Throughout we simulate an observational study with n = 1000 patients and two timevarying treatments A 1 (t) and A 2 (t). We assumeL(t) is appropriately summarized by a continuous time-varying confounding variable L 1 (t) and a binary time-varying confounding variable L 2 (t). The simulation algorithm includes two steps. In step (1), we consider nonlinear terms for the continuous variable L 1 (t k ) and the interaction term

in the true treatment decision model. In particular, past treatment status {A 1 (t k−1 ), A 2 (t k )} is a predictor of L(t k ), which then predicts future treatment exposure {A 1 (t k ), A 2 (t k+1 )} as well as future failure status Y (t k+1 ) via 1/ log(T0). Therefore, L(t k ) is a time-dependent confounder affecting both the future treatment choices and counterfactual survival outcomes. The simulation of treatment initiation is placed in the recurrent events framework. Once treatment is initiated at time t k , treatment duration following initiation is simulated from a zero-truncated Poisson distribution. In step (1), we generate a longitudinal data set with 100 × 1000 observations (100 aligned measurement time points for each of n = 1000 individuals). In step (2), we randomly discard a proportion of follow-up observations for a randomly selected subset of individuals (Lin, Scharfstein and Rosenheck, 2004) ; and in the resulting data set, the individuals will have varying number of follow-up measurement time points, which are also irregularly spaced. Web Appendix S2 provides the full pseudo-code for simulating data under the marginal structural Cox model with two time-varying treatments.

Our simulation parameters are chosen so that the simulated data possess similar characteristics to those observed in the motivating COVID-19 data set. The treatments A 1 and A 2 are simulated to resemble dexamethasone and remdesivir such that: (i) about 20% patients did not take any of the anti-viral and anti-inflammatory medications aimed at treating COVID-19; (ii) among those who were treated, 62% took dexamethasone only, 25% took remdesivir only and 13% took both (either concurrently or with treatment switching); (iii) the number of initiations for both treatments ranges from 0 to 4 with the average medication duration about 5 days. The values of treatment effect parameters ψ 1 and ψ 2 were set to yield a 6.7% mortality proportion among those who received dexamethasone and a 4.9% mortality proportion in those treated with remedesivir.

Comparison of methods. We conduct two sets of simulations to investigate the finite-sample performance of our proposed joint marginal structural survival model in continuous time (JMSSM-CT). First, we compare how accurate the four weight estimators described in Section 3.3 estimate the structural parameter ψ. Second, we use the best weight estimator, suggested by the first set of simulation, for JMSSM-CT, and compare it with the joint marginal structural model that requires aligned discrete time points (JMSM-DT). To ensure an objective comparison, we use the random forests and adapt it into our proposed recurrent events framework to estimate the weights for JMSM-DT. In addition, we implement both JMSSM-CT and JMSM-DT on the "rectangular" simulation data with 100 aligned time points for each individual and on the "ragged" data with irregular observational time points. The performance on the rectangular data will be considered as the benchmark performance, based on which we will assess the relative accuracy of JMSSM-CT and JMSM-DT when estimating the structural parameters with the "ragged" data. 4.3. Performance characteristics. To assess the performance of each method, we simulate 250 observational data sets using the above approach, and evaluate the absolute bias, root mean squared error (RMSE) and covarage probability (CP) for estimating the ψ. The CP is evaluated on normality-based confidence intervals with the robust sandwich variance estimator. Figure 2 suggests that the weight estimator (iv) using both the flexible tree-based survival model and kernel function estimator of the treatment initiation intensity yielded the lowest biases in estimating both ψ 1 and ψ 2 . By contrast, the weight estimator (i) via the usual main-effects Cox regression model along with the Nelson-Aalen baseline intensity estimator produced the largest estimation bias. Applying the kernel function smoothing to the Nelson-Aalen estimator led to bias reduction for both the Cox (approach (ii)) and tree-based survival model (approach (iv)) for the treatment process. Flexible modeling of the intensity ratio function has a larger effect in reducing the bias in structural parameter estimates than smoothing the nonparametric baseline intensity estimator. For example, compared to approach (ii), approach (iii) further reduced the mean absolute bias (MAB) in estimatingψ 1 by approximately 67%. Supplementary Table 1 summarizes the MAB, RMSE and CP for the four weight estimators and similarly suggests that approach (iv) led to the smallest MAB and RSME, and provided close to nominal CP with the robust sandwich variance estimator.

FIG 2. Biases in the estimates of ψ 1 and ψ 2 among 250 data replications using four approaches to estimate the weights as described in Section 3.3. Approach (i) uses main-effects Cox regression model and Nelson-Aalen estimator for baseline intensity. Approach (ii) uses kernel function smoothing of the Nelson-Aalen estimator in approach (i). Approach (iii) uses a survival forests model that accommodates time-varying covariates and Nelson-Aalen estimator for baseline intensity. Approach(iv) uses kernel function smoothing of the Nelson-Aalen estimator in approach (iii).

The second set of simulation benchmarks the performance of JMSSM-CT versus JMSM-DT on the data with fully aligned follow-up time points and compare how much each method can recover the benchmark performance in situations where the longitudinal measurements are irregularly spaced. Table 1 displays the MAB, RMSE and CP for each of the two methods under both data settings, and Supplementary Figure 1 visualizes the distributions of biases across 250 data replications. In the rectangular data setting with fully aligned time points, compared to JMSM-DT, JMSSM-CT had similar CP but smaller MAB and RMSE. As the sparsity of longitudinal measurements increased and the time intervals became unevenly spaced, the proposed JMSSM-CT could still recover the benchmark performance; whereas the JMSM-DT had a deteriorating performance (larger MAB and RMSE and lower CP), with larger performance decline under coarser discretization of the follow-up time. Supplementary Table 2 summarizes the distribution of estimated individual time-varying weights from one random replication of the ragged data for JMSSM-CT and JMSM-DT. Overall, JMSSM-CT with the weight estimator (iv) provided the smallest maximum/minimum weight ratio (2.36/0.68) and no extreme or spiky weights.

Comparing the proposed method JMSSM-CT in continuous time with JMSM-DT in discrete time in estimating the treatment effect ψ on the bases of mean absolute bias (MAB), root mean square error (RMSE) and coverage probability (CP) across 250 data replications. In the estimation of the weights, the weight esimator (iv) was used for JMSSM-CT and the random forests adapted into our recurrent events framework (Section 3.2) was used for JMSM-DT. Both methods were implemented on the "rectangular" simulation data with 100 aligned time points for each individual and on the "ragged" data with unaligned time points. With the ragged data, the follow-up time was discretized in the space of 0.5 , 1 and 2 days for JMSM-DT. 

5.1. COVID-19 data. We apply the proposed method JMSSM-CT to a comprehensive COVID-19 data set drawn from the Epic electronic medical records system of the Mount Sinai Medical Center, and draw causal inferences about the comparative effectiveness of multiple COVID-19 treatment strategies. The data set includes 11,286 de-identified unique adult patients (≥18 years of age) who were diagnosed with COVID-19 and hospitalized within the Mount Sinai Health System between February 25, 2020 to February 26, 2021. A confirmed case of COVID-19 was defined as a positive test result from a real-time reverse-transcriptase PCR-based clinical test carried out on nasopharyngeal swab specimens collected from the patient (Wang et al., 2020) .

We focus on the comparative effectiveness of four treatment classes that are of most clinical interest: (i) remdesivir; (ii) dexamethasone; (iii) anti-inflammatory medications other than corticosteroids; and (iv) corticosteroids other than dexamethasone. We defined treatment classes by carefully reviewing the medications administered to patients. For example, the dexamethasone class includes both oral and intravenous dexamethasone; and the corticosteroids other than dexamethasone class includes oral and intravenous hydrocortisone, oral and intravenous methylprednisolone, intravenous prednisolone, and oral and intravenous prednisone. Detailed definitions of the four treatment classes are provided in Supplementary Table 3 . The observed treatment patterns are visualized in Figure 1 ; patients could be simultaneously taking two or more treatment classes, or they could switch from one treatment class to another during their hospital stays.

Following suggestions by our clinician investigators, we assumed that the following timefixed and time-varying confounders were sufficient to predict both treatment decision and outcome (i.e., assumption (A2) holds): age, sex, race, ethnicity, D-dimer levels (the degradation product of crosslinked fibrin, reflecting ongoing activation of the hemostatic system), serum creatinine levels (a waste product that forms when creatine breaks down, indicating how well kidneys are working), whether the patient used tobacco at the time of admission, history of comorbidity represented by a set of binary variables: hypertension, coronary artery disease, cancer, diabetes, asthma and chronic obstructive pulmonary disease, hospital site, and patient oxygen levels (definition provided in Supplementary Table 4) . The time-varying confounding variables were D-dimer levels, serum creatinine level and patient oxygen levels.

FIG 3. Trajectories of D-dimer levels over the course of hospital stay for 9 randomly chosen patients. Symbols represent the types of treatment classes received by a patient at a given time.

The average age of this sample population is 64.6 with a standard deviation of 18.1. About 54% of the patients were male and 46% female. The Hispanics accounted for about 26% of the patient population and the racial composition is 29% Whites, 25% Blacks, 6% Asians and 40% Other. Summary statistics for time-fixed confounders are presented in Supplementary  Table 5 . Time-varying confounders were measured repeatedly over the course of hospital stay. Figure 3 displays trajectories of D-dimer levels for 9 randomly chosen patients over the course of hospital stay. Supplementary Figure 2-3 show trajectories of the serum creatinine levels and patient oxygen levels. A considerable variability is observed across patients in both these time-varying measures and treatments.

We considered a composite outcome, ICU admission or in-hospital death, whichever occurs first. The outcome may be right censored by hospital discharge or administratively censored on t o = February 26, 2021, the date on which the database for the current analysis was locked. 5.2. Fitting the joint marginal structural survival models. We implemented all four approaches discussed in Section 3.3 to estimate the time-varying weights for the JMSSM-CT model. Additionally, we discretized the time in the space of 1, 3 and 5 days, and applied the discrete time based method JMSM-DT to compare with our proposed JMSSM-CT method. The weight model included all time-fixed and time-varying confounders listed in Supplementary Table 5 (12), pairwise treatment interactions were included if there were sufficient data points supporting the joint use of the pair of treatments.

Results. Using the stabilized inverse probability weights to correct for timevarying confounding and censoring, the structural model parameter estimatesψ (log hazard ratio) and the associated 95% confidence intervals are provided in Table 2 . Echoing the findings from our simulation study (Section 4), the weight estimator (iv), using the random survival forests model and kernal function smoothing of the Nelson-Aalen estimator, produced the narrowest confidence intervals. By contrast, the weight estimator (i), using the main-effects Cox regression model and non-smoothed Nelson-Aalen estimator, led to the widest confidence intervals. As a result, using the weight estimator (iv), we observe a significant treatment benefit with dexamethasone (−.2(−.35, −.06)) and remedesivir (−.53(−.75, −.31)), and added treatment benefit if remedesivir is used in combination with corticosteroids other than dexamethasone. Using the weight estimator (i), none of the main or interactive treatment effects appeared to be statistically significant.

The joint and interactive effect estimatesψ (log hazard ratio) of COVID-19 treatments and associated 95% confidence intervals (CI), using the COVID-19 dataset drawn from the Epic electronic medical records system at the Mount Sinai Medical Center. The composite outcome of in-hospital death or admission to ICU was used. To estimate the weights, four approaches (i)-(iv) (Section 3.3) were used for JMSSM-CT. Confidence intervals were estimated via the robust sandwich variance estimators. "×" denotes treatment interaction. To obtain further insights into the operating characteristics of each method, we summarize the distribution of the estimated individual weights in Table 3 . As one can clearly see, the weight estimator (i) produced a substantial amount of extreme weights -the minimum of .0001 and maximum of 63. By comparison, the estimator (iv) generated no spiky weights, with the mean value of close to one. There is little difference in the weight distribution between estimator (ii) and estimator (iii), both of which mitigated the issue of extreme weights, but not to the same degree as the estimator (iv). Figure 4 shows the side-by-side comparison of the time-varying weights at 7, 14, 21 and 28 days since hospital admission estimated using the four weight estimators. The weight estimator (iv) produced no extreme weights at any of the time points. An increasing amount of extreme weights was generated when the modeling flexibility decreased or when the baseline intensity estimator was not smoothed.

(95% Confidence Interval) (i) (ii) (iii)(iv)

The distribution of the individual time-varying weights estimated from the COVID-19 data. Four approahces (i)-(iv) described in Section 3.3 were used for the weight estimation. Corroborating findings from our simulation study, discretizing the time can lead to the loss of information and efficiency, suggested by the width of confidence intervals of ψ for JMSSM-CT and JMSM-DT, shown in Table 4 . The JMSSM-CT yielded the narrowest confidence intervals; whereas for JMSM-DT, the width of confidence intervals grows as the space of days for the discretization increases. Comparing the proposed JMSSM-CT with discrete-time based method JMSM-DT in estimating the joint and interactive effectsψ (log hazard ratio) of COVID-19 treatments and associated 95% confidence intervals (CI), using the COVID-19 dataset drawn from the Epic electronic medical records system if the Mount Sinai Medical Center. The composite outcome of in-hospital death or admission to ICU was used. Confidence intervals were estimated via the robust sandwich variance estimators. "×" denotes treatment interaction. The weight estimator (iv) was used for JMSSM-CT. For JMSM-DT, the follow up time was discretized in the space of 1, 3, and 5 days.

The smallest unit of time in the COVID-19 data is 1 day. Using the parameter estimatesψ, we further computed the counterfactual survival curves under each treatment regimen. Figure 5 presents the counterfactual survival curves, usinĝ ψ estimated by the most promising weight estimator (iv), for ICU admission or death, whichever occurs first, among patients with COVID-19 infection, under five treatment regimens given upon admission to hospital. Among the four main treatment classes, remdesivir had significantly better treatment benefits followed by dexamethasone than two alternative treatment classes: anti-inflammatory medications other than corticosteroid and corticosteroids other than dexamethasone. Interestingly, remdesivir and corticosteroids other than dexamethasone had a significant treatment interaction effect suggesting additional survival benefit when they are used in combination. This is demonstrated by the highest counterfactual survival curve under the concomitant use of these two types of medications. 5.4. Sensitivity analysis to assess the impact of treatment ordering. When multiple time-varying treatments are under investigation, the analyst may face the question of the optimal treatment ordering by which the joint treatment weights Ω A1,...,AW (t Ki ) (Section 3.4) can be estimated. In our simulation and case study, we used the default decreasing order of "treatment sizes". Note that the "treatment sizes" considered here do not refer to the sizes of mutually exclusive treatment groups, but simply the number of patients who have received the treatments at some point during the hospital stay. In our COVID-19 dataset, 7830 patients took dexamethasone (A 1 ) at some points in time following hospital admission, 4943 took corticosteroids other than dexamethasone (A 2 ), 4103 received remedesivir (A 3 ), and 2844 were treated with anti-inflammatory other than corticosteroid (A 4 ). We estimated the joint treatment weights by multiplying the sequence of conditionals,

To evaluate whether treatment ordering can impact the causal inferences about treatment effects, we conducted a sensitivity analysis, in which we explored four choices of treatment order: decreasing order of treatment sizes, increasing order of treatment sizes and two random choices of treatment order. As demonstrated in Table 5 , different treatment orders by which the joint treatment weights were estimated did not lead to appreciable or directional changes in the estimate of ψ. However, more efficiency was gained by using the decreasing order of treatment sizes (first conditioning on the treatment used by most patients), as suggested by the narrowest confidence intervals of ψ.

6. Discussion. Motivated by inconclusive real-world evidence for the comparative effectiveness of multiple treatment strategies for COVID-19, we have developed a joint marginal structural survival model and novel weighting schemes to address time-varying confounding and censoring in continuous time. There are three main advantages of our proposed method. First, this approach casts the complex time-varying treatment with irregular "start/stop" switches into the process of recurrent events where treatment initiation can be considered under the recurrent event framework with discontinuous intervals of eligibility. This innovative formulation enables us to address complex time-varying confounding by modeling the intensity processes of the filtered counting processes for complex time-varying treatments. Second, the proposed method is able to handle a complex longitudinal dataset on its own terms, without discretizing and artificially aligning measurement times, which would lead to less accurate and efficient treatment effect estimates, as demonstrated by our The joint and interactive effect estimatesψ (log hazard ratio) of COVID-19 treatments and associated 95% confidence intervals (CI), using the COVID-19 dataset drawn from the Epic electronic medical records system of the Mount Sinai Medical Center. The composite outcome of in-hospital death or admission to ICU was used. The weight estimator (iv) (Section 3.3) was used. The joint treatment weights were estimated using four different choices of treatment order. Confidence intervals were estimated via the robust sandwich variance estimators. "×" denotes treatment interaction. A 1 = dexamethasone. A 2 = corticosteroids other than dexamethasone. A 3 = remdesivir. A 4 = anti-inflammatory medications other than corticosteroids. → denotes treatment order.

Treatment ordersψ (95% Confidence Interval) simulations. Third, modern machine learning techniques designed for censored survival data and smoothing techniques of the baseline intensity function can be used easily for with our weighting method to further improve the treatment effect estimator under conventional parametric formulations. We have also introduced a simulation algorithm that is compatible with the complex data structures of our proposed modeling framework, and demonstrated the accuracy of the proposed method for estimating causal parameters. Our approach can be extended in the following two directions. First, we considered a joint marginal structural proportional hazards model and a tailored simulation algorithm to generate datasets of complex time-varying structures that are compatible with the proportional hazards model. It may be worthwhile to develop alternative joint marginal structural survival models such as the structural additive hazards model, and assess the robustness of different structural models for estimating counterfactual survival functions under different data generating processes. Second, we have maintained the conditional exchangeability assumption in our work. This is a standard causal structural assumption in the literature on addressing time-varying confounding. Although untestable using the observed data, our clinician investigators supported the validity of the assumption in the COVID-19 dataset upon the review of time-varying confounders. For future research, expanding the methodology for addressing baseline unmeasured confounding (Hogan, Daniels and Hu, 2014; Hu et al., 2022) and developing sensitivity analysis approaches to capture the impact of time-varying unmeasured confounding in continuous time for our model would be a worthwhile and important contribution.

7. Software. R codes to implement the proposed methods and replicate our simulation studies are provided in the GitHub page of the first author https://github.com/ liangyuanhu/JMSSM-CT. and approach (iv). For exposition brevity, consider a simple multiplicative intensity model for treatment initiation

where λ 0 (t) is the baseline intensity rate function, r L (t), β > 0 is the (time-dependent) intensity ratio function parameterized by β, and Y (t) is the at risk indicator. The cumulative baseline hazard is Λ 0 (t) = t 0 λ 0 (s)ds. The kernel function estimator for λ 0 (t) is derived by smoothing the increments of the Nelson-Aalen estimator ofΛ 0 aŝ

where K is the kernel function, which is a bounded function on [-1,1] and has integral 1, and the bandwidth b is a positive parameter governing the amount of smoothness. Commonly used kernel functions include the standard normal density function K(x) = φ(x) and the Epanechnikov kernel function K(x) = 0.75(1 − x 2 ), |x| ≤ 1. shows that the kernel function estimatorλ 0 (t) is consistent and asymptotically normal provided that there exists a sequence of positive constants {a n }, increasing to infinity as n → ∞, and that the bandwidth tends to zero more slowly than a −2 n as n → ∞. Placed in the framework of recurrent events for a treatment A w that can have multiple "start/stop" switches as described in Section 3.1, the intensity of the qth treatment initiation λ Aw q (t) is smoothed by means of kernel function estimator of the baseline intensity function λ Aw 0q (t).

In our simulations (Section 5) and COVID-19 case study (Section 6), we used both the standard normal density φ(x) and Epanechnikov kernel functions and they yielded similar 1 arXiv:2109.13368v3 [stat.ME] 6 May 2022 results. We presented results from the normal density kernel function. Following Andersen et al. (1993) , we chose the optimal bandwidth b as the value that minimizes the mean integrated squred error (MISE)

S1.2. Random survival forests accommodating time-varying covariates. Approach (iii) uses a recent random survival forests model that accommodates timevarying covariates to reduce the parametric assumptions about the form of the intensity ratio function required by the usual proportional intensity regression model. proposed a forest method that estimates the survival function by three steps. In step (1), reformat the data in the counting process structure, that is, for individual i, the time-varying covariate

. Pool the counting process styled records from N subjects to create a list of pseudo-subjects,

The set of pseudo-subjects is treated as if they were independent.

Step (2) We now briefly describe the forest algorithms. extended the relative risk forests, which combines the relative risk trees (LeBlanc and Crowley, 1992) with random forest methodology (Breiman, 2001) , for LTRC data by modifying the splitting criteria. The splitting criterion under the relative risk framework is to maximize the reduction in the onestep deviance between the log-likelihood of the saturated model and the maximized loglikelihood. Let R h denote the set of observations that fall into the node h. Let Λ 0 index the baseline cumulative hazard function, ϕ h represent the nonnegative relative risk of the node h, and t l and δ l be the time and event indicator of the lth observation ∀l ∈ R h . Given the right censored observations (t l , δ l ), the full likelihood deviance residual for node h is defined as

Two steps are needed to modify the splitting rule and obtain the deviance residual appropriate for LTRC data (Fu and Simonoff, 2017) . First, compute the estimated cumulative hazard functionΛ 0 (·) based on all pseudo-subject observations. Second, replaceΛ 0 (t l ) in (2) witĥ Λ 0 (t l+1 ) −Λ 0 (t l ), and replace δ l in (2) with δ l . We refer to for more detailed description of the random survival forests model. S1.3. Variance estimation. Since our estimators for ψ (marginal structural proportional hazards model parameter) can be considered as solution to the weighted partial score equation, we consider the robust sandwich variance estimator as a convenience device to construct confidence intervals. The robust sandwich variance estimator has been considered, for example, in Xiao, Abrahamowicz and Moodie (2010), and Karim et al. ( , 2018 , and has been shown to be at most conservative under the discrete-time setting. We use this estimator for inference in conjunction with our continuous-time stabilized inverse probability weights, and formally evaluate its performance in our simulations. We briefly describe the robust sandwich variance estimator below. With the continuous-time weights, we focus on equation (11) in the main text with two treatments:

where Ω A1,A2 Ω C (G i ) is the weight for time-varying treatments A 1 and A 2 and censoring (in

, and

In the above definition, R T t refers to the collection of subjects still at risk for the outcome event at time t, Y * * T

is the at-risk function for the outcome event, and r(a 1 , a 2 , t) = exp{ψ 1 a 1 (t) + ψ 2 (t)a 2 (t) + ψ 3 a 1 (t)a 2 (t)}. Now further define

with a ⊗2 = aa for any vector a. Then the robust sandwich variance estimator takes the form Binder, 1992) , where

du is the martingale based on the counting process for outcome event. The sandwich variance estimator is obtained when both Σ 0 and Σ 1 are evaluated at the estimated weights andψ.

Because the above robust variance estimator considers the weights Ω A1,A2 Ω C (G i ) as fixed known values (Binder, 1992) , it could result in conservative (but still valid) inference. With time-invariant weights estimated by logistic regression in the cross-sectional treatment setting, the corrected robust sandwich variance estimator has been derived to achieve improved variance estimation for hazard ratio parameters . However, an extension to continuous-time weights is not trivial and currently unavailable. Alternatively, resampling method such as the bootstrap method (Brumback et al., 2004; Cole and Hernán, 2008) could be used to make more robust inference for ψ that accounts for the uncertainty of the weights.

Although theoretical valid, this resampling approach is not pursued in our current work due to the substantially more intensive computations associated with repeated estimation of complex weights under the recurrent event framework. γ ← (γ 0 , γ 1 , γ 2 , γ 3 , γ 4 , γ 5 , γ 6 , γ 7 , γ 8 , γ 9 , γ 10 , γ 11 ) ← [log(2/7), 1/2, −1/2, − log(3/5), 0.8, 0.5, 0.8, −0.5, 1/2, 1.2, −0.6, −0.3] (parameter vector for generating A 1 ); η ← (η 0 , η 1 , η 2 , η 3 , η 4 , η 5 , η 6 , η 7 , η 8 , η 9 , η 10 , η 11 ) ← [log(3/7), 1/3, −1/3, − log(2/5), 0.9, 0.6, 0.8, −0.5, 1/3, 0.9, −0.6, −0.4] (parameter vector for generating A 2 ); ψ 1 ← −0.5 (true log-hazard value representing the effect of treatment A 1 ) ψ 2 ← −0.3 (true log-hazard value representing the effect of treatment A 2 ) COMPUTE for ID ← 1 to n (for each individual) do

A 2 (m − 1), n A1 , Y (m) = 0; γ) = γ 0 + γ 1 A 1 (m − 1) + γ 2 (L 2 (m − 1)) 2 + γ 3 (L 1 (m − 1)) 2 + γ 4 (A 1 (m − 1)L 1 (m)) +γ 6 (L 1 (m)L 2 (m)) + γ 7 (A 1 (m − 1)L 2 (m)) + γ 8 A 2 (m − 1) +γ 9 (A 2 (m − 1)L 1 (m)) + γ 10 (A 2 (m − 1)L 2 (m)) + γ 11 n A1

if A 1 (m − 1) = 0 or m − 1 = τ A1 then

if A 1 (m) = 1 then δ A1 ∼ zero-truncated Poisson(10) (treatment duration after initiation) = η 0 +η 1 A 1 (m−1)+η 2 (L 2 (m−1)) 2 +η 3 (L 1 (m−1)) 2 +η 4 (A 1 (m)L 1 (m)) +η 6 (L 1 (m)L 2 (m)) + η 7 (A 1 (m)L 2 (m)) + η 8 A 2 (m − 1) +η 9 (A 2 (m − 1)L 1 (m)) + η 10 (A 2 (m − 1)L 2 (m)) + η 11 n A2

if A 2 (m − 1) = 0 or m − 1 = τ A2 then

if A 2 (m) = 1 then δ A2 ∼ zero-truncated Poisson(9) (treatment duration after initiation) τ A2 ← m + δ A2

A 2 (m + 1) : A 2 (max (τ A2 , M )) ← 1 The distribution of the estimated individual time-varying weights from one random replication of the "ragged" longitudinal data with unaligned time points, for the proposed JMSSM-CT versus JMSM-DT. To estimate the weights, the random forests was used for JMSM-DT and four approaches (i)-(iv) (Section 3.4) were used for JMSSM-CT. S3.2. Treatment classes for COVID-19. In Table 3 , we provide detailed definitions of the four treatment classes for COVID-19 whose comparative effectiveness on in-hospital death was investigated in Section 6. Corticosteroids other than dexamethasone Hydrocortisone (po), Hydrocortisone (iv), Methylprednisolone (po), Methylprednisolone (iv), Prednisolone (iv), Prednisone (po), Prednisone (iv)

Anti-inflammatory medications other than corticosteroids Alpha-1-Proteinase Inhibitor (iv), Anakinra (iv), Azathioprine (po), Belatacept (iv), Dexamethasone (iv), Dexamethasone (po), Eculizumab (iv), Envarsus (iv), Sarilumab (iv), Gengraf (po), Gengraf (iv), Hydrocortisone (po), Hydrocortisone (iv), Ibrutinib (po), Immune Globulin (iv), Infliximab (iv), Methylprednisolone (po), Methylprednisolone (iv), Montelukast (po), Prednisolone (iv), Prednisone (po), Prednisone (iv), Ruxolitinib (po), Tocilizumab (iv), Apremilast (po), Celecoxib (po), Dasatinib (po), Everolimus (iv), Gemtuzumab (iv), Ibuprofen (iv), Ibuprofen (po), Ifosfamide (iv), Leflunomide (po), Mesalamine (iv), Methotrexate (iv), Methylphenidate (iv), Mycophenolate (iv), Naproxen (po), Rituximab (iv), Sulfasalazine (po), Tacrolimus (iv), Pacritinib (po), Risankizumab (iv), Daratumumab (iv), Talquetamab (iv) S3.3. Patient oxygen levels. We describe how patient oxygen levels are categorized based on the use of ventilator in Table 4 . Tracheotomy, Transtracheal oxygen therapy, Ventilator, Endotracheal tube, T-shaped tubing connected to an endotracheal tube, Nasal synchronized intermittent mandatory ventilation Abbreviations: SD = standard deviation;

Survival and event history analysis: a process point of view

Cox's regression model for counting processes: a large sample study

Statistical models based on counting processes

Relative incidence estimation from case series for vaccine safety evaluation

Dexamethasone in hospitalized patients with Covid-19

Marginal structural models to estimate the joint causal effect of nonrandomized treatments

Structural accelerated failure time models for survival analysis in studies with time-varying treatments

A Bayesian perspective on assessing sensitivity to assumptions about unobserved data

Estimating the effects of multiple timevarying exposures using joint marginal structural models: Alcohol consumption, injection drug use, and HIV acquisition

Causal comparative effectiveness analysis of dynamic continuous-time treatment initiation rules with sparsely measured outcomes and death

Estimating heterogeneous survival treatment effect in observational data using machine learning

Modeling the causal effect of treatment initiation time on survival: Application to HIV/TB co-infection

A flexible sensitivity analysis approach for unmeasured confounding with multiple treatments and a binary outcome with application to SEER-Medicare lung cancer data. The Annals of Applied Statistics

Semiparametric inference in observational duration-response studies, with duration possibly right-censored

Estimating inverse probability weights using super learner when weight-model specification is unknown in a marginal structural Cox model context

Improving propensity score weighting using machine learning

Analysis of longitudinal data with irregular, outcome-dependent follow-up

Smoothing counting process intensities by means of kernel functions

Estimation of the time-dependent accelerated failure time model in the presence of confounding factors

Association, causation, and marginal structural models

Estimation and extrapolation of optimal treatment and testing strategies

Causal inference in continuous time: an example on prostate cancer therapy

A flexible parametric approach for estimating continuous-time inverse probability of treatment and censoring weights

Variance estimation in inverse probability weighted Cox models

Multiple self-controlled case series for large-scale longitudinal observational databases

Hospitalised COVID-19 patients of the Mount Sinai Health System: a retrospective observational study using the electronic medical records

Ensemble Methods for Survival Data with Time-Varying Covariates

Simulation from structural survival models under complex time-varying data structures

A global treatments for coronaviruses including COVID-19

Statistical models based on counting processes

Fitting Cox's proportional hazards models from survey data

Random forests

Sensitivity analyses for unmeasured confounding assuming a marginal structural model for repeated measures

Constructing inverse probability weights for marginal structural models

Survival trees for left-truncated and right-censored data, with application to time-varying covariate data

On the application of statistical learning approaches to construct inverse probability weights in marginal structural cox models: hedging against weight-model misspecification

Comparison of statistical approaches dealing with time-dependent confounding in drug effectiveness studies

Relative risk trees for censored survival data

Variance estimation in inverse probability weighted Cox models

Accuracy of conventional and marginal structural Cox model estimators: a simulation study

Ensemble Methods for Survival Data with Time-Varying Covariates

Acknowledgments. This work was supported in part by award ME_2017C3_9041 from the Patient-Centered Outcomes Research Institute and by grants R21 CA245855-01 and P30CA196521-01 from the National Cancer Institute. Conflict of Interest: None declared.