key: cord-0500423-s8rx5ucc
authors: Frauen, Dennis; Hatt, Tobias; Melnychuk, Valentyn; Feuerriegel, Stefan
title: Estimating average causal effects from patient trajectories
date: 2022-03-02
journal: nan
DOI: nan
sha: 6c848740923008d87a5872a2d37e1618dbcf7fea
doc_id: 500423
cord_uid: s8rx5ucc

In medical practice, treatments are selected based on the expected causal effects on patient outcomes. Here, the gold standard for estimating causal effects are randomized controlled trials; however, such trials are costly and sometimes even unethical. Instead, medical practice is increasingly interested in estimating causal effects among patient subgroups from electronic health records, that is, observational data. In this paper, we aim at estimating the average causal effect (ACE) from observational data (patient trajectories) that are collected over time. For this, we propose DeepACE: an end-to-end deep learning model. DeepACE leverages the iterative G-computation formula to adjust for the bias induced by time-varying confounders. Moreover, we develop a novel sequential targeting procedure which ensures that DeepACE has favorable theoretical properties, i.e., is doubly robust and asymptotically efficient. To the best of our knowledge, this is the first work that proposes an end-to-end deep learning model for estimating time-varying ACEs. We compare DeepACE in an extensive number of experiments, confirming that it achieves state-of-the-art performance. We further provide a case study for patients suffering from low back pain to demonstrate that DeepACE generates important and meaningful findings for clinical practice. Our work enables medical practitioners to develop effective treatment recommendations tailored to patient subgroups.

Causal effects of treatments on patient outcomes are hugely important for decision-making in medical practice [47] . These estimates inform medical practitioners in the expected effectiveness of treatments and thus guide treatment selection. Notwithstanding, information on causal effects is also relevant for other decision-making of other domains such as public health [9] and marketing [44] .

Causal effects can be estimated from either randomized controlled trials (RCTs) or observational studies [33] . Even though RCTs present the gold standard for estimating causal effects, they are often highly costly, unfeasible in practice, or even unethical (e. g., medical professionals cannot withhold effective treatment to patients in need) [33] . Therefore, medical practice is increasingly relying on observational data to study causal relationships between treatments and patient outcomes. Nowadays, electronic health records are readily available, capture patient trajectories with high granularity, and thus provide rich observational data in medical practice [2, 26] .

In this paper, we aim at estimating causal effects from observational data in form of patient trajectories. Patient trajectories encode medical histories in a time-resolved manner and are thus of longitudinal form. However, estimating causal effects from observational data is subject to challenges [32, 13] . One reason, because the underlying treatment assignment mechanism is usually confounded with the patient outcomes. Another reason is that confounders may vary over time, which introduces additional dependencies and treatment-confounder feedback. As an example, consider a physician who assigns medications to patients over time. Here, the patient outcome is a measurement of some vital signs indicating the patient's health status. Typically, the physician selects treatments based on a patient's time-varying covariates such as blood pressure or heart rate, yet which also affect the patient outcome and are influenced by previous treatments. Hence, methods for estimating causal effects must adjust for time-varying confounders in order to produce unbiased estimates.

While there are works on estimating individualized causal effects [21, 5, 20] , we are interested in average causal effects (ACEs). As such, ACEs give the expected different in health outcomes when applying different treatment interventions at the level of patient subgroups. Estimating ACEs is especially important for medical practice for a number of reasons [32, 13] . (i) There are many settings in practice where interventions affect whole populations such as in public health. For example, a government might be interested in the different effects of a stay-home order on COVID-19 spread for vaccinated arXiv:2203.01228v1 [stat.ML] 2 Mar 2022 vs. non-vaccinated people, or what the effect of sugar tax is on diabetes onsets among society. Here, treatments affect subgroups of people, thus necessitating average causal effects. (ii) In medical practice, personalization is typically done by varying treatment recommendations across patient subgroups. To this end, medical research seeks to find increasingly more granular subgroups [25] . These should identify differences in disease dynamics and thus capture different phenotypes [2] , so that different treatment guidelines are tailored to each subgroup [10] . To do so, one must again compare causal effects across patient cohorts, that is, average causal effects across patient cohorts. Motivated by such considerations from medical practice, our work aims at time-varying ACE estimation.

Proposed method: We propose an end-to-end deep learning model to estimate time-varying ACEs, called DeepACE. Deep-ACE combines a recurrent neural network and feed-forward neural networks to learn conditional expectations of factual and counterfactual outcomes under complex non-linear dependencies, based on which we then estimate time-varying ACEs. In DeepACE, we address time-varying confounding by leveraging the G-formula, which expresses the ACE as a sequence of nested conditional expectations based on observational data as. Existing methods are limited in that these learn the nested conditional expectations separately by performing an iterative procedure [41] . In contrast, our end-to-end model DeepACE makes it possible to learn them jointly, leading to a more efficient use of information across time.

We further develop a sequential targeting procedure by leveraging results from semi-parametric estimation theory in order to improve the estimation quality of DeepACE. The sequential targeting procedure perturbs ("targets") the outputs of DeepACE so that our estimator satisfies a semi-parametric efficient estimating equation. To achieve this, we propose a targeting layer and a targeted regularization loss for training. We then derive that DeepACE provides a doubly robust and asymptotically efficient estimator.

Our main contributions: 1 1. We propose DeepACE: the first end-to-end neural network for estimating time-varying average causal effects using observational data. DeepACE builds upon the iterative G-computation formula to address time-varying confounding.

2. We develop a novel sequential targeting procedure which ensures that DeepACE provides a doubly robust and asymptotically efficient estimator.

Estimating causal effects from observational data can be loosely grouped into static and longitudinal settings (Table 1) . e. g., TARNet [36] , RMSNs [21] , CRN [5] , causal forest [45] G-Net [20] Average (=ACE) e. g., TMLE [42] , g-methods [24] , LTMLE [39, 41] , DragonNet [37] DeepACE (ours)

Extensive works focus on treatment effect estimation in static settings [36, 11, 14] . Two important methods that adopt machine learning for average treatment effect estimation in the static setting are: (i) targeted maximum likelihood estimation (TMLE) [42] and (ii) DragonNet [37] . TMLE is a plugin estimator that takes a (machine learning) model as input and perturbs the predicted outcomes, so that the final estimator satisfies a certain efficient estimating equation. This idea builds upon semi-parametric efficiency theory and is often called targeting. Any estimator satisfying the efficient estimating equation is guaranteed to have desirable asymptotic properties. On the other hand, DragonNet is a neural network that incorporates a targeting procedure into the training process by extending the network architecture and adding a tailored regularization term to the loss function. This allows the model parameters to adapt simultaneously to provide a targeted estimate. To the best of our knowledge, there exists no similar targeting procedure for longitudinal data, and, to fill this gap, we later develop a tailored sequential targeting procedure.

In general, causal effect estimation for static settings is different from longitudinal settings due to time-varying confounding and treatment-confounder feedback [32] . The latter can appear when dealing with longitudinal observational data. Hence, methods for static causal effect estimation are biased when they are applied to longitudinal settings, thus leading to an inferior performance. For this reason, we later use methods for time-varying causal effect estimation as our prime baselines. Results for static baselines are reported in Appendix B.

Causal effect estimation in longitudinal settings is often also called time-varying causal effect estimation. Here, we distinguish individual and average causal effects.

Individual causal effects: There is a growing body of work on adapting neural networks for estimating individual causal effects in the longitudinal setting (also called individual treatment effects or ITE for short). These methods predict counterfactual outcomes conditioned on patient trajectories, and then output causal effects by taking differences between the factual and counterfactual outcome. Recurrent marginal structural networks (RMSNs) [21] use inverse probability weighting to learn a sequence-to-sequence model that addresses bias induced by time-varying confounders. The counterfactual recurrent network (CRN) [5] adopts adversarial learning to build balanced representations. The G-Net [20] incorporates the G-formula into a recurrent neural network and applies Monte Carlo sampling.

In this paper, we are instead not interested in individual but average causal effects. Still, we later include the above stateof-the-art methods from ITE estimation (i. e., RMSNs, CRN, and G-Net) as baselines (we average the individual estimates).

There are other methods for predicting counterfactual outcomes over time (e. g., [35, 38, 28] ), which are not applicable due to different settings or assumptions (e. g., stronger unconfoundedness assumptions).

Average causal effects: Several methods for time-varying ACE estimation originate from epidemiological literature.

Here, common are so-called g-methods [24] . Examples of g-methods include marginal structural models [33] , Gcomputation via the G-formula [32] , and structural nested models [30] . The previous methods make linearity assumptions and are consequently not able to exploit nonlinear dependencies within the data. Nevertheless, we include the g-methods as baselines.

We are aware of only one work that leverages machine learning methods for time-varying ACE estimation: longitudinal targeted maximum likelihood estimation (LTMLE) [39] . LTMLE has two components: (i) LTMLE makes use iterative G-computation. Mathematically, the G-formula can be expressed in terms of nested conditional expectations, which can then be learned successively by using arbitrary regression models. As such, each conditional expectation is learned separately, which is known as iterative G-computation. Our method, DeepACE, follows a similar approach but estimates the conditional expectations jointly. (ii) LTMLE targets the estimator by perturbing the predicted outcomes in each iteration to make them satisfy an efficient estimating equation. In contrast to LTMLE, our sequential targeting procedure is incorporated into the model training process, which allows all model parameters to be learned simultaneously. Later, we implement two variants of LTMLE as baselines, namely LTMLE with a generalized linear model (glm) [39] and LTMLE with a super learner [41] .

Research gap: To the best of our knowledge, there exists no end-to-end machine learning model for time-varying ACE estimation. Hence, DeepACE is the first neural network that simultaneously learns to perform iterative G-computation and to apply a sequential targeting procedure. By learning all parameters jointly, we expect our end-to-end model to provide more accurate estimation results.

We build upon the standard setting for estimating time-varying ACEs [32, 39] . For each time step t ∈ {1, . . . , T }, we observe (time-varying) patient covariates X t ∈ R p , treatments A t ∈ Figure 1 : One possible example of a causal graph describing the data-generating process.

{0, 1}, and outcomes Y t+1 ∈ R. For example, we would model critical care for COVID-19 patients by taking blood pressure and heart rate as time-varying patient covariates, ventilation as treatment, and respiratory frequency as outcome.

Modeling the treatments A t as binary variables is consistent with prior works [42, 37] and is standard in medical practice [33] : should one apply a treatment or not?

At each time step t, the treatment A t directly affects the next outcome Y t+1 , and the covariates X t may affect both A t and Y t+1 . All X t , A t , and Y t+1 may have direct effects on future treatments, covariates, and outcomes. The corresponding causal graph is shown in Fig. 1 . For notation, we denote the observed trajectory at time t by H t = (X t ,Ā t−1 ), wherē X t = (X 1 , . . . , X t ) andĀ t−1 = (A 1 , . . . , A t−1 ). We always consider the lagged outcomes Y t to be included in the covariates X t .

We have further access to an observational dataset D, that consists of N independent patient trajectories D for patients i ∈ {1, . . . , N }, i. e., D = ({x

. Such patient trajectories are nowadays widely available in electronic health records [2] . For notation, we use a superscript (i) to refer to patients (we omit it unless needed).

We build upon the potential outcomes framework [34] , which was extended in [33] to the longitudinal setting. We denote Y t+1 (ā t ) as the potential outcome, which would have been observed at time t + 1 if a treatment intervention a t = (a 1 , . . . , a t ) was applied. Note that Y t+1 (ā t ) is unobserved if the treatment interventionā t does not coincide with the treatment assignmentsĀ t in the observational dataset. This is also known as the fundamental problem of causal inference [27] .

Given two intended treatment assignmentsā T = (a 1 , . . . , a T ) andb T = (b 1 , . . . , b T ), we define the expected potential outcomes as

Objective: We aim at estimating the average causal effect (ACE):

(2) To do so, we build upon the following standard assumptions in causal inference [33, 19] : Assumption 1 (Consistency). The treatment assignment A t =ā t implies Y t+1 (ā t ) = Y t+1 , i. e., potential and observed outcomes coincide for all t ∈ {1, . . . , T }.

Assumption 3 implies that there are no unobserved confounders that influence both treatment assignments and outcomes. Together, Assumptions 1-3 allow to identify the ACE ψ from observational data [27] .

The expected potential outcome θ a can be expressed in terms of the observational data via the well-known G-formula [29] . For our setting, we consider a variant that uses iterated conditional expectations [31, 3] to write θ a as

For a better understanding of Eq. (3), we can equivalently write the parameter θ a as the result of an iterative process. More precisely, we introduce recursively defined conditional expectations Q a t that depend on the covariatesX t−1 and interventionsā t−1 via

Then, the expected potential outcome can be written as

In Eq. (4), the covariates X T , X T −1 , . . . are successively integrated out, and the θ a is obtained in Eq. (5) by averaging.

In the following, we review iterative G-computation [41] , which is an iterative procedure that leverages Eq. (4) and Eq. (5) to estimate the expected potential outcome θ a and, subsequently, the average causal effect ψ. Iterative Gcomputation estimates the conditional expectations Q a t by using a regression model for all t ∈ {2, . . . , T + 1}. Then, θ a can be estimated by taking the empirical mean in Eq. (5) . The full algorithm is in Alg. 1.

Iterative G-computation only requires estimating conditional expectations. This is usually an easier task as compared to estimating conditional distributions, which is done by other Gcomputation methods [32] . However, iterative G-computation in the above form is subject to drawbacks. In particular, one has to specify T separate regression models that are trained separately. Each model uses the predictions of its already trained predecessor as training labels. This may lead to a training procedure which is comparatively unstable. The reason is that each training objective does not take into account how prediction errors propagate through time. For example, a model Q T +1 (·) could generate predictions that approximate Y T +1 well (leading to a small regression objective) but that cause other modelsQ t+1 (·) for t < T to become unstable. However, when trainingQ t+1 (·), the modelQ T +1 (·) is already fixed and cannot be trained anymore.

Need for an end-to-end model: We postulate that an end-toend model overcome the above drawbacks. In particular, an end-to-end model can share information across time steps and, thereby, should be able to generate more accurate estimates for time-varying ACEs. Motivated by this, our end-to-end model learns all parameters jointly.

Overview: We propose a novel end-to-end deep learning model for ACE estimation, called DeepACE. DeepACE is motivated by the idea of iterative G-computation. It is trained with observational data from patient trajectories and a specific treatment interventionā T , based on which it learns the conditional expectations Q a t from Eq. (4). DeepACE consists of two main components: (i) a Gcomputation layer (Sec. 4.1), which produces initial estimates of the Q a t by minimizing a tailored G-computation loss, and (ii) a targeting layer (Sec. 4.2), which applies perturbations in a way that the final estimator satisfies an efficient estimating equation. The overall model architecture is shown in Fig. 2 .

DeepACE is trained by combining (i) a G-computation loss L Q , (ii) a propensity loss L g , and (iii) a targeting loss L tar into a joint loss L as described later. The outputs of Deep-ACE can then be used to provide ACE estimates (Sec. 4.3). We further show that DeepACE provides a doubly robust and asymptotically efficient estimator (Sec. 4.4). Finally, we provide implementation details are described in Sec. 4.5.

The G-computation layer takes the observational data D and a specific treatment interventionā T as input. For each time step t ∈ {2, . . . , T + 1}, it generates two outputs: (i) a factual outputQ A t for Q t X t−1Āt−1 , and (ii) a counterfactual outputQ a t for Q t X t−1 ,ā t−1 according to Eq. (4). The factual outputsQ A t are trained to estimate the one-step shifted counterfactual outputsQ a t+1 , while the counterfactual out-putsQ a t are obtained by implicitly evaluatingQ t X t−1 , · at A t−1 =ā t−1 as done in Alg. 1. 

The architecture of the G-computation layer is shown in Fig. 2 (bottom) . In the G-computation layer, we use a long short-term-memory (LSTM) layer [15] to process the input data. We choose an LSTM due to its ability to learn complex non-linear dynamics from patient trajectories while addressing the vanishing gradient problem which frequently occurs when using recurrent neural networks.

We feed the data twice into the LSTM: (i) with the observed treatmentsĀ T (factual forward pass), and (ii) once with the treatment interventionā T (counterfactual forward pass). Based on this, we computed the hidden LSTM states as follows. At each time step t, the factual forward pass leads to a factual hidden LSTM state h A t depending on the factual trajectory H t = (X t ,Ā t−1 ), and the counterfactual forward pass leads to a counterfactual hidden LSTM state h a t depending on the past covariatesX t and interventionsā t−1 .

Both hidden states h A t and h a t are processed further. In the factual forward pass, we feed the factual hidden state h A t together with the current observed treatment A t into a fullyconnected feed-forward network FF Q t . The network FF Q t generates a factual outputQ A t+1 for Q t+1 (X t ,Ā t ) according to Eq. (4). In the counterfactual forward pass, we feed the counterfactual hidden state h a t also FF Q t and replace the treatment input A t with the current intervention a t . As a result, the network FF Q t generates a counterfactual outputQ a t+1 for Q a t = Q t+1 (X t ,ā t ).

We design a tailored loss function, such that we mimic Algorithm 1. For this, we denote the outputs of the G-computation layer for a patient i at time t byQ A (i) t+1 (η) andQ a (i) t+1 (η). Here, we explicitly state the dependence on the model parameters (i. e., the LSTM and feed forward layers), which we denote by η. We define the G-computation loss as

By way of how L Q is constructed, each counterfactual output Q a t+1 is used as a prediction objective by the previous factual outputQ A t . Recall that, in Algorithm 1, the counterfactual estimatesQ a t+1 are obtained only by evaluating the learned conditional expectation atĀ t =ā t . Therefore, we only want the factual outputsQ A t to learn the counterfactual outputŝ Q a t+1 and not the other way around. Hence, when training the model with gradient descent-based optimization, we block the gradient backpropagation through the counterfactual FF Q t during the counterfactual forward pass.

For our sequential targeting procedure, we now introduce a targeting layer. The motivation is as follows: In principle, we could estimate the expected potential outcome θ a by first training the G-computation layer, and subsequently following Eq. (5) and taking the empirical mean over the first counterfactual outputsQ a 2 . Instead, we propose to leverage results from semi-parametric estimation theory, as this allows us to construct an estimator with better theoretical properties, namely double robustness and asymptotic efficiency. 2 For this purpose, we design our targeting layer so that it estimates the treatment assignments, i. e., the so-called propensity scores

We then use the propensity scores to perturb the counterfactual outputsQ a t to make them satisfy an efficient estimating equation. To formalize this, we first provide the mathematical background and subsequently describe how we implement the targeting layer.

In the following, we summarize the general framework under which semi-parametric efficient estimators can be obtained. LetQ a = (Q a 2 , . . . ,Q a T +1 ) be estimators of the conditional expectations (Q a 2 , . . . , Q a T +1 ) from Eq. 4, and letĝ = (ĝ 1 , . . . ,ĝ T ) be estimators of the propensity scores (g 1 , . . . , g T ), where g t = g t (H t ). Furthermore, letθ a be an estimator of θ a .

Ideally, we would like to obtain a tuple of estimators (Q a ,ĝ,θ a ) with the following properties: It can be shown that, asymptotically, the tuple (Q a ,ĝ,θ a ) fulfills properties (1) and (2) if it satisfies the following efficient estimating equation [17] 1

where φ is the efficient influence function of (Q a ,ĝ,θ a ). We call an estimator that satisfies Eq. (8) "targeted". For the longitudinal setting, φ has a closed form (derived in [39] ) that is given by

where we used the convention that Q a T +2 = Y T +1 and where 1(·) denotes the indicator function. 2 For an overview on semi-parametric estimation theory, we refer to [17] .

We use our targeting layer to perturb the initial estimates produced by the G-computation layer in order to satisfy Eq. (8) . Specifically, we propose a sequential targeting procedure by adding a model parameter that is jointly trained with the other model parameters. A tailored regularization term ensures that the efficient estimating estimation from Eq. (8) is satisfied.

Inputs to the targeting layer are (i) the counterfactual outputŝ Q a t+1 (η) of the G-computation layer and (ii) predictionsĝ t (η) of the propensity scores g t (H t ), where, η denotes the trainable parameters of the G-computation layer. To allow for gradient backpropagation, we obtain the counterfactual outputŝ Q a t+1 (η) from an second (identical) output of FF Q t , where the gradient flow is not blocked during training. Furthermore, we generate the propensity estimatesĝ t (η) by adding separate feed-forward networks FF g t on top of the factual hidden states h A t . In the following, we describe how the targeting layer applies perturbations to generate targeted outputs Q t+1 . We recursively define perturbation values q T +2 (η) = 0 and

for t ∈ {2, . . . , T + 1}. The perturbation values are used to create the targeted network outputs via

for t ∈ {2, . . . , T + 2}, where is an additional network parameter that is trained together with η. Note that, by definition, we have that Q a T +2 (η, ) = Y T +1 .

We use two regularization terms in order to train the targeting layer. First, we define the propensity loss

where BCE denotes binary cross-entropy loss. The propensity loss ensures that the propensity networks learn to predict the propensity scores g t (H t ). Second, we define our targeting loss

(13) We show in Sec.4.4 that our targeting loss forces the outputs Q t+1 to satisfy the efficient estimating from Eq. (8) and thus makes them "targeted".

In contrast to other sequential targeting methods (e. g., LTMLE [39] ) that apply targeting perturbations iteratively over time, our procedure allows the entire model to be learn jointly. We later show that this gives more accurate ACE estimates.

To estimate the ACE ψ for two treatment interventionsā T and b T , we train two separate DeepACE models for eachā T and b T . Then, we estimate ψ viâ

where Q a (i) 2 and Q b (i) 2 denote the two targeted DeepACE outputs for patient i at the time step t = 2.

The following theorem ensures that our combination of targeting layer and regularization in DeepACE indeed produces a targeted estimator. Here, L(η, ) denotes the joint loss (defined in Sec. 4.5). Theorem 1. Let (η,ˆ ) be a stationary point of L(η, ). Then, for any β > 0, the estimator

is targeted, i. e., DeepACE satisfies the efficient estimating equation from Eq. (8).

By Theorem 1, the DeepACE estimator θ a is doubly robust, i. e., θ a is consistent, even if either the targeted outputs Q a t+1 or the propensity estimatesĝ t are misspecified.

Assuming that the true conditional expectations Q a t+1 and propensity scores g t are contained in a suitable hypothesis class, the initial outputsQ a t+1 from the G-computation layer and the propensity estimatesĝ will converge to Q a t+1 and g t with growing sample size due to the construction of L Q and L g (provided that α > 0). The next corollary shows that this implies asymptotic efficiency of θ a . Corollary 1. If the initial outputsQ a t+1 from the Gcomputation layer and propensity estimatesĝ t of DeepACE are consistent for Q a t+1 and g t , then also the targeted outputs Q a t+1 are consistent for Q a t+1 . In particular, the DeepACE estimator θ a is asymptotically efficient for θ a .

Proof. See Appendix A.

Overall loss: To train DeepACE, we combine the above into an overall loss L(η, ) = L Q (η) + αL g (η) + βL tar (η, ),

where α and β are constants that control the amount of propensity and targeting regularization, respectively.

We implemented DeepACE using the the PyTorch Lightning framework. We incorporated variational dropout [8] in the LSTM. This allows us to provide uncertainty estimates using DeepACE. We used the Adam optimizer [18] with 100 epochs. The feed-forward are set to one layer each, and the layer sizes are subject to hyperparameter tuning. Details on our hyperparameter tuning and training are in Appendix D.

We compare DeepACE against state-of-the-art methods for time-varying causal effect estimation, see Table 1 . The baselines are selected from recent literature on causal effect estimation [20, 39] .

(1) G-methods: Here, we use: (i) a marginal structural network (MSN) [33] with inverse probability weighting, (ii) iterative G-computation as in Algorithm 1, (iii) G-computation via the parametric G-formula [32] , and (iv) a structural nested mean model (SNMM) with g-estimation [30] .

(2) Longitudinal targeted maximum likelihood estimation (LTMLE) [39] : We implement LTMLE in two variants. The first variant (i) uses generalized linear models (glm) to estimate the conditional expectations. The second variant (ii) uses the super learner algorithm [40] , which builds upon a crossvalidation algorithm to combine different regression models into a single predictor.

(3) Deep learning for ITE estimation: Additionally, we include (i) recurrent marginal structural networks (RMSNs) [21] , (ii) a counterfactual recurrent network (CRN) [5] , and (iii) G-Net [20] . Different from our method, these baselines predict individual as opposed to average causal effects. We obtain ACE estimates by averaging the predicted ITEs.

Implementation and hyperparameter tuning details are in Appendix C. For completeness, we also reported results for static baselines in Appendix B but note that these aim at a different setting and have thus an inferior performance.

Setting: Synthetic data are commonly used to evaluate the effectiveness of causal inference methods (e. g., [42, 4, 37] ). For real-world data, the counterfactual outcomes are never observed, and, hence, the ground-truth ACE is unknown. In contrast, synthetic data allow us to have access to counterfactual outcomes. Therefore, we can successfully compute the ground-truth and thus benchmark the performance.

We

from a data-generating process similar to [4, 12] . The time-varying covariates X t ∈ R p follow a non-linear autoregressive process 17) for some lag h ≥ 1, randomly sampled weights α i , β i ∼ N (1/(i + 1), 0.02 2 ), γ i ∼ unif{−1, 1} (discrete uniform distribution), and noise X t ∼ N p (0, diag p (0.1 2 )). The treatments A t ∈ {0, 1} are selected as A t = 1 {πt>0.5} , where π t are treatment probabilities. They depend on the past observations via

where σ denote the sigmoid function and A t ∼ N (0, 0.2 2 ) is noise. Finally, the outcomes Y t+1 ∈ R are determined via (19) for weights w i = (−1) i+1 1 i and noise Y t+1 ∼ N (0, 0.1 2 ). Results: We generate N = 1000 patient trajectories over T = 15 time steps and sample p = 6 time-varying covariates with lag h = 5. At the same time, we use the same process and noise to generate counterfactual data for three setups with different treatment interventionsā T andb T . 3 For each method and setup, we calculate the absolute error between estimated and ground-truth averaged over 5 different runs with random seeds.

The results are shown in Table 2 . All baselines are clearly outperformed by DeepACE on all three experiments. The bestperforming baseline is LTMLE with the super learner. This is reasonable as it is the only baseline available that is both tailored for ACE estimation and makes use of machine learning. The linear methods (i. e., g-methods, LTMLE, and glm) are not able to capture the non-linear dependencies within the data and thus achieve an inferior performance. The ITE baselines use recurrent neural networks and should thus be able to learn non-linearities but, nevertheless, are inferior. This is attributed to the fact that they are designed for estimating individual rather than average causal effects.

To compare DeepACE with iterative G-computation, we compute the estimation errors of both methods over different time lags h ∈ {1, . . . , 5}. The results are shown in Fig. 3 , showing the effectiveness of DeepACE for long-range dependencies as common in modern EHRs.

Ablation study: We also analyze DeepACE but where the targeting layer is removed ( Table 2 ). Both variants of Deep-ACE outperform the baselines. The variant without targeting performs well but we are careful with interpretations due to the simple nature of the synthetic data and rather relegate conclusions to real-world data as in the following.

Setting: We create semi-synthetic data that enables us to evaluate DeepACE using real-world data while having access to the ground-truth ACE. For this purpose, we use the MIMIC-III dataset [16] , which includes electronic health records from patients admitted to intensive care units. (1) G-METHODS MSM [33] 0.24 ± 0.18 0.27 ± 0.19 0.21 ± 0.10 Iterative G-computation [41] 0.26 ± 0.23 0.34 ± 0.26 0.32 ± 0.27 G-formula (parametric) [32] 0.14 ± 0.10 0.44 ± 0.25 0.86 ± 0.45 SNMM [30] 0.39 ± 0.04 0.47 ± 0.02 0.35 ± 0.02

(2) LTMLE LTMLE (glm) [39] 0.20 ± 0.19 0.25 ± 0.19 0.33 ± 0.24 LTMLE (super learner) [39, 40] 0.13 ± 0.08 0.14 ± 0.12 0.19 ± 0.13

(3) DEEP LEARNING FOR ITE ESTIMATION RMSNs [21] 0.20 ± 0.10 0.52 ± 0.32 0.95 ± 0.52 CRN [5] 0.20 ± 0.10 0.52 ± 0.32 0.95 ± 0.52 G-Net [20] 0. 18 We use a preprocessing pipeline [46] to extract 10 timevarying covariates X t ∈ R 10 over T = 15 time steps. Then, treatments A t ∈ {0, 1} are simulated as binary variables with treatment probabilities

where A t ∼ N (0, 0.5 2 ) is noise and t is the current treatment level, defined by t = t−1 + 2(A t − 1)X t tanh(Y t ) and initialization 0 = T /2. Finally, outcomes are generated via

where Y t ∼ N (0, 0.1 2 ) is noise. Results: We generate N = 1000 patient trajectories with lag h = 8. We compare three setups with different treatment interventions that are chosen analogous to Sec. 5.2. The results are shown in Table 3 . Again, DeepACE outperforms all baselines by a large margin.

Ablation study: We repeat the experiments with DeepACE but where the targeting layer is removed (Table 3 ). This thus demonstrates the importance of our targeting procedure for achieving a superior performance.

Setting: We demonstrate the value of DeepACE for a realworld patient trajectories collected in a clinical study. For Table 3 : Results on semi-synthetic data (mean ± standard deviation).

Setup 1 Setup 2 Setup 3

(1) G-METHODS MSM [33] 2.35 ± 0.64 3.06 ± 0.67 2.47 ± 0.95 Iterative G-computation [41] 0.81 ± 0.35 0.72 ± 0.72 1.90 ± 1.06 G-formula (parametric) [32] 0.32 ± 0.27 0.31 ± 0.20 0.32 ± 0.27 SNMM [30] 0.28 ± 0.33 0.52 ± 0.26 1.96 ± 2.36

(2) LTMLE LTMLE (glm) [39] 0.82 ± 0.35 0.72 ± 0.72 1.84 ± 1.13 LTMLE (super learner) [39, 40] 0.96 ± 1.01 0.92 ± 1.17 0.76 ± 0.47

(3) DEEP LEARNING FOR ITE ESTIMATION RMSNs [21] 2.35 ± 0.14 2.32 ± 0.18 2.36 ± 0.14 CRN [5] 2.53 ± 0.03 2.53 ± 0.04 2.52 ± 0.04 G-Net [20] 0.67 ± 0.15 0.65 ± 0.17 0.67 ± 0.15

DeepACE w/o targeting (ours) 0.18 ± 0.17 0.25 ± 0.11 0.21 ± 0.07 DeepACE (ours) 0.18 ± 0.14 0.12 ± 0.14 0.16 ± 0.10 lower = better (best in bold) this purpose, we analyze data from N = 928 patients with low back pain (LBP) [25] . The dataset consists of pain as a time-varying outcomes recorded via assessments at T = 3 time steps over the course of a year. Pain intensity is our outcome of interest, Y t+1 . As treatment, we consider whether a patient has been doing physical labor or was exempt such as by recommendation of a medical professional. Covariates are given by perceived disability and 127 risk factors (e. g., age, gender). Hence, we are interested in the causal effect of whether patients have been allowed/disallowed to do physical labor on pain intensity. As of now, little is understood about potential between-patient heterogeneity in LBP progression. Due to that, medical research is interested in identifying different phenotypes whereby LBP is classified into different subgroups with clinical meaningful interpretation. We thus estimate average causal effects for two patient cohorts, namely (1) severe LBP and (2) mild LBP [25] . We repeated this for K = 20 evaluations with variational droupout for uncertainty estimation. Results: We make several observations relevant for medical practice (Fig. 4) : (i) Both ACEs are positive, which is line with medical knowledge implying that physical labor may worsen LBP progression. (ii) Surprisingly, the ACE is larger for the group with mild LBP. While this may not sound not intuitive at first, it matches common medical practice: patients with mild LBP are rarely recommended to stop physical labor. Hence, our results suggest that physical labor should be also stoped for mild LBP, as this would eliminate negative effects for observed pain. (iii) The variance is much larger for mild LBP, indicating that impact of physical labor and thus quality of care is more heterogeneous.

Need for DeepACE: Estimating causal effects from observational data requires custom methods that adjust for timevarying confounding. For the above experiments, we also implemented a standard LSTM but found that it was inferior to the other baselines (results omitted for space). This is to be expected [32] : a standard LSTM does not address treatmentconfounder feedback because of which it has a large generalization error and is even biased [1] . As a remedy, we add to a growing stream of research that develops tailored methods for estimating causal effects.

Strengths: DeepACE improves on state-of-the-art methods for time-varying ACE estimation in several ways. (i) We incorporate iterative G-computation into an end-to-end deep learning model. Our experiments confirm that this achieves superior estimation results. (ii) DeepACE provides a targeted estimator that is both doubly robust and asymptotically efficient. To achieves this, DeepACE does not rely on an iterative targeting procedure as, e. g., LTMLE, but instead incorporates a novel sequential targeting during the end-to-end training procedure. This is inspired by DragonNet [37] but the original targeting is for static settings, whereas we are the first to handle longitudinal settings. (iii) DeepACE is an end-toend model, which is an advantage when dealing with patient subgroups. The entire data from all subgroups can be used for learning. In contrast, LTMLE or iterative G-computation can only use data from a single subgroup separately, thereby impeding performance.

Practical considerations: As it is with other research in causal inference from observational data, one challenge for causal effect estimation is that one only observes factual but not counterfactual outcomes. For practice in medicine and industry, this has implications as the performance cannot be evaluated directly but would necessitate RCTs. However, this also pinpoints the need for DeepACE: RCTs are costly and oftentimes even unethical (e. g., one cannot withhold effective treatments from patients to estimate causal effects) [33] . Here, a powerful remedy is offered by methods for estimating causal effects from observational data and, in particular, our DeepACE.

Use cases: DeepACE fulfills important needs for decisionmaking in medicine and industry. In medicine, current guidelines typically make recommendations at the level of patient subgroups [10] : treatments are selected that promise the desired effects on health groups for patients from a specific subgroup, and, as such, are based on average causal effects. By offering more accurate ACE estimation, DeepACE will enable better decision-making for patient subgroups and will thus contribute to more effective care. Here, we expect EHRs and post-approval drug monitoring to offer valuable data for input. Moreover, DeepACE may also inform RCT design where one tests new treatment strategies that were generated from observed patient trajectories. Beyond that, ACE estimation is also relevant, for example, in marketing (e. g., what does the treatment effect of price discounts on purchase behav-ior differ for online vs. offline customers condition on their their past purchase history?).

Conclusion: In this paper, we proposed DeepACE for timevarying ACE estimation. To the best of our knowledge, Deep-ACE is the first end-to-end deep learning model for that purpose. By estimating causal effects of treatments on health outcomes, DeepACE informs the choice of effective treatments for patient subgroups.

Proof of Theorem 1. We show that the tuple Q a (η,ˆ ),ḡ(η), θ a satisfies the efficient estimating equation in Eq. (8) . At any stationary point (η,ˆ ), we have that

Multiplying both sides with T β yields the result.

Proof of Corollary 1. We show that consistency ofQ a t+1 and g t implies consistency of the targeted outputs Q a t+1 . Then, asymptotic efficiency follows from Theorem 1.

The key argument is similar as in [37] . By assumption, the G-computation loss L Q and the propensity loss L g are asymptotically minimized byQ a t+1 = Q a t+1 andĝ t = g t (due to finite Vapnik-Chervonenkis (VC) dimension). In the following, we show that the overall loss L is asymptotically minimized by additionally settingˆ = 0. Hence, the targeted outputs Q a t+1 are consistent because DeepACE can set the perturbation parameter toˆ = 0, which implies Q a t+1 =Q a t+1 for all t ∈ {1, . . . , T }.

Because the targeting layer only adds a single parameter to the model, a finite VC dimension is preserved. We show now thatˆ = 0 asymptotically minimizes each summand of the targeting loss L tar . The last summand for t = T + 1 is minimized (squared loss) at

which is indeed achieved atˆ = 0. The other summands for t ∈ {2, . . . , T } are minimized at

which is also achieved atˆ = 0.

There are several methods for causal effect estimation in the static setting. However, these do not take into account timevarying confounding and are biased in the longitudinal setting.

Because of that, we refrained from reporting them in our main paper. To show that DeepACE also outperforms static methods, we implemented four state-of-the-art methods for estimating static causal effects and evaluated them on our synthetic and semi-synthetic datasets. Here, we follow [22] : we consider DragonNet [37] , TARNet [36] , double machine learner (DML) [6] , and doubly robust learner (DR) [7] . The latter two methods are meta-learners, which we instantiate via causal forests [45] . Results are in Table 4 . Overall, DeepACE is superior by a large margin.

We provide a detailed overview of our baselines. Of note, our baselines represent state-of-the-art methods for causal effect estimation in longitudinal settings [5, 21, 39, 20] .

Generalized methods (g-methods) [24] are a class of statistical models for time-varying ACE estimation originally used in epidemiology [33] . G-methods can be loosely categorized into marginal structural models, G-computation via the Gformula, and structural nested models. We use models of all three categories.

MSMs express the expected potential outcome θ a directly as a function of the treatment interventionā T = (a 1 , . . . , a T ). In our case, we consider the model

with parameters β = (β 0 , . . . , β T ). The parameters β can be estimated via inverse probability weighting. More precisely, the stabilized inverse probability weight for patient i is defined as

where f At|Āt−1 and g At|Ht denote the conditional densities of A t givenĀ t−1 and H t = (Ā t−1 ,X t ), respectively. Both conditional densities can be estimated via standard logistic regressions.

Robins [33] showed that the parameters β of the MSN can be consistently estimated by performing a weighted linear regression. Here, the observed treatments A are regressed on the observed outcomes Y T +1 and each observation i is weighted by SW i . Once the estimatesβ are obtained, we estimate the ACE viâ

whereā T andb T denote the treatment interventions. C.1.2 G-computation [32, 41] The G-formula from Eq. (3) can be used to estimate θ a in an iterative manner. We include Algorithm 1 as a baseline in which we use linear regression to estimate the conditional expectations.

An equivalent way to write the G-formula from Eq. (3) is

where f Xt|Xt−1,Āt−1 denotes the conditional density of X t givenX t−1 andĀ t−1 . The conditional densities and the conditional expectation E Y T +1 X T =x T ,Ā T = a can be estimated by parametric regression models and are subsequently plugged into Eq. (33) . This method is also known as parametric G-computation [32] . We use the algorithm from [23] as another baseline. Here, the conditional densities are essentially modeled as conditional normal distributions, where both mean and variance are estimated by generalized linear regression models.

A structural nested mean model specifies the marginal effects of the treatment interventions at each time step. In our case, we consider the model given by

for all t ∈ {1, . . . , T } with parameters β = (β 1 , . . . , β T ).

Here, Y T +1 (ā t , 0) denotes the potential outcome that is observed if the treatment interventionā t is applied until time t, and no treatment is applied afterwards.

SNMM uses a method called g-estimation [32] to estimate the model parameters β. G-estimation is based on solving certain estimating equations, which is implied by Eq. (34) in combination with the assumptions from Sec. 3.1. For our experiments, we use the g-estimation method from [43] to obtain estimatesβ.

LTMLE [39] extends iterative G-computation (Algorithm 1) by targeting the estimatesQ a t after each iteration step t. For a detailed description, we refer to [39] and [41] .

In our experiments, we consider two different variants of LTMLE: variant (i) estimates the conditional expectations Q a t and the propensity scores g t with generalized linear models, and variant (ii) applies the super learner algorithm [40] , which uses k-fold cross-validation to find an optimal weighted combination of machine learning base models. We set k = 3 and use a generalized linear model, random forest, xgboost, and a generalized additive model (GAM) with regression splines as base models.

We included state-of-the-art baselines that predict future potential outcomes conditional on the patient trajectories H t using deep learning. These are RMSNs [21] , CRN [5] and G-Net [20] . To this end, we implemented all models as described in the respective references. Of note, these baselines aim at predicting individual counterfactual outcomes rather than estimating average causal effects. Because of that, we then obtain the ACE by averaging the respective predicted outcomes and subsequently subtracting them. For hyperparameter tuning, we refer to Appendix D.

C.3.1 RMSNs [21] and CRN [5] Both baselines are based on a encoder-decoder architecture, and consider the setting where treatment interventions a (t,t+τ −1) = (a t , . . . , a t+τ −1 ) are applied from some time t to t + τ − 1. The encoder builds a representation Φ t of the patient trajectory H t . This is then used by the decoder together with the treatment interventionā (t,t+τ −1) to predict the future potential outcome Y t+τ (ā (t,t+τ −1) ). In the encoder, we set H 1 = X 1 as input so that treatment interventions can span the complete time frame from t = 1 to τ = T .

RMSNs address time-varying confounding by re-weighting the training loss using inverse probability of treatment weights similar to Eq. (31). These weights are estimated using two separate LSTMs for nominator and denominator. In contrast, CRN adopts adversarial training techniques to build balanced representations that are non-predictive of the treatment.

G-Net uses the G-formula from Eq. (33) conditioned on the history H 1 to estimate the outcomes Y T +1 (ā T ) via Monte Carlo sampling (we use 100 samples). For this purpose, the conditional densities f Xt|Xt−1,Āt−1 are estimated by learning the corresponding conditional expectations E[X t |X t−1 ,Ā t−1 ] via an LSTM-based model. One can then sample from f Xt|Xt−1,Āt−1 by drawing from the empirical distributions of the residuals on some holdout set that is not used to estimate the conditional expectations. We used 20 % of the training data for the holdout dataset.

We use Pytorch lightning for our implementation. We performed hyperparameter tuning for all deep learning model (including DeepACE) on all datasets by splitting the data into a training set (80%) and a validation set (20%). We then performed 30 random grid search iterations and chose the set of parameter that minimized the factual MSE on the validation set. The hyperparameter search ranges are shown in Table 5 . After training models with optimal hyperparameters, we used the full datasets to estimate the ACEs. Each model is trained over 100 epochs with the Adam optimizer [18] . During training, all LSTM networks use variational dropout [8] . We set the regularization parameters to α = 0.1 and β = 0.05. Note that it is infeasible to include α or β into the hyperparameter tuning process because we only have access to the factual data (and, hence, α and β may shrink to zero in order to minimize the factual loss). In doing so, we are consistent with several state-of-the-art methods for causal effect estimation [37, 5] .

Limits of estimating heterogeneous treatment effects: Guidelines for practical algorithm design

Analyzing patient trajectories with artificial intelligence

Doubly robust estimation in missing data and causal inference models

Time series deconfounder: Estimating treatment effects over time in the presence of hidden confounders

Estimating counterfactual treatment outcomes over time through adversarially balanced representations

Double/debiased machine learning for treatment and structural parameters

Orthogonal statistical learning

A theoretically grounded application of dropout in recurrent neural networks

Causal inference in public health

The path to personalized medicine

Estimating average treatment effects via orthogonal regularization

Sequential deconfounding for causal inference with unobserved confounders

Generalizing off-policy learning under sample selection bias

Combining observational and randomized data for estimating heterogeneous treatment effects

Long shortterm memory

MIMIC-III, a freely accessible critical care database

Semiparametric theory and empirical processes in causal inference

Adam: A method for stochastic optimization

Deconfounding temporal autoencoder: Estimating treatment effects over time using noisy proxies

G-Net: A recurrent network approach to G-computation for counterfactual prediction under a dynamic treatment regime

Forecasting treatment responses over time using recurrent marginal structural networks

Estimating individual treatment effects with time-varying confounders

gfoRmula: An R package for estimating the effects of sustained treatment strategies via the parametric G-formula

An introduction to g methods

Identifying subgroups of patients using latent class analysis: Should we use a single-stage or a two-stage approach? A methodological study using a cohort of patients with low back pain

AttDMM: an attentive deep markov model for risk scoring in intensive care units

Causal inference in statistics: An overview

SyncTwin: Treatment effect estimation with longitudinal outcomes

A new approach to causal inference in mortality studies with a sustained exposure period: Application to control of the healthy worker survivor effect

Correcting for non-compliance in randomized trials using structural nested mean models

Robust estimation in sequentially ignorable missing data and causal inference models

Estimation of the causal effects of time-varying exposures. Chapman & Hall/CRC handbooks of modern statistical methods

Marginal structural models and causal inference in epidemiology

Bayesian inference for causal effects: The role of randomization

Reliable decision support using counterfactual models

Estimating individual treatment effect: Generalization bounds and algorithms

Adapting neural networks for the estimation of treatment effects

Treatment-response models for counterfactual reasoning with continuous-time, continuous-valued interventions

Targeted minimum loss based estimation of causal effects of multiple time point interventions

Super learner

Targeted learning in data science

Targeted maximum likelihood learning

Revisiting g-estimation of the effect of a time-varying exposure subject to time-varying confounding

Causal inference in economics and marketing

Estimation and inference of heterogeneous treatment effects using random forests

MIMIC-extract: A data extraction, preprocessing, and representation pipeline for MIMIC-III

Causal inference in the age of decision medicine