key: cord-0537674-bsw6fusl
authors: Vo, Thanh Vinh; Hoang, Trong Nghia; Lee, Young; Leong, Tze-Yun
title: Federated Estimation of Causal Effects from Observational Data
date: 2021-05-31
journal: nan
DOI: nan
sha: 16431cf6fbffbcee571c0d269a87c913c2829574
doc_id: 537674
cord_uid: bsw6fusl

Many modern applications collect data that comes in federated spirit, with data kept locally and undisclosed. Till date, most insight into the causal inference requires data to be stored in a central repository. We present a novel framework for causal inference with federated data sources. We assess and integrate local causal effects from different private data sources without centralizing them. Then, the treatment effects on subjects from observational data using a non-parametric reformulation of the classical potential outcomes framework is estimated. We model the potential outcomes as a random function distributed by Gaussian processes, whose defining parameters can be efficiently learned from multiple data sources, respecting privacy constraints. We demonstrate the promise and efficiency of the proposed approach through a set of simulated and real-world benchmark examples.

Estimating the casual effects of an intervention on an outcome is commonly used in many practical areas, e.g., personalized medicine (Powers et al., 2018) , digital experiments (Taddy et al., 2016) and political science (Green and Kern, 2012) . One example is to estimate the effect of smoking on causing lung cancer. To accurately infer the causal effects, one would need a large number of data observations. However, observational data often exist across different institutions and typically cannot be centralized for processing due to privacy constraints. For example, medical records of patients are kept strictly confidential at local hospitals (Gostin et al., 2009 ). This real-life scenario would limit access of causal inference algorithms on the training data.

Existing medical data have not been fully exploited by causal inference primarily because of the aforementioned constraints. Current approaches in causal inference (e.g., Shalit et al., 2017; Yao et al., 2018) require the medical records to be shared and put in one place for processing. This could violate the privacy rights of patients. Some alternative solutions such as establishing data use agreements or creating secure data environments may not be possible and is often not implemented. For example, suppose that some hospitals own the electronic health records (EHRs) of different patient populations and we wish to utilize these EHRs to perform causal inference on whether smoking causes lung cancer in all of these populations. However, these records cannot be shared across the hospitals because they may contain sensitive information of the patients. This problem would lead to a big barrier for developing effective causal effect estimators that are generalizable, which usually need a diverse and big dataset. How to utilize these EHRs to build a global causal effect estimator while preserving the patients' privacy rights is a challenging problem which has not been well explored.

In practice, it is intractable to verify whether the causal estimands are reliable. Thus, in addition to giving point estimates of causal effects, an estimator which outputs confidence intervals would give helpful insights into the uncertainty of causal estimands. For example, a narrow confidence interval for the individual treatment effect means that patients are at a higher risk of getting lung cancer. Most of the recent causal effect estimators (e.g. Shalit et al., 2017; Louizos et al., 2017; Yao et al., 2018; Madras et al., 2019) , however, ignore discussion on the uncertainty of the causal effects. Some existing causal inference packages such as econml (Microsoft Research, 2019) provides frequentist approaches, e.g., Bootstrap (Efron and Tibshirani, 1994) , Bootstrap-of-Little-Bags (Kleiner et al., 2014) , to find such confidence intervals. These approaches require many rounds of resampling the entire dataset and retraining the models on these resamples. Hence, to use these approaches for the context of muti-source causal inference while preserving privacy, it might require a careful redesigning of the resampling algorithms.

These challenges motivate us to propose a framework that can learn the causal effects of interest without combining data sources to a central site and, at the same time, learn higher-order statistics of the causal effects, hence capturing their uncertainty. To address such problem, we utilize the Bayesian imputation approach (Imbens and Rubin, 2015) since it can capture uncertainty of the causal estimands. We then generalize this model to a more generic model based on Gaussian processes (GPs) . To train the model on multiple sources while preserving data privacy, we further decompose this generic model into multiple components, each of them handling a source in our multi-source data context. This generic approach is called federated learning which was introduced recently in McMahan et al. (2017) and it has not been studied for causal inference. In short, our contributions are as follows:

• We propose a novel Federated Causal Inference (FedCI) framework that fuses federated learning and causal inference to incorporate multiple data sources while preserving private rights of users.

• An advantage of the proposed method is that it also gives higher-order statistics of the causal estimands under a Bayesian approach.

• We propose a variational approximation scheme for the proposed model whose evidence lower bound can be decomposed additively across different data sources. This allows the parameters to be optimized via federated gradient averaging. We then leverage the computed predictive distribution to estimate the desired treatment effect quantities efficiently. We carry out an empirical evaluation of the proposed framework on benchmark datasets, which shows competitive performance compared to the baselines trained on the combined dataset. Nie and Wager (2020) , used parametric methods to model the potential outcomes. These methods make a standard ignorability assumption of Rosenbaum and Rubin (1983) . Louizos et al. (2017) ; Madras et al. (2019) followed the structural causal model (SCM) (Pearl, 1995) to estimate causal effects under the existence of latent confounding variables. Bica et al. (2020a,b) formalized potential outcomes for temporal data with observed and unobserved confounding variables to estimate counterfactual outcomes for treatment plans. All these works were not proposed for the context of multi-source data which cannot be shared and combined as a unified dataset due to some privacy constraints. Our model, in contrast, learns individual treatment effect (ITE) and average treatment effect (ATE) while preserving privacy of the observed individuals. It is different from the problem of transportability of causal relations (e.g., Pearl and Bareinboim, 2011; Pearl, 2013b,a, 2016) , where theoretical tools were developed to transport causal effects from a source population to a target population and did not take into account the privacy constraints.

Federated learning. The concepts of federated learning and causal inference are two well-known areas that have been developed independently. Federated learning aims to train an algorithm across multiple decentralized clients, thus preserving the privacy information of the data (McMahan et al., 2017) . Two variations of federated learning include federated stochastic gradient descent (Shokri and Shmatikov, 2015) and federated averaging (McMahan et al., 2017) . Recent developments of these two areas, e.g., Álvarez et al. are formalized for a typical classification or regression problem. Federated learning has recently been applied in facilitating multi-institutional collaborations without sharing patient data (Rieke et al., 2020; Sheller et al., 2020) and healthcare informatics (Lee and Shin, 2020; Xu et al., 2021) . Several applications of federated learning in medical data include predicting hospitalizations for cardiac events (Brisimi et al., 2018) , predicting adverse drug reactions (Choudhury et al., 2019) , stroke prevention (Ju et al., 2020) , mortality prediction (Vaid et al., 2020) , medical imaging (Ng et al., 2021) , predicting outcomes in SARS-COV-2 patients (Flores et al., 2020) . However, to the best of our knowledge, no work has been done for causal inference.

Following some recent works in causal inference (e.g., Shalit et al., 2017; Yao et al., 2018; Oprescu et al., 2019; Künzel et al., 2019; Nie and Wager, 2020) , we utilize the potential outcomes framework to develop a federated causal inference algorithm. Our approach has connection to the SCM approach with a causal graph that includes three variables: treatment, outcome, and observed confounder (Pearl, 2009, Chapter 7) , where the causal effects can be identified using backdoor adjustment formula (Pearl, 2009) . We summarize the related models in the subsequent sections.

The concept of potential outcomes was proposed in Neyman (1923) for randomized trial experiments. Rubin (1975 Rubin ( , 1976 Rubin ( , 1977 Rubin ( , 1978 re-formalized the framework for observational studies. We consider the causal effects of a binary treatment w, with w = 1 indicating assignment to 'treatment' and w = 0 indicating assignment to 'control'. Following convention in the literature (e.g., Rubin, 1978) , the causal effect for individual i is defined as a comparison of the two potential outcomes, y i (0) and y i (1), where these are the outcomes that would be observed under w = 1 and w = 0, respectively. We can never observe both y i (1) and y i (0) for any individual i, because it is not possible to go back in time and expose the i-th individual to the other treatment. Therefore, individual causal effects cannot be known and must be inferred.

In this work, we generalize the Bayesian imputation model of Imbens and Rubin (2015, Chapter 8) since this model can capture uncertainty of the causal estimands in a Bayesian setting. The model is specified as follows:

where 0i and 1i are the Gaussian noises. The key to compute treatment effects is y i (0) and y i (1), however we cannot observe both of them. So we need to impute one of the two outcomes. Let y i,obs be the observed outcome and y i,mis be the unobserved outcome. The idea is to find the marginal distribution p(y i,mis |y obs , X, w). Once the missing outcomes are imputed, the treatment effects can be estimated. Note that p(y i,mis |y obs , X, w) = p(y i,mis |y i,obs , x i , w i ), i.e., the outcomes of all individual are dependent. To find the above above distribution, Imbens and Rubin (2015) suggested four steps based on the following equation p(y i,mis |y obs , X, w) = p(y i,mis |y obs , X, w, θ)p(θ|y obs , X, w)dθ where θ is the set of all parameters in the model, i.e., θ = {β 0 , β 1 }. The aim is to find p(y i,mis |y obs , X, w, θ) and p(θ|y obs , X, w), and then compute the above integration to obtain p(y i,mis |y obs , X, w), which is a non-parametric prediction. In Sections 3.4, 3.5 and 3.6, we generalize this model with Gaussian processes and decompose it into multiple components to perform federated inference of the causal effects.

This section formalizes the problem of estimating causal effects under some privacy constraints. We address this problem by generalizing the Bayesian imputation model presented in Section 2.2 to a more generic model based on Gaussian processes. We decompose the model into multiple components, each associated with a data source. This decomposition results in the proposed Federated Causal Inference (FedCI) method.

In the following, we detail our proposed model specification and explicate the link to the causal quantity that we would like to estimate.

Problem setting & notations. Suppose we have m sources of data, each is denoted by

, where s = 1, 2, . . . , m, and the quantities w s i , y s i,obs and x s i are the treatment assignment, observed outcome associated with the treatment, and covariates of individual i in source s, respectively. In this work, we focus on binary treatment w s i ∈ {0, 1}, thus y s i,obs can be either the potential outcomes y s i (0) or y s i (1), i.e., for each individual i, we can only observe either y s i (0) or y s i (1), but not both of them. We further denote the unobserved or missing outcome as y s i,mis . These variables are related to each other through the following equations

Thus, y s i (1) = y s i,obs when w s i = 1 and y s i (1) = y s i,mis when w s i = 0, and similar for y s i (0). For notational convenience, we further denote y s (0) = [y s 1 (0),..., y s ns (0)] , y s obs = [y s 1,obs ,..., y s ns,obs ] , and similarly for y s (1), y s mis , X s and w s . Causal effects of interest. Due to privacy concerns, these data sources D s are located in different locations. We are interested in estimating individual treatment effect (ITE) and average treatment effect (ATE) which are defined as follows

where y s i (1) and y s i (0) are realization outcomes of their corresponding random variables, and n = m s=1 n s is the total number of samples. Note that the ITE is also known as the conditional average treatment effect (CATE).

wherew s i := 2w s i −1 and y obs , X, w denotes the vectors/matrices of the observed outcomes, covariates and treatments concatenated from all the sources. The estimate of ATE is as follows

wherew := 2w − 1 with 1 is a vector of ones. The above estimates capture the mean and variance of the treatment effects. At present, what remains is to learn the posterior p(y mis y obs , X, w), which is the predictive distribution of y mis given all the covariates, treatments and observed outcomes from all sources. In the next sections, we develop a federated GP-augmented imputation model to approximate this distribution.

In the following, we make some assumptions that allow the causal effects to be estimate in a federated setting. The first three assumptions are standard. The fourth assumption is needed to allow us to proceed in the preprocessing step. (Rosenbaum and Rubin, 1983 ) Assumption 2 (The stable unit treatment value assumption). (i) There are no hidden versions of the treatment and (ii) treatment on one unit does not affect the potential outcomes of another one. (Imbens and Rubin, 2015) Assumption 3. The individuals from all sources share the same set of covariates.

Assumption 4. There exists a set of features such that any individual is uniquely identified across different sources. We refer to this set as 'primary key'. A 'primary key' in Assumption 4 is not limited to the observed data used for inference as described in Section 3.1, but it can be any features to uniquely identify an individual such as {nationality, national id} of patients. Assumption 4 allows us to proceed with a preprocessing procedure (if necessary) to remove the repeated individuals in different sources while preserving the individuals' privacy. The preprocessing procedure are summarized as follows. Firstly, each source would use a one-way hash function (such as MD4, MD5, SHA or SHA256) to encrypt each individuals' primary key and then send the hashed sequences to a server. By doing this, the individuals' data are secured. Note that the one-way hash function is agreed among the sources so that they would use the same function. Then, the server collects all hashed sequences from all sources and perform a matching algorithm to see if there exists repeated individuals among different sources. For each repeated individual, the server randomly choose to keep it on a small number (predefined) of sources and inform the other sources to exclude this individual from the training process. The whole procedure is to ensure that an individual does not exists in a huge number of sources, thus prevent learning a biased model. The whole procedure is to ensure that an individual does not exists in a huge number of sources, thus prevent learning a biased model. We summarize the procedure in Figure 1 .

Assumption 4 and the preprocessing procedure are required for data that are highly repeated in different sources only. For data that are not likely to have a high number of repetitions such as patients from different hospitals of different countries, the above assumption and the preprocessing procedure are not required. Note that the existing methods also need Assumption 4 since they need to combine data and remove repeated individuals.

Note that Assumption 1 is not testable since we cannot observe both y s i (0) and y s i (1), and this is well documented (Imai et al., 2010) . However, Assumption 2 is likely to hold in a real-life setting. For example, a patient having an increase of blood pressure due under a medication cannot in any shape or form influence the blood pressure (outcome) of another patient. In addition, most hospitals should collect common covariates of their patients, thus Assumption 3 is also a reasonable assumption. By preceeding discussions, Assumption 4 is a realistic assumption. In the subsequent sections, we assume that all of the assumptions described in this section are satisfied, and the preprocessing procedure was performed if it is necessary.

The model presented in Eq. (1) is a simple Bayesian linear model. In this section, we present a more generic nonlinear model under the Bayesian setting. In particular, since β 0 and β 1 follows multivariate normal distributions, the two components β 0 x i and β 1 x i also follow multivariate normal distributions. The generalisation of these two components are f 0 (

is a vector of basis functions with input x i . This formulation would lead to the fact that the marginal of f 0 (x) and f 1 (x) are Gaussian processes. Thus, we propose

where K denotes the covariance matrix computed with a kernel function k(x, x ). Similar to the imputation model of Imbens and Rubin (2015) , the model presented here also requires finding the marginal distribution p(y i,mis | y obs , X, w). Although this model is generic, it requires access to all of the observed data to compute K, hence violates privacy rights. In the subsequent sections, we propose a federated model that address this problem.

Recall that the aim is to find p(y mis | y obs , X, w) so that we may in turn compute Eqs. (4) and (5) to arrive at the quantities of interest. To that end, we propose to model the joint distribution of the potential outcomes as follows

where ε s i ∼ N(0, I 2 ) is to handle the noise of the outcomes. As mentioned earlier in Section 2.2 and 3.4, all the outcomes are dependent in the Bayesian imputation approach. This dependency is handle via f s j (x i ) and g s j (j ∈ {0, 1}). We name the dependency handled by f s j (x i ) as intra-dependency and the one captured by g s j as inter-dependency. Intra-dependency. f s 0 (x i ) and f s 1 (x i ) are GP-distributed functions, which allows us to model each source dataset simultaneously along with their heterogeneous correlation. Specifically, we

, and µ 0 (·), µ 1 (·) are functions modelling the mean of these GPs. Parameters of these functions and hyperparameters in the kernel function are shared across multiple sources. The above GPs handle the correlation within one source only.

Inter-dependency. To capture dependency among the sources, we introduce variable g = [g 0 , g 1 ], where

Figure 2: Graphical model that summarizes the proposed framework with treatment w s , covariate X s , and the two potential outcomes y s mis and y s obs . The quantity f s is idiosyncratic to the sources and g contains shared characteristics across all the sources. Σ and Φ are shared parameters. Note that this is not a causal graph.

Each g s 0 and g s 1 are shared within the source s, and they are correlated across multiple sources s ∈ {1,..., m}. The correlation among the sources is modelled via the covariance matrix M which is computed with a kernel function. The inputs to the kernel function are the sufficient statistics (we used mean, variance, skewness, and kurtosis) of each covariate x s within the source s. We denote the first four moments of covariates asx s ∈ R 4dx×1 and the kernel function as γ(x s ,x s ), which evaluates the correlation of two source s and s . The above formulation implies that g 0 and g 1 are GPs. Each element of r 0 and r 1 are computed with the mean functions r 0 (x s ) and r 1 (x s ), respectively. In this setting, we only share the sufficient statistics of covariates, but not covariates of a specific individual, hence preserving privacy of all individuals.

The two variables Φ and Σ. These variables are positive semi-definite matrices capturing the correlation between the two possible outcomes y s i (0) and y s i (1), Φ 1 2 and Σ 1 2 are their Cholesky decomposition matrices. Note that Φ and Σ are also random variables. The reason that we constraint Φ and Σ as positive semi-definite matrices is explained later in Lemma 2. Because of this constraint, we model their priors using Wishart distribution Φ ∼ Wishart(V 0 , d 0 ), Σ ∼ Wishart(S 0 , n 0 ), where V 0 , S 0 ∈ R 2×2 are predefined positive semi-definite matrices and d 0 , n 0 ≥ 2 are predefined degrees of freedom.

The graphical model of our framework. We summarize our framework in Figure 2 . The figure shows that g, Σ and Φ are shared crosses the sources, thus capturing the correlation among them, and f s is specific for the source s that capture the correlation among individuals within this source. To see how our model handles dependency between the outcomes of two different sources through the latent variable g, we block the paths between two sources s and s through Φ and Σ and only keep the path through g. The covariance between the outcomes of s and s is presented in Lemma 1.

Lemma 1. Let s and s be two different sources.

2 ) in Lemma 1 is non-zeros, which implies that y s i (0) and y s j (0) are correlated, and so do y s i (1) and y s j (1).

In this section, we present some results on the joint distribution of potential outcomes. Then, we construct an objective function that can be trained in a federated fashion.

3.6.1 The joint distribution of the outcomes

In the following, we derive some results that are helpful in constructing the federated objective function in Section 3.6.2. Due to limited space, we defer the proofs of these results to Appendix. For convenience in presenting the subsequent results, we further denote g s = [g s 0 , g s 1 ], where g s 0 = [g s 0 ,..., g s 0 ] and g s 1 = [g s 1 ,..., g s 1 ] . Lemma 2. Let Φ, Σ, K, µ 0 (X s ), µ 1 (X s ), and g s satisfy the model in Eq. (7). Then,

where ⊗ is the Kronecker product.

From Lemma 2, we observe that Φ, K s , Σ, and I ns are positive semi-definite, thus the covariance matrix Φ ⊗ K s + Σ ⊗ I ns is positive semi-definite due to the fundamental property of Kronecker product. This explains the reason we chose Φ and Σ to be positive semi-definite in our model; otherwise, the covariance matrix is invalid. From Lemma 2, we can obtain the following result in Lemma 3.

Lemma 3. Let Φ, Σ, K, µ 0 (X s ), µ 1 (X s ), and g s satisfy the model in Eq. (7). Then,

The mean functions µ obs (X s ) and µ mis (X s ) are as follows:

where m 0 = φ * 11 (µ 0 (X s )+g s 0 ) and m 1 = φ * 21 (µ 0 (X s )+g s 0 )+φ * 22 (µ 1 (X s )+g s 1 ) with φ * ab is the (a, b)-th element of Cholesky decomposition matrix of Φ, 1 is a vector ones, and is the element-wise product. The covariance matrices K s obs , K s mis , and K s om are computed by kernel functions:

where φ ab and σ ab are the (a, b)-th elements of Φ and Σ, respectively.

In the subsequent sections, we use the result in Lemma 3 to obtain the conditional likelihood p(y s obs |X s , w s , Φ, Σ, g s ), which is useful in inferring parameters and hyperparameters of our proposed model. We then also obtain the posterior p(y s mis y s obs , X s , w s , Φ, Σ, g) to estimate ITE and local ATE.

Since estimating p(y s mis y s obs , X s , w s ) exactly is intractable, we sidestep this intractability via a variational approximation (Kingma and Welling, 2013; Blei et al., 2017) . To achieve this, we maximize the following evidence lower bound (ELBO) L:

where L s = E q [log p(y s obs |·)] − 1 m z∈{g,Φ,Σ} D KL (q(z) p(z)). The conditional likelihood p(y s obs |·) is obtained from Lemma 3 by marginalizing out y s mis , i.e., p(y s obs |X s , w s , Φ, Σ, g s ) = N(y s obs ; µ obs (X s ), K obs ).

We observe that the above conditional likelihood is free of σ 21 and σ 12 , which captures the correlation of two potential outcomes. Thus the posterior of these variables would coincide with their priors, i.e., the correlation cannot be learned but set as a prior. This is well-known as one of the potential outcome cannot be observed (Imbens and Rubin, 2015) . In Eq. (8), the ELBO L is derived from the of joint marginal likelihood of all m sources, and it is factorized into m components L s , each component corresponds to a source. This enables federated optimization of L. The first term of L s is expectation of the conditional likelihood with respect to the variational posterior q(g, Φ, Σ), thus this distribution is learned from data of all the sources. In the following, we present the factorization of this distribution.

Variational posterior distributions. We apply the typical mean-field approximation to factorize among the variational posteriors q(Φ, Σ, g) = q(Φ) q(Σ) q(g), where q(g) = , h 0 (·) and h 1 (·) are the mean functions, U is the covariance matrix computed with a kernel function κ(u s , u s ), where u s := [ỹ s obs (0),ỹ s obs (1),x s ,w s ]. Since Φ and Σ are positive semi-definite matrices, we model their variational posterior as Wishart distribution: q(Φ) = Wishart(Φ; V q , d q ) and q(Σ) = Wishart(Σ; S q , n q ), where d q , n q are degrees of freedom and V q , S q are the positive semi-definite scale matrices. We set the form of these scale matrices as follows

where ν i , ρ, δ i , η are parameters to be learned and ρ, η ∈ [0, 1].

To maximize the ELBO, we approximate the expectation in L s with Monte Carlo integration, which require drawing samples of g, Φ and Σ from their variabional distributions. This requires a reparameterization to allow the gradients to pass through the random variables g, Φ and Σ. Since we model the correlation among each individual and the correlation between the two possible outcomes, the typical reparameterization of Kingma and Welling (2013) cannot be applied as this method only holds true with diagonal covariance matrix. The reparameterization trick we applied is more general

where ξ j ∼ N(0, I m ) and U 1 2 is the Cholesky decomposition matrix of U. Since q(Φ) is modeled as Wishart distribution, we introduce the following procedure to draw Φ:

where V 1 2 q is the Choleskey decomposition matrix of V q . Likewise, we also apply this procedure to draw Σ.

The Federated optimization algorithm. With the above designed model and its objective function, we can compute gradients of the learnable parameters separately in each source without sharing data to a central server. Thus, it satisfies the privacy constraints. We summarize our inference procedure in Algorithm 1. 3.6.3 How data from all sources help prediction of causal effects in a specific source?

Remember that the key to estimate ITE and ATE is to find the predictive distribution p(y mis y obs , X, w). This distribution can be estimated by the following relation:

p(y mis y obs , X, w) E q m s=1 p(y s mis y s obs , X s , w s , Φ, Σ, g) ,

where the expectation is with respect to the variational distribution q(Φ, Σ, g), and

To understand why data from all the sources can help predict causal effects in a source s, we observe that p(y s mis y obs , X, w) E q p(y s mis y s obs , X s , w s , Φ, Σ, g)

= p(y s mis y s obs , X s , w s

,ỹ obs (0),ỹ obs (1),X,w (iii) ),

Eq. (10) is an approximation of the predictive distribution of the missing outcomes y s mis and it depends on the following three components:

(i). The observed outcomes, covariates and treatment assignments from the same source s, and (ii). The shared parameters Θ learned from data of all the sources, and (iii). Sufficient statistics of the observed data from all the sources.

The two last components (ii) and (iii) indicate that the predictive distribution in source s utilized knowledge from all the sources through Θ and the sufficient statistics [ỹ obs (0),ỹ obs (1),X,w]. This explain why data from all the sources help predict missing outcomes in source s.

Baselines and the aims of our experiments. In this section, we first perform experiments to examine the performance of FedCI. We then compare the performance of FedCI against recent findings, such as BART (Hill, 2011) , CEVAE (Louizos et al., 2017) , OrthoRF (Oprescu et al., 2019) , X-learner (Künzel et al., 2019) , and R-learner (Nie and Wager, 2020) . Note that all these work do not consider causal effects in a federated setting. The aim of this analysis is to show the efficacy of our method compared to the baselines trained in three different cases: (1) training a local model on each source data, (2) training a global model with the combined data of all sources, (3) using bootstrap aggregating (also known as bagging; is an ensemble learning method) of Breiman (1996) where m models are trained separately on each source data and then averaging the predicted treatment effects of each model. Note that case (2) violates individuals' privacy rights and is only used for comparison purposes. In general, we expect that the performance of FedCI is close to that of the performance of the baselines in case (2). Implementation of CEVAE is readily available (Louizos et al., 2017) . For the implementation of BART (Hill, 2011) , we use the package BartPy, which is also available online. For X-learner (Künzel et al., 2019) and R-learner (Nie and Wager, 2020) , we use the package causalml . In both methods, we use xgboost.XGBRegressor as learners for the outcomes. For OrthoRF (Oprescu et al., 2019) , we use the package econml (Microsoft Research, 2019) . For all the methods, we fine-tune the learning rate in {10 −1 , 10 −2 , 10 −3 , 10 −4 } and regularizers in {10 1 , 10 0 , 10 −1 , 10 −2 , 10 −3 }.

Evaluation metrics. We report two evaluation metrics: (i) precision in estimation of heterogeneous effects (PEHE) (Hill, 2011) : PEHE := m s=1 ns i=1 (τ s i −τ s i ) 2 /(mn s ) for evaluating ITE, and (ii) absolute error: ATE := |τ −τ | for evaluating ATE, where τ s i and τ are the true ITE and true ATE, respectively, andτ s i ,τ are their estimates. Note that these evaluation metrics are for point estimates of the treatment effects. In our case, the point estimates are the mean of ITE and ATE in their predictive distributions.

Data description. Obtaining ground truth for evaluating causal inference algorithm is a challenging task. Thus, most of the state-of-the-art methods are evaluated using synthetic or semi-synthetic datasets. In this experiment, the synthetic data is simulated with the following distributions:

where ϕ(·) denotes the sigmoid function, λ(·) denotes the softplus function, and

We simulate two synthetic datasets: DATA-1 and DATA-2. For DATA-1, the ground truth parameters are randomly set as follows: σ 0 = σ 1 = 1, (a 0 , b 0 , c 0 ) = (0.6, 0.9, 2.0),

, where 1 is a vector of ones and I dx is an identity matrix. For DATA-2, we set (b 0 , c 0 ) = (6, 30), b 1 ∼ N(10 · 1, 2 · I dx ), c 1 ∼ N(15 · 1, 2 · I dx ), and the other parameters are set similar to that of DATA-1. The purpose is to make two different scales of the outcomes for the two datasets. For each dataset, we simulate 10 replications with n = 5000 records. We only keep {(y i , w i , x i )} n i=1 as the observed data, where y i = y i (0) if w i = 0 and y i = y i (1) if w i = 1. We divide the data into five sources, each consists of n s = 1000 records. In each source, we use 50 records for training, 450 for testing and 400 for validation. In the following, we report the evaluation metrics and their standard errors over the 10 replications.

The above parameters chosen for this simulation study satisfy Assumption 1 since y i (0) and y i (1) are independent with w i given x i . Assumption 2 is respected as the treatment treatment on an individual i does not effect the outcome of another individual j (i = j). Since we fixed the dimension of x i and draw it from the same distribution, Assumption 3 is implicitly satisfied. It is important to note that Assumption 4 and the preprocessing procedure are not necessary since each record that is drawn from the above distributions is attributed to one individual. This necessarily means that there are no duplicates of individuals in more than one source. In a real life setting, in the case when there are individuals appearing in multiple sources, Assumption 4 needs to hold, and the preprocessing procedure described in Section 3.3 has to be performed to exclude those repeated individuals from the training process.

FedCI is as good as training on combined data. Figure 3 reports the three evaluation metrics of FedCI compared with two baselines: training on combined data and training locally on each data source. As expected, the figures show that the errors of FedCI are as low as that of training on the combined data. This result verifies the efficacy of the proposed federated algorithm. Inter-dependency component analysis. We study the impact of the inter-dependency component (see Section 3.5) by removing it from the model. Figure 4 presents the errors of FedCI compared with 'no inter-dependency' (FedCI without inter-dependency). The figures show that the errors in predicting ITE and ATE of 'no inter-dependency' seems to be higher than those of FedCI. This result showcases the importance of our proposed inter-dependency component.

Contrasting with existing baselines. In this experiment, we compare FedCI with the existing baselines. Note that all the baselines do not consider estimating causal effects on multiple sources with privacy constraints. Thus, we train them in three cases as explained earlier: (1) train locally (loc), (2) train with combined data (com), and (3) train with bootstrap aggregating (agg). Note that case (2) violates privacy constraints. In general, we expect that the error of FedCI to be close to case (2) of the baselines. Table 1 and 2 reports the performance of each method in estimating ATE and ITE. Regardless of different scales on the two synthetic dataset, the figure shows that FedCI achieves competitive results compared to all the baselines. In particular, FedCI is among the top-3 performances among all the methods. Importantly, FedCI obtains lower errors than those of BART com , X-Learner com , R-Learner com , and OthoRF com , which were trained on combined data and thus violate privacy constraints. Compare with CEVAE com , FedCI is better than this method in predicting ITE and comparable with this method in predicting ATE (slightly higher errors). However, we emphasize again that this result is expected since we proposed a federated learning algorithm while CEVAE com is not a federated one.

The estimated distribution of ATE. To analyse uncertainty, we present in Figure 5 the estimated distribution of ATE in the first source (s = 1). The figures show that the true ATE is covered by the estimated interval and the estimated mean ATE shifts towards its true value (dotted lines) when more data sources are used. This result might give helpful information for user.

Data description. The Infant Health and Development Program (IHDP) (Hill, 2011 ) is a randomized study on the impact of specialist visits (the treatment) on the cognitive development of children (the outcome). The dataset consists of 747 records with 25 covariates describing properties of the children and their mothers. The treatment group includes children who received specialist visits and control group includes children who did not receive. For each unit, a treated and a control outcome are simulated using the numerical schemes provided in the NPCI package (Dorie, 2016) , thus allowing us to know the true individual treatment effect. We use 10 replicates of the dataset in this experiment. For each replicate, we divide into three sources, each consists of 249 records. For each source, we then split it into three equal sets for the purpose of training, testing, and validating the models. We report the mean and standard error of the evaluation metrics over 10 replicates of the data. This dataset satisfies the Assumptions 1, 2, 3. Assumption 4 is redundant since there is are no repetitions of individuals in this dataset.

Results and discussion. Similar to the experiment for synthetic dataset, here we also train the baselines in three cases as explained earlier. We also expect that the errors of FedCI to be close to the baseline trained with combined data (com). The result reported in Table 3 shows that the FedCI achieves a competitive results compared to the baselines (we skipped the first case (loc), please see Appendix for the full table). Indeed, FedCI is in the top-3 performances among all the methods. This result again verifies that FedCI can be used to estimate causal effects effectively under some privacy constraints of the data sources. The estimated distribution of ATE is presented in Appendix due to limited space.

We introduced a causal inference paradigm via a reformulation of multi-output GPs to learn causal effects, while keeping private data at their local sites. A modular inference method whose ELBO can be decomposed additively across data sources is presented. We posit that our formulation would prove useful in a diverse range of use cases within a causal inference setting on different range of applications.

We note that the inherently use of GP in our approach would in fact incur the computational time of inverse covariance matrix in each source of cubic time complexity. A possible future work direction is to reformulate this in terms of the recent sparse Gaussian Process models.

Yoon, J., Jordon, J., and van der Schaar, M. (2018). GANITE: Estimation of individualized treatment effects using generative adversarial nets. In ICLR.

Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., and Chandra, V. (2018) . Federated learning with non-iid data. arXiv preprint arXiv:1806.00582.

Zhe, S., Xing, W., and Kirby, R. M. (2019). Scalable high-order gaussian process regression. In AISTATS, pages 2611-2620.

In this section, we present additional experimental results on the IHDP dataset. The results here were not presented in the main text due to limited space. In Table 4 (five first rows), we present additional results of the baselines trained locally (loc). Similar to the experiments on synthetic data, the results here show that FedCI achieves much smaller errors. The reason is because FedCI accesses to all the data sources in a federated fashion while the 'baselines trained locally' (loc) only have access to a local data source.

Similar to the experiment on synthetic data, the estimated distribution of ATE in the first source (s = 1) is presented in Figure 6 . Again, the figures show that the true ATE is inside the estimated interval and the estimated mean ATE shifts towards its true value (dotted lines) when more data sources are used.

A.2 Synthetic data: DATA-2

In this section, we present additional experimental results on DATA-2. Those results were skipped in the main text due to limited space. In Table 5 (five first rows), we present additional results of the baselines trained locally (loc) and the baselines trained with bootstrap aggregating (agg). Similar to the experiments on DATA-1 presented in the main text, the results on DATA-2 also show that FedCI achieves much lower errors, especially the error in predicting ITE. Table 4 : Out-of-sample errors on IHDP dataset where top-3 performances are highlighted in bold (lower is better). The dashes (-) in 'loc' and 'agg' indicate that the numbers are the same as those of 'com'.

The error of ITE ( √

The error of ATE ( ATE) 1 source 2 sources 3 sources 1 source 2 sources 3 sources BART loc -5.83±2.6 6.56±3.3 -2.09±0.9 1.38±0.5 X-Learner loc -4.14±1.5 4.54±1.9 -1.51±0.7 0.77±0.5 R-Learner loc -6.35±1.9 6.16±2.0 -2. Figure 6 : The estimated ATE distribution on source #1 of IHDP dataset. The dotted black lines represent the true ATE. Proof.

Cov(y s i , y s j | Σ, Φ) = Cov Φ This completes the proof.

Proof. We denote ξ s 0 ∼ N(0, I ns ) and ξ s 1 ∼ N(0, I ns ). Then, from the model definition (Eq. (5) in the main text), we have y s 1 (0) . . . y s ns (0) y s 1 (1) . . . y s ns (1)

= Φ 1 2 f s 1 (0) + g s (0) . . . f s ns (0) + g s (0) f s 1 (1) + g s (1) . . . f s ns (1) + g s (1) +Σ 1 2 ε s 1 (0) . . . ε s ns (0) ε s 1 (1) . . . ε s ns (1)

, which is equivalent to the following Y s = µ 0 (X s )+g s 0 +(K s ) 1 2 ξ s 0 µ 1 (X s )+g s 1 +(K s ) 1 2 ξ s 1 (Φ 1 2 ) + ε s 0 ε s 1 (Σ 1 2 )

Limits of estimating heterogeneous treatment effects: Guidelines for practical algorithm design

Bayesian inference of individualized treatment effects using multi-task gaussian processes

Non-linear process convolutions for multi-output gaussian processes

Causal transportability with limited experiments

Meta-transportability of causal effects: A formal approach

Causal inference and the data-fusion problem

Estimating counterfactual treatment outcomes over time through adversarially balanced representations

Time series deconfounder: Estimating treatment effects over time in the presence of hidden confounders

Variational inference: A review for statisticians

Bagging predictors. Machine learning

Federated learning of predictive models from federated electronic health records

Causalml: Python package for causal machine learning

Predicting adverse drug reactions on distributed health data using federated learning

Mogptk: The multi-output gaussian process toolkit

Npci: Non-parametrics for causal inference

An introduction to the bootstrap

Federated learning used for predicting outcomes in sars-cov-2 patients

Beyond the HIPAA privacy rule: enhancing privacy, improving health through research

Modeling heterogeneous treatment effects in survey experiments with bayesian additive regression trees

Federated learning for mobile keyboard prediction

Bayesian nonparametric modeling for causal inference

A general approach to causal mediation analysis

Causal inference in statistics, social, and biomedical sciences

Fast approximate multi-output gaussian processes

Privacy-preserving technology to help millions of people: Federated prediction model for stroke prevention

Auto-encoding variational bayes

A scalable bootstrap for massive data

Metalearners for estimating heterogeneous treatment effects using machine learning

Federated learning on clinical benchmark data: Performance assessment

Causal effect inference with deep latent-variable models

Fairness through causal awareness: Learning causal latent-variable models for biased data

Communicationefficient learning of deep networks from decentralized data

EconML: A Python Package for ML-Based Heterogeneous Treatment Effects Estimation

Agnostic federated learning

On the application of probability theory to agricultural experiments. essay on principles. section 9

Federated learning: a collaborative effort to achieve better medical imaging models for individual sites that have small labelled datasets

Quasi-oracle estimation of heterogeneous treatment effects

Orthogonal random forest for causal inference

Causal diagrams for empirical research

Causality

Transportability of causal and statistical relations: A formal approach

Some methods for heterogeneous treatment effect estimation in high dimensions

The future of digital health with federated learning

The central role of the propensity score in observational studies for causal effects

Bayesian inference for causality: The importance of randomization

Inference and missing data

Assignment to treatment group on the basis of a covariate

Bayesian Inference for Causal Effects: The Role of Randomization

Robust and communicationefficient federated learning from non-iid data

Estimating individual treatment effect: generalization bounds and algorithms

Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data

Privacy-preserving deep learning

A nonparametric bayesian analysis of heterogenous treatment effects in digital experimentation

Federated learning of electronic health records improves mortality prediction in patients

Federated learning for healthcare informatics

Representation learning for treatment effect estimation from observational data

which implies thatThis completes the proof.

Proof. Following the proof of Lemma 2, we note that if the observed treatment w s i = 0, then the mean of p(y s i,obs |X s , w s , Φ, Σ, g s ) equals to the mean of p(y s i (0)|Φ, Σ, X s , w s , g s ) and the mean of p(y s i,mis |X s , w s , Φ, Σ, g s ) equals to the mean of p(y s i (1)|Φ, Σ, X s , w s , g s ). If the observed treatment w s i = 1, then the mean of p(y s i,obs |X s , w s , Φ, Σ, g s ) equals to the mean of p(y s i (1)|Φ, Σ, X s , w s , g s ) and the mean of p(y s i,mis |X s , w s , Φ, Σ, g s ) equals to the mean of p(y s i (0)|Φ, Σ, X s , w s , g s ). Similarly, each element in K obs and K mis also depends on whether w s i = 0 or w s i = 1.