key: cord-0681560-ipl45svm
authors: Squires, Chandler; Shen, Dennis; Agarwal, Anish; Shah, Devavrat; Uhler, Caroline
title: Causal Imputation via Synthetic Interventions
date: 2020-11-05
journal: nan
DOI: nan
sha: 1306e7695e5a9ea134c40e0d5806809ee3bc9608
doc_id: 681560
cord_uid: ipl45svm

Consider the problem of determining the effect of a drug on a specific cell type. To answer this question, researchers traditionally need to run an experiment applying the drug of interest to that cell type. This approach is not scalable: given a large number of different actions (drugs) and a large number of different contexts (cell types), it is infeasible to run an experiment for every action-context pair. In such cases, one would ideally like to predict the result for every pair while only having to perform experiments on a small subset of pairs. This task, which we label"causal imputation", is a generalization of the causal transportability problem. In this paper, we provide two main contributions. First, we demonstrate the efficacy of the recently introduced synthetic interventions estimator on the task of causal imputation when applied to the prominent CMAP dataset. Second, we explain the demonstrated success of this estimator by introducing a generic linear structural causal model which accounts for the interaction between cell type and drug.

A central goal in science is to predict the effect of our actions -clinical treatments, economic policies, or social programs to name a few -prior to deciding which action to execute. A key challenge is that the data that is available about the effect of an action almost always comes from a different context than the context on which we are trying to make a prediction. For example, consider the problem of predicting the effect of 10,000 different drugs on 100 different cell types.

The traditional method of experimental design, which tests every possible combination of drug and cell type, is not scalable to such large problems since 1 million different experiments are needed. An alternative approach, which is taken in this paper, is to first collect data on only a small subset of such pairs, to then infer the relationships between different cell types and drugs, and lastly to apply these learnt relationships to impute the effect of untested drug-cell type pairs.

Causal transportability, structure discovery. A related line of work that focuses on the task of predicting the effect of an experiment in a new context goes by the name of causal transportability (Bareinboim and Pearl, 2014; Lee et al., 2020) . This work starts from the assumption that one possesses a selection diagram, which specifies both a causal model of the underlying system as well as potential differences in this model between contexts. Access to such a selection diagram is infeasible for large, complex systems such as gene regulatory networks, where active research is still required to establish the pattern of regulatory relationships and the ways in which they differ between cell types. When there are a small number of variables, a potential solution to this problem is to first perform causal structure discovery using the available interventional data (Wang et al., 2017; Yang et al., 2018; and then to apply causal transportability techniques to the resulting graphs. However, in large systems with 100s or 1000s of variables, it is often infeasible to gather enough data to perform causal structure discovery with high certainty. Thus, we seek to develop direct methods, which estimate the effect in the target domain without first learning any causal structure.

Synthetic interventions. This direct approach is reminiscent of the approaches taken in the field of policy evaluation. In particular, Agarwal et al. (2020) recently introduced the synthetic interventions (SI ) method for solving the causal imputation problem. They considered a setting where one observes data associated with a collection of units (e.g., cell types). Here, there is a pre-intervention period where all units are observed under control (e.g., absence of any ac-arXiv:2011.03127v1 [stat.ME] 5 Nov 2020 tions), followed by a post-intervention period where each unit experiences one intervention or action (e.g., drug) from some finite collection of interventions. Under this sparse observation pattern, the goal in that work is to impute what would have occurred to each unit under every intervention in the post-intervention period. In Agarwal et al. (2020) , they provably showed that, under suitable assumptions, the SI estimator consistently imputes these counterfactuals without having to first learn the underlying causal graph.

Our contributions. The main contributions of this work are two-fold. First, we establish that under a generic linear structural causal model, the model assumption required for SI are satisfied. Specifically, in the context of our setup, the key assumption in the SI framework is the existence of a particular linear algebraic relationship across the contexts and actionsmore formally, a latent factor model (see Definition 2). Pleasingly, this factor model implies the existence of an invariant linear relationship between actions or contexts; this invariance is key in allowing us to effectively perform causal imputation. Moreover, in this work, we argue that a generic linear structural causal model implies a factor model. Subsequently, we establish that the direct approach of SI provides a consistent estimator for causal imputation under this generic linear structural causal model without requiring the model to be learned.

Second, we utilize the SI method for directly predicting the effect of a drug on a cell type for a large number of drug-cell type pairs using a small fraction of data. Specifically, we focus on the prominent CMAP dataset, which contains gene expression signatures for over 20,000 different small molecules across over 70 different cell types. Empirically, we demonstrate that the SI method performs surprisingly well -it accurately predicts the effect of an untested drug-cell type pair, outperforming natural baselines. In particular, SI achieves leave-one-out (LOO) R 2 of 0.89 compared to the best baseline performer with LOO R 2 of 0.83. This is a significant improvement considering that it has been well-established that the baseline already performs so well, and hence there is little room left for improvement.

Our experiments show that application of SI across drugs or perturbations (actions) for a given cell type (context) significantly outperforms SI across cell types for a given perturbation -a result that is of biological interest in its own right.

Organization of the paper. In Section 2, we review related work on methods related to causal imputation. In Section 4, we review the SI method, describe how we adapt it for causal imputation, and provide a sim-ple linear structural causal model under which the SI estimator is consistent. Finally, in Section 5, we describe the CMAP dataset, and evaluate the SI method against natural baselines with respect to the task of predicting gene expression for chemical perturbations across different cell types.

Given the ubiquity of the task of predicting the effect of an action in a different context from which we observe, there are many lines of study tackling this problem; we highlight some of the most relevant below.

Causal transportability. As stated earlier, a closely related line of work to causal imputation is causal transportability Pearl, 2014, 2016) . The objective in causal transport is to find a transport formula from a set of "source" contexts to a target context, given the ability to perform some subset of experiments in the source contexts. Whereas causal transport focuses on prediction from data generated by some target action in different contexts than the target context, causal imputation generalizes this problem to also use data about different actions, including from the target context itself.

The transport formulas generated by previous methods are derived from a selection diagram, which specifies a causal model of the system as well as differences between the source and target contexts. A major difference between this work on causal imputation and prior work on causal transportability is that we do not require knowledge about the structure of the causal graph, and instead only require that the true causal model be of a generic linear form. A further advantage of the SI framework considered in this work is that the key assumption (Assumption 3) that enables SI to generalize from a small collection of experiments to predicting on untested context-action pairs can be verified through a simple data-driven hypothesis test (see Section 4). As a side benefit, this approach is also far less data-intensive than the number of samples required to learn the structure of the causal graph.

Imputation of gene expression. The most relevant comparison to the task considered in Section 5 is prior work on predicting perturbation response across cell types. Dixit et al. (2016) used a regularized linear model, which makes predictions of the form x ca = β a + w c β c , wherex ca denotes the outcome vector associated with cell type c under action a and w c is a set of covariates for the cell type c. When no covariates are available except the categorical cell type parameter, this reduces tox ca = β a + β c , i.e., a fixed effect of the action on the cell type (or vice versa). This is a limited case of the factor model which we describe in Section 4, after centering to remove the cell-type specific baseline β c . Lotfollahi et al. (2019) introduced the scGen method for causal imputation on single-cell gene expression data. This method first trains a variational autoencoder in order to find a low-dimensional representation of gene expression data. Then, for each drug a, the method computes a vector δ a representing the shift induced by that drug in the latent space. Finally, to predict the effect of the drug on a new cell type, they first encode the unperturbed gene expression vector, add the shift δ a in the latent space, and decode to obtain the estimate. They tested their method on control data and a single perturbation, stimulation by IFN-β, across seven different cell types, and showed that their method was able to accurately predict the expression level of both cell type specific IFN-β markers and cell type invariant IFN-β markers.

Our method can be used in conjunction with an autoencoder to generalize the scGen method -as in that method, we can apply the SI method proposed in this paper in the latent space, instead of using the more restrictive fixed estimator as done in that work. Promisingly, recent work using autoencoders to predict perturbation effects on SARS-CoV2 infected cells empirically demonstrated that autoencoders align representations in a way that induces linearity amongst the perturbations and cell types in the learned space (Belyaeva et al., 2020) . We leave this as an interesting future direction to pursue.

The "causal imputation" methods we consider start with genomic data from various perturbations and cell types, and impute genomic data for pairs of perturbations and cell types. Orthogonally, there exists a vast literature on predicting perturbation response from other data modalities, such as a perturbation's molecular structure (Stokes et al., 2020) or imaging data (Hofmarcher et al., 2019; Yang et al., 2018) . There are many possible avenues for combining these approaches with our method, e.g. through coupled autoencoders (Yang and Uhler, 2019) to represent each perturbation as a combination of other perturbations and an encoding of its molecular structure.

Policy evaluation. The field of policy evaluation has a rich literature on estimating the effect of an action on a collection of units; this quantity is denoted as the "treatment effect". The usual quantity of interest in policy evaluation, known as Average Treatment Effect (ATE), is the average effect of an intervention over all units; in our setting, this would translate to the average effect of a drug over all cell types. A randomized control trial (RCT) remains the de-facto way to estimate ATE. Over the past two decades, the synthetic control (SC) estimator (Abadie and Gardeaz-abal, 2003; Abadie et al., 2010) and variants thereof have gained in popularity -here the goal is to estimate what would have occurred to a particular unit under control (e.g., if no intervention was applied). The recent work of Agarwal et al. (2020) generalizes SC to estimate the counterfactual on an individual unit under all interventions (including control). They show the efficacy of SI in a collection of social settings, including the effect of mobility restricting interventions on COVID-19 mortality rates, and the effect of financial and communication interventions on immunization rates in villages in Haryana, India.

We consider a collection of contexts (e.g., cell types), denoted by C, and actions (e.g., drugs), denoted by A. The quantity or measurements of interest associated with a given context c ∈ C under action a ∈ A is denoted by x ca ∈ R p . Collectively, this forms a orderthree tensor X = [x ca : c ∈ C, a ∈ A] ∈ R |C|×|A|×p . Our primary interest is in finding or estimating X .

Towards this, we have access to limited observations corresponding to small number of experiments that are already conducted. Specifically, let Ω ⊂ C × A denote the pairs of contexts and actions for which we observe the associated measurements, i.e., for each (c, a) ∈ Ω, we observe x ca .

We define a few useful notations. Specifically, for any a ∈ A, c ∈ C, let A(c) = {a ∈ A : (c, a) ∈ Ω} and C(a) = {c ∈ C : (c, a) ∈ Ω}. These notations are extended to sets of contexts and actions, i.e., A(C) = ∩ c∈C A(c) and C(A) = ∩ a∈A C(a).

We shall utilize an adaptation of the SI estimator introduced in Agarwal et al. (2020) within the context of policy evaluation. We start by describing the estimator for our setup, and then argue that it is guaranteed to satisfy a factor model under a linear structural causal model. Under the factor model, as argued in Agarwal et al. (2020) , SI provably recovers the underlying X of interest.

Consider any target context c and target action a such that (c, a) /

∈ Ω. In short, our adapted SI estimator imputes the counterfactual x ca by appropriately weighting the observations {x cj : j ∈ A(c)}. Towards learning the weights, let C train = C(A(c) ∪ {a}) denote the collection of contexts for which {x ij : i ∈ C train , j ∈ A(c) ∪ {a}} are observed. We call A(c) the donor ac-tions and C train the training contexts. We formally define the estimation strategy for imputing x ca below.

We note that the standard SI estimator can also be applied towards weighting {x ia : i ∈ C(a)} instead, i.e., we learn a synthetic intervention using contexts rather than actions. To achieve this, we follow the estimation strategy laid out above with A train = A(C(a) ∪ {c}) and C(a) replacing C train and A(c), respectively. We similarly call C(a) the donor contexts and A train the training actions. We visualize both variants of the estimator and the associated terminology in Appendix A.

These two estimators -weighting actions and weighting contexts -can be used in a complementary fashion. That is, say we apply the weighting actions estimators first. Then, the imputed values can increase the set of donor actions, which should lead to better outcomes when applying the weighting contexts estimator. An analogous argument can be made when the weighting contexts estimator is applied first. This suggests a composite algorithm that iteratively applies each approach until the imputed tensor converges to a desired threshold. However, one drawback of this approach is that the estimator is computationally more demanding. As such, formally motivating and analyzing such an algorithm remains to be future work.

Handling noise and sparsity. In many experimental settings, we may have measurement error and/or missing entries. In this case, Agarwal et al. (2020) proposed first applying matrix estimation -ME : R n×p → R n×p with n denoting the size of donor contexts or actions -to x train and x test to "de-noise" the observations and fill in missing values; popular ME methods include nuclear norm minimization and singular value thresholding, as analyzed by Agarwal et al. (2020) .

Linear structural causal model. Here, we explain the success of the SI estimator on causal imputation via a connection to causal structural models, which have been frequently used as models of genomic networks (Friedman et al., 2000; Badsha et al., 2019) . We begin by defining the particular structure we impose on the causal structural models we study.

A linear structural equation model over the random vector x ∈ R p is defined by the set of equations

Let x ca denote a random vector generated from context c under action a. We assume that x ca is generated from a linear structural equation model with latent variables v a ∈ R r , whose distribution depends on the action a, but where the rest of the model depends only on the context c; i.e.,

with A c ∈ R p×p and B c ∈ R p×r . Note that Definition 1 implies that A c only has non-zero entries below the diagonal. This model is pictured in Figure 1 . In the non-parametric setting considered by Bareinboim and Pearl (2016) , the outcome x ca is not identifiable for such a graphical structure, even given access to the true selection diagram. We show in the rest of this section that x ca is identifiable in the linear setting, without access to the true selection diagram.

Proposition 1. Under (1), for any a ∈ A and c ∈ C, there exists M c ∈ R p×r such that

Proof. Since A c only has non-zero entries below the diagonal, this implies that (I − A c ) is invertible. As a result, we may re-write x ca as

Defining M c as (I − A c ) −1 B c completes the proof. Definition 2. We say that X satisfies a factor model if for any a ∈ A and c ∈ C,

where U c ∈ R p×r is a matrix of latent measurement factors associated with context c, and v a ∈ R r is a latent factor associated with action a.

From Proposition 1, it follows that a linear structural model implies that X satisfies a factor model. 

We begin by stating some of our key modeling assumptions that will enable SI to consistently impute X . To reduce redundancy, the following discussion will be restricted to the algorithm described in Section 4.1, where linear weights across the actions are learnt. Analogous statements apply to the case where linear weights across the contexts are learnt.

Assumption 1. Let X satisfy the factor model as defined in Definition 2.

As stated above, such a factor model is implied by a linear structural causal model.

Assumption 2. Given a target context c and target action a, there exists β ∈ R |A(c)| such that v a = j∈A(c) β j v j .

This is a mild assumption. By definition, such a β must exist if v a and [v j ] j∈A(c) are linearly dependent.

Recall v a and v j are in R r . If r |A(c)|, then by the definition of rank, it is easy to see that the 'pathological' case where v a is linearly independent of [v j ] j∈A(c) is unlikely to hold; that is, in the worst case, this undesirable event occurs for at most r actions out of all possible A, which is a small fraction if r |A(c)| ≤ |A|.

Assumption 3. Given a target context c and target action a, let rowspan(x test ) ⊆ rowspan(x train ) * , where x train and x test are defined in Section 4.1.

Intuitively, this assumption requires that the target context is no more "complex" than the training context in a linear algebraic sense; this is the key assumption that enables SI to generalize from a small set of experiments to making predictions on untested contextactions pairs. In Section 4.4, we show how one can test whether this assumption holds in a data-driven manner. * Row span of a matrix M ∈ R mxn is simply the linear subspace of R n spanned by the row vectors of M .

Theorem 1. Suppose Assumption 1 holds. Further, for all (c, a) / ∈ Ω, suppose Assumptions 2 and 3 hold. Thenx ca = x ca , and hence X is recovered exactly.

Proof. Throughout this proof, we follow the notation set in Section 4.1. To begin, consider any (c, a) / ∈ Ω. Then by Assumptions 1 and 2, for all i ∈ C

Since (2) holds for all i ∈ C, it necessarily holds for the target context c and those contexts i ∈ C train = C(A(c) ∪ {a}).

Now, let v train ∈ R |A(c)|×r1 and v test ∈ R |A(c)|×r2 denote the right singular vectors of x train and x test , respectively, with r 1 = rank(x train ) and r 2 = rank(x test ). Recall thatβ = x † train y train . Sinceβ ∈ rowspan(x train ) = span(v train ) by design, it follows from (2) 

Combining these arguments yieldŝ

This completes the proof.

Below, we provide some robustness checks of when to use the SI estimator in practice. As before, we focus on the SI estimator that learns across actions, but analogous statements hold when SI learns across contexts.

Subspace inclusion hypothesis test. Recall that Assumption 3 enables SI to recover X from our sparse subset of observations Ω (Theorem 1). As such, Agarwal et al. (2020) proposed a data-driven hypothesis test to check when this condition is satisfied in practice. In effect, their test statisticτ measures the gap between the rowspaces of x train and x test , i.e.,

where v train and v test are the right singular vectors of x train and x test , respectively. Agarwal et al. (2020) derived a critical value τ α , which depends on the parameters of the underlying noise distribution, such that under the null hypothesis H 0 where Assumption 3 holds,

Instead of estimating the parameters of the underlying noise distribution, we follow an intuitive heuristic based on the observation thatτ ≤ rank(v test ), since the columns of v test are orthonormal. The statisticτ can then be interpreted as the spectral energy of v test that does not belong within span(v train ); thus, we fix some fraction ρ ∈ [0, 1] and reject H 0 ifτ ≥ ρ · rank(v test ), i.e., if more than ρ fraction of the spectral energy in v test lies outside of span(v train ).

Rejection of the test implies that the SI estimator might not perform well on our prediction task, since Assumption 3 is unlikely to hold. In practice, this allows us to switch to a simpler baseline estimator, as we show in Section 5.

Arbitrary sets of donor actions. In the estimation strategy that we have described, we have assumed that our estimator will be a linear combination of all actions A(c) which are available for context c. In turn, this restricts the set of training contexts to only those contexts for which each of these actions, along with the target action, is measured, i.e., C train = C(A(c) ∪ {a}). However, expressingβ as a linear combination of all actions A(c) may be suboptimal if it leads to a significantly smaller set C train . For example, say |A(c)| = 11, and there is only one training context containing all of these actions, i.e., |C(A(c) ∪ {a})| = 1. Suppose there is a subset A ⊂ A(c), with |A | = 10 with |C(A ∪ {a})| = 100, i.e., by excluding a single action from the linear combination, we are able to train on 100 times more contexts. Thus, we may consider the SI estimator induced by a specific donor set A donor ⊆ A(c), with the corresponding set of training contexts being C train = C(A donor ∪ {a}). The choice of donor set introduces a tradeoff: as the number of donor actions increases, the number of training contexts might decrease. This raises the question of how to pick an optimal set of donor actions for a given prediction.

Picking donor actions. The subspace inclusion hy-pothesis test suggests an elegant method for picking a set of donor actions: for a fixed significance level, find a set passing the hypothesis test which maximizes the number of training contexts. Unfortunately, this induces a combinatorial optimization problem which may be difficult to solve when there are many actions. We consider two computationally efficient alternatives. First, we may greedily pick actions, according to whichever action least reduces the number of training contexts, until the set passes the hypothesis test. Second, we may always use A(c) as the donor set, then use the subspace inclusion hypothesis test described above to decide on whether to use the SI estimate.

In this section, we describe the results of applying the SI estimator to the task of causal imputation on the CMAP dataset. Subramanian et al. (2017) developed the L1000 assay platform, which allows for cost-effective measurement of the gene expression levels. In particular, this assay measures the levels of 978 "landmark" genes, which were picked using a data-driven approach based on their ability to recover information about the rest of the transcriptome. This cost reduction enabled the authors to measure L1000 profiles from over 1,000,000 different samples, covering 71 different cell types and over 40,000 different chemical perturbations. We focus on the subset of chemical perturbations in the dataset, and randomly sample 100 of these perturbations, along with the "control" perturbation, DMSO, in order to investigate a smaller, unbiased version of the dataset. A detailed evaluation of the dataset and a description of our preprocessing pipeline is described in Appendix B. The dataset can be accessed at https://www.ncbi. nlm.nih.gov/geo/query/acc.cgi?acc=GSE92742. Figure 2 shows an embedding of 11,185 gene expression vectors via UMAP (McInnes et al., 2018) , colored by cell type. It is clear from this plot that most of the variation in the data is due to the cell type rather than the perturbation applied to the cell. This is further supported by Fig. 8 in Appendix C, where we can see that most of the perturbations fall within the normal variation of a single cell type; the additional variation due to different perturbations is minor. This suggests a natural mean-over-actions baseline estimator Indeed, it is well-known that the effect of chemical perturbations is much smaller than the effect of cell type, so that the simple mean-over-actions estimator already performs well, achieving a median R 2 score of 0.83. This leaves little room for improvement by incorporating information related to the target intervention. We consider three other natural baselines. The mean-over-contexts is defined aŝ

That is, we average all contexts which receive the target action. We may combine these estimators into the parametric two-way mean estimator with parameter λ c ∈ [0, 1],

x ca two-way = λ cx ca avg-c + (1 − λ c )x ca avg-a , which is simply a convex combination of the meanover-actions and mean-over-contexts estimators.

Finally, we use an estimator which assumes that each perturbation applies a fixed, additive shift to the gene expression, as was used in conjunction with autoencoders in Lotfollahi et al. (2019) and Belyaeva et al. (2020) . Formally, the fixed action effect estimator, relative to action a , is defined aŝ

In particular, a natural choice for a is "control", i.e., no action.

For each algorithm, we measure performance by the leave-one-out (LOO) R 2 score. In particular, for each cell type and perturbation pair (c, a) ∈ Ω that is measured, we remove the true gene expression vector x ca from the dataset, and use the remainder of the dataset to construct an estimatex ca . Figure 3a reports the R 2 scores of each of the estimator including SI and the baselines described above.

The results of each estimator provide insight into the relationship between cell types and perturbations. As expected from Figure 2 , the mean-over-actions estimator is a strong baseline, with a median R 2 score of 0.83. In contrast, the mean-over-contexts estimator performs quite poorly. Seeing Figure 2 , this is to be expected: the substantial variation between different cell types suggests that each cell type is not representable as a linear combination of the others. The fixed effect estimator, using a = DMSO, performs similarly to the mean-over-actions estimator, indicating its prediction quality is dominated by the cell-type-specific baseline x ca , i.e., the shift vector for the perturbation is small.

Finally, the SI-action method performs best, with a median R 2 of 0.89, substantially closing the gap between the median R 2 of the mean-over-actions baseline and perfect prediction. The SI method within the context (SI-action) takes a weighted linear combination across the measurements for a given cell type and different perturbations in contrast to mean-overactions baseline which simply utilizes equal weights across these measurements. It is noteworthy that such weights are learnt from measurements across perturbation for different cell types, and yet they are still effective in transferring or transporting the learnt model to other cell types. This gives credence to the factor model laid out in Section 4.2 holding in our setting.

In contrast, the SI-in-context method performs nearly as poorly as the mean-over-contexts baseline, indicating again that the difference between cell types is too great to effectively express one cell type as a linear combination of others. To further elucidate the difference between the withincell and within-action estimators, we plot in Figure 3b frequencies of the test statisticτ described in Section 4, where a higher value of the statistic indicates that the subspace inclusion hypothesis is less likely to hold and hence the results of our method are less likely to be meaningful. Indeed, we can see that the test statistic tends to be lower in the case of regressing in the action dimension -this is in line with what is implied by the factor model we posit in Definition 2; in particular, there is an invariant factor for a given action across measurements and cell types.

Few remarks are in order. Unlike traditional application of the SI method, we do not need to de-noise the data to get the best empirical results. Specifically, the results in Figure 3a do not utilize any data pre-processing or de-noising using a matrix estimation procedure as suggested in Agarwal et al. (2020) to remove high amounts of noise that might be present. We believe this is not needed in our setting as the data is implicitly de-noised, since each measurement is obtained by taking an average of multiple observations for a given measured cell type and perturbation pair (c, a) ∈ Ω in the data. Indeed, as seen in Appendix D, the matrix estimation based de-noising does help the result if we restrict ourselves to use only a single sample per observed (c, a) ∈ Ω.

In this paper, we adapted the SI estimator of Agarwal et al. (2020) for use on a task we call causal imputation: predicting the effect of an action across different contexts, without access to the ground truth structural causal model describing the system. We demonstrated the superior performance of the SI estimator to other baselines on the task of causal imputation in the CMAP dataset, an important source of information for predicting the perturbation effect of different drugs. Finally, we explained the success of the SI estimator by describing a linear structural equation model on which the method is consistent.

Several important directions are left open for future work, of which we cover only a few. First, the tradeoff between the number of donor actions and the number of training contexts raises the need for a principled method for picking "optimal" donor sets. One promising approach may be to frame this choice as a combinatorial optimization problem, where the objective function may be submodular under some assumptions on the problem structure. A related question is whether we can apply SI in a sequential manner to infer which samples are most informative to reduce sample complexity in an experimental design and/or active learning framework.

Another important direction for future work is on nonlinear methods to the causal imputation problem. Genomic data is known to exhibit highly nonlinear relationships, so that our model in Section 4.2 is only a coarse approximation. A straightforward nonlinear extension of our method would be to perform SI in a latent space learned by an autoencoder. Two concepts from this paper are likely to be useful in the development and analysis of nonlinear methods. First, we demonstrated that it is beneficial to develop representations for each action which are invariant to the context in which they occur, allowing for the effect of the action to be transported between contexts. Second, our mechanistic explanation in Section 4.2 for the success of the SI method may serve as a starting point for explaining the success of nonlinear methods.

We visualize the two directions of the SI-action and the SI-context method in Figures 4 and 5 , respectively.

In Figure 6a , we display the availability of gene expression profiles for each cell type/chemical perturbation pair. The cell types are sorted from left to right by the number of perturbations for which gene expression profiles are available. The perturbations are sorted from bottom to top by the number of cell types for which gene expression profiles are available.

In Figure 7a , we display the total number of perturbations for which gene expression profiles are available, for each cell type. The cell type with the most perturbations available is VCAP, with 15,805 perturbations available out of 20,369.

In Figure 7b , we display the total number of cell types for which gene expression profiles are available, for each perturbation. The perturbation with the most cell types available is DMSO (control), which is available for 70 of 71 cell types, followed by BRD-A19037878, which is available for 64 out of 71 cell types.

Data Selection. We use the Level 2 data from L1000 dataset, which contains unnormalized gene expression values. The Level 2 data is split into two sets, "delta" and "epsilon", containing 49,216 and 1,278,882 samples, respectively. They differ in which landmark genes are used; we only use the larger "epsilon" dataset for consistency of our results.

We select 100 perturbations at random to run all of our analyses over, in order to create a smaller but unbiased dataset. We show the plots corresponding to those that we showed for the whole dataset in Figures 6b, 7c , and 7d; they qualitatively verify that the subsampled dataset is similar in character to the original.

In Figure 8 , we show the UMAP embedding of gene expression data from 70 different perturbations in the VCAP cell line, which we picked since it has the greatest number of samples for any cell type in the dataset. Comparing to Figure 2 , which included 70 different cell types, we see that gene expression vectors (even within a single cell type) cluster far less by perturbation than they do by cell type. Moreover, most perturbations do not substantially differ from the control perturbation (DMSO) . This suggests that we should expect estimating the cell-type specific perturbation effect will be a difficult task.

D Results on single samples Figure 9 shows the results of using SI on unaveraged data. In particular, for each cell type and perturbation, we select a single corresponding sample at random.

As described in Figure 4 , the SI estimator may be used in conjunction with some matrix estimation procedure for denoising before regression. We use the hard singular value thresholding (HSVT) estimator at energy level ρ = .95, that is, given the singular value decomposition X = U ΣV , with singular values in decreasing order, we find that first k such that k i=1 Σ 2 ii ≥ ρ X 2 F , and use HSV T (X, k) = k i=1 Σ ii U i V i . We see that, as predicted by theory, matrix estimation improves the predictions on unaveraged data (SIaction vs. SI-action-HSVT). However, the results still do not match the performance of the simple meanover-actions baseline.

Thus, prior to prediction, we perform the subspace inclusion hypothesis test described in Section 4. In particular, ifτ ≥ .1 · rank(x test ), then we reject the hypothesis test, concluding that the SI method is unlikely to work. If the test passes, we use the SI-HSVT predictor; if it fails, we instead use the mean-acrossactions predictor as a strong "fallback" option.

We see that adding this hypothesis test (SI-action-HSVT, +test) returns us to the performance level of the mean-across-actions baseline. Figure 9: Performance of causal imputation algorithms on recovering perturbation effects in the CMAP dataset, using unaveraged data.

The economic costs of conflict: A case study of the Basque country

Synthetic control methods for comparative case studies: Estimating the effect of California's tobacco control program

Learning causal biological networks with the principle of Mendelian randomization

Transportability from multiple environments with limited experiments: Completeness results

Causal inference and the data-fusion problem

Causal network models of SARS-CoV-2 expression and aging to identify candidates for drug repurposing

Perturb-seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens

Using Bayesian networks to analyze expression data

Accurate prediction of biological assays with high-throughput microscopy images and convolutional networks

General transportability: Synthesizing observations and experiments from heterogeneous domains

scGen predicts single-cell perturbation responses

Umap: Uniform manifold approximation and projection for dimension reduction

Permutation-based causal structure learning with unknown intervention targets

A deep learning approach to antibiotic discovery

A next generation connectivity map: L1000 platform and the first 1,000,000 profiles

Permutation-based causal inference algorithms with interventions

Multi-domain translation by learning uncoupled autoencoders

Characterizing and learning equivalence classes of causal dags under interventions

Chandler Squires was partially supported by an NSF Graduate Fellowship, MIT J-Clinic for Machine Learning and Health, and IBM. Caroline Uhler was partially supported by NSF (DMS-1651995), ONR (N00014-17-1-2147 and N00014-18-1-2765), and a Simons Investigator Award. Dennis Shen was partially supported by a Draper Fellowship. Anish Agarwal was partially supported by a MIT IDSS Thomson Reuters Fellowship.