key: cord-0210331-3xq13sa6
authors: Agarwal, Anish; Shah, Devavrat; Shen, Dennis
title: Synthetic Interventions
date: 2020-06-13
journal: nan
DOI: nan
sha: 767304255bf6aa17513d975033ed565f3d89ae19
doc_id: 210331
cord_uid: 3xq13sa6

Consider a setting where there are $N$ heterogeneous units (e.g., individuals, sub-populations) and $D$ interventions (e.g., socio-economic policies). Our goal is to learn the potential outcome associated with every intervention on every unit (i.e., $N times D$ causal parameters). Towards this, we present a causal framework, synthetic interventions (SI), to infer these $N times D$ causal parameters while only observing each of the $N$ units under at most two interventions, independent of $D$. This can be significant as the number of interventions, i.e, level of personalization, grows. Importantly, our estimator also allows for latent confounders that determine how interventions are assigned. Theoretically, under a novel tensor factor model across units, measurements, and interventions, we formally establish an identification result for each of these $N times D$ causal parameters and establish finite-sample consistency and asymptotic normality of our estimator. The estimator is furnished with a data-driven test to verify its suitability. Empirically, we validate our framework through both experimental and observational case studies; namely, a large-scale A/B test performed on an e-commerce platform, and an evaluation of mobility restriction on morbidity outcomes due to COVID-19. We believe this has important implications for program evaluation and the design of data-efficient RCTs with heterogeneous units and multiple interventions.

There is a growing interest in personalized decision-making, where the goal is to select an optimal intervention for each unit from a collection of interventions. For example in e-commerce, a common goal of an online platform is to determine which discount level is best suited (e.g., to increase engagement levels) for each customer sub-population.

Such customer sub-populations are usually created based on customer demographics and their prior interactions with the platform, and the clustering is done to ensure that customers within a sub-population behave similarly, while allowing for significant heterogeneity across sub-populations. At its core, estimating the best "personalized" intervention requires evaluating the impact of each of the D ≥ 1 discount levels on each of the N ≥ 1 customer sub-populations. One way to estimate the best personalized intervention is to conduct N × D randomized experiments corresponding to each combination. In particular, for each sub-population and discount level, some number of randomly chosen customers from that sub-population are assigned that discount level and their corresponding outcomes are measured. Conducting these N × D personalized experiments would allow one to estimate all N × D causal parameters. However, this can become very expensive, if not infeasible, as D and N grow. This motivates the need for "data-efficient" experiment design where N × D causal parameters can be inferred from far fewer experiments. Similar questions arise in program evaluation, where one may want to design a governmental policy that is particularly suited to the socio-economic realities of a geographic location. For example, in the context of COVID-19, policy-makers from N countries might be interested in evaluating which of D mobility restricting policies is best suited for their country. There is an additional challenge in program evaluation compared to the experimental setting as only observational data is readily available. In particular, each country can only undergo at most one of the D policies. Further, the policy implemented in any given country will likely be confounded with the characteristics of that country (e.g., the government structure, population density and demographics, cultural leanings), which itself may impact the outcomes observed under the policy. Further, these confounding variables might not be fully known or observed.

In summary, our interest is in evaluating the N × D causal parameters associated with N units and D interventions based on potentially confounded, limited observations that do not scale with N × D.

Special instances of this question have a rich literature within econometrics and beyond.

In particular, the sub-problem of estimating what would have happened to a "treated" unit (i.e., undergoes an intervention) under "control" (i.e., absence on an intervention) has been studied through the prominent frameworks of differences-in-differences (DID) Ashenfelter and Card (1984) , Bertrand et al. (2004) , Angrist and Pischke (2009) and synthetic controls (SC) Abadie and Gardeazabal (2003) , Abadie et al. (2010) . These frameworks live within the framework of panel data settings, where one gets repeated measurements of units. By considering "control" as one of the D interventions, we see that DID and SC both allow Neyman (1923) and Rubin (1974) , we denote Y (d) tn ∈ R as the potential outcome for unit n and measurement t under intervention d. We encode the set of all potential outcomes {Y (d) tn } into an order-3 tensor whose dimensions correspond to units, measurements, and interventions. See Figure 1 for a visualization of this potential outcomes tensor. tn for d ∈ [D] 0 , n ∈ I (d) , and t ∈ T post . For all other entries of Y , we assume Y tnd = .

Observation pattern. As per Assumption 1, we observe all N units under d = 0 for t ∈ T pre .

For d > 0 and t ∈ T post , however, it is sufficient to only observe outcomes for N d = |I (d) | ≥ 1 units, provided N d is sufficiently large. Below, we connect our observation pattern with what is typically assumed in the SC literature. In particular, as with our setting, SC assumes that all N units are observed under d = 0 for t ∈ T pre . Since d = 0 often represents control and t typically indexes the t th time step, this time frame is often referred to as the "preintervention" period. During the "post-intervention" period, i.e., t ∈ {T 0 + 1, . . . , T } with T 1 = T − T 0 , each unit is then observed under one of the D interventions. See Figure 2a for a visual depiction of the typical observation pattern in the SC literature. Indeed, we use T pre and T post to be in line with the SC literature. However, our setup allows for more general observation patterns compared to SC as we do not strictly need a separate pre-and post-intervention period. This is particularly relevant in multi-arm randomized control trials (RCTs), like in the e-commerce setting described earlier, where there may not be a pre-intervention period. Further, in many such RCTs setting, units can be simultaneously observed under multiple interventions; for example, e-commerce platforms can randomly choose different individuals from the same customer sub-populations to undergo different discount levels (here, a unit is defined as a customer sub-population). Hence, the sets I (d 1 ) and I (d 2 ) do not have to be mutually exclusive. However, despite being able to simultaneously observe units under multiple interventions, recall our goal is to learn all N × D causal parameters using far fewer number of experiments. We show this is possible if we observe all units under the same intervention for some number of measurements, as formalized in Assumption 1. See Figure 2b for a visual depiction of the observation pattern in our proposed data-efficient RCT.

Target causal parameter. The goal is to estimate Y That is, the expected potential outcome of unit n under intervention d, averaged over the measurements t ∈ T post . In the setting of SC, this would correspond to estimating the average potential outcome of each unit under each intervention during the post-intervention period. We emphasize that in observational settings, the potential outcomes Y (d) tn might be correlated with which entries of Y are observed, i.e., there is confounding. This serves as an additional challenge in defining the appropriate causal framework to estimate θ (d) n .

Synthetic interventions (SI) estimator: an overview. The SI estimator recovers a given θ (d) n with valid confidence intervals via a three-step procedure, where each step has a simple closed form expression. Below, we give an overview of the method. For a given (n, d) pair, the first step is to estimate a linear model, w (n,d) ∈ R N d such that for all t ∈ T pre ,

Specifically, we use principal component regression (PCR) to learn w (n,d) by linearly regressing {Y tn0 : t ∈ T pre } on {Y tj0 : t ∈ T pre , j ∈ I (d) }. Subsequently θ (d) n is estimated as

The precise details of the SI estimator are in Section 2. Though the SI method produces an estimate of the entire trajectory E[Y (d) tn ] for t ∈ T post as defined in (1), our theoretical results are focused on proving consistency and asymptotic normality for θ (d) n . Further, the SI estimator comes with a data-driven hypothesis test to verify when one can accurately learn a model w (n,d) using outcomes under d = 0 and t ∈ T pre , and then transfer it to estimate counterfactual outcomes for measurements T post and d ∈ [D] 0 .

Challenge in estimating beyond control: causal transportability with panel data. For ease of discussion, we restrict our attention here to the setting of SC where we have a pre-and post-intervention period. Identification and estimation of θ (d) n across all (n, d) complements the existing SC literature, which focuses on estimating {θ (0) n : n / ∈ I (0) }, i.e., the counter-factual average potential outcome under control for a treated unit, otherwise referred to as the "treatment effect on a treated unit". However, the SI framework allows one to estimate causal parameters of the form θ (d) n for d = 0 and n ∈ I (0) , this corresponds to the "treatment effect for an untreated unit"; estimating the best personalized treatment for a unit that is thus far untreated is clearly of interest, but is not identifiable nor estimable in the SC framework. We discuss the fundamental challenge in estimating θ (d) n for d = 0 next. Indeed, if we restrict ourselves to the task of estimating {θ (0) n : n / ∈ I (0) }, we see that the SI estimator in essence reduces to the standard SC estimator. 2 A reasonable question that may arise is whether θ (d) n across (n, d) can be estimated by simply fitting N × D separate SC estimators, one for each pair. This is not possible. In SC, w (n,0) is learned using outcomes data under control and then is only applied on outcomes under control.

However, to estimate θ (d) n for d = 0, this requires learning a model using outcomes under control, but then applying it to outcomes under intervention d = 0-recall (2). This begs the pivotal question: "when does the structure between units under control continue to hold in other intervention regimes"? The SC framework does not have an answer for this question thus far and it has indeed been posed as an important open question in Abadie (2020). As we will see, a tensor factor model across time, units, and interventions provides one natural answer to this question.

We highlight that the problem of when one can learn a model under one interventional regime and transfer it to another has gained interest across many fields and has been known by terms including: "causal transportability", "transfer learning", "learning with distribution shift". We believe the SI framework provides one answer to this problem for the panel data setting, and more broadly as well. 2 The key difference being that rather than restricting w (n,0) to be convex as is classically done, we learn this linear model using PCR. We motivate the suitability of PCR in Section 2.

There are two key conceptual frameworks that are foundational to this work: SC and latent factor models. Given the sizeable literature, we provide a brief overview of closely related works from each topic.

As noted earlier, when restricted to d = 0, i.e., estimating θ (0) n for a subset of units n / ∈ I (0) , the SI method is effectively SC as introduced in the seminal works of Abadie and Gardeazabal (2003), Abadie et al. (2010) , but with a different form of regularization.

Precisely, for this restricted setting, SI is identical to the "robust synthetic controls" (RSC) method of Amjad et al. (2018) and further analyzed in Agarwal et al. (2020) . RSC is a variant of SC, which de-noises observations and imputes missing values via matrix completion, e.g., by using singular value thresholding. Other variants of SC have been considered, including in Hsiao et al. (2012) , Doudchenko and Imbens (2016) A critical aspect that enables SC is the structure between units and time under control (d = 0). One elegant encoding of this structure is through a (matrix) factor model (also known as an interactive fixed effect model), Chamberlain (1984) , Liang and Zeger (1986) , Arellano and Honore (2000) , Bai (2003 , 2009 ), Pesaran (2006 , Weidner (2015, 2017 

n , by directly learning from the observed outcomes rather than relying on additional covariates.

Section 2: Methodological. We propose the SI estimator, which recovers all N × D causal parameters θ (d) n , with valid confidence intervals, under the observation pattern described in Assumption 1. For the special case where N d = N/D, SI estimates the N × D parameters

Section 3: Empirical. We empirically assess the efficacy of the SI framework through two real-world case studies: (i) Experimental-evaluating the feasibility of running dataefficient A/B tests using data from a large e-commerce platform. We call our proposed experimental design "synthetic A/B testing". (ii) Observational-evaluating the impact of mobility restrictions on COVID-19 related morbidity outcomes.

Section 4: Theoretical. We introduce a novel tensor factor model over potential outcomes, which is a natural generalization of the matrix factor model. Under this setting, we prove the identification of all N × D causal parameters, and establish finite-sample consistency and asymptotic normality of the SI estimator. Collectively, our results provide an answer to the open question of extending SC to multiple interventions, Abadie (2020).

Section 5: Robustness check for SI (and SC). We propose a data-driven hypothesis test to check for the feasibility of when a model learned can be transferred across interventional regimes and measurements. We provide guarantees for both its Type I and Type II errors.

We use the test in the canonical SC case studies of (i) terrorism in Basque Country and (ii) California's Proposition 99 Abadie and Gardeazabal (2003) , Abadie et al. (2010) .

Section 6: Simulations. We run simulations which support our theoretical consistency and asymptotic normality results.

Section 7: Comparison with SC literature. By restricting our attention to estimating θ (0) n for n / ∈ I (0) , our identification and inference results immediately give new results for the SC literature. To better contextualize our work, we present a detailed comparison with some representative works in the SC literature.

Section 8: Causal inference & tensor completion. Our proposed tensor factor model over potential outcomes serves as a connection between causal inference and the growing field of tensor completion. Traditionally, the goal in tensor completion is to recover a tensor from its partial, noisy observations, where its entries are missing completely at random.

In contrast, our work suggests that the SI method can be viewed as solving a "causal" variant of the tensor completion problem where there is confounding, i.e., the missingness pattern is correlated with the entries of the potential outcomes tensor. Indeed, we hope this provides a framework for future inquiry towards obtaining a clearer understanding of the trade-offs between sample complexity, statistical accuracy, experiment design, and the role of computation in estimating various causal parameters.

Notations. See Appendix A for formal definitions of the standard notations we use.

In Section 2.1, we formally introduce the SI estimator. In Section 2.2, we discuss some practical heuristics for its implementation and supplement it with a more technical discussion to motivate and justify its various steps. Without loss of generality, we will focus on estimating θ (d) n for a given (n, d) pair.

Additional notation. Here, we introduce necessary notation required to formally state the estimator. Formally, let

represent the vector of observed outcomes for unit n under d = 0 for t ∈ T pre . Let

represent the observed outcomes for units within I (d) for the pre and post measurements. Note Y pre,I (d) is constructed using observed outcomes under intervention 0, while Y post,I (d) is constructed using outcomes under intervention d. We define the singular value

where M = min{T 0 , N d }, s ∈ R are the singular values (arranged in decreasing order), and u ∈ R T 0 , v ∈ R N d are the left and right singular vectors, respectively. Note s , u , v are actually indexed by the measurements in T pre and units in I (d) , but we suppress this dependence to increase readability.

The SI estimator is a simple three-step procedure, each with a closed-form expression.

It has one hyper-parameter k ∈ [M ] that quantifies the number of singular components of Y pre,I (d) to retain. The third step shows how to estimate the confidence interval for θ (d)

n ; for simplicity, we pick the 95% confidence interval.

1. Learn a linear model w (n,d) between unit n and I (d) .

(3)

3. Produce the 95% confidence interval:

where

Below, we discuss the three steps of the SI estimator. Our goal here is to provide (i) practical heuristics of when and how to implement the SI estimator in practice and (ii) a more technical justification for each step.

2.2.1.

Step 1: Estimating w (n,d)

The first step of the SI estimator, given in (3), estimates a linear relationship w (n,d) between the observed outcomes Y pre,n and Y pre,I (d) by doing singular value thresholding (SVT) on the matrix Y pre,I (d) and subsequently running ordinary least squares (OLS) on the resulting matrix and Y pre,n . This is also known in the literature as principal component regression (PCR) Agarwal et al. (2020 Agarwal et al. ( , 2021 . The number of singular values retained is a hyper-parameter k ∈ min{T 0 , N d }. Below we justify when and why PCR is appropriate in the setting of latent factor models and how to choose k.

Why PCR and how to choose k? The SI estimator, like various other methods such as SC, DID and its variants, learn a linear model, i.e., w (n,d) between the target unit n and the "donor" units in I (d) . However to ensure the linear model is not "overfit", the different variants of these methods suggest different ways to regularize the linear fit. In particular, in SC, it is standard to impose that the linear fit is convex, i.e., the coefficients of w (n,d) are non-negative and sum to 1. More recent works impose an 1 -penalty term to the linear fit, e.g., Chernozhukov et al. (2020b) , which is known as LASSO in the literature, and it promotes sparsity in the learned w (n,d) , i.e., forces most of the coefficients of w (n,d) to be 0. PCR can be seen as another form of regularization, where rather than regularizing w (n,d) directly, it imposes "spectral sparsity", i.e., sets most of the singular values of Y pre,I (d) to be 0, before learning a linear model. Indeed, PCR is particularly effective as a regularization technique if Y pre,I (d) is (approximately) low-rank-for further discussion on this point, please refer to the technical discussion below and Agarwal et al. (2020 Agarwal et al. ( , 2021 .

With regards to selecting the hyper-parameter k, there exist a number of principled heuristics to choose it and we name a few here. Perhaps the most popular data-driven approach is simply to use cross-validation, where the pre-intervention data is our training set and the post-intervention data is our validation set. Another standard approach is to use a "universal" thresholding scheme that preserves the singular values above a precomputed threshold (see Chatterjee (2015) and Donoho and Gavish (2013) ). Finally, a "human-inthe-loop" approach is to inspect the spectral characteristics of Y pre,I (d) , and choose k to be the natural "elbow" point that partitions the singular values into those of large and small magnitudes. For a graphical depiction of such an elbow point, see Figure 3 . We observe this clear elbow point in both case studies in Section 3. As such, if this approximate low-rank spectral profile is not in the data, then using PCR is likely not appropriate.

Technical discussion. Here, we give a more technical justification of when we expect an elbow point in the spectrum of Y pre,I (d) .

The key assumption we make in Section 4.1 I (d) ] are Θ(1) and its nonzero singular values are of the same magnitude, then they will scale as Θ(

This low-rank structure is also what motivates the existence of a a linear model w (n,d) . When can we "transfer" w (n,d) across interventional regimes and measurements?

Step 2 of the SI estimator, defined in (4) and (5), applies the learned model w (n,d) on Y post,I (d) to produce the counterfactual estimate θ (d)

n . This should lead to some pause as we are learning w (n,d) using Y pre,n , Y pre,I (d) , which correspond to outcomes for d = 0 and for t ∈ T pre . However, we are applying w (n,d) on Y post,I (d) which correspond to outcomes for any d ∈ [D] 0 and for t ∈ T post . Thus, Step 2 of the estimation procedure implicitly assumes

Step 2 of SI estimator is a form of causal transportation. We find a necessary condition for such causal transportation to be feasible is that the "complexity" of

To build intuition, consider the setting where E[Y pre,I (d) ] is a matrix of 0's, i.e., it has a degenerate rowspace. Then, it is easy to see that we cannot effectively learn w (n,d) using any regression procedure. Hence, we require a natural condition that the span of the right singular vectors of E[Y post,I (d) ] (i.e., its rowspace) lies within that of E[Y pre,I (d) ]-see Assumption 8. Of course, E[Y post,I (d) ] and E[Y pre,I (d) ] are not observed, but we can estimate their respective rowspaces via Y post,I (d) and Y pre,I (d) . The data-driven hypothesis test we propose in Section 5.1 uses Y post,I (d) and

Y pre,I (d) to check whether this subspace inclusion property holds. The discussion above suggests that before running Step 2 of the SI estimator, one should first run this hypothesis test as a robustness check to verify if it feasible to transfer the learned model w (n,d) between Y pre,I (d) and Y post,I (d) . If it fails, then causal transportation across is likely infeasible.

Technical discussion. We now motivate and formally introduce the test statistic that underpins our proposed hypothesis test. Let r pre = rank(E[Y pre,I (d) ]), and let r post =

V post ∈ R N d ×rpost denote their respective estimates, which are constructed from the top r pre and r post right singular vectors of Y pre,I (d) and Y post,I (d) , respectively. As discussed above, note if causal transportation as required by Step 2 of the SI estimator is to hold, then we require that the span of V post is contained within that of V pre . Since do not observe V pre and V post , a a natural test statistic, τ , is the distance between the V pre and V post :

Indeed, if V pre = V pre , V post = V post , and the span of V post is within that of V pre , then τ = 0. To account for noise, for a given significance level α ∈ (0, 1), we reject the validity of Step 2 of the SI method to estimate θ (20). See Section 5 for details and formal results.

Step 3: Uncertainty Quantification of θ (d) n

The confidence interval we produce in (6) is a direct implication of our asymptotic normality result given in Theorem 4.3. The variance scales with two quantities w (n,d) 2 and σ. Both quantities are standard. In particular, w (n,d) 2 is the 2 -norm our estimate of the linear model and σ as defined in (7) is essentially our in-sample "training" error. This sug-gest that if our in-sample training error as produced by PCR is large, then naturally our uncertainty of the estimated θ (d) n grows accordingly.

We present two real-world case-studies to assess the efficacy of the SI framework and explore possible applications both in experimental and observational settings.

We use data collected from a large e-commerce company that segmented its users into N = 25 customer sub-populations/groups (i.e., units), with approximately 10, 000 individual users per group, based on the historical engagement of a user, as measured by time and money spent on the platform. The aim of the company was to learn how different discount levels (i.e., interventions) affected the engagement levels of each of the 25 customer groups.

The levels were 10%, 30%, and 50% discounts over the regular subscription cost (control or 0% discount.). Thus, there are total of D = 4 interventions. The A/B test was performed by randomly partitioning users in each of the N = 25 customer groups into 4 subgroups; these subgroups corresponded to either one of the 3 discounts strategies or control. User engagement in each of these N × D = 25 × 4 = 100 experiments was measured daily over T = 8 days. We only had access to the average engagement level of all the customers in each of the 100 experiments. That is, we had access to 100 trajectories each of length 8.

This e-commerce A/B test is particularly suited to validate the SI framework as we observe the engagement levels of each customer group under each of the three discounts and control, i.e., for each customer group, we observe all possible "counterfactual" outcomes.

Further, it is an experimental setting and thus there is no latent confounding in how interventions were assigned. Given that we have access to all possible potential outcomes of interest, we can test the efficacy of the SI framework by exposing data from a limited number of experiments to the SI estimator, and see how well it can recreate the customer engagement outcomes in the hidden experiments. 

To simulate how a data-efficient synthetic A/B test (i.e., RCT) can be run, we randomly partition the 25 customer groups into three clusters, denoted as customer groups 1-8, 9-16, and 17-25. For all 25 customer groups, we give the SI estimator access to the outcomes under 0% discount (i.e., control). For the 10% discount level, we give the SI estimator access to data from customer groups 1-8, but hold out the corresponding data for groups 9-25. In other words, the SI estimator does not get to observe the trajectories of groups 9-25 under a 10% discount. Using the observed trajectories of customer groups 1-8, we separately apply the SI estimator to create synthetic user engagement trajectories for each of the 9-25 customer groups under a 10% discount. Analogously, we only give the SI estimator data for the 30% and 50% discount, for customer groups 9-16 and 17-25, respectively. Note

See Figure 4a and Figure 5 and observe that it is approximately low-rank. In Table I, we show the hypothesis test results for the three discounts. The hypothesis test passes for each discount at a significance level of α = 0.05. Thus, this dataset passes both feasibility checks.

Further, since we have access to the customer engagement outcomes from all 100 experiments, we can inspect the spectral profile of the induced tensor of potential outcomes.

This allows to check whether the latent factor model across units, measurements, and interventions holds-see Assumption 2 for a formal description of the tensor factor model.

be a order-3 tensor, where Y tnd represents the observed engagement level for customer group n, on day t under discount policy d. Consider the mode-1 and mode-2 unfolding of Y , which result into 8 × (25 × 4) and

25 × (8 × 4) matrices, respectively. We plot the spectra of the mode-1 and mode-2 unfoldings of Y as shown in Figure 6 . For both mode-1 and mode-2 unfoldings of the tensor Y , over 99% of the spectral energy is captured by the top two singular values. This suggests that Y has a low canonical polyadic tensor rank, (Farias and Li, 2019 , Proposition 1).

Empirical results. To quantify the accuracy of counterfactual predictions, we need a meaningful baseline to compare against. To that end, we use the following squared error metric for any (n, d) pair:

In words, the numerator in the right most term on the right-hand side of (9) represents the squared error associated with the SI estimate θ (d) n . We coin the denominator, (1/N d ) j∈I (d) Y tjd , as the "RCT estimator"; this is the average outcome across all units within subgroup I (d) . If the units are indeed homogeneous (i.e., they react similarly to intervention d), then the RCT estimator will be a good predictor of θ (d) n . Therefore, SE (d) n > 0 indicates the success of the SI estimator over the RCT baseline and (9) can be interpreted as a modified R 2 statistic with respect to this baseline. In effect, the SE (d) n captures the gain by "personalizing" the prediction to the target unit using SI over the natural baseline of taking the average outcome over I (d) .

Across the three discounts, SI achieves a median SE (d) of at least 0.98 as denoted in clinical trials for personalized medicine). Lastly, we emphasize the SI estimator did not use additional covariates about the customer groups or discounts. Rather, it only required a unique identifier (UID) for each customer group and discount.

At the onset of the COVID-19 pandemic across the globe, the policies that different nations enacted were primarily targeted at restricting mobility to curb the spread of the virus. A question of interest to policy-makers is how effective was mobility restriction, as it can help understand the trade-offs between health outcomes and economic impact of these policies. Although it is infeasible to run experiments in this setting, we explore how the SI framework can leverage readily available observational data from across the globe to help answer these questions. Below, we list and justify our key modeling decisions.

Outcome metric: daily death counts. Due to its relative reliability and availability, we use daily COVID-19 related morbidity outcomes as our outcome variable of interest, taken from Dong et al. (2020) . The other standard metric, number of daily infections, is much less reliable due to the inconsistencies in testing and reporting across regions.

Intervention: change in mobility rate. At the start of the pandemic, each country implemented numerous policies to combat the spread of COVID-19. This makes it difficult to analyze any particular policy (e.g., stay-at-home orders vs. schools shutting down) in isolation. Thus a key assumption we make is that at the start of the pandemic, almost all such policies had been directed towards restricting how individuals move and interact. That is,

we assume the effect of these various policies on COVID-19 related morbidity outcomes is solely mediated via the level of mobility restriction. To that end, we use Google's mobility reports Google (2020) to study the change in a country's mobility compared to their respective national baseline from January 2020. Thus, we adopt mobility as our notion of intervention, and investigate how a country's change in mobility level potentially affects COVID-19 related morbidity outcomes. The additional challenge here compared to the A/B testing case study is that we only have access to observational data. That is, there may be observed and/or latent characteristics associated with the country that might both influence the mobility restriction policy enacted (e.g., population demographics, cultural trends, governmental structure) as well as the health outcomes observed.

Pre-and post-intervention periods. Recall from Assumption 1 that to apply SI, we require a set of outcome measurements for all units that are under a common intervention. Towards that, using Google's mobility reports, we verify that 20 days prior to cumulative 80 Figure 7a displays the average reduction in mobility and assigned intervention groups. Figure 7b shows the observation pattern according to these assignments.

COVID-19 related deaths in a country (and any time before), none of the N = 27 countries we select in this case study restricted mobility. Thus, rather than using a particular calendar date, we choose the day a country has cumulative 80 COVID-19 deaths as "Day 0".

Henceforth, we refer to the pre-and post-intervention periods as the days before and after Day 0, respectively. In particular, T 0 = 20, and we measure post-intervention outcomes for the first 15 days from the onset of the pandemic in a country, i.e., T 1 = 15.

Categorizing countries by intervention received: average (lagged) mobility score. Studies have shown that there is a median lag of 20 days from the onset of infection to the day of death (e.g., see Wilson et al. (2020) ). Thus, a country's death count on a particular day is a result of the infection levels from approximately 20 days prior. In order to analyze the effect of a mobility restricting intervention from "Day 0" onwards, we consider a country's mobility score from Day -20 to Day -1. Given that Google's mobility score changes every day, we take its average in a given country from Day -20 to Day -1, and then bucket it into the D = 3 intervention groups defined as follows; see Figure 7a for a visual depiction of this clustering. For a given country, we define: (a) "low mobility restriction" as a reduction in mobility that is less than 5% compared to the national baseline from January 2020; (b)

"moderate mobility restriction" as a reduction in mobility between 5% to 35% compared to the national baseline from January 2020; (c) "severe mobility restriction" as a reduction in mobility greater than 35% compared to the national baseline from January 2020. We remark that by discretizing mobility from a numerical trajectory over 20 days into these three categorical buckets is a significant modeling choice and it coarsens the possible causal parameters of interest. See Figure 7b for a visual depiction of the observation pattern.

Feasibility checks. We run the same two feasibility checks as in the A/B testing case study.

In particular, we (i) inspect the spectrum of Y pre,

(ii) run the hypothesis test for subspace inclusion. Let d = {0, 1, 2} correspond to the mobility restrictions {"low", "moderate", "severe"} as described above, respectively. We plot the spectrum of Y pre, [N ] in Figure 8 , which clearly exhibits low-rank structure. In Table II, we show that our dataset passes the hypothesis test for all mobility restriction levels at a significance level of α = 0.05. Thus, both feasibility checks pass.

Empirical results. We apply SI using the setup above to produce counterfactual predictions of the daily COVID-19 morbidity outcomes for the 15 days following Day 0 under the three mobility restriction levels for each country. In this case study, since we do not have access to the counterfactual trajectory of a country for a mobility restricting intervention it did not actually go through, the best we can do is leave-one-out cross validation. For a country n that undergoes mobility restricting intervention d in the post-intervention period (i.e., n ∈ I (d) ), we hide the post-intervention data for that country, and we see how well SI is able to recreate that trajectory using only that country's pre-intervention data. We then use the estimated trajectory from SI to produce SE

n , as defined in (9). Subsequently, we define SE (d) as the median SE (d) n over n ∈ I (d) . In Table II , we see that the median SE (d) is 0.46, 0.80, 0.08 for the low, moderate, and severe mobility restriction, respectively. This indicates that there is indeed significant heterogeneity amongst the countries in how mobility affects the COVID-19 related morbidity outcomes at least with respect to the low and moderate restriction. For severe mobility restriction, the SI estimator only slightly outperforms the RCT estimate; this is likely due to the fact that countries that underwent a severe mobility restriction all had very few COVID-19 related deaths in the post-intervention period, and thus there was not much heterogeneity in their trajectories.

For every mobility restriction level, we display the three synthetic trajectories produced by SI for one representative country, using only the pre-intervention outcomes for each given country. This corresponds to producing two counterfactual trajectories per country, for the two mobility interventions it did not actually go through, and one trajectory for the intervention it did actually go through, which serves as cross-validation. Hence, in this case study, we produce 2 × 27 = 54 counterfactual trajectories. For the low mobility restricting regime, we show results for USA in Figure hold generally across all countries we use in this study. We display the results for USA, Brazil, and India as they are the largest countries within each intervention group.

Further, we emphasize that we produce these trajectories using only their COVID-19 related death outcomes and the Google mobility report to categorize which intervention bucket they belong to in the post-intervention period. That is, we do not use any additional covariates about the countries or the various interventions.

Importantly, the SI model of each country is fit in the pre-intervention period, when no

intervention has yet occurred. Still, the learned model transfers to an interventional setting,

i.e., when the interventions take effect within the donor countries. This helps validate the SI framework. An "optimistic" conclusion one can draw from the figures above is that, uniformly across all countries, there is a significant drop in the number of deaths with even a "moderate" drop in mobility (i.e, a 5-35% drop in mobility compared to the national baseline). After this point, gains by further restricting mobility seem to be diminishing.

Of course, with any such conclusions, it needs to be rigorously cross-validated with other studies of a similar nature; further these estimated counterfactual trajectories are only valid up to the modeling decisions made and under appropriate causal assumptions. In Section 4, we provide a formal causal framework for SI and our associated consistency and normality results for the SI estimator.

In this section, we present formal results. In Section 4.1, we introduce our causal framework. In Section 4.2, we establish identification of the causal parameter of interest, θ 

where r ≥ 1 represents the 'rank' or the dimension of the latent factors, and u t , v n , w d ∈ R r represent the latent factors associated with the t-th measurement, n-th unit, and d-th intervention, respectively.

The tensor factor model is a natural generalization of the matrix factor model traditionally considered in the literature. Specifically, when restricted to any specific intervention d,

] ∈ R r . Indeed, (10) states that the expected potential outcomes for each intervention should satisfy the traditional matrix factor model. However, the critical additional assumption is that the unit n specific factor v n remains invariant across interventions. In this work, we shall require this weaker condition, which is implied by the tensor factor model. ASSUMPTION 2-Invariant unit factors: For any given (t, n, d),

where u (d) t ∈ R r is the latent factor specific to (t, d); v n ∈ R r is the latent factor specific to n; and ε (d) tn ∈ R is a mean zero residual term specific to (t, n, d).

Assumption 2 essentially states that the collection of latent factors

tn ] : (t, n, d)}. Thus, the distribution of the potential outcomes is captured through the residual term {ε 

Collectively, let the intervention assignments be denoted as

In an ideal setting, we wish to have D and Y (d) tn be independent, as is the case in a RCT. However, in observational studies, the interventions and potential outcomes can be dependent, which is known as confounding. We consider a setting where the confounders can be latent, but their impact on interventions and potential outcomes is mediated through the latent factors. That is, we assume conditioned on the latent factors, LF , the potential outcomes and interventions are conditionally independent. 

Strictly speaking, our identification results only require E[ε (d) tn |LF, D] = 0, which is known as conditional mean independence. However, we state it as ε (d) tn ⊥ ⊥ D | LF to increase the interpretability of the conditional exogeneity assumption we make. Assumptions (d) tn ⊥ ⊥ D | LF. Hence, this conditional independence condition can be thought of as "selection on latent factors", which is analogous to "selection on observables". Similar conditional independence assumptions have been considered in Athey et al. (2017) , Kallus et al. (2018) .

In the context of the COVID-19 case study described in Section 3.2, it suggests that the mobility restriction each country went through can be dependent on latent factors (e.g. national politics, cultural trends, population demographics), which might also influence the COVID-19 morbidity outcomes under different interventions. However, Assumption 3 requires that the collection of latent factors is rich enough that conditional on it, the potential health outcomes of a country and the mobility restriction interventions are independent.

Target causal parameter. We can now formally define our target causal parameter. For any given (n, d), we aim to estimate

That is, for each unit n and intervention d, our interest is in the average expected potential outcomes over T post .

Identification. Next, we argue that our target causal parameter can be written as a function of observed outcomes, i.e., we establish identification of our causal parameter. Let E = {LF, D} refer to the collection of latent factors and intervention assignments. To that end, we make an additional mild assumption.

ASSUMPTION 4-Linear span inclusion: Given (n, d) and conditioned on E, v n ∈

Given Assumption 2, the linear span inclusion requirement as stated in Assumption 4 is rather weak. For example, consider the case that span({v j : j ∈ I (d) }) = R r . Since v n ∈ R r , Assumption 4 would immediately hold. Note Theorem 4.6.1 of Vershynin (2018) implies that if {v n } n∈ [N ] are sampled as independent, mean zero, sub-Gaussian vectors, then span({v j : j ∈ I (d) }) = R r holds with high probability as N (d) grows. More intuitively, Assumption 4 requires that the intervention assignment D is such that sufficiently many units undergo intervention d, and their unit factors are collectively "diverse" enough so that their span includes the latent factor associated with any other unit. Now we state the formal identification result.

THEOREM 4.1: Given (n, d), let Assumptions 1 to 4 hold. Then, given w (n,d) ,

4.3. Finite-sample Consistency Additional Assumptions. We state additional assumptions required to establish statistical guarantees for our estimation procedure.

ASSUMPTION 5-Sub-Gaussian noise: Conditioned on E, for any (t, n, d), ε

tn are zero-mean independent sub-Gaussian random variables with Var[ε (d)

and s 1 ≥ . . . s rpre > 0 be its nonzero singular values. We assume the singular values are well-balanced, i.e., for universal constants c, c > 0,

Finite-sample consistency. Now we state the finite-sample guarantee which establishes that the estimator described in Section 2.1 yields a consistent estimate of the causal parameter for any given unit-intervention pair. To simplify notation, we will henceforth absorb dependencies on σ into the constant within O p (·), defined in Appendix A.

THEOREM 4.2: Given (n, d), let Assumptions 1 to 8 hold.

, where k is defined as in (3). Then conditioned on E,

Here,

We establish that the estimate is asymptotically normal around the true causal parameter.

This will justify the confidence interval defined in (6) (15).

Then conditioned on E,

Further, for w (n,d) and σ 2 defined in (3) and (7), respectively, we have Within the econometrics factor model analyses and matrix completion literature, Assumption 7 is analogous to incoherence-style conditions, e.g., Assumption A of Bai and Ng (2020) . It is also closely related to the notion of pervasiveness (see Proposition 3.2 of Fan et al. (2017)). Additionally, Assumption 7 has been shown to hold with high probability for the canonical probabilistic generating process used to analyze probabilistic principal component analysis in Bishop (1999) and Tipping and Bishop (1999) ;

here, the observations are assumed to be a high-dimensional embedding of a low-rank matrix with independent sub-Gaussian entries (see Proposition 4.2 of Agarwal et al. (2020)).

Lastly, we highlight that the assumption of a gap between the top few singular values of observed matrix of interest and the remaining singular values has been widely adopted in the econometrics literature of large dimensional factor analysis dating back to Chamberlain and Rothschild (1983) .

Next, we discuss Assumption 8. The goal of the SI framework is to "causally transport" 

In Section 5.1, we formally define the subspace inclusion hypothesis test; this serves as a robustness check of Assumption 8, which enables our learned model to be transferred across interventions and measurements. In Section 5.2, we apply this test to two seminal case studies from the SC literature and discuss our findings.

Notation. Recall r pre = rank(E[Y pre,I (d) ]) and r post = rank(E[Y post,I (d) ]). Further, recall

analogously. Finally, let V pre ∈ R N d ×rpre and V post ∈ R N d ×rpost denote their respective estimates, which are constructed from the top r pre and r post right singular vectors of Y pre,I (d) and Y post,I (d) , respectively.

Recall the test statistic τ as defined in (8) in Section 2.2.2, and the motivation for its usage.

More formally, we define the test as follows: for any significance level α ∈ (0, 1),

Here, τ (α) is the critical value, which we define for some absolute constant C ≥ 0:

where φ pre (a) = (20), such that the Type I error is bounded as P( τ > τ (α)|H 0 ) ≤ α. To bound the Type II error, suppose the following additional condition holds:

Then, the Type II error is bounded as P( τ ≤ τ (α)|H 1 ) ≤ α.

The particular C for which Theorem 5.1 holds depends on the underlying distribution of

tn , which determines the distribution of the potential outcomes Y tn be normally distributed for all (t, n, d). Then, P( τ > τ (α)|H 0 ) ≤ α and P( τ ≤ τ (α)|H 1 ) ≤ α.

We now argue (21) is not a restrictive condition. Conditioned on H 1 , observe that r post > V pre V pre V post 2 F always holds. If Assumption 7 holds and the nonzero singular values of E[Y post,I (d) ] are well-balanced, then one can easily verify that the latter two terms on the right-hand side of (21) decay to zero as T 0 , T 1 , N d grow.

Computing τ (α) requires estimating (i) σ 2 ; (ii) r pre , r post ; (iii) s rpre , ς rpost . We provide an estimator for σ in (7) 

If we consider the noiseless case, ε (d) tn = 0, we note that τ (α) = 0. More generally, if the spectra of E[Y pre,I (d) ] and E[Y post,I (d) ] are well-balanced, then Corollary 5.2 establishes that τ (α) = o(1), even in the presence of noise. We remark that Corollary 5.1 allows for exact constants in the definition of τ (α) under the Gaussian noise model.

Here, we provide a complementary approach to computing τ (α), used in Squires et al. (2021) . To build intuition, observe that τ represents the remaining spectral energy of V post not contained within span( V pre ). Further, we note τ is trivially bounded by r post since the columns of V post are orthonormal. Thus, one can fix some fraction α ∈ (0, 1) and reject H 0 if τ > τ (α), where τ (α) = α · r post . In words, if more than α fraction of the spectral energy of V post lies outside the span of V pre , then the alternative test rejects H 0 . We remark that this variant is likely more robust compared to its exact computation counterpart, which requires estimating several "nuisance" quantities described above in order to estimate (20) and varies with the underlying modeling assumptions we make about ε (d) tn and the singular values, s rpre , ς rpost . As such, we employ the heuristic variant in our case studies presented in Section 3.

We revisit two seminal case studies within the SC literature: (i) an evaluation on the impact of terrorism in Basque Country Abadie and Gardeazabal (2003) ; (ii) an evaluation on the impact of California's Proposition 99 on tobacco consumption Abadie et al. (2010) . In particular, we apply our subspace inclusion hypothesis test to study the potential feasibility of counterfactual inference in both studies. Indeed, these studies have been used extensively to explain the utility of the SC method and have subsequently served as benchmarks in following works, cf. Amjad et al. (2018) , Athey et al. (2017) , Arkhangelsky et al. (2020) , Agarwal et al. (2020) , and more broadly as well. As such, we hope our findings not only motivate the usage of this test, but also spark the development of additional robustness tests to stress test the causal conclusions drawn from these studies and beyond.

Background & setup. In 1968, the first Basque Country victim of terrorism was claimed; however, it was not until the mid-1970s did the terrorist activity become more rampant Abadie and Gardeazabal (2003) . To study the economic ramifications of terrorism on Basque Country, we use the per-capita GDP of N = 18 Spanish regions from 1955-1997, i.e., T = 43. Among these regions, there is one treated region, Basque Country, which experienced terrorism; the other 17 regions are considered control regions as they were relatively unaffected by terrorism. Translating to our notation, we have D = 2 with d = 0

representing control and d = 1 representing treatment (i.e., terrorism); in turn, we have T 0 = 14 and T 1 = 29. Without loss of generality, we index Basque Country as unit n = 1.

Our interest is to estimate θ (0) 1 . We note that the original work of Abadie and Gardeazabal (2003) uses 13 additional predictor variables for each region, including demographic information pertaining to one's educational status, and average shares for six industrial sectors. We do not utilize this additional information here, i.e., we only use information related the outcome of interest, i.e., per-capita GDP.

Hypothesis test results. We apply our heuristic variant of the hypothesis test, introduced in Section 5.1.4, with α = 0.05. We find that our test statistic τ = 0.01 and τ (α) = 0.05. Since τ < τ (α), our tests suggest that the SI method is suitable for this study, and also supports the suitability of and conclusions drawn from previous works utilizing SC-like methods for this case study. the other 38 states included in this study neither adopted an anti-tobacco program or raised cigarette sales taxes by 50 cents or more. As such, these states are considered the control states and California is considered the treated state. Translating to our notations, we have D = 2 with d = 0 representing control and d = 1 representing treatment (i.e., Proposition 99); in turn, we have T 0 = 29 and T 1 = 12. Without loss of generality, we index California as unit n = 1. Our interest is to estimate θ (0) 1 . It is worth noting that the original work in Abadie et al. (2010) uses six additional covariates per state, e.g., retail price, beer consumption per capita, and percentage of individuals between ages of 15-24. We do not include these variables in our study.

Hypothesis test results. We apply our heuristic variant of the hypothesis test with α = 0.05.

We find that our test statistic τ = 1.64 and τ (α) = 0.15. Since τ > τ (α), our test suggests that the SI estimator is likely ill-suited to produce a reliable estimate of θ 

In this section, we present illustrative simulations to reinforce our theoretical results. In particular, these simulations suggest that the finite-sample estimation error bounds are tight, asymptotic Gaussian approximation is accurate, and the assumptions behind our theoretical results are necessary. Below, we present a brief overview of the setup for each simulation as well as the primary takeaway, and relegate the details to Appendix C.

The purpose of this section is to study the finite-sample properties of the SI estimator. 

in such a way that our operating assumptions hold; namely, Assumptions 2, 4, 6, 7, 8. Further, we sample Y pre,I (d) , Y post,I (d) , and Y pre,n while respecting Assumptions 1, 3, 5. Our objective is to estimate θ Therefore, this simulation demonstrates that the estimator described in Section 2 produces a consistent estimate of the underlying causal parameter if Assumptions 1 to 8 hold.

In this section, we study the asymptotic normality properties of the SI estimator, as well as the importance of subspace inclusion (Assumption 8).

In the following simulation, we will enforce Assumption 8 to hold between the pre-and post-intervention data. However, we will allow the pre-and post-intervention data to be sampled from different distributions.

Setup. We consider a binary intervention model D = 2, where the pre-intervention data will be observed under control d = 0, while the post-intervention data will be observed under treatment d = 1. In order to separate treatment from control, we will sample {Y (0)

in such a way that our operating assumptions hold; namely, Assumptions 2, 4, 6, 7, 8. Further, we sample Y pre,I (1) , Y post,I (1) , and Y pre,n while respecting Assumptions 1, 3, 5. Our objective is to estimate θ (1) n from (Y pre,I (1) , Y post,I (1) , Y pre,n ). We re-emphasize that (Y pre,I (1) , Y pre,n ) used to learn the model follow a different distribution from Y post,I (1) used for predictions; however, subspace inclusion between the pre-and post-intervention is upheld.

Results. We run 5000 iterations and display the histogram of estimates θ (1) n in Figure 11a . As implied by Theorem 4.3, the histogram is very well-approximated by a normal distribution with mean θ (1) n and variance (1/ √ T 1 )σ 2 w (n,1) 2 2 . This figure demonstrates that even if the pre-and post-intervention potential outcomes follow different distributions, our estimates remain normally distributed around the true causal parameter. That is, it is feasible to learn w (n,d) under one intervention setting (e.g., control), and then transfer the learned model to a different intervention regime, which may obey a different distribution, provided subspace inclusion holds.

Next, we show θ (d) n is non-trivially biased when Assumption 8 fails.

Setup. We again consider a binary intervention model D = 2, where the pre-intervention data will be observed under control d = 0, while the post-intervention data will be ob- served under treatment d = 1. In order to separate treatment from control, we will sam-

in such a way that Assumptions 2, 4, 6, 7 hold. Crucially, however, we will violate Assumption 8. Further, we sample Y pre,I (1) , Y post,I (1) , and Y pre,n while respecting Assumptions 1, 3, 5. Our objective is to estimate θ (1) n from (Y pre,I (1) , Y post,I (1) , Y pre,n ).

Results. We perform 5000 iterations and plot the histogram of estimates in Figure 11b .

Unlike Figure 11a , the histogram is not well-approximated by the normal distribution with mean θ (1) n but instead has non-trivial bias. The juxtaposition of these two figures reinforces the importance of Assumption 8 for successful counterfactual inference (i.e., generalization).

The goal of this section is to better contextualize our assumptions and results within the SC literature, by doing a detailed comparison with some representative works.

Causal parameter in SC: treatment effect on the treated. For any (n, d) pair, letȲ

tn . In addition, let τ

denote the (relative) treatment effect between interventions d 1 and d 2 for unit n, averaged over the post-intervention period. The most closely related causal parameter in the SC literature to our work is τ (d,0) n for n ∈ I (d) and d = 0. This is referred to as the unit specific treatment effect on the treated, averaged over the post-intervention period. RecallȲ (d) n is observed for n ∈ I (d) . As such, the goal in these works is to estimate the counterfactual potential outcomes We re-emphasize the SI framework allows for identification and inference of θ 

Overview of assumptions. There are two key assumptions made in Chernozhukov et al. (2020b) . First, the authors assume the existence of w (n,0) ∈ R N 0 such that for all time t and conditioned 6 on any sampling of {Y

where E[ε tn ] = 0 and E[ε tn Y

tj ] = 0. Second, they assume the existence of an oracle estimator for w (n,0) , denoted as w (n,0) , such that w (n,0) − w (n,0) 2 = o(1). The two most common restrictions placed on w (n,0) , for which there exist estimators with formal performance guarantees, include (i) w (n,0) is convex, i.e., j∈I (0) w Under these assumptions, the authors provide a flexible and practical t-test based inference procedure to estimateȲ (0) n , which utilizes the oracle estimator w (n,0) . We compare these assumptions with the SI framework. (22). Note that under Assumption 2, we can equivalently write Assumption 4 as follows:

Compared to (22) and for d = 0, we operate under a weaker functional form assumption as given in (23). To see this, simply take expectations on both sides of (22) with respect to ε tn , which implies (23). That is, we do not require the linear relationship w tj : j ∈ I (0) }. Rather, we make the weaker assumption that the relationship only holds between the expected potential outcomes. 6 The conditioning on {Y Restrictions on w (n,0) . As discussed, theoretical guarantees in previous works require w (n,0) to be approximately sparse, or even more stringently, convex. Further, the estimators used to learn w (n,0) require explicit knowledge of the restriction placed on w (n,0) . For example, if w (n,0) is assumed to be convex, then convex regression is used; if w (n,0) is assumed to be approximately sparse, then a 1 -norm regularizer is used. In comparison, the estimator we propose to learn w (n,0) in (3) does not impose any such restriction on w (n,0) , even in the high-dimensional setting where N 0 > max{T 0 , T 1 }. Indeed, Lemma F.2 establishes that our estimator consistently learns the unique minimum 2 -norm w (n,d) , denoted as w (n,d) . Further, our consistency and asymptotic normality results implicitly scale with the 1 -and 2 -norm of w (n,d) . In particular, the error in our consistency result of Theorem 4.2 implicitly scales with w (n,d) 1 and the variance in our asymptotic normality result of Theorem 4.3 implicitly scales with w (n,d) 2 . We do, however, assume that there exists a latent factor model between the potential outcomes (Assumption 2); such an assumption is not made in Chernozhukov et al. (2020b) .

To overcome high-dimensionality, where w (n,d) is not uniquely specified, the estimator in (3) directly exploits the low-rank structure induced by this factor model in Assumption 2. Bai and Ng (2020) Overview of assumptions. In Bai and Ng (2020) , they consider the following factor model:

Here, Λ n ∈ R r 1 is a unit n specific latent factor and F t ∈ R r 1 is a time t specific latent factor associated with intervention d = 0 (control); x tn ∈ R k is a vector of observed covariates;

and β ∈ R k is a latent factor acting on x tn , which crucially is invariant across (t, n). The authors make appropriate assumptions on the factor loadings Λ = [Λ n ] ∈ R r 1 ×N and F = [F t ] ∈ R r 1 ×T , and on the residual term ε tn . See Assumptions A to D in Bai and Ng (2020) for details. In essence, these assumptions require ( Factor model in (24). First, we compare the factor model assumed in (24) with that in (11) of Assumption 2. Under (24), if we further assume that x tn , β admits a latent factorization

given by x tn , β = x t ,x n withx t ,x n ∈ R k 1 , then we can write (24) as a special case of (11). To see this, let d = 0 and define u

where r = r 1 + k 1 . We stress that we do not require access tox t orx n to express the model in Bai and Ng (2020) as an instance of (11); instead, we require that such a factorization of x tn , β implicitly exists. However, if one does not assume such a latent factorization, then first learning β using observed covariates and then estimating Λ, F via the residuals, is likely necessary. In Appendix B, we show how to incorporate covariates in SI.

In addition, if one wants to estimate {θ (d) n : d = 0} in the SI framework (which we underscore is not considered in the SC literature), we require the added assumption that α (d) n factorizes into a (time, intervention) specific and unit specific latent factor, i.e.,

where r = r 1 + r 2 + k 1 . 7 7 In the SC literature, a closely related factor model is Y

tn for d = 0, cf. Abadie and Gardeazabal (2003) , Abadie et al. (2010) . Here, β t ∈ R k is a latent factor, x n ∈ R k Assumption on Λ, F . Since our ultimate goal is to estimate θ (d) n , it is not necessary to accurately estimate either of the latent factor loadings (Λ, F ). In fact, our proposed estimator in Section 2 explicitly circumvents estimating these two quantities. In comparison, Bai and Ng (2020) establish their theoretical guarantees by requiring explicit bounds on the estimation qualities of (Λ, F ), which naturally require making more stringent assumptions on the latent factor loadings; for example, they require each singular vector of (Λ, F ) to be identifiable, which itself requires all of the non-zero singular values to be distinct. Arkhangelsky et al. (2020) and Agarwal et al. (2020) We compare with these recent works together as they both consider approximately lowrank factor models for the expected potential outcomes, a generalization of what we do.

Comparison with Arkhangelsky et al. (2020) . They consider a binary intervention model:

tn ] ∈ R T ×N is the matrix of potential outcomes under control; L = [L tn ] ∈ R T ×N encodes the latent factor model under control (in our notation, L tn =

tn ] ∈ R T ×N is the matrix of potential outcomes under d = 1; and A = [α tn ] ∈ R T ×N encodes the treatment effect. The authors propose the "synthetic difference-in-differences" estimator for

∈I (0) α tn |L, and establish rigorous asymptotic normality results. If we continue to restrict our attention to the estimation of counterfactuals under control, then the two primary differences between this work and Arkhangelsky et al. (2020) are the assumptions made on the (i) spectral profile of the factor models and (ii) relationship between the target and donors. In particular, we consider a low-rank L (Assumption 2) with is an observed unit specific covariate, and F t , Λ n ∈ R r 1 and α (d) tn ∈ R are defined as in (25). As such, similar to (25), we can write this factor model as a special case of (11), where u (0) t = (β t , F t ) and v n = (x n , Λ n ). Again, if one wants to estimate θ (d) tn , then we need to assume an additional factorization of α (d) tn , just as in (26) and (27). a well-balanced spectra (Assumption 7); the latter effectively assumes a lower bound on the smallest non-zero singular value of L. In comparison, Arkhangelsky et al. (2020) makes a weaker assumption that L is approximately low-rank, i.e., its min{T 0 , N 0 }-th singular value is sufficiently small (see Assumption 3 of their work). However, as in Abadie et al. (2010) , they require a convex relationship to hold between the target and donors, while we make the weaker assumption that a linear relationship holds (see the discussion under Assumption 4 for why a linear relationship is directly implied by a low-rank factor model).

Comparison with Agarwal et al. (2020) . They analyze the recently proposed "robust SC" (RSC) estimator of Amjad et al. (2018) , which is closely related to our proposed estimator in Section 2 if restricted to estimating θ Future directions. An interesting future research direction is to build upon these works to study the trade-offs between the assumptions placed on the latent factor model (e.g., spectral profile), relationship between the target and donors, and subsequent inference guarantees one can establish for various target causal parameters (e.g., extending the model and estimator of Arkhangelsky et al. (2020) to estimate θ (d) n for any n / ∈ I (d) and d = 0). Further, another direction that would be of interest is staggered adoption, where the duration of the pre-intervention period can vary amongst the units, i.e., T 0 is specific to each unit.

This setup for example was considered in Athey et al. (2017) .

As stated earlier, the works listed at the beginning of Section 7 use some combination (and/or variation) of the assumptions discussed in Sections 7.1 to 7.3 to prove formal statistical guarantees. To prove asymptotic normality, the works of Chernozhukov et al. (2020b) , Bai and Ng (2020) , Arkhangelsky et al. (2020) , make additional assumptions on the relative scalings between T 0 , T 1 , N 0 . We also make similar assumptions but with respect to N d , rather than N 0 , to estimate θ (d) n for d = 0. In particular, Chernozhukov et al. (2020b) Bai and Ng (2020) requires 2020)), which require that T 1 (N − N 0 ) do not grow "too quickly" compared to T 0 and N 0 . We note that finding the optimal relative scalings between T 0 , T 1 , N 0 (or N d ) remains interesting future work.

In this section, we re-interpret the classical potential outcomes framework through the lens of tensors. Specifically, consider an order-3 tensor with axes that correspond to measurements, units, and interventions. Each entry of this tensor is associated with the potential outcome for a specific measurements, unit, and intervention. Recall Figure 1 for a graphical depiction of this tensor. Therefore, estimating unobserved potential outcomes, the fundamental task in causal inference, is equivalent to estimating various missing entries of this order-3 potential outcomes tensor. Recall from Figure 2 how different observational and experimental studies that are prevalent in causal inference can be equivalently posed as tensor completion with different sparsity patterns. Indeed, imputing entries of a tensor that are noisy and/or missing is the goal of tensor completion (TC).

In Section 8.1, we discuss how important concepts in causal inference have a related notion in tensor completion. In Section 8.2, we show how low-rank tensor factor models, prevalent in the TC literature can guide future algorithmic design for causal inference. In Section 8.3, we pose what algorithmic advances are required in the TC literature to allow it to more directly connect with causal inference.

Causal parameters and TC error metrics. Here, we discuss the relationship between causal inference with different target causal parameters and TC under different error metrics. The first step in causal inference is to define a target causal parameter, while the first step in tensor completion is to define an error metric between the underlying and estimated tensors.

Below, we discuss a few important connections between these two concepts.

To begin, consider as the causal parameter the average potential outcome under intervention d across all T measurements and N units. 8 Then, estimating this parameter can equivalently be posed as requiring a tensor completion method with a Frobenius-norm error guarantee for the d-th slice of the potential outcomes tensor with dimension T × N , normalized by 1/ √ T N . As such, a uniform bound for this causal parameter over all D interventions would in turn require a guarantee over the max (normalized) Frobenius-norm error for each of the D slices. Another causal parameter is unit n's potential outcome under intervention d averaged over all T measurements (recall, this is θ (d) n ). Analogously, this translates to the 2 -norm error guarantee of the n-th column of the d-th tensor slice, normalized by 1/ √ T . A uniform bound over all N units for the d-th intervention would then correspond to a 2,∞ -norm error (defined in (47)) for the d-th tensor slice. As a final example, let the target causal parameter be unit n's potential outcome under intervention d and measurement t. This would require a TC method with a max-norm (entry-wise) error of the d-th matrix slice. Similar as above, a uniform bound over all measurements, units, and interventions corresponds to a max-norm error over the entire tensor.

Tensor factor model. We start by recalling Definition 4.1 for a low-rank tensor factor

tn ] ∈ R T ×N ×D denote an order-3 tensor of potential outcomes, which we assume admits the following decomposition:

where r is the canonical polyadic (CP) rank, u t , v n , λ d ∈ R r are latent factors associated with the t-th measurement, n-th unit, and d-th intervention, respectively, and ε (d)

tn is a mean zero residual term. We note that such a factorization always exists, but the key assumption is that the CP rank r is much smaller than N, T, D.

Algorith design guided by tensor factor models. The factorization in Assumption 2 is implied by the factorization assumed by a low-rank tensor as given in (28). In particular, Assumption 2 does not require the additional factorization of the (time, intervention) factor u (d) t as u t , λ d , where u t is a time specific factor and λ d is an intervention specific factor. An important question we pose is whether it is feasible to design estimators that exploit an implicit factorization of u (d) t = u t , λ d . Indeed, a recent follow-up work, Squires et al. (2021) finds that rather than regressing on units, using a variant of SI which regresses across interventions leads to better empirical results in the setting of single cell therapies. Further, for example Shah and Yu (2019) directly exploit the tensor factor structure in (28) to provide max-norm error bounds for TC under a uniformly missing at random sparsity pattern.

We leave as an open question whether one can exploit the further latent factor structure in (28) over that in Assumption 2 to operate under less stringent causal assumptions or data requirements, and/or identify and estimate more involved causal parameters.

The TC literature has grown tremendously because it provides an expressive formal framework for a large number of emerging applications. In particular, this literature quantifies the number of samples required and the computational complexity of different estimators to achieve theoretical guarantees for a given error metric (e.g., Frobenius-norm error over the entire tensor)-indeed, this trade-off is of central importance in the emerging sub-discipline at the interface of computer science, statistics, and machine learning.

Given our preceding discussions connecting causal inference and TC, a natural question is whether we can apply the current toolkit of tensor completion to understand statistical and computational trade-offs in causal inference.

We believe a direct transfer of the techniques and analyses used in TC is not immediately possible for the following reasons. First, most results in the TC literature assume uniformly randomly missing entries over all T × N × D elements of the tensor. In comparison, as seen in Figure 2 , causal inference settings frequently induce a "missing not at random" sparsity pattern. Further, this literature typically studies the Frobenius-norm error across the entire tensor. However, as discussed in Section 8.1, most meaningful causal parameters require more refined error metrics over the tensor.

Hence, we pose two important and related open questions: (i) what block sparsity patterns and structural assumptions on the potential outcomes tensor allow for faithful recovery with respect to a meaningful error metric for causal inference, and (ii) if recovery is possible, what are the fundamental statistical and computational trade-offs that are achievable? An answer to these questions will formally bridge causal inference with TC, as well as computer science and machine learning more broadly.

The SI framework adds to the growing literature on data-efficient, personalized decisionmaking within causal inference and beyond. It presents a new perspective on learning with distribution shifts, an active area of research in econometrics and machine learning, also known as causal transportability. Along with proposing a new framework for experiment design, which we call synthetic A/B testing, we extend the SC framework to the multiple intervention setting, an open problem in the literature.

For a matrix A ∈ R a×b , we denote its transpose as A ∈ R b×a . We denote the operator (spectral) and Frobenius norms of A as A op and A F , respectively. The columnspace (or range) of A is the span of its columns, which we denote as R(A) = {v ∈ R a : v = Ax, x ∈ R b }. The rowspace of A, given by R(A ), is the span of its rows. Recall that the nullspace of A is the set of vectors that are mapped to zero under A. For any vector v ∈ R a , let v p denote its p -norm. Further, we define the inner product between vectors v, x ∈ R a

If v is a random variable, we define its sub-Gaussian (Orlicz) norm as v ψ 2 . Let [a] = {1, . . . , a} for any integer a.

Let f and g be two functions defined on the same space. We say that f (n) = O(g(n)) if and only if there exists a positive real number M and a real number n 0 such that for all n ≥ n 0 , |f (n)| ≤ M |g(n)|. Analogously we say: f (n) = Θ(g(n)) if and only if there exists positive real numbers m, M such that for all n ≥ n 0 , m|g(n)| ≤ |f (n)| ≤ M |g(n)|; f (n) = o(g(n)) if for any m > 0, there exists n 0 such that for all n ≥ n 0 , |f (n)| ≤ m|g(n)|.

We adopt the standard notations and definitions for stochastic convergences. As such, we denote d − → and p − → as convergences in distribution and probability, respectively. We will also make use of O p and o p , which are probabilistic versions of the commonly used deterministic O and o notations. More formally, for any sequence of random vectors X n , we say X n = O p (a n ) if for every ε > 0, there exists constants C ε and n ε such that P( X n 2 > C ε a n ) < ε for every n ≥ n ε ; equivalently, we say (1/a n )X n is "uniformly tight" or "bounded in probability". Similarly, X n = o p (a n ) if for all ε, ε > 0, there exists n ε such that P( X n 2 > ε a n ) < ε for every n ≥ n ε . Therefore,

Additionally, we use the "plim" probability limit operator: plim X n = a ⇐⇒ X n p − → a.

We say a sequence of events E n , indexed by n, holds "with high probability" (w.h.p.)

if P(E n ) → 1 as n → ∞, i.e., for any ε > 0, there exists a n ε such that for all n > n ε , P(E n ) > 1 − ε. More generally, a multi-indexed sequence of events E n 1 ,...,n d , with indices n 1 , . . . , n d with d ≥ 1, is said to hold w.h.p. if P(E n 1 ,...,n d ) → 1 as min{n 1 , . . . , n d } → ∞.

We also use N (µ, σ 2 ) to denote a normal or Gaussian distribution with mean µ and variance σ 2 -we call it standard normal or Gaussian if µ = 0 and σ 2 = 1.

In this section, we discuss how access to meaningful covariate information about each unit can help improve learning of the model. To this end, let X = [X kn ] ∈ R K×N denote the matrix of covariates across units, i.e., X kn denotes the k-th descriptor (or feature) of unit n.

One approach towards incorporating covariates into the Synthetic Interventions estimation procedure described in Section 2.1, is to impose the following structure on X.

ASSUMPTION 9-Covariate structure: For any k ∈ [K] and n ∈ [N ], let X kn = ϕ k , v n + ζ kn . Here, ϕ k ∈ R r represents a latent factor specific to descriptor k, v n ∈ R r is the unit latent factor as defined in (11), and ζ kn ∈ R is a mean zero measurement noise specific to descriptor k and unit n.

Interpretation. The key structure we impose in Assumption 9 is that the covariates X kn have the same latent unit factors v n as the potential outcomes Y 

where recall w (n,d) as defined in Assumptions 4.

PROOF: Proof is immediate by plugging Assumption 4 into Assumption 9.

Q.E.D.

Adapting the Synthetic Interventions estimator. Proposition B.1 suggests a natural modification to the model learning stage of the Synthetic Interventions estimator presented in Section 2.1. In particular, similar to the estimation procedure of Abadie et al. (2010) , we propose appending the covariates to the pre-intervention outcomes. Formally, we define X n = [X kn ] ∈ R K as the vector of covariates associated with the unit n ∈ [N ]; analogously, we define X I (d) = [X kj : j ∈ I (d) ] ∈ R K×N d as the matrix of covariates associated with units within I (d) . We further define Z n = [Y pre,n , X n ] ∈ R T 0 +K and Z I (d) =

[Y pre,I (d) , X I (d) ] ∈ R (T 0 +K)×N d as the concatenation of pre-intervention outcomes and features associated with the target unit n and subgroup I (d) , respectively. We denote the SVD of Z I (d) as

where M = min{(T 0 + K), N d }, λ ∈ R are the singular values (arranged in decreasing order), and µ ∈ R T 0 +K , ν ∈ R N d are the left and right singular vectors, respectively. Then,

we define the modified model parameter estimator as

(1/ λ ) ν µ Z n .

The remainder of the algorithm, as described in Section 2.1, proceeds as is with the new estimate w (n,d) .

Theoretical implications. Let in addition to setup of Theorem 4.2 and Assumption 9 hold.

Then, it can be verified that the statistical guarantees in Sections 4.3, 4.4 and 5.1 continue to hold with T 0 being replaced by T 0 + K. In essence, adding covariates into the modellearning stage augments the data size, which can improve the estimation rates. For example, in Theorem 4.2, the term T 

In this section, we present the details of our simulations in Section 6.

Below, we provide details for our consistency simulation in Section 6.1.

We let N d = |I (d) | = 200 and r = 15, where r is defined in Assumption 2. We define the latent unit factors associated with I (d) as V I (d) ∈ R N d ×r (refer to (11)), where its entries are independently sampled from a standard normal distribution.

Pre-intervention data. We choose T 0 = 200 and r pre = 10. We define the latent preintervention time factors under control (d = 0) as U pre ∈ R T 0 ×r , which is sampled as follows: (i) let A ∈ R T 0 ×rpre , where its entries are independently sampled from a standard normal; (ii) let Q ∈ R rpre×(r−rpre) , where its entries are first independently sampled from a uniform distribution over [0, 1], and then its columns are normalized to sum to one; (iii) define U pre = [A, AQ] as the concatenation of A and AQ. By construction, U pre has rank r pre w.h.p., which we empirically verify. Next, we define E[Y pre,

Again by construction, we note that rank(E[Y pre,I (d) ]) = r pre w.h.p., which we empirically verify. We then generate the model w (n,d) ∈ R N d from a multivariate standard normal distribution, and define E[Y pre,n ] = E[Y pre,I (d) ]w (n,d) ∈ R T 0 .

Post-intervention data. We choose T 1 = 200. We sample the post-intervention time factors as follows: (i) let P ∈ R T 1 ×T 0 , where its entries are first independently sampled from a uniform distribution over [0, 1], and then its rows are normalized to sum to one; (ii) define U post = P U pre ∈ R T 1 ×r . We then define E[Y post,

To study the effect of the post-intervention period length, we will treat it as a variable. As Interpretation of data generating process. We now motivate the construction of E [Y pre,I (d) ]

and E[Y post,I (d) ]. Recall that the SI framework allows potential outcomes from different interventions to be sampled from different distributions. As such, we construct E[Y post,I (d) ]

such that they follow a different distribution to that of E [Y pre,I (d) ]. This allows us to study when the model learnt using pre-intervention data "generalizes" to a post-intervention regime generated from a different distribution. However, we note that by construction, As- Verifying assumptions. We note that our data generating process ensures that Assumptions 1, 3, 5 hold. In addition, we empirically verify Assumptions 6 and 7. Further, for Assumption 2, we note that our pre-and post-intervention (expected) potential outcomes associated with I (d) were all generated using V I (d) ; thus, their variations only arise due to the sampling procedure for their respective latent time-intervention factors. Given that E[Y pre,n ]

and θ In this section, we describe the setup for Section 6.2.1.

We consider the binary D = 2 intervention model. Our interest is in estimating θ (1) n . Our data generating process will be such that the pre-and post-intervention data will obey different distributions. Towards this, we choose N 1 = |I (1) | = 400 and r = 15. We define V ∈ R N 1 ×r , where its entries are independently sampled from a standard normal.

Pre-intervention data. We choose T 0 = 400, and define the latent pre-intervention time factors under control as U pre ∈ R T 0 ×r , where its entries are sampled independently from a standard normal. Next, we define E[Y pre,I (1) ] = U pre V ∈ R T 0 ×N 1 . By construction, rank(E[Y pre,I (1) ]) = r w.h.p., which we empirically verify. We then generate the model w (n,1) ∈ R N 1 from a standard normal, and define E[Y pre,n ] = E[Y pre,I (1) ]w (n,1) ∈ R T 0 . We

Post-intervention data. We choose T 1 = 20, and generate post-intervention time factors under d = 1 as follows: We define U post ∈ R T 1 ×r , where its entries are independently sampled w.h.p., which we empirically verify.

Observations. We generate Y pre,n and Y pre,I (1) by adding independent noise from a normal distribution with mean zero and variance σ 2 = 0.5 to E[Y pre,n ] and E[Y pre,I (1) ], respectively. We generate Y post,I (1) by applying the same additive noise model to E[Y post,I (1) ].

Verifying assumptions. As before, Assumptions 1 to 7 hold by construction.

We perform 5000 iterations, where E[Y pre,n ], E[Y pre,I (0) ], E[Y post,I (1) ] are fixed throughout, but the idiosyncratic shocks are re-sampled to generate new (random) outcomes.

Within each iteration, we first use (Y pre,n , Y pre,I (1) ) to fit w (n,1) , as in (3). Next, we use Y post,I (1) and w (n,1) to yield θ

(1) n , as in (5). The resulting histogram is displayed in Figure 11a .

Next, we describe the setup for Section 6.2.2.

We continue analyzing the binary D = 2 intervention model. We let N 1 = 400, r = 15, and generate V I (1) ∈ R N 1 ×r by independently sampling its entries from a standard normal.

Pre-intervention data. We choose T 0 = 400 and r pre = 12. We construct E[Y pre,I (1) ] using V I (1) identically to that in Section C.1.1, such that rank(E[Y pre,I (1) ]) = r pre w.h.p., which we empirically verify. As before, we generate w (n,1) ∈ R N 1 from a standard normal and Observations. We generate Y pre,n and Y pre,I (1) by adding independent noise from a normal distribution with mean zero and variance σ 2 = 0.5 to E[Y pre,n ] and E[Y pre,I (1) ], respectively. We generate Y post,I (1) by applying the same noise model to E[Y post,I (1) ].

Verifying assumptions. As before, Assumptions 1 to 7 hold by construction.

We perform 5000 iterations, where E[Y pre,n ], E[Y pre,I (1) ], E[Y post,I (1) ] are fixed, but the idiosyncratic shocks are re-sampled. In each iteration, we use (Y pre,n , Y pre,I (1) ) to fit w (n,1) , and then use Y post,I (1) and w (n,1) to yield θ In what follows, the descriptors above the equalities will denote the assumption used, e.g., A1 represents Assumption 1:

The third equality follows since u For ease of notation, we suppress the conditioning on E for the remainder of the proof.

To bound the gap between s i and s i , we recall the following well-known results.

LEMMA E.2-Weyl's inequality: Given A, B ∈ R m×n , let σ i and σ i be the i-th singular values of A and B, respectively, in decreasing order and repeated by multiplicities. Then 

at least 1 − 2 exp −t 2 . Here, K = max i,j A ij ψ 2 , and C > 0 is an absolute constant.

Recalling Assumption 5 and applying Lemma E.3, we conclude for any t > 0 and some

This completes the proof. 

where T post = {T − T 1 + 1, . . . , T } and w (n,d) is defined as in (15).

PROOF: Recall that V pre ∈ R N d ×rpre represents right singular vectors of E [Y pre,I (d) ].

We also recall w (n,d) = V pre V pre w (n,d) , where w (n,d) is defined as in Theorem 4.1. Let V post ∈ R N d ×rpost denote the right singular vectors of E[Y post,I (d) ]. Assumption 8 implies

Therefore, we have

where we use (29) in the last equality. Hence, we conclude

The first equality follows from (14) where k is defined as in (3). Then, conditioned on E,

PROOF: We first re-state Corollary 5.1 of Agarwal et al. (2021) .

LEMMA-Corollary 5.1 of Agarwal et al. (2021) : Let the setup of Lemma F.2 hold. Then with probability at least

where C(σ) is a constant that only depends on σ.

This is seen by adapting the notation in Agarwal et al. (2021) to that used in this paper.

In particular, y = Y pre,n , X = E[Y pre,I (d) ], Z = Y pre,I (d) , β = w (n,d) , β * = w (n,d) , where y, X, Z, β, β * are the notations used in Agarwal et al. (2021) . 9 Further, we also use the fact that Xβ * in the notation of Agarwal et al. (2021) (i.e., E[Y pre,I (d) ] w (n,d) ) in our notation) equals E[Y pre,n ]. This follows from

which follows from (13) in Theorem 4.1 and (30). We conclude by noting (31) implies

Q.E.D.

APPENDIX G: PROOF OF THEOREM 4.2

For ease of notation, we suppress the conditioning on E for the remainder of the proof.

Let C( v p ) = max{1, v p } for any v ∈ R a , and let T post = {T − T 1 + 1, . . . , T }. For

Note that the rows of Y post,I (d) are formed by {Y t,I (d) : t ∈ T post }. Additionally, let ∆ (n,d) = w (n,d) − w (n,d) . Finally, for any matrix A with orthonormal columns, let P A = AA denote the projection matrix onto the subspace spanned by the columns of A.

By (5) and Lemma F.1 (implied by Theorem 4.1), we have

where we have used Y t, Plugging (34) into (33) yields

Below, we bound the three terms on the right-hand side (RHS) of (35) separately.

Bounding term 1. By Cauchy-Schwartz inequality, observe that

Hence, it remains to bound P Vpre ∆ (n,d) 2 . Towards this, we state the following lemma. Its proof can be found in Appendix G.1. LEMMA G.1: Consider the setup of Theorem 4.2. Then,

Using Lemma G.1, we obtain

This concludes the analysis for the first term.

Bounding term 2. We begin with a simple lemma.

LEMMA G.2: Let γ t be a sequence of independent mean zero sub-Gaussian random variables with variance σ 2 . Then, 1

PROOF: Immediately holds by Hoeffding's lemma (Lemma G.6 ).

Q.E.D.

By Assumptions 2 and 5, we have for any t ∈ T post , E[ ε t,I (d) , w (n,d) ] = 0, Var( ε t,I (d) , w (n,d) ) = σ 2 w (n,d) 2 2 .

Since ε t,I (d) , w (n,d) are independent across t, Lemma G.2 yields

Bounding term 3. First, we define the event E 1 as

Now, condition on E 1 . By Assumptions 2 and 5, we have for any t ∈ T post ,

Var( ε t,I (d) , ∆ (n,d) ) = σ 2 ∆ (n,d) 2 2 .

The above uses the fact that w (n,d) depends on Y pre,I (d) and Y pre,n and hence are independent of ε t,I (d) for all t ∈ T post . Given that ε t,I (d) , ∆ (n,d) are independent across t, Lemmas F.2 and G.2 imply E 2 |E 1 occurs w.h.p. Further, we note

Since E 1 and E 2 |E 1 occur w.h.p, it follows from (38) that E 2 occurs w.h.p. As a result,

Collecting terms. Incorporating (36), (37), (39) into (35), and simplifying yields

This concludes the proof of Theorem 4.2.

G.1. Proof of Lemma G.1

We begin by introducing some helpful notations: let Y rpre pre,I (d) = rpre =1 s u v be the rank r pre -approximation of Y pre,I (d) . More compactly, Y rpre pre,I (d) = U pre S pre V pre . To establish Lemma G.1, consider the following decomposition:

We proceed to bound each term separately.

Bounding term 1. Recall Av 2 ≤ A op v 2 for any A ∈ R a×b and v ∈ R b . Thus,

To control the above term, we state a helper lemma that bounds the distance between the subspaces spanned by the columns of V pre and V pre . Its proof is given in Appendix G.2. LEMMA G.3: Consider the setup of Theorem 4.2. Then for any ζ > 0, the following holds w.p. at least 1 − 2 exp −ζ 2 : P Vpre − P Vpre op ≤ Cσ(

, and C > 0 is an absolute constant.

Substituting (41) and the bound in Lemma F.2 into (40), we obtain 

We call upon Property 4.1 of Agarwal et al. (2021) , which states that w (n,d) , as given by (3), is the unique solution to the following program:

This, along with (32), implies that From (50) and (51), we have

where we used (47) in the second inequality above. Using (49) and (52), we obtain

We now state two helper lemmas that will help us conclude the proof. The proof of Lemmas G.4 and G.5 are given in Appendices G.3 and G.4, respectively.

LEMMA G.4-Lemma 7.2 of Agarwal et al. (2021) : Let Assumptions 1, 2, 3, 5, 6, 7

hold. Suppose k = r pre , where k is defined as in (3).

LEMMA G.5: Let Assumptions 1 to 7 hold. Then, given Y rpre pre,I (d) , the following holds with respect to the randomness in ε pre,n :

Incorporating Lemmas E.1, G.4, G.5, and Assumption 7 into (53), we conclude

Collecting terms. Combining (42) and (54), and noting v 2 ≤ v 1 for any v, we conclude

We recall the well-known singular subspace perturbation result by Wedin.

THEOREM G.1-Wedin's Theorem Wedin (1972) : Given A, B ∈ R m×n , let V , V ∈ R n×n denote their respective right singular vectors. Further, let V k ∈ R n×k (respectively, V k ∈ R n×k ) correspond to the truncation of V (respectively, V ), respectively, that retains the columns corresponding to the top k singular values of A (respectively, B). Let s i represent the i-th singular value of A. Then,

Recall that V pre is formed by the top r pre right singular vectors of Y pre,I (d) and V pre is formed by the top r pre signular vectors of E[Y pre,I (d) ]. Therefore, Theorem G.1 gives

, where we used rank(E[Y pre,I (d) ]) = r pre , and hence s rpre+1 = 0. By Assumption 5 and Lemma E.3, we can further bound the inequality above. In particular, for any ζ > 0, we have w.p. at least 1 − 2 exp −ζ 2 ,

Note that ε pre,n is independent of Y pre,I (d) , and thus also independent of U pre , Y rpre pre,I (d) . Therefore,

Moreover, using the cyclic property of the trace operator, we obtain E P U pre ε pre,n , ε pre,n = E ε pre,n P U pre ε pre,n = E tr ε pre,n P U pre ε pre,n = E tr ε pre,n ε pre,n P U pre = tr E ε pre,n ε pre,n P U pre = tr σ 2 P U pre = σ 2 U pre 2 F = σ 2 r pre .

Note that the above also uses (i) the mean zero and coordinate-wise independence of ε pre,n ;

(ii) orthonormality of U pre ∈ R n×rpre . Therefore, it follows that

Next, to obtain high probability bounds for the inner product term, we use the following lemmas, the proofs of which can be found in Appendix A of Agarwal et al. (2021) .

LEMMA G.6-Modified Hoeffding's Lemma: Let X ∈ R n be r.v. with independent mean-zero sub-Gaussian random coordinates with X i ψ 2 ≤ K. Let a ∈ R n be another random vector that satisfies a 2 ≤ b for some constant b ≥ 0. Then for any ζ ≥ 0, P n i=1 a i X i ≥ ζ ≤ 2 exp − cζ 2 K 2 b 2 , where c > 0 is a universal constant.

LEMMA G.7-Modified Hanson-Wright Inequality: Let X ∈ R n be a r.v. with independent mean-zero sub-Gaussian coordinates with X i ψ 2 ≤ K. Let A ∈ R n×n be a random matrix satisfying A op ≤ a and A 2 F ≤ b for some a, b ≥ 0. Then for any ζ ≥ 0,

Using Lemma G.6 and (58) , and Assumptions 2 and 5, it follows that for any ζ > 0 P P Upre E[Y pre,n ], ε pre,n ≥ ζ ≤ exp − cζ 2 T 0 σ 2 .

Note that the above also uses P Upre E[Y pre,n ] 2 ≤ E[Y pre,n ] 2 ≤ √ T 0 , which follows from the fact that P Upre op ≤ 1 and Assumption 6. Further, (61) implies P Upre E[Y pre,n ], ε pre,n = O p ( T 0 ).

Similarly, using (59), we have for any ζ > 0 P U pre S pre V pre w (n,d) , ε pre,n ≥ ζ ≤ exp

where we use the fact that 

Finally, using Lemma G.7, (60), Assumptions 2 and 5, we have for any ζ > 0 P P Upre ε pre,n , ε pre,n ≥ σ 2 r pre + ζ ≤ exp − c min ζ 2 σ 4 r pre , ζ σ 2 ,

where we have used P Upre op ≤ 1 and P Upre 2 F = r pre . Then, (65) implies P Upre ε pre,n , ε pre,n = O p r pre .

From (57) APPENDIX H: PROOF OF THEOREM 4.3

For ease of notation, we suppress the conditioning on E for the remainder of the proof.

Additionally, let T post = {T − T 1 + 1, . . . , T } and ∆ (n,d) = w (n,d) − w (n,d) . We start by establishing (17). To begin, we scale the left-hand side (LHS) of (35) by √ T 1 and analyze each of the three terms on the right-hand side (RHS) of (35) separately. To address the first term on the RHS of (35), we scale (36) by √ T 1 /(σ w (n,d) 2 ) and recall our assumption on T 1 given by (16). We then obtain 1 T 1 σ w (n,d) 

To address the second term on the RHS of (35), we scale (37) by √ T 1 σ w (n,d) 2 . Since ε t,I (d) , w (n,d) are independent across t, the Lindeberg-Lévy Central Limit Theorem (see Billingsley (1986) ) yields 1 T 1 σ w (n,d) 2 t∈Tpost ε t,I (d) , w (n,d) d − → N 0, 1 .

To address the third term on the RHS of (35), we scale (39) by √ T 1 σ w (n,d) 2 and recall the assumption log(T 0 N d ) = o min{T 0 , N d }/(C 2 ( w (n,d)

2 )r 2 pre ) . This yields 1 T 1 σ w (n,d) 2 t∈Tpost ε t,I (d) , ∆ (n,d) = o p (1).

Finally, scaling (35) by √ T 1 and collecting (67), (68), (69), we conclude

This establishes (17).

(18) is immediate from Lemma F.2.

Next, we establish (19). Using the definition of w (n,d) in (3), Y pre,I (d) w (n,d) = Y rpre pre,I (d) w (n,d) , where recall that Y rpre pre,I (d) is obtained by truncating SVD of Y pre,I (d) by retaining top r pre components. Substituting this equality into our definition of the variance From (65), we have for any ζ > 0, P ε pre,n , P Upre ε pre,n ≥ σ 2 r pre + ζ ≤ exp − c min ζ 2 σ 4 r pre , ζ σ 2 , which implies that ε pre,n , P Upre ε pre,n = O p (r pre ).

Plugging (76) and (77) into (73), we obtain

Collecting terms. Normalizing (70) by T 0 and subsequently incorporating the bounds in (71), (72), and (78), we conclude

APPENDIX I: PROOF OF THEOREM 5.1

For ease of notation, we suppress the conditioning on E for the remainder of the proof.

We make use of the following notation: for any matrix A with orthonormal columns, let P A = AA denote the projection matrix onto the subspace spanned by the columns of A. Additionally, we follow the notation established in Section 5.1. In particular, we recall We proceed to bound each term on the right-hand side of (79) independently.

Bounding (P Vpre − P Vpre ) V post 2 F . By Lemma G.3, we have w.p. at least 1 − α 1 ,

Note that we have used the fact that V post 2 F = r post .

Bounding (I − P Vpre )(P Vpost − P Vpost ) 2 F . Observe that (I − P Vpre ) is a projection matrix, and hence I − P Vpre op ≤ 1. By adapting Lemma G.3 for V post , V post in place of V pre , V pre , we have w.p. at least 1 − α 2 P Vpost − P Vpost (81)

Note that we have used the following: (i) P Vpost − P Vpost F = sin Θ F , where sin Θ ∈ R rpost×rpost is a matrix of principal angles between the two projectors (see Absil et al. (2006) ), which implies rank(P Vpost − P Vpost ) ≤ r post ; (ii) the standard norm inequality A F ≤ rank(A) A op for any matrix A. Using the result above, we have

Bounding (P Vpre − P Vpre ) V post , (I − P Vpre ) V post F . Using the cyclic property of the trace operator, we have that (P V pre − P V pre ) V post , (I − P V pre ) V post F = tr V post (P V pre − P V pre )(I − P V pre ) V post = tr (P V pre − P V pre )(I − P V pre )P V post .

Note that P Vpre − P Vpre is symmetric, and I − P Vpre and P Vpost are both symmetric positive semidefinite (PSD). As a result, Lemmas G.3 and I.6 yield w.p. at least 1 − α 1 tr (P Vpre − P Vpre )(I − P Vpre )P Vpost ≤ P Vpre − P Vpre op tr (I − P Vpre )P Vpost ≤ P Vpre − P Vpre op I − P Vpre op tr P Vpost ≤ Cσr post φ pre (α 1 ) s rpre .

Again, to arrive at the above inequality, we use I − P Vpre op ≤ 1 and tr(P Vpost ) = r post .

Collecting terms. Collecting (80), (82), and (84) with α 1 = α 2 = α/2, w.p. at least 1 − α, τ ≤ Cσ 2 r post φ 2 pre (α/2) s 2 rpre + Cσ 2 r post φ 2 post (α/2) ς 2 rpost + Cσr post φ pre (α/2) s rpre .

Defining the upper bound as τ (α) completes the bound on the Type I error.

Type II error. Next, we bound the Type II error. We will leverage Lemma I.2, the proof of which can be found in Appendix I.3.

LEMMA I.2: The following equality holds: τ = r post − c 1 − c 2 , where

We proceed to bound each term on the right hand side of (85) separately.

Bounding (P Vpre − P Vpre ) V post 2 F . From (80), we have that w.p. at least 1 − α 1 ,

Bounding P Vpre (P Vpost − P Vpost ) 2 F . Using the inequality AB F ≤ A op B F for any two matrices A and B, as well as the bound in (81), we have w.p. at least 1 − α 2 , P Vpre (P Vpost − P Vpost ) 2 F ≤ Cσ 2 r post φ 2 post (α 2 ) ς 2 rpost .

(87)

Bounding (P Vpre − P Vpre ) V post , P Vpre V post F . Using an identical argument used to create the bounds in (83) and (84), but replacing I − P Vpre with P Vpre , we obtain w.p. at least

Bounding P Vpre (P Vpost − P Vpost ), P Vpre P Vpost F . Like in the argument to produce the bound in (84), we use Lemmas G.3 and I.6 to get that w.p. at least 1 − α 2 , P V pre (P V post − P V post ), P V pre P V post F = tr (P V post − P V post )P V pre P V pre P V post = tr (P V post − P V post )P V pre P V post ≤ P V post − P V post op P V pre op tr P V post ≤ Cσr post φ post (α 2 ) ς r post .

Collecting terms. Combining (86), (87), (88), (89) with α 1 = α 2 = α/2, and using the definition of τ (α), we have that w.p. at least 1 − α, c 2 ≤ τ (α) + Cσr post φ post (α/2) ς rpost .

Hence, along with using Lemma I.2, it follows that w.p. at least 1 − α, τ ≥ r post − c 1 − τ (α) − Cσr post φ post (α/2) ς rpost .

Now, suppose r post satisfies (21), which implies that H 1 must hold. Then, (90) and (21) together imply P( τ > τ (α)|H 1 ) ≥ 1 − α. This completes the proof. 

Under H 0 , it follows that (I − P Vpre )V post = 0. As a result, (I − P Vpre ) V post 2 F = (I − P Vpre )P Vpost 2 F = (I − P Vpre )P Vpost − (I − P Vpre )P Vpost 2 F = (I − P Vpre )(P Vpost − P Vpost ) 2 F . Applying these two sets of equalities above together completes the proof. 

Now, consider the second term of the equality above.

Further, analyzing the second term of (92), we note that P V pre V post 2 F = P V pre P V post 2 F = P V pre P V post − P V pre P V post + P V pre P V post 2 F = P V pre (P V post − P V post ) 2 F + P V pre P V post 2 F + 2 P V pre (P V post − P V post ), P V pre P V post F . (93) 

Using synthetic controls: Feasibility, data requirements, and methodological aspects

Synthetic control methods for comparative case studies: Estimating the effect of californiaâs tobacco control program

The economic costs of conflict: A case study of the basque country

On the largest principal angle between random subspaces

On principal component regression in a high-dimensional error-invariables setting

On robustness of principal component regression

mrsc: Multi-dimensional robust synthetic control

Robust synthetic control

Mostly Harmless Econometrics: An Empiricist's Companion

Panel data models: Some recent developments. Handbook of Econometrics

Synthetic difference in differences

Using the longitudinal structure of earnings to estimate the effect of training programs

Matrix completion methods for causal panel data models

Inferential theory for factor models of large dimensions

Panel data models with interactive fixed effects

Matrix completion, counterfactuals, and factor analysis of missing data

The augmented synthetic control method

How much should we trust differences-in-differences estimates? The Quarterly journal of economics

Probability and Measure

Bayesian pca

The power of convex relaxation: Near-optimal matrix completion

Panel data

Arbitrage, factor structure, and mean-variance analysis on large asset markets

The PCDID Approach: Difference-in-Differences when Trends are Potentially Unparallel and Stochastic. Working Papers 2020-03

Matrix estimation by universal singular value thresholding

An exact and robust conformal inference method for counterfactual and synthetic controls

Practical and robust t-test based inference for synthetic control and related methods

An interactive web-based dashboard to track covid-19 in real time. The Lancet infectious diseases

The optimal hard threshold for singular values is

Balancing, regression, difference-in-differences and synthetic control methods: A synthesis

An ∞ eigenvector perturbation bound and its application to robust covariance estimation

Learning preferences with side information. Management Science

Low-rank approximations of nonseparable panel models

Google LLC google covid-19 community mobility reports

A panel data approach for program evaluation: Measuring the benefits of political and economic integration of hong kong with mainland china

Causal inference with noisy and missing covariates via matrix factorization

Inference for factor model based average treatment effects. Available at SSRN 3112775

Estimation of average treatment effects with panel data: Asymptotic theory and implementation

Longitudinal data analysis using generalized linear models

Linear regression for panel with unknown number of factors as interactive fixed effects

Dynamic linear panel regression models with interactive fixed effects

Sur les applications de la theorie des probabilites aux experiences agricoles: Essai des principes

Estimation and inference in large heterogeneous panels with a multifactor error structure

Minimax rates of estimation for high-dimensional linear regression over q -balls

A simpler approach to matrix completion

Estimating causal effects of treatments in randomized and nonrandomized studies

Iterative collaborative filtering for sparse noisy tensor estimation

Causal imputation via synthetic interventions

Probabilistic principal component analysis

High-dimensional probability: An introduction with applications in data science

Perturbation bounds in connection with singular value decomposition

Case-fatality risk estimates for covid-19 calculated by using a lag time for fatality

Generalized synthetic control method: Causal inference with interactive fixed effects models

Incorporating (92) and (93) into (91), and recalling

5: Let A, B ∈ R n×n be symmetric PSD matrices. Then, tr(AB) ≥ 0. PROOF: Let B 1/2 denote the square root of B. Since A 0, we have tr(AB) = i ) A

6: If A ∈ R n×n is a symmetric matrix and B ∈ R n×n is a symmetric PSD matrix, then tr(AB) ≤ λ max (A) · tr(B), where λ max (A) is the top eigenvalue of A. PROOF: Since A is symmetric

5 yields tr((λ max (A)I − A)B) = λ max (A) · tr(B) − tr(AB)

Bounding term 2. To begin, since V pre is an isometry, it follows that P Vpre ∆ (n,d) 2 2 = V pre ∆ (n,d) 2 2 .We upper bound V pre ∆ (n,d) 2 2 as follows: consider Y r pre pre,I (d) ∆ (n,d) 2 2 = ( V pre ∆ (n,d) ) S 2 r pre ( V pre ∆ (n,d) ) ≥ s 2 r pre V pre ∆ (n,d) 2 2 .Using (44) and (43) together impliesTo bound the numerator in (45), notewhere we have used (32). To further upper bound the the second term on the RHS above, we use the following inequality: for any A ∈ R a×b , v ∈ R b ,where A 2,∞ = max j A ·j 2 and A ·j represents the j-th column of A. Thus,Substituting (46) into (45) and subsequently using (48) impliesWe first re-state Lemma 7.2 of Agarwal et al. (2021) .LEMMA: Let the setup of Lemma G.4 hold. Then w.p. at leastwhere C(σ) is a constant that only depends on σ.This is seen by adapting the notation in Agarwal et al. (2021) to that used in this paper.In particular,are the notations used in Agarwal et al. (2021) with r = r pre .Next, we simplify (55) using Assumption 7. As such, w.p. at leastThis concludes the proof of Lemma G.4.

Throughout this proof, C, c > 0 will denote absolute constants, which can change from line to line or even within a line. Recall w (n,d) = V pre S −1 pre U pre Y pre,n and Y pre,n = E[Y pre,n ] + ε pre,n . Thus,Therefore,estimator in (7), we get (n,d) ) .Below, we analyze each term on the RHS of (70) separately.In order to control this term, we incorporate Lemmas G.4 and G.5 into the bound in (52) to obtainBounding ε pre,n 2 . Since the entries of ε pre,n are independent mean zero sub-Gaussian random variables, it follows from Lemma G.2 thatBounding ε pre,n , (E[Y pre,n ] − Y rpre pre,I (d) w (n,d) ) . From (56), we have Y rpre pre,I (d) w (n,d) = P Upre (E[Y pre,n ] + ε pre,n ). Hence, ε pre,n , (E[Y pre,n ] − Y rpre pre,I (d) w (n,d) ) = ε pre,n , E[Y pre,n ] − ε pre,n , P Upre E[Y pre,n ] − ε pre,n , P Upre ε pre,n .By Lemma G.6 and Assumption 5, it follows that for any ζ > 0P ε pre,n , P Upre E[Y pre,n ] ≥ ζ ≤ exp − cζ 2 T 0 σ 2 .Note that we have used P Upre op ≤ 1 and Assumption 6 to obtain P Upre E[Y pre,n ] 2 ≤ E[Y pre,n ] 2 ≤ √ T 0 . Together, (74) and (75)