key: cord-0163736-k1y05smy
authors: Nie, Xinkun; Lu, Chen; Wager, Stefan
title: Nonparametric Heterogeneous Treatment Effect Estimation in Repeated Cross Sectional Designs
date: 2019-05-28
journal: nan
DOI: nan
sha: fc67989c61711019797710b765d0420ff57300f4
doc_id: 163736
cord_uid: k1y05smy

Identifying heterogeneity in a population's response to a health or policy intervention is crucial for evaluating and informing policy decisions. We propose a novel heterogeneous treatment effect estimator in the difference-in-differences design with repeated cross sectional data, where we observe different samples of a population at two time periods separated by the onset of a policy intervention, as well as samples of a population that serves as the control. Our estimator has orthogonality properties that enable fast rates on learning the treatment effect while allowing slower rates for estimating nuisance components. Our proposal shows promising empirical performance across a variety of simulation setups.

Difference-in-differences is an increasingly popular observational study design for estimating causal effects from repeated cross sections [e.g., Angrist and Pischke, 2008 , Bertrand et al., 2004 , Card and Krueger, 1994 , Lechner, 2011 , Obenauer and von der Nienburg, 1915 . For an overview of applying difference-in-differences method in public health, see Dimick and Ryan [2014] , Gertler et al. [2011] and Wing et al. [2018] for a review. Recently, several authors have employed it to study public health policy and implications due to COVID-19 [Brodeur et al., 2021, Goodman-Bacon and Marcus, 2020] .

A standard design with cross sectional data is as follows: we conduct a survey or draw a sample for some outcome (e.g., medical expediture) from two comparable states (or cities, regions, etc.). The first state later enacts some health policy of interest (e.g., encouragement of Medicare enrollment) and the second state doesn't. We then draw another sample from both states at a time after the health policy has been implemented in the first state. The simplest difference-in-differences estimator assumes global parallel trends: if neither state had enacted the policy, then their trends would have evolved in the same way. It would then attribute any difference in trends between the two states to the effect of the policy change.

In this paper, we focus on flexibly estimating treatment effect heterogeneity in the above design. There are two main challenges. The first challenge involves representing and targeting heterogeneity effectively without assuming specific functional forms. The second challenge involves relaxing the classic assumption of "global parallel trends," which is unlikely to hold when there is heterogeneity explained by covariates. In particular, states may have changing subgroups of people that exhibit markedly different trends on their own. For example, when studying the effect of enrollment in Medicare on medical expenditure, we may find that there are subgroups within states (e.g., based on age, income or gender) that have different baseline trends; then, if the two states under comparison have different proportions of these subgroups, the global parallel trends assumption immediately becomes questionable. Instead, we may want to control for these covariates, and only assume parallel trends once we have conditioned on them [e.g., Abadie, 2005 , Acemoglu and Angrist, 2001 , Blundell et al., 2004 , Heckman et al., 1998 .

Throughout the paper, we work in the following formal setup. We observe n independent samples (S i , T i , X i , Y i ), where the state indicator S i ∈ {0, 1} denotes whether the i-th individual is in the control or exposed state, T i ∈ {0, 1} denotes the time of the observation (pre-vs. post-intervention), X i ∈ R d is a set of potential confounders and Y i ∈ R is the outcome of interest. Only samples in the "exposed" state and "post" time period get treated, i.e., we can write the treatment or exposure indicator as W i = S i T i . Following the potential outcomes framework [Imbens and Rubin, 2015] , let Y i (0) and Y i (1) denote the control and treated potential outcomes and suppose we observe Y i = Y i (W i ). We are interested in estimating the heterogeneous treatment effect τ (x), defined as the expected treatment effect conditional on covariates X i = x and on X i treated:

Because E Y (0) X i = x, S i = 1, T i = 1 is not observed, we impose a parallel trends assumption conditional on covariates, so that

which then allows us to identify the conditional average treatment effect on the treated as follows,

As discussed in Abadie [2005] , (2) may be more credible than the standard parallel trends assumption that holds without conditioning on X i as it enables us to control for known sources of confounding. A classical approach to estimating τ (x) would be to employ flexible nonparametric modeling of the treatment effect function. A common practice is to employ a two-way fixed effect linear model with interactions for Y i in terms of X i , S i , T i of the form

then interpretingτ (x) =β s,t x as the treatment effect [e.g., Anzia and Berry, 2011] . However, the estimator (4) is not justified by a nonparametric version of the assumption (2) due to the assumed linear functional forms on τ (x) and on any confounding effects of X i that affect the outcome Y i [Angrist and Pischke, 2008 , Ding and Li, 2019 , Keele and Minozzi, 2013 , Lechner, 2011 . These constraining assumptions can be difficult to satisfy in practice.

In this paper, we aim to flexibly estimate the treatment effect function τ (X) given only the assumption (2) along with a relevant form of overlap. As a direct consequence of (2), the data generating process can be written as the following generic specification:

where the joint distribution of {T i , X i , S i } may be arbitrary and

are the conditional effect of S alone and T alone, and E ε i X i , S i , T i = 0. We note that all the conditional effect function b(·), ξ(·), ρ(·), τ (·) can be nonparametric funcitons.

One naive way to estimate τ (·) in this model is as follows. Recall our expression for τ (x) in (2). From that expression, one might attempt to estimate

for the four pairs of s and t on the corresponding subsets of the data, and then estimatê τ (x) =ĝ(x, 1, 1) −ĝ(x, 1, 0) −ĝ(x, 0, 1) +ĝ(x, 0, 0). However, this approach is often not robust. As an example where this method might fail, consider a high dimensional linear

We might consider fitting the Lasso [Tibshirani, 1996] for eachβ(s, t) separately, and estimateτ (x) = x (β(1, 1) −β(1, 0) −β(0, 1) +β(0, 0)). However, the lasso regularizes eachβ(s, t) towards 0 separately, which might result inβ(1, 1) −β(1, 0) −β(0, 1) +β(0, 0) being regularized away from 0, even when τ (x) = 0 everywhere. See Künzel et al. [2019] and Nie and Wager [2021] for a similar discussion on the T -Learner for the CATE. We seek to build a robust estimator for τ (·). To this end, we start by considering the case where the underlying treatment effect τ (x) is constant and the only challenge is to eliminate confounding. We start by developing an orthogonal transformation of (5) that generalizes the transformation of Robinson [1988] for the conditionally linear model. This representation allows us to build a transformed regression estimator (TR) to estimate the constant treatment effect. Our TR estimator achieves the parametric 1/ √ n rate of convergence while allowing for slower estimation rates on all nuisance components. The 1/ √ n rate of convergence in the TR estimator allows valid asymptotic confidence interval construction, while we still enable flexible nonparametric estimation on the nuisance components (e.g., without linearity assumptions). We further discuss the properties of the transformed regression estimator when the underlying linearity assumption is misspecified. We then build upon the transformed regression construction and propose a heterogeneous treatment effect estimator for τ (·) and show empirically the proposal is advantageous comparing to exisitng baselines.

Compared to the standard difference-in-differences design, we do not assume the panel setting where every individual is observed both before and after the treatment. In that particular case, one could apply any of the existing heterogeneous treatment effect estimators [e.g., , Hill, 2011 , Künzel et al., 2019 , Nie and Wager, 2021 by taking the difference between the outcomes from the two time periods as the new outcome variable. In our setting, we make the less stringent assumption of conditional parallel trends, and allow covariate shifts for the two time periods which makes the setting significantly more challenging. Adapting advances in semiparametric efficiency theory and leveraging flexible nonparametric machine learning methods for this task is our main methodological contribution.

The difference-in-differences approach to treatment effect estimation was popularized by Card and Krueger [1994] , and has since become ubiquitous in the social sciences. Angrist and Pischke [2008] and Lechner [2011] provide a textbook treatment and a broad literature review. Building on Abadie [2005] , we are here most interested in extensions of classical difference-in-differences methods that leverage covariate information to make the parallel trends assumption more plausible. Several authors have also recently extended differencein-differences analyses in other complementary directions. Arkhangelsky [2018] and Athey and Imbens [2006] consider difference-in-differences type designs where there may be nonadditive treatment effects. Abadie et al. [2010] , , , Ben-Michael et al. [2018] , Xu [2017] and Li and Li [2019] develop methods that can be applied in the panel setting in which the same individuals are observed in both the preand post-periods, whereas in our setting we observe separate cross-sectional data in each period.

Most of the existing literature with repeated cross sections, including Abadie [2005] , Li and Li [2019], Sant'Anna and Zhao [2018] , Chang [2020] , assume that there is no covariate shift across cross-sections from the same state: they require that the joint distribution of (X i , S i ) does not vary with T i , i.e., that (X i , S i ) ⊥ ⊥ T i . Our approach does not require this assumption (see Proposition 1 and the following comment), as this assumption may be hard to justify with cross-sectional data where we are not able to survey exactly the same people in the pre-and post-periods. For example, in a ride-sharing application, Lu et al.

[2018] estimates the effects of a dynamic pricing feature on drivers' behaviors by leveraging a natural experiment where a software bug temporarily disables a dynamic pricing feature for certain drivers. In this case, we may expect the distribution of the covariates X i for active drivers varies both with exposure S i and time T i .

Methodologically, we build on a large body of work in nonparametric estimation of heterogeneous treatment effects well. One approach is to reduce the "regularization bias" that might occur. Examples of this line of work include Athey and Imbens [2016] , Hahn et al. [2017] , and Shalit et al. [2016] . Another approach, the one we choose to adopt, is to develop meta-learning procedures that do not depend on a specific machine learning method. Key examples of such works are Künzel et al. [2019] and Nie and Wager [2021] . Our decomposition of τ (x) is conceptually similar to the orthogonal moments constructions from Robinson [1988] , and more broadly, from Belloni et al. [2011] , Bickel et al. [1998 ], Newey [1994 , Scharfstein et al. [1999] , Van Der Laan and Rubin [2006] and others.

In the difference-in-differences design, flexible modeling and estimation that goes beyond the standard two-way fixed effects and linearity assumptions has also drawn considerable interest. Abadie [2005] considers inverse propensity stratification based methods. Another approach considers more flexible outcome models, see [e.g., Heckman et al., 1998 , Meyer, 1995 . Recently, Chang [2020] , Sant'Anna and Zhao [2018] and Zimmert [2018] proposed doubly robust variants of the approach of Abadie [2005] that also allow for heterogeneity in τ (x).

The key orthogonality property of our proposed heterogeneous treatment effect estimator relies on a new decomposition for the outcome model (2) motivated by Robinson [1988] . The conditional probabilities of an observation being in state S i or time period T i conditionally on X i = x play a central role in our analysis [Rosenbaum and Rubin, 1983] . We write these quantities as

We write e s,t (X i ) = P S i = s, T i = t X i . We also write m(x) = E Y i X i = x for the conditional response function marginalizing over T i and S i , and

for the conditional effect of S marginalizing over T and S respectively. We write the conditional covariance of S i and T i as

Finally, for convenience, we let Z i = (X i , S i , T i ). Given this notation, we can verify the following (the derivation is given in the Supplementary Materials).

Proposition 1. Suppose we have access to an independent and identically distributed sequence of tuples Y i and Z i = (X i , S i , T i ). Under the model (2), our data-generating distribution admits a representation

where E i Z i = 0, and

Furthermore, all terms in the above decomposition are orthogonal in the following sense:

The key property of this representation is the orthogonality property (13), which will enable flexible estimation of treatment effects at parametric rates as discussed in the following section. In the setting of Abadie [2005] and Sant'Anna and Zhao [2018] , their assumption gives T i ⊥ ⊥ S i X i , which implies ∆(X i ) = 0. As an immediate corollary to Proposition 1, this decomposition then simplifies to a functional form closely reminiscent of Robinson's transformation [Robinson, 1988] :

More generally, we see that when ∆(X i ) is close to 0, all expressions underlying (11) and (12) are well-conditioned, and we expect estimation using (11) to be stable. Conversely, if T i and S i are highly correlated conditionally on (12) could become unstable; this is as expected, because if S i and T i are highly correlated, then we do not expect their interaction effect to be well identified. Finally, we note that all nuisance components in the decomposition above are marginal quantities, and thus can be estimated using all of the data. This property is desirable for empirical performance as it is more data efficient when we need to estimate them in a small-sample regime.

As a building block of our proposed heterogeneity treatment effect estimator, we first consider estimation in a setting where the treatment effect itself is constant τ (x) = τ in the representation (11), but all other nuisance components defined above, i.e., m(x), ν(x), ς(x), s(x), t(x) and ∆(x), may vary with x. The standard approach to estimating τ in this setting is to write a two-way fixed effect model of the form

and to interpret the coefficient on ST as an estimate of the treatment effect. However, as shown in our experiments, this simple linear regression-based approach to treatment effect estimation may be severely biased in the setting where the linear model (15) is misspecified.

Here, we propose the transformed regression (TR) estimator with cross-fitting (shown in Algorithm 1). The method is based on the decomposition (11), which is motivated by a decomposition used by Robinson [1988] to estimate parametric components in partial linear models. Robinson's decomposition has also been used in many other recent works, such as in for causal forests, Robins [2004] for G-estimation, as well as in Chernozhukov et al. [2018a] and Zhao et al. [2017] . The transformed regression estimator, motivated by Robinson, also has good theoretical properties. In Theorem 2, we show that the transformed regression estimator is √ n-consistent and asymptotically normal under considerably more generality than simply running an OLS regression with the model (15). We note that cross-fitting helps avoid overfitting of the estimates and also serves as a proof technique to show √ n-consistency on τ . Having √ n−consistency on τ enables us to build valid asymptotic confidence intervals for τ , while we still allow nonparametric estimation on the nuisance components without imposing linearity assumptions such as in (15). The proof of the theorem is in the Supplementary Materials.

Theorem 2. Under the conditions of Proposition 1, suppose furthermore that τ (x) = τ is constant and that the following conditions hold:

1 Split the data into Q roughly equal folds, I 1 , I 2 , ..., I K , with K fixed, to be used for cross-fitting.

, with data not in I k , using any supervised learning method for prediction accuracy (the superscript of −I k denotes using data not in the k-th fold). 3 Estimateν −I k (x) as a heterogeneous "treatment effect" of S i while ignoring T i ; and estimateς −I k (x) as a "treatment effect" of T i ignoring S i . Both can leverage methods designed for heterogeneous treatment estimation in the single cross-section case.

, using the estimated nuisance parameters following (12). Then, for j ∈ I k , obtain point estimatesĈ −I k (Z j ) and

6 Combine predictions from different folds I k :

1. Overlap: the conditional probabilities e s,t (x) are bounded away from 0 by some small η > 0 for all values of t, s and x.

2. Consistency: for any estimated nuisance parameterμ(x), such asm(x),ν(x) andς(x), we have that: sup

3. Risk decay: for any estimated nuisance parameterμ(x), we have:

. Boundedness: all the nuisance parameters are uniformly bounded:

Then, writingτ T R as the transformed regression estimater obtained using Algorithm 1, and τ * as the transformed regression estimator with oracle nuisance parameters, we have

and

In step 3 of Algorithm 1, ς(x) and ν(x) can be estimated with methods for heterogeneous treatment effect estimation. In our simulations, we use causal forests , but we note that other estimators in the literature can also be used here [e.g., , Hill, 2011 , Künzel et al., 2019 , Nie and Wager, 2021 .

If we ever want to use the transformed regression estimator which assumes a constant treatment effect, it is important to understand how it behaves under misspecification. Interestingly, as shown in Proposition 3, even when τ (x) is not constant, the transformed regression estimator converges to a weighted average of τ (x) with positive weights, mirroring the findings in Crump et al. [2009] and Li et al. [2018] . The proof for Proposition 3 is found in the Supplementary Materials.

Proposition 3. If we use the transformed regression estimator from Algorithm 1, and the conditions from Theorem 2 are satisfied, then

where V T R is the same variance term from Theorem 2.

We also note that when in the setting of Abadie [2005] , where T i ⊥ ⊥ S i X i and so ∆(X i ) = 0, the above simplifies to:

When τ (x) is not constant, the transformed regression estimator can thus be thought of as a weighted mean of the treatment effect, where more weight is given to the data points that are likely to appear with all four (S i , T i ) pairs.

In this section, we relax the assumption from the previous section that the underlying treatment effect is constant, and aim to estimate the heterogeneous treatment effect. We propose a flexible nonparametric estimator in the difference-in-differences setup that draws inspiration from recent advances in heterogeneous treatment effect estimation in the single cross-section case.

1 Split the data into K roughly-equal folds I 1 , ..., I K for cross-fitting. 2 Following steps 2 to 4 of Algorithm 1, estimate the nuisance parametersm −I k (x),

Also following step 4 of Algorithm 1, produce point estimatesĈ −I k (Z j ) and H −I k (Z j ) according to (16) for j ∈ I k . 3 Estimate the cross-fitted heterogeneous treatment effectτ (−Ik) (·) for data in I k aŝ (21) where Λ n (·) is some regularization term. For j ∈ I k , useτ (−Ik) (X j ) as the estimate for τ (X j ).

We adapt our estimator of a constant causal parameter τ into an estimator for a heterogeneous treatment function τ (x), by turning the estimation equation underlying the former estimator into a loss function. The R-learner [Nie and Wager, 2021] follows this strategy to derive a heterogeneous treatment effect estimator from Robinson's decomposition [Robinson, 1988] in the setting of a single cross section. We follow the same strategy here with repeated cross sectional data. Using the decomposition from (11), we can estimate treatment effect heterogeneity with the following algorithm R-DiD:

Recall from the last section that when the treatment effect τ (x) is contant, the transformed regression estimator achieves √ n rate for estimating the treatment effect parameter τ . When τ (x) is a nonparametric function, we can no longer achieve √ n parametric rates. Instead, we aim to show a quasi-oracle result, i.e., even if estimating the nuisance components m(x), ν(x), ς(x), ∆(x), s(x) and t(x) have a slow convergence rate, we can still achieve fast nonparametric rates on the treatment effect function τ (x) as if we had known these nuisnace components perfectly. See Chernozhukov et al. [2018b] , Luedtke and van der Laan [2016], van der Laan and Dudoit [2003] for similar developments. In this work, we leverage the recent result from Foster and Syrgkanis [2019] that generalizes the quasi-oracle bounds in Nie and Wager [2021] .

In particular, for any treatment effect functionτ and nuisance functionμ where the nuisance function µ(x) includes m(x), ν(x), ς(x), ∆(x), s(x) and t(x), define the loss function

Suppose in Algorithm 2, instead of leveraging cross-fitting to learn nuisance components, we use sample splitting, i.e. we split the data in half, and use the first half to learn all the nuisance componentsμ, and use the second half to estimateτ . The following then holds as a direct consequence of Theorem 1 in Foster and Syrgkanis [2019], Theorem 4. Suppose the conditional probabilities e s,t (x) are bounded away from 0 by some constant η > 0 for all values of t, s and x, and the outcome |Y | is bounded by M . Suppose further that K = 2 in Algorithm 2, and letτ =τ (−2) be the treatment effect estimate on the first fold. Suppose there exist rate functions r n such that

and

The theorem above shows that the required learning rate on the nuisance components is only at a 4th-order growth rate compared to the learning rate on the target parameter τ . Note that if the nuisance components µ were known, (24) would be in terms of L(τ , µ) − L(τ, µ). We refer readers to Section 4 in Foster and Syrgkanis [2019] to see that in the case of empirical risk minimization, the cost to relate these quantities gets absorbed, and the same conclusion holds. The proof of the theorem relies on the orthogonality results from Proposition 1 and is included in the Supplementary Materials.

In this section, we test the validity of our methods in a variety of simulation setups where the true underlying treatment effect can be both constant and non-constant. In the simulations, we generate n i.i.d. samples X i of dimension p from some underlyng distribution P ; the pre/post-treatment time indicator T i and the state indicator S i are generated with a multinomial distribution over the four pairs (T i , S i ) ∈ {(1, 1), (1, 0), (0, 1), (0, 0)}. The outcomes Y i are generated from (5) with i X i ∼ N (0, 1). Next, we present results in the case where the underlying treatment effect is constant as well in the case where it is heterogeneous.

We compare five methods in estimating heterogeneous treatment effects, four of which serve as baselines. OLS is as shown in (4). The T-Learner runs a separate regression for each of the four conditional quantities in (2) with a regression forest, and follows (2) to build the estimateτ from the four separate regressions. The CF-time learner runs a causal forest to learn the time-wise treatment effect in the treated location, i.e.

and then runs another causal forest to learn the time-wise treatment effect in the control location, i.e.

and subtracts the two estimates. On the other hand, The CF-state learner runs a causal forest to learn the state-wise treatment effect in the treated time period, i.e.

and then runs another causal forest to learn the time-wise treatment effect in the control period, i.e.

and subtracts the two estimates.

We compare the above four baselines with the R-DiD estimator outlined in Algorithm 2. In particular, we note that causal forests can be understood as an instantiation of the R-learner with random forests, and can be used in Step 3 of Algorithm 2. 1 All regressions in all of the methods under comparison are implemented with regression forests from the package grf . In Algorithm 2, we takeĤ as the "outcome" andĈ as the "treatment" in a causal forest. Along with CF-time and CF-state, they are all implemented with the package grf. We consider the following four setups:

Setup A X i ∼ N (0, I d×d ); easy treatment effect τ (x) = x 4 + 0.5x 5 ; easy conditional effects ρ(x) = 1/(1 + exp(x 3 )) + 4x 2 5 , ξ(X) = 1/(1 + exp(x 4 )) + 3x 2 6 ; easy baseline b(x) = max(x 1 + x 2 , 0) + 4x 2 6 ; constant propensity for S i : s(x) = 0.6, but constant propensity for T i , t(x) = 0.4, and S i is independent from T i .

Setup B X i ∼ N (0, I d×d ); easy treatment effect τ (x) = 0.5(x 1 + x 2 + x 3 ); challenging conditional effects ρ(x) = 5(sin(x 1 x 2 π) + 2(x 3 − 0.5) 2 ) and ξ(x) = 5(sin(x 1 x 2 π) + 2x 2 5 ). There is no baseline effect, i.e. b(x) = 0; propensities are constant s(x) = t(x) = 0.5, with S i and T i independent.

Setup C X i ∼ N (0, I d×d ); there exists no time or state effect: ρ(x) = ξ(x) = 0, but highly correlated baseline effect b(x) = 2 sin(1.5x 1 ) and propensities, where e 1,1 (x) = 0.5 + 0.5(1 − 6η) sin(1.5x 1 ), and e s,t (x) = (1 − e 1,1 (x))/3, for (s, t) = (1, 1); τ = 1.

Setup D X i ∼ N (0, I d×d ); easy non-constant treatment effect τ (x) = 3x 1 + 2x 4 ; easy conditional effects ρ(X) = 2x 5 , ξ(X) = 0; easy baseline b(x) = max(x 1 + x 2 + x 4 + x 6 , 0); non-constant propensity for S i : s(x) = min(max(η, (1/(1 + exp(−0.5x 3 )))), 1 − η), and nonconstant propensity for T i , t(x) = min(max(η, (1/(1 + exp(−0.5x 2 )))), 1 − η), and S i is independent from T i . This is a well-specified setup for OLS.

The results of the simulations are shown in Table 1 . For each of the n values, we draw n training data points, and another separate n testing data points. We report results on this indepedent testing set. We run the experiments 200 times, and the mean squared error of each algorithm is shown below. Our proposal R-DiD performs well, while in simulation setup D, we see that OLS performs particularly well given it's a well-specified setup. However, in practice it is less conceivable to have a well-specified setup, and our algorithm shines due to its flexibility to model nonparametrically and robustness towards estimation errors in nuisance components.

To test our methods in practice, we revisit a study from Angrist and Kugler [2008] on the effect of import restrictions on self-employment incomes. The context is as follows: Columbia was one of the major suppliers of cocaine to North America and Europe before the 2000s. Before 1994, Columbia relied on coca leaf supplies from Bolivia and Peru, which it then refined to produce cocaine. United States and local militaries disrupted the air-bridge that brought the coca leaves to Columbian refiners. As a result, coca cultivation shifted to Columbia's rural areas. The study from Angrist and Kugler [2008] then examines, among other things, the effect of the restriction of coca import on self-employment incomes in Columbia. The authors define self-employment income as income from individual short-term contract, from the sale of domestically produced goods, and from agricultural productions. The authors conclude that the decrease in import has a positive effect on the self-employment incomes. The dataset includes repeated cross-section survey data and contains the following information about individuals: gender, age, number of family members, immigrant status, marital status, and whether they lived in rural or urban areas, which we use as covariates X i . The individuals come from one of three coca-growing regions (Bolivar, Cauca and Narino), or one of thirteen non-growing regions (Atlantico, Sucre, Cordoba, Santander, Boyaca, Caldas, Risaralda, Quindio, Tolima, Huilda, Antioquia, Choco, Valle de Cauca); see map in Angrist and Kugler [2008] for details. The dataset also includes a number of demilitarized zones, which we omit in our analysis. 2 Individuals from growing regions are classified as exposed, with S i = 1, and those in non-growing regions classified as non-exposed, with S i = 0, because the increase in coca production could only benefit those in growing regions. As for time periods, we take 1993 as the pre-treatment period, with T i = 0; because air interdictions occured throughtout 1994, we take 1995 as the post-treatment period, with T i = 1. The outcomes are the log self-employment incomes, Y i .

We first fit the treatment effect function τ (x) using the heterogeneous effect estimator Figure 1 : Histogram ofτ (·) fitted using Algorithm 2 on the dataset from Angrist and Kugler [2008] . We see that there appears to be heterogeneity in the treatment: the treatment effects lie in two groups, one around 0.1 and the other around 0.5. Our results in Table  3 suggest that these two groups are treatment effects for people living in urban and rural areas respectively.

R-DiD as in Algorithm 2. Figure 1 provides a histogram of the estimated treatment effects. We see that there appears to be heterogeneity in the treatment, as the histogram exhibits two distinct masses. Beyond heterogeneity estimates, we compare a few different methods for the estimation of average treatment effects in this setup. In Appendix A, we describe the augmented inverse propensity estimator (AIPW) that builds upon the heterogeneous treatment effect estimates from Algorithm 2. An alternative approach is AMLE, which builds upon the Augmented Minimax Linear Estimation (AMLE) approach of Hirshberg and Wager [2021] by constructing balancing weights. See an earlier working paper [Lu et al., 2019] for details on this estimator and comparison with AIPW. We also compare our TR estimator that assumes constant treatment effects, as well as Sample Means by taking the empirical means of (3), and the standard OLS as in (15). 3 See Table 2 for result comparisons. Table 2 shows the results for estimating average treatment effects. There exists a nontrivial difference between the estimated effect from methods that allow for heterogeneity in τ (x) and those that do not, which suggests that there is a weighting effect going on when the likelihood of state and treatment time indicators vary with covariates. All the methods suggest that there is a positive effect of the air interdictions on the log self-employment income, as found in Angrist and Kugler [2008] . Recall that, when the treatment effect is non- 2008] . The first row is obtained by running the AMLE method on only individuals who live in the urban area; the second row comes from running AMLE on individuals who live in the rural area. As shown, there seems to be a much larger treatment effect on individuals living in rural areas than on individuals in urban areas.

constant, TR obtains estimates of weighted average treatment effects, given by Proposition 3. Because TR is giving significantly different results from the non-constant effect methods, we suspect that in this dataset the treatment effect varies with covariates. We also note that, in this case, the OLS estimator is fairly closely aligned with the TR estimator, suggesting that assuming linear nuisance components did not have too big an effect on our estimation of τ . However, it may have been difficult to argue a-priori that the linear specification used by OLS would be innocuous here.

As further evidence of treatment heterogeneity, Table 3 shows average effects obtained by AMLE, separately for urban and rural regions. Thus, estimates provided by the transformed regression estimator should not necessarily be interpreted as average treatment effects, and may instead better be interpreted as targeting a weighted estimand following the discussion in Section 4.

As a final sanity check, we also run a placebo analysis with years 1998 and 2000 as pre-and post-treatment period, and we use our methods to check if there was an effect in these years. It is encouraging that all the methods suggest there is a negligible effect on the treatment effect, as shown in Table 4 

Given our proposed heterogeneous treatment effect estimator, one direct way to construct an average treatment effect estimator is by building on results for doubly robust estimation as developed in Chernozhukov et al. [2018c] . In order to do so, we first note that the average treatment parameter can be written as a weighted average of outcomes using inverseprobability-style weights as follows: Chernozhukov et al. [2018c] implies that if we obtain good estimates both ofĝ and the inverse-probability-style weights outlined above, then we can obtain semiparametrically efficient estimates of τ using the doubly robust form such as in Augmented Inverse Propensity Weighitng (AIPW) [Robins and Rotnitzky, 1995] . We will refer to this algorithm as AIPW, which takes on the following steps:

1. Following the cross-fitting steps 2 to 4 of Algorithm 1, estimate the nuisance parametersm −I k (x),ς −I k (x),ν −I k (x)),Â −I k (z),B −I k (z),Ĉ −I k (z) using data not in the I k , where I 1 , ..., I K are the K folds of the data. In the same way, fit nonparametric regressions for the propensitiesê −I k 1,1 (x),ê −I k 1,0 (x),ê −I k 0,1 (x) andê −I k 0,0 (x).

2. Run step 3 of Algorithm 2 to obtain cross-fitted point estimatesτ (X j ) =τ −Ik (X j ), for each j ∈ I k .

3. For each j ∈ I k , produce the cross-fitted point estimates:

andγ

4. Estimate the average treatment effect E [τ (x)] as:

Abadie [2005] and Sant'Anna and Zhao [2018] explored a special case of this approach in cases where (X i , S i ) are independent from T i . As a result, only the estimate of the single propensity s(x) is needed to perform a similar estimation, as opposed to the quantity γ(x, s, t).

While plug-in estimation with the doubly robust score admits for algorithmically simple estimation of the average treatment effect, probabilities e 1,1 (x) etc. may get quite small, and so inverting even slightly inaccurate propensity estimates may result in instability.

Qingyuan Zhao, Dylan S Small, and Ashkan Ertefaie. Selective inference for effect modification via the lasso. arXiv preprint arXiv:1705.08020, 2017.

Michael Zimmert. Efficient difference-in-differences estimation with high-dimensional common trend confounding. arXiv preprint arXiv:1809.01643, 2018.

Assuming parallel trends conditioning on covariates X i as in (2), Y i can be written from (5) as follows:

We could then write the above as:

where ρ and ξ are defined as in (6). Note that ultimately, we want a decomposition of the following form:

we thus seek coefficients A, B, C and D. Note that we have the following expressions:

Equating the coefficients from (S3) and (S1), we have that:

Summing the equations, we get that D ≡ 1. We can then reformulate our objective as:

which is the final form expressed in the proposition. We re-express the components as:

m(x) = g(x, 0, 0) + (e 1,1 (x) + e 1,0 (x))ξ(x) + (e 1,1 (x) + e 0,1 (x)ρ(x) + e 1,1 (x)τ (x) equating coefficients of (S4) and (S2), we have:

Let us define ∆ s (x) = e 1,1 (x)/s(x) − e 0,1 (x)/(1 − s(x)) and ∆ t (x) = e 1,1 (x)/t(x) − e 1,0 (x)/(1 − t(x)). The second and third equations become:

which gives us:

To obtain the expressions in (12), we just have to get rid of the conditional expectations form above, and it is easy to check that we will obtain the desired forms. Now we check (13). First, note that:

Because E T i X i , S i = 1 = e 1,1 (X i )/s(X i ), the above also evaluates to zero. Similarly, we can show that E A(Z i ) X i , S i = 0 and E B(Z i ) X i , T i = 0. For E C(Z) X, S , note that:

Thus we have

where the third equality follows by noting that

We can similarly check that E C(Z i )B(Z i ) X i = 0.

First we prove the statement √ n(τ T R −τ * ) p − → 0, whereτ T R is the transformed regression estimator, andτ * is the transformed regression estimator with oracle nuisance parameters. From (18), we see that

Because each |I k |/n is approximately 1/K, which is fixed as n grows, and the number of folds K is also fixed, √ n(τ T R −τ * ) p − → 0 will follow if we show that √ n(τ −Ik −τ * ) → p 0. The proof consists of two steps. Firstly, we show that, if two nuisance parameters µ(x) and ν(x), estimated byμ −I k (x) andν −I k (x) respectively, satisfy conditions 2, 3 and 4 from above, then the estimation of their product µ(x) · ν(x) byμ −I k (x) ·ν −I k (x), and their sum µ(x) + ν(x) byμ −I k (x) +ν −I k (x), also satisfy conditions 2, 3, and 4. As a result, the estimate ofĤ −I k (x) andĈ −I k (x), by sums and products of other nuisance parameters, such asm −I k (x) andê −I k 1,1 (x), also satisfies the conditions listed above. Secondly, we show the desired result, assuming thatĤ −I k (x) andĈ −I k (x) satisfy the conditions above.

We start with the first step. Assume thatμ −I k (x) andμ −I k (x) satisfy conditions 2 to 4. We want to estimate µ(

Consistency and boundedness comes easily from the consistency and boundedness ofμ −I k (x) andμ −I k (x). Risk decay comes as follows:

where we used the assumption that the nuisance parameters are all bounded by M . Risk decay for µ + ν follows similarly. Recall we want to show the following:

which will follow if we can show that

Because with these bounds, and noting the Taylor expansion of f (x, y) = x y at some point x 0 and y 0 = 0 is given by:

we have that:

where Slutsky's theorem is used in the last equality. Let us first check (S7). Note that:

We check that the three terms above are small. For (S11), we first check that

where J(X i ) = e 1,1 (X i )/t(X i ) and K(X i ) = e 1,1 (X i )/s(X i ) and we similarly define the cross-fitted analogueĴ −I k (·) andK −I k (·). Many terms above vanished because of (13). We will just show that

and the argument for E τ (

Thus, again by (13), we have

where the first line conditions on the data not in fold I k . Similarly, we have that

, are all zero. Now we are ready to show that (S11) is

where the first equality is precisely because all terms of the form H(Z j ) Ĉ −I k (Z j ) − C(Z j ) have mean zero, and the bound comes from the risk decay assumption. To show (S12) is o p 1 √ n , the argument is exactly the same as the one we gave above: the only difference being we have to check that the term C(Z j ) Ĥ −I k (Z j ) − H(Z j ) has mean zero. Note that:

where terms vanish in the last equality because of (13). Again we just have to show that

have mean zero, which follows exactly the same argument as we used for (S14). Thus (S12) is also o p have to check that (S13) is o p 1 √ n , which follows by Cauchy-Schwarz and the risk decay assumption:

Thus we have finished checking (S7). We proceed to check (S8). Note that:

The second term above is immediately o p 1 √ n because of the risk decay assumption. Thus we only have to check that (S21) is o p 1 √ n , which will follow exactly the same argument for (S11), once we show that C(Z j ) Ĉ −I k (Z j ) − C(Z j ) has mean zero. Note then

where terms vanish in the last equality because of (13). Following the same argument for showing (S14), we can show that the two terms above are both zero. Thus

where (S25) use Slutsky's theorem and the central limit theorem. The desired result then follows.

The proof of Proposition 3 is essentially the same as that of Theorem 2: by exactly the same argument, we know that √ n(τ T R −τ * ) p − → 0, whereτ * is the transformed regression estimator with oracle nuisance parameters. Then we just have to show

E[C 2 (z)] 2 , which follows from the following calculation:

where (S29) uses Slutsky's theorem and the central limit theorem. 

due to the fact that E Z = 0 from Proposition 1. Next, we check Assumption 1. Note that we have to check it with directional derivative with respect to all nuisance components m(x), ν(x), ς(x), s(x), t(x), ∆(x). By symmetry, we only need to check it with m(x), ν(x), s(x), ∆(x). −C(Z)(τ (X) + t τ (τ (X) − τ (X))))C(Z)(τ (X) − τ (X))] /dt ν | tν =0,tτ =0 = −2E [A(Z)(ν (X) − ν(X))C(Z)(τ (X) − τ (X))] = −2E E A(Z)(ν (X) − ν(X))C(Z)(τ (X) − τ (X)) X = −2E (ν (X) − ν(X))(τ (X) − τ (X))E A(Z)C(Z) X = 0

where the last equality follows from E A(Z)C(Z) X = 0 as in Proposition 1. Before we continue, we define

and similarly, we define B t∆ (Z i ), C t∆ (Z i ). We further define = −2dE[M t∆,tτ (Z)C t∆ (Z)(τ (X) − τ (X))]/dt ∆ | t∆=0,tτ =0

For the second term, we notice that E M (Z) Z = 0 from Proposition 1, and so

For the first term, − 2E C(Z)(τ (X) − τ (X)) ∂M t∆,0 (Z) ∂t ∆ t∆=0 = 2E C(Z)(τ (X) − τ (X)) ∂A t∆ (Z)ν(X) ∂t ∆ t∆=0 + ∂B t∆ (Z)ς(X) ∂t ∆ t∆=0 + ∂C t∆ (Z)τ (X) ∂t ∆

.

First, we note that ∂A t∆ (Z)ν(X) ∂t ∆ t∆=0 = ∂ 1 − (∆+t∆(∆ −∆)) 2 (X) s(X)(1−s(X))t(X)(1−t(X)) −1 ∂t ∆ t∆=0 T − t(X) − ∆(X) (S − s(X)) s(X)(1 − s(X)) ν(X) + 1 − ∆ 2 (X) s(X)(1 − s(X))t(X)(1 − t(X)) −1 ∂ T − t(X) − (∆(X)+t∆(∆ (X)−∆))(S−s(X)) s(X)(1−s(X)) ∂t ∆ t∆=0 ν(X) = f 1 (X)A(Z) + f 2 (X)f 3 (X, S)

for some function f 1 , f 2 , f 3 . The upshot is that 2E C(Z)(τ (X) − τ (X)) ∂A t∆ (Z)ν(X) ∂t ∆ t∆=0 = 2E [C(Z)(τ (X) − τ (X))f 1 (X)A(Z)] + 2E [C(Z)(τ (X) − τ (X))f 2 (X)f 3 (X, S)]

= 2E E C(Z)(τ (X) − τ (X))f 1 (X)A(Z) X + 2E E C(Z)(τ (X) − τ (X))f 2 (X)f 3 (X, S) X, S = 2E (τ (X) − τ (X))f 1 (X)E C(Z)A(Z) X + 2E (τ (X) − τ (X))f 2 (X)f 3 (X, S)E C(Z) X, S = 0 + 0 = 0 where the second to the last equality follows from E C(Z) X, S = 0 and E C(Z)A(Z) X = 0 from Proposition 1. Following a similar argument, we can show that 2E C(Z)(τ (X) − τ (X)) ∂B t∆ (Z)ς(X) ∂t ∆ t∆=0 = 0.

Note that 

Given s(X)+ ∆(X) t(X) = Ω(η) and |A(Z)| = Ω(η), etc., we have C(Z) = Ω(η 2 ). Thus, λ = Ω(η 4 ) and κ = 0 in Assumption 3.

To show Assumption 4 holds, we can check that and this similarly holds when we take derivative with respect to s(X) or t(X). We conclude that β 2 = O(M 2 /η 4 ) in Assumption 4. The result then directly follows from Theorem 1 in Foster and Syrgkanis [2019] .

Semiparametric difference-in-differences estimators. The Review of Economic Studies

Synthetic control methods for comparative case studies: Estimating the effect of california's tobacco control program

Consequences of employment protection? the case of the Americans with Disabilities Act

Rural windfall or a new resource curse? coca, income, and civil conflict in colombia

Mostly harmless econometrics: An empiricist's companion

The Jackie (and Jill) Robinson effect: Why do congresswomen outperform congressmen?

Dealing with a technological bias: The difference-in-difference approach

Synthetic difference in differences

Recursive partitioning for heterogeneous causal effects

Identification and inference in nonlinear difference-indifferences models

Estimating treatment effects with causal forests: An application

Matrix completion methods for causal panel data models

Generalized random forests

Square-root lasso: pivotal recovery of sparse signals via conic programming

The augmented synthetic control method

How much should we trust differences-in-differences estimates?

Efficient and Adaptive Estimation for Semiparametric Models

Evaluating the employment impact of a mandatory job search program

Covid-19, lockdowns and well-being: Evidence from google trends

Minimum wages and employment: A case study of the fast-food industry in New Jersey and Pennsylvania

Double/debiased machine learning for difference-in-differences models

Double/debiased machine learning for treatment and structural parameters

Double/debiased machine learning for treatment and structural parameters

Locally robust semiparametric estimation

Dealing with limited overlap in estimation of average treatment effects

Methods for evaluating changes in health care policy: the difference-in-differences approach

Natural and quasi-experiments in economics

The asymptotic variance of semiparametric estimators

Quasi-oracle estimation of heterogeneous treatment effects

Effect of minimum-wage determinations in Oregon

Semiparametric efficiency in multivariate regression models with missing data

Optimal structural nested models for optimal sequential decisions

Root-n-consistent semiparametric regression

The central role of the propensity score in observational studies for causal effects

Doubly robust difference-in-differences estimators. Available at SSRN 3293315

Adjusting for nonignorable drop-out using semiparametric nonresponse models

Estimating individual treatment effect: generalization bounds and algorithms

Regression shrinkage and selection via the lasso

Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples

Targeted maximum likelihood learning

Designing difference in difference studies: best practices for public health policy research. Annual review of public health

Generalized synthetic control method: Causal inference with interactive fixed effects models