key: cord-0561378-d7eo32dc authors: Caetano, Carolina; Callaway, Brantly; Payne, Stroud; Rodrigues, Hugo Sant'Anna title: Difference in Differences with Time-Varying Covariates date: 2022-02-07 journal: nan DOI: nan sha: 0a93026522ae951c1d22b22e493bac81588ee752 doc_id: 561378 cord_uid: d7eo32dc This paper considers identification and estimation of causal effect parameters from participating in a binary treatment in a difference in differences (DID) setup when the parallel trends assumption holds after conditioning on observed covariates. Relative to existing work in the econometrics literature, we consider the case where the value of covariates can change over time and, potentially, where participating in the treatment can affect the covariates themselves. We propose new empirical strategies in both cases. We also consider two-way fixed effects (TWFE) regressions that include time-varying regressors, which is the most common way that DID identification strategies are implemented under conditional parallel trends. We show that, even in the case with only two time periods, these TWFE regressions are not generally robust to (i) time-varying covariates being affected by the treatment, (ii) treatment effects and/or paths of untreated potential outcomes depending on the level of time-varying covariates in addition to only the change in the covariates over time, (iii) treatment effects and/or paths of untreated potential outcomes depending on time-invariant covariates, (iv) treatment effect heterogeneity with respect to observed covariates, and (v) violations of strong functional form assumptions, both for outcomes over time and the propensity score, that are unlikely to be plausible in most DID applications. Thus, TWFE regressions can deliver misleading estimates of causal effect parameters in a number of empirically relevant cases. We propose both doubly robust estimands and regression adjustment/imputation strategies that are robust to these issues while not being substantially more challenging to implement. In this paper, we study difference in differences identification strategies where (i) the parallel trends assumption holds only after conditioning on covariates, (ii) some or all of these covariates vary over time, and (iii) some of the time varying covariates could themselves be affected by the treatment. A number of papers (e.g., Heckman, Ichimura, Smith, and Todd (1998), Abadie (2005) , and Sant'Anna and Zhao (2020)) show that certain causal effect parameters, typically the average treatment effect on the treated (ATT), are identified under conditional parallel trends assumptions. These types of conditional parallel trends assumptions are attractive in applications where the path of untreated potential outcomes may differ among units with different characteristics. However, work in the econometrics literature typically considers the case where covariates involved in the parallel trends assumption either do not vary over time or are "pre-treatment" (that is, the value of a time-varying covariate is set to its value in the pre-treatment period; see Bonhomme and Sauder (2011) and Lechner (2011) for some discussions on using pre-treatment values of time-varying covariates). In contrast, empirical work in economics often only includes covariates that vary over time. In this case, identification must implicitly assume that the treatment does not have an effect on the covariates themselves, which is implausible in some applications. Covariates that could have been affected by participating in the treatment are often referred to as "post-treatment" or as "bad controls." The received wisdom seems to be that this type of covariate should not be included in empirical research. 1 However, we provide several examples below where it seems important to condition on the value of the covariate that would have occurred in the absence of the treatment; in these cases, it would not generally be sufficient to just "not include" this sort of covariate. We propose several different strategies for dealing with time-varying covariates that show up in the parallel trends assumption while also potentially being affected by the treatment. Difference in differences identification strategies are most often implemented using two-way fixed effects (TWFE) regressions. The most common version of a TWFE regression that includes covariates is the following where θ t is a time fixed effect, η i is individual-level unobserved heterogeneity (i.e., an individual fixed effect), D it is the treatment indicator, and X it are time varying covariates. In the TWFE regression in Equation (1) , α is the parameter of interest and it is often interpreted as "the causal effect of the treatment" or at least would be hoped to be a weighted average of underlying heterogeneous treatment effects. Being able to include covariates is one of the main attractions of using a TWFE regression to implement a DID design. For example, Angrist and Pischke (2008) write: 1 For example, Angrist and Pischke (2008) discuss "bad controls" in the context of deciding whether or not to control for occupation when studying causal effects of graduating from college on earnings. In that case, occupation is likely to be affected by attending college and, therefore, can make comparisons in earnings among those with the same occupation who graduated or did not graduate from college hard to interpret (even if college were randomly assigned). Angrist and Pischke (2008) note that "...we would do better to control only for variables that are not themselves caused by education." We return to a related example later in this section on the effect of job displacement on earnings where occupation is potentially affected by job displacement. "A second advantage of regression-DD is that it facilitates empirical work with regressors other than switched-on/switched off dummy variables." 2 TWFE regressions have come under much scrutiny in recent work in terms of how well they perform for implementing DID identification strategies. In particular, TWFE regressions can perform very poorly in the presence of more than two time periods, variation in treatment timing across units, and treatment effect heterogeneity (particularly, treatment effect dynamics); see Goodman-Bacon (2021) and de Chaisemartin and D'Haultfoeuille (2020). Although with only two time periods, TWFE regressions are known to be reliable under unconditional parallel trends, here we point out a number of problems with TWFE regressions for implementing DID identification strategies that rely on conditional parallel trends assumptions even in the case with only two time periods. In particular, we show that TWFE regressions can deliver poor estimates of the average treatment effect on the treated (which is the natural target parameter for DID identification strategies) for any of four reasons: (1) time-varying covariates that are themselves affected by the treatment, (2) ATTs and/or parallel trends assumptions that depend on the pre-treatment level of time varying covariates in addition to (or instead of) only the change in the covariates over time, (3) ATTs and/or paths of untreated potential outcomes that depend on time-invariant covariates, and (4) violations of strong functional form assumptions both for outcomes over time and for the propensity score. All four of these issues are common in applications in economics. In applications where none of the four issues mentioned above occur, TWFE regressions deliver a weighted average of conditional ATTs where all the weights are positive. However, even in this best-case scenario, TWFE regressions still suffer from a "weight-reversal" property similar to the one pointed out in S loczyński (2020) under unconfoundedness with cross-sectional data. In our case, conditional ATTs for relatively uncommon values of the covariates among the treated group (relative to the untreated group) are given large weights while conditional ATTs for common values of the covariates among the treated group are given small weights. In order to get around this weight reversal issue, one needs to additionally rule out heterogeneous treatment effects across different values of the covariates. Adding this condition to the previous four implies that TWFE regressions deliver the ATT; however, we stress that these are a very stringent set of requirements for TWFE regressions to perform well for estimating the ATT when the parallel trends assumption depends on time-varying covariates. We propose several new strategies for dealing with time varying covariates that are required for the parallel trends assumption to hold. When the researcher is confident that the covariates evolve exogenously with respect to the treatment, we provide a doubly robust estimand for the ATT (these arguments are similar to the ones in Sant'Anna and Zhao (2020) for the case with time invariant covariates). Doubly robust estimators have the property that they deliver consistent estimates of the ATT if either an outcome regression model or a propensity score model is correctly specified, thus giving researchers an extra chance to correctly specify a model relative to regression adjustment or propensity score weighting strategies. Besides this, our doubly robust estimands can also be used in the context of the double/debiased machine learning literature where the propensity score and outcome regression model can be estimated using a wide variety of modern machine learning techniques (see Chernozhukov et al. (2018) for the general case and Chang (2020) in the context of DID). 3 When the time-varying covariates can be affected by the treatment, we provide sufficient (and easy-to-interpret) conditions under which the strategy of conditioning on "pre-treatment" covariates, which is common in the econometrics literature, is justified. We also discuss other cases where this strategy is not reasonable. In these cases, we propose regression adjustment-type and doubly robust-type expressions for the ATT. Finally, when a researcher is willing to make an additional function form assumption for untreated potential outcomes, we propose some even simpler approaches based on regression adjustment (these approaches are also broadly similar to recent "imputation estimators" proposed in Liu, Wang, and Xu (2021), Gardner (2021) , and Borusyak, Jaravel, and Spiess (2021)). We also show that stronger functional form assumptions for the model for untreated potential outcomes can allow for parallel trends-type assumptions for the covariates to be sufficient for identification of the ATT. Before moving into our main arguments, we provide three examples to illustrate the types of questions that we address in the current paper. We revisit these applications at relevant parts of the paper. Example 1 (Stand-your-ground laws). Cheng and Hoekstra (2013) study the effects of standyour-ground laws on homicides and other crimes. They use state-level data and exploit variation in the timing of stand-your-ground laws across states in order to identify policy effects. For some of their results, they condition on time-varying covariates that include state-level demographics, the number of police officers in the state, the number of people incarcerated, median income, poverty rate, and spending on assistance and public welfare. Although it is debatable whether or not some of the these covariates could be affected by the treatment (particularly the number of police officers and the number of people incarcerated), by running TWFE regressions that include these covariates, Cheng and Hoekstra (2013) at least implicitly argue that these covariates evolve exogenously from the treatment. Whether this is true or not, for exposition purposes we will assume that none of the covariates used in this example are affected by the treatment. Example 2 (Shelter-in-place orders). A number of recent papers study the effect of shelter-inplace orders on various outcomes including mobility (see, for example, Weill, Stigler, Deschenes, and Springborn (2021) and references therein), labor market outcomes (e.g., Gupta et al. (2020) , and consumer spending (e.g., Chetty, Friedman, Hendren, and Stepner (2020)). Paths of all of these outcomes (in the absence of shelter-in-place orders) likely depend on the current number of Covid-19 cases due to individuals making different choices about staying at home or continuing to work based on the local "state" of the pandemic. This suggests that parallel trends assumptions ought to condition on the number of Covid-19 cases that would have occurred if the policy had not been implemented. Moreover, since Covid-related policies are designed to affect the number of Covid-19 cases, this would be a case with a time-varying covariate that is likely to be affected by the treatment. 3 Using machine learning in this context may be particularly useful because the expressions for the ATT involve conditioning on time-varying covariates across different time periods. In many applications, time-varying covariates may be highly serially correlated, and it may be challenging to specify simple parametric models involving these covariates in this context. However, machine learning estimators may perform much better in this context. Example 3 (Job Displacement). Research on job displacement typically invokes parallel trends assumptions to identify causal effects of job displacement on workers' earnings. If, in the absence of job displacement, paths of earnings depend on the occupation, industry, or union status of a worker, then it would be desirable to condition on these variables in the parallel trends assumption. However, most empirical work on job displacement does not condition on these variables, presumably due to each of these possibly being affected by job displacement. 4 Moreover, Barnette, Odongo, and Reynolds (2021) argue that differences in the distribution of pre-displacement occupations are likely an important explanation for the magnitude of effects of job displacement; similarly, Brand (2006) reports relatively large effects of job displacement on occupation. The examples above are broadly representative of applications that invoke DID identification assumptions with time varying covariates. The first example involves time-varying covariates that can reasonably be thought of as evolving exogenously with respect to the treatment. The following two examples both involve covariates that are potentially affected by the treatment. Later in the paper, we point out some further conceptual differences between these latter two examples. Our paper shares a similar motivation to Zeldow and Hatfield (2021) which considers different possible sources of bias due to controlling for time-varying covariates that are possibly affected by the treatment. That paper mainly considers how sensitive existing strategies are (e.g., controlling for only pre-treatment covariates or additionally including lagged outcomes) to covariates that can be affected by the treatment. Relative to that paper, we make explicit assumptions on how the treatment can affect the covariates and, under these extra conditions, are able to propose estimation strategies that are guaranteed to perform well (up to regularity conditions) in those cases. Our paper is also related to the literature on causal inference with panel data using structural nested mean models (Robins (1997) ) and marginal structural models (Robins, Hernán, and Brumback (2000) ); see Blackwell and Glynn (2018) for a recent review. These approaches, however, are based on "sequential ignorability" assumptions rather than allowing for time-invariant unobserved heterogeneity. Sequential ignorability implies that treated and untreated potential outcomes are independent of treatment status conditional on pre-treatment values of covariates (and possibly pre-treatment outcomes). 5 Unlike the bulk of this literature, the current paper focuses on the case where a researcher would like to invoke a parallel trends assumptionsrather than sequential ignorability -for identification. However, the current paper also invokes an additional assumption on how treated and untreated potential covariates are generated; this type of assumption is not made in this literature. The reason for this is that the timing that 4 Some papers do include occupation, industry, and/or union controls in "robustness checks" and others study how effects of job displacement vary by whether or not a worker remains in the same industry, occupation, or union status following job displacement which is broadly similar to controlling for each of these (see, for example, Topel (1991), Jacobson, LaLonde, and Sullivan (1993), and Stevens (1997)). 5 Another difference between the current paper and much of the sequential ignorability literature is that these papers are typically primarily interested in recovering causal effects of different treatment paths (e.g., where each unit can move into or out of the treatment in each period). The arguments in our paper could likely be extended in this direction but our main results apply to the case where there are only two time periods and treatment can only take place in the second time period. we consider differs from what is typically considered in the literature on sequential ignorability; in our case, units potentially become treated, then their covariate realizes (and may itself be affected by treatment) and this covariate needs to be controlled for identification. By contrast, the sequential ignorability literature typically has the covariate realized first, then the treatment, then the outcome, and controlling for, effectively, the covariate in the previous period is sufficient for identifying parameters of interest. That said, like the current paper, that literature does take seriously how covariates evolve over time and how participating in the treatment can affect covariates themselves. Of papers broadly in this literature, the most similar to the current paper is Imai, Kim, and Wang (2018) which focuses on a conditional parallel trends assumption that can hold after conditioning on past values of the covariates as well as past values of the outcome. Our paper is also related to the literature on mediation analysis. Like a mediator, our covariates can be affected by treatment participation. However, the mediation literature is typically interested in decomposing treatment effects into direct effects of the treatment and indirect effects due to the effect of the treatment on the mediator (see Huber (2020) for a recent review of this literature). Our paper is less ambitious in that we only seek to identify the overall effect of the treatment on outcomes; the tradeoff is that we are able to generally make weaker assumptions than would be required to separately recover direct and indirect effects of participating in the treatment. That said, it would be interesting to extend our arguments to additionally identifying direct and indirect effects of participating in the treatment, and it seems likely that existing arguments from the mediation analysis literature could be applied in this case. Our paper is relatively more similar to Rosenbaum (1984) , Lechner (2008) , and Flores and Flores-Lagunes (2009); these papers consider identification of treatment effect parameters under unconfoundedness (and with cross-sectional data) where the covariates that are required for the unconfoundedness assumption to hold could have been affected by the treatment. Besides this, our paper is related to a large literature in econometrics on strict exogeneity and pre-determinedness in panel data models (see, for example, Arellano and Honoré (2001) ). (1). In some ways, the decompositions in these papers are more general than our decomposition as they all consider the case with more than two time periods and with variation in treatment timing. On the other hand, our results zoom in on the "textbook" case with exactly two periods and where no one is treated in the first period; our decomposition emphasizes a number of possible limitations of TWFE regressions even in the case with exactly two periods. Indeed, moving to more complicated cases with more periods and variation in treatment timing would make the case for using TWFE regressions even weaker, as it would introduce additional issues particularly related to using already treated units as comparison units (which can lead to negative weights on underlying treatment effect parameters), as all three papers mentioned above imply. See Remark 4 below for a more detailed comparison. For this section, we focus on a baseline case where the researcher has access to two time periods of panel data. We label the second time period t * and the first time period t * − 1, and use t to indicate a generic time period. In each time period, we observe outcomes Y t , a time-varying covariate X t , and time invariant covariates Z. As is standard in the DID literature, we suppose that no one is treated in the first time period. We use the binary variable D to indicate whether or not a unit participates in the treatment. Importantly for our setup, we allow for the possibility that the time varying covariate can itself be affected by the treatment; in order to do this, we define X t (1) to be the value that the covariate would take if a unit participated in the treatment and X t (0) to be the value that the covariate would take if a unit did not participate in the treatment; for simplicity, we often refer to these as "treated potential covariates" and "untreated potential covariates." Next, we define treated potential outcomes as Y t (1, X t (1)) (this is the outcome that a unit would experience in time period t if they participated in the treatment and their covariate took on its value under the treatment) and untreated potential outcomes as Y t (0, X t (0)) (this is the outcome that a unit would experience in time period t if they did not participate in the treatment and their covariate took its value in the absence in the treatment). For most of the arguments in the current paper, it is sufficient to use the shorter notation )). In this setup, the observed covariates in each time period are: . In other words, in the second time period, we observe treated potential covariates for units that participate in the treatment, and we observe untreated potential covariates for units that do not participate in the treatment. In the first time period, since no units are treated yet, we observe untreated potential covariates for all units. Likewise, observed outcomes are given by Following the vast majority of the DID literature, we target identifying the average treatment effect on the treated (ATT). It is given by which is the average difference between treated and untreated potential outcomes among the treated group. Throughout the paper, we make the following assumptions which are independent and identically distributed. Assumption 2 (Conditional Parallel Trends). Assumption 1 says that we observe iid panel data. Assumption 2 says that, on average, the path of untreated potential outcomes is the same for the treated group as for the untreated group after conditioning on untreated potential covariates in time period t * , pre-treatment covariates X t * −1 , and time-invariant covariates Z. Relative to standard conditional parallel trends assumptions (Heckman, Ichimura, and Todd (1997), Abadie (2005) , and Callaway and Sant'Anna (2021)), the set of covariates being conditioned on includes untreated potential covariates which are unobserved for the treated group and therefore can complicate existing identification strategies. Assumption 3 is an overlap assumption, and this typo of assumption is standard in the treatment effects literature. Part (a) implies that, for any values of X t * , X t * −1 , and Z, there will be some untreated units with those values of the covariates in the population. Part (b) is similar but holds for any values of X t * −1 , W t * −1 , and Z. Next, we provide two distinct assumptions for dealing with covariates that vary over time. We call the first assumption covariate exogeneity because it implies that participating in the treatment does not change the distribution of covariates for the treated group. This assumption is technically weaker than assumptions like, for all units certainly be a leading case where this sort of condition might hold. Assumption Cov-Exogeneity allows for covariates to change values over time, but it imposes that (in distribution) they are not affected by participating in the treatment. This sort of condition may be reasonable in some applications (e.g., Example 1 above). In other cases, this assumption may be less reasonable (e.g., Examples 2 and 3 above). Assumption Cov-Unconfoundedness is an unconfoundedness assumption for untreated potential covariates. It allows for the treatment to effect the time varying covariates, but it says that the distribution of untreated potential covariates is the same for the treated group and the untreated group after conditioning on the vector of pre-treatment covariates (X t * −1 , W t * −1 , Z). This assumption allows us to recover the conditional distribution of untreated potential covariates for the treated group. This distribution is a key ingredient for identifying the ATT below. In Assumption Cov-Unconfoundedness, we allow for the possibility that W t * −1 is empty; in fact, this is a leading case. In this case, unconfoundedness for untreated potential covariates holds after conditioning on the lag of the time-varying covariates X t * −1 and other time invariant covariates Z. Below, we connect this specific condition to the common practice in the econometrics literature on DID of conditioning on pre-treatment values of time-varying covariates. With a slight abuse of notation, we also allow for the possibility that W t * −1 includes the lagged outcome Y t * −1 . For example, another interesting case is when W t * −1 = Y t * −1 , so that covariate unconfoundedness holds after conditioning on pre-treatment covariates, time invariant covariates, and the pre-treatment outcome. Interestingly, we show below that, under this condition, both the path of outcomes over time and the lag of the outcome show up in the expression for AT T which is unusual in DID applications (see, Chabé-Ferret (2017) for related discussion). In the results below, we provide separate results that invoke either Assumption Cov-Exogeneity or Assumption Cov-Unconfoundedness. Next, we state our main identification result. (1) if, in addition, Assumption Cov-Exogeneity and Assumption 3(a) hold, then (2) if, in addition, Assumption Cov-Unconfoundedness and Assumption 3(b) hold, then The intuition for part (1) of Theorem 1 is relatively straightforward. Under the conditional parallel trends assumption and when covariates evolve exogenously, one can recover the ATT by (i) taking the path of outcomes experienced by the treated group and adjusting it by the path of outcomes experienced by the untreated group (conditional on X t * , X t * −1 , and Z) and then (ii) accounting for differences in the distribution of X t * , X t * −1 , and Z across groups. This result is very similar to existing results with time invariant covariates (e.g., Heckman, Ichimura, and Todd (1997)) as well as Lechner (2011)). The intuition for part (2) is somewhat more complicated. The term [∆Y t * |X t * , X t * −1 , Z, D = 0] is the average change in outcomes over time conditional on X t * , X t * −1 , and Z among the untreated group. Under Assumption 2, this is the path of outcomes that, conditional on X t * (0), X t * −1 , and Z, the treated group would have experienced if they had not participated in the treatment. The next expectation is over the distribution of X t * (0) (conditional on X t * −1 , W t * −1 , and Z) for the untreated group. Under Assumption Cov-Unconfoundedness, this is the same conditional distribution that X t * (0) follows for the treated group. Finally, the outside expectation is over the distribution of X t * −1 , W t * −1 , and Z for the treated group and, therefore, allows for these variables to be distributed differently in the treated group relative to the untreated group. Corollary 1 provides two important special cases for the results in part (2) of Theorem 1. The first part provides a formal justification for the common practice in the econometrics literature on DID with time varying covariates of including only "pre-treatment" covariates. In particular, this result says that, when unconfoundedness holds for the time varying covariate conditional on time-invariant covariates and other pre-treatment covariates, then it is sufficient for the researcher to only "account for" pre-treatment and time-invariant covariates in order to recover the ATT. 6 The second part of Corollary 2 is also interesting in that it relates the ATT to an expression that includes the lagged outcome. There are a number of papers that explore the idea of including lagged outcomes in a DID framework (e.g., Chabé-Ferret (2017), Imai, Kim, and Wang (2018), and Zeldow and Hatfield (2021)) though it is challenging to provide a justification for including lagged outcomes in DID settings -our approach justifies the inclusion of lagged outcomes (in the manner specified in the corollary) in cases where unconfoundedness for the time-varying covariate holds after conditioning on the lag of the outcome variable. Next, we provide alternative expressions for AT T that are useful for estimation. ( (2020)). Besides this, they also provide a connection to the DID literature on estimating the ATT under conditional parallel trends using double/debiased machine learning; see, in particular, Chang (2020) . This may be particularly useful in the first case where the propensity score and outcome regression depends on time-varying covariates in both periods. These can be practically difficult to estimate because, in many cases, X t * and X t * −1 may be highly collinear. Conventional methods typically invoke functional form assumptions that impose, for example, that these functionals only depending on ∆X t * . As noted below, these sorts of restrictions may be implausible in many applications. Remark 1. In cases where time-varying covariates may be affected by the treatment, we mainly focus on the case where an unconfoundedness type assumption holds for the time varying covariates. A natural alternative would be to invoke parallel trends assumptions for the time-varying covariates themselves. Importantly, though, our above arguments require identifying the entire conditional distribution of X t * (0) for the treated group (not just its mean). 7 That said, difference in differences approaches that recover the distribution of untreated potential outcomes, such as Callaway and Li (2019) and Callaway, Li, and Oka (2018), could be applied here (though note that these approaches require additional assumptions). Likewise, the change-in-changes approach in Athey and Imbens (2006) and Melly and Santangelo (2015) , which can recover distributions of untreated potential outcomes, could be applied to the time-varying covariates in this context. Another potential limitation of these approaches in this context is that they typically only point identify distributions of continuous outcomes and, therefore, would not be very suitable for a number of relevant applications that involve discrete time-varying covariates. Although neither of our assumptions on untreated potential covariates in Assumptions Cov-Exogeneity and Cov-Unconfoundedness are directly testable, the condition in Assumption Cov-Unconfoundedness can be "pre-tested" -that is, one can check if it holds in pretreatment time periods. One simple idea is to compute pseudo-ATTs in pre-treatment periods; if both Assumption 2 (the conditional parallel trends assumption) and Assumption Cov-Unconfoundedness hold in pre-treatment periods, then these pseudo-ATTs should be equal to 0. Alternatively, one can directly pre-test Assumption Cov-Unconfoundedness: for some pre-treatment period t, Assumption Cov-Unconfoundedness implies that the distribution of X t |X t−1 , W t−1 , Z, D = d is the same for both the treated and untreated groups. This sort of test could be implemented using results from the goodness-of-fit testing literature (e.g., Bierens (1982) and Stute (1997) ). To conclude this section, we revisit the three examples from the introduction. Example 2 (Shelter-in-place, cont'd) In our example of shelter-in-place orders on various economic outcomes, the parallel trends assumption held after conditioning on the number of Covid-19 cases that would have occurred if the policy had not been implemented. That is, "untreated potential Covid-19 cases" plays the role of X t * (0) in this case. Callaway and Li (2021) show that, under a SIRD model -which is the leading pandemic model in the epidemiology literature -controlling for the pre-treatment "state" of the pandemic is sufficient for unconfoundedness to hold. That is, the conditions in part (1) of Corollary 1 and part (2) of Corollary 2 hold when one wants to control for the number of untreated potential Covid-19 cases. Example 3 (Job displacement, cont'd) Finally, recall our example on the effect of job displacement on earnings where the parallel trends assumption holds only after conditioning on, for example, "untreated potential occupation" -that is, the occupation that a worker would have had if they had not been displaced from their job. In this case, an unconfoundedness assumption for occupation may be more likely to hold if it conditions on (i) pre-treatment timevarying covariates (including pre-treatment occupation), (ii) time invariant covariates (such as demographics and education), and (iii) pre-treatment earnings. In particular, conditioning on pre-treatment earnings could be important if there are occupation specific wage premiums and high-earning workers are more likely to (in the absence of job displacement) stay in the same occupation over time relative to low-earning workers. This application would then be covered by the results from part (2) of Corollary 1. In this section, we consider how to interpret α in the TWFE regression in Equation (1). We continue to consider the "textbook" case with two time periods where no one is treated in the first time period and where some (but not all) units become treated in the second time period. This is a best-case for TWFE regressions as it does not introduce well-known problems related to using already-treated units as comparison units that show up when using TWFE regressions with multiple periods, variation in treatment timing, and treatment effect heterogeneity (Goodman-Bacon (2021) and de Chaisemartin and D'Haultfoeuille (2020)). In the case with exactly two periods, it is helpful to equivalently re-write Equation (1) as where we define ∆X t * := (1, X t * − X t * −1 ) ′ which is the change in the covariate over time and is augmented with an intercept term for the time fixed effect. We also slightly abuse notation by taking β to include an extra parameter in its first position corresponding to the intercept. Our interest in this section is in determining what kind of conditions are required to interpret α as the ATT or at least as a weighted average of some underlying treatment effect parameters. Denote the linear projection of ∆Y t * on ∆X t * by L(∆Y t * |∆X t * ) : , and define the corresponding projection error e := ∆Y t * − L(∆Y t * |∆X t * ). Similarly, define the Below, to keep the notation concise, it is useful to define X all (d) := (X t * (d), X t * −1 , Z). We also define AT T X all (0) (X all (0)) := [Y t * (1)− Y t * (0)|X all (0), D = 1] which is the ATT conditional on X t * (0), X t * −1 , and Z. And we further define p(X all (0)) = P(D = 1|X all (0)). Next, we state a main result decomposing α from the TWFE regression. Proposition 1. Under Assumptions 1, 2, and 3(a) , The result in Proposition 1 indicates that α is equal to a weighted average of underlying conditional ATTs (we discuss the nature of the weights in more detail below) plus a number of undesirable "bias" terms. We Next, we introduce several additional assumptions that are useful for eliminating the bias terms in Proposition 1. We also use the additional notation: ] -these define different types of conditional ATTs. (a) AT T X all (0) (X all (0)) = AT T X t * (0),X t * −1 (X t * (0), X t * −1 ) a.s. . Assumption 6 (Linearity of conditional ATTs and paths of untreated potential outcomes). (a) There exists a δ 1 such that AT Assumption 7 (Linearity of propensity score in terms of change in time-varying covariates). There exists a δ p such that p(X all (0)) = ∆X t * (0) ′ δ p . The first part of Assumption 4 says that, conditional on X t * (0) and X t * −1 , conditional ATTs (2021)). In those cases, it sometimes holds by construction (e.g., when the covariates are all discrete and a full set of interactions is included in the model). In our case, though, it seems particularly implausible as (i) it requires the propensity score to only depend on changes in covariates over time, and (ii) even with fully interacted discrete regressors, the propensity score is unlikely to be linear in changes in the regressors over time. 10 10 For example, suppose that the only covariate is binary. In the cross-sectional case considered by other papers mentioned above, the propensity score would be linear by construction. However, the change in the covariate over time would be a single variable that can take the values -1, 0, or 1; moreover, the change in a binary covariate over time is equal to 0 in cases when the covariate is equal to 1 in both periods or when the covariate is equal to 0 in both periods. This suggests that the propensity score would not be linear in the change in covariates over time even in this very simple case. Proposition 2. Under Assumptions 1, 2, 3(a) , Cov-Exogeneity, 4, 5, and 6, α = [ω AT T (X all (0))AT T X all (0) (X all (0))|D = 1] where ω AT T and ω e are defined in Proposition 1. (a) If, in addition, Assumption 7 holds, then and [ω AT T (X all (0))|D = 1] = 1. The second part of Proposition 2 says that, if we are willing to assume that the propensity score is equal to the linear projection of the treatment on the change in time-varying covariates over time, then the weights on conditional ATTs will have mean one and the nuisance expression in term (E) will be equal to zero. Even in this case, the weights have a "weight-reversal" property analogous to the one pointed out in S loczyński (2020) in the context of unconfoundedness and cross-sectional data. What this means is that conditional ATTs are given more weight for values of the covariates that are relatively uncommon among the treated group relative to the untreated group; and that conditional ATTs are given less weight for values of the covariates that are relatively common among the treated group relative to the untreated group. Finally, if in addition to all the previous conditions, conditional ATTs are constant across different values of the covariates, then α will be equal to the AT T . This is a treatment effect homogeneity condition with respect to the covariates. It is somewhat weaker than individuallevel treatment effect homogeneity and it allows for treatment effects to still be systematically different for treated units relative to untreated units; instead, for the treated group, treatment effects cannot be systematically different across different values of the covariates. These results are much different from our earlier results in Section 2. Those results did not require any of the additional assumptions in Proposition 2. In fact, when covariates evolve exogenously with respect to the treatment (as under Assumption Cov-Exogeneity), then the doubly robust expressions for the ATT in part (1) of Corollary 2 only require that either the propensity score or the outcome regression model be correctly specified; in cases where these are estimated using machine learning, even these parametric assumptions can be substantially relaxed. Moreover, in contrast with the TWFE regressions considered in this section, our earlier additional results can accommodate cases where the time-varying covariates are affected by the treatment. Remark 3. It is worth pointing out that all of the extra conditions considered in Proposition 2 are sufficient conditions rather than necessary conditions. For example, it would be possible for some violations of these assumptions to "offset" each other so that α happens to be equal to ATT. That said, there is no reason to expect this to happen in applications. (though, in both papers, the weights can be negative). These include a version of conditional parallel trends that holds when one conditions on the change in covariates over time 11 and an assumption on linearity of the propensity score conditional on changes in observed covariates over time. 12 None of these papers explicitly address the issue of time-varying covariates potentially 11 Ishimaru (2022) does point out that "conditioning on [changes in time-varying covariates] may not be sufficient to make parallel trends plausible." 12 Another way that the decomposition in Ishimaru (2022) is more general than the one in the current paper is that paper does not require the treatment to be binary. Ishimaru (2022) also considers an interesting extension on decomposing a modified TWFE regression that additionally includes time-varying coefficients on time-varying coefficients. Based on his result, it seems likely that this sort of regression would not suffer from issues related to parallel trends depending on the levels of time-varying covariates rather than only changes in time-varying covariates over time. However, it appears that this regression would still suffer from the other issues mentioned in this section; that said, this is a distinct regression from the TWFE regression in Equation (1) that is much more commonly used in empirical work in economics. being affected by the treatment. 13 In this section, we provide several alternative strategies that involve stronger parametric assumptions on the path of untreated potential outcomes than we made in Section 2. The approaches discussed in this section are generally simpler to estimate than would be the case for the expressions coming from Section 2 and, in some cases, can allow for weaker (or at least alternative) assumptions on how the treatment affects time-varying covariates. The strategies that we propose in this section are also able to avoid the issues with TWFE regressions pointed out in To start with, it is well known (e.g. Blundell and Costa Dias (2009), Gardner (2021), and Borusyak, Jaravel, and Spiess (2021)) that there is a close connection between unconditional parallel trends assumptions and the following model for untreated potential outcomes independent of treatment status in all time periods), but allows for η to be distributed differently across groups and does not impose any modeling assumptions on treated potential outcomes. As discussed above, the econometrics literature on difference in differences often considers the case where the covariates in the parallel trends assumption are time invariant. In that case, the analogous model for untreated potential outcomes is given by where the distribution of η can vary across groups (as well as vary with Z) and the key condition for the conditional parallel trends assumption to hold is that [∆v t |Z, D = 1] = [∆v t |Z, D = 0] (see, for example, Heckman, Ichimura, and Todd (1997) for a discussion of this kind of model). 14 In this setup, the main challenge is estimating g t (z) (though note that this is a practical, estimation challenge rather than an identification challenge). The natural way to parameterize this model is where we now take Z to include an intercept (and, therefore, δ t absorbs the time fixed effect). Given this framework, which can be consistently estimated from the regression of ∆Y t * on Z using only observations from the untreated group. 15 The same sort of arguments imply that, when there are some covariates that vary over time (as above, we consider the case of a single time-varying covariate but note that it is straightforward to extend these arguments to cases with more time-varying covariates), a natural motivating model is Moreover, the same sorts of arguments as above imply that Assumption 2 holds in this model. Similar to the previous case, the main practical challenge is that g t (z, x t (0)) is likely to be challenging to estimate. Like the previous case, the natural way to parameterize this model is Although it is straightforward to recover the parameters in Equation (6), recall that, Given that the parameters are identified, every term is identified in this expression except for [∆X t * (0)|D = 1] (because X t * (0) is not observed for the treated group). We briefly consider six settings for recovering [∆X t * (0)|D = 1] -three of these come from the assumptions we have already considered for untreated potential covariates and three involve parallel trends assumptions for untreated potential covariates. Several of these cases involve averaging over conditional expectations of ∆X t (0). In this section we additionally impose linear models for these conditional expectations; under this extra condition, researchers are able to estimate ATT while potentially allowing for the treatment to affect time-varying covariates using only regressions and averaging. Case 2: Assumption Cov-Unconfoundedness holds conditional on (Z, X t * −1 ) In this case, if we are willing to assume the following linear model for untreated potential where it follows by the conditions in this case that [u t * |Z, X t * −1 , D = d] = 0 for d ∈ {0, 1}. Plugging this expression into Equation (6) implies that Thus, in this case, one can estimate δ * 2,t * and β * 2,t * from a regression of the change in outcomes over time using the untreated group, and then estimate the ATT from the sample analogue of Thus, this particular case bypasses the need for actually estimating a separate model for the change in the time-varying covariate over time. This is perhaps not surprising as these are the same conditions as in Section 2 where it was sufficient for the researcher to condition on the pre-treatment value of the covariates to recover the ATT. Case 3: Assumption Cov-Unconfoundedness holds conditional on X t * −1 , W t * −1 , Z In this case, where the first equality holds by the law of iterated expectations, and the second equality holds by Assumption Cov-Unconfoundedness and by assuming a linear model for the change in untreated covariates over time. This suggests estimating [∆X t * (0)|D = 1] by running a regression of ∆X t * on Z, X t * −1 , and W t * −1 using only untreated observations in order to estimate the parameters γ t * , λ t * , and ξ t * , and then to estimate [∆X t * (0)|D = 1] by using the sample analogue of the expression in Equation (7). 16 For this case, we assume that This expression is very similar to the one in Case 1, except that one should use the change in untreated potential covariates for the untreated group. For this case, we assume that [∆X t * (0)|Z, W t * −1 , D = 1] = [∆X t * (0)|Z, W t * −1 , D = 0]. In this case, where the first equality holds by the law of iterated expectations, and the second equality holds by the conditional parallel trends assumption used in this case and a linearity assumption. Similarly to above, this suggests running a regression of ∆X t * on Z and W t * −1 using only untreated observations to estimate γ t * and ξ t * and then to estimate [∆X t * (0)|D = 1] from the sample analogue of Equation (9) . All of the approaches discussed in this section are substantially more robust than the TWFE regressions discussed in Section 3. In particular, unlike TWFE regressions, they allow for the Remark 6. This section has continued to consider the case with exactly two periods, but it is straightforward to extend these arguments to multiple periods and variation in treatment timing by estimating models for untreated potential outcomes using all available untreated observations (these are observations both for units that do not participate in the treatment in any time period as well as pre-treatment time periods for units that become treated at some point). Once the model for untreated potential outcomes has been estimated, one can "impute" untreated potential outcomes for treated observations, and weighted averages of differences between observed treated potential outcomes and imputed untreated potential outcomes correspond to various treatment effect parameters, depending on the weights chosen by the researcher. In the current paper, we have considered DID identification strategies where the parallel trends assumption holds only after conditioning on time varying covariates that may themselves be affected by the treatment. This setting is common in empirical applications in economics, and we have provided several approaches that offer a number of advantages relative to more commonly used TWFE regressions that include covariates (even in the case where there are only two time periods). In addition, the new approaches that we have proposed are generally not much more complicated to implement than TWFE regressions. where the first equality is just the definition of AT T , the second equality holds by adding and subtracting [Y t * −1 (0)|D = 1], and the third equality holds by writing potential outcomes in terms of their observed counterparts. For part (1) , further notice that, where the first equality holds by the law of iterated expectations, the second equality holds by Assumption 2, and the last equality holds because ∆Y t * (0) and X t * (0) are observed for the untreated group and uses Assumption Cov-Exogeneity to integrate over the distribution of observed covariates (i.e., treated potential covariates) for the treated group. Combining this expression with the previous one for AT T completes the proof for part (1) of the result. For part (2) , notice that where the first equality holds by the law of iterated expectations, the second equality holds by Assumption 2 (unlike part (1), this term is not immediately identified because we do not have an immediate analogue of the distribution of X t * (0) in order to identify the outer expectation), the third equality holds by the law of iterated expectations, the fourth equality holds by Assumption Cov-Unconfoundedness (because, after conditioning on (X t * −1 , W t * −1 , Z), the only randomness comes from X t * (0)), the fifth equality holds by writing potential outcomes in terms of their observed counterparts, and this term is identified because the distribution of (X t * −1 , W t * −1 , Z) is identified for the treated group. Proof. For part (1), the result holds immediately by the law of iterated expectations. For part (2) , the result holds immediately from the expression in part (2) of Theorem 1 using W t * −1 = Y t * −1 . Proof. For part (1), we omit the proof as, after invoking Assumption Cov-Exogeneity, this becomes the same case as with time invariant covariates -see, for example, Sant'Anna and Zhao (2020) for this sort of result in the case with time invariant covariates. Given the expression for the ATT in part (1) of Corollary 1, the proof of part (2) follows using the same arguments as for part (1). We prove the result in several steps. To start with, consider the numerator in the expression for α in Equation (1). Notice that We provide results for each of the terms in Equation (A1) next. Lemma 1. Under Assumptions 1, 2, and 3(a) , where the first three equalities hold by repeatedly applying the law of iterated expectations, the fourth equality holds by rearranging terms, the fifth equality holds by integrating over the distribution of X all (0) conditional on D = 1 and re-weighting, and the last equality holds under the conditional parallel trends assumption in Assumption 2. Next, we provide a result for the second term in Equation (A1). Lemma 2. Under Assumptions 1, 2, and 3(a) , where the first equality holds by applying the law of iterated expectations, the second equality holds by applying the law of iterated expectations and the law of iterated projections, the third equality holds by adding and subtracting [L(∆Y t * |∆X t * , D = 1)p(X all (0)) 2 ] and [L(∆Y t * |∆X t * , D = 0)p(X all (0))(1 − p(X all (0)))] and rearranging terms, the fourth equality holds by applying the law of iterated expectations to each each term. This completes the proof. Next, we provide a result on decomposing differences between the conditional expectation of ∆Y t * (conditional on the full vector X all (0)) and the linear projection of ∆Y t * on ∆X t * . Proof. The result holds immediately just by adding and subtracting terms. To which holds because untreated potential covariates are equal to observed covariates for the untreated group. Second, consider the case when d = 1. In this case, where the first equality holds by the law of iterated expectations, the second equality holds by Assumption 2, the third equality holds by Assumption 4(b), and the last equality holds because, conditional on X t * (0), and X t * −1 , the inside conditional expectation is non-random. Thus, Term where the first equality holds by adding and subtracting terms, the second equality holds by Assumption Cov-Exogeneity, the definition of AT T X t * (0),X t * −1 , Assumption 5(b), and the middle term is equal to zero by the same arguments as were used for Term (B) above, the third equality holds by Assumption 5(a), the fourth equality holds by adding and subtracting terms and by Assumption Cov-Exogeneity, and the last equality holds because where the first equality holds by adding and subtracting terms, the second equality holds using similar arguments as for previous terms and uses Assumptions 2, 4 and 5 and Assumption Cov-Exogeneity, the third equality holds by Assumption 6, and the last equality holds by the definition of linear projection where the linear projection coefficient is given by δ 1 + δ 0 . This completes the first part of the proof. Next, we prove additional result (a) in Proposition 2. Toward this end, recall that, Semiparametric difference-in-differences estimators Estimating the Labor Market Impact of Voluntary Military Service Using Social Security Data on Military Applicants Mostly Harmless Econometrics: An Empiricist's Companion Panel data models: some recent developments Does regression produce representative estimates of causal effects? Identification and inference in nonlinear difference-indifferences models Changes over time in the cost of job loss for young men and women Consistent model specification tests How to make causal inferences with time-series cross-sectional data under selection on observables Alternative approaches to evaluation in empirical microeconomics Recovering distributions in difference-in-differences models: A comparison of selective and comprehensive schooling Revisiting event study designs: Robust and efficient estimation The effects of job displacement on job quality: Findings from the Wisconsin Longitudinal Study Quantile treatment effects in difference in differences models with panel data Policy evaluation during a pandemic Quantile treatment effects in difference in differences models under dependence restrictions and with only two time periods Difference-in-differences with multiple time periods Should we combine difference in differences with conditioning on pre-treatment outcomes Double/debiased machine learning for difference-in-differences models Does strengthening self-defense law deter crime or escalate violence? Evidence from expansions to castle doctrine Double/debiased machine learning for treatment and structural parameters The economic impacts of COVID-19: Evidence from a new public database built from private sector data Two-way fixed effects estimators with heterogeneous treatment effects Identification and estimation of causal mechanisms and net effects of a treatment under unconfoundedness Two-stage difference in differences On estimating multiple treatment effects with regression Difference-in-differences with variation in treatment timing Effects of social distancing policy on labor market outcomes Characterizing selection bias using experimental data Matching as an econometric evaluation estimator: Evidence from evaluating a job training programme Handbook of Labor, Human Resources and Population Economics Matching Methods for Causal Inference with Time-Series Cross-Section Data Recent developments in the econometrics of program evaluation Empirical Decomposition of the IV-OLS Gap with Heterogeneous and Nonlinear Effects What Do We Get From A Two-Way Fixed Effects Estimator? Implications From A General Numerical Equivalence Earnings losses of displaced workers A note on endogenous control variables in causal studies The estimation of causal effects by difference-in-difference methods A practical guide to counterfactual estimators for causal inference with time-series cross-sectional data The changes-in-changes model with covariates Causal inference from complex longitudinal data". Latent variable modeling and applications to causality Marginal structural models and causal inference in Epidemiology Estimation of regression coefficients when some regressors are not always observed The consequences of adjustment for a concomitant variable that has been affected by the treatment Doubly robust difference-in-differences estimators Adjusting for nonignorable drop-out using semiparametric nonresponse models Interpreting OLS estimands when treatment effects are heterogeneous: Smaller groups get larger weights A general double robustness result for estimating average treatment effects Persistent effects of job displacement: The importance of multiple job losses Nonparametric model checks for regression Specific capital, mobility, and wages: Wages rise with job seniority Researchers' Degrees-of-Flexibility and the Credibility of Difference-in-Differences Estimates: Evidence From the Pandemic Policy Evaluations Confounding and regression adjustment in differencein-differences studies Next, we provide a useful result for the denominator in the expression for α in Equation [D](1 − L(D|∆X t * ))|D = 1Proof. From the definition of u, it follows thatand we consider each of these in turn. Start with,where the first equality holds from the definition of L(D|∆X t * ) and the second equality holds immediately from the previous one. Next,where the first equality holds by the definition of A 3 , the second equality holds by the definition of L(D|∆X t * ), and the last equality holds by canceling terms. Plugging Equations (A3) and (A4)in Equation (A2) implies thatwhere the second and third equalities hold by the law of iterated expectations and which completes the proof.Proof of Proposition 1. The first part of the expression for α comes from Equation (A1) and by Lemma 1 and Lemma 4. The second and third parts come from Equation (A1) and by Lemmas 2 to 4.