key: cord-0155882-7nxqgfmi
authors: Torous, William; Gunsilius, Florian; Rigollet, Philippe
title: An Optimal Transport Approach to Causal Inference
date: 2021-08-12
journal: nan
DOI: nan
sha: ca89d6bdb6a7a99dd31e6e7262829bc8e714458b
doc_id: 155882
cord_uid: 7nxqgfmi

We propose a method based on optimal transport theory for causal inference in classical treatment and control study designs. Our approach sheds a new light on existing approaches and generalizes them to settings with high-dimensional data. The implementation of our method leverages recent advances in computational optimal transport to produce an estimate of high-dimensional counterfactual outcomes. The benefits of this extension are demonstrated both on synthetic and real data that are beyond the reach of existing methods. In particular, we revisit the classical Card&Krueger dataset on the effect of a minimum wage increase on employment in fast food restaurants and obtain new insights about the impact of raising the minimum wage on employment of full- and part-time workers in the fast food industry.

Causal inference is a central question behind data-driven decision making across a wide variety of domains such as health care, economics, marketing, education, law (Angrist and Pischke, 2008; Ho and Rubin, 2011; Imbens and Rubin, 2015; Rothman and Greenland, 2005) , and, more recently, the life sciences (Bates et al., 2020; Belyaeva et al., 2021; Meinshausen et al., 2016; Squires et al., 2020) .

One of the main promises of causal inference is to be able to answer the question "what if?": What if a patient had received a different treatment? What if a stimulus check was for a different amount? What if this gene was knocked down? What if this photo was taken at night? These questions pertain to a hypothetical reality called a counterfactual. This powerful concept is behind most of the methods based on the theory of potential outcomes (Neyman, 1923; Rubin, 1974 Rubin, , 1978 currently employed across social and biomedical sciences (Rothman and Greenland, 2005; Morgan and Winship, 2014; Imbens and Rubin, 2015) . If counterfactuals could be observed, the causal inference question would be trivial: the causal effect would simply be the difference between an observation and its counterfactual. In practice, however, counterfactuals cannot be observed; this limitation has been dubbed the "fundamental problem of causal inference" (Holland, 1986; Imbens and Rubin, 2015) . In this context, the power of a causal inference method hinges on adequate modeling and estimation of counterfactuals. The goal of this paper is to propose a new modeling paradigm for counterfactuals that is based on optimal transport. As a byproduct, we leverage recent progress in the theory of statistical optimal transport to develop an estimation strategy for these counterfacturals and hence, causal effects.

We focus on the simpler case of two 1 possible realities: with and without intervention. Consider a framework with four groups, that may not contain the same individuals: a treatment 2 group preintervention, a treatment group post-intervention, a control group pre-intervention, and a control group post-intervention. The purpose of the control group is to quantify the natural drift, i.e., the time trends between the pre and post-intervention periods and that are not related to the intervention under study. In turn, the causal effect of an intervention, a.k.a treatment effect, is given by the change observed in the treatment group between the two periods, net of the natural drift.

Prior work. The two main existing methods for estimating causal effects in the above setting are the classical method of Difference-in-Differences (DiD) (Lechner, 2011; Wing et al., 2018) and its generalization beyond average and aggregate effects, the Changes-in-Changes (CiC) estimator (Athey and Imbens, 2006) .

DiD is a linear approach designed to estimate average (or aggregate) causal effects (Abadie, 2005; Heckman, 1990 ) over all units. This method can be proved to correctly identify counterfactuals under the "parallel trends" assumption (Abadie, 2005; Ryan et al., 2019) under which the average natural shift is assumed to be the same across both the control and treatment group. This idea has been used in many areas of science where capturing the average treatment effect is sufficient to formulate informative causal conclusions. Impactful applications include quantifying the effect of public health measures in response to COVID-19 (Goodman-Bacon and Marcus, 2020) and estimating how irrigation farmers adapt to water quotas (Drysdale and Hendricks, 2018) .

The fact that the linear DiD estimator only provides estimates of average and aggregate treatment is a severe limitation in many settings where treatment heterogeneity is important, i.e., where different units can react differently to the same intervention. The CiC estimator (Athey and Imbens, 2006) addresses this issue for univariate outcomes. It generalizes the DiD estimator by not just focusing on averages but the whole probability distribution of the respective units. This development permits the estimation of the entire counterfactual law of the fraction of treated units had they not received the treatment, hence allowing for a general form of heterogeneity in their response to treatment.

In its current form the CiC estimator is only applicable in settings with univariate outcomes, as it relies heavily on the definition of quantile functions. This makes it rather restrictive for modern applications where the outcomes of interest are often high-dimensional. This includes, for example A/B testing in digital marketing campaigns where outcomes are a combination of features such as click-through rate, time-per-page, etc., as well as measuring intervention effects such as gene knock down on a population of cells measured in high-dimensional gene space. In principle, the CiC estimator may be tensorized by estimating treatment effects independently for each coordinate but this solution fails to capture correlations that are often important in causal discovery.

Our contributions. We recast the CiC estimator using tools from the theory of optimal transport. This perspective allows us to extend this methodology to handle multivariate observables. Our approach relies on a natural extension of the causal model proposed in Athey and Imbens (2006) which guarantees that both populations, treatment and control, evolve between pre-and post-treatment periods via optimal transport maps (Theorem 3). In turn, these optimal transport maps can be estimated consistently from data using recent results from statistical optimal transport and implemented efficiently using recent advances in computational optimal transport (Peyré and Cuturi, 2019) . This also allows us to introduce a natural and interpretable generalization of the "parallel trends" assumption via co-monotonicity. In Section 4, we demonstrate the benefits of this extension by comparing it to tensorized CiC both on artificial and real data. Artificial data is designed to demonstrate that tensorized CiC can be grossly inconsistent and in the real data example, we revisit the classical dataset of Card and Krueger (1994) that has sparked an intense debate about the effects of raising the minimum wage on employment. Crucially, our ability to handle the workload of part-time and full-time workers as a bivariate outcome reveals an interesting effect from the perspective of labor economics that had escaped the numerous prior studies of this data.

To better understand our new model, we recall in this section the causal model on which the classical difference-in-differences and changes-in-changes estimators are built; see Figure 1 .

In this model, the quantity of interest, or outcome, is modeled as a stochastic process denoted by {Y t , t = 0, 1} with two time periods: pre-intervention (t = 0) and post-intervention (t = 1). Denote by µ t the distribution of Y t . In this expository section, and following prior work, Y t is assumed be a scalar. Note that the main goal of this paper is to extend this question to the multidimensional case but we postpone this extension to Section 3.1. Without loss of generality, we assume that Y t is generated from a latent variable U t ∈ R, that may change over time but whose distribution ν is time-invariant: U t ∼ ν, t = 0, 1. We can think of U t as capturing all of the unobserved and intrinsic characteristics of a unit-e.g., a patient-involved in the study at time t. For example U t may capture genetic or education background.

The stochastic process (Y t ) evolves differently depending on whether it describes a unit in the control or in the treatment group. In the former case, it is subject solely to a natural drift meant to model time variations that occur independent of the intervention. In the latter case, the process is subject to both the natural drift and the treatment effect. Causal inference aims at deconvolving these two effects. In the rest of this section we describe a model for each of these processes separately.

To model the natural drift (without treatment) of the stochastic process {Y t } t , between the preintervention period (t = 0) and post-intervention period (t = 1), we postulate the existence of two production functions h t , t ∈ {0, 1} such that Y t = h t (U t ), t ∈ {0, 1}. In other words, the distribution µ t of Y t is the pushforward of ν by h t ; we write µ t = (h t ) # ν. In particular, assuming that h 0 is invertible, we also have ν = (h −1 0 ) # µ 0 so that

There are many maps T such that µ 1 = T # µ 0 so the natural drift map d = h 1 • h −1 0 need not be identifiable without further assumptions.

Treatment Figure 1 : Illustration of various maps in the space of measures. An arrow indicates a pushforward map; for example µ 1 = d # µ 0 . A dashed arrow indicates a measure-preserving pushforward map and hence maps a measure to itself. d is the natural drift map and T is map from an observed outcome to its counterfactual. The data is observed from the four measures colored in magenta.

On the one hand, we assume that not only the distribution of U t is time-invariant but that also U t is itself constant over time (U 0 = U 1 ), then we have Y 1 = d(Y 0 ) so this map can be estimated via multivariate regression from independent copies of the pair (Y 0 , Y 1 ). While observation setups where the pair (Y 0 , Y 1 ) is observed arise classically in many practical applications where we observe a unit in the control group pre-and post-intervention, the assumption that U 0 = U 1 is more tenuous. Moreover, this setup leaves out an important class of problems where the units in the control group can differ between t = 0 and t = 1. A canonical example arises in the context of genomics and more specifically of single-cell RNA-Seq data where each unit corresponds to a cell. Measuring the gene expression levels of a cell is a destructive process and, as a result, a given cell may be only measured once so that coupled observations of the form (Y 0 , Y 1 ) are not available. Instead, we observe independent copies of Y 0 as well as independent copies of Y 1 . While difficult, this problem can be solved at the cost of additional assumptions of the function d. This problem is sometimes referred to as uncoupled regression and it arises in various applications, far beyond the genomics example described above (see, e.g., .

It can be shown that the natural drift d is identifiable from uncoupled data if it is monotone 3 with known monotonicity (either increasing or decreasing). This follows for example from assuming that both h 0 and h 1 are monotone increasing, which are the main assumptions employed in the study of the CiC estimator. In fact, in the one-dimension case, the unique monotone increasing function d such that µ 1 = d # µ 0 is given by F −1 1 • F 0 where F t denotes the cumulative distribution function (CDF) of Y t for t = 0, 1. This follows from Brenier's theorem presented in Section 3.1. In turn, a consistent estimatord of d may be obtained by plugging-in empirical CDFs.

Recently, other approaches for modeling the natural drift, i.e. generalizing the "parallel trends" assumption to allow for general heterogeneity in the univariate setting, have been proposed. These methods in general make stronger or less interpretable assumptions on the natural drift than elementwise monotonicity in Athey and Imbens (2006) . Callaway and Li (2019) directly assume that the copulas between µ 0 and µ 1 as well as µ 0 and µ † 1 are the same; Roth and Sant'Anna (2020) assume that the pointwise differences between the corresponding cumulative distribution functions are equal,

for all x ∈ R; Bonhomme and Sauder (2011) restrict the heterogeneity of the model to be additively separable and assume that the pointwise differences between the corresponding logarithms of the characteristic functions are equal. As we show below (equation (2)) our optimal transport approach provides the most natural generalization of the parallel trends requirement to multivariate settings in the sense that it requires only h 0 and h 1 to be comonotone. Intuitively, this means that h 0 and h 1 have to move "roughly in the same direction", an arguably more natural and interpretable requirement for a parallel trend.

The treatment effect is only observable in the treatment group which may contain units with a distribution that differs from that of the control group. As a result, we assume a new set of latent variables U 0 and U 1 that share a time-invariant distribution ν which may differ from ν. Given an unknown production function h 1 , let Y 1 = h 1 (U 1 ) denote the outcome of a unit after treatment.

If for the same unit one could also observe its outcome without treatment Y † 1 = h 1 (U 1 ), or counterfactual, the treatment effect would simply be given by Y 1 − Y † 1 . Unfortunately, the pair (Y 1 , Y † 1 ) is not jointly observed; this issue is often dubbed the "fundamental problem of causal inference" (Holland, 1986) . The potential outcomes paradigm prompts us to estimate the counterfactual Y † 1 . In other words, we aim to estimate the effect of only the natural drift d on a given unit from the treatment group. In light of the previous subsection, a consistent estimatord of the natural drift map may be obtained using data from the control group. Unfortunately, this does not lead to an estimator of the counterfactual. Indeed, in the setup of Athey and Imbens (2006) described above, we only have that

Despite this hurdle we can still, somewhat surprisingly, estimate the counterfactual distribution, that is the distribution µ † 1 of Y † 1 . Indeed, let Y 0 = h 0 (U 0 ) be a random outcome in the treatment group pre-intervention and denote its distribution by µ 0 . Then assuming that the production functions h 0 and h 1 are the same across both control and treatment groups, we get that µ † 1 = d # µ 0 , where we recall that d = h 1 • h −1 0 can be estimated consistently using empirical CDFs. In turn, the knowledge of this distribution µ † 1 allows us to estimate linear treatment effects of the form ϕdµ 1 − ϕdµ † 1 . For example the popular average treatment effect on the treated (ATT) is of this form and corresponds to choosing ϕ to be the identity map. Note that estimating nonlinear treatment effects, such as the variance of the treatment effect, or more generally, the distribution of treatment effects requires additional assumptions to select a coupling between µ 1 and µ † 1 . Nonlinear effects can be recovered by assuming that the production function h 1 is also increasing. In that case, for each Y 1 , we can obtain its counterfactual as

1 . By virtue of Brenier's theorem optimal transport-see Theorem 1-the map T is, in fact the unique monotone map such that µ † 1 = T # µ 1 and it is given by

1 and F is the CDF of µ 1 . In particular it can be 4. We note in passing that if we were to assume that U1 = U0, we could, in fact, test the assumption that the natural drift d, and hence the production functions h0, h1, are monotone increasing from independent copies of paired data of the form (Y0, Y1). We would do so by checking if there exists an increasing function d that indeed interpolates these points. In practice, such a function seldom exists and allowing a time-variable Ut allows us to account for deviations from this ideal situation. estimated from data using an empirical CDF for F and any estimator of the CDF of the counterfactual distribution F † . A nonlinear effects of notable interest is the quantile of order q ∈ (0, 1) of the distribution of the treatment effect Y 1 − Y † 1 . Under the above assumption that h 1 is increasing, it coincides with a popular linear treatment effect, the so-called quantile treatment effect (QTE) defined by

The QTE is the difference between the observed quantile of the distribution of the treatment group post intervention net of its counterfactual quantile (Lehmann, 1974; Doksum, 1974; Firpo, 2007) . However, if h 1 is not monotone increasing, as noted by Chernozhukov and Hansen (2005) , the two notions do not coincide. Unfortunately, in practice the treatment may have drastic effects that shuffles the population under study. In particular, it implies the production function h * 1 is often not monotonically increasing. This is the case for example, if the treatment has a much stronger effect on relatively low pre-intervention outcomes than in does on high ones. In this paper, we present results both with and without this assumption on h 1 for completeness but emphasize the fact that this is often too stringent an assumption.

At this point, it is quite apparent that classical causal model described above relies heavily on monotonicity of the production functions to be identifiable. As a result, an extension of this model largely hinges on identifying a condition with equal power in high dimension. In the next section, we argue that the theory of optimal transport provides a natural framework for such an extension.

In this section we look at the CiC estimator through the lens of optimal transport theory. This provides the means to generalize its underlying idea to the multivariate setting.

Recall that identifiability of the causal model presented in the previous section hinges on the uniqueness of the increasing map d such that µ 1 = d # µ 0 -and possibly also of the map T such that µ † 1 = T # µ 1 -when estimating nonlinear treatment effects. This uniqueness follows from a fundamental result of optimal transport known as Brenier's theorem.

Theorem 1 (Brenier (1991) ) Let P and Q be two probability measures defined over R d such that P is absolutely continuous with respect to the Lebesgue measure. Then, among all maps T : R d → R d such that Q = T # P , there is a unique one, called the Brenier map denotedT , which is the gradient of a convex function. FurthermoreT is an optimal transport map in the following sense.

Let Γ denote the set of joint probability distributions of (X, Y ) ∈ R d × R d such that X ∼ P and Y ∼ Q, then the optimal transport problem,

admits a unique solutionγ such that (X, Y ) ∼γ if and only if X ∼ P and Y =T (X), P -a.s.

It is not hard to check that for d = 1, a map T is the gradient of a convex function if and only if it is non-decreasing. As a result, d is identifiable as soon as it is increasing which is ensured by the assumption that the production functions h 0 and h 1 are monotone. In fact, it follows from a slightly weaker condition, namely the co-monotonicity of theses two maps in the sense that

Here we intently use a notation that extends to higher dimension to facilitate our discussion in the next subsection.

With the analysis from the previous section, the generalization of the DiD idea, and in particular the CiC estimator, to higher-dimensional settings should in principle be straightforward. We simply generalize the assumption that the production functions h 0 , h 1 , and possibly h 1 , are monotone in a multivariate sense and then find a corresponding estimator based on optimal transport theory that is coherent with the model. To that end, recall that in higher dimensions, gradients of convex functions are cyclically monotone (Rockafellar, 1997, Theorem 24.8). A map T : R d → R d is cyclically monotone if for any positive integer m and any cycle u 1 , . . . , u m , u m+1 = u 1 in its domain, it holds

It is not hard to check that when m = 2, Equation (3) reduces to the classical multivariate notion of monotonicity

This notion leads to a natural extension of the co-monotone assumption (2) which drives identifiability in the one-dimensional case. Two production functions h 0 and h 1 are said to be cyclically comonotone if for any positive integer m and any cycle u 1 , . . . , u m , u m+1 = u 1 in their common domain, it holds

For m = 2, this condition reduces to (2). Moreover, Condition (5) readily implies that the map d = h 1 • h −1 0 is cyclically monotone and hence, by Theorem 1, is the unique Brenier map such that

Unfortunately, this solution to our technical problem comes at the cost of interpretability. Indeed, while the co-monotonicity condition (2) simply means that the two production functions h 0 and h 1 do not evolve in opposite directions, the cyclical co-comonotonicity condition (5) remains obscure despite attempts a finding justification for cyclical monotonicity in some applications (Chernozhukov et al., 2021; Shi et al., 2018) .

The challenge therefore is to find assumptions on the problem that ensure identifiability based on the simple co-monotonicity condition (2) only.

In order to produce a model for which the natural generalization of the CiC estimator via optimal transport remains identifiable, we rely on results from the literature on truthfulness in mechanism design, in particular on the following theorem.

Theorem 2 (Saks and Yu (2005) ) Let K and F be a convex and a finite subset of R d respectively. Then a function T : K → F is cyclically monotone if and only if it is monotone.

This result is quite remarkable and deserves some comments. Indeed while cyclical monotonicity is a much more stringent condition than monotonicity in general, the two conditions collapse into one when the map takes only a finite set of values. This prompts the question of whether this collapse occurs in more general setting. Unfortunately, to date, all extensions of this result are rather artificial and appear to capture essentially the same conceptual idea (see, e.g. Kushnir and Lokutsievskiy, 2021) . In our context, this assumption restricts the scope of the model to situations where the post-intervention distribution µ 1 has finite support. Since the size of this support may be arbitrarily large, this assumption is essentially vacuous and can be accounted for simply by numerical precision. In fact, in the case of small support size, it can benefit the estimation process by acting as a statistical regularizer. Indeed it has been observed that semi-discrete optimal transport between a continuous and a discrete observation escapes the curse of dimensionality that plagues statistical optimal transport between continuous measures (see Forrow et al., 2019; Gunsilius, 2021; Hütter and Rigollet, 2021) .

We are now in a position to define a new causal model for which the counterfactual distribution is identifiable. We refer to Section 2 and Figure 1 for details about the rationale behind this model.

We observe data from each of the four distributions µ 0 , µ 1 , µ 0 , and µ 1 , all defined on R d . Furthermore, we assume that there exists production functions h 0 , h 1 , h 1 :

Recall that our goal is to estimate the counterfactual distribution µ † 1 = (h 1 ) # ν . We gather here our main assumptions.

(A.1) The measures µ 0 and µ 0 are supported on convex sets K 0 and K 0 respectively and absolutely continuous with respect to the Lebesgue measure.

(A.

2) The measures µ 1 and µ † 1 are supported on finite sets F 1 and F † 1 respectively.

(A.

3) The production functions h 0 and h 1 are co-monotone in the sense of (2) and h 1 • h −1 0 is a well-defined function.

(A.4) The production functions h 0 are h 1 are co-monotone and h 1 • h −1 0 is a well-defined function.

(A.5) The measure µ 1 is supported on a convex set K 1 .

We now provide some brief justifications for these assumptions. Assumptions (A.1) and (A.5) are purely technical assumptions on the regularity of the problem that often arise in classical results of optimal transport theory (see, e.g., Caffarelli, 2000) . Assumption (A.3) is the natural generalization of the co-monotonicity assumption from the changes-in-changes estimator. Such a relationship is often reasonable, particularly when the unobservables are interpreted as characteristics of individuals such as health or ability (Athey and Imbens, 2006) . As discussed in Section 2, assumption (A.4) is often not realistic but necessary to obtain a map that matches each outcome Y 1 from the treatment group to its counterfactual and, ultimately, compute nonlinear treatment effects. We include it for completeness to obtain a side result of independent interest. The assumptions on h 1 • h −1 0 and h 1 • h −1 0 that appear in assumptions (A.3) and (A.4) ensures that if two latent variables yield the same outcome in the pre-intervention period, then they also lead to the same outcome in the post-intervention period thus eliminating potential redundancy in the definition of the latent space.

The main non-standard assumption, (A.2) is the most fundamental in generalizing the differencein-differences idea to multivariate settings. It is also technical in nature and allows us to obtain a unique counterfactual distribution via Theorem 2. Since the finite set may have arbitrarily large cardinally, it is inconsequential from a practical standpoint.

Under these assumptions, we get the following theorem, which is our main result. We postpone its proof to the appendix.

Theorem 3 Under assumptions (A.1-A. 3) there exists a unique map d :

It is the Brenier map from µ 0 to µ 1 . Moreover, if assumptions (A.4-A.5) also hold, there exists a unique map T :

In practice the distributions are unknown and we observe independent observations from each of the four distributions µ 0 , µ 1 , µ 0 , and µ 1 . Estimating Brenier maps from data is a subject of intense activity both from the theoretical and practical angles (see Gunsilius, 2021; Hütter and Rigollet, 2021; de Lara et al., 2021 , and references therein) often resulting in significant gaps between the two. Since this is a subject of independent interest, we do not dwell in this question and simply mention that simple consistent estimators can be constructed (see Panaretos and Zemel, 2020) and rates of convergence may also obtained under appropriate assumptions such as smoothness (Gunsilius, 2021; Hütter and Rigollet, 2021; Vacher et al., 2021) .

In this section, we demonstrate the performance of the new estimator on both synthetic and real data.

We present a synthetic data experiment which demonstrates that our proposed method manages to estimate joint counterfactual distributions, while naive tensorization of the CiC estimator fails to capture the dependence structure. In particular, the CiC estimator does not recover the true counterfactual marginals of simple linear production functions in R 2 while our optimal transport based method can.

We consider n = 3000 units in R 2 for each treatment arm. Samples of latent variables are drawn from the distribution ν at t = 0 and t = 1, mirroring the setup discussed in Section 2.1. Draws from ν are fixed across time for counterfactual validation purposes. For controls, ν's first coordinate is distributed according to Beta(3, 2) and second to Beta(2, 3); for treatments, ν 's first coordinate is distributed according to Beta(2, 3) and second to Beta(3, 2).

We consider the following set of production functions:

In the appendix we show that these linear production functions are co-monotone, but not elementwise monotone as required by the CiC estimator. Using these production functions, we generate n independent samples from each of the distributions µ 0 , µ 1 and µ 0 , as well as samples from the true counterfactual distribution µ † 1 for validation purposes.

To estimate the transport map d, we first compute an optimal plan using observed data and round it to an optimal transport map. This map is only defined on the data from µ 0 . To predict counterfactuals treatment on data from µ 0 , we employ nearest neighbor interpolation. Since the Beta distribution is supported on [0, 1] for all parameter choices, µ 0 and µ 0 have identical support in this example and this naive extrapolation technique performs sufficiently well. In particular, it shows that the OT based estimator remains close to the true counterfactual distribution while naive tensorization of CiC may depart significantly.

.008 .089 st. dev.

.002 .003 Table 1 : Recovery of eCDF by method over 20 runs.

The kernel density estimator (KDE) of the counterfactual joint distribution in Figure 5 and the recovered marginals in Figure 6 demonstrate these claims. Notably, CiC recovers a joint distribution with the appropriate shape but orientated in the wrong direction in R 2 . We also compute the empirical CDF for our 3000 true counterfactual observations and the counterfactuals generated by the two methods over a uniform mesh of 10, 000 points. Figure 7 visually demonstrates that OT almost perfectly recovers the eCDF, and Table 1 quantifies this result. The mean absolute difference over each mesh point is an order of magnitude smaller for OT and has a smaller standard deviation. 

On April 1, 1992, New Jersey raised its minimum wage from the Federal level of $4.25 per hour to $5.05 per hour. Card and Krueger (1994) (CK henceforth) investigated the effect this increase had on employment in fast food restaurants with a case-control experimental design, taking the bordering region of eastern Pennsylvania, where the minimum wage remained at $4.25, as a control group. The original study employs the standard DiD estimator to conclude that the higher minimum wage led to increased employment in New Jersey restaurants. This result spawned much debate within the economics community about the effect of minimum wage policies on employment (e.g. Neumark and Wascher 2000 , Dube et al. 2010 , Meer and West 2016 , Neumark et al. 2014 and references therein). We obtained a copy of the full dataset from David Card's public website. The treatment effect of interest in CK and many subsequent reevaluations is the change in full-time equivalent employees (FTE), which is defined via FTE = FT + .5PT, where FT (PT) is the number of full-time (part-time) employees (Neumark and Wascher, 2000; Lu and Rosenbaum, 2004) . We use the proposed multivariate extension of the changes-in-changes estimator to estimate a bivariate treatment effect in terms of full-and part-time employees, hence fully dissecting the causal effect of raising the minimum wage on these subgroups.

The original CK study finds a positive average treatment effect of 2.76 FTE jobs added in the treatment group compared to the control group, a result reproduced by Neumark and Wascher (2000) using different methodology. Neumark and Wascher (2000) also compute an average treatment effect in full-and part-time jobs with separate regressions and find treatment effects of 3.16 gained full-time jobs and 0.60 lost part-time jobs. These classical contributions only focus on aggregate outcomes over all restaurants, irrespective of the size of the respective restaurant. An exception is Ropponen (2011) who reanalyzes the data with the CiC estimator, taking into account the heterogeneity in the size of the fast-food restaurants. He finds the average treatment effect bounded in [0.90, 1.70]. Ropponen also notes in New Jersey that large restaurants pre-intervention (measured in FTE employees) have a negative treatment effect while small restaurants pre-intervention have a positive treatment effect. Our analysis recovers this trend for both FT and PT employees in New Jersey. As we also generate counterfactual controls, we find an opposite trend for Pennsylvania restaurants. This opposite trend is intriguing and worth exploring further as it does not seem to be explained by substitution effects of employees moving across the state line to find jobs.

CiC DiD ATE FT 3.07 2.61 3.45 ATE PT −1.79 −1.52 −1.00 Table 2 : Estimated average treatment effect on full-and part-time employment by method.

The proposed multivariate extension of the changes-in-changes estimator allows us to jointly estimate the effects on full-and part-time employment while accounting for the heterogeneity in restaurant size. For the results in this section, restaurants are represented in R 2 by our outcomes of interest, the number of full-and part-time employees. In the appendix we provide an additional experiment with restaurants represented in higher dimension including additional measurements about wages, prices, and operations. We discard 19 units with missing outcomes, leaving 76 control and 315 treatment restaurants. The transport plans for both groups are solved with a linear program, and we applly nearest neighbor matching between treatment and control pre-intervention samples to calculate counterfactual outcomes.

Our optimal transport analysis suggests that fast food restaurants responded to an increased minimum wage by substituting full-time employees for part-time ones with a net gain of full-time equivalent employees. Our results seem to indicate that the negative effect on part-time employees is more pronounced than previously estimated. The quantile plot in Figure 8 suggests this effect applies throughout the distribution of restaurants, not just the mean: fixing the number of full-time employees, counterfactual quantiles tend to have more part-time employees than treated quantiles at the same level; likewise fixing the number of part-time employees, treated quantiles tend to have more full-time employees. In their original paper CK suggest that restaurants may respond to higher labor costs with more full-time employees because they tend to be older, more skilled, and more productive, thus a better investment of capital. Furthermore, the univariate methods may underestimate the number of part-time employees lost, a conclusion supported by the higher dimensional analysis in Table 5 from the appendix.

We have introduced a general method based on optimal transport theory for causal inference in classical treatment and control study designs. It combines two desirable properties: it is designed for high dimensional settings while at the same time capturing the heterogeneity in treatment response of individuals. In particular, it generalizes both the classical difference-in-differences estimator, which is applicable in high dimension but only captures average effects, and the changes-in-changes estimator (Athey and Imbens, 2006) , which allows for general treatment heterogeneity but is only applicable in univariate settings. It also implies a natural and interpretable generalization of the standard "parallel trends" assumption of difference-in-difference estimators (Abadie, 2005) in terms of co-monotonicity of the causal production functions. Showcasing the utility of the proposed method, we revisit the classical Card & Krueger dataset by decomposing the treatment effects for full-and part-time employees. We find that fast-food restaurants responded to an increased minimum wage by substituting full-time employees for parttime employees, even when accounting for restaurant size. This provides a novel insight by dissecting the relationship between full-and part-time employees while accounting for the heterogeneity in restaurant size. More generally, since our proposed method is able to consider multivariate outcomes, it can help uncover interesting causal effects and nonlinear relations in a wide array of applications.

Proof of Theorem 3 Proof Recall first that the co-monotonicity assumption (A.3) reads as

for all x 0 , y 0 in the image of h 0 . In particular, this implies that the function d = h 1 • h −1 0 is monotone in the sense of (4). Since the function maps a convex set K 0 to a finite set F 1 , it follows from Theorem 2 that d is, in fact, cyclically monotone. By Rockafellar (1997, Theorem 24.8) , this is equivalent to d being the gradient of a convex function and, hence the unique Brenier map from µ 0 to µ 1 . This completes the first part of the theorem.

The second part of the theorem follows exactly the same arguments with h 1 replacing h 1 .

Proof of Co-monotone Production Functions in 4.1 Proof Recall we defined the following production functions in our synthetic data experiment:

h 0 (u) = 1 α α 1 u and h 1 (u) = 1 −α −α 1 u.

In our simulations, the α parameter of the production functions is fixed at 0.5. This choice makes h 0 and h 1 co-monotone, because for α < 1

Note, however, that h 0 and h 1 are not element-wise monotone as required by the CiC estimator. In fact, the first marginal function h

Hence by setting (u − v) ≡ 0 z for some z = 0 it holds that

In this section we provide details about the implementation of our numerical experiments from Section 4. In our experiments, we obtain finite draws from the continuous distributions ν and ν to apply production functions and generate µ 0 and µ 0 and post-intervention outcomes µ 1 and µ † 1 . Recent results prove consistency (Panaretos and Zemel 2020, de Lara et al. 2021 ) and rates of convergence under smoothness assumptions (Hütter and Rigollet 2021 , Gunsilius 2021 , Vacher et al. 2021 ) of optimal transport maps computed from empirical sample data. Because of these advances, we claim the transport maps calculated on large, but finite, simulation datasets estimate the true infinite-population maps well.

To compute optimal transport maps, we use a linear programming method ot.lp.emd from the Python package Python Optimal Transport (Flamary et al., 2021) . These maps are then extrapolated into optimal transport plans with Euclidean nearest neighbor matching. In our simulations the control and treatment groups pre-intervention have almost complete overlap of sampled support, justifying this technique. To be concrete, counterfactual treatment outcomes are generated by matching each treatment unit pre-intervention to the nearest control and pushing forward along the transport map between observed control distributions.

All numerical experiments are written and run in Python 3.

For our reanalysis of the Card & Krueger data on the minimum wage and fast food employment, we use the original dataset provided by David Card on his website. It contains survey information, collected via telephone, for 410 fast food restaurants across 4 fast-food chains. Of these, 79 are in the control region of northeastern Pennsylvania and 331 are in New Jersey. In addition to the two employment outcomes of interest, a number of other observations about a restaurant's wages, prices, and operations were collected. Our optimal transport methodology allows us to estimate treatment effects with restaurants more richly represented as vectors in R 10 . We present this experiment for illustrative purposes. We select a representative subset of 10 numerical covariates, listed in Table 3 , and remove any restaurant with a missing entry for any of these covariates. This leaves a final sample of 52 controls and 200 treated. The full dataset has 15 numerical covariates but including all 15 leads to a small sample size due to missing data. The covariates we do not include add little additional information, such as the price of fries (we include the price of entrées and sodas), have high rates of missingness, or have a different distribution in the subsample with complete data than the entire survey population. Unlike the outcome-only analysis presented in Section 4.2, the included covariates have different scales. Thus, we standardize each to zero mean and unit standard deviation. The dataset also includes categorical data such as the fast-food chain, which we do not include because the distance between restaurants would depend on how these are encoded.

We observed moderate sensitivity of estimate magnitude when perturbing the covariate set, although the sign of our estimators did not change. Therefore, we present aggregate results over subsets of covariates and units in this experiment. We believe the difficulty of estimating a 10 dimensional object with only 250 observations leads to this sensitivity and use more robust experiment design to smooth out the randomness. Table 4 shows that including additional covariates does not change the sign of our estimated treatment effects and estimates both full-time and part-time employment changes with smaller magnitude than bivariate optimal transport. Like the results in Section 4.2, this analysis suggests the minimum wage increase led the average fast food restaurant to hire more FTE employees and that univariate methods may underestimate the part-time treatment effect.

OT CiC DiD ATE FT 1.54 3.12 3.08 4.03 ATE PT −1.66 −2.12 −0.84 −0.97 Table 4 : Estimated treatment effects with and without covariates by method in the covariate-complete subpopulation.

To temper sensitivity to certain covariates, we calculate the treatment effects for all subsets of covariates in R 8 , R 9 , or R 10 . The positive results of this experiment are included in Table 5 . The sign of our two estimates never change in this experiment. To show our results are not biased by any particular units, we rerun our experiment 10000 times with 20% of our treatment and control units randomly dropped but all covariates included. Further reinforcing the estimates' consistency in sign, the effect for part-time workers becomes positive above the 96 th quantile and the effect for full-time workers becomes negative below the 1 st quantile. Both checks suggest the full-time treatment effect given by optimal transport with covariates is smaller than univariate estimates, while classical methods may underestimate the decrease in part-time workers. In future work with more tests for covariate selection and further sensitivity analysis, we hope stronger claims about the estimates' magnitude can be made. 

Semiparametric difference-in-differences estimators

Mostly harmless econometrics: An empiricist's companion

Identification and inference in nonlinear difference-in-differences models

Causal inference in genetic trio studies

Causal network models of SARS-CoV-2 expression and aging to identify candidates for drug repurposing

Recovering distributions in difference-in-differences models: A comparison of selective and comprehensive schooling

Polar factorization and monotone rearrangement of vector-valued functions

Monotonicity properties of optimal transportation and the FKG and related inequalities

Quantile treatment effects in difference in differences models with panel data

Minimum wages and employment: A case study of the fast-food industry in New Jersey and Pennsylvania

An IV model of quantile treatment effects

Identification of hedonic equilibrium and nonseparable simultaneous equations

A consistent extension of discrete optimal transport maps for machine learning applications

Empirical probability plots and statistical inference for nonlinear models in the two-sample case

Adaptation to an irrigation water restriction imposed through local governance

Minimum wage effects across state borders: Estimates using contiguous counties

Efficient semiparametric estimation of quantile treatment effects

POT: Python optimal transport

Statistical optimal transport via factored couplings

Using difference-in-differences to identify causal effects of COVID-19 policies

On the convergence rate of potentials of Brenier maps

Varieties of selection bias

Credible causal inference for empirical legal studies

Statistics and causal inference

Minimax estimation of smooth optimal transport maps

Causal inference in statistics, social, and biomedical sciences

When is a monotone function cyclically monotone? Theoretical Economics

The estimation of causal effects by difference-in-difference methods

Nonparametrics: statistical methods based on ranks. Holden-day

Optimal pair matching with two control groups

Effects of the minimum wage on employment dynamics

Methods for causal inference from gene perturbation experiments and validation

Counterfactuals and Causal Inference: Methods and Principles for Social Research

Minimum wages and employment: A case study of the fast-food industry in New jersey and Pennsylvania: Comment

Revisiting the minimum wage-employment debate: Throwing out the baby with the bathwater?

Sur les applications de la théorie des probabilités aux expériences agricoles : Essai des principes

An invitation to statistics in Wasserstein space

Computational optimal transport. Foundations and Trends® in Machine Learning

Uncoupled isotonic regression via minimum Wasserstein deconvolution. Information and Inference: A

Convex analysis

Reconciling the evidence of Card and Krueger (1994) and Neumark and Wascher (2000)

When is parallel trends sensitive to functional form?

Causation and causal inference in epidemiology

Estimating causal effects of treatments in randomized and nonrandomized studies

Bayesian inference for causal effects: The role of randomization

Now trending: Coping with non-parallel trends in difference-in-differences analysis

Weak monotonicity suffices for truthfulness on convex domains

Estimating semi-parametric panel multinomial choice models using cyclic monotonicity

Causal imputation via synthetic interventions

A dimension-free computational upper-bound for smooth optimal transport estimation

Designing difference in difference studies: best practices for public health policy research