key: cord-0506766-fkgfuoqh authors: D'iaz, Iv'an title: Causal influence, causal effects, and path analysis in the presence of intermediate confounding date: 2022-05-16 journal: nan DOI: nan sha: 46842cfac2e19c868d40865bbfd2e674ff0ba9ec doc_id: 506766 cord_uid: fkgfuoqh Recent approaches to causal inference have focused on the identification and estimation of textit{causal effects}, defined as (properties of) the distribution of counterfactual outcomes under hypothetical actions that alter the nodes of a graphical model. In this article we explore an alternative approach using the concept of textit{causal influence}, defined through operations that alter the information propagated through the edges of a directed acyclic graph. Causal influence may be more useful than causal effects in settings in which interventions on the causal agents are infeasible or of no substantive interest, for example when considering gender, race, or genetics as a causal agent. Furthermore, the"information transfer"interventions proposed allow us to solve a long-standing problem in causal mediation analysis, namely the non-parametric identification of path-specific effects in the presence of treatment-induced mediator-outcome confounding. We propose efficient non-parametric estimators for a covariance version of the proposed causal influence measures, using data-adaptive regression coupled with semi-parametric efficiency theory to address model misspecification bias while retaining $sqrt{n}$-consistency and asymptotic normality. We illustrate the use of our methods in two examples using publicly available data. Statistical causal inference is primarily concerned with quantifying the strength of the causal relation between two variables. In the context of a non-parametric structural equation model, measures of causality are defined in terms of changes in the distribution of the variables under targeted hypothetical modifications to the structural equations. These methods can be roughly divided into two distinct classes: methods that seek to quantify the causal effect of an action, and information theoretic measures that view causality as information propagation. Causal effects are defined with respect to the actions that elicit them. Under a causal effect framework, the researcher specifies a set of (possibly hypothetical) interventions to the causal system to be evaluated. For example, a clinical researcher may be interested in evaluating mortality rates in a hypothetical world where all patients are given a certain treatment, and compare them with mortality rates in a hypothetical world where no patient is given the treatment. Analyses involving causal effects are often prescriptive, i.e., the goal is to recommend one of the actions under evaluation. Statistical approaches to quantifying the effect of actions include (longitudinal) average treatment effects (e.g., Robins, 1986; Bang and Robins, 2005) , quantile or other distributional effects (e.g., Díaz, 2017; Kennedy et al., 2021) , optimal treatment regimes (e.g., Murphy, 2003; Díaz et al., 2018) , and dynamic treatment initiation strategies (e.g., van der Laan et al., 2005; Cain et al., 2010) , among many others. Information theoretic approaches, which are less common in the literature, seek to quantify causal influence through measuring properties of the structural equation model, and view causation as the transfer of information (Collier, 1999; Illari and Russo, 2014; Schölkopf, 2022) . To define measures of the strength of a causal relation, information theoretic approaches also rely on hypothetical interventions to the structural equations. The difference with causal effects is that information theoretic approaches are not concerned with the effect elicited by an action, but merely use interventions on the structural equations as a tool to measure dependence of the structural equations on the causal agent of interest. In this sense, information theoretic approaches for causal inference are descriptive rather than prescriptive. For example, Janzing et al. (2013) propose to quantify the causal relation between X k and X s through the distribution of the data obtained under a hypothetical intervention where the arrows in the path X k → X s have been removed. Information theoretic approaches are common in the causal discovery literature, where the goal is to learn a family of plausible causal structures for a given dataset. A general tool that cuts across causal effects and information-theoretic approaches to causal inference is the concept of stochastic interventions (e.g., Didelez et al., 2006; Díaz and van der Laan, 2013; Young et al., 2014; Chaves et al., 2015) defined as interventions that replace certain equations of the model by random draws from user-given distributions. Examples of stochastic interventions include incremental propensity score interventions (Kennedy, 2019; Wen et al., 2021) , and interventions that shift the exposure distribution in an additive or multiplicative scale (Díaz and van der Laan, 2012; Díaz and Hejazi, 2020; Díaz et al., 2021b Díaz et al., , 2022a . Stochastic interventions can be viewed as measuring causal effects if the intervention used is of substantive interest. For example, incremental propensity score interventions yield interesting interpretations as the effect of actions if one can conceive of a real-world action/intervention that would yield such incremental post-intervention exposure mechanism. In this sense, stochastic interventions can be of a prescriptive nature if one is interested in the effect of such a real-world action. However, stochastic interventions can also be viewed as information-theoretic tools that are not of prescriptive interest but merely used to describe information transfer between variables in the structural equation system. For example the edge-removal operations of Janzing et al. (2013) can be conceptualized as interventions where the causal agents of interest are set to random draws from their marginal distribution. In this paper we propose to use stochastic interventions as a means to define operations on a directed acyclic graph that remove or emulate the information transferred along certain edges of interest in the causal graph. These edge operations are then used to define measures of strength of a causal relation which we call causal influence. These measures of causal influence are interpretable outside of the causal effect framework, i.e., they are interpretable as information-theoretic quantities that do not require the researcher to be interested in the causal effect of hypothetical actions. Our proposed measures of causal influence can be of interest in multiple situations, for example if the causal agents of interest are not manipulable, such as race, gender, or genetics. The rest of the paper is organized as follows. In §2 we expand on the idea of non-manipulable causes and on the importance of mediation and path analysis. In §3 we introduce the notation and the causal model. In §4 we introduce the novel measure of causal influence proposed in this paper based on information transfer interventions along the edges of a directed acyclic graph. In §6 we discuss a pathanalysis method that uses these interventions, and we prove that this method satisfies appropriately defined path-specific sharp null criteria. In §7 we develop non-parametric efficient estimators for some of the measures of causal influence proposed, and in §8 we present the results of applying the methods to two publicly available datasets. We conclude in §9 with a discussion of connections to recent literature and directions of future research. In §1 of the supplement we present a path-effect decomposition that uses the ideas of intervening on the information transferred along edges of the causal graph to achieve a path-specific decomposition of causal effects. Consider questions regarding the causal effect of race, or the causal effect of gender. A causal effect framework is inappropriate to analyze the causal relation between race and other variables. Under a causal effect framework one is forced to define effects in terms of hypothetical actions that modify the causal agent. Under a causal effect framework "potential causes must be plausibly manipulable; they cannot include fixed attributes such as race" (Kaufman and Cooper, 2001; VanderWeele and Robinson, 2014) . However, systems with non-manipulable causal agents can be subject to hypothetical structural interventions that would remove a variable from the set of causes of other variables. For example, one could ask what would be the association between race and health outcomes in a hypothetical world where race was not causally related to social determinants of health such as education, job opportunities, and income. This operationalization of causal influence does not require to define hypothetical manipulations on race or gender, and can be formalized in a directed acyclic graph in terms of interventions that modify the transfer of information along the edges of the graph. A common characteristic of fixed attributes such as race, gender, and genetics is that their causal relation with other variables is usually mediated by other attributes. For example, the causal relation between race and health disparities in the US population is mediated by socioeconomic factors which are themselves affected by race through discrimination (Williams and Rucker, 2000) . Likewise, the causal relation between polygenic risk scores and individual traits or disease is usually mediated by physiological or environmental processes that occur later in life.Thus, understanding the causal relation between such fixed causal agents and outcomes generally requires mediation analyses. Natural direct and indirect effects (Robins and Greenland, 1992; Pearl, 2001) are the preferred approach for mediation analysis when evaluating the causal effect of an action. However, these effects are not identified in the presence of treatment-induced mediator-outcome confounding (Avin et al., 2005) . A popular solution to this problem which has gained traction in the literature is the use of randomized interventional effects (VanderWeele et al., 2014) . Recent research (Miles, 2022) has uncovered an important limitation of these effects, namely that they fail to satisfy the sharp null mediational criterion, meaning that the effect through the mediator can be non-zero even when there is no structural relation that operates through the mediator for any individual in the population. We propose to use interventions on the information transferred along the edges of a causal graph to define measures of causal influence. These measures solve a long-standing problem in the causal mediation literature, namely the identification and estimation of path-specific relations in the presence of mediator-outcome confounders affected by treatment. We show that the proposed measures of causal influence satisfy appropriately defined path-specific sharp null criteria. Importantly, we show that these information transfer interventions can also be used in the context of causal effects to decompose the average treatment effect into path-specific effects. 3 Notation, data, and causal model Assume that we observe n independent and identically distributed copies X i , . . . , X n of X = (W, A, Z, M, Y ) ∼ P. We use a non-parametric structural equation model (NPSEM) to study causal relations. We assume that X is generated according to: (1) where the functions f are deterministic but unknown, and U = (U 1 , . . . , U p ) is a vector of exogenous error whose distribution is unrestricted in principle. We simplify the presentation of the model by considering a single W with little loss of generality, but we could split it into several factors according to whether they are confounders of only some of the subsequent relations. We are interested in quantifying the strength of the causal relation between A and Y , and understanding the extent to which that relation operates through the various paths involved in the directed acyclic graph (DAG) depicted in (1) We refer to the former goal as mediation analysis, and to the latter as path-analysis. The concept of causal influence is generally defined as the existence of a directed path between two variables (Pearl and Verma, 1995) . A formal definition of the causal influence of A on Y in terms of the non-parametric structural equation model can be constructed as follows. Definition 1 (Causal influence). For fixed a, let the counterfactual variable Y (a) be defined as the solu- For binary treatments, the concept of causal influence as stated above is related to Fisher's sharp null hypothesis of no individual treatment effect H 0 : approaches to causal inference have focused on quantifying causal influence through the concept of a causal effect, defined through changes in the probability distribution of X under hypothetical interventions that modify the value of the variable A. In this paper, we will propose an alternative way to quantify causal influence, using an approach based on intervening on the information transferred along edges in the DAG depicted in Figure 1 . In general, we will require that measures of causal influence satisfy the following property: P1 (Sharp null criterion). A causal influence measure satisfies the sharp null criterion if it is null whenever there is no causal influence. While this criterion is trivially satisfied by many measures of causal influence, it has been recently uncovered that it is not satisfied by widely adopted approaches to mediation analysis. In §5 we prove that our proposed causal influence measures for mediation satisfy a version of this criterion adapted to mediation. Before we describe our proposed measure, we review the definition of causal effects for contrast. A general definition of causal effect is as follows (Pearl, 2000) : Definition 2 (Causal effect of an action). Given a fixed value a, the causal effect of the intervention A = a is defined as (moments of) the probability distribution of the counterfactual variable Y (a). Contrasts of causal effects indexed by one or more interventions will also be referred to as causal effects. As stated in the definition, to define the causal effect of A on Y one needs to specify an intervention to the node corresponding to A in the NPSEM (1). For example, consider an intervention that removes the equation f A from the NPSEM and replaces it with assignment to a fixed value A = a. The is a random draw from a user-given distribution (e.g., Wen et al., 2021). We do not pursue these definitions since causal effects are not our focus and the above definition given in terms of static interventions will serve to illustrate the advantages and differences between the causal effect framework and the causal influence framework we propose. To measure the causal strength of the relation between A and Y , we propose to measure the impact on the joint distribution of A and Y of intervening on the information transferred along all paths between A and Y in Figure 1 . We first define an intervention where no information is transferred along any of these paths. Specifically, let A denote a noise variable defined as a random draw from the distribution of A conditional on W . Note that in some applications it may be more reasonable to define A as a random draw from the conditional distribution of A given a subset of W 1 ∈ W . In the interest of simplifying notation we pursue a definition based on a random draw conditional on W . To remove all the information transferred along all paths between A and Y , one possibility is to transfer noise along the edges in S = {A → Y, A → M, A → Z}. which corresponds to the following data generating mechanism: The above counterfactual represents interventions that remove the influence of A on Y along all edges in the set S and replace it by transferring noise. We will therefore use the alternative notation Y S for Y (A). This notation will be useful when we define counterfactuals that modify the information transferred along different sets of edges, such as those required for path analysis. We illustrate the relevance of this definition with an example. Consider a problem where one is interested in the causal influence of race A on mortality Y in the COVID-19 pandemic, mediated by hospitalization M and social determinants of health Z. In this problem, W may be the empty set since race is uniquely determined by genetics. The counterfactual Z(A) is then interpreted as the social determinants of health that would be observed in a hypothetical world where society's treatment of an individual as related to their social determinants of health was as if their race was a random draw from the population, i.e., as if race did not play a role in someone's "assignment" of social determinants of health. This information transfer intervention allows us to define measures of causal influence as follows: This definition is related to the definition of causal strength given by Sprenger (2018) ; Fitelson and Hitchcock (2011) . The difference with our definition is that we are interested in a system where the information transferred along by the causal agent is present vs removed, whereas the definition in these works focuses on contrasts where the causal agent is turned on vs off. To further see why this is a sensible measure of causal influence, consider the original NPSEM (1) and its associated DAG in Figure 1 . In this model A and Y may be associated due to two different types of relations: (i) any of the directed paths from A to Y , or (ii) undirected paths operating through common causes (e.g., W or U). In contrast, in the intervened NPSEM (2), A and Y S can only be associated due paths operating through to common causes. Therefore, a contrast between the joint distributions P(y, a) and P S (y, a) provides a measure of the causal influence operating through paths from A to Y . These ideas are formalized in the following property: Proposition 1. Any contrast D(P, P S ) between (moments of) P(y, a) and P S (y, a) satisfies the sharp null criterion P1. Proof This follows after noticing that Y S = Y whenever A has no causal influence on Y . To illustrate the utility of Definition 3, we now present a number of examples of how these distributions can be contrasted. Example 1 (Covariance decomposition). Consider the covariance decomposition By definition, A does not have a causal influence on Y S . Any association between A and Y S is due to common causes, which means that Cov(A, Y S ) is a measure of confounding. On the other hand, is a contrast of covariances comparing hypothetical worlds that only differ in the influence through paths from A to Y , which is present in Cov(A, Y ) but Example 2 (Regression of residuals). Consider the function f where we note that this parameter measures the strength of the association in a hypothetical world where the direct influence of A is removed (and therefore all association is due to confounding) and compares it to the association observed in the actual world. , where p(y, a) is the density of (Y, A) and p S (y, a) is the density of (Y S , A). The following theorem provides an expression that can be used to identify the joint distribution of (Y S , A) under the standard assumption of no unmeasured confounders. The assumption of the theorem is satisfied whenever (1). This assumption basically states that W contains all common causes of A and Y . Note also that identification of this causal influence does not require the positivity assumption P(g(a | W )) > 0, unlike most influence measures defined using causal effects of actions. Furthermore, if A is randomized, this identification result reduces to P(Y ≤ y | A = a), in agreement with the idea that the causal relation between A and Y is not confounded in a randomized experiment. The expectation of the conditional covariance θ has been previously used for causal inference as a means to study other causal effects, such as the variance-weighted ATE (Li et al., 2011) . Furthermore, it forms the basis to construct the partial correlation coefficient, which has a long but non-rigorous history as a measure of causality in applications (Ellett and Ericson, 1986) , for example in the context of genomics (e.g., Freudenberg et al., 2009 ). However, we know of no previous result that provides an interpretation of the expected conditional covariance as a causal effect in terms of formal interventions in a causal model. In addition, this result provides a causal interpretation of the well known law of total covariance as a decomposition of the covariance between A and Y in terms of a pure causal effect θ and a pure confounding effect τ . This is a procedure commonly used in applied studies aiming to estimate causal inference effects (e.g., in genomics) but its interpretation in a formal causal inference framework has not been previously articulated. The counterfactual Y S involves information transfer interventions that remove all the information transferred along edges in S. In this article we will also use an information transfer operation that emulates information transferred along certain paths. Consider, for example, the paths Interestingly, the distribution of this counterfactual variable is identified by P suggesting that this information transfer intervention preserves the relation that operates through the original paths. While this kind of intervention is not very useful to define total causal influence of A on Y , it will be fundamental in §5 when we present the path analysis methods using causal influence. In the next section we discuss a major advantage of the information transfer interventions introduced in this section, namely the ability to provide measures of direct and indirect causal influence. We will show that these interventions allow the decomposition of the total causal influence into an influence and an effect that operates through each specific path. Importantly, we show that this decomposition is possible even in the presence of mediator-outcome confounders which are affected by exposure, a problem whose solution has been elusive in the causal inference literature that focuses on the causal effect of actions. We start our presentation with a brief discussion of the problems with existing approaches to mediation analysis using causal effects. Mediation analysis is the task of decomposing the total causal influence of A on Y into an influence that operates through M and an influence that operates through all other mechanisms. This can be done by decomposing the total causal influence of the previous section into influence that operates the pathways A → Y and A → Z → Y (so-called direct influence), and influence that operates through Definition 4 (Causal influence through a mediator). For fixedā = (a 0 , a 1 , a 2 ), define the coun- For binary exposures, this definition of causal influence is equivalent to the sharp mediation null (0)) for a ∈ {0, 1}. As before, it will be desirable that parameters that measure the influence of A on Y operating through M satisfy the following property (Miles, 2022) : to satisfy the mediational sharp null criterion if it is null whenever there is no causal influence through the mediator M. In addition, it may be desirable that mediation analyses that seek to unveil mechanisms satisfy the following property, which we present in an additive scale but can be equivalently stated in a multiplicative or other scales: to decompose a measure of total influence if they add up to the total influence. Having established the above desiderata for a measure of influence through a mediator, we now review two of the major mediation frameworks recently proposed: natural direct and indirect effects, and randomized interventional direct and indirect effects. We discuss the lack of identifiability of natural effects in the presence of a mediator-outcome confounder affected by treatment, and discuss the fact that randomized mediational effects do not satisfy the sharp mediational null criterion. These shortcomings were originally described by Avin et al. (2005) and Miles (2022), respectively. We then move onto discussing our proposal for mediation analysis based on the causal influence measures proposed in the previous section, and show how our proposal can be used to solve those shortcomings. In the case of a binary exposure A, the average treatment effect E(Y (1) − Y (0)) is a common measure of the effect of the action A = 1 vs the action A = 0. Natural mediation effects decompose the ATE into effects that operate through the mediator M and effects that operate through all other causes. Specifically, we have While the NIE and NDE provide a useful and intuitive decomposition of the ATE into effects that operate through M vs all other mechanisms, and satisfy P2 and P3, these effects are not identified in the DAG 1 (Avin et al., 2005) . To understand why identification fails, it is useful to consider a simplified model where W = ∅, all the errors U are mutually independent, and the edges A → M and A → Y have been removed. It can be proved that in this model we have where we have enriched the notation to add the intervention node in the index of the counterfactual is not (for a counterexample, see Table 1 of Avin et al., 2005) , leading to lack of identifiability of the NIE and NDE. The variable Z is often referred to as a recanting witness because it operates as a direct effect through the path A → Z → Y and as an indirect effect through the path A → Z → M → Y , leading to the lack of identifiability of either the direct or the indirect effect. As a solution to the lack of identifiabiliy of natural direct and indirect effects, much of the recent literature on mediation analysis has focused on randomized interventional effects (Didelez et al., 2006; van der Laan and Petersen, 2008; VanderWeele et al., 2014; Díaz et al., 2021a) . We discuss the definition, interpretation, and identification of these effects below. Consider a random draw G(a) from the distribution of M(a) conditional on W . Randomized mediational effects are concerned with interventions that set the mediator to G(1) and G(0) instead of M (1) and M(0). Specifically, the randomized mediational effects are defined as follows: The first limitation of randomized mediational effects is that they do not satisfy P3 in the sense that they do not decompose the ATE, but rather decompose an alternative treatment effect given by the left hand side of the above expression which is defined in terms of interventions on both A and M. Unlike the NIE and NDE, randomized mediational effects are identified in the model in the NPSEM (1). VanderWeele et al. (2014) show that the above randomized interventional indirect effect is identified under the assumption that there are no unmeasured confounders of the relations A → M, A → Y , and M → Y . However, Miles (2022) has recently uncovered an important limitation of these effects, namely that they fail to satisfy the mediational sharp null criterion P2. One counterexample involves creating an NPSEM with independent errors, W = ∅, and an exogenous binary variable U Z such that such that there is no causal influence of A on Y operating through M (Definition 4). It is easy to see that the randomized interventional indirect effect in this example equals which is generally not null, and would only be null if U Z was observed and conditioned upon in all the above quantities. The interested reader is referred to Miles (2022) for more details and counterexam- In what follows we propose a mediation analysis strategies based on the information transfer interventions introduced in §4, and show that this approach overcomes the limitations of the interventional effects discussed in this section. In this section, we propose a decomposition of the causal influence of A on Y into influence that We first state the properties that are desirable of measures of such path-specific influence. In Definition 5 (Path-specific causal influence). For fixedā = (a 0 , a 1 , a 2 , a 3 , a 4 ), define the counterfac-tual variable Y (a 1 , Z(a 2 ), M(a 3 , Z(a 4 ))). The variable A is said to have a causal influence on Y through each path P 1 , P 2 , P 3 , P 4 if and only if the following conditions hold with positive probability Name Path Condition Our proposed method for path analysis will require to specify a set of interventions that sequentially remove the information transferred through the paths P 1 , P 2 , P 3 , and P 4 . To construct the interventions, we first define the sets S j = {P 1 , . . . , P j } for j = 1, 2, 3, 4. Then, let A and Z A denote a random draw from the distribution of A and Z conditional on W , respectively. Let Z A denote a random draw from the distribution of Z conditional on (A, W ). The notation Z A indicates that Z A does not transfer information from A onto the descendants of Z, whereas the notation Z A indicates that Z A transfers information from A onto the descendants of Z. In this sense, an intervention that assigns Z as Z A can be thought of as an edge-emulation intervention. Although the emulation of the information transferred along the edge is not perfect (e.g., Z A cannot transfer information from U Z into M), it will be sufficient for purposes of defining path-specific causal influence. Define the following counterfactual variables: A straightforward measurement of the causal influence through path P j could be achieved through a contrast of the distributions P(Y S j ≤ y | A = a) and P(Y S i−1 ≤ y | A = a), for j = 1, 2, 3, 4. (A related definition of path-specific influence is proposed by Zhang and Bareinboim (2018) in terms of a covariance contrast and using atomic interventions.) However, some of these causal influence measures are not identified due to the recanting witness problem outlined in the previous section. Specifically, the probability distribution of Y S 2 is not identifiable because Z operates as a recanting witness through This means that the causal influence through the Fortunately, the edge emulation-intervention can be used to solve this problem. Specifically, denote Then we have the following definition. Definition 6 (Measure of causal influence through a path). For i = 0, 1, 2, 3, 4 and k = 0, 1, 2, let S j ) for paths not involving the recanting witness Z (i.e., P 1 and P 4 ), as D(P (1) S 2 ) for the path P 2 , and as D(P S 3 ) for P 3 . In Theorem 2 below we show that the above definition satisfies the path-specific null criterion, meaning that these parameters may be used to test the null hypothesis of no path-specific effect. Theorem 2. The contrasts defined in Definition 6 satisfy the path-specific sharp null criterion P4 with respect to each path P j . Note that the definitions of the counterfactuals Y (1) S 2 , and Y (2) S 3 , entail intervening on the information that is transferred along certain paths. For example, in Y (1) S 1 the influence of A on Y operating through the path A → Z → M → Y operates by means of the random draw Z A , whereas the influence operating through the path A → Z → Y operates through the natural value of Z. These edge emulation interventions substitute a path A → Z → · · · → Y by a synthetic path A → Z A → · · · → Y that transfer the same information as the original path. The edge-emulation intervention is what allow us to solve the recanting witness problem to obtain causal influence measures that satisfy the pathspecific sharp null criterion. The above definitions allow us to test the null hypothesis of no causal influence through each path. A related goal is to decompose the total influence of A on Y into influences operating through each path. This is achievable only for some contrasts D (e.g., it is not achievable for the Kullback-Leibler divergence of Example 3). Specifically, we have the following result which is presented in an additive scale but could also be proved for a multiplicative scale. Theorem 3 (Decomposition of the total influence into path-specific influences). Assume that the contrast function D is linear in the sense that D(P, F) = D(P, G) + D(G, F) for any distributions F, P, and G. Define the path-specific influences θ P 1 = D(P (0) S 3 ), and θ P 4 = D(P (0) S 4 ), as well as the parameter θ P 2 ∨P 3 = D(P (0) S 3 ). Then we have the following decomposition of the total causal influence θ = D(P S 0 , P S 4 ): Clearly, the covariance contrast and the expectation contrast of Examples 1 and 2 satisfy the assumption of the theorem. In the above decomposition, the parameter θ P 2 ∨P 3 appears as a consequence of the addition of the counterfactuals in Equation (4). As the notation implies, this parameter is equal to zero if there is no influence through the path P 2 , or there is no influence through the path P 3 . This result is proved formally in Proposition 2 below. Intuition for this may be obtained as follows. The contrasts D(P (0) 1 , P 2 ) and D(P 3 ), which would readily yield measures of the influence through paths P 2 and P 3 , respectively, are not identified because the distribution P (0) 2 is not identified due to the recanting witness problem. However, the contrast θ P 2 ∧P 3 = D(P (0) 1 , P 3 ) is identified, and it measures the influence operating through P 2 and P 3 . That is, θ P 2 ∧P 3 = 0 if there is no causal influence operating through P 2 nor P 3 . This highlights the difficulty in measuring the influence through paths P 2 and P 3 separately. The above theorem decomposes θ P 2 ∧P 3 into an influence measure θ P 2 that operates only through P 2 , an influence measure θ P 3 that operates only through P 3 , and an influence measure θ P 2 ∨P 3 that operates through P 2 or P 3 . This latter parameter is zero if the influence going trough either of these paths is null. Furthermore, as demonstrated in the following lemma, the parameter is null whenever there is no intermediate confounding by Z. that and that D(P, F) = S∈S D(P(· | S), F(· | S))P(S) for any partition S of the sample space, and . Assume U can be partitioned in sets U 1 , U 2 , and U 3 such that the following hold almost surely • supā |Z(a 1 ) − Z(a 0 )| = 0 in U 1 , and Then θ P 2 ∨P 3 = 0. Importantly, the above proposition implies that whenever there is no intermediate confounding, the path-specific decomposition is exact in the sense that it satisfies property P3. We now illustrate the proposed path-analysis based on the covariance decomposition of Example 1. Example 1 (Continued). Let D denote the covariance contrast defined as D(P, Wright (1921, 1923, 1934) proposed a covariance decomposition for Cov(A, Y ) in terms of pathspecific coefficients in the context of linear models and in the absence of an intermediate confounder Z. An immediate consequence of the above results is that our covariance decomposition generalizes Wright's approach to a non-parametric model in the presence of intermediate confounding. Zhang and Bareinboim (2018) provide an alternative generalization that is unidentifiable in the pres- We now present identification formulas for the probability distributions involved in computation of the measures of causal influence of Definition 6. are independent, and are also satisfied in other configurations of conditional independence of these errors, for example we could allow U W and U Y to be correlated. A2 (Overlap). For a fixed value a, assume the following hold for all w such that p(w) > 0: • p(z | a, w) > 0 implies p(z | a ′ , w) > 0 for all a ′ such that p(a ′ | w) > 0. • p(m | z, a, w) > 0 implies p(m | z ′ , a ′ , w) > 0 for all (a ′ , z ′ , z) such that p(a ′ | w) > 0, p(z | a, w) > 0, and p(z ′ | a, w) > 0. • p(m | z, a, w) > 0 implies p(m | z ′ , a ′ , w) > 0 for all (a ′ , z ′ , z) such that p(a ′ | w) > 0, p(z | a ′ , w) > 0, and p(z ′ | a ′ , w) > 0. Theorem 4 (Identification of path-specific causal influence). Assume A1 and A2. Then the pathspecific counterfactual distributions are identified as follows At this point it is important to note that the ideas of information transfer interventions discussed in this paper can also be used to obtain a decomposition of the average treatment effect into path-specific effects analogous to the decomposition of Theorem 3. We discuss such an extension in §1 of the supplement. In the following section, we discuss efficient non-parametric estimation of the covariance influence discussed in Example 1. Efficient non-parametric estimation of the other parameters discussed in this paper is also possible, but we defer the development of such estimators to future work. 7 Efficient estimation of the path-specific influence using the covariance contrast In this section we discuss estimation of the covariance parameters presented in Example 1, and specifically the causal influence decomposition in (5). Note that, to estimate these covariances, it will be sufficient to construct an estimator of the parameter τ Note also that estimation of E[f (A)Y is optimal in the sense that the estimators converge to the optimal normal distribution at √ n-rate. However, if the model is non-parametric, a plug-in estimation strategy results in first-order bias, which must be corrected. The general methods to characterize this first-order bias are rooted in semi-parametric estimation theory (e.g., von Mises, 1947; Begun et al., 1983; Bickel et al., 1997; van der Vaart, 1998; Robins et al., 2009) , and in the theory for doubly robust estimation using estimating equations (Robins, 2000; Robins et al., 1994; van der Laan and Robins, 2003; Bang and Robins, 2005) . Under this theory, the first-order bias is characterized in terms of the so-called canonical gradient. This canonical gradient also characterizes the efficiency bound in the non-parametric model, and is therefore also known as the efficient influence function. Importantly, knowledge of the canonical gradient allows the development of estimators under slow convergence rates for the nuisance parameters involved. This is important because flexible regression and estimation methods involving model selection must be used when the model is non-parametric, and those flexible estimation methods often fail to be consistent at parametric rate, though they may be consistent at slower rates. The following theorem illustrates the sense in which general plug-in estimators are biased in the non-parametric model, and provides a characterization of the bias in terms of the canonical gradient ϕ k j . The specific formulas of the gradient for each duple (k, j) are useful to construct the estimators, but they are cumbersome and somewhat uninformative, so we relegate their presentation to the supplementary materials. Theorem 5 (First order von-Mises expansion). Let τ k j (G) denote the τ k j parameter evaluated at a distribution G. Then, for any pair of distributions P and G, we have the following: where R k j (G, P) is a second-order term of the form for functionals ω, κ, and ν that vary with (i, k). The specific form of each canonical gradient ϕ k j and second order term R k j (G, P) is given in the supplement. An application of the above theorem with G equal to an estimateP of the true probability distribution P reveals that a plug-in estimation strategy is generally biased in first-order, and provides a representation of the bias as −E P [ϕ k j (X;P)]. Importantly, this theorem also provides an avenue to correct for this first order bias. Specifically, the first order bias may be estimated through the empirical average of ϕ k j (X j ;P) across observations i. If this bias is added back to the plug-in estimator τ k j (P), one would expect that the resulting estimator is unbiased in first-order. This idea is formalized below in Theorem 6. The canonical gradient and the above reasoning are at the center of several recent estimation meth-ods that leverage machine learning and flexible regression, such as the targeted learning framework (van der Laan and Rubin, 2006; Rose, 2011, 2018) , and double machine learning (Chernozhukov et al., 2018) . An important feature of these approaches is the use of cross-fitting, which will allow us to obtain n 1/2 -convergence of our estimators while avoiding entropy conditions that may be violated by data adaptive estimators of the nuisance parameters (Zheng and van der Laan, 2011; Chernozhukov et al., 2018) . Let P 1 , . . . , P V denote a random partition of the data set into V prediction sets of approximately the same size. That is, In addition, for each v, the associated training sample is given by Our proposed estimator only requires estimation of the conditional expectation of Y given (M, Z, A, W ), which will be denoted by m, the probability mass function of M given (Z, A, W ), denoted by p M , the probability mass function of Z given (A, W ), denoted by p Z , and the probability mass function of A given W , denoted by p A . Let η = (m, p M , p Z , p Z ), and note that the canonical gradients ϕ k j can be written as functions ϕ k j (X; η). Let v(i) denote the prediction set to which observation i belongs, and letη v(i) denote an estimator of η obtained using training data T v(i) . Then, the estimator of τ k j is defined whereφ is the uncentered canonical gradient given in the supplementary materials. The following theorem provides the conditions under which the above estimator is expected to be efficient and asymptotically normal: Theorem 6 (Asymptotic linearity of the proposed estimator). Assume that R k j (η, η) = o P (n −1/2 ) and and for all m, The proof of this theorem is sketched in the supplementary materials. The arguments are standard in the analysis of estimators in the targeted learning and double machine learning frameworks. The above theorem implies that an estimatorθ P j of θ P j constructed through an application of the above estimator to f (A) = A and f (A) = 1 is also asymptotically linear. An application of the Delta method and the central limit theorem then yields where σ 2 is the non-parametric efficiency bound, which allows the construction of Wald-type confidence intervals. In this section we apply the covariance path-specific decomposition analysis proposed in the previous section to estimating path-specific causal relations in two examples using publicly available data. In the first example, we re-analyze the data from a recent study examining gender differences in wage expectations among students at two Swiss institutions of higher education (Fernandes et al., 2021) . In the second example, we re-analyze data from a nationally representative randomized experiment on how the framing of media discourse shapes public opinion and immigration policy (Brader et al., 2008) . The datasets are publicly available in the causalweight (Bodory and Huber, 2021) and mediation (Tingley et al., 2014 ) R packages, respectively. Code to reproduce our analyses is available at https://github.com All nuisance parameters were estimated with extreme gradient tree boosting where the hyperparameters are chosen from a random grid of size 100 using the caret (Kuhn, 2021) library in R. In this study, the authors administered a survey to 804 students at the University of Fribourg and the University of Applied Sciences in Bern in the year 2017. The survey contained a number of questions regarding wage expectations, as well as a number of variables related to wages such as the study program (business, economics, communication, and business informatics), job or educational plans after finishing the studies, the intended industry (trade, transport, hospitality, communication, finance, etc.), as well as other variables such as the age, parents education, nationality, and home ownership. We study the causal relation between gender (A) and wage expectations three years after graduation (Y ) study program (Z) and whether a student plans to continue obtaining further education or work full time after graduation (M) as mediators. The outcome Y is recorded in a scale of 0-16, where 0 means less than 3500 Swiss Franc (CHF) gross per month, 1 means 3500-4000 CHF, 2=4000-4500 CHF, etc., and 16=more than 11000 CHF. For simplicity and illustration purposes we treat Y as a numerical variable. The results of our analysis are presented in Table 2 . We conclude that most of the influence of Table 2 do not require to interpret hypothetical and infeasible interventions that would modify someone's gender. Instead, we compute the causal covariance between gender and wage expectations θ, and decompose it into path-specific covariances. Table 2 : Results in the gender and wage expectation illustrative example along with 95% confidence intervals. In this study, the authors examine whether and how elite discourse affects public opinion and action on immigration policy. They conducted a randomized experiment in which 265 subjects are exposed to different media stories about immigration. They employed a 2 × 2 design in which they manipulate ethnic cues by altering the picture and name of an immigrant featured in a hypothetical New York Times story (white European vs Latin American). They also manipulate the tone of the story, focusing on wither positive or negative consequences of immigration, as well as conveying positive or negative attitudes of governors and other citizens towards immigration. Our treatment variable A will take values 0, 1, or 2, with 0 denoting a positive story about a white European, 1 denoting a negative story about a white European immigrant or a positive story about a Latin American immigrant, and 2 denoting a negative story about a Latin American immigrant. The authors also collected information on age, education, gender, and income of the study participants, which we denote with W . A major hypothesis of the study was that anxiety and is an important mediator of the causal influence of the framing of the story on negative attitude towards immigration. Anxiety M was measured using a numerical scale from 3 to 12 where 3 indicates the most negative feeling. Perceived harm Z caused by immigration was also measured in a scale between 2 and 8. The outcome of interest Y is a four-point scale measuring a subject's attitude towards increased immigration where larger values indicate more negative attitudes. The results of the analysis are presented in Table 3 . These results largely agree with the results of the original research article, with the difference that these analyses provide more nuance in the sense that most of the influence of the framing of the story on negative attitudes towards immigration is mediated directly by anxiety through pathways that do not involve perceived harm from immigration. Table 3 : Results in the media discourse illustrative example along with 95% confidence intervals. Our proposed approach to measuring causal influence by removing or replacing edges shares important connections to recent causal inference literature. For example, Nabi et al. (2019) study the problem of learning fair optimal treatment policies. Their approach to fairness relies on constructing a distribution where undesirable causal pathways are "removed" and then learning optimal treatment policies with respect to this fair but unobserved distribution. Their path-removal operations focus on removing average treatment effects by means of finding the closest (e.g., in KL-divergence) distribution to the observed data distribution under the constraint of no average effect through the specified paths. Our approach to removing paths is different from theirs in that we remove all types of causal influences, i.e., not only those that affect the average of the outcome. Their approach to learning fair optimal treatment policies could possibly be applied to a fair distribution constructed using the information transfer interventions proposed in this paper. Our proposal also shares connections to the general theory of causal interventions presented by Shpitser and Tchetgen (2016) . In their work, the authors generalize and unify many interesting targets of causal inference through the use of so-called edge-and path-interventions, defined as interventions on the source node of the edge or path, where the intervention operates only for the purpose of the specific outgoing path. Our approach to achieving identification of path-specific causal influence and path-specific causal effects also entails a type of path-intervention. The difference with the approach of Shpitser and Tchetgen (2016) is that we do not restrict the path-intervention to the source node, but instead intervene of nodes in the other positions in the path. Specifically, by allowing path-interventions to be defined in terms of the recanting witness node, we are able to define path-specific effects which are non-parametrically identifiable. A general theory for path-intervention that allow for interventions on nodes other than the source node seems to be an important direction of future work. Our discussion centers around a structural definition of causal influence interpreted as the strength of dependence of the structural functions on their arguments (see, e.g., Definition 4) This criterion for causal influence may be too strong in some cases. For example Sprenger (2018) argues for a probabilistic rather than structural definition of causal influence. We conjecture that our proposed measures of causal influence would also satisfy probabilistic null criteria, but leave the proof of those results to future work. Lastly, there are multiple interesting directions for future work that build on the ideas presented in this paper. The first is that the order of the decomposition we pursue where we proceed sequentially by intervening on the paths P 1 , P 2 , P 3 , and P 4 is arbitrary. Other orderings for these paths can also be considered and may be of more practical relevance in certain applications. The second is that the ideas we present can be generalized to construct path-specific effects for multiple ordered mediators and for situations with time-varying treatments, mediators, and covariates. 1 Extension to path analysis using causal effects The information transfer interventions introduced in this article can also be used to construct a decomposition of the average effect of a binary treatment into path-specific effects that satisfy path-specific null criteria, analogous to the decomposition constructed for measures of causal influence. Specifically, consider the ATE ψ = E[Y (1)− Y (0)] introduced in §4.1, and note that under our assumed NPSEM we have Y (a) = Y (a, Z(a), M (a, Z(a))). Then we could use the following counterfactuals to define path-specific effects: where the causal effect operating through path P j is defined as E[Y S j −1 − Y S j ]. As before, the probability distribution of Y S 2 is not identified due to the recanting witness Z. However, we can achieve an effect decomposition * corresponding author: ild2005@med.cornell.edu into path-specific effects using information transfer interventions as follows. Let Z a denote a random draw from the distribution of Z(a) conditional on W . Define where, in comparison to the definitions in (6), we have emulate the information transferred through some paths by means of the random draws Z 1 and Z 0 . For example, in Y ′ S 2 , the effect of the action A = 1 operating through the This allows us to achieve the following identifiable effect decomposition: Theorem 7 (Decomposition of the average treatment effect into path-specific effects). Define the path-specific causal effects as as well as the parameter Then we have the following decomposition of the average treatment effect ψ = E(Y S 0 − Y S 4 ): As in the previous section, we have the following result showing that this decomposition of the total causal effect satisfies the path-specific null criterion: Theorem 8. The contrasts defined in Theorem 7 satisfy the path-specific sharp null criterion P4 with respect to each path P j . Furthermore, in this case ψ P 2 ∨P 3 is also an effect operating through either A → Z → Y or A → Z → M → Y , which is equal to zero whenever there is no effect through at least one of these paths (e.g., if Z is not an intermediate confounder): . Assume U can be partitioned in sets U 1 , U 2 , and U 3 such that the following hold almost surely Then ψ P 2 ∨P 3 = 0. In order to present identification results for the above decomposition, we will require the following overlap assumption which guarantees that the functionals defined in Theorem 9 are well defined. A1 (Overlap assumption for decomposition of the ATE). Assume the following hold for all w such that p(w) > 0 • p(a | w) > 0 for a ∈ {0, 1}. • p(z | A = 1, w) > 0 implies p(z | A = 0, w) > 0. • p(m | A = 1, z ′ , w) > 0 implies p(m | A = 0, z, w) > 0 for all (z, z ′ ) such that p(z | a ′ , w) > 0 and p(z ′ | a ⋆ , w) > 0 and for a ′ , a ⋆ ∈ {0, 1}. Theorem 9 (Identification of the path-specific decomposition of the average treatment effect). Assume A1 and A1. Then, for a ′ = 1 and a ⋆ = 0, we have The proofs of the results in this section follow steps identical to the proofs of the results in §6. 2 Proofs of results in the paper 2.1 Theorem 1 The first line follows by the law of iterated expectation, the second line by independence of A on all other data conditional on W and by the fact that A is distributed as A conditional on W , the third line follows by the assumption of the theorem, the fourth line because Y (a ′ ) = Y in the event A = a ′ , and the last line by definition. Proof The proof of Theorem 2 proceeds as follows. We will prove the statement of the Theorem for each contrast of path separately. 1. For P 1 , assume that P{supā |Y (a 1 , Z(a 2 ), M (a 3 , Z(a 4 ))) − Y (a 0 , Z(a 2 ), M (a 3 , Z(a 4 )))| = 0} = 1. Then we have almost surely, where the second equality follows by assumption. 2. For P 2 , assume that P{supā ,m |Y (a 1 , Z(a 2 ), M (a 3 , Z(a 4 ))) − Y (a 1 , Z(a 0 ), M (a 3 , Z(a 4 )))| = 0} = 1. Then we have where U 1 and U 2 are the sets of the first statement of Lemma 1 given below. We have a.s. in the event U ∈ U 1 . We also have Z(A) = Z(A) a.s., and therefore Y (1) 3. For P 3 , assume that P{supā ,m |Y (a 1 , Z(a 2 ), M (a 3 , Z(a 4 ))) − Y (a 1 , Z(a 2 ), M (a 3 , Z(a 0 )))| = 0} = 1. Then we have where U 1 , U 2 and U 3 are the sets of the second statement of Lemma 1. We have a.s. in the event U ∈ U 1 . We also have M (A, Z(A)) = M (A, Z(A)) a.s. in the event U ∈ U 2 , and therefore Y S 2 . Lastly, in the event U ∈ U 3 we have Z(A) = Z(A) a.s., and therefore Y 4. For P 4 , assume that P{supā ,m |Y (a 1 , Z(a 2 ), M (a 3 , Z(a 4 ))) − Y (a 1 , Z(a 2 ), M (a 0 , Z(a 4 )))| = 0} = 1. where U 1 and U 2 and U 3 are the sets of the third statement of Lemma 1. We have a.s. in the event U ∈ U 1 . We also have M (A, Z(A)) = M (A, Z(A)) a.s. in the event U ∈ U 2 , and therefore Y (0) Proof We will prove the results for each random variable at a time. First, note that dP(a ′ | w)dP(Z(a) = z | a, w)dP(M (a, z) = m | Z(a) = z, a, w)dP(w | a) = P[Y (a ′ , z, m) ≤ y | A = a, W = w]dP(a ′ | w)dP(z | a, w)dP(m | z, a, w)dP(w | a) = P[Y (a ′ , z, m) ≤ y | a ′ , z, m, w]dP(a ′ | w)dP(z | a, w)dP(m | z, a, w)dP(w | a) = P[Y ≤ y | a ′ , z, m, w]dP(a ′ | w)dP(z | a, w)dP(m | z, a, w)dP(w | a), where the first equality follows by law of iterated expectation, the second one by assumptions A1, A1 and definition of counterfactuals M (a, z) and Z(a), and A1 of the theorem, and the fourth one by definition of the counterfactual Y (a, z, m). Note that, to simplify notation, we removed the random variables from the right hand side of the | symbol in the above probabilities. We also have P(Y (1) S 1 ≤ y | A = a) = = P[Y (a ′ , z, m) ≤ y | A = a, Z(a) = z, M (a, z ′ ) = m, W = w, Z A = z ′ , A = a ′ ]× dP(a ′ | w)dP(Z(a) = z | a, w)dP(M (a, z ′ ) = m | Z(a) = z, a, w)dP(z ′ | a, w)dP(w | a) = P[Y (a ′ , z, m) ≤ y | A = a, W = w]dP(a ′ | w)dP(z | a, w)dP(m | z ′ , a, w)dP(z ′ | a, w)dP(w | a) = P[Y (a ′ , z, m) ≤ y | a ′ , z, m, w]dP(a ′ | w)dP(z | a, w)dP(m | z ′ , a, w)dP(z ′ | a, w)dP(w | a) = P[Y ≤ y | a ′ , z, m, w]dP(a ′ | w)dP(z | a, w)dP(m | a, w)dP(w | a). dP(a ′ | w)dP(z ′ | a, w)dP(M (a, z ′ ) = m | Z(a ′ ) = z, a, w)dP(Z(a ′ ) = z | a, w)dP(w | a) = P[Y (a ′ , z, m) ≤ y | A = a, W = w]dP(a ′ | w)dP(z ′ | a, w)dP(m | z ′ , a, w)dP(z | a ′ , w)dP(w | a) = P[Y ≤ y | a ′ , z, m, w]dP(a ′ | w)dP(z ′ | a, w)dP(m | z ′ , a, w)dP(z | a ′ , w)dP(w | a) = P[Y ≤ y | a ′ , z, m, w]dP(a ′ | w)dP(m | a, w)dP(z | a ′ , w)dP(w | a) = P[Y ≤ y | a ′ , z, m, w]dP(z, a ′ | w)dP(m, w | a) and, P(Y (2) S 2 ≤ y | A = a) = = P[Y (a ′ , z, m) ≤ y | A = a, Z(a) = z ′ , M (a, z ′ ) = m, W = w, Z A = z, A = a ′ ]× dP(a ′ | w)dP(z | a ′ , w)dP(M (a, z ′ ) = m | Z(a) = z ′ , a, w)dP(Z(a) = z ′ | a, w)dP(w | a) = P[Y (a ′ , z, m) ≤ y | A = a, W = w]dP(a ′ | w)dP(z | a ′ , w)dP(m | z ′ , a, w)dP(z ′ | a, w)dP(w | a) = P[Y ≤ y | a ′ , z, m, w]dP(a ′ | w)dP(z | a ′ , w)dP(m | z ′ , a, w)dP(z ′ | a, w)dP(w | a) = P[Y ≤ y | a ′ , z, m, w]dP(z, a ′ | w)dP(m, w | a) In addition, P(Y (2) S 3 ≤ y | A = a) = = P[Y (a ′ , z, m) ≤ y | A = a, Z(a ′ ) = z ′ , M (a, z ′ ) = m, W = w, Z A = z, A = a ′ ]× dP(z | a ′ , w)dP(M (a, z ′ ) = m | Z(a ′ ) = z ′ , a, w)dP(Z(a ′ ) = z ′ | a, w)dP(a ′ | w)dP(w | a) = P[Y (a ′ , z, m) ≤ y | A = a, W = w]dP(z | a ′ , w)dP(m | z ′ , a, w)dP(z ′ | a ′ , w)dP(a ′ | w)dP(w | a) = P[Y ≤ y | a ′ , z, m, w]dP(z | a ′ , w)dP(m | z ′ , a, w)dP(z ′ | a ′ , w)dP(a ′ | w)dP(w | a), As well as, z, m) ≤ y | A = a, Z(a ′ ) = z, M (a, z) = m, W = w, A = a ′ ]× dP(a ′ | w)dP(M (a, z) = m | Z(a ′ ) = z, a, w)dP(Z(a ′ ) = z | a, w)dP(w | a) = P[Y (a ′ , z, m) ≤ y | A = a, W = w]dP(a ′ | w)dP(m | z, a, w)dP(z | a ′ , w)dP(w | a) = P[Y ≤ y | a ′ , z, m, w]dP(a ′ | w)dP(m | z, a, w)dP(z | a ′ , w)dP(w | a) = P[Y ≤ y | a ′ , z, m, w]dP(z, a ′ | w)dP(m | z, a, w)dP(w | a), Lastly, P(Y (0) S 4 ≤ y | A = a) = = P[Y (a ′ , z, m) ≤ y | A = a, Z(a ′ ) = z, M (a ′ , z) = m, W = w, A = a ′ ]× dP(a ′ | w)dP(M (a ′ , z) = m | Z(a ′ ) = z, a, w)dP(Z(a ′ ) = z | a, w)dP(w | a) = P[Y (a ′ , z, m) ≤ y | A = a, W = w]dP(a ′ | w)dP(m | z, a ′ , w)dP(z | a ′ , w)dP(w | a) = P[Y (a ′ , z, m) ≤ y | m, a, z, w]dP(a ′ | w)dP(m | z, a ′ , w)dP(z | a ′ , w)dP(w | a) = P[Y ≤ y | m, a ′ , z, w]dP(a ′ | w)dP(m | z, a ′ , w)dP(z | a ′ , w)dP(w | a) = P[Y ≤ y | w]dP(w | a). First, note that in the event U ∈ U 1 we have Y (0) We have D(P S 2 (· | U 1 ))P(U 1 ) + D(P S 2 (· | U 3 ), P (1) S 2 ) = D(P (1) S 2 (· | U 1 ), P S 3 (· | U 1 ))P(U 1 ) (9) S 3 (· | U 2 ))P(U 2 ) (10) S 2 (· | U 3 ))P(U 3 ). D(P S 3 , P S 3 ) = D(P S 3 (· | U 1 ), P S 3 (· | U 1 ))P(U 1 ) (12) S 3 (· | U 2 ), P S 3 (· | U 2 ))P(U 2 ). By the assumptions of the lemma we have (7)+(9)+(12) = 0, (8)+(11) = 0, and (10)+(13) = 0, concluding the proof of the lemma. We first give the form of the canonical gradients, and then prove the result. The canonical gradients are given by ϕ (k) j (X; P) =φ 3.1 Proof of Theorem 6 Proof To simplify notation we remove the index j and superindex k from τ k j to simplify notation. Let P n,v denote the empirical distribution of the prediction set P v , and let G n,v denote the associated empirical process n/J(P n,v − P). Note thatτ = 1 V V v=1 P n,vφ (·;η v ), τ = Pφ(·; η). Thus, √ n{τ − τ } = G n {φ(·; η) − τ } + R n,1 + R n,2 , where P{φ(·;η v ) − τ }. It remains to show that R n,1 and R n,2 are o P (1). Theorem 5 together with the assumption that R (k) j (η, η) = o P (n −1/2 ) shows that R n,2 = o P (1). For R n,1 we use empirical process theory to argue conditional on the training sample T v . In particular, Lemma 19.33 of van der Vaart (1998) applied to the class of functions F = {φ(·;η v ) −φ(·; η)} (which consists of one element) yields E G n,v (φ(·;η v ) −φ k j (·; η)) T v 2C log 2 n 1/2 + ||φ(·;η v ) −φ(·; η)||(log 2) 1/2 The assumption R (k) j (η, η) = o P (n −1/2 ) can only hold ifη − η = o P (1), therefore the right hand side is o P (1). Lemma 6.1 of Chernozhukov et al. (2018) may now be used to argue that conditional convergence implies unconditional convergence, concluding the proof. Lemma 1. Let U denote the range of U = (U W , U A , U Z , U M , U Y ). The following statements are true: 1. P{supā |Y (a 1 , Z(a 2 ), M (a 3 , Z(a 4 ))) − Y (a 1 , Z(a 0 ), M (a 3 , Z(a 4 )))| = 0} = 1 implies U can be partitioned into sets U 1 and U 2 such that (a) P{supā ,z |Y (a 1 , z 1 , M (a 3 , Z(a 4 ))) − Y (a 1 , z 2 , M (a 3 , Z(a 4 )))| = 0 | U ∈ U 1 } = 1, and Identifiability of path-specific effects Doubly robust estimation in missing data and causal inference models Information and asymptotic efficiency in parametric-nonparametric models Efficient and Adaptive Estimation for Semiparametric Models causalweight: Estimation Methods for Causal Inference Based on Inverse Probability Weighting Commentary: considerations for use of racial/ethnic classification in etiologic research Nonparametric causal effects based on incremental propensity score interventions Semiparametric counterfactual density estimation caret: Classification and Regression Training Higher order inference on a treatment effect under low regularity conditions On the causal interpretation of randomized interventional indirect effects Optimal dynamic treatment regimes Learning optimal fair policies Causality: Models, Reasoning, and Inference Direct and indirect effects A theory of inferred causation Quadratic semiparametric von mises calculus A new approach to causal inference in mortality studies with sustained exposure periods -application to control of the healthy worker survivor effect Identifiability and exchangeability for direct and indirect effects Robust estimation in sequentially ignorable missing data and causal inference models Estimation of regression coefficients when some regressors are not always observed Causality for machine learning Causal inference with a graphical hierarchy of interventions Foundations of a probabilistic theory of causal strength mediation: R package for causal mediation analysis Direct effect models. The international journal of biostatistics Unified Methods for Censored Longitudinal Data and Causality The method of path coefficients. The annals of mathematical statistics Identification, estimation and approximation of risk under interventions that depend on the natural value of treatment using observational data Non-parametric path analysis in structural causal models Cross-validated targeted minimum-loss-based estimation Likewise, for (i, k) = (1, 1), we have (b) P{supā |Z(a 2 ) − Z | = 0} = 1 implies U can be partitioned into sets U 1 , U 2 , and U 3 such that (a) P{supā ,m |Y | = 0} = 1 implies U can be partitioned into sets U 1 and U 2 such that (a) P{supā ,m |Y The other statements can be proved using parallel arguments. The statement is of the form if p then q, we will prove the contrapositive if not q then not p Assume there is a set U ⋆ ⊆ U with P(U ⋆ ) > 0 and valuesā, m 1 ∈ supp{M Then, |Y Double/debiased machine learning for treatment and structural parameters Asymptotic Statistics − f (a)E(Y | a ′ , z, m, W )dP(m | a, W )dP(z | a, W )dP(a | W )dP(a ′ | W )Proof The proof of the theorem proceeds as follows. We prove the result for (i, k) = (0, 1) and (i, k) =(1, 1), the other results can be obtained using the same arguments. We assume the distributions G and P have densities g and p dominated by a measure λ. We use f dλ to denote f (x)dλ(x), and use bf ′ g ⋆ h ′⋆ dλ to denote b(m, z, a, w)f (m, z, a ′ , w)g(m, z ⋆ , a, w)h(m, z ⋆ , a ′ , w)λ(m, z, z ⋆ , a, a ′ , w).For (i, k) = (0, 1), we haveThe probability that the term in the right hand side is positive is greater than zero in U ⋆ (i.e., not p), concluding the proof of the claim.