key: cord-0478471-1u094ghn
authors: Soleymani, Ashkan; Raj, Anant; Bauer, Stefan; Besserve, Michel; Scholkopf, Bernhard
title: Causal Feature Selection via Orthogonal Search
date: 2020-07-06
journal: nan
DOI: nan
sha: eafeda0f1c6209d10c7acbfaa005021dfc026a96
doc_id: 478471
cord_uid: 1u094ghn

The problem of inferring the direct causal parents of a response variable among a large set of explanatory variables is of high practical importance in many disciplines. However, established approaches often scale at least exponentially with the number of explanatory variables, are difficult to extend to nonlinear relationships, and are difficult to extend to cyclic data. Inspired by {em Debiased} machine learning methods, we study a one-vs.-the-rest feature selection approach to discover the direct causal parent of the response. We propose an algorithm that works for purely observational data while also offering theoretical guarantees, including the case of partially nonlinear relationships possibly under the presence of cycles. As it requires only one estimation for each variable, our approach is applicable even to large graphs. We demonstrate significant improvements compared to established approaches.

Identifying causal relationships is a profound and hard problem pervading experimental sciences such as biology (Sachs et al., 2005) , medicine (Castro et al., 2020) , earth system sciences (Runge et al., 2019) , or robotics (Ahmed et al., 2020) . While randomized controlled interventional studies are considered the gold standard, they are in many cases ruled out by financial or ethical concerns (Pearl, 2009; Spirtes et al., 2000) . In order to improve the understanding of a system and help design relevant interventions, the subset of causes that have a direct effect (direct causes / direct causal parents) often needs to be identified based on observations only. Let us consider the setup exemplified in Figure 1 , corresponding to a linear structural equation (SEM) for the response Y ,

where U is an independent exogenous variable with zero mean, θ, X ∈ R d , Y ∈ R and ·, · denotes the inner product. We investigate how to find the direct causes of Y among a high-dimensional vector of covariates X, where the covariates have arbitrary non-linear, possibly cyclic relationships among them. In other words, the causal structure of covariates (X) is an arbitrary member of uniquely solvable structural causal models (Simple SCMs), possibly with hidden confounders (As long as there is no hidden confounder for the response variable). Uniquely solvability of SCMs amounts to not having self-cycles in the causal structure, but any other arbitrary non-linear cyclic structure is allowed (Bongers et al., 2021) . Practically speaking, almost all causal discovery applications lie under the umbrella of simple SCMs (Bollen, 1989; Sanchez-Romero et al., 2019) . Besides, the assumption of not having self-cycles is usually assumed not-limiting in the literature (Lacerda et al., 2012; Rothenhäusler et al., 2015; Bongers et al., 2016) . From our formulation, a given entry of θ should be non-zero if and only if the variable corresponding to that particular coefficient is a direct causal parent (Peters et al., 2017) , e.g., X 1 and X 2 in Figure 1 . We restrict ourselves to the setting of linear direct causal effects of Y (LDC, as specified in Equation 1) and no feature descending from Y (NFD). LDC is justified as an approximation when the effects of each causal feature are weak such that the possibly nonlinear effects can be linearized; NFD is justified in some applications where we can exclude any influence of Y on a covariate. This is, for example, the case when X are genetic factors, and Y is a particular trait/phenotype. Our method, in particular, comes handy in this case due to the relatively complex non-linear cyclic structure of these genetic factors in high-dimensional regimes (Yao et al., 2015; Warrell & Gerstein, 2020) .

While applicable to full graph discovery rather than the simplified problem of finding causal parents, state-of-the-art methods for causal discovery often rely on strong assumptions or the availability of interventional data or have prohibitive computational costs explained in section 1.1 in more detail. In addition to and despite their strong assumptions, causal discovery methods may perform worse than simple regression baselines (Heinze-Deml et al., 2018; Janzing, 2019; Zheng et al., 2018) .

While plain regression techniques have appealing computational costs, they come without guarantees. When using unregularized least-square regression to estimate θ, there can be infinitely many possible choices for θ recovered with equivalent prediction accuracy for regressing Y , especially in the case of over-parametrized models. However, none of these choices provide any information about the features which, when intervened upon, directly cause the output variable Y . On the other hand, when using a regularized method such as Lasso, a critical issue is the bias induced by regularization (Javanmard & Montanari, 2018) .

Feature Selection in our setting, for the case of two direct causal parents of Y , X 1 and X 2 , out of variables {X 1 , · · · , X 11 }, such that Y = θ 1 X 1 + θ 2 X 2 + U , U being an independent zero-mean noise. We propose an approach to find X 1 and X 2 under assumptions discussed in the text. An example of this setup in the realworld is finding genes which directly cause a phenotype.

When knowing the distinction between covariates and direct causes, Double ML approaches (Chernozhukov et al., 2018a) have shown promising bias compensation results in the context of high dimensional observed confounding of a single variable. In the present paper, we generalize them to the problem of finding direct causes. Our key contributions are:

• We show that under the assumption that no feature of X is a child of Y , the Double ML (Chernozhukov et al., 2018) principle can be applied in an iterative and parallel way to find the subset of direct causes with observational data. • Our approach has a computational complexity requirement polynomial (fast) time in dimension d. • Our method provides asymptotic guarantees that the set can be recovered from observational data. Importantly, this result neither requires linear interactions among the covariates, faithfulness, nor acyclic structure. • Extensive experimental results demonstrate the state-ofthe-art performance of our method. Our approach significantly outperforms all other methods (even though underlying data generation conditions favor them), especially in the case of non-linear interactions between covariates, despite relying only on linear projection.

The question of finding direct causal parents is also addressed in the literature as mediation analysis (Baron & Kenny, 1986; Hayes, 2017; Shrout & Bolger, 2002) . Several principled approaches have been proposed (relying, for instance, on Instrumental Variables (IVs)) (Angrist & Imbens, 1995; Angrist et al., 1996; Bowden & Turkington, 1990) to test for a single direct effect in the context of specific causal graphs. Extensions of the IV-based approach to generalized IVsbased approaches (Brito & Pearl, 2012; Van der Zander & Liskiewicz, 2016) are the closest known result to discovering direct causal parents. However, no algorithm is provided in Brito & Pearl (2012) to identify the instrumental set. Subsequently, an algorithm is provided in Van der Zander & Liskiewicz (2016) for discovering the instrumental set in the simple setting where all the interactions are linear and the graph is acyclic. In contrast, our method allows non-linear cyclic interaction amongst the variables.

Several other works have also tried to address the problem of discovering causal features. The authors review work on causal feature selection in Guyon & Aliferis (2007) . More recent papers on causal feature selection have appeared since (Cawley, 2008; Paul, 2017; Yu et al., 2018) , but none of those claims to recover all the direct causal parents asymptotically or non-asymptotically as we do in our case. There has been another line of works on inferring causal relationships from observational data, most of which require strong assumptions, such as faithfulness (Mastakouri et al., 2019; Pearl, 2009; Spirtes et al., 2000) . Classical approaches along these lines include the PC-algorithm (Spirtes et al., 2000) , which can only reconstruct the network up to a Markov equivalence class. Another approach is to restrict the class of interactions among the covariates and the functional form of the signal-noise mixing (typically considered additive) or the distribution (e.g., non-Gaussianity) to achieve identifiability (see (Hoyer et al., 2009; Peters et al., 2014) ); this includes linear approaches like LiNGAM (Shimizu et al., 2006) and nonlinear generalizations with additive noise (Peters et al., 2011) . For a recent review of the empirical performance of structure learning algorithms and a detailed description of causal discovery methods, we refer to (Heinze-Deml et al., 2018) . Recently, there have been several attempts at solving the problem of causal inference by exploiting the invariance of a prediction under a causal model given different experimental settings (Ghassami et al., 2017; . The computational cost to run both algorithms is exponential in the number of variables when aiming to discover the full causal graph.

Our method mainly takes inspiration from Debiased/Double ML method (Chernozhukov et al., 2018a) which utilizes the concept of orthogonalization to overcome the bias introduced due to regularization. We will discuss this in detail in the next section. Considering a specific example, the Lasso suffers from the fact that the estimated coefficients are shrunk towards zero, which is undesirable (Tibshirani & Wasserman, 2017) . To overcome this limitation, a debiasing approach was proposed for the Lasso in several papers (Javanmard & Montanari, 2014; Zhang & Zhang, 2014) . However, unlike our approach, Debiased Lasso methods do not recover all the non-zero coefficients of the parameter vector θ under the generic assumptions of the present work.

Before describing the proposed method, we quickly discuss Double ML and Neyman orthogonality in the next section, which will be helpful in building the theoretical framework for our method.

Given a fixed set of policy variables D and control variables X acting as common causes of D and Y , we consider the partial regression model of Equation (2),

where Y is the outcome variable, U, V are disturbances and g 0 , m 0 : R d → R are (possibly non-linear) measurable functions. An unbiased estimator of the causal effect parameter θ 0 can be obtained via the orthogonalization approach as in Chernozhukov et al. (2018a) , which is obtained via the use of the "Neyman Orthogonality Condition" described below.

Neyman Orthogonality Condition: The traditional estimator of θ 0 in Equation (2) can be simply obtained by finding the zero of the empirical average of a score function φ such that φ(W ; θ, g) = D (Y − Dθ − g(X)). However, the estimation of θ 0 is sensitive to the bias in the estimation of the function g. Neyman (Neyman, 1979) proposed an orthogonalization approach to get an estimate for θ 0 that is more robust to the bias in the estimation of nuisance parameter (m 0 , g 0 ). Assume for a moment that the true nuisance parameter is η 0 (which represents m 0 and g 0 in Equation (2)) then the orthogonalized "score" function ψ should satisfy the property that the Gateaux derivative operator with respect to η vanishes when evaluated at the true parameter values:

The corresponding Orthogonalized or Double/Debiased ML estimatorθ 0 solves

whereη 0 is the estimator of η 0 and ψ satisfies condition in Equation (3). For the partially linear model discussed in Equation (2), the orthogonalized score function ψ is,

with η = (m, g).

From Double ML to Causal Discovery: The distinction between policy variables and confounding variables is not always known in advance, which motivates us to consider the more general setting of causal discovery. To this end, we consider a set of variables X = {X 1 , X 2 , · · · X d } which includes direct causal parents of the outcome variable Y as well as other variables. We also reiterate our assumption that the relationship between the outcome variable and direct causal parents of the outcome variable is linear. The relationship among other variables can be cyclic and nonlinear. We now provide a general approach to scanning putative direct causes scaling "polynomially" in their number (see Computational Complexity paragraph in next section), based on the application of a statistical test and Double ML estimators. We describe first the algorithm and then provide theoretical support for its performance.

We provide pseudo-code for our proposed method (CORTH Features) in Algorithm 1. Intuitively, the idea is to do a one-vs-rest split for each variable in turn and try to estimate the link between that particular variable and the outcome variable using Double ML. To do so, we decompose Equation (1) to single out a variable D = X k as policy variable and take the remaining variables Z = X −k = X\X k as multidimensional control variables, and run Double ML estimation assuming the partial regression model presented in Section 2.1, which now takes the form

The step-wise description of our estimation algorithm goes as follows:

(a) Select one of the variables X i to estimate its (hypothetical) linear causal effect θ on Y .

(b) Set all of the other variables X −i as the set of possible confounders. (c) Use the Double ML approach to estimate the parameter θ i.e. the causal effect of X i on Y .

(d) If the variable X i is not a causal parent, the distribution of the conditional covariance χ i (Proposition 3) is a Gaussian centered around zero. We use a simple normality test for χ i to select or discard X i as one of the direct causal parents of Y .

We iteratively repeat the procedure on each of the variables until completion. Pseudo-code for the entire procedure is given below in Algorithm 1.

Note that Equation (5) is not necessarily a correct structural equation model to describe the true underlying causal structure. In general, for instance, when D actually causes Z, it is non-trivial to show that the Double ML estimation of parameter θ k will be unbiased (see Section 2.3).

Remarks on Algorithm 1: X [k] i is a vector which corresponds to the samples chosen in the k th subsampling procedure, X

In general the subscript i represents the estimation for the i th variable and super-script k represents the k th subsampling procedure. K represents the set obtained after sample splitting. m [\k] i are (possibly nonlinear) parametric functions fitted using (1 st , . . . , k − 1 th , k + 1 th , . . . , K th ) subsamples.

Algorithm 1 Efficient Causal Orthogonal Structure Search (CORTH Features) 1: Input: response Y ∈ R N , covariates X ∈ R N ×d , significance level α, number of partitions K. 2: Split N observations into K-fold random partitions, I k for k = 1, 2 . . . , K, each having n = N/K observations. 3: for i = 1, . . . , d do 4:

for Subsample k ∈ [K] do 5:

ij (Z kj )) 9:χ

ij D kj 10: Computational Complexity: For each subset randomly selected from the data, we fit two lasso estimators. Accelerated coordinate descent (Nesterov, 2012) can be applied to optimize the lasso objective. To achieve ε error, O d √ κ max log 1 ε number of iterations are required where κ max is the maximum of the two condition number for both the problems and each iteration requires O(nd) computation. Hence, the computational complexity of running our approach is only polynomial in d.

Figure 2: A generic example of identification of a causal effect θ in the presence of causal and anti-causal interactions between the causal predictor and other putative parents, and possibly arbitrary cyclic and nonlinear assignments for all nodes except Y (see Proposition 2). We have

Now we describe the execution of our algorithm for a simple graph with 3 nodes. Let us consider the following linear structural equation model as an example of our general formulation:

Y := θ 1 X 1 + θ 2 X 2 + ε 3 , X 2 := a 12 X 1 + ε 2 , and X 1 := ε 1 . (6) Example 1. Let us consider the system whose structural equation model is given in Equation (5). If ε 1 , ε 2 and ε 3 are independent uncorrelated noise terms with zero mean, then Algorithm 1 will recover the coefficients θ 1 and θ 2 .

A detailed proof is given in Appendix A.1. While the estimation of the parameter θ 1 is in line with the assumed partial regression model of Equation (6), the estimation of θ 2 does not follow the same. However, it can be seen from the proof that θ 2 can also be estimated from the orthogonal score in Equation (4).

We now show that this result holds for a more general graph structure given in Figure 2 , allowing for non-linear cyclic interactions among features.

Proposition 2. Assume the partially linear Gaussian model of Figure 2 , denote X −k = [Z 1 , Z 2 ] the control variables, γ = (γ 1 , γ 2 , γ 12 ) the parameter vector of the (possibly non-linear) assignments between putative parents of Y , and β = (β 1 , β 2 ), the vector of causal coefficients for encoding linear effects of X −k on outcome Y . Then, independently of the γ parameters and of the functional form of the associated assignments between parents of Y , the

with

, follows the Neyman orthogonality condition for the estimation of θ with nuisance parameters η = (β, γ) which reads

Please refer to Appendix A.2 for the proof. Comparing the score in Equation (7) with the score in Equation (4), there are two takeaways from Proposition 2: (i) the orthogonality condition remains invariant irrespective of the causal direction between X k and Z, and (ii) the second term in Proposition 2 replaces function m by the (unbiased) linear regression estimator for modelling all the relations; given that the relation between Z and Y is linear, even if relationships between Z and X k are non-linear (See Appendix B for concrete examples). Combining with the Double ML theoretical results (Chernozhukov et al., 2018a) , this suggests that regularized predictors based on Lasso or ridge regression are tools of choice for fitting functions (m, g).

Now that we have illustrated and justified the fitting procedure of our algorithm, we provide a theoretically grounded statistical decision criterion for the direct causes after the model has been fitted.

where U is an exogenous variable and X −j represents all the variables except X j . The assumptions made with the above formulation are standard in the orthogonal machine learning literature Smucler et al., 2019; Chernozhukov et al., 2018) . Let us define the quantity

, · · · , d}, which is the expected conditional covariance of X j given X −j .

. . , p} let X −j be the vector equals to X but excluding coordinate j and define θ −j similarly. Define for j ∈ {1, . . . , p}

which also has the double robustness property (Chernozhukov et al., 2018; then under the conditions given in Equations (8) to (10),

Proof. Take j ∈ P A Y . Then, from Equation (8)

Thus

For c), we rewrite

Let G the sub-sigma algebra generated by X −j , under our assumptions,

is an element of L 2 (Ω, G), the second right-hand side term of the above equation vanishes and we get the result.

There are two main implications of the results provided in Proposition 3. (i) χ j is non-zero only for direct causal parents of the outcome variable, and χ j has double robustness property as shown in Smucler et al., 2019; Chernozhukov et al., 2018) . Having double robustness property means that while computing the empirical version of the χ j which we denote asχ j , one can use regularized methods like ridge regression or Lasso to estimate the conditional expectation (function m). Afterward, one can perform statistical tests on top of it to decide between zero or non-zero tests. (ii) In line with the above orthogonal score results, we see that this quantity can be estimated using linear (unbiased) regression to fit the function m, although interactions between features may be non-linear.

Next, we discuss the variance of our estimator so that later a statistical test can be used to differentiate between zero and non-zero test. For the sake of convenience, the case of 2 partitions (K = 2) 1 is explained here.

Randomly split the data in two halves, say D n1 and D n2 . Take j ∈ {1, . . . , p}.

can be considered as regularized regression problems. We use Lasso as the estimator for conditional expectation (Equation (12)) in the experiments. Now, we compute the empirical estimates of χ j . Let,

P nk here denotes the empirical average and σ k j denotes the empirical variance of χ j . Finally, let

Theorem 1 of (Smucler et al., 2019) provides conditions under which (see also (Chernozhukov et al., 2018) ), when the estimators

are Lasso-type regularized linear regressions, it holds that asymptotically

In this case, the test that rejects χ j = 0 when | χ j | ≥ 1.96

σj √ n will have approximately 95% confidence level. The probability of rejecting the null when it is false is

In order to account for multiple testing, we use Bonferroni correction.

Comments about Estimator: In this paper, we use Lasso for the nuisance parameter estimation as the variance of the conditional covariance is known (Smucler et al., 2019) . One can also use other estimators instead, assuming one obtains a reasonable enough estimate of the nuisance parameter (up to N −1/4 -neighbourhood (Chernozhukov et al., 2018a) ) with the correct variance term, which is beyond the scope of this paper.

Conditional Independence Tests: Asymptotically, the conditional independence testing between Y and X j given X −j is also a possible solution for our proposed approach. Indeed, d-separation rules imply that true causes are conditionally dependent according to this test, while non-causes are conditionally independent (because X −j is not a collider under our NFD assumption). However, conditional independence testing is challenging in high-dimensional/non-linear settings. Kernel-based conditional independence testing is computationally expensive (Zhang et al., 2012) . We used χ j in the paper because it was already known from previous works (Smucler et al., 2019; Chernozhukov et al., 2018b) that it has double robustness property, which means one can use regularized methods like Lasso to estimate empirical conditional expectation from a finite number of samples and the empirical estimator is still unbiased with controlled variance. Our work is related to the recent work of (Shah & Peters, 2020), which proposes a conditional independence test whose proofs rely heavily on (Chernozhukov et al., 2018a) . In this paper, we use for the first time such double ML-based tests for the search problem.

In this section, we perform extensive empirical evaluation for our method. For every combination of number of nodes (#nodes), connectivity (p s ), noise level (σ 2 ), number of observations (z), and non-linear probability (p n ) (see Table C.1) , 100 examples (DAGs) are generated and stored as csv files (altogether 72.000 DAGs are simulated, comprising a dataset of overall >10GB). For each DAG, z samples are generated. We provide more details about the parameters (#nodes, p s , p n and z) and data generation process in Appendix C.1. For future benchmarking, the generated files with the code will be made available later.

The baselines we compare our method against are: LINGAM (Shimizu et al., 2006) , order -independent PC (Colombo & Maathuis, 2014) , rankPC, MMHC (Tsamardinos et al., 2006) , GES (Chickering, 2003) , rankGES, ARGES (adaptively restricted GES (Nandy et al., 2016) ), rankARGES, FCI+ (Claassen et al., 2013) , PCI (Shah & Peters, 2020), and Lasso (Tibshirani, 1996) , which are suitable for observational data. The CompareCausalNetworks R Package 2 is used to run most of the baselines methods. We use 10-fold cross-validation to choose the parameters of all approaches. Recall, Fall-out, Critical Success Index, Accuracy, F1 Score, and Matthews correlation coefficient (Matthews, 1975) are considered as metrics for the evaluation. These metrics are described in Appendix C.2. As direction of the possible causes in the defined setting is determined, the non-directional edges inferred by some baselines, e.g., PC are evaluated as direct causes of the target variable. Regression Technique and Hyper-parameters: We use Lasso as the estimator of conditional expectation for our method because the variance bound for χ j with Lasso type estimator of conditional expectation (Equation equation 12) is provided in Equation equation 11. Further, using more splits than 2 splits in the experiment relatively increases the performance of parameter estimation.

Results aggregated by the number of nodes (corresponding to 18000 simulations per entry in the table), connectivity level (corresponding to 24000 simulations per entry in the table), the number of observations over all simulations (corresponding to 24000 simulations per entry in the table) are illustrated in Tables 1 to 3 respectively 3 . Our method performs better than the competing baselines in terms of accuracy and F1 score, especially for more connected structures, despite data being generated according to DAG causal structures, which, dissimilar to our method, is an essential condition for them. To provide a visual comparison, we plot the accuracy of all the methods w.r.t. the connectivity parameter (p s ) in Figure 3 for different values of p n and σ 2 on 1800 samples.

It can be observed that the accuracies of the competing baselines significantly drop with increasing noise level and nonlinearity, while our method is more robust to them. We also extensively compare all the metrics (Recall, Fall-out, Critical Success Index, Accuracy, F1 Score, and Matthews correlation coefficient) for all the methods in Appendix C.3. According to these metrics, our approach performs better than baselines in most cases regardless of the set of parameters used for generating data. Our method shows in particular stability in performance w.r.t. the number of nodes (Table C. 3), partially non-linear relationships (Table C .4), connectivity (table C. 2), number of observations (table C.6), and noise level (table C.5). We also show the plot of parameter estimation for direct causal parents vs. non-causal parents in Figure 4 . In the plots and tables, we denote our approach as CORTH Features. Figure 5 shows the runtime of the method in seconds as a function of the graph's size. Notice that the runtime of our algorithm in the log-log plot is roughly linear, supporting our above statement about the computational time being polynomial in d. Since we used 5000 observations, any additional overhead is coming from crossvalidation.

We also apply our algorithm to a recent COVID-19 Dataset (Einstein, 2020) where the task is to predict COVID-19 cases (confirmed using RT-PCR) amongst suspected ones. For an existing and extensive analysis of the dataset with predictive methods, we refer to Schwab et al. (2020) . We apply our algorithm to discover the features which directly cause the diagnosed infection. We found that the following were the most common causes across different runs of our approach: Patient age quantile, Arterial Lactic Acid, Promyelocytes, and Base excess venous blood gas analysis. Lacking medical ground truth, we report these not as corroboration of our approach but rather as a potential contribution to causal discovery in this challenging problem. It is encouraging that some of these variables are consistent with other studies Schwab et al. (2020) . Details on data preprocessing and more results are available in Appendix D. A recent empirical evaluation of different causal discovery methods highlighted the desirability of more efficient search algorithms (Heinze-Deml et al., 2018) . In the present work, we provide identifiability results for the set of direct causal parents, including the case of partially nonlinear cyclic models, as well as a highly efficient algorithm that scales well w.r.t. the number of variables and exhibits state-of-the-art performance across extensive experiments. Our approach builds on the Double ML method for the partial regression setting of Chernozhukov et al. (2018a); however, we show it can be applied to different underlying causal structures, which is the key for the purpose of search, as this structure is not always known in advance. Whilst not amounting to full causal graph discovery, identification of causal parents is of major interest in real-world applications, e.g., when assaying the causal influence of genes on the phenotype. A natural direction worth exploring is to extend this approach for discovering direct causal parents in the case when nonlinear relationships exist between the output variable and its direct causal parents. Proof of Example 1 . Let us start from the easier case first (See Figure A.1) . Let us first try to estimate the coefficient of interaction between X 2 and Y but it is also very clear that the estimation of θ 2 will be unbiased as the given setting precisely match with the double machine learning setting. However, we will see in this example that given the population, θ 1 can be approximated as well. Let us write down the structural equation model first:

From the set of equations we have:

Let also denote E[ε 2 1 ] = σ 2 1 and E[ε 2 2 ] = σ 2 2 . Hence, E[X 2 1 ] = σ 2 1 , E[X 1 X 2 ] = a 12 σ 2 1 and E[X 2 2 ] = a 12 E[X 1 X 2 ] + E[ε 2 X 2 ] = a 2 12 σ 2 1 + σ 2 2 .. Let us first try to find the regression co-efficient of fitting X 2 on Y .

Y =θ 2 X 2 + η 1 .

Similarly, if we fit X 2 on X 1 then X 1 =â −1 12 X 2 + η 2 ,

can also be written as following:

Hence,â ResidualV = X 1 −â −1 12 X 2 . Hence we can have

We now calculate,

Last equation was written after a step of minor calculation. Since the estimator iŝ

In this section, we use a generic example shown in Figure 2 which we show again in Figure A .2 to illustrate the role of interactions between the covariates on the proposed causal discovery algorithm. The estimator discussed can simply be derived from the Neyman orthogonality condition. We now provide the below the proof for Proposition 2. For the sake of completeness, we also rewrite the statement of the proposition again.

Proposition 4 (Restatement of Proposition 2). Assume the partially linear Gaussian model of Fig. A.2 , denote X −k = [Z 1 , Z 2 ] the control variables, γ = (γ 1 , γ 2 , γ 12 ) the parameter vector of the (possibly non-linear) assignments between putative parents of Y , and β = (β 1 , β 2 ) the vector of causal coefficients for encoding linear effects of X −k on outcome Y . Then, independently from the γ parameters and of the functional form of the associated assignments between parents of Y , the score

with

, follows the Neyman orthogonality condition for the estimation of θ with nuisance parameters η = (β, γ) which reads

Proof of Proposition 2. Using the global Markov factorization for simple SCMs 4 (Forré & Mooij, 2017; Bongers et al., 2021) , P (W ; θ, η) = P (Y |X −k , X K ; θ, β)P (X −k , X K ; γ), due to linearity and gaussianity of the assignment of Y , we obtain a negative log likelihood of the form (up to additive constants)

where f stands for the negative log likelihood of the second factor. Following Chernozhukov et al. (2018a) [eq. (2.7)], this leads to the Neyman orthogonal score

Following eq. (2.8) of the same paper, we derive the expression of µ as

Reintroducing µ in the expression of ψ leads to the result.

The result discussed in Proposition 2 is not directly intuitive. In simple words, there are two takeaways from Proposition 2: (i) the orthogonality condition remains invariant irrespective of the causal direction between X k and Z, and (ii) the second term in Equation (16) suggests to use a linear estimator for modeling all the relations, given that the relation between Z and Y is linear.

To generate more intuition, we provide a few examples. Let us go back again to the three variable interaction assuming the following structural equation model:

where f is a nonlinear function and ε 1 , ε 2 and ε 3 are zero mean Gaussian noises.

• Consider the case when f (x) = x 2 . The goal is to estimate the parameter θ 1 which we callθ 1 . We follow the standard double ML procedure assuming policy variable X 1 and control X 2 , although the ground truth causal dependency X 1 → X 2 in contradiction with such setting (see Equation (2)). The estimate of θ 2 following the double ML procedure, which we callθ 2 = E[X2Y ]

. Similarly, we want to estimate X 1 = αX 2 + η from which we get, α = E[X1X2] E[X2] 2 . It is easy to see that E[X 1 X 2 ] = E[X 3 1 ] = 0. Hence, α = 0 and it is easy to seeθ 1 = θ 1 .

• Consider now the more general case where f is any nonlinear function. As in the previously discussed example, the goal is to estimate θ 1 . We

] . We substitute these estimates into the orthogonality condition in Equation (16):

From the above two examples, it is clear that even though the internal relations between the variables are nonlinear, all we need is an unbiased linear estimate to estimate the causal parameter.

For every combination of number of nodes (#nodes), connectivity (p s ), noise level (σ 2 ), number of observation (z), and non-linear probability (p n ) (look at (18)) with variance σ 2 starting from root of the DAG. For future benchmarking, the generated files will be made available with the code later on.

We generate DAGs (Direct Acyclic Graphs) in multiple steps: i) a random permutation of nodes is chosen as a topological order of a DAG. ii) Based on this order, directed edges are added to this DAG from each node to its followers with a certain probability p s (connectivity). iii) For each observation, values are assigned to nodes according to the topological order of the DAG in such a way that each node's value is determined by summing over transformations (linear or nonlinear with a certain nonlinear probability p n ) of values of its direct causes with the addition of Gaussian distributed noise. The non-linear transformation used is α tanh(βx) 5 , with α = 0.5 and β = 1.5. If the set of parents for the node X is denoted as P A X as before then value assignment for a node X is as follow:

where ε ∼ N (0, σ 2 ) in which σ 2 represents noise level. ι (X) is an indicator functions which decides between linear or non-linear contribution of X in X . We decide the value of ι (p n ) by generating a binary randon number which is 5 The resulting values in the experiments are not concentrated around zero, and they are even up to 10ks for large graphs (∼ 50 nodes). With the nonlinearity feature of α tanh(βx) for relatively large values taken into account, this is a good representer of nonlinear relationships. In the experiments we vary the connectivity parameter, the number of nodes in the graph, the non-linear probability, the number of observations and the noise level and generate 100 graphs for each setting.

1 with probablity p n and 0 with probability 1 − p n . The value of θ is set to 2 for the small DAGs (number of nodes equal to 5 or 10) and 0.5 for large DAGs (number of nodes equal to 20 or 50) due to the value exploitation that might happen in large graphs.

We vary and investigate the effect of non-linear relationships, the number of nodes, number of observations, effect of connectivity and noise level while simulating the data. We summarize the factors in the data generation in Table C .1.

Correctly and incorrectly inferred direct causes are considered true and false. Let the total number of true positives, false positives, true negatives ,and false negatives denoted by TP, FP, TN, and FN, we evaluate our method using following metrics:

• Recall (true positive rate):

T P R = T P T P + F N • Fall-out (false positive rate):

• Critical Success Index (CSI): also known as Threat Score.

T P T P + F N + F P • Accuracy:

• F1 Score: harmonic mean of precision and sensitivity.

F 1 = 2T P 2T P + F P + F N • Matthews correlation coefficient (MCC): a metric for evaluating quality of binary classification introduced in (Matthews, 1975) .

In some rare cases, we encountered zero-divided-by-zero and divided-by-zero cases for some of these metrics. In these situations, scores are reported 1 and 0 respectively while Fall-out is reported 0 and 1.

In this section, supplementary tables supporting superior performance of CORTH Features compared to wellestablished methods are provided (See Tables C.2 to C.6). This superiority is consistent w.r.t. connectivity (Table C. 2), number of nodes (Table C. 3), number of observations (Table C .6), nonlinearity (Table C .4), and noise (Table C.5) using different evaluation metrics. The preprocessing stage for this dataset is the same as (Schwab et al., 2020) except that, for each target variable upsampling is used to resolve data imbalance.

The results obtained by leveraging CORTH Features is suprisingly consistent with (Schwab et al., 2020) which demonstrates the ability of this method in feature selection. The selected features are indicated in Tables D.1 to D.4 

Double/de-biased machine learning using regularized riesz representers

Learning L2 Continuous Regression Functionals via Regularized Riesz Representers

Optimal structure identification with greedy search

Learning sparse causal models is not np-hard

Order-independent constraint-based causal structure learning

Diagnosis of COVID-19 and its clinical spectrum

Markov properties for graphical models with cycles and latent variables

Learning causal structures using regression invariance

Causal feature selection. In Computational methods of feature selection

Introduction to mediation, moderation, and conditional process analysis: A regression-based approach

Causal structure learning

Nonlinear causal discovery with additive noise models

Causal regularization

Confidence intervals and hypothesis testing for high-dimensional regression

Debiasing the lasso: Optimal sample size for gaussian designs

Discovering cyclic causal models by independent components analysis

Selecting causal brain features with a single conditional independence test per feature

Comparison of the predicted and observed secondary structure of t4 phage lysozyme

Methods for causal inference from gene perturbation experiments and validation

High-dimensional consistency in score-based and hybrid structure learning

Efficiency of coordinate descent methods on huge-scale optimization problems

C (α) tests and their use

Feature selection as causal inference: Experiments with text classification

Identifiability of causal graphs using functional models

Causal discovery with continuous additive noise models

Causal inference by using invariant prediction: identification and confidence intervals

Elements of causal inference: foundations and learning algorithms

Backshift: Learning causal cyclic graphs from unknown shift interventions

Characterization of parameters with a mixed bias property

Inferring causation from time series in earth system sciences

Causal protein-signaling networks derived from multiparameter single-cell data

Estimating feedforward and feedback effective connections from fmri time series: Assessments of statistical methods

predcovid-19: A systematic study of clinical predictive models for coronavirus disease

The hardness of conditional independence testing and the generalised covariance measure

A linear non-gaussian acyclic model for causal discovery

Mediation in experimental and nonexperimental studies: new procedures and recommendations

A unifying approach for doubly-robust 1 regularized estimation of causal contrasts

Causation, prediction, and search

Regression shrinkage and others

Sparsity, the lasso, and friends

The max-min hill-climbing bayesian network structure learning algorithm

On searching for generalized instrumental variables

Cyclic and multilevel causation in evolutionary processes

Prior knowledge driven granger causality analysis on gene regulatory network discovery

A unified view of causal and non-causal feature selection

Confidence intervals for low dimensional parameters in high dimensional linear models

Kernel-based conditional independence test and application in causal discovery

Dags with no tears: Continuous optimization for structure learning