Submitted 18 September 2018 Accepted 16 December 2018 Published 21 January 2019 Corresponding author Patrick Blöbaum, bloebaum@ar.sanken.osaka-u.ac.jp Academic editor Charles Elkan Additional Information and Declarations can be found on page 26 DOI 10.7717/peerj-cs.169 Copyright 2019 Blöbaum et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Analysis of cause-effect inference by comparing regression errors Patrick Blöbaum1, Dominik Janzing2, Takashi Washio1, Shohei Shimizu3 and Bernhard Schölkopf2 1 Osaka University, Osaka, Japan 2 MPI for Intelligent Systems, Tübingen, Germany 3 Shiga University, Shiga, Japan ABSTRACT We address the problem of inferring the causal direction between two variables by comparing the least-squares errors of the predictions in both possible directions. Under the assumption of an independence between the function relating cause and effect, the conditional noise distribution, and the distribution of the cause, we show that the errors are smaller in causal direction if both variables are equally scaled and the causal relation is close to deterministic. Based on this, we provide an easily applicable algorithm that only requires a regression in both possible causal directions and a comparison of the errors. The performance of the algorithm is compared with various related causal inference methods in different artificial and real-world data sets. Subjects Artificial Intelligence, Data Mining and Machine Learning Keywords Causality, Causal discovery, Cause-effect inference INTRODUCTION Causal inference (Spirtes, Glymour & Scheines, 2000; Pearl, 2009) is becoming an increasingly popular topic in machine learning. The results are often not only of interest in predicting the result of potential interventions, but also in general statistical and machine learning applications (Peters, Janzing & Schölkopf, 2017). While the causal relationship between variables can generally be discovered by performing specific randomized experiments, such experiments can be very costly, infeasible or unethical1. In particular, the identification of the causal direction between two variables without performing any interventions is a challenging task. However, recent research developments in causal discovery allow, under certain assumptions, inference of the causal direction between two variables purely based on observational data (Kano & Shimizu, 2003; Comley & Dowe, 2003; Shimizu et al., 2006; Sun, Janzing & Schölkopf, 2006; Zhang & Hyvärinen, 2009; Hoyer et al., 2009; Janzing, Sun & Schölkopf, 2009; Daniušis et al., 2010; Peters, Janzing & Schölkopf, 2011; Janzing et al., 2012; Sgouritsa et al., 2015; Mooij et al., 2016; Marx & Vreeken, 2017). In regard to the present work, we further contribute to the causal discovery in an unconfounded bivariate setting based on observational data, where one variable is the cause and the other variable is the effect. That is, given observed data X,Y that are drawn from a joint distribution pX,Y , we are interested in inferring whether X caused Y or Y caused X. In this sense, we define X as the cause and Y as the effect if intervening on X How to cite this article Blöbaum P, Janzing D, Washio T, Shimizu S, Schölkopf B. 2019. Analysis of cause-effect inference by comparing regression errors. PeerJ Comput. Sci. 5:e169 http://doi.org/10.7717/peerj-cs.169 https://peerj.com mailto:bloebaum@ar.sanken.osaka-u.ac.jp https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.169 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.169 1Further discussions about ethics in randomized experiments, especially in the context of clinical trials, can be found in Rosner (1987). 2The data is taken from https://webdav. tuebingen.mpg.de/cause-effect/ and further discussed in Mooij et al. (2016). 0 2 4 6 8 10 National Income (in USD) 104 40 50 60 70 80 90 L ife E xp e ct a n cy ( in y e a rs ) 40 50 60 70 80 90 Life Expectancy (in years) 0 2 4 6 8 10 N a tio n a l I n co m e ( in U S D ) 10 4 (a) (b) Figure 1 A comparison of the national income of 194 countries and the life expectancy at birth. (A) The national income on the x-axis and the life expectancy on the y-axis. (B) The life expectancy on the x- axis and the national income on the y-axis. Full-size DOI: 10.7717/peerjcs.169/fig-1 changes the distribution of Y . In the following, we use the term ’causal inference’ to refer to the identification of the true causal direction. A possible application is the discovery of molecular pathways, which relies on the identification of causal molecular interactions in genomics data (Statnikov et al., 2012). Other examples in biomedicine where observational data can be used for causal discovery are discussed in the work by Ma & Statnikov (2017). An example for a bivariate relationship is provided in Fig. 1, where the national income of countries are compared with the life expectancy at birth2. Here, a clear statement about the causal relationship is not obvious. It has been argued that richer countries have a better health care system than poorer countries. Hence, a higher national income leads to a higher life expectancy (Mooij et al., 2016). Based on the plots, this causal relationship is not clear at all. Nevertheless, we provide a way to correctly determine the causal direction by only using these data points. Conventional approaches to causal inference rely on conditional independences and therefore require at least three observed variables. Given the observed pattern of conditional dependences and independences, one infers a class of directed acyclic graphs (DAGs) that is compatible with the respective pattern (subject to Markov condition and faithfulness assumption (Spirtes, Glymour & Scheines, 2000; Pearl, 2009)). Whenever there are causal arrows that are common to all DAGs in the class, conditional (in)dependences yield definite statements about causal directions. In a bivariate setting, however, we rely on asymmetries between cause and effect that are already apparent in the bivariate distribution alone. One kind of asymmetry is given by restricting the structural equations relating cause and effect to a certain function class: For linear relations with non-Gaussian independent noise, the linear non-Gaussian acyclic model (LiNGAM) (Shimizu et al., 2006) provides a method to identify the correct causal direction. For nonlinear relations, the additive noise model (ANM) (Hoyer et al., 2009) and its generalization to post-nonlinear models (PNL) (Zhang & Hyvärinen, 2009) identify the causal direction by assuming an independence Blöbaum et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.169 2/29 https://webdav.tuebingen.mpg.de/cause-effect/ https://webdav.tuebingen.mpg.de/cause-effect/ https://peerj.com https://doi.org/10.7717/peerjcs.169/fig-1 http://dx.doi.org/10.7717/peerj-cs.169 between cause and noise, where, apart from some exceptions such as bivariate Gaussian, a model can only be fit in the correct causal direction such that the input is independent of the residual. Further recentapproaches forthe bivariatesetting arebased on an informal independence assumption stating that the distribution of the cause (denoted by pC) contains no information about the conditional distribution of the effect given the cause (denoted by pE|C). Here, the formalization of ‘no information’ is a challenging task. For the purpose of foundational insights (rather than for practical purposes), Janzing & Schölkopf (2010) and Lemeire & Janzing (2012) formalize the idea via algorithmic information and postulate that knowing pC does not enable a shorter description of pE|C and vice versa. Using algorithmic information theory, one can, for instance, show that the algorithmic independence of pC and pE|C implies K(pC)+K(pE|C)≤K(pE)+K(pC|E), (1) if K denotes the description length of a distribution in terms of its Kolmogorov complexity (for details see Section 4.1.9 in Peters, Janzing & Schölkopf (2017)). In this sense, appropriate independence assumptions between pC and pE|C imply that pE,C has a simpler description in causal direction than in anticausal direction. An approximation of (1) is given by the SLOPE algorithm in the work by Marx & Vreeken (2017), where regression is utilized to estimate and compare the approximated Kolmogorov complexities. For this, a logarithmic error is used, which is motivated by a minimum description length perspective. Another work that is inspired by the independence assumption is the information-geometric approach for causal inference (IGCI) (Janzing et al., 2012). IGCI provides a method to infer the causal direction in deterministic nonlinear relationships subject to a certain independence condition between the slope of the function and the distribution of the cause. A related but different independence assumption is also used by a technique called unsupervised inverse regression (CURE) (Sgouritsa et al., 2015), where the idea is to estimate a prediction model of both possible causal directions in an unsupervised manner, i.e., only the input data is used for the training of the prediction models. With respect to the above independence assumption, the effect data may contain information about the relation between cause and effect that can be employed for predicting the cause from the effect, but the cause data alone does not contain any information that helps the prediction of the effect from the cause (as hypothesized in Schölkopf et al. (2013)). Accordingly, the unsupervised regression model in the true causal direction should be less accurate than the prediction model in the wrong causal direction. For our approach, we address the causal inference problem by exploiting an asymmetry in the mean-squared error (MSE) of predicting the cause from the effect and the effect from the cause, respectively, and show, that under appropriate assumptions and in the regime of almost deterministic relations, the prediction error is smaller in causal direction. A preliminary version of this idea can be found in Blöbaum, Washio & Shimizu (2017) and Blöbaum, Shimizu & Washio (2017) but in these works the analysis is based on a simple heuristic assuming that the regression of Y on X and the regression of X on Y yield functions that are inverse to each other, which holds approximately in the limit of small Blöbaum et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.169 3/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.169 or ? Figure 2 An illustration of the goal of our proposed method. It aims to identify the causal DAG of two variables, where either X causes Y or Y causes X. Full-size DOI: 10.7717/peerjcs.169/fig-2 noise. Moreover, the analysis is also based on the assumption of an additive noise model in causal direction and on having prior knowledge about the functional relation between X and Y , which makes it impractical for generic causal inference problems. In this work, we aim to generalize and extend the two aforementioned works in several ways: (1) We explicitly allow a dependency between cause and noise. (2) We give a proper mathematical proof of the theory that justifies the method subject to clear formal assumptions. (3) We perform extensive evaluations for the application in causal inference and compare it with various related approaches. The theorem stated in this work might also be of interest for general statistical purposes. A briefer version of this work with less extensive experiments, lesser details and without detailed proofs can be found in Blöbaum et al. (2018). This paper is structured as follows: In ‘Preliminaries’, we define the problem setting and introduce the used notations and assumptions, which are necessary for the main theorem of this work stated in ‘Theory’. An algorithm that utilizes this theorem is proposed in Algorithm and evaluated in various artificial and real-world data sets in ‘Experiments’. PRELIMINARIES In the following, we introduce the preliminary problem setting, notations and assumptions. Problem setting and notation In this work, we use the framework of structural causal models (Pearl, 2009) with the goal of correctly identifying cause and effect variables of given observations from X and Y . As illustrated in Fig. 2, this can be described by the problem of identifying whether the causal DAG of X →Y or X ←Y is true. Throughout this paper, a capital letter denotes a random variable and a lowercase letter denotes values attained by the random variable. Variables X and Y are assumed to be real-valued and to have a joint probability density (with respect to the Lebesgue measure), denoted by pX,Y . By slightly abusing terminology, we will not further distinguish between a distribution and its density since the Lebesgue measure as a reference is implicitly understood. The notations pX , pY , and pY |X are used for the corresponding marginal and conditional densities, respectively. The derivative of a function f is denoted by f ′. General idea As mentioned before, the general idea of our approach is to simply compare the MSE of regressing Y on X and the MSE of regressing X on Y . If we denote cause and effect Blöbaum et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.169 4/29 https://peerj.com https://doi.org/10.7717/peerjcs.169/fig-2 http://dx.doi.org/10.7717/peerj-cs.169 3Note that although the form of (5) is similar to that of ANM, the core assumption of ANM is an independence between cause and noise, which we do not need in our approach. Therefore, we assume the general structural equation defined in (3), whereas ANM assumes a more restrictive structural equation of the form E =ζ(C)+Ñ with CyÑ . by C,E ∈{X,Y}, respectively, our approach explicitly reads as follows. Let φ denote the function that minimizes the expected least squares error when predicting E from C, which implies that φ is given by the conditional expectation φ(c)=E[E|c]. Likewise, let ψ be the minimizer of the least squares error for predicting C from E, that is, ψ(e)=E[C|e]. Then we will postulate assumptions that imply E[(E−φ(C))2]≤E[(C−ψ(E))2] (2) in the regime of almost deterministic relations. This conclusion certainly relies on some kind of scaling convention. For our theoretical results we will assume that both X and Y attain values between 0 and 1. However, in some applications, we will also scale X and Y to unit variance to deal with unbounded variables. Equation (2) can be rewritten in terms of conditional variance as E[Var[E|C]]≤E[Var[C|E]]. Assumptions First, recall that we assume throughout the paper that either X is the cause of Y or vice versa in an unconfounded sense, i.e., there is no common cause. Therefore, the general structural equation is defined as E =ζ(C,Ñ), (3) where CyÑ . For our analysis, we first define a function φ to be the conditional expectation of the effect given the cause, i.e., φ(c) :=E[E|c] and, accordingly, we define a noise variable N as the residual N :=E−φ(C). (4) Note that (4) implies that E[N|c]=0. The function φ is further specified below. Then, to study the limit of an almost deterministic relation in a mathematically precise way, we consider a family of effect variables Eα by Eα :=φ(C)+αN, (5) where α∈R+ is a parameter controlling the noise level and N is a noise variable that has some (upper bounded) joint density pN,C with C. Note that N here does not need to be statistically independent of C (in contrast to ANMs), which allows the noise to be non-additive. Therefore, (5) does not, a priori, restrict the set of possible causal relations, because for any pair (C,E) one can always define the noise N as (4) and thus obtain Eα=1=E for any arbitrary function φ3. For this work, we make use of the following assumptions: 1. Invertible function: φ is a strictly monotonically increasing two times differentiable function φ : [0,1]→[0,1]. For simplicity, we assume that φ is monotonically increasing with φ(0)=0 and φ(1)=1 (similar results for monotonically decreasing functions follow by reflection E →1−E). We also assume that φ−1 ′ is bounded. Blöbaum et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.169 5/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.169 4Note, however, that the assignment (5) is not a structural equation in a strict sense, because then C and N would need to be statistically independent. 2. Compact supports: The distribution of C has compact support. Without loss of generality, we assume that 0 and 1 are, respectively, the smallest and the largest values attained by C. We further assume that the distribution of N has compact support and that there exist values n+ > 0 > n− such that for any c, [n−,n+] is the smallest interval containing the support of pN|c. This ensures that we know [αn−,1+αn+] is the smallest interval containing the support of pEα. Then the shifted and rescaled variable Ẽα := 1 1+αn+−αn− (Eα−αn−) (6) attains 0 and 1 as minimum and maximum values and thus is equally scaled as C. 3. Unit noise variance: The expected conditional noise variance is E[Var[N|C]]=1 without loss of generality, seeing that we can scale the noise arbitrary by the parameter α and we are only interested in the limit α→0. 4. Independence postulate: While the above assumptions are just technical, we now state the essential assumption that generates the asymmetry between cause and effect. To this end, we consider the unit interval [0,1] as probability space with uniform distribution as probability measure. The functions c 7→φ′(c) and c 7→Var[N|c]pC(c) define random variables on this space, which we postulate to be uncorrelated, formally stated as Cov[φ′,Var[N|c]pC]=0. (7) More explicitly, (7) reads:∫ 1 0 φ ′(c)Var[N|c]pC(c)dc− ∫ 1 0 φ ′(c)dc ∫ 1 0 Var[N|c]pC(c)dc =0. (8) The justification of (7) is not obvious at all. For the special case where the conditional variance Var[N|c] is a constant in c (e.g., for ANMs), (7) reduces to Cov[φ′,pC]=0, (9) which is an independence condition for deterministic relations stated in Schölkopf et al. (2013). Conditions of similar type as (9) have been discussed and justified in Janzing et al. (2012). They are based on the idea that φ contains no information about pC. This, in turn, relies on the idea that the conditional pE|C contains no information about pC. To discuss the justification of (8), observe first that it cannot be justified as stating some kind of ‘independence’ between pC and pE|C. To see this, note that (8) states an uncorrelatedness of the two functions c 7→φ′(c) and c 7→Var[N|c]pC(c). While φ′ depends only on the conditional pE|C and not on pC, the second function depends on both pC|E and pE, since Var[N|c] is a property of pE|C. Nevertheless, to justify (8) we assume that the function φ represents a law of nature that persists when pC and N change due to changing background conditions. From this perspective, it becomes unlikely that they are related to the background condition at hand. This idea follows the general spirit of ‘modularity and autonomy’ in structural equation modeling, that some structural equations may remain unchanged when other parts of a system change (see Chapter 2 in Peters, Janzing & Schölkopf (2017) for a literature review)4. To further justify (7), one could think of a scenario where someone changes φ independently of pN,C, which then results in vanishing correlations. Typically, this assumption would be violated if φ is adjusted to pN,C or vice versa. This could happen due to an intelligent design by, for instance, first Blöbaum et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.169 6/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.169 observing pN,C and then defining φ or due to a long adaption process in nature (see Janzing et al. (2012) for further discussions of possible violations in a deterministic setting). A simple implication of (8) reads∫ 1 0 φ ′(c)Var[N|c]pC(c)dc =1, (10) due to ∫ 1 0 φ ′(c)dc =1 and ∫ 1 0 Var[N|c]pC(c)dc =E[Var[N|C]]=1. In the following, the term independence postulate is used to refer to the aforementioned postulate and the term independence to a statistical independence, which should generally become clear from the context. THEORY As introduced in ‘General idea’, we aim to exploit an inequality of the expected prediction errors in terms of E[Var[E|C]]≤E[Var[C|E]] to infer the causal direction. In order to conclude this inequality and, thus, to justify an application to causal inference, we must restrict our analysis to the case where the noise variance is sufficiently small, since a more general statement is not possible under the aforementioned assumptions. The analysis can be formalized by the ratio of the expectations of the conditional variances in the limit α→0. We will then show lim α→0 E[Var[C|Ẽα]] E[Var[Ẽα|C]] ≥1. Error asymmetry theorem For our main theorem, we first need an important lemma: Lemma 1 (Limit of variance ratio) Let the assumptions 1–3 in ‘Assumptions’ hold. Then the following limit holds: lim α→0 E[Var[C|Ẽα]] E[Var[Ẽα|C]] = ∫ 1 0 1 φ′(c)2 Var[N|c]pC(c)dc (11) Proof: We first give some reminders of the definition of the conditional variance and some properties. For two random variables Z and Q the conditional variance of Z, given q is defined by Var[Z|q] :=E[(Z−E[Z|q])2|q], while Var[Z|Q] is the random variable attaining the value Var[Z|q] when Q attains the value q. Its expectation reads E[Var[Z|Q]] := ∫ Var[Z|q]pQ(q)dq. For any a∈R, we have Var [ Z a ∣∣q]= Var[Z|q] a2 . Blöbaum et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.169 7/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.169 For any function h, we have Var[h(Q)+Z|q]=Var[Z|q], which implies Var[h(Q)|q]=0. Moreover, we have Var[Z|h(q)]=Var[Z|q], if h is invertible. To begin the main part of the proof, we first observe E[Var[Eα|C]]=E[Var[φ(C)+αN|C]]=α2 E[Var[N|C]]︸ ︷︷ ︸ = 1 (Assumpt. 3)) =α 2 . (12) Moreover, one easily verifies that lim α→0 E[Var[C|Ẽα]] E[Var[Ẽα|C]] = lim α→0 E[Var[C|Eα]] E[Var[Eα|C]] , (13) due to (6) provided that these limits exist. Combining Eqs. (12) and (13) yields lim α→0 E[Var[C|Ẽα]] E[Var[Ẽα|C]] = lim α→0 E[Var[C|Eα]] α2 = lim α→0 E [ Var [ C α ∣∣Eα]]. (14) Now, we can rewrite (14) as lim α→0 E [ Var [ C α ∣∣Eα]]= lim α→0 E [ Var [ φ−1(Eα−αN) α ∣∣Eα]] = lim α→0 ∫ φ(1)+αn+ φ(0)+αn− Var [ φ−1(e−αN) α ∣∣e]pEα(e)de = lim α→0 ∫ φ(1) φ(0) Var [ φ−1(e−αN) α ∣∣e]pEα(e)de. (15) In the latter step, αn+ and −αn− vanishes in the limit seeing that the function e 7→Var [ φ −1(e−αN)/α ∣∣e]pEα(e) is uniformly bounded in α. This is firstly, because φ−1 attains only values in [0,1], and hence the variance is bounded by 1. Secondly, pEα(e) is uniformly bounded due to pEα(e)= ∫ n+ n− pφ(C),N (e−αn,n)dn= ∫ n+ n− pC,N (φ −1(e−αn),n)φ−1 ′ (e−αn)dn ≤‖φ −1′ ‖∞‖pC,N‖∞(n+−n−). Accordingly, the bounded convergence theorem states lim α→0 ∫ φ(1) φ(0) Var [ φ−1(e−αN) α ∣∣e]pEα(e)de=∫ φ(1) φ(0) lim α→0 ( Var [ φ−1(e−αN) α ∣∣e]pEα(e))de. To compute the limit of Var [ φ−1(e−αN) α ∣∣e], Blöbaum et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.169 8/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.169 we use Taylor’s theorem to obtain φ −1(e−αn)=φ−1(e)−αnφ−1 ′ (e)− α2n2φ−1 ′′ (E2(n,e)) 2 , (16) where E2(n,e) is a real number in the interval (e−αn,e). Since (16) holds for every n∈[− e α , 1−e α ] (note that φ and φ−1 are bijections of [0,1], thus e−αn lies in [0,1]) it also holds for the random variable N if E2(n,e) is replaced with the random variable E2(N,e) (here, we have implicitly assumed that the map n 7→e2(n,e) is measurable). Therefore, we see that lim α→0 Var [ φ−1(e−αN) α ∣∣e]= lim α→0 Var [ −Nφ−1 ′ (e)− αN 2φ−1 ′′ (E2(N,e)) 2 ∣∣e ] =φ −1′(e)2Var[N|e]. (17) Moreover, we have lim α→0 pEα(e)=pE0(e). (18) Inserting Eqs. (18) and (17) into (15) yields lim α→0 E [ Var [ C α ∣∣E0]]=∫ φ(1) φ(0) φ −1′(e)2Var[N|e]pE0(e)de = ∫ 1 0 φ −1′(φ(c))2Var[N|φ(c)]pC(c)dc = ∫ 1 0 1 φ′(c)2 Var[N|c]pC(c)dc, where the second equality is a variable substitution using the deterministic relation E0 =φ(C) (which implies pE0(φ(c))=pC(c)/φ ′(c) or, equivalently, the simple symbolic equation pE0(e)de=pC(c)dc). This completes the proof due to (14). � While the formal proof is a bit technical, the intuition behind this idea is quite simple: just think of the scatter plot of an almost deterministic relation as a thick line. Then Var[Eα|c] and Var[C|Eα =φ(c)] are roughly the squared widths of the line at some point (c,φ(c)) measured in vertical and horizontal direction, respectively. The quotient of the widths in vertical and horizontal direction is then given by the slope. This intuition yields the following approximate identity for small α: Var[C|Ẽα=φ(c)]≈ 1 (φ′(c))2 Var[Ẽα|C =c]=α 2 1 (φ′(c))2 Var[N|c]. (19) Taking the expectation of (19) over C and recalling that Assumption 3 implies E[Var[Ẽα|C]]=α2E[Var[N|C]]=α2 already yields (11). With the help of Lemma 1, we can now formulate the core theorem of this paper: Theorem 1 (Error Asymmetry) Let the assumptions 1–4 in ‘Assumptions’ hold. Then the following limit always holds lim α→0 E[Var[C|Ẽα]] E[Var[Ẽα|C]] ≥1, with equality only if the function stated in Assumption 1 is linear. Blöbaum et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.169 9/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.169 Proof: We first recall that Lemma 1 states lim α→0 E[Var[C|Ẽα]] E[Var[Ẽα|C]] = ∫ 1 0 1 φ′(c)2 Var[N|c]pC(c)dc. We then have∫ 1 0 1 φ′(c)2 Var[N|c]pC(c)dc = ∫ 1 0 1 φ′(c)2 Var[N|c]pC(c)dc · ∫ 1 0 Var[N|c]pC(c)dc︸ ︷︷ ︸ = 1 (Assumpt. 3) = ∫ 1 0 √( 1 φ′(c) )2 Var[N|c] 2 pC(c)dc · ∫ 1 0 √ Var[N|c] 2 pC(c)dc ≥  ∫ 1 0 √( 1 φ′(c) )2 Var[N|c] √ Var[N|c]pC(c)dc  2 = (∫ 1 0 1 φ′(c) Var[N|c]pC(c)dc )2 , (20) where the inequality is just the Cauchy Schwarz inequality applied to the bilinear form f ,g 7→ ∫ f (c)g(c)pC(c)dc for the space of functions f for which ∫ f 2(c)pc(c)dc exists. Note that if φ is linear, (20) becomes 1, since φ′=1 according to Assumpt. 1. We can make a statement about (20) in a similar way by using (10) implied by the independence postulate and using Cauchy Schwarz:∫ 1 0 1 φ′(c) Var[N|c]pC(c)dc = ∫ 1 0 1 φ′(c) Var[N|c]pC(c)dc · ∫ 1 0 φ ′(c)Var[N|c]pC(c)dc︸ ︷︷ ︸ = 1 (10) = ∫ 1 0 √ 1 φ′(c) Var[N|c] 2 pC(c)dc · ∫ 1 0 √ φ′(c)Var[N|c] 2 pC(c)dc ≥ (∫ 1 0 √ 1 φ′(c) Var[N|c] √ φ′(c)Var[N|c]pC(c)dc )2 =   ∫ 1 0 Var[N|c]pC(c)dc︸ ︷︷ ︸ = 1 (Assumpt. 3)   2 =1. (21) Combining Eqs. (20) and (21) with Lemma 1 completes the proof. � Blöbaum et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.169 10/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.169 Remark Theorem 1 states that the inequality holds for all values of α smaller than a certain finite threshold. Whether this threshold is small or whether the asymmetry with respect to regression errors already occurs for large noise cannot be concluded from the theoretical insights. Presumably, this depends on the features of φ, pC, pN|C in a complicated way. However, the experiments in ‘Experiments’ suggest that the asymmetry often appears even for realistic noise levels. If the function φ is non-invertible, there is an information loss in anticausal direction, since multiple possible values can be assigned to the same input. Therefore, we can expect that the error difference becomes even higher in these cases, which is supported by the experiments in ‘Simulated cause–effect pairs with strong dependent noise’. ALGORITHM A causal inference algorithm that exploits Theorem 1 can be formulated in a straightforward manner. Given observations X,Y sampled from a joint distribution pX,Y , the key idea is to fit regression models in both possible directions and compare the MSE. We call this approach Regression Error based Causal Inference (RECI) and summarize the algorithm in Algorithm 1. Although estimating the conditional expectations E[Y |X] and E[X|Y] by regression is a standard task in machine learning, we should emphasize that the usual issues of over- and underfitting are critical for our purpose (like for methods based on ANMs or PNLs), because they under- or overestimate the noise levels. It may, however, happen that the method even benefits from underfitting: if there is a simple regression model in causal direction that fits the data quite well, but in anticausal relation the conditional expectation becomes more complex, a regression model with underfitting increases the error even more for the anticausal direction than for the causal direction. Algorithm 1 The proposed causal inference algorithm. function RECI(X, Y ) FX and Y are the observed data. (X,Y )←RescaleData(X,Y ) f ←FitModel(X,Y ) FFit regression model f : X →Y g ←FitModel(Y,X) FFit regression model g : Y →X MSEY |X ←MeanSquaredError(f ,X,Y ) MSEX|Y ←MeanSquaredError(g,Y,X) if MSEY |X