key: cord-0603722-m9foxqlw
authors: Scholkopf, Bernhard; Kugelgen, Julius von
title: From Statistical to Causal Learning
date: 2022-04-01
journal: nan
DOI: nan
sha: 279632fab64d6029ab8d0776f191638bddd43b88
doc_id: 603722
cord_uid: m9foxqlw

We describe basic ideas underlying research to build and understand artificially intelligent systems: from symbolic approaches via statistical learning to interventional models relying on concepts of causality. Some of the hard open problems of machine learning and AI are intrinsically related to causality, and progress may require advances in our understanding of how to model and infer causality from data.

In 1958, the New York Times reported on a new machine called the perceptron. Frank Rosenblatt, its inventor, demonstrated that the perceptron was able to learn from experience. He predicted that later perceptrons would be able to recognize people, or instantly translate spoken language. Now a reality, this must have sounded like distant science fiction at the time. In hindsight, we may consider it the birth of machine learning, the field fueling most of the current advances in artificial intelligence (AI).

Around the same time, another equally revolutionary development took place: scientists understood that computers could do more than compute numbers: they can process symbols. Although this insight was also motivated by artificial intelligence, in hindsight it was the birth of the field of computer science. There was great optimism that the manipulation of symbols, in programs written by humans, implementing rules designed by humans, should be enough to generate intelligence. Below, we shall refer to this as the symbol-rule hypothesis. 1 There was initially encouraging progress on seemingly hard problems such as automatic theorem proving and computer chess. One of the fathers of the field, Herb Simon, predicted in 1956 that "machines will be capable, within twenty years, of doing any work a man can do." However, problems that appeared simple, such as most things animals could do, turned out to be hard. This came to be known as Moravec's paradox. When IBM's Deep Blue Given a distribution over the U i , which are assumed independent, this also gives rise to a probabilistic model p(X). However, the model in (1.1) is more structured: the graph connectivity and the functions f i create particular dependences between the observables. Moreover, it describes how the system behaves under intervention: by replacing functions by constants, we can compute the effect of setting some variables to specific values.

Causal learning builds on assumptions different from standard machine learning ( § 5), and addresses a different level in the modeling hierarchy ( § 6). It also comes with new problems, such as causal discovery, where we seek to infer properties of graph and functions from data ( § 7). In some cases, conditional independences among the X i contain information about the graph [145] ; but novel assumptions let us handle some cases that were previously unsolvable [64] . Those assumptions have nontrivial implications for machine learning tasks such as semi-supervised learning, covariate shift adaptation and transfer learning [130] ( § 8). Once provided with a causal model, causal reasoning ( § 9) allows us to identify and estimate certain causal queries of interest from observational data. We conclude with a list of some current and open problems ( § 10), with a particular emphasis on the topic of causal representation learning.

The presentation and notation will be somewhat informal in several respects. We generally assume that all distributions possess densities (w.r.t. a suitable reference measure). We sometimes write p(x) for the distribution (or density) of a random variable X. Accordingly, the same p can denote another distribution p(y), distinguished by the argument of p(·). We also sometimes use summation for marginalization which supposes discrete variables; the corresponding expressions for continuous quantities would use integrals.

Suppose we have measured two statistically dependent observables and found the points to lie approximately on a straight line. An empirical scientist might be willing to hypothesize a corresponding law of nature (see Fig. 1 ). However, already Leibniz pointed out that if we scatter spots of ink randomly on a piece of paper by shaking a quill pen, we can also find a mathematical equation satisfied by these points [81] . He argued that we would not call this a law of nature, because no matter how the points are distributed, there always exists such an equation; we would only call it a law of nature only if the equation is simple. This raises the question of what makes an equation simple. The physicist Rutherford took the pragmatic view that if there is a law, it should be directly evident from the data: "if your experiment needs statistics, you ought to have done a better experiment." 3 This view may have been a healthy one Figure 1 : Given a small number of observations, how do we find a law underlying them? Leibniz argued that even if we generate a random set of points, we can always find a mathematical equation satisfied by these points.

when faced with low-dimensional inference problems where regularities are immediately obvious; however, modern AI is facing inference problems that are harder: they are often high-dimensional and nonlinear, yet we may have little prior knowledge about the underlying regularity (e.g., for medical data, we usually do not have a mechanistic model).

Statistical learning theory studies the problem of how to still perform valid inference, provided that we have sufficiently large datasets and the computational means to process them. Let us look at some theoretical results for the simplest learning scenario, drawing from [133] ; for details, see [153] . Suppose we are given empirical observations, (x 1 , y 1 ), . . . , (x m , y m ) ∈ X × Y, (2.1) where X is some nonempty set from which the inputs come, and Y = {±1} is the output set, in our case consisting of just two classes. This situation is called pattern recognition, and our goal is to use the training data (2.1) to infer a function f : X → {±1} (from some function class chosen a priori) which will produce the correct output for a new input x which we may not have seen before. To formalize what we mean by "correct", we make the assumption that all observations (x i , y i ) have been generated independently by performing a random experiment described by an unknown probability distribution p(x, y)-a setting referred to as i.i.d. (independent and identically distributed) data. Our goal will be to minimize the expected error (or risk)

c(y, f (x)) dp(x, y), (2.2) where c is a so-called loss function, e.g., the misclassification error c(y, f (x)) = 1 2 |f (x) − y| taking the value 0 whenever f (x) = y and 1 otherwise. The difficulty of the task stems from the fact that we are trying to minimize a quantity that we cannot evaluate: since we do not know p, we cannot compute (2.2) . We do know, however, the training data (2.1) sampled from p. We can thus try to infer a function f from the training sample whose risk is close to the minimum of (2.2). To this end, we need what is called an induction principle.

One way to proceed is to use the training sample to approximate (2.2) by a finite sum, referred to as the empirical risk

3)

The empirical risk minimization (ERM) induction principle recommends that we choose (or "learn") an f that minimizes (2.3). We can then ask whether the ERM principle is statistically consistent: in the limit of infinitely many data points, will ERM lead to a solution which will do as well as possible on future data generated by p?

It turns out that if the function class over which we minimize (2.3) is too large, then ERM is not consistent. Hence, we need to suitably restrict the class of possible functions. For instance, ERM is consistent for all probability distributions, provided that the VC dimension of the function class is finite. The VC dimension is an example of a capacity measure. It is defined as the maximal number of points that can be separated (classified) in all possible ways using functions from the class. E.g., using linear classifiers (separating classes by straight lines) on R 2 , we can realize all possible classifications for 3 suitably chosen points, but we can no longer do this once we have 4 points, no matter how they are placed (see Fig. 2 ). This means that the VC dimension of this function class is 3. More generally, for linear separations in R d , the VC dimension is d + 1.

Whenever the VC dimension is finite, our class of functions (or explanations) becomes falsifiable in the sense that starting from a certain number of observations, no longer all possible labelings of the points can be explained (cf. Fig. 2 ). If we can nevertheless explain a sufficiently large set of observed data, we thus have reason to believe that this is a meaningful finding. Figure 2 : Using straight lines, we can separate three points in all possible ways; we cannot do this for four points, no matter how they are placed. The class of linear separations is not "falsifiable" using three points, but it becomes falsifiable once we have four or more points.

Much of machine learning research is concerned with restrictions on classes of functions to make inference possible, be it by imposing prior distributions on function classes, through other constraints, or by designing self-regularizing learning procedures, e.g., gradient descent methods for neural networks [79] . While there is a solid theoretical understanding of supervised machine learning as described above (i.e., function learning from input-output examples), there are still details under investigation, such as the recently observed phenomenon of "double descent" [7] .

A popular constraint, implemented in the Support Vector Machine (SVM) [153, 133] , is to consider linear separations with large margin: it turns out that for large margin separations in high-dimensional (or infinite-dimensional) spaces, the capacity can be much smaller than the dimensionality, making learning possible in situations where it would otherwise fail.

For some learning algorithms, including SVMs and nearest neighbor classifiers, there are strong universal consistency results, guaranteeing convergence of the algorithm to the lowest achievable risk, for any problem to be learned [28, 153, 133, 147] . Note, however, that this convergence can be arbitrarily slow.

For a given sample size, it will depend on the problem being learned whether we achieve low expected error. In addition to asymptotic consistency statements, learning theory makes finite sample size statements: one can prove that with probability at least 1 − δ (for δ > 0), for all functions f in a class of functions with VC dimension h,

This is an example of a class of results that relate the training error R emp [f ] and the test error R[f ] using a confidence interval (the square root term) depending on a capacity measure of a function class (here, its VC dimension h). It says that with high probability, the expected error R[f ] on future observations generated by the unknown probability distribution is small, provided the two terms on the right hand side are small: the training error R emp [f ] (i.e., the error on the examples we have already seen), and the square root term, which will be small whenever the capacity h is small compared to the number of training observations m. If, on the other hand, we try to learn something that may not make sense, such as the mapping from the name of people to their telephone number, we would find that to explain all the training data (i.e., to obtain a small R emp [f ]), we need a model whose capacity h is large, and the second term becomes large. In any case, it is crucial for both consistency results and finite sample error bounds such as (2.4) that we have i.i.d. data.

Kernel Methods A symmetric function k : X 2 → R, where X is a nonempty set, is called a positive definite (pd) kernel if for arbitrary points x 1 , . . . , x m ∈ X and coefficients a 1 , . . . , a m ∈ R:

The kernel is called strictly positive definite if for pairwise distinct points, the implication i,j a i a j k(x i , x j ) = 0 =⇒ ∀i : a i = 0 is valid. Any positive definite kernel induces a mapping

into a reproducing kernel Hilbert space (RKHS) H satisfying

for all x, x ∈ X. Although H may be infinite-dimensional, we can construct practical classification algorithms in H provided that all computational steps are carried out in terms of scalar products, since those can be reduced to kernel evaluations (2.6).

In the SVM algorithm, the capacity of the function class is restricted by enforcing a large margin of class separation in H via a suitable RKHS regularization term. The solution can be shown to take the form

where the learned parameters α i and b are the solution of a convex quadratic optimization problem. A similar expansion of the solution in terms of kernel functions evaluated at training points holds true for a larger class of kernel algorithms beyond SVMs, regularized by an RKHS norm [127] .

In kernel methods, the kernel plays three roles which are crucial for machine learning: it acts as a similarity measure for data points, induces a representation in a linear space 4 via (2.5), and parametrizes the function class within which the solution is sought, cf. (2.7).

Kernel Mean Embeddings Consider two sets of points X := {x 1 , . . . , x m } ⊂ X and Y := {y 1 , . . . , y n } ⊂ X. We define the mean map µ as [133] µ

if all empirical moments up to order d coincide. For strictly pd kernels, the means coincide only if X = Y , rendering µ injective [134] . The mean map has some other interesting properties [144] , e.g., µ(X) represents the operation of taking a mean of a function on the sample X:

x )] < ∞, then the above statements, including the injectivity of µ, generalize to Borel measures p, q, if we define the mean map as

and replace the notion of strictly pd kernels by that of characteristic kernels [33] . This means that we do not lose information when representing a probability distribution in the RKHS. This enables us to work with distributions using Hilbert space methods, and construct practical algorithms analyzing distributions using scalar product evaluations.

Note that the mean map µ can be viewed as a generalization of the moment generating function M p of a random variable x with distribution p,

The map µ has applications in a number of tasks including computing functions of random variables [132] , testing of homogeneity [41] , and of independence [43] . The latter will be of particular interest to causal inference: we can develop a kernel-based independence test by computing the distance between sample-based embeddings of a joint distribution p(X, Y ) and the product of its marginals p(X), p(Y ) [42, 44, 43, 165, 115] , and generalize it to conditional independence testing [33, 101] , as required for certain causal discovery methods (see § 7).

Methods Relying on i.i.d. Data In current successes of machine learning [79] , we generally (i) have large amounts of data, often from simulations or large-scale human labeling, (ii) use high capacity machine learning models (e.g., neural networks with many adjustable parameters), and (iii) employ high performance computing. Statistical learning Figure 3 : Measurements of two genes (x-axis), gene A (left) and gene B (right), show the same strong positive correlation with a phenotype (y-axis). However, this statistical information alone is insufficient to predict the outcome of a knock-out experiment where the activity of a gene is set to zero (vertical lines at x = 0). Answering such interventional questions requires additional causal knowledge (inset causal graphs): knocking out gene A, which is a direct cause, would lead to a reduction in phenotype, whereas knocking out gene B, which shares a common cause, or confounder, with the phenotype but has no causal effect on it, would leave the phenotype unaffected. This shows that correlation alone is not enough to predict the outcome of perturbations to a system (toy data, figure from [112] ).

theory offers a partial explanation for recent successes of learning: huge datasets enable training complex models and thus solving increasingly difficult tasks.

However, a crucial aspect that is often ignored is that we (iv) assume that the data are i.i.d. This assumption is crucial for good performance in practice, and it underlies theoretical statements such as (2.4) . When faced with problems that violate the i.i.d. assumption, all bets are off. Vision systems can be grossly misled if an object that is normally recognized with high accuracy is placed in a context that in the training set may be negatively correlated with the presence of the object. For instance, such a system may fail to recognize a cow standing on the beach. In order to successfully generalize in such settings, we would need to construct systems which do not merely rely on statistical dependences, but instead model mechanisms that are robust across certain violations of the i.i.d. assumption. As we will argue, causality provides a natural framework for capturing such stable mechanisms and reasoning about different types of distribution shifts.

It is a commonplace that correlation does not imply causation. Two popular and illustrative examples are the positive correlation between chocolate consumption and nobel prizes per capita [91] , and that between the number of stork breeding pairs and human birth rates [89] , neither of which admit a sensible interpretation in terms of direct causation. These examples naturally lead to the following questions: What exactly do we mean by "causation"? What is its relationship to correlation? And, if correlation alone is not enough, what is needed to infer causation?

Here, we adopt a notion of causality based on manipulability [159] and intervention [105] which has proven useful in fields such as agriculture [161] , econometrics [46, 52] and epidemiology [119] .

Definition 3.1 (Causal effect). We say that a random variable X has a causal effect on a random variable Y if there exist x = x s.t. the distribution of Y after intervening on X and setting it to x differs from the distribution of Y after setting X to x .

Inherent to the notion of causation, there is a directionality and asymmetry which does not exist for correlation: if X is correlated with Y , then Y is equally correlated with X; but, if X has a causal effect on Y , the converse (in the generic case) does not hold.

We illustrate the intervention-based notion of causation and its difference from correlation (or, more generally, statistical dependence) in Fig. 3 . Here, knocking out two genes X A and X B that are indistinguishable based on their correlation with a phenotype Y would have very different effects. Only intervening on X A would change the distribution of Y , whereas X B does not have a causal effect on Y -instead, their correlation arises from a different (confounded) causal structure. Such causal relationships are most commonly represented in the form of causal graphs where directed arrows indicate a direct causal effect. Figure 4 : Reichenbach's common cause principle [117] postulates that statistical dependence between two random variables X and Y has three elementary possible causal explanations shown as causal graphs in (a)-(c). It thus states that association is always induced by an underlying causal process. In (a) the common cause Z coincides with X, and in (b) it coincides with Y . Grey nodes indicate observed and white nodes unobserved variables.

The example in Fig. 3 shows that the same correlation can be explained by multiple causal graphs which lead to different experimental outcomes, i.e., correlation does not imply causation. However, there is a connection between correlation and causation, expressed by Reichenbach [117] as the Common Cause Principle, see Fig. 4 . Principle 3.2 (Common Cause). If two random variables X and Y are statistically dependent (X ⊥ ⊥ Y ), then there exists a random variable Z which causally influences both of them and which explains all their dependence in the sense of rendering them conditionally independent (X ⊥ ⊥ Y | Z). As a special case, Z may coincide with X or Y .

According to Principle 3.2, statistical dependence always results from underlying causal relationships by which variables, including potentially unobserved ones, influence each other. Correlation is thus an epiphenomenon, the byproduct of a causal process.

For the example of chocolate consumption (X) and Nobel laureates (Y ), common sense suggests that neither of the two variables should have a causal effect on the other, i.e., neither chocolate consumption driving scientific success (X → Y ; Fig. 4a ) nor Nobel laureates increasing chocolate consumption (Y → X; Fig. 4c ) seem plausible. Principle 3.2 then tells us that the observed correlation must be explained by a common cause Z as in Fig. 4c . A plausible candidate for such a confounder could, for example, be economic factors driving both consumer spending and investment in education and science.

Without such background knowledge or additional assumptions, however, we cannot distinguish the three cases in Fig. 4 through passive observation, i.e., in a purely data-driven way: the class of observational distributions over X and Y that can be realized by these models is the same in all three cases.

To be clear, this does not mean that correlation cannot be useful, nor that causal insight is always required. Both genes in Fig. 3 remain useful features for making predictions in a passive, or observational, setting in which we measure the activities of certain genes and are asked to predict the phenotype. Similarly, chocolate consumption remains predictive of winning Nobel prizes. However, if we want to answer interventional questions, such as the outcome of a geneknockout experiment or the effect of a policy enforcing higher chocolate consumption, we need more than correlation: a causal model.

Causal inference has a long history in a variety of disciplines, including statistics, econometrics, epidemiology, and AI. As a result, different frameworks for causal modeling have emerged over the years and coexist today. The first framework described below (CGM) starts from the distribution of the observables, combining it with a directed graph to endow it with causal semantics. The second one (SCM) starts from a graph and a set of functional assignments, and generates the observed distribution as the push-forward of an unobserved noise distribution. Finally, we cover a non-graphical approach (PO) popular in statistics.

Causal Graphical Models (CGMs) The graphical models framework [78, 75] provides a compact way of representing joint probability distributions by encoding the dependence structure between variables in graphical form. Directed graphical models are also known as Bayesian networks [102] . While they do not offer a causal interpretation per seindeed, different graphical models can be compatible with the same distribution (cf. Principle 3.2)-when edges are endowed with the notion of direct causal effect (Defn. 3.1), we refer to them as causal graphical models (CGM) [145] . Definition 4.1 (CGM). A CGM M = (G, p) over n random variables X 1 , . . . , X n consists of: (i) a directed acyclic graph (DAG) G in which directed edges (X j → X i ) represent a direct causal effect of X j on X i ; and (ii) a joint distribution p(X 1 , . . . , X n ) which is Markovian w.r.t. G: where PA i = {X j : (X j → X i ) ∈ G} denotes the set of parents, or direct causes, of X i in G.

We will refer to (4.1) as the causal (or disentangled) factorization. While many other entangled factorizations are possible, e.g.,

only (4.1) decomposes the joint distribution into causal conditionals, or causal mechanisms, p(X i | PA i ), which can have a meaningful physical interpretation, rather than being mere mathematical objects such as the factors on the RHS of (4.2).

It turns out that (4.1) is equivalent to the following condition. Definition 4.2 (Causal Markov condition). A distribution p satisfies the causal Markov condition w.r.t. a DAG G if every variable is conditionally independent of its non-descendants in G given its parents in G.

Defn. 4.2 can equivalently be expressed in terms of d-separation, a graphical criterion for directed graphs [105] , by saying that d-separation in G implies (conditional) independence in p. The causal Markov condition thus provides a link between properties of p and G.

What makes CGMs causal is the interpretation of edges as cause-effect relationships which enables reasoning about the outcome of interventions using the do-operator [105] and the concept of graph surgery [145] . The central idea is that intervening on a variable, say by externally forcing it to take on a particular value, renders it independent of its causes and breaks their causal influence on it, see Fig. 5 for an illustration. For example, if a gene is knocked out, it is no longer influenced by other genes that were previously regulating it; instead, its activity is now solely determined by the intervention. This is fundamentally different from conditioning since passively observing the activity of a gene provides information about its driving factors (i.e., its direct causes).

To emphasize this difference between passive observation and active intervention, Pearl [105] introduced the notation do(X := x) to denote an intervention by which variable X is set to value x. The term graph surgery refers to the idea that the effect of such an intervention can be captured in the form of a modification to the original graph by removing all incoming edges to the intervened variable. Interventional queries can then be answered by performing probabilistic inference in the modified post-intervention graph which typically implies additional (conditional) independences due to the removed edges. Example 4.3. The interventional distribution p(X 3 | do(X 2 = x 2 )) for the CGM in Fig. 5 is obtained via probabilistic inference w.r.t. the post-intervention graph G where X 1 ⊥ ⊥ X 2 :

It differs from the conditional p(X 3 | x 2 ) for which inference in done over G where X 1 ⊥ ⊥ X 2 . Note the marginal p(x 1 ) in (4.3), in contrast to the conditional p(x 1 | x 2 ) in (4.4): this is precisely the link which is broken by the intervention do(X 2 := x 2 ), see Fig. 5b . The RHS of (4.3) is an example of covariate adjustment: it controls for the confounder X 1 of the causal effect of X 2 on X 3 , see § 9 for more details on adjustment and computing interventions.

CGMs have been widely used in constraint-and score-based approaches to causal discovery [145, 47] which we will discuss in § 7. Due to their conceptual simplicity, they are a useful and intuitive model for reasoning about interventions. However, their capacity as a causal model is limited in that they do not support counterfactual reasoning, which is better addressed by the two causal modeling frameworks which we will discuss next.

Structural Causal Models (SCMs) Structural Causal Models, also referred to as functional causal models or non-parametric structural equation models, have ties to the graphical approach presented above, but rely on using directed functional parent-child relationships rather than causal conditionals. While conceptually simple in hindsight, this constituted a major step in the understanding of causality, as later expressed by [105] (p. 104):

"We played around with the possibility of replacing the parents-child relationship p(X i | PA i ) with its functional counterpart X i = f i (PA i , U i ) and, suddenly, everything began to fall into place:

We finally had a mathematical object to which we could attribute familiar properties of physical mechanisms instead of those slippery epistemic probabilities p(X i | PA i ) with which we had been working so long in the study of Bayesian networks." Definition 4.4 (SCM). An SCM M = (F, p U ) over a set X of n random variables X 1 , . . . , X n consists of (i) a set F of n assignments (the structural equations),

where f i are deterministic functions computing each variable X i from its causal parents PA i ⊆ X \ {X i } and an exogenous noise variable U i ; and (ii) a joint distribution p U (U 1 , . . . , U n ) over the exogenous noise variables.

The paradigm of SCMs views the processes f i by which each observable X i is generated from others as a physical mechanism. All randomness comes from the unobserved (also referred to as unexplained) noise terms U i which capture both possible stochasticity of the process, as well as uncertainty due to unmeasured parts of the system. Note also the assignment symbol ":=" which is used instead of an equality sign to indicate the asymmetry of the causal relationship: the LHS quantity is defined to take on the RHS value. For example, we cannot simply rewrite a structural equation

for some g, as would be the case for a standard (invertible) equation.

In parametric, linear form (i.e., with linear f i ), SCMs are also known as structural equation models and have a long history in path analysis [161] and economics [46, 52] .

Each SCM induces a corresponding causal graph via the input variables to the structural equations which is useful as a representation and provides a link to CGMs. Definition 4.5 (Induced causal graph). The causal graph G induced by an SCM M is the directed graph with vertex set X and a directed edge from each vertex in PA i to X i for all i. Example 4.6. Consider an SCM over X = {X 1 , X 2 , X 3 } with some p U (U 1 , U 2 , U 3 ) and

Following Defn. 4.5, the induced graph then corresponds to G in Fig. 5 .

Defn. 4.4 allows for a rich class of causal models, including ones with cyclic causal relations and ones which do not obey the causal Markov condition (Defn. 4.2) due to complex covariance structures between the noise terms. While work exists on such cyclic or confounded SCMs [13] , it is common to make the following two assumptions. Assumption 4.7 (Acyclicity). The induced graph G is a DAG: it does not contain cycles. Assumption 4.8 (Causal sufficiency/no hidden confounders). The U i are jointly independent, i.e., their distribution factorises:

Assumption 4.7 implies 5 the existence of a well-defined, unique (observational) distribution over X from which we can draw via ancestral sampling: 6 first, we draw the noise variables from p U , and then we iteratively compute the corresponding X i 's in topological order of the induced DAG (i.e., starting at the root node of the graph), substituting previously computed X i into the structural equations where necessary. Formally, the (observational) distribution p(X 1 , . . . , X n ) induced by an SCM under Assumption 4.7 is defined as the push-forward of the noise distribution p U through the structural equations F. Under Assumption 4.8, the causal conditionals are thus given by

where f −1 pa i (X i ) denotes the pre-image of X i under f i for fixed PA i = pa i . Assumption 4.8 rules out the existence of hidden confounders because any unmeasured variables affecting more than one of the X i simultaneously would constitute a dependence between some of the noise terms (which account for any external, or exogenous, influences not explained by the observed X i ). In combination with Assumption 4.7, Assumption 4.8 (also known as causal sufficiency), thus ensures that the distribution induced by an SCM factorises according to its induced causal graph as in (4.1). In other words, it guarantees that the causal Markov condition is satisfied w.r.t. the induced causal graph [105] . Below, unless explicitly stated otherwise, we will assume causal sufficiency.

Due to the conceptual similarity between interventions and the assignment character of structural equations, the computation of interventional distributions fits in naturally into the SCM framework. To model an intervention, we simply replace the corresponding structural equation and consider the resulting entailed distribution. Definition 4.9 (Interventions in SCMs). An intervention do(

This way of handling interventions coincides with that for CGMs: e.g., after performing do(X 2 := x 2 ) in Ex. 4.6, X 1 no longer appears in the structural equation for X 2 , and the edge X 1 → X 2 hence disappears in the intervened graph, as is the case for G in Fig. 5 .

In contrast to CGMs, SCMs also provide a framework for counterfactual reasoning. While (i) observations describe what is passively seen or measured and (ii) interventions describe active external manipulation or experimentation, (iii) counterfactuals are statements about what would or could have been, given that something else was in fact observed. These three modes of reasoning are sometimes referred to as the three rungs of the "ladder of causation" [108] . As an example, consider the following counterfactual query:

Given that patient X received treatment A and their health got worse, what would have happened if they had been given treatment B instead, all else being equal?

The "all else being equal" part highlights the difference between interventions and counterfactuals: observing the factual outcome (i.e., what actually happened) provides information about the background state of the system (as captured by the noise terms in SCMs) which can be used to reason about alternative, counterfactual, outcomes. This differs from an intervention where such background information is not available. For example, observing that treatment A did not work may tell us that the patient has a rare condition and that treatment B would have therefore worked. However, given that treatment A has been prescribed, the patient's condition may have changed, and B may no longer work in a future intervention.

Note that counterfactuals cannot be observed empirically by their very definition and are therefore unfalsifiable. Some therefore consider them unscientific [116] or at least problematic [26] . On the other hand, humans seem to perform counterfactual reasoning in practice, developing this ability in early childhood [14] .

Counterfactuals are computed in SCMs through the following three-step procedure:

1. Update the noise distribution to its posterior given the observed evidence ("abduction"). 2. Manipulate the structural equations to capture the hypothetical intervention ("action"). 3. Use the modified SCM to infer the quantity of interest ("prediction"). Note that while computing interventions only involved manipulating the structural equations, counterfactuals also involve updating the noise distribution, highlighting the conceptual difference between the two. Updating p U requires knowledge of the interaction between noise and observed variables, i.e., of the structural equations, which explains why additional assumptions are necessary. Note that the updated noise variables no longer need to be independent, even if the original system was causally sufficient (Assumption 4.8). Example 4.11 (Computing counterfactuals with SCMs). Consider an SCM M defined by

Suppose we observe X = 2 and Y = 6.5 and want to answer the counterfactual "what would Y have been, had X = 1?", i.e., we are interested in p(Y X=1 | X = 2, Y = 6.5). Updating the noise using the observed evidence 

, we obtain the counterfactual SCM M X=2,Y =6.5 ,

where δ(·) denotes the Dirac delta measure. Performing the intervention do(X := 1) in (4.8) then gives the result p(Y X=1 | X = 2, Y = 6.5) = δ(3.5), i.e., "Y would have been 3.5". This differs from the interventional distribution p(Y | do(X = 1)) = N(3, 1), since the factual observation helped determine the background

The SCM viewpoint is intuitive and lends itself well to studying restrictions on function classes to enable induction ( § 2). For this reason, we will mostly focus on SCMs in the subsequent sections.

Potential Outcomes (PO) The potential outcomes framework was initially proposed by Neyman [98] for randomized studies [31] , and later popularized and extended to observational settings by Rubin [125] and others. It is popular within statistics and epidemiology and perhaps best understood in the context of the latter. This is also reflected in its terminology: in the most common setting, we consider a binary treatment variable T , with T = 1 and T = 0 corresponding to treatment and control, respectively, whose causal effect on an outcome variable Y (often a measure of health) is of interest.

One interpretation of the PO framework consistent with its roots in statistics is to view causal inference as a missing data problem. In the PO framework, for each individual (or unit) i and treatment value t there is a PO, or potential response, denoted Y i (t) capturing what would happen if individual i received treatment t. The POs are considered deterministic quantities in the sense that for a given individual i, Y i (1) and Y i (0) are fixed and all randomness in the realized outcome Y i stems from randomness in the treatment assignment:

To decide whether patient i should receive treatment, we need to reason about the individualized treatment effect (ITE) τ i as captured by the difference of the two POs. Definition 4.12 (ITE). The ITE for individual i under a binary treatment is defined as

The "fundamental problem of causal inference" [51] is that only one of the POs is ever observed for each i. The other, unobserved PO becomes a counterfactual:

Consequently, τ i can never be measured or computed from data, i.e., it is not identifiable (without further assumptions), as illustrated in Tab. 1.

Implicit in the form of (4.9) and (4.11) are the following two assumptions. Assumption 4.13 (Stable unit treatment value; SUTVA). The observation on one unit should be unaffected by the particular assignment of treatments to the other units [23] . Assumption 4.14 (Consistency). If individual i receives treatment t, then the observed outcome is Y i = Y i (t), i.e., the potential outcome for t.

Assumption 4.13 is usually understood as (i) units do not interfere, and (ii) there is only one treatment level per group (treated or control) leading to well-defined POs [61] . It can be violated, e.g., through (i) population dynamics such as herd immunity from vaccination or (ii) technical errors or varying within-group dosage, respectively. However, for many situations such as controlled studies it can be a reasonable assumption, and we can then view different units as independent samples from a population.

So far, we have considered POs for a given unit as deterministic quantities. However, most times it is impossible to fully characterize a unit, e.g., when dealing with complex subjects such as humans. Such lack of complete information introduces uncertainty, so that POs are often instead treated as random variables. This parallels the combination of deterministic structural equations with exogenous noise variables in SCMs. 7 Indeed, there is a equivalence between POs and SCMs [105] :

An individual in the PO framework thus corresponds to a particular instantiation of the U j in an SCM: the outcome is deterministic given U, but since we do not observe u i (nor can we characterize a given individual based on observed covariates), the counterfactual outcome is treated as a random variable. In practice, all we observe is a featurised description x i of an individual i and have to reason about expected POs,

Another common assumption is that of no hidden confounders which we have already encountered in form of the causal Markov condition (Defn. 4.2) for CGMs and causal sufficiency (Assumption 4.8) for SCMs. In the PO framework this becomes no hidden confounding between treatment and outcome and is referred to as (conditional) ignorability. Assumption 4.15 (Conditional ignorability). Given a treatment T ∈ {0, 1}, potential outcomes Y (0), Y (1), and observed covariates X, we have:

The PO framework is tailored toward studying the (confounded) effect of a typically binary treatment variable on an outcome and is mostly used for causal reasoning, i.e., estimating individual and population level causal effects ( § 9). In this context, it is sometimes seen as an advantage that an explicit graphical representation is not needed. At the same time, the lack of a causal graph and the need for special treatment and outcome variables make POs rather unsuitable for causal discovery where other frameworks prevail.

Let us consider the disentangled factorization (4.1) of the joint distribution p(X 1 , . . . , X n ). This factorization according to the causal graph is possible whenever the U i are independent. We now consider an additional notion of independence, concerning how the factors in (4.1) relate to one another.

Consider a dataset that consists of altitude A and average annual temperature T of weather stations [112] . Supposed we find that A and T are correlated, which we attribute due to the fact that the altitude has a causal effect on the temperature. Suppose we had two such datasets, one for Austria and one for Switzerland. The two joint distributions may be different, since the marginal distributions p(A) over altitudes will likely differ. The conditionals p(T | A), however, may be rather similar, since they reflect physical mechanisms generating temperature from altitude. The causal factorization p(A)p(T | A) thus contains a component p(T | A) that generalizes across countries, while the entangled factorization p(T )p(A | T ) does not. A similar reasoning applies when we consider interventions in a system. For a model to correctly predict the effect of interventions, it needs to have components that are robust when moving from an observational distribution to certain interventional distributions.

The above insights can be stated as follows [130, 112] : Principle 5.1 (Independent Causal Mechanisms (ICM)). The causal generative process of a system's variables is composed of autonomous modules that do not inform or influence each other. In the probabilistic case, this means that the conditional distribution of each variable given its causes (i.e., its mechanism) does not inform or influence the other mechanisms.

This principle subsumes several notions important to causality, including separate intervenability of causal variables, modularity and autonomy of subsystems, and invariance [105, 112] . In the two-variable case, it reduces to an independence between the cause distribution and the mechanism producing the effect from the cause.

Applied to the causal factorization (4.1), the principle tells us that the factors should be independent in two senses:

(influence) changing (or intervening upon) one mechanism p(X i | PA i ) does not change the other mechanisms p(X j | PA j ) (i = j), and (information) knowing some mechanisms p(X i | PA i ) (i = j) does not give us information about a mechanism p(X j | PA j ). in regions where f has small slope. As a result, p(y) contains information about f −1 (figure from [112] )

.

We view any real-world distribution as a product of causal mechanisms. A change in such a distribution (e.g., when moving from one setting/domain to a related one) will always be due to a change in at least one of those mechanisms. Consistent with Principle 5.1, we hypothesize [131] :

. Small distribution changes tend to manifest themselves in a sparse or local way in the causal/disentangled factorization (4.1), i.e., they should usually not affect all factors simultaneously.

In contrast, in a non-causal factorization, e.g., (4.2), many terms will be affected simultaneously if we change one of the physical mechanisms responsible for a system's statistical dependences. Such a factorization may thus be called entangled. The notion of disentanglement has recently gained popularity in machine learning [9, 50, 82, 148] , sometimes loosely identified with statistical independence. The notion of invariant, autonomous, and independent mechanisms has appeared in various guises throughout the history of causality research, see [1, 105, 130, 112, 131] .

Measures of Dependence of Mechanisms Note that the dependence of two mechanisms p(X i | PA i ) and p(X j | PA j ) does not coincide with the statistical dependence of the random variables X i and X j . Indeed, in a causal graph, many of the random variables will be dependent even if all the mechanisms are independent.

Consider two variables and structural assignments X := U and Y := f (X). I.e., the cause X is a noise variable (with density p(x)), and the effect Y is a deterministic function of the cause. Let us moreover assume that the ranges of X and Y are both [0, 1], and f is strictly monotonically increasing. The ICM principle then reduces to an independence of p(x) and f . Let us consider p(x) and the derivative f as random variables on the probability space [0, 1] with Lebesgue measure, and use their correlation as a measure of dependence of mechanisms. It can be shown that for f = id, independence of p(x) and f implies dependence between p(y) and (f −1 ) (see Fig. 6 ). Other measures are possible and admit information-geometric interpretations. Intuitively, under the ICM assumption (Principle 5.1), the "irregularity" of the effect distribution becomes a sum of (i) irregularity already present in the input distribution and (ii) irregularity introduced by the mechanism f , i.e., the irregularities of the two mechanisms add up rather than (partly) compensating each other. This would not be the case in the opposite ("anticausal") direction (for details, see [64] ). Other dependence measures have been proposed for high-dimensional linear settings and time series [63, 137, 12, 68] .

Algorithmic Independence So far, we have discussed links between causal and statistical structures. The fundamental of the two is the causal structure, since it captures the physical mechanisms that generate statistical dependences in the first place. The statistical structure is an epiphenomenon that follows if we make the unexplained variables random. It is awkward to talk about the (statistical) information contained in a mechanism, since deterministic functions in the generic case neither generate nor destroy information. This motivated us to devise an algorithmic model of causal structures in terms of Kolmogorov complexity [66] . The Kolmogorov complexity (or algorithmic information) of a bit string is essentially the length of its shortest compression on a Turing machine, and thus a measure of its information content. Independence of mechanisms can be defined as vanishing mutual algorithmic information: two conditionals are considered independent if knowing (the shortest compression of) one does not help achieve a shorter compression of the other one.

Algorithmic information theory is an elegant framework for non-statistical graphical models. Just like statistical CGMs are obtained from SCMs by making the unexplained variables U i random, we obtain algorithmic CGMs by turning the U i into bit strings (jointly independent across nodes), and viewing the node X i as the output of a fixed Turing machine running the program U i with input PA i . Similar to the statistical case, one can define a local causal Markov condition, a global one in terms of d-separation, and a decomposition of the joint Kolmogorov complexity in analogy to (4.1), and prove that they are implied by the SCM [66] . This approach shows that concepts of causality are not intrinsically tied to statistics: causality is about mechanisms governing flow of information which may or may not be statistical. The assumption of algorithmically independent mechanisms has interesting implications for physics: it implies the second law of thermodynamics (i.e., the arrow of time). Consider a process where an incoming ordered beam of photons (the cause) is scattered by an object (the mechanism). Then the outgoing beam (the effect) contains information about the object. Microscopically, the time evolution is reversible; however, the photons contain information about the object only after the scattering. What underlies Loschmidt's paradox [86] ?

The asymmetry can be explained by applying the ICM Principle 5.1 to initial state and system dynamics, postulating that the two be algorithmically independent, i.e., knowing one does not allow a shorter description of the other one. The Kolmogorov complexity of the system's state can then be shown to be non-decreasing under time evolution [62] .

If we view Kolmogorov complexity as a measure of entropy, this means that the entropy of the state can only stay constant or increase, amounting to the second law of thermodynamics.

Note that the resulting state after time evolution is clearly not independent of the system dynamic: it is precisely the state that, when fed to the inverse dynamics, would return us to the original (ordered) state.

Coupled differential equations are the canonical way of modeling physical phenomena. They allow us to predict the future behavior of a system, to reason about the effect of interventions, and-by suitable averaging procedures-to predict statistical dependences that are generated by a coupled time evolution. They also allow us to gain insight into a system, explain its functioning, and, in particular, read off its causal structure.

Consider a coupled set of ordinary differential equations

with initial value x(t 0 ) = x 0 . We assume that they correctly describe the physical mechanisms of a system. 8 The Picard-Lindelöf theorem states that, at least locally, if f is Lipschitz, there exists a unique solution x(t). This implies, in particular, that the immediate future of x is implied by its past values.

In terms of infinitesimal differentials dt and dx = x(t + dt) − x(t), (6.1) reads:

From this, we can ascertain which entries of the vector x(t) cause the future of others x(t + dt), i.e., the causal structure.

Compared to a differential equation, a statistical model derived from the joint distribution of a set of (time-independent) random variables is a rather superficial description of a system. It exploits that some of the variables allow the prediction of others as long as the experimental conditions do not change. If we drive a differential equation system with certain types of noise, or if we average over time, statistical dependences between components of x may emerge, which can be exploited by machine learning. In contrast to the differential equation model, such a model does not allow us to predict the effect of interventions; however, its strength is that it can often be learned from data.

Causal modeling lies in between these two extremes. It aims to provide understanding and predict the effect of interventions. Causal discovery and learning tries to arrive at such models in a data-driven way, using only weak assumptions (see Table 2 , from [112, 131] ).

While we may naively think that causality is always about time, most existing causal models do not (and need not) consider time. For instance, returning to our example of altitude and temperature, there is an underlying dynamical physical process that results in higher places tending to be colder. On the level of microscopic equations of motion for the involved particles, there is a temporal causal structure. However, when we talk about dependence or causality between altitude and temperature, we need not worry about the details of this temporal structure; we are given a dataset where time does not appear, and we can reason about how that dataset would look if we were to intervene on temperature or altitude.

Some work exists trying to build bridges between these different levels of description. One can derive SCMs that describe the interventional behavior of a coupled system that is in an equilibrium state and perturbed in an adiabatic way [96] , with generalizations to oscillatory systems [123] . In this work, an SCM arises as a high-level abstraction of an underlying system of differential equations. It can only be derived if suitable high-level variables can be defined [124] , which in practice may well be the exception rather than the rule.

Sometimes, domain knowledge or the temporal ordering of events can help constrain the causal relationships between variables: e.g., we may know that certain attributes like age or sex are not caused by others; treatments influence health outcomes; and events do not causally influence their past. When such domain knowledge is unavailable or incomplete, we need to perform causal discovery: infer which variables causally influence which others, i.e., learn the causal structure (e.g., a DAG) from data. Since experiments are often difficult and expensive to perform while observational (i.e., passively collected) data is abundant, causal discovery from observational data is of particular interest.

As discussed in § 3 in the context of the Common Cause Principle 3.2, the case where we have two variables is already difficult since the same dependence can be explained by multiple different causal structures. One might thus wonder if the case of more observables is completely hopeless. Surprisingly, this is not the case: the problem becomes easier (in a certain sense) because there are nontrivial conditional independence properties [146, 25, 35] implied by a causal structure. We first review two classical approaches to the multi-variate setting before returning to the two-variable case.

Constraint-Based Methods Constraint-based approaches to causal discovery test which (conditional) independences can be inferred from the data and then try to find a graph which implies them. They are therefore also known as independence-based methods. Such a procedure requires a way of linking properties of the data distribution p to properties of the underlying causal graph G. This link is known as the faithfulness assumption. Faithfulness can be seen as the converse of the causal Markov condition. Together, they constitute a one-to-one correspondence between graphical separation in G and conditional independence in p. While the causal Markov condition is satisfied by construction, faithfulness is an assumption which may be violated. A classical example for a violation of faithfulness is when causal effects along different paths cancel.

Example 7.2 (Violation of faithfulness). Consider the SCM from Ex. 4.6 and let

∼ N(0, 1). By substitution, we obtain X 3 = (β + αγ)X 1 + γU 2 + U 3 . Hence X 3 ⊥ ⊥ X 1 whenever β + αγ = 0, even though this independence is not implied by the causal Markov condition over the induced causal graph G, see Fig. 5 . Here, faithfulness is violated if the direct effect of X 1 on X 3 (β) and the indirect effect via X 2 (αγ) cancel.

Apart from relying on faithfulness, a fundamental limitation to constraint-based methods is the fact that many different DAGs may encode the same d-separation / independence relations. This is referred to as Markov equivalence and illustrated in Fig. 7 . Figure 7 : Illustration of Markov equivalence using common graph motifs. The chains in (a) and the fork in (b) all imply the relation X ⊥ ⊥ Z | Y (and no others). They thus form a Markov equivalence class, meaning they cannot be distinguished using conditional independence testing alone. The collider, or v-structure, in (c) implies X ⊥ ⊥ Z (but X ⊥ ⊥ Z | Y ) and forms its own Markov equivalence class, so it can be uniquely identified from observational data. For this reason, v-structures are helpful for causal discovery. It can be shown that two graphs are Markov equivalent iff. they share the same skeleton and v-structures.

Constraint-based algorithms typically first construct an undirected graph, or skeleton, which captures the (conditional) independences found by testing, and then direct as many edges as possible using Meek's orientation rules [90] . The first step carries most of the computational weight and various algorithms have been devised to solve it efficiently.

The simplest procedure is implemented in the IC [110] and SGS [145] algorithms. For each pair of variables (X, Y ), these search through all subsets W of the remaining variables to check whether X ⊥ ⊥ Y | W. If no such set W is found, then X and Y are connected with an edge. Since this can be slow due to the large number of subsets, the PC algorithm [145] uses a much more efficient search procedure. It starts from a complete graph and then sequentially test only subsets of the neighbors of X or Y of increasing size, removing an edge when a separating subset is found. This neighbor search is no longer guaranteed to give the right result for causally insufficient systems, i.e., in the presence of hidden confounders. The FCI (for Fast Causal Inference) algorithm [145] addresses this setting, and produces a partially directed causal graph as output.

Apart from being limited to recovering a Markov equivalence class, constraint-based methods can suffer from statistical issues. In practice, datasets are finite, and conditional independence testing is a notoriously difficult problem, especially if conditioning sets are continuous and multi-dimensional. So while in principle, the conditional independences implied by the causal Markov condition hold true irrespective of the complexity of the functions appearing in an SCM, for finite datasets, conditional independence testing is hard without additional assumptions [136] . Recent progress in (conditional) independence testing heavily relies on kernel function classes to represent probability distributions in reproducing kernel Hilbert spaces, see § 2.

Score-Based Methods Score-based approaches to causal discovery assign a score to each graph G from a set of candidate graphs (usually the set of all DAGs). The score S is supposed to reflect how well G explains the observed data D = {x 1 , . . . , x m }, and we choose the graphĜ maximizing this score,

Various score functions have been proposed, but most methods assume a parametric model which factorises according to G, parametrised by θ ∈ Θ. Two common choices are multinomial models for discrete data [22] and linear Gaussian models for continuous data [34] . E.g., a penalized maximum likelihood approach using the BIC [135] as a score yields

where k is the number of parameters andθ MLE is the maximum likelihood estimate for θ to D in G. Note that k generally increases with the number of edges in G so that the second term in (7.1) penalizes complex graphs which do not lead to substantial improvements.

Another choice of score function is the marginal likelihood, or evidence, in a Bayesian approach to causal discovery, which requires specifying prior distributions over graphs and parameters, p(G, θ) = p(G)p(θ | G). The score for G is then given by

This integral is intractable in general, but can be computed exactly for some models such as a Dirichlet-multinomial under some mild additional assumptions [47, 48] .

A major drawback of score-based approaches is the combinatorial size of the search space. The number of DAGs over n random variables grows super-exponentially and can be computed recursively (to account for acyclicity constraints) [120] . E.g., the number of DAGs for n = 5 and n = 10 nodes is 29281 and 4175098976430598143, respectively. Finding the best scoring DAG is NP-hard [20] . To overcome this problem, greedy search techniques can be applied, e.g., greedy equivalence search (GES) [21] which optimizes for the BIC.

In recent years, another class of methods has emerged that is based on assuming particular functional forms for the SCM assignments. Those arose from studying the cause-effect inference problem, as discussed below.

Cause-Effect Inference. In the case of only two variables, the ternary concept of conditional independences collapses and the causal Markov condition (Defn. 4.2) thus has no nontrivial implications. However, we have seen in § 5 that assuming an independence of mechanisms (Principle 5.1) lets us find asymmetries between cause and effect, and thus address the cause-effect inference problem previously considered unsolvable [64] . It turns out that this problem can be also addressed by making additional assumptions on function classes, as not only the graph topology leaves a footprint in the observational distribution, but so do the functions f i in an SCM. Such assumptions are typical for machine learning, where it is well-known that finite-sample generalization without assumptions on function classes is impossible, and where much attention is devoted to properties of function classes (e.g., priors or capacity measures), as discussed in § 2.

Let us provide an intuition as to why assumptions on the functions in an SCM should help learn about them from data. Consider a toy SCM with only two observables X → Y . In this case, the structural equations (4.5) turn into

with noises U ⊥ ⊥ V . Now think of V acting as a random selector variable choosing from among a set of functions

depends on v in a non-smooth way, it should be hard to glean information about the SCM from a finite dataset, given that V is not observed and it randomly switches between arbitrarily different f v . 9 This motivates restricting the complexity with which f depends on V . A natural restriction is to assume an additive noise model

If f in (7.3) depends smoothly on V , and if V is relatively well concentrated, this can be motivated by a local Taylor expansion argument. Such assumptions drastically reduce the effective size of the function class-without them, the latter could depend exponentially on the cardinality of the support of V .

Restrictions of function classes can break the symmetry between cause and effect in the two-variable case: one can show that given a distribution over X, Y generated by an additive noise model, one cannot fit an additive noise model in the opposite direction (i.e., with the roles of X and Y interchanged) [53, 95, 114, 76, 6] . This is subject to certain genericity assumptions, and notable exceptions include the case where U, V are Gaussian and f is linear. It generalizes results of [140] for linear functions, and it can be generalized to include nonlinear rescaling [164] , cycles [94] , confounders [65] , and multi-variable causal discovery [113] . There is now a range of methods that can detect causal direction better than chance [97] .

We have thus gathered some evidence that ideas from machine learning can help tackle causality problems that were previously considered hard. Equally intriguing, however, is the opposite direction: can causality help us improve machine learning?

Nonstationarity-Based Methods The last family of causal discovery approaches we mention is based on ideas of nonstationarity and invariance [130] . These approaches do not apply to purely observational data collected in an i.i.d. setting. In contrast, they aim to leverage heterogeneity of data collected from different environments. The main idea is the following: since causal systems are modular in the sense of the ICM Principle 5.1, changing one of the independent mechanisms should leave the other components, or causal conditionals, unaffected (SMS Principle 5.2). A correct factorization of the joint distribution according to the underlying causal structure should thus be able to explain heterogeneity by localized changes in one (or few) of the mechanisms while the others remain invariant.

One of the first works to use this idea [151] analyzed which causal structures can be distinguished given data resulting from a set of mechanism changes. Recent work [54] additionally aims to learn a low-dimensional representation of the mechanism changes. Other works [111, 121] have proposed methods for finding the direct causes of a given target variable. Using a recent result on identifiability of non-linear ICA [59] which also relies on non-stationarity, a method for learning general non-linear SCMs was proposed [93] . Here the idea is to train a classifier to discriminate between the true value of some nonstationarity variable (such as a time-stamp or environment indicator) and a shuffled version thereof.

Semi-Supervised Learning Suppose our underlying causal graph is X → Y , and we wish to learn a mapping X → Y . The causal factorization (4.1) in this case is

The ICM Principle 5.1 posits that the modules in a joint distribution's causal factorization do not inform or influence each other. This means that, in particular, p(X) should contain no information about p(Y | X), which implies that semi-supervised learning [17] should be futile, as it is trying to use additional information about p(X) (from unlabeled data) to improve our estimate of p(Y | X = x). How about the opposite direction? Is there hope that semi-supervised learning should be possible in that case? It turns out the answer is yes, due to work on cause-effect inference using the ICM Principle 5.1 [24] . It introduced a measure of dependence between the input and the conditional of output given input, and showed that if this dependence is zero in the causal direction, then it is strictly positive in the opposite direction. Independence of cause and mechanism in the causal direction thus implies that in the backward direction (i.e., for anticausal learning), the distribution of the input variable should contain information about the conditional of output given input, i.e., the quantity that machine learning is usually concerned with. This is exactly the kind of information that semi-supervised learning requires when trying to improve the estimate of output given input by using unlabeled inputs. This suggests that semi-supervised learning should be impossible for causal learning problems, but feasible otherwise, in particular for anticausal ones. A meta-analysis of published semi-supervised learning benchmark studies corroborated this prediction [130] , and similar results apply for natural language processing [69] . These findings are intriguing since they provide insight into physical properties of learning problems, thus going beyond the methods and applications that machine learning studies usually provide.

Subsequent developments include further theoretical analyses [67, 112] and a form of conditional semi-supervised learning [158] . The view of semi-supervised learning as exploiting dependences between a marginal p(x) and a non-causal conditional p(y | x) is consistent with the common assumptions employed to justify semi-supervised learning [17, 126] .

We have discussed the shortcomings of the i.i.d. assumption, which rarely holds true exactly in practice, and the fact that real-world intelligent agents need to be able to generalize not just within a single i.i.d. setting, but across related problems. This notion has been termed out-of-distribution (o.o.d.) generalization, attracting significant attention in recent years [131] . While most work so far has been empirical, statistical bounds would be desirable that generalize (2.4), including additional quantities measuring the distance between training and test distribution, incorporating meaningful assumptions [138] . Such assumptions are necessary [8] , and could be causal, or related to invariance properties.

The recent phenomenon of "adversarial vulnerability" [149] shows that minuscule targeted violations of the i.i.d. assumption, generated by adding suitably chosen noise to images (imperceptible to humans), can lead to dangerous errors such as confusion of traffic signs. These examples are compelling as they showcase non-robustnesses of artificial systems which are not shared by human perception. Our own perception thus exhibits invariance or robustness properties that are not easily learned from a single training set.

Early causal work related to domain shift [130] looked at the problem of learning from multiple cause-effect datasets that share a functional mechanism but differ in noise distributions. More generally, given (data from) multiple distributions, one can try to identify components which are robust, and find means to transfer them across problems [166, 4, 163, 36, 55] . According to the ICM Principle 5.1, invariance of conditionals or functions (also referred to as covariate shift in simple settings) should only hold in the causal direction, a reversal of the impossibility described for SSL.

Building on the work of [130, 111] , the idea of invariance for prediction has also been used for supervised learning [121, 3, 87] . In particular, "invariant risk minimization" (IRM) was proposed as an alternative to ERM, cf. (2.3).

In contrast to causal discovery ( § 7), which aims to uncover the causal structure underlying a set of variables, causal reasoning starts from a known (or postulated) causal graph and answers causal queries of interest. While causal discovery often looks for qualitative relationships, causal reasoning usually aims to quantify them. This requires two steps: (i) identifying the query, i.e., deriving an estimand for it that only involves observed quantitites; and (ii) estimating this using data. Often, the quantities of interest can be described as treatment effects, i.e., contrasts between two interventions. Definition 9.1 (Treatment effects). The conditional average treatment effect (CATE), conditioned on (a subset of) features x, is defined as

The average treatment effect (ATE) is defined as the population average of the CATE,

While ITE (Defn. 4.12) and CATE (9.1) are sometimes used interchangeably, there is a conceptual difference: ITE refers to the difference of two POs and is thus bound to an individual, while CATE applies to subpopulations, e.g., the CATE for females in their 40s. Since the ITE is fundamentally impossible to observe, it is often estimated by the CATE conditional on an individual's features x i using suitable additional assumptions.

As is clear from Defn. 9.1, the treatment effects we want to estimate involve interventional expressions. However, we usually only have access to observational data. Causal reasoning can thus be cast as answering interventional queries using observational data and a causal model. This involves dealing with confounders, both observed and unobserved.

Before discussing how to identify and estimate causal effects, we illustrate why causal assumptions are necessary using a well-known statistical phenomenon.

Simpson's paradox refers to the observation that aggregating data across subpopulations may yield opposite trends (and thus lead to reversed conclusions) from considering subpopulations separately [143] . We observed a textbook example of this during the Covid-19 pandemic by comparing case fatality rates (CFRs), i.e., the proportion of confirmed Covid-19 cases which end in fatality, across different countries and age groups as illustrated in Fig. 8 [154] : for all age groups, CFRs in Italy are lower than in China, but the total CFR in Italy is higher.

How can such a pattern be explained? The case demographic (see Fig. 8 , right) is rather different across the two countries, i.e., there is a statistical association between country and age. In particular, Italy recorded a much larger proportion of cases in older patients who are generally at higher risk of dying from Covid-19 (see Fig. 8, left) . While this provides a consistent explanation in a statistical sense, the phenomenon may still seem puzzling as it defies our causal intuition. Humans appear to naturally extrapolate conditional probabilities to read them as causal effects, which can lead to inconsistent conclusions and may leave one wondering: how can the disease in Italy be less fatal for the young, less fatal for the old, but more fatal for the people overall? It is for this reason that the reversal of (conditional) probabilities in Fig. 8 is perceived as and referred to as a "paradox" [106, 49] .

If we consider the country as treatment whose causal effect on fatality is of interest, then causal assumptions (e.g., in the form of a causal graph) are needed to decide how to handle covariates such as age that are statistically associated Figure 9 : Treatment effect estimation with three observed covariates X 1 , X 2 , X 3 : here, the valid adjustment sets for T → Y (see Prop. 9.3) are {X 1 }, {X 2 }, and {X 1 , X 2 }. Including X 3 opens the non-directed path T → X 3 ← X 2 → Y and lies on the directed path T → X 3 → Y , both of which can introduce bias.

with the treatment, e.g., whether to stratify by (i.e., adjust for) age or not. This also explains why randomized controlled trials (RCTs) [31] are the gold standard for causal reasoning: randomizing the assignment breaks any potential links between the treatment variable and other covariates, thus eliminating potential problems of bias. However, RCTs are costly and sometimes unethical to perform, so that causal reasoning often relies on observational data only. 10 We first consider the simplest setting without hidden confounders and with overlap. We start with identification of treatment effects on the population level, and then discuss different techniques for estimating these from data.

Identification In absence of unmeasured variables (i.e., without hidden confounding), and provided we know the causal graph, it is straight-forward to compute causal effects by adjusting for covariates. A principled approach to do so for any given graph was proposed by Robins [118] and is known as the g-computation formula (where the g stands for general). It is also known as truncated factorisation [105] or manipulation theorem [145] . It relies on the independence of causal mechanisms (Principle 5.1), i.e., the fact that intervening on a variable leaves the other causal conditionals in (4.1) unaffected:

3) the interventional distribution of interest can then be obtained by marginalization. This is related to the idea of graph surgery (see Fig. 5 ), and leads to a set of three inference rules for manipulating interventional distributions known as do-calculus [105] that have been shown to be complete for identifying causal effects [56, 141] .

Note that covariate adjustment may be needed even if there are no clear confounders directly influencing both treatment and outcome, as shown by the example in Fig. 9 . Example 9.2. Applying the g-computation formula (9.3) to the setting of Fig. 9 , we obtain

where the last line follows by using the following conditional independences implied by the graph:

Note that both the RHS in (9.5) and both sides in (9.6) take the form

In this case we call Z a valid adjustment set for the effect of T on Y . Here, {X 1 }, {X 2 }, and {X 1 , X 2 } are all valid adjustment sets, but it can be shown that, e.g., {X 1 , X 3 } is not (see Fig. 9 ). As computing the g-formula with many covariates can be cumbersome, graphical criteria for which subsets constitute valid adjustment sets are useful in practice, even in the absence of unobserved confounders. Proposition 9.3 ( [142] ). Under causal sufficiency, a set Z is a valid adjustment set for the causal effect of a singleton treatment T on an outcome Y (in the sense of (9.7)) if and only if the following two conditions hold: (i) Z contains no descendant of any node on a directed path from T to Y (except for descendants of T which are not on a directed path from T to Y ); and (ii) Z blocks all non-directed paths from T to Y .

Here, a path is called directed if all directed edges on it point in the same direction, and non-directed otherwise. A path is blocked (by a set of vertices Z) if it contains a triple of consecutive nodes connected in one of the following three ways:

Two well-known types of adjustment set implied by Prop. 9.3 are parent adjustment, where Z = Pa T ; and the backdoor criterion, where Z is constrained to contain no descendants of T and to block all "back-door paths" from T to Y (T ← ... Y ).

Note that Prop. 9.3 only holds singleton treatments (i.e., interventions on a single variable). For treatments T involving multiple variables, a slightly more complicated version of Prop. 9.3 can be given in terms of proper causal paths, and we refer to [103, 109] for details.

Let us briefly return to our earlier example of Simpson's paradox and Covid-19. Considering a plausible causal graph for this setting [154] , we find that age A acts as a mediator C → A → F of the causal effect of country C on fatality F (there is likely also a direct effect C → F , potentially mediated by other, unobserved variables). If we are interested in the (total) causal effect of C on F (i.e., the overall influence of country on fatality), A should not be included for adjustment according to Prop. 9.3, and, subject to causal sufficiency, the total CFRs can be interpreted causally. 11 For another classic example of Simpson's paradox in the context of kidney stone treatment [18] , on the other hand, the size of the stone acts as a confounder and thus needs to be adjusted for to obtain sound causal conclusions.

Valid covariate adjustment and the g-formula tell us how to compute interventions from the observational distribution when there are no hidden confounders. To actually identify causal effects from data, however, we need to also be able to estimate the involved quantities in (9.7). This is a problem if a subgroup of the population never (or always) receives a certain treatment. We thus need the additional assumption of a non-zero probability of receiving each possible treatment, referred to as overlap, or common support. Assumption 9.4 (Overlap/common treatment support). For any treatment t and any configuration of features x, it holds that: 0 < p(T = t | X = x) < 1.

The combination of overlap and ignorability (i.e., no hidden confounders-see Assumption 4.15) is also referred to as strong ignorability and is a sufficient condition for identifying ATE and CATE: the absence of hidden confounders guarantees the existence of a valid adjustment set Z ⊆ X for which p(Y | do(T = t), Z) = p(Y | T = t, Z), and overlap guarantees that we can actually estimate the latter term for any z occurring with non-zero probability. 12 Regression Adjustment Having identified a valid adjustment set (using Prop. 9.3), regression adjustment works by fitting a regression functionf to E[Y | Z = z, T = t] = f (z, t) using an observational sample {(y i , t i , z i )} m i=1 . We can then usef to impute counterfactual outcomes asŷ CF i =f (z i , 1 − t i ) in order to estimate the CATE. The ATE is then given by the population average and can be estimated aŝ

where m 1 and m 0 are the number of observations from the treatment and control groups, respectively. Note the difference to the RCT estimator where no adjustment is necessary,

Matching and Weighting Approaches While regression adjustment indirectly estimates ATE via CATE, matching and weighting approaches aim to estimate ATE directly. The general is idea to emulate the conditions of an RCT as well as possible.

Matching approaches work by splitting the population into subgroups based on feature similarity. This can be done on an individual level (so-called one-to-one or nearest neighbor matching) by matching each individual i with the most similar one, j(i), from the opposite treatment group (i.e., t i = t j(i) ). The difference of their outcomes, y i − y j(i) , is then considered as a sample of the ATE, and their average taken as an estimate thereof, Alternatively, the population can be split into larger subgroups with similar features (so-called strata). Each stratum is then treated as an independent RCT. If there are K strata containing m 1 , ..., m K observations each, the stratified ATE estimator isτ

RCT is the estimator from (9.9) applied to observation in the k th stratum.

Weighting approaches, on the other hand, aim to counteract the confounding bias by reweighting each observation to make the population more representative of an RCT. This means that underrepresented treatment groups are upweighted and overrepresented ones downweighted. An example is the inverse probability weighting (IPW) estimator,

The treatment probability p(T = 1 | Z) is also known as propensity score. While from a theoretical point of view Z should be a valid adjustment set, practitioners sometimes use all covariates to construct a propensity score.

Propensity Score-Methods To overcome the curse of dimensionality and gain statistical efficiency in highdimensional, low-data regimes, propensity scores can be a useful tool, because covariates and treatment are rendered conditionally independent, T ⊥ ⊥ Z | s(z), by the propensity score s(z) := p(T = 1 | Z = z) [122] . Instead of adjusting for large feature sets or performing matching in high-dimensional spaces, the scalar propensity score can be used instead. Applying this idea to the above methods gives rise to propensity score adjustment and propensity score matching. For the latter, the difference in propensity scores is used as similarity between instances to find nearest neighbors or to define strata.

While simplifying in one respect, the propensity score needs to be estimated from data which is an additional source of error. The standard approach for this is to estimate s(z) by logistic regression, but more sophisticated methods are also possible. However, propensity score methods still rely on having identified a valid adjustment set Z to give unbiased results. Using all covariates to estimate s, without checking for validity as an adjustment set, can thus lead to wrong results.

Next, we consider the case of causal reasoning with unobserved confounders. While it is not possible to identify causal effects in the general case, we will discuss two particular situations in which ATE can still be estimated. These are shown in Fig. 10a and b .

Front-Door Adjustment The first situation in which identification is possible even though a hidden variable H confounds the effect between treatment and outcome is known as front-door adjustment. The corresponding causal graph is shown in Fig. 10a . Front-door adjustment relies on the existence of an observed variable M which blocks all directed paths from T to Y , so that T only causally influences Y through M . For this reason M is also called a mediator. The other important assumption is that the hidden confounder does not influence the mediator other than through the treatment T , i.e., M ⊥ ⊥ H | T . In this case, and provided p(t, m) > 0 for all t and m, the causal effect of T on Y is identifiable and is given by the following. Proposition 9.5 (Front-door adjustment). For the causal graph in Fig. 10a it holds that:

We give a sketch of the derivation, and refer to [105] for a proof using the rules of do-calculus. Since M mediates the causal effect of T on Y , we have that p(y | do(t)) = m p(m | do(t))p(y | do(m)). (9.14)

Since there are no backdoor paths from T to M we have p(m | do(t)) = p(m | t).

Moreover, {T } is a valid adjustment set for the effect of M on Y by Prop. 9.3, so

Substituting into (9.14) then yields expression (9.13).

We point out that the setting presented here is only the simplest form of front-door adjustment which is sufficient to convey the main idea. It can be amended to include observed covariates X as well, as long as the conditions on the mediator remain satisfied.

Instrumental Variables (IV) The second setting for causal reasoning with hidden confounders is based on the idea of instrumental variables [2, 29, 160] , see Fig. 10b . The IV approach relies on the existence of a special observed variable I called instrument. 

Since this assumption cannot be tested, background knowledge is necessary to justify the use of a variable as IV in practice. Conditions (ii) and (iii) state that the instrument is correlated with treatment, and only affects the outcome through T , and are referred to as relevance and exclusion restriction, respectively.

Given a valid IV, we apply a two-stage procedure: first obtain an estimateT of the treatment variable T that is independent of H by predicting T from I. Having thus created an unconfounded version of the treatment, a regression of Y onT then reveals the correct causal effect. We demonstrate this idea for a simple linear model with continuous treatment variable where the causal effect can be obtained by two-stage least squares (2SLS). IVs have been studied extensively and more sophisticated versions than the simple example above exist, allowing for non-linear interactions and observed covariates.

Having discussed some special settings to deal with hidden confounding, we briefly present a technique to deal with violations of the overlap assumption.

Regression Discontinuity Design In a regression discontinuity design (RDD) the treatment assignment mechanism behaves like a threshold function, i.e., the propensity score is discontinuous [60] . In the simplest setting, the assignment of treatment or control is determined by whether an observed score S is above a threshold s 0 , T := I{S ≥ s 0 }. This score in turn depends on other covariates which may or may not be observed. For example, patients may be assigned a risk score, and treatment is only prescribed if this score surpasses a given threshold. Since the score may be assigned by another institution, not all relevant covariates H are usually observed. However, it is assumed that the treatment decision only depends on the score, e.g., because doctors comply with the official rules. The causal graph for such a simple RDD setting is shown in Fig. 10c . While the score S constitutes a valid adjustment set in principle, the problem with RDDs is the lack of overlap: patients with low scores are always assigned T = 0 and patients with high scores are always assigned T = 1. Because of this, covariate adjustment, matching, or weighting approaches do not apply. The general idea of an RDD is to overcome this challenge by comparing observations with score in a small neighborhood of the decision cut-off value s 0 , motivated by the consideration that patients with close scores but on opposite sides of s 0 differ only in whether they received the treatment or not. For example, if the treatment cut-off value is 0.5 for a score in [0,1], then patients with scores of 0.49 and 0.51 are comparable and can be treated as samples from an RCT. An RDD (in its simplest form) thus focuses on differences in the regression function

We conclude this section with a real-world application performing causal reasoning in a confounded additive noise model. Launched in 2009, NASA's Kepler space telescope initially observed 150000 stars over four years, in search of exoplanet transits. These are events where a planet partially occludes its host star, causing a slight decrease in brightness, often orders of magnitude smaller than the influence of telescope errors. When looking at stellar light curves, we noticed that the noise structure was often shared across stars that were light years apart. Since that made direct interaction of the stars impossible, it was clear that the shared information was due to the telescope acting as a confounder. We thus devised a method that (a) regresses a given star of interest on a large set of other stars chosen such that their measurements contain no information about the star's astrophysical signal, and (b) removes that regression in order to cancel the telescope's influence. 13 The method is called "half-sibling" regression since target and predictors share a parent, namely the telescope. The method recovers the random variable representing the astrophysical signal almost surely (up to a constant offset), for an additive noise model (specifically, the observed light curve is a sum of the unknown astrophysical signal and an unknown function of the telescope noise), subject to the assumption that the telescope's effect on the star is in principle predictable from the other stars [128] .

In 2013, the Kepler spacecraft suffered a technical failure, which left it with only two functioning reaction wheels, insufficient for the precise spatial orientation required by the original Kepler mission. NASA decided to use the remaining fuel to make further observations, however the systematic error was significantly larger than before-a godsend for our method designed to remove exactly these errors. We augmented it with models of exoplanet transits and an efficient way to search light curves, leading to the discovery of 36 planet candidates [32] , of which 21 were subsequently validated as bona fide exoplanets [92] . Four years later, astronomers found traces of water in the atmosphere of the exoplanet K2-18b-the first such discovery for an exoplanet in the habitable zone, i.e., allowing for liquid water [10, 152] . The planet turned out to be one that had been first detected in our work [32] (exoplanet candidate EPIC 201912552).

Conservation of Information We have previously argued that the mechanization of information processing plays currently plays a similar role to the mechanization of energy processing in earlier industrial revolutions [126] . Our present understanding of information is rather incomplete, as was the understanding of energy during the course of the first two industrial revolutions. The profound modern understanding of energy came with Emmy Noether and the insight that energy conservation is due to a symmetry (or covariance) of the fundamental laws of physics: they look the same no matter how we shift time. One might argue that information, suitably conceptualized, should also be a conserved quantity, and that this might also be a consequence of symmetries. The notions of invariance/independence discussed above may be able to play a role in this respect.

Mass seemingly played two fundamentally different roles (inertia and gravitation) until Einstein furnished a deeper connection in general relativity. It is noteworthy that causality introduces a layer of complexity underlying the symmetric notion of statistical mutual information. Discussing source coding and channel coding, Shannon [139] remarked: This duality can be pursued further and is related to a duality between past and future and the notions of control and knowledge. Thus we may have knowledge of the past but cannot control it; we may control the future but have no knowledge of it.

What is an Object? Following the i.i.d. pattern recognition paradigm, machine learning learns objects by extracting patterns from many observations. An complementary view may consider objects as modules that can be separately manipulated or intervened upon [150] . The idea that objects are defined by their behavior under transformation has been influential in fields ranging from psychology to mathematics [74, 88] .

Causal Representation Learning. In hindsight, it appears somewhat naive that first attempts to build AI tried to realize intelligence by programs written by humans, since existing examples of intelligent systems appear much Classic AI: symbols provided a priori; rules provided a priori.

Machine Learning: representations (symbols) learned from data; only include statistical information.

Causal Modeling: structural causal models assume the causal variables (symbols) are given.

Causal Representation Learning: capture interventions, reasoning, planning-"Thinking is acting in an imagined space" (Konrad Lorenz) Figure 11 : Causal representation learning aims to automatically learn representations that contain not just statistical information, but support interventions, reasoning, and planning. The long-term goal of this field is to learn causal world models supporting AI, or causal digital twins of complex systems.

too complex for that. However, there is a second problem, which is just as significant: classic AI assumed that the symbols which were the basis of algorithms were provided a priori by humans. When building a chess program, it is clear that the algorithms operate on chess board positions and chess pieces; however, if we want to solve a real-world problem in an unstructured environment (e.g., recognize spoken language), it is not clear what constitutes the basic symbols to be processed.

Traditional causal discovery and reasoning assumed that the elementary units are random variables connected by a causal graph. Real-world observations, however, are usually not structured into such units to begin with. For instance, objects in images that permit causal reasoning first need to be discovered [85, 157, 150, 84] . The emerging field of causal representation learning strives to learn these variables from data, much like machine learning went beyond symbolic AI in not requiring that the symbols that algorithms manipulate be given a priori (see Fig. 11 ).

Defining objects or variables, and structural models connecting them, can sometimes be achieved by coarse-graining of microscopic models, including microscopic SCMs [124] , ordinary differential equations [123] , and temporally aggregated time series [37] . While most causal models in economics, medicine, or psychology use variables that are abstractions of more elementary concepts, it is challenging to state general conditions under which coarse-grained variables admit causal models with well-defined interventions [15, 124, 16] . The task of identifying suitable units that admit causal models aligns with the general goal of modern machine learning to learn meaningful representations for data, where meaningful can mean robust, transferable, interpretable, explainable, or fair [77, 72, 162, 71, 70, 155] . To combine structural causal modeling (Defn. 4.4) and representation learning, we may try to devise machine learning models whose inputs may be high-dimensional and unstructured, but whose inner workings are (partly) governed by an SCM.

Suppose that our high-dimensional, low-level observations X = (X 1 , ..., X d ) are explained by a small number of unobserved, or latent, variables S = (S 1 , ..., S n ) where n d, in that X is generated by applying an injective map g : R n → R d to S (see Fig. 12c ): X = g(S). (10.1)

A common assumption regarding (10.1) is that the latent S i are jointly independent, e.g., for independent component analysis (ICA) [57] (where g is referred to as a mixing) or disentangled representation learning [9] (where g is called a decoder). Presently, however, we instead want think of the latent S i as causal variables that support interventions and reasoning.

The S i may thus well be dependent, and possess a causal factorization (4.1),

induced by an underlying (acyclic) SCM M = (F, p U ) with jointly independent U i and

Our goal is to learn a latent causal model consisting of (i) the causal representation S = g −1 (X), along with (ii) the corresponding causal graph and (iii) the mechanisms p(S i | PA i ) or f i . This is a challenging task, since none of them are directly observed or known a priori; instead we typically only have access to observations of X. In fact, there is no hope in an i.i.d. setting since already the simpler case with independent S i (and n = d) is generally not identifiable (i.e., for arbitrary nonlinear g in (10.1)): even independence does not sufficiently constrain the problem to uniquely recover, or identify, the true S i 's up to any simple class of ambiguities such as permutations and element-wise invertible transformations of the S i [58] .

(c) Figure 12 : Overview of different causal learning tasks: (a) causal discovery ( § 7) aims to learn the causal graph (or SCM) connecting a set of observed variables; (b) causal reasoning ( § 9) aims to answer interventional or counterfactual queries based on a (partial) causal model over observed variables X i ; (c) causal representation learning ( § 10) aims to infer a causal model consisting of a small number of high-level, abstract causal variables S i and their relations from potentially high-dimensional, low-level observations X = g(S).

To link causal representation learning to the well-studied ICA setting with independent latents in (10.1), we can consider the so-called reduced form of an (acyclic) SCM: by recursive substitution of the structural assignments (10.3) in topological order of the causal graph, we can write the latent causal variables S as function of the noise variables only S = f RF (U). (10.4) Due to acyclicity, this mapping f RF : R n → R n has a lower triangular Jacobian (possibly after re-ordering the S i w.l.o.g.). However, (10.4) is strictly less informative than (10.3): while they entail the same distribution (10.2), the former no longer naturally supports interventions on the S i but only changes to the noise distribution p U (an example of a so-called soft intervention [30] ). At the same time, the reduced form (10.4) allows us to rewrite (10.1) as

Through this lens, the task of learning the reduced form (10.4) could be seen as structured form of nonlinear ICA (i.e., (10.1) with independent latents) where we additionally want to learn an intermediate representation through f RF . However, as discussed, we cannot even solve the problem with independent latents (i.e., identify g • f RF in (10.5)) [58] , let alone separate the SCM and mixing functions to recover the intermediate causal representation.

It is not surprising that is is not possible to solve the strictly harder causal representation learning problem in an i.i.d. setting and that additional causal learning signals are needed. This gives rise to the following questions: How can we devise causal training algorithms to learn the S i ? And, what types of additional data, assumptions, and constraints would these algorithms require beyond the i.i.d. setting? Two general ideas are to (i) build on the ICM Principle 5.1 and enforce some form of (algorithmic) independence between the learned causal mechanisms p(S i | PA i ) or f i , and (ii) use heterogeneous (non-i.i.d.) data, e.g., from multiple views or different environments, arising from interventions in the underlying latent SCM (10.3). We briefly discuss some more concrete ideas based on recent work.

Generative Approach: Causal Auto-Encoders. One approach is to try to learn the generative causal model (10.1) and (10.3), or its reduced form (10.4), using an auto-encoder approach [73] . An auto-encoder consists of an encoder function q : R d → R n which maps X to a latent "bottleneck" representation (e.g., comprising the unexplained noise variables U), and a decoder functionĝ : R n → R d mapping back to the observations. For example, the decoder may directly implement the compositionĝ = g • f RF from (10.4). Alternatively, it could consist of multiple modules, implementing (10.1) and (10.3) separately. A standard procedure to train such an auto-encoder architecture is to minimise the reconstruction error, i.e., to satisfyĝ • q ≈ id on a training set of observations of X. As discussed, this alone is insufficient, so to make it causal we can impose additional constraints on the structure of the decoder [80] and try to make the causal mechanisms independent by ensuring that they are invariant across problems and can be independently intervened upon. For example, if we intervene on the causal variables S i or noise distribution p U in our model of (10.3) or (10.4), respectively, this should still produce "valid" observations, as assessed, e.g., by the discriminator of a generative adversarial network [38] . While we ideally want to manipulate the causal variables, another way to intervene is to replace noise variables with the corresponding values computed from other input images, a procedure that has been referred to as hybridization [11] . Alternatively, if we have access to multiple environments, i.e., datasets collected under different conditions, we could rely on the Sparse Mechanism Shift Principle 5.2 by requiring that changes can be explained by shifts in only a few of the p(S i | PA i ).

Discriminative Approach: Self-Supervised Causal Representation Learning. A different machine learning approach for unsupervised representation learning, that is not based on generative modeling but is discriminative in nature, is self-supervised learning with data augmentation. Here, the main idea is to apply some hand-crafted transformations to the observation to generate augmented views that are thought to share the main semantic characteristics with the original observation (e.g., random crops or blurs for images). One then directly learns a representation by maximizing the similarity across encodings of views related to each other by augmentations, while enforcing diversity across those of unrelated views. In recent work [156] , we set out to better understand this approach theoretically, as well as to investigate its potential for learning causal representations. Starting from (10.1), we postulate a latent causal model of the form S c → S s , where S c is a (potentially multivariate) content variable, defined as the high-level semantic part of the representation S = (S c , S s ) that is assumed invariant across views; and S s is a (potentially multivariate) style variable, defined as the remaining part of the representation that may change. Within this setting, data augmentations have a natural interpretation as counterfactuals under a hypothetical intervention on the style variables, given the original view. It can be shown that in this case, subject to some technical assumptions, common contrastive selfsupervised learning algorithms [19, 99, 45] as well as appropriately constrained generative models isolate, or recover, the true content variables S c up to an invertible transformation. By extending this approach to use multiple augmented views of the same observation, and linking these to different counterfactuals in the underlying latent SCM, it may be possible to recover a more-fine grained causal representation.

Independent Mechanism Analysis. We also explored [40] to what extent the ICM Principle 5.1 may be useful for unsupervised representation learning tasks such as (10.1), particularly for imposing additional constraints on the mixing function g. It turns out that independence between p(S) and the mixing g-measured, e.g., as discussed in § 5 in the context of Fig. 6 and [64]-does not impose nontrivial constraints when S is not observed, even when the S i are assumed independent as in ICA. However, by thinking of each S i as independently influencing the observed distribution, we postulate another type of independence between the partial derivatives ∂g ∂Si of the mixing g which has a geometric interpretation as an orthogonality condition on the columns of the Jacobian of g. The resulting independent mechanism analysis (IMA) approach rules out some of the common examples of non-identifiability of nonlinear ICA [58, 82] mentioned above. Since IMA does not require independent sources, it may also be a useful constraint for causal representation learning algorithms.

Learning Transferable Mechanisms and Multi-Task Learning Machine learning excels in i.i.d. settings, and through the use of high capacity learning algorithms we can achieve outstanding performance on many problems, provided we have i.i.d. data for each individual problem ( § 2). However, natural intelligence excels at generalizing across tasks and settings. Suppose we want to build a system that can solve multiple tasks in multiple environments. If we view learning as data compression, it would make sense for that system to utilize components that apply across tasks and environments, and thus need to be stored only once [126] .

Indeed, an artificial or natural agent in a complex world is faced with limited resources. This concerns training data, i.e., we only have limited data for each individual task/domain, and thus need to find ways of pooling/re-using data, in stark contrast to the current industry practice of large-scale labelling work done by humans. It also concerns computational resources: animals have constraints on the resources (e.g., space, energy) used by their brains, and evolutionary neuroscience knows examples where brain regions get re-purposed. Similar constraints apply as machine learning systems get embedded in physical devices that may be small and battery-powered. Versatile AI models that robustly solve a range of problems in the real world will thus likely need to re-use components, which requires that the components are robust across tasks and environments [129, 131] . This calls for a structure whose modules are maximally reusable. An elegant way to do this would be to employ a modular structure that mirrors modularity that exists in the world. In other words, if the are mechanisms at play in the world play similar roles across a range of environments, tasks, and settings, then it would be prudent for a model to employ corresponding computational modules [39] . For instance, if variations of natural lighting (the position of the sun, clouds, etc.) imply that the visual environment can appear in brightness conditions spanning several orders of magnitude, then visual processing algorithms in our nervous system should employ methods that can factor out these variations, rather than building separate sets of object recognizers for every lighting condition. If our brain were to model the lighting changes by a gain control mechanism, say, then this mechanism in itself need not have anything to do with the physical mechanisms bringing about brightness differences. It would, however, play a role in a modular structure that corresponds to the role the physical mechanisms play in the world's modular structure-in other words, it would represent the physical mechanism. Searching for the most versatile yet compact models would then automatically produce a bias towards models that exhibit certain forms of structural isomorphy to a world that we cannot directly recognize.

A sensible inductive bias to learn such models is to look for independent causal mechanisms [83] , and competitive training can play a role in this: for a pattern recognition task, learning causal models that contain independent mechanisms helps in transferring modules across substantially different domains [100] .

Interventional World Models, Surrogate Models, Digital Twins, and Reasoning Modern representation learning excels at learning representations of data that preserve relevant statistical properties [9, 79] . It does so, however, without taking into account causal properties of the variables, i.e., it does not care about the interventional properties of the variables it analyzes or reconstructs. Going forward, causality will play a major role in taking representation learning to the next level, moving beyond the representation of statistical dependence structures towards models that support intervention, planning, and reasoning. This would realize Konrad Lorenz' notion of thinking as acting in an imagined space. It would also provide a means to learn causal digital twins that go beyond reproducing statistical dependences captured by surrogate models trained using machine learning.

The idea of surrogate modeling is that we may have a complex phenomenon for which we have access to computationally expensive simulation data. If the mappings involved (e.g., from parameter settings to target quantities) can be fitted from data, we can employ machine learning, which will often speed them up by orders of magnitude. Such a speed-up can qualitatively change the usability of a model: for instance, we have recently built a system to map gravitational wave measurements to a probability distribution of physical parameters of a black hole merger event, including sky position [27] . The fact that this model only requires seconds to evaluate makes it possible to immediately start electromagnetic follow-up observations using telescopes as soon as a gravitational wave event has been detected, enabling analysis of transient events.

Going forward, we anticipate that surrogate modeling will benefit from respecting the causal factorization (4.1) decomposing the overall dependence structure into mechanisms (i.e., causal conditionals). We can then build an overall model of a system by modeling the mechanisms independently, each of them using the optimal method. Some of the conditionals we may know analytically, some we may be able to transfer from related problems, if they are invariant. For some, we may have access to real data to estimate them, and for others, we may need to resort to simulations, possibly fitted using surrogate models.

If the model is required to fully capture the effects of all possible interventions, then all components should be fitted as described in the causal directions (i.e., we fit the causal mechanisms). Such a model then allows to employ all the causal reasoning machinery described in § 4 and § 9 (e.g., computing interventional and, in the case of SCMs, counterfactual distributions). If, on the other hand, a model only needs to capture some of the possible interventions, and is used in a purely predictive/observational mode for other variables, then we can get away with also using and fitting some non-causal modules, i.e., using a decomposition which lies in between (4.1) and (4.2).

We believe that this overall framework will be a principled and powerful approach to build such (causal) digital twins or causal surrogate models by combining a range of methods and bringing them to bear according to their strengths.

Concluding Remarks. Most of the discussed fields are still in their infancy, and the above account is biased by personal taste and knowledge. With the current hype around machine learning, there is much to say in favor of some humility towards what machine learning can do, and thus towards the current state of AI-the hard problems have not been solved yet, making basic research in this field all the more exciting.

Autonomy

Identification of causal effects using instrumental variables

Transportability from multiple environments with limited experiments: Completeness results

Causal inference and the data-fusion problem

The arrow of time in multivariate time series

Reconciling modern machine learning practice and the bias-variance trade-off

Impossibility theorems for domain adaptation

Representation learning: A review and new perspectives

Counterfactuals uncover the modular structure of deep generative models

Group invariance principles for causal generative models

Foundations of structural causal models with cycles and latent variables

The power of possibility: Causal learning, counterfactual reasoning, and pretend play

Multi-level cause-effect systems

Causal feature learning: an overview

Semi-supervised learning

Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy

A simple framework for contrastive learning of visual representations

Learning bayesian networks is np-complete

Optimal structure identification with greedy search

A bayesian method for the induction of probabilistic networks from data

Planning of experiments

Inferring deterministic causal relations

Conditional independence in statistical theory

Causal inference without counterfactuals

Real-time gravitational-wave science with neural posterior estimation

A probabilistic theory of pattern recognition

Assumptions of IV methods for observational epidemiology

Interventions and causal inference

The design of experiments

A systematic search for transiting planets in the K2 data

Kernel measures of conditional dependence

Learning gaussian networks

Logical and algorithmic properties of independence and their application to Bayesian networks

Domain adaptation with conditional transferable components

Causal discovery from temporally aggregated time series

Generative adversarial nets

Recurrent independent mechanisms

Independent mechanism analysis, a new concept?

A kernel method for the two-sample-problem

Measuring statistical dependence with Hilbert-Schmidt norms

A kernel statistical test of independence

Kernel methods for measuring independence

Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

The probability approach in econometrics

Learning bayesian networks: The combination of knowledge and statistical data

A bayesian approach to causal discovery

The Simpson's paradox unraveled

Beta-VAE: Learning basic visual concepts with a constrained variational framework

Statistics and causal inference

Causality in macroeconomics

Nonlinear causal discovery with additive noise models

Causal discovery from heterogeneous/nonstationary data

Behind distribution shift: Mining driving forces of changes and causal arrows

Pearl's calculus of intervention is complete

Independent component analysis: algorithms and applications

Nonlinear independent component analysis: Existence and uniqueness results

Nonlinear ica using auxiliary variables and generalized contrastive learning

Regression discontinuity designs: A guide to practice

Causal inference in statistics, social, and biomedical sciences

Algorithmic independence of initial condition and dynamical law in thermodynamics and causal inference

Telling cause from effect based on high-dimensional observations

Informationgeometric approach to inferring causal directions

Identifying confounders using additive noise models

Causal inference using the algorithmic Markov condition

Semi-supervised interpolation in an anticausal learning scenario

Detecting non-causal artifacts in multivariate linear regression models

Causal direction of data collection matters: Implications of causal and anticausal learning for nlp

Algorithmic recourse: from counterfactual explanations to interventions

Algorithmic recourse under imperfect causal knowledge: a probabilistic approach

Avoiding discrimination through causal reasoning

Auto-encoding variational Bayes

Vergleichende Betrachtungenüber neuere geometrische Forschungen. Verlag von Andreas Deichert

Probabilistic graphical models: principles and techniques

Consistency of causal inference under the additive noise model

Counterfactual fairness

Graphical models

Deep learning

Structural autoencoders improve representations for generation and transfer

Discours de métaphysique. 1686

Challenging common assumptions in the unsupervised learning of disentangled representations

Competitive training of mixtures of independent deep generative models

Objectcentric learning with slot attention

Discovering causal signals in images

Über den Zustand des Wärmegleichgewichtes eines Systems von Körpern mit Rücksicht auf die Schwerkraft

Nonlinear invariant risk minimization: A causal approach

Categories for the working mathematician

Storks deliver babies (p= 0.008)

Causal inference and causal explanation with background knowledge

Chocolate consumption, cognitive function, and nobel laureates

Stellar and planetary properties of K2 campaign 1 candidates and validation of 17 planets, including a planet receiving earth-like insolation

Causal discovery with general non-linear relationships using non-linear ica

On causal discovery with cyclic additive noise models

Regression by dependence minimization and its application to causal inference

From ordinary differential equations to structural causal models: The deterministic case

Distinguishing cause from effect using observational data: methods and benchmarks

On the application of probability theory to agricultural experiments. essay on principles. section 9

Representation learning with contrastive predictive coding

Learning independent causal mechanisms

A measure-theoretic approach to kernel conditional mean embeddings

Bayesian networks: A model of self-activated memory for evidential reasoning

Causal diagrams for empirical research

Direct and indirect effects

Causality: Models, reasoning, and inference

Comment: understanding simpson's paradox

External validity: From do-calculus to transportability across populations

The book of why: the new science of cause and effect

Confounding equivalence in causal inference

A theory of inferred causation

Causal inference by using invariant prediction: identification and confidence intervals

Elements of causal inference -foundations and learning algorithms

Identifiability of causal graphs using functional models

Causal discovery with continuous additive noise models

Kernel-based tests for joint independence

The logic of scientific discovery

The direction of time

A new approach to causal inference in mortality studies with a sustained exposure period-application to control of the healthy worker survivor effect

Marginal structural models and causal inference in epidemiology

Counting labeled acyclic digraphs, new directions in the theory of graphs (proc. third ann arbor conf., univ. michigan, ann arbor, mich

Invariant models for causal transfer learning

The central role of the propensity score in observational studies for causal effects

From deterministic ODEs to dynamic structural causal models

Causal consistency of structural equation models

Estimating causal effects of treatments in randomized and nonrandomized studies

Probabilistic and Causal Inference: The Works of Judea Pearl

A generalized representer theorem

Modeling confounding by half-sibling regression

Causal and statistical learning

On causal and anticausal learning

Toward causal representation learning

Computing functions of random variables via reproducing kernel Hilbert space representations

Learning with kernels

RKHS representation of measures applied to homogeneity, independence, and Fourier optics

Estimating the dimension of a model

The hardness of conditional independence testing and the generalised covariance measure

Telling cause from effect in deterministic linear dynamical systems

Estimating individual treatment effect: generalization bounds and algorithms

Coding theorems for a discrete source with a fidelity criterion

A linear non-Gaussian acyclic model for causal discovery

Identification of joint interventional distributions in recursive semi-markovian causal models

On the validity of covariate adjustment for estimating causal effects

The interpretation of interaction in contingency tables

A Hilbert space embedding for distributions

Causation, prediction, and search

Grundlagen der Entscheidungstheorie

Support vector machines

Robustly disentangled causal mechanisms: Validating deep representations for interventional robustness

Intriguing properties of neural networks. arXiv preprint 1312

Unsupervised object learning via common fate

Causal discovery from changes

Water vapour in the atmosphere of the habitable-zone eight-earth-mass planet K2-18b

Statistical learning theory

Simpson's paradox in Covid-19 case fatality rates: a mediation analysis of age-related causal effects

On the fairness of causal algorithmic recourse

Self-supervised learning with data augmentations provably isolates content from style

Towards causal generative scene models via competition of experts

Semi-supervised learning, causality and the conditional cluster assumption

Causation and manipulability

Tariff on animal and vegetable oils

Correlation and causation

Fairness in decision-making -the causal explanation formula

Multi-source domain adaptation: A causal view

On the identifiability of the post-nonlinear causal model

Kernel-based conditional independence test and application in causal discovery

Domain adaptation under target and conditional shift

Ackowledgements Many thanks to all past and present members of the Tübingen causality team, and to Cian Eastwood and Elias Bareinboim for feedback on the manuscript.