key: cord-0461840-31by8w71
authors: Zhang, Rui; Imaizumi, Masaaki; Scholkopf, Bernhard; Muandet, Krikamol
title: Maximum Moment Restriction for Instrumental Variable Regression
date: 2020-10-15
journal: nan
DOI: nan
sha: ee8218daa52ca164edc00e72769b595ea356e29a
doc_id: 461840
cord_uid: 31by8w71

We propose a simple framework for nonlinear instrumental variable (IV) regression based on a kernelized conditional moment restriction (CMR) known as a maximum moment restriction (MMR). The MMR is formulated by maximizing the interaction between the residual and functions of IVs that belong to a unit ball of reproducing kernel Hilbert space (RKHS). This allows us to tackle the IV regression as an empirical risk minimization where the risk depends on the reproducing kernel on the instrument and can be estimated by a U-statistic or V-statistic. This simplification not only enables us to derive elegant theoretical analyses in both parametric and non-parametric settings, but also results in easy-to-use algorithms with a justified hyper-parameter selection procedure. We demonstrate the advantages of our framework over existing ones using experiments on both synthetic and real-world data.

Instrumental variables (IV) have become standard tools for economists, epidemiologists, and social scientists to uncover causal relationships from observational data without an unconfoundedness assumption (Angrist and Pischke, 2008; Klungel et al., 2015) . Randomization of treatments or policies has been perceived as the gold standard for such tasks, but is generally prohibitive in many real-world scenarios due to time constraints or ethical concerns. When treatment assignment is not randomized, it is generally impossible to discern between the causal effect of treatments and spurious correlations that are induced by unobserved factors. Instead, the use of IV enables the investigators to incorporate natural variation through an IV that is associated with the treatments, but not with the outcome variable, other than through its effect on the treatments. In economics, for instance, the season-of-birth was used as an IV to study the return from schooling, which measures causal effect of education on labor market earning (Card, 1999) . In genetic epidemiology, the idea to use genetic variants as IVs, known as Mendelian randomization, has also gained popularity in recent years (Burgess et al., 2017 (Burgess et al., , 2020 .

Classical methods for IV regression often rely on a linearity assumption in which a two-stage least squares (2SLS) is the most popular technique (Angrist et al., 1996) . The generalized method of moments (GMM) of Hansen (1982) , which imposes the orthogonality restrictions, can also be used for IV regression. However, linearity assumption is often violated in real-world applications, motivating several works to extend both 2SLS and GMM to the nonlinear setting. Unfortunately, existing approaches suffer from various drawbacks such as the need to solve an ill-posed integral equation or to estimate a conditional density (Newey and Powell, 2003; Hall and Horowitz, 2005; Blundell et al., 2007; Horowitz, 2011) , complicated first-stage estimation (Hartford et al., 2017; Singh et al., 2019) , and cumbersome minimax optimization (Lewis and Syrgkanis, 2018; Bennett et al., 2019; Muandet et al., 2020b) , which limit the prospects of the methods.

To overcome these drawbacks, we propose a simple framework for nonlinear IV regression based on the maximum moment restriction (MMR) framework of Muandet et al. (2020a) , which reformulates the original conditional moment restriction (CMR) by maximizing the interaction between the residual and functions of IVs that belong to a unit ball in a reproducing kernel Hilbert space (RKHS). We call this framework MMR-IV for brevity. Based on this formulation, we can view IV regression as an empirical risk minimization (ERM) problem (Vapnik, 1998) where the empirical risk depends on the reproducing kernel on instruments and can be expressed as a U-statistic or V-statistic. Not only does this formulation allow us to perform elegant theoretical analyses in both parametric and non-parametric settings, but it also results in easy-to-use algorithms with a justified hyperparameter selection procedure. In particular, we can arXiv:2010.07684v1 [cs. LG] 15 Oct 2020 X Y Z Figure 1 : A causal graph depicting an instrumental variable Z that satisfies an exclusion restriction and unconfoundedness (there may be a confounder ε acting on X and Y , but it is independent of Z).

even solve the problem in closed-form if the solution is assumed to belong to a RKHS. We demonstrate the effectiveness of the proposed approach through experiments on both synthetic and real data.

Following standard setting in the literature (Hartford et al., 2017; Lewis and Syrgkanis, 2018; Bennett et al., 2019; Singh et al., 2019; Muandet et al., 2020b) , let X be a treatment (endogenous) variable taking value in X ⊆ R d and Y a real-valued outcome variable. Our goal is to estimate a function f : X → R from a structural equation model (SEM) of the form

where we assume that E[ε] = 0 and E[ν] = 0. Unfortunately, as we can see from (1), ε is correlated with the treatment X, i.e., E[ε|X] = 0, and hence standard regression methods cannot be used to estimate f . This setting arises when there exist unobserved confounders (i.e., common causes) between X and Y .

To illustrate the problem, let us consider an example taken from Hartford et al. (2017) which aims at predicting sales of airline ticket Y under an intervention in price of the ticket X. However, there exist unobserved variables that may affect both sales and ticket price, e.g., conferences, COVID-19 pandemic, etc. This creates a correlation between ε and X in (1) that prevents us from applying the standard regression toolboxes directly on observational data.

Instrumental variable(s). To address this problem, we assume access to an instrumental variable Z taking value in Z ⊆ R d . As we can see in (1), the instrument Z is associated with the treatments X, but not with the outcome Y , other than through its effect on the treatments. Formally, Z must satisfy (i) Relevance: Z has a causal influence on X, (ii) Exclusion restriction: Z affects Y only through X, i.e., Y ⊥ ⊥ Z|X, ε, and (iii) Unconfounded instrument(s): Z is independent of the error, i.e., ε ⊥ ⊥ Z. For example, the instrument Z may be the cost of fuel, which influences sales only via price. Intuitively, the instrument Z acts as a "natural experiment" that induces variation in X, as illustrated in Figure 1 .

Generalized method of moment (GMM). The aforementioned conditions imply that E[ε | Z] = 0 for P Z -almost all z. This is a conditional moment restriction (CMR) which we can use to estimate f (Newey, 1993) . For any measurable function h, the CMR implies a continuum of unconditional moment restrictions (Lewis and Syrgkanis, 2018; Bennett et al., 2019) :

That is, there exists an infinite number of moment conditions, each of which is indexed by the function h. One of the key questions in econometrics is which moment conditions should be used as a basis for estimating the function f (Donald and Newey, 2001; Hall, 2005) . In this work, we show that, for the purpose of estimating f , it is sufficient to restrict h to be within a unit ball of a RKHS of real-valued functions on Z.

Throughout this paper, we assume that h is a realvalued function on Z which belongs to a RKHS H k endowed with a reproducing kernel k : Z × Z → R. H k satisfies two important properties: (i) for all z ∈ Z and h ∈ H k , we have k(z, ·) ∈ H k and (ii) (reproducing property of H k ) h(z) = h, k(z, ·) H k where k(z, ·) is a function of the second argument. Furthermore, we define Φ k (z) as a canonical feature map of z in H k . It follows from the reproducing property that k(z, z ) = Φ k (z), Φ k (z ) H k for any z, z ∈ Z, i.e., an inner product between the feature maps of z and z can be evaluated implicitly through the kernel evaluation. Every positive definite kernel k uniquely determines the RKHS for which k is a reproducing kernel (Aronszajn, 1950) . For detailed exposition on kernel methods, see, e.g., Schölkopf and Smola (2002) , Berlinet and Thomas-Agnan (2004), and Muandet et al. (2017) .

Instead of (2), we define a risk in terms of a maximum moment restriction (MMR) (Muandet et al., 2020a) :

That is, instead of considering all measurable functions h as instruments, we only restrict to functions that lie within a unit ball of the RKHS H k . The risk is then defined as a maximum value of the moment restriction with respect to this function class. The benefits of our formulation are two-fold. First, it is computationally intractable to learn f from (2) using all measurable functions as instruments. By restricting the function class to a unit ball of the RKHS, the problem becomes computationally tractable, as will be shown in Lemma 1 below. Second, this restriction still preserves the consistency of parameter estimated using R k (f ). In other words, nothing is lost in terms of statistical estimation by this simplification (cf. Theorem 1).

We assume throughout that the reproducing kernel k is integrally strictly positive definite (ISPD).

Assumption 1. The kernel k is continuous, bounded (i.e., sup z∈Z k(z, z) < ∞) and satisfies the condition of integrally strictly positive definite (ISPD) kernels, i.e., for any function g that satisfies 0 < g 2 2 < ∞, we have Z g(z)k(z, z )g(z ) dz dz > 0.

An important property of such kind of kernels is that k(z, z ) > 0 for any z, z ∈ Z. They are also related to the notion of characteristic and universal kernels; see, e.g., Simon-Gabriel and Schölkopf (2018) . More details on ISPD kernels are given in Appendix A.

We propose to learn f by minimizing R k (f ) in (3). To this end, we define an optimal function f * as a minimizer of the above population risk with respect to a function class F of real-valued functions on X , i.e.,

It is instructive to note that population risk R k depends on the choice of the kernel k. The important insight for our work is that the population risk R k (f ) has an analytic solution.

where (X , Y , Z ) is an independent copy of (X, Y, Z).

Based on Assumption 1 and Lemma 1, we obtain the following result, which is a special case of Muandet et al. (2020a, Theorem 3.2) , showing that R k (f ) = 0 if and only if f satisfies the original conditional moment restriction (see Appendix B.2 for the proof).

Theorem 1. Assume that the kernel k is integrally strictly positive definite (ISPD). Then, for any realvalued measurable function f ,

It is instructive to note that Theorem 1 holds as long as the kernel k belongs to a class of ISPD kernels. Hence, it allows for more flexibility in terms of the kernel choice. Moreover, it is not difficult to show that

The previous results pave the way for an empirical risk minimization (ERM) framework (Vapnik, 1998) to be used in our work. That is, given a sample

∼ P n (X, Y, Z) of size n, an empirical estimate of the risk R k (f ) can be obtained as

which is in the form of U-statistic (Serfling, 1980, Section 5) . Alternatively, an empirical risk based on Vstatistic can also be used, i.e.,

Both forms of empirical risk can be used as a basis for a consistent estimation of f . The advantage of (5) is that it is a minimum-variance unbiased estimator with appealing asymptotic properties, whereas (6) is a biased estimator of the population risk (4)

However, the estimator based on V-statistics employs a full sample and hence may yield better estimate of the risk than the U-statistic counterpart.

Let x := [x 1 , . . . , x n ] , y := [y 1 , . . . , y n ] and z := [z 1 , . . . , z n ] be column vectors. Let K z be the kernel matrix K(z, z) = [k(z i , z j )] ij evaluated on the instruments z. Then, both (5) and (6) can be rewritten as

where f (x) := [f (x 1 ), . . . , f (x n )] and W V (U ) ∈ R n×n is a symmetric weight matrix that depends on the kernel matrix K z . Specifically, the weight matrix W corresponding to (5) is given by W U = (K z − diag(k(z 1 , z 1 ), . . . , k(z n , z n )))/(n(n − 1)) where diag(a 1 , . . . , a n ) denotes an n × n diagonal matrix whose diagonal elements are a 1 , . . . , a n . As shown in Appendix B.5, W U is indefinite and may cause problematic inferences. The weight matrix W for (6) is given by W V := K z /n 2 which is a positive definite matrix for the ISPD kernel k. Finally, our objective (7) also resembles the well-known generalized least regression with correlated noise (Kariya and Kurata, 2004, Chapter 2) where the covariance matrix is the Z-dependent invertible matrix W −1 V (U ) . Based on (5) and (6), we estimate f * by minimizing the regularized empirical risk over a function class F:

where λ > 0 is a regularization constant satisfying lim n→∞ λ = 0, and Ω(f ) is the regularizer. Sincê f U andf V respectively minimizes objectives which are regularized U-statistic and V-statistic, they can be viewed a specific form of M-estimators; see, e.g., Van der Vaart (2000a, Ch. 5) . In this work, we focus on the V-statistic empirical risk and provide practical algorithms when F is parametrized by deep neural networks (NNs) and an RHKS of real-valued functions.

Kernelized GMM. We may view the objective (7) from the GMM perspective (Hall, 2005) . The assumption that the instruments Z are exogenous implies that E[Φ k (Z)ε] = 0 where Φ k denotes the canonical feature map associated with the kernel k. This gives us an infinite number of moments,

). Hence, we can write the sample moments asĝ(f ) = (1/n)

The intuition behind GMM is to choose a function f that sets these moment conditions as close to zero as possible, motivating the objective

). That is, our objective (7) defined using V-statistic is a special case of the GMM objective when the weighting matrix is the identity operator. Discovering more efficient weighting operators is also an interesting research direction; see, e.g., Carrasco et al. (2007, Ch. 6) for an earlier endeavour.

A workflow of our algorithm based on R V is summarized in Algorithm 1; we leave the R U based method to future work to solve the inference issues caused by indefinite W U . We provide examples of the class F in both parametric and non-parametric settings below.

Deep neural networks. In the parametric setting, the function class F can often be expressed as F Θ = {f θ : θ ∈ Θ} where Θ ⊆ R m denotes a parameter space. We consider a very common nonlinear model in machine learning f (

) denotes a nonlinear feature map of a depth-h NN. Here, W i for i = 1, . . . , h are parameter matrices and each σ i denotes the entrywise activation function of the i-th layer. In this case, θ = (b 0 , W 0 , W 1 , . . . , W h ). As a result, we can rewritê f V in terms of their parameters aŝ

In what follows, we refer to this algorithm as MMR-IV (NN); see Algorithm 3 in Appendix D.1.

Kernel machines. In a non-parametric setting, the function class F becomes an infinite dimensional space. In this work, we consider F to be an RKHS H l of real-valued functions on X with a reproducing kernel l : X × X → R. Then, the regularized solution can Algorithm 1 MMR-IV Input: Dataset D = {x, y, z}, kernel function k with parameters θ k , a function class F, regularization functional Ω(·), and regularization parameter λ Output: The estimate of f * in F. 1: Compute the kernel matrix K = k(z, z; θ k ).

As per the representer theorem, any optimalf admits a form Schölkopf et al., 2001) , and based on this representation, we rewrite the objective aŝ

For U-statistic version, the quadratic program (8) substitutes indefinite W U for W V , so it may not be positive definite. The value of λ needs to be sufficiently large to ensure that (8) is definite. On the other hand, the Vstatistic based estimate (8) is definite for all non-zero λ since W V is positive semi-definite. Thus, the optimal α can be obtained by solving the first-order stationary condition and if L is positive definite, the solution has a closed form expression, α = (LW V L + λL) −1 LW V y.

In our experiments, we focus on the V-statistic version because of these advantages. In the following, we refer to this algorithm as MMR-IV (RKHS).

Nyström approximation. The MMR-IV (RKHS) algorithm is computationally costly for large datasets as it requires a matrix inversion. To improve the scalability, we resort to Nyström approximation (Williams and Seeger, 2001) to accelerate the matrix inversion in α = (LW V L + λL) −1 LW V y. First, we randomly select a subset of m( n) samples from the original dataset and construct the corresponding sub-matrices of W V , namely, W mm and W nm based on this subset. Second, let V and U be the eigenvalue vector and the eigenvector matrix of W mm . Then, the Nyström approximation is obtained as W V ≈ U V U where U := ( m/n)W nm U V −1 and V := (n/m)V . We finally apply the Woodbury formula (Flannery et al., 1992, p. 75 ) to obtain (

We will refer to this algorithm as MMR-IV (Nyström); Algorithm 2 in Appendix D.1. The runtime complexity of this algorithm is O(nm 2 + n 2 ).

The V-statistic empirical risks (6) provides a convenient hyper-parameter selection through cross validation (CV) procedures. In particular, one can observe that our RKHS model in (8) is similar to the kernel ridge regression, except the V-statistic empirical risk R(f ). Based on this observation, we provide an efficient way for MMR-IV (Nyström) to select the regularization parameter λ and parameters of l using CV, and the derivation is inspired by Vehtari et al. (2016) . While for NN, we follow the standard CV.

Gaussian process (GP) interpretation. We first introduce a GP over f , and we show that maximizing its posterior distribution minimizes the regularized V-statistic objective (8), and equivalence between the predictive mean and the optimal function of (8) in Appendix C. The GP model has the prior of f (x) ∼ GP(0, δl(x, x)) and the likelihood p(D|f (x)) = (2π) n/2 |K z | −1/2 N (f (x) | y, K z ), where l is the kernel function, δ(= (λn 2 ) −1 ) plays the role of the regularization parameter justified in Appendix C, and D := {(x, y, z)}. The posterior distribution of f (x) can be straight-forwardly derived from the Gaussian-like prior and likelihood, 

As the predictive function value is the predictive mean f (x M ) = b, error of m LMOCV is expressed as ,

where we denote by (i) the LMOCV index and the residual r (i) := b (i) −y 

We provide the consistency and asymptotic normality of f V . We also develop the same results for f U , but we defer them to the appendix due to space limitation. All proofs are presented in the appendix.

We first show the consistency off V , which depends on the convexity and uniform convergence of the risk functions. The result holds regardless of the shape of Ω(f ), so we can utilize the regularization θ 2 2 which is common for NN but non-convex in terms of f .

Theorem 2 (Consistency off V ). Assume that the distribution of Y has bounded support, F is compact, and Assumption 1 holds, Ω(f ) is a bounded function, λ

If the Ω(f ) is convex in f , the consistency can be obtained more easily by Newey and McFadden (1994, Theorem 2.7) . In this case, we can avoid several conditions. We provide an additional result with this setting in Section B.6 in the appendix.

We analyze asymptotic normality of the estimatorf V , which is important to advanced statistical analysis such as tests. Here, we investigate two different cases: the estimator has finite-and infinite-dimension.

Finite-dimension case. We consider thef V is characterized by a finite-dimensional parameter from a parameter space Θ.

We rewrite the regularized V-statistic risk as a compact form R V,λ (f θ ) :=

→ 0, f θ and Ω(θ) are twice continuously differentiable about θ, √ nλ p → 0, θ * is an interior point of Θ, and Assumption 1 holds. Then,

]H −1 and denotes a convergence in law.

Infinite-dimension case. We show the asymptotic normality of an infinite-dimensional estimatorf V . That is, we show that an error off V weakly converges to a Gaussian process that takes values in a function space H l . We set Ω(f ) = f 2 H l and consider a minimizer: f * λ0 ∈ argmin f ∈F R k (f ) + λ 0 f 2 H l with arbitrary λ 0 > 0. We also define N (ε, H, · ) as an ε-covering number of a set H in terms of · . Then, we obtain the following:

Theorem 4. Suppose Assumption 1 holds, l is a bounded kernel, k is a uniformly bounded function, and λ−λ 0 = o(n −1/2 ) holds. Also, suppose that X , Z, and Y are compact spaces, and there exists s ∈ (0, 2) and a constant C H > 0 such that log N (ε, H l , · L ∞ ) ≤ C H ε −s for any ε ∈ (0, 1). Then, there exists a Gaussian process G * P such that

This result allows statistical inference on functional estimators such as kernel machines. Although there are many conditions, all of them are valid for many well-known kernels; see Appendix B.11 for examples.

When the relationships between Z and X as well as between X and Y are nonlinear, several extensions of 2SLS and GMM have been proposed. In the two-stage approach, the function f , the first-stage regression is replaced by a conditional density estimation of P(X|Z) using a kernel density estimator. Estimating conditional densities is a difficult task and has been known to perform poorly in high dimension (Tsybakov, 2008) .

The IV regression has also recently received attention in the machine learning community. Hartford et al. 

We present the experimental results in a wide range of settings for IV estimation. Following Lewis and Syrgkanis (2018) We refer readers to Appendix D for more experimental details due to space limitation, and our code is publicly available 1 .

Following Bennett et al. (2019), we employ the following data generation process:

where Z := (Z 1 , Z 2 ) ∼ Uniform([−3, 3] 2 ), e ∼ N (0, 1), and γ, δ ∼ N (0, 0.1 2 ). In words, Z is a twodimensional IV, but only the first instrument Z 1 has an .011 ± .002 .005 ± .000 .153 ± .019 .040 ± .004 MMR-IV (Nys) .011 ± .001 .001 ± .000 .006 ± .002 .020 ± .002 effect on X. The variable e is the confounding variable that creates the correlation between X and the residual Y − f * (X). We vary the true function f * between the following cases to enrich the datasets: (i) sin:

We consider both small-sample (n = 200) and large-sample (n = 2000) regimes.

The results are reported in Table 1 for the largesample regime (and Table 3 in Appendix D for the small-sample regime). Our findings are as follows: (i) MMR-IVs perform reasonably well in most cases. (ii) The linear methods (2SLS and Poly2SLS) perform best when the linearity assumption is satisfied. (iii) Some nonlinear but complicated methods (GMM+NN, DeepIV and DeepGMM) are unstable due to the sensitivity to hyper-parameters. It shows that MMR-IV (Nyström) has the advantage of adaptive hyper-parameter selection. Section D.2 provides more comments.

We also record the runtimes of all methods on the large-sample regime and report them in Figure 2 . Compared to the NN-based methods, i.e., AGMM, DeepIV, DeepGMM, our MMR-IV (NN) is the most computationally efficient method, which is clearly a result of a simpler objective. Using a minimax optimization between two NNs, AGMM is the least efficient method. DeepGMM and DeepIV are more efficient than AGMM, but are less efficient than MMR-IV (NN RKHS-based methods, namely, KernelIV and MMR-IV (RKHS), have similar runtime complexity. We observe that both of them scale poorly to large datasets.

In high-dimensional setting, we employ the same data generating process as in Section 6.1. We consider only the absolute function for f * , but map Z, X, or both X and Z to MNIST images (784-dim) (LeCun et al., 1998) . Let us denote the original outputs in Section 6.1 by X low , Z low and let π(u) := round(min(max(1.5u + 5, 0), 9)) be a transformation function mapping inputs to an integer between 0 and 9, and let RI(d) be a function that selects a random MNIST image from the digit class d. We set n = 10, 000. Then, the scenarios we consider are (i) MNIST Z : Z ← RI(π(Z low 1 )), (ii) MNIST X : X ← RI(π(X low )), and (iii) MNIST XZ : X ← RI(π(X low )), Z ← RI(π(Z low 1 )). We report the results in Table 2 . We summarize our findings: (i) MMR-IV (NN) performs well across scenarios. (ii) MMR-IV (Nyström) fails except MNIST Z , because the ARD kernel with PCA are not representative enough for MNIST X . (iii) Some methods (AGMM and DeepIV) do not work when X is high-dimensional, similar to Bennett et al. (2019). (iv) The two-step methods (Poly2SLS and Ridge2SLS) has large errors because the first-stage regression from Z to X is illposed. Section D.3 provides more comments.

We demonstrate our method in the setting of Mendelian randomization which relies on genetic variants that satisfy the IV assumptions. The "exposure" X and outcome Y are univariate and generated from the simulation process by Hartwig et al. (2017): where . Linkage disequilibrium and population stratification are another common reasons for invalid IVs. This experiment aims to evaluate the sensitivity of different methods to the number of instruments (d ) and confounder strengths (c 1 , c 2 ). We consider three experiments: (i) d = 8, 16, 32; (ii) c 1 = 0.5, 1, 2; (iii) c 2 = 0.5, 1, 2; unmentioned parameters use default values: β = 1, d = 16, c 1 = 1, c 2 = 1. We draw 10, 000 samples for the training, validation and test sets, respectively, and train MMR-IV (Nyström) only on the training set. Other settings are the same as those of the low-dimensional scenario. Figure 3 depicts the experimental results. Overall, 2SLS performs well on all settings due to the linearity assumption, except particular sensitivity to the number of (weak) instruments, which is a well-known property of 2SLS (Angrist and Pischke, 2008) . Although imposing no such assumption, MMR-IVs perform competitively and even more stably, since the information of instruments is effectively captured by the kernel k and we only need to deal with a simple objective, and also the analytical CV error plays an essential role. Section D.5 contains additional findings.

The Vitamin D data. Lastly, we apply our algorithm to the Vitamin D data (Sjolander and Martinussen, 2019, Sec. 5.1). The data were collected from a 10-year study on 2571 individuals aged 40-71 and 4 variables are employed: age (at baseline), filaggrin (binary indicator of filaggrin mutations), VitD (Vitamin D level at baseline) and death (binary indicator of death during study). The goal is to evaluate the potential effect of VitD on death. We follow Sjolander and Martinussen (2019) by controlling age in the analyses, using filaggrin as instrument, and then applying the MMR-IV (Nyström) algorithm. Sjolander and Martinussen (2019) modeled the effect of VitD on death by a generalized linear model and found the effect is insignificant by 2SLS (p-value on the estimated coefficient is 0.13 with the threshold of 0.05). More details can be found in Appendix D.6. The estimated effect is illustrated in Figure 4 in the appendix. We observe that: (i) the result without instruments is counterintuitive: at a young age, deficiency of VitD results in low mortality rate; (ii) In contrast, the result with instruments is more reasonable: a low VitD level at a young age has a slight effect on death, but a more adverse effect at an old age (Meehan and Penckofer, 2014).

Learning causal relations when hidden confounders are present is a cornerstone of decision making. IV regression is a standard tool to tackle this task, but currently faces challenges in nonlinear settings. The present work presents a simple framework that overcomes some of these challenges. We employ RKHS theory in the reformulation of conditional moment restriction (CMR) as a maximum moment restriction (MMR) based on which we can approach the problem from the empirical risk minimization (ERM) perspective. As we demonstrate, this framework not only facilitates theoretical analysis, but also results in easyto-use algorithms that perform well in practice compared to existing methods. The paper also shows a way of combining the elegant theoretical approach of kernel methods with practical merits of deep neural networks. Despite these advantages, the optimal choice of the kernel k remains an open question which we hope to address in future work. 

where σ is a positive bandwidth parameter. Another important kernel is an inverse multiquadric (IMQ) kernel k(z, z ) = (c 2 + z − z 2 2 ) −γ where c and γ are positive parameters (Steinwart and Christmann, 2008, Ch. 4 ). This class of kernel functions is closely related to the notions of universal kernels (Steinwart, 2002) and characteristic kernels (Fukumizu et al., 2004) . The former ensures that kernel-based classification/regression algorithms can achieve the Bayes risk, whereas the latter ensures that the kernel mean embeddings can distinguish different probability measures. In principle, they guarantee that the corresponding RKHSs induced by these kernels are sufficiently rich for the tasks at hand. We refer the readers to Sriperumbudur et al. (2011) and Simon-Gabriel and Schölkopf (2018) for more details.

This section contains detailed proofs of the results that are missing in the main paper.

Proof. Since H k is the RKHS, we can rewrite (3) as

where we used the reproducing property of H k in the first equality and the fact that H k is a vector space in the last equality. By assumption, E[(Y − f (X))k(Z, ·)] is Bochner integrable (Steinwart and Christmann, 2008, Def. A.5.20) . Hence, we can write (10) as

as required.

Proof. First, the law of iterated expectation implies that

By Lemma 1, we know that

To show the converse, we assume that R k (f ) = 0 and rewrite it as

. Since k is ISPD by assumption, this implies that g is a zero function with respect to P Z , i.e., E[Y − f (X) | z] = 0 for P Z -almost all z.

Theorem 5. If F is a convex set, then the risk R k given in (4) is strictly convex on F.

Proof. Given α ∈ (0, 1) and any functions f, g : X → R, we will show that

H k . Hence, we can rewrite the above function as

The equality (a) is obtained by considering

H k on the left hand side of (a). We note that the right hand side of (a) is quadratic in E[(Y − f (X))k(Z, ·)] H k and E[(Y − g(X))k(Z, ·)] H k , and can be further expressed as a square binomial as the right hand side of (b). Therefore, the convexity follows from the fact that k is the ISPD kernel and α(α − 1) < 0.

The results presented in this section are used to prove the consistency off V andf U .

Lemma 2 (Uniform consistency of R V (f )). Assume that the distribution of Y has bounded support, F is compact, and Assumption 1 holds. Then, the risk

Proof. First, let u := (x, y, z), u := (x , y , z ), and h f (u, u ) := (y − f (x))(y − f (x ))k(z, z ) for some (x, y, z), (x , y , z ) ∈ X ×Y ×Z. To prove that R V converges uniformly to R k , we need to show that (i) h f (u, u ) is continuous at each f with probability one; (ii) E u,u sup f ∈F |h f (u, u )| < ∞, and E u,u sup f ∈F |h f (u, u)| < ∞ Newey and McFadden (1994, Lemma 8.5). To this end, it is easy to see that

The third inequality follows from the Cauchy-Schwarz inequality. Since F is compact, every function f in F is bounded, say |f (x)| < M < ∞ for some constant M and E X [M ] < ∞. Since the distribution of Y also has a bounded support, it follows that E[|Y |] < ∞. The kernel k(·, ·) is continuous and bounded as per Assumption 1, i.e., sup z∈Z k(z, z) < ∞, so k(z, z)k(z , z ) < ∞ follows straightforwardly. Consequently, we have h f is continuous at each f with probability one and

Hence, our assertion follows from Newey and McFadden (1994, Lemma 8.5).

Lemma 3 (Uniform consistency of R U (f )). Assume that the distribution of Y has bounded support, F is compact and Assumption 1 holds. Then,

Proof. First, let u := (x, y, z), u := (x , y , z ), and h f (u, u ) := (y − f (x))(y − f (x ))k(z, z ) for some (x, y, z), (x , y , z ) ∈ X × Y × Z. To prove the uniform consistency of R U , we need to show that (i) h f (u, u ) is continuous at each f with probability one; (ii) there is d(u, u ) with |h f (u, u )| ≤ d(u, u ) for all f ∈ F and E u,u [d(u, u )] < ∞ (Newey and McFadden, 1994, Lemma 2.4); (iii) (u i , u j ) n,n i =j has strict stationarity and ergodicity in the sense of Newey and McFadden (1994, Footnote 18 in P.2129). To this end, it is easy to see that

The third inequality follows from the Cauchy-Schwarz inequality. Since F is compact, every function f in F is bounded, say |f (x)| < M < ∞ for some constant M and E X [M ] < ∞. Since the distribution of Y has bounded support, it follows that E[|Y |] < ∞. The kernel k(·, ·) is continuous and bounded as per Assumption 1, i.e., sup z∈Z k(z, z) < ∞, so k(z, z)k(z , z ) is integrable. Consequently, (i) h f (u, u ) is continuous at each f with probabiilty one and (ii) E u,u [d(u, u )] < ∞ .

Furthermore, we show that (u i , u j ) n,n i =j has strict stationarity and ergodicity. Strict stationarity means that the distribution of a set of data (u i , u j =i ) i+m,j+m i=1,j=1

does not depend on the starting indices i, j for any m and m , which is easy to check. Ergodicity means that R U (f ) 

B.5 Indefiniteness of Weight Matrix W U Theorem 6. If Assumption 1 holds, W U is indefinite.

Proof. By definition, we have

where diag(a 1 , . . . , a n ) denotes an n × n diagonal matrix whose diagonal elements are a 1 , . . . , a n . We can see that the diagonal elements of K U are zeros and therefore trace(W U ) = 0. Let us denote the eigenvalues of W U by {λ i } n i=1 . Since n i=1 λ i = trace(W U ), we conclude that there exist both positive and negative eigenvalues (all eigenvalues being zeros yields trivial W U = 0). As a result, W U is indefinite.

Theorem 7 (Consistency off V with convex Ω(f )). Assume that F is a convex set, f * is an interior point of F, Ω(f ) is convex about f , λ p → 0 and Assumption 1 holds. Then,f V exists with probability approaching one andf V p → f * .

Given Ω(f ) is convex about f , we prove the consistency based on Newey and McFadden (1994, Theorem 2.7) which requires

H k , and by the law of large number, we have that 1

→ R k (f ) follows from the Continuous Mapping Theorem (Mann and Wald, 1943) based on the fact that the function g(·) = · 2 H k is continuous. As λ

→ R k (f ) by Slutsky's theorem (Van der Vaart, 2000b, Lemma 2.8). Besides, it is easy to see that R V (f ) is convex because the weight matrix W V is positive definite, and (ii) R V (f ) + λΩ(f ) is convex due to convex Ω(f ). Further, the condition (i) directly follows from Theorem 5, and given that f * is an interior point of the convex set F, our assertion follows from Newey and McFadden (1994, Theorem 2.7).

Proof. From the conditions of Lemma 2, we know that F is compact, R k (f ) is continuous about f and

is convex on F and is uniquely minimized at f * . Based on the conditions that Ω(f ) is bounded and λ p → 0, we obtain by Slutsky's theorem that

Consequently, we assert the conclusion by Newey and McFadden (1994, Theorem 2.1) .

Theorem 8 (Consistency off U ). Assume that conditions of Lemma 3 hold, F is convex, Ω(f ) is a bounded function and λ

Proof. By the conditions of Lemma 3, we know that F is compact,

is convex on F and is uniquely minimized at f * . Based on the conditions that Ω(f ) is bounded and λ p → 0, we obtain by Slutsky's theorem that

Consequently, we assert the conclusion by Newey and McFadden (1994, Theorem 2.1).

In this section, we consider the regularized U-statistic risk R U,λ (f θ ). For u i := (x i , y i , z i ) and u j := (x j , y j , z j ), we express it in a compact form

We will assume that f θ and Ω(θ) are twice continuously differentiable about θ. The first-order derivative ∇ θ R U (f θ ) can also be written as

Asymptotic normality of ∇ θ R U,λ (f θ * ). We first show the asymptotic normality of ∇ θ R U,λ (f θ * ). We assume that there exists z ∈ Z such that E X [∇ θ f θ * (X) | z]p(z) = 0 or E XY [Y − f θ * (X) | z]p(z) = 0. Both terms being equal to zeros for all z ∈ Z leads to a singular ∇ 2 θ R U (f θ * ) and the asymptotic distribution therefore becomes much more complicated to analyze.

Lemma 4. Suppose that f θ and Ω(θ) are first continuously differentiable about θ,

Proof. The proof follows from Serfling (1980, Section 5.5.1 and Section 5.5.2) and we need to show that (i)

where equality holds if for any U , there is E U [∇ θ h θ * (U, U )] = 0, i.e.,

As the above equation holds for any Y , the coefficient of Y must be 0:

where we note that E[∇ θ f θ * (X ) | Z ]p(Z ) = 0 for any Z implied by the second function above. Similarly, the coefficient of ∇ θ f θ * (X) must be zero, which implies that E X Y [(Y − f θ * (X )) | Z ]p(Z ) = 0 for any Z . The two coefficients cannot be zero at the same time (otherwise against the given conditions),

) as per Serfling (1980, Section 5.5.1). Finally, as √ nλ p → 0 and ∇ θ Ω(θ * ) < ∞ by the condition that Ω(θ) is first continuously differentiable, we assert the conclusion by Slutsky's theorem,

This concludes the proof.

Uniform consistency of ∇ 2 θ R U,λ (f θ ). Next, we consider the second derivative ∇ 2 θ R U,λ (f θ ) and show its uniform consistency. In what follows, we denote by · F the Frobenius norm. We can express ∇ 2 θ R U,λ (f θ ) as

Lemma 5. Suppose that f θ and Ω(θ) are twice continuously differentiable about θ, Θ is compact, p(Y ) has bounded support, λ p → 0 and Assumption 1 holds. Then,

Proof. The proof is similar to that of Lemma 3 and both applies extended results of Newey and McFadden (1994, Lemma 2.4). As (u i , u j ) i =j being strictly stationary in the sense of Newey and McFadden (1994, Footnote 18 in P.2129) has been shown in Lemma 3, we only need to show that there exists d(u, u ) ≥ |∇ 2 θ h θ (u, u )| for all u, u ∈ X × Y × Z and E[d(U, U )] < ∞. As f θ is twice continuously differentiable about θ and Θ is compact, f θ is bounded. We can easily find a constant function a(x) = A ∈ R such that |f θ (x)| < a(x) < ∞. By the same condition, we have each entry of ∇ θ f θ bounded as

Consequently, we exploit the triangle inequality of the Frobenius norm and obtain

where we assume that |y| < M < ∞ for all y ∈ Y in the second inequality because p(Y ) has bounded support and that θ is m-dimensional. Since k(z, z ) is bounded as implied by Assumption 1,

p → 0 following from the extended results in the remarks of Newey and McFadden (1994, Lemma 2.4) . Furthermore, from the conditions that Ω(θ) is twice continuously differentiable and the parameter space Θ is compact, we obtain that ∇ 2

θ Ω(θ) F < ∞ for any θ ∈ Θ. Finally, it follows from the Slutsky's theorem that

This concludes the proof.

Theorem 9 (Asymptotic normality ofθ U ). Suppose that H = E[∇ 2 θ h θ * (U, U )] is non-singular, Θ is convex and compact, p(Y ) has bounded support, f θ and Ω(θ) are twice continuously differentiable about θ, √ nλ p → 0, R k (f θ ) is uniquely minimized at θ * and θ * is an interior point of Θ, and Assumption 1 holds.

Proof. The proof follows by Newey and McFadden (1994, Theorem 3 .1) and we need to show that

The proof of (i) is very similar to Theorem 8 except that we consider finite dimensional parameter space instead of functional space. For a neat proof, we would like to omit the detailed proof here. We can first show the uniform

is continuous about θ similarly to Lemma 3, and then θ U p → θ * similarly to Theorem 8.

From the conditions that Θ is compact, f θ is twice continuously differentiable about θ, p(Y ) has bounded support and k(z, z ) is bounded as implied by Assumption 1, we can obtain (ii) R V,λ (θ) is twice continuously differentiable and E[ ∇ θ h θ * (U, U ) 2 2 ] < ∞ by a similar proof to that of Lemma 5. Given

is non-singular and R k (f θ ) is uniquely minimized at θ * , we can obtain that the Hessian matrix H is positive definite because R k (θ) is convex as per Theorem 5,

). Finally by Lemma 5, we know that H(θ) = E[∇ 2 θ h θ (U, U )] and H(θ * ) = H, so (iv) and (v) are satisfied. Now, conditions of Newey and McFadden (1994, Theorem 3.1) are all satisfied, so we assert the conclusion.

We restate the notations

Lemma 6. Suppose that conditions of Lemma 4 hold. Then

has the same limit distribution as that of √ n∇ θ R U (f θ * ) by Serfling (1980, Section 5.7.3). Furthermore, by √ nλ p → 0 and ∇ θ Ω(θ * ) < ∞ from that Ω(θ) is first continuously differentiable, we assert the conclusion by Slutsky's theorem

Lemma 7. Suppose that conditions of Lemma 5 hold. Then,

Proof. From conditions of Lemma 5, we can obtain that ∇ 2 θ h θ (u, u ) F and ∇ 2 θ h θ (u, u) F are bounded, and E[∇ 2 θ h θ (U, U )] is continuous about θ as proved in Lemma 5. By ∇ 2 θ h θ (u, u ) F and ∇ 2 θ h θ (u, u) F are bounded, it follows that ∇ 2

Then, we assert the conclusion by Newey and McFadden (1994, Lemma 8.5 ).

Proof of Theorem 3. The proof is the same as that of Theorem 9 except that R U is replaced by R V .

We firstly state the asymptotic normality theorem forf U and its proof. Afterwards, we provide the proof of Theorem 4 whose proof is a slightly modified version of that off U .

Theorem 10. Suppose Assumption 1 holds, l is a bounded kernel, k is a uniformly bounded function, and λ ≥ λ 0 holds. Also, suppose that X , Z, and Y are compact spaces, and there exists s ∈ (0, 2) and a constant C H > 0 such that log N (ε, H l , · L ∞ ) ≤ C H ε −s for any ε ∈ (0, 1). If λ − λ 0 = o(n −1/2 ) holds, then there exists a Gaussian process G * P such that

An exact covariance of G * P is described in the proof. The proof is based on the uniform convergence of U-processes on the function space (Arcones and Gine, 1993) and the functional delta method using the asymptotic expansion of the loss function (Hable, 2012) . This asymptotic normality allows us to perform statistical inference, such as tests, even in the non-parametric case.

We discuss the boundedness and covering number assumptions in Theorem 4 and Theorem 10. For the boundedness assumption, many common kernels, such as the Gaussian RBF kernel, the Laplacian kernel, and the Mercer kernel, satisfy it. For the covering number, the common kernels above also satisfy it with a certain parameter configuration. For example, the Gaussian kernel (see Section 4 in Steinwart and Christmann (2008)) and the Mercer kernel (explained in Zhou (2002)).

To prove the theorem, we provide some notation. Let P be a probability measure which generates u = (x, y, z) and

For a signed measure Q on W, we define a measure Q 2 := Q ⊗ Q on W × W. Then, we can rewrite the U-statistic risk as

where U n is an empirical measure for the U -statistics. Similarly, we can rewrite the V-statistic risk as

where P n is an empirical measure of u.

Further, we define a functional associated with measure. We consider functional spaces G 1 :

Note that G 1 contains a functional which corresponds to the U 2 n . Then, we consider a set of functionals

and also let B 0 be the closed linear span of B S . For a functional F ∈ B S , let ι(F ) be a measure which satisfies F (g) = g dι(F ).

Uniform Central Limit Theorem: We firstly achieve the uniform convergence. Note that a measure P 2 satisfies P 2 h f := E (U,U ) [h f (U, U )]. The convergence theorem for U-processes is as follows:

Theorem 11 (Theorem 4.4 in Arcones and Gine (1993)). Suppose H is a set of uniformly bounded class and symmetric functions, such that {P 1 h f | h f ∈ H} is a Donsker class and lim n→∞ E[n −1/2 log N (n −1/2 ε, H, · L 1 (U 2 n ) )] = 0 holds, for all ε > 0. Then, we obtain √ n(U 2 n − P 2 ) 2G P 1 , in ∞ (H).

Here, G P 1 denotes a Brownian bridge, which is a Gaussian process on H with zero mean and a covariance

To apply the theorem, we have to show that H satisfies the condition in Theorem 11. We firstly provide the following bound:

Proof of Lemma 8. We simply obtain the following:

as required.

Lemma 9. Suppose the assumptions of Theorem 4 hold. Then, the followings hold:

2. For any ε > 0, the following holds:

lim n→∞ E[n −1/2 log N (n −1/2 ε, H, · L 1 (U 2 n ) )] = 0.

Proof of Lemma 9. For preparation, fix ε > 0 and set N = N (ε, H l , · L ∞ ). Also, let Q be an arbitrary finite discrete measure. Then, by the definition of a bracketing number, there exist N functions

For the first condition, as shown in Equation (2.1.7) in Van der Vaart and Wellner (1996) , it is sufficient to show

for arbitrary δ ∈ (0, 1). Here, c > 0 is some constant, and Q is taken from all possible finite discrete measure. To this end, it is sufficient to show that log N (ε,

with constants C, C > 0. The first inequality follows Lemma 8 with the bounded property of f, f and Y. The second inequality follows the bounded condition of k in Theorem 10. Hence, the entropy condition shows the first statement.

For the second condition, we have the similar strategy. For any h f ∈ H, we consider i ∈ {1, 2, . . . , N } such that f − f i L ∞ ≤ ε. Then, we measure the following value

with a constant C > 0. Hence, we have

since s ∈ (0, 1).

From Theorem 11 and Lemma 9, we rewrite the central limit theorem utilizing terms of functionals. Note that ι −1 (U 2 n ), ι −1 (P 2 ) ∈ B S holds. Then, we can obtain

Learning Map and Functional Delta Method: We consider a learning map S :

Obviously, we havef = S λ (ι −1 (U 2 n )), and f * λ0 = S λ0 (ι −1 (P 2 )).

We consider a derivative of S λ in the sense of the Gateau differentiation by the following steps.

Firstly, we define a partial derivative of the map R Q 2 (f ). To investigate the optimality of the minimizer of

To this end, we consider the following derivative ∇R Q 2 ,λ [f ] : H l → H l with a direction f as

Here, ∂ f,1 h f is a partial derivative of h f in terms of the input f (x) as

and ∂ f,2 h f follows it respectively. The following lemma validates the derivative:

Lemma 10. If the assumptions in Theorem 4 hold, then ∇R Q 2 ,λ [f ] is a Gateau-derivative of R Q 2 ,λ with the direction f ∈ H l .

Proof of Lemma 10. We consider a sequence of functions h n ∈ H l for n ∈ N, such that h n (x) = 0, ∀x ∈ X and h n L ∞ → 0 as n → ∞. Then, for f ∈ H l , a simple calculation yields

The convergence follows the definition of h n and the absolute integrability of k, which follows the bounded property of k and compactness of Z. Then, we obtain the statement.

Here, we consider its RKHS-type formulation of ∇R Q 2 ,λ , which is convenient to describe a minimizer. Let Φ l : X → H l be the feature map associated with the RKHS H l , such that Φ l [x], f H l = f (x) for any x ∈ X and f ∈ H l . Let ∇R Q 2 ,λ : H l → H l be an operator such that

Obviously, ∇R Q 2 ,λ [f ](·) = ∇R Q 2 ,λ (f ), · H l . Now, we can describe the first-order condition of the minimizer of the risk. Namely, we can state thatf = argmin

This equivalence follows Theorem 7.4.1 and Lemma 8.7.1 in Luenberger (1997).

Next, we apply the implicit function theorem to obtain an explicit formula of the derivative of S. To this end, we consider a second-order derivative ∇ 2R Q 2 ,λ : H l → H l as

which follows (b) in Lemma A.2 in Hable (2012). Its basic properties are provided in the following result:

Lemma 11. If Assumption 1 and the assumptions in Theorem 4 hold, then ∇ 2R Q 2 ,λ is a continuous linear operator and it is invertible.

We define the Gateau derivative of S. For a functional F ∈ ∞ (G 1 ∪ G 2 ), we define the following function

where f Q 2 = S λ (ι −1 (Q 2 )) and Q 2 is a signed measure on W × W. Then, we provide the following derivative theorem:

Proposition 1. Suppose the assumptions in Theorem 4 hold. For F ∈ B S , F ∈ ∞ (G 1 ∪ G 2 ), and s ∈ R such that F + sF ∈ B S , ∇S ι(F ),λ (F ) is a Gateau-derivative of S λ , namely,

Proof of Proposition 1. This proof has the following two steps, (i) define a proxy operator Γ, then (ii) prove the statement by the implicit function theorem.

(i) Define Γ: Note that ι(F ) exists since F + sF ∈ B S implies F ∈ B 0 . We define the following operator Γ(s, f, λ) : H l → H l for f ∈ H l :

For simple derivatives, Lemma A.2 in Hable (2012) provides

(ii) Apply the implicit function theorem: By its definition and the optimal conditions, we have Γ(s, f, λ) = 0 ⇔ f = S λ (ι(F ) + sι(F )).

Also, we obtain

Then, for each λ > 0, by the implicit function theorem, there exists a smooth map ϕ λ : R → H l such that Γ(s, ϕ λ (s), λ) = 0, ∀s, and it satisfies

Also, we have ϕ λ (s) = S(Q 2 + sµ 2 ). Then, we have

Then, we obtain the statement. Now, we are ready to prove Theorem 4 and Theorem 10.

Proof of Theorem 10. As a preparation, we mention that S λ is differentiable in the Hadamard sense, which is Gateau differentiable by Proposition 1. Lemma A.7 and A.8 in Hable (2012) show that ∇S ι(F ),λ,ι(G) is Hadamard-differentiable for any λ, G and F .

Then, we apply the functional delta method. As shown in Theorem 11 and Lemma 9, we have

Hence, we obtain

Utilizing the result, we can obtain

in ∞ (H). The convergence follows the functional delta method.

Proof of Theorem 4. This proof is completed by substituting the Central Limit Theorem part (Theorem 11) of the proof of Theorem 10. From Section 3 in Akritas et al. (1986) , the V-and U-processes have the same limit distribution asymptotically, so the same result holds.

We present the close connection between the non-parametric model and the Gaussian process (GP) in this section. We will show that the RKHS solution of our objective function (7) is equivalent to the maximum function of the posterior distribution of the GP. The relationship is inspired by the similarity of our objective function to that of GLS, and it will be used to derive an efficient cross validation error.

By Mercer theorem, the kernel l(x, x ) can be expanded as l(

, a space of square-integrable real-valued functions, and N < ∞ for degenerate kernels and N = ∞ otherwise. To satisfy for arbitrary f (

we choose φ i (x) such that φ i is orthogonal in H l as the Mercer expansion is not unique. Based on the reproducing property, we can obtain φ i , φ j H l = δ ij λ −1 i where δ ij = 1, if i = j, and zero otherwise. Similar settings can be found in Walder and Bishop (2017), for example.

Let us consider a GP over a space of functions

is the feature vector and w is the parameter vector. We assume a prior distribution of w ∼ N (0, δΛ) 2 where Λ = diag([λ 1 , λ 2 , . . . , λ N ] is a diagonal matrix with eigenvalues λ 1 , . . . , λ N , and δ > 0 is a hyper-parameter, which plays the same role as the regularization hyper-parameter (as we show later). This definition is equivalent to assuming that on a set of input x, f (x) ∼ GP(0, δl(x, x)) because

We refer the interested readers to Rasmussen and Williams (2005) for further details on Gaussian process models.

Given the prior distribution on w, we aim to characterize the posterior distribution p

is an i.i.d. sample of size n, and f (x) = Φ(x)w where the i's row of Φ(x) contains the feature vector of x i , namely, Φ(x i ) . To define the likelihood p(D | w), we recall from Lemma 1 that the risk R k (f ) can be expressed in terms of two independent copies of random variables (X, Y, Z) and (X , Y , Z ). To this end, let D := {(x i , y i , z i )} n i=1 be an independent sample of size n with an identical distribution to D. Given a pair of samples (x, y, z) and (x , y , z ) from D and D , respectively, we then define the likelihood as

Hence, the likelihood on both D and D can be expressed as

In practice, however, we only have access to a single copy of sample, i.e., D, but not D . One way of constructing D is through data splitting: the original dataset is split into two halves of equal size where the former is used to construct D and the latter is used to form D . Unfortunately, this approach reduces the effective sample size that can be used for learning. Alternatively, we propose to estimate the original likelihood (28) by using M -estimators: given the full dataset D = {(x i , y i , z i )} n i=1 , our approximated likelihood can be defined as

The matrix K z , as a kernel matrix defined above, plays a role of correlating the residuals y i − f (x i ). Besides, the standard GP regression has a similar likelihood, which is Gaussian but the covariance matrix is computed on x rather than z. Based on the likelihood p(D | w), the maximum likelihood (ML) estimator hence coincides with the minimizer of the unregularized version of our objective (7).

Combining the above likelihood function with the prior on w, the maximum a posteriori (MAP) estimate of w can be obtained by solving

Specifically, setting the first derivative of (33) to zero yields the first-order condition

Note that we expressŵ using the implicit expression. Given a test data point x * , the corresponding prediction can be obtained by computing the posterior expectation, i.e.,

where we defineα ≡ δK z (y − Φ(x) ŵ). The second equality holds becauseŵ is the posterior mean. The third equality is obtained by substituting (34) We can see that without using the representer theorem here, we get the same expression of the optimal solutionf as the one we obtain by invoking the representer theorem in Section 3.2. By substitutingα = δK z (y − Φ(x) ŵ) back into (34), we obtainŵ = ΛΦ(x) α.

Furthermore, putting (35) into (33) also yields the following optimization problem:

where L = l(x, x) is the kernel matrix. This problem has the same form as that of the V-statistic objective (8) except that (n 2 δ) −1 replaces λ. Therefore, we conclude that the predictive mean is equivalent to the optimal function of the V-statistic objective (8) given that (n 2 δ) −1 = λ, and δ plays the role of the regularization parameter. Such a GP interpretation is used for an elegant derivation of the efficient cross validation error in Section 3.3.

In this section, we provide additional details about the experiments as well as more discussions on the experimental results presented in Section 6.

The details of MMR-IV (Nyström) and MMR-IV (NN) algorithms are given in Algorithm 2 and Algorithm 3, respectively.

In the experiments, we compare our algorithms to the following baseline algorithms:

• DirectNN: A standard least square regression on X and Y using a neural network (NN).

• 2SLS: A vanilla 2SLS on raw X and Z.

• Poly2SLS: The 2SLS that is performed on polynomial features of X and Z via ridge regressions.

• DeepIV (Hartford et al., 2017): A nonlinear extension of 2SLS using deep NNs. We use the implementation available at https://github.com/microsoft/EconML.

Input: Dataset D = {(x, y, z)} n i=1 , kernel functions k and l with parameters θ k and θ l , regularization parameter λ, leave out data size M , Nyström approximation sample size M , LMOCV times m Output: predictive value at x * 1:

3:

Input: Dataset D = {x, y, z}, kernel function k with parameters θ k , NN f NN with parameters θ NN , regularization parameter λ Output: predictive value at x * 1: 

For the experiments in Section 6.1, we consider both small-sample (n = 200) and large-sample (n = 2000) regimes, in which n points are sampled for training, validation and test sets, respectively. In both regimes, we standardize the values of Y to have zero mean and unit variance for numerical stability. Hyper-parameters of NNs in Algorithm 3, including the learning rate and the regularization parameter, are chosen by 2-fold CV for fair comparisons with the baselines. These hyper-parameters are provided in Appendix D.4. Besides, the parameter selection of baselines has been tuned reasonably well in the package provided by Bennett et al. (2019), so we use them without changes. We fix the random seed to 527 for all data generation and model initialization.

In contrast to our NN-based method, the RKHS-based method in Algorithm 2 has the analytic form of CV error. We combine the training and validation sets to perform leave-2-out CV to select parameters of the kernel l and the regularization parameter λ. We choose the sum of Gaussian kernels for k and the Gaussian kernel for l. For the Nyström approximation, we subsample 300 points from the combined set. As a small subset of the Gram matrix is used as Nyström samples, which misses much information of data, we avoid outliers in test results by averaging 10 test errors with different Nyström approximations in each experiment. All methods are repeated 10 times on each dataset with random initialization.

Detailed comments on the experimental results. First, under the influence of confounders, DirectNN performs worst as it does not use instruments. Second, MMR-IVs perform reasonably well in both small-sample and large-sample regimes. For the linear function, 2SLS and Poly2SLS tend to outperform other algorithms .019 ± .003 .004 ± .001 .292 ± .024 .075 ± .008 MMR-IV (RKHS) .030 ± .000 .011 ± .000 .075 ± .000 .057 ± .000 as the linearity assumption is satisfied in this case. For non-linear functions, some NN based methods show competitive performance in certain cases. Notably, GMM+NN has unstable performance as the function h in (2) is designed manually. KernelIV performs quite well but not as well as our method. Moreover, the performances of the complicated methods like DeepIV and DeepGMM deteriorate more in the large-sample regimes than the small-sample regimes. We suspect that this is because these methods rely on two NNs and are thus sensitive to different hyper-parameters. In contrast, MMR-IV (Nyström) has the advantage of adaptive hyper-parameter selection.

For experiments in Section 6.2, we sample n = 10, 000 points for the training, validation, and test sets, and run each method 10 times. For MMR-IV (Nyström), we run it only on the training set to reduce computational workload, and use principal component analysis (PCA) to reduce dimensions of X from 728 to 8. We still use the sum of Gaussian kernels for k(z, z ), but use the automatic relevance determination (ARD) kernel for l(x, x ). We omit KernelIV in this experiment since its kernel parameter selection is not suitable for image data.

Detailed comments on the experimental results. Like Bennett et al. (2019), we observe that the AGMM code crashes, and the DeepIV code returns NaN when X is high-dimensional. Besides, we run Ridge2SLS, which is Poly2SLS with fixed linear degree, in place of Poly2SLS to reduce computational workload. 2SLS has large errors for high-dimensional X because the first-stage regression from Z to X is ill-posed. The average performance of GMM+NN suggests that manually designed functions for instruments is insufficient to extract useful information. Furthermore, MMR-IV (NN) performs competitively across scenarios. On MNIST Z , MMR-IVs perform better than other methods, which implies using the sum of Gaussian kernels for the kernel k is proper. DeepGMM has competitive performance as well. On MNIST X and MNIST XZ , MMR-IV (NN) outperforms all other methods. The ARD kernel with PCA of MMR-IV (Nyström) fails because the features from PCA are not representative enough. In addition, we observe that DeepGMM often produces results that are unreliable across all settings. We suspect that hard-to-optimize objective and complicated optimization procedure are the causes. Compared with DirectNN, most of baselines can hardly deal with high-dimensional structured instrumental regressions.

For the kernel function on instruments, we employ the sum of Gaussian kernels k(z, z ) = 1 3 3 i=1 exp − z − z 2 2 2σ 2 ki and the Gaussian kernel for l(x, x ) = exp(− x − x 2 2 /(2σ 2 l )), where σ k1 is chosen as the median interpoint distance of z = {z i } n i=1 and σ k2 = 0.1σ k1 , σ k3 = 10σ k1 . The motivation of such a kernel k is to optimize f on multiple kernels, and we leave parameter selection of k to the future work. 

Model f Learning Rates λ Low Dimensional FCNN(100,100) (10 −12 , 10 −11 , 10 −10 , 10 −9 , 10 −8 , 10 −7 , 10 −6 ) (5 × 10 −5 , 10 −4 , 2 × 10 −4 ) MNIST Z FCNN(100,100) (10 −12 , 10 −11 , 10 −10 , 10 −9 , 10 −8 , 10 −7 , 10 −6 ) (5 × 10 −5 , 10 −4 , 2 × 10 −4 ) MNIST X CNN (10 −12 , 10 −11 , 10 −10 , 10 −9 , 10 −8 , 10 −7 , 10 −6 ) (5 × 10 −5 , 10 −4 , 2 × 10 −4 ) MNIST XZ CNN (10 −12 , 10 −11 , 10 −10 , 10 −9 , 10 −8 , 10 −7 , 10 −6 ) (5 × 10 −5 , 10 −4 , 2 × 10 −4 ) Mendelian FCNN(100,100) (10 −12 , 10 −11 , 10 −10 , 10 −9 , 10 −8 , 10 −7 , 10 −6 ) (5 × 10 −5 , 10 −4 , 2 × 10 −4 )

As per the dimensions of X, we parametrize f either as a fully connected neural network with leaky ReLU activations and 2 hidden layers, each of which has 100 cells, for non-image data, or a deep convolutional neural network (CNN) architecture for MNIST data. We denote the fully connected neural network as FCNN(100,100) and refer readers to our code release for exact details on our CNN construction. Learning rates and regularization parameters are summarized in Table 4 .

For experiments in Section 6.3, KernelIV also achieves competitive and stable performance across all settings, but not as good as ours. DirectNN is always among the worst approaches on all settings as no instrument is used.

Poly2SLS performs accurately on the last two experiments, while presents significant instability with the number of instruments in Figure 3 (left) because of failure of the hyper-parameter selection. In Figure 3 (middle) and Figure 3 (right), we can observe that the performance of most approaches deteriorates as the effect of confounders becomes stronger. MMR-IV (Nyström) is more sensitive to c 2 than c 1 .

We normalize each variable to have a zero mean and unit variance to reduce the influence of different scales. We consider two cases: (i) without instrument and (ii) with instruments. By without instrument, we mean that the W V matrix of MMR-IV (Nyström) becomes an identity matrix. Following Sjolander and Martinussen (2019), we assess the effect of Vitamin D (exposure) on mortality rate (outcome), control the age in the analyses, and use filaggrin as the instrument. We illustrate original Vitamin D, age and death in Figure 5 . We randomly pick (random seed is 527) 300 Nyström samples and use leave-2-out cross validation to select hyper-parameters.

In this experiment, we use age as a control variable by considering a structural equation model, which is similar to the model (1) except the presence of the controlled (exogenous) variable C, Y = f (X, C) + ε, X = t(Z, C) + g(ε) + ν

where E[ε] = 0 and E[ν] = 0. We further assume that the instrument Z satisfies the following three conditions:

(i) Relevance: Z has a causal influence on X;

(ii) Exclusion restriction: Z affects Y only through X, i.e., Y ⊥ ⊥ Z|X, ε, C;

(iii) Unconfounded instrument(s): Z is conditionally independent of the error, i.e., ε ⊥ ⊥ Z | C.

Unlike the conditions specified in the main text, (ii) and (iii) also include the controlled variable C. : Estimated effect of vitamin D level on mortality rate, controlled by age. Both plots depict normalized contours off defined in Section D.6 where blue represents low mortality rate and yellow the opposite. We can divide each plot roughly into left (young age) and right (old age) parts. While the right parts reflect similar information (i.e., lower vitamin D level at an old age leads to higher mortality rate), the left parts are different. In (a), a high level of vitamin D at a young age can result in a high mortality rate, which is counter-intuitive. A plausible explanation is that it is caused by some unobserved confounders between vitamin D level and mortality rate. In (b), on the other hand, this spurious effect disappears when the filaggrin mutation is used as instrument, i.e., a low vitamin D level at a young age has only a slight effect on death, but a more adverse effect at an old age (Meehan and Penckofer, 2014) . This comparison demonstrates the benefit of an instrument variable. 

Empirical processes associated with V-statistics and a class of estimators under random censoring

Mostly Harmless Econometrics: An Empiricist's Companion

Proof of Lemma 11. By (b) in Lemma A.2 in Hable (2012), ∇ 2R Q 2 ,λ is a continuous linear operator. In the following, we define A :A is a compact operator. For the injectivity, we fix non-zero f ∈ H l and obtainThe last equality follows the property of Φ l and the last inequality follows the ISPD property in Assumption 1.For the compactness, we follow Lemma A.5 in Hable (2012) and obtain that operators u ) ) and (f → k(z, z )f (x )Φ l [x ](·) dQ 2 (u, u )) are compact.