key: cord-0045857-sm202zuc
authors: Zou, Qing
title: Learning Functions Using Data-Dependent Regularization: Representer Theorem Revisited
date: 2020-05-22
journal: Computational Science - ICCS 2020
DOI: 10.1007/978-3-030-50420-5_23
sha: f9e7369d7d36a79ad720b8ccd3ff0eac5c322a43
doc_id: 45857
cord_uid: sm202zuc

We introduce a data-dependent regularization problem which uses the geometry structure of the data to learn functions from incomplete data. We show another proof of the standard representer theorem when introducing the problem. At the end of the paper, two applications in image processing are used to illustrate the function learning framework.

Many machine learning problems involve the learning of multidimensional functions from incomplete training data. For example, the classification problem can be viewed as learning a function whose function values give the classes that the inputs belong to. The direct representation of the function in high-dimensional spaces often suffers from the issue of dimensionality. The large number of parameters in the function representation would translate to the need of extensive training data, which is expensive to obtain. However, researchers found that many natural datasets have extensive structure presented in them, which is usually known as manifold structure. The intrinsic structure of the data can then be used to improve the learning results. Nowadays, assuming data lying on or close to a manifold becomes more and more common in machine learning. It is called manifold assumption in machine learning. Though researchers are not clear about the theoretical reason why the datasets have manifold structure, it is useful for supervised learning and it gives excellent performance. In this work, we will exploit the manifold structure to learn functions from incomplete training data.

One of the main problems in numerical analysis is function approximation. During the last several decades, researchers usually considered the following problem to apply the theory of function approximation to real-world problems:

where L is some linear operator, {(x i , y i )} n i=1 ⊂ X × R are n accessible observations and X ⊂ R d , d ≥ 1 is the input space. We can use the method of Lagrange multiplier to solve Problem (1) . Assume that the searching space for the function f is large enough (for example L 2 space). Then the Lagrangian function C(f ) is given by

Taking the gradient of the Lagrangian function w.r.t. the function f gives us

Setting C (f ) = 0, we have

where δ(· − x) is the delta function. Suppose L * is the adjoint operator of L.

Then we have

for some a i and (·, ·).

As machine learning develops fast these years, kernel methods [1] have received much attentions. Researchers found that working in the original data space is somehow not well-performed. So, we would like to map the data to a high dimensional space (feature space) using some non-linear mapping (feature map). Then we can do a better job (e.g. classification) in the feature space. When we talk about feature map, one concept that is unavoidable to mention is the kernel, which easily speaking is the inner product of the features. With a kernel (positive definite), we can then have a corresponding reproducing kernel Hilbert space (RKHS) [2] H K . We can now solve the problem that is similar to (1) in the RKHS: min

A more feasible way is to consider a regularization problem in the RKHS:

Then the searching space of f becomes H K , which is a Hilbert space. Before solving Problem (2), we would like to recall some basic concepts about the RKHS. Suppose we have a positive definite kernel K :

x n ∈ X, a 1 , · · · , a n ∈ R, then H K is the Hilbert space corresponding to the kernel K(·, ·). It is defined by all the possible linear combination of the kernel K(·, ·), i.e., H K = span{K(·, ·)}.

Thus, for any f (·) ∈ H K , there exists x i and α i such that

Since H K is a Hilbert space, it is equipped with an inner product. The principle to define the inner product is to let H K have representer K(·, x) and the representer performs like the delta function for functions in L 2 (note that delta function is not in L 2 ). In other word, we want to have a similar result to the following formula:

This is called reproducing relation or reproducing property. In H K , we want to define the inner product so that we have the reproducing relation in H K :

To achieve this goal, we can define

Then we have K(·, x), K(·, y) HK = K(x, y)

With the kernel, the feature map Φ · (x) can be defined as

Having these knowledge about the RKHS, we can now look at the solution of Problem (2) . It can be characterized by the famous conclusion named representer theorem, which states that the solution of Problem (2) is

The standard proof of the representer theorem is well-known and can be found in many literatures, see for example [3, 4] . While the drawback of the standard proof is that the proof did not provide the expression of the coefficients α i . In the first part of this work, we will provide another proof of the representer theorem. As a by-product, we can also build the relation between Problem (1) and Problem (2).

To give another proof of the representer theorem, we first build some relations between ·, · HK and ·, · L2 . We endow the dataset X with a measure μ. Then the corresponding L 2 (X) inner product is given by

Consider an operator L on f with respect to the kernel K:

which is the Hilbert-Schmidt integral operator [5] . This operator is self-adjoint, bounded and compact. By the spectral theorem [6] , we can obtain that the eigenfunctions e 1 (x), e 2 (x), · · · of the operator will form an orthonormal basis of L 2 (X), i.e.,

With the operator L defined as (3), we can look at the relations between ·, · L2(X) and ·, · HK . Suppose e i (x) are the eigenfunctions of the operator L and λ i are the corresponding eigenvalues, then

But by the reproducing relation, we have

Now, let us look at how to represent K(x, y) by the eigenfunctions. We have

and λ i can be computed by

To see K(x, y) = i λ i e i (x)e i (y), we can just plug it into (4) to verify it:

Since the eigenfunctions of L form an orthogonal basis of L 2 (X), then for any f ∈ L 2 , it can be written as f = i a i e i (x). So we have

While for the L 2 norm, we have

Next we show that the orthonormal basis e i (x) are within H K . Note that

So we can get

We now need to investigate that for any

This means that to let f = i a i e i (x) ∈ H K , we need to have i

Combining all these analysis, we can then get the following relation between ·, · L2(X) and ·, · HK :

According to which, we can have another proof of the representer theorem.

Proof. Suppose e 1 , e 2 , · · · are eigenfunctions of the operator L. Then we can write the solution as

We consider here a more general form of Problem (2):

where E(·, ·) is the error function which is differentiable with respect to each a i . We would use the tools in L 2 (X) space to get the solution.

The cost function of the regularization problem is

By substituting f * into the cost function, we have

differentiating C(f * ) w.r.t. each a i and setting it equal to zero gives

Solving a k , we get

Since f * = k a k e k (x), we have

This proves the representer theorem.

Note that this result not only proves the representer theorem, but also gives the expression of the coefficients α i .

With the operator L, we can also build a relation between Problem (1) and Problem (2) . Define the operator in Problem (1) to be the inverse of the Hilbert-Schmidt Integral operator. The discussion on the inverse of the Hilbert-Schmidt Integral operator can be found in [8] . Note that for the delta function, we have

Then the solution of Problem (1) becomes 2(

By which we obtain

So far, we have introduced the standard representer theorem. While as we discussed at the very beginning, many natural datasets have the manifold structure presented in them. So based on the classical Problem (2), we would like to introduce a new learning problem which exploits the manifold structure of the data. We call it the data-dependent regularization problem. Regularization problem has a long history going back to Tikhonov [9] . He proposed the Tikhonov regularization to solve the ill-posed inverse problem.

To exploit the manifold structure of the data, we can then divide a function into two parts: the function restricted on the manifold and the function restricted outside the manifold. So the problem can be formulated as

where

The norms || · || M and || · || M c will be explained later in details. α and β are two parameters which control the degree for penalizing the energy of the function on the manifold and outside the manifold. We will show later that by controlling the two balancing parameters (set α = β), the standard representer theorem is a special case of Problem (5).

We now discuss something about the functions f 1 and f 2 . Consider the ambient space X ⊂ R n (or R n ) and a positive definite kernel K. Let us first look at the restriction of K to the manifold M ⊂ X. The restriction is again a positive definite kernel [2] and it will then have a corresponding Hilbert space. We consider the relation between the RKHS H K and the restricted RKHS to explain the norms || · || M and || · || M c .

K : X × X → R (or R n × R n → R) is a positive definite kernel. Let M be a subset of X (or R n )

with the norm defined as

Proof. Define the set

We first show that the set A = {||f || HK : f ∈ H K , f| M = f r } has a minuma for any f r ∈ S(M). Choose a sequence {f i } ∞ i=1 ⊂ H K . Then the sequence is bounded because the space H K is a Hilbert space. It is reasonable to assume that

is weakly convergent because of the Banach-Alaoglu theorem [11] . By the weakly convergence, we can obtain pointwise convergence according to the reproducing property. So the limit of the sequence {f i } ∞ i=1 attains the minima. We further define ||f r || S(M) = min A. We show that S(M), || · || S(M) is a Hilbert space by the parallelogram law. In other word, we are going to show that

Since we defined ||f r || S(M) = min A. Then for all f 1 , g 1 ∈ S(M), there exists f, g ∈ H K such that

By the definition of S(M), we can choose f 1 , g 1 such that

For the reverse inequality, we first choose f 1 , g 1 such that ||f 1 || 2 S(M) = ||f || 2 HK and ||g 1 || 2 S(M) = ||g|| 2 HK . Then

Therefore, we get

Next, we show (6) by showing that for all f r ∈ S(M) and x ∈ M,

Choose f ∈ H K such that f r = f | M and ||f r || S(M) = ||f || HK . This is possible because of the analysis above. Specially, we have

Now, for any function f ∈ H K such that f | M = 0, we have

Thus,

This completes the proof of the lemma.

With this lemma, the solution of Problem (5) then becomes easy to obtain. By the representer theorem we mentioned before, we know that the function satisfies min

Thus, we can conclude that is solution of (5) is exactly

where the coefficients a i are controlled by the parameters α and β.

With the norms || · || M and || · || M c being well-defined, we would like to seek the relation between || · || M , || · || M c and || · || HK . Before stating the relation, we would like to restate some of the notations to make the statement more clear. Let

To find the relation between || · || M , || · || M c and || · || HK , we need to pullback the restricted kernel K 1 and K 2 to the original space. To do so, define

Then we have K = K p 1 + K p 2 . The corresponding Hilbert spaces for K p 1 and K p

The following lemma shows the relation between || · || H K p 1 , || · || H K p 2 and || · || HK , which also reveals the relation between || · || M , || · || M c and || · || HK by Moore-Aronszajn theorem [12] . Note that we can also use other kernels, for example, polynomial kernel and Gaussian kernel to proceed image interpolation. Choosing the right kernel is an interesting problem and we do not have enough space to compare different kernels in this paper.

In Fig. 2(b) , we downsampled the original image by a factor of 3 in each direction. The zoomed image is shown in Fig. 2(e) . The interpolation result with the zoomed image are shown in Fig. 2(c) and Fig. 2(f) . 

From the function learning point of view, the patch-based image denoising problem can be viewed as learning a function from noisy patches to their "noise-free" centered pixels. See Fig. 3 for illustration.

In the patch-based image denoising application, we use the Laplacian kernel as well. We assume that the noisy patches are lying close to some manifold so we set the balancing parameter which controls the energy outside the manifold to be large enough. We use the images in Fig. 4 as known data to learn the function. Then for a given noisy image, we can use the learned function to do image denoising. To speed up the learning process, we randomly choose only 10% of the known data to learn the function. We use the image Baboon to test the learned denoising function. The denoising results are shown in Fig. 5 . Each column shows the result corresponding to one noise level. 

In this paper, we introduced a framework of learning functions from part of the data. We gave a data-dependent regularization problem which helps us learn a function using the manifold structure of the data. We used two applications to illustrate the learning framework. While these two applications are just part of the learning framework. They are special cases of the data-dependent regularization problem. However, for the general application, we need to calculate ||f 1 || 2 M and ||f 2 || 2 M c , which is hard to do so since we only have partial data. So we need to approximate ||f 1 || 2 M and ||f 2 || 2 M c from incomplete data and to propose a new learning algorithm so that our framework can be used in a general application. This is part of our future work. Another line for the future work is from the theoretical aspect. We showed that the solution of the data-dependent regularization problem is the linear combination of the kernel. It then can be viewed as a function approximation result. If it is an approximated function, then we can consider the error analysis of the approximated function.

Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

Theory of reproducing kernels

A generalized representer theorem

When is there a representer theorem? vector versus matrix regularizers

Hilbert-Schmidt operators

Introduction to Spectral Theory in Hilbert Space

Manifold regularization: a geometric framework for learning from labeled and unlabeled examples

Regularization of incorrectly posed problems

Theory of Reproducing Kernels and Applications

Adaptive learning for symbol detection: a reproducing kernel hilbert space approach

Kernel Functions for Machine Learning Applications

Lemma 2. Suppose K 1 , K 2 : Y × Y → R (or R n × R n → R) are two positive definie kernels. If K = K 1 + K 2 , thenThe idea of the proof of this lemma is exactly the same as the one for Lemma 1. Thus we omit it here.A direct corollary of this lemma is: If we go back to our scenario, we can get the following result by Corollary 1:This means that if we set α = β in Problem (5), it will reduce to Problem (2) . Therefore, the standard representer theorem is a special case of our datadependent regularization problem (5).

As we said in the introduction part, many engineering problems can be viewed as learning multidimensional functions from incomplete data. In this section, we would like to show two applications of functions learning: image interpolation and patch-based iamge denoising.

Image interpolation tries to best approximate the color and intensity of a pixel based on the values at surrounding pixels. See Fig. 1 for illustration. From function learning perspective, image interpolation is to learn a function from the known pixels and their corresponding positions. We would like to use the Lena image as shown in Fig. 2(a) to give an example of image interpolation utilizing the proposed framework. The zoomed image is shown in Fig. 2(d) . In the image interpolation example, the two balancing parameters are set to be the same and the Laplacian kernel [13] is used: