key: cord-0977603-xw52jfwz
authors: Campi, C.; Marchetti, F.; Perracchione, E.
title: Learning via variably scaled kernels
date: 2021-06-26
journal: Adv Comput Math
DOI: 10.1007/s10444-021-09875-6
sha: 3766ca59ae0cff7fc939497d638a2f1b8c5befdd
doc_id: 977603
cord_uid: xw52jfwz

We investigate the use of the so-called variably scaled kernels (VSKs) for learning tasks, with a particular focus on support vector machine (SVM) classifiers and kernel regression networks (KRNs). Concerning the kernels used to train the models, under appropriate assumptions, the VSKs turn out to be more expressive and more stable than the standard ones. Numerical experiments and applications to breast cancer and coronavirus disease 2019 (COVID-19) data support our claims. For the practical implementation of the VSK setting, we need to select a suitable scaling function. To this aim, we propose different choices, including for SVMs a probabilistic approach based on the naive Bayes (NB) classifier. For the classification task, we also numerically show that the VSKs inspire an alternative scheme to the sometimes computationally demanding feature extraction procedures.

In the context of approximation theory, the variably scaled kernels (VSKs) were introduced in 2015 by [6] . The basic idea behind them is to map the initial set of examples via a scaling function and construct an augmented approximation space. Our main contribution consists in linking the VSKs to the field of machine learning, as the VSKs have a long-known equivalent in pattern analysis. Precisely, many methods based on feature augmentation, as, e.g., zero padding and feature replication [9, 21, 27] , fall into the general VSK setting that we are going to investigate.

Focusing on kernel learning methods and specifically on KRNs and SVMs (see, e.g., [16, 42] ), we give a very general formulation of feature augmentation schemes via VSKs. In doing so, we drive our attention towards the Gaussian and linear kernels, being truly popular for learning issues. We provide theoretical results concerning their practical implementation and expressiveness [13] and we further analyze the spectrum of the kernel matrices constructed via VSKs. This study reveals the effectiveness of the proposed approach especially for the Gaussian kernel; indeed, the condition number of the VSK kernel matrix is less than or equal to the condition number of the matrix constructed via the standard kernel. This fact turns out to be relevant for KRNs, where one may require to compute the inverse of the kernel matrix, which is usually affected by severe ill-conditioning. Moreover, for the selection of the scaling function of the KRN-VSK, one can refer to the available literature in the context of approximation theory [10, 36] . Indeed, the scaling function might be selected so that it mimics the samples and this might lead to an improvement in terms of accuracy and/or stability (see, e.g., [6, 10, 11] ). Here, in particular, we propose to use a non-linear fitting of the function itself as augmented feature.

While for the KRN-VSK we can refer to some available literature for selecting the scaling function, for SVM-VSK, we consider a probabilistic solution. More precisely, focusing on binary classification problems, we first note that the VSK setting induces new feature maps and spaces that depend on the scaling function associated to the VSK. For being competitive with the accuracy of the classical SVMs, as well as with other common classifiers, we have to select a suitable scaling function for the VSKs. To this aim, we remark that the SVM is characterized by a geometric point of view. Nevertheless, methods based on probability distributions, as the NB classifiers, might outperform SVM. For that reason, many efforts are devoted to investigate which classifier performs better and under which conditions; for a general overview refer, e.g., to [7, 31, 47] . In this work, we thus fuse SVM and NB classifiers by means of VSKs, so that the mixed approach takes into account the probabilistic features of the NB algorithm and classifies geometrically with SVM. Moreover, we conclude the paper by presenting a feature extraction algorithm that is inspired by the VSK framework and that might be considered in place of other feature extraction schemes; refer, e.g., to [19, 45] .

The paper is organized as follows. In Section 2, we briefly review the use of kernels in machine learning literature. In Section 3, we investigate the VSKs for two learning methods, specifically SVM and KRNs. Then, in Sections 4 and 5, we drive our attention towards the Gaussian and linear VSKs as well as towards the problem of selecting the scaling function. Section 6 is devoted to numerical experiments with both toy models and a real dataset. In Section 7, we present a feature extraction algorithm whose underlying idea is derived from the study of the variably scaled setting. The last section deals with conclusions and work in progress.

We consider a learning problem with training examples

where x i ∈ Ω ⊆ R n and y i ∈ R. For the particular case of the classification setting, we fix y i ∈ {−1, +1}.

For both SVMs and KRNs, we drive our attention towards (strictly) positive definite kernels κ : Ω × Ω −→ R, where Ω is a bounded set, that can be decomposed via the Mercer's Theorem as explained below (see, e.g., Theorem 2.2. [15] p. 107 or [26] ).

then the kernel can be expressed as

where {λ k } k≥0 are the (non-negative) eigenvalues and {ρ k } k≥0 are the (L 2orthonormal) eigenfunctions of the operator T : L 2 (Ω) −→ L 2 (Ω), given by

Moreover, such expansion is absolutely and uniformly convergent.

We point out that many relevant kernels, e.g., cases where Ω is unbounded or non-measurable, do not fall into the above Mercer decomposition. Thus, on one side, taking only Mercer kernels might be restrictive. On the other side, it provides the adequate background for our purposes and it offers an easy way to introduce feature maps and spaces. Indeed, for such kernels that admit a Mercer expansion (also called valid kernels according to the definition given by [42] ), it is worth to note that we can interpret the series representation in terms of an inner product in the so-called feature space F , which is a Hilbert space. Indeed,

where Φ : Ω −→ F is a feature map. For a given kernel, the feature map and space are not unique. A possible solution is the one of taking the map Φ(x) = κ(·, x), which is linked to the characterization of F as a reproducing kernel Hilbert space; see [16, 42] for further details.

In the classification context, many studies are devoted to investigate and measure the complexity of a chosen model, such as the so-called VC dimension [44] and the empirical Rademacher complexity [4] . The complexity of a method is usually referred to as capacity or expressiveness. Indeed, complex models have the capability to perform complex tasks, by determining elaborated decision functions, and thus to express sophisticated links between the data. In any case, the capacity of a method needs to be tailored to the considered task, in order to avoid overfitting; for a general overview, we refer, e.g., the reader to [39] .

To better investigate the concept of expressiveness in the kernel setting, we introduce the kernel matrix K constructed via the dataset Ξ = {x 1 , . . . , x N } ⊆ Ω, i.e., the matrix of entries

where κ is a (strictly) positive definite kernel. Note that if κ is a strictly positive definite kernel then K is positive definite, while it is positive semi-definite if κ is a positive definite kernel.

The expressiveness of a kernel-based model is related to the number of dichotomies achievable by a linear separator in the feature space. Moreover, concerning the rank of the kernel matrix, we have the following result [ As capacity measure dedicated to the kernel setting, we consider the spectral ratio that has been introduced in [13] . It is defined as

According to the following definition (see [13, Definition 1, p. 8] ), such quantity is an expressiveness measure for kernels. As a remark, we also point out that it is connected to the empirical Rademacher complexity [13, Theorem 4, p. 9].

Definition 1 Let κ i , κ j : Ω × Ω −→ R, be two (strictly) positive definite kernels. We say that κ j is more specific (or more expressive) than κ i whenever for any dataset

where K i and K j are the kernel matrices on Ξ obtained via κ i and κ j , respectively.

Remark 2 Technically the Definition 1, which is taken by [13] , states that κ j is more specific (or more expressive) than κ i also when the equality in (2.3) holds true. In the latter case, we should use the term "equally or more specific than." To take a common notation with the native definition, we simply use the term "more expressive" also for the trivial case.

The spectral ratio being an expressiveness measure, it is related to the rank of the kernel matrix (see also Remark 1), indeed

We conclude this brief review on kernels for machine learning by pointing out that the kernel matrices introduced above might suffer from severe ill-conditioning. In order to partially overcome instability issues in the approximation framework, a possible solution comes from the use of VSKs (see below for their definition), which have been recently introduced in [6] ; refer also to [10, 11] . We point out that in Definition 2, we present a multidimensional extension of the scaling function ψ, which has been introduced as a real-valued function in the previous literature [6] .

When dealing with Mercer's kernels, the construction of a VSK as in Definition 2 provides a valid kernel. We now extend this general setting to work with KRNs and SVMs.

To have a clear theoretical framework, we investigate the use of VSKs as a feature augmentation algorithm, where new features are added to the original dataset in order to possibly increase the performances of learning schemes. According to Definition 2, we define a function Ψ : Ω −→Ω as

The function Ψ extends the data vector x ∈ Ω, including m features that depend on the original ones. The VSK kernel defined in (2.4) is a valid kernel, as it corresponds to an inner product in the associated feature space F Ψ (see [42, Proposition 3.22, p. 75] ). Moreover, it induces a new feature map Θ : Ω −→ F Ψ so that

Referring to (2.1), because of [6, Theorem 3.1], the spaces F Ψ and the classical feature space F , associated to κ :Ω ×Ω −→ R and induced by the feature map Υ :Ω −→ F , are isometric; see also [10, Proposition 2.3] . We now investigate the use of the VSKs for both SVMs and KRNs.

In this section, we present the VSK setting in the SVM algorithm. For this general overview, we also refer the reader to [16, 42] . We take Ξ = {x i , i = 1, . . . , N} ⊆ Ω, where Ω ⊆ R n . The associate function values are so that y i ∈ {−1, +1}, i = 1, . . . , N. Indeed, for the binary classification problem via VSKs, we need to find a predictor, i.e., a decision function s Ψ : Ω −→ {−1, +1}, that assigns appropriate labels, i.e.,ỹ i ∈ {−1, +1}, to other unknown

we define a non-linear SVM classifier that makes use of VSKs via the following decision function:

Observe that in the computation of b any given index i so that 0 < α i < ζ can be used. However, to make b uniquely defined and for stability purposes, it is computed via an average over all such candidates. The equation of the SVM decision function s Ψ : Ω −→ {−1, +1}, i.e., w and b as in equation (3.1) and (3.2), is then found by imposing the Karush Kuhn Tucker conditions (see, e.g., [29] ) and thanks to (3.1), for x ∈ Ω, it reads as follows:

If one uses the standard kernel κ : Ω × Ω −→ R, then we recover the classical SVM setting.

As a second test case for the use of VSKs in the machine learning context, we investigate regression networks.

Since here KRNs are used for regression/interpolation tasks, given distinct data

Concerning supervised learning networks, the simplest strategy consists in learning the trend between inputs and outputs via a predictor s Ψ : Ω −→ R which is a linear combination of some basis functions, in this case VSKs. For a general overview on KRNs, we refer the reader to [16, 30] .

We keep the general framework of KRNs and we adapt them to the use of VSKs. Here, we focus on kernels with centers at locations Z = {z i , i = 1, . . . , M} ⊆ Ω; and thus, our KRN-VSK predictor s Ψ : Ω −→ R is of the form

for (strictly) positive definite kernels κ Ψ : Ω × Ω −→ R and for some real coefficients c 1 , . . . , c M .

For KRN-VSK, we compute c = (c 1 , . . . , c M ) ∈ R M via the following minimization problem [15] 

where ν ∈ R + is a regularization parameter.

In the following, we may take the set of kernel centers Z ≡ Ξ . In that case, the kernel matrix K Ψ of entries

is square. Furthermore, if a strictly positive definite kernel as the Gaussian function is used, then the matrix is non-singular. Therefore, we may look at the special setting for which ν = 0. In that case, the solution can be found as c = (K Ψ ) −1 y, where y = (y 1 , . . . , y N ) and c = (c 1 , . . . , c N ) .

In general, computing the inverse of the kernel matrix K might lead to serious instability issues due to the typical ill-conditioning of the kernel matrix. This problem may be somehow overcome by selecting a safe shape parameter γ , formally introduced below, and/or by using stable bases; refer, e.g., to [20, 24, 34] . In the incoming sections, we will point out that the use of VSKs might reduce the usual ill-conditioning of the kernel matrices.

In this section, we focus on specific kernels providing the practical implementation of the variably scaled setting. Furthermore, we also study the expressiveness and the conditioning induced by the VSKs.

Radial kernels are truly common. They are kernels for whom there exists a radial basis function (RBF) ϕ : R + −→ R, where R + := [0, ∞), and (possibly) a shape parameter γ > 0 such that, for all x, y ∈ Ω,

Among all radial kernels, we remark that the Gaussian is given by

We now discuss its practical implementation in the variably scaled setting. We point out that the Gaussian kernel is strictly positive definite; and thus, its associated kernel matrix turns out to be positive definite, provided that the data are distinct; see, e.g., [16] .

Throughout this section, we take N data points

The Gaussian VSK matrix can be seen as a Hadamard product; indeed, we have the following result.

. . , N, and • denotes the Hadamard matrix product.

Proof For x, y ∈ Ω, we have that

Therefore, the entries of the VSK matrix built on Ξ = {x i , i = 1, . . . , N} are given by

and thus

About the Hadamard product, we report here a result that can be traced back to 1911 by Schur [40] . It will be helpful in what follows; refer also to [ 

This result allows us to infer about the spectrum of the kernel matrix (see [12] ) and to show that with the Gaussian VSK we gain both in terms of stability and expressiveness of the kernel.

We now give upper and lower bounds for the Frobenius norm · F of the kernel matrix K in terms of its variably scaled setting. This turns out to be helpful when comparing the spectral ratio of the two matrices (K and K Ψ ).

Proof Being the RBF ϕ : R + −→ R associated to the Gaussian kernel κ nonincreasing, for x, y ∈ Ω, we obtain

which in particular implies that

From this theorem, we can easily infer on the spectral ratio in the VSK setting. 

where ϕ : R + −→ R is the RBF associated to the Gaussian kernel κ : Ω ×Ω −→ R. Taking into account Theorem 5, we obtain

Remark 3 The concept of expressiveness when the equality in (4.2) holds true has already been clarified in Remark 2. We further point out that the equality is satisfied, for instance, in the trivial case for which ψ(x) ≡ 0, for all x ∈ Ω.

On one side, the fact that the Gaussian VSK is more expressive than the standard one tells us that the VSK-based learning might be able to deal with more complex tasks. In the next subsection, we focus on the stability of the kernel matrix.

The smallest eigenvalue of a positive definite kernel matrix is of course linked to the ill-conditioning. Moreover, given Ξ = {x i , i = 1, . . . , N} ⊆ Ω, the stability is also related to the separation distance

which only depends on the data. As shown in, e.g., [6] , we have that

is the separation distance in the VSK setting. This gives the intuition of the fact that the VSKs might lead to possible improvements in terms of stability [6] . Indeed, in general, it is well-known that the smallest eigenvalue of the kernel matrix is related to the separation distance, meaning that the ill-conditioning usually grows as the separation distance decreases; refer, e.g., to [28] , where the authors make use of a result from [3] on the eigenvalues of distance matrices. These facts are the fruits on many studies on the so-called trade-off or uncertainty principle [37, 38] , which could be summarized in a conflict between accuracy and stability. As already mentioned, the VSKs are helpful for improving the stability, especially in view of the following property. We also refer the reader to [43, 

Proof First note that, since in this case the matrix is positive definite, the condition number can be computed as

Moreover, from Theorem 4 and since the RBF ϕ : R + −→ R associated to the Gaussian kernel κ :

This result turns out to be meaningful especially for the KRN-VSK approach. As a second case study, we now consider the linear kernel, which is truly popular for classification tasks.

For x, y ∈ Ω, the linear kernel κ : Ω × Ω −→ R is given by κ(x, y) = x y.

As for the Gaussian kernel, its implementation in the variably scaled setting turns out to be trivial. We remark that the linear kernel is positive definite; and thus, its associated kernel matrix turns out to be positive

The linear VSK can be written as sum of matrices; indeed, we have the following result.

Let Ξ = {x i , i = 1, . . . , N} ⊆ Ω be a set of data points. Let ψ : Ω −→ Λ be the scaling function for the VSK setting. Let κ : Ω × Ω −→ R be the linear kernel. Then, the VSK matrix constructed on Ξ via κ Ψ : Ω × Ω −→ R is given by

Proof For x, y ∈ Ω we have that:

and thus the kernel matrix is given by

We now drive our attention towards the expressiveness of the linear VSK.

Depending on the function ψ, we might have that the linear VSK is less expressive than the standard linear kernel; indeed, we have the following proposition.

. . , N} ⊆ Ω be a set of data points. Let ψ : Ω −→ Λ be the scaling function for the VSK setting. Let κ : Ω × Ω −→ R be the linear kernel. Let us suppose that the associated kernel matrix K is non-negative, i.e., so that all the entries of K are non-negative. Given the VSK matrix

Proof Under our assumptions, if ψ : Ω −→ Λ is so that K ψ is non-negative, we have that tr(K) tr(K Ψ ) ≤ 1.

Moreover, since we suppose K to be non-negative, we get

Finally, taking into account the definition of the spectral ratio, the statement follows.

Note that the requirements of Proposition 2 are satisfied, e.g., if Ω ⊆ R n + and Λ ⊆ R m + .

Being Gramian matrices, K Ψ and K ψ are positive semi-definite. Concerning the minimum eigenvalue of the VSK matrix K Ψ , by virtue of Weyl's inequality (see, e.g., [5, Section III.2, p. 62]), we obtain that:

As for the Gaussian kernel, one can make many different choices for the function ψ. Some of them are discussed in the next section.

In this section, we provided some theoretical findings concerning the expressiveness of VSKs in terms of the spectral ratio, without taking into account the tuning of the shape parameter γ (see (4.1)), which is considered fixed. While such a theoretical investigation concerns a relevant topic in the theory of machine learning, as we pointed out in Section 2, the spectral ratio represents a poor choice for model selection in practical applications. Indeed, in view of the maximization of a certain score (e.g., accuracy, AUC, f1-score), it is convenient to perform a classical tuning of the model parameters (e.g., SVM, KRN) instead of analyzing its capacity via the spectral ratio as it is.

In the framework of approximation theory, as well as for KRNs, the choice of the scaling function can be guided by some characteristics concerning the data distribution or the underlying function that needs to be reconstructed (see, e.g., [35, 36] ). In the classification setting, the VSKs can be seen as feature augmentation methods. More precisely, our aim is to adopt this strategy to encode possible a priori information in the kernel. Let us take N data points Ξ = {x i , i = 1, . . . , N} ⊆ Ω, where Ω ⊆ R n and consider a subset Λ ⊆ R m , we now propose some techniques to define the scaling function of the VSK framework.

Depending on the task and on the available knowledge, different choices for the scaling function could be taken into account. Here, we construct the scaling function ψ : Ω −→ Λ as follows. Given the dataset

we introduce the classes C 1 and C 2 , associated to the labels y = −1 and y = +1, respectively. Letx = (x 1 , . . . ,x n ) be a new example that we need to classify. Treating the features as mutually independent, the NB classifier (see, e.g., [1, 25] ) computes

The likelihood n i=1 P (x i |C j ) and the prior P (C j ) are typically estimated from the dataset Σ. In other cases, especially when the dataset is not too large, they could be obtained as a priori knowledge, for example by consulting the literature.

In For x ∈ Ω, since P 2 (x) = 1 − P 1 (x) and P 1 (x) are not independent, we observe that it is sufficient to consider one of the two probabilities.

Concerning the effectiveness of this scaling function Ψ : Ω −→Ω for the Gaussian VSK, we refer to the notation introduced in Theorem 3 and we point out that,

We observe that if P 1 (x i ) ≈ P 1 (x j ), then K ψ ij ≈ 1 and so K Ψ ij ≈ K ij . Considering instead the linear VSK κ Ψ : Ω × Ω −→ R described in Section 4.2, we get

We remark that, according to to Proposition 2, with the linear VSK we construct kernels that might be less expressive than the standard ones.

For both kernels, this means that the matrices change according to our a priori knowledge on the dataset, leading to a different, possibly easier, learning task for SVM.

Here, we take again N distinct data Ξ = {x i , i = 1, . . . , N} ⊆ Ω, where Ω ⊆ R n , and the associated measurements y i ∈ R, i = 1, . . . , N, and consider a subset Λ ⊆ R m . We now investigate some ideas to define the scaling function for KRNs.

Therefore, concerning the choice of the scaling function ψ : Ω −→ Λ, we suppose to know the trend of data, which can be modelled via a specific class of functions, i.e., a model M : Ω × R l −→ R depending on x ∈ Ω, and on l parameters β = (β 1 , . . . , β l ). To determine β, we compute:

Then, one possible solution to define the function ψ : Ω −→ Λ is

Of course, this gives a recipe for the selection of the scaling function which is not unique.

All the performed experiments have been carried out in PYTHON, using also the scientific module scikit-learn [32] , on a Intel(R) Core(TM) i7 CPU 4712MQ 2.13 GHz processor.

In the following, we consider different toy datasets of various sizes, with precise probability information concerning the features' distributions, and we compare our SVM-VSK approach with standard SVM and NB classifiers. A freely available software can be downloaded at https://github.com/emmaA89/SVM-VSK. Moreover, in the validation and in the test steps, we evaluate the performance of the considered methods by means of the f 1 -score, weighted with respect to the classes.

We remind that the f 1 -score is defined as the harmonic mean between precision and recall. More precisely, given the number of true positive (TP), false positive (FP), and false negative (FN) cases,

where precision = TP TP + FP and recall = TP TP + FN .

We proceed by constructing 12 toy datasets that differ in terms of number of features and examples. We now fix n = 64. Letting Ω ⊆ R n , they are extracted from the dataset

where the two classes C 1 and C 2 , associated to the labels y = −1 and y = +1 respectively, are exactly balanced. The construction of such a dataset is explained in the following steps.

1. Each class C j , j = 1, 2, is characterized by two vectors μ j = μ 1 j , . . . , μ n j , σ j = σ 1 j , . . . , σ n j .

More precisely, let us denote by U (a, b) a univariate uniform random distribution on the interval (a, b) ⊆ R and by p ∼ U (a, b) a sample from such distribution. Then, μ k j and σ k j , k = 1, . . . , n, j = 1, 2 are determined as follows: N (μ, σ ) the univariate normal distribution with mean μ and standard deviation σ . Let x i = x 1 i , . . . , x n i be an example in Ω belonging to a class C j , j = 1, 2. The elements x k i of x i ∈ Ω, k = 1, . . . , n, are then randomly generated as samples of N μ k j , σ k j + 1 = N μ k j , σ k j + N (0, 1), where N (0, 1) is Gaussian white noise. 3. Finally, the data are normalized between [0, 1].

From the so-constructed Γ , let n k ∈ {2, 4, 16, 64} and let x i,n k be the element x i ∈ Γ reduced in dimensions to n k randomly pre-selected features (that are the same for all x i ∈ Γ ). We extract the datasets Γ ξ = {(x i,n k , y i ), i = 1, . . . , N q , y i ∈ {−1, +1}}, with N q ∈ {100, 500, 2500}, ξ = (N q , n k ), and preserving the balance among the two classes. More precisely, fixed N q , we randomly select N q indices from the set {1, . . . , 5000}. Moreover, since all combinations examples-features are taken into account, we obtain 12 different datasets.

In the following description, we fix one of the extracted datasets Γ ξ for some value of N q and n k . We divide such a dataset in a training set Σ ξ and a test set T ξ . These sets are so that card(Σ ξ ) ≈ 2card(T ξ ).

In this experiment, we suppose to have a priori information and to encode it in the SVM-VSK method by means of the NB algorithm. More precisely, the NB classifier is trained considering both Σ ξ andΓ ξ , which is defined as the dataset containing the examples of Γ , whose number of features has been reduced to n k , that are not in Γ ξ , i.e.,Γ ξ := Γ (5000,n k ) \ Γ ξ . Therefore, in this test, we compare the performances on T ξ of the three methods constructed as follows.

1. The NB classifier, which is trained onΓ ξ ∪ Σ ξ . Given x = (x 1 , . . . , x n k ), we adopt the Gaussian likelihood [33] P (

for i = 1, . . . , n k , j = 1, 2. 2. The standard SVM method, which is trained on Σ ξ . 3. The SVM-VSK classifier, which is trained on Σ ξ and whose scaling map ψ :

Ω −→ Λ, constructed as explained in Section 5, considers the probabilistic outcomes of the NB classifier.

In order to tune the SVM hyperparameters ζ and γ , the latter in case of RBF kernel, we consider a 5-fold cross-validation on Σ ξ .

We carry out the test for each dataset Γ ξ and we show the obtained results in Fig. 1 . The proposed SVM-VSK algorithm is competitive with the best among SVM and NB methods, slightly outperforming both in some cases.

For the Gaussian kernel, we numerically verify Corollary 1 by reporting in Table 1 the spectral ratios related to the matrices K and K Ψ , obtained from the training sets Σ ξ with N q = 100, 500, 2500, and n k = 2. The results numerically confirm what was theoretically predicted, i.e., the Gaussian VSK is more expressive than the standard one for a fixed shape parameter.

Moreover, for the linear kernel, we are in the hypothesis of Proposition 2. The quantities involved in that proposition are reported in Table 2 . The results support what was theoretically observed.

Here, we refer the reader to [16, Program 18.1, p. 340 ] for a detailed software that deals with KRNs. As an example for KRNs, we focus on the Italian data of the 2020 COVID-19 pandemic. The task we consider consists in learning the time series, i.e., Ω ⊆ R, of people that in Italy were hospitalized as intensive care unit (ICU) patients from 24 February 2020 to 26 April 2020. The dataset, provided by the "Dipartimento della Protezione Civile," is available at https://github.com/pcm-dpc/COVID-19/tree/ master/dati-andamento-nazionale.

The dataset Γ consists of 63 samples and it is divided as follows. The first 58 days define the training set Σ, i.e., they are used to construct the regression model, which is then tested on the last t = 5 days,x i , i = 1, . . . , t. Referring to Section 3.2 we take the set of kernel centers Z as the set of available data in Ξ and we construct the model using the Gaussian kernel defined in (4.1). Moreover, the feature augmentation strategy outlined in ( In addition, we encode into the kernel also other available data. Precisely, thinking of time series, one usually disposes of other available and dependent data sampled at the same locations which can be used as additional features (see, e.g., [8, 41] ). In this direction, we take into account the total number of COVID19 infected (included death and recovered people), the daily number of new infected and the total number Table 1 The spectral ratios of the matrices K and K Ψ related to the normalized training sets Σ ξ , varying N q = 100, 500, 2500. We set n k = 2 and we considered a Gaussian kernel with γ = 1 of infected (excluded death and recovered people). Of course, this selection of the scaling function means that we are adding a priori knowledge to the selected time series. Therefore, the scaling function ψ is so that ψ :

To analyze the performances of the variably scaled setting, we take the Gaussian kernel and we compute the condition number of the kernel matrix and the rounded mean error (RME). Let

be the mean error, where A is a decision function as defined in Section 3.2 obtained via classical or variably scaled kernels. Since hospitalized patients are involved in the dynamics we consider, as accuracy indicator, the RME defined as RME = ME , if ME − ME ≤ 0.5, ME , if ME − ME > 0.5.

In the first experiments, we set the parameter ν = 0. We remark that for regression networks the selection of the shape parameter plays a crucial role. Therefore, to make a fair comparison between classical and VSK regression networks, we report the condition numbers and the RME for 200 values of the shape parameter γ in the interval [0. 5, 20] . The results are reported in Fig. 2 . We observe that the computation carried out via VSKs is characterized by a lower condition number of the kernel matrix, as theoretically observed in Proposition 1. For such experiment, this directly reflects on the accuracy of the computation, meaning that the safe interval for the shape parameter γ is larger than for the classical method (see Fig. 2 , right).

In Fig. 3 , we report two graphical results corresponding to ν = 0 and ν = 1e − 04, left and right respectively. In both cases we take the optimal shape parameter γ * , meaning that it leads to the smallest RME, in the same framework of Fig. 2 (right) . The associated RME is shown in Table 3 . We note that the VSK setting outperforms the classical method for ν = 0, while for ν = 1e − 04 the two approximations are comparable.

In this section, we propose a feature extraction method directly inspired by the presented variably scaled setting, which can be used as an alternative to other possible expensive feature extraction algorithms. To this aim, we consider the Wisconsin Breast Cancer Database [22, 23] , which consists of 699 instances described by 9 features, extracted from a digitized image of a fine needle aspirate of a breast mass. The task consists in predicting if the mass is benign or malignant. From the original dataset, we exclude 16 instances that present missing values. The two classes are not equally distributed, presenting 444 benign instances and 239 malignant instances.

At first, we divide the dataset into a training set, consisting of 226 benign and 116 malignant cases, and a test set, which is composed of 218 benign and 123 malignant cases. Then, taking the hyperparameter grids adopted in Section 6.1, we compare the performances on the test set of the following four methods.

1. A NB classifier with Gaussian likelihood. 2. A standard SVM classifier, whose hyperparameters ζ and γ (in the Gaussian case) are validated by means of 5-fold cross-validation on the training set. 3. A SVM classifier constructed after a feature selection process, as explained in what follows.

Analyzing the resulting weights of the SVM classifier (in the linear case), we can rank the features by their influence in the classification; see, e.g., [17] . Then, we choose then more relevant features, here we fixn = 2, and we consequently reduce our training and test sets by restricting to the two most relevant features. Finally, we take both linear and Gaussian kernels, we train a SVM classifier via 5-fold cross-validation on the reduced training set and we evaluate the results on the reduced test set.

We denote this method with SVM-Selection (SVM-S). 4. A SVM classifier constructed after a VSK-like feature extraction process, as described in the following lines. We randomly selectn − 1 features (heren = 2). The training set restricted to the remaining 8 features is used to train a Gaussian NB classifier. Reduced training and test sets are obtained by juxtaposing the previously selectedn − 1 features to the probabilistic output of the NB classifier. Then, we take both linear and Gaussian kernels, we train a SVM classifier via 5-fold cross-validation on the reduced training set and we evaluate the results on the reduced test set. We denote this method with SVM-Extraction (SVM-E). We point out that both SVM-S and SVM-E consider reduced training and test sets that are characterized by the same number of featuresn. Moreover, the SVM-E presents some advantages in terms of computational complexity with respect to SVM-S, since training an auxiliary NB classifier to perform feature extraction is cheaper than training a SVM classifier to carry out the feature selection.

In Table 4 , we present the results obtained considering the SVM, NB, and SVM-S methods. In Table 5 , we report the results concerning the SVM-E algorithm. For completeness, we vary the randomly selected feature, taking into account all the possibilities.

We observe that the best score is achieved by the SVM-E algorithm. Moreover for this dataset, we point out that such a method prefers the Gaussian kernel with respect to the linear one, while the standard SVM and SVM-S obtain better classification scores when the linear kernel is considered.

We investigated the link between VSKs and feature augmentation strategies. In doing so, we tailored the VSKs for SVM and KRNs. The proposed methods turn out to be flexible and easy to implement. For KRNs, the use of VSKs takes advantage of being stable and for classification of merging the probabilistic features of NB and the geometric ones of SVM. This results in effective algorithms that can be used for many tasks. Applications to real datasets show the effectiveness of our approach.

Work in progress consists in extending this concept for support vector regression and as well as for greedy methods [2, 46] .

Data Classification: Algorithms and Applications

Learning with subsampled kernel-based methods: Environmental and financial applications

Eigenvalues of Euclidean distance matrices

Rademacher and Gaussian complexities: risk bounds and structural results

Matrix Analysis

Interpolation with variably scaled kernels

Landslide susceptibility assessment in Vietnam using support vector machines, decision tree and Naïve Bayes models

Smoothing exponential-polynomial splines for multiexponential decay data

Frustratingly easy domain adaptation

Shape-Driven Interpolation with Discontinuous Kernels: Error Analysis, Edge Extraction and Applications in MPI

Jumping with variably scaled discontinuous kernels (VSDKs)

Improved estimates for condition numbers of radial basis function interpolation matrices

Learning deep kernels in the space of dot product polynomials

The spectrum of kernel random matrices

Kernel-based Approximation Methods Using Matlab

Kernel PCA for novelty detection

Bounds on the spectral radius of a Hadamard product of nonnegative or positive semidefinite matrices

Face recognition using kernel principal component analysis

Theoretical and computational aspects of multivariate interpolation with increasingly flat radial basis functions

Learning with augmented features for supervised and semisupervised heterogeneous domain adaptation

Wisconsin breast cancer database, UCI machine learning repository

Cancer diagnosis via linear programming

The extension of Rippa's algorithm beyond LOOCV

Automatic indexing: An experimental inquiry

Functions of positive and negative type and their connection with the theory of integral equations

Cyclic prefixing or zero padding for wireless multicarrier transmissions?

Norm estimates for the inverses of a general class of scattered-data radialfunction interpolation matrices

Numerical Optimization

Introduction to radial basis function networks

Thumbs up? Sentiment Classification Using Machine Learning Techniques

Scikit-learn: Machine learning in python

Naive Bayes classification of uncertain data

An algorithm for selecting a good value for the parameter c in radial basis function interpolation

Edge detection methods based on RBF interpolation

Interpolating functions with gradient discontinuities via variably scaled kernels

Error estimates and condition numbers for radial basis function interpolation

Multivariate interpolation and approximation by translates of a basis function

Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

Bemerkungen zur Theorie der beschränkten Bilinearformen mit unendlich vielen Veränderlichen

A simple PSA-based computational approach predicts the timing of cancer relapse in prostatectomized patients

Kernel Methods for Pattern Analysis

Hadamard products and multivariate statistical analysis

On the uniform convergence of relative frequencies of events to their probabilities

Large scale image annotation: Learning to rank with joint wordimage embeddings

A vectorial kernel orthogonal greedy algorithm

Question classification using support vector machines

Acknowledgements We sincerely thank the reviewers for helping us to significantly improve the manuscript. This research has been accomplished within Rete ITaliana di Approssimazione (RITA) and partially funded by GNCS-INδAM, by the European Union's Horizon 2020 research and innovation programme ERA-PLANET, grant agreement no. 689443, via the GEOEssential project and by the ASI -INAF grant "Artificial Intelligence for the analysis of solar FLARES data (AI-FLARES)."

Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.