key: cord-0787177-6n2t59eb
authors: Yadav, Rakesh Kumar; Abhishek; Verma, Shekhar; Venkatesan, S
title: Cross-covariance based affinity for graphs
date: 2020-11-20
journal: Appl Intell
DOI: 10.1007/s10489-020-01986-9
sha: b5678f61c5bd21d9a6450c8c062cb56329f8437d
doc_id: 787177
cord_uid: 6n2t59eb

The accuracy of graph based learning techniques relies on the underlying topological structure and affinity between data points, which are assumed to lie on a smooth Riemannian manifold. However, the assumption of local linearity in a neighborhood does not always hold true. Hence, the Euclidean distance based affinity that determines the graph edges may fail to represent the true connectivity strength between data points. Moreover, the affinity between data points is influenced by the distribution of the data around them and must be considered in the affinity measure. In this paper, we propose two techniques, CCGA(L) and CCGA(N) that use cross-covariance based graph affinity (CCGA) to represent the relation between data points in a local region. CCGA(L) also explores the additional connectivity between data points which share a common local neighborhood. CCGA(N) considers the influence of respective neighborhoods of the two immediately connected data points, which further enhance the affinity measure. Experimental results of manifold learning on synthetic datasets show that CCGA is able to represent the affinity measure between data points more accurately. This results in better low dimensional representation. Manifold regularization experiments on standard image dataset further indicate that the proposed CCGA based affinity is able to accurately identify and include the influence of the data points and its common neighborhood that increase the classification accuracy. The proposed method outperforms the existing state-of-the-art manifold regularization methods by a significant margin.

Manifold learning and regularization methods have been widely used for data representation and processing respectively. A given data is represented as a weighted graph where the weights represent the similarity between data points. In an ideal case, similar data points should have Rakesh Kumar Yadav pcl2014003@iiita.ac.in

Abhishek rsi2016006@iiita.ac.in Shekhar Verma sverma@iiita.ac.in S Venkatesan venkat@iiita.ac.in 1 Department of IT Deoghat, Indian Institute of Information Technology Allahabad, Jhalwa, Prayagraj, U.P. India, India a higher affinity, which satisfies the manifold assumption. In manifold learning, clusters of data points having larger affinity are kept spatially close when projected to lower dimensional space. Similarly, in manifold regularization, the labels of data points in a large affinity neighborhood are assumed to be similar and, hence, the approximated function is penalized of it changes label in such a neighborhood. This shows that affinity fundamental to both manifold learning and regularization methods, which also influence the accuracy of the underlying model.

Any manifold learning or regularization technique can be decomposed into three steps. (1) computation of the pairwise affinity, (2) creation of the graph, and (3) project the data into low dimensional space or approximate the generative function and penalize. When only data points are given, then an appropriate metric needs to be evolved to capture the similarity between data points. This is crucial as the efficiency of a learning technique hinges on the affinity metric. The affinity metric must be able to capture the dependence between data points and account for uneven sampling, if any, of the data points. Once the affinity is determined, geodesic Neighborhood of x i κ(x i , x j )

Similarity between x i and x j Q, Z Random variables F and G

Hilbert-Schmidt Independence Criterion d(x i , x j )

Distance between x i and x j d N(x i ) (·, ·) Distance between points in a neighborhood κ N(x i ) (·, ·) kernel defined in a neighborhood using the heat kernel may be computed between data points assumed to be lying on the Riemannian manifold. The heat kernel enforces the smoothness assumption, which depends on the kernel bandwidth (σ ). The heat kernel affinity is governed by its bandwidth or scaling, which drastically reduces the affinity value for data points assumed to be distant or non-similar. The local scaling method further optimizes this bandwidth to fit the local neighborhoods of connected data points [1] . It often takes the distance between the point of interest and its 7 th neighbor as affinity bandwidth, which suggests an increase in accuracy as compared to baseline methods. Instead of finding affinity from the given data points, [2] proposes to recreate optimal affinity assuming the graph partition has been provided. The removal of noisy edges that artificially increased or decreased the affinity between data points based on the maximum cliques further enhanced the manifold's modeling [3] . A slightly different approach introduced in the paper [4] that combined the similar information from various neighborhoods to enhance the robustness of kNN. An adaptive neighborhood assigning technique [5] proposed to learn the graph adjacency. It can extensively explore the local manifold structure but fails to decode the global structure of the manifold. A different approach introduced by [6] that used the information theoretic approach to decode the similarity information. These methods directly rely on the Euclidean distance to construct a data graph that assumes the knowledge of the extent of the locally linear region and remains susceptible to noisy and highly correlated features.

In this work, we propose cross-covariance based graph affinity (CCGA) for the affinity approximation task. CCGA aims to preserve both linear and non-linear properties of the manifold accurately. The method initiates by mapping the data points into reproducing kernel Hilbert space (RKHS) using an appropriate kernel on the vertices of the underlying graph, which is followed by similarity measure between data points identified as neighbors in ambient space using Hilbert-Schmidt independence criterion (HSIC). The cross-covariance based affinity metric minimizes the effect of noise and error measurement by making it independent of its existing spatial locations. CCGA for the local region (CCGA L ) method aims to strengthen the affinity by discovering the connectivity between data points, which may or may not be immediately connected but share one or more common local neighborhoods that indicate an innate similarity. Additionally, CCGA for the neighborhood (CCGA N ) weighs the influence of respective neighbors of two connected data points for a correct measure of similarity.

The major contributions of the paper can be stated as, we propose Cross-Covariance based Graph Affinity between two data points that incorporates nonlinear dependence, the influence of neighboring data points, and the neighborhoods around data points. The first technique CCGA L , try to include the influence of the data points in the common neighborhood in the affinity metric between two points. The second technique CCGA N , takes into consideration the influence of the neighborhoods of data points that are adjacent to both the points. Table 1 describes the important mathematical symbols used in this paper.

The rest of the paper is organized as follows: Section 2 describes the existing literature that highlights state-ofthe-art techniques and various affinities used in machine learning techniques. Section 3 illustrates the proposed technique and describes how measuring the dependence between the data points leads to a more accurate affinity. Sections 3.2 and 3.3 explains the affinity obtained based similarity estimation between two data in a local region and between two neighborhoods, respectively. Section 4 highlights the experiments performed on various real world and synthetic data to evaluate the performance of our techniques in terms of classification accuracy and dimensionality reduction. Section 5 concludes the findings and robustness of our techniques.

It is known that the affinity between data points depends not only on the location but also on the neighborhoods of two points. The density of data has been considered by various metrics in supervised [7] , unsupervised [8] , semisupervised [9] and deep learning [10] . The affinity between two data points with different neighborhood density is different from the affinity when both neighborhoods have the same density [11] . Many of the existing literature shows that they adjust local data structures to improve the characteristics of the affinity graph. However, these techniques are highly susceptible to the occurrence of noise and irrelevant features. They mainly focus on the parameter σ of the RBF heat kernel to determine the similarity between two data points. To solve such issues, the graph-theoretic approach is applied by [3] that consider rebuilding the tight neighborhood by picking maximal clique, which in other words, maximizes the average pairwise similarity. The use of contextual information is proposed in [12] to obtain affinity scores between the pairs of data points. The similarity information propagates as diffusion on the graph. The SSNMM-isomap framework [13] employs LLE to preserve the optimal local properties that minimizes the pairwise distances between intraclass data points that assumed to lie in the same manifold. The inter-class distances between data points that lie over different manifolds are simultaneously maximized. The LLE based neighborhood reconstruction error for preserving local topological structures leads to accurate low dimensional representation and, hence increases the model's accuracy. The work in the paper [14] carried out the diffusion on a locally constrained sparse graph to obtain a more accurate graph affinity. A similar kind of work [15] performed the diffusion process on a tensor product graph that captures the higher order information. However, in the aforementioned works, the diffusion process is highly susceptible to noise. This leads to the inaccurate affinity propagation in a graph. DSCF-net [16] aims to learn an optimal representation matrix along with several sets of basis vectors. The deep factorized coefficient matrix accumulates the similarity information between samples and features. A technique [17] for 2D image is introduced to extract the neighborhood features by minimizing the Frobenius norm-based reconstruction error. This makes the distance metric more reliable and encodes the neighborhood reconstruction error precisely. The RS2ACF [18] framework optimizes the representation by constraining it in both projection and label space. The basis vector obtained from concept factorization is used to discover the latent semantic structure hidden in data proves to be useful for clustering. The label constrained representation matrix preserves the local geometrical information, which propagates accurate labels to unlabeled data. In [19] , a parameter free affinity based clustering is introduced that estimate distance between data points, and a threshold affinity is obtained. In paper [20] , propose Consistent Affinity Graph Learning (CAGL) algorithm for multiview data that selects the robust graph Laplacian matrices from each view. It models the hypergraph Laplacians as a point on Grassmann manifold and fuses all the views with CAGL algorithm. In [21] , Axiomatic Fuzzy Set-based Spectral Clustering (AFSSC) is proposed that creates a highly robust affinity graph by exploiting and identifying the discriminative features.

Kernel dependency measures like HSIC capture both linear and nonlinear relationships by mapping the data into a high dimensional feature space. A kernel defined on the vertices of a graph gives a representation of the data and measure of similarity between data points in the RKHS. HSIC [22, 23] is calculated from the empirical estimate of the Hilbert-Schmidt norm on the crosscovariance operator. Empirical HSIC is used to measure statistical dependence between two random variables or two datasets assumed to be drawn from different probability density functions. HSIC has been optimized with greedy algorithms on data features [23, 24] . In the paper [25] , the proposed feature space independent semi-supervised kernel matching method for domain adaptation. HSIC Lasso using dual augmented Lagrangian for global optimum, has been introduced in [26] for regression with feature reduction. ICA [27] , with contrast function based on canonical correlation in the RKHS space, has been proposed for feature selection by finding the minimum correlated dimensions. The HSIC based supervised feature selection [24] measures the dependence between the given features with their respective labels and selects those features that give the maximum correlation. HSIC has also been used for dimensionality reduction [28, 29] such that maximum independent features are identified using the kernel method. The problem of hyperspectral image classification [30] has also utilized surrogate kernel based HSIC to further increase the classification accuracy.

CCGA based affinity measure has been proposed to capture the nonlinear dependence between data points of a point cloud lying on a Riemannian manifold in place of Euclidean distance. It is recognized that in addition to nonlinear dependence, pairwise affinities are also affected by the presence of common neighbors of data points. The first technique, CCGA L , considers the dependence between neighboring points of a neighborhood to identify innate dependence and to incorporate them in the affinity measure between data points. The second technique, CCGA N assumes that common neighbors do not affect the affinity measure alone but with their own neighborhoods. This effect has been captured in the affinity measure for increased classification accuracy and dimensionality reduction performance.

A pairwise affinity metric needs to capture the mutual dependence (linear or nonlinear) between data points. Semisupervised learning assumes that the conditional density of the points given the labels varies smoothly along the geodesics in the intrinsic geometry of the marginal distribution of the data points. If two points on the manifold have an edge i.e. high mutual dependence between them on which the marginal density is large, then the two points would generally share the same label. This entails that the affinity metric between two data points must also consider their neighborhoods. Nonlinear dependence can be computed using the kernel in an appropriate feature space. However, to consider the effect of neighborhood requires determination of the neighborhoods of the nodes. This becomes paradoxical as the affinity between data points define edges. Moreover, implicit mapping into the feature space does not preserve neighborhoods. This can be solved by defining local neighborhoods created by kNN or NN using Euclidean distance. This neighborhood is used to determine neighboring points and the effect of neighborhoods on affinity using kernel based dependency measure in the feature space. The resulting affinity is used to find the geodesic on the Riemannian manifold using a heat kernel.

Given n data points (x i ) n i=1 sampled on manifold M, the local neighborhoods are designed using k-NN or -NN represented as G = (X, W ) where, X is collection of all data points and W contains the affinity between connected data points. Assume that around each data point x i , the local neighbors are contained in x j ∈ N(x i ). The heat kernel based affinity between data points are obtained using:

where, x i is the point of interest, x j ∈ N(x i ) is its neighbor, and σ is the heat kernel bandwidth. In graph-based semisupervised learning methods, affinity measures between labeled and unlabeled data points are used to determine the labels of unlabeled points. The choice of affinity measure is vital in determining the classes of the unlabeled points. A graph is created using kNN such that number of fixed neighbors in kNN may force data points beyond the linear region to be considered as a neighbor. In other situation, genuine linear neighborhood data points may be discarded. Thus, the neighborhood of a data point cannot be guaranteed to hold the local linearity assumption. This requires a non Euclidean measure that can capture the relation between x i and x j ∈ N(x i ).

Kernel methods map the input data into a highdimensional RKHS and define a linear method therein. The model results in a nonlinear relationship with respect to the input space. The mapping is implicit, and the nonlinear relations between data in the feature space are captured through the kernel function. A symmetric positive semidefinite matrix can be viewed as a proximity measure that has an inner product representation and satisfies the criteria required to be a distance metric. Consider for any proximity κ on a finite set X, if two data points x i and x j are connected then the distance between them is obtained

. This also indicates that the proximity measures can be combined to generate a distance metric that represents the pairwise affinity or dependence between data points.

In this paper, we follow the proximity approach to encode the similarity among data points through cross covariance based affinity. The method captures both linear and nonlinear properties of the underlying data manifold by means of statistical independence criterion. The crosscovariance operator assigns a value between 0 and 1 where, the former value is assigned for two completely different data points and the later value signifies that the two data points are exactly the same. HSIC of a data point to itself always results in a unity value. As the properties of underlying manifold M is unknown and the sample data points cannot be guaranteed to have been drawn uniformly. Hence, the distribution parameters of resultant neighborhoods differ from each other. The similarity and differences of neighborhood distribution parameters can only be appropriately utilized when the data is mapped to its full feature space. However, due to the complexity of the mapping and increased dimensions, the kernel proximity automatically identifies the small linear subspace where the manifold properties can be exploited in a linear manner.

The proposed affinity framework relies on defining the local neighborhood on kNN method based on Euclidean distance. This is necessary as the neighborhood is not preserved in the feature space. The implicit kernel mapping redistributes the points. Hence, a Euclidean distance based neighborhood is fixed ab initio. Then the affinities between connected data points are determined through CCGA methods. The proposed techniques account for both linear as well as nonlinear relationships between each pair of connected data points by estimating the HSIC dependence.

HSIC is a kernel based similarity measure between two random variables. Given, Q and Z are two random variables sampled from their respective probability density functions. Assume, F and G are the two RKHSs for Q and Z respectively. Let the φ(q) and ψ(z) be the feature mapping functions such that φ(q) : Q → R and ψ(z) : Z → R. The kernel functions are given by κ(q, q ) = φ(q), φ(q ) F andκ(z, z ) = φ(z), φ(z ) G . The empirical estimation of HSIC using a finite number of data samples is defined as:

where, n is the total number of observation, κ, andκ are the two respective kernel matrix for Q, and Z respectively, and C is the centering matrix. The similarity value between two data vectors in the Hilbert space is given by (2).

Assume a data point x i and its neighbors x j , · · · , x k ∈ N(x i ). Based on smoothness assumption, if N(x i ) is a local region with high affinity between neighborhood data points, then all the neighbors shall share similar label with high probability. This entails that two data points x j ∈ N(x i ) and x k ∈ N(x i ) that are part of the neighborhood N(x i ) of x i but not adjacent to each other i.e. x j ∈ N(x k ) and

x k ∈ N(x j ) may also generally share the same label. It also indicates that the affinity between such data points x j and x k should be large if they have multiple common neighborhood. However, in general for affinity calculation, the relation between such x j and x k is not considered unless either x j ∈ N(x k ) or x k ∈ N(x j ) is true. The proposed CCGA L tries to identify such similarities among data points and accumulate them to define revamp affinity between two data points based on two factors, first, whether they are neighbors and second, number of common local neighborhood shared by them. Figure 1 shows the pictorial representation of CCGA L , x i as point of interest with data points in the kNN neighborhood x j ∈ N(x i ). Here, x i is denoted by black filled circle and its neighbors with blue filled circles. The direct connectivity between x i and x j ∈ N(x i ) has been illustrated using solid lines which is approximated similarly by both Euclidean distance and CCGA L . This CCGA L determines the statistical independence between the neighbors. This is represented by dashed line. If the data points x j and x k further share common local neighborhood apart from x i , the next such distance should be added to previous value between them. The proximity between any two connected data points x i and x j can be defined as

Since, κ of a data point to itself is 1.

Similarly, when two data points x j , and x j lie in the neighborhood N(x i ), then their distance is calculated using

where, κ (N(x i )) (·, ·) signifies kernel proximity applied the neighborhood of N(x i ), and w (N(x i )) represents the respective neighborhood distance. Further, if x j , and x k are immediately connected or are part neighborhoods of other data point (N(·)), then the distance between them is obtained by averaging over all individual distances using [31] d(x j ,

The final affinity w jk between any two data points x j and x j where, w(x j , x k ) = 0 can be obtained using the heat kernel as

The algorithm for CCGA for local region has been explained in Algo. 1. The CCGA L major steps are as follows, 1. Given a dataset X having n number of points, create a kNN neighborhood N(x i ) around each data point x i 2. Apply kernel κ N(x i ) and compute the k × k distance matrix. 3. Convert this distance matrix to an affinity matrix as in (5) . 4. For each pair of data points x j and x k contained in the current k × k affinity matrix, add the current affinity values to the previously computed values. 5. Repeat the step 2 until all n data points are processed.

CCGA L is based on the assumption that common neighbors influence the mutual dependence between two data points. These individual influences may be further conditioned by their own neighborhoods. CCGA N tries to assimilate this intangible effect of the neighborhoods of the neighbors themselves in the affinity metric. Again, to adhere to the manifold assumption, a neighborhood is identified using spatial proximity based on Euclidean distance. CCGA N proceeds as follows, given data points in X, a local neighborhood N(x i ) can be constructed using kNN or NN around every point x i based on Euclidean distance between all data points. Once the graph is created, we want to compute the pairwise affinity, based on their nonlinear dependence, the effect of neighboring data points, and their respective neighborhoods between the point of interest x i and each of its neighbor x j ∈ N(x i ). As shown in Fig. 2 , the pairwise affinity between every such pair of x i and x j would further decide if they are likely to share similar labels. We have two neighborhoods N(x i ) around the x i and other N(x j ) around the x j such that x j ∈ N(x i ). To compute the nonlinear dependence, we apply RBF kernels: κ (N(x i ) and κ (N(x j ) for N(x i ) and N(x j ) respectively. Each kNN neighborhood contains k + 1 elements including point of interest. The kernels over N(x i ) and N(x j ) allows the algorithm to appropriately weigh the importance of neighbors. The HSIC based statistical independence between N(

The resultant d(N(x i ), N(x j )) distance matrix consists of (k + 1) × (k + 1) where each d i j corresponds to distance between i th member of N(x i ), and j th member of N(x j ):

) now contains the distance calculated using HSIC between every pair of N(x i ), and N(x j ) represented as rows, and columns respectively as shown above. Finally, the distances between data points being member of such N(x i ), andN(x j ) are summed up together to strengthen such relations. w(

This distance is calculated between data point and each of its neighbors. Inside every neighborhood, w(N(x i ), N(x j )) is computed k times. Over n data points, this algorithm is repeated k × n times to compute the distance matrix. The final affinity w ij between any two data points can be obtained using (5) . The algorithm for CCGA for local region has been explained in Algorithm 2.

The CCGA N can be summarized in following steps.

1. Given data points in X, local neighborhood N(x i ) is obtained with kNN algorithm around each data point. 2. After obtaining this graph, based on the Euclidean the distance, affinity between point of interest x i and each of its neighbor x j ∈N(x i ) is computed. In this way, we have two neighborhoods, N(x i ) around x i and N(x j ) around x j . 3. On each of this neighborhood, the RBF kernel κ (N(x i ) ) and κ (N(x j )) is applied. 4. The HSIC based statistical independence between two these neighborhoods is computed as (6). 5. The pairwise HSIC values between data points is converted into distance (7). This forms a distance matrix of size (k + 1) × (k + 1). 6. In every neighborhood w(N(x i ), N(x j )) is computed k times which is repeated for n data points (k × n) times. 7. The final affinity matrix is obtained by (3).

The techniques based on CCGA largely depends on the number of data points n and ambient dimension D. For both, CCGA L and CCGA N RBF kernel (O(n)) and kNN algorithm (O(D × n × k)) has been used. The graph mainly depends on the k number of nearest neighbors. The time complexity for both techniques can be expressed as, -CCGA L , complexity is computed based on steps as defined in Algorithm 1. The over all time complexity for n data points using CCGA N technique is given as O(k 3 × n × d)(d + k).

The performance of the proposed graph affinity based techniques CCGA L and CCGA N have been evaluated on both synthetic and real world datasets for manifold learning and graph Laplacian regularized classification, i.e. manifold regularization. The existing state-of-the-art graph affinity methods like heat, binary EMR [32] (both 24 and 72 graphs) used for graph construction are compared against our techniques in terms of classification error. In order to validate the robustness of our techniques, a wide range of real world datasets are used. The real world datasets mainly comprise of images, handwritten characters, scene recognition and brain computer interface. The binary classification error is estimated with two models LapSVM and LapRLSC for both test and unlabeled sets. The mean classification error 1 , along with standard deviation by varying the value of k are summarized in the tables.

The synthetic datasets contain 3D manifold shapes whose low dimensional 2D representation is known. Non-linear dimensionality reduction based 2D projections using both CCGA L and CCGA N techniques are obtained. For comparison many existing state-of-the-art non-linear dimensionality reduction techniques namely LTSA (Local Tangent Space Alignment) [33] , LLE (Local Linear Embedding) [34] , Isomap (Isometric Mapping) [35] , LE (Laplacian Eigenmap) [36] , and HE (Hessian Eigenmap) [37] are employed along with linear dimensionality reduction technique PCA (Principal Component Analysis) [38] . The first row of figures in the Table 2 contains the original 3D manifolds and subsequent rows contain the 2D output from each of the employed techniques.

1. Punctured Sphere: The second column of the Table 2 shows the punctured sphere dataset. This artificially generated dataset contains 4000 data points in 3D. It is originally a 2D disc, which is elevated on the third dimension to make it a sphere. The results show that PCA, LLE, Isomap and HE are unable to determine the true 2D structure as they failed to preserve the intrinsic geometry of the punctured segment marked with red color. Both CCGA L and CCGA N , along with LTSA and LE preserve the true intrinsic geometry and are able to unfold the true lower space structure. The HSIC based nonlinear dependence is able to find the true affinity among data points which leads to a better association among data points belonging to the same class. 2. Sine on a hyperboloid: The third column of the Table 2 , shows 629 data points, sine wave wrapped around a hyperboloid in 3D space which is originally a smooth circle on a 2D plane. As evident, CCGA L , CCGA N , and Isomap give the best representation in 2D space while retaining the geometrical characteristics. All other techniques fail to either preserve the circle or its smoothness. These techniques only rely on the Euclidean distance for structural properties estimation and discard a large chunk of the nonlinear properties. 3. Swiss roll with hole: This dataset contains 3947 data points modeled in a Swiss roll shape with a small hole in between green and blue portion. When unrolled correctly, it should appear as a 2D flat strip with a hole. Isomap is able to give the closest output to its intrinsic 1 The least classification error for each dataset has been highlighted in bold in Tables 3 to 12 representation. LLE, CCGA L , and CCGA N are able to preserve the shape of the hole along with the strip. Other methods failed to preserve the connectivity around the hole. CCGA L , and CCGA N are not able to better unroll the structure. It seems that they discard some spurious dependencies between points in the hole perimeter. 4. Twin peak with hole: The last column of the Table 2, shows the artificially generated Twin peaks with a hole, which is originally a 2D flat strip with a hole containing 6990 data points. The results show that PCA gave the most accurate representations, followed by CCGA L and CCGA N . Other methods either failed to unroll the strip to its original structure or the boundary data points remained cluttered. All nonlinear dimensionality reduction techniques are able to retain the hole, but 2D shape distortion is maximum for Isomap as the shortest is not the correct geodesic distance.

Manifold learning results as shown in the Table 2 on the synthetic datasets clearly suggest that both the techniques are able to determine the optimal affinities that preserve the maxi-mum intrinsic geometrical properties. This leads to a better association among data points belonging to the same class. It is observed that for all the synthetic datasets, both, CCGA L and CCGA N are able to unroll the data and project them to 2D space while retaining the maximum structural characteristics.

1. USPS: This dataset [39] contains scanned images (Fig. 3a) of handwritten digits on envelopes of the US postal services. The original scanned digits varying in Fig. 3 Real world Datasets Images [44] comprises of 67 classes of image data (Fig. 3f) . Each class has more then 100 images. Each image has been reduced to equal size of 32 × 32 × 3. Finally, a matrix of size 15620 × 3072 which is further divided into training 70% and testing set 9. Natural Images : This dataset [45] contains images (Fig. 3h ) of different objects like car, cat, airplane, motorbike, flower, person and fruit. It has total of 8 [46] contains CT scan images (Fig. 3i) of COVID-19 positive and negative patients. It has a total of 451 CT scan grayscale images out of which 275 belonged to positive patients and rest 195 were negative. During pre-processing, all images were changed to standard 512 × 512pixels. Data was further split to 50% as training and 50% as testing set. The experiment was Extensive experiments are performed with LapSVM and LapRLSC classifiers. Root mean square error with standard deviation is reported in separate tables for unlabeled and test set Experiments on these datasets are performed by varying the kNN paramete as the affinity matrix used for constructing the graph depends on k, the number of neighbors. This parameter k is varied from 6 to 15. It is observed that the classification accuracy for all datasets increases, thus, with enriched neighborhood more accurate affinity is learned. On further increasing the value (k = 16) the accuracy suddenly dips and similar trend is followed on further increasing the neighborhood. The classification error for value of k = 6 and k = 10 for both LapSVM and LapRLSC are reported in the Tables 3, 4, 5, 6, 7, 8, 9, and 10.

On USPS dataset, CCGA L and CCGA N have highest classification accuracy as compared to the existing stateof-the-art techniques for both test and unlabeled set. On USPS test set, CCGA N achieves the classification accuracy of > 98% and CCGA L stands closest to it with accuracy > 97%. On the other hand, classification accuracy with heat and binary affinities are > 95% and > 94% respectively. In contrast, EMR 24 and EMR 72 have < 72% classification accuracy, which is least among all. A similar trend in the classification accuracy is obtained for the unlabeled set, where CCGA N has the best accuracy of > 95.5% and EMR 72 has the least accuracy of < 67%. This reveals that HSIC based dependence value obtained between data points is able to capture the nonlinear dependence between data points. This leads to better classification accuracy. On HASY v2 dataset for test as well as unlabeled set CCGA N has most accurate result > 98% followed by CCGA L (> 98%) and binary (> 97%). EMR 24 (> 83%). Whereas, others like heat affinity, CCGA L , EMR 24 and EMR 72 have accuracy that lies between 78% to 80%. This shows that dataset intrinsic geometry has been well explored by the given techniques in order to determine the connectivities by accommodating both linear and nonlinear characteristics. On CIFAR-10 dataset, there is a similar trend in the classification result. For both test and unlabeled sets classification accuracy has not much variation, the accuracy is ≈ 70%. CCGA N leads among all by 1% with classification accuracy 71%. On another object detection dataset, like Lego Bricks, CCGA L and CCGA N give best label prediction results. They achieve an accuracy of > 94% for both test and unlabeled sets. The next most accurate one is binary weight with accuracy > 91%. EMR 24 In order to further validate the efficacy of CCGA L and CCGA N , random walk plots are constructed with state-ofthe-art graph affinities on BCI the dataset are shown in the Fig. 4 . An affinity matrix is considered to be better than other affinity matrices if its random walk plot has clear clusters. As evident, the random walk is shown in Fig. 4b   Fig. 4 Random walk on BCI 5F dataset constructed using CCGA N contains cluster patches that define a strong intra-neighbor connectivity and sparse interneighbor edges. This supports the superior performance of CCGA N during classification as the affinity matrix helps in identifying clear clusters on the data points. The random walk using CCGA L (Fig. 4a) gives a comparable result wherein different classes of the dataset are discernible. The plots for EMR 24 (Fig. 4c) and EMR 72 (Fig. 4d) have less clear clusters. On the other hand, the random walk over non-parametric methods KTA 4 and HHG 4 do not have much recognizable clusters, while the former affinity do define some high strength groups, the later gave uniform affinity to all data points. Similarly, the heat affinity random walk also diluted the distinguishable clusters by assigning high affinity to every edge (Fig. 4e) as it captures linear dependencies only.

It is evident from the results of non-linear dimensionality reduction and the random walk plots that CCGA L and CCGA N are capable of preserving the smooth intrinsic geometrical properties better than other methods. Similarly, increased classification accuracy across the standard handwritten, BCI, object and scene detection dataset show that the graph affinity obtained gives information about the unexplored connectivities in the data which outperform the other affinity methods by a considerable margin.

CCGA L and CCGA N are based on HSIC, which is a nonparametric statistical measure. While in the previous experiments, they have been compared with existing state-of-theart parametric affinity methods, it is essential to compare them with state-of-the-art non-parametric statistical test measures. We have selected two such methods, namely kernel target alignment (KTA) [47] and Heller Heller Gorfine (HHG) [47] . As KTA works in feature space by aligning two kernel functions or a kernel and a target function which makes it very similar to HSIC based affinity method. On the other hand, HHG computes the statistical measure from the underlying Euclidean metric, which further allows the comparison to explore if statistical test based on Euclidean norm can appropriately enforce the smoothness constraint in the presence of nonlinear relation. Both KTA and HHG aim to compute the dependence between the samples of two random variables assuming they are not independent of each other.

KTA tries to learn data embedding in feature space such that the data points follow nice clustering properties i.e. data points belonging to the same class should be mapped spatially closer to each other as compared to 4 For explanation refer Section 4.3. data points from different classes. Thus, the intra-class distance between data points in feature space remains close to zero, and inter-class distance, greater than zero. Over the data points in X, KTA employs two kernels κ 1 = n i=j =1 κ 1 (x i , x j ) and κ 2 = n i=j =1 κ 2 (x i , x j ). Further, the alignment between κ 1 and κ 2 is performed through

. It is also interpreted as the cosine angle between two bi-dimensional vectors κ 1 and κ 2 . The measure of alignment between two different kernels on same pair of data points allows to identify true linear neighbors situated in feature space, this corresponds to small spatial distance or larger affinity. Similarly, the non-aligning kernels of data points would result in smaller affinity values representing absence of linear relationship as required by regularization to enforce function smoothness across the dense linear regions.

The other non-parametric statistical test HHG estimates the dependence between two random variables through norm distance. HHG assumes that if the two random variables X and X are dependent then there exists a point (x 0 , x 0 ) and radii R x and R x such that the joint distribution of X and X should be different from product of the marginal distribution of balls around (x 0 , x 0 ). The independence test performed with HHG is given by,

where, S(i, j ) is computed by observing the two dichotomous random variables for n observation in both random variables,

Both KTA and HHG values have been used as an affinity metric in place of CCGA N . Further, LapSVM and LapRLSC were executed for both methods on all benchmark datasets. The performance of all three non-parametric affinity methods have been compared on the mean classification error over both unlabeled and test set. The mean error for test and unlabeled set have been reported in Table 11 and  Table 12 . On USPS dataset, for both test and unlabeled set, CCGA N outperformed both KTA and HHG by achieving an accuracy of > 98%. KTA also gave a close result and remained at second position whereas HHG gave poor accuracy for LapSVM (> 71%). In case of Hasy v2 dataset, result of all three non-parametric affinity methods remained close and CCGA N managed to marginally outperform others. A similar trend can be observed in other datasets like MNIST, BCI HaLT, CIFAR-10 and Lego brick dataset. Though for all these datasets, the classification accuracy of CCGA N remained only marginally better than KTA and HHG. In remaining datasets, out of the three nonparametric affinity methods, HHG gave the least accurate results whereas the accuracy results of KTA and CCGA N remained very close.

The non parametric statistical evaluation carried out on various datasets in terms of classification accuracy suggests that HSIC based affinity gave more accurate intrinsic information. From Tables 11 and 12 , it can be concluded that KTA and CCGA N are performed similar in classification accuracy and same trend has been repeated for all datasets. However, HHG based error values remained very high in comparison to both KTA and CCGA N . This reason for this poor performance of HHG is the Euclidean distance based metric. As this Euclidean distance, which proves to be inefficient in encoding the relations between the data points accurately.

In this paper, we proposed Cross-Covariance based Graph Affinity between two data points that incorporates nonlinear dependence, the influence of neighboring data points, and the neighborhoods around data points. The first technique CCGA L , tried to include the influence of the data points in the common neighborhood in the affinity metric between two points. The second technique CCGA N , took into consideration the influence of the neighborhoods of data points that are adjacent to both the points. Both these techniques were able to determine the linear and nonlinear affinity between each data point. The classification accuracy and dimensionality reduction on both real world and synthetic datasets clearly suggested that the influence of adjacent points must be considered in pairwise affinity. The classification accuracy on different categories of datasets like Object detection, handwritten, and scene detection was found to be consistently better in comparison to other methods by a margin of ≈ 1% to 5% under both LapSVM and LapRLSC. CCGA N , has better classification accuracy and dimensionality reduction results that suggest that nonlinear dependence and neighborhoods have significant influence on affinity between data points.

Self-tuning spectral clustering

Learning spectral clustering, with application to speech separation

Dominant sets and pairwise clustering

Consensus of k-nns for robust neighborhood selection on graph-based manifolds

Clustering and projected clustering with adaptive neighbors

Constructing robust affinity graphs for spectral clustering

On density based transforms for uncertain data mining

A densitypenalized distance measure for clustering

Density-sensitive semisupervised inference

Learning neural representations for network anomaly detection

Social distance metric: from coordinates to neighborhoods

Learning context-sensitive shape similarity by graph transduction

Semisupervised local multi-manifold isomap by linear embedding for feature extraction

Locally constrained diffusion process on locally densified distance spaces with applications to shape retrieval

Affinity learning with diffusion on tensor product graph

Deep self-representative concept factorization network for representation learning

Robust neighborhood preserving projection by nuclear/l2, 1-norm regularization for image feature extraction

Joint label prediction based semi-supervised adaptive concept factorization for robust data representation

A parameter-free affinity based clustering

Learning robust affinity graph representation for multi-view clustering

Fuzzy based affinity learning for spectral clustering

Theory of reproducing kernels

Feature selection via dependence maximization

Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces

Feature space independent semisupervised domain adaptation via kernel matching

From transformation-based dimensionality reduction to feature selection 27

Kernel feature selection via conditional covariance minimization

Local tangent space alignment based on hilbert-schmidt independence criterion regularization

Sparse hilbert schmidt independence criterion and surrogate-kernel-based feature selection for hyperspectral image classification

On a duality between metrics and σ -proximities. arXiv preprint math0508183

Ensemble manifold regularization

Linear local tangent space alignment and application to face recognition

Nonlinear dimensionality reduction by locally linear embedding

Global versus local methods in nonlinear dimensionality reduction

Laplacian eigenmaps and spectral techniques for embedding and clustering

Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data

Principal component analysis. Chemometrics and intelligent laboratory systems

A database for handwritten text recognition research

A large electroencephalographic motor imagery dataset for electroencephalographic brain computer interfaces

Learning multiple layers of features from tiny images

Bag-of-visual-words and spatial extensions for land-use classification

Recognizing indoor scenes

Effects of degradations on deep neural network architectures

On kernel-target alignment

Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.