key: cord-024437-r5wnz7rq authors: Wang, Yubin; Zhang, Zhenyu; Liu, Tingwen; Guo, Li title: SLGAT: Soft Labels Guided Graph Attention Networks date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47426-3_40 sha: doc_id: 24437 cord_uid: r5wnz7rq Graph convolutional neural networks have been widely studied for semi-supervised classification on graph-structured data in recent years. They usually learn node representations by transforming, propagating, aggregating node features and minimizing the prediction loss on labeled nodes. However, the pseudo labels generated on unlabeled nodes are usually overlooked during the learning process. In this paper, we propose a soft labels guided graph attention network (SLGAT) to improve the performance of node representation learning by leveraging generated pseudo labels. Unlike the prior graph attention networks, our SLGAT uses soft labels as guidance to learn different weights for neighboring nodes, which allows SLGAT to pay more attention to the features closely related to the central node labels during the feature aggregation process. We further propose a self-training based optimization method to train SLGAT on both labeled and pseudo labeled nodes. Specifically, we first pre-train SLGAT on labeled nodes and generate pseudo labels for unlabeled nodes. Next, for each iteration, we train SLGAT on the combination of labeled and pseudo labeled nodes, and then generate new pseudo labels for further training. Experimental results on semi-supervised node classification show that SLGAT achieves state-of-the-art performance. In recent years, graph convolutional neural networks (GCNs) [26] , which can learn from graph-structured data, have attracted much attention. The general approach with GCNs is to learn node representations by passing, transforming, and aggregating node features across the graph. The generated node representations can then be used as input to a prediction layer for various downstream tasks, such as node classification [12] , graph classification [30] , link prediction [17] and social recommendation [19] . Graph attention networks (GAT) [23] , which is one of the most representative GCNs, learns the weights for neighborhood aggregation via self-attention mechanism [22] and achieves promising performance on semi-supervised node classification problem. The model is expected to learn to pay more attention to the important neighbors. It calculates important scores between connected nodes based solely on the node representations. However, the label information of nodes is usually overlooked. Besides, the cluster assumption [3] for semisupervised learning states that the decision boundary should lie in regions of low density. It means aggregating the features from the nodes with different classes could reduce the generalization performance of the model. This motivates us to introduce label information to improve the performance of node classification in the following two aspects: (1) We introduce soft labels to guide the feature aggregation for generating discriminative node embeddings for classification. (2) We use SLGAT to predict pseudo labels for unlabeled nodes and further train SLGAT on the composition of labeled and pseudo labeled nodes. In this way, SLGAT can benefit from unlabeled data. In this paper, we propose soft labels guided attention networks (SLGAT) for semi-supervised node representation learning. The learning process consists of two main steps. First, SLGAT aggregates the features of neighbors using convolutional networks and predicts soft labels for each node based on the learned embeddings. And then, it uses soft labels to guide the feature aggregation via attention mechanism. Unlike the prior graph attention networks, SLGAT allows paying more attention to the features closely related to the central node labels. The weights for neighborhood aggregation learned by a feedforward neural network based on both label information of central nodes and features of neighboring nodes, which can lead to learning more discriminative node representations for classification. We further propose a self-training based optimization method to improve the generalization performance of SLGAT using unlabeled data. Specifically, we first pre-train SLGAT on labeled nodes using standard cross-entropy loss. Then we generate pseudo labels for unlabeled nodes using SLGAT. Next, for each iteration, we train SLGAT using a combined cross-entropy loss on both labeled nodes and pseudo labeled nodes, and then generate new pseudo labels for further training. In this way, SLGAT can benefit from unlabeled data by minimizing the entropy of predictions on unlabeled nodes. We conduct extensive experiments on semi-supervised node classification to evaluate our proposed model. And experimental results on several datasets show that SLGAT achieves state-of-the-art performance. The source code of this paper can be obtained from https://github.com/jadbin/SLGAT. Graph-Based Semi-supervised Learning. A large number of methods for semi-supervised learning using graph representations have been proposed in recent years, most of which can be divided into two categories: graph regularization-based methods and graph embedding-based methods. Different graph regularization-based approaches can have different variants of the regularization term. And graph Laplacian regularizer is most commonly used in previous studies including label propagation [32] , local and global consistency regularization [31] , manifold regularization [1] and deep semi-supervised embedding [25] . Recently, graph embedding-based methods inspired by the skip-gram model [14] has attracted much attention. DeepWalk [16] samples node sequences via uniform random walks on the network, and then learns embeddings via the prediction of the local neighborhood of nodes. Afterward, a large number of works including LINE [21] and node2vec [8] extend DeepWalk with more sophisticated random walk schemes. For such embedding based methods, a two-step pipeline including embedding learning and semi-supervised training is required where each step has to be optimized separately. Planetoid [29] alleviates this by incorporating label information into the process of learning embeddings. Graph Convolutional Neural Networks. Recently, graph convolutional neural networks (GCNs) [26] have been successfully applied in many applications. Existing GCNs are often categorized as spectral methods and non-spectral methods. Spectral methods define graph convolution based on the spectral graph theory. The early studies [2, 10] developed convolution operation based graph Fourier transformation. Defferrard et al. [4] used polynomial spectral filters to reduce the computational cost. Kipf & Welling [12] then simplified the previous method by using a linear filter to operate one-hop neighboring nodes. Wu et al. [27] used graph wavelet to implement localized convolution. Xu et al. [27] used a heat kernel to enhance low-frequency filters and enforce smoothness in the signal variation on the graph. Along with spectral graph convolution, define the graph convolution in the spatial domain was also investigated by many researchers. GraphSAGE [9] performs various aggregators such as meanpooling over a fixed-size neighborhood of each node. Monti et al. [15] provided a unified framework that generalized various GCNs. GraphsGAN [5] generates fake samples and trains generator-classifier networks in the adversarial learning setting. Instead of fixed weight for aggregation, graph attention networks (GAT) [23] adopts attention mechanisms to learn the relative weights between two connected nodes. Wang et al. [24] generalized GAT to learn representations of heterogeneous networks using meta-paths. Shortest Path Graph Attention Network (SPAGAN) to explore high-order path-based attentions. Our method is based on spatial graph convolution. Unlike the existing graph attention networks, we introduce soft labels to guide the feature aggregation of neighboring nodes. And experiments show that this can further improve the semi-supervised classification performance. In this paper, we focus on the problem of semi-supervised node classification. Many other applications can be reformulated into this fundamental problem. Let G = (V, E) be a graph, in which V is a set of nodes, E is a set of edges. Each node u ∈ V has a attribute vector x u . Given a few labeled nodes V L ∈ V , where each node u ∈ V L is associated with a label y u ∈ Y , the goal is to predict the labels for the remaining unlabeled nodes In this section, we will give more details of SLGAT. The overall structure of SLGAT is shown in Fig. 1 . The learning process of our method consists of two main steps. We first use a multi-layer graph convolution network to generate soft labels for each node based on nodes features. We then leverage the soft labels to guide the feature aggregation via attention mechanism to learn better representations of nodes. Furthermore, we develop a self-training based optimization method to train SLGAT on the combination of labeled nodes and pseudo labeled nodes. This enforces SLGAT can further benefit from the unlabeled data under the semi-supervised learning setting. In the initial phase, we need to first predict the pseudo labels for each node based on node features x. The pseudo labels can be soft (a continuous distribution) or hard (a one-hot distribution). In practice, we observe that soft labels are usually more stable than hard labels, especially when the model has low prediction accuracy. Since the labels predicted by the model are not absolutely correct, the error from hard labels may propagate to the inference on other labels and hurt the performance. While using soft labels can alleviate this problem. We use a multi-layer graph convolutional network [12] to aggregate the features of neighboring nodes. The layer-wise propagation rule of feature convolution is as follows: Here, A = A + I is the adjacency matrix with added self-connections. I is the identity matrix, is a layer-specific trainable transformation matrix. σ (·) denotes an activation function such as denotes the hidden representations of nodes in the l th layer. The representations of nodes f (l+1) are obtained by aggregating information from the features of their neighborhoods f (l) . Initially, f (0) = x. After going through L layers of feature convolution, we predict the soft labels for each node u based on the output embeddings of nodes: Now we will present how to leverage the previous generated soft labels for each node to guide the feature aggregation via attention mechanism. The attention network consists of several stacked layers. In each layer, we first aggregate the label information of neighboring nodes. Then we learn the weights for neighborhood aggregation based on both aggregated label information of central nodes and feature embeddings of neighboring nodes. We use a label convolution unit to aggregate the label information of neighboring nodes, and the layer-wise propagation rule is as follows: where W g is a layer-specific trainable transformation matrix, and g (l) ∈ R |V |×d (l) g denotes the hidden representations the label information of nodes. The label information g (l+1) are obtained by aggregating from the label information g (l) of neighboring nodes. Initially, g (0) = softmax f (L) according to Eq. 2. Then we use the aggregated label information to guide the feature aggregation via attention mechanism. Unlike the prior graph attention networks [23, 28] , we use label information as guidance to learn the weights of neighboring nodes for feature aggregation. We enforce the model to pay more attention to the features closely related to the labels of the central nodes. A single-layer feedforward neural network is applied to calculate the attention scores between connected nodes based on the central node label information g (l+1) and the neighboring node features h (l) : are layer-specific trainable transformation matrices, h (l) ∈ R |V |×d (l) h denotes the hidden representations of node features. · represents transposition and || is the concatenation operation. Then we obtain the attention weights by normalizing the attention scores with the softmax function: where N i is the neighborhood of node i in the graph. Then, the embedding of node i can be aggregated by the projected features of neighbors with the corresponding coefficients as follows: Finally, we can achieve better predictions for the labels of each node u by replacing the Eq. 2 as follows: where ⊕ is the mean-pooling aggregator. Grandvalet & Bengio [7] argued that adding an extra loss to minimize the entropy of predictions on unlabeled data can further improve the generalization performance for semi-supervised learning. Thus we estimate pseudo labels for unlabeled nodes based on the learned node representations, and develop a self-training based optimization method to train SLGAT on both labeled and pseudo labeled nodes. Int this way, SLGAT can further benefit from the unlabeled data. For semi-supervised node classification, we can minimize the cross-entropy loss over all labeled nodes between the ground-truth and the prediction: where C is the number of classes. To achieve training on the composition of labeled and unlabeled nodes, we first estimate the labels of unlabeled nodes using the learned node embeddings as follows: where τ is an annealing parameter. We can set τ to a small value (e.g. 0.1) to further reduce the entropy of pseudo labels. Then the loss for minimizing the entropy of predictions on unlabeled data can be defined as: The joint objective function is defined as a weighted linear combination of the loss on labeled nodes and unlabeled nodes: where λ is a weight balance factor. We give a self-training based method to train SLGAT which is listed in Algorithm. 1. The inputs to the algorithm are both labeled and unlabeled nodes. We first use labeled nodes to pre-train the model using cross-entropy loss. Then we use the model to generate pseudo labels on unlabeled nodes. Afterward, we train the model by minimizing the combined cross-entropy loss on both labeled and unlabeled nodes. Finally, we iteratively generate new pseudo labels and further train the model. In this section, we evaluate our proposed SLGAT on semi-supervised node classification task using several standard benchmarks. We also conduct an ablation study on SLGAT to investigate the contribution of various components to performance improvements. We follow existing studies [12, 23, 29] and use three standard citation network benchmark datasets for evaluation, including Cora, Citeseer and Pubmed. In all these datasets, the nodes represent documents and edges are citation links. Node features correspond to elements of a bag-of-words representation of a document. Class labels correspond to research areas and each node has a class label. In each dataset, 20 nodes from each class are treated as labeled data. The statistics of datasets are summarized in Table 1 . We compare against several traditional graph-based semi-supervised classification methods, including manifold regularization (ManiReg) [1] , semi-supervised embedding (SemiEmb) [25] , label propagation (LP) [32] , graph embeddings (DeepWalk) [16] , iterative classification algorithm (ICA) [13] and Planetoid [29] . Training Validation Test Cora 2,708 5,429 1,433 7 140 500 1,000 Citeseer 3,327 4,732 3,703 6 120 500 1,000 Pubmed 19,717 44,338 500 3 60 500 1,000 Furthermore, since graph neural networks are proved to be effective for semisupervised classification, we also compare with several state-of-arts graph neural networks including ChebyNet [4] , MoNet [15] , graph convolutional networks (GCN) [12] , graph attention networks (GAT) [23] , graph wavelet neural network (GWNN) [27] , shortest path graph attention network (SPAGAN) [28] and graph convolutional networks using heat kernel (GraphHeat) [27] . We train a two-layer SLGAT model for semi-supervised node classification and evaluate the performance using prediction accuracy. The partition of datasets is the same as the previous studies [12, 23, 29] with an additional validation set of 500 labeled samples to determine hyper-parameters. Weights are initialized following Glorot and Bengio [6] . We adopt the Adam optimizer [11] for parameter optimization with initial learning rate as 0.05 and weight decay as 0.0005. We set the hidden layer size of features as 32 for Cora and Citeseer and 16 for Pubmed. We set the hidden layer size of soft labels as 16 for Cora and Citeseer and 8 for Pubmed. We apply dropout [20] with p = 0.5 to both layers inputs, as well as to the normalized attention coefficients. The proper setting of λ in Eq. 11 affects the semi-supervised classification performance. If λ is too large, it disturbs training for labeled nodes. Whereas if λ is too small, we cannot benefit from unlabeled data. In our experiments, we set λ = 1. We anticipate the results can be further improved by using sophisticated scheduling strategies such as deterministic annealing [7] , and we leave it as future work. Furthermore, inspired by dropout [20] , we ignore the loss in Eq. 10 with p = 0.5 during training to prevent overfitting on pseudo labeled nodes. We now validate the effectiveness of SLGAT on semi-supervised node classification task. Following the previous studies [12, 23, 29] , we use the classification accuracy metric for quantitative evaluation. Experimental results are summarized in Table 2 . We present the mean classification accuracy (with standard deviation) of our method over 100 runs. And we reuse the results already reported in [5, 12, 23, 27, 28] for baselines. We can observe that our SLGAT achieves consistently better performance than all baselines. When directly compared to GAT, SLGAT gains 1.0%, 2.3% and 3.2% improvements for Cora, Citeseer and Pubmed respectively. The performance gain is from two folds. First, SLGAT uses soft labels to guide the feature aggregation of neighboring nodes. This indeed leads to more discriminative node representations. Second, SLGAT is trained on both labeled and pseudo labeled nodes using our proposed self-training based optimization method. SLGAT benefits from unlabeled data by minimizing the entropy of predictions on unlabeled nodes. Following Shchur et al. [18] , we also further validate the effectiveness and robustness of SLGAT on random data splits. We created 10 random splits of the Cora, Citeseer, Pubmed with the same size of training, validation, test sets as the standard split from Yang et al. [29] . We compare SLGAT with other most related competitive baselines including GCN [12] and GAT [23] on those random data splits. 1 We run each method with 10 random seeds on each data split and report the overall mean accuracy in Table 3 . We can observe that SLGAT consistently outperforms GCN and GAT on all datasets. This proves the effectiveness and robustness of SLGAT. In this section, we conduct an ablation study to investigate the effectiveness of our proposed soft label guided attention mechanism and the self-training based optimization method for SLGAT. We compare several variants of SLGAT on node classification, and the results are reported in Table 4 . We observe that SLGAT has better performance than the methods without soft labels guided attention in most cases. This demonstrates that using soft labels to guide the neighboring nodes aggregation is effective for generating better node embeddings. Note that attention mechanism seems has little contribution to performance on Pubmed when using self-training. The reason behind such phenomenon is still under investigation, we presume that it is due to the label sparsity of Pubmed. 2 The similar phenomenon is reported in [23] that GAT has little improvement on Pubmed compared to GCN. We also observe that SLGAT significantly outperforms all the methods without self-training. This indicates that our proposed self-training based optimization method is much effective to improve the generalization performance of the model for semi-supervised classification. In this work, we propose SLGAT for semi-supervised node representation learning. SLGAT uses soft labels to guide the feature aggregation of neighboring nodes for generating discriminative node representations. A self-training based optimization method is proposed to train SLGAT on both labeled data and pseudo labeled data, which is effective to improve the generalization performance of SLGAT. Experimental results demonstrate that our SLGAT achieves state-ofthe-art performance on several semi-supervised node classification benchmarks. One direction of the future work is to make SLGAT going deeper to capture the features of long-range neighbors. This perhaps helps to improve performance on the dataset with sparse labels. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples Spectral networks and locally connected networks on graphs Cluster kernels for semi-supervised learning Convolutional neural networks on graphs with fast localized spectral filtering Semi-supervised learning on graphs with generative adversarial nets Understanding the difficulty of training deep feed for leveraging graph wavelet transform to address the short-comings of previous spectral graphrd neural networks Semi-supervised learning by entropy minimization node2vec: scalable feature learning for networks Inductive representation learning on large graphs Deep convolutional networks on graph-structured data Adam: a method for stochastic optimization Semi-supervised classification with graph convolutional networks Link-based classification Efficient estimation of word representations in vector space Geometric deep learning on graphs and manifolds using mixture model CNNS Deepwalk: online learning of social representations Schnet: a continuous-filter convolutional neural network for modeling quantum interactions Pitfalls of graph neural network evaluation Deep collaborative filtering with multi-aspect information in heterogeneous networks Dropout: a simple way to prevent neural networks from overfitting Line: large-scale information network embedding Attention is all you need Graph attention networks Heterogeneous graph attention network Deep learning via semi-supervised embedding A comprehensive survey on graph neural networks Graph wavelet neural network. In: International Conference on Learning Representations (ICLR SPAGAN: shortest path graph attention network Revisiting semi-supervised learning with graph embeddings An end-to-end deep learning architecture for graph classification Learning with local and global consistency Semi-supervised learning using Gaussian fields and harmonic functions