key: cord-0043200-3ocr7481 authors: Park, Hogun; Neville, Jennifer title: Role Equivalence Attention for Label Propagation in Graph Neural Networks date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47436-2_42 sha: 53abf30c7822a14a271ae599b094d3e6c92ad4d3 doc_id: 43200 cord_uid: 3ocr7481 Semi-supervised relational learning methods aim to classify nodes in a partially-labeled graph. While popular, existing methods using Graph Neural Networks (GNN) for semi-supervised relational learning have mainly focused on learning node representations by aggregating nearby attributes, and it is still challenging to leverage inferences about unlabeled nodes with few attributes—particularly when trying to exploit higher-order relationships in the network efficiently. To address this, we propose a Graph Neural Network architecture that incorporates patterns among the available class labels and uses (1) a Role Equivalence attention mechanism and (2) a mini-batch importance sampling method to improve efficiency when learning over high-order paths. In particular, our Role Equivalence attention mechanism is able to use nodes’ roles to learn how to focus on relevant distant neighbors, in order to adaptively reduce the increased noise that occurs when higher-order structures are considered. In experiments on six different real-world datasets, we show that our model (REGNN) achieves significant performance gains compared to other recent state-of-the-art baselines, particularly when higher-order paths are considered in the models. Semi-supervised relational learning methods aim to classify unlabeled nodes in a partially-labeled graph by leveraging information about both the labeled and unlabeled nodes, and their connectivity. In particular, the methods exploit relational dependencies in the graph to jointly make predictions about unlabeled nodes. Prior work on semi-supervised learning in graphs has typically defined relational features via aggregation over the features of neighboring nodes, and then unknown class labels are inferred iteratively using approximate inference algorithms (e.g., Label Propagation (LP) [22] , Gibbs sampling [15] ). However, many previous methods are limited in their ability to leverage complex neighborhood patterns for learning and inference. For example, while LP works well in simple scenarios, it only exploits direct edges to make predictions on unlabeled examples. While information can propagate across the graph, messages are only passed among direct neighbors, which can be inadequate in complex, sparse graphs with few labels. Recently, Graph Convolution Networks (GCNs) [10] were proposed to exploit message passing functions to aggregate nearby neighbors and learn a latent representation of each node, which is then used for predicting node labels. However, GCN mainly aims to learn neighbors' attribute patterns-they are not typically used in partially-labeled graphs with few attributes. In this case, collective inference is needed during learning, so that patterns among neighbor class labels can also be used in the model. In this work, we propose REGNN, a graph neural network architecture for enhancing the accuracy of label propagation, which uses a role equivalence attention mechanism to facilitate reasoning with higher-order relationships among labeled nodes. To move beyond direct neighbors and exploit longer range information in sparse graphs, recent work has effectively incorporated higher-order relationships and paths into relational models (e.g., high-order GCNs [1] and GraRep [2] ). Our proposed approach REGNN, considers high-order (or k th order) proximity matrices and extends the existing high-order-based GCNs to leverage inferences about unlabeled nodes via neighborhood at various distances. Since reasoning with higher-order paths (i.e., large k) increases the computational complexity of learning 1 , we propose a more efficient mini-batch learning method. Incorporating higher-order paths can increase the relational signal by considering nearby but not directly linked nodes, however, it can also increase noise due to spurious connections as neighborhood size increases. To account for this and enable the model to learn which distant nodes are more relevant, we propose a novel attention mechanism based on role equivalence, a social network property that quantifies similarity among nodes based on their relational context. The attention mechanism is used to merge the multiple node representations learned from the set of high-order-based GCNs. Our experiments show that attention based on role equivalence works significantly better in the context of label propagation. Figure 1a -b show examples that high-order path and attention can help to capture roles in the label propagation scenarios. Each node and edge indicate a user and an interaction during a semester, respectively. Note that colors represent class labels, yellow for student, blue for faculty, and green for staff. In Figure 1a , we are trying to predict the label of a user D, who is a faculty. When we use just direct neighbors, it is not possible to predict the true label of node D by label propagation. However, as the high-order paths from 2 nd order neighbors are considered, the label of D could be successfully predicted. Like this example, high-order paths are potentially useful to learn the underlying hierarchical roles such as advisor-student in a citation network and admin-member relationships in a University group on Facebook. When we just stack a GCN layer multiple times, it is also difficult to learn this kind of information because their aggregation is always from direct neighbors. Meanwhile, if a user is surrounded by nodes with diverse class labels, the magnitude of nearby labels can often mislead prediction. In this case, if latent representations are known/estimated, the model can put more importance on neighbors with similar representations. In Fig. 1b , we are trying to predict the label of node F, who is also a faculty member. Although the node F has more students or staffs as neighbors, node G and H are likely to have similar representations to the representation of node F in the latent space, so that they can be weighted more heavily when aggregating. Our aim is to have REGNN exploit both these ideas, by combining high-order paths with a role similarity-based attention layer. Semi-Supervised Node Classification. Previous semi-supervised node classification algorithms learn a model to predict unknown class labels in a partially labeled graph. For example, LP [22] and ICA [15] estimate labels of unlabeled nodes using the local inference. Recently, graph embedding methods have been proposed to learn low-dimensional representations of nodes by leveraging many relational properties such as random walk (e.g., Node2Vec [8] ), high-order paths (e.g., GraRep [2] and NEU [19] ), structural similarities (e.g., Struc2Vec [13] ). Graph Neural Networks (GNN). In addition, graph neural networks architectures (e.g., GCN [10] ) also have attracted a lot of attention. Recently, high-order path information was also incorporated into GCNs. HA-GCN [23] and N-GCN [1] proposed joint graph neural network architectures that take attributed highorder proximity matrices. However, while high-order GCNs show more robust performance on node classification, computing the high-order paths from the proximity matrices can be quite inefficient. Attention-Based GNNs. Graphs are often complex and noisy, so many researchers have incorporated the concept of "attention" into semi-supervised classification. For example, GAT [18] proposed a self-attention-based graph neural network, which decides importance using an edge-wise weighted sum by leveraging rich attributes. However, this attention mechanism has been shown to not be very effective when the data contain attributes with low homophily or there are no attributes. Meanwhile, VAIN [9] proposed kernel-based attention mechanisms for multi-agent modeling. However, they exploit the similarity between nodes along with direct edges only in an attributed graph. In contrast, we design a novel similarity-based kernel based on the concept of role equivalence attention and extend it to incorporate neighbors at various distances (thus, high-order paths) in the setting of label propagation. First, we include the mathematical definitions of structural equivalence [11] and regular equivalence [5] below. Let N (u) refer to the neighbors of node u. Regular equivalence states that nodes play the same role if they have connections to nodes with similar roles [6] . There can be many valid ways of grouping nodes into equivalence role sets for a given graph, and regular equivalence is often defined recursively. Based on the above definitions, we can approximate the notion of regular equivalence based on positions in latent space. Using Definition 3, we propose a graph neural network architecture with the attention layer based on role equivalence among nodes in the following section. Note that the term, role (or structural)-equivalence, has been also used in many different ways (e.g., similarities in triangles, betweenness, k-paths, k-stars, kcliques, subgraph patterns/graphlets, and feature-based MF). In this paper, we use the term role-equivalence to refer to Definition 3. Problem Formulation and Notation. Given an undirected graph G = (V, E), where V is a set of vertices and E is a set of edges. A is an adjacency matrix of G. V is composed of V L (labeled vertices) and V U (unlabeled vertices). Y L is constructed as a |V | × C class label matrix. Again, for each labeled node i ∈ Y L with class label This Y L will be fed to our REGNN with the adjacency matrix A. Thus, the goal of REGNN is to estimate the class labels of V U from Y L and A, which is a transductive learning setting. To learn high-order path-based GCN, we initially construct K different GCN layers. For the layers, adjacency matrices, A, A 2 , ..., A K , which have different orders, are fed as input. A k is obtained by self-multiplying the adjacency matrix A k-times. Then the (i, j) entry of A k is the number of k-hop paths from i to j. With these high-order adjacency matrices, we can define K × M high-order convolution operators. The node representations in m th layer with an adjacency matrix A k are formulated as and Y L represents known class labels, which are fed to the first GCN layers. W m k is a trainable weight matrix for the m th layer in the k th order GCN, andD k is the degree matrix of A k + I. The symmetric normalizing trick in k takes the average of neighboring nodes' representation from each of high-order adjacency matrices. is the input representation size at m th layer and s (m+1) is the size of output representation for the next layer. Therefore, W 1 k could be defined from R C×s (2) , where C is the number of class labels. This indicates that propagated labels are transformed by the matrix multiplication. The representation of the last GCN layer, H , is additionally passed through another softmax function to normalize the latent representations. The last layers of REGNN play an important role in jointly learning multiple representations via different high-order paths. In this paper, our role equivalence attention uniquely merges their characteristics in the following ways: Concatenation Layer. Outputs from the previous high-order GCNs are concatenated before they are fed into the self-attention layer. There are K outputs, one from each of the high-order GCNs: The outputs are concatenated corresponding to the axis of the representation column. Attention Layer. The self-attention layer measures the degree of role equivalence (Definition 3) among nodes to place more importance on structurally similar neighbors. The intermediate representations of the last high-order GCN layers are used for defining the role, thus f (i) := q con i . In this layer, by considering role equivalence, we can incorporate structural information into node classification. To measure the degree of role equivalence, we additionally define a quantitative measure of role equivalence in latent space with RE f (i), f(j) , and use it in the attention layer below. Our self-attention layer takes inputs from the concatenation layer and produces a new vector q i ∈ R K·s (M +1) as: , cos refers to cosine similarity, Z = j∈N (i) e βcos f (i),f (j) , and β is a hyperparameter that moderates attention, which we estimate during learning. Note that we do not consider self-loops when computing similarity. RE f (i), f(j) measures how close nodes i and j are to being role equivalent (i.e., if the latent representations are unit vectors, then the two are equal when their cosine similarity is 1). To predict node class labels, a final softmax function is used. Here,ŷ i is the output of the softmax function, and each dimension of y represents the predicted probability of the corresponding labels for the class given inputs. Note that W final ∈ R (K·s (M +1) )×C . For learning, as in the original GCN, we use a categorical cross-entropy loss. In Eq. (3), C is the number of class labels. Since all activation functions are differentiable, learning is simple via back-propagation-all Ws (W m k and W final ), b final , and β are trained.ŷ i is used to predict class labels for unlabeled nodes V U . Calculating Eq. (3), requires the loss to be summed over all the nodes labeled together. A batch algorithm cannot handle large-scale datasets due to the difficulty of fitting the full graph in GPU memory and slow convergence. Even worse, the dense neighbors from high-order paths reduce the scalability of the model with respect to both time and space. To overcome the limitation, we propose an efficient sampling-based learning method. Consider the representation of a node u in the m th GCN layer with k th order paths from Eq. (1): We use the same importance distribution as in [3] to approximate Eq. (4) with |S| samples for node u as follows: with the importance distribution q(v) = ||Â[: Note the distribution q is only calculated once (i.e., before training) given the normalized aggregated graph, and the input label matrix, H (1) k , should be updated according to S via H ]. Our overall mini-batched training procedure is described in Algorithm 1. At every epoch, all nodes are randomly divided to create a mini-batch set B, which is composed of multiples of γ nodes. We set γ = 1024. B provides a candidate node set for sampling |S| later. When it comes to a new mini-batch,à k is induced from k according to S. Similarly, when REGNN gets neighboring nodes at the attention layer for N (i) in Eq. (2), A k=1 is used. As a result, the new loss function for the mini-based training will be L mini-batch (L, Y ) = − SL Assume that Y L ∈ R |V |×C is a label input matrix. For each labeled node i ∈ V L with class label y i = c, we set Y L [i, c] = 1 and Y L [i, .] = 0 otherwise. If node j is in V U , we set Y L [j, :] = 0. Let Y be a prediction matrix, and Y[i, :] for each node i ∈ V U will be used for actual prediction. The prediction is from arg max j Y[i, j]. According to [22] , the prediction will converge as α is a parameter in (0, 1) and specifies the relative amount of the information from its neighbors and the initial label information. Regarding an input graph, The normalized Laplacian matrix L of A is decomposed as L = ΦΛΦ −1 and could be modified using the frequency response [14] as L = Φp(Λ)Φ −1 where p(·) is called the frequency response function of the graph. p(Λ) can further be written as diag(p(λ 1 ), ..., p(λ n )). The graph L is linear shift-invariant, if and only if there exist a function p(·) : R → R. At last, the prediction of LP [22] , Y LP , can be reformulated from the perspective of eigen-decomposition and is shown as: In this Y LP , p LP (λ i ), the frequency response function of LP, is equal to Similarly, we can reformulate the GCN. DenoteD ii = j (A + I) ij . Then, A =D −1/2 (A + I)D −1/2 = I−D −1/2LD−1/2 = I−L s , whereL is the Laplacian matrix of A+I. We denoteL s = ΦΛΦ −1 . Using the above notation, a two-layered GCN can be characterized as follows, where W 0 and W 1 are trainable weight matrices in the first and second GCN layers, respectively: where When LP [22] uses the following frequency response function, p(λ i ) = (1 − λ i ) 2 , with two linear transformations, the new Y LP will be same as Y GCN . Thus, Y GCN ≈ Y LP (W 0 W 1 ). In other words, GCN can approximate LP, and as such we expect better accuracy in label propagation with the help of additional linear transformations and a non-linear function. When the k th order adjacency matrix A k is fed to the GCN, the response function becomes (1 − λ i ) 2k , which means that we can estimate labels from the different eigenvalue function by considering different paths. This analysis implies that high-order GCN layers in our REGNN can get label-wise representations of unknown nodes in latent space using different eigenvalue functions. Furthermore, they can learn a joint representation using our proposed role equivalence attention layer. When batch size is considered, the time complexity of learning REGNN (before importance sampling) is Data and Experimental Setup. We use six real-world network datasets for evaluation. Cora, Citeseer, PubMed, and NELL are publicly available citation network datasets [20] . Facebook (|V | = 4038, |E| = 65794, C = 2) is drawn from the Purdue University network [12] , and the data were randomly sampled to make its class labels' proportion to 50/50. For Friendster [16] (|V | = 43880, |E| = 145407, C = 4), only 30% of the training data is used to make the label ratio equal. Additionally, we generated a mirrored-Karate network to verify how roles are learned in our model. LR (Logistic Regression), LP [22] , GhostEdge [7] , graph neural networks (GCN, N-GCN [1] , and GAT [18] ), and graph embedding methods (Node2Vec [8] , GraRep [2] , NEU [19] , Struc2Vec [13] , and VERSE [17] ) are used for comparison. We train the models on training/validation sets and report results on the test set. Every reported result is the average of 10 trials using randomly shuffled node sets, and 10% of the nodes are used for testing and validation, respectively. The number of nodes for training is varied. For all neural network models, we set max epochs = 1000, dropout rate = 0. The Karate club network [21] is a graph that is composed of 34 members and 78 interactions among them. To interpret REGNN with respect to capturing roles, we construct a mirrored network, which is composed of two copies of the network connected between node 32 and 66, as in Fig. 2a . We can assume that each node has its own structural role, which connects between different communities. The colors in the graph are chosen according to community IDs after community detection [4] . Every node that has the same color (i.e., role equivalent) should have similar latent representation when their structural roles are properly captured. Figure 2b and Fig. 2c show the learned representations of nodes from GCN and REGNN, respectively. Similar to the GCN experiment with the Karate network [10] , a hidden layer of size 2 was inserted before the final softmax layer for visualization of the latent representations. They are visualized in Fig. 2b and Fig. 2c . Labels for training data were chosen from total 8 nodes (two nodes per a community). In the result, GCN fails to distinguish red and green nodes (i.e., the communities overlap), while REGNN separated the nodes from the two communities more effectively. This is evidence that that REGNN's attention layer successfully learned the structural roles by measuring role equivalence. Tables 1, 2, 3, 4 and 5 show REGNN node classification performance on the Citeseer, Cora, Facebook, PubMed, and Friendster data as proportion of labeled nodes is varied, compared to other baselines. For GCN, Node2Vec, GraRep, Struct2Vec, VERSE, and NEU, we directly obtain results from official implementations. Classification results of all methods are averaged at each proportion. Bold scores represent the corresponding model is significantly better than the others by paired t-tests (p-value<0.05). In all datasets, REGNN has consistently good results across all label proportions. On the other hand, N-GCN is similar to GCN and LP in Citeseer and Cora, in particular. This indicates that the highorder path information did not help to find better node representations, but the attention over high-order paths helped to increase performance when only known labels are given. Node2Vec and GhostEdge exhibit similar results in most of the datasets, and both achieve good performance at lower label proportions. However, their relative performance often decreases when more labels are available (e.g., in Cora). Struct2Vec and VERSE are not as good as Node2Vec. Since Struct2Vec considers structural similarity only, it does not perform well on most of the datasets. VERSE also learns similarities from Personalized PageRank, which is not helpful for our citation and social network datasets. For PubMed and Friendster, due to the heavy computation cost on the large edges, Struct2Vec, and GhostEdgeNL are not included. Table 6 shows classification performance on the NELL knowledge graph. The result is from the same train/test/validation sets as in [20] . REGNN shows the best performance but is almost on par with Node2Vec and VERSE. However, the execution time for training was much faster than Node2Vec and VERSE. In particular, Node2Vec and NEU incur a great deal of overhead to generate random walks (4,327.43 s), and their training time to learn embeddings after the generation was also slower than REGNN. We also tested N-GCN with our importance sampling but the accuracy was still lower than REGNN. REGNN uses role-based attention to leverage high-order paths. In this section, we report how high-order paths or role-based attention contributes to increasing REGNN's performance. Figure 3 shows comparisons from an ablation study. We compare REGNN (Order = 4), which is the best performing order chosen during parameter selection on the validation data, to REGNN (Order = 1), which denotes a simplified REGNN that still use the role-based attention but does not consider high-order paths. N-GCN and GAT (Order = 4) correspond to versions of our model where the role equivalence attention is replaced by the mixing layer used in [1] and the edge-wise attention of [18] , respectively. For the mixing layer, column-wise concatenation is used. We also compared with the softmax attention of [1] in our experimental setting, but the concatenation-based mixing layer was more accurate. In the ablation experiments, REGNN (Order = 4) again achieved the best results across all datasets. Specifically, it performed significantly better (assessed by paired t-tests) than REGNN (Order = 1), GAT (Order = 4), and GAT (Order = 1). In this ablation study, before computing the attentive weights, highorder GCNs are used in the same way for the GAT for fair comparison, but the result is still worse than REGNN (Order = 4). We tested different numbers of multi-headed attentions for GAT, but it did not help much. This means that our attention mechanism can identify more meaningful neighbors than the one used in GAT-at least in our application settings, which focus on label propagation in graphs with few attributes. In addition, when high-order GCNs are not used, REGNN (Order = 1) is worse than the simple GCN (Order = 1) in Citeseer and Cora. This indicates that it is more effective when REGNN combines its latent representations with high-order paths. In this paper, we propose REGNN, a Graph Neural Network architecture that uses a novel Role Equivalence attention mechanism with higher-order paths. REGNN is able to exploit nodes roles to learn how to focus on relevant neighbors from high-order paths, in order to adaptively reduce the increased noise that occurs when higher-order structures are considered. In our experimental results, REGNN showed significant performance gains compared to state-of-theart alternatives that use alternative attention mechanisms and/or higher-order paths. N-GCN: multi-scale graph convolution for semi-supervised node classification GraRep: learning graph representations with global structural information FastGCN: fast learning with graph convolutional networks via importance sampling Finding community structure in very large networks Regular equivalence: general theory Ego-centered and local roles: a graph theoretic approach Using ghost edges for classification in sparsely labeled networks node2vec: scalable feature learning for networks Vain: attentional multi-agent predictive modeling Semi-supervised classification with graph convolutional networks Structural equivalence of individuals in social networks Overcoming relational learning biases to accurately predict preferences in large scale networks struc2vec: learning node representations from structural identity Discrete signal processing on graphs Collective classification in network data Are graph neural networks miscalibrated? Verse: versatile graph embeddings from similarity measures Graph attention networks Fast network embedding enhancement via high order proximity approximation Revisiting semi-supervised learning with graph embeddings An information flow model for conflict and fission in small groups Learning with local and global consistency Graph convolution: a high-order and adaptive approach Acknowledgments. This research is supported by NSF and AFRL under contract numbers: CCF-1918483, IIS-1618690, and FA8650-18-2-7879.