key: cord-024499-14jlk5tv authors: Balalau, Oana; Goyal, Sagar title: SubRank: Subgraph Embeddings via a Subgraph Proximity Measure date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47426-3_38 sha: doc_id: 24499 cord_uid: 14jlk5tv Representation learning for graph data has gained a lot of attention in recent years. However, state-of-the-art research is focused mostly on node embeddings, with little effort dedicated to the closely related task of computing subgraph embeddings. Subgraph embeddings have many applications, such as community detection, cascade prediction, and question answering. In this work, we propose a subgraph to subgraph proximity measure as a building block for a subgraph embedding framework. Experiments on real-world datasets show that our approach, SubRank, outperforms state-of-the-art methods on several important data mining tasks. In recent years we have witnessed the success of graph representation learning in many tasks such as community detection [8, 19] , link prediction [10, 20] , graph classification [3] , and cascade growth prediction [13] . A large body of work has focused on node embeddings, techniques that represent nodes as dense vectors that preserve the properties of nodes in the original graph [5, 9] . Representation learning of larger structures has generally been associated with embedding collections of graphs [3] . Paths, subgraphs and communities embeddings have received far less attention despite their importance in graphs. In homogeneous graphs, subgraph embeddings have been used in community prediction [1, 8] , and cascade growth prediction [6, 13] . In heterogeneous graphs, subgraphs embedding have tackled tasks such as semantic user search [14] and question answering [4] . Nevertheless, the techniques proposed in the literature for computing subgraph embeddings have at least one of the following two drawbacks: i ) they are supervised techniques and such they are dependent on annotated data and do not generalize to other tasks; ii ) they can tackle only a specific type of subgraph. Approach. In this work, we tackle the problem of computing subgraph embeddings in an unsupervised setting, where embeddings are trained for one task and will be tested on different tasks. We propose a subgraph embedding method based on a novel subgraph proximity measure. Our measure is inspired by the random walk proximity measure Personalized PageRank [11] . We show that our subgraph embeddings are comprehensive and achieve competitive performance on three important data mining tasks: community detection, link prediction, and cascade growth prediction. Contributions. Our salient contributions in this work are: • We define a novel subgraph to subgraph proximity measure; • We introduce a framework that learns comprehensive subgraphs embeddings; • In a thorough experimental evaluation, we highlight the potential of our method on a variety of data mining tasks. Node Embeddings. Methods for computing node embeddings aim to represent nodes as low-dimensional vectors that summarize properties of nodes, such as their neighborhood. The numerous embedding techniques differ in the computational model and in what properties of nodes are conserved. For example, in matrix factorization approaches, the goal is to perform dimension reduction on a matrix that encodes the pairwise proximity of nodes, where proximity is defined as adjacency [2] , k-step transitions [7] , or Katz centrality [16] . Random walk approaches have been inspired by the important progress achieved in the NLP community in computing word embeddings [15] . These techniques optimize node embeddings such that nodes co-occurring in short random walks in the graph have similar embeddings [10, 18] . Another successful technique is to take as input a node and an embedding similarity distribution and minimizes the KL-divergence between the two distributions [19, 20] . Subgraph Embeddings. A natural follow-up question is how to compute embeddings for larger structures in the graph, such as paths, arbitrary subgraphs, motifs or communities. In [1] , the authors propose a method inspired by ParagraphVector [12] , where each subgraph is represented as a collection of random walks. Subgraph and node embeddings are learned such that given a subgraph and a random walk, we can predict the next node in the walk using the subgraph embedding and the node embeddings. The approach is tested on link prediction and on community detection, using ego-networks to represent nodes. In [13] , the authors present an end-to-end neural framework that given in input the cascade graph, predicts the future growth of the cascade for a given time period. A cascade graph is sampled for a set of random walks, which are given as input to a gated neural network to predict the future size of the cascade. [6] is similarly an end-to-end neural framework for cascade prediction, but based on the Hawkes process. The method transforms the cascade into diffusion paths, where each path describes the process of information propagation within the observation time-frame. Another very important type of subgraph is a community and in [8] community embeddings are represented as multivariate Gaussian distributions. Graph Embeddings. Given a collection of graphs, a graph embedding technique will learn representations for each graph. In [3] , the authors propose an inductive framework for computing graph embeddings, based on training an attention network to predict a graph proximity measure, such as graph edit distance. Graph embeddings are closely related to graph kernels, functions that measure the similarity between pairs of graphs [21] . Graph kernels are used together with kernel methods such as SVM to perform graph classification [22] . Preliminaries. PageRank [17] is the stationary distribution of a random walk in which, at a given step, with a probability α, a surfer teleports to a random node and with probability 1 − α, moves along a randomly chosen outgoing edge of the current node. In Personalized PageRank (PPR) [11] , instead of teleporting to a random node with probability α, the surfer teleports to a randomly chosen node from a set of predefined seed nodes. Let P r(u) be the PageRank of node u and P P R(u, v) be the PageRank score of node v personalized for seed node u. Problem Statement. Given a directed graph G = (V, E), a set of subgraphs S 1 , S 2 , · · · , S k of G and an integer d, compute the d-dimensional embeddings of the subgraphs. We define a subgraph proximity measure inspired by Personalized PageRank. Let S i and S j be two subgraphs in a directed graph G. Their proximity in the graph is: where P R Si (v i ) represents the PageRank of node v i in the subgraph S i , and P P R(v i , v j ) the PageRank of node v j personalized for node v i in the graph G. When considering how to define proximity between subgraphs, our intuition is as follows: important nodes in subgraph S i should be close to important nodes in subgraph S j . This condition is fulfilled as PageRank will give high scores to important nodes in the subgraphs and Personalized PageRank will give high scores to nodes that are "close" or "similar". We note that our measure is a similarity measure, hence subgraphs that are similar will receive a high proximity score. We choose the term proximity to emphasis that our measure relates to nearness in the graph, as it is computed using random walks. We can interpret Eq. 1 using random walks, as follows: Alice is a random surfer in the subgraph S i , Bob is a random surfer in the subgraph S j , and Carol is a random surfer in graph G. Alice decides to send a message to Bob via Carol. Carol starts from the current node Alice is visiting (P R Si (v i )) and she will reach a node v j ∈ S j with probability P P R(v i , v j ). Bob will be there to receive the message with probability P R Sj (v j ). Normalized Proximity. Given a collection of subgraphs S = {S 1 , S 2 , · · · S k }, we normalize the proximity px(S i , S j ), ∀j ∈ 1, k such that it can be interpreted as a probability distribution. The normalized proximity for a subgraph S i is: Rank of a Subgraph. Similarly to PageRank, our proximity can inform us of the importance of a subgraph. The normalized proximity given a collection of subgraphs S 1 , S 2 , · · · S k can be expressed as a stochastic matrix, where each row i encodes the normalized proximity given subgraph S i . The importance of subgraph S i can be computed by summing up the elements of column i. Sampling According to the Proximity Measure. Given a subgraph S i in input, we present a procedure for efficiently sampling px(S i , ·) introduced in Eq. 1. We suppose that all the Pagerank vectors of the subgraphs {S 1 , S 2 , · · · S k } have been precomputed. We first select a node n i in S i according to distribution P R Si . Secondly, we start a random walk from n i in the graph G and we select n j , the last node in the walk before the teleportation. Lastly, node n j may belong to several subgraphs S j 1 , S j 2 · · · . We return a subgraph S j according to the normalized distribution P R S j 1 (n j ), P R S j 2 (n j ), · · · . The procedure doesn't require computing the Personalized Pagerank vectors, which saves us O(n 2 ) space. We shall use this procedure for computing embeddings, thus avoiding computing and storing the full proximity measure px. Given a graph G = (V, E) and set of subgraphs of G, S = {S 1 , S 2 , · · · , S k }, we learn their representations as dense vectors, i.e. as embeddings. We extend the framework in [20] proposed for computing node embeddings to an approach for subgraph embeddings. In [20] , the authors propose to learn node embeddings such that the embeddings preserve an input similarity distribution between nodes. The similarities of a node v to any other node in the graph are represented by the similarity distribution sim G , where w∈V sim G (v, w) = 1. The corresponding embedding similarity distribution is sim E . The optimization function of the learning algorithm minimizes the Kullback-Leibler (KL) divergence between the two proximity distributions: The authors propose several options for instantiating sim G , such as Personalized PageRank and adjacency similarity. The similarity between embeddings, sim E , is the normalized dot product of the vectors. In order to adapt this approach to our case, we define the subgraph-tosubgraph proximity sim G to be the normalized proximity presented in Eq. 2. The embedding similarity sim E is computed in the same manner and the optimization function now minimizes the divergence between distributions defined on our input subgraphs, i.e. sim G , sim E : S × S → [0, 1]. In our experimental evaluation we use this method, which we refer to as SubRank. We note that sim G will not be fully computed, but approximated using the sampling procedure presented in Sect. 3.1. Proximity of Ego-Networks. Two very important tasks in graph mining are community detection and link prediction. Suppose Alice is a computer scientist and she joins Twitter. She starts following the updates of Andrew Ng, but also the updates of her friends, Diana and John. Bob is also a computer scientist on Twitter and he follows Andrew Ng, Jure Leskovec and his friend Julia. As shown in Fig. 1 , there is no path in the directed graph between Alice and Bob. A pathbased similarity measure between nodes Alice and Bob, such as Personalized PageRank, will return similarity 0, while it will return high values between Alice and Andrew Ng and between Bob and Andrew Ng. An optimization algorithm for computing node embeddings will have to address this trade-off, with a potential loss in the quality of the representations. Thus, we might miss that both Alice and Bob are computer scientists. To address this issue we capture the information stored in the neighbors of the nodes by considering ego-networks. Therefore in our work, we represent a node v as its ego network of size k (the nodes reachable from v in k steps). In Sect. 4, we perform quantitative analysis to validate our intuition. Proximity of Cascade Subgraphs. In a graph, an information cascade can be modeled as a directed tree, where the root represents the original content creator, and the remaining nodes represent the content reshares. When considering the task of predicting the future size of the cascade, the nodes already in the cascade are important, as it very likely their neighbors will be affected by the information propagation. However, nodes that have reshared more recently the information are more visible to their neighbors. When running PageRank on a directed tree, we observe that nodes on the same level have the same score, and the score of nodes increases as we increase the depth. Hence, two cascade trees will have a high proximity scorepx if nodes that have joined later the cascades (i.e. are on lower levels in the trees) are "close" or "similar" according to Personalized Pagerank. In Sect. 5, we perform quantitative analysis and we show that our approach gives better results than a method that gives equal importance to all nodes in the cascade. Datasets. We perform experiments on five real-world graphs, described below. We report their characteristics in Table 1 . • Citeseer 1 is a citation network created from the CiteSeer digital library. Nodes are publications and edges denote citations. The node labels represent fields in computer science. • Cora (see footnote 1) is also a citation network and the node labels represent subfields in machine learning. • Polblogs 2 is a directed network of hyperlinks between political blogs discussing US politics. The labels correspond to republican and democrat blogs. Competitors. We evaluate our method, SubRank, against several state-of-theart methods for node and subgraph embedding computation. For each method, we used the code provided by the authors. We compare with: • DeepWalk [18] learns node embeddings by sampling random walks, and then applying the SkipGram model. The parameters are set to the recommended values, i.e. walk length t = 80, γ = 80, and window size w = 10. • node2vec [10] is a hyperparameter-supervised approach that extends Deep-Walk. We fine-tuned the hyperparameters p and q on each dataset and task. In addition, r = 10, l = 80, k = 10, and the optimization is run for an epoch. • LINE [19] proposes two proximity measures for computing two d-dimensional vectors for each node. In our experiments, we use the second-order proximity, as it can be used for both directed and undirected graphs. We run experiments with T = 1000 samples and s = 5 negative samples, as described in the paper. • VERSE [20] learns node embeddings that preserve the proximity of nodes in the graph. We use Personalized PageRank as a proximity measure, the default option proposed in the paper. We run the learning algorithm for 10 5 iterations. • VerseAvg is a adaption of VERSE, in which the embedding of a node is the average of the VERSE embeddings of the nodes in its ego network. • sub2vec [1] computes subgraph embeddings and for the experimental evaluation, we compute the embeddings of the ego networks. Using the guidelines of the authors, for Cora, Citeseer and Polblogs we select ego networks of size 2 and for the denser networks Cithep and DBLP, ego networks of size 1. For the first four methods, node embeddings are used to represent nodes. For sub2vec, SubRank and VerseAvg, the ego network embedding is the node representation. The embeddings are used as node features for community detection and link prediction. We compute 128 dimensional embeddings. Parameter Setting for SubRank. We represent each node by its ego network of size 1. We run the learning algorithm for 10 5 iterations. Our code is public. We assess the quality of the embeddings in terms of their ability to capture communities in a graph. For this, we use the k-means algorithm to cluster the nodes embedded in the d-dimensional space. In Table 2 we report the Normalized Mutual Information (NMI) with respect to the original label distribution. On Polblogs, SubRank has a low NMI, while on Citeseer and Cora it outperforms the other methods. On DBLP it has a comparative performance with VERSE. Node classification is the task of predicting the correct node labels in a graph. For each dataset, we try several configurations by varying the percentage of nodes used in training. We evaluate the methods using the micro and macro F 1 score, and we report the micro F 1, as both measures present similar trends. The results are presented in Table 3 . On Citeseer and Cora SubRank significantly outperforms the other methods. On Polblogs, SubRank performs similarly to the other baselines, even though the embeddings achieved a low NMI score. On DBLP, SubRank is the second best method. To create training data for link prediction, we randomly remove 10% of edges, ensuring that each node retains at least one neighbor. This set represents the ground truth in the test set, while we take the remaining graph as the training set. In addition, we randomly sample an equal number of node pairs that have no edge connecting them as negative samples in our test set. We then learn embeddings on the graph without the 10% edges. Next, for each edge (u, v) in the training or the test set, we obtain the edge features by computing the Hadamard product of the embeddings for u and v. The Hadamard product has shown a better performance than other operators for this task [10, 20] . We report the accuracy of the link prediction task in Table 4 . Our method achieves the best performance on 4 out of 5 datasets. Given in input: i ) a social network G = (V, E), captured at a time t 0 , ii ) a set of information cascades C that appear in G after the timestamp t 0 , and that are captured after t 1 duration from their creation, iii ) a time window t 2 , our goal is to predict the growth of a cascade, i.e. the number of new nodes a cascade acquires, at t 1 + t 2 time from its creation. Note that given a cascade c = (V c , E c ) ∈ C, we know that the nodes V c are present in V , however c can contain new edges not present in E. Datasets. We select for evaluation two datasets from the literature: • AMiner [13] represents cascades of scientific citations. We use the simplified version made available by the authors 5 . The dataset contains a global citation graph and the cascades graphs. A node in a graph represents an author and an edge from a 1 to a 2 represents the citation of a 2 in an article of a 1 . A cascade shows all the citations of a given paper. Competitors. We compare SubRank with the following state-of-the-art methods for the task of predicting the future size of cascades: • DeepCas [13] is an end-to-end neural network framework that given in input the cascade graph, predicts the future growth of the cascade for a given period. The parameters are set to the values specified in the paper: k = 200, T = 10, mini-batch size is 5 and α = 0.01. • DeepHawkes [6] is similarly an end-to-end deep learning framework for cascade prediction based on the Hawkes process. We set the parameters to the default given by the authors: the learning rate for user embeddings is 5×10 −4 and the learning rate for other variables is 5 × 10 −3 . • In addition, we consider the node embedding method VERSE [20] , as one of the top-performing baseline in the previous section. The node embeddings are learned on the original graph and a cascade is represented as the average of the embeddings of the nodes it contains. We then train a multi-layer perceptron (MLP) regressor to predict the growth of the cascade. Parameter Setting for SubRank. We recall that our subgraph proximity measure requires the computation of PPR of nodes in the graph and the PR of nodes in the subgraphs. For this task, we consider the PPR of nodes in the global graph and the PR of nodes in the cascades. We obtain the cascade embeddings which are then used to train an MLP regressor. For both VERSE and SubRank we perform a grid search for the optimal parameters of the regressor. We report the mean squared error (MSE) on the logarithm of the cascade growth value, as done in previous work on cascade prediction [6, 13] in Table 5 . We observe that SubRank out-performs VERSE thus corroborating our intuition that nodes appearing later in a cascade should be given more importance. The best MSE overall is obtained by the end-to-end framework DeepHawkes which is expected as the method is tailored for the task. We note, however, that SubRank achieves the best results on AMiner. In this work, we introduce a new measure of proximity for subgraphs and a framework for computing subgraph embeddings. In a departure from previous work, we focus on general-purpose embeddings, and we shed light on why our method is suited for several data mining tasks. Our experimental evaluation shows that the subgraph embeddings achieve competitive performance on three downstream applications: community detection, link prediction, and cascade prediction. Sub2Vec: feature learning for subgraphs Distributed large-scale natural graph factorization Unsupervised inductive graph-level representation learning via graph-graph proximity Question answering with subgraph embeddings A comprehensive survey of graph embedding: problems, techniques, and applications DeepHawkes: bridging the gap between prediction and understanding of information cascades GraRep: learning graph representations with global structural information Learning community embedding with community detection and node embedding on graphs Graph embedding techniques, applications, and performance: a survey node2vec: scalable feature learning for networks Topic-sensitive PageRank: a context-sensitive ranking algorithm for Web search Distributed representations of sentences and documents DeepCas: an end-to-end predictor of information cascades Subgraph-augmented path embedding for semantic user search on heterogeneous social network Distributed representations of words and phrases and their compositionality Asymmetric transitivity preserving graph embedding The PageRank citation ranking: bringing order to the Web DeepWalk: online learning of social representations Line: large-scale information network embedding Verse: versatile graph embeddings from similarity measures Graph kernels RetGK: graph kernels based on return probabilities of random walks