Beyond Node Embedding: A Direct Unsupervised Edge Representation Framework for Homogeneous Networks Sambaran Bandyopadhyay1 and Anirban Biswas2 and Narasimha Murty3 and Ramasuri Narayanam4 Abstract. Network representation learning has traditionally been used to find lower dimensional vector representations of the nodes in a network. However, there are very important edge driven mining tasks of interest to the classical network analysis community, which have mostly been unexplored in the network embedding space. For applications such as link prediction in homogeneous networks, vec- tor representation (i.e., embedding) of an edge is derived heuristically just by using simple aggregations of the embeddings of the end ver- tices of the edge. Clearly, this method of deriving edge embedding is suboptimal and there is a need for a dedicated unsupervised approach for embedding edges by leveraging edge properties of the network. Towards this end, we propose a novel concept of converting a net- work to its weighted line graph which is ideally suited to find the em- bedding of edges of the original network. We further derive a novel algorithm to embed the line graph, by introducing the concept of col- lective homophily. To the best of our knowledge, this is the first direct unsupervised approach for edge embedding in homogeneous infor- mation networks, without relying on the node embeddings. We val- idate the edge embeddings on three downstream edge mining tasks. Our proposed optimization framework for edge embedding also gen- erates a set of node embeddings, which are not just the aggregation of edges. Further experimental analysis shows the connection of our framework to the concept of node centrality. 1 Introduction Network representation learning (also known as network embedding) has gained significant interest over the last few years. Traditionally, network embedding [22, 12, 28] maps the nodes of a homogeneous network (where nodes denote entities of similar type) to lower di- mensional vectors, which can be used to represent the nodes. It has been shown that such continuous node representations outperform conventional graph algorithms [2] on several node based downstream mining tasks like node classification, community detection, etc. Edges are also important components of a network. From the point of downstream network mining analytics, there are plenty of network applications - such as computing edge betweenness centrality [20] and information diffusion [24] - which heavily depend on the infor- mation flow in the network. Compared to the conventional down- stream node embedding tasks (such as node classification), these tasks are more complex in nature. But similar to node based ana- lytics, there is a high chance to improve the performance of these tasks in a continuous lower dimensional vector space. Thus, it makes 1 IBM Research & IISc, Bangalore, email: sambband@in.ibm.com 2 Indian Institute of Science, Bangalore, email: anirbanb@iisc.ac.in 3 Indian Institute of Science, Bangalore, email: mnm@iisc.ac.in 4 IBM Research, Bangalore, email: ramasurn@in.ibm.com sense to address these problems in the context of network embed- ding via direct representation of the edges of a network. As a first step towards this direction, it is important to design dedicated edge embedding schemes and validate the quality of those embeddings on some basic edge-centric downstream tasks. (a) Synthetic Graph (b) node2vec (c) line2vec Figure 1: Edge Visualization: (a) We created a small synthetic net- work with two communities. So, there are three types of edges: Green (or red) edges with both the end points belonging to the green (or red respectively) community; Blue edges with end points belonging to two different communities. (b) node2vec embedding (8 dimensional) of the edges obtained by taking average of the embeddings of the end vertices and then used t-SNE for visualization. (c) Direct edge em- beddings (8 dimensional) obtained by line2vec and then used t-SNE for visualization. Clearly, line2vec is superior which visually sepa- rates the edge communities, compared to that with the conventional way of aggregating node embeddings to obtain edge representation. In the literature, there are indirect ways to compute embedding of an edge in an information network. For tasks like link predic- tion, where a classifier needs to be trained on both positive (existing) and negative (not existing) edge representations, a simple aggrega- tion function [12] such as vector average or Hadamard product has been used on the representations of the two end vertices to derive the vector representation of the corresponding edge. Typically node embedding algorithms use the homophily property [18] by respecting different orders of node proximities in a network. As the inherent ob- jective functions of these algorithms are focused on the nodes of the network, using an aggregation function on these node embeddings to get the edge embedding could be suboptimal. We demonstrate the shortcoming of this approach in Figure 1, where the visualization of the edge embeddings derived by aggregating node embeddings (tak- ing average of the two end nodes) from node2vec [12] on a small synthetic graph do not maintain the edge community structure of the network. Whereas, a direct edge embedding approach line2vec, to be proposed in this paper, completely adheres to the community struc- ture, as edges of different types are visually segregated in the t-SNE plot of the same shown in 1(c). So there is a need to develop algo- rithms for directly embedding edges (i.e., not via aggregating node embeddings) in information networks. We address this research gap ar X iv :1 91 2. 05 14 0v 1 [ cs .S I] 1 1 D ec 2 01 9 in this paper in a natural way. Following are the contributions: • We propose a novel edge embedding framework line2vec, for ho- mogeneous social and information networks. To the best of our knowledge, this is the first work to propose a dedicated unsuper- vised edge embedding scheme which avoids aggregation of the end node embeddings. • We exploit the concept of line graph for edge representation by converting the given network to a weighted line graph. We fur- ther introduce the concept of collective homophily to embed the line graph and produce the embedding of the edges of the given network. • We conduct experiments on three edge-centric downstream tasks. Though our approach is proposed for embedding edges, we further analyze to show that, a set of robust node embeddings, which are not just the aggregation of edges, are also generated in the process. • We experimentally discover the non-trivial connection of the clas- sical concept of node centrality with the optimization framework of line2vec. The source code of line2vec is available at https: //bit.ly/2kfiS2l to ease the reproducibility of the results. Though edge centric network mining tasks such as edge central- ity, network diffusion and link prediction can be benefited from edge embeddings, applications of edge embeddings to tackle them is non- trivial and needs a separate body of work. For example, finding cen- tral edges in the network amounts to detecting a subset of points in the embedding space which are diverse between each other and rep- resent a majority of the other points. We leave them to be addressed in some future work. 2 Related Work and Research Gaps Node embedding in information network has received great interest from the research community. We refer the readers to the survey arti- cles [33] for a comprehensive survey on network embedding and cite only some of the more prominent works in this paragraph. DeepWalk [22] and node2vec [12] are two node embedding approaches which employ different types of random walks to capture the local neigh- borhood of a node and maximize the likelihood of the node context. Struc2vec [23] is another random walk based strategy which finds similar embeddings for nodes which are structurally similar. A deep autoencoder based node embedding technique (SDNE) that preserves structural proximity is proposed in [31]. Different types of node em- bedding approaches for attributed networks are also present in the literature [35, 3, 9]. A semi-supervised graph convolution network based node embedding approach is proposed in [14] and further ex- tended in GraphSAGE [13] which learns the node embeddings with different types of neighborhood aggregation methods on attributes. Recently, node embedding based on semi-supervised attention net- works [28], maximizing mutual information [29], and in the presence of outliers [4] are proposed. Compared to the above, representing edges in information net- works is significantly less matured. Some preliminary works ex- ist which use random walk on edges for community detection in networks [15] or to classify large-scale documents into large-scale hierarchically-structured categories [11]. [1] focuses on the asym- metric behavior of the edges in a directed graph for deriving node embeddings, but it represents a potential edge just by a scalar which determines its chance of existence. [25, 30] derive embeddings for different types of edges in a heterogeneous network, but their pro- posed method essentially uses an aggregation function inside the op- timization framework to generate edge embeddings from the node embeddings. For knowledge bases, embedding entities and relation types in a low dimensional continuous vector space [5, 7, 10] have been shown to be useful. But, several fundamental concepts of graph embedding, such as homophily, are not directly applicable to them. [19] proposes a dual-primal GCN based semi-supervised node em- bedding approach which first aggregates edge features by convolu- tion, and then learns the node embeddings by employing a graph attention on the incident edge features of a node. To the best of our knowledge, [36] is the only work which proposes a supervised approach based on adversarial training and an auto-encoder, purely for edge representation learning in homogeneous networks. But their framework needs a large amount of labelled edges to train the GAN, which makes it restrictive for real world applications. Hence in this paper, we propose a task-independent unsupervised dedicated edge embedding framework for homogeneous information networks to ad- dress the research gaps. 3 Problem Description An information network is typically represented by a graph G = (V,E,W), where V = {v1,v2, · · · ,vn} is the set of nodes (a.k.a. vertices), each representing a data object. E ⊆{(vi,vj )|vi,vj ∈ V} is the set of edges. We assume, |E| = m. Each edge e ∈ E is associated with a weight wvi,vj > 0 (1 if G is unweighted), which indicates the strength of the relation. Degree of a node v is denoted as dv, which is the sum of weights of the incident edges. N(v) is the set of neighbors of the node v ∈ V . For the given network G, the edge representation learning is to learn a function f : e 7→ x ∈ RK , i.e., it maps every edge e ∈ E to a K dimensional vector called edge embedding, where K < m. These edge embeddings should preserve the underlying edge semantics of the network, as described below. Edge Importance: Not all the edges in a network are equally im- portant. For example, in a social network, millions of fans can be connected to a movie star. But any two fans of a movie star may not be similar to each other. So this type of connections are weaker com- pared to an edge which connects two friends who have much lesser number of connections individually [16]. Edge Proximity: The edges which are close to each other in terms of their topography or semantics should have similar embeddings. Similar to the concepts of node proximities [31], it is easy to define first and higher order edge proximities via incidence matrix. 4 Solution Approach: line2vec We propose an elegant solution (referred as line2vec) to embed each edge of the given network. First we map the network to a weighted line graph, where each edge of the original network is transformed into a node.Then we propose a novel approach for embedding the nodes of the line graph, which essentially provides the edge embed- dings of the original network. For simplicity of presentation, we as- sume that the given network is undirected. Nevertheless, it can triv- ially be generalized for directed graphs. 4.1 Line Graph Transformation Given an undirected graph G = (V,E), the line graph L(G) is the graph such that each node of L(G) is an edge in G and two nodes of L(G) are neighbors if and only if their correspond- ing edges in G share a common endpoint vertex [32]. Formally L(G) = (VL,EL) where VL = {(vi,vj ) : (vi,vj ) ∈ E} and EL = { ( (vi,vj ), (vj,vk) ) : (vi,vj ) ∈ E , (vj,vk) ∈ E}. Figure 2 https://bit.ly/2kfiS2l https://bit.ly/2kfiS2l shows how to convert a graph into the line graph [8]. Hence the line graph transformation induces a bijection from the set of edges of the given graph to the set of nodes of the line graph as l : e 7→ v where ∀e ∈ E, ∃ v ∈ VL and if two edges ei,ej ∈ E are adjacent there will be an corresponding edge e ∈ EL in the line graph. Figure 2: Transformation process of a graph into its line graph. (a) Represents an information network G. (b) Each edge in the original graph has a corresponding node in the line graph. Here the green edges represent the nodes in line graph. (c) For each adjacent pair of edges in G there exists an edge in L(G). The dotted lines here are the edges in the line graph. (d) The line graph L(G) of the graph G 4.2 Weighted Line Graph Formation We propose to construct a weighted line graph for our problem even if the original graph is unweighted. These weights would help the random walk in the later stage of line2vec to focus more on the rel- evant nodes in the line graph. It is evident from Section 4.1 that a node of degree k in the original graph G produces k(k−1)/2 edges in the line graph L(G). Therefore high degree nodes in the origi- nal graph may get over-represented in the line graph. Often many of these incident edges are not that important to the concerned node in the given network, but they can potentially change the movement frequency of a random walk in the line graph. We follow a simple strategy to overcome this problem. The goal is to ensure that the line graph not only reflects the topology of the original graph G (which is guaranteed by Whitney graph isomorphism theorem [32] in almost all cases) but also the dynamics of the graph is not affected by the transformation process. The edge weights are defined to facilitate a random walk on L(G), as described in Section 4.3.1. Intuitively if we start a random walk from a node vij ≡ (vi,vj ) ∈ L(G) and want to traverse to vjk ≡ (vj,vk) ∈ L(G), then it is equivalent to selecting the node vj ∈ G from (vi,vj ) and move to vk ∈ G. If G is undirected, we define the probability of choosing vj to be propor- tional to dvi dvi +dvj . Here, dvi and dvj are the degrees of the end point nodes of the edge (vi,vj ) and an edge in general is more important to the endpoint node having lower degree than the other endpoint with a higher degree [16]. Then selecting vk is proportional to edge weight of ejk ≡ (vj,vk) ∈ E. Hence, for any two adjacent edges eij ≡ (vi,vj ) and ejk ≡ (vj,vk), we define the edge weight for the edge (eij,ejk) of the line graph L(G) as follows: w(eij,ejk) = di di + dj × wjk∑ r∈N(vj ) wjr −wij (1) This completes the formation of the weighted line graph from any given network. 4.3 Embedding the Line Graph Here we propose a novel approach to embed the nodes of the line graph. Line graph is a special type of graph which comes with some nice properties. Below is one important observation that we exploit in embedding the line graph. Lemma 1 Each (non-isolated) node in the graph G induces a clique in the corresponding line graph L(G). Proof 1 Let’s assume that a (non-isolated) node v in the graph G has nv edges connected to it. So these nv edges are neighbors of each other. Hence in the corresponding line graph L(G), each of these edges would be mapped to a node and each of these nodes is connected to all the other nv −1 nodes. Thus there is a clique of size nv induced in the line graph by node v. This can be visualized in Fig. 2, where the node 1 in (a) with de- gree 3 induces a clique of size 3, including the nodes (1,2), (1,3) and (1,4) into the corresponding line graph in (d). Lemma 1 is interesting because it tells that the nodes of the line graph exhibit some col- lective property, rather than just pairwise property. To clarify, in the given network, two nodes are pairwise connected by an edge, but in the line graph, a group of nodes form a clique. Pairwise homophily [18], which has been the backbone to many standard embedding al- gorithms [31], is not sufficient for embedding the line graph. Hence we propose a new concept ‘collective homophily’ applicable to the line graph. We explain it below. Figure 3: Collective Homophily ensures the embeddings of the edges which are connected via a common node in the network, stay within a sphere of small radius. 4.3.1 Collective Homophily and Cost Function Formulation We emphasize that all the nodes, which are part of a clique in a line graph, should be close to each other in the embedding space. One way to enforce collective homophily is to introduce a sphere (of small radius R ∈ R) in the embedding space and ensure that embedding of the nodes (in the line graph) which are part of a clique, remain within the sphere. Hence any two embeddings within a sphere are at a maxi- mum of 2R distance apart from each other. The concept is explained in Fig. 3. Smaller the radius R, embeddings of the neighbor edges would be closer to each other and hence the better the enforcement of collective homophily. Note that a sum of pairwise homophily loss in the embedding space may lead to some pairs being very close to each other and others may still be quite far. So, we formulate the objective function to embed the (weighted) line graph as follows. Let us introduce some notation. Bold face letters like u (or v) denote a node in the line graph L(G), which can also be denoted by uuv when the correspondence with the edge (u,v) ∈ E in the original graph G is required. Normal face letters like u,v denote nodes in the given graph. xv ∈ RK (equivalently xuv) denotes the embedding of the node vuv in line graph (or the edge (u,v) ∈ E). To map the nodes of the line graph to vectors, first we want to pre- serve different orders of node proximities in the line graph. For this, a truncated random walk based sampling strategy S is used to provide a set of nodes NS(v) as context to a node v in the network. Here we employ the random walk proposed by [12], which balances between the BFS and DFS search strategy in the graph. As the generated line graph is a weighted one, we consider the weights of the edges while computing the node transition probabilities. Let X denote the matrix with each row as the embedding xv of a node v of the line graph. As- suming conditional independence of the nodes, we seek to maximize (w.r.t. X) the log likelihood of the context of a node as:∑ v∈VL log P(NS(v)|xv) = ∑ v∈VL ∑ v′∈NS (v) log P(v ′|xv) Each of the above probabilities can be represented using standard softmax function parameterized by the dot product of xv′ and xv. As usual, we also approximate the computationally expensive denomi- nator of the softmax function using some negative sampling strategy N̄(v) for any node v. The above equation, after simple algebraic manipulations, leads to maximizing the following:∑ v∈VL ∑ v′∈NS (v) xv′ ·xv −|NS(v)| log ( ∑ v̄∈N̄(v) exp(xv̄ ·xv) ) (2) Next, we implement the concept of collective homophily as pro- posed above. Each node u ∈ V (in the original network) induces a clique in the line graph (Lemma 1). An edge (u,v) ∈ E corresponds to the node vuv ∈ VL in the line graph. So we want all the nodes of the form vuv ∈ VL belong to a sphere centered at cu ∈ RK and of radius Ru, where v ∈ N(u) (neighbors of u). As collective ho- mophily suggests that embeddings of these nodes must be close to each other, we minimize the sum of all such radii. This with Eq. 2 gives the final cost function of line2vec as follows. min X,R,C ∑ v∈VL [ |NS(v)| log ( ∑ v̄∈N̄(v) exp(xv̄ ·xv) ) − ∑ v′∈NS (v) xv′ ·xv ] + α ∑ u∈V R 2 u such that, ||xuv −cu||22 ≤ R 2 u, ∀v ∈N(u), ∀u ∈ V Ru ≥ 0, ∀u ∈ V (3) Here, α is a positive weight factor. The constraint ||xuv −cu||22 ≤ R2u ensures that nodes of the form xuv belong to the sphere of radius Ru and centered at cu. We use R and C to denote set of all such radii and centers respectively. 4.3.2 Solving the Optimization Equation 3 is a non-convex constrained optimization problem. We use penalty functions [6] technique to convert this to an uncon- strained optimization problem as follows: min X,R,C ∑ v∈VL [ |NS(v)| log ( ∑ v̄∈N̄(v) exp(xv̄ ·xv) ) − ∑ v′∈NS (v) xv′ ·xv ] + α ∑ u∈V R 2 u + λ ∑ u∈V ∑ v∈N(u) g(||xuv −cu||22 −R 2 u) + ∑ u∈V γug(−Ru) (4) Here the function g : R → R is defined as g(t) = max(t, 0). So it imposes a penalty to the cost function in Eq. 4 when the argument inside g is positive, i.e., when there is a violation of the constraints in Eq. 3. We use a linear penalty g(t) as the gradient does not van- ish even when t → 0+. To solve the unconstrained optimization in Eq. 4, we use stochastic gradient descent, computing gradients w.r.t. each of X, R and C. We take subgradient when t = 0 for g(t). All the penalty parameters λ and γu’s corresponding to penalty func- tions are positive. When there is any violation of a constraint (or sum of constraints), the corresponding penalty parameter is increased to impose more penalty. We give more importance to the type of con- straints Ru ≥ 0, as violation of them may change the intuition of the solution. So we use different penalty parameters for each of them, so imposing a different penalty to each of such constraints is possible. One can show that under appropriate assumptions, any convergent subsequence of solutions to the unconstrained penalized problems must converge to a solution of the original constrained problem [6]. Very small values of the penalty parameters might lead to the vio- lation of constraints, and very large values would make the gradient descent algorithm oscillate. So, we start with smaller values of λ and γu’s and keep increasing them until all the constraints are satisfied or the gradients become too large making abrupt function changes. Note that, theoretically some of the constraints in Eq. 3 may still be violated, but experimentally we found them satisfied up to a large extent (Section 5). In the final solution, xv gives the vector represen- tation of node v of the line graph, which is essentially the embedding of the corresponding edge in the original network. 4.4 Key Observations and Analysis Both the edge embedding properties mentioned in Section 3 are pre- served in the construction and embedding of the weighted line graph. Particularly, if two edges have a common incident node in the orig- inal network, the corresponding two nodes in the transformed line graph would be neighbors. Also two edges having similar neighbor- hood in the original network lead to two nodes having similar neigh- borhood in the transformed line graph. The random walk and collec- tive homophily preserve both pairwise and collective node proxim- ity of the line graph in the embedding space. Thus different orders of edge proximities of the original network is captured well in the edge embeddings. Also the construction of edge weights in line graph (Sec. 4.2) ensures that underlying importance of edges of the original network is preserved in the transformed line graph, and hence in the embeddings through truncated random walk. Time Complexity: Edge embedding is computationally difficult than node embedding, as the number of edges in a real life net- work is more than the number of nodes. From Lemma 1, each node u in the original network induces a clique of size du (degree of u in G). Hence total number of edges in the line graph is: mL =∑ u∈V ( du 2 ) = ∑ u∈V du(du−1) 2 ≤ |V |d2, where d is the maximum de- gree of a node in the given network. So, the construction of line graph would take O(|V |d2) time. Next, we use alias table for fast computation of the corpus of node sequences in O(mL log(mL)) = O(|V |d2 log(|V |d)) by the random walks, assuming the number of random walks on the line graph, maximum length of a random walk, context window size and the number of negative samples for each node to be constant, as they are the hyper parameters of skip-gram model. Then, the first term (under the sum over the nodes in VL) of Eq. 4 can be computed in O(|VL|) = O(|E|) time. Next, the term weighted by α can be computed in O(|V |) time. Then, for the term weighted by λ, we need to visit each node in V and for each such node, its neighbors in the original graph, which can be computed in a total of O(|E|) time. The last term of Eq. 4 can be computed in O(|V |) time. As we use penalty methods to solve it, the runtime of solving Eq. 4 is O(|E| + |V |). Hence the total runtime complexity of line2vec is O(|V |d2 log(|V |d)). So in the worst case, (for e.g., a nearly complete graph), run time complexity is O(|V |3log|V |). But for most of the real life social networks, the maximum degree can be considered as a constant (i.e., does not grow with the number of nodes). Hence for them, the run time complexity is O(|V |log|V |). 5 Experimental Evaluation We conduct detailed experiments on three downstream edge centric network mining tasks and thoroughly analyze the proposed optimiza- tion framework of line2vec. 5.1 Design of Baseline Algorithms Unsupervised direct edge embedding for information network itself is a novel problem. Existing approaches only aggregate the embed- dings of the two end nodes to find the embedding of an edge. So as baselines, we only consider the publicly available implementation of a set of popular unsupervised node embedding algorithms which can work only using the link structure of the graph: DeepWalk, node2vec, SDNE, struc2vec and GraphSage (official unsupervised implemen- tation for the un-attributed networks). We have considered differ- ent types of node aggregation methods such as taking the average, Hadamard product, vector norms of two end node embeddings [12] to generate the edge embeddings for the baseline algorithms. It turns out that average aggregation method performs the best among them. So we report the performance of the baseline methods with average node aggregation, where embedding of an edge (u,v) is computed by taking the average of the node embeddings of u and v. 5.2 Datasets Used and Setting Hyper-parameters We used five real world publicly available datasets for the ex- periments. A summary of the datasets is given in Table 1. For Zachary’s karate club and Dolphin social network (http:// www-personal.umich.edu/˜mejn/netdata/), there are no ground truth community labels given for the nodes. So we use the modularity based community detection algorithm, and label the nodes based on the communities they belong to. For Cora, Pubmed (https://linqs.soe.ucsc.edu/data) and MSA [26], the ground truth node communities are available. The ground truth edge labels are derived as follows. If an edge connects two nodes of the same community (intra community edge), the label of that edge is the common community label. If an edge connects nodes of different communities (inter community edge), then that edge is not consid- ered for calculating the accuracy of downstream tasks. Note that, all the edges (both intra and inter community) are considered for learn- ing the edge embeddings. We also provide the size of the generated weighted line graphs in Table 1. Note that, line graphs are still ex- tremely sparse in nature, which enables the application of efficient data structures and computation on sparse graphs here. We set the parameter α in Eq. 3 to be 0.1 in the experiments. At that value, the two components in the cost function in Eq. 3 con- tribute roughly the same to the total cost in the first iteration of line2vec. The dimension (K) of the embedding space is set as 8 for Karate club and Dolphin social network as they are small in size, Table 1: Summary of the datasets used. Dataset #Nodes #Edges #Edge-Labels #Nodes in L(G) #Edges in L(G) Zachary’s Karate club 34 78 3 78 528 Dolphin social network 62 159 4 159 923 Cora 2708 5278 7 5278 52301 Pubmed 19717 44327 3 44327 699385 MSA 30101 204926 3 204926 6149555 and it is set as 128 for the other three larger datasets (for all the algo- rithms). For the faster convergence of SGD, we set the initial learning rate higher and decrease it over the iterations. We vary the penalty pa- rameters in Eq. 4 over the iterations as discussed in Section 4.3.2 to ensure that the constraints are satisfied at large. 5.3 Penalty Errors of line2vec Optimization We have shown the values of two different penalty errors (or con- straint violation error of the penalty method based optimization) over the iterations of line2vec in Figure 5. For all the datasets, total spher- ical error ∑ u∈V ∑ v∈N(u) g(||xuv − cu||22 − R2u) converges to a small value very close to zero and negative error ∑ u∈V g(−Ru) remains to be zero. This means, almost all the constraints of line2vec formula- tion are satisfied in the final solution. 5.4 Downstream Edge Mining Tasks Edge visualization: It is important to understand if the edge embed- dings are able to separate the communities visually. We use the em- bedding of the edges as input in RK , and use t-SNE [17] to plot the edge embedding in a 2 dimensional space. Fig. 4 shows the edge vi- sualizations by line2vec, along with the baselines algorithms on Cora datasets. Note that, line2vec is able to visually separate the commu- nities well compared to all the other baselines. The same trend was observed even in Fig. 1 for the small synthetic network. Line2vec, be- ing a direct approach for edge embedding via collective homophily, outperforms all the baselines which aggregate node embeddings to generate the embeddings for the edges. Edge Clustering: Like node clustering, edge clustering is also im- portant to understand the flow of information within and between the communities. For clustering the embeddings of the edges, we apply KMeans++ algorithm. To evaluate the quality of clustering, we use unsupervised clustering accuracy [34] which uses different permuta- tions of the labels and chooses the assignment which gives best possi- ble accuracy. Figure 6a shows that line2vec outperforms all the base- lines for edge clustering on all the datasets. DeepWalk and node2vec also perform well among the baselines. Multi-class Edge Classification: We use only 10% edges with ground truth label (as generated in Section 5.2) as the training set, because getting labels is expensive in networks. A logistic regression classifier is trained on the edge embeddings generated by different al- gorithms. The performance on the test set is reported using Micro F1 score. Figure 6b shows that line2vec is better or highly competitive with the state-of-the-art embedding algorithms. node2vec and Deep- Walk follows line2vec closely. On the Dolphin dataset, node2vec outperforms line2vec marginally. Performance of line2vec for edge classification again shows the superiority of a direct edge embedding scheme over the node aggregation approaches. 5.5 Ablation Study of line2vec The idea of line2vec is to embed the line graph for generating the edge embeddings of a given network. There are two main novel com- http://www-personal.umich.edu/~mejn/netdata/ http://www-personal.umich.edu/~mejn/netdata/ https://linqs.soe.ucsc.edu/data (a) DeepWalk (b) node2vec (c) SDNE (d) struc2vec (e) GraphSAGE (f) line2vec Figure 4: Edge visualization on Cora dataset. Different colors represent different edge communities. (a) Spherical Error (b) Non-negative Error Figure 5: Both spherical error ∑ u∈V ∑ v∈N(u) g(||xuv − cu||22 − R2u) and non-negative error ∑ u∈V g(−Ru) in the penalty function based optimization of line2vec converge to zero very fast on all the datasets. ponents in line2vec: first, the construction of weighted line graph; and second, more importantly, proposing the concept of collective homophily on the weighted line graph. In this subsection, we show the incremental benefit of each component through a small experi- ment of edge visualization on the Dolphin dataset, as shown in Fig. 7. We use node2vec (N2V) as the starting point because the skip- gram objective component of line2vec (L2V) is similar to node2vec. Though, visually there is not much difference between Sub-figures 7a and 7b, but there is some improvement when we apply node2vec on the weighted line graph (without using collective homophily) in Sub-fig. 7c. Finally, superiority of line2vec because of using collec- tive homophily on the weighted line graph is clear from Sub-fig. 7c. Thus, both the novel components of line2vev have their incremental benefits for the overall algorithm. 5.6 Parameter Sensitivity of line2vec Figure 8 shows the sensitivity of line2vec with respect to the hyper- parameter α (in Eq. 3 of the main paper) on Karate and Dolphin datasets. We have shown the variation of performance for node clas- sification (both micro and macro F1 scores) and node clustering (un- supervised accuracy). From the figure, one can observe that optimal performance in most of the cases is obtained when the value of α is from 0.05 to 0.1. Around these values, the loss from both the com- ponents of line2vec in Eq. 3 are close to each other. For our other experiments, we fix α=0.1 for all the datasets. 5.7 Interpretation of cu as Node Embedding line2vec is dedicated for direct edge embedding in information net- works. Lemma 1 suggests that each node in the given network G induces a clique in the line graph L(G). Based on the concept of col- lective homophily, corresponding to a node u in G, the clique in the line graph is enclosed by a sphere centered at cu ∈ RK (Eq. 3). In- tuitively, the center acts as a point which is close to the embeddings of all the nodes in the clique induced by u (or equivalently, all the edges incident on u in G). Hence the role of this center in the em- bedding space is similar to the role of the node to its adjacent edges in the graph. This motivates us to consider cu as the node embedding of u ∈ V in G. If (u,v) ∈ E, then the edge embedding of (u,v) should be close to both cu and cv, which in turn pulls cu and cv close to each other. Thus, node proximities are also captured in cu. We use clustering of the nodes (a.k.a. community detection) of the given network to validate the quality of node embedding obtained from the centers of the line2vec optimization. We use k-means++ clustering, as before, on the set of points cu, ∀u ∈ V and validate the clustering quality by using unsupervised accuracy [4]. Figure 6c shows that line2vec, though designed specifically for edge embed- ding, performs really good for a node based mining task. Specifically, for Karate and Dolphins networks, the gain is significantly more than best of the baselines. This result is interesting as we aimed to find edge embeddings, but also generate a set of efficient node embed- dings, which are not just the aggregation of the incident edges. 5.8 Connection of Node Centrality with Ru This subsection analyzes the interpretation of the radius Ru of the sphere enclosing the clique induced by node u ∈ V in the embed- ding space. When a node u has less number of incident edges, and the neighbors are very close to each other in the embedding space (for e.g., they are all from the same sub-community), a small radius Ru should be enough to enclose all the edges incident on u. But when the neighbors of the node u are diverse in nature, the corresponding edges would also be different in terms of strength and semantics. For example, an influential researcher may be directly connected to many other researchers in a research network, but only few of them can be direct collaborators. Hence, a larger sphere is needed to enclose the clique in the line graph induced by such a node. This intuition con- nects radius Ru of a sphere in the embedding space of line graph to the centrality [27] of the node u in the given network. A node which is loosely connected (i.e., less number or very similar neighbors) in the network is less central, and a node which is strongly connected (many or diverse set of neighbors) is considered as highly central. As real life networks are noisy [4], first we experiment with a small synthetic graph as shown in Figure 9 to show the connection between Ru and the centrality of the node u ∈ V . It has three communities and there is a central (red colored) node connecting all the commu- nities. Each community has three sub-communities which are con- nected via the green colored nodes. The degree of each node in this network is kept roughly the same. We use closeness centrality [21], which is used widely in the network analysis literature. The closeness centrality of the nodes are plotted in Fig. 9(b). The nodes in the y-axis are sorted based on their closeness centrality values and as expected, the red node top the list as it is well connected to all the communi- ties, followed by the green nodes, with yellow nodes placed at the bottom. We run line2vec on this synthetic graph and plot the Radius Ru for each node u in Fig. 9(c). Here also, the nodes are sorted in (a) Edge Clustering (b) Edge Classification (c) Node Clustering Figure 6: Performance Comparisons: (a) Micro F1 Score of Edge Classification. (b) Edge Clustering with KMeans++. (c) Node Clustering with KMeans++. Here we use cu as the embedding of the node u in the given network. (a) N2V (b) N2V+LG (c) N2V+WLG (d) L2V Figure 7: Edge visualization on Dolphin Dataset by t-SNE: In the following sub-figures, edge Embeddings are obtained (a) by using node2vec on the input graph and then taking average of end node embeddings for each edge, (b) by using node2vec on an unweighted (conventional) line graph, (c) by using node2vec on our proposed weighted line graph, (d) by line2vec. Clearly, there is an incremental improvement of the quality because of using weighted line graph and then collective homophily as reflected in (c) and (d) respectively. (a) Edge Classification (b) Edge Clustering Figure 8: Sensitivity of line2vec with respect to the hyper-parameter α (in Eq. 3 of the main paper) on Karate and Dolphin datasets: We have shown the variation of performance for edge classification (Mi- cro F1 score) and edge clustering (unsupervised accuracy). the same order as in sub-figure 9(b). As one can see, the red node has the highest value of the radius. As this node is connected to a diverse set of nodes in the network, it needs a larger sphere to enclose the induced clique in the line graph. We also observe that most of the green nodes have higher values of Ru than that of the yellow nodes. The correlation coefficient between the closeness centrality and the radius Ru is 0.56. A more prominent trend can be observed for be- tweenness centrality [27], where the correlation coefficient with the radius Ru is 0.86. On all the real-world datasets, we show the correlation of Ru with the two centrality metrics for all the nodes in Table 2. High positive correlation between them can conclude that radius Ru of a node is roughly proportional to the centrality of the node u in the network. However, a detailed analysis is required to see the scope of introduc- ing a new type of node centrality based on the values of Ru. (a) (b) (c) Figure 9: Relationship between radius Ru associated with each node and closeness centrality in a synthetic graph. (a) shows the structure of the synthetic network. (b) shows the closeness centrality of the nodes, where in Y axis, nodes are sorted based on their centrality values. (c) shows the Ru for all the nodes. Nodes in Y-axis of (c) are sorted in the same order as in (b). The colors of the lines in (b) and (c) correspond to three different types of nodes (colored accordingly) in (a). This figure also shows the high overlap between the top few nodes in both the lists. Table 2: Pearson Correlation-Coefficient(CC) values obtained be- tween the radius(Ru) and centrality values of nodes for different net- works. The centrality measures considered here are Betweenness and Closeness centrality. Dataset Karate Dolphins Cora Pubmed MSA Betweenness CC 0.81 0.66 0.29 0.26 0.35 Closeness CC 0.68 0.78 0.79 0.59 0.72 6 Discussion and Future Work We proposed a novel unsupervised dedicated edge embedding frame- work for homogeneous information and social networks. We convert the given network to a weighted line graph and introduce the con- cept of collective homophily to embed the weighted line graph. Our framework is quite generic. The skip-gram based component in the objective function of line2vec can easily be replaced with any other approach like graph convolution in weighted line graph. Beside, we also plan to extend this methodology for heterogeneous information networks and knowledge bases. There are several edge centric appli- cations in networks. This work, being the first one towards a direct edge embedding, can play a basis to solve some of them in the con- text of network embedding and help to move network representation learning beyond node embedding. REFERENCES [1] Sami Abu-El-Haija, Bryan Perozzi, and Rami Al-Rfou, ‘Learning edge representations via low-rank asymmetric projections’, in Proceedings of the 2017 ACM on Conference on Information and Knowledge Man- agement, pp. 1787–1796. ACM, (2017). [2] Lada A Adamic and Eytan Adar, ‘Friends and neighbors on the web’, Social networks, 25(3), 211–230, (2003). [3] Sambaran Bandyopadhyay, Harsh Kara, Aswin Kannan, and M Narasimha Murty, ‘Fscnmf: Fusing structure and content via non-negative matrix factorization for embedding information net- works’, arXiv preprint arXiv:1804.05313, (2018). [4] Sambaran Bandyopadhyay, N Lokesh, and M Narasimha Murty, ‘Out- lier aware network embedding for attributed networks’, in Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 12– 19, (2019). [5] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko, ‘Translating embeddings for modeling multi- relational data’, in Advances in neural information processing systems, pp. 2787–2795, (2013). [6] Kurt Bryan and Yosi Shibberu, ‘Penalty functions and constrained opti- mization’, Dept. of Mathematics, Rose-Hulman Institute of Technology. http:// www. rosehulman. edu/˜ bryan/lottamath/penalty. pdf, (2005). [7] Muhao Chen and Chris Quirk, ‘Embedding edge-attributed relational hierarchies’. SIGIR, (2019). [8] Tim S Evans and Renaud Lambiotte, ‘Line graphs of weighted net- works for overlapping communities’, The European Physical Journal B, 77(2), 265–272, (2010). [9] Hongchang Gao and Heng Huang, ‘Deep attributed network embed- ding.’, in IJCAI, volume 18, pp. 3364–3370, (2018). [10] Zheng Gao, Gang Fu, Chunping Ouyang, Satoshi Tsutsui, Xiaozhong Liu, Jeremy Yang, Christopher Gessner, Brian Foote, David Wild, Ying Ding, et al., ‘edge2vec: Representation learning using edge semantics for biomedical knowledge discovery’, BMC bioinformatics, 20(1), 306, (2019). [11] Mohammad Golam Sohrab, Toru Nakata, Makoto Miwa, and Yutaka Sasaki, ‘Edge2vec: Edge representations for large-scale scalable hier- archical learning’, Computación y Sistemas, 21(4), 569–579, (2017). [12] Aditya Grover and Jure Leskovec, ‘node2vec: Scalable feature learn- ing for networks’, in Proceedings of the 22nd ACM SIGKDD interna- tional conference on Knowledge discovery and data mining, pp. 855– 864. ACM, (2016). [13] Will Hamilton, Zhitao Ying, and Jure Leskovec, ‘Inductive representa- tion learning on large graphs’, in Advances in Neural Information Pro- cessing Systems, pp. 1025–1035, (2017). [14] Thomas N Kipf and Max Welling, ‘Semi-supervised classification with graph convolutional networks’, arXiv preprint arXiv:1609.02907, (2016). [15] Suxue Li, Haixia Zhang, Dalei Wu, Chuanting Zhang, and Dongfeng Yuan, ‘Edge representation learning for community detection in large scale information networks’, in International Workshop on Mobility Analytics for Spatio-temporal and Social Data, pp. 54–72. Springer, (2017). [16] David Liben-Nowell and Jon Kleinberg, ‘The link-prediction problem for social networks’, Journal of the American society for information science and technology, 58(7), 1019–1031, (2007). [17] Laurens van der Maaten and Geoffrey Hinton, ‘Visualizing data us- ing t-sne’, Journal of machine learning research, 9(Nov), 2579–2605, (2008). [18] Miller McPherson, Lynn Smith-Lovin, and James M Cook, ‘Birds of a feather: Homophily in social networks’, Annual review of sociology, 27(1), 415–444, (2001). [19] Federico Monti, Oleksandr Shchur, Aleksandar Bojchevski, Or Litany, Stephan Günnemann, and Michael M Bronstein, ‘Dual-primal graph convolutional networks’, arXiv preprint arXiv:1806.00770, (2018). [20] M.E.J. Newman, Networks: An Introduction, Oxford University Press, Oxford, UK, 2010. [21] Tore Opsahl, Filip Agneessens, and John Skvoretz, ‘Node centrality in weighted networks: Generalizing degree and shortest paths’, Social networks, 32(3), 245–251, (2010). [22] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena, ‘Deepwalk: Online learning of social representations’, in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. ACM, (2014). [23] Leonardo FR Ribeiro, Pedro HP Saverese, and Daniel R Figueiredo, ‘struc2vec: Learning node representations from structural identity’, in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 385–394. ACM, (2017). [24] E. Rogers, Diffusion of Innovations, Free Press, New York, USA, 1995. [25] Yu Shi, Qi Zhu, Fang Guo, Chao Zhang, and Jiawei Han, ‘Easing em- bedding learning by comprehensive transcription of heterogeneous in- formation networks’, in Proceedings of the 24th ACM SIGKDD In- ternational Conference on Knowledge Discovery & Data Mining, pp. 2190–2199. ACM, (2018). [26] Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo- june Paul Hsu, and Kuansan Wang, ‘An overview of microsoft aca- demic service (mas) and applications’, in Proceedings of the 24th in- ternational conference on world wide web, pp. 243–246. ACM, (2015). [27] Oskar Skibski, Talal Rahwan, Tomasz P Michalak, and Makoto Yokoo, ‘Attachment centrality: An axiomatic approach to connectivity in net- works’, in Proceedings of the 2016 International Conference on Au- tonomous Agents & Multiagent Systems, pp. 168–176. International Foundation for Autonomous Agents and Multiagent Systems, (2016). [28] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio, ‘Graph attention networks’, in International Conference on Learning Representations, (2018). [29] Petar Veličković, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm, ‘Deep graph infomax’, in Inter- national Conference on Learning Representations, (2019). [30] Janu Verma, Srishti Gupta, Debdoot Mukherjee, and Tanmoy Chakraborty, ‘Heterogeneous edge embedding for friend recommenda- tion’, in European Conference on Information Retrieval, pp. 172–179. Springer, (2019). [31] Daixin Wang, Peng Cui, and Wenwu Zhu, ‘Structural deep network embedding’, in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1225–1234. ACM, (2016). [32] H. Whitney, ‘Congruent graphs and the connectivity of graphs’, Amer- ican Journal of Mathematics, 54(1), 150–168, (1932). [33] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu, ‘A comprehensive survey on graph neural net- works’, arXiv preprint arXiv:1901.00596, (2019). [34] Junyuan Xie, Ross Girshick, and Ali Farhadi, ‘Unsupervised deep em- bedding for clustering analysis’, in International conference on ma- chine learning, pp. 478–487, (2016). [35] Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Y Chang, ‘Network representation learning with rich text information.’, in IJCAI, pp. 2111–2117, (2015). [36] Yang Zhou, Sixing Wu, Chao Jiang, Zijie Zhang, Dejing Dou, Ruom- ing Jin, and Pengwei Wang, ‘Density-adaptive local edge representa- tion learning with generative adversarial network multi-label edge clas- sification’, in 2018 IEEE International Conference on Data Mining (ICDM), pp. 1464–1469. IEEE, (2018). 1 Introduction 2 Related Work and Research Gaps 3 Problem Description 4 Solution Approach: line2vec 4.1 Line Graph Transformation 4.2 Weighted Line Graph Formation 4.3 Embedding the Line Graph 4.3.1 Collective Homophily and Cost Function Formulation 4.3.2 Solving the Optimization 4.4 Key Observations and Analysis 5 Experimental Evaluation 5.1 Design of Baseline Algorithms 5.2 Datasets Used and Setting Hyper-parameters 5.3 Penalty Errors of line2vec Optimization 5.4 Downstream Edge Mining Tasks 5.5 Ablation Study of line2vec 5.6 Parameter Sensitivity of line2vec 5.7 Interpretation of cu as Node Embedding 5.8 Connection of Node Centrality with Ru 6 Discussion and Future Work