key: cord-0045888-izog355j authors: Selvarajah, Kalyani; Kobti, Ziad; Kargar, Mehdi title: Link Prediction by Analyzing Temporal Behavior of Vertices date: 2020-05-22 journal: Computational Science - ICCS 2020 DOI: 10.1007/978-3-030-50420-5_19 sha: 510d8fb6500932a7b469336d7b5bb908129af816 doc_id: 45888 cord_uid: izog355j Complexity and dynamics are challenging properties of real-world social networks. Link prediction in dynamic social networks is an essential problem in social network analysis. Although different methods have been proposed to enhance the performance of link prediction, these methods need significant improvement in accuracy. In this study, we focus on the temporal behavior of social networks to predict potential future interactions. We examine the evolving pattern of vertices of a given network [Formula: see text] over time. We introduce a time-varying score function to evaluate the activeness of vertices that uses the number of new interactions and the number of frequent interactions with existing connections. To consider the impact of timestamps of the interactions, the score function engages a time difference of the current time and the time of the interaction occurred. Many existing studies ignored the weight of the link in the given network [Formula: see text], which brings the time-varied details of the links. We consider two additional objective functions in our model: a weighted shortest distance between any two nodes and a weighted common neighbor index. We used Multi-Layer Perceptron (MLP), a deep learning architecture as a classifier to predict the link formation in the future and define our model as a binary classification problem. To evaluate our model, we train and test with six real-world dynamic networks and compare it with state-of-the-art methods as well as classic methods. The results confirm that our proposed method outperforms most of the state-of-the-art methods. Social Networks (SN) can be used to model a comprehensive range of real-life phenomena and examine the world around us. It ranges from online social interaction, including Facebook, Twitter, and LinkedIn to human interactions such as co-authorship, healthcare, and terrorist networks. Social networks analysis is the study of such networks to discover common structural patterns and explains 1. Active individuals in social networks are popular among both existing members and new members who like to join the network. They believe that active individuals, for instance, in the co-authorship network, always update their research with current trends as well as being open to new ideas. So, to evaluate the activeness of any member, we consider two factors on the temporal network. (a) The score for constructing new connections (b) The score for the increased number of interactions with existing connections (How much the existing link becomes strong). We introduce a new score function to incor-porate the impact of the timestamps and the gap between the current time and the time of the interaction occurred. Besides this, we introduce a probability function based on the activeness score of a pair of nodes to decide the likelihood of occurring a new link. 2. The smaller distance between any two individuals is higher the chances of future interaction. We incorporate the weighted shortest distance in LATB. In addition to this, we include another objective function, the weighted common neighbor index, which incorporates the time to evaluate the changes of strength of the neighbors' relationship. 3 . In LATB, we used Multi-Layer Perceptron (MLP) as a classifier to predict the link formation in the future and defined our model as a binary classification problem. The remaining of the paper is organized as follows. Section 2 describes related works. Section 3 specifies the problem definitions. Section 4 presents the experimental setup and the corresponding results. Finally, Sect. 5 concludes the research idea of this paper with directions for future work. Link prediction problem on the static network examines a single snapshot of a network structure at time t as an input, and then predicts possible unobserved links at time t (t ≤ t ) [13] . On the other hand, link prediction in dynamic networks investigates the evolution of networks over time as a sequence of snapshots and then predicts new links in the future. This section presents an overview of the link prediction problems on social networks. Several methods have been proposed to deal with the link prediction problem on the temporal network systems during the past decade. The researchers designed a lot of topology-based similarity metrics for link prediction such as Common Neighbors (CN) [16] , Adamic-Adar Coefficient (AA) [1] , and Katz (KZ) [9] . Since the weights of links are rarely taken into account, many researchers modified those metrics in order to adopt the dynamic features of the social networks. The authors [15] examine the link prediction based on connection weight score structural properties of a given network. Zhu et al. [32] proposed a weighted mutual information model which is to estimate the effect of network structures on the connection likelihood by considering the benefits of both structural properties and link weights. Potgieter et al. [21] showed that temporal metrics are valuable features in terms of accuracy. Tylenda et al. [25] proposed a graph-based link prediction algorithm and integrated it with temporal information and extended the local probabilistic model to involve time awareness. Yao et al. [31] used time-decay to manage the weight of the links and modified the common neighbor index to includes nodes in 2-hop. The authors [10] presented a time frame based unsupervised link prediction method for directed and weighted networks and derived a score for potential links in a time-weighted manner. Tong W et al. [29] examined the concepts of the temporal trend of nodes by considering the changes of the degree over time using the structural perturbation method. Munasinge et al. [14] studied the impact of a relationship between timestamps of interactions and strength of the link for the future. Xiaoyi Li et al. [12] proposed a deep learning method, conditional temporal restricted Boltzmann machine, which adopted a combination of feature engineering and CNN to predicts links. Recently, Goyal et al. [6] proposed DynGEM, which uses the recent advances in deep learning methods, autoencoders for graph embeddings to handle growing, dynamic graphs and for link prediction. Wang et al. [27] examined relational deep learning to jointly model high-dimensional node attributes and link structures with layers of latent variables and proposed generalized variational inference algorithm for learning the variables and predicting the links. A dynamic network is evolving over time and can be considered as a sequence of network snapshots within a time interval. The size of the network can occasionally shrink or expand as the network evolves. In this work, we focus on undirected weighted graphs. Given a series of snapshots {G 1 , G 2 , . . . , G t−1 } of an evolving graph G T = V, E T , where the edge e = (u, v) ∈ E t represents a link between u ∈ V t and v ∈ V t at a particular time t . The dynamic link prediction approaches attempt to predict the likelihoods of links in the next time step G t . The list of graphs {G 1 , G 2 , . . . , G t−1 } corresponding to a list of symmetric adjacency matrices {A 1 , A 2 , . . . , A t−1 }. The adjacency matrix A T of G T is a N ×N matrix where each element A T (i, j) takes 1 if the nodes v i ∈ V and v j ∈ V, are connected at least once within time period T and takes 0 if they are not. Given a sequence of (t − 1) snapshots {A 1 , A 2 , . . . , A t−1 }, the goal is to predict the adjacency matrix A t at future time t. The idea of node activeness is highly related to the temporal behaviors of nodes. We can determine the active nodes through the analysis of the time-varying historical information of the nodes. To decide the activeness of the nodes, we can examine how they interact with others (nodes) throughout the timeframe. With any temporal network involving humans, we believe that the following factors are highly relevant to decide the activeness of nodes. New Connections: A node can remain inactive or active. It can be decided based on how often a node made the new interactions throughout the time frame. Let us consider any two members A and B of a given dynamic network G with the same number of new connections in the past; A might be connected at an early stage and not creating any new connection later (can be named as sleeping node), while node B formed most of his or her connection at the later stage (active node). The node B would attract more new people and have a high probability of generating new connections in the near future. We believe that considering this behavior of the node is significant in predicting the new links. Connections) . Given a network G T V, E T and a set of nodes A = a 1 , a 2 , . . . , a n at time t m . Let's say the time windows to estimate the new connections are |t 1 The score of building new connection at time t n m of a member or node of the network a k can be defined as: where N C ti a k is the number of new connection made by the node a k at time t i . The term |t n m − t i | is the timestamp between current time and the selected time that the number of new connections has been made by the node a k . If the difference between t n m and t i is high, the value of N C ti a k over |t n m − t i | + 1 become smaller although a node made more number of new connections at early time (N C ti a k is large). However, this is opposite for the nodes which made new connection recently. The timestamp has an addition of one to avoid the denominator become infinite when the two timestamps are equal. The Frequency of Interaction: In addition to the new interactions, a node can continuously associate with the existing connected nodes. It is another way to decide the activeness of the node. Let us consider two nodes A and B with the same number of existing connections in the past; A might have several interactions at an early stage and not interacting with them later, while B is having frequent interactions with existing connections throughout the time frame. The later one (B) would attract more new nodes and have a high probability of generating new connections in the near future. The score of frequent collaborations with existing connection at time t n m of a node a k can be defined as: where EC ti a k is the number of frequent collaboration with existing connection by a node a k at time t i . For instance, let's consider a network of authors (nodes) at certain year (2012) and the number new connection which both A and C made through last seven years (till 2019) can be shown as Fig. 1 . Similarly, the value of SN (C) can be evaluated to 2.89 as the information given in Fig. 1 . Although both A and C made the equal number of new connections (= 9) in the past seven years, the score has a huge difference. The reason is C built new connections recently than A. This shows that nodes, which generate new connections recently have more attractive than others. Likewise, we can evaluate the score value of active nodes in terms of frequent interaction with existing connections. Since both building new interactions and frequent interactions with existing connections have an influence on deciding the active node, the combination of these scores is a proper way to determine the score of the active node as given in Eq. 3. where λ is a tradeoff value between the score for building new connections and expanding existing connections. Algorithm 1 describe the step by step process of calculating node activeness score for all nodes and store it in a lookup table. At this point, every node is assigned by a score based on their activity on dynamic networks. However, to maintain the range of values between 0 and 1, we normalize each score. Inspired by configuration model, the probability P ai,aj of a link exists between any two nodes a i and a j would be proportional to SA(a i ).SA(a j ). Since the probability should be between 0 and 1, we drive the Ex ← 0 Existing Connection Score 5: Nw ← 0 New Connection Score 6: nb ← Γ [Gt 0 (n)] collaborated nodes at time t = 0 7: for all ti in T do 8: for all m ∈ nb do 10: if m ∈ nb0 then where P ai,aj is the probability of existing link between node a i and a j , SA(a i ) and SA(a j ) are the popularity scores of nodes a i and a j respectively and n is the total number of nodes in the networks. We name our proposed method LATB (Link prediction by Analyzing Temporal Behaviour of vertices), because it highlights the behaviors of vertices. In the past, the majority of researches [13, 19] have examined the accuracy of several heuristics for link prediction such as Adamic Adar, Preferential Attachment, and Jaccard Coefficient. In this research, we consider two modified forms of heuristics: weighted shortest path distance and weighted common neighbors. In the undirected social network G, if some nodes have past interaction, their associated nodes in G are connected by an edge. If many levels of past interactions between two nodes are taken into account, then the input graph G is weighted. In this case, the smaller the edge weight between two nodes, the two nodes had more interactions in the past and have higher chances of interactions in the future. The distance between two nodes a i and a j , specified as dist(a i , a j ), is equal to the sum of the weights on the shortest path between them in the input graph G. If a i and a j are not connected in graph G, i.e., there is no path between a i and a j in G, the distance between them is set to ∞. SD ai,aj = wdist(a i , a j ) (5) The Common Neighbors (CN) is the most widely used index in link prediction and evidence to the network transitivity property. It counts the number of common neighbors between node pair a i and a j . Newman et al. [16] has estimated this quantity in the context of collaboration networks. The probability that a i and a j collaborate in the future can be written as 6. where Γ (a i ) and Γ (a j ) consists of number of neighbors of the node a i and a j in G respectively. As mentioned in the above definition, the common neighbors only consider the binary relations between nodes and ignore the time-varying nature and number of link occurrences. We adopt the time-varied weights into the common neighbors, which can give better predictions [8] . β is an attenuation factor and t is the time considered for prediction. We treat the link prediction problem as a binary classification problem. We used MLP as a classifier. In this regards, we generate a dataset for all existing links in a last time step of a given dynamic network G T , for the positive link class, where for any two vertices i and j, the link between i and j, ij ∈ E t . We generate negative link class, where any two vertices i and j, ij / ∈ E t by using downsampling technique to avoid imbalance problem. We assign the binary crossentropy as loss function, which can be written as: where y is binary indicator (0 or 1), p is predicted probability. Finally, we build and train a neural network for link prediction. MLP is one of the most common and a variant of the original Perceptron model [22] . Here, we only briefly discuss the components of an MLP since this paper is not about MLP innovations. A typical MLP system can be built with layers of neurons, as shown in Fig. 2 . Each neuron in a layer calculates the sum of its inputs (x) that are carried through an activation function (f). The output (O) from the network can be written as: where O jk is the neuron j th output at k th layer and β jk is bias weight for neuron j in layer k, respectively. We conduct extensive experiments to test our model with five real-world dynamic networks and use AUC (Area Under Curve) as evaluation metrics. We use six real world dynamic networks (Table 1 ): Enron corpus [11] and Radoslaw [23] are email communication networks. Each node specifies an employee and link represents email conversation among employees. Enron has the details from 6 January 1998 until 4 February 2004 while Randoslaw is from January 2 nd 2010 to September 30 th 2010. Contact [4] is data from wireless devices carried by people. Every node is people, and a link established when they contacted. The contact list represents the active contacts during 20-s intervals of the data collection. College Messages [17] have private messages sent on an online social network at the University among college people. EU-core [18] is an email data from a large European research institution. Link is the communication between members from 4 different departments. Mathoverflow [18] has the interactions on the stack exchange web site Math Overflow. In the beginning, we sort the dataset in ascending order of time, and then we process a sequence of snapshots for each dataset at a fixed interval. We split every dynamic networks into five time frames G 1 , G 2 , G 3 , G 4 , G 5 . We evaluate each scoring value at the last snapshot. We implement our model in Python3, processed the dataset on IBM cluster, the specification of POWER8 52 processor 256 GB of RAM. We trained and tested our model in an Nvidia GTX 1050Ti, 4 GB GPU with 768 CUDA cores. In the weighted common neighbor index, we set the attenuation factor β to 0.001. To evaluate the active score of the nodes, we assign tradeoff factor λ to 0.5, because we believe that both the score for new connections and the score for frequent interaction with existing connections are equally important to decide the activeness of a person. In the MLP, the first layer has four neurons with the ReLu activation function. We use two hidden layers of 32 neurons. The output layer contains a single neuron with the Sigmoid activation function. We train the neural network for 100 epochs. We use 80% training set, 10% validation set, and 10% testing set. We repeat the above process for ten times and find the average AUC. We use various methods as the baselines, including classical methods such as Common Neighbors (CN), Jaccard coefficient (JC), Adamic Adar (AA) and Preferential attachment (PA), and network embedding methods such as node2vec, LINE, DeepWalk, and SDNE. The brief introduction of these methods is listed as follow: -Common Neighbors (CN) [16] : It is one of the most common measurements used in link prediction problem. Having a large number of the common neighbors easily create a link. -Jaccard Coefficient (JC): It is a normalized form of the CN index. -Adamic-Adar Coefficient (AA) [1] : It evaluates the impotency of a node when having less number of neighbors when predicting links. -Preferential Attachment (PA) [3] : It generates the belief that nodes with large number of neighbors are more likely to form more in the future. -node2vec [7] : It is a node embedding method, which learns nodes representation of network by preserving higher-order proximity between nodes. It used a higher probability of node occurrence in a fixed-length random walk. -LINE [24] : It used an objective function to preserves the first-order and second-order neighborhoods to learn node representations, most similar to node2vec. It is useful to apply for large-scale network embedding. -DeepWalk [20] : It used random walk model to learn vertex representations. This embedding can be used to predict link existence. -SDNE [26] : It used both the first-order and second-order proximities together in an autoencoder based deep model to generate the vertex representations. Regarding the implementations, we evaluated the Link prediction problem from the original code by the authors for node2vec 1 , LINE 2 , DeepWalk 3 and SDNE 4 . Our model achieves a significant improvement compared to other methods in various dynamic networks except the dataset Enron, which is better in SDNE. The standard network embedding methods, node2vec and LINE, perform the worst than other methods in terms of AUC. LATB performs significantly better in Contact and Eu-Core networks above 96% while the classic and embedding methods achieve below 92%. Moreover, LATB has an AUC higher than 93% among all tested dynamic networks. We can conclude that LATB can perform well in both very sparse (Enron) and dense (Radoslaw) networks. Tables 2 and Fig. 3 present the results comparison of other methods with LATB. In this paper, we propose a model for link prediction in dynamic networks by analyzing temporal behaviors of vertices, named LATB. To model the evolving pattern of each vertex, we propose a new scoring method, which can engage the historical changes of vertices. To further address LATB, temporal changes of vertices are analyzed in two ways to measure the activeness of a node: how often a vertex interacts with existing connected nodes -to measure the strength of the relationship with its neighbors, and how fast and often collaborate with new nodes. Because both measures have a strong influence on deciding the node activeness, we introduce a probability function based on the activeness of nodes to evaluate the chances of being connected in the future. We also use two other weighted indexes: shortest distance and common neighbors, to incorporate the time-varying nature and number of link occurrences in neighbor nodes. In LATB, MLP is used as a classifier, which results in the status of link existence. Empirically, we compare LATB with classical methods and traditional embedding methods in five different real-world dynamic networks. Overall our model, LATB, achieves significant improvements and reaches above 93% of AUC. Friends and neighbors on the web Link prediction models for recommendation in academic collaboration network of Turkey Evolution of the social network of scientific collaborations Impact of human mobility on opportunistic forwarding algorithms Evolution of networks Dyngem: deep embedding method for dynamic graphs node2vec: scalable feature learning for networks Link prediction based on time-varied weight in co-authorship network A new status index derived from sociometric analysis Unsupervised link prediction based on time frames in weighted-directed citation networks The enron corpus: a new dataset for email classification research A deep learning approach to link prediction in dynamic networks The link-prediction problem for social networks Time aware index for link prediction in social networks Link prediction of social networks based on weighted proximity measures Clustering and preferential attachment in growing networks Patterns and dynamics of users' behavior and interaction: network analysis of an online community Motifs in temporal networks Finding experts by link prediction in co-authorship networks Deepwalk: online learning of social representations Temporality in link prediction: understanding social complexity The perceptron: a theory of statistical separability in cognitive systems The network data repository with interactive graph analytics and visualization Line: large-scale information network embedding Towards time-aware link prediction in evolving social networks Structural deep network embedding Relational deep learning: a deep latent variable model for link prediction Link prediction in evolving networks based on popularity of nodes Friend or frenemy?: predicting signed ties in social networks Link prediction based on common-neighbors for dynamic social network Link prediction in weighted networks: a weighted mutual information model Acknowledgment. This research work was supported by International Business Machines (IBM); experiments were conducted on a high performance IBM Power System S822LC Linux Server.