This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/authorsrights http://www.elsevier.com/authorsrights Author's personal copy Proximity measures for link prediction based on temporal events Paulo R.S. Soares, Ricardo B.C. Prudêncio ⇑ Center of Informatics (CIn), Federal University of Pernambuco, Recife, PE, Brazil a r t i c l e i n f o Keywords: Link prediction Temporal events Neighborhood-based measures Co-authorship networks a b s t r a c t Link prediction is a well-known task from the Social Network Analysis field that deals with the occur- rence of connections in a network. It consists of using the network structure up to a given time in order to predict the appearance of links in a close future. The majority of previous work in link prediction is focused on the application of proximity measures (e.g., path distance, common neighbors) to non-con- nected pairs of nodes at present time in order to predict new connections in the future. New links can be predicted for instance by ordering the pairs of nodes according to their proximity scores. A limitation usually observed in previous work is that only the current state of the network is used to compute the proximity scores, without taking any temporal information into account (i.e., a static graph representa- tion is adopted). In this work, we propose a new proximity measure for link prediction based on the con- cept of temporal events. In our work, we defined a temporal event related to a pair of nodes according to the creation, maintenance or interruption of the relationship between the nodes in consecutive periods of time. We proposed an event-based score which is updated along time by rewarding the temporal events observed between the pair of nodes under analysis and their neighborhood. The assigned rewards depend on the type of temporal event observed (e.g., if a link is conserved along time, a positive reward is assigned). Hence, the dynamics of links as the network evolves is used to update representative scores to pairs of nodes, rewarding pairs which formed or preserved a link and penalizing the ones that are no longer connected. In the performed experiments, we evaluated the proposed event-based measure in different scenarios for link prediction using co-authorship networks. Promising results were observed when the proposed measure was compared to both static proximity measures and a time series approach (a more competitive method) that also deploys temporal information for link prediction. � 2013 Elsevier Ltd. All rights reserved. 1. Introduction As the amount of interaction among individuals increases in vir- tual environments, more useful social data become available, serv- ing as a basis for social network analysis. Social networks are structures composed by individuals (entities) that can be con- nected by different forms of social relationship (such as friendship, in which two individuals are connected if they are friends) (Ama- ral, Scala, Barthelemy, & Stanley, 2000). In social networks, connec- tions and entities tend to appear and disappear along time, which turns them into highly dynamic and complex systems. Social Net- work Analysis (SNA) is a broad field of research that tries to deal with such complexity (Wasserman & Faust, 1994). Several tasks can be associated to SNA. In this paper, our specific aim is to inves- tigate the prediction of links in a social network, that is, we are fo- cused on predicting the most probable future connections based on previous states of the network. This is a well-known problem from the SNA field called link prediction (Hasan, Chaoji, Salem, & Zaki, 2006). A lot of work has been devoted to cope with the link prediction problem (Getoor & Diehl, 2005; Wang, Satuluri, & Parthasarathy, 2007; Xiang, 2008). The majority of previous work is based on the application of proximity measures to non-connected pairs of nodes in the network at present time in order to predict new con- nections at future time. The proximity measures are used to asso- ciate scores to the pairs of nodes, which can be used either: (1) in a unsupervised approach, in which a chosen score is simply ordered and links are predicted for the top ranked pairs; or (2) in a super- vised approach, in which the link prediction is treated as a classi- fication task and different scores are used as predictor attributes by a learning algorithm. A limitation that can be pointed out in the previous work is that the proximity scores are usually calcu- lated without taking into account the evolution of the network. The proximity measures are computed using all network data up to the present moment (i.e., the current network state) without considering when links were created. Hence, a potentially useful source of information for link prediction is not adequately leveraged. 0957-4174/$ - see front matter � 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.eswa.2013.06.016 ⇑ Corresponding author. Tel.: +55 8121268430. E-mail addresses: prss@cin.ufpe.br (P.R.S. Soares), rbcp@cin.ufpe.br (R.B.C. Prudêncio). Expert Systems with Applications 40 (2013) 6652–6660 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / e s w a Author's personal copy In the current work, we propose a novel proximity measure in which temporal information is taken into account. In the proposed measure, we deployed the concept of temporal events, which are specific activities (e.g., creation or removal of a link) observed be- tween a pair of nodes in consecutive time intervals. For instance, an innovative event occurs when two nodes are not connected in a given time interval and a new link is created between them in the next interval. The proximity score for a pair of nodes is com- puted by monitoring along time the events observed around the nodes and their immediate neighbors. Each category of temporal event defined in our work (innovative, conservative and regressive) is associated to a numerical reward and the proximity score in- creases or decreases along time depending on the temporal events observed in consecutive network frames. Our proposal was sup- ported by the Homans’ work that associates the strength of a con- nection between individuals to their frequency of interaction (which was modeled in our work by the temporal events) (Homans, 1951) and the idea shared by many authors that a bigger common neighborhood between individuals is closely related to a higher probability of future connections (Newman, 2001). In order to verify the viability of the proposed measure, we per- formed experiments on co-authorship networks extracted from four sections of the physics e-Print arXiv.1 In these experiments, the proposed measure was evaluated in different scenarios by con- sidering for instance different reward values for the temporal events and by adopting a weighting function to give higher importance to more recent events. For a baseline comparison, experiments were performed with traditional proximity measures previously adopted in the literature. We also performed experiments with a time series approach (Potgieter, April, Cooke, & Osunmakinde, 2009) which, similarly to our approach, also aims to consider temporal informa- tion to improve link prediction. In all experiments, the unsupervised approach was adopted to perform the prediction task once the scores were calculated and the Area Under the ROC Curve (AUC) was used to evaluate the performance of prediction. The performance achieved by the event-based measure outperformed the results of the baseline methods in all the networks considered. Section 2 briefly presents the link prediction problem. Section 3 describes the event-based approach, showing the concepts used and the score calculation process. Section 4 presents the social net- work data adopted, the experiments and obtained results. Finally, Section 5 concludes this work by presenting some final consider- ations and future work. 2. Link prediction Link prediction consists of predicting new connections or detecting hidden links in a network. It is a very important task applicable to a wide variety of areas, such as bibliographic domain, molecule biology, criminal investigations and recommending sys- tems (Getoor & Diehl, 2005; Xiang, 2008). A traditional definition of the link prediction problem is expressed by: ‘‘Given a snapshot of a social network at time t, we seek to accurately predict the edges that will be added to the network during the interval from time t to a given future time t0’’ (Liben-Nowell et al., 2003). Among several approaches to treat the problem, the most widespread ones rely on the use of proximity measures between pairs of nodes (Lu & Zhou, 2011; Xiang, 2008). As previously mentioned, the starting point of this approach is to extract the values/scores of different measures that indicate the similarity between pairs of nodes. These scores can be used either by unsupervised (Liben-Nowell et al., 2003; Lu & Zhou, 2011; Murata & Moriyasu, 2008) or supervised link prediction (Hasan et al., 2006; Lichtenwalter, Lussier, & Chawla, 2010; Sá & Prudêncio, 2011). In the former approach, a proximity measure is chosen and deployed to rank node pairs in the network. The top ranked ones are predicted to be linked. In the supervised link prediction, a set of proximity measures is cho- sen and adopted as predictor attributes by a classifier, learned based on historical data. Each pair of non-connected nodes is de- scribed by the set of proximity scores. A classifier then uses these attributes to perform a binary classification to decide whether the link will be formed or not in the future. The proximity measures proposed and evaluated in the litera- ture can be broadly categorized into semantic or topological mea- sures (Xiang, 2008). In the semantic measures (or node-wise measures), the nodes’ content is considered to measure proximity. For instance, in a co-authorship network, the similarity between keywords extracted from published papers can be used to predict future interaction among the authors (Xiang (2008)). Different from the semantic measures, the topological strategy consists of deploying the network structure to compute the proximity scores (e.g. the number of common neighbors that two nodes share). Topological measures are more commonly adopted in the litera- ture since they are more general and do not require the definition of rich features to describe content (in fact, rich content is not al- ways available depending on the social network considered). Several topological measures were proposed in the literature mainly categorized into neighborhood-based or path-based mea- sures (Hasan & Zaki, 2011). The neighborhood-based measures take into account the immediate neighbors of the nodes. In general, these measures consider that two nodes are more likely to form a link if their sets of neighbors have a large overlap (Xiang, 2008). Among the neighborhood-based measures, we can mention Common Neighbors (Newman, 2001), Preferential Attachment (Barabasi et al., 2002; Newman, 2001),Adamic and Adar (2003) and the Jaccard’s coefficient (Salton & McGill, 1986). The path- based measures in turn define proximity between nodes by consid- ering the paths between them. The basic idea is that two nodes are more likely to form a link if there are short paths between them. The path-based measures range from the simple path-distance measure to more sophisticated definitions that consider ensembles of different paths, such as the Katz measure (Katz, 1953). In com- parative terms, the neighborhood-based methods are more wide- spread, due to both their computational eficiency and great performance observed in experiments (Huan, 2006; Liben-Nowell et al., 2003; Murata & Moriyasu, 2008). The measure proposed in our work can be categorized as neighborhood-based, since it uses information about the connections around nodes to assign scores to them (see Section 3). Most previous work performs link prediction by statically ana- lyzing the network data, that is, no temporal information is consid- ered to perform the prediction. However, temporal information (e.g., the moments when two nodes interacted in the past or the time when a connection was first observed) is an important aspect that should be considered during the link prediction (Hasan & Zaki, 2011). For instance, when computing a proximity score based on neighborhood, it would be interesting to consider not only how many but also when the links were formed between neighbors. Recent activity between common neighbors may be more impor- tant than older activity. Also, static approaches are appropriate to investigate whether a certain link will ever occur in a network but they are less useful, for instance, to applications for which the prediction of repeated link occurrences are of interest (Huang & Lin, 2009). In this section, we mention some previous work concerning the use of temporal information for link prediction. For instance, in Tylenda, Angelova, and Bedathur (2009), the authors modeled a network as a weighted graph, in which the weight of a link was the age of the most recent activity between the corresponding1 http://www.arxiv.org P.R.S. Soares, R.B.C. Prudêncio / Expert Systems with Applications 40 (2013) 6652–6660 6653 Author's personal copy nodes. The link prediction task was then accomplished by deploy- ing extended versions of the proximity measures adequate for weighted networks (e.g., weighted Adamic Adar). Bringmann, Berlingerio, Bonchi, and Gionis (2010) proposed to mine network data augmented with temporal information in order to discover association rules (in terms of frequent subgraphs) that best ex- plained the network evolution. In Juszczyszyn, Musial, and Budka (2011), the authors adopted a related approach in which the his- tory of the network (recorded during past time windows) is used to derive probabilities of transitions between triads of nodes. Although promising results can be obtained in this approach, min- ing frequent subgraphs has shown to be a very expensive task (Coenen, Jiang, & Zito, 2012). An alternative approach for deploying temporal information is to treat link prediction as a time series forecasting problem (Huang & Lin, 2009; Potgieter et al., 2009; Qiu, He, & Yen, 2011; Soares & Prudêncio, 2012). Huang and Lin (2009) built a time series for each pair of nodes, in which each series observation is the frequency of occurrence of links between the nodes during a specific time peri- od. The time series forecasts produced by ARIMA models (Box & Jenkins, 1970) were then used to estimate the probability of future link occurrence. In Potgieter et al. (2009) and Soares and Prudêncio (2012), the authors adopted a similar idea, but in this case time series models were used to predict proximity scores. In this ap- proach, given a chosen proximity measure, a time series is built for each pair of nodes by computing the score in a sequence of time periods (i.e., using different snapshots of the network along time). A final proximity score is then obtained for each pair by forecasting its corresponding time series. Different models were applied in the forecasting process, including ARIMA models, smoothing methods and linear regression models. Despite the good results obtained in experiments, a difficulty in the time series approach is to choose an adequate forecasting model, among the variety of models that can be applied. In fact, in Soares and Prudêncio (2012), the authors ob- served that a wrong decision concerning the forecasting model can have a negative impact in the link prediction performance. 3. Event-based link prediction Social networks are highly dynamic structures in which several connections and nodes tend to appear or disappear along time. The temporal evolution of these networks brings valuable information about how connections tend to be formed, and, for that, should be considered in the link prediction task. In the current work, we pro- pose a new proximity measure that takes into account the tempo- ral structure of a network and temporal events related to pairs of nodes in the network. The aim of our proposal is to combine Ho- mans’ thought that the strength of a connection between two indi- viduals is directly associated with how often they interact with one another (modeled here by means of temporal events) (Homans, 1951) and Newman’s idea that the bigger the number of common neighbors between two nodes, the higher is their probability to be connected in the future (Newman, 2001). The general idea of the proposed approach is to increase or de- crease the proximity score between two nodes depending on tem- poral events observed among the two nodes and their neighborhood. A temporal event, which will be explained better in the next section, is defined through the appearance or disap- pearance of links around nodes as the network evolves. In the pro- posed work, initially a temporal structure is created by extracting consecutive frames of the network at different time intervals. A proximity score is then computed for each pair of nodes by aggre- gating the rewards assigned for the temporal events observed dur- ing the transitions of frames. The proposed measure is explained in more detail in the next subsections. 3.1. Temporal structure In the proposed solution, we initially build a temporal structure N , by following a methodology described in Soares and Prudêncio (2012). First of all, the network is split into several time-sliced snapshots, which represent states of the network at different time intervals in the past. After that, frames are built by grouping con- secutive snapshots. The size of each frame is the same as the length of the prediction window, pre-defined in the link prediction task. Each frame represents one time step in N . This methodology is detailed below. Let GðV; EÞ be a graph representing a social network observed up to time T. Each edge in E is represented by a triple ðu; v; tÞ, indicat- ing that the nodes u and v 2 V had a social interaction at time t. Hence, more than one edge related to a single pair of nodes can be observed in E if the nodes interacted in different moments in the past. Let w be the length of the prediction window, i.e., our task is to predict new links concerning the future time interval from T þ 1 to T þ w. Let Gt be the sub-graph of G containing only the edges observed at time t. Let ½Gt; Gtþ1; . . . ; Gtþm� be the frame formed by the disjoint union of the graphs from time t to t þ m. In our work, a set of n con- secutive frames N ¼fF1; . . . ; F k; . . . ; F ng of size w is extracted from the graph G. Formally, N can be defined as N ¼f ½GT�nwþ1; GT�nwþ2; . . . ; GT�ðn�1Þw�; . . . ; ½GT�2wþ1; GT�2wþ2; . . . ; GT�w�; ½GT�wþ1; GT�wþ2; . . . ; GT�g ð1Þ The frame Fk ¼ ½GT�ðn�kþ1Þwþ1; . . . ; GT�ðn�kÞw� is hence the sub-graph of G containing all links observed in the k-th time interval of size w. As an example, consider a network observed up to the year T ¼ 2012 and a prediction window of length 2 (i.e., the task is to predict new links from 2013 and 2014). If we extract n ¼ 3 frames from the network, the following structure (N ¼fF1; F2; F 3g) is obtained N ¼f½G2007; G2008�; ½G2009; G2010�; ½G2011; G2012�g ð2Þ This structure can be used to analyze how the network evolved every 2 years since 2007. For this, we will introduce next section the concept of temporal events, that can be observed between sub- sequent frames. 3.2. Temporal events We define a temporal event as a specific activity between two nodes from a frame to its subsequent. A temporal event is the ac- tion that leads a dyad (a pair of nodes) from a state (connected or non-connected) to another. Events can be categorized into one of three mutually exclusive types: conservative, innovative or regres- sive, defined below. � Conservative A conservative event occurs when a relationship between two nodes is not dropped when the network evolves, that is, when two nodes share a link in a frame and this connection is pre- served in the subsequent one. For each pair of nodes ðu; vÞ, we define a reward Cðu; v; kÞ related to frame Fk , in order to take into account a conservative event during the transition from the ðk � 1Þth to the kth frame. Formally: Cðu; v; kÞ¼ c; if ðu; vÞ 2 Ek�1 \ Ek 0; otherwise � ð3Þ In the above definition, Ek�1 and Ek are the sets of edges observed in frames Fk�1 and Fk respectively. The constant c indicates the reward for conservative events, which should be a non-negative value since the strength of a tie between two nodes is preserved. 6654 P.R.S. Soares, R.B.C. Prudêncio / Expert Systems with Applications 40 (2013) 6652–6660 Author's personal copy � Innovative Innovative events represent the creation of a new link between two nodes on different frames. They happen when two nodes are not connected in a frame and a link is observed in the next frame. The Innovative reward Iðu; v; kÞ associated to a pair ðu; vÞ and a frame Fk is: Iðu; v; kÞ¼ i; if ðu; vÞ 2 Ek n Ek�1 0; otherwise � ð4Þ The constant i in the above equation indicates the reward for inno- vative events. Its value should be positive since the tie between two nodes is strengthened. � Regressive Regressive events are opposite to innovative ones. This kind of event represents the removal of an existing link between two nodes from a frame to its subsequent. The Regressive reward Rðu; v; kÞ is defined as: Rðu; v; kÞ¼ r; if ðu; vÞ 2 Ek�1 n Ek 0; otherwise � ð5Þ In this event, r should assume a non-positive value since the strength of the connection between the nodes tends to decrease. The values of the parameters c; i and r can be determined empir- ically by evaluating the link prediction performance in a validation set and choosing the best configuration of values considered. As it will be seen, in our experiments, the number of parameters to define was reduced to two by making the values of c and r propor- tional to i (in this case, i was set to 1 for the sake of simplicity). As it will be seen in the next subsection, given a pair of nodes, we will define as primary events the ones strictly related to both nodes in the pair, whereas secondary events happen in dyads com- posed by only one of the nodes in the pair and its neighbors. 3.3. Event-based score As previously said, many approaches for link prediction com- pute scores to pairs of nodes by deploying a chosen proximity mea- sure (see Section 2), aiming to determine how similar those nodes are and, consequently, how likely a connection between them will be formed in a close future. The proposed measure combines both: (1) the rewards associ- ated to primary events, which are the temporal events strictly re- lated to the pair of nodes under analysis; and (2) the rewards associated to secondary events, which are the temporal events ob- served in the nodes’ neighborhood. The proximity score associated to a given pair of nodes ðu; vÞ is defined as: scoreðu; vÞ¼ Xn k¼2 Pðu; v; kÞþ a:Sðu; v; kÞ ð6Þ Pðu; v; kÞ¼ Cðu; v; kÞþIðu; v; kÞþRðu; v; kÞ ð7Þ Sðu; v; kÞ¼ X y2CðuÞ\CðvÞ Pðu; y; kÞþ Pðy; v; kÞ ð8Þ In Eq. (7), Pðu; v; kÞ computes the reward of the event (conservative, innovative or regressive) for the pair of nodes ðu; vÞ observed in the transition from frame k � 1 to frame k. In Eq. (8), Sðu; v; kÞ indicates the aggregated reward of secondary events associated to the pair ðu; vÞ (i.e., the primary events observed in its common neighbor- hood). In this equation, CðxÞ is the set of neighbors of the node x in the network. In Eq. (6), the parameter a is an amortization factor that indi- cates how strong secondary events affect the tie between u and v. Pðu; v; kÞ and Sðu; v; kÞ associated to the first frame (i.e., k ¼ 1) are all null since this frame is not a result of any set of events, and hence for simplicity, they are not considered in Eq. (6). Fig. 1 illustrates the application of the proposed event-based score. Suppose that one wants to compute the proximity score for the pair ð1; 3Þ. The nodes 1 and 3 have node 2 as common neighbor in the network. Hence, the score of the connection between 1 and 3 is going to be calculated as a function of the events that occurred between themselves (primary events), plus the ones occurred in the dyads ð1; 2Þ and ð2; 3Þ (secondary events) along the network temporal structure N . From frame F 1 to frame F2 , it can be noticed that a conservative and a regressive event happened in the dyad ð1; 2Þ and ð2; 3Þ respectively. So far, no event occurred in the dyad ð1; 3Þ. Therefore, up to frame F 2, the partial score of the pair ð1; 3Þ is given by the combination of the amortized rewards associated with the second- ary events described above, that is, aðc þ rÞ. Looking at the next time step (frame F 3), a regressive event is associated to ð1; 2Þ, whereas an innovative one occurred in the pair ð2; 3Þ. In this step, there is also an innovative event related to the pair under analysis ð1; 3Þ; its presence must be counted too in the final score of the pair (its reward, however, will not be amor- tized since it is a primary event). Hence, the resultant score in this frame is given by aðr þ iÞþ i. By the definition of Eq. (6), the scores are cumulative in time. The final score of a pair of nodes is, then, the sum of all its partial scores along N . This, therefore, results in a final score of aðc þ 2r þ iÞþ i for the pair ð1; 3Þ. In the current work, we also propose an extension of the event- based score that increases the importance of more recent events observed for a pair of nodes. In Eq. (9), we propose a proximity score which considers a monotonically increasing function b : Z ! R to weight recent events more heavily than the old ones. In our work, we adopted a logarithmic function (see Eq. (10)), although other functions can be deployed. We highlight that Eq. (6) is a special case of Eq. (9) by setting bðkÞ¼ 1. scoreðu; vÞ¼ Xn k¼2 bðkÞ:½Pðu; v; kÞþ a:Sðu; v; kÞ� ð9Þ bðkÞ¼ logðkÞ ð10Þ 3.4. Prediction Finally, in order to perform the prediction of new links, we adopted an unsupervised strategy after the scores were computed for the pairs of nodes. It basically consists of ranking the pairs of nodes according to their scores, and then, selecting the top-ranked ones as the new links in the network, as explained in Section 2. Fig. 1. Example of event-based score computation using a network composed by a click of three nodes. N represents the temporal structure of the network divided in three frames (F1; F2 and F 3 ). The variables c; i and r are the rewards associated with conservative, innovative and regressive events respectively, and a is the amorti- zation factor used to deal with secondary events. P.R.S. Soares, R.B.C. Prudêncio / Expert Systems with Applications 40 (2013) 6652–6660 6655 Author's personal copy 4. Experiments and results In this section we describe the experiments performed to eval- uate the proposed approach as well as the obtained results. Ini- tially we describe the social network data used and the methodology of experiments (Section 4.1), followed by the base- line methods adopted for a comparative evaluation (Section 4.2). In Section 4.3, we present the initial results aimed to evaluate the proposed measure considering different configurations of re- wards for the temporal events. In Section 4.4, we evaluate the influence of the amortization factor a in the proposed measure. Fi- nally, in Section 4.5, we present the effect of using a weighting function for recent event as well as the comparative results with the baseline methods. 4.1. Data and settings In order to evaluate the performance of the proposed scores, data from four co-authorship networks were used in our experi- ments. We highlight that co-authorship networks are highly used in the literature to evaluate SNA techniques. Being a social struc- ture, this kind of network is very dynamic along time, what makes it a great source of data to be explored regarding the emergence of new connections. Besides, most of them are publicly available in digital environment (Barabasi et al., 2002; Newman, 2001). In a co-authorship network, a node represents an author and an edge indicates that two specific nodes have co-authored a paper. As seen in Section 3, in our work an edge also stores the time when the relationship was observed. More specifically, in the co-authorship networks adopted here, each edge stores the publication year of the co-authored paper. The network data used in our experiments were collected from the e-print arXiv,2 which maintains a large database of electronic scientific papers in several fields, such as mathematics, physics, astronomy, among others. In our experiments, the networks were built using collaborations from four distinct sub-areas, namely: as- tro-physics (astro-ph), condensed matter (cond-mat), high energy physics – lattice (hep-lat) and theoretical high energy physics (hep- th). Collaborations from 1993 to 2000 were used for astro-ph and cond-mat, and collaborations from 1992 to 2010 were used for hep- lat and hep-th. The summarized information about the adopted net- works are presented in Table 1. In our experiments, a prediction window of size one was adopted, and hence the aim of the analysis is to investigate how the networks evolved every year. By following the methodology described in Section 3.1, a temporal structure N was built for each network. For astro-ph and cond-mat, data from 1993 to 1999 were used to build N (with a total number of n ¼ 7 frames) and the snapshot from 2000 was used as the prediction frame. For hep-lat and hep-th, N was built with collaborations from 1992 to 2009 (with a total number of n ¼ 18 frames), whereas the prediction frame was composed by collaborations from 2010. As mentioned in Section 3.2, adequate values for the reward parameters can be estimated by evaluating the link prediction per- formance on a validation set. In our work, the last frame of N was adopted as the validation set in order to perform parameter selec- tion for each network: the frame of 1999 was used to validation for the astro-ph and cond-mat networks and the frame of 2009 was used to validation for hep-lat and hep-th. As usual in the link prediction field, our aim is to investigate the prediction of new links in a network. Therefore, for each network, the initial list of candidate pairs to be evaluated is formed by pairs that are not connected in the last frame of N . The final list is formed by picking (from the initial list) only the dyads that were directly or indirectly affected by some event during the network evolution. Table 2 shows the distribution of the candidate pairs of nodes regarding their class labels (connected or non-connected) both in the validation and the test sets. As it can be seen, the class distribution is highly imbalanced, which is a common problem related to the link prediction task. In order to minimize the effect of imbalanced data during evalua- tion, the Area Under ROC Curve (AUC) was adopted as the measure of performance. ROC curves relate the sensitivity (true positive rate) and specificity (true negative rate) of a classifier. The AUC has been traditionally used in imbalanced classification problems (Murata & Moriyasu, 2008; Wang et al., 2007) due to its higher robustness to class distribution anomalies (Fawcett, 2006). Finally, for more reliable results, we performed a bootstrap over the test set by taking 100 randomly stratified sub-samples and computing the AUC. The 10% of the best results and 10% of the worst results were discarded and the average AUC was computed. 4.2. Baseline methods In our experiments, the proposed measures were compared with different measures previously adopted in the literature of link prediction. Initially, four classical proximity scores based on neigh- borhood were considered: PA, CN, AA and JC. These are very popu- lar measures used in the state of the art of link prediction (Bringmann et al., 2010; Lichtenwalter et al., 2010). Each measure is explained below. The PA measure assumes that the probability of a future link between two nodes is proportional to their degrees. In a co-author- ship network, such probability is correlated to the product of the number of collaborators they have Barabasi et al. (2002). Hence, it is defined as: PAðu; vÞ¼ jjCðuÞjj � jjCðvÞjj ð11Þ The CN measure states that the bigger the number of neighbors two nodes share, the higher is their probability to form a link in the future (Newman, 2001). Formally, the measure is defined as: CNðu; vÞ¼ jjCðuÞ\ CðvÞjj ð12Þ The AA measure refines the CN by increasing the scores of pairs of nodes in which the neighbors in common possess less connections (Adamic & Adar, 2003). Its formal definition is: Table 1 Networks statistics. astro-ph cond-mat hep-lat hep-th Papers 19,077 20,664 9,367 38,569 Authors 16,978 18,070 4,718 17,887 Collaborations 140,157 56,731 32,309 58,855 Density (10�4 ) 9.7252 3.4746 29.0355 3.6792 Avg. degree 16.51 6.28 13.70 6.58 Table 2 Class distribution for the validation and test sets. astro-ph cond-mat hep-lat hep-th (a) Validation set No. of pairs 745,683 149,769 330,137 498,782 Connected (+) 7,797 1,737 1,099 1,232 Non-connected (�) 737,886 148,032 329,038 497,550 Proportion (+/�) 1.06% 1.17% 0.33% 0.25% (b) Test set No. of pairs 1,337.607 254,651 373,053 553,544 Connected (+) 9,791 2,665 608 1,626 Non-connected (�) 1,327.816 251,986 372,445 551,918 Proportion ((+/�)) 0.74% 1.06% 0.24% 0.29% 2 arXiv.org e-Print archive – Cornell University Library. 6656 P.R.S. Soares, R.B.C. Prudêncio / Expert Systems with Applications 40 (2013) 6652–6660 Author's personal copy AAðu; vÞ¼ X z2CðuÞ\CðvÞ 1 logðjjCðzÞjjÞ ð13Þ Finally, the JC assumes higher proximity values for pairs of nodes which share a higher proportion of common neighbors relative to to- tal number of neighbors they have Salton and McGill (1986). JCðu; vÞ¼ jjCðuÞ\ CðvÞjj jjCðuÞ[ CðvÞjj ð14Þ The above measures were statically applied to the whole graph G (i.e., in the current state of the network) without considering tem- poral information. In our work, we also performed experiments with the time series approach for link prediction (Potgieter et al., 2009; Soares & Prudêncio, 2012) described in Section 2. As our cur- rent work, this approach also considers temporal information dur- ing the link prediction. Initially the network is split into frames, using the same procedure described in Section 3.1. Given a chosen proximity measure, a time series is built for each pair of nodes by applying the measure in each network frame. A forecasting model is then used in order to predict the next value of the series. Such forecast value is defined as the final proximity score of the pairs of nodes. Compared to the proposed work, we can point out a diffi- culty related to the time series approach, which is the choice of a good forecasting model. In fact, in the experiments performed in Soares and Prudêncio (2012) using co-authorship networks, the choice of the forecasting model strongly affected the link prediction performance. In our work, we performed experiments adopting AA as the measure to build the time series and the linear regression (LR) model as the forecasting model. That was the best combination of proximity measure and forecasting model in the experiments presented in Soares and Prudêncio (2012). Finally, we highlight that similarly to the proposed measures, the baseline measures described in this section (both the static ones and the time series approach) were applied for unsupervised link prediction. 4.3. First-round experiments As described in Section 3.3, the event-based measure is poten- tially affected by the rewards associated to each of the three events: conservative, innovative and regressive. In the first round of experiments, we aimed: (1) to evaluate the effect of rewards’ values in the performance of the proposed measure; and (2) to ver- ify whether these values can be properly chosen using a validation set. We deployed the validation set to empirically evaluate and select the best combination of rewards in the link prediction task. In this round, as mentioned in Section 3.2, i was fixed to 1.0 and the other two parameters c and r varied proportionally to i : c ¼f0:0; 0:25; 0:5; 1:0; 2:0g and r ¼f�2:0;�1:0;�0:5;�0:25g. We consider here a fixed value for a, since our focus was to inves- tigate the adopted rewards’ values. The value of a was set to 0.05 through a preliminary and exploratory battery of experiments. Table 3 shows the AUC value obtained by each combination of the rewards in the validation set for all networks. The best four ob- served results for each network are presented in bold. As it can be seen, the most effective combinations of reward did not vary so much from one network to another. The best results were obtained by parameters (c and r) around the same region (lower central) in the tables for all networks. Besides, these results indicate that conservative and regressive rewards can balance each other. The best performance values were achieved when c 6 i, although some good results were also obtained when we consid- ered a higher weight (2.0) to the conservative reward. In our exper- iments, if regressive reward is roughly weighted, the event-based method did not perform well (this can be noticed by analyzing results obtained with r ¼�2:0 and �1:0). As one could expect, this shows that if the aim is to predict new links in a network, then the history of innovative and conservative events assumes a higher de- gree of importance in the prediction process. Table 4 shows the AUC value obtained by each combination of the rewards in the test set for all networks. The results observed in the test sets did not varied substantially from those ones ob- served in the validation set. The best configurations of parameters were also similar considering the validation and the test set (see Table 5). This supports the conclusions we made regarding rewards and indicates that they can be determined empirically using a val- idation set. 4.4. Second-round experiments In the second round of experiments, our aim was to investigate how the amortization factor a affects the performance of the event-based measure. As explained in Section 3.3, this parameter indicates the importance of the secondary events in relation to pri- mary events in our proposed measure. In this experiment, we eval- Table 3 AUCs obtained by different combinations of rewards on the validation sets. r=c 0.0 0.25 0.5 1.0 2.0 (a) astro-ph �2.0 0.3402 0.3552 0.3698 0.4074 0.4681 �1.0 0.6301 0.6657 0.6773 0.6919 0.6972 �0.5 0.7931 0.7998 0.8012 0.8018 0.7996 �0.25 0.7967 0.8021 0.8035 0.8041 0.8023 (b) cond-mat �2.0 0.3002 0.3087 0.3181 0.3530 0.4239 �1.0 0.5684 0.6202 0.6318 0.6483 0.6507 �0.5 0.7928 0.8083 0.8131 0.8128 0.8092 �0.25 0.7964 0.8086 0.8109 0.8123 0.8105 (c) hep-lat �2.0 0.3792 0.4553 0.5801 0.6849 0.7273 �1.0 0.8002 0.8330 0.8294 0.8206 0.8102 �0.5 0.8406 0.8476 0.8467 0.8424 0.8340 �0.25 0.8309 0.8374 0.8389 0.8378 0.8331 (d) hep-th �2.0 0.2610 0.2843 0.3276 0.4264 0.5223 �1.0 0.6668 0.7599 0.76554 0.7678 0.7584 �0.5 0.8501 0.8579 0.8555 0.8502 0.8409 �0.25 0.8433 0.8497 0.8491 0.8469 0.8411 Table 4 AUCs obtained for several combinations of c and r on the test sets. r=c 0.0 0.25 0.5 1.0 2.0 (a) astro-ph �2.0 0.3462 0.3684 0.3916 0.4401 0.5116 �1.0 0.6423 0.6781 0.6875 0.6979 0.7006 �0.5 0.7844 0.7911 0.7922 0.7916 0.7879 �0.25 0.7861 0.7919 0.7927 0.7929 0.7909 (b) cond-mat �2.0 0.3041 0.3109 0.3259 0.3743 0.4303 �1.0 0.5731 0.6325 0.6408 0.6544 0.6535 �0.5 0.8072 0.8154 0.8147 0.8122 0.8064 �0.25 0.8078 0.8126 0.8127 0.8123 0.8091 (c) hep-lat �2.0 0.2945 0.3618 0.4934 0.6431 0.7353 �1.0 0.7652 0.8480 0.8429 0.8403 0.8349 �0.5 0.8483 0.8573 0.8574 0.8557 0.8515 �0.25 0.8393 0.8478 0.8506 0.8527 0.8508 (d) hep-th �2.0 0.2258 0.2376 0.2706 0.3583 0.4634 �1.0 0.6624 0.7303 0.7331 0.7303 0.7220 �0.5 0.8605 0.8625 0.8587 0.8529 0.8441 �0.25 0.8555 0.8558 0.8541 0.8518 0.8466 P.R.S. Soares, R.B.C. Prudêncio / Expert Systems with Applications 40 (2013) 6652–6660 6657 Author's personal copy uated the proposed measure by adopting different values of a ranging from 0 to 1:5, with an increment of 0:05. We highlight that when a ¼ 0, secondary events are in fact not taken into account. For higher values (such as a ¼ 1:5) in turn, possibly an excessive importance is given to secondary events, which may harm the prediction performance. In this section, the rewards c; r and i were fixed according to the best configuration observed in the validation set in the previous round of experiments. The AUC values obtained as a varies are illustrated in Fig. 2. When a assumes a null value, the performance of the model is harmed because secondary events are not being taking into account in the prediction task. In this case, the measure will only be able to predict connections between pairs affected by primary events any time in the past, that is, pairs that were connected in at least one frame of N . This shows the effectiveness of secondary events in the prediction of new connections. The best AUC values were obtained by adopting a ¼ 0:05, which is a relatively small value considering the range of values evaluated in our experiments. When it approaches to higher values, the method performs worse since surrounding pairs are becoming as important as the main pair of nodes in analysis. Besides, the num- ber of secondary events is much bigger than the total amount of primary events (mainly if the network is dense). Hence, isolated secondary events should contribute with small values, which when aggregated, can outstand in the predictive process without, how- ever, mask the effects of the primary events. In this section, we suggest a simple heuristic to define the value of a based on the ratio between the total number of primary events and secondary events in N (see Table 6). As it can be seen, the AUC results obtained by adopting the heuristic value for a were similar to the best results observed in Fig. 2. 4.5. Last-round experiments In the previous experiments we evaluated the event-based score focusing on its parameters. In this section, we evaluated it aiming to determine if the age of the temporal events (event-based score with Eq. (9)) can bring any gain of performance to the predic- tion task. Aditionally, we compared the event-based scores to the baseline measures described in Section 4.2. Table 7 present the Table 5 Best rewards settings extracted from experiments on validation and test sets. astro-ph cond-mat hep-lat hep-th (a) Validation set c 1.0 0.5 0.25 0.25 i 1.0 1.0 1.0 1.0 r �0.25 �0.5 �0.5 �0.5 (b) Test set c 1.0 0.25 0.5 0.25 i 1.0 1.0 1.0 1.0 r �0.25 �0.5 �0.5 �0.5 Fig. 2. Plot of AUC versus the amortization factor a. 6658 P.R.S. Soares, R.B.C. Prudêncio / Expert Systems with Applications 40 (2013) 6652–6660 Author's personal copy AUC value obtained by the proposed scores and the baseline meth- ods. In this table, we refer the function assigned to b to distinguish between the scores generated by using the two aforementioned equations. The results showed that the methods that deploy temporal information (i.e., the event-based scores and the time series approach) in general outperformed the classical static approach based on PA, CN, AA and JC. These results confirm that temporal information is actually an important aspect to consider for link prediction. The event-based scores obtained the best results from all methods considered. The performance gain was higher for the cond-mat, hep-lat and hep-th networks, but less expressive for the as- tro-ph network, in which the event-based score was quite similar to AA alone. In our results, we observed a performance gain when the event-based score using b ¼ log was compared to the event-based score using b ¼ 1 in all networks considered, what supports our belief that recent events carry more information about the emer- gence of links.. Although, the performance gain was in general small in absolute values, it was statistically verified using a t-test with a 95% level of confidence. Finally, although both the time series approach and the event- based methods explore the temporal nature of the network, the proposed approach obtained the best comparative results. The per- formance gain was statistically verified with a 95% level of confi- dence. The proposed measure analyses the network by focusing on the connections as such, and for that, it was able to extract more relevant information to the prediction of ties that might emerge in a close future. The method based on time series, on the other hand, assumes that the topological measure history shows some ten- dency as the network evolves. Thus, it performs an indirect analy- sis of the connections in the network. In most cases, the time series approach outperformed the static approach, but it was not able to overcome the results achieved by an analysis strictly focused on the links, as done by the method based on events. 5. Conclusion In this work we introduced a new proximity measure for link prediction based on the notion of temporal events. Differently from the classical static approach that takes the network at time t in order to predict new connections at future time t0, our method takes temporal information into account by monitoring how the network evolved along time. Our method consists of computing scores by aggregating rewards of the temporal events observed at each step in the network evolution, which resulted in more rep- resentative scores to the dyads under analysis. Different experi- ments were performed in four co-authorship networks, initially aimed to evaluate the robustness of the proposed method concern- ing its parameters. The obtained results suggest that the method was quite stable concerning the definition of its parameters across the different networks adopted, although more experiments need to be performed in other domains of application. Also, we verified the importance of considering secondary events in the proposed measure. In the performed experiments, we also compared the event- based score to other baseline approaches proposed in the literature of link prediction. The obtained results showed that the event- based measure outperformed the baseline measures considered, indicating that the combination of interactions between nodes and the dynamics of common neighborhood can bring a gain in performance to the prediction task. Additionally, when the age of the temporal events was taken into account, the method per- formed slightly better, indicating that more recent events can bring more useful information to the prediction process. In our work, we adopted a log function in order to give a higher weight to more recent events. Nevertheless, other functions can be evaluated in the future (e.g., linear) to weight the recent events. Also, different functions can be associated to each category of temporal event if we assume for instance that recent events from one category are actually more important that the other events. Finally, we highlight that the event-based score described here was specific to undirected and unweighted networks. A possible future line of research is to extend the ideas discussed here to more complex categories of networks. For instance, a variety of temporal events could be defined if the direction of the links is considered (e.g., a conservative event is observed in a direction but a regres- sive event is observed in the other direction, thus indicating a non-reciprocal interaction between the nodes at a given moment). Also, in the case of weighted networks, temporal events could con- sider an increase or decrease of the weight between the nodes along time, instead of just considering the existence of the links. The rewards in such cases should be carefully defined in order to adequately reflect the important of the events in the link predic- tion process. References Adamic, L. A., & Adar, E. (2003). Friends and neighbors on the web. Social Networks, 25(3), 211–230. Amaral, L. A., Scala, A., Barthelemy, M., & Stanley, H. E. (2000). Classes of small- world networks. Proceedings of the National Academy of Sciences USA, 97(21), 11149–11152. Barabasi, A-L., Jeong, H., Neda, Z., Ravasz, E., Schubert, A., & Vicsek, T. (2002). Evolution of the social network of scientific collaborations. Physica A, 311(3–4), 590–614. Box, G., & Jenkins, G. (1970). Time series analysis: Forecasting and control. San Francisco: Holden-Day. Bringmann, B., Berlingerio, M., Bonchi, F., & Gionis, A. (2010). Learning and predicting the evolution of social networks. Intelligent Systems IEEE, 25(4), 26–35. de Sá, H. R., & Prudêncio, R. B. C. (2011). Supervised link prediction in weighted networks. In Proceedings of the 2011 international joint conference on neural networks (pp. 2281–2288). IEEE. Fawcett, T. (2006). An introduction to roc analysis. Pattern Recognition Letters, 27, 861–874. Getoor, L., & Diehl, C. P. (2005). Link mining: A survey. SIGKDD Explorations Newsletter, 7(2), 3–12. Hasan, M. A., Chaoji, V., Salem, S., Zaki, M. (2006). Link prediction using supervised learning. In Proceedings of SDM 06 workshop on link analysis counterterrorism and security. Hasan, M. A., & Zaki, M. J. (2011). A survey of link prediction in social networks. In Charu C. Aggarwal (Ed.), Social network data analytics (pp. 243–275). US: Springer. Homans, G. C. (1951). The human group. London: Routledge and Kegan. Zan Huan, (2006). Link prediction based on graph topology: The predictive value of the generalized clustering coefficient. In Proceedings of 12th ACM SIGKDD international conference on knowledge discovery and data mining. Table 6 Proportion between the number of primary events (P) and secondary events (S). astro-ph cond-mat hep-lat hep-th P/S 0.0116 0.0391 0.0089 0.0238 AUC with a ¼ P=S 0.7953 0.8160 0.8664 0.8627 Best AUC 0.7948 0.8269 0.8681 0.8830 Table 7 Performance of the methods (AUCs). astro-ph cond-mat hep-lat hep-th Event-based (b ¼ log) 0.7948 0.8269 0.8681 0.8830 Event-based (b ¼ 1) 0.7929 0.8147 0.8573 0.8625 Time series (AA + LR) 0.7584 0.7402 0.8537 0.7899 AA 0.7844 0.7346 0.8049 0.7513 CN 0.7391 0.6744 0.7598 0.6838 JC 0.7299 0.6114 0.7408 0.6481 PA 0.5043 0.5455 0.5668 0.4834 P.R.S. Soares, R.B.C. Prudêncio / Expert Systems with Applications 40 (2013) 6652–6660 6659 Author's personal copy Huang, Z., & Lin, Dennis K. J. (2009). The time-series link prediction problem with applications in communication surveillance. INFORMS Journal on Computing, 21(2), 286–303. Jiang, C., Coenen, F., & Zito, M. (2013). A survey of frequent subgraph mining algorithms. The Knowledge Engineering Review, 28(1), 75–105. Juszczyszyn, K., Musial, K., Budka, M. (2011). Link prediction based on subgraph evolution in dynamic social networks, In Privacy security risk and trust (PASSAR) 2011 IEEE third international conference on social computing (pp. 27–34). Katz, L. (1953). A new status index derived from sociometric analysis. Psychometrika, 18(1), 39–43. Liben-Nowell, D., Kleinberg, J. (2003). The link prediction problem for social networks. In Proceedings of the 2003 international conference on information and knowledge management (pp. 556–559). Lichtenwalter, R. N., Lussier, J. T., Chawla, N. V. (2010). New perspectives and methods in link prediction. In Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 243–252). Lu, L., & Zhou, T. (2011). Link prediction in complex networks: A survey. Physica A, 390(6), 1150–1170. Murata, T., & Moriyasu, S. (2008). Link prediction based on structural properties of online social networks. New Generation Computing, 26(3), 245–257. Newman, M. E. J. (2001). Clustering and preferential attachment in growing networks. Physical Review E, 64. Newman, M. E. J. (2001). The structure of scientific collaboration networks. In Proceedings of the national academy of sciences of the United States of America (Vol. 98, pp. 404–409). Potgieter, A., April, K. A., Cooke, R. J. E., & Osunmakinde, I. O. (2009). Temporality in link prediction: Understanding social complexity. Journal of Emergence: Complexity and Organization, 11(1), 83–96. Qiu, B., He, Q., Yen, J. (2011). Evolution of node behavior in link prediction. In AAAI (pp. 1810–1811). Salton, G., & McGill, M. J. (1986). Introduction to modern information retrieval. McGraw-Hill Inc. Soares, P. R. S., Prudêncio, R. B. C. (2012). Time series based link prediction. In Proceedings of the 2012 international joint conference on neural networks (pp. 784–790). Tylenda, T., Angelova, R., Bedathur, S. (2009). Towards time-aware link prediction in evolving social networks. In Proceedings of the third workshop on social network mining and analysis, SNA-KDD ’09 (pp. 1–9). Wang, C., Satuluri, V., Parthasarathy, S. (2007). Local probabilistic models for link prediction. In Proceedings of the 2007 seventh IEEE international conference on data mining (pp. 322–331). Wasserman, S., & Faust, K. (1994). Social network analysis: Methods and applications (structural analysis in the social sciences). Cambridge University Press. Xiang, E. W., (2008). A survey on link prediction models for social network data. PhD thesis, PhD Qualifying Exam, Department of Computer Science and Engineering, The Hong Kong University of Science and Technology. 6660 P.R.S. Soares, R.B.C. Prudêncio / Expert Systems with Applications 40 (2013) 6652–6660