key: cord-0043144-1m7knvmh authors: Mohapatra, Dattatreya; Pal, Siddharth; De, Soham; Kumaraguru, Ponnurangam; Chakraborty, Tanmoy title: Modeling Citation Trajectories of Scientific Papers date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47436-2_47 sha: 378de60b4224a8422617a82ffc7510d47a608b84 doc_id: 43144 cord_uid: 1m7knvmh Several network growth models have been proposed in the literature that attempt to incorporate properties of citation networks. Generally, these models aim at retaining the degree distribution observed in real-world networks. In this work, we explore whether existing network growth models can realize the diversity in citation growth exhibited by individual papers – a new node-centric property observed recently in citation networks across multiple domains of research. We theoretically and empirically show that the network growth models which are solely based on degree and/or intrinsic fitness cannot realize certain temporal growth behaviors that are observed in real-world citation networks. To this end, we propose two new growth models that localize the influence of papers through an appropriate attachment mechanism. Experimental results on the real-world citation networks of Computer Science and Physics domains show that our proposed models can better explain the temporal behavior of citation networks than existing models. Over the past two decades, study of citation networks has drawn tremendous attention for various reasons [1] , such as for finding useful academic papers, understanding success of authors, papers and institutes, and decision making processes like promotion and fund disbursement. The study of complex networks has emerged as a field to explain nontrivial topological features that occur in a wide range of large networked systems. Citation network is one such example of a complex network, which captures citation relationships between paper sources or documents. A citation network is a directed and acyclic information network, with the documents being the nodes, and directed edges representing citations of one document by another, thereby capturing the flow of information or knowledge in a particular field. An important property of citation networks is the "in-degree distribution" of nodes. Several models have been proposed to illustrate this distribution [15] -but while these models aim at retaining the power-law degree distribution in synthetic networks, they fail to reproduce other important properties that real-world citation networks might possess. For instance, Ren et al. [16] pointed out that existing models underestimate the number of triangles and thus fail to model the high clustering in citation networks, which is closely related to network transitivity and the formation of communities. In a series of papers [5, 6] , Chakraborty et al. showed that the temporal growth of the in-degree of nodes (aka, citation trajectory) in citation networks can be grouped into five major patterns, and such patterns are prevalent across citation networks of different domains, e.g., Computer Science, Physics. Building on our previous work [14] , we first demonstrate that none of the established growth models can adequately realize citation trajectories as observed in the data. This immediately calls for the need to investigate more involved growth models. We address this issue by accounting for local and global influences exerted by individual nodes in a network. We introduce the concept of a 'location space' associated with the nodes in the network [3] , and propose two new preferential-attachment growth models based on this concept -Location-Based Model (LBM) and Location-Based Model with Gaussian active subspaces (LBM-G). LBM models both the local and global influences, with new nodes connecting preferentially based on the combined influence. LBM-G is an extension of LBM, which models regions of high activity that can be shifted periodically. We evaluate the proposed models on both the Computer Science and Physics citation networks. Experimental results show that LBM and LBM-G are indeed more accurate at realizing the citation trajectories of nodes compared to other network models. 1 Citation networks are growing networks that exhibit certain nontrivial statistical properties. In particular, they have been shown to possess a heavy-tailed powerlaw in-degree distribution [9, 15] . This was later incorporated by Barabasi and Albert [2] and others [10] in their growth models, that shed light on the concept of preferential attachment. The evolution of citation networks was also modeled by preferential attachment with a time dependent initial attractiveness [8] . Another class of growth models employ decay factors to model the temporal nature of certain node properties [7] . There was also a need to explore the effect of the aging factor on other global properties of the network, as exhibited by Zhu et al. [20] and Medo et al. [12] . Another growing body of research focused on the temporal growth of the indegree of nodes (citation trajectory) in citation networks. Motivated by Xie et al. [19] , Chakraborty et al. [5] studied a large citation network to identify various kinds of citation trajectories and unfolded many interesting causes responsible for these patterns [6] . We posit that a new network growth model is needed to explain these patterns, and our propositions in this work are a means to that end. Publication datasets of two domains (Computer Science and Physics) are used in this paper. Chakraborty et al. [5, 6] defined 'citation trajectory of a paper' as the (noncumulative) number of citations (normalized by maximum citation count at any year) per year the paper receives till the time of analysis. One can consider a citation trajectory as a time-ordered set of data points (integers). They observed that contrary to the general consensus that the shape of the citation trajectory of papers are the same [2, 9] , there are five different shapes of citation trajectories prevalent across different domains of citation network (see Fig. 1 for the toy examples of trajectories): (i) Early Risers (ER) with a single peak in citation count within the first five years (known as activation period) after publication; (ii) Frequent Risers (FR) with distinct multiple peaks; (iii) Late Risers (LR) with a single peak after five years of publication (but not in the last year); (iv) Steady Risers (SR) with monotonically increasing citation count; and (v) Others (OT) whose average numbers of citations per year are less than 1. The class of random growth models is a natural choice for analyzing citation networks owing to their growing nature. Network growth models define a sequence of random graphs {G t , t = 0, 1, . . .}, where G t = (V t , E t ), with V t and E t being the set of nodes and edges in G t , respectively. In a network growth model, we Figure 1 (b) defines some additional notations used in this paper. We want to investigate whether some popular classes of network growth models can exhibit the different citation trajectories mentioned in Sect. 4.1. To this end, we first look at three popular classes of network growth models. The BA model [2] is based on the principle of "preferential attachment", where new nodes connect preferentially to existing nodes with higher degree. The probability that a new node at time t+1 connects to node i ∈ V t is given by, . (1) Since new nodes have low initial degrees, owing to initial in-degree being zero, subsequent nodes that join the network will attach to older nodes with higher probability. Hence, it is evident that the BA model cannot simulate the initial growth in citations after publication, i.e., Early Risers, which is often observed in citation networks (See Table 2 ). This motivates us to incorporate the intrinsic 'attractiveness potential', or 'fitness', of a paper in the model. Section 5 gives theoretical justifications for the same. As argued previously, to model initial peak in the citation trajectory of papers, we associate a fitness value [4, 13, 17] with each node. We assume an independently and identically distributed (iid) sequence of random variables (rvs) {ξ, ξ i , i = 1, 2, . . .} randomly drawn from a power law distribution ξ, e.g., Pareto distribution, with ξ i denoting the fitness of node i. In the additive model the attachment probability of a node is directly proportional to the sum of its degree and fitness; while in the multiplicative model it is directly proportional to their product. The attachment probabilities that new nodes connect to node i ∈ V t at time t + 1 are given by Additive: As expected, the fitness-based models are able to achieve some amount of initial citation growth after the publication of a paper (See Table 2 ). Furthermore, the multiplicative model can achieve a non-zero fraction of Steady Risers category in the APS citation network, which is realized to a lesser extent in the BA model or the additive model. Section 5 presents theoretical justifications. The following theorem describes the change in attachment probabilities of nodes over time for the three growth models described in Sect. 4.3. We define Ξ t = i∈Vt ξ i and ψ t = i∈Vt ξ i D t (i), for t = 0, 1, . . .. For clarity in presentation, we assume that a single node enters at any time step t, and forms a connection with one node in the existing graph G t−1 . We label the incoming node by the time index of its entry to the network, i.e., V t = {0, 1, . . . , t} for t = 0, 1, . . .. All our results can be easily extended to more general scenarios where multiple nodes enter the network, and incoming nodes form multiple connections. (ii) Additive fitness (AF) model: (iii) Multiplicative fitness (MF) model: Proof. Fix t = 0, 1, . . ., and i in V t . The difference in the attachment probability of node i in the BA model between time t + 1 and t is given as Furthermore, by noting that when looking at the expected difference in attachment probability conditioned on the graph at time t − 1, S t is the only random variable in (5), we obtain and (2) follows. Similarly in the additive fitness model, the difference in the attachment probability of node i can be written as Taking expectation on both sides conditioned on G t−1 and ξ t leads to (3) . The difference in the attachment probability of node i can be written for the multiplicative model as follows where (7) follows by noting that ψ t = ψ t−1 + ξ St + ξ t , because only node S t gets a new edge from ξ t and its fitness degree product therefore only increases by its own fitness ξ St . Furthermore, we lower bound the expected change in visibility as follows and the result follows. Theorem 1 indicates that for both the BA model and the additive fitness model, the probability of attachment will reduce over time in expectation for all the nodes. However, for the multiplicative model the probability of attachment could increase over time if the node has sufficiently high initial fitness ξ i . This gives a strong theoretical justification for using multiplicative model for realizing Steady Risers, in that nodes with high initial fitness could possibly maintain or increase their attachment probability over time. However in Table 2 , we observe that the proportion of Steady Risers is negligible in comparison to the MAS data. This is probably because only a few nodes get the opportunity to be Steady Risers in the multiplicative model. We propose two models which introduce the concept of a 'location space' to model the local visibility of nodes in a network and capture different citation growth patterns, while keeping the global influence due to degree and fitness in a multiplicative fashion. This allows us to overcome the limitation of low number of Steady Risers in the multiplicative model by restricting the influence of nodes to their locality. Popular articles in scientific networks are prominent in their subfield of interest (say, Machine Learning in Computer Science domain), and do not exert the same amount of influence across other subfields. We overcome this limitation in the fitness models by restricting the region of influence for incoming nodes. The location-based growth model (LBM) proposed here captures the notion of 'local influence' of a node in the network. Each node is assigned a location vector drawn from a distribution over the location space L. This location vector serves as a representation of the subfield to which a research article belongs. Given a sequence of iid location vectors {χ, χ t , t = 0, 1, ...} and fitness vectors {ξ, ξ t , t = 0, 1, ...}, with the location vectors being uniformly distributed over the location space L, and the fitness vectors being Pareto distributed, the attachment rule is given as, where, d(·, ·) can be any distance metric, and γ is a decay factor governing how fast the attachment probability decays with distance in the location space. We use Euclidean distance metric d(·, ·) and report results for different values of γ. Algorithm 2 describes the exact formulation of the network growth using LBM. In LBM, sampling a node location over the entire space L implies a faulty scenario where new papers entering the network at the same time step seemingly belong to widely different subfields -whereas we generally observe heightened research activity in only a handful of subfields at a time. The growth model should take these regional spikes into account while assigning node locations and also incorporate their shifting nature. We introduce the concept of active subspaces which generates the locations for new papers entering the network. We realize these active subspaces through multi-dimensional Gaussian distributions over the location space L. For simplicity, in this paper we only assume one active subspace at every time step. The location vector of node t, χ t , is chosen as χ t ∼ N (μ t , σ 2 ), where μ t and σ 2 are the mean and variance of the Gaussian at time t. The Gaussian distribution is then updated either after entry of fixed number of nodes, or after certain number of times steps. For ease of exposition, we assume that single node enters at every time step. Under this assumption, both the update techniques are identical, which is afterwards relaxed in Sect. 7. After entry of S nodes or S time steps, the mean of the Gaussian distribution is updated as follows μ +1 ∼ N (μ , ρ 2 ), = kS and k = 1, 2, . . . (9) where, ρ 2 captures the variance in the shift of the active subspace. Once the location vectors are drawn, the attachment rule is identical to the LBM model. The pseudo-code of LBM-G is shown in Algorithm 4. Table 3 . Percentage of papers in different categories obtained from the simulated network of LBM for (a) MAS and (b) APS. We also vary γ and report the result as well as the similarity with the real data in terms of JSD. We adopt the experimental setup proposed by Chakraborty et al. [5] to identify the proportion of nodes belonging to each citation trajectory. We use (the square of) the Jensen-Shannon distance (JSD) [11] to measure the similarity between two distributions. Please refer to the full version of the paper where we discuss the reasons for choosing this metric as well. We report the values of JSD 2 with the results as it makes differentiating the effect of two sets of parameters easier. We analyze the network obtained from the LBM model and measure the proportion of papers in each category of citation trajectory. We also observe the effect of the scalar factor γ. The following regimes of γ are considered, given the number of nodes n: (a) constant scalar factor γ; (b) linear, γ = n; (c) sub-linear, γ = n 0.5 ; (d) logarithmic, γ = log n. Table 3 reports JSD between the paper category distribution of the LBM model and the real network. We notice that for both MAS and APS datasets, the best result (minimum JSD) is obtained with γ = log n. Therefore, we choose γ = log n as default for LBM. Along with S indicating the frequency at which LBM-G model shifts the active subspace, we have another hyper-parameter, σ, the standard deviation of each Gaussian distribution representing a subspace. We consider two different strategies of shifting the active subspace: in terms of months (Table 4) , and the number of nodes (Table 5) . We obtain the lowest JSD for S = 1 month and σ = 2.0, i.e., the active subspace is shifted every month and the new Gaussian subspace is chosen from a Gaussian distribution with standard deviation 2. However, the more interesting pattern to observe in Table 4 is that we consistently get good results (low JSD) with high σ values, while keeping the frequency of shifting the active subspace constant. This stands true for the APS dataset as well. Table 4 . Percentage of papers in different categories obtained from the simulated network of LBM-G for (a) MAS and (b) APS. We also vary S (in terms of months) and σ (standard deviation) and report the result as well as the similarity with the real data in terms of JSD. We can also vary S and σ to control the proportion of nodes belonging to each category at a coarser level. We observe the following patterns in Tables 4 and 5 : -The proportion of Late Risers has a direct correlation with σ. This is probably because Late Risers acquire a peak after the activation period. A high σ makes it more likely that the active subspace shifts further away from the node, thus delaying re-activation of the node's location. -The proportion of Early Risers is seen to have a positive correlation with S, in general. This can be explained by the effect S has on the activation period of a node. A higher S means the node's location stays activated for a longer period of time, thus giving it more opportunities to achieve a peak. -The proportion of Steady Risers has a negative correlation with S, in general. A node belonging to this category should not achieve a peak during the activation period. This means that the factor controlling the proportion of Early Risers can be used to control the proportion of Steady Risers as well. A lower S means the active subspace gets shifted frequently, thus not giving ample opportunity to form a very high peak. A higher S would instead keep the location activated for an extended period of time which may lead to higher peaks early on when the number of nodes in the network is low, and then lower peaks as the number of nodes increases. -The proportion of Frequent Risers should depend upon the frequency with which a certain subspace is re-activated. We observe that this proportion generally peaks for moderate values of S and σ. Table 5 . Percentage of papers in different categories obtained from the simulated network of LBM-G for (a) MAS and (b) APS. We also vary S (in terms of nodes) and σ (standard deviation) and report the result as well as the similarity with the real data in terms of JSD. The results reported throughout the paper are obtained by setting activation period to 5 years and peak threshold to 0.75 as default for categorization. We further run LBM and LBM-G (with the default parameter setting) and vary the two parameters associated with citation trajectory classification -activation period and peak threshold. Figure 2 hints upon the fact that the conclusions drawn throughout the paper remain invariant (p < 0.005) with the minor change in the parameters related to trajectory categorization. Please refer to the full version of the paper to read about this experiment in detail. In this paper, we proposed two models to explain an important characteristic of a citation network -the trajectory of citation growth. The models focus more on exploring the local neighborhood of node during edge formation, instead of looking at the network globally. This is important because edge formation in real networks is usually a local process. Experimental results showed significant improvement over the existing growth models on two real-world datasets in terms of realizing different citation trajectories of papers. A literature review on the state-of-the-art in patent analysis Emergence of scaling in random networks Models of social networks based on social distance attachment Scale-free networks from varying vertex intrinsic fitness On the categorization of scientific citation profiles in computer science Universal trajectories of scientific success Evolution of networks with aging of sites Characterizing and modeling citation dynamics Measuring preferential attachment in evolving networks Network growth by copying Divergence measures based on the shannon entropy Temporal effects in the growth of networks Fitness-based generative models for power-law networks Springer Optimization and Its Applications Visibility of nodes in network growth models Networks of scientific papers Modeling the clustering in citation networks Vertex intrinsic fitness: how to produce arbitrary scale-free networks An overview of microsoft academic service (MAS) and applications Modeling the citation network by network cosmology Effect of aging on network structure The full version of the paper is available at https://arxiv.org/abs/2002.06628. The project was partially supported by SERB (Ramanujan fellowship and ECR/2017/00l691) and the Infosys Centre of AI, IIIT Delhi, India. The authors would like to thank Dr. Ralucca Gera at Naval Postgraduate School, USA, for initial discussions and insights. This document does not contain technology or technical data controlled under either the U.S. International Traffic in Arms Regulations or the U.S. Export Administration Regulations.