key: cord-0557336-hazgt0uh authors: cSimcsek, Aybike title: Estimating the Expected Influence Capacities of Nodes in Complex Networks under the Susceptible-Infectious-Recovered (SIR) Model date: 2021-03-03 journal: nan DOI: nan sha: f9463556ad208cfa87f215c798ad58695357a96c doc_id: 557336 cord_uid: hazgt0uh In recent years, epidemic modeling in complex networks has found many applications, including modeling of information or gossip spread in online social networks, modeling of malware spread in communication networks, and the most recent model of the COVID-19 pandemic. If the information disseminated is accurate, for example, maximizing its distribution is desirable, whereas if it is a rumor or a virus, its spread should be minimized. In this context, it is very important to identify super-spreaders that maximize or minimize propagation. Lately, studies for detecting super-spreaders have gained momentum. Most of the studies carried out aim to distinguish the influences of nodes under a specific propagation model (such as SIR) using network centrality measures and subsequently, to rank the nodes accordingly. However, in this study, we developed an algorithm that approximates the expected influence of nodes under the popular SIR model. By considering the behavior of the SIR model and only the shortest paths between nodes, the algorithm ranks the nodes according to this approximated value. Our developed algorithm is named the Expected Value Estimation (EVE). We compared the performance of EVE, using different SIR settings on real datasets, with that of many current well-known centrality measures. The experimental studies demonstrated that the solution quality (ranking capability) of EVE is superior to that of its competitors. Complex networks are highly suitable tools for modeling the real world. They have applications in many different fields such as natural sciences [1] , health [2] , cyber security [3] , economics [4] , and social networks [5] - [7] . Moreover, epidemic modeling in complex networks has attracted attention in recent years for its many practical benefits. The spread of a virus outbreak (such as can be estimated and precautions can be taken based on this [8] . By modeling the spread of gossip on the social network, the spread can be prevented [9] , [10] . Or, the desired information may reach the maximum number of people [11] . Whether you want to minimize the spread of gossip or maximize the spread of information, in any case, in order to do so, the set having the smallest number of the most influential individuals should be identified [12] , [13] . The influences of these individuals under certain epidemic models (such as SIR) should be calculated in order to identify the smallest number of the most influential individuals (i.e., key players). For this, it is necessary to model the propagation by selecting each node individually as the seed. Since propagation models are stochastic models, they must be repeated many times (e.g., about 10.000 iterations) and the average value taken. This operation requires very high processing power. On the other hand, researchers have noticed a correlation between the influence capacity of the nodes and network centrality measures, which have been used for a long time to determine the importance of nodes in complex networks. The basic expectation here is that as a centrality measure increases, the influence capacity increases, and as the centrality measure decreases, the influence capacity decreases. Since the calculation of centrality measures requires much less processing power than modeling the propagation thousands of times, studies have turned to this area. For this purpose, basic centrality measures such as Degree, Closeness, Betweenness [14] , Katz [15] , PageRank [16] were used and new centrality measures were developed. However, many of the measures developed only considered the local and global impacts of the nodes [17]- [21] or network communities [18] , [19] , [22] - [24] . Recently, another approach has been adopted that combines multiple centrality measures to develop new hybrid centrality measures [25] - [32] . However, many of these studies ignore the dynamics of the propagation model. In this study, we developed an algorithm that ranks nodes according to their influence capacity, taking into account the propagation behavior in the Susceptible-Infectious-Recovered (SIR) model. We named our developed algorithm the Expected Value Estimation (EVE) because it is based on approximating the expected influence of each node. It is worth mentioning here that the EVE algorithm does not calculate the importance of nodes contrary to the centrality measures. Instead, it calculates the approximate expected influence of the nodes under the SIR model and ranks the nodes accordingly. Under certain epidemic models (such as SIR), it is necessary to perform heavy Monte-Carlo simulations to distinguish the influence of nodes However, if the dynamics of the SIR propagation model are taken into account, the process can be simplified by ignoring some of the behaviors of this model. Thus, the approximate expected influence of nodes can be calculated and used to rank nodes (similar to a centrality measures). Generally speaking, in the SIR model, a node affects its neighbor nodes with a probability β. If not its direct neighbor, it is likely to affect its neighbors' neighbors with probability (β × β). If the network is a tree, the probability of a node influencing another l-hop away node can be calculated as since there can be only one path between each pair of nodes. Thus, the expected influence of a node can be calculated using its distance to all other reachable nodes by this node as the sum of values. However, real networks rarely exhibit tree structures. Hence, there can be many different paths of different lengths between any two nodes. It is also costly to use all paths to all other nodes to calculate the expected influence of a node. However, the probability of one node influencing another node decreases exponentially with the distance between them, although in practice, the value of is much less than 1. The natural consequence of this is ≫ +1 , where ∈ ℕ + . Based on this information, the expected probability of a node influencing another node can only be approximated using the shortest path between these two nodes. This is because the probability of influence calculated for routes other than the shortest path will be much lower. These calculated values can be used to distinguish the influence capacities of the nodes (similar to a centrality measure). Before discussing the details of EVE, it would be useful to give some preliminary information. Let = ( , ) be an undirected unweighted graph (network). Here, is the set of nodes (vertices), and is the set of edges (links). The Susceptible-Infectious-Recovered (SIR) model is a well-known model used for population-based epidemic modeling. In recent years, due to their popularity, SIR and SIR variations have been applied to network topologies [33] . In the SIR model, nodes are found in one of three states: Susceptible, Infected, and Recovered. The transition of nodes between states occurs according to certain probabilities. Susceptible nodes are more likely to be infected by neighbors who are already infected with probability . Infected nodes are also likely to go into a recovered state with probability . Initially, all other nodes are in a susceptible state, except for nodes that carry the disease (i.e., those that are infected). Starting from the nodes that are initially infected (called 'seed nodes'), the disease spreads over the network. After a certain period of time, there are no remaining infected nodes on the network and thus, the model is terminated. [34] : Let ( , ) and ( , ) be tuples of joint A and B ranking lists. If > and > or < and < , then the tuples are concordant. If > and < or < and > , then the tuples are discordant. If = or = , then the tuples are neither concordant nor discordant. Finally, tau is defined as in Equation (1). Here, is the number of concordant pairs, is the number of discordant pairs, and is the number of all combinations. Positive values indicate a positive correlation, and negative values indicate a negative correlation. [35] : Monotony is a metric of how well the centrality measure assigns each node to different rank levels. The ranking monotonicity (RM) will be '1' if all nodes are assigned to a different ranking level. If all nodes are assigned to the same ranking level, the RM will be '0'. Of course, for a centrality measure, the closer it is to RM 1, the better. The RM is calculated as follows: Here, n is the length of the L-ranking list and is the number of elements assigned to the same r rank. The working principle of EVE is based on expected value calculation. Therefore, it is useful to first look into the details of how a node infects its neighbor nodes in SIR and how this node recovers. This situation is shown for one iteration in Algorithm 1 [36] . The node u in the algorithm was initially selected as the infected node or one infected at any point in time. According to Algorithm 1, the node u infects its neighbors with probability β. After the node u infects its neighbors, this node is recovered with probability γ. If = 1, the node u has absolutely only one attempt to infect its neighbors since it will not be in the Infected state in the next iteration. If = 0.5, roughly speaking, the node u has two attempts to infect its neighbors since it will be in the Infected state in the next iteration with probability 0.5. If we generalize, the node u has at least 1 γ ⁄ attempts to infect its neighbors. Since the probability of the node u infecting its neighbors is β, the expected value of infecting a neighbor by node u would be 1⁄γ times β; that is, β⁄γ. Let us explain the situation in Figure 1 , where different topologies are shown. Notice that Figure 1 a, b, and c are trees. Therefore, there is only one path between all nodes. In Figure 1 -a, let the node u initially be selected as a seed (infected). The expected influence value (ev) of the node u becomes ev(u) = 1 + β⁄γ. Here, 1 has been added as node u is already infected. Figure 1 -b shows the expected influence value (ev) of the node as ev(u) = 1 + ⁄ + (probability of infecting ). In order to infect the node y, the node u must infect the node x. Next, the node x must infect the node y. The probability of these two events happening together can be obtained by multiplying the probabilities of their respective occurrence. Thus, the expected value of u infecting the node y is ( ⁄ × ⁄ ), i.e., The expected value of a node infecting another node decreases exponentially with the distance between them. If we generalize the ev calculation, we get Equation (3). Here, nn is the size of the set of node u's neighbors at h-hop distance. The situation is a little different in Figure 1 -d. The node y is both a 1-hop and a 2-hop neighbor of the node u. Therefore, the node u can infect the node y directly, as well as through the node x. Thus, the expected value of node u infecting the node y is the sum of these two possibilities, or 1 at most. value of node u infecting the node y would be 2. However, this value can be at most 1, since once a node is infected, it cannot be infected again. In large and complex networks, there can be many different paths having different lengths from one node to another. It is quite costly to consider all paths. Instead, only the shortest paths can be considered. Thus, as in Figure 1 -e, the (x, y) edge is ignored and the approximate ev can be calculated using Equation (3). However, instead of changing the structure of the graph, only neighbors with h-shortest path hop distance can be included when creating ℎ sets. Thus, it is guaranteed that ∩ = ∅ ; here ≠ ve , ∈ {1 … ℎ}. If we named as ℎ to the sets created by selecting only neighbors with h-shortest path hop distance, we can calculate the measure we call EVE as in Equation (4). Equation (4) does not take into account paths other than the shortest paths. In the literature, β is usually taken as very small (e.g., ≤0.1) and γ as large (e.g., = 1). The corollary of this is , where ∈ ℕ + . Thus, it can be considered reasonable to ignore paths other than the shortest paths. In practice, EVE can be calculated as in Algorithm 2. The EVE needs the shortest path information for all pairs. If the Floyd-Warshall algorithm is used, its time complexity is (| | 3 ). The algorithm has two nested "for" loops and each works × exactly with step | |. So, the time complexity is (| | 2 ). Finally, the resulting list is sorted. If an algorithm with a time complexity of log is used for this, the time complexity will be (| | log| |). As each these processes must follow one another, the time complexity is (| | 3 + | | 2 + | | log| |), that is, (| | 3 ). As a result, the time complexity of EVE is dominated by the shortest path calculation. To evaluate the performance of EVE, we determined five competitor centrality measures and experimented with different SIR settings over four real-world datasets. First, let us look at the competing centrality measures and datasets. DC (Degree Centrality) is calculated by dividing the degree of the node by the total number of nodes in the graph minus one [37] . EC (Eigenvector Centrality) is used to determine the importance of a node in the network. The basic logic of EC is that the more adjacent a node is to the important nodes, the more important it is [38] . CC (Closeness Centrality) is a measure of how close a node is to other nodes [39] . The closer the node is to other nodes, the larger the CC. Centrality) is the proportional information on how many of the shortest paths between all pairs are through a node [14] . GC (Gravitational Centrality) is a recent centrality measure inspired by Newton's gravitational formula [29] . Instead of the mass in the original formula, it uses the k-shell values of the nodes and instead of the distance, it uses the length of the shortest path between nodes. Its formula is as follows: Here, (•) is the length of the shortest path between nodes and ; Ν is the set of 3-hop neighbors of node . The GC was chosen as a competitor because it is a recent centrality measure that gives successful results. It is also similar to EVE because it is calculated using the shortest path length between nodes. We evaluated the performance of EVE and the competitor centrality measures from different angles. First, we looked at the Kendall ranking performances. Next, we evaluated the "rank index vs. SIR score" graphics created by the measures. We then compared their Monotonicity performances. Finally, we looked at how many of the nodes in the top 5% of the ranking lists created by the measures corresponded to the ranking lists created according to the SIR simulations. In the SIR simulations, we set each node as the only infected node in the network. We ended the simulations when there were no infected nodes left in the network. At the end of each simulation, we took the number of recovered nodes in the network as the influence of the node selected as the single infected node at the beginning of that simulation. We repeated the simulation for each node 1000 times and took the average of their influences as the final SIR score. For the simulations we used Python and NetworkX [44] . The ranking performances of EVE and the competitor centrality measures are shown in Figures 2-4 . Ranking performances were calculated using Definition 2, as the Kendall's tau ranking correlation coefficient. The ranking list created by the measure and the list created by SIR simulations were used in the calculation. We were inspired by [17] , [31] to use this type of graphic to compare the methods. The best results were given by EVE in six experiments, by GC in four experiments, and by EC in two experiments. In addition, the EVE tau values in all experiments are very close to 0.8 or higher. and Recovery rate: = . for all experiments. The graphics of the ranking indices created by the measures vs. the SIR scores are shown in Figures 5-7 . As the index increases (i.e., as the centrality decreases), the SIR score is expected to decrease. This is an indication that nodes have been assigned the correct rank level. We were inspired by [17] , [31] to use this type of graphic to compare the methods. In most experiments, it can be said that EVE created a more uniformly decreasing graphic than the other methods. for all experiments. The monotonicity values of the ranking lists created by EVE and the competitor centrality measures are shown in Tables 2. The values were calculated using Definition 3. Since the ranking lists produced by the centrality measures depends only on the network structure, their monotonicity values were calculated only once for each data set. Since the ranking list produced by EVE is dependent on SIR, its monotony values were calculated separately for different SIR settings. The monotonicity values calculated for EVE were 1 in seven experiments and very close to 1 in the other experiment. This means that EVE assigns a different rank to all nodes in seven experiments. Meanwhile, the EC, CC, and GC also yielded successful results. Finally, we examined how many of the nodes in the top 5% of the ranking lists created by the measure coincided with the nodes in the top 5% of the ranking list created according to the SIR simulations. The results are shown in Tables 3-5 . Nodes in the top rank levels formed by the measure are expected to be more influential nodes. Therefore, it is important that the nodes at the top of the list and those at the top of the ranking list created according to the SIR simulations are the same. We were inspired by [31] to use this type of graphic to compare the methods. According to the calculated values, EVE gave the best results in eight experiments. In this study, we proposed an approach that approximates the influences of nodes in complex networks under the SIR propagation model using the shortest paths between nodes and then applies this to rank the nodes. The EVE is similar to a centrality measure in that it is used for ranking nodes. However, EVE is not a centrality measure, but a metric specific to the SIR model. As a result of 12 simulations we made with four different real-world datasets and three different SIR settings, EVE performed better than state-of-the-art and well-known centrality measures. We compared EVE with well-known centrality measures as well as with a state-of-the-art measure such as Gravitational Centrality, which is successful and innovative method. The EVE demonstrated that he expected influences of nodes could be better distinguished by using the parameters of the propagation model and the shortest paths (without using the centrality measures of the nodes). The EVE is calculated using the shortest paths between nodes. This means that all other paths are ignored. In dense networks, there can be many different paths other than the shortest path between two nodes. Therefore, ignoring these paths increase the difference (error) between EVE and the actual expected influence. In future studies, we plan to develop approaches that produce more precise results without increasing the time complexity. Universal resilience patterns in complex networks Network medicine: a network-based approach to human disease Analyzing and Detecting Emerging Internet of Things Malware: A Graph-Based Approach A study of the spreading scheme for viral marketing based on a complex network model Comparisons of Karcı and Shannon entropies and their effects on centrality of social networks Predicting positive and negative links in online social networks Mobility network models of COVID-19 explain inequities and inform reopening A Novel Centrality of Influential Nodes Identification in Complex Networks Dynamics analysis of SIR epidemic model with correlation coefficients and clustering coefficient in networks A survey on influence maximization in a social network Maximizing the spread of influence through a social network Identifying sets of key players in a social network A Set of Measures of Centrality Based on Betweenness A new status index derived from sociometric analysis The PageRank Citation Ranking: Bringing Order to the Web Identifying influential nodes in complex networks based on global and local structure Ranking nodes in complex networks based on local structure and improving closeness centrality Centrality in modular networks A novel model to identify the influential nodes: Evidence theory centrality A novel measure of identifying influential nodes in complex networks A community-based approach to identifying influential spreaders Centrality in Complex Networks with Overlapping Community Structure Identification of influential nodes in social networks with community structure based on label propagation Convex combinations of centrality measures A Revisit to the Infection Source Identification Problem under Classical Graph Centrality Measures Efficient algorithms based on centrality measures for identification of top-K influential users in social networks Combined centrality measures for an improved characterization of influence spread in social networks Identifying influential spreaders in complex networks based on gravity formula Identifying influential spreaders in complex networks based on entropy weight method and gravity law Lexical Sorting Centrality to Distinguish Spreading Abilities of Nodes in Complex Networks under the Susceptible-Infectious-Recovered (SIR) Model Fast ranking nodes importance in complex networks based on LS-SVM method Simulating SIR processes on networks using weighted shortest paths A NEW MEASURE OF RANK CORRELATION Identifying and ranking influential spreaders in complex networks by neighborhood coreness NDlib: a python library to model and analyze diffusion processes over complex networks Power and Centrality: A Family of Measures The centrality index of a graph An Information Flow Model for Conflict and Fission in Small Groups The Network Data Repository with Interactive Graph Analytics and Visualization Self-similar Community Structure in a Network of Human Interactions Exploratory social network analysis with Pajek Exploring Network Structure, Dynamics, and Function using NetworkX