key: cord-0010893-mr4v0hkn authors: Radicchi, Filippo; Castellano, Claudio title: Fundamental difference between superblockers and superspreaders in networks date: 2017-01-18 journal: Phys Rev E DOI: 10.1103/physreve.95.012318 sha: 85fd457a8703c0b90cf51d8d553954461c611ea3 doc_id: 10893 cord_uid: mr4v0hkn Two important problems regarding spreading phenomena in complex topologies are the optimal selection of node sets either to minimize or maximize the extent of outbreaks. Both problems are nontrivial when a small fraction of the nodes in the network can be used to achieve the desired goal. The minimization problem is equivalent to a structural optimization. The “superblockers,” i.e., the nodes that should be removed from the network to minimize the size of outbreaks, are those nodes that make connected components as small as possible. “Superspreaders” are instead the nodes such that, if chosen as initiators, they maximize the average size of outbreaks. The identity of superspreaders is expected to depend not just on the topology, but also on the specific dynamics considered. Recently, it has been conjectured that the two optimization problems might be equivalent, in the sense that superblockers act also as superspreaders. In spite of its potential groundbreaking importance, no empirical study has been performed to validate this conjecture. In this paper, we perform an extensive analysis over a large set of real-world networks to test the similarity between sets of superblockers and of superspreaders. We show that the two optimization problems are not equivalent: superblockers do not act as optimal spreaders. The interplay between structure and function is at the heart of the interest attracted by the study of complex networks in recent years. Processes mediated by disordered interaction patterns are affected by the topological properties of the underlying graph in nontrivial ways [1] [2] [3] . Spreading phenomena are among the most fundamental and studied types of dynamics occurring on networks [4] . In this context, a natural question, with implications for practical applications, is the following: given a network and a spreading dynamics on top of it, how can we identify the set of n "superspreaders," i.e., the n vertices such that, if the spreading process is initiated simultaneously by all of them, the average number of nodes reached by the spreading event is maximal? This problem is often indicated also as "influence maximization," in particular in computer science, where fundamental results have been derived [5, 6] . An equally interesting and important problem is the identification of the set of n "superblockers," i.e., the n vertices such that, if immunized, and thus effectively removed from the network, lead to the minimal average size of the outbreak. As spreading may occur only if contacts are present, the identification of superblockers is equivalent to the solution of the so-called optimal percolation problem [7] , i.e., the identification of the minimum set of nodes to be eliminated in order to destroy the giant component of the network. Superblockers effectively correspond to the nodes that, when removed, minimize the size of the largest connected component in the network. Solving the optimal percolation problem is nontrivial, and many interesting results have appeared in the past few months [8] [9] [10] [11] . * claudio.castellano@roma1.infn.it In their seminal work on the problem of optimal percolation [7] , Morone and Makse hinted a strong connection between the identification problems of superspreaders and superblockers. The paper effectively describes the problem of identifying superblockers, but it always refers to superblockers as they were optimal spreaders, suggesting that essentially the two sets coincide. The analogy between superblockers and superspreaders may sound plausible from some point of view: it is natural to expect that both superblockers and superspreaders will be found among the nodes with largest connectivity. On the other hand, a conspicuous difference between the two problems is that optimal percolation depends only on the topological structure, while influence maximization depends (at least in principle) on the type of spreading process considered and on the detailed value of the parameters describing it. In Ref. [7] , caveats about the distinction between the two problems are put forward, specifying that the mapping between influence maximization and optimal percolation is exact only for the linear threshold model with a very particular choice of the thresholds. A similar approach, based purely on topological information, has been recently used also for more general choices of the threshold [12] . For different types of spreading models, such as, for example, those belonging to the susceptible-infected-removed (SIR) class, methods relying on the mapping to the optimal percolation problem are not a priori granted to work. Nonetheless, the idea that superspreaders and superblockers are equivalent in arbitrary spreading models has been rapidly adopted without further scrutiny [13] [14] [15] [16] [17] [18] [19] [20] [21] (Ref. [8] being an exception); this calls for a deeper and more careful investigation. In this paper, we perform a critical analysis of the conjectured coincidence between superblockers and superspreaders when the spreading process is described by the independent cascade model (ICM), a very simple dynamics belonging to the same class of SIR-like models for epidemic spreading. By applying algorithms to determine independently sets of optimal blockers and sets of optimal spreaders for a large collection of real-world topologies, we are able to show that in general they are very different. Moreover, we clarify that the identity of superspreaders strongly depends on the only parameter of the ICM model: characterizing optimal spreaders based on purely topological network properties (with no reference to the specific spreading dynamics) is thus an impossible task. The set of superblockers is defined as the minimal set of vertices such that their removal leaves no extensive component in the network. The identification of the superblockers is equivalent to optimizing a percolation process [7] . After its formalization, several different heuristic strategies have been introduced to perform this optimization task. Morone and Makse [7] proposed a greedy algorithm based on a quantity, collective influence (CI), to rank vertices according to their blocking power. Later on, Clusella et al. [8] modified the algorithm for explosive percolation [22, 23] to identify optimal blockers. Other nongreedy approaches are based on belief-propagation methods [9, 10] . Very recently an iterative method based on the exploitation of the (k=2)-core structure of the network has been shown to perform well while being computationally very efficient [11] . In our work, we use the first two methods (collective influence and Clusella et al.) to identify superblockers. In particular, we apply the CI 3 method, where the collective influence of a node is computed by summing over nodes at the frontier of balls of radius = 3 (see Ref. [7] for details). We use also a simplified version of the algorithm by Clusella et al., where node scores are not computed iteratively but they are set equal to their degree. This modification of the algorithm by Clusella et al. reduces slightly its performance, but it allows us to treat all networks in the same manner, without the need to determine additional ad hoc parameter values for every specific network. We remark that the CI and Clusella et al. algorithms provide only suboptimal solutions to the problem of identifying superblockers. However, the solutions they provide are sufficiently close to the optimum, so that we do not expect substantial variations if other, possibly more effective, algorithms for the identification of superblockers are used. To be more specific, all algorithms devised to obtain a solution of the optimal percolation problem identify the minimal set of superblockers able to destroy the giant component of the graph. The distinction between the extensive (giant) component of a network and subextensive components is clear cut only in the limit of infinite size. For finite networks such as those we consider here, the distinction is blurred and somehow arbitrary. In practice, in our study we adopt the same convention put forward by Clusella et al. [8] and consider a component to be extensive if its size is larger than √ N , where N is the overall number of network vertices. In general, the methods for the optimal percolation problem provide a set of a n (x) c vertices, n (x) c being a specific value depending on the method x considered. However, the goal of our analysis is to test whether superblockers are also optimal spreaders for a generic set size n. We will therefore use the methods by Morone and Makse, and by Clusella et al. to assign nodes with a rank ranging from 1 to N . In one case, the order of the nodes will be established as the inverse order in which they are added in the algorithm by Clusella et al. In the other case instead, it will coincide with the order in which nodes are removed from the network according to the collective influence score. We will measure the agreement, as a function of ρ = n/N , between the set S (x) of best n superblockers found by methods devised to optimally destroy a network and the set S (C) of the best n superspreaders identified by a method specifically deployed for their identification (see below). For reference, we will consider also two alternative rankings: a completely random one (dubbed 'random" in the following) and a ranking based on node degrees (with random ordering in the case of tie), denoted as 'degree." As spreading dynamics on networks, we consider the independent cascade model (ICM) [24] , in its simplest, unweighted version. This model is commonly considered in studies of influence maximization by computer scientists. One starts from a set S (of size n) of initially activated nodes at time t = 0. All other nodes are instead initially set as inactive. At each discrete time step t, two rules are applied in sequence: (i) each activated node i contacts all its neighbors j and, with an independent probability p, tries to activate each of those nodes that have been never activated during previous stages of the dynamics; (ii) all nodes that tried to activate their neighbors at step (i) become inactive, and they cannot be activated again in subsequent stages of the dynamics. The process is iterated until no more active nodes are present. This dynamics is a parallel version of the common SIR model [4] for epidemic spreading, with the time to recover fixed deterministically to 1. The identification of superspreaders (or influence maximization) in networks has attracted a huge interest in the past 15 years, since its formalization by Domingos and Richardson [5] . In a nutshell, the problem is the following: given a network of size N and a spreading dynamics on top of it, a set S of initially active nodes generates a cascade (outbreak) of average size R(S). We will denote R(S) also as "spreading power" of set S. Influence maximization aims at identifying, among all subsets of size n, the subset S * for which R(S) is maximal. The seminal paper by Kempe et al. [6] has shown that the influence maximization problem is computationally hard (NP-complete). However, the same paper provides, for the broad class of submodular dynamics (including the ICM), a greedy algorithm able to find a suboptimal solution, provably within 63% of the optimum [6] . More precisely, by adding at each time step to S the node that maximizes the marginal increment of R, one is guaranteed that R(S) (1 − 1/e)R(S * ), where e is the base of the natural logarithm. Many other works have followed, improving the poor computational efficiency of Kempe's greedy algorithm, while still preserving the original performance bounds [25, 26] . Nowadays it is possible to determine influence maximizers for networks with billions of nodes [27] . More recently, the statistical physics community has started to attack the problem with its tools and concepts [28, 29] . We are not crucially interested in computational efficiency as networks with order 10 4 nodes are sufficient for our purposes. Therefore, to identify superspreaders in the ICM we use the greedy algorithm version introduced by Chen et al. [26] , which is based on the well-known mapping between SIR dynamics and random percolation [30] . As for the case of superblockers identification, the algorithm we use for superspreaders identification does not determine the actual optimum, but just a suboptimal solution. Also in this case, we reasonably expect this approximation to have a very limited impact on the results. The outcome of Chen's algorithm is a ranking of all network nodes with an associated spreading power R(S (C) ,ρ): this value means that the set of superspreaders of size n = ρN is made by all nodes ranked from 1 to n and that the average number of nodes reached by a cascade initiated by them is R(S (C) ,ρ). Finally, it is important to remark that we are interested here in finding optimal multiple spreaders, i.e., sets of vertices that maximize the extent of the spreading process when seeded simultaneously in all of them. A similar but distinct problem is the search for optimal single spreaders, i.e., the nodes that are most influential when the process is initiated only in one node [31, 32] . The two problems are somehow related, but in a nontrivial manner: good single influencers may share large parts of their influence zone, so that seeding the outbreak in all of them at the same time leads to a cascade only slightly larger than those started by each of them separately. Each identification algorithm of superblockers provides a different ranking of all nodes in the network. For each ranking x, by means of ICM numerical simulations repeated 10 000 times, we compute the spreading power R(S (x) ,ρ) for any size ρN of the seed set S (x) . There are two possible ways to compare the sets of superblockers and superspreaders. The first possibility is to compare the identity of the individual nodes. Are the vertices identified as superblockers also those identified by Chen's algorithm as superspreaders? To answer this question, we consider the Jaccard index (or similarity) among the two sets S (x) and S (C) . This quantity is defined as where |A| stands for the number of elements in the set A. Clearly, if the two sets S (x) and S (C) coincide their similarity goes to 1, while it vanishes if they have null intersection. The second possibility is to compare not the identity but only the spreading power of the two different sets. Indeed, it is in principle possible that the set of superblockers does not coincide with the set of the best spreaders, yet it has comparable spreading power. In such a case one would conclude that the search for the best blockers effectively uncovers a set of almost optimal spreaders. To compare the spreading power of superblockers and of superspreaders, we consider the ratio R(S (x) ,ρ)/R(S (C) ,ρ). A value of this quantity equal to 1 indicates that the best blockers are also the best spreaders in the network, while small values show that blockers are not good spreaders. In the evaluation of these comparisons, one must always keep in mind that in the limit ρ → 1 all sets unavoidably coincide; hence, the quantities defined above tend to 1. As substrate of the optimal percolation and of the ICM spreading process we consider 51 real-world networks of very diverse origin, size, and topological features [33] . For each of them we compute the critical value p c separating the region of the phase-diagram where outbreaks are subextensive (p < p c ) from the supercritical phase (p > p c ) where outbreaks reach a finite fraction of the whole network. The value of p c is determined as the position of the maximum of the susceptibility s 2 / s 2 (where s n is the nth moment of the outbreak size distribution computed for random initial single spreaders) [34] . p c values for each network are reported in Ref. [33] . We start by comparing superblockers with superspreaders determined when p = p c . In Fig. 1 we plot, as a function of ρ, the value of the Jaccard similarity J (x) between the set of superspreaders determined by Chen's greedy algorithm and sets of superblockers determined using collective influence, Clusella et al. method, and, as a reference, the degree method. It becomes immediately clear that there is a huge variability among the different cases. However, by looking at the average value of J (x) (depicted in blue), two conclusions can be drawn. First, there are in general rather few superblockers nodes that belong also to the set of superspreaders; second, spreading power is more correlated to degree than to the blocking ability. One may wonder whether the results of Fig. 1 are due to the lack of accuracy of the method based on collective influence and the algorithm by Clusella et al. to establish a precise rank for superblockers. Both these methods are, in fact, devised to optimally destroy a network by finding the minimal set of size n (x) c = ρ (x) c N whose removal leads to the disappearance of the giant component in the graph, but these nodes are not chosen in a special order. We therefore extract from Fig. 1 the values of the Jaccard similarity corresponding to ρ (x) c and plot them in Figs. 2(a) and 2(b). We note that the similarity between superspreaders and superblockers is still very low, except for networks with high values of ρ (x) c (for ρ (x) c → 1 the two sets obviously tends to coincide). Figures 2(a) and 2(b) provide strong evidence that sets of superspreaders and superblockers are very different. Nevertheless, one could hypothesize that, even if superblockers are not the very best spreaders, they still are very good spreaders. For this reason, in Figs. 2(c) and 2(d) we plot the ratio R (x) /R (C) of the spreading power of blockers to the optimal spreading power obtained using Chen's algorithm. It turns out that the sets of blockers identified using both Clusella and CI methods are far from being optimal spreaders: their performance is often even worse than the one resulting by randomly selecting the same fraction ρ (x) c of seeds. Similar results are confirmed in Fig. 3 , where we consider the ratio R (x) /R (C) for arbitrary values of ρ. Superblockers are never good spreaders. Ranking nodes based on their degree is generally a much better strategy than ranking nodes using collective influence scores or the Clusella c , topological methods for the identification of superblockers never exceed the performance of random selection. The results displayed above are obtained when the independent cascade model for spreading is at criticality. We expect the difference between good and bad spreaders to be maximal for p = p c . We have repeated the same analysis for other values of p, both well below the critical point (p = p c /2) and well into the supercritical phase (p = 2p c ). The results, reported in Ref. [33] , confirm that in the whole phase-diagram the nodes that keep the network together (blockers) have no special spreading capability. The conjecture that superspreaders and superblockers are essentially the same nodes in a network rests implicitly on the assumption that the identity of best spreaders does not depend on the parameter p. We test this hypothesis in Fig. 4 , where we plot, as a function of ρ, the Jaccard distance among sets of optimal spreaders for sub-, super-, and critical values of p. Interestingly, the sets of optimal spreaders for subcritical and critical evolution are quite similar, while they are very different from the optimal set of spreaders in the supercritical regime. Whether a set of nodes has large spreading power crucially depends on p. Any attempt to relate sets of optimal spreaders to sets determined only by topology is ill-fated. In this paper we have shown that, for the independent cascade model in a network, superspreaders and superblockers are two distinct concepts, with no direct practical connec-tion. More in detail, our results indicate that the nodes whose removal leads to the breakdown of the topology into nonextensive components do not coincide with the best nodes for seeding a spreading process. Even the plain degree centrality identifies better spreaders than the methods aimed at identifying superblockers. In addition, as the identity of the optimal spreaders is strongly dependent on the parameter that regulates the dynamics, attempts to identify sets of superspreaders based only on topological properties without reference to the details of the spreading dynamics are bound to fail. With the benefit of hindsight these results appear rather easy to be anticipated: the choice of optimal seeds depends on the spreading dynamics (and its parameters) while optimal blocking does not. As most recent papers in the field have implicitly assumed the validity of this conjecture [13] [14] [15] [16] [17] [18] [19] [20] [21] , we believe that a detailed verification was in order. The minimization and the maximization of the extent of spreading processes mediated by complex topologies are both exciting examples of the nontrivial interplay between structure and function. They are deservedly attracting a huge interest in statistical physics, computer science, and other communities. The interest on these issues is by no means reduced by the awareness that they are fundamentally different problems. Networks: An Introduction Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '01 Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03 Proc. Natl. Acad. Sci. USA Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '07 Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '09