key: cord-0010907-ymhngjrt authors: Liu, Yang; Wang, Xi; Kurths, Jürgen title: Optimization of targeted node set in complex networks under percolation and selection date: 2018-07-23 journal: nan DOI: 10.1103/physreve.98.012313 sha: 86d99e5e3dbe5a7b4eee270446089da6484ac7b0 doc_id: 10907 cord_uid: ymhngjrt Most of the existing methods for the robustness and targeted immunization problems can be viewed as greedy strategies, which are quite efficient but readily induce a local optimization. In this paper, starting from a percolation perspective, we develop two strategies, the relationship-related (RR) strategy and the prediction relationship (PR) strategy, to avoid a local optimum only through the investigation of interrelationships among nodes. Meanwhile, RR combines the sum rule and the product rule from explosive percolation, and PR holds the assumption that nodes with high degree are usually more important than those with low degree. In this manner our methods have a better capability to collapse or protect a network. The simulations performed on a number of networks also demonstrate their effectiveness, especially on large real-world networks where RR fragments each of them into the same size of the giant component; however, RR needs only less than [Formula: see text] of the number of nodes which are necessary for the most excellent existing methods. There has recently been an enormous amount of interest focusing on the targeted immunization and robustness problems of network science [1] [2] [3] [4] [5] , like investigating the critical threshold of structural collapse if an intentional attack happens [6] , or probing the optimal targeted-immunized threshold if a virus is in possible transmission [7] . These problems appear in, but are not limited to, effectively preventing viruses in computer or population related networks [8, 9] , information transmission in social networks [10] [11] [12] , or the breakdown of some infrastructure networks [13, 14] . For a network, the solution to the critical or optimal threshold is mathematically equivalent to finding the minimum set of nodes which can fragment the network into a certain situation, e.g., the size of the giant component is less than a given value after the removal of the minimum set. To achieve this, numerous methods have been proposed in the last few years, consisting of random immunization [15] , acquaintance strategies [7, 16] , targeted methods [3, 4, [17] [18] [19] , etc. [20] [21] [22] [23] , ranging from the need of local information to the whole network demand. With respect to random immunization, the immunization nodes are randomly selected from a certain network-without any priority about them. Similarly, random selection is also applied in the acquaintance strategy, but only one of the neighbors of a certain node is chosen to be immunized [7] . In addition, the targeted method is a widely accepted approach which first identifies the importance of each node and then removes the nodes in descending order of * yangliu@pik-potsdam.de importance until the network reaches the immunized demand [3, 4, 18, 19] . Within networks, there are numerous relationships among nodes. Generally, high-degree nodes tend to connect to other high-degree nodes in assortatively mixed networks, while they mostly have low-degree neighbors in disassortative networks [24] . Moreover, a node with a low degree might play a critical role, whereas those with high degree might not be of significance comparatively (e.g., the betweenness centrality of nodes [25] ). In the course of immunization, some subinfluential nodes would become influential after a few nodes are removed, while some others might lose their importance instead [heuristic immunized strategies [3, 4, 18] , including the high adaptive degree centrality strategy, etc.]. All of such methods, e.g., the Collective Influence method (CI) [3] (better results always obtained with larger radius ), show that a better immunization strategy could be discovered when more interrelationships of nodes are considered. This may be also a good interpretation why the high adaptive betweenness centrality strategy (HAB) is significantly effective in most situations, as well as the belief propagation-guided decimation (BPD) method [4, 26] in artificial networks. But HAB has a limitation due to its high time complexity [O(n 2 m)] and BPD is not so effective in real-world networks because there are always many loops. Most of those methods can also be viewed as greedy strategies, i.e., they repeat the process that recalculates the importance of nodes in the remaining network and then remove the most influential one or a part of it. For an optimization problem, the greedy strategy is quite efficient but readily induces a local optimum. In addition, taking Fig. 1 (a) as an example, the removal of a node would affect the status (remove or not) of other nodes. These facts motivate us to use another approach: can the local optimization be effec- In this network almost all of the methods mentioned in this paper will remove v 1 (marked with "+ ') in the demand case of splitting this network into isolated ones, while the optimal removal set should be the nodes marked by "×" apparently. Now, assuming that v 2 is removed first, then most of these methods will easily find the optimal solution in the remaining network. But the same situation cannot be induced by removing v 3 . In other words, the removal of v 2 or v 3 will influence the status of v 1 (remove or not), and as a result it directly determines whether the optimal solution can be reached. (b) An example of RR under the sum rule and τ = 2. In this temporary network (κ = 16), we consider two assumed cases: (1) (v 3 ,v 4 ) then (v 1 ,v 5 ) and (2) (v 2 ,v 4 ), then (v 1 ,v 5 ), i.e., two rounds of selection. For the first choice between v 3 and v 4 in case 1, the occupied node is either v 3 or v 4 since ξ (v 3 ) = ξ (v 4 ) = 5. After this, node v 1 would be chosen because of 1 = ξ (v 1 ) < ξ(v 5 ) = 3 (might induce the optimal solution of q c ). In contrast, v 2 would be selected to be occupied at first [3 = ξ (v 2 ) < ξ(v 4 ) = 5], and then v 5 [4 = ξ (v 1 ) > ξ(v 5 ) = 3] in case 2 (might be associated with the optimization of F ). In this example, we can also find how the status of a node influences the status of other nodes, e.g., v 2 to v 1 and v 5 . (c, d) An example of the PR method. In (c), some low-degree nodes are chosen and occupied first. tively avoided by investigating the interrelationship among nodes? Here, also from a percolation perspective [3, 19, 27, 28] , we propose two strategies: the relationship-related (RR) method and the prediction relationship (PR) method, which are capable of achieving excellent performance compared to other existing strategies. The main idea of the developed strategies is to explore and utilize the interrelationship among the nodes. In addition, RR combines the sum rule and the product rule from explosive percolation [29] , while PR holds the assumption that nodes with high degree are usually more important than those with low degree. In this way, our approaches can achieve a better capability of avoiding of local optimum and obtain smaller thresholds than other methods. To demonstrate the effectiveness of the proposed strategies, we conduct numerous simulations on a number of networks. The results show that our methods have significant advantages over other strategies, especially on large real-world networks where RR can collapse each of them into the same size of the giant component with less than 90% nodes of CI, BPD, or the Explosive Immunization method (EI) [19] . Moreover, our methods might also be used for the feedback vertex set (FVS) problem [26, 30, 31] . We consider an undirected network composed of n = |N | nodes tied by m = |M| edges where N and M are the node set and the edge set, accordingly. Let S a be an arbitrary configuration (sequence) of N , namely, {S a (i),i ∈ [1,n]} ≡ N where S a (i) corresponds to a unique node of the network. Then the threshold q S a c regarding S a is defined to be q S a c := min in which is a given value and G(S a ; q) represents the probability that a node is part of the giant (largest) connected component in the remaining network after the removal of all nodes in {S a (i),i ∈ [1, n × q ]}, including the incidental edges. Denoting by F S a , the average size fraction of giant components of S a , the solution associated with the targeted immunization or robustness problem is to search the optimal sequence S θ , which satisfies where i,a ∈ [1,n!] mean all the configurations of N . Apparently, finding the optimal solution is NP-hard. Inspired by Ref. [29] , we develop RR method in a percolation process, i.e., change the process from the removal of the most influential node to the occupation of the least important node. In other words, we start the RR method with an arbitrary configuration S a,0 of the node set N and a nonoccupied network [or a given strategy, e.g., high-degree centrality strategy (HD)], and then reverse the order of S a,0 to be a new sequence S a,0 , satisfying where i + j = n + 1,∀i,j ∈ [1,n]. Let r be the proportion of possible candidates and τ ∈ Z + be the selection times, respectively. Denoting the number of occupied nodes with κ, we then obtain S a,1 based on S a,0 through the following procedures: (i) Each time randomly select one node v i from the nearest nonoccupied node set N u (κ): (5) (ii) Independently repeat the selection (i) τ times to form the candidate node set N c (κ), and then choose the node v c from N c (κ) which minimizes ξ (·) to be occupied (randomly choose one if there are several nodes with the same minimum): where ξ (v j ) is defined as the following two cases (respectively correspond to the sum rule and the product rule [29] ): in which G c i is the size of the component c i and C v j denotes the component set that node v j would connect in the temporary network consisting of all the occupied nodes {S a,0 (j )|j ∈ [1,κ]} and the related edges. (iii) Update S a,0 by swapping S a,0 (κ + 1) and v c , i.e., exchange the places of S a,0 (κ + 1) and v c in S a,0 . (iv) Repeat the processes (i)-(iii) until all nodes are occupied, and we will get a new sequence S a,1 by reversing S a,0 . Next, replace S a,1 with S a,0 based on Eq. (3), namely, replace S a,1 with S a,0 if is given and q 0 . In this manner, we further obtain S a,2 based on S a,1 as well as S a,T with other r and τ where T denotes the time step. An illustration of RR is shown in Fig. 1(b) . Now let us focus our attention on the parameters r and τ as well as the two kinds of selection strategies [Eq. (7)]. A large r indicates that the selection happens on a large range of possible candidates, which on the one hand can make RR converge quickly at the early stage, but on the other hand it will induce RR saturating and no longer improving after T reaches some value since τ r × n. The value of τ is the main contribution to the time consumption of RR. To overcome this, here we associate the r and τ with T (may have other choices): where r s and τ s are the initial values of r and τ , δ r is the decrease rate of r and δ τ denotes the increases rate of τ , respectively. With respect to the two kinds of selection strategies, our simulation results demonstrate that the sum rule is more efficient than the product rule in small networks but less efficient in large network. Hence, we combine them and use the following adaptive probability to determine which one is adopted in each T : where p sr is the selection probability of the sum rule, otherwise the product rule. π sr and π pr correspond to the number of positive replacements under the sum rule and the product rule, respectively. In other words, if the sum rule promotes a better result (smaller F or q c ), then π sr = π sr + 1, vice versa. In this paper, we initialize π sr and π pr with 1. The PR method is developed based on an assumption that high-degree nodes are normally more influential than those nodes with low degree, i.e., PR tries to keep the occupied components away from as many high-degree nodes as possible. To achieve this, we first identify each node based on the distribution of node degree: where p(k v j ) is the probability of nodes with degree k v j . Then, similar to RR, construct the ξ (·) function with H v z (11) in which u (v j ) denotes all of the v j 's nearest-unoccupied neighbors (here view v i as occupied node). An example of PR is shown in Figs. 1(c) and 1(d). Following Ref. [4] , we further develop RR and PR to obtain the optimal FVS of a given network, which can help RR and PR to obtain better q c than the direct calculation in model networks. Let FVS be a subset of N , after the removal of it there is no loop in the remaining network (N \ FVS). Denoting with n FVS the number of nodes in FVS, the goal of optimizing FVS is to minimize n FVS . How can we achieve this in RR and PR? Considering the candidate node set N c (κ) [see Sec. II A (ii)], we construct the subset N c FVS (κ) of it in the following way: where ψ(v j ) is defined as in which | o (v j ,c i )| is the number of occupied nodes that belong to the component c i as well as the nearest neighbors of v j . Then we rewrite Eq. (6) as where v c corresponds to the node chosen to be occupied. In addition, another strategy is adopted for the FVS problem: if ψ(v c ) = 1 (this means that there are two neighbors of v c in the same component, i.e., the selected occupied node v c will induce a loop), then we further exchange the places of v c and one of its two corresponding neighbors (randomly) after the swap process [see Sec. II A (iii)]. Finally, without a loss of generality, we replace S a,1 with S a,0 if S a,0 has smaller n FVS than S a,1 . Obviously, there is no loop in the temporary network (composing of {S a,0 (j )|j ∈ [1,κ]}) if all occupied nodes satisfy ψ(v c ) = 0 [Eq. (14) ] in the occupied process. In other words, the minimization of n FVS is equivalent to the maximization of κ under the constraint of ψ(v c ) = 0. In this section, if there is no special explanation, of CI [3] is fixed to 4, each result of EI [19] is obtained with K = 6 and 2000 candidates, and BPD [4] is conducted with x = 12. Note that the results of BPD are slightly different from the results in Ref. [4] , since we fix the "Degree threshold" with the "Degree of top percent" in the BPD code (Table II) . To validate the effectiveness of the proposed methods in more detail, here we test RR and PR by considering different optimization objectives, F and q c , respectively. In addition, both RR and PR are based on HD with r s = F HD , τ s = 10, δ r = 0.001 and δ τ = 0.01 for networks with n 10 4 , δ r = 0.01, and δ τ = 0.01 for networks with 10 4 < n 10 5 , δ r = 0.1, and δ τ = 0.1 for networks with 10 5 < n 10 6 , and δ r = 0.5 and δ τ = 0.5 for networks with n > 10 6 , accordingly. The threshold q c is assumed to be obtained with G(S; q) < 0.01. We first conduct our validation on a number of real-world networks from various fields: one Power Grid network [32, 33] (Power), three Collaboration networks [34] (including ca-GrQc, ca-AstroPh, and ca-CondMat), one Internet peer-to-peer network [34, 35] , Autonomous systems graphs [36] (including as-733 and as-Skitter), the Scottish cattle movements network [19] , two Citation networks [36, 37] (including hep-th and cit-HepTh), two Communication networks (including email-Enron [38, 39] and email-EuAll [34] ), one Location-based online social network [40] (loc-Gowalla), the Amazon product co-purchasing network [41] (com-Amazon), the Google web graph [39] (web-Google), and two Road networks [39] (including roadNet-PA and roadNet-TX). The choices of these networks consider both the density of edges [1] and the assortativity of degrees [24, 42] , which are associated with robustness of a network. Some basic information regarding these networks is given in Table I . Note that for all networks studied here, the directed edges are simply replaced with undirected edges, and self-loops and isolated nodes are entirely deleted. In Fig. 2 the proportion G(q) of the largest component versus the fraction q of removed nodes is plotted by comparing RR, CI, BPD, and EI on the CA-AstroPh network, the Cit-HepPh network, the TXroad network, and the as-Skitter network. In almost all the situations studied here, RR exhibits notable superiority of less nodes to be removed for same size of giant component compared to the other strategies. Further regarding certain metrics ( Fig. 3 and Table II) , RR also shows better threshold q c in most networks and represents minimal average giant fraction F in all cases compared to HD, CI, BPD, and EI, especially for the four largest networks where both F and q c of RR are significantly smaller than the other strategies, e.g., RR needs less than half of nodes of HD, CI, and BPD to split the two road networks into fragments with G(S; q) < 0.01. In addition, PR can also achieve smaller q c in 12/17 networks than HD, CI, BPD, and EI. We further evaluate the performance of the proposed strategies (both RR and PR) by focusing on artificial model networks [including Erdős-Rényi (ER) [45] and scale-free (SF) [33] networks]. Note that here RR is in the normal way (optimizing F ) and PR is to optimize FVS (following the idea of BPD). As illustrated in Fig. 4 , RR significantly outperforms CI of lower G(q) curves on both ER and SF networks. Considering the threshold q c on the ER networks, we respectively obtain q c = 0.1767 by BPD, q c = 0.1843 through EI (slightly larger, 0.0005, than the results in Ref. [19] ) and q c = 0.1809 with PR. Meanwhile, PR with q c = 0.0977 is closer to BPD with q c = 0.0965 compared to EI with q c = 0.0996 in the SF networks. Besides, the results of q c versus the average degree k are exhibited in Fig. 5 . Interestingly, when tied by K = 6, EI performs worse and worse with the increase of k . The reason why this happens is ascribed to the fact that k (eff) v i (used to measure the spreading ability of a node [19] ) is harder and harder to identify the nodes with similar degree as k rising, where (v i ) consists of all v i 's nearest neighbors, L v i is the number of leaves (nodes with degree 1) in (v i ) and In other words, more and more nodes have a degree larger than K when the network becomes dense. Hence, we also report the results of EI with K = k + 2 (EI 2 ) in Fig. 5 (but this adaptation is invalid for real-world networks). To summarize: considering the threshold, PR performs better than both CI and EI but slightly worse [(q PR c − q BPD c )/q PR c < 2.58% for all cases] than BPD in the model networks. In contrast to this, RR obtains a quite small F compared to other methods, e.g., (F CI − F RR )/F CI ≈ 8.25% in Fig. 4 (a) and 10.10% in Fig. 4(b) , respectively. The different performances of BPD in model and realworld networks arouse our interest in another question: how do the loops influence the effectiveness of BPD, since the belief propagation (BP) algorithm is actually sensitive to the existence of circles in a network and most of the real-world networks have a lot of loops (see Table I )? We still employ the paradigmatic ER and SF models to construct our basis networks. Then, for each network, the following strategies are used to enhance the clustering coefficients, i.e., increase the local loops. (i) Randomly choose one node v i and its two corresponding neighbors v j and v k subject to (v j ,v k ) = 0, which means that there is no edge between v j and v k , namely, (v i ,v j ) = (v i ,v k ) = 1 and v j = v k . (ii) In the same TABLE II. The threshold q c (G(S; q) < 0.01), the average giant fraction F and the size of the feedback vertex set n FVS of HD, CI a , BPD a , EI, PR, and RR on the 17 real-world networks. Here CI is with = 3 for the Email-EuAll network and = 2 for the as-Skitter network. Each result of EI, PR, and RR is obtained by averaging 20 independent realizations. The bold numbers are the minimal value of each objective among these methods for a same network. way to respectively choose one of the neighbors of v j and v k , assuming they are v jj and v kk satisfying (v j ,v jj ) = (v k ,v kk ) = 1, (v jj ,v kk ) = 0, and v jj = v kk . (iii) Cut (delete) the edges (v j ,v jj ) and (v k ,v kk ) and at the same time add two new edges (v j ,v k ) = (v jj ,v kk ) = 1. (iv) repeat (i)-(iii) until the network reaches our demand, i.e., a given clustering coefficient. In this manner, the clustering coefficients of these networks can be improved and, apparently, the degree distribution of them is kept constant. As illustrated in Fig. 6 , the fraction n FVS /n of PR rise more slowly than BPD with the increase of the clustering coefficients CC in both ER and SF networks, while the threshold q c of PR decreases more quickly than BPD. This may indicate that PR is more suitable than BPD for real-world networks. Therefore, we also show the performances of BPD, PR, and RR for the FVS problem in Table II where PR finds a smaller FVS than BPD in almost all the networks (16/17). Moreover, we consider the two largest networks, the TXroad network (with maximal degree 12) and the as-Skitter network (with maximal degree 35 455), to demonstrate the efficiency of the proposed methods. Since it is hard to analyze the computational complexity of RR and PR in detail, we here put them as well as CI and BPD (open-source codes written by either C or C++ program) in the same simulated environment and compare their time consumptions. As illustrated in Fig. 7 , both PR and RR get smaller thresholds than CI and BPD within a quite short time, in particular, RR takes only 3.6 s to obtain a better result than CI and BPD in the TXroad network. Note that the running time of CI and BPD reported here may be as a reference but not as a standard. Finally, the susceptible-infectious-recovery (SIR) epidemic spreading model [5, 18, 46] is used to investigate the spreading process of a virus on the email-Enron network and the loc-Gowalla network by comparing the CI, EI, and RR methods. For a given network under SIR simulation, its nodes belong to either the susceptible, infected, or recovered state. And before the start of the simulation, a part of nodes are previously identified and removed from the network based on a certain strategy. Then one random node is selected from the remaining network as the infected source and the others are to be susceptible. In each time step, the infected nodes infect their susceptible neighbors with the infection rate λ, and then they recover with rate η. The recovered nodes are removed from the network too. This process is repeated until there is no infected node in the network. The simulation results are shown in Fig. 8 where λ and η are fixed to 0.2 and 0.05, respectively. On both networks, RR has a significantly lower value (9.5 to 20.0 times) of recovered individuals than EI under the same immunized fraction q [Figs. 8(a) and 8(b) ]. Considering the final recovered fraction R f [Figs. 8(c) and 8(d)], RR also outperforms CI and EI in almost all situations. In this paper, two methods as effective strategies have been developed for the robustness and target immunization problems based on percolation transition. The proposed strategies choose the removed (immunized) fraction by repeatedly investigating and capturing the interrelationship among nodes. To evaluate the effectiveness of both proposed methods, we conduct numerous simulations on two types of model networks as well as 17 real-world networks from different fields. The results, especially most of the empirical networks, clearly illustrate that our strategy considerably outperforms the existing well-known strategies, like CI [3] and EI [19] . In addition, our strategies might open up a new path to investigate more effective solutions to the robustness and immunization problems as well as obtain the minimal feedback set [4] in network science. Proc. Natl. Acad. Sci. USA Proc. Natl. Acad. Sci. USA Proc. Natl. Acad. Sci. USA Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining CEAS 2004 -First Conference on Email and Anti-Spam Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining The authors would like to acknowledge the European Regional Development Fund (ERDF), the German Federal Ministry of Education and Research, and the Land Brandenburg for supporting this project by providing resources on the high-performance computer system at the Potsdam Institute for Climate Impact Research. Y.L. gratefully acknowledges support from a China Scholarship Council (CSC) scholarship and J.K. for the support of the Russian Science Foundation (Grant No. 16-12-10198).