key: cord-0005246-l676wo9t authors: Gao, Chao; Liu, Jiming; Zhong, Ning title: Network immunization and virus propagation in email networks: experimental evaluation and analysis date: 2010-07-14 journal: Knowl Inf Syst DOI: 10.1007/s10115-010-0321-0 sha: 445fa0ea39576aac8a7a809e93474a8bd62d8b33 doc_id: 5246 cord_uid: l676wo9t Network immunization strategies have emerged as possible solutions to the challenges of virus propagation. In this paper, an existing interactive model is introduced and then improved in order to better characterize the way a virus spreads in email networks with different topologies. The model is used to demonstrate the effects of a number of key factors, notably nodes’ degree and betweenness. Experiments are then performed to examine how the structure of a network and human dynamics affects virus propagation. The experimental results have revealed that a virus spreads in two distinct phases and shown that the most efficient immunization strategy is the node-betweenness strategy. Moreover, those results have also explained why old virus can survive in networks nowadays from the aspects of human dynamics. the Internet, the scientific collaboration network and the social network [15, 32] . In these networks, nodes denote individuals (e.g. computers, Web pages, email-boxes, people, or species) and edges represent the connections between individuals (e.g. network links, hyperlinks, relationships between two people or species) [26] . There are many research topics related to network-like environments [23, 34, 46] . One interesting and challenging subject is how to control virus propagation in physical networks (e.g. trojan viruses) and virtual networks (e.g. email worms) [26, 30, 37] . Currently, one of the most popular methods is network immunization where some nodes in a network are immunized (protected) so that they can not be infected by a virus or a worm. After immunizing the same percentages of nodes in a network, the best strategy can minimize the final number of infected nodes. Valid propagation models can be used in complex networks to predict potential weaknesses of a global network infrastructure against worm attacks [40] and help researchers understand the mechanisms of new virus attacks and/or new spreading. At the same time, reliable models provide test-beds for developing or evaluating new and/or improved security strategies for restraining virus propagation [48] . Researchers can use reliable models to design effective immunization strategies which can prevent and control virus propagation not only in computer networks (e.g. worms) but also in social networks (e.g. SARS, H1N1, and rumors). Today, more and more researchers from statistical physics, mathematics, computer science, and epidemiology are studying virus propagation and immunization strategies. For example, computer scientists focus on algorithms and the computational complexities of strategies, i.e. how to quickly search a short path from one "seed" node to a targeted node just based on local information, and then effectively and efficiently restrain virus propagation [42] . Epidemiologists focus on the combined effects of local clustering and global contacts on virus propagation [5] . Generally speaking, there are two major issues concerning virus propagation: 1. How to efficiently restrain virus propagation? 2. How to accurately model the process of virus propagation in complex networks? In order to solve these problems, the main work in this paper is to (1) systematically compare and analyze representative network immunization strategies in an interactive email propagation model, (2) uncover what the dominant factors are in virus propagation and immunization strategies, and (3) improve the predictive accuracy of propagation models through using research from human dynamics. The remainder of this paper is organized as follows: Sect. 2 surveys some well-known network immunization strategies and existing propagation models. Section 3 presents the key research problems in this paper. Section 4 describes the experiments which are performed to compare different immunization strategies with the measurements of the immunization efficiency, the cost and the robustness in both synthetic networks (including a synthetic community-based network) and two real email networks (the Enron and a university email network), and analyze the effects of network structures and human dynamics on virus propagation. Section 5 concludes the paper. In this section, several popular immunization strategies and typical propagation models are reviewed. An interactive email propagation model is then formulated in order to evaluate different immunization strategies and analyze the factors that influence virus propagation. Network immunization is one of the well-known methods to effectively and efficiently restrain virus propagation. It cuts epidemic paths through immunizing (injecting vaccines or patching programs) a set of nodes from a network following some well-defined rules. The immunized nodes, in most published research, are all based on node degrees that reflect the importance of a node in a network, to a certain extent. In this paper, the influence of other properties of a node (i.e. betweenness) on immunization strategies will be observed. Pastor-Satorras and Vespignani have studied the critical values in both random and targeted immunization [39] . The random immunization strategy treats all nodes equally. In a largescale-free network, the immunization critical value is g c → 1. Simulation results show that 80% of nodes need to be immunized in order to recover the epidemic threshold. Dezso and Barabasi have proposed a new immunization strategy, named as the targeted immunization [9] , which takes the actual topology of a real-world network into consideration. The distributions of node degrees in scale-free networks are extremely heterogeneous. A few nodes have high degrees, while lots of nodes have low degrees. The targeted immunization strategy aims to immunize the most connected nodes in order to cut epidemic paths through which most susceptible nodes may be infected. For a BA network [2] , the critical value of the targeted immunization strategy is g c ∼ e −2 mλ . This formula shows that it is always possible to obtain a small critical value g c even if the spreading rate λ changes drastically. However, one of the limitations of the targeted immunization strategy is that it needs to know the information of global topology, in particular the ranking of the nodes must be clearly defined. This is impractical and uneconomical for handling large-scale and dynamic-evolving networks, such as P2P networks or email networks. In order to overcome this shortcoming, a local strategy, namely the acquaintance immunization [8, 16] , has been developed. The motivation for the acquaintance immunization is to work without any global information. In this strategy, p % of nodes are first selected as "seeds" from a network, and then one or more of their direct acquaintances are immunized. Because a node with higher degree has more links in a scale-free network, it will be selected as a "seed" with a higher probability. Thus, the acquaintance immunization strategy is more efficient than the random immunization strategy, but less than the targeted immunization strategy. Moreover, there is another issue which limits the effectiveness of the acquaintance immunization: it does not differentiate nodes, i.e. randomly selects "seed" nodes and their direct neighbors [17] . Another effective distributed strategy is the D-steps immunization [12, 17] . This strategy views the decentralized immunization as a graph covering problem. That is, for a node v i , it looks for a node to be immunized that has the maximal degree within d steps of v i . This method only uses the local topological information within a certain range (e.g. the degree information of nodes within d steps). Thus, the maximal acquaintance strategy can be seen as a 1-step immunization. However, it does not take into account domain-specific heuristic information, nor is it able to decide what the value of d should be in different networks. The immunization strategies described in the previous section are all based on node degrees. The way different immunized nodes are selected is illustrated in Fig. 1 1 An illustration of different strategies. The targeted immunization will directly select v 5 as an immunized node based on the degrees of nodes. Suppose that v 7 is a "seed" node. v 6 will be immunized based on the maximal acquaintance immunization strategy, and v 5 will be indirectly selected as an immunized node based on the D-steps immunization strategy, where d = 2 Fig. 2 An illustration of betweenness-based strategies. If we select one immunized node, the targeted immunization strategy will directly select the highest-degree node, v 6 . The node-betweenness strategy will select v 5 as it has the highest node betweenness. The edge-betweenness strategy will select one of v 3 , v 4 and v 5 because the edges, L 1 and L 2 , have the highest edge betweenness the highest-degree nodes from a network, many approaches cut epidemic paths by means of increasing the average path length of a network, for example by partitioning large-scale networks based on betweenness [4, 36] . For a network, node (edge) betweenness refers to the number of the shortest paths that pass through a node (edge). A higher value of betweenness means that the node (edge) links more adjacent communities and will be frequently used in network communications. Although [19] have analyzed the robustness of a network against degree-based and betweenness-based attacks, the spread of a virus in a propagation model is not considered, so the effects of different measurements on virus propagation is not clear. Is it possible to restrain virus propagation, especially from one community to another, by immunizing nodes or edges which have higher betweenness. In this paper, two types of betweenness-based immunization strategies will be presented, i.e. the node-betweenness strategy and the edge-betweenness strategy. That is, the immunized nodes are selected in the descending order of node-and edge-betweenness, in an attempt to better understand the effects of the degree and betweenness centralities on virus propagation. Figure 2 shows that if v 4 is immunized, the virus will not propagate from one part of the network to another. The node-betweenness strategy will select v 5 as an immunized node, which has the highest node betweenness, i.e. 41. The edge-betweenness strategy will select the terminal nodes of L 1 or L 2 (i.e. v 3 , v 4 or v 4 , v 5 ) as they have the highest edge betweenness. As in the targeted immunization, the betweenness-based strategies also require information about the global betweenness of a network. The experiments presented in this paper is to find a new measurement that can be used to design a highly efficient immunization strategy. The efficiency of these strategies is compared both in synthetic networks and in real-world networks, such as the Enron email network described by [4] . In order to compare different immunization strategies, a propagation model is required to act as a test-bed in order to simulate virus propagation. Currently, there are two typical models: (1) the epidemic model based on population simulation and (2) an interactive email model which utilizes individual-based simulation. Lloyd and May have proposed an epidemic propagation model to characterize virus propagation, a typical mathematical model based on differential equations [26] . Some specific epidemic models, such as SI [37, 38] , SIR [1, 30] , SIS [14] , and SEIR [11, 28] , have been developed and applied in order to simulate virus propagation and study the dynamic characteristics of whole systems. However, these models are all based on the mean-filed theory, i.e. differential equations. This type of black-box modeling approach only provides a macroscopic understanding of virus propagation-they do not give much insight into microscopic interactive behavior. More importantly, some assumptions, such as a fully mixed (i.e. individuals that are connected with a susceptible individual will be randomly chosen from the whole population) [33] and equiprobable contacts (i.e. all nodes transmit the disease with the same probability and no account is taken of the different connections between individuals) may not be valid in the real world. For example, in email networks and Instant Message (IM) networks, communication and/or the spread of information tend to be strongly clustered in groups or communities that have more closer relationships rather than being equiprobable across the whole network. These models may also overestimate the speed of propagation [49] . In order to overcome the above-mentioned shortcomings, [49] have built an interactive email model to study worm propagation, in which viruses are triggered by human behavior, not by contact probabilities. That is to say, the node will be infected only if a user has checked his/her email-box and clicked an email with a virus attachment. Thus, virus propagation in the email network is mainly determined by two behavioral factors: email-checking time intervals (T i ) and email-clicking probabilities (P i ), where i ∈ [1, N ] , N is the total number of users in a network. T i is determined by a user's own habits; P i is determined both by user security awareness and the efficiency of the firewall. However, the authors do not provide much information about how to restrain worm propagation. In this paper, an interactive email model is used as a test-bed to study the characteristics of virus propagation and the efficiency of different immunization strategies. It is readily to observe the microscopic process of worm propagating through this model, and uncover the effects of different factors (e.g. the power-law exponent, human dynamics and the average path length of the network) on virus propagation and immunization strategies. Unlike other models, this paper mainly focuses on comparing the performance of degree-based strategies and betweenness-based strategies, replacing the critical value of epidemics in a network. A detailed analysis of the propagation model is given in the following section. An email network can be viewed as a typical social network in which a connection between two nodes (individuals) indicates that they have communicated with each other before [35, 49] . Generally speaking, a network can be denoted as E = V, L , where V = {v 1 , v 2 , . . . , v n } is a set of nodes and L = { v i , v j | 1 ≤ i, j ≤ n} is a set of undirected links (if v i in the hit-list of v j , there is a link between v i and v j ). A virus can propagate along links and infect more nodes in a network. In order to give a general definition, each node is represented as a tuple . -ID: the node identifier, v i .I D = i. -State: the node state: i f the node has no virus, danger = 1, i f the node has virus but not in f ected, in f ected = 2, i f the node has been in f ected, immuni zed = 3, i f the node has been immuni zed. -NodeLink: the information about its hit-list or adjacent neighbors, i.e. v i .N odeLink = { i, j | i, j ∈ L}. -P behavior : the probability that a node will to perform a particular behavior. -B action : different behaviors. -VirusNum: the total number of new unchecked viruses before the next operation. -NewVirus: the number of new viruses a node receives from its neighbors at each step. In addition, two interactive behaviors are simulated according to [49] , i.e. the emailchecking time intervals and the email-clicking probabilities both follow Gaussian distributions, if the sample size goes to infinity. For the same user i, the email-checking interval T i (t) in [49] has been modeled by a Poisson distribution, i.e. T i (t) ∼ λe −λt . Thus, the formula for P behavior in the tuple can be written as P 1 behavior = Click Prob and P 2 behavior = CheckT ime. -ClickProb is the probability of an user clicking a suspected email, -CheckRate is the probability of an user checking an email, -CheckTime is the next time the email-box will be checked, v i .P 2 behavior = v i .CheckT ime = ex pGenerator(v i .Check Rate). B action can be specified as B 1 action = receive_email, B 2 action = send_email, and B 3 action = update_email. If a user receives a virus-infected email, the corresponding node will update its state, i.e. v i .State ← danger. If a user opens an email that has a virus-infected attachment, the node will adjust its state, i.e. v i .State ← in f ected, and send this virus email to all its friends, according to its hit-list. If a user is immunized, the node will update its state to v i .State ← immuni zed. In order to better characterize virus propagation, some assumptions are made in the interactive email model: -If a user opens an infected email, the node is infected and will send viruses to all the friends on its hit-list; -When checking his/her mailbox, if a user does not click virus emails, it is assumed that the user deletes the suspected emails; -If nodes are immunized, they will never send virus emails even if a user clicks an attachment. The most important measurement of the effectiveness of an immunization strategy is the total number of infected nodes after virus propagation. The best strategy can effectively restrain virus propagation, i.e. the total number of infected nodes is kept to a minimum. In order to evaluate the efficiency of different immunization strategies and find the relationship between local behaviors and global dynamics, two statistics are of particular interest: 1. SID: the sum of the degrees of immunized nodes that reflects the importance of nodes in a network 2. APL: the average path length of a network. This is a measurement of the connectivity and transmission capacity of a network where d i j is the shortest path between i and j. If there is no path between i and j, d i j → ∞. In order to facilitate the computation, the reciprocal of d i j is used to reflect the connectivity of a network: if there is no path between i and j, d −1 i j = 0. Based on these definitions, the interactive email model given in Sect. 2.3 can be used as a test-bed to compare different immunization strategies and uncover the effects of different factors on virus propagation. The specific research questions addressed in this paper can be summarized as follows: 1. How to evaluate network immunization strategies? How to determine the performance of a particular strategy, i.e. in terms of its efficiency, cost and robustness? What is the best immunization strategy? What are the key factors that affect the efficiency of a strategy? 2. What is the process of virus propagation? What effect does the network structure have on virus propagation? 3. What effect do human dynamics have on virus propagation? The simulations in this paper have two phases. First, a existing email network is established in which each node has some of the interactive behaviors described in Sect. 2.3. Next, the virus propagation in the network is observed and the epidemic dynamics are studied when applying different immunization strategies. More details can be found in Sect. 4. In this section, the simulation process and the structures of experimental networks are presented in Sects. 4.1 and 4.2. Section 4.3 uses a number of experiments to evaluate the performance (e.g. efficiency, cost and robustness) of different immunization strategies. Specifically, the experiments seek to address whether or not betweenness-based immunization strategies can restrain worm propagation in email networks, and which measurements can reflect and/or characterize the efficiency of immunization strategies. Finally, Sects. 4.4 and 4.5 presents an in-depth analysis in order to determine the effect of network structures and human dynamics on virus propagation. The experimental process is illustrated in Fig. 3 . Some nodes are first immunized (protected) from the network using different strategies. The viruses are then injected into the network in order to evaluate the efficiency of those strategies by comparing the total number of infected nodes. Two methods are used to select the initial infected nodes: random infection and malicious infection, i.e. infecting the nodes with maximal degrees. The user behavior parameters are based on the definitions in Sect. 2.3, where μ p = 0.5, σ p = 0.3, μ t = 40, and σ t = 20. Since the process of email worm propagation is stochastic, all results are averaged over 100 runs. The virus propagation algorithm is specified in Alg. 1. Many common networks have presented the phenomenon of scale-free [2, 21] , where nodes' degrees follow a power-law distribution [42] , i.e. the fraction of nodes having k edges, p(k), decays according to a power law p(k) ∼ k −α (where α is usually between 2 and 3) [29] . Recent research has shown that email networks also follow power-law distributions with a long tail [35, 49] . Therefore, in this paper, three synthetic power-law networks and a synthetic community-based network, generated using the GLP algorithm [6] where the power can be tuned. The three synthetic networks all have 1000 nodes with α =1.7, 2.7, and 3.7, respectively. The statistical characteristics and visualization of the synthetic community-based network are shown in Table 1 and Fig. 4c , f, respectively. In order to reflect the characteristics of a real-world network, the Enron email network 1 which is built by Andrew Fiore and Jeff Heer, and the university email network 2 which is complied by the members of the University Rovira i Virgili (Tarragona) will also be studied. The structure and degree distributions of these networks are shown in Table 2 and Fig. 4 . In particular, the cumulative distributions are estimated with maximum likelihood using the method provided by [7] . The degree statistics are shown in Table 9 . In this section, a comparison is made of the effectiveness of different strategies in an interactive email model. Experiments are then used to evaluate the cost and robustness of each strategy. Input: NodeData[NodeNum] stores the topology of an email network. Timestep is the system clock. v 0 is the set of initially infected nodes. Output: SimNum[timeStep] [k] stores the number of infected nodes in the network in the k th simulation. (1) For k=1 to Runtime //We run 100 times to obtain an average value (2) NodeData[NodeNum] ← Initializing an email network as well as users' checking time and clicking probability; (3) NodeData[NodeNum] ← Choosing immunized nodes based on different immunization strategies and adjusting their states; (4) While Timestep < EndSimul //There are 600 steps at each time (5) For i=1 to NodeNum (6) If NodeData[i].Checktime==0 (7) prob← computing the probability of opening a virus-infected email based on user's ClickProb and VirusNum (8) If Send a virus to all friends according to its hit-list (12) EndIf (13) EndIf (14) EndFor (15) For i=1 to NodeNum (16) Update the next CheckTime based on user's CheckRate (17) NodeData The immunization efficiency of the following immunization strategies are compared: the targeted and random strategies [39] , the acquaintance strategy (random and maximal neighbor) [8, 16] , the D-steps strategy (d = 2 and d = 3) [12, 17] (which is introduced in Sect. 2.1), The bridges between different communities: 100 The whole network: α=1.77, k =8.34 and the proposed betweenness-based strategy (node-and edge-betweenness). In the initial set of experiments, the proportion of immunized nodes (5, 10, and 30%) are varied in the synthetic networks and the Enron email network. Table 3 shows the simulation results in the Enron email network which is initialized with two infected nodes. Figure 5 shows the average numbers of infected nodes over time. Tables 4, 5 , and 6 show the numerical results in three synthetic networks, respectively. The simulation results show that the node-betweenness immunization strategy yields the best results (i.e. the minimum number of infected nodes, F) except for the case where 5% of the nodes in the Enron network are immunized under a malicious attack. The average degree of the Enron network is k = 3.4. This means that only a few nodes have high degrees, others have low degrees (see Table 9 ). In such a network, if nodes with maximal degrees are infected, viruses will rapidly spread in the network and the final number of infected nodes will be larger than in other cases. The targeted strategy therefore does not perform any better than the node-betweenness strategy. In fact, as the number of immunized nodes increases, the efficiency of the node-betweenness immunization increases proportionally There are two infected nodes with different attack modes. If there is no immunization, the final number of infected nodes is 937 with a random attack and 942 with a malicious attack, and AP L = 751.36(10 −4 ). The total simulation time T = 600 more than the targeted strategy. Therefore, if global topological information is available, the node-betweenness immunization is the best strategy. The maximal S I D is obtained using the targeted immunization. However, the final number of infected nodes (F) is consistent with the average path length (AP L) but not with the S I D. That is to say, controlling a virus epidemic does not depend on the degrees of immunized nodes but on the path length of a whole network. This also explains why the efficiency of the node-betweenness immunization strategy is better than that of the targeted immunization strategy. The node-betweenness immunization selects nodes based on the average path length, while the targeted immunization strategy selects based on the size of degrees. A more in-depth analysis is undertaken by comparing the change of the AP L with respect to the different strategies used in the synthetic networks. The results are shown in Fig. 6 . Figure 7a , b compare the change of the final number of infected nodes over time, which correspond to Fig. 6c , d, respectively. These numerical results validate the previous assertion that the average path length can be used as a measurement to design an effective immunization strategy. The best strategy is to divide the whole network into different sub-networks and increase the average path length of a network, hence cut the epidemic paths. In this paper, all comparative results are the average over 100 runs using the same infection model (i.e. the virus propagation is compared for both random and malicious attacks) and user behavior model (i.e. all simulations use the same behavior parameters, as shown in Sect. 4.1). Thus, it is more reasonable and feasible to just evaluate how the propagation of a virus is affected by immunization strategies, i.e. avoiding the effects caused by the stochastic process, the infection model and the user behavior. It can be seen that the edge-betweenness strategy is able to find some nodes with high degrees of centrality and then integrally divide a network into a number of sub-networks (e.g. v 4 in Fig. 2) . However, compared with the nodes (e.g. v 5 in Fig. 2 ) selected by the node-betweenness strategy, the nodes with higher edge betweenness can not cut the epidemic paths as they can not effectively break the whole structure of a network. In Fig. 2 , the synthetic community-based network and the university email network are used as examples to illustrate why the edge-betweenness strategy can not obtain the same immunization efficiency as the node-betweenness strategy. To select two nodes as immunized nodes from Fig. 2 , the node-betweenness immunization will select {v 5 , v 3 } by using the descending order of node betweenness. However, the edge-betweenness strategy can select {v 3 , v 4 } or {v 4 , v 5 } because the edges, L 1 and L 2 , have the highest edge betweenness. This result shows that the node-betweenness strategy can not only effectively divide the whole network into two communities, but also break the interior structure of communities. Although the edgebetweenness strategy can integrally divided the whole network into two parts, viruses can also propagate in each community. Many networks commonly contain the structure shown in Fig. 2 , for example, the Enron email network and university email networks. Table 7 and Fig. 8 present the results of the synthetic community-based network. Table 8 compares different strategies in the university email network, which also has some self-similar community structures [18] . These results further validate the analysis stated above. From the above experiments, the following conclusions can be made: Tables 4-8 , AP L can be used as a measurement to evaluate the efficiency of an immunization strategy. Thus, when designing a distributed immunization strategy, attentions should be paid on those nodes that have the largest impact on the APL value. 2. If the final number of infected nodes is used as a measure of efficiency, then the nodebetweenness immunization strategy is more efficient than the targeted immunization strategy. 3. The power-law exponent (α) affects the edge-betweenness immunization strategy, but has a little impact on other strategies. In the previous section, the efficiency of different immunization strategies is evaluated in terms of the final number of infected nodes when the propagation reaches an equilibrium state. By doing experiments in synthetic networks, synthetic community-based network, the Enron email network and the university email network, it is easily to find that the node-betweenness immunization strategy has the highest efficiency. In this section, the performance of the different strategies will be evaluated in terms of cost and robustness, as in [20] . It is well known that the structure of a social network or an email network constantly evolves. It is therefore interesting to evaluate how changes in structure affect the efficiency of an immunization strategy. -The cost can be defined as the number of nodes that need to be immunized in order to achieve a given level of epidemic prevalence ρ. Generally, ρ → 0. There are some parameters which are of particular interest: f is the fraction of nodes that are immunized; f c is the critical value of the immunization when ρ → 0; ρ 0 is the infection density when no immunization strategy is implemented; ρ f is the infection density with a given immunization strategy. Figure 9 shows the relationship between the reduced prevalence ρ f /ρ 0 and f. It can be seen that the node-betweenness immunization has the lowest prevalence for the smallest number of protected nodes. The immunization cost increases as the value of α increases, i.e. in order to achieve epidemic prevalence ρ → 0, the node-betweenness immunization strategy needs 20, 25, and 30% of nodes to be immunized, respectively, in the three synthetic networks. This is because the node-betweenness immunization strategy can effectively break the network structure and increase the path length of a network with the same number of immunized nodes. -The robustness shows a plot of tolerance against the dynamic evolution of a network, i.e. the change of power-law exponents (α). Figure 10 shows the relationship between the immunized threshold f c and α. A low level of f c with a small variation indicates that the immunization strategy is robust. The robustness is important when an immunization strategy is deployed into a scalable and dynamic network (e.g. P2P and email networks). Figure 10 also shows the robustness of the D-steps immunization strategy is close to that of the targeted immunization; the node-betweenness strategy is the most robust. [49] have compared virus propagation in synthetic networks with α = 1.7 and α = 1.1475, and pointed out that initial worm propagation has two phases. However, they do not give a detailed explanation of these results nor do they compare the effect of the power-law exponent on different immunization strategies during virus propagation. Table 9 presents the detailed degree statistics for different networks, which can be used to examine the effect of the power-law exponent on virus propagation and immunization strategies. First, virus propagation in non-immunized networks is discussed. Figure 11a shows the changes of the average number of infected nodes over time; Fig. 11b gives the average degree of infected nodes at each time step. From the results, it can be seen that 1. The number of infected nodes in non-immunized networks is determined by attack modes but not the power-law exponent. In Figs. 11a , b, three distribution curves (α = 1.7, 2.7, and 3.7) overlap with each other in both random and malicious attacks. The difference between them is that the final number of infected nodes with a malicious attack is larger than that with a random attack, as shown in Fig. 11a , reflecting the fact that a malicious attack is more dangerous than a random attack. 2. A virus spreads more quickly in a network with a large power-law exponent than that with a small exponent. Because a malicious attack initially infects highly connected nodes, the average degree of the infected nodes decreases in a shorter time comparing to a random attack (T 1 < T 2). Moreover, the speed and range of the infection is amplified by those highly connected nodes. In phase I, viruses propagate very quickly and infect most nodes in a network. However, in phase II, the number of total infected nodes grows slowly (Fig. 11a) , because viruses aim to infect those nodes with low degrees (Fig. 11b) , and a node with fewer links is more difficult to be infected. In order to observe the effect of different immunization strategies on the average degree of infected nodes in different networks, 5% of the nodes are initially protected against random and malicious attacks. Figure 12 shows the simulation results. From this experiment, it can be concluded that 1. The random immunization has no effect on restraining virus propagation because the curves of the average degree of the infected nodes are basically coincident with the curves in the non-immunization case. 2. Comparing Fig. 12a , b, c and d, e, f, respectively, it can be seen that the peak value of the average degree is the largest in the network with α=1.7 and the smallest in the network with α=3.7. This is because the network with a lower exponent has more highly connected nodes (i.e. the range of degrees is between 50 and 80), which serve as amplifiers in the process of virus propagation. 3. As α increases, so does the number of infected nodes and the virus propagation duration (T 1 < T 2 < T 3). Because a larger α implies a larger AP L , the number of infected nodes will increase; if the network has a larger exponent, a virus need more time to infect those nodes with medium or low degrees. Fig. 14 The average number of infected nodes and the average degree of infected nodes, with respect to time when virus spreading in different networks. We apply the targeted immunization to protect 30% nodes in the network First, consider the process of virus propagation in the case of a malicious attack where 30% of the nodes are immunized using the edge-betweenness immunization strategy. There are two intersections in Fig. 13a . Point A is the intersection of two curves NET1 and NET3, and point B is the intersection of NET2 and NET1. Under the same conditions, Fig. 13a shows that the total number of infected nodes is the largest in NET1 in Phase I. Corresponding to Fig. 13b , the average degree of infected nodes in NET1 is the largest in Phase I. As time goes on, the rate at which the average degree falls is the fastest in NET1, as shown in Fig. 13b . This is because there are more highly connected nodes in NET1 than in the others (see Table 9 ). After these highly connected nodes are infected, viruses attempt to infect the nodes with low degrees. Therefore, the average degree in NET3 that has the smallest power-law exponent is larger than those in Phases II and III. The total number of infected nodes in NET3 continuously increases, exceeding those in NET1 and NET2. The same phenomenon also appears in the targeted immunization strategy, as shown in Fig. 14. The email-checking intervals in the above interactive email model (see Sect. 2.3) is modeled using a Poisson process. The Poisson distribution is widely used in many real-world models to statistically describe human activities, e.g. in terms of statistical regularities on the frequency of certain events within a period of time [25, 49] . Statistics from user log files to databases that record the information about human activities, show that most observations on human behavior deviate from a Poisson process. That is to say, when a person engages in certain activities, his waiting intervals follow a power-law distribution with a long tail [27, 43] . Vazquez et al. [44] have tried to incorporate an email-sending interval distribution, characterized by a power-law distribution, into a virus propagation model. However, their model assumes that a user is instantly infected after he/she receives a virus email, and ignores the impact of anti-virus software and the security awareness of users. Therefore, there are some gaps between their model and the real world. In this section, the statistical properties associated with a single user sending emails is analyzed based on the Enron dataset [41] . The virus spreading process is then simulated using an improved interactive email model in order to observe the effect of human behavior on virus propagation. Research results from the study of statistical regularities or laws of human behavior based on empirical data can offer a valuable perspective to social scientists [45, 47] . Previous studies have also used models to characterize the behavioral features of sending emails [3, 13, 22] , but their correctness needs to be further empirically verified, especially in view of the fact that there exist variations among different types of users. In this paper, the Enron email dataset is used to identify the characteristics of human email-handling behavior. Due to the limited space, Table 10 presents only a small amount of the employee data contained in the database. As can be seen from the table, the interval distribution of email sent by the same user is respectively measured using different granularities: Day, Hour, and Minute. Figure 15 shows that the waiting intervals follow a heavy-tailed distribution. The power-law exponent as the Day granularity is not accurate because there are only a few data points. If more data points are added, a power-law distribution with long tail will emerge. Note that, there is a peak at t = 16 as measured at an Hour granularity. Eckmann et al. [13] have explained that the peak in a university dataset is the interval between the time people leave work and the time they return to their offices. After curve fitting, see Fig. 15 , the waiting interval exponent is close to 1.3, i.e. α ≈ 1.3 ± 0.5. Although it has been shown that an email-sending distribution follows a power-law by studying users in the Enron dataset, it is still not possible to assert that all users' waiting intervals follow a power-law distribution. It can only be stated that the distribution of waiting intervals has a long-tail characteristic. It is also not possible to measure the intervals between email checking since there is no information about login time in the Enron dataset. However, combing research results from human Web browsing behavior [10] and the effect of non-Poisson activities on propagation in the Barabasi group [44] , it can be found that there are similarities between the distributions of email-checking intervals and email-sending intervals. The following section uses a power-law distribution to characterize the behavior associated with email-checking in order to observe the effect human behavior has on the propagation of an email virus. Based on the above discussions, a power-law distribution is used to model the email-checking intervals of a user i, instead of the Poisson distribution used in [49] , i.e. T i (τ ) ∼ τ −α . An analysis of the distribution of the power-law exponent (α) for different individuals in Web browsing [10] and in the Enron dataset shows that the power-law exponent is approximately 1.3. In order to observe and quantitatively analyze the effect that the email-checking interval has on virus propagation, the email-clicking probability distribution (P i ) in our model is consistent with the one used by [49] , i.e. the security awareness of different users in the network follows a normal distribution, P i ∼ N (0.5, 0.3 2 ). Figure 16 shows that following a random attack viruses quickly propagate in the Enron network if the email-checking intervals follow a power-law distribution. The results are consistent with the observed trends in real computer networks [31] , i.e. viruses initially spread explosively, then enter a long latency period before becoming active again following user activity. The explanation for this is that users frequently have a short period of focused activity followed by a long period of inactivity. Thus, although old viruses may be killed by anti-virus software, they can still intermittently break out in a network. That is because some viruses are hidden by inactive users, and cannot be found by anti-virus software. When the inactive users become active, the virus will start to spread again. The effect of human dynamics on virus propagation in three synthetic networks is also analyzed by applying the targeted [9] , D-steps [17] and AOC-based strategy [24] . The numerical results are shown in Table. 11 and Fig. 17 . From the above experiments, the following conclusions can be made: 1. Based on the Enron email dataset and recent research on human dynamics, the emailchecking intervals in an interactive email model should be assigned based on a power-law distribution. 2. Viruses can spread very quickly in a network if users' email-checking intervals follow a power-law distribution. In such a situation, viruses grow explosively at the initial stage and then grow slowly. The viruses remain in a latent state and await being activated by users. In this paper, a simulation model for studying the process of virus propagation has been described, and the efficiency of various existing immunization strategies has been compared. In particular, two new betweenness-based immunization strategies have been presented and validated in an interactive propagation model, which incorporates two human behaviors based on [49] in order to make the model more practical. This simulation-based work can be regarded as a contribution to the understanding of the inter-reactions between a network structure and local/global dynamics. The main results are concluded as follows: 1. Some experiments are used to systematically compare different immunization strategies for restraining epidemic spreading, in synthetic scale-free networks including the community-based network and two real email networks. The simulation results have shown that the key factor that affects the efficiency of immunization strategies is APL, rather than the sum of the degrees of immunized nodes (SID). That is to say, immunization strategy should protect nodes with higher connectivity and transmission capability, rather than those with higher degrees. 2. Some performance metrics are used to further evaluate the efficiency of different strategies, i.e. in terms of their cost and robustness. Simulation results have shown that the D-steps immunization is a feasible strategy in the case of limited resources and the nodebetweenness immunization is the best if the global topological information is available. 3. The effects of power-law exponents and human dynamics on virus propagation are analyzed. More in-depth experiments have shown that viruses spread faster in a network with a large power-law exponent than that with a small one. Especially, the results have explained why some old viruses can still propagate in networks up till now from the perspective of human dynamics. The mathematical theory of infectious diseases and its applications Emergence of scaling in random networks The origin of bursts and heavy tails in human dynamics Cluster ranking with an application to mining mailbox networks Small worlds' and the evolution of virulence: infection occurs locally and at a distance On distinguishing between internet power law topology generators Power-law distribution in empirical data Efficient immunization strategies for computer networks and populations Halting viruses in scale-free networks Dynamics of information access on the web A simple model for complex dynamical transitions in epidemics Distance-d covering problem in scalefree networks with degree correlation Entropy of dialogues creates coherent structure in email traffic Epidemic threshold in structured scale-free networks On power-law relationships of the internet topology Improving immunization strategies Immunization of real complex communication networks Self-similar community structure in a network of human interactions Attack vulnerability of complex networks Targeted local immunization in scale-free peer-to-peer networks The large scale organization of metabolic networks Probing human response times Periodic subgraph mining in dynamic networks. Knowledge and information systems Autonomy-oriented search in dynamic community networks: a case study in decentralized network immunization Characterizing web usage regularities with information foraging agents How viruses spread among computers and people On universality in human correspondence activity Enhanced: simple rules with complex dynamics Network motifs simple building blocks of complex networks Epidemics and percolation in small-world network Code-red: a case study on the spread and victims of an internet worm The structure of scientific collaboration networks The spread of epidemic disease on networks The structure and function of complex networks Email networks and the spread of computer viruses Partitioning large networks without breaking communities Epidemic spreading in scale-free networks Epidemic dynamics and endemic states in complex networks Immunization of complex networks Computer virus propagation models The Enron email dataset database schema and brief statistical report Exploring complex networks Modeling bursts and heavy tails in human dynamics Impact of non-poissonian activity patterns on spreading process Predicting the behavior of techno-social systems A decentralized search engine for dynamic web communities A twenty-first century science An environment for controlled worm replication and analysis Modeling and simulation study of the propagation and defense of internet e-mail worms Chao Gao is currently a PhD student in the International WIC Institute, College of Computer Science and Technology, Beijing University of Technology. He has been an exchange student in the Department of Computer Science, Hong Kong Baptist University. His main research interests include Web Intelligence (WI), Autonomy-Oriented Computing (AOC), complex networks analysis, and network security. Department at Hong Kong Baptist University. He was a Professor and the Director of School of Computer Science at University of Windsor, Canada. His current research interests include: Autonomy-Oriented Computing (AOC), Web Intelligence (WI), and self-organizing systems and complex networks, with applications to: (i) characterizing working mechanisms that lead to emergent behavior in natural and artificial complex systems (e.g., phenomena in Web Science, and the dynamics of social networks and neural systems), and (ii) developing solutions to large-scale, distributed computational problems (e.g., distributed scalable scientific or social computing, and collective intelligence). Prof. Liu has contributed to the scientific literature in those areas, including over 250 journal and conference papers, and 5 authored research monographs, e.g., Autonomy-Oriented Computing: From Problem Solving to Complex Systems Modeling (Kluwer Academic/Springer) and Spatial Reasoning and Planning: Geometry, Mechanism, and Motion (Springer). Prof. Liu has served as the Editor-in-Chief of Web Intelligence and Agent Systems, an Associate Editor of IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions on Systems, Man, and Cybernetics-Part B, and Computational Intelligence, and a member of the Editorial Board of several other international journals. Laboratory and is a Professor in the Department of Systems and Information Engineering at Maebashi Institute of Technology, Japan. He is also an Adjunct Professor in the International WIC Institute. He has conducted research in the areas of knowledge discovery and data mining, rough sets and granular-soft computing, Web Intelligence (WI), intelligent agents, brain informatics, and knowledge information systems, with more than 250 journal and conference publications and 10 books. He is the Editor-in-Chief of Web Intelligence and Agent Systems and Annual Review of Intelligent Informatics, an Associate Editor of IEEE Transactions on Knowledge and Data Engineering, Data Engineering, and Knowledge and Information Systems, a member of the editorial board of Transactions on Rough Sets.