key: cord-0888921-ejsmyo0h authors: Zhao, Na; Wang, Jian; Yu, Yong; Zhao, Jun-Yan; Chen, Duan-Bing title: Spreading predictability in complex networks date: 2021-07-12 journal: Sci Rep DOI: 10.1038/s41598-021-93611-z sha: b2b9af2130c4ff0f1f401389d54384354ba500f1 doc_id: 888921 cord_uid: ejsmyo0h Many state-of-the-art researches focus on predicting infection scale or threshold in infectious diseases or rumor and give the vaccination strategies correspondingly. In these works, most of them assume that the infection probability and initially infected individuals are known at the very beginning. Generally, infectious diseases or rumor has been spreading for some time when it is noticed. How to predict which individuals will be infected in the future only by knowing the current snapshot becomes a key issue in infectious diseases or rumor control. In this report, a prediction model based on snapshot is presented to predict the potentially infected individuals in the future, not just the macro scale of infection. Experimental results on synthetic and real networks demonstrate that the infected individuals predicted by the model have good consistency with the actual infected ones based on simulations. Spreading dynamics is an important issue in spread and control 1,2 of rumor 3,4 and disease [5] [6] [7] [8] , marketing 9 , recommending [10] [11] [12] , source detecting 13, 14 , and many other interesting topics [15] [16] [17] [18] . Generally speaking, we can not observe the transmission process of infectious diseases, but can only observe the snapshot at a certain time. How to predict the infection probability 19 , infection scale 20, 21 , or even the infected nodes precisely from a given snapshot has been gotten much attention in recent years. Researchers have gotten many achievements on macro level of spread such as phase transition of spread 22 and basic reproduction number 23 . Up to now, many researches focus on estimating of infection scale. The simplest one is mean-field model, in which, the spread coverage can be predicted by using differential equations 20 . Besides mean-field model, some more realistic models such as pair approximation 21 and permutation entropy 24 are considered to predict the infection scale or infectious disease outbreaks. The main difference between mean-field and pair approximation is that the former(latter) approximates high-order moments in term of first (second) order ones. Researchers studied the predictability of a diverse collection of outbreaks and identified a fundamental entropy barrier for disease time series forecasting through adopting permutation entropy as a model independent measure of predictability 24 . Funk et al. 25 presented a stochastic semi-mechanistic model of infectious disease dynamics that was used in real time during the 2013-2016 West African Ebola epidemic to fit the simulated trajectories in the Ebola Forecasting Challenge, and to produce forecasts that were compared to following data points. Zhang et al. 26 proposed a measurement to state the efforts of users on Twitter to get their information propagation. They found that small fraction of users with special performance on participation can gain great influence, while most other users play an intermediate role during the information propagation. Up to now, most researches focus on macro level of spreading prediction. Besides analysis on macro level, we also should pay attention to the details of infected individuals so as to contain the spread of serious infectious diseases such as SARS 27 , H7N7 28 and COVID-19 29 . Chen et al. did some interesting works on this area 19 . They presented an iterative algorithm to estimate the infection probability of the spreading process and then to predict the spreading coverage from a given snapshot. In this report, we present a probability based prediction model to estimate the probability of a node to be infected, further, to determine the potentially infected nodes in the future rather than macro scale. Figure 1 is a toy network with 24 nodes. The snapshot includes 5 recovered nodes and 1 infected node, as shown in Fig. 1a We evaluate the model on synthetic and real networks. Synthetic networks are Wattes-Strogatz (WS) networks 30 , Barabási-Albert (BA) networks 31 and Given-Newman (GN) community networks 32 . Each synthetic network has 4000 nodes and each GN community network has 40 communities. Eleven real networks are cond-mat, astro-Ph, email, c.elegens, ecoli, internet, PGP, TAP, HEP, Y2H and power. The number of nodes and edges are listed in Table 1 . In order to evaluate the model, we employ the Susceptible-Infected-Removed (SIR) model 33 to simulate the spreading process on networks. In a network, we randomly select one node as the initial spreader. The information from this node will infect each of its susceptible neighbors with probability µ . For simplicity, we assume www.nature.com/scientificreports/ that the node will immediately recover (i.e., the recovering probability is 1) after infecting neighbors. Of course, if the recovering probability is less than 1, it can be analyzed similarly, we will study this in the future. The new infected nodes continue to infect their neighbors in next step. If it is not specially stated, we take the snapshot after five steps of spreading from the initial node as the known information. The correlations ρ on 11 real networks are shown in Table 1 , where ρ is the Pearson correlation between the results of prediction and actual ones based on simulations. From Table 1 , it can be seen that the results obtained by prediction model are good consistency with actual results based on simulations, especial for the case of large number of infected nodes N I of snapshot. It is noted that the correlation ρ and N I have strong positive correlation. For networks Y2H and power, the correlation ρ is extremely low since N I is very small. Actually, in these cases, there are few infected nodes in snapshot. Furthermore, the networks are very sparse, so, it is hard to predict the nodes being infected from snapshot in the future. While for networks cond-mat, astro-Ph, email, c.elegens, ecoli and internet, the correlations ρ are larger than 0.9, this indicates that the infected individuals predicted by model are basically consistent with the actual ones based on simulations. Moreover, we also deeply analyze the effect of some parameters on the prediction model by using synthetic networks, including: (1) the effect of infection probability, (2) the effect of network structure, and (3) the effect of stage of snapshot. The effect of infection probability. Figure 2 shows the Pearson correlation ρ between the results of averaging on 200 simulations and that of probability prediction model under different infection probability µ on WS, BA and GN networks. Generally, the correlation get larger while µ getting larger. For large µ , e.g., µ = 0.3 , the correlation approach to 1 since most of nodes will be infected. From Fig. 2 , it can be seen that there exists a transition point, in detail, the transition point at µ = 0.15 for WS network (see Fig. 2b ) and at µ = 0.1 for GN network (see Fig. 2c ). This can be explained as follows: the information almost do not diffuse if µ is small ( µ < 0.15 for WS networks and µ < 0.1 for GN network), and the infected nodes are highly random for different simulations. It is noted that there hardly exist transition point in BA network. It can be explained as follows: the information will easily reach to the node with large degree regardless the location of initially infected node, eventually, reach to other nodes for its heterogenous structure. Interestingly, if µ is very small (e.g., µ = 0.02 ), the correlation is getting large in BA network, as shown in Fig. 2a . Actually, for very small µ , only a few snapshots in 200 simulations can be utilized to analyze the correlation ρ since spread stops in two or three steps in most simulations, the results have no statistical significance. Besides, the distribution of correlation ρ under the results of 200 independent runs are listed in Fig. 2d -f. From these three subfigures, it can be seen that the distributions of correlation ρ of BA and GN networks are similar, while that of WS network are generally large comparing with BA and GN networks. The effect of network structure. Figure 3 shows the correlation for three types of networks with different structural parameters. For WS network, we study the effect of the rewiring parameter p on correlation. For BA network, we consider a variant form of it in which each new node u connects to an existing node v with probability p u = (k u + B)/ v (k v + B) 34, 35 . This modified model allows a selection of the exponent of the power-law scaling in the degree distribution p(k) ∼ k −γ with γ = 3 + B/m in the thermodynamic limit where m is the number of nodes should be connected when a new node is added and B is tunable parameter. With this network, we study the effect of B on correlation. For GN network, we study the effect of k in on correlation, where k in is the average internal degree of nodes in community. For a node u in community C, its internal degree k in u can be written as: For standard BA network, i.e., B = 0 , there are a few nodes with extremely large degree, the information can be spread out easily so long as it reaches to a node with www.nature.com/scientificreports/ large degree. So, it is relatively easy to predict which node will be infected in the future. As B increasing, the network evolves to random, a node getting infected or not will be hard to predict relatively, so the correlation decreases when B increases, as shown in Fig. 3a . In WS network, if rewiring probability p < 0.2 , the information hardly diffuse to other nodes since the WS network is almost regular, so it is hard to predict the infected nodes. As rewiring probability p getting larger, the network getting more random, the information reaches to other www.nature.com/scientificreports/ nodes easily, consequently, it is easy to predict the infected nodes, as shown in Fig. 3b . In GN network, if average internal degree k in is larger, the community structure is clearer, correspondingly, the information is harder to escape the community boundary, and the correlation will getting worse, as shown in Fig. 3c . Besides the network parameter listed above, the density of network, i.e., average node degree k , also affects the correlation, as shown in Fig. 3d-f . It can be seen that the correlation is small for small average node degree k . Especially in WS and GN networks, for a large scope of average node degree ( k < 12 in WS and k < 8 in GN), the correlation is extremely small, there exists an obvious transition points, as shown in Fig. 3d ,f. Actually, in WS and GN networks, when the average degree is small, the snapshot contains few infected nodes, so, little usable information can be used to predict. This leads to the inaccurate estimation of µ 19 , further, the prediction of subsequent infected nodes is also inaccurate. In fact, it can be seen from Table 1 that if the number of infected nodes N I is small, the correlation ρ is also low. We will study the essential reason of this issue in the future. The effect of stage of snapshot. We further analyze the correlation ρ under different stage of snapshot, as shown in Fig. 4 . In Fig. 4 , T is the spreading steps of snapshot. Generally, it is difficult to estimate the infected rate precisely if just the snapshot in the early stage is given since there is little usable information, so, it is hard to predict the infected nodes. As T increases, more information could be used, the correlation ρ is getting larger. In the late stage, many nodes of snapshot are infected or recovered, the left nodes are hard to be infected, so the correlation ρ are getting smaller, especially in BA network since most of all nodes are recovered. From Figs. 3 and 4 , it is interesting that the prediction results fluctuate greatly for different parameters, while the fluctuation of the results is very small under determined parameter. For example, as shown in Fig. 3b , if the reconnection parameter p is small, the correlation ρ is low. However, if the reconnection parameter p is high, the correlation ρ is high. No matter what the value of p, the error bar is small under a certain p, this indicates that the correlation on determined parameter p has little change. In Fig. 4 , although the correlation fluctuates greatly with stage of the snapshot, it changes very little for a determined snapshot. In conclusion, this report mainly predicts the potential infected individuals according to the currently observed snapshot, which has significance on prevention and control of infectious diseases such as COVID-19. Due to the popularity of mobile device, it is relatively easy to obtain users' contact network, which provides certain basic conditions of prediction on potential infected individuals. For a given snapshot, we use IAIP 19 method to estimate the infection probability. In IAIP model, we denote the number of infected nodes as N I , the number of susceptible nodes as N S and the number of recovered nodes as N R . N S + N I + N R = N since we use SIR spreading model. If a susceptible node j contacts an infected node i, node j has an opportunity to be infected. For an infected node i at step t (recovered at step t + 1 ), the contact times with its susceptible neighbors are k i − m i ( m i neighbors have been infected before step t). So, the total contact times before step T are i∈R k i − m i where T is the spreading steps of snapshot. In these i∈R k i − m i contacts, N R + N I − 1 nodes are infected, so the infection probability µ can be approximately calculated by: where k i is the degree of node i and m i is the number of infected or recovered nodes in the neighbors of node i when it is infected at step t ( t < T where T is the spreading steps of snapshot). Since the exact value of m i cannot be directly extracted from the snapshot, we use its expected value m i to approximate. Actually, m i is the weighted average of M i states where each state S l (1 l M i ) has exactly l neighbors having been infected before node i is infected. When node i is infected, the probability that exactly one neighbor has been infected is µ and the probability that exactly two neighbors have been infected is µ · (1 − µ) . Generally, the probability that l(1 l M i ) neighbors have been infected is µ · (1 − µ) l−1 where M i is the total number of infected or recovered nodes of i ′ s neighbors in the snapshot. Moreover, the number of infected or recovered neighbors will not exceed M i when i is infected. So, the probability that exactly l(1 ≤ l ≤ M i ) neighbors have been infected is approximated by the normalized value of µ · (1 − µ) q−1 (1 ≤ q ≤ M i ) . Based on these, the expected value m i of recovered or infected neighbors when i is infected can be calculated by the weighted average of l(1 ≤ l ≤ M i ) , that is: www.nature.com/scientificreports/ On the basis of Eq. 2 and Eq. 3, µ and m i are expected to respectively approach their true values. In real situation, it is very difficult to estimate m i accurately since we just can obtain the information at time T from snapshot. In the future study, we will combine other strategies such as source detection from snapshot 13, 36 to estimate m i more accurately. In the proposed model, a group of infected individuals try to infect a node i until it is infected. Actually, we hold a reactive process in this report since an infected individual effectively contacts all its neighbors to expand the epidemics or information 37 . For a given snapshot, a node u will be converted into infected one with a probability P u (t) at time t, we have, where Ŵ u is the neighbors of node u and infection probability µ can be estimated by IAIP model (Iterative Algorithm for estimating the Infection Probability) 19 . For node v in Eq. (4), it is reasonable to assume P v (t) = 1 for infected node and P v (t) = 0 for susceptible or recovered node. Obviously, the initial condition is, By solving Eq. (4) under initial condition Eq. (5), P u (t) will be converged to a steady state denoted by P u (t c ) , where t c is the convergence time. The final score P u = P u (t c ) is the probability to be infected of susceptible node while spreading achieves steady state. More precisely, we run the percolation according to Eq. (4) until the process dies, that is, for each node u, P u (t) = P u (t − 1) under given permissible error. Of course, we can also predict the probability of each node being infected at a certain step t after the snapshot. In order to evaluate the performance of the proposed model, we use Pearson correlation ρ between the result of averaging on N simulations and that of probability prediction model, that is: where − → p r = (x 1 , x 2 , . . . , x N ) and − → p e = (y 1 , y 2 , . . . , y N ) are the vector of infected probability of nodes obtained by simulations and by probability prediction model respectively, and N is the number of nodes of networks. P u (0) = 0 if node u is susceptible or recovered 1 if node u is infected , Impact of human activity patterns on the dynamics of information diffusion Impact of human activity patterns on the dynamics of information diffusion Rumor evolution in social networks Rumor detection over varying time windows Modeling human mobility responses to the large-scale spreading of infectious diseases Localization and spreading of diseases in complex networks Competing spreading processes on multiplex networks: Awareness and epidemics Authoritative sources in a hyperlinked environment Viral marketing through e-mail: The link consumer-company Recommender systems Avoiding congestion in recommender systems Adaptive social recommendation in a multiple category landscape Locating the source of diffusion in large-scale networks Locating the source of diffusion in complex networks by time reversal backward spreading Similarity measures in scientometric research: The jaccard index versus saltons cosine formula The small world yields the most effective information spreading Enhancing topology adaptation in information sharing social networks The spread of behavior in an online social network experiment Predicting the evolution of spreading on complex networks Epidemic spreading in scale-free networks Heterogeneous pair-approximation for the contact process on complex networks Dynamic phase transitions in cell spreading Dengue disease, basic reproduction number and control On the predictability of infectious disease outbreaks Real-time forecasting of infectious disease dynamics with a stochastic semi-mechanistic model Users participation and social influence during information spreading on twitter Characterization of a novel coronavirus associated with severe acute respiratory syndrome Avian influenza a virus (h7n7) associated with human conjunctivitis and a fatal case of acute respiratory distress syndrome Nowcasting and forecasting the potential domestic and international spread of the 2019-ncov outbreak originating in wuhan, china: a modelling study Collective dynamics of small-world networks Emergence of scaling in random networks Finding and evaluating community structure in networks Infectious Diseases of Humans: Dynamics and Control Statistical mechanics of complex networks Evolution of networks Inferring the origin of an epidemic with a dynamic message-passing algorithm Discrete-time markov chain approach to contact-based disease spreading in complex networks N.Z. and D.-B.C. designed the research and prepared all figures. N.Z., J.W. and Y.Y. performed the experiments and analyzed the data. All authors wrote the manuscript. The authors declare no competing interests. Correspondence and requests for materials should be addressed to D.-B.C. Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.