key: cord-0010912-0undodih authors: Iannelli, Flavio; Koher, Andreas; Brockmann, Dirk; Hövel, Philipp; Sokolov, Igor M. title: Effective distances for epidemics spreading on complex networks date: 2017-01-17 journal: Phys Rev E DOI: 10.1103/physreve.95.012313 sha: 3752edae8a211c666dc6493abd2ba499288f1681 doc_id: 10912 cord_uid: 0undodih We show that the recently introduced logarithmic metrics used to predict disease arrival times on complex networks are approximations of more general network-based measures derived from random walks theory. Using the daily air-traffic transportation data we perform numerical experiments to compare the infection arrival time with this alternative metric that is obtained by accounting for multiple walks instead of only the most probable path. The comparison with direct simulations reveals a higher correlation compared to the shortest-path approach used previously. In addition our method allows to connect fundamental observables in epidemic spreading with the cumulant-generating function of the hitting time for a Markov chain. Our results provides a general and computationally efficient approach using only algebraic methods. Networks have received growing attention in the past decade particularly due to their applicability in describing a wide range of phenomena. Prominent examples are the mapping of the World Wide Web and structure of the Internet, social and financial networks, epidemiology, and language dynamics [1] [2] [3] [4] . In the context of epidemic spreading the prediction of outbreaks has become particularly important for public health issues. The rapid growth in the velocity of transportation means and frequency of movements has further increased the risk that global emergent diseases such as H1N1 [5] , SARS [6] , or EBOV [7] , and more recently ZIKV [8] , will spread worldwide. The underlying mobility networks are usually scale-free [9] . This implies the absence of an epidemic threshold [10] that allows any transmittable disease to spread through the global population. The large amount of traffic data both at the local and global scale, which became available in recent years, provides a new opportunity to understand such processes. On the one hand numerical simulations of infection spreading offer a practical tool for estimating the infection arrival time [11] . In this regard metapopulation models [12] [13] [14] provide a reasonable tool for maintaining a high level of complexity in the simulation of pandemics [11] . At the global scale the different subpopulations, defined by the nodes of the network, are cities that can be connected by directed or undirected fluxes of individuals, provided by the worldwide transportation network data. On the other hand algebraic methods give a solid foundation for drawing general conclusions and in many cases provide numerical instruments superior to direct simulations. In this work we introduce a network-based measure that generalizes the concept of distance and that provides fundamental insights into the dynamics of disease transmission as well as an efficient numerical estimation of the infection * iannelli.flavio@gmail.com arrival time. We compare this effective distance [16] with the numerical estimate of the transmission times using a metapopulation model to validate the method. A series of papers have already been devoted to this problem [17] [18] [19] [20] [21] . Most of them rely on the concept of most probable path, the shortest-path effective distance D SP ij for each source i and target j in the network. The latter can be defined, for both directed and undirected networks, as the geodesic graph distance with edge weights given by the first moment of a Gumbel distributed variable which depends only on the network topology and the infection rate. This shortest-path approach, however, significantly overestimates the infection arrival times [18, 22] . A more realistic scenario takes into account all possible paths [19] yielding the multiple-path distance D MP ij , which is better suited to estimate the arrival times of the infection. This method allows, at least in principle, to take into account all possible directed transmission paths, although the computation becomes infeasible as their number grows exponentially with the the number of vertices in the network. The lack of a practical computational approach leads back to considering only the shortest path. A logarithmic ad hoc edge weight transformation was introduced in Ref. [20] by simply requiring that adding edges should translate to multiplying the associated probabilities. This follows the intuitive argument that a higher number of passengers reduces the separation distance between neighboring nodes. This logarithmic transformation can be viewed as a log-space reduction [23] and, as we will show later, it can be derived as a special case of the shortest-path effective distance defined previously in Ref. [18] . The estimated arrival time T AR ij from node i to node j , obtained from numerical simulations, correlates highly with the shortest-path effective distance. Using this metrics the complexity of disease transmission can be understood in terms of circular waves propagating at constant velocity from the infected node at time zero to all other nodes in the network [20] . The measure we introduce here aims to generalize previous definitions by including all walks that connect source and target. Let us consider a metapopulation susceptible-infectedremoved dynamics S βSI − − → I μI − → R, where β and μ are the infection and recovery rate, respectively. The nodes of the metapopulation network consists of subpopulations of constant size N j , which divides into susceptible (S j ), infected (I j ), and removed (R j ) individuals In the metapopulation in addition to the local SIR reaction dynamics, the movement of a host between populations i and j is governed by the master equation Here we introduced j } as a placeholder variable, and the transition rates Q ij are defined as the conditional probability of a randomly chosen individual to jump from location i to location j within one time step The transition rates Q ij can equivalently be defined in terms of the weighted adjacency matrix A W ij as Q ij = A W ij /N i ∈ [0,1]. The latter is obtained from the actual passenger fluxes between any two airports in the global mobility network used in the simulations [15] . This network consists of V = 3865 vertices (airports) and E = 51 440 directed edges (fluxes), with very broad degree and weight distributions (see Figs. 1 and 2), where the high heterogeneity in the network connectivity is graphically reproduced. For the network diameter we found D = 16 (connecting Stuart Island to Narsaq Kujalleq Heliport) and the global clustering coefficient is c = 0.26 ± 0.01. A peculiar feature of this network is that the antisymmetric part of the fluxes χ = A W ij − A W ji is vanishing to a high degree of accuracy so that it can be considered as undirected [24] . The weighted adjacency matrix of the undirected air traffic network is then defined by A W ij = A ij W ij , where A ij is the adjacency matrix element and W ij 0 the corresponding weight. The symmetry in the adjacency matrix implies that for adjacent The Markov transition matrix associated to the network can be written in terms of both the fluxes A W ij and the local transition rates Q ij : and it is row stochastic by construction. From (4) we also have the detailed balance where we have introduced the weighted degree k W i = j A W ij , sometimes denoted as strength [25] , as the asymptotic probability distribution for the corresponding Markov chain. Thus using (6) we can rewrite (2), in terms of the compartment densities Furthermore, we can remove the dependence on the population size N j by introducing a constant global mobility rate α. This parameter is defined as the ratio α = /N , between the total daily passenger flux = ij A W ij and the total population N = i N i , i.e., the rate to leave a node for a randomly chosen individual. A local node dependent mobility rate can also be defined as In the global mobility network data the total traffic of each node, i.e., its weighted degree k W j , is with a good accuracy proportional to its population N i via the global mobility rate Normalized weight distribution P(W ) ∼ W −γw with scaling exponent γ w = 3.60 ± 0.14. Inset: Unweighted degree distribution P(k) ∼ k −γ k with scaling exponent γ k = 1.79 ± 0.10. Scaling exponents are obtained using the method presented in Ref. [27] . α, thus in our case α i = α ∀i. The complete SIR dynamics of the Rvachev-Longini model [20, 26] becomes where the mobility function for each compartment density accounts for diffusion along the nodes. Integrating system (9) we obtain the contagion dynamics on the transportation network A W ij with the global mobility rate α and the infection parameters β and μ. Finally, the arrival time T AR ij for each source-target pair in the global mobility network is computed by considering a node j infected as soon as one infected individual is present. After introducing the effective distance we use T AR ij to compare the goodness of the different measures. The fundamental metric on a network is given by the (geodesic) shortest-path length over all paths ij connecting node i to node j . In a weighted network for each edge (k,l) ∈ ij no node is visited more than once and contributes to the total length with its corresponding weight [1] where the inverse is used because in our case a higher flux of passengers reduces the distance between two nodes. Starting from this definition it is possible to extend the notion of distance by replacing the adjacency matrix weight with an effective quantity that captures the essence of the problem of predicting when the disease will arrive at a certain place. In Ref. [18] the authors defined the shortest-path distance D SP ij by first considering the susceptible-infected model with only two cities, by then generalizing to arbitrary topologies. We first show how their analytical approach leads to an equivalent definition of effective distance used in Ref. [20] , and then we generalize it to more realistic propagation scenarios where all paths are taken into account. Let us consider two susceptible populations i and j and place an infected individual in i at time t i = 0. Assuming Q ij t 1 we can derive a probability density function for the infection hitting time h j to city j of the Gumbel type [18] . The first moment of this distribution is given by where γ e ≈ 0.5772 is the Euler-Mascheroni constant, and the average . . . i is taken over times starting at t i = 0. Using (5) we find where we assume the mobility rate (8) to be node independent, i.e., α i = α ∀i and δ = ln(β/α) − γ e is a dimensionless parameter. This result can easily be generalized to the SIR model and a network with an arbitrary number of nodes simply by minimizing the quantity h l k for each consecutive link (k,l) that belongs to the path ij connecting source i to target j . For the arbitrary heterogenous population and network topology with an arbitrary number of nodes by taking the minimum over all paths yields the shortest-path effective distance 012313-3 where for the SIR metapopulation dynamics we have Since each term in the sum is strictly positive, one can obtain the complete shortest-path distance matrix for each source and target pair using the Dijkstra algorithm [28] in a time , where E and V are the graph size and order, respectively. The effective distance defined in Ref. [20] can then be recovered as a special case of (14) simply by setting the scale parameter δ to unity. This fact allows for a deeper and more complete understanding of this effective distance and offers a more solid explanation to the extremely high correlation with the infection arrival time found in Ref. [20] . The most important limitation of (14) is that only one path is considered, namely, the path that in addition to minimizing the topological length also maximizes the probability associated to that. Thus in this scenario the contribution to the disease spread comes only from a single route. The effective infection arrival time D SP ij (δ)/V EF , where V EF ≈ β − μ is the linearized effective speed of the infection [18, 20] , is in fact overestimated with respect to the simulations result T AR ij [22] . To take into account the most realistic disease spread scenario one has to consider instead the multiplicity of transmission routes. For two distinct paths ij and ij connecting i with j , the authors in Ref. [19] found that a two-path distance D 2P ij satisfies the equation where is the effective distance associated to the path ij of arbitrary length, which corresponds to a Gumbel distributed hitting time. Relation (16) can then be easily generalized to an arbitrary number of multiple paths as so that we obtain Here we have defined the total probability associated to the path ij as By grouping all probabilities associated to paths of the same length in the quantity F ij (n) = | ij |=n F ij ( ij ), we can replace in (19) the sum over all paths connecting i to j with a sum over integer path lengths to get where n max is the maximum path length in the network. If we select the path ij of length n that is associated to the dominant contribution, i.e., the path that maximize its associated probability and minimizes the topological path length, one recovers the shortest-path effective distance of (14) Therefore the multiple-path distance gives a more accurate estimate of the infection arrival time, as it allows to take into account the most probable route as well as all possible alternative transmission routes. However, since the total number of paths between i and j can scale as O(V !), the measure D MP ij becomes computationally infeasible for large graphs. Both measures D SP ij and D MP ij rely on the fact that the epidemic will spread along simple paths, i.e., routes that do not cross themselves. Here we follow instead a different approach and introduce a distance that includes all possible random walks from source to target. Relaxing the assumption of directed spread is equivalent to effectively erasing the memory from the system at each time step. This is achieved by including in (19) all walks ij , which contrary to the paths ij , allow also already visited nodes. We define the random-walk effective distance by generalizing (19) as where H ij ( ij ) is the probability associated to a walk that starts in i and arrives to j . As for the probabilities F ij we can group the probabilities associated to walks of the same length into H ij (n) = | ij |=n H ij ( ij ). The latter is precisely the hitting time probability for a Markov chain defined recursively as [29] H ij (n) = k =j P ik H kj (n). Thus H ij (n) is simply the nth power of the subtransition probability matrix obtained by removing the j th row and column. Contrary to the multiple-path scenario now the walks are unbounded and so becomes n max . Furthermore since each term in the sum (23) is positive, assuming the convergence of the sum, we can rearrange it as In Fig. 3 we use the Pearson correlation coefficient R 2 for quantifying the accuracy of the different measures using São Paulo airport as the source of the infection. Each dot in the scatter plot corresponds to an airport, which is labeled infected in the simulations when the the infection density is greater than zero. The high correlation with the infection arrival time found in Ref. [20] using a shortest-path approach (light blue) is improved when considering the random-walk effective distance (violet). The points on the dashed diagonal indicate a perfect agreement between the simulation and the effective distance. The correlation distribution considering all nodes in the network as initial infected seed shows that not only the measure proposed here possesses a higher averaged correlation but it is also more peaked around it (see Fig. 4) . A remarkable interpretation of the random-walk effective distance can be found by noticing that by definition D RW ij (δ) = − ln e −δh j i , where h j is the hitting time to node j [30] . Thus, since H ij (0) = 0 for i = j , we have the correspondence Hence one obtains the cumulants of the hitting time by differentiating the random-walk effective distance with respect to δ at δ = 0. This interesting correspondence allows one to rigorously relate epidemiological quantities such as the arrival time and the speed of infection in a reaction-diffusion model to the fluctuations of the hitting time. Then one can interpret D RW ij (δ) as a generalized free energy in a statistical physics perspective [31] and providing a more profound theoretical framework than the ad hoc measure proposed in Ref. [20] . From the computational side, in order to evaluate the infinite sum in (25), we can restrict ourselves to the first nonvanishing contributions, which dominate due to the decreasing exponential in the walk length n. However, we can also solve the complete expression by rewriting (25) into a geometric series. This requires to vectorize D RW ij with respect to the arrival node j to obtain where P(j |j ) and I(j |j ) are the transition and identity submatrices obtained by deleting row and column j [32] , while p(j ) is the j th column of P with element j removed. To obtain the previous expression we have used that for δ > 0, all eigenvalues of the matrix e −δ P(j |j ) are strictly smaller than unity. For each arrival node the random-walk effective distance can then be obtained in polynomial time O(V 3.4 ) using, for instance, the Coppersmith-Winograd algorithm for matrix inversion [33] , making the problem of parallel transmission routes feasible even for large networks as the one used in our simulations. For highly heterogeneous topologies, such as the air-traffic network [24] , only a small number of paths contributes to D MP ij . Taking the limit of the dominant contribution in (23) , which corresponds to selecting the dominant path in (19) , allows one to neglect the sum over the walks (paths), and it yields as for (22) In Fig. 5 the comparison between the shortest-path and random-walk approach for the U.S. air-traffic network [13] shows that the results presented here are robust also for Erdős-Rényi networks. The correlation coefficients are respectively R 2 SP = 0.99 and R 2 RW = 1.00 for the U.S. air-traffic network and R 2 SP = 0.94 and R 2 RW = 1.00 for the randomized Erdős-Rényi network. An higher correlation and stability of the random-walk approach is also observed in the case of artificial networks, as for unweighted Barabási-Albert [9] and lattice models (see Fig. 6 ). The correlation coefficients for the latter are R 2 SP = 1.00 and R 2 RW = 1.00 for the Barabási-Albert network and R 2 SP = 0.99 and R 2 RW = 1.00 for the lattice. In summary we have presented a generalization of the concept of effective distance by overcoming the restriction of simple path propagation of a disease. The proposed random-walk effective distance includes the previously defined shortest-path measure as a particular case. The remarkable correlation found with the infection arrival time can be explained as follows. The contribution of looped trajectories in the propagation of physical information is neglected because of the decreasing exponential in the walk length. The latter serves as damping such contributions for long walks, and in particular allows us to neglect infinite loop contributions. In scenarios where multiple parallel paths are important, for instance, in Erdős-Rényi graphs or regular lattices, the assumption of a single dominant path breaks down, and the measure proposed here can be used as an efficient alternative. The predictive power of the random-walk effective distance can be used for containment strategies and estimation of arrival times for real global pandemics from the underlying networks topology. The random-walk metric can in fact be generally applied to any weighted and directed network besides the transportation ones, for instance, in the context of social interactions and rumor spreading. For unweighted locally treelike networks both the shortest-path and random-walk effective distances yield maximum correlation with the simulated arrival time, as the shortest path tends to dominate. From a theoretical point of view our results show that the average infection arrival time in a metapopulation model can be approximated by the cumulant-generating function of the hitting time for a Markov chain. In fact, the generating function approach can also be used to formally derive the latter from the first moments of the Gumbel distribution [18] . The connection with the cumulant-generating function allows for an interpretation within statistical physics. In particular this would explain how the different approaches are connected in terms of the entropy associated to paths of fixed length [31, 34] . This observation links disease spreading on complex networks with a generic diffusion process. Further developments and extensions of our results include the generalization to temporal networks by considering a set of transition matrices, one for each time step [35, 36] . Dynamical Processes on Complex Networks Evolution and Structure of the Internet: A Statistical Physics Approach Global mobility data of air traffic was provided by OAG (www.oag.com) Scale-Free Neworks Markov Chains Matrix Theory: Basic Results and Techniques Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing Social Informatics: Proceedings of the 4th International Conference We thank T. Isele for insightful discussions and technical assistance. This work was developed within the scope of the IRTG 1740/TRP 2015/50122-0 and funded by the DFG/FAPESP. A.K. and P.H. acknowledge support by Deutsche Forschungsgemeinschaft in the framework of SFB910.F.I. and A.K. contributed equally to this work.