key: cord-0570722-8qn0p2nr
authors: Jamieson-Lane, Alastair; Blasius, Bernd
title: Epidemic Arrival Times; Theory, Discussion, and Limitations
date: 2020-04-12
journal: nan
DOI: nan
sha: 2bcda36d390006af3c02fc70d34528825872df92
doc_id: 570722
cord_uid: 8qn0p2nr

The rise of the World Airline Network over the past century has lead to sharp changes in our notions of `distance' and `closeness' - both in terms of trade and travel, but also (less desirably) with respect to the spread of disease. Using flight data from the WAN, along with a drastically simplified epidemic model, we are able to predict epidemic arrival times, for arbitrary initial conditions, in a computationally efficient manner. Our framework provides some theoretical justification to the `effective distance' originally introduced by Brockmann &Helbing (2013), however we also observed that the predictive power of such heuristics is significantly lower than previously reported. Further improvements to our framework allow predictions to be made, even in parameter regimes where past methods are known to fail, and illuminate the circumstances under which this class of methods can be expected to fail.

It was said by Jules Verne in 1873 that "The world has grown smaller, since a man can now go round it ten times more quickly than a hundred years ago," that one could, travel around the world in eighty days [16] . One can't help but wonder what Verne would think of the advances in technology that have occurred in the century and a half since, which now allow the circumnavigation of the globe in less than eighty hours. The World Aviation Network (WAN) has, over the past century, shrunk the globe to a fraction of its former size, allowing for vast increases in tourism, immigration, and trade. At the same time, diseases that might once have travelled at the speed of cart, ship or train now cross the world in a number of days or hours. In 2003 Severe Acute Respiratory Syndrome (SARS) was able to spread from China to Vietnam, Hong Kong and Canada within two weeks of being reported to the WHO [18] . Similarly, the 2009 H1N1 flu pandemic first reported in Mexico was able to reach both Europe and Asia within a fortnight [19] . Now, in 2020 we see that coronavirus SARS-CoV-2 has spread quickly across the world [20] , reaching all but a handful of countries.

Modelling of disease spread and disease arrival time is critical, both to mitigate the effects of a disease, and determine where best to apply quarantine and screening protocols so as to minimize the spread of infection. In order to model the spread of disease, given our current WAN, a number of highly detailed epidemic simulation models have been created, such as GLEAMviz [3] . Depending 1 arXiv:2004.05557v1 [physics.soc-ph] 12 Apr 2020 on how they are set up, these models can include such factors as vaccination, incubation times, multiple susceptibility classes, non-linear responses, seasonal forcing, quarantine, and the stochastic movement of individual agents, providing a detailed and comprehensive modelling framework. Unfortunately, in many cases, such models quickly become black boxes in and of themselves-objects to be studied and analyzed only via costly simulation-predictions of the future that can be observed, but seldom understood. This difficulty is further exacerbated by the difficulties of determining parameter values, often a delicate task during the initial stages of a disease outbreak when information is scarce.

Depending on the questions being asked, explicit modelling of such intricate details is necessary, however, for other questions, such as the determination of epidemic arrival time, it seems that simpler methods may suffice. For this reason, a number of authors [2, 9] have produced "distance metrics" -artificial measures of distance based on flight data from the WAN. Effective distance measures predict relative arrival times for an epidemic starting in a specified location, predicting for example that an epidemic starting in Vancouver will take twice as long to reach Istanbul as it will to reach Rio de Janeiro. The absolute arrival time depends on disease parameters.

Such measures frequently rely on synthesizing a 'direct distance' between connected locations, and then determining the 'shortest path' between locations which are not connected directly. An example of such an effective distance, proposed by Brockmann & Helbing [2] , defines the direct distance between a pair of vertices as

where P ij is the probability that a particular individual who is leaving location i is travelling to location j. The total effective distance is then given as the shortest path between a given pair of nodes:

where here we minimize over all possible paths from i to j, and d a,b denotes the effective distance along each step of this shortest path (and P a,b the corresponding probability). A number of papers by Gautreau, Ianelli, et al. provide theoretical justifications to the effective distance, by first calculating the expected time for a growing epidemic to make its first 'jump', [6] , and then summing overall possible paths joining a pair of airports (in an appropriate manner) [9] . More recent work by Chen et al. [4] made use of linear spreading theory, approximating the growth of the epidemic as simple exponential growth. In doing so, they achieve the same high accuracy as Ianelli et al., but at significantly lower conceptual and computational complexity.

In what follows, we will improve upon Chen et al. linear spreading method, before going on to explore the strengths and, just as importantly, the limitations of the general class of 'effective distance' methods discussed in the previous paragraph. In section 2 we introduce the 'unbound growth arrival time' and demonstrate some of its key properties. We then show how a modified version of Brockmann & Helbing's "effective distance" can be found as an approximation of our results under appropriate circumstances. In section 3 we compare the predictions made by these methods to both simple deterministic ODE models, and large scale stochastic simulations. We then compare to arrival times as observed for H1N1 and SARS. In general, it is observed that both effective distance and unbound growth methods have reasonable predictive power, but that neither method provide the R 2 = 0.973 predictive power observed in the original paper ( [2] , Fig. 2 ), especially when compared to real world data.

In section 4 we explore possible sources of error in the unbound growth model, demonstrating both where the model is robust, and what types of changes will cause it to break down or give erroneous results. A slight tweak of the model allows us to maintain accuracy even in cases where the past approaches give poor results, at modest computational cost. The more technical discussion of efficient computational techniques, and comparison to the work of Gautreau, Iannelli, et al. [6, 9] can be found in the appendices.

Our primary goal in this paper is to demonstrate how previous results can be reached more directly via a minimal framework which is both conceptually and computationally simpler than those commonly used in the literature. Along the way, we are able to demonstrate the limitations and underlying assumption of these past models, and possible corrections and improvements -allowing us to model epidemics with more complex initial conditions.

Consider a network of N connected nodes, each node representing a particular location. In our particular case these nodes will refer to cities, or, to be more specific, particular airports in the WAN. While traditionally one might model the spread of a disease on such a network using SIS, SIR dynamics or similar [17] , we are here interested only in the initial time of arrival at a given location, and not in the various non-linear effects observed as an epidemic saturates in a population after its arrival. We therefore model only the infected population, and neglect changes in the susceptible and recovered population (the S and R components of classical SIR dynamics).

We model the spread of infection within each population as unbound exponential growth, with some rate α which accounts for both infection and recovery. We do not consider any manner of population size constraint either globally or locally. Our approach is thus similar to that used by Chen et al. [4] . In order to link our various sub-populations, we allow individuals to travel between nodes at some transition rate γF , where F is our 'flight matrix'. The off diagonal entries F i,j give the rate of travel from j to i, while the diagonal entries F i,i indicate the rate at which individuals leave location i, selected such that i F i,j = 0 for each source location j. Hence, F is mass preserving. The parameter γ is a simple scaling constant and gives the global rate of flying. It is assumed to be small relative to α, as the average infected individual does not travel on a daily basis, but is assumed to recover and/or infect others on that time scale. The expected number of infected individuals at a given location is thus governed by the equation:

where x is a vector such that x i gives the expected number of infected individuals in location i. Unlike the various non-linear models used commonly throughout the literature, equation 3 can be solved explicitly using the matrix exponential:

x(t) = e αt e γF t x(0).

With this we can calculate τ i , the unbound-growth arrival time in location i, by solving the equation;

x i (τ i ) = χ.

Here χ is some arbitrarily defined 'epidemic threshold'; for our purposes it is assumed that x is normalized such that χ = 1. For any given initial conditions, x(0), the value of τ i will depend both on the epidemic growth rate α and on our transition matrix γF . Unlike previous distance metrics [2, 9] , multiple starting points such as one might observe midway through an epidemic, are allowable (x(0) can have multiple non-zero entries).

At this stage, we part ways with Chen et al. and their work. While Chen approximate e F t using powers of F , in what follows we will instead be solving the full equation (5) numerically, making use of eigenvector decomposition so as to efficiently calculate of all τ i for arbitrary x(0). Full details of these numerical methods are provided in Appendix A. If we restrict our focus to epidemics originating from a single location j (that is to say x(0) with a single non-zero entry at j), it is possible to calculate τ i,j , the expected epidemic travel time from j to i. Eigenvector decomposition allows for the calculation of all τ i,j in O(N 3 ) time. These costs are similar to those of determining all pairwise shortest path lengths, and are largely dominated by the eigenvector decomposition, which need only be calculated once for a given network.

The unbound growth metric differs from effective distance in two major ways. The first is in the normalization of flight rates: while effective distance makes use of the flight probability matrix P , unbound growth considers the full transition matrix γF -in which flight intensity is explicitly considered. The second major differences is in how multiplicity of paths is considered: the effective distance is calculated using only the shortest path. The unbound growth arrival time is calculated from e γF t , and thus considers all possible paths through our flight network. In this section, we will illustrate the importance of these two differences through two toy examples.

The contrast between using F and P matrices is best seen by considering a simple directed chain of three nodes. We consider the case where individuals flow from node A to node B at rate 1, and individuals transition from node B to node C at some dramatically reduced rate = 10 −4 . (see figure 1 )

The lack of branching in our chain makes effective distance calculation rather straight forward; all transition probabilities are equal to either one or zero, and only a single path exists from A to C. Figure 1 : (Left) Sketch of small chain graph. Initial infection is assumed to take place in node A, the root node (blue). (Right) Flight matrix F of the small chain graph. Each column of the matrix represents the flow from the corresponding node.

Calculating x(t) explicitly for this chain we find:

and hence τ B ≈ −logγ/α and τ C ≈ −(2 log γ + log )/α. In order to test these predictions, we compare to the results of a SIR model [10] , with reaction kinetics given by:

For our simulation we consider a individual based model, in which all reactions take place according to a Poisson process, with rates ρS i I i , σI i and γF ji I i respectively. Simulation is done using τ -leaping method [8] . The time of arrival of the first infected individual at node i we denote as T i . T i is a random variable, and our intention is to predict its expected value as accurately as possible.

Comparing the observed arrival time, averaged over 500 simulations, with both the effective distance D and unbound-growth arrival time τ , we observe that normalizing away the rate of flow reduces the predictive power of our method (figure 2). That said, the discrepancy observed in the figure is the result of selecting γ; for more realistic choices of F , where the normalization constants required for each columns do not vary across many orders of magnitude, P is a good approximation. In simulations where F can be known exactly, F is preferable, however, when comparing to real life data, F is, to some extent, unknowable, depending on both the number of flights taken from a particular airport each day, but also on the 'effective population' that said airport serves. Raw flight data alone is not enough to calculate F . In contrast P is easily calculated directly from simple flight data, and gives a good approximation of F up to a constant of proportionality. We now turn our attention to considering the effects of multiplicity of paths. In order to best observe this effect consider the case of a "Spinning top" graph, as depicted in figure 3 The spinning top graph consists of k + 4 nodes; a single central node (green) with no incoming edges, our initial site of infection, a 'top' and 'bottom' node with no outgoing edges (blue), a single "up" node directly between the center node and the top (red), and k "down" nodes between the center node and the bottom (yellow). Infected individuals in the central node will travel either up or down at a rate of γ -those travelling down select amongst the k 'down' nodes uniformly at random.

While the spinning top graph is neither the simplest case, nor the most realistic one, it is simple enough to remain analytically tractable, and acts as a clear illustration of several key distinctions between τ i and D ij .

Calculating x(t) explicitly, we find

we see that the expected arrival at the top and bottom of our graph should be equal -indeed this is not surprising; because we exclude all non-linear effects the division of the 'down' compartment into k sub nodes has no effect on the behavior of the system. The behavior of the sum of all 'down' nodes is identical to the behavior of the single 'up' node. Hence we find that τ t = τ b ≈ (2 log α − 2 log γ)/α. The unbound-growth arrival time at each of the 'down' nodes is given by τ d ≈ (log k − log γ)/α. Unlike the case of effective distance, where the predicted order of arrival depends only on the geometry of the network, we see here that the order of τ t and τ d is dependent on the value of γ relative to 1/k.

: : Initial infection is assumed to take place in the center node (green). (Right) Flight matrix F of the Spinning top graph. Each column of the matrix represents the flow from the corresponding node. Infection spreads outwards from the initially infected "center node", both up and down towards the peripheries.

Comparison of D and τ to simulations (see figure 4 ) demonstrates that effective distance distinguishes between the 'top' and 'bottom' node of the graph, while unbound growth correctly handles multiplicity of paths, and assigns the pair identical arrival times. We also see that τ i is also responsive to changes in γ, and correctly predicts the change in expected order of arrival when γ is changed from 10 −2 to 10 −5 .

Before moving on to the next section, it is worth emphasizing that both toy examples considered here have been explicitly crafted so as to highlight differences between our methods.

The simple variable flow chain demonstrates that normalization of transport rates does lead to some loss of accuracy, but also that this loss of accuracy is likely to be negligible for realistic transport networks. The spinning top graph is constructed to demonstrate the importance of multiplicity of paths. In this case however, even in the minimal case k = 2, the effects of multiple paths are noticeable (see Fig. 13 , Appendix C).

While in our examples we have considered a particularly extreme case where multiplicity of paths causes the effective distance metric to break down, when dealing with more realistic graphs (such as the WAN), we observe that in many cases a single 'most likely' path does dominate the traffic for any given pair of nodes. Suppose such a dominant path does exist, and that this path consists of s steps. By approximating the matrix exponential by the dominant term in its fig. 3 . In all cases, we assume α = ρ = 0.5, σ = 0, k = 250, and γ as specified on our scatter plot. The initial susceptible population in each node is 80 million. Points are color coded so as to match the graph in figure  3 , that is to say, blue are top and bottom nodes, red represents our 'up' node, yellow our 'down' nodes, and green the initially infected 'center' node. power series, and then taking logarithms, we find

Ignoring small terms and decomposing log(F s ij ) we find

where here F a,b represent transition rates along the most probable path. Hence, in the case of transmissions with a single dominant path, the unbound-growth arrival time we describe here implies a modified "effective distance" as a particular approximation. The logarithm first discussed by Brockmann & Helbing while introducing the effective distance can be seen to be a direct consequence of inverting exponential growth. The constant +1 is replaced by a − log γ, more accurately reflecting where this constant term comes from, and how to adapt our distance measure in response to variable transport intensity. This modification is in line with the analytic work of previous authors [7] . Equation 14 can be thought of as a halfway point between the effective distance and the unbound growth arrival time -accounting correctly for rate of movement, but ignoring the issue of multiplicity of paths. The interested reader can find a comparison of this metric to various simulation results in appendix C.

3 Comparison to simulations, past metrics, and real world data

As always, the purpose of mathematical modelling is to simplify our description of the universe as far as possible, but no further. In order to evaluate the validity of the rather drastic simplifications made thus far, we compare our predicted arrival time to both simulations and real world data. Let us start by considering the most straight forward and direct comparison; namely, how does the unbound growth metric compare to deterministic SIR dynamics on some arbitrary artificial network. We construct a scale free network [12] , and then simulate simple SIR dynamics (9) as a system of ODE's. As might be hoped, we observe near perfect correlation for the the unbound growth arrival time metric (see figure 5 ). This is to be expected, as the system we compare to is perfectly deterministic, and a close approximation to the equation that τ i is defined by. Figure 5 : Comparison of various distance metrics to the deterministic time at which I k (t) = 1. Each point represents a single node, with its y position determined by the predictions of a full SIR model, and the x coordinate given by one of three different prediction metrics. Color is determined by the 'number of steps' from our initial infection location. Here we consider a scale free network of 1500 nodes constructed using Mathew George's Scale-Free Network Generation tool [12] . Our epidemic parameters are ρ = 0.5, σ = 0.25 and γ = 0.0001. Qualitatively similar results are obtained using other network structures. Different choices of Parameter values will either increase or reduce the R 2 value associated with the Effective Distance.

We next compare to a significantly more intricate model-namely the stochastic global epidemic model GLEAMviz [3] . We make use of a standard epidemic model previously used to study H1N1 [1] , and consider a simple epidemic starting in Rio de Janeiro ( figure 6 ). We find that use of the matrix exponential rather than only the shortest path gives reasonable improvements, but also that both methods are still limited in their predictive power when faced with a truly stochastic model. We also note that these predictions were not made using the same underlying airline network as GLEAMviz uses internally; it is likely that the GLEAMviz team would find a tighter fit when making use of a adjacency network that perfectly matched their simulation. Finally, while comparison to simulations provides an effective test bed, useful in terms of repeatability, we are, generally speaking not interested in the results of simulation, but instead in the real world. We follow in the footsteps of Brockmann & Helbing [2] , and compare observed and predicted arrival time for both SARS and H1N1 (see figure 7) . We find that the unbound growth method gives results largely similar to the effective distance metric, with a slight improvement in R 2 value, given suitable α, γ. Unfortunately, neither method is able to reproduce the very high R 2 values observed in the original paper. It is unclear if the observed discrepancies is due differences in implementation of the algorithm, differences in the underlying data, or some other cause. While it is possible that changes in the global flight network are to blame for these discrepancies, this explanation seems unlikely -increasing d i,j by one requires a 2.72 fold change in P i,j . Discrepancies greater than 5 require that P i,j change by more than two orders of magnitude between different flight networks. Such perturbations do not appear plausible.

Overall, it would appear that, when faced with the inherent stochasticity of the real world, both the unbound-growth arrival time, and all methods based on similar underling principles face significant difficulty. While we may effectively calculate the expected arrival time, the underlying variance in the system, and the exact details of the epidemic may lead to significant noise.

So far we have tested our unbound growth model in what may be considered close to optimal conditions; we have assumed that F ,α and γ are all well known, that γ 1 and α ≈ 1 and that the susceptible population of each node is large enough that epidemic saturation is irrelevant. Each of the above assumptions may be violated in one context or another, and for this reason it is critical to understand how far each can be stretched before necessitating the use of a more complex model. Such understanding sheds light not only on the unboundgrowth arrival time itself, but also on the effective distance metric, which we have shown to rely on a similar theoretical foundation (see equation 14) The first, and most likely assumption to be violated in practice is the notion that we know α and γ. While γ can be determined to some reasonable level of accuracy via commercial flight data, α is a number that will vary from illness to illness and may be known only to a limited degree of accuracy, particularly in the early stages of a pandemic. Fortunately the exponential model turns out to be robust against variations in both α and γ -even when varying these parameters over an order of magnitude compared to the 'true' values used in simulations, predicted arrival times remain highly correlated with those observed in simulation (see figure 8 ). When using incorrect α the constant of proportionality between τ i and observed arrival times is no longer equal to one, suggesting that the unbound growth model robustly estimates ατ i , such that the relative time of arrival at any two locations is largely independent of α and γ, but that incorrect estimates of α will lead to corresponding inaccuracy in the absolute values of τ i ; if α is estimated a factor of two high, then τ i will be a factor of two low. Errors in estimating γ have an even smaller impact -of the order log(γ true /γ est )/α-as is implied by eq. 14.

Another systematic source of error is inaccuracies in our flight network data. These can be produced due to uncounted or unregistered flights (or other forms of transportation), out of date data, or as a direct result of changes to individual behavior during an epidemic (for example, quarantine measures, or panic amongst the populous). In order to test robustness to noise in our flight data, we compare the results of a full deterministic SIR model on some flight network F , to arrival times predicted using the perturbed matrixF , whereF ij = F ij e ξN (0,1) . We consider perturbations to both random and scale free networks (figure 9).

We also consider how the unbound growth model might behave in cases where α is very small-such as when dealing with diseases with slower spread and long incubation time such as HIV. In such a case, the time until the expected number of individuals at a location reaches one is generally large. Consider for example the extreme case α = 0: we can easily imagine a single "infected" individual travelling throughout our network, its probability density slowly diffusing to steady state, and the expected number of infections never reaching x i (t) = 1 for any location. Despite this, each location will certainly have a time at which our single infected individual firsts arrives, and hence τ i = ∞ proves to be a Figure 2D ; we find that numerous details of the general shape are reconstructed (for example the four uppermost points have roughly the same geometry), but that the figure as presented in [2] is slanted significantly closer to a straight line. (Top Right) For suitably chosen α and γ values, the unbound growth method gives slightly improved results, but is still limited by the accuracy of the underlying network, and the inherently noisy details of real world epidemiology. (Bottom) We compare effective distance and the unbound growth metric, with the SARS data. We were not able to reproduce the extremely tight correlation previously reported [2] . Once again, use of the unbound growth metric yields modest improvements in R 2 compared to our reconstruction the effective distance metric. The three filled nodes in these images corresponding to (from top to bottom) South Korea, Hong Kong, and the United States. As might be expected for distance measure based on flight data, both Korea and Hong Kong are 'close' to China according to both of metrics, while the USA is further away. This would appear to be in contrast to previously published results [2] , which appear to indicate that the United States is closest to China, while Korea is rather distant. These inconsistencies will be resolved with the original authors, after the present Pandemic situation has passed. Figure 8 : Calculating the unbound growth arrival time using incorrect growth and transport parameters leads to suboptimal results. Here, we consider the case of using α and γ which are either an order of magnitude too high or to low compared to their true values (α = 0.25, γ = 10 −4 ). Data is coloured according to the minimum number of steps required to reach the associated node from our starting location. As can be seen, while errors change the magnitude of our distance (note varying axis scales), the effect on relative distances is limited in eight of the nine cases considered, despite the significant error in parameter values. Figure 9 : (Top) Comparing the results of a full SIR simulation with the predictions made using a perturbed version of the original random network, we are still able to accurately predict 'arrival time', indicating that the method is robust even when link weights are increased or decreased by an order of magnitude. As previously, nodes are colour coded based on the number of steps to the initial infection site. Note that here, after noise is applied to calculateF , we demand symmetry, and recalculate diagonal entries so as to preserve mass. (Bottom) As above, but here we consider perturbations to a random scale free network. The structure of the network leads to a sharper separation based on "number of steps", but overall exhibits a roughly similar response to to noise; namely significant noise in F is required to damage the predictive power of the unbound-growth metric. Note that here we are comparing to a deterministic SIR model; this is done so as to highlight inaccuracies caused by changes to F , while ignoring the more general effects of noise. rather poor estimate. This breakdown of τ i can be seen in figure 10 .

At this juncture, it becomes obvious that the unbound-growth arrival time proposed in equation 5 gives poor results when α is sufficiently small. We thus seek modifications to render the method more robust. So far, our approach has relied upon using x(t), a vector containing expected infected population in each location, in order to estimate T i , the time at which the first infection is imported. More accurate results can be obtained by instead calculating the expected number of imported infections. Define y(t), to be the vector valued function for the expected number of imported infections at a given location prior to time t. The rate of infection importation is given bẏ

whereF is equal to the flight matrix F , except that all diagonal elements are replaced by zero. We thus see that y(t) counts the number of arrivals in each location but not departures nor infections. The system of equations formed by equations 3 and 15 can be re-written as a single matrix differential equation:

This equation can be solved for any given time using the matrix exponential, which can be calculated efficiently so long as the eigen-decomposition is known. If we make the (somewhat dubious) assumption that the actual number of arrivals is a Poisson distributed random variable, then the probability of zero arrivals by a given time is given by e −y(t) . The expected time for the first arrival can be found using:

This formula gives more robust results than eq. 5, all be it at a higher computational cost. Here we have made use of similar techniques to a number of past papers in the literature [6, 9] , all be it in a different order. Theoretical discussion comparing equation 17 to these papers, along with a detailed explanation of efficient calculation of τ + i is provided in Appendix B. Regardless of the algorithm used, calculation of y i (t) and the corresponding integral comes at non-trivial computationally cost. In cases where α γ the more detailed method gives negligible improvements compared to equation (5) , and the original unbound growth arrival time based only on x is recommended. In cases where α is small and τ i gives poor predictions, τ + i can be used to obtain more robust results (see figure 10) .

Finally, while we have demonstrated the effectiveness of the exponential growth model when applied on the scale of the global aviation network, effective distance measures have also been considered on a much smaller scale in order to study the spread of antibiotic resistance between wards within a hospital [5] . In this case, assuming unbound growth becomes problematic, as the susceptible population in each node is rather small; a single ward might contain (for example) 30 beds, a far cry from the tens or hundreds of thousands found when each node represents an entire city. 16 ,17) . In all cases, we consider a random network of 600 nodes, arrival times are averaged over 500 simulations. In each case we provide both the R 2 value, and the computational time required to find the given metric. We observe that for α γ (top row) all three metrics give reasonable results, with the two unbound-growth based methods giving a noticeable increase in accuracy. However, for α ≈ γ (bottom row), the simple unbound growth method based on equation 5 breaks down, and equations 16 and 17 are required for accurate prediction. where the susceptible population in each node has mean 8.5. We assume infection rate ρ = 2.5, and recovery rate σ = 0. Note that for small populations, σ > 0 generally leads to extinction, particularly when γ < σ.

In such cases, where the local susceptible population is very small, we may encounter situations where our local epidemic is no longer in exponential growth phase by the time it spreads to a neighbouring node. Small population sizes also increase the importance of stochastic effects. This may, in certain cases lead to our simplified model being inaccurate (see figure 11 ).

In practice, for the SIR type models the parameter window where this concern is relevant is relatively narrow; infections which grow quickly enough to violate our assumptions are liable to burn through the entire local population and drive themselves to extinction before spreading. For SIS and SI type models, or any infection which is expected to reach a local endemic equilibrium, it is important to determine whether exponential growth is a good approximation before making use of the unbound-growth arrival time, or any effective distance metric that relies upon the same underlying assumptions.

Detailed, multi-compartment epidemic models allow us to answer a wide range of questions, at the cost of additional complexity. In cases where we have a particular question in mind, simpler models will suffice, and knowing which details to include and which to ignore provides not only convenience, but also insight into the driving factors behind the phenomena we wish to study.

In studying epidemic arrival time in the world aviation network, it turns out that we need only two ingredients: transport and exponential growth. As indicated by Chen et al. we can ignore the more complex non-linear dynamics of saturation and recovery [4] , as might be encountered in an SIR model, and need only consider a simple "single compartment" epidemic model; one that is ana-lytically tractable yet nonetheless predicts the expected arrival times observed by more complicated models, both deterministic and stochastic.

In using this simple unbound-growth model, we are able to explicitly state the assumptions underlying previously described effective distance measures: namely that the wave front is perpetually in its "exponential growth" phase during which non-linear effects are irrelevant, and thus arrival time is governed entirely by system dynamics in a naive population. In the language of both ecology and physics, we are dealing with a "pulled wave" phenomena [11, 15] , or alternatively "linear spreading theory" [4] . While here we have discussed these phenomena purely in the context of epidemic spread through the global flight network, it seems likely that a similar framework might apply when studying any suitable spreading phenomena in networks -for example the spreading of rumours through online communities [13] , or the spread of invasive species between distant ports [14] .

In identifying the assumptions implicitly underlying past methods, and considering an explicit model, we are able to demonstrate both the robustness and limitations of the proposed model, and provide alternative methods that remain effective in parameter regime where the simple "unbound-growth" model is demonstrated to give poor results.

Unfortunately, our attempts to reproduce past results appear to indicate that in practice, the predictive powers of effective distance metrics may be significantly lower than previously reported [2] -a disappointing, though not entirely unexpected result given the complexity and contingent nature of global epidemic spread. Knowledge of the WAN allows foresight, but can only be relied on for limited predictive power; when comparing to 'best practice' simulations we find that such simple methods such as those described here predict roughly 60% of variance, when compared to real world date, this predictive power reduces to 50% in the cases of H1N1, and a mere 30% in the case of SARS. Noise and chance are, as always, powerful forces within our world.

A Computationally efficient methods for solving equation (5) In order to determine the arrival time, τ i , it is necessary to solve

here, χ is an arbitrarily choosen thresholed, assumed to be normalized to 1 and v i t is an indicator vector of the form [0, 0...1, 0...0] such that all entries are zero, save for the i th , which is one.

Generally speaking, this equation can not be solved exactly, and must instead be solved using numerical methods. Because e αt varies far faster than v i t e γF t x(0), we can form a simple iteration scheme by asserting

Because e γF τ (n)

i ≈ e γF τ (n+1) i this iteration is almost equivalent to solving the exact equation, and converges quickly. Computing the matrix exponential however is a costly operation, one we would prefer to streamline wherever possible. To this end, it proves useful to first decompose of F using its eigenvectors: F = QDQ −1 . Here Q is a matrix such that each column contains an eigenvector of F , and D is a diagonal matrix containing the eigenvalues of F in corresponding order. We can then compute:

Because D is a diagonal matrix e γDτ (n) i can be computed using an elementwise exponentiation of only the diagonal terms, as opposed to full matrix exponentiation. All matrix-vector multiplications can be replaced by suitable vectorvector multiplications. (Q −1 x(0)) need only be computed once, and (v i t Q) is equivalent to examining the i th row of Q. Eigenvalue decomposition, the most costly operation involved in this procedure need only be computed once ever for a given F , and need not be repeated when varying α, γ, i or x(0).

Suitable matlab code for the above iteration scheme is given by: B Computationally efficient methods for solving equation (17) , and a comparison to Gautreau and Iannelli

In order to determine the improved unbound growth arrival time, τ + i , we must first have an efficient and reliable way of calculating y(t), and then use this method to approximate e −yi(t) dt as accurately as possible.

Calculation of y(t) can be completed using a similar approach to our calculation of x(t) -namely, we determine the eigendecomposition of the relevant 2N ×2N matrix (see equation 16) , and then use this eigendecomposition to take matrix exponentials using only vector-vector multiplication rather than matrixmatrix multiplication. We can avoid solving the 2N × 2N eigendecomposition problem by noting:

where D is a diagonal matrix containing the eigenvalues of F , Q is a full rank matrix containing the corresponding eigenvectors, and Q = (D + α/γ) −1F Q.

The expanded system has the same eigenvalues as the original system, along with N additional zeros. We find y(t) = Q exp(αt) exp(γDt)Q −1 x(0). Now that we have an method for calculating y(t) we would like to calculate e −yi(t) dt as efficiently as possible. This process is made significantly easier by first noting that for each i, as y i (t) passes through 1, it can be well approximated by a simple exponential growth function, that is to say, there exists some β i , µ i such that y i (t) ≈ exp[−(t − µ i )/β i ] in the vicinity of µ i . Because P (t < T i ) ≈ 1 when y i 1, and P (t < T i ) ≈ 0 when y i 1, we can safely use this exponential approximation of y with negligible impact on e −yi(t) dt. In making this approximation we effectively assert that T i is well approximated by a Gumbel distribution: P (t < T ) ≈ e − exp((t−µi)/βi) . The Gumbel distribution has expected value µ i + 0.5772β i . Hence, τ + i can be approximated using knowledge only of the time and slope with which y i crosses 1. The following code implements these calculations in an efficient manner. %Set up initial value, and matrix for finding slope. SlopeFinder= (F-diag(diag(F)))*[V,zeros(NumCities,NumCities)]; InfectedLevel=zeros(2*NumCities,1); InfectedLevel(1)=1; initV= Vbeta\InfectedLevel; %Variables will be stored here. GumbelMu=-ones(NumCities,1); GumbelBeta=-ones(NumCities,1);; %We will consider a number of times as starting points for finding Mu. t=0; dt=0.5*(log(Threshold)-log(gam))/(alpha-recoveryRate); order of operations we simplify our work conceptually, and reduce computational time significantly. The resulting prediction however must be considered, at best, a heuristic, and unlike Gautreau, Iannelli et al. we make no claims of mathematical rigor.

In practice, using the unbound-growth metric, we observe results similar to those found using Iannelli at al's methods (see figure 12 ). In addition, the improved unbound growth metric presented here appears to be more robust, in that it gives accurate results even in the case α/γ ≈ 1, where both Iannelli's technique and the basic-unbound growth method suffer difficulties. Figure 12 : For a random network containing 4000 nodes, we compare the average arrival time (averaged over 350 simulations), to time predicted using effective distance [2] , Ianelli's 'log inverse' formula (equation 25), and the improved unbound growth arrival time. In all cases, R 2 values, and calculation times are provided-calculation times given in seconds. Nodes are colour coded based on their 'number of steps' from the epidemic origin. As can be seen, Ianelli's technique and the improved unbound growth estimates give very similar predictions, however, the later is observed to be roughly 80 times faster.

Here we provide a number of additional figures, that may be of interest to some readers, but are not of enough importance to place in the main text. Figure 13 : A repetition of our Spinning top graph experiment, with k = 2. Here we plot mean arrival time (averaged over 150 simulations) vs either effective distance, D, or unbound growth arrival time, τ i , for the Spinning top graph depicted in fig. 3 . In all cases, we assume α = ρ = 0.5, σ = 0, k =, and γ as specified on our scatter plot. Figure 14 : Here we compare the predictions of effective distance and the unbound-growth arrival time to those observed for a deterministic SIR modelin a manner similar to figure 5 of the main text. In contrast to the scale free network used for that figure, here we consider a random network, where, for each node we select five other nodes uniformly at random to link it to, and then impose symmetry by making all edges bidirectional. All nodes have an initial susceptible population of S = 10 9 . We use the parameter values ρ = α = 0.5 and σ = 0. For the two top figures we select γ = 10 − 5, and for the lower two we select γ = 10 − 8. We observe that the performance of the effective distance is sensitive to γ, and that in this deterministic case the unbound growth arrival time gives near perfect prediction. Here we compare to the deterministic SIR simulations on a random network, as shown in figure 14 . The Modified effective distance gives strong predictions. The 'line separation' is generated as a result of the ignored −s log t. (Middle) Comparison to results of our GLEAMviz simulation (as described in figure 6 ). As can be seen, the use of − log γ and F as opposed to 1 and P significantly improves the accuracy of the method, however the existence of multiple paths still results in the unbound growth model giving slightly better results. (Bottom) Here we compare the modified effective distance to real world data for H1N1 and SARS. Surprisingly, in this case, results are worse than those obtained by Brockmann & Helbing's original effective distance.

Human Mobility Networks, Travel Restrictions, and the Global Spread of 2009 H1N1 Pandemic

The Hidden Geometry of Complex, Network-Driven Contagion Phenomena

The GLEaMviz computational tool, a publicly available software to explore realistic epidemic spreading scenarios at the global scale

Estimating epidemic arrival times using linear spreading theory

Network analysis as a tool to improve infection prevention & outbreak management

Arrival Time Statistics in Global Disease Spread

Global disease spread: statistics and estimation of arrival times

Approximate accelerated stochastic simulation of chemically reacting systems

Philipp Hoevel, and Igor M. Sokolov. Effective Distances for Epidemics Spreading on Complex Networks

A contribution to the mathematical theory of epidemics

Finding the sweet spot for invasion theory

B-A Scale-Free Network Generation and Visualization

Statistical mechanics of rumour spreading in network communities

The risk of marine bioinvasion caused by global shipping

Front propagation into unstable states

Around the World in Eighty Days. Porter & Coates, 1873. Google-Books-ID: sd06RZq4mN4C

Spatial epidemiology of networked metapopulation: An overview. bioRxiv

WHO | Situation updates -Avian influenza

WHO. Novel Coronavirus (2019-nCoV) situation reports

while(min(GumbelMu)==- 1) t=t+dt; %Determine y. Select y which are close to y=1. y=Vbeta(((NumCities+1):end),:)*(exp(Dbeta*t).*initV); select= (GumbelMu==-1)&(y>=0.3);%Find slope for selected y. Use this slope to find mu more effectively. GumbelBeta(select)= (SlopeFinder(select,:)*(exp(Dbeta*t).*initV))./y(select); GumbelMu(select)= t-log(wob(select))./GumbelBeta(select); end tauPlus=GumbelMu-0.5772./GumbelBeta; %Use mu, and slope to find tau. tauPlus(1)=0;

Our assumptions and methodology here have made use of many of the insights of Gautreau et al [7] and Iannelli et al [9] , all be it in a different order, and while such technical discussion would be out of place in the main body of this paper, we would be remiss not to highlight these past works her.Gautreau et al. demonstrated in their 2008 paper that the time taken for an infection to pass from one city to a neighbouring city was distributed according to a Gumbel distribution: P (t < T ) = e − exp((t−µ)/β) . This is most easily interpreted as an outer exponential resulting from the underlying Poisson like process of disease spread, and the inner exponential as the result of the ever increasing force of infection. Gautreau et al.'s approach can be used to formulate a modified 'direct distance; based on expected value of the Gumbel distribution, and these distances can be summed along shortest paths in order to estimate arrival times. Gautreau et al. thus arrive at results rather similar to our equation 14. In cases where there are multiple paths connect nodes i and j, the multiplicity of paths causes shortest path methods to overestimate arrival time. Gautreau et al provide a formula for predicting the expected arrival time given two parallel paths:Iannelli et al. were subsequently able to provide a formula for combining all paths joining a pair of nodes:Here P ¬j,¬j represents the probability sub matrix with the j th row and column excluded, P ¬j,j represents the j th column of P with the j th row excluded, and C = log(α/γ) − 0.57721. This formula is given as formula 28 in Iannelli et al.'s paper [9] . While very effective, this formula is also rather expensive computationally, requiring j separate large matrix divisions. Overall, the methods presented in this paper can be seen as a simple reversal of the operations used by Gautreau, Iannelli et al. Rather than calculate step times, and then combine the effects of multiple paths, we instead use the matrix exponential to consider all possible paths, and integrate over our Gumbel-like probability distribution only as our final step. In swapping the