key: cord-0260553-p79hblhi
authors: Luo, Wuqiong; Tay, Wee Peng; Leng, Mei
title: Identifying Infection Sources and Regions in Large Networks
date: 2012-04-02
journal: nan
DOI: 10.1109/tsp.2013.2256902
sha: 03fe33ffc8d87e408f2e8411a624381efd8b66db
doc_id: 260553
cord_uid: p79hblhi

Identifying the infection sources in a network, including the index cases that introduce a contagious disease into a population network, the servers that inject a computer virus into a computer network, or the individuals who started a rumor in a social network, plays a critical role in limiting the damage caused by the infection through timely quarantine of the sources. We consider the problem of estimating the infection sources and the infection regions (subsets of nodes infected by each source) in a network, based only on knowledge of which nodes are infected and their connections, and when the number of sources is unknown a priori. We derive estimators for the infection sources and their infection regions based on approximations of the infection sequences count. We prove that if there are at most two infection sources in a geometric tree, our estimator identifies the true source or sources with probability going to one as the number of infected nodes increases. When there are more than two infection sources, and when the maximum possible number of infection sources is known, we propose an algorithm with quadratic complexity to estimate the actual number and identities of the infection sources. Simulations on various kinds of networks, including tree networks, small-world networks and real world power grid networks, and tests on two real data sets are provided to verify the performance of our estimators.

With rapid urbanization and advancements in transportation technologies, the world has become more interconnected. A contagious disease like Severe Acute Respiratory Syndrome (SARS) can spread quickly through a population and lead to an epidemic [1] . It is crucial to quickly identify the index cases of a contagious disease since it allows us to study the causes, and hence facilitate the search for antiviral drugs and efficacious therapies.

Moreover, by inferring the the set of individuals infected by each source, potential containment policies can be formulated to prevent further spreading of the disease due to new index cases [2] , [3] . In a similar vein, a computer virus on a few servers of a computer network can quickly spread to other servers or computers in the network. any of its neighbors at each time step. This simple infection model is based on the classical susceptible-infected (SI) model [15] , which has been widely used in modeling viral epidemics [16] - [21] . An algorithm for evaluating the single source estimator was proposed in [14] , and it was shown to have complexity 1 O(n) for tree networks, where n is the total number of infected nodes. Furthermore, it was shown that this estimator performs well in a very general class of tree networks known as the geometric trees (cf. Section III-D), and identifies the infection source with probability going to one as n increases.

In many applications, there may be more than one infection source in the network. For example, an infectious disease may be brought into a country through multiple individuals. Multiple individuals may collude in spreading a rumor or malicious piece of information in a social network. In this paper, we investigate the case where there may be multiple infection sources, and when the number of infection sources is unknown a priori. We also consider the problem of estimating the infection region of each source, and show that a direct application of the algorithm in [14] performs significantly worse than our proposed algorithms if there are more than one infection sources.

We also note that [14] provides theoretical performance measures for several classes of tree networks, which we are unable to do here except for the class of geometric trees, because of the greater complexity of our proposed algorithms. Instead, we provide simulation results to verify the performance of our algorithms.

A related problem is the detection and localization of diffusive sources using wireless sensor networks [22] - [27] .

The diffusion models used under this framework are based on spatio-temporal diffusion models [22] or state-space models with linear dynamics [23] , where information like the physical positions of sensors are known. There is no natural translation of the source detection and localization problem in a sensor network to other networks like a computer network, without performing discretization and introducing a combinatorial aspect to the problem, as is done in [28] and [29] . Similarly, inference of viral epidemic processes in populations has been studied in [10] , [12] , [15] , where various features related to the propagation of a viral epidemic, such as the rates of infection and the length of latency periods are investigated. These works' focus is on specific viral infection processes with assumptions that do not naturally hold for infection processes in other networks. Moreover, there is little work on determining the sources or index cases of a disease.

On the other hand, the infection source estimation algorithms we consider in this paper can be useful in applications like pollution source localization, where we are limited to inexpensive sensors capable only of detecting the presence or absence of a pollutant, and the identities of its neighbors. In this case, spatio-temporal diffusion models are not applicable as we only have knowledge of which nodes are "infected". The algorithms we study in this paper are also applicable to inferring infection sources in viral epidemics, when little information about the epidemic propagation characteristics is available.

In this paper, we consider the estimation of multiple infection sources when the number of infection sources is unknown a priori. We adopt the same SI diffusion model as in [14] , as this has been widely used to model various infection spreading processes [16] - [21] . The results of this work are applicable to scenarios where the infection spreads in an approximately homogeneous way, with infections happening independently. Examples include the spreading of a new disease in a human population, where nobody has yet developed any immunity to the disease.

A novel computer virus attacking a network can also be modeled using a homogeneous spreading process. On the other hand, our model is highly simplistic and does not model many other spreading processes of practical interest.

However, as alluded to earlier, finding the infection sources in this simple model is already very challenging.

The focus of our work is not on modeling infection processes. Rather, by restricting our analysis to the simplest homogeneous exponential spreading model, we hope to gain insights into identifying multiple infection sources in real networks. We show that unlike the single source estimation problem, the multiple source estimation problem is much more complex and cannot be solved exactly even for regular trees. Our main contributions are the following. (iv) For general graphs, when there are at most k max infection sources, we provide an estimation procedure for the infection sources and infection regions. Simulations suggest that on average, our estimators are within a few hops of the true infection sources in the infection graph. 2 (v) We test our estimators on real data in Section V-C. The first test is based on real contact tracing data of a patient cluster during the SARS outbreak in Singapore in 2003. Our estimator correctly identifies the number of index cases for the cluster to be one and successfully finds this index case. The second test considers the Arizona-Southern California cascading power outages in 2011. Our estimator correctly identifies the number of outage sources for the main affected power network to be two, and the distance between our estimators and the real sources are within 1 hop. These tests suggest that our estimator has reasonable performance in 2 In general, we do not know the whole underlying network, but rather the subgraph of infected nodes. For example, in the case of a contagious disease spreading in a population, we only perform contact tracing on the patients to construct the connections among them. From our simulation studies, the infection graph typically has an average diameter of more than 27 hops even though the underlying network's diameter is much smaller. some applications even though we have adopted a simplistic infection model.

The rest of the paper is organized as follows. In Section II, we present the system model and problem formulation.

In Section III, we derive estimators for infection sources and regions for tree networks, and present algorithms to evaluate them. We also show asymptotic results for geometric tree networks. We discuss estimation algorithms for general graphs in Section IV. In Section V, we present simulations and tests on real data to verify the performance of our proposed estimators. Finally we conclude and summarize in Section VI.

In this section, we describe our model and assumptions, introduce some notations, and present some preliminary results. Consider an undirected graph G = (V, E), where V is the set of nodes and E is the set of edges. If there is an edge connecting two nodes, we say that they are neighbors. The neighborhood N G (v) of a node v is the set of all neighbors of v in G. The length of the shortest path between u and v is denoted as d(u, v). In a computer network, the graph G models the interconnections between computers in the network. In the example of a population or a social network, V is the set of individuals, while an edge in E represents a relationship between two individuals. We define an infection to be a property that a node in G possesses, and can be transmitted to a neighboring node. When a node has an infection, we say that it is infected. The neighbors of an infected node is said to be susceptible. We assume the susceptible-infected model [15] , where once a node has been infected, it will not lose its infection. We adopt the same infection spreading process as in [14] , where the time taken for an infected node to infect a susceptible neighbor is exponentially distributed with rate 1. All infections are independent of each other. Therefore, if a susceptible node has more than one infected neighbors and subsequently becomes infected, its infection is transmitted by one of its infected neighbors, chosen uniformly at random. For mathematical convenience, we also assume that G is large so that boundary effects can be ignored in our analysis.

Suppose that at time 0, there are k ≥ 1 nodes in the infected node set S * = {s 1 , . . . , s k } ⊂ V . These are the infection sources from which all other nodes get infected. Suppose that after the infection process has run for some time, and n nodes are observed to be infected. Typically, n is much larger than k. These nodes form an infection

i be a partition of the infected nodes V n so that A n,i ∩ A n,j = ∅ for i = j, with each partition A n,i being connected in G n , and consisting of the nodes whose infection can be traced back to the source node s i . The set A n,i is called the infection region of s i , and we say that A * n is the infection partition. Given G n , our objective is to infer the sources of infection S * and to estimate A * n . In addition, if we do not have prior knowledge of the number of infection sources k, we also aim to infer the number of infection sources. Without loss of generality, we assume that G n is connected, otherwise the same estimation procedure can be performed on each of the components of the graph. We also assume that there are at most k max infection sources, i.e., the number of infection sources k ≤ k max . From a practical point of view, if two infection sources are close to each other, we can ignore either one of them and treat the infection as spreading from a single source. Therefore, we are interested in cases where the infection sources are separated by a minimum distance. These assumptions are summarized in the following. Assumption 1. The number of infection sources is at most k max , and the infection graph G n is connected.

For all s i , s j ∈ S * , the length of the shortest path between them d(s i , s j ) ≥ τ , where τ is a constant greater than 1.

Assumption 3. Every node in G has bounded degree, with d * being the maximum node degree.

Suppose that our priors for S * and A * n are uniform over all possible realizations, and let P be the probability measure of the infection process. We seek S and A n that maximize the posterior probability of S * and A * n given G n ,

where P (G n | S) is the probability of observing G n if S is the set of infection sources, and P (A n | S, G n ) is the probability that A * n = A n conditioned on S being the infection source set and the infection graph being G n . For any source set S, let an infection sequence σ = (σ 1 , . . . , σ n−k ) be a sequence of the nodes in G n , excluding the the k source nodes in S, arranged in ascending order of their infection times (note that with probability one, no two infection times are the same). For any sequence to be an infection sequence, a necessary and sufficient condition is that any infected node σ i , i = 1, . . . , n − k, has a neighbor in S ∪ {σ 1 , . . . , σ i−1 }. We call this the infection sequence property. An example is shown in Figure 1 . Let Ω(G n , S) be the set of infection sequences for an infection graph G n and source set S, and let C(S | G n ) = |Ω(G n , S)| be the number of infection sequences.

We have

where P (σ | S) is the probability of obtaining the infection sequence σ conditioned on S being the infection sources.

Evaluating the expression (2) and maximizing (1) for a general G n is a computationally hard problem as it involves combinatorial quantities. As shown in [14] , if G is a regular tree and |S| = 1, P (G n | S) is proportional to |Ω(G n , S)|, which is equivalent to the number of linear extensions of a poset. It is known that evaluating the linear extensions count is a #P-complete problem [30] . As such, we will make a series of approximations to simplify the problem, and present numerical results in Section V to verify our algorithms. The first approximation we make is to evaluate the estimatorsŜ

A n (Ŝ) = arg max instead of the exact maximum a posteriori (MAP) estimators for (1) . Even with this approximation, the optimal estimators are difficult to compute exactly, and may not be unique in general. Therefore, our goal is to design algorithms that are approximately optimal but computationally efficient. In Section III, we make further approximations and design algorithms to evaluate the estimatorsŜ andÂ n (Ŝ) when G is a tree. In Section IV, we consider the case when G is a general graph. For the reader's convenience, we summarize some notations commonly used in this paper in Table I . Several notations have been introduced previously, while we formally define the remaining ones in the sequel where they first appear.

In this section, we consider the problem of estimating the infection sources and regions when the underlying network G is a tree. We first derive an estimator for the infection partition in (4), given any source node set S and G n . Then, we derive an estimator based on the number of infection sequences. Next, we consider the case where there are two infection sources, propose approximations that allow us to compute the estimator with reasonable complexity, and show that our proposed estimator works well in an asymptotically large geometric tree under some simplifying assumptions. In most practical applications, the number of infection sources is not known a priori.

We present a heuristic algorithm for general trees to estimate the infection sources when the number of infection sources is unknown, but bounded by k max .

In this section, we derive an approximate infection partition estimator for (4) given any infection source set S.

This estimator is exact under a simplifying technical condition given in Theorem 1 below, the proof of which is provided in Appendix A. A Voronoi partition may not produce the optimal estimator for the infection partition in a general infection graph.

However, it is intuitively appealing as nodes closer to a particular source are more likely to be infected by that source. For simplicity, we will henceforth use the Voronoi partition of the infection graph G n as an estimator for A * n , and present simulation results in Section V to verify its performance. We will also see in Section III-E that this approximation allows us to design an infection source estimation algorithm with low complexity.

We now consider the problem of estimating the set of infection sources S * . When |S * | = 1, our estimation problem reduces to that in [14] , which considers only the single source infection problem. In the following, we introduce some notations, and briefly review some relevant results from [14] .

A path between any two nodes u and v in the tree G n is denoted as ρ(u, v). For any set of nodes S in G n , consider the connected subgraph H n ⊂ G n consisting of all paths between any pair of nodes in S. Treat this

subgraph as a "super" node, with the tree G n rooted at this "super" node. For any node v ∈ G n \H n , we define

T v (S) to be the tree rooted at v with the path from v to H n removed. For v ∈ H n , we define T v (S) to be the tree rooted at v so that all edges between v and its neighbors in H n are removed. 3 We say that T v (S) is the tree rooted

of these definitions is shown in Figure 2 . If S = {s 1 , . . . , s k }, we will sometimes use the notation T v (s 1 , . . . , s k )

instead.

Recall that C(S | G n ) is the number of infection sequences if S is the infection source set. If there is a single infection source node S = {s}, and G is a regular tree where each node has the same degree, it is shown in [14] that the MAP estimator for the infection source is obtained by evaluatingŜ = arg max v∈Gn C(v | G n ), which seeks to maximize C(v | G n ) over all nodes. Therefore, it has been suggested that C(v | G n ) can be used as the infection source estimator for general trees. The following result is provided in [14] .

Suppose that G n is a tree. For any node s ∈ G n , we have

We observe that each term |T u (s)| in the product on the right hand side (R.H.S.) of (5) is the number of nodes in the sub-tree T u (s) (and which appears when we account for the number of permutations of these nodes). We can think of the terms in the product being ordered according to the infection spreading sequence, i.e., each time we reach a particular node u, we include terms corresponding to the nodes u can potentially infect. This interpretation is useful in helping us understand the characterization in Lemma 2 for the case where there are two infection sources.

To compute C(v | G n ), an O(n) algorithm based on Lemma 1 was provided in [14] . We call this algorithm the Single Source Estimation (SSE) algorithm. We refer the reader to [14] for details about the implementation of the algorithm. Although findingŜ by maximizing C(s | G n ) is exact only for regular trees, it was shown in [14] that this estimator has good performance for other classes of trees. In particular, if G is a geometric tree (cf. Section III-D), then the probability, conditioned on S * = {s}, of correctly identifying s using C(s | G n ) goes to one as 3 As Tv(S) is defined on Gn, its notation should include Gn. However, in order to avoid cluttered expressions, we drop Gn in our notations. Confusion will be avoided through the context in which these trees are referenced.

can find the corresponding reverse infection sequence ξ = (u2, u1, u3). We have I1(ξ; s1, s2) = |Tu 2 (s1, s2)| = 1, I2(ξ; s1, s2) =

n → ∞. Inspired by this result, we propose estimators based on quantities related to C(S | G n ) for cases where

In the following, we first discuss the case where |S * | = 2, and extend the results to the general case where |S * | is unknown in Section III-E. We then numerically compare our proposed algorithms with a modified SSE algorithm adapted for finding multiple sources in Section V.

In this section, we assume that there are two infection sources S = {s 1 , s 2 }. Given two nodes u and v in G n ,

be the total number of nodes in the trees rooted at the first i nodes in the permutation ξ. We have the following characterization for C(s 1 , s 2 | G n ), whose proof is given in Appendix B.

Suppose that G n is a tree. Consider any two nodes s 1 and s 2 in G n , and suppose that ρ(s 1 , s 2 ) = (s 1 , u 1 , . . . , u m , s 2 ). We have

where for

and Γ(u 1 , u m ) is the set of all permutations ξ = (ξ 1 , . . . , ξ m ) of nodes in ρ(u 1 , u m ) such that (ξ m , . . . , ξ 1 ) is an infection sequence starting from s 1 and s 2 and resulting in ρ(s 1 , s 2 ).

The characterization for C(s 1 , s 2 | G n ) is similar to that for the single source case in (5), except for the additional q(u 1 , u m ; s 1 , s 2 ) term. We first clarify the meaning of Γ(u 1 , u m ). Given any infection sequence σ that starts with

, we can find a permutation ξ = (ξ 1 , . . . , ξ m ) of nodes in ρ(u 1 , u m ) such that ξ i = σ m−i+1 for i = 1, . . . , m. In other words, ξ can be interpreted as the reverse infection sequence corresponding to σ. Then Γ(u 1 , u m ) is the set of all such reverse infection sequences corresponding to Ω(ρ(s 1 , s 2 ), {s 1 , s 2 }). We show an illustration of these definitions in Figure 3 . Each term |T u (s)| in the product in the R.H.S. of (5) can be interpreted as the number of nodes that can be infected via u once u has been infected. Similarly, the sum in (9) is over all possible reverse infection sequences ξ of the nodes in ρ(u 1 , u m ), and each term I i (ξ; s 1 , s 2 ) in the product within the sum is the number of nodes that can be infected once ξ i+1 , . . . , ξ m have been infected.

By utilizing Lemma 2, we can compute C(u, v | G n ) for any two nodes u and v in G n by evaluating |T w (u, v)| for all nodes w ∈ G n , and the quantity q(

Algorithm 1 allows us to compute f w (u) = |T w (u)| and g w (u) = v∈Tw(u) |T v (u)| for all neighbors u of w, and for all w ∈ G n in O(n) time complexity. To do this, we first choose any node r ∈ G n , and consider G n as a directed tree with r as the root node, and with edges in G n pointing away from r. Let pa(w) and ch(w) be the parent and the set of children of w in the directed tree G n , respectively. Starting from the leaf nodes, let each non-root node w ∈ G n pass two messages containing f w (pa(w)) and g w (pa(w)) to its parent. Each node stores the values of these two messages from each of its children, and computes its two messages to be passed to its parent.

When r has received all messages from its children, a reverse sweep down the tree is done so that at the end of the algorithm, every node w ∈ G n has stored the values {f u (w), g u (w) : u ∈ N Gn (w)}. The algorithm is formally described in Algorithm 1. The last product term on the R.H.S. of (7) can then be computed using

and taking its reciprocal.

To compute C(s 1 , s 2 | G n ) in (7), we still need to compute q(u 1 , u m ; s 1 , s 2 ). The recurrence (8) allows us to compute q(u 1 , u m ; s 1 , s 2 ) for all s 1 , s 2 ∈ G n in O(n 2 d 2 * ) complexity, where d * is the maximum node degree. The computation proceeds by first considering each pair of neighbors (u, v). Both nodes have at most d * neighbors each, so that we need to evaluate q(u, v; s 1 , s 2 ) for all s 1 ∈ N Gn (u)\ρ(u, v) and

The computed values and T ρ(u,v) (s 1 , s 2 ) are stored in a hash table. In the next step, we repeat the same procedure for node pairs that are two hops apart, and so on until we have considered every pair of nodes in G n . Note that for a path (u 1 , . . . , u m ) and s 1 , s 2 neighbors of u 1 and u m respectively, q(u 1 , u m ; s 1 , s 2 ) can be computed in constant time from (8) as q(u 2 , u m ; s 1 , s 2 ) = q(u 2 , u m ; u 1 , s 2 ) and q(u 1 , u m−1 ; s 1 , s 2 ) = q(u 1 , u m−1 ; s 1 , u m ). A similar remark applies for the computation of |T ρ(u1,um) (s 1 , s 2 )|. In addition, each lookup of the hash table takes O(1) complexity since G n is known and collision-free hashing can be used. Therefore, the overall complexity is O(n 2 d 2 * ). The algorithm to compute the infection sources estimator is formally given in Algorithm 2. We call this the Two Source Estimation (TSE) algorithm, and it forms the basis of our algorithm for multiple sources estimation in the sequel. Algorithm 1 Tree Sizes and Products Computation 1: Inputs: G n 2: Choose any node r ∈ G n as the root node.

3: for w ∈ G n do 4: Store received messages f x (w) and g x (w), for each x ∈ ch(w).

if w is a leaf then 6: f w (pa(w)) = 1 7: g w (pa(w)) = 1 8:

else 9 :

13:

Pass f w (pa(w)) and g w (pa(w)) to pa(w).

14: end for 15 : Set g pa(r) (r) = 1.

16: for w ∈ G n do 17: Store received message g pa(w) (w) from pa(w).

if w is not a leaf then 19: for x ∈ ch(w) do 20 :

Pass g w (x) to x.

end for 23: end if 24 : end for

In this section, we study the special case of geometric trees, propose an approximate estimator for geometric trees, and provide theoretical analysis for its performance. First, we give the definition of geometric trees and prove some of its key properties. Then, we derive a lower bound for C(S | G n ), and propose an estimator based on this lower bound. We show that our proposed estimator is asymptotically correct, i.e., it identifies the actual infection sources with probability (conditioned on the infection sources) going to one as the infection graph G n becomes large. For mathematical convenience, instead of letting the number of infected nodes n grow large, we let the time t from the start of the infection process to our observation time become large.

The geometric tree network is defined in [14] w.r.t. a single infection source. In the following, we extend this Algorithm 2 Two Source Estimation (TSE) 1: Input: G n 2: Let (s * 1 , s * 2 ) be the maximizer of C(·, · | G n ). Set C * = 0.

3: for d = 1 to diameter of G n do 4: for each s 1 ∈ G n do 5: for each s 2 such that d(s 1 , s 2 ) = d do 6: Let ρ(s 1 , s 2 ) = (s 1 , u 1 , . . . , u d−1 , s 2 ). Look up |T ρ(u1,ud−2) (s 1 , u d−1 )|, q(u 2 , u d−1 ; u 1 , s 2 ), and q(u 1 , u d−2 ; s 1 , u d−1 ).

Store

14:

Store q(u 1 , u d−1 ; s 1 , s 2 ) = q(u 2 , u d−1 ; u 1 , s 2 ) + q(u 1 , u d−2 ; s 1 , u d−1 ) |T ρ(u1,ud−1) (s 1 , s 2 )| .

end if 16: Compute g(s 1 , s 2 ) from (10).

C(s 1 , s 2 | G n ) = (n − 2)!q(u 1 , u d−1 ; s 1 , s 2 )/g(s 1 , s 2 ).

Update (s * 1 , s * 2 ) and C * if C(s 1 , s 2 | G n ) > C * . 19: end for 20: end for 21: end for definition to the case where there are two sources. Let S * = {s 1 , s 2 } be the infection sources, and let T ′ u (s 1 , s 2 ) be defined in the graph G in the same way as T u (s 1 , s 2 ) is defined for G n . Let N G (ρ(s 1 , s 2 )) be the set of nodes that have a neighboring node in ρ(s 1 , s 2 ). For each node u, let n(u, r) be the number of nodes in T ′ u (s 1 , s 2 ) that are at a distance r from u. We say that G is a geometric tree if for all u ∈ N (ρ(s 1 , s 2 )), we have br α ≤ n(u, r) ≤ cr α ,

where α, b, and c are fixed positive constants with b ≤ c. The condition (11) implies that all trees defined w.r.t.

the infection sources are growing polynomially fast at about the same rate. As we have assumed that the infection rates are homogeneous for every node, the resulting infection graph G n will also be approximately regular with high probability. We have the following properties for a geometric tree, whose proofs are in Appendix C. 

where

and

In addition, for t ≥ t 0 , we have

The infection sequences count in (7) is not amendable to analysis. In the following, we seek an approximation to simplify our analysis. For s 1 , s 2 ∈ G n , suppose that ρ(s 1 , s 2 ) = (s 1 , u 1 , . . . , u m , s 2 ), with p = |ρ(s 1 , s 2 )| = m + 2.

Instead of computing C(s 1 , s 2 | G n ), we consider a new infection graph G ′ n with two "virtual" nodes x i , i = 1, 2 added, where x i is attached to s i (see Figure 4) . We now consider the infection sequence count C(x 1 , x 2 | G ′ n ) ≥ C(s 1 , s 2 | G n ). Since the trees rooted at x i are single node trees, we have

where the last inequality follows because if s 1 and x 2 are sources, then s 2 can be inserted in any of at most n − 1 positions in an infection sequence from Ω(G n , {s 1 , s 2 }), so that C(s 1 ,

Let ξ * = (ξ * 1 , . . . , ξ * p ) be a permutation of the nodes in ρ(s 1 , s 2 ) such that |T ξ * i (s 1 , s 2 )| ≥ |T ξ * j (s 1 , s 2 )| for all 1 ≤ i ≤ j ≤ p, i.e., the nodes in ξ * are arranged in descending order of the size of the sub-trees rooted at them.

Let I * i (s 1 , s 2 ) = I i (ξ * ; s 1 , s 2 ) (cf. the definition in (6)) be the total number of nodes in the i biggest sub-trees in {T u (s 1 , s 2 ) : u ∈ ρ(s 1 , s 2 )}. From Lemma 2, we have

where the inequality holds because |Γ(s 1 , s 2 )| = 2 p−1 , and each term in the sum on the R.H.S. of (9) is lower bounded by p i=1 I * i (s 1 , s 2 ) −1 . We use the lower bound in (15) as a proxy for C(s 1 , s 2 | G n ). However, we have used a very loose lower bound in (15) , so we propose the estimator

whereC

and δ is a fixed positive constant, to be chosen based on prior knowledge about the graph G. Algorithm 2 can be modified to find the maximizer forC(·, · | G n ). We call this the geometric tree TSE algorithm. The following result provides a way to choose δ, and shows that our proposed estimatorS is asymptotically correct in a geometric tree.

A proof is provided in Appendix D.

Suppose that G is a geometric tree with two infection sources S * = {s * 1 , s * 2 }. Let d min and d max be constants such that deg G (s i ) ∈ [d min , d max ] for i = 1, 2. Let b and c be fixed positive constants satisfying (11) for the geometric tree G. Suppose that

Then, for any δ in the non-empty interval

we have lim t→∞ P(S = S * | S * ) = 1.

Theorem 2 implies that if we know the constants governing the regularity condition (11) for G, we can choose a δ so that our estimatorS gives the true infection sources with high probability if the infection graph G n is large. The class of geometric trees as defined by (11) can be used to model various scenarios in practice, e.g., a tree spanning a wireless sensor network with nodes randomly scattered. However, the assumption (11) may also be overly strong for other applications. In Section V, we perform numerical studies to gain insights into the performance of our proposed estimator for different classes of tree networks.

In most practical applications, the number of infection sources is not known a priori. However, typically we may be able to guess the maximum number of infection sources k max , or we can choose a reasonable value of k max depending on the size of the infection graph G n . In this section, we present a heuristic algorithm that allows us to estimate the infection sources with a given k max .

We first consider the instructive case where k max = 2 and G is a geometric tree. In this case, the number of infection sources can be either one or two. Suppose we run the geometric tree TSE algorithm on G n . We have the following result, whose proof is in Appendix E. (11) (i) Randomly choose k max nodes satisfying Assumption 2 as the infection sources and find a Voronoi partition for G n . Use the SSE algorithm to find a source node for each infection region. Repeat these steps until for every region, the distance between estimated source nodes between iterations is below a fixed threshold or a maximum number of iterations is reached. We call this the Infection Partition (IP) Algorithm (see Algorithm 3).

(ii) For any two regions in the partition obtained from step (i) that are connected by an edge in G n , run the TSE algorithm in the combined region to find two source estimates. If the two estimates have distance less than τ , we decrement the number of source nodes, and repeat step (i). Run the Voronoi partitioning algorithm with centers in S (l−1) to obtain the infection partition

i .

for i = 1 to m do 6: Run SSE algorithm in A 

In this section, we generalize the MSEP algorithm to identify multiple infection sources in general graphs G.

In [14] , the SSE algorithm is extended to general graphs when it is known that there is only a single infection source in the network using a heuristic. The algorithm first chooses a node s of G n as the root node, and generates a spanning tree T bfs (s, G n ) of G n rooted at s using the breadth-first-search (BFS) procedure. The SSE algorithm is then applied on this spanning tree to compute C(s | T bfs (s, G n )). In addition, the infection sequences count is weighted by the likelihood of the BFS tree. This is repeated using every node in G n as the root node, and the nodê s with the maximum weighted infection sequences count is chosen as the source estimator, i.e.,

where σ v is the sequence of nodes that corresponds to an infection spreading from v along the BFS tree. It can be shown that this algorithm has complexity O(n 2 ). For further details, the reader is referred to [14] . We call this algorithm the SSE-BFS algorithm in this paper.

We adapt the MSEP algorithm for general graphs using the same BFS heuristic. Specifically, we replace the SSE algorithm in line 6 of the IP algorihm with the SSE-BFS algorithm. In addition, in line 9, we run the TSE for all regions A i and A j in the partition A such that there exists an edge (u, v) in G n with u ∈ A i and v ∈ A j do 9: Set (u, v) = Algorithm TSE(A i ∪ A j ). 10: if d(u, v) < τ then 11: Merge A i and A j , set s i = u and discard s j

k := k − 1 13: break 14: end if 15: end for 16: if S = S ′ then 17: break 18: end if 19 : end while 20: return (S, A) (u, v) in G n with u ∈ T bfs (s i , A i ) and v ∈ T bfs (s j , A j ). We call this modified algorithm the MSEP-BFS algorithm.

Since the worst case complexity for the SSE-BFS algorithm is O(n 2 ), the complexity of the MSEP-BFS algorithm can be shown to be O(k 3 max n 2 ), which is the same complexity as the MSEP algorithm. To verify the effectiveness of the MSEP-BFS algorithm, we conduct simulations on both synthetic and real world networks in Section V.

In this section, we present results from simulations and tests on real data to verify our proposed algorithms. We first consider geometric tree networks and regular tree networks with various numbers of infection sources, and then we present results on small-world networks and a real world power grid network. We also apply our algorithms to the contact tracing data obtained during the SARS outbreak in Singapore in 2003 [1] and the Arizona-Southern California cascading power outages in 2011 [31] . 

We perform simulations on geometric trees, regular trees, and small-world networks. The number of infection sources |S * | are chosen to be 1, 2, or 3, and we set k max = 3. Figure 5 . It can be seen that our algorithm correctly finds the number of infection sources more than 93% of the time for geometric trees, and more than 71% of the time for regular trees. The accuracy of about 69.2% for small-world networks is worse than that for the tree networks, as the infection tree for a small-world network has to be estimated using the BFS heuristics, thus additional errors are introduced into the procedure.

When there are more than one infection sources, we compare the performance of the MSEP algorithm with a naive estimator based on the SSE algorithm. We call this the nSSE algorithm. Specifically, in the estimator for tree networks, we first compute C(u | G n ) for all nodes u ∈ G n , and choose the |S * | nodes with the largest counts as the source nodes. In non-tree networks, we use the SSE-BFS algorithm. Since the nSSE algorithm can not estimate |S * |, we consider two variants. In the first variant, we assume the nSSE algorithm has prior knowledge of |S * |. In the second variant, we guess |S * | by choosing uniformly from {1, . . . , k max }.

To quantify the performance of each algorithm, we first match the estimated source nodesŜ = {ŝ i : i = 1, . . . , |Ŝ|} with the actual sources S * so that the sum of the error distances between each estimated source and its match is minimized. Let this matching be denoted by the function π, which matches each actual source s i tô s π(i) . If we have incorrectly estimated the number of infection sources, i.e., |Ŝ| = |S * |, we add a penalty term to this sum. The average error distance is then given by

where η is a penalty weight for incorrectly estimating the number of infection sources. For different applications, we may assign different values to η depending on how important it is to estimate correctly the number of infection sources. In our simulations, we consider the cases where η = 0, and where η is the diameter of the infection graph.

The average error distances for the different types of networks are provided in Table II . Clearly, the MSEP/MSEP-BFS algorithm outperforms the nSSE algorithm, even when the nSSE algorithm has prior knowledge of the number of sources. When |S * | is known a priori, the performance of the nSSE algorithm deteriorates with increasing |S * |. This is to be expected as the SSE algorithm assumes that the node with the largest infection sequence count is the only source, and this node tends to be close to the distance center [32] of the infection graph. The histogram of the average error distances when η = 0 are shown in Figure 6 .

The MSEP/MSEP-BFS algorithm also estimates the infection region of each source. To evaluate its accuracy, we first perform the matching process described previously. Let the true infection region of s i be A n,i and the estimated infection region ofŝ π(i) beÂ n,i , where we setÂ n,i = ∅, if we have underestimated the number of sources and s i is unmatched. We define the correct infection region covering percentage for s i as the ratio between |Â n,i ∩ A n,i | and |A n,i |, and we compute the minimum (or worst case) infection region covering percentage as min i∈{1,··· ,|S * |}

This is then averaged over all simulation runs. We find that the average minimum infection region covering percentage is more than 59% for all networks, as shown in Table II .

We verify the performance of the MSEP-BFS algorithm on the western states power grid network of the United

States [33] . We simulate the infection spreading process on the power grid network, which contains 4941 nodes.

For each simulation run, 1, 2 or 3 infection sources are randomly chosen from the power grid network under Assumption 2, and the spreading process is simulated so that a total of 500 nodes are infected. For each value of |S * |, 1000 simulation runs are performed. The simulation results are shown in Figures 5 and 6(d) , and Table II . We see that the MSEP-BFS algorithm outperforms the nSSE algorithm in every scenario. The average infection region covering percentage is above 59%.

In order to get some insights in the performance of the MSEP-BFS algorithm in real infection spreads, we conduct two tests on real infection spreads data. We first apply the MSEP-BFS algorithm to to a network of nodes that represent the individuals who were infected with the SARS virus during an epidemic in Singapore in the year 2003. The data is collected using contact tracing of patients [1] , where an edge between two nodes indicate that there is some form of interaction or relationship between the individuals (e.g., they are family members, classmates, colleagues, or commuters who shared the same public transport system). A part of the SARS infection network corresponding to a cluster of 193 patients is shown in Figure 7 . We test the MSEP-BFS algorithm on the network in Figure 7 , assuming that there are at most k max = 3 infection sources. It turns out that the MSEP-BFS algorithm correctly estimates the number of infection sources to be one, and correctly identifies the real infection source.

We next consider the Arizona-Southern California cascading power outages in 2011 [31] . The affected power network is represented by a graph where a node represents a key facility (substation or generating plant) affected by an outage, and an edge between two nodes indicate that there is a transmission line between these two facilities. The Figure 8 . We test the MSEP-BFS algorithm on the network in Figure 8 , and assume that there are at most k max = 3 infection sources. We can see that the MSEP-BFS algorithm correctly estimates the number of infection sources to be two. We also found one of the sources correctly, and one estimate 1 hop away from the real source. In this paper, we have adopted a simple SI infection model with homogeneous spreading rates, allowing us to derive analytical results that provide useful insights into infection source estimation for practical networks.

However, this simplistic diffusion model does not adequately capture the real world dynamics of many networks.

Future research includes the use of richer diffusion models that allow the inclusion of drifts and other dynamics in the infection spreading process, and tools from statistics to approximate optimal estimators for the infection sources. Our proposed algorithms find a set of nodes most likely to infect or influence a network, and are thus potentially useful for various practical applications. For example, our algorithm may be integrated with non-modelbased consensus methods [34] , [35] to design multi-agent control systems that uses only a small subset of agents as controllers. In cloud-centric media platforms [36] , variants of our proposed algorithm may be used for intelligent 

where Ω(H n , S, A n ) = {σ ∈ Ω(H n , S) : σ ∩ A n,i is an infection sequence, for all i = 1, . . . , k.}, and σ ∩ A n,i is the subsequence of σ containing only nodes that are in A n,i .

Let h = |H n | − k, and consider an infection sequence σ = (σ 1 , . . . , σ h ) ∈ Ω(H n , S, A n ). Let the set of edges connecting susceptible nodes to infected nodes be called the susceptible edge set. We have assumed that the infection times of susceptible nodes are independent and identically exponentially distributed. Therefore, given the infection sequence σ 1 , . . . , σ l−1 , the next edge along which the infection is spread is chosen uniformly at random from the susceptible edge set at time index l − 1. Since H n is a tree where all nodes except those in S have degree 2, after infection of a new node, the susceptible edge set size remains the same except in the case where the infected node is the last node to be infected amongst those on a path connecting two infection sources. In that case, the susceptible edge set size reduces by 2. Let J σ be the set of indices of the last infected nodes on every path connecting infection sources. Letting n l = 1 if l / ∈ J σ and 2 otherwise, we then have

where p is the number of paths connecting infection sources, and This shows that the estimatorÂ n (S) is a Voronoi partition of G n , and the proof is complete.

To simplify notations, we write T u (s 1 , s 2 ) as T u , with the implicit understanding that all trees are defined w.r.t. The n − 2 slots in a sequence are occupied by nodes from T si \{s i }, i = 1, 2, and T ρ(u1,um) . Therefore, we have

where R(u i , u j ) for i ≤ j is the number of ways of permuting the nodes in T ρ(ui,uj) such that the infection sequence property is maintained, and the last equality follows from Lemma 1. To simplify the notations, for

For example, from Lemma 1, we have C(u i | T ui ) = (|T ui | − 1)!J(u i , u i ). In the following, we show that for

The proof proceeds by induction on j − i. If j = i, we have R(u i , u i ) = C(u i | T ui ) and the claim follows from Lemma 1. Suppose that the claim (23) holds for all nodes u k and u p such that p − k < j − i. The number of permutations R(u i , u i ) can be computed by considering a sequence with m = |T ρ(ui,uj) | slots. The first slot can be filled with either u i or u j . Therefore, we have

where the penultimate equality follows from the inductive hypothesis and Lemma 1, and the last equality follows

The claim (23) now follows from (8).

Finally, (9) follows by an inductive argument using (8), which we omit. The proof is now complete.

APPENDIX C PROOF OF LEMMA 3

The proof follows easily from Theorems 5 and 6 of [14] . Consider the infection spreading along a path in G n . Let Π(t) be the counting process of the number of infected nodes in this path. The process Π(t) consists of exponentially distributed arrivals with rate 1, and at most one arrival with rate 2 if the path is between the two infection sources.

Let Π 1 (t) be a unit rate Poisson process corresponding to the rate 1 arrivals. Then Π 1 (t) ≤ Π(t) ≤ Π 1 (t) + 1.

From Theorem 6 of [14] , we have for any positive γ < 0.2,

The rest of the proof is the same as that of Theorem 5 of [14] , and the proof is complete.

APPENDIX D

We first show that under (18) , the interval (19) is non-empty. The condition (18) implies that Fig. 9 . Illustration of the network structure when u0 = v0. Not all nodes are shown.

which after some algebraic manipulations yields

Therefore (19) is a non-empty interval. Fix a δ in the interval. Then for all ǫ > 0 sufficiently small, we have

Recall that t is the time from the start of the infection spreading to our observation of G n . From Lemma 3, for each ǫ, there exists t 0 such that if t ≥ t 0 , we have

We will make use of the two inequalities (24) and (25) extensively in the following proof steps. Let E t be the event defined in Lemma 3. Then from Lemma 3, we have for t ≥ t 0 ,

In the following, we show that P(S = S | S, E t ) = 1 for t ≥ t 0 . The proof then follows from (26) as ǫ can be chosen arbitrarily small.

To show that P(S = S | S, E t ) = 1 is equivalent to showing that with probability one,

for all node pairs u m , v l ∈ G n such that at least one of them is not in S. Let u 0 and v 0 be the first nodes in ρ(s 1 , s 2 )

that are connected to u m and v l respectively. We divide the proof into two cases, depending on whether u 0 and v 0 are distinct or not, as shown in Figures 9 and 10 .

Suppose that u 0 = v 0 . A typical network for this case is shown in Figure 9 , where m, l, n, p, and k are nonnegative integers, and at least one of u m and v l is not in S, i.e., either m + l > 0 or n + p > 0. We let u 0 = s 1 if n = 0, and v 0 = s 2 if p = 0.

We will show that if either m + l > 0 or n + p > 0, we have for t ≥ t 0 ,

The proof follows by showing thatC(u 0 , v 0 | G n ) ≥C(u m , v l | G n ), where strict inequality holds if m + l > 0, andC(s 1 , s 2 | G n ) ≥C(u 0 , v 0 | G n ) with strict inequality holding if n + p > 0. From (17), we have 4

if m + l > 0. The first inequality follows because I * m+l+i (u m , v l ) ≥ I * i (u 0 , v 0 ) for i = 1, . . . , k + 2, and the last inequality follows from (25) when t ≥ t 0 .

Let ψ = deg G (s 1 ) + deg G (s 1 ). We have for t ≥ t 0 ,

· w∈ρ(s1,x1)∪ρ(y1,s2)

|T w (u 0 , v 0 )| ≥ [2(1 + δ)] n+p · n+p+k+2 i=k+3 I * i (s 1 , s 2 ) −1 · w∈ρ(s1,x1)∪ρ(y1,s2)

where the first inequality follows because I * i (u 0 , v 0 ) ≥ I * i (s 1 , s 2 ) for i = 1, . . . , k + 2, and the last inequality follows from (24) if n + p > 0. The bound (27) is now proved.

We next consider the case where u 0 = v 0 = w 0 in Figure 10 , where k, m and l are non-negative integers. When t ≥ t 0 , we have the following bounds, which are straight forward to verify and whose proofs are omitted here. where the last inequality follows from (24) and (25) . The theorem is now proved.

Let t be the elapsed time from the start of an infection spreading from a single s to the time we observe G n . We wish to show that Algorithm TSE estimates as sources s and one of its neighbors with probability (conditioned on s being the infection source) converging to 1 as t → ∞. This is equivalent to showing that for t sufficiently large, and for each pair of nodes u m , v l ∈ G n where either d(u m , s) > 1 or d(v l , s) > 1, there exists a neighbor r of s such thatC(s, r | G n ) >C(u m , v l | G n ).

A typical network is shown in Figure 11 , where k, m and l are non-negative integers. If m, l and k are positive, we let r be the neighbor of s that lies on the path connecting s to u m (i.e., the node w 1 in Figure 11 ). If m and l are positive and k = 0, then r is chosen to be either u 1 or v 1 . If m = 0, we must have k > 0 so that w k = u m and r = w 1 . A similar remark applies for the case l = 0. Note that m + l > 0. For t sufficiently large, we havẽ C(s, r | G n ) C(u m , v l | G n ) = Q(s, r) Q(u m , v l ) · w∈Gn\ρ(um,vl)

|T w (u m , v l )| w∈Gn\{s,r} |T w (s, r)|

|T w (s, r)|

|T w (s, r)|

where the last inequality follows from (25) and Lemma 3. The proof of the theorem is now complete.

Epidemiology and control of SARS in Singapore

Efficient mitigation strategies for epidemics in rural regions

Finding and evaluating community structure in networks

Directed-graph epidemiological models of computer viruses

Source tracing and pursuing of network virus

Twitterrank: finding topic-sensitive influential twitterers

Everyone's an influencer: quantifying influence on Twitter

Determining content power users in a blog network: an approach and its applications

Identifying the productive and influential bloggers in a community

Epidemics and percolation in small-world networks

Spread of epidemic disease on networks

A tutorial introduction to Bayesian inference for stochastic epidemic models using Markov chain Monte Carlo methods

The effect of network topology on the spread of epidemics

Rumors in a network: Who's the culprit?

The Mathematical Theory of Infectious Diseases and its Applications

Some discrete-time SI, SIR, and SIS epidemic models

Velocity and hierarchical spread of epidemic outbreaks in scale-free networks

Epidemic spread in weighted scale-free networks

Maximal planar networks with large clustering coefficient and power-law degree distribution

Behaviors of susceptible-infected epidemics on scale-free networks with identical infectivity

An epidemic model with adaptive virus spread control for wireless sensor networks

Distributed sequential Bayesian estimation of a diffusive source in wireless sensor networks

Detection and localization of material releases with sparse sensor configurations

Bayesian detection in bounded height tree networks

On the impact of node failures and unreliable communications in dense sensor networks

Performance of statistical tests for single-source detection using random matrix theory

Robust distributed detection, localization and estimation of a diffusive target in clustered wireless sensor networks

Maximizing the spread of influence through a social network

Extracting influential nodes on a social network for information diffusion

Counting linear extensions is #P-complete

Federal Energy Regulatory Commission and North American Electric Reliability Corporation

The centrality index of a graph

Collective dynamics of 'small-world' networks

Robust consensus tracking for an integrator-type multi-agent system with disturbances and unmodelled dynamics

Robust consensus tracking of a class of second-order multi-agent dynamic systems

Designing an inter-cloud messaging protocol for content distribution as a service (CoDaas) over future internet