key: cord-0644497-8lyauh3f authors: Berenbrink, Petra; Hahn-Klimroth, Max; Kaaser, Dominik; Krieg, Lena; Rau, Malin title: Inference of a Rumor's Source in the Independent Cascade Model date: 2022-05-24 journal: nan DOI: nan sha: 1f7a8cd9e3f5a4922c344e454da7f9b60fb259fe doc_id: 644497 cord_uid: 8lyauh3f We consider the so-called Independent Cascade Model for rumor spreading or epidemic processes popularized by Kempe et al. [2003]. In this model, a small subset of nodes from a network are the source of a rumor. In discrete time steps, each informed node"infects"each of its uninformed neighbors with probability $p$. While many facets of this process are studied in the literature, less is known about the inference problem: given a number of infected nodes in a network, can we learn the source of the rumor? In the context of epidemiology this problem is often referred to as patient zero problem. It belongs to a broader class of problems where the goal is to infer parameters of the underlying spreading model, see, e.g., Lokhov [NeurIPS'16] or Mastakouri et al. [NeurIPS'20]. In this work we present a maximum likelihood estimator for the rumor's source, given a snapshot of the process in terms of a set of active nodes $X$ after $t$ steps. Our results show that, for cycle-free graphs, the likelihood estimator undergoes a non-trivial phase transition as a function $t$. We provide a rigorous analysis for two prominent classes of acyclic network, namely $d$-regular trees and Galton-Watson trees, and verify empirically that our heuristics work well in various general networks. In this paper we consider a stochastic diffusion process which models the spread of information or influence in networks. Influence propagation is motivated by many applications from various fields: in marketing these processes are studied to maximize the adoption of a new product; these processes are used to study how social media influencers manipulate humans in social networks, in epidemiology they are used to study the spread of viruses or disease [4, 19] , or, in general as processes that spread information in networks [1, 10, 18, 21, 27, 30 ]. On a very high level these processes work as follows. Initially a small subset of the vertices are in a distinguished state (they might have a piece of information, or they are infected, depending on the application in mind). In this paper we will call these vertices, based on rumor spreading, informed vertices. Informed vertices can inform their neighbors and the rumor spreads as time passes by through the network. Most publications studying stochastic diffusion processes observe these processes in a forward direction, i.e., they consider how information spreads in a network, how many nodes will become informed, or which nodes have the largest influence on the vertices in a social network. Those processes are well understood in simple networks [5, 26] . In this paper we study the learning problem of inferring ω, a problem which received far less attention. Our goal is to detect the source of the rumor. In the disease spreading settings this problem is referred to as the patient zero problem. This inference problem was studied with respect to the SI model from epidemiology rigorously [33] . Under the simple SI disease spreading model once infected nodes stay infected and they can infect randomly chosen neighboring nodes. Understanding inference problems of this kind better will help us to find the source of outbreaks of infectious diseases like COVID-19 or to find the source of rumors. The later might help to prevent that political elections are influenced from the outside world. In this paper we employ the well-known Independent Cascade model (see, e.g., [18] ). The process starts with an initial set of active nodes I 0 and it works in discrete steps. When node v first becomes active in step t, it is given a single chance (one shot) to activate each (currently inactive) neighbor with probability p. Whether or not v succeeds to activate any of its neighbors, it cannot make any further attempts in later rounds. It remains inactive for all steps t > t. The process runs until no more activations are possible. We assume that I 0 (the rumor's source) contains only a single node denoted by ω. We call all nodes that were activated at one point of time during the process informed. Note that this one-shot property is a very fitting model for rumor or disease spreading in social networks. Indeed, once a user hears about an article supporting her opinion, she will either ignore it or share it within her social contacts in the near future. Every of the possible recipients either ignores her opinion (does not get activated) or decides to share it again with its peers (gets activated). Furthermore, users are unlikely to share the same article twice. In the case of disease spreading the informed vertices model the persons which caught the illness and the active ones model the persons which are infectious at any point of time. Our problem fits very well into the so-called teacher-student model, a framework which is frequently used to model (machine) learning tasks. The model was introduced by Gardner [9] in the context of studying fundamental properties of the Perceptron, a fairly simple binary classifier. [6, 7, 36, 38] . Suppose a teacher samples a ground-truth σ from a distribution called teacher's prior µ P . Rather than directly revealing this ground-truth to a student, the teacher creates a condensed versionσ of the ground-truth by the means of a teacher's model µ M . Now, in the so-called Bayes optimal case, the teacher conveysσ, µ P , µ M to her student. The student's task is to infer a non-trivial guess of σ from the observed data. In the context of our contribution, the teacher's prior is, given a network G, to select one node uniformly at random as the rumor's source ω. Suggestion: The teacher's model is the t-step forward process of the Independent Cascade Model on G starting with the source ω. The condensed informationσ that the student receives consists of the network G and the set of currently active nodes X . The student's task now is to infer ω from G and X . In our first result, we prove that our model is Bayes optimal (or, in terms of statistical physics, the Nishimori property holds). Secondly, with respect to d−regular trees, we show that for a small spreading parameter p (in comparison to the degree of the tree d) it is not possible to infer the source of the rumor. For a large spreading parameter p strong detection is possible, we show that we can detect the source node with a very large probability. The probability approaches one as a function of the time t at which the snapshot was done. For intermediate p we show that a proposed node ω c is the source of the rumor with a constant probability. Furthermore, we bound the probability that the real rumor source ω is far away from ω c . Another way to read our results is that inference of ω gets easier for increasing value of (d − 1) · p. Finally, we establish a similar phase transition with respect to Galton-Watson processes with spreading parameter Po(λ). Related Work. Forward propagation processes, like the epidemic models [4, 17, 19, 37] , rumor spreading [10, 21, 27, 30] , information cascades [12, 18, 39] , blog propagation models [22] , and marketing strategies [1, 18] have been studied extensively and for a long time within different research communities and we refrain from discussing the extensive literature here. On the contrary, rigorous contributions on the corresponding inference problem, the source detection task, are very rare. To the best of our knowledge, all existing rigorous results on how to infer a rumor's source, are with respect to the SI model [16, 32, 33] . In our notation, the author's of [32, 33] prove that on specific infinite acyclic networks like d-regular trees, super-critical Galton-Watson processes, and geometric graphs, approximate inference of ω is always possible in the SI model as long as the infinite tree satisfies certain expansion properties (for instance, in the d-regular case, this requires d ≥ 3). Furthermore, there are strong tail-bounds given that prove that the probability of declaring a far-apart node as the rumor's source is small. The results are proved by a fundamental connection between a generalized Polya's urn and the SI model on acyclic networks. Nevertheless, there are many well explained heuristics towards the source detection task on various network types which are supported by extensive simulation studies [3, 13, 15, 29, 35] , in recent contributions also based on neural networks [2, 31, 34] . Finally, a closely related problem that attained attention recently, is not to infer ω given X but to infer the parameters of the underlying spreading model. Over the last years, learning strategies towards this problem were proposed and studied experimentally [23, 25] . We are given a communication network G = (V, E) with n vertices x 1 . . . x n and m edges. We assume I 0 = {ω} and ω is chosen uniformly at random from all vertices. Hence, the rumor originates at a single source. A spreading parameter p ∈ (0, 1) determines the viciousness of the rumor. The diffusion process in the Independent Cascade Model runs in discrete, synchronous steps. Let I t be the set of vertices which are active in step t. In step t we will call the vertices in I 0 ∪ I 1 . . . ∪ I t informed. In every step t ≥ 1 every active node x i ∈ I t activates any of its uninformed neighbors with probability p. All these newly informed vertices form the set I t+1 . Note that every node becomes active exactly once but vertices have potentially the chance to be activated by each of their neighbors. Note that it can happen that the process dies out at a step t. In this case for all t ≥ t it holds that The interference problem is defined as follows. We observe the state of the network at an arbitrary time t. The task is to infer ω given only (G, I t ) and the parameter p. In this paper we study the following variants of the problem. • Strong detection: here we have to infer ω correctly with high probability 1 , • Weak detection: inference of ω is only required with positive probability 2 . Our first result relates the probability that a given set X is active conditioned on some node v being the source to the probability that v is the source conditioned on X being active. This establishes the so-called Nishimory property (or Bayes optimality) of our inference problem. In terms of the teacher-student model, it states that the student has access to the teacher's prior and the teacher's model. Equivalently, in expectation there is no statistical difference between the ground-truth and a uniform sample from the posterior distribution [38] . Theorem 2.1 applies to all types of networks as long as the rumor's source ω is chosen uniformly at random. Theorem 2.1. Let G = (V, E) be an arbitrary network and fix an arbitrary step t. Let X be the set active vertices in step t. For any The above theorem is used to show the main results of this paper. It allows us to calculate P (X = X | ω = v) instead of P (ω = v | X = X), which often is more accessible. Note that calculating the first probability is quite challenging in general networks G since the entropy of X is very large. For the remaining analytical part of the of the paper we consider acyclic networks, namely d-regular trees and Galton-Watson trees with offspring distribution Po(λ). Galton-Watson trees with offspring distribution D are defined by the following experiment. We start with one node which spawns d 0 ∼ D ωc Figure 1 : Visualization of a possible snapshot of the spreading process. Here, ω c spawned four sub-trees out of which three contain active elements of X (orange nodes) and one does not contain active elements (purple). Thus, the candidate set C of possible rumor's sources consists of all vertices in the purple sub-tree. Note that here only a finite part of the infinitely expanding 4-regular tree is presented. children. Recursively, any of the children w 1 , . . . w d0 spawns η 1 , . . . , η d0 ∼ D children (and so on). We call such a process super-critical, if with positive probability, the vertices spawned during the process form an infinite tree. It is well known that a (not too dense) instance of an Erdős-Rényi random graph G n, d n locally looks like a Galton-Watson tree with offspring distribution Po(d − 1). In contrast, a random d-regular graph looks locally like a d-regular tree, provided d is not too large [8] . We assume that a teacher fixes a time t at which she observes the network. We assume that the student is not aware of the time when the process started. We define X as the set of nodes which are active after t steps of the Independent Cascade model (starting from an unknown and randomly chosen source ω). Note that X can be empty. The set of candidate nodes C are all nodes that have the same distance in G to each node in X . Note that the set C is not empty since ω ∈ C. The closest candidate ω c ∈ C is defined as the node with minimum distance to all nodes in X . Our source detection heuristic calculates the closest candidate ω c ∈ C and returns it as the estimated rumors source. If X = ∅ (the process died out before time t) or contains at most one node, the heuristic returns a failure. The following results describe the probability of success or failure for this source detection heuristic with respect to the models parameter. The first theorem shows a phase transition between weak and strong detection for d-regular trees. We show that, for small p · (d − 1) it is not possible to infer the source of the rumor. This is due to the huge likelihood that, in this setting, the process dies out and X = ∅ or the corresponding set X is very small which makes the inference impossible. For large p · (d − 1) we show that the source node is the closest candidate with probability 1 − o d (1) . Note that in this scenario, each active node will infect several nodes and it is unlikely that the process dies out. The size of X grows as a function of d and t which makes the prediction more and more reliable. For intermediate p · (d − 1) we show that the closest candidate ω c is the source of the rumor with a constant probability. Furthermore, we bound the probability that the real rumor source ω is far away from ω c . Intuitively this means that inference of ω gets easier for increasing values of (d − 1) · p. Theorem 2.2 (d-regular trees). Let G = (V, E) be an infinite d-regular tree and let X be the set of active nodes generated by the Independent Cascade Process with spreading parameter p after t = ω(1) steps. Then, the following phase-transitions occur. Since, in expectation, a Po(λ)-Galton-Watson tree with λ = d − 1 spawns exactly as many children as a d-regular tree, it might be not too surprising that a similar phase transition occurs in this model. Theorem 2.3 (Galton-Watson processes). Let G = (V, E) be an infinite tree generated by a Po(λ)-Galton-Watson process. Let X be the set of active nodes generated by the Independent Cascade Process with spreading parameter p after t = ω(1) steps. Then, the following phase-transition occurs. • If λp ≤ 1, any estimator fails at weak detection with probability 1 − o t (1). • If 1 < λp = Θ(1), then the closest candidate ω c is the source of the rumor ω with positive probability (weak detection). Furthermore, the probability that dist(ω c , ω) > k is at most exp (−Ω(k)). • If λp = ω(1), then closest candidate ω c is the source of the rumor ω with probability 1 − o λ (1) (strong detection). In Theorems 2.2 and 2.3 we assume that the underlying tree network is infinitely large. This is conceptually necessary. Indeed, the trivial algorithm that outputs one node uniformly at random succeeds at weak detection in finite networks. In the next section, we provide extensive simulations that verify the asymptotic statements of the theorems on small networks. Furthermore, we present simulation results on non-acyclic networks such as random geometric graphs and show that (the natural extension of) the closest candidate heuristics works well. The main proof strategy of Theorems 2.2 and 2.3 is to interpret the transmission process as a special type of percolation on the underlying network. As in Theorem 2.3 the underlying network itself is random, it turns out to be technically non-trivial to pin down the exact distributions involved in this process due to subtle rare events that might yield either very small or large node degrees. Fortunately, the Poisson thinning technique allows us to carry out the calculations smoothly. In this section we present simulation results that support and complement our main theorems for cyclic graphs. Our simulation software is implemented in the C ++ programming language. As a source of randomness we use the Mersenne Twister mt19937_64 provided by the C ++ 11 library. All simulations have been carried out on four machines equipped with two Intel(R) Xeon(R) E5-2630 v4 CPUs and 128 GiB of memory each, running the Linux 5.13 kernel. A simulation run consists of three parts. First, we generate a network G = (V, E). To this end, we have implemented generators for Erdős-Rényi graphs, random d-regular graphs (configuration model [14] ) and random geometric graphs [28] . Secondly, we run the Independent Cascade Process for t rounds starting from a randomly chosen node ω ∈ V . Finally, we generalize our source detection algorithm to cyclic graphs as follows. For v ∈ V and t ≥ 0 let N t (v) denote the set of nodes w ∈ V that have distance at most t to v. Then we calculate Hence, N t is the set of nodes with distance at most t to every node in X . We pick the minimum t such that N t = ∅ and return N t as the candidate set. Note that such a t exists since ω ∈ N t . To generate our data we execute 100 independent simulation runs for each network where we simulate the Independent Cascade Model for 8 rounds. Our data for Erdős-Rényi graphs with n = 10 5 nodes and expected node degree 4 are shown in Table 1 . Similar data for random 4-regular graphs in the configuration model and 2-dimensional random geometric graphs with expected node degree 16 can be found in Appendix B. The tables show for various spreading probabilities p the number of successes in detecting the source, the numbers of errors for the two error cases ω ∈ N t and X = 0, and the average and maximum distance of ω to nodes in N t . The success rates are also visualized in Fig. 2 . We remark that our empirical data confirm the phase transitions as claimed in Theorems 2.2 and 2.3 for the even more general case of locally tree-like graphs, see below. Varying Numbers of Rounds. In Fig. 3 present simulation data for random geometric graphs with expected node degree 16, where we increased the numbers of rounds of the Independent Cascade Model. We show the success rates after 8, 16, and 32 rounds. Phase Transitions. Finally in Fig. 4 we present an additional plot that highlights the phase transition behavior of our algorithm. Recall that our heuristic in cyclic networks may return more than one candidate. In our simulation we compute the distance for each candidate found by our heuristics to ω. The plot shows a histogram of these distances for three values of p, namely 0.45, 0.50, and 0.55. From this histogram, a phase transition at p = 0.5 becomes eminent: while with p = 0.45 the majority of the candidates as a large distance to ω, this changes drastically for p = 0.55, where the largest distance found is as small as 3. In this section, we prove Theorems 2.1 and 2.2. Furthermore, the proof of Theorem 2.3 is sketched as it is similar to the one of Theorem 2.2. All omitted details can be found in the appendix. Observe that by definition, we have for any v, w ∈ V that P (ω = v) = P (ω = w). Thus, Bayes' rule and the law of total probability implies . and the theorem follows. A crucial observation in the proof of Theorem 2.2 is the following. If node v gets activated during the spreading process by node w, it has d − 1 additional neighbors v 1 , . . . , v d−1 except w which we call children of v. Any of those children gets activated with probability p independently from everything else in the next step. Suppose without loss of generality that v 1 , . . . , v d0 are the activated children where d 0 ∼ Bin(d − 1, p). In every of those children v i , a new and independent rumor spreading process starts in the tree rooted at v i and directed away from v. As this tree is, itself, d-regular, this process is distributed equally as starting d 0 independent Galton-Watson processes with offspring distribution Bin (d − 1, p). Depending on (d − 1)p, few, some or many of those processes will die out eventually. To prove our result, we need some additional notation, see Section 4.2. Given a node v, we can direct the tree away from v and denote the set of subtrees rooted at v's children by T v . Most interesting to our proof are the subtrees that contain active nodes. We denote them by T v X and denote Y v = Y v (X ) = |T v X | . Note that all candidates but one have at most one subtree containing active nodes. Only the closest candidate can have more than one. Finally, let t X v denote the distance from v to any of the vertices in X . Figure 5 : Here, ω c spawned three sub-trees out of which two contain active elements of X (orange) and one does not contain active elements (purple). Thus, C consists of all vertices in the purple sub-tree rooted at ω c . Proof. The first part, namely that x t = (1−p+px t−1 ) d−1 , is an application of probability generating functions and their connection to branching processes. The probability generating function f Bin(n,p) of a Binomial distribution with parameters n and p reads f Bin(n,p) (s) = E s Bin(n,p) = (1 − p + ps) n . Now, let p k = P (Bin(d − 1, p) = k). Then, it is immediate that which is exactly the probability generating function of the Binomial distribution. A detailed explanation and a formal proof of this statement can be, for instance, found in [11] . Let us now suppose that v = ω c . Let V 0 be the event that v has exactly k ≤ d 0 ≤ d children that get activated by v. Clearly, P (V 0 ) = P (Bin (d, p) = d 0 ) and of course, d 0 needs to be at least k as differently, the probability of having k active sub-trees was zero. Given V 0 , we find that the probability of observing exactly k active sub-trees is the probability that exactly k out of d 0 independent Galton-Watson processes with offspring distribution Bin(d − 1, p) survived the first t X v steps. Therefore, the number of active sub-trees at time t is distributed as given V 0 and the first part of the formula follows. If, on contrary, v is not the closest candidate but a node that has a different distance from X , we observe that from the originally 1 ≤ d 0 ≤ d Galton-Watson processes originated in the children of v, exactly one process needed to survive and d 0 − 1 needed to be extinct at time t X v . The proposition follows. Proof of Theorem 2.2 (i). To prove the first part of Theorem 2.2, it suffices to apply the first part of Proposition 4.1. Indeed, if (d − 1)p ≤ 1, we find that the smallest fixed-point of x → (1 − p + px) d−1 is x = 1. Therefore, x t = 1 − o t (1). Furthermore, as p is a constant, we have that in this case d = Θ(1) as well. Therefore, a union bound over the at most d possible independent Galton-Watson processes with offspring distribution Bin(d − 1, p) originated in the children of ω, yields that with probability 1 − o t (1), we find X = ∅. In this case, detection clearly fails with high probability. Proof of Theorem 2.2 (ii). We start with the part of the theorem that claims that weak detection succeeds by the source detection heuristic, namely that P (ω c = ω) = Ω(1). We find that ω c = ω with probability one if the set of possible candidate nodes C has size 1. Therefore, it suffices to prove that P (|C| = 1) = Ω(1). This is the case if (and only if), the rumor's source ω propagated the rumor to all of its d children and all d independent Galton-Watson processes with offspring distribution Bin(d − 1, p) originated in the children of ω did not become extinct. Let d 0 denote the number of children of ω that get activated. Clearly, as, by assumption, p and d are constants. Furthermore, since 1 < (d − 1)p = Θ(1) holds, the smallest fixed point of x → (1 − p + px) d−1 is a real number between zero and one. Therefore, by Proposition 4.1, there are constants 0 < γ 1 ≤ γ 2 < 1 such that Thus, it follows that the source detection heuristic succeeds at weak detection if 1 < (d − 1)p = Θ(1) with probability 1 − o t (1) by (1) - (2) . It is left to prove that under the same assumptions we have Suppose that t X ω − t X ωc = dist (ω c , ω) = k > 3. Therefore, there is a unique path P ωc,ω in G that connects ω c and ω with k − 2 internal vertices. All of those internal vertices will get activated exactly ones during the spreading process. Therefore, for any of those k − 2 steps, the process needs to either activate only exactly one child or, it activates more than one child, but the remaining processes died out at the observation time. Of course, we have that the probability that a node spawns exactly one active child is given by P (Bin(d, p) which is a real number bounded away from zero and one if d, p = Θ(1). By (1) -(2) as well as (4), we find that there is a sequence of constants {γ i } i=1...k−2 dependent only on p, d, and k all of which γ i are uniformly bounded away from zero and one. Therefore, which implies (3). Proof of Theorem 2.2 (iii). The last part of Theorem 2.2 (iii) states that the source detection heuristic succeeds at strong detection with high probability if (d − 1)p ∈ ω(1). We start with the following simple observation which is an immediate consequence of the Chernoff bound applied to the Bin ((d − 1), p) distribution. (1), we find that if node v gets activated, the number of activated children = ω(1) with high probability. As a next observation, we claim that if d 0 = ω(1) Galton-Watson processes with offspring distribution Bin ((d − 1), p) start independently, at least (1 − ε)-fraction of them will not become extinct eventually with high probability for any ε > 0. Lemma 4.3 . Suppose that Galton-Watson processes with offspring distribution Bin ((d − 1), p) start independently under the condition that (d − 1)p ∈ ω(1). Let Y denote the number of processes that do not ultimately become extinct. Then, Proof. By Proposition 4.1, we have that the probability that one of the processes becomes extinct is p e = o(1). Thus, the number of not-extinct processes is Binomially distributed with parameters and 1 − p e . Therefore, the lemma follows from the Chernoff bound. By Fact 4.2 and Lemma 4.3 we directly find that, with high probability, all but o(dp) of the processes started in the children of ω are still active at the observation time. Suppose that ω = ω c and dist(ω, ω c ) = k ≥ 1. This implies that either k times only exactly one child is activated or, given that multiple children are activated, only exactly one of those spreading processes survived eventually. For a specific step 1 ≤ i ≤ k − 1, the probability that this occurs is by Fact 4.2 and Lemma 4.3 at most γ i = o(1). Therefore, P (ω c = ω | X ) = o(1) which implies the third part of Theorem 2.2. The main proof strategy is analogous to the proof of Theorem 2.2 with one fundamental difference which makes the analysis more involved. While in the d-regular case, a once activated node spawns a random number of independent Galton-Watson processes with offspring distribution Bin(d − 1, p), this is not true in the setting of Theorem 2.3. Here, the underlying network itself is a Galton-Watson process with offspring distribution Po(λ). Fortunately, the Poisson thinning principle [20] implies that, in distribution, we can analyze the following spreading process. Once v gets activated, it spawns d 0 ∼ Po(λp) active children and thus d 0 independent Galton-Watson processes with offspring distribution Po(λp). The extinction probabilityx t can be derived similarly as in Proposition 4.1 and readsx t = exp (−λp(1 −x t−1 )). Now, we have all ingredients to prove (i) -(iii) of Theorem 2.3. All proofs are based on similar ideas as their counter-parts in the proof of Theorem 2.2. The main challenge arising is that, in comparison to the Binomial distribution, the Poisson distribution has heavy tails which must be taken into account. This challenge is technically non-trivial but standard concentration tools for large deviation analyses suffice in order to prove the result. All omitted proof details can be found in the appendix. We pin down exact information-theoretic phase-transitions in the source detection task on important tree-network models by proving that as soon as weak detection is possible information-theoretically, the efficient closest candidate heuristics succeeds at this task. Those findings imply, of course, the same result for G(n, p) and random d-regular graphs as long as they are sufficiently sparse and the spreading process ran for only o(ln(n)) steps as those random networks are then, locally, given by the described tree-networks with high probability. Furthermore, we show empirically that the source detection heuristic performs well on non-acyclic networks and seems to be a very decent and efficient estimator of the rumor's source. A natural question is whether it is possible to prove similar information-theoretic bounds for non-acyclic networks. While this seems to be a very challenging task in general, it might become accessible if we restrict ourselves to specific random networks or networks with a specific tree-width. Finally, on the empirical side, it is an interesting question whether the rumor's source of the Independent Cascade Process can be learned by graph neural networks. This seems challenging as only few vertices are active and the network would need to learn possible propagation paths in a graph given only a snapshot of the network. ε t = o t (1) denotes the convergence speed towards 1. Then, by the recurrence equation and a Taylor approximation we have If λp < 1, we directly find that ε t = O ((λp) t ) decays exponentially fast in t. If λp = 1, this is much more subtle. Indeed, we find and therefore, we only get ε t = O t −1 in this case. Since we assume p to be a constant, clearly λ = O(1) as well. Unfortunately, the Poisson tails are kind of heavy. More precisely, even if λ is a constant, the probability that a Po(λ) variable becomes large is not negligible. We analyze this by a careful application of limits. Recall that we assume that the underlying tree-network is infinite. We model this as follows. Suppose that the tree-network consists of n vertices and we will let n → ∞. Let C > 0, then the probability that the number of neighbors of a specific node v exceeds C is, for large C, given by Chernoff's bound as As the number of spawned children is independent for all vertices, the number of vertices of degree at least C is stochastically dominated by a Bin (n, exp (−C/2)). Thus, with probability 1 − o n (1), there are no more than O(n ln(n) exp (−D)) vertices of degree D > 0 for a sufficiently large constant D (independent of n) if n → ∞. We denote by D the event that this is actually true. Thus, conditioned on D, there are only O(n ln(n) exp (−D)) vertices of degree at least D. Now, we chose ω uniformly at random out of all vertices. Therefore, given D, the probability that ω has small degree is given by, . Clearly, this becomes only a high probability event if D = Ω (ln ln n). In the worst case, we find that a union bound over all activated children of ω leads only to ultimate extinction of all processes, if or, differently, that t = Ω(ln ln n). As in the theorem, we only claim the assertion in the limit t → ∞ and we assume the underlying tree-network to be infinite, this proves the claim of the theorem. We remark at this point that the assumption that t depends slightly on n does no harm to applications as, on real networks, ln ln n can be seen as a constant. Proof of Theorem 2.3 (ii). Again, as in the d-regular case, we start proving the weak detection property of the source detection heuristic. Thus, we aim to prove P (|C| = 1) = Ω(1). This is the case if (and only if) ω propagated the rumor to all of its d ω Po (λ) children and all d 0 independent Galton-Watson processes with offspring distribution Po (λp) rooted at the children of ω did die out eventually. Let d 0 denote the number of children of ω that get activated. We first need to calculate the probability that d 0 = d ω . To this end, let denote the modified Bessel function of order zero. It is well known (see, for instance, Equation (2) of [24] ) that We have Therefore, by the law of total probability, If λ and p are constants, it is immediate from (5) that there is a constant γ > 0 such that P (all children of ω get activated) > γ. Finally, since 1 < λp = Θ(1) by assumption, the smallest fixed point ofx → exp (−λ(1 −x)) is a real number between zero and one. Therefore, by Proposition A.2, there are constants 0 < γ 1 ≤ γ 2 < 1 such that Therefore, the source detection heuristic succeeds at weak detection if 1 < λp = Θ(1) with probability 1 − o t (1) by (5) - (7). Again, we are left to prove the decay property, namely P (dist (ω c , ω) ≥ k) ≤ exp (−Ω(k)) . Suppose that t X ω − t X ωc = dist (ω c , ω) = k > 3. As previously, we find a unique path P ωc,ω in G that connects ω c and ω with k − 2 internal vertices. Exactly as in the d-regular case, of those internal vertices will get activated exactly ones during the spreading process and so, in every step, the process needs to either activate exactly one child or all but one of the remaining processes ultimately are extinct. The probability that a node spawns exactly one child is given by P (Po(λp) = 1) = λ exp(−λ) (9) which is a real number bounded away from zero and one if λ = Θ(1). By (5) - (7) and (9), there is a sequence of constants {γ i } i=1...k−2 dependent only on γ(p, λ, k) and uniformly bounded away from zero and one, such that and (8) follows. Proof of Theorem 2.3 (iii). In order to prove the last part of Theorem 2.3, we start with the following observation. Lemma A.3. If λp → ∞, then, with high probability, an activated node v satisfies the following. • deg(v) ≥ λ 2 = ω(1) with high probability, • The number of activated children ω v satisfies ω v ≥ λ 2 = ω(1) with high probability. Proof. This is an immediate consequence of the Chernoff bound applied to the Po(λ) and, respectively, Po(λp) distribution given that λp → ∞. As a next observation, we claim that if ω v (as given by Lemma A.3) Galton-Watson processes with offspring distribution Po(λp) are initialized independently, at least (1 − ε)-fraction will survive with high probability for any ε > 0. In this appendix we present the additional simulation data for random regular graphs (configuration model) with d = 4 (see Table 2 ) and random geometric graphs with an expected node degree of 16 (see Table 3 ). We initialized both networks with n = 10 5 nodes. Balancing spreads of influence in a social network Epidemic inference through generative neural networks Predicting epidemic evolution on contact networks from partial observations Mathematical epidemiology: Past, present, and future Graphs with specified degree distributions, simple epidemics, and local vaccination strategies Optimal group testing. Proceedings of Thirty Third Conference on Learning Theory (COLT) Locality defeats the curse of dimensionality in convolutional teacher-student scenarios Introduction to Random Graphs Three unfinished works on the optimal storage capacity of networks Generalization of epidemic theory: An application to the transmission of ideas Probability and Random Processes Information diffusion through blogspace Fast rumor source identification via random walks. Social Network Analysis and Mining Random Graphs An algorithmic framework for estimating rumor sources with different start times Approximate identification of the optimal epidemic source in complex networks Modeling Infectious Diseases in Humans and Animals Maximizing the spread of influence through a social network A contribution to the mathematical theory of epidemics Poisson Processes Information contagion: An empirical study of the spread of news on digg and twitter social networks Patterns of cascading behavior in large blog graphs Reconstructing parameters of spreading models from partial observations Quasi-rational analytic approximation for the modified bessel function i1(x) with high accuracy Causal analysis of covid-19 spread in germany Sir epidemics on a bernoulli random graph Theory of rumour spreading in complex social networks Random Geometric Graphs Efficiently spotting the starting points of an epidemic in a large graph Modeling spread of disease from social interactions Finding patient zero: Learning contagion source with graph neural networks. ArXiv, abs Detecting sources of computer viruses in networks: Theory and experiment Rumor centrality: A universal source detector. SIGMETRICS Perform Information Source Estimation with Multi-Channel Graph Neural Network Multiple source detection without knowing the underlying propagation model Adversarial teacher-student representation learning for domain generalization Gonorrhea: Transmission dynamics and control Statistical physics of inference: thresholds and algorithms Information propagation in online social networks: a tie-strength perspective Suppose that X Galton-Watson processes with offspring distribution Po(γ) start independently under the condition that γ → ∞. Let Y denote the number of such processes that did not ultimately become extinct By Proposition A.2, the probability that one specific process out of the X processes gets extinct is p e = o(1). Thus, as in the d-regular case, we have that the number of not-extinct processes is Bin(X, 1 − p e )-distributed As in the d-regular case, the previous lemmas imply that with high probability ω(1) of the processes started in the children of ω are still active Suppose that ω = ω c and dist(ω, ω c ) = k ≥ 1. This implies that either k times only one active child is spawned or, if multiple children are spawned, only exactly one rumor spreading process rooted in those children survives eventually A.1 Proof in the Galton-Watson model, Theorem 2.3 The proof is similar to the proof of Theorem 2.2. The most fundamental difference is that in the d-regular case, a once activated node spawns a random number of independent Galton-Watson processes with offspring distribution Bin(d − 1, p), this is not true if the underlying network itself is a Galton-Watson process with offspring distribution Po(λ). The Poisson thinning principle [20] allows to analyze the resulting distribution. Fact A.1 (Application of the Poisson thinning principle). Let X ∼ Po(d) and furthermore, given X, define Y = Bin (X, p). Then Y ∼ Po(λp).Thus, once v gets activated, it spawns d 0 ∼ Po(λp) active children and thus d 0 independent Galton-Watson processes with offspring distribution Po(λp). The following proposition characterizes the extinction probability of such processes. Proposition A.2. Ifx t is the probability that at time t there are no more active vertices under the spreading model on a super-critical Galton-Watson tree with offspring distribution Po(λp), we findand the ultimate extinction probability of the spreading process is the smallest fixed Again, we refer to [11] for a detailed explanation of the connection between the probability generating function and the extinction probability of branching processes. Now, for brevity, suppose that v = ω c . Let V 0 be the event that v has exactly k ≤ d 0 ≤ d children that get activated by v. Similarly as before, P (V 0 ) = P (Po(λp) = d 0 ) and of course, d 0 needs to be at least k as differently, the probability of having k active sub-trees was zero.Given V 0 , we again start d 0 independent Galton-Watson processes with offspring distribution Po(λp) in the children. Therefore, the probability of observing exactly k active sub-trees is the probability that exactly k out of d 0 of those processes are not extinct after t X v steps. Of course, the number of such active sub-trees at time t is distributed as Bin (d 0 ,x t ) given V 0 and the first part of the formula follows.As in the d-regular case, if on contrary v is not the closest candidate but a node further apart from X , we observe that from the originally 1 ≤ d 0 ≤ d Galton-Watson processes originated in the children of v, exactly one process needed to survive and d 0 − 1 needed to be extinct at time t X v .Proof of Theorem 2.3 (i). As in the d-regular case, the first part of Theorem 2.3 follows by the first part of Proposition A.2. If λp ≤ 1, the smallest fixed-point ofx → exp −λp(1 −x) isx = 1. Therefore,x t = 1 − o t (1) describes the probability that the underlying spreading process died out until time-step t. More precisely, by the recurrence equation, we find the following. Suppose that