key: cord-0113346-37varh1n authors: Dawkins, Quinlan; Li, Tianxi; Xu, Haifeng title: Diffusion Source Identification on Networks with Statistical Confidence date: 2021-06-09 journal: nan DOI: nan sha: f29d3816c359e5177493c9a9bf00078230457b9e doc_id: 113346 cord_uid: 37varh1n Diffusion source identification on networks is a problem of fundamental importance in a broad class of applications, including rumor controlling and virus identification. Though this problem has received significant recent attention, most studies have focused only on very restrictive settings and lack theoretical guarantees for more realistic networks. We introduce a statistical framework for the study of diffusion source identification and develop a confidence set inference approach inspired by hypothesis testing. Our method efficiently produces a small subset of nodes, which provably covers the source node with any pre-specified confidence level without restrictive assumptions on network structures. Moreover, we propose multiple Monte Carlo strategies for the inference procedure based on network topology and the probabilistic properties that significantly improve the scalability. To our knowledge, this is the first diffusion source identification method with a practically useful theoretical guarantee on general networks. We demonstrate our approach via extensive synthetic experiments on well-known random network models and a mobility network between cities concerning the COVID-19 spreading. One pressing problem today is the spreading of misinformation or malicious attacks/virus in various cyberspaces. For example, rumors and fake news on social networks may result in many serious political, economic, and social issues (Vosoughi et al., 2018) . Viruses that spread via emails and computer communication may cause severe privacy and leakage problems Newman et al. (2002) ; Halperin & Almogy (2002) ; Xu & Ren (2016) . The negative impacts stem from a few source users/locations and then spread over the social networks via a diffusion process in such events. One crucial step to reduce the loss from such an event is to quickly identify the sources so that counter-measures can be taken in a timely fashion. Though early practices have been done for this important problem with motivations from various domains, systematic research on this problem only began very recently, arguably starting from the seminal work of Shah & Zaman (2011) , which proposed a rumor center estimator that can be located by an efficient message-passing algorithm with linear time complexity. Despite the significant interest and progress on this problem in recent years (Shah & Zaman, 2012; Dong et al., 2013; Khim & Loh, 2016; Bubeck et al., 2017; Yu et al., 2018; Crane & Xu, 2020) , many challenges remain unaddressed. First, the theoretical understanding of these methods is currently only available under very restrictive and somewhat unrealistic structural assumptions of the networks such as regular trees. This is perhaps partially explained by the well-known computational hardness about the probabilistic inference of diffusion process in general graphs Shapiro & Delgado-Eckert (2012) . Therefore, intuitive approximations have been used for general networks (Nguyen et al., 2016; Kazemitabar & Amini, 2020) . However, such methods lack theoretical guarantees. Second, even for regular trees, the available performance guarantee is far from being useful in practice. Even in the most idealized situation of infinite regular trees, the correct probability of the rumor center is almost always below 0.3 (Shah & Zaman, 2011; Dong et al., 2013; Yu et al., 2018) . For general graphs, as we show later, the correct rate of such a single-point estimation method only becomes too low to be practical. To guarantee higher success probability, a typical approach, as in both machine learning theory Valiant (1984) and data-driven applied models LeCun et al. (2015) , is perhaps to obtain more data. However, a fundamental challenge in diffusion source identification (DSI) is that the problem by nature has only one snapshot of the network information, i.e., the earliest observation about the infection status of the network. 1 Therefore, compared to classic learning tasks, DSI poses a fundamentally different challenge for inference.It is the above crucial understanding that motivates our adoption of a different statistical inference technique, the confidence set. Previously systematic statistical studies adopt the confidence set approach for DSI on trees (Bubeck et al., 2017; Khim & Loh, 2016; Crane & Xu, 2020) . Though they enjoy good theoretical properties, the methods are applicable only on infinite trees. This paper aims to bridge the gap between practically useful algorithms and theoretical guarantees for the DSI problem. We introduce a new statistical inference framework which provably includes many previous methods Shah & Zaman (2011) ; Nguyen et al. (2016) as special cases. Our new framework not only highlights the drawback of the previous methods but, more importantly, also leads to the design of our confidence set inference approach with finite-sample theoretical guarantee on any network structures. As a demonstration, consider the example of the COVID-19 spreading procedure in early 2020. Figure 1 shows a travel mobility network between 49 major cities in China, constructed from the two-week travel volume (Lab, 2020; Hu et al., 2020) before the virus caught wide attention. The square nodes (21 out of 49) are all cities with at least five confirmed cases of the virus on Jan 24, 2020. The DSI problem is: given only knowledge about the mobility network and which cities have detected a notable amount of confirmed cases (in this case, at least 5) , can we identify in which city the virus was first detected? This problem turns out to be too difficult for precise identification. None of the single-point source identification methods under evaluation can successfully identify Wuhan due to its relatively non-central position from the network (details in Section 5). Nevertheless, both of our 80% and 90% confidence sets cover Wuhan correctly, giving recommendations of 6 nodes and 11 nodes (out of 49 cities), respectively. In fact, the evaluation on all the whole week after the lockdown of Wuhan reveals that both confidence sets correctly cover Wuhan in all the seven days, while the single-point estimation methods are rarely effective. Such a result evidently shows the necessity of adopting confidence set approach and the effectiveness of our solution.Our contributions in this paper can be summarized in three-folds. 1. We introduce an innovative statistical framework for the DSI problem. It includes several previous methods as special cases, but has the potential for more effective inference. 2. Under our framework, we propose a general way to construct the source node confidence set, whose validity can be guaranteed for finite sample size and any network structures. It is the first DSI method with a theoretical performance guarantee on general networks, to the best of our knowledge. 3. We propose techniques that dramatically improve the computational efficiency of our inference algorithm. En route, we develop a generalized importance sampling method, which may be of independent interest. A high-level message in the paper is that the confidence set approach, which did not receive adequate attention in the machine learning literature, can be an important tool for inference tasks, especially for challenging problems with limited available data. We start by formalizing the Diffusion Source Identification (DSI) problem, introduced in the seminal work of Shah & Zaman (2011) . Consider a network G with node set V = {1, · · · , n} and edge set E. For ease of presentation, we focus on unweighted and undirected networks but it is straightforward to generalize the model and our framework to weighted networks. We write (u, v) ∈ E if node u and v are connected. The network can be equivalently represented by its n × n binary adjacency matrix A, where A uv = A vu = 1 if and only if (u, v) ∈ E. There is a source node s * ∈ V on the network G initiating a diffusion of a certain effect (rumor, fake news or some virus) over the network G. We embed our inference of the diffusion procedure under the widely-adopted "Susceptible-Infected" (SI) model (Anderson & May, 1992; Shah & Zaman, 2011) , though our approach can be easily tailored to other diffusion procedure as well. In the SI model, the source node s * is the only "infected" node initially. The infection diffuses as follows: given the set of currently infected nodes after t − 1 infections, the next infection happens by sampling uniformly at random one of the edges connecting an infected node and a susceptible node. Consequently, a full diffusion path with T infections can be represented by a sequence of T + 1 nodes in the infection order. We define the diffusion path space to be However, in practice, when the occurrence of the infection is noticed, we have already lost the information about the diffusion path. Instead, the available data only contain the snapshot of the current infection status on the network without the infection order. Formally, the data can be represented as an n-dimensional binary vector y with y i = 1(i is infected) ∈ {0, 1}, where 1 is the standard indicator function. Therefore, the sample space of the DSI problem can be defined as induces a connected subgraph of G}. Equivalently, we will also think of any y ∈ Y T as the a infected subset of nodes V I ⊂ V with size T . The DSI problem can then be defined as follows. Definition 1 (Diffusion Source Identification). Given one sample y ∈ Y T , identify the source node s * of the diffusion process that generates y. Challenges. The challenge of DSI intrinsically arises from the loss of information in the observed data. Specifically, by definition, we have a many-to-one mapping ζ : Z T → Y T , such that ζ(·) maps a diffusion path to the corresponding infection snapshot of the network. Information about the infection order has been lost upon the observation of data y. Nevertheless, the DSI problem looks to identify the first node in the infection order, with access to only one snapshot of the infection status. Note that obtaining multiple snapshots over time does not reduce the difficulty of DSI. This is because, given the current snapshot, later observed data carry no additional information about the source node due to the Markov property of the SI model. We start by formulating DSI under a systematic statistical framework, which will help in our design of better inference methods later on. Treating the network G as fixed and s * as the model parameter, the probability of generating data y ∈ Y T can be represented by P s * (Y = y) = p(y|s * ). where random variable Y denotes the observed data. The identification of s * can then be treated as a parameter estimation problem. Specifically, we consider the following general parameter estimation framework. Given any discrepancy function : Y T × Z T → [0, ∞), we want to find an estimator of s * based on the following optimization problem: in which Z ∈ Z T is the random diffusion path following the SI model starting from parameter s and E s denotes the expectation over Z. That is, we look to select the s that the diffusion path Z it generates has the minimum expected discrepancy from our observed data y. Remark 1. An important design here is that the discrepancy function is defined on Y T × Z T , not on Y T × Y T . That is, y will be compared with the random diffusion path while not merely the snapshot induced by the path. This is because Z contains richer information about the diffusion process. As we show later, this turns out to be very crucial for designing effective discrepancy functions. Notice that our framework include a few previous methods as special cases. Due to space limit, all formal proofs in this paper have been deferred to the Appendix. Instead, intuition and explanations are provided as needed. Proposition 1. 1. If rc (y, z) = 1 − 1(y = ζ(z)), when the network is an infinite regular tree, procedure (1) gives the rumor center of Shah & Zaman (2011) . Therefore, such a function may not be sufficiently sensitive for general networks. From this perspective, se is potentially better. Second, and importantly, both of the above discrepancy functions only depend on ζ(z), failing to leverage the diffusion order of the z. Ignoring such information may also undermine the performance. To overcome these drawbacks, we propose the following family of discrepancy functions as a better alternative. We call this family the canonical family of discrepancy functions. Definition 2 (Canonical Discrepancy Functions). Consider a class of discrepancy functions that can be written in the following form The canonical form (2) is essentially a negative similarity function. It incorporates both the infection status and the infection order of z. The weight function h incorporates the diffusion order such that if z deviates from y at an early stage, the deviation is treated as a stronger signal for their discrepancy, compared with the case when they only deviates at a later stage of the diffusion. Conceptually, this canonical family is general enough to incorporate the needed information for the diffusion process. In addition, as shown in Section 3.4, it admits fundamental properties that make the computation very efficient. As a special case, we demonstrate that se is equivalent to a discrepancy function with h(t z (v)) ≡ 2, as follows In this paper, we are particularly interested in the following natural configuration as the discrepancy function, which we call the "Averaged Deviation -inverse Time" (ADiT), which takes the canonical family form (2) with the inverse time weights: In Table 1 of Section 5, we show the simulation performance of the single-point estimation by our framework compared to other methods. Though our methods demonstrate improvements, the accuracy is universally low in all situations for all methods. Such an observation indicates that it is generally impossible to recover the source node by a single estimator with high accuracy. Indeed, as shown in Shah & Zaman (2011) ; Dong et al. (2013) ; Yu et al. (2018) , even in the ideal infinite regular tree for which the rumor center is proved to be optimal in the MLE sense, the probability of correct source node identification turns out to still be low (≤ 0.3). Such a low accuracy is far from useful in real-world applications, suggesting the necessity of developing alternative forms of inference, which is we embark on in the next section. As mentioned previously, single point estimators suffer from low success rates, rendering them unsatisfactory in real-world applications. To identify the source node with a nontrivial performance guarantee, we propose constructing a small subset of nodes that provably contains the source nodes with any pre-defined confidence. This insight motivates our use of the confidence set as the DSI method. Definition 3. Let Y be the random infection status of the stochastic diffusion process starting from s * . A level 1 − α confidence set of the source node is a random set S(Y ) ⊂ V depending on Y for which Surprisingly, the idea of using confidence set to infer the diffusion source -though arguably a natural one in statistics -has not been explored much in the context of DSI. The most relevant to ours are probably Bubeck et al. (2017) ; Khim & Loh (2016) and Crane & Xu (2020) . Bubeck et al. (2017) considered identifying the first node of a growing tree but not a diffusion process. Khim & Loh (2016) extended the idea to the SI model but only for infinite regular tree and asymptotic setting. Despite its theoretical merits, this method is not practical. For example, even consider the situation of an infinite 4-regular tree as the network structure, applying the method of Khim & Loh (2016) would indicate a confidence set of size 4 11 ≈ 5 × 10 6 , regardless of the infected size T . This is far too large for almost any applications, let alone the fact that infinite regular tree itself is unrealistic. Crane & Xu (2020) makes the inference more effective, but still rely on the tree-structure assumption. We instead take a completely different yet natural approach based on our statistical framework for the problem. To ensure the validity of the inference for any network structures, we will rely on the general statistical inference strategy for the confidence set construction. We first introduce a testing procedure for the hypothesis H 0 : s * = s against the alternative hypothesis H 1 : s * = s. Given a discrepancy function , data y and the node s under evaluation, define the testing statistic to be our loss T s (y) = E s (y, Z) for any data y. Then the p-value of the test is defined to be where the probability P s is over the randomness of the path Z generated from the random diffusion process starting from s. The p-value is the central concept in statistical hypothesis testing, and it gives a probabilistic characterization of how extreme the observed y deviates from the expected range for random paths that are truly from s (Lehmann & Romano, 2006) . For a level 1 − α confidence set, we compute ψ s for all nodes s and construct the confidence set by S(y) = {s : ψ s (y) > α}. The following result guarantees the validity of the confidence set constructed above. Theorem 1. The confidence set constructed by (5) is a valid 1 − α confidence set. Notice that Theorem 1 is a general result, independent of the network structure or the specific test statistic we use. However, the validity of the confidence set only gives one aspect of the inference. We would like to have small confidence sets in practice since such a small set would narrow down our investigation more effectively. The confidence set size would depend on the network structure and the corresponding effectiveness of the discrepancy function (the test statistic). We will use the proposed ADiT to define our test statistic. As shown in our empirical study, it gives excellent and robust performance across various settings. The exact evaluation of the statistic T s (y) and p-value ψ s is infeasible for general graphs since the probability mass function of the SI model is intractable. To overcome this barrier, we resort to the Monte Carlo (MC) method for approximate calculation, with details in Algorithm 1. This vanilla version turns out to be computationally inefficient. However, we will introduce techniques to significantly improve its computation efficiency afterwards. Remark 2 (Monte Carlo setup). Note that we have two layers of Monte Carlo evaluations. The first layer is the loss function calculation in (6) and (7), while the second layer is the p-value evaluation (8). The first layer shares the same m samples. This is different from the classical Monte Carlo, but would not break the validity for p-value calculation. The properties of p-value calculation by Monte Carlo method have been studied in detail by Jockel (1986) ; Besag & Clifford (1989) . Algorithm 1 Vanilla MC for Confidence Set Construction 1: Input: MC sample number m, confidence level α 2: Input: Network G, data y, discrepancy function 3: for each infected node s ∈ y do 4: Generate 2m samples z i ∈ Z, i = 1, · · · , 2m from the T -round diffusion process with source s on G. Estimate expected loss T s (y) of data y aŝ 6: For path z j , j = 1, · · · , m, estimate T s (ζ(z j )) aŝ 7: Estimate the p-value ψ s (y) asψ s (y) = 1 m m j=1 1(T s (ζ(z j )) ≥T s (y)). 8: end for 9: return level 1 − α confidence set: C α (y) = {s ∈ V I :ψ s (y) > α}. Remark 3 (Choice of the sample number m). In theory, the computation in Algorithm 1 is exact when m → ∞. In practice, simple guidance about the choice of m can be derived as follows. The critical step in Algorithm 1 is Step 7 for the p-value calculation since the MC errors from previous steps are usually in a lower order. For the correctness, we only need to worry about the evaluation at node s * when the true p-value is close to α. Step 7 averages over m indicators. By the central limit theorem, the MC estimate at most misses the true p-value by roughly 2 α(1 − α)/m. For example, if we are aiming for a 90% confidence set where α = 0.1, setting m = 10000 would indicate that the MC at most misses the targeting confidence level by 0.006%, which is usually good enough in most applications. In our experiments, we use this m = 10000 and it has been sufficient in all situations. Notice that this recommendation is more conservative than the ones used in classical statistical inference problems (Jockel, 1986) . In our experience, it might still be acceptable to use a smaller m. Remark 4 (Time complexity of the vanilla MC, and its trivial parallelization). The time complexity of a standard sequential implementation of Algorithm 1 isÕ(mT 2 + m 2 T 2 ): 3 (1) the first term is due to the MC sampling (Bringmann & Panagiotou, 2017) ; (2) the second term is from the statistic calculation (7) given the MC samples. However, our algorithm can be trivially parallelized. In particular, the for-loop in Step 3 can be distributed across different s ∈ V i with any communication. This leads to a parallel time complexityÕ(mT + m 2 T ). It is worthwhile to compare this time cost with the rumor center of Shah & Zaman (2011) which hasÕ(dT ) linear complexity and d is the maximum node degree. But the algorithm has to be sequential (thus non-parallelizable). In summary, Algorithm 1 has a better dependence on the network density captured by d but has an additional quadratic dependence on the number of samples m. A major computational bottleneck of Algorithm 1 is the O(m 2 T ) time for estimating E s ( (y, Z)) in Equation (7) for every j since we have to computeψ for m samples, and eachψ is the average over another m samples. Fortunately, it turns out that, for canonical discrepancy family, this step can be done in O(mT ) time, highlighting another advantage of our proposed family of cost functions. Instead of computingT s in Equation (7) by summing over the sample i = m + 1, · · · , 2m, we can computeT s directly using only the "summary information" of these samples that can be computed and cached in advance. This insight is possible due to the following alternative representation of theT s (y) function in Equation (7):T where M v,k counts the total number of samples in z m+1 , · · · , z 2m in which node v is the k'th infected node in the infection path. Let M ∈ R n×T be the matrix containing the entries M v,k . Note that, there are at most mT nonzero entries in M since each sample only has T nodes. These entries can be computed in O(mT ) time simply by updating the corresponding M v,k entries during sampling. With these non-zero M v,k entries, we can then computeĥ(v) = T k=1 M v,k h(k) for all the v that showed up in our samples in O(mT ) time. Finally, given the previousĥ(v), we can compute anyT s (y) in O(T ) time where y = ζ(z 1 ), · · · , ζ(z m ), which thus in total takes an additional O(mT ) time. This overall takes O(mT ) time. In subsection 3.4, we reduced the computation time for estimating a single p-value toÕ(mT ), which is arguably the minimum possible in our framework since even sampling m samples already takesÕ(mT ). In this section, we will introduce efficient strategies to reduce another major computational cost in our algorithm -the MC sampling. Our techniques will "borrow" MC samples of one node for the inference task of another node by leveraging the network structure and properties of the SI model. Consequently, we only need to generate MC samples for a subset of the infected nodes, which may effectively reduce the computational cost. A node with only one connection in the network is called a single-degree node. Suppose node u ∈ V I is a single degree node with the only neighbor v 0 that is also infected. Since any diffusion process starting from u must pass v 0 , we can then use the distribution of paths from v 0 to infer the distribution of paths from u. However, the converse is not true -a diffusion path from v 0 may not pass u, and even if it passes u, this may not occur as the first infection. Therefore, certain mapping is needed to connect the two diffusion processes. The following theorem formulates this intuition. Theorem 2. Let u be a single-degree node in the graph G with the only neighbor node v 0 . If a path z ∈ Z T starting from v 0 contains u z = {v 0 , s 1 , s 2 , · · · , s K−1 , u, s K+1 , · · · , s T }, define z's matching path from u as In this case, the likelihood ratio between z and f u (z) is If the path z from v 0 that does not contain u, we define the ratio p (f u (z)|u)/p (z|v 0 ) to be 0. Notice that all terms on the right-hand side of (11) are available when we sample a path from the diffusion process starting at v 0 , thus given a sampled path z, computing the likelihood ratio only introduces negligible computational cost. Intuitively, according to Theorem 2, when the MC samples of v 0 are available, they can be used to compute the p-value for node u based on a similar idea to importance sampling (L'Ecuyer & Owen, 2009). However, the regular importance sampling cannot be directly applied because the likelihood ratio is only available between z and f u (z) under the mapping of f u . Therefore, we need a generalized version of the importance sampling. We name this procedure the surjective importance sampling and give its property in the following theorem. We believe that this theorem could be of general interest beyond our context. Theorem 3 (Surjective Importance Sampling). Suppose p 1 and p 2 are two probability mass functions for discrete random vector Z defined on C 1 and C 2 . Let E 1 and E 2 denote the expectation with respect to p 1 and p 2 , respectively. Given surjection φ : C 1 → C 2 , defined on a subset C 1 ⊂ C 1 , we define the inverse mapping by φ −1 (z) = {z ∈ C 1 : φ(z) =z} for anyz ∈ C 2 . For a given bounded real function of interest, g, define where Z 1 , Z 2 , · · · , Z m is a size-m i.i.d. sample from distribution p 1 , and if Z i ∈ C 1 , we define p 2 (φ(Z i )) = 0. We have lim m→∞η = η a.s. Notice that the standard importance sampling is a special case of Theorem 3 when φ is the identity mapping. Theorem 2 and 3 toghether would serve as a cornerstone for our use of the MC samples from v 0 to make inference of u. Corollary 1. For a single degree node u and its neighbor v 0 , let z i , i = 1, · · · , m be the m i.i.d. paths generated from the diffusion process with source v 0 . For any bounded function g, we have in which f u (z i ) and the likelihood ratio is given by Theorem 2. Based on Corollary 1, when g(z) = (y, z) or g(z) = 1(T u (ζ(z)) ≥ T u (y)), E[g] corresponds to the test statistic T u (y) or the p-value ψ u (y). Consequently, the MC sampling for u can be avoided. Instead, to find the p-value for u, Equation (7) in Algorithm 1 can be replaced byT u (ζ (f u (z j ))) equalling the following and Equation (8) can be replaced byψ u (y) equalling the following where z j , j = 1, · · · , 2m are the MC samples generated from v 0 . The same operation can be used for T u (y)). The computational strategy for canonical discrepancy functions can also be extended in this setting (see Appendix C). When the network structure is in some sense "symmetric" for two nodes, the inference properties of the MC samples from one node can be viewed as stochastically equivalent to the MC samples from the other node after the symmetric reflection. We call such a property isomorphism. Denote the node u's kth order neighborhood-the set of all nodes (exactly) k hops away from uby N k (u). The following definition for isomorphism rigorously formulates the aforementioned idea. Definition 4. Any two nodes u, v in a network are first-order isomorphic if there exists a permutation π : {1, 2, · · · , n} → {1, 2, · · · , n}, such that: where Aπ ,π is the resulting matrix by applying permutation π on the rows and columns of A simultaneously. For illustration, consider a simplified case of the isomorphism where u and v have exactly the same connections. In this case, π only swaps u and v and remains the identity mapping for all other nodes. For this pair of u, v, the diffusion process properties would be the same if we swap the positions of u and v. Definition 4 is more general than the above simplified case as it allows permutation to the first-order neighbors. Under this definition of isomorphism, the following theorem shows that we can use the MC samples from one node to make inference of its isomorphic nodes after applying the permutation. Theorem 4. If u and v are first-order isomorphic under the permutation π. If Z = {u, v 1 , v 2 , · · · , v T −1 } is a random diffusion path from source u. Define the permuted path Z π = {π(u), π(v 1 ), · · · , π(v T −1 )}. Then Z π has the same distribution as a random diffusion path from source v. To use the MC samples of one node to its isomorphic nodes according to Theorem 4, we need an efficient algorithm to identify all isomorphic pairs and the corresponding permutations. Directly checking Definition 4 is costly. To speed up the computation, we identify necessary conditions for isomorphism in Proposition 2. Proposition 2. If u and v are first-order isomorphic, we must have d u = d v and D 1 (u) = D 1 (v) where d u and d v are the degrees of u and v, D 1 (u) and D 2 (v) are the degree sequence (sorted in ascending order) of N 1 (u) and N 1 (v). Furthermore, u and v have the same second-order neighbor sets. That is, Based on Proposition 2 we can efficiently identify isomorphism using pre-screening steps. This turns out to significantly speed up our computation. Details of the algorithm are described by Algorithm 2 in Appendix B. With the isomorphic relations available, we can partition the nodes into isomorphic groups. Then MC sampling is only needed for one node in each group, and the MC samples can be shared within the group according to Theorem 4. Specifically, suppose Z 1 , · · · Z 2m are sampled from the diffusion process from u. If u and v are isomorphic with permutation π, we can use (Z 1 ) π , (Z 2 ) π , · · · , (Z 2m ) π as the MC samples of v in Algorithm 1. Definition 4 can be extended to higher-order neighborhoods, identifying more isomorphic pairs. However, the complexity of identifying such pairs increases exponentially with the order of neighbors, which may overwhelm the saved time on the MC side. The first-order isomorphism turns out to give the most desirable tradeoff in terms of computational efficiency. In this section, we evaluate our proposed methods on well-studied random network models. We generate networks from three random network models: random 4-regular trees, the preferential attachment model (Barabási & Albert, 1999 ) and the small-world (S-W) network model (Watts & Strogatz, 1998) . In network science, the preferential attachment model is usually used to model the scale-free property of networks that is conjectured by many as ubiquity in real-world networks (Barabási, 2013) . The small-world property is believed to be prevalent in social networks (Watts & Strogatz, 1998) . The network size is N = 1365 (the size of regular tree with degree 4 and depth 6) with T = 150. The networks are sparse with average degree below 4. The Monte Carlo size m is 4000. Source node in all settings is set to be the one with median eigen-centrality to be more realistic. All reported results are averaged over 200 independent replications. Table 1 . Though the two estimators based on our framework are always better, the overall message from the table is not promising. All of the methods, including ours, give poor accuracy that is too low to be useful in applications. Such a negative result convincingly shows that the DSI problem is generally too difficult for the single-point estimation strategy to work, and exploring the alternative confidence set inference is necessary. Table 2 shows the coverage rate of the confidence sets, with the squared Euclidean distance and the ADiT as the discrepancy functions. Notably, the proposed confidence set procedure delivers the desired coverage in all settings (up to the simulation error). Meanwhile, the size of the confidence set varies substantially depending on the network structure. For regular trees and scale-free networks, the ADiT works much better than the Euclidean distance, indicating that the diffusion order is informative in this type of network structure. For the small-world networks, the two are very similar. This may indicate that for well-connected networks, the diffusion order is less informative. In general, we believe the adaptivity of the ADiT-based confidence set is always preferable. To obtain a comprehensive view of the tradeoff between the set size and confidence level, we show the relationship between the confidence set's average size and the confidence level in Figure 2 . The relation is slightly sup-linear. In connection with the single-point estimation results, notice that for small-world networks, the confidence set with a confidence level 20% has average size of around 1. In contrast, the regular tree and preferential attachment network are more difficult, and to guarantee at 10%, the average size of the confidence set is already about 5. These observations verify the results in Table 1 and support our argument that, in general, inferring the source by a single-point estimator is hopeless. Finally, we also evaluate the timing improvements achieved by the pooled MC strategies. The power of the pooled MC strategies depends on network structures, as expected. The timing comparison for the pooled MC strategies is included in Table 3 . The timing included is only the sequential version of our method for a fair comparison with the rumor center. As can be seen, with both of the pooled MC strategies used, we can reduce the timing by about 60% for tree structure and the preferential attachment networks, but the effects on small-world networks are negligible. In applications, it is observed that network structure tends to exhibit core-periphery patterns (Miao & Li, 2021) , in which the pooled MC strategies are expected to be effective. For reference, the average timing for finding the rumor center is about 1.25 seconds. However, with the extra computational cost, our method provides a confidence sets at any specified levels, with guaranteed accuracy for any network structures. We believe it is generally a wise tradeoff. Meanwhile, notice that our inference procedure can be parallelized. We give a parallel algorithm in appendix (see Algorithm 3 in Appendix D). It needs MC sampling for only one node in each group, and the calculations for other nodes can be done using pooled MC methods. Table 4 includes the timing results of the parallel version implementation based on 20 cores in the same settings as Table 3 of the main paper. With 20 cores, the time needed for a confidence set construction is less than 30 seconds for cases when the pooled MC methods are effective. We have introduced a statistical inference framework for diffusion source identification on networks. Compared with previous methods, our framework is more general and renders salient insights about the problem. More importantly, within this framework, we can construct the confidence set for the source node in a more natural and principled way such that the success rate can be guaranteed on any network structure. To our knowledge, our method is the first DSI method with theoretical guarantees for general network structures. We also propose efficient computational strategies that are potentially useful in other problems as well. A.1 Proof of Theorem 1 Define q α T,s * = inf t {t : P s * (T s * (ζ(Z)) ≥ t) ≤ α}. Notice that q α T,s * can be seen as one generalized definition for the right quantile of the distribution of the random variable T s * (Ỹ ), whereỸ := ζ(Z) is a random infection status of the network generated by the diffusion process starting from s * . Now assume Y is a random infection status from the diffusion process from s * . According to the definition of the p-value, we have Note that since s * is the true source node, T s * (Y ) and T s * (Ỹ ) are following exactly the same distribution, thus Given a path generated by the diffusion process starting from v 0 containing u, denoted by z = {v 0 , s 1 , s 2 , · · · , s K−1 , u, s K+1 , · · · , s T }, we match it to the path f u (Z) defined as f u (z) = {u, v 0 , s 1 , · · · , s K−1 , s K+1 , · · · , s T }. We start from the probability mass of z starting from v 0 . By using the Markov property, we have p(z|v 0 ) =P(s 1 |v 0 )P(s 2 |v 0 , s 1 ) · · · P(s K−1 |v 0 , s 1 , · · · , s K−2 ) × P(s K+1 |v 0 , · · · , u) · · · P(s T |v 0 , · · · , s T −1 ) (12) × P(u|v 0 , s 1 , · · · , s K−1 ) In contrast, for the path f u (z), we have P(f u (z)|u) =P(v 0 |u)P(s 1 |u, v 0 )P(s 2 |u, v 0 , s 1 ) · · · P(s K−1 |u, v 0 , s 1 , · · · , s K−2 ) × P(s K+1 |u, v 0 , · · · , s K−1 ) · · · P(s T |u, v 0 , · · · , s T −1 ) =P(s 1 |u, v 0 )P(s 2 |u, v 0 , s 1 ) · · · P(s K−1 |u, v 0 , s 1 , · · · , s K−2 ) × P(s K+1 |u, v 0 , · · · , s K−1 ) · · · P(s T |u, v 0 , · · · , s T −1 ). Notice that the conditional probability P(s k+1 |v 0 , · · · , u, s K+1 , · · · s k ), k > K only depends on the infection status before the kth infection and is invariant to the infection order. This property indicates that all terms after the K + 1th (in the second rows) of (12) and (13) are equal. Next, we compare the terms in the first line in each of (12) and (13). Notice that for each k < K, the term P(s k |v 0 , s 1 , · · · , s k−1 ) is uniform for each available connections given the infected nodes v 0 , s 1 , · · · , s k−1 while the term P(s k |u, v 0 , s 1 , · · · , s k−1 ) is uniform on all available edges given the infected nodes u, v 0 , s 1 , · · · , s k−1 . The only difference in the two infected sets is on u. Since u has only one connection to v 0 , at each point, the number of available infecting edges is one more in the former case. Therefore, we have P(s k |u, v 0 , s 1 , · · · , s k−1 ) = 1 1 − P(s k |v 0 , s 1 , · · · , s k ) P(s k |v 0 , s 1 , · · · , s k ), k < K. In addition, notice that in the third line of (12), there is one extra term that does not appear in (13). Combining the aforementioned three relations, we final obtain probability mass factor to be P(f u (π)|S 0 = u) P(π|S 0 = v 0 ) = 1 (1 − P(s 1 |v 0 ))(1 − P(s 2 |v 0 , s 1 )) · · · (1 − P(s K−1 |v 0 , s 1 , · · · , s K−2 ))P(u|v 0 , s 1 , · · · , s K−1 ) (14) Moreover, if π does not contain u, we set the ratio to be 0. Since Z i 's are a random sample from p 1 , under the current assumption, by the strong law of large numbers, we have P η → E 1 [ g(φ(Z)) |φ −1 (φ(Z))| p 2 (φ(Z)) p 1 (Z) ] = 1. Input: Graph G = (V, E) Initialize L = ∅ to store the list of isomorphic node pairs for every node u ∈ V do Compute N 1−4 (u) as the set of all neighbors of u within 4 hops // N 1−4 (u) includes all nodes that are possible to be isomorphic to u for v ∈ N 1−4 (u) do if d v == d u then Compute D 1 (u) = {d u : u ∈ N 1 (u)}, the multi-set of degrees of nodes in N 1 (u) Similarly, compute D 1 (v), the multi-set of degrees of all one-hop neighbors of v if D 1 (u) == D 1 (v) then ComputeÑ 2 (u) = N 2 (u) − N 1 (u) − N 1 (v) − {u, v} //Ñ 2 (u) contains all (exactly) two-hop neighbors of u, but with all (exactly) one-hop neighbors of u, v removed Do exhaustive search to check whether u, v are isomorphic by enumerating all possible matchings of their neighbors, and if so, add (u, v) to list L // Hopefully, not many pairs need to go through this step end if end if end if end for end for Based on the properties in Proposition 2, Algorithm 2 finds all isomorphic pairs and the permutations in the network. Next, we provide a proof of Proposition 2. Proof of Proposition 2. d u = d v because π gives a 1-1 mapping from N 1 (u) to N 1 (v). Furthermore, we have π(N 1 (u)) = N 1 (v). By condition 3 of Definition 4, we also have D 1 (u) = D 1 (v). For the last one, we can prove by contradiction. Suppose there exists a node w, such that w ∈ N 2 (u) but w / ∈ N 2 (v). Since π(N 1 (u)) = N 1 (v) while π(w) = w, so after applying the permutation to the network, we have π(w) disconnected from π(N 1 (w)). This contradicts condition 3 of Definition 4. The calculation strategy for canonical discrepancy functions can also be further generalized to the weighted averaging scenario used for the single-degree nodes in Section 4. (y, f u (z i )) P (f u (z i )|u) P (z i |v 0 ) Therefore, to use this strategy in Section 4.1, when general MC samples from v 0 , in addition to caching M , we also want to cache the matrix adjusted by the factor As discussed in Section 3.3, our confidence set construction algorithm can be implemented in parallel, further boosting its speed. In the main paper, we only include the details and timing for the sequential version. The parallelized algorithm is described in Algorithm 3. Infectious diseases of humans: dynamics and control Emergence of scaling in random networks Generalized monte carlo significance tests Efficient sampling methods for discrete distributions Finding adam in random growing trees Inference on the history of a randomly growing tree Rooting out the rumor culprit from suspects System and method of virus containment in computer networks Building an open resources repository for covid-19 research Finite sample properties and asymptotic efficiency of monte carlo tests. The annals of Statistics Approximate identification of the optimal epidemic source in complex networks Confidence sets for the source of a diffusion in regular trees Baidu Mobility Data Deep learning. nature Monte Carlo and Quasi-Monte Carlo Methods Testing statistical hypotheses Informative core identification in complex networks Email networks and the spread of computer viruses Multiple infection sources identification with provable guarantees Rumors in a network: Who's the culprit? Finding rumor sources on random graphs Finding the probability of infection in an sir network is np-hard A theory of the learnable The spread of true and false news online Collective dynamics of 'small-world'networks Propagation effect of a virus outbreak on a network with limited anti-virus ability Rumor source detection in finite graphs with boundary effects by message-passing algorithms Algorithm 3 Parallel Confidence Set Construction 1: Input: MC sample number m, confidence level α, Network G, data y, discrepancy function 2: Compute S = {g 1 , g 2 , · · · , g M }, the isomorphic groups for and (8) 10: for each v ∈ g that is isomorphic to s do 11: Calculateψ v (y) according to Theorem 4. 12: end for 13: for each single-degree node v ∈ g do 14: Calculate the p-valueψ v (y) according to the surjective importance sampling The work is supported by the Quantitative Collaborative Award from the College of Arts and Sciences at the University of Virginia. T. Li is supported in part by the NSF grant DMS-2015298. H. Xu is supported by a GIDI award from the UVA Global Infectious Diseases Institute. Notice that φ is a surjection. Therefore, the term E 1 [ g(φ(Z)) |φ −1 (φ(Z))| p 2 (φ(Z)) p 1 (Z) ] can be rewritten as We will use f u (z) in place of φ to apply Theorem 3. The only remaining step is to find |f u (f −1 u (z))|. By the definition of f u in Theorem 3, it is easy to see that the other T − 1 nodes (except u and v 0 ) and their order uniquely determine to the mapped path. Therefore, |f u (f −1 u (z))| would always be T in this situation. Notice that, given T , the probability mass function of a diffusion path only depends on the network A and the source node. Condition 1 of Definition 4 indicates that Z π starts from v. Condition 2 and condition 3 of Definition 4 together indicate that Z π is also a valid diffusion path. Condition 3, in particular, indicates thatNow define can define π −1 to be the inverse of π. For anyZ from v, for the same reason,Z π −1 is also a valid diffusion path starting from u. In particular, we have Z = (Z π ) π −1 . Therefore, Z π has the same sample space and probability mass function as the random diffusion path from v.B Details for Isomorphism Identification in Section 4.2