key: cord-0011561-zyjd9rmp authors: Peixoto, Tiago P. title: Network Reconstruction and Community Detection from Dynamics date: 2019-09-18 journal: nan DOI: 10.1103/physrevlett.123.128301 sha: a9a35792b56735056d79a82e6cf1e0a5e1665d81 doc_id: 11561 cord_uid: zyjd9rmp We present a scalable nonparametric Bayesian method to perform network reconstruction from observed functional behavior that at the same time infers the communities present in the network. We show that the joint reconstruction with community detection has a synergistic effect, where the edge correlations used to inform the existence of communities are also inherently used to improve the accuracy of the reconstruction which, in turn, can better inform the uncovering of communities. We illustrate the use of our method with observations arising from epidemic models and the Ising model, both on synthetic and empirical networks, as well as on data containing only functional information. The observed functional behavior of a wide variety largescale system is often the result of a network of pairwise interactions. However, in many cases, these interactions are hidden from us, either because they are impossible to measure directly, or because their measurement can be done only at significant experimental cost. Examples include the mechanisms of gene and metabolic regulation [1] , brain connectivity [2] , the spread of epidemics [3] , systemic risk in financial institutions [4] , and influence in social media [5] . In such situations, we are required to infer the network of interactions from the observed functional behavior. Researchers have approached this reconstruction task from a variety of angles, resulting in many different methods, including thresholding the correlation between time series [6] , inversion of deterministic dynamics [7] [8] [9] , statistical inference of graphical models [10] [11] [12] [13] [14] and of models of epidemic spreading [15] [16] [17] [18] [19] [20] , as well as approaches that avoid explicit modeling, such as those based on transfer entropy [21] , Granger causality [22] , compressed sensing [23] [24] [25] , generalized linearization [26] , and matching of pairwise correlations [27, 28] . In this Letter, we approach the problem of network reconstruction in a manner that is different from the aforementioned methods in two important ways. First, we employ a nonparametric Bayesian formulation of the problem, which yields a full posterior distribution of possible networks that are compatible with the observed dynamical behavior. Second, we perform network reconstruction jointly with community detection [29] , where, at the same time as we infer the edges of the underlying network, we also infer its modular structure [30] . As we will show, while network reconstruction and community detection are desirable goals on their own, joining these two tasks has a synergistic effect, whereby the detection of communities significantly increases the accuracy of the reconstruction, which in turn improves the discovery of the communities, when compared to performing these tasks in isolation. Some other approaches combine community detection with functional observation. Berthet et al. [31] derived necessary conditions for the exact recovery of group assignments for dense weighted networks generated with community structure given observed microstates of an Ising model. Hoffmann et al. [32] proposed a method to infer community structure from time-series data that bypasses network reconstruction by employing a direct modeling of the dynamics given the group assignments, instead. However, neither of these approaches attempt to perform network reconstruction together with community detection. Furthermore, they are tied down to one particular inverse problem, and as we will show, our general approach can be easily extended to an open-ended variety of functional models. Bayesian network reconstruction.-We approach the network reconstruction task similarly to the situation where the network edges are measured directly, but via an uncertain process [33, 34] : If D is the measurement of some process that takes place on a network, we can define a posterior distribution for the underlying adjacency matrix A via Bayes' rule where PðDjAÞ is an arbitrary forward model for the dynamics given the network, PðAÞ is the prior information on the network structure, and PðDÞ ¼ P A PðDjAÞPðAÞ is a normalization constant comprising the total evidence for the data D. We can unite reconstruction with community detection via an, at first, seemingly minor, but ultimately consequential modification of the above equation where we introduce a structured prior PðAjbÞ where b represents the partition of the network in communities, i.e., b ¼ fb i g, where b i ∈ f1; …; Bg is group membership of node i. This partition is unknown, and is inferred together with the network itself, via the joint posterior distribution The prior PðAjbÞ is an assumed generative model for the network structure. In our work, we will use the degreecorrected stochastic block model (DC-SBM) [35] , which assumes that, besides differences in degree, nodes belonging to the same group have statistically equivalent connection patterns, according to the joint probability with λ rs determining the average number of edges between groups r and s and κ i the average degree of node i. The marginal prior is obtained by integrating over all remaining parameters weighted by their respective prior distributions, which can be computed exactly for standard prior choices, although it can be modified to include hierarchical priors that have an improved explanatory power [36] (see Supplemental Material [37] for a concise summary.). The use of the DC-SBM as a prior probability in Eq. (2) is motivated by its ability to inform link prediction in networks where some fraction of edges have not been observed or have been observed erroneously [34, 39] . The latent conditional probabilities of edges existing between groups of nodes is learned by the collective observation of many similar edges, and these correlations are leveraged to extrapolate the existence of missing or spurious ones. The same mechanism is expected to aid the reconstruction task, where edges are not observed directly, but the observed functional behavior yields a posterior distribution on them, allowing the same kind of correlations to be used as an additional source of evidence for the reconstruction, going beyond what the dynamics alone says. Our reconstruction approach is finalized by defining an appropriate model for the functional behavior, determining PðDjAÞ. Here, we will consider two kinds of indirect data. The first comes from a susceptible-infected-susceptible (SIS) epidemic spreading model [40] , where σ i ðtÞ ¼ 1 means node i is infected at time t, 0, otherwise. The likelihood for this model is where is the transition probability for node i at time t, with fðp; σÞ ¼ ð1 − pÞ σ p 1−σ , and where m i ðtÞ ¼ P j A ij lnð1 − τ ij Þσ j ðtÞ is the contribution from all neighbors of node i to its infection probability at time t. In the equations above, the value τ ij is the probability of an infection via an existing edge ði; jÞ, and γ is the 1 → 0 recovery probability. With these additional parameters, the full posterior distribution for the reconstruction becomes Since τ ij ∈ ½0; 1, we use the uniform prior PðτÞ ¼ 1. Note, also, that the recovery probability γ plays no role on the reconstruction algorithm, since its term in the likelihood does not involve A [and, hence, gets cancelled out in the denominator PðσjγÞ ¼ PðγjσÞPðσÞ=PðγÞ]. This means that the above posterior only depends on the infection events 0 → 1 and, thus, is also valid without any modifications to all epidemic variants susceptible-infected (SI), susceptibleinfected-recovered (SIR), susceptible-exposed-infectedrecovered (SEIR), etc., [40] , since the infection events occur with the same probability for all these models. The second functional model we consider is the Ising model, where spin variables on the nodes s ∈ f−1; 1g N are sampled according to the joint distribution where β is the inverse temperature, J ij is the coupling on edge ði; jÞ, h i is a local field on node i, and ZðA; β; J; hÞ ¼ P s expðβ P i c à ; 0; otherwiseg. The value of c à was chosen to maximize the posterior similarity, which represents the best possible reconstruction achievable with this method. Nevertheless, the network thus obtained is severely distorted. The inverse correlation method comes much closer to the true network, but is superseded by the joint inference with community detection. Empirical dynamics.-We turn to the reconstruction from observed empirical dynamics with unknown underlying interactions. The first example is the sequence of M ¼ 619 votes of N ¼ 575 deputies in the 2007 to 2011 session of the lower chamber of the Brazilian congress. Each deputy voted yes, no, or abstained for each legislation, which we represent as f1; −1; 0g, respectively. Since the temporal ordering of the voting sessions is likely to be of secondary importance to the voting outcomes, we assume the votes are sampled from an Ising model [the addition of zero-valued spins changes Eq. (9) only slightly by replacing 2 coshðxÞ → 1 þ 2 coshðxÞ]. Figure 4 shows the result of the reconstruction, where the division of the nodes uncovers a cohesive government and a split opposition, as well as a marginal center group, which correlates very well with the known party memberships and can be used to predict unseen voting behavior (see Supplemental Material [37] for more details). In Fig. 5 , we show the result of the reconstruction of the directed network of influence between N ¼ 1833 twitter users from 58224 retweets [50] using a SI epidemic model (the act of "retweeting" is modeled as an infection event, using Eqs. (5) and (6) with γ ¼ 0) and the nested DC-SBM. The reconstruction uncovers isolated groups with varying propensities to retweet, as well as groups that tend to influence a large fraction of users. By inspecting the geolocation metadata on the users, we see that the inferred groups amount, to a large extent, to different countries, although clear subdivisions indicate that this is not the only factor governing the influence among users (see Supplemental Material [37] for more details). Conclusion.-We have presented a scalable Bayesian method to reconstruct networks from functional observations that uses the SBM as a structured prior and, hence, performs community detection together with reconstruction. The method is nonparametric and, hence, requires no prior stipulation of aspects of the network and size of the model, such as number of groups. By leveraging inferred correlations between edges, the SBM includes an additional source of evidence and, thereby, improves the reconstruction accuracy, which in turn also increases the accuracy of the inferred communities. The overall approach is general, requiring only appropriate functional model specifications, and can be coupled with an open ended variety of such models other than those considered here. [51, 52] for details on the layout algorithm), and the edge colors indicate the infection probabilities τ ij as shown in the legend. The text labels show the dominating country membership for the users in each group. Inferring gene regulatory networks from multiple microarray datasets Dynamic models of large-scale brain activity Estimating spatial coupling in epidemiological systems: a mechanistic approach Bootstrapping topological properties and systemic risk of complex networks using the fitness model The role of social networks in information diffusion Network inference with confidence from multivariate time series Revealing Network Connectivity from Response Dynamics Inferring network topology from complex dynamics Revealing physical interaction networks from statistics of collective dynamics Learning factor graphs in polynomial time and sample complexity Reconstruction of Markov random fields from samples: Some observations and algorithms, in Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques Which graphical models are difficult to learn Estimation of sparse binary pairwise Markov networks using pseudo-likelihoods Inverse statistical problems: From the inverse Ising problem to data science Inferring networks of diffusion and influence On the convexity of latent social network inference Learning the graph of epidemic cascades Statistical inference approach to structural reconstruction of complex networks from binary time series Maximum-likelihood network reconstruction for SIS processes is NP-hard Network reconstruction from infection cascades Escaping the Curse of Dimensionality in Estimating Multivariate Transfer Entropy Causal network inference by optimal causation entropy Reconstructing propagation networks with natural diversity and identifying hidden sources Efficient reconstruction of heterogeneous networks from time series via compressed sensing Robust Reconstruction of Complex Networks from Sparse Data Universal data-based method for reconstructing complex networks with binary-state dynamics Reconstructing weighted networks from dynamics Reconstructing network topology and coupling strengths in directed networks of discrete-time dynamics Community detection in networks: A user guide Bayesian stochastic blockmodeling Exact recovery in the Ising blockmodel Community detection in networks with unobserved edges Network structure from rich but noisy data Reconstructing Networks with Unknown and Heterogeneous Errors Stochastic blockmodels and community structure in networks Nonparametric Bayesian inference of the microcanonical stochastic block model for summary of the full generative model used, details of the inference algorithm and more information on the analysis of empirical data Efficient Monte Carlo and greedy heuristic for the inference of stochastic block models Missing and spurious interactions and the reconstruction of complex networks Epidemic processes in complex networks Spatial interaction and the statistical analysis of lattice systems Equation of state calculations by fast computing machines Monte Carlo sampling methods using Markov chains and their applications Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications Artifacts or attributes? Effects of resolution on the little rock lake food web Note that, in this case, our method also exploits the heterogeneous degrees in the network via the DC-SBM, which can Refinements of this approach including Thouless-Anderson-Palmer (TAP) and Bethe-Peierls (BP) corrections [14] yield the same performance for this example Pseudolikelihood Decimation Algorithm Improving the Inference of the Interaction Network in a General Class of Ising Models The simple rules of social contagion Hierarchical Block Structures and High-Resolution Model Selection in Large Networks Hierarchical edge bundles: Visualization of adjacency relations in hierarchical data