key: cord-0124857-zuj4uwvf
authors: Esfahani, Fatemeh; Srinivasan, Venkatesh; Thomo, Alex; Wu, Kui
title: Nucleus Decomposition in Probabilistic Graphs: Hardness and Algorithms
date: 2020-06-02
journal: nan
DOI: nan
sha: ecb7dc041ff97ce4c107fa831326b24d9c2a941c
doc_id: 124857
cord_uid: zuj4uwvf

Finding dense components in graphs is of great importance in analyzing the structure of networks. Popular and computationally feasible frameworks for discovering dense subgraphs are core and truss decompositions. Recently, Sariyuce et al. introduced nucleus decomposition, a generalization which uses higher-order structures and can reveal interesting subgraphs that can be missed by core and truss decompositions. In this paper, we present nucleus decomposition in probabilistic graphs. We study the most interesting case of nucleus decomposition, k-(3,4)-nucleus, which asks for maximal subgraphs where each triangle is contained in k 4-cliques. The major questions we address are: How to define meaningfully nucleus decomposition in probabilistic graphs? How hard is computing nucleus decomposition in probabilistic graphs? Can we devise efficient algorithms for exact or approximate nucleus decomposition in large graphs? We present three natural definitions of nucleus decomposition in probabilistic graphs: local, global, and weakly-global. We show that the local version is in PTIME, whereas global and weakly-global are #P-hard and NP-hard, respectively. We present an efficient and exact dynamic programming approach for the local case and furthermore, present statistical approximations that can scale to large datasets without much loss of accuracy. For global and weakly-global decompositions, we complement our intractability results by proposing efficient algorithms that give approximate solutions based on search space pruning and Monte-Carlo sampling. Our extensive experimental results show the scalability and efficiency of our algorithms. Compared to probabilistic core and truss decompositions, nucleus decomposition significantly outperforms in terms of density and clustering metrics.

Abstract-Finding dense components in graphs is of great importance in analysing the structure of networks. Popular and computationally feasible frameworks for discovering dense subgraphs are core and truss decompositions. Recently, Sarıyüce et al. introduced nucleus decomposition, which uses r-cliques contained in s-cliques, where s > r, as the basis for defining dense subgraphs. Nucleus decomposition can reveal interesting subgraphs that can be missed by core and truss decompositions.

In this paper, we present nucleus decomposition in probabilistic graphs. The major questions we address are: How to define meaningfully nucleus decomposition in probabilistic graphs? How hard is computing nucleus decomposition in probabilistic graphs? Can we devise efficient algorithms for exact or approximate nucleus decomposition in large graphs?

We present three natural definitions of nucleus decomposition in probabilistic graphs: local, global, and weakly-global. We show that the local version is in PTIME, whereas global and weaklyglobal are #P-hard and NP-hard, respectively. We present an efficient and exact dynamic programming approach for the local case. Further, we present statistical approximations that can scale to bigger datasets without much loss of accuracy. For global and weakly-global decompositions we complement our intractability results by proposing efficient algorithms that give approximate solutions based on search space pruning and Monte-Carlo sampling. Extensive experiments show the scalability and efficiency of our algorithms. Compared to probabilistic core and truss decompositions, nucleus decomposition significantly outperforms in terms of density and clustering metrics.

Index Terms-Probabilistic Graphs, Dense Subgraphs, Nucleus Decomposition

Probabilistic graphs are graphs where each edge has a probability of existence (cf. [1] - [8] ). Many real-world graphs, such as social, trust, and biological networks are associated with intrinsic uncertainty. For instance, in social and trust networks, an edge can be weighted by the probability of influence or trust between two users that the edge connects [9] - [11] . In biological networks of protein-protein interactions (cf. [12] ) an edge can be assigned a probability value representing the strength of prediction that a pair of proteins will interact in a living organism [13] - [15] .

Mining dense subgraphs and discovering hierarchical relations among them is a fundamental problem in graph analysis tasks. For instance, it can be used in visualizing complex networks [16] , finding correlated genes and motifs in biological networks [17] , [18] , detecting communities in social and web graphs [19] , [20] , summarizing text [21] , and revealing new F. Esfahani,V. Srinivasan, A. Thomo and K. Wu are with the Department of Computer Science, University of Victoria, Victoria, B.C. E-mail: esfahani,srinivas,thomo,wkui@uvic.ca. research subjects in citation networks [22] . Core and truss decompositions are popular tools for finding dense subgraphs. A k-core is a maximal subgraph in which each vertex has at least k neighbors, and a k-truss is a maximal subgraph whose edges are contained in at least k triangles. Core and truss decompositions have been extensively studied for deterministic as well as probabilistic graphs (cf. [1] , [23] - [27] ).

A recent notion of dense subgraphs is nucleus introduced by Sarıyüce et al. [28] , [29] . Nucleus decomposition is a generalization of core and truss decompositions that uses higher-order structures to detect dense regions. It can reveal interesting subgraphs that can be missed by core and truss decompositions. In a nutshell, a k-(r, s)-nucleus is a maximal subgraph whose r-cliques are contained in at least k of scliques, where s > r. For r = 1, s = 2 and r = 2, s = 3 we obtain the notions of k-core and k-truss, respectively. For r = 3, s = 4, r-cliques are triangles, s-cliques are 4-cliques, and k-(3, 4)-nucleus is strictly stronger than k-truss and kcore. Sarıyüce et al. in [28] , [29] observed that, in practice, k-(3, 4)-nucleus is the most interesting in terms of the quality of subgraphs produced for a large variety of graphs. As such, in this paper we also focus on this decomposition. To the best of our knowledge, nucleus decomposition over probabilistic graphs has not been studied yet.

As pointed out by [28] , [29] , nucleus decomposition can uncover a finer grained structure of dense groups not possible using other dense subgraph mining methods; as such, nucleus decomposition can be beneficial for a large variety of applications, e.g. community structure discovery [30] , mining dense regions in internet of things [31] , financial fraud detection [32] , extracting brain connectome subgraph hierarchy [33] , detection of complexes in biological networks [34] , etc. All these applications of nucleus decomposition extend naturally to the probabilistic networks. Ignoring probabilities and using deterministic methods amounts to setting all probabilities to 1, which not only misses salient information, but could prove detrimental in applications such as finding cohesive subnetworks of proteins from probabilistic PPI networks which has valuable implications to disease diagnosis [13] . Last but not the least, computing probabilistic nucleus is highly beneficial for task driven team formation in probabilistic social networks, demonstrated later in our case study using a DBLP network.

We are the first to study nucleus decomposition in probabilistic graphs. The major questions we address are: How to define meaningfully nucleus decomposition in probabilistic graphs? How hard is computing nucleus decomposition in probabilistic graphs? Can we devise efficient algorithms for exact or approximate nucleus decomposition in large graphs?

Definitions. We start by introducing three natural notions of probabilistic nucleus decomposition (Section III). They are based on the concept of possible worlds (PW's), which are instantiations of a probabilistic graph obtained by flipping a biased coin for each edge independently, according to its probability. We define local, global, and weakly-global notions of nucleus as a maximal probabilistic subgraph H satisfying different structural conditions for each case.

In the local case, we require a good number of PW's of H to satisfy a high level of density around each triangle (in terms of 4-cliques containing it) in H. This is local in nature because the triangles are considered independently of each other. To contrast this, we introduce the global notion, where we request the PW's themselves be deterministic nuclei. This way, not only do we achieve density around each triangle but also ensure the same is achieved for all the triangles of H simultaneously. Finally, we relax this strict requirement for the weakly-global case by requiring that PW's only contain a deterministic nucleus that includes the triangles of H.

Global and Weakly-Global Cases. We show that computing global and weakly-global decompositions are intractable, namely #P-hard and NP-hard, resp. (Section IV). We complement these results with efficient algorithms for these two cases that give approximate solutions based on search space pruning combined with Monte-Carlo sampling (Section VI).

Local Case. We show that local nucleus decomposition is in PTIME (Section V). The main challenge is to compute the probability of each triangle to be contained in k 4-cliques. We present a dynamic programming (DP) solution for this task, which combined with a triangle peeling approach, solves the problem of local nucleus decomposition efficiently. While this is welcome result, we further propose statistical methods to speed-up the computation. Namely, we provide a framework where well-known distributions, such as Poisson, Normal, and Binomial, can be employed to approximate the DP results. We provide detailed conditions under which the approximations can be used reliably, otherwise DP is used as fallback. This hybrid approach speeds-up the computation significantly and is able to handle datasets, which DP alone cannot.

Experiments. We present extensive experiments which show that our DP method for local nucleus decomposition is efficient and can handle large datasets; when combined with our statistical approximations, the process is significantly spedup and can handle much larger datasets. We demonstrate the importance of nucleus decomposition by comparing it to probabilistic core and truss decomposition using density and clustering metrics. The results show that nucleus decomposition significantly outperforms core and truss decompositions in terms of these metrics.

Let G = (V, E) be an undirected graph, where V is a set of vertices, and E is a set of edges. For a vertex v ∈ V , let N (v) be the set of v's neighbors:

Nucleus decomposition in deterministic graphs. Nucleus decomposition is a generalization of core and truss decompositions [28] , [29] . Each nucleus is a subgraph which contains a dense cluster of cliques. The formal definitions are as follows.

Let r, s with r < s be positive integers. We call cliques of size r, r-cliques, and denote them by R, R , etc. Analogously, we call cliques of size s, s-cliques, and denote them by S, S , etc.

Definition 1: The s-support of an r-clique R in G, denoted by s-supp G (R), is the number of s-cliques in G that contain R.

Definition 2: Two r-cliques R and R in G, are s-connected, if there exists a sequence R = R 1 , R 2 , · · · , R k = R of rcliques in G such that for each i, there exists some s-clique in G that contains R i ∪ R i+1 . Now nucleus decomposition is as follows. Definition 3: Let k be a positive integer. A k-(r, s) k-(r, s) k-(r, s)-nucleus is a maximal subgraph H of G with the following properties.

1) H is a union of s-cliques: every edge in H is part of an

For simplicity, whenever clear from the context, we will drop the use of prefix s from the definition of support and connectedness.

When r = 1, s = 2, r-cliques are nodes, s-cliques are edges, and k-(1, 2)-nucleus is the well-known notion of k-core. When r = 2, s = 3, r-cliques are edges, s-cliques are triangles, and k-(2, 3)-nucleus is the well-known notion of k-truss. [28] shows that k-(3, 4)-nucleus, where we consider triangles contained in 4-cliques, provides much more interesting insights compared to k-core and k-truss in terms of density and hierarchical structure. As such, in this paper, we also focus on the r = 3, s = 4 case. For simplicity, we will drop using r and s and assume them to be 3 and 4, respectively. In particular, we will refer to k-(3, 4)-nucleus as simply k-nucleus.

Probabilistic Graphs. A probabilistic graph is a triple G = (V, E, p), where V and E are as before and p : E → (0, 1] is a function that maps each edge e ∈ E to its existence probability p e . In the most common probabilistic model (cf. [1] , [3] , [4] ), the existence probability of each edge is assumed to be independent of other edges.

In order to analyze probabilistic graphs, we use the concept of possible worlds that are deterministic graph instances of G in which only a subset of edges appears. Conceptually, the possible worlds are obtained by flipping a biased coin for each edge independently, according to its probability. We write G G to say that G is possible world for G. The probability of a possible world G = (V, E G ) G is as follows: Pr[G | G] = e∈E G p e e∈E G (1 − p e ). We will use G, G , H, H to denote probabilistic graphs. Nucleus decomposition in probabilistic graphs. We now define three variants of nucleus decomposition in probabilistic graphs which are based on Definitions 4 and 5 we give below. These variants relate to the nature of nucleus and we refer to them as local ( ), global (g), and weakly-global (w).

Definition 4: Let H be a probabilistic graph, a triangle, and µ a mode in set { , g, w}. Then, X H, ,µ is a random variable that takes integer values k with tail probability

where indicator variable 1 µ (H, , k) is defined depending on mode µ as follows.

1 (H, , k) = 1 if is in H, and the support of in H is at least k.

and is a deterministic k-nucleus.

In the above definition, 1 (H, , k) has a local quality because a possible world G satisfies its condition if it provides sufficient support to triangle without considering other triangles in H. On the other hand, 1 g (H, , k) and 1 w (H, , k) have a global quality because a possible world H satisfies their conditions only when other triangles in H are considered as well (creating a nucleus together). In the following, as preconditions for cohesiveness, we will assume cliqueness and connectedness for the nuclei subgraphs we define. Specifically, we will only consider subgraphs H, which, ignoring edge probabilities, are unions of 4-cliques, and where each pair of triangles in H is connected in H.

Definition 5: Let G = (V, E, p) be a probabilistic graph. Given threshold θ ∈ [0, 1], integer k ≥ 0, and µ ∈ { , g, w}, a µ-(k, θ)-nucleus H is a maximal subgraph of G, such that Pr(X H, ,µ ≥ k) ≥ θ for each triangle in H. Moreover, the µ-(k, θ)nucleusness (or simply nucleusness when µ, k, and θ are clear from context) of a triangle is the largest value of k such that is contained in a µ-(k, θ)nucleus.

Intuitively for µ = , from a probabilistic perspective, a subgraph H of G can be regarded as a cohesive subgraph of G if the support of every triangle in H is no less than k with high probability (no less than a threshold θ). We call this version local nucleus.

Local nucleus is a nice concept for probabilistic subgraph cohesiveness, however, it has the following shortcoming. While it ensures that every triangle in H has support at least k in a good number of instantiations of H, it does not ensure those instantiations are deterministic nuclei themselves or they contain some nucleus which in turn contains . Obviously, nucleusness is a desirable property to ask for in order to achieve a higher degree of cohesiveness and this leads to the other two versions of probabilistic nucleus of a global nature, which we call global and weakly-global (obtained for µ = g and µ = w).

In general, g-(k, θ)-nuclei are smaller and more cohesive than w-(k, θ)-nuclei. We remark that, every g-(k, θ)-nucleus is contained in a w-(k, θ)-nucleus which in turn is contained in an -(k, θ)-nucleus.

Example 1: Consider graph G shown in Figure 1 [Left]. Let us assume that the red edges have probability P = 0.9, the blue edges have probability P = 0.8, the green dashed edge has probability 1, and the black dot-dashed edge has probability P = 0.5. Let θ = 0.13. It can be verified that each triangle in G is contained in at least 2 4-cliques with probability at least 0.134, i.e. Pr(X G, , ≥ 2) ≥ 0.134 ≥ θ. Thus, G is a -(2, 0.13) nucleus.

However, G cannot be a w-(2, 0.13) or g-(2, 0.13) nucleus. For instance, consider triangle = (3, 5, 6) . In all the possible worlds of G, the clique on vertices {3, 4, 5, 6, 8} should exist since this is the only deterministic 2-nucleus which contains . Thus, the edges of this clique (9 blue and 1 red) should exist and the other edges in G can either exist or not in the possible worlds of G. As a result, we get Pr(X G, ,w ≥ 2) = 0.8 9 · 0.9 = 0.120 < θ. Now, consider subgraph H induced by vertices {1, 2, 3, 4, 6, 7}, Figure 1 [Right]. We show that H is w-(2, 0.13)-nucleus. Our reasoning is as follows. Ignoring probabilities, H consists of two deterministic 2-nuclei, one induced by {1, 2, 3, 4, 7} (call it cl 1 ) and the other induced by {2, 3, 4, 6, 7} (call it cl 2 ). Triangles in H can belong to either cl 1 or cl 2 . Consider an arbitrary triangle in cl 1 . To compute Pr(X H, ,w ≥ 2), all the possible worlds of H which contain cl 1 as a deterministic 2-nucleus are valid. As a result, all the edges in cl 1 should exist, and the edges (2, 6), (3, 6) and (4, 6) can either exist or not exists in the valid possible worlds (edge (7, 6) has probability 1). As such, we have 2 3 = 8 valid possible worlds. Summing over the existence probability of each possible world, we get Pr(X H, ,w ≥ 2) = 0.9 10 = 0.348 > θ. A similar reasoning can be applied for an arbitrary triangle in cl 2 which gives Pr(X H, ,w ≥ 2) = 1·0.5·0.8 2 ·0.9 6 = 0.170 > θ. Thus, we can say that H is a w-(2, 0.13)-nucleus.

Let us consider H in more detail. This subgraph cannot be a g-(2, 0.13)-nucleus. For instance, consider triangle = (1, 2, 3) . For this triangle, there are only two valid possible worlds which are deterministic 2-nucleus: (1) the one in which all the edges exist (H 1 ), (2) the one in which none of edges (2, 6) , (3, 6) and (4, 6) exist (H 2 ). Adding one of the edges (2, 6), (3, 6) and (4, 6) creates one extra triangle which will not be part of two cliques. This results in the possible world not being a deterministic 2-nucleus. So, summing over these two possible worlds we get:

Pr(X H, ,g ≥ 2) = 0.9 10 · 1 · 0.5 · 0.8 2 + 0.9 10 · 1 · (1 − 0.5) · (1 − 0.8) 2 = 0.118 < θ.

However consider subgraphs H 1 and H 2 induced by {1, 2, 3, 4, 7} and {2, 3, 4, 6, 7}, respectively. The only possible worlds of H 1 and H 2 that are deterministic 2-nuclei are the ones in which all their edges exist. So for each triangle and in H 1 and H 2 , we have Pr(X H1, ,g ≥ 2) = 0.9 10 = 0.348 > θ and Pr(X H2, ,g ≥ 2) = 1 · 0.5 · 0.8 2 · 0.9 6 = 0.170 > θ. Thus, H 1 and H 2 are global g-(2, 0.13)-nuclei subgraphs. Nucleus Decomposition. The nucleus decomposition finds the set of all the µ-(k, θ)-nuclei for different values of k. We will study the problem in the three different modes we consider. Specifically, we call nucleus-decomposition problems for the different modes -NuDecomp, g-NuDecomp, and w-NuDecomp, respectively.

In the following, we prove uniqueness and hierarchicalcontainment properties of probabilistic nucleus decomposition.

Proposition 1: The local, weakly-global, and global nucleus decompositions are unique.

Proof: The uniqueness is based on the definitions of local, weakly-global, and global nucleus decomposition. Specifically, the uniqueness follows from the property that each nucleus (local, weakly-global, or global) is a maximal subgraph satisfying the required property. As such, the set of maximal nuclei is unique.

Proposition 2: There exists a hierarchical-containment property for local, weakly-global, and global decomposition.

Proof: Let θ be an arbitrary and fixed user-defined threshold. To prove the hierarchical-containment property for local nucleus decomposition, let F be a local -(k + 1, θ)nucleus. By the definition of local nucleus, each triangle in F has support at least k + 1 in F, with probability no less than θ. Since k + 1 > k, each triangle in F has also support at least k in F. Thus, F is contained in a -(k, θ)-nucleus. This proves the property for local nucleus decomposition.

For weakly-global decomposition, let H be a weakly-global w-(k+1, θ)-nucleus. Referring to Definition 4 for each triangle ∈ H we have H is contained in a w-(k, θ)-nucleus. A similar reasoning holds for the global case as its definition is based on possible worlds which are deterministic nuclei.

We show that -NuDecomp can be computed in polynomial time and furthermore we give several algorithms to achieve efficiency for large graphs. Before this, we start by showing that g-NuDecomp and w-NuDecomp are #P -hard and NPhard, respectively. Nevertheless, as we show later in the paper, once we obtain the -NuDecomp, we can use it as basis, combined with sampling techniques, to effectively approximate g-NuDecomp and w-NuDecomp.

In this section, we show that g-NuDecomp and w-NuDecomp are NP-hard. For this we use a reduction from the k-clique problem. Furthermore, we can show that g-NuDecomp is even harder, namely #P-hard, using a reduction from the network reliability problem.

Definition 6: The k-clique Problem [3] . Given a graph G, and input parameter k, the k-clique problem is to check whether there is a clique of size k in the graph. The k-clique problem is NP-complete.

We note the following interesting property about k-nucleus. Lemma 1: For any k, the only graph on (k + 3) vertices which is a deterministic k-nucleus is a (k + 3)-clique.

Proof: Recall that based on the definition of the knucleus, each triangle is contained in at least k 4-cliques. Given the vertices {v 1 , v 2 , · · · , v k+3 }, without loss of generality, let 123 = (v 1 , v 2 , v 3 ) be a triangle with vertices v 1 ,v 2 , and v 3 . The triangle 123 must be part of k 4-cliques; therefore, there must be an edge between each of the remaining k vertices and all the vertices of the triangle 123 . Now, new triangles are created, containing vertices {v 4 , · · · , v k+3 }. Let ijt be one of them, where i, j = 1, 2, i = j, and t = 4, · · · , k + 3. This triangle must be contained in k 4cliques as well. Thus, there should be edges between each vertex in the triangle ijt to the other k vertices. Thus, each vertex v t becomes connected to all the other vertices creating a clique on k + 3 vertices.

Theorem 4.1: w-NuDecomp and g-NuDecomp are NP-hard. Proof: Given a graph G = (V, E), we define a probabilistic graph G = (V, E, p) as follows: For each edge e in G,

We prove that w-(k, θ)-nucleus g-(k, θ)-nucleus of G exists if and only if a (k + 3)-clique exists in G. Let C be a (k + 3)-clique in G. Since C has (k + 3) · (k + 2)/2 edges, its existence probability is

In addition, in a (k + 3)-clique, each triangle is contained in exactly k 4-cliques. Thus, as a subgraph, C is a both w-(k, θ)nucleus and g-(k, θ)-nucleus of G.

In the following we show that if G does not contain a (k + 3)-clique, the w-(k, θ)-nucleus and g-(k, θ)-nucleus are empty. We prove the case for weakly-global, and the same reasoning can be applied for the global case as well.

Suppose that G does not contain a (k + 3)-clique. For a contradiction, let us assume that a w-(k, θ)-nucleus of G exists and denote it by H. Based on Lemma 1, a (k +3)-clique is the only graph which has k + 3 vertices and is a deterministic knucleus. Since, ignoring edge probabilities, H cannot be a (k+ 3)-clique, it must have at least (k + 4) vertices. Furthermore, since it contains a k-nucleus for each triangle in it, the degree of each vertex is at least (k + 2).

Let 

In the extended version of this paper [35] , we show that that g-NuDecomp is even harder, namely #P-hard.

Here we propose efficient algorithms for solving -NuDecomp. Peeling is a general strategy that has been used broadly in core and truss decompositions as well as in deterministic nucleus decomposition [28] . However, generalizing peeling to compute -NuDecomp creates significant computational challenges. For example, a challenge is finding the support score for each triangle. This is because of the combinatorial nature of finding the maximum value of k such that Pr(X G, , ≥ k) ≥ θ for a triangle . In particular, triangle in a probabilistic graph can be part of different numbers of 4 cliques with different probabilities. As a result, considering all the subsets of 4-cliques which contain results in exponential time complexity. In our algorithm, we identify two challenging tasks, namely computing and updating nucleus scores.

Our process starts by computing a nucleus score κ for each triangle , which initially is the maximum k for which

In other words, for each i, S i is the set of vertices of a 4-clique that contains . For notational simplicity, we will also denote by S i the 4-clique on {u, v, w, z i }.

Similarly, for each i, let E i = {(u, z i ), (v, z i ), (w, z i )} be the set of edges which connect vertex z i to vertices of . Let Pr(E i ) = p(u, z i ) · p(v, z i ) · p(w, z i ) be the existence probability of E i . We have:

(3) Thus, we need to compute Pr(X G, , = k) for any k, and find the maximum value of k for which the probability on the left-hand side of Equation 3 is greater than or equal to θ. In fact, Pr(X G, , = k) gives the probability that is contained in k number of 4-cliques in G. Under the condition that exists, we denote X (S , k, j) to be the probability that is contained in k of 4-cliques from {S 1 , · · · , S j } ⊆ S , where S the set of 4-cliques containing in G. In other words, X (S , k, j) is conditional probability (conditioning on the existence of ).

We fix an arbitrary order on S . The event that is contained in k of 4-cliques from {S 1 , · · · , S j }, can be expressed as the union of the following two sub-events: (1) the event that the 4-clique S j exists and is contained in (k−1) of 4-cliques from {S 1 , · · · , S j−1 }, and (2) the event that the S j does not exist and is part of k of 4-cliques from {S 1 , · · · , S j−1 }. Thus, we have the following recursive formula: Equation 4 , and multiplying X (S , k, j) by Pr( ) (existence probability of ), gives the desired probability Pr(X G, , = k). Thus, we have:

Given a triangle , let the neighbor triangles of be those triangles which form a 4-clique with . In the following we show how we can update Pr(X G, , ≥ k) when a neighbor triangle is processed in the decomposition.

Once the κ scores have been initialized as described above, a process of peeling "removes" the triangle * of the lowest κ-score, specifically marks it as processed, and updates the neighboring triangles (those contained in the same 4-cliques as the removed triangle) in terms of Pr(X G, , ≥ k). Because of the removal of * the cliques containing it cease to exist, thus Pr(X G, , ≥ k) of the neighbors will change. We recompute this probability using the formula in Equation 4 , where the sets of cliques S are updated to remove the cliques containing * . Algorithm 1 computes the nucleusness of each triangle in G. In line 3, for each triangle , κ( ) is initialized using Equation 4 . Array processed records whether a triangle has been processed or not in the algorithm (line 4). At each iteration (line [5] [6] [7] [8] [9] [10] [11] , an unprocessed triangle with minimum κ( ) is considered, and its nucleus score is set and stored in for all triangles ∈ G do 3:

processed[ ]← false 5: for all unprocessed ∈ G with minimum κ( ) do 6:

Find set S of 4-cliques containing 8: for all S ∈ S with non-processed triangles do 9: for all

processed[ ]← true 12: return array ν(·) array ν (line 6). Then, the κ( ) values of all the neighboring triangles are updated using Equation 4 . The affected triangles are those unprocessed triangles which are part of the same 4-clique with triangle . The algorithm continues until all the triangles are processed. At the end, each triangle obtains its nucleus score and array ν with these scores is returned (line 12). Once all the nucleus scores are obtained, we build -(k, θ)-nuclei for each value of k.

Observe that the κ values for each triangle at each iteration decrease or stay the same. This implies that κ for each triangle is a monotonic property function similar to properties described in [36] for vertices. Now, we can use a reasoning similar to the one in [36] to show that our algorithm, which repeatedly removes a triangle with the smallest κ value, gives the correct nucleusness for each triangle.

Time complexity: Using dynamic programming, Lines 2-3 take O ∈G κ · c , where κ is the nucleusness obtained for each triangle in line 3. Let κ max be the maximum κ over all the triangles in G. Since

where T G is the total number of triangles in the graph, and d max is the maximum degree in G. For each triangle , finding all S 's in

. Therefore, the running time for processing all the triangles is

Thus, the total running time of Algorithm 1 is bounded by O κ max d 2 max T G , and we can state the following. Theorem 5.1: -NuDecomp can be computed in polynomial time.

The space complexity is O(T G ). This space is needed to store triangles (not 4-cliques) and their κ values. This is the same as the space complexity of deterministic nucleus decomposition.

While being able to compute -NuDecomp in polynomial time is good news, finding the maximum k such that Pr(X G, ,l ≥ k) ≥ θ is quadratic in c which is not efficient for large probabilistic graphs. As an alternative approach, we will now propose efficient methods to approximate Pr(X G, ,l ≥ k) in O(c ) time such that the results are practically distinguishable from the exact values. The approximation is based on limit theorems, such as Le Cam's Poisson Limit Theorem [37] and Lyapunov's Central Limit Theorem [38] .

With slight abuse of notation, we also define each E i as an indicator random variable which takes on 1, if all the edges in E i exist, and takes on 0, if at least one of the edges in the set does not exist. We observe that the variables E i are mutually independent since the sets E i do not share any edge. Also, each Bernoulli variable E i takes value 1 with probability

We can verify the following proposition.

, respectively. Now we show that we can approximate the distribution of ζ using Le Cam's Theorem which makes use of Poisson Distribution [37] . Poisson Distribution [39] : A discrete random variable X is said to have Poisson distribution with positive parameter λ, if the probability mass function of X is given by:

The expected value of a Poisson random variable is λ. Setting λ to µ, we can approximate the distribution of ζ by the Poisson distribution. Using Le Cam's Theorem [37] , the error bound on the approximation is as follows:

(7) Equation 7 shows that the Poisson distribution is reliable if Pr(E i ) and c are small.

We observe that computing tail probabilities for the Poisson distribution is easy in practice as these probabilities satisfy a simple recursive relation. 

Pr(E i ) of the Poisson approximation becomes large. To tackle the problem, we define a Translated Poisson [40] 

Pr(E i ) is the expected value of distribution ζ. Thus, the difference between the variance of Y and ζ can be written as:

where {λ 2 } = λ 2 − λ 2 . As can be seen the difference between the variances becomes small in this case.

and the complexity of obtaining Pr(X G, , ≥ k) remains the same.

We will now consider the scenario when c is large. In this case, the variance of ζ will be large. In the following, we show the use of Central Limit Theorem for this case.

Theorem. An important theorem in statistics, Lyapunov's Central Limit Theorem (CLT) [38] states that, given a set of random variables (not necessarily i.i.d.), their properly scaled sum converges to a normal distribution under certain conditions. If c and hence σ 2 are large, then by [38] ,

using CLT we can subtract c i=1 µ i from the sum of E i 's and divide by σ. As a result, we have:

has standard normal distribution, we can find the maximum value of k such that the right-hand side of Equation 11 is at equal or greater than the threshold. Evaluation of each probability can be done in constant time. Thus, finding the maximum value of k can be done in O(c ) time.

Binomial Distribution. In many networks, edge probabilities are close to each other and as a result, for each triangle , Pr(E i )'s are also close to each other. In that case, the distribution of support of the triangle can be well approximated by Binomial distribution. A random variable X is said to have Binomial distribution with parameters p and n, if the probability mass function of X is given by [41] :

In the above equation, p is success probability, and n is the number of experiments. In statistics, the sum of non-identically distributed and independent Bernoulli random variables can be approximated by the Binomial distribution [42] . As discussed in [42] , the Binomial distribution provides a good approximation, if its variance is close to the variance of ζ. For the approximation, we set n = c and n · p = µ.

We observe that tail probabilities for the Binomial distribution can be calculated inexpensively as these probabilities satisfy the following well-known recursive relation with n = c and p = µ/n is close to 1 (e.g. not less than a number D), the Binomial approximation is used. 5) Otherwise, Dynamic Programming is used. For selecting the thresholds we refer to the literature in statistics. In particular, CLT gives a good approximation if the number (for our problem c ) of random variables in the sum is at least 30 ( [43], p. 547). In fact, we set our threshold A = 200 to much higher than what is suggested by the literature. Also, regarding Poisson distribution, the existence probability (for our problem Pr(E i )'s) of the indicator random variables in the sum should be less than 0.25 (see [37] ). So, we set C = 0.25. We set B to be half of A so that it is considerably far from A (threshold on c ). We set D = 0.9 which is close enough to 1.

When using A = 200, B = 100, C = 0.25, D = 0.9, we observed that the results of computing Pr(X G, , ≥ k) using an approximation are practically indistinguishable from the solution of dynamic programming. Furthermore, as we observed in our experiments, falling back to dynamic programming in point (5) happens only in a small amount of cases, i.e. most triangles in real world networks satisfy one of the earlier conditions (1)-(4). This means we can avoid dynamic programming for most of the triangles.

In this section, we propose algorithms for computing global and weakly-global nucleus decomposition. Given a graph H, computing Pr(X H, ,g ≥ k) and Pr(X H, ,w ≥ k) requires finding all the possible worlds of H, which are in total 2 |E(H)| , where E(H) is the number of edges in H. This is prohibitive. Thus, we use Monte Carlo sampling to estimate the probabilities, denoted byPr(X H, ,g ≥ k) andPr(X H, ,w ≥ k). The following lemma states a special version of the Hoeffding's inequality [44] that provides the minimum number of samples required to obtain an unbiased estimate.

Lemma 2: Let Y 1 , · · · , Y n be independent random variables bounded in the interval [0, 1]. Also, letȲ = 1 n n i=1 Y i . Then, we have that

In other words, for any , δ ∈ (0, 1],

. Based on the above, using Monte Carlo sampling, we can obtain an estimate of Pr(X H, ,g ≥ k), and Pr(X H, ,w ≥ k) for any subgraph H by sampling n possible worlds of H,

, is an error bound, and δ is a probability guarantee. In particular, we have:

where µ = g or w, and the indicator function 1 µ (H i , , k) is given in Definition 4. Based on Lemma 2, what we obtain is an unbiased estimate. Thus, setting µ = g, w, we have

In what follows, we propose an algorithm for finding all g-(k, θ)-nuclei for different values of k = 1, . . . , k max , where k max is the largest value for which the local nucleus is non-empty. This is because we extract global nuclei from local ones since every g-(k, θ)-nucleus is part of an -(k, θ)-nucleus. The main steps of our proposed algorithm are given in Algorithm 2.

Given a positive integer k, threshold θ, error-bound , and confidence level δ, the algorithm starts by creating subgraph C k as the union of all -(k, θ)-nuclei (line 4). Then, the algorithm incrementally builds a candidate g-(k, θ)-nucleus H as follows. For each triangle in C k , it adds to H all the 4-cliques in C k containing (line 6). By this process other triangles could potentially be added to H such that the number of 4-cliques containing is less than k. In order to remedy this, the algorithm adds all the 4-cliques of C k containing to H. This process continues until all the triangles in H are contained in at least k 4-cliques (lines 7-8). Once the candidate graph H is obtained, n samples of possible worlds of H are obtained (line 10). Then, the algorithm checks if the conditionPr(X H, ,g ≥ k) ≥ θ is satisfied for each triangle in H (lines [11] [12] [13] . At the end, the algorithm returns all g-(k, θ)-nuclei H (line [15] [16] [17] , for all the possible values of k. for all k ← 1 to k max do 4: C k ← the union of -(k, θ)-nuclei by Algorithm 1 5: for all ∈ C k do 6: H ← all 4-cliques in C k containing 7: while ∃ ∈ H with less than k 4-cliques ∈ H containing it do 8: add all 4-cliques of C k containing to H 9:

condition hold ← true 10:

sample ← {H 1 , · · · , H n }

for all ∈ H doPr(X H, ,g ≥ k) ← Eq. (15) 12:

ifPr(X H, ,g ≥ k) < θ then 13: condition hold ← false 14: break 15: if condition hold == true then 16: solution ← solution ∪ H for all ∈ H do 12:Pr(X H, ,w ≥ k) ← global score[ ]/n 13: solution ← solution ∪ connected union of 's withPr(X H, ,w ≥ k) ≥ θ 14: return solution w-(k, θ)-nucleus. In what follows, we propose an algorithm for finding all w-(k, θ)-nuclei, for different values of k = 1, . . . , k max , where k max is as before. We begin by noting that each w-(k, θ)-nucleus is an -(k, θ)-nucleus. The steps of our proposed algorithm are given in Algorithm 3. We use array global score to store the number of deterministic k-nuclei that each triangle belongs to. The array is initialized to zero for all the triangles in the candidate graph (line 5). For each candidate graph which is a -(k, θ)-nucleus, we obtain the required number n of possible worlds for the given and δ. Then, we perform deterministic nucleus decomposition on each world (lines 6-8). If triangle is in a deterministic knucleus of that possible world, the corresponding index of in array global score is incremented by one (lines 9-10). In line 12, the approximate valuePr(X H, ,w ≥ k) is obtained for each triangle. Then, we start creating the connected components H using triangles withPr(X H, ,w ≥ k) ≥ θ (line 13). At the end, the algorithm returns all w-(k, θ)-nuclei, for all the possible values of k. Remark. Both of these algorithms run in polynomial time. They compute the correct answer provided the estimation of the threshold probabilities using Monte-Carlo sampling is close to the true value. If not, they give approximate solutions. Space Complexity. For global and weakly global decompositions the space needed is O(T G + m · n), where m is the number of edges in H and n is the number of possible worlds for H we sample.

From a theoretical point of view, n, the number of samples, is constant for fixed values of and δ, and since m, number of edges, is absorbed by T G , we can say that the above complexity is again O(T G ), i.e. same as the space complexity for deterministic nucleus.

From a practical point of view, for each sample graph (possible world) we use a bit array to record whether an edge exists in the sample or not. For practical values of and δ, m · n is about 200 · m bits, which is 200/(32 + 32) ∼ 3 times more than the space needed to store the edges as adjacency lists (assuming an integer node id is 32 bits, and the graph is undirected, i.e. each edge is represented as two directed edges). In other words, to store the n possible worlds we only need about three times more space than what is needed to store G.

We present our extensive experimental results to test the efficiency, effectiveness, and accuracy of our proposed algorithms. Our implementations are in Java and the experiments are conducted on a commodity machine with Intel i7, 2.2Ghz CPU, and 12Gb RAM, running Ubuntu 18.04.

Datasets and Experimental Framework. Statistics for our datasets are in Table I . We order the datasets based on the number of triangles they contain. Datasets with real probabilities are flickr, dblp, and biomine from [1] , [45] and krogan from [46] .

We also consider datasets ljournal-2008 and pokec. ljournal-2008 is obtained from Laboratory of Web Algorithmics (http://law.di.unimi.it/datasets.php) and pokec is from the Stanford Network Analysis Project (http://snap.stanford.edu). For these networks, we generated edge probabilities uniformly distributed in (0, 1].

We evaluate our algorithms on three important aspects. First is the efficiency. For this, we report the running time of our algorithms in Subsection VII-A. Second is the accuracy and closeness of our approximation methods. We discuss this in Subsection VII-B. Third is the quality of the output nucleus as measured by density and probabilistic clustering coefficient which are discussed in Subsection VII-C. Finally, in Subsection VII-D we show the usefulness of our nucleus definitions over probabilistic graphs by presenting a detailed use-case.

In this section, we report the running times of our proposed algorithms for local nucleus decomposition: one that uses dynamic programming and the other that uses statistical approximations for computing and updating the support of triangles. We denote them by DP and AP, respectively. Next, we report the running times of our (fully) global and weaklyglobal nucleus decomposition algorithms, which we denote by FG and WG. We set error-bound = 0.1 and confidence level δ = 0.1. Based on these values and Lemma 2, we set the number of samples to n = 200 > 1 2 2 ln 2 δ (i.e. greater than what is required by Hoeffding's inequality). As such, our results for global and weakly-global notions of nuclueus are approximate but come with strong quality guarantees.

Running time results for DP and AP are shown in Figure 2 for different values of θ. Y-axis for the last 3 plots is in log-scale.

Both algorithms perform well on medium-size datasets, dblp and flickr; computing the nucleus decomposition of these two graphs takes less than 1 sec. For a larger-size dataset, pokec, both algorithms complete in less than 10 min. Note that AP clearly outperforms DP on large-size datasets such as biomine and ljournal-2008 for small values of θ. For instance, for ljournal-2008 with θ = 0.1, it is only AP that can run to completion, whereas DP could not complete after one day. Nevertheless, both DP and AP are able to run in reasonable time for all the other cases, which is good considering that nucleus decomposition is a harder problem than core and truss decomposition.

In general, the running times of both DP and AP decrease significantly as θ increases. This is because the number of triangles which, (a) exist with probability greater than θ and (b) have a support at least k again with probability greater than θ, decreases. As can be seen, AP is faster than DP on all datasets for different values of θ. In addition to the ljournal-2008 case for which only AP could complete, when θ = 0.1, the gain of AP over DP is about 24% and 30% for biomine and pokec, respectively.

For speed-up evaluation of AP vs. DP we added two more datasets. The statistics of these datasets are given in Table II . The first dataset is enwiki-2013. What is special about this dataset is that its maximum initial nucleus score is 2, 813, which is much larger than in other graphs we consider. We set θ = 0.1; when θ is small, more triangles can have enough probability to be part of a much larger number of 4-cliques. This can cause too much work for DP to compute nucleus scores and update these values when the triangles are being processed in the peeling step. For this dataset, DP was not able to complete the computation within one week. In contrast, AP completed in about 80K sec (less than a day). The other additional dataset we considered is itwiki-2013. The maximum initial nucleus score in this dataset is 1, 866. In this graph, using the same θ = 0.1, DP needs about 40h, whereas AP 16h, i.e. AP is 2.5 times faster than DP.

Moreover, we ran DP and AP on biomine with θ = 0.01. DP took about 37.5h, whereas AP 2.5h, thus being 15 times faster than DP. We report the running time of FG and WG in Figure 3 along with the running time of local (denoted by L in the figure) nucleus decomposition for θ = 0.1 (which as explained above is more difficult than θ = 0.2, . . . , 0.5). Note that the global and weakly-global nuclei are obtained from the local ones using Algorithms 2 and 3. Therefore, their running time includes the time required for obtaining local nuclei. For local decomposition, we use DP to obtain the probabilistic support of the triangles, except for ljournal-2008 for which we use AP since DP does not scale for this threshold. Also, we report running times averaged across 5 runs, since the solutions of FG and WG depend on the random sampling steps.

In general WG is faster than FG. This is because WG performs deterministic nucleus decomposition only on a fixed number of sample graphs while FG does the decomposition every time that a candidate graph is detected. We also note that as the graph becomes larger, WG will have to perform nucleus decomposition on larger sample graphs leading to increased running time. For FG, usually candidate graphs are small even for large graphs. So, when the graph becomes lager, the runtime of WG increases more compared to FG. Moreover, we compare the running time of nucleus decomposition algorithms, local, weakly-global, and global, on biomine with θ = 0.1 and θ = 0.01 in Figure 4 . For the local decomposition (L) we used DP because we are interested in the relative difference in running time for the different nucleus notions and L is the initial step for computing WG and FG.

When θ decreases, running times increase since more triangles can have enough probability to be contained in a local nucleus subgraph. In terms of the size of the results, Table III shows the average number of vertices and edges for the L, WG, and FG subgraphs aggregated over all k ∈ [1, k max ]. In general, the average values increase as we decrease threshold. This is due to the fact that by decreasing θ more triangles can have enough probability to be contained in 4-cliques. 

To evaluate the accuracy of the AP algorithm, we compare the final nucleus scores obtained by DP and AP algorithms. We report the results in Table IV . We show the results for θ equal to 0.2 and 0.4, since for the remaining values the error results do not differ significantly. The second column shows the average difference (error) from true value over the total number of triangles. The last column shows the percentage of triangles whose value is different from their exact value.

As can be seen, the average error is quite small for all the datasets we consider. Particularly, for flickr with θ = 0.4 and biomine with θ = 0.2 and θ = 0.4 we have that AP computes nucleus decomposition with zero error. Also, the percentage of triangles with an error score is very small, namely less than 1% for all the datasets, except krogan and ljournal-2008. For these two, the percentages are still small, 5.24% and 1.79%, respectively. These results show that the output of AP is very close to that of exact computation by DP, and thus, AP is a reliable approximation methodology. 

Here we compare the cohesiveness of -(k, θ)-nucleus with the cohesiveness of local (k, γ)-truss [24] and (k, η)-core [1] .

We use two metrics. The first metric is the probabilistic density (PD) of a graph G, which we denote by PD(G) and is defined as follows [24] :

.

In words, PD of a probabilistic graph is the ratio of the sum of edge probabilities to the possible number of edges in a graph. The second metric is probabilistic clustering coefficient (PCC). It measures the level of tendency of the nodes to cluster together. Given a probabilistic graph G, its PCC is defined as follows [24] , [47] :

. (18) For probabilistic nucleus, probabilistic truss and probabilistic core subgraphs, we use the same threshold θ = γ = η, set to 0.1 and 0.3. (γ is used as threshold in the truss case [24] , and η is used as threshold in the core case [1] ). Table V reports results on dblp, pokec, and biomine. Results for the other datasets are similar. For a given threshold, we report the statistics of local (k N max , θ)-nucleus, (k T max , γ)-truss, and (k Cmax , η)-core, where k N max , k T max , and k Cmax are maximum nucleus, truss and core scores, respectively. Also, for k N max , k T max , and k Cmax , we might obtain more than one connected component; we report the average statistics over such components. We denote by V N , V T , V C , the sets of nodes, by E N , E T , E C , the sets of edges, by P D N , P D T , P D C , the PD's and by P CC N , P CC T , P CC C , the PCC's of nucleus, truss, and core components, respectively. The last column shows the running time for computing each decomposition. We observe that sometimes nucleus decomposition is faster than truss decomposition. This is because in nucleus decomposition there could be fewer triangles that survive the specified threshold in terms of support than edges in truss decomposition.

As can be seen in the table, (k N max , θ)-nucleus produces high quality results in terms of PD and PCC. We achieve a significantly higher PD and PCC for nucleus compared to truss and core. For instance, for dblp the PD for nucleus is 0.8 versus 0.611 and 0.264 for truss and core, which translates for nucleus being about 30% and 200% more dense than truss and core. Similar conclusions can be drawn for PCC as well.

Moreover, Figure 5 reports the average PD, average PCC, average edges in each -(k, θ)-nucleus, and number of connected components ( -(k, θ)-nuclei) for an example dataset flickr with fixed θ = 0.3 and varying k. We see that even for small values of k, PD and PCC are considerably high (above 70-80%). In general, PD and PCC become larger as k increases, since denser nuclei will be detected by removing triangles having low support probability to be part of a 4-clique. This causes the final subgraphs to have edges with high probability only. Furthermore, since -(k, θ)-nucleus implies connectivity, the number of connected components increases as k decreases. It results in an increase in the average number of edges in each -(k, θ)-nucleus. V: Cohesiveness statistics of l l l-(k, θ) (k, θ) (k, θ)-nucleus N, (k, θ) (k, θ) (k, θ)-truss, T, and (k, θ) (k, θ) (k, θ)-core, C on dblp, pokec, and biomine.

, and the probabilistic clustering coefficient Finally, we compare the PD and PCC values of g-(k, θ)nucleus, w-(k, θ)-nucleus over 5 runs of these algorithms, and -(k, θ)-nucleus, for krogan, flickr, and dblp datasets using θ = 0.001, and averaging over all the possible values of k. The results are shown in Figure 6 . We see that g-(k, θ)-nucleus achieves higher cohesiveness as expected. In addition, w-(k, θ)-nucleus exhibits quite good PD and PCC values higher than those for -(k, θ)-nucleus. Effect of and δ. We consider krogan dataset with θ = 0.1. The choice of and δ influence the number n of possible worlds we sample. For = 0.1 and δ = 0.1 we obtain n = 150. In order to see the fidelity of our results, we experiment by increasing n to higher values, namely 300, 500, 1000, 2000. As the results in Table VI show, the following metrics about global and weakly-global nuclei: average PD, average PCC, average number of vertices, and average number of edges change very little. Specifically, the first two metrics are dispersed by not more than 0.4% around their mean over the different values of n, and the last two metrics are dispersed by not more than 1.8%. There can be many and δ values corresponding to a given sample size; for illustration, for n = 150, we can have = 0.1, δ = 0.1, whereas for n = 2000, we can have = 0.03, δ = 0.05, i.e. we see that even though in the latter case the and δ decrease by a factor of 3 and 2, respectively, still the nuclei results in terms of the aforementioned metrics are almost the same. This validates the choice of and δ to 0.1 since lower values do not offer significant improvement in the quality of results.

Analysis of DBLP Collaboration Network for taskdriven team formation. To show the usefulness of nucleus decomposition in probabilistic graphs, we apply our decomposition algorithms to solve the task-driven team formation problem for a DBLP network. In task-driven team formation [1] , we are given a probabilistic graph G T = (V, E, p T ), which is particularly obtained for task T . Vertices in G T are individuals and edge probabilities are obtained with respect to task T as described in [1] . Given a query Q, T , where Q ⊂ V , and T is a set of keywords describing a task, the goal is to find a set of vertices that contain Q and make a good team to perform the task described by the keywords in T . By a good team we mean a good affinity among the team members in terms of collaboration for the given task. To solve task-driven team formation using nucleus decomposition, we extend the definition of [1] to employ probabilistic nucleus: Given a probabilistic graph G T = (V, E, p T ) with respect to a task T , a query set Q of vertices, and a threshold θ, apply nucleus decomposition on G T and find a (k, θ)-nucleus (local/weakly-global/global) which (1) contains the vertices in Q, and (2) has the highest value of k for the given θ, and return the obtained subgraph as a solution.

In our experiment, we use a DBLP collaboration network from [1] , where vertices are authors, and edges represent collaboration on at least one paper. The dataset has 1, 089, 442 vertices and 4, 144, 697 edges. For each edge, we take the bag of words of the title of all papers coauthored by the two authors connected by the edge and apply Latent Dirichlet Allocation (LDA) [1] , [48] to infer its topics and calculate the edge probability. Given a task T with keywords, and the Fig. 7 : a) A case study of task-driven team formation with keyword {"algorithm"}, and query vertices {"Erik D. Demaine", "J. Ian Munro", "John Iacono"}, k = 2, and θ = 10 −11 . The depicted graph with thick blue edges corresponds to a g-(k, θ) nucleus. The whole graph (of 10 vertices) is a -(k, θ) nucleus which coincides with a w-(k, θ) nucleus in this example. b) A weakly-global w-(k, θ) nucleus for task-driven team formation with query nodes {"Xindong Wu", "Bing Liu 0001", "Vipin Kumar"}, and keyword { "algorithm"}. k = 1, and θ = 10 −11 .

input collaboration network, we obtain a probabilistic graph G T , in which p(u, v) represents the collaboration level in the papers co-authored by u and v related to task T ( [1], [24] ).

The first sample query we consider is {"algorithm"}, {"Erik D. Demaine", "J. Ian Munro", "John Iacono"} . Figure 7a shows the subgraph obtained by -(k, θ)-nucleus and w-(k, θ)-nucleus decompositions, where k = 2 and θ = 10 −11 . The threshold is the same as the ones used in case studies of previous works (on truss and core). As discussed in [1] , the edge probabilities in the data are very small, which requires setting threshold θ to a small value.

Remark. It should be noted that picking an appropriate value for the threshold can be done using binary search over (0, b], where b ≤ 1. The subgraph contains all the three authors in the query. It has 10 vertices and 33 edges. As can be seen, the obtained subgraph is quite good for taskdriven team formation. All the authors in the subgraph are well-known and have strong collaboration affinity to work on a research paper related to algorithms. A g-(k, θ)-nucleus (same k and θ) that contains the query vertices is shown with thick blue edges in the same figure. As expected, this subgraph is more cohesive and it happens to be a clique of size 6. Its density and clustering coefficient (PCC) is 0.138 and 0.140 as opposed to 0.099 and 0.110 for the local and weakly-global subgraphs. From a research perspective the collaborations of the academicians in the blue subgraph are more focused on designing efficient data-structures.

We run the global truss algorithm on the dataset. As expected the global truss subgraph which contains the query authors is bigger than global nucelus (9 vertices and 18 edges) and its density and PCC are lower (0.067 and 0.086).

We also run global core decomposition as in [26] for the same value k and θ. It should be noted that the global definition is different from global truss and global nucleus. Also, it does not assume connectivity between nodes. However, for fairness of comparison, we considered a connected component of this subgraph which contains query authors. The obtained subgraph contains 569 vertices and 5294 edges, with density 0.003 and PCC 0.061.

Regarding local truss, we obtained a subgraph with 170 vertices and 1033 edges with density equal to 0.008 and PCC equal to 0.0872. On the other-hand, local core decomposition results in density and PCC being equal to 0.0084 and 0.0659 with 226 vertices and 2631 edges. As can be seen, our nucleus decomposition algorithm results in much better subgraphs in terms of vertex size and cohesiveness.

The second query we use shows the usefulness of the weakly-global notion. It has keyword {"algorithm"} and vertices {"Xindong Wu", "Bing Liu 0001", "Vipin Kumar"}. Figure 7b shows the w-(k, θ) nucleus for this query, where the threshold is the same as before, and k = 1. The local nucleus containing the query authors had more than 100 nodes while the global nucleus containing these three query authors was empty. This example shows that the weakly global notion can discover interesting teams when the other two notions produce teams that are too big or too small (or empty). In particular, all the authors in the resulting subgraph are very well-know and have similar research area which can form a good team related to keyword algorithm (query keyword). On the other-hand, using global truss decomposition we could not obtain any subgraph. In addition, both local truss and core decompositins, did not lead to a desired team as the number of vertices in such graphs is very large, 16663 and 31300, respectively. In fact, it is not realistic for this amount of authors to collaborate on paper related to algorithm. The density and PCC for weakly-global subgraph is 0.036 and 0.0388, as opposed to density 0.00005 and PCC 0.0280 in local truss and density 0.000001 and PCC 0.0236 in local core. The same argument hold for global core with 2997 vertices, 35354 edges, density 0.0004, and PCC 0.0294. For local nucleus decomposition cohesiveness results show density 0.03 and and PCC 0.0331 with vertices 100 which is much smaller than local core and local truss.

Compared to other notions of dense subgraphs in probabilistic graphs, such as truss decomposition of [24] , we observed that our nuclei notions capture denser subgraphs better than the truss counterparts. For instance, in the example of [24] for task-driven query of {"data", "algorithm"}, {"Jeffrey D. Ullman", "Piotr Indyk"} , local nucleus gives a smaller community than local truss, namely the community obtained by the global truss: ("Jeffrey D. Ullman", "Shinji Fujiwara", "Aristides Gionis", "Rajeev Motwani", "Mayur Datar", "Edith Cohen", "Cheng Yang", "Piotr Indyk"). This is interesting as it shows that, in some cases, communities obtained by the exponential time algorithm of [24] for global truss can be obtained by our polynomial time algorithm for local nucleus.

In summary, our case study shows that we can discover good communities with reasonable cohesiveness using the efficient algorithm for local nucleus. However, some local nuclei can be too big. If so, we can apply the algorithms for weaklyglobal or global nucleus decomposition on the local nuclei to (3) density of the subgraph. Parameter θ is set to 0.001 for all the notions. We see that l-nucleus is denser than l-truss and both l-core and g-core. Also, wnucleus and g-nucleus are denser than g-truss. In terms of nodes, l-nucleus gives a subgraph which is much smaller than the subgraphs of l-core, g-core, and l-truss. Such a graph of 95 nodes is more amenable for further processing by human analysts.

get smaller and denser communities. Nucleus Decomposition on the Human Biomine Dataset. We use the human biomine dataset [49] , which has 861,812 nodes and 8,666,287 edges. This dataset is different from the biomine dataset we used for our efficiency evaluation. We consider how our notions perform in detecting proteins/genes that interact with the SARS-CoV-2 coronavirus. Bouhaddou et al. [50] found that during the SARS-CoV-2 virus infection, changes in activities can happen for human kinases. We select three proteins, P04626, P12931 and P42684; they are tyrosine kinase-related proteins and come from UniProt, which is a freely accessible database of protein sequences and functional information. The gene names associated with these proteins are SRC, ERBB2, and ABL2. These proteins have received literature support for interaction with SARS-CoV-2 coronavirus [50]- [56] . We refer to them as proteins of interest. We find the subgraphs obtained by local, weakly-global, and global nucleus decomposition which contain these three nodes. Moreover, at the same time we compare these graphs with their counterparts, truss and core in terms of density and size of the subgraph. For all the notions we set threshold θ = 0.001. Table VII shows the comparison of different dense sub-graph notions with respect to (1) largest k for which the subgraph contains the proteins of interest, (2) number of nodes in the subgraph, and (3) density of the subgraph. We see that l-nucleus is denser than l-truss and both l-core and g-core. Also, w-nucleus and g-nucleus are denser than gtruss. In terms of nodes, l-nucleus gives a subgraph which is much smaller than the subgraphs of l-core, g-core, and l-truss. More precisely, with respect to l-nucleus, the three proteins of interest appear in a nucleus of 95 vertices and 509 edges. To see which kind of biology function/process our detected community represent, we use Metascape (https: //metascape.org/gp/index.html#/main/step1). Metascape [57] is a web-based portal that provides comprehensive gene list annotation and analysis resources. Using Metascape, we find that the proteins in the local nucleus are associated with several diseases, most of them being forms of cancer (16 out 20). The p-values of the association are less than 10 −18 , which is statistically very significant. Figure 8 shows the weakly global and global nuclei which contain the proteins of interest. All the nodes (green and pink) comprise the weakly global subgraph. The pink nodes comprise the global nucleus subgraph. Using Metascape, we find that the proteins in our weakly-global and global subgraphs are associated with some more specific forms of cancer such as Uterine Carcinosarcoma and Hormone Refractory Prostate Cancer, respectively, with p-values less than 10 −6 , which are statistically quite significant, especially given the fact that these subgraphs are much smaller than the local nucleus (in general, the more observations we have, the smaller the pvalues become). These findings are useful to biologists in order to perform targeted tests for checking whether drugs for the treatment of these diseases can also be repurposed for treating COVID-19 [54] . There are over 250 anticancer drugs approved by the FDA, but far fewer for specific kinds of cancer. Thus, showing connections to specific forms helps narrow the choice of drugs to repurpose.

In summary, it is running all the three versions of nucleus decomposition on the Biomine dataset that gives surprising subgraphs pointing to potentially useful further investigation by biologists. Running only local nucleus decomposition will miss such interesting groups, no matter how we set the values of k and θ.

Discussion of the difference between weakly-global and global definitions on BrightKite. To show more applications on the difference between weakly-global and global definitions of nucleus decomposition in probabilistic graphs, we consider social network data from BrightKite (https://snap.stanford.edu/ data/loc-brightkite.html). BrightKite was once a location-based social networking service provider where users shared their locations by checkingin. The friendship network was collected using their public API, and consists of 58,228 nodes and 214,078 edges, and 4,491,143 checkins between April 2008 and October 2010. We generated probabilities for each edge based on the Jaccard similarity between the neighborhoods of two endpoints. Running weakly-global and global nucleus decompositions on this dataset with θ = 0.1, we retrieve 300 and 20 g-(k, θ) and w-(k, θ) nuclei, respectively. For weakly-global subgraphs, k ranges in [1, 5] , and for global subgraphs, k can take on values of 1 and 2.

As expected, global nuclei obtain better cohesiveness in terms of density and clustering coefficient. In particular, the average density and clustering coefficient in global nuclei over all values of k, is 0.6951 and 0.6947 as opposed to 0.4844 and 0.5052 in weakly-global nuclei. We also report another interesting observation on this dataset. We obtain the average number of checkins by users in the detected subgraphs. The average number of user checkins in global nuclei is about 6% more than those in weakly-global nuclei. Moreover, there exist periods, for instance, the period between August 2008 and April 2009, in which the average number of checkins of the users in the global nuclei is 57% more than the average number of checkins in the weakly-global subgraphs. These results show that global nuclei can capture better user engagement (as measured by the number of checkins) than weakly-global nuclei.

Remark. we explain that all our three models are useful and they should be used in tandem. Local nucleus helps to identify dense subgraphs of interest. We can adjust k and θ to obtain smaller and denser subgraphs. However, global and weakly global nuclei can identify pockets that are impossible to obtain with local nucleus no matter how we adjust k and θ. For instance, in Example 1 in the paper, it is only the global nucleus that can identify H 1 and H 2 ; no other notion can. In our DBLP use case, the local and weakly-global notions helped us identify a dense subgraph of researchers working on Algorithms, however, the global nucleus gave a particular pocket of researchers, who, after close examination, turned out to be especially focused on designing efficient data-structures. Then in the same case study, we were able to identify a useful weakly-global nucleus of five researchers, who are well known to work on algorithms for data mining. The local nucleus was too big (more than 100 nodes), whereas the global nucleus was empty. All these examples show that an analyst should run all the three versions of nucleus decomposition in tandem on a dataset and then closely examine the results. We stress out that this is not just to obtain denser subgraphs as we go from local to weakly-global and global. More than density, what is important is the detection of small pockets of nodes with nice properties that escape getting identified by other notions. Finally, in the Biomine dataset, we observe that the group of proteins in a local nucleus containing three proteins of interest were related to many forms of cancer even though the proteins of interest have received literature support related to COVID-19. Based on consultations with Bioinformatics researchers, this finding is of great importance in finding relationships between seemingly distant diseases. Regarding weakly-global and global notions, they were able to find subgraphs of the local nucleus that were comprised of proteins related to more specific cancer diseases. Investigating the connection of these diseases to COVID-19 is an interesting avenue to explore for a biologist researcher. To reiterate, a researcher should use all three notions of nucleus decomposition as they provide different different view-points and can reveal subgraphs which can be missed by other notions.

VIII. RELATED WORK In deterministic graphs, core and truss decompositions have been studied extensviely [58] - [69] . Core decomposition in probabilistic graphs has been studied in [1] , [23] , [26] , [70] . Bonchi et al. [1] were the first to introduce core decomposition for such graphs. They focus on finding a subgraph in which each vertex is connected to k neighbors within that subgraph with high probability. In [23] more efficient algorithms were proposed which can also handle graphs that do not fit in main memory. In [26] , the authors focus on finding a subgraph which contains nodes with high probability to be k-core member in the probabilistic graph. In [70] , an index-based structure is defined for processing core decomposition in probabilistic graphs.

In the probabilistic context, the notion of local (k, η)-truss is introduced by Huang, Lu, and Lakshmanan in [24] . Their proposed algorithm for computing local (k, η)-truss is based on iterative peeling of edges with support less than k and updating the support of affected edges. Also, [24] proposed the notion of global (k, η)-truss based on the probability of each edge belonging to a k-truss in a possible world. In [71] an approximate algorithm for the local truss decomposition is proposed to efficiently compute the tail probability of edge supports in the peeling process of [24] . In [72] truss decomposition is computed using an index-based approach.

Core decomposition of uncertain graphs

Mining maximal cliques from an uncertain graph

Discovering highly reliable subgraphs in uncertain graphs

Maximizing the spread of influence through a social network

Limiting the spread of misinformation in social networks

Influence maximization: Near-optimal time complexity meets practical efficiency

ACM SIGMOD International Conference on Management of Data

Distance-constraint reachability computation in uncertain graphs

Discovering frequent subgraphs over uncertain graph databases under probabilistic semantics

Learning influence probabilities in social networks

Trust prediction from user-item ratings

Using probabilistic confidence models for trust inference in web-based social networks

Genome-wide analysis of eukaryotic twin cx 9 c proteins

Identifying functional modules in protein-protein interaction networks: an integrated exact approach

Understanding network concepts in modules

Network-based prediction of protein function

Large scale cohesive subgraphs discovery for social network visual analysis

A general framework for weighted gene coexpression network analysis

Motifcut: regulatory motifs finding with maximum density subgraphs

A survey of community search over big graphs

Finding skyline communities in multi-valued networks

A complex network approach to text summarization

Local algorithms for hierarchical dense subgraph discovery

Efficient computation of probabilistic core decomposition at web-scale

Truss decomposition of probabilistic graphs: Semantics and algorithms

K-core decomposition of large networks on a single pc

Efficient probabilistic k-core computation on uncertain graphs

Truss decomposition in massive networks

Finding the hierarchy of dense subgraphs using nucleus decompositions

Nucleus decompositions for identifying hierarchy of dense subgraphs

Social centrality using network hierarchy and community structure

Effective and efficient dense subgraph query in large-scale social internet of things

Hidden: hierarchical dense subgraph detection with application to financial fraud detection

Extracting brain disease-related connectome subgraphs by adaptive dense subgraph discovery

Detection of complexes in biological networks through diversified dense subgraph mining

Nucleus decomposition in probabilistic graphs: Hardness and algorithms

Fast algorithms for determining (generalized) core groups in social networks

An approximation theorem for the poisson binomial distribution

Nouvelle forme de la théoreme dur la limite de probabilité

Handbook of the poisson distribution

Translated poisson approximation using exchangeable pair couplings

Probability, random variables, and stochastic processes. Tata McGraw-Hill Education

Binomial approximation to the poisson binomial distribution

Probability and statistical inference

Probability inequalities for sums of bounded random variables

K-nearest neighbors in uncertain graphs

Global landscape of protein complexes in the yeast saccharomyces cerevisiae

Methods to determine node centrality and clustering in graphs with uncertain structure

Latent dirichlet allocation

Interactive exploration of heterogeneous biological networks with biomine explorer

The global phosphorylation landscape of sars-cov-2 infection

Covid-19-driven endothelial damage: complement, hif-1, and abl2 are potential pathways of damage and targets for cure

Examining the effector mechanisms of Xuebijing injection on COVID-19 based on network pharmacology

Increased expression of hypoxia-induced factor 1α mRNA and its related genes in myeloid blood cells from critically ill COVID-19 patients

Integrative COVID-19 Biological Network Inference with Probabilistic Core Decomposition

Imatinib is not a potent anti-sars-cov-2 drug

Interferon-stimulated gene products as regulators of central carbon metabolism

Metascape provides a biologistoriented resource for the analysis of systems-level datasets

Network structure and minimum degree

Querying k-truss community in large and dynamic graphs

ACM SIGMOD International Conference on Management of Data

Trusses: Cohesive subgraphs for social network analysis

Extracting analyzing and visualizing triangle k-core motifs within networks

Distributed algorithms for k-truss decomposition

Efficient computation of importance based communities in web-scale networks using a single machine

Distributed k-core decomposition

When engagement meets similarity: Efficient (k,r)-core computation on social networks

Efficient core decomposition in massive networks

Streaming algorithms for k-core decomposition

Efficient bitruss decomposition for large-scale bipartite graphs

Strud: Truss decomposition of simplicial complexes

Index-based optimal algorithm for computing k-cores in large uncertain graphs

Fast truss decomposition in large-scale probabilistic graphs

Efficient probabilistic truss indexing on uncertain graphs

Building on the well-studied notions of core and truss decomposition, Sarıyüce et al. [28] introduce nucleus decomposition in deterministic graphs. They propose an algorithm for computing (3, 4) -nuclei. In a more recent work, Sarıyüce et al. [22] propose efficient distributed algorithms for nucleus decomposition. Our work is the first to study nucleus decomposition in probabilistic graphs.

In this work, we made several key contributions. We introduced the notion of local, weakly-global and global nuclei for probabilistic graphs. We showed that computing weaklyglobal and global nuclei is intractable. We complemented these hardness results with effective algorithms to approximate them using techniques from Monte-Carlo sampling.We designed a polynomial time, peeling based algorithm for computing local nuclei based on dynamic programming and showed that its efficiency can be much improved using novel approximations based on Poisson, Binomial and Normal distributions. Finally, using an in-depth experimental study, we demonstrated the efficiency, scalability and accuracy of our algorithms for nucleus decomposition on real world datasets.