key: cord-0047120-r7g9toun authors: Jaiswal, Rajesh; Ramanna, Sheela title: Detecting Overlapping Communities Using Distributed Neighbourhood Threshold in Social Networks date: 2020-06-10 journal: Rough Sets DOI: 10.1007/978-3-030-52705-1_32 sha: e6f4f198f41c8f02301443b389632cf429293df7 doc_id: 47120 cord_uid: r7g9toun In this work, we have proposed a simple overlapping community detection algorithm based on a distributed neighbourhood threshold method (DNTM). DNTM uses pre-partitioned disjoint communities and then analyzes the neighbourhood distribution of boundary nodes in disjoint communities to detect overlapping communities. It is a form of seed-based global method since boundary nodes are considered as seeds and become the starting point for detecting overlapping communities. Threshold value for each boundary node is used as minimum influence by the neighbours of a node in order to determine its belongingness to any community. The effectiveness of the DNTM algorithm has been demonstrated by testing on fifteen real-world datasets and compared with seven overlapping community detection algorithms. DNTM outperforms comparable algorithms with 10 out of 15 datasets and gives comparable results for the remaining 5 datasets in terms of the extended modularity [Formula: see text] measure. Experiments with various disjoint algorithms on 15 datasets reveal that DNTM with tolerance community detection (TCD) as a preprocessing algorithm gives the best result. There are a plethora of methods for detecting overlapping communities in social networks for both synthetic and real-world datasets starting from [19] . Classical strategies include: local expansion of seed nodes [20, 22] , label propagation [7, 13, 33] , clique-based [26] and ensemble-based methods [3, 4] to name a few. In this paper, we propose a new method based on detecting overlapping communities by i) utilizing disjoint communities, and ii) analyzing the neighbourhood distribution of boundary nodes in disjoint communities to detect overlapping clusters. Our method is akin to the more recent class of ensemble methods [3] that uses disjoint methods as a starting point for development of overlapping method. In this paper, we propose a distributed neighbourhood threshold method (DNTM) which depends on the neighbourhood distribution of boundary nodes in disjoint communities. The threshold for each boundary node is used as minimum neighbour influence for a node to belong in any community. DNTM can be considered as global method since we are not performing any local expansion on a set of initial seed nodes for generating overlapping clusters. Instead, we are using boundary nodes and exploring the clusters external to the home clusters of boundary nodes to generate overlapping clusters. It is also a form of seed-based method since boundary nodes are considered as seeds and become the starting point for detecting overlapping clusters. There is only a user-defined maximum threshold (tolerance) criteria to form a neighbourbood. Four disjoint methods have been considered in this work with the primary method based on a tolerance community detection (TCD) [15] . The other partitioning methods include: Louvain [1] , Girvan-Newman [10] and Greedy Modularity [5] . Typical metrics such as Overlapping Normalized Mutual Information (ONMI), Precision, Recall, or F-measure require ground-truth communities. However, ground-truth communities are readily available for large real networks. In their absence, computer generated benchmark networks with built-in ground-truth communities, called synthetic networks such as LFR [19] must be used, to first generate the ground-truth communities. In this paper, DNTM uses an extended modularity Q ov measure introduced by Nicosia et al. [24] as a performance metric. The effectiveness of the DNTM algorithm has been demonstrated by testing on fifteen real-world datasets and compared with seven overlapping community detection algorithms. The contribution of this paper is a simple algorithm which outperforms comparable algorithms with 10 out of 15 datasets and gives comparable results for the remaining 5 datasets in terms of extended modularity Q ov measure. Another noteworthy feature of DNTM is that no optimization strategy such as satisfying some fitness function criteria has been used. Experiments with various partitioning methods on 15 datasets reveal that: TCD gives the best result with 7 datasets, Greedy Modularity method gives the best result with 4 datasets and both Louvain and Girvan-Newman methods with 4 datasets. Our paper is organized as follows: In Sect. 2, we briefly review some representative overlapping community detection algorithms. In Sect. 3, we give a brief overview of definitions and cluster quality measure used in this paper. In Sect. 4, we give details of the proposed DNTM algorithm and its complexity. In Sect. 5, we present experimental results and analysis. Lastly, we give concluding remarks in Sect. 6. In this section, we briefly review some representative algorithms in terms of general strategies used by these algorithms. The general strategy is to start with a set of initial nodes as seeds and then expand to communities based on a fitness function criteria. OSLOM [20] : Introduced in 2011 by Lancichinetti et al., this method was the first that detected communities based on their statistical significance that takes into account different types of graphs, edge direction, edge weights, overlapping communities, network hierarchy and to recognize the absence of community structure and/or the presence of randomness in graphs. It is based on a local expansion and optimization strategy where community expansion is performed by comparing the statistical significance of clusters defined with respect to a global null model (which is the configuration model). LEMON [22] : This algorithm proposed in 2018 by Li et al., is based on the concepts of seed sets, local spectral diffusion, and local spectra. Here, a subspace around the initial seed sets called local spectra is explored using a short random walk also known as local spectral diffusion. Local spectra avoids computation burden by replacing a large number of singular vectors with short random walks. The running time of LEMON scales with the size of the community rather than that of the entire graph and has been tested on large networks. The general strategy is to label every node with a unique value and replace the node's label value with that of its most commonly detected neighbour. Once this process terminates, the nodes having the same label form a community. COPRA [13] : Introduced in 2010, this method extends the label propagation algorithm(LPA) method by Raghavan et al. [27] to detect overlapping communities with a novel termination condition. This method is dependent on parameters such as node belonging coefficient and maximum number of communities a node can belong to, and can handle weighted and bipartite graphs. COPRA usually produces results that are better (in terms of modularity) for large networks. SLPA [33] : This algorithm is based on speaker-listener mechanism to transfer the information known as labels between the nodes. Each node in this method maintains a list of labels and a randomly selected label from this list is propagated further to the node under consideration presently for detecting communities. [7, 8] : Label propagation algorithm is applied at the core of DEMON method to merge the locally generated clusters using merging function to obtain overlapping communities. The general strategy here is to leverage disjoint clusters produced by various disjoint community detection algorithms to discover the overlapping communities. MEDOC [4] : Introduced in 2016 by Chakraborty et al., this is the first ensemble based method for discovering overlapping communities by using metacommunities created from combining various similar clusters produced by disjoint communities detection methods. Further an association matrix which records the probability of a vertex belonging to a meta-community is utilized to generate both non-overlapping and overlapping communities. EnCoD [3] : This method uses various disjoint community detection algorithms to generate disjoint clusters and further utilize the good qualities of these clusters to create an ensemble solution. This algorithm uses node membership as a feature and similarity of node pairs to form a network. CPM [26] : Introduce by Gergely Palla et al. in 2005, this classical algorithm is the first method to detect overlapping communities based on clique-percolation technique. NECTAR [6] : It is a node-centric overlapping community detection algorithm in which the best communities for a given node are found using objective function and further this node is added to these communities to obtain the overlapping communities. In this method, Louvain's local search heuristic approach is generalized to discover overlapping communities. This algorithm tries to maximize the dynamically chosen objective function (i.e. WOCC and Q E ) by testing every possible existence of each node in it's neighbouring cluster in order to generate overlapping communities. All the clusters with a maximum value of objective function are considered to obtain the overlapping communities. IEDC [14] : This algorithm provides an integrated framework for discovering both overlapping and non-overlapping communities. It uses a node-based criteria with a probabilistic model. It includes computation of internal associations (non-overlapping communities), computation of external associations (overlapping communities) using interaction matrix and a community propagation probability of its neighbours. Here, we give a brief overview of definitions and cluster quality measure used in this paper. A graph G is defined as a pair of (V, E) where V is a set consisting all the nodes and E is set consisting all the edges E ⊆ V × V . Undirected graphs are such graphs in which if an edge (x, y) ∈ E then edge (y, x) must also be in E. The degree of a node v is defined as the number of edges containing v. Two nodes are adjacent if they share a common edge. The path length of P is measured as n − 1 where n is the total number of nodes in path P . It is also measured as the number edge(s) in that path. The path with minimum length (or number of edge(s)) from a source node s to a destination node d is called the shortest path sp from s to d. Neighbourhood of a Node: The neighbourhood of a node x for a graph G = (V, E) is defined as: ε is a user-defined positive real threshold value, sp is the shortest path from x to y and |sp| is the number of edge(s) in sp. A breadth first search is used for traversing the graph in order to find the neighbourhood of any given node. In Fig. 1 , the neighbourhood cluster(s) for the green node belonging to cluster C 1 are: clusters C 2 and C 3 . Note, for the green node, cluster C 1 is considered as the home cluster. Distributed Neighbourhood Threshold: Equaion 4 defines this threshold as the ratio of total number of the neighbours of a given node v over the total number of neighbourhood clusters of v plus the home cluster of v. Overlapping Candidate Node: Let v ∈ C j , then v is a candidate overlapping node if it satisfies the following equation: Overlapping Node: Node v is a overlapping node if for any C i ∈ NC(v) it satisfies the following equation: Example 1. In Fig. 1 , the green node in cluster C 1 is an overlapping candidate node since it has neighbours in clusters C 2 and C 3 . All nodes that have neighbours outside their home clusters are considered as overlapping candidate nodes. Using Eq. 4, |N r (green node)| = 8 and |NC(green node)| = 2, hence D t (v) = 2. In other words, D t (v) is considered as the minimum threshold value for a node v to be classified as overlapping node. As shown in Fig. 1 green node shares 3 edges with C 3 which also means |N r (green node)| in C 3 is 3. Since cluster C 3 includes neighbours of green node and D t (green node) meets the threshold requirement, the green node will be shared with C 3 as shown in Fig. 2 . Cluster Quality Measure: Extended Modularity: In this work we have used the extended modularity Q ov measure introduced by Nicosia in [24, 25] given in Eq. 7 where V is the set of nodes, |V | represents the number of nodes, C represents the set of overlapping cluster, m is the total number of edges and A i,j is the adjacency matrix for the graph. We have chosen to use this measure since it does not require the ground-truth to measure the quality of the generated clusters. Generally, good quality overlapping clusters have higher Q ov value. The value of Q ov will be 0 when only one cluster is obtained with all the nodes in it. Details about various coefficients in Eq. 7 can also be found in [25] . In overlapping communities, each node can belong to multiple communities but with different strengths of belonging. An array of such belonging factor [α i,1 , α i,1 , α i,1 , . ......α i,|C| ] is calculated and allotted to each node i in the graph G. The strength of node i belonging to community c is depicted by coefficient α i,c . Since the belonging coefficient for each node is already defined, it is also possible to define the belonging coefficient to each community for edges incoming to or outgoing from a node. Belonging coefficient of edge l = (i, j) with source node i and target node j to community c is represented by function β l,c . Further, the belonging coefficient for link l(i, j) pointing to a node going into the community c is represented by β in l(i,j),c and given by Eq. 8 similarly the belonging coefficient for link l(i, j) pointing to a node going out of the community c is obtained by using Eq. 9 and is represented by β out l(i,j),c . Extended Modularity measures for overlapping cluster depends on F (α i,c , α j,c ) which is defined in the Eq. 10 where f (α i,c ) is a simple linear scaling function given in Eq. 11 . The value of p is set to 30 in [25] . Generally, good quality overlapping clusters have higher Q ov value. The value of Q ov will be 0 when only one cluster is obtained with all the nodes in it. Datasets: Various sized real-world datasets were used in this study: Karate [34] , Dolphin [23] , Lesmis [16] , Football [10] , Polbooks [17] , Jazz [11] , Power grid [31] , Durgnet [32] , Highschool [18] , Netscience [29] , C.elegans [9] , Bible-names [18] , Protein [18] , Internet-Route [21] and PGP [2] . In Fig. 3 , the flow of the DNTM algorithm is given where DNTM takes crisp partitioned clusters as input irrespective of the algorithm used. We first generate non-overlapping clusters and use these clusters to examine all such nodes which have neighbours in other clusters to find overlapping nodes. Once an overlapping node is found, we update the respective clusters by including this overlapping node to obtain the resultant overlapping clusters. The main steps of DNTM algorithm are as follows: i) generate nonoverlapping clusters, ii) find candidate overlapping nodes using Eq. 5, iii) calculate distributed neighbourhood threshold using Eq. 4, iv) filter overlapping return Lo nodes using Eq. 6, and v) update the clusters with overlapping nodes to obtain the resultant overlapping clusters. Note, DNTM takes crisp partitioned clusters as input, irrespective of the algorithm used (see Fig. 6 and 7) . Algorithm 1 includes the following data structures: list of overlapping clusters L o is used to store generated overlapping clusters, Node-Cluster Dictionary NC dic to store cluster id of each node, Cluster-Node Dictionary L dic to store nodes in each cluster, Neighbour Node-Cluster Dictionary N r C dic to store cluster id of neighbourhood nodes, Overlapping-Candidate-Node Dictionary C o N dic to store overlapping candidate nodes and its neighbours N r from neighbourhood cluster NC, Node-Neighbour Dictionary N r N dic to store node and its neighbours. In DNTM algorithm for a graph G (V, E) To examine the performance of DNTM, 15 real world data-sets were used and compared with the following overlapping communities detection algorithms: CPM [26] , OSLOM [20] , COPRA [13] , SLPA [33] , Node Perception [30] , DEMON [7, 8] and CONGO [12] with h = 2 and h = 3. Except for OSLOM and COPRA, all other algorithms were taken from CDlib [28] Python package. Table 1 gives the results of our experiments where DNTM (TCD) is the proposed algorithm which uses TCD method to generate non-overlapping clusters with ε = 2 with source code made available by the authors. TCD method relies on a tolerance relation where a tolerance class represents members of the same community and uses an objective function based on two well-known quality functions, modularity and coverage. Since most of the algorithms have a non-unique output for Q ov for each execution, hence these algorithms were executed 10 times and the average of the 5 best scores for Q ov was used in our reporting shown in Table 1 and bold values represent the best score for each dataset. In additon, the number of clusters generated by majority of the algorithms is used as input for those algorithms that require number of clusters as input. Based on the results in Table 1 and Fig. 4 and Fig. 5 , we can observe that the proposed DNTM algorithm outperforms comparable algorithms with 10 out of 15 datasets and gives comparable results for the remaining 5 datasets. The quality of generated overlapping clusters from DNTM is greatly affected by the number of disjoint clusters passed as input, generated by the initial disjoint algorithm. From Eq. 4 it can be observed that D t has an inverse relation with number of communities. D t is highly sensitive and dependent on the number of communities. As a result, increasing number of communities, will decrease the value of D t , which will in turn affect the overlap between the communities. In our experiments, the number of communities, range from 2 to 109. We also observed that in general, for the datasets, where the number of communities is greater than 4, DNTM achieves the best result. Also, DNTM depends on the boundary nodes in the disjoint clusters as well their internal and external links (edges). If the number of external links of a node is extremely less as compared to its internal links, this node is less likely to qualify the condition in Eq. 6 to be classified as an overlapping node. Most algorithms use an internal objective function to obtain good quality clusters which entails parameter selection. DNTM does not have this limitation as it does not use an internal objective function and the major computation is done for overlapping candidate nodes which is comparatively less than |V |. Hence DNTM is computationally efficient. Figure 6 and 7 show overlapping clusters generated with the proposed DNTM algorithm where the input (disjoint clusters) was obtained using Louvain [1] and Girvan-Newman [10] methods on the Karate dataset. In Fig. 6 , three overlapping nodes [3, 14, 20] were detected, whereas using TCD as input method, five overlapping nodes [9, 10, 20, 29, 31] were detected. In Fig. 7 , 12 overlapping nodes were detected including a hierarchical cluster where nodes [28, 29] are present in 3 clusters. In this paper, we have proposed a new overlapping community detection algorithm (DNTM) based on: i) utilizing disjoint communities produced by community detection algorithm(s), and ii) analyzing the neighbourhood distribution of boundary nodes of discovered disjoint communities to detect overlapping clusters. The effectiveness of the DNTM algorithm has been demonstrated by testing on fifteen real-world datasets and compared with seven overlapping community detection algorithms in terms of an extended modularity Q ov measure. Three other well-known disjoint methods have been considered in this work with the primary method based on a tolerance community detection. DNTM outperforms comparable algorithms with 10 out of 15 datasets and gives comparable results for the remaining 5 datasets. Experiments with various disjoint algorithms on 15 datasets reveal that DNTM with TCD as a preprocessing algorithm gives the best result. Another noteworthy feature of DNTM is that no any optimization strategy has been used during or after the clustering process. Future work with DNTM will include: i) considering an ensemble mechanism to use various disjoint methods to select the best disjoint clusters in terms of quality and number of clusters as a preprocessing step to the DNTM algorithm, ii) defining an internal objective function to obtain good quality clusters, iii) testing and analyzing the behavior of DNTM on synthetic networks and iv) implementing a parallel DNTM to be able to handle datasets with larger nodes and communities. Fast unfolding of communities in large networks Models of social networks based on social distance attachment Ensemble-based overlapping community detection using disjoint community structures Ensemble-based algorithms to detect disjoint and overlapping communities in networks Finding community structure in very large networks Node-centric detection of overlapping communities in social networks DEMON: a local-first discovery method for overlapping communities Uncovering hierarchical and overlapping communities with a local-first approach Community detection in complex networks using extremal optimization Community structure in social and biological networks Community structure in jazz A fast algorithm to find overlapping communities in networks Finding overlapping communities in networks by label propagation IEDC: an integrated approach for overlapping and non-overlapping community detection. Knowl.-Based Syst Tolerance methods in graph clustering: application to community detection in social networks The Stanford GraphBase: A Platform for Combinatorial Computing Books about us politics KONECT -the Koblenz network collection Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities Finding statistically significant communities in networks Graph evolution: densification and shrinking diameters Local spectral clustering for overlapping community detection Identifying the role that animals play in their social networks Extending the definition of modularity to directed graphs with overlapping communities Extending modularity definition for directed graphs with overlapping communities Uncovering the overlapping community structure of complex networks in nature and society Near linear time algorithm to detect community structures in large-scale networks CDLIB: a Python library to extract, compare and evaluate communities from complex networks The network data repository with interactive graph analytics and visualization Use of local group information to identify communities in networks Collective dynamics of 'small-world' networks Social networks of drug users in high-risk sites: finding the connections SLPA: uncovering overlapping communities in social networks via a speaker-listener interaction dynamic process An information flow model for conflict and fission in small groups