key: cord-028660-hi35xvni
authors: Chen, Jie; Li, Yang; Zhao, Shu; Wang, Xiangyang; Zhang, Yanping
title: Three-Way Decisions Community Detection Model Based on Weighted Graph Representation
date: 2020-06-10
journal: Rough Sets
DOI: 10.1007/978-3-030-52705-1_11
sha: 
doc_id: 28660
cord_uid: hi35xvni

Community detection is of great significance to the study of complex networks. Community detection algorithm based on three-way decisions (TWD) forms a multi-layered community structure by hierarchical clustering and then selects a suitable layer as the community detection result. However, this layer usually contains overlapping communities. Based on the idea of TWD, we define the overlapping part in the communities as boundary region (BND), and the non-overlapping part as positive region (POS) or negative region (NEG). How to correctly divide the nodes in the BND into the POS or NEG is a challenge for three-way decisions community detection. The general methods to deal with boundary region are modularity increment and similarity calculation. But these methods only take advantage of the local features of the network, without considering the information of the divided communities and the similarity of the global structure. Therefore, in this paper, we propose a method for three-way decisions community detection based on weighted graph representation (WGR-TWD). The weighted graph representation (WGR) can well transform the global structure into vector representation and make the two nodes in the boundary region more similar by using frequency of appearing in the same community as the weight. Firstly, the multi-layered community structure is constructed by hierarchical clustering. The target layer is selected according to the extended modularity value of each layer. Secondly, all nodes are converted into vectors by WGR. Finally, the nodes in the BND are divided into the POS or NEG based on cosine similarity. Experiments on real-world networks demonstrate that WGR-TWD is effective for community detection in networks compared with the state-of-the-art algorithms.

Nowadays, there are all kinds of complex systems with specific functions in the real world such as online social systems, medical systems and computer systems. These systems can be abstracted into networks with complex internal structures, called complex networks. The research of complex networks has received more and more attention due to the development of the Internet. Community structure [6, 23] is a common feature of complex networks, which means that a network consists of several communities, the connections between communities are sparse and the connections within a community are dense [10] . Mining the community structure in the network is of great significance to understand the network structure, analyze the network characteristics and predict the network behavior. Thus, community detection has become one of the most important issues in the study of complex networks.

In recent years, a great deal of research is devoted to community detection in networks. Most community detection methods are used to identify nonoverlapping communities (i.e., a node belongs to only one community). The main approaches include graph partitioning and clustering [9, 10, 13] , modularity maximization [1, 20] , information theory [12, 25] and non-negative matrix factorization [16, 27] . The Kernighan-Lin algorithm [13] is a heuristic graph partitioning method that detects communities by optimizing the edges within and between communities. GN algorithm [10] is a representative hierarchical clustering method, which can find communities by removing the links between communities. Blondel et al. proposed the Louvain algorithm [1] , which is a well-known optimization method based on modularity. It is used to handle large-scale networks due to low time complexity. Liu et al. [16] put forward a community detection method by using non-negative matrix factorization. Zhao et al. [33] introduced the idea of granular computing into the community detection of network and proposed a community detection method based on clustering granulation.

The existing non-overlapping community detection algorithms have made great achievements, but these algorithms only use the traditional two-way decisions [29, 30] method (the acceptance or rejection decision) to deal with the overlapping nodes between communities. Compared with the two-way decisions method, the three-way decisions theory (TWD) [28] adds a non-commitment decision. The main idea of TWD is to divide an entity set into three disjoint regions, which are denoted as positive region (POS), negative region (NEG) and boundary region (BND) respectively. The POS adopts the acceptance decision, the NEG adopts the rejection decision, and the BND adopts the noncommitment decision (i.e., entities that cannot make a decision based on the current information are placed in the BND). For entities in the BND, we can further mine more information to realize their final partition. The introduction of non-commitment decision can effectively solve the decision-making errors caused by insufficient information, which is more flexible and closer to the actual situation.

How to deal with the boundary region has become a key issue for threeway decisions community detection. At present, the commonly used methods to process the boundary region include modularity increment [20] and similarity calculation [2, 8] . But these methods only take advantage of the local features of the network, without considering the information of the divided communities and the similarity of the global structure. Therefore, how to tackle the boundary region effectively is a challenge.

In this paper, we propose a three-way decisions community detection model based on weighted graph representation (WGR-TWD). The graph representation can well transform the global structure of the network into vector representation and make the two nodes in the boundary region that appear in the same community more similar by using the weight. Firstly, the multi-layered community structure is constructed by hierarchical clustering. The target layer is selected according to the extended modularity value of each layer. Secondly, all nodes are converted into vectors by weighted graph representation. Finally, nodes in the boundary region are divided into positive or negative region based on cosine similarity. Thus, non-overlapping community detection is realized.

The key contributions of this paper can be summarized as follows:

(1) We use weighted graph representation to obtain the global structure information of the network to guide the processing of the boundary region, which gets a better three-way decisions community detection method. (2) Based on the knowledge of the communities in the target layer, we make the two nodes connected by a direct edge in the boundary region more similar by using frequency of appearing in the same community as the weight. Then the walk sequences are constructed according to the weight of the edge. Finally, the Skip-Gram model is used to obtain the vector representation of nodes. Therefore, the weighted graph representation method is realized. The rest of this paper is organized as follows. We introduce related work in Sect. 2. We give the detailed description of our algorithm in Sect. 3. Experiments on real-world networks are reported in Sect. 4. Finally, we conclude the paper in Sect. 5.

Hierarchical clustering method has been widely used in community detection due to the hierarchical nature of the network structure. This approach can be divided into two forms: divisive method and agglomerative method. The divisive method removes the link with the lowest similarity index repeatedly, while the agglomerative method merges the pair of clusters with the highest similarity index repeatedly. These two methods eventually form a dendrogram, and communities are detected by cutting the tree.

The research of community detection based on hierarchical clustering has received widespread attention from scholars. Girvan and Newman proposed the GN algorithm [10] , which is a typical divisive method. Clauset et al. [5] proposed a community detection algorithm based on data analysis, which is a representative agglomerative method. Fortunato et al. [9] presented an algorithm to find community structures based on node information centrality. Chen et al. proposed the LCV algorithm [4] which detects communities by finding local central nodes. Zhang et al. [32] introduced a hierarchical community detection algorithm based on partial matrix convergence using random walks.

Combining hierarchical clustering with granular computing, we introduce an agglomerative method based on variable granularity to build a dendrogram. Given an undirected and unweighted graph G = (V, E), where V is the set of nodes, E denotes the set of edges. The set of neighbor nodes to a node v i is denoted as

The formation process of the initial granules is as follows. First, we calculate the local importance of each node in the network. The local importance of a node v i is defined as follows:

where

is the degree of node v i , and |·| denotes the number of elements in a set. Second, all important nodes are found according to the local importance of nodes. The node v i is an important node if I (v i ) > 0. Finally, for any important node, an initial granule is composed of all neighbor nodes of the important node and the important node itself. After all the initial granules are obtained, the hierarchical clustering method based on variable granularity is described. The clustering coefficient between the two granules is defined as 

where med {} is a median function. The clustering process is as follows. Firstly, for ∀C m i , C m j ∈ H m , the clustering coefficient between them is calculated. Then the clustering threshold λ m of the current layer is calculated. And the maximum clustering coefficient is found, which is denoted as f C m α , C m β .

If f C m α , C m β λ m , the two granules C m α and C m β are merged to form a new granule and the new granule is added to H m+1 . Otherwise, all the granules in H m are added to H m+1 and H m is set to empty. For each layer, repeat above clustering process until all nodes in the network are in a granule. Therefore, a dendrogram is built.

Traditional network representation usually uses high-dimensional sparse vectors, which takes more running time and computational space in statistical learning. Network representation learning (NRL) is proposed to address the problem. NRL aims to learn the low-dimensional potential representations of nodes in networks. The learned representations can be used as features of the graph for various graph-based tasks, such as classification, clustering, link prediction, community detection, and visualization.

DeepWalk [24] is the first influential NRL model in recent years, which adopts the approach of natural language processing by using the Skip-Gram model [18, 19] to learn the representation of nodes in the network. The goal of Skip-Gram is to maximize the probability of co-occurrence among the words that appear within a window. DeepWalk first generates a large number of random walk sequences by sampling from the network. These walk sequences can be analogized to the sentences of the article, and the nodes are analogized to the words in the sentence. Then Skip-Gram can be applied to these walk sequences to acquire network embedding. DeepWalk can express the connection of the network well, and has high efficiency when the network is large.

To effectively deal with overlapping communities in the target layer, a weighted graph representation approach is proposed. At first, a weighted graph is constructed according to the community structure of the target layer. The weights of edges in an unweighted graph are defined as follows

where σ ij is the number of communities in which nodes v i and v j appear in a community at the same time, N c is the total number of communities in the target layer. After that, an improved DeepWalk (IDW) model is used to acquire the vector representation of all nodes in the graph. Unlike DeepWalk, the IDW model constructs the walk sequences according to the weight of the edge. The greater the weight, the higher the walk probability. Assume that the current walk node is v i , if v j ∈ N (v i ), then the walk probability from node v i to node v j is

After obtaining all the walk sequences, the Skip-Gram model is used to learn the vector representation of nodes from the walk sequences. The objective function of IDM is as follows

where R (v i ) is the vector representation of node v i , ω is the window size which is maximum distance between the current and predicted node within a walk sequence. Thus, the vector representation of all nodes in the network is obtained.

We will present the proposed WGR-TWD algorithm in this section. Figure 1 shows the overall framework of the proposed algorithm. Our algorithm consists of two parts: the construction of multi-layered community structure and boundary region processing. The first part, we employ the hierarchical clustering method based on variable granularity to construct a multi-layered community structure according to Sect. 2.1. Some overlapping communities exist in the multi-layered community structure because of clustering mechanism, so we use the extended modularity (EQ) [26] to measure the partition quality of each layer. It is defined as follows where m is the number of edges in the network, C i represents a community, O u is the number of communities that node u belongs to, A uv is the element of adjacent matrix, and d u is the degree of node u. A larger EQ value means better performance for overlapping community division. Thus, we select the layer corresponding to the largest EQ value as the target layer. The second part introduces the method of dealing with overlapping communities in the target layer. Since there are overlapping communities in the target layer, we need to further divide the target layer to achieve non-overlapping community detection. Therefore, the three-way decisions theory (TWD) is introduced to handle overlapping communities. Based on the idea of TWD, we define the overlapping part in the communities as boundary region (BND), and the non-overlapping part as positive region (POS) or negative region (NEG). And our goal is to process nodes in the BND. First of all, we adopt the weighted graph representation method to learn the vector representation of all nodes in the network. After that, the nodes in the BND are divided into the POS or NEG by using cosine similarity. Suppose the vector of node u is u = (x 1 , x 2 , ..., x n ) , node v is v = (y 1 , y 2 , ..., y n ) , then the cosine similarity is defined as

For arbitrary node v i in the BND, find out all communities containing node v i in the target layer, calculate the average value of cosine similarity between node v i and non-overlapping nodes in each community as the similarity between node v i and this community, then join node v i into the community corresponding to the maximum similarity and update the community structure of the target layer. Repeat the above operation until all nodes in the BND are processed. The WGR-TWD algorithm is described in Algorithm 1.

We test the performance of our method on eight real-world datasets in which each dataset is described as follows, and the main information of those datasets are shown in Table 1 .

Zachary's karate club [31] . This is a social network of friendships between 34 members of a karate club at a US university in the 1970s.

Dolphin social network [17] . It is an undirected social network of frequent associations between 62 dolphins in a community living off Doubtful Sound, New Zealand.

Books about US politics [22] . A network of books about US politics published around the time of the 2004 presidential election and sold by the online bookseller Amazon.com. Edges between books represent frequent co-purchasing of books by the same buyers.

American college football [10] . A network of American football games between Division IA colleges in 2000.

Email communication network [21] . It is a complex network which indicates the email communications of a university. The network was composed by Alexandre Arenas.

Facebook [15] . The network was collected from survey participants using Facebook app.

Geom [11] . The authors collaboration network in computational geometry. Collaboration [14] . The network is from the e-print arXiv and covers scientific collaborations between authors papers submitted to High Energy Physics Theory category.

In this paper, two representative algorithms are chosen to compare with the proposed WGR-TWD, as shown below:

-Modularity increment (MI) [3] . A hierarchical clustering method based on variable granularity, and the overlapping nodes between communities are divided according to modularity optimization. -DeepWalk [24] . It is a network representation learning method. This approach is used to handle the overlapping communities in the target layer. 

We employ two widely used criteria to evaluate the performance of community detection algorithms. The first index is modularity (Q) [5] , which is often used when the real community structure is not known. Q is defined as follows

where m is the number of edges in the network, A is the adjacent matrix, d i is the degree of node i, c i represents the community to which node i belongs, and δ (c i , c j ) = 1 when c i = c j , else δ (c i , c j ) = 0. The higher the modularity value, the better the result of community detection.

Another index is normalized mutual information (NMI) [7] , which is defined as follows

where C A (C B ) denotes the number of communities in partition A (B), C ij is the number of nodes shared by community i in partition A and by community j in partition B, C i. (C .j ) represents the sum of elements of matrix C in row i (column j), and n is the number of nodes in the network. A higher value of NMI indicates the detected community structure is closer to the real community structure.

In the networks with known real partition (the first four small networks), we use two indicators (Q and NMI) to evaluate our algorithm. Table 2 presents the community detection results of the proposed algorithm and the baseline algorithms on networks with known real partition. We can see that our method To further verify the effectiveness of the proposed algorithm, the MI method is used to deal with each layer in the multi-layered community structure. And we select the layer corresponding to the maximum Q value as the target layer. The experimental results are shown in Table 3 . Compared with Table 2, Table 3 can obtain higher Q value. Combined with Tables 2 and 3, our method can get better community detection results compared with the baseline methods.

We also conducted experiments on four large networks. On these networks, the real partition is unknown. Therefore, we only use modularity to evaluate the performance of different methods. Table 4 shows the community detection results of the proposed method and baseline methods. On the first three networks, we can see that our method obtains better results compared with the two baseline methods. On the Collaboration dataset, the MI method achieves the best performance which is a little bit higher than our method. The main reason is that the Collaboration network is very sparse which leads to poor vector representation of the nodes. In conclusion, the proposed method effectively addresses the problem of non-overlapping community detection in networks.

In this paper, we propose a method for three-way decisions community detection based on weighted graph representation. The target layer in multi-layered community structure is selected according to the extended modularity value of each layer. For the overlapping communities in the target layer, the weighted graph representation can well transform the global structure into vector representation and make the two nodes in the boundary region more similar by using frequency of appearing in the same community as the weight. Finally, the nodes in the boundary region are divided according to cosine similarity. Experiments on real-world networks demonstrate that the proposed method is effective for community detection in networks.

Fast unfolding of communities in large networks

Three-way dicision community detection algorithm based on local group information

VGHC: a variable granularity hierarchical clustering for community detection

A method for local community detection by finding maximaldegree nodes

Finding community structure in very large networks

Detecting community structure via the maximal sub-graphs and belonging degrees in complex networks

Comparing community structure identification

Three-way decision based on non-overlapping community division

Method to find community structures based on information centrality

Community structure in social and biological networks

Complex network clustering by multiobjective discrete particle swarm optimization based on decomposition

Information limits for recovering a hidden community

An efficient heuristic procedure for partitioning graphs

Graph evolution: densification and shrinking diameters

Learning to discover social circles in ego networks

Semi-supervised community detection based on non-negative matrix factorization with node popularity

The emergent properties of a dolphin social network

Efficient estimation of word representations in vector space

Distributed representations of words and phrases and their compositionality

Fast algorithm for detecting community structure in networks

Finding community structure in networks using the eigenvectors of matrices

Modularity and community structure in networks

Finding and evaluating community structure in networks

Deepwalk: online learning of social representations

An information-theoretic framework for resolving community structure in complex networks

Detect overlapping and hierarchical community structure in networks

Nonnegative matrix factorization with mixed hypergraph regularization for community detection

Three-way decision: an interpretation of rules in rough set theory

Three-way decisions with probabilistic rough sets

Two semantic issues in a probabilistic rough set model

An information flow model for conflict and fission in small groups

Hierarchical community detection based on partial matrix convergence using random walks

Community detection algorithm based on clustering granulation

Acknowledgments. This work is supported by the National Natural Science Foundation of China (Grants Numbers 61876001) and the Major Program of the National Social Science Foundation of China (Grant No. 18ZDA032).