key: cord-0043175-p2vxnn9z authors: Lyu, Tianshu; Sun, Fei; Zhang, Yan title: Node Conductance: A Scalable Node Centrality Measure on Big Networks date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47436-2_40 sha: 8ababc985b108560608764b52f17512e2aae6a47 doc_id: 43175 cord_uid: p2vxnn9z Node centralities such as Degree and Betweenness help detecting influential nodes from local or global view. Existing global centrality measures suffer from the high computational complexity and unrealistic assumptions, limiting their applications on real-world applications. In this paper, we propose a new centrality measure, Node Conductance, to effectively detect spanning structural hole nodes and predict the formation of new edges. Node Conductance is the sum of the probability that node i is revisited at r-th step, where r is an integer between 1 and infinity. Moreover, with the help of node embedding techniques, Node Conductance is able to be approximately calculated on big networks effectively and efficiently. Thorough experiments present the differences between existing centralities and Node Conductance, its outstanding ability of detecting influential nodes on both static and dynamic network, and its superior efficiency compared with other global centralities. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this chapter (10.1007/978-3-030-47436-2_40) contains supplementary material, which is available to authorized users. Social network analysis is used widely in social and behavioral sciences, as well as economics and marketing. Centrality is an old but essential concept in network analysis. Central nodes mined by centrality measures are more likely to help disseminating information, stopping epidemics and so on [19, 21] . Local and global centralities are classified according to the node influence being considered. Local centrality, for instance, Degree and Clustering Coefficient are simple yet effective metrics for ego-network influence. On the contrary, tasks such as information diffusion and influence maximization put more attention on the node's spreading capability, which need centrality measurements at long range. Betweenness and Closeness capture structural characterization from a global view. As the measures are operated upon the entire network, they are informative and have been extensively used for the analysis of social-interaction networks [11] . However, exact computations of these centralities are infeasible the ideal route (shortest path or maximum flow) at most times. Random walk centrality [19] counts the number of random walks instead of the ideal routes. Nevertheless, the computational complexity is still too high. Subgraph centrality [7] , the most similar measure to our work, is defined as the sum of closed walks of different lengths starting and ending at the vertex under consideration. It characterizes nodes according to their participation in subgraphs. As subgraph centrality is obtained mathematically from the spectra of the adjacency matrix, it also runs into the huge computational complexity. Advance in NLP Research. Neural language model has spurred great attention for its effective and efficient performance on extracting the similarities between words. Skip-gram with negative sampling (SGNS) [16] is proved to be co-occurrence matrix factorization in fact [12] . Many works concerns the different usages and meanings of the two vectors in SGNS. The authors of [13] seek to combine the input and output vectors for better representations. Similarly, in the area of Information Retrieval, input and output embeddings are considered to carry different kinds of information [18] . Input vectors are more reflective of function (type), while output vectors are more reflective of topical similarity. In our work, we further analyze the relationships between the learned input and output vectors and the network topology, bringing more insights to the network embedding techniques. Moreover, we bridge the gap between node embedding and the proposed centrality, Node Conductance. Conductance measures how hard it is to leave a set of nodes. We name the new metric Node Conductance as it measures how hard it is to leave a certain node. For an undirected graph G, and for simplicity, we assume that G is unweighted, although all of our results apply to weighted graphs equally. A random walk on G defines an associated Markov chain and we define the Node Conductance of a vertex i, NC ∞ , as the sum of the probability that i is revisited at s-th step, where s is the integer between 1 and ∞. The next section demonstrates that the number of times that two nodes co-occur in the random walk is determined by the sub-network shared by these two nodes. Node Conductance is about the co-occurrence of the target node itself and is thus able to measure how dense the connections are around the target node. The graph G is supposed to be connected and not have periodically-returned nodes (e.g. bipartite graph). The adjacency matrix A is symmetric and the entries equal 1 if there is an edge between two nodes and 0 otherwise. Vector d = A1, where 1 is a n × 1 vector of ones, n is the node number, and each entry of d is the node degree. D is the diagonal matrix of degree: D = diag(d). Graph G has an associated random walk in which the probability of leaving a node is split uniformly among the edges. For a walk starting at node i, the probability that we find it at j after exactly s steps is given by NC r denotes the sum of the probability that the node is revisited at the step s, s is between 1 and r where P ii is the entry in the i-th row and i-th column of matrix P . Supposed that r approaches infinity, NC ∞ becomes a global node centrality measure. In order to compute the infinite sum of matrix power, s = 0 is added for convenience. (4) D−A, the Laplacian matrix L of the network, is singular and cannot be inverted simply. We introduce pseudo-inverse. L ij = N k=1 λ k u ik u jk , where λ and u are the eigenvalue and eigenvector respectively. As vector [1, 1, ...] is always an eigenvector with eigenvalue zero, the eigenvalue of the pseudo-inverse L † is defined as follows. NC ∞ (i) only concerns about the diagonal of L † . where d i is the degree of node i, the ith entry of d. Although Node Conductance is a global node centrality measure, the Node Conductance value is more relevant with local topology. As shown in Eq. 3, in most cases, the entry value of (D −1 A) s is quite small when s is large. It corresponds to the situation that the random walk is more and more impossible to revisit the start point as the walk length increases. In the supplementary material, we will prove that Node Conductance can be well approximated from local subgraphs. Moreover, as the formalized computation of Node Conductance is mainly based on matrix power and inverse, the fast calculation of Node Conductance is also required. We will discuss the method in Sect. 4. Node Conductance seems to have very similar definition as Subgraph Centrality (SC) [7] and PageRank (PR) [20] . In particular, Node Conductance only computes the walks started and ended at the certain node. And PR is the stationary distribution of the random walk, which means that it is the probability that a random walk, with infinite steps, starts from any node and hits the node under consideration. PR = D(D − αA) −1 1, where the agent jumps to any other node with probability α. The difference between PR and Eq. 4 lies in the random walks taken into account. By multiplying matrix 1, the PR value of node i is the sum of the entries in the i-th row of D(D − αA) −1 . In Eq. 4, the NC value of node i is the entry of the i-th row and i-th column. In summary, NC is more about the node neighborhood while PR is from a global view. The difference makes PageRank a good metric in Information Retrieval but less effective in social network analysis. After all, social behavior almost have nothing to do with the global influence. SC counts the subgraphs number that the node takes part in, which is equivalent to the number of closed walks starting and ending at the target node, The authors later add a scaling factor to the denominator in order to make the SC value converge, but get less interpretive. NC, on the contrary, is easy-to-follow and converges by definition. As the calculation of Node Conductance involves matrix multiplication and inverse, it is hard to apply to large networks. Fortunately, the proof in our Supplementary Material indicates that Node Conductance can be approximated from the induced subgraph G i formed by the k-neighborhood of node i. And the approximation error decreases at least exponentially with k. Random walk, which Node Conductance is based on, is also an effective sampling strategy to capture node neighborhood in the recent network embedding studies [10, 21] . Next, we aim at teasing out the relationship between node embeddings and network structures, and further introduces the approximation of Node Conductance. word2vec is highly efficient to train and provides state-of-art results on various linguistic tasks [16] . It tries to maximize the dot product between the vectors of frequent word-context pairs and minimize it for random word-context pairs. Each word has two representations in the model, namely the input vector (word vector w) and output vector (context vector c). DeepWalk [21] is the first one pointing out the connection between texts and graphs and using word2vec technique into network embedding. Although DeepWalk and word2vec always treat the input vector w as the final result, context vector c still plays an important role [18] , especially in networks. (1) Syntagmatic: If word i and j always co-occur in the same region (or two nodes have a strong connection in the network), the value of w i · c j is large. (2) Paradigmatic: If word i and j have quite similar contexts (or two nodes have similar neighbors), the value of w i ·w j is high. In NLP tasks, the latter relationship enables us to find words with similar meaning, and more importantly, similar Part-of-speech. That is the reason why only input embeddings are preserved in word2vec. However, we do not have such concerns about networks, and moreover, we tend to believe that both of these two relationships indicate the close proximity of two nodes. In the following, we analyze the detailed meanings of these two vectors based on the loss function of word2vec. SGNS is the technique behind word2vec and DeepWalk, guaranteeing the high performance of these two models. Our discussion of DeepWalk consequently starts from SGNS. The loss function L of SGNS is as follows [12, 14] . V W is the vocabulary set, i is the target word and V C is its context words set, #(i, j) r is the number of times that j appears in the r-sized window with i being the target word. #(i) r is the times that i appears in the training pairs: #(i) r = j∈VW #(i, j) r , where w i and c i are the input and output vectors of i. (6) neg is the word sampled based on distribution P (i) = #(i)/|D|, corresponding to the negative sampling parts, D is the collection of observed words and context pairs. Note that word2vec uses a smoothed distribution where all context counts are raised to the power of 0.75, making frequent words have a lower probability to be chosen. This trick resolves word frequency imbalance (non-negligible amount of frequent and rare words) while we found that node degree does not have such imbalanced distribution in all of the dataset we test (also reported in Fig. 2 in DeepWalk [21] ). Thereby, we do not use the smoothed version in our experiments. SGNS aims to optimize the loss function L presented above. The authors of [12] provide the detailed derivation of SGNS as follows. We define x = w i · c j and find the partial derivative of L (Eq. 6 ) with respect to x: Comparing the derivative to zero, we derive that where k is the number of negative samples. In the above section, we derive the dot product of the input and output vectors. Now as for a certain node i, we calculate the dot product of its input vector and output vector: w i · c i = log #(i,i)r #(i)r·P (i) − log k. Usually, the probability is estimated by the actual number of observations: P (i), namely the probability of a node being visited in a random walk, is proportional to the node degree. Thus, we have In our experiments, the value of exp(w i · c i ) · deg(i) is used as the relative approximate Node Conductance value of node i. Actually, the exact value of each node's Node Conductance is not that necessary. Retaining their relative ranks is enough to estimate their centrality. The variants of DeepWalk also produce similar node embeddings. For example, node2vec is more sensitive to certain local structure [15] and its embeddings has lower capacity of generalization. We only discuss DeepWalk in this paper for its tight connection to random walk, which brings more interpretability than other embedding algorithms. DeepWalk generates m random walks started at each node and the walk length is l, sliding window size is w. Node embedding size is d. We set m = 80, l = 40, w = 6, and d = 128. In order to compute the node embeddings, DeepWalk uses word2vec optimized by SGNS in gensim 1 and preserves the default settings, where the embeddings are initialized randomly, initial learning rate is 0.025 and linearly drops to 0.0001, epochs number is 5, negative sample number is 5. The formalized computation of Node Conductance is based on eigendecomposition, which scales to O(V 3 ), V is the number of nodes. Using DeepWalk with SGNS, the computational complexity per training instance is O(nd + wd), where n is the number of negative samples, w is the window size and d is the embedding dimension. The number of training instance is decided by the settings of random walks. Usually it is O(V ). Now that different measures are designed so as to capture the centrality of the nodes in the network, it has been proved that strong correlations exist among these measures [23] . We compute different centrality measures on several small datasets 2 . NC ∞ is computed by Eq. 5. NC DW is computed by DeepWalk with the window size 6. As presented in Table 1 , we calculate their correlations by Spearman's rank correlation coefficient. NC ∞ and Network Flow Betweenness are not able to be computed on dataset polblog as the graph is disconnected. Apart from the football dataset, Degree, NC ∞ and PageRank value show significant relation with NC DW on all the rest datasets. Node Conductance is not sensitive to window size on these datasets. We visualize the special case, football network, in order to have an intuitive sense of the properties of Degree, Betweenness, and Node Conductance (other centralities are presented in the Supplementary Material). Moreover, we want to shed more light on the reason why Node Conductance does not correlate with Degree on this dataset. Figure 1 presents the football network. The color represents the ranking of nodes produced by different metrics (Low value: red, medium value: light yellow, high value: blue). The values produced by these four metrics are normalized into range [0,1] respectively. Comparing Fig. 1a and Fig. 1b with Fig. 1d , it seems that the result provided by Node Conductance (window = 6) synthesizes the evaluations from Degree and Betweenness. Node Conductance gives low value to nodes with low degree (node 36, 42, 59) and high betweenness centrality (node 58, 80, 82). We are able to have an intuitive understanding that Node Conductance captures both local and global structure characteristics. When the window size is bigger, the distribution of node colors in Fig. 1c basically consistent with Fig. 1d . Some clusters of nodes get lower values in Fig. 1c because of the different levels of granularity being considered. We employ Node Conductance computed by DeepWalk to both static network and dynamic network to demonstrate its validity and efficiency. Node Conductance of different window size are all tested and size 6 is proved to be the best choice. We try our best to calculate the baseline centralities accurately, while some of them do not scale to the big network datasets. (Table 2) . We employ the collaboration network of DBLP, Amazon product co-purchasing network, and Youtube social network provided by SNAP 3 . In DBLP, two authors are connected only if they are co-authors and the publication venue is considered to be the ground-truth communities. DBLP has highly connected clusters and consequently has the best Clustering Coefficient (CC). In Amazon network, an edge means that two products are co-purchased frequently and the ground-truth communities are the groups of products that are in the same category. Users in Youtube social networks create or join into different groups on their own interests, which can be seen as the ground-truth. The link between two users represents their friend relationship. The CC of Youtube network is very poor. Dynamic Network. Flickr network [17] between November 2nd, 2006 and May 18th, 2007. As shown in Table 3 , there are altogether 4 snapshots during this period. This unweighted and undirected network has about 300,000 new users and over 3.8 million new edges. Table 4 . The configuration of our computer is: two Intel(R) Xeon(R) CPU E5-2620 at 2.00 GHz, 64 GB of RAM. Node Conductance is calculated by DeepWalk with the setting m = 80, l = 40, w = 6, and d = 128, the same setting in [21] . As Node Conductance is the by-product of DeepWalk, the actual running time of Node Conductance is the same as DeepWalk. As presented in the beginning of the section, Eigenvector centrality and PageRank are approximately calculated and we set the error tolerance used to check convergence in power method iteration to 1e−10. Betweenness are approximately calculated by randomly choosing 1000 pivots. More pivots requires more running time. Subgraph Centrality and Network Flow Betweenness do not have corresponding approximations. Time costs of some global centralities are listed in Table 4 . Approximate Eigenvector, Subgraph Centrality and Network Flow Betweenness are not able to finish calculating in a reasonable amount of time on these three datasets. Node Conductance calculated by DeepWalk is as fast as the approximate PageRank and costs much less time than approximate Betweenness. Comparing with the existing global centralities, Node Conductance computed by DeepWalk is much more scalable and capable to be performed on big datasets. We use Node Conductance to find nodes spanning several communities. Sometimes, it is called structural hole as well. Amazon, DBLP and Youtube datasets provide the node affiliation and we count the number of communities each node belongs to. In our experiments, nodes are ranked decreasingly by their centrality values. We first calculate the Spearman ranking coefficient between the ranks produced by each centrality measure and the number of communities. The error tolerance of approximate Eigenvector Centrality is set to be 1e−6. Other settings are the same as the Sect. 6.1. Results are shown in Table 5 . Node Conductance performs the best and PageRank has a poor performance. We further explore the differences between the rank of these centralities and plot the communities numbers of nodes (y-axis) in the order of each centrality measure (x-axis). In order to smooth the curve, we calculate the average number of communities node belongs to for every 1000 nodes. For example, point (x, y) denotes that nodes that are ranked from (1000x) to (1000(x + 1)) belong to y communities on average. In Fig. 2 , all of the six metrics are able to reflect the decreasing trend of spanning communities number. It is obvious that Node Conductance provides the smoothest curve comparing with the other five metrics, which indicates its outstanding ability to capture node status from a structural point of view. The consistency of performance on different datasets (please refer to the Supplementary Material) demonstrates that Node Conductance is an effective tool for graphs with different clustering coefficient. Degree and PageRank seem to have very different performances as shown in the Table 5 , Fig. 2 . The ground-truth centrality is the number of communities that each node belongs to, which means many nodes have the same centrality rank. Similarly, many nodes have the same degree too. However, under the measurement of the other centralities, nodes have different centrality values and ranks. Thus, degree has advantage to achieve higher ranking coefficient in Table 5 but performs bad as shown in Fig. 2 . As for the curves of PageRank, the tails are quite different from the curves of Node Conductance. In Fig. 2e , the tail does not smooth. In other words, PageRank does not perform well for those less active nodes and thus achieves a poor score in Table 5 . The calculation of Node Conductance is entirely based on the topology, while node affiliation (communities) is completely determined by the fields and applications. Node affiliation is somehow reflected in the network topology and Node Conductance has better ability to capture it. In this experiment, we focus on the mechanism of network growing. It is well-known that the network growth can be described by preferential attachment process [3] . The probability of a node to get connected to a new node is proportional to its degree. We consider the Flickr network [17] expansion during Dec. 3rd, 2006 to Feb. 3rd, 2007. Note that the results are similar if we observe other snapshots, and given space limitations, we only show this expansion in the paper. Nodes in the first snapshot are ranked decreasingly by their degree. We also count the newly created connections for every node. Figure 3 presents strong evidence of preferential attachment. However, there exist some peaks in the long tail of the curve and the peak should not be ignored as it almost reaches 50 and shows up repeatedly. Figure 3b presents the relationship between increasing degree and Node Conductance. Comparing the left parts of these two curves, Node Conductance fails to capture the node with the biggest degree change. On the other hand, Node Conductance curve is smoother and no peak shows up in the long tail of the curve. Degree-based preferential attachment applies to the high degree nodes, while for the nodes with fewer edges, this experiment suggests that there is a new expression of preferential attachmentthe probability of a node to get connected to a new node is proportional to its Node Conductance. In this paper, we propose a new node centrality, Node Conductance, measuring the node influence from a global view. The intuition behind Node Conductance is the probability of revisiting the target node in a random walk. We also rethink the widely used network representation model, DeepWalk, and calculate Node Conductance approximately by the dot product of the input and output vectors. Experiments present the differences between Node Conductance and other existing centralities. Node Conductance also show its effectiveness on mining influential node on both static and dynamic network. Internet: diameter of the world-wide web Approximating betweenness centrality Emergence of scaling in random networks Factoring and weighting approaches to status scores and clique identification Centrality and network flow Centrality estimation in large networks Subgraph centrality in complex networks A set of measures of centrality based on betweenness Centrality in social networks conceptual clarification node2vec: scalable feature learning for networks Identification of influential spreaders in complex networks Neural word embedding as implicit matrix factorization Improving distributional similarity with lessons learned from word embeddings Word embedding revisited: a new representation learning and explicit matrix factorization perspective Enhancing the network embedding quality with structural similarity Distributed representations of words and phrases and their compositionality Growth of the flickr social network Improving document ranking with dual word embeddings A measure of betweenness centrality based on random walks The pagerank citation ranking: bringing order to the web Deepwalk: online learning of social representations Collective dynamics of a small-world networks Centers of complex networks