key: cord-0208909-p7w6c1gn authors: Dorodnykh, Aleksandr; Prokhorenkova, Liudmila Ostroumova; Samosvat, Egor title: Preferential placement for community structure formation date: 2017-06-18 journal: nan DOI: nan sha: d23a9cb4a5d349b818b8d4a40fbd0d554457fb79 doc_id: 208909 cord_uid: p7w6c1gn Various models have been recently proposed to reflect and predict different properties of complex networks. However, the community structure, which is one of the most important properties, is not well studied and modeled. In this paper, we suggest a principle called"preferential placement", which allows to model a realistic community structure. We provide an extensive empirical analysis of the obtained structure as well as some theoretical results. The evolution of complex networks attracted a lot of attention in recent years. Empirical studies of different real-world networks have shown that such structures have some typical properties: small diameter, power-law degree distribution, clustering structure, and others [10, 17, 44] . Therefore, numerous random graph models have been proposed to reflect and predict such quantitative and topological aspects of growing real-world networks [10, 12, 17, 47, 51] . The most extensively studied property of complex networks is their vertex degree distribution. For the majority of studied real-world networks, the portion of vertices of degree d was observed to decrease as d −γ , usually with 2 < γ < 3 [5, 23, 45] . Such networks are often called scale-free. The most well-known approach to the modeling of scale-free networks is called preferential attachment. The main idea of this approach is that new vertices emerging in a graph connect to some already existing vertices chosen with probabilities proportional to their degrees. Preferential attachment is a natural process allowing to obtain a graph with a power-law degree distribution, and many random graph models are based on this idea, see, e.g., [11, 14, 28, 33, 56] . Another important characteristic of complex networks is their community (or clustering) structure, i.e., the presence of densely interconnected sets of vertices, which are usually called clusters or communities [24, 26] . Several empirical studies have shown that community structure of different real-world networks has some typical properties. In particular, it was observed that the cumulative community size distribution obeys a power law with some parameter λ. For instance, [16] reports that λ = 1 for some networks; [3] obtains either λ = 0.5 or λ = 1; [27] also observes a power law with λ close to 0.5 in some range of cluster sizes; [48] studies the overlapping communities and shows that λ is ranging between 1 and 1. 6 . Community structure is an essential property of complex networks. For example, it highly affects the spreading of infectious diseases in social networks [29, 37] , spread of viruses over computer networks [54] , promotion of products via viral marketing [31] , propagation of information [52] , etc. Therefore, it is crucial to be able to model realistic community structures. Nowadays, there are a few random graph models allowing to obtain clustering structures. Probably the most well-known model was suggested in [35] as a benchmark for comparing community detection algorithms. In this model, the distributions of both degrees and community sizes follow power laws with predetermined exponents. However, there are two drawbacks of this model. First, it does not explain the power-law distribution of community sizes, these sizes are just sampled from a power-law distribution at the beginning of the process. Second, a subgraph induced by each community is very similar to the configuration model [8] , which does not allow to model, e.g., hierarchical community structure often observed in real-world networks [3, 16] . A weighted model which naturally generates communities was proposed in [34] . However, the community structure in this model is not analyzed in details and only the local clustering coefficient is shown. From the figures presented in [34] it seems that the community size distribution does not have a heavy tail as it is observed in real-world complex networks. Finally, let us mention a paper [50] which analyzes a community graph, where vertices refer to communities and edges correspond to shared members between the communities. The authors show that the development of the community graph seems to be driven by preferential attachment. They also introduce a model for the dynamics of overlapping communities. Note that [50] only models the membership of vertices and does not model the underlying network. In this paper, we propose a process which naturally generates clustering structures. Our approach is called preferential placement and it is based on the idea that vertices can be embedded in a multidimensional space of latent features. The vertices appear one by one and their positions are defined according to preferential placement: new vertices are more likely to fall into already dense regions. We present a detailed description of this process in Section 2. After n steps we obtain a set of n vertices placed in a multidimensional space. In Section 3 we empirically and theoretically analyze the obtained structure: in particular, we show that the communities are clearly visible and their sizes are distributed according to a power law. Note that after the placement of all vertices is defined, one can easily construct an underlying network, using, e.g., the threshold model [13, 39] . We discuss possible models and their properties in Section 4. This paper is a journal version of [20] . It contains additional theoretical and empirical results for the vertex configuration obtained by preferential placement. We also discuss some new ideas on generating graphs from this configuration and analyze the diameter of such graphs. In this section, we describe the proposed approach which we call preferential placement. We assume that all vertices are embedded in R d for some d ≥ 1. One can think that coordinates of this space correspond to latent features of vertices. Introducing latent features has recently become a popular approach both in predictive and generative models. These models are known by different names such as latent feature models [41, 42] , matrix factorization models [4, 21, 40] , spatial models [2, 6, 7] , or geographical models [13, 39] . The basic idea behind all these models is that vertices having similar latent features are more likely to be connected by an edge. Preferential placement is a procedure describing the embedding of vertices in the space R d . After that, given the coordinates of all vertices, one can construct a graph using one of many well-known approaches (see Section 4 for the discussion of possible variants). Our model is parametrized by a distribution Ξ taking nonnegative values. The proper choice of Ξ is discussed further in this section. We construct a random configuration of vertices (or points) S n = {x 1 , . . . , x n }, where x 1 is the origin. Now assume that we have constructed S t for t ≥ 1, then we obtain S t+1 by adding a vertex v t+1 with the coordinates x t+1 chosen in the following way: • Choose a vertex v i t+1 from v 1 , . . . , v t uniformly at random. • Sample ξ t+1 from the distribution Ξ. • Sample a direction e t+1 from a uniform distribution on a multidimensional sphere e t+1 = 1, where by · we denote the Euclidean norm in R d . We argue in this paper that in order to obtain a realistic clustering structure one should take Ξ to be a heavy tailed distribution. In this case, according to the procedure described above, new vertices will usually appear in the dense regions, close to some previously added vertices; however, due to the heavy tail of Ξ, from time to time we get outliers, which originate new clusters. We call the described above procedure "preferential placement" due to its analogy with preferential attachment. Assume that at some step of the algorithm we have several clusters, i.e., groups of vertices located close to each other, and a new vertex appears. Then the probability that this vertex will join a cluster C is roughly proportional to its size, i.e., the number of vertices already belonging to this cluster. This is the basic intuition which we discuss further in this paper in more details. In some cases, it would be convenient to think of the definition given above in the following way. Let us first construct a uniform random recursive tree. Recall that a recursive tree of order n is a rooted tree on n vertices labeled v 1 , . . . , v n , with the property that for each k, 2 ≤ k ≤ n, the indices of the vertices on the unique path from the root v 1 to the vertex v k form an increasing sequence [53] . Uniform random recursive tree (or URRT) is a tree sampled from the set of recursive trees of order n uniformly at random. An equivalent way to sample a URRT is to start from a graph T 1 consisting of a root vertex v 1 and at each step t + 1 add a vertex v t+1 connected to a vertex v i t+1 chosen from v 1 , . . . , v t uniformly at random. After n steps we get a URRT T n . Random recursive trees were extensively studied in the literature, see, e.g., [18, 19, 43, 49] . To construct a random configuration of vertices S n = {x 1 , . . . , x n }, we first construct a URRT T n using a recursive procedure described above (we will further call this tree genealogical ). Then we label all edges of T n : for each edge (v t , v it ) we set its label w t to be a vector ξ t e t , where ξ t is sampled from the distribution Ξ and e t is, as before, a random direction of length 1 in R d . Finally, the value x k for each vertex v k is obtained by summing all labels on the unique path from v 1 to v k , i.e., x k = w j 1 + . . . + w j l such that for all 1 ≤ t ≤ l we have i jt = j t−1 , j l = k, j 0 = 1. It is easy to see that this definition is equivalent to the one given in Section 2.1. Note that each x k = w j 1 + . . . + w j l is obtained by a random walk with lengths of jumps distributed according to Ξ, which we further assume to be a power law. Such random walks are known in the literature as Lévy flights [15] . In other words, preferential placement is essentially a combination of URRT and Lévy flights. 3 Analysis of preferential placement 3 In this section, we analyze structures obtained using the preferential placement procedure described above. We take Ξ to be a slightly modified Pareto distribution with the density In all the experiments we take d = 2 since the obtained structures are easy to visualize. However, we also tried other values of d ≥ 1 and obtained results similar to shown. e.g., on Figures 5, 7 and 8. Also, if not specified otherwise, we generated structures with the number of points n = 100K. First, let us visualize the structures obtained by our algorithm. We tried several values of β, β ∈ {0.5, 1, 1.5, 2.5, 4}. The results are presented on Figures 1 and 2 . The value β = 0.5 produces the heaviest tail, in this case the distribution Ξ does not have a finite expectation. Although some clusters are clearly visible in this case (Figure 1a ), they are located far apart from each other and there are too many single outliers far away from the main picture, which seems to be not very realistic. Note that for too large β, e.g., for β = 4, the variance is too low and we obtain only one giant cluster with minor fluctuations, as presented on Figure 1d . Further in this paper we discuss the case β = 1.5 presented on Figure 2 . In this case Ξ has a finite expectation but an infinite variance. Another interesting observation is a hierarchical clustering structure produced by our algorithm. To illustrate this, we take the figure obtained for β = 1.5 and zoom it to see more details. Figure 2 shows that the largest cluster further consists of several subclusters. In order to further analyze this self-similarity property we empirically estimated a fractal dimension of the obtained structures [32] . We used a box-counting algorithm 1 which computes the number of boxes N (ε) with the edge size equal to ε needed to cover the structure. Then the fractal dimension d f is defined using the following relation: N (ε) ∝ ε −d f . The results are presented on Figure 3 (crosses represent the empirically observed pairs (ε, N (ε)), while lines correspond to the approximations N (ε) ∝ ε −d f ). One can see that the fractal dimension d f increases from 0.4 to 1.3 while β increases from 0.5 to 4. Note that the fractal dimension of Lévy flights is also known to monotonically increase with the parameter of the power-law distribution [36] . In this section, we analyze the distribution of cluster sizes produced by preferential placement. We present both theoretical and empirical observations. The main difficulty with the analysis of clustering structure is the fact that there are no standard definitions of clusters, both in graphs and metric spaces. For example, clusters are often defined as a result of some clustering algorithm. 2 This causes a lot of difficulties for both theoretical and empirical analysis. First, let us discuss why we expect to observe a power-law distribution of cluster sizes in our model. As we discussed above, due to the absence of a rigorous definition of a cluster, further in this section we are able to present only some heuristic theory. Let F t (s) denote the number of clusters of size s at step t. In order to analyze F t (s) we consider its dynamics inductively. Assume that after a step t we obtain some clustering structure. At step t + 1 we add a vertex v t+1 and choose its "parent" v i t+1 from v 1 , . . . , v t uniformly at random. Clearly, the probability to choose a parent from some cluster C with |C| = s is equal to s t . In this case, we call C a parent cluster for v t+1 . Now let us make the following assumptions: 1. All clusters can only grow, they cannot merge or split. 2. At step t + 1 a new cluster appears with probability p(t) = c t α , c > 0, 0 ≤ α ≤ 1. 3. Given that a vertex t + 1 does not create a new cluster, the probability to join a cluster C with |C| = s is equal to s t . These assumptions are quite strong and even not very realistic. For instance, it seems reasonable that two clusters can merge if many vertices appear somewhere between them. Regarding the second assumption, p(t) can possibly depend on the current configuration S t . However, these assumptions allow us to analyze the behavior of F t (s) formally. Namely, we prove the following theorem. Theorem 1 Under the assumptions described above the following holds. 2. If 0 < α ≤ 1, then for any > 0 To sum up, if the probability p(n) of creating a new cluster is of order 1 n α for α > 0, then the distribution of cluster sizes follows a power law with parameter 2 − α growing with p(n) from 1 to 2; if p(n) = c, 0 < c < 1, then the parameter grows with c from 2 to infinity. Recall that the parameter of the cumulative distribution is one less than discussed above. The proof of Theorem 1 is technical and we place it to Appendix. Let us also explain why we do not consider p(n) decreasing faster than c n . It is natural to assume that a new cluster appears if a new vertex chooses a parent vertex near the border of some cluster and then ξ t+1 and e t+1 are chosen such that x t+1 = x i t+1 + ξ t+1 · e t+1 falls quite away from the parent cluster. This probability is roughly proportional to the number of vertices located near the borders of the clusters. Extreme case, 1 vertex, provides the bound c n . Finally, let us mention that in practice the probability p(n) of creating a new cluster can depend not only on Ξ, but also on the definition of clusters. Further in this section we demonstrate that parameters of a clustering algorithm can affect the parameter of the obtained power law. As we already mentioned, there is no standard definition of a clustering structure. In many cases, clusters and communities are defined just as a result of some clustering algorithm. Therefore, we first analyze the performance of several clustering algorithms, then choose the most appropriate one and analyze clusters it produces. We compare the following algorithms: k-means [38] , EM (expectation maximization), and DBSCAN (density-based spatial clustering of applications with noise) [22] . For k-means and EM one has to specify the number of clusters. We tried several values of k, k ∈ {10, 50, 100, 500, 1000}, but both algorithms turned out to be not suitable for out problem. As expected, in all cases they unnaturally split the largest cluster into several small ones (see Figures 4a and 4b) . On the contrary, DBSCAN produces more realistic results. It requires two parameters: radius of neighborhood ε and the minimum number of neighbors required to form a dense region k. We consider k ∈ {1, 2, 3} and ε is chosen in such a way that if we connect all vertices i, j such that i − j < ε, then we get Ln edges, L ∈ {5, 25, 125}, where n is the number of vertices. For all parameters we get reasonable clustering structures. The result Figure 4c . For these parameters we also analyze the distribution of cluster sizes (see Figure 5a ). Note that for not too large values of s (s < 300) the cumulative distribution follows a power law with parameter λ ≈ 0.95. In Theorem 1 this value corresponds to the case α = 0.05, i.e., p(n) ∝ n −0.05 . Based on this, we expect the number of clusters to grow as n 0.95 , i.e., close to linearly. On Figure 6 we plot the empirical number of clusters and fit it by n 0.95 . Finally, as we promised above, we show that λ can depend on the clustering algorithm. Figure 5b shows the cumulative cluster size distribution for DBSCAN with L = 5, k = 1. Note that λ = 1.44, so it is larger in this case. Intuitively, the reason is that p(n) is larger for L = 5 than for L = 125. Smaller values of L correspond to smaller ε, which means that it is harder for a new vertex to join some existing cluster, which makes p(n) larger. In this section, we analyze the scatter of vertices in R d . Namely, let us first consider the variable which is the average moment of order δ for the distance from the origin to a vertex. Recall that the most interesting case is 1 < β < 2, for such β the expected average squared distance EX 2 (n) diverges, therefore we analyze EX δ (n) for 1 ≤ δ ≤ 2. The following theorem holds. Theorem 2 Let ξ be a random variable sampled from the distribution Ξ. For any δ, 1 ≤ δ < min{β, 2}: EX δ (n) ≤ Eξ δ n t=2 1 t ∼ Eξ δ log n . If β > 2, then In the proof we will use the following lemma. Lemma 1 Let ξ and η be two independent random vectors. Assume that the distribution of ξ is symmetrical, i.e., the distributions of ξ and −ξ are identical. Then for any 1 ≤ δ ≤ 2: We place the proof of this lemma to Appendix. Now we are ready to prove Theorem 2. Proof. Let us denote by ξ a vector of length ξ (sampled from the distribution Ξ) pointing to a random direction, i.e., ξ = ξe. For β > 2 the following recurrent formula holds: Using the above recurrent formula, we prove the second part of the theorem: To prove the first part we use Lemma 1: so we obtain the recurrent formula for EX δ (n) from which the theorem follows. One can also easily obtain the lower bound for the following variable: Namely, the following lemma holds. Let ω(n) be any function tending to infinity as n tends to infinity. Then Proof. Let r = n 1 β ω(n) . If for at least one step t we have ξ t > 2r, then X max > r, therefore 4 Graph models 4 In this section, we discuss how a graph can be constructed based on the vertices embedding produced by the preferential placement procedure. Spatial distance. The basic idea behind many known spatial models is that we want to increase the probability of connecting two vertices if they have similar latent features. Various methods can be found in the literature, which are usually combined with some other ideas like introducing weights of vertices or taking into account degrees of vertices (see, e.g., a survey of spatial models in [7] ). We now briefly describe some possible approaches: • threshold model [13, 39] : • p-threshold model : • p-threshold model with random edges (as in spatial small-world models [7] ): • Waxman model [55] : Here we denote by E the set of edges. We assume that all edges are mutually independent, hence to describe a random graph it is enough to define the probability of each edge. Genealogical distance. It is also reasonable to take into account the genealogical tree used to create the configuration of vertices (see Section 2.2 for the description of this tree). For example, it seems very natural to connect each vertex to its parent in the genealogical tree. This serves several purposes: (i) the underlying genealogical process is reflected in the obtained graph, (ii) the graph is guaranteed to be connected, (iii) the graph has small diameter, as we show further in this section. More generally, let d tree (v i , v j ) be the length of the unique path between v i and v j in the genealogical tree. Then let where F (x, y) : R + × N → [0, 1] can be any function non-increasing with both x and y. Further we deeper analyze the following threshold model: We choose θ such that we have 5n edges with x i − x j ≤ θ in our graph. As before, we take Ξ to be a distribution with the density function f β (x) = β (x+1) β+1 , x ≥ 0 for β = 1.5. Figure 7 visualizes the obtained graph with n = 10K 3 . Degree distribution. Let us empirically analyze the degree distribution for the threshold model (1). The cumulative degree distribution for this case is presented on Figure 8 . Observe that the cumulative degree distribution does not follow a power law. However, it is very similar to degree distributions obtained in many real-world networks (numerous examples can be found in [1]). Figure 7 : Visualization of the obtained graph, laid out with ForceAtlas 2 [30] ; vertices are colored according to a partition obtained by Louvain community detection algorithm [9] Diameter. Now let us prove that graphs obtained according to (1) have small (at most logarithmic) diameter. In order to prove this, we use the following theorem [49] . Theorem 3 Let H n be the height of a uniform random recursive tree T n . Then, with probability one, Our graph contains the underlying genealogical tree by the definition, and this tree is URRT, so we obtain the following result. Theorem 4 For a graph G n constructed according to (1) we have, with probability one, log n ≤ 2e . In this paper, we introduced a principle called preferential placement. Our method is designed to model a realistic clustering structure in R d . The algorithm is parametrized only by a distribution Ξ, and if Ξ is a Pareto distribution, which is the most natural choice, then we essentially have only one parameter -the exponent β. The proposed algorithm naturally models clusters and the distribution of cluster sizes follows a power law, which is a desirable property. Although preferential placement only generates the coordinates of vertices, one can easily construct a graph based on the obtained structure using one of the methods discussed in this paper. We showed that applying a genealogical-based variant of a threshold model to the configuration generated by preferential placement leads to a realistic degree distribution and small diameter. In this paper, we made only a first step to understanding the cluster formation in complex structures and there are many directions for future research. First of all, more formal analysis of the distribution of cluster sizes would be useful. As we discussed, the main problem here is the lack of any suitable formal definition of clusters. However, one can try, e.g., to analyze clusters produced by one of well-known clustering algorithms. Second, the radius of obtained configuration can be deeper analyzed: e.g., we are going to study the lower bound for X δ (n) and the upper bound for X max (n) (see Section 3.4). Finally, there are many open questions in the analysis of the obtained graph structures: comparison of different models, theoretical analysis of the degree distribution, analysis of the local clustering coefficient, and so on. Now we introduce the following variables and it remains to show that 2 δ/2−1 u δ/2 + v δ/2 ≤ 1, which is true since the maximum of u δ/2 + v δ/2 = u δ/2 + (1 − u) δ/2 for 0 ≤ u ≤ 1 is equal to 2 −δ/2+1 at u = 1 2 . Now we note that the distributions of ξ + η and ξ − η are identical, so E ξ + η δ = 1 2 E ξ + η δ + E ξ − η δ and now the lemma follows from Equation (5). Jeanette Janssen, and Pawe l Pra lat. A spatial web graph model with local influence regions Community analysis in social networks Factorization threshold models for scale-free networks generation Emergence of scaling in random networks Crossover from scale-free to spatial networks Spatial networks The asymptotic number of labeled graphs with given degree sequences Fast unfolding of communities in large networks Complex networks: Structure and dynamics The degree sequence of a scale-free random graph process Mathematical results on scale-free random graphs. Handbook of graphs and networks: from the genome to the internet The structure of geographical threshold graphs Popularity based random graph models leading to a scale-free degree sequence Introduction to the theory of lévy flights. Anomalous transport: Foundations and applications Finding community structure in very large networks and Paulino Ribeiro Villas Boas. Characterization of complex networks: A survey of measurements The strong convergence of maximal degrees in uniform random recursive trees and dags Total path length for random recursive trees Preferential placement for community structure formation Temporal link prediction using matrix and tensor factorizations A density-based algorithm for discovering clusters in large spatial databases with noise On power-law relationships of the internet topology Community detection in graphs Resolution limit in community detection Community structure in social and biological networks Self-similar community structure in a network of human interactions Growing scale-free networks with tunable clustering Forecast and control of epidemics in a globalized world Forceatlas2, a continuous graph layout algorithm for handy network visualization designed for the gephi software Maximizing the spread of influence through a social network Fractal geometry Local clustering coefficient in generalized preferential attachment models Model of community emergence in weighted social networks Benchmark graphs for testing community detection algorithms Fractional quantum mechanics Transmission dynamics and control of severe acute respiratory syndrome Least squares quantization in pcm Geographical threshold graphs with small-world and scale-free properties Link prediction via matrix factorization. Machine Learning and Knowledge Discovery in Databases A log-linear model with latent features for dyadic prediction Nonparametric latent feature models for link prediction On the number of terminal vertices in certain random trees with an application to stemma construction in philology The structure and function of complex networks Power laws, pareto distributions and zipf's law. Contemporary physics Finding and evaluating community structure in networks Recency-based preferential attachment models Uncovering the overlapping community structure of complex networks in nature and society Note on the heights of random recursive trees and random m-ary search trees Preferential attachment of communities: The same principle, but a higher level Small subgraphs in preferential attachment networks Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter A survey of recursive trees On computer viral infection and the effect of immunization Routing of multipoint connections Maximal planar networks with large clustering coefficient and power-law degree distribution This work is supported by Russian President grant MK-527.2017.1. Special thanks to Alexey Tikhonov for stimulating discussions. First, recall the process of cluster formation:• At the beginning of the process we have one vertex which forms one cluster.• At n-th step with probability p(n) a new cluster consisting of v n is created.• With probability 1 − p(n) new vertex joins already existing cluster C with probability proportional to |C|.So, we can write the following equations:Now we can take expectations of the both sides of the above equations and analyze the behavior of EF t (s) inductively. Consider the case α = 0, i.e., p(n) = c. Let us prove that in this casewhere θ n,s ≤ C s 1 1−c for some constant C > 0. We prove this result by induction on s and for each s the proof is by induction on n. Note that for n = 1 Equation (4) holds for all s. Consider now the case s = 1. We want to prove that EF n (1) = c 2 − c (n + θ n,1 ) .For the inductive step we use Equation (2) and getFor s > 1 we use Equation (3) and getTo finish the proof we need to show thatIt is easy to show that the above inequality holds.Now we consider the case p(n) = cn −α for 0 < α ≤ 1. Let us prove that in this casewhere θ n,s ≤ Cn max{0,1−2α} s 1−α+ for some constant C > 0 and for any > 0. The proof is similar to the case α = 0. Again, for n = 1 the theorem holds. Consider s = 1. We want to prove that EF n (1) = c 2 − α n 1−α + θ n,1 .Inductive step in this case becomesIn order to finish the proof for the case s = 1 it is sufficient to show thatwhich holds for sufficiently large C.For s > 1 we have:In order to finish the proof, it remains to show that,1−2α} s 1−α+ , which holds for sufficiently large C. Let us first prove that for any vectors x, y ∈ R d and for any δ, 1 ≤ δ ≤ 2, we have:It x = 0 and y = 0, then this inequality holds. Otherwise, without loss of generality, we can assume that x 2 + y 2 = 1 (if x 2 + y 2 = a > 0, we divide x and y by √ a). In order to prove (5) , it is sufficient to show that 1 2 x + y δ + x − y δ ≤ 1 and 1 ≤ x δ + y δ . The second inequality is trivial:x δ + y δ ≥ x 2 + y 2 = 1 since x ≤ 1, y ≤ 1, and δ ≤ 2.It remains to show that 1 2x + y δ + x − y δ ≤ 1. First let us note thatx + y 2 + x − y 2 = 2( x 2 + y 2 ) = 2.