key: cord-0043194-f6so6tih authors: Iftikhar, Masooma; Wang, Qing; Lin, Yu title: dK-Microaggregation: Anonymizing Graphs with Differential Privacy Guarantees date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47436-2_15 sha: 13107b844d839be6db775811e17118efa630efc8 doc_id: 43194 cord_uid: f6so6tih With the advances of graph analytics, preserving privacy in publishing graph data becomes an important task. However, graph data is highly sensitive to structural changes. Perturbing graph data for achieving differential privacy inevitably leads to inject a large amount of noise and the utility of anonymized graphs is severely limited. In this paper, we propose a microaggregation-based framework for graph anonymization which meets the following requirements: (1) The topological structures of an original graph can be preserved at different levels of granularity; (2) [Formula: see text]-differential privacy is guaranteed for an original graph through adding controlled perturbation to its edges (i.e., edge privacy); (3) The utility of graph data is enhanced by reducing the magnitude of noise needed to achieve [Formula: see text]-differential privacy. Within the proposed framework, we further develop a simple yet effective microaggregation algorithm under a distance constraint. We have empirically verified the noise reduction and privacy guarantee of our proposed algorithm on three real-world graph datasets. The experiments show that our proposed framework can significantly reduce noise added to achieve [Formula: see text]-differential privacy over graph data, and thus enhance the utility of anonymized graphs. Graph data analysis has been widely performed in real-life applications. For instance, online social networks are explored to analyze human social relationships, election networks are studied to discover different opinions in a community, and co-author networks are used to understand collaboration relationships among researchers [22] . However, such networks often contain sensitive or personally identifiable information, such as social contacts, personal opinions and private communication records. Publishing graph data can thus pose a privacy threat. To preserve graph data privacy, various anonymization techniques for graph data publishing have been proposed in the literature [1, 11, 14, 24] . Nonetheless, even when a graph is anonymized without publishing any identity information, an individual may still be revealed based on structural information of a graph [11] . In recent years, differential privacy [5] has emerged as a widely recognized mathematical framework for privacy. A number of studies [10, 18] have investigated the problem of publishing anonymized graphs under guarantee of differential privacy. However, graph data is highly sensitive to structural changes. Directly perturbing graph data often leads to inject a large amount of random noise and the utility of anonymized graphs is severely impacted. To deal with this issue, several works [19] [20] [21] [22] have explored techniques of indirectly perturbing graph data through a graph abstraction model, such as the dK-graph model [16] and hierarchical random graph (HRG) model [2] , or spectral graph methods. The central ideas behind these works are to first project a graph into a statistical representation (e.g., degree distribution and dendrogram), or a spectral representation (e.g., adjacency matrix), and then add random noise to perturb such representations. Although these techniques are promising, they can only achieve ε-differential privacy over a graph by injecting the magnitude of random noise proportional to the sensitivity of queries, which is fixed to global sensitivity. Due to the high sensitivity of graph data on structural changes, the utility of anonymized graphs being published by these works is still limited. To alleviate this limitation, we aim to develop a general framework of anonymizing graphs which meets the following requirements: (1) The topological structures of an original graph can be preserved at different levels of granularity; (2) ε-differential privacy is guaranteed for an original graph through adding controlled perturbation to its edges (i.e., edge privacy [13] ); (3) The utility of graph data is enhanced by reducing the magnitude of noise needed to achieve ε-differential privacy. We observe that the dK-graph model [15, 16] for analyzing network topologies can serve as a good basis for generating structure-preserving anonymized graphs. Essentially, the dK-graph model generates dK-graphs by retaining a series of network topology properties being extracted from d-sized subgraphs in an original graph. In order to reduce the amount of random noise without compromising ε-differential privacy, we incorporate microaggregation techniques [4] into the dK graph model to reduce the sensitivity of queries. This enables to apply perturbation on network topology properties at a flexible level of granularity, depending on the degree of microaggregation. Figure 1 provides a high-level overview of our proposed framework. Given two neighboring graphs G ∼ G , network topology properties such as dKdistributions [16] are first extracted from each graph. Then a dK-distribution goes through a microaggregation procedure, which consists of partition and aggregation. After that, the microaggregated dK-distribution is perturbed, yielding a ε-differentially private dK-distribution. Finally, based on the perturbed dKdistribution, ε-differentially private dK-graphs are generated. That is, for two neighboring graphs G ∼ G , their corresponding anonymized graphs generated by this framework are ε-indistinguishable. To summarize, our work has the following contributions: (1) We present a novel framework, called dK-microaggregation, that can leverage a series of network topology properties to generate ε-differentially private anonymized graphs. (2) We propose a distance constrained algorithm for approximating dKdistributions of a graph via microaggregation within the proposed framework, which enables us to reduce the amount of noise being added into ε-deferentially private anonymized graphs. (3) We have empirically verified the noise reduction of our proposed framework on three real-world networks. It shows that our algorithm can effectively enhance the utility of generated anonymized graphs by providing better within-cluster homogeneity and reducing the amount of noise, in comparison with the state-of-the-art methods. Let G = (V, E) be a simple undirected graph, where V is the set of nodes and E the set of edges in G. We use deg(v) to denote the degree of a node v, and deg(G) to denote the maximum degree of G. Definition 1 (Neighboring graphs). Two graphs G = (V, E) and G = (V , E ) are said to be neighboring graphs, denoted as G ∼ G , iff V = V , E ⊂ E and |E| + 1 = |E |. The dK-graph model [16] provides a systematic way of extracting subgraph degree distributions from a given graph, i.e. dK-distributions. Specifically, 1K-distribution captures a degree distribution, 2K-distribution captures a joint degree distribution, i.e. the number of edges between nodes of different degrees, and 3K-distribution captures a clustering coefficient distribution, i.e. the number of triangles and wedges connecting nodes of different degrees. When d = |V |, dK-distribution specifies the entire graph. For larger values of d, dK-distributions capture more complex properties of a graph at the expense of higher computational overhead [16] . To describe how a dK-distribution is extracted from a graph, we define the notion of dK function. Following the previous work [16] , we define dK-graph as a graph that can be constructed through reproducing the corresponding dK-distribution. Definition 4 (dK-graph). A dK-graph over dK(G) is a graph in which connected subgraphs of size d satisfy the probability distribution in dK(G). Conceptually, a dK-graph is considered as an anonymized version of an original graph G that retains certain topological properties of G at a chosen level of granularity. In this paper, we aim to generate dK-graphs with ε-differential privacy guarantee for preserving privacy of structural information between nodes of a graph (edge privacy). We formally define differentially private dK-graph below. A randomized mechanism K provides ε-differentially private dK-graphs, if for each pair of neighboring graphs G ∼ G and all possible outputs G ⊆ range(K), the following holds (1) G is a family of dK-graphs, and ε > 0 is the differential privacy parameter. Smaller values of ε provide stronger privacy guarantees [5] . In this section, we present a novel framework dK-Microaggregation for generating ε-differentially private dK-graphs. Without loss of generality, we will use 2Kdistribution to illustrate our proposed framework. This is due to two reasons: (1) As previously discussed in [15, 16] , the d = 2 case is sufficient for most practical purposes; (2) dK-generators for d = 2 have been well studied [9, 15] , whereas dK-generators for d ≥ 3 have not been yet discovered [9] . Given a graph Previous studies [19, 20] have shown that, changing a single edge in a graph may result in one or more changes on tuples in its corresponding dK-distribution. The following lemma states the maximum number of changes between the 2Kdistributions of two neighboring graphs. In our work, for each dK-distribution D, we want to generate D ε that is an anonymized version of D satisfying ε-differential privacy. Thus, we view the response to a dK function γ dK for d = 2 as a collection of responses to degree queries, one for each tuple in a 2K distribution. To guarantee ε-differential privacy for each q t , we can add random noise into the real response q t (G), yielding a randomized response q t (G) + Lap(Δ(q t )/ε), where Δ(q t ) denotes the sensitivity of q t and Lap(Δ(q t )/ε) denotes random noise drawn from a Laplace distribution. If we query D with a set of degree queries {q t } t∈D and the response to each q t satisfies ε-differential privacy, by the parallel composition property of differential privacy [17] , we can generate D ε that satisfies ε-differential privacy. However, the total amount of random noise being added into the responses can be very high, particularly when a graph is large. To control the amount of random noise and thus increase the utility of D ε , we microaggregate similar tuples in D before adding noise. Thus, the dK function γ dK is replaced by γ dK • M, i.e., we run γ dK on the microaggregated dK-distribution D resulting from running a microaggregation algorithm M over D. The response to γ dK • M is a collection of responses to microaggregate degree queries, one for each cluster in D. Indeed, we can see that q t is a special case of q * T since q t (G) = q * T (G) holds for T = {t}. By Lemma 1, we have the following lemma about q t and q * T . For each cluster in D that is resulted from running M, only aggregated frequency value for a cluster of tuples is returned by a microaggregate degree query. Thus, γ dK • M is less "sensitive" when the number of clusters in D is smaller. By Lemma 2 and the fact that changing one edge on a graph may lead to changes on multiple clusters in D, we have the following lemma about the sensitivity of γ dK • M. Generally, dK-microaggregation works in the following steps. First, it extracts a dK-distribution from a graph. Then, it microaggregates the dK-distribution and perturbs the microaggregated dK-distribution to generate ε-differentially private dK-distribution. Finally, a dK-graph is generated. In this section, we discuss algorithms for microaggregating dK-distributions. Generally, a microaggregation algorithm for dK-distributions M = (C, A) consists of two phases: (a) Partition -similar tuples in a dK-distribution are partitioned into the same cluster; (b) Aggregation -the frequency values of tuples in the same cluster are aggregated. As illustrated in Fig. 2 MDAV-dK Algorithm. Given a dK-distribution D = γ dK (G) over a graph G, a simple way of microaggregating D is to partition D in such a way that each cluster contains at least k tuples. For this, we use a simple microaggregation heuristic, called Maximum Distance to Average Vector (MDAV) [4] , which can generate clusters of the same size k, except one cluster of size between k and 2k − 1. However, unlike a standard version of MDAV that aggregates each cluster by replacing each record in the cluster with a representative record, we perform aggregation to aggregate frequency values of tuples in each cluster into an aggregated frequency value. To avoid ambiguity, we call our MDAV-based algorithm for microaggregating dK-distributions the MDAV-dK algorithm. It is well-known that, for many real-world networks such as Twitter, their degree distributions are often highly skewed. This often leads to highly skewed dK-distributions for such networks. However, due to inherent limitations of MDAV, e.g. the fixed-size constraint, MDAV-dK would suffer significant information loss when evenly partitioning a highly skewed dK-distribution into clusters of the same size. To address this issue, we propose an algorithm called Maximum Pairwise Distance Constraint (MPDC-dK). Unlike MDAV-dK, MPDC-dK aims to partition a dK-distribution into clusters under a distance constraint. Specifically, after partitioning, the distances between the corresponding degrees in any two tuples within a cluster should be no greater than a specified distance interval τ . Take a 2K-distribution D for example. Let t 1 = (g 1 , g 1 , m 1 ) and t 2 = (g 2 , g 2 , m 2 ) be two tuples in a cluster after applying MPDC-dK on D. Then, we say that these two tuples satisfy a distance constraint τ iff max(|g 1 − g 2 |, |g 1 − g 2 |) ≤ τ . The clustering problem addressed by MPDC-dK is thus to generate the minimum number of clusters in which every pair of tuples from the same cluster satisfies such a distance constraint τ . The conceptual ideas behind our MPDC-dK algorithm design is to consider each degree pair (g, g ) as coordinates in a two dimensional space, and also treat the above distance constraint τ as a τ -by-τ box, denoted by ((x, x ), τ) and centered at (x, x ), in the same two dimensional space. Clearly, such a box corresponds to a cluster that satisfies the distance constraint τ , and a box ((x, x ), τ) covers a degree pair (g, g ) iff x − τ /2 ≤ g ≤ x + τ /2 and x − τ /2 ≤ g ≤ x + τ /2. MPDC-dK employs a greedy algorithm to find the minimum number of boxes (i.e., clusters) that cover all degree pairs. MDPC-dK first enumerates all boxes that cover at least one degree pair and records the corresponding counts as the number of degree pairs being covered by these boxes. MDPC-dK then recursively selects a box with the maximum count (i.e., covering the maximum number of degree pairs) in a greedy manner, assigns these degree pairs in a new cluster, and removes them from other boxes until all boxes are empty. After that, MDPC-dK performs aggregation to aggregate the frequency values of tuples in each cluster into an aggregated frequency value. Algorithm 1 describes the details of our MPDC-dK algorithm. Given a dKdistribution D, we start with initializing an empty cluster list D (Line 1) and a list b_list to record each box and its corresponding degree pairs, and the total number of degree pairs covered by the box (Line 2). For each degree pair (g, g ) in D, we enumerate boxes that cover (g, g ) using a function covering_boxes (Line 4). For each enumerated box b i we update the list by adding (g, g ) to b i and increment the count of b i by 1 (Lines 5-6). After creating b_list, we iteratively select a box b max with the maximum count for degree pairs (Line 8), then generate a new cluster of degree pairs in d max , and add it into the cluster list (Lines 9-10). We further remove b max and all degree pairs in b max from b_list and update the counts of affected boxes in b_list (Lines 11-15 ). The algorithm terminates when b_list is empty and returns a set of generated clusters D . Privacy Analysis. Here, we theoretically show that dK-graphs generated in our proposed framework are differentially private. Firstly, by Lemma 2 and 3, we can obtain a ε-differentially dK-distribution D ε by microaggregating a dK-distribution and calibrating the amount of random noise according to the sensitivity of microaggregated degree queries. As D ε only contains aggregated frequency values for clusters of tuples in a dK-distribution, we perform postprocessing using a randomized algorithm f to randomly select tuples within each cluster of D ε until the aggregated frequency value of the cluster is reached. Previously, Dwork and Roth [6] proved that differential privacy is immune to post-processing, i.e., the composition of a randomized algorithm with a differentially private algorithm is differentially private. This leads to the lemma below. Based on f (D ε ), a dK-graph can be generated using a dK-graph generator [15, 16] . Following Lemma 4, Definition 5, and the proposition of Dwork and Roth [6] on post-processing, we have the following theorem for our framework, which corresponds to a randomized mechanism K = γ dK • M • K dK • f • γ dK , where γ dK : D → G is a dK-graph generator. Complexity Analysis. We analyze the time complexity of the algorithms MDAV-dK and MPDC-dK. For MDAV-dK with a constraint on the minimum size k of clusters, given a dK-distribution D as input, the complexity of MDAV-dK for clustering is similar to MDAV [4] , i.e. O(n 2 ). For MPDC-dK with a constraint on the distance interval τ , in order to generate clusters, MPDC-dK needs to perform a sequential search over all degree pairs in D. Firstly, MPDC-dK needs to enumerate boxes for all the degree pairs, and each degree pair is covered by at most (τ + 1) 2 boxes (Line 4 of Algorithm 1), hence the cost of enumerating boxes is O(τ 2 n) (Line 3-6 of Algorithm 1). Secondly, MPDC-dK sorts the boxes based on the corresponding degree pairs being covered, and selects and removes the box with the maximum count iteratively. Although it takes O(nlogn) to sort and greedily select the box with the maximum count for the first iteration, each later iteration only costs O(τ 2 logn) (Line 8 of Algorithm 1) because each box overlaps with at most 4τ 2 other boxes and removing one box only affects the count of O(τ 2 ) boxes (Lines 11-15 of Algorithm 1). Hence, the cost of selecting and removing boxes is O(τ 2 nlogn) (Lines 7-15 of Algorithm 1). The overall complexity of MPDC-dK for clustering is O(τ 2 nlogn). We have evaluated the proposed framework to answer the following questions: -Q1. How does dK-microaggregation reduce the amount of noise added into dK-distributions while still providing ε-differential privacy guarantee? -Q2. How does our microaggregation algorithms perform in providing better within cluster homogeneity for dK-distributions? -Q3. What are the trade-offs between utility and privacy when generating differentially private dK-graphs? Baseline Methods. In order to evaluate our proposed framework, we considered the following methods: (1) ε-DP, which is a standard ε-differential privacy algorithm in which noise is added using the Laplace mechanism [5] . (2) MDAV-dK which extends the standard microaggregation algorithm MDAV [4] for handling dK-distributions. (3) MPDC-dK is our proposed dK-microaggregation algorithm. We used Orbis [15] to generate 2K-distributions. Evaluation Measures. We used Euclidean distance [19] to measure network structural error between original and perturbed dK-distributions. For clustering algorithms, we measure within-cluster homogeneity using the sum of absolute error [7] ∀xj ∈ci |x j − μ i | where c i is the set of tuples in cluster i and μ i is the mean of cluster i. Experimental Results. To verify the overall utility of ε-differentially private dK-distribution, we first conducted experiments to compare the structural error between original and perturbed dK-distributions generated by our algorithm MDAV-dK, MPDC-dK and the baseline method ε-DP. Figure 3 presents our experimental results. For ε-DP, we used the following privacy parameters ε = [0.01, 0.1, 1.0, 10.0], which cover the range of differential privacy levels widely used in the literature [12] . The results for ε-DP is displayed as horizontal lines, as ε-DP does not depend on the parameters k and τ . From Fig. 3 , we can see that, for all three datasets, our proposed algorithms MDAV-dK and MPDC-dK lead to less structural error for every value of ε as compared to ε-DP. This is because, by approximating a query γ to γ • M via dk-microaggregation, the errors caused by random noise to achieve ε-differential privacy are reduced significantly. Thus, dK-microaggregation introduces overall less noise to achieve differential privacy. We then conducted experiments to compare the quality of clusters, in terms of within-cluster homogeneity, generated by MDAV-dK and MPDC-dK. The results are shown in Tables 1 and 2 . We observe that, for values of k and τ at which MDAV-dK and MPDC-dK generate almost the same number of clusters, as highlighted in bold, MPDC-dK outperforms MDAV-dK by producing clusters with less SAE over all three datasets. This is consistent with the previous discussion in Sect. 4. As MPDC-dK always partitions degree pairs under a distance constraint rather than a fixed-size constraint, thus it generates more homogeneous clusters as compared to MDAV-dK. We analyze the trade-offs between utility and privacy of dK-graphs generated in the proposed framework. To enhance the utility of differentially private dK-graphs, we approximated an original query γ to γ • M. This thus introduces two kinds of errors: one is random noise to guarantee ε-differential privacy, and the other one is due to microaggregation. We have noticed that, the second kind of error can be reduced by generating homogeneous clusters during microaggregation. On the other hand, for the first kind of error which depends on the sensitivity of γ • M, it dominates the impact on the utility of differentially private dK-graphs generated via dk-microaggregation. By reducing sensitivity we can increase the utility of dK-graphs without compromising privacy. Graph data anonymization has been widely studied in the literature, and many anonymization techniques [1, 11, 14, 24] have been proposed to enforce privacy over graph data. These techniques can be broadly categorized into three areas: nodes and edges perturbation, k-anonymity, and differential privacy. Perturbation-based approaches follow certain principles to process nodes and edges, including identity removal [14] , edge modification [23] , nodes clustering [11] , and so on. Generally, k-anonymity approaches divide an original graph into at least k-sized blocks so that the probability that an adversary can reidentify one node's identity is at most 1/k. Popular k-anonymity approaches for graph anonymization include k-candidate [11] , k-neighborhood anonymity (k-NA) [24] , k-degree anonymity (k-DA) [14] , k-automorphism, and k-isomorphism (k-iso) [1] . Differential privacy on graph data can be roughly divided into two categories, namely: node differential privacy [3] and edge differential privacy [13] . In general, unlike k-anonymity, differential privacy approaches have mathematical proofs of privacy guarantee. Nevertheless, applying differential privacy on graph data limits utility because graph is highly sensitive to structural changes and adding noise directly into graph data can significantly degrade its utility. To address this issue, many approaches [19] [20] [21] [22] perturb various statistical information of a graph by projecting graph data into other domains using feature-abstraction models [2, 16] . This idea is appealing; however it leads to yielding less data utility due to injecting random noise based on the global sensitivity to guarantee εdifferential privacy. Our aim is to anonymize graphs under ε-differential privacy using less sensitive queries. In this regard, we proposed a microaggregation-based framework which reduces the sensitivity via microaggregation, thus reducing the overall noise needed to achieve ε-differentially private graphs. K-isomorphism: privacy preserving network publication against structural attacks Hierarchical structure and the prediction of missing links in networks Publishing graph degree distribution with node differential privacy Ordinal, continuous and heterogeneous k-anonymity through microaggregation Calibrating noise to sensitivity in private data analysis The algorithmic foundations of differential privacy Fast and robust general purpose clustering algorithms Towards privacy for social networks: a zero-knowledge based definition of privacy 2.5 k-graphs: from sampling to generation Accurate estimation of the degree distribution of private networks Publishing differentially private datasets via stable microaggregation Publishing attributed social graphs with formal privacy guarantees Towards identity anonymization on graphs Orbis: rescaling degree correlations to generate annotated internet topologies Systematic topology analysis and generation using degree correlations Privacy integrated queries: an extensible platform for privacypreserving data analysis Calibrating data to sensitivity in private data analysis Sharing graphs using differentially private graph models Preserving differential privacy in degree-correlation based graph generation Differential privacy preserving spectral graph analysis Differentially private network data release via structural inference Randomizing social networks: a spectrum preserving approach Preserving privacy in social networks against neighborhood attacks In this paper, we have formalized a general microaggregation-based framework for anonymizing graphs that preserves the utility of dK-graphs while enforcing ε-differential privacy. Based on the proposed framework, we have proposed an algorithm for microaggregating dK-distributions under a distance constraint. We have theoretically analyzed the privacy property of our framework and the complexity of our algorithm. The effectiveness of our work has been empirically verified over three real-world datasets. Future extensions to this work will consider zero knowledge privacy (ZKP) [8] , to release statistics about social groups in a network while protecting privacy of individuals.