key: cord-0045711-sjd9py87 authors: Li, Jun; Ma, Huan; Wu, Guangjun; Zhang, Yanqin; Ma, Bingnan; Hui, Zhen; Zhang, Lei; Zhu, Bingqing title: A Workload Division Differential Privacy Algorithm to Improve the Accuracy for Linear Computations date: 2020-06-15 journal: Computational Science - ICCS 2020 DOI: 10.1007/978-3-030-50417-5_33 sha: 9168a599013f90667cc8d00ff4a0599dfdcaef87 doc_id: 45711 cord_uid: sjd9py87 Differential privacy algorithm is an effective technology to protect data privacy, and there are many pieces of research about differential privacy and some practical applications from the Internet companies, such as Apple and Google, etc. By differential privacy technology, the data organizations can allow external data scientists to explore their sensitive datasets, and the data owners can be ensured provable privacy guarantees meanwhile. It is inevitable that the query results that will cause the error, as a consequence that the differential privacy algorithm would disturb the data, and some differential privacy algorithms are aimed to reduce the introduced noise. However, those algorithms just adopt to the simple or relative uniform data, when the data distribution is complex, some algorithms will lose efficiency. In this paper, we propose a new simple [Formula: see text]-differential privacy algorithm. Our approach includes two key points: Firstly, we used Laplace-based noise to disturb answer to reduce the error of the linear computation queries under intensive data items by workload-aware noise; Secondly, we propose an optimized workload division method. We divide the queries recursively to reduce the added noise, which can reduce computation error when there exists query hot spot in the workload. We conduct extensive evaluation over six real-world datasets to examine the performance of our approach. The experimental results show that our approach can reduce nearly 40% computation error for linear computation when compared with MWEM, DAWA, and Identity. Meanwhile, our approach can achieve better response time to answer the query cases compared with the start-of-the-art algorithms. Among the data privacy protection technologies, existing research is based on the solution from the following perspectives: anonymity-based methods, encryptionbased methods, noise-based method, and differential privacy-based method. There have been many reliable encryption-based method technology, such as DES [7] , 3DES, Blowfish [23] , RC5 [21] , IDEA, RSA, etc. The advance of the encryption technology is their security. However, the analyzability will be lost due to the encryption. The anonymity-based methods to protect data privacy can keep the data's analyzability, the mainly anonymity-based technologies are k-anonymity [24] , L-diversity [3] and T-closeness [18] . However anonymity-based methods have fatal weaknesses, and the anonymous data might suffer antianonymity. For the data organizers, there exist security and privacy problems on data collection and publishing. Among the data privacy attacks, differential attack is a way that the attacker infers private data through statistical information over two homogeneous datasets. For example, an attacker can infer a person's specific shopping goods by differential attacks via different queries. To explore whether a person bought an object, the attacker can conduct two queries, and one query obtains the count of persons that have bought the object, and another query the count on the data set that excludes the person by the quasiidentifier, such as timestamp, gender, region, age, etc. By the two query results, the attack can infer whether the person bought the object. To solve the differential attack, many differential privacy algorithms can be used, such as matrix mechanism [17] , DAWA algorithm [16] , MWEM [13] , and RAPPOR [10] , etc. The differential privacy technology can be used in many fields [5, 6, 11, 20, 26] . Differential privacy was first defined by Dwork et al. [8, 9] , and it protects the individual data by injecting noise to the results according to the privacy budget. A number of ε-differential privacy algorithms have been proposed [2, 13, [15] [16] [17] , and some of them workload-aware and data-dependent [2, 13, 16, 17] . From the method of disturbing results view, ε-differential adopts three ways: Laplace Mechanism [8] , Exponential Mechanism [19] , and Randomized Response [25] . Random response mechanism is an effective way to protect the privacy of the frequency vector. The random response mechanism has been used in privacy protection of collecting sensitive data since the 1960s. RAPPOR [10] is ε-differential privacy technology that Google company has already used in the browser, and it adopts the random response. MWEM [13] is classical ε-differential privacy, and it is based on a combination of the Mechanism Exponential Mechanism with the Multiplicative Weights update rule. The MWEM algorithm selects and poses queries using the Exponential and Laplace Mechanisms, and it improves the approximation using the Multiplicative Weights update rule. DAWA [16] is a data-dependent and workload-aware algorithm, and it adds noise according to the input data and the given workload and it is a two-stage mechanism for answering range queries under ε-differential privacy. In 2016, Michael Hay et al. propose an evaluation framework for standardized evaluation of privacy, called DPBENCH [14] . In 2018, Dan Zhang et al. [27] propose a programming framework and system called ktelo to implement the existing algorithms. For the task of answering linear counting queries, ktelo allows both privacy novices and experts to easily design algorithms, and the APEx [12] is a novel differential privacy system that allows data analysts to pose adaptively chosen sequences of queries along with required accuracy bounds. Most of the algorithms are related to data distribution, especially when the data items are sparse, i.e., there are a large number of items are empty, these algorithms can effectively reduce the introduced errors. The same conclusion can be reached in the paper [14, 16] . While, these algorithms are not suitable for all data situations, as in the situation the data items are intensive and the data has complex distribution, and the conclusion is also shown in [14, 16] . Current ε-differential privacy algorithms will cause computation error for linear computations over the intensive data domain. Inspired by the partition of the data domain, we propose a novel ε-differential privacy algorithm via Laplace-based noise and optimized workload division to decrease the computation error in complex data situation. We make the following contributions: (1) We propose a novel ε-differential privacy algorithm in complex data situation. We used Laplace-based noise to disturb the query results. This disturbation can reduce the error of the linear computation queries under intensive data items by workload-aware noise. (2) We propose an optimized workload division method. We divide the queries recursively to reduce the added noise. This division can effectively reduce computation error when there exists a hot spot, i.e., some domain is frequently queried in the workload. (3) We conduct extensive experiments on six real-world datasets and conduct a comparison with differential privacy algorithms (MWEM, DAWA, and Identity). The evaluation results show that the proposed algorithm can effectively reduce the computation error and has better efficiency relatively. We propose a ε-differential privacy algorithm for the linear computation queries. The algorithm aims to reduce the results error in the case that the sensitivity of workload is high and there exists frequency queried dom(B) item due to the hot issue or statistical attack queries and the frequency count x is complex. In the algorithm, we adopt Laplace Mechanism to disturb the query results. To reduce the random added noise, we propose a novel perspective that the added noise might be reduced by dividing the queries into several clusters and add Laplacebase noise respectively. Furthermore, based on the Laplace division, we propose a simple and effective recursion division for the query workloads and the privacy budget. The method recursively divides the queries workload and privacy budget into two parts when the expected noise is less than that before dividing. To sum up, the algorithm can solve three problems: (1) The current ε-algorithms can reduce the error limitly, meanwhile, those algorithms will cost much computation resources. (2) When the sensitivity of a query workload is large, the current algorithms can't reduce the noise obviously. This can be shown in [16] . (3) In the situation that the data distribution is intensive, the current algorithms cannot fit it and will cause much error for the answer to query workload. The method we propose satisfies ε-differential privacy rigorously, and εdifferential privacy is the privacy protection mechanism proposed by Dwork in 2006 and regulates privacy protection. We will define ε-differential privacy formally. Definition 1 (ε-differential privacy). An algorithm M is a ε-differential privacy algorithm if for any neighboring database I and I (|I − I | ≤ 1), and any subset of output S satisfies the following formula: The ε-differential privacy algorithm protects privacy data by disturbing the answer and the attackers cannot distinguish the results over the neighboring database I and I , and the parameter ε is the privacy budget and it determines the privacy-preserving capacity. If the privacy budget is lower, the differential algorithm will protect privacy more effectively. For the random algorithm M , if the results over the two adjacent datasets I and I are close to each other, and it is difficult to infer whether a data item exists by M (X) and M (Y ). ε-differential privacy has the following three primary properties. Property 1. For the random algorithm ε i -difference privacy M 1 , and function M (X) is an arbitrary deterministic function: R → R . Then M 1 (M (X)) still satisfies ε differential privacy. Property 2. For the random algorithm M i and it satisfies ε i -difference privacy. Defining a random function M that it is a process of a random sequence of M i . The random function M satisfies k i=1 ε i -difference privacy. Our algorithm reduces the results error when answering the linear computation query under the ε-differential privacy, and we will define the linear computation query. For a database instance I whose relational schema attributes A = {A 1 , A 2 , ..., A l }. In A, each attribute data can be discrete or continuous. For the continuous data, the data can be treated as discrete in the data domain as well. The workload means a set of queries over the attributes For example, if the workload queries in a subset of three-dimensional range query over attributes A 1 , A 2 , and A 3 , We then present a frequency vector x, and x i ∈ dom(B). For example, dom(B) = { (1, 1, 1), (1, 1, 2) , ...} and for each dom(B) i , x i is the frequency of tuples values dom(B) i . A linear computation query computes a linear combination of the frequency in x, as described the following SQL query and we define the linear computation query as follows definition formally. 1}. The answer to a linear query q on x is the vector product q · x = q 1 x 1 + q 2 x 2 + ..., +q n x n . The linear computation can be called range count query, linear count query, and point count query when the query q can be marked as range, length-n vector, or a position in x. In the data collection situation, calculating the frequency in x can be done by the data organizers. And the data organizers has the capability to answer the linear computation query over the frequency vector x. The workload W makes up of a set of linear computation queries. If W is an m × n matrix, it means m length-n linear computation queries and the query results can be computed as the matrix product W · x. The linear computation query is one of the most important and common queries in data mining and data analysis. The linear computation can help the analyst understand the distribution information of data and to make intelligent decisions and data prediction. Figure 1 shows a workload W , frequency vector x, and the answer to W over x. Our algorithm adopts Laplace Mechanism to add noise, and we transform the Laplace Machainsim [8] to fit the query workload and data distribution. To ensure our algorithm satisfy ε-differential privacy, the algorithm adds random noise rigorously conform to the Laplace distribution. The Laplace mechanism [8] is proposed by Dwork, and the key method of Laplace Mechanism is to add noise that randomly generated through the Laplace distribution to the query results. The probability density function of the Laplace distribution is described as following: The variance of a random variable that satisfies the Laplace distribution is σ 2 = 2b 2 . To make the algorithm satisfy ε-differential privacy, we can add random noise from the Lap(x; a, 0), and we denote the Laplace distribution random variable as Lap(a) in the following section. For different query or query workload, the Laplace Mechanism adds noise differs against the sensitivity of the query or workload. x , the sensitivity of the query q is: It can be seen that the sensitivity of a query is the maximum change of the answer to a query on the neighboring frequency vectors. When the sensitivity of a query is high, the privacy data has a high probability to be attacked, and the reason is that the presence or absence of certain data can greatly change the result of the query, and it is more calculable to infer the certain sensitive data. For a query workload W , we use an m × n matrix to represent W , as shown in Fig. 2 . According to the sensitivity of a query, the sensitivity of the query workload W can be defined as the following: Given the definition of Laplace distribution and sensitivity, we can define the Laplace mechanism as following formally. The random variable Y i is generated by Lap Δ W · 1 ε . The proof is presented as the following, where database I and I differ at most one record, P I (s) is the probability that the output for the query database I is s. To reduce the noise, we will divide the queries in workload into several workloads. Meanwhile, the privacy budget ε will be divided into the same number of privacy budgets. After dividing, different workloads will add corresponding noise according to the divided privacy budget. Formally, for the workload W , we divide it as {W 1 , W 2 , ..., W m }, For each divided workload W i , the privacy budget is also divided into ε = {ε 1 , ε 2 , ..., ε m },and add random noise from the distribution Lap(Δ Wi 1/ε i ). That is, the answer to the workload is S = It can be proved that the algorithm satisfies ε-differential privacy as the property2, and we can also prove it by the following process. For the neighboring database I and I , that is, I − I 1 ≤ 1. Let p I (s) represent the distribution probability of query x on W , and s ∈ R k : We discuss the error change by the dividing for workload W = {W 1 , W 2 , ..., W m }, and privacy budget ε = {ε 1 , ε 2 , ..., ε m }. We calculate the average L 1 error for the answer to the workload. The excepted L 1 error of the answer before dividing and after dividing the workload is as follows: Taking the above workload in Fig. 2 for example, the original workload and divided workloads as following. When the workload W adopts 1-differential privacy by the Laplace Mechanism. The excepted L 1 error is Δ W /ε = 4, and after dividing the workload into W 1 and W 2 , the privacy budget into ε 1 = 0.58 and ε 2 = 0.42, the L 1 error will be 3.4275. Basing on the dividing for workload and privacy budget, we propose a specific division in the data situation that the data is relatively large and the distribution of data is complex. We take the mean square error of the frequency vector x to discriminate the data distribution complexity. And in query workload, there exist data domain queried with high frequency. To reduce the added Laplacebased noise, we divide the privacy budget into two equal parts iteratively, and the workload is divided according to the sensitivity, and the process can be described in Algorithm 1. The dividing in the algorithm will continue until the recursion finished. We will discuss the rationality of the division. The dichotomy is used as the reason that the query workload and privacy budget are divided into two parts W 1 , W 2 , ε 1 , ε 2 , and for E (L 1 ) = ∇W 1 ε1 * |W 1 | + ∇W 2 ε2 * |W 2 | , ε 1 = ε 2 is the minimum extreme point of the function. As described in Algorithm 1, at first, we set the Algorithm 1. Workload dividing 1: procedure divideWorkload(W, ε) 2: get the most frequent item in x as xi, 4: |W1| = (|W | + ∇W + g)/4 5: select randomly |W1| queries as W1 from W where xi is queried and the rest queries as W2 6: noise = |W | * ∇W , noise divided = |W1| * ∇W 1 + (|W | − |W1|) · ∇W 2 7: if noise ≥ noise divided then Stop dividing while noise doesn't reduce 8: return (DIV IDE WORKLOAD(W1, ε/2), divideWorkload(W1, ε/2)) 9: return W return W while it is unnecessary to divide data domain queried by high frequency as high-frequency items. For workload W , its division is W 1 and W 2 and supposing that the sensitivity of W is the sum of W 1 and W 2 . The total expected noise under the ε-differential privacy is We can infer that (ε/2, ε/2, W 1 , W 2 ) is a point of minimum, so we adopt ε 1 = ε2 = 2/ε. To get min E(noise(ε 1 , ε 2 , W 1 , W 2 )), we set Δ W1 = |W 1 |, and we can compute that when |W 1 | = (W + Δ W ) /4, the E(noiseε 1 , ε 2 , W 1 , W 2) will be a minimal value. The parameter g can optimize the result as a consequence of that for a workload W and its divisions W 1 ,W 2 , Δ W > (Δ W1 + Δ W2 ), which is not in accordance with our assumption. Therefore, we introduce the parameter to regulate the result and the g can be estimation by the specific workload. We now evaluate the performance of our approach on multiple datasets and workloads and compare our algorithm with state-of-the-art differential privacy algorithms. The main metric is average error, and we evaluate the metric on differential datasets and workloads (Fig. 3) . In follow evaluation, we test our algorithm with the metric average L 1 error per query result of the given workload. The workloads we use are generated randomly and the data set is from the real public database. To make the result more convincing, we run 5 trials for each evaluation. Furthermore, we test the time efficiency of our algorithm. The synthetic workload also is used by all the comparison algorithms. In the experiment, we set the privacy budget varies in {10.0,5.0,1.0,0.5,0.1}. In the following sections, we describe the datasets, workload, and the parameters in the experiment. In the above section, we have described the properties of liner count query. When there are multiple attributes in datasets, we can still use one-dimensional frequency vector x to represent the datasets. In the experiment, we use one-dimensional data sets. We use six real data sets. Adult comes from American statistical data [4] . The frequency vector x is built on the attribute "capital loss", which is also used in the experiment [13] . The Adult is sparse, and many frequency counts in x are zero. Income is from IPUMS American community survey data from 2001-2011, and frequency vector x is the count of personal Income [22] , and Income is also used in DAWA [16] . Patent is a citation network among a subset of US patents [16] . Fidaeta is from census of fatal occupational injuries in the United States of American labor statistics [1], and both Nw and Nw2 are from a survey of compensation in the United States of American labor statistics [1], and they are the frequency vector by setting unit as 1 and 2 in the continuous value attribute. We take the length of x of the five datasets as 4096. The overview of datasets is described in Table 1 . For the query workload, we conduct the experiment on eight synthetic query workloads W . For the frequently queried item in frequency vector x, we set probability of being queried p = {0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}. A workload has 2000 queries, and each query q ∈ W randomly selected a center cluster, and the frequency counts in x are randomly generated via the normal distribution with c as the center and 10 as the variance. Furthermore, we compare three algorithms with our algorithm. The Identity [8] algorithm adopts Laplace Mechanism that the answer results are directly added Laplace distribution noise for disturbance. MWEM [13] achieves differential privacy technology by obtaining an estimate of x through Laplace Mechanism and Exponential Mechanism. DAWA [16] algorithm adopts the partitioning method to achieve the differential privacy for range count workload and linear count workload. Among the experimental datasets, Adult is a "sparse" data set. The data distribution is relatively even-distributed as shown in Table 1 , and zero accounts for more than 97% in the frequency vector x. The other four experimental data sets are "complex" data sets with a large scale and complex data distribution. Figure 4 , 5, and 6 show the L 1 average error for the parameter p as 0,9, 0.6 and 0.2. It can be seen that MWEM [13] and DAWA [16] will add more noise than the Identity [8] algorithms, meanwhile, our algorithm always adds less noise than the Identity. The results figures show that MWEM [13] and DAWA [16] algorithm are datasets-aware and when facing different datasets, both algorithms perform differently over the same workload. The MWEM is most erratic, and when the data sets are simple or approximately even-distributed, the algorithm can add less noise than the other algorithms, but not for the complex data. In Fig. 6 , we compare the discount of L 1 average error by comparing it with the Identity [8] . In the experiment, we compare the different perforation with different parameter p, which represents frequency of a certain dom(B) i in x. Figure 5 shows that in the experiment sets, when p = 0.2, the algorithm can reduce more than 40% the L 1 error than the Identity. The ε-differential privacy is an effect privacy-preserving technology for linear computation. It can prompt data organizers to provide a secure third-party interface for statistical query. In this paper, we propose a novel ε-differential privacy algorithm, which uses Laplace-based noise and optimized workload division to decrease the computation error in complex data distribution for linear computations. The evaluation results show that our approach can reduce nearly 40% computation error when compared with the start-of-the-art differential privacy algorithms MWEM, DAWA, and Identity. As further work, we plan to extend our approach by optimizing the proposed work load division to reduce the introduced error. Differentially private histogram publishing through lossy compression L-diversity: privacy beyond k-anonymity UCI machine learning repository Differentially private smart metering with battery recharging Differentially private data analysis of social networks via restricted sensitivity Special feature exhaustive cryptanalysis of the NBS data encryption standard Calibrating noise to sensitivity in private data analysis The algorithmic foundations of differential privacy Rappor: randomized aggregatable privacypreserving ordinal response Differential privacy protection recommendation algorithm based on student learning behavior APEx: accuracy-aware differentially private data exploration A simple and practical algorithm for differentially private data release Principled evaluation of differentially private algorithms using DPBench Top-k frequent itemsets via differentially private FP-trees A data-and workload-aware algorithm for range queries under differential privacy Optimizing linear counting queries under differential privacy t-closeness: privacy beyond k-anonymity and l-diversity Mechanism design via differential privacy Differential privacy for positive and unlabeled learning with known class priors The RC5 encryption algorithm Integrated public use microdata series: Version 5.0 Description of a new variable-length key, 64-bit block cipher (Blowfish) k-anonymity: a model for protecting privacy Randomized response: a survey technique for eliminating evasive answer bias Differentially private histogram publication Ektelo: a framework for defining differentially-private computations