This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright http://www.elsevier.com/copyright Author's personal copy A novel anonymization algorithm: Privacy protection and knowledge preservation Weijia Yang a,*, Sanzheng Qiao b a Department of Computer Science, Shanghai Jiao Tong University, Shanghai 200030, China b Department of Computing and Software, McMaster University, Hamilton, Ont., Canada L8S 4K1 a r t i c l e i n f o Keywords: Data mining Privacy protection Data anonymization Knowledge preservation a b s t r a c t In data mining and knowledge discovery, there are two conflicting goals: privacy protection and knowl- edge preservation. On the one hand, we anonymize data to protect privacy; on the other hand, we allow miners to discover useful knowledge from anonymized data. In this paper, we present an anonymization method which provides both privacy protection and knowledge preservation. Unlike most anonymiza- tion methods, where data are generalized or permuted, our method anonymizes data by randomly break- ing links among attribute values in records. By data randomization, our method maintains statistical relations among data to preserve knowledge, whereas in most anonymization methods, knowledge is lost. Thus the data anonymized by our method maintains useful knowledge for statistical study. Further- more, we propose an enhanced algorithm for extra privacy protection to tackle the situation where the user’s prior knowledge of original data may cause privacy leakage. The privacy levels and the accuracy of knowledge preservation of our method, along with their relations to the parameters in the method are analyzed. Experiment results demonstrate that our method is effective on both privacy protection and knowledge preservation comparing with existing methods. � 2009 Elsevier Ltd. All rights reserved. 1. Introduction Privacy protection is an important issue in data processing. Sup- pose a data set shown in Fig. 1 is used in a medical research article. Before the article is published, the data must be anonymized to protect privacy. For example, if Michael knows that Hannah is 32 working as a clerk in UK, then he can infer that Hannah has diabe- tes, even the data set is published without names. A group of attri- butes, such as fAge; Job; Countryg in the above example, which can be used to infer patients is commonly called quasi-identifier (Q-I). The attribute like disease is called sensitive attribute. In medical research, the associations such as ‘‘clerk ! hypertension” and ‘‘[50–60] ! diabetes” are often studied. This kind of associations compose non-sensitive knowledge, if individual names cannot be inferred from the associations. Thus, it is important that a data set is protected in such a way that the miner can hardly infer the sensitive data of an individual and, at the same time, discover non-sensitive knowledge. In recent years, many methods have been proposed to protect the data privacy in data mining (Agrawal & Aggarwal, 2001; Agra- wal & Srikant, 2000; Aggarwal & Yu, 2008; Clifton, Kantarcioglu, Vaidya, Lin, & Zhu, 2002; Du & Zhan, 2003; Lindell & Pinkas, 2000). Among these methods, data anonymization (Ciriani, Vim- ercati, Foresti, & Samarati, 2008; Machanavajjhala, Gehrke, Kifer, & Venkitasubramaniam, 2006; Sweeney, 2002) provides an effec- tive yet simple way of preventing the user from learning sensitive data. Referring to the k-anonymity method (Sweeney, 2002), any individual is indistinguishable from at least k � 1 other ones in the anonymized data set. Machanavajjhala et al. (2006), however, has pointed out that the user may guess the sensitive values with high confidence when the sensitive data is lack of diversity, and introduced the l-diversity method. Both k-anonymization and l- diversity have drawbacks. Most of the related works (Kifer & Gehrke, 2006; LeFevre, DeWitt, & Ramakrishnan, 2005, 2006; Li, Li, & Venkatasubramanian, 2007; Meyerson & Williams, 2004) gen- eralize the data values. Some attributes in the quasi-identifier may even be totally suppressed. Both the attribute generalization and suppression lead to loss of data details (Aggarwal, 2005). As a re- sult, the useful data knowledge such as the non-sensitive knowl- edge and data distributions can only be obtained in general forms, which may be of little value for the miner. Recently, the permutation method has been proposed by Xiao and Tao (2006) and Koudas et al. (2007). Instead of generalizing the values of attributes, it randomly permutes the sensitive values in record groups. In this way, the user can hardly match the sensi- tive values with the right individuals. However, it is also difficult to discover the data knowledge from the permuted data. In this paper, we propose a novel anonymization method. In- stead of generalizing or permuting attribute values as in most anonymization methods, our method randomly breaks the links 0957-4174/$ - see front matter � 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2009.05.097 * Corresponding author. Tel.: +86 21 5474 0000. E-mail addresses: weijia.yang@yahoo.com.cn (W. Yang), qiao@cas.mcmaster.ca (S. Qiao). Expert Systems with Applications 37 (2010) 756–766 Contents lists available at ScienceDirect Expert Systems with Applications j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / e s w a Author's personal copy between the value combinations of the quasi-identifier and the values of the sensitive data, while maintaining statistical relations between quasi-identifier data and sensitive data, thus preserving the non-sensitive knowledge for statistical study. It protects pri- vacy and, at the same time, allows the miner to discover non-sen- sitive knowledge with high confidence. Furthermore, an enhanced method is proposed for preventing the privacy leakage when the user has some prior knowledge of the original data associations. The paper is organized as follows: in Section 2, we introduce our privacy definition after presenting the preliminary knowledge of data anonymization. Our random anonymization (RA) algorithm is presented in Section 3. Then we analyze its level of privacy pro- tection in Section 4. The discussion of the accuracy of the knowl- edge preservation is given in Section 5. Afterwards, we present in Section 6 an enhanced RA algorithm to limit the extra privacy leak- age. Finally, we demonstrate our experiments in Section 7. 2. Data privacy Since our method provides a way of anonymizing data, it is necessary to first introduce some preliminary definitions of data anonymization in Section 2.1. The privacy is disclosed when the user is able to correctly link the individuals with their sensitive val- ues. In Section 2.2, we introduce our new definition of privacy by data association. 2.1. Preliminary In this section, we adopt several definitions commonly used in the k-anonymization and l-diversity methods from Sweeney (2002) and Machanavajjhala et al. (2006). First, we present the for- mal definition of quasi-identifier. Definition 1 (Quasi-Identifier (Q-I). Given a data set DðA1; A2; . . . ; AmÞ and an external table DE. For all records ri 2 D, if the value combination riðAj; . . . ; AkÞ; j; k 6 m;fAj; . . . ; Akg contains no identifiers, can be uniquely located in DE , we call the set of attributes fAj; . . . ; Akg a quasi-identifier. For example, in the data set in Fig. 1, the external table DE would consist of Name, Age, Job, and Country and a quasi-identifier would be fAge; Job; Countryg. The k-anonymization method uses the generalization tech- nique, formally defined by: Definition 2 (Generalization). Suppose that a domain M consists of disjoint partitions fPig; i ¼ 1 . . . n, and [Pi ¼ M. On a given value combination v, we call the generalization process as returning the only partition Pi containing v. By generalizing the quasi-identifier, each individual in a k- anonymous table is identical to at least k � 1 other ones with re- spect to the quasi-identifier: Definition 3 (k-Anonymous). Given a data set DðA1; A2; . . . ; AmÞ and its quasi-identifier QI. If for any subset C # QI and for any record ri 2 D, there exist at least k � 1 other records sharing the same values with ri on the attribute set C, then data set D is k- anonymous. When the data set is k-anonymous, we can group together the records with the same value combinations of the quasi-identifier: Definition 4 (Q-I Group). Given a data set D and its quasi- identifier QI. We define the Q-I group as the set of all the records with the same values on QI. Fig. 2 is a 2-anonymous data set generalized from Fig. 1, where the data set is partitioned into three groups. As shown in Fig. 2, the data of Age are grouped into wider intervals and Jobs are clustered. We denote by symbol ‘‘�” a wildcard in attributes. As a result, each record is indistinguishable from at least one other record by the quasi-identifier. However, k-anonymity has drawbacks. When an individual be- longs to the last group in Fig. 2, the user can infer that the individ- ual has hypertension with more than 66% confidence, since two out of three in the group have hypertension. Moreover, since quite a few data values have been generalized in the anonymous data set, non-sensitive associations such as ‘‘[50–60] ! diabetes” can- not be obtained. While the k-anonymization method only focuses on hiding the Q-I information of the individuals, the l-diversity model (Mach- anavajjhala et al., 2006) pays attention to the relations between the Q-I and sensitive data: Definition 5 (l-diversity). A Q-I group is said to satisfy l-diversity if it contains at least l ‘‘well-represented” values for the sensitive attribute. A data set is said to have l-diversity if every Q-I group of it satisfies l-diversity. In particular, for each Q-I group, if the entropy of the sensitive attribute is greater than ln l, then the data set is said to satisfy Fig. 2. A 2-anonymous data set by generalizing Fig. 1. Fig. 1. A sample of original confidential data. W. Yang, S. Qiao / Expert Systems with Applications 37 (2010) 756–766 757 Author's personal copy ‘‘entropy l-diversity”. When a user tries to guess the sensitivity of an individual from an anonymized data, the diversity quantifies the average number of the possible sensitive values. Fig. 3, also de- rived from Fig. 1, shows an anonymized data set satisfying more than (entropy) 2-diversity. From this data set, the user cannot infer the sensitive value of any individual with more than 50% confi- dence. However, Q-I values have been generalized further, thus the non-sensitive knowledge is hard to obtain. Moreover, the diversity of the sensitivity in each Q-I group is limited by the dis- tribution of sensitive data in the whole data set. To keep the data details, the permutation method permutes the sensitive values within each Q-I group. As shown in Fig. 4, the data set in Fig. 1 are permuted so that each group satisfies at least (en- tropy) 2-diversity. The permuted data set is actually divided into two views: the view of the quasi-identifier and the view of the sen- sitivity. During data mining, the user will join these two views for each record group. Although the data details are preserved, the non-sensitive associations can hardly be discovered from Fig. 4, be- cause their confidences decrease dramatically. 2.2. Measuring data privacy Most k-anonymization methods (Aggarwal, 2005; Kifer & Gehrke, 2006; LeFevre et al., 2005, LeFevre, DeWitt, & Ramakrish- nan, 2006; Meyerson & Williams, 2004; Sweeney, 2002) emphasize the protection of the quasi-identification information. The minimal size of the Q-I groups is used to quantify the privacy level of the anonymized data. The larger the minimal size is, the more difficult it is to identify an individual from a value combination of Q-I. Sim- ilarly, it is difficult to identify an individual from a value of sensi- tive data alone. However, if the user can establish a strong association between a value combination of Q-I and a value of sen- sitive data, then privacy can be compromised. Thus we propose that privacy is measured by the association between the value combinations of Q-I and the values of sensitive data. A probabilistic measurement of privacy or anonymity is given in Section 4. Also, the anti-monotone property of k-anonymity shows that if the qua- si-identifier is k-anonymous, any subset of it is also k-anonymous. The value combinations of the subset do not disclose more privacy than those of the quasi-identifier. Thus, unlike the k-anonymiza- tion methods, we do not group the value combinations of Q-I. How do we prevent the user from getting these important associ- ations? We will present our anonymization algorithm in the next section. 3. Anonymization algorithm To break the associations between the value combinations of Q- I and the values of sensitive data, it is not necessary to anonymize all the attributes in Q-I. Our main idea is to randomly replace part of the Q-I data for each record by using the distributions of the original values of an attribute in the Q-I. In this way, no new information is added to the anonymized data, the associations between Q-I and sensitive data are broken, and the original data distribution is preserved. Consequently, the associations in individ- ual records are broken, whereas the statistics of associations in the whole data set is preserved. We present the anonymization process in Algorithm 1. Algorithm 1. The RA Algorithm 1: Input : the original data set D, the Q-I attributes Q, and the probability distribution fp1; . . . ; pmg, where m ¼ jQj, the number of attributes in Q-I. 2: Output: the original data set D is overwritten by an anonymized one 3: begin 4: n :¼ jDj; 5: Dist :¼;; 6: for i: = 1 to m do 7: begin 8: Disti:= the distribution of the values of Q i; 9: end 10: for j: = 1 to n do 11: begin 12: Randomly select an attribute Q k in Q-I of the jth record with probability pk; 13: Randomly generate a new value for Q k based on Distk ; 14: Replace the value of Q k with the new value; 15: end 16: end For example, in the data set in Fig. 1, the Q-I attributes Q ¼fQ 1; Q 2; Q 3g¼fAge; Job; Countryg. The values of Q i and their corresponding distributions are:Fig. 4. An anonymous data set by permuting Fig. 1. Fig. 3. A 2-diversity data set derived from Fig. 1. 758 W. Yang, S. Qiao / Expert Systems with Applications 37 (2010) 756–766 Author's personal copy Values of Q 1 [20–30] [30–40] [40–50] [50–60] [60–70] Dist1 1/10 4/10 1/10 3/10 1/10 Values of Q 2 Doctor Clerk Trader Engineer Banker Dist2 1/10 4/10 2/10 1/10 2/10 Values of Q 3 USA Germany UK India Dist3 6/10 1/10 2/10 1/10 Setting p1 ¼ p2 ¼ p3 ¼ 1=3, for every record in Fig. 1, we equally likely select an attribute Q k from the three attributes in the Q-I. Then we randomly generate a value for the selected Q k based on Distk and replace the original value of Q k with the new value. Fig. 5 shows an anonymized data set produced by Algorithm 1. How do we choose pi in this algorithm? How does this algo- rithm protect privacy while preserving non-sensitive knowledge? We will address these issues in the following two sections. 4. Privacy analysis Since our RA algorithm, unlike the k-anonymity and l-diversity methods, anonymizes a data set by randomly breaking the associ- ations between its Q-I and sensitive data, we first define a statisti- cal measurement of anonymity in Section 4.1. Then we use this definition to analyze the anonymity of our method. Moreover, since we randomize the values of the selected attributes, for com- parison, we also use the measurement in Evfimievski, Gehrke, & Srikant (2003) to analyze the ‘‘privacy breaches” in our method in Section 4.2.1. In addition, in Section 4.2.2, we further discuss the privacy leakage when the user has some prior knowledge of the data distribution. 4.1. Probabilistic anonymity For a good anonymization, it should be very unlikely that the user can infer the original associations from the corresponding associations in an anonymized data set. Thus we propose the fol- lowing definition of anonymity. Definition 6 (Probabilistic anonymity). Suppose that a data set D is anonymized to D0. Let r be a record in D and r0 2 D0 be its anonymized form. Denote rðQIÞ as the value combination of the quasi-identifier in r. The probabilistic anonymity of data set D0 is defined by 1=PðrðQIÞjr0ðQIÞÞ, where PðrðQIÞjr0ðQIÞÞ is the probability that rðQIÞ (for all r 2 D) may be inferred given r0ðQIÞ. The probabilistic anonymity gives a measurement of how unli- kely the user can infer original associations. The greater the prob- abilistic anonymity, the less probable the user can guess the original data. Now that we have a measurement of anonymity, we will show how to determine pi in Algorithm 1 to maximize the anonymity of the data set produced by the algorithm. Proposition 7. Let Q i; i ¼ 1; . . . ; m be the ith Q-I attribute (category attribute) in a data set D and EntropyðQ iÞ be the value of the entropy of Q i. Then the probabilistic anonymity of D 0, the anonymized form of D, reaches the maximal value when each pi in Algorithm 1 is directly proportional to the value of eEntropyðQ iÞ. Proof 1. Let cij be the jth category value in Q i and FreqðcijÞ be the frequency of cij in Q i . We calculate PaðD 0Þ, the probabilistic ano- nymity of D0, by: ln PaðD0Þ ¼� Xm i¼1 XJi j¼1 ½pi FreqðcijÞ lnðpi FreqðcijÞÞ� ¼ Xm i¼1 XJi j¼1 pi FreqðcijÞ ln 1 pi � � þ Xm i¼1 XJi j¼1 pi FreqðcijÞ ln 1 FreqðcijÞ � � ; where Ji is the number of the category values in Q i . SincePJi j¼1 FreqðcijÞ¼ 1, we can also derive: ln PaðD0Þ ¼ Xm i¼1 pið� ln pi þ EntropyðQ iÞÞ ð1Þ Next, we use the method of Lagrange Multipliers (Chen, Jin, Zhu, & Ouyang, 1983) to find the maximal value of PaðD0Þ. We incorporate the constraint Pm i¼1 pi ¼ 1 into (1): Fðp1; . . . ; pm; lÞ¼ Xm i¼1 ½pið� ln pi þ EntropyðQ iÞÞ�þ l Xm i¼1 pi � 1 ! where l is an unknown scalar. Setting rp1;...;pm Fðp1; . . . ; pm; lÞ¼ 0, we have: @F @p1 ¼ EntropyðQ 1Þ� ln p1 � 1 þ l ¼ 0 . . . @F @pm ¼ EntropyðQ mÞ� ln pm � 1 þ l ¼ 0: 8>< >: Thus, for all i; j 2 ½1; m�, we have ln pi pj ¼ EntropyðQ iÞ� EntropyðQ jÞ; implying that pi pj ¼ eEntropyðQ iÞ eEntropyðQ jÞ : ð2Þ From (2), when pi ¼ e EntropyðQ iÞPm j¼1 e EntropyðQ jÞ ; PaðD0Þ reaches its maximal value. hFig. 5. An anonymized data set produced by Algorithm 1 from Fig. 1. W. Yang, S. Qiao / Expert Systems with Applications 37 (2010) 756–766 759 Author's personal copy The above proposition also shows a general method for calcu- lating the probabilistic anonymity. When pi ¼ 1m ; i ¼ 1; . . . ; m, we can have a quick estimation of the scaled PaðD0Þ by taking the geo- metric mean of the diversities of all Q-I attributes: ln PaðD0Þ¼ ln mþ Xm i¼1 1 m EntropyðQ iÞ � � ¼ ln m Ym i¼1 Diversityi !1 m 0 @ 1 A; ð3Þ where Diversityi ¼ eEntropyðQ iÞ is the entropy diversity of Q i (Mach- anavajjhala et al., 2006). For an arbitrary record, the probability that the user may guess its original Q-I values, given an anonymized data set, is 1=PaðD0Þ. This is also the user’s confidence in associating a sensitive value with an individual. From (3), PaðD0Þ is usually greater than the geometric mean of the diversities of all Q-I attri- butes. Similarly, PaðD0Þ is usually greater than the diversity of the sensitive attribute. Even when the user is sure that an individual is in the data set, the user’s maximal confidence in inferring the cor- responding sensitivity is 1=Diversitys, where Diversitys is the diver- sity of the sensitive attribute. In comparison, in the l-diversity method, the user’s confidence in guessing the data privacy is 1=l, which is usually greater than 1=Diversitys. Therefore, by disassociat- ing the relationships between the Q-I and sensitive values, the RA method provides a better protection of the privacy than the l-diver- sity method does. 4.2. Privacy breaches The measurement of probabilistic anonymity evaluates the gen- eral level of privacy protection. Although the privacy is protected on average, certain privacy breaches (Evfimievski et al., 2003) may still take place. 4.2.1. Common breaches Suppose q 2 D is one of the original value combinations of the quasi-identifier. And q0 2 D0 is its anonymized form. The privacy breaches can be represented by: there exists q 2 D and q0 2 D0 : Pðq � tjq0 � t0Þ� Pðo � tjq0 � t0Þ; for all o – q; where t 2 D is the original record and t0 2 D0 is its anonymized form. By the above definition, if the user is able to get the original q with high confidence once q0 appears in the anonymized data, then there are privacy breaches. Privacy breaches are quantified by the c- amplification (Evfimievski et al., 2003): Definition 8 (c-Amplification). Suppose a data set D is anonymized to D0. For some value v0 2 D0, if for all v 1; v 2 2 D; Pðv 1!v0Þ Pðv 2!v0Þ 6 c, then the anonymization is at most c-amplifying for v0. The anonymiza- tion is at most c-amplifying if it is at most c-amplifying for all v0 2 D0. By the definition, once the anonymized value v0 is revealed, when all candidate values are equally possible, i.e., they share a common probability, then the probability that the user may obtain its original value is at most c times the common probability. Thus, the closer the value of c approaches 1, the more difficult it is for the user to infer the original data. Note that c P 1. In the RA algorithm, each record is anonymized without refer- ring to its original Q-I data. It seems that we have c ¼ 1. But the original data distribution is used to generate the new values for the selected attributes. How does this affect c? In the following, we analyze the privacy breaches in our method. Suppose a data set D is anonymized to D0 by Algorithm 1. Let q0 ¼ fq01; q02; . . . ; q0mg be a value combination of the quasi-identifier in D0 where q0i is the value of the ith Q-I attribute Q i. We calculate the probability Pðq ! q0Þ for each possible value combination q 2 D: P q ¼f�; q02; q03; . . . ; q0mg! q0 � � ¼ p1 P Q 1 ¼ q01 � � P q ¼fq01;�; q03; . . . ; q0mg! q0 � � ¼ p2 P Q 2 ¼ q02 � � . . . P q ¼fq01; . . . ; q 0 m�1;�g! q 0 � � ¼ pm P Q m ¼ q0m � � 8>>>< >>>: If there exists a k such that pk PðQ k ¼ q0kÞ P pi PðQ i ¼ q 0 iÞ for all i ¼ 1; . . . ; m, then there is a privacy breach by Definition 8. Although the user can infer with high confidence that q0 is generated by anon- ymizing the kth attribute, he is unable to find the original value of the kth attribute, since each value is anonymized independently from the original one. Thus the user can hardly leverage the breaches in our method to identify the individuals or their sensitive data. However, things will be different when the user has some prior knowledge of the distributions of the Q-I data, especially the support of the value combinations of the quasi-identifier. This is probable when data set D contains all the individuals in the public data. In the next subsection, we analyze the data privacy that our method preserves in this situation. 4.2.2. Privacy breaches by prior knowledge Let t0 2 D0 be an anonymized record and q0 # t0 be its Q-I value combination, then t0 may be anonymized from one of the following two types of records in D: � The records which contain q0 both before and after the anonymization. � The records whose Q-I values have intersection with q0 of size m � 1 and are anonymized to q0. Definition 9 (Support). The support of a value combination q in D, denoted by suppDðqÞ, is the percentage of the records in D which contain all the attribute values in q. The support of a record t, denoted by suppDðtÞ, is the support of its value combination. A support gives the frequency of the occurrence of a value com- bination in D. Assuming pi ¼ 1=m; i ¼ 1; . . . ; m, we have the expec- tation of the support for the first type of records in D0: EðsuppD0ðft 0jq0 � t&q0 � t0gÞÞ ¼ suppDðq 0ÞPðq0is anonymized to itselfÞ ¼ suppDðq 0Þ 1 m Xm i¼1 P Q i ¼ q0i � � : ð4Þ For the second type, jq0 \ qj ¼ m � 1, assuming pi ¼ 1=m; i ¼ 1; . . . ; m, the expectation of the support is: EðsuppD0ðft 0jq0 � t0&q � tgÞÞ ¼ X q ðsuppDðqÞPðq is anonymized to q 0ÞÞ ¼ X q suppDðqÞ 1 m P Q wðq0;qÞ ¼ q0wðq0;qÞ � � � ; ð5Þ where wðq0; qÞ is the index of the Q-I attribute in which q0 and q differ. This time, by the prior knowledge, a breach for the user to guess the original value of the anonymized attribute may occur. Combining (4) and (5), we can derive the c-amplification for the breach: cq0 ¼ maxq suppDðq0Þ � Pm i¼1 PðQ i ¼ q 0 iÞ suppDðqÞ � PðQ wðq0;qÞ ¼ q0wðq0;qÞÞ ! If cq0 < 1, we set cq0 ¼ 1=cq0 . The value of cq0 varies with the distribu- tions of different data sets. It is desirable to have a constant mea- 760 W. Yang, S. Qiao / Expert Systems with Applications 37 (2010) 756–766 Author's personal copy surement. Thus, we derive an upper bound for the maximal proba- bility that the user may infer the original q from q0. Let maxSupp be the larger of max jq\q0j¼m�1 ðsuppDðqÞP Q wðq0;qÞ ¼ q 0 wðq0;qÞ � and suppDðq 0Þ Xm i¼1 P Q i ¼ q0i � � then the maximal probability is maxSuppP q suppDðqÞPðQ wðq0;qÞ ¼ q0wðq0;qÞÞ � þ suppDðq0Þ Pm i¼1 P Q i ¼ q0i � � ð6Þ The following proposition gives an upper bound for the maximal probability. Proposition 10. Suppose pi ¼ 1=m, for i ¼ 1; . . . ; m. Then the max- imal probability that the user may guess the original data is bounded above by max suppDðq0Þ miniðsuppDðq0 q0iÞÞ ; PðQ wðq0;qÞ ¼ q0wðq0;qÞÞPm i¼1 P Q i ¼ q0i � � !: Proof 2. We consider two cases based on the maxSupp in (6). In the first case when maxSupp ¼ suppDðq0Þ Pm i¼1 P Q i ¼ q0i � � , the probability that the user may guess the data is Pðq 2 Djq0 2 D0Þ ¼ suppDðq0Þ Pm i¼1P Q i ¼ q 0 i � � Pm i¼1 P Q i ¼ q0i � � suppDðq0Þþ P wðq0;qÞ¼i suppDðqÞ � � ¼ suppDðq0Þ Pm i¼1 PðQ i ¼ q 0 iÞPm i¼1 P Q i ¼ q0i � � � suppD q0 q0i � �� � 6 suppDðq0Þ mini suppDðq0 q0iÞ � � ; ð7Þ where q0 q0i is the value combination obtained by deleting q 0 i from q. If q ¼fqig, then suppDðq0 q0iÞ¼ 1. In the second case, maxSupp ¼ suppDðqÞPðQ wðq0;qÞ ¼ q0wðq0;qÞÞ for some q, we have Pðq 2 Djq0 2 D0Þ ¼ suppDðqÞPðQ wðq0;qÞ ¼ q0wðq0;qÞÞPm i¼1 P Q i ¼ q0i � � suppD q 0 q0i � �� � 6 PðQ wðq0;qÞ ¼ q0wðq0;qÞÞPm i¼1 P Q i ¼ q0i � � : ð8Þ Combining (7) and (8) completes the proof. h In summary, without any prior knowledge, the user is unlikely to associate sensitive data with individuals from anonymized data set. Even with the knowledge of the distribution of the quasi-iden- tifier, the user’s ability is limited by the upper bound given in Prop- osition 10. 5. Quasi-sensitive knowledge preservation Now that we have discussed the security of our anonymization method, in this section, we turn to the issue of knowledge preser- vation, which is important in data mining. Many k-anonymization methods (LeFevre et al., 2005, 2006; Meyerson & Williams, 2004) try to minimize the loss of data details while maximizing the level of privacy protection. Some recent algorithms (Fung, Wang, & Yu, 2007 ;Kifer & Gehrke, 2006) also try to retain the data utility for the purpose of data analysis and data mining. But few consider the issue of non-sensitive knowledge preservation. In Section 5.1, we first explain why we preserve the non-sensitive knowledge and also define its general form in the term ‘‘quasi-sensitive knowledge”. Then in Section 5.2, we discuss the accuracy of knowl- edge recovery. 5.1. Quasi-sensitive knowledge In most applications of data analysis and data mining, we are interested in finding out the data relationships among attributes, especially the associations between the Q-I and sensitive values. We call these associations the quasi-sensitive associations. Definition 11 (Quasi-sensitive associations). Let Q be the set of Q-I attributes in data set D and S be the sensitive attributes. Suppose attribute sets bQ # Q and bS # S. For any values q̂ 2 bQ and ŝ 2 bS, we call the associations q̂ ! ŝ the quasi-sensitive associations in D. All the quasi-sensitive associations make up the quasi-sensitive knowledge in D. For example, in the data set in Fig. 1, the two of its quasi-sensi- tive associations: � Clerk ! Hypertension (confidence: 75%) � [30–40], Clerk ! Hypertension (confidence: 67%) which are helpful in analyzing the causes of hypertension. We denote by jq̂j the number of values in q̂ and jQj the number of attributes in Q, and call jq̂j the length of the association q̂ ! ŝ. We can see that normally the closer jq̂j approaches jQj, the more sensitive the related association is. Especially when jq̂j ¼ jQj, the related association is most sensitive because the antecedent part of the association rule can infer a specific individual. To preserve the less sensitive associations, we try to make the discovery of the ‘‘shorter” associations more accurate than the ‘‘longer” ones. In the following section, we will show how to achieve this goal. 5.2. Accuracy evaluation of knowledge recovery We define the relative difference: RðqÞ¼ suppD0ðqÞ� suppDðqÞ suppDðqÞ 100%: The frequently used measurement relative bias is represented by BðqÞ¼ jEðRðqÞÞj. This provides a measurement of the difference be- tween the support in D0 and the support in D. Thus, from the defini- tion of support, the smaller BðqÞ is, the more accurate the knowledge discovery can be. In this section, we first derive the gen- eral form of the relative error in our method. Then, we analyze how the quasi-sensitive associations are treated with respect to their length. Recall that Algorithm 1 randomly selects Q-I attribute for anon- ymization. Intuitively, the more Q-I attributes q involves, the more likely that q will be changed. Consequently, it is more likely that the difference RðqÞ will be larger. The following proposition shows a relation between the relative bias BðqÞ and the length of q. First we introduce some notations. Again, let q be a value combination in the original Q-I data. We denote :q the set of value combina- tions: fq̂ such that jq̂j ¼ jqj & QIðq̂Þ¼ QIðqÞ & q̂ – qg. Suppose q ¼fq1; . . . ; qlg; l 6 m, we denote q qi; i ¼ 1; . . . ; l, the value com- bination fq1; . . . ; qi�1; qiþ1; . . . ; qlg, and q qx the value combination fq1; . . . ; ql; qxg. When jqj ¼ 1, we define suppDðq qÞ¼ 1. We also use liftðqi; q qiÞ¼ suppðqÞ suppðq qiÞ PðQ i¼qiÞ to measure the strength of the relationship between qi and q qi . A lift value greater than 1 indi- cates there is a positive association between qi and q qi , whereas a value less than 1 indicates there is a negative association (Giudici, 2003). Proposition 12. For a value combination q in D, the relative bias BðqÞ is positively proportional to P i 1 liftðqi;q qiÞ � 1 h i . W. Yang, S. Qiao / Expert Systems with Applications 37 (2010) 756–766 761 Author's personal copy Proof 3. From the definition of RðqÞ, we have EðRðqÞÞ¼ E suppD0ðqÞ� suppDðqÞ suppDðqÞ � � ¼ suppDðqÞPðq ! qÞþ suppDð:qÞPð:q ! qÞ� suppDðqÞ suppDðqÞ ¼ suppDð:qÞ suppDðqÞ Pð:q ! qÞ� Pðq !:qÞ: ð9Þ Then we get BðqÞ¼ jEðRðqÞÞj ¼ 1 m Xjqj i¼1 suppDðq qiÞ� suppDðqÞ suppDðqÞ PðQ i ¼ qiÞ� PðQ i – qiÞ � � ¼ 1 m Xjqj i¼1 1 liftðqi; q qiÞ � 1 � � ; ð10Þ which completes the proof. h The proposition shows that the more attributes q has, the larger the range of relative bias BðqÞ is. In particular, when liftðqi; q qiÞ P 0:5, then BðqÞ 6 jqj m , implying that the range of the expected relative difference between suppD0 and suppD in- creases linearly with the length jqj. As for the variance of the relative difference, we have: VarðRðqÞÞ¼ 1 m2 Xjqj i¼1 suppDðq qiÞ 2 suppDðqÞ 2 PðQ i ¼ qiÞ� PðQ i ¼ qiÞ 2 � ! : ð11Þ Let value combination ~q ¼ q qx; qx R q. If suppDðq qiÞ suppDðqÞ varies little from suppDð~q qiÞ suppDð~qÞ for i ¼ 1; . . . ; jqj, then VarðRð~qÞÞ P VarðRðqÞÞ. That means VarðRðqÞÞ grows almost linearly with the increase of jqj. In other words, the relative difference between suppD0ðqÞ and suppDðqÞ be- comes more uncertain as value combination gets longer. In particu- lar, in the case of single attribute Q-I, where m ¼ 1, from (10), we have zero bias: BðqÞ¼ 1 liftðq; q qÞ � 1 ¼ 0: In summary, the shorter the length of value combination is, the bet- ter the quasi-sensitive knowledge is preserved, and the more accu- rate the knowledge recovery is. Also, since short associations are usually less sensitive than long ones, the user is more likely to dis- cover the less sensitive associations than the sensitive ones. 6. Enhancement of privacy protection Although the RA algorithm prevents the user from associating the individuals with their sensitive data, the prior knowledge of the associations of the Q-I data may worsen privacy leakage. In Fig. 5, the probabilistic anonymity PaðD0Þ of the anonymized data set is about 11 by Proposition 7. Suppose the user knows in advance the association: ‘‘Clerk, USA ! [30–40]” (50%). Once the value combination {[20–30], Clerk, USA} occurs in the anonymized data, the user is able to guess with a confidence that the original Q- I value combination is {[30–40], Clerk, USA}, where a ¼ Pð\age" is anonymizedÞ 50% � 17% > 1 PaðD0Þ : Thus, the user’s inference about the data privacy is improved. To tackle this problem, we randomly anonymize more than one Q-I attribute. This will improve privacy protection. How- ever, knowledge recovery may become less accurate. In this section, we generalize Algorithm 1 by uniformly choosing kðk P 1Þ Q-I attributes for each record and replacing their val- ues by using their original distributions. We first present the algorithm, then analyze its privacy protection and knowledge preservation. Algorithm 2. The Enhanced RA Algorithm 1: Input : the original data set D, the Q-I attributes Q, and the number kðk 6 mÞ of the attributes to be anonymized 2: Output : the original data set D is overwritten by an anonymized one 3: begin 4: m :¼ jQj; 5: n :¼ jDj; 6: Dist :¼;; 7: for i :¼ 1 to m do 8: begin 9: Disti :¼ the distribution of the values of Q i; 10: end 11: for j :¼ 1 to n do 12: begin 13: Randomly select k attributes with equal probability in Q-I of the jth record; 14: for k :¼ 1 to k do 15: begin 16: attr :¼ kth selected Q-I attribute; 17: Randomly generate a new value for attr based on Distattr ; 18: Replace the value of attr with the new value; 19: end 20: end 21: end Apparently, the Algorithm 2 enhances privacy protection when k gets larger. What is the impact of k on knowledge preservation? In the following, we will show that both the bias BðqÞ and variance VarðRðqÞÞ increase as k gets larger. Since we choose the Q-I attributes uniformly at random, the probability that each combination of k Q-I attributes gets selected is 1 m k � �� . In the algorithm, a value combination cannot be anon- ymized to all the other forms. If two value combinations q and q0 differ in g values ðg 6 kÞ, then these two value combinations are convertible into each other, since all the g values may be anony- mized. Before the discussion of BðqÞ and VarðEðqÞÞ, in the following proposition, we investigate the impact of k on the probability that a value combination is anonymized to its convertible value combina- tion. We will show that both probabilities Pðq ! q0Þ and Pðq0 ! qÞ increase as k increases. Proposition 13. Suppose k is the number of the anonymized attributes in each record. Let q 2 D and q0 2 D0 be partial value combinations of Q- I, jqj ¼ jq0j 6 m, and they are convertible into each other. Then the probabilities Pðq ! q0Þ and Pðq0 ! qÞ increase directly as k. Proof 4. Suppose that q and q0 differ in g attributes, g 6 k. Let AttrSetsgþk be the set of all possible combinations of g þ k Q-I attri- butes where each combination AttrSetsgþki ; i ¼ 1; . . . ; jqj� g k � � , contains those g attributes and other k; k 6 jqj� g, arbitrary Q-I attributes of q. Let PðAttrSetsgþki ; q 0Þ be the probability that the values of the g þ k attributes in AttrSetsgþki are anonymized to the 762 W. Yang, S. Qiao / Expert Systems with Applications 37 (2010) 756–766 Author's personal copy corresponding values of q0. Then probability that q is anonymized to q0 is: Pðq ! q0Þ ¼ Xminðk�g;jqj�gÞ k¼0 m �jqj k � g � k � � m k � � XjAttrSetsgþkj i¼1 PðAttrSetsgþki ; q 0Þ 2 6664 3 7775 ¼ Xminðk�g;jqj�gÞ k¼0 Prk;gk where, for simplicity, Prk;gk denotes m �jqj k � g � k � � m k � � PjAttrSetsgþkj i¼1 PðAttrSetsgþki ; q 0Þ. As k is increased by 1, Prkþ1;gk Prk;gk ¼ m �jqj k þ 1 � g � k � � m k þ 1 � � m �jqj k � g � k � � m k � � ¼ ðk þ 1Þðm � k þ 1Þ ðk þ 1 � g � kÞðm � k þ 1 �jqjþ g þ kÞ : It then follows that 1 6 Prkþ1;gk Prk;gk 6 k þ 1; since g þ k 6 minðk; jqjÞ. This shows that the probability Pðq ! q0Þ increases directly as k. The proof for Pðq0 ! qÞ is similar. h Now let us look at Bðq0Þ and VarðRðq0ÞÞ. From (9), we get Bðq0Þ ¼ X fq2:q0g suppDðqÞ suppDðq0Þ Pðq ! q0Þ � � � Pðq0 ! qÞ and VarðRðq0ÞÞ ¼ X fq2:q0g suppDðqÞ suppDðq0Þ � �2 ðPðq ! q0Þ� Pðq ! q0Þ2Þ " þ Pðq0 !:q0Þ� Pðq0 !:q0Þ2 � # : As k increases, there are more value combinations q in D convertible to q0. Also, from Proposition 13, both Pðq ! q0Þ and Pðq0 ! qÞ vary directly as k. Therefore, the range of BðqÞ increases with k. As for the variance, Pðq ! q0Þ and Pðq0 ! qÞ are usually less than 1/2, then Pðq ! q0Þ� Pðq ! q0Þ2 and Pðq0 !:q0Þ� Pðq0 !:q0Þ2 increase with k. Thus, the variance also varies directly as k. In summary, by tuning the value of k, we can adjust the level of the average bias and the variance. Thus, we need to find a compro- mise between the protection of data privacy and the preservation of useful knowledge, as demonstrated in our experiments. 7. Experiments In this section, we present our experiment results to demon- strate the effectiveness of our methods. By common aggregate que- ries, we first compare our RA Algorithm 1 with the l-diversity and permutation methods on the SQL queries. Then we compare our method with the permutation method on the quasi-sensitive knowledge discovery. At the same time, we demonstrate the ef- fects of the parameter k in our enhanced RA Algorithm 2 and the association length on knowledge discovery. 7.1. Experiment setup The data set adopted in the experiments is the adult data set from the UCI machine learning repository (Hettich & Bay, 1999). We used the following nine original attributes to compose the qua- si-identifier: ‘‘education”, ‘‘race”, ‘‘sex”, ‘‘work-class”, ‘‘marital-sta- tus”, ‘‘age”, ‘‘relationship”, ‘‘native-country” and ‘‘salary”. The attribute ‘‘occupation” was retained as the sensitive attribute. The records with missing values were removed and the resulting data set contained 30,162 records. 7.2. Accuracy of SQL queries In the first part of the experiments, we generated three anony- mized data sets by respectively applying the entropy l-diversity, permutation, and our RA Algorithm 1 to the test data set. We implemented the entropy l-diversity method based on the k-anon- ymization algorithm in LeFevre et al. (2006). While the permuta- tion algorithm was implemented as in Xiao & Tao (2006). Then, we performed random queries on the three anonymized data sets and compared the average accuracy of their answers. The queries were in the forms similar to those in Xiao & Tao (2006): select count ð�Þ from tablename where Q 1 in ðR1Þ and . . . and Q m in ðRmÞ and S in ðRsÞ where Q i is the ith Q-I attribute, S the sensitive attribute, and Ri the query values for the ith attribute. 7.2.1. Record count estimation When a data set is anonymized by the l-diversity algorithm, we cannot run the queries directly since the data is generalized. In- stead, the record count of each query is obtained by the following estimate (Xiao & Tao, 2006): X p #pðRsÞ Ym i¼1 #pðRi \ ValuesðQ iÞÞ #pðValuesðQ iÞÞ ! ; ð12Þ where ValuesðQ iÞ is the set of the distinct values of attribute Q i and function #pðxÞ counts in the pth Q-I group the occurrence of the val- ues in the set x. By summing up the estimated record counts in all the Q-I groups, we get an estimate for the query. When a data set is anonymized by the permutation method, the answers to the queries are estimated by X p #pðRsÞ #pðrecords matching the Q -I conditionsÞ #pðrecords in the p th groupÞ � � : ð13Þ In this case, in the pth Q-I group, the records matching the condi- tions of Q-I in the where clause in a query can be directly retrieved as #p (records matching the Q-I conditions). The product in the above equation assumes all the associations between the value combinations of Q-I and sensitive data are equally possible. We call this kind of associations the uniform associations between the Q-I and sensitive values. When the data set is anonymized by the RA algorithm, we re- trieve the queries directly without estimation. 7.2.2. The results We conducted eight sets of experiments. In the jth set, 1 6 j 6 8, of experiments, we performed 1000 queries on the three anonymized data sets and each query was composed of the sensitive attribute and j attributes which were selected from the quasi-identifier uniformly at random. For each selected attribute, we chose a random number of categorical values from its domain to form its query values Ri. In each set of experiments, we W. Yang, S. Qiao / Expert Systems with Applications 37 (2010) 756–766 763 Author's personal copy calculated the average relative errors of the query results. The rel- ative error is defined by jAnsðquery; DÞ� Ansðquery; D0Þj Ansðquery; DÞ ; where Ansðquery; DÞ returns the answer to query on data set D. Since the numerator jAnsðquery; DÞ� Ansðquery; D0Þj compares the aggre- gated counts between D and D0, the above definition of relative error may exceed 1. The probabilistic anonymity of the test data set reached 34 by our RA anonymization method, meaning that the average probabil- ity that a user could correctly associate an individual with the cor- responding sensitive value was 1/34. While the l-diversity and permutation methods limited that probability to 1=l, which was further limited by the distribution of the sensitive values. In the test data set, the maximal value of l was about 10 (Entropy(occu- pation)� ln 10). Thus, we set l ¼ 10 for the l-diversity and permu- tation methods. As shown in Fig. 6a, when the queries become more sensitive, involving more attributes, the relative error in the permuted data varies less than those in the l-diverse data and the RA anonymized data. This is because the estimate (12) for the results of the queries on the l-diverse data assumes a uniform value distribution of each Q-I attribute. Thus, the more Q-I attributes are involved, the farther away their joint distribution is from the multivariate uniform dis- tribution. As for the RA method, from Proposition 12, the frequent and infrequent value combinations usually have higher values of BðqÞ than other value combinations. Since the proportions of the frequent and infrequent value combinations decrease when the number of attributes involved increases, the rise of the relative er- ror slows down as the number of attributes involved increases, as shown in Fig. 6a. For the permuted data set, the data links in the original data set are not completely uniform associations assumed by (13), causing the errors in the queries. Since the strength of those associations is not influenced by the number of the attributes involved in the queries, the relative error varies the least. We then decreased the value of diversity l to 8 and 5 and re- peated the comparison. This time, each query in the experiment consisted of the sensitive attribute and 5 random Q-I attributes. As shown in Fig. 6b, although the accuracy of the estimations in the l-diverse data and permuted data improved as l decreased, the estimation from the anonymized data set by our RA method consistently had the least amount of error. This is because the esti- mation from the RA anonymized data set assumes no relationships between the attributes. The RA algorithm tries to preserve the data relationships by perturbing a small portion of data set. 7.3. Quasi-sensitive knowledge preservation In this section, we investigate knowledge preservation in sev- eral anonymization techniques, the effect of the parameter k in the enhanced RA Algorithm 2 on knowledge discovery, and the im- pact of the association length on knowledge discovery. We first ran the Apriori algorithm (Agrawal & Srikant, 1994) on the original data set D to get the quasi-sensitive associations AssoD, which was used as the baseline result in the following experi- ments. Then we compared baseline results with the quasi-sensitive associations from each anonymized data set D0. In the comparison, we evaluated the accuracy of the knowledge discovery by calculat- ing the ‘‘relative percentage”: jAssoD \ AssoD0 j jAssoDj 100%: In the test data set, the highest frequency of the values of attribute ‘‘occupation” was about 13%. Thus, in the Apriori algorithm, we set the confidence threshold to 15% and the support threshold to 8% . Fig. 7 compares the effect of our RA Algorithm 1 with that of the permutation method with l ¼ 5; 8; 10 (We did not compare with l- diversity and k-anonymization since most data associations are suppressed there). More quasi-sensitive associations were recov- ered from our RA anonymized data than from the permuted data. Although the previous experiments of aggregate queries showed that the accuracy of the answers from the permuted data varied the least, in the current experiment, only a small part of the asso- ciations was recovered. This is caused by the assumption of the uniform associations between the Q-I and sensitive values. In the previous experiments, most queries consisted of independent attri- bute values. For example, in the first set of the previous experi- ments, the queries consisted of the sensitive attribute and one random Q-I attribute. We ran the Chi-Square test for each contin- gency table consisting of a pair of attributes in the queries. Most Fig. 6. (a) Comparison of the query accuracy among the data sets by different anonymization methods ðl ¼ 10Þ; (b) comparison of the query accuracy with different diversity values. Fig. 7. Comparison of the quasi-sensitive associations among different anonymi- zation methods. 764 W. Yang, S. Qiao / Expert Systems with Applications 37 (2010) 756–766 Author's personal copy results of v 2 jDj was less than 0.2. Thus, the relative errors from the queries consisting of associated attribute values contributed little to the average error. However, in the association discovery, it only discovered those highly associated attribute values. The more interesting the associations are, the farther away they are from the uniform relationship. While in our RA method, the data ran- domization had a relatively small impact on the data relationships. Therefore, it was capable of preserving more associations. To test the influence of k on the discovery of the quasi-sensitive knowledge, we implemented the enhanced RA Algorithm 2 with various numbers of Q-I attributes anonymized. We set the param- eter k to 1,3,5,7,9. As shown in Fig. 8a, with the increase of k, more artificial asso- ciations were generated and fewer original associations were dis- covered. This confirms our conclusions in Section 6. Since a quasi-sensitive association may cause more potential threats to the data privacy when it involves more attribute values, we further compared the discovery of the quasi-sensitive associa- tions with various association lengths. In the comparison, we set k ¼ 5. Fig. 8b shows that it is more accurate to recover the shorter associations than the longer ones. Most of the associations with 1, 2 or 3 attributes were discovered even when more than half of the attribute values were anonymized in each record. (Here, we de- note by length 1 the frequent items in the data set.) This means that the user is more likely to discover less sensitive associations. The above experiments demonstrate the following characteris- tics of our methods. The RA Algorithm 1 is capable of protecting data privacy with minor impact on the data distribution. The en- hanced RA Algorithm 2 can further decrease the risk of privacy leakage when some original data associations are disclosed. More- over, by tuning the value of k, it can make a balance between the useful knowledge preservation and privacy protection. 8. Conclusions In this paper, we introduce a novel data anonymization method by randomizing the data records. Different from the generalization methods such as the k-anonymization methods, we regard the data privacy as the links between the Q-I and sensitive values. By ran- domly replacing part of the values in each record while maintain- ing statistical relations in whole data set, the RA method not only achieves a higher level of the privacy protection but also preserves more non-sensitive knowledge than the other anonymization methods, as demonstrated by our experiments. Moreover, the use- ful associations which are less sensitive can be discovered more accurately than the sensitive ones. In particular, when association length is 1, our method delivers zero bias. Furthermore, the en- hanced RA method provides the extra privacy protection when the user has some prior knowledge of the Q-I data. Data anonymization is a popular direction in the research of pri- vacy preserving data mining. Integrating our method with other privacy preserving techniques, we may accomplish more tasks. This work serves as an initial step. Additional research will include more work on other types of data knowledge and the definition of data privacy. References Aggarwal, C. C. (2005). On k-anonymity and the curse of dimensionality. In Proceedings of the 31st international conference on very large data bases (pp. 901– 909). Trondheim, Norway. Agrawal, D., & Aggarwal, C. C. (2001). On the design and quantification of privacy preserving data mining algorithms. In Proceedings of the 20th ACM SIGMOD- SIGACT-SIGART symposium on principles of database systems (pp. 247–255). Santa Barbara, California. Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In Proceedings of the 20th international conference on very large data bases (pp. 487–499). Santiago, Chile. Agrawal, R., & Srikant R. (2000). Privacy-preserving data mining. In Proceedings of the 2000 ACM SIGMOD international conference on management of data (pp. 439– 450). Dallas, Texas. Aggarwal, C. C., & Yu, P. S. (2008). A general survey of privacy-preserving data mining models and algorithms. Privacy-preserving data mining (pp. 11–52). US: Springer. Chen, C., Jin, F., Zhu, X., & Ouyang, G. (1983). Mathematical analysis. Higher Education Press. Ciriani, V., Vimercati, S., Foresti, S., & Samarati, P. (2008). k-Anonymous data mining: A survey. Privacy-preserving data mining (pp. 105–136). US: Springer. Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X., & Zhu, M. Y. (2002). Tools for privacy preserving distributed data mining. ACM SIGKDD Explorations Newsletter, 4(2), 28–34. Du, W., & Zhan, Z. (2003). Using randomized response techniques for privacy- preserving data mining. In Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 505–510). Washington, DC. Evfimievski, A., Gehrke, J., & Srikant, R. (2003). Limiting privacy breaches in privacy preserving data mining. In Proceedings of the 22th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (pp. 211–222). San Diego, California. Fung, B. C. M., Wang, K., & Yu, P. S. (2007). Anonymizing classification data for privacy preservation. IEEE Transactions on Knowledge and Data Engineering, 19(5), 711–725. Giudici, P. (2003). Applied data mining: Statistical methods for business and industry. John Wiley & Sons. Ltd. Hettich, S., & Bay, S. D. (1999). The UCI KDD archive. URL : http://kdd.ics.uci.edu. Kifer, D., & Gehrke, J. (2006). Injecting utility into anonymized datasets. In Proceedings of the 2006 ACM SIGMOD international conference on management of data (pp. 217–228). Chicago, Illinois. Koudas, N., Srivastava, D., Yu, T., & Zhang, Q. (2007). Aggregate query answering on anonymized tables. In Proceedings of the 23rd international conference on data engineering (pp. 116–125). Istanbul, Turkey. LeFevre, K., DeWitt, D. J., & Ramakrishnan, R. (2005). Incognito: Efficient full-domain k-anonymity. In Proceedings of the 2005 ACM SIGMOD international conference on management of data (pp. 49–60). Baltimore, Maryland. LeFevre, K., DeWitt, D. J., & Ramakrishnan, R. (2006). Mondrian multidimensional k- anonymity. In Proceedings of the 22nd international conference on data engineering (p. 25). Atlanta, Georgia. Li, N., Li, T., & Venkatasubramanian, S. (2007). t-Closeness: Privacy beyond k- anonymity and l-diversity. In Proceedings of the 23rd international conference on data engineering (pp. 106–115). Istanbul, Turkey. Lindell, Y., & Pinkas, B. (2000). Privacy preserving data mining. In Proceedings of the 20th annual international cryptology conference on advances in cryptology (pp. 36–54). Santa Barbara, California. Fig. 8. (a) Discovery of the quasi-sensitive associations with different values of k in the enhanced RA method; (b) discovery of the quasi-sensitive associations with different length ðk ¼ 5Þ. W. Yang, S. Qiao / Expert Systems with Applications 37 (2010) 756–766 765 Author's personal copy Machanavajjhala, A., Gehrke, J., Kifer, D., & Venkitasubramaniam, M. (2006). l- Diversity: Privacy beyond k-anonymity. In Proceedings of the 22th international conference on data engineering (p. 24). Los Alamitos, California. Meyerson, A., & Williams, R. (2004). On the complexity of optimal k-anonymity. In Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (pp. 223–228). Paris, France. Sweeney, L. (2002). Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, 10(5), 571–588. Xiao, X., & Tao, Y. (2006). Anatomy: Simple and effective privacy preservation. In Proceedings of the 32nd international conference on very large data bases (pp. 139– 150). Seoul, Korea. 766 W. Yang, S. Qiao / Expert Systems with Applications 37 (2010) 756–766