This article appeared in a journal published by Elsevier. The attached
copy is furnished to the author for internal non-commercial research
and education use, including for instruction at the authors institution

and sharing with colleagues.

Other uses, including reproduction and distribution, or selling or
licensing copies, or posting to personal, institutional or third party

websites are prohibited.

In most cases authors are permitted to post their version of the
article (e.g. in Word or Tex form) to their personal website or
institutional repository. Authors requiring further information

regarding Elsevier’s archiving and manuscript policies are
encouraged to visit:

http://www.elsevier.com/copyright

http://www.elsevier.com/copyright


Author's personal copy

A novel anonymization algorithm: Privacy protection and knowledge preservation

Weijia Yang a,*, Sanzheng Qiao b

a Department of Computer Science, Shanghai Jiao Tong University, Shanghai 200030, China
b Department of Computing and Software, McMaster University, Hamilton, Ont., Canada L8S 4K1

a r t i c l e i n f o

Keywords:
Data mining
Privacy protection
Data anonymization
Knowledge preservation

a b s t r a c t

In data mining and knowledge discovery, there are two conflicting goals: privacy protection and knowl-
edge preservation. On the one hand, we anonymize data to protect privacy; on the other hand, we allow
miners to discover useful knowledge from anonymized data. In this paper, we present an anonymization
method which provides both privacy protection and knowledge preservation. Unlike most anonymiza-
tion methods, where data are generalized or permuted, our method anonymizes data by randomly break-
ing links among attribute values in records. By data randomization, our method maintains statistical
relations among data to preserve knowledge, whereas in most anonymization methods, knowledge is
lost. Thus the data anonymized by our method maintains useful knowledge for statistical study. Further-
more, we propose an enhanced algorithm for extra privacy protection to tackle the situation where the
user’s prior knowledge of original data may cause privacy leakage. The privacy levels and the accuracy
of knowledge preservation of our method, along with their relations to the parameters in the method
are analyzed. Experiment results demonstrate that our method is effective on both privacy protection
and knowledge preservation comparing with existing methods.

� 2009 Elsevier Ltd. All rights reserved.

1. Introduction

Privacy protection is an important issue in data processing. Sup-
pose a data set shown in Fig. 1 is used in a medical research article.
Before the article is published, the data must be anonymized to
protect privacy. For example, if Michael knows that Hannah is 32
working as a clerk in UK, then he can infer that Hannah has diabe-
tes, even the data set is published without names. A group of attri-
butes, such as fAge; Job; Countryg in the above example, which can
be used to infer patients is commonly called quasi-identifier (Q-I).
The attribute like disease is called sensitive attribute. In medical
research, the associations such as ‘‘clerk ! hypertension” and
‘‘[50–60] ! diabetes” are often studied. This kind of associations
compose non-sensitive knowledge, if individual names cannot be
inferred from the associations. Thus, it is important that a data
set is protected in such a way that the miner can hardly infer the
sensitive data of an individual and, at the same time, discover
non-sensitive knowledge.

In recent years, many methods have been proposed to protect
the data privacy in data mining (Agrawal & Aggarwal, 2001; Agra-
wal & Srikant, 2000; Aggarwal & Yu, 2008; Clifton, Kantarcioglu,
Vaidya, Lin, & Zhu, 2002; Du & Zhan, 2003; Lindell & Pinkas,
2000). Among these methods, data anonymization (Ciriani, Vim-

ercati, Foresti, & Samarati, 2008; Machanavajjhala, Gehrke, Kifer,
& Venkitasubramaniam, 2006; Sweeney, 2002) provides an effec-
tive yet simple way of preventing the user from learning sensitive
data. Referring to the k-anonymity method (Sweeney, 2002), any
individual is indistinguishable from at least k � 1 other ones in
the anonymized data set. Machanavajjhala et al. (2006), however,
has pointed out that the user may guess the sensitive values with
high confidence when the sensitive data is lack of diversity, and
introduced the l-diversity method. Both k-anonymization and l-
diversity have drawbacks. Most of the related works (Kifer &
Gehrke, 2006; LeFevre, DeWitt, & Ramakrishnan, 2005, 2006; Li,
Li, & Venkatasubramanian, 2007; Meyerson & Williams, 2004) gen-
eralize the data values. Some attributes in the quasi-identifier may
even be totally suppressed. Both the attribute generalization and
suppression lead to loss of data details (Aggarwal, 2005). As a re-
sult, the useful data knowledge such as the non-sensitive knowl-
edge and data distributions can only be obtained in general
forms, which may be of little value for the miner.

Recently, the permutation method has been proposed by Xiao
and Tao (2006) and Koudas et al. (2007). Instead of generalizing
the values of attributes, it randomly permutes the sensitive values
in record groups. In this way, the user can hardly match the sensi-
tive values with the right individuals. However, it is also difficult to
discover the data knowledge from the permuted data.

In this paper, we propose a novel anonymization method. In-
stead of generalizing or permuting attribute values as in most
anonymization methods, our method randomly breaks the links

0957-4174/$ - see front matter � 2009 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2009.05.097

* Corresponding author. Tel.: +86 21 5474 0000.
E-mail addresses: weijia.yang@yahoo.com.cn (W. Yang), qiao@cas.mcmaster.ca

(S. Qiao).

Expert Systems with Applications 37 (2010) 756–766

Contents lists available at ScienceDirect

Expert Systems with Applications

j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / e s w a


Author's personal copy

between the value combinations of the quasi-identifier and the
values of the sensitive data, while maintaining statistical relations
between quasi-identifier data and sensitive data, thus preserving
the non-sensitive knowledge for statistical study. It protects pri-
vacy and, at the same time, allows the miner to discover non-sen-
sitive knowledge with high confidence. Furthermore, an enhanced
method is proposed for preventing the privacy leakage when the
user has some prior knowledge of the original data associations.

The paper is organized as follows: in Section 2, we introduce our
privacy definition after presenting the preliminary knowledge of
data anonymization. Our random anonymization (RA) algorithm
is presented in Section 3. Then we analyze its level of privacy pro-
tection in Section 4. The discussion of the accuracy of the knowl-
edge preservation is given in Section 5. Afterwards, we present in
Section 6 an enhanced RA algorithm to limit the extra privacy leak-
age. Finally, we demonstrate our experiments in Section 7.

2. Data privacy

Since our method provides a way of anonymizing data, it is
necessary to first introduce some preliminary definitions of data
anonymization in Section 2.1. The privacy is disclosed when the
user is able to correctly link the individuals with their sensitive val-
ues. In Section 2.2, we introduce our new definition of privacy by
data association.

2.1. Preliminary

In this section, we adopt several definitions commonly used in
the k-anonymization and l-diversity methods from Sweeney
(2002) and Machanavajjhala et al. (2006). First, we present the for-
mal definition of quasi-identifier.

Definition 1 (Quasi-Identifier (Q-I). Given a data set DðA1;
A2; . . . ; AmÞ and an external table DE. For all records ri 2 D, if the
value combination riðAj; . . . ; AkÞ; j; k 6 m;fAj; . . . ; Akg contains no
identifiers, can be uniquely located in DE , we call the set of
attributes fAj; . . . ; Akg a quasi-identifier.

For example, in the data set in Fig. 1, the external table DE would
consist of Name, Age, Job, and Country and a quasi-identifier would
be fAge; Job; Countryg.

The k-anonymization method uses the generalization tech-
nique, formally defined by:

Definition 2 (Generalization). Suppose that a domain M consists of
disjoint partitions fPig; i ¼ 1 . . . n, and [Pi ¼ M. On a given value
combination v, we call the generalization process as returning the
only partition Pi containing v.

By generalizing the quasi-identifier, each individual in a k-
anonymous table is identical to at least k � 1 other ones with re-
spect to the quasi-identifier:

Definition 3 (k-Anonymous). Given a data set DðA1; A2; . . . ; AmÞ and
its quasi-identifier QI. If for any subset C # QI and for any record
ri 2 D, there exist at least k � 1 other records sharing the same
values with ri on the attribute set C, then data set D is k-
anonymous.

When the data set is k-anonymous, we can group together the
records with the same value combinations of the quasi-identifier:

Definition 4 (Q-I Group). Given a data set D and its quasi-
identifier QI. We define the Q-I group as the set of all the records
with the same values on QI.

Fig. 2 is a 2-anonymous data set generalized from Fig. 1, where
the data set is partitioned into three groups. As shown in Fig. 2, the
data of Age are grouped into wider intervals and Jobs are clustered.
We denote by symbol ‘‘�” a wildcard in attributes. As a result, each
record is indistinguishable from at least one other record by the
quasi-identifier.

However, k-anonymity has drawbacks. When an individual be-
longs to the last group in Fig. 2, the user can infer that the individ-
ual has hypertension with more than 66% confidence, since two out
of three in the group have hypertension. Moreover, since quite a
few data values have been generalized in the anonymous data
set, non-sensitive associations such as ‘‘[50–60] ! diabetes” can-
not be obtained.

While the k-anonymization method only focuses on hiding the
Q-I information of the individuals, the l-diversity model (Mach-
anavajjhala et al., 2006) pays attention to the relations between
the Q-I and sensitive data:

Definition 5 (l-diversity). A Q-I group is said to satisfy l-diversity if
it contains at least l ‘‘well-represented” values for the sensitive
attribute. A data set is said to have l-diversity if every Q-I group of
it satisfies l-diversity.

In particular, for each Q-I group, if the entropy of the sensitive
attribute is greater than ln l, then the data set is said to satisfy

Fig. 2. A 2-anonymous data set by generalizing Fig. 1.

Fig. 1. A sample of original confidential data.

W. Yang, S. Qiao / Expert Systems with Applications 37 (2010) 756–766 757


Author's personal copy

‘‘entropy l-diversity”. When a user tries to guess the sensitivity of
an individual from an anonymized data, the diversity quantifies
the average number of the possible sensitive values. Fig. 3, also de-
rived from Fig. 1, shows an anonymized data set satisfying more
than (entropy) 2-diversity. From this data set, the user cannot infer
the sensitive value of any individual with more than 50% confi-
dence. However, Q-I values have been generalized further, thus
the non-sensitive knowledge is hard to obtain. Moreover, the
diversity of the sensitivity in each Q-I group is limited by the dis-
tribution of sensitive data in the whole data set.

To keep the data details, the permutation method permutes the
sensitive values within each Q-I group. As shown in Fig. 4, the data
set in Fig. 1 are permuted so that each group satisfies at least (en-
tropy) 2-diversity. The permuted data set is actually divided into
two views: the view of the quasi-identifier and the view of the sen-
sitivity. During data mining, the user will join these two views for
each record group. Although the data details are preserved, the
non-sensitive associations can hardly be discovered from Fig. 4, be-
cause their confidences decrease dramatically.

2.2. Measuring data privacy

Most k-anonymization methods (Aggarwal, 2005; Kifer &
Gehrke, 2006; LeFevre et al., 2005, LeFevre, DeWitt, & Ramakrish-
nan, 2006; Meyerson & Williams, 2004; Sweeney, 2002) emphasize
the protection of the quasi-identification information. The minimal
size of the Q-I groups is used to quantify the privacy level of the
anonymized data. The larger the minimal size is, the more difficult
it is to identify an individual from a value combination of Q-I. Sim-
ilarly, it is difficult to identify an individual from a value of sensi-
tive data alone. However, if the user can establish a strong
association between a value combination of Q-I and a value of sen-
sitive data, then privacy can be compromised. Thus we propose
that privacy is measured by the association between the value
combinations of Q-I and the values of sensitive data. A probabilistic
measurement of privacy or anonymity is given in Section 4. Also,
the anti-monotone property of k-anonymity shows that if the qua-
si-identifier is k-anonymous, any subset of it is also k-anonymous.
The value combinations of the subset do not disclose more privacy
than those of the quasi-identifier. Thus, unlike the k-anonymiza-
tion methods, we do not group the value combinations of Q-I.
How do we prevent the user from getting these important associ-
ations? We will present our anonymization algorithm in the next
section.

3. Anonymization algorithm

To break the associations between the value combinations of Q-
I and the values of sensitive data, it is not necessary to anonymize
all the attributes in Q-I. Our main idea is to randomly replace part
of the Q-I data for each record by using the distributions of the
original values of an attribute in the Q-I. In this way, no new
information is added to the anonymized data, the associations
between Q-I and sensitive data are broken, and the original data
distribution is preserved. Consequently, the associations in individ-
ual records are broken, whereas the statistics of associations in the
whole data set is preserved. We present the anonymization process
in Algorithm 1.

Algorithm 1. The RA Algorithm

1: Input : the original data set D, the Q-I attributes Q, and the
probability distribution fp1; . . . ; pmg, where m ¼ jQj, the
number of attributes in Q-I.

2: Output: the original data set D is overwritten by an
anonymized one

3: begin
4: n :¼ jDj;
5: Dist :¼;;
6: for i: = 1 to m do
7: begin
8: Disti:= the distribution of the values of Q i;
9: end
10: for j: = 1 to n do
11: begin
12: Randomly select an attribute Q k in Q-I of the jth record with

probability pk;
13: Randomly generate a new value for Q k based on Distk ;
14: Replace the value of Q k with the new value;
15: end
16: end

For example, in the data set in Fig. 1, the Q-I attributes
Q ¼fQ 1; Q 2; Q 3g¼fAge; Job; Countryg. The values of Q i and their
corresponding distributions are:Fig. 4. An anonymous data set by permuting Fig. 1.

Fig. 3. A 2-diversity data set derived from Fig. 1.

758 W. Yang, S. Qiao / Expert Systems with Applications 37 (2010) 756–766


Author's personal copy

Values of Q 1 [20–30] [30–40] [40–50] [50–60] [60–70]

Dist1 1/10 4/10 1/10 3/10 1/10

Values of Q 2 Doctor Clerk Trader Engineer Banker

Dist2 1/10 4/10 2/10 1/10 2/10

Values of Q 3 USA Germany UK India

Dist3 6/10 1/10 2/10 1/10

Setting p1 ¼ p2 ¼ p3 ¼ 1=3, for every record in Fig. 1, we equally
likely select an attribute Q k from the three attributes in the Q-I.
Then we randomly generate a value for the selected Q k based on
Distk and replace the original value of Q k with the new value.
Fig. 5 shows an anonymized data set produced by Algorithm 1.

How do we choose pi in this algorithm? How does this algo-
rithm protect privacy while preserving non-sensitive knowledge?
We will address these issues in the following two sections.

4. Privacy analysis

Since our RA algorithm, unlike the k-anonymity and l-diversity
methods, anonymizes a data set by randomly breaking the associ-
ations between its Q-I and sensitive data, we first define a statisti-
cal measurement of anonymity in Section 4.1. Then we use this
definition to analyze the anonymity of our method. Moreover,
since we randomize the values of the selected attributes, for com-
parison, we also use the measurement in Evfimievski, Gehrke, &
Srikant (2003) to analyze the ‘‘privacy breaches” in our method
in Section 4.2.1. In addition, in Section 4.2.2, we further discuss
the privacy leakage when the user has some prior knowledge of
the data distribution.

4.1. Probabilistic anonymity

For a good anonymization, it should be very unlikely that the
user can infer the original associations from the corresponding

associations in an anonymized data set. Thus we propose the fol-
lowing definition of anonymity.

Definition 6 (Probabilistic anonymity). Suppose that a data set D is
anonymized to D0. Let r be a record in D and r0 2 D0 be its
anonymized form. Denote rðQIÞ as the value combination of the
quasi-identifier in r. The probabilistic anonymity of data set D0 is
defined by 1=PðrðQIÞjr0ðQIÞÞ, where PðrðQIÞjr0ðQIÞÞ is the probability
that rðQIÞ (for all r 2 D) may be inferred given r0ðQIÞ.

The probabilistic anonymity gives a measurement of how unli-
kely the user can infer original associations. The greater the prob-
abilistic anonymity, the less probable the user can guess the
original data.

Now that we have a measurement of anonymity, we will show
how to determine pi in Algorithm 1 to maximize the anonymity of
the data set produced by the algorithm.

Proposition 7. Let Q i; i ¼ 1; . . . ; m be the ith Q-I attribute (category
attribute) in a data set D and EntropyðQ iÞ be the value of the entropy
of Q i. Then the probabilistic anonymity of D

0, the anonymized form of
D, reaches the maximal value when each pi in Algorithm 1 is directly
proportional to the value of eEntropyðQ iÞ.

Proof 1. Let cij be the jth category value in Q i and FreqðcijÞ be the
frequency of cij in Q i . We calculate PaðD

0Þ, the probabilistic ano-
nymity of D0, by:

ln PaðD0Þ ¼�
Xm
i¼1

XJi
j¼1
½pi FreqðcijÞ lnðpi FreqðcijÞÞ�

¼
Xm
i¼1

XJi
j¼1

pi FreqðcijÞ ln
1
pi

� �

þ
Xm
i¼1

XJi
j¼1

pi FreqðcijÞ ln
1

FreqðcijÞ

� �
;

where Ji is the number of the category values in Q i . SincePJi
j¼1 FreqðcijÞ¼ 1, we can also derive:

ln PaðD0Þ ¼
Xm
i¼1

pið� ln pi þ EntropyðQ iÞÞ ð1Þ

Next, we use the method of Lagrange Multipliers (Chen, Jin, Zhu, &
Ouyang, 1983) to find the maximal value of PaðD0Þ. We incorporate
the constraint

Pm
i¼1 pi ¼ 1 into (1):

Fðp1; . . . ; pm; lÞ¼
Xm
i¼1
½pið� ln pi þ EntropyðQ iÞÞ�þ l

Xm
i¼1

pi � 1
 !

where l is an unknown scalar. Setting rp1;...;pm Fðp1; . . . ; pm; lÞ¼ 0,
we have:

@F
@p1
¼ EntropyðQ 1Þ� ln p1 � 1 þ l ¼ 0

. . .
@F
@pm
¼ EntropyðQ mÞ� ln pm � 1 þ l ¼ 0:

8><
>:
Thus, for all i; j 2 ½1; m�, we have

ln
pi
pj
¼ EntropyðQ iÞ� EntropyðQ jÞ;

implying that

pi
pj
¼

eEntropyðQ iÞ

eEntropyðQ jÞ
: ð2Þ

From (2), when pi ¼ e
EntropyðQ iÞPm

j¼1
e

EntropyðQ jÞ
; PaðD0Þ reaches its maximal value.

hFig. 5. An anonymized data set produced by Algorithm 1 from Fig. 1.

W. Yang, S. Qiao / Expert Systems with Applications 37 (2010) 756–766 759


Author's personal copy

The above proposition also shows a general method for calcu-
lating the probabilistic anonymity. When pi ¼ 1m ; i ¼ 1; . . . ; m, we
can have a quick estimation of the scaled PaðD0Þ by taking the geo-
metric mean of the diversities of all Q-I attributes:

ln PaðD0Þ¼ ln mþ
Xm
i¼1

1
m

EntropyðQ iÞ
� �

¼ ln m
Ym
i¼1

Diversityi

 !1
m

0
@

1
A;
ð3Þ

where Diversityi ¼ eEntropyðQ iÞ is the entropy diversity of Q i (Mach-
anavajjhala et al., 2006). For an arbitrary record, the probability that
the user may guess its original Q-I values, given an anonymized
data set, is 1=PaðD0Þ. This is also the user’s confidence in associating
a sensitive value with an individual. From (3), PaðD0Þ is usually
greater than the geometric mean of the diversities of all Q-I attri-
butes. Similarly, PaðD0Þ is usually greater than the diversity of the
sensitive attribute. Even when the user is sure that an individual
is in the data set, the user’s maximal confidence in inferring the cor-
responding sensitivity is 1=Diversitys, where Diversitys is the diver-
sity of the sensitive attribute. In comparison, in the l-diversity
method, the user’s confidence in guessing the data privacy is 1=l,
which is usually greater than 1=Diversitys. Therefore, by disassociat-
ing the relationships between the Q-I and sensitive values, the RA
method provides a better protection of the privacy than the l-diver-
sity method does.

4.2. Privacy breaches

The measurement of probabilistic anonymity evaluates the gen-
eral level of privacy protection. Although the privacy is protected
on average, certain privacy breaches (Evfimievski et al., 2003)
may still take place.

4.2.1. Common breaches
Suppose q 2 D is one of the original value combinations of the

quasi-identifier. And q0 2 D0 is its anonymized form. The privacy
breaches can be represented by:

there exists q 2 D and q0 2 D0 :
Pðq � tjq0 � t0Þ� Pðo � tjq0 � t0Þ; for all o – q;

where t 2 D is the original record and t0 2 D0 is its anonymized form.
By the above definition, if the user is able to get the original q with
high confidence once q0 appears in the anonymized data, then there
are privacy breaches. Privacy breaches are quantified by the c-
amplification (Evfimievski et al., 2003):

Definition 8 (c-Amplification). Suppose a data set D is anonymized
to D0. For some value v0 2 D0, if for all v 1; v 2 2 D;

Pðv 1!v0Þ
Pðv 2!v0Þ

6 c, then
the anonymization is at most c-amplifying for v0. The anonymiza-
tion is at most c-amplifying if it is at most c-amplifying for all
v0 2 D0.

By the definition, once the anonymized value v0 is revealed,
when all candidate values are equally possible, i.e., they share a
common probability, then the probability that the user may obtain
its original value is at most c times the common probability. Thus,
the closer the value of c approaches 1, the more difficult it is for the
user to infer the original data. Note that c P 1.

In the RA algorithm, each record is anonymized without refer-
ring to its original Q-I data. It seems that we have c ¼ 1. But the
original data distribution is used to generate the new values for
the selected attributes. How does this affect c? In the following,
we analyze the privacy breaches in our method.

Suppose a data set D is anonymized to D0 by Algorithm 1. Let
q0 ¼ fq01; q02; . . . ; q0mg be a value combination of the quasi-identifier

in D0 where q0i is the value of the ith Q-I attribute Q i. We calculate
the probability Pðq ! q0Þ for each possible value combination
q 2 D:

P q ¼f�; q02; q03; . . . ; q0mg! q0
� �

¼ p1 P Q 1 ¼ q01
� �

P q ¼fq01;�; q03; . . . ; q0mg! q0
� �

¼ p2 P Q 2 ¼ q02
� �

. . .

P q ¼fq01; . . . ; q
0
m�1;�g! q

0
� �

¼ pm P Q m ¼ q0m
� �

8>>><
>>>:
If there exists a k such that pk PðQ k ¼ q0kÞ P pi PðQ i ¼ q

0
iÞ for all

i ¼ 1; . . . ; m, then there is a privacy breach by Definition 8. Although
the user can infer with high confidence that q0 is generated by anon-
ymizing the kth attribute, he is unable to find the original value of
the kth attribute, since each value is anonymized independently
from the original one. Thus the user can hardly leverage the
breaches in our method to identify the individuals or their sensitive
data. However, things will be different when the user has some
prior knowledge of the distributions of the Q-I data, especially the
support of the value combinations of the quasi-identifier. This is
probable when data set D contains all the individuals in the public
data. In the next subsection, we analyze the data privacy that our
method preserves in this situation.

4.2.2. Privacy breaches by prior knowledge
Let t0 2 D0 be an anonymized record and q0 # t0 be its Q-I value

combination, then t0 may be anonymized from one of the following
two types of records in D:

� The records which contain q0 both before and after the
anonymization.

� The records whose Q-I values have intersection with q0 of size
m � 1 and are anonymized to q0.

Definition 9 (Support). The support of a value combination q in D,
denoted by suppDðqÞ, is the percentage of the records in D which
contain all the attribute values in q. The support of a record t,
denoted by suppDðtÞ, is the support of its value combination.

A support gives the frequency of the occurrence of a value com-
bination in D. Assuming pi ¼ 1=m; i ¼ 1; . . . ; m, we have the expec-
tation of the support for the first type of records in D0:

EðsuppD0ðft
0jq0 � t&q0 � t0gÞÞ

¼ suppDðq
0ÞPðq0is anonymized to itselfÞ

¼ suppDðq
0Þ

1
m

Xm
i¼1

P Q i ¼ q0i
� �

: ð4Þ

For the second type, jq0 \ qj ¼ m � 1, assuming pi ¼ 1=m; i ¼ 1;
. . . ; m, the expectation of the support is:

EðsuppD0ðft
0jq0 � t0&q � tgÞÞ

¼
X

q

ðsuppDðqÞPðq is anonymized to q
0ÞÞ

¼
X

q

suppDðqÞ
1
m

P Q wðq0;qÞ ¼ q0wðq0;qÞ
� 	� �

; ð5Þ

where wðq0; qÞ is the index of the Q-I attribute in which q0 and q
differ. This time, by the prior knowledge, a breach for the user to
guess the original value of the anonymized attribute may occur.
Combining (4) and (5), we can derive the c-amplification for the
breach:

cq0 ¼ maxq
suppDðq0Þ �

Pm
i¼1 PðQ i ¼ q

0
iÞ

suppDðqÞ � PðQ wðq0;qÞ ¼ q0wðq0;qÞÞ

 !

If cq0 < 1, we set cq0 ¼ 1=cq0 . The value of cq0 varies with the distribu-
tions of different data sets. It is desirable to have a constant mea-

760 W. Yang, S. Qiao / Expert Systems with Applications 37 (2010) 756–766


Author's personal copy

surement. Thus, we derive an upper bound for the maximal proba-
bility that the user may infer the original q from q0. Let maxSupp be
the larger of

max
jq\q0j¼m�1

ðsuppDðqÞP Q wðq0;qÞ ¼ q
0
wðq0;qÞ

� 	
and

suppDðq
0Þ
Xm
i¼1

P Q i ¼ q0i
� �

then the maximal probability is

maxSuppP
q suppDðqÞPðQ wðq0;qÞ ¼ q0wðq0;qÞÞ
� 	

þ suppDðq0Þ
Pm

i¼1 P Q i ¼ q0i
� � ð6Þ

The following proposition gives an upper bound for the maximal
probability.

Proposition 10. Suppose pi ¼ 1=m, for i ¼ 1; . . . ; m. Then the max-
imal probability that the user may guess the original data is bounded
above by

max
suppDðq0Þ

miniðsuppDðq0 	 q0iÞÞ
;
PðQ wðq0;qÞ ¼ q0wðq0;qÞÞPm

i¼1 P Q i ¼ q0i
� � !:

Proof 2. We consider two cases based on the maxSupp in (6). In the
first case when maxSupp ¼ suppDðq0Þ

Pm
i¼1 P Q i ¼ q0i

� �
, the probability

that the user may guess the data is

Pðq 2 Djq0 2 D0Þ

¼
suppDðq0Þ

Pm
i¼1P Q i ¼ q

0
i

� �
Pm

i¼1 P Q i ¼ q0i
� �

suppDðq0Þþ
P

wðq0;qÞ¼i suppDðqÞ
� 	� 	

¼
suppDðq0Þ

Pm
i¼1 PðQ i ¼ q

0
iÞPm

i¼1 P Q i ¼ q0i
� �

� suppD q0 	 q0i
� �� �

6
suppDðq0Þ

mini suppDðq0 	 q0iÞ
� � ; ð7Þ

where q0 	 q0i is the value combination obtained by deleting q
0
i from

q. If q ¼fqig, then suppDðq0 	 q0iÞ¼ 1. In the second case,
maxSupp ¼ suppDðqÞPðQ wðq0;qÞ ¼ q0wðq0;qÞÞ for some q, we have

Pðq 2 Djq0 2 D0Þ ¼
suppDðqÞPðQ wðq0;qÞ ¼ q0wðq0;qÞÞPm
i¼1 P Q i ¼ q0i

� �
suppD q

0 	 q0i
� �� �

6

PðQ wðq0;qÞ ¼ q0wðq0;qÞÞPm
i¼1 P Q i ¼ q0i

� � : ð8Þ
Combining (7) and (8) completes the proof. h

In summary, without any prior knowledge, the user is unlikely
to associate sensitive data with individuals from anonymized data
set. Even with the knowledge of the distribution of the quasi-iden-
tifier, the user’s ability is limited by the upper bound given in Prop-
osition 10.

5. Quasi-sensitive knowledge preservation

Now that we have discussed the security of our anonymization
method, in this section, we turn to the issue of knowledge preser-
vation, which is important in data mining. Many k-anonymization
methods (LeFevre et al., 2005, 2006; Meyerson & Williams, 2004)
try to minimize the loss of data details while maximizing the level
of privacy protection. Some recent algorithms (Fung, Wang, & Yu,
2007 ;Kifer & Gehrke, 2006) also try to retain the data utility for
the purpose of data analysis and data mining. But few consider
the issue of non-sensitive knowledge preservation. In Section 5.1,
we first explain why we preserve the non-sensitive knowledge
and also define its general form in the term ‘‘quasi-sensitive

knowledge”. Then in Section 5.2, we discuss the accuracy of knowl-
edge recovery.

5.1. Quasi-sensitive knowledge

In most applications of data analysis and data mining, we are
interested in finding out the data relationships among attributes,
especially the associations between the Q-I and sensitive values.
We call these associations the quasi-sensitive associations.

Definition 11 (Quasi-sensitive associations). Let Q be the set of Q-I
attributes in data set D and S be the sensitive attributes. Suppose
attribute sets bQ # Q and bS # S. For any values q̂ 2 bQ and ŝ 2 bS, we
call the associations q̂ ! ŝ the quasi-sensitive associations in D.

All the quasi-sensitive associations make up the quasi-sensitive
knowledge in D.

For example, in the data set in Fig. 1, the two of its quasi-sensi-
tive associations:

� Clerk ! Hypertension (confidence: 75%)
� [30–40], Clerk ! Hypertension (confidence: 67%)

which are helpful in analyzing the causes of hypertension.
We denote by jq̂j the number of values in q̂ and jQj the number

of attributes in Q, and call jq̂j the length of the association q̂ ! ŝ.
We can see that normally the closer jq̂j approaches jQj, the more
sensitive the related association is. Especially when jq̂j ¼ jQj, the
related association is most sensitive because the antecedent part
of the association rule can infer a specific individual. To preserve
the less sensitive associations, we try to make the discovery of
the ‘‘shorter” associations more accurate than the ‘‘longer” ones.
In the following section, we will show how to achieve this goal.

5.2. Accuracy evaluation of knowledge recovery

We define the relative difference:

RðqÞ¼
suppD0ðqÞ� suppDðqÞ

suppDðqÞ
100%:

The frequently used measurement relative bias is represented by
BðqÞ¼ jEðRðqÞÞj. This provides a measurement of the difference be-
tween the support in D0 and the support in D. Thus, from the defini-
tion of support, the smaller BðqÞ is, the more accurate the
knowledge discovery can be. In this section, we first derive the gen-
eral form of the relative error in our method. Then, we analyze how
the quasi-sensitive associations are treated with respect to their
length.

Recall that Algorithm 1 randomly selects Q-I attribute for anon-
ymization. Intuitively, the more Q-I attributes q involves, the more
likely that q will be changed. Consequently, it is more likely that
the difference RðqÞ will be larger. The following proposition shows
a relation between the relative bias BðqÞ and the length of q. First
we introduce some notations. Again, let q be a value combination
in the original Q-I data. We denote :q the set of value combina-
tions: fq̂ such that jq̂j ¼ jqj & QIðq̂Þ¼ QIðqÞ & q̂ – qg. Suppose
q ¼fq1; . . . ; qlg; l 6 m, we denote q 	 qi; i ¼ 1; . . . ; l, the value com-
bination fq1; . . . ; qi�1; qiþ1; . . . ; qlg, and q 
 qx the value combination
fq1; . . . ; ql; qxg. When jqj ¼ 1, we define suppDðq 	 qÞ¼ 1. We also
use liftðqi; q 	 qiÞ¼

suppðqÞ
suppðq	qiÞ PðQ i¼qiÞ

to measure the strength of the
relationship between qi and q 	 qi . A lift value greater than 1 indi-
cates there is a positive association between qi and q 	 qi , whereas
a value less than 1 indicates there is a negative association (Giudici,
2003).

Proposition 12. For a value combination q in D, the relative bias BðqÞ
is positively proportional to

P
i

1
liftðqi;q	qiÞ

� 1
h i


.

W. Yang, S. Qiao / Expert Systems with Applications 37 (2010) 756–766 761


Author's personal copy

Proof 3. From the definition of RðqÞ, we have

EðRðqÞÞ¼ E
suppD0ðqÞ� suppDðqÞ

suppDðqÞ

� �
¼

suppDðqÞPðq ! qÞþ suppDð:qÞPð:q ! qÞ� suppDðqÞ
suppDðqÞ

¼
suppDð:qÞ
suppDðqÞ

Pð:q ! qÞ� Pðq !:qÞ:

ð9Þ

Then we get

BðqÞ¼ jEðRðqÞÞj

¼
1
m

Xjqj
i¼1

suppDðq 	 qiÞ� suppDðqÞ
suppDðqÞ

PðQ i ¼ qiÞ� PðQ i – qiÞ
� �


¼

1
m

Xjqj
i¼1

1
liftðqi; q 	 qiÞ

� 1
� �


;
ð10Þ

which completes the proof. h

The proposition shows that the more attributes q has, the larger
the range of relative bias BðqÞ is. In particular, when
liftðqi; q 	 qiÞ P 0:5, then BðqÞ 6

jqj
m , implying that the range of

the expected relative difference between suppD0 and suppD in-
creases linearly with the length jqj.

As for the variance of the relative difference, we have:

VarðRðqÞÞ¼
1

m2
Xjqj
i¼1

suppDðq 	 qiÞ
2

suppDðqÞ
2

PðQ i ¼ qiÞ� PðQ i ¼ qiÞ
2

� 	 !
:

ð11Þ

Let value combination ~q ¼ q 
 qx; qx R q. If
suppDðq	qiÞ

suppDðqÞ
varies little from

suppDð~q	qiÞ
suppDð~qÞ

for i ¼ 1; . . . ; jqj, then VarðRð~qÞÞ P VarðRðqÞÞ. That means
VarðRðqÞÞ grows almost linearly with the increase of jqj. In other
words, the relative difference between suppD0ðqÞ and suppDðqÞ be-
comes more uncertain as value combination gets longer. In particu-
lar, in the case of single attribute Q-I, where m ¼ 1, from (10), we
have zero bias:

BðqÞ¼
1

liftðq; q 	 qÞ
� 1


 ¼ 0:

In summary, the shorter the length of value combination is, the bet-
ter the quasi-sensitive knowledge is preserved, and the more accu-
rate the knowledge recovery is. Also, since short associations are
usually less sensitive than long ones, the user is more likely to dis-
cover the less sensitive associations than the sensitive ones.

6. Enhancement of privacy protection

Although the RA algorithm prevents the user from associating
the individuals with their sensitive data, the prior knowledge of
the associations of the Q-I data may worsen privacy leakage.

In Fig. 5, the probabilistic anonymity PaðD0Þ of the anonymized
data set is about 11 by Proposition 7. Suppose the user knows in
advance the association: ‘‘Clerk, USA ! [30–40]” (50%). Once the
value combination {[20–30], Clerk, USA} occurs in the anonymized
data, the user is able to guess with a confidence that the original Q-
I value combination is {[30–40], Clerk, USA}, where

a ¼ Pð\age" is anonymizedÞ 50% � 17% >
1

PaðD0Þ
:

Thus, the user’s inference about the data privacy is improved.
To tackle this problem, we randomly anonymize more than

one Q-I attribute. This will improve privacy protection. How-
ever, knowledge recovery may become less accurate. In this
section, we generalize Algorithm 1 by uniformly choosing
kðk P 1Þ Q-I attributes for each record and replacing their val-
ues by using their original distributions. We first present the
algorithm, then analyze its privacy protection and knowledge
preservation.

Algorithm 2. The Enhanced RA Algorithm

1: Input : the original data set D, the Q-I attributes Q, and the
number kðk 6 mÞ of the attributes to be anonymized

2: Output : the original data set D is overwritten by an
anonymized one

3: begin
4: m :¼ jQj;
5: n :¼ jDj;
6: Dist :¼;;
7: for i :¼ 1 to m do
8: begin
9: Disti :¼ the distribution of the values of Q i;

10: end
11: for j :¼ 1 to n do
12: begin
13: Randomly select k attributes with equal probability in Q-I

of the jth record;
14: for k :¼ 1 to k do
15: begin
16: attr :¼ kth selected Q-I attribute;
17: Randomly generate a new value for attr based on Distattr ;
18: Replace the value of attr with the new value;
19: end
20: end
21: end

Apparently, the Algorithm 2 enhances privacy protection when
k gets larger. What is the impact of k on knowledge preservation?
In the following, we will show that both the bias BðqÞ and variance
VarðRðqÞÞ increase as k gets larger.

Since we choose the Q-I attributes uniformly at random, the
probability that each combination of k Q-I attributes gets selected

is 1
m
k

� ��
. In the algorithm, a value combination cannot be anon-

ymized to all the other forms. If two value combinations q and q0

differ in g values ðg 6 kÞ, then these two value combinations are
convertible into each other, since all the g values may be anony-
mized. Before the discussion of BðqÞ and VarðEðqÞÞ, in the following
proposition, we investigate the impact of k on the probability that a
value combination is anonymized to its convertible value combina-
tion. We will show that both probabilities Pðq ! q0Þ and Pðq0 ! qÞ
increase as k increases.

Proposition 13. Suppose k is the number of the anonymized attributes
in each record. Let q 2 D and q0 2 D0 be partial value combinations of Q-
I, jqj ¼ jq0j 6 m, and they are convertible into each other. Then the
probabilities Pðq ! q0Þ and Pðq0 ! qÞ increase directly as k.

Proof 4. Suppose that q and q0 differ in g attributes, g 6 k. Let
AttrSetsgþk be the set of all possible combinations of g þ k Q-I attri-

butes where each combination AttrSetsgþki ; i ¼ 1; . . . ;
jqj� g
k

� �
,

contains those g attributes and other k; k 6 jqj� g, arbitrary Q-I
attributes of q. Let PðAttrSetsgþki ; q

0Þ be the probability that the
values of the g þ k attributes in AttrSetsgþki are anonymized to the

762 W. Yang, S. Qiao / Expert Systems with Applications 37 (2010) 756–766


Author's personal copy

corresponding values of q0. Then probability that q is anonymized
to q0 is:

Pðq ! q0Þ ¼
Xminðk�g;jqj�gÞ
k¼0

m �jqj
k � g � k

� �
m

k

� � XjAttrSetsgþkj
i¼1

PðAttrSetsgþki ; q
0Þ

2
6664

3
7775

¼
Xminðk�g;jqj�gÞ
k¼0

Prk;gk

where, for simplicity, Prk;gk denotes

m �jqj
k � g � k

� �
m
k

� � PjAttrSetsgþkj
i¼1

PðAttrSetsgþki ; q
0Þ. As k is increased by 1,

Prkþ1;gk
Prk;gk

¼

m �jqj
k þ 1 � g � k

� �
m
k þ 1

� �
m �jqj
k � g � k

� �
m

k

� �
¼

ðk þ 1Þðm � k þ 1Þ
ðk þ 1 � g � kÞðm � k þ 1 �jqjþ g þ kÞ

:

It then follows that

1 6
Prkþ1;gk
Prk;gk

6 k þ 1;

since g þ k 6 minðk; jqjÞ.
This shows that the probability Pðq ! q0Þ increases directly as k.

The proof for Pðq0 ! qÞ is similar. h

Now let us look at Bðq0Þ and VarðRðq0ÞÞ. From (9), we get

Bðq0Þ ¼
X
fq2:q0g

suppDðqÞ
suppDðq0Þ

Pðq ! q0Þ
� �

� Pðq0 ! qÞ


and

VarðRðq0ÞÞ ¼
X
fq2:q0g

suppDðqÞ
suppDðq0Þ

� �2
ðPðq ! q0Þ� Pðq ! q0Þ2Þ

"

þ Pðq0 !:q0Þ� Pðq0 !:q0Þ2
� 	#

:

As k increases, there are more value combinations q in D convertible
to q0. Also, from Proposition 13, both Pðq ! q0Þ and Pðq0 ! qÞ vary
directly as k. Therefore, the range of BðqÞ increases with k. As for
the variance, Pðq ! q0Þ and Pðq0 ! qÞ are usually less than 1/2, then
Pðq ! q0Þ� Pðq ! q0Þ2 and Pðq0 !:q0Þ� Pðq0 !:q0Þ2 increase with
k. Thus, the variance also varies directly as k.

In summary, by tuning the value of k, we can adjust the level of
the average bias and the variance. Thus, we need to find a compro-
mise between the protection of data privacy and the preservation
of useful knowledge, as demonstrated in our experiments.

7. Experiments

In this section, we present our experiment results to demon-
strate the effectiveness of our methods. By common aggregate que-
ries, we first compare our RA Algorithm 1 with the l-diversity and
permutation methods on the SQL queries. Then we compare our
method with the permutation method on the quasi-sensitive
knowledge discovery. At the same time, we demonstrate the ef-
fects of the parameter k in our enhanced RA Algorithm 2 and the
association length on knowledge discovery.

7.1. Experiment setup

The data set adopted in the experiments is the adult data set
from the UCI machine learning repository (Hettich & Bay, 1999).
We used the following nine original attributes to compose the qua-
si-identifier: ‘‘education”, ‘‘race”, ‘‘sex”, ‘‘work-class”, ‘‘marital-sta-
tus”, ‘‘age”, ‘‘relationship”, ‘‘native-country” and ‘‘salary”. The
attribute ‘‘occupation” was retained as the sensitive attribute.
The records with missing values were removed and the resulting
data set contained 30,162 records.

7.2. Accuracy of SQL queries

In the first part of the experiments, we generated three anony-
mized data sets by respectively applying the entropy l-diversity,
permutation, and our RA Algorithm 1 to the test data set. We
implemented the entropy l-diversity method based on the k-anon-
ymization algorithm in LeFevre et al. (2006). While the permuta-
tion algorithm was implemented as in Xiao & Tao (2006). Then,
we performed random queries on the three anonymized data sets
and compared the average accuracy of their answers. The queries
were in the forms similar to those in Xiao & Tao (2006):

select count ð�Þ from tablename
where Q 1 in ðR1Þ and . . . and Q m in ðRmÞ and S in ðRsÞ

where Q i is the ith Q-I attribute, S the sensitive attribute, and Ri the
query values for the ith attribute.

7.2.1. Record count estimation
When a data set is anonymized by the l-diversity algorithm, we

cannot run the queries directly since the data is generalized. In-
stead, the record count of each query is obtained by the following
estimate (Xiao & Tao, 2006):

X
p

#pðRsÞ
Ym
i¼1

#pðRi \ ValuesðQ iÞÞ
#pðValuesðQ iÞÞ

 !
; ð12Þ

where ValuesðQ iÞ is the set of the distinct values of attribute Q i and
function #pðxÞ counts in the pth Q-I group the occurrence of the val-
ues in the set x. By summing up the estimated record counts in all
the Q-I groups, we get an estimate for the query.

When a data set is anonymized by the permutation method, the
answers to the queries are estimated by

X
p

#pðRsÞ
#pðrecords matching the Q -I conditionsÞ

#pðrecords in the p th groupÞ

� �
: ð13Þ

In this case, in the pth Q-I group, the records matching the condi-
tions of Q-I in the where clause in a query can be directly retrieved
as #p (records matching the Q-I conditions). The product in the
above equation assumes all the associations between the value
combinations of Q-I and sensitive data are equally possible. We call
this kind of associations the uniform associations between the Q-I
and sensitive values.

When the data set is anonymized by the RA algorithm, we re-
trieve the queries directly without estimation.

7.2.2. The results
We conducted eight sets of experiments. In the jth set,

1 6 j 6 8, of experiments, we performed 1000 queries on the
three anonymized data sets and each query was composed of the
sensitive attribute and j attributes which were selected from the
quasi-identifier uniformly at random. For each selected attribute,
we chose a random number of categorical values from its domain
to form its query values Ri. In each set of experiments, we

W. Yang, S. Qiao / Expert Systems with Applications 37 (2010) 756–766 763


Author's personal copy

calculated the average relative errors of the query results. The rel-
ative error is defined by

jAnsðquery; DÞ� Ansðquery; D0Þj
Ansðquery; DÞ

;

where Ansðquery; DÞ returns the answer to query on data set D. Since
the numerator jAnsðquery; DÞ� Ansðquery; D0Þj compares the aggre-
gated counts between D and D0, the above definition of relative error
may exceed 1.

The probabilistic anonymity of the test data set reached 34 by
our RA anonymization method, meaning that the average probabil-
ity that a user could correctly associate an individual with the cor-
responding sensitive value was 1/34. While the l-diversity and
permutation methods limited that probability to 1=l, which was
further limited by the distribution of the sensitive values. In the
test data set, the maximal value of l was about 10 (Entropy(occu-
pation)� ln 10). Thus, we set l ¼ 10 for the l-diversity and permu-
tation methods.

As shown in Fig. 6a, when the queries become more sensitive,
involving more attributes, the relative error in the permuted data
varies less than those in the l-diverse data and the RA anonymized
data. This is because the estimate (12) for the results of the queries
on the l-diverse data assumes a uniform value distribution of each
Q-I attribute. Thus, the more Q-I attributes are involved, the farther
away their joint distribution is from the multivariate uniform dis-
tribution. As for the RA method, from Proposition 12, the frequent
and infrequent value combinations usually have higher values of
BðqÞ than other value combinations. Since the proportions of the
frequent and infrequent value combinations decrease when the
number of attributes involved increases, the rise of the relative er-
ror slows down as the number of attributes involved increases, as
shown in Fig. 6a. For the permuted data set, the data links in the

original data set are not completely uniform associations assumed
by (13), causing the errors in the queries. Since the strength of
those associations is not influenced by the number of the attributes
involved in the queries, the relative error varies the least.

We then decreased the value of diversity l to 8 and 5 and re-
peated the comparison. This time, each query in the experiment
consisted of the sensitive attribute and 5 random Q-I attributes.
As shown in Fig. 6b, although the accuracy of the estimations in
the l-diverse data and permuted data improved as l decreased,
the estimation from the anonymized data set by our RA method
consistently had the least amount of error. This is because the esti-
mation from the RA anonymized data set assumes no relationships
between the attributes. The RA algorithm tries to preserve the data
relationships by perturbing a small portion of data set.

7.3. Quasi-sensitive knowledge preservation

In this section, we investigate knowledge preservation in sev-
eral anonymization techniques, the effect of the parameter k in
the enhanced RA Algorithm 2 on knowledge discovery, and the im-
pact of the association length on knowledge discovery.

We first ran the Apriori algorithm (Agrawal & Srikant, 1994) on
the original data set D to get the quasi-sensitive associations AssoD,
which was used as the baseline result in the following experi-
ments. Then we compared baseline results with the quasi-sensitive
associations from each anonymized data set D0. In the comparison,
we evaluated the accuracy of the knowledge discovery by calculat-
ing the ‘‘relative percentage”:

jAssoD \ AssoD0 j
jAssoDj

100%:

In the test data set, the highest frequency of the values of attribute
‘‘occupation” was about 13%. Thus, in the Apriori algorithm, we set
the confidence threshold to 15% and the support threshold to 8% .

Fig. 7 compares the effect of our RA Algorithm 1 with that of the
permutation method with l ¼ 5; 8; 10 (We did not compare with l-
diversity and k-anonymization since most data associations are
suppressed there). More quasi-sensitive associations were recov-
ered from our RA anonymized data than from the permuted data.
Although the previous experiments of aggregate queries showed
that the accuracy of the answers from the permuted data varied
the least, in the current experiment, only a small part of the asso-
ciations was recovered. This is caused by the assumption of the
uniform associations between the Q-I and sensitive values. In the
previous experiments, most queries consisted of independent attri-
bute values. For example, in the first set of the previous experi-
ments, the queries consisted of the sensitive attribute and one
random Q-I attribute. We ran the Chi-Square test for each contin-
gency table consisting of a pair of attributes in the queries. Most

Fig. 6. (a) Comparison of the query accuracy among the data sets by different
anonymization methods ðl ¼ 10Þ; (b) comparison of the query accuracy with
different diversity values.

Fig. 7. Comparison of the quasi-sensitive associations among different anonymi-
zation methods.

764 W. Yang, S. Qiao / Expert Systems with Applications 37 (2010) 756–766


Author's personal copy

results of v
2

jDj was less than 0.2. Thus, the relative errors from the
queries consisting of associated attribute values contributed little
to the average error. However, in the association discovery, it only
discovered those highly associated attribute values. The more
interesting the associations are, the farther away they are from
the uniform relationship. While in our RA method, the data ran-
domization had a relatively small impact on the data relationships.
Therefore, it was capable of preserving more associations.

To test the influence of k on the discovery of the quasi-sensitive
knowledge, we implemented the enhanced RA Algorithm 2 with
various numbers of Q-I attributes anonymized. We set the param-
eter k to 1,3,5,7,9.

As shown in Fig. 8a, with the increase of k, more artificial asso-
ciations were generated and fewer original associations were dis-
covered. This confirms our conclusions in Section 6.

Since a quasi-sensitive association may cause more potential
threats to the data privacy when it involves more attribute values,
we further compared the discovery of the quasi-sensitive associa-
tions with various association lengths. In the comparison, we set
k ¼ 5.

Fig. 8b shows that it is more accurate to recover the shorter
associations than the longer ones. Most of the associations with
1, 2 or 3 attributes were discovered even when more than half of
the attribute values were anonymized in each record. (Here, we de-
note by length 1 the frequent items in the data set.) This means
that the user is more likely to discover less sensitive associations.

The above experiments demonstrate the following characteris-
tics of our methods. The RA Algorithm 1 is capable of protecting
data privacy with minor impact on the data distribution. The en-
hanced RA Algorithm 2 can further decrease the risk of privacy
leakage when some original data associations are disclosed. More-
over, by tuning the value of k, it can make a balance between the
useful knowledge preservation and privacy protection.

8. Conclusions

In this paper, we introduce a novel data anonymization method
by randomizing the data records. Different from the generalization
methods such as the k-anonymization methods, we regard the data
privacy as the links between the Q-I and sensitive values. By ran-
domly replacing part of the values in each record while maintain-
ing statistical relations in whole data set, the RA method not only
achieves a higher level of the privacy protection but also preserves
more non-sensitive knowledge than the other anonymization
methods, as demonstrated by our experiments. Moreover, the use-
ful associations which are less sensitive can be discovered more
accurately than the sensitive ones. In particular, when association
length is 1, our method delivers zero bias. Furthermore, the en-
hanced RA method provides the extra privacy protection when
the user has some prior knowledge of the Q-I data.

Data anonymization is a popular direction in the research of pri-
vacy preserving data mining. Integrating our method with other
privacy preserving techniques, we may accomplish more tasks.
This work serves as an initial step. Additional research will include
more work on other types of data knowledge and the definition of
data privacy.

References

Aggarwal, C. C. (2005). On k-anonymity and the curse of dimensionality. In
Proceedings of the 31st international conference on very large data bases (pp. 901–
909). Trondheim, Norway.

Agrawal, D., & Aggarwal, C. C. (2001). On the design and quantification of privacy
preserving data mining algorithms. In Proceedings of the 20th ACM SIGMOD-
SIGACT-SIGART symposium on principles of database systems (pp. 247–255). Santa
Barbara, California.

Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in
large databases. In Proceedings of the 20th international conference on very large
data bases (pp. 487–499). Santiago, Chile.

Agrawal, R., & Srikant R. (2000). Privacy-preserving data mining. In Proceedings of
the 2000 ACM SIGMOD international conference on management of data (pp. 439–
450). Dallas, Texas.

Aggarwal, C. C., & Yu, P. S. (2008). A general survey of privacy-preserving data mining
models and algorithms. Privacy-preserving data mining (pp. 11–52). US: Springer.

Chen, C., Jin, F., Zhu, X., & Ouyang, G. (1983). Mathematical analysis. Higher
Education Press.

Ciriani, V., Vimercati, S., Foresti, S., & Samarati, P. (2008). k-Anonymous data mining:
A survey. Privacy-preserving data mining (pp. 105–136). US: Springer.

Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X., & Zhu, M. Y. (2002). Tools for privacy
preserving distributed data mining. ACM SIGKDD Explorations Newsletter, 4(2),
28–34.

Du, W., & Zhan, Z. (2003). Using randomized response techniques for privacy-
preserving data mining. In Proceedings of the ninth ACM SIGKDD international
conference on knowledge discovery and data mining (pp. 505–510). Washington,
DC.

Evfimievski, A., Gehrke, J., & Srikant, R. (2003). Limiting privacy breaches in privacy
preserving data mining. In Proceedings of the 22th ACM SIGMOD-SIGACT-SIGART
symposium on principles of database systems (pp. 211–222). San Diego, California.

Fung, B. C. M., Wang, K., & Yu, P. S. (2007). Anonymizing classification data for
privacy preservation. IEEE Transactions on Knowledge and Data Engineering,
19(5), 711–725.

Giudici, P. (2003). Applied data mining: Statistical methods for business and industry.
John Wiley & Sons. Ltd.

Hettich, S., & Bay, S. D. (1999). The UCI KDD archive. URL : http://kdd.ics.uci.edu.
Kifer, D., & Gehrke, J. (2006). Injecting utility into anonymized datasets. In

Proceedings of the 2006 ACM SIGMOD international conference on management
of data (pp. 217–228). Chicago, Illinois.

Koudas, N., Srivastava, D., Yu, T., & Zhang, Q. (2007). Aggregate query answering on
anonymized tables. In Proceedings of the 23rd international conference on data
engineering (pp. 116–125). Istanbul, Turkey.

LeFevre, K., DeWitt, D. J., & Ramakrishnan, R. (2005). Incognito: Efficient full-domain
k-anonymity. In Proceedings of the 2005 ACM SIGMOD international conference on
management of data (pp. 49–60). Baltimore, Maryland.

LeFevre, K., DeWitt, D. J., & Ramakrishnan, R. (2006). Mondrian multidimensional k-
anonymity. In Proceedings of the 22nd international conference on data
engineering (p. 25). Atlanta, Georgia.

Li, N., Li, T., & Venkatasubramanian, S. (2007). t-Closeness: Privacy beyond k-
anonymity and l-diversity. In Proceedings of the 23rd international conference on
data engineering (pp. 106–115). Istanbul, Turkey.

Lindell, Y., & Pinkas, B. (2000). Privacy preserving data mining. In Proceedings of the
20th annual international cryptology conference on advances in cryptology (pp.
36–54). Santa Barbara, California.

Fig. 8. (a) Discovery of the quasi-sensitive associations with different values of k in
the enhanced RA method; (b) discovery of the quasi-sensitive associations with
different length ðk ¼ 5Þ.

W. Yang, S. Qiao / Expert Systems with Applications 37 (2010) 756–766 765


Author's personal copy

Machanavajjhala, A., Gehrke, J., Kifer, D., & Venkitasubramaniam, M. (2006). l-
Diversity: Privacy beyond k-anonymity. In Proceedings of the 22th international
conference on data engineering (p. 24). Los Alamitos, California.

Meyerson, A., & Williams, R. (2004). On the complexity of optimal k-anonymity. In
Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART symposium on principles of
database systems (pp. 223–228). Paris, France.

Sweeney, L. (2002). Achieving k-anonymity privacy protection using generalization
and suppression. International Journal of Uncertainty, 10(5), 571–588.

Xiao, X., & Tao, Y. (2006). Anatomy: Simple and effective privacy preservation. In
Proceedings of the 32nd international conference on very large data bases (pp. 139–
150). Seoul, Korea.

766 W. Yang, S. Qiao / Expert Systems with Applications 37 (2010) 756–766