tifs-corradi-2937640-proof.pdf


IE
EE

 P
ro

of

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 1

Relative Privacy Threats and Learning
From Anonymized Data

Michele Boreale, Fabio Corradi , and Cecilia Viscardi

Abstract— We consider group-based anonymization schemes,1
a popular approach to data publishing. This approach aims2
at protecting privacy of the individuals involved in a dataset,3
by releasing an obfuscated version of the original data, where4
the exact correspondence between individuals and attribute5
values is hidden. When publishing data about individuals, one6
must typically balance the learner’s utility against the risk7
posed by an attacker, potentially targeting individuals in the8
dataset. Accordingly, we propose a unified Bayesian model of9
group-based schemes and a related MCMC methodology to learn10
the population parameters from an anonymized table. This allows11
one to analyze the risk for any individual in the dataset to be12
linked to a specific sensitive value, when the attacker knows13
the individual’s nonsensitive attributes, beyond what is implied14
for the general population. We call this relative threat analysis.15
Finally, we illustrate the results obtained with the proposed16
methodology on a real-world dataset.17

Index Terms— Privacy, anonymization, k-anonymity, MCMC18
methods.19

I. INTRODUCTION20

WE CONSIDER a scenario where datasets containing21 personal microdata are released in anonymized form.22
The goal here is to enable the computation of general popula-23
tion characteristics with reasonable accuracy, at the same time24
preventing leakage of sensitive information about individuals25
in the dataset. The Database of Genotype and Phenotype [32],26
the U.K. Biobank [36] and the UCI Machine Learning repos-27
itory [47] are well-known examples of repositories providing28
this type of datasets.29

Anonymized datasets always have “personal identifiable30
information”, such as names, SSNs and phone numbers,31
removed. At the same time, they include information32
derived from nonsensitive (say, gender, ZIP code, age,33
nationality) as well as sensitive (say, disease, income)34
attributes. Certain combinations of nonsensitive attributes, like35
�gender, date of birth, ZIP code�, may be used to uniquely36
identify a significant fraction of the individuals in a population,37
thus forming so-called quasi-identifiers. For a given target38
individual, the victim, an attacker might easily obtain this piece39
of information (e.g. from personal web pages, social networks40

AQ:1 Manuscript received January 18, 2019; revised May 2, 2019 and August 7,
AQ:2 2019; accepted August 15, 2019. This paper was presented at the Proceedings

of SIS 2017 [5]. The associate editor coordinating the review of this article
and approving it for publication was Prof. Xiaodong Lin. (Corresponding
author: Fabio Corradi.)

The authors are with the Dipartimento di Statistica, Informatica,
AQ:3 Applicazioni (DiSIA), Università di Firenze, Florence, Italy (e-mail:

fabio.corradi@unifi.it).
Digital Object Identifier 10.1109/TIFS.2019.2937640

etc.), use it to identify him/her within a dataset and learn the 41
corresponding sensitive attributes. This attack was famously 42
demonstrated by L. Sweeney, who identified Massachusetts’ 43
Governor Weld medical record within the Group Insurance 44
Commission (GIC) dataset [46]. Note that identity disclosure, 45
that is the precise identification of an individual’s record in 46
a dataset, is not necessary to arrive at a privacy breach: 47
depending on the dataset, an attacker might infer the victim’s 48
sensitive information, or even a few highly probable candidate 49
values for it, without identity disclosure involved. This more 50
general type of threat, sensitive attribute disclosure, is the one 51
we focus on here.1 52

In an attempt to mitigate such threats for privacy, regulatory 53
bodies mandate complex, often baroque syntactic constraints 54
on the published data. As an example, here is an excerpt from 55
the HIPAA safe harbour deidentification standard [48], which 56
prescribes a list of 18 identifiers that should be removed or 57
obfuscated, such as 58

all geographic subdivisions smaller than a state, 59
including street address, city, county, precinct, ZIP 60
code, and their equivalent geocodes, except for the 61
initial three digits of the ZIP code if, according to 62
the current publicly available data from the Bureau 63
of the Census: (1) the geographic unit formed by 64
combining all ZIP codes with the same three initial 65
digits contains more than 20,000 people; and (2) 66
the initial three digits of a ZIP code for all such 67
geographic units containing 20,000 or fewer people 68
is changed to 000. 69

There exists a large body of research, mainly in 70
Computer Science, on syntactic methods. In particular, 71
group-based anonymization techniques have been systemat- 72
ically investigated, starting with L. Sweeney’s proposal of 73
k-anonymity [46], followed by its variants, like �-diversity [30] 74
and Anatomy [49].In group-based methods, the anonymized - 75
or obfuscated - version of a table is obtained by partitioning 76
the set of records into groups, which are then processed to 77
enforce certain properties. The rationale is that, even knowing 78
that an individual belongs to a group of the anonymized 79
table, it should not be possible for an attacker to link that 80
individual to a specific sensitive value in the group. Two 81
examples of group based anonymization are in Table I, adapted 82

1 Depending on the nature of the dataset, the mere membership disclosure,
i.e. revealing that an individual is present in a dataset, may also be considered
as a privacy breach: think of data about individuals who in the past have been
involved in some form of felony. We will not discuss membership disclosure
privacy breaches in this paper.

1556-6013 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

https://orcid.org/0000-0003-3949-3837


IE
EE

 P
ro

of

2 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY

TABLE I

A TABLE (TOP) ANONYMI ZED ACCORDI NG TO 2-ANONYMI TY VI A LOCAL
RECODI NG (MI DDLE) AND ANATOMY (BOTTOM)

from [9]. The topmost, original table collects medical data83
from eight individuals; here Disease is considered as the84
only sensitive attribute. The central table is a 2-anonymous,85
2-diverse table: within each group the nonsensitive attribute86
values have been generalized following group-specific rules87
(local recoding) so as to make them indistinguishable; more-88
over, each group features 2 distinct sensitive values. In general,89
each group in a k-anonymous table consists of at least k90
records, which are indistinguishable when projected on the91
nonsensitive attributes; �-diversity additionally requires the92
presence in each group of at least � distinct sensitive values,93
with approximately the same frequency. This is an example94
of horizontal scheme. Table I (c) is an example of application95
of the Anatomy scheme: within each group, the nonsensitive96
part of the rows are vertically and randomly permuted, thus97
breaking the link between sensitive and nonsensitive values.98
Again, the table is 2-diverse.99

In recent years, the effectiveness of syntactic anonymization100
methods has been questioned, as offering weak guarantees101
against attackers with strong background knowledge – very102
precise contextual information about their victims. Differen-103
tial privacy [18], which promises protection in the face of104
arbitrary background knowledge, while valuable in the release105

of summary statistics, still appears not of much use when it 106
comes to data publishing (see the Related works paragraph). 107
As a matter of fact, release of syntactically anonymized tables 108
appears to be the most widespread data publishing practice, 109
with quite effective tool support (see e.g. [37]). 110

In the present paper, discounting the risk posed by attackers 111
with strong background knowledge, we pose the problem in 112
relative terms: given that whatever is learned about the general 113
population from an anonymized dataset represents legitimate 114
and useful information (“smoke is associated with cancer”), 115
one should prevent an attacker from drawing conclusions about 116
specific individuals in the table (“almost certainly the target 117
individual has cancer”): in other words, learning sensitive 118
information for an individual in the dataset, beyond what is 119
implied for the general population. To see what is at stake 120
here, consider dataset (b) in Table I. Suppose that the attacker’s 121
victim is a Malaysian living at ZIP code 45501, and known 122
to belong to the original table. The victim’s record must 123
therefore be in the first group of the anonymized table. The 124
attacker may reason that, with the exception of the first group, 125
a Japanese is never connected to Heart Disease; this hint 126
can become a strong evidence in a larger, real-world table. 127
Then the attacker can link with high probability the Malaysian 128
victim in the first group to Heart Disease. In this attack, 129
the attacker combines knowledge of the nonsensitive attributes 130
of the victim (Malaysian, ZIP code 45501) with the group 131
structure and the knowledge learned from the anonymized 132
table. 133

We propose a unified probabilistic model to reason about 134
such forms of leakage. In doing so, we clearly distinguish the 135
position of the learner from that of the attacker: the resulting 136
notion is called relative privacy threat. In our proposal, both 137
the learner and the attacker activities are modeled as forms 138
of Bayesian inference: the acquired knowledge is represented 139
as a joint posterior probability distribution over the sensitive 140
and nonsensitive values, given the anonymized table and, 141
in the case of the attacker, knowledge of the victim’s presence 142
in the table. A comparison between these two distributions 143
determines what we call relative privacy threat. Since posterior 144
distributions are in general impossible to express analytically, 145
we also put forward a MCMC method to practically estimate 146
such posteriors. We also illustrate the results of applying our 147
method to the Adult dataset from the UCI Machine Learn- 148
ing repository [47], a common benchmark in anonymization 149
research. 150

A. Related Works 151

Sweeney’s k-anonymity [46] is among the most popu- 152
lar proposals aiming at a systematic treatment of syntactic 153
anonymization of microdata. The underlying idea is that every 154
individual in the released dataset should be hidden in a 155
“crowds of k”. Over the years, k-anonymity has proven to 156
provide weak guarantees against attackers who know much 157
about their victims, that is have a strong background knowl- 158
edge. For example, an attacker may know from sources other 159
than the released data that his victim does not suffer from 160
certain diseases, thus ruling out all possibilities but one in 161


IE
EE

 P
ro

of

BOREALE et al.: RELATIVE PRIVACY THREATS AND LEARNING FROM ANONYMIZED DATA 3

the victims’s group. Additional constraints may be enforced162
in order to mitigate those attacks, like �-diversity [30] and163
t-closeness [27]. Differential Privacy [18] promises protec-164
tion in the face of arbitrary background knowledge. In its165
basic, interactive version, this means that, when querying a166
database via a differentially private mechanism, one will get167
approximately the same answers, whether the data of any168
specific individual is included or not in the database. This is169
typically achieved by injecting controlled levels of noise in the170
reported answer, e.g. Laplacian noise. Differential Privacy is171
very effective when applied to certain summary statistics, such172
as histograms. However, it raises a number of difficulties when173
applied to table publishing: in concrete cases, the level of noise174
necessary to guarantee an acceptable degree of privacy would175
destroy utility [12], [13], [44]. Moreover, due to correlation176
phenomena, it appears that Differential Privacy cannot in177
general be used to control evidence about the participation178
of individuals in a database [4], [26]. In fact, the no-free-179
lunch theorem of Kifer and Machanavajjhala [26] implies that180
it is impossible to guarantee both privacy and utility, without181
making assumptions about how the data have been generated182
(e.g., independence assumptions). Clifton and Tassa [10] crit-183
ically review issues and criticisms involved in both syntactic184
methods and Differential Privacy, concluding that both have185
their place, in Privacy Preserving- Data Publishing and Data186
Mining, respectively. Both approaches have issues that call187
for further research. A few proposals involve blending the188
two approaches, with the goal to achieve both strong privacy189
guarantees and utility, see e.g. [28].190

A major source of inspiration for our work has been191
Kifer’s [25]. The main point of [25] is to demonstrate a pitfall192
of the random worlds model, where the attacker is assumed193
to assign equal probability to all cleartext tables compatible194
with the given anonymized one. Kifer shows that a Bayesian195
attacker willing to learn from the released table can draw196
sharper inferences than those possible in the random worlds197
model. In particular, Kifer shows that it is possible to extract198
from (anatomized) �-diverse tables belief probabilities greater199
than 1/�, by means of the so-called deFinetti attack. While200
pinpointing a deficiency of the random worlds model, it is201
questionable if this should be considered an attack, or just202
a legitimate learning strategy. Quoting [10] on the deFinetti203
attack:204

The question is whether the inference of a general205
behavior of the population in order to draw belief206
probabilities on individuals in that population con-207
stitutes a breach of privacy (. . .). To answer this208
question positively for an attack on privacy, the suc-209
cess of the attack when launched against records that210
are part of the table should be significantly higher211
than its success against records that are not part of212
the table. We are not aware of such a comparison213
for the deFinetti attack.214

It is this very issue that we tackle in the present paper.215
Specifically, our main contribution here is to put forward a216
concept of relative privacy threat, as a means to assess the217
risks implied by publishing tables anonymized via group-based218

methods. To this end, we introduce: (a) a unified probabilistic 219
model for group-based schemes; (b) rigorous characterizations 220
of the learner and the attacker’s inference, based on Bayesian 221
reasoning; and, (c) a related MCMC method, which generalizes 222
and systematizes that proposed in [25]. 223

Very recently, partly inspired by differential privacy, a 224
few authors have considered what might be called a rel- 225
ative or differential approach to assessing privacy threats, 226
in conjunction with some notion of learning or inference 227
from the anonymized data. Especially relevant to our work 228
is differential inference, introduced in a recent paper by 229
Kassem et al. [24]. These authors make a clear distinction 230
between two different types of information that can be inferred 231
from anonymized data: learning of “public” information, con- 232
cerning the population, should be considered as legitimate; 233
on the contrary, leakage of “private” information about indi- 234
viduals should be prevented. To make this distinction formal, 235
given a dataset, they compare two probability distributions 236
that can be machine-learned from two distinct training sets: 237
one including and one excluding a target individual. An attack 238
exists if there is a significant difference between the two dis- 239
tributions, measured e.g. in terms of Earth Moving Distance. 240
While similar in spirit to ours, this approach is conceptually 241
and technically different from what we do here. Indeed, in our 242
case the attacker explicitly takes advantage of the extra piece 243
of information concerning the presence of the victim in the 244
dataset to attack the target individual, which leads to a more 245
direct notion of privacy breach. Moreover, in [24] a Bayesian 246
approach to inference is not clearly posed, so the obtained 247
results lack a semantic foundation, and strongly depend on the 248
adopted learning algorithm. Pyrgelis et al. [39] use Machine 249
Learning for membership inference on aggregated location 250
data, building a binary classifier that can be used to predict 251
if a target user is part of the aggregate data or not. A similar 252
goal is pursued in [35]. Again, a clear semantic foundation 253
of these methods is lacking, and the obtained results can be 254
validated only empirically. In a similar vein, [3] and [17] have 255
proposed statistical techniques to detect privacy violations, 256
but they only apply to differential privacy. Other works, such 257
as [23] and [33], have just considered the problem of how 258
to effectively learn from anonymized datasets, but not of 259
how to characterize legitimate, as opposed to non-legitimate, 260
inference. 261

On the side of the random worlds model, Chi-Wing Wong 262
et al.’s work [9] shows how information on the population 263
extracted from the anonymized table – in the authors’ words, 264
the foreground knowledge – can be leveraged by the attacker 265
to violate the privacy of target individuals. The underlying rea- 266
soning, though, is based on the random worlds model, hence 267
is conceptually and computationally very different from the 268
Bayesian model adopted in the present paper. Bewong et al. [2] 269
assess relative privacy threat for transactional data by a suitable 270
extension of the notion of t -closeness, which is based on com- 271
paring the relative frequency of the victim’s sensitive attribute 272
in the whole table with that in the victim’s group. Here the 273
underlying assumption is that the attacker’s prior knowledge 274
about sensitive attributes matches the public knowledge, and 275
that the observed sensitive attributes frequencies provide good 276


IE
EE

 P
ro

of

4 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY

estimates both for the public knowledge and the attacker’s277
belief. Our proposal yields more sophisticated estimates via a278
Bayesian inferential procedure. Moreover, in our scenario the279
assumption on the attacker’s knowledge is relaxed requiring280
only the knowledge of the victim’s presence in whatever group281
of the table.282

A concept very different from the previously discussed pro-283
posals is Rubin’s multiple imputation approach [43], by which284
only tables of synthetic data, generated sampling from a285
predictive distribution learned from the original table, are286
released. This avoids syntactic masking/obfuscation, whose287
analysis requires customized algorithms on the part of the288
learner, and leaves to the data producer the burden of synthesis.289
Note that this task can be nontrivial and raises a number of290
difficulties concerning the availability of auxiliary variables291
for non-sampled units, see [42]. In Rubin’s view, synthetic292
data overcome all privacy concerns, in that no real individual’s293
data is actually released. However, this position has been ques-294
tioned, on the grounds that information about participants may295
leak through the chain: original table → posterior parameters296
→ synthetic tables. In particular, Machanavajjhala et al. [31]297
study Differential Privacy of synthetic categorical data. They298
show that the release of such data can be made differen-299
tially private, at the cost of introducing very powerful priors.300
However, such priors can lead to a serious distortion in301
whatever is learned from the data, thus compromising utility.302
In fact, [50] argues that, in concrete cases, the required pseudo303
sample size hyperparameter could be larger than the size of304
the table. Experimental studies [7], [8] appear to confirm305
that such distorting priors are indeed necessary for released306
synthetic data to provide acceptable guarantees, in the sense307
of Differential Privacy. See [50] for a recent survey of results308
about synthetic data release and privacy. An outline of the309
model presented here, with no proofs of correctness, appeared310
in the conference paper [5].311

B. Structure of the Paper312

The rest of the paper is organized as follows. In Section313
II we propose a unified formal definition of vertical and314
horizontal schemes. In Section III we put forward a probabilis-315
tic model to reason about learner’s and attacker’s inference;316
the case of prior partial knowledge of the victim’s attributes317
on the part of the attacker is also covered. Based on that,318
measures of (relative) privacy threats and utility are introduced319
in Section IV. In Section V, we study a MCMC algorithm to320
learn the population parameters posterior and the attacker’s321
probability distribution from the anonymized data. In Section322
VI, we illustrate the results of an experiment conducted on a323
real-world dataset. A few concluding remarks and perspectives324
for future work are reported in Section VII. Some technical325
material has been confined to Appendix A.326

II. GROUP BASED ANONYMIZATION SCHEMES327

A dataset consists of a collection of rows, where each row328
corresponds to an individual. Formally, let R and S, ranged329
over by r and s respectively, be finite non-empty sets of330
nonsensitive and sensitive values, respectively. A row is a pair331

(s, r ) ∈ S × R. There might be more than one sensitive and 332
nonsensitive characteristic, so s and r can be thought of as 333
vectors. 334

A group-based anonymization algorithm A is an algorithm 335
that takes a multiset of rows as input and yields an obfuscated 336
table as output, according to the scheme 337

multiset of rows −→ cleartext table −→ obfuscated table. 338
Formally, fix N ≥ 1. Given a multiset of N rows, d = 339
{|(s1, r1), . . . , (sN , r N )|}, A will first arrange d into a sequence 340
of groups, t = g1, . . . , gk , the cleartext table. Each group in 341
turn is a sequence of ni rows, gi = (si,1, ri,1), . . . , (si,ni , ri,ni ), 342
where ni can vary from group to group. Note that both the 343
number of groups, k ≥ 1, and the number of rows in each 344
group, ni , depend in general on the original multiset d as well 345
as on properties of the considered algorithm – such as ensuring 346
k-anonymity and �-diversity (see below). The obfuscated table 347
is then obtained as a sequence t ∗ = g∗1 , . . . , g∗k , where the 348
obfuscation of each group gi is a pair g

∗
i = (mi , li ). Here, 349

each mi = si,1, . . . , si,ni is the sequence of sensitive values 350
occurring in gi ; each li , called generalized nonsensitive value, 351
is one of the following: 352

• for horizontal schemes, a superset of gi ’s nonsensitive 353
values: li ⊇ {ri,1, . . . , ri,ni }; 354

• for vertical schemes, the multiset of gi ’s nonsensitive 355
values: li = {|ri,1, . . . , ri,ni |}. 356

Note that the generalized nonsensitive values in vertical 357
schemes include all and only the values, with multiplicities, 358
found in the corresponding original group. On the other hand, 359
generalized nonsensitive values in horizontal schemes may 360
include additional values, thus generating a superset. What 361
values enter the superset depends on the adopted technique, 362
e.g. micro-aggregation, generalization or suppression; in any 363
case this makes the rows in each group indistinguishable when 364
projected onto the nonsensitive attributes. For example, each 365
of 45501, 45502 is generalized to the superset 4550∗ = 366
{45500, 45501, . . . , 45509} in the first group of Table I(b). 367

Sometimes it will be notationally convenient to ignore the 368
group structure of t altogether, and regard the cleartext table 369
t simply as a sequence of rows, (s1, r1), (s2, r2), . . . , (s1, sN ). 370
Each row (s j , r j ) is then uniquely identified within the table 371
t by its index 1 ≤ j ≤ N . 372

An instance of horizontal schemes is k-anonymity [46]: 373
in a k-anonymous table, each group consists of at least k≥ 374
1 rows, where the different nonsensitive values appearing 375
within each group have been generalized so as to make them 376
indistinguishable. In the most general case, different occur- 377
rences of the same nonsensitive value might be generalized in 378
different ways, depending on their position (index) within the 379
table t : this is the case of local recoding. Alternatively, each 380
occurrence of a nonsensitive value is generalized in the same 381
way, independently of its position: this is the case of global 382
recoding. Further conditions may be imposed on the resulting 383
anonymized table, such as �-diversity, requiring that at least 384
� ≥ 1 distinct values of the sensitive attribute appear in each 385
group. Table I (center) shows an example of k= 2-anonymous 386
and � = 2-diverse table: in each group the nonsensitive 387


IE
EE

 P
ro

of

BOREALE et al.: RELATIVE PRIVACY THREATS AND LEARNING FROM ANONYMIZED DATA 5

TABLE II

SUMMARY OF NOTATI ON

values are indistinguishable and two different sensitive values388
(diseases) appear in each group.389

An instance of vertical schemes is Anatomy [49]: within390
each group, the link between the sensitive and nonsensitive391
values is hidden by randomly permuting one of the two392
parts, for example the nonsensitive one. As a consequence,393
an anatomized table may be seen as consisting of two sub-394
tables: a sensitive and a nonsensitive one. Table I (c) shows395
an example of anatomized table: in the nonsensitive sub-table,396
the reference to the corresponding sensitive values is lost; only397
the multiset of nonsensitive values appears for each group.398

Remark 1 (disjointness): Some anonymization schemes399
enforce the following disjointness property on the obfuscated400
table t ∗:401

Any two generalized nonsensitive values in t ∗ are402
disjoint: i 
= j implies li ∩ l j = ∅.403

We need not assume this property in our treatment – although404
assuming it may be computationally useful in practice (see405
Section III).406

For ease of reference, we provide a summary of the notation407
that will be used throughout the paper in Table II.408

III. A UNIFIED PROBABILISTIC MODEL409

We provide a unified probabilistic model for reasoning on410
group-based schemes. We first introduce the random variables411
of the model together with their joint density function. On top412
of these variables, we then define the probability distributions413
on S × R that formalize the learner and the attacker knowl-414
edge, given the obfuscated table.415

A. Random Variables416

The model consists of the following random variables.417

• �, taking values in the set of full support probability418
distributions D over S × R, is the joint probability419
distribution of the sensitive and nonsensitive attributes in420
the population.421

• T = G1, . . . , Gk , taking values in the set of422
cleartext tables T . Each group Gi is in turn a423
sequence of ni ≥ 1 consecutive rows in T , Gi =424
(Si,1, Ri,1), . . . , (Si,ni , Ri,ni ). The number of groups k is425

not fixed, but depends on the anonymization scheme and 426
the specific tuples composing T . 427

• T ∗ = G∗1, . . . , G∗k , taking values in the set of obfuscated 428
tables T ∗. 429

We assume that the above three random variables form a 430
Markov chain: 431

� −→ T −→ T ∗. (1) 432
In other words, uncertainty on T is driven by �, and T ∗ 433
solely depends on the table T and the underlying obfuscation 434
algorithm. As a result, T ∗ ⊥⊥ � | T . Equivalently, the 435
joint probability density function f of these variables can be 436
factorized as follows, where π, t, t ∗ range over D, T and T ∗, 437
respectively: 438

f (π, t, t ∗) = f (π) f (t|π) f (t ∗|t). (2) 439
Additionally, we shall assume the following: 440

• π ∈ D is encoded as a pair π = (πS , πR|S) where πR|S = 441
{πR|s : s ∈ S}. Here, πS are the parameters of a full 442
support categorical distribution over S, and, for each s ∈ 443
S, πR|s are the parameters of a full support categorical 444
distribution over R. For each (s, r ) ∈ S × R 445

f (s, r |π) = f (s|π) · f (r |πR|s) 446
We also posit that the πS and the πR|s ’s are chosen inde- 447
pendently, according to Dirichlet distributions of hyper- 448
parameters α = (α1, . . . , α|S|) and βs = (βs1, . . . , βs|R|), 449
respectively. In other words 450

f (π) = Dir(πS | α) ·
�
s∈S

Dir(πR|s | βs ). (3) 451

The hyperparameters α and β may incorporate prior 452
(background) knowledge on the population, if this is 453
available. Otherwise, a uniformative prior can be chosen 454
setting αi = βsj = 1 for each i, s, j . When r ∈ R 455
is a tuple of attributes, we shall assume conditional 456
independence of those attributes given s, so that the joint 457
probability of r |s can be determined by factorization. 458

• The N individual rows composing the table t , say 459
(s1, r1), . . . , (sN , r N ), are assumed to be drawn i.i.d. 460
according to f (·|π). Equivalently 461

f (t|π) = f (s1, r1|π) · · · f (sN , r N |π). (4) 462

Instances of the above model can be obtained by specifying 463
an anonymization mechanism A. In particular, the distribution 464
f (t ∗|t) only depends on the obfuscation algorithm that is 465
adopted, say obf(t). In the important special case obf(t) acts 466
as a deterministic function on tables, f (t ∗|t) = 1 if and only 467
if obf(t) = t ∗, otherwise f (t ∗|t) = 0. 468

B. Learner and Attacker Knowledge 469

We shall denote by pL the probability distribution over S × 470
R that can be learned given the anonymized table t ∗. This 471
distribution we take to be the average of f (s, r |π) with respect 472


IE
EE

 P
ro

of

6 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY

to the density f (� = π|T ∗ = t ∗). Formally, for each (s, r ) ∈473
S × R:474
pL(s, r |t ∗) �= Eπ∼ f (π|t ∗)[ f (s, r |π)] =

�
D

f (s, r |π) f(π|t ∗) dπ.475
(5)476

Of course, we can condition pL on any given r and obtain477
the conditional probability pL(s|r, t ∗). Equivalently, we can478
compute479

pL(s|r, t ∗) �= Eπ∼ f (π|t ∗)[ f (s|r, π)] =
�
D

f (s|r, π) f (π|t ∗) dπ.480
(6)481

In particular, one can read off this distribution on a victim’s482
nonsensitive attribute, say rv, and obtain the corresponding483
distribution on S.484

We shall assume the attacker knows the values of T ∗ = t ∗485
and the nonsensitive value rv of a target individual, the victim;486
moreover the attacker knows the victim is an individual in487
the table. Accordingly, in what follows we fix once and for488
all t ∗ and rv: these are the values observed by the attacker.489
Given knowledge of a victim’s nonsensitive attribute rv and490
knowledge that the victim is actually in the table T , we can491
define the attacker’s distribution on S as follows.492

Let us introduce in the above model a new random vari-493
able V , identifying the index of the victim within the clear-494
text table T . We posit that V is uniformly distributed on495
{1, . . . , N }, and independent from �, T , T ∗. Recalling that496
each row (S j , R j ) is identified within T by a unique index497
j , we can define the attacker’s probability distribution on S,498
after seeing t ∗ and rv, as follows, where it is assumed that499
f (RV = rv, t ∗) > 0, that is the observed victim’s rv is500
compatible with t ∗:501

pA(s|rv, t ∗) �= f (SV = s | RV = rv, t ∗). (7)502
The following crucial lemma provides us with a characteri-503

zation of the above probability distribution that is only based504
on a selection of the marginals R j given t

∗. This will be the505
basis for actually computing pA(s|rv, t ∗). Note that, on the506
right-hand side, only those rows whose sensitive value - known507
from t ∗ - is s contribute to the summation. A proof of the508
lemma is reported in Appendix A.509

Lemma 1: Let T = (S j , R j ) j ∈ 1...N . Let s j be the sensitive510
value in the j -th entry of t ∗. Let rv and t ∗ such that f (RV =511
rv, t

∗) > 0. Then512

pA(s|rv, t ∗) ∝
�

j : s j =s
f (R j = rv | t ∗). (8)513

Note that the disjointness of generalized nonsensitive values514
of the groups can make the computation of (8) more efficient,515
restricting the summation on the right-hand side to a unique516
group.517

Example 1: In order to illustrate the difference between518
the learner’s and the attacker’s inference, we reconsider the519
toy example in the Introduction. Let t ∗ be the 2-anonymous,520
2-diverse Table I(b). Assume the attacker’s victim is521
the first individual of the original dataset, who is from522
Malaysia(= M ) and lives in the ZIP code 45501 area, hence523

TABLE III

POS TERI OR DI S TRI BUTI ONS OF DI S EAS ES F OR A VI CTI M WI TH
rv = (M, 45501), F OR THE ANONYMI ZED t ∗ I N TABLE I(B).

NB: FIGURES AFFECTED BY ROUNDI NG ERRORS

rv = (M, 45501). Table III shows the belief probabilities of 524
the learner, pL(s|rv, t ∗), and of the attacker, pA(s|rv, t ∗), for 525
the victim’s disease s. We also include the random worlds 526
model probabilities, pRW(s|rv, t ∗), which are just proportional 527
to the frequency of each sensitive value within the victim’s 528
group. Note that the learner and the attacker distributions have 529
the same mode, but the attacker is more confident about his 530
prediction of the victim’s disease. The random worlds model 531
produces a multi-modal solution. 532

As to the computation of the probabilities in Table III, 533
a routine application of the equations (2) – (8) shows that 534
pL and pA reduce to the expressions (9) and (10) below, 535
given in terms of the model’s density (2). The crucial point 536
here is that the adversary knows the group his victim is in, 537
i.e. the first two lines of t ∗ in the example. Below, s ∈ S; 538
for j = 1, 2, s j denotes the sensitive value of the j -th row, 539
while t is a cleartext table, from which t− j is obtained by 540
removing (s j , rv). It is assumed that the obfuscation algorithm 541
A is deterministic, so that f (t ∗|t) ∈ {0, 1}. 542

pL(s|rv, t ∗) ∝
�
D

f (π) f (s, rv|π)
�

t :A(t )=t ∗
f (t|π) dπ (9) 543

pA(s j |rv, t ∗) ∝
�
D

f(π) f(s j , rv|π)
�

t− j :A(t )=t ∗
f (t|π) dπ. (10) 544

Unfortunately, the analytic computation of the above integrals, 545
even for the considered toy example, is a daunting task. 546
For instance, the summation in (9) has as many terms as 547
t ∗-compatible tables t , that is 6.4 × 105 for Example 1 – 548
although the resulting expression can be somewhat simplified 549
using the independence assumption (4). Accordingly, the fig- 550
ures in Table III have been computed resorting to simulation 551
techniques, see Section V. 552

An alternative, more intuitive description of the inference 553
process is as follows. The learner and the attacker first learn 554
the parameters π given t ∗, that is they evaluate f (πDis|t ∗), 555
f (πZIP|s |t ∗) and f (πNat|s |t ∗), for all s ∈ S. Due to the 556
uncertainty on the ZIP code and/or Nationality, learning π 557
takes the form of a mixture (this is akin to learning with 558
soft evidence, see Corradi et al. [11]). After that, the learner, 559
ignoring the victim is in the table, predicts the probability of 560
rv, pL(rv|s, t ∗), for all s, by using a mixture of Multinomial- 561
Dirichlet. The attacker, on the other hand, while still basing 562
his prediction pA(rv|s, t ∗) on the parameter learning outlined 563
above, restricts his attention to the first two lines of t ∗, thus 564
realizing that s ∈ {Heart, Flu}. Then, by Bayes theorem, 565
and adopting the relative frequencies of the diseases in t ∗ as 566
an approximation of f (s|t ∗), the posterior probability of the 567
diseases for the victim can be computed. 568


IE
EE

 P
ro

of

BOREALE et al.: RELATIVE PRIVACY THREATS AND LEARNING FROM ANONYMIZED DATA 7

Remark 2 (attacker’s inference and forensic identification):569
The attacker’s inference is strongly reminiscent of two famous570
settings in forensic science: the Island Problem (IP) and the571
The Data Base Search Problem (DBS), see e.g. [1], [14]572
and more recently [45]. In an island with N inhabitants a573
crime is committed; a characteristic of the criminal (e.g.574
a DNA trait) is found on the crime scene. It is known that the575
island’s inhabitants posses this characteristic independently576
with probability p. It is assumed the existence of exactly577
one culprit C in the island. In IP, one island’s inhabitant I ,578
the suspect, is found to have the given characteristic, while579
the others are not tested. An investigator is interested in the580
probability that I = C .581

When we cast this scenario in our framework, the individ-582
uals in the table play the role of the inhabitants (including583
the culprit), while rv plays the role of the characteristic found584
on the crime scene, matching that of the suspect. In other585
words - perhaps ironically - our framework’s victim plays here586
the role of the suspect S, while our attacker is essentially587
the investigator. Letting S = {0, 1} (innocent/guilty) and588
R = {0, 1} (characteristic absent/present), the investigator’s589
information is then summarized by an obfuscated horizontal590
table t ∗ of N rows with as many groups, where exactly one591
row, say the j -th, has S j = 1 and R∗j = R j = 1 (the culprit),592
while for i 
= j , Si = 0 and R∗i = ∗ ( N − 1 innocent593
inhabitants). Recalling that the variable V in our framework594
represents the suspect’s index within the table, the probability595
that I = C is596

Pr(V = j |RV = 1, t ∗) = Pr(SV = 1|RV = 1, t ∗)597
= pA(s = 1|rv = 1, t ∗).598

Then applying (8), we find599

pA(s = 1|rv = 1, t ∗) =
f (R j = 1|t ∗)

f(R j = 1|t ∗)+(N −1) f (Ri
=j = 1|t ∗)
600

= 1
1 + (N − 1) f (Ri 
= j = 1|t ∗)

. (11)601

By taking suitable prior hyperparameters, f (Ri 
= j = 1|t ∗) can602
be made arbitrarily close to p. For ease of comparison with603
the classical IP and DBS settings, rather than relying on a604
learning procedure, we just assume here f (Ri = 1|t ∗) = p605
for i 
= j , so that (11) simplifies to606

pA(s = 1|rv = 1, t ∗) =
1

1 + (N − 1) p (12)607
which is the classical result known from the literature.608
In DBS, the indicted exhibiting rv is found after testing 1 ≤609
k < N individuals that do not exhibit rv. This means the table610
t ∗ consists now of k rows (s, r ) = (0, 0) (the k innocent,611
tested inhabitants not exhibiting rv), one row (s, r ) = (1, 1)612
(the culprit) and N −1 −k rows (s, r ∗) = (0, ∗) (the N −1 −k613
innocent, non-tested inhabitants). Accordingly, (11) becomes614
(letting j = k + 1, and possibly after rearranging indices),615

(13), as shown at the bottom of this page. Letting f (Ri = 616
1|t ∗) = p for i > k + 1, equation (13) becomes 617

pA(s = 1|rv = 1, t ∗) =
1

1 + (N − 1 − k) p 618
which again is the classical result known from the literature. 619
Finally note that our methodology also covers the possibility 620
to learn about the probability of the characteristic, f (Ri = 621
1|t ∗), but here we have only stressed how the attacker strategy 622
solves the IP and DBS forensic problems. Uncertainty about 623
population parameters and identification has been considered 624
elsewhere by one of us [6]. 625

We now briefly discuss an extension of our framework to 626
the more general case where the attacker has only partial 627
information about his victim’s nonsensitive attributes. For a 628
typical application, think of a dataset where R and S are 629
individuals’ genetic profiles and diseases, respectively, with an 630
adversary knowing only a partial DNA profile of his victim; 631
e.g. only the alleles at a few loci. Formally, fix a nonempty 632
set Y and let g : R → Y be a (typically non-injective) 633
function, modeling the attacker’s observation of the victim’s 634
nonsensitive attribute. With the above introduced notation, 635
consider the random variable Y

�= g(RV ). It is natural to 636
extend definition (7) as follows, where g(rv) = yv ∈ Y and 637
f (Y = yv, t ∗) > 0: 638

pA(s|yv, t ∗) �= f (SV = s | Y = yv, t ∗). (14) 639
It is a simple matter to check that (8) becomes the following, 640
where g−1(y) ⊆ R denotes the counter-image of y according 641
to g: 642

pA(s|rv, t ∗) ∝
�

j : s j =s
f (R j ∈ g−1(yv) | t ∗). (15) 643

Also note that one has f (R j ∈ g−1(yv) | t ∗) = 644�
r∈g−1(yv) f (R j = r | t ∗). An extension to the case of partial 645

and noisy observations can be modeled similarly, by letting 646
Y = g(RV , E ), where E is a random variable representing 647
an independent source of noise. We leave the details of this 648
extension for future work. 649

IV. MEASURES OF PRIVACY THREAT AND UTILITY 650

We are now set to define the measures of privacy threat and 651
utility we are after. We will do so from the point of view of 652
a person or entity, the evaluator, who: 653
(a) has got a copy of the cleartext table t , and can build an 654

obfuscated version t ∗ of it; 655
(b) must decide whether to release t ∗ or not, weighing the 656

privacy threats and the utility implied by this act. 657

The evaluator clearly distinguishes the position of the learner 658
from that of the attacker. The learner is interested in learning 659
from t ∗ the characteristics of the general population, via pL . 660
The attacker is interested in learning from t ∗ the sensitive 661

pA(s = 1|rv = 1, t ∗)=
f (Rk+1 = 1|t ∗)

f (Rk+1 = 1|t ∗) + k f (Ri∈{1,k} = 1|t ∗) + (N − 1 − k) f (Ri>k+1 = 1|t ∗)
(13)


IE
EE

 P
ro

of

8 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY

value of a target individual, the victim, via p A. The last662
probability distribution is derived by exploiting the additional663
piece of information that the victim is an individual known to664
be in the original table, of whom the attacker gets to know the665
nonsensitive values. As pointed out in [34], information about666
the victim’s nonsensitive attributes can be easily gathered from667
other sources such as personal blogs and social networks.668
These assumptions about the attacker’s knowledge allow a669
comparison between the risks of a sensitive attribute disclosure670
for an individual who is part of the table and for individuals671
who are not. The evaluator adopts the following relative,672
or differential, point of view:673

a situation where, for some individual, p A conveys674
much more information than that conveyed by pL675
(learner’s legitimate inference on general popula-676
tion), must be deemed as a privacy threat.677

Generally speaking, the evaluator should refrain from pub-678
lishing t ∗ if, for some individual, the level of relative pri-679
vacy threat exceeds a predefined threshold. Concerning the680
definition of the level of threat, the evaluator adopts the681
following Bayesian decision-theoretic point of view. Whatever682
distribution p is adopted to guess the victim’s sensitive value,683
the attacker is faced with some utility function. Here, we con-684
sider a simple 0-1 utility function for the attacker, yielding 1 if685
the sensitive attribute is guessed correctly and 0 otherwise.686
The resulting attacker’s expected utility is maximized by687
the Bayes act, i.e. by choosing s = argmaxs�∈S p(s�), and688
equals p(s). The above discussion leads to the following689
definitions. Note that we consider threat measures both for690
individual rows and for the overall table. For each threatened691
row, the relative threat index Ti says how many times the692
probability of correctly guessing the secret is increased by693
the attacker’s activity i.e. by exploiting the knowledge of694
the victim’s presence in the table. At a global, table-wise695
level, the evaluator also considers the fraction GTA of rows696
threatened by the attacker.697

Definition 1 (privacy threat): We define the following pri-698
vacy threat measures.699

• Let q be a full support distribution on S and (s, r ) be a700
row in t . We say (s, r ) is threatened under q if q(s) =701
maxs� q(s

�), and that its threat level under q is q(s).702
• For a row (s, r ) in t that is threatened by pA(·|r, t ∗), its703

relative threat level is704

Ti(s, r, t, t ∗) �= pA(s|r, t
∗)

pL(s|r, t ∗)
. (16)705

• Let NA(t, t ∗) be the number of rows (s, r ) in t threatened706
by pA(·|r, t ∗). The global threat level GT A(t, t ∗) is the707
fraction of rows that are threatened, that is708

GT A(t, t
∗) �= NA (t, t

∗)
N

. (17)709

Similarly, we denote by GTL(t, t ∗) the fraction of rows710
(s, r ) in t that are threatened under pL(·|r, t ∗).711

• As a measure of how better the attacker performs than712
learner at a global level, we introduce relative global713
threat:714

RGTA(t, t
∗) �= max{0, GT A(t, t ∗) − GTL(t, t ∗)}. (18)715

Remark 3 (setting a threshold for Ti): A difficult issue is 716
how to set an acceptable threshold for the relative threat level 717
Ti. This is conceptually very similar to the question of how to 718
set the level of � in differential privacy: its proponents have 719
always maintained that the setting of � is a policy question, 720
not a technical one. Much depends on the application at hand. 721
For instance, when the US Census Bureau adopted differential 722
privacy, this task was delegated to a committee (the Data 723
Stewardship Executive Policy committee, DSEP); details on 724
the operations of this committee can be found in [19, Sect.3.1]. 725
We think that similar considerations apply when setting the 726
threshold of Ti. For instance, an evaluator might consider the 727
distribution of the Ti values in the dataset (see Fig. 3a–3h in 728
Section VI) and then choose a percentile as a cutoff. 729

The evaluator is also interested in the potential utility 730
conveyed by an anonymized table for a learner. Note that the 731
learner’s utility is distinct from the attacker’s one. Indeed, the 732
learner’s interest is to make inferences that are as close as 733
possible to the ones that could be done using the cleartext 734
table. Accordingly, obfuscated tables that are faithful to the 735
original table are the most useful. This leads us to compare two 736
distributions on the population: the distribution learned from 737
the anonymized table, pL, and the ideal (I) distribution, pI, 738
one can learn from the cleartext table t . The latter is formally 739
defined as the expectation2 of f (s, r |π) under the posterior 740
density f (π|t). Explicitly, for each (s, r ) 741

pI(s, r |t) �=
�
D

f (s, r |π) f (π|t) dπ. (19) 742
Note that the posterior density f (π|t) is in turn a Dirichlet 743
density (see next section) and therefore a simple closed form 744
of the above expression exists, based on the frequencies of 745
the pairs (s, r ) in t . In particular, recalling the αs , β

s
r notation 746

for the prior hyperparameters introduced in Section III, let 747
α0 =

�
s αs and β

s
0 =

�
r β

s
r , and γs (t) and δ

s
r (t) denote the 748

frequency counts of s and (s, r ), respectively, in t . Then we 749
have 750

pI(s, r |t) =
αs + γs (t)
α0 + N

· β
s
r + δsr (t)

βs0 + γs (t)
. (20) 751

The comparison between pL and pI can be based on some 752
form of distance between distributions. One possibility is to 753
rely on total variation (aka statistical) distance. Recall that, 754
for discrete distributions q, q � defined on the same space X , 755
the total variation distance is defined as 756

TV(q, q �) �= sup
A⊆X

|q( A) − q �( A)| = 1
2

�
x

|q(x ) − q �(x )|. 757

Note that TV(q, q �) ∈ [0, 1]. Note that this is a quite 758
conservative notion of diversity since it based on the event 759
that shows the largest difference between distributions. 760

Definition 2 (faithfulness): The relative faithfulness level of 761
t ∗ w.r.t. t is defined as 762

RF(t, t ∗) �= 1 − TV� pI(·| t) , pL(·| t ∗) �. 763
2 Another sensible choice would be taking pI (s, r| t) = f (s, r| πMAP ),

where πMAP = argmaxπ f (π |t) is the maximum a posteriori distribution
given t . This choice would lead to essentially the same results.


IE
EE

 P
ro

of

BOREALE et al.: RELATIVE PRIVACY THREATS AND LEARNING FROM ANONYMIZED DATA 9

Remark 4: In practice, the total variation of two high-764
dimensional distributions might be very hard to compute.765
Pragmatically, we note that for M large enough, TV(q, q �) =766
1
2 Ex ∼q(x )[|1− q

�(x )
q(x ) |] ≈ 12M

�M
i=1 |1− q

�(xi )
q(xi )

|, where the xi are767
drawn i.i.d. according to q(x ). Then a proxy to total variation768
is the empirical total variation defined below, where (si , ti ),769
for i = 1, . . . , M , are generated i.i.d. according to pI(·, ·| t):770

ETV(t, t ∗) �= 1
2M

M�
i=1

����1 − pL(si , ri | t
∗)

pI(si , ri | t)
���� . (21)771

772

Remark 5 (ideal knowledge vs. attacker’s knowledge):773
The following scenario is meant to further clarify the extra774
power afforded to the attacker, by the mere knowledge that775
his victim is in the table. Consider a trivial anonymization776
mechanism that simply releases the cleartext table, that is777
t ∗ = t . As pL = pI in this case, it would be tempting778
to conclude that the attacker cannot do better than the779
learner, hence there is no relative risk involved. However,780
this conclusion is wrong: for instance, pI(·|rv, t) can fail to781
predict the vicitim’s correct sensitive value if this value is782
rare, as we show below.783

For the sake of simplicity, consider the case where the784
observed victim’s nonsensitive attribute rv occurs just once in t785
in a row (s0, rv). Also assume a noninformative Dirichlet prior,786
that is, in the notation of Section III, set the hyperparameters787
to αs = βsr = 1 for each s ∈ S, r ∈ R. Then, simple788
calculations based on (20) and the attacker’s distribution789
characterization (8), show the following. Here for each s ∈ S,790
γs = γs (t) denotes the frequency count of s in t , and c a791
suitable normalizing constant:792

pI(s|rv, t) =

⎧⎪⎪⎨
⎪⎪⎩

1 + γs
|R| + γs

c, if s 
= s0
2(1 + γs0 )
|R| + γs0

c, if s = s0
793

pA(s|rv, t ∗) =


0, if s 
= s0
1, if s = s0.

(22)794

As far as the target individual (s0, rv) ∈ t is concerned, we795
see that while pA predicts s0 with certainty, predictions based796
on pL = pI will be blatantly wrong, if there are values s 
= s0797
that occur very frequently in t , while s0 is rare, and N is large798
compared to |R|. To make an extreme numeric case, consider799
|S| = 2, |R| = 1000 and γs0 = 1 in a table t of N =800
106 rows: plugging these values in (22) yields pL(s0|rv, t ∗) =801
pI(s0|rv, t) ≈ 0.004, hence a relative threat for (s0, rv) of802
1/ pL(s0|rv, t ∗) ≈ 250.803

V. LEARNING FROM THE OBFUSCATED TABLE BY MCMC804

Estimating the privacy threat and faithfulness measures805
defined in the previous section, for specific tables t and t ∗,806
implies being able to compute the distributions (5), (6) and (8).807
Unfortunately, these distributions, unlike (19), are not available808
in closed form, since f (� = π| T ∗ = t ∗) = f (π|t ∗) cannot809
be derived analytically. Indeed, in order to do so, one should810
integrate f (π, t|t ∗) with respect to the density f (t|t ∗), which811
appears not to be feasible.812

To circumvent this difficulty, we will introduce a Gibbs sam- 813
pler, defining a Markov chain (X i )i≥0, with X i = (�i , Ti ), 814
converging to the density 815

f (� = π, T = t|t ∗) 816
= f �� = π, S1 = s1, R1 = r1, . . . , SN = sN , RN = r N | t ∗� 817

(note that the sensitive values s j in T are in fact fixed and 818
known, given t ∗). General results (see e.g. [41]) ensure that, 819
if �0, �1, . . . are the samples drawn from the �-marginal of 820
such a chain, then for each (s, r ) ∈ S × R 821
1

M
M�

�=0
f (s, r |��) →

�
D

f (s, r |π) f (π|t ∗)dπ = pL(s, r |t ∗) 822
(23) 823

1
M

M�
�=0

f (s|r, ��) →
�
D

f (s|r, π) f (π|t ∗)dπ = pL(s|r, t ∗) 824

(24) 825

almost surely as M −→ +∞. Therefore, by selecting 826
an appropriately large M , one can build approximations of 827
pL(s, r |t ∗) and pL(s|r, t ∗) using the arithmetical means on 828
the left-hand side of (23) and (24), respectively. Moreover, 829
for each index 1 ≤ j ≤ N , using samples drawn from the 830
R j -marginals of the same chain, one can build an estimate of 831
f (R j = r j | t ∗). Consequently, using (8) (resp. (15), in the case 832
of partial observation) one can estimate pA(s|rv, t ∗) (resp. 833
pA(s|yv, t ∗)) for any given rv (resp. yv). 834

In the rest of the section, we will first introduce the MCMC 835
for this problem and then show its convergence. We will then 836
discuss details of the sampling procedures for each of the two 837
possible schemes, horizontal and vertical. 838

A. Definition and Convergence of the Gibbs Sampler 839

Simply stated, our problem is sampling from the marginals 840
of the following target density function, where t ∗ = g∗1 , . . . , g∗k 841
and t = g1, . . . , gk (note that the number of groups k is known 842
and fixed, given t ∗). 843

f (π, t|t ∗). (25) 844
Note that the r j ’s of interest, for 1 ≤ j ≤ N , are the elements 845
of the groups gi ’s, for 1 ≤ i ≤ k. The Gibbs scheme allows 846
for some freedom as to the blocking of variables. Here we 847
consider k + 1 blocks, coinciding with π and g1, . . . , gk . 848
This is natural as, in the considered schemes, (Ri , Si ) ⊥⊥ 849
(R j , S j )|π, t ∗ for (Ri , Si ) and (R j , S j ) occurring in distinct 850
groups. Formally, let x 0 = π 0, t 0 (with t 0 = g01 , . . . , g0k ) 851
denote any initial state satisfying f (π 0, t 0|t ∗) > 0. Given 852
a state at step h, x h = π h , t h (t h = gh1 , . . . , ghk ), one lets 853
x h+1 �= π h+1, t h+1, where t h+1 = gh+11 , . . . , gh+1k and 854

π h+1 is drawn from f (π|t h , t ∗) (26) 855
gh+1i is drawn from 856

f (g|π h+1, gh+11 , . . . , gh+1i−1 , ghi+1, . . . , ghk , t ∗) 857
(1 ≤ i ≤ k). (27) 858


IE
EE

 P
ro

of

10 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY

Running this chain presupposes we know how to sample859
from the full conditional distributions on the right-hand side860
of (26) and (27). In particular, there are several possible861
approaches to sample from g. In this subsection we provide a862
general discussion about convergence, postponing the details863
of sampling from the full conditionals to the next subsection.864

Let us denote by t−i
�= g1, . . . , gi−1, gi+1, . . . , gk the table865

obtained by removing the i -th group gi from t . The following866
relations for the full conditionals of interest can be readily867
checked, relying on the conditional independencies of the868
model (2) and (4) (we presuppose that in each case the869
conditioning event has nonzero probability)870

f (π|t, t ∗) = f (π|t) (28)871
f (g|π, t−i , t ∗) ∝ f (g|π) f (t ∗|g, t−i ) (1 ≤ i ≤ k). (29)872

As we shall see, each of the above two relations enables sam-873
pling from the densities on the left-hand side. Indeed, (28) is a874
posterior Dirichlet distribution, from which effective sampling875
can be easily performed (see next subsection). A straight-876
forward implementation of (29) in a Acceptance-Rejection877
(AR) sampling perspective is as follows: draw g according to878
f (g|π) and accept it with probability f (t ∗|g, t−i ) = f (t ∗|t).879
Here, f (t ∗|t) is just the probability that the obfuscation880
algorithm returns t ∗ as output when given t = g, t−i as input.881
Actually, to make sampling from the RHS of (29) effective,882
further assumptions will be introduced (see next subsection).883
Note that, since the sensitive values are fixed in t and known884
from the given t ∗, sampling g in (29) is actually equivalent to885
sampling the nonsensitive values of the group.886

In addition to (29), to simplify our discussion about conver-887
gence, we shall henceforth assume that, for each group index888
1 ≤ i ≤ k, the set of instances of the i -th group that are889
compatible with t ∗ does not depend on the rest of the table,890
t−i . That is, we assume that for each i (1 ≤ i ≤ k):891
{g : f (t ∗|g, t−i ) > 0} = {g : f (t ∗|g, t �−i ) > 0} ∀ t−i and t �−i892

�= Gi . (30)893
For instance, (30) holds true if the anonymization algorithm894
ensures t ∗ is independent from ti−1 given a i -th group g: t ∗ ⊥⊥895
t−i | g.896

Let x = (π, g1, . . . , gk ) denote a generic state of this897
Markov chain. Under the assumption (30), the support of the898
target density f (x |t ∗) is the product space899

X �= D × G1 × · · · × Gk . (31)900
By this, we mean that {x : f (x |t ∗) > 0 } = X . This is901
a consequence of: (a) the fact that Dirichlet only consid-902
ers full support distributions; and (b) equation (29), taking903
into account the assumption (30). Let X 0, X 1, . . . denote the904
Markov chain defined by the sampler over X and denote by905
κ(·|·) its conditional kernel density over X . Slightly abusing906
notation, let us still indicate by f (·|t ∗) the probability distri-907
bution over X induced by the density f (x |t ∗). Convergence908
in distribution follows from the following proposition, which909
is an instance of general results – see e.g. the discussion910
following Corollary 1 of [41].911

Proposition 1 (convergence): Assume (30). For each (mea- 912
surable) set A ⊆ X such that f ( A|t ∗) > 0 and each x 0 ∈ X , 913
we have κ(X 1 ∈ A|X 0 = x 0) > 0. As a consequence, 914
the Markov chain (X i )i≥0 is irreducible and aperiodic, and 915
its stationary density is f (x |t ∗) in (25). 916

B. Sampling From the Full Conditionals 917

Let us consider (28) first. It is a standard fact that the 918
posterior of the Dirichlet distribution f (π|t), given the N 919
i.i.d. observations t drawn from the categorical distribution 920
f (·|π), is still a Dirichlet, where the hyperparameters have 921
been updated as follows. Denote by γ (t) = (γ1, . . . , γ|S|) the 922
vector of the frequency counts γi of each si in t . Similarly, 923
given s, denote by δs (t) = (δs1, . . . , δs|R|) the vector of the 924
frequency counts δi of the pairs (ri , s), for each ri , in t . Then, 925
for each π = (πS, πR|S ), we have 926

f(π|t) = Dir(πS | α+γ (t))·
�
s∈S

Dir(πR|s | βs + δs (t)). (32) 927

Let us now discuss (29). In what follows, for the sake 928
of notation we shall write a generic i -th group as gi = 929
(s1, r1), . . . , (sn , rn ) (thus avoiding double subscripts), and let 930
g∗i = (mi , li ) denote the corresponding obfuscated group in 931
t ∗. As already observed, given an obfuscated i -th group g∗i = 932
(li , mi ), when sampling a i -th group g from (29), one actually 933
needs to generate only the nonsensitive values of g, which are 934
constrained by li , as the sensitive ones are already fixed by 935
the sequence mi . In what follows, to make sampling from (29) 936
effective, will shall work under the following assumptions, 937
which are stronger than (30). 938
(a) Deterministic obfuscation function: for each t and t ∗, 939

f (t ∗|t) is either 0 or 1. 940
(b) For each 1 ≤ i ≤ k, letting g∗i = (li , mi ), with mi = 941

s1, . . . , sn , the i -th obfuscated group in t
∗, the following 942

holds true: 943
Horizontal schemes 944

Gi={g = (s1, r1), . . . , (sn , rn ) : r� ∈li for 1 ≤ � ≤ n } (33) 945
Vertical schemes 946

Gi = {g = (s1, ri1 ), . . . , (sn , rin ) : 947
for ri1 , . . . , rin a permutation of li }. (34) 948

Assumption (a) is realistic in practice. In horizontal 949
schemes, assumption (b) makes the considered sets Gi ’s pos- 950
sibly larger than the real ones, that is li ⊃ {r1, . . . , rn }. This 951
happens, for instance, if in certain groups the ZIP code is 952
constrained to just, say, two values, while the generalized code 953
“5013*” allows for all values in the set {50130, . . . , 50139}. 954
We will not attempt here a formal analysis of this assumption. 955
In some cases, such as in schemes based on global recoding, 956
this assumption is realistic. Otherwise, we only note that the 957
support X of the resulting Markov chain may be (slightly) 958
larger than the one that would be obtained not assuming (33) 959
or (34). Heuristically, this leads one to sampling from a more 960
dispersed density than the target one. At least, the resulting 961
distributions can be taken to represent a lower bound of what 962
the attacker can actually learn. 963


IE
EE

 P
ro

of

BOREALE et al.: RELATIVE PRIVACY THREATS AND LEARNING FROM ANONYMIZED DATA 11

Fig. 1. Sampling from f (g|π, t−i , t ∗) (g ∈ Gi ) for horizontal schemes,
across all the groups.

Under assumptions (a) and (b) above, for each 1 ≤ i ≤ k,964
it holds that g ∈ Gi if and only if f (t ∗|g, t−i ) = 1. Therefore965
sampling according to the right-hand side of (29) reduces to966
the following:967
draw g ∈ Gi with probability ∝ f (g|π) (1 ≤ i ≤ k). (35)968

We discuss now how to implement (35) effectively. This969
will achieve sampling from the full conditionals (29) without970
resorting to a presumably inefficient AR method. We deal with971
the two cases, horizontal and vertical, separately.972

a) Horizontal schemes: In order to generate g =973
(r1, s1), . . . , (rn , sn ) ∈ Gi , for each � = 1, .., n, we draw974
r� ∈ li with probability ∝ f (r�|s�, π). Explicitly, (29) now975
becomes976

f (g|π, t−i , t ∗) =

⎧⎪⎨
⎪⎩

0, if g /∈ Gi
n�

�=1

f (r�|s�, π)�
r∈li f (r |s�, π)

, if g ∈ Gi (36)977

thus satisfying (35). Note that this is equivalent to sam-978
pling each row independently. The sampling process of979
f (g|π, t−i , t ∗) for horizontal schemes across all the groups980
of the table is illustrated graphically in Fig. 1.981

b) Vertical schemes: Let li = {| r1, . . . , rn |}. We have982
that g ∈ Gi if and only if g = (s1, ri1 ), . . . , (sn , rin ), for983
some permutation (ri� )1≤�≤n of r1, . . . , rn . Here, sampling984
the nonsensitive values of g row by row would involve to985
gradually reduce the sample space. A sampling procedure986
along these lines is possible, but nontrivial, see Appendix B.987

We discuss here a more straightforward sampling procedure,988
based on generating gi ∈ Gi in a single shot. We adopt a989
single-iteration Metropolis within Gibbs scheme. Essentially,990
this consists in running a Metropolis method that targets the991
distribution ∝ f (g|π) with support Gi , for one iteration.992
Specifically, let us write the current value of the i -th group in993
the Gibbs Markov chain as ghi . Following Casella and Robert994
[40, Ch.10], this step consists in drawing g ∈ Gi according to995
a proposal distribution J (g|ghi ) and accepting it, that is letting996
gh+1i = g, with probability997

�
�= min


1,

f (g|π) J (ghi |g)
f (ghi |π) J (g|ghi )

�
(37)998

while keeping gh+1i = ghi with probability 1 − �. The999
resulting MCMC method is still theoretically sound: see Casella1000

TABLE IV

SUMMARY OF THREAT AND FAI THF ULNESS MEAS URES F OR
ANONYMI ZATI ON ACCORDI NG TO K-ANONYMI TY

AND �- DI VERS I TY

and Robert [40, Ch.10.3.3]. As to the proposal distribution 1001
J (g|ghi ), a possibility is generating g ∈ Gi via a pure random 1002
permutation of the n nonsensitive values in li ; or just to swap 1003
the nonsensitive values of two randomly chosen positions 1004
in ghi . In both cases, the proposal is symmetric, and (37) 1005
simplifies accordingly as follows, where r1, . . . , rn is the 1006
sequence of sensitive values in the poposed g: 1007

� = min


1,

�n
�=1 f (r�|s�, π)�n
�=1 f (r

h
� |s�, π)

�
. 1008

VI. EXPERIMENTS 1009

We have put a proof-of-concept implementation3 of our 1010
methodology at work on a subset of the Adult dataset extracted 1011
by Barry Becker from the 1994 US Census database and 1012
available from the UCI machine learning repository [47]. This 1013
is a common benchmark for experiments on anonymization 1014
[38]. In particular, we have focused on the subset of 5692 rows 1015
also considered by the authors of [38], with the following 1016
categorical attributes: sex, age, race, marital status, education, 1017
native country, workclass, salary class, occupation, with occu- 1018
pation (14 values) considered as the only sensitive attribute. 1019
We will discuss implementation and results details separately 1020
for vertical and horizontal schemes. We will then briefly 1021
discuss convergence issues of the employed MCMC method. 1022

A. Horizontal Schemes: k-Anonymity 1023

Using the ARX anonymization tool [37] we obtained two 1024
different k-anonymous versions of the considered dataset, 1025
enjoying respectively k-anonymity and �-diversity4 for k = 1026
� = 4 and k = � = 6. The average size of the groups 1027
was respectively of 38 rows (k = � = 4) and of 355 rows 1028
(k = � = 6). 1029

The results we have obtained are summarized in Table IV. 1030
For reference, we include the following information in the last 1031
two lines: baseline accuracy, the fraction of rows correctly 1032
classified using the empirical distribution obtained from the 1033
frequencies of the sensitive values in the anonymized table 1034
– i.e., the fraction of the most frequent sensitive value; and 1035

3 Python code and data available from the authors.
4 Recall that �-diversity requires at least � distinct values of the sensitive

attribute in each group.


IE
EE

 P
ro

of

12 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY

ideal accuracy, the fraction of tuples threatened under pI.1036
As a further element of comparison, we also consider an1037
attacker whose reasoning is based on the random worlds1038
models, and include in the table GTRW, the fraction of rows1039
correctly classified assuming all tables compatible with t ∗1040
equally likely. Like in [25], we compute ABSA and ABSRW,1041
the absolute error under the distribution derived under pA and1042
under the random worlds distribution pRW, respectively. ABS1043

is defined as
N�

i=1
�
s∈S

|1{si =s} − p(s|ri , t ∗)|, where p(·) might1044
be either of pA(·) or pRW(·). Note that, since the considered1045
anonymized tables do not enjoy disjointness between groups1046
(see Remark 1), also in the random worlds perspective the1047
probability of each sensitive attribute may well be ≥ 1/�.1048
In our experiments, when � = 4 the attacker outperforms1049
random worlds classification, while when a more powerful1050
obfuscation is adopted the two results are quite similar.1051

The remaining rows in Table IV consider the privacy threats1052
and faithfulness measures introduced in Section IV. As a1053
general comment, small variations of � and/or k do not produce1054
dramatic changes. The faithfulness level is stable, but does not1055
reach a satisfactory level. The attacker is anyway in a position1056
to correctly classify the sensitive attribute of individuals in the1057
table ≈ 2.3 − 2.5% more often than the learner. We found the1058
maximum value of TiA for the threatened rows is about 13.8,1059
meaning the attacker can be up to ≈14 times more confident1060
than the learner about the guessed value.1061

A more informative summary of our analysis is provided by1062
the scatter plots and histograms of Figure 2. The scatter plots1063
are obtained from the threat levels under pL and under pA.1064
The number of rows (s, r ) in which pA(s|r, t ∗) ≥ pL(s|r, t ∗)1065
roughly equals those in which pA(s|r, t ∗) ≤ pL(s|r, t ∗),1066
although globally the attacker has a slight advantage in terms1067
of number of threatened rows. In Figure 2 we also report the1068
empirical distribution log2 TiA for tuples threatened under pA1069
and under pL. We also have evidence of positive skewness,1070
as shown by the value of γ (the third standardized moments1071
of the empirical distributions). Recalling that log2 TiA = 11072
means pA(s|r, t ∗) = 2 pL(s|r, t ∗), the histograms show that1073
pA(s|r, t ∗) is often more than twice pL(s|r, t ∗) leading to a1074
log2 TiA ≥ 1. In particular, when k = � = 4, log2 TiA is1075
at least 1 for ≈ 6% of the individuals threatened under pA,1076
meaning ≈ 0.6% of the whole table. Conversely, log2 TiA1077
is close to 0 for most of the rows in which pA(s|r, t ∗) ≤1078
pL(s|r, t ∗).1079
B. Vertical Schemes: Anatomy1080

Using a freely available anonymization tool [22], we have1081
obtained two anatomized versions of the considered dataset,1082
with groups of size � = 4 and � = 6, respectively. The1083
resulting tables also enjoy �-diversity. The results we have1084
obtained are summarized in Table V. Concerning the random1085
worlds approach, we note the following. Anatomy partitions1086
the tables in groups all of size �. Therefore, although disjoint-1087
ness is not satisfied, just as in the horizontal case, the sensitive1088
attribute frequencies equal 1/� in each group. This implies1089
that the probability of a sensitive value depends on how many1090
groups contain the victim’s nonsensitive attributes and on1091

Fig. 2. Results for k-anonymity. Top (� = k = 6): scatter plots of pL vs
pA for tuples threatened under pA (a), and under pL (c); (b) and (d) are the
histograms of log2 TiA for these two cases. Bottom: same for � = k = 4. The
skewness value (γ ) represents the third standardized moment of the empirical
distribution. Dark red areas show where the attacker performs better than the
learner.

their frequencies in each group, leading often to multimodal 1092
distributions. We assume that a guess may be obtained ran- 1093
domly choosing between the equally likely sensitive attributes. 1094
Accordingly, the fractions of threatened rows, GTRW, are 1095
averaged over 500 different sampling. Here, it is apparent that 1096
the our attacker is able to classify better than the random 1097
worlds scenario. We note that, as � increases from 4 to 6, 1098
the fraction of rows threatened under the distributions derived 1099
by the learner (GTL) and by the attacker (GT A) decreases 1100
significantly. Moreover, as � grows both the relative threat 1101
RGTA and the faithfulness level RF decrease, which implies 1102
a trade-off between privacy and the utility conveyed by the 1103
table. 1104

Again, for a more informative summary of our analysis, 1105
we look at scatter plots and histograms, displayed in Figure 3, 1106
where we compare pA and pL on threatened rows. It is 1107
apparent here that the attacker is more confident than the 1108
learner in the majority of the cases, even when focusing on 1109
the rows threatened under pL. This is in contrast with the 1110
horizontal case, where the attacker exhibits smaller threat 1111


IE
EE

 P
ro

of

BOREALE et al.: RELATIVE PRIVACY THREATS AND LEARNING FROM ANONYMIZED DATA 13

TABLE V

SUMMARY OF THREAT AND FAI THF ULNESS MEAS URES F OR
ANONYMI ZATI ON ACCORDI NG TO ANATOMY

levels on the rows threatened under pL (Figure 2, (d) and (h)).1112
As far as the histograms are concerned, an even greater1113
skewness than the horizontal case is evident here. In particular,1114
the attacker can be up to ≈ 287 times more confident then1115
the learner, being the maximum TiA about 286.19. Moreover,1116
when � = 4, the individuals with log2 TiA ≥ 1 are ≈ 26% of1117
the rows threatened under pA (≈ 8% of the whole table). This1118
means that there are 483 individuals in the dataset for which1119
the threat level under pA is at least twice as much the threat1120
level under pL.1121

C. Discussion1122

Comparing the horizontal and the vertical cases for the1123
considered dataset, the following considerations are in order.1124

• In the horizontal case, we have a situation of low faith-1125
fulness and low privacy threat, irrespective of the value1126
of k and �. Indeed, in both cases the average group size1127
is well above k, and this has a negative effect on the1128
inference capabilities of both the learner and the attacker.1129
The slight numerical differences observed between the1130
cases k = � = 4 and k = � = 6 are basically an artifact1131
of the anonymization tool. Yet, in relative terms, one can1132
observe a significant increase in the number of tuples1133
threatened by the attacker, over the learner.1134

• In the vertical case, one obtains a greater faithfulness1135
at the price of a greater privacy threat. This difference1136
from the horizontal case is partly explained by the smaller1137
group size, which now coincides with �. Now moving1138
from � = 4 to � = 6 has a tangible negative impact1139
on the inference capabilities of both the learner and the1140
attacker. In relative terms, one can observe an even more1141
marked increase of the number of tuples threatened by1142
the attacker, over the learner.1143

The above considerations partly depend on both the original1144
dataset and the details of the employed anonymization tool.1145

D. Assessing MCMC Convergence1146

For each of the considered anonymized datasets, we ran a1147
MCMC as introduced in Section V for M = 100, 000 runs.1148
The convergence of each chain to the stationary distribu-1149
tion was assessed via a methodology based on comparing1150
sub-sequences of the sample sequences with one another. More1151
precisely, as for the population parameters distribution (32),1152
we used the method proposed by Geweke [21]. The Geweke1153

Fig. 3. Results for Anatomy. Top (� = 6): scatter plots of pL vs pA for tuples
threatened under pA (a), and under pL (c); (b) and (d) are the histograms of
log2 TiA for these two cases. Bottom: same for � = 4. The skewness value
(γ ) represents the third standardized moment of the empirical distribution.
Dark red areas show where the attacker performs better than the learner.

proposal is based on an adapted two-samples test on the means 1154
in sub-sequences of the chain. 1155

After a burn-in of 50,000 iterations, we compared the last 1156
25,000 samples against 5 blocks of of 5,000 consecutive sam- 1157
ples each, taken starting from the 50,000-th iteration. We found 1158
that all the distributions πR|S produced a test statistic within 1159
two standard deviations from zero, thus providing evidence of 1160
convergence. 1161

As for the distribution of the cleartext table, f (t|π, t ∗), we 1162
used a test specifically designed for categorical distributions 1163
by Deonovich and Smith, called Weiß procedure [15]. The 1164
approach is based on a χ 2 test adjusted for the autocorrelation 1165
induced by the chain. The test is based on partitioning the 1166
whole sample sequence into sub-sequences, and then testing 1167
the homogeneity between the empirical distribution of each 1168
sub-sequence and the empirical distribution of the whole 1169
chain. After a burn-in of 50,000 observations, we compared 1170
5 sub-sequences of 10,000 consecutive samples each. For the 1171
vertical scheme, we assessed the convergence for each row of 1172
the table, thereby demonstrating the stationary of f (t|π, t ∗). 1173


IE
EE

 P
ro

of

14 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY

For the horizontal scheme, some of the rows did not exhibit1174
evidence of convergence. However, we found that, starting1175
with several independent chains, very similar results in terms1176
of the proposed assessment measures were obtained.1177

In the vertical case, within the Metropolis step both the pure1178
random permutation and the swap group generation strategies1179
(Section V-B) were experimented. The obtained results are1180
consistent; however, the pure random permutation strategy1181
shows a much higher rate of rejection, suggesting that the1182
swap strategy should be preferred.1183

VII. CONCLUSION1184

We have put forward a notion of relative privacy threat that1185
applies to group-based anonymization schemes. Our proposal1186
is based on a rigorous characterization of the learner’s and1187
of the attacker’s inference, in a unified Bayesian model of1188
group-based schemes. A related MCMC algorithm for posterior1189
parameters estimation has also been introduced. Experiments1190
conducted on the well-known Adult dataset [47] have been1191
illustrated.1192

Our analysis emphasizes the risks posed by the mere fact1193
that an attacker can look up a released anonymized table.1194
This prompts an obvious alternative: release the parameters1195
of the posterior distribution learned from the cleartext table1196
( pI, in our notation). This may not always be possible, or be1197
a good idea, for several reasons. First, certain organizations1198
must release datasets as part of their mission, e.g. census1199
bureaus. Second, especially in the case of high-dimensional1200
data, the computation of the posterior is feasible only assum-1201
ing suitable conditional independencies, whereby potentially1202
important correlations are lost; see [10] and references therein.1203
Third, parameters release itself is not exempt from risks for1204
privacy. In particular, although differentially private release of1205
the parameters is possible [16], it seems that quite strong1206
priors are necessary to obtain acceptable guarantees; see1207
[50, Ch.6] and references therein. In conclusion, further1208
research is called for in order to understand under what1209
circumstances data and/or parameters release can be done1210
safely.1211

APPENDIX A1212
PROOF OF LEMMA 11213

We first characterize the probability f (V = j |RV = rv, t ∗),1214
for an arbitrary j ∈ {1, . . . , N }. Bayes theorem yields1215
f (V = j |RV = rv, t ∗) ∝ f (RV = rv|V = j, t ∗) f (V = j |t ∗)1216

= f (R j = rv|V = j, t ∗) f (V = j |t ∗)1217
∝ f (R j = rv|V = j, t ∗) (38)1218
= f (R j = rv|t ∗) (39)1219

where (38) follows from f (V = j |t ∗) = f (V = j ) = 1/N1220
(independence of V ), and (39) follows because, as easily1221
checked, for any fixed j , independence of R j and V is1222
preserved by conditioning on t ∗. Now we have, for every s ∈ S1223
pA(s|rv, t ∗) (40)1224
= f (SV = s | RV = rv, t ∗)1225
=

�
j

f (SV = s, V = j |RV = rv, t ∗)1226

Fig. 4. Sampling from θ (g|π, t ∗) for vertical schemes.

=
�

j

f (SV = s|V = j, RV = rv, t ∗) f (V = j |RV = rv, t ∗) 1227

=
�

j

f (S j = s|V = j, R j = rv, t ∗) f (V = j |RV = rv, t ∗) 1228

=
�

j : s j=s
f (S j = s|V = j, R j = rv, t ∗) f (V = j |RV =rv, t ∗) 1229

(41) 1230
=

�
j : s j =s

f (V = j |RV = rv, t ∗) (42) 1231

∝
�

j : s j =s
f (R j = rv|t ∗). (43) 1232

where (41) and (42) follow from the fact that, for s j 
= s, 1233
f (S j = s, t ∗) = 0, while for s j = s obviously f (S j = s|V = 1234
j, R j = rv, t ∗) = 1. Finally, (43) follows from (39). 1235

Note that in (43) each term on the RHS actually is the joint 1236
probability f (R j = rv, S j = s|t ∗), being s j = s embedded in 1237
the range of the summation. 1238

APPENDIX B 1239
AN ALTERNATIVE GROUP SAMPLING METHOD FOR 1240

VERTICAL SCHEMES 1241

We consider the following method for sampling g ∈ Gi . 1242
Draw n values ri� , � = 1, . . . , n, as follows: 1243

1. draw ri1 from li according to a distribution ∝ f (r |s1, π); 1244
2. draw ri2 from li \ {| ri1 |} according to a distribution ∝ 1245

f (r |s2, π); 1246
… 1247

n. draw rin from li \ {| ri1 , . . . , rin−1 |} according to a distrib- 1248
ution ∝ f (r |sn, π). 1249

For a multiset l�, let σ (l�|s�, π) �=
�

r in l� f (r |s�, π) denote 1250
the probability of extracting some element appearing in l� 1251
(disregarding multiplicities) according to f (·|s�, π). Using this 1252
notation, the probability of returning exactly the sequence 1253
ri1 , . . . , rin , hence g = (s1, ri1 ), . . . , (sn , rin ) ∈ Gi , as a result 1254
of the above n drawings, can be written as 1255

θ (g|π, t ∗) �= f (ri1 |s1, π)
σ (li |s1, π)

· f (ri2 |s2, π)
σ (li \ {| ri1 |}|s2, π)

· · · f (rin |sn, π)
f (rin |sn, π)

1256

=
�n

�=1 f (ri� |s�, π)
ν(g|π) 1257

where we denote by ν(g|π) the denominator of the expression 1258
on the RHS of

�= above. The sampling process of θ (g|π, t ∗) 1259
for vertical schemes across all the groups of the table is 1260
illustrated in Fig. 4. We note that θ (g|π, t ∗) is dependent on 1261
the chosen ordering of the sensitive values s1, . . . , sn , which 1262


IE
EE

 P
ro

of

BOREALE et al.: RELATIVE PRIVACY THREATS AND LEARNING FROM ANONYMIZED DATA 15

may invalidate condition (35). A possible solution could be to1263
sweep the order of sampling according to the Random Sweep1264
Gibbs sampler scheme originally proposed by [20] and further1265
developed by [29].1266

REFERENCES1267

[1] D. J. Balding and P. Donnelly, “Inference in forensic identification,”1268
J. Roy. Stat. Soc. A, Statist. Soc., vol. 158, no. 1, pp. 21–53, 1995.1269

[2] M. Bewong, J. Liu, L. Liu, J. Li, and K. K. R. Choo, “A relative privacy1270
model for effective privacy preservation in transactional data,” in Proc.1271
IEEE Trustcom/BigDataSE/ICESS, Aug. 2017, pp. 394–401.1272

[3] B. Bichsel, T. Gehr, D. Drachsler-Cohen, P. Tsankov, and T. Vechev,1273
“DP-finder: Finding differential privacy violations by sampling and1274
optimization,” in Proc. ACM CCS, 2018, pp. 508–524.1275

[4] M. Boreale and M. Paolini, “Worst- and average-case privacy breaches in1276
randomization mechanisms,” in Theoretical Computer Science, vol. 597,1277
Berlin, Germany: Springer, 2015, pp. 40–61.1278

[5] M. Boreale and F. Corradi, “Relative privacy risks and learning from1279
anonymized data,” in Proc. SIS, A. Petrucci, F. Racioppi, and R. Verde,1280
Eds. Firenze Univ. Press, 2017, pp. 199–204.1281

[6] D. Cavallini and F. Corradi, “Forensic identification of relatives of1282
individuals included in a database of DNA profiles,” Biometrika, vol. 93,1283
pp. 525–536, Sep. 2006.1284

[7] A.-S. Charest, “How can we analyze differentially-private synthetic1285
datasets? ” J. Privacy Confidentiality, vol. 2, no. 2, 2011.AQ:4 1286

[8] A.-S. Charest, “Empirical evaluation of statistical inference from1287
differentially-private contingency tables,” in Proc. Int. Conf. Pri-1288
vacy Stat. Databases (PSD). Berlin, Germany: Springer-Verlag, 2012,1289
pp. 257–272.1290

[9] R. C.-W. Wong, A. W. C. Fu, K. Wang, P. S. Yu, and J. Pei, “Can the1291
utility of anonymized data be used for privacy breaches?” ACM Trans.1292
Knowl. Discovery Data, vol. 5, no. 3, pp. 16-1–16-24, 2011.1293

[10] C. Clifton and T. Tassa, “On syntactic anonymity and differential1294
privacy,” in Proc. Trans. Data Privacy, vol. 6, 2013, pp. 161–183.1295

[11] F. Corradi, V. Pinchi, S. Garatti, and I. Barsanti, “Probabilistic classi-1296
fication of age by third molar development: The use of soft evidence,”1297
J. Forensic Sci., vol. 58, no. 1, pp. 51–59, 2013.1298

[12] F. K. Dankar and K. El Emam, “The application of differential privacy1299
to health data,” in Proc. Joint EDBT/ICDT Workshops (EDBT-ICDT),1300
2012, pp158–166.1301

[13] F. K. Dankar and K. El Emam, “Practicing differential privacy in health1302
care: A review,” in Trans. Data Privacy, vol. 6, no. 1, pp. 35–67, 2013.1303

[14] A. P. Dawid, “The island problem: Coherent use of identification1304
evidence,” in Aspects of Uncertainty: A Tribute to D. V. Lindley,1305
P. R. Freeman and A. F. M. Smith, Eds. Hoboken, NJ, USA: Wiley,1306
1994, pp. 159–170.1307

[15] B. E. Deonovic and B. J. Smith, “Convergence diagnostics for MCMC1308
draws of a categorical variable,” 2017, arXiv:1706.04919. [Online].1309
Available: https://arxiv.org/abs/1706.049191310

[16] C. Dimitrakakis, B. Nelson, A. Mitrokotsa, and B. I. P. Rubin-1311
stein, “Robust and Private Bayesian Inference,” in Proc. 25th Int.1312
Conf. Algorithmic Learn. Theory (ALT), Bled, Slovenia, Oct. 2014,1313
pp. 291–305.1314

[17] Z. Ding, Y. Wang, G. Wang, D. Zhang, and D. Kifer, “Detect-1315
ing violations of differential privacy,” in Proc. ACM CCS, 2018,1316
pp. 475–489.1317

[18] C. Dwork, “Differential privacy,” in Proc. 33rd Int. Conf. Automata,1318
Lang. Program. Berlin, Germany: Springer-Verlag, 2006, pp. 1–12.1319

[19] S. L. Garfinkel, J. M. Abowd, and S. Powazek, “Issues encountered1320
deploying differential privacy,” in Proc. ACM Workshop Privacy Elec-1321
tron. Soc., 2018, pp. 133–137.1322

[20] S. Geman and D. Geman, “Stochastic relaxation, Gibbs distributions, and1323
the Bayesian restoration of images,” IEEE Trans. Pattern Anal. Mach.1324
Intell., vol. PAMI-6, no. 6, pp. 721–741, Nov. 1984.1325

[21] J. Geweke, “Evaluating the accuracy of sampling-based approaches to1326
the calculation of posterior moments,” in Bayesian Statistics. Oxford,1327
U.K.: Oxford Univ. Press, 1992, pp. 169–193.1328

[22] Q. Gong. (2014). Anatomize, GitHub. [Online]. Available: https://github.1329
com/qiyuangong/Anatomize1330

[23] A. Inan, M. Kantarcioglu, and E. Bertino, “Using anonymized data for1331
classification,” in Proc. IEEE Int. Conf. Data Eng. (ICDE), Washington,1332
DC, USA, Mar./Apr. 2009, pp. 429–440.1333

[24] A. Kassem, G. Acs, C. Castelluccia, and C. Palamidessi, “Differential1334
inference testing a practical approach to evaluate anonymized data,”1335
INRIA, Res. Rep., 2018, pp. 1–21.AQ:5 1336

[25] D. Kifer, “Attacks on privacy and deFinetti’s theorem,” in Proc. SIG- 1337
MOD Conf., 2006, pp. 127–138. 1338

[26] D. Kifer and A. Machanavajjhala, “No free lunch in data privacy,” in 1339
Proc. SIGMOD, 2011, pp. 193–204. 1340

[27] N. Li, T. Li, and S. Venkatasubramanian, “t -closeness: Privacy beyond 1341
k-anonymity and �-diversity,” in Proc. ICDE, 2007, pp. 106–115. 1342
doi: 10.1109/ICDE.2007.367856. 1343

[28] N. Li, W. Qardaji, and D. Su, “On sampling, anonymization, and 1344
differential privacy: Or, k-anonymization meets differential privacy,” in 1345
Proc. 7th ACM Symp. Inf., Comput. Commun. Secur. (ASIACCS), 2012, 1346
pp. 32–33. AQ:61347

[29] J. S. Liu, “Markov chain Monte Carlo and related topics,” Dept. Statist., 1348
Stanford Univ., Stanford, CA, USA, Tech. Rep., 1995. 1349

[30] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam, 1350
“�-diversity: Privacy beyond k-anonymity,” in Proc. ICDE, 2006, p. 24. 1351

[31] A. Machanavajjhala, D. Kifer, J. Abowd, J. Gehrke, and L. Vilhube, “Pri- 1352
vacy: Theory meets practice on the map,” in Proc. IEEE 24th Int. Conf. 1353
Data Eng., Apr. 2008, pp. 277–286. doi: 10.1109/ICDE.2008.4497436. 1354

[32] M. D. Mailman et al., “The NCBI dbGaP database of genotypes and 1355
phenotypes,” Nature Genet., vol. 39, no. 10, pp. 1181–1186, 2007. doi: 1356
10.1038/ng1007-1181. 1357

[33] K. Mancuhan and C. Clifton, “Statistical learning theory approach for 1358
data classification with �-diversity,” 2016, arXiv:1610.05815. [Online]. 1359
Available: https://arxiv.org/abs/1610.05815 1360

[34] A. Narayanan and V. Shmatikov, “Robust de-anonymization of large 1361
datasets (how to break anonymity of the Netflix prize dataset),” Univ. 1362
Texas Austin, Austin, TX, USA, Tech. Rep., 2008. 1363

[35] M. Nasr, R. Shokri, and A. Houmansadr, “Machine learning with 1364
membership privacy using adversarial regularization,” in Proc. ACM 1365
SIGSAC Conf. Comput. Commun. Secur., Toronto, ON, Canada, 2018, 1366
pp. 634–646. doi: 10.1145/3243734.3243855. 1367

[36] W. Ollier, T. Sprosen, and T. Peakman, “UK Biobank: From concept to 1368
reality,” Future Med., vol. 6, no. 6, pp. 639–646, 2005. 1369

[37] F. Prasser and F. Kohlmayer, “Putting statistical disclosure control 1370
into practice: The ARX data anonymization tool,” in Medical Data 1371
Privacy Handbook, A. Gkoulalas-Divanis and G. Loukides, Eds. Cham, 1372
Switzerland: Springer, Nov. 2015. 1373

[38] F. Prasser, F. Kohlmayer, and K. A. Kuhn, “A benchmark of 1374
globally-optimal anonymization methods for biomedical data,” in Proc. 1375
27th IEEE Int. Symp. Comput.-Based Med. Syst., New York, NY, USA, 1376
May 2014, pp. 66–71. 1377

[39] A. Pyrgelis, C. Troncoso, and E De Cristofaro, “Knock knock, who’s 1378
there? Membership inference on aggregate location data,” in Proc. Netw. 1379
Distrib. Syst. Secur. Symp., San Diego, CA, USA, Feb. 2018, pp. 18–21. 1380
doi: 10.14722/ndss.2018.23183. 1381

[40] C. P. Robert and G. Casella, Monte Carlo Statistical Methods. 2nd ed. 1382
Springer, 2004. AQ:71383

[41] G. O. Roberts and A. F. M. Smith, “Simple conditions for the con- 1384
vergence of the Gibbs sampler and Metropolis–Hastings algorithms,” 1385
Stochastic Processes Appl., vol. 49, pp. 207–216, Feb. 1994. 1386

[42] T. E. Raghunathan, J. P. Rubin, and D. B. Reiter, “Multiple imputation 1387
for statistical disclosure limitation,” J. Off. Statist., vol. 19, no. 1, 1388
pp. 1–16, 2003. 1389

[43] D. B. Rubin, “Statistical disclosure limitation,” J. Off. Statist., vol. 9, 1390
no. 2, pp. 461–468, 1993. 1391

[44] R. Sarathy and K. Muralidhar, “Evaluating Laplace noise addition to 1392
satisfy differential privacy for numeric data,” Trans. Data Privacy, vol. 4, 1393
no. 1, pp. 1–17, 2011. 1394

[45] K. Slooten and R. Meester, “Forensic identification: The island problem 1395
and its generalisations,” 2017. arXiv:1201.4647. [Online]. Available: 1396
https://arxiv.org/abs/1201.4647 1397

[46] L. Sweeney, “K-anonymity: A model for protecting privacy,” Int. 1398
J. Uncertainty, Fuzziness Knowl.-Based Syst., vol. 10, no. 5, 1399
pp. 557–570, 2002. 1400

[47] UCI Machine Learning Repository, Adult Dataset. (1996). [Online]. 1401
Available: https://archive.ics.uci.edu/ml/datasets/Adult 1402

[48] U.S. Office for Civil Rights. (Nov. 26, 2012). Guidance Regarding Meth- 1403
ods for De-Identification of Protected Health Information in Accordance 1404
With the Health Insurance Portability and Accountability Act (HIPAA) 1405
Privacy Rule. [Online]. Available: https://www.hhs.gov/sites/ 1406
default/files/ocr/privacy/hipaa/understanding/coveredentities/De- 1407
identification/hhs_deid_guidance.pdf 1408

[49] X. Xiao and Y. Tao, “Anatomy: Simple and effective privacy preserva- 1409
tion,” in Proc. VLDB, 2006, pp. 139–150. 1410

[50] S. Zheng, “The differential privacy of Bayesian inference,” B.S. thesis, 1411
Harvard College, Cambridge, MA, USA, 2015. [Online]. Available: 1412
https://dash.harvard.edu/handle/1/14398533 1413

http://dx.doi.org/10.1109/ICDE.2007.367856
http://dx.doi.org/10.1109/ICDE.2008.4497436
http://dx.doi.org/10.1038/ng1007-1181
http://dx.doi.org/10.1145/3243734.3243855
http://dx.doi.org/10.14722/ndss.2018.23183