tifs-corradi-2937640-proof.pdf IE EE P ro of IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 1 Relative Privacy Threats and Learning From Anonymized Data Michele Boreale, Fabio Corradi , and Cecilia Viscardi Abstract— We consider group-based anonymization schemes,1 a popular approach to data publishing. This approach aims2 at protecting privacy of the individuals involved in a dataset,3 by releasing an obfuscated version of the original data, where4 the exact correspondence between individuals and attribute5 values is hidden. When publishing data about individuals, one6 must typically balance the learner’s utility against the risk7 posed by an attacker, potentially targeting individuals in the8 dataset. Accordingly, we propose a unified Bayesian model of9 group-based schemes and a related MCMC methodology to learn10 the population parameters from an anonymized table. This allows11 one to analyze the risk for any individual in the dataset to be12 linked to a specific sensitive value, when the attacker knows13 the individual’s nonsensitive attributes, beyond what is implied14 for the general population. We call this relative threat analysis.15 Finally, we illustrate the results obtained with the proposed16 methodology on a real-world dataset.17 Index Terms— Privacy, anonymization, k-anonymity, MCMC18 methods.19 I. INTRODUCTION20 WE CONSIDER a scenario where datasets containing21 personal microdata are released in anonymized form.22 The goal here is to enable the computation of general popula-23 tion characteristics with reasonable accuracy, at the same time24 preventing leakage of sensitive information about individuals25 in the dataset. The Database of Genotype and Phenotype [32],26 the U.K. Biobank [36] and the UCI Machine Learning repos-27 itory [47] are well-known examples of repositories providing28 this type of datasets.29 Anonymized datasets always have “personal identifiable30 information”, such as names, SSNs and phone numbers,31 removed. At the same time, they include information32 derived from nonsensitive (say, gender, ZIP code, age,33 nationality) as well as sensitive (say, disease, income)34 attributes. Certain combinations of nonsensitive attributes, like35 �gender, date of birth, ZIP code�, may be used to uniquely36 identify a significant fraction of the individuals in a population,37 thus forming so-called quasi-identifiers. For a given target38 individual, the victim, an attacker might easily obtain this piece39 of information (e.g. from personal web pages, social networks40 AQ:1 Manuscript received January 18, 2019; revised May 2, 2019 and August 7, AQ:2 2019; accepted August 15, 2019. This paper was presented at the Proceedings of SIS 2017 [5]. The associate editor coordinating the review of this article and approving it for publication was Prof. Xiaodong Lin. (Corresponding author: Fabio Corradi.) The authors are with the Dipartimento di Statistica, Informatica, AQ:3 Applicazioni (DiSIA), Università di Firenze, Florence, Italy (e-mail: fabio.corradi@unifi.it). Digital Object Identifier 10.1109/TIFS.2019.2937640 etc.), use it to identify him/her within a dataset and learn the 41 corresponding sensitive attributes. This attack was famously 42 demonstrated by L. Sweeney, who identified Massachusetts’ 43 Governor Weld medical record within the Group Insurance 44 Commission (GIC) dataset [46]. Note that identity disclosure, 45 that is the precise identification of an individual’s record in 46 a dataset, is not necessary to arrive at a privacy breach: 47 depending on the dataset, an attacker might infer the victim’s 48 sensitive information, or even a few highly probable candidate 49 values for it, without identity disclosure involved. This more 50 general type of threat, sensitive attribute disclosure, is the one 51 we focus on here.1 52 In an attempt to mitigate such threats for privacy, regulatory 53 bodies mandate complex, often baroque syntactic constraints 54 on the published data. As an example, here is an excerpt from 55 the HIPAA safe harbour deidentification standard [48], which 56 prescribes a list of 18 identifiers that should be removed or 57 obfuscated, such as 58 all geographic subdivisions smaller than a state, 59 including street address, city, county, precinct, ZIP 60 code, and their equivalent geocodes, except for the 61 initial three digits of the ZIP code if, according to 62 the current publicly available data from the Bureau 63 of the Census: (1) the geographic unit formed by 64 combining all ZIP codes with the same three initial 65 digits contains more than 20,000 people; and (2) 66 the initial three digits of a ZIP code for all such 67 geographic units containing 20,000 or fewer people 68 is changed to 000. 69 There exists a large body of research, mainly in 70 Computer Science, on syntactic methods. In particular, 71 group-based anonymization techniques have been systemat- 72 ically investigated, starting with L. Sweeney’s proposal of 73 k-anonymity [46], followed by its variants, like �-diversity [30] 74 and Anatomy [49].In group-based methods, the anonymized - 75 or obfuscated - version of a table is obtained by partitioning 76 the set of records into groups, which are then processed to 77 enforce certain properties. The rationale is that, even knowing 78 that an individual belongs to a group of the anonymized 79 table, it should not be possible for an attacker to link that 80 individual to a specific sensitive value in the group. Two 81 examples of group based anonymization are in Table I, adapted 82 1 Depending on the nature of the dataset, the mere membership disclosure, i.e. revealing that an individual is present in a dataset, may also be considered as a privacy breach: think of data about individuals who in the past have been involved in some form of felony. We will not discuss membership disclosure privacy breaches in this paper. 1556-6013 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. https://orcid.org/0000-0003-3949-3837 IE EE P ro of 2 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY TABLE I A TABLE (TOP) ANONYMI ZED ACCORDI NG TO 2-ANONYMI TY VI A LOCAL RECODI NG (MI DDLE) AND ANATOMY (BOTTOM) from [9]. The topmost, original table collects medical data83 from eight individuals; here Disease is considered as the84 only sensitive attribute. The central table is a 2-anonymous,85 2-diverse table: within each group the nonsensitive attribute86 values have been generalized following group-specific rules87 (local recoding) so as to make them indistinguishable; more-88 over, each group features 2 distinct sensitive values. In general,89 each group in a k-anonymous table consists of at least k90 records, which are indistinguishable when projected on the91 nonsensitive attributes; �-diversity additionally requires the92 presence in each group of at least � distinct sensitive values,93 with approximately the same frequency. This is an example94 of horizontal scheme. Table I (c) is an example of application95 of the Anatomy scheme: within each group, the nonsensitive96 part of the rows are vertically and randomly permuted, thus97 breaking the link between sensitive and nonsensitive values.98 Again, the table is 2-diverse.99 In recent years, the effectiveness of syntactic anonymization100 methods has been questioned, as offering weak guarantees101 against attackers with strong background knowledge – very102 precise contextual information about their victims. Differen-103 tial privacy [18], which promises protection in the face of104 arbitrary background knowledge, while valuable in the release105 of summary statistics, still appears not of much use when it 106 comes to data publishing (see the Related works paragraph). 107 As a matter of fact, release of syntactically anonymized tables 108 appears to be the most widespread data publishing practice, 109 with quite effective tool support (see e.g. [37]). 110 In the present paper, discounting the risk posed by attackers 111 with strong background knowledge, we pose the problem in 112 relative terms: given that whatever is learned about the general 113 population from an anonymized dataset represents legitimate 114 and useful information (“smoke is associated with cancer”), 115 one should prevent an attacker from drawing conclusions about 116 specific individuals in the table (“almost certainly the target 117 individual has cancer”): in other words, learning sensitive 118 information for an individual in the dataset, beyond what is 119 implied for the general population. To see what is at stake 120 here, consider dataset (b) in Table I. Suppose that the attacker’s 121 victim is a Malaysian living at ZIP code 45501, and known 122 to belong to the original table. The victim’s record must 123 therefore be in the first group of the anonymized table. The 124 attacker may reason that, with the exception of the first group, 125 a Japanese is never connected to Heart Disease; this hint 126 can become a strong evidence in a larger, real-world table. 127 Then the attacker can link with high probability the Malaysian 128 victim in the first group to Heart Disease. In this attack, 129 the attacker combines knowledge of the nonsensitive attributes 130 of the victim (Malaysian, ZIP code 45501) with the group 131 structure and the knowledge learned from the anonymized 132 table. 133 We propose a unified probabilistic model to reason about 134 such forms of leakage. In doing so, we clearly distinguish the 135 position of the learner from that of the attacker: the resulting 136 notion is called relative privacy threat. In our proposal, both 137 the learner and the attacker activities are modeled as forms 138 of Bayesian inference: the acquired knowledge is represented 139 as a joint posterior probability distribution over the sensitive 140 and nonsensitive values, given the anonymized table and, 141 in the case of the attacker, knowledge of the victim’s presence 142 in the table. A comparison between these two distributions 143 determines what we call relative privacy threat. Since posterior 144 distributions are in general impossible to express analytically, 145 we also put forward a MCMC method to practically estimate 146 such posteriors. We also illustrate the results of applying our 147 method to the Adult dataset from the UCI Machine Learn- 148 ing repository [47], a common benchmark in anonymization 149 research. 150 A. Related Works 151 Sweeney’s k-anonymity [46] is among the most popu- 152 lar proposals aiming at a systematic treatment of syntactic 153 anonymization of microdata. The underlying idea is that every 154 individual in the released dataset should be hidden in a 155 “crowds of k”. Over the years, k-anonymity has proven to 156 provide weak guarantees against attackers who know much 157 about their victims, that is have a strong background knowl- 158 edge. For example, an attacker may know from sources other 159 than the released data that his victim does not suffer from 160 certain diseases, thus ruling out all possibilities but one in 161 IE EE P ro of BOREALE et al.: RELATIVE PRIVACY THREATS AND LEARNING FROM ANONYMIZED DATA 3 the victims’s group. Additional constraints may be enforced162 in order to mitigate those attacks, like �-diversity [30] and163 t-closeness [27]. Differential Privacy [18] promises protec-164 tion in the face of arbitrary background knowledge. In its165 basic, interactive version, this means that, when querying a166 database via a differentially private mechanism, one will get167 approximately the same answers, whether the data of any168 specific individual is included or not in the database. This is169 typically achieved by injecting controlled levels of noise in the170 reported answer, e.g. Laplacian noise. Differential Privacy is171 very effective when applied to certain summary statistics, such172 as histograms. However, it raises a number of difficulties when173 applied to table publishing: in concrete cases, the level of noise174 necessary to guarantee an acceptable degree of privacy would175 destroy utility [12], [13], [44]. Moreover, due to correlation176 phenomena, it appears that Differential Privacy cannot in177 general be used to control evidence about the participation178 of individuals in a database [4], [26]. In fact, the no-free-179 lunch theorem of Kifer and Machanavajjhala [26] implies that180 it is impossible to guarantee both privacy and utility, without181 making assumptions about how the data have been generated182 (e.g., independence assumptions). Clifton and Tassa [10] crit-183 ically review issues and criticisms involved in both syntactic184 methods and Differential Privacy, concluding that both have185 their place, in Privacy Preserving- Data Publishing and Data186 Mining, respectively. Both approaches have issues that call187 for further research. A few proposals involve blending the188 two approaches, with the goal to achieve both strong privacy189 guarantees and utility, see e.g. [28].190 A major source of inspiration for our work has been191 Kifer’s [25]. The main point of [25] is to demonstrate a pitfall192 of the random worlds model, where the attacker is assumed193 to assign equal probability to all cleartext tables compatible194 with the given anonymized one. Kifer shows that a Bayesian195 attacker willing to learn from the released table can draw196 sharper inferences than those possible in the random worlds197 model. In particular, Kifer shows that it is possible to extract198 from (anatomized) �-diverse tables belief probabilities greater199 than 1/�, by means of the so-called deFinetti attack. While200 pinpointing a deficiency of the random worlds model, it is201 questionable if this should be considered an attack, or just202 a legitimate learning strategy. Quoting [10] on the deFinetti203 attack:204 The question is whether the inference of a general205 behavior of the population in order to draw belief206 probabilities on individuals in that population con-207 stitutes a breach of privacy (. . .). To answer this208 question positively for an attack on privacy, the suc-209 cess of the attack when launched against records that210 are part of the table should be significantly higher211 than its success against records that are not part of212 the table. We are not aware of such a comparison213 for the deFinetti attack.214 It is this very issue that we tackle in the present paper.215 Specifically, our main contribution here is to put forward a216 concept of relative privacy threat, as a means to assess the217 risks implied by publishing tables anonymized via group-based218 methods. To this end, we introduce: (a) a unified probabilistic 219 model for group-based schemes; (b) rigorous characterizations 220 of the learner and the attacker’s inference, based on Bayesian 221 reasoning; and, (c) a related MCMC method, which generalizes 222 and systematizes that proposed in [25]. 223 Very recently, partly inspired by differential privacy, a 224 few authors have considered what might be called a rel- 225 ative or differential approach to assessing privacy threats, 226 in conjunction with some notion of learning or inference 227 from the anonymized data. Especially relevant to our work 228 is differential inference, introduced in a recent paper by 229 Kassem et al. [24]. These authors make a clear distinction 230 between two different types of information that can be inferred 231 from anonymized data: learning of “public” information, con- 232 cerning the population, should be considered as legitimate; 233 on the contrary, leakage of “private” information about indi- 234 viduals should be prevented. To make this distinction formal, 235 given a dataset, they compare two probability distributions 236 that can be machine-learned from two distinct training sets: 237 one including and one excluding a target individual. An attack 238 exists if there is a significant difference between the two dis- 239 tributions, measured e.g. in terms of Earth Moving Distance. 240 While similar in spirit to ours, this approach is conceptually 241 and technically different from what we do here. Indeed, in our 242 case the attacker explicitly takes advantage of the extra piece 243 of information concerning the presence of the victim in the 244 dataset to attack the target individual, which leads to a more 245 direct notion of privacy breach. Moreover, in [24] a Bayesian 246 approach to inference is not clearly posed, so the obtained 247 results lack a semantic foundation, and strongly depend on the 248 adopted learning algorithm. Pyrgelis et al. [39] use Machine 249 Learning for membership inference on aggregated location 250 data, building a binary classifier that can be used to predict 251 if a target user is part of the aggregate data or not. A similar 252 goal is pursued in [35]. Again, a clear semantic foundation 253 of these methods is lacking, and the obtained results can be 254 validated only empirically. In a similar vein, [3] and [17] have 255 proposed statistical techniques to detect privacy violations, 256 but they only apply to differential privacy. Other works, such 257 as [23] and [33], have just considered the problem of how 258 to effectively learn from anonymized datasets, but not of 259 how to characterize legitimate, as opposed to non-legitimate, 260 inference. 261 On the side of the random worlds model, Chi-Wing Wong 262 et al.’s work [9] shows how information on the population 263 extracted from the anonymized table – in the authors’ words, 264 the foreground knowledge – can be leveraged by the attacker 265 to violate the privacy of target individuals. The underlying rea- 266 soning, though, is based on the random worlds model, hence 267 is conceptually and computationally very different from the 268 Bayesian model adopted in the present paper. Bewong et al. [2] 269 assess relative privacy threat for transactional data by a suitable 270 extension of the notion of t -closeness, which is based on com- 271 paring the relative frequency of the victim’s sensitive attribute 272 in the whole table with that in the victim’s group. Here the 273 underlying assumption is that the attacker’s prior knowledge 274 about sensitive attributes matches the public knowledge, and 275 that the observed sensitive attributes frequencies provide good 276 IE EE P ro of 4 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY estimates both for the public knowledge and the attacker’s277 belief. Our proposal yields more sophisticated estimates via a278 Bayesian inferential procedure. Moreover, in our scenario the279 assumption on the attacker’s knowledge is relaxed requiring280 only the knowledge of the victim’s presence in whatever group281 of the table.282 A concept very different from the previously discussed pro-283 posals is Rubin’s multiple imputation approach [43], by which284 only tables of synthetic data, generated sampling from a285 predictive distribution learned from the original table, are286 released. This avoids syntactic masking/obfuscation, whose287 analysis requires customized algorithms on the part of the288 learner, and leaves to the data producer the burden of synthesis.289 Note that this task can be nontrivial and raises a number of290 difficulties concerning the availability of auxiliary variables291 for non-sampled units, see [42]. In Rubin’s view, synthetic292 data overcome all privacy concerns, in that no real individual’s293 data is actually released. However, this position has been ques-294 tioned, on the grounds that information about participants may295 leak through the chain: original table → posterior parameters296 → synthetic tables. In particular, Machanavajjhala et al. [31]297 study Differential Privacy of synthetic categorical data. They298 show that the release of such data can be made differen-299 tially private, at the cost of introducing very powerful priors.300 However, such priors can lead to a serious distortion in301 whatever is learned from the data, thus compromising utility.302 In fact, [50] argues that, in concrete cases, the required pseudo303 sample size hyperparameter could be larger than the size of304 the table. Experimental studies [7], [8] appear to confirm305 that such distorting priors are indeed necessary for released306 synthetic data to provide acceptable guarantees, in the sense307 of Differential Privacy. See [50] for a recent survey of results308 about synthetic data release and privacy. An outline of the309 model presented here, with no proofs of correctness, appeared310 in the conference paper [5].311 B. Structure of the Paper312 The rest of the paper is organized as follows. In Section313 II we propose a unified formal definition of vertical and314 horizontal schemes. In Section III we put forward a probabilis-315 tic model to reason about learner’s and attacker’s inference;316 the case of prior partial knowledge of the victim’s attributes317 on the part of the attacker is also covered. Based on that,318 measures of (relative) privacy threats and utility are introduced319 in Section IV. In Section V, we study a MCMC algorithm to320 learn the population parameters posterior and the attacker’s321 probability distribution from the anonymized data. In Section322 VI, we illustrate the results of an experiment conducted on a323 real-world dataset. A few concluding remarks and perspectives324 for future work are reported in Section VII. Some technical325 material has been confined to Appendix A.326 II. GROUP BASED ANONYMIZATION SCHEMES327 A dataset consists of a collection of rows, where each row328 corresponds to an individual. Formally, let R and S, ranged329 over by r and s respectively, be finite non-empty sets of330 nonsensitive and sensitive values, respectively. A row is a pair331 (s, r ) ∈ S × R. There might be more than one sensitive and 332 nonsensitive characteristic, so s and r can be thought of as 333 vectors. 334 A group-based anonymization algorithm A is an algorithm 335 that takes a multiset of rows as input and yields an obfuscated 336 table as output, according to the scheme 337 multiset of rows −→ cleartext table −→ obfuscated table. 338 Formally, fix N ≥ 1. Given a multiset of N rows, d = 339 {|(s1, r1), . . . , (sN , r N )|}, A will first arrange d into a sequence 340 of groups, t = g1, . . . , gk , the cleartext table. Each group in 341 turn is a sequence of ni rows, gi = (si,1, ri,1), . . . , (si,ni , ri,ni ), 342 where ni can vary from group to group. Note that both the 343 number of groups, k ≥ 1, and the number of rows in each 344 group, ni , depend in general on the original multiset d as well 345 as on properties of the considered algorithm – such as ensuring 346 k-anonymity and �-diversity (see below). The obfuscated table 347 is then obtained as a sequence t ∗ = g∗1 , . . . , g∗k , where the 348 obfuscation of each group gi is a pair g ∗ i = (mi , li ). Here, 349 each mi = si,1, . . . , si,ni is the sequence of sensitive values 350 occurring in gi ; each li , called generalized nonsensitive value, 351 is one of the following: 352 • for horizontal schemes, a superset of gi ’s nonsensitive 353 values: li ⊇ {ri,1, . . . , ri,ni }; 354 • for vertical schemes, the multiset of gi ’s nonsensitive 355 values: li = {|ri,1, . . . , ri,ni |}. 356 Note that the generalized nonsensitive values in vertical 357 schemes include all and only the values, with multiplicities, 358 found in the corresponding original group. On the other hand, 359 generalized nonsensitive values in horizontal schemes may 360 include additional values, thus generating a superset. What 361 values enter the superset depends on the adopted technique, 362 e.g. micro-aggregation, generalization or suppression; in any 363 case this makes the rows in each group indistinguishable when 364 projected onto the nonsensitive attributes. For example, each 365 of 45501, 45502 is generalized to the superset 4550∗ = 366 {45500, 45501, . . . , 45509} in the first group of Table I(b). 367 Sometimes it will be notationally convenient to ignore the 368 group structure of t altogether, and regard the cleartext table 369 t simply as a sequence of rows, (s1, r1), (s2, r2), . . . , (s1, sN ). 370 Each row (s j , r j ) is then uniquely identified within the table 371 t by its index 1 ≤ j ≤ N . 372 An instance of horizontal schemes is k-anonymity [46]: 373 in a k-anonymous table, each group consists of at least k≥ 374 1 rows, where the different nonsensitive values appearing 375 within each group have been generalized so as to make them 376 indistinguishable. In the most general case, different occur- 377 rences of the same nonsensitive value might be generalized in 378 different ways, depending on their position (index) within the 379 table t : this is the case of local recoding. Alternatively, each 380 occurrence of a nonsensitive value is generalized in the same 381 way, independently of its position: this is the case of global 382 recoding. Further conditions may be imposed on the resulting 383 anonymized table, such as �-diversity, requiring that at least 384 � ≥ 1 distinct values of the sensitive attribute appear in each 385 group. Table I (center) shows an example of k= 2-anonymous 386 and � = 2-diverse table: in each group the nonsensitive 387 IE EE P ro of BOREALE et al.: RELATIVE PRIVACY THREATS AND LEARNING FROM ANONYMIZED DATA 5 TABLE II SUMMARY OF NOTATI ON values are indistinguishable and two different sensitive values388 (diseases) appear in each group.389 An instance of vertical schemes is Anatomy [49]: within390 each group, the link between the sensitive and nonsensitive391 values is hidden by randomly permuting one of the two392 parts, for example the nonsensitive one. As a consequence,393 an anatomized table may be seen as consisting of two sub-394 tables: a sensitive and a nonsensitive one. Table I (c) shows395 an example of anatomized table: in the nonsensitive sub-table,396 the reference to the corresponding sensitive values is lost; only397 the multiset of nonsensitive values appears for each group.398 Remark 1 (disjointness): Some anonymization schemes399 enforce the following disjointness property on the obfuscated400 table t ∗:401 Any two generalized nonsensitive values in t ∗ are402 disjoint: i = j implies li ∩ l j = ∅.403 We need not assume this property in our treatment – although404 assuming it may be computationally useful in practice (see405 Section III).406 For ease of reference, we provide a summary of the notation407 that will be used throughout the paper in Table II.408 III. A UNIFIED PROBABILISTIC MODEL409 We provide a unified probabilistic model for reasoning on410 group-based schemes. We first introduce the random variables411 of the model together with their joint density function. On top412 of these variables, we then define the probability distributions413 on S × R that formalize the learner and the attacker knowl-414 edge, given the obfuscated table.415 A. Random Variables416 The model consists of the following random variables.417 • �, taking values in the set of full support probability418 distributions D over S × R, is the joint probability419 distribution of the sensitive and nonsensitive attributes in420 the population.421 • T = G1, . . . , Gk , taking values in the set of422 cleartext tables T . Each group Gi is in turn a423 sequence of ni ≥ 1 consecutive rows in T , Gi =424 (Si,1, Ri,1), . . . , (Si,ni , Ri,ni ). The number of groups k is425 not fixed, but depends on the anonymization scheme and 426 the specific tuples composing T . 427 • T ∗ = G∗1, . . . , G∗k , taking values in the set of obfuscated 428 tables T ∗. 429 We assume that the above three random variables form a 430 Markov chain: 431 � −→ T −→ T ∗. (1) 432 In other words, uncertainty on T is driven by �, and T ∗ 433 solely depends on the table T and the underlying obfuscation 434 algorithm. As a result, T ∗ ⊥⊥ � | T . Equivalently, the 435 joint probability density function f of these variables can be 436 factorized as follows, where π, t, t ∗ range over D, T and T ∗, 437 respectively: 438 f (π, t, t ∗) = f (π) f (t|π) f (t ∗|t). (2) 439 Additionally, we shall assume the following: 440 • π ∈ D is encoded as a pair π = (πS , πR|S) where πR|S = 441 {πR|s : s ∈ S}. Here, πS are the parameters of a full 442 support categorical distribution over S, and, for each s ∈ 443 S, πR|s are the parameters of a full support categorical 444 distribution over R. For each (s, r ) ∈ S × R 445 f (s, r |π) = f (s|π) · f (r |πR|s) 446 We also posit that the πS and the πR|s ’s are chosen inde- 447 pendently, according to Dirichlet distributions of hyper- 448 parameters α = (α1, . . . , α|S|) and βs = (βs1, . . . , βs|R|), 449 respectively. In other words 450 f (π) = Dir(πS | α) · � s∈S Dir(πR|s | βs ). (3) 451 The hyperparameters α and β may incorporate prior 452 (background) knowledge on the population, if this is 453 available. Otherwise, a uniformative prior can be chosen 454 setting αi = βsj = 1 for each i, s, j . When r ∈ R 455 is a tuple of attributes, we shall assume conditional 456 independence of those attributes given s, so that the joint 457 probability of r |s can be determined by factorization. 458 • The N individual rows composing the table t , say 459 (s1, r1), . . . , (sN , r N ), are assumed to be drawn i.i.d. 460 according to f (·|π). Equivalently 461 f (t|π) = f (s1, r1|π) · · · f (sN , r N |π). (4) 462 Instances of the above model can be obtained by specifying 463 an anonymization mechanism A. In particular, the distribution 464 f (t ∗|t) only depends on the obfuscation algorithm that is 465 adopted, say obf(t). In the important special case obf(t) acts 466 as a deterministic function on tables, f (t ∗|t) = 1 if and only 467 if obf(t) = t ∗, otherwise f (t ∗|t) = 0. 468 B. Learner and Attacker Knowledge 469 We shall denote by pL the probability distribution over S × 470 R that can be learned given the anonymized table t ∗. This 471 distribution we take to be the average of f (s, r |π) with respect 472 IE EE P ro of 6 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY to the density f (� = π|T ∗ = t ∗). Formally, for each (s, r ) ∈473 S × R:474 pL(s, r |t ∗) �= Eπ∼ f (π|t ∗)[ f (s, r |π)] = � D f (s, r |π) f(π|t ∗) dπ.475 (5)476 Of course, we can condition pL on any given r and obtain477 the conditional probability pL(s|r, t ∗). Equivalently, we can478 compute479 pL(s|r, t ∗) �= Eπ∼ f (π|t ∗)[ f (s|r, π)] = � D f (s|r, π) f (π|t ∗) dπ.480 (6)481 In particular, one can read off this distribution on a victim’s482 nonsensitive attribute, say rv, and obtain the corresponding483 distribution on S.484 We shall assume the attacker knows the values of T ∗ = t ∗485 and the nonsensitive value rv of a target individual, the victim;486 moreover the attacker knows the victim is an individual in487 the table. Accordingly, in what follows we fix once and for488 all t ∗ and rv: these are the values observed by the attacker.489 Given knowledge of a victim’s nonsensitive attribute rv and490 knowledge that the victim is actually in the table T , we can491 define the attacker’s distribution on S as follows.492 Let us introduce in the above model a new random vari-493 able V , identifying the index of the victim within the clear-494 text table T . We posit that V is uniformly distributed on495 {1, . . . , N }, and independent from �, T , T ∗. Recalling that496 each row (S j , R j ) is identified within T by a unique index497 j , we can define the attacker’s probability distribution on S,498 after seeing t ∗ and rv, as follows, where it is assumed that499 f (RV = rv, t ∗) > 0, that is the observed victim’s rv is500 compatible with t ∗:501 pA(s|rv, t ∗) �= f (SV = s | RV = rv, t ∗). (7)502 The following crucial lemma provides us with a characteri-503 zation of the above probability distribution that is only based504 on a selection of the marginals R j given t ∗. This will be the505 basis for actually computing pA(s|rv, t ∗). Note that, on the506 right-hand side, only those rows whose sensitive value - known507 from t ∗ - is s contribute to the summation. A proof of the508 lemma is reported in Appendix A.509 Lemma 1: Let T = (S j , R j ) j ∈ 1...N . Let s j be the sensitive510 value in the j -th entry of t ∗. Let rv and t ∗ such that f (RV =511 rv, t ∗) > 0. Then512 pA(s|rv, t ∗) ∝ � j : s j =s f (R j = rv | t ∗). (8)513 Note that the disjointness of generalized nonsensitive values514 of the groups can make the computation of (8) more efficient,515 restricting the summation on the right-hand side to a unique516 group.517 Example 1: In order to illustrate the difference between518 the learner’s and the attacker’s inference, we reconsider the519 toy example in the Introduction. Let t ∗ be the 2-anonymous,520 2-diverse Table I(b). Assume the attacker’s victim is521 the first individual of the original dataset, who is from522 Malaysia(= M ) and lives in the ZIP code 45501 area, hence523 TABLE III POS TERI OR DI S TRI BUTI ONS OF DI S EAS ES F OR A VI CTI M WI TH rv = (M, 45501), F OR THE ANONYMI ZED t ∗ I N TABLE I(B). NB: FIGURES AFFECTED BY ROUNDI NG ERRORS rv = (M, 45501). Table III shows the belief probabilities of 524 the learner, pL(s|rv, t ∗), and of the attacker, pA(s|rv, t ∗), for 525 the victim’s disease s. We also include the random worlds 526 model probabilities, pRW(s|rv, t ∗), which are just proportional 527 to the frequency of each sensitive value within the victim’s 528 group. Note that the learner and the attacker distributions have 529 the same mode, but the attacker is more confident about his 530 prediction of the victim’s disease. The random worlds model 531 produces a multi-modal solution. 532 As to the computation of the probabilities in Table III, 533 a routine application of the equations (2) – (8) shows that 534 pL and pA reduce to the expressions (9) and (10) below, 535 given in terms of the model’s density (2). The crucial point 536 here is that the adversary knows the group his victim is in, 537 i.e. the first two lines of t ∗ in the example. Below, s ∈ S; 538 for j = 1, 2, s j denotes the sensitive value of the j -th row, 539 while t is a cleartext table, from which t− j is obtained by 540 removing (s j , rv). It is assumed that the obfuscation algorithm 541 A is deterministic, so that f (t ∗|t) ∈ {0, 1}. 542 pL(s|rv, t ∗) ∝ � D f (π) f (s, rv|π) � t :A(t )=t ∗ f (t|π) dπ (9) 543 pA(s j |rv, t ∗) ∝ � D f(π) f(s j , rv|π) � t− j :A(t )=t ∗ f (t|π) dπ. (10) 544 Unfortunately, the analytic computation of the above integrals, 545 even for the considered toy example, is a daunting task. 546 For instance, the summation in (9) has as many terms as 547 t ∗-compatible tables t , that is 6.4 × 105 for Example 1 – 548 although the resulting expression can be somewhat simplified 549 using the independence assumption (4). Accordingly, the fig- 550 ures in Table III have been computed resorting to simulation 551 techniques, see Section V. 552 An alternative, more intuitive description of the inference 553 process is as follows. The learner and the attacker first learn 554 the parameters π given t ∗, that is they evaluate f (πDis|t ∗), 555 f (πZIP|s |t ∗) and f (πNat|s |t ∗), for all s ∈ S. Due to the 556 uncertainty on the ZIP code and/or Nationality, learning π 557 takes the form of a mixture (this is akin to learning with 558 soft evidence, see Corradi et al. [11]). After that, the learner, 559 ignoring the victim is in the table, predicts the probability of 560 rv, pL(rv|s, t ∗), for all s, by using a mixture of Multinomial- 561 Dirichlet. The attacker, on the other hand, while still basing 562 his prediction pA(rv|s, t ∗) on the parameter learning outlined 563 above, restricts his attention to the first two lines of t ∗, thus 564 realizing that s ∈ {Heart, Flu}. Then, by Bayes theorem, 565 and adopting the relative frequencies of the diseases in t ∗ as 566 an approximation of f (s|t ∗), the posterior probability of the 567 diseases for the victim can be computed. 568 IE EE P ro of BOREALE et al.: RELATIVE PRIVACY THREATS AND LEARNING FROM ANONYMIZED DATA 7 Remark 2 (attacker’s inference and forensic identification):569 The attacker’s inference is strongly reminiscent of two famous570 settings in forensic science: the Island Problem (IP) and the571 The Data Base Search Problem (DBS), see e.g. [1], [14]572 and more recently [45]. In an island with N inhabitants a573 crime is committed; a characteristic of the criminal (e.g.574 a DNA trait) is found on the crime scene. It is known that the575 island’s inhabitants posses this characteristic independently576 with probability p. It is assumed the existence of exactly577 one culprit C in the island. In IP, one island’s inhabitant I ,578 the suspect, is found to have the given characteristic, while579 the others are not tested. An investigator is interested in the580 probability that I = C .581 When we cast this scenario in our framework, the individ-582 uals in the table play the role of the inhabitants (including583 the culprit), while rv plays the role of the characteristic found584 on the crime scene, matching that of the suspect. In other585 words - perhaps ironically - our framework’s victim plays here586 the role of the suspect S, while our attacker is essentially587 the investigator. Letting S = {0, 1} (innocent/guilty) and588 R = {0, 1} (characteristic absent/present), the investigator’s589 information is then summarized by an obfuscated horizontal590 table t ∗ of N rows with as many groups, where exactly one591 row, say the j -th, has S j = 1 and R∗j = R j = 1 (the culprit),592 while for i = j , Si = 0 and R∗i = ∗ ( N − 1 innocent593 inhabitants). Recalling that the variable V in our framework594 represents the suspect’s index within the table, the probability595 that I = C is596 Pr(V = j |RV = 1, t ∗) = Pr(SV = 1|RV = 1, t ∗)597 = pA(s = 1|rv = 1, t ∗).598 Then applying (8), we find599 pA(s = 1|rv = 1, t ∗) = f (R j = 1|t ∗) f(R j = 1|t ∗)+(N −1) f (Ri =j = 1|t ∗) 600 = 1 1 + (N − 1) f (Ri = j = 1|t ∗) . (11)601 By taking suitable prior hyperparameters, f (Ri = j = 1|t ∗) can602 be made arbitrarily close to p. For ease of comparison with603 the classical IP and DBS settings, rather than relying on a604 learning procedure, we just assume here f (Ri = 1|t ∗) = p605 for i = j , so that (11) simplifies to606 pA(s = 1|rv = 1, t ∗) = 1 1 + (N − 1) p (12)607 which is the classical result known from the literature.608 In DBS, the indicted exhibiting rv is found after testing 1 ≤609 k < N individuals that do not exhibit rv. This means the table610 t ∗ consists now of k rows (s, r ) = (0, 0) (the k innocent,611 tested inhabitants not exhibiting rv), one row (s, r ) = (1, 1)612 (the culprit) and N −1 −k rows (s, r ∗) = (0, ∗) (the N −1 −k613 innocent, non-tested inhabitants). Accordingly, (11) becomes614 (letting j = k + 1, and possibly after rearranging indices),615 (13), as shown at the bottom of this page. Letting f (Ri = 616 1|t ∗) = p for i > k + 1, equation (13) becomes 617 pA(s = 1|rv = 1, t ∗) = 1 1 + (N − 1 − k) p 618 which again is the classical result known from the literature. 619 Finally note that our methodology also covers the possibility 620 to learn about the probability of the characteristic, f (Ri = 621 1|t ∗), but here we have only stressed how the attacker strategy 622 solves the IP and DBS forensic problems. Uncertainty about 623 population parameters and identification has been considered 624 elsewhere by one of us [6]. 625 We now briefly discuss an extension of our framework to 626 the more general case where the attacker has only partial 627 information about his victim’s nonsensitive attributes. For a 628 typical application, think of a dataset where R and S are 629 individuals’ genetic profiles and diseases, respectively, with an 630 adversary knowing only a partial DNA profile of his victim; 631 e.g. only the alleles at a few loci. Formally, fix a nonempty 632 set Y and let g : R → Y be a (typically non-injective) 633 function, modeling the attacker’s observation of the victim’s 634 nonsensitive attribute. With the above introduced notation, 635 consider the random variable Y �= g(RV ). It is natural to 636 extend definition (7) as follows, where g(rv) = yv ∈ Y and 637 f (Y = yv, t ∗) > 0: 638 pA(s|yv, t ∗) �= f (SV = s | Y = yv, t ∗). (14) 639 It is a simple matter to check that (8) becomes the following, 640 where g−1(y) ⊆ R denotes the counter-image of y according 641 to g: 642 pA(s|rv, t ∗) ∝ � j : s j =s f (R j ∈ g−1(yv) | t ∗). (15) 643 Also note that one has f (R j ∈ g−1(yv) | t ∗) = 644� r∈g−1(yv) f (R j = r | t ∗). An extension to the case of partial 645 and noisy observations can be modeled similarly, by letting 646 Y = g(RV , E ), where E is a random variable representing 647 an independent source of noise. We leave the details of this 648 extension for future work. 649 IV. MEASURES OF PRIVACY THREAT AND UTILITY 650 We are now set to define the measures of privacy threat and 651 utility we are after. We will do so from the point of view of 652 a person or entity, the evaluator, who: 653 (a) has got a copy of the cleartext table t , and can build an 654 obfuscated version t ∗ of it; 655 (b) must decide whether to release t ∗ or not, weighing the 656 privacy threats and the utility implied by this act. 657 The evaluator clearly distinguishes the position of the learner 658 from that of the attacker. The learner is interested in learning 659 from t ∗ the characteristics of the general population, via pL . 660 The attacker is interested in learning from t ∗ the sensitive 661 pA(s = 1|rv = 1, t ∗)= f (Rk+1 = 1|t ∗) f (Rk+1 = 1|t ∗) + k f (Ri∈{1,k} = 1|t ∗) + (N − 1 − k) f (Ri>k+1 = 1|t ∗) (13) IE EE P ro of 8 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY value of a target individual, the victim, via p A. The last662 probability distribution is derived by exploiting the additional663 piece of information that the victim is an individual known to664 be in the original table, of whom the attacker gets to know the665 nonsensitive values. As pointed out in [34], information about666 the victim’s nonsensitive attributes can be easily gathered from667 other sources such as personal blogs and social networks.668 These assumptions about the attacker’s knowledge allow a669 comparison between the risks of a sensitive attribute disclosure670 for an individual who is part of the table and for individuals671 who are not. The evaluator adopts the following relative,672 or differential, point of view:673 a situation where, for some individual, p A conveys674 much more information than that conveyed by pL675 (learner’s legitimate inference on general popula-676 tion), must be deemed as a privacy threat.677 Generally speaking, the evaluator should refrain from pub-678 lishing t ∗ if, for some individual, the level of relative pri-679 vacy threat exceeds a predefined threshold. Concerning the680 definition of the level of threat, the evaluator adopts the681 following Bayesian decision-theoretic point of view. Whatever682 distribution p is adopted to guess the victim’s sensitive value,683 the attacker is faced with some utility function. Here, we con-684 sider a simple 0-1 utility function for the attacker, yielding 1 if685 the sensitive attribute is guessed correctly and 0 otherwise.686 The resulting attacker’s expected utility is maximized by687 the Bayes act, i.e. by choosing s = argmaxs�∈S p(s�), and688 equals p(s). The above discussion leads to the following689 definitions. Note that we consider threat measures both for690 individual rows and for the overall table. For each threatened691 row, the relative threat index Ti says how many times the692 probability of correctly guessing the secret is increased by693 the attacker’s activity i.e. by exploiting the knowledge of694 the victim’s presence in the table. At a global, table-wise695 level, the evaluator also considers the fraction GTA of rows696 threatened by the attacker.697 Definition 1 (privacy threat): We define the following pri-698 vacy threat measures.699 • Let q be a full support distribution on S and (s, r ) be a700 row in t . We say (s, r ) is threatened under q if q(s) =701 maxs� q(s �), and that its threat level under q is q(s).702 • For a row (s, r ) in t that is threatened by pA(·|r, t ∗), its703 relative threat level is704 Ti(s, r, t, t ∗) �= pA(s|r, t ∗) pL(s|r, t ∗) . (16)705 • Let NA(t, t ∗) be the number of rows (s, r ) in t threatened706 by pA(·|r, t ∗). The global threat level GT A(t, t ∗) is the707 fraction of rows that are threatened, that is708 GT A(t, t ∗) �= NA (t, t ∗) N . (17)709 Similarly, we denote by GTL(t, t ∗) the fraction of rows710 (s, r ) in t that are threatened under pL(·|r, t ∗).711 • As a measure of how better the attacker performs than712 learner at a global level, we introduce relative global713 threat:714 RGTA(t, t ∗) �= max{0, GT A(t, t ∗) − GTL(t, t ∗)}. (18)715 Remark 3 (setting a threshold for Ti): A difficult issue is 716 how to set an acceptable threshold for the relative threat level 717 Ti. This is conceptually very similar to the question of how to 718 set the level of � in differential privacy: its proponents have 719 always maintained that the setting of � is a policy question, 720 not a technical one. Much depends on the application at hand. 721 For instance, when the US Census Bureau adopted differential 722 privacy, this task was delegated to a committee (the Data 723 Stewardship Executive Policy committee, DSEP); details on 724 the operations of this committee can be found in [19, Sect.3.1]. 725 We think that similar considerations apply when setting the 726 threshold of Ti. For instance, an evaluator might consider the 727 distribution of the Ti values in the dataset (see Fig. 3a–3h in 728 Section VI) and then choose a percentile as a cutoff. 729 The evaluator is also interested in the potential utility 730 conveyed by an anonymized table for a learner. Note that the 731 learner’s utility is distinct from the attacker’s one. Indeed, the 732 learner’s interest is to make inferences that are as close as 733 possible to the ones that could be done using the cleartext 734 table. Accordingly, obfuscated tables that are faithful to the 735 original table are the most useful. This leads us to compare two 736 distributions on the population: the distribution learned from 737 the anonymized table, pL, and the ideal (I) distribution, pI, 738 one can learn from the cleartext table t . The latter is formally 739 defined as the expectation2 of f (s, r |π) under the posterior 740 density f (π|t). Explicitly, for each (s, r ) 741 pI(s, r |t) �= � D f (s, r |π) f (π|t) dπ. (19) 742 Note that the posterior density f (π|t) is in turn a Dirichlet 743 density (see next section) and therefore a simple closed form 744 of the above expression exists, based on the frequencies of 745 the pairs (s, r ) in t . In particular, recalling the αs , β s r notation 746 for the prior hyperparameters introduced in Section III, let 747 α0 = � s αs and β s 0 = � r β s r , and γs (t) and δ s r (t) denote the 748 frequency counts of s and (s, r ), respectively, in t . Then we 749 have 750 pI(s, r |t) = αs + γs (t) α0 + N · β s r + δsr (t) βs0 + γs (t) . (20) 751 The comparison between pL and pI can be based on some 752 form of distance between distributions. One possibility is to 753 rely on total variation (aka statistical) distance. Recall that, 754 for discrete distributions q, q � defined on the same space X , 755 the total variation distance is defined as 756 TV(q, q �) �= sup A⊆X |q( A) − q �( A)| = 1 2 � x |q(x ) − q �(x )|. 757 Note that TV(q, q �) ∈ [0, 1]. Note that this is a quite 758 conservative notion of diversity since it based on the event 759 that shows the largest difference between distributions. 760 Definition 2 (faithfulness): The relative faithfulness level of 761 t ∗ w.r.t. t is defined as 762 RF(t, t ∗) �= 1 − TV� pI(·| t) , pL(·| t ∗) �. 763 2 Another sensible choice would be taking pI (s, r| t) = f (s, r| πMAP ), where πMAP = argmaxπ f (π |t) is the maximum a posteriori distribution given t . This choice would lead to essentially the same results. IE EE P ro of BOREALE et al.: RELATIVE PRIVACY THREATS AND LEARNING FROM ANONYMIZED DATA 9 Remark 4: In practice, the total variation of two high-764 dimensional distributions might be very hard to compute.765 Pragmatically, we note that for M large enough, TV(q, q �) =766 1 2 Ex ∼q(x )[|1− q �(x ) q(x ) |] ≈ 12M �M i=1 |1− q �(xi ) q(xi ) |, where the xi are767 drawn i.i.d. according to q(x ). Then a proxy to total variation768 is the empirical total variation defined below, where (si , ti ),769 for i = 1, . . . , M , are generated i.i.d. according to pI(·, ·| t):770 ETV(t, t ∗) �= 1 2M M� i=1 ����1 − pL(si , ri | t ∗) pI(si , ri | t) ���� . (21)771 772 Remark 5 (ideal knowledge vs. attacker’s knowledge):773 The following scenario is meant to further clarify the extra774 power afforded to the attacker, by the mere knowledge that775 his victim is in the table. Consider a trivial anonymization776 mechanism that simply releases the cleartext table, that is777 t ∗ = t . As pL = pI in this case, it would be tempting778 to conclude that the attacker cannot do better than the779 learner, hence there is no relative risk involved. However,780 this conclusion is wrong: for instance, pI(·|rv, t) can fail to781 predict the vicitim’s correct sensitive value if this value is782 rare, as we show below.783 For the sake of simplicity, consider the case where the784 observed victim’s nonsensitive attribute rv occurs just once in t785 in a row (s0, rv). Also assume a noninformative Dirichlet prior,786 that is, in the notation of Section III, set the hyperparameters787 to αs = βsr = 1 for each s ∈ S, r ∈ R. Then, simple788 calculations based on (20) and the attacker’s distribution789 characterization (8), show the following. Here for each s ∈ S,790 γs = γs (t) denotes the frequency count of s in t , and c a791 suitable normalizing constant:792 pI(s|rv, t) = ⎧⎪⎪⎨ ⎪⎪⎩ 1 + γs |R| + γs c, if s = s0 2(1 + γs0 ) |R| + γs0 c, if s = s0 793 pA(s|rv, t ∗) = 0, if s = s0 1, if s = s0. (22)794 As far as the target individual (s0, rv) ∈ t is concerned, we795 see that while pA predicts s0 with certainty, predictions based796 on pL = pI will be blatantly wrong, if there are values s = s0797 that occur very frequently in t , while s0 is rare, and N is large798 compared to |R|. To make an extreme numeric case, consider799 |S| = 2, |R| = 1000 and γs0 = 1 in a table t of N =800 106 rows: plugging these values in (22) yields pL(s0|rv, t ∗) =801 pI(s0|rv, t) ≈ 0.004, hence a relative threat for (s0, rv) of802 1/ pL(s0|rv, t ∗) ≈ 250.803 V. LEARNING FROM THE OBFUSCATED TABLE BY MCMC804 Estimating the privacy threat and faithfulness measures805 defined in the previous section, for specific tables t and t ∗,806 implies being able to compute the distributions (5), (6) and (8).807 Unfortunately, these distributions, unlike (19), are not available808 in closed form, since f (� = π| T ∗ = t ∗) = f (π|t ∗) cannot809 be derived analytically. Indeed, in order to do so, one should810 integrate f (π, t|t ∗) with respect to the density f (t|t ∗), which811 appears not to be feasible.812 To circumvent this difficulty, we will introduce a Gibbs sam- 813 pler, defining a Markov chain (X i )i≥0, with X i = (�i , Ti ), 814 converging to the density 815 f (� = π, T = t|t ∗) 816 = f �� = π, S1 = s1, R1 = r1, . . . , SN = sN , RN = r N | t ∗� 817 (note that the sensitive values s j in T are in fact fixed and 818 known, given t ∗). General results (see e.g. [41]) ensure that, 819 if �0, �1, . . . are the samples drawn from the �-marginal of 820 such a chain, then for each (s, r ) ∈ S × R 821 1 M M� �=0 f (s, r |��) → � D f (s, r |π) f (π|t ∗)dπ = pL(s, r |t ∗) 822 (23) 823 1 M M� �=0 f (s|r, ��) → � D f (s|r, π) f (π|t ∗)dπ = pL(s|r, t ∗) 824 (24) 825 almost surely as M −→ +∞. Therefore, by selecting 826 an appropriately large M , one can build approximations of 827 pL(s, r |t ∗) and pL(s|r, t ∗) using the arithmetical means on 828 the left-hand side of (23) and (24), respectively. Moreover, 829 for each index 1 ≤ j ≤ N , using samples drawn from the 830 R j -marginals of the same chain, one can build an estimate of 831 f (R j = r j | t ∗). Consequently, using (8) (resp. (15), in the case 832 of partial observation) one can estimate pA(s|rv, t ∗) (resp. 833 pA(s|yv, t ∗)) for any given rv (resp. yv). 834 In the rest of the section, we will first introduce the MCMC 835 for this problem and then show its convergence. We will then 836 discuss details of the sampling procedures for each of the two 837 possible schemes, horizontal and vertical. 838 A. Definition and Convergence of the Gibbs Sampler 839 Simply stated, our problem is sampling from the marginals 840 of the following target density function, where t ∗ = g∗1 , . . . , g∗k 841 and t = g1, . . . , gk (note that the number of groups k is known 842 and fixed, given t ∗). 843 f (π, t|t ∗). (25) 844 Note that the r j ’s of interest, for 1 ≤ j ≤ N , are the elements 845 of the groups gi ’s, for 1 ≤ i ≤ k. The Gibbs scheme allows 846 for some freedom as to the blocking of variables. Here we 847 consider k + 1 blocks, coinciding with π and g1, . . . , gk . 848 This is natural as, in the considered schemes, (Ri , Si ) ⊥⊥ 849 (R j , S j )|π, t ∗ for (Ri , Si ) and (R j , S j ) occurring in distinct 850 groups. Formally, let x 0 = π 0, t 0 (with t 0 = g01 , . . . , g0k ) 851 denote any initial state satisfying f (π 0, t 0|t ∗) > 0. Given 852 a state at step h, x h = π h , t h (t h = gh1 , . . . , ghk ), one lets 853 x h+1 �= π h+1, t h+1, where t h+1 = gh+11 , . . . , gh+1k and 854 π h+1 is drawn from f (π|t h , t ∗) (26) 855 gh+1i is drawn from 856 f (g|π h+1, gh+11 , . . . , gh+1i−1 , ghi+1, . . . , ghk , t ∗) 857 (1 ≤ i ≤ k). (27) 858 IE EE P ro of 10 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY Running this chain presupposes we know how to sample859 from the full conditional distributions on the right-hand side860 of (26) and (27). In particular, there are several possible861 approaches to sample from g. In this subsection we provide a862 general discussion about convergence, postponing the details863 of sampling from the full conditionals to the next subsection.864 Let us denote by t−i �= g1, . . . , gi−1, gi+1, . . . , gk the table865 obtained by removing the i -th group gi from t . The following866 relations for the full conditionals of interest can be readily867 checked, relying on the conditional independencies of the868 model (2) and (4) (we presuppose that in each case the869 conditioning event has nonzero probability)870 f (π|t, t ∗) = f (π|t) (28)871 f (g|π, t−i , t ∗) ∝ f (g|π) f (t ∗|g, t−i ) (1 ≤ i ≤ k). (29)872 As we shall see, each of the above two relations enables sam-873 pling from the densities on the left-hand side. Indeed, (28) is a874 posterior Dirichlet distribution, from which effective sampling875 can be easily performed (see next subsection). A straight-876 forward implementation of (29) in a Acceptance-Rejection877 (AR) sampling perspective is as follows: draw g according to878 f (g|π) and accept it with probability f (t ∗|g, t−i ) = f (t ∗|t).879 Here, f (t ∗|t) is just the probability that the obfuscation880 algorithm returns t ∗ as output when given t = g, t−i as input.881 Actually, to make sampling from the RHS of (29) effective,882 further assumptions will be introduced (see next subsection).883 Note that, since the sensitive values are fixed in t and known884 from the given t ∗, sampling g in (29) is actually equivalent to885 sampling the nonsensitive values of the group.886 In addition to (29), to simplify our discussion about conver-887 gence, we shall henceforth assume that, for each group index888 1 ≤ i ≤ k, the set of instances of the i -th group that are889 compatible with t ∗ does not depend on the rest of the table,890 t−i . That is, we assume that for each i (1 ≤ i ≤ k):891 {g : f (t ∗|g, t−i ) > 0} = {g : f (t ∗|g, t �−i ) > 0} ∀ t−i and t �−i892 �= Gi . (30)893 For instance, (30) holds true if the anonymization algorithm894 ensures t ∗ is independent from ti−1 given a i -th group g: t ∗ ⊥⊥895 t−i | g.896 Let x = (π, g1, . . . , gk ) denote a generic state of this897 Markov chain. Under the assumption (30), the support of the898 target density f (x |t ∗) is the product space899 X �= D × G1 × · · · × Gk . (31)900 By this, we mean that {x : f (x |t ∗) > 0 } = X . This is901 a consequence of: (a) the fact that Dirichlet only consid-902 ers full support distributions; and (b) equation (29), taking903 into account the assumption (30). Let X 0, X 1, . . . denote the904 Markov chain defined by the sampler over X and denote by905 κ(·|·) its conditional kernel density over X . Slightly abusing906 notation, let us still indicate by f (·|t ∗) the probability distri-907 bution over X induced by the density f (x |t ∗). Convergence908 in distribution follows from the following proposition, which909 is an instance of general results – see e.g. the discussion910 following Corollary 1 of [41].911 Proposition 1 (convergence): Assume (30). For each (mea- 912 surable) set A ⊆ X such that f ( A|t ∗) > 0 and each x 0 ∈ X , 913 we have κ(X 1 ∈ A|X 0 = x 0) > 0. As a consequence, 914 the Markov chain (X i )i≥0 is irreducible and aperiodic, and 915 its stationary density is f (x |t ∗) in (25). 916 B. Sampling From the Full Conditionals 917 Let us consider (28) first. It is a standard fact that the 918 posterior of the Dirichlet distribution f (π|t), given the N 919 i.i.d. observations t drawn from the categorical distribution 920 f (·|π), is still a Dirichlet, where the hyperparameters have 921 been updated as follows. Denote by γ (t) = (γ1, . . . , γ|S|) the 922 vector of the frequency counts γi of each si in t . Similarly, 923 given s, denote by δs (t) = (δs1, . . . , δs|R|) the vector of the 924 frequency counts δi of the pairs (ri , s), for each ri , in t . Then, 925 for each π = (πS, πR|S ), we have 926 f(π|t) = Dir(πS | α+γ (t))· � s∈S Dir(πR|s | βs + δs (t)). (32) 927 Let us now discuss (29). In what follows, for the sake 928 of notation we shall write a generic i -th group as gi = 929 (s1, r1), . . . , (sn , rn ) (thus avoiding double subscripts), and let 930 g∗i = (mi , li ) denote the corresponding obfuscated group in 931 t ∗. As already observed, given an obfuscated i -th group g∗i = 932 (li , mi ), when sampling a i -th group g from (29), one actually 933 needs to generate only the nonsensitive values of g, which are 934 constrained by li , as the sensitive ones are already fixed by 935 the sequence mi . In what follows, to make sampling from (29) 936 effective, will shall work under the following assumptions, 937 which are stronger than (30). 938 (a) Deterministic obfuscation function: for each t and t ∗, 939 f (t ∗|t) is either 0 or 1. 940 (b) For each 1 ≤ i ≤ k, letting g∗i = (li , mi ), with mi = 941 s1, . . . , sn , the i -th obfuscated group in t ∗, the following 942 holds true: 943 Horizontal schemes 944 Gi={g = (s1, r1), . . . , (sn , rn ) : r� ∈li for 1 ≤ � ≤ n } (33) 945 Vertical schemes 946 Gi = {g = (s1, ri1 ), . . . , (sn , rin ) : 947 for ri1 , . . . , rin a permutation of li }. (34) 948 Assumption (a) is realistic in practice. In horizontal 949 schemes, assumption (b) makes the considered sets Gi ’s pos- 950 sibly larger than the real ones, that is li ⊃ {r1, . . . , rn }. This 951 happens, for instance, if in certain groups the ZIP code is 952 constrained to just, say, two values, while the generalized code 953 “5013*” allows for all values in the set {50130, . . . , 50139}. 954 We will not attempt here a formal analysis of this assumption. 955 In some cases, such as in schemes based on global recoding, 956 this assumption is realistic. Otherwise, we only note that the 957 support X of the resulting Markov chain may be (slightly) 958 larger than the one that would be obtained not assuming (33) 959 or (34). Heuristically, this leads one to sampling from a more 960 dispersed density than the target one. At least, the resulting 961 distributions can be taken to represent a lower bound of what 962 the attacker can actually learn. 963 IE EE P ro of BOREALE et al.: RELATIVE PRIVACY THREATS AND LEARNING FROM ANONYMIZED DATA 11 Fig. 1. Sampling from f (g|π, t−i , t ∗) (g ∈ Gi ) for horizontal schemes, across all the groups. Under assumptions (a) and (b) above, for each 1 ≤ i ≤ k,964 it holds that g ∈ Gi if and only if f (t ∗|g, t−i ) = 1. Therefore965 sampling according to the right-hand side of (29) reduces to966 the following:967 draw g ∈ Gi with probability ∝ f (g|π) (1 ≤ i ≤ k). (35)968 We discuss now how to implement (35) effectively. This969 will achieve sampling from the full conditionals (29) without970 resorting to a presumably inefficient AR method. We deal with971 the two cases, horizontal and vertical, separately.972 a) Horizontal schemes: In order to generate g =973 (r1, s1), . . . , (rn , sn ) ∈ Gi , for each � = 1, .., n, we draw974 r� ∈ li with probability ∝ f (r�|s�, π). Explicitly, (29) now975 becomes976 f (g|π, t−i , t ∗) = ⎧⎪⎨ ⎪⎩ 0, if g /∈ Gi n� �=1 f (r�|s�, π)� r∈li f (r |s�, π) , if g ∈ Gi (36)977 thus satisfying (35). Note that this is equivalent to sam-978 pling each row independently. The sampling process of979 f (g|π, t−i , t ∗) for horizontal schemes across all the groups980 of the table is illustrated graphically in Fig. 1.981 b) Vertical schemes: Let li = {| r1, . . . , rn |}. We have982 that g ∈ Gi if and only if g = (s1, ri1 ), . . . , (sn , rin ), for983 some permutation (ri� )1≤�≤n of r1, . . . , rn . Here, sampling984 the nonsensitive values of g row by row would involve to985 gradually reduce the sample space. A sampling procedure986 along these lines is possible, but nontrivial, see Appendix B.987 We discuss here a more straightforward sampling procedure,988 based on generating gi ∈ Gi in a single shot. We adopt a989 single-iteration Metropolis within Gibbs scheme. Essentially,990 this consists in running a Metropolis method that targets the991 distribution ∝ f (g|π) with support Gi , for one iteration.992 Specifically, let us write the current value of the i -th group in993 the Gibbs Markov chain as ghi . Following Casella and Robert994 [40, Ch.10], this step consists in drawing g ∈ Gi according to995 a proposal distribution J (g|ghi ) and accepting it, that is letting996 gh+1i = g, with probability997 � �= min 1, f (g|π) J (ghi |g) f (ghi |π) J (g|ghi ) � (37)998 while keeping gh+1i = ghi with probability 1 − �. The999 resulting MCMC method is still theoretically sound: see Casella1000 TABLE IV SUMMARY OF THREAT AND FAI THF ULNESS MEAS URES F OR ANONYMI ZATI ON ACCORDI NG TO K-ANONYMI TY AND �- DI VERS I TY and Robert [40, Ch.10.3.3]. As to the proposal distribution 1001 J (g|ghi ), a possibility is generating g ∈ Gi via a pure random 1002 permutation of the n nonsensitive values in li ; or just to swap 1003 the nonsensitive values of two randomly chosen positions 1004 in ghi . In both cases, the proposal is symmetric, and (37) 1005 simplifies accordingly as follows, where r1, . . . , rn is the 1006 sequence of sensitive values in the poposed g: 1007 � = min 1, �n �=1 f (r�|s�, π)�n �=1 f (r h � |s�, π) � . 1008 VI. EXPERIMENTS 1009 We have put a proof-of-concept implementation3 of our 1010 methodology at work on a subset of the Adult dataset extracted 1011 by Barry Becker from the 1994 US Census database and 1012 available from the UCI machine learning repository [47]. This 1013 is a common benchmark for experiments on anonymization 1014 [38]. In particular, we have focused on the subset of 5692 rows 1015 also considered by the authors of [38], with the following 1016 categorical attributes: sex, age, race, marital status, education, 1017 native country, workclass, salary class, occupation, with occu- 1018 pation (14 values) considered as the only sensitive attribute. 1019 We will discuss implementation and results details separately 1020 for vertical and horizontal schemes. We will then briefly 1021 discuss convergence issues of the employed MCMC method. 1022 A. Horizontal Schemes: k-Anonymity 1023 Using the ARX anonymization tool [37] we obtained two 1024 different k-anonymous versions of the considered dataset, 1025 enjoying respectively k-anonymity and �-diversity4 for k = 1026 � = 4 and k = � = 6. The average size of the groups 1027 was respectively of 38 rows (k = � = 4) and of 355 rows 1028 (k = � = 6). 1029 The results we have obtained are summarized in Table IV. 1030 For reference, we include the following information in the last 1031 two lines: baseline accuracy, the fraction of rows correctly 1032 classified using the empirical distribution obtained from the 1033 frequencies of the sensitive values in the anonymized table 1034 – i.e., the fraction of the most frequent sensitive value; and 1035 3 Python code and data available from the authors. 4 Recall that �-diversity requires at least � distinct values of the sensitive attribute in each group. IE EE P ro of 12 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY ideal accuracy, the fraction of tuples threatened under pI.1036 As a further element of comparison, we also consider an1037 attacker whose reasoning is based on the random worlds1038 models, and include in the table GTRW, the fraction of rows1039 correctly classified assuming all tables compatible with t ∗1040 equally likely. Like in [25], we compute ABSA and ABSRW,1041 the absolute error under the distribution derived under pA and1042 under the random worlds distribution pRW, respectively. ABS1043 is defined as N� i=1 � s∈S |1{si =s} − p(s|ri , t ∗)|, where p(·) might1044 be either of pA(·) or pRW(·). Note that, since the considered1045 anonymized tables do not enjoy disjointness between groups1046 (see Remark 1), also in the random worlds perspective the1047 probability of each sensitive attribute may well be ≥ 1/�.1048 In our experiments, when � = 4 the attacker outperforms1049 random worlds classification, while when a more powerful1050 obfuscation is adopted the two results are quite similar.1051 The remaining rows in Table IV consider the privacy threats1052 and faithfulness measures introduced in Section IV. As a1053 general comment, small variations of � and/or k do not produce1054 dramatic changes. The faithfulness level is stable, but does not1055 reach a satisfactory level. The attacker is anyway in a position1056 to correctly classify the sensitive attribute of individuals in the1057 table ≈ 2.3 − 2.5% more often than the learner. We found the1058 maximum value of TiA for the threatened rows is about 13.8,1059 meaning the attacker can be up to ≈14 times more confident1060 than the learner about the guessed value.1061 A more informative summary of our analysis is provided by1062 the scatter plots and histograms of Figure 2. The scatter plots1063 are obtained from the threat levels under pL and under pA.1064 The number of rows (s, r ) in which pA(s|r, t ∗) ≥ pL(s|r, t ∗)1065 roughly equals those in which pA(s|r, t ∗) ≤ pL(s|r, t ∗),1066 although globally the attacker has a slight advantage in terms1067 of number of threatened rows. In Figure 2 we also report the1068 empirical distribution log2 TiA for tuples threatened under pA1069 and under pL. We also have evidence of positive skewness,1070 as shown by the value of γ (the third standardized moments1071 of the empirical distributions). Recalling that log2 TiA = 11072 means pA(s|r, t ∗) = 2 pL(s|r, t ∗), the histograms show that1073 pA(s|r, t ∗) is often more than twice pL(s|r, t ∗) leading to a1074 log2 TiA ≥ 1. In particular, when k = � = 4, log2 TiA is1075 at least 1 for ≈ 6% of the individuals threatened under pA,1076 meaning ≈ 0.6% of the whole table. Conversely, log2 TiA1077 is close to 0 for most of the rows in which pA(s|r, t ∗) ≤1078 pL(s|r, t ∗).1079 B. Vertical Schemes: Anatomy1080 Using a freely available anonymization tool [22], we have1081 obtained two anatomized versions of the considered dataset,1082 with groups of size � = 4 and � = 6, respectively. The1083 resulting tables also enjoy �-diversity. The results we have1084 obtained are summarized in Table V. Concerning the random1085 worlds approach, we note the following. Anatomy partitions1086 the tables in groups all of size �. Therefore, although disjoint-1087 ness is not satisfied, just as in the horizontal case, the sensitive1088 attribute frequencies equal 1/� in each group. This implies1089 that the probability of a sensitive value depends on how many1090 groups contain the victim’s nonsensitive attributes and on1091 Fig. 2. Results for k-anonymity. Top (� = k = 6): scatter plots of pL vs pA for tuples threatened under pA (a), and under pL (c); (b) and (d) are the histograms of log2 TiA for these two cases. Bottom: same for � = k = 4. The skewness value (γ ) represents the third standardized moment of the empirical distribution. Dark red areas show where the attacker performs better than the learner. their frequencies in each group, leading often to multimodal 1092 distributions. We assume that a guess may be obtained ran- 1093 domly choosing between the equally likely sensitive attributes. 1094 Accordingly, the fractions of threatened rows, GTRW, are 1095 averaged over 500 different sampling. Here, it is apparent that 1096 the our attacker is able to classify better than the random 1097 worlds scenario. We note that, as � increases from 4 to 6, 1098 the fraction of rows threatened under the distributions derived 1099 by the learner (GTL) and by the attacker (GT A) decreases 1100 significantly. Moreover, as � grows both the relative threat 1101 RGTA and the faithfulness level RF decrease, which implies 1102 a trade-off between privacy and the utility conveyed by the 1103 table. 1104 Again, for a more informative summary of our analysis, 1105 we look at scatter plots and histograms, displayed in Figure 3, 1106 where we compare pA and pL on threatened rows. It is 1107 apparent here that the attacker is more confident than the 1108 learner in the majority of the cases, even when focusing on 1109 the rows threatened under pL. This is in contrast with the 1110 horizontal case, where the attacker exhibits smaller threat 1111 IE EE P ro of BOREALE et al.: RELATIVE PRIVACY THREATS AND LEARNING FROM ANONYMIZED DATA 13 TABLE V SUMMARY OF THREAT AND FAI THF ULNESS MEAS URES F OR ANONYMI ZATI ON ACCORDI NG TO ANATOMY levels on the rows threatened under pL (Figure 2, (d) and (h)).1112 As far as the histograms are concerned, an even greater1113 skewness than the horizontal case is evident here. In particular,1114 the attacker can be up to ≈ 287 times more confident then1115 the learner, being the maximum TiA about 286.19. Moreover,1116 when � = 4, the individuals with log2 TiA ≥ 1 are ≈ 26% of1117 the rows threatened under pA (≈ 8% of the whole table). This1118 means that there are 483 individuals in the dataset for which1119 the threat level under pA is at least twice as much the threat1120 level under pL.1121 C. Discussion1122 Comparing the horizontal and the vertical cases for the1123 considered dataset, the following considerations are in order.1124 • In the horizontal case, we have a situation of low faith-1125 fulness and low privacy threat, irrespective of the value1126 of k and �. Indeed, in both cases the average group size1127 is well above k, and this has a negative effect on the1128 inference capabilities of both the learner and the attacker.1129 The slight numerical differences observed between the1130 cases k = � = 4 and k = � = 6 are basically an artifact1131 of the anonymization tool. Yet, in relative terms, one can1132 observe a significant increase in the number of tuples1133 threatened by the attacker, over the learner.1134 • In the vertical case, one obtains a greater faithfulness1135 at the price of a greater privacy threat. This difference1136 from the horizontal case is partly explained by the smaller1137 group size, which now coincides with �. Now moving1138 from � = 4 to � = 6 has a tangible negative impact1139 on the inference capabilities of both the learner and the1140 attacker. In relative terms, one can observe an even more1141 marked increase of the number of tuples threatened by1142 the attacker, over the learner.1143 The above considerations partly depend on both the original1144 dataset and the details of the employed anonymization tool.1145 D. Assessing MCMC Convergence1146 For each of the considered anonymized datasets, we ran a1147 MCMC as introduced in Section V for M = 100, 000 runs.1148 The convergence of each chain to the stationary distribu-1149 tion was assessed via a methodology based on comparing1150 sub-sequences of the sample sequences with one another. More1151 precisely, as for the population parameters distribution (32),1152 we used the method proposed by Geweke [21]. The Geweke1153 Fig. 3. Results for Anatomy. Top (� = 6): scatter plots of pL vs pA for tuples threatened under pA (a), and under pL (c); (b) and (d) are the histograms of log2 TiA for these two cases. Bottom: same for � = 4. The skewness value (γ ) represents the third standardized moment of the empirical distribution. Dark red areas show where the attacker performs better than the learner. proposal is based on an adapted two-samples test on the means 1154 in sub-sequences of the chain. 1155 After a burn-in of 50,000 iterations, we compared the last 1156 25,000 samples against 5 blocks of of 5,000 consecutive sam- 1157 ples each, taken starting from the 50,000-th iteration. We found 1158 that all the distributions πR|S produced a test statistic within 1159 two standard deviations from zero, thus providing evidence of 1160 convergence. 1161 As for the distribution of the cleartext table, f (t|π, t ∗), we 1162 used a test specifically designed for categorical distributions 1163 by Deonovich and Smith, called Weiß procedure [15]. The 1164 approach is based on a χ 2 test adjusted for the autocorrelation 1165 induced by the chain. The test is based on partitioning the 1166 whole sample sequence into sub-sequences, and then testing 1167 the homogeneity between the empirical distribution of each 1168 sub-sequence and the empirical distribution of the whole 1169 chain. After a burn-in of 50,000 observations, we compared 1170 5 sub-sequences of 10,000 consecutive samples each. For the 1171 vertical scheme, we assessed the convergence for each row of 1172 the table, thereby demonstrating the stationary of f (t|π, t ∗). 1173 IE EE P ro of 14 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY For the horizontal scheme, some of the rows did not exhibit1174 evidence of convergence. However, we found that, starting1175 with several independent chains, very similar results in terms1176 of the proposed assessment measures were obtained.1177 In the vertical case, within the Metropolis step both the pure1178 random permutation and the swap group generation strategies1179 (Section V-B) were experimented. The obtained results are1180 consistent; however, the pure random permutation strategy1181 shows a much higher rate of rejection, suggesting that the1182 swap strategy should be preferred.1183 VII. CONCLUSION1184 We have put forward a notion of relative privacy threat that1185 applies to group-based anonymization schemes. Our proposal1186 is based on a rigorous characterization of the learner’s and1187 of the attacker’s inference, in a unified Bayesian model of1188 group-based schemes. A related MCMC algorithm for posterior1189 parameters estimation has also been introduced. Experiments1190 conducted on the well-known Adult dataset [47] have been1191 illustrated.1192 Our analysis emphasizes the risks posed by the mere fact1193 that an attacker can look up a released anonymized table.1194 This prompts an obvious alternative: release the parameters1195 of the posterior distribution learned from the cleartext table1196 ( pI, in our notation). This may not always be possible, or be1197 a good idea, for several reasons. First, certain organizations1198 must release datasets as part of their mission, e.g. census1199 bureaus. Second, especially in the case of high-dimensional1200 data, the computation of the posterior is feasible only assum-1201 ing suitable conditional independencies, whereby potentially1202 important correlations are lost; see [10] and references therein.1203 Third, parameters release itself is not exempt from risks for1204 privacy. In particular, although differentially private release of1205 the parameters is possible [16], it seems that quite strong1206 priors are necessary to obtain acceptable guarantees; see1207 [50, Ch.6] and references therein. In conclusion, further1208 research is called for in order to understand under what1209 circumstances data and/or parameters release can be done1210 safely.1211 APPENDIX A1212 PROOF OF LEMMA 11213 We first characterize the probability f (V = j |RV = rv, t ∗),1214 for an arbitrary j ∈ {1, . . . , N }. Bayes theorem yields1215 f (V = j |RV = rv, t ∗) ∝ f (RV = rv|V = j, t ∗) f (V = j |t ∗)1216 = f (R j = rv|V = j, t ∗) f (V = j |t ∗)1217 ∝ f (R j = rv|V = j, t ∗) (38)1218 = f (R j = rv|t ∗) (39)1219 where (38) follows from f (V = j |t ∗) = f (V = j ) = 1/N1220 (independence of V ), and (39) follows because, as easily1221 checked, for any fixed j , independence of R j and V is1222 preserved by conditioning on t ∗. Now we have, for every s ∈ S1223 pA(s|rv, t ∗) (40)1224 = f (SV = s | RV = rv, t ∗)1225 = � j f (SV = s, V = j |RV = rv, t ∗)1226 Fig. 4. Sampling from θ (g|π, t ∗) for vertical schemes. = � j f (SV = s|V = j, RV = rv, t ∗) f (V = j |RV = rv, t ∗) 1227 = � j f (S j = s|V = j, R j = rv, t ∗) f (V = j |RV = rv, t ∗) 1228 = � j : s j=s f (S j = s|V = j, R j = rv, t ∗) f (V = j |RV =rv, t ∗) 1229 (41) 1230 = � j : s j =s f (V = j |RV = rv, t ∗) (42) 1231 ∝ � j : s j =s f (R j = rv|t ∗). (43) 1232 where (41) and (42) follow from the fact that, for s j = s, 1233 f (S j = s, t ∗) = 0, while for s j = s obviously f (S j = s|V = 1234 j, R j = rv, t ∗) = 1. Finally, (43) follows from (39). 1235 Note that in (43) each term on the RHS actually is the joint 1236 probability f (R j = rv, S j = s|t ∗), being s j = s embedded in 1237 the range of the summation. 1238 APPENDIX B 1239 AN ALTERNATIVE GROUP SAMPLING METHOD FOR 1240 VERTICAL SCHEMES 1241 We consider the following method for sampling g ∈ Gi . 1242 Draw n values ri� , � = 1, . . . , n, as follows: 1243 1. draw ri1 from li according to a distribution ∝ f (r |s1, π); 1244 2. draw ri2 from li \ {| ri1 |} according to a distribution ∝ 1245 f (r |s2, π); 1246 … 1247 n. draw rin from li \ {| ri1 , . . . , rin−1 |} according to a distrib- 1248 ution ∝ f (r |sn, π). 1249 For a multiset l�, let σ (l�|s�, π) �= � r in l� f (r |s�, π) denote 1250 the probability of extracting some element appearing in l� 1251 (disregarding multiplicities) according to f (·|s�, π). Using this 1252 notation, the probability of returning exactly the sequence 1253 ri1 , . . . , rin , hence g = (s1, ri1 ), . . . , (sn , rin ) ∈ Gi , as a result 1254 of the above n drawings, can be written as 1255 θ (g|π, t ∗) �= f (ri1 |s1, π) σ (li |s1, π) · f (ri2 |s2, π) σ (li \ {| ri1 |}|s2, π) · · · f (rin |sn, π) f (rin |sn, π) 1256 = �n �=1 f (ri� |s�, π) ν(g|π) 1257 where we denote by ν(g|π) the denominator of the expression 1258 on the RHS of �= above. The sampling process of θ (g|π, t ∗) 1259 for vertical schemes across all the groups of the table is 1260 illustrated in Fig. 4. We note that θ (g|π, t ∗) is dependent on 1261 the chosen ordering of the sensitive values s1, . . . , sn , which 1262 IE EE P ro of BOREALE et al.: RELATIVE PRIVACY THREATS AND LEARNING FROM ANONYMIZED DATA 15 may invalidate condition (35). A possible solution could be to1263 sweep the order of sampling according to the Random Sweep1264 Gibbs sampler scheme originally proposed by [20] and further1265 developed by [29].1266 REFERENCES1267 [1] D. J. Balding and P. Donnelly, “Inference in forensic identification,”1268 J. Roy. Stat. Soc. A, Statist. Soc., vol. 158, no. 1, pp. 21–53, 1995.1269 [2] M. Bewong, J. Liu, L. Liu, J. Li, and K. K. R. Choo, “A relative privacy1270 model for effective privacy preservation in transactional data,” in Proc.1271 IEEE Trustcom/BigDataSE/ICESS, Aug. 2017, pp. 394–401.1272 [3] B. Bichsel, T. Gehr, D. Drachsler-Cohen, P. Tsankov, and T. Vechev,1273 “DP-finder: Finding differential privacy violations by sampling and1274 optimization,” in Proc. ACM CCS, 2018, pp. 508–524.1275 [4] M. Boreale and M. Paolini, “Worst- and average-case privacy breaches in1276 randomization mechanisms,” in Theoretical Computer Science, vol. 597,1277 Berlin, Germany: Springer, 2015, pp. 40–61.1278 [5] M. Boreale and F. Corradi, “Relative privacy risks and learning from1279 anonymized data,” in Proc. SIS, A. Petrucci, F. Racioppi, and R. Verde,1280 Eds. Firenze Univ. Press, 2017, pp. 199–204.1281 [6] D. Cavallini and F. Corradi, “Forensic identification of relatives of1282 individuals included in a database of DNA profiles,” Biometrika, vol. 93,1283 pp. 525–536, Sep. 2006.1284 [7] A.-S. Charest, “How can we analyze differentially-private synthetic1285 datasets? ” J. Privacy Confidentiality, vol. 2, no. 2, 2011.AQ:4 1286 [8] A.-S. Charest, “Empirical evaluation of statistical inference from1287 differentially-private contingency tables,” in Proc. Int. Conf. Pri-1288 vacy Stat. Databases (PSD). Berlin, Germany: Springer-Verlag, 2012,1289 pp. 257–272.1290 [9] R. C.-W. Wong, A. W. C. Fu, K. Wang, P. S. Yu, and J. Pei, “Can the1291 utility of anonymized data be used for privacy breaches?” ACM Trans.1292 Knowl. Discovery Data, vol. 5, no. 3, pp. 16-1–16-24, 2011.1293 [10] C. Clifton and T. Tassa, “On syntactic anonymity and differential1294 privacy,” in Proc. Trans. Data Privacy, vol. 6, 2013, pp. 161–183.1295 [11] F. Corradi, V. Pinchi, S. Garatti, and I. Barsanti, “Probabilistic classi-1296 fication of age by third molar development: The use of soft evidence,”1297 J. Forensic Sci., vol. 58, no. 1, pp. 51–59, 2013.1298 [12] F. K. Dankar and K. El Emam, “The application of differential privacy1299 to health data,” in Proc. Joint EDBT/ICDT Workshops (EDBT-ICDT),1300 2012, pp158–166.1301 [13] F. K. Dankar and K. El Emam, “Practicing differential privacy in health1302 care: A review,” in Trans. Data Privacy, vol. 6, no. 1, pp. 35–67, 2013.1303 [14] A. P. Dawid, “The island problem: Coherent use of identification1304 evidence,” in Aspects of Uncertainty: A Tribute to D. V. Lindley,1305 P. R. Freeman and A. F. M. Smith, Eds. Hoboken, NJ, USA: Wiley,1306 1994, pp. 159–170.1307 [15] B. E. Deonovic and B. J. Smith, “Convergence diagnostics for MCMC1308 draws of a categorical variable,” 2017, arXiv:1706.04919. [Online].1309 Available: https://arxiv.org/abs/1706.049191310 [16] C. Dimitrakakis, B. Nelson, A. Mitrokotsa, and B. I. P. Rubin-1311 stein, “Robust and Private Bayesian Inference,” in Proc. 25th Int.1312 Conf. Algorithmic Learn. Theory (ALT), Bled, Slovenia, Oct. 2014,1313 pp. 291–305.1314 [17] Z. Ding, Y. Wang, G. Wang, D. Zhang, and D. Kifer, “Detect-1315 ing violations of differential privacy,” in Proc. ACM CCS, 2018,1316 pp. 475–489.1317 [18] C. Dwork, “Differential privacy,” in Proc. 33rd Int. Conf. Automata,1318 Lang. Program. Berlin, Germany: Springer-Verlag, 2006, pp. 1–12.1319 [19] S. L. Garfinkel, J. M. Abowd, and S. Powazek, “Issues encountered1320 deploying differential privacy,” in Proc. ACM Workshop Privacy Elec-1321 tron. Soc., 2018, pp. 133–137.1322 [20] S. Geman and D. Geman, “Stochastic relaxation, Gibbs distributions, and1323 the Bayesian restoration of images,” IEEE Trans. Pattern Anal. Mach.1324 Intell., vol. PAMI-6, no. 6, pp. 721–741, Nov. 1984.1325 [21] J. Geweke, “Evaluating the accuracy of sampling-based approaches to1326 the calculation of posterior moments,” in Bayesian Statistics. Oxford,1327 U.K.: Oxford Univ. Press, 1992, pp. 169–193.1328 [22] Q. Gong. (2014). Anatomize, GitHub. [Online]. Available: https://github.1329 com/qiyuangong/Anatomize1330 [23] A. Inan, M. Kantarcioglu, and E. Bertino, “Using anonymized data for1331 classification,” in Proc. IEEE Int. Conf. Data Eng. (ICDE), Washington,1332 DC, USA, Mar./Apr. 2009, pp. 429–440.1333 [24] A. Kassem, G. Acs, C. Castelluccia, and C. Palamidessi, “Differential1334 inference testing a practical approach to evaluate anonymized data,”1335 INRIA, Res. Rep., 2018, pp. 1–21.AQ:5 1336 [25] D. Kifer, “Attacks on privacy and deFinetti’s theorem,” in Proc. SIG- 1337 MOD Conf., 2006, pp. 127–138. 1338 [26] D. Kifer and A. Machanavajjhala, “No free lunch in data privacy,” in 1339 Proc. SIGMOD, 2011, pp. 193–204. 1340 [27] N. Li, T. Li, and S. Venkatasubramanian, “t -closeness: Privacy beyond 1341 k-anonymity and �-diversity,” in Proc. ICDE, 2007, pp. 106–115. 1342 doi: 10.1109/ICDE.2007.367856. 1343 [28] N. Li, W. Qardaji, and D. Su, “On sampling, anonymization, and 1344 differential privacy: Or, k-anonymization meets differential privacy,” in 1345 Proc. 7th ACM Symp. Inf., Comput. Commun. Secur. (ASIACCS), 2012, 1346 pp. 32–33. AQ:61347 [29] J. S. Liu, “Markov chain Monte Carlo and related topics,” Dept. Statist., 1348 Stanford Univ., Stanford, CA, USA, Tech. Rep., 1995. 1349 [30] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam, 1350 “�-diversity: Privacy beyond k-anonymity,” in Proc. ICDE, 2006, p. 24. 1351 [31] A. Machanavajjhala, D. Kifer, J. Abowd, J. Gehrke, and L. Vilhube, “Pri- 1352 vacy: Theory meets practice on the map,” in Proc. IEEE 24th Int. Conf. 1353 Data Eng., Apr. 2008, pp. 277–286. doi: 10.1109/ICDE.2008.4497436. 1354 [32] M. D. Mailman et al., “The NCBI dbGaP database of genotypes and 1355 phenotypes,” Nature Genet., vol. 39, no. 10, pp. 1181–1186, 2007. doi: 1356 10.1038/ng1007-1181. 1357 [33] K. Mancuhan and C. Clifton, “Statistical learning theory approach for 1358 data classification with �-diversity,” 2016, arXiv:1610.05815. [Online]. 1359 Available: https://arxiv.org/abs/1610.05815 1360 [34] A. Narayanan and V. Shmatikov, “Robust de-anonymization of large 1361 datasets (how to break anonymity of the Netflix prize dataset),” Univ. 1362 Texas Austin, Austin, TX, USA, Tech. Rep., 2008. 1363 [35] M. Nasr, R. Shokri, and A. Houmansadr, “Machine learning with 1364 membership privacy using adversarial regularization,” in Proc. ACM 1365 SIGSAC Conf. Comput. Commun. Secur., Toronto, ON, Canada, 2018, 1366 pp. 634–646. doi: 10.1145/3243734.3243855. 1367 [36] W. Ollier, T. Sprosen, and T. Peakman, “UK Biobank: From concept to 1368 reality,” Future Med., vol. 6, no. 6, pp. 639–646, 2005. 1369 [37] F. Prasser and F. Kohlmayer, “Putting statistical disclosure control 1370 into practice: The ARX data anonymization tool,” in Medical Data 1371 Privacy Handbook, A. Gkoulalas-Divanis and G. Loukides, Eds. Cham, 1372 Switzerland: Springer, Nov. 2015. 1373 [38] F. Prasser, F. Kohlmayer, and K. A. Kuhn, “A benchmark of 1374 globally-optimal anonymization methods for biomedical data,” in Proc. 1375 27th IEEE Int. Symp. Comput.-Based Med. Syst., New York, NY, USA, 1376 May 2014, pp. 66–71. 1377 [39] A. Pyrgelis, C. Troncoso, and E De Cristofaro, “Knock knock, who’s 1378 there? Membership inference on aggregate location data,” in Proc. Netw. 1379 Distrib. Syst. Secur. Symp., San Diego, CA, USA, Feb. 2018, pp. 18–21. 1380 doi: 10.14722/ndss.2018.23183. 1381 [40] C. P. Robert and G. Casella, Monte Carlo Statistical Methods. 2nd ed. 1382 Springer, 2004. AQ:71383 [41] G. O. Roberts and A. F. M. Smith, “Simple conditions for the con- 1384 vergence of the Gibbs sampler and Metropolis–Hastings algorithms,” 1385 Stochastic Processes Appl., vol. 49, pp. 207–216, Feb. 1994. 1386 [42] T. E. Raghunathan, J. P. Rubin, and D. B. Reiter, “Multiple imputation 1387 for statistical disclosure limitation,” J. Off. Statist., vol. 19, no. 1, 1388 pp. 1–16, 2003. 1389 [43] D. B. Rubin, “Statistical disclosure limitation,” J. Off. Statist., vol. 9, 1390 no. 2, pp. 461–468, 1993. 1391 [44] R. Sarathy and K. Muralidhar, “Evaluating Laplace noise addition to 1392 satisfy differential privacy for numeric data,” Trans. Data Privacy, vol. 4, 1393 no. 1, pp. 1–17, 2011. 1394 [45] K. Slooten and R. Meester, “Forensic identification: The island problem 1395 and its generalisations,” 2017. arXiv:1201.4647. [Online]. Available: 1396 https://arxiv.org/abs/1201.4647 1397 [46] L. Sweeney, “K-anonymity: A model for protecting privacy,” Int. 1398 J. Uncertainty, Fuzziness Knowl.-Based Syst., vol. 10, no. 5, 1399 pp. 557–570, 2002. 1400 [47] UCI Machine Learning Repository, Adult Dataset. (1996). [Online]. 1401 Available: https://archive.ics.uci.edu/ml/datasets/Adult 1402 [48] U.S. Office for Civil Rights. (Nov. 26, 2012). Guidance Regarding Meth- 1403 ods for De-Identification of Protected Health Information in Accordance 1404 With the Health Insurance Portability and Accountability Act (HIPAA) 1405 Privacy Rule. [Online]. Available: https://www.hhs.gov/sites/ 1406 default/files/ocr/privacy/hipaa/understanding/coveredentities/De- 1407 identification/hhs_deid_guidance.pdf 1408 [49] X. Xiao and Y. Tao, “Anatomy: Simple and effective privacy preserva- 1409 tion,” in Proc. VLDB, 2006, pp. 139–150. 1410 [50] S. Zheng, “The differential privacy of Bayesian inference,” B.S. thesis, 1411 Harvard College, Cambridge, MA, USA, 2015. [Online]. Available: 1412 https://dash.harvard.edu/handle/1/14398533 1413 http://dx.doi.org/10.1109/ICDE.2007.367856 http://dx.doi.org/10.1109/ICDE.2008.4497436 http://dx.doi.org/10.1038/ng1007-1181 http://dx.doi.org/10.1145/3243734.3243855 http://dx.doi.org/10.14722/ndss.2018.23183