key: cord-0019037-lgr7hgkd
authors: Tao, Ziye; Weber, Griffin M; Yu, Yun William
title: Expected 10-anonymity of HyperLogLog sketches for federated queries of clinical data repositories
date: 2021-07-12
journal: Bioinformatics
DOI: 10.1093/bioinformatics/btab292
sha: b3552492b1433ab5d64e83dbe7dd9181a580085b
doc_id: 19037
cord_uid: lgr7hgkd

MOTIVATION: The rapid growth in of electronic medical records provide immense potential to researchers, but are often silo-ed at separate hospitals. As a result, federated networks have arisen, which allow simultaneously querying medical databases at a group of connected institutions. The most basic such query is the aggregate count—e.g. How many patients have diabetes? However, depending on the protocol used to estimate that total, there is always a tradeoff in the accuracy of the estimate against the risk of leaking confidential data. Prior work has shown that it is possible to empirically control that tradeoff by using the HyperLogLog (HLL) probabilistic sketch. RESULTS: In this article, we prove complementary theoretical bounds on the k-anonymity privacy risk of using HLL sketches, as well as exhibit code to efficiently compute those bounds. AVAILABILITY AND IMPLEMENTATION: https://github.com/tzyRachel/K-anonymity-Expectation.

Clinical data containing patients' personal medical records are important resources for biomedical research. Fully centralizing that data may permit the widest array of potential analyses, this is often not feasible due to privacy and confidentiality requirements (Benitez and Malin, 2010; Emam et al., 2009; Heatherly et al., 2013) . During times of pressing need, such as during a global pandemic, these privacy requirements may be justifiably relaxed (Haendel et al., 2020) -such as using trusted third party vendors such as Datavant (Kho and Goel, 2019 )-but even then, it is important to keep in mind the various privacy-utility tradeoffs (Bengio et al., 2021 (Bengio et al., , 2020 . A more privacy-friendly alternative is to use a federated network instead, which give hospitals control over their local databases; then, a distributed query tool enables researchers to send queries to the network, such as 'how many patients across the network have diabetes' (Brat et al., 2020; Weber, 2015) . A number of these hospital networks have emerged, including the Shared Health Research Information Network for Harvard affiliated hospitals (Weber et al., 2009) , the Federated Aggregate Cohort Estimator developed through a collaboration of five universities and institutions (Wyatt et al., 2014) , the open-source PopMedNet (Davies et al., 2016) and the Patient Centered Outcomes Research Institute launched PCORnet as a network of networks (Fleurence et al., 2014) .

However, patients often receive medical care from multiple hospitals, so medical records at different hospitals may be duplicated or incomplete. Depending on the aggregation method used to combine results from the network, this can produce errors. For example, consider using a simple summation of aggregate counts: if a patient with hypertension receives medical care from both Hospital A and Hospital B, then it is possible that the sum will double count that patient, which results in the overestimation of the number of patients with hypertension (Weber, 2013) .

Of course, this problem can be mostly alleviated by sending a hashed identifier of patients matching each hospital's queries to a trusted third party, but that again raises privacy concerns (Oechslin, 2003) . There is some natural tradeoff between the privacy guaranteed to individual patients and the accuracy of the aggregate query, and hashed identifiers and simple summation fall at opposite ends of the spectrum. Several of the authors of this article recently proposed using the HyperLogLog (HLL) 'probabilistic sketch' (Durand and Flajolet, 2003; Flajolet and Martin, 1985; Flajolet et al., 2007) to access intermediate tradeoffs of privacy versus accuracy (Yu and Weber, 2020) . Probabilistic counting was introduced to the computer literature decades ago, and has found use in analyzing large streaming data in a variety of settings, ranging from internet routers (Cai et al., 2005) to text corpora comparisons (Broder, 1997) to genomic sequences (Baker and Langmead, 2019; Ondov et al., 2016; Solomon and Kingsford, 2018) . Instead of sharing a single aggregate count, or sharing the full list of matching patient IDs (Weber, 2013) , each hospital instead shares a smaller 'summary sketch' built from taking the logarithm of a coordinated random sample of m matching patient hashed IDs (Yu and Weber, 2020) . Because only m patient IDs are used, and those are obfuscated through taking a logarithm, these HLL sketches are significantly more private than sending a full list of matching IDs. Due to the way the estimators work, HLL sketches have an error of about 1:04= ffiffiffiffi m p , which can be much less than expected from simple summation.

But when any data are shared by a hospital to a third party, there is risk of accidental leakage. Advances in homomorphic encryption and secure multi-party computation (Lindell, 2005) may eventually solve this problem by not allowing the third party any unencrypted data, but these are currently still impractical for deployment due to both computational and communication complexity. For example, consider the case where a hospital finds that there is only one patient satisfying the criterion for a query. If this hospital returns the aggregate count as one, then this unique patient's personal information is linked and can potentially be re-identified through a linkage attack (Emam and Dankar, 2008; Yu and Weber, 2020) . To properly compare the privacy of various methods of data aggregation, we turn to the concept of k-anonymity. The basic idea behind k-anonymity is that if a method or dataset is k-anonymous, then each patient is similar to at least k À 1 other patients with respect to potentially identifying variables, so that it is hard to determine the identity of a single patient in the dataset (Emam and Dankar, 2008; Sweeney, 2002) . Although other mathematical formalisms like differential privacy (Dwork, 2008) are much stronger, they are harder to work with, as they require injecting deliberate noise, and are not currently widely in use by clinical databases. Furthermore, it is provably impossible for composable cardinality estimators (such as HLL) to be differentially private, because the ability to deduplicate runs counter to the base assumptions of differential privacy (Desfontaines et al., 2019) .

In this article, we will assume that hospitals in a federated network implement the HLL algorithm for queries. We will then prove bounds on the expected k-anonymity of the shared sketches, as well as provide fast algorithms for computing that expected k-anonymity. This study is an extension of previous work (Yu and Weber, 2020) , which operated under the same setting and assumptions, but only provided empirical results and no proofs on the levels of privacy achieved. Here, we provide rigorous theoretical justification for those empirical claims.

In this article, we adopt the HLL sketch federated clinical network setting given in prior work (Yu and Weber, 2020) . For completeness, we duplicate the salient points below.

Assume that every patient has a single invariant ID that is used across hospitals. Prototypically, one might consider using social security numbers in the USA for that purpose. Even without a single unique identifier, it is possible to generate an ID based off a combination of other possibly non-unique IDs, such as first and last name, zip code, address, birthdate, etc. Unfortunately, there may be errors in these records due to character recognition errors (e.g. S and 8), phonetic errors (e.g. ph and f) and typographic errors including insertion, transposition and substitutions. Luckily, there is a lot of existing literature on this problem, and methods such as BIN-DET and BIN-PROB (Durham et al., 2010) have been proposed to deal with the issue. Thus, in this article, we will treat this problem as outof-scope and assume for simplicity that every patient has a unique stable ID known to all institutions.

Further assume that there is a federated network of hospitals (or other institutions) responding to clinical queries, along with a central party that manages and relays messages. When hospitals receive a query, they generate a list of the IDs of patients who match the query. Each hospital will use a publicly known hash function to first pseudorandomly partition the patients into m buckets and then assign a uniform pseudorandom number between 0 and 1 to each patient. We also assume that the hash function is known by the attacker, because the attacker may have compromised one of the other hospitals or the central party. The hospital then stores the order of magnitude of the smallest number within each bucket, and sends these m smallest bucket values to the central party. By applying the HLL estimator, the central party is then able to compute the aggregate count for the query with a relative error of around 1:04 ffiffiffi m p (Flajolet et al., 2007) .

Here, we focus on an individual hospital and want to determine the expected exposure to accidentally disclosing private information if the central party is compromised. As the HLL sketch aggregates information within each of the m buckets, our goal is to compute the expected number of buckets which are not k-anonymized. In line with common practice, we set k ¼ 10 for most of our results, though the algorithms and proofs hold for other k. Below, we provide two approximation formulas for the expected value and in the Section 4 construct a table for the user to determine which approximation should be chosen based on the number of distinct patients and other relevant parameters.

The HLL (Flajolet et al., 2007) probabilistic sketching algorithm is widely used to estimate the cardinality (number of different elements) of a set. Assume we have a database of electronic medical records; we can estimate the number of distinct patients by applying the HLL algorithm. The basic idea behind HLL is that the minimum value of a collection of random numbers between 0 and 1 is inversely proportional to the size of the collection. Therefore, we can estimate the cardinality of a set by first applying a hash function which maps all the elements uniformly onto ½0; 1 and considering the minimum value. For the purposes of this article, we will operate in the random oracle model, where we assume that the hash function actually maps to a random number; in practice, a standard hash function like SHA-256 would probably be employed. In order to increase the accuracy of estimation, we randomly divide the set into m partitions and then estimate the cardinality of the original set by the harmonic mean from m partitions. Furthermore, the HLL algorithm only needs to store the position of the first 1 bit in the 64-bit hash value, rather than the full patient ID hash, providing partial privacy protection. As the expected error in the final estimate is around 1:04= ffiffiffiffi m p , increasing m can reduce the error of HLL query but increases the risk of privacy leaks.

In our setting, when a hospital is sent a query, there are two relevant sets to consider: (i) the background population (often, the set of all patients at the hospital) and (ii) the set of patients matching the query. The reason for considering the background population is that they can 'hide' patients who match the query by providing plausible deniability. The hospital will return a HLL sketch, which contains m values-the maximum position of the first 1 bit within each bucket. We define a HLL bucket with value x to be 'k-anonymous' if at least k À 1 patients in the background population have hash value x; we will call these corresponding hash values in the background population hash collisions (Yu and Weber, 2020) . This means that if an attacker gets access to the sketch and can narrow down the set of potential patients to the background population, they cannot determine with certainty which of the k patients with Fig. 1 . Illustration of HyperLogLog k-anonymity. A hospital has an identified set B contained within the background population A. Binary hashes are taken of all patient identifiers. Those hashes are first used to partition the patients into four buckets. Within each bucket of B, the smallest value is chosen as the representative. Then the k-anonymity of that bucket is the number of hashes in the corresponding bucket of the background population that share the same position of the leading 1 bit that hash value was in the set of patients matching the query. Our goal is to determine the expected number of buckets that are not at least 10-anonymous (Fig. 1) .

We wish to note that in this article, we deliberately use the much weaker notion of privacy provided by k-anonymity (Emam and Dankar, 2008) , rather than stronger alternatives like differential privacy (Dwork, 2008) , which have provable protection against inference attacks. Unfortunately, differential privacy (and similar alternatives) are provably incompatible with any composable cardinality estimation (Desfontaines et al., 2019) . In practice, hospital IRBs admit the use of 10-anonymity for query set patients as a useful metric, despite known issues with vulnerability of k-anonymity to inference attacks. Our article thus focuses on analyzing probabilistic sketches as a more private alternative to the standard practice of sending full hashed IDs.

Let us recast the textual description above a bit more rigorously as the following mathematical problem:

Let A be a set and B A is a non-empty subset of A. A represents the background population and B represents patients satisfying the query. We define r ¼ jBj jAj as the ratio of number of patients satisfying the query to background population (also sometimes known as concept prevalence).

Let r : A ! ð0; 1 be a one-way hash function. In theory, we assume that we have a shared oracle available to both parties. In practice, a cryptographic hash function such as SHA-1, SHA-224 or SHA-256 (Johnson, 2020) is generally used. r uniformly maps each element in A to a random real number in the interval ð0; 1.

Let r À : A ! Z !0 be defined by r À ðxÞ ¼ bÀ log 2 rðxÞc. This function returns the number of 0 bits before the first 1 bit in x 2 ð0; 1 under a base 2 expansion. Let p : A ! f1; ::; mg be a map that randomly partitions patients into m buckets. In practice, this map can also be derived from a cryptographic hash function. From the partition function p, we define A i ¼ fx 2 AjpðxÞ ¼ ig and B i ¼ B \ A i , which, respectively, represent the ith bucket in whole database and sample.

Let h i ðBÞ ¼ maxfr À ðxÞjx 2 B i g be the maximum number of zeros before the first one among all hash values represented in base 2 in the ith bucket of B which is B i . Let e i ¼ fx 2 A i jr À ðxÞ ¼ h i ðBÞg be the set of elements in the ith bucket of A which collide with the elements in B i . We want to compute the Eðjfe i ji ¼ 1; . . . ; m and 0 < je i j k À 1gjÞ, the expected number of non-k-anonymous buckets.

As described above, we need to consider the collisions against all m buckets. Here, however, we first show a simple analysis with no partition function (i.e. the case where m ¼ 1) and compute the probability of each possible number of collisions so that in the later sections we can use this result to compute the desired expected value of 'nonk-anonymous' buckets.

Since there is only one bucket, there are only two sets A and B which represent the set of all patients and the set of patients matching the query, respectively. We denote hðBÞ ¼ maxfr À ðxÞjx 2 Bg, the maximum number of zeros before the first one among all hash values in base 2 in B, and e ¼ fx 2 Ajr À ðxÞ ¼ hðBÞg, the set of collisions. We want to compute the probability that the number of collisions is less or equal to k, which is Pðjej kjA; BÞ. Each element in rðAÞ can be thought of as an i.i.d. random variable with distribution Unif (0, 1). Therefore, r À ðxÞ ¼ n if and only if 1 2 nþ1 rðxÞ 1 2 n . Then we get Pðr À ðxÞ ¼ nÞ ¼ 1 2 nþ1 . Thus,

Lemma 2.1.Given sets B & A, the probability of exactly n 2 Z þ collisions is:

;nÞ k¼1 f ði; k; jAj; jBjÞgði; n À k; jAj; jBjÞ;

where f ði; k; jAj; jBjÞ ¼ jBj k

Proof. Since the sets A and B are fixed, we use Pðjej ¼ nÞ to represent Pðjej ¼ njA; BÞ for notational simplicity here.

By the law of total probability, we know that

Pðje \ Bj ¼ k and hðBÞ ¼ iÞPðje \ ðA À BÞj ¼ n À k and hðBÞ ¼ iÞ.

First we consider the case where we have k collisions in e \ B:

Next we consider the case where we have k collisions in e \ ðA À BÞ:

gði; k; jAj; jBjÞ ¼ Pðje \ ðA À BÞj ¼ k and hðBÞ ¼ iÞ

Pðjej ¼ mÞ h

Recall that A is the background population and B the set of patients satisfying the query criteria. We denote the buckets of A and B under our partition function by A 1 ; . . . ; A m and B 1 ; . . . ; B m where B i ¼ B \ A i for i ¼ 1; ::; m and e i is the sets of collisions in the ith bucket. Thus, the expected value of the number of buckets with no more than k collision is Eðjfe i jje i j k; i ¼ 1; . . . ; mgjÞ. Note that ðjA 1 j; . . . ; jA m jÞ $ MultinomialðjAj; p 1 ; . . . ; p m Þ with

m . Therefore, we know for a single bucket, say A 1 , its cardinality follows a binomial distribution that is

With a given A i , jB i j $ HypergeometricðjAj; jA i j; jBjÞ. Thus,

where b i 2 f0; 1; . . . ; minða i ; jBjÞg and EðjB i jjjA i jÞ ¼ jBj

Theorem 2.2.The expected number of buckets which have at least 1 collision but no more than k collisions is:

Proof.

PðjA 1 j; . . . ; jA m jÞPð0 < je i j kjjA i jÞ ðby the independence of je i j and jA j j for all j 6 ¼ iÞ

PðjA 1 j; . . . ; jA m jÞ ðby separating the summationÞ ¼ m X jA1j P k;jA1j PðjA 1 jÞ ðby law of total probabilityÞ where P k;jA1j ¼ Pð0 < je 1 j kjjA 1 jÞ.

In order to compute Pð0 < je 1 j kjjA 1 jÞ, we have to consider the range of jB 1 j which is f0; 1; . . . ; minðjBj; jA 1 jÞg.

Pð0 < je 1 j kjjA 1 j; jB 1 jÞPðjB 1 jjjA 1 jÞ:

In contrast to the simple case in Section 2.3, here B 1 is not necessarily a proper subset of A 1 because A 1 can be the empty set and thus B 1 is also an empty set in this case. The collision number is zero if and only if A 1 is an empty sets. Therefore, we will expand the formula in Lemma 2.1 to compute Pð0 < je 1 j kjjA 1 j; jB 1 jÞ. Furthermore, if we want rule out the case of zero collisions-because when the bucket is empty, there is not a patient ID for which we need to guarantee k-anonymity-we should set the range of jA 1 j and jB 1 j as f1; 2; . . . ; jAjg and f1; 2; . . . ; minða; jBjÞg, respectively.

Therefore, we will get:

Again, recall that A is the background population, B is the set of patients satisfying the query criteria and e is the set of collisions. In Section 2.4, we gave an explicit formula for computing Pðjej kjjAj; jBjÞ. However, the time complexity of carrying out that computation is troublesome

Usually, k is smaller than jBj and the infinity in the second sum will be replaced by 64 (or some other constant <100) because it represents the maximum number of zeros before the first one among all hash values in base 2. As there are only 7 billion people on Earth, 64 bits is sufficient for the original hash function to have low probability of collisions. Therefore, the time complexity is Oðk 2 Þ for at most k collisions.

We consider the time complexity of computing the desired expectation. Theoretically, the range of jA 1 j is f1; 2; . . . ; jAjg and the range of jB 1 j is f1; 2; . . . ; minðjBj; jA 1 jÞg. Therefore, the computation time is:

and the time complexity is Oðk 2 jAj 2 Þ, which is quadratic in the size of population for at most k collisions. In practice, for large set sizes, it is computationally infeasible to use this theoretical formula to compute the desired expectation; thus, in the remainder of this article we analyze fast approximations.

When jAj is large, it is impossible to sum over whole range of jA 1 j. Therefore, we will use concentration inequalities to restrict jA 1 j and jB 1 j to a smaller range. Because there is only an exponentially small probability that A 1 and B 1 will fall outside these restricted windows, i154 Z. Tao et al.

this will have minimal effect on the final answer while reducing the computation time from quadratic to linear in the size of A. Recall that jA 1 j $ Binomial jAj; 1 m and EðjA 1 jÞ ¼ jAj m ¼ l a ; VarðjA 1 jÞ ¼ jAj

In order to reduce the time complexity, we will restrict jA 1 j in our computations to the interval ðL a ; U a Þ :¼ ðl a À 5r a ; l a þ 5r a Þ.

Recall that jB 1 j $ HypergeometricðjAj; jA 1 j; jBjÞ for a given jA 1 j and EðjB 1 jjjA 1 jÞ ¼ rjA

in order to compute the error bound more easily below in Section 3.2.1. After concentration, we can make sure that PðjjA 1 j À l a j ! 5r a ÞՇ9:6 Â 10 À4 and PðjjB 1 j À l b j ! 5r 0 b ÞՇ9:6 Â 10 À4 which is shown below in detail in Section 3.2.1. As an aside, while these two intervals of jA 1 j and jB 1 j have been chosen for analyzing the error bound and time complexity analytically, in the computing code we can directly use built-in functions to compute the relevant confidence intervals for jA 1 j and jB 1 j.

By the concentration inequalities on jA 1 j and jB 1 j, the desired expectation will be approximated by:

where P jA1j;jB1j ¼ PðjB 1 jjjA 1 jÞ; P jA1j ¼ PðjA 1 jÞ and P k;jA1j;jB1j ¼ Pð0 < je 1 j kjjA 1 j; jB 1 jÞ.

The computation time after concentration is:

So, the time complexity after concentration is O k 2 jAj m which is linear in jAj m . After concentration, the expected value E 1 ðkÞ is smaller than the actual EðkÞ, but we can bound the error.

Recall that jA 1 j $ BinomialðjAj; 1 m Þ and l a :¼ EðjA 1 jÞ ¼ jAj m ; r 2 a :¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi jAj 1 m 1 À 1 m r . We concentrate jA 1 j in the interval ðL a ; U a Þ :¼ ðl a À 5r a ; l a þ 5r a Þ. We define F a ðxÞ :¼ PðjA 1 j xÞ the cumulative density function of jA 1 j. Note: A is the total size of the hospital background population, m is the number of buckets used in the HyperLogLog sketch and r is the fraction of the background population that matches the query criteria. 'A1' and 'A2', respectively, denote approximations 1 and 2. For every one of the parameter regimes, we used simulations to determine which of the approximation methods is more suitable for the practitioner.

First, we consider the concentration on jA 1 j. We will apply the higher moments inequality on jA 1 j À l a (Blum et al., 2020) :

PðjjA 1 j À l a j ! aÞ EðjjA 1 j À l a j r Þ a r for any positive even integer r:

If we choose r ¼ 6 then, we will get:

PðjjA 1 j À l a j ! 5r a Þ EðjjA 1 j À l a j r Þ ð5r a Þ 6 ¼ 1 5 6 1 À 30d þ 120d 2 þ 25jAjd À 130jAjd 2 þ 15jAj 2 d 2 jAj 2 d 2 ! 15 5 6 ¼ 9:6 Â 10 À4 ;

where d ¼ mÀ1 m 2 . Then we consider the jB 1 j for a given jA 1 j. For a given jA 1 j, we know jB 1 j $ Hypergeometric ðjAj; jA 1 j; jBjÞ and l b :¼ EðjB 1 jjjA 1 jÞ ¼ rjA 1 j; r 2 b :¼ VarðjB 1 jjjA 1 jÞ ¼ rjA 1 j jAjÀjA1j jAj jAjÀjBj jAjÀ1 . We concentrate jB 1 j in the interval ðL

We define F b ðxÞ :¼ PðjB 1 j xjjA 1 jÞ the cumulative density function of jB 1 j for a given jA 1 j.

Note that for X $ BinomialðjA 1 j; rÞ, we can get EðXÞ ¼ rjA 1 j and VarðXÞ ¼ rð1 À rÞjA 1 j. The expected value is equal to EðjB 1 jjjA 1 jÞ and the variance is equal to r 02 b which is bigger than the variance of jB 1 j for this given jA 1 j. This explains that the hypergeometric distribution is more concentrated about the mean than the binomial distribution (Kalbfleisch, 1985) . Therefore, we will use this binomial distribution to bound the tail of our hypergeometric distribution:

VarðXÞ p Շ9:6 Â 10 À4 ;

where P jA1j;jB1j ¼ PðjB 1 jjjA 1 jÞ; P jA1j ¼ PðjA 1 jÞ and P k;jA1j;jB1j ¼ Pðje 1 j kjjA 1 j; jB 1 jÞ. But the in the computing code, we can use the built-in function to find the interval (L a , U a ) and (L b , U b ) such that PðL a jA 1 j U a Þ ! 1 À a and PðL b jB 1 j U b Þ ! 1 À a. This will not affect the time complexity and can ensure that the absolute error between the estimated expected value and the actual expected value is <1 by choosing a proper a. It is obvious the smaller a is, the smaller the error will be, but the intervals (L a , U a ) and (L b , U b ) will be bigger which means a longer computing time. Therefore, there is a tradeoff between accuracy and speed (see Table 2 for real computing time).

Fortunately, in all cases we explore, the L a and U a given above can ensure that a < 5 Â 10 À5 .

where P jA1j;jB1j ¼ PðjB 1 jjjA 1 jÞ; P jA1j ¼ PðjA 1 jÞ and P k;jA1j;jB1j ¼ Pð0 < je 1 j kjjA 1 j; jB 1 jÞ.

Although the time complexity after concentration is linear in jAj m , for large jAj and m small, this speedup is often still not enough. We can further approximate Pðje 1 j kjjA 1 j; jB 1 jÞ by Pðje 1 j kjjA 1 j; jB 1 j ¼ rjA 1 jÞ and get the following approximation of the expectation:

PðjA 1 jÞPðje 1 j kjjA 1 j; rjA 1 jÞ:

This is a 'mean-field' approximation based on Approximation A1. The basic idea behind this approximation is to use the probability at the mean value which is Pð0 < je 1 j < kjjA 1 j; rjA 1 jÞ to represent all the probabilities Pð0 < je 1 j < kjjA 1 j; jB 1 jÞ when jB 1 j 2 ðL b ; U b Þ because Pð0 < je 1 j < kjjA 1 j; jB 1 jÞ is monotonic increasing in jB 1 j and the interval (L b , U b ) is small enough compared with the theoretical range ð0; minðjA 1 j; jBjÞÞ.

The range of jA 1 j is still ðL a ; U a Þ ¼ ðl a À 5r a ; l a þ 5r a Þ. Therefore, the computation time of E 2 is:

. The real computing time will be discussed in the Section 4. Unfortunately, we do not have a strong provable guarantee with this approximation, but it seems empirically to work well in practice.

In order to assess the accuracy-speed tradeoffs of our two approximations, we ran simulations measuring the ground truth empirical k-anonymity of patients in several different regimes using HLL sketches. Those simulations serve as the ground truth since they have the same distribution as hashing real patient identifiers with a random seed, without needing to use real patient data for this article. Then, we compared those empirical values against the approximations described in this article. In the large cardinality regimes, it is computationally infeasible to run full simulations, so we only compare the run-times of the two approximation methods. In Table 2 , we provide full tables of these results. In Table 1 , we provide a high-level summary giving a practitioner guidance on which method is appropriate under those particular parameter choices. All computations were run in single-thread mode on an AMD Ryzen Threadripper 3970X 32-core CPU machine running Ubuntu 18.04.5 LTS (bionic) with 256 GiB of RAM. It is worth mentioning that steps in computations are trivially parallelizable, but for benchmarking purposes all our results are of single-threaded performance. Additionally, instead of using actual hash functions (e.g. SHA-256), we generate uniform random numbers as the hashed values, which has the same probability distribution. Code is available on Github and relies on using the numpy, scipy.stat and decimal packages for simulation of patient hashes and explicit computation of probability distributions: https://github.com/tzyRachel/K-anonym ity-Expectation Recall that A represents the number of all patients, B represents the number of patients who meet some query criteria and m is the number of buckets in the HLL process. We introduce r ¼ jBj jAj to represent the ratio of jAj and jBj, because as we will see, this ratio controls to a large extent the number of collisions. Intuitively, r represents the number of background population persons who could be used to provide plausible deniability to each patient in the query set.

Our simulations sweep over the different combinations of the parameters A, r and m to construct a table to fit Approximations A1 and A2. In all simulations, we restrict jAj in the interval ½10 4 ; 10 7 and m in the interval ½100; 50 000. In this article, the total number of different patients in a hospital is assumed to be over 10 4 ; not only are approximation methods unnecessary for jAj < 10 4 because exact computations are feasible, but there is not a sufficiently large background population to hide the query set when jAj is small, and the privacy characteristics then become equivalent to sending hashed IDs (Yu and Weber, 2020) . Since the simulations are run under the condition of '10-anonymity', we make sure that jAj m > 20 which is the mean value of the single bucket size. Also, r is restricted in the interval ½0:001; 0:1 and we choose six different values of r which are 0:1; 0:08; 0:05; 0:01; 0:005; 0:001 to run the simulations and compare the simulation results with computing results.

As we discussed in the Section 2, we can estimate the desired expected value by both Approximations A1 and A2. The final choice of Approximation A1 or Approximation A2 seems to be dependent primarily on jAj m . In most cases, when jAj m ! 1500, Approximation A2 is good enough and the computing time is no longer than 3 min. When jAj m < 1500, Approximation A2 will be not accurate enough and we have to choose Approximation A1. The computing time of Approximation A1 is proportional to ffiffi r p jAj m , which is sometimes a concern. When jAj m 1500, the computing time is usually no longer than 8 min. But there are several special cases, such as when r ¼ 0.1 and r ¼ 0.08, that the computing time at A ¼ 10 7 ; m ¼ 6666 is $10 min which might be acceptable but is really not ideal. Furthermore, in extreme cases, the approximate expected k-anonymity return by Approximations 1 and 2 differ by $10 (Table 2) .

To make things easier for the end-practitioner, we provide a summary 'choice' table (Table 1) guiding them on which approximation is suggested, based on different numbers of patients, numbers of buckets and ratios of number of patients matching query to all patients. Choosing between approximations A1 and A2 is an accuracy/running time tradeoff. A1 is usually both more accurate and expensive than A2. For the purpose of the choice table, to give a concrete recommendation, we aim to have single-threaded running It is the relationship to prevalence rate that is more complicated and nonlinear, as shown by focusing on the behavior for 100 and 500 buckets times below 10 min; many modern multi-core machines can run over a dozen threads at once, and given that the approximation algorithms are trivially parallelizable, this amounts implicitly to a goal of real wall-clock time of less than a minute. The choice table is filled out by selecting the approximation method with the least error given that time constraint. In most cases, we choose A1 if the running time is below 10 min. Sometimes, computation results from A1 and A2 are almost the same, so the faster method can be chosen. Based on this rule, we compare gold-standard simulation results against the approximations in Table 2 to construct the 'choice' table.

Note that our choice of 10 min single-threaded run-time was arbitrary; given extra computational resources, the ideal switch-off point between approximations will vary. Figure 2 shows the errors between the approximation results (based on the choice Table 1 ) and simulation results ( Table 2) when number of distinct patients is 10 7 and number of buckets are 100 and 1000, respectively. The absolute values of all the errors are no more than 4. Note: Some entries are empty because the computation time was infeasibly long. We have highlighted (in yellow or green) the more accurate approximation finished within 10 min. Full simulation and computation results for r 2 f0:1; 0:08; 0:05; 0:01; 0:005; 0:001g are available on Github in machine-readable format.

Z. Tao et al.

We first note that all of the approximations we have provided finish on the order of minutes. As they are analytical approximations, there is also no need to run them multiple times. Although we have not shown explicit simulation run-times in the tables above, the larger simulations take upwards of hours; furthermore, we did not perform simulations for the largest parameter ranges because we expected those to take significantly longer. Our approximations speed up determining the expected privacy loss from distributing HLL sketches. We are also able to form some general conclusions about the expected privacy of HLL sketches. As mentioned above the prevalence ratio r ¼ jBj jAj , where A and B are, respectively, the background population and query population can be interpreted as the ratio of patients matching a query (e.g. 'How many patients have been diagnosed with diabetes?'). Based on HLL, m is the number of buckets and A i and B i are the ith bucket in A and B. Figure 3 plots the number of buckets and prevalence rate against the estimated expected number of non-'k-anonymized' buckets and the number of buckets versus the percentage of the non-'k-anonymous' buckets. The two top plots are simply the number of non-k-anonymous buckets against the number of buckets and varying the other parameters, but this turns out to not be the right set of variables to control.

Instead, as evidenced by the lower-left plot (Fig. 3) , a roughly constant fraction of the buckets are not k-anonymized when r is constant. This is unsurprising because as mentioned earlier, r is intuitively the number of background population members that could be used to hide each patient. Of course, random chance also plays a large role. More precisely, this constant is close to 100Pð0 < jej < 10j jAj m ; r jAj m Þ where Pð0 < jej < 10j jAj m ; r jAj m Þ is the probability of that the number of collisions is >0 and <10 when the bucket size is at the mean value jAj m . It is not quite equal for two reasons. The first reason is the obvious one, that we are using the approximations that form the subject of this article. The second reason is that the single bucket size jA 1 j follows a Binomial distribution with mean jAj m and p ¼ 1 m . When jAj and jAj m are big enough, we can get Pð0 < je 1 j 10jjA 1 j; jB 1 jÞ % Pð0 < je 1 j 10j jAj m ; r jAj m Þ by concentrating jA 1 j; jB 1 j in an interval centered at the means, which is similar to what we did in Approximation A2, but simpler. However, when jAj m is not that big, for example, jAj ¼ 100; m ¼ 5, then Pð0 < je 1 j 10j jAj m ; r jAj m Þ and Pð0 < je 1 j 10jjA 1 j; rjB 1 jÞ will differ a lot for different value of jA 1 j and jB 1 j. Now that it is clear that r is the value of primary importance, we see in the lower-right plot of Figure 3 that as prevalence rate (r) increases, more buckets are non-'k-anonymized'. This is because bigger r means more overlap between sets A and B and also each pair of buckets A i and B i . Thus, the maximum number of zeros before the first one among all hash values in B i is more likely equal to that in A i . Thus, a hospital IRB or clinical query system seeking to understand the 10-anonymity of a particular query can use a firstorder approximation based only on r, without even needing to run our code. Indeed, they need only consult our lower-right plot in Figure 3 and scale to the size of their background population to determine that first-order approximation. This can be done without any code. When a more precise result is needed, however, our two Approximations can provide that answer in only a few minutes. Of course, if even that is insufficient, the practitioner may choose to directly measure the k-anonymity of a particular HLL sketch; this is not in the scope of this article, but was empirically done in prior work (Yu and Weber, 2020) .

In this article, we have developed a method to quickly compute the expected number of non-'k-anonymous' buckets in the HLL sketch. Because of the number of patients (denoted as jAj in our model) is too big to compute the precise expected value, we introduced two approximations based on concentration inequalities. In general, Approximation A1 is suitable for the case when the expected value of single bucket size which is jAj m is 'small', for example, total number of patients (jAj) is 10 5 and number of buckets (m) is 100 or total number of patients (jAj) is 10 7 and number of buckets (m) is 10 5 . Approximation A2 is suitable for the case when jAj m is 'big', for example, total number of patients (jAj) is 10 7 and number of buckets (m) is 100 (see choice table in Section 4).

By an appropriate choice of approximation method, we can control the computing time to under 300 s in almost all the cases. In other words, when an individual hospital is asked a query to return the aggregate counts based on sharing HLL sketches, we can compute the expected number of buckets which match fewer than 10 patients in the background population. If this number is too high, that is a signal to the clinical query system that the particular query is unsafe to release using HLL sketches. It is then up to the clinical query system to decide whether to fall back on another aggregation method, or if they should simply not respond to the query.

Our results further give some guidance into the parameter ranges in which HLL sketches are likely to be safe to release. HLL sketches are especially useful for rare diseases, where the prevalence ratio in the population is low. Note that this is in marked contrast to sending raw counts, where rare diseases are precisely the least k-anonymous. Thus, HLL sketches fill a complementary role. Indeed, at the heart of the problem is the tradeoff between the utility/accuracy of HLL sketches and privacy, which increase or decrease, respectively, with the number of buckets. The average k-anonymity of a bucket is roughly inversely proportional to the square of the estimation error; our work computes instead the number of buckets that are not at least 10-anonymous. For more guidance on this tradeoff, we refer the reader to prior work, where we graphed this tradeoff empirically (Yu and Weber, 2020) .

Ultimately, our work is primarily useful in contexts where federated clinical query systems are used in biomedical research. The past year has seen increasing amounts of data centralization to combat the Covid-19 pandemic. The cost to privacy has been accepted because of the urgent clear and present need. However, in the future post-pandemic era as the pendulum swings the other direction, privacy may again take center stage. We hope that our work will be useful in analyzing the privacy consequences of distributed query systems and help inform policy-makers and institutional IRBs about the privacy-utility tradeoffs at hand.

All of the data and code used to generate benchmarks is available on the Github: https://github.com/tzyRachel/K-anonymity-Expectation

Dashing: fast and accurate genomic distances with HyperLogLog

The need for privacy with public digital contact tracing during the COVID-19 pandemic

Inherent privacy limitations of decentralized contact tracing apps

Evaluating re-identification risks with respect to the HIPAA privacy rule

Foundations of Data Science

International electronic health record-derived COVID-19 clinical course profiles: the 4CE consortium

On the resemblance and containment of documents

Fast and accurate traffic matrix measurement using adaptive cardinality counting

Software-enabled distributed network governance: the PopMedNet experience

Cardinality estimators do not preserve privacy

Loglog counting of large cardinalities

Private medical record linkage with approximate matching

Differential privacy: a survey of results

Protecting privacy using k-anonymity

Evaluating the risk of re-identification of patients from hospital prescription records

Probabilistic counting algorithms for data base applications

HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm

Launching PCORnet, a national patient-centered clinical research network

the N3C Consortium (2021) The National COVID Cohort Collaborative (n3c): rationale, design, infrastructure, and deployment

Enabling genomic-phenomic association discovery without sacrificing anonymity

Security Controls Evaluation, Testing, and Assessment Handbook

Probability and Statistical Inference

Systems and methods for enabling data de-identification and anonymous data linkage

Secure multiparty computation for privacy preserving data mining

Making a faster cryptanalytic time-memory trade-off

Mash: fast genome and metagenome distance estimation using MinHash

Improved search of large transcriptomic sequencing databases using split sequence bloom trees

2002) k-Anonymity: a model for protecting privacy

Federated queries of clinical data repositories: the sum of the parts does not equal the whole

Federated queries of clinical data repositories: scaling to a national network

The Shared Health Research Information Network (SHRINE): a prototype federated query tool for clinical data repositories

Federated Aggregate Cohort Estimator (FACE): an easy to deploy, vendor neutral, multi-institutional cohort query architecture

Balancing accuracy and privacy in federated queries of clinical data repositories: algorithm development and validation