key: cord-0567481-r7c2i35x
authors: Wang, Dong; Liu, Fang
title: Privacy-preserving Publication and Sharing of COVID-19 Pandemic Data
date: 2021-06-18
journal: nan
DOI: nan
sha: 8f441a75ad8216083ced7f4f1e578750c65cc657
doc_id: 567481
cord_uid: r7c2i35x

A huge amount of data of various types are collected during the COVID-19 pandemic, the analysis and interpretation of which has been indispensable for curbing the spread of the coronavirus. As the pandemic slows down, the collected data during the pandemic will continue to be rich sources for further studying the pandemic and understanding its impacts on public health, economics, and societies. On the other hand, na"{i}ve release and sharing of the information can be associated with serious privacy concerns. In this paper, aiming at shedding light on privacy-preserving sharing of pandemic data and thus promoting and encouraging more data sharing for research and public use, we examine three common data types -- case surveillance, patient location histories and hot spot maps, and contact tracing networks -- collected during the pandemic and develop and apply privacy-preserving approaches for publishing or sharing each data type. We illustrate the applications and examine the utility of released privacy-preserving data in examples and experiments at various levels of privacy guarantees.

As the world continues to cope with the COVID-19 pandemic and some regions are on track of slowly and cautiously going back to the pre-pandemic life, the impact brought by the pandemic is here to stay for a very long time. A huge amount of data of various types have been collected during the pandemic, the analysis and interpretation of which has been indispensable for government, health authorities, and experts in various fields to understand disease severity and identify risk factors, to monitor and forecast the spread of COVID, to evaluate the impacts of the pandemic on public health, the capacity of healthcare systems, economics and employment, education, mental health, and to plan and implement strategies to mitigate the negative impacts. As the COVID-19 slow downs and will eventually be under control, the collected data will continue to serve as rich sources for further studying of the disease and understanding of its impacts on societies.

On the other hand, naïve release and sharing of the information can be associated with serious privacy concerns, especially considering the fact a huge amount and a great variety of data were collected very quickly in a short period of time to deal with the global health crisis and the data privacy and ethics regulations were lagging behind at least in the initial stage of the pandemic. Many types of data collected during the pandemic are known to be associated with high privacy risk, such as disease status, medical history, insurance status, location history, close contacts, employment/income status. A balance must be found between individual privacy protection and sharing the data for research use.

Fortunately, this is not an unsolvable problem though further research is needed to ensure the solutions are practically feasible and software and tools will need to be developed to facilitate practical applications. This is because many research questions revolve around extracting population-level aggregate information and understanding patterns rather than focusing on individual-level information; the latter is often the goal of privacy attacks. In other words, if a privacy-preserving data release mechanism can maintain accurate and useful aggregate information while guaranteeing individual-level privacy, it may make an effective approach for pandemic data sharing. In this work, we examine three specific types of data collected during the pandemic and develop and apply privacy-preserving approaches for publishing or sharing each data type. We hope this work sheds light on sharing COVID-19 data and helps promote and encourage more data sharing to assist us to better understand the disease and its current and future impact on societies, among others.

Privacy-preserving collection and analysis of COVID-19 data have been developed and applied during the pandemic. The Google research teams apply differential privacy (DP) to generate anonymized metrics from the data of Google users who opted in for the Location History setting in their Google accounts and produce the COVID-19 community mobility reports [1] , to understand the impacts of social distancing policies on mobility and COVID-19 case growth in the US [2] , to generate anonymized trends in Google searches for COVID-19 symptoms and related topics [3] , and to forecast COVID-19 trends using spatio-temporal graph neural networks [4] . DP is also integrated in deep learning to predict COVID-19 infections from imaging data [5, 6] . Butler et al. [7] apply DP to generate individual-level health tokens/randomized health certificates while allowing useful aggregate risk estimates to be calculated. In all the above work, the randomization and anonymization processes are designed to ensure that minimal individual-level information -such as locations, movement, behaviors, contacts, medical status -can be derived from the released information while preserving the usefulness of the collective information at the population level.

ware around the world during the pandemic to track and curb the spread of the coronavirus. The apps collect users' location data (e.g., GPS) or proximity data (e.g., Bluetooth) to identify and notify those who might have been near a COVID-19 patient and at high risk of contracting the disease. It is well known that location data are highly revealing of personal information [8] [9] [10] [11] . Many location-based contacting tracing apps employ the centralized model without integrating any formal privacy concepts when collecting or releasing information. Specifically, data are collected and stored on servers owned and maintained by government or health authorities, who process the data, identify close contacts, and notify users for potential exposure. This is the model adopted by the Alipay Health Code (China), WeChat (China) [12, 13] , Corona100m (South Korea) [14] , COVIDTracker (Thailand) [15] , and ProteGo (Poland) [16] . By contrast, in the decentralize model, only users know the potential exposure and the contact logs are never transmitted to or stored in a central server. Safe paths (US) [17] is one of the few applications that utilize the decentralized model for location-based contact tracing. It redacts and anonymizes location data on a semi-trusted server and users periodically check the server to see if they have cross paths with COVID-19 patients. Vepakomma et al. [18] utilize the decentralized scheme and develop improved count-mean-sketch data structure for privacy-preserving contact tracing. The proximitybased apps such as the Pan-European Privacy-Preserving Proximity Tracing (PEPP-PT) (EU) [19] (a centralized model) and Google/Apple Exposure Notification (GAEN) system (a decentralized model) [20] do not collect location data but rather proximity information. Similar to the location-based contact tracing, it has both centralized and decentralized models. The former processes and stores contact logs and notifies clients of potential contacts via a central server whereas the latter does this locally on users' devices. In summary, the decentralized model arguably provides better privacy protection than the centralized model, and the privacy protection measures taken in the decentralized model are often encryption-based.

There are also approaches that aim at protecting simultaneously the privacy associated with different data types. For example, Iyer et al. [21] explore and apply the spatial k-anonymity concept to data sharing and management of both contact tracking and hot spot maps that display patients' whereabouts. Cao et al. [22] propose a location sanitization mechanism based privacy policy graphs for epidemic surveillance that includes location monitoring, epidemic analysis, and contact tracing.

In summary, many privacy-preserving methods and applications in the pandemic, including the work reviewed above, have focused on data collection and information sharing with governments, health officials, and clients so to curb the spread of the disease in a timely manner and facilitate quick decision making during the pandemic. COVID-19 related data sharing with the purposes of research use has not received much attention.

Our work has a different focus from the work presented in Sec 1.2. We explore new applications of existing privacy-preserving techniques and develop new techniques to share COVID-19 data collected during the pandemic for research use. The research will not only have intellectual merits, but will also produce research outcomes of broader impacts, including but not limited to better understanding of the pandemic, solving problems that we were not able to during the pandemic due to time imperative, and generating new insights into how we can better handle similar crisis in the future.

We examine three common data types collected during the pandemic and develop privacypreserving approaches for publishing and sharing such data. The methods are based on the state-of-the-art privacy notations and models, including DP, geo-indistinguishability (GI), and randomized response (RR) to provide formal privacy guarantees on the shared data. In each application, we present the motivation, the methodology, examples and experiments to examine their privacy guarantees and utility. Specifically,

• We share subgroup case surveillance count data by employing DP mechanisms for multidimensional marginals and histograms. The subgroups are often formed by demographic attributes such as race/ethnicity, socioeconomic status, gender, age, that are often regarded as pseudo-identifiers to re-identify individuals and disclose sensitive information, without proper privacy preserving measures. For that reason, such granular information is often not shared, but the availability of this information can help better understanding of risk factors and the heterogeneity of the disease, identifying groups vulnerable to the disease, among others.

• Based on the notion of GI, we propose the doppelganger method for case location publication and hots pot map generation while ensuring individual privacy. The formulation of the doppelganger definitions considers both data utility and adversary error in learning individual location given sanitized location records. Hot spot maps can be linked with residual attributes in an area to study topics such as residential racism and structural segregation.

• We propose a new method based on GI and examine the feasibility of a previously studied differentially private RR method for social networks for publishing contact tracing networks. Contact tracing networks are of research interest as they help understand how physical proximity may affect the spread of the disease, among others. The GI-based method perturbs the locations of individuals and thus their proximity to others. The RR sanitation approach works directly on the edges/non-edges in a network. Both the GI-based and RR method release sanitized contact tracing networks, aiming at providing privacy guarantees for individuals in the networks while maintaining useful structural information of the networks.

We provide a brief overview on the basic concepts and notions that are employed in the privacy-preserving methods in Sec 3.

Definition 1 ( -differential privacy (DP) [23] ) A randomization mechanism M isdifferentially private, if for any pair data sets X and X that differ by one record and every possible outcome subset Ω of the image of a query,

where > 0 is the privacy budget or privacy loss parameter. The smaller is, the more privacy protection there is on the individuals in the data. X and X differing by one record (denoted by d(X, X ) = 1) may refer to the case that X and X are of the same size but differ in at least one the attribute value in exactly one record, or the case that X has one record more than X or vice versa. Definition 1 is the basic form of DP; relaxed versions and extension exist, such as ( , δ)-DP [24, 25] , Rényi DP [26] , Gaussian DP [27] .

DP provides a mathematically rigorous framework for protecting individual privacy when releasing and sharing information. There exist many randomization mechanisms to achieve DP, such as the Laplace mechanism [23] , the Exponential mechanism [28] , and the Gaussian mechanism [29, 30] . Sec 3 uses the Laplace mechanism to illustrate the methods and applications, so we provide its definition below.

Definition 2 (Laplace mechanism) Let s = (s 1 , . . . , s r ) be a r-dimensional statistic,

be the global sensitivity of s, and e comprise r independent random samples from Laplace(0, ∆ −1 ). The sanitized s * via the Laplace mechanism of -DP is s * = s + e.

Definition 3 (Geo-indistinguishability (GI) [31] ) Let d(P, P ) be the Euclidean distance between any two distinct locations P and P , and be the unit-distance privacy loss. A randomization mechanism M satisfies GI iff, for any possible released location Q and for possible pair of P and P that d(P, P ) ≤ γ,

GI is an extension of the basic DP concept in Definition 1 to the location setting. M in Eq (2) enjoys ( γ)-privacy for any specified γ > 0. To achieve GI when collecting or releasing location data, the planar Laplace mechanism can be used to perturb the location information in polar coordinates.

Definition 4 (polar Laplace mechanism [31] ) The sanitized location Q, given the actual location P with coordinates (x, y) in the Euclidean space, satisfies GI with coordinates (x * , y * ) = (x + r cos(θ), y + r sin(θ)),

where the joint distribution of r and θ is

Eq (4) implies r and θ are independently distributed and

Randomized response (RR) [32] is a research method often used in surveys to allow respondents to answer sensitive questions while maintaining confidentiality. There are multiple ways to achieve randomized response; one approach is as follows. Given a sensitive question (e.g. did you ever steal? what your sexuality?), the respondent flips a coin. If it lands on tails, then the respondent answers the question truthfully. If it lands on heads, then the respondent flips a second coin and answers "Yes" for heads and "No" for tails. This particular randomized response procedure corresponds to = log(3) in DP for the individual-level response Dwork and Roth [29] . Though the individual-level responses are randomized, the true population proportion of Yes to the question can be unbiased estimated by 3p/(2p + 1) via the Bayes' theorem., where p is the probability of answering Yes.

We propose approaches to sharing several types of pandemic data in a privacy-preserving manner. The data types we focus on are subgroup case surveillance data (Sec 3.1), case location data (Sec 3.2), and contact tracing networks (Sec 3.3). The sanitization approaches we develop and apply are based on the DP, GI, and RR concepts presented in Sec 2.

Collection and publication of case numbers are necessary during the pandemic to monitor and forecast the spread of the disease, to understand how COVID-19 impacts the capacity of healthcare systems, to provide necessary information to health authorities for quick decision making, and to keep the public informed about the scale of the spread. Oftentimes are case numbers aggregated at the organizational (e.g. universities and colleges), city, county, state, and national levels, not breaking down by demographic groups. On the other hand, case numbers by demographic groups such as age, gender, race and ethnicity provide valuable information for identifying risk factors and groups vulnerable to the disease and understanding the heterogeneity of the susceptibility to the disease. However, publishing such granular information carries the re-identification risk and disclosure risk, especially in a data sparsity situation.

The good news is that there exist many methods that publish privacy-preserving counts from multi-dimensional histogram in the DP framework [33] [34] [35] [36] [37] , which is basically the problem of the subgroup case number publishing. For demonstration purposes, we adopt the universal histogram (UH) approach [38] to publish case subgroup case numbers; other approaches may also be used. The UH approach, compared to a flat Laplace sanitizer that injects noises sampled from Lap(0, −1 ) to the count in each of the histogram bins, forms a hierarchical tree among the data attributes and injects noise to the node counts in each layer of the tree, and the calculate and release final node counts, exploring the equality constraints among the tree nodes in different layers. The UH provides improved accuracy for the sanitized node counts located near the root of the tree (low-dimensional marginals). Correspondingly, we can place the attributes that are of more practical importance near the root of the tree and the information of which should be rather accurate, and those that are less important near the terminals of the tree.

We use a specific example to illustrate how to apply the UH approach in publishing demographic subgroup case numbers. Fig 1(a) depicts a tree formed from a COVID-19 dataset with 200 positive case numbers collected in a county. The tree has h = 4 layers, with each layer formed by a demographic attribute -age (elderly or not), race (minority and majority), and gender (F and M).

During the DP sanitization, each layer receives a portion of the total budget following the sequential composition principle in DP [28] . For illustration purposes, we assume each layer receives 1/h of the total (other allocation schemes can be used), the node counts are sanitized via the Laplace mechanism Lap(0, h −1 ). The first-step estimate z[v] is obtained from Eq (7), where v denotes a node and its children node set is succ(v) and k = 2 for the bipartite tree in Fig 1(a) .

The estimate z[v] is inconsistent in the sense that it might not equal to the sum of the node counts of its children nodes, a violation of the equality constraints. The inconsistency can be corrected via Eq (8) , which yields the final sanitized count h for node v in the tree (u is node v's parent in Eq (8)). that the elderly has a higher rate of COVID-10 than the non-elderly if they have the same minority status and gender; those for Minority are 1.03, 1.03, 0.93, 0.83 for the original data and the sanitized data at = 0.5, 0.3 and 0.1, respectively, meaning that the minority group has a higher rate of COVID-19 than the majority if they have the same age and gender; those for Gender are −0.18, −0.16, −0.25, 0.07 for the original data and the sanitized data at = 0.5, 0.3 and 0.1, respectively, meaning that the male has a lower rate of COVID-19 than the female if they are of the same age and minority status except for that result from the sanitized data at = 0.1, where the sign of the estimate differs from the original estimate. Fig 2 focuses on the point estimates for the regression coefficients. If statistical inference is of interest, to take into account the sanitization randomness so to ensure the validity of the inference, one may release multiple sets of sanitized data and combine the results across the multiple sets [39] or explicitly model the sanitization mechanism during the analysis. 

When a person is diagnosed with COVID-19, health authorities may interview the person for his or her whereabouts and location history in the past few weeks [40] , and that information may be shared with the public [41] . Knowing the accurate information on a patient's travel history is critical for health authorities to track and contain the spread of the disease. However, that information does not have to shared with the public in its original accurate form. In some cases, the release of the exact location history has caused serious privacy risk for the patient and considerable psychological harm for the victim after being doxed and cyber-bullied [42].

We propose a privacy-preserving approach -namely, the doppelganger -to releasing privacypreserving location information. The doppelganger would be particularly useful for protecting individual privacy when sharing location information at the local level, especially considering that hot spot maps are often built on a relatively fine scale. The finer the scale is, the more sparse the data becomes, the higher the privacy risk for re-identification from releasing location data, and thus the greater the need for effective privacy protection approaches. As the scale gets coarser, say from the city level, to the regional, state, or national levels, the information released by the doppelganger would deviate little from the actual information.

The main idea behind the doppelganger, as suggested by its name, is as follows: rather than releasing the true location P of an individual, we release K ≥ 2 perturbed versions of P , guided by the GI concept. The reason for releasing multiple perturbed locations for a given original location instead of just one is to cause confusion on the adversary's end when he or she tries to figure out the target's true location while providing researchers data to quantify the sanitization uncertainty if needed and with improved utility at some K compared to a single sanitized location release. To preserve utility of the released locations, we require at least one sanitized location to fall within a certain radius r of the original location; to confuse the adversary, we require at least one sanitized location to fall outside a certain radius r ≥ r of the original location. Taken together, we require the probability that the two events occur simultaneously is at least 1 − β, as stated in Definition 5.

Definition 5 (doppelganger set D(K, r, r , β, )) Let D be the set of K ≥ 2 sanitized locations for the original location P with the total privacy loss > 0. D is a doppelganger set if, for r ≥ r > 0 and β ∈ (0, 1), Pr(∃(P * , P * ) ∈ D : (d(P, P * ) < r) ∩ (d(P, P * ) > r )) ≥ 1 − β.

Therefore, the definition of the doppelganger covers both the usefulness and privacy protection aspects. r and r can be set the same or not. For a given r, the further away r is from r, the less useful the positions in D will tend to be. As the data curator, we would like β associated with a set D to be small and call 1 − β the effectiveness rate of D.

To examine the robustness of the doppelganger against potential privacy attacks, We define a threat model in Definition 6, which is the rate that adversary re-identifies the target's true location based on the aggregate location information in set D. Since it is impossible to pinpoint the exact location with the randomness injected due to sanitization, we introduce a cutoff point l for re-identification success.

Definition 6 (location re-identification) LetP * is the inferred location given the sanitized locations in D.

for l > 0 and α ∈ (0, 1).

In other words, 1 − α is the location re-identification success rate for a given cutoff l. Various ways exist for the adversary to inferP * given the doppelganger set D, such as using the centroid or geometric center of all the points in D.

We conduct experiments to evaluate the effectiveness of the doppelganger . We takes into account privacy loss composability when sanitizing multiple locations; that is, each location receives /K privacy budget during sanitization per the sequential composition theorem [28] . We set r = r and examine the impact of the number of published locations K ∈ [2, 10] on the effectiveness of the doppelganger set for r ranging from 2.5 to 15. Note that it is the product of r that determines the effectiveness and privacy level of D rather than r alone or alone; for example, (r = 5, = 1) and (r = 20, = 0.25) lead to the same 1 − β and 1 − α. We infer locationP * given D using the centroid of the set in this example, by taking the averages of the X and Y coordinates of the sanitized locations P * in D, respectively, and calculate the Euclidean distance between the inferredP * and the original location P .

The main results are presented in Fig 4 and summarized as follows. (1) as K increases, 1−β increases at the beginning and reaches a maximum at a certain K and then either levels off or decreases as K continues to increase. (2) The maximum possible usefulness depends on r . The smaller r is, the less effective D is in general; the effectiveness can go up to ∼ 100% when r is large; the larger r , the larger K is for achieving the highest effectiveness. (3) As r gets larger, 1 − α increases; when r = 15, the re-identification rate is close 100%. (4) 1 − α decreases as K increases, but the rate of the decrease slows down as r increases and is barely noticeable for large r . We plot the difference in Fig 5 between α and β presented in Fig 4. K = 1 is included as a reference baseline where α = β (technically speaking, effectiveness is not defined per Definition 5 for K = 1; when K = 1, r = r and the effectiveness reduces to the location re-identification rate and thus α = β). Scenarios below the reference line α = β implies no gain from releasing K ≥ 2 locations compared to releasing a single sanitized location. If 1 − β > 1 − α, i.e., α > β, then the effectiveness of set D overshadows the privacy risk from releasing the set (the green area in the plot); otherwise, releasing D is not worth the privacy risk as its effectiveness is not high. The result suggests larger K and smaller r would be practically good choices for achieving effectiveness with relatively low re-identification rates. 

Contact tracing is considered an effective approach for curbing the spread of COVID-19 during the pandemic. Contact tracing can be carried out manually by human contact tracers or digitally via GPS or Bluetooth devices. The collected data from contact tracing networks (CTN), which can be regarded as a type of social networks with person as nodes and an edge between two people representing they are within 6 feet of each other. CTNs are of research interest since they provide valuable information for better understanding how physical proximity affects the spread of the disease and human contact behaviors during the pandemic and they evolve, among others. However, directly sharing CTN data even for research purposes is associated with privacy concerns. For example, adversaries may link a CTN with other small bandwidth large bandwidth Figure 6 : COVID-19 hot spot maps created from a doppelganger set (10 original locations; K = 5; r = r = 10) database or use background knowledge to infer who were infected with COVID-19 and tell who were close physically (appearing in the same place at the same time) based on the edge information given in a CTN.

We investigate two approaches for sharing CTN data in a privacy-preserving manner. We propose the first approach based on the GI notation to perturb people's location data from which CTNs are constructed; the location data may be collected manually or digitally. The second approach is based on direct perturbation of the edge information in CTNs constructed from either proximity information or from location data via a DP mechanism.

Algorithm 1 presents the GI-based CTN sanitization steps. For demonstration purposes, we assume that the nodes are independent of each other in the algorithm, thus the overall privacy cost for releasing the sanitized CNS is the same as as the per-node privacy lost per the parallel composition principle [28] . This assumption can be loosened.

There are exists several methods for differentially private synthesizing edge information in a network. We illustrate the direct perturbation of CTN edges with the RR mechanism [43] ; other approaches privacy-preserving release of graph edges can also be used. The RR mechanism satisfies the edge DP that ensures that the mechanism's output does not reveal more information on a particular relation on top of what the adversary already knows. Let p ij denote the probability that the RR mechanism retains edge e ij = 1 after santization and q ij be the probability that the mechanism retains non-edge e ij = 0. Karwa et al. [43] recommend p ij = q ij = e ij /(1 + e ij ) for nodes i = j = 1, . . . , n; that is, the probabilities of retaining the original relationship between any pair of nodes is the same regardless of whether there is an edge or not. In our experiments, we set ij ≡ similar to what' used in the experiments in [43] ; that is, the probability of edge flipping is π ij ≡ 1/(1 + e ), where the privacy cost is per pair of nodes. If all edges are mutually independent, the total cost for sanitizing the whole network is also per the parallel composition theorem [28] (the Algorithm 1: GI-based sanitization of CTNs input : location coordinates (x i , y i ) for individuals i = 1, . . . , n, privacy budget , contact cut point a (e.g. a = 6 feet) output: CTN with sanitized edges 1 Perturb (x i , y i ) via the planar Laplace mechanism of -GI in Eq (3) to yield (x * i , y * i ) for i = 1, . . . , n; 2 for i = 1, . . . , n − 1 do

If d * ij ≤ a, e * ij = 1 between nodes i and j; else e * ij = 0. 6 end 7 end assumption made in [43] ). When this assumption does not hold, the total privacy cost for sanitizing the network would be > , by how much depends on how much information shared by the connection patterns in the network. The GI-based sanitization yields CTNs that are more structurally similar to the original CTN, at the same privacy cost (per node in the former and per edge in the latter), compared to the RR approach. As increases, the RR-sanitized CTNs approach the original network. However, even for as large as 3, the sanitized CTN is still much denser than the original CTN in Fig 7. In addition to the visual display of the sanitized CTNs in Fig 8, we also calculated the common structural statistics for networks. Specifically, we examine the following statistics: # of edges, # triangles, betweenness centrality, closeness centrality, degree distribution (DD), edgewise shared partners distribution (ESPD), and geodesic distance distributions (GDD). , where s k is the geodesic distance between two nodes and equals the length of the shortest path joining those two nodes (∞ if there is no such path). Table 1 presents the # of edges and # triangles of the sanitized CTNs. The results are consistent with the observations in Fig 8. For the GI-sanitized CTNs, the twp statistics are similar to those of the original network and there is not much of a difference across different values in these statistics. The high density of sanitized networks via the RR mechanism is reflected by the much higher #'s of edges and triangles than the original statistics. At = 5, the # of triangles is similar to the original, but the # of edges is still 50% higher. Table 1 , the GI-based approach is robust to and the statistics are stable across different values. By contrast, the three distributions of the CTNs generated via the RR approach deviate significantly from the original distributions and only at = 5 does the statistics start to approach the original; but the privacy cost would too high to provide practically meaningful privacy protection. In contrast, those of the CTNs obtained via the RR sanitization deviate significantly from the original distribution even at as large as 5.

In summary, we may draw the following conclusions based on the results in this subsection. The GI-based sanitization can produce privacy-preserving CTNs that are structurally similar to the original CTNs, per various statistical measures for networks. In addition, the utility of the sanitized CTNs is relatively insensitive to at least for the examined range of [0. 5, 2] and CTN types, implying that a small can be used to provide strong privacy guarantees without scarifying much of the utility. The sanitized CTNs can be shared with researchers who are interested in learning more about CTNs during the pandemic, without compromising individual privacy at a pre-specified privacy cost. Comparatively, the RR mechanism examined this experiment does not generate useful CTNs unless the privacy cost is large.

We present several examples on how we may release and share data and information collected during the COVID-19 pandemic in privacy-preserving manner. Each examined data type (case numbers, case location information/hot spot maps, and contact tracing networks) is commonly seen during the pandemic. We either apply existing differentially private mechanisms and approaches or devise new concepts and approaches based on formal privacy notions, such as DP and GI, to sanitize and release information. All the presented approaches aim to preserve the utility of released information that can be used to infer underlying population parameters or patterns. The approaches do not target at learning individual-level information, which not only conflicts with the goal of privacy protection, but is also unnecessary for the purposes of mining and understanding the population-level information. We hope our examples and methods shed light on privacy-preserving sharing of pandemic data to help promote and encourage more data sharing for research use.

Google covid-19 community mobility reports: Anonymization process description (version 1.0)

Impacts of social distancing policies on mobility and covid-19 case growth in the us

Google covid-19 search trends symptoms dataset: Anonymization process description (version 1.0)

Examining covid-19 forecasting using spatio-temporal graph neural networks

Differential privacy practice on diagnosis of covid-19 radiology imaging using efficientnet

Covid-19 imaging data privacy by federated learning design: A theoretical framework

Differentially private health tokens for estimating covid-19 risk

Unique in the crowd: The privacy bounds of human mobility

Privacy in location-based social networks: Researching the interrelatedness of scripts and usage

A geoprivacy manifesto

Quantifying location privacy

In coronavirus fight, china gives citizens a color code, with red flags

What the us can learn from other countries using phones to track covid-19

South korea to step-up online coronavirus tracking

Covid-19 news tracker-location-based news about covid-19 in thailand

Covid-19: Poland launches an official tracking app

Safe path

Dams: Meta-estimation of private sketch data structures for differentially private covid-19 contact tracing

Use of smartphone data to manage covid-19 must respect eu data protection rules

Privacy-preserving contact tracing

Spatial k-anonymity: A privacy-preserving method for covid-19 related geospatial technologies

Panda: policy-aware location privacy for epidemic surveillance

Calibrating noise to sensitivity in private data analysis

Our data, ourselves: Privacy via distributed noise generation

Privacy: Theory meets practice on the map

Rényi differential privacy

Gaussian differential privacy

Mechanism design via differential privacy

The algorithmic foundations of differential privacy

Generalized gaussian mechanism for differential privacy

Geoindistinguishability: Differential privacy for location-based systems

Randomized response: A survey technique for eliminating evasive answer bias

Dpcube: Releasing differentially private data cubes for health information

Differentially private histogram publication

Towards accurate histogram publication under differential privacy

Differentially private histogram publishing through fractal dimension for dynamic datasets

A privacy preserving algorithm to release sparse high-dimensional histograms

Boosting the accuracy of differentiallyprivate histograms through consistency

Model-based differentially private data synthesis

location history

publish location information

Sharing social network data: differentially private estimation of exponential-family random graph models