key: cord-0736347-wxa8gc21
authors: Wu, Xiaotong; Khosravi, Mohammad Reza; Qi, Lianyong; Ji, Genlin; Dou, Wanchun; Xu, Xiaolong
title: Locally private frequency estimation of physical symptoms for infectious disease analysis in Internet of Medical Things
date: 2020-08-27
journal: Comput Commun
DOI: 10.1016/j.comcom.2020.08.015
sha: 7b8e7f6f0615755d65bff0f3415f6e6431c5d6a0
doc_id: 736347
cord_uid: wxa8gc21

Frequency estimation of physical symptoms for peoples is the most direct way to analyze and predict infectious diseases. In Internet of medical Things (IoMT), it is efficient and convenient for users to report their physical symptoms to hospitals or disease prevention departments by various mobile devices. Unfortunately, it usually brings leakage risk of these symptoms since data receivers may be untrusted. As a strong metric for health privacy, local differential privacy (LDP) requires that users should perturb their symptoms to prevent the risk. However, the widely-used data structure called sketch for frequency estimation doesn’t satisfy the specified requirement. In this paper, we firstly define the problem of frequency estimation of physical symptoms under LDP. Then, we propose four different protocols, i.e., CMS-LDP, FCS-LDP, CS-LDP and FAS-LDP to solve the above problem. Next, we demonstrate that the designed protocols satisfy LDP and unbiased estimation. We also present two approaches to implement the key component (i.e., universal hash functions) of protocols. Finally, we conduct experiments to evaluate four protocols on two real-world datasets, representing two different distributions of physical symptoms. The results show that CMS-LDP and CS-LDP have relatively optimal utility for frequency estimation of physical symptoms in IoMT.

With the explosive development of Internet of Medical Things (IoMT), there have been various medical applications and services in a large number of mobile devices (e.g., smart phones and wearable devices) [40, 47, 45] . It is efficient and convenient for hospitals or disease prevention departments (i.e., the third party) to estimate the frequency of physical symptoms for mobile users by their devices so as to monitor and predict infectious diseases. For example, if disease prevention departments want to know potential spread range and speed of coronavirus disease 2019 (COVID-19) [37] , it is secure for them to remotely estimate the frequency of typical symptoms of peoples by smart phones, including fever, cough and shortness of breath. The novel estimation model reduces the labor costs and improves the detection efficiency. Although human beings benefit from the estimation model in IoMT, users inevitably face leakage risks of disease information, especially when the third party are untrusted [9, 17] . Once disease information is leaked, it may cause a heavier psychological burden for the society and individuals.

In fact, health privacy has been one of the biggest concerns of not only individuals but also the whole society (e.g., European GDPR law) [33, 24, 25, 39, 49, 38] . Therefore, it is necessary to protect individuals' health privacy when collecting and analyzing their symptoms. Differential privacy (DP) [14, 48] is a strong privacy metric in a central setting. It assumes that there is a trusted third party to collect and perturb symptoms from users and then share them with others. However, DP ignores that it is possible for the third party to threaten users' health privacy, especially in IoMT. To this end, local differential privacy (LDP) [12] is proposed in the local setting. It requires that each mobile user should locally perturb his/her symptoms before sending to the third party and make them indistinguishable.

In recent years, there have been a series of works to design various protocols for frequency estimation of physical symptoms under LDP in both academia and industry [7, 13, 30, 43] . In the existing privacy protocols, physical symptom is defined as categorical data. For categorical data, it consists of three important components, i.e., encoding, perturbation and aggregation [1, 2, 3, 26, 32] . The first two components are operated by the user while the last one is executed by the third party. Most of perturbation mechanisms to guarantee privacy are based on the Randomized Response (RR) technique, in which reporting a real answer for a Boolean query is derived from a certain probability [34] . Wang et al. [32] surveyed the previous protocols and classified encoding methods into four types, including direct [31] , histogram [3] , unary [16, 26] and local hashing [1, 2, 32] encoding (i.e., DE, HE, UE and LHE). Protocols based on UE need higher communication cost, while those based on DE and HE have relatively lower accuracy. As a result, protocols based on LHE are a promising solution to balance computa-tion complexity and data utility.

However, the previous protocols don't make full use of the widely applied data structures for frequency estimation of physical symptoms. Sketch is one of the most fundamental and most efficient data structures to estimate frequency for physical symptoms. In general, it uses a small vector to index the position of a symptom by a certain number of hash functions, which saves space but faces possible conflict. There are some basic and classic sketches, such as Count Sketch (CS) [4] , Count-Min Sketch (CMS) [8] , Fast-AGMS Sketch (FAS) [6] and Fast-Count Sketch (FCS) [28] . The focus of the paper is on how to design the sketch-based protocols for frequency estimation of physical symptoms satisfying local differential privacy. If so, it is convenient for others to directly apply the proposed protocols to estimate the frequency of physical symptoms.

There are two main challenges to solve the above problem. The first one is that the existing private protocols can't be directly used to perturb query result by sketches. The reasons are that (i) perturbation and aggregation of symptoms are executed by the user and the third party, respectively. It implies that only the RR technique can be leveraged; and (ii) hash functions used in the sketches need to stay the same implying that the existing encoding methods (e.g., DE, HE and UE) are not applicable. The second one is that the designed private sketches should satisfy two important properties, including unbiased estimation and local differential privacy. In other words, it needs to balance privacy guarantee and utility of perturbed data.

To address the above challenges, we make full use of LHE and RR. In brief, we leverage the sketch vector to implement encoding of physical symptoms and then perturb it by RR to achieve privacy guarantee. At first, we propose two CMS-and FCS-based protocols, namely CMS-LDP and FCS-LDP to guarantee privacy. Then, we propose two CSand FAS-based protocols, namely CS-LDP and FAS-LDP to ensure privacy. In particular, the formers utilize perturbation mechanism of optimal local hashing (OLH) in [32] , while the latter directly use RR to perturb each entry of the sketch vector. Finally, we do other operations to make sure that the protocols satisfy unbiased estimation. It is noted that our protocols only increase computation cost rather than space overhead. This is consistent with the initial purpose, i.e., space save. To the best of our knowledge, there are few works to utilize the sketch under LDP to estimate the frequency of symptoms. The contributions of this paper can be summarized as follows:

• We formulate the problem of designing sketches under LDP by mathematical definitions and strong privacy metric to estimate the frequency of physical symptoms in IoMT.

• We propose four different protocols for frequency estimation of symptoms with privacy protection, including CMS-LDP, FCS-LDP, CS-LDP and FAS-LDP. We also demonstrate that all of the protocols satisfy two important properties, including unbiased estimation and LDP. We also analyze the variance of frequency estimation in four protocols.

• We present two approaches to implement universal hash functions. We also evaluate and compare the proposed protocols on two real-world datasets, representing two different distributions of physical symptoms.

The rest of this paper is organized as follows. Section 2 surveys related work about privacy protection in IoMT. Section 3 introduces the preliminaries and gives problem definitions. Section 4 proposes four different protocols for sketches under LDP to estimate the frequency of symptoms. Section 5 analyzes unbiased estimation, privacy guarantee and variance. Section 6 presents two methods to implement universal hash functions. Section 7 evaluates proposed protocols on real datasets. Finally, Section 8 concludes this paper.

In IoMT, there are vast amounts of users' healthy information generated by various mobile devices every day. However, the information is left from mobile users and may be faced with leakage risk from untrusted data collectors or data adversaries [17] . In order to guarantee privacy, there have been a number of protection techniques, including identification, anonymity and differential privacy [33, 48, 9] . In detail, identification technique utilizes authentication mechanism for each mobile user, who must login the system by his/her identity. On the other hand, data collectors use perturbation techniques (e.g., -anonymity [27] , -diversity [21] and DP [14] ) to perturb raw data from mobile users. These techniques provide a certain degree of privacy protection in some specified scenes.

However, these techniques are not fully suited to applications in IoMT due to their respective shortcomings. Identification and anonymity can't offer enough privacy protection, while DP usually assumes that the third party is trusted. That is, the third party doesn't illegally leak users' health information. Unfortunately, it doesn't hold in some real environments of IoMT. The reason is that users can't know the receivers of their healthy information in the local setting so that they don't trust them. In order to overcome the disadvantages, local differential privacy [12] has been proposed to protect private information of each user against the threat from the untrusted third party. Both academical researchers and industrial engineers focus on how to achieve local differential privacy from theory and application.

Different from DP, LDP has higher privacy requirement in a local setting, which requires probability limit between any two values, instead of any two datasets in DP [12] . To this end, the existing mechanisms of DP can't be fully applied to LDP, including the Laplace mechanism [15] and the exponential mechanism [23] . In order to implement LDP, there have been a number of works to design proper mechanisms [7, 13, 30, 43] . Most of them leveraged the Randomized Response (RR) mechanism [34] or its refinements. RR reports a random answer derived from a certain amount of probability for a Boolean question. For categorical data, Kairouz et al. [18, 19] proposed an extremal privatization mechanism ( -RR), which is universally optimal in the low and high privacy regimes. For ordinal data, the value is firstly encoded to a binary one and then perturb it by RR [11] [13] [44] . In general, the proposed algorithms need to satisfy both unbiased estimation and LDP. In addition, these perturbation privacy mechanisms are widely applied to various scenes, including heavy hitter identification, itemset mining, marginal release, data mining, medical analysis and infectious disease prediction [20] .

In the industry, there have been a lot of applications to satisfy local differential privacy. RAPPOR is the first product from Google to collect users' data and ensure privacy [16] . The user firstly encodes his/her data by a special data structure called Bloom filter and then perturbs the encoded output by the randomized response mechanism. The third party collects and decodes perturbed data from the users to get statistical information. In 2017, Apple [10] published a white paper to propose efficient and scalable algorithms satisfying local differential privacy. Microsoft [11] also presented new algorithms with local differential privacy to estimate mean and histogram for telemetry data.

Frequency estimation under LDP for physical symptoms is an efficient and convenient way for infectious disease prediction. There have been a few of works to design effective mechanisms for different types of health information, including categorical (e.g., sex and symptoms), ordinal (e.g., age) [18, 19, 20, 5] . The main objective of the existing mechanisms is to maximize the utility and reduce communication cost and computation complexity.

Bassily et al. [2] proposed efficient mechanisms by utilizing Random Matrix Projection. Running time of the third party and the user is ( 5∕2 ) and ( 3∕2 ), respectively. Meanwhile, the estimation error is ( √ log( )∕( 2 )) with high probability where is the number of users and is the size of domain. They also proposed improved algorithms, in which server time is ( ) and user time is (1) [1] . Bun et al. [3] presented an algorithm called PrivateExpanderSketch, which achieved optimal worst-case error as a function of all possible parameters. Wang et al. [32] introduced a framework to analyze all the previous mechanisms for frequency estimation under LDP. They also proposed an Optimal Local Hashing (OLH) to estimate the frequency with better utility. Different from the previous environment with only an attribute, Qin et al. [26] estimated frequency over set-valued data under LDP. However, to the best of our knowledge, there are few works to combine sketch under LDP to estimate the frequency of physical symptoms in IoMT. Therefore, the focus of this paper is to solve the problem. 

As aforementioned, the focus of the paper is on frequency estimation of physical symptoms in IoMT. There are two key roles, i.e., an untrusted third party and a large number of mobile users. The third party collects perturbed health information from the users. In the following section, we formulate the frequency estimation problem in the context of local differential privacy.

Suppose that there are a set of users denoted by  = { 1 , 2 , ⋯ , }. Each user has a symptom . Then,  = { 1 , 2 , ⋯ , } is defined as a set of symptoms from users. The domain of elements in  is  with size . An untrusted third party plans to collect statistical information of users' symptoms. Assume that the untrusted third party has known the domain . Each element in  is generalized to an integer Fig. 1 , the domain  consists of types of symptoms of COVID-19, such as fever, cough and shortness of breath. Each symptom is labelled as an integer (e.g., fever is labelled by 1). Therefore, in the following section, we have ∈  = [ ].

Here, we focus on the frequency estimation of physical symptoms as follows:

A query ∶  → for frequency estimation of some symptom is to compute the frequency of users whose symptom is the same with ∈ . Formally, for any ∈ ,

Obviously, the range of ( ) is [0, ].

However, in infectious disease prediction, it is not necessary and impossible to count accurate frequency estimation for physical symptoms. For example, if some disease prevention department wants to estimate the speed or range of COVID-19 by the number of patients with specified symptoms (e.g., fever, cough or shortness of breath), there is no need for the department to get the accurate count for each symptom. What's more, it is not realistic to spend a large amount of labor cost in dealing with it. Thus, an approximate frequency estimation is proposed as follows: if and only if for any symptom ∈ ,̂ ( ) satisfies the following requirement:

where ∈ [0, 1] represents the expected error and ∈ [0, 1] is the confidence probability.

Unfortunately, the third party may be untrusted because that he/she possibly utilizes health information of users to get illegal profit. Therefore, the users have to perturb their information so as to avoid privacy leakage. Differential privacy [14, 48] is the most common and strong privacy protection metric, which is usually suited for the central setting. Different from DP, local differential privacy is proposed in the local setting, in which it assumes that the third party is untrusted. This means that before the users send their symptoms, perturbation of symptoms should satisfy the following privacy measurement:

where () denotes the set of all possible outputs of the algorithm  and is privacy budget.

represents the strength of privacy protection. The lower value of corresponds to the stronger privacy protection. With the same as DP, LDP has also the composition properties as follows. [22] ) Assume that each privacy algorithm  satisfies -local differential privacy. A group of  (⋅) applied to the same dataset satisfy ( ∑ )local differential privacy. [22] ) Assume that each privacy algorithm  satisfies -local differential privacy. A group of  (⋅) applied to a set of disjoint datasets satisfy max{ }-local differential privacy. In the algorithms designed by the paper, we will make full use of the composition properties of local differential privacy.

In order to solve the frequency estimation problem under LDP for symptoms, the proposed mechanism should satisfy the privacy measurement as follows:

for frequency estimation of symptoms satisfies -local differential privacy if and only if  = ( 1 (⋅),  2 (⋅), ⋯ ,  (⋅)), in which each  ∶  → is a local random processor that satisfies -local differential privacy, and  ∶ → is some postprocessing function.

Based on the above definition, the key functions are the local random processor  (⋅) and the post-processing function (⋅). Each user executes the local random processor, while the third party executes the post-processing function. Furthermore, Wang et al. [32] surveyed the previous work and concluded that a protocol for frequency estimation under LDP generally consists of the following steps:

• Encoding. This operation (⋅) is executed by the user. In detail, input of the user is generally encoded as a histogram, a vector [16] , a single bit [2] , or an integer [32] by encoding and decoding algorithms or hash functions.

• Perturbing. The mechanism (⋅) is also executed by the user. Meanwhile, the perturbed mechanism should satisfy -local differential privacy. The input of (⋅) is the output of (⋅). Therefore,  (⋅) in Def. 4 is ( (⋅)). • Aggregating. The operation (⋅) is executed by the third party. At first, the third party receives data from users, which is perturbed with the combination of (⋅) and (⋅). Then, he/she aggregates the data and extracts approximate frequency estimation. Fig.2 shows the concrete operations and interactions between mobile users and the third party. It is obvious to see that users need to encode and perturb their symptoms locally, while the third party aggregates perturbed data. As a result, these operations increase extra overhead for users, including computation, communication and storage. To this end, it is necessary to make full use of the existing data structure named sketch to decrease the overhead.

In order to solve the approximate frequency estimation of categorical data (e.g., physical symptoms), Charikar et al. [4] firstly proposed an algorithm by using a special data structure, i.e., Count Sketch, which achieved good space bounds. After that, the refined versions were proposed, such as Fast-AGMS Sketch [6] , Count-Min Sketch [8] and Fast-Count Sketch [28] . However, to the best of our knowledge, there are few works to design the new data structure Sketch under LDP to solve the approximate frequency estimation of physical symptoms. In the following sections, we propose improved protocols based on Count Sketch, Fast-AGMS Sketch,

, error parameter , probability confidence , privacy parameter Output: a set of approximate frequency estimation {̂ ( )} 1: 

Input: symptom , hash functions  = {ℎ 1 , ℎ 2 , ⋯ , ℎ }, privacy parameter , the number of hash functions , the bit number

perturb as follows:

[ ][ ] = 1 7: end for 8: return  Count-Min Sketch and Fast-Count Sketch, which satisfy unbiased estimation, i.e., ∀ ∈ , [̂ ( )] = ( ) and local differential privacy. In addition, we also compare the proposed protocols by experiments.

In this section, we propose four novel protocols to solve the approximate frequency estimation of physical symptoms under LDP. These protocols are based on CS, FAS, CMS and FCS. We will analyze unbiased estimation and privacy protection of proposed protocols in the next section.

Under no privacy preservation, the third party receives physical symptoms and uses a small vector to count the frequency of each element in , while users do nothing. However, in order to protect privacy, the user should send perturbed symptoms rather than raw values. Therefore, the user 

needs to generate a small vector locally and then send it to the third party. Finally, the third party aggregates these vectors and derives the full sketch.

CMS is a simple and effective data structure, which was proposed by Cormode et al. [8] . For a frequency estimation querŷ ( ), its values by CMS is ( ), ( ) + ‖ − ‖ 1 with a probability 1 − , in which − is the real frequency except . Here, we propose corresponding protocols based on optimal local hashing in [32] to implement CMS under local differential privacy.

Algorithm 1 shows a protocol CMS-LDP for privacy estimation of symptoms under LDP. The protocol consists of four key components. The first one is an initial process, in which and are computed by the input parameters and (line 1 ∼ 3). represents the domain of universal hash functions, while is the number of hash functions. Then, each user calls  client-CMS-LDP to report the perturbed record to the server-side (line 4 ∼ 6). Next, Line 7 implements the key component, namely Sketch-CMS-LDP. The last one implements the query of server-side (line 8 ∼ 10). By the above four parts, the algorithm solves the approximate frequency estimation problem of physical symptoms.

Each user executes Algorithm 2 to implement the function of Perturb(Encode(⋅)) in Section 3.2. Algorithm 2 utilizes an existing technique, namely -Randomized Response ( -RR), to ensure -local differential privacy, which was designed by Kairouz et al. [19] . In detail, Algorithm 2 firstly implements (⋅) by universal hash function ℎ to map the input into a -length bit array in line 4. Then, a -RR technique is used to perturb the output of (⋅) in line 5. Finally, the user returns the result to the server-side.

After the third party receives perturbed symptoms from users, he/she computes a data structure, namely Sketch-CMS-LDP by Algorithm 3. Compared to CMS, our proposed mechanism  Sketch-CMS-LDP has extra operations, i.e., Line 7 ∼ 9. Sketch-CMS-LDP firstly generates an initial sketch, which is a multidimensional array in line 1. Then, all arrays from users are added to the sketch in line 2 ∼ 6, which is the same with CMS. Finally, Line 7 ∼ 9 is to ensure unbiased estimation of CMS-LDP, which will be proved in Section 5.1. Algorithm 4 gets the approximate estimation for each frequency querŷ , which is executed by the third party. The combination of Algorithm 3 and 4 implement the function of (⋅) in Section 3.2. For each querŷ ( ), the algorithm chooses the minimum of values, each of which is from an array [ ], ∈ [ ]. The position of each value in the array is decided by hash function ℎ and . By Algorithm 1∼4, we solve the problem of frequency estimation under local differential privacy.

For each user in the client-side, the time complexity is (log 2 (1∕ )) and the space complexity in ( (⋅)) is ( 1 log 2 (1∕ )). For  server-CMS-LDP in the server-side, the time complexity is ( 1 log 2 (1∕ )) and the space complexity is ( 1 log 2 (1∕ )). Besides, the communication overhead between each user and the server is ( 1 log 2 (1∕ )). Thus, for the third party, our proposed protocol doesn't increase the space overhead. Section 5.1 will analyze the proposed algorithms in terms of the error bound, unbiased estimation and local differential privacy.

Thorup et al. [28] proposed a refined version of Count-Min Sketch, namely Fast-Count Sketch (FCS). Compared with CMS, FCS uses a family of 4-universal hash functions, instead of 2-universal hash function.

The protocol to implement Fast-Count Sketch under local differential privacy, namely Algorithm FCS-LDP, is similar with Algorithm 1. 

Among all of various formats of the special data structure called sketch, Count Sketch was firstly proposed in 2002 [4] . The data structure is generally used to point query, range query, inner product query and so on. Compared with Count-Min Sketch, CS needs higher space cost, i.e., (log 2 ( )∕ 2 ). Meanwhile, CS also satisfies unbiased estimation and the value for a frequency estimation querŷ ( ) about is ( )− ‖ − ‖ 2 , ( )+ ‖ − ‖ 2 with a confidence probability 1− . In the following, we will introduce algorithms to implement CS with LDP.

In Algorithm 5, we describe the overview of protocol CS-LDP to implement CS under LDP. It consists of four key parts, i.e., initialization, perturbation in client-side, construction of sketch and frequency estimation in server-side.

In the initial process (line 1 ∼ 4), it requires hash function pairs ⟨ℎ , ⟩. Hash function ℎ maps some symptom into ⌈ ∕ 2 ⌉-length bits in which c is some constant, while function maps some input value into {−1, +1}. Both two classes of hash functions are independent with each other. Compared with ⌈1∕ ⌉-length bits of hash function ℎ in Algorithm 1, CS-LDP needs more space cost. Then, each user executes  client-CS-LDP to perturb his/her data (line 5 ∼ 7), which implements the function of ( (⋅)). Next, the third party constructs a special data structure, namely Sketch-CS-LDP, to collect data from users by  Sketch-CS-LDP (line 8). Finally, for each frequency estimation̂ ( ), the third party calls  server-CS-LDP to compute the result (line 9 ∼ 11).

Before each user sends his/her symptom to the third party, Algorithm 6 is executed.  client-CS-LDP firstly initializes a multidimensional integer array , i.e., {−1} × (line 1). The array only consists of two elements, i.e., −1 and +1. Then, because of the property of sequential composition of local differential privacy, it computes smaller privacy parameter ′ for each hash function (line 2). Next, the specified position ℎ ( ) of each row in  is set as ( ) (line 3 ∼ 5), which implements the function of (⋅). For each entry in array , it uses Randomized Response (RR) [34] to ensure ′ -local differential privacy (line 6 ∼ 10), which implements the function of (⋅). At last, the result is returned to the third party.

After the third party receives perturbed symptoms from users, he/she uses Algorithm 7 to get the data structure Sketch-CS-LDP. Firstly,  Sketch-CS-LDP initializes a × -integer array, i.e., {0} × (line 1). Then, line 2 is to compute a parameter , which is used to ensure unbiased estimation due to randomized response. Since data from perturbed mechanism  client-CS-LDP is biased, it needs to be calibrated. Therefore, Algorithm 7 adjusts the data from each user (line 3 ∼ 5) to get the estimated frequency of each entry in each array  . Finally, all adjusted arrays are added to sketch (line 6 ∼ 9). In Section 5.2, we will prove unbiased estimation and local differential privacy of our proposed protocol CS-LDP.

Algorithm 8 is used to answer the frequency estimation ( ) for the third party. The combination of Algorithm 7 and 8 implements the function of (⋅). We use function (⋅) to choose a proper value from the candidates as the final estimation for , rather than function (⋅). Charikar et al. [4] has demonstrated that function for ∈ [ ] do 8: sample some value as follows: (⋅) is robust. By Algorithm 5∼8, we solve the problem of frequency estimation with local differential privacy.

Compared with CMS-LDP, CS-LDP needs the higher time and space cost, and has the shaper bound for the frequency estimation. For each user in  client-CS-LDP , the time cost is ( 1 2 log 2 (1∕ )) due to sampling each entry of the matrix by Randomized Response, while the space cost is respectively. The cost of time and space is (log 2 (1∕ )) and ( 1 2 log 2 (1∕ )) in  server-CS-LDP , respectively. Therefore, the total time cost for the third party is ( 1 2 log 2 (1∕ )), while the total space cost is ( 1 2 log 2 (1∕ )).

Fast-AGMS Sketch (FAS) is a refined version of Count Sketch [6] , which guarantees logarithmic-time sketch update and tracking costs. The only difference between Count Sketch and Fast-AGMS Sketch is that the latter replaces 2universal hash functions  of the former with 4-universal hash functions.

Protocol FAS-LDP to achieve Fast-AGMS Sketch under local differential privacy is similar with CS-LDP. The only change is that a set of hash functions, namely , are chosen independently from a 4-universal family of hashing functions mapping  to {−1, +1}. Besides, other mechanisms, i.e.,  client-FAS-LDP ,  Sketch-FAS-LDP and  server-FAS-LDP in client-side and server-side are the same with  client-CS-LDP ,  Sketch-CS-LDP and  server-CS-LDP , respectively. As a result, these algorithms have the same time and space cost with Algorithm 5∼8 for each user and the third party.

The protocols that implement the approximate frequency estimation of physical symptoms under LDP must satisfy two fundamental requirements, i.e., unbiased estimation and local differential privacy. In the section, we will demonstrate that our proposed protocols satisfy two requirements. Be- 

For the proposed protocols, the most important property is to satisfy local differential privacy. We prove that protocol CMS-LDP satisfies the requirement as follows: Therefore, for each hash function, perturbation satisfies ′local differential privacy. Because that there are hash functions for the same symptom , the combination of hash functions satisfy -local differential privacy for each user according to Lemma 1. On the other hand, the third party receives perturbed symptoms from all users. Since records between different users are disjoint, it satisfies -differential privacy for all perturbed symptoms according to Lemma 2. PROOF. For each hash function ℎ , the third party receives reported symptoms from all users. Let any possible output and  the set of symptoms that are hashed to by (⋅). Then, the probability that any symptom in  is perturbed to is . That is, for any ∈ , we have For another input value ∉ , it is still possibly mapped to . Let ′ the probability that is mapped to . In line 10 of mechanism  Sketch-LMS-LDP , the third party makes the extra operations for the raw records. Thus, the probability ′ is computed by as follows:

In mechanism  Sketch-LMS-LDP , we get the final sketch by line 11. The expectation that the number of elements in  which are mapped to is

where is the number of elements in . Therefore, for each 2-universal hash functions, protocol CMS-LDP satisfies unbiased estimation. The variance is

By the above proof, CMS-LDP satisfies unbiased estimation and LDP. Since protocol FCS-LDP is similar with CMS-LDP except for hash functions, FCS-LDP also satisfies unbiased estimation and local differential privacy. Besides, it has the same variance. The corresponding proof is omitted.

Here, we firstly prove that protocol CS-LDP satisfies local differential privacy as follows:

Theorem 5. Protocol CS-LDP satisfies -local differential privacy.

PROOF. Assume that there are two users and . Both of them have different input values and and the same output  ∈ {−1, +1} . For each pair (ℎ , ) of hash functions, the third party receives two perturbed arrays and from these two users. Because that all of hash functions ℎ and are independent with each other, we have

The reason for ▿ ⟹▵ is presented as follows. In detail,

Each entry in is derived from {−1⋅ , +1⋅ }, in which ∈ {−1, +1} is sampled from Eq. 5 and only an entry is +1 with position decided by hash value ℎ ( ). is similar with . In Eq. ▿, only two entries in ℎ ( ) are different from two entries in ℎ ( ) , while the others have the same probability. Therefore, Eq. ▿ is derived to Eq. ▵.

For Eq. ▵, there are two different cases, i.e., ℎ ( ) = ℎ ( ) and ℎ ( ) ≠ ℎ ( ). For the first case in which ℎ ( ) = ℎ ( ), we have

For the second case in which ℎ ( ) ≠ ℎ ( ), we have

Assume that ( ) = +1 and ( ) = +1. Then, ℎ ( ) = +1 ⋅ 1 , ℎ ( ) = −1 ⋅ 2 , ℎ ( ) = −1 ⋅ 3 and ℎ ( ) = +1 ⋅ 4 . If  ℎ ( ) = +1, then 1 = +1, 2 = −1, 3 = −1 and 4 = +1. Therefore, according to Eq. 5, we have

in which ′ = ∕2 . Therefore, for each pair (ℎ , ) of hash functions in the above cases, it satisfies ∕ -local differential privacy. Since there are hash functions to perturb the same symptom,  Client-CS-LDP of each user satisfies -local differential privacy according to Lemma 1. Since records between different users are disjoint, it satisfies -differential privacy for all perturbed records according to Lemma 2. ■ For the variance of̂ ( * ), Var[̂ ( * )] is derived as follows: Therefore, we complete the proof. ■ Since protocol FAS-LDP is similar with CS-LDP except for hash functions, FAS-LDP also satisfies unbiased estimation and local differential privacy. Besides, it has the same variance. The corresponding proof is omitted.

In the section, we introduce the implementation ofuniversal hash functions. In detail, we propose two different methods to implement universal hash functions. That is, CW-Trick hashing is used to the case in which = 2 while Tabulation based hashing is used to = 4.

In the protocols proposed by Section 4, the most important component is -universal hash functions, which are defined as follows:

where is an positive integer. According to Definition 5, a large number of values can be hashed to a relatively small set of keys. Meanwhile, there may be collisions, in which two different values are hashed to a key. Parameter is the factor that influences the collision. For example, as parameter increases from 2 to 4, the probability of collision decreases from 1∕ 2 to 1∕ 4 . Here, we focus on how to design the proper hash functions that satisfy Definition 5. In the following section, we present two different implementations for -universal hash functions, i.e., CW-Trick and Tabulation based hash functions.

The first method for -universal hash functions is implemented by the following equation:

where is a prime that is greater than value and is picked randomly from [ ]. This equation was proposed by Wegman et al. [35] . It is obvious to see that this method is very simple and easy to implement. However, it has a disadvantage, especially when is very large. That is, if is an arbitrary prime, this method is fairly slow because the 'mod ' is relatively slow. In order to solve the problem, Thorup et al. [28] proposed a simple method, namely CW-trick, in which is a so-called Mersenne prime of the form 2 − 1. In our experiments, we will use = 2 61 − 1. The CW-Trick hashing is used to = 2.

If CW-Trick hashing is used to = 4, the time is slow. Thorup et al. [28] proposed another method, namely Tabulation based hashing, to implement -universal hash functions. They compared their method with CW-trick in terms of running time, which shows that the former is faster than the latter by at least a factor of 5. However, the weakness of Tabulation based hashing is that it requires large pre-computed tables and thus needs the extra space to store.

We firstly introduce Theorem 2.3 in [28] to compute hash value of by 4-universal hash functions.

There are any characters ⃗ = ( 0 1 ⋯ −1 ), ∈ [2 ] .

is a × generator matrix, which satisfies that any square sub-matrix has full rank over prime field , where ≥ max{2 , + } is an odd prime. ⃗ = ⃗ are additional characters ( 0 1 ⋯ ), = − 1. Then, hash function of ⃗ is

ℎ(⋅) is a 4-universal hashing function if ℎ andh are independent 4-universal hash functions into [2 ] .

In our experiments, assume that value of each user has 32 bits and each character has 16 bits 1 . Therefore, = 2 and = 1. In detail, value can be divided into two sub-values with 16-bits, i.e., 1 and 2 as the following equation: Besides, 4-universal hash functions ℎ andh are implemented by tabulation based hashing. That is, we use precomputed tables to replace hash functions ℎ andh . That is, Eq. 8 is changed as

where is a fully random table of size 2 with values from . Tabulation based hashing thus takes time ( + ) and space (( + )2 ).

In the section, our main goal is to study (1) the impact of different parameters on data utility; and (2) the application of protocols for frequency estimation of physical symptoms. The experiments are performed over two real datasets, representing two different distributions of symptoms. 

The evaluation is performed on a desktop computer, which has 4G of RAM and Inter(R) Pentium(R) CPU P6200 running Windows 7 operating system. All of protocols are implemented by Python. All of testing datasets are real.

Parameter settings. The number of hash function ⟨ℎ , ⟩ is 100 and we will choose hash function pairs from them. The implementation of hash functions is based on CW-Trick and Tabulation-based hashing in Section 6. The values of and depend on error and confidence probability . We mainly evaluate four protocols on different values of privacy budget , error and confidence probability .

We use two real datasets to evaluate our protocols, including WISDM and BMBD. We select an attribute to estimate its frequency in these two datasets, which represent two different distributions of physical symptoms. The datasets are presented as follows:

• WISDM [36] . The dataset is collected from the accelerometer and gyroscope sensors of a smart-phone and smart-watch as 51 subjects. It is about diverse activities of daily living and has 15, 630, 426 records. We select attribute ACTIVITY from seven attributes to • BMBD [29] . The dataset is collected from a public website Bixi 2 , which offers bike rental service. It contains bike share information for Bixi Monetreal from Table 1 shows the statistical information of attribute ACTIV-ITY in WISDM and START_STATION_CODE in BMBD. The distributions of these two attributes are different. ACTIV-ITY is uniformly distributed, while START_STATION_CODE is loose. Therefore, we use them to evaluate our protocols for different distributions of symptoms.

Evaluation Metrics. We evaluate the top-frequent elements  = { 1 , 2 , ⋯ , } in by RE and MSE, which have the most frequencies [46, 41, 42] . The concrete metrics are presented as follows:

• Relative Error (RE). The metric is usually used for SUM/COUNT/AVG queries and measures how large the error is relatively to the true answer for each querŷ ( ) by the following equation:

Furthermore, for the top-elements, we evaluate them by the average and median value of relative errors, i.e., ARE and MRE.

• Mean Square Error (MSE). It is used to measure the estimation accuracy by the average of the squared errors, namely,

where ( ) is the true frequency of users taking value .

In the following section, we show the impact of three key parameter , and in terms of RE and MSE for different datasets. Then, we show the effectiveness of protocols by estimating top-in different cases. In order to reduce the error, we repeat each experiment 10 times. Since frequency of different records in BMBD is different, we will consider two cases, including top-10 and top-30. That is, we estimate 10 and 30 elements with the most frequency, respectively.

In order to analyze impact of parameter , we firstly set the parameters. That is, confidence probability is 0.1 and Fig. 3 (a) and 3(c). This is consistent to CMS and FCS without LDP. This implies that addition of privacy preservation doesn't change influence of parameter . However, Fig.  3 (b) and 3(d) show that has little impact on CS-LDP and FAS-LDP, since RE and MSE don't change as increases from 0.15 to 0.25. Fig. 4 (a) -4(d) and Fig. 5 (a) -5(d) show RE and MSE of BMBD in two different cases by four private protocols. Compared to WISDM, it has the opposite results for protocol CMS-LDP and FCS-LDP. That is, with increasing from 0.004 to 0.00571, RE and MSE decrease. The key factor for this opposition is size of mapped by hash function , in which the size in BMBD is far greater than that in WISDM. Thus, perturbation for privacy preservation is the dominant factor for BMBD, while conflict probability is relatively less. As size of decreases, it leads to less expected error. Fig. 4 (b) and 4(d) and Fig. 5 (b) and 5(d) show that RE and MSE of BMBD by CS-LDP and FAS-LDP have no much difference with increasing. That is, has little impact on CS-LDP and FAS-LDP.

Therefore, the impact of on CMS-LDP and FCS-LDP depends on the size of . In contrast, it has little impact on CS-LDP and FAS-LDP.

Here, we consider the impact of parameter for four LDP protocols. The expected error is 0.18 for WISDM, while that is 0.005 for BMBD. Privacy budget is 3. The range of is [0.005, 0.02, 0.1, 0.25].

For WISDM, Fig. 3 (e) -3(h) show RE and MSE of four protocols. It is intuitive to find that RE and MSE increase with increasing. This implies that adding the number of hash functions can efficiently improve the accuracy for perturbed data. Meanwhile, it needs more time for the users and the third party. From Fig. 3 Thus, the impact of on CS-LDP and FAS-LDP is consistent, while that on CMS-LDP and FCS-LDP is changing in different datasets.

In order to analyze the impact of , for WISDM is 0.18, while that is 0.005 for BMBD. Confidence probability is 0.1. The range of privacy budget is [1, 3, 5, 7] .

For WISDM, Fig. 3 (i) -3(l) show RE and MSE of four protocols. Fig. 3 (i) and 3(k) show that RE and MSE of CMS-LDP and FCS-LDP don't decrease with increasing. In contrast, they have big fluctuation. Compared to those for BMBD in Fig. 4 (i), 4(k), 5(i) and 5(k), the possible reason is the size of . For CS-LDP and FAS-LDP, the accuracy increases as privacy budget increases. However, it doesn't change once privacy budget exceeds some threshold (e.g., 5 in Fig. 3 (b) and 3(d)).

For BMBD, Fig. 4 (i) -4(l) and Fig. 5 (i) -5(l) show RE and MSE of four protocols. We can find that top-10 and top-30 have the same results about RE and MSE. For four protocols, both of RE and MSE decrease as privacy budget increases. In other words, the increase of RE and MSE in CMS-LDP and FCS-LDP is fast, while that in FCS-LDP and FAS-LDP is not obvious. Once privacy budget exceeds some threshold (e.g., 3 in Fig. 4(j) ), RE and MSE have little change for CS-LDP and FAS-LDP. The threshold is 5 for

According to the above results, we can derive that privacy budget has the important impact on four protocols. Meanwhile, once exceeds some threshold, the impact can be ignored.

By our experiments, we analyze two different distributions of symptoms for some infectious disease in IoMT. That is, the symptoms are uniformly distributed in case (i), while those are loose in case (ii). Generally speaking, we have two important findings in Fig. 3, 4 and 5. The first one is that the utility of CMS-LDP and FCS-LDP is greater than that of CS-LDP and FAS-LDP. Since the time of CMS-LDP and FCS-LDP is smaller than that of CMS-LDP and FAS-LDP, Therefore, CMS-LDP and FCS-LDP are relatively optimal in terms of utility and computation overhead. The second one is that protocol CMS-LDP is not absolutely better than FCS-LDP, and vice versa. Furthermore, different distributions of symptoms have an important influence for the choices of different parameters, including , and . In case (i), FCS-LDP has better stability than CMS-LDP as different parameters change. In case (ii), CMS-LDP and FCS-LDP have the similar changes. Parameter , and should be as small as possible to maximize the utility.

The focus of the paper is on how to estimate the frequency of symptoms under local differential privacy for infectious disease analysis in IoMT. Based on basic sketches, this paper has proposed four protocols, including CMS-LDP, FCS-LDP, CS-LDP and FAS-LDP. We have proved that our proposed protocols satisfy two important properties, i.e., unbiased estimation and local differential privacy. Meanwhile, we have presented the variance of four protocols. Through empirical analysis, CMS-LDP and CS-LDP are relatively optimal protocols for frequency estimation of physical symptoms in IoMT. We plan to design better encoding methods to reduce the computation complexity and perturbation techniques to improve the accuracy of our protocols in the future.

Practical locally private heavy hitters

Local, private, efficient protocols for succinct histograms

Heavy hitters and the structure of local privacy

Finding frequent items in data streams

Private spatial data aggregation in the local setting

Sketching streams through the net: Distributed approximate query tracking

Marginal release under local differential privacy

An improved data stream summary: The count-min sketch and its applications

A survey of security and privacy issues in the internet of things from the layered context

Learning with privacy at scale

Collecting telemetry data privately

Local privacy and statistical minimax rates

Minimax optimal procedures for locally private estimation

Differential privacy: A survey of results

Calibrating noise to sensitivity in private data analysis

RAPPOR: randomized aggregatable privacy-preserving ordinal response

Review of security and privacy for the internet of medical things (iomt)

Extremal mechanisms for local differential privacy

Extremal mechanisms for local differential privacy

Mobile data collection and analysis with local differential privacy

L-diversity: Privacy beyond k -anonymity

Privacy integrated queries: an extensible platform for privacy-preserving data analysis

Mechanism design via differential privacy

Viola: Trustworthy sensor notifications for enhanced privacy on mobile systems

Privacy-aware cross-platform service recommendation based on enhanced localitysensitive hashing

Heavy hitter estimation over set-valued data with local differential privacy

k-anonymity: a model for protecting privacy

Tabulation based 4-universal hashing with applications to second moment estimation

Bixi montreal bikeshare data -bikeshare information for bixi montreal

Collecting and analyzing multidimensional data with local differential privacy

Private weighted histogram aggregation in crowdsourcing

Locally differentially private protocols for frequency estimation

Finprivacy: A privacy-preserving mechanism for fingerprint identification

Randomized response: A survey technique for eliminating evasive answer bias

New hash functions and their use in authentication and set equality

Wisdm smartphone and smartwatch activity and biometrics dataset data set

A new coronavirus associated with human respiratory disease in china

Joint optimization of offloading utility and privacy for edge computing enabled iot

A blockchain-powered crowdsourcing method with privacy preservation in mobile environment

Trustoriented iot service placement for smart cities in edge computing

Computational experiment-based evaluation on context-aware O2O service recommendation

Social learning evolution (SLE): computational experiment-based modeling framework of social manufacturing

Optimal schemes for discrete distribution estimation under locally differential privacy

Privkv: Key-value data collection with local differential privacy

Efficient query of quality correlation for service composition

Covering-based web service quality prediction via neighborhood-aware matrix factorization

Location-aware deep collaborative filtering for service recommendation

Differentially private high-dimensional data publication in internet of things

Why does your data leak? uncovering the data leakage in cloud from mobile apps

Our thanks to the reviewers for their constructive comments and suggestions to improve the quality of the manuscript.