key: cord-0560296-hyszy64z authors: Liu, Ao; Wang, Yu-Xiang; Xia, Lirong title: Smoothed Differential Privacy date: 2021-07-04 journal: nan DOI: nan sha: a2beb939daba8c6a835accb98afe1d79990e7aae doc_id: 560296 cord_uid: hyszy64z Differential privacy (DP) is a widely-accepted and widely-applied notion of privacy based on worst-case analysis. Often, DP classifies most mechanisms without external noise as non-private [Dwork et al., 2014], and external noises, such as Gaussian noise or Laplacian noise [Dwork et al., 2006], are introduced to improve privacy. In many real-world applications, however, adding external noise is undesirable and sometimes prohibited. For example, presidential elections often require a deterministic rule to be used [Liu et al., 2020], and small noises can lead to dramatic decreases in the prediction accuracy of deep neural networks, especially the underrepresented classes [Bagdasaryan et al., 2019]. In this paper, we propose a natural extension and relaxation of DP following the worst average-case idea behind the celebrated smoothed analysis [Spielman and Teng, 2004]. Our notion, the smoothed DP, can effectively measure the privacy leakage of mechanisms without external noises under realistic settings. We prove several strong properties of the smoothed DP, including composability, robustness to post-processing and etc. We proved that any discrete mechanism with sampling procedures is more private than what DP predicts. In comparison, many continuous mechanisms with sampling procedures are still non-private under smoothed DP. Experimentally, we first verified that the discrete sampling mechanisms are private in real-world elections. Then, we apply the smoothed DP notion on quantized gradient descent, which indicates some neural networks can be private without adding any extra noises. We believe that these results contribute to the theoretical foundation of realistic privacy measures beyond worst-case analysis. Differential privacy (DP), a de facto measure of privacy in academia and industry, is often achieved by adding external noises to published information [Dwork et al., 2014] . However, external noises are procedurally or practically un-acceptable in many real-world applications. For example, presidential elections often require a deterministic rule to be used [Liu et al., 2020] . In such cases, though, internal noise often exists, as shown in the following example. Example 1 (Election with Internal Noise). Due to COVID-19, many voters in the 2020 US presidential election chose to submit their votes by mail. Unfortunately, it was estimated that the US postal service might have lost up to 300,000 mail-in ballots (0.2% of all votes) [Bogage and Ingraham, 2020] . For the purpose of illustration, suppose these votes are distributed uniformly at random, and the histogram of votes is announced after the election day. A critical public concern about elections is: should publishing the histogram of votes be viewed as a significant threat to privacy? Notice that with internal noise such as in Example 1, the histogram mechanism can be viewed as a randomized mechanism, formally called sampling-histogram in this paper. The same question can be asked about publishing the winner under a deterministic voting rule, where the input is sampled from the real votes. If we apply DP to answer this question, we would then conclude that publishing the histogram can pose a significant threat to privacy, as the privacy parameter δ ≈ 1 (See Section 2 for the formal definition) in the following worst-case scenario: all except one votes are for the Republican candidate, and there is one vote for the Democratic candidate. Notice that δ ≈ 1 is much worse than the threshold for private mechanisms, δ = o(1/n) (n is the number of agents). Moreover, using the adversary's utility as the measure of privacy loss (see Section 2 for the formal definition), in this (worst) case, the privacy loss is large (≈ 1, see Section 2 for the formal definition of utility), which means the adversary can make accurate predictions about every agent's preferences. However, DP does not tell us whether publishing the histogram poses a significant threat to privacy in general. In particular, the worst-case scenario described in the previous paragraph never happened even approximately in the Under review for the 38 th Conference on Uncertainty in Artificial Intelligence (UAI 2022). modern history of US presidential elections. In fact, no candidates get more than 70% of the votes since 1920 [Leip, 2021] , when the progressive party dissolved. It turns out the privacy loss (measured by the adversary's utility) may not be as high as measured by DP. To see this, suppose 0.2% of the votes were randomly lost in the presidential elections of each year since 1920 (in light of Example 1), we present the adversary's utility of predicting the unknown votes in Figure 1 . It can be seen that the adversary has very limited utility (at the order of 10 −32 to 10 −8 , which is always smaller than the threshold of private mechanisms n −1 ), meaning that the adversary cannot learn much from the published histogram of votes. We also observe an interesting decreasing trend in δ, which implies that the elections become more private in more recent years. This is primarily due to the growth of voting population, which is exponentially related to the adversary's utility (Theorem 2). In Appendix A, we show that the elections are still private when only 0.01% of votes got lost. 1920 1945 1970 1995 2020 Year 10 -30 10 -20 10 -10 10 0 or adversaries' utility adversaries' utility our proposed (x) of DP n -1 Figure 1 : The privacy loss in US presidential elections. The lower δ is, the more private the election is. As another example, for deep neural networks (DNNs), even adding slight noise can lead to dramatic decreases in the prediction accuracy, especially when predicting underrepresented classes [Bagdasaryan et al., 2019] . Internal noises also widely exist in machine learning, for example, in the standard practice of cross-validation as well as in training (e.g., the batch-sampling when training DNNs). As shown in these examples, the worst-case privacy according to DP might be too loose to serve as a practical measure for evaluating and comparing mechanisms without external noises in real-world applications. This motivates us to ask the following question. How can we measure privacy for mechanisms without external noise under realistic models? The choice of model is critical and highly challenging. A model based on worst-case analysis (such as in DP) provides upper bounds on privacy loss, but as we have seen in Figure 1 , in some situations, such upper bounds are too loose to be informative in practice. This is similar to the runtime analysis of an algorithm-an algorithm with exponential worst-case runtime, such as the simplex algorithm, can be faster than some algorithms with polynomial runtime in practice. Average-case analysis is a natural choice of the model, but since "all models are wrong" [Box, 1979] , any privacy measure designed for a certain distribution over data may not work well for other distributions. Moreover, ideally the new measure should satisfy the desirable properties that played a central role behind the success of DP, including composability and robustness to post-processing. These properties make it easier for the mechanism designers to figure out the privacy level of mechanisms. Unfortunately, we are not aware of a measure based on average-case analysis that has these properties. We believe that the smoothed analysis [Spielman, 2005] provides a promising framework for addressing this question. Smoothed analysis is an extension and combination of worst-case and average-case analyses that inherits advantages of both. It measures the expected performance of algorithms under slight random perturbations of worstcase inputs. Compared with the average-case analysis, the assumptions of the smoothed analysis are much more natural. Compared with the worst-case analysis, the smoothed analysis can better describe the real-world performance of algorithms. For example, it successfully explained why the simplex algorithm are faster than some polynomial algorithms in practice. Our Contributions. The main merit of this paper is a new notion of privacy for mechanisms without external noises, called smoothed differential privacy (smoothed DP for short), which applies smoothed analysis to the privacy parameter δ(x) (Definition 2) as a function of the database x. In our model, the "ground truth" distribution of agents is from a set of distributions Π over data points, on top of which the nature adds random noises. Formally, the "smoothed" δ(x) is defined as designer to figure out the privacy level when the set of distributions Π is hard to estimate. Using smoothed DP, we found that many discrete mechanisms without external noise (and with small internal noise) are significantly more private than those guaranteed by DP. For example, the sampling-histogram mechanism in Example 1 has an exponentially small δ SDP (Theorem 2), which implies that the mechanism protects voters' privacy in elections-and this is in accordance with the observation on US election data in Figure 1 . We also note that the sampling-histogram mechanism is widely used in machine learning (e.g., the SGD in quantized DNNs). In comparison, smoothed DP implies a similar privacy level as the standard DP in many continuous mechanisms. We proved that smoothed DP and the standard DP have the same privacy level for the widely-used sampling-average mechanism when the inputs are continuous (Theorem 3) . Experimentally, we numerically evaluate the privacy level of the sampling-histogram mechanism using US presidential election data. Simulation results show an exponentially small δ SDP , which is in accordance with our Theorem 2. Our second experiment shows that a one-step stochastic gradient descent (SGD) in quantized DNNs [Banner et al., 2018 , Hubara et al., 2017 also has an exponentially small δ SDP . This result implies that SGD with gradient quantization can already be private in practice, without adding any external noise. In comparison, the standard DP notion always requires extra (external) noise to make the network private at the cost of a significant reduction in accuracy. Related Work and Discussions. There is a large body of literature in the theory and practice of DP and its extensions. We believe that the smoothed DP introduced in this paper is novel. To the best of our knowledge, it appears to be most similar with distributional DP [Bassily et al., 2013] , which measures privacy given the adversary's (probabilistic) belief about the data he/she is interested in. Our smoothed DP is different both conceptually and technically. Conceptually, the adversary in distributional DP only has probabilistic information about the database and is much weaker than the smoothed DP adversary, who has complete information. Technically, distributional DP considers randomness in both the mechanism and the adversary's belief about the database, while smoothed DP only considers the randomness in the dataset. We prove that smoothed DP servers as an upper bound to distributional DP (Proposition 1). Rényi DP [Mironov, 2017] Quantized neural networks [Marchesi et al., 1993 , Tang and Kwan, 1993 , Balzer et al., 1991 , Fiesler et al., 1990 are initially designed to make hardware implementations of DNNs easier. In the recent decade, quantized neural networks becomes a research hotspot again owing to its growing applications on mobile devises [Hubara et al., 2016 , 2017 , Guo, 2018 . In quanitized neural networks, the weights [Anwar et al., 2015 , Kim and Smaragdis, 2016 , Zhou et al., 2017 , Lin et al., 2016 , 2017 , activation functions [Vanhoucke et al., 2011 , Courbariaux et al., 2015 , Rastegari et al., 2016 , Mishra et al., 2018 and/or gradients [Seide et al., 2014 , Banner et al., 2018 , Alistarh et al., 2017 , Du et al., 2020 are quantized. When the gradients are quantized, both the training and inference of DNN are accelerated [Banner et al., 2018 , Zhu et al., 2020 . Gradient quantization can also save the communication cost when the DNNs are trained on distributed systems [Guo, 2018] . The smoothed analysis [Spielman and Teng, 2004 ] is a widely-accepted analysis tool in mathematical programming, machine learning [Kalai et al., 2009, Manthey and Röglin, 2009] , computational social choice [Xia, 2020 , Baumeister et al., 2020 , Xia, 2021 , Liu and Xia, 2021 , and other topics [Brunsch et al., 2013 , Bhaskara et al., 2014 , Blum and Dunagan, 2002 . Lastly, we note that smoothed DP is very different from the smooth sensitivity framework [Nissim et al., 2007] . The latter is an algorithmic tool that smooths the changes of the noise-level across neighboring datasets (and achieve the standard DP), while we use smoothing as a theoretical tool to analyze the intrinsic privacy properties of non-randomized algorithms in practice. In this paper, we use n to denote the number of records (entries) in a database x ∈ X n , where X denotes all possible values for a single entry. n also represents the number of agents when one individual can only contribute one record. Let ||x − x || 1 denote the number of different records (the Definition 1 (Differential privacy). Let M denote a randomized algorithm and S be a subset of the image space of M. M is said to be ( , δ)-differentially private for some > 0, δ > 0, if for any S and any pair of inputs x and x such that ||x − x || 1 ≤ 1, Notice that the randomness comes from the mechanism M in the worst case x. DP guarantees immunity to many kinds of attacks (e.g., linkage attacks [Nguyen et al., 2013] and reconstruction attacks [Dwork et al., 2014] ). Take reconstruction attacks for example, the adversary has access to a subset of the database (such information may come from public database, social media, etc.). In an extreme situation, an adversary knows all but one agent's records. To protect the data of every single agent, DP uses δ = o(1/n) as a common requirement of private mechanisms [Dwork et al., 2014, p. 18 ]. Next, we recall two common interpretations/justifications on how DP helps protect privacy even in the extreme situation of reconstruction attacks. Justification 1: DP guarantees the prediction of adversary cannot be too accurate [Wasserman and Zhou, 2010, Kairouz et al., 2015] . Assume that the adversary knows all entries except the i-th. Let x −i denote the database x with its i-th entry removed. With the information provided by the output M(x), the adversary can infer the missing entry by testing the following two hypothesis: H 0 : The missing entry is X (or equivalently, the database is Suppose that after observing the output of M, the adversary uses a rejection region rule for hypothesis testing 1 , where H 0 is rejected if and only if the output is in the rejection region S. For any fixed S, the decision rule can be wrong in two possible ways, false positive (Type I error) and false negative (Type II error). Thus, the Type I error rate is E I (x) = Pr[M(x) ∈ S]. Similarly, the Type II error rate is According to the definition of DP, for any neighboring x and x , the adversary always has which implies that E I (x) and E II (x) cannot be small at the same time. When and δ are both small, both E I and E II becomes close to 0.5 (the error rates of random guess), which means that the adversary cannot get much information from the output of M. Justification 2: With probability at least 1-2δ, M is insensitive to the change of one record [Dwork et al., 2014] . In more detail, ( , δ)-DP guarantees the distribution of M's output will not change significantly when one record changes. Here, "change" corresponds to add, remove or replace one record of the database. Mathematically, [Dwork 1 The adversary can use any decision rule, and the rejection region rule is adopted just for example. et al., 2014, p. 18] showed that given any pair of neighboring databases x and x , where the probability 2 is taken over a (the output of M). The above inequality shows that the change of one record cannot make an output significantly more likely or significantly less likely (with at least 1 − 2δ probability). Dwork et al. [2014, p. 25 ] also claimed that the above formula guarantees that the adversary cannot learn too much information about any single record of the database through observing the output of M. Recall that DP is based on the worst-case analysis over all possible databases. However, as shown in Exampe 1 and Figure 1 , the worst-case nature of DP sometimes leads to an overly pessimistic measure of privacy loss, which may bring unnecessary external noise in the hope of an improve privacy but at the cost of accuracy. In this section, we introduce smoothed DP, which applies the smoothed analysis to the privacy parameter δ(x), and prove its desirable properties. Due to the space constraint, all proofs of this section can be found in Appendix E. We first introduce the database-wise parameter δ , M (x), which measures the privacy leakage of mechanism M when its input is x. Definition 2 (Database-wise privacy parameter δ , M (x)). Let M : X n → A denote a randomized mechanism. Given any database x ∈ X n and any > 0, define the database-wise privacy parameter as: In words, δ , M (x) is the minimum δ values, such that the ( , δ)-DP requirement on M (Inequality (1)) holds at neighboring pairs (x, x ) and (x , x). In Lemma 8, we will show that d ,M (x, x ) measures the utility of the adversary. DP as the worst-case analysis of δ , M (x). In the next theorem, we show that the privacy measure based on worstcase analysis of δ ,M is equivalent to the standard DP. With the database-wise privacy parameter δ ,M (x) defined in the last subsection, we now formally define smoothed DP, where the worst-case "ground truth" distribution of every agent is allowed to be any distribution from a set Π, on top of which Nature adds random noises to generate the database. where x ∼ π = (π 1 , · · · , π n ) means that for every 1 ≤ i ≤ n, the i-th entry in the database follows π i . Like DP, smoothed DP bounds privacy leakage (in an arguably more realistic setting), via the following two justifications that are similar to the two common justifications of DP in Section 2. The proof follows after bounding Type I and Type II errors when the input is x by δ ,M . That is, for a fixed database x, it is not hard to verify Then, by the definition δ ,M (x), we have, which means that E I and E II cannot be small at the same time when the database is x. It follows that the smoothed DP, which is a smoothed analysis of δ ,M , can bound the smoothed Type I and Type II errors. Justification 2: A smoothed DP M is insensitive to the change of one record with at least 1-2δ probability. The proof is, again, done through analyzing δ ,M (x). More precisely, given any mechanism M, any ∈ R + and any pair of neighboring databases x, x , we have The formal claim can be found in Lemma 6 in Appendix B. As smoothed DP replaces the worst-case analysis to smoothed analysis, we also view δ = o(1/n) as a requirement for private mechanisms for smoothed DP. We first reveal a relationship between DP, smoothed DP, and distributional DP. Let us first recall the definition of DDP [Bassily et al., 2013] . To simplify notation, we let x −i (or π −i ) denote the database (or distributions) such that only the i-th entry is removed. Definition 4 (Distributional Differential Privacy (DDP)). A mechanism M is ( , δ, Π)-distributional differentially private (DDP) if there is a simulator Sim such that for any π ∈ Π n , any i ∈ [n], any data X and any S be a subset of the image space of M, Proposition 1 (DP Smoothed DP DDP). Given any mechanism M with domain X n and any set of distributions Π over X , The above proposition shows that DP can guarantee smoothed DP, and smoothed DP can guarantee DDP. The proof and additional discussions about Proposition 1 can be found in Appendix C. Next, we present four properties of smoothed DP and discuss how they can help mechanism designers figure out the smoothed DP level of mechanisms. We first present the robustness to post-processing property, which says no function can make a mechanism less private without adding extra knowledge about the database. The post-processing property of smoothed DP can be used to upper bound the privacy level of many mechanisms. With it, we know a private data-preprocessing can guarantee the privacy of the whole mechanism. Then, the rest part of the mechanisms can be arbitrarily designed. The proof of all four properties of the smoothed DP can be found in Appendix F. Then, we introduce the composition theorem for the smoothed DP, which bounds the smoothed DP property of databases when two or more mechanisms publish their outputs about the same database. In practice, Π might be hard to accurately characterized. The next proposition introduces the pre-processing property of smoothed DP, which says the distribution of data can be replaced by the distribution of features (extracted using any deterministic function). For example, in deep learning, the distribution of data can be replaced by the distribution of gradients, which is usually much easier to estimate in real-world training processes. More technically, the pre-processing property guarantees that any deterministic way of data-preprocessing is not harmful to privacy. To simplify notation, we let f (π) be the distribution of f (X) where X ∼ π. For any set of distributions Π = {π 1 , · · · , π m }, we let f (Π) = {f (π 1 ), · · · , f (π m )}. Proposition 4 (Pre-processing for deterministic functions). Let f : X n →X n be a deterministic function and M :X n → A be a randomized mechanism. Then, The following proposition shows that any two sets of distributions with the same convex hull have the same privacy level under smoothed DP. With this, the mechanism designers can ignore all inner points and only consider the vertices of convex hull when calculating the privacy level of mechanisms. Let CH(Π) denote the convex hull of Π. Proposition 5 (Distribution reduction). Given any , δ ∈ R + and any Π 1 and Π 2 such that CH(Π 1 ) = CH(Π 2 ), an anonymous mechanism M is ( , δ, In this section, we use smoothed DP to measure the privacy of some commonly-used mechanisms, where the randomness are intrinsic and unavoidable (as opposed to external noises such as Gaussian or Laplacian noises). Our analysis focus on two widely-used algorithms where the intrinsic randomnesses comes from sampling (without replacement). We also compare the privacy levels of smoothed DP with DP. All missing proofs of this section can be found in Appendix G. In this subsection, we study the smoothed DP property of (discrete) sampling-histogram mechanism (SHM), which is widely used as a pre-possessing step in many real-world applications like the training of DNNs. As smoothed DP satisfies post-processing (Proposition 2) and pre-processing (Proposition 3), the smoothed DP property of SHM can upper bound the smoothed DP of many mechanisms used in practice, which are based on SHM. SHM first sample T data from the database and then output the histogram of the T samples. Formally, we define the sampling-histogram mechanism in Algorithm 1. Note that we require all data in the database to be chosen from a finite set X . Algorithm 1: Sampling-histogram mechanism M H 1: Inputs: A finite set X , the number of samples T ∈ Z + and a database x = {X 1 , · · · , X n } where X i ∈ X for all i ∈ [n] 2: Randomly sample T data from x without replacement. The sampled data are denoted by X j1 , · · · , X j T . 3: Output: The histogram hist(X j1 , · · · , X j T ) Smoothed DP of mechanisms based on SHM. The smoothed DP of SHM can be used to upper bound the smoothed DP of the following three groups of mechanisms/algorithms. The first group consists of deterministic voting rules, as presented in the motivating example in Introduction. The sampling procedure in SHM mimics the votes that got lost. The second group consists of machine learning algorithms based on randomly-sampled training data, such as crossvalidation. The (random) selection of the training data corresponds to SHM. Notice that many training algorithms are essentially based on the histogram of the training data (instead of the ordering of data points). Therefore, overall the training procedure can be viewed as SHM plus a postprocessing function (the learning algorithm). Consequently, the smoothed DP of SHM can be used to upper bound the smoothed DP of such procedures. The third group consists of SGD of DNNs with gradient quantization [Zhu et al., 2020 , Banner et al., 2018 , where the gradients are rounded to 8-bit in order to accelerate the training and inference of DNNs. The smoothed DP of SHM can be used to bound the privacy leakage in each SGD step of the DNN, where a batch (subset of the training set) is firstly sampled and the gradient is the average of the gradients of the sampled data. DP vs. Smoothed DP for SHM. We are ready to present the main theorem of this paper, which indicates that SHM is very private under some mild assumptions. We say distribution π is strictly positive, if there exists a positive constant c such that π(X) ≥ c for any X in the support of π. A set of distributions Π is strictly positive is there exists a positive constant c such that every π ∈ Π is strictly positive (by c). The strictly positive assumption is often considered mild in elections [Xia, 2020] and discrete machine learning [Laird and Saul, 1994] , even though it may not holds for every step of SGD. Theorem 2 (DP vs. Smoothed DP for SHM). For any SHM, denoted by M H , given any strictly positive set of distribution Π, any finite set X , and any n, T ∈ Z + , we have: smoothed DP for any ≥ ln 2n n−T . 3 (ii) (Tightness of smothed DP bound) For any > 0, there does not exist δ = exp(−ω(n)) such that M H is ( , δ, Π)-smoothed DP. (iii) (DP) For any > 0, there does not exist 0 ≤ δ < T n such that M H is ( , δ)-DP. This theorem says the privacy leakage is exponentially small under real-world application scenarios. In comparison, DP cares too much about the extremely rare cases and predicts a constant privacy leakage. Also, note that our theorem allows T to be at the same order of n. For example, when setting T = 90% × n, SHM is (3, exp(−Θ(n)), Π)-smoothed DP, which is an acceptable privacy threshold in many real-world applications. We also proved similar bounds for the SHM with replacement in Appendix H. In this section, we show that the sampling mechanisms with continuous support is still not privacy-preserving under smoothed DP. Our result indicates that the neural networks without quantized parameters are not private without external noise (i.e., the Gaussian or Laplacian noise). Algorithm 2: Continuous sampling average M A 1: Inputs: The number of samples T and a database x = {X 1 , · · · , X n } where X i ∈ [0, 1] for all i ∈ [n] 2: Randomly sample T data from x without replacement. The sampled data are denoted as X j1 , · · · , X j T . 3: Output: We use the sampling-average (Algorithm 2) algorithm as the standard algorithm for continuous mechanisms. Because sampling-average can be treated as SHM plus an average step, sampling-average is non-private also means SHM with continuous support is also non-private according to the postprocessing property of smoothed DP. Smoothed DP in elections. We use a similar setting as Example 1, where 0.2% of the votes are randomly lost. We numerically calculate the δ parameter of smoothed DP. Here, the set of distributions Π includes the distribution of all 57 congressional districts of the 2020 presidential election. Using the distribution reduction property of smoothed DP (Proposition 5), we can remove all distributions in Π except DC and NE-2 5 , which are the vertices for the convex hull of Π. Figure 2 (left) shows that the smoothed δ parameter is exponentially small in n when = 7, which matches our Theorem 2. We find that the smoothed δ is also exponentially small when = 0.5, 1 or 2, which indicates that the sampling-histogram mechanism is more private than DP's predictions under our settings. The smoothed δ is also exponentially small when 1% or 10% of votes got randomly lost (see the middle and right sub-figures of Figure 2 ). Also, see Appendix I.2 for the experiments with different settings on Π. SGD with 8-bit gradient quantization. According to the pre-processing property of smoothed DP, the smoothed DP of (discrete) sampling average mechanism upper bounds We propose a novel notion to measure the privacy leakage of mechanisms without external noises under realistic settings. One promising next step is to apply our smoothed DP notion to the entire training process of quantized DNNs. Is the quantized DNN private without external noise? If not, what level of external noises needs to be added, and how should we add noises in an optimal way? More generally, we believe that our work has the potential of making many algorithms private without requiring too much external noise. In the motivating example, the adversary's utility represent max x :||x−x || 1 u(x, x , t), where u(x, x , t) is the adjusted utility defined above Lemma 8. We set the threshold of accuracy t = 51% in Figure 1 . In other words, the adversary omitted the utilities coming from the predictors with no more than 51% accuracy. To compare the privacy of different years, we assume that the US government only publish the number of votes from top-2 candidate. The δ parameter plotted in Figure 1 follows the database-wise privacy parameter δ ,M(x) in Definition 2, where = ln 0.51 0.49 . One can see that the threshold t = e e +1 , which matches the setting of Lemma 8. In all experiments related to our motivating example, we directly calculated the probabilities and our numerical results does not include any randomness. Figure 5 presents different settings for the privacy level of US presidential elections (the motivating example). In all our settings, we set t = e e +1 . Figure 5 shows similar information as Figure 1 under different settings (threshold of accuracy t ∈ {0.5, 0.51, 0.6} and ratio of lost votes = 0.01% or 0.2%). In all settings of Figure 5 , the US presidential election is much more private than what DP predicts. Figure 5 (d) shows that the US presidential election is also private (δ = o(1/n) and ≈ 0.4) when there are only 0.01% of votes got lost. In this section, we present the formal definition of "probabilities" used in The View 3 of our interpretations of DP and smoothed DP. To simplify notations, we define the δ-approximate max divergence between two random variables Y and Z as, where Supp(Y ) represents the support of random variable Y . Especially, we use D ∞ (Y ||Z) to represent the D 0 ∞ (Y ||Z), which is the max divergence. One can see that a mechanism M is ( , δ)-DP if and only if for any pair of neighboring databases x, x : D δ ∞ M(x)||M(x ) ≤ . The next lemma shows the connection between D ∞ (Y ||Z) and D δ ∞ (Y ||Z). Lemma 6 (Lemma 3.17 in [Dwork et al., 2014] In [Dwork et al., 2014] , " Pr[M(x)∈S] Pr[M(x )∈S] ≥ e happens with no more than δ probability" means that there exists a random variable Y such that D ∞ M(x)||Y ≤ and ∆ Y, M(x ) ≤ δ. In other words, this means: with modifying the distribution of M(x ) by at most δ (for its mass), the probability ratio between M(x) and the modified distribution is always upper bounded by e . By now, we explained the meaning of "probability" in [Dwork et al., 2014] . As the smoothed DP is the smoothed analysis of δ ,M while DP is its worst-case analysis, DP smoothed DP follows intuitively. In this section, we focus on showing that Smoothed DP is an upper bound of DDP. Similarly, we may rewrite the DDP conditions as δ ≥ max 0,δ DDP Next, we compareδ 1 andδ DDP where for any set of outputs S, simulator Sim πi 's distribution of outputs are defined as follows, By symmetry, we also haveδ 2 ≥δ DDP 2 and the proposition follows by the definition of DDP. Justification 3: d ,M (x, x ) tightly bounds Bayesian adversary's utilities. We consider the same adversary as in View 1. Since the adversary has no information about the missing entry, he/she may assume a uniform prior distribution about the missing entry. Let X i ∈ X denote the missing entry. Observing output a from mechanism M, the adversary's posterior distribution is . A Bayesian predictor predicts the missing entry X i through maximizing the posterior probability. Then, for the uniform prior, when the output is a, the 0/1 loss of the Bayesian predictor is defined as follows, where a correct prediction has zero loss and any incorrect prediction has loss one. Then, we define the adjusted utility of adversary (in Bayesian prediction), which is the expectation of a normalized version of 0-1 . Mathematically, for a database x, we define the adjusted utility with threshold t as follows, In short, u(t, x −i ) is the worst case expectation of 1 − 0-1 while the contribution from predictors with loss larger than 1 − t is omitted. Especially, when the threshold t ≥ 1/|X |, an always correct predictor ( 0-1 = 0) has utility 1 and a random guess predictor ( 0-1 = 1 − 1/|X |) has utility 0. For example, we let X = {0, 1} and consider the coin-flipping mechanism M CF with support X , which output X i with probability p and output 1 − X i with probability 1 − p. When p = 1, the entry X i is non-private because the adversary can directly learn it from the output of M CF . Correspondingly, the adjusted utility of adversary is 1 for any threshold t ∈ (0, 1). When p = 0.5, the mechanism give a output uniformly at random from X . In this case, the output of M cannot provide any information to the adversary. Correspondingly, the adjusted utility of adversary is 0 for any threshold t ∈ (0.5, 1). In the next lemma, we reveal a connection between the adversary's utility u and d ,M (x, x ) + d ,M (x , x). Lemma 8. Given mechanism M : X n → A and any pair of neighboring databases x, x ∈ X n , Lemma 8 shows that the adjusted utility is upper bounded by d ,M . Especially, when |X | = 2, we provide both upper and lower bounds to the adjusted utility in Lemma 10 in Appendix E.1, which means that d ,M (x, x ) is a good measure for the privacy level of M when |X | = 2. In the following corollary, we show that δ ,M (x) upper bounds the adjusted utility of adversary. Corollary 9. Given mechanism M : X n → A and any pair of neighboring databases x, x ∈ X n , Lemma 10. Given X such that |X | = 2 and mechanism M : X n → A and any pair of neighboring databases x, x ∈ X n , Proof. To simplify notations, for any output a, we let Then, we define the utility of adversary when the database is x Let X i be the different entry between x and x , we have, To further simplify notations we define the adjusted utility with threshold for output a as follows. u a, x, x , e − 1 e + 1 = max 0, |p(a) − p (a)| p(a) + p (a) − e − 1 e + 1 . Note that the threshold is for the utility (not for the accuracy). Then, its easy for find that u x, x , e e + 1 = E a∼M(x) u a, x, x , e − 1 e + 1 . Using the above notations, we have, p(a) · 2 · (r − e ) (e + 1) · r < the first term of E a∼M(x) u a, x, x , e − 1 e + 1 . By symmetry, we have, The upper bound part of lemma follows by combining the above two bounds. For the lower bound, we have, p(a) · 2 · (r − e ) (e + 3) · r ≥ the first term of E a∼M(x) u a, x, x , e − 1 e + 1 . By symmetry, we have, 2 e + 3 · max S d M,S, (x , x) ≥ the second term of E a∼M(x) u a, x, x , e − 1 e + 1 . The lower bound part of lemma follows by combining the above two bounds. Lemma 8 Given mechanism M : X n → A and any pair of neighboring databases x, x ∈ X n , Proof. We assume that X = {X 1 , · · · , X m }. Then, using the notations in the proof of Lemma 10, we know that where X * ⊆ X , |X * | = 2 and arg max X ∈X Pr[M(x −i ∪ {X }) = a] ∈ X * . In words, X * is a set of two missing data while the MLE prediction is included in the dataset. Now, we reduced the u(t, x −i ) of general X to a set of two possible missing data. Then, Lemma 8 follows by the direct application of Lemma 10. Lemma 1 (DP in δ , M (x)) Mechanism M : X n → A is ( , δ)-differentially private if and only if, Proof. The "if direction": By the definition of δ ,M (x), we know that, Thus, for all x ∈ X n , we have δ One can see that the above statement is the same as the requirement of DP in Definition 1. The "only if direction": In the definition of DP, we note database x and x are interchangeable. Thus, Then, we combine the bounds for d M,S, (x , x) and d M,S, (x, x ). Then, for all x ∈ X n , ≤ max 0, δ, δ = δ. By now, we proved both directions of Theorem 1. Proposition 2 (post-processing). Let M : X n → A be a ( , δ, Π)-smoothed DP mechanism. Given f : A → A be any arbitrary function (either deterministic or randomized), we know f • M : X n → A is also ( , δ, Π)-smoothed DP. Proof. For any fixed database x, any database x such that ||x − x || 1 ≤ 1 and any set of output S ⊆ A , we let T = {y ∈ A : f (y) ∈ A }. Then, we have, Considering that the above inequality holds for any pair of neighboring x, x and any S, we have, Using the same procedure, we have, Combining the above two inequalities, we have, Then, for any π, Thus, max Then, Proposition 2 follows by the definition of smoothed DP. Proposition 3 (Composition). Let M i : X n → A i be an ( i , δ i , Π)-smoothed DP mechanism for any i ∈ [k]. Define By symmetry, we have, d M,S, (a , a) . Combining the above two inequalities with definition of smoothed DP, for any fixed x and a = f (x), we have When x ∼ π, we know that a ∼ f ( π). Then, where δ M (or δ M•f ) is the smoothed DP parameter for M (or M • f ). Proposition 5 (Distribution reduction). Given any , δ ∈ R + and any Π 1 and Π 2 such that CH(Π 1 ) = CH(Π 2 ), a anonymous mechanism M is ( , δ, Π 1 )-smoothed DP if and only if M is ( , δ, Π 2 )-smoothed DP. Proof. We let Π * = {π * 1 , · · · , π * p } denote the vertices of CH(Π 1 ) or CH(Π 2 ). Then, for any distribution π ∈ Π 1 , π = p j=1 α j · π * j , where p j=1 α j = 1 and α j ∈ [0, 1] for any j ∈ [p]. Let π −i to denote the distribution of the agents other than the i-th agent. Then, we know where p j=1 α j,i = 1 and α j,i ∈ [0, 1] for any j ∈ [p]. Considering that the above decomposition works for any database, by induction, we have, α ji,i · Pr[x|(π * j1 , · · · , π * jn )] . The above inequality shows that any for π ∈ Π 1 , E x∼ π [δ ,M (x)] = Considering that the above inequality holds for any π ∈ CH(Π 1 ), then, Also note that Π * ⊆ Π 1 , we have, Then, by symmetry, we have, where the normalization factor Z µ,σ = p ,µ,σ (−0.5, 0.5] . By replacing the Gaussian distribution by Laplacian distribution, we defined 8-bit quantized Laplacian distribution L 8-bit (µ, σ 2 ), which will be used in our extra experiments for the SGD with quantized gradients. Smoothed DP in elections. Figure 6 presents the numerical results for election under different settings. The congressional districts (CD) setting is the same setting of Π as Figure 2 , which is the distributions of all 57 congressional districts in 2020 presidential election. In the history setting, the set of distributions Π includes the distribution of all historical elections since 1920. All results in Figure 6 shows similar trend as Figure 2 . No matter what is the set of distributions Π and no matter what is the ratio of votes got lost, the δ parameter of smoothed DP is always exponentially small in the number of agent n. SGD with 8-bit gradient quantization. Figure 7 presents the SGD with 8-bit gradient quantization experiment under different settings. All four settings in Figure 7 shares the same setting as Figre 3, except the set of distributions Π. The detailed settings of Π are as follows. • Setting 3: Π = {N 8-bit (0, 0.12 2 ), L 8-bit (0, 0.12 2 )}, • Setting 4: Π = {L 8-bit (0, 0.12 2 ), L 8-bit (0.2, 0.12 2 )}, where L 8-bit (µ, σ 2 ) is the 8-bit quantized Laplacian distribution defined in Appendix I.1. All results in Figure 7 shows similar information as 3. It says that the δ of smoothed DP is exponentially small in the number of agents n. As the number of iterations in Figure 7 is just ∼ 10% of the number of iteration of Figure 3 , we observe some random fluctuations in Figure 7 . Qsgd: Communication-efficient sgd via gradient quantization and encoding Theorem 2 (The DP and Smoothed-DP for SHM). Using the notations in Algorithm 1, given any strictly positive set of distribution Π, any finite set A and any n, T ∈ Z + , we have the following properties on M H ,Proof. We present our proof of 1 • in the following three steps.Step 0. Notations and Preparation. Let H hist(X 1 , · · · , X n ) (and h hist(X j1 , · · · , X j T )) denote the histogram of the whole database (the T samples). H i (and h i ) denote the i-th component of vector H (and h). Let m = |A| be the size of the finite set A. Note the sampling process in M H is without replacement. Then, Step 1. Bound δ( H) and max S δ S ( H, H ). W.l.o.g., we assume that H has one more data of the i 1 -th type while has one less data of the i 2 -th type. Then,By the definition of δ h ( H, H ), we have,Thus, whenWe also note that δ h ( H, H ) ≤ Pr h| H for any h. Then, we combine the above results with (3) and we have,We apply Chernoff bound and have,Using the same procedure, we have,By the definition of smoothed DP, we have,Step 2. The smoothed analysis to δ( H). We note that Π is f -strictly positive, which means any distribution π ∈ Π's PMF is no less than f . Here, f can be a function of m or n. f -strictly positive will reduce to the standard definition of strictly positive when f is a constant function. Given the f -strictly positive condition, by the definition of smoothed DP and for any constant c 2 ∈ (0, 1), we have,Then, we apply Chernoff bound and union bound on the second term. For the first term, we directly apply inequality (4) and have, The election experiment. Similar with the motivating example, we only consider the top-2 candidates in each congressional district. The top-2 candidates are Trump and Bidden in any congressional districts. In the discussion of our main text, DC refers to Washington, D.C. and NE-2 refers to Nebraska's 2nd congressional district. The distribution of each congressional district is calculated using the election results of the 2020 US presidential election. For example, in DC, each agent votes Bidden with 94.46% probability while votes Trump with 5.54% probability.The SGD experiment. We formally define the 8-bit quantized Gaussian distribution N 8-bit , which is used in the set of distributions Π of our SGD experiment. To simplify notations, for any interval I = (i 1 , i 2 ], we define p I,µ,σ as the probability of Gaussian distribution N (µ, σ 2 ) on I. Formally,Then, we formally define the 8-bit quantized Gaussian distribution as follows,Definition 5 (8-bit quantized Gaussian distribution). Using the notations above, Given any µ and σ, the probability mass function of N 8-bit (µ, σ 2 ) is