key: cord-0492774-sfm5lr7x
authors: Esfandiari, Hossein; Mirrokni, Vahab; Narayanan, Shyam
title: Tight and Robust Private Mean Estimation with Few Users
date: 2021-10-22
journal: nan
DOI: nan
sha: c28b132fc6a6f04145c711c7698778ca5156ba3f
doc_id: 492774
cord_uid: sfm5lr7x

In this work, we study high-dimensional mean estimation under user-level differential privacy, and attempt to design an $(epsilon,delta)$-differentially private mechanism using as few users as possible. In particular, we provide a nearly optimal trade-off between the number of users and the number of samples per user required for private mean estimation, even when the number of users is as low as $O(frac{1}{epsilon}logfrac{1}{delta})$. Interestingly our bound $O(frac{1}{epsilon}logfrac{1}{delta})$ on the number of users is independent of the dimension, unlike the previous work that depends polynomially on the dimension, solving a problem left open by Amin et al.~(ICML'2019). Our mechanism enjoys robustness up to the point that even if the information of $49%$ of the users are corrupted, our final estimation is still approximately accurate. Finally, our results also apply to a broader range of problems such as learning discrete distributions, stochastic convex optimization, empirical risk minimization, and a variant of stochastic gradient descent via a reduction to differentially private mean estimation.

Without any doubt, user privacy is one of the greatest matters of the modern era. Differential privacy is one of the most popular concepts to measure and enforce privacy. In a nutshell, differential privacy statically protects the privacy of every single unit of data, regardless of how much information about the rest of the data is revealed. Depending on how a single unit of data is defined, we have two general notions of differential privacy: user-level differential privacy and item-level 1 differential privacy.

Let us differentiate these two notions with a simple example. Consider a dataset that consists of the daily blood pressures of the patients in a hospital. For the privacy purpose, we may define a unit of data to be one single value of a blood pressure, or the whole collection of the blood pressures of a patient. The former leads to item-level differential privacy, while the latter leads to user-level differential privacy. Roughly speaking, item-level differential privacy in this example protects one single value of a patient's blood pressure, however, an estimate of the average blood pressure of an individual might be released under item-level differential privacy. On the other hand, userlevel differential privacy protects any information about the blood pressure of every single patient, while it allows statistical analysis of a population of the patients. Clearly, user-level differential privacy provides a higher level of privacy, while item-level differential privacy is an easier measure to guarantee. While both item-level and user-level differential privacy are respected measures, in this work, our main focus is on user-level differential privacy.

Ignoring the computational constraints, it is folklore that providing differential privacy gets easier as the number of users in the dataset grows and for many tasks we can have simple differentially private mechanisms using a large enough number of users in the dataset. However, the question has always been "how large should the dataset be?". Being able to provide differential privacy using only a few users is of particular importance when we want to analyse local information, for example analysing the information of each hospital separately, or analysing the information of each neighborhood of a city separately (e.g. for advertising an anti-drug campaign or suicide prevention lifeline). Another example where we face having few users is when we are analysing a rare situation such as the patients of a rare or novel disease (e.g. the Covid-19 pandemic at its early days [6] ).

Motivated by this, we consider the problem of mean estimation with user-level differential privacy and attempt to understand the limits of the number of users one needs to provide a differentially private mechanism for this problem. Indeed, differentially private mean estimation is of high technical and practical importance since it can be used directly to develop differentially private mechanisms for several other tasks such as stochastic convex optimization, empirical risk minimization, and a variant of stochastic gradient descent [18] .

We start with a formal definition of differential privacy. First, let X = {X i,j } represent a dataset, where each user i (ranging from 1 to n) has m samples (j ranging from 1 to m). In the context of user-level privacy, we say that two datasets X = {X i,j } and X ′ = {X ′ i,j } are neighboring if there is only one user i where the datasets differ, i.e., for all i ′ = i and all j ∈ [m], X i ′ ,j = X ′ i ′ ,j . In the context of item-level privacy, the definition of neighborhood is more restrictive. Specifically, in item-level differential privacy two neighboring datasets only differ in a single sample (i.e., only one pair (i, j) can have X i,j = X ′ i,j ).

Definition 1. We say that a (randomized) algorithm M is (ε, δ)-differentially private if for any two neighboring datasets X and X ′ and any subset S of the output space we have P[M (X) ∈ S] ≤ e ε · P[M (X ′ ) ∈ S] + δ.

As noted before, Definition 1 applies to both user-level and item-level differential privacy, while their only difference is in their definition of neighborhood. Although in this work we focus on user-level privacy, we first study the case where each user has one sample (i.e., m = 1), in which case user-level privacy and item-level privacy are equivalent.

Our main contribution in this work is to provide a nearly optimal trade-off between the number of users n and the number of samples m per user required so that one can accurately and privately learn the mean of a distribution up to low error. Although such a trade-off was known when the number of users n is sufficiently large, it was poorly understood when n is small.

Similar to the previous work on user-level differential privacy [19, 18] , we assume that each sample is drawn independently from some unknown distribution D, with a support entirely in an unknown ball B ⊂ R d with center x ∈ R d and radius r. (For simplicity, we suppose r = 1 in this subsection.) However, it is worth noting that our algorithms enjoy a high level of robustness. Specifically, we show that our algorithm still succeed even if the samples of each user are mildly correlated, or even if the data of 49% of the users are adversarially corrupted. It is easy to observe that it is impossible to recover the mean if the data of 51% of the users are adversarially corrupted.

Next, we describe our contributions in more detail. First, we provide an (ε, δ) user-level differentially private algorithm over n ≥ O( 1 ε log 1 δ ) users, that estimates the mean µ of the unknown distribution D up to error O( d/m) (see Theorem 9 for details). Hence, in the infinite sample limit (i.e., when each user is given m → ∞ i.i.d. samples), we can estimate the mean µ perfectly with very high probability as long as we have n ≥ 1 ε log 1 δ . We complement this positive result with a matching lower bound that shows, even if there are infinitely many samples per user, it is necessary to have Ω( 1 ε log 1 δ ) users in order to privately estimate µ up to any finite accuracy (see Lemma 11 for details). Together, these fully resolve the question of how many users are needed to estimate the mean in the infinite sample limit, a question first investigated by Amin et al. [3] and then further studied by Levy et al. [18] . Previously, this was only established when the number of users n was at leastÕ( d log 1 δ /ε), where theÕ hides logarithmic dependencies in d, m, 1 ε and doubly logarithmic dependencies in 1 δ [18] . Next, we obtain nearly tight bounds for the user-sample trade-off needed for private mean estimation. Specifically, when n ≥ O( 1 ε log 1 δ ), we provide an (ε, δ) user-level differentially private algorithm that estimates the mean µ up to ℓ 2 errorÕ 1 mn + d mn 2 ε 2 with probability 1 − α. It is worth noting that the running time of our algorithm can be made almost linear in n, m, d, and log 1 α (see Theorem 9 for details). This extends the algorithm of Levy et al. [18] , which provided similar bounds but only when n ≥Õ( 1 ε d log 1 δ ), to the regime when n can be very low with no dependency on the dimension d. Moreover, our ℓ 2 error is known to be tight up to logarithmic factors when 1 δ is a sufficiently large polynomial in the dimension d [16, 18] , implying that the error of our algorithm is nearly optimal.

As a corollary, we also obtain user-level private algorithms for the problem of learning discrete distributions. Specifically, we show that our results on mean estimation directly imply that one can learn a discrete distribution up to a total variation distance ofÕ d mn + d 2 mn 2 ε 2 as long as n ≥ O( 1 ε log 1 δ ) (see Corollary 10 for details). Hence, our work extends the algorithm of Liu et al. [19] , which provided comparable bounds for user-level private distribution learning but only when n ≥Õ( 1 ε d log 1 δ ). Our mean estimation algorithm is also directly applicable to a variant of stochastic gradient descent, and to the problems of empirical risk minimization and stochastic convex optimization. Indeed, [18] uses mean estimation as a black-box to solve these problems with user-level privacy. As this part is essentially unchanged from [18] , we briefly discuss how we extend these results in Appendix D.

Finally, we prove that our private algorithm for learning discrete distributions is nearly optimal in sample complexity when 1 δ is approximately polynomial in the dimension, both when the number of users n is low or high. This resolves an open question of Liu et al. [19] , who provided a comparable lower bound for (ε, 0)-differential privacy and asked whether one could extend the lower bound to (ε, δ)-differential privacy for nonzero δ. In addition, as a direct consequence we have that our mean estimation algorithm is nearly optimal, though such bound was already known [16, 18] .

We reiterate that all of algorithms work under a certain notion of robustness as well. Namely, we show that if the samples for each individual user are slightly correlated, and even if 49% of the users have all of their data adversarially corrupted, we can still estimate the mean of the distribution accurately in the regime when the number of users n is at most O( √ d).

Differential privacy was first introduced by Dwork et al. [10] and since then has drawn significant attention. There have been several previous works on this topic, with a main focus on item-level differential privacy [11, 25, 15, 8, 1, 22, 13, 12, 27, 9] . Recently, Liu et al. [19] studied the problem of learning discrete distributions with user-level differential privacy and provided the first non-trivial user-level differentially private algorithm for this problem. Note that private learning of discrete distributions can be done via a private mean estimation algorithm. Later, Levy et al. [18] extended the work of Liu et al. and provided userlevel differential privacy mechanisms for several learning tasks, namely, high-dimensional mean estimation, empirical risk minimization with smooth losses, stochastic convex optimization, and learning hypothesis class with finite metric entropy. A core part of these algorithms was their algorithm for user-level differentially private mean estimation.

User-level differential privacy has also been recently studied for learning models in the federated learning setting [4, 20, 21] . The concept of bounding user contribution motivated by user-level differential privacy has also been studied in the contexts of histogram estimation [3] and SQL [26] .

In this section we explain our algorithmic techniques and the challenges we face. Our main innovation is developing a new private estimation method when the number of users is very small, in the case where each user only has 1 item. From here, we show that it is simple to convert this to a general user-level private algorithm by having each user take the mean of their samples and treating the mean as a single item. So, we now focus on item-level differential privacy.

We will create an item-level private algorithm that, if most items are concentrated in an unknown ball of radius 1, estimates the ball up to radius √ d. While this may seem like a very poor estimate, this is in fact nearly optimal when the number of users is only O( 1 ε log 1 δ ). In addition, in the user-level setting, the sample mean of each (uncorrupted) user's data will be tightly concentrated if the number of samples per user is large, so if we can get concentration γ/ √ d for the mean of each user, then we can privately estimate the distribution mean up to error γ.

The main challenge here is that previous algorithms decomposed the distribution into 1-dimensional components and combined the estimate in each dimension. By the strong composition theorem (see Appendix A or [11, Theorem 3.20] for the statement), one needs approximately O(ε/ √ d)-level differential privacy in each dimension to ensure ε-level differential privacy overall, which means that at least √ d/ε samples are needed. Hence, we need a method that avoids decomposing the distribution into 1-dimensional pieces.

We develop an algorithm loosely based on the exponential mechanism. The rough outline of the standard exponential mechanism, used for 1-dimensional mean estimation, is to split the real line into intervals of length 1, and then select an interval [a, a + 1] with probability proportional to e ε·f (a) , where f (a) is the number of data points in the interval [a, a + 1]. In d dimensions, a first attempt at modifying the algorithm is as follows. Sample each point p ∈ R n with probability proportional to e ε·f (p) , where f (p) is the number of data points that are within some distance T of p. Unfortunately, this attempt runs into a major problem. Namely, if we are promised that the mean is in some large ball of radius B, the total volume of points is proportional to B d , whereas the set of points within T of the data points has volume proportional to approximately T d . So, to ensure that we will usually pick a point within T of the data points, we must have that e ε·n · T d > B d , since e ε·0 = 1 and e ε·f (p) ≤ e ε·n even for points very close to the ball of radius 1 containing all points. So, unless T is almost the same size as B, we will need n ≥ Ω( 1 ε · d), which is far too large. To fix this, we only sample points p if f (p) ≥ 1. However, we lose differential privacy, since changing a single data point may convert f (p) from 1 to 0 for many points p, which means that the probability of sampling p goes from proportional to e ε to proportional to 0, which is not an e ±ε multiplicative change. To fix this, we add a "garbage bucket" that we sample from with a certain probability. By choosing the garbage bucket probability correctly, we ensure approximate (ε, δ)-differential privacy, because we show that we sample from the garbage bucket with probability significantly more than sampling p with f (p) = 1. In addition, we show that as long as T = O( √ d), we sample from a point with large f (p) with high probability and select the garbage bucket with very low probability, assuming that most points are close together. To establish this, we prove a technical lemma that the intersection of several close balls of radius √ d still has a comparatively large volume, which means there is a large volume of points p with large f (p).

Unfortunately, this sampling method is very hard to run, and in fact takes exponential time, because the way that n balls intersect in high dimensions can be very complicated. To make this efficient, we develop an efficient rejection sampling procedure. While this rejection sampling procedure is fast if all data points are close together, it can be very slow for worst-case datasets, so we have to stop the rejection sampling after a certain number of rounds. Unfortunately, stopping the rejection sampling after a fixed number of rounds can cause privacy to break, because changing a single point may cause the probability of acceptance from the rejection sampling algorithm to significantly change. To fix this, we develop a privacy-preserving rejection sampling algorithm, by initializing a geometric random variable X and running the rejection sampling algorithm X times. By setting the decay rate of X appropriately, we are able to successfully obtain privacy while ensuring an efficient runtime.

Our techniques for establishing a tight user-sample tradeoff (when 1 ε log 1 δ ≪ n ≪ √ d · 1 ε log 1 δ ) are based on combining our methods when n = O( 1 ε log 1 δ ) with the Fast Johnson-Lindenstrauss method [2] and strong composition theorems. Since this idea of using Fast Johnson-Lindenstrauss was used in [18] (in the regime n ≫ √ d · 1 ε log 1 δ ) and we apply it in a similar manner, we defer our discussion of this part to Subsection 3.3.

We now move on to describing our lower bound techniques. Our main contribution is to show a tight lower bound for user-level privately learning an unknown discrete distribution, which can be thought of as a special case of mean estimation. Hence, we obtain a tight lower bound for user-level mean estimation as well.

For mean estimation, a user-level lower bound follows based on the item-level lower bound techniques of [16] . However, their techniques importantly use the fact that each coordinate is independent (such as spherical Gaussian distributions), whereas in the setting of learning discrete distributions, the coordinates are in fact negatively correlated. However, we are able to combine their techniques with a Poissonization method. Namely, rather than treating each user as having received m samples from a distribution over [d], we assume that each user has received Pois(m) samples. In this case, if the distribution were known to have p i as the probability of sampling i, then the number of samples that equal i would have distribution Pois(m · p i ) and would be independent across i. In the case when each p i is generated independently and each user is given a independent sample from Pois(m · p i ) for all i ∈ [d], we are able to apply the methods of [16] , albeit with significantly different computations.

Our final piece is to show that if we can privately learn discrete distributions, then we can privately learn the Poisson variables as well, which allows us to also get a lower bound for discrete distribution learning. For this, we treat each p i as an independent random variable, and consider the discrete distributionp with probabilityp i := p i /(p 1 + · · · + p d ), i.e., normalized to have overall probability 1. From here, we show how to simulate the multivariate Poisson distribution Pois(m · p i ) d i=1 given m samples from the discrete distributionp, which we use to prove that privately learning discrete distributions is at least as hard as privately learning Poisson distributions. One issue is that we do not know the sum d i=1 p i since the sum has been normalized thorughp i , but we show how to deal with this by privately learning a one-dimensional Poisson.

In this section, we assume that every user only has a single point (item) in R d , and we develop a robust and differentially private algorithm for estimating a crude approximation of their location when the number of points is very small. In the next section, we use the methods of the item-level algorithm to develop a user-level differentially private algorithm.

In subsection 3.1, we give an algorithm for estimating O( 1 ε log 1 δ ) points that may run in exponential time. In subsection 3.2, we improve the algorithm to run efficiently. Finally, in subsection 3.3, we give an algorithm for estimating n points accurately, when n is in between

We start by stating the theorem, which establishes an (ε, δ)-differentially private estimation algorithm, but with no guarantees on the runtime.

Theorem 2. Let n ≥ C · 1 ε log 1 δ for a sufficiently large constant C, and fix a positive real number r. Then, there exists an algorithm over n points x 1 , . . . , x n ∈ R d with the following guarantees.

1. The algorithm is (ε, δ)-differentially private over arbitrary datasets {x 1 , . . . , x n }.

2. If there exists a ball of radius r, centered at some unknown point x, that contains at least 2 3 of the points x 1 , . . . , x n , then with success probability 1 the algorithm will return a point x ′ such that

The algorithm works as follows. Define B i to be the (closed) ball of radius r √ d around each point x i , and for each point p ∈ R d , define f (p) to be the number of balls B i containing p. Now, consider the following (unnormalized) distribution over R d . Let the distribution have density proportional to D(p) = e ε·min(f (p),2n/3) if f (p) > 0 and density D(p) = 0 if f (p) = 0. Finally, we add a "garbage bucket" with a point mass proportional to 4 δ · V B for V B the volume of each ball B i . To establish accuracy of our algorithm, we will require the following lemma, which states that the volume of the intersection of the balls B i is large if they are all close together. This will become crucial verifying Condition 2 of Theorem 2, because it establishes that the probability of picking a point in the intersection of the balls that are close together will overwhelm the probability of picking a point far away from these balls. In addition, we will need to check privacy (i.e., Condition 1 of Theorem 2). Roughly, this will follow because the garbage bucket will have volume much more than the volume of points with f (p) = 1 (which are the only points which change by more than an e ±ε factor after a single item is modified). However, the volume of the garbage bucket will be much less than the volume of points with large f (p) after scaling by the e ε·min(f (p),2n/3) factor, using Lemma 3. Combined, these observations will allow us to prove Theorem 2.

The formal proofs of Lemma 3 and Theorem 2 are deferred to Appendix B.1.

The main issue with the algorithm described in Subsection 3.1 is that it is difficult to run efficiently. However, in this subsection, we demonstrate a modification of the previous algorithm that ensures the algorithm runs efficiently, even for worst-case datasets. We now describe our faster algorithm. First, initialize a random variable X ∼ Geom(1/N ), where N = Θ(e 10 √ log n · α −1 ). Here, n is the number of data points, and α will represent our (preliminary) failure probability. Now, we consider the following rejection sampling-based algorithm. We first pick an index i ∈ [n + 1], where we choose each i ∈ [n] with probability proportional to e ε·2n/3 , and choose n + 1 with probability proportional to 4n δ . If we picked i ≤ n, we sample a uniform point p in B i , and accept p with probability 1 3 · n f (p) · e ε(min(f (p),2n/3)−2n/3) . Else, if we picked n + 1, we pick a garbage bucket that we label G 1 and accept with probability 1 3 . If we reject, then we retry the algorithm. We retry this algorithm a total of X ∼ Geom(1/N ) times, and if the algorithm rejects all X times, we automatically select a second garbage bucket G 2 as the estimate. We provide the pseudocode for this procedure in Figure 1 .

Next, we check that Algorithm 1 (in Figure 1 ) is a valid algorithm. To do so, we have the following proposition.

Hence, it is a valid probability.

Thus, our algorithm is actually a valid algorithm, and in fact in each round of the rejection sampling, we never accept with probability more than 1 2 . Next, we show that Algorithm 1 returns a point from the same distribution as the previous algorithm of Subsection 3.1, conditioned on not choosing G 2 . Specifically, we show the following proposition.

Proposition 5. Consider a single rejection round of Algorithm 1. Then, conditioned on accepting p or returning G 1 , we sample each point p with density proportional to e ε·min(f (p),2n/3) and sample

Therefore, if we let X = ∞ and keep repeating the loop until we return a point (or G 1 ), the final point that we choose has the same distribution as the algorithm described in subsection 3.1, where the garbage bucket is G 1 . However, to prevent the algorithm from possibly taking exponential time, we only perform the rejection sampling X ∼ Geom(1/N ) times. Our main additional challenge Algorithm 1 Private Mean Estimation over Few Samples 1: procedure DP-Estimate-1(x 1 , . . . , x n , α, ε, δ)

Let N ← Θ(e 10 √ log n · α −1 ), and sample X ∼ Geom(1/N ).

for rep = 1 to X do

Sample i ∼ [n + 1], with each 1 ≤ i ≤ n proportional to e ε·2n/3 and n + 1 proportional to 4n δ .

if i ≤ n then 6:

8: Return G 2 . 14: end procedure 

is to show that we still preserve differential privacy, while still maintaining accuracy even with a limited number of rounds. We say that Algorithm 1 accepts if the final estimate is not G 2 . From Proposition 5, we clearly have that conditioned on Algorithm 1 accepting, the output distribution is the same as the output distribution of the exponential-time algorithm. Hence, the main additional results we must prove are that first, the probability of Algorithm 1 accepting is high if most of the points x 1 , . . . , x n are concentrated, and second, the probability of Algorithm 1 accepting does not change by much among neighboring datasets.

Indeed, we prove such a result: Lemma 18, which we have deferred both the formal statement and proof to Appendix B.2. From here, we are able to establish our main result for item-level privacy with few users.

Theorem 6. Suppose that we are given input points x 1 , . . . , x n ∈ R d , where n ≥ C · 1 ε · log 1 δ for a sufficiently large constant C and for ε, δ, α ≤ 1 3 . Then, if at least 2/3 of the points x i are contained in a ball centered around some x of radius r, where r is known but x is unknown, Algorithm 1 finds

Note that if α −1 = (nd) o(1) , then our algorithm runs in time almost linear in n and d, so the runtime is nearly optimal. In addition, as we will show in Lemma 11, the choice of n = C · 1 ε log 1 δ is optimal. One potential downside of Algorithm 1 is that the runtime depends inversely on the failure probability α. If we wish to succeed with very low (such as exponential) failure probability, the runtime can be very slow. This, however, can be fixed with only a slight loss in the number of samples. To do so, we let 1/3 represent the "initial" failure probability. If we wish for an overall 1 − α failure probability, we run the algorithm O(log 1 α ) times, and return the coordinate-wise median of all returned points which are not G 1 or G 2 . In this case we obtain the following corollary of Theorem 6: Corollary 7. Consider the same setup as in Theorem 6, except that n ≥ C( 1 ε log 1 δ log 1 α ). Then, for any α ≤ 1 3 , there is an algorithm that runs in time O(n 1+o(1) ·d·log 1 α ) and is (ε, δ)-differentially private even for arbitrary datasets, such that if at least 2/3 of the points x 1 , . . . , x n are in a ball of radius O(r), with probability 1 − α the output is within O( √ d · r) of this ball.

Remark. As long as α ≥ exp −d o(1) , the required dependence of n on d is subpolynomial, i.e., d o (1) , and the runtime is almost linear, i.e., (nd) 1+o(1) . Thus, we can even have cryptographically small failure probability in almost linear time.

To obtain Corollary 7 from Theorem 6, the privacy guarantee roughly follows from standard composition theorems of differential privacy. The accuracy guarantee follows from the fact that if at least 2/3 of a set of points p 1 , . . . , p k (where we set k = O(log 1 α ) and think of p i 's as the returned points) are all within distance

We provide the full proofs of Propositions 4 and 5, Theorem 6, and Corollary 7 (as well as state and prove Lemma 18) in Appendix B.2. We also provide pseudocode corresponding to Corollary 7 in Appendix B.2.

In subsections 3.1 and 3.2, we have focused on the problem when n ≈ 1 ε log 1 δ , and have obtained error O(r √ d) in our estimate. This contrasts to the previous work of Levy et al. [18] , which proved that as long as n ≥Õ( d log 1 δ /ε), we are able to estimate the sample mean up to error r ·Õ 1/ √ n + d log 1 δ /(n · ε) . Here, theÕ also hides logarithmic factors in ratio B/r, where B is a promise on the maximum distance of any point from the origin. Thus, ignoring all logarithmic factors, [18] proved that we can estimate the mean up to error roughly r 1 √ n + √ d n·ε as long as n is greater than roughly √ d/ε. In Subsections 3.1 and 3.2, we proved this claim even when n is approximately log 1 δ /ε. In this subsection, we combine our algorithm from Subsection 3.2 with the method used in [18] , which allows us to extend their result from the regime where n ≥Õ( d log 1 δ /ε) to the full regime of n ≥ O( 1 ε log 1 δ ). The idea of Levy et al. [18] was to use the "Fast Johnson-Lindenstrauss" technique of Ailon and Chazelle [2] , which randomly rotates the data efficiently, and then perform a private 1-dimensional mean estimation algorithm in each coordinate. To briefly describe the Fast Johnson-Lindenstrauss method, this procedure, when applied on a vector v, outputs 1

Here, D is a diagonal d × d matrix with each diagonal entry randomly chosen from ±1, and H is the d × d Hadamard matrix (we will assume that d is a power of 2). This procedure is an invertible linear map, and is known to be computable in time O(d log d) for an individual vector, so it takes time O(d log d · n) to perform on all of x 1 , . . . , x n (where we use the same matrix D for each x i , rather than re-initializing the randomness, to preserve linearity).

The main advantage of this procedure is that if the points are concentrated in a ball of radius r, with high probability after the random rotation, the points are concentrated in an interval of length O( log(d · n/α)/d) in all d dimensions, with probability at least 1 − α. Hence, Levy et al. [18] reduces d-dimensional estimation to 1-dimensional estimation.

We now describe our algorithm, which proceeds in a similar manner to Levy et al. [18] , but we reduce to k 2 -dimensional estimation for some 1 ≤ k 2 ≤ d and apply Theorem 6, rather than reduce to 1-dimensional estimation. Fix some k ≤ √ d, and assume WLOG that k, d are powers of 2 (only the order of magnitude of k, d will matter). Next, set

to obtain an estimatex (s) for x (s) , and concatenate these to obtaiñ x ∈ R d . Finally, we undo the Fast JL operation by returning √ d · (HD) −1 ·x. Using [2] , strong composition of privacy, and Theorem 6, we obtain the following theorem.

)-differentially private algorithm on x 1 , . . . , x n in the worst case, such that if at least 2/3 of the points x 1 , . . . , x n are all in a ball of radius 1 centered at some unknown point x, then with probability 1 − α, the algorithm will return a point within

Remark. If one wishes for a runtime that is almost linear in n and has logarithmic dependence on the failure probability α, as in Corollary 7, this can be done by dividing the privacy parameters by log 1 α . So, if the number of samples n is at least

with probability 1 − α, where theÕ hides logarithmic factors in d, α −1 , and n, and the runtime is improved to O(n 1+o(1) d · log 1 α + nd log d). We prove Theorem 8 in Appendix B.3.

In the user-level setup, we suppose that there are n users, each of which is given m data samples from R d . Let X i,j ∈ R d represent the jth sample of the ith user. We suppose that each sample X i,j has marginal distribution D, which has unknown mean µ ∈ R d , but with a promise that E X i,j − µ 2 2 ≤ r 2 for some known r > 0. In addition, we assume that the data is independent across users. However, among the m samples of any individual uncorrupted user, the data may not be entirely independent, but the samples do not have significant covariance between them. Namely, for any user i and samples

We do not make any restrictions on how negative E X i,j − µ, X i,j ′ − µ could be. Finally, we allow for up to n/4 of the users to have their data corrupted adversarially, which means that after the data X = {X i,j } has been generated, an adversary may look at X, and then choose n/4 of the users and alter all of their data in any way. We remark that n/4 could be increased to κ · n for any constant κ < 1/2, with minor modifications to the overall algorithm.

Our results in the user-level setting are a simple extension of the item-level algorithms (either Theorem 6, Corollary 7, or Theorem 8). Indeed, the algorithm simply takes the mean of each user's data, and then applies the item-level algorithm on the user means. So, we will just directly state the main results. For the setup as described in the above paragraph, we prove the following.

and each X i,j ∈ R d , with the following properties:

1. If the data is generated via the procedure described above, then the algorithm reports a point µ such that with probability at least 1 − α,

2. The algorithm is (ε, δ) user-level differentially private even over worst-case datasets, where the ith user is given

3. The algorithm runs in expected time O(mnd+nd log d+n 1+o(1) dα −1 ) over worst-case datasets.

In addition, the algorithm can be made to run in time O(mnd + nd log d + n 1+o(1) d log 1 α ) and have failure probability 1 − α, at the cost of an additional log 1 α factor in n. More simply, by ignoring covariance between users, adversarial robustness, and all logarithmic factors, and by combining with the previous results of [19, 18] we can state a simpler version of the main theorem, which also takes into account the question of learning a discrete distribution up to low total variation distance.

Corollary 10. Let n ≥ C · 1 ε · log 1 α·δ . Suppose we have n users, each with m data samples X = {X i,j }. Then, there is an algorithm taking as input X that is (ε, δ) user-level private in the worst case, and if each X i,j were drawn i.i.d. from an unknown distribution D entirely supported in an (unknown) ball of radius r, the algorithm learns the mean µ of D up to ℓ 2 error

with failure probability α. Here, we useÕ to drop all logarithmic dependencies on α −1 , δ −1 , ε −1 , m, n, and d. Finally, if instead each data sample X i,j were drawn from a discrete set [d], there is an (ε, δ) user-level private algorithm such that, if each sample were drawn i.i.d. from a distribution D over

with failure probability α.

Note that Corollary 10 holds both in the low-and high-user setting. We prove both Theorem 9 and Corollary 10, and also provide pseudocode, in Appendix B.4.

In this section, we provide lower bounds complementing our results from the previous sections. The first lower bound is simple and states that any private learning of the distribution requires at least Ω( 1 ε log 1 δ ) users under reasonable conditions of ε, δ.

Lemma 11. Suppose we have n users, each with m data points all contained in an unknown ball of radius 1. Then, learning this ball up to any finite error T , with probability 2/3 and with (ε, δ) user-level privacy, cannot be done if n ≤ 1 3 · 1 ε log 1 δ , assuming that δ ≤ ε 2 .

Next, we provide a lower bound for privately learning discrete distributions in the user-level setting, answering an open question of [19] by allowing δ to be just moderately small as opposed to 0. Our overall method is similar to the lower bounds of [16] , but with an additional "Poissonization trick" to overcome the issue of lack of independence between coordinates. In addition, we also must provide a reduction from learning discrete distributions to learning Poisson variables. 

represents the jth sample of user i. Suppose there exists an (ε, δ) user-level differentially private algorithm M that takes as input X = {X i,j }, where i ranges from 1 to n and j ranges from 1 to m,

∼p, then with probability at least 1/3, the total variation distance between M (X) andp is at least

Note that Theorem 12 is almost tight compared to Corollary 10, up to logarithmic factors, and holds in both the low-and high-user setting. We did not attempt to optimize the logarithmic factors.

We prove Lemma 11 and Theorem 12 in Appendix C. Finally, we note that Corollary 10 does not hold if we also require adversarial robustness, if the number of users n is much more than √ d. Specifically, we show that if even a small constant fraction of users are corrupted, one cannot estimate the mean to better than r √ m with high probability, even if all points are in a known ball of radius r. We defer the formal statement and proof to Appendix C as well.

In this section, we state the well-known weak and strong composition theorems of differential privacy for completeness.

First, we state the Weak Composition Theorem, which can be found, for instance, at [11, Theorem 3.16] .

Theorem 13. (Weak Composition Theorem) Let M 1 , . . . , M k be algorithms on a dataset X, such that each M i is (ε i , δ i )-differentially private for some ε i , δ i ≥ 0. Then, the concatenation M (X) = (M 1 (X), . . . , M k (X)) is (ε, δ)-differentially private, where ε = ε 1 + · · · + ε k and δ = δ 1 + · · · + δ k .

Next, we state the Strong Composition Theorem, which can be found, for instance, at [11, Theorem 3.20] .

Theorem 14. (Strong Composition Theorem) Let ε, δ, δ ′ ≥ 0 be arbitrary parameters. Suppose that M 1 , . . . , M k be algorithms on a dataset X, such that each M i is (ε, δ)-differentially private. Then,

In this section, we provide missing proofs for both the item-level and user-level algorithms. In Subsection B.1, we prove all of the results from Subsection 3.1. In Subsection B.2, we prove all of the results from Subsection 3.2. In Subsection B.3, we prove all of the results from Subsection 3.3. Finally, in Subsection B.4, we prove all of the results from Section 4.

In this subsection, our goal is to prove Theorem 2. We start by proving Lemma 3, but before we do, we first prove an auxiliary result bounding the probability of a point on a ball having first coordinate too large. This result is relatively standard, but we prove it here to get explicit constants in our bound.

Proposition 15. Let n, d both be at least some sufficiently large constant, and choose a uniformly random point (x 1 , . . . , x d ) in the closed ball of radius √ d, centered at the origin. Then, the probability that x 1 ≥ 5 √ log n is at most 1 2n . Proof. It suffices to prove the same result with (x 1 , . . . , x d ) chosen from the sphere (i.e., where

Gaussians. So, the probability that x 1 ≥ 5 √ log n is at most the probability that Z 1 ≥ 2 √ log n plus the probability that Z 2 1 + · · · + Z 2 d ≤ d 6 , since one of these must occur in order for x 1 to be at least 5 √ log n.

The probability that a random Gaussian is at least 2 √ log n, where n is sufficiently large, is at most e −(2 √ log n) 2 /2 ≤ 1 n 2 . To bound the probability that Z 2 1 +· · ·+Z 2 d ≤ d 6 , note that Z 2 1 +· · ·+Z 2 d ∼ χ 2 d , and it is well-known that the chi-square χ 2 d random variable has PDF

is at most d O(1) · (1/6) d/2 · e d/2−d/12 ≤ e −0.4·d for sufficiently large d. So, overall, the probability that x 1 ≥ 5 √ log n is at most 1 n 2 + e −0.4·d , which is at most 2 n 2 ≤ 1 2n for sufficiently large n unless e −0.4·d > 1 n 2 , which means that d ≤ 5 log n. But in this case, x 1 ≤ √ d ≤ √ 5 log n, so we have that x 1 ≥ 5 √ log n with probabability 0. This concludes the proof.

Proof of Lemma 3. First, assume WLOG that r = 1, which we can assume by scaling. Next, note that the ball of radius √ d − 2 around x 1 is contained in i B i , and this ball has

We will lowerbound the volume of points p with p 2

However, by rotational symmetry of the sphere, the probability of this event is equal to the probability that the first coordinate of p is at least −5 √ log n. Since the ball has radius d − 1 − 10 √ log n ≤ √ d, the probability of this event is at least the probability that the first coordinate of a point picked on the √ d-radius sphere is at least −5 log(n). By Proposition 15, the probability of this not occurring is at most 1/(2n), which means by the union bound, the probability that a random point p with p 2 2 ≤ d − 1 − 10 √ log n not being in all of the balls is at most 1/(2n) · n = 1/2, for any fixed points x 1 , . . . , x n . Therefore,

To see why the last inequality holds, since d is sufficiently large and 10 √ log n ≤ 5 √ d, we have Proof of Theorem 2. First, we note that we can make the following three assumptions.

• r = 1. This is an assumption we may trivially make by scaling.

• ε ≥ δ. Note that if ε ≤ δ, then any (δ/2, δ/2)-differentially private algorithm is known to be (0, δ)-differentially private, and in this case we may even obtain an algorithm that succeeds

• In Condition 2, the success probability is only 1 − δ instead of 1. Since we are allowing for exponential time algorithms, if we have an algorithm only succeeding with 1 − δ probability, the algorithm can verify that at least 2/3 of the points lie in a ball of radius r = 1, and in this case condition on returning a point x ′ of distance at most O( √ d) from each of those x i 's, via rejection sampling. This changes the total variation distance of the algorithm outputs's distribution by at most δ, so the algorithm is still (ε, O(δ))-differentially private. Now, we state and prove two propositions, which establish accuracy and privacy, respectively, of our algorithm.

Proposition 16. If n ≥ O 1 ε · log 1 α·ε·δ , and at least 2/3 of the points x 1 , . . . , x n are all in some unknown ball B centered at x of radius 1, then with probability at least 1 − α, the algorithm chooses a point within 1 + √ d of x.

Proof. Suppose we do not choose such a point. Then, either we chose the garbage bucket, or we chose a point p that was not in B i for any i with x i − x 2 ≤ 1, in which case f (p) ≤ 1 3 · n. Now, note that the volume of i:

Lemma 3 (by replacing n with 2n/3) and that f (p) ≥ 2 3 · n for any p ∈ i: x i −x 2 ≤1 B i . Therefore, the probability density function in this region is proportional to e 2/3·ε·n , whereas the mass of the garbage bucket is proportional to 4 δ · V B . So, we choose the garbage bucket with probability at most

So, as long as 2 3 · ε · n − 10 √ log n ≥ log(8/δ) + log(2/α) = log(16/(α · δ)), then we choose the garbage bucket with probability at most α/2.

Next, note that total volume of points for which f (p) = 0 is at most n · V B , which means that we choose a point p not in i: x i −x 2 ≤1 B i with probability at most

So, as long as 1 3 · ε · n − 11 log n ≥ log(4/α), then we choose a point p not within 1 + √ d of x with probability at most α/2. Therefore, it suffices for n to satisfy 2 3 · ε · n − 10 √ log n ≥ log 16 α·δ and 1 3 · ε · n − 11 log n ≥ log 4 α . For these to be true, it suffices that n ≥ 30 ε · √ log n, that n ≥ 3 ε log 16 α·δ , that n ≥ 6 ε · log 4 α , and that n ≥ 66 ε log n. These are all clearly true as long as n ≥ C · 1 ε log 1 α·ε·δ for a sufficiently large constant C. This concludes the proof.

Proposition 17. This algorithm is (ε, δ)-differentially private.

Proof. We just need to see what happens when a single ball is moved. For any point p, the number of balls B i containing p (which we called f (p)) does not change by more than 1. Hence, the relative (unnormalized) density function at p does not change by more than an e ε factor, unless f (p) changes from 0 to 1 or from 1 to 0. However, this only occurs for at most V B volume, so the unnormalized density of points p for which f (p) goes from 1 to 0 or from 0 to 1 is at most e ε · V B (for each), so the total such unnormalized density is at most 2e ε · V B ≤ 4 · V B . So, the amount of normalized density cannot be more than δ, since the garbage bucket has unnormalized density 4 δ · V B , which is at least 1 δ times the total density of what we are modifying by more than an e ε factor. Thus, we have (ε, δ)-differential privacy.

By the initial WLOG assumptions, we may assume that r = 1, ε ≥ δ and that α = δ. So, by Proposition 16, we only need O( 1 ε log 1 δ ) samples. This proves Theorem 2.

In this subsection, we prove all results omitted from Subsection 3.2, and provide pseudocode for Corollary 7. First, we prove Propositions 4 and 5.

Proof of Proposition 4. If f (p) ≥ 2n/3, this equals n/3 f (p) ≤ 1 2 , and e ε·(min(f (p),2n/3)−2n/3) = 1. Otherwise, 1 3

so if we define x = ε · 2n/3 and y = ε · f (p), then the probability equals 1 2 · e y /y e x /x . For n ≥ C · 1 ε log 1 ε , e x /x ≥ ε −2 . However, if y ≤ 1 then e y /y ≤ e/y = O( 1 ε ) since y = ε · f (p) ≥ ε, and if 1 ≤ y ≤ x, then e t /t is increasing for t ≥ 1, so e y /y ≤ e x /x. Therefore, we always have that e y /y e x /x ≤ 1, which concludes the proof.

Proof of Proposition 5. Consider a single iteration of the main For loop (i.e., lines 4-10 in the pseudocode in Figure 1 ). A point p can only be sampled if p ∈ i B i , and if so it is sampled with probability density proportional to

This is because we sample some point i with p ∈ B i with probability density proportional to f (p) · e 2n/3 , and then select p with density to 1/V B , and then accept with probability 1 3 · n f (p) · e ε(min(f (p),2n/3)−2n/3) . However, we select G 1 with probability proportional to

So, by scaling by a factor of 3 n · V B for all probabilities, the proportionalities are correct.

Next, we state and prove Lemma 18, which establishes that a modified version of Algorithm 1 accepts with high probability if most of x 1 , . . . , x n are in a ball of radius r, and is differentially private in the worst case.

Lemma 18. Suppose that ε, δ, α ≤ 1 3 , and n ≥ C · 1 ε log 1 α·ε·δ . Consider a modified algorithm where we run Algorithm 1, but output 1 if the algorithm above returns G 2 and output 0 otherwise. Then, if at least 2n/3 of the points x 1 , . . . , x n are all contained in a ball of radius r, then the algorithm outputs 1 with probability at most α. In addition, this algorithm is (3ε, 3δ)-differentially private even in the worst case.

Proof. By scaling, we may assume that r = 1. We first prove worst-case differential privacy. To do so, we will need the following auxiliary result, for which the proof is straightforward yet somewhat tedious.

Proposition 19. Suppose that 0 ≤ q, q ′ ≤ 1/2 and 0 < ε, δ < 1

Proof of Proposition 19. First, suppose that q ′ = e ε · q + δ N . Then,

The first part of the proposition now follows from the fact that N ·x (N −1)·x+1 is a strictly increasing function on the range − 1 N , ∞ . For the second part of the proposition, we first suppose that q ′ = e ε ·q+ δ N . In this case, note that (N −1)q ′ ≤ e ε ·(N −1)·q+δ, and that 1−q ′ = 1−(e ε q+δ) = (1−q)−(e ε −1)q−δ ≥ (2−e ε )(1−q)−δ, where the last inequality follows from q ≤ 1 2 so 1 − q ≥ q. Then, as long as ε ≤ 1 3 , 2 − e ε ≥ e −2ε , so 1 − q ′ ≥ e −2ε (1 − q) − δ. Thus,

Above, the second line follows from the fact that

Above, the second line follows from the fact that a+δ b−δ − a b = (a+b) b(b−δ) · δ ≤ a+b 2/3·b 2 · δ ≤ 3δ whenever a ≤ 1 ≤ b and δ ≤ 1 2 . Thus, the second part of the proposition follows from the fact that

is a strictly decreasing function on the range − 1 N , ∞ .

We now return to the algorithm. First, suppose that a single round of the algorithm accepts with probability q. Then, note that P[Geom(q) ≤ Geom(r)] = q q+r−qr for any q, r > 0, so by setting r = 1/N , the probability of the modified algorithm outputting 0 is q q+1/N −q/N = N · q (N −1)q+1 , and the probability of the modified algorithm outputting 1 is

We first prove worst-case privacy. For simplicity, define g(p) = min(f (p), 2n/3). For a single round of rejection sampling, conditioned on not selecting i = n + 1 (for which the probability is independent of the dataset x 1 , . . . , x n ), the probability of acceptance is

So, changing one of the points x 1 , . . . , x n does not affect e ε(f (p)) by more than a e ±ε factor, except for points that have f (p) go from 1 to 0 or vice versa. Then, as long as n ≥ C · 1 ε · log 1 α·δ for a sufficiently large constant C, e ε(g(p)−2n/3) ≤ e ε(1−2n/3) ≪ α·δ n ≪ δ N for any point p with f (p) ≤ 1, where we are assuming WLOG that ε ≥ δ. Since the total volume of points p with f (p) going from 1 to 0 is at most Vol(B) (and similar for 0 to 1), overall we have that if q x is the probability of acceptance in a single round of rejection sampling for x = x 1 , . . . , x n , then if x, x' differ in at most one point then q x · e −ε − δ N ≤ q x' ≤ q x · e ε + δ N . Therefore, by Proposition 19, if p 0 is the probability that the modified algorithm outputs 0 on input x and p ′ 0 is the probability that the modified algorithm outputs 0 on input x ′ , then e −ε · p 0 − δ ≤ p ′ 0 ≤ e ε · p 0 + δ. Likewise, if p 1 is the probability that the modified algorithm outputs 1 on input x and p ′ 1 is the probability that the modified algorithm outputs 1 on input x ′ , then e −3ε · p 1 − 2δ ≤ p ′ 1 ≤ e 2ε · p 1 + 3δ. (Note that q x , q x ′ ≤ 1 2 since we never accept with probability more than 1/2, so we may apply Proposition 19.) Thus, our modified algorithm is (3ε, 3δ)-differentially private, even for worst-case datasets.

Next, we prove that the modified algorithm outputs 1 with probability at most α, assuming that at least 2/3 of the points x 1 , . . . , x n are all contained in a ball of radius 1. In this case, if n ≥ C · 1 ε log 1 α·ε·δ , then e ε·2n/3 ≥ (α·ε·δ) −1 ≥ 4 δ , so a single round of the rejection sampling algorithm first chooses some point i ∈ [n] with probability at least 1 2 , so we select a point i with x i − x 2 ≤ 1 with probability at least 1 3 . Conditioned on this event, we select a point p ∈ i: x i −x 2 ≤1 B i with probability at least 1 2 · e −10 √ log n by Lemma 3. Finally, conditioned on this event, we have that g(p) = 2n/3, so we accept with probability 1 3 · n f (p) · e g(p)−2n/3 ≥ 1 3 . Therefore, each round of rejection sampling succeeds with at least q = 1 18·e 10 √ log n probability, so the probability of the modified algorithm outputting 1 is at most

We now prove Theorem 6.

Proof of Theorem 6. By scaling, we may assume WLOG that r = 1, and we may also assume that ε ≥ δ, for the same reason as in Theorem 2. In addition, we may assume that α ≥ δ. To see why, suppose we have an algorithm succeeding with 1 − δ probability if at least 2/3 of the points x 1 , . . . , x n are in a ball of radius 1. Now, in O(nd) time we can verify whether the coordinate-wise median of x 1 , . . . , x n is within 1 of at least 2/3 of the points x 1 , . . . , x n . In this specific case, we condition on the algorithm returning a point within O( √ d) of the coordinate-wise median. This happens with 1 − δ probability so we can repeat the algorithm until we get such a point, which occurs an expected O(1) times, so the expected runtime is the same. In addition, this changes the output distribution by at most δ in total variation distance, so the algorithm is still (O(ε), O(δ))differentially private. Finally, if at least 2/3 of the points x 1 , . . . , x n are in a ball of radius 1/10, [23, Theorem 1] proves that the coordinate-wise median of x 1 , . . . , x n is within 1 of those points, which means that this modified algorithm in fact succeeds with probability 1. By scaling the algorithm by a factor of 10, we have the algorithm now succeeds with probability 1 if at least 2/3 of the points x 1 , . . . , x n are in a ball of radius 1.

Because of our assumptions, we may assume n ≥ C · 1 ε log 1 ε·δ·α , since α, ε ≥ δ. First, we show that our algorithm successfully finds a point within O( √ d) of x with probability at least 1 − 2α, if at least 2/3 of the points x i are contained in a ball of radius 1. To see why, note that the overall distribution of our output is the same as first running the modified algorithm in Lemma 18, and if the output is 1 returning G 2 and otherwise running the algorithm of subsection 3.1. The probability of outputting G 2 is at most α by Lemma 18, and conditioned on not outputting G 2 , the probability of success is at least 1 − α by Proposition 16. Therefore, the algorithm succeeds with probability at least 1 − 2α.

Next, we bound the expected runtime even for worst-case datasets. Note that in each round of the algorithm, we pick a point i ∈ [n + 1], which can be done in O(n) time 2 . We then have to pick a uniform point p in the ball of radius √ d around x i if we choose i ≤ n, which is well-known to be doable in O(d) time 3 . To decide whether we wish to accept or reject, we need to compute f (p), which takes time O(n · d) since we need to compute the distances between p and each point x i for 1 ≤ i ≤ d. So overall, a single step of the rejection sampling algorithm takes time O(n · d). We repeat this process an expected O(N ) times, so the total runtime is O(N · n · d). As a note, the number of times we repeat this loop decays geometrically, so our runtime is O(N · n · d · K) with probability at least 1 − e Θ(k) .

Finally, to check privacy, we recall that our algorithm can be thought of as first running the modified algorithm in Theorem 2, and if the output is 0, running the algorithm of subsection 3.1. These procedures are (3ε, 3δ)-and (ε, δ)-differentially private (by Lemma 18 and Proposition 17, respectively). Therefore, by weak composition of differential privacy, we have that the overall algorithm is (4ε, 4δ)-differentially private.

We now prove Corollary 7.

Proof. The algorithm simply runs O log 1 α parallel copies of Algorithm 1 but with the failure probability α replaced by 1/3. This gives us k = O log 1 α points p 1 , . . . , p k , each generated by an (ε ′ , δ ′ )-differentially private algorithm with failure probability 1/3, according to Theorem 6. Here, we will set ε ′ = Θ(ε)/ min log 1 α , log 1 α · log 1 δ and δ ′ = Θ(δ)/ log 1 α . The final estimate p will simply be the coordinate-wise median of the points p 1 , . . . , p k (where we discard any p i which is a garbage bucket G 1 or G 2 ).

First, by Theorem 6 and our new values ε ′ , δ ′ , the sample complexity is

We note that this is always at most O( 1 ε log 1 δ log 1 α ), since log 1 δ log 1 α = O(log 1 δ ) unless 1 α ≥ e 1/δ , in which case log 1 α log 1 δ · log 1 δ log 1 α =Õ log 1 α = O log 1 α . Next, the runtime is simply k times the runtime of generating a single point p i , since coordinatewise medians can be computed in O(nd) total time. Thus, by Theorem 6, the runtime is O ne √ 10 log n · d · log 1 α . To bound privacy, note that since each p i is independently generated by an (ε ′ , δ ′ ) differentially private algorithm, and the final point p only depends on the p i 's, we may use either weak or strong composition to directly get the privacy guarantees of either O ε ′ · log 1

Consequently, by setting δ ′ = Θ(δ)/ log 1 α and ε ′ = Θ(ε)/ min log 1 α , log 1 α · log 1 δ , we obtain an (ε, δ)-differential privacy bound. Finally, if at least 2n/3 of the points x 1 , . . . , x n are in some ball of radius r centered at some unknown x, then each p i is within O( √ d · r) of x with probability at least 2/3. Therefore, by a Chernoff bound, at least 3/5 of the points p 1 , . . . , p k are within O( √ d · r) of x with probability at least 1 − α, assuming k ≥ C log 1 α for a sufficiently large constant C. Conditioned on this, it is known that the coordinate-wise median is deterministically within O( √ d · r) of x as well [23] , so the algorithm succeeds with probability at least 1 − α.

Finally, we provide the pseudocode corresponding to Corollary 7, in Figure 2 .

1: procedure DP-Estimate-2(x 1 , . . . , x n , α, ε, δ)

Set k = O(log 1 α ).

Set ε ′ = Θ(ε)/ min log 1 α , log 1 α · log 1 δ , δ ′ = Θ(δ)/ log 1 α 4:

for i = 1 to k do 5: p i = DP-Estimate-1(x 1 , . . . , x n , 1 3 , ε ′ , δ ′ ) 6: end for 7:

Return the coordinate-wise median of p 1 , . . . , p k . 8: end procedure 

In this subsection, we prove Theorem 8. First, we formally state the main lemma for the Fast Johnson-Lindenstrauss method. We now prove Theorem 8.

Proof of Theorem 8. Again, we may assume WLOG that r = 1. Set α ′ = α/(n 2 · d), and let

In addition, fix 1 ≤ k ≤ √ d as a power of 2, and set ε ′ = Θ ε/(k log 1 δ ) and δ ′ = Θ(δ/k 2 ). Suppose that n ≥ C · 1 ε ′ · log 1 δ ′ = Θ(C) · k ε log k δ log 1 δ . Then, by Lemma 20 and the union bound, we obtain that if at least 2/3 of the points x i are within 1 of some point x in Euclidean distance, then with probability at least 1 − α, each point i − x (s) 2 ≤ O( log(dn/α)/k) for at least 2/3 of the points i. Since n ≥ C · 1 ε ′ · log 1 δ ′ , we can use Theorem 6 to obtain an (ε ′ , δ ′ )-differentially private algorithm on {x 

Thus, since M is a linear, ℓ 2 -norm preserving map, we have that M −1x − x 2 ≤ O( d log dn α /k). Next, we verify that the overall algorithm is (ε, δ)-differentially private. To see this, note that since D is oblivious to (i.e., has no dependence on) the dataset, multiplying the data by M at the beginning and by M −1 at the end does not degrade privacy. So, we can bound the privacy using the strong composition theorem, as a concatenation of k 2 (ε ′ , δ ′ )-differentially private functions.

Since ε ′ = Θ ε/(k log 1 δ ) and δ ′ = Θ(δ/k 2 ), the strong composition theorem tells us the overall algorithm is ε ′ · 2k 2 log 1 δ + O(k 2 · (ε ′ ) 2 ), k 2 δ ′ + δ = (O(ε), O(δ))-differentially private. Next, we bound the runtime of this algorithm. First, initializing D and multiplying each x i by 

. Finally, we concatenate our estimatesx (s) and apply M −1 , which takes time O(d log d). Overall, the runtime is O(n 1+o(1) d · α −1 + nd log d).

Now, if we assume n = Θ k ε · log k δ · log 1 δ , then the error is e = O( d log dn α /k) with proba-

This proves the algorithm is sufficiently accurate as long as C · 1 ε log 1

since we can always choose such a k between 1 and √ d, and round it to a power of 2 if necessary. Finally, if C · 1 ε log 1 δ ≤ n ≤ C · 1 ε log 1 δ 3/2 , we can directly use Theorem 6 to get that the error is at most O( √ d), which is at most O √ d · 1 ε · log 1 δ 3/2 /n . So, the algorithm is also sufficiently accurate in this case.

In this subsection, we prove Theorem 9 and Corollary 10, and provide pseudocode for Theorem 9.

Proof of Theorem 9. First, let us assume that no user is corrupted. In this case, a fixed user i is given m data points X i,1 , . . . , X i,m , and computes the mean of their data, which we denoteX i . Each X i,j has mean µ, and

So, by Markov's inequality, X i − µ 2 ≤ 10 √ m · r with probability at least 49 50 , so by a Chernoff bound, at least 11 12 · n of the users will have X i − µ 2 ≤ 10 √ m · r with at least 1 − e Ω(n) ≥ 1 − α probability. Therefore, even if an adversary looks at the users' data and arbitrarily corrupts n/4 of the users' data, at least 2 3 · n of the users will still have X i − µ 2 ≤ 10 √ m · r. Under this event, by replacing r with r ′ = 10 √ m · r, we can directly apply Theorem 6 or 8 to obtain the desired result with runtime O(mnd + n 1+o(1) · d · α −1 + nd log d), since computing each X i takes time O(mnd). Alternatively, we could apply Corollary 7 or the remark after Theorem 8 to obtain the improved runtime of O(mnd + n 1+o(1) · d · log 1 α + nd log d), at the cost of the number of users n being an additional log 1 α factor larger.

Proof of Corollary 10. First, we consider the mean estimation question. In the case where C · 1 ε log 1 α·δ ≤ n ≤ C · √ d · 1 ε log 1 δ , we can directly apply Theorem 9. Indeed, since all points X i,j are drawn from a distribution D supported on a ball of radius r, we have that E X i,j − µ 2 2 ≤ E X i,j − x 2 2 ≤ r 2 , where r is the center of the ball. In addition, since each X i,j is independent,

Therefore, the conditions to apply Theorem 9 hold.

In the case where n ≥ C · d log 1 δ · 1 ε · log(m(dn + n 2 ε 2 )), we can directly apply [18, Corollary 1] to obtain the desired bound (where we ignore all logarithmic factors). Finally, in the potential

ε · log(m(dn + n 2 ε 2 )), we can simply ignore all users after the first C · √ d · 1 ε log 1 δ , and use our already existing bounds. Since the upper bound is at most an O(log(m · d · n)) factor more than the lower bound, the result still holds in this case since we are ignoring logarithmic factors in the bound.

We now move to our result on the distribution learning question. This will be quite direct from our results on the mean estimation question. Indeed, we can encode each i ∈ [d] as a one-hot vector (0, . . . , 1, 0, . . . , 0) ∈ R d , where there is a 1 precisely at the ith position. Then, this distribution is supported on the unit ball around 0, so we can learn this distribution up to ℓ 2 error

with failure probability α. However, note that the ℓ 1 distance is at most √ d times the ℓ 2 distance, and the total variation distance of two distributions is exactly 1 2 of their ℓ 1 distance. So, we can learn the distribution D up to total variation distancẽ

in an (ε, δ) user-level differentially private manner.

Finally, we include pseudocode for the algorithm corresponding to Theorem 9, in Figure 3 .

In this section, we prove all lower bound results, from Section 5. In Subsection C.1, we prove Lemma 11. In Subsection C.2, we prove Theorem 12. (We defer one piece of this proof to Subsection C.3). Finally, in Subsection C.4, we state and prove a simple result that one cannot learn the mean of a distribution up to error better than O r √ m if a constant fraction of users are corrupted, even if privacy is not required. This roughly implies that beyond n ≈ √ d users, having more users does not help with robust mean estimation.

In this section, we prove Lemma 11.

Proof. Suppose otherwise, i.e., there is such an algorithm which we call M . Consider a ball B of radius 1 centered at some point b, and for any r > 0, let B r represent the radius r ball concentric with B. Suppose we have data {X i,j } where each point X i,j is arbitrary. Let p 0 represent the probability that our user-level differentially private algorithm M outputs a point x ∈ B T . Now, for 1 ≤ i ≤ n, let p i represent the probability that M outputs a point x ∈ B T , if we replace the first i users' data each with m copies of the center of B. Then, p n is the probability that M outputs x ∈ B T if every point X i,j is just the center of B. By our assumption on M , and since each point X i,j is in B, p n ≥ 2 3 . In addition, we have that this algorithm is (ε, δ) user-level differentially private, which means that p i+1 ≤ e ε · p i + δ. Equivalently, by adding δ · 1 e ε −1 to both sides, we have that p i+1 + δ · 1 e ε −1 ≤ e ε · p i + δ · e ε e ε −1 , so

for all 0 ≤ i ≤ n − 1. So, this means that Sample D uniformly over ±1-valued diagonal matrices 6: for i = 1 to n do 7: if k > 1 then ⊲ When k = 1, there is no need to use any composition. 13 : ComputeX as the concatenation of allX (s)

Return √ d · (HD) −1 ·X 22: end procedure Here, we set k so that n = Θ k ε log k δ log 1 δ , unless n = O 1 ε (log 1 δ ) 3/2 , in which case we set k = 1. Note that when k = 1, we do not change ε, δ and do not need to use any composition theorem, so we get the bounds based on applying Theorem 6 (or Corollary 7 if we apply Algorithm 2 instead of Algorithm 1), whereas for 1 < k ≤ √ d, we get bounds based on applying Theorem 8.

If n ≤ 1 3 · 1 ε · log 1 δ , then e −nε · 2 3 ≥ 2 3 · (δ) 1/3 , which means that p 0 ≥ 2 3 · (δ) 1/3 − δ e ε −1 ≥ δ, assuming that δ ≤ ε 2 and that δ is sufficiently small. Therefore, even if the data {X i,j } is arbitrary, we still have that with probability at least δ, we output a point in B T , i.e., within T of b. But this is true for an arbitrary b, so by constructing N > 1 δ disjoint balls of radius T centered at some points b 1 , . . . , b N , which have pairwise distances at least 2T from each other, we have that the probability that M outputs a point within T of some b i is at least N δ > 1, which is a contradiction. Therefore, we must have that n ≥ 1 3 · 1 ε · log 1 δ .

In this section, we prove Theorem 12.

Suppose we have n users and m samples (items) per user, where each sample is drawn from an unknown distribution over [d] characterized by p = [p 1 , . . . , p d ]. The goal of the algorithm is to privately estimate p up to low total variation (ℓ 1 ) distance, given these samples. Our lower bound will show that one cannot privately estimate p significantly better than the guarantees of Corollary 10. In addition, our lower bound will also imply a nearly optimal lower bound for user-level mean estimation (up to ℓ 2 distance), both in the case when there are few users or many users. This is because we can think of each sample i ∼ [d] as a one-hot vector (0, 0, . . . , 1, 0, . . . , 0), with a 1 on the ith coordinate. Hence, the samples are concentrated in a ball of radius 1 and we are trying to estimate the mean (p 1 , . . . , p d ).

Before dealing with multinomial distributions over [d], we start by just considering the number of samples that equal some k ∈ [d] for each user. However, instead of looking at Binomial distributions, which is the actual distribution, we instead prove the following two lemmas relating to Poisson variables. Our method of attack is to show that if instead of each user getting m samples from a distribution over d, the user were given d independent Poisson variables where the kth variable roughly corresponding to the number of samples that the ith user receive that equal k, then privately estimating the distribution is hard. This step follows a similar technique to [16] , but with very different computations. We then show how to reduce the problem of each user receiving samples from [d] to the problem of each user receiving Poisson variables, which shows that learning discrete distributions privately is also difficult.

Lemma 21. Fix positive integers m, n and let 0 < p < 1 be a parameter. Let f be a nonnegative, uniformly bounded function on n variables X 1 , . . . , X n , where each X i i.i.d.

∼ Pois(m · p). For simplicity, we write X = (X 1 , . . . , X n ), and write E X to represent the expectation over drawing

where g ′ (p) is the derivative of g at p.

Proof. Note that by the formula for the Poisson probability mass function,

Now, since |f (X)| is uniformly bounded, one can show that the derivative of the expression

commutes with the summation. (We defer the proof of this fact to Lemma 26 in Subsection C.3.) So, using the product rule,

The next lemma, sometimes called a "fingerprinting" lemma, is based on similar techniques used in [16, 24] for lower bounds for private estimation from samples. ∼ Pois(m · p). Then, for f as in Lemma 21,

Proof. First, by Lemma 

, where the second line follows from integration by parts.

We abbreviate X = (X 1 , . . . , X n ) and assume p ∼ Unif 0. ∼ Pois(m · p). Then,

Therefore, we have that

Rather than thinking of X = {X i,j } where each X i,j ∈ [d] and 1 ≤ i ≤ n, 1 ≤ j ≤ m, we consider X = {X i,k }, where 1 ≤ i ≤ n and 1 ≤ k ≤ d: here, X i,k represents the count of the number of times user i received k ∈ [d]. The ith user is given {X i,k } d k=1 , but instead of it following the multinomial distribution Mult(m; p) for p = (p 1 , . . . , p d ) we instead suppose that X i,k ∼ Pois(m · p i ) for each i, where m is fixed, the values X i,k are independent (both across i and k). Finally, we assume that

and that the algorithm does not know p = (p 1 , . . . , p k ). We now prove a lower bound on the sample complexity if the users are given samples from a Poisson distribution, rather than a multinomial or discrete distribution. This follows a similar method to [16, Theorem 6.1] Lemma 23. Suppose that M is an algorithm that takes as input X = {X i,k }, and with all notation as in the above paragraphs, and outputs an estimate M (X) = (M 1 (X), . . . , M d (X)), where each

, and that E p,X,M M (X) − p 2 2 ≤ α 2 24d for some 0 < α ≤ 1, where the expectation is also over the randomness of the algorithm M . Then, the number of users n must satisfy n ≥ Ω( d α·ε· √ m ). 

While this implication may not seem immediate because M k (X) does not just have to depend on the choices X i,k for a fixed k, we note that since we chose the p k values independently, the set {X i,k ′ } over 1 ≤ i ≤ n and k ′ = k is independent of the set {X i,k } over 1 ≤ i ≤ n. Therefore, Lemma 22 still holds in our case. Therefore, since E p,X,M [ M (X) − p 2 2 ] ≤ α 2 24d , we must have that

We spend the remainder of the proof working towards an upper bound for n i=1 E[Z i ]. Fix p, and let X ∼i represent replacing X i,· with an independent draw from Pois(m · p k ) for each 1 ≤ k ≤ d.

(In other words, we have replaced all of the ith user's data.)

only the middle term in the product is changed from Z i,k ) and let Z i = d k=1Z i,k . Due to the resampling, we have that {X i,k } d k=1 is independent of X ∼i conditioned on p, so E X,M [Z i ] = 0. Now, note that E X,X ∼i ,M [Z i,k ·Z i,k ′ ] = 0 for any k = k ′ if we condition on p, since up to a scaling factor only depending on p,

and the two terms are independent because X ∼i and X i,· are independent. However, (X i,k − mp k ) and (X i,k ′ − mp k ′ ) are independent and both have expectation 0, so this means the overall product has expectation 0.

Thus, conditioning on p, we can bound

Note that in the final line, we are using the fact that M (X ∼i ) and M (X) have the same distribution, and we are assuming that E[ M (X) − p 2 2 ] ≤ α 2 /(24d). Hence, even when we remove the conditioning on p, we still have that

To finish, we note that if we fix p, X, and X ∼i and just take the expectation over the randomness of M , then Z i andZ i are the same, except one is a function of M (X) and the other is the same function of M (X ∼i ). Therefore, the distributions of Z i andZ i conditioned on p, X, and X ∼i are (ε, δ)-close. Thus, the distributions are still (ε, δ) close when we remove the conditioning as well. In this case, it is well known (see [16, Equation 18 ]) that for any T > 0,

To bound P[Z i > t], it is well known that for any variable Y ∼ Pois(λ), that P(Y > λ + t) ≤ exp (−c · t) whenever t > λ, for some absolute constant c > 0. Therefore, the probability that |X i,k − mp k | > t is at most exp (−c · t) for any t > 1.5 · m d , since p k ∈ 0.5 d , 1.5 d . We note that Z i,k ≤ O(1/d 2 ) · |X i,k − mp k |, which means that any t > Ω(m/d 3 ), the probability that Z i,k > t is at most exp −Ω(d 2 ) · t . So, as Z i = d k=1 Z i,k , the probability that Z i,k > t is at most d · exp (−Ω(d) · t) for any t > T = Θ(m) by a basic union bound. Thus, we can apply (2) and the fact that E[Z i ] = 0 and E[|Z i |] = O( √ m · α/d 2 ) to get

.

Assuming that δ = o(ε · α/(d 2 · √ m)), this means that n = Ω( 1 ε · α −1 · d/ √ m), as desired.

We now provide a reduction from privately learning multinomial distributions to privately learning Poisson distributions as above. Given this reduction, we can prove that learning a multinomial distribution accurately also requires n =Ω( 1 ε · α −1 · d/ √ m) samples.

First, we note the following result about privately learning 1-dimensional distributions, which will be useful in proving our reduction. do this, let X i = d k=1 X i,k , and letX i = min(max(X i , m ′ /4), 4m ′ ). We apply Theorem 24 to the valuesX 1 , . . . ,X n ′ with B = 4m ′ , τ = √ m ′ · O(log d α·ε ), and T = O(log d α·ε ), to get an estimatẽ D = D(X 1 , . . . ,X n ′ ). Lettings =D/m ′ , our final estimate for p will bes · M ({X i,j }).

We show that if M is (ε, δ)-private, thenM is (2ε, δ)-private. To see why, note that the generation of samples {X i,j } m j=1 from {X i,k } d k=1 is entirely user specific, so any algorithm that is private on {X i,j } is equally private on {X i,k }. In addition, M throws out all data after the first n users, which does not worsen privacy. Therefore, the function M ({X i,j } n m i=1,j=1 ) is still (ε, δ)differentially private as a function of {X i,k }. Next, we similarly just have to check thats is a private function of theX i values, because eachX i only depends on the ith user's input. However, Theorem 24 tells us that in facts (as a scaled version ofD) is (ε, 0)-differentially private. Therefore, the overall algorithm of generating p is (2ε, δ)-differentially private.

Next, we verify accuracy of this procedure. Assuming that every user is given {X i,k } d k=1 where each X i,k is independently drawn from Pois(m ′ · p k ), then X i = d k=1 X i,k ∼ Pois(m ′ · d k=1 p k ). In addition, with probability at least 1 − n ′ · e −Ω(m ′ ) ≥ 1 − (α 10 ε 10 )/d 10 (by our assumptions on the sizes of m ′ , n ′ ), X i ∈ [m/4, 4m], since p k ∈ [0.5/d, 1.5/d] for all k ∈ [d]. So,X i = X i for all i ∈ [n ′ ] with O(α 9 ε 9 /d 9 ) failure probability (since n ′ ≤ O d α·ε ), which means the total variation distance between X 1 + · · · + X n ′ andX 1 + · · · +X n ′ is at most O(α 9 ε 9 /d 9 ). Now, note thatX i ∈ [0, B] for B = 4m ′ , and that X i = m ′ · s ± O( √ m ′ · log d α·ε ) with probability at least 1 − (α 10 ε 10 )/d 10 for each i by concentration of Poisson 4 (and since 0.5 < s < 1.5). So, for τ = C √ m ′ log d α·ε , we have that with probability at least 1 − O(α 9 ε 9 /d 9 ), all of the valuesX i for i ∈ [n] are in the range [m ′ · s − τ, m ′ · s + τ ]. Overall, this means that if T = C log d α·ε , then D − Therefore, 1 2 − κ √ m and equals −v with probability 1 2 + κ √ m . It is well known that for large values of m, the total variation distance between the distribution Bin(m, 1 2 + κ √ m ) and Bin(m, 1 2 − κ √ m ) is O(κ). Because of this, there exists a method that can take a sequence of m samples drawn from D and corrupt the distribution with probability at most O(κ) (where the choice of corruption is allowed to depend on the samples, but the overall probability of corruption is at most O(κ)), such that the final sequence is now has the same distribution as m samples drawn from D ′ .

So, this means that if we do this for all n users, with high probability we do not corrupt more than an O(κ) fraction of the users, but now we have completely converted the sequence {X i,j } from each sample being drawn from D to each sample being drawn from D ′ . Therefore, we cannot distinguish between these two distributions with high probability. But, the difference between the means of these two distributions is precisely 2κ √ m , so we cannot learn the mean of an unknown distribution up to better than O κ √ m .

We note that Levy et al. [18] showed that user-level private mean estimation could be used in a black-box manner to solve user-level private empirical risk minimization, stochastic convex optimization, and a variant of stochastic gradient descent. In this section, we briefly discuss how our improved mean estimation algorithm in the low user regime improves their algorithms for these three problems. We recall that the general goal in both empirical risk minimization and stochastic convex optimization is to minimize the average loss. We have a parameter space Θ and a data space Z, where each user i is given m samples X i,j drawn from some unknown distribution D over Z, as well as a loss function ℓ : Θ × Z → R. The goal in empirical risk minimization is to minimize the empirical risk L(θ; X) := 1 mn · i,j ℓ(θ; X i,j ) over θ ∈ Θ for some data points X = {X i,j }, whereas the goal in stochastic convex optimization is to minimize the expected population risk L(θ; D) := E z∼D [ℓ(θ; z)] over θ ∈ Θ, assuming each X i,j i.i.d.

We make some assumptions about the loss function and the input space. First, we assume that Θ is closed, convex, and is contained in a ball of radius R, i.e., there exists θ 0 such that θ −θ 0 ≤ R for all θ ∈ Θ. Next, we assume that ℓ(·, z) is G-Lipschitz, meaning that for all z ∈ Z and all θ ∈ Θ, ∇ℓ(θ; z)| 2 ≤ G, where the gradient is with respect to θ. Next, we assume that the function ℓ(·, z) is H-smooth, meaning that for all z ∈ Z the gradient ∇ℓ(θ; z) is H-Lipschitz as a function of θ. Finally, for any θ, we assume that ∇ℓ(θ; Z) is σ 2 -subgaussian if Z ∼ D, which means that for all v ∈ R d and θ ∈ Θ,

To perform stochastic gradient descent, we assume that for a function F : Θ → R, we have access to a first-order stochastic oracle O F,ν 2 , which is a random mapping such that for all θ ∈ Θ, E O F,ν 2 (θ) = ∇F (θ), and V ar O F,ν 2 (θ) ≤ ν 2 . We now consider an abstract optimization algorithm framework using this oracle (in the same way as [18] ). Given such an oracle O F,ν 2 , the optimization algorithm is comprised of three functions, Query, Update, and Aggregate. There also exists an output space O that, when fed into Query, tells us the new value of θ ∈ Θ. We start with some initial output o 0 . Then, for T iterations (i.e., for t = 0 to T − 1), we first compute θ t ← Query(o t ), g t = O F,ν 2 (θ t ) (note that g t is a random output since the oracle is stochastic), Theorem 32. Theorem 31 holds even if n =Ω ε −1 + min 3 mH 2 R 2 /(GG · ε 4 ), HR √ m/(d · σ · √ ε) , whereΩ hides all logarithmic factors.

Hence, we provide an extension of user-level private stochastic convex optimization to a smaller regime of n, again having almost no dependence on d.

Deep learning with differential privacy

The fast johnson-lindenstrauss transform and approximate nearest neighbors

Bounding user contributions: A bias-variance trade-off in differential privacy

Generative models for effective ml on private

Convex optimization: Algorithms and complexity

Impact of small-n studies during a pandemic

Stochastic model-based minimization of weakly convex functions

Differential privacy: A survey of results

Differential privacy and robust statistics

Calibrating noise to sensitivity in private data analysis

The algorithmic foundations of differential privacy

Boosting and differential privacy

Data mining with differential privacy

Introduction to Property Testing

Boosting the accuracy of differentially-private histograms through consistency

Privately learning highdimensional distributions

Estimate sequences for stochastic composite optimization: Variance reduction, acceleration, and robustness to noise

Learning with user-level privacy

Learning discrete distributions: user vs item-level privacy

A general approach to adding differential privacy to iterative training procedures

Learning differentially private recurrent language models

Mechanism design via differential privacy

Deterministic O(1)-approximation algorithms for 1-center clustering with outliers

Interactive fingerprinting codes and the hardness of preventing false discovery

A statistical framework for differential privacy

Differentially private sql with bounded user contribution

Differential privacy via wavelet transforms

In certain situations, the guarantees may also depend on µ, a guarantee on strong convexity. Namely, we may assume that ℓ(·, z) is µstrongly convex, which means that for any θ 1 , θ 2 ∈ Θ and any z ∈ Z, ∇f (θ 1 , z)−∇f (θ 2 , z), θ 1 −θ 2 ≥ µ · θ 1 − θ 2 2 . Finally, the algorithm's guarantees depend on n, the number of users, m, the number of samples per user, and d, the dimension. We first state the following proposition about stochastic gradient methods

If for all z ∈ Z, ℓ(·, z) is convex, then

We thank Piotr Indyk for several helpful discussions. We also thank Srivats Narayanan for pointing us to the reference [6] .

Theorem 24. [18, Theorem 1, rephrased] Let τ < B and ε < 1 be fixed. Then, there exists an (ε, 0)-differentially private algorithm D with input X 1 , . . . , X n ∈ [0, B], such that if there exists an unknown interval I ⊂ [0, B] of radius 2τ that contains all of the X i 's, then for any T > 0, the probability that D(X 1 , . . . , X n ) − X 1 +···+Xn n ≥ 8τ nε · T is at most e −T + B τ · e −nε/8 .

Suppose there exists an (ε, δ) user-level differentially private algorithm M , where δ = o (ε · α)/(d 2 · √ m · log 2 ( d ε·α ) , that takes as input X = {X i,j }, where i ranges from 1 to n, j ranges from 1 to m, and X i,j ∈ [d]. Moreover, suppose that if X i,j i.i.d.∼p for some distributionp over [d] , then E M (X) −p 2 1 ≤ α 2 1000 . Then, n √ m = Ω ε −1 α −1 d/ log 6 d α·ε .Proof. First, note that n ≥ 1 6ε . To see why, since δ ≤ ε, for two adjacent datasets X = {X i,j } and X ′ = {X ′ i,j } that only differ on 1 user, the total variation distance between M (X) and M (X ′ ) is at most ε + δ ≤ 2ε. So, the total variation distance between M (X) and M (X ′ ) for two arbitrary datasets X, X ′ , which may differ in up to all n users, is at most 2n · ε. So, if n < 1 6ε , then the total variation distance between M (X) and M (X ′ ) is less than 1 3 . Therefore, if X were drawn from the distribution where every element is 1, and X ′ were drawn from the distribution where every element is 2, either M (X) estimates p 1 > 1 2 with at most 2/3 probability, or M (X ′ ) estimates p 2 > 1 2 with at most 2/3 probability. Therefore, the expected value of M (X) −p 2 1 is at least 1 3 · 1 2 2 = 1 12 either forp = (1, 0, . . . , 0) or forp = (0, 1, 0, . . . , 0). So, we may assume n ≥ 1 6ε . Now, let A = 6C log d α·ε 4 for some large constant C, and let n ′ = n·A and m ′ = m·A. It sufficesFinally, we may assume that both n ′ , m ′ ≤ O d α·ε , as otherwise the result is immediate. Consider sampling a vector p where each p i ∼ Unif 0.5 d , 1.5 d . Letp = p/(p 1 + · · · + p d ). Since p 1 + · · · + p d ∈ [0.5, 1.5], this means thatp i ∈ [1/(3d), 3/d] for all i. We define s to be d k=1 p i : note that s ∈ [0.5, 1.5].Now, consider the case where each user i is given {X i,k } d k=1 , where i ranges from 1 to n ′ and each X i,k is a nonnegative integer. Then, it is well known that if each X i,k ∼ P ois(m ′ · p k ) (independently), if the user generates X i,k copies of each integer k and randomly permutes them, the resulting sequence has the same distribution as first sampling m ′′ ∼ P ois(m ′ ·(p 1 +· · ·+p d )) and then generating m ′′ independent samples each from the discrete distributionp. Since m ′ ·(p 1 +· · ·+p d ) ≥ m ′ /2, with probability at least 1 − n ′ · e −Ω(m ′ ) ≥ 1 − α 10 , every user i is given at least m ′ /4 samples by a Chernoff bound. Similarly, with probability at least 1 − n ′ · e −Ω(m ′ ) ≥ 1 − α 10 , every user i is given at most 4m ′ samples by a Chernoff bound, since m ′ · (p 1 + · · · + p d ) ≤ 3m ′ /2.Our overall algorithmM operating on {X i,k } n ′ d i=1,k=1 will work as follows. First, for each user i, if d k=1 X i,k < m ′ /4, user i will generate m random samples {X i,j } m j=1 from a uniform distribution over [d] . Otherwise, user i will generate X i,k copies of the integer k for each k ∈ [d], randomly permute them, and let {X i,j } m j=1 be the first m ≤ m ′ /4 indices in i's randomly permuted list. Next, the algorithm applies M on {X i,j } but just on the first n users (so over 1 ≤ i ≤ n, 1 ≤ j ≤ m) to output an estimate for the distributionp = p p i . Finally,M will estimate p 1 + · · · + p d . ToHowever, by Lemma . Therefore, assuming the dimension d is at least a sufficiently large constant, and since n ′ ≥ log 4 d α·ε , we must have that 1. This, however, is impossible, since even the optimal nonprivate estimation ofp (with n users and m samples per user) will have ℓ 2 2 error at least Ω 1 mn in expectation, and thus ℓ 2 1 error at least that. However, we claimed that there was a private algorithm (with n users and m samples per user) that could estimatep with error E M (X) −p 2 1 ≤ α 2 1000 . This is a contradiction, so we must have that in fact n ′ · √ m ′ = Ω d α·ε .We can now complete the proof of Theorem 12. The only change we must make is convert a lower bound on the expected error to a lower bound on the error one can achieve with at least 2/3 probability, which we can do via amplification at the expense of a logarithmic decrease on our lower bound on n √ m.Proof of Theorem 12. Suppose that with probability at least 2/3, M (X) −p 1 ≤ c · α for some very small constant c and some parameter α ≤ 1. Then, if we have n · log 1 α users, each with m samples, we can obtain log 1 α (ε, δ)-differentially private estimates ofp by splitting the users into log 1 α blocks of size n. With probability at least 1 − c · α, at least 3 5 of the estimates are within c · α in ℓ 1 distance. In this case, we can take the coordinate-wise median of the estimates to obtain an estimate within O(c) · α ofp in ℓ 1 distance with at least 1 − c 2 · α 2 probability, by a Chernoff bound and by [23] . Otherwise, the estimate is still within 1 ofp in ℓ 1 distance.So, this algorithm, provides an estimate that deviates fromp in expected ℓ 2 1 distance by at most (O(c)·α) 2 +c 2 ·α 2 ≤ α 2 1000 . In addition, the algorithm is still (ε, δ)-differentially private, since the output is a function of (ε, δ)-differentially private algorithms on disjoint groups of users. By Theorem 25, this implies that n log 1 α · √ m = Ω ε −1 α −1 d/ log 6 d α·ε , so n √ m = Ω ε −1 α −1 d/ log 7 d α·ε . Finally, even in the non-private case, obtaining error α in ℓ 1 distance with 2/3 probability requires m · n ≥ α −2 · d. So, in total, we need that 

F is a µ-strongly convex function, and we have access to3. For a point x in the Euclidean space that contains Θ, define Π Θ (x) = arg min θ∈Θ x − θ 2 . In addition, for γ > 0, define the gradient mapping G F,γ asAssuming that we have access to θ 0 ∈ Θ such that G F,1/H (θ 0 ) 2 − inf θ∈Θ G F,1/H (θ) 2 ≤ ∆ 1 for some ∆ 1 ≥ 0, thenRecall our assumptions on Θ and the loss function ℓ. Also, recall the definition of L(θ, X) for X = {X i,j }. We are now ready to state the main guarantee of Levy et al. [18] regarding private user-level empirical risk minimization. We then describe how our mean estimation algorithm can be used to improve it.Theorem 29. [18, Theorem 4] Suppose that n =Ω( √ d · T /ε), where theΩ hides logarithmic factors in 1 δ , T, d, m, B σ , 1 α , where α is the failure probability and B is a ball that is promised to contain all data points X i,j even in the worst case. Then, there exists an algorithm ([18, Algorithm 4]) that takes as input X = {X i,j } and is instantiated with a first-order gradient algorithm, and outputs some θ ∈ Θ, with the following properties. First, the algorithm is (ε, δ) user-level differentially private for any 0 < ε, δ ≤ 1. In addition, there exist variants of projected stochastic gradient descent (the algorithm with guarantees described in Proposition 28) such that if [18, Algorithm 4] is instantiated with the projected SGD, then:2. If for all z ∈ Z, ℓ(·, z) is µ-strongly convex, then3. Defining G F,γ (θ) as in Equation (3), we haveThe algorithm of [18] importantly uses the high-dimensional mean estimation T times, and so by strong composition requires (ε ′ , δ ′ )-differential privacy for each iteration, where ε ′ = Θ(ε/ √ T ) and δ ′ = Θ(δ/T ). Because of this, they require n =Ω( √ d/ε ′ ), as they are only able to get mean estimation algorithms in this regime. However, because of our extension of user-level private mean estimation to n =Ω(1/ε ′ · log(1/δ ′ )), their theorem can be immediately improved as follows.Theorem 30. Theorem 29 still holds even if n =Ω( √ T /ε), whereΩ hides logarithmic factors inHence, we provide an extension of user-level private empirical risk minimization, which uses a variant of projected stochastic gradient descent, to the regime where n is relatively small, i.e., n ≥Ω( √ T /ε) (which has almost no dependence on d) as opposed to n ≥Ω( √ dT /ε). Next, we describe how we also can get improvements for the stochastic convex optimization problem. Here, instead of trying to minimize L( θ, X), we wish to minimize L( θ, D), i.e., minimize the expected loss over the full unknown distribution rather than just over the known samples. Recall our assumptions on Θ and the loss function ℓ, and the distribution D over the data space Z. We are now ready to describe the stochastic convex optimization guarantees of Levy et al. [18] . We then describe how our mean estimation algorithm can be used to improve it.Theorem 31. Define G = min(G, σ √ d), and suppose that n =Ω(min( 3 d 2 mH 2 R 2 /(GG · ε 4 ), HR √ m/(σε)).Then, there exists a stochastic convex optimization algorithm ([18, Algorithm 5] ) that takes as input X = {X i,j } and has the following properties. The algorithm is (ε, δ) user-level differentially private, and if D, ℓ satisfy our assumptions and each X i,j ∈ Z is drawn i.i.d. from D, then the output θ satisfiesIndeed, their proof at one point assumes that n ≥Ω(ε · d · H/λ), where they will set T = Θ(H/λ) as that was the necessary assumption from Theorem 29, and where λ = σ 2 ·d 2 n 2 mε 2 + G·G n·m /R. (We allow theΩ to hide all logarithmic factors.) However, because of our improved Theorem 30, we will only need that n ≥Ω ε · max(1, H/λ) instead. Hence, we obtain the following theorem.