key: cord-0668316-36f6pnln authors: Bakopoulou, Evita; Zhang, Jiang; Ley, Justin; Psounis, Konstantinos; Markopoulou, Athina title: Location Leakage in Federated Signal Maps date: 2021-12-07 journal: nan DOI: nan sha: 47a4ee8889e4b3aa7b7ac1f7972b9ca894962d67 doc_id: 668316 cord_uid: 36f6pnln We consider the problem of predicting cellular network performance (signal maps) from measurements collected by several mobile devices. We formulate the problem within the online federated learning framework: (i) federated learning (FL) enables users to collaboratively train a model, while keeping their training data on their devices; (ii) measurements are collected as users move around over time and are used for local training in an online fashion. We consider an honest-but-curious server, who observes the updates from target users participating in FL and infers their location using a deep leakage from gradients (DLG) type of attack, originally developed to reconstruct training data of DNN image classifiers. We make the key observation that a DLG attack, applied to our setting, infers the average location of a batch of local data, and can thus be used to reconstruct the target users' trajectory at a coarse granularity. We show that a moderate level of privacy protection is already offered by the averaging of gradients, which is inherent to Federated Averaging. Furthermore, we propose an algorithm that devices can apply locally to curate the batches used for local updates, so as to effectively protect their location privacy without hurting utility. Finally, we show that the effect of multiple users participating in FL depends on the similarity of their trajectories. To the best of our knowledge, this is the first study of DLG attacks in the setting of FL from crowdsourced spatio-temporal data. Mobile crowdsourcing is widely used to collect data from a large number of mobile devices, which are useful on their own and/or used to train models for properties of interest, such as cellular/WiFi coverage, sentiment, occupancy, temperature, COVID-related statistics etc. Within this broader class of spatio-temporal models trained by mobile crowdsourced Preprint. Under review. data [11] , we focus on the representative and important case of cellular signal maps. Cellular operators rely on key performance indicators (a.k.a. KPIs) to understand the performance and coverage of their network, in their effort to provide the best user experience. These KPIs include wireless signal strength measurements, especially Reference Signal Received Power (RSRP), which is going to be the focus of this paper, and other performance metrics (e.g., coverage, throughput, delay) as well as information associated with the measurement (e.g., location, time, frequency band, device type etc.). Cellular signal strength maps consist of KPIs in several locations. Traditionally, cellular operators collected such measurements by hiring dedicated vans (a.k.a. wardriving [48] ) with special equipment, to drive through, measure and map the performance in a particular area of interest. However, in recent years, they increasingly outsource the collection of signal maps to third parties [4] . Mobile analytics companies (e.g., OpenSignal [42] , Tutela [2] ) crowdsource measurements directly from end-user devices, via standalone mobile apps or SDKs integrated into popular partnering apps, typically games, utility or streaming apps. The upcoming dense deployment of small cells for 5G at metropolitan scales will only increase the need for accurate and comprehensive signal maps [22, 27] . Because cellular measurements are expensive to obtain, they may not be available for all locations, times and other parameters of interest, thus there is need for signal map prediction based on limited available such measurements. Signal map prediction is an active research area and includes: propagation models [12, 49] , data-driven approaches [13, 20, 25] , combinations thereof [43] , and increasingly sophisticated ML models for RSRP [4, 19, 30, 45] and throughput [47, 50] . However, all these prediction techniques consider a centralized setting: mobile devices upload their measurements to a server, which then trains a global model to predict cellular performance (typically RSRP) based on at least location, and potentially time and other features. Clearly, this introduces privacy risks for the participating users. Fig. 1 depicts an example of a dataset collected on a university campus, which is one of the datasets we use throughout (a) RSRP measurements of one user. (b) RSRP measurements of another user. (c) RSRP measurements from all users. this paper. Fig. 1 (a) and (b) show the locations where measurements of signal strength (RSRP) were collected by two different volunteers as they move around the campus. The measurements from all users are uploaded to a server, which then creates a signal map for the campus. The server can then merge the datasets from several participating users and store them (shown on Fig. 1 (c)); and/or may train a global model for predicting signal strength based on location and other features. However, this utility comes at the expense of privacy: as evident in Fig. 1 , frequently visited locations may reveal user's home, work, or other important locations, as well as their mobility pattern; these may further reveal sensitive information such as their medical conditions, political beliefs etc. [3] . The trajectories of the two example users are also sufficiently different from each other, and can be used to distinguish between them, even if pseudo-ids are used. In this paper, we make the following contributions. First, we formulate the signal strength prediction problem within an online federated learning framework. Second, we study, for the first time, a DLG-based location privacy attack launched by an honest-but-curious server. First, w.r.t. the signal map prediction problem: we formulate the simplest version that captures the core of the problem. More specifically, we train a DNN model to predict signal strength (RSRP) based on GPS location (latitude, longitude), while local training data arrive in an online fashion. The problem lends itself naturally to FL: training data are collected by the mobile devices, which want to collaborate without uploading sensitive location information. FL enables mobiles to do exactly that by exchanging model parameters with the server but keeping their training data from the server [39] . The problem further lends itself to online learning because the training data are collected over time [10, 33] as the users move around. Second, w.r.t. the location inference: we consider an honestbut-curious server, which implements online FL accurately but attempts to infer the location of users. Since gradient updates are sent from users to the server in every FL round (explicitly in Federated SGD (FedSGD) or computed from model updates in Federated Averaging (FedAvg) [39] ), FL lends itself naturally to inference attacks from gradients. We adapt the DLG attack, originally developed to reconstruct training images and text used for training DNN classifiers [23, 52] . A key observation, that we confirm both empirically and analytically, is that a DLG attacker who observes a single gradient (SGD) update from a target user, can reconstruct the average location of points in the batch. Over multiple rounds of FL, this allows the reconstruction of the target(s)' mobility pattern. We show that the averaging of gradients inherent in FedAvg provides a moderate level of protection against DLG, while simultaneously improving utility; we systematically evaluate the effect of multiple federate learning parameters (E, B, R, η) on the convergence and success of the attack. We also propose Diverse Batch-an algorithm that mobile devices can apply locally to curate their batches, so as to further protect their location privacy. Finally, we show that the effect of multiple users participating in FL, w.r.t. the success of the DLG attack, depends on the similarity of trajectories of those users. Throughout this paper, we use two real-world datasets including: (i) the geographically small but dense Campus Dataset [5] we introduced in Fig. 1 ); and (ii) the larger but sparser publicly available Radiocells Dataset [1] , especially its subset from the London metropolitan area. We purposely do not consider additional privacy-preserving defense techniques such as Differential Privacy (DP) [18, 41] or Secure Aggregation (SA) [8] , in order to not distort the data, such as utility remains intact, and to keep the overhead at the device low. We find that, in this setting, we can still achieve meaningful privacy without such mechanisms. If stronger privacy guarantees are desired, one may apply DP or SA on top. The outline of the paper is as follows. Section 2 formulates the federated online signal maps prediction problem and the corresponding DLG attack, and provides key insights. Section 3 describes the evaluation setup, including datasets and privacy metrics. Section 4 presents the evaluation results for different simulation scenarios. Section 5 discusses related work. Section 6 concludes and outlines future directions. In Section 2.1, we model the problem within the online federated learning framework and we define the DLG attack that allows an honest-but-curious server to infer the whereabouts of target user(s). In Section 2.2, we provide analytical insights that help explain the performance of the DLG attack for various user trajectories and tuning of various parameters, and also guide the algorithm design choices and recommendations. Signal Maps Prediction. Signal map prediction typically trains a model to predict a key performance indicator (KPI) y i based on the input feature vector where i denotes the i-th datapoint in the training set. W.l.o.g. we consider the following: y is a metric capturing the signal strength and we focus specifically on Reference Signal Received Power (RSRP), which is the most widely used KPI in LTE. For the features x used for prediction of y, we focus on the spatial coordinates (longitude, latitude), i.e., m = 2. (Prior work [4] has shown that the most important features for RSRP prediction are the coordinates and time. In this paper we handle time not as a feature, but within the online learning framework.) We train a DNN model with weights w, per cell tower, y i = F(x i , w); the loss function is the Mean Square Error, and we report the commonly used Root Mean Squared Error (RMSE) as our utility metric. We consider a general DNN architecture, and we tune its hyperparameters (depth, width, type of activation functions, learning rate η) via the Hyperband tuner [32] to maximize utility. Tuning the DNN architecture can be done using small datasets per cell tower, which are collected directly or contributed by users or third parties willing to share/sell their data. Thus, unlike prior work [52] , our DNN architecture is quite general. In Sec. 4, the tuned DNN architecture is optimized for the datasets used there, namely the Campus Dataset and the Radiocells Dataset. For example, when tuned for Campus Dataset cell x204, the DNN consists of two hidden layers with 224 and 640 units and ReLU and sigmoid activation functions, respectively, and dropout layers between the hidden layers with 5% of units being dropped randomly. The dropout layers are necessary to prevent overfitting in overparameterized DNN architectures [46] . Measurement collection over time and space. We consider several users, each with a mobile device, who collect several signal strength measurements ({x i , y i } i=N i=1 ), as they move around throughout the day and occasionally upload some information to the server. Examples are shown on Fig.1 from users moving around on a university campus; Fig. 2 Notation Description focuses on a single such user and the locations where they collected measurements for three different days. It is important to note that the measurement data are not static but collected in an online fashion. Users continuously collect such measurements as they move around throughout the day, and they periodically upload them to the server, e.g., at night when the mobile is plugged in to charge and connected to WiFi. Let the time be divided into time intervals of duration T , indexed by t = 1, . . . , R; in the previous example, T was one day, but we also consider 1h, 3h, one week etc.) At the the end of each such T -interval , user k processes the set of measurement data D k t collected over that last round and sends an update to the server. We also refer to D k t as the local data that "arrive" at user client k in interval/round t. Collected over t = 1, . . . , R, D k t reveals a lot about user k's whereabouts, as evident by the examples of Fig. 1 and Fig. 2 . Human mobility is well known to have strong patterns: people spend most of the time in a few important places (e.g., home and work), and move in continuous trajectories. The locations x collected as part of the signal map measurements (x i , y i ) i=N i=1 essentially sample the real user's trajectory. Federated Signal Maps. State-of-the-art mobile crowdsourcing practices [2, 11] rely on the server to collect the raw measurements (in our context locations and associated RSRP measurements) from multiple users, aggregate them into a single map, and maybe train a centralized prediction model, with the associated location privacy risks. In this paper, we apply for the first time FL [39] to the signal map prediction problem, which allows users to keep their data on their device, while still collaborating to train a global model, by exchanging model parameter updates with the server. In the federated learning framework [29, 39] , the server and the users agree on a DNN architecture and they communicate in rounds (t = 1, . . . , R) to train it. In every round t, the server Figure 2 : Example of the Online Federated Learning Framework with DLG Attack, using data from the campus. The target user k collects data in an online fashion, and processes them in intervals of duration T =1 day. The right part of the picture shows real locations (in light blue) the target user visited on different days. During round t, the target collects local data D k t , uses it for local training, updates the local model weights w k t and shares them with the server. (During local training, the local data D k t is further split into a list B k t of mini-batches, each of size B.) The server observes the model parameter update w k t at time t, computes the gradient (∇w = w k t − w k t−1 ) and launches a DLG attack using Algorithm 2. For each day t, it manages to reconstruct the centroid (average location) of the points in D k t (shown in dark blue color). During the last day, where the user did not move much, the centroid conveys quite a a lot of information. initializes the global parameters w t and asks for a local update from a subset (fraction C) of the users. The user k trains a local model on its local data sends its update for the local model parameters w k t to the server. The server averages all received updates, updates the global parameters to w t+1 and initiates another round; until convergence. If a single gradient descent step is performed on the local data, the scheme is called Federated SGD (FedSGD). If there are multiple local steps, (i.e., local data are partitioned into mini-batches of size B each, there is one gradient descent step per mini-batch, and multiple passes E epochs) on the local data), the scheme is called Federated Averaging (FedAvg) [39] . FedSGD is FedAvg for E = 1, B = ∞,C = 1. B = ∞ indicates that the entire local batch are treated as one mini-batch. Online Federated Learning. A significant difference in our setting compared to the classic FL setting [39] is that the local data of user k are not available all at once, but arrive in an online fashion as the user collects measurements. We consider that the interval T (for processing online data) coincides with one round t of federated learning. At the end of each such interval/round t, the user processes the local data D k t that arrived during the last time interval T ; it then updates its local model parameters w t and sends the update to the server. We introduce a new local pre-processing step in line 16 in Algorithm 1: the user may choose to use all recent local data D k t or a subset of it as its LocalBatch. (Unless explicitly noted, we mean LocalBatch = D k t , except for Section 4.3 where Diverse Batch is introduced to pick LocalBatch ⊂ D k t so as to increase location variance and privacy.) Once LocalBatch is selected, FedAvg can further partition it into a set of minibatches (B k t ) of size B. An example is depicted on Fig. 2 , where data are collected and processed by user k in rounds of T = one day. How to update the model parameters based on the stream of local data ({D k t },t = 1, . . . , R) is the subject of the active research area of online learning [10, 33] . We adopt the following approach. In every round t, user k uses only its latest batch D k t for local training and for computing w k t . The data collected in previous rounds (D k 1 , ...D k t−1 ) have already been used to compute the previous local (w k t−1 ) and global (w t−1 ) model parameters but is not used for the new local update. This is one of the state-of-the-art approaches in federated learning [14, 15, 37] . In this paper, we do not consider training on data from previous rounds. (See Appendix A.3 on "Cumulative Online FL", for a discussion on the other extreme, where training is performed on all available data at any given time.) Honest-but-Curious Server. We assume an honest-butcurious server who receives and stores model updates from each user, and whose goal is to infer the user's locations. The server may be interested in various location inference goals: e.g., the user trajectory at various spatio-temporal granularities, important locations (e.g., home or work), presence near points-of-interest (e.g., has the user been at the doctor's office?). W.l.o.g., the server targets a user k who participates in round t: it compares updates across successive rounds w k t−1 and w k t and computes the gradient for round t; see Algorithm 1 and Fig.2 . It then uses this gradient to infer user k's location in round t, as described next. DLG Attack against a Single Update. At Line 11 of Algorithm 1, the server launches a DLG attack to infer the location of user k in round t. The DLG attack is defined in Algorithm 2 and an example is shown in Fig. 3 . In each iteration i, the DLG algorithm: 1) randomly initializes a dummy location x 0 (shown in yellow), 2) obtains the gradient at dummy location, ∇w i , a.k.a. dummy gradient, 3) updates the dummy location towards the direction that minimizes the cosine distance between the dummy and true gradient. We choose to Algorithm 1: Online FedAvg with DLG Attack. 1 Given: K users (indexed by k); B local mini-batch size; E number of local epochs; R number of rounds of duration T each; C fraction of clients; n t is the total data size from all users at round t, η is the learning rate; the server aims to reconstruct the local data of target user k. 2 Server executes: 3 Initialize w 0 4 for each round t = 1,2, ... R do 5 m ← max(C · K, 1) 6 S t ← (random subset (C) of users) 7 for each user k ∈ S t in parallel do minimize cosine, as opposed to euclidean loss to match the direction, not the magnitude, of the true gradient [23] . Implementation details. (1) The attacker reconstructs both the location x (i.e., latitude and longitude coordinates), and the RSRP value y; we cannot use the analytical reconstruction of the label proposed in [51] , since we have regression instead of classification. (2) We observe that different location initializations converge to the same point in practice, as shown in Fig. 6 . We initialize the prediction label with the mean RSRP from the training data, which is realistic: the attacker can have access to public cellular signal strength measurements or collect a few measurements around each cell tower. (3) We set the maximum number of iterations to m = 400, 000, and add an early stopping condition: if the reconstructed location does not change for 10 DLG iterations, then we declare convergence. Key observation O1: DLG on one batch. An example of our DLG attack on one SGD update from user k is depicted in Fig. 3 : it reconstructs x DLG , which ends up being close to the average locationx in batch D k t . We experimented with multiple initializations and we found that to be true in practice, independently of initialization; see Fig. 6 . Therefore, when applied to a single local update, Algorithm 2 reconstructs one location x DLG , which is close to the average locationx of the N points in that batch. This is in contrast to the original DLG attacks on images [23, 52] , which aimed at reconstructing all N points in the batch. Since local data arrive in an online fashion, the server can launch one DLG attack per round and reconstruct one (the average) location in each round. All these reconstructed locations together can reveal the user's whereabouts, as discussed next. Key observation O2: DLG across several rounds. Human mobility is well-known to exhibit strong spatio-temporal patterns, e.g., (i) people move in continuous trajectories and (ii) they frequently visit a few important locations, such as home, work, gym etc. [7, 28] . Because of (i), inferring the average location in successive rounds essentially reveals the trajectory at the granularity of interval T . Because of (ii), there is inherent clustering around these important locations, as exemplified in Fig. 4a . When successfully inferred, these can reveal sensitive information [3] and help identify the user [16] . In this section, we are interested in analyzing how close is the location reconstructed by the DLG attack (x DLG ) to the true average location of the batch (x). The analysis explains analytically our empirical observations, provides insights into the performance of DLG attacks depending on the input data characteristics, and guides our design choice to maximize privacy. , the DLG attacker can reconstruct a unique user location x DLG , which satisfies: Proof: We are able to prove this lemma under two assumptions on the DNN model architecture: (1) the DNN model starts with a biased fully-connected layer (Assumption 1 in Appendix A.1.1); and (2) the bias vector of its first layer (b 1 h ) has not converged (Assumption 2 in Appendix A.1.1). Next, we bound the distance between the reconstructed user location by the DLG attacker and the centroid of user locations in a data mini-batch as follows: is used to update the DNN model y i = F(x i , w) during a gradient descent step. Then, the reconstruction error of the DLG attacker, defined as the L 2 distance between the user location reconstructed by DLG attacker x DLG and the centroid of user locations in this mini-batchx = 1 B ∑ i=B i=1 x i , can be bounded by the following expression: (2) Proof: See Appendix A.1.2. Theorem 1 says that the reconstruction error of the DLG attacker is equal to the L 2 norm of the sample co-variance matrix between the partial derivative over the bias g i (w) and the user location x i , divided by the (absolute value of the) average partial derivative within a mini-batchḡ(w). Moreover, this error can be upper bounded by the sum of the sample variance of g i (w) and the sample variance of ||x i || 2 in the mini-batch. The above theorem involves the partial derivatives g i (w) whose values are hard to predict when x i varies. To bound the error of the DLG attacker without involving such derivatives, we further make the mild assumption that the gradient function is Lipschitz continuous (see Assumption 3 in Appendix A.1.3), and state the following theorem: Theorem 2. Subject to the Lipschitz continuity assumption about ∇ (F(x i , w), y i ), the reconstruction error of the DLG attacker can be bounded by: Theorem 2 says that the reconstruction error of DLG attacker can be bounded by the weighted sum of the sample variance of the user data ||x i || 2 and the labels y i in the minibatch. (I1) Impact of data mini-batch variance. Theorem 1 shows that the variance of user locations affects the upper bound of the DLG attacker's reconstruction error. Specifically, the smaller the data mini-batch variance is, the smaller the upper bound of the DLG attacker's reconstruction error is. Theorem 1 also shows that the DLG error depends on the variance of the gradients. One may intuitively argue that since the randomness of the gradients comes from the randomness of the data, the larger the data variance the larger the gradients variance too, thus the larger the error. Theorem 2 does not involve gradients. It directly shows how the variance of local user data and associated labels affect the upper bound of the reconstruction error. Motivated by the above discussion, we propose a novel algorithm which we refer to as Diverse Batch to increase the mini-batch data variance of each user during training, see Sec. 4.3. It is worth noting that in theory, an increasing upper bound does not guarantee that the actual reconstruction error will increase. That said, and as expected by intuition, we empirically show this to be the case (see Table 2 in Sec. 4.3). (I2) Impact of model convergence rate. As shown in the above theorems, another key component affecting the upper bound of the attacker's reconstruction error is |ḡ(w)|, which is the average partial derivative over the bias and reflects the convergence level of the global model: As the global model converges (i.e., the training loss converges to zero), |ḡ(w)| will also converge to zero, and hence the upper bound of the reconstruction error will diverge to infinity. This is expected since the attacker needs user information leaking from the gradient to reconstruct user's location. Recall that the attacker attempts to reconstruct one user location at each mini-batch and FL round. Hence, as the model converges faster, the reconstruction error will diverge faster and thus a smaller fraction of the reconstructed user locations will be accurate, the ones corresponding to the early reconstruction attempts, see Fig. 7 . (I3) Impact of Averaging. While under FedSGD the attacker observes the gradient updates after processing each mini-batch, under FedAvg the attacker observes the gradient update at the end of the batch/round, thus this gradient update is the time average of ∂ (F(x i ,w),y i ) ∂b 1 h during each training round. We can apply Theorems 1 and 2 to the FedAvg case by re-defining g i (w) as the time average of ∂ ((x i ,w),y i ) ∂b 1 h during each training round, and the impact of each parameter will be the same, as that discussed above for FedSGD. In Sec. 4.1, we show the impact of FedAvg parameters (B, E) on the DLG attack: as B decreases and/or E increases, the attack is less accurate, due to faster model convergence rate caused by multiple local gradient descent steps (see Fig. 9 and Fig. 10 ). (I4) Impact of multiple users. FL involves the participation of multiple users, who will jointly affect the convergence of the global model. Prior work has shown that as the data diversity across multiple users increases, i.e., the dissimilarity or heterogeneity between users increases, the global model may converge slower or even diverge [24, 35] . As discussed above, the global model convergence rate impacts the DLG attacker's reconstruction accuracy. Thus, when the data diversity across multiple users increases, we expect that the global model will converge slower, resulting in more accurate reconstruction of user locations by the DLG attacker. In Sec.4.4, we empirically show how the similarity of multiple users affects the DLG attacker's performance (see Table 3 in Sec. 4.4). We evaluate the success of the DLG attack for different scenarios: we specify the exact configuration and parameter tuning for the online federated learning (Algorithm 1), DLG attack (Algorithm 2), and any defense mechanism in Section 4. In Sec. 3.1, we describe two real-world datasets that we use as input to our simulations. In Sec. 3.2, we define privacy metrics that quantify the privacy loss due to the attack. Campus Dataset [5] . This dataset is collected on a university campus and is the one depicted on Fig. 1 . It contains real traces from seven different devices used by student volunteers and faculty members, using two cellular providers over a period of four months. IRB review was not required as the proposed activity was deemed as non-human subjects research by the IRB of our institution. It is a relatively small dataset; it spans an entire university campus, a geographical area of approx. 3 km 2 and 25 cell towers. However, it is very dense: it consists of 169,295 measurements in total. Pseudoids are associated with the measurements which facilitates the simulation of user trajectories in FL. The number of measurements per cell tower and user varies, details are deferred to Appendix A.2. For the evaluation in Sec. 4, we pick the cell tower x204 with the largest number of data points, and choose the user 0 with the most measurements as the target user. Unless mentioned otherwise, this is the default DLG setup. We also provide in appendix additional results with another cell tower (x355). The campus is depicted in Fig. 1c , and the locations of target (user 0) are depicted in Fig. 1a (all cells) and Fig. 4a (cell x204 only) . It is worth nothing that we know the frequently visited locations coincide with home (on the right part of the picture) and work (campus buildings on the left part of the picture) locations for the user. This side information becomes useful when we assess the success of the attack. Moreover, the attacker uses the campus boundaries (an area of 3 km 2 ) as the defined area of the attack; if some reconstructed locations fall outside this area, they are treated as diverged. [1] . We also consider the largescale real-world Radiocells Dataset, which comes from a community project founded in 2009 under the name openmap. org. It contains data from over 700 thousand cell towers around the world and is publicly available in [1] . Raw measurement data are available for in-depth analysis of communication networks or cell coverage, and can be used for our problem of signal maps. The measurements are contributed by real users without logging their user ids. Users log and upload their data into multiple upload files: each containing device information, version of the app, etc. that can be used for distinguishing users, as in [11] . We focus on cellular data from 2017 (8.5 months) in the area of London, UK, from the top cell tower (x455), which had the most measurements (64,302 in total) from approx. 3,500 upload files and a geographical area that spans approx. 5,167 km 2 . Each upload file corresponds to a single device, typically containing 16 measurements per 2h on average. Since no pseudo-ids are provided that would allow us to link multiple upload files of a user, we use heuristics to create longer user trajectories so that we have enough data points per batch; see details in Sec. 4 We use the RMSE for signal map prediction problem as our utility metric. We also need metrics that capture how similar or dissimilar the reconstructed locations are from the real ones. Any FL algorithm and any defense mechanism must be evaluated w.r.t. the privacy-utility tradeoff they achieve. Visualization. Visually comparing the reconstructed (x DLG ) to the true locations in the batch (D k t ) provides intuition. For example, Fig. 4 (a) shows the real (shown in light blue) and the reconstructed (shown in color) locations reconstructed per 1h-long batches, for a user in the Campus Dataset. One can see that the reconstructed match the frequently visited locations of the user. For example, DLG seems indeed to reconstruct locations on the right side of the figure, where we confirmed that graduate student housing is on this campus. This is expected, based on our key observation O2 (the characteristics of human mobility, i.e., continuous trajectory, spending most of the time at a few important locations). Distance from the Centroid. In order to assess how accurate the attack is within a single round t, we use the distance ||x DLG −x t ||, i.e., how far (in meters) is the reconstructed location from the average location of points in that batch B k t . Based on the key observation O1, we expect this distance to be small when the attack converges. Comparing location distributions. In order to assess the success of the attack considering all rounds t = 1, ..., R, we need a metric that captures how similar or dissimilar are the reconstructed locations from the real ones; e.g., see Fig. 4 for an example of real (light blue) vs. reconstructed (in color) locations over multiple rounds. We considered the KL-divergence and Jensen-Shannon distance, which are well-known metrics for comparing distributions. However, they would only capture the difference in probability mass, not the spatial distance between real and inferred frequent locations, which is of interest in our case, as per key observation O2. We use the Earth Movers Distance (EMD) [9, 21] to capture the distance between the 2D-distributions of reconstructed and real locations. It has been previously used in location privacy as a measure of t-closeness [34] , l-diversity [38] . It is defined as: EMD(P, Q) = inf γ∈Π(P,Q) E (x,y)∼γ [ x − y ], where Π(P, Q) is the set of all joint distributions whose marginals are P (true locations) and Q (reconstructed locations). EMD takes into account the spatial correlations and returns the minimum cost required to convert one probabil- ity distribution to another by moving probability mass. 1 We use the Euclidean distance when calculating EMD on GPS coordinates in UTM. We compute EMD using Monte Carlo approximations for N = 1000 projections; which is more computationally efficient than the exact EMD calculation and it is suitable for 2D distributions. The range of EMD values depends on the dataset and spa-tial area. EMD = 0 would mean that the distributions of real and reconstructed locations are identical. Low EMD values indicate successful location reconstruction by the DLG attack thus high privacy loss. To get more intuition, let's revisit Fig. 4 . In Fig. 4(a) , the DLG attacker reconstructed locations around the frequently visited locations (close to home) and achieved EMD = 5.3. In Fig. 4(b) , we show the same number of locations chosen uniformly at random, which leads to EMD = 21.33; this provides an upper bound in privacy (random guesses by the attackers) in this scenario. %Attack Divergence. In our simulations, we observed that: (i) if the DLG attack converges, it converges tox regardless of the initialization; (ii) however, the DLG attack did not always converge, depending on the location variance of the batch, the tuning of parameters of the DLG optimizer and FedAvg. Examples of attack divergence are shown in Fig. 7a . In Sec. 4, we define a rectangular geographical area of interest for the attacker e.g., the entire 3 km 2 campus in Campus Dataset. If some reconstructed locations are outside the boundaries, we declare them "diverged" and (i) we discard them when computing the privacy (EMD) metric, but also (ii) we report the fraction of those attacks that diverged. In practice, if an attack diverges outside the area of interest, the attacker can relaunch the DLG attack with a different initialization hoping until it reaches convergence tox. This, however, is costly for the attacker, therefore the % of attacks that diverged is another metric of success or failure of the DLG attack. Next, we evaluate the DLG attack in a range of scenarios. In Section 4.1, we consider FedSGD, which is the most favorable scenario for DLG -the strongest attack. In Section 4.2, we show that the averaging inherent in FedAvg provides a moderate level of protection against DLG, which also improves utility. In Section 4.3, we propose a simple defense that users can apply locally: Diverse Batch curates local batches with high variance. In Section 4.4, we show the effect of multiple users on the success of the DLG attack. We use the Campus Dataset and Radiocells Dataset, we partition them into time intervals T to simulate measurements arriving over time and we tune the DNN architecture and hyper-parameters to maximize utility (prediction error). We use a default η = 0.001 (or η = 10 −5 in case of mini-batches), details on tuning η are deferred to Appendix A.3. W.l.o.g., we focus on a particular target, whose locations are to be reconstructed by the DLG attacker and we report the privacy (EMD, % diverged attacks)-utility tradeoff (RMSE). Strongest Attack. FedSGD is a special case of Algorithm 1 with B = ∞, E = 1: in each round t the target user performs a single SGD step on their data D k t and sends the local model parameters w k t to server, which in turn computes the gradient (line 10 in Algorithm 1). ∇w k t corresponds to the true gradient obtained on data D k t and it is the best scenario for the attacker. Impact of time interval T . Fig. 4a shows the true vs. the reconstructed locations via DLG for intervals with T =1h. Fig. 8 shows the results for the same data but divided into intervals with duration longer than 1h, i.e., 24h and 1 week. The shorter the interval, the better the reconstruction of locations: we can confirm that visually and quantitatively via EMD, while the utility (RMSE) is not affected significantly. This is in agreement with Insight I1, since smaller T leads to smaller batch variance in the target's trajectory. However, the most visited locations by the user (i.e., the home and work) are successfully reconstructed for all T ; this is due to the mobility pattern of the target, who repeats his home-work trajectory over time. Impact of DLG Initialization. First, we consider a single FL round and we evaluate the effect of DLG initialization. For example, consider LocalBatch to consist of the measurements of the target from week t = 7 (D 0 t ) in Campus Dataset, and perform one local SGD step to train the local model. The global model is initialized to the same random weights before local training. The attacker splits the geographical area into a grid of 350 meters and uses the center of each grid cell as a candidate initial dummy point in Algorithm 2. Fig. 6 shows the results for 20 random initialization. We observe that, for all initializations, the attack converges (cosine loss < -0.9988) to the average locationx of the local data, regardless of the distance between the initialization and the centroid of the data. This is explained by Insight I2 in Sec. 2.2; the gradient provides information for the attack, since the model has not converged. Impact of Number of FL Rounds. Second, we repeat the same experiment, but with the goal of evaluating the effect of multiple FL rounds, everything else staying the same. To that end, we consider that the LocalBatch data is the same for all rounds D 0 t = D 0 7 ,t = 0, ..., 20 (and the same as in the previous experiment: D 0 t ); this removes the effect of local data changing over time in an online fashion. We also use the exact same 20 random initializations, as above. The global model is now updated in each round and iterates with the target as in Algorithm 1. The norm of the flattened per-layer weight matrices approaches zero after round 9 (see Fig. 7d ) and at this point the DLG attack starts diverging; the reconstructed point is further away from the mean of the data as shown in Fig. 7c and the final cosine loss D starts increasing (Fig. 7b) . In the worst case, when the attack diverges, the reconstructed point is 1 km away from the centroid location of the batch. This can be explained by Insight I2 in Sec. 2.2: after several rounds, the model starts converging and the gradients decrease; the average gradient goes to 0, the bound in Theorem 1 goes to ∞, and x DLG can go far fromx. Thus, even in the worse scenario The local data are considered the same from one round to the next, and the same as the batch used in Fig. 6 . The attacker uses the same 20 random initialization as in the single round case. The effect of multiple ML rounds is that (b) the gradients are converging to zero, which results in (c) higher cosine loss for the DLG attacker and eventually (d) the reconstructed points X DLG end up further away from the centroidx. of the FedSGD, without any add-on averaging or defense mechanisms, there is some protection against the DLG attack, after the initial FL rounds when the global model converges. DLG Initialization Strategies. There are different strategies for initializing the dummy points in each batch. (i) The attacker could pick randomly within the geographical area of interest, as we did in Fig. 6 , or the middle of the campus. (ii) The attacker could use a rough estimate ofx plus Gaussian noise. (iii) The attacker could leverage the reconstructed location from a previous round and use it to initialize the dummy point in the next round, in order to leverage the continuity of user mobility and make an educated guess especially in the FedAvg is the general Algorithm 1, of which FedSGD is a special case for B = ∞, E = 1. FedAvg [39] has many parameters that control the computation, communication and storage at the users and server. The learning rate η is tuned for each dataset, see Appendix A.2; the number of FL rounds R is discussed previously for FedSGD; the fraction C of users in round is related to global averaging in Section 4.4. Our focus here is to evaluate the effect of local averaging through the use of B-sized mini-batches and E number of epochs, as a way to defend against DLG attacks. Intuitively, the more local SGD steps (smaller B and high E), the more averaging over local gradients, and the less successful the DLG attack by the server based on the observed w k t − w k t−1 . Interestingly, more averaging improves both convergence and utility [39] . Throughout this section, we focus on T = 1 week which gives 11 intervals. Impact of Mini-batch Size B in FedAvg. Fig. 9 shows the utility (RMSE) and privacy (EMD) metrics when the targets splits its local data into mini-batches of size B, and performs one SGD step per mini-batch. At one extreme B = ∞, the entire LocalBatch is treated as one mini-batch, and this becomes FedSGD. At the other extreme, B = 1, there is one SGD step per local data point, which maximizes privacy. Smaller B values decrease RMSE, especially for lower η, but for the default η = 0.001 the RMSE is not significantly affected. In addition to EMD, we show in Fig. 9c the fraction of diverged DLG attacks. As B decreases, the fraction of diverged attacks increases (which makes the attack less accurate) due to increased gradient descent steps. It also makes the attack more week, E = 1.] Reducing mini-batch size B introduces more averaging of the gradients which increases EMD (and privacy) and makes the attack more expensive due to divergence. The default η seems to be less sensitive to B in terms of RMSE. Here B=1000 corresponds to B = ∞, thus FedSGD, which leads to reduced privacy but also to higher RMSE for the lower η. We choose B = 20 (3rd marker in Fig. 9 ) and the lower η =1e-05 in order to get some privacy protection (EMD increases slightly) and to maximize utility. Impact of Local Epochs E. Another parameter of FedAvg that affects the number of SGD steps is the number of epochs E, i.e., the number of local passes on the dataset, during local training. We set B = 20, based on the previous experiment and we evaluate the impact of local epochs for two learning rates η. Fig. 10 shows that increasing E increases privacy both in terms of EMD and divergence. It also improved utility (not shown): can reduce RMSE from 6.25 to 4.75dbm. Putting it together. We choose E = 5 and B = 20, which together provide improved privacy and utility. In summary, averaging in FedAvg provides some moderate protection against DLG attacks, which is also consistent with Insight I3. How-ever, even with these parameters, the frequently visited locations (i.e., home and work) of the target can still be revealed (e.g., see Fig. 11a ), which motivated us to design the following algorithm for further improvement. Intuition. So far, we have considered that every user, including the target, processes all the local data in the order they arrive during round t, i.e., in line 16 of Algorithm 1 it is Local-Batch = D t target . The local averaging in FedAvg prevents the server from obtaining the real gradient, thus providing some protection. We can do better by exploiting the key observation (O1) supported analytically by insight (I1): the variance of locations in a batch affects how far the reconstructed x DLG is from the batch centroidx. If the target preprocesses the data to pick a subset LocalBatch ⊆ D t target , so that the selected locations have high variance, then we can force the DLG attack to have high |x DLG −x| and possibly even diverge. Diverse Batch Algorithm. There are many ways to achieve the aforementioned goal. We designed Diverse Batch to maximize variance of locations in LocalBatch, using DBSCAN clustering. In each FL round t, the target does the following at line (16) of Algorithm 1: 1. The data D t target that arrived during that round are considered candidates to include in LocalBatch. 2. Apply DBSCAN on those points and identify the clusters. LocalBatch; this intuitively increases variance. 4. If more datapoints are needed, remove the selected points from D t target and repeat steps 1,2 recursively on the remaining data, until the desired LocalBatch size is reached. We refer to this selection of LocalBatch as Diverse Batch and it is applied in line (16) of Algorithm 1. After that, FedAvg continues as usually, potentially using mini-batches and multiple epochs. Appendix A.3 shows more details. Data Minimization. In our implementation of Diverse Batch, the target applies step 3 above exactly once, and skips step 4. As a result, Diverse Batch uses significantly fewer points, and thus has a data minimization effect, in addition to high variance. For example, with eps = 0.05km, Diverse Batch achieves EMD=15.23 using 1% of the location compared to a random sampling of 1% of the points that leads to EMD=7.6; see Fig. 11 (b) vs. 11(c), as well as Table 2 . Tuning Diverse Batch. First, we tune the learning rate η = 0.001 via a grid search; see appendix for details. An important parameter of DBSCAN is eps, which controls the maximum distance between points in a cluster: higher eps leads to fewer clusters, i.e., less datapoints chosen in LocalBatch, but higher variance in the coordinates of points in LocalBatch. Table 2 shows the performance of Diverse Batch for various eps values. We make the following observations. First, as expected, increasing eps decreases the number of training points and results in smaller batches, due to fewer clusters. Second, w/o averaging, the RMSE increases with eps; with B = 20, E = 5, RMSE is not affected by eps. On the other hand, EMD increases for larger eps in both cases, but with averaging EMD is higher overall. The percentage of diverged attacks increases with eps, which makes the attack more expensive and less accurate. Third, we consider two baselines: (i) the same number of points as DBSCAN in each round, but randomly selected; (ii) FedAvg with B = 20, E = 5 but LocalBatch =D t target : the EMD remains low regardless of eps since the batch variance is not controlled. Performance of FedAvg with Diverse Batch. Fig. 11 shows that adding Diverse Batch to FedAvg (B = 20, E = 5) significantly improves privacy from EMD = 9.7 to EMD = 15.23. Random sampling of the same random of points per batch performs worse (EMD = 6.6), because it preserves the spatial correlation in trajectories. This indicates that the main benefit of Diverse Batch is controlling the batch location variance, rather than data minimization. In Fig. 12 , we compare all methods (in increasing privacy: FedSGD, FedAvg with mini-batches and epochs, and Diverse Batch) w.r.t. both privacy (EMD, divergence, distance from centroid) and utility (RMSE). (a) EMD starts at 7.5 in FedSGD, increases in FedAvg, and almost doubles (15) in Diverse Batch. (b) In FedSGD attack, there is almost no divergence, while with Diverse Batch more than 60% of the attacks diverge. Divergence makes the DLG attack more expensive, since the attacker needs to relaunch the attack with other initializations. (c) We also report the distance between the reconstructed location and the centroid of the batch |x DLG −x|: in FedSGD, this is less than 30 meters, while with Diverse Batch it increases to more than 350 meters. Throughout this paper, we focus on the target user exchanging local (w target t ) and global (w t ) model parameter updates with the server, for t = 1, . . . , R. When there are more users participating in FL, in addition to the target, they contribute their updates and there is also global averaging of the gradients across users to produce w t (line 12 in Algorithm 1). The data used for updates across rounds may be different, depending on how similar the users' trajectories are. As discussed in (I4) in Sec. 2.2, when the diversity of data (in this case, across rounds) increases, convergence slows down and the gradient magnitude |ḡ| remains large for more rounds, which makes the DLG attack more successful. Here, we evaluate the performance of all previous methods (FedSGD, FedAvg, FedAvg with Diverse Batch) under DLG attack, still from the perspective of a single target user, but considering the presence of multiple users updating the global model. Creating Multiple Synthetic Users. We use the London Table 2 : Performance of Diverse Batch. Parameters: T =1-week; DBSCAN is run once in each round; η = 0.001, dropout=0.05. When two numbers are reported (X/Y) they correspond to FedSGD and FedAvg (B = 20, E = 5), respectively. For each value of the main parameter eps of DBSCAN we report the following metrics. Since Diverse Batch picks LocalBatch of different size in every round, we report the average batch size. Since it picks a subset of all data LocalBatch ⊂ D k t , we report of the % of datapoints chosen. The utility (RMSE) is not significantly affected by eps, but is improved by FedAvg, as expected. For privacy, we report the EMD between reconstructed and real locations, the % of diverged attacks, and the average distance |x DLG −x| in meters. In general, as eps increases, privacy increase. In the last column, as a baseline for comparison, we report the utility and privacy if the same number of points as in column 3 are picked uniformly at random: the EMD is approx. half. to create "users", but we also consider the cell id because we train a DNN model for signal strength per cell. In addition, we merge files with the same device information as follows: if the start and end locations of two upload files are within a certain distance Z, then we assign them to the same "synthetic user". The intuition is that users typically repeat their trajectory across days, i.e., a user starts/ends logging data from/to same places, such as home or work. By setting Z = 1 mile, we obtain 933 potential synthetic users. We select five of them (users 1-5), each with sufficiently high number of measurements per week and spanning several weeks. We merge all remaining upload files of the dataset (1059 out of 1351) into one "background" user 0. The entire dataset (partitioned into synthetic users 0-5) is depicted on Fig.13 . We confirmed, both visually and via pair-wise similarity via EMD, that: users 1-4 are highly similar to each other, and dissimilar to users 0 and 5. We pick user 3 as the target, since they had the most measurements. Impact of Additional Users. We simulate DLG on FL with the target and additional synthetic users participating. We report the results on Table 3 . First, we consider the target (3) alone: the DLG attack achieves: EMD 17.08, under FedSGD; 24.13, under FedAvg; up to 26.9 , for FedAvg with Diverse Batch and eps = 0.01; the RMSE is roughly the same across all FedAvg and better than FedSGD, as expected. Next, we consider that the background user 0 joins and Adding Diverse Batch to FedAvg offers significant protection to the target: for eps = 0.005, one third of the runs resulted in 100% divergence. Increasing eps to 0.01 resulted in two thirds of runs with 100% divergence, and the remaining had EMD=26.9 and RMSE=5.44. In the multi-users case, it is sufficient that the target applies Diverse Batch locally, to get privacy protection, e.g., EMD=31.33 and 96% divergence in the case of (3, 5) . This is amplified when users are dissimilar, like users (3, 5) , rather than the case of background (3,0). Signal Maps Prediction Framework. There has been significant interest in signal map prediction based on a limited number of spatio-temporal cellular measurements. These in-clude propagation models [12, 49] as well as data-driven approaches [13, 20, 25] and combinations thereof [43] . Increasingly sophisticated machine learning models are being developed to capture various spatial, temporal and other characteristics of signal strength [4, 19, 45] and throughput [47, 50] . The problem has been considered so far only in a centralized, not distributed settings. To the best of our knowledge, this paper is the first to consider signal map prediction (i) in the FL framework (ii) considering online learning in the case of streaming data and (iii) a DLG attack specifically in this setting. Online federated learning is an emerging area [14, 15, 37] . To the best of our knowledge, existing work in the online setting does not consider privacy leakage yet, but instead focuses on convergence and device heterogeneity. Work in different communities [31] consider the case of online location data, although not in FL way and with different goals (e.g., location prediction), where the utility lies in the location itself. Location Privacy. Numerous works evaluated location privacy or trajectory data, e.g., [7, 16, 17, 28, 44] , where the utility of the dataset lies in the location itself. In this work, we focus on location privacy in mobile crowdsourcing systems (MCS), similarly to [11, 44] . As [11] pointed out, in important difference is that the utility in MCS does not lie in the location itself, but in the measurement associated with that location. In our case, the measurement is RSRP, and location is only a feature in an ML model. However, location is the primary feature needed to predict signal strength: additional features other than location and time (such as frequency, device information, environment etc.) bring only incremental benefit [4] . Privacy Attacks in Federated Learning. There has been a significant amount of prior work in this area, such as [6, 23, 26, 40, 51, 52] to mention just a few representatives. In this paper, we purposely consider the basic FedAvg algorithm, without adding well-known, but often expensive, defense mechanisms based on DP [18, 41] or SA [8] . In this particular setting, we capitalize on our key observations (O1, O2) and on analytical insight I1 (on the effect of the batch variance), to design a novel pre-processing step (Diverse Batch) that can be applied locally to create local batches with high location variance, based on clustering, and thus protection against the DLG attack. Further protection can always be added by adding DP or SA, but this is an orthogonal direction. Reconstruction attacks based on gradients. It has been shown that observing gradients [51, 52] gradients (as in FedSGD) or model parameters [23] (as in FedAvg) can enable reconstruction of the local training data. Such attacks have been demonstrated in the past in other contexts, mostly for reconstructing image or text training data, with a few exceptions such as [6] that inferred user's browsing history. First, "deep leakage from gradients" (DLG) [52] , reconstructed training data (images and text) and their corresponding labels, from observing a single gradient during training of DNN image classifiers, without the need for additional models (e.g., GANS [26] ) or side information. [52] discussed potential defenses, including tuning training parameters such as minibatch size, which we incorporate in our setting. The closest to our design of the DLG attack is the work on "inverting gradients in federated learning" [23] that modified the DLG attack to use a cosine-based distance instead of the Euclidean, and also evaluated for the first time DLG against FedAvg and the impact of averaged gradients due to local epochs on the attack. In contrast to prior work, our DLG attack reconstructs one location (the average) from a gradient updated computed on N locations in a data batch, as opposed to all N data points in the batch. Summary. We demonstrated, for the first time, a DLG attack on location, in the context of online federated learning for cellular signal maps. In contrast to state-of-the-art DLG attacks on images, we showed that DLG in our setting can reconstruct the centroid location in each data batch. We showed that the averaging mechanisms of FL provide some level of protection. We proposed a new algorithm (Diverse Batch) that users can apply locally to create high-variance batches and further protect themselves. Ultimately, the success of a DLG attack depends not only on the FedAvg configuration, but also on the input data, i.e., the characteristics of the target user trajectory and its similarity with the trajectories of other participating users. Throughout the paper we leveraged insights from both analysis and real-world cellular signal strength datasets. Future Directions. (1) Signal maps is only one application of DLG attacks on online federated learning. The same methodology can be applied to other learning tasks that rely on crowdsourced measurements collected in online fashion. (2) In this paper, we treated DLG attacks across rounds independently from each other. However, due to the characteristics of human trajectories, there are more dependencies and opportunities to explore in the design of DLG attacks and defenses; we only briefly touched upon this in the DLG initialization. Assumption 2. Given a data mini-batch of size B: h is the h-th element in the bias vector of the first layer (i.e., b 1 = [b 1 1 , ..., b 1 H ] T ). The first assumption states that the DNN model y i = F(x i , w) starts with a biased fully-connected layer, which is defined as y 1 i = σ(w 1 x i + b 1 ), where y 1 i ∈ R H is the output of the first layer with dimension H, and w 1 ∈ R H×2 and b 1 ∈ R H represent the weight matrix and bias vector in the first layer. While the first layer in our DNN is a biased fully-connected layer with H = 224, there is a dropout layer which nullifies some elements (5% of them) thus breaks the fully-connected assumption. However, in the absence of a dropout layer the DLG attacker would perform better, and thus we are being conservative by analyzing a slightly stronger attacker. The second assumption ensures that the gradient w.r.t the bias vector in the first layer is a non-zero vector. This typically holds during training when the model parameters have not converged. Formally, Given a data mini-batch of size B: h is the h-th element in the bias vector of the first layer (i.e., . We are now ready to prove Lemma 1 in Sec. 2.2 which holds under the two assumptions stated above. Proof of Lemma 1: Under Assumption 1, the first layer of the service model starts with a biased fully-connected layer, defined as y 1 i = σ(w 1 x i + b 1 ). Based on the chain rule, we can calculate the following partial derivatives: where w 1 k is the k-th row of weight matrix w 1 (note that w 1 ∈ R K×2 ). Combing Eq. (4) and Eq. (5), we can derive: Note the above steps are borrowed from the proof of Proposition D3.1 in [23] . The following steps are novel and specific to our application: Based on Eq. (6) and the definition of DLG attack, the reconstructed user location of DLG attacker x DLG satisfies: By combining Eq. (6) and Eq. (7), we can further derive: where g i (w) = ∂ (F(x i ,w),y i ) ∂b 1 h . Therefore, under Assumption 2, we can get: A.1.2 Proof of Theorem 1 Subject to Assumption 1 and 2, we are ready to prove Theorem 1 (see Sec. 2.2). Proof of Theorem 1: The distance between the reconstructed user location and the center of user locations in the mini-batch can be bounded as follows: A.1.3 Proof of Theorem 2 Assumption 3. The gradient function (F(x i , w), y i ) is Lipschitz continuous with respect to (x i , y i ), that is: Note that while the ReLU activation function violates this assumption, any smooth approximations would satisfy it [36] . Subject to Assumption 1, 2 and 3, we are ready to prove Theorem 2 (see Sec. 2.2). Proof of Theorem 2: Based on Assumption 3, we can derive that ∀(x i , y i ), (x j , y j ), the following inequality holds: Since: we can prove that: Finally, combining Eq. (12) and Eq. (14), we can derive that: This section of the appendix extends Sec. 3.1. Campus Dataset. Fig. 14a shows the number of measurements for the top three cell towers, per user in each cell tower. Fig. 14b and Fig. 14c show the number of batches per user and the distribution of number of datapoints per batch for each granularity of batch (T ) for the x204 cell, respectively. For coarser T , we obtain fewer batches but they contain significantly more datapoints, e.g., for cell x204 and user 0, the average batch size for 1-week batches is 3492, for 1-day is 817, for 3-hour is 205 and for 1-hour batches is 79. Radiocells Dataset. Fig. 17 shows the distribution of datapoints per batch for each user for T = 1-week. It it significantly sparser than the Campus Dataset; the average batch size per user is 50. Fig. 18 shows the pair-wise user similarity based on the EMD value between the distribution of locations between each pair of users. This section of the appendix extends Sec. 4. Campus Dataset cell x355. Fig. 15 demonstrates the DLG attack for another cell tower (x355). Tuning learning rates η for FedSGD. We performed a grid search with η = 10 −6 , 10 −5 , 10 −4 , 10 −3 , 0.001, 0.1 for two scenarios: FedSGD and FedAvg with E=1, B=20. To minimize RMSE, we choose η = 0.001 as our default learning rate. In case of mini-batches, we observe a lower η = 10 −5 can slightly lower RMSE. Fig. 19 shows the corresponding η values in the case of 1-week rounds and FedSGD and in the case where mini-batch size is B = 20. In Sec. 4.1, we chose η = 0.001 for FedSGD and we set η = 10 −5 when B = 20 to minimize the RMSE and then measure the privacy leakage. Tuning of DBSCAN parameter in Diverse Batch. Fig. 20 shows the standard deviation for each coordinate and how it increases when we increase the eps parameter of DBSCAN. Thus, higher eps corresponds to fewer clusters obtained which also increases the location variance of the batch. Tuning η in Diverse Batch. Fig. 21a shows how the selected η minimizes RMSE in case of Diverse Batch. Fig 21b shows the results for η =1e-05, which increases RMSE but the privacy patterns are similar to the tuned η. Cumulative Online FL. In Sec. 4, we considered online FL where the user train locally on the data that becomes available in the current round and discarding the local data from previous rounds. Here, we consider the other extreme, namely "Cumulative Online" where the previously seen data are not discarded but they are re-used in addition to the newly acquired batch of data in each round. Training on all available data provides some level of privacy protection from location leakage (although the important locations of a user are still revealed), at the cost of increased training time and storage on the device. Ultimately, how much "memory" to keep in training data depends on those practical constraints and the data itself (in our case, human mobility patterns lead to data repeat over days and weeks, so it is ok to forget.). Moreover, if the user keeps re-using the older data during local training, the RMSE might decrease if the older data corresponds to older mobility patterns of the user. In the Campus Dataset, there are only eleven weeks so the mobility patters are similar. For the above reasons, we proposed our Diverse Batch, which achieves a similar RMSE and higher privacy due to higher EMD values, while using only a small subset of the data in each round without requiring storage of older data. Your Apps Know Where You Were Last Night, and They're Not Keeping It Secret City-wide signal strength maps: Prediction with random forests System for crowdsourcing signal strength maps on campus Fedpacket: A federated learning approach to mobile packet classification Human mobility characterization from cellular network data Practical secure aggregation for privacy preserving machine learning Sliced and radon wasserstein barycenters of measures Large scale online learning. Advances in neural information processing systems On (the lack of) location privacy in crowdsourcing applications IST-4-027756 WIN-NER II D1. 1.2 V1. 2 WINNER II Channel Models Specsense: Crowdsensing for efficient querying of spectrum occupancy Asynchronous online federated learning for edge devices with non-iid data Fleet: Online federated learning via staleness awareness and performance prediction Unique in the crowd: The privacy bounds of human mobility Inferring individual daily activities from mobile phone traces: A boston example. Environment and Planning B: Planning and Design Differential privacy. Encyclopedia of Cryptography and Security RAIK: Regional analysis with geodata and crowdsourcing to infer key performance indicators Zip-Weave: Towards efficient and reliable measurement based mobile coverage maps Python optimal transport. JMLR The 5g eve multi-site experimental architecture and experimentation workflow Inverting gradients -how easy is it to break privacy in federated learning? In NeurIPS On the convergence of local descent methods in federated learning Steering crowdsourced signal map construction via bayesian compressive sensing Deep models under the gan: information leakage from collaborative deep learning Challenges in 5G: How to empower SON with big data for enabling 5G Identifying important places in people's lives from cellular network data Advances and open problems in federated learning Spatial signal strength prediction using 3d maps and deep learning Real time location prediction with taxi-gps data streams. Transportation Research Part C: Emerging Technologies Hyperband: A novel bandit-based approach to hyperparameter optimization Efficient mini-batch training for stochastic optimization t-closeness: Privacy beyond k-anonymity and l-diversity Federated optimization in heterogeneous networks On the loss landscape of adversarial training: Identifying challenges and how to overcome them Fedvision: An online visual object detection platform powered by federated learning Enhancing privacy using location semantics in location based services Communication-efficient learning of deep networks from decentralized data Exploiting unintended feature leakage in collaborative learning Toward robustness and privacy in federated learning: Experimenting with local and central differential privacy Practical radio environment mapping with geostatistics Knock knock, who's there? membership inference on aggregate location data Localization of lte measurement records with missing information Dropout: A simple way to prevent neural networks from overfitting Spatiotemporal modeling and prediction in cellular networks: A big data enabled deep learning approach Accuracy characterization of cell tower localization Ray tracing for radio propagation modeling: Principles and applications Long-term mobile traffic forecasting using deep spatio-temporal neural networks idlg: Improved deep leakage from gradients Deep leakage from gradients Per user datapoints for top 3 cell towers. (b) Total batches per user: x204 cell. (c) Total datapoints per batch per interval: 204 cell We thank Emmanouil Alimpertis for providing access to the Campus Dataset. This work has been supported by NSF Awards 1900654, 1956393, 1901488, and 1956435. In this part of the appendix, we provide the proofs of the statements made in Section 2.2.A.1.1 Proof of Lemma 1 Assumption 1. The service model y i = F(x i , w) starts with a biased fully-connected layer, which is defined as y 1 i = σ(w 1 x i + b 1 ), where y 1 i ∈ R H is the output of the first layer with dimension H, and w 1 ∈ R H×2 and b 1 ∈ R H represent the weight matrix and bias vector in the first layer.