key: cord-0523363-4xhd9xoa authors: Fernandez, Joaquin Delgado; Menci, Sergio Potenciano; Lee, Charles; Fridgen, Gilbert title: Secure Federated Learning for Residential Short Term Load Forecasting date: 2021-11-17 journal: nan DOI: nan sha: 9fb72aaf20de849f1f14900bd2daa2e17a8d263f doc_id: 523363 cord_uid: 4xhd9xoa The inclusion of intermittent and renewable energy sources has increased the importance of demand forecasting in power systems. Smart meters can play a critical role in demand forecasting due to the measurement granularity they provide. Despite their virtue, smart meters used for forecasting face some constraints as consumers' privacy concerns, reluctance of utilities and vendors to share data with competitors or third parties, and regulatory constraints. This paper examines a collaborative machine learning method, federated learning extended with privacy preserving techniques for short-term demand forecasting using smart meter data as a solution to the previous constraints. The combination of privacy preserving techniques and federated learning enables to ensure consumers' confidentiality concerning both their data, the models generated using it (Differential Privacy), and the communication mean (Secure Aggregation). To evaluate this paper's collaborative secure federated learning setting, we explore current literature to select the baseline for our simulations and evaluation. We simulate and evaluate several scenarios that explore how traditional centralized approaches could be projected in the direction of a decentralized, collaborative and private system. The results obtained over the evaluations provided decent performance and in a privacy setting using differential privacy almost perfect privacy budgets (1.39,$10e^{-5}$) and (2.01,$10e^{-5}$) with a negligible performance compromise. Forecasting is necessary to ensure balance, maintain quality, secure electricity supply, and operate the power system at lower costs [1] . Amongst various types of forecasting, load forecasting is crucial for energy actors, such as market agents, because it allows for a better understanding of consumption and pricing patterns [2] . Within load forecasting, short term load forecasting (STLF) focuses on the load side with a time window from a few minutes or hours to one day-ahead or a week [3] . STLF is vital for many operational processes in the power system, such as planning, operating, and scheduling [1] . Concerning residential STLF, it aims to forecast electrical household consumption [kWh] and to assist, for example, market agents in tackling energy deviations. These energy deviations impact the energy price, which in turn has a direct impact on the electricity costs that customers face. Moreover, there is a recent shift towards an increase in residential electricity consumption due to electrification processes [4, 5] and the recent COVID-19 pandemic and its consequent increase in adoption of remote work [6] . While COVID-19 is a historical exception, this general shift towards increased residential electricity consumption might not be an anomaly in the future. Therefore, the importance of load forecasting and its categories increases with demand, not only from an economic point of view but, as previously stated, also from an operational point of view. There are traditional methods for STLF, but these methods build on limiting assumptions. Early techniques use statistical time series models relying on seasonal autoregressive integrated moving average (ARIMA) [7] , and exponential smoothing for double seasonality or linear transfer functions. These techniques fall short by assuming linearity of data. Hence, there is a need for models capable of coping with non-linear dependencies -an area where Artificial Intelligence (AI) methods are gaining momentum and offer better performance [8, 9, 10, 11, 12] . Whether statistical or AI-based, forecasting techniques need data and data scarcity drastically reduces their accuracy [13] . Residential STLF is not an exception, and there is already a solution that reduces data scarcity through digitalization: the push for advanced metering infrastructure (AMI) through smart meters increases the frequency and granularity when collecting data [14] . As a result, STLF can utilize the available data aggregated from smart meters using centralized or decentralized solutions. Central solutions require transferring the data to a central system and consequently face a twofold problem. There are considerable privacy challenges concerning smart meter data use due to the sensibility and correlatability of granular data. Data collected from smart meters installed in residences is granular enough such that one can extract individual customers' behaviour and thus identify customers [15] . From a regulatory perspective, the transfer and aggregation of smart meter data is challenging under current regulatory regimes such as the EU's General Data Protection Regulation (GDPR) -the framework introducing a set of guidelines for collecting and processing of personal information from European citizens [16, 17] . The regulatory challenge increases as device ownership (who owns the device) impact data ownership, and this changes even within the EU [18, 19] . Therefore, centralized collaborative approaches such as Belgium's Atrias [20] , or Norway's Elhub [21] provide so-called data lakes, are not possible in every market and jurisdiction. Decentralized approaches partly tackle the issues of centralized solutions (transfer and aggregation) by connecting different and distributed entities rather than creating a central data pool such as a data lake. A recent divergent and collaborative approach for forecasting is Federated Learning (FL) [22, 23] . It offers a collaboration framework to share prediction models instead of raw data. However, FL as a standalone is not de-facto private. While FL addresses the transfer and aggregation issue as data stay with their owners, it does not offer a viable solution to privacy concerns. For example, AI scholars proved that it is possible to reconstruct the original raw data out of the resulting models, both in the context of deep learning (DL) [24] and FL [25] . Therefore, the standalone FL implementation requires an extension to ensure full privacy of connected entities. This is done by additional privacy-preserving techniques such as differential privacy (DP) and secure aggregation (SecAgg) when computing and communicating, respectively model updates or gradient updates (information needed for learning). Residential STLF can benefit from a decentralized collaborative approach offered by FL extended by privacy preserving techniques. When reviewing literature, both DP and FL were tested in isolation for STLF [26, 27, 28, 29, 30, 31] . However, FL has not been tested on residential STLF using smart meter data and privacy preserving techniques. Therefore, this paper investigates this research gap driven by three main points. Firstly to analyze residential SLTF Neural Network (NN) models under distributed conditions (FL setting) since they heavily appear in literature and offer an increased accuracy. Secondly, to analyze if the inclusion of privacy-preserving techniques (DP, SecAgg) in FL implies a substantial drop in the residential STLF forecasting accuracy as they modify the data. Thirdly, identify the main constraints for applying privacy preserving techniques (DP, SecAgg) to FL in the context of residential STLF. The rest of this paper is structured as follows. Section 2 provides the related works to our main conceptual pillars. These are federated learning techniques and privacy preserving techniques. Section 3 describes our application of privacy preserving techniques for residential STLF using FL. The structure of this section follows the description of (1) our secure and non-secure FL settings and (2) how these FL models operate. In section 4, we cover (1) the simulation environment, (2) the dataset and (3) the evaluation metrics. Section 5 focuses on the evaluation of centralized NN models under a distributed condition, FL setting with no privacy. Furthermore, this section evaluates our different FL settings used for residential STLF through five scenarios. These scenarios enable us to compare the forecasting accuracy obtained. The scenarios cover non-secure (standard) FL, data correlation impact, different DL architecture from the baseline model, and secure FL by implementing two privacy-preserving techniques: DP and SecAgg. Finally, Section 6 provides an overview of our results and the potential future directions for research. In most fields, AI has already proven its value, though the performance of models is highly dependent on the quantity and quality of data. Generally speaking, the challenge of designing high-performing AI is hindered by problems related to data fragmentation and isolation -mostly due to competitive pressure and tight regulatory frameworks (related to data privacy and security). The authors in [22, 23] proposed a fundamentally new method, FL. The main idea of FL is to allow the training of ML models between multiple disconnected clients without the physical moving of raw data, nor explicitly exposing local raw data in any way to each other. In other words, FL allows competing clients (e.g. companies) to leverage each others' datasets without revealing their individual dataset. In doing so, there is potential for models trained with FL to obtain a more accurate forecasting output than when each client independently trained a model. To date, there are two different training approaches and three different configurations for the distribution of data and errors. The two main approaches to train FL models are: federated stochastic gradient descend (Fed-SGD) and federated averaging (Fed-Avg) [22] . Although both rely on similar functionalities, there are differences between the two approaches. Fed-SGD works by averaging the client's gradient (direction of learning) at every step in the learning phase. A client can be thought of as one disconnected entity within FL. More specifically, clients locally compute gradients of their loss (difference between error and ground truth). The clients subsequently send each of their locally computed gradients to a central server. The central server aggregates and averages the locally computed gradients by applying (weighted) average from each client's update. In Fed-Avg, the central server averages the models' updates when all the clients have finished computing their local models. In other words, Fed-Avg modifies the Fed-SGD algorithm by letting each clients' models compute their own model weights (based on their own gradients). Each client k will train their own model in parallel. In doing so, the impact is twofold: there is a reduction in the number of communication rounds (per batch in Fed-SGD versus per epoch in Fed-Avg) and an improvement forecasting performance [22] . Each client will use the received model weights as their base model for the next iteration round. This is repeated until the end of the prescribed rounds. As mentioned above, Fed-SGD and Fed-Avg are two approaches to FL but there are also multiple configurations. The configurations depend on how the feature space X , the label space Y, and the space formed by the identifiers I are distributed. Different setups of the triplet (X , Y, I) can be classified as Horizontal, Vertical, and Assisted Federated Learning [32] . Take for instance two clients i and j. • Horizontal Federated Learning is when i and j share feature space such that • Vertical Federated Learning is when I i = I j , but X i = X j and Y i = Y j . • Assisted Learning (AL) is done through collided data between clients. In [33] defined collision as clients with the same data entries of a dataset D but different in feature space AL leverages the sharing of error terms as clients share errors between each other. One client may use the errors of another for their own benefit to increase their training performance. Regardless of the approach and configuration, FL is attributed to moral hazard issues [34] or so-called 'soft' attacks on the contextual integrity of the shared data between federated clients. The moral hazard issue arises because FL is by nature collaborative [35] . Multiple clients come together to train models iteratively using their respective data at their disposal. Therefore, the involved clients require trust between each involved clients' data and client's behaviour to train and use the final model. Furthermore, clients do not exchange raw data between each other but information which are inferences upon raw data are exchanged. FL as a standalone (non-secure) does not guarantee data privacy because of the information exchange between clients. In [36] , researchers found a way to use gradients' updates to retrieve original raw data of a client. Such possibilities stand in conflict with requirements such as, for instance the European Union's GDPR. Therefore, FL requires additional complementary adjustments when applied in a realistic environment where data privacy of clients is sought after. There is a limited amount of literature which directly addresses the implementation of FL for STLF. For example, in [30] measured the performance of a FL model under clustering. The results displayed are by average 10% better relative to centralized learning techniques. Building on their work, the authors in [37] applied k-means algorithms to group users according to socio-economics factors. Authors in [31] followed a similar approach to demonstrate the application of FL over a dataset of over 200 households. The testing phase used a set of four scenarios in which authors presented the utility of FL applied to STLF with handcrafted models. The number of clients analyzed in their scenarios between 5 to 20, and the different local epochs ranging 1 to 5. Authors in [38] applied FL in a total of 5 smart meters comparing the performance of models trained using two FL techniques and for two different time resolutions. While authors in [30, 31, 37, 38] use FL, they do not use any privacy techniques in their papers. Over the last years, the increase of data usage has led to new techniques aiming to extract every last drop out of "statistical information" [39] . This form of data extraction is often at odds with the subject's privacy. However, privacy preserving techniques which extends FL not only cover the privacy of statistical information but also the means of communicating them. Hence, within this section we describe DP as a way to protect an individual's information and SecAgg as a mechanism to protect the communication channel and the exchange of updates among clients and a central server. The seminal work of [40] introduced differential privacy (DP) as a new method to solve adversarial attacks without auxiliary information. As described in [39] , "differential privacy addresses the paradox of knowing nothing about an individual while learning useful information about a population." In other words, DP hides individual data trends using noise. Notably, [40] proposes the concept of epsilon differential privacy ( -DP), as follows: "For every pair of inputs x and y that differ in one row, for every output in S, an adversary should not be able to use the output in S to distinguish between any x and y". The privacy budget ( ) determines how much of an individual's privacy an query may use, or to what extent it may increase the risk to breach an individual's privacy. For instance, a value of = 0 reflects perfect privacy, which means any analysis done will not affect an individual's privacy at all [41] . The authors in [42] extended the concept of ( -DP) to ( , δ-DP) where δ is the failure probability. DP is accomplished by adding random noise to the data query so that the lack of a single entry is obscured. Laplacian or Gaussian noise are the base methods for this approach [39] . Finding an adequate trade-off between the noise and utility of the remaining model is crucial and not a trivial endeavour. SecAgg uses a secure communication channel to perform secure training without the need of additive noise [43] thanks to cryptographic primitives. SecAgg is a secure multi-party computation protocol (SMPC). The protocol allows a set of distributed, unknown clients to aggregate a value x without revealing the value to the rest of the participants. The backbone of SecAgg uses Shamir's t-out-of-n Secret Sharing that enables a user to split a secret s into n shares [44] . To reconstruct the secret, more than t − 1 shares are needed to retrieve the original secret s. Any allocation with less than t shares will provide no information about the original secret. Even though SecAgg provides a secure environment for the training of models, latent patterns could still point towards the original data owner. More specifically, Model Inversion (MI) attacks aim to reconstruct the original training data from the model parameters [45] . Therefore, under SecAgg settings, models are not safe from MI attacks. In the context of hiding smart meter data, there is an ample variety of privacy-preserving methods used. One general approach is to add Gaussian noises to each smart meter data to prevent adversarial attacks [46] . Similarly, authors in [47] add adaptive noise to the Battery-based Load Hiding (BLH) values. Another approach is to look at temporally and spatially aggregated profiles from smart meters [48] . Authors in [49] execute DP with the addition of Laplacian noise to the aggregated dataset. These examples of literature demonstrate the use of privacy preserving methods to hide smart meter data. Within the literature of STLF [27, 49, 47] , authors consistently perturb the datasets by adding noise drawn from either a Gaussian or Laplacian distribution. Even though the referenced papers use this technique, we could not find any mathematical proof that ensures privacy. Hence, in this paper, we follow the proven approach proposed by [50] where Gaussian noise is added after every communication round in FL without modifying the original data. To account for the privacy loss during the training of our models, we followed the approach of [22] . Furthermore, to date literature does not explore the addition of SecAgg for STLF. This section explores our secure and non-secure FL models. We provide an overview of the different selection settings necessary for such FL models. Furthermore, within this section, we explore how the secure and non-secure FL models operate describing their steps. Our FL models predict the next hour consumption [kWh] based on the consumption data of the last 12 h. Our secure and non-secure FL models use the same FL algorithm. Between Fed-Avg and Fed-SGD, we selected Fed-Avg. This algorithm has advantages compared to Fed-SGD, such as the reduced number of communication rounds needed and a performance increase in STLF as exposed by [38] . Given the kind of problem we are solving, where clients share feature space the configuration is horizontal. The main difference between the secure and non-secure FL models is the extension of privacy preserving techniques in the latter one. To implement DP as privacy-preserving technique we follow the steps of [50] rather than [27, 51] , where they add noise to the dataset before the training. This DP implementation avoids modifying the entire dataset and only obfuscate models per query at the server side. The addition of DP in FL requires the definition of function sensitivity (S) and a clipping strategy. The sensitivity of a query function is a crucial element in DP as it determines the actuation range of the added noise. Defined by [39] it is the euclidean distance between two datasets (C) differing in at most one element k. Considering the first lemma from [50] and assuming all the clients weight the same, the sensitivity S is bounded as S(f (c)) ≤ S/n, being n the number of clients. The vectors in ∆ k include the different model updates computed among the clients. To bound the sensitivity, we need to maintain the models' updates in a known range. An standard solution is to clip them by a defined value before averaging. There are two different strategies for clipping the values of a neural network. These strategies are per layer clipping or flat clipping [50] . In our case, to reduce complexity, we use flat clipping as ∆ k = π(∆ k , S) being S the overall clipping parameter. Both strategies rely on the same principle; by layer or by network, the values of the updates are projected into a l2 sphere with the norm determined by the clipping value. The first implementation of flat clipping clip values using a fixed norm which could not be the best solution; the authors in [52] found an innovative strategy to adapt the clipping values based on a quantile of the distribution, known as adaptive clipping. Once the sensitivity is defined and bounded using a clipping strategy, we evaluate how the noise scale with the sensitivity to obtain privacy guarantee. In our model, we add the Gaussian noise. as defined by: N (0, σ 2 ) for σ = z · S, where z is a the noise scale and S is the sensitivity of the query. In each query, all rows are selected (q = 1 in the first theorem of [50] ). The addition of noise determines the overall privacy protection provided by a DP analysis. This privacy protection ( ) 1 determines how much privacy of the system an individual can utilize [41] . varies depending on the amount of noise added, favoring the privacy with high noise and low or favoring the utility with low noise and high . To compute this value, we use the accountant provided by Renyi Differential Privacy (RDP) [53] . Additionally, we also consider separately SecAgg [43] . It provides us with a security layer enhancing cryptographic primitives both in client and server side. DP secures the model's updates with the addition of random noise. DP offers protection against any malicious agent to reconstruct the original data out of the model updates. However, is not protecting the communication channel between the mentioned clients and the central server as SecAgg does. In SecAgg [43] , authors leverage the work of [54] allowing "a group of mutually distrustful parties u ∈ U each hold a private value x u and collaborate to compute an aggregate value". In SecAgg, the model will know that at least k users participated but neither which users nor the contribution of these users. SecAgg implies two main algorithms: sharing and reconstruction. The sharing algorithm will transform a secret into a set of shares of the secret associated with different clients. These shares follow [44] , hence collusion between n − 1 participants is insufficient to disclose other clients' private information. The reconstruction algorithm works in the opposite direction. It takes the mentioned shares from the clients and reconstructs the secret. In other words, clients share their secrets (models' updates) with the server through a secure channel without the server being able to reconstruct the secrets. The central server forwards the received encrypted shares (with other clients public key). Each client, upon reception, masks the input and sends it back to the server. Finally, the central server asks for the shares of a client. It reconstructs the aggregated value (secure sum of the different clients' models' updates) without knowing the individual secrets or the participating clients. In this subsection, we describe how our models operate in a an FL standard, FL-DP and FL-SecAgg setting. All three of them share 6 main steps to compute the forecast. In Figure 1 we illustrate the entire process with the additional step FL-DP requires. Furthermore, FL-SecAgg requires a set of cryptographic primitives to secure the communication between the client and server. This requires further communication rounds for sharing the clients' secretes and public keys. Figure 1 does not include this additional communication rounds. Firstly, the central server has to select a baseline model architecture and initialize it using Glorot initialization [55] . We describe our baseline model selection in subsection 5.1. Secondly, the central server shares the initial model with the respective clients. Thirdly, each client starts training the model received by the server on its own data. Fourthly, clients send their updates (w.r.t. initial model) to the central server after each epoch. Fifthly, the central server averages these models' updates and adds noise drawn from a Gaussian distribution in the case of DP (5' in Figure 1 ). Sixty and Finally, the server returns the model to the clients. The clients will continue this training process, sending and receiving updates until they reach their common goal. Secure Aggregation Client 1 Client n Central Server The simulations in this paper were conducted using the high performance computer (HPC) facilities of the University of Luxembourg [56] within the IRIS Cluster. Depending on the availability of the Graphic Processor Units (GPU), the federations run in an environment with 32 Intel Skylake cores and two NVIDIA Tesla V100 with 16Gb or 32Gb. The federation code is written in Python based on the framework provided by Tensorflow-Federated 2 (TFF) whereas the DL models are written in Keras [57] . Concerning the timeline, the simulations lasted around 800 hours distributed during June, July, and August 2021. Our evaluations use a combination of two datasets. The first dataset we took was from [58] . The data collected in the first dataset belongs to the UK Power Networks led Low Carbon London project between November 2011 and February 2014 in London, United Kingdom. It contains the electrical consumption [kWh] from 5567 households in a half-an hour resolution. This dataset also contains the Classification Of Residential Neighborhoods' (ACORN) [59] . The dataset is divided into individual household entries known as LCLid (Low Carbon London id). Additionally, the second dataset is composed of daily and weekly weather profiles from the Greater London area. Consequently, all customers have the same weather profile although their location might differ within the London Greater area. The pipeline used for dataset treatment consists in 3 main steps. First, we modify the time window of our data. Initially, the data is in a half-hour timestamp, and we downscale it to hourly data. This modification reduces the computational burden of our analysis. The downscale is made of the sum of two subsequent half-hours. Due to the short timeframe, sensors might fail at certain measurements, so it is normal to have abnormal or null values. During the first step of the pipeline, we trimmed these values. Afterwards, we rescaled all variables to have the same range using a standard-scaler as in [8] . The rescalation is necessary ease the FL learning process. Finally, we combine the datasets, to create our own standard dataset for the analysis. In Figure 2 we provide an example of the pipeline process result. It represents the electricity consumption [kWh] of 5 LCLid randomly selected for a 2 day period using 1h timestamps. Metrics offer an important characteristic for the development and testing of forecasting models. Indeed, since FL models are known to converge to a middle point [60] some metrics could offer a misleading answer to the performance of the model. AI models optimize the error of prediction with respect to the ground truth. In a mixed environment where there are many truths, the models tend to minimize the mean of the loss across datasets. This tendency, could provoke a FL model to predict the average of each of the datasets and hence to offer promising mean squared error (MSE), Equation 1 and mean absolute error (MAE), Equation 2. These misleading measurements would provide, at a later stage, inaccurate results far from the actual performance of the model. Given this, in this paper, we would like to analyze not only MSE or MAE but also the mean absolute percentage error (MAPE), Equation 4 and root mean square error (RMSE), Equation 3 . This effort is to increase objectivity of the results. The formal equations are as follows: Although we covered the secure and non-secure models in section 3, there is a crucial part to cover, the underlying learning method the FL models will use as they are the base of any AI model. Thus, we evaluate the performance of the different DL methods available in the literature (Table 1) to compare them using the metrics exposed in subsection 4.3. This comparison allows for the later selection of the model our scenarios will use. The models 3 were trained over 100 epochs using the dataset explained in subsection 4.2. From the methods above, we can observe two major trends. First, the recurrent use of the UCI [69] dataset and, above all, an increase in the depth of the neural networks over the years. Accordingly, Figure 3 illustrates the performance results of the models tested. Some models behave worse in our dataset than what was claimed by the authors. In [67, 68] authors calculated the metrics in a non-scaled dataset. Meaning, the transformation of the dataset prior or after the computation of the metric could bias the results. For instance, the scaled MSE is equal to the non-scaled MSE multiplied by σ being σ the standard deviation of the dataset prior standard-scaling. In other words, all calculated metrics have to be either scaled or not to offer a fair comparison. The same factor appeared when measuring RMSE. Subsequently, from now on we calculate all the metrics displayed with scaled data to standardize our simulation and evaluation (section 4). The results from [61, 8] behave similarly, although offering a remarkable difference in the number of network parameters. Training FL models is an intricate and expensive endeavour. From now on, in this paper, we will use both the models proposed by [61] and [8] as the baseline models. 3 The models have been implemented following their correspondent articles or the code authors have provided. 4 Fully connected layer: Layer where all the neurons are connected to the previous layer's neurons. We design a set of scenarios to analyze the performance (metrics) of our FL settings applied to STLF. These scenarios allow us to assess the benefits of our FL settings. Firstly, we start with an non-secure (standard) FL setting using the baseline model of [22] . Secondly, we analyse the potential performance increase imposing data correlation. Thirdly, we evaluate if a different and bigger DL baseline model might or might not be beneficial using the baseline model of [61] . Lastly, We evaluate two different privacy preserving techniques (DP and SecAgg) along with FL. In total, we designed five scenarios (lettered from A to E) collected and summarized in Table 2 . In each scenario, we run eight simulations. Each simulation assesses the behaviour of models in different settings where we increase the number of clients (federation size). The scaling process uses different federation sizes being 2, 5, 8, 11, 14, 17, 20, and 23 clients respectively. This scaling process enables us to evaluate the models performance as they grow in number of clients. Each client will contain data specifically from one LCLid. We limit the number of clients (LCLIds) to 23 due to severe computational burden when training the federated models. For instance, the model trained on [8] as baseline model needed around eights days to finish. Furthermore, the training of the models uses a fixed batch size of 100 and 300 communication rounds. This scenario analyzes the scaling performance of FL using [61] as baseline model. We use a FL architecture without any privacy-preserving techniques. Each client uses data from one random LCLid taken from the final dataset 4.2. Furthermore, in this scenario we impose no data-correlation among the clients. In Table 3 we collect the metrics results expressed in absolute values and the training time per round needed, expressed in seconds [s]. There is an improvement when increasing the number of clients in MAE and MAPE metrics as displayed in Table 3 . These results are similar to the ones offered by [37, 38] . However, the MSE and RMSE metrics are almost constant. These results remark the importance to collect several metrics when analyzing any forecasting models. Nonetheless, it is necessary to remark that although more clients imply more data points for the model, FL is not a categorical example of the pure correlation between performance and data size. Different clients might have data that will drag the performance of the individual models by moving them in opposite directions. Therefore, in FL, it is not about the amount of data but rather about the quality and similarity of data. Nevertheless, it is clear from our results that in our FL setting, the computational time also increases as the number of clients increases, potentially creating time constraints for our FL model. Additionally, in Figure 4 we expose the MAPE of the models during the different training rounds for the eight simulations. We can observe a quasi-exponential decrease in the MAPE over 300 rounds. The spikes were investigated and are due to the data itself, where there is a significant difference in the consumption data input (batch). Nevertheless, throughout the 300 rounds, the MAPE obtained is between 0.20 and 0.35, which can be considered a reasonable forecast based on [70] . In this scenario, we analyze the performance an standard (non-secure) FL setting where there is an imposed correlation among the clients in the federation. Hence, we build on top of Scenario A (5.3). The method we use for the correlation is Pearson correlation as in [71] . We consider only data from specific ACRONs (H and L), serving us as a correlation filter. Then for each federation size based on their number of clients ( [2, 5, 8, 11, 14, 17, 20, 23] ), we calculate all possible non-repeated combinations and calculate their correlations, from which we select the combinations with the highest correlations. We collect, similar to Scenario A (5.3), the results in Table 4 . We display the metrics values and the correlation rate, both expressed in absolute values. The computation time remains similar to that of Scenario A (5.3). [37] , where the application of K-means to cluster customers offers a performance gain between 10% and 15%. Additionally the results are similar to the ones offered by [38] . Hence, just by using correlations among the data used, metrics tend to improve. From an energy point of view, the increase in performance can potentially reduce the imbalance costs caused by the foresting errors. From a general power system perspective, the forecasting accuracy increase is a positive outcome as the system operator could also plan assets and calculate potential congestions better. In this scenario, we explore how a bigger DL architecture in terms of parameters impacts the metrics. Scenario C motivation comes from our preliminary conceptual review exposed in subsection 5.1. We noticed that the model of [8] behaves similarly than [61] . The model from [8] carries almost five times the number of parameters of [61] bringing more complexity to the model but opening the door for new patterns invisible for the smaller model. Given its size and a potential computational burden, we implement three modifications. The first modification concerns the GPUs, where we modify the settings of each of the 2 Nvidia Tesla allocated on the HPC. For each of them, we create two virtual cards, resulting in four cards for the FL model to train. The second modification is the batch size, which we increase from 100 to 200. Ideally, the batch size increase should prevent overtiffing since there are more data entries available to compute the loss of the model. Finally we modify the model proposed by [8] . We transform the initially proposed LSTM layers to CuDNNLSTM [72] . The transformation will enable the LSTMs to use the Compute Unified Device Architecture (CUDA) kernel of our Tesla GPUs to reduce the computation time. Likewise to the previous scenarios, we collect the metrics results obtained from our simulations in Table 5 in absolute values and the computational average time expressed in seconds [s]. The results of Scenario C clearly show the computational burden of training a complex FL model. The computational time recorded is almost five times the recorded from previous scenarios. Concerning the metrics, there is only a performance increase for 8 out of 32 metrics when compared to Scenario A. These are for the case of f ederationsize = 2 (all metrics) and f ederationsize = 5 (MSE and RMSE). They have a performance increase ranging from 13% up to 68%. Contrariwise the performance decrease for the reminder 24 metrics ranging from 0.39% up to 264%. These results point to a clear overfitting case, where as the number of clients increase, the FL model's performance dramatically decreases. Overfitting is usually defined as the lack of generalization of a model. An overfitted model has crossed the line between learning tendencies or patterns and memorize the data received as input. Models in FL tend to converge to a middle point [60] where all the different clients find their local minimum. During Fed-Avg, the models got averaged at every communication round. Averaging models that have understood patterns result in new models that can devise shared patterns. When overfitted models are averaged, the result does not differ from blatant noise. Furthermore, when exploring the MAPE results over 300 rounds, depicted in Figure 5 , the overfitting is clearly visual. The green line, two LCLid, shows a sharp slope within the first 40 rounds. This scenario focuses on implementing DP as a privacy-preserving technique and how this can impact the FL model. clipping and adaptive clipping techniques, explained also in section 3.1. We only consider one federation size, 17, in this scenario since all federation sizes follow the same logic for implementing DP. The first technique implemented is fixed clipping following the steps in [50] . There are two main steps to follow. Firstly, to determine the lowest clipping (S) value possible. We treated S as a hyper-parameter for our model. Clipping could negatively affect the convergence rate of any model as it clips all values bigger than S. Secondly, identify a tolerable level of noise for our simulations. These values enable us to compute the privacy guarantee and know the privacy budget of our model. The identification of the lowest clipping follows an iterative calculation, starting with S = 0.01 until S = 0.5. The iteration uses 0.05 steps, where the selection of the starting point follows the recommendations of [50] . We use the results of Scenario A (f orf ederationsize = 17) as a benchmark to find a suitable clipping value. The idea is to select the lowest possible clipping value that still allows the model to converge, hence maximizing the standard deviation of noise our model could handle. We collected the results for S in Table 6 and selected S 0.4 as our fixed clipping value. Once identified the lowest clipping parameter, we can compute the standard deviation of the noise. With S = 0.4 and the expected number of clients qw = 17, we apply S = S /qw to calculate the standard deviation of the noise level σ = z · S. Likewise, in the previous step, we treat z as a hyper-parameter and proceed in iterations. The privacy guarantee we calculated using the Rényi Differential Privacy Accountant [53] Table 7 collects the metric results and the calculated privacy guarantee obtained in the iterative process. Once obtained the z and σ from the previous step and starting with σ = 0.023 (z = 1) until σ = 0.72 (z = 32). We run different simulations to explore how the addition of noise affects the performance of the FL models. Considering the obtained values in Table 8 , the addition of DP, distorts the values by adding noise, thus reducing the performance. Technically, when comparing the results obtained in Table 8 with the previous results for the respective federation size, 17 in scenario A Table 3 , on average, the performance decreases. The decrease grows as the noise scale (z) increases, rustling at the end of our simulations an average performance decrease in the metrics of 27%, mainly due to the metric results of z = 32. However, we should contextualize the results. We can consider the measurements obtained as accurate results in themselves. The metrics are relatively low, displaying a decent forecasting performance, although higher error than scenario A (no DP). Yet, it is necessary to remark that DP provides a privacy guarantee of (1.39, 10e −5 )-DP, where the lower the score, the better. Concerning our privacy results, these are close to perfect privacy ( = 1). The second technique implemented for our analysis is adaptive clipping. For our FL model, we consider the implementation done in [52] , where their algorithm iteratively adjusts the norm clip, trying to approximate it to a fixed quantile. Hence, there is no need to search for the lowest clipping and noise, contrary to fixed clipping. The clipping adapts per round. Figure 6 is a representation of the evolution and adjustment of the clipping value over the training rounds. There spike at the very beginning is because of the low initial clipping value C 0 = 0.1. Such a low value provokes that few data points will participate in selecting the following clipping values at the initial rounds. The smaller the data points, the more difficult it is to estimate the optimal value. Consequently, the size of the steps required to make this estimate increases exponentially until it reaches the real quantile, resulting in an even stiffer growth. Pertaining to the metrics obtained when implementing adaptive clipping, we collect them in Table 9 . Similar to the previous results with fixed clipping, there is a decrease in performance compared to scenario A; although it is lower than fixed clipping, it is still around 21%. Hence, adaptive clipping does increase the performance compared to fixed clipping. Nonetheless, with the results obtained, it is necessary to remark the same contextualization as previously done for fixed clipping. The results are technically worse than an FL setting with no DP, yet the performance decrease we can justify it as privacy is guaranteed. The privacy guarantee obtained in this scenario is (2.01,10e −5 )-DP. Even-though being slightly worse than the one obtained for fixed clipping, it still obtains a remarkable privacy guarantee. Although fixed and adaptive clipping offers similar results, fixed clipping requires an earlier step to find an appropriate clipping value. This step delays the concept of privacy. In our simulations, we can calculate the absolute best fitting clipping value as we hold all data. However, in a real configuration, it requires each client to calculate their clipping value and subsequently aggregate them into a final clipping value. Figure 7 . It depicts the error decrease in terms of MAPE almost in a logarithm manner and stabilizes at the end of the 300 rounds. The application of SecAgg concerning the results expressed in Table 10 affects in a negligible way the computation time of the model compared with scenario A and even B. Consequently, SecAgg provides a better metric performance than DP. This work discusses the application of a collaborative forecasting technique, federated learning for residential short term load forecast in several steps. Firstly, this paper examined the most relevant short term load forecasting literature on our dataset to choose a deep learning architecture. The chosen baseline architecture served as a foundation for our federated learning settings. The step analysis over different considerations (size, correlation, and a different deep learning architecture) let us achieve promising results for the federated learning application for residential short term load forecasting. From our standard federated setting results and analysis, we can extrapolate the following: (1) There is a performance increase when a federation uses highly correlated data to train, (2) bigger models tend to overfit, affecting the performance, and (3) deep learning architectures highly impact in computation time. Concerning the metrics themselves, the application of federated learning for residential short term load forecasting is encouraging as the obtained metrics score low error. Thence, STLF network models trained in a decentralized approach offers similar performance as central solutions. Secondly, this paper covers the application of privacy-preserving techniques for short term load forecasting in a Federated Learning setting. To our knowledge, this is the first paper to cover them. We introduced Differential Privacy and Secure Aggregation to procure a secure setting in an federated learning context to explore the performance decrease they create when applied. Their application results in a minimal error increase. On the one hand, the privacy guarantee obtained using Differential Privacy is remarkably close to a secure theoretical setting ( = 0). Exploring both fixed and adaptive clipping, we obtained (1.39,10e −5 ) and (2.01,10e −5 ) as the best privacy budget in ( , δ) terms. Notably, the federated learning models behave consistently both in fixed and adaptive clipping. On the other hand, secure aggregation secures the communication mean and aggregations done by the central server, without a substantial drop in the performance and better metrics than Differential Privacy. Concerning the main lesson learned from our analysis is that finding an adequate trade-off between noise, performance, and utility is not a trivial endeavour. Nonetheless, after the secure federated learning setting analysis, we can posit the following: (1) An initial scaling of the data positively affects the privacy budget because of the reduction of the sensitivity of the query. (2) Adaptive clipping reduces precomputation needed. ( 3) The addition of secure aggregation almost does not affect the performance; the time required to create a secure configuration is worth the service it provides. (4) Our secure federated learning for STLF has various constraints. These are: resources needed for the computation since we need to simulate all clients; data quality as it affects the learning; data inputs beyond electricity consumption and weather data; the deep learning architecture; privacy needed, and the number of clients. These constraints might be similar to all secure federated learning settings. From our perspective, adaptive clipping in DP and secure aggregation enable real-world application as it avoids preliminary computations to find an adequate clipping value. A potential real-world application that would benefit from our secure federated learning setting are local flexibility markets using local data markets. Energy vendors can increase their local short-term load forecast performance by collaborating with a privacy guarantee and thus diminish their potential imbalance costs. Similarly, our Secure federated learning setting is applicable to regional generation forecasting for a local flexible market environment. However, the computation resources needed might hurdle these real-world applications. Secure federated learning is a computationally expensive process in the training face, whereas the instantiation of the model is not. To start a collaborative secure federated learning setting, energy vendors might procure specialized hardware if they already do not possess it. Despite the training phase being computationally expensive, this computation is not always required to create update models. For instance, a federation of energy vendors can update their model fortnightly. Finally, the next steps in this research are to (1) assess bigger (scaled) settings with additional correlation indicators, such as the existence of distributed energy resources (i.e., photovoltaics, electric vehicles, or home energy management systems), to improve correlation. Furthermore, (2) to investigate data input disruptions produced by a hostile agent or a mistake caused by a smart metering device malfunction. The authors declare no conflict of interest. Short-term Forecasting in Power Systems: A Guided Tour Electricity load forecasting: a systematic review Enhanced load forecasting Global energy and climate outlook 2019: Electrification for the low-carbon transition Electricity final consumption by sector, wolrd Impact of the lockdown during the covid-19 pandemic on electricity use by residential users Time series analysis and prediction of electricity consumption of health care institution using arima model Towards efficient electricity forecasting in residential and commercial buildings: A novel hybrid cnn with a lstm-ae based framework Analysis of modern approaches for the prediction of electric energy consumption Electric load forecasting: Literature survey and classification of methods Neural networks for short-term load forecasting: a review and evaluation Advances in machine learning modeling reviewing hybrid and ensemble methods An overview of forecasting problems and techniques in power systems A survey on advanced metering infrastructure Disaggregation of household load profiles Smart meter data: Balancing consumer privacy concerns with legitimate applications Report on data access and data handling Format and procedures for electricity (and gas) data access and exchange in member states Smart metering and electricity demand: Technology, economics and international experience The central hub in providing information in the energy market Netural data hub fir metering data and market processes Communication-efficient learning of deep networks from decentralized data Federated optimization: Distributed machine learning for on-device intelligence Deep leakage from gradients Inverting gradients -how easy is it to break privacy in federated learning Differential privacy for real smart metering data Market value of differentially-private smart meter data A technique to provide differential privacy for appliance usage in smart metering Smart meter data privacy Federated learning for short-term residential energy demand forecasting Electrical load forecasting using edge computing and federated learning Federated machine learning: Concept and applications Assisted learning: A framework for multi-organization learning Advances and open problems in federated learning Federated learning of deep networks using model averaging Deep leakage from gradients Short-term energy consumption forecasting at the edge: A federated learning approach Distributed load forecasting using smart meter data: Federated learning with recurrent neural networks The algorithmic foundations of differential privacy Differential privacy Differential privacy: A primer for a non-technical audience Evaluating differentially private machine learning in practice Practical secure aggregation for privacy-preserving machine learning How to share a secret Model inversion attacks that exploit confidence information and basic countermeasures A randomized response model for privacy preserving smart metering Achieving differential privacy of data disclosure in the smart grid A privacy-preserving distributed smart metering temporal and spatial aggregation scheme Differential privacy for real smart metering data Learning differentially private recurrent language models Blockchain and federated learning for privacy-preserved data sharing in industrial iot Differentially private learning with adaptive clipping Rényi differential privacy Multiparty computation from somewhat homomorphic encryption Understanding the difficulty of training deep feedforward neural networks Management of an academic hpc cluster: The ul experience Keras Smart meter data from london area The acorn user guid Federated learning: Challenges, methods, and future directions Building energy load forecasting using deep neural networks Short-term residential load forecasting based on lstm recurrent neural network Building energy consumption prediction: An extreme deep learning approach Deep learning for household load forecasting-a novel pooling deep rnn Multi-step short-term power consumption forecasting with a hybrid deep learning strategy Electric energy consumption prediction by deep learning with state explainable autoencoder Predicting residential energy consumption using cnn-lstm neural networks Improving electric energy consumption prediction using cnn and bi-lstm Individual household electric power consumption data set Industrial and business forecasting methods: A practical guide to exponential smoothing and curve fitting Short-term electricity price forecasting based on similar day-based neural network Optimizing performance of recurrent neural networks on gpus This work has been supported by funding from the European Union (EU) within its Horizon 2020 programme, project MDOT (Medical Device Obligations Taskforce), Grant agreement 814654, and from the Kopernikus-project "SynErgie" by the German Federal Ministry of Education and Research (BMBF). Additionally, we would like to thank to Tom Josua BARBEREAU for the time invested polishing the paper.