key: cord-0137916-g5zi55s7 authors: Xu, Chencheng; Hong, Zhiwei; Huang, Minlie; Jiang, Tao title: Acceleration of Federated Learning with Alleviated Forgetting in Local Training date: 2022-03-05 journal: nan DOI: nan sha: d90f81bd97db5052ac9056f3f5a46eb06006e6fb doc_id: 137916 cord_uid: g5zi55s7 Federated learning (FL) enables distributed optimization of machine learning models while protecting privacy by independently training local models on each client and then aggregating parameters on a central server, thereby producing an effective global model. Although a variety of FL algorithms have been proposed, their training efficiency remains low when the data are not independently and identically distributed (non-i.i.d.) across different clients. We observe that the slow convergence rates of the existing methods are (at least partially) caused by the catastrophic forgetting issue during the local training stage on each individual client, which leads to a large increase in the loss function concerning the previous training data at the other clients. Here, we propose FedReg, an algorithm to accelerate FL with alleviated knowledge forgetting in the local training stage by regularizing locally trained parameters with the loss on generated pseudo data, which encode the knowledge of previous training data learned by the global model. Our comprehensive experiments demonstrate that FedReg not only significantly improves the convergence rate of FL, especially when the neural network architecture is deep and the clients' data are extremely non-i.i.d., but is also able to protect privacy better in classification problems and more robust against gradient inversion attacks. The code is available at: https://github.com/Zoesgithub/FedReg. Federated learning (FL) has emerged as a paradigm to train a global machine learning model in a distributed manner while taking privacy concerns and data protection regulations into consideration by keeping data on clients (Voigt & Von dem Bussche, 2017) . The main challenge faced in FL is how to reduce the communication costs (in training) without degrading the performance of the final resultant model, especially when the data on different clients are not independently and identically distributed (non-i.i.d.) (Yuan & Ma, 2020; Wang et al., 2020b) . The most popular FL algorithm is FedAvg (McMahan et al., 2017a) . In each training round of FedAvg, local training steps are separately performed at every client and the locally trained models are transferred to the server. Then, the server aggregates the local models into a global model by simply averaging their parameters. Although FedAvg succeeds in many applications, its training processes often diverge when the data are non-i.i.d., also known as heterogeneous, across the clients (Li et al., 2019; Zhao et al., 2018) . Several FL algorithms have been designed to improve FedAvg and tackle the heterogeneity issue mainly by reducing the difference between locally trained parameters Karimireddy et al., 2020) or aggregating these parameters into different groups (Wang et al., 2020a; Yurochkin et al., 2019) . However, the performance of these methods is still far from satisfactory when a deep neural network architecture is employed (Rothchild et al., 2020; Li et al., 2021) . On the other hand, recent work in the literature (Geiping et al., 2020) shows that the transmission of trained model parameters does not ensure the protection of privacy. Privacy-sensitive information can be recovered Figure 1 : The catastrophic forgetting issues on non-i.i.d. MNIST (a) and EMNIST (b) data, and an illustration of FedReg (c). Here, loss (t−1) denotes the loss on data D j , which is the local data of client j sampled in round t − 1, computed with parameters θ (t−1) , and loss (t) denotes the averaged loss on the same data computed with the locally trained parameters in round t. See Appendix B.1 for their detailed definitions. The values of loss (t) are significantly larger than those of loss (t−1) , indicating that after the local training stage, the knowledge about the training data in the previous round at the other clients has been forgotten. To accelerate the convergence rate of FL by alleviating such forgetting issue in the local training stage, FedReg generates pseudo data D s i and perturbed data D p i , and regularizes the local parameter θ (t,i) by constraining the loss on D s i and D p i . Therefore, instead of using the plain gradient with learning rate η θ , FedReg updates θ (t,i) with the gradient g γ (D i ) computed with slowly-updated parameters θ (t,i) γ , which prevents the gradient from converging to zero in few steps, and then it regularizes θ (t,i) with a combination of the gradients g β (D s i ) and g β (D p i ) computed from the pseudo and perturbed data. with gradient inversion attacks (Zhu et al., 2019) . Differential privacy (DP) (McMahan et al., 2017b; Abadi et al., 2016) is one of the most widely used strategies to prevent the leakage of private information. However, when DP is incorporated into FL, the performance of the resulting model decays significantly (Jayaraman & Evans, 2019) . We observe that when the data are non-i.i.d. across the clients, the locally trained models suffer from severe forgetting of the knowledge of previous training data at the other clients (i.e., the well-known catastrophic forgetting issue), perhaps due to the discrepancy between local data distributions and the global data distribution. As shown in Figure 1 and supplementary Figure C .1, this forgetting issue leads to a large increase in the loss concerning these training data under the non-i.i.d. setting, thereby slowing down the convergence speed. In this work, we propose a novel algorithm, FedReg, that reduces the communication costs in training by alleviating the catastrophic forgetting issue in the local training stage. FedReg reduces knowledge forgetting by regularizing the locally trained parameters with generated pseudo data, which are obtained by using modified local data to encode the knowledge of previous training data as learned by the global model. The potential conflict with the knowledge in the local data introduced by the pseudo data is dampened by the use of perturbed data, which are generated by making small perturbations to the local data, whose predictive values they help ensure. The generation of the pseudo data and perturbed data only relies on the global model received from the server and the local data on the current client. Thus, compared with FedAvg, no extra communication costs concerning the data on the other clients are incurred. Our extensive experiments demonstrate the superiority of FedReg in accelerating the FL training process, especially when the neural network architecture is deep and the clients' data are extremely heterogeneous. Furthermore, the pseudo data can be further utilized in classification problems to defend against gradient inversion attacks. More specifically, we show that with similar degree of protection of private information, the degradation in the performance of the global model learned by our method is much less than that learned by using FedAvg combined with DP. Our contributions. We demonstrate that when the data across clients are non-i.i.d., catastrophic forgetting in the local training stage is an important factor that slows down the FL training process. We therefore propose a novel algorithm, FedReg, that accelerates FL by alleviating catastrophic forgetting with generated pseudo data. We also perform extensive experiments to establish the superiority of FedReg. We further show that in classification problems, the pseudo data generated in FedReg can help protect private information from gradient inversion attacks with much less impact on the performance of the resultant model compared with DP. FL is proposed to address privacy concerns in a distributed learning environment. In the FL paradigm, a client i keeps its local data D i on the client machine during the whole training process so that the server and other clients cannot directly access D i . The objective is to find the parameter θ * that minimizes the loss on the global data ∪ i∈C D i , where C is the set of clients, i.e., In this equation, L θ is the loss function with parameters θ, which can be cross-entropy or in some other custom-defined form. FedAvg (McMahan et al., 2017a) and FedProx are the most widely used FL algorithms. In FedAvg, in a training round t, the server sends the initial parameters θ (t−1) to a set of sampled clients C (t) . Then, the parameters are updated independently for S epochs on each of these clients to minimize the loss on the local data, and the locally trained parameters θ (t,i) are then sent to the server. The server aggregates the parameters by simply averaging over them and obtains the parameters θ (t) , i.e., θ (t) = 1 K i∈C (t) θ (t,i) , where K is the number of sampled clients in round t. FedAvg is unstable and diverge when the data are non-i.i.d. across different clients . FedProx stabilizes FedAvg by including a proximal term in the loss function to limit the distance between θ (t,i) and θ (t−1) . Although FedProx provides a theoretical proof of convergence, empirically it fails to achieve good performances when deep neural network architectures are used (Li et al., 2021) . SCAFFOLD (Karimireddy et al., 2020) assumes that the heterogeneity of data leads to a client-drift in gradients and correlates the drift with gradient correction terms. As clients need to send extra information to the server, SCAFFOLD increases the risk of privacy leakage and doubles the communication burden compared with FedAvg. Furthermore, the accuracy of the correction terms is highly correlated with the training history of the client. Thus, the performance of SCAFFOLD decreases significantly when the number of clients is large, in which case each client is only sampled for few times and the estimation of the gradient correction terms is often inaccurate. FedCurv (Shoham et al., 2019) aims to tackle the forgetting issue on non-i.i.d. data with elastic weight consolidation (Kirkpatrick et al., 2017) . To achieve this, FedCurv needs to transfer Fisher information between the server and clients, which increases the communication costs to 2.5 times compared with FedAvg. Multiple methods (Luo et al., 2021; Hao et al., 2021; Goetz & Tewari, 2020) have also tried to introduce synthesized data to help reduce the effect of heterogeneity, but they either rely on some specific architectures of neural networks (Luo et al., 2021) such as batch normalization (Ioffe & Szegedy, 2015) or require the synthesized data to be shared with the server (Hao et al., 2021; Goetz & Tewari, 2020) , which contradicts the privacy protection objectives of FL. Catastrophic forgetting occurs specifically when the neural network is trained sequentially on multiple tasks. In this case, the optimal parameters for the current task might perform poorly on the objectives of previous tasks. Many algorithms have been proposed to alleviate the forgetting issue. Memory-based methods (Parisi et al., 2019) have achieved excellent performances in accommodating new knowledge while retaining previously learned experience. Such memory-based methods, such as gradient episodic memory (GEM) (Lopez-Paz & Ranzato, 2017) and averaged gradient episodic memory (A-GEM) (Chaudhry et al., 2018) , store a subset of data from previous tasks and replay the memorized data when training on the current task. For instance, A-GEM treats the losses on the episodic memories of previous tasks as inequality constraints in optimizing the objectives of current tasks and changes the model updates from the plain gradient g to g − wg ref , where g ref is the gradient computed from the loss on the memorized data and w is a non-negative weight. Unfortunately, these memory-based techniques are not suitable for FL due to privacy concerns. FL provides a privacy guarantee by keeping the users' data on individual client machines locally and only sharing the model updates. However, recent research has shown that data information can be recovered from the parameter updates (Geiping et al., 2020; Zhu et al., 2019) in the FedAvg framework by simply finding data with updates similar to the values returned from the client. DP, which avoids privacy leakage by adding noise to training data (Sun et al., 2019) or model updates (Abadi et al., 2016; McMahan et al., 2017b) , is the most widely used strategy to protect private information. When adding noise to model updates, such as differentially private SGD (DPSGD) (Abadi et al., 2016) , the noise level in DP is controlled by the gradient norm bound C and the noise scale σ. DP often causes a large performance degradation in the resultant model. Our main challenge is how to alleviate the forgetting of previously learned knowledge at each client without having to access data at the other clients in the local training stage. For any data set D, we denote the set of data near D within the Euclidean distance of δ as N( We assume that in round t on client i, (1) for all data d − ∈ ∪ j∈C/{i} D j local to the other clients, the global model with parameter θ (t−1) has a better feature representation than the local model with parameter θ (t,i) ; and (2) for any where f θ is the prediction function with parameters θ. The assumption (2) guarantees that, in the local training stage, the range of changes in the predicted values of previous training data at the other clients can be controlled by constraining the changes in the predicted values of the data near D i . Based on these assumptions, we first generate pseudo data and then alleviate the catastrophic forgetting issue by regularizing the locally trained parameters with the loss on the pseudo data. In other words, we hope that the pseudo data would achieve the same effect as the previous training data in constraining the local model. Note that FedReg does not assume any particular form of the loss function, but the method of modifying gradients to enhance privacy protection (to be discussed below in Section 3.3) is currently only applicable to classification problems. The pseudo code of FedReg is provided in Appendix D. With the above assumptions, the catastrophic forgetting issue can be alleviated by constraining the prediction values for data in N(D, δ). However, such a constraint is too strong and would hinder the learning of new knowledge from the local data. Moreover, it is also computationally inefficient to enumerate all such data. Observe that for a data point d then the constraint on d will not contribute to addressing the forgetting issue. Therefore, to efficiently regularize the locally trained parameters, only data points with large changes in the predicted values should be used to alleviate knowledge forgetting. Such data points are usually within a small distance to the data points in D i , but their predicted values given by θ (t−1) could be very different from the labels in D i . Data satisfying these conditions can be easily obtained with the fast gradient sign method (Goodfellow et al., 2014) and we denote the data generated in this way as pseudo data D s i . More precisely, each data point (x s , y s ) ∈ D s i is iteratively generated from a data point (x, y) ∈ D i as follows: where η s is the step size and E the number of steps. Despite the relaxation in constraints achieved by the pseudo data generated above, some information conflicting with the knowledge embedded in the local data D i could possibly be introduced in D s i due to the inaccuracy of the global model during the training process. To further eliminate such conflicting information, perturbed data D p i = {(x p , y p )} with small perturbations on D i are iteratively generated as follows: where the step size η p satisfies η p η s in order to make sure that the perturbed data is much closer to the local data than the pseudo data. In our experiments, we take η p = 0.01η s and E = 10 to reduce the complexity of tuning hyper-parameters. With the pseudo data D s i and perturbed data D p i generated above, we regularize θ (t,i) by requiring where constraint (4) alleviates the catastrophic forgetting issue and constraint (5) eliminates conflicting information introduced in (4). Constraint (5) also helps improve the robustness of the resultant model (Madry et al., 2018) . Expanding both sides of (4) and of (5) respectively with θ (t,i) β = 0.5 θ (t,i) + θ (t−1) so that the second-order term can be eliminated and ignoring higherorder terms (see Appendix A.1 for the details), we obtain Due to the existence of potentially conflicting information in D s i and D p i , directly finding θ (t,i) to satisfy the above two inequalities is mathematically ill-defined. Besides, given the complex architectures of deep neural networks, it is computationally too challenging to solve either of the above inequalities. Therefore, we approximate the optimal parameters θ (t,i) * by solving the following constrained optimization problems: By solving (8) and (9) (see Appendix A.2 for the detailed derivation), we obtain Based on the generation process of the pseudo data, it is easy to note that d s ∈D s , and thus the inequality sign in (4) can in fact be replaced by an equality sign. In practice, however, we have observed that the training process would be more stable and result in non-negative w s and w p with the inequality formulation, probably due to errors introduced in the above approximation. Finally, to prevent the gradient from converging to zero in a very small number of steps, slowly-updated parameters (Zhang et al., 2019b) In summary, taking into acount the above improvements, the local parameters are updated in each training step as: We further notice that in classification problems, the pseudo data can be used to modify the gradient g γ (D i ) to enhance the protection of privacy. If we denote D s i = {(x, 1 n n j=1 e (j) )|(x, y) ∈ D s i } for an n-classification problem, where e (j) is the standard basis vector with a 1 at position j, then the data points in D s i are similar to those in D i , but they may have totally different labels. Hence, it is reasonable to assume that g γ (D i ) and g γ (D s i ) contain similar semantic information but different classification information. Since in the training of FL, only the classification information contributes to the model performance, removing the semantic information in g γ (D i ) will enhance the privacy protection capability of FL without severely degrading the performance of the resultant model. Based on this intuition, we presume that the components of g γ (D i ) in the same (or roughly the same) direction of g γ (D s i ) contain the semantic information and the other (orthogonal) components contain the classification information. Thus, we compute a modified gradient (MG) to enhance the protection of privacy as We refer to the version of FedReg in which the g γ (D i ) term in (11) is replaced byg γ (D i ) as FedReg with MG. We compare the performance of FedFeg against the above-mentioned baseline FL algorithms including FedAvg, FedProx, FedCurv and SCAFFOLD as well as the well-known SGD algorithm (Bottou, 2012) . Note that SGD corresponds to a special case of FedAvg where the local training stage consists of only one step and all data on a client are used in a large single batch. FedReg is evaluated on MNIST (Deng, 2012) , EMNIST (Cohen et al., 2017) , CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) , and CT images of COVID-19 (He, 2020). To simulate a scenario for FL, the training data in each dataset are split into multiple clients in different ways and the performance of the trained model is evaluated on the test data. The data preparation steps for each dataset are described below and more experimental details are provided in Appendix B. MNIST The training data are split into 5,000 clients. Each client has images either from only one class, named MNIST (one class), or from two classes, named MNIST (two classes). The number of images per client follows a power law distribution . A 5-layer neural network with three layers of convolutional neural networks (CNN) and two layers of full connections (FC) is used. EMNIST The training data of the digits in EMNIST are split into 10,000 clients. Each client owns 24 images from the same class. The same model architecture used on MNIST is used here. The training data are split into 10,000 clients. Each client owns 5 images from random classes, named CIFAR-10 (uniform), or from only one class, named CIFAR-10 (one class). A ResNet-9 network with the Fixup initialization (Zhang et al., 2019a) is trained from scratch. The training data are split into 50,000 clients and each client owns 1 image. A ResNet-9 network with the Fixup initialization is trained from scratch, named CIFAR-100 (ResNet-9), and a pre-trained transformer (Dosovitskiy et al., 2020) is fine-tuned, named CIFAR-100 (Transformer). CT images related to COVID-19 The dataset contains 349 CT images of COVID-19 patients with source information and 397 non-COVID-19 CT images without source information. The data are split into training and test data in the same way provided in (He, 2020). Then, the CT images of COVID-19 patients in the training data are split to make sure that the images from the same source are assigned to the same client, and 32 clients are obtained in this way. The non-COVID-19 images in the training data are distributed to each client following the proportion of COVID-19 images. A DenseNet-121 network (Huang et al., 2017) and a 10-layer neural network with 9 layers of CNNs and one layer of FC are trained on these CT images to distinguish the COVID-19 images from the non-COVID-19 ones. The results on MNIST and EMNIST are shown in Table 1 . FedReg required fewer communication rounds to converge compared with the baseline methods and achieved a higher final accuracy (i.e., FedCurv also aims to allieviate the forgetting issue, but its performance is not significantly better than FedProx. Note that due to its heavy reliance on the optimization history when estimating the correction term, in our experiments, SCAFFOLD performed worse than FedAvg. Table 2 illustrates that, on CIFAR-10, when the data are randomly distributed, FedReg significantly outperformed all the baseline methods in both convergence speed (measured in communication rounds) and final model accuracy, demonstrating the superiority of FedReg in training deep neural networks. When data across clients become more heterogeneous and the images at each client are from the same class, the multiple local training steps in the baseline algorithms (other than SGD) failed to improve the performance of the locally trained model, resulting in a worse performance than that of SGD, while FedReg was still able to save about a half of the communication costs compared with SGD. A similar conclusion can be drawn when training the ResNet-9 network from scratch and fine-tuning the transformer model on CIFAR-100, as shown in To prove that FedReg indeed alleviates the catastrophic forgetting, increases of the loss values concerning previous training data at the other clients are compared between FedReg and FedAvg. As shown in Figure 2 , the increases of the loss are significantly lower in FedReg than those in FedAvg, suggesting that although FedAvg and FedReg both forget some of the learned knowledge, the forgetting issue is less severe in FedReg. To further explore the role of the generated pseudo data in the alleviation of forgetting, the values of the empirical Fisher information (Ly et al., 2017) in the pseudo data and previous training data are compared. As shown in Figure 2 , the values of the pseudo data and previous training data are significantly correlated. Following the Laplace approximation (MacKay, 1992; Kirkpatrick et al., 2017) , if we approximate the posterior distribution of the optimal parameters on a dataset by a normal distribution centered at θ (t−1) with a diagonal precision matrix, then the precision matrix can be approximated by the empirical Fisher information. The similarity in the empirical Fisher information suggests that a model with high performance on the pseudo data has a relatively high probability to perform well on previous training data. Therefore, regularizing the parameters with the pseudo data can help reduce the performance degradation on previous training data and alleviate the knowledge forgetting issue. The impact of the hyper-parameter η s , which determines the distance between the pseudo data and the local data, is explored here. Intuitively, when η s is too large, the distance between the generated pseudo data and the local data will be too large to effectively regularize the model parameters. On the other hand, when η s is too small, the conflicting information from an inaccurate global model will slow down the convergence speed. This intuition is empirically confirmed by the experiments presented in supplementary Figure C. 2. On the EMNIST and CIFAR-10 (one class) datasets, when η s is too small, it takes more communication rounds to achieve the same accuracy. As η s increases, the number of communication rounds decreases to a minimum and then bounces back, indicating that a less effective regularization on the model parameters in the local training stage has been reached. The protection of private information is one of the most important concerns in the FL paradigm, especially in medical applications. To illustrate the privacy security of FedReg with MG, gradient inversion attacks are used to recover information from updated parameters in classification problems. Under a comparable degree of privacy protection, the resultant model performance is compared between FedReg with MG and the baseline methods with DPSGD (Abadi et al., 2016) . Using gradient inversion attacks, the quality of images recovered from the updated parameters in a local step is examined. Note that when only one local training step is performed in each round, the updated parameters in FedAvg, FedProx, FedCurv, and SGD are all the same, leading to the same images recovered from these methods. We do not include SCAFFOLD here since its performance was significantly inferior to the other methods in the above experiments. As shown in supplementary Figures C.3 and C.4, for some images sampled from the EMNIST, CIFAR-10 and COVID-19 CT datasets, the quality of the images recovered from FedReg with MG is significantly worse than those recovered from the baseline methods, exhibiting a better privacy protection capability of Fe-dReg with MG. To compare the impact of privacy protection on model performance, we tune the noise level in DPSGD to generate images with similar quality as measured by PSNR (Hore & Ziou, 2010) and compare the performance of the resultant models. The results are shown in Table 4 . It can be seen that the protection of private information with DP costs a large performance degradation in the baseline methods, while the performance of FedReg with MG is degraded much less when maintaining a similar privacy protection level. Note that the CT images used in the above experiments contain some privacy-sensitive information, including the date of birth of the patient and the date when the image was taken. In the basic FedAvg, when a 10-layer neural network with 9 layers of CNNs and one layer of FC is used, such information can be easily recovered with gradient inversion attacks. When DP is applied, although such time information could be protected, the resultant model suffers from severe performance degradation, as shown in Table 4 . In contrast, FedReg with MG is able to protect the sensitive time information with only mild model performance decay. In this work, we proposed a novel algorithm, FedReg, to accelerate the convergence speed of FL by alleviating the catastrophic forgetting issue in the local training stage. Pseudo data are generated to carry knowledge about previous training data learned by the global model without incurring extra communication costs or accessing data provided at the other clients. Our extensive experiments show that the generated pseudo data contain similar Fisher information as the previous training data at the other clients, and thus the forgetting issue can be alleviated by regularizing the locally trained parameters with the pseudo data. The pseudo data can also be used to defend against gradient inversion attacks in classification problems with only mild performance decay on the resultant model compared with DP. Although FedReg exhibits competitive performance in accelerating convergence speeds, its performance is influenced by hyper-parameters. Automatic tuning of the hyperparameters could make FedReg easier to use. Meanwhile, the question of how to modify gradients in FedReg to effectively protect privacy in regression problems remains to be explored. Moreover, a rigorous proof of the convergence rate of FedReg is of theoretical interest. We leave these questions to future work. Substituting both sides of (4), dividing them by |D s i | and ignoring all high order terms, then only the first order terms are left and the thus inequality (6) is obtained, i.e., The inequality (7) can be derived from (5) in a similar way. The constraint optimization problems in (8) and (9) can be solved by the Lagrange multiplier method. More specifically, for the problem (8), i.e., θ (t,i) = arg min θ θ − θ (t,i) 2 s.t. θ (t−1) − θ T g β (D s i ) ≥ 0, its Lagrangian can be written as Setting the derivatives of L(θ, α) w.r.t. θ to zero, the value of θ (t,i) minimizing L(θ, α) is obtained as Thus, the value of θ (t,i) is The constraint optimization problem 9 can be solved similarly. In all the experiments, the learning rates for different methods are optimized by grid search, the optimal weight for the proximal term in FedProx is searched among {1.0, 0.1, 0.01, 0.001}, the optimal λ for FedCurv is searched among {0.001, 0.0001, 0.00001}, and the hyper-parameters (γ and η s ) in FedReg are optimized by grid search as well. The number of epochs in the local training stage is optimized in FedAvg and applied to the other methods. B.1 COMPUTATION OF loss (t−1) AND loss (t) Formally, the value of loss (t−1) for a client i ∈ C (t−1) is computed as and the value of loss (t) is computed as where θ (t,j) is the parameters sent from client j in round t. The hyper-parameters in the experiments are listed below: • MNIST. Ten clients are sampled in each training round and 500 rounds are run in total with a batch size of 10 so that all training data are utilized once on average. The learning rate in all methods is 0.1. On MNIST (two classes), in the local training stage, the local data are processed in 40 epochs. The weight for the proximal term in FedProx is 0.001 and λ in FedCurv is 10 −4 . In FedReg, γ = 0.4 and η s = 0.2. On MNIST (one classes), the local data are processed in 20 epochs. The weight for the proximal term in FedProx is 0.01 and λ in FedCurv is 10 −4 . In FedReg, γ = 0.3 and η s = 0.2. • EMNIST. 20 clients are sampled in each training round, and in the local training stage, the local data are processed in 20 epochs with a batch size of 24. 500 rounds are run in total so that all the training data are utilized once on average. The learning rate in SCAFFOLD is 0.1 and in other methods, it is 0.2. The weight for the proximal term in FedProx is 0.001 and λ in FedCurv is 10 −4 . In FedReg, γ = 0.4 and η s = 0.05. • CIFAR-10. Following the experimental settings in (Rothchild et al., 2020) , 240 rounds are run in total with 100 clients sampled in each training round, and in the local training stage, the batch size is 5. On CIFAR-10 (uniform), the local data are processed in 30 epochs. The learning rate is 0.05 in FedAvg, FedCurv and FedProx, 0.1 in SGD and FedReg, and 0.01 in SCAFFOLD respectively. The weight for the proximal term in FedProx is 0.001 and λ in FedCurv is 10 −5 . In FedReg, γ = 0.5 and η s = 0.1. On CIFAR-10 (one class), the local data are processed in 20 epochs. The learning rate is 0.05 in FedAvg, FedCurv, FedProx, SCAFFOLD and FedReg, and 0.1 in SGD, respectively. The weight for the proximal term in FedProx is 0.01 and λ in FedCurv is 10 −3 . In FedReg, γ = 0.5 and η s = 0.1. • CIFAR-100. When ResNet-9 is adopted, 1,200 rounds are run in total with 100 clients sampled in each training round, and in the local training stage, the local data are processed in 10 epochs with a batch size of 1. The learning rate is 0.05 in FedAvg, FedCurv and FedProx, and 0.1 in SGD and FedReg, respectively. The weight for the proximal term in FedProx is 0.01 and λ = 10 −3 in FedCurv. In FedReg, γ = 0.25 and η s = 0.1. When the pre-trained transformer is adopted, 100 rounds are run in total with 500 clients sampled in each training round, and in the local training stage, the local data are processed in 5 epochs with a batch size of 1. The learning rate is 0.1 in all methods. The weight for the proximal term in FedProx is 0.001 and λ = 10 −5 in FedCurv. In FedReg, γ = 0.02 and η s = 10 −2 . Note that as SCAFFOLD is too memory-consuming to run on CIFAR-100 given our computational resources, the results of SCAFFOLD on CIFAR-100 are not reported. • CT images related to COVID-19. In our experiments, ten rounds are run in total and in each round, 10 clients are sampled. In the local training stage, the local data are processed in 20 epochs with a batch-size of 10. The learning rate is 1 × 10 −3 in FedAvg, FedCurv and FedProx, 5×10 −4 in SCAFFOLD, 1×10 −4 in SGD, and 5×10 −3 in FedReg, respectively. The weight for the proximal term in FedProx is 1.0 and λ in FedCurv is 10 −4 . In FedReg, γ = 0.5 and η s = 10 −6 . Note that group-normalization (Wu & He, 2018) is used in the DenseNet-121 network as the batch size is small in the local training stage. • Landmarks-User-160k. The training data in this dataset has 164,172 images of 2,028 landmarks from 1,262 users, and the test data contains 19,526 images. Each user has images of 29 classes on average. A DenseNet-121 network is trained on this dataset. 2,000 rounds are run in total, and in each round 100 clients are sampled. In the local training stage, the local data is processed in 40 epochs. The learning rate is set as 0.1 in SGD, FedAvg, FedReg, and pFedGP (Achituve et al., 2021). As pFedGP is designed for personalized federated learning scenarios and trains an independent GP-tree for each client, to apply it to this dataset, a small global dataset is built by sampling 2,028 images from all training data, which covers all the classes contained in Landmarks-User-160k. In each round, the deep kernel, which is the DenseNet-121 network without the classification layer, is trained on each client and aggregated in the server. Then, a global GP-tree is trained on the sampled global sub-dataset to classify the images in the test data. The hyper-parameters used here, except γ in FedReg, are the same as the values used in the previous section. • Computation of loss increment. To fairly compare the forgetting issue in FedReg and FedAvg, γ is set to be 1.0 in FedReg, and in each round, the client parameters are initialized with the parameters obtained in FedReg to avoid a large discrepancy between the values of loss (t−1) computed in FedReg and FedAvg. For a client i ∈ C (t−1) , the loss increase is computed as loss (t) (i) − loss (t−1) (i). • The computation of Fisher information. Following (Kunstner et al., 2019) , in round t of client i ∈ C (t) , for the previous training data D (t−) = ∪ j∈C (t−) D j , where C (t−) = ∪ s∈[t−10,t−1] C (s) , the empirical Fisher information is computed as Similarly for the pseudo data D s i , the empirical Fisher information is computed as Note that considering the generation process of the pseudo data, the parameter values θ (t,i) used to compute Fisher information are the same local model parameters after a local training step. The target of gradient inversion attacks in (Geiping et al., 2020) and (Zhu et al., 2019) is to find where ∆θ θ (t) , · denotes the amounts (updates) that the parameters have changed from θ (t) with the corresponding data, TV (d) the total variation of d, and d true a data point on the client. The distance function could be cosine distance (Geiping et al., 2020) or L2-distance (Zhu et al., 2019) . Note that the method proposed in (Zhu et al., 2019) sets w = 0. To consider the worst case, in our attacking experiments, the parameters are only trained for one step with each data point, in which case the values of ∆θ θ (t) , d obtained in FedAvg, FedCurv, FedProx and SGD are exactly the same. 32,000 attacking iterations are used in all experiments and both distance functions (i.e., cosine and L2) are considered. Adam (Kingma & Ba, 2014) optimizer is used, which performs better than L-BFGS empirically. The learning rate and the weight for the total variation are optimized in (the basic) FedAvg and applied to the other methods. The learning rate decaying scheme follows the method provided in (Geiping et al., 2020) . More details concerning each dataset are discussed below. • EMNIST. The images to be recovered are randomly sampled. The model architecture and hyper-parameters used in FedAvg, FedProx, FedCurv, SGD, and FedReg are the same as the ones used in section B.2. In the baseline methods with DPSGD, C = 1.0 and σ = 0.01, and other hyper-parameters are the same as the ones used in the corresponding baseline methods (without DPSGD). In FedReg with MG, the learning rate and the value of γ are the same as the ones used in FedReg, and η s = 0.03. In the attacking algorithm, the initial learning rate is 1.0, the distance function is L2-distance, and the weight for the total variation term is 10 −8 . • CIFAR-10. The images to be recovered are randomly sampled. The model architecture and hyper-parameters used in FedAvg, FedProx, FedCurv, SGD, and FedReg are the same as the ones used in section B.2. In the baseline methods with DPSGD, C = 1.0 and σ = 0.05, and the other hyper-parameters are the same as the ones used in the corresponding baseline methods (without DPSGD). In FedReg with MG, the learning rate and the value of γ are the same as the one used in FedReg, and η s = 0.001. In the attacking algorithm, the initial learning rate is 1.0, the distance function is cosine distance, and the weight for the total variation term is 10 −6 . • CT images. Only the images containing time information are used here. As DenseNet-121 is too complex to be attacked, we use a ten-layer neural network with 9 layers of CNNs and one layer of FC instead. 10 rounds are run in total and in each round, 10 clients are sampled. In the local training stage, the local data are processed in 20 epochs. The learning rate is 5 × 10 −4 in SGD, FedAvg, FedProx and FedReg, and 1 × 10 −3 in FedCurv, respectively. The weight for the proximal term in FedProx is 0.1 and λ in FedCurv is 0.001. In FedReg, γ = 0.5 and η s = 10 −6 . In the baseline methods with DPSGD, C = 1.0 and σ = 0.001, and other hyper-parameters are the same as the ones used in the corresponding methods (without DPSGD). In FedReg with MG, the learning rate and the values of γ and η s are the same as those in FedReg. In the attacking algorithm, the initial learning rate is 1.0, the distance function is L2-distance, and the weight for the total variation term is 10 −8 . Figure C.1: The values of loss (t) and loss (t−1) on homogeneously distributed (a) EMNIST and (b) MNIST, where the whole dataset is uniformly partitioned into 10, 000 clients in EMNIST and 2, 000 clients in MNIST, respectively. It can be seen that the values of loss (t) are comparable to those of loss (t−1) , indicating not much previous training data has been forgotten. Randomly sample K clients C (t) for i ∈ C (t) do 4: θ (t,i) ← θ (t−1) , generate D s i and D p i with η s , η p and E 5: for s = 1, 2, ..., S do 6: Compute g γ (D i ), θ (t,i) ← θ (t,i) − η θ g γ (D i ), θ (t,i) β ← 0.5 θ (t,i) + θ (t−1) Compute g β (D s i ), g β (D p i ), w s and w p , θ (t,i) ← θ (t,i) − w s g β (D s i ) − w p g β (D p i ) 8: end for θ (t) ← 1 K i∈C (t) θ (t,i) 12: end for 13: return {θ (t) } T t=1 Figure C .4: High-resolution versions of the CT images (rows 1 and 2) and the images recovered from FedAvg (rows 3 and 4). Note that the time stamps located at the upper right corners of the recovered images are visible. Deep learning with differential privacy Personalized federated learning with gaussian processes Stochastic gradient descent tricks Efficient lifelong learning with a-gem Emnist: Extending mnist to handwritten letters The mnist database of handwritten digit images for machine learning research An image is worth 16x16 words: Transformers for image recognition at scale Inverting gradientshow easy is it to break privacy in federated learning? Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples Evaluating differentially private machine learning in practice Scaffold: Stochastic controlled averaging for federated learning Adam: A method for stochastic optimization Overcoming catastrophic forgetting in neural networks Learning multiple layers of features from tiny images Limitations of the empirical fisher approximation for natural gradient descent Fedscale: Benchmarking model and system performance of federated learning Bingsheng He, and Dawn Song. Model-contrastive federated learning Federated optimization in heterogeneous networks On the convergence of fedavg on non-iid data Gradient episodic memory for continual learning No fear of heterogeneity: Classifier calibration for federated learning with non-iid data Wagenmakers. A tutorial on fisher information A practical bayesian framework for backpropagation networks Towards deep learning models resistant to adversarial attacks Communication-efficient learning of deep networks from decentralized data Learning differentially private recurrent language models Continual lifelong learning with neural networks: A review Fetchsgd: Communication-efficient federated learning with sketching Liron Mor-Yosef, and Itai Zeitak. Overcoming forgetting in federated learning on non-iid data Differential privacy for data and model publishing of medical data The eu general data protection regulation (gdpr) Federated learning with matched averaging Tackling the objective inconsistency problem in heterogeneous federated optimization Group normalization Federated accelerated stochastic gradient descent Bayesian nonparametric federated learning of neural networks Fixup initialization: Residual learning without normalization Lookahead optimizer: k steps forward, 1 step back Federated learning with non-iid data Deep leakage from gradients