key: cord-0436420-gnpadquh authors: Dong, Jiahua; Wang, Lixu; Fang, Zhen; Sun, Gan; Xu, Shichao; Wang, Xiao; Zhu, Qi title: Federated Class-Incremental Learning date: 2022-03-22 journal: nan DOI: nan sha: aa04929037c41854a31c18dab172606ed4ae4875 doc_id: 436420 cord_uid: gnpadquh Federated learning (FL) has attracted growing attention via data-private collaborative training on decentralized clients. However, most existing methods unrealistically assume object classes of the overall framework are fixed over time. It makes the global model suffer from significant catastrophic forgetting on old classes in real-world scenarios, where local clients often collect new classes continuously and have very limited storage memory to store old classes. Moreover, new clients with unseen new classes may participate in the FL training, further aggravating the catastrophic forgetting of the global model. To address these challenges, we develop a novel Global-Local Forgetting Compensation (GLFC) model, to learn a global class incremental model for alleviating the catastrophic forgetting from both local and global perspectives. Specifically, to address local forgetting caused by class imbalance at the local clients, we design a class-aware gradient compensation loss and a class-semantic relation distillation loss to balance the forgetting of old classes and distill consistent inter-class relations across tasks. To tackle the global forgetting brought by the non-i.i.d class imbalance across clients, we propose a proxy server that selects the best old global model to assist the local relation distillation. Moreover, a prototype gradient-based communication mechanism is developed to protect privacy. Our model outperforms state-of-the-art methods by 4.4%-15.1% in terms of average accuracy on representative benchmark datasets. Federated learning (FL) [4, 18, 42, 46] enables multiple local clients to collaboratively learn a global model while * Equal contributions (ordered alphabetically). † Corresponding authors. providing secure privacy protection for local clients. It successfully addresses the data island challenge without completely compromising clients' privacy [12, 22] . Recently, it has attracted significant interests in academia and achieved remarkable successes in various industrial applications, e.g., autonomous driving [39] , wearable devices [33] , medical diagnosis [10, 52] and mobile phones [36] . Generally, most existing FL methods [16, 42, 46, 52] are modeled in a static application scenario, where data classes of the overall FL framework are fixed and known in advance. However, real-world applications are often dynamic, where local clients receive the data of new classes in an online manner. To handle such a setting, existing FL methods typically require storing all training data of old classes at the local clients' side so that a global model can be obtained via FL, however the high storage and computation overhead may render the FL unrealistic when new classes arrive dynamically [33, 36, 47, 52] . And if these methods [42, 52] are required to learn new classes continuously with very limited storage memory, they may suffer from significant performance degradation (i.e., catastrophic forgetting [20, 37, 40] ) on old classes. Moreover, in real-world scenarios, new local clients that collect the data of new classes in a streaming manner may want to participate in the FL training, which could further exacerbate the catastrophic forgetting on old classes in the global model training. To address these practical scenarios, we consider a challenging FL problem named Federated Class-Incremental Learning (FCIL) in this work. In the FCIL setting, each local client collects the training data continuously with its own preference, while new clients with unseen new classes could join in the FL training at any time. More specifically, the data distributions of the collected classes across the current and newly-added clients are non-independent and identically distributed (non-i.i.d.). FCIL requires these local clients to collaboratively train a global model to learn new classes continuously, with constraints on privacy preservation and limited memory storage [37, 49] . To better com-prehend the FCIL problem, we here use COVID-19 diagnosis among different hospitals as a possible example [6] . Imagine that before the pandemic, there could be hundreds of hospitals working collaboratively to train a global infectious disease diagnosis model via FL. Due to the sudden emergence of COVID-19, these hospitals will collect a large amount of new data related to COVID-19 and add them into the FL training as new classes. Moreover, new hospitals whose main focus is not infectious diseases may join the fight against COVID-19, where they have little data of the old infectious diseases, and all hospitals should learn to diagnose the old diseases and new COVID-19 variants. In such scenarios, most existing FL methods will likely suffer from catastrophic forgetting on old diseases diagnosis under the sudden emergence of new COVID-19 variants data. An intuitive way to address new classes (e.g., learning new COVID-19 variants) continuously in the FCIL setting is to simply integrate FL [4, 32, 42] and class-incremental learning (CIL) [17, 37, 50] together. However, such strategy needs the central server to know when and where the data of new classes arrives (privacy-sensitive information), which violates the requirement of privacy preservation in FL. In addition, although local clients can utilize conventional CIL [17, 37] to address their local catastrophic forgetting, the non-i.i.d. class imbalance across clients may still cause heterogeneous forgetting on different clients, and this simple integration strategy could further exacerbate local catastrophic forgetting due to the heterogeneous global catastrophic forgetting on old classes across clients. To tackle these challenges in FCIL, we propose a novel Global-Local Forgetting Compensation (GLFC) model in this paper, which effectively addresses local catastrophic forgetting occurred on local clients and global catastrophic forgetting across clients. Specifically, we design a classaware gradient compensation loss to alleviate the local forgetting brought by the class imbalance at the local clients via balancing the forgetting of different old classes, and propose a class-semantic relation distillation loss to distill consistent inter-class relations across different incremental tasks. To overcome the global catastrophic forgetting caused by the non-i.i.d. class imbalance across clients, we design a proxy server to select the best old global model for the class-semantic relation distillation at the local side. Considering the privacy preservation, the proxy server collects perturbed prototype samples of new classes from local clients via a prototype gradient-based communication mechanism, and then utilizes them to monitor the performance of the global model for selecting the best one. Our model achieves 4.4%∼15.1% improvement in terms of average accuracy on several benchmark datasets, when compared with a variety of baseline methods. The major contributions of this paper are summarized as follows: • We address a practical FL problem, namely Federated Class-Incremental Learning (FCIL), in which the main challenges are to alleviate the catastrophic forgetting on old classes brought by the class imbalance at the local clients and the non-i.i.d class imbalance across clients. • We develop a novel Global-Local Forgetting Compensation (GLFC) model to tackle the FCIL problem, alleviating both local and global catastrophic forgetting. To our best knowledge, this is the first attempt to learn a global class-incremental model in the FL settings. • We design a class-aware gradient compensation loss and a class-semantic relation distillation loss to address local forgetting, by balancing the forgetting of old classes and capturing consistent inter-class relations across tasks. • We design a proxy server to select the best old model for class-semantic relation distillation on the local clients to compensate global forgetting, and we protect the communication between this proxy server and clients with a prototype gradient-based mechanism for privacy. Federated Learning (FL) is a decentralized learning framework that can train a global model by aggregating local model parameters [24, 44, 51, 53] . To collaboratively learn a global model, [32] proposes to aggregate local models via a weight-based mechanism. [38] introduces a proximal term to help local model approximate the global ones. [42] focuses on minimizing the model discrepancies across clients via an improved EWC. Moreover, [4] designs a layer-wise aggregation strategy to reduce computation overhead [55, 56] . [18] sacrifices the local optimality for rapid convergence, while [12, 22] aim to improve the performance of local models. [34] integrates unsupervised domain adaptation [5, 9, 27, 28, 57] into federated learning framework [14, 31] . However, these existing FL methods cannot effectively learn new classes continuously, due to the limited memory to store old classes at the local clients' side. Class-Incremental Learning (CIL) aims to learn new classes continuously while tackling forgetting on old classes [1, 19, 54] . Without access to the data of old classes, [20] designs new regulators for balancing the biased model optimization caused by new classes, and [25, 41] use the knowledge distillation to surmount catastrophic forgetting. [40, 48] introduce generative adversarial networks to produce synthetic data of old classes. As claimed in [11, 30, 37, 49] , the class imbalance between old and new classes is a crucial challenge for exemplar replay methods. [29, 50] design a self-adaptive network to balance biased predictions. [17] uses the causal effect on knowledge distillation to rectify class imbalance. [43] introduces geodesic path to traditional knowledge distillation. [1] combines task-wise knowledge distillation and separated softmax for bias compensation. These CIL methods however cannot be applied to tackle our FCIL problem, due to their strong assumptions on when and where the data of new classes arrive. In the standard class-incremental learning [37, 41, 43] , there are a sequence of streaming tasks T = {T t } T t=1 , where T denotes the task number, and the t- consists of N t pairs of samples x t i and their one-hot encoding labels y t i ∈ Y t . Y t represents the label space of the t-th task including C t new classes that are dif- j=1 Y j old classes in previous t − 1 tasks. Inspired by [30, 37, 49] , we construct an exemplar memory M to select |M| C p exemplars for each old class in the t-th incremental task, and it satisfies |M| We then extend conventional class-incremental learning to Federated Class-Incremental Learning (FCIL). Given K local clients {S l } K l=1 and a global central server S G , for the r-th global round (r = 1, · · · , R), a set of local clients are randomly selected to participate in the gradient aggregation. Specifically, once the l-th client S l is selected at each global round for the t-th incremental task, it will receive the latest global model Θ r,t , and train Θ r,t on its privately accessible t-th incremental task T t l ∪ M l ∼ P ⊂ T t is the training data of new classes, M l denotes its exemplar memory, and P l is the class distribution of the l-th client. {P l } K l=1 are non-independent and identically distributed (i.e., non-i.i.d.) from each other. At the t-th incremental task, the label space Y t l ⊂ Y t of the l-th local client is a subset of Y t = ∪ K l=1 Y t l , and it includes C t l new classes j=1 Y j l old classes. After loading Θ r,t and conducting the local training at the t-th incremental task, S l can get a locally updated model Θ r,t l . All locally updated models of selected clients are then uploaded to the global server S G to be aggregated as the global model Θ r+1,t of next round. The global server S G then distributes parameters Θ r+1,t to local clients for the next global round. In the FCIL setting, we divide local clients {S l } K l=1 into three categories (i.e., {S l } K l=1 = S o ∪ S b ∪ S n ) in each incremental task. Specifically, S o consists of K o local clients that cannot receive the new data of current task but have the exemplar memory stored via previous learned tasks; S b includes K b clients collecting the new data of current task and the exemplar memory of previous tasks; and S n consists of K n newly-added clients that receive the new data of current task, but have no any exemplar memory of old classes. These clients are dynamically changing as the incremental tasks arriving. Namely, we randomly determine {S o , S b , S n } at each global round, and S n are irregularly added at any global round in the FCIL. It causes the gradual increase of K = K o + K b + K n in streaming tasks. Moreover, we have no any prior knowledge about the number of streaming tasks T , data distributions {P l } K l=1 , when to collect new classes or add new local clients. The goal of FCIL is to effectively train a global model Θ R,T to learn new classes consecutively while alleviating the catastrophic forgetting on old classes with the requirement of privacy preservation, via communicating the local model parameters with the global central server S G . The overview of our model is depicted in Figure 1 . To address the FCIL requirements, our model solves local forgetting via a class-aware gradient compensation loss and a class-semantic relation distillation loss (Section 4.1), while tackling global forgetting via a proxy server to select the best old model for local clients (Section 4.2). At the t-th incremental task, given the l-th local client S l ∈ S b with the training data T t l of new classes and exemplar memory M l , the classification loss L CE for a mini- where b is the batch size, and Θ r,t is the classification model at the r-th global round for the t-th task, which is transmitted from global server to local clients. P t l (x t li , Θ r,t ) ∈ R C p +C t denotes the sigmoid probability predicted via Θ r,t , and D CE (·, ·) is the binary cross-entropy loss. As aforementioned, the class imbalance between old and new classes (T t l and M l ) at the local side enforces the local training to suffer from significant performance degradation (i.e., local catastrophic forgetting) on old classes. To prevent local forgetting, as shown in Figure 1 , we develop a class-aware gradient compensation loss and a classsemantic relation distillation loss for local clients, which can correct imbalanced gradient propagation and ensure inter-class semantic consistency across incremental tasks. • Class-Aware Gradient Compensation Loss: After S G distributes Θ r,t to local clients, the class-imbalanced distributions at local side cause imbalanced gradient backpropagation of the last output layer in Θ r,t . It forces the update of local model Θ r,t l to perform different learning paces within new classes and different forgetting paces within old classes after local training. This phenomenon heavily worsens the local forgetting on old classes, when new streaming data becomes part of old classes continuously. As a result, we design a class-aware gradient compensation loss L GC to respectively normalize the learning paces of new classes and forgetting paces of old classes via reweighting their gradient propagation. Specifically, inspired by [45, 46] , for a single sample (x t li , y t li ) (its ground-truth label is y t li , the one-hot vector of y t li is y t li ), we obtain a gradient measurement G t li with respect to the y t li -th neuron N t y t li of the last output layer in Θ r,t l : To normalize the learning paces of new classes and forgetting paces of old classes, we perform separate gradient normalization for old and new classes, and utilize it to re- as the gradient means for new and old classes, where I (·) is the indicator function that if the subscript condition is true, Thus, the re-weighted L CE loss is formulated as follows: when the i-th sample x t li belongs to new classes,Ḡ i will be G n , otherwise G o . • Class-Semantic Relation Distillation Loss: During the training of local model Θ r,t l initialized as the current global model Θ r,t , the probability predicted by Θ r,t l indicates the inter-class semantic similarity relations. To ensure the inter-class semantic consistency across different incremental tasks, we design a class-semantic relation distillation loss L RD by considering the underlying relations among old and new classes. As depicted in Figure 1 , we respectively forward a mini-batch dataset {X t lb , Y t lb } into the stored old model Θ t−1 l and current local model Θ r,t l , and obtain the corresponding predicted probabilities P t−1 of old and new classes. These probabilities reflect the interclass relations between old and new classes. Different from existing knowledge distillation strategies [1, 3, 8, 17, 26] that only ensure old classes' semantic consistency among Θ t−1 l and Θ r,t l , we consider inter-class relations between old and new classes simultaneously via optimizing L RD . Namely, we utilize a variant of one-hot encoding labels ) effectively indicates the inter-class semantic similarity relations for both old and new classes via smoothing the one-hot labels. Thus, we formulate L RD as follows: where D KL (·||·) is the Kullback-Leibler divergence. Overall, the optimization objective for the l-th local client is: where λ 1 , λ 2 are hyper-parameters. We update local model Θ r,t l via optimizing Eq. (6), and then aggregate all local models in global server S G to obtain the global model Θ r+1,t of the next round. When t = 1, there is no old model Θ t−1 l to perform L RD , and we set λ 1 = 1.0, λ 2 = 0, otherwise, λ 1 = 0.5, λ 2 = 0.5. Note that local clients in S o and S n have the same objective (i.e., Eq. (6)) with the clients in S b , except for the definition ofḠ i in Eq. (4).Ḡ i is always set as G o for S o and G n for S n . • Task Transition Detection: When optimizing Eq. (6), it is essential for local clients to know when new classes arrive, then update exemplar memory M l and store old classification model Θ t−1 l used for L RD . However, in the FCIL, we have no prior knowledge about when local clients receive the data of new classes. To tackle this issue, a trivial solution is to identify whether the labels of training data have been seen before. However, it cannot determine if the newly received labels are from new classes or the old classes observed by other local clients due to noni.i.d. setting of class distributions. Another intuitive solution is to use performance degradation as the signal of collecting new classes. This solution is infeasible in the FCIL, since the random selection of {S o , S b , S n } and their non-i.i.d. class distributions can cause sharp performance degradation, even though without receiving new classes. To this end, we propose a task transition detection mechanism to accurately identify when local clients receive new classes. Specifically, at the r-th global round, every client computes the average entropy H r,t l via the received global model Θ r,t on its current training data T t l : where I(·) = i p i log p i is the entropy function. When H r,t l encounters a sudden rise and satisfies H r,t l − H r−1,t l ≥ r h , we argue that the local clients are receiving new classes, and update t by t ← t + 1. Then they can update memory M l and store old model Θ t−1 l . We empirically set r h = 1.2. Although Eq. (6) could tackle local catastrophic forgetting brought by the class imbalance at the local side, it cannot address heterogeneous forgetting from other local clients (i.e., global catastrophic forgetting). In other words, the non-i.i.d. class-imbalanced distributions across local clients result in certain global catastrophic forgetting on old classes, worsening the local catastrophic forgetting further. Thus, it is necessary to address the heterogeneous forgetting across clients from the global perspective. As aforementioned, the proposed class-semantic relation distillation loss L RD in Eq. (5) requires a stored old classification model Θ t−1 l of previous tasks to distill inter-class relations. A better Θ t−1 l can globally increase the distillation gain from previous tasks, strengthening the memory of old classes in a global view. As a result, the selection of Θ t−1 l plays an important role in global catastrophic forgetting compensation, which should be considered from a global perspective. However, in the FCIL, it is difficult to select the best Θ t−1 l due to the privacy protection. The intuitive solution is that every client stores its best old model {Θ t−1 l } T t=2 for each task during the (t − 1)-th task with the training data T t−1 l . Unfortunately, this solution considers the selection of Θ t−1 l from a local perspective, and cannot guarantee the selected Θ t−1 l has the best memory for all old classes, since each local client has only a subset of old classes (non-i.i.d.). To this end, we employ a proxy server S P to select the best Θ t−1 for all clients from a global perspective, as depicted in Figure 1 . Specifically, when local clients have identified new classes (i.e., T t l ) at the beginning of the t-th task via task transition detection, they will transmit perturbed prototype samples of new classes to S P via a prototype gradient-based communication mechanism. After receiving these gradients, S P reconstructs the perturbed prototype samples, and utilizes them to monitor the performance of global model Θ r,t (received from S G ) until the best one is found. When stepping at the next task (t+1), S P will distribute the best Θ r,t to local clients, and local clients regard it as the best old model to perform L RD . • Prototype Gradient-Based Communication: Given the l-th local client S l ∈ S b ∪ S n that receives the training data T t l of new classes for the t-th task, S l identifies new classes via task transition detention. Then S l selects only one representative prototype sample x t lc * from T t l for each new class (c = C p l + 1, · · · , C p l + C t l ), where the feature of x t lc * is closest to the mean embedding of all samples belonging to the c-th class in the latent feature space. We then feed these prototype samples and their labels {x t lc * , y t lc * } C p l +C t l c=C p l +1 ⊂ T t l into a L-layer gradient encoding network Γ = {W i } L i=1 to compute the gradient ∇Γ lc , where Γ is much shallower than Θ r,t l for communication efficiency, and W i is the parameters of the i-th layer. The i-th element ∇ Wi Γ lc of ∇Γ lc is defined as ∇ Wi Γ lc = ∇ Wi D CE (P t l (x t lc * , Γ), y t lc * ), where P t l (x t lc * , Γ) is the probability predicted via Γ. Then S l transmit C t l gradients {∇Γ lc } C p l +C t l c=C p l +1 to S P for the reconstruction of prototype samples. S P randomly shuffles all received gradients from selected clients of this global round to construct a gradient , and we assume there are N t g gradients in this pool. This shuffling operation prevents S P from tracking certain selected clients by remarking special gradient distributions. For the n-th element ∇Γ t n of ∇Γ t , we can obtain its corresponding ground-truth label y t n (with one-hot encoding label y t n ) via observing the gradient symbol of the last layer in ∇Γ t n (proposed in [46, 58] ). Given a dummy samplex t n initialized by a standard Gaussian N (0, 1), we forward all pairs of {x t n , ∇Γ t n , y t n } that is same as the gradient encoding network used by local clients, to recover prototype samples for each new class. The reconstruction loss L RT and the update ofx n t are expressed as follows: where P t (x t n , Γ) is the probability predicted via Γ. η denotes the learning rate to updatex t n . • Selection of The Best Old Model: S P can only receive gradients from local clients at the first round of the tth task, when they detect new classes. Then S P reconstructs N t g prototype samples of new classes and their labels (i.e., {x t n , y t n } N t g n=1 ) via optimizing Eq. (9) . At the t-th task, S P forwards these reconstructed samples into the global model Θ r,t (received from S G ) to select the best Θ t via evaluating which model has the best accuracy, until receiving gradients of new classes from the next task. At every global round starting from the second task, S P distributes the best models of the last task and current task (i.e., Θ t−1 and Θ t ), to all selected clients. If these selected clients detect new classes from T t+1 l at the t-th task, they will set Θ t as the old model Θ t−1 l , otherwise, Θ t−1 is set as Θ t−1 l to perform L RD . • Perturbed Prototype Samples Construction: Although the network Γ is only privately accessible to S P and local clients, malicious attackers may steal Γ and these gradients to reconstruct raw prototype sample {x t lc * , y t lc * } ∈ T t l of the l-th local client. To achieve privacy preservation, we propose to add perturbations to these prototype samples. The attackers can get little useful information from the perturbed prototype samples even if they can reconstruct them. To be specific, given a prototype sample {x t lc * , y t lc * } ∈ T t l , we forward it into the local model Θ r,t l that has been trained via Eq. (6), and apply back-propagation to update this sample. In order to produce perturbed prototype sample, we introduce a Gaussian noise into the latent feature of prototype sample, and then update x t lc * via Eq. (11): where Φ(x t lc * ) denotes the latent feature of x t lc * , and P t l (Φ(x t lc * ) + γN (0, σ 2 ), Θ r,t l ) is the probability predicted via Θ r,t l when adding Gaussian noise N (0, σ 2 ) to Φ(x t lc * ). σ 2 represents the variance of features of all samples belonging to y t lc * , and we empirically set γ = 0.1 to control the effect of Gaussian noise in this paper. Some reconstructed prototype samples are visualized in Figure 2 Starting from the first incremental task, all clients are required to compute the average entropy of their private training data via Eq. (7) at the beginning of each global round, and follow iCaRL [37] to update their exemplar memory M l . For each global training round, the central server S G randomly selects a set of local clients to conduct local training. After that, when the selected clients identify new classes via the task transition detection strategy, they will construct perturbed prototype samples of these new classes and share the corresponding gradients to the proxy server S P via the prototype gradient-based communication mechanism. After receiving these gradients, S P reconstructs these prototype samples, and utilizes them to select the best global model Θ t until collecting gradients next time. Starting from the second task (t = 2), S P will distribute best models of the last and current task (i.e., Θ t−1 , and Θ t ) to selected clients. Then the l-th client uses Θ t−1 as its Θ t−1 l to update the current local model Θ r,t l via optimizing Eq. (6), when it doesn't detect new classes via task transition detection. Otherwise, it uses Θ t to train the current local model Θ r,t l . Finally, S G aggregates the updated local models Θ r,t l to get the global model Θ r+1,t of next ground. The optimization pipeline is provided in supplementary material. We use three datasets: CIFAR-100 [13, 21] , ImageNet-Subset [7] , and TinyImageNet [35] in our experiments. For a fair comparison with baseline class-incremental learning methods [1, 11, 17, 37, 43, 49] in the FCIL setting, we follow the same protocols proposed by [37, 49] to set incremental tasks, utilize the identical class order generated from iCaRL [37] , and employ the same backbone (i.e., ResNet-18 [15] ) as the classification model [2] . The SGD optimizer whose learning rate is 2.0 is used to train all models. The exemplar memory M l of each client is set as 2,000 for all streaming tasks. A shallow LeNet [23] with only 4 layers is used as the network Γ. We employ a SGD optimizer with the learning rate as 0.1 to construct perturbed samples for local clients, while utilizing a L-BFGS optimizer with the learning rate as 1.0 to reconstruct prototype samples for proxy server. We initialize the number of local clients as 30 in the first incremental task, introduce 10 additional new local clients as the learning tasks arrive consecutively. At each global round, we randomly select 10 clients to conduct 20epoch local training. Each client randomly receives 60% classes from the label space of its seen task. We run our experiments for 3 times with 3 random seeds (2021, 2022, 2023) and report the averaged results. Please refer to more details in supplementary material. This section shows comparison experiments to illustrate the effectiveness of our GLFC model, as shown in Tables 1, 2 , 3, where ' ' denotes the improvements of our model compared with other comparison methods. We observe that our model outperforms existing class-incremental methods [1, 11, 17, 37, 43, 49] in the FCIL setting by a margin of 4.4%∼15.1% in terms of average accuracy. It validates that our model could enable local clients to collaboratively train a global class-incremental model. Moreover, our model has stable performance improvement in comparison with other methods for all incremental tasks, which verifies the effectiveness of our model to address the forgetting in FCIL. As shown in Tables 1, 2, 3, we investigate the effects of each module in our model via ablation studies. Oursw/oCGC, Ours-w/oCRD and Ours-w/oPRS denote the performance of our model without using L GC , L RD and the proxy server S P , where Ours-w/oCGC and Ours-w/oCRD utilize L CE and the knowledge distillation proposed in iCaRL [37] for a replacement. Compared with Ours, the performance of Ours-w/oCGC, Ours-w/oCRD and Oursw/oPRS degrades evidently with a range of 1.1%∼10.1%. It verifies the effectiveness of all modules to cooperate together. The ablation performance verifies all modules are essential to train a global class-incremental model. The proxy server is also essential to select the best old model via evaluating reconstructed samples (shown in Figure 2 ). In this section, we conduct qualitative analysis of various incremental tasks (T = 5, 10, 20) on benchmark datasets to validate the superior performance of GLFC, as shown in Figures 3, 4 . According to these curves, we can easily observe that our model performs better than other competing baseline methods [1, 11, 17, 37, 43, 49] for all incremental tasks, in the settings with different number of tasks (T = 5, 10, 20) . It demonstrates that the GLFC model enables multiple local clients to learn new classes in a streaming manner while addressing local and global forgetting. As shown in Table 4 , we study the effects of different exemplar memories on the performance of our GLFC model, by respectively setting M l as {500, 1000, 1500, 2000} on CIFAR-100 [21] . From the results in Table 4 , we can conclude that the better performance of GLFC model on learning new classes in a streaming manner will be, the larger size of exemplar memory M l is. It validates that storing more training data of old classes at the local side could strengthen the memory ability of our proposed GLFC model for old classes. Moreover, the presented results also illus-trate the effectiveness of our GLFC model about identifying new classes via the task transition detection mechanism and updating the exemplar memory. In this paper, we propose a practical Federated Class-Incremental Learning (FCIL) problem, and develop a novel Global-Local Forgetting Compensation (GLFC) model to address the local and global catastrophic forgetting in FCIL. Specifically, a class-aware gradient compensation loss and a class-semantic relation distillation loss are designed to locally address the catastrophic forgetting, by correcting the imbalanced gradient propagation and ensuring consistent inter-class relations across tasks. We also employ a proxy server to tackle global forgetting by selecting the best old model for preserving the memory of old classes. Extensive experiments on representative benchmark datasets demonstrate the effectiveness of our proposed GLFC model. A.1. Datasets CIFAR-100 [21] is composed of 60,000 color images from 100 classes. Each class has 500 training and 100 evaluation samples with the size 32 × 32. ImageNet-Subset [7] is a subset of ImageNet [7] , and it includes 100 classes sampled in the same way as [17] . We split each class into 500 training and 100 test samples with the size 224 × 224. TinyImageNet [35] includes 100,000 samples for 200 classes, and each sample is downsized to 64 × 64. Each class has 500 training and 50 test samples. For the federated learning setting, in the first task, there are total 30 local clients, and for each global round, we randomly select 10 clients to conduct 20-epoch local training. After the local training, these clients will share their updated models to participate in the global aggregation of this round. When the number of streaming tasks is T = 10, for CIFAR-100 and ImageNet-Subset, each task includes 10 new classes for 10 global rounds, and each task transition will introduce 10 additional new clients. For Tiny-ImageNet, each task includes 20 new classes for the same 10 global rounds, and each task transition also includes 10 new clients. Therefore, in the last task of T = 10, the total number of local clients is 120, and we also randomly select 10 clients to perform global aggregation. For the case of T = 20, each task has 5 classes with 10 global rounds of training for CIFAR-100 and ImageNet-Subset, and the number of classes will be 10 for TinyImageNet. Note that the number of newly introduced clients is 5 now at each task transition. We also conduct experiments of T = 5, CIFAR-100 and ImageNet-Subset contain 20 classes for each task, while TinyImageNet has 40 classes per task. For all three datasets, each task covers 20 global rounds and there will be 20 new clients joining in the framework at each task transition. For building the non-i.i.d. setting, every client can only own 60% classes of the label space in the current task, and these classes are randomly selected. During the task transition global round, we assume 90% existing clients are from S b , while the resting 10% clients are from S o . For fair comparisons with other class-incremental learning methods in the FCIL setting, we follow the same protocols proposed by [37, 49] to split classes into incremental tasks, utilize the identical class order generated from iCaRL [37] . Moreover, we use ResNet-18 as the classification model. As for the gradient encoding network, we use a shallow LeNet with only 4 layers. We use a SGD optimizer whose initial learning rate is 2.0 to train all classification models, and the learning rate is divided by 5, 25 and 125 when the accumulated local epochs of a task hit 100, 150 and 180, respectively. What's more, the optimizer of backpropagation perturbation generation is SGD with learning rate as 0.1, and the sample reconstruction optimization happened on the proxy server utilizes L-BFGS with learning rate as 1.0 for saving memory storage. The batch size is 128, and the exemplar memory M l of each client has the sample size of 2,000 during all streaming tasks. As for the optimization iterations, the prototype perturbation generation has 100 iterations, and prototype sample reconstruction conducts 200 iterations for each gradient. We repeatedly run our experiments for three times with three random seeds (2021, 2022, 2023) and report the average results in our comparison experiments. This paper is the first exploration to address the federated class-incremental learning (FCIL) problem, and there is not any baseline method that is built on similar settings. Therefore, for fair comparisons, we compare our GLFC model with several state-of-the-art class-incremental methods (i.e., iCaRL [37] , BiC [49] , PODNet [11] , DDE [17] , GeoDL [43] and SS-IL [1] under the federated learning (FL) settings, to validate the effectiveness of our proposed GLFC model. Besides, top-1 accuracy metric is employed to evaluate the performance of other comparison methods and our proposed GLFC model. Starting from the first incremental task, all clients are required to compute the average entropy of their private training data via Eq. (7) at the beginning of each global round, and follow iCaRL [37] to update their exemplar memory M l . For each global training round, the central server S G randomly selects a set of local clients to conduct local training. After that, when the selected clients identify new classes via the task transition detection strategy, they will construct perturbed prototype samples of these new classes and share the corresponding gradients to the proxy server S P via the prototype gradient-based communication mechanism. After receiving these gradients, S P reconstructs these prototype samples, and utilizes them to select the best global model Θ t until collecting gradients next time. Starting from the second task (t = 2), S P will distribute best models of the last and current task (i.e., Θ t−1 , and Θ t ) to selected clients. Then the l-th client uses Θ t−1 as its Θ t−1 l to update the current local model Θ r,t l via optimizing Eq. (6), when it doesn't detect new classes via task transition detection. Otherwise, it can use Θ t to train the current local model Θ r,t l . Finally, S G aggregates the updated local models Θ r,t l to get the global model Θ r+1,t of next ground. The detailed optimization pipeline is provided in Algorithm 1. g gradients in ∇Γ t ; Receive Θ r,t from SG as local classification model; Forward {X t P , Y t P } to Θ r,t and get the best Θ t ; Distribute Θ t−1 and Θ t to all selected local clients; C. Experiments on TinyImageNet Dataset C. 1 As shown in Tables 5, 6 , we present comparison experiments between our model and other baseline class-incremental learning methods on TinyImageNet dataset. The presented results show that our GLFC model significantly outperforms other state-of-the-art comparison methods by 4.7%∼11.0% in terms of average accuracy. It illustrates the effectiveness of our model to address both local and global catastrophic forgetting in the FCIL setting. Moreover, the performance of our model is the best among all incremental tasks, and there is a large performance improvement for each incremental task. This phenomenon validates that the proposed proxy server is effective to address global catastrophic forgetting brought by non-i.i.d. class imbalance across clients via prototype sample construction mechanism. Meanwhile, the proposed class-aware gradient compensation loss and class-semantic relation distillation loss guarantee that our model could effectively alleviate local catastrophic forgetting at local client side. This subsection investigates the effectiveness of different variants of our model on TinyImageNet dataset, as presented in Tables 5, 6 . When compared with Ours, Oursw/oCGC degrades the performance of 2.7%∼2.8% in terms of average accuracy, which validates the effectiveness of the class-aware gradient compensation loss to compensate imbalanced gradient propagation. We observe that Ours performs better than Ours-w/oCRD by 10.1%∼10.2% in terms of average accuracy. The class-semantic relation distillation loss ensures inter-class semantic consistency across different incremental tasks to address local catastrophic forgetting. Moreover, the performance of Ours-w/oPRS is worse than Ours by 3.2%∼4.6% in terms of average accuracy. This performance degradation verifies that global catastrophic forgetting brought by non-i.i.d. class imbalance across clients could be effectively addressed via the proxy server. All proposed modules in our GLFC model could cooperate well to address the FCIL problem. When any one of proposed components is removed, as shown in Tables 5, 6 , Ours-w/oCGC, Ours-w/oCRD and Ours-w/oPRS achieve significant performance degradation. As presented in Tables 7, 8 , in this subsection, we introduce the qualitative analysis of various incremental tasks (T = 5, 10) on TinyImageNet dataset to validate the effectiveness of the proposed GLFC model. From the results in Tables 7, 8, we observe that the performance of our proposed model has a large improvement (3.2%∼10.0% in terms of average accuracy) over other state-of-the-art comparison methods for all incremental tasks. Even though there are different settings with different number of tasks (T = 5, 10), our proposed GLFC model still has the best performance, which verifies that our model could effectively tackle both local and global catastrophic forgetting in the FCIL setting. Moreover, the significant performance improvement illustrates that our model enables multiple local clients to learn new classes consecutively, while addressing catastrophic forgetting on old learned classes under the privacy preservation and limited memory of local clients. In this subsection, as shown in Table 9 , we further conduct extensive experiments (T = 10) on CIFAR-100 dataset to investigate the effects of different exemplar memories on the performance of our proposed GLFC model when setting M l as {500, 1000, 1500, 2000}. From the presented results in Table 9 , we easily observe that our model achieves the better performance for all incremental tasks, when local clients have large memory storage to store the exemplar samples of old classes. Moreover, storing more training data of old classes at the local side could promote the memory replay on old classes, which further addresses catastrophic forgetting at local clients' side for old classes. Be-sides, it validates that our proposed model is efficient to distinguish new classes via the task transition detection strategy and update the corresponding exemplar memory M l at local side. The updated exemplar memory plays an essential role in tackling local catastrophic forgetting on old classes. This section discusses the limitation for our proposed model and the potential societal impact of this paper. This paper mainly focuses on addressing Federated Class-Incremental Learning (FCIL) problem from the algorithm perspective. In the future, it is necessary to develop mathematical theoretical supports for understanding the FCIL problem and the proposed GLFC model. A possible way to develop mathematical theories for our model is considering existing mathematical explanations of federated learning (FL) and class-incremental learning (CIL) simultaneously. However, as we know, there is rare theoretical analysis about CIL. Therefore, it might be tough to propose a brand-new theoretical support to analyze regular CIL problem. Instead, we will try to establish a theoretical analysis for the FCIL problem from a FL perspective in the future work. The FCIL problem discussed in our paper doesn't have any negative societal impact. On the contrary, we believe our work can solve real-world problems and bring about extensive benefits. In comparison with standard federated learning (FL), the proposed Federated Class-Incremental Learning (FCIL) is more practical as we assume the data of new classes as well as new clients will indiscriminately and continuously participate in FCIL. To solve the FCIL problem, our proposed GLFC model can enable a global classincremental learning model to be trained on decentralized devices without data sharing (uploading decentralized data to a central server or data exchange between participated devices). Compared to regular class-incremental learning methods that always need access to the training data, FCIL can protect the private information of participants by remaining the local data where it is collected. We have faith that the proposed GLFC model can bring beneficial gains to a number of information-sensitive scenarios, such as medical diagnosis, smartphone applications, pharmaceutical companies, and high-technology enterprises, etc. In summary, this work is the first attempt to learn a global class-incremental model in the setting of FL, which expedites the development of FL-based applications with the requirement of privacy preservation. Ss-il: Separated softmax for incremental learning Two-stage label embedding via neural factorization machine for multi-label classification Learning to ground visual objects for visual dialog Communicationefficient federated deep learning with layerwise asynchronous model update and temporally weighted aggregation Denoised maximum classifier discrepancy for sourcefree unsupervised domain adaptation Federated learning for predicting clinical outcomes in patients with covid-19 Imagenet: A large-scale hierarchical image database Unbiased mean teacher for cross-domain object detection Zhen Fang, and Zhengming Ding. Where and how to transfer: Knowledge aggregation-induced transferability perception for unsupervised domain adaptation What can be transferred: Unsupervised domain adaptation for endoscopic lesions segmentation Podnet: Pooled outputs distillation for small-tasks incremental learning Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach Learning bounds for open-set learning Fedboost: A communication-efficient algorithm for federated learning Deep residual learning for image recognition Federated adversarial debiasing for fair and transferable representations Distilling causal effect of data in classincremental learning SCAFFOLD: Stochastic controlled averaging for federated learning Split-and-bridge: Adaptable class incremental learning within a single neural network Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks Learning multiple layers of features from tiny images Unsupervised model personalization while preserving privacy and scalability: An open problem Gradientbased learning applied to document recognition. Proceedings of the IEEE Fedbn: Federated learning on non-iid features via local batch normalization Learning without forgetting Learning deep kernels for non-parametric two-sample tests Bapa-net: Boundary adaptation and prototype alignment for cross-domain semantic segmentation Undoing the damage of label shift for crossdomain semantic segmentation Adaptive aggregation networks for class-incremental learning Mnemonics training: Multi-class incremental learning without forgetting A novel attribute reconstruction attack in federated learning Federated learning of deep networks using model averaging Dïot: A crowdsourced self-learning approach for detecting compromised iot devices Federated adversarial domain adaptation Tiny imagenet visual recognition challenge. CS231N course Federated learning for emoji prediction in a mobile keyboard icarl: Incremental classifier and representation learning On the convergence of federated optimization in heterogeneous networks Distributed federated learning for ultrareliable low-latency vehicular communications Continual learning with deep generative replay Incremental learning of object detectors without catastrophic forgetting Liron Mor-Yosef, and Itai Zeitak. Overcoming forgetting in federated learning on non-iid data On learning the geodesic path for incremental learning Federated learning with matched averaging Eavesdrop the composition proportion of training labels in federated learning Addressing class imbalance in federated learning Non-transferable learning: A new approach for model ownership verification and applicability authorization Memory replay gans: Learning to generate new categories without forgetting Large scale incremental learning Der: Dynamically expandable representation for class incremental learning Achieving linear speedup with partial worker participation in non-iid federated learning FLOP: federated learning on medical datasets using partial networks Bayesian nonparametric federated learning of neural networks Knowledge aware emotion recognition in textual conversations via multi-task incremental transformer DATA: differentiable architecture approximation with distribution guided sampling You only search once: Single shot neural architecture search via direct sparse optimization Clarinet: A one-step approach towards budget-friendly unsupervised domain adaptation idlg: Improved deep leakage from gradients Given: At the r-th global round and the t-th task (assume t ≥ 2), central server SG randomly selects a set of clients {Ss 1 , Ss 2 , ..., Ss m } with size as m; The selected clients have their local training data {T t s 1 , T t s 2 , · · · , T t sm } and local exemplar memories {Ms 1 , Ms 2 , · · · , Ms m }; SG sends the latest global classification model Θ r,t to all selected clients; The gradient encoding model Γ and the proxy server SP ; All Clients: for S l in {S1, S2, · · · , SK } do Use T t l to compute average entropy H r,t l via Eq. (7); Apply iCaRL [37] on T t l to update M l ; Selected Clients: Receive Θ r,t from SG as local classification model;do Generate perturbed sample (x t lc * , y t lc * ); Compute gradient ∇Γ lc = ∪W i ∇W i DCE(P t l (x t lc * , Γ), y t lc * ); ∇Γ t l ← ∇Γ t l ∪ ∇Γ lc Send ∇Γ t l to the proxy server SP ; Proxy Server: Receive ∇Γ t = {∇Γ t s 1 , ∇Γ t s 2 , ..., ∇Γ t sm } from selected local clients, and there are N t