key: cord-0220148-20in1td3 authors: Jiang, Meirui; Yang, Hongzheng; Cheng, Chen; Dou, Qi title: IOP-FL: Inside-Outside Personalization for Federated Medical Image Segmentation date: 2022-04-16 journal: nan DOI: nan sha: 6dd28541cbed6af8067c3b02240108d2b8519faf doc_id: 220148 cord_uid: 20in1td3 Federated learning (FL) allows multiple medical institutions to collaboratively learn a global model without centralizing all clients data. It is difficult, if possible at all, for such a global model to commonly achieve optimal performance for each individual client, due to the heterogeneity of medical data from various scanners and patient demographics. This problem becomes even more significant when deploying the global model to unseen clients outside the FL with new distributions not presented during federated training. To optimize the prediction accuracy of each individual client for critical medical tasks, we propose a novel unified framework for both Inside and Outside model Personalization in FL (IOP-FL). Our inside personalization is achieved by a lightweight gradient-based approach that exploits the local adapted model for each client, by accumulating both the global gradients for common knowledge and local gradients for client-specific optimization. Moreover, and importantly, the obtained local personalized models and the global model can form a diverse and informative routing space to personalize a new model for outside FL clients. Hence, we design a new test-time routing scheme inspired by the consistency loss with a shape constraint to dynamically incorporate the models, given the distribution information conveyed by the test data. Our extensive experimental results on two medical image segmentation tasks present significant improvements over SOTA methods on both inside and outside personalization, demonstrating the great potential of our IOP-FL scheme for clinical practice. Code will be released at https://github.com/med-air/IOP-FL. Federated learning (FL) in medical image analysis has been an increasingly important topic, owning to its advantage of jointly learning a global model on many medical institutions in a decentralized way [1] - [12] . However, such a global model is unlikely as optimal for each individual FL client (see Fig.1 ), due to various data distributions associated with specific scanners, protocols, patient demographics at each local client [13] - [19] . Instead of training a global model that is to be commonly used everywhere, we argue it is crucial to personalize FL models, in order to optimize prediction . . . Inside client 1 Inside client K Outside client Fig. 1 . In federated learning, the global model usually fails to commonly yield good performance on each individual client due to the data heterogeneity. Our proposed novel inside-outside personalization approach (called IOP-FL) addresses this challenge, towards optimizing medical image segmentation accuracy for any specific client in clinical practice. accuracy at each client for critical medical applications. Moreover, to make the best use of the federated training effort, it is good if the federated model can also perform well on clients outside federation in the wild. Unfortunately, due to the data distribution shift, the global model may encounter even more severe performance degradation on unseen testing clients. In these regards, exploring FL frameworks equipped with comprehensive personalization strategies for both inside and outside clients is highly demanded. Existing works relevant to this challenging problem are mainly on model personalization for inside FL clients. Some approaches incorporate extra learning paradigms to balance local and global information for each local client model, such as multi-task learning [20] , [21] and meta learning [22] , [23] . These methods complicate the federated training, e.g., multitask learning needs to add auxiliary tasks and solve them simultaneously. Some recent works aim to specify part of network parameters for personalization [19] , [24] - [26] , such as the client-specific classification heads [24] , [25] . However, it is not trivial to identify and separate between the shareable and personalizable subsets of parameters. Moreover, all these previous personalization approaches have to be performed during the federated training, and are hardly applicable to outside FL clients during the model deployment. For outside FL model personalization, the key challenge is unavailability of original data distributions of FL training, i.e., only the learned FL model and test data with new distributions are given during testing. Furthermore, for the sake of wider applicability, we consider practical situations that no labels are provided by outside clients and the personalization needs to be achieved at test time. Recently, some test-time adaptation methods [27] , [28] aim to use the inference sample as a hint of the new data distribution so as to adapt certain model layers for better performance under distribution shift. Although can be helpful, directly combining test-time adaptation methods with the personalized FL methods in an ad-hoc way is less effective. One reason is the test-time adaptation methods are designed for one single model, it is not trivial to verify whether a personalized internal model or the global model is better to be adapted. In addition, current methods are mainly restricted to selecting a subset of model parameters to update. In this paper, we propose a novel and unified framework to achieve both Inside and Outside Personalization in FL (IOP-FL). For the inside personalization, we design a lightweight gradient-based method by calculating a local adapted model as the personalization for each inside client. Intuitively, the global gradients aggregated from all clients contain the general unbiased information, and the local gradients based on each client's own data represent the local data distribution. Our local adapted model dynamically accumulates both types of gradients to make each personalized model optimize towards the client-specific direction without diverging from the agreed common knowledge. On the other hand, for the outside personalization, since the data distribution of an outside client may not be presented during federated training, we consider it is important to construct a more diverse and informative parameter space as a good basis to personalize a new model at the test time. Such a space can be well-composed with our obtained various local personalized models representing different data distributions, as well as the common global model for general patterns. We therefore design a coefficient matrix to dynamically combine the local personalized models and the global model in a layer-wise manner, which is optimized by a designed test-time routing scheme to adapt to the new distributions of inference data. The test-time personalization is driven by a consistency regularization with a shape constraint for segmentation tasks in an unsupervised way. Our main contributions are highlighted in the following: • We propose a new framework to achieve inside-outside personalization in FL for medical image segmentation. To the best of our knowledge, this is the first work to improve the performance of federated models for both inside and outside heterogeneous local distributions. • We design a lightweight gradient-based personalization method by calculating local adapted models with the global and local gradients, which effectively and efficiently yields personalized models for inside clients. • We present a novel test-time personalization scheme which dynamically mixes the various local personalized models and the global model in an unsupervised way to obtain personalized models for outside clients. • We conduct extensive experiments on two medical image segmentation tasks with a number of challenging heterogeneous clients. Our unified IOP-FL framework achieves superior performance on both inside and outside clients over state-of-the-art (SOTA) methods. Personalized federated learning aims to learn local models personalized to the distribution of each client. A variety of approaches have been proposed to achieve model personalization for inside FL clients by using local fine-tuning [29] , [30] , multi-task learning [20] , [21] , clustering [31] , [32] , transfer learning [30] , knowledge distillation [33] and metalearning [22] , [23] . The most state-of-the-art methods try to specify the entire network into the shared global parameters and the unique local parameters [19] , [24] - [26] . For example, FedRep [25] shares the feature extractor and personalizes a classification head for each client. In FedBN [19] , the batch normalization layers are kept locally without uploading to the central server. For these methods, however, it could be difficult to decide which parts of model parameters should be shared across clients and which parts should be personalized . Very recently, a few early attempts have been made to study personalized FL in medical applications. Roth et al. [34] combine FL with an AutoML technique and propose an adaptation scheme to obtain personalized model architectures for each client. Chen et al. [35] propose a personalized retrogressresilient framework to achieve a personalized model for each client with improved performance. All the previous personalized FL works focus only on the model personalization for clients inside the federation, but we propose a unified FL framework to personalize the model for both inside and outside clients in which the data distributions are different. There are currently very few works aiming to improve the performance of federated models on outside unseen sites. A very recent work FedDG [36] shares extra frequency information across clients to improve the global model's generalizability to unseen domains. Another one is FedADG [37] , which introduces an extra distribution generator to measure and align different distributions to learn domain-invariant representation. However, the global model in these methods is not dedicatedly optimized towards the distributions of outside FL clients, thus still cannot guarantee the optimal performance for unseen clients with different distributions. Besides, extra training cost in these methods (frequency information sharing in FedDG, domain generator in FedADG) needs to afford. Test-time learning is recently proposed in domain adaptation/generalization works to obtain better predictions on new domains by adapting models with the distribution information presented at test time. Test-time-training (TTT) [27] designs a rotation recognition of images as the auxiliary task to adapt the encoder of models. Tent [28] measures and minimizes an entropy loss to conduct the test-time adaptation on the batch normalization layers. Although these methods are applicable to the model personalization outside federation, their model updating is restricted to a limited parameter space and is not able to fully utilize the diverse personalized FL models of internal clients. Instead, our test-time update scheme tries to personalize a new model from a diverse parameter space represented by our learned various local personalized models, Inside client 1 Inside client 2 which can largely enhance the representation capability at the test-time. Experimental comparison with these approaches demonstrates the superior performance of our method. To address the personalization problem for both inside and outside clients, we propose an effective new personalized FL framework, i.e., IOP-FL. We start with the formulation of federated heterogeneous medical image segmentation and then describe the local adapted model for the inside personalization and test-time routing for the outside personalization. Preliminaries: Let (X ,Y) be the joint image and label space, {D s 1 , · · · , D s K } as the set of distributions of K distributed training clients involved in FL, and D o be the outside client's distribution 1 . We consider the heterogeneous distributions that all samples X are non-iid across clients (e.g., medical images from different hospitals). Specifically, for the k-th client, let {(x s i,k , y s i,k )} n k i=1 be the training samples drawn independently from a training distribution D s k , and be samples from D o without labels available. The FL aims to find the best global model w by solving the overall empirical risk minimization problem, i.e., min w where L is the loss function and f w is the model function parameterized by w. However, the loss of the client k only depends on its own local data, there is a major concern of the standard objective. Specifically, the global model may not always achieve the best performance for a given client with distribution differing from the population significantly. Our framework instead uses personalized objectives to learn specific models for each client. For inside clients, we have the objective below: 1 Here we use one outside client for simplification, this framework can be easily deployed to multiples. wherew = 1 K K k=1 w k is the average global model, L k denotes the local loss and λ indicates the trade-off on the model similarity measured by the function s(·, ·). For outside clients, we propose the objective as follows: where W = R · [w 1 , · · · , w k ,w] is the model weight routing space that incorporates all models obtained after training, including the local personalized models and the global model, and · denotes matrix multiplication. R ∈ R L×(K+1) is the routing matrix consists of re-weight coefficient terms r l k for the l-th layer of the k-th model. All terms in the routing matrix are updated by a specifically designed test-time loss L t . The overall framework is shown in Fig. 2 . In the following, we will describe the details of solving the two objectives for both inside and outside clients. A good local personalized model should make full use of the information that all clients contribute to mitigating insufficient local data, and also be optimized towards the direction of each individual client's own distribution. With this insight, we propose to calculate the local adapted model, which gathers both the global gradients and local gradients in an accumulated way for inside personalization. The global gradients help extract a relatively unbiased description of the general pattern to make each client benefit from the large distributed data and the local gradients help to personalize toward each client's distribution. Benefitting from the congregating of gradients, the local adapted model not only contains information in the current round, but also keeps tracking information in previous rounds. This alleviates the influence of client heterogeneity and helps to stabilize the local personalization process. To step into the specific design of the local adapted model, we start by solving the objective of Eq. (1), which allows each client to learn a specific model w k different from each other under a constraint of dissimilarity measurement s. This objective has been emerging explored in existing literature [22] , [38] , [39] , e.g. using the L2-norm to penalize model differences to learn a common global model or designing a new loss term to update the local model. But these methods are sensitive to the choice of λ in Eq. (1). Ideally, the local models are optimized towards different directions regarding the value of λ. When λ = 0, each client k only calculates local gradients based on samples drawn from D s k . When considering the limit case λ → ∞, intuitively, this limit should force each local model to be identical, which is equivalent to optimizing solely with global average gradients. In this case, for λ ∈ (0, ∞), the local models should be an interpolation between pure local models and the global model. Such interpolation can gain benefits from both global gradients and local gradients. Considering the procedure of local gradient descent, after i local training iterations, the model at the next iteration either continues local gradient descent or performs the aggregation to obtain the global gradients. With the condition of aggregation or not, the procedure can be formulated as: The process of local model training can be seen as the alternating optimization with either local gradients or overall global gradients. Based on this property, we choose to mix the local gradients and the global gradients as a natural constraint of the dissimilarity measurement s, thus calculating a local adapted model. By doing this, the local adapted model is updated towards learning a common pattern as well as fitting the local distinct distribution, and we do not need to explicitly specify the λ. Denote the local adapted model as P k , to make the problem well-defined, we specify P t k = w t k −η∇ L k when the communication round t = 0. Then the local adapted model of each local client can be calculated in the following way: where τ is the accumulation rate to congregate history information and stabilize training. After communication with the server, the gradients from local data are further calculated and mixed with global gradients to jointly optimize and obtain the local adapted model. In our implementation, we experimentally set the rate as 0.9, and we further study the effects on different choices of τ in our experiments. For the local gradient descent, in practice, we iteratively conduct the stochastic gradient descent on all sampled local training data. Please note that the aggregation is naturally performed in the federated learning after certain local training epochs, the calculation of the local adapted model based on global and local gradients does not take any extra communication cost. With the obtained models during the inside personalization, we further study the personalization on new data outside the federation, which is more challenging due to the unseen distribution and unavailable labels. Unfortunately, either directly deploying a single model or a straightforward ensemble of models can not deliver good performance due to the unseen distributions of the test data. Although current testtime adaptation methods are applicable, they are only for a single model and are mainly restricted to updating a limited subset of parameters. A smart way to utilize multiple models in the personalized FL needs to be proposed. Therefore, based on these internal models, we design a test-time routing scheme to overcome these challenges. This test-time routing aims to use test data to find optimal re-weight coefficients to aggregate all learnable layers from all models, thus incorporating both training knowledge and test data information flexibly. The objective of the test-time routing is shown in Eq. (2). We aim to obtain a new personalized model w o from all training stage obtained models, we call the collection of these models as well as the corresponding re-weight coefficients as the routing space, which is denoted by W. By minimizing a specifically designed unsupervised loss on the test data from the outside client, we can dynamically update these coefficients to obtain a new outside personalized model from the routing space. The details for routing space construction and test-time routing are described below. Routing space construction. To unleash the potential of test-time routing, we construct the routing space by considering all personalized models as well as the global model. The global model contains strong common knowledge of the general pattern and each personalized model represents its local distribution. By combining these diverse models in the routing space, the representation capacity is largely improved. Furthermore, different from traditional ensemble methods which simply aggregate outputs at the model level, to diversify the feature extraction procedure, we consider a layer-wise manner. That is, for a convolutional layer with parameter W k from the k-th model and a given sample where r k (superscript l is omitted for ease of notation) is the re-weight coefficients for the linear combination. We use the test data to dynamically learn the coefficients r k for each learnable layer, and all the coefficients form the routing matrix R. Specifically, the learnable layers include all convolutional layers in the neural network. As shown in Fig. 3 , given all model weights, we aim to optimize the routing matrix R to calculate the outside personalized model weight w o from the routing space W = R·[w 1 , · · · , w k ,w] . It's important to note that the routing matrix R is the core to present the character of the dynamic. Intuitively, given an outside client, the data distribution of this client could be similar to one of the inside clients or lies in their mixed distributions. For the first case, each column vector in the matrix would degrade closely to the one-hot format to fit this similar distribution. For another case, the R coordinates all candidates in the model set to generate better parameters that fit test data distribution as closely as possible. In the following, we will illustrate the design of losses to optimize the coefficients in the routing matrix. Test-time parameter routing Diverse distributions Fig. 3 . The illustration of the test-time parameter routing. All trained models present diverse distribution information and the test-time routing aggregates the plentiful parameters to fit the new test data. Test-time routing process. From the previous discussion, we have constructed the routing space. Here, we focus on how to combine existing models to obtain a high-performance personalized model. The key is to bridge existing models and our desired personalized model via the routing matrix R, where this matrix should be optimized regarding the test data distribution. By denoting ψ as the adaptive average pooling, we aim to learn each coefficient r l k in R as follows: where h l denotes the input for the l-th layer (for the first layer is the input image and for later layers are feature maps), θ l k is a multi-layer linear network that maps the pooled inputs to calculate the re-weight coefficient. By doing this, each coefficient in the matrix R is determined by the input and the corresponding learnable network θ l k . As shown in Fig. 2 , we can dynamically fuse the models to fit the given data from the outside client by optimizing the routing matrix via our designed losses. In specific, we design a consistency regularization to capture the input image feature distribution. Given a test image x o i , we generate a random Gaussian-like noise ∼ N (0, ( 1 2 ) 2 ) as a perturbation and add it to this image. Denote the output prediction map of x o i and x o i + as z ∈ R L×H×C and z ∈ R L×H×C respectively. We define the consistency loss L cons as below: Note that this consistency regularization does not need any label information of test data. Furthermore, it is observed that for segmentation tasks, the main source of performance drop at unseen clients is the incomplete shape. Concerning this challenge, we further propose a shape-relevant constraint to preserve the regular shape of segmentation masks, by making the model predictions within certain limits to be more consistent. Denote V c (i,j) as the probability vector of the segmentation prediction of class c at the position (i, j) and Q (i,j) as the set of prediction probabilities centered at (i, j). We define the distance d (i,j) in the following: where Q (i,j) = {(i ± d, j ± l)|d, l ∈ [k]} and k is a neighbor range radius. We then design an unsupervised shape constraint by minimizing the uncertainty of predictions and the distance over all prediction probabilities as below: This shape loss improves the segmentation map with a smoother boundary as well as lower uncertainty. Finally, we have the overall test-time personalization loss L t as follows: where β is a balancing hyper-parameter, R is the routing matrix consisting of learnable weights to be optimized and w o is the personalized weights for the outside client. With the complete test-time personalization loss, the routing matrix will be optimized as arg min R L t . The entire test-time personalization procedure is performed in an unsupervised way and does not damage original training knowledge. In our proposed IOP-FL framework, we use the U-Net [41] and the Adam optimizer, the model is trained with a learning rate of 1e −3 and the batch size is 16. For inside clients training, we empirically set the accumulation rate τ as 0.9 to calculate the local adapted model and further analyze this parameter in the ablation study. We perform local training for 1 epoch in each communication round and fully train the model for 100 rounds. For outside personalization, we update the routing matrix R for 10 epochs using all test samples. The loss weight β is empirically set to 0.01 to balance the values of different loss terms. We use the predictions of the outside personalized model with the lowest test-time unsupervised loss. In this section, we extensively evaluate the effectiveness of our framework IOP-FL for both inside and outside clients on two typical medical image segmentation datasets, including the quantitative performance comparison, qualitative segmentation visualization, and comprehensive ablation studies. We evaluate our approach on the prostate segmentation task with T2-weighted MRI images from 6 different sources [42] - [45] and the optic disc and cup segmentation task with retinal fundus images from 4 different institutions [46] - [48] . We take data from each different source as a single client, resulting in clients A to F for prostate images and clients A to D for retinal fundus images. For each client, we first split 20% data as a test set, then the remaining data are further split into 80% for training and 20% for validation. For data pre-processing, we first resized images to 384 × 384 for the axial plane of prostate MRI images and retinal fundus images, then we normalized each data individually to zero mean and unit variance. During training, we adopted data augmentation of random rotation and flipping for all images. The common metric Dice score is employed to evaluate model performance on both tasks. Experimental setting. For inside clients comparison, we involve all clients for training and report the performance on each client's test set using the personalized model with the best validation performance. The performance of our local adapted model is compared with single-site training (i.e., each client trains the model individually), the baseline setting FedAvg [40] (i.e., learn a common global model), and recent SOTA personalized FL methods. FedRep [25] (ICML 2021) learns a common representation and a personalized head for each client. FedBN [19] (ICLR 2021) keeps client-specific batch normalization layers locally. pFedMe [22] (NeruIPS 2020) uses the Moreau envelopes as a regularized loss function to pursue clients' own models with different directions. Comparison results. Table I reports the results on every single client and the average performance for both datasets. All results are in form of average and standard deviation over three independent runs. It can be observed that the FedAvg [40] outperforms single-site training on average, showing the importance for utilizing the distributed data. On top of this, all personalized FL methods can make further improvement over FedAvg, demonstrating the benefits of model personalization. Compared with these methods, our framework IOP-FL achieves a higher overall performance as well as higher single client performance on 9 out of 10 clients. The improvements attribute to our gradient-based design. Specifically, compared with those partially personalized models, our gradient-based personalization helps the whole model parameters benefit from global and local information, providing a more complete personalization. In addition, the accumulation keeps tracking gradient information, mitigating the instability caused by client heterogeneity during the personalization. As a result, our approach achieves consistent improvements over FedAvg on all inside FL clients, with the 2.63% increase in average Dice for prostate segmentation, as well as 2.79% and 2.09% increase in average Dice for optic disc and cup segmentation respectively. Besides performance improvements, the left-hand side of Fig. 4 shows the qualitative visualization of segmentation results in comparison with the baseline FedAvg and other state-of-the-art personalization methods on two tasks. Each row shows an image from a different client and these cases present the heterogeneity. Compared with the ground truth, our method accurately segments the shape, while other compared methods sometimes fail to obtain the complete structure. Experimental setting. For outside client comparison, we conduct the leave-one-client-out experiment, i.e., each time we leave a client as the outside client and perform the training Comparison results. Table II shows that FedDG [36] and test-time learning approaches TTT [27] and Tent [28] can generally improve the model performance over FedAvg on outside clients, but not as effective as our proposed method. This demonstrates the benefits of our test-time personalization from a diverse parameter space that is represented by learned training models. We observe that Average and Ensemble approaches obtain inferior performance than FedAvg [40] , showing that simply averaging or ensembling the models can hardly achieve good performance on outside clients which have different data distributions than the inside clients. Our proposed test-time routing, instead, can effectively personalize a new model given the distribution information provided by the test data, thus achieving superior performance. In specific, we achieve the highest average results on prostate segmentation by increasing 5.26% in Dice over Ensemble. and 3.47% in Dice over FedAvg. Superior performance is also obtained with our IOP-FL on retinal fundus images, with the best overall Dice for optic disc/cup segmentation. The segmentation results are shown on the right-hand side of Fig. 4 , the compared methods may fail to obtain an accurate boundary due to the different unseen data distributions. But with our test-time personalization, our approach shows more accurate boundaries. We further conduct ablation studies to investigate the key properties of our IOP-FL framework. For inside personalization, we analyze: i) convergence with different local training epochs; and ii) effect of accumulation rate τ . For outside personalization, we analyze: i) scalability regarding participating number of inside clients; ii) how is the performance of our testtime routing regarding time cost; iii) what is the contribution of each test-time loss item, and iv) robustness to model inversion attacks when setting up the outside personalization. 1) Inside Personalization: Convergence with different local epochs: Different aggregating frequencies may affect the learning behavior, here we analyze the training performance curve of our IOP-FL compared with different personalized FL methods. In Fig. 5 , we explore different local training epoch E = 1, 2, 4, 8, 10 with 100 communication rounds. It can be observed that the training curve of our IOP-FL increases faster than others, indicating a faster convergence rate of our method. And at the late stage of training, especially for the small local epoch number, the curve of our method goes smoother, demonstrating the benefits of gradient information accumulation to stabilize the local adapted model training. Effects of the accumulation rate: We further study how the accumulation rate τ affects the performance of our method. The local adapted model focus more on current information with a higher rate and a lower one pays more attention to historical information. We draw the Dice scores variation by Fig. 6 . For inside clients, τ ≥ 0.5 has better results on both datasets with smaller standard deviations. For outside clients, the performance also shows an increasing trend with higher accumulation rate. We further draw the τ = 1.0 with dashed lines, the results indicate that totally discarding all previous history may not affect the performance of inside personalization, but it leads to a large performance drop for outside personalization. Typically, our approach can have a good performance with the τ of 0.9 for both inside and outside clients. 2) Outside Personalization: Outside performance regarding different inside client numbers: We analyze how the performance of our method on the outside client would change when different numbers of inside clients participate in FL. Fig. 7 (a) shows the results on one prostate MRI outside client with the inside client number from 1 to K−1. We see that the performance can be improved with more clients being involved, indicating the necessity of using FL to aggregate more data with different distributions. Furthermore, compared with FedAvg, the improvement of our method has an increasing trend, i.e., a higher performance gain can be obtained with more clients involved. This demonstrates the efficacy of our test-time routing to exploit a diverse and informative routing space represented by a more number of local models. Inference speed-performance relation study: For testtime personalization, it is very important to achieve a quick inference speed as well as maintain a high performance. To this end, we explore the relationship between the test performance on outside clients and the inference time of our method. We use the prostate datasets and plot the overall performance curve by leaving each one of six clients as the outside client, and adding the standard deviation range in Fig. 7 (b) . The x-axis denotes the average time to process a single sample. We can see that with more time consumed, i.e., training more iterations with our test-time routing scheme, the performance quickly increases and then saturates. The upper bound of the shaded areas shows that for some outside clients, our approach can fit their distributions at the very beginning with high and stable performance. As for the lower bound, the test Dice score quickly increases and reaches a point near the highest performance with less than one second on each sample, showing the effectiveness of our approach to adapt to various outside clients with less time consumed. Contribution of each loss item for test-time routing: We validate the effects of two key designs for our test-time routing, i.e., the consistency loss and the unsupervised shape constraint. We report the results by removing either one component, as shown in Fig. 8 . From the results can be seen that removing either loss term will decrease the final performance on both segmentation tasks. This consistent observation shows the contributions of our two key designs. That is, the consistency loss helps the outside personalized model to fit the overall context distributions and the shape constraint loss focuses more on the detailed structural information. Fig. 9 . Visualization of inversion attacks to reconstruct samples. Robustness to model inversion attacks: When performing the outside personalization, the trained models need to be transferred to outside clients, which may raise risks for the model inversion attack. However, with only models available, the model inversion attack from outside clients is difficult to perform if no additional information (e.g., gradient) is given [49] , [50] . Here we investigate two kinds of attacks, the gradient inversion attack [51] which utilizes gradients information of training samples to reconstruct original images, and the latent feature inversion attack [52] which requires the latent feature can be accessed by the unseen client. As shown in Fig. 9 , even given the gradient or latent feature information, the original data can hardly be perfectly reconstructed. With the presence of various distributions of clients in FL, the global model may suffer different degrees of performance degeneration on each local client. This problem becomes even more severe on outside testing clients due to the unseen data distribution shift. To optimize the model prediction accuracy on each individual client, it is important to perform the model personalization, especially for critical medical applications with a low error tolerance [53] . Previous methods either focus on training personalized models for internal clients or adapting/fine-tuning the model for external clients, the personalization for the inside and outside is performed separately. Different from the traditional training paradigm, FL can yield many models from different clients, which provides a new way to construct a unified personalization framework. In this paper, we present a comprehensive strategy to perform the model personalization on both inside and outside clients. Specifically, we propose a lightweight gradient-mixing method to calculate personalized models for inside clients, on top of this, we further present a novel test-time routing scheme to incorporate all existing models to generate the outside personalized model. For inside clients, our gradient-mixing method derives from the similarity constraints between the pure local model and the global model. Some existing works incorporate different regularization terms or feature contrastive learning as the similarity constraints [22] , [38] , [54] . Compared with these methods, our approach shares the same insight, but is novelly implemented in a more lightweight way by mixing local and global gradients to achieve the similarity constraint. Intuitively, the global gradients help learn general patterns and the local gradients capture various local distributions, combining them can help the personalized model fit each individual distribution but not aloof from the common knowledge. For outside clients, our experiments have shown that simply averaging models predictions or using the conventional model ensemble cannot yield satisfactory performance, thus failing to fully exploit the various training-obtained models. Therefore, we propose to combine internal models with the guidance of our novel test-time routing scheme. This scheme aims to calculate the outside personalized model by optimizing the re-weight coefficients to aggregate existing models. The coefficients are updated to make the target combined model fit the test data distribution, while the original training knowledge in all internal models is well preserved during the testtime update. With the correct guidance towards fitting new data distribution, the training efforts can further contribute to external clients testing, enhancing the overall applicability of FL. We have demonstrated the applicability of our method in the comparison with different methods in Table I and Table II . Previous methods tackle the inside or outside personalization separately, i.e., using very different methods, while our framework solve this problem uniformly for both sides. One limitation of our approach is that when performing the outside personalization, all internal models need to be transferred to the outside client to form a diverse and informative routing space. The construction of routing space will face a high communication cost as the number of training clients increases. In future work, a promising extension of our method is exploring metrics to evaluate the representation capability of internal models, such that the test-time personalization can be achieved by only selecting and transferring representative models. This can largely reduce the cost during the model deployment phase for outside clients, especially for the largescale FL system. In general, our inside personalization approach and the idea of test-time routing are agnostic to the model architecture. In this work, we mainly consider the segmentation tasks, while this framework can also be applied to other tasks (e.g., diagnosis or detection) via specific changes on the test-time loss design. The exploration of different tasks can further enhance the applicability of our proposed framework, which will be very promising as our future works. This work presents a new unified framework, called IOP-FL, to achieve the personalization of each individual client for both inside and outside the federation. In our framework, we propose the local adapted model to serve as the inside personalization and then further utilize the inside personalized models as well as the global model to perform a novelly designed test-time routing towards generating a new outside personalized model. The efficacy of our method is well demonstrated on two medical image segmentation tasks and extensive experimental analysis. Our framework has great potential to be widely used in different hospitals to improve the performance by either inside or outside personalization. The future of digital health with federated learning Federated learning in distributed medical databases: Metaanalysis of large-scale subcortical brain data Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data Federated learning for breast density classification: A real-world implementation Federated simulation for medical imaging Multi-site fmri analysis using privacy-preserving federated learning and domain adaptation: Abide results Inverse distance aggregation for federated learning with non-iid data Federated learning for predicting clinical outcomes in patients with covid-19 Federated transfer learning for eeg signal classification Federated deep learning for detecting covid-19 lung abnormalities in ct: a privacy-preserving multinational validation study Harmofl: Harmonizing local and global drifts in federated learning on heterogeneous medical images Federated semi-supervised medical image classification via inter-client relation matching Toward personalized federated learning Personalized federated learning for intelligent iot applications: A cloud-edge based framework Survey of personalization techniques for federated learning Advances and open problems in federated learning The non-iid data quagmire of decentralized machine learning Federated learning for healthcare informatics FedBN: Federated learning on non-IID features via local batch normalization Fedu: A unified framework for federated multi-task learning with laplacian regularization Federated multitask learning Personalized federated learning with moreau envelopes Personalized federated learning: A meta-learning approach Fedbabu: Towards enhanced representation for federated image classification Exploiting shared representations for personalized federated learning Federated learning with personalization layers Testtime training with self-supervision for generalization under distribution shifts Tent: Fully test-time adaptation by entropy minimization Federated evaluation of on-device personalization Salvaging federated learning by local adaptation An efficient framework for clustered federated learning Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints Fedmd: Heterogenous federated learning via model distillation Federated whole prostate segmentation in mri with personalized neural architectures Personalized retrogressresilient framework for real-world medical federated learning Feddg: Federated domain generalization on medical image segmentation via episodic learning in continuous frequency space Federated learning with domain generalization Federated optimization in heterogeneous networks Federated learning of a mixture of global and local models Communication-efficient learning of deep networks from decentralized data U-net: Convolutional networks for biomedical image segmentation Ms-net: Multi-site network for improving prostate segmentation with heterogeneous mri data Evaluation of prostate segmentation algorithms for mri: the promise12 challenge Computer-aided detection and diagnosis for prostate cancer based on mono and multi-parametric mri: a review Nciproc. ieee-isbi conf. 2013 challenge: Automated segmentation of prostate structures Rim-one: An open retinal image database for optic nerve evaluation A comprehensive retinal image dataset for the assessment of glaucoma from the optic nerve head analysis Refuge challenge: A unified framework for evaluating automated methods for glaucoma assessment from fundus photographs Model inversion attacks that exploit confidence information and basic countermeasures The secret revealer: Generative model-inversion attacks against deep neural networks Inverting gradients-how easy is it to break privacy in federated learning An analysis of the vulnerability of two common deep learning-based medical image segmentation techniques to model inversion attacks The medical algorithmic audit Model-contrastive federated learning